This blog post is authored by Brandon Rohrer, Senior Data Scientist at Microsoft.
Data has dollar value. Gasoline has horsepower. Extracting both takes some care.
We start with data, the gasoline of data science. Although we hear the phrase “big data” a lot, quantity isn’t everything. If it were, we would just fill up with barrels of crude oil. Just like the fuel you put in your car, data needs to be clean, concentrated, and of the right type.
The bottom line: Without good data, nothing produces any value at all.
The Algorithm: Internal Combustion
Algorithms are the internal combustion engine of the data science world. They are the black boxes that take data in and magically convert it to insights and predictions. And, like internal combustion, most of us have a vague notion of how they work, but it takes an expert to really wrap their head around their finer points of operation.
Algorithms are rules for learning. When some data is run through those rules, it makes a rough summary of the data called a model. The art of building models from algorithms and data is an important part of data science. There are lots of pitfalls and tricks to avoid them. For instance, there are tricks for building a model that is specific, but not too specific—models are trained on one set of data, but their accuracy is measured on an entirely separate second set. There are lots of machine learning algorithms in a data scientist’s toolbox. Choosing which algorithm to use is a deep discussion of its own. There are also tricks for cleaning up noisy data, filling in missing values, and squeezing the data for every last bit of information it has to offer. Azure ML Studio is powerful tool for learning and using these tricks.
A good model is a beautiful thing. (Just ask any ML specialist.) But it can’t really do anything on its own. It’s like an internal combustion engine without fuel injection, cooling, electric, or lubrication. To run new data through the model, some poor soul has to manually feed it files and store the results. And that doesn’t even touch on the issues of getting the data beforehand and processing it after.
The bottom line: Models alone don’t generate much business value.
The API: An Engine
To make a model into a contributing part of your operation, it needs to be operationalized. This is a big word for an even wordier idea: taking the model and packaging it so that it can be easily used from another program. This packaged model is called an Application Programming Interface, or API for short. An API is an internal combustion engine complete with pumps, hoses, wires, and brackets all around.
It may seem like a small step to take one set of computer instructions, the model, and wrap it inside another, the API. But for the intern whose job it was to spoon-feed data to the model, it makes a world of difference. Having an API means that the intern just has to write (yet another) program that can spin it up each time new data arrives. It also means that new data is processed very quickly (low latency) and that without a human in the way to slow things down, your business can handle much more data (high throughput). And the intern can move on trickier problems.
Creating APIs from models is conceptually simple, but difficult to get right. You have to set up servers which come with a long list of worries—security, uptime, accessibility, robustness, data format consistency, error recovery, backups—which all incur risk and the cost mitigating that risk. There are two ways to make an end run around that complexity. One is to use Azure ML. In that case, converting a model to a well-engineered API happens with a single mouse click. I’m not exaggerating. One mouse click. Your second option is to use APIs based on models someone else has written. There is a library of these in the Cortana Analytics Gallery covering several tough problems, including text and image processing, the engine behind the viral how-old.net.
The bottom line: APIs generate more value than models alone.
The End-to-End Solution: An Automobile
A fully operational engine is an impressive engineering feat, but it still can’t drive you to the beach. Without wheels, brakes, a transmission and an MP3 player it remains a curiosity. Likewise, when your goal is to make data-driven business decisions, an API won’t help you by itself. There is a long list of other jobs that have to be taken care of before your process can make it out of the driveway.
Luckily there is a one-stop shop where you can get all of this. The Cortana Analytics Suite is a collection of products that help take care of these jobs. They’re all in cloud, which just means they live and run in a secure building full of computers. Here are some of the most important jobs you’ll need taken care of and the members of the Cortana Analytics family that do them.
Gathering and reporting results: Azure Data Factory
Plotting results: Power BI
The bottom line: End-to-end solutions generate more value than APIs. In some cases a great deal more.
There is some assembly required to get your entire data science machine working together. ML models are just one part of the system. The Cortana Analytics Suite provides the other pieces you need. When it’s all up and running, an end-to-end solution will transform your raw data into the information you need and deliver it where you need it.
Follow me on Twitter or ping me on LinkedIn.