Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

Azure Data Services for (near) Real Time Scenarios

Light takes about 8 and half minutes to get from the surface sun to the earth and it can actually take a photon 20,000 years to get form the core of the sun to the surface.  On earth that means that if the sun stops shining we wont know about it until eight and a half minutes after and begs the question what is real time as in real time analytics?  So IO tend to stick to the term near real time to be precise because if we are to do any kind of analysis we’ll need a small amount of time for that no matter what technology we use. 

In my diagram for this scenario I have identified two candidate sources for this sort of data – IoT and Feeds.

image

IoT, the internet of things is the crazy work of connecting devices and humans to the net and gathering telemetry from them, so things like wearables, weather stations, or even wildlife.  These things are typically low powered devices often prototyped using maker technology like Raspberry Pi, Aduino and many others, these then send feeds to a service but the question is how to ingest that data and what to do with it? There could be hundreds to millions of these devices and some of them can generate a lot of data as they send detailed state readings possibly at several times a second and it’s scale we need to factor in when deciding what to use.

In the diagram above I have shown two Azure services in the orchestration box in pink, Service Bus and Event Hubs they are complimentary and different.  For smaller scale scenarios Azure Service Bus can be used by itself and it was original designed to allow asynchronous communications between service tiers e.g. An Azure Web role talking to a worker role (web application and logic application as they are now called). 

When more scale is required one or more event hubs can be used behind the service broker to handles millions of events a second. so where before a service bus has  queues and topics it can now have event hubs behind it as well. However where multiple queues and topics compete to process the next message, event hubs are partitioned with each hub having it’s own sequence of events. The event hubs are partitioned by a partition key that can be defined for example if a sensor send in temperature and pressure reading you might have an event hub for each portioned by a reading type key.  The event hub pricing gives you an idea of how powerful they are with each hub handling up to 1,000/events a second with a data ingress of 1Mb/sec.

So event hubs can bring and cache events at great scale but what to do with them ? we could write them as is to storage but typically we’ll want to aggregate and process the data in some way first and this is where Stream Analytics comes into play.  This is another Azure Service that uses a SQL like language to analyse data over time and send the result of to another process or write it to one of the many storage formats on Azure. 

Comparing this to the other orchestration service in Azure, Data Factory can be likened to how we might analyse a group of cars by make model and colour.  If we used a tool like Data Factory or any other ETL (Extract Transform and Load) solution we would go to a car park where the cars were and count them every day possibly noting which ones had left and arrived since the last count. If we used Stream Analytics we would stand on a bridge over a motorway and count the number of cars that went under the bridge in a given time. 

The other things to know about stream analytics is what it can consume and what it can output. It can consume data from Event Hubs and Azure Blob Storage and it can also output to both of these and SQOL Azure databases.  This means we can recurse and reuse data as it arrives to do ever more sophisticated analysis.

I have also included HDInsight in the diagram as we can configure Apache Storm on top of HDInsight be selecting this as an option when we create an HDInsight cluster..

An example of the quick-create cluster form in the portal

This allows open source orientated organisations to use Storm as a service and thus use the tools and techniques they are familiar with to do near time analysis at scale as well.

In both cases the (near) real time dashboard in Power BI allow these feeds to be visualised and in this video you’ll see some of that as used by Microsoft’s Digital Crimes Unit as they continually monitor threats and work with law enforcement agents to mitigate them.

To conclude while there is a lot of interest in IoT what really matters is the data coming off of those devices and how we can use it for good, for example by feeding more people through analysis of the environment, making the internet safer, and to improve chronic health problems with wearables.