Adventure Works!

My title for this post is a pun on the Adventure Works databases, and samples that have been in SQL Server since I can remember.  There were also some data mining examples ( as referenced in  this old post) but this has not really moved on since 2011 when I last wrote about it so you might be forgiven for thinking that data mining is dead as far as Microsoft is concerned.

However since that time two big things have happened;  the hyper-scale of cloud and the rise of social media as a business tool not just as a bit of fun to share strange pictures and meaningless chat.  Coupled together this is big data;   masses of largely useless data, being produced at a rate faster than can be downloaded in a variety of semi and unstructured formats – so Volume , Velocity and Variety.  Hidden in this data are nuggets of gold such as brand sentiment , how users navigate our web sites and what they are looking at, and patterns that we can’t immediately recognise. Up until now processing and analysing big data has really only been possible for large corporates and governments as they have the resources and expertise to do this. However as well as storing big data the cloud can also be used to make this big data analysis available to anyone who has the sense of adventure to give it a try -  all that’s needed is access to the data and an understanding of how to mine the information.  However the understanding bit of the equation is still a problem and this expertise aka data science is the bottleneck and a quick search on your favourite jobs board for jobs in this area will confirm this. 

So what is Microsoft doing about this?

What they have always done – simplify it , commoditise it, and integrate it.  If I go back to SQL Server 2000 we had DTS to load and transform data from any source and analysis services to slice and dice it from Excel and then we got reporting services in 2002 all in one product.  In 2014 we have a complete set of tools to mash, hack and slice data into submission from any source, but these tools are no longer in SQL Server they are in the cloud specifically Azure and in Office 365.   So what are the tools?

  • HDInsight which is Hadoop running as a service in Azure  where you can build a  cluster as large as you need and feed it data with all the industry standard tools you are used to (Mahout and Pig for example).
  • Microsoft Azure Machine Learning (MAML) can take data from anywhere including HDInsight and do the kind of predictive analytics that data mining promised but without the need to be a data scientist yourself.  This is because the MAML studio has a raft of the best algorithms that are publicly available and is also very easy to use from IE or Chrome – actually it reminds a bit of SQL Server Integration Services which is no bad thing.    

image

Once you have trained your experiment (as they called) you can expose this as a web service which can then be consumed on a transaction by transaction basis to score credit, advise on purchase decisions etc. within your own web sites. 

  • Office 365 provides the presentation layer on the tools above with access to HDInsight data and machine learning datasets from the Power BI suite of tools.

In order to play with any of these there’s two other tools you’ll need - An MSDN subscription to get free hours on Azure to try this out and to get a copy of Office 2013 for the Power BI stuff. You’ll also want to watch the Microsoft Virtual Academy for advice and guidance although at the time of writing there aren’t any courses on MAML as it’s so new.

Finally a word of warning before you start on your own adventure  - these tools can all encode a certain amount of business logic and so it’s important to understand the end to end changes you have made in building your models from source to target and to consider where and when to use which tool.  For example Power Pivot can itself do quite a of data analysis but is best used in a big data world as a front end for HDInsight or machine learning experiment. 

I will be going deeper into this in subsequent posts as this stuff is not only jolly interesting it’s also a huge career opportunity for anyone who loves mucking around with data.