Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

If you haven’t got a question then Machine Learning is not the answer

I am in the middle of preparing for some more events the next big one being a day of machine learning (ML) at SQL Saturday Exeter.  The organisers asked me to make a short video to explain what I’ll be covering and that got me to thinking about what my advice would be having worked with the tool for a good year.

What’s the question?  Unlike many aspects of BI and data mining we are not on a fishing expedition, and although ML can just be use to explore data with its built in tools or via your own Python / R scripts that is not what is was designed for.  The typical use cases for ML – predictive maintenance, customer churn, sentiment analysis are all about answering very specific questions when will a part fail, which customers am I about to loose, was that tweet being nice about me or not?

Does the answer lead to action? At recent briefing on ethics in medical data one of the speaker remarked that a significant factor in patients not completing their treatment was their credit card rating , the lower it was the more likely they would drop out.  The problem is that it’s not easy to see what to do about that (no suggestions please), whereas in more typical customer churn scenarios, we might setup some sort of offer to retain those at risk. 

For me this is why ML has not quite taken in many organisation as expected based on the massive interest in it – It wows everyone but then it becomes of a question of applying it in organisations. Having said that I think a massive area is process automation and checking e.g. in contracts and financial approvals to detect fraud and spot mistakes.

ML is not perfect.  We are very used to humans making mistakes indeed I would argue unless you made a mistake today you haven’t learnt anything. Learning machines also make mistakes – they can’t perfectly recognise speech or images for example. However they are starting to equal or out perform us in some of these areas.  My point here is to have a process in place to handle this. Say we are in the business of approving  loan applications –  If ML is not confident of the outcome it will have a mid-range confidence value (say 0.33 to 0.66), where a lower score will automatically be rejected and a higher one accepted.  For the mid range values we might then refer it to a human to decide.  Similarly if ML can’t decode the postcode on a letter then maybe a human can- but then again in image recognition ML has ben proved to be better at humans when it comes to recognising objects and faces.

ML needs to be retrained.  An example is the way to explain this – an algorithm for basket recommendation maybe superb but when new products appear there are no patterns to follow and even if they are these may not be typical behaviour.  So for production you’ll need a process to do this retraining and ideally this should be programmatic rather than interactive.  This is not any harder than consuming an API as you would to make a prediction. However in this case it’s a batch call with your training set as the argument.

Algorithms are about numbers not industries.  The patient not following though on treatment, the student dropping out of a course are just like the customer churn long used in retail and the same algorithms will work equally well in these other areas as all ML is doing is comparing correlations of factors against what we are trying to predict.  Similarly anomaly detection works for bank fraud and for predictive maintenance

Not your normal neural network.  Buried in the neural network module in ML is ability to do deep learning using a special scripting language net# (nothing to do with .Net really). It is more for the hardened data scientist and if you want to know more check this session at last year’s Cortana Analytics Workshop.

ChChChanges.  ML is continually evolving and this is important for two reasons, one check back regularly to see what has been added, and two if you have feedback then share it to drive features you feel you need.  A good example of this is the much deeper support for Python in ML.  However I am more interested in seeing more algorithms e.g. time series analysis and more OpenCV modules to handle images, both of which will mean we should be writing less of our own code anyway.

Do Something.  The only way to begin to see the potential of ML in your organisation is to try it. Yes you need a question, yes you need data, and yes you do need to be scientific in your approach to the problem.  However you don’t need to be a full on data scientist, you just need their curiosity, added to you data wrangling skills and business knowledge.  The Cortana Analytics Gallery is awash with interesting business based scenarios and tutorials to help you, so don’t just sit there…