Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

How to train your MAML

In this fourth post in my Azure Machine Learning series we are actually going to do the training itself.  I have tidied up my experiment from last time to get rid of the modules to export data to SQL Azure as that has now served it’s purpose ..

image

Before we get stuck into training there’s still a bit more tidying up to do:

  • We can get rid of the year month and day columns as they were only needed to join our datasets and don’t have any relevance in predicting flight delay (actually they might do but I am keeping things simple here).
  • We can put the weather reading data into discrete ranges rather than leaving them as continuous variables.  For example temperature could be aggregated into groups say less than –10,  –10 to 0, 0-10 , 10 , 20-40 and  40 plus.  This is called quantization in the same way that the  quantum in quantum physics recognises that particles like electrons have discrete amounts of energy.  There’s a special quantization module to help us do this in ML studio.

So lets do those tow steps in ML studio. First add in another Project Columns module under the join to remove the time based columns..

image

Now add in the a Quantize module underneath that.  Quantization can work in several ways depending on the binning and quantile normalization options we select in this module.  We can design our own bins into which each value falls by setting the bin edges (like the temperature groups I just mentioned).  If we decide to let the Quantize model set the bin edges automatically then we can choose the algorithm it will use,  the quantile normalization method, and we can set how many bins we want.  For our initial training we’ll select 10 bins for all of the numerical data we are going to use to base our predictions on and we’ll overwrite the columns with the new values..

 image

If we now Visualize the left hand output of the quantize function we can see that there are now 10 unique values for each of our numerical columns..

image

Now we can  actually get on with the training itself, although in the real world we may have to revisit some of the earlier decisions we made to improve the output.  The process of training is much like learning simple arithmetic we are introduced to a new symbol say + and given example of how it works.  To confirm we’ve got it we use it ourselves and compare our results against some more examples.  In machine learning we do the initial training by carving out a sample from our data set.  In MAML we can use the split module to do this which we used this in my last post to just to get us the data from one airport we are now going to use the split function to get a random sample. 

To do create our training set we drag a Split module to the bottom of the design surface and connect its input to the left hand output of the Quantization module and set its properties as follows to give a 20% random sample we can use for training..

image

The question now is which algorithm to use to make our flight delay prediction? This is a huge topic in its own right and also requires prior knowledge of statistics to make sense of what some of these do.  Also you can elect to ignore all the good work Microsoft have done in providing some really powerful algorithms used in Xbox and Bing and bring your own algorithm written in open source R language. In MAML there are three types of built in algorithms, Classification, Clustering & Regression so what exactly do those terms mean?

  • Classification is where we want to determine which group something belongs to – in our case we want to determine whether a flight is in the delayed group r the on time group.
  • Clustering is similar to that but this time we don’t know what the groups are.  For example we might take a collection of purchases from a shopping web site and MAML would work out what things were bought with what other things to provide recommendations when a customer places something in their basket. 
  • Regression is the process of fitting points to curve or line (hence linear regression) so here we are looking at predicting a continuous variable say house price based on attribute of a property such as age area, nearest station distance etc.    

So if we expand out the  Machine Learning | Evaluate | Classification object on the left of ML studio we can see some curiously named modules like Multiclass Decision Jungle  which we can use..

Module

Description

Multiclass Decision Forest

Create a multiclass classification model using a decision forest

Multiclass Decision Jungle

Create a multiclass classification model using a decision jungle

Multiclass Logistic Regression

Create a multiclass logistic regression classification model

Multiclass Neural Network

Create a multiclass classifier using a neural network algorithm

One-vs-All Multiclass

Create an one-vs-all classification model

Two-Class Averaged Perceptron

Creates an Averaged Perceptron binary classification model

Two-Class Bayes Point Machine

Create a Bayes Point Machine binary classification model

Two-Class Boosted Decision Tree

Create a binary classifier using a boosted decision tree algorithm

Two-Class Decision Forest

Create a two-class classification model using a decision forest

Two-Class Decision Jungle

Create a two-class classification model using a decision jungle

Two-Class Logistic Regression

Create a two-class logistic regression model

Two-Class Neural Network

Create a binary classifier using a neural network algorithm

Two-Class Support Vector Machine

Creates a Support Vector Machine binary classification model

At the time of writing the help on MSDN on what these do is pretty opaque to mere mortals and to add to the fun many of these have parameters that can be tweaked as well.  At this point it’s worth recapping what data we have and what we are trying to do – In our case there is probably some sort of relationship between some of our columns for example  visibility humidity wind speed and temperature and our two outcomes are that the flight is delayed or it isn’t.  So let’s jump in as we can’t really break anything and take Microsoft’s advice that Two-Class Boosted Decision Tree is really effective in predicting one of two out comes especially when the data is sorted of related.

The other thing we can do while training our model is to work out which of columns are actually informing the decision making process.  Eliminating non significant columns both improves computation time (which we are paying for!) and also improves accuracy by eliminating spurious noise.  In our experiment it might be that the  the non-weather related attributes a factor such as  the carrier (airline), and the originating and departing airports.  Rather than guess or randomly eliminating columns with yet another Project columns module we can get MAML to do the work for us by using the Sweep Parameters module. Let’s see this in action –drag the Sweep parameters module onto the design surface. Note it has three inputs which we connect as follows:

  • the algorithm we want to use which is a Two-Class Boosted Decision Tree module
  • the trained data set this is the 20% split from the split module – so the left hand output.
  • the data to be used to evaluate the sweep is working which is simply the rest of the data form the split module.

So how does the evaluation but work – simply by selecting the column to be used which has the answer we are looking for , whether or not  a flight is delayed which is the ArrDel15 column.  We can simply leave the rest of the sweep parameters alone for now and rerun the model.

image

Each of the outputs from the sweep look very different from what we have in our data set, and although accuracy (at an average of 0.91) is in one of the columns it’s difficult for a data science newbie like me to work out how good the predictions are is. Fortunately we can get a better insight into what’s going on by using the Score and Evaluate modules.  To do this we connect the Score module the left output of the sweep parameters module, underneath this we can then drag on the Evaluate module and connect the output of the score module to the left hand input and rerun the experiment( neither of these have any settings )

image

If we now look visualize the output of the Evaluate module we get this graph..

image

What is this telling us? The key numbers are the false positive figures just under the graph where we can see that there were 302,343 correctly predicted on late flights, but 98,221 were identified as being late but were on time and that 1,466,0871 flights were correctly predicted as being on time but 61,483 were predicted to be on time but were late.  The ROC graph is also an indication of accuracy the greater the area between the blue line and a straight line is an indication of accuracy as well so the closer the blue line is to un upside down L shape the better it is.

Could we improve on this score and how might we do that?  We could tweak the parameters for the Two-Class Boosted Decision Tree and what parameters we sweep for or we could see if there is a better algorithm.  For this I would use the technique I use to evaluate a good wine which is to try blind taste two wines and select my favourite hang on to that and then  compare that with the next wine and continue until I have my favourite of a given group.  In MAML we can do this by adding another branch tot eh second input of the second input evaluate module and compare how they do.

By way of an example we can drag on the Two-Class Logistic Regression module,  copying and pasting the Score Model and Sweep Parameters modules  and connecting up these as shown below..

image

and if we visualize the evaluate module we now get two ROC curves and by highlighting them we can see the detailed results  where the red curve is the right hand input so our Two-Class Logistic Regression module

image

We can see that the Two-Class Boosted Decision Tree module is slightly outperforming Two-Class Logistic Regression and we should stick with that or compare it to something else.

So not only is MAML really easy to setup but provides sophisticated tools to help you to evolve to the right algorithm to meet your business need and we still haven’t had to enter any code or get too deep into maths & stats.  What now – well we could output the data to something like Power BI to show the results but we might want also want to use the experiment we have made to predict flights on a departure by departure basis in some sort of web site and that’s what we’ll look at in my next post in this series.