Since I wrote this series we have taken on two computer science graduates in the evangelism team in the UK, Amy Nicholson and Bianca Furtuna. Both studied machine learning as part of their courses so I introduced them to this series on Azure ML to get them up to speed.  They have taken this series apart and put it back together again as a full day training course which we have been testing at several events recently.  They also spotted some improvements and so I thought I would share these.

Amount of Training Data:

In the part of the experiment when I split the data into training and test data, I split out 20% of the data for training.  However what I should have done is used  80% of the data training instead as usually for continuous variable type machine leaning scenarios the majority of data is given to train the model and then 20% is used to validate the model, this allows the model to really refine the patterns it understands in the data and predict with more accuracy.  It’s also important to have a good balance of data as well and this case we need a mix of flights that are late and on time.

Excluding Month and DayofMonth:

I removed the time related columns used for the joining the Flight and Weather Data,  Amy wondered if the time of year was significant and ran some tests with the Boosted Decision Tree algorithm:

 Keep Month, DayofMonth Remove Month, DayofMonth Accuracy 0.935 0.933

However to be fair to me there isn’t enough data in the sample to really test this as there’s less than a years worth of data.  However Tine of day might be important as well.  The point is here is to experiment and test to iterate to a good result, and knowledge of the business data and maths combined with some scientific rigour is  needed for that.

Quantize Module Use:

I explained and used quantization in my experiment the process of putting a continuous variable like temperature into several buckets of ranges as this is in the Azure ML sample experiment on Flight Delay.  Amy decided to test whether this was improving the accuracy of the model  by removing it..

 Keep Month, DayofMonth + No Quantize Module Remove Month, DayofMonth + Quantize Module Accuracy 0.935 0.930

so it turns out it’s better not to do that for this scenario.  Amy then wondered whether other two class algorithms in ML worked better with quantised data..

 Without Quantize Module With Quantize Module Boosted Decision Tree 0.935 0.933 Logistic Regression 0.927 0.916 Support Vector Machine 0.924 0.916

So it’s not helping here at all.  Amy did see a slight reduction in processing time using quantisation and you might argue that a small drop in accuracy might be acceptable to really speed up processing but the improvement

Conclusion

I learnt three things from this feedback as well as the specific errors above:

1.  Expertise. It is worth talking to and learning from a real data scientist and sharing your work with them. Azure ML makes this easy by allowing me to invite others into my workspace.

2.  Do good science.  Don’t slavishly follow other peoples work, be curious but sceptical by testing your experiments rigorously

3. Document your work.  If you want to share your work you'll need to explain your hypothesis and working like any good scientist. While Azure ML does allow comments but doesn’t have a way of storing other metadata about an experiment the way a solution in Visual Studio does.