This post is by Hang Zhang and Chirag Dhull of the Microsoft Information Management & Machine Learning team.
Each year the KDD Cup brings the data science community together to compete for a coveted leaderboard spot, awarded at the annual KDD conference. This year's challenge, the KDD Cup 2015, requires participants to predict the likelihood of a student dropping out from a MOOC platform, XuetangX. This is a typical customer churn analysis problem. Gaining a better understanding of customer churn is a top priority for not just MOOC platforms but almost all businesses.
Most of you on this blog are familiar with Azure ML – Microsoft’s cloud based tool that helps developers and data scientists easily build and operationalize their predictive analytics solutions. In this blog post we discuss two sample experiments we have published in Azure ML Gallery that will serve as a great starting point for those of you looking to participate in the latest KDD challenge. Go ahead and use these as a baseline on which you can then start building on using the rich Azure ML studio environment – you can drag and drop an extensive set of battle-tested algorithms or use your own custom R or Python scripts.
The two sample experiments are:
The two experiments – Low and High – differ in complexity, performance, and running speed. The Low version is a subset of the High. Here are the key differences between the two experiments:
The Low experiment is less complex than the High – it creates fewer features, and trains a smaller gradient boosted decision tree model. For instance, the maximum number of leaves per tree = 50 and number of trees = 600 in the Low version, vs. 100 and 1200 respectively in the High version.
The Low experiment has lower performance with AUC = 0.853 on the public leaderboard whereas the High experiment’s AUC = 0.873.
The Low experiment, because it’s simpler, runs faster. It can complete in as few as 18 minutes whereas the High experiment can take 2 hours.
Below are the graphs of these two sample experiments. The highlighted modules are the extra additions in the High version and help improve the accuracy of the model. Since the Low version is a pure subset, in our step-by-step walkthrough tutorial that we published in github, we only describe the details of the High version of the experiment.
After you copy either of these samples to your Azure ML workspace, you can click the “Run” button at the bottom of the page. Once the experiment completes, right click the output port of the “Convert to CSV” module at the bottom right hand corner of the experiment (circled red in the following graphs), select “Download” and the predictions on test data will be downloaded to your local machine. You need to delete the header line of the downloaded CSV file before you submit it to the competition website for evaluation.
High version (AUC=0.873, running time = ~2 hours)
Low-version (AUC=0.853, running time = ~18 minutes)
We hope these experiments kick-start your participation at the KDD Cup and also help you learn more about cloud based machine learning. Incidentally, Microsoft is the platinum sponsor at this year’s KDD conference in Sydney – we hope to see some of you there!
Hang & Chirag
You can email Hang here.