End-to-End Data Science Walkthrough with Spark 2.0 on Azure HDInsight Hadoop Clusters

This post is authored by Debraj GuhaThakurta, Senior Data Scientist, and Brad Severtson, Senior Content Developer, at Microsoft.

The data scientists among you would have seen how Spark 2.0, which released in July 2016, offered several enhancements over Spark 1.6. These enhancements included:

  • Easier ANSI SQL and more streamlined APIs.
  • Improvements in the speeds of data processing and summarization.
  • Enhancements to Spark’s ML library, including DataFrame -based primary ML APIs.
  • Easier Rformula-based specification of training formulae, as well as improvements in pipelining and cross-validation features.
  • ML pipeline persistence.
  • A new structured streaming API.

Microsoft Azure released Spark 2.0 on HDInsight (Linux) as a service in September 2016. To help users get a jumpstart with using Spark 2.0 on HDInsight for data science and machine learning, we are providing end-to-end data science walkthroughs using Spark 2.0 on HDInsight.

This is an update to the Spark 1.6 -based walkthrough that we published in June 2016, as a part of the Team Data Science Process
documentation. That release contained a comprehensive walkthrough using pySpark and MLlib to demonstrate how to conduct end-to-end data science on Azure HDInsight Spark 1.6 clusters. Using detailed examples and pySpark code, made publicly available from a GitHub repository, we demonstrated how to:

  • Easily provision a managed Azure HDInsight Spark cluster, use Azure storage blobs for data import and export, and use Jupyter notebook server on the cluster for development.
  • Upload sample Jupyter notebooks from a public GitHub repository to your Spark cluster and run them in Spark to perform typical data science tasks needed in an end-to-end scenario.
  • Pre-process, transform, and train ML models using cross-validation and hyper-parameter sweeping.
  • Consume the trained models in production using Spark.

So, what’s changed in the Spark 2.0 walkthroughs, i.e. when compared with previously published Spark 1.6 versions? Here are the highlights of what’s new and remains the same: 

Things that remain the same in the Spark 2.0 version: 

Things that are new in the Spark 2.0 version:

  • We have consolidated multiple notebooks showing basic and advanced features of ML into one streamlined notebook.
  • We have included a new and well-known dataset, namely the 2011 & 2012 airline on-time departure dataset, to demonstrate the creation and evaluation of binary classification models.

We believe the Spark 2.0 update to the data science walkthrough will be super useful for data-scientists and ML practitioners developing and deploying end-to-end data science processes on Spark 2.0 clusters. Do try it out and let us know what you think.

Debraj @d_guhathakurta
Brad @Brad2435150