Latest Rev of Utilities for Microsoft Team Data Science Process (TDSP) Now Available

This post is authored by Hang Zhang, Senior Data Science Manager, Xibin Gao, Data Scientist, and Wei Guo, Data Scientist, of Microsoft.

We are excited to announce the release of version 0.12 of the Microsoft Team Data Science Process utilities. We had earlier released Team Data Science Process (TDSP) back in September 2016, along with a set of data science utilities (version 0.1), with a view to help boost the productivity of data scientists. In this blog post, we are happy to share our latest feature additions and enhancements.

New Features

IDEAR in Microsoft R Server, for Big Data

Microsoft R Server (MRS) is the enterprise-class analytics platform for R. It supports exploring, visualizing and analyzing big data on a single machine or on Hadoop or Spark clusters. The previously released IDEAR, in open source R, is constrained by memory size as data is loaded into memory before data exploration. We have now released IDEAR in MRS, which allows R users to explore and analyze big data interactively and generate data reports automatically. These feature changes are mostly under the hood and not necessarily visible in the user interface. In other words, IDEAR in MRS brings the same user experience as IDEAR in open source R but with extended capabilities when it comes to the ability to handle big data. Microsoft offers a free Microsoft R Server Developer Edition. If you are using an Azure Data Science Virtual Machine (DSVM), the MRS Developer Edition comes pre-installed and you can start using IDEAR in MRS right off the bat.

IDEAR in Python 3

Since Python 2.7 will not be maintained past 2020, it makes sense to develop IDEAR in Python 3. The newly released IDEAR in Python can run in both Python 3.5 and Python 2.7. Future versions of IDEAR will only be on Python 3.x, with IDEAR in Python 2.7 getting deprecated.

IDEAR in Python 3 on Azure Notebooks Services

We also released an Azure Notebooks service version of IDEAR in Python 3.5, named IDEAR-Python-AzureNotebooks.ipynb. Using the Azure Notebook services can save you the time and trouble of setting up Jupyter Notebook servers and installing the necessary libraries. IDEAR-Python-AzureNotebooks.ipynb reads both data and YAML files from Azure Blob Storage. The interactive data exploration, analysis and visualization capabilities are the same as IDEAR in Jupyter Notebooks (IDEAR.ipynb) – the only difference is that IDEAR-Python-AzureNotebooks.ipynb does not have functions to generate reports automatically.

Feature Enhancements

Checking Missing Values in IDEAR in R

Data scientists pay close attention to missing values as they represent an important data quality consideration, when doing data analysis. We now provide a feature to assess and visualize the severity of missing values in your data. This helps users identify which variables have the highest rates of missing values, and where the missing values happen to be (e.g. which segments of rows).


Principal Component Analysis on Mixed Data Types, in IDEAR for Open Source R

It is almost universally true that both numerical and categorical variables co-exist in data sets. Sometime categorical variables can even dominate a data set. In this release, we used PCAmixdata to handle mixture of categorical and numerical variables. The image below demonstrates a clear clustering pattern, colored by the variable season, when applying IDEAR on the Bike Rental sample data shipped with the utilities, by using the PCAmixdata library.


Numerical Variable Histograms Grouped by Categorical Variable Levels, in IDEAR in MRS

This feature enhancement allows users to easily compare the distribution difference of a numerical variable conditioning on different values of the categorical variable.


Numerical Interactions Grouped by Categorical Variables, in IDEAR in MRS

Interactions between two numerical variables can be influenced by a third categorical variable. You now have the option to view the scatterplot between numerical variables grouped by the categorical variable levels.


Next Steps

You can download and play with these new features in the data science utilities, and send us your feedback or feature requests via the comments feature below, or on the issues tab of our GitHub repository, or via twitter, to @zenlytix. We continue to work on improving this toolset to better serve your data science project needs, so we look forward to hearing from you.

Hang, Xibin & Wei

References