Announcing Data Science Utilities Version 0.11, for the Team Data Science Process

This post is authored by Hang Zhang, Senior Data Scientist Manager, Gopi Kumar, Principal Program Manager, and Xibin Gao, Data Scientist, at Microsoft.

Back in September 2016, we released an early public preview of Team Data Science Process (TDSP), with the goal of supporting secure collaboration within enterprise data science organizations, with capabilities such as versioning, knowledge management and more. TDSP helps you structure your data science projects by providing a standardized set of Git repositories, document templates and utilities that are relevant at different stages of your development lifecycle. One of the key elements of TDSP is a repository of utilities that we’ve specifically created to boost data science productivity. Along with our September 2016 release (version 0.1), we published two R-based data science utilities to a public GitHub repository, namely:

  • Interactive Data Exploration, Analysis and Reporting (IDEAR), which helps data scientists to explore and visualize a data set interactively, and
  • Automated Modeling and Reporting (AMAR), which facilitates baseline model training, model sweeping, and parameter sweeping.

Motivated by your feedback and suggestions, and based on our understanding of the needs of the data science community, we are now pleased to announce TDSP Data Science Utilities version 0.11. This new version, which officially released during the recent holiday season (on December 22nd, 2016), includes several new features and enhancements, which we describe in this blog post.

New Features

IDEAR Now Runs in Python

According to the KDnuggets 2016 poll, R and Python are the top two languages for analytics and data science. To better support the data science community, we are now releasing IDEAR in Jupyter Notebooks (Python 2.7). Data scientists who prefer Python can now explore and visualize data using similar functionality as what IDEAR had earlier provided in R. Users can upload the IDEAR Jupyter Notebook to a Jupyter Notebook server, configure the working directory in the Jupyter Notebook, and start investigating data sets. More detailed instructions can be found in GitHub repository. The interactivity is enabled using the ipywidgets library in Python.

IDEAR in R Now Extracts Date Time Components Automatically from Datetime Fields

Datetime is a common data type encountered in business applications such as customer churn, fraud detection and demand forecasting. Data scientists usually write code to extract datetime components such as year, month, weekday, week of year or hour, and can then use them as extra variables for further analysis and modeling. This new feature of IDEAR in R extracts these datetime components automatically and adds them directly to the original dataset, with column names ending with _autogen_year, _autogen_month, etc. IDEAR in R works with this enhanced dataset, allowing data scientists to visualize and obtain insights on how these date and time components impact the target variable. Users only need to specify which columns are datetime columns and provide their format in the YAML configuration file. To try this feature, use the UCI_Bike_Rental data by passing para-bike-rental-hour.yaml file to IDEAR. This data has a datetime column dteday in the format “YYYY-mm-dd“, which has been specified as DateTimeColumns in the YAML file.

IDEAR in R Now Runs in Visual Studio with R Tools for Visual Studio (RTVS)

With our September 2016 release, IDEAR in R
needed to be run on RStudio. For data scientists who prefer Visual Studio with RTVS as their data science IDE, we now have the option to run IDEAR in R in Visual Studio. You can do so simply by changing the option “Shiny pages browser” to External in Visual Studio. More detailed instructions can be found here: Instructions for using IDEAR in R.

Enhanced Features

We are also pleased to offer the following enhancements to our earlier features:

  • Slices in pie charts of individual categorical variable visualizations are sorted by the frequencies of the categorical variable levels. Sorting slices in this manner makes pie charts more readable, especially when individual variables have a large number of levels (e.g. week number, day of month, etc.). The bar chart of the categorical variable still takes the ordinal order of the levels, to provide a view complementary to pie charts sorted by frequencies.
  • Enhanced readiness to run IDEAR in both R and Python on the Azure Data Science Virtual Machine (DSVM). DSVM, by default, carries Jupyter Notebook server with Anaconda Python and Microsoft R Open. The most recent release of DSVM carries all the libraries necessary to run IDEAR in both R and Python. After you clone the Data Science Utilities repository to your DSVM, you can use IDEAR in R on Visual Studio with RTVS or use IDEAR in Jupyter Notebooks (Python 2.7) after simply launching your Jupyter Notebook server. For instructions on launching Jupyter Notebook server on the Azure DSVM, see this article: Ten Things You Can Do On the Data Science Virtual Machine.
  • IDEAR in R now has a more consistent coding style, following the guidance provided in Hands-On Data Science Sharing R Code — With Style. This makes source code more readable.

Next Steps

Go ahead and try these new utilities by cloning the GitHub repository. Two sample datasets are included as well, so you can use these to try out the new utilities or try them on your own datasets.

We hope you get a chance to use these tools and the Team Data Science Process in your next data science project. Send us your feedback as always – you can either use the comments feature below or go to the issues tab of our GitHub repository above, or tweet to @zenlytix. We’re always looking for ways to improve our tools and make them even more useful across an even broader range of analytics scenarios.

Hang, Gopi & Xibin