This post is authored by Xibin Gao, Data Scientist, Debraj GuhaThakurta, Senior Data Scientist, Gopi Kumar, Principal Program Manager, and Hang Zhang, Senior Data Science Manager, at Microsoft.
When presented with a new dataset, the first set of questions data scientists need to answer include:
- What does the data look like? What’s the schema?
- What’s the quality of the data? What’s the severity of missing data?
- How are individual variables distributed? Do I need to do variable transformation?
- How relevant is the data is to the machine learning task? How difficult is the machine learning task itself?
- Which variables are most relevant to the machine learning target?
- Is there any specific clustering pattern in the data?
- How will ML models on the data perform? Which variables are significant in the models?
Data scientists typically spend a significant amount of time writing code seeking answers to the above questions. Although datasets differ between projects, much of the code can be generalized into data science utilities that can be reused across projects, thus helping with productivity. Additionally, such utilities can help data scientists work on specific tasks in a project in a guided mode, ensuring consistency and completeness of the underlying tasks.
We are therefore excited to announce the public availability of two data science utilities which we believe will help boost your productivity:
- Interactive Data Exploration, Analysis and Reporting (IDEAR), and
- Automated Modeling and Reporting (AMAR).
These two utilities, which run in CRAN-R, can be accessed from this GitHub site. They were published as part of the Team Data Science Process (TDSP), which we launched at the Microsoft Machine Learning & Data Science Summit in Atlanta last month and discussed in our blog post last week.
The Interactive Data Exploration, Analysis and Reporting tool, or IDEAR, helps data scientists explore, visualize and analyze data, and helps provide insights into the data in an interactive manner. The interactivity is achieved by the Shiny library from R Studio. When you see visualizations or analysis results that could helpful you in data discussion with others, you can click a button to export the associated R scripts generating the visualization/results to a R log file. By clicking a “Generate Report” button in IDEAR, the R log file will be run to generate the data report automatically. You can directly use this report to have in-depth data discussions with teammates or with a data provider or your client, for instance.
Features of IDEAR we’d like to highlight include:
Automatic Variable Type Detection
This feature is helpful when a data scientist is handed a wide dataset without any confirmation of the variable types. IDEAR automatically detects variable types based on the number of unique values and the average number of observations for each unique value. Detection results are output to a YAML file to help you gradually reach the truth with respect to data types.
Variable Ranking and Target Leaker Identification
IDEAR ranks numerical and categorical independent variables based on the strength of their association with target variables. If some variables have significantly stronger associations with the target variable, you need to be alerted that they have the risk of actually being target leakers. This feature can also serve the purpose of evaluating the relevance of the data to the ML task.
The figure below shows the top ranked numerical variables and categorical variables to the target “IsIncomeOver50K” in the UCI adult income dataset.
Visualizing High-Dimensional Data
Visualization of high-dimensional data can often be a challenge. But such visualization can be very helpful in identifying the clustering pattern in the data. For ML tasks, building separate models for different clusters of observations can significantly improve the performance of the predictive models. Customer clustering and segmentation is a common practice in marketing and CRM tools.
IDEAR projects the high-dimensional numerical matrix into a 2-D or 3-D principal component space. In 3-D principal component spaces, you can change the view angle to visualize the data in different perspectives, which may be helpful in revealing clustering patterns.
The Automated Modeling and Reporting tool, or AMAR, is a customizable tool to train machine learning models with hyper-parameter sweeping, compare the accuracy of those models, and look at variable importance. A parameter input file is used to specify which models to run, what part of the data is to be used for training and testing, the parameter ranges to sweep over, and the strategy for best parameter selection (e.g. cross-validation, bootstrapping etc.).
When this utility completes running, a standard HTML model report is output with the following information:
- A view of the top few rows of the dataset used for training.
- The training formula used to create the models.
- The accuracy of various models (AUC, RMSE, etc.), and a comparison of the same, i.e. if multiple models are trained.
- Variable importance ranking.
For example, when you run this utility on the same UCI adult income dataset, the three plots below show:
- The comparison of ROC/AUC, sensitivity and specificity (evaluated on the held-out data of 3-fold cross-validation), viz. random forest, xgboost, and glm-net.
- ROC curves, with AUC, on a test data.
Feature importance ranking.
With virtually no setup and coding effort, the modeling report can provide an initial assessment of the prediction accuracy of several commonly used machine learning approaches, as well as guidance for the subsequent steps of feature selection and modeling. You have the freedom to add more algorithms in AMAR and to change the grids of the hyper-parameter sweeping.
You can try these utilities by cloning the GitHub repository. Two sample datasets are included with these utilities, you can either use these to get a quick sense for the two tools, or try using the utilities on your own dataset.
We hope you will try these tools and the Team Data Science Process as part of your next data science project. Be sure to send us your feedback and thoughts, either via the comments feature below, or on the issues tab of our GitHub repository, or via tweet to @zenlytix. We are always looking for ways to improve our tools and make them even more useful for a broad range of analytics scenarios.
Xibin, Debraj, Gopi & Hang