This post is by Jacob Spoelstra, Data Science Director, Hang Zhang, Senior Data Scientist Manager, and Gopi Kumar, Principal Program Manager in the Data Science Team at Microsoft.
Are you building a data science team but unsure how to make the team productive? Are you concerned that the lack of collaboration or consistent processes could hinder project success? Are you doing too many routine data science tasks manually? Do you face challenges capturing or reusing knowledge from data initiatives across your teams?
Microsoft is happy to introduce the Team Data Science Process (TDSP) – a methodology and set of practices for collaborative data science. TDSP is designed to help you fully realize the promise of data science for your business, and addresses each of the issues above.
What is TDSP?
TDSP is an agile, iterative, data science methodology to improve collaboration and team learning. It has the following components:
- A data science lifecycle definition.
- A standard project structure, including a well-defined directory hierarchy and a list of output artifacts in a standard document template structure that are stored in a versioned repository.
- A shared and distributed analytics infrastructure.
- Productivity tools and utilities for data scientists. These simplify adherence to the process by automatically producing project artifacts and providing scripts for common tasks such as the creation and management of repositories and shared analytics resources.
More information on each of these components are in the sections that follow.
Data Science Lifecycle
The data science lifecycle is a systematic set of steps that starts with a firm understanding of the business problem or question at hand. It also includes the development of predictive analytics models and their deployment as predictions in intelligent applications. Data Science is a highly iterative discovery process with an emphasis on evaluating and validating each step along the way, followed by refining the hypothesis and the models, leading to a sound solution.
Standard Project Structure
Having projects share a common directory structure and having project documents use a similar template make it easier for the team to find information about past projects.
The project structure also drives quality by ensuring that all aspects of a project, as listed in the document template, get addressed in a checklist-like fashion. All artifacts (documents and code) are stored in a version control system such as Git, Team Foundation Server (TFS), or Subversion, allowing the team to collaborate easily. Tracking tasks and features in an Agile project tracking system such as Jira, Rally, or Visual Studio Team Services (VSTS) facilitates closer tracking of code to individual features and helps teams get better when it comes to estimating the overall effort needed. Our data science team at Microsoft uses VSTS for its Git code repository support, Agile project tasks and sprints tracking.
Shared and Distributed Analytics Infrastructure
TDSP provides recommendations for managing shared analytics and storage infrastructure, including cloud file systems for storing datasets, databases, Big Data clusters (Hadoop, Spark), machine learning services, etc., both on the cloud and on-premises. This is where raw and processed datasets are stored, enabling reproducible analysis. It also avoids duplication, which could lead to inconsistencies and additional infrastructure costs. Scripts are provided to provision the shared resources, track them and allow each team member to connect to those resources securely. Our data science team uses the Microsoft Data Science Virtual Machine as our cloud development environment. This is useful to ensure a consistent configuration across the project team, for validating experiments and for saving us time in setting up the environment.
Productivity Tools and Utilities
Introducing new processes in an organization can be a challenging task. By providing specific tools for aspects of the process lifecycle, we not only get the benefit of added productivity, but also consistency in the adoption and adherence of new processes.
Here are two utilities we provide that will jump start your adoption of TDSP and to automate common tasks in the data science lifecycle:
- The first utility, called the Interactive Data Exploration Analytics and Reporting (IDEAR), is designed to help you explore data in an interactive and flexible manner.
- The second utility is the Automated Modeling and Reporting tool, which provides a customizable, semi-automated tool to train and evaluate single or multiple ML models with hyper-parameter sweeping, and helps you compare the accuracy of those models.
We will soon post a detailed blog on these two utilities.
TDSP also provides a mechanism for individuals to contribute tools and utilities into to their team’s shared code repository, so they can be used by other projects across your organization.
We have published the TDSP guidelines, tools and project structure on Github and invite you to check these out at the links below:
Do send us your feedback and suggestions for other tools that would help you implement a Team Data Science Process that is customized to your organization. You can comment below, or start a new thread on the issues tab of this TDSP Github repository, or tweet to @zenlytix.
Jacob, Hang and Gopi