This post is authored by Jacob Spoelstra, Data Science Director, Hang Zhang, Senior Data Scientist Manager, and Gopi Kumar, Senior Program Manager, at Microsoft. The authors are speaking at the upcoming Microsoft Data Science Summit on September 26-27 in Atlanta.
As enterprise software development projects have grown in complexity and scale, the industry has adopted processes to enable better collaboration and manage quality. These processes include the use of version control, code review and task-tracking systems as well as the now ubiquitous Agile process.
With the new emphasis on the use of analytics for enterprise-wide data-driven decision making, data science projects have similarly grown in complexity but are often executed in an ad hoc manner by scientists. Things get worse when data scientists in an organization are from very varied backgrounds such as statistics, computer science, physics etc. Scientists in this scenario will tend to approach projects in significantly different ways. Such heterogeneous processes within an organization introduce obstacles for collaboration. Also, while public clouds provide access to virtually unlimited compute power and can facilitate global collaboration, they introduce challenges when it comes to efficiently tracking work projects and building institutional knowledge.
At Microsoft, we have developed the Team Data Science Process (TDSP) to address these challenges. The key feature is a set of git-based repositories with templates providing a central archive with a standardized project structure, document templates, and utility scripts for all projects, independent of the execution environment, to allow scientists to use multiple cloud resources as needs dictate. We use Visual Studio Team Services (VSTS) to manage team tasks and execution cadence, control access, and maintain repositories containing work items.
Why Have a Process?
A process provides a detailed sequence of the activities necessary to perform specific business services and is used to standardize procedures and establish best practices. CRISP-DM is the standard process for data science. It describes the typical stages of a project, from business understanding to final deployment of a solution, and it highlights the iterative nature of data science project phases. A 2014 KDD Nuggets survey found that it is still the most popular process. But CRISP-DM does not prescribe specific formats for project artifacts. The TDSP aims to fill that gap.
Figure 1.The Cross Industry Standard Process for Data Mining(CRISP-DM) illustrating common phases during the execution of a data science project (source: Wikipedia)
Who Is This For?
The TDSP is primarily for a data science team that is developing the analytic components of a predictive analytics solution, specifically, teams using cloud-based assets for compute and storage. When working on the cloud, virtual machines become disposable compute, to be added to projects as needs dictate. There is a many-to-many relationship between data scientists, VMs, and projects. A large, complex project might have several people working on it, each running tasks on several VMs. In other cases, several smaller projects can be hosted on a single VM. Data is typically not stored on the VM, but accessed from other cloud stores, such as blob, database, or cluster. In this world, modeling and analysis are done on working copies, while project artifacts are permanently stored in central git repositories.
Figure 2. An illustration of a data science team executing on multiple parallel projects, collaborating on the cloud. Each data scientist may provision one or more virtual machines (VMs); one or more VMs are attached to storage assets that contain the data. Templates, as well as project code and documents, are maintained in a central Git repository.
What Do You Get?
The TDSP prescribes a process for setting up and executing projects on the cloud and has two concrete components: The first comprises templates and utilities that are aimed at making data scientists more productive. The second involves tracking tasks and organizing artifacts in git repositories hosted on a VSTS server.
We provide templates for the folder structure and required documents. The folder structure organizes files such as code for data exploration, feature extraction, and model iterations in standard locations, which makes it easier for team members to understand work done by others and to add new people to teams. It is easy to view and update document templates in markdown format and to use them as checklists to make sure that key questions are answered consistently for each project. Examples include a project charter to document the business problem and scope of the project, a data report to document the structure and statistics of the raw data, and model reports to make sure that derived features are documented and model performance metrics such as ROC curves or MSE are consistently reported.
Figure 3. Folder structure provided by the project template. It includes templates for common documents.
One goal for the TDSP is to reduce the time from first access to data to having a baseline model built. The first task in any project is to explore and understand the data. The TDSP data report utility is an R script that interactively explores a data set and then auto-generates a report. Another script automatically explores a number of models to create a baseline model. Other utility scripts automate common tasks such as provisioning and attaching cloud file systems to virtual machines.
VSTS was developed to provide a collaboration and build environment for software development projects, but it turns out to be ideal for organizing data science projects as well. First, it provides a way to define teams and organize multiple git repositories. So, the TDSP maintains a central utilities repository as a way to share the basic helper scripts, and also to capture institutional knowledge as these get refined and teams build more of them over time. For new projects, you can create new repositories, then seed them with the template. The work-tracking feature in VSTS allows teams to execute projects following the scrum approach to Agile development: defining a backlog of tasks and sprints to prioritize and execute. A neat feature is that tasks can automatically create git branches and similarly pull requests are automatically associated with work items.
What Do You Gain?
The goals of the TDSP are to help managers organize and standardize the work products on data science projects and to make individual data scientists more productive. So what exactly does each of these stakeholders get?
As a manager, the benefits of your team standardizing on the process are:
- Organization: A single place to go to find code, documentation, and artifacts for all your team’s projects
- Standardization: Code, data, and documents are organized the same way, so when a second pair of eyes are required, or a new member joins a team, files are where you expect them to be, and documents have the same naming and structure.
- Knowledge Accretion: One of the biggest challenges for data science teams where individuals or small teams work largely independently on a variety of projects is how to accumulate and share learnings, tools, and best practices. The TDSP specifically provides a central shared utilities repository and methodology for individual projects to share and contribute to this team-wide resource. Also, by virtue of the team having a documented process, improvements can be documented and spread to individuals executing subsequent projects.
- Security: VSTS provides for detailed role-based access control on repositories and work items.
As an individual data scientist, your benefits are:
- Productivity: Out of the box the TDSP provides various utilities for data exploration and baseline model construction. And as the team customizes and builds up domain-specific scripts and utilities over time, these can be easily shared. In addition, document templates serve as a guide for information required and provide basic formatting
- Collaboration: Teams can seamlessly collaborate on distributed compute (VMs), with a simple way to share data assets and run large analytics jobs without contending for resources. All the while contributing to a single repository so that the product appears to be the work of a single scientist.
How Do You Adopt It?
A key tenet when developing the TDSP was that it should be easy to adopt. It is designed to be flexible to multiple scenarios, and we designed it to allow customization to existing practices. The benefit comes from formalizing these practices.
Figure 4. Swim lanes for members of a data science team adopting TDSP for the first time.
The diagram above illustrates the steps required for a team to adopt the process. The group manager creates an account on VSTS and configures it so that his leads have the appropriate access. He populates the initial repositories with the templates from our GitHub account. Each team lead sets up an environment for her team on the server, including users and initial repositories. To initiate a project, the project lead creates a new repository and creates shared cloud assets such as databases and compute clusters. Each individual joining the project will create VMs, attach the shared assets, clone the common and project repositories and start work, checking in code and documents into the project repository.
The TDSP is a framework used by Microsoft to efficiently execute predictive analytics projects. We are treating this as a product and will continue to develop it. Some items on the roadmap include more customization of VSTS to better fit the stages and work items of data science projects; additional utilities to automate more steps of the process, from resource provisioning to document generation; and better integration and versioning of notebooks.
We will be talking about the TDSP at the upcoming Microsoft Data Science Summit on September 26-27. We hope you can join us there to find out more.
Jacob, Hang & Gopi