Recent Updates to the Microsoft Data Science Virtual Machine

Posted by Gopi Kumar, Principal Program Manager in the Microsoft Data Group.

It’s been over 9 months since we first released the Data Science Virtual Machine (DSVM), a custom virtual machine image we published in the Azure Marketplace with a host of popular data science tools pre-installed and pre-configured. We’ve made a few updates since then, and now offer the DSVM in both Windows and Linux editions. There’s been a tremendous response to this offering by the data analytics community across the globe and we continue to iterate and improve the experience. This post provides a quick update on some of the newer features that should further improve your productivity and let you accomplish more with the DSVM.

Windows Edition
We now have the SQL Server 2016 Developer edition replacing the SQL Server 2014 Express edition on the VM. SQL Server 2016 Developer is a full-featured edition, for development/test purposes only, of Microsoft’s industry-leading OLTP database and top-performing data warehouse. 

It also includes R Services that support in-database analytics using Microsoft R, enabling large-scale analytics to be run closer to your data using ScaleR, Microsoft’s distributed scalable library in R that is fully compatible with open source R packages and supports parallel algorithms.

The DSVM also packages an end-to-end data science tutorial featuring SQL Server R Services as a Jupyter notebook along with a preloaded dataset in the SQL database. You can also run R Server standalone outside the database.

In addition to libraries to work with Azure ML, we also provide locally on the VM a few popular Open Source ML and deep neural networks/AI toolkits such as xgboost, Vowpal Wabbit, Rattle, CNTK and mxnet with samples to get you started.

Other notable updates to the VM include the Azure CLI, Visual Studio Community 2015 Update 3, which comes with several language tools including R, Python and node,js as well as pre-installed plugins that make it easier to work with data and analytics technology, including with SQL Server, Azure HDInsight(Hadoop), Azure Data Lake.

You have the ability to run several Linux command line tools, e.g. awk, sed, find, wget, perl etc., right in the Windows command prompt or on Git Bash. Data movement tools on the VM support the movement of data to and from relational databases, Azure storage accounts, Azure DocumentDB and Azure Data Lake. Microsoft Data Management Gateway installed on the VM allows you to setup data pipelines from on premises to cloud using the Azure Data Factory.

Linux Edition

Microsoft R Server Developer edition, for non-production use only, is now available on the Linux DSVM, allowing you to build models at scale in R using Microsoft’s ScaleR libraries. Previously we supported only Microsoft R Open, which uses Open Source libraries that can only process data that fit in memory.

Another major update on the Linux VM is our support for JupyterHub, a multiuser solution for Jupyter Notebook server. Based on our experience, Jupyterhub has been particularly useful in education and training scenarios, where a single VM instance is able to support multiple users independently working on their own single-user notebook server instances with OS authentication.

We have also added support for working with the Julia language both in the command line and as a Jupyter notebook kernel. All the ML tools mentioned above in the Windows section with the exception of mxnet are also available on the Linux DSVM.

The slide below captures the key software components available in each of the DSVM editions: 

DSVM Edition Side by Side Comparison - New

With the data science VM you have a comprehensive set of tools to perform a whole range of data science activities including data movement, data storage, data exploration/visualization, modeling with ML and AI algorithms, and operationalization using multiple languages in both Linux and Windows environments.

There is lots more information at the resources listed below. Do give the DSVM a spin for your next data science or analytics project or training session. As always, we’d love hearing your feedback so we can continue to improve your experience.



Windows Edition:

Linux Edition: