Running Compiled Code on Azure ML in R and Python

This post is by Max Kaznady, Data Scientist in the Microsoft Data Group.

Introduction

Azure ML automates a large number of machine learning tasks in the cloud, including scaling ML experiments and publishing trained models as a RESTful web service. ML models can be applied to data using Azure ML modules or using custom modules in which the user provides their own ML algorithm training and scoring implementations; the latter currently support R and Python, along with a large variety of open source libraries.

In addition, Azure ML has a built-in capability to run R and Python scripts in special R and Python script modules. Each Azure ML R and Python script module can take up to two dataframes as input, along with a zipped folder which contains other dependencies. The output is an optional dataframe along with a graphics device for any plots generated.

But what if your library or custom implementation is not available on Azure ML? If it's written in R or Python then you can easily source it into one of the script modules using the "Script bundle (Zip)" handle shown in Figure 1.

Figure 1: Linking external dependencies to R and Python script modules.

But what if your code is written in a compiled language, such as C, C++ or even Fortran?

Execution

In this post, we focus on sourcing R and Python's external dependencies, such as R libraries and Python modules, which are not already installed on Azure ML and require code compilation. Commonly the compiled code comes from a variety of other languages such as C, C++ and Fortran. One could also use this approach to wrap their compiled code with R or Python wrappers and run it on Azure ML.

To illustrate the process, we will build two MurmurHash modules from C++ for R and Python using the following two implementations on GitHub, and link them to Azure ML from a zipped folder:
https://github.com/hajimes/mmh3 https://github.com/cran/hashFunction

Hashing is a very fundamental technique in computer science. For example, MurmurHash underlies the implementation of Microsoft Research "Distributed Robust Algorithm for Count-based Learning" or DRACULA algorithm.

To simplify things, we can install both packages using 'install.packages("hashFunction")' for R and 'pip install mmh3' for Python. You can also git-clone each package and build from source. In order for the compiled code to run on Azure ML, one has to build these modules on 64-bit Windows 8.1 and higher operating system (or cross-compile from cloned source code). Before installing both packages, you should download and install the Microsoft Visual C++ compiler package. If you don't have one installed already, a pop-up should appear during package installation prompting you to install one.

Below is an example of Python's 32-bit MurmurHash top-level function, which is invoked from mmh3/mmh3module.cpp (line 39):

Figure 2: Sample MurmurHash entry point in Python module.

After installation, you can find the location of each package using .libPaths() command in R and "import site; site.getsitepackages()" in Python. My installed R package, using Microsoft R Open for RevolutionR Enterprise, is in

"C:/Users/maxkaz.NORTHAMERICA/Documents/Microsoft/MRO-for-RRE/8.0/R-3.2.2/library"

and for Python, using the Continuum Anaconda suite, is in

"C:/Users/maxkaz.NORTHAMERICA/AppData/Local/Continuum/Anaconda2/lib/site-packages".

Next, you need to copy out the built packages from R and Python, which contain the DLLs (more specifically, the R and Python equivalents of DLLs) and create a zip archive for each.

Figures 3 and 4 show the contents of each of the R and Python .zip dependencies which I'm attaching to each R and Python Azure ML Script Module. Notice the separate .pyd file in PythonDep folder in Figure 4 which also needs to be copied separately.

Figure 3: Contents of RDep.zip.

Figure 4: Contents of PythonDep.zip.

Next, we need to test our experiment. The Azure ML test input is a simple Manual Input Module for strings "abc", "abd" and "abe", shown in Figure 5 and attached in Figure 8. We should see each R and Python script map each input string to the same bucket, because the algorithm referenced by each script is identical (albeit having different implementation). This is the very first module which is executed in the flow of execution in Figure 8. It is also replaced with Web service input module when the experiment is deployed as a RESTful web service.

Figure 5: Sample input data used to test the RESTful web service.

The sample R and Python code which is placed inside each R and Python Azure ML script module and uses the dependencies is shown in Figures 6 and 7.

Figure 6: Contents of R script module.

Figure 7: Contents of Python script module.

But what if we have dependencies which are inter-dependent, i.e. say I have another library which requires MurmurHash? Once we add the new location of the library path in R and Python, it propagates to all modules, i.e. any other modules which are sourced by the Azure ML R and Python code know where to find the newly-attached dependencies. It's the same effect as modifying the R library path or the PYTHONPATH environment variable – we just do it from within the source code directly.

We can now publish the built modules on Azure ML as a RESTful web service and interact with the compiled code using a scalable cloud web service. The complete experiment is shown in Figure 8.

Figure 8: Complete Azure ML experiment.

In this experiment, we split the input to both R and Python scripts, which should produce identical results when we test the web service. This input is replaced with input from the Web service input module when the experiment is deployed as a RESTful web service. As expected, the output of the Add Columns module is identical for both R and Python scripts in Figure 9.

Figure 9: Output of both script modules, merged together for comparison.
This is the output which is produced by supplying sample.

Conclusion

We successfully demonstrated the invocation of compiled code through Azure ML R and Python Script Modules. This technique generalizes to running user-written code in compiled languages on Azure ML, including code which requires static initialization with fixed datasets. This convenience feature harnesses the power of open source compiled code by making it execute within the Azure ML cloud framework, where such compiled code can be easily deployed as a scalable RESTful web service.

Max