By Basim Majeed, Cloud Solution Architect at Microsoft
An increasing focus has been placed recently on the data science process; a methodology to govern the enterprise-scale effort that goes into the development, deployment and maintenance of data analytics. Data scientists have not been lacking in terms of tools for developing their algorithms but when it comes to deploying their solutions, especially in a hybrid environment, the available tools have not been flexible enough. Data scientists love to write their Python and R scripts using open source tools, such as PyCharm and RStudio. Such tools allow them to work interactively with samples of the data and build the analytics algorithm gradually.
Most of the developed data analytics scripts are required to be deployed in a complex environment that could span a hybrid cloud/on premises resources. Having the right tools to allow the governance and deployment of these scripts within a workflow that meets business needs is a major part of making a success out of the data science effort.
Microsoft Azure offers several services that allow flexibility of deployment and integration of data analytics, and allow the algorithms developed by data scientist to be made available in a number of ways based on the specific usage scenario. In this article I will show how to host a Python script in Azure Container Instances and how to then integrate the container in a workflow using Azure Logic Apps. Using Logic Apps makes it easy to ensure the container is only created on demand and then turned off so that the cost is only incurred when necessary.
This scenario requires the Python script to run on demand based on a trigger event (e.g. when new data becomes available). The script retrieves data from an Azure SQL database, operates on the data and then writes the results back to the database as shown in the diagram below. A Docker container image hosts the Python script and is registered with the Azure Container Registry. The Logic Apps instance controls the workflow and is instantiated by the trigger signal, creating a container group with a single container based on the image stored in the registry. The container runs the Python script and on completion it is destroyed by the Logic App.
The first step is to use Docker to build a container image that can run the Python script. Since it is necessary for the script to interact with the SQL database, we need to make sure that the Dockerfile used to build the container image contains the necessary reference to the pyodbc library. A complete Dockerfile can be found here, though you will need to add the necessary command to include your Python script as part of the Dockerfile. For example, to include the Python script “my_script.py” you will need to add the following (note: modify the following two lines based on where you want to place the scripts in the container image):
ADD my_script.py / CMD [ "python", "./my_script.py" ]
After you have created your Dockerfile you can use the Docker Command Line Interface (CLI) “build” command to build your container image:
docker build -t <dockerfilename> .
The second step involves registering and uploading the container image with the Azure Container Registry, making sure to tag the image with information such as the image version. You can use the Azure CLI to achieve this as described here.
The third step is about building the workflow using Azure Logic Apps. With the recent addition of Container Instance Group connectors, Logic Apps can control the creation of a Container Instances inside container groups, monitor the container state to detect success of execution and then delete the container and the associated container group. By ensuring that the container is only active for the amount of time necessary to complete the task, charges are minimised.
There are many trigger types that can be used to start the Logic App including webhooks, http notifications and timed events, allowing the workflow to integrate the Python script execution with external events. In the Logic App instance shown in the diagram below the trigger is set as a timed event. When the Logic App receives the timer event it creates a Container Group and a Container inside the group based on the image retrieved from the registry. A loop is then started that monitors the state of the Container Group until it has succeeded (indicating that the Python script has completed). The last step is to delete the Container Group.