Data Transformations “By Example” in the Azure Machine Learning Workbench

This post is authored by Ranvijay Kumar, Senior Program Manager at Microsoft.

In this post, I talk about the Derive Column By Example transformation – an unexpected, powerful and super-efficient way to perform complex data transformations in the Azure Machine Learning Workbench.

Imagine you’re head chef at a boutique restaurant where meals are prepared by robots. You’ve just invented this amazing new recipe for sautéing vegetables and are now wondering how best to teach this new recipe to Sunny, the new robot on the team. As you approach Sunny, it says, “Just show me the ingredients and the final dish, and I’ll be all set”. Sure enough, Sunny scans the list of ingredients, analyzes the final dish, and then figures out the recipe all by itself. Shortly after, you see Sunny merrily sautéing away several kinds of vegetables using your latest recipe, to demonstrate their learning.

Sound like science fiction?

Well, the new Azure Machine Learning Workbench does no less when it comes to your data preparation.

Think of your input data as the raw ingredients, and your intended output as the final dish. All you need to do is invoke our Derive Column by Example transform, select the input columns, and provide examples of the output. The tool analyzes the examples and synthesizes a program to perform the transformation you intended. The program that it creates is generic enough to work not only on the example inputs you provided but also on other similar inputs.

Here is a video of the Derive Column by Example experience in the Azure Machine Learning Workbench. Notice how the values in the EventTime column are mapped to their corresponding 30-minutes time-periods only using an example.

Video: “Derive Column by Example” experience in the Azure Machine Learning Workbench.

What’s the Secret Sauce?

The underlying technology, known as PROgram Synthesis using Examples (PROSE) within Microsoft, is a new frontier in AI that brings together advances in logical reasoning and machine learning based methods. Logical reasoning -based search techniques are used to efficiently search for programs within an underlying domain-specific language (DSL) that are consistent with the examples. Machine learning -based ranking techniques are used to pick an intended program from among the many programs that are consistent with the examples. The DSL for data transformation in the Azure Machine Learning Workbench is designed to perform common data transformation tasks for data professionals. If the intended task can be expressed by a program in the underlying DSL, the tool will synthesize one such program using examples. Often one example is sufficient to do the intended transform; however, sometimes you may need to provide more examples.

The technology is a result of many years of research at Microsoft and has come a long way since it was first released in Excel 2013 (some of you may be familiar with the Excel Flash Fill experience).

You can visit the Microsoft PROSE SDK Resources page to learn more about the latest advancements in this technology.

What Transformations are Possible “By Example”?

If you are familiar with Excel Flash Fill, you will recall that it is limited to string transformations. However, the new advancements in the technology have also added number, date and composite transformations to the PROSE suite. What’s more, PROSE generates programs which, when combined with your other data preparation steps, can be operationalized using the scale-out capabilities of Azure Machine Learning.

Here are some transforms you can do today using Derive Column by Example:

  • String Transformations: Substring; Concatenation; Case Manipulation; Dictionary Mapping (e.g. Male -> 0, Female -> 1, Unknown -> 2).
  • Date Transformations: Extracting Date from Strings; Date Format Change; Extracting Date Parts; Mapping Datetime to Time Periods.
  • Number Transformations: Rounding; Floor; Ceiling; Bucketing; Padding; Scaling by 1000.
  • Composite Transformations: Any combination of the above transformations.

Here’s one of my favorite examples of a composite transformation – this showcases a few of the above transformations all in one go. In this [rather complex] scenario, three input-output examples were needed:


Figure 1: Performing a composite transformation “by example”. 

Here, we are using the name, mass (g), year, and GeoLocation columns as inputs to compute the output column. Notice the following:

  • This is a composite transformation:
    • Several fields are being transformed simultaneously to produce the output.
  • Case conversion on the name column:
    • Aachen -> AACHEN.
  • Division by 1000 on mass (g) column:
    • 21 -> 0.021.
  • Extraction of date part from year column:
    • 1880-01-01 00:00:00 -> 1880.
  • Number rounding in the GeoLocation column:
    • (50.775000, 6.083330) -> (50.8, 6.1).
  • Concatenation of constants strings:
    • “The meteorite named…”

You can see more examples of Derived Column By Example transformations in the product documentation.

Situations Where “By Example” Does Not Help

Of course, Derive Column by Example does not support every type of transformation. For example, it does not do generic arithmetic manipulations by example. Also, transformations such as mapping N to Nth Prime number are too contrived, and perhaps not common. In such cases, the tool aims to fail fast to indicate the its inability to synthesize a program from the examples.

When the “by Example” Transformation does not help, the ready-to-use transformations in the Azure Machine Learning Workbench come to rescue. You can see a list of ready-to-use transformations that can be invoked from the Transformations menu here.

If none of those satisfy your needs, you can write custom transformations in Python for variety of tasks such as adding a derived column based on values from other columns, filtering rows, manipulating the entire dataset, or writing the output to a custom destination. See the documentation for the extensibility features in the Azure Machine Learning Workbench Data Preparation tool.


Figure 2: Invoking a script based transformation from the Transformations menu. 

Summary, Next Steps

In this post, we introduced a powerful new approach to data transformations – one that truly has the promise to revolutionize the data preparation workflow for data scientists. We saw the power of the Derive Column by Example transform through an example of a composite transformation involving multiple simultaneous operations. Finally, we discussed the fallback mechanisms in the Workbench that let you gracefully handle situations where “by Example” transformations do not help.

By way of next steps, you can:

We are eager to know about your experience with “by Example” transformations – what worked, what did not, and what more you would like to see. Share your stories and scenarios with us by writing to prose-contact@microsoft.com or by using the Comments feature at the bottom of this blog post, You can also send us a Smile or Frown from the Workbench Application.

Ranvijay
@ranviz