By Joseph Sirosh, Corporate Vice President and CTO of AI, and Sumit Gulwani, Partner Research Manager, at Microsoft.
There are an estimated 250 million “knowledge workers” in the world, a term that encompasses anybody engaged in professional, technical or managerial occupations. These are individuals who, for most part, perform non-routine work that requires the handling of information and exercising the intellect and judgement. We, the authors of this blog post, count ourselves among them. So are a majority of you reading this post, regardless of whether you’re a developer, data scientist, business analyst or manager.
Although a majority of knowledge work tends to be non-routine, there are, nevertheless, many situations in which knowledge workers find ourselves doing tedious repetitive tasks as part of our day jobs, especially around tasks that involve manipulating data.
In this blog post, we take a look at Microsoft PROSE, an AI technology that can automatically produce software code snippets at just the right time and in just the right situations to help knowledge workers automate routine tasks that involve data manipulation. These are generally tasks that most users would otherwise find exceedingly tedious or too time consuming to even contemplate.
Details of Microsoft PROSE can be obtained from GitHub here: https://microsoft.github.io/prose/.
Examples of Tedious Everyday Knowledge Worker Tasks
Let’s take a couple of examples from the familiar world of spreadsheets to motivate this problem.
Figures 1a (above), 1b (below): A couple of examples of “data cleaning” tasks,
and how Excel “Flash Fill” saves the user a ton of tedious manual data entry.
Look at the task being performed by the user in the Excel screen in Figure 1a above. If you see the text the user is entering in cell B2, it looks like they have modified the data in the corresponding column A, to fit a certain desired format for phone numbers. You can also see them starting to attempt an identical transformation manually in the next cell below, i.e. cell B3.
Similarly, in cell E2 in Figure 1b above, it seems like the user is transforming the first and last names fields available in columns C and D, changing them into a format with just the last name followed by comma and capitalized first initial. They next attempt to accomplish an identical transformation, manually, in cell E3 which is right below it.
Excel recognizes that the user-entered data in cells B2 and B3 represents their desired “output” (i.e. for a certain format of telephone numbers) and that it corresponds to the “input” data available in column A. Similarly, Excel recognizes that the user-entered data in cells E2 and E3 represents a transformed output of the corresponding input data present in columns C and D. Having recognized the desired transformation pattern, Excel is able to display the [likely] desired user output – shown in gray font in the images above – in all the cells of columns B and E, in these two examples.
Regular Excel users among you will readily recognize this as Excel Flash Fill – a feature that we released five years ago and which has collectively saved our users millions of tedious hours of data grunge work.
Introduction to Microsoft PROSE
PROSE is short for Programming Synthesis using Examples, and it’s the technology underpinning of Excel Flash Fill.
PROSE has been through many major enhancements since its initial release in Excel. These new capabilities have since been released in many other products including Power BI, PowerShell and SQL Server Management Studio and are increasingly finding their way into many scenarios that involve big data and AI, including in Azure Log Analytics and Azure Machine Learning, where PROSE-generated scripts can be executed on very large datasets, including via the Azure Spark runtime.
In this post, we describe how PROSE works and some of the exciting new scenarios where its being applied. In many cases, PROSE delivers productivity gains that are well in excess of 100x.
How Does Microsoft PROSE Work?
PROSE works by automatically generating software programs based on input-output examples that are provided at runtime, usually by a user who is just going about their everyday tasks.
Given such input-output examples, PROSE generates a ranked set of software programs that are consistent with the examples provided. It then applies the output of its “best” program, with a view to help the user complete their broader task. This workflow is illustrated below.
Figure 2: How Microsoft PROSE works, under the covers.
To go back to the examples in Figure 1, what Excel is doing is displaying the output of the best PROSE-generated program using the gray colored font. The Excel user can accept these suggestions simply by hitting the Enter key. At this point, the user could provide additional examples, such as a correction they may apply to one of the auto-generated outputs. In such a situation, PROSE will try to further refine its final program, adapting it to the newest example provided. It will once again update the entire output column to reflect the updated ‘best program’.
A key technical challenge for PROSE is to search for programs in an underlying domain-specific language that are consistent with the user-provided examples. Our real-time search methodology leverages logical reasoning techniques and neural-guided heuristics to solve this issue.
Another challenge is to resolve the ambiguity that may be present in the user-provided examples since many programs can satisfy a few examples. Our Machine Learning -based ranking techniques often help us select an intended program from among the many that satisfy the examples. We also use active learning -based user interaction models that resemble an interactive conversation with the user, to iterate and arrive at the desired output.
The Microsoft PROSE SDK exposes these generic search and ranking algorithms, allowing advanced developers to construct PROSE capabilities for new task domains.
In the rest of this post, we look at a few additional scenarios where data scientists and developers and knowledge workers can use PROSE technology to get their tasks done faster and in a much more enjoyable manner. You can also look at a video overview of these scenarios.
Customer Use Cases and Microsoft PROSE Benefits
In this section, we highlight the benefit of using PROSE in the following scenarios:
- In data preparation, for use by data scientists.
- In Python Code Accelerator, for use by data scientists.
- For generating code snippets, for use by software developers.
- In code transformation, for use by software developers.
- For table extraction from PDF files, for use by knowledge workers.
Scenario 1. Data Preparation
Although it may still be the sexiest job of the 21st century, being a data scientist sure involves spending lots of time on mundane data organization and analysis. In fact, it is estimated that data scientists end up spending as much as 80% of their time transforming data into formats that are more suitable for machine learning and AI.
This is where PROSE comes to the rescue. PROSE can automate several data manipulation tasks including string transformations (already seen in the Excel example above), in column-splitting, field extraction from log files and web pages, and normalizing semi-structured data into structured data. To take one example, consider the dataset in Figure 3a below, which reports raw temperature measurements.
Figure 3a: Raw temperature measurements
Rather than using these as-is, a data scientist may want to map these temperatures to different bins as part of featurization exercise. Unlike in the world of Excel, doing so manually in the world of big data is nigh impossible, therefore their best bet is to write a complex custom script.
They now have a much easier and faster alternative, which is to use PROSE to derive the new column based on a user-provided example, as shown in Figure 3b below.
Figure 3b: Transforming raw temperature measurements into interval
bands via the power of Microsoft PROSE plus a couple of user-provided examples.
As seen in the figure, as soon as the user types their desired output (or example) in the second column of row 2, PROSE determines the user’s intent, automatically generates the relevant code snippet, and uses it to correctly populate all the remaining rows, with the output of the PROSE-generated code snippet shown in gray colored font. Voila!
Scenario 2. Python Code Accelerator in Notebooks
PROSE, in general, requires user intent and sample data to generate code. Notebooks, because of their partial execution capability, are great platforms for interactive program synthesis using PROSE. A user typically develops script in Notebook one cell at a time, executing and evaluating the cell, and deciding on the next steps as she goes. After execution of each cell, new states are created, or old states are updated. At that time, user may decide to write code for the next cell on her own or invoke PROSE Code Accelerator which takes the user’s intent and the current state of the Notebook to synthesize code on user’s behalf. The code is readable and modifiable, like what the user might have written herself perhaps after spending much more time.
Figure 4a: Microsoft PROSE -powered Python Code Accelerator generating code to load a CSV file.
Notice in the above figure how PROSE analyzes the content of the file and generates Python code using libraries that the user may already familiar with. By using PROSE, user has saved several minutes of frustration and effort that she can now spend on more useful tasks.
Figure 4b: Microsoft PROSE -powered Python Code Accelerator generating code to fix the datatypes in a Python DataFrame.
Python users often struggle with wrong data type in data frames. PROSE intelligently analyzes the data and generates code to parse the values to the right data types and handle exception cases. Depending on the number of columns, it can be a huge time saver for Data Scientists.
Scenario 3. Generation of Code Snippets for Text Transformations
Consider a developer who needs to write a function to transform text inputs, but – rather than writing code – they want to just show the desired transformation via an example. Say, for instance, that they need to transform names from the format [First name] [Last name] to [Last name], [First initial]. E.g. if “Joseph Sirosh” was the input provided, they would want “Sirosh, J” as the desired output.
We did a fun implementation of this scenario in partnership with Stack Overflow where we created a chatbot for developers, one that uses PROSE behind the scenes to generate lots of different programs and figures out the best fit for a given example provided by the developer. Figure 5 below shows a Stack Overflow chatbot session that captures such an interaction.
Figure 5: Stack Overflow bot, powered by Microsoft PROSE. The bot provides code snippets in response to requested input/output transformation patterns.
This example showed pseudocode, but we could just as easily emit Python or Java.
Scenario 4. For Large Scale Code Transformation
PROSE has extensive applicability in scenarios that involve repetitive code transformations, including code reformatting and refactoring. In certain application migration scenarios, it is estimated that developers could end up spending as much as 40% of their entire time refactoring old code.
Take the example in Figure 6a below, where a SQL query written by another developer happens to use a different convention for naming a column than the one your organization prefers (this is called aliasing). For instance, the column aliasing for ExpectedShipDate is done using the “=” (equals to) operator, but your preference is to use “AS” for the same.
Figure 6a: Old code that needs to be reformatted.
Fortunately, you have the PROSE extension in your IDE (Integrated Development Environment) and, by giving a single example of the SQL transformation you wish to perform, i.e. by correcting just the one line of code with ExpectedShipDate as below:
DATEADD(DAY, 15, OrderDate) AS ExpectedShipDate,
… the IDE calls PROSE to take care of the rest, as shown in Figure 6b.
Figure 6b: Transformed code. Microsoft PROSE has correctly interpreted the developer’s intent,
correctly transforming all the column aliases to use AS instead of the “=” (equals to) operator.
Scenario 5. Table Extraction from Images and PDF Files
As knowledge workers, we frequently encounter tabular data that is rendered as an image or appears in a PDF file, rendering it useless for any fresh data analysis.
Luckily for us, PROSE is not limited to text and can take a variety of input formats, including images and PDFs.
Figure 7a: Table in a PDF file.
PROSE supports OCR which allows it to process this sort of scenario seamlessly. All the user needs to do is perform a selection operation to indicate the bounds of the table, and, using a technique called predictive synthesis, PROSE extracts the table into a corresponding “live” spreadsheet, as shown in Figure 7b. This is a capability provided by the PDF connector in Microsoft Power BI. It allows users to perform computations and analysis that were either inaccessible or would have required tedious manual data reentry.
Figure 7b: Table in Figure 7a extracted using the PDF connector in Microsoft Power BI.
Microsoft PROSE, or Program Synthesis by Example, is pre-defined suite of technologies applicable in a variety of tasks, including the cleaning and pre-processing of data into formats that are amenable to analysis.
The Microsoft PROSE SDK includes:
- The Flash Fill example described above, currently available in Excel and PowerShell.
- Data extraction from text files by examples, available in PowerShell and Azure Log Analytics.
- Data extraction and transformation of JSON, by examples.
- Predictive file-splitting technology, which splits a text file into structured columns without any examples.
As humans, we thrive in tasks that exercise our creativity and intellect and prefer avoiding tasks that are exceedingly tedious and repetitive. By successfully predicting user intent and automatically generating code snippets to automate everyday tasks involving data, Microsoft PROSE has saved our users millions of hours of manual work.
We may have named it PROSE, but for the knowledge workers who are saving tons of time and boosting their productivity, this AI technology is more like sweet poetry!