This post is authored by Sudarshan Raghunathan, Principal Software Engineering Manager, Darren Edge, UX Architect, and Jonathan Larson, Principal Data Architect, at Microsoft.
Azure Machine Learning Studio provides a Swiss-army knife of tools to operate on text datasets in a robust and efficient manner. For instance, there is a suite of built-in modules for lower-level tasks such as language detection and text pre-processing for common cleaning steps such as case normalization, stop word removal, stemming and lemmatization. Building on top of these is a collection of modules for converting pre-processed text into N-gram and skip-gram numerical features via hashing or metrics such as TF-IDF. Once a set of numerical features has been constructed, you can use any of the existing suite of learning algorithms in Azure ML to build classification, regression, recommendation or clustering models as necessary.
Besides training models using N-gram features, you can also use a set of powerful modules for tasks such as entity and key-phrase extraction that are backed by robust pre-trained models and, in turn, use them to build different kinds of features.
Azure ML leverages the powerful Vowpal Wabbit library (VW) for many of its text analytics capabilities. For example, VW is used in the Latent Dirichlet Analysis module for building topic models on large datasets. Since VW has a large number of algorithmic knobs and is well-suited for a wide variety of learning tasks, advanced users of VW can also directly use our wrappers around the command-line interface that exposes all the options for maximum flexibility.
The open source ecosystem in R and Python also has a wide variety of tools for processing text from reading and parsing text in different (often domain-specific) formats. For example, the tm package in R provides functions for text pre-processing like case normalization and stemming and the NLTK module in Python provides extensive functionality for all aspects of text analysis from pre-processing to part-of-speech tagging to building classification and clustering models. Azure ML makes it very convenient to access these functions in the larger ecosystem in your experiments. For instance, the Python 2.7.11 and 3.5 environments already come pre-configured with all the corpora and models in NTLK.
One of the most powerful aspects of all these capabilities is that the user can compose them in arbitrary ways to construct very flexible machine learning pipelines on text data. Thanks to Azure ML’s operationalization capabilities, these pipelines can then be turned into production-ready web services in a few clicks for real-time and batch scoring.
The Bing News PowerBI Solutions Template
Given this large ensemble of tools, how does a data scientist go about building an end-to-end solution for solving real-world problems?
In this post, we describe the approach for building the Bing News template for PowerBI. The Bing News solution template helps match your interests with relevant articles from hundreds of different news providers. It does this by creating an automated pipeline of Azure services to create a turnkey solution for customers to analyze news articles. The true power of this workbook is displayed when combining all the text analytics in conjunction with each other using cross-filtering. For example, in the Bing News template, a user can quickly understand the gist of a topic by selecting it and reviewing the related keyphrases and associated named entities. This combination of AI techniques used together creates a powerful way to navigate across large document repositories and to quickly discover articles of interest.
The template includes four different sophisticated machine learning techniques that are used in combination together to provide a high-fidelity analysis. The architecture of the template is seen in the flowchart below.
The core of the Bing News template starts with an Azure Logic App, which polls for news articles from the Bing News API at a preset schedule (5 minutes) on a list of user specified topics. As the data makes its way through the Logic App, the actual news article text is retrieved and sent through a series of Azure Functions for basic data transformation. Next, the Microsoft Text Analytics Cognitive Service is used for keyphrase and sentiment extraction over the text body. These text enrichments could alternately be performed in the Azure ML portion of the pipeline using the “Extract Key Phrases from Text” module. At this point, the data along with some basic enrichments are stored in an Azure SQL database. We then use a separate periodically-invoked Logic App to call several Azure ML web services, which perform the complex tasks of Vowpal Wabbit topic clustering and named entity recognition (NER). These machine learning outputs are then written back into the Azure SQL database as the final enrichments on the data. PowerBI is then wired up to this Azure SQL database directly and will update itself accordingly whenever the user refreshes the workbook. An advantage in building the pipeline in this fashion is that it allows end-users to quickly customize the pipeline to their own needs. If the customer deploying the solution template wishes to add another machine learning tag (e.g. language detection), they can simply plug in an extra Azure ML or Cognitive Service to provide the additional enrichment.
Building and deploying robust AI-powered apps that involve features generated from raw multilingual text data often require deep domain expertise, the ability to assemble the output of several disparate tools which might not compose well, and good pre-trained models for tasks such as entity extraction. As we have shown in this post, the suite of built-in text analytics modules in Azure ML coupled with the ability to call into external tools, be it NLTK or Microsoft Cognitive Services, in a seamless manner along with the ability to package and deploy the entire workflow as a single REST endpoint greatly reduces the friction in constructing, deploying and re-training real-world ML-powered applications such as the PowerBI Bing News solutions template highlighted in this article.
To learn more about text analytics-based apps in Azure ML, visit our documentation page. There, you’ll not only find detailed help on using the modules but also pointers to a set of complete end-to-end examples for document classification, finding related items, and building a model for sentiment analysis. In addition, be sure to check out the Cortana Intelligence Gallery for more user-contributed samples, and feel free to post your questions to our MSDN forum.
Sudarshan, Darren & Jonathan