This post was co-authored by Muhammad Raza Khan & Akshay Mehra (both former interns), with Mohamed Abdel-Hady & Debraj GuhaThakurta (Senior Data Scientist Leads) and Wei Guo & Zoran Dzunic (Data Scientists) at Microsoft.
We recently published two real-world scenarios demonstrating how to use Azure Machine Learning alongside the Team Data Science Process (TDSP) to execute AI projects involving Natural Language Processing (NLP) use-cases, namely, for sentiment classification and entity extraction. This blog post provides a summary of these two samples, which are available through public GitHub repositories. The samples use a variety of Azure data platforms, such as Data Science Virtual Machines (DSVMs) to train DNN models for sentiment classification and entity extraction using GPUs, and HDInsight Spark for data processing and word embedding model training at scale. The samples show how domain-specific word embeddings generated using domain-specific and labeled training data sets outperforms generic word embeddings trained on general and unlabeled data, which leads to improved accuracy in classification and entity extraction tasks.
The samples show several capabilities of the Azure ML Workbench including:
- Instantiation of Team Data Science Process (TDSP) structure and templates.
- Execution of Python scripts on different compute environments.
- Run history tracking for Python scripts.
- Execution of jobs on remote Spark compute context using HDInsight Spark 2.1 clusters.
- Execution of jobs in remote GPU DSVMs on Azure.
- Easy operationalization of deep learning models as web services on Azure Container Services (AKS).
Sentiment Classification Using Supervised Word Embeddings
We demonstrate the use of word embedding methods, namely, Word2Vec and Sentiment Specific Word Embedding (SSWE) to predict Twitter sentiment. We use TDSP with Azure ML to execute this project. The data used in this project is the Sentiment140 dataset, which contains the text of the tweets (with emoticons removed) along with the polarity of each of the tweets (positive and negative, neutral tweets are removed for this project). The Sentiment140 dataset has been labelled using the concept of distant supervision as explained in the paper on Twitter Sentiment Classification Using Distant Supervision. This project is executed using TDSP templates which consist of the following stages:
- Data acquisition and understanding.
- Feature Engineering
- Model Creation
- Model Evaluation
- Feature Engineering
A few key highlights from this sample:
- The sample is running on the latest Azure ML Workbench.
- Model training is performed in Azure DSVM with GPU.
- Word embeddings are generated using Word2Vec and SSWE.
- Deep learning frameworks and packages such as TensorFlow, Cognitive Toolkit (CNTK) and Keras are applied in this project.
- Four models using different word embedding methods and classification modeling techniques are trained and compared.
- The most accurate trained model is deployed to a web service on Azure Container Service.
The Word2Vec algorithm is based on the paper Mikolov, Tomas, et al. “Distributed Representations of Words and Phrases and their Compositionality”, Advances in neural information processing systems. 2013. Skip-gram is a shallow neural network taking the target word encoded as a one hot vector as input and using it to predict nearby words. The skip-gram based architecture is shown in the following figure.
The Sentiment Specific Word Embedding (SSWE) algorithm proposed in Tang, Duyu, et al. “Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification.” ACL (1). 2014 tries to overcome the weakness of the Word2vec algorithm whereby words with similar contexts and opposite polarity can have similar word vectors. This means that Word2vec may not perform very accurately for tasks such as sentiment analysis. SSWE algorithm handles this weakness by incorporating both the sentence polarity and the word’s context into its loss function.
We are using a simplified variant of SSWE here implemented as a convolutional neural network (CNN) designed to optimize the cross-entropy of sentiment classes as the loss function. As we will see below, the performance of this SSWE embedding is better than the Word2Vec embedding in terms of sentiment classification accuracy. The SSWE CNN model that we use in this sample is shown in the figure below.
We compared Logistic Regression and Gradient Boosted Decision Tree models using Word2Vec and SSWE embeddings as features. Results show that Gradient Boosted Tree Model with SSWE embedding performs the best. This is model is deployed to a web service on Azure Container Service.
We would love to hear your feedback on this classification sample – you can send us your feedback and comments via the GitHub issues page.
Entity Extraction from Biomedical Unstructured Text
Entity extraction is a subtask of information extraction, and is also known as Named-Entity Recognition (NER), entity chunking and entity identification. The aim of this real-world scenario-based sample is to highlight how to use Azure ML and TDSP to execute a complicated NLP task such as entity extraction from unstructured text. The sample shows:
- How to train a neural word embeddings model on a text corpus of about 18 million PubMed abstracts using Spark Word2Vec implementation.
- How to build a deep Long Short-Term Memory (LSTM) recurrent neural network model for entity extraction on a GPU-enabled Azure DSVM.
- Demonstrate that domain-specific word embeddings model can outperform generic word embeddings models in the entity recognition task.
- Demonstrate how to train and operationalize deep learning models using Azure ML Workbench.
Named entity recognition is a critical step for complex NLP tasks in the biomedical field, such as:
- Extracting the mentions of named entities such diseases, drugs, chemicals and symptoms from electronic medical or health records.
- Drug discovery.
- Understanding the interactions between different entity types such as drug-drug interaction, drug-disease relationship and gene-protein relationship.
Our use case scenario focuses on how a large amount of unstructured data corpus such as Medline PubMed abstracts can be analyzed to train a word embedding model. The output embeddings are then considered as automatically generated features to train a neural entity extractor.
The figure below shows the architecture used to process the data and train models.
The DNN model architecture used across all the experiments and for comparison is presented below. The parameter that changes for different datasets is the maximum sequence length (613 here).
Our results show that the biomedical entity extraction model training on the domain-specific word embedding features outperforms the model trained on the generic feature type (using Google News data). The domain-specific model can detect 7,012 entities correctly (out of 9,475) with F1-score of 0.73 compared to 5,274 entities with F1-score of 0.61 for the generic model. The following is a comparison of the accuracy of two feature types: 1) Word embeddings trained on PubMed abstracts, and 2) Word embeddings trained on Google News. We can clearly see that the in-domain model outperforms the generic model. Having a specific word embedding model rather than using a generic one is hence much more helpful.
We would love to hear your feedback on this entity extraction sample – you can send us your feedback and comments via the GitHub issues page.
We showed how to apply cloud platforms such as Azure DSVM and HDInsight Spark clusters, along with tools such as Azure ML, and a data science process (TDSP) in two real-world NLP use-cases, with corresponding end-to-end samples. We also showed that customization of the word embedding approach or the use of domain-specific data-sets for word embeddings can improve the accuracy of subsequent tasks such as classification or entity extraction. We hope these samples help you efficiently design, develop and deploy end-to-end NLP solutions in your business domains.
Raza, Akshay, Mohamed, Debraj, Wei & Zoran