Evaluating Failure Prediction Models for Predictive Maintenance

This post is by Shaheen Gauher, PhD, Data Scientist at Microsoft


Predictive Maintenance is about anticipating failures and taking preemptive actions. Recent advances in machine learning and cloud storage have created a tremendous opportunity to utilize the gamut of data coming from factories, buildings, machines, sensors, and more to not only monitor equipment health but also predict when something is likely to malfunction or fail. However, as simple as it sounds in principle, in reality it is often hard to come by all the data that is necessary to actually make such predictions and do so in a timely manner. Data that is collected is often incomplete or partial or there’s just not enough of it, making it unsuitable for modelling.


In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Ideally, the data should have hundreds or even thousands of failures. But even in these cases the distribution or the ratio of failure to non-failure data is highly skewed. Additionally, the data should be collected from all of the relevant parts and should capture the complete picture or timeline of events prior to the occurrence of the failure. Collecting partial information leads to incomplete learning and an imprecise prediction. For example, if we wanted to predict when the car brakes will fail, we should collect data not just from the brake pads, but also from the wheels, the complete maintenance record of the car, when the wheels were replaced, when the brake pads were replaced, the make and model of the car, when it was purchased, the history of how and where the car was driven and more; and over a long period of time. A model that learns from rich data like this will be able to find patterns and might identify dependencies that would otherwise not be so obvious and correctly predict in advance when a brake failure will occur. As the field matures and there is more understanding around the art of machine learning, businesses will start collecting data more strategically.

Modeling Imbalanced Data

Modelling for Predictive Maintenance falls under the classic problem of modelling with imbalanced data when only a fraction of the data constitutes failure. This kind of data poses several issues. While normal operations data (i.e. non-failure data) which constitutes the majority of the data is similar to one another, failure data may be different from one another. Standard methods for feature selection and feature construction do not work so well for imbalanced data. Moreover, the metrics used to evaluate the model can be misleading. For example, in a classification model for a dataset with more than 99% non-failure data and less than 1% failure data, a near perfect accuracy could be achieved simply by assigning all instances in the data to the majority (non-failure) class. This model however is useless as it never learned to predict a failure. More appropriate metrics for evaluating these types of models are precision, recall, AUC etc. Instead of conventional accuracy, the accuracy per class should be computed and the mean of these accuracies should be reported. For details on how to compute these evaluation metrics, see here.

When building models, a clear understanding of the business requirement and the tolerance to false negatives and false positives is necessary. For some businesses, failure to predict a malfunction can be detrimental (e.g. aircraft engine failure) or exorbitantly expensive (e.g. production shutdown in a factory) in which case we must tune our model for a high recall. They would rather the model errs on the side of caution as it is more cost effective to do a maintenance checkup in response to a false prediction rather than a full blown shutdown. On the other hand, falsely predicting a failure when there is none can be a problem for other businesses due to loss of time and resources to address a falsely predicted failure, in which case the model should be tuned for a high precision. In the language of statistics, this is what we call misclassification cost. The actual dollar amount associated with a false prediction can be evaluated by the business by taking into account the repair costs, from parts as well as labor, quantifying the effect on their brand and reputation, customer satisfaction etc. This should be the driving factor for tuning the model for cost-sensitive learning.

There are several ways we can circumvent the problems of modelling with imbalanced data. Below I will briefly describe three ways to deal with imbalanced data within the Azure ML framework.

1) As mentioned above, with imbalanced data, the classification algorithms performance is biased against the minority class. Hence the first step is to balance the dataset through resampling. There are various sampling techniques available, each with their own advantages and disadvantages. You can find a brief description here. In Azure ML, the module SMOTE allows to upsample or increase the number of minority (failure) instances by synthesizing new examples. The module Partition and Sample allows us to do simple random sampling or stratified random sampling and can be used for down sampling the majority (non-failure) class. The Split Data module can also be used to down sample the majority class.

2) For most machine learning algorithms, we need to provide some hyperparameters (e.g. for Boosted Decision Tree we will need to provide values for Maximum number of leaves per tree, Minimum number of samples per leaf node, Number of trees constructed etc.). This determines the efficiency of the model. Additionally, we also need to specify the metric (e.g. Accuracy, Recall, AUC etc.) to use for determining the best set of parameters. The Sweep Parameters module in Azure ML allows us to do just that. By selecting Recall or Precision as the metric to optimize the parameter set, the resulting model can be tuned for high recall or high precision performance.

3) The classification models in ML, besides predicting a positive or negative, also output a score which is a real number between 0 and 1. Scoring values above 0.5 (the default threshold) are labelled positive and below negative. The choice of this threshold decides the predicted label and thus the related metrics. For example, choosing a threshold or the operating point of 0.7 means all instances with scores greater than 0.7 will be labelled as positive and below negative. By adjusting this threshold, we can tweak the predictions to produce a high recall or high precision.

Some Use Cases

In the section below I will discuss two Predictive Maintenance scenarios. I will briefly describe the business requirement and how to build the model keeping our requirements in mind.

  1. A manufacturing line for circuit boards for electronic products needed to detect a faulty board early in the production line. Detecting a failure early on, even if it was a false failure, and washing the board didn’t cost very much, whereas missing the defective board and mounting components on it only to later scrap it would cost a substantial amount. The business was tolerant of false positive even at the expense of false negatives. In this case the requirement was to tune the model to catch as many failures as possible. With less than 1% of failure data available, conventional evaluation metrics are meaningless. We balanced the data by upsampling the minority class (failures) using SMOTE and tuned the model parameters for high AUC. Once the model was trained we chose a high recall ROC operation point. The same trained model could also be used for a different production line with varying tolerance for false positives by adjusting the ROC operating point to produce different recall values. Despite having limited failure data and noise in the data, the resulting model caught failures 75% of the time.


  2. A wind farm needed to predict if a wind turbine generator would fail within the next few months. Turbines were located in remote locations. Failures would require technicians to travel to a remote farm, spend substantial amounts of time doing inspections and ultimately do the repairs. The costs incurred by stalled operations, replacing expensive parts and associated labor costs were quite steep. The model needed to be tuned to only pick up extremely likely failures. The business was more tolerant of a false negative over a false positive. With only 1% of the data constituting failure, we upsampled the minority class and tuned the model parameters for high F1 score. The resulting model yielded an F1 score of 0.07 that was more than three times better than what a random model would have produced (0.02).

Next I will show the results from a binary classifier for an imbalanced data (~25k rows) with failures constituting just 1% of the data. 70% of the data was used for training and the rest for testing. As mentioned above, the accuracy and AUC metrics for a classifier for an imbalanced data set are misleading. To evaluate the model, Precision, Recall and F1 scores are some of the metrics to look at as they indicate how well the model is performing in predicting a rare failure. The metrics obtained from the model are shown below. I will compare these metrics to what we will get from a baseline random classifier to confirm that the model is doing a better job than just making lucky guesses. More information such as definitions and methods used to compute baseline metrics are available here. I will also demonstrate what is referred to as trade-off between precision and recall.

Tuning the model for high recall yielded a recall of 0.33 i.e. 33% of the actual failures were caught by the model (for threshold of 0.5). The precision obtained was 0.65.

A random weighted guess model would have produced a recall of 0.01. We can increase the recall to 0.5 by choosing a threshold of 0.01 at the cost of precision sliding to 0.12. As you can see there is an inverse relationship between precision and recall. A greater recall increases the chance of catching all failures even false ones! A greater precision on the other hand decreases the chance of catching false failures along with real failures! The precision of 0.12 in this case is at least ten times better than what a random model would have produced!

Tuning the model for high precision yields 0.86 precision, i.e. 86% of the predicted failures were true failures. A random model would have produced a precision of 0.01. By further adjusting the threshold we could achieve an even higher precision.

Baseline metrics provide a justification for the results obtained from the model when the numbers suggest otherwise. Looking at the metrics in isolation can lead to an incomplete performance evaluation of the model especially in terms of the value added by the predictive model.

Predicting failures is one of the many problems in the Predictive Maintenance domain. Predictive Maintenance solutions include forecasting error counts for machines as a precursor to a failure, anomaly detection and root cause analysis to name a few. Businesses can use the Cortana Intelligence Suite as a starting point in their long term Predictive Maintenance strategy. Cortana Intelligence opens new possibilities in the Predictive Maintenance space, including data ingestion, data storage, data processing and advanced analytics components. The Cortana Intelligence Predictive Maintenance for Aerospace Solution Template provides all the essential elements for building an end to end Predictive Maintenance solution. You can find a more in-depth discussion of Predictive Maintenance solutions including industrial best practices around data and machine learning in the playbook here.