Almost every application of machine learning (ML) involves uncertainty. For example, if we are classifying images according to the objects they contain, some images will be difficult to classify, even for humans. Speech recognition too, particularly in noisy environments, is notoriously challenging and prone to ambiguity. Deciding which movies to recommend to a user, or which web page they are searching for, or which link they will click on next, are all problems where uncertainty is inevitable.
Uncertainties are a source of errors, and we are therefore tempted to view uncertainty as a problem to be avoided. However, the best way to handle uncertainty is to approach it head on and treat it as a first-class citizen of the ML world.
To do this we need a mathematical basis for quantifying and manipulating uncertain quantities, and this is provided by probability theory. We often think of probabilities in terms of the rate of occurrence of a particular event. For example, we say that the probability of a coin landing heads is 50% (or 0.5) if the fraction of heads in a long series of coin flips is a half. But we also need a way to handle uncertainty for events which cannot be repeated many times. For example, our particular coin might be badly bent in which case there is no reason to be sure that heads and tails are equally likely. The rate at which it will land heads is itself an uncertain quantity, and yet there is only one instance of this bent coin. This more general problem of quantifying uncertainty in a consistent way has been studied by many researchers, and although various different schemes have been proposed it turns out that they are all equivalent to probability theory.
Image Classification Example
To see how probabilities can be valuable in practice, let’s consider a simple example. Suppose we have been asked to build a system to detect cancer as part of a mass screening programme. The system will take medical images (for instance X-rays or MRI images) as input and will provide as output a decision on whether or not the patient is free of cancer. The judgement of human experts will be treated as ‘ground truth’ and our goal is to automate their expertise to allow screening on a mass scale. We will also imagine that we have been supplied with a large number of training images each of which has been labelled as normal or cancerous by a human expert.
The simplest approach would be to train a classifier, such as a neural network, to assign each new image to either ‘cancer’ or ‘normal’. While this appears to offer a solution to our problem, we can do much better by training our neural network instead to output the probability that the image represents cancer (this is called the inference step), and then subsequently using this probability to decide whether to assign the image to the normal class or the cancer class (this is called the decision step).
If our goal is to misclassify as few images as possible, then decision theory tells us that we should assign a new image to the class for which our neural network assigns the higher probability. For instance, if the neural network says the probability that a particular image represents cancer is 30% (and therefore that the probability that it is normal is 70%) then the image would be classified as normal. At this point our two-stage approach is equivalent to a simple classifier.
Of course we would not feel happy with a screening programme that gave an all-clear to someone with a 30% chance of having cancer, because an error in which cancer is mis-classified as normal is far more costly (the patient could develop advanced cancer before it is detected) than an error in which a normal image is mis-classified as cancer (the image would be sent to a human expert to assess whose time would be wasted). In a screening programme we might therefore require the probability of cancer to be lower than some very low threshold, say 1%, before we are willing to allow the system to classify an image as normal. This can be formalised by introducing cost values, for the two types of mis-classification. Decision theory then provides a simple procedure for classifying an image so as to minimise the average cost, given the probabilities for the two classes.
The key point is that if there are changes to the cost values, for example due to a change in the cost of human time to assess images, then only the trivial decision step needs to be changed. There is no need to repeat the complex and expensive process of retraining the neural network.
Cancer is Rare
In screening programmes we are typically looking for rare events. Let’s suppose that only 1 in 1,000 of the people being screened in our example have cancer. If we collected 10,000 images at random and hand-labelled them then typically only around 10 of those images would represent cancer – hardly enough to characterise the wide variability of cancer images. A better approach is to balance the classes and have, say, 5,000 images each of normal and cancer. Again decision theory tells us how to take the probabilities produced by a network trained on balanced classes and correct those probabilities to allow for the actual frequency of cancer in the population. Furthermore, if the system is applied to a new population with a different background rate of cancer, the decision step can trivially be modified, again without needing to retrain the neural network.
Incidentally, failure to take account of these so-called prior class probabilities for rare events lies at the heart of the prosecutor’s fallacy, a statistical blunder which has been responsible for numerous major miscarriages of justice in the court room.
Improving Accuracy by Rejection
Finally, in our screening example we might imagine a further improvement to the system in which the neural network only classifies the ‘easy’ cases, and rejects those images for which there is significant ambiguity. The downside is that a human must then examine the rejected images, but we would expect that the performance of the neural network on the remaining examples would be improved. This intuition turns out to be correct, and decision theory tells us that we should reject any image for which the higher class probability is below some threshold. By changing this threshold we can change the fraction of images which are rejected, and hence optimise the trade-off between improving system performance and minimizing human effort.
We have seen some of the numerous benefits of training classifiers to generate probabilities rather than simply make decisions. Furthermore, many other ML tasks can benefit from a probabilistic approach, including regression, clustering, recommendation, and forecasting. You can find out more about how to use probabilities in ML in Chapter 1 of Pattern Recognition and Machine Learning. You can also try out the Two-Class Bayes Point Machine Classifier in the Microsoft Azure Machine Learning service as an example of classifier that generates probabilities as opposed to decisions.
Next week we’ll see how to take probabilities to the next stage and use them to describe the uncertainty in the parameters of our learning model itself.
Learn about my research