Machine Learning, meet Computer Vision – Part 2

This blog post is co-authored by Jamie Shotton, Antonio Criminisi and Sebastian Nowozin of Microsoft Research, Cambridge, UK.

In our last post, we introduced you to the field of computer vision and talked about a powerful approach, classifying pixels using decision forests, which has found broad application in medical imaging and Kinect. In this second post we will look at some of the recent excitement around deep neural networks and their successes in computer vision, followed by a look at what might be next for computer vision and machine learning.

Deep Neural Networks

The last few years have seen rapid progress in the quality and quantity of training datasets we have access to as vision researchers. The improvements are to a large extent due to the uptake of crowdsourcing which has allowed us to scale our datasets to millions of labelled images. One challenging dataset, ImageNet, contains millions of images labeled with image-level labels across tens of thousands of categories.

After a few years of slow progress in the community on the ImageNet dataset, Krizhevsky et al. rather rocked the field in 2012. They showed how general-purpose GPU computing paired with some seemingly subtle algorithmic changes could be used to train convolutional neural networks much deeper than before. The result was a remarkable step change in accuracy in image classification on the ImageNet 1000-category test. This also garnered a lot of attention in the popular press and even resulted in some very large start-up buyouts. Since then “deep learning” has become a very hot topic in computer vision, with recent papers extending the approach to object localization, face recognition, and human pose estimation.

The Future

While clearly very powerful, are deep convolutional networks the end of the road for computer vision? We’re sure they’ll continue to be popular and push the state of the art in the next few years but we believe there’s still another step change or two to come. We can only speculate as to what these changes will be, but we finish up by highlighting some of the opportunities as we see them.

Representations: These networks learn to predict a relatively simple representation of the image contents. There’s no deep understanding of where individual objects live in the image, how they relate to one another, or the role of particular objects in our lives (e.g. we couldn’t easily combine the cue that a person’s hair looks slightly glossy with the fact that they are holding a hair-dryer to get a more confident estimate that their hair is wet). New datasets such as Microsoft CoCo may help push this forward by providing very detailed labeling of individual object segmentations in “non-iconic” images – i.e. images where there’s more than one object present that are not front-and-center.

Efficiency: While the evaluation of a deep network on a test image can be performed relatively quickly though parallelization, neural networks don’t have the notion of conditional computation that we encountered in our last post: every test example ends up traversing every single node in the network to product its output. Furthermore, even with fast GPUs, training a network can take days or weeks which limits the ability to experiment rapidly.

Structure learning: Deep convolutional networks currently have a carefully hand-designed and rather rigid structure that has evolved over many years of research. Changing, say, the size of a particular layer or the number of layers can have undesirable consequences to the quality of the predictor. Beyond simply brute-force parameter sweeps to optimize the form of the network, we hope there is opportunity to really learn a much more flexible network structure directly from data.

Recently, we have been taking baby steps towards addressing particularly the latter two of these opportunities. We’re particularly excited by our recent work on decision jungles: ensembles of rooted decision DAGs. You can think of a decision DAG as a decision tree in which child nodes have been merged together so that nodes are allowed to have multiple parents. Compared to decision trees, we’ve shown that they can reduce memory consumption by an order of magnitude while also resulting in considerably improved generalization. A DAG also starts to look a lot like a neural network, but does have two important differences: firstly, the structure is learned jointly with the parameters of the model; and secondly, the DAG retains the idea from decision trees of efficient conditional computation: a single test example follows a single path through the DAG, rather than traversing all nodes as would be the case with a neural network. We’re actively investigating whether decision jungles, perhaps in conjunction with other forms of deep learning including stacking and entanglement, can offer an efficient alternative to deep neural networks.

If you’re interested in trying out decision jungles for your problem, the Gemini module in Azure ML will let you investigate further.

Overall, computer vision has a bright future thanks in no small part to machine learning. The rapid recent progress in vision has been fantastic, but we believe the future of computer vision research remains an exciting open book.

Jamie, Antonio and Sebastian