This blog post is authored by John Platt, Deputy Managing Director and Distinguished Scientist at Microsoft Research.
I just returned from the Neural Information Processing Systems (NIPS) 2014 conference, which was held this year in Montreal, Canada. NIPS is one of the two main machine learning (ML) conferences, the other being ICML.
NIPS has broad coverage of many ML sub-fields, including links to neuroscience (hence the name). I thought that the program chairs and committee created a conference which appealed to many different ML specialists – excellent job!
I want to share three exciting trends that I saw in NIPS this year:
Continued rapid progress in deep learning and neural networks
Making large-scale learning more practical
Research into constraints that arise in the real practice of ML
Deep learning is the automatic construction of deep models from data. They are called “deep” because the models compute desired functions in multiple steps, rather than trying to solve problems in one or two steps. Deep learning is typically accomplished using neural networks, which are models that use matrix multiplication and non-linearities to build their functions.
Progress in deep learning since 2011 has been amazingly rapid. For example, on a benchmark of recognizing objects in images, the error rate has decreased 40% relative, per year. Deep learning has also become more broadly applicable than just classifying images.
One challenging problem in ML is the co-estimation of outputs that are strongly coupled. For example, when translating a sentence from one language to another, you don’t want to translate word-by-word. You have to think about the entire sentence you would produce.
Previously, when ML algorithms estimated coupled outputs, they would explicitly use inference, which can be slow at run time. Recently, there’s been some exciting work of having neural networks do the inference implicitly. At NIPS, Ilya Sutskever showed that you can use a deep LSTM model to do machine translation (MT) and perform almost as well as the state-of-the-art MT system. Ilya’s system is more general: it can map input sequences to output sequences. At NIPS, there was also other work in coupling outputs across large amounts of space or time. For example, Jason Weston had a workshop paper that had a neural network that used a content-addressable memory to perform question answering. The “Neural Turing Machine” uses a similar idea.
Given the successes of deep learning, researchers are trying to understand how they work. Ba and Caruana had a NIPS paper which showed that, once a deep network is trained, a shallow network can learn the same function from the outputs of the deep network. The shallow network can’t learn the same function directly from the data. This indicates that deep learning could be an optimization/learning trick.
Many people (including us!) have used middle layers in deep neural networks as feature detectors in related tasks. There was a wonderful talk at NIPS, where the authors did a set of careful experiments that examined this pattern. They trained a deep network on one set of 500 visual categories, kept the first N layers, and then retrained on a different set of 500. They found that, if you use middle layers and retrain on top, you lose some accuracy due to the sub-optimality of training. They found that if you use the highest layers, you lose some accuracy due to the features being too specific. Fine tuning everything recovers all lost accuracy. Very useful to know.
Large-scale training (of all sorts of models) has continued to be an interesting research vein. While not that many people have training sets above 1TB, the models that use that scale data tend to be commercially very valuable.
Training in machine learning is a form of parameter optimization: an ML model can be viewed as having a set of knobs that are adjusted to make the model perform well on a training set. Large-scale training then becomes large-scale optimization. Yurii Nesterov, a famous optimization expert, gave an interesting invited talk about how to solve certain optimization problems that arise from ML in time that is logarithmic in the number of parameters.
When ML training is distributed across many computers, it is challenging to minimize the amount of communication between the computers. Training time is typically dominated by communication time.
One very nice NIPS talk described a method of performing distributed feature selection which only requires two phases of communicating models between all of the nodes. This looks promising.
Practice of Machine Learning
One quite positive trend I saw at NIPS was algorithmic and theoretical researchers examining issues that ML practitioners frequently encounter.
In the last few years, adversarial training has been a topic of research interest. In adversarial training, you don’t try to model the world as a probability distribution, but rather as an adversary who is trying to make your algorithm perform poorly. You then measure your performance relative to the best possible model that could be trained from the adversarial data, in hindsight.
A lot of the work in adversarial training has been quite interesting. At this NIPS, I saw some work that showed its practicality. It’s the nature of adversarial training to provide worst-case bounds. If you have an algorithm that is adapted to “easy data”, you normally lose the worst-case guarantees. A paper in the main conference showed that you can have your cake (perform well on easy data) and eat it too (get a worst-case guarantee). Drew Bagnell gave a clear talk at one of the Reinforcement Learning workshops that illustrated how adversarial learning is required in order to learn control policies in the real world (because you should treat your own mistaken decisions as an adversary).
There was a delightful workshop about Software Engineering for Machine Learning. Speakers from LinkedIn, Microsoft, Netflix, and Facebook talked about their experiences in putting ML into production. Some Google engineers produced a very trenchant paper about technical debt incurred by putting ML into production. I highly recommend reading it, if you are planning on doing it.
Between the progress in deep and large-scale learning, and the theoretical focus on practical issues, I learned a lot at NIPS. I’ve gone every year that the conference has existed, and I’m looking forward to the next one.