Now Available on Azure ML – Criteo's 1TB Click Prediction Dataset

This post is by Misha Bilenko, Principal Researcher in Microsoft Azure Machine Learning.

Measurement is the bedrock of all science and engineering. Progress in the field of machine learning has traditionally been measured against well-known benchmarks such as the many datasets available in the UCI-ML repository, in the KDDCup and Kaggle contests and on ImageNet.

Today, we are delighted to announce the availability of the largest ever publicly released ML dataset produced by our friends at Criteo, this dataset is now hosted by Microsoft Azure Machine Learning. This new benchmark allows us to compare the performance of supervised learning algorithms on a realistic dataset representing an industry-defining multi-billion dollar task – namely, advertisement click prediction.

The scale of the data that ML systems are expected to consume keeps growing steadily. In domains such as computational finance and online advertising, thousands of training points arrive every second. The development of ML algorithms and systems capable of learning on large-scale data that yield high-throughput predictions is a key challenge. Yet, few public benchmarks exist that provide a realistic snapshot of a revenue-critical predictive problem.

An earlier competition by Criteo, as well as contests by Yandex, Tencent and Avazu have all brought attention to variants of the task: namely, that of predicting the probability of a user clicking on an item (e.g. an advertisement or webpage link) based on attributes describing various properties of the context, the item and the user. However, these datasets tended to be small samples of realistic production datasets, often small enough to fit in the RAM of a high-end modern workstation.

By contrast, the newly available Criteo 1TB dataset provides over 4 billion examples with binary labels (click vs. no-click). There are 156 billion total (dense) feature-values and over 800 million unique attribute values. While this record-breaking scale may seem formidable for classic ML algorithms, the emergence of cloud ML platforms makes it straightforward for every data scientist to train predictive models on such datasets from their laptop, utilizing distributed learning techniques such as Learning with Counts (codenamed Dracula an area where we will present a detailed experiment on this dataset in a follow-up post).

We salute Criteo for giving our field a practical baseline that allows us to quantify progress in “big learning”. And we really look forward to all the new algorithms and systems that this dataset will motivate and inspire you to create.