Solve the Big Data Problems of the Future: Join Microsoft Research’s Naiad Project

Posted by Tara Grumm
Senior Manager, Worldwide Marketing and Operations

In this decade we will collect more scientific data than we’ve collected in human history. Soon it will be impossible to do any kind of analysis without access to powerful computational tools, which can make sense of the flood of data coming from sources like satellites, internet-connected sensors, and massive computer simulations.

 Naiad researchers (from left) Michael Isard, Derek Murray, and Frank McSherry at Silicon Valley TechFair, April 2014

Michael Isard, Derek Murray, and Frank McSherry of Microsoft Research are trying to address the data deluge with Naiad, an open source .NET-based platform for high-throughput, low-latency data analysis, including tools built atop Microsoft Azure to deliver interactive analyses of huge data sets.

We recently caught up with the team to get their perspectives on how Naiad can solve emerging big data problems and how interested programmers and analysts can get involved in this open source project.

Tell us about Naiad. What big data problem does it solve?

Over the past decade, general-purpose big data platforms like Hadoop have brought distributed computing into the mainstream. As people have become accustomed to processing their data in the cloud, they have become more ambitious, wanting to do things like graph analysis, machine learning, and real-time stream processing on their huge data sources.

Naiad is designed to solve this more challenging class of problems: it adds support for a few key primitives – maintaining state, executing loops, and reacting to incoming data – and provides high-performance infrastructure for running them in a scalable distributed system.

The result is the best of both worlds. Naiad runs simple programs just as fast as existing general-purpose platforms, and complex programs as fast as specialized systems for graph analysis, machine learning, and stream processing. Moreover, as a general-purpose system, Naiad lets you compose these different applications together, enabling mashups (such as computing a graph algorithm over a real-time sliding window of a social media firehose) that weren’t possible before.

Who should use Naiad?

We’ve designed Naiad to be accessible to a variety of different users. You can get started right away with Naiad by writing programs using familiar declarative operators based on SQL and LINQ.

For power users, we’ve created low-level interfaces to make it possible to extend Naiad without sacrificing any performance. You can plug in optimized data structures and algorithms, and build new domain-specific languages on top of Naiad. For example, we wrote a graph processing layer on top of Naiad that has performance comparable with (and often better than) specialized systems designed only to process graphs.

Why was it important that Naiad be an open source project?

Research is not only about discovering new things, but also about communicating these discoveries back to the community. While research papers are a great mechanism for delivering high-level ideas, they can’t capture all the fine detail that makes up a complex system like Naiad. Opening the source lets people look at the code, and play with it to see what happens when they change the parameters or tweak the algorithms.

In addition, we want people to use Naiad to solve their big data problems. Getting feedback from real users is a great way to tell what we should be doing next. Releasing the code under a license that makes people feel comfortable adopting it is an important part of that. In addition, we designed Naiad to support a rich ecosystem of libraries layered over it, and making it open starts the ball rolling.

What has the reaction been like since the launch of Naiad?

We have had a great response from the research community. The work has been cited a lot, and we know of at least two university groups who have started building on the Naiad code base for their own research. We’re just getting started with customer engagement, and are working on Proof of Concept trials with a few users, both inside and outside Microsoft.

What did the Microsoft Azure platform bring to the project?

Microsoft Azure makes it easy to try out Naiad at scale. Azure HDInsight recently introduced support for YARN, which lets new distributed frameworks run on a Hadoop cluster. We added YARN support to Naiad, so now you can use it even if you don’t have a Windows cluster on premises.

Naiad is one of the first .NET-based big data frameworks. Microsoft Azure is a great platform for running .NET applications, thanks to the smooth integration with Visual Studio tools, ASP.NET for data visualization, Azure Storage for static data, and Azure Service Bus for streaming data feeds.

What would you like to see Naiad become?

We would love to see Naiad in broad use, as part of the standard toolkit for processing big data. Naiad should raise the bar for performance expectations, and at the same time lower the barrier to entry for new programmers. We also designed Naiad to be transparent enough that you can see its moving parts, and it would make for a great system for teaching some of the principles of distributed, data-parallel computing.

Beyond Naiad, what’s your favorite technology innovation within Microsoft Research?

While there are lots to choose from, right now we’re most excited about Roslyn, the .NET Compiler Platform. It’s going to be a great tool for analyzing the code written for a system like Naiad and applying interesting optimizations to the code that runs on the cluster. And we’re delighted that it’s open source!

To learn more, please visit Microsoft Research’s Naiad site or download today from GitHub (source code) and NuGet.org (binary packages). Let us know what you think in the comments.