Over the past two years, we have seen growing interest from the scientific community in using public clouds for research. As part of the original Cloud Research Engagement Initiative in 2010, Microsoft partnered with funding agencies all over the world to award more than 75 research teams for projects using Microsoft’s Windows Azure cloud. The research covers topics in computer science, biology, physics, chemistry, social science, geology, ecology, meteorology and drug discovery. More details about these projects can be found here.
From an informal survey of these projects, we learned researchers value the concept of using an on-demand, scalable compute resource over acquiring, deploying and managing a dedicated resource. Ninety percent of these researchers were pleased with their ROI using cloud services to build their application and would use cloud resources again. Of course, this sample is biased. These researchers are, for the most part, the leading edge risk takers and early adopters.
The most enthusiastic responses to our survey came from research teams with limited access to large scale computational resources. These were also the research teams with a community of users who had a pressing need for help from the cloud. For example, a research team at the Geography of Natural Disasters Laboratory at the University of Aegean in Greece built a cloud application that can be used to simulate wildfire propagation. The end users of the service are primarily emergency responders, including the fire service, fire departments and civil protection agencies that must deal with wildfires on the island of Lesvos.
The idea of using the cloud to help broader communities extends to scientific disciplines. The vast majority of scientists don’t want to learn to program a cluster, a supercomputer or a cloud. They want to get on with their science. This describes the vast majority of the research community.
The problem faced by these researchers results from the avalanche of data from instruments, digital records, surveys and the World Wide Web hitting every research discipline. We are witnessing the birth of the fourth paradigm of science where large-scale data analysis now stands as a peer of theory, experiment and simulation. Advanced visualization used to be the only way to spot the trends in massive amounts of scientific data. Now machine learning can recognize patterns in data collection that are far too large or complex to visualize. Advanced data analytics and machine learning are now being used widely in the commercial sector to understand how the economy works, how to make our online searches more relevant and how to help hospitals deal with high re-admission rates.
Science is brimming with opportunities for us to use these techniques on our exploding data collections. For example, techniques such as those developed by David Heckerman and his Microsoft Research team are being used to understand the genetic causes of diseases. They used 27,000 computing cores on Windows Azure for a Genome Wide Association study (GWAS) of a large population. While this is a remarkable computational achievement, one exciting outcome is that the results of the analysis are now freely available as a cloud service (Epistasis GWAS for 7 common diseases) on the Windows Azure Marketplace.
Many public scientific data collections are growing so fast that there is no way for an individual researcher to download them. And, as their personal data collections grow, researchers are putting pressure on overstressed university resources to house and maintain them. Funding agencies around the world are starting to insist that data from publicly funded research should be made available to the public, but it is not clear how we will financially sustain all these data collections.
Fortunately, high-quality data is very valuable, provided it is easy to access and analyze. Given a highly valuable research collection and user friendly cloud analysis tools such as the Epistasis GWAS service, the research community may be able to support it through modest subscriptions. A great example of this model is the Michigan Inter-university Consortium for Political and Social Research (ICPSR), a high quality data collection run by an expert curation team that has existed for 50 years and is funded by a combination of grants and dues paid by hundreds of member institutions.
Working with Internet2, we have already begun discussions with research community members about how we can build cloud services for science. Our long-term goal is to create self-sustaining scientific research ecosystems of data and services based on community data and community supported tools. Science is driven by a combination of trusted open source tools and high-quality commercial software. We can make these widely available in the cloud as services. In addition to supporting the data and computation, we believe expert users will also have a platform to create and market indispensable research services. In the cloud world, we have “Infrastructure as a Service”, “Platform as a Service” and “Software as a Service.” Why not “Research as a Service”?