Is High Performance Computing naturally Open Source (ie. for tinkerers)?

by anandeep on June 18, 2008 11:17am

I have always been fascinated by clusters.  Some people envision working with desktops or workstations when they think of “working with computers”.  For me working with computers was always with a large collection of computers in a back room somewhere.  And how cool if you could make all those computers collaborate with each other working to solve cool things like genome mapping, movie special effects, simulations of car crashes or simulations of molecules being formed!

So you can imagine I jumped at the chance to work with the Windows High Performance Computing team.  This is the same team that builds  Windows HPC Server 2008.

I think most of the people working in the team are from the “large collection of computers in back room somewhere” school. Would be really different in the Mac software division I assume!

I work with the Open Source Software Lab and we are all things “Open Source” to the rest of the company.  The HPC Server team wanted us to make sure that their product played nice with Linux infrastructure and vice-versa.  The usual suspects like AD, Samba, LDAP. CIFS etc were involved.  We had to make sure that these recurrent interoperability themes were addressed in the HPC environment.   I also got a chance to dig into ROCKS, OSCAR, MPI stacks and job schedulers etc etc.

This was a very rewarding experience not only for the technology exposure that I got but the pervasiveness of knowledge of Open Source within the team.  They were far ahead of the other product groups in this regard and  “got” the Open Source ethos. In fact, prior to my interactions with them they had released an open source MPI stack based on Argonne National Lab’s MPI implementation.

The other reason was that a lot of their customers were relentlessly open source!  The conventional wisdom is that HPC applications and infrastructure require a lot of tinkering.  Of course, there are some applications like FEM and CFD and that are well understood, but the general feeling was that complete control and access to the underlying infrastructure is a must for getting the most performance out of a cluster.  And performance is the main thing in “High Performance Computing”.

Linux is seen providing that access by HPC customers and there is a large base of Linux for HPC in academia, the national labs and other institutions that use large clusters for doing their thing.

But is this really true?

I think that HPC has gone through a typical evolution – it starts with a few people who have a pressing need.  There is a cross disciplinary team formed that builds software to do their job and a community grows around it.  The community reaches critical mass and people start making building tools to make it more convenient.  ROCKS is an example of this.  Great skill, knowledge and ability is needed to get the job done.

However, these skilled people now become overloaded.  The tools and the infrastructure that they created become so popular that everyone, including people who do not have background that was assumed before, wants to use it for their ends.  So the community responds and builds standardized, easy-to-use infrastructure pieces that start to fit seamlessly together.  Some control is lost, but ease-to-use is the primary focus.

The infrastructure for HPC has reached that stage (ROLLS with ROCKS). Windows HPC Server 2008 is built for this ease-of-use too.
However, the applications have not reached the stage of ease-of-use.  They have to be coded with a lot of domain knowledge and have to built from scratch to truly scale while running on clusters.  That means that the application writers demand more control of the underlying infrastructure and want more access to it than the users and maintainers want.

I am going out on a limb and making a prediction here – soon end users will be able to specify instead of coding applications, be it genome comparison or physics simulation.  This is similar to accountants finding spreadsheets.  There will probably be a few different models for different types of applications but that stage will come pretty quickly. 

The infrastructure that runs these user-specified applications will be adaptive and will take these specifications and automatically tune them for high performance on the clusters.

This is where the perception of needing control to the lowest levels will be moot.  The best adaptive infrastructure will be the one adopted.

Bold enough for you?