Windows HPC Server 2008

The software previously known as Compute Cluster Server v2 is now available in beta 1 on https://connect.microsoft.com

HPC Server 2008 contains some pretty significant innovations. I'll summarize them here and then produce some more detailed entries on each of them. Also check out the videos about v2 on https://edge.technet.com, (mostly) courtesy of yours truly :-)

Noteworthy innovations in v2 for IT Pros:

1. It runs on Longhorn Server (oops, Windows Server 2008) only. There is no upgrade process. Wipe and replace.

2. It uses Windows Deployment Service (WDS), not RIS, which makes that "wipe and replace" much less painful :-) Multicast deployment is supported :-)

3. The administration console offers a one-stop shop for deployment, administration, diagnostics and reporting. I know that many of you will be thinking, "just like Ganglia!". Well, that's the general idea. It is not full system center, but it is a very functional and efficient way to manage a HPC cluster.

4. It will offer head node fail-over from beta 2 onwards, thus eliminating a worrying single point of failure. This feature uses Server 2008 fail-over clustering, so it requires enterprise edition or better for the head node.

5. As a consequence of (4), we will support installing the head node on a sql 2005 cluster. In fact, we include sql 2005 express with the product but also support installations on pre-existing sql 2005 servers. You need not install the head node services on the sql machine either.

6. We have devised a new networking api to run along winsock direct, called Network Direct. the idea is to enable verb-based interaction with low-latency networking hardware, thus shaving off another couple of microseconds of latency, much like it happens with MVAPICH. In this release the only consumer of network direct is MSMPI. We're working with OEMs to write network direct providers / drivers.

7. Powershell scripting is used for administration of common operations, along with the old v1 commands. In fact, those still work perfectly because we have maintained 100% compatibility with v1 COM API. Of course the v2 API exposes new functionality, but that deserves a post in itself.

8. The scheduler has been significantly enhanced for scalability and optimization. It deserves a post in itself, but here are some significant changes:

- ability to dynamically grow and shrink the pool of resources allocated to running jobs

- enforcing constraints on the basis of job templates, not just filters

- use of different units of allocation: core, CPU slot, node, depending on what your application needs

- biasing allocation algorithm towards memory or CPUs

- from beta 2 onwards, pre-emption of running tasks !!

9. Last but not least by any means, we are working with partners to support clustered file systems. CXFS, Melio and StorNext FS come to mind, being available on Windows now (2008 support is in the works).  More are coming.

Of course this is a beta product and the usual caveat applies: features you see may not make it into the final product, may not be fully functional, etc... Still v2 is definitely worth a try, because of the great improvements.

Note that I have spoken just of those topics of interest to IT Pros. There is more in the works for developers, like the ability to interface with the cluster using WCF and schedule WCF services.

By the way, we are running a program called HPCPAL for those of you interested in trying it out. We offer help with design of either infrastructure or software architecture followed by 4 days on site in Redmond with us, working on your code. All we ask for is a reference. If you're interested, send a note to hpcpal@microsoft.com and we'll take it from there. 

I hope I've not forgotten anything important... I'm jet-lagged in Barcelona, where I'm speaking at Teched IT Forum. If you're around and want to chat about CCS, drop me a note.

In the meanwhile, check out our website, where you'll also find whitepapers describing hpc server 2008 in more detail.