Hi, I’m Doug and I design & run production Windows Compute Cluster Server clusters for the HPC team at Microsoft. I often get asked “what hardware configuration should I choose for my cluster?” The stock answer is: “it depends.” WCCS will run just fine on two desktop machines connected by an Ethernet mini-hub, but that may not be sufficient for your application’s performance needs. For customers with demanding applications and larger budgets, a hardware purchasing decision can be quite complex.
The HPC Test team bought a 256-node SDR-Infiniband-based cluster, named Rainier, one year ago. The Rainier cluster achieved a ranking of #116 on the November 2007 Top500 list at www.top500.org, after having been in operation for less than 2 weeks. Rainier has 256 compute nodes and is primarily used for testing new builds of the product as we work on the next version of Windows Computer Cluster Server. People are curious about the hardware configuration of the Rainier cluster and the decisions that go into making a sizeable cluster purchase.
At today's prices, a 256+ node cluster with high-speed network interconnects is likely to cost in excess $1M when all hardware, cabling, facilities modifications, and installation labor is factored into the total cost.
The three primary and most difficult challenges in planning hardware for a large HPC cluster are:
1.) Where will you put it?
2.) How will you power it?
3.) How will you cool it?
Answering these three questions is a significant challenge for those of us who build HPC clusters in lab environments here at Microsoft and for customers I have talked to. Many modern datacenters are not built to the power and cooling specifications necessary to achieve the kinds of hardware densities that are available with today's server hardware. A per-rack power-consumption spec for the average datacenter constructed in the past 10 years is typically ~2-6kW. If you are building a new datacenter to support HPC workloads, a design target for power consumption per-42-U-rack of ~15 kW will support many of the higher-end blade-based configurations for some time to come.
How well you can power and cool the server and network hardware helps greatly to determine
a.) How densely you can populate the server racks and thus
b.) Which server form-factor is an option for deployment and thus
c.) What the most cost-effective physical network design will be
Power and cooling will dictate whether or not it is feasible to fully populate a 42-U rack with blade-based or traditional 1-U-server-based form factors.
Server blades typically offer
a.) the greatest density of CPU cores per-U
b.) streamlined management and monitoring (no need for KVM switches & KVM cables, for example)
c.) ease of replace-ability for individual components or servers
d.) Power consumption of 2-4 kW per fully-populated blade chassis.
1-U form-factor systems typically offer:
a.) lesser CPU core density per-U than blades
b.) more overhead in managing out-of-band configuration
c.) better cable management solutions
d.) Greater total-memory and local disk expansion per-server
e.) Greater expansion options (co-processors/GPU's, etc.)
With the ever-increasing server densities in hardware racks, weight is a fourth consideration that cannot be ignored. In some of our older lab facilities (housed on upper floors of standard office buildings), we are prevented from fully populating 42-U racks with blades or 1-U servers due to weight restrictions.
The HPC Test team settled on a Dell PowerEdge 1955 blade configuration for the Rainier cluster (See http://download.microsoft.com/download/4/7/8/478f369c-f530-4a1f-a9d8-2d219d42c297/Windows%20HPC%20Server%202008%20Top500%20Datasheet_11-07.pdf for details). The choice was driven by pricing and the availability of the then-new quad-core Xeon processors. We manage two other smaller “production” clusters: an HP cluster based on the 1-U DL145G2 platform, and an IBM 1350 cluster (purchased as a Linux cluster and converted to Windows two years ago). All 3 hardware vendors have blade and 1-U server offerings. Every vendor approaches system management slightly differently, so you will want to evaluate their management frameworks in the broader context of how they will fit within your existing server management infrastructure (for a complete list of Microsoft HPC partners visit: http://www.microsoft.com/windowsserver2003/ccs/partners/partnerlist.mspx).
Having an understanding of applications requirements ahead of making a server platform decision is useful because it will prevent over- or under-sizing choice of CPU and memory. Unfortunately, when you do not know what the applications requirements will be for the cluster, or when a cluster will be used for many different current-and-future applications, planning an optimal hardware configuration is a lot harder. For a general-purpose cluster supporting multiple applications, a baseline rule of thumb is to have no less than 2GB of RAM per-CPU-core. So for a dual-processor/quad-core compute node, you should expect to have a minimum of 16GB of RAM.
One of the problems we ran into with deployment of server blades in our lab datacenter facility was insufficient cooling. Without adequate cooling, the facilities owners were not willing to let our team fully populate a 42-U rack with blade chassis. Instead, we were allowed only to populate to a maximum of 80% of rack capacity. If we had opted for 1-U systems, we would likely have realized between 36 and 42 systems per rack. In our case, then, the total number of processors per rack was roughly the same regardless of platform choice.
One side-effect of spreading the servers out into a greater number of less-densely-populated racks was that it complicated the network design. Fewer racks == shorter cable runs to each rack from a network switching equipment location. Copper cables have stringent distance limitations. Different server platforms and network adapters have a different maximum-length of cable that they will support. In the case of the Dell PowerEdge 1955 blades, each blade supported a maximum 5M cable length for copper Infiniband cables. I have two recommendations based on this experience:
1.) If at all possible, avoid the use of copper cable for high-speed networks in large clusters. Copper is fine for a standard Ethernet management/monitoring network within the cluster. But for dedicated high-speed application networks, fiber-optic cabling is easier to work with, label, identify, replace, etc. I was surprised just how heavy and unwieldy Infiniband cables become when there are in excess of 500 in a confined space. Fiber optic cabling can be expensive, but Infiniband cables are already expensive; the cost premium of fiber is worth it. At least one vendor now offers a fiber-optic cable with standard connectors for existing copper switches and host adapters.
2.) If possible, streamline the network design by using larger centralized switches. It is easier to centralize the switch design with a small number of densely-populated racks. Fewer switches means fewer possible points of failure and less hardware overall to manage. The Rainier cluster design, unfortunately, consists of a dedicated Infiniband switch per blade chassis. This design was a necessary compromise due to cost and cable length restrictions. One note of caution: avoid lower-cost technologies which make use of oversubscription at the switch level and (in the case of blades) at the port-concentrator level. A cluster that routinely has compute nodes competing for the same individual bandwidth is a cluster that will not have fully-utilized CPUs.
There are multiple choices for choice of high-speed network technology. Traditionally, the two choices in HPC have been Infiniband or Myrinet. We run production clusters here at Microsoft based on both technologies. Each has its' own advantages and specific configuration settings, but they are both relatively easy to support on the Windows platform. It should be noted that major network vendors are also offering 10Gb-E solutions now, and network switch vendors are making great improvements in switch density to help drive down costs. My recommendation regardless of technology is to choose the vendor that you have the best relationship with, who provides the best hardware and driver support for your specific application and system needs, and who will work with you to spec your final design prior to purchase.
Hope this helps.