Skip navigation.
Home

The Case for Blades in High-Performance Computing

By Scott Farrand RLX Technologies

Abstract A blade server is a complete computer system with multiple "blades" residing in one enclosure. Blade servers offer high efficiency levels due to high density, cable consolidation, ease of deployment, homogeneous software component revisions, utilization management, and their ability to reconfigure workloads. This article describes the role of blade-servers in cluster computing, and highlights how blade server clusters support high-performance computing (HPC) workloads.

All clusters are not created equal. While all clusters share a set of common characteristics, cluster evolution has produced significant variations on a basic idea. A basic cluster calls for a computing system with the following characteristics:

  • Multiple, low cost compute nodes, each typically a fully functioning computer with its own memory, CPU, and perhaps some storage. Often a high-volume, low-cost CPU is used, such as those made by Intel or AMD. Each node runs its own instance of a low-cost, highly efficient operating system, such as Linux or Windows.
  • The cluster nodes are connected by low-cost, high bandwidth, low latency interconnects (e.g. InfiniBand)
  • A permanent, high performance data store, often shared, for input/output repository and scratch cache
  • A distributed resource manager to arbitrate and schedule compute jobs (e.g. LSF, PBS, SGE)
  • Message Passing Interface (MPI) to parallelize application algorithms
  • Applications that embody parallel algorithms. Those algorithms execute across the cluster nodes.

Improvements to that basic pattern are often motivated by the desire to provide the highest possible throughput performance at the lowest price point. The proper balance between price and performance depends on a cluster's architecture as well as the type of workload placed on that cluster: A cluster architecture may yield excellent price/performance under certain workload types, but only average price/performance under other type of loads. Optimized cluster architectures also aim at favorable price/performance scalability characteristics: Increasing a cluster's workload should yield the least amount of increase in the cost of building and running that cluster.

High-Performance Computing (HPC) Workloads

The requirements of high-performance computing (HPC) illustrate the architectural choices that impact a cluster's price-to-performance characteristics. HPC workloads support numerically intensive problems. HPC plays an important role in industries as diverse as life sciences, geological sciences, fluid mechanics, chemical research, weather forecast, astronomy, cinematic production, finance, and digital media. In these fields, scientists and professionals face problems that require repetitive computational functions to arrive at modeling, simulation and analysis solutions. Executing numerical algorithms with the highest possible performance is the primary HPC goal. Scaling an HPC architecture refers to increasing the computational throughput to reduce the time it takes to execute numerical algorithms. To scale computational throughput beyond the capacity of a single CPU, many CPUs must work on a problem simultaneously. Thus, to achieve scale, software algorithms must be able to derive benefit from the work of multiple CPUs in parallel. The more efficiently a parallel software algorithm reaps the power of multiple CPUs, the better it scales. Many algorithms have been optimized over the years for HPC, such as Monte Carlo simulations, determining the root of an equation, solving a linear system of equations, interpolation, partial differential equations, finite differences, and finite element analysis. As parallel algorithms were implemented and increasingly optimized, an informal terminology has emerged from these efforts. That terminology captures the level of importance of bandwidth (for message passing), latency, and CPU load:

  • Embarrassingly parallel: Applications that need little or no inter-CPU communication to execute their algorithms. On a cluster, these applications typically run across the compute farm entirely without having to coordinate their progress among nodes through message passing.
  • Moderately parallel: These applications' algorithms do require some coordination between compute nodes to control their progress. However, if the communication frequency is modest, so are the network bandwidth and latency requirements. 100Mb/s or 1Gb/s Ethernet can sometimes satisfy these applications' message passing requirements. Overall application throughput may instead be bound by compute time or disk I/O, not network latency.
  • Highly parallel: These applications need relatively large amount of coordination between processors. A highly parallel application cannot pay the performance price of using Ethernet to communicate because of Ethernet's generally high utilization of the host CPUs, high latency, and relatively poor bandwidth. Solutions to this problem are specialized interconnect communication channels, such Quadrics and Myrinet. A new entrant to the specialized interconnect communication market is Infiniband. Infiniband not only possesses the desirable low latency and high bandwidth characteristics for messaging, but it also relieves the host CPU from much of the burden of setting up and executing the communication. That frees up the host CPU to service the calculation of the application's algorithms.

Traditionally, advanced system architectures in support of scalable parallel algorithms were expensive to develop and build (and, consequently, expensive to purchase). In the recent decade, due to large-scale manufacturing and marketplace economics, Intel has popularized powerful and comparatively low-cost CPUs with supporting chip sets for 1-way, 2-way and 4-way designs. Combining these low cost CPUs in multiple standalone, "shared nothing" compute nodes, and then aggregating those nodes into clusters, allows the scaling of computational power at the currently most effective price points. The key metric for these systems' cost effectiveness is the price-to-performance ratio of the entire cluster - not just the price/performance of a single node.

Up vs. Out: The Cost of Scaling

The factors that most influence a cluster's throughput are summarized in the equation below: where: CPUs = total number and speed of CPUs executing the parallel algorithm E = efficiency of the parallel algorithms IPC = efficiency of the inter-process communication between compute nodes Storage I/O = Frequency and size of input data reads and output data writes Jobs = efficiency of the scheduling of the jobs across the cluster resources Therefore, the better a cluster's total CPU capacity, IPC efficiency, storage I/O, and scheduling, the higher its total throughput, given the same parallel algorithm. To improve any of these factors, a cluster must either scale up, scale out, or both. Scaling up increases the total throughput of each cluster node. Scaling out refers to the ability to add stand-alone, shared nothing computer systems to increase total system throughput. The first Beowulf clusters consisted of uniprocessor compute nodes. Such nodes represent the purest form of stand-alone, "shared nothing" designs, implying that the main memory of one node is not shared with other nodes. In recent years, dual processor designs have become available at low cost points, driving the adoption of dual-processor compute nodes. The chip sets for these dual-processor nodes control memory access and handle interrupts in such a fashion as to allow symmetric multiprocessing at the local node. In rarer instances, a compute node may be 4-way, 8-way, or greater. As the number of processors in a unit increases, the per node cost grows at a fast rate, rendering the 4- and 8-way nodes less attractive from the price-to-performance ratio perspective. For example, approximate market pricing at the time of this writing for a 2-way 3Ghz Intel Xeon based server with 2GB of memory ranges between $4-5K. A 4-way 2.5Ghz Intel Xeon with 4GB of memory costs between $16-18K. And an 8-way Intel Xeon with 8GB of memory can be $70-$100K. As this pricing example shows, cost grows non-linearly with scale-up (multi-way or SMP) computers. Scale-out compute clusters are collections of single stand-alone systems combined to reduce the time to calculate large parallel compute problems. Currently, such compute complexes are capable of executing large jobs with throughputs in excess of 5 Tflops. Scale-out incurs a linear cost factor by definition (2x as many nodes cost 2x as much money). These cost characteristics favor scale-out compute clusters over SMPs. To maintain that cost advantage, however, scale-out clusters must provide efficient inter-node I/O, minimize the space taken up by the collection of cluster nodes, and must also make it possible to cost-effectively manage a large number of nodes. Blade servers emerged in response to these scale-out cluster needs. A blade server is a complete computer system with multiple "blades" residing in one enclosure. As few as 6, and as high as 24 servers per enclusure are observed on the market today.

What's in a U

As a cluster scales to incorporate an increasing number of nodes, those nodes are often mounted on racks. Rack mounting requires a uniform form factor for all cluster node enclosures. Traditionally, the industry converged on a form factor that sets a node's height at 1.75 inches - about the height of a pizza box. When measuring rack space, an enclusure with that height is a unit of 1, or "1U." A typical server rack is 42Us: It can accommodate 42 pizza box-sized servers.

Figure 1: Rack-mounted servers
While "U"s measure the amount of rack space taken up by a server enclosure, density measures the amount of computing power concentrated in 1U. Traditional "pizza box" enclosures offer 1 server per enclosure. In blade servers, densities range from 1 node per 1U to 8 nodes per 1U. Ultra-dense blade servers can accommodate a density of as many as 24 blades in a 3U enclosure. A server blade is inserted into a chassis that supplies redundant power, with redundant cooling fans amortized across all the blades in the chassis. The chassis are often offered with built-in switch technology that aggregates the cabling for simplified external connections. Blade configurations are usually homogeneous, and management software can be optimized for provisioning or loading standard cluster software stacks on the nodes.
Figure 2: A blade chassis (front view)
Figure 3: A blade chassis (rear view)
Architecturally, server blades are similar to industry standard servers with some improved functional characteristics. Market leading blade servers posses the available state-of-the-art components: High performing CPUs (e.g. Dual Intel Xeon), memory controllers addressing up to 8GB of memory and 3rd level caches ranging from 512KB up to 2MB, modest on board storage (because of limited board "real estate"), two or three Ethernet interfaces, mezzanine expansion card enhancements for Fibre Channel storage expansion and high speed interconnects, and management-instrumented designs such as the Intelligent Management Platform Interface - IPMI.
Figure 4: The anatomy of a blade
Server blades are typically plugged into a chassis containing a mid-plane to supply power to the blades and to provide the conductors to route I/O external to the chassis. Typical I/O options includes Gigabit Ethernet, Fibre Channel, and other high speed low latency interconnects (e.g., Infiniband). Often, the mid plane I/O is routed to an on-board switch to eliminate much of the cabling, contributing to higher blade density. The chassis also supplies redundant cooling and power to the blades. That amortizes the power supply and cooling equipment costs across the blades in the chassis. Power and thermal characteristics are critical design requirements for blade servers. Enough air must flow through the blades and peripherals in the chassis to properly cool the blades. Air flow requirements are supported by streamlined mechanical designs, controllable fans, and temperature monitoring of the critical devices. Because of their unique design, server blades squeeze a lot of compute capacity into limited floor space. The strong correlation of the requirements of compute clusters and blade servers has led to the rapid adoption of blades for compute clusters. Market analysis suggests that over 25% of all new servers sold will be blade servers by the year 2006. Blade servers are ideal for embarrassingly parallel and moderately parallel HPC workloads. Blades with Infiniband capability can tackle algorithms that are highly parallel because of Infiniband's low CPU load and low latency. A homogeneous environment to run HPC jobs, and the ease of managing large cluster configurations, are two of the most important advantages blades provide for HPC computing tasks.

Homogeneity

Many HPC parallel algorithms run best on a cluster when all cluster nodes are homogeneous. Having nodes of significantly differing resources (e.g., CPUs or memory) prevent convenient capacity planning. Reallocating a variant node to a cluster can change the behavior of a cluster, and may cause HPC workloads to run in a sub-optimal fashion. The blade architecture helps maintain node homogeneity: It helps ensure that the CPUs, the amount of memory, the cluster interconnect, the permanent storage and I/O and, most important, the software, remains the same for all cluster nodes. In a blade, each compute node runs its own, full version of an operating system. The critical role of the operating system to schedule execution of processes, manage permanent store I/O, manage multiple threads of execution, inter-node messaging (MPI), and facilitate user access, cannot be understated. A popular operating system choice for blades is Linux. The Linux kernel has undergone several advances in recent years that make it an efficient cluster compute node OS. Support for symmetric multiprocessing and multithreaded TCP/IP networking stack, introduced in kernel version 2.4, optimized the Linux kernel for multi-way server platforms. Because the Linux source code is readily available, further customizations can be applied. For example, single node process management has been extended to manage a single system image process across many compute nodes via the Linux utility bproc. The extensibility of Linux through loadable modules allows host access to instrumentation data on blade servers. Critical temperatures, power and fan speeds can be monitored during runtime on blades to insure favorable operating charateristics.

Managing Large Clusters

Clusters, by definition, are collections of compute nodes running algorithms distributed across the cluster node farm. While each node needs to be manageable on its own, the collection of nodes forms a larger entity. A view into the throughput, health, and status of that overall system entity reduces diagnosis time and allows a more immediate revealing of any abnormalities that may hinder timely completion of jobs. Visibility into the efficiency and performance of the cluster also supports better physical resource capacity requirements for maximum cluster throughput.

Figure 5: Blade server management software (image courtesy of RLX Technologies, Inc.)
Blade management software treats the blade cluster as a whole and includes visibility and control of job schedulers. In addition, management software allows the mounting of file systems and en masse file system partitioning. Automated formatting and mounting speeds time to provision and integrate storage to a large farm of server blades.

Conclusion

Scale-out compute clusters constructed from commodity, off-the-shelf components currently offer the most advantageous platform for running high-performance computing workloads. Blade servers evolved from simple shared-nothing commodity clusters to offer high density, and to solve many of the problems inherent in scale-out clusters. Blades provide best-of-breed server technology, a homogeneous environment to execute jobs, and a wealth of software tools that ease the management of large clusters.

About the author

Scott Farrand is Vice President of Systems Engineering at RLX Technologies.

Resources

Beowulf Home Page InfiniBand Trade Association Myrinet cluster interconnects

Wholesale Designer Replica