Skip navigation.
Home

CPlant: The Largest Linux Cluster

Neil Pundit
Sandia National Laboratories

Faced with the possibility that supercomputer vendors may vanish for lack of profitability, an effort commenced at Sandia National Laboratories in 1997 to build a cluster entirely from commodity, mass-market parts. The cluster should be growable as well as prunable - much like a living plant. Additionally, we started to consider supercomputing as a utility similar to electric power or gas. With these goals in mind, we named the project Computational Plant, or Cplant for short, implying the flexibility and power of a living organism, as well as the utility of a power plant.

Based chiefly on price/performance characteristics, we selected DEC Alpha CPUs, Myrinet interconnects, and the Linux operating system. With those building blocks, we started constructing clusters in Albuquerque, New Mexico, and Livermore, California - Sandia's main locations. In the process, we were guided by our success with the Sandia/Intel ASCI Red TFLOPS machine in much of the design, architecture, and runtime system.

While ASCI Red proved one of the most successful massively parallel, high-performance computers ever built, its large massively parallel processor (MPP) design has several drawbacks. Most notable: The performance advance of commodity hardware components quickly outpaces that of custom-built hardware. MPP systems built from specialty hardware do not enjoy the lower price/performance benefits of mass-market components.

On the other hand, very large-scale systems require specialized knowledge and research. Volume vendors are neither the best organizations to provide that specialized knowledge, nor are they in a position to serve niche markets. Software applications requiring high compute performance will continue to grow in size, variety, and complexity. Large-scale systems must be able to adapt to those applications over their lifetime.

While clusters have firmly established themselves as the architecture choice for small- and medium-scale installations, few current cluster technologies support scaling to the level of compute performance, usability, and reliability afforded by large MPP systems. For a cluster to scale to thousands of nodes, it must address the following architecture challenges:

  • Limit - or entirely eliminate - the use of technologies that do not scale. For instance, tools like rsh (remote shell), or protocols such as NFS (network file system), have inherent scalability limitations
  • The complexity of maintaining a cluster should not increase with the growth in cluster size. Scalable cluster management and maintenance are critical
  • Ensure system usability. Users should not be required to have detailed system configuration knowledge in order to use the cluster effectively.

Cplant addresses those design goals, and is centered around the following key concepts:

  • A large system is constructed from many independent building blocks
  • The independent building blocks can be partitioned by changing only a small number of system characteristics
  • System partitions interact only through a limited number of capabilities
  • The system is composed of conceptually separate components, each offering specialized functionality
  • The system dedicates significant resources to system monitoring and configuration.

We have built many clusters over the years on those foundations. The largest cluster today is in Albuquerque, consisting of over 2,500 Alpha nodes. It is configured such that the compute power of the central section, consisting of 1,500-plus nodes, can be switched to serve any of four subclusters. Each of those subclusters is dedicated to a distinct mode of operation: development, open, restricted, and classified (see Figure 1). The development subcluster has 128 nodes, and the other three each consists of 256 nodes. Each subcluster works independently with its own support system of administrative and I/O nodes.


Subclusters in Cplant

Figure 1.
Subclusters in Cplant. Compute nodes in the central section (purple) swing between red, black, and green.

Cplant system utilization today is in the range of 80% of total system capacity - a practical limit, since the system doesn't typically have just the right set of job requests on hand to fill the machine capacity. A variety of applications vie for their turn for scheduling as the system continues to be oversubscribed.

Cplant System Software (CPSS): An open-source release

After much debate within our group and with our sponsors (U.S. Department of Energy),we joined the open-source movement in the hope that subsequent modifications and wider use will enrich our software and serve the technical community at large. In June, 2001, a complete package of Cplant System Software (CPSS) was released under the open source GNU Public License. That software allows a cluster to function as a general-purpose supercomputing resource. CPSS is downloadable from the Cplant project website. Over 1,200 downloads have been made to date.

Consisting of over 300,000 lines of source code, initially written for the Alpha CPU using Myrinet interconnects, CPSS has been ported to work with x86 and IA64 CPUs as well. In addition, efforts are under way to provide SMP and thread support, OS bypass, and support for the Quadrics interconnect. The software has been commercially licensed to Unlimited Scale Inc., which plans to introduce a supported commercial product based on it.

Overview of CPSS

CPSS is designed to provide a scalable, full-featured environment for cluster computing on commodity hardware components. It emphasizes scalability: CPSS provides a scalable message passing layer, scalable runtime utilities, and scalable debugging support. Cplant is targeted for installations configured as a System Support Hierarchy: Scalable units of workstations are arranged as leaves on a tree of system support nodes that provide file system and administrative support (see Figure 2).


Cplant System Support Hierarchy

Figure 2.
System support hierarchy in Cplant.

CPSS has been designed to support a functional partition model initiated by ASCI Red: Cluster nodes are divided into Service, Compute, and I/O partitions (see Figure 3). These partitions also provide the main organizational units for the CPSS code base.


ALT TAG

Figure 3.
Conceptual partitioning of Cplant.

CPSS is distributed as source code that can be built for a specific processor architecture, and configured for a specific hardware installation. The source code consists of operating system code (in the form of Linux modules and drivers), application support libraries, compiler tools, a port of MPI (Message Passing Interface), integrated runtime utilities (node allocator, job launcher, process control thread, query tool, batch system), support for application debugging, and scripts for configuring and installing the software.

CPSS is coded mainly in ANSI C, and is built on a Linux system using the GNU C compiler. Support for parallel applications programming is provided for C, C++, and Fortran code using MPI. Application code can be built with GNU compilers, but in the case of an Alpha-based installation the more robust Tru64 and Compaq Alpha-Linux compilers are supported. Support for compiling and installing code is based largely on the make utility along with Perl and Bash scripts.

Installation support is provided in the context of the Cplant Virtual Machine (or VM). A Cplant VM is a logical partition of hardware components: a single hardware installation can run multiple independent virtual machines. To support that concept, the built code base is installed in a separate VM structure for each virtual machine. Code in a given VM is then configured to run on a specific hardware subset.

The CPSS code base is maintained with the CVS version management tool. The CVS repository for Cplant is organized largely according to Cplant's main functional partitions (service, compute, and I/O). In addition, there are directory trees for components that correspond to configuration, compilation, documentation, release management, regression testing, shell scripts, header files, tools, and support, among others.

The Cplant system today is not of industrial strength. It did not go through a productization cycle at all. While we are fixing bugs and introducing modifications, the system continues to serve a wide range of applications in physics, chemistry, combustion, nuclear, structures, and bioinformatics research.

Work in Progress

Work toward enhancing Cplant consists of providing support for SMP and threads, as well as OS bypass. OS bypass allows for data transfer to take place between user space and the network card without cooperation or direction by the operating system. A research effort is underway to use Quadrics interconnect under Portals 3. Quadrics is an attractive interconnect due to the possibility of running user-level threads directly on the network card's processor.

An important thrust is to make Cplant more robust: A major effort is to provide a layer of reliability in message passing with a reasonable overhead, particularly as the cluster gets larger. Modifications and bug fixes will be routinely incorporated into future open source releases.



Resource

The Cplant project page

Sandia/Intel ASCI Red

The TOP500 fastest supercomputers

J. Otto and Neil Pundit, Cplant System Software (CPSS): A Complete System Software Package for Clusters, Proceedings of the 2002 IEEE International Conference on Cluster Computing, Cluster2002, 2002

D. S. Greenberg, R. B. Brightwell, L. A. Fisk, A. B. Maccabe, and R. E. Riesen, A System Software Architecture for High-End Computing, Proceedings of SuperComputing 97, 1997

N. Boden, D. Cochen, R. E. Felderman, C. L. Seitz, and J. N. Seizovic, Myrinet-a gigabit-per-second local area network, IEEE Micro,1995

Quadrics interconnect

Unlimited Scale, Inc.

R. B. Brigthwell, T. B. Hudson, R. E. Riesen, A. B. Maccabe, The Portals 3.0 Message Passing Interface, Sandia National Laboratories, Tech Rep. SAND99-2959, 1999

W. Gropp, E. Lusk, N. Doss, and A. Skjellum, A high-performance, portableimplementation of the MPI message passing interface standard, Parallel Computing, 1996

G. R. Luecke, J. Yuan, and S. Spanoyannis, Performance and scalability of MPI on PC clusters

Protable Batch System (PBS)

lkj

Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,Ioh,I