John-Paul Navarro
Desai Narayan
Evard Remy
Dan Nurmi
Argonne National Laboratory
Chiba City's primary mission is to be a scalability testbed (see Figure 1), built from open source components, for the high-performance computing and computer science communities. It includes 256 dual PIII computation user nodes, 32 visualization nodes, 10 storage and file server nodes, 4 login nodes, 12 management nodes, Fast and Gigabit commodity Ethernet, and a high-performance Myrinet interconnect for applications. As a scalability testbed, Chiba City is dedicated to the research, development, and testing of architectures, algorithms, software, and protocols that push the scalability boundary of clusters and the software that runs on them.
Although Chiba City was built and operates primarily using open source software, the goal is to support installation and operation of any open or closed source operating system on user nodes. To that end, the Chiba development team created a cluster administration toolkit, the City Toolkit, designed specifically to support unattended installation of arbitrary operating systems. Largely because of the testbed mission of Chiba City I, nodes need to be rebuilt and reconfigured frequently and quickly. Combining a large number of nodes and a high rebuild and reconfiguration rate makes scalable cluster management design critical.
A version of many components in the City Toolkit is available from the Chiba City Web site. Although our goal is to open-source our software, we have not spent any significant time packaging and documenting it for public consumption.

Figure 1.
Chiba City: The Argonne scalable cluster
Towns, mayors, and the president
Chiba City hardware is divided into blocks of physically contiguous nodes, or towns. A town consists of between 8 and 32 managed nodes, a management node called a mayor, an Ethernet network switch, serial concentrators connecting the nodes' serial ports to the mayor, and network-addressable remote power controllers.
Most Chiba City towns fit into two racks. A town contains all the hardware necessary to operate as an independent subcluster, thereby supporting management scalability. Figure 2 illustrates the self-contained design of a town, showing serial console connectivity of town nodes to the town mayor. Although each town has its own Ethernet network, all of Chiba City is a single flat IP space.

Figure 2.
Design of a Chiba City town. The town nodes connect to the mayor via serial consoles
Towns are not merely a physical partitioning scheme: The concept defines how management software and services -- required to build, configure, and operate the nodes in a town -- work. An example of that physical and management link is the boot-and-build process for nodes. When a node boots, it sends its boot loader prompt (LILO or GRUB) over its serial port. The town mayor, which runs serial port monitoring software, detects that boot loader prompt and sends a response, telling the node either to boot or build. The build tools used to build a node and the software packages installed on that node are accessed from the mayor via NFS or HTTP/FTP.
The mayor provides the following services:
- Relay or proxy access to the master cluster database
- Static DHCP service
- Tftp access to boot images
- Console monitoring, logging, and interactive access
- Rootnfs for node build and debug environment
- NFS access to shared user software
- NFS, HTTP, or FTP access to software packages installed during a build
- NFS-based file relay service for copying files to or from the nodes before and after job runs (Chiba doesn't have a global general-purpose file system)
- Relay service for global management commands
Managing the town mayors, login machines, file servers, and a batch scheduler server is a master management server: the president. The president behaves exactly like a mayor to the nodes it manages. The president and all the nodes it manages form a logical town, although physically the hardware is distributed. That is, the machines in this town aren't isolated in their own set of contiguous racks like normal towns: the president manages machines physically located in many different racks.
The most significant difference between mayors and the president is that the president contains the master copy of the images (OS + configuration), layered software packages, and the master cluster configuration database. From the president, software is distributed to mayors with rsync. Access to the master configuration database on the president is available through proxies on each mayor.
We chose the three-level hierarchical design because we felt it could
architecturally extend to more levels. Using a 32 managed-node-to-management-server
ratio, one could build a 1,024-node cluster with the three levels currently in Chiba.
Using the same ratio, one could build a cluster with four hierarchical levels and
consisting of over 32,000 nodes.
Scalability Successes in the Chiba Architecture
We found that many aspects of the Chiba City management architecture worked very well on a our 314-node cluster. The following sections describe aspects of the architecture that we feel could apply to the management of much larger clusters.
Dedicated management servers (mayors)
An important aspect of our design is that management servers are not available to user applications. Although that design decision seems obvious to us, many people have confronted us about the added hardware tax. That separation is particularly important in our environment because most cluster applications expect dedicated access to nodes. Our cluster build-and-configure software was designed to run on Linux machines. The software requirements for the management services are often different from those of the application software. For example, currently our management and software runs under RedHat 7.1 but is capable of installing other versions of RedHat or even totally different distributions like Mandrake and operating systems like FreeBSD.
A master management server (president)
Once settled on dedicated management servers, one has to determine how to distribute services between those servers. The Chiba City approach gives a single machine the role of master management server, or "president." A president is the authoritative source for all software and configuration information, the recipient of all status information, and the point from which all administrative functions are issued. We found that model to be straightforward and easy to use. From the president, we are able to build, configure, and update other management servers. Management operations issued on the president forward to the mayors responsible for the desired target nodes. We believe that having a canonical source for all software, and from which all management commands might be issued, should scale to any size cluster. For that scale-up to work, the management software must divide and delegate operations to subordinate management servers.
Using the master management server as home for the master configuration and state database has worked well from a conceptual point of view. (Some of the database performance issues will be discussed below.)
Rebuildable management nodes
Once we realized the need for the president, we were able to build procedures to automate rebuilding all other management servers. If done correctly, it's even possible to rebuild a management node and then transfer the president role to it -- in effect, upgrading the president.
Remote power control
Remote network-based power control is an essential component of hands-off administration. Without it, some administrative and operational activities would require individuals walking up and down aisles of computer racks pushing buttons - an impractical proposition, not just from the perspective of time and cost, but because people might push the wrong buttons.
Remote console
We found remote consoles to be useful tools for node identification, boot control, network failure diagnosis, and recovery. Nevertheless, other approaches make remote consoles nonessential on very large clusters. One approach would be to issue SNMP queries to the cluster switches to determine which nodes are attached to which switch ports and then to use switch ports instead of serial ports to identify hosts.
Parallel management algorithms
Early in the City Toolkit design we realized that, along with parallel hardware management servers independently managing a subset of nodes, we also needed parallel management software algorithms. For instance, we heavily relied on the very effective tool pdsh2 (Parallel Distributed Shell).
Another essential design point in scalable management software is delegation. By taking a centrally issued command, splitting it into parallel components and delegating those to other management servers, one can scale arbitrary management operations to 1,000s of nodes and execute quickly. We see parallelizing management applications and commands as an essential technique for scalable cluster management.
Scalability Problems with the Chiba Architecture
Based on our experience, we have identified several scalability problems with the current Chiba City architecture:
Single point of failure
The master management server - president - is a single point of failure. When that server fails, it may be impossible to issue administrative commands or to perform any operations that depend on the master configuration services. One possible solution is to configure president services on two or more high-availability servers.
Management server hardware configuration
Finding the appropriate specs for management nodes is more important than we had originally thought. Care should be taken that appropriate amounts of RAM, CPUs, disk space, I/O bandwidth, and network bandwidth are available on management servers.
Hard dependence between user nodes and a single management node
The most unscalable and problematic aspect of the Chiba management architecture is the direct hard wired dependence between the nodes in a town and a specific management server (mayor). As a consequence, when rebuilding all the nodes in a town that town's mayor would be heavily loaded down while all the other mayors in Chiba City could be idle. This hard management link also affected node availability since the failure of a mayor would render all the nodes managed by that mayor unusable.
Console access through management nodes
Console access tied to mayors meant that if a particular mayor became unavailable, all the nodes it managed could be unusable if those nodes depended on a management service (such as software mounted via NFS). Also, we could not rebuild or reconfigure those nodes, since only that failed management server could provide management services needed to rebuild a node.
All management servers providing most services
Putting all management services on every mayor created a difficulty for protocols that could otherwise be serviced by a single machine (for example, DHCP and the configuration database). To address that problem, we resorted to database proxies and multiple DHCP servers serving fixed subsets of machines, a design we found less than ideal.
Limitations of a hierarchy
Some management services are difficult to install and operate hierarchically (and don't require or benefit from such a configuration). For example: for example NTP, SYSLOG, and DHCP.
The right management-server-to-managed-node ratio
Since each management service exhibits different scalability characteristics, it's not possible to pick a "right" ratio of mayor-to-managed-nodes. Possible imbalance leads to less than optimal ratios and inefficient mayor utilization. For example, management services like HTTP and FTP are often used to serve 100s of megabytes worth of software packages needed during a machine build. A small set of nodes building in parallel can swamp a single HTTP/FTP server. At the other end of the spectrum are management services like DHCP, SYSLOG, and NTP which can effectively manage 100s or even 1000s of individual client request quickly. These two examples demonstrate that different management services scale differently based on many different factors including the amount of processing required to service a given request, the rate at which clients make requests, hardware limitations like network bandwidth, and many other factors.
Hierarchical configuration and software push
Pushing software and configuration changes down the three-level hierarchy is a time-consuming process. Having to deal with four or five levels could be a serious problem. Mayors have a cached copy of the software and configuration files used to build the nodes they manage. Since the authoritative copy of these files is on the president, whenever the authoritative copy changes we have to push the changes to the second layer of mayors before they can be used by clients. Instead of the 15 minutes push we have today, in a four or five level hierarchy we would have a series of pushes potentially taking 30 to 45 minutes.
Another aspect of the problem is that our software and config repository is 100s of megabytes, and is expected to grow as the number of user node configurations we support grows. We push all of those software and configs to all mayors regardless of whether the nodes serviced by each cache even uses those files.
Database scalability
Many node build and configuration operations require access to the central cluster configuration MySQL database. That database contains node identification and configuration information including information about user node access rules that vary depending on which user jobs are scheduled to run on which user nodes. As we've expanded the database's use, it has become a major bottleneck. Currently our primary problem appears to be slow database connection performance.
Future Work
The Chiba City I scalable management approach has taught us several lessons about what works well and what doesn't. Moving forward, we have started to design a new management architecture that combines the Chiba City I successes with ideas on how to address the problems.
For example, to address the problems associated with a multilevel hierarchy while taking advantage of the master configuration server concept, we are starting to explore a fixed three-layer architecture (see Figure 3).

Figure 3.
New design for the Chiba City cluster
At the top level is the president: the master and authoritative configuration and state server. The middle tier contains a collection of management servers and devices that are directly responsible for providing the network services used to build the non-management nodes. The final tier contains all the managed nodes used by users and applications.
At first glance, the new design appears very similar to the current Chiba City I architecture. It differs, however, in the following respects: First, the new design has only three levels regardless of cluster size. Second, the middle tier can itself be thought of as a cluster of mayors dedicated to running management services/applications. That tier may include many different server configurations dedicated to specific management functions.
No hard ties exist between management nodes and user nodes. Instead, the new design uses either intelligent network hardware or dynamically reconfigurable protocols between managed and management nodes. An example of intelligent network hardware might be a load balancing router similar to those used by large Web server farms.
We also see the possibility of using parallel programming techniques, such as MPI, to write parallel management services. The number and type of nodes for each kind of management service would be flexible and based on individual protocol requirements: Some services may require a single machine for almost any size cluster (for example, NTP or DHCP), while other services may require multiple machines behaving as a load-balanced minicluster (e.g., FTP or
HTTP).
The new architecture, high-availability president services, intelligent packages, configuration data caching, and management cluster load distribution techniques are some of the elements that we hope will enable simple and efficient management of future clusters with tens of thousands or even hundreds of thousands of nodes.
Acknowledgments
This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38.
Resource:
Mathematics and Computer Science Division, Argonne National Laboratory
The Chiba City Web site
Chiba City Toolkit
Parallel Distributed Shell (PDSH)
rsync
levitra viagra, viagra
levitra viagra, viagra levitra cialis pharmacist prescription drug, buy levitra viagra online, cialis levitra sale viagra, cialis generic levitra viagra, viagra levitra comparison, cialis levitra link pharmacies com viagra, cialis viagra levitra effects, apcalis levitra viagra, viagra versus cialis, generic cialis, comparison of cialis levitra and viagra, levitra cialis, cialis levitra vs, cialis levitra free sample, levitra versus cialis, cheapest cialis, cialis levitra prescriptionscom, cialis cheap, cialis attorney columbus, order levitra, discount levitra, viagra levitra cialis pharmacist prescription drug, cialis comparison levitra viagra, addiction levitra, levitra online order, viagra levitra difference comparison, levitra vendita, levitra vs cialis, cialis levitra vardenafil