James Kaufman
Toby Lehman
IBM Almaden Research Center
Just ten years ago, computational scientists were content to concentrate on either very small computer model problems or "regular size" problems with a very coarse resolution. For example, rather than work on a complex molecular chain, they may have had to confine their study to a single small molecule. Or, rather than simulate a full car crash test using the desired 1 cubic centimeter resolution, they may have had to simulate only a portion of a car at, say, a 10 cubic centimeter resolution. They simply did not have the compute power to work on larger problems or use a finer resolution.
Today the computing capabilities have grown considerably, but, interestingly, the collective appetite of the computational scientists has grown even more and at a faster rate. Now, it's not just studying molecular chains, but solving the protein-folding problem and calculating the structures of complex assemblies of macromolecules (see Resources). Or, it's not simulating a full car at a cubic centimeter resolution, it's simulating any full-size vehicle at the cubic millimeter or icrometer resolution. It appears that the demands on grid computing systems will outweigh their actual capacities for quite some time. Furthermore, the number of classes of problems will grow as well, as grid computing moves from the purely scientific to the commercial arena.
We have chosen cellular automata (CA) and finite element model (FEM) problems as our main focal point for the OptimalGrid project. Current CA or FEM packages are either specialized single node programs (like Algor, Altair, Ansys, Comsol, Cosmos/m, EDS, and others) or they are toolkits for building your own distributed application (like Globus). There have already been many surveys of computing grid systems and computing grid problems (see Resources), but none describe a system like OptimalGrid. There are no general-purpose packages available today that can take a described CA or FEM problem, partition it, then distribute it on a grid and optimally manage the running of the system dynamically. With the current trend of increasing appetites for larger scale FEM problem solutions, we predict that systems like OptimalGrid will become mainstream tools for computational scientists.
In this paper we describe the OptimalGrid system. In Section 2.0 we describe the complexities involved in parallelizing Finite Element problems. Furthermore, we show how existing grid systems do not meet the needs of FEM problems. In Section 3.0 we present the OptimalGrid architecture. In Section 4.0 we present our experimental results on the first implementation of OptimalGrid. In Section 5.0 we present our conclusions and describe the project's status.
The Challenge of Parallelizing Cellular Problems
All finite element model (FEM) problems are solved numerically by partitioning space into "small" finite elements where "small" is ideally defined by the smallest natural scale in a problem. Figure 1 shows a simple depiction of a solid (continuous) object being turned into a discrete set of nodes, each with specific properties. In practice the overall scale of a problem to be studied, along with the available compute resource, will limit the resolution of an FEM calculation (thus imposing a lower bound on the finite element size). Approximate solutions typically assert that at the length scales defined by the finite elements spatial gradients tend to zero (or a constant).

Figure 1:
Partitioning a finite element problem: A solid (continuous) object being turned into a discrete set of nodes.
To appreciate the power of this computational approach, consider the enormously complex problems of biochemistry. As we know, all life is cellular. Every living cell contains all of the instructions for its replication and operation. All of the functions of a multicellular organism are performed by transporting molecular messages, establishing concentration gradients across membranes, and by local physical (e.g., protein folding) and chemical (e.g., metabolic) reactions. At the algorithmic level, there are no new breakthroughs required to solve, in principle, problems as complex as understanding the chemistry of life. The solution to such a
problem requires significant advances in our understanding of protein chemistry, and a system that can efficiently partition FEM problems across a sufficiently large collection of computing resources. While conceptually easy to parallelize, the problem becomes difficult because of the need to manage communication between elements.
To understand the complexity introduced by requiring communication between adjacent cells in an FEM class problem it is useful to consider a simple example. We will introduce the OptimalGrid terminology and the structure used to represent FEM and CA problems. A FEM problem is represented by a discrete set of objects, which we refer to as original problem cells (OPCs). In Figure 2, an OPC is represented by the smallest unit in this 2 dimensional matrix.

Figure 2:
The smallest unit in the two-dimensional matrix represents an original problem cell (OPC), the basic object of a FEM problem.
A set of OPCs that might be sent to a compute agent for problem simulation is known as a variable problem partition (VPP) - it is variable because the number of OPCs varies depending on the compute capacity of the compute agent. An OPC has neighbors (in this example, just the four neighbors (North, South, East, West), but an OPC could have more neighbors if the problem
required it). The neighbors are typically used to compute the next state of an OPC. A VPP is self-contained, except for the OPCs on its edge - meaning, a compute agent can compute the state of all of its OPCs (the internal OPCs), but it must communicate the states of those OPCs on the edges to the compute agents holding the neighboring VPPs.
Figure 3 shows a simple problem with four VPPs. In this simple 2 dimensional problem, each VPP has only two neighbors, and thus has only two edges to communicate. VPP number 1 must communicate one OPC collection edge to VPP 2 and the other OPC collection edge to VPP 3.
The trick to running a problem at peak efficiency is to perform the internal VPP state iteration in parallel with the edge communication - having all of the VPPs swap their edges in roughly the same time that they compute the state of their inner cores. If any compute agents take too long to compute their internal VPP state or take too long to communicate their edges, then the overall problem distribution must be adjusted in order to balance the compute agents'
iteration and communication times.

Figure 3:
A FEM problem with four variable problem partitions. VPP number 1 communicates with neighbor VPPs
A simple demo
For demonstration purposes, we created a simple cellular automata problem - the "Game
of Life" based on a modified Eden Model for bacterial growth (see Resources). In this problem three entities representing three types of bacteria (A, B, C) are growing in a two dimensional space. We use Cartesian coordinates so the smallest "elements" or "cells" in this space are squares with sides of unit length. The data contained in any cell is simply the type of bacteria entity present. Each cell also contains the methods or rules that describe how the entities interact on site and how they spread or propagate to other sites. The entities contain information about to interact with other entities.
In this example, the interaction method causes A to eat B, B to eat C, and C to eat A. The propagation method describes how the bacteria spread. Propagation to adjacent sites or cells requires that each cell also store pointers to nearest neighbors. These pointers define a graph and together with the propagation method add the communication requirement to the problem. In the example, the propagate method causes, at every iteration, an entity at site (i,j) to spread to it's nearest neighbors at sites (i+1,j),(i-1,j),(i,j+1),(i,j-1) each with probability 0.5 (the classic Eden model for bacterial growth).

Figures 4a-4d (clockwise from top left):
Progression of bacterial growth
Figures 4a, 4b, 4c and 4d show the time progression of a single compute agent computing a complete instantiation of the "Game of Life". The three bacteria start in different locations (4a), then they start to grow and interact (4b), then their interaction begins to form a spiral pattern (4c) and finally this spiral pattern takes over the entire virtual petri dish (4d).
Now suppose that we wish to run a larger version of this problem that doesn't fit on a single computer. We would need to partition the problem space into sections, and those sections would have to communicate with each other to share information about the bacteria as it crosses the section boundaries. Figure 5 shows one problem iteration running on four compute agents. From this visual representation of the problem, it is clear that the outer edge of each VPP is needed by the neighboring VPPs in order to compute their next state.

Figure 5:
A problem divided over four compute agents
How would we solve this problem on a computing grid? First, the problem solving system infrastructure must provide for efficient communication between neighboring problem pieces. Furthermore, since the state of one part of the overall problem depends (transitively) on all adjacent parts, it is necessary to keep the parts (relatively) synchronized. If the compute agent responsible for one part of the problem were to fall behind, then the other compute agents could not proceed any faster than the slow one. From the literature on current Grid systems (see Resources), we see that current work
addresses issues in access control, security, messaging and file transfer. They do not address the issue of inter-communication problem pieces or managing the load of the individual nodes to balance the progress of all the compute agents. We created OptimalGrid to address those issues.
The OptimalGrid architecture
The Almaden OptimalGrid project is an autonomic (self-configuring) Grid Computing management infrastructure designed to solve interconnected problems, such as cellular automata or finite element model problems. It targets the compute landscape of the Internet, where the machines being used are not necessarily dedicated to just one task, nor are they necessarily in the same administrative domain. The overall system architecture is shown in Figure 6.

Figure 6:
The OptimalGrid architecture
Given an initial problem, such as the skull shown here, OptimalGrid will automatically partition the problem into OPC collections, based on the problem complexity. Then, those OPC collections are grouped into variable problem partitions, based on the user's budget (the more compute credits you have, the more machines you can use) and available compute resources.
In this particular example, the skull is composed of about 14,000 solid elements (OPCs), using a common 3-dimensional tetrahedral mesh. For a given problem complexity, we might turn that into 40 collections of 350 OPCs each. Then, depending on available compute resources, those 40 collections could be grouped into 1, 2, 10 or even 40 VPPs (and sent to an equal number of compute agents), although a more likely number would be in the 2 to 10 range.
OptimalGrid uses a standard schema for problem cell state, structure, and inter-cell interaction. With that information, the problem management software can assign the problem pieces to computing agents based on actual resource/performance metrics. It also reassigns (and even restructures) the problem pieces to balance the computing load.
As shown in Figure 6, the main coordinator component of the OptimalGrid system is the Autonomic Program Manager (APM). The APM does a number of things:
- It manages the compute agents and the pieces of the problem;
- It is responsible for invoking a problem builder (the component that creates the initial problem); and
- It assigns the initial distribution of the problem given all of the information maintained on problems, compute agents and the general computing environment.
The APM also invokes the various pluggable Rule Engines that track the events of the OptimalGrid system and store the lessons learned for optimizing the problem computation. Compute Agents (CA) perform the actual computation (they are the compute nodes) - they receive a VPP from the APM and they work on it.
The Micro-Payment Broker (MPB) tracks the compute cycles performed by each of the compute agents, and sends a payment into the compute agent's account for each successful compute sequence. The MPB uses a multi-part key that includes the original computation, the specific computation sequence, the VPP, the Compute Agent and a problem verification code - to ensure that the payments are not faked or stolen.
Finally, to bring everything together, the distributed communication middleware, TSpaces, provides flexible communication, event notification, transactions, database storage, queries and access control.
Autonomic problem restructuring
An important aspect of the inter-cell relationship of cellular automata computation is that the overall computation cannot progress if part of the computation is behind. Assume that a different Compute Agent (CA) processes each VPP. The efficiency of a compute agent depends on network bandwidth as well as processing power, memory, and perhaps storage performance. In the absence of communication, faster compute agents would process VPPs at a higher frequency.
However, because the VPPs (and therefore the compute agents) must communicate, the fastest compute agent can be at most N time cycles ahead of the slowest agent, where N is the number of neighbors separating the fastest and slowest agents. So, to compute the overall state of cellular automata as quickly as possible, it is necessary to keep the ratio of VPP complexity to computing resources capability as closely matched as possible for all problem pieces. That can be especially challenging, given that the idle compute nodes on a Grid can change status frequently (from idle, to partially busy, to very busy) and can disappear altogether.
Thus, the only way to keep the system nearly balanced is to track the progress of all of the compute nodes constantly and reapportion the problem as needed. This is the essence of the autonomic feature of the OptimalGrid system: constant monitoring of system measurements for all of the pieces, a mechanism to change parameters and reapportion the system, and a place to remember which situations require which changes.
Using OptimalGrid
To use OptimalGrid, a user must supply the code for several classes. These classes must include an OPC class and a Problem Builder class.
The OPC class defines the user's problem or problem space. We described OptimalGrid using Cellular Automata and Finite Element Problems as an example. However, the system is much more general. As long as a user's problem can be expressed as a graph where the nodes of a graph contain data, methods, and pointers to neighbors, the OptimalGrid middleware can handle the problem. The OPCs describe the data, methods, and pointers to neighbors that constitute the problem space. As well, an application developer may use multiple OPC classes (or just one).
The Problem Builder defines the initial problem, populates the OPCs with data if necessary, and divides the problem into pieces. SmartGrid provides several implementations of an abstract ProblemBuilder class. One application generates a generic problem using rectilinear coordinates in a space of any dimensionality. The user merely specifies the overall volume of space.
OPCs within the space may be (optionally) initialized using a very simple VppDataInitializer class. Alternatively, OptimalGrid provides a simple XmlProblemBuilder class and XML schema allowing a user to define and initialize virtually any problem space. System pararamters are set in a simple system configuration (text file) and or by command line options. A user-configurable problem configuration infrastructure makes it simple for an application developer to specify and access application configuration settings in a text-based problem configuration file.
Experimental results
We tested the performance and scalability of OptimalGrid using the modified Eden model (see Resources) as an archetypal cellular problem. We tested the system performance as a function of problem size and number of heterogeneous Compute Agents (up to 25 at the time of this paper - experiments on up to 70 machines are planned for the near future).
Since all OPC and VPP data are wrapped in messages that also contain the performance and requirements data, along with the complexity of the problem piece, evaluation of performance and system scaling are a natural part of system operation. For comparison, we studied the scaling behavior of the same biological model running on a single machine (all components running on one machine) up to full parallelism, with each major component running on a different machine and each of 1, 2, 4, 9, 16, and 25 compute agents running on a separate machine. The compute agents comprised a heterogeneous collection of computers, including various models of IBM ThinkPad T20s, IntelliStation Servers, and desktop machines of various sizes. We have tested the system with Compute Nodes running a heterogeneous collection of operating systems including Win9x, WinNT, Win2000, and Linux.
Scalability
For a single machine running an undivided problem, computation time grows with the square of linear problem size, or linearly with the number of OPCs. For very small problems, performance is limited by the overhead associated with, for example, rendering graphics. Performance data, as a function of problem size, comparing a computation on a single machine with a distributed OptimalGrid computation running on various numbers of machines, is depicted in Figures 7a, 7b.

Figure 7(a):
The cycle time (ms) as a function of problem size (number of OPCs) for a problem distributed on 1, 4, and 9 compute agents

Figure 7(b):
The cycle time per OPC (ms/opc) as a function of problem size (number of OPCs) for problem distributed on 1, 4, and 9 compute agents
For any problem size, with or without communication, there is always system overhead. Figure 7b shows that increasing a problem size amortizes the overhead over a larger and larger number of problem pieces so that the cost of adding of one additional problem piece (one additional OPC)
decreases to some asymptotic value as the problem size grows.
Distribution of a problem into multiple VPPs adds communication overhead that grows both as the number of pieces, as well as the size. Clearly, if a problem is very small, it is not beneficial to distribute it. This is evident in Figure 7b, where optimum performance for a very small problem (below 10,000 OPCs) is obtained by running on a single machine (the curve in red). As the problem grows it is increasingly advantageous to use multiple machines, if they are available. For figures 7a, 7b, and for the largest problem sizes run, we achieved a net speedup of 10x using 25 machines (though greater performance is achievable by additional tuning and by turning off the graphics display and other overhead). Of course, this problem size
(almost 1 million OPCs) would not have fit on one machine. For some large problems, the size, not the speedup, could be the primary reason to use OptimalGrid - the problem might not fit on any "regular" dedicated cluster.
Dynamic improvement
OptimalGrid embodies two main principles: first, to provide the grid middleware to run an FEM or CA problem from start to finish; and second, to optimize the apportionment of the problem over the computing landscape so that the computation of the problem remains balanced despite changes in the compute agents running the problem. We ran some initial experiments to demonstrate the advantages of dynamic
reapportionment.
Our initial experiments involved two simple load balancing procedures. Using the simple "Game of Life" cellular automata problem, we reoptimized the problem by one of two schemes every five program cycles (the graph and further explanation is omitted for space reasons, but they will soon be available on our project website,
http://www.almaden.ibm.com/software/dsm/OptimalGrid.
The first scheme was to simply move the harder problem pieces (VPPs) to the faster machines (sorting each problem piece by performance measure and complexity respectively). This gave us an immediate 40% gain from the initial random allocation in the first 5 cycles. Thereafter, performance increased linearly with a repartioning step as a local optimization. This is done by removing individual OpcCollection tiles from slow machines and adding them to nearby fast machines.
Memory and other costs
We have also measured the fixed incremental memory overhead of running a problem under the OptimalGrid system. The incremental memory cost, including all the performance metrics and communication components is only 1K/OPC. With a Java Virtual Machine limited to 128MB of memory, the largest VPP that could be run on a single Compute Agent contained about 60,000 OPCs (a VPP size of 240x240). Larger workstations could, of course, handle much larger problem pieces but this constituted a practical limit for a small desktop machine. However, since overall performance was optimal with a VPP size of about 140x140 OPCs, memory considerations would not limit the computation, since a larger problem would more be efficiently distributed across a larger number of compute agents.
Another cost associated with distributing a problem on a Grid is the cost of collecting the problem state data from the various compute agents. In order to autonomically re-optimize a problem, the VPPs (which collectively define the entire problem) must occasionally be redefined and redistributed to the active compute agents. Note that the read cost overhead reflects not only the time to read back problem pieces (VPPs), but also the time each CA spent waiting for the Autonomic Program manager to redefine and reassign the new VPPs to the active CAs. For both VPP read and VPP write operations, the communication overhead decreases as the problem is distributed over a greater number of CAs (into smaller VPPs). Depending upon the actual application, both of these costs could be amortized over many computation cycles. The cost of logging, reoptimizing and redistributing the VPPs contributes less than 5% of the total computation time.
Conclusion and project status
Grid computing is still in its infancy. There are numerous tools available for building various types of grid applications, but as yet, no one size fits all. Although problems solved by grid computing are often very sophisticated, the problem management software today is still quite primitive. Existing grid software manages the movement of problem pieces from machine to machine (or server to machine). However, it does not provide for sophisticated management of the problem pieces, does not account for correlations between problem pieces, does not provide a representation of problem piece requirements, and does not adapt the problem itself to dynamic changes in available computing resources.
We have demonstrated with the OptimalGrid system a middleware that effectively manages a complex distributed computation on a heterogeneous grid. Philosophically, OptimalGrid makes the assumption that grid middleware does not have full administrative control over the available compute nodes. As a result, the nodes may be subject to unknowable and unpredictable loads, changing costs and different usability requirements.
Given any problem that requires communication between compute nodes to proceed (any interconnected parallel problem), a single slow compute agent can slow down an entire computation. Similarly, a failed node can halt the computation altogether. On the Grid, recovery from a failed node amounts to the same problem as recovery from a node that has become so busy it is no longer useful. A slightly slower compute agent must be assigned a smaller or "easier" piece to solve. A failed of heavily loaded machine should simply be replaced. OptimalGrid is designed to deal with both situations.
We have described some OptimalGrid applications and described some initial experimental results. We are optimistic that we can build an autonomic grid system that removes most of the burden of solving cellular or FEM problems on a distributed system, and that our system will scale to hundreds of nodes. Clearly, those organizations able to afford dedicated parallel machines or large homogeneous clusters may not require a Grid system. However, many institutions are driven to find a way to use "free" or idle cycles in their organization or intranet. Furthermore, the growing appetite for solving increasingly large engineering or scientific simulations suggests the time is right to provide computing services via a grid utility model. OptimalGrid will allow users to specify a problem, and then run it on a network of machines, without having to write a single line of "parallel" or "distributed" code.
OtimalGrid also manages multiple computations, and gives users an operating console to monitor, graph, aggregate and update system parameters of their choice. Finally, cellular automata provide a fascinating archetypal problem for Grid Computing. The communication complexity involved in distributing a cellular or Finite Element problem can be thought of in terms of partitioning an abstract graph. The connections in the graph are defined by the pointers to neighbors contained in an OPC. Thought of in this way, the same system used to distribute or partition a cellular problem can be used to partition other problems, such as non-spatial flow-like problems (financial flows, business flows, web page flows, etc), provided the graph can be efficiently segmented.
We have completed Version 3 of our system, which includes a working APM, a sets of compute agents, a library of problem builders, a master console, a centralized communications component, a general-purpose event mechanism, a general-purpose application configuration mechanism, and the infrastructure to autonomically repartition and reapportion FEM and CA problems. We've run OptimalGrid on three different problems (applications), and on up to 30 machines. In the near future we expect to be running a batch of new problems on our new 70-machine Linux cluster (made from commodity parts).
Resource:
The OptimalGrid project home page
Paul Shread, Even Small Companies Can Benefit From Grid Computing