Skip navigation.
Home

Parallel Virtual Machine - Tuning PVM 3.4 for Large Clusters

Al Geist
Oak Ridge National Laboratory

PVM's programming model begins with the concept of a virtual machine. That machine is composed of one or more hosts that communicate with each other using standard TCP/IP network protocols. The hosts can be any heterogeneous collection of workstations, supercomputers, PCs, and even laptops. In this article we will assume the hosts are nodes in a large PC cluster. From the programmer's view, the virtual machine is a single large distributed-memory parallel computer. The PVM programming model allows an unlimited number of hosts in the virtual machine (although the actual implementation limit is 4,096 hosts, each of which can be multiprocessors with up to 2,048 nodes). No one has come close to reaching that host limit. The largest PVM
virtual machines to date are on 1,000-processor Beowulf clusters. The programming model allows hosts to be added and deleted dynamically from the virtual machine.

In the PVM model the hosts are assumed to have sufficient memory to run a given application. This means it is up to the user not to exceed the virtual memory execution limits on any host.

Processes - or threads - running on the virtual machine are called tasks. The model assumes that many tasks can run on each host and that any task can communicate with any other task. The PVM programming model supports a dynamic task pool where tasks can be added or killed inside a larger running application. The only design limitation on the number of tasks inside a virtual machine is the 32 bit task ID. This number is quite a bit higher than the practical limit on the number of tasks, which is limited by the available memory on the hosts.

The PVM model also supports dynamic groups: sets of tasks that logically work together. Communication between groups of tasks, such as broadcast, use the group model. Tasks can join and leave groups without synchronizing with any other group members. Using group names, it is possible for independently started PVM tasks to communicate with each other.

PVM and MPI (Message Passing Interface)

Since the release of the Message Passing Interface (MPI) specification, developers of parallel applications often wonder whether to use MPI or PVM. Each package has its strengths and weaknesses. MPI was designed to provide a standard message passing API for parallel computers. Since vendors are developing and supporting high performance MPI implementations for their systems, MPI can be assumed to have the best communication performance for applications that run predominantly within a single vendor's multiprocessor. On the other hand, on Beowulf clusters PVM and MPI have very comparable communication performance. Another strength of MPI is its rich message passing API. With 128 functions in MPI-1 and another 160 in MPI-2, application developers have a wide range of options in passing messages. MPI programs are portable: An MPI application should be able to run on another vendor's multiprocessor with little or no source changes.

PVM strengths are in cases where the application must run on a heterogeneous network of computers, or the application requires fault tolerance. MPI does not define any concept of a virtual machine - it is strictly a message passing API. In contrast, the PVM model supports the dynamic adding and deleting of both hosts and tasks, and supplies notification functions to determine when a fault occurs. The flexibility of the PVM model comes at the cost of peak communication performance. In the next section we describe ways to tune PVM applications to achieve higher performance.

Improving PVM application performance

PVM provides several options for sending messages, allowing the programmer to tune an application's communication. The standard three-step method to send a message in PVM involves (1) initializing a buffer, (2) packing data into the buffer, and (3) sending the data to the destination task.

The pvm_initsend() call sets packing options on a per-message basis. The options are: PvmDataDefault, PvmDataRaw, and PvmDataInplace. If the virtual machine contains a heterogeneous mix of data formats, then PvmDataDefault must be used. The default method converts all data into XDR format needed for heterogeneity, which slows down packing and unpacking of messages. Since the data formats between sender and receiver are the same on our large PC clusters, we want to turn off the default option. PvmDataRaw avoids the conversion overhead, but the raw method still has the cost of copying the data into the send buffer. The PvmDataInPlace option avoids that overhead: With PvmDataInPlace set, only pointers to the data are placed in the send buffer. At send time the data is copied directly from user memory to the network. There are three restrictions to using PvmDataInPlace: First, only contiguous blocks of data can be packed. Second, sender and receiver must have the same data format. Third, if data is changed after packing, the changed values are sent.

The high performance send options are set in the pvm_setopt() routine, which only needs to be called once in each task. PvmRouteDefault is the default setting. This method is scalable and should work in all possible virtual machines. In the default method messages are routed and buffered through the pvmds (PVM daemons running on each host).

Since we are assuming the virtual machine is a large PC cluster, we have several higher performance options we can select in PVM. Replacing PvmRouteDefault with PvmRouteDirect can result in a four-fold bandwidth increase. The direct method sets up a TCP socket directly between the two tasks bypassing the pvmd. In general, the direct method should be used. But be aware: setting up the TCP socket has a one-time very large latency. It only makes sense if many messages or large volumes of data are going to pass between two tasks.

To get performance equal to vendor's MPI implementations on MPP, applications should use pvm_psend()/pvm_precv() pair. psend combines the three sending steps into one, and matches very closely with vendors' native functions for moving data. For example, on the IBM SP pvm_psend() maps directly to mpi_send(). When run on a PC cluster, psend maps to PvmDataInPlace and pvm_send with PvmRouteDirect.

From an application design standpoint, consider the following two observations: First, limit the size and number of outstanding messages. These messages have to be buffered inside the PVM system and this increases overhead within PVM. Second, numerous tests have been conducted where the fast Ethernet cluster network was replaced with a much faster Myrinet network and shown to boost message bandwidth by 5X or more. But when full applications were run on this cluster they showed little or no improvement in total execution time despite the higher bandwidth. The reason: Execution time is much more sensitive to load balancing the application than improving the communication bandwidth. Put another way, much more time is spent waiting for another task to finish calculating the data than for the data to actually move between tasks. The bottom line: Work on improving the load balance of an application first, then spend time improving the communication performance.

Advanced Features in PVM 3.4

In the remaining part of this article we describe three new extensions to the PVM programming model that can significantly increase the design sophistication of cluster applications. These extensions are: (1) communication context, (2) message handlers, and (3) message box. PVM 3.4 supports all these extensions.

In a typical message passing system, messages are transitive, and the focus is on making their existence as brief as possible by decreasing latency and increasing bandwidth. There are a growing number of situations in today's distributed applications in which programming would be much easier if there was a way to have persistent messages. This is the purpose of the new Message Box feature in PVM 3.4. The Message Box is an internal tuple space in the virtual machine.

The four functions that make up the Message Box in PVM 3.4 are:


    index = pvm_putinfo( name, msgbuf, flag )
            pvm_recvinfo( name, index, flag )
            pvm_delinfo( name, index, flag )
            pvm_getmboxinfo( pattern, matching_names, info )

Tasks can use regular PVM pack routines to create an arbitrary message, and then use pvm_putinfo() to place that message into the Message Box with an associated name. Copies of the message can be retrieved by any PVM task that knows that name. If the name is unknown, or changes dynamically, then pvm_getmboxinfo() can be used to find the list of names active in the Message Box. The flag defines the properties of the stored message, such as, who is allowed to delete the message, does the name allow multiple instances of messages, can a put to the same name overwrite the message? The flag argument also allows extension of this interface as PVM 3.4 users give us feedback on how they use the
features of Message Box.

Here are a few of the many uses for the Message Box feature: A performance monitor can leave its findings in the Message Box for other tools to use. The PVM group server functionality has been implemented in the new message box functions. The Genome Integrated Supercomputer Toolkit (GIST) makes heavy use of PVM 3.4 Message Box to coordinate and track thousands of genome analysis jobs submitted daily to the Genome Channel. GIST creates the analysis pipeline and schedules the tasks on any of a number of large clusters at ORNL. GIST also requires PVM fault tolerance capabilities to ensure that every submitted job completes.

The ability to have persistent messages in a distributed computing environment opens up many new application possibilities in not only high performance computing but also collaborative technologies.

One of the main difficulties of writing libraries for message passing applications is that messages sent inside the application may get intercepted by the message passing calls inside the library. The same problem occurs when two applications want to cooperate - for example, a performance monitor and a scientific application, or an airframe stress application coupled with an aerodynamic flow application. Any time there are two or more programmers writing different parts of the overall message passing application there is the potential that a message will be inadvertently received by the wrong part of the application. The solution to this problem is communication context.

PVM 3.4 adds four new routines to manage communication contexts.


    new_context = pvm_newcontext()
    old_context = pvm_setcontext( new_context )
    info        = pvm_freecontext( context )
    context     = pvm_getcontext()

pvm_newcontext() returns a system-wide unique context tag generated by the local daemon (similar to how the local daemon generates system-wide unique task IDs). Since it is a local operation, pvm_newcontext() is very fast. The returned context can then be broadcast to all the tasks that are cooperating on this part of the application. Each of the tasks calls pvm_setcontext(), which switches the active context and returns the old context tag so that the old context can be restored at the end of the module by another call to pvm_setcontext(). pvm_freecontext() and pvm_getcontext() are used to free memory associated with a context tag and to get the value of the active context tag, respectively. Finally, spawned tasks in PVM 3.4 inherit the context of their parent.

PVM has always had message handlers internally, which were used for controlling the virtual machine. In PVM 3.4 the ability to define and delete message handlers has been raised up to the user level. The two new message handler functions are:


    mhid = pvm_addmhf( src, tag, context, *function );
           pvm_delmhf( mhid );

Once a message handler has been added by a task, whenever a message arrives at this task with the specified source, message tag, and communication context, the specified function executes. That function is passed the message ID so that it may unpack the message if desired. PVM 3.4 places no restrictions on the complexity of the function, which is free to make system calls or other PVM calls. A message handler ID is returned by the add routine, and that ID is then used in the delete message handler routine.

There is no limit on the number of handlers the user can set up, and handlers can be added and deleted dynamically by each application task independently.

By setting up message handlers, users can write programs that dynamically change the features of the underlying virtual machine. For example, message handlers could be added to implement active messages. Thereafter, an application could use this form of communication rather than the typical send/recv. Similar examples exist for almost every feature of the virtual machine.

The ability of an application to adapt features of the virtual machine to meet its present needs is a powerful new capability that hasn't been explored in a heterogeneous distributed computing environment. Work continues in this direction and in a follow-up project to PVM, the PVM team is creating a self-assembling, adapting virtual machine called Harness. But that is a topic for a future article.

Resource
Parallel Virtual Machine
Message Passing Interface (MPI)
MPI Forum

Designer Louis Replica