You are here

User-defined Nonblocking Collectives Must Make Progress

June 11, 2012

Logically collective interfaces are a fantastic abstraction for libraries, allowing parallel concerns to be hidden and optimized behind robust interfaces that ensure the desired level of consistency and with which it is easy to avoid deadlock.

 

Although some algorithms (e.g. parallel MIS) are fully asynchronous and unnatural to implement using collectives, most are still naturally exposed using a collective API. Nonblocking collectives preserve these attractive properties, but allow latency hiding and overlap of communication with computation, a long-promised attribute that is finally being delivered thanks to recent advances in hardware and network software. Any algorithmic component entailing more communication than computation can benefit from a nonblocking interface; the following is an example that will be available in MPI-3[1].


int MPI_Iallreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, MPI_Comm comm, MPI_Request *request)

A critical feature of this interface is that the collective can make progress concurrently with other communication, and when supported by the hardware and/or runtime, concurrently with pure computation. For example, a reduction started using the interface above can execute concurrently with matrix multiplication and preconditioner application (which generally involve local communication) in a "pipelined" GMRES algorithm which we recently implemented in PETSc, providing up to 3x speedup for strong scaling. MPI-3 adds nonblocking variants of all standard collective operations, but there is still no satisfactory mechanism by which other libraries can implement nonblocking collectives that make progress. I believe this limitation is responsible for the continue proliferation of synchronous interfaces, a trend that is harmful to performance at scale and stifles algorithmic innovation.

The critical inadequacy of current solutions is that there is no standard mechanism by which to make progress. Any time the collective would involve multiple rounds of communication, the user's nonblocking implementation is obliged to put the rounds in the Begin()routine, in the End()routine, or to create a special function that the user is required to call repeatedly to make each step of progress. This breaks abstractions because we would like a nonblocking collective from one library to make progress while control flow is in another library.

Examples of user-defined collective operations

I consider three examples of user-defined collective operations for which adding special system support (e.g., in the MPI standard) would be unreasonable, although exposing these under a nonblocking interface would be useful.

TSQR: reductions with side effects

Communication-reducing (or "communication-avoiding") tall-skinny QR algorithms orthogonalize the columns of an m x n matrix A, m >> n, partitioned by rows, at a latency cost essentially equivalent to one reduction instead of the n reductions required by conventional methods. The algorithm recursively orthogonalizes the local part, producing a small triangular matrix R and an orthogonal matrix Q, with R passed as input to the next level of the reduction. This reduction produces a global R and contributions to Q that are used to stably orthogonalize the local part of Q. This reduction could, in principle, be implemented by using MPI_Iallreduce()with a user-defined reduction operation to produce the global R, followed by purely local work to reconstruct a global Q. Unfortunately, that would necessarily discard the intermediate contributions to Q because MPI makes no guarantees about where the user-defined reductions will be executed, causing the orthogonality of the resulting Q to be sensitive to rounding errors. Thus, although an implementation based on MPI_Iallreduce()could take advantage of optimized nonblocking collective implementations, it would be much less useful for purposes such as Krylov methods where orthogonality is crucial.

Unstructured communication setup

Setting up sparse unstructured communication networks requires multiple MPI calls to discover the number of neighbors, message sizes, and indices of data that need to be sent or received, sometimes with separate steps per topological dimension or ghosting layer. The overall cost of this step is usually strongly limited by communication latency, and there is often other useful work that could be performed while communication is being set up.

Fast multipole method

The fast multipole method is a hierarchical method consisting of independent local work as well as tree work involving multiple rounds of communication with relatively little computation. Since tree work has no dependence on results of local work, it could run concurrently. To execute these tasks concurrently, current implementations need to write special code to interrupt the local work in order to make progress on the tree work. Analysis algorithms that use the tree part of the computation independently from the local part would benefit from having the tree computation exposed as a nonblocking collective.

 

How to ensure progress?

User-managed progress threads

An application or client can spawn a thread that polls the network and processes any messages that arrive. If multiple user-defined collectives are being executed concurrently, they either need separate threads or to add an event-driven progress engine. If multiple libraries want to be able to drive progress, they either need their own thread or need to use a common event-driven progress thread. Creating separate threads for each collective is essentially unworkable because of complicated system configuration and the need to control oversubscription, but using an event-driven system requires a standardized interface.

MPI generalized requests

MPI-2 added support for generalized requests, which enable the creator of the request to get a callback when a request is polled. Unlike native MPI nonblocking collectives, however, generalized requests are not advanced by the progress engine, so that user code becomes responsible for polling to generate progress. In many cases, it is not practical for the user of a user-defined nonblocking collective to be responsible for generating progress. Indeed, the user may not even know how many requests are in progress or have an API that gives access to them. Although user-managed threads may be used to poll generalized requests, this approach has all the downsides of user-managed progress threads discussed above. An extension to the generalized request interface[2] came tantalizingly close, providing a progress callback that is invoked when MPI_Test()is called on that request. This requirement of polling the specific request means that progress will not be made when program control is in a different library (or any place ignorant of the specific pending request), and can lead to deadlock where MPI-3 nonblocking collectives would complete (e.g. if different ranks attempt to complete two pending collectives in a different order).

Common event-driven interfaces

A common event-driven interface would eliminate the problem of library contention and ensure that user-defined nonblocking collectives can have equivalent semantics to native collectives. Such interfaces are partially available in existing implementations of native nonblocking collectives, thus providing a starting point for adding extensibility and eventual standardization. Note that an event-driven interface could allow callbacks to be executed by a separate thread, so the event-driven interface does not rule out the use of progress threads by an implementation.

First-class user-defined nonblocking collectives deserve to be a part of any exascale programming model. The sooner they are available, the sooner we can begin paying off the technical debt acquired in their absence.

[1] MPIX_Iallreduce() and other nonblocking collectives approved for MPI-3 are already available in MPICH2-1.5.  HTML

[2] The extended generalized request interface (which is implemented in MPICH2). PDF

 

About the Author

Jed Brown is a postdoc at Argonne National Laboratory where he works on parallel solvers and discretization schemes for problems involving partial differential equations. He has been a developer of PETSc since 2008 and works with various applications including ice sheet modeling, mantle and lithosphere dynamics, MHD, and shape optimization for engines. Jed received a Dr. Sc. from ETH Zurich in 2011, where his work focused on robust discretization schemes and parallel solvers for ice sheet dynamics. He received a MS in Mathematics from the University of Alaska Fairbanks in 2006, where he was the initial author of the Parallel Ice Sheet Model PISM. In 2004, he received a BS in Physics and a BS in Mathematics, also from the University of Alaska Fairbanks.