Hybrid Messaging Passing in Shared-Memory Clusters∗

Hybrid Messaging Passing in Shared-Memory Clusters∗ Vincent W. Freehy Jin Xuz David K. Lowenthalx Abstract An increasingly popular choice for high-performance machines is clusters of SMPs. Such clusters are popular in department-wide as well as national laboratory settings. On clusters of SMPs, there are two different memory models: shared memory, within a node, and distributed memory, between nodes. It is more difficult to write an efficient program using two memory models, i.e., combining OpenMP and MPI, than one. Therefore, several systems provide a hybrid message passing system, in which an IPC mechanism is used for local messaging and a network protocol (i.e., TCP) is used for remote messaging. This allows messaging passing programs to run much more efficiently on shared-memory clusters. The dual to this hybrid message passing program is shared-variable program supported by a software distributed shared memory. In theory, the shared-variable program is easier to create; however, in practice such a program is intolerably slow. While clearly reducing the overhead of messaging, a hybrid message passing system does not take full advantage of the shared memory model. This paper describes the Optimizing Hybrid Messaging (OHM) package that allows for highly efficient local messaging in a distributed memory program. Our prototype implementation of OHM is written in Java. It re- duces message passing overhead between processes on the same shared-memory node, and it has performance comparable to a multi-threaded version. 1 Introduction High-speed networking and low cost make distributed-memory clusters very attractive high-performance computers. Furthermore, clusters have proven to be more scalable than shared-memory multicomputers. However, small-scale parallelism is quite usable and cost effective. Consequently, networks of shared-memory multiprocessors, with up to eight processors per node, are emerging as the dominant architecture for high-performance computers. This is because in terms of most raw performance metrics, shared-memory clusters are a highly efficient design point. However, this architecture is problematic in terms of software and programming cost, because it has a hybrid memory model consisting of both shared and distributed memory. Compared to other architectures, ∗This research supprted by NSF grants CCR-9876073 and CCR-9733063 and a grant from DARPA HPCS. yDepartment of Computer Science, North Carolina State University, Raleigh, NC 27965, [email protected] zDepartment of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, [email protected] xDepartment of Computer Science, University of Georgia, Athens, GA, 30602, [email protected] 1 a hybrid architecture presents additional challenges for developing efficient programs as well as tuning and porting them. There are three basic programming models for a hybrid machine. The first is shared memory across all processors and nodes with software distributed shared memory (SDSM). The second uses message passing between processors and nodes, and the third uses message passing between nodes and shared memory within nodes. None of these three is ideal; SDSMs are inefficient between nodes and message passing is inefficient within a node. Although hybrid systems can be efficient, the hybrid model requires a user to use two different programming models. This paper addresses the challenge of efficiently programming shared-memory clusters through the Optimized Hybrid Messaging (OHM) system, which provides several optimized local messaging mechanisms that decrease the cost of message passing. Also, it presents results in which the performance of OHM programs nearly equivalent to multi-threaded programs on a single SM node. Many high-performance messaging systems provide optimizations for SM clusters. For example, the major MPI implementations use efficient IPC for local messaging instead of a network protocol (which demonstrates the importance of SM clusters and utility of efficient local messaging). These implementations provide a short cuts for messages that are destined for another process on the same SM node. However, this is partial a solution because, while it increases efficiency, it does not take full advantage of the opportunities available for message passing using shared memory. Our prototype implementation of OHM uses Java primarily because of the ease of prototyping within a Java virtual machine. There are two major components. First, is a message passing class that is implemented as a “drop-in” replacement for Java’s native1 messaging methods. The second component provides and manages shared memory buffers that are used to hold message headers, queues, and data. At first blush, OHM can be viewed as a multiplexer between local and remote messaging methods. If the source and destination are located on the same machine, then OHM uses specialized, highly efficient local message passing. But when the source and destination are located on different machines, then OHM uses the native messaging methods. In this respect, OHM is similar to systems such as LAM MPI. OHM distinguishes itself from these by providing several mechanisms to pass messages for efficiency. The next section discusses general hybrid message passing. Section 3 introduces a prototype OHM system created as Java class. The next section shows that our OHM prototype is significantly faster than other message passing mechanisms and that it provides excellent SM performance. Then the paper presents related work and conclusions. 2 Hybrid Message Passing In a hybrid computer, there are two types of messages. A remote message is between processes on different nodes, and a local message is between processes on the same node. Typically, a remote mechanism also supports local communication, but not efficiently. For example, a straightforward implementation of message passing in a hybrid computer uses TCP/IP for both local and remote messages. While it is sufficient, this way incurs unnecessary overhead for local message passing. In contrast, a hybrid message passing system uses separate mechanisms for local and remote 1The native package is java.io.net, which supports communication with TCP sockets. 2 messages. A simple hybrid message passing mechanism uses TCP/IP for remote messages and interpro- cess communication (i.e., AF UNIX on Unix machines) for local messages. It is easy to determine whether two communicating processes are local or remote when a connection is being established. Therefore, the proper methods can be dynamically bound to send and receive methods when the connection is created. Thus, the mechanism that establishes a connection between two processes serves as a clearing house for specific, and efficient, implementations of message passing. As a result the inherent overhead of hybrid messaging is negligible. This simple form of hybrid message passing is supported in some MPI implementations. 2.1 Shared Memory Message Passing Messages are typically passed using copy semantics [LH89a]. In copy semantics, the send operation copies message data from an application buffer to an intermediate buffer. Similarly, the receive operation copies data from an intermediate buffer to an application buffer. Because source, destination, and intermediate buffers are all independent, copy semantics provides the simplest message passing interface and semantics. But not the most efficient. Other message passing semantics exist, but are not as widespread and are more difficult to use. For example, MPI provides the Isend and Irecv interface, which can be more efficient than copy semantics. However, significant effort may be necessary to transform a program from copy semantics to this more efficient form. In particular, because message data is shared (not copied), the programmer must explicitly verify that a buffer is no longer shared before modifiying it. In MPI, this involves the insertation of Wait statements before the appropriate subsequent uses of message data. Simple hybrid message passing uses generic IPC. While this is more efficient than using TCP/IP, it still has significant overhead. There are two primary sources of overhead: (i) cross- ing the user-kernel boundary and (ii) copying data. This overhead is a consequence of using a kernel-level, general-purpose mechanism. Both of these sources of overhead are eliminated in OHM. Local message passing can be performed very efficiently because memory (and hence message data) can be shared between processes. In the common case, OHM will exchange messages without a system call or other intervention from the kernel. This user-to-user transfer can be performed with very little overhead. Additionally, when the message data originates in shared memory, message passing can often be accomplished without copying buffers. As described above, the standard method of local send and receive involves two copies. The sender copies from a private buffer into a shared buffer, and the receiver copies from the shared buffer into a private buffer. A 1-copy send method allocates a shared memory buffer, then copies message data from a private buffer into it. It then inserts a descriptor into a queue associated with this connection, indicating that the message was sent. Similarly, a 1-copy receive method removes the descriptor from the queue and copies data from the shared buffer into a private buffer. The sending and receiving processes are free to manipulate the private buffers without affecting program correctness. When 0-copy versions of send and receive are available in

Load more