Hybrid Messaging Passing in Shared-Memory Clusters∗

Vincent W. Freeh† Jin Xu‡ David K. Lowenthal§

Abstract An increasingly popular choice for high-performance machines is clusters of SMPs. Such clusters are popular in department-wide as well as national laboratory settings. On clusters of SMPs, there are two different memory models: , within a node, and distributed memory, between nodes. It is more difficult to write an efficient program using two memory models, i.e., combining OpenMP and MPI, than one. Therefore, several systems provide a hybrid message passing system, in which an IPC mechanism is used for local messaging and a network protocol (i.e., TCP) is used for remote messaging. This allows messaging passing programs to run much more efficiently on shared-memory clusters. The dual to this hybrid message passing program is shared-variable program supported by a distributed shared memory. In theory, the shared-variable program is easier to create; however, in practice such a program is intolerably slow. While clearly reducing the overhead of messaging, a hybrid message passing system does not take full advantage of the shared memory model. This paper describes the Optimizing Hybrid Messaging (OHM) package that allows for highly efficient local messaging in a dis- tributed memory program. Our prototype implementation of OHM is written in Java. It re- duces message passing overhead between processes on the same shared-memory node, and it has performance comparable to a multi-threaded version.

1 Introduction

High-speed networking and low cost make distributed-memory clusters very attractive high-per- formance . Furthermore, clusters have proven to be more scalable than shared-memory multicomputers. However, small-scale parallelism is quite usable and cost effective. Consequently, networks of shared-memory multiprocessors, with up to eight processors per node, are emerging as the dominant architecture for high-performance computers. This is because in terms of most raw performance metrics, shared-memory clusters are a highly efficient design point. However, this architecture is problematic in terms of software and programming cost, because it has a hybrid memory model consisting of both shared and distributed memory. Compared to other architectures,

∗This research supprted by NSF grants CCR-9876073 and CCR-9733063 and a grant from DARPA HPCS. †Department of Science, North Carolina State University, Raleigh, NC 27965, [email protected] ‡Department of and Engineering, University of Notre Dame, Notre Dame, IN 46556, [email protected] §Department of Computer Science, University of Georgia, Athens, GA, 30602, [email protected]

1 a hybrid architecture presents additional challenges for developing efficient programs as well as tuning and porting them. There are three basic programming models for a hybrid machine. The first is shared memory across all processors and nodes with software distributed shared memory (SDSM). The second uses message passing between processors and nodes, and the third uses message passing between nodes and shared memory within nodes. None of these three is ideal; SDSMs are inefficient between nodes and message passing is inefficient within a node. Although hybrid systems can be efficient, the hybrid model requires a user to use two different programming models. This paper addresses the challenge of efficiently programming shared-memory clusters through the Optimized Hybrid Messaging (OHM) system, which provides several optimized local messaging mechanisms that decrease the cost of message passing. Also, it presents results in which the performance of OHM programs nearly equivalent to multi-threaded programs on a single SM node. Many high-performance messaging systems provide optimizations for SM clusters. For exam- ple, the major MPI implementations use efficient IPC for local messaging instead of a network protocol (which demonstrates the importance of SM clusters and utility of efficient local mes- saging). These implementations provide a short cuts for messages that are destined for another on the same SM node. However, this is partial a solution because, while it increases ef- ficiency, it does not take full advantage of the opportunities available for message passing using shared memory. Our prototype implementation of OHM uses Java primarily because of the ease of prototyping within a Java virtual machine. There are two major components. First, is a message passing class that is implemented as a “drop-in” replacement for Java’s native1 messaging methods. The second component provides and manages shared memory buffers that are used to hold message headers, queues, and data. At first blush, OHM can be viewed as a multiplexer between local and remote messaging methods. If the source and destination are located on the same machine, then OHM uses specialized, highly efficient local message passing. But when the source and destination are located on different machines, then OHM uses the native messaging methods. In this respect, OHM is similar to systems such as LAM MPI. OHM distinguishes itself from these by providing several mechanisms to pass messages for efficiency. The next section discusses general hybrid message passing. Section 3 introduces a prototype OHM system created as Java class. The next section shows that our OHM prototype is significantly faster than other message passing mechanisms and that it provides excellent SM performance. Then the paper presents related work and conclusions.

2 Hybrid Message Passing

In a hybrid computer, there are two types of messages. A remote message is between processes on different nodes, and a local message is between processes on the same node. Typically, a remote mechanism also supports local communication, but not efficiently. For example, a straightforward implementation of message passing in a hybrid computer uses TCP/IP for both local and remote messages. While it is sufficient, this way incurs unnecessary overhead for local message pass- ing. In contrast, a hybrid message passing system uses separate mechanisms for local and remote

1The native package is java.io.net, which supports communication with TCP sockets.

2 messages. A simple hybrid message passing mechanism uses TCP/IP for remote messages and interpro- cess communication (i.e., AF on Unix machines) for local messages. It is easy to determine whether two communicating processes are local or remote when a connection is being established. Therefore, the proper methods can be dynamically bound to send and receive methods when the connection is created. Thus, the mechanism that establishes a connection between two processes serves as a clearing house for specific, and efficient, implementations of message passing. As a result the inherent overhead of hybrid messaging is negligible. This simple form of hybrid message passing is supported in some MPI implementations.

2.1 Shared Memory Message Passing Messages are typically passed using copy semantics [LH89a]. In copy semantics, the send op- eration copies message data from an application buffer to an intermediate buffer. Similarly, the receive operation copies data from an intermediate buffer to an application buffer. Because source, destination, and intermediate buffers are all independent, copy semantics provides the simplest message passing interface and semantics. But not the most efficient. Other message passing semantics exist, but are not as widespread and are more difficult to use. For example, MPI provides the Isend and Irecv interface, which can be more efficient than copy semantics. However, significant effort may be necessary to transform a program from copy semantics to this more efficient form. In particular, because message data is shared (not copied), the programmer must explicitly verify that a buffer is no longer shared before modifiying it. In MPI, this involves the insertation of Wait statements before the appropriate subsequent uses of message data. Simple hybrid message passing uses generic IPC. While this is more efficient than using TCP/IP, it still has significant overhead. There are two primary sources of overhead: (i) cross- ing the user-kernel boundary and (ii) copying data. This overhead is a consequence of using a kernel-level, general-purpose mechanism. Both of these sources of overhead are eliminated in OHM. Local message passing can be performed very efficiently because memory (and hence message data) can be shared between processes. In the common case, OHM will exchange messages without a system call or other intervention from the kernel. This user-to-user transfer can be performed with very little overhead. Additionally, when the message data originates in shared memory, message passing can often be accomplished without copying buffers. As described above, the standard method of local send and receive involves two copies. The sender copies from a private buffer into a shared buffer, and the receiver copies from the shared buffer into a private buffer. A 1-copy send method allocates a shared memory buffer, then copies message data from a private buffer into it. It then inserts a descriptor into a queue associated with this connection, indicating that the message was sent. Similarly, a 1-copy receive method removes the descriptor from the queue and copies data from the shared buffer into a private buffer. The sending and receiving processes are free to manipulate the private buffers without affecting program correctness. When 0-copy versions of send and receive are available in addition to 1-copy versions, there are four ways to pass messages: two-copy, zero-copy, and two ways with a single copy. A 0-copy

3 send requires that the message data begin in shared memory. A 0-copy receive must be able use a reference to a shared buffer or a private buffer. Additionally, in both cases, the message buffer must be guarded against writing while it is shared. (In this paper, 0-copy and 1-copy refer to the underlying send or receive mechanism, and two-copy, single-copy, and zero-copy refer to the resultant message passing achieved by pairing a send with a receive.) With zero-copy message passing, the message data begins in a shared buffer on the sender side and the receiver accesses the data directly from the same shared-memory buffer. Zero-copy mech- anisms eliminate the copies between shared and private buffers. Of course, appropriate actions must be taken to provide copy semantics with zero-copy message passing. Zero-copy mechanisms are described in the next section. There are two ways to perform single-copy messaging. Sender-copy is a straightforward op- timization on the receiver side (1-copy send and 0-copy receive). The receiver realizes that the sender has already copied the message data into a shared buffer. Because the sender copied the data, it makes no claim on the data; consequently, the receiver is free to use the buffer. Even though the message data is in a shared buffer the receiver is the exclusive owner. The other single-copy mechanism, receiver-copy, occurs when the receiver chooses not to use the shared buffer (0-copy send and 1-copy receive). To the sender, receiver-copy is indistinguish- able from zero-copies, described above. The effect is that the 1-copy receive immediately copies message data and decrements the reference count for the shared buffer.

2.2 0-copy Mechanisms The primary optimization is to avoid the most expensive operation—copying message data— especially for large messages. Copying can be eliminated on the sender or receiver side (or both). OHM provides copy semantics message passing because it is the typical and easier-to-use form of message passing. Furthermore, efficient mechanisms exist to reduce the cost of copy semantics. Therefore, OHM uses such mechanisms, which logically copy message buffers. With zero-copy the buffer is shared. Therefore, a write by one process is visible to the other. This violates copy semantics. The usual mechanism that provides zero-copy efficiency yet retains copy semantics is copy-on-write (COW). When a process tries to write to a shared buffer, it will first copy the buffer, then write to this new buffer to which is has exclusive access. The copy is avoided when either (1) no writing occurs or (2) the writing occurs after the buffer is no longer needed by all other processes. In order to perform zero-copy message passing, the message data must originate in shared memory. This means that the object(s) that makes up the message data is allocated in shared memory and not private memory. In COW, because both sending and receiving processes share the same buffer, writing to it must be guarded in order to preserve copy semantics. The buffer can be guarded explicitly by inserting a check inline before a write to a protected memory buffer. Alternatively, the buffer can be guarded implicitly by using the mprotect system call, which traps when a process writes to protected memory. In both cases, a copy is effected when a process tries to write to a shared buffer.

4 2.3 User-Level Synchronization The second overhead of generic message passing is the cost of a system call. This can be elimi- nated by direct user-to-user synchronization. The expected use of OHM is tightly-coupled, SPMD programs. In such programs, there are no spurious or unanticipated receives. So while data may become available before it is needed, a receive will be issued, usually within a small amount of time. Additionally, the send process rarely gets too far ahead of the receiver; therefore, the maxi- mum number of buffers in a tends to be small ( the maximum is almost always 1). Finally, there is usually one worker process for every processor, which means that a busy wait is not unduely costly. The straightforward implementation of user-level synchronization uses receiver polling. If the message has already been sent, then the receiver will see this and immediately receive the message. In which case, the send and receive were performed without a system call and without delay. When a receive is issued before the corresponding send has been completed, the receive must block. The process can block using a system-provided mechanism, or it can delay in user code. The former is heavy weight and expensive. The latter is inexpensive provided the wait interval is relatively short. An efficient solution for this type of environment is to block (e.g., busy wait) at the user-level for a small period of time before employing system-level blocking (e.g., ). Short delays are expected to be the overwhelmingly common case. A sender must obtain a free element in the message queue before sending. Consequently, unless message queues are unbounded, a sender will have to block if it gets too far ahead of the receiver. Because this case is not likely given the anticipated environment, a heavy-weight system mechanism seems most appropriate.

2.4 Discussion There are tradeoffs between different message passing mechanisms. As usual, these tradeoffs are in terms of time and space. In large messages, the time is dominated by the cost of copying message data. Therefore, as a message gets larger the two-copy mechanism becomes costly. A one-copy mechanism obviously suffers less from this cost, but it still quickly becomes inefficient. Therefore, for large messages, a zero-copy is paramount. The other operations needed for sending and receiving that do not grow with message size. These fixed costs are least in the two-copy case and greatest in the zero-copy case. The reduced copy mechanisms must perform additional operations, such as guarding the buffer and setting reference counters, in order to provide copy semantics. Additionally, there are hidden costs with reduced copy mechanisms. First, the object that is the message data must be in shared memory before the send. Therefore, the sender must build it in shared memory. In the general case, this may not be very difficult (short of allocating all objects in shard memory). Moreover, allocating from shared memory is more costly than from private, and it increases the demand on shared memory. A second hidden cost arises in cases where there is significant interference between sender and receiver resulting in many buffers being copied on write. In such a situation, the lazy copy on write is more expensive than an eager copy on send (1-copy send). The space tradeoff is straightforward. Each copy requires an additional buffer. Thus the two- copy mechanism (which requires three buffers) has the greatest space demand. However, there

5 are two areas of memory in OHM: private and shared. Shared memory has a higher cost than private because a lock must be acquired and buffers might need to be guarded. Because the two- copy mechanism only uses a shared buffer for intermediate storage (with at short lifetime), it has little shared memory pressure. This re-enforces the general understanding about message passing: Avoid copying big messages. The final tradeoff is source program complexity. Copy semantics is trivial to support with two- copy. Using an implicit COW mechanism, it can be supported with any reduced copy mechanism but requires OS support. The explicit COW mechanism (which inserts checks into the source code) is much cheaper than implicit. However, it cannot always be done and requires analysis of the source code. Another part of our research that is beyond the scope of this paper is an analyzer and transformer that modifies Java bytecode. It automatically creates more efficient hybrid message passing programs from a standard message passing prgoram. Although the prototype described in this paper can be used stand alone, it was created to be a backend for this transformer.

3 OHM Implementation

This section describes a prototype implementation of Optimized Hybrid Messaging (OHM) in Java. We create a new Java class, Hybrid, that is a drop-in replacement for the native Java com- munications class, java.io.net. Thus converting from a native Java program to a Hybrid program can be as trivial as modifying one import statement. While this will create a hybrid program that is more efficient, OHM has more to offer—OHM is intended to be the target of a compiler or program optimizer. Therefore, the Hybrid interface has additional methods. This research uses Java for several reasons, which are explained below. But before discussing the implementation, we first describe some important characteristics of Java. As Java is an object- oriented language, a class is unit of encapsulation in Java. Therefore, is it natural to implement our prototype as a class. Hybrid consists of both Java and C code. The C code is the lower-level code that interfaces the directly with the operating system. This C code is incorporated into the Hybrid class (and seamlessly merged with the Java code) using JNI, the Java Native Interface [JNI]. Java source is compiled into bytecode, which is then executed by a Java VM (virtual machine). An OHM program creates a Java VM (JVM) for each processor; therefore, a multiprocessor node will have more than one JVM executing concurrently. The Hybrid class provides efficient mes- saging between these processes, executing a distributed-memory program on a shared-memory node. OHM creates a section of the virtual memory in each JVM that is shared by all processes. This is contrasted with a shared-memory program that uses a multi-threaded JVM. In the latter case, there is one JVM process with multiple threads that communicate using the memory of the process—which is shared among the threads. The JVM for OHM is slightly modified to provide shared-memory. Each JVM maps the same shared-memory heap into its address space. This heap contains all the data associated with the local message passing mechanisms. This includes message queues (for passing messages) and message buffers. Our prototype is implemented in Java 1.2.2. The Hybrid class has been ported to Linux 2.4 and Solaris (SunOS) 5.8. Because Java is (mostly) architecture and OS independent there are no changes in the Java code. At present there are no differences in the C code because the operating

6 systems are both Unix-based. However, it is expected that the minor differences in the implemen- tations of sockets between Linux and Solaris will change this.

3.1 Shared-Memory Message Passing Hybrid provides streaming, point-to-point communication, just as Java sockets. Hybrid communi- cation consists of two elements: queues and buffers. Each bi-directional message channel between two processes has a pair of queues—one for each direction of communication. A queue holds message headers that include a reference to a buffer holding the message data. In this way a send inserts into the queue and a receive removes from it. A queue in Hybrid is a fixed size array of message headers. We use a fixed size for speed and simplicity. The downside is an increased potential of blocking a sender when the queue is full. In theory there could be such blocking, but in practice a reasonable SPMD program will never have a large number of pending messages in a queue. Because there is a queue for each direction, for each queue there is exactly one process that inserts all messages and exactly one process that removes all messages. This allows for highly efficient synchronization. Each queue has a head and a tail index. The sender increments the tail index (modulo the queue size) when it inserts. Because the receiver only ever reads this index, there is no need for the sender to acquire a lock before incrementing. Similarly, the receiver exclusively modifies the head index. Consequently, in the common case, messages are passed between processes without a system call. A send blocks when the queue is full and a receive blocks when it is empty. The queue is empty when the head equals the tail; it is full when the head equals one more than the tail (modulo queue length). Neither situation should be considered bad. A sender that gets too far ahead likely needs to be throttled back. Our prototype uses busy-wait blocking between polling. Each time a process detects a blocking situation, it waits a small amount of time before re-checking. For an SPMD program with one process per processor, there is essentially no benefit in a blocked process yielding the processor because there is no other application that can run. Therefore, this implementation will continue to wait and poll until the blocking condition is eliminated. The second element in OHM is a shared-memory message buffer, which is a chunk of dynam- ically allocated shared memory. OHM creates one section of shared memory per node that is used by all JVMs running on that node. The alternative is to create distinct shared memories between each pair of JVMs. The advantage of the alternative is that interference is reduced as only two 1 processes ever access this memory. The disadvantage is that it creates 2 p(p − 1) distinct shared memories, where p is the number of processes, which greatly complicates the implementation and increases copying because buffers cannot be shared between more than two processes. Our imple- mentation trades off (potentially) increased interference for simplicity and (potentially) decreased copying. OHM uses a two-level free list to manage the allocation of shared memory. There is one global free list per node and a local free list for each process. Processes acquire large chunks of shared memory from the global free list. Individual buffers are allocated out of the local free lists. Most allocations and deallocations are done to the local free lists, which do not require locking. There are very few expensive global operations, which require locking. Locking of the global free list is

7 done using semaphores provided by the operating system. In the program tested in this paper the global free list is accessed only during the initialization of the program. This is because the programs have “balanced” communication. That is they send and receive equal amounts of data. Therefore, in the steady-state situation shared memory buffers will be enqueued to and dequeued from local free lists. Such balance is not uncommon in SPMD programs. But in unbalanced programs, the global free list will be accessed throughout the program’s lifetime. In such a program, the overhead of managing shared memory buffer will be greater than experienced in the programs described in this paper. However, this cost will be low because most operations will be to the local free list. The native Java classes that establish streaming communication are Socket and ServerSocket. A matching pair of instances of these objects are mated to form a TCP channel. The Hybrid implementation determines the destination of the process on the other end of the channel. If the process is on a different node, then a native Java channel is created in which the approriate methods are dynamically bound in this object. Otherwise, a Hybrid channel is created with the appropriate method bindings. In Hybrid, a communication channel consists of a pair of queues, one each for incoming and outgoing messages.

3.2 Message Passing Mechanisms OHM provides both one- and zero-copy send and receive mechanisms. The 1-copy send method copies a message into a shared buffer. The method prototype and its basic use is shown below.

Socket S; data = new Object(); data = ...; S.send(data);

S is an open socket between two processes and data is reference to an arbitrary Java object. First, the method must allocate a shared buffer and copy message data into this buffer. Next the method acquires a message header in the outgoing queue. It modifies a few fields in the header; in particular, it sets a pointer to the shared buffer. As stated above, this operation does not require locking and only blocks when the outgoing queue is full—which is an extremely uncommon case. The other send mechanism is 0-copy send, which is similar to the above. However, rather than allocating a shared buffer and copying data, this method sets the pointer field in the queue element to point to the message data (i.e., a pointer to the message data sent as a parameter). Because there is one physical copy—but two logically copies—of the message buffer, OHM uses a reference counter for every shared buffer. Before enqueuing the message header into the outgoing queue, the 0-copy send method increments the counter—indicating that there is an additional logical copy of the buffer. Both send methods use the identical interface (i.e., same parameters). The two receive methods have the same interface, shown below.

Socket S; data; S.receive(data); use(data);

8 As before, S is an open socket between two processes and data is reference to an arbitrary Java object. The 1-copy receive method blocks if its incoming queue (which is the sender’s outgoing queue) is empty. As with sending, this is a busy wait. When the incoming queue is not empty, the receiver dequeues a message header. It copies message data from the shared buffer, using a field in the message header into a newly-allocated buffer. Because the receiver is finished with its logical copy of the message data it decrements the reference counter. If the count is zero, the buffer is placed on the receiver’s local free list. The 0-copy receive mechanism uses the same interface as above, but there are a few differences in the implementation. First, the reference to the buffer (data) is set to point to the shared data. Also, no buffer is allocated. Third, the reference counter is not decremented. Hybrid supports copy-on-write explicitly. The explicit method uses a conditional before every first write following a send or receive. It checks the reference count for the buffer and makes a copy if the reference count is greater than one. This method requires additional code to be inserted. Because our system is intended to be the target of a compiler, this requirement is not onerous. This explicit method has less overhead than the implicit method. In particular, it does not require a system. However, it is not possible to determine all paths in a code. Therefore, we cannot guarantee that all first accesses will be guarded in general. While an implicit does not have this restriction, it require assistance from the system. An implicit COW uses the mprotect system call and a handler. The message buffer is write guarded with mprotect, which invokes a signal when it is written to. The signal handler will copy the message data to a new buffer if the reference count is greater than one. This implicit mechanism is not yet incorporated into Hybrid because of the high cost of guarding data. In the time required to execute an mprotect system call, many thousands of bytes can be copied. So this implicit mechanism is only rational for a very large message.

3.3 Using OHM There are two memory regions from which a message buffer can be allocated: private and shared. The private region is ordiinary heap memory, that is exclusive to the JVM. The shared region is used by all JVMs local to the machine. It is created using the system call. Because the message data is communicated through shared memory, the data must be in a shared buffer at the conclusion of a send. In 1-copy send, the source buffer may be in either private or shared memory because the mech- anism will copy the data into a newly allocated shared memory buffer. However, in a 0-copy send method, the source buffer must begin in a shared buffer. Therefore, in order to use the zero-copy send mechanism, the program must be modified in order to create message data in shared memory. There are complementary requirements of the receive methods. 1-copy receive may copy into a buffer in either memory region. The 0-copy receive does not have to create an object in a shared memory buffer. Rather it must be able to use a shared memory buffer. This is easily accom- plished because Java only supports pass by reference. In either case, the receive method provides a reference to a “new” message buffer. As stated above, OHM is intended as the target of a compiler. However, it can be used directly. In either case the following are modifications that must be made to convert an “ordinary” message passing program in a Hybrid program.

9 1. Insert import statements.

2. Change each native send method to a Hybrid send method.

3. Change each native receive method to a Hybrid receive method.

4. Pre-allocate objects for 0-copy send.

5. Guard message data against write after each 0-copy send or receive.

If the first three actions are done and only 1-copy mechanisms are used there will be a large benefit. This is very easy to do (manually or automatically). Although the resultant OHM progam will be much faster than another message passing program, it is slower than a threaded program on a shared-memory node. To get the full benefit of OHM, and to get performance equivalent to a threaded program, one must use 0-copy mechanism. This requires the last two actions, which need analysis and program modifications.

4 Results

This section presents three different tests. First, we show the relative performance of standard message passing mechanism and compare that to OHM mechanisms. Next, Java applications using native communications are compared to Java applications using our Hybrid class (which uses the OHM mechanisms). Last, we compare distributed-memory, messaging passing Java programs using Hybrid to shared-memory, threads-based Java programs. We do not present any tests between nodes in a cluster because our prototype uses TCP for remote communication. Because our system does nothing special between nodes, such a test is not measuring our work. There are two testing platforms. One is a Sun SPARC station with 4 processor and 32 GB of memory. It is a xx MHz machine and runs Solaris. The other is a 4-way Intel machine running Linux.

4.1 Message Passing Mechanisms Figure 1 shows the cost of five message passing mechanisms relative to the size of the message. The results shown are the time in nanoseconds for a round-trip message on our Linux platform. The two-copy mechanism copies into an intermediate buffer on send and then into a destination buffer on receive. There are two instances of COW tested. One is the cost when the data is subsequently copied (i.e., the data was copied on write). The other measures the cost when there is no copy. We also tested two native message passing mechanisms, AF INET, which is TCP, and AF UNIX, which is IPC. The COW mechanism that does not copy, has a cost that is independent of message size and very small to begin with. It has (almost negligible) overhead, independent of message size. COW- copy, which performs a single, on-demand copy is approximately one-half the cost of 2-copy. TCP approximately ten times slower than 2-copy on large messages. Unix IPC is slower than TCP on large messages because TCP is a native method and we implemented Unix IPC using the JNI interface. While JNI methods are C routine that run native on a machine, they must interface to the

10 6 10 2−copy cow−copy 5 10 cow−no inet unix

4 10

3 10

2 10

1 10

0 10

−1 10 Time(microseconds per roundtrip)

−2 10

−3 10 1 2 3 4 5 6 10 10 10 10 10 10 message size(bytes)

Figure 1: Pingpong test.

Java VM. The JNI-Java API copies its parameters; therefore, the Unix IPC test has an extra copy. If this copy is subtracted, then Unix IPC properly sits between TPC and 2-copy.

4.2 Distributed-Memory Applications First, we look at the message passing times and verify that OHM is more efficient than other messaging mechanisms. Figure 2 shows results from our Sun platform. The data size for these programs is 1500 × 1500 per process. The tests scale the size of the problem with the number of processes, so the test size has 1500p rows, where p is the number of processors. The width of the problem is kept the same so that the size of messages it the same for both the 2- and 4-process test. The message size in this test is 1500 8-bit doubles. (Figure 2 does not show 1-procesor times because such a test does not use messages.) The processes communicate in 1-dimensional layout; therefore, to an individual process there is no change as the processes are added. Recall each process manages its own private free lists. Because the communication is balanced (send exactly the same amount as receive) processes never have to access the global free list during the computation. Moreover, there is no interference (due to write sharing) in this test. Tests were run using Unix IPC and TPC, but the results are much worse than OHM. The TPC times shown in Figure 2 are scaled by one-tenth. Unix IPC was faster than TPC, but alway more than an several orders of magnitude worse than OHM. Looking at Figure 1, we see that for a message of 1500 doubles, the IPC curve is still below TPC. We did not compare directly to MPI because MPI provides a vastly different interface. There is substantially more cost to sending an MPI message than a Hybrid message. However, reading the literature we see that hybrid version of MPI claims similar results to ours. There are two overheads in sending messages. Comparing the two-copy and zero-copy results

11 1000

100 Seconds

10

'zero-copy' 'two-copy' 'unix' 'tcp' 1 2 4 Processes

Figure 2: Message passing Jacobi iteration test.

Threads Zero-copy Two-copy Processes/threads secs (speedup) secs (speedup) secs (speedup) 1 57.5 (1.00) 2 58.3 (1.97) 57.7 (1.99) 60.6 (1.90) 4 59.2 (3.89) 64.1 (3.59) 69.0 (3.33) Table 1: Jacobi iteration (times in seconds, speedup in parentheses). show the savings from avoiding a copy. Using the two-copy mechanism is 5% and 8% slower than zero-copy, for 2 and 4 nodes, respectively. We cannot directly determine the savings due to avoiding a system call because Unix IPC is not native to Java. However, even if it were, the implementation of message passing in the kernel is significantly different from our implementation. Therefore, the comparison would be dubious.

4.3 Message Passing vs. Threads Now we compare hybrid message passing to a shared-memory program that uses threads to see if OHM is competitive with threading on a shared-memory machine. Table 1 shows running times of three versions of Jacobi iteration on our Sun platform. The times for the OHM programs are the same as those shown in Figure 2. The shared-memory program uses native Java threads and the JVM is multi-threaded. Its times are shown in the left column of Table 1. All programs use the same application kernel. Therefore, the major differences in time can be attributed to in what happens between iterations. In the threaded programs, a barrier is executed. In the message passing program, messages containing rows are sent to and received from each neighbor. Moreover, in the zero-copy program, message data is passed by exchanging references. The messages sent using COW and there is no write interference, so there is no write. Because

12 Threaded OHM Processes secs (speedup) secs (speedup) 1 59.9 (1.00) 61.5 (0.974) 2 28.3 (2.11) 31.1 (1.92) 4 13.7 (4.37) 16.2 (3.70)

Table 2: 2-D FFT, 2048 × 2048 (times in seconds, speedup in parentheses).

Threaded OHM Processes secs (speedup) secs (speedup) 1 40.1 (1.00) 2 40.7 (1.97) 40.4 (1.99) 4 50.1 (3.20) 49.6 (3.23)

Table 3: Jacobi, 1024p × 1024 (times in seconds, speedup in parentheses).

the rows are created in shared memory, the message passing program never copies data. Thus the difference between the two programs is the cost of exchanging messages versus the cost of a barrier. Interior processes, execute four message operations (2 sends and 2 receives). Each thread executes a barrier, which is implemented as a Java synchronized method. In this program, one would expect the threaded program to perform very well. And it does, with speedups of 1.97 and 3.89 on 2 and 4 processors, respectively. The OHM programs also have good speedups. Interestingly, the 2-way zero-copy program is faster than the two-way threaded program. But for the 4-way tests, it is the other way around. In the message passing program the interior nodes send (and receive) two messages each it- eration and the edge nodes only send one message. Because there are no interior node in the 2-way message passing program the cost of synchronization between iterations is approximately half the cost that it is in the 4-way test. In the threaded program the theoretically cost of the barrier also doubles from 2 to 4 processors. However, in practice there is a high fixed cost for a barrier synchronization. The second test is a 2-dimensional FFT. This program performs N row FFTs in parallel, with an equal number of rows distributed to each process. Then it transposes the 2-D array, and performs N more row FFTs. It does the transpose rather than perform column FFTs with data striped across all processes. During the transpose each process sends a message to every other process n n containing a p × p block the array. Each process transposes the block upon receipt. Because our FFT implementation requires n to be a power of two, we use a fixed size problem for all tests. Table 2 shows results from our Sun platform. We make two observations. First, the message passing program runs almost 3% slower on one node even though the program kernels are identical. The only tangible difference between the two programs is that the OHM program allocates its array in shared memory, using OHM buffer management methods. Second, the threaded program scales better; in fact it scale more than linearly. The transpose phase is very communication intensive. While both use the same hardware for communication, the threaded program is able to perform the transpose slightly faster. Our Linux platform tells a different story. While memory latency grows under contention in

13 all systems, it is particularly noticable on our Linux platform. Table 3 shows times and speedups for threaded and OHM programs in Linux. The overall speedup is less than on the Sun. Moreover, the overhead is greater on the threaded program than on the message passing program. This sug- gests that differences in the memory hierarchy are as important to performance as the differences between threaded and OHM. We take this as an indication that the performance of OHM is indeed comparable with that of a threaded program on a shared-memory machine.

5 Related Work

There is a significant amount of work on parallel programming on clusters of multiprocessors. There are three primary options: use solely message-passing, use solely shared memory, and use a hybrid programming model. This section describes these in turn. One way to write a program for a multiprocessor cluster is to use message passing between and within the nodes. MPI [MPI96] can be used for this; the problem is that message passing is less efficient than shared-memory within a node because of excessive copying. Several have designed MPI implementations that reduce copying using shared-memory buffers within a node. These in- clude [TSY99], who showed significant improvements compared to a native MPI implementation. The problem with these implementations is they lack semantic information that can lead to fur- ther optimizations. With the Grinder, OHM is provided semantic optimization; for example, when there are no more references to an object, it can be transferred to another node. It is also possible to use MPI between nodes and a shared-memory model within nodes. The shared-memory model can be a thread model or a higher-level such as OpenMP [opea]. While this can deliver high performance, it forces a two-level programming model, which forces uses to learn two completely different sets of concepts. Instead, we provide a single programming model with OHM. Another way to write programs for multiprocessor clusters is to use distributed shared memory systems (e.g., [KDCZ94, LH89b]). Such systems provide a uniform shared-memory programming model. Systems that provide this include an OpenMP port with an SDSM [HLCZ00], Strings [RC98], SoftFLASH [ENCH96], MGS [YKA96], and HyFi [RL02]. The OpenMP port [Opeb] uses POSIX threads within a node for portability and the Treadmarks SDSM for implicit inter- node communication. Strings [RC98] is a high performance SDSM for symmetric multiprocessor clusters based on the Quarks SDSM [SSC98]. SoftFlash [ENCH96] system is a kernel DSM im- plementation. MGS [YKA96] was one of the first systems to explore coupling of small to medium scale shared memory multiprocessors through software to synthesize larger logically shared mem- ory systems. HyFi [RL02] supports iterative and fork-join parallelism, along with the Filaments SDSM []. Other systems use processes to implement parallelism within a node. Each of these processes have a separate address space but have shared memory regions mapped between the processes. These systems include Cashmere-2L [SDH+97] and HLRC-SMP [SBIS98]. Unfortunately, DSM systems provide a uniform programming model, but the implementation is less efficient than a message-passing model. This overhead can be significant for many classes of applications. OHM will deliver the performance of message passing between nodes. Our work is loosely based on copy-on-write [RTY+87], which was originally developed for efficient copying inside of an operating system. The virtual memory system was used to mark

14 pages read only instead of copying, with both address spaces mapping the data. If one process writes, a physical copy is made. One way this was used was in implementing an efficient version of fork.

6 Conclusion

This paper discuss the hybrid message passing model and argues its advantages. It then presents a prototype implementation that decreases the cost of local communication and is competitive with threaded programs on a shared-memory machine. The are several directions of future work. First, because copying is a major source of overhead in remote message passing systems. The mechanisms describe here reduce the cost of copying message between two addresses spaces that belong to user processes. The same mechanisms can redcue the cost of copying messages between a user’s address space and the kernel. We plan to extend this interface to reduce the cost of remote message passing, i.e., we will use an interface like VIA. Writing efficient programs using VIA is not simple. However, the details and difficulties can be hidden in the Hybrid class methods. Second, COW is only one of many reduce-copy optimization. We are investigating several other protocols, including some of our own design. Last, we are building a Java bytecode analyzer and transformer, called the Grinder, that op- timizes for message passing. It takes the bytecode of an ordinary messaging passing program, analyzes it, and emits new bytecode. This bytecode uses the Hybrid class. But more than that the program is transformed to use it more efficiently.

References

[ENCH96] Andrew Erlichson, Neal Nuckolls, Greg Chesson, and John Hennessy. SoftFLASH: Analyzing the performance of clustered distributed virtual shared memory. In Proceed- ings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, Oct 1996. [HLCZ00] Y. Charlie Hu, Honghui Lu, Alan L. Cox, and Willy Zwaenepoel. OpenMP for net- works of SMPs. Journal of Parallel and , 60(12):1512–1530, 2000. [JNI] Java native interface specification. http://java.sun.com/products/jdk/1.1/docs/guide/jni/spec/jniTOC.doc.html. [KDCZ94] Pete Keleher, Sandhya Dwarkadas, Alan Cox, and Willy Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Pro- ceedings of the 1994 Winter Usenix Conference, pages 115–131, January 1994. [LH89a] Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. In ACM Transactions on Computer Systems, page 7(4), 1989. [LH89b] Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4), November 1989.

15 [MPI96] Message Passing Interface Forum MPIF. MPI-2: Extensions to the Message-Passing Interface. Technical Report, University of Tennessee, Knoxville, 1996.

[opea] Openmp homepage. http://openmp.org.

[Opeb] OpenMP. http://openmp.org.

[RC98] Sumit Roy and Vipin Chaudhary. Strings: A high-performance distributed shared memory for symmetrical multiprocessor clusters. In Proceedings of HPDC, July 1998.

[RL02] Subramanian Ragavan and David K. Lowenthal. Hyfi: Architecture-independent par- allelism on a cluster of multiprocessors. October 2002.

[RTY+87] Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert Baron, David Black, William Bolosky, and Jonathan Chew. Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures. In Proceed- ings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 31–39, Palo Alto, California, 1987.

[SBIS98] R. Samanta, A. Bilas, L. Iftode, and J. Singh. Home-based svm protocols for smp clus- ters: design and performance. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, Feb 1998.

[SDH+97] R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. Scott. Cashmere-2l: Software coherent shared memory on a clustered remote write network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, pages 170–183, Oct 1997.

[SSC98] M. Swanson, L. Stroller, and J. B. Carter. Making distributed shared memory simple, yet efficient. In In Proc. of 3rd Int’l Workshop on High-Level Parallel Programming Models and Supporting Environments, 1998.

[TSY99] Hong Tang, Kai Shen, and Tao Yang. Compile/run-time support for threaded MPI execution on multiprogrammed shared memory machines. In Principles and Practice of Parallel Programming, pages 107–118, 1999.

[YKA96] Donald Yeung, John Kubiatowicz, and Anant Agarwal. Mgs: A multigrain shared memory system. In Proceedings of the 23rd Annual International Symposium on Com- puter Architecture, May 1996.

16