Integrating Message-Passing and Shared-Memory: Early Experience

Integrating Message-Passing and Shared-Memory: Early Experience David Kranz, Kirk Johnson, Anant Agarwal, John Kubiatowicz, and Beng-Hong Lim Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 Abstract is reasonable to expect that future large-scale multiprocessors will be built upon what are essentially message-passing communication This paper discusses some of the issues involved in implement- substrates. In turn, this suggests that multiprocessor architectures ing a shared-address space programming model on large-scale, might be designed such that processors are able to communicate distributed-memory multiprocessors. While such a programming with one another via either shared-memory or message-passing model can be implemented on both shared-memory and message- interfaces, using whichever is likely to be most ef®cient for the passing architectures, we argue that the transparent, coherent cach- communication in question. ing of global data provided by many shared-memory architectures In fact, the MIT Alewife machine [1] does exactly thatÐ is of crucial importance. Because message-passing mechanisms interprocessor communication can be effected either through a are much more ef®cient than shared-memory loads and stores for sequentially-consistent shared-memory interface or by way of a certain types of interprocessor communication and synchroniza- messaging mechanism as ef®cient as those found in many present tion operations, however, we argue for building multiprocessors day message-passing architectures. The bulk of this paper re- that ef®ciently support both shared-memory and message-passing lates early experience we've had integrating shared-memory and mechanisms. We describe an architecture, Alewife, that integrates message-passing in the Alewife machine. Section 2 overviews support for shared-memory and message-passing through a simple implementation schemes for shared-memory. Section 3 describes interface; we expect the compiler and runtime system to cooperate the basic Alewife architecture, paying special attention to the in- in using appropriate hardware mechanisms that are most ef®cient tegration of the message-passing and shared-memory interfaces. for speci®c operations. We report on both integrated and exclu- Section 4 provides results comparing purely shared-memory and sively shared-memory implementations of our runtime system and message-passing implementations of several components of the two applications. The integrated runtime system drastically cuts Alewife runtime system and two applications. We show that a hy- down the cost of communication incurred by the scheduling, load brid thread scheduler utilizing both shared-memory and message- balancing, and certain synchronization operations. We also present based communication performs up to a factor of two better than preliminary performance results comparing the two systems. an implementation using only shared-memory for communication. Section 5 discusses related work. Finally, Section 6 summarizes the major points of the paper and outlines directions for future research. 1 Intro duction Contributions of this Pap er Researchers in parallel computing generally agree that it is im- 1.1 portant to support a shared-address space or shared-memory programming modelÐprogrammers should not bear the responsibility In this paper, we argue for building multiprocessors that ef®ciently for orchestrating all interprocessor communication through explicit support both shared-memory and message-passing mechanisms messages. Implementations of this programming model can be di- for interprocessor communication. We argue that ef®cient sup- vided into two broad areas depending on the target architecture. port for ®ne-grained data sharing is fundamental, pointing out With message-passing architectures, the shared-address space is that traditional message-passing architectures are unable to pro- typically synthesized by low-level system software, while tradi- vide such support, especially for applications with irregular and tional shared-memory architectures of¯oad this functionality into dynamic communication behavior. Similarly, we point out that specialized hardware. while shared-memory hardware can support such communication, it is not a panacea eitherÐwe identify several other communica- In the following section, we discuss some of the issues involved tion styles which are handled less ef®ciently by shared-memory in implementing a shared-address space programming model on than by message-passing architectures. Finally, we (i) describe large-scale, distributed-memory multiprocessors. We provide a how shared-memory and message-passing communication is inte- brief overview and comparison of conventional implementation grated in the Alewife hardware and software systems; (ii) present strategies for message-passing and shared-memory architectures. some preliminary performance data obtained with a simulator of We argue that the transparent, coherent caching of global data Alewife comparing the performance of software systems using both provided by many shared-memory architectures is of fundamen- message-passing and shared-memory against implementations us- tal importance, because it affords low-overhead, ®ne-grained data ing exclusively shared-memory; and (iii) provide analysis and early sharing. We point out, however, that some types of communica- conclusions based upon the performance data. tion are handled less ef®ciently via shared-memory then they might be via message-passing. Building upon this, we suggest that it Implementing a Shared-Address Space 2 shared-address-space-reference(location) if currently-cached?(location) then Implementations of a shared-address programming model can be // satisfy request from cache divided into two broad areas depending on the underlying system load-from-cache(location) architecture. On message-passing architectures, a shared-address elsif is-local-address?(location) then space is typically synthesized through some combination of com- // satisfy request from local memory pilers and low-level system software. In traditional shared-memory load-from-local-memory(location) architectures, this shared-address space functionality is provided else by specialized hardware. This section gives a brief comparison of // must load from remote memory; send remote these implementation schemes and their impact on application per- // read request message and invoke any actions formance. We conclude that the ef®cient ®ne-grained data sharing // required to maintain cache coherency afforded by hardware implementations is of fundamental impor- load-from-remote-memory(location) tance for some applications, but also point out that some types of communication are handled less ef®ciently via shared-memory hardware then they could be via explicit message-passing. In either Figure 1: Anatomy of a shared-address space reference. case, we assume that the physical memory is distributed across the processors. others that are static. Data sharing in programs can also be ®ne- or Anatomy of a Memory Reference 2.1 coarse-grained. When executing a multiprocessor application written assuming a Through a great deal of hard work, a programmer can often shared-address space programming model, the actions indicated eliminate the overhead incurred by local/remote checks shown in by the pseudocode in Figure 1 must be taken for every shared- Figure 1 by sending explicit messages. The ®xed costs of such address space reference. The pseudocode assumes that locally operations are typically high enough, however, that for ef®cient ex- cached copies of locations can exist. (If shared-data caching is not ecution, the messages must be made fairly large. The result is that supported, the ®rst test is eliminated.) The most important parts of the granularity at which data sharing can be exploited ef®ciently the code in Figure 1 are the two ªlocal/remoteº checks. is quite coarse. For static programs with suf®ciently large-grained data sharing, these explicit messages are like assembly language: The local/remote check is the essence of the distinction be- if done well, they always provide the best performance. However, tween shared-memory and message-passing architectures. In the as with assembly language, it is not at all clear that most pro- former, the instruction to reference memory is the same whether the grammers are better off if forced to orchestrate all communication object referenced happens to be in local or remote memory; the lo- through explicit messages. For static applications which share data cal/remote checks are facilitated by hardware support to determine at too ®ne a grain, even if overhead related to local/remote checks whether a location has been cached (cache tags and comparison can be eliminated, performance suffers because messages typically logic) or, if not, whether it resides in local or remote memory (lo- aren't large enough to amortize any ®xed overheads inherent in cal/remote address resolution logic). If the data is remote and not message-based communication. It is worth noting, however, that cached, a messagewill be sent to the remote node to access the data. for architectures like iWarp [4] which in effect support extremely Although the ®rst test is unnecessary if local caching is disallowed, low overhead ªmessagingº for static communication patterns, this the second test is still required to implement a shared-addressspace is not a problem. programming model. Because shared-memory systems provide hardware support for detecting

Load more