Integrating Message-Passing and Shared-Memory: Early Experience

Integrating Message-Passing and Shared-Memory: Early Experience David Kranz, Kirk Johnson, Anant Agarwal, John Kubiatowicz, and Beng-Hong Lim Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 Abstract is reasonable to expect that future large-scale multiprocessors will be built upon what are essentially message-passing communication This paper discusses some of the issues involved in implement- substrates. In turn, this suggests that multiprocessor architectures ing a shared-address space programming model on large-scale, might be designed such that processors are able to communicate distributed-memory multiprocessors. While such a programming with one another via either shared-memory or message-passing model can be implemented on both shared-memory and message- interfaces, using whichever is likely to be most ef®cient for the passing architectures, we argue that the transparent, coherent cach- communication in question. ing of global data provided by many shared-memory architectures In fact, the MIT Alewife machine [1] does exactly thatÐ is of crucial importance. Because message-passing mechanisms interprocessor communication can be effected either through a are much more ef®cient than shared-memory loads and stores for sequentially-consistent shared-memory interface or by way of a certain types of interprocessor communication and synchroniza- messaging mechanism as ef®cient as those found in many present tion operations, however, we argue for building multiprocessors day message-passing architectures. The bulk of this paper re- that ef®ciently support both shared-memory and message-passing lates early experience we've had integrating shared-memory and mechanisms. We describe an architecture, Alewife, that integrates message-passing in the Alewife machine. Section 2 overviews support for shared-memory and message-passing through a simple implementation schemes for shared-memory. Section 3 describes interface; we expect the compiler and runtime system to cooperate the basic Alewife architecture, paying special attention to the in- in using appropriate hardware mechanisms that are most ef®cient tegration of the message-passing and shared-memory interfaces. for speci®c operations. We report on both integrated and exclu- Section 4 provides results comparing purely shared-memory and sively shared-memory implementations of our runtime system and message-passing implementations of several components of the two applications. The integrated runtime system drastically cuts Alewife runtime system and two applications. We show that a hy- down the cost of communication incurred by the scheduling, load brid thread scheduler utilizing both shared-memory and message- balancing, and certain synchronization operations. We also present based communication performs up to a factor of two better than preliminary performance results comparing the two systems. an implementation using only shared-memory for communication. Section 5 discusses related work. Finally, Section 6 summarizes the major points of the paper and outlines directions for future research. 1 Intro duction Contributions of this Pap er Researchers in parallel computing generally agree that it is im- 1.1 portant to support a shared-address space or shared-memory programming modelÐprogrammers should not bear the responsibility In this paper, we argue for building multiprocessors that ef®ciently for orchestrating all interprocessor communication through explicit support both shared-memory and message-passing mechanisms messages. Implementations of this programming model can be di- for interprocessor communication. We argue that ef®cient sup- vided into two broad areas depending on the target architecture. port for ®ne-grained data sharing is fundamental, pointing out With message-passing architectures, the shared-address space is that traditional message-passing architectures are unable to pro- typically synthesized by low-level system software, while tradi- vide such support, especially for applications with irregular and tional shared-memory architectures of¯oad this functionality into dynamic communication behavior. Similarly, we point out that specialized hardware. while shared-memory hardware can support such communication, it is not a panacea eitherÐwe identify several other communica- In the following section, we discuss some of the issues involved tion styles which are handled less ef®ciently by shared-memory in implementing a shared-address space programming model on than by message-passing architectures. Finally, we (i) describe large-scale, distributed-memory multiprocessors. We provide a how shared-memory and message-passing communication is inte- brief overview and comparison of conventional implementation grated in the Alewife hardware and software systems; (ii) present strategies for message-passing and shared-memory architectures. some preliminary performance data obtained with a simulator of We argue that the transparent, coherent caching of global data Alewife comparing the performance of software systems using both provided by many shared-memory architectures is of fundamen- message-passing and shared-memory against implementations us- tal importance, because it affords low-overhead, ®ne-grained data ing exclusively shared-memory; and (iii) provide analysis and early sharing. We point out, however, that some types of communica- conclusions based upon the performance data. tion are handled less ef®ciently via shared-memory then they might be via message-passing. Building upon this, we suggest that it Implementing a Shared-Address Space 2 shared-address-space-reference(location) if currently-cached?(location) then Implementations of a shared-address programming model can be // satisfy request from cache divided into two broad areas depending on the underlying system load-from-cache(location) architecture. On message-passing architectures, a shared-address elsif is-local-address?(location) then space is typically synthesized through some combination of com- // satisfy request from local memory pilers and low-level system software. In traditional shared-memory load-from-local-memory(location) architectures, this shared-address space functionality is provided else by specialized hardware. This section gives a brief comparison of // must load from remote memory; send remote these implementation schemes and their impact on application per- // read request message and invoke any actions formance. We conclude that the ef®cient ®ne-grained data sharing // required to maintain cache coherency afforded by hardware implementations is of fundamental impor- load-from-remote-memory(location) tance for some applications, but also point out that some types of communication are handled less ef®ciently via shared-memory hardware then they could be via explicit message-passing. In either Figure 1: Anatomy of a shared-address space reference. case, we assume that the physical memory is distributed across the processors. others that are static. Data sharing in programs can also be ®ne- or Anatomy of a Memory Reference 2.1 coarse-grained. When executing a multiprocessor application written assuming a Through a great deal of hard work, a programmer can often shared-address space programming model, the actions indicated eliminate the overhead incurred by local/remote checks shown in by the pseudocode in Figure 1 must be taken for every shared- Figure 1 by sending explicit messages. The ®xed costs of such address space reference. The pseudocode assumes that locally operations are typically high enough, however, that for ef®cient ex- cached copies of locations can exist. (If shared-data caching is not ecution, the messages must be made fairly large. The result is that supported, the ®rst test is eliminated.) The most important parts of the granularity at which data sharing can be exploited ef®ciently the code in Figure 1 are the two ªlocal/remoteº checks. is quite coarse. For static programs with suf®ciently large-grained data sharing, these explicit messages are like assembly language: The local/remote check is the essence of the distinction be- if done well, they always provide the best performance. However, tween shared-memory and message-passing architectures. In the as with assembly language, it is not at all clear that most pro- former, the instruction to reference memory is the same whether the grammers are better off if forced to orchestrate all communication object referenced happens to be in local or remote memory; the lo- through explicit messages. For static applications which share data cal/remote checks are facilitated by hardware support to determine at too ®ne a grain, even if overhead related to local/remote checks whether a location has been cached (cache tags and comparison can be eliminated, performance suffers because messages typically logic) or, if not, whether it resides in local or remote memory (lo- aren't large enough to amortize any ®xed overheads inherent in cal/remote address resolution logic). If the data is remote and not message-based communication. It is worth noting, however, that cached, a messagewill be sent to the remote node to access the data. for architectures like iWarp [4] which in effect support extremely Although the ®rst test is unnecessary if local caching is disallowed, low overhead ªmessagingº for static communication patterns, this the second test is still required to implement a shared-addressspace is not a problem. programming model. Because shared-memory systems provide hardware support for detecting

Integrating Message-Passing and Shared-Memory: Early Experience

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support