A UPC Runtime System Based on MPI and POSIX Threads

Abstract

MuPC is a portable runtime system for Unified Parallel C (UPC). A modified version of the EDG C/C++ front end translates the user’s UPC program into C and turns UPC-specific language features into calls to MuPC runtime functions. MuPC implements each UPC thread (process) as two Pthreads, one for the user program and private memory accesses, and the other for remote memory accesses. Remote memory is accessed by two-sided MPI message passing. MuPC performance features include a runtime software cache for remote accesses and low latency access to shared memory with affinity to the issuing thread. MuPC is a useful platform for experimenting with current and future UPC language features and investigating UPC performance. This paper describes the internal design of MuPC and compares its performance to several other available platforms. A UPC Runtime System Based on MPI and POSIX Threads

Zhang Zhang, Jeevan Savant, and Steven Seidel Computer Science Department Michigan Technological University Houghton, MI 49931, USA {zhazhang, jvsavant, steve}@mtu.edu

Abstract formed to foster and coordinate UPC development and re- search activities. MuPC is a portable runtime system for Unified Paral- The UPC group at Michigan Technological University lel C (UPC). A modified version of the EDG C/C++ front developed MuPC, a publicly available implementation of end translates the user’s UPC program into C and turns UPC based on MPI-1 and Pthreads. This paper discusses UPC-specific language features into calls to MuPC runtime the implementation of the MuPC runtime system, focusing functions. MuPC implements each UPC thread (process) as on the internal design and the issues affecting its perfor- two Pthreads, one for the user program and private memory mance. accesses, and the other for remote memory accesses. Re- The remainder of this paper is organized as follows. Sec- mote memory is accessed by two-sided MPI message pass- tion 2 briefly reviews the UPC language. Section 3 is a high- ing. MuPC performance features include a runtime soft- level overview of the MuPC. This section also describes ware cache for remote accesses and low latency access to other UPC implementations. Section 4 describes the run- shared memory with affinity to the issuing thread. MuPC time API and its two-threaded structure. Section 5 discusses is a useful platform for experimenting with current and fu- some details about the runtime system internals. Section 6 ture UPC language features and investigating UPC perfor- discusses the performance of MuPC by reporting bench- mance. This paper describes the internal design of MuPC marking results for some simple synthetic benchmarks and and compares its performance to several other available for the matrix-multiplication application. Finally, section 7 platforms. concludes the paper.

2. UPC background Keywords: Unified Parallel C, shared memory, parallel programming, MPI, POSIX threads UPC is a superset of ANSI C (as per ISO/IEC9899 [13]). UPC programs adopt the single program multiple data 1. Introduction (SPMD) execution model. Each UPC thread is a process executing a sequence of instructions. Data objects in a UPC Unified Parallel C (UPC) is an extension of C for pro- program can be either private or shared. A UPC thread gramming multiprocessors with a shared address space [10, has exclusive access to the private objects that reside in its 8]. UPC provides a common syntax and semantics for ex- private memory. A thread also has access to all of the ob- plicit parallel programming in C, and it directly maps lan- jects in the shared memory space. UPC provides a parti- guage features to the underlying architecture. UPC is an tioned view of shared memory by introducing the concept example of the partitioned shared memory programming of affinity. Shared memory is equally partitioned among model in which shared memory is partitioned among all all threads. The region of shared memory associated with UPC threads (processes). This partition is formally repre- a given thread is said to have affinity to that thread. The sented in the programming language. Each thread can ac- concept of affinity captures the reality that on many parallel cess any location in shared memory using the same syntax architectures the latencies of accessing different shared ob- but in many implementations the locations in each thread’s jects are different. It is assumed that an access to a shared own partition of shared memory are accessed more quickly. object that has affinity to the thread that performs the ac- UPC has been gaining interest from academia, industry cess is faster than an access to a shared object to which the and government labs. A UPC consortium [10] has been thread does not have affinity.

1 UPC provides two types of pointers to shared objects. A MuPC. However, since the translator is released as a bi- pointer-to-shared is a pointer whose target is a shared data nary executable, a user who downloads a MuPC release object. The pointer itself resides in the private memory of cannot automatically port the release to any platform. The some thread. A shared pointer-to-shared is a pointer whose home page of MuPC (www.upc.mtu.edu) currently provides target is a shared data object. The pointer itself also resides the translator executables for two platforms, the HP Al- in shared memory. phaServer/Tru64 Unix and the Intel x86/Linux. Executables UPC provides strict and relaxed memory consistency for other platforms are available upon request. modes. This choice affects accesses to shared objects. The strict mode does not allow reordering of shared object ac- 3.2. Other UPC implementations cesses. The relaxed mode allows reordering and coales- cence of shared object accesses between synchronization The first UPC compiler was written for the Cray T3E [4]. points as long as data dependencies within each thread are This implementation no longer meets the UPC specifica- preserved. The relaxed mode offers opportunities for com- tion. Cray now incorporates support for UPC into the C piler and run time system optimizations. compiler on the Cray X1 platform [6]. The most widely used public domain UPC compiler is 3. MuPC overview and related work Berkeley UPC [18]. It is highly portable because it provides a multi-layered system design that interposes the GASNet 3.1. MuPC overview communication layer between the runtime system and the network hardware [19]. Core and extended GASNet APIs can be mated with a variety of runtime systems, such as The MuPC runtime system was developed with help UPC and Titanium, on one side and with various types of from the High Performance Technical Computing Division network hardware, such as Myrinet, Quadrics, and even at Hewlett-Packard Company. The most recent release of MPI, on the other side. Other components of the Berke- MuPC [15] contains three components, the runtime system, ley UPC compiler are analogous to those of MuPC, in- the UPC-to-C translator, and a reference implementation of cluding a UPC-to-C translator (based on the Open64 open collective functions. The runtime system implements an source compiler) and a platform independent runtime sys- API defined by the translator using MPI and the POSIX tem. Unlike MuPC and the HP compilers, Berkeley UPC standard thread library. Each UPC thread at run time is does not provide a runtime cache for remote references. mapped to two Pthreads, the computation Pthread and the Many translator-level optimizations for Berkeley UPC are communication Pthread. This paper focuses on the internal described in [5]. structure of the runtime system, but an overview of the other Intrepid Technology provides a UPC compiler [12] as two components is first given below. an extension to the GNU GCC compiler (GCC 3.3.2). It The translator is a modified version of the EDG C/C++ supports only shared memory platforms and SMP platforms V3.4 front end [7]. Translation is a two-phase process. The such as the SGI Irix, Cray T3E, and Intel x86 uniprocessor first phase parses UPC source code and produces an IL (in- and SMPs. It is freely available under the GPL license. termediate language) tree structure. The second phase low- Hewlett-Packard offered the first commercially available ers the IL tree to produce ANSI C code. In the lowering UPC compiler [11]. The current version of this compiler phase, all UPC features are represented using equivalent C targets Tru64 UNIX, HP-UX, and XC Linux clusters. Its constructs, or are translated into calls to functions that are front end is also based on the EDG source-to-source trans- defined in a runtime API. The runtime API was originally lator. The Tru64 runtime system uses the Quadrics net- designed by HP and has since been modified to better fit work for off-node remote references. Runtime optimiza- MuPC’s purposes. tions include a write-through runtime cache and trainable A set of collective functions [20] is an integral part of prefetcher. the UPC language. MuPC provides a reference implementation for all collectives, in which data movement is implemented using UPC’s one-sided communication primitives 4. Runtime system design such as upc memcpy(), upc memput() and upc memget(). Although a lower-level implementation may give better per- The MuPC runtime system is a static library that imple- formance, a reference implementation is useful for users to ments a UPC runtime system API. The front end translates understand the semantics of collectives and for language de- a UPC program into an ANSI C program where UPC spe- velopers to test new designs. cific features are represented using structures and function The MuPC runtime system is portable to any platform calls defined in the API. For example, shared pointers are that supports MPI and POSIX thread. This means that a translated into UPCRTS SHARED POINTER TYPE structures. wide range of architectures and operating systems can run Remote memory reads and writes are translated into calls

2 to UPCRTS Get() and UPCRTS Put(), respectively. The considers a put operation to be completed as soon as the translated code is compiled using a C compiler such as icc, operation returns, although the data to be written are likely then linked with the runtime system library to generate an still in transit to its destination. This permits the MuPC run- executable. This section describes the runtime system API time system to issue multiple put operations at a time, but and discusses some high-level decisions made in its design. only one get at a time. It is desirable to have the runtime issue multiple gets at a time and complete them altogether 4.1. The runtime system API with just one getsync routine, but this requires a depen- dence analysis within a block to determine which gets can be issued simultaneously. Basically, only a sequence of The runtime system API used by MuPC is based on gets without the WAR hazards in between can be issued an API originally defined by the UPC development group simultaneously. This feature is not currently implemented at Hewlett-Packard. Despite the similarity between the in MuPC. two APIs, their implementations are completely different. While MuPC implements the API using MPI and Pthreads, Another set of corresponding runtime functions are de- HP’s implementation is based on the NUMA architecture fined for moving non-scalar type data between shared mem- of the SC-series platform and uses the Elan communication ory locations. Non-scalar types, such as user defined data library and Quadrics switch to move data between nodes. types, are treated by the runtime system as raw bytes. Both Compared with MPI, the Elan library provides much lower block get and block put operations are synchronous. A level message passing primitives and higher performance. block get or block put request is issued first, then it is On the other hand, the MPI+Pthreads implementation gives completed by a matching completion routine. In the block MuPC greater portability. MPI also makes the implemen- get case, the completion routine guarantees the data re- tation simpler because it already guarantees ordered mes- trieved are ready to be consumed by the calling thread. In sage delivery in point-to-point communication. In addi- the block put case, the completion routine guarantees the tion, process startup, termination and buffer management source location where the data used to be are ready to be are straightforward in MPI. overwritten because the data are already on the way to its THREADS and MYTHREAD are runtime constants repre- destination. senting the number of UPC threads and the ID of a Besides shared memory accesses, UPC provides particular running thread, respectively. They are avail- one-sided message passing functions: upc memcpy(), able in the runtime library through two global constants: upc memget(), upc memput() and upc memset(). These UPCRTS gl cur nvp and UPCRTS gl cur vpid. functions retrieve or store a block of raw bytes from or The API defines a C structure, to shared memory. The MuPC runtime supports these di- UPCRTS SHARED POINTER TYPE, to represent pointers-to- rectly by providing corresponding runtime functions. These shared. A pointer-to-shared contains three fields: address, functions rely on the aforementioned block get and block affinity and phase. The address and affinity fields together put functions to accomplish data movement. specify the location of an object in shared memory. The phase field specifies the relative offset of an element in 4.1.2 Synchronization a shared array. The phase field is useful only in pointer arithmetic, which is handled in the UPC-to-C translation The MuPC runtime system directly supports the UPC syn- step. In other words, translated code does not include any chronization functions by providing matching runtime rou- pointer arithmetic. The runtime system needs only address tines. It is noteworthy that a UPC barrier is not implemented and affinity to retrieve and store shared data. using MPI Barrier(), although it appears to be natural to do so. UPC barriers come with two variations, anonymous barriers and indexed barriers. An indexed barrier includes 4.1.1 Shared memory accesses an integer argument and can be matched only by barriers Shared memory read and write operations for scalar type on other threads with the same argument or by anonymous data are implemented using corresponding get and put barriers. MPI barriers are incapable of supporting complex functions defined in the runtime API. One key design fea- synchronization semantics as such, so the MuPC runtime ture of the runtime system is that all get operations are syn- system implements its own barriers, using a binary tree- chronous, while put operations are asynchronous. A get based algorithm. routine initiates a request to get data from a remote mem- A fence in UPC is a mechanism to force shared ac- ory location. Each request is then completed by one of the cesses completion. In the MuPC runtime system, fences getsync routines (depending on the data type), which guar- play a significant role in the implementation of UPC’s mem- antees the data is ready to use once it returns. In contrast, ory consistency model. More details are discussed in sec- there are no routines to complete put operations. A thread tion 5.1.

3 UPC provides locks to synchronize concurrent accesses 3. Support nonblocking communication. to shared memory. Locks are shared opaque objects that can only be manipulated through pointers. In MuPC, the 4. Support concurrent and arbitrary accesses to remote lock type is implemented using a shared array with size memory. THREADS and block size 1. Each lock manipulation rou- 5. Provide or support the implementation of collective tine, such as lock, unlock, and lock attempt, has a cor- communication and synchronization primitives. responding runtime function. MPI was chosen to implement the MuPC runtime API 4.1.3 Shared memory management for two important reasons: (1) The prevalence of MPI guarantees the portability of the runtime system. (2) MPI li- Static shared variables are directly supported using static braries provide high-level communication primitives that variables of ANSI C. For static shared arrays distributed enable fast system development by hiding the details of across threads, the translator calculates the size of the por- message passing. MuPC is built on the top of only a tion on each thread, and accordingly allocates static arrays handful of the most commonly used MPI routines, such as on each thread. For static shared scalars and static shared MPI Isend() and MPI Irecv(). arrays with indefinite block size, the translator replicates However, MPI does not meet the five requirements listed them on all UPC threads but only the copy on Thread 0 above. Specifically, MPI-1.1 supports only two-sided com- is used. This ensures that on each thread the corresponding munication. The one-sided get/put routines defined in elements of a shared array that occupies multiple threads the MuPC runtime API must be simulated using message always have the same local addresses. This is a desirable polling because the receiver does not always know when a feature that greatly simplifies the implementation of remote sender will send a message. The MPI-2 standard supports references. one-sided communication with some restrictions, but the re- The memory module of the runtime system manages the strictions make the one-sided message passing routines in- shared heap. At start-up, the memory module reserves a compatible with the requirements of UPC [1]. In addition, segment of the heap on each thread to be used in future MPI-2 has not been as widely implemented as MPI-1.1. memory allocations. The size of this segment is a constant Therefore, MuPC is currently implemented with MPI-1.1. that can be set at run time. The starting address of this re- The MuPC runtime system constructs its own MPI com- served segment is the same on all threads. This guarantees municator. All UPC threads are members of the commu- that corresponding elements of an allocated array have the nicator MUPC COMM WORLD. MuPC uses only asynchronous same local addresses on each thread. The memory module message-passing functions to achieve message overlapping uses a simple first-fit algorithm [17] to keep track of allo- and to minimize message overhead. cated and free space on the reserved segment on each thread at run time. 4.3. The two-threaded implementation

4.1.4 Interface extension MuPC uses a two-threaded design to simulate the one- One goal of MuPC is to provide an experimental platform sided communication, or remote memory access (RMA), for language developers to try out new ideas. Thus the using MPI’s two-sided communication primitives. This de- runtime API is extensible. One extension that is currently sign uses POSIX Threads (Pthreads) because of its wide being investigated is the implementation of atomic shared availability. Each UPC thread spawns two Pthreads at memory operations, which are not part of the latest UPC runtime, the computation Pthread and the communication language speciﬁcations. Section 5.4 contains more details Pthread. The computation Pthread executes the compiled about implementing atomic operations. user code. This code contains calls to MuPC runtime functions. The computation Pthread delegates communication 4.2. The communication layer tasks to the communication Pthread. The communication Pthread plays two basic roles. First, it handles the read Bonachea and Duell [1] presented ﬁve requirements for and write requests of the current UPC thread by posting MPI Irecv() MPI Isend() communication layers underlying the implementation of and calls and completing them MPI Testall() GAS languages such as UPC: using . Second, it responds to the read and write requests from remote threads by sending the requested 1. Support remote memory access (RMA) operations data using MPI Isend() or storing the arriving data directly (one-sided communication). into memory. Since the communication Pthread has to re- spond to remote read and write requests in real time, it im- 2. Provide low latency for small remote accesses. plements a polling loop (by posting persistent MPI receive

4 requests) that listens to all other UPC threads. Figure 1 de- accesses do not start until the write is complete. There is no picts the operation of the communication Pthread. need for a fence immediately after a strict read because a read is a synchronous operation, subsequent accesses cannot start until the read returns. No fences are inserted for relaxed accesses. Reordering relaxed accesses may lead to performance improvement. For example, starting long-latency operations early helps hide the waiting time. At this stage MuPC is not capable of reordering program statements. Because of the non-overtaking property of MPI messages [16], shared memory accesses to a target thread always complete in program order, as long as remote memory caching is not used. Therefore, in the absence of caching, a fence requires so- liciting an acknowledgment from every other thread that the calling thread ever wrote to since the last fence. The fence blocks until an appropriate number of acknowledgments ar- rive. When caching is used, however, a fence additionally requires the cache to be invalidated and dirty cache lines to be written back. Fences are also implied at lock, unlock, Figure 1. The communication Pthread lock attempt, and barriers.

The two Pthreads synchronize with each other using 5.2. Optimization for local shared accesses a set of mutexes and conditional variables; they commu- nicate with each other using a set of global buffers and Local shared accesses are accesses to shared variables queues. Some MPI implementations are not thread-safe, that have affinity to the accessing thread. There is no com- so the MuPC runtime system encapsulates all MPI function munication involved in this type of access, but the latency calls in the communication Pthread. still tends to be longer than the latency of accessing private variables [2, 3, 5, 9]. The extra cost comes from the effort of 5. Runtime system internals handling UPC’s shared keyword. For example, initially the UPC-to-C translator mechanically translates shared refer- The runtime system kernel is organized into 3 modules. ences into calls to corresponding runtime get/put functions. The main module implements all runtime API functions, The overhead of the function calls accounts for the extra as well as the communication Pthread and the computation cost. Pthread. A memory module defines the memory allocation MuPC’s UPC-to-C translator has been optimized to dis- and de-allocation mechanisms that underlie UPC’s dynamic tinguish local and remote shared references. Tests are in- shared memory allocation functions. A cache module per- serted for each shared reference to determine if the refer- forms remote reference caching to help reduce the long la- ence is local at run time by comparing the thread field of tencies of referencing remote scalar shared variables. the address with MYTHREAD. Once identified, a local shared This section describes details about the implementation reference is converted to a regular local memory reference of the memory consistency model and atomic operations. to avoid expensive runtime function calls. Also discussed are some performance-oriented features of MuPC. 5.3. Remote memory caching

5.1. Fences and the memory consistency model The runtime implements a software caching scheme to hide the latency of remote accesses. Each UPC thread main- MuPC relies on fences to implement the strict and tains a noncoherent, direct-mapped, write-back cache for the relaxed shared memory access modes. A fence forces scalar references made to remote threads. Cache is allo- the completion of shared memory accesses. A fence forces cated before execution begins. The length of a cache line an ordering of the accesses before it and the accesses after and the number of cache lines can be set at run time using a it. For each strict access, the translator inserts a fence conﬁguration ﬁle. immediately before it to ensure this access does not occur The cache on each thread is divided into THREADS-1 seg- earlier than any prior accesses. For a strict write, another ments, each being dedicated a unique remote thread. Ei- fence is needed immediately after it to ensure subsequent ther a read miss or a write miss triggers the load of a cache

5 line. Subsequent references are made to the cache line un- accesses will only see that the value of the targeted memory til the next miss. Writes to remote locations are stored in location has been changed and they all fail. A reply message the cache, then will be actually written back later at a cache is then built for each access request, specifying success or invalidation or when the cache line is replaced. This mecha- failure, and sent back to each requester. nism helps reduce the number of MPI messages by combin- ing frequent small messages into less frequent large mes- 6. Performance characteristics sages. Cache invalidation takes place at a fence, with dirty To characterize the performance of the MuPC runtime cache lines being written back to their sources. Dirty cache system, this section reports the performance of two syn- lines are collected and packed into raw byte packages des- thetic microbenchmarks. Also summarized are perfor- tined to appropriate remote threads. The packages are trans- mance measurements made earlier [21] for the NAS Par- ferred using block put operations. The communication allel Benchmarks. We run MuPC on two platforms, an Pthread of a receiver unpacks each package and updates the HP AlphaServer SC SMP cluster and a Linux cluster con- corresponding memory locations with the values carried by nected with a Myrinet network. For comparison purposes, the package. the same benchmarks are also run using Berkeley UPC and A cache line conflict occurs if more than one memory HP UPC on the Linux cluster and the AlphaServer SC clus- block is mapped to one cache line. The one already in cache ter, respectively. is replaced and written back if it is dirty. The runtime sys- The AlphaServer SC cluster has 8 nodes with four tem calls the block put function to perform the write-back 667MHz Alpha 21264 EV67 processors per node. The of the replaced cache line. Linux cluster has 16 nodes with two Pentium 1.5GHz pro- The runtime system also provides a victim cache to mit- cessors per node. The UPC compilers used are MuPC igate the penalty of conflict misses. The victim cache has V1.1, Berkeley UPC V2.0, and HP UPC V2.2. In the ex- the same structure as the main cache, except that it is much periments, each UPC thread is mapped to one processor. smaller. When a conflict occurs, the replaced line is stored The cache in MuPC on each thread is configured to have in the victim cache. It is written back when it is replaced 256×(THREADS-1) blocks with 1024 bytes per block. The again in the victim cache. A cache hit in the victim cache cache in HP UPC on each thread is configured to be 4-way brings the line back to the main cache, and a miss both in associative, with 256×(THREADS-1) blocks and 1024 bytes the main cache and in the victim cache triggers the load of for each block. a fresh cache line. The synthetic microbenchmarks used are: The false sharing problem is handled by associating with • Streaming remote access measures the transfer rates each cache line a bit vector to keep track of the bytes written of remote memory accesses issued by a single thread, by a thread. The bit vector is transferred together with the while other threads are idle. Four access patterns, cache line and is used to guide the byte-wise updating of the stride-1 reads and writes, and random reads and writes, corresponding memory locations. are measured. 5.4. Atomic operations • Natural ring is similar to the streaming remote access test, except that all threads form a logical ring and Atomic shared memory operations currently imple- read from or write to their neighbors at approximately mented in MuPC include Compare and Swap, Double Com- the same time. This benchmark is similar to the all- pare and Swap, Fetch and Operate, and Masked Swap. The processes-in-a-ring test implemented in the HPC Chal- runtime system supports them directly using corresponding lenge Benchmark Suite [14]. runtime functions. The results for the streaming remote access tests are pre- Atomic operations provide safe lock-free accesses to the sented in Figures 2 and 3. These tests reveal the perfor- shared memory. When an atomic operation is performed mance of fine grained accesses with a lightly loaded com- to a shared variable by multiple UPC threads, only one can munication network. When remote reference caching is succeed, all other attempts fail and have no effect. not used, MuPC has comparable performance with Berke- The runtime system starts an atomic operation by build- ley UPC. Berkeley UPC is slightly better in read operations ing a request (an MPI message) to be sent to the thread with but MuPC outperforms it in write operations. It is clear which the targeted memory location has affinity. Upon re- that caching helps MuPC in stride-1 accesses, whose trans- ceiving this message, the thread appends the request to a fer rates are significantly better than the no-caching coun- cyclic queue. Thus multiple concurrent accesses from dif- terparts. But caching hurts the performance of random ac- ferent threads are sequentialized by the queue. The access cesses, although not significantly, due to cache miss penal- that arrives first will be performed successfully. Then later ties. Also note that caching helps HP UPC in stride-1 reads

6 Figure 2. The streaming remote access re- Figure 3. The streaming remote access results for platforms without caching sults for platforms with caching

7. Summary but not in stride-1 writes. This is because the cache in MuPC is an open source runtime system for UPC that HP UPC is a write-through cache. MuPC achieves only is based on MPI and POSIX threads. Each UPC thread 10 − 25% of HP’s performance in random accesses. is represented by two Pthreads at run time; one for computation and one for interthread communication. MuPC is portable to most distributed memory and shared memory The results for the natural ring tests are presented in Fig- parallel systems because it is based on commonly available ures 4 and 5. The communication network is much more libraries. Performance results presented here compare sev- heavily loaded in these tests than in the streaming remote eral current UPC compilers with synthetic and application access tests. Both MuPC and Berkeley UPC suffer from a benchmarks. The front end translator and the run time sys- performance degradation by a factor as large as 10. HP UPC tem contain several optimizations that make MuPC compet- suffers from only a slight performance degradation. The ob- itive with other open source and commercial UPC compil- servations about remote reference caching still hold in these ers. MuPC’s extensibility is currently facilitating the study tests. of new UPC language features such as atomic memory operations. [21] carried out a full study of MuPC’s performance for the NAS Parallel Benchmarks. MuPC’s performance was compared with all available UPC compilers on multiple Acknowledgments platforms. Five benchmarks, embarrassingly parallel (EP), conjugate gradient (CG), fast Fourier transform (FT), inte- The authors wish to thank Brian Wibecan and the UPC ger sort (IS), and multi-grid solver (MG), were studied. The group at Hewlett-Packard who gave much assistance in results show that MuPC achieved comparable performance MuPC development. The authors would also like to thank with other UPC implementations for EP, FT and IS. Most Dan Bonachea and the Berkeley UPC development group implementations exhibited erratic performance for MG be- for valuable discussions. This work is partially supported cause of the non-scalable access patterns in the UPC imple- by DoD contract MDA904-03-C-0483. mentation of that benchmark. On the other hand, MuPC’s performance for CG was much worse than others. Figure 6 References compares for CG MuPC’s performance with Berkeley UPC on the Linux cluster and with HP UPC on the AlphaServer [1] D. Bonachea and J. Duell. Problems with using MPI 1.1 and SC cluster. The CG benchmark features irregular, ﬁne-grain 2.0 as compilation targets for parallel language implementa- shared memory accesses. The slowness of MuPC for CG tions. International Journal on High Performance Comput- shows MuPC’s limitation in handling such accesses. ing and Networking, 2003.

7 Figure 4. The natural ring test results for plat- Figure 5. The natural ring test results for platforms without caching forms with caching

Linux cluster AlphaServer [2] F. Cantonnet and T. El-Ghazawi. UPC Performance and Po- 1000 tential: A NPB Experimental Study. In Proceedings, Super- Berkeley HP MuPC MuPC computing 2002: Baltimore, Maryland, Nov. 2002. 900 [3] F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, and T. El- 800

Ghazawi. Performance Monitoring and Evaluation of a UPC 700 Implementation on a NUMA Architecture. In Proceedings 600 of International Parallel and Distributed Processing Sympo- 500 sium, 2004. Mops/sec [4] W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and 400 K. Warren. Introduction to UPC and Language Speci®ca- 300

tion. Technical Report CCS-TR-99-157, IDA Center for 200 Computing Sciences, May 1999. 100 [5] W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, and 0 K. Yelick. A Performance Analysis of the Berkeley UPC 0 4 8 12 16 0 4 8 12 16 Compiler. In Proceedings of 17th Annual International Con- Threads ference on Supercomputing (ICS), 2003. [6] Cray Inc. Cray X1 System Overview. Cray Inc., 2003. Figure 6. NPB CG benchmark performance http://www.cray.com/craydoc/manuals/S-2346-22/html-S- 2346-22/z1018480786.html. [7] Edison Design Group, Inc. Compiler Front Ends for the [15] Michigan Technological University. UPC Projects at MTU. OEM Market. http://www.edg.com. http://www.upc.mtu.edu. [8] T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick. UPC: [16] M. Snir and W. Gropp. MPI: The Complete Reference. The Distributed Shared Memory Programming. John Wiley & MPI Press, 2nd edition, 1998. Sons, 2005. [17] A. Tanenbaum. Modern Operating Systems. Prentice Hall, [9] T. El-Ghazawi and S. Chauvin. UPC Benchmarking Issues. 2nd edition, 2001. In Proceedings of ICPP (2001), 2001. [18] UC Berkeley. Berkeley Uni®ed Parallel C Home Page, 2004. [10] George Washington University. Uni®ed Parallel C Home http://upc.nersc.gov. Page, 2004. http://hpc.gwu.edu/∼upc. [19] UC Berkeley. GASNet Home Page, 2004. [11] Hewlett-Packard. Compaq UPC for Tru64 UNIX, 2004. http://www.cs.berkeley.edu/∼bonachea/gasnet. http://www.hp.com/go/upc. [20] E. Wiebel, D. Greenberg, and S. Seidel. UPC Collectives [12] Intrepid Technology. Intrepid UPC Home Page, 2004. Speci®cation 1.0, Dec. 2003. http://www.intrepid.com/upc. http://www.gwu.edu/∼upc/docs/UPC Coll Spec V1.0.pdf. [13] ISO/IEC. Programming Languages - C, ISO/IEC 9989, May [21] Z. Zhang and S. Seidel. Benchmark Measurements for 2000. Current UPC Platforms. In Proceedings of IPDPS'05, [14] P. Luszczek, J. Dongarra, and et al. Introduction to the HPC 19th IEEE International Parallel and Distributed Process- Challenge Benchmark Suite. In Supercomputing 2005 (sub- ing Symposium, Apr. 2005. mitted), Nov. 2005.