Memory-Mapping Support for Reducer Hyperobjects
Total Page:16
File Type:pdf, Size:1020Kb
Memory-mapping support for reducer hyperobjects The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation I-Ting Angelina Lee, Aamir Shafi, and Charles E. Leiserson. 2012. Memory-mapping support for reducer hyperobjects. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures (SPAA '12). ACM, New York, NY, USA, 287-297. As Published http://dx.doi.org/10.1145/2312005.2312056 Publisher Association for Computing Machinery (ACM) Version Author's final manuscript Citable link http://hdl.handle.net/1721.1/90259 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ Memory-Mapping Support for Reducer Hyperobjects ∗ † I-Ting Angelina Lee Aamir Shafi Charles E. Leiserson MIT CSAIL National University of MIT CSAIL Sciences and Technology 32 Vassar Street Sector H-12 32 Vassar Street Cambridge, MA 02139 USA Islamabad, Pakistan Cambridge, MA 02139 USA [email protected] aamir.shafi@seecs.edu.pk [email protected] ABSTRACT Keywords Reducer hyperobjects (reducers) provide a linguistic abstraction Cilk, dynamic multithreading, memory mapping, reducers, re- for dynamic multithreading that allows different branches of a par- ducer hyperobjects, task parallelism, thread-local memory mapping allel program to maintain coordinated local views of the same non- (TLMM), work stealing. local variable. In this paper, we investigate how thread-local mem- ory mapping (TLMM) can be used to improve the performance 1. INTRODUCTION of reducers. Existing concurrency platforms that support reducer Reducer hyperobjects (or reducers for short) [12] have been hyperobjects, such as Intel Cilk Plus and Cilk++, take a hyper- shown to be a useful linguistic mechanism to avoid determinacy map approach in which a hash table is used to map reducer ob- races [11] (also referred as general races [28]) in dynamic multi- jects to their local views. The overhead of the hash table is costly threaded programs. Reducers allow different logical branches of — roughly 12× overhead compared to a normal L1-cache mem- a parallel computation to maintain coordinated local views of the ory access on an AMD Opteron 8354. We replaced the Intel Cilk same nonlocal variable. Whenever a reducer is updated — typi- Plus runtime system with our own Cilk-M runtime system which cally using an associative operator — the worker thread on which uses TLMM to implement a reducer mechanism that supports a re- the update occurs maps the reducer access to its local view and per- ducer lookup using only two memory accesses and a predictable forms the update on that local view. As the computation proceeds, branch, which is roughly a 3× overhead compared to an ordinary the various views are judiciously reduced (combined) by the run- L1-cache memory access. An empirical evaluation shows that the time system using an associative reduce operator to produce a final Cilk-M memory-mapping approach is close to 4× faster than the value. Cilk Plus hypermap approach. Furthermore, the memory-mapping Although existing reducer mechanisms are generally faster than approach admits better locality than the hypermap approach during other solutions for updating nonlocal variables, such as locking parallel execution, which allows an application using reducers to and atomic-update, they are still relatively slow. Concurrency plat- scale better. forms that support reducers, specifically Intel’s Cilk Plus [19] and Cilk++ [25], implement the reducer mechanism using a hypermap Categories and Subject Descriptors approach in which each worker employs a thread-local hash ta- ble to map reducers to their local views. Since every access to a D.1.3 [Software]: Programming Techniques—Concurrent pro- reducer requires a hash-table lookup, operations on reducers are gramming; D.3.3 [Software]: Language Constructs and Features— relatively costly — about 12× overhead compared to an ordinary Concurrent programming structures. L1-cache memory access on an AMD Opteron 8354. In this paper, we investigate a memory-mapping approach for supporting reduc- General Terms ers, which leverages the virtual-address translation provided by the underlying hardware to yield a close to 4× faster access time. Design, Experimentation, Performance. A memory-mapping reducer mechanism must address four key questions: 1. What operating-system support is required to allow the This work was supported in part by the National Science Foundation under Grant CNS-1017058. Aamir Shafi was funded in part by a Fulbright grant virtual-memory hardware to map reducers to their local during his visit at MIT. views? ∗ I-Ting Angelina Lee is currently affiliated with Intel Labs, Hillsboro, OR. 2. How can a variety of reducers with different types, sizes, and † Aamir Shafi is currently an Assistant Professor at NUST SEECS and was life spans be handled? a Visiting Fulbright Scholar at MIT during the course of this research. 3. How should a worker’s local views be organized in a compact fashion to allow both constant-time lookups and efficient se- Permission to make digital or hard copies of all or part of this work for quencing during reductions? personal or classroom use is granted without fee provided that copies are 4. Can a worker efficiently gain access to another worker’s local not made or distributed for profit or commercial advantage and that copies views without extra memory mapping? bear this notice and the full citation on the first page. To copy otherwise, to Our memory-mapping approach answers each of these questions republish, to post on servers or to redistribute to lists, requires prior specific using simple and efficient strategies. permission and/or a fee. SPAA’12, June 25–27, 2012, Pittsburgh, Pennsylvania, USA. 1. The operating-system support employs thread-local memory Copyright 2012 ACM 978-1-4503-1213-4/12/06 ...$10.00. mapping (TLMM) [23]. TLMM enables the virtual-memory L1−memory 1 bool has_property(Node *n); memory−mapped 2 std::list<Node *> l; 3 // ... hypermap 4 void walk(Node *n) { locking 5 if(n) { 6 if(has_property(n)) 0 2 4 6 8 10 12 14 7 l.push_back(n); Normalized overhead 8 cilk_spawn walk(n->left); 9 walk(n->right); Figure 1: The relative overhead for ordinary L1-cache memory accesses, 10 cilk_sync; memory-mapped reducers, hypermap reducers, and locking. Each value 11 } is calculated by the normalizing the average execution time of the mi- 12 } crobenchmark for the given category by the average execution time of the (a) microbenchmark that performs L1-cache memory accesses. 1 bool has_property(Node *n); hardware to map the same virtual address to different views 2 list_append_reducer<Node *> l; in the different worker threads, allowing reducer lookups to 3 // ... 4 void walk(Node *n) { occur without the overhead of hashing. 5 if(n) { 2. The thread-local region of the virtual-memory address space 6 if(has_property(n)) only holds pointers to local views and not the local views 7 l->push_back(n); themselves. This thread-local indirection strategy allows a 8 cilk_spawn walk(n->left); 9 walk(n->right); variety of reducers with different types, sizes, and life spans 10 cilk_sync; to be handled. 11 } 3. A sparse accumulator (SPA) data structure [14] is used to 12 } (b) organize the worker-local views. The SPA data structure has a compact representation that allows both constant-time ran- dom access to elements and sequencing through elements Figure 2: (a) An incorrect parallel code to walk a binary tree and create stored in the data structure efficiently. a list of all nodes that satisfy a given property. The code contains a race 4. By combining the thread-local indirection and the use of the on the list l. (b) A correct parallelization of the code shown in Figure 2(a) SPA data structure, a worker can efficiently transfer a view using a reducer hyperobject. to another worker. This support for efficient view transferal branch, which is more efficient than a hypermap reducer. An unex- allows workers to perform reductions without extra memory pected byproduct of the memory-mapping approach is that it pro- mapping. vides greater locality than the hypermap approach, which leads to We implemented our memory-mapping strategy by modifying more scalable performance. Cilk-M [23], a Cilk runtime system that employs TLMM to manage The rest of the paper is organized as follows. Section 2 provides the “cactus stack,” to make it a plug-in replacement for Intel’s Cilk the necessary background on reducer semantics, which includes the Plus runtime system. That is, we modified the Cilk-M runtime sys- linguistic model for dynamic multithreading and the reducer inter- tem to replace the native Cilk runtime system shipped with Intel’s face and guarantees. Section 3 reviews the hypermap approach to C++ compiler by making Cilk-M conform to the Intel Cilk Plus support the reducer mechanism. Sections 4, 5, 6, and 7 describe Application Binary Interface (ABI) [17]. We then implemented our design and implementation of the memory-mapped reducers, the memory-mapping strategy in the Cilk-M runtime system and where each section addresses one of the four questions we men- compared it on code compiled by the Intel compiler to Cilk Plus’s tioned above in details. Section 8 presents the empirical evaluation hypermap strategy. of the memory-mapped reducers by comparing it to the hypermap Figure 1 graphs the overheads of ordinary accesses, memory- reducers. Section 9 summarizes related work. Lastly, Section 10 mapped reducer lookups, and hypermap reducer lookups on a sim- offers some concluding remarks. ple microbenchmark that performs additions on four memory loca- tions in a tight for loop, executed on a single processor. The mem- ory locations are declared to be volatile to preclude the compiler 2.