<<

Shoal: Smart Allocation and Replication of Memory For Parallel Programs Stefan Kaestle, Reto Achermann, and Timothy Roscoe, ETH Zürich; Tim Harris, Oracle Labs, Cambridge https://www.usenix.org/conference/atc15/technical-session/presentation/kaestle

This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15). July 8–10, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-225

Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX. : smart allocation and replication of memory for parallel programs

Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris∗ Systems Group, Dept. of Computer Science, ETH Zurich ∗Oracle Labs, Cambridge, UK

Abstract mization to apply in which situation and, with rapidly Modern NUMA multi-core machines exhibit complex la- evolving and diversifying hardware, programmers must tency and throughput characteristics, making it hard to repeatedly make manual changes to their software to allocate memory optimally for a given program’s access keep up with new hardware performance properties. patterns. However, sub-optimal allocation can signifi- One solution to achieve better data placement and cantly impact performance of parallel programs. faster data access is to rely on automatic online moni- We present an array abstraction that allows data place- toring of program performance to decide how to migrate ment to be automatically inferred from program analysis, data [13]. However, monitoring may be expensive due to and implement the abstraction in Shoal, a runtime library missing hardware support (if pages must be unmapped for parallel programs on NUMA machines. In Shoal, to trigger a fault when data is accessed) or insufficiently arrays can be automatically replicated, distributed, or precise (if based on sampling using performance coun- partitioned across NUMA domains based on annotating ters). Both approaches are limited to a relatively small memory allocation statements to indicate access patterns. number of optimizations (e.g. it is hard to incrementally We further show how such annotations can be auto- activate large pages or switch to using DMA hardware matically provided by compilers for high-level domain- for data copies based on monitoring or event counters) specific languages (for example, the Green-Marl graph We present Shoal, a system that abstracts memory ac- language). Finally, we show how Shoal can exploit ad- cess and provides a rich programming interface that ac- ditional hardware such as programmable DMA copy en- cepts hints on memory access patterns at the runtime. gines to further improve parallel program performance. These hints can either be manually written or automati- We demonstrate significant performance benefits from cally derived from high-level descriptions of parallel pro- automatically selecting a good array implementation grams such as domain specific languages. Shoal includes based on memory access patterns and machine charac- a machine-aware runtime that selects optimal implemen- teristics. We present two case-studies: (i) Green-Marl, tations for this memory abstraction dynamically during a graph analytics workload using automatically anno- buffer allocation based on the hints and a concrete com- tated code based on information extracted from the high- bination of machine and workload. If available, Shoal level program and (ii) a manually-annotated version of is able to exploit not only NUMA properties but also the PARSEC Streamcluster benchmark. hardware features such as large pages and DMA copy engines. Our contributions are: 1 Introduction a memory abstraction based on arrays that decou- • Memory allocation in NUMA multi-core machines is ples data access from the rest of the program, an interface for programs to specify memory access increasingly complex. Good placement of and access • to program data is crucial for application performance, patterns when allocating memory, a runtime that selects from several highly tuned ar- and, if not carefully done, can significantly impact scal- • ability [3, 13]. Although there is research (e.g. [7, 3]) ray implementations based on access patterns and in adapting to the concrete characteristics of such ma- machine characteristics and can exploit machine chines, many programmers struggle to develop software specific hardware, features modifications to Green-Marl [20], a graph analyt- applying these techniques. We show an example in Sec- • tion 5.1. ics language, to show how Shoal can extract access The problem is that it is unclear which NUMA opti- patterns automatically from high-level descriptions.

USENIX Association 2015 USENIX Annual Technical Conference 263 Node 2 Node 0 GDDR RAM RAM memory allocation strategy. Memory is not allocated di- Memory

DMA DMA MC MC DMA rectly when calling malloc, but mapped only when the L3 Cache L3 Cache corresponding memory is first accessed by a thread. This IC resulting page fault will cause Linux to back the faulting Xeon Phi page from the NUMA node of the faulting core. IC IC A surprising consequence of this choice is that on GPGPU GPGPU Linux the implementation of the initialization phase of

DMA IC a program is often critical to its memory performance, L3 Cache L3 Cache Node 3 Node 1 even through programmers rarely consider initialization DMA MC MC DMA GDDR as a candidate for heavy optimization, since it almost Memory RAM RAM never dominates the total execution time of the program. MC Memory Controllers IC Interconnect Link 16x PCI Express Link Figure 1: Architecture of a modern multi-core machine. To see why, consider that memset is the most widely used approach for initializing the elements of an array. Most programmers will spend little time evaluating al- 2 Motivation ternatives, since the time spent in the initialization phase Modern multi-core machines have complex memory is usually negligible. An example is as follows: hierarchies consisting of several memory controllers // ---- Initialization (sequential) ------placed across the machine – for example, Figure 1 shows void *ptr = malloc(ARRSIZE); such a machine with four host main memory controllers memset(ptr, 0, ARRSIZE); (one per processor socket). Memory latency and band- // ---- Work (parallel, highly optimized) ----- width depend on which core accesses which memory lo- execute_work_in_parallel(); cation [8, 10]. The interconnect may suffer congestion The scalability of a program written this way can be when access to memory controllers is unbalanced [13]. limited. memset executes on a single core and so all Future machines will be more complex: they may not memory is allocated on the NUMA node of that core. provide global cache coherence [21, 25], or even shared For memory-bound parallel programs, one memory con- global physical addresses [1]. Even today, accelerators troller will be saturated quickly while others remain idle like Intel’s Xeon Phi, GPGPUs, and FPGAs have higher since all threads (up to 64 on the machines we evaluate) memory access costs for parts of the physical memory request memory from the same controller. Furthermore, space [1]. Such hardware demands even more care in the interconnect close to this memory controller will be application data placement. more susceptible to congestion. This poses challenges to programmers when allocat- There are two problems here: (i) memory is not allo- ing and accessing memory. First, detailed knowledge cated (or mapped) when the interface suggests (memory of hardware characteristics and a good understanding of is not allocated inside malloc itself but later in the exe- their implications for algorithm performance is needed cution) and (ii) the choice of where to allocate memory is for efficient scalable programs. Care must be taken when made in a subsystem (the OS kernel) that has no knowl- choosing a memory controller to allocate memory from, edge of the intended access patterns of this memory. and how to subsequently access that memory. This can be addressed by tuning algorithms to spe- Second, hardware changes quickly meaning that de- cific operating systems. For example, we could initialize sign choices must be constantly re-evaluated to ensure memory using a parallel for loop: good performance on hardware. This imposes high engineering and maintenance costs. This is worth- // ---- Initialization (parallel) ------while in high-performance computing or niche markets, void *ptr = malloc(ARRSIZE); but general purpose machines have too broad a hardware #pragma omp parallel for range for this to be practical for many domains. The re- for (int i=0;i

2 264 2015 USENIX Annual Technical Conference USENIX Association allocation policies change. Furthermore, it also requires high-level constructs in the PageRank example can be correct setup of OpenMP’s CPU affinity to ensure that all determined as follows: (i) Foreach (T: G.Nodes) cores participate in this parallel initialization phase in or- means the nodes-array will be accessed sequen- der to spread memory equally on all memory controllers. tially, read-only, and with an index, and (ii) Sum(w: Finally, we might do better: allocate memory close to the t.InNbrs) implies read-only, indexed accesses on in- cores that access it the most. neighbors array. Beyond simple placement of data, ideas and tech- We argue that memory access patterns encoded in niques from traditional distributed systems like replica- DSLs present a significant performance opportunity, and tion and partitioning can help to improve memory man- should be passed to the runtime to enable automatic agement [31]. Replication localizes data access by stor- tuning of memory allocation and access. Since low- ing several copies of the same data, distributing load and level code is generated by the DSL compiler, it is also reducing communication costs. Replication carries the relatively easy to change the programming abstractions cost of maintaining consistency when updating data, as used by the generated code for accessing memory. Only well as increasing the program’s memory footprint. Par- the compiler (rather than the input program) must be titioning chunks data and places these blocks onto differ- changed in such a case. ent nodes. This balances load, and, if work is scheduled close to data it is accessing, also localizes array accesses. Manual annotations. Even without a DSL, program- The key challenge in applying these techniques is that the mers often know data access patterns when writing a pro- choice of a good distribution strategy depends critically gram. They understand the semantics of their programs on concrete combinations of machine and workload. and, hence, how memory is accessed, but have no way Our work starts with the observation that memory ac- of passing this knowledge to the runtime to guide data cess patterns of applications are often encoded in high- placement and access. Existing interfaces intended to en- level languages or known by programmers. We show able this coarse-grained and inflexible. One example is how this information can be used to tune memory place- libnuma’s [34] NUMA-aware memory allocation, which ment and access without programmers having to under- allows a client to specify which node memory should be stand the characteristics of the machine at hand. allocated from, but does not allow combining this with Automatic annotations from high-level DSLs. A trend other allocation options (such as large pages), and re- we exploit is the emergence of high-level domain spe- quires a programmer to manually integrate this with par- cific languages (DSLs) [20, 37, 41]. These languages are allel task scheduling. known for the ease of programming since their seman- In Shoal, we automatically tune data placement and tics closely match a specific application domain. DSLs access based on memory access patterns and hints pro- typically compile to a low-level language (such as C), vided by a high-level compiler or by programmers. We possibly with several backends depending on the target introduce (i) a new interface for memory allocation, in- machine to execute the program. DSLs can provide us cluding machine-aware malloc call that accepts hints to with memory access patterns directly from the input pro- guide placement and (ii) an abstraction for data access gram with relatively simple modifications to high-level based on arrays. For these arrays, we provide several im- compilers. Listing 1 shows an example program for the plementations including data distribution, replication and Green-Marl graph analytics DSL. partitioning. All implementations can be interchanged Memory access patterns from two of Green-Marl’s transparently without the need to change programs. Our abstraction also admits implementations that are tuned to hardware features (such as DMA engines) or accelerators Procedure pagerank(/* arguments */) { (Xeon Phi). The Shoal library automatically selects ar- // .. initialization here ray implementations based on array access patterns and Do { diff = 0.0; cnt++; machine specifications. We currently support adaptions Foreach (t: G.Nodes) { based on the NUMA hierarchy, DMA engines, and large Double val = (1-d) / N + d* MMU pages. Sum(w: t.InNbrs) { The result is that Shoal allows programmers to write w.pg_rank / w.OutDegree()} ; programs that achieve good performance without having diff += | val - t.pg_rank |; t.pg_rank <= val @ t; (i) to understand machine characteristics and (ii) need- } ing to constantly rewrite applications in order to keep up } While ((diff > e) && (cnt < max)); with hardware changes. We demonstrate Shoal using the } Green-Marl graph DSL, and the Streamcluster low-level Listing 1: Excerpt from Green-Marl’s PageRank C program from the PARSEC benchmark suite.

3 USENIX Association 2015 USENIX Annual Technical Conference 265 3 Shoal’s array abstraction timized high-level array operations for initializing and Shoal’s memory abstraction is based on arrays. We found copying arrays. These provide relaxed consistency guar- this sufficient for the workloads we have been looking at antees: the order in which elements are initialized or in the context of this research, but we expect to add more copied is not specified, allowing these operations to be data types in the future. We provide several array imple- parallelized and offloaded to DMA engines in an asyn- mentations, but all of them implement the same interface. chronous fashion. Writes to replicas can be realized by This allows Shoal to select an implementation transpar- writing to the master copy and propagating the changes ently to the programmer. The optimal choice depends on to all replicas using shl__repl_sync(). This allows to machine characteristics and memory access patterns. re-initialize replicated arrays, for example to reuse other- wise read-only buffers in streaming applications. // allocate an array template 3.2 Array types shl_array* shl__malloc_array(size_t size, We currently provide four array implementations. bool ro, bool indexed, bool used); Single-node allocation. Allocates the entire array on the // get element at position i local node. While limited in scalability, performance is T get(size_t i); independent of the OS since memory is guaranteed to be // set element at position i to v mapped in the allocation phase. Single-node arrays are void set(size_t i, T v); // number of elements rarely used for parallel programs. size_t get_size(void); Distribution. Distributed arrays allocate data equally // copy from another Shoal array across NUMA nodes. The precise distribution is not int copy_from_array(shl_array *src); // initialize every element with value specified and depends on the implementation. This re- int init_from_value(T value); duces pressure on memory controllers, but can lead to // copy from a non-Shoal array high latency or congestion if many accesses are re- void copy_from(T* src); mote. The performance of distributed arrays can be non- // calculate the CRC checksum deterministic, as data is scattered semi-randomly and unsigned long get_crc(void); // initialize the current thread might vary between program executions. void shl__thread_init(void); Replication. Several copies of the array are allocated. // synchronize replicas We currently always place one replica on each memory void shl__repl_sync(void* src, void **dest, size_t num_dest, size_t size); controller. All data is then accessed locally. In addition to distributing load across the system, this reduces pres- Listing 2: Interface of Shoal sure on the interconnect at the cost of increased memory footprint. 3.1 Interface and programming model Partitioning. Partitioning is a form of distribution where Listing 2 illustrates Shoal’s programming interface, data is spread out in the machine such that work units which decouples computation and memory access allow- can be executed local to where their data is allocated. If ing transparent selection of different array implementa- done carefully, array accesses are local as with replica- tions. Besides the usual set() and get() operators, we tion, but without the increased memory footprint. This provide a collection of high-level functions to initialize implies a scheduling challenge, since the working set for memory and copy between Shoal arrays. each thread must be known and the jobs scheduled ac- cordingly. Thread initialization. In OpenMP, Shoal uses builtin functions to determine the thread ID and the cor- 3.3 Selection of arrays responding replica to be used. Per-thread array In selecting an array implementation, we try to: (i) max- pointers can otherwise be setup by manually calling imize local access to minimize interconnect traffic, (ii) shl__thread_init() on each thread. load-balance memory on all available controllers to avoid Array allocation. shl__malloc_array allocates points of contention, and (iii) transparently use hardware Shoal arrays and selects the best implementation for features when available (e.g. DMA and large pages). the machine it is running on based on memory access We show our policy for selecting array implementa- pattern hints given as arguments. Shoal always maps all tions in Figure 2: 1. we use partitioning if the array is pages of an array to guarantee memory allocation and only accessed via an index, 2. we enable replication if avoid non-determinism. the array is read-only and fits into every NUMA node Data operations. Reads and writes to arrays are per- of the host machine, and 3. we otherwise use a uniform formed with get() and set(), but we also provide op- distribution among all available memory controllers.

4 266 2015 USENIX Annual Technical Conference USENIX Association We only replicate read-only arrays, as we found that high-level hardware spec the cost for maintaining consistency dominates the per- program config file formance benefits in current NUMA machines (Sec- program tion 5.4) – however, we plan to revisit this for more com- plex NUMA hierarchies. In case of limited RAM, where high-level access Shoal Shoal the increase in working set size with replication is not compiler patterns library tolerable, we can selectively activate replication based on the cost function of memory accesses extracted from low-level compiler the high-level program. code (e.g. C) Figure 3: Shoal system overview start replicated yes High-level compiler. High-level DSLs often encode ac- cess patterns in an intuitive way: for instance, the Green- indexed no is read yes fits all Marl DSL features language constructs to access all access? only? nodes? nodes in a graph (see Foreach (t: G.Nodes) in List- ing 1). From the language specification, we know that yes no this represents a read-only and sequential data access to partitioned distributed the nodes array. The high-level compiler translates the no input into low-level code while such access information Figure 2: Array Selection is lost. Our modifications to the Green-Marl compiler Large pages. If available, we use large pages. This is extract this knowledge about array access patterns from not always optimal, and the impact of using large pages the input source code and makes it accessible to the gen- is hard to anticipate, but on average enabling large pages erated code which uses Shoal’s array abstraction. improves performance (in the future, we plan to enable Low-level code with array abstractions. The gener- large pages per-array). Most current multi-core machines ated code – here, C++ – uses Shoal’s abstraction to al- have independent TLBs for different page sizes, sug- locate and access memory. At compile time the concrete gesting it might be useful to retain some arrays on nor- choice of array implementation is not made; this happens mal pages. Also, the TLBs coverage for randomly ac- later at runtime based on hardware specifications. cessed large arrays is still not sufficient to prevent TLB Access patterns. In high-level languages, memory ac- misses. One approach would use large pages for mostly- cess are usually implicit and translated into simple load sequential array access, and 4k pages for randomly ac- and store instructions. In Shoal, however, we use the cessed data. We also plan to use huge pages (typically compiler to generate important information about load/- 1GB), which may keep most of the working set covered store patterns which is then used by the runtime library. by the TLB even for big workloads. Firstly, we capture the read/write-ratio. The number Overall, we have found this policy to be simple, but of reads and writes by a program is workload specific, effective. but Shoal can still in many cases extract a formula es- 4 Implementation timating the number of reads and writes for a given in- put. For example, the number of reads in Green-Marl’s The Shoal runtime library is structured in two parts: (i) PageRank rank array is: kE kN, where E is the number ∗ a high-level array representation as defined in Section 3 of edges and N is the number of nodes. Currently, Shoal based on C++ templates and (ii) a low-level, OS-specific derives these formulas but only uses them to determine if backend. We now describe the work-flow invoked when the array is read-only; we expect more sophisticated uses Shoal is used together with a high-level DSL, and then of this information in the future. Secondly, we infer if describe our low-level backends. all accesses to an array are based solely on the loop vari- Figure 3 shows how Shoal is used with high-level, par- able of a parallel loop. tmp indicates the array is used for allel languages. temporary values, ro that it is read-only, etc. High-level program. The input program is written in An example of the automatically extracted informa- a high-level parallel language, which in this paper is tion for Green-Marl’s PageRank can be seen on Table 1. Green-Marl, a DSL for graph analysis. Many other Hardware specification. The Shoal runtime takes the high-level languages such as SQL [18] and OptiML [37] hardware configuration of the system into account when provide similar resource usage information to the kind selecting array implementations. Currently, we consider we extract from Green-Marl; we show an example of a the following hardware features: (i) NUMA topology: Green-Marl expression of PageRank in Listing 1. Shoal has a NUMA-aware array allocation function that

5 USENIX Association 2015 USENIX Annual Technical Conference 267 array N E ro std tmp indexed Memory allocation. We want to provide strong guaran- begin y yy tees on where memory is allocated, but allocation poli- r_begin y yy cies are not consistent across OSes – indeed, they even r_node_idx yy y change between different version of the same OS. Linux, pg_rank y for instance, implements a first touch allocation policy, pg_rank_nxt y yy which causes confusion about where and when memory Table 1: Shoal’s extracted array properties for PageRank will actually be allocated. Libraries such as libnuma pro- vide an interface which gives more control, but this lacks attempts to distribute load on all memory controllers and support for large and huge pages. Barrelfish [7] gives the localize access to reduce pressure on interconnects. (ii) user the ability to manage its own address space via self- RAM size: available memory can limit which arrays can paging [19]: an application requests memory explicitly be replicated. (iii) Page size: Modern systems offer var- from a specific NUMA node and maps it as it wishes. ious page sizes and often provide a dedicated TLB for These systems provide different trade-offs between each size. The use of large and huge pages is useful in re- complexity, portability, and maintainability of applica- ducing TLB misses [17] in large working sets. (iv) DMA tion code and efficient use of the memory system: an ex- engines: Some CPUs have integrated DMA engines [22]. plicit, flexible interface imposes an additional burden on We make use of these for copy and initialization. the client. We believe that programmers should not have to deal with this complexity and want to avoid manual Shoal program. The Shoal library takes care of select- tuning to adapt programs to new machines. ing array implementations based on the extracted access patterns, and hardware specification of the machine. The 5 Evaluation executable generated by the compiler is a program binary Our goal in this section: we show that programs scale which links against the Shoal library. and perform significantly better with Shoal than with a OS-specific backends. To improve portability, we sep- regular memory runtime. We also show a comparison of arate high-level array implementations from low-level, our array implementations and analyze Shoal’s initializa- OS-dependent functions which mediate access to the tion cost, and finally we investigate the benefits of using memory allocation facilities or DMA devices. Currently, a DMA engine for array copy. we run on the Linux and Barrelfish [6] OSes to demon- Table 2 shows the machines used for our evaluation. strate portability. Our results on both, 8x8 AMD Opteron and 4x8x2 Intel Xeon, are similar. For brevity, we focus on results from Topology information. Shoal needs to obtain informa- our 8x8 AMD Opteron unless stated otherwise. We use tion about the system architecture including number of two workloads: Green-Marl and PARSEC Streamcluster. NUMA nodes, their sizes and corresponding CPU affini- Green-Marl. The Green-Marl compiler comes with a ties. On Linux, this information can be obtained us- variety of programs of which we have selected three ing libnuma [34]. In Barrelfish, hardware information graph algorithms to demonstrate the performance char- is stored in the system knowledge base [33]. acteristics of Shoal: (i) PageRank [30] iteratively calcu- Scheduling. For replication and partitioning, Shoal must lates the importance of each node in the graph as a sum of map threads to cores. On Linux we pin threads by set- the rank of all incoming neighbors divided by the number ting the affinities, whereas on Barrelfish we directly cre- of outgoing edges they have, (ii) hop-distance calculates ate threads on specific cores. Given a concrete data dis- the distance of every node from the root using Bellman- tribution, scheduling can be optimized to execute work Ford, and (iii) triangle-counting that counts the number units close to where data is accessed. To date, Shoal of triangles in the input graph. This is implemented as a is not fully integrated with the OpenMP runtime and triple loop: for all nodes in the graph, it looks at all com- we use a static OpenMP schedule for partitioning to binations of nodes reachable from it and checks if there ensure that work units are executed close to the parti- is an edge connecting them. tions they are working on. This works well for bal- For our evaluation we used two graphs: (i) the Twit- anced workloads, but can lead to significant slowdown ter graph [23] having 41M nodes and 1468M edges. The compared to dynamic schedules if the cost of execut- total working set size is 2.459 GB with Green-Marl con- ing work units is non-uniform. In the future, we plan figured to 64 bit node and edge types (excluding unused to design and integrate our own OpenMP runtime to pro- arrays). We were not able to run triangle-counting on the vide us fine-grained control of scheduling without losing Twitter graph on our system, hence we were falling back performance for unbalanced workloads. An alternative on the LiveJournal graph [5] for that workload. LiveJour- approach would schedule work units on partitions using nal has 4M nodes and 69M edges with a total working set OpenMP 4.0’s team-statement. of 392 MB in Green-Marl.

6 268 2015 USENIX Annual Technical Conference USENIX Association machine 8x8 AMD Opteron 4x8x2 Intel Xeon 2x10 Intel Xeon CPU AMD Opteron 6378 Intel Xeon E5-4640 Intel Xeon E5-2670 v2 micro architecture Piledriver Sandy Bridge Ivy Bridge #nodes/#sockets 8/4 @ 2.4 GHz 4/4 @ 2.4 GHz 2/2 @ 2.5 GHz L1 data size 16K /thread 32K /core 32K /core L2/L3 size 2048K / 6144K 256K / 20480K 256K / 25600K memory 512 GB (64 GB per node) 512 GB (128 GB per node) 256 GB (128 GB per node) Table 2: Machines used for evaluation (L2 shared by core, L3 shared by socket)

PARSEC – Streamcluster. Streamcluster [9] solves the 60 online clustering problem. Input data is given as an array ms]

of multi-dimensional points. We manually modified it to 4 original use Shoal for memory allocation and accesses. 40 Shoal 5.1 Scalability Carrefour 20 In highly parallel workloads, scalability is one of the key runtime [ × 10 concerns. In this section we show the benefits of using Shoal over unmodified versions of the workloads and that 0 8 16 32 allocating memory based on access patterns, if available, number of threads is favorable over online methods. Figure 5: Scalability of PARSEC streamcluster on 8x8 AMD Green-Marl. We evaluated scalability of three Green- Opteron Marl workloads on an 8x8 AMD Opteron comparing Shoal ( ) against the original Green-Marl implemen- tation ( ) and Carrefour [13] ( ). Figure 4 shows Since Streamcluster is a streaming application, arrays that Shoal clearly outruns the original implementation for input coordinates are reused for each chunk of new by almost 2x and also performs better than the online streaming data, but are otherwise read-only. We use (iii) method as a result of an optimized memory placement. shl__repl_sync() to synchronize the master copy of Furthermore, our results show that an online method the array to it’s replicas once after a new chunk has been can harm the performance in case pages are getting mi- read. We compare the original Streamcluster implemen- grated back and forth (hop-distance). Overall, except for tation with Carrefour and Shoal (Figure 5). Our results triangle-counting, all implementations scale well. Note confirm the bad scalability of Streamcluster due to the that we do not include the graph loading time and Shoal use of memset() after allocating arrays [13, 17], which initialization. causes all memory to be allocated on a single NUMA We also executed the same measurements on - node. This leads to congestion of the interconnect and relfish ( ) to show Shoal’s portability. Our intention memory controllers of that node. Shoal achieves an 4x is not to show that either operating system is faster than improvement over the original implementation. Shoal’s the other, but rather their comparability. On Barrelfish, annotated access allocation function outperforms Car- only static OpenMP schedules are supported due to im- refour’s online method. We want to emphasize here, that plementation limitations. This negatively impacts the we replaced only one of the used arrays with a Shoal ar- performance for triangle-counting. However, Shoal still ray (large pages and replication) and did not apply further performs better than the original implementation, which optimizations. uses dynamic OpenMP schedules. 5.2 Comparison of array implementations PARSEC – Streamcluster. In contrast to Green-Marl, Streamcluster is implemented in C and hence there is We conducted a detailed analysis of Shoal’s different ar- no automatic method of extracting access patterns. We ray implementations using all physical cores of our ma- modified Streamcluster to use Shoal’s array abstraction chines. In this section we show, that Shoal achieves to demonstrate that using Shoal directly by program- better performance than the original Green-Marl imple- mers can improve scalability with little efforts for man- mentation regardless of which array configuration we ual annotation. To make Streamcluster work with Shoal, use. Figure 6 shows our results normalized to the orig- we had to (i) abstract access to arrays using Shoal’s inal Green-Marl implementation and Figure 8 shows get and set methods, (ii) initialize each thread using the breakdown into initialization and computation times. shl__thread_init() and change the array allocation The measurements were executed on the 8x8 AMD to use shl__malloc_array() instead of malloc(). Opteron using all 32 physical cores. Following, we give

7 USENIX Association 2015 USENIX Annual Technical Conference 269 PageRank hop-distance triangle-counting 20 1.0 ms] 4.0 4 15 3.0 10 0.5 2.0 5 1.0 runtime [ × 10 0 0.0 0.0 8 16 32 8 16 32 8 16 32 number of threads number of threads number of threads original OpenMP Green-Marl (Linux) Shoal (Linux) Shoal (Barrelfish) Carrefour Figure 4: Scalability on 8x8 AMD Opteron. Workload: Twitter (LiveJournal for triangle-counting). For Shoal, we chose the best configuration each. We omit results for 64 threads: using SMT threads has similar performance to using only the 32 physical cores.

the memory controllers are not saturated. Memory 1 throughput suffers from the lower bandwidth of the in- terconnect links, i.e. 9.6GB/s for QPI. With randomly distribution of memory, only 1/4 of all memory accesses are expected to be local. 0.5 Distribution + replication ( ). In contrast to distribu-

normalized runtime tion, replication is applied only to read-only data. In our workloads, the graph itself is not altered by the program 0 and hence replicated among the nodes. This results in a PageRank hop-distance triangle-counting increased fraction of locally served memory accesses and OpenMP Shoal-d Shoal-d -r lower interconnect traffic: memory accesses are evenly Shoal-d -r -l Shoal-d -r -p Shoal-d -r -p -l distributed among all memory controllers as shown in Figure 6: Comparison of various combinations of array imple- (iv) of Figure 7. Note, enabling replication without dis- mentations on 8x8 AMD Opteron: distribution (-d), replication tribution allocates non-read-only arrays into single-node (-r), partitioning (-p), large page (-l) (runtime normalized to arrays resulting in an unbalanced memory access for that stock-Green-Marl) part of the working set, see (iii) of Figure 7. Initializa- tion cost for replicated arrays are higher than distributed explanations for each configuration and relate them to arrays because more memory needs to be allocated. We our performance counter observations (Figure 7). force correct allocation by touching each replica on its designated node. Finally, copying the master array to the Distribution ( ). The original Green-Marl implemen- other replicas causes some additional overhead when ini- tation already initializes memory for storing the graphs tializing such an array (Figure 8). with a OpenMP loop to distribute memory in the ma- chine. However, this is not done for dynamically allo- Partitioning ( and ). Replication of data increases cated arrays (e.g. the rank_next in PageRank). With the memory footprint of the application. Partitioning Shoal, all arrays are ensured to be distributed among the tries to preserve the locality of replication without in- nodes, resulting in a more even distribution of memory creasing the memory footprint. Our current implemen- and a better performance across all workloads. Initial- tation requires a static OpenMP schedule for partitioning ization of distributed arrays relies on an OpenMP loop to to ensure scheduling of work units to the right partitions. allocate memory evenly across all NUMA nodes. Initial- However, static schedules potentially lead to imbalance izing memory is hence executed in parallel, which results of work among the execution units as workloads may be in small initialization cost compared to other array types. skewed (e.g. in triangle-counting). Eventhough the same We show this in Figure 8. amount of memory has to be allocated as with distributed Our claims are supported by measurements of each arrays, its initialization is more complex: using Linux’ memory controller’s read- and write throughput. (Fig- first touch policy, Shoal ensures memory is touched on ure 7): compared to the original implementation (i) the correct node by migrating a thread to where memory where all reads and writes are executed on socket 0, en- should be allocated and touching each page from there. abling distribution (ii) results in an evenly distributed This results in similar initialization time as with replica- load on all memory controllers. However, in both cases, tion, but slightly less time to copy the data.

8 270 2015 USENIX Annual Technical Conference USENIX Association Large Pages ( and ). Modern CPUs support vari- We now compare the raw copy performance of DMA ous page sizes and have a distinct TLB for each page controllers to CPU memcpy() and further evaluate how size. A miss in the TLB enforces the CPU to do a full DMA engines can be used in Shoal. Asynchronous page table walk which drastically increases the access memory operations offered by DMA controllers can free time. Shoal supports large pages for its arrays. Enabling up the CPU of the burden of copying data around and large pages for PageRank and triangle-counting results provide cycles to do actual work. Shoal offers an inter- in a slightly better performance, while hop-distance run- face to start an asynchronous memory copy and to check time increases slightly. Gaud et al. [17] concluded sim- for completion of the operation. ilar findings in their experiments with large pages. En- abling large pages reduces the total number of pages used 20 and therefore the number of required first touches in the DMA (1 node) DMA (2 nodes) allocation process. This results in a decrease of the allo- memcpy (1 node) memcpy (2 nodes) 15 cation time (Figure 8). We conclude that despite the additional overhead of 10 allocation and initialization, the total runtime with Shoal is still reduced. However, we want to emphasize here, 5 that we do not consider initialization time as a main tar- throughput [GB/s] get of optimization as typically time spent for computa- tion dominates the program execution. Nevertheless, al- 1 8 16 1 10 20 1 8 16 1 10 20 location could be improved by (i) maintaining a cache of number of threads / channels pre-allocated pages on each node, (ii) applying a smarter Figure 9: Comparison of parallel OpenMP copy and DMA page mapping strategy or (iii) by initializing Shoal while copy on 2x10 Intel Xeon for large buffers ( cache size) input data (e.g. the graph) is loaded.

2.0 Comparing raw memory throughput. Our results of a raw throughput evaluation (Figure 9) show, that the use of DMA engines does not necessarily improve the se- 1.5 ms] quential performance, especially for blocking copy oper- 5 ations as used by PageRank. However, to outperform the 1.0 DMA controller, all threads of the CPU have to be used for memory copying and hence no other computational task can be executed in the meantime. runtime [x0.5 10 We now show the DMA engines are useful if only a few threads are available for synchronously copying ar- 0.0 rays or asynchronous copies. For example, we are plan- 8 16 32 64 ing to evaluate the use of DMA engines to propagate number of threads writes to replicas in the background. OpenMP Shoal alloc Shoal copy Shoal + distribution + replication + partitioning 2000 + large page - partitioning 1500 Figure 8: Shoal initialization and runtime on 8x8 AMD 40 SMT threads 1000 Opteron for various array configurations using PageRank with 20 threads Twitter workload 500 copy time [ms] 0 0 50 100 5.3 Use of DMA engines % Fraction DMA transfers Modern CPUs have integrated DMA engines, which pro- Figure 10: Initialization cost for copying data into Shoal arrays vide a rich set of memory operations. For instance, re- using the Twitter working set with replication on Barrelfish cent Intel server CPUs provide integrated CrystalBeach 3 DMA engines [22]. We evaluate the use of DMA en- DMA engines for initialization. We benchmark the ini- gines for initialization and copy operations on a 2x10 In- tialization phase of PageRank where data is copied from tel Xeon (our 8x8 AMD Opteron and 4x8x2 Intel Xeon the graph’s memory into the Shoal arrays. We copy do not have DMA engines). We run these experiments a certain ratio of the array using DMA engines asyn- on Barrelfish, as user-level support for DMA engines chronously while using parallel OpenMP loops to copy is already integrated and requires no additional setup. the remaining elements. Figure 10 shows how varying

9 USENIX Association 2015 USENIX Annual Technical Conference 271 (i) no optimizations (ii) distributed (iii) replicated (iv) distributed + replicated 20 20 20 20

15 15 15 15

10 10 10 10

5 5 5 5 throughput GB/s 0 0 0 0 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 time [s] time [s] time [s] time [s] socket 0 read socket 0 write socket 1 read socket 1 write Figure 7: Memory throughput for sockets 0 and 1 on 4x8x2 Intel Xeon. In the first 35 seconds, the graph is loaded from disk. For replication, we show the replica initialization cost in green. Note: Sockets 2 and 3 are equal to socket 1 and left out for readability. the ratio of how much of the array is copied using DMA concurrent writes and reads in the same loop iteration engines vs. parallel OpenMP loops affects performance. would cause non-determinism. For the parallel copy, we show the result for using all 20 In this section, we show that replicating non-read-only physical threads and 40 SMT-threads respectively. First, data does not deliver much benefit on current NUMA we see a big difference if we enable SMT ( ): as machines for already otherwise optimized workloads: expected, memory access latency is hidden and the use the additional cost of house-keeping (e.g. maintaining of a DMA engine improves performance only slightly write-sets) and propagating updates to all replicas out- (about 10%). This is presumably because these 40 hyper- weighs the potential performance gain of replication. threads are sufficient to saturate all memory controllers We compare writeable replicas with single-node allo- on that machine. With SMT disabled ( ), the mem- cation and distribution (Table 3). Our results show that ory latency cannot be hidden. Our results show clearly, the cost of maintaining consistency grows with the num- that using DMA engines and CPU copy simultaneously ber of replicas. Furthermore, replication not necessarily reduces the time for copying arrays by 2x. achieves better performance compared to distributed ar- DMA engines for array copy. In our PageRank work- rays as the load on the interconnect in the latter case is load, the ranks are copied between two arrays in every already relatively low. iteration. With our implementation, we can use DMA We believe that writeable replication will be useful engines to improve the copy time in that case too. How- (and needed) in heterogeneous systems, where memory ever, our measurements show, that depending on the ar- non-uniformity is more drastic (e.g. more NUMA nodes, ray configuration only 1-5% of the entire runtime is spent slower links). In that case, replication of data in local copying and hence optimize that part does not have a no- memory is crucial for performance even in the presence table effect on PageRank’s total runtime, hop-distance of updates. Writeable replication could also have an ap- behaves similarly. However, workloads allowing asyn- plication for more complex workloads (e.g. a smaller chronously copy of data would be a more obvious candi- fraction of read-only data), where the simple mecha- date for optimizations based on DMA engines. nisms we presented in this paper cannot be applied. To sum up, the effect of using DMA controllers for dist configuration cost stderr notes memory operations highly depends on whether the pro- single-node 214.0 11.0 gram has to share the resources with other workloads or distributed 203.0 0.9 not. If all resources are available, DMA engines provide wr-rep, 2 reps 248.8 7.0 nodes: 0,n-1 about 10% improvement. On the otherhand if resources wr-rep, 4 reps 333.6 5.9 nodes: 0,n-1 are shared with other users, DMA engines provide up to wr-rep w/o copy op 202.9 7.2 2x improvement in our case. Table 3: Writeable replicas on 8x8 AMD Opteron. Workload: 5.4 Writeable replication hop-distance with -d -r -h configuration Finally, we look at the issue of write-shared arrays. 6 Related work Efficiently maintaining consistency of replicated data Our work was originally inspired by recent research in is difficult; updates must be propagated to all replicas. domain specific languages. Such languages are based This can be achieved by issuing writes to all replicas on the observation that it is hard to write efficient code or by applying techniques such as double-buffering and for a wide-range of different systems, as algorithms need asynchronous copies. Both relax consistency guarantees, to be tuned to a concrete machine in order to achieve but are strong enough for use with OpenMP loops, where good performance. DLSs express algorithms in a rich

10 272 2015 USENIX Annual Technical Conference USENIX Association and intuitive way. The Green-Marl [20] graph analytics operation to let an application give hints on future ac- DSL, OptiML [37], a machine learning DSL , as well as cesses to a memory region which results in a distributed SPL [41], a signal processing language, all provide a rich or local allocation to the calling thread. Shoal extends set of powerful high-level operators. They use a compiler this approach with a wider range of possible usage pat- that generates code that is highly tuned to the target ma- terns, and infers appropriate settings to use. chine and makes heavy use of data parallelism. While Li [24] attempts to find the best algorithm for a specific all of these languages encode memory access patterns in task depending on machine characteristics and work- their high-level languages, none uses them to adapt mem- load based on empirical search. In contrast, we decide ory allocation at runtime. a priori based on additional information extracted from Modern machines are becoming inherently complex: high-level languages or given by manual annotations. Baumann et al. argued that computers are already a Franchetti et al. [16] automatically tune FFT programs to distributed system in their own right [8]. They pro- multi-core machines. They argue that programming such posed a multikernel approach [7] which avoids sharing machines is increasingly complicated, which increases of state among OS nodes by replication and applying the burden for programmers and makes a case for auto- techniques from distributed systems. Similarly, Went- matic tuning. Atune-IL [32] auto-tunes applications, in- zlaff et al. [40] apply partitioning to OS services. Tech- cluding the number of threads etc. It explores all possible niques from distributed systems are beneficial not only parameters, but tries to reduce the search space. on an OS level, but also for applications. Multimed [31] However, tuning data placement and parallelism indi- replicates database instances within a single multi-core vidually is not optimal, because data and threads may not machine. Salomie et al. showed that congestion of end up on the same node. Hence, affinity of threads and memory controllers and interconnects impact the over- data need to be enforced in order to improve the perfor- all performance. Carrefour [13] attempts to reduce the mance of OpenMP programs [38]. contention on interconnect and memory controllers by Finally, PGAS languages such as UPC [39], co-array online monitoring of memory accesses and auto-tuning Fortran [27], X10 [12], Chapel [11], and Fortress [29] NUMA-aware memory allocation. This approach can provide an abstraction of shared arrays which can be im- be applied to any application without modifications to plemented across a distributed system. Code iterating the program code, but is less fine-grained. While Shoal over an array can execute on the node holding the por- derives a program’s semantics from annotations or DSL tion of the array being accessed. compiler analysis, the former these approaches need to guess programmers’ intentions in retrospect. In high-performance computing, array abstrac- Systems such as SGI’s Origin 2000 [35] use page-level tions [26] have been used to simplify programming migration and replication of data. Hardware monitors de- while still providing good performance and scalability. tect the access patterns to pages (e.g., which processors They support high-level operations on arrays e.g. matrix tend to access the page, and whether these are reads or multiplications or atomic operators. writes). Based on the gathered data, pages are replicated or migrated towards a frequently accessing CPU. 7 Conclusion With highly parallel workloads, efficient synchroniza- In this paper, we presented Shoal, a library that provides tion is crucial for application performance [14]. Lock an array abstraction and rich memory allocation func- cohorting [15] implements NUMA-aware locks by tak- tions that allow automatic tuning of data placement and ing cache hierarchy and NUMA-topology into account. access depending on workload and machine characteris- Shoal’s treatment of memory is analogous to these sys- tics. Tuning is based on memory access patterns. These tems’ treatment of locks. are either (i) given by manual annotation, or, ideally, (ii) Access to large and huge MMU pages is provided by by modifying compilers of high-level languages to ex- services and libraries like libhugetlbfs [4]. The lat- tract that information automatically. We have shown that ter, however, requires static setup of a large page pool, we can use this additional information to automatically among other issues. choose array implementations that increase performance Cache coherence protocols like MOESI [2] allow on today’s NUMA systems. We report an up 2x improve- cache-lines to be in a shared state which is a form of ment for Green-Marl, a high-level graph analytics work- hardware-level replication. This is only effective with load, without changing the Green-Marl input program. workloads having good locality and small working set, We found our memory abstraction as well as the simple which is not the case for our graph workloads. Research policy for selecting the array implementation sufficient systems, such as Stanford FLASH, have provided soft- for current workloads and machines, but believe that fu- ware control over this form of replication [36]. ture machines can benefit from a more fine-grained se- The Solaris operating system provides a madvise [28] lection of array implementations.

11 USENIX Association 2015 USENIX Annual Technical Conference 273 References York, NY, USA, 2008), PACT ’08, ACM, pp. 72– 81. [1] ACHERMANN, R. Message Passing and Bulk Transport on Heterogenous Multiprocessors. ETH [10] BOYD-WICKIZER, S., CHEN, H., CHEN, R., Zurich, 2014. Master’s Thesis, http://dx.doi. MAO, Y., K AASHOEK, F., MORRIS, R., . org/10.3929/ethz-a-010262232 PESTEREV, A., STEIN, L., WU, M., DAI, Y., [2] ADVANCED MICRO DEVICES. AMD64 Archi- ZHANG, Y., AND ZHANG, Z. Corey: An Oper- tecture Programmer’s Manual Volume 2: System ating System for Many Cores. In Proceedings of Programming. Online, 2013. http://amd-dev. the 8th USENIX Conference on Operating Systems wpengine.netdna-cdn.com/wordpress/ Design and Implementation (Berkeley, CA, USA, media/2012/10/24593_APM_v21.pdf. Publica- 2008), OSDI’08, USENIX Association, pp. 43–57. tion Number 24593. Revision 3.23. [11] CHAMBERLAIN, B., CALLAHAN, D., AND ZIMA, [3] APPAVOO, J., SI LVA , D. D., KRIEGER, O., AUS- H. Parallel Programmability and the Chapel Lan- LANDER, M., OSTROWSKI, M., ROSENBURG, guage. International Journal of High Performance B., WATERLAND, A., WISNIEWSKI, R. W., Computing Applications 21, 3 (Aug. 2007), 291– XENIDIS, J., STUMM, M., AND SOARES, L. Ex- 312. perience Distributing Objects in an SMMP OS. ACM Transactions on Computer Systems 25,3 [12] CHARLES, P., G ROTHOFF, C., SARASWAT, V., (Aug. 2007). DONAWA, C., KIELSTRA, A., EBCIOGLU, K., VON PRAUN, C., AND SARKAR, V. X10: An [4] ARAVAMUDAN, N., LITKE, A., AND MUNSON, Object-Oriented Approach to Non-Uniform Clus- E. libhugetlbfs. Online, 2015. http:// ter Computing. In Proceedings of the 20th Annual . libhugetlbfs.sourceforge.net/ ACM SIGPLAN Conference on Object-oriented [5] BACKSTROM, L., HUTTENLOCHER, D., KLEIN- Programming, Systems, Languages, and Applica- BERG, J., AND LAN, X. Group Formation in Large tions (New York, NY, USA, 2005), OOPSLA ’05, Social Networks: Membership, Growth, and Evo- ACM, pp. 519–538. lution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery [13] DASHTI, M., FEDOROVA, A., FUNSTON, J., and Data Mining (New York, NY, USA, 2006), GAUD,F.,LACHAIZE, R., LEPERS, B., QUEMA, KDD ’06, ACM, pp. 44–54. V., AND ROTH, M. Traffic Management: A Holis- tic Approach to Memory Placement on NUMA [6] BARRELFISH PROJECT. The Barrelfish Operating Systems. In Proceedings of the Eighteenth Inter- System. Online, 2015. www.barrelfish.org. national Conference on Architectural Support for [7] BAUMANN, A., BARHAM,P.,DAGAND, P.-E., Programming Languages and Operating Systems HARRIS, T., ISAACS, R., PETER, S., ROSCOE, T., (New York, NY, USA, 2013), ASPLOS ’13, ACM, SCHÜPBACH, A., AND SINGHANIA, A. The Mul- pp. 381–394. tikernel: A new OS architecture for scalable multi- [14] DAVID, T., GUERRAOUI, R., AND TRIGONAKIS, core systems. In Proceedings of the ACM SIGOPS V. Everything You Always Wanted to Know About 22nd Symposium on Operating Systems Principles Synchronization but Were Afraid to Ask. In Pro- (New York, NY, USA, 2009), SOSP ’09, ACM, ceedings of the 24th ACM Symposium on Operating pp. 29–44. Systems Principles (New York, NY, USA, 2013), [8] BAUMANN, A., PETER, S., SCHÜPBACH, A., SOSP ’13, ACM, pp. 33–48. SINGHANIA, A., ROSCOE,T.,BARHAM, P., AND ISAACS, R. Your computer is already a distributed [15] DICE, D., MARATHE, V. J., AND SHAVIT, N. system. Why isn’t your OS? In Proceedings of Lock Cohorting: A General Technique for Design- the 12th Conference on Hot Topics in Operating ing NUMA Locks. In Proceedings of the 17th ACM Systems (Berkeley, CA, USA, 2009), HotOS’09, SIGPLAN Symposium on Principles and Practice USENIX Association, pp. 12–12. of Parallel Programming (New York, NY, USA, 2012), PPoPP ’12, ACM, pp. 247–256. [9] BIENIA, C., KUMAR, S., SINGH, J. P., AND LI, K. The PARSEC Benchmark Suite: Characteriza- [16] FRANCHETTI, F., VORONENKO, Y., AND tion and Architectural Implications. In Proceed- PÜSCHEL, M. FFT Program Generation for ings of the 17th International Conference on Paral- Shared Memory: SMP and Multicore. In Pro- lel Architectures and Compilation Techniques (New ceedings of the 2006 ACM/IEEE Conference on

12 274 2015 USENIX Annual Technical Conference USENIX Association Supercomputing (New York, NY, USA, 2006), SC [25] MATTSON, T. G., VAN DER WIJNGAART, R., ’06, ACM. AND FRUMKIN, M. Programming the Intel 80- core network-on-a-chip Terascale Processor. In [17] GAUD,F.,LEPERS, B., DECOUCHANT, J., FUN- Proceedings of the 2008 ACM/IEEE Conference on STON, J., FEDOROVA, A., AND QUEMA, V. Supercomputing (Piscataway, NJ, USA, 2008), SC Large Pages May Be Harmful on NUMA Sys- ’08, IEEE Press, pp. 38:1–38:11. tems. In 2014 USENIX Annual Technical Confer- ence (USENIX ATC 14) (2014), USENIX Associa- [26] NIEPLOCHA, J., HARRISON, R. J., AND LIT- tion. TLEFIELD, R. J. Global Arrays: A Nonuniform Memory Access Programming Model for High- [18] GICEVA, J., SALOMIE, T.-I., SCHÜPBACH, A., Performance Computers. The Journal of Supercom- ALONSO, G., AND ROSCOE, T. COD: Database puting 10, 2 (June 1996), 169–189. / Operating System Co-Design. In CIDR (2013), www.cidrdb.org. [27] NUMRICH, R. W., AND REID, J. Co-Array For- tran for Parallel Programming. SIGPLAN Fortran [19] HAND, S. M. Self-paging in the nemesis operat- Forum 17, 2 (Aug. 1998), 1–31. ing system. In Proceedings of the 3rd Symposium [28] ORACLE CORPORATION. in So- on Operating Systems Design and Implementation madvise() laris 10. Online, 2015. (1999), pp. 73 – 86. http://docs. oracle.com/cd/E19253-01/817-0547/ [20] HONG, S., CHAFI, H., SEDLAR, E., AND whatsnew-updates-72/index.html. OLUKOTUN, K. Green-Marl: A DSL for Easy [29] ORACLE LABS. Fortress. Online, 2015. https: and Efficient Graph Analysis. In Proceedings of //projectfortress.java.net. the 17th International Conference on Architectural Support for Programming Languages and Operat- [30] PAGE, L., BRIN, S., MOTWANI, R., AND WINO- ing Systems (New York, NY, USA, 2012), ASPLOS GRAD, T. The PageRank Citation Ranking: Bring- XVII, ACM, pp. 349–362. ing Order to the Web. Technical Report 1999-66, Stanford InfoLab, November 1999. [21] HOWARD, J., DIGHE, S., VANGAL, S. R., RUHL, G., BORKAR, N., JAIN, S., ERRAGUNTLA, V., [31] SALOMIE, T.-I., SUBASU, I. E., GICEVA, J., AND KONOW, M., RIEPEN, M., GRIES, M., ET AL. ALONSO, G. Database Engines on Multicores, A 48-Core IA-32 Processor in 45 nm CMOS Us- Why Parallelize When You Can Distribute? In Pro- ing On-Die Message-Passing and DVFS for Perfor- ceedings of the 6th Conference on Computer Sys- mance and Power Scaling. IEEE Journal of Solid- tems (New York, NY, USA, 2011), EuroSys ’11, State Circuits 46, 1 (2011), 173–183. ACM, pp. 17–30.

[22] INTEL CORPORATION. Intel Xeon Processor [32] SCHAEFER, C. A., PANKRATIUS, V., AND TICHY, E5-1600/E5-2600/E5-4600 v2 Product Fami- W. F. Atune-IL: An Instrumentation Language lies, Datasheet - Volume One of Two. Online, for Auto-Tuning Parallel applications. In Proceed- 2014. http://www.intel.com/content/dam/ ings of the 15th International Euro-Par Conference www/public/us/en/documents/datasheets/ on Parallel Processing (Berlin, Heidelberg, 2009), xeon-e5-v2-datasheet-vol-1.pdf, Docu- Euro-Par ’09, Springer-Verlag, pp. 9–20. ment Number: 329187-003. [33] SCHÜPBACH, A., PETER, S., BAUMANN, A., ROSCOE, T., BARHAM, P., H ARRIS, T., AND [23] KWAK, H., LEE, C., PARK, H., AND MOON, S. ISAACS, R. Embracing diversity in the Barrelfish What is Twitter, a Social Network or a News Me- manycore operating system. In Proceedings of the dia? In WWW ’10: Proceedings of the 19th inter- Workshop on Managed Many-Core Systems (2008). national conference on World wide web (New York, NY, USA, 2010), ACM, pp. 591–600. [34] SILICON GRAPHICS INTERNATIONAL CORPORA- TION. libnuma. Online, 2015. http://oss.sgi. [24] LI, X., GARZARÁN, M. J., AND PADUA, D. com/projects/libnuma/. A Dynamically Tuned Sorting Library. In Pro- ceedings of the International Symposium on Code [35] SILICON GRAPHICS INTERNATIONAL Generation and Optimization: Feedback-directed CORPORATION. Origin and Onyx2 The- and Runtime Optimization (Washington, DC, USA, ory of Operations Manual. Online, 2015. 2004), CGO ’04, IEEE Computer Society, p. 111. http://techpubs.sgi.com/library/tpl/

13 USENIX Association 2015 USENIX Annual Technical Conference 275 cgi-bin/getdoc.cgi?coll=0650&db=bks& srch=&fname=/SGI_Developer/OrOn2_ Theops/sgi_html/ch02.html.

[36] SOUNDARARAJAN,V.,HEINRICH, M., VERGH- ESE, B., GHARACHORLOO, K., GUPTA, A., AND HENNESSY, J. Flexible Use of Memory for Repli- cation/Migration in Cache-Coherent DSM Multi- processors. In Proceedings of the 25th Annual In- ternational Symposium on Computer Architecture (Washington, DC, USA, 1998), ISCA ’98, IEEE Computer Society, pp. 342–355.

[37] SUJEETH, A., LEE, H., BROWN, K., ROMPF, T., CHAFI, H., WU, M., ATREYA, A., ODER- SKY, M., AND OLUKOTUN, K. OptiML: An Im- plicitly Parallel Domain-Specific Language for Ma- chine Learning. In Proceedings of the 28th Interna- tional Conference on Machine Learning (ICML-11) (2011), pp. 609–616.

[38] TERBOVEN, C., AN MEY, D., SCHMIDL, D., JIN, H., AND REICHSTEIN, T. Data and Thread Affin- ity in OpenMP Programs. In Proceedings of the 2008 Workshop on Memory Access on Future Pro- cessors: A Solved Problem? (New York, NY, USA, 2008), MAW ’08, ACM, pp. 377–384.

[39] UPC CONSORTIUM. UPC Language and Li- brary Specifications, November 2013. Version 1.3. Online. http://upc.lbl.gov/publications/ upc-spec-1.3.pdf.

[40] WENTZLAFF, D., AND AGARWAL, A. Factored Operating Systems (fos): The Case for a Scalable Operating System for Multicores. SIGOPS Operat- ing Systems Review 43, 2 (Apr. 2009), 76–85.

[41] XIONG, J., JOHNSON, J., JOHNSON, R., AND PADUA, D. SPL: A Language and Compiler for DSP Algorithms. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Lan- guage Design and Implementation (New York, NY, USA, 2001), PLDI ’01, ACM, pp. 298–308.

14 276 2015 USENIX Annual Technical Conference USENIX Association