Smart Allocation and Replication of Memory for Parallel Programs

Shoal: Smart Allocation and Replication of Memory For Parallel Programs Stefan Kaestle, Reto Achermann, and Timothy Roscoe, ETH Zürich; Tim Harris, Oracle Labs, Cambridge https://www.usenix.org/conference/atc15/technical-session/presentation/kaestle This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15). July 8–10, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-225 Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX. Shoal: smart allocation and replication of memory for parallel programs Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris∗ Systems Group, Dept. of Computer Science, ETH Zurich ∗Oracle Labs, Cambridge, UK Abstract mization to apply in which situation and, with rapidly Modern NUMA multi-core machines exhibit complex la- evolving and diversifying hardware, programmers must tency and throughput characteristics, making it hard to repeatedly make manual changes to their software to allocate memory optimally for a given program’s access keep up with new hardware performance properties. patterns. However, sub-optimal allocation can signifi- One solution to achieve better data placement and cantly impact performance of parallel programs. faster data access is to rely on automatic online moni- We present an array abstraction that allows data place- toring of program performance to decide how to migrate ment to be automatically inferred from program analysis, data [13]. However, monitoring may be expensive due to and implement the abstraction in Shoal, a runtime library missing hardware support (if pages must be unmapped for parallel programs on NUMA machines. In Shoal, to trigger a fault when data is accessed) or insufficiently arrays can be automatically replicated, distributed, or precise (if based on sampling using performance coun- partitioned across NUMA domains based on annotating ters). Both approaches are limited to a relatively small memory allocation statements to indicate access patterns. number of optimizations (e.g. it is hard to incrementally We further show how such annotations can be auto- activate large pages or switch to using DMA hardware matically provided by compilers for high-level domain- for data copies based on monitoring or event counters) specific languages (for example, the Green-Marl graph We present Shoal, a system that abstracts memory ac- language). Finally, we show how Shoal can exploit ad- cess and provides a rich programming interface that ac- ditional hardware such as programmable DMA copy en- cepts hints on memory access patterns at the runtime. gines to further improve parallel program performance. These hints can either be manually written or automati- We demonstrate significant performance benefits from cally derived from high-level descriptions of parallel pro- automatically selecting a good array implementation grams such as domain specific languages. Shoal includes based on memory access patterns and machine charac- a machine-aware runtime that selects optimal implemen- teristics. We present two case-studies: (i) Green-Marl, tations for this memory abstraction dynamically during a graph analytics workload using automatically anno- buffer allocation based on the hints and a concrete com- tated code based on information extracted from the high- bination of machine and workload. If available, Shoal level program and (ii) a manually-annotated version of is able to exploit not only NUMA properties but also the PARSEC Streamcluster benchmark. hardware features such as large pages and DMA copy engines. Our contributions are: 1 Introduction a memory abstraction based on arrays that decou- • Memory allocation in NUMA multi-core machines is ples data access from the rest of the program, an interface for programs to specify memory access increasingly complex. Good placement of and access • to program data is crucial for application performance, patterns when allocating memory, a runtime that selects from several highly tuned ar- and, if not carefully done, can significantly impact scal- • ability [3, 13]. Although there is research (e.g. [7, 3]) ray implementations based on access patterns and in adapting to the concrete characteristics of such ma- machine characteristics and can exploit machine chines, many programmers struggle to develop software specific hardware, features modifications to Green-Marl [20], a graph analyt- applying these techniques. We show an example in Sec- • tion 5.1. ics language, to show how Shoal can extract access The problem is that it is unclear which NUMA opti- patterns automatically from high-level descriptions. USENIX Association 2015 USENIX Annual Technical Conference 263 Node 2 Node 0 GDDR RAM RAM memory allocation strategy. Memory is not allocated di- Memory DMA DMA MC MC DMA rectly when calling malloc, but mapped only when the L3 Cache L3 Cache corresponding memory is first accessed by a thread. This IC resulting page fault will cause Linux to back the faulting Xeon Phi page from the NUMA node of the faulting core. IC IC A surprising consequence of this choice is that on GPGPU GPGPU Linux the implementation of the initialization phase of DMA IC a program is often critical to its memory performance, L3 Cache L3 Cache Node 3 Node 1 even through programmers rarely consider initialization DMA MC MC DMA GDDR as a candidate for heavy optimization, since it almost Memory RAM RAM never dominates the total execution time of the program. MC Memory Controllers IC Interconnect Link 16x PCI Express Link Figure 1: Architecture of a modern multi-core machine. To see why, consider that memset is the most widely used approach for initializing the elements of an array. Most programmers will spend little time evaluating al- 2 Motivation ternatives, since the time spent in the initialization phase Modern multi-core machines have complex memory is usually negligible. An example is as follows: hierarchies consisting of several memory controllers // ---- Initialization (sequential) ----------- placed across the machine – for example, Figure 1 shows void *ptr = malloc(ARRSIZE); such a machine with four host main memory controllers memset(ptr, 0, ARRSIZE); (one per processor socket). Memory latency and band- // ---- Work (parallel, highly optimized) ----- width depend on which core accesses which memory lo- execute_work_in_parallel(); cation [8, 10]. The interconnect may suffer congestion The scalability of a program written this way can be when access to memory controllers is unbalanced [13]. limited. memset executes on a single core and so all Future machines will be more complex: they may not memory is allocated on the NUMA node of that core. provide global cache coherence [21, 25], or even shared For memory-bound parallel programs, one memory con- global physical addresses [1]. Even today, accelerators troller will be saturated quickly while others remain idle like Intel’s Xeon Phi, GPGPUs, and FPGAs have higher since all threads (up to 64 on the machines we evaluate) memory access costs for parts of the physical memory request memory from the same controller. Furthermore, space [1]. Such hardware demands even more care in the interconnect close to this memory controller will be application data placement. more susceptible to congestion. This poses challenges to programmers when allocat- There are two problems here: (i) memory is not allo- ing and accessing memory. First, detailed knowledge cated (or mapped) when the interface suggests (memory of hardware characteristics and a good understanding of is not allocated inside malloc itself but later in the exe- their implications for algorithm performance is needed cution) and (ii) the choice of where to allocate memory is for efficient scalable programs. Care must be taken when made in a subsystem (the OS kernel) that has no knowl- choosing a memory controller to allocate memory from, edge of the intended access patterns of this memory. and how to subsequently access that memory. This can be addressed by tuning algorithms to spe- Second, hardware changes quickly meaning that de- cific operating systems. For example, we could initialize sign choices must be constantly re-evaluated to ensure memory using a parallel for loop: good performance on current hardware. This imposes high engineering and maintenance costs. This is worth- // ---- Initialization (parallel) ------------- while in high-performance computing or niche markets, void *ptr = malloc(ARRSIZE); but general purpose machines have too broad a hardware #pragma omp parallel for range for this to be practical for many domains. The re- for (int i=0;i<ARRSIZE; i++) sult is poor performance on most platforms. init(i); // ---- Work (parallel, highly optimized) ----- These problems can be seen in much code today. Pro- execute_work_in_parallel(); grammers take little care of where memory is allocated and how it is accessed. In cases like the popular Stream- This will be faster and retain scalability in current cluster benchmark (evaluated in Section 5.1) and appli- versions of Linux. The first-touch strategy will equally cations from the NAS benchmark suite [13], memory is spread out memory across all memory controllers, which allocated using a low-level malloc call which provides balances the load on them and reduces contention on in- no guarantees about where memory is allocated or other dividual interconnect links. details such as the page size to use. One drawback of this strategy is the loss of portabil- For example, Linux currently employs a first-touch ity and scalability when the OS kernel’s internal memory 2 264 2015 USENIX Annual Technical Conference USENIX Association allocation policies change. Furthermore, it also requires high-level constructs in the PageRank example can be correct setup of OpenMP’s CPU affinity to ensure that all determined as follows: (i) Foreach (T: G.Nodes) cores participate in this parallel initialization phase in or- means the nodes-array will be accessed sequen- der to spread memory equally on all memory controllers. tially, read-only, and with an index, and (ii) Sum(w: Finally, we might do better: allocate memory close to the t.InNbrs) implies read-only, indexed accesses on in- cores that access it the most.

Smart Allocation and Replication of Memory for Parallel Programs

From Francois De Quesnay Le Despotisme De La Chine (1767)

Bicycle Manual Road Bike

This World Is but a Canvas to Our Imagination

Constructing Reservoir Dams in Deglacierizing Regions of the Nepalese Himalaya the Geneva Challenge 2018

Defence Appraisal

Cumulative Watershed Effects of Fuel Management in the Western United States Elliot, William J.; Miller, Ina Sue; Audin, Lisa

A Modelling Applied to Restoration Scenarios

TS14K Streambank Armor Protection with Stone Structures

Random River: Trade and Rent Extraction in Imperial China

Reverse-Time Modeling of Channelized Meandering Systems from Geological Observations Marion Parquer

Report IOW 5: Chilton Chine to Colwell Chine

Palaeoenvironment and Taphonomy of Dinosaur Tracks in the Vectis Formation (Lower Cretaceous) of the Wessex Sub-Basin, Southern England