Arxiv:1806.11128V4 [Cs.DC] 7 Jan 2019 a NUMA-Aware Provably-Efﬁcient Task-Parallel Platform Based on the Work-First Principle

2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. You can find the published work on IEEE Xplore by following this link. arXiv:1806.11128v4 [cs.DC] 7 Jan 2019 A NUMA-Aware Provably-Efficient Task-Parallel Platform Based on the Work-First Principle Justin Deters Jiaye Wu Yifan Xu I-Ting Angelina Lee Washington University in St. Louis fj.deters, jiaye.wu, xuyifan, [email protected] Abstract—Task parallelism is designed to simplify the task of Most concurrency platforms, including ones mentioned parallel programming. When executing a task parallel program above, schedule task parallel computations using work steal- on modern NUMA architectures, it can fail to scale due to the ing [14]–[17], a randomized distributed protocol for load phenomenon called work inflation, where the overall processing time that multiple cores spend on doing useful work is higher balancing. Work stealing, in its classic form, provides strong compared to the time required to do the same amount of work on theoretical guarantees. In particular, it provides asymptotically one core, due to effects experienced only during parallel executions optimal execution time [14]–[17] and allows for good cache such as additional cache misses, remote memory accesses, and locality with respect to sequential execution when using pri- memory bandwidth issues. vate caches [18]. In practice, work stealing has also been One can mitigate work inflation by co-locating the computation demonstrated to incur little scheduling overhead and can be with its data, but this is nontrivial to do with task parallel programs. First, by design, the scheduling for task parallel pro- implemented efficiently [5]. grams is automated, giving the user little control over where the Shared memory on mod- computation is performed. Second, the platforms tend to employ Socket 0 Socket 1 ern parallel systems is often 0 1 2 3 8 9 10 11 work stealing, which provides strong theoretical guarantees, but its 4 5 6 7 12 13 14 15 realized with Non-Uniform randomized protocol for load balancing does not discern between LLC LLC Memory Access (NUMA), MainMemory work items that are far away versus ones that are closer. MC QPI MC QPI Main Memory In this work, we propose NUMA-WS, a NUMA-aware task where the memory latency parallel platform engineered based on the work-first principle. can vary drastically, depend- Socket 2 Socket 3 By abiding by the work-first principle, we are able to obtain 16 17 18 19 24 25 26 27 ing on where the mem- 20 21 22 23 28 29 30 31 a platform that is work efficient, provides the same theoretical guarantees as a classic work stealing scheduler, and mitigates ory access is serviced. Fig- LLC LLC MainMemory work inflation. We have extended Cilk Plus runtime system ure1 shows an example of a MC QPI MC QPI Main Memory to implemented NUMA-WS. Empirical results indicate that the NUMA system and its mem- Fig. 1. An example of a 32-core NUMA-WS is work efficient and can provide better scalability by ory subsystem. This system four-socket system, where each socket mitigating work inflation. consists of four sockets, with has its own last-level L3 cache and Index Terms—work stealing, work-first principle, NUMA, lo- memory banks. cality, work inflation eight cores per socket, and each socket has its own last-level cache (LLC), memory controller, and memory banks (DRAM). Each LLC is shared I. INTRODUCTION among cores on the same socket, and the main memory consists of all the DRAMs across sockets, where each DRAM is Modern concurrency platforms are designed to simplify the responsible for a subset of the physical address range. On such task of writing parallel programs for shared-memory parallel a system, when a piece of data is allocated, it can either reside systems. These platforms typically employ task parallelism on physical memory managed by the local DRAM (i.e., on (sometimes referred to as dynamic multithreading), in which the same socket) or by the remote DRAM (i.e., on a different the programmer expresses the logical parallelism of the com- socket). When accessed, the data is brought into the local LLC putation using high-level language or library constructs and and its coherence is maintained by the cache coherence protocol lets the underlying scheduler determine how to best handle among LLCs. Thus, the memory access latency can be tens of synchronizations and load balancing. Task parallelism provides cycles (serviced from the local LLC), over a hundred cycles a programming model that is processor oblivious, because the (serviced from a local DRAM or a remote LLC), or a few language constructs expose the logical parallelism within the hundreds of cycles (serviced from a remote DRAM). application without specifying the number of cores on which the application will run. Examples of such platforms include A task parallel program can fail to scale on such a NUMA OpenMP [1], Intel’s Threading Building Blocks (TBB) [2], [3], system, as a result of a phenomenon called work inflation, various Cilk dialects [4]–[9], various Habanero dialects [10], where the overall processing time that multiple cores spend on [11], Java Fork/Join Framework [12], and IBM’s X10 [13]. doing useful work is much higher compared to the time required to do the same amount of work on one core, due to effects This research was supported in part by National Science Foundation under experienced only during parallel executions. Multiple factors grant number CCF-1527692 and CCF-1733873. can contribute to work inflation, including work migration, parallel computations sharing a LLC destructively, or accessing To measure work efficiency, an important metric is TS, the data allocated on remote sockets. execution time of serial elision, obtained by removing the One can mitigate work inflation by co-locating computations parallel control constructs or replacing it with its sequential that share the same data or co-locate the computation and its counter part. Serial elision should perform the exact same data on the same socket, thereby reducing remote memory algorithm that the parallel program implements but without accesses. These strategies are not straightforward to implement the parallel overhead. Thus, one can quantify work efficiency in task parallel code scheduled using work stealing, however. by comparing T1 against TS to measure parallel overhead. First, by design the scheduling of task parallel programs is Assuming a work-efficient platform, one can obtain parallel automated, which gives the programmer little control over code whose ratio between T1 and TS is close to one. where the computation is executed. Second, the randomized We have implemented a prototype system by extending Intel protocol in work stealing does not discern between work items Cilk Plus, which implements the classic work stealing algo- that are far away versus ones that are closer. rithm, and empirically evaluated it. The empirical results indi- Ideally, we would like a task parallel platform that satisfies cate that NUMA-WS is work efficient, scalable across different the following criteria: number of cores, and can mitigate NUMA effects. Specifically, • provide the same strong theoretical guarantees that a classic we show that, NUMA-WS incurs negligible parallel overhead work stealing scheduler enjoys; (i.e., T1=TS), comparable to that in Cilk Plus. Moreover, we • be work efficient, namely, the platform does not unneces- compare the parallel execution times across benchmarks when sarily incur scheduling overhead that causes the single-core running on Intel Cilk Plus versus running on NUMA-WS on a execution time to increase; four-socket 32-core system. NUMA-WS was able to decrease • support a similar processor-oblivious model of computation: work inflation compared to Cilk Plus without adversely impact- assuming sufficient parallelism, the same program for a given ing scheduling time. Across benchmarks, NUMA-WS obtains input should scale as the number of cores used increases; and much better speedup than that in Cilk Plus. • mitigate work inflation. Critically, to achieve the theoretical bounds and practical Even though many mechanisms and scheduling policies have efficiency, the design and engineering of NUMA-WS abides been proposed to mitigate work inflation in task parallel by a principle called the work-first principle proposed by [5], programs [19]–[29], none of the proposed solutions satisfy which states that one should minimize the overhead borne by all criteria simultaneously. In particular, many of them are the work term (T1) and move the overhead onto the span term not work efficient nor do they provide a provably efficient (T1). Intuitively, a scheduler must incur some scheduling over- scheduling bound (see SectionVI). head due to the extra bookkeeping necessary to enable correct In this paper, we propose NUMA-WS, a task parallel plat- parallel execution or to mitigate NUMA effects. Within the form that satisfies these criteria simultaneously. NUMA-WS context of a work-stealing scheduler, worker threads (surrogates employs a variant of work stealing scheduler that extends the of processing cores) load balance by “stealing” work when classic algorithm with mechanisms to mitigate NUMA effects. necessary. The work-first principle states that it’s best to incur NUMA-WS achieves the same theoretical bounds on execution scheduling overhead on the control path that can be amortized time and additional cache misses for private caches as the against successful steals. To put it differently, whenever a classic work stealing algorithm, albeit with a slightly larger choice can be made to incur overhead on a thief stealing versus constant hidden in the big-O term.

Load more