OS and Runtime Support for Efficiently Managing Cores in Parallel

OS and Runtime Support for Efficiently Managing Cores in Parallel Applications by Kevin Alan Klues Adissertationsubmittedinpartialsatisfactionofthe requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Eric Brewer, Chair Professor Krste Asanović Professor Brian Carver Summer 2015 OS and Runtime Support for Efficiently Managing Cores in Parallel Applications Copyright 2015 by Kevin Alan Klues 1 Abstract OS and Runtime Support for Efficiently Managing Cores in Parallel Applications by Kevin Alan Klues Doctor of Philosophy in Computer Science University of California, Berkeley Professor Eric Brewer, Chair Parallel applications can benefit from the ability to explicitly control their thread scheduling policies in user-space. However, modern operating systems lack the interfaces necessary to make this type of “user-level” scheduling efficient. The key component missing is the ability for applications to gain direct access to cores and keep control of those cores even when making I/O operations that traditionally block in the kernel. A number of former systems provided limited support for these capabilities, but they have been abandoned in modern systems, where all e↵orts are now focused exclusively on kernel-based scheduling approaches similar to Linux’s. In many circles, the term “kernel” has actually become synonymous with Linux, and its Unix-based model of process and thread management is often assumed as the only possible choice. In this work, we explore the OS and runtime support required to resurrect user-level threading as a viable mechanism for exploiting parallelism on modern systems. The idea is that applications request cores, not threads, from the underlying system and employ user- level scheduling techniques to multiplex their own set of threads on top of those cores. So long as an application has control of a core, it will have uninterrupted, dedicated access to it. This gives developers more control over what runs where (and when), so they can design their algorithms to take full advantage of the parallelism available to them. They express their parallelism via core requests and their concurrency via user-level threads. We frame our work with a discussion of Akaros, a new, experimental operating system we developed, whose “Many-Core Process” (MCP) abstraction is built specifically with user- level scheduling in mind. The Akaros kernel provides low-level primitives for gaining direct access to cores, and a user-level library, called parlib, provides a framework for building custom threading packages that make use of those cores. From a parlib-based version of pthreads to a port of Lithe and the popular Go programming language, the combination of Akaros and parlib proves itself as a powerful medium to help bring user-level scheduling back from the dark ages. i Contents Contents i List of Figures iii List of Tables vi 1 Introduction 1 1.1 BackgroundandMotivation ........................... 2 1.2 Akaros....................................... 5 1.2.1 The Many-Core Process . 7 1.2.2 ResourceProvisioningandAllocation . 8 1.3 ContributionsandRoadmap ........................... 9 2Parlib 13 2.1 AbstractionOverview............................... 14 2.2 ThreadSupportLibrary ............................. 17 2.2.1 Vcores................................... 20 2.2.2 Uthreads.................................. 25 2.2.3 Synchronization Primitives . 31 2.3 AsynchronousEvents............................... 36 2.3.1 Akarosvs.Linux ............................. 37 2.3.2 TheEventDeliveryAPI ......................... 38 2.3.3 EventNotifications ............................ 40 2.4 AsynchronousServices .............................. 43 2.4.1 System Calls . 43 2.4.2 TheAlarmService ............................ 45 2.4.3 POSIX Signal Support . 47 2.5 OtherAbstractions ................................ 49 2.5.1 MemoryAllocators ............................ 49 2.5.2 Wait-Free Lists . 50 2.5.3 Dynamic TLS . 51 2.6 Evaluation . 52 ii 2.6.1 The upthread Library . 53 2.6.2 The kweb Webserver . 55 2.6.3 Experiments . 57 2.7 Discussion . 73 3 Lithe 75 3.1 ArchitecturalOverview.............................. 79 3.2 TheLitheAPI .................................. 82 3.2.1 Lithe Callbacks . 85 3.2.2 Lithe Library Calls . 87 3.2.3 Lithe Synchronization Primitives . 93 3.3 TheFork-JoinScheduler ............................. 95 3.4 Evaluation . 96 3.4.1 SPQRBenchmark ............................ 97 3.4.2 ImageProcessingServer . 99 3.5 Discussion . 103 4Go 105 4.1 Go Overview . 106 4.2 PortingGotoAkaros............................... 108 4.3 Results . 109 4.4 Discussion and Future Work . 111 5RelatedWork 113 5.1 Threading Models . 113 5.2 Concurrency vs. Parallelism . 115 5.3 Contemporary Systems . 117 6 Conclusion 119 Bibliography 121 iii List of Figures 1.1 A high-level comparison between a traditional 1:1 process and an MCP . 8 1.2 Three di↵erent core allocation schemes, based on (a) allocating cores strictly for future use, (b) allocating cores based solely on real-time demand, and (c) provisioning cores for future use, but allowing others to make use of them during periodsoflowdemand................................. 10 2.1 The collection of abstractions provided by parlib . 15 2.2 Parlib and the APIs it exposes in support of building parallel runtimes based on user-level scheduling. 16 2.3 This figure shows a direct comparison of threads running under a traditional 1:1 threading model, compared to parlib’s decoupled threading model. Under parlib, all threading is done in user-space, and a processes request cores (not threads) from the kernel, which pop up on a transition context represented by the vcore abstraction. 18 2.4 This figure shows the process of passing control from uthread context to vcore context and back. Parlib provides the mechanisms to easily pass from one context to another. Scheduler writers uses these mechanisms to implement their desired scheduling policies. 20 2.5 This figure shows the layering of the vcore and uthread abstractions with respect to the bi-directional API they provide to scheduler writers. 26 2.6 Average context switch latency/core of upthreads vs. the Linux NPTL. For upthreads, we show the various components that make up the total context switch latency. For the Linux NPTL, we show the latency when context switching a single thread vs. two. Error bars indicate the standard deviation of the measure- ments across cores. In all cases, this standard deviation is negligible. 61 2.7 Total context switches per second of upthreads vs. the Linux NPTL for an increasing number of cores and a varying number of threads per core. 62 2.8 Thread completion times of a fixed-work benchmark for di↵erent scheduling algorithms. The benchmark runs 1024 threads with 300 million loop iterations each, averagedover50runs. ................................ 65 iv 2.9 Speedup and average runtime of the OpenMP-based benchmarks from the NAS Parallel Benchmark Suite on upthreads vs. the Linux NPTL. Figure (a) shows the average runtime over 20 runs when configured for both 16 and 32 cores. Figure (b) shows the speedup. The red bars represents the speedup when all benchmarks are ran on 16 cores. The blue bars represents the speedup when all benchmarks are ran on 32 cores. 68 2.10 Webserver throughput with 100 concurrent connections of 1000 requests each. Requests are batched to allow 100 concurrent requests/connection on the wire at a time. After each request is completed, a new one is placed on the wire. After each connection completes all 1000 requests, a new connection is started. 71 2.11 Results from running the kweb throughput experiment on Akaros. The results are much worse due to a substandard networking stack on Akaros. The Linux NPTLresultsareshownforreference. 73 3.1 Representative application of the type targeted by Lithe. Di↵erent components are built using distinct parallel runtimes depending on the type of computation they wish to perform. These components are then composed together to make up a full application. 76 3.2 The evolution of parallel runtime scheduling from traditional OS scheduling in the kernel (a), to user-level scheduling with parlib (b), to user-level scheduling with Lithe (c). Lithe bridges the gap between between providing an API similar to parlib’s for building customized user-level schedulers, and allowing multiple parallel runtimes to seamlessly share cores among one another. 77 3.3 Lithe interposing on parlib to expose its own API for parallel runtime development. Figure (a) shows a single parallel runtime sitting directly on parlib. Figure (b) shows Lithe interposing on the parlib API to support the development of multiple,cooperatingparallelruntimes. 79 3.4 An example scheduler hierarchy in Lithe. Figure (a) shows an example program invoking nested schedulers. Figure (b) shows the resulting scheduler hierarchy from this program. The Graphics and AI schedulers originate from the same point because they are both spawned from the same context managed by the upthread scheduler. 80 3.5 The process of requesting and granting a hart in Lithe. A child process first puts in a request through lithe hart request(),whichtriggersahart request() callback on its parent. The parent uses this callback to register the pending request and grant a new hart to its child at some point in the future. The initial request is initiated in uthread context, but all callbacks and the subsequent lithe hart grant() occur in vcore context. 81 3.6 The software architecture of the SPQR benchmark on Lithe. Cores are shared between OpenMP and TBB at di↵erentlevelsoftheSPQRtasktree. 97 v 3.7 SPQR performance across a range of di↵erent input matrices. We compare its performance when running on the Linux NPTL (for both an ‘out-of-the-box’ configuration and a hand-tuned configuration based on optimal partitioning of cores) to running directly on parlib-based upthreads and Lithe. In all cases, Lithe outperforms them all. 99 3.8 Integrating the Flickr-like image processing server into kweb. Static file requests serviced by pthreads must share cores with thumbnail requests serviced by a mix of pthreads and OpenMP.

Load more