Operating System Support for Parallel Processes
Total Page:16
File Type:pdf, Size:1020Kb
Operating System Support for Parallel Processes Barret Rhoden Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-223 http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-223.html December 18, 2014 Copyright © 2014, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Operating System Support for Parallel Processes by Barret Joseph Rhoden A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Eric Brewer, Chair Professor Krste Asanovi´c Professor David Culler Professor John Chuang Fall 2014 Operating System Support for Parallel Processes Copyright 2014 by Barret Joseph Rhoden 1 Abstract Operating System Support for Parallel Processes by Barret Joseph Rhoden Doctor of Philosophy in Computer Science University of California, Berkeley Professor Eric Brewer, Chair High-performance, parallel programs want uninterrupted access to physical resources. This characterization is true not only for traditional scientific computing, but also for high- priority data center applications that run on parallel processors. These applications require high, predictable performance and low latency, and they are important enough to warrant engineering effort at all levels of the software stack. Given the recent resurgence of interest in parallel computing as well as the increasing importance of data center applications, what changes can we make to operating system abstractions to support parallel programs? Akaros is a research operating system designed for single-node, large-scale SMP and many-core architectures. The primary feature of Akaros is a new process abstraction called the \Many-Core Process" (MCP) that embodies transparency, application control of physical resources, and performance isolation. The MCP is built on the idea of separating cores from threads: the operating system grants spatially partitioned cores to the MCP, and the application schedules its threads on those cores. Data centers typically have a mix of high- priority applications and background batch jobs, where the demands of the high-priority application can change over time. For this reason, an important part of Akaros is the provisioning, allocation, and preemption of resources, and the MCP must be able to handle having a resource revoked at any moment. In this work, I describe the MCP abstraction and the salient details of Akaros. I discuss how the kernel and user-level libraries work together to give an application control over its physical resources and to adapt to the revocation of cores at any time | even when the code is holding locks. I show an order of magnitude less interference for the MCP compared to Linux, more resilience to the loss of cores for an HPC application, and how a customized user-level scheduler can increase the performance of a simple webserver. i Contents Contents i List of Figures iii List of Tables v 1 Introduction 1 2 Background on Operating Systems 4 2.1 System Calls . 4 2.2 Threads, Concurrency, and Parallelism . 7 2.3 Scheduling and Competition . 20 2.4 Synchronization . 27 2.5 Performance Isolation . 30 3 The Basics of the Many-Core Process and Akaros 34 3.1 MCP Basics . 34 3.2 Akaros Details . 36 4 Life for a Many-Core Process 48 4.1 Procinfo and Procdata . 48 4.2 Virtual Cores . 49 4.3 MCPs from Creation to Entry . 52 4.4 User-level Scheduling . 57 4.5 Impact of Dedicated Cores . 70 5 Event Delivery 88 5.1 Concurrent Queues for Payloads . 89 5.2 The Event Delivery Subsystem . 100 5.3 Never Miss an Important Event . 108 6 Virtual Core Preemption 116 6.1 Preemption Basics . 118 ii 6.2 Preemption Detection and Recovery Locks . 123 6.3 Handling Preemption . 141 7 Applications 149 7.1 Fluidanimate . 149 7.2 Webserver . 151 8 Conclusion 160 Bibliography 162 iii List of Figures 4.1 FTQ Raw Output, c89, Linux, Core 7 . 71 4.2 FFT @ 3000 Hz, c89, Linux, Core 7 . 72 4.3 FFT @ 500 Hz, c89, Linux, Core 7 . 72 4.4 FFT @ 100 Hz, c89, Linux, Core 7 . 73 4.5 FFT @ 100 Hz, c89, Old Akaros, Core 7 . 74 4.6 FFT @ 100 Hz, c89, Akaros, Core 7 . 75 4.7 FFT @ 500 Hz, c89, Akaros, Core 7 . 75 4.8 FFT @ 3000 Hz, c89, Akaros, Core 7 . 76 4.9 FFT @ 3000 Hz, c89, Linux, Core 31 . 77 4.10 FFT @ 3000 Hz, c89, Akaros, Core 31 . 77 4.11 FFT @ 500 Hz, c89, Linux, Core 31 . 78 4.12 FFT @ 500 Hz, c89, Akaros, Core 31 . 78 4.13 FFT @ 100 Hz, c89, Linux, Core 31 . 79 4.14 FFT @ 100 Hz, c89, Akaros, Core 31 . 79 4.15 FTQ Raw Data, c89, Akaros, Core 7 . 80 4.16 FTQ Raw Data, c89, Akaros, Core 31 . 81 4.17 FFT @ 3000 Hz, hossin, Linux, Core 7 . 82 4.18 FFT @ 3000 Hz, hossin, Akaros, Core 7 . 82 4.19 FFT @ 500 Hz, hossin, Linux, Core 7 . 83 4.20 FFT @ 500 Hz, hossin, Akaros, Core 7 . 83 4.21 FFT @ 100 Hz, hossin, Linux, Core 7 . 84 4.22 FFT @ 100 Hz, hossin, Akaros, Core 7 . 84 4.23 FTQ Raw Data, hossin, Akaros, Core 7 . 85 4.24 FTQ Raw Data, hossin, Linux, Core 7 . 86 6.1 Akaros Spinlocks, 4 Threads, Acquisition Latency . 135 6.2 Linux Spinlocks, 31 Threads, Acquisition Times . 135 6.3 MCS locks, 31 Threads, Acquisition Times . 137 6.4 Akaros, Basic MCS locks, 31 Threads, Lock Throughput . 138 6.5 Akaros, MCS-PDRO locks, 31 Threads, Lock Throughput . 138 6.6 Akaros, MCS-PDRN locks, 31 Threads, Lock Throughput . 139 iv 7.1 Fluidanimate, Akaros, 64 Threads, 31 Cores . 151 7.2 Fluidanimate Performance on Akaros and Linux with Preemption . 152 7.3 Fluidanimate Relative Performance on Akaros and Linux with Preemption . 152 7.4 Akaros Kweb, Custom-2LS, Throughput vs. Workers . 156 7.5 Kweb and Fluidanimate Throughput with Preemption . 158 v List of Tables 4.1 Average Thread Context-Switch Latency (nsec) . 68 4.2 Akaros Context-Switch Latency (nsec) . 69 6.1 Spinlock Acquisition Latency (in ticks) . 134 6.2 MCS Lock Acquisition Latency (in ticks) . 136 7.1 Fluidanimate Runtime (seconds) . 150 7.2 Kweb Throughput, 100 connections of 100,000 calls each . 154 vi Acknowledgments I have been extremely fortunate over the past seven years. Not only did I have the oppor- tunity to pursue a project I found interesting, but I did so while working with excellent people. Both my research opportunity and the group environment are the product of my advisor: Eric Brewer. I would like to thank Eric for his inspiration and guidance | especially his implicit guidance as a role model. Although research is important, it is less important than a work-life balance and being \safe in your own skin." Eric gave me the breathing room necessary to work on Akaros and allowed me to \do the right thing," even if it took longer. I would also like to thank my committee, especially David Culler, for providing a much- needed historical perspective for my work. Additionally, David Wessel's music application was the original inspiration for the separation of provisioning and allocation in Akaros. The bulk of my time was spent with other graduate students and developers, and my work would not have been possible without them. Kevin Klues has been my partner in crime for a number of years now. It's hard to believe that the end of our era in Berkeley is finally here. David Zhu and I have had discussions since our first semester here together. Andrew Waterman is an impressive coder who tends to know what I am thinking before I say it. Paul Pearce developed Akaros's initial network drivers, and Andrew Gallatin vastly improved the more recent Plan 9 network stack. I would be lucky to work with any of them in the future. Ron Minnich and I met in 2009, but we did not work together until our serendipitous pairing as host and intern at Google in 2013. He was instrumental both in porting the guts of Plan 9 to Akaros and in using FTQ. Ron breathed new life into Akaros, greatly increasing its chances of having a life beyond graduate school. More importantly, Ron is the real deal | authentic and generous, and I hope to be more like him. My family has been very supportive over the years. They have always provided a home in Pittsburgh, complete with nieces and nephews, for whenever I was in town and needed to relax. Finally, I would like to thank my wife, Tamara Broderick. Not only did she help with FTQ and R, but she has been my biggest supporter. Whether it was coining the name \Akaros" or designing giraffe-themed merchandise, she has always been there for me, and I look forward to returning the favor. 1 Chapter 1 Introduction There are limitations to what we can do with computers. System designers make trade-offs, and over time certain designs should be reevaluated | especially when there is a change in the landscape of computing. One such change is the resurgence of interest in parallel computing due to the industry's failure to increase uniprocessor performance[5].