BARGHI SAMAN.Pdf (1.888Mb)

Improving the Performance of User-level Runtime Systems for Concurrent Applications by Saman Barghi A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2018 © Saman Barghi 2018 Examining Committee Membership The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. External Examiner: Carsten Griwodz Professor, Department of Informatics, University of Oslo, Oslo, Norway Chief Research Scientist, Simula Research Laboratory, Lysaker, Norway Supervisor(s): Martin Karsten Associate Professor, Cheriton School of Computer Science, University of Waterloo Internal Member: Peter A. Buhr Associate Professor, Cheriton School of Computer Science, University of Waterloo Ken Salem Professor, Cheriton School of Computer Science, University of Waterloo Internal-External Member: Werner Dietl Assistant Professor, Dept. of Electrical and Computer Engineering, University of Waterloo ii I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract Concurrency is an essential part of many modern large-scale software systems. Applications must handle millions of simultaneous requests from millions of connected devices. Handling such a large number of concurrent requests requires runtime systems that efficiently man- age concurrency and communication among tasks in an application across multiple cores. Existing low-level programming techniques provide scalable solutions with low overhead, but require non-linear control flow. Alternative approaches to concurrent programming, such as Erlang and Go, support linear control flow by mapping multiple user-level execution entities across multiple kernel threads (M:N threading). However, these systems provide comprehensive execution environments that make it difficult to assess the performance impact of user-level runtimes in isolation. This thesis presents a nimble M:N user-level threading runtime that closes this con- ceptual gap and provides a software infrastructure to precisely study the performance impact of user-level threading. Multiple design alternatives are presented and evaluated for scheduling, I/O multiplexing, and synchronization components of the runtime. The performance of the runtime is evaluated in comparison to event-driven software, system- level threading, and other user-level threading runtimes. An experimental evaluation is conducted using benchmark programs, as well as the popular Memcached application. The user-level runtime supports high levels of concurrency without sacrificing application performance. In addition, the user-level scheduling problem is studied in the context of an existing actor runtime that maps multiple actors to multiple kernel-level threads. In particular, two locality-aware work-stealing schedulers are proposed and evaluated. It is shown that locality-aware scheduling can significantly improve the performance of a class of applications with a high level of concurrency. In general, the performance and resource utilization of large-scale concurrent applications depends on the level of concurrency that can be expressed by the programming model. This fundamental effect is studied by refining and customizing existing concurrency models. iv Acknowledgements I would like to take this opportunity to thank all those who helped me to make this thesis possible. First and foremost, I would like to express my sincere thanks to my supervisor, Martin Karsten, for his regular advice, guidance, encouragement and perseverance throughout the course of my PhD. This thesis is only made possible with his unlimited help and support. I would also like to thank Peter A. Buhr for his guidance, help, and support throughout the course of my PhD. I thank the members of my PhD committee for their valuable and constructive feedback: Carsten Griwodz, Weinter Dietl, Ken Salem, and Peter A. Buhr. Finally, I would like to thank the many friends and colleagues I have met during my studies for their support and friendship over the past few years. v Dedication This is dedicated to my beloved parents and family for their love, endless support, encouragement, and sacrifices. vi Table of Contents List of Tablesx List of Figures xi Abbreviations xiv 1 Introduction1 1.1 Thesis Statement................................2 1.2 Main Results..................................3 2 Background and Related Work5 2.1 Multicore Processors..............................5 2.2 Concurrent and Parallel Programming Models................8 2.2.1 Thread Programming.......................... 10 2.2.2 Event Programming.......................... 13 2.2.3 Actor-Model Programming....................... 15 2.2.4 Other Programming Models...................... 17 2.3 User-level Runtime Systems.......................... 19 2.3.1 Input/Output (I/O).......................... 20 2.3.2 Scheduling................................ 25 2.3.3 User-level Threading Libraries..................... 30 vii 3 User-level Threading Runtime 38 3.1 Overview..................................... 40 3.2 Scheduling.................................... 43 3.2.1 Placement................................ 43 3.2.2 Scheduler Design............................ 44 3.2.3 Idle Management............................ 46 3.3 Synchronization................................. 47 3.4 Synchronous I/O Interface........................... 51 3.4.1 IO Subsystem.............................. 51 3.4.2 IO Polling................................ 52 3.4.3 Polling Mechanism........................... 56 3.5 Implementation Details............................. 60 4 Evaluation 63 4.1 Environment and Tools............................. 63 4.2 Queue Performance............................... 65 4.3 I/O Subsystem Design............................. 66 4.3.1 Yield Before Read........................... 66 4.3.2 Lazy Registration............................ 67 4.4 I/O Polling................................... 69 4.5 Benchmark Evaluation............................. 72 4.5.1 Scalability................................ 73 4.5.2 Efficiency................................ 74 4.5.3 I/O Multiplexing............................ 75 4.6 Application Evaluation............................. 80 4.6.1 Memcached Transformation...................... 80 4.6.2 Performance Evaluation........................ 81 viii 5 Actor Scheduling 92 5.1 Related Work.................................. 92 5.2 Characteristics of Actor Applications..................... 96 5.3 Locality-Aware Scheduler(LAS)........................ 97 5.3.1 Memory Hierarchy Detection...................... 97 5.3.2 Locality-aware Work-stealing..................... 98 5.4 Experiments and Evaluation.......................... 100 5.4.1 Benchmarks............................... 101 5.4.2 Experiments............................... 102 5.4.3 Chat Server............................... 106 6 Utilization Model 110 6.1 Probabilistic Model............................... 111 6.2 Analytic Model................................. 114 6.2.1 Experimental Verification....................... 116 7 Conclusion and Future Work 119 7.1 Summary and Conclusion........................... 119 7.2 Future Work................................... 121 References 123 APPENDICES 143 A Other User-level Threading Libraries 144 B Intel Results 147 ix List of Tables 2.1 M:N user-level threading frameworks with I/O support........... 37 4.1 Cache Layout.................................. 65 x List of Figures 2.1 Shared Ready Queue.............................. 26 2.2 Shared Ready Queue and Local Queues.................... 27 2.3 Local Queues.................................. 28 2.4 Deque Used in Work-stealing.......................... 29 3.1 An example of a setup with 6 processors and 3 clusters........... 41 3.2 Scheduler and I/O subsystem per cluster................... 42 3.3 Cluster Scheduler................................ 45 3.4 Barging spectrum for different locks...................... 49 3.5 Blocking mutex with baton passing....................... 50 3.6 User-Level I/O Synchronization........................ 54 3.7 Design with a poller fibre per cluster...................... 58 3.8 Central polling mechanism using a dedicated poller thread.......... 59 3.9 Polling frequency spectrum........................... 60 3.10 (a) Fixed-size stack, (b) Segmented stack................... 61 4.1 Scalability of Different Queues........................ 65 4.2 Effect of yielding before read on a web server performance.......... 67 4.3 Effect of lazy registration on short-lived connections using a web server... 68 4.4 Throughput and Latency of I/O polling mechanisms............. 70 4.5 Scalability (32 cores, 1024 fibres, 32 locks). Throughput is normalized by the respective throughput for a single thread/fibre.............. 73 xi 4.6 Efficiency (N 2 fibres, N=4 locks, 20µs loops), lower is better........ 74 4.7 Web Server (12,000 connections)........................ 76 4.8 Web Server (16 cores)............................. 78 4.9 Web Server (16 cores)............................. 79 4.10 Memcached - Core Scalability (6,000 connections, write ratio 0.1)..... 82 4.11 Memcached - Connection Scalability (16 cores, write ratio 0.1)....... 83 4.12 Memcached - Read/Write Scalability With Blocking Mutex (16 cores, 6,000 connections)..................................

BARGHI SAMAN.Pdf (1.888Mb)

Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)

Events, Co-Routines, Continuations and Threads OS (And Application)Execution Models System Building

Designing an Ultra Low-Overhead Multithreading Runtime for Nim

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

HPX – a Task Based Programming Model in a Global Address Space

An Ideal Match?

Tcl and Java Performance

High-Level and Efficient Stream Parallelism on Multi-Core Systems

Exascale Computing Project -- Software

HPXMP, an Openmp Runtime Implemented Using

Thread Scheduling in Multi-Core Operating Systems Redha Gouicem

The Future of PGAS Programming from a Chapel Perspective