Effective Cooperative Scheduling of Task-Parallel Applications on Multiprogrammed Parallel Architectures

Effective cooperative scheduling of task-parallel applications on multiprogrammed parallel architectures GEORGIOS VARISTEAS Doctoral Thesis in Information and Communication Technology Royal institute of Technology, KTH Stockholm, Sweden, 2015 KTH TRITA-ICT 2015:14 SE-100 44 Stockholm ISBN: 978-91-7595-708-1 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av Doktorand examen 2015-11-09 i KTH Kista. c Georgios Varisteas, Tryck: Universitetsservice US AB iii Abstract Emerging architecture designs include tens of processing cores on a single chip die; it is believed that the number of cores will reach the hundreds in not so many years from now. However, most common parallel workloads cannot fully utilize such systems. They expose fluctuating parallelism, and do not scale up indefinitely as there is usually a point after which synchronization costs outweigh the gains of parallelism. The combination of these issues suggests that large-scale systems will be either multiprogrammed or have their unneeded resources powered off. Multiprogramming leads to hardware resource contention and as a result application performance degradation, even when there are enough resources, due to negative share effects and increased bus traffic. Most often this degradation is quite unbalanced between co-runners, as some applications dominate the hardware over others. Current Operating Systems blindly provide applications with access to as many resources they ask for. This leads to over-committing the system with too many threads, memory contention and increased bus traffic. Due to the inability of the application to have any insight on system-wide resource demands, most parallel workloads will create as many threads as there are available cores. If every co-running application does the same, the system ends up with threads N times the amount of cores. Threads then need to time-share cores, so the continuous context-switching and cache line evictions generate considerable overhead. This thesis proposes a novel solution across all software layers that achieves through- put optimization and uniform performance degradation of co-running applications. Through a novel fully automated approach (DVS and Palirria), task-parallel applications can accurately quantify their available parallelism online, generating a meaning- ful metric as parallelism feedback to the Operating System. A second component in the Operating System scheduler (Pond) uses such feedback from all co-runners to ef- fectively partition available resources. The proposed two-level scheduling scheme ultimately achieves having each co- runner degrade its performance by the same factor, relative to how it would execute with unrestricted isolated access to the same hardware. We call this fair scheduling, departing from the traditional notion of equal opportunity which causes uneven degradation, with some experiments showing at least one application degrading its performance 10 times less than its co-runners. iv Acknowledgements I would like to give special thanks to... Mats Brorsson for giving me clear perspective Karl-Filip Faxén for being patient and helpful with all sorts of engineering troubles Christian Schulte for giving me the initial inspiration to follow systems engineering Mothy Roscoe, Tim Harris, and Dorit Nuzman for support that gave me confidence to come through the other end. Sandra Gustavsson Nylén and Marianne Hellmin for handling all procedure headaches on a moment’s notice any given day Ananya Muddukrishna and Artur Podobas for accompanying me on this difficult ride Georgia Kanli for being my sounding board daily, my biggest supporter and second harshest critic everyone who listened to my questions and helped me along the way and last but not least Avi Mendelson, Willy Zwaenepoel, Mats Dam, Håkan Grahn, and Markus Hidell for giving this journey its deserving ending Contents Contentsv List of algorithms ix List of Figuresx List of Tables xiii I Prologue1 1 Introduction and Motivation3 1.1 Motivation.................................. 3 1.2 Background................................. 5 1.2.1 Application level.......................... 6 1.2.2 System level ............................ 7 1.3 Problem statements............................. 8 1.4 Research methodology........................... 9 1.5 Thesis Organization............................. 9 2 Parallel Architectures 13 2.1 Introduction................................. 13 2.2 Anatomy of Multi-Core Chips ....................... 15 2.3 Many-core architectures .......................... 20 2.4 Cache Hierarchies.............................. 20 2.5 Interconnect Networks ........................... 22 3 Parallel Programming Models 27 3.1 Introduction................................. 27 3.2 Task-centric programming models..................... 27 3.3 Characterization of task-based applications ................ 29 3.4 Scheduling task-parallel applications.................... 31 3.4.1 Work-distribution overview .................... 32 3.5 Work-stealing................................ 33 v vi CONTENTS 3.5.1 Task queues............................. 34 3.5.2 Scheduling Policies......................... 35 3.5.3 Victim Selection .......................... 35 3.5.4 Discussion ............................. 37 3.5.5 Existing work-stealing task schedulers............... 38 4 Operating System Scheduling 41 4.1 Introduction................................. 41 4.2 Basic concepts ............................... 42 4.2.1 Resource partitioning strategies .................. 42 4.2.2 Interrupts.............................. 43 4.2.3 Context switching ......................... 43 4.2.4 Thread management ........................ 43 4.3 System-wide scheduling issues....................... 44 4.3.1 Selecting resources......................... 44 4.3.2 Fairness............................... 45 4.3.3 Efficiency.............................. 47 4.3.4 Requirements estimation trust................... 48 4.3.5 Scalability: computability and overhead.............. 48 4.3.6 Use case 1: a first attempt on the Barrelfish OS.......... 49 4.3.7 Use case 2: The evolution of the Linux scheduler......... 51 4.4 System-wide scheduling principles..................... 53 II Solution Space 57 5 Adaptive Work-Stealing Scheduling 59 5.1 Introduction................................. 59 5.2 Parallelism quantifying metrics....................... 60 5.3 Deciding allotment size satisfiability.................... 61 5.3.1 ASTEAL.............................. 61 5.4 Computing estimation requirements .................... 62 5.5 DVS: Deterministic Victim Selection.................... 63 5.5.1 Description............................. 63 5.5.2 Formal definition.......................... 66 5.5.3 Implementation........................... 68 5.5.4 Discussion and Evaluation..................... 72 5.6 Palirria: parallelism feedback through DVS ................ 79 5.6.1 Description............................. 79 5.6.2 Definition and discussion...................... 82 5.6.3 Implementation........................... 93 5.6.4 Discussion and Evaluation..................... 97 6 Pond: Exploiting Parallelism Feedback in the OS 105 CONTENTS vii 6.1 Introduction................................. 105 6.2 Restructuring the problem ......................... 106 6.3 Process model................................ 108 6.4 Pond-workers: Threads as a service .................... 109 6.4.1 Pond over a monolithic kernel................... 110 6.4.2 Pond over a micro-kernel...................... 111 6.5 Scheduling ................................. 112 6.5.1 Architecture modeling....................... 113 6.5.2 Space sharing............................ 117 6.5.3 Time Sharing............................ 119 6.6 Experimental Methodology......................... 120 6.7 Evaluation.................................. 122 6.7.1 Adversarial Scenario........................ 124 6.7.2 Trusted Requirements Estimation . 127 6.7.3 Limitations............................. 130 6.8 Related Work................................ 130 III Epilogue 133 7 Summary and Future Work 135 7.1 Thesis summary............................... 135 7.2 Contributions................................ 136 7.3 Future work................................. 137 Bibliography 141 List of Algorithms 1 Sequential Fibonacci............................ 28 2 Task parallel Fibonacci........................... 28 3 DVS Classification: Main function to construct the victims sets of all nodes. 69 4 DVS Outer Victims: Find a node’s outer victims.............. 70 5 DVS Inner Victims: Find a node’s inner victims.............. 70 6 DVS source Victims set: Construct the source node’s victims set..... 71 7 DVS Core X victims set: Construct the core X victims set extension... 71 8 DVS Z Class victims set: Construct the Z class victims set extension... 71 9 Palirria Decision Policy: Check DMC and change allotment respectively. 93 10 Incremental DVS Classification: Adapted function to construct the victims sets of nodes from some zones. Code segments identical to algorithm 3 are omitted................................. 96 11 get_distance_of_units: implementation for a mesh network........ 114 12 get_units_at_distance_from: implementation for a mesh network..... 115 13 get_distance_of_units: implementation for a Point-to-point link network. 116 14 get_units_at_distance_from: implementation for a Point-to-point link network..................................... 116 15 get_distance_of_units: implementation

Load more