Scalable System Software for High Performance Large-Scale Applications

Scalable System Software for High Performance Large-scale Applications Alessandro Morari Department of Computer Architecture Universitat Politècnicade Catalunya A thesis submitted for the degree of Doctor of Philosophy 27 May 2014 Ai miei genitori Fabrizio ed Elena, a mia sorella Elisa ed ai miei nonni Stefano e Vera. Acknowledgements I would like to thank my Thesis Director Prof. Mateo Valero, for his con- tinuous support and prompt attention, even when we were separated by a whole continent. I would also like to acknowledge my mentor Dr. Roberto Gioiosa, for believing in me from the beginning, and for challenging me with his thoughtful remarks. I would like to thank my friends and colleagues at Pacific Northwest National Laboratories, Dr. Oreste Villa and Dr. Antonino Tumeo. Their friendship and teachings made me change rad- ically, transforming this time in the laboratory to an amazing experience. I would like to thank the director of the Center for Adaptive Supercom- puter Software Dr. John Feo, for his incredible heart and honesty, and for supporting me from the first day we met. I would also like to thank my group leader at the Barcelona Supercomputing Center, Dr. Francisco Cazorla, for continuously supporting my research activity and for patiently dealing with my laziness with bureaucratic procedures. I would like to acknowledge my colleagues in the CAOS group at Barcelona Supercomputing Center Marco Paolieri, Roberta Piscitelli, Carlos Luque, Carlos Boneti, Bo- jan Maric, Kamil Kedzierski, Victor Jimenez, Petar Radojkovic, Vladimir Kacarevic, Dr. Eduardo Qui~nones,Dr. Jaume Abella and Dr. Miquel Moreto for the interesting discussions and the beautiful moments shared when I was working in Barcelona. Also, I would like to thank my colleagues at IBM T. J. Watson Dr. Robert Wisniewski, Dr. Bryan Rosenburg, Dr. Jose Brunheroto, Dr. Chen-Yong Cher and Dr. Alper Buyuktosunoglu for sharing with me their knowledge and giving me the opportunity of working on extremely interesting problems. Finally, I am deeply grateful to my family for their incredible support, for their love and for giving me the gift of curiosity that has taken me where I stand today. Abstract In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The next generation of supercomputers will be able to deliver a performance in the order of 1018 floating point operations per seconds (ExaFlop). The design of a new generation of supercomputers, called Exascale generation, is the object of a large part of current research in HPC. The availability of large- scale models and simulation and the availability of large instruments like telescopes, colliders and light sources has generated a growing volume of scientific data. Computing resources to move, analyze and manage this exponentially growing volume of data is becoming the next big challenge, commonly referred as big data. The steps toward the design of an Exas- cale Supercomputer are increasingly pointing to the necessity of integrating Exascale with big data. Over the years, the difficulty of efficiently manage large-scale system resources has repeatedly shown the importance of providing scalable system software. System software has evolved to keep pace with technology and applications evolution. Research on scalable system software for large-scale system has historically been driven by two fundamental aspects: on one side, performance, efficiency and reliability, on the other side the goal of providing a productive developing environment. Considered the challenges posed by the next-generation large-scale high- performance systems and applications, it is clear that system software needs to be significantly updated if not redesigned. In this Thesis, we propose approaches to measure, design and implement scalable system software. Overall, this Thesis proposes three orthogonal approaches that should be integrated into a new system-software layer for next-generation large-scale high performance computing systems: • To obtain scalable applications a detailed measurement and under- standing of system software performance and overhead are necessary. We design and implement a methodology to provide detailed measurement of system software interruptions and their effect on applications performance. • The runtime system is the right candidate to provide highly scalable system services and to exploits low-level hardware optimizations. Some system services could be moved from the operating system to the runtime system. A specialized runtime system for a class of application can leverage a deeper knowledge of the application behaviour and ex- ploit low-level hardware features to improve scalability. We design and implement a runtime system for the class of irregular applications and show a performance improvement of several orders of magnitude with respect to more traditional approaches. • Exploiting low-level hardware features at user-level (i.e., in the application) can lead to considerable performance improvement. We show two instances of optimizations leveraging architecture-specific hardware features and obtaining significant performance improvement. To enhance the performance of an entire class of applications, those approaches should be integrated into a specialized runtime system. Contents Contentsv List of Figures xi I Introduction and Backround 1 1 Introduction 3 1.1 Challenges of Next-generation High-performance Large-scale systems . 3 1.2 Proposed approach and methodologies . 5 1.3 Thesis contributions . 6 1.4 Thesis structure . 8 2 Background: current trends in High Performance Computing 9 2.1 Technology trends . 9 2.2 Scientific and large-scale applications . 12 2.2.1 Big data and exascale computing . 14 2.2.2 Data-intensive and Irregular Applications . 16 2.3 System software for large-scale systems . 17 2.3.1 Operating Systems . 18 2.3.2 Runtime Systems and Parallel Programming Models . 20 2.3.3 Trends in System Software . 26 II Operating System Scalability 29 3 General Purpose Operating Systems 31 3.1 Summary . 31 3.2 Experimental environment . 34 v CONTENTS 3.3 Related Work . 34 3.4 Measuring OS noise . 36 3.4.1 LTTng-noise ............................ 39 3.4.2 Tracing scalability . 40 3.4.3 Analyzing FTQ with LTTng-noise ................ 40 3.5 Experimental Results . 43 3.5.1 Noise breakdown . 44 3.5.2 Page faults . 46 3.5.3 Scheduling . 48 3.5.4 Process preemption and i/o ..................... 49 3.5.5 Periodic activities . 52 3.6 Noise disambiguation . 53 3.6.1 Disambiguation of qualitative similar activities . 55 3.6.2 OS noise composition . 56 3.6.3 Conclusions . 56 4 Light-weight Operating Systems 59 4.1 Summary . 59 4.2 Experimental environment . 62 4.3 Related Work . 62 4.4 Memory management in general purpose OS and light-weight kernels . 63 4.5 Methodology . 65 4.5.1 Adding support for TLB misses to CNK . 66 4.5.2 Tracing TLB misses . 68 4.5.3 TLB noise injection . 69 4.6 Experimental results . 71 4.6.1 TLB pressure . 72 4.6.2 Analysis of TLB overhead at scale . 73 4.6.3 TLB noise injection . 77 4.7 Conclusions . 82 III Runtime System Scalability 83 5 Scalable Runtimes for Distributed Memory Systems 85 5.1 Summary . 85 5.2 Experimental environment . 89 vi CONTENTS 5.3 Related Work . 90 5.4 Programming model and API . 93 5.4.1 PGAS communication model . 93 5.4.2 Loop parallelism program structure model . 95 5.4.3 Explicit data and code locality management . 95 5.4.4 Blocking and Non-blocking semantics . 96 5.4.5 Explicit synchronization . 96 5.4.6 Example . 96 5.5 Runtime architecture . 99 5.5.1 Overview . 100 5.5.2 Communication . 101 5.5.3 Aggregation . 103 5.5.4 Multithreading . 105 5.6 Experimental Evaluation . 107 5.6.1 Micro-benchmarks . 109 5.6.2 BFS . 110 5.6.3 Graph Random Walk . 113 5.6.4 Concurrent Hash Map Access . 115 5.7 Conclusions . 117 IV User-level Scalability Exploiting Hardware Features 119 6 Exploiting Hardware Thread Priorities on Multi-threaded Architec- tures 121 6.1 Summary . 121 6.2 Experimental environment . 123 6.3 Related Work . 124 6.4 POWER5 and POWER6 Microarchitecture . 125 6.4.1 POWER5 and POWER6 Core Microarchitecture . 125 6.4.2 Simultaneous Multi-Threading . 125 6.4.3 Software-controlled Hardware Thread Priorities . 127 6.5 Experimental Setup . 128 6.5.1 Experimental environment . 129 6.5.2 The Linux kernel modification . 130 6.5.3 Running the experiments . 131 6.5.4 Micro-benchmarks description . 131 vii CONTENTS 6.5.4.1 Integer micro-benchmarks . 132 6.5.4.2 Floating point micro-benchmarks . 132 6.5.4.3 Memory micro-benchmarks . 133 6.6 Analysis of results . 133 6.6.1 Default Priorities . 133 6.6.2 Malleability . 134 6.6.3 Higher Priority . 135 6.6.4 Lower Priority . 141 6.6.5 Maximum priority difference . 141 6.6.6 Malleability of SPEC CPU2006 . 142 6.7 Use cases . 144 6.7.1 Use case A - Load Balancing . 144 6.7.2 Use case B - Transparent threads . 146 6.8 Conclusions . 148 7 Exploiting cache locality and network-on-chip to optimize sorting on a Many-core architecture 151 7.1 Summary . 151 7.2 Experimental environment . 153 7.3 Tilera many-core processor . 153 7.3.1 Processor architecture . 153 7.3.2 Architectural characterization . 154 7.4 Background . 156 7.4.1 Radix sort . ..

Load more