On the Realization of Fine-Grained Multithreading in Software
Total Page:16
File Type:pdf, Size:1020Kb
Andreas Gr¨avinghoff On the Realization of Fine- Grained Multithreading in Software II III F¨urdie Seerosenprinzessin IV V "After a rare speech at the National Center for Atmospheric Research in Boulder, Colorado, in 1976, programmers in the audience had suddenly fallen silent when Cray offered to an- swer questions. He stood there for several minutes, waiting for their queries, but none came. When he left, the head of NCAR's computing division chided the programmers. 'Why didn't someone raise a hand ?' After a tense moment, one pro- grammer replied, 'How do you talk to God?' " from The SUPERMEN VI Acknowledgments I am grateful to Prof. Keller for our successful cooperation during the last five years, especially for providing the time to complete this work by freeing me from other tasks. In addition, this work would not have been possible without the necessary equipment, e.g. the Compaq workstation that was for development and evaluation of emulated multithreading. I am grateful to Prof. Ungerer from the Universit¨atAugsburg for reviewing this thesis and our interesting discussions during the course of this work. I thank the following people and organizations for their support dur- ing the course of this work: the HLRS in Stuttgart and the NIC in J¨ulich for granting access to their Cray T3E computer systems - software devel- opment and evaluation would have been impossible without access to these machines. The people from the Gastdozentenhaus "Heinrich Hertz" at the Universit¨atKarlsruhe for guarding me while writing the first draft. Peter Bach and Michael Bosch for providing useful comments on an early draft - I am really looking forward to working with them again at ETAS GmbH in Stuttgart. Matthias M¨ullerfrom the Universit¨atKarlsruhe for interesting dis- cussions about the E-registers in the Cray T3E. The Universitat Politecnica de Catalunya in Spain, namely Ernest Artiaga, for providing an implementa- tion of the PARMACS macros based on POSIX threads that was used during the evaluation of emulated multithreading on the Compaq workstation. The Informatikrechnerbereich in Hagen, namely Enno deVries, for providing the DECcampus software used during software development. My colleagues from the chair of Prof. Schiffmann for the good teamwork even in times of stress. My heartfelt thanks go to my family, Lutz, Ulrike and Tina Gr¨avinghoff for their support and encouragement during all these years. Last but not least, I wish to thank my fianc´eeKerstin Bernhardt for her outstanding support even while pursuing her own PhD in psychology. VIII Acknowledgments Contents 1. Introduction :::::::::::::::::::::::::::::::::::::::::::::: 1 1.1 Trends in Sequential Computing . 1 1.1.1 Dependencies & Hazards . 4 1.1.2 Dependency Removal . 7 1.1.3 Caches & Main Memory . 10 1.2 Trends in Parallel Computing . 14 1.3 Latency Tolerance . 18 1.4 Multithreading . 21 1.4.1 Hardware Multithreading . 22 1.4.2 Software Multithreading . 28 1.4.3 Summary. 31 1.5 Outline . 32 2. Emulated Multithreading ::::::::::::::::::::::::::::::::: 33 2.1 Design Preferences . 33 2.1.1 Multithreaded Processor Model . 33 2.1.2 Context Switch Strategies . 35 2.1.3 Context Switch Overhead . 37 2.2 Basic Concept . 39 2.2.1 Assumptions . 40 2.2.2 Data Structures . 40 2.2.3 Emulation Library . 41 2.2.4 Code Conversion . 42 2.3 Performance Issues . 45 2.3.1 Number of Threads . 46 2.3.2 Caches . 47 2.3.3 Branch Prediction . 50 2.3.4 Code Scheduling . 54 2.3.5 Out-of-order Execution. 54 2.4 Architecture Support . 55 X Contents 3. Implementation ::::::::::::::::::::::::::::::::::::::::::: 61 3.1 Introduction . 61 3.2 Design Flow . 62 3.3 High-Level Language Converter . 64 3.3.1 Configuration File . 64 3.3.2 Conversion Tasks . 65 3.3.3 Implementation . 65 3.4 Emulation Library . 67 3.4.1 Thread Initialization Routines . 68 3.4.2 Thread Execution Routines . 69 3.4.3 Communication Routines . 69 3.5 Assembler converter . 71 3.5.1 Configuration . 72 3.5.2 Lexer & Parser . 76 3.5.3 Basic Blocks . 78 3.5.4 Super Blocks . 81 3.5.5 External Calls . 90 3.5.6 Data-Flow Analysis . 93 3.5.7 Register Allocation . 100 3.5.8 Code Conversion . 121 3.5.9 Statistics . 122 3.6 Register Partitioning . 124 3.7 Platform . 125 3.8 Compiler Integration . 129 4. Benchmarks :::::::::::::::::::::::::::::::::::::::::::::: 133 4.1 Benchmark Suites . 133 4.1.1 LINPACK . 134 4.1.2 LFK . 134 4.1.3 ParkBench . 135 4.1.4 NPB . 138 4.1.5 Perfect Club . 139 4.1.6 SPLASH2 . 140 4.1.7 Summary. 140 4.2 SPLASH2 Benchmark Suite. 141 4.2.1 The FFT Kernel . 142 4.2.2 The LU Kernel . 145 4.2.3 The Radix Kernel . 148 4.2.4 The Ocean Application. 151 4.2.5 The Barnes application . 155 4.2.6 The FMM application . 158 Contents XI 5. Evaluation : Compaq XP1000 ::::::::::::::::::::::::::::: 165 5.1 Compaq XP1000 . 166 5.1.1 Processor . 167 5.1.2 Cchip . 168 5.1.3 Dchip . 169 5.1.4 Pchip . 170 5.1.5 Memory . 170 5.1.6 Peripherals . 171 5.1.7 Software Environment . 171 5.2 Methodology . 172 5.3 Code Conversion . 175 5.4 FFT . 179 5.5 LU . 183 5.6 Radix . 186 5.7 Ocean . 190 5.8 Barnes . 193 5.9 FMM . ..