On the Realization of Fine-Grained Multithreading in Software

Andreas Grävinghoff On the Realization of Fine- Grained Multithreading in Software II III Fürdie Seerosenprinzessin IV V "After a rare speech at the National Center for Atmospheric Research in Boulder, Colorado, in 1976, programmers in the audience had suddenly fallen silent when Cray offered to an- swer questions. He stood there for several minutes, waiting for their queries, but none came. When he left, the head of NCAR's computing division chided the programmers. 'Why didn't someone raise a hand ?' After a tense moment, one pro- grammer replied, 'How do you talk to God?' " from The SUPERMEN VI Acknowledgments I am grateful to Prof. Keller for our successful cooperation during the last five years, especially for providing the time to complete this work by freeing me from other tasks. In addition, this work would not have been possible without the necessary equipment, e.g. the Compaq workstation that was for development and evaluation of emulated multithreading. I am grateful to Prof. Ungerer from the UniversitätAugsburg for reviewing this thesis and our interesting discussions during the course of this work. I thank the following people and organizations for their support during the course of this work: the HLRS in Stuttgart and the NIC in Jülich for granting access to their Cray T3E computer systems - software development and evaluation would have been impossible without access to these machines. The people from the Gastdozentenhaus "Heinrich Hertz" at the UniversitätKarlsruhe for guarding me while writing the first draft. Peter Bach and Michael Bosch for providing useful comments on an early draft - I am really looking forward to working with them again at ETAS GmbH in Stuttgart. Matthias Müllerfrom the UniversitätKarlsruhe for interesting discussions about the E-registers in the Cray T3E. The Universitat Politecnica de Catalunya in Spain, namely Ernest Artiaga, for providing an implementation of the PARMACS macros based on POSIX threads that was used during the evaluation of emulated multithreading on the Compaq workstation. The Informatikrechnerbereich in Hagen, namely Enno deVries, for providing the DECcampus software used during software development. My colleagues from the chair of Prof. Schiffmann for the good teamwork even in times of stress. My heartfelt thanks go to my family, Lutz, Ulrike and Tina Grävinghoff for their support and encouragement during all these years. Last but not least, I wish to thank my fiancéeKerstin Bernhardt for her outstanding support even while pursuing her own PhD in psychology. VIII Acknowledgments Contents 1. Introduction :::::::::::::::::::::::::::::::::::::::::::::: 1 1.1 Trends in Sequential Computing . 1 1.1.1 Dependencies & Hazards . 4 1.1.2 Dependency Removal . 7 1.1.3 Caches & Main Memory . 10 1.2 Trends in Parallel Computing . 14 1.3 Latency Tolerance . 18 1.4 Multithreading . 21 1.4.1 Hardware Multithreading . 22 1.4.2 Software Multithreading . 28 1.4.3 Summary. 31 1.5 Outline . 32 2. Emulated Multithreading ::::::::::::::::::::::::::::::::: 33 2.1 Design Preferences . 33 2.1.1 Multithreaded Processor Model . 33 2.1.2 Context Switch Strategies . 35 2.1.3 Context Switch Overhead . 37 2.2 Basic Concept . 39 2.2.1 Assumptions . 40 2.2.2 Data Structures . 40 2.2.3 Emulation Library . 41 2.2.4 Code Conversion . 42 2.3 Performance Issues . 45 2.3.1 Number of Threads . 46 2.3.2 Caches . 47 2.3.3 Branch Prediction . 50 2.3.4 Code Scheduling . 54 2.3.5 Out-of-order Execution. 54 2.4 Architecture Support . 55 X Contents 3. Implementation ::::::::::::::::::::::::::::::::::::::::::: 61 3.1 Introduction . 61 3.2 Design Flow . 62 3.3 High-Level Language Converter . 64 3.3.1 Configuration File . 64 3.3.2 Conversion Tasks . 65 3.3.3 Implementation . 65 3.4 Emulation Library . 67 3.4.1 Thread Initialization Routines . 68 3.4.2 Thread Execution Routines . 69 3.4.3 Communication Routines . 69 3.5 Assembler converter . 71 3.5.1 Configuration . 72 3.5.2 Lexer & Parser . 76 3.5.3 Basic Blocks . 78 3.5.4 Super Blocks . 81 3.5.5 External Calls . 90 3.5.6 Data-Flow Analysis . 93 3.5.7 Register Allocation . 100 3.5.8 Code Conversion . 121 3.5.9 Statistics . 122 3.6 Register Partitioning . 124 3.7 Platform . 125 3.8 Compiler Integration . 129 4. Benchmarks :::::::::::::::::::::::::::::::::::::::::::::: 133 4.1 Benchmark Suites . 133 4.1.1 LINPACK . 134 4.1.2 LFK . 134 4.1.3 ParkBench . 135 4.1.4 NPB . 138 4.1.5 Perfect Club . 139 4.1.6 SPLASH2 . 140 4.1.7 Summary. 140 4.2 SPLASH2 Benchmark Suite. 141 4.2.1 The FFT Kernel . 142 4.2.2 The LU Kernel . 145 4.2.3 The Radix Kernel . 148 4.2.4 The Ocean Application. 151 4.2.5 The Barnes application . 155 4.2.6 The FMM application . 158 Contents XI 5. Evaluation : Compaq XP1000 ::::::::::::::::::::::::::::: 165 5.1 Compaq XP1000 . 166 5.1.1 Processor . 167 5.1.2 Cchip . 168 5.1.3 Dchip . 169 5.1.4 Pchip . 170 5.1.5 Memory . 170 5.1.6 Peripherals . 171 5.1.7 Software Environment . 171 5.2 Methodology . 172 5.3 Code Conversion . 175 5.4 FFT . 179 5.5 LU . 183 5.6 Radix . 186 5.7 Ocean . 190 5.8 Barnes . 193 5.9 FMM . ..

On the Realization of Fine-Grained Multithreading in Software

Implicitly-Multithreaded Processors

Kaisen Lin and Michael Conley

A Speculative Control Scheme for an Energy-Efficient Banked Register File

The Microarchitecture of a Low Power Register File

REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading

PERL – a Register-Less Processor

UNIVERSITY of CALIFORNIA, SAN DIEGO Holistic Design for Multi-Core Architectures a Dissertation Submitted in Partial Satisfactio

Superscalar Execution Scalar Pipeline and the Flynn Bottleneck Multiple

Mini-Threads: Increasing TLP on Small-Scale SMT Processors

Energy-Effective Issue Logic

Piranha:Piranha

EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading Fall 2007 Prof