3.15 Speedup and Stall Rate for Livermore Kernels 1 and 7
Total Page:16
File Type:pdf, Size:1020Kb
UvA-DARE (Digital Academic Repository) On the compilation of a parallel language targeting the self-adaptive virtual processor Bernard, T.A.M. Publication date 2011 Document Version Final published version Link to publication Citation for published version (APA): Bernard, T. A. M. (2011). On the compilation of a parallel language targeting the self-adaptive virtual processor. Print partners Ipskamp. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:27 Sep 2021 INVITATION On the Compilation of a Parallel Language SVP targeting of a Parallel On the Compilation TO ATTEND THE PUBLIC DEFENSE OF MY THESIS ON FRIDAY MARCH 11TH 2011 at 12:00PM IN THE AGNIETENKAPEL, OUDERZIJDS VOORGBURGWAL 231, AMSTERDAM RECEPTION AFTERWARDS Thomas A.M. Bernard Thomas On the Compilation of On the Compilation of a Parallel Language a Parallel ISBN 978-90-9026006-8 Language targeting the 90000 > targeting the Self-adaptive Virtual Processor Self-adaptive Virtual Processor 9 789090 260068 Thomas A.M. Bernard Thomas A.M. Bernard On the Compilation of a Parallel Language targeting the Self-Adaptive Virtual Processor Thomas A.M. Bernard On the Compilation of a Parallel Language targeting the Self-Adaptive Virtual Processor ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof. dr. D.C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op vrijdag 11 maart 2011 te 12:00 uur door Thomas Antoine Marie Bernard geboren te Meaux, Frankrijk Promotiecommissie: Promotor: prof. dr. C.R. Jesshope Overige leden: prof. dr. P. Klint prof. dr. G.J.M. Smit dr. M. Beemster dr. C.U. Grelck Faculteit: Faculteit der Natuurwetenschappen, Wiskunde en Informatica The work described in this thesis was carried out in the section of Computer Systems Architecture of the University of Amsterdam, with the financial sup- port of: • the University of Amsterdam, • the NWO Microgrids project, • the European FP-7 Apple-CORE project, • the Advanced School for Computing and Imaging (ASCI). Advanced School for Computing and Imaging c Copyright 2010 by Thomas A.M. Bernard ISBN 978-90-9026006-8 ASCI dissertation series number 211. Author contact: [email protected] or [email protected] Print partners Ipskamp, Enschede Our greatest glory is not in never falling, but in rising every time we fall. Confucius Contents 1 Introduction 1 1.1 Classical microprocessor improvements . 2 1.2 Multicore architectures . 2 1.3 Exploiting concurrency as a solution . 3 1.4 Impact of concurrency on software systems . 9 1.5 Contribution of this thesis . 12 1.6 Overview of this thesis . 12 I Foundations 15 2 Background in parallel computing systems 17 2.1 Approaches in concurrent execution models . 18 2.2 Relevant parallel architectures . 22 2.3 Modeling concurrency in compilers . 24 2.4 Requirements for a concurrent execution model . 25 3 SVP Execution Model and its Implementations 27 3.1 Our approach to multicore programming . 28 3.2 Presentation of the SVP execution model . 29 3.3 Hardware implementation: Microgrid . 36 3.4 Software implementation: µTC language . 44 3.5 SVP system performance . 49 3.6 Discussion and conclusion . 56 i CONTENTS II Compilation for Parallel Computing Systems 57 4 From basics to advanced SVP compilation 59 4.1 Basics in compiler transformations . 60 4.2 SVP compilation schemes . 63 4.3 Under the hood of SVP compilation . 67 4.4 Conclusion . 83 5 On the challenges of optimizations 85 5.1 Hazards with optimizations . 86 5.2 Investigating some optimizations . 87 5.3 Discussion and conclusion . 97 6 Implementing the SVP compiler 99 6.1 Role of the compiler . 100 6.2 Compiler design decisions . 101 6.3 Compilation challenges . 107 6.4 Discussion and conclusion . 113 7 SVP evaluation 117 7.1 Evaluation of SVP compilation . 118 7.2 Evaluation of SVP computing system . 122 7.3 Discussion and conclusion . 137 III Discussion and conclusion 139 8 Discussion and conclusion 141 8.1 Thesis overview . 141 8.2 Limitations . 142 8.3 Future work . 144 8.4 Conclusions . 146 A µTC language syntax summary 147 Summary 157 Samenvatting 162 ii CONTENTS Acknowledgements 165 Publications 167 Bibliography 174 Index 175 iii CONTENTS iv List of Figures 1.1 Partitioning a sequential cooking recipe into tasks . 5 1.2 Communication between concurrent tasks of a cooking recipe . 6 1.3 Synchronization between concurrent tasks of a cooking recipe . 6 1.4 Management of concurrent tasks of a cooking recipe . 7 1.5 Overview of a standard software system . 9 1.6 The bridge between Software world and Hardware world . 11 2.1 Computing system domains . 18 3.1 SVP parallel computing system overview . 28 3.2 Illustration of an SVP family creation . 29 3.3 An SVP family . 30 3.4 SVP inter-thread communication with a Global channel . 31 3.5 SVP inter-thread communication with a shared channel . 32 3.6 SVP inter-thread communication channels . 33 3.7 Illustration of an SVP concurrency tree . 35 3.8 Different states of an SVP thread . 40 3.9 SVP register window layout . 42 3.10 Mapping of hardware registers to architectural registers . 44 3.11 µTC example of a simplified reduction . 46 3.12 Thread function definition . 48 3.13 Gray area between create and sync . 49 3.14 Functional diagram of a 16-core Microgrid . 50 v LIST OF FIGURES 3.15 Speedup and stall rate for Livermore kernels 1 and 7 . 52 3.16 Speedup of sine function . 53 3.17 Speedup of Livermore kernel 3 . 54 3.18 Performance of FFT . 55 4.1 Basic compilation scheme T ...................... 63 4.2 Simplified compilation scheme T for a thread function . 63 4.3 Compilation scheme T for a thread function . 65 4.4 Compilation scheme for µTC create action . 66 4.5 Compilation scheme for µTC break action . 66 4.6 Compilation scheme T involving a C function call . 67 4.7 Call gate inserted instead of function call . 67 4.8 The compilation process as a black box . 68 4.9 A simple compiler . 68 4.10 A classic optimizing three-stage compiler . 69 4.11 A modern optimizing three-stage compiler design . 70 4.12 A work-flow representation of a compilation process . 71 4.13 Composition of a µTC program with concurrent regions . 76 4.14 Creation graph example of a program . 77 4.15 The relationship of a single concurrent region . 79 4.16 Control flows of sequential and concurrent paradigms . 80 4.17 CFG representation of an SVP create block . 80 4.18 DFG of an SVP shared synchronized communication channel . 81 5.1 Example of optimization side-effects on SVP code . 87 5.2 Example of optimization side-effects on communication channels 88 5.3 SSA transformation . 89 5.4 Example of unreachable code . 90 5.5 Example of valid code removal . 90 5.6 CFG representation of thread function “foo” . 91 5.7 Code example with CSE . 92 5.8 Code example with PRE . 92 5.9 Example of combining instruction . 93 5.10 Example of copy propagation . 94 5.11 Instruction reordering example . 94 vi LIST OF FIGURES 5.12 Instruction reordering example with create sequence . 95 5.13 Dependency chain between operations . 96 6.1 Compiler composition of GCC 4.1 Core Release . 105 6.2 Location of changes in GCC-UTC . 106 6.3 Shared object used as a token to enforce sequential constraints . 110 7.1 Instruction mix of Livermore kernels in µTC . 119 7.2 Comparison of code size between unoptimized and optimized code . 120 7.3 Comparison of instruction size between hand-coded and com- piled code . 121 7.4 Comparison of execution cycles between hand-coded and com- piled Livermore kernels . 122 7.5 Functional diagram of a 64-core Microgrid . 123 7.6 BLAS DNRM2 in µTC . 125 7.7 Performance of DNRM2 on one SVP place . 126 7.8 N/P parallel reduction for the inner product . 128 7.9 IP performance, using N/P reduction . 129 7.10 Performance of the ESF . 132 7.11 Performance of the matrix-matrix product . 133 7.12 Computation kernel for the 1-D FFT . 134 7.13 Performance of the 1-D FFT . 135 vii LIST OF FIGURES viii List of Tables 3.1 List of SVP instructions which can be added to an existing ISA . 37 3.2 List of SVP register classes . 42 3.3 List of µTC constructs . 46 3.4 List of µTC types . 46 3.5 Create parameters which set up the family definition . 48 ix LIST OF TABLES x Chapter 1 Introduction I think there is a world market for maybe five computers Thomas J.