Generating and Auto-Tuning Parallel Stencil Codes

Generating and Auto-Tuning Parallel Stencil Codes Inauguraldissertation zur Erlangung der Würde eines Doktors der Philosophie vorgelegt der Philosophisch-Naturwissenschaftlichen Fakultät der Universität Basel von Matthias-Michael Christen aus Affoltern BE, Schweiz Basel, 2011 Originaldokument gespeichert auf dem Dokumentenserver der Universitat¨ Basel: edoc.unibas.ch. Dieses Werk ist unter dem Vertrag “Creative Commons Namensnennung–Keine kommerzielle Nutzung–Keine Bearbeitung 2.5 Schweiz” lizenziert. Die vollstandige¨ Lizenz kann unter http://creativecommons.org/licences/by-nc-nd/2.5/ch eingesehen werden. Attribution – NonCommercial – NoDerivs 2.5 Switzerland You are free: to Share — to copy, distribute and transmit the work Under the following conditions: Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Noncommercial — You may not use this work for commercial pur- poses. No Derivative Works —You may not alter, transform, or build upon this work. With the understanding that: Waiver — Any of the above conditions can be waived if you get permission from the copyright holder. Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license. Other Rights — In no way are any of the following rights affected by the license: • Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; • The author’s moral rights; • Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. Notice — For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to the web page http: //creativecommons.org/licenses/by-nc-nd/2.5/ch. Disclaimer — The Commons Deed is not a license. It is simply a handy reference for understanding the Legal Code (the full license) – it is a human-readable expression of some of its key terms. Think of it as the user-friendly interface to the Legal Code beneath. This Deed itself has no legal value, and its contents do not appear in the actual license. Creative Commons is not a law firm and does not provide legal services. Distributing of, displaying of, or linking to this Commons Deed does not create an attorney-client relationship. Genehmigt von der Philosophisch-Naturwissenschaftlichen Fakultat¨ auf Antrag von Prof. Dr. Helmar Burkhart Prof. Dr. Rudolf Eigenmann Basel, den 20. September 2011 Prof. Dr. Martin Spiess, Dekan Abstract In this thesis, we present a software framework, PATUS, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto- tuning methodology. The PATUS stencil specification DSL allows the programmer to ex- press a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model is- sues and of manually applying hardware platform-specific code opti- mization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by au- tomated adaptation of implementation-specific parameters to the char- acteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code’s performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which PA- TUS was used to generate the corresponding implementations. The se- lection includes stencils taken from two real-world applications: a sim- ulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework’s flexibility and ability to produce high performance code. Contents Contents i 1 Introduction 3 I High-Performance Computing Challenges 7 2 Hardware Challenges 9 3 Software Challenges 17 3.1 The Laws of Amdahl and Gustafson ............. 18 3.2 Current De-Facto Standards .................. 24 3.3 Beyond MPI and OpenMP ................... 28 3.4 Optimizing Compilers ..................... 32 3.5 Domain Specific Languages .................. 42 3.6 Motifs ............................... 45 4 Algorithmic Challenges 53 II The PATUS Approach 59 5 Introduction To PATUS 61 5.1 Stencils and the Structured Grid Motif ............ 63 5.1.1 Stencil Structure Examples ............... 64 5.1.2 Stencil Sweeps ...................... 68 5.1.3 Boundary Conditions .................. 69 5.1.4 Stencil Code Examples ................. 70 5.1.5 Arithmetic Intensity .................. 71 ii CONTENTS 5.2 A PATUS Walkthrough Example ................ 74 5.2.1 From a Model to a Stencil ............... 75 5.2.2 Generating The Code .................. 76 5.2.3 Running and Tuning .................. 78 5.3 Integrating into User Code ................... 81 5.4 Alternate Entry Points to PATUS ................ 83 5.5 Current Limitations ....................... 83 5.6 Related Work ........................... 84 6 Saving Bandwidth And Synchronization 89 6.1 Spatial Blocking ......................... 89 6.2 Temporal Blocking ....................... 91 6.2.1 Time Skewing ...................... 92 6.2.2 Circular Queue Time Blocking ............ 97 6.2.3 Wave Front Time Blocking ............... 99 6.3 Cache-Oblivious Blocking Algorithms ............ 101 6.3.1 Cutting Trapezoids ................... 101 6.3.2 Cache-Oblivious Parallelograms ........... 102 6.4 Hardware-Aware Programming ................ 105 6.4.1 Overlapping Computation and Communication . 105 6.4.2 NUMA-Awareness and Thread Affinity ....... 106 6.4.3 Bypassing the Cache .................. 108 6.4.4 Software Prefetching .................. 109 7 Stencils, Strategies, and Architectures 111 7.1 More Details on PATUS Stencil Specifications ........ 111 7.2 Strategies and Hardware Architectures ............ 115 7.2.1 A Cache Blocking Strategy ............... 115 7.2.2 Independence of the Stencil .............. 117 7.2.3 Circular Queue Time Blocking ............ 119 7.2.4 Independence of the Hardware Architecture .... 122 7.2.5 Examples of Generated Code ............. 124 8 Auto-Tuning 127 8.1 Why Auto-Tuning? ....................... 127 8.2 Search Methods ......................... 131 8.2.1 Exhaustive Search .................... 131 8.2.2 A Greedy Heuristic ................... 132 8.2.3 General Combined Elimination ............ 132 CONTENTS iii 8.2.4 The Hooke-Jeeves Algorithm ............. 133 8.2.5 Powell’s Method .................... 135 8.2.6 The Nelder-Mead Method ............... 136 8.2.7 The DIRECT Method .................. 137 8.2.8 Genetic Algorithms ................... 137 8.3 Search Method Evaluation ................... 138 III Applications & Results 147 9 Experimental Testbeds 149 9.1 AMD Opteron Magny Cours .................. 151 9.2 Intel Nehalem .......................... 152 9.3 NVIDIA GPUs .......................... 153 10 Performance Benchmark Experiments 157 10.1 Performance Benchmarks ................... 157 10.1.1 AMD Opteron Magny Cours ............. 158 10.1.2 Intel Xeon Nehalem Beckton .............. 164 10.1.3 NVIDIA Fermi GPU (Tesla C2050) .......... 165 10.2 Impact of Internal Optimizations ............... 168 10.2.1 Loop Unrolling ..................... 168 10.3 Impact of Foreign Configurations ............... 170 10.3.1 Problem Size Dependence ............... 170 10.3.2 Dependence on Number of Threads ......... 171 10.3.3 Hardware Architecture Dependence ......... 172 11 Applications 175 11.1 Hyperthermia Cancer Treatment Planning .......... 175 11.1.1 Benchmark Results ................... 178 11.2 Anelastic Wave Propagation .................. 180 11.2.1 Benchmark Results ................... 182 IV Implementation Aspects 187 12 PATUS Architecture Overview 189 12.1 Parsing and Internal Representation ............. 191 12.1.1 Data Structures: The Stencil Representation ..... 191 12.1.2 Strategies ......................... 194 iv CONTENTS 12.2 The Code Generator ....................... 196 12.3 Code Generation Back-Ends .................. 199 12.4 Benchmarking Harness ....................

Load more