Automated Optimization of Numerical Methods for Partial Differential Equations

Imperial College London Department of Computing Automated Optimization of Numerical Methods for Partial Differential Equations Fabio Luporini September 2016 Supervised by: Dr. David A. Ham, Prof. Paul H. J. Kelly Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of Imperial College London and the Diploma of Imperial College London Declaration I herewith certify that the material in this thesis that is not my own work has been properly acknowledged and referenced. Fabio Luporini The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work. i Abstract The time required to execute real-world scientific computations is a major issue. A single simulation may last hours, days, or even weeks to reach a certain level of accuracy, despite running on large-scale parallel architec- tures. Strict time limits may often be imposed too – 60 minutes in the case of the UK Met Office to produce a forecast. In this thesis, it is demonstrated that by raising the level of abstraction, the performance of a class of numerical methods for solving partial differential equations is improvable with minimal user intervention or, in many circumstances, with no user intervention at all. The use of high level languages to express mathematical problems enables domain-specific optimization via compilers. These automated optimizations are proven to be effective in a variety of real-world applications and computational kernels. The focus is on numerical methods based on unstructured meshes, such as the finite element method. The loop nests for unstructured mesh traver- sal are often irregular (i.e., they perform non-affine memory accesses, such as A[B[i]]), which makes reordering transformations for data locality es- sentially impossible for low level compilers. Further, the computational kernels are often characterized by complex mathematical expressions, and manual optimization is simply not conceivable. We discuss algorithmic solutions to these problems and present tools for their automation. These tools – the COFFEE compiler and the SLOPE library – are currently in use in frameworks for solving partial differential equations. iii To Alice v Acknowledgements In these two pages, I would like to express my gratitude to those who supported me during my studies at Imperial College. My greatest thanks to my supervisors, Paul Kelly and David Ham, for the time they have spent with me. Your guidance and passion have been an incredible source of inspiration for my research. You probably do not even realize how much I have learnt from you, and for this I will be eter- nally grateful. I have been extremely lucky to work next to Doru Bercea, Luigi Nardi, Florian Rathgeber, Francis Russell, Georgios Rokos, Lawrence Mitchell, Andrew McRae, Thomas Gibson, Miklos´ Homolya, Michael Lange. It has been a real pleasure to share the office with you. I think we have done some real science together, and all of you have contributed to my personal and professional growth. Thanks. Keep up the hard work, I am sure we will meet again! I met Carlo Bertolli for the first time in 2008; he was a lab helper when I was a second year student in Computer Science at University of Pisa. He co-supervised my Bachelor thesis back in 2009. We kept in touch, and after a few years he suggested me to apply for a PhD position at Imperial. If I am writing these lines, it is also thanks to him and his guidance. So, thanks Carlo. At the very beginning of my studies at Imperial, I met J. “Ram” Ra- manujam. We have worked, shared many interesting stories (and trolls) about the Polyhedral model, and had many pizzas and coffees together. I consider you like my third, “unofficial” supervisor. Thanks! With Emanuele Vespa I have shared my office for only one year, but our vii friendship goes back to the good old days in Pisa (when we were young). I will miss you, but I am also sure we will always be in touch. I will really miss two things: 1) our (self-)trolling, especially for all those PhD students telling us that they had deadlines and were publishing stacks of papers everywhere, while we were spending most of our time chasing bugs and “understanding what to do next”; 2) playing Teeworlds. I also thank my Italian friends, those in London (in particular, Francesca, Simone, Federico), those spread all over the world, and those who I talk with almost everyday. I am so glad that after all these years we still keep in touch. Finally, my biggest thanks goes to those who have made this experience possible: Alice and our families. I like science, I love what I do, but without your love and support this PhD would have never been possible. My gratitude for these 4 years in London will be eternal. viii Contents 1 Introduction1 1.1 Thesis Statement.........................1 1.2 Overview..............................1 1.3 Thesis Outline and Contributions................2 1.4 Dissemination...........................5 2 Background7 2.1 The Finite Element Method...................7 2.1.1 Weak Formulation....................8 2.1.2 Local and Global Function Spaces...........9 2.1.3 The Reference Element.................. 11 2.1.4 Assembly......................... 12 2.1.5 Local Assembly Example: from Math to Code.... 13 2.2 Software Abstractions for Partial Differential Equation Solvers 18 2.2.1 Automating the Finite Element Method........ 19 2.2.2 The PyOP2 and OP2 Libraries............. 23 2.2.3 Stencil Languages.................... 27 2.3 Compilers and Libraries for Loop Optimization....... 29 2.3.1 Loop Reordering Transformations........... 29 2.3.2 Composing Loop Tiling and Loop Fusion....... 31 2.3.3 Automation via Static Analysis............. 33 2.3.4 Automation via Dynamic Analysis........... 35 2.4 Domain-Specific Optimization................. 35 2.4.1 Tensor Contraction Engine............... 36 2.4.2 Halide........................... 36 ix 2.4.3 Spiral............................ 37 2.4.4 Small-scale Linear Algebra............... 37 2.5 On the Terminology Adopted.................. 38 3 Automated Sparse Tiling for Irregular Computations 43 3.1 Motivation............................. 43 3.2 Context and Approach...................... 45 3.3 Applying Loop Fusion and Loop Tiling is More Difficult than Commonly Assumed.................... 47 3.4 Related Work........................... 50 3.5 The Loop Chain Abstraction for Generalized Inspector/Ex- ecutor Schemes.......................... 52 3.5.1 Relationship between Loop Chain and Inspector.. 52 3.5.2 Definition of a Loop Chain............... 53 3.5.3 The Abstraction Revisited for Unstructured Mesh Ap- plications......................... 54 3.6 Loop Chain, Inspection and Execution Examples....... 56 3.7 Data Dependency Analysis for Loop Chains......... 64 3.8 Formalization........................... 66 3.8.1 The Generalized Sparse Tiling Inspector....... 66 3.8.2 The Generalized Sparse Tiling Executor........ 71 3.8.3 Computational Complexity of Inspection....... 73 3.9 Implementation.......................... 73 3.9.1 SLOPE: a Library for Sparse Tiling Irregular Com- putations.......................... 74 3.9.2 PyOP2: Lazy Evaluation and Interfaces........ 76 3.9.3 Firedrake/DMPlex: the S-depth Mechanism for Ex- tended Halo Regions................... 77 3.10 Performance Evaluation - Benchmarks............. 78 3.10.1 Sparse Jacobi....................... 79 3.10.2 Airfoil........................... 80 3.10.3 Outcome.......................... 82 3.11 Performance Evaluation - Seigen: an Elastic Wave Equation Solver for Seismological Problems............... 83 3.11.1 Computation....................... 83 3.11.2 Setup and Reproducibility............... 87 x 3.11.3 Results and Analysis................... 92 3.12 Conclusions and Future Work.................. 107 4 Minimizing Operations in Finite Element Integration Loops 109 4.1 Motivation and Related Work.................. 109 4.2 Loop Nests, Expressions and Optimality........... 112 4.3 Transformation Space: Sharing Elimination.......... 116 4.3.1 Identification and Exploitation of Structure...... 117 4.3.2 Global Analysis of the Expression........... 118 4.3.3 The Sharing Elimination Algorithm.......... 120 4.3.4 Examples......................... 122 4.4 Transformation Space: Pre-evaluation of Reductions..... 127 4.5 Transformation Space: Memory Constraints......... 129 4.6 Selection and Composition of Transformations........ 129 4.6.1 The Main Transformation Algorithm......... 129 4.6.2 The Cost Function q ................... 132 4.7 Formalization........................... 133 4.8 Code Generation......................... 136 4.8.1 Expressing Transformations with COFFEE...... 136 4.8.2 Independence from Form Compilers......... 137 4.8.3 Handling Block-sparse Tables.............. 137 4.9 Performance Evaluation..................... 138 4.9.1 Experimental Setup................... 138 4.9.2 Performance Results................... 140 4.10 Conclusions............................ 146 4.11 Limitations and Future Work.................. 146 5 Cross-loop Optimization of Arithmetic Intensity for Finite Ele- ment Integration 149 5.1 Recapitulation and Objectives.................. 149 5.2 Low-level

Load more