Design and Optimisation of Scientific Programs in a Categorical Language
Total Page:16
File Type:pdf, Size:1020Kb
Design and Optimisation of Scientific Programs in a Categorical Language Thomas James Ashby Doctor of Philosophy Institute of Computing Systems Architecture School of Physics and School of Informatics University of Edinburgh 2005 Abstract This thesis presents an investigation into the use of advanced computer languages for scientific computing, an examination of performance issues that arise from using such languages for such a task, and a step toward achieving portable performance from compilers by attacking these problems in a way that compensates for the complexity of and differences between modern computer architectures. The language employed is Aldor, a functional language from computer algebra, and the scientific computing area is a. subset of the family of iterative linear equation solvers applied to sparse systems. The linear equation solvers that are considered have much common structure, and this is factored out and represented explicitly in the lan- guage as a framework, by means of categories and domains. The flexibility introduced by decomposing the algorithms and the objects they act on into separate modules has a strong performance impact due to its negative effect on temporal locality. This necessi- tates breaking the barriers between modules to perform cross-component optimisation. In this instance the task reduces to one of collective loop fusion and array contrac- tion. Traditional approaches to this problem rely on static heuristics and simplified machine models that do not deal well with the complex trade-offs involved in targeting modern computer architectures. To rectify this we develop a technique called iterative collective loop fusion that empirically evaluates different candidate transformations in order to select the best available. We apply our technique to programs derived from the iterative solver framework to demonstrate its effectiveness, and compare it against other techniques for collective loop fusion from the literature, and more traditional approaches such as using Fortran, C and/or high-performance library routines. The use of a high-level categorical language such as Aldor brings important ben- efits in terms of elegance of expression, comprehensibility, and code reuse. Iterative collective loop fusion outperforms the other collective loop fusion techniques. Ap- plying it to the iterative solver framework gives programs with performance that is comparable with the traditional approaches. Acknowledgements Firstly I would like to thank my two long-suffering supervisors Mike O'Boyle (com- puter science) and Tony Kennedy (physics), for their motivation, academic guidance, financial support and patience above and beyond the call of duty. Doing interdisci- plinary research is by no means easy, but I hope that ultimately the venture was worth it for all concerned. I would also like to thank anyone who helped me in my work or commented on the thesis to help me improve it, including my viva panel Mark Bull (internal - EPCC) and Paul Kelly (external - Imperial College). On a more general note, my gratitude goes out to the administrative and support staff at the University of Edinburgh, and the people who built all the tools that I have relied on to do my work. This includes the creators of the dictation packages ViaVoice (IBM) and Dragon Naturally Speaking (virtually all the text in this document has been dictated), which, as well as being useful, have also helped to keep me amused with recognition errors, some of which were uncannily perceptive. My favourites include "much hilarity" for "modularity", "Al Gore/Algol" for "Aldor", and "awful banality" for "orthogonality". Secondly, I would like to thank my friends - you know who you are. I especially need to mention two whose companionship along the torturous route to a doctorate meant a great deal to me - Henrik, for providing me with a regular dose of reality (there is no spork), and Björn, for coping with sharing an office with me as well as being a friend. Lastly, there is my family and my partner, without whom I certainly would not have finished. Their support was always given without question, even when they had far, far greater burdens to carry than my own. Words fail me now, as they always will. kyj Dedicated to every unwitting innocent who has ever had the misfortune to ask me: "So, what is it that you actually do then?" Vi Table of Contents 1 Introduction 1 1.1 Computational Science Domain ....................2 1.2 Language ................................3 1.3 Compiler Optimisations .........................4 1.4 Linear Systems .............................6 1.5 Thesis Outline ............................. 7 1.6 Contributions ..............................9 2 Aldor 11 2.1 Fundamentals of the Language ......................11 2.1.1 Domains and categories ....................11 2.1.2 General features ........................14 2.1.3 The core language and the abstract machine ..........15 2.1.4 Basic libraries ...........................17 2.1.5 Storage model .........................18 2.1.6 Purity and overloading .....................18 2.2 Abstract Machine and Compilation Model ...............19 2.2.1 Libraries and whole program optimisation ...........20 2.2.2 FOAM types and variables ...................21 2.2.3 Uniform representation rule ..................21 2.2.4 Memory model .........................22 2.2.5 Aldor domains and generators in FOAM ............22 2.3 Compiler Implementation ........................23 Vii 2.3.1 Pre-existing optimisations 24 2.4 Summary ................................27 3 The Iterative Solvers 29 3.1 Notation .................................29 3.2 Overview ................................30 3.2.1 The form of the problem .....................30 3.2.2 Krylov subspaces ........................30 3.2.3 Halting condition ........................31 3.2.4 Orthogonality conditions ....................31 3.2.5 Reduced (projected) system as interface ............32 3.2.6 Operator structure .......................33 3.3 Generating Krylov Subspaces ......................34 3.3.1 The Arnoldi relation ......................34 3.3.2 Long and short recurrences ...................35 3.3.3 The two-sided Lanczos method ................36 3.4 Orthogonality Conditions and Projected Systems ...........38 3.4.1 Orthogonality conditions and orthogonal Krylov bases . 38 3.4.2 Orthogonality conditions and non-orthogonal Krylov bases 41 3.4.3 Orthogonality conditions and breakdowns ...........41 3.5 Solving the Projected System and Recovering the Solution ......42 3.5.1 Search vectors ..........................43 3.6 The Common Algorithms ........................46 3.6.1 Initial guess ...........................46 3.6.2 Calculating the recurrence residual ...............46 3.6.3 Putting it all together ......................48 3.6.4 Preconditioning .........................49 3.7 Summary ................................50 4 Functional and Algebraic Language Optimisation 51 4.1 Compilation of Functional Languages .................51 4.1.1 Fine-grained function composition and recursion .......52 VII, 4.1.2 Higher order control flow analysis . .............. 52 4.1.3 Polymorphism, boxing and modules ..............54 4.1.4 Arrays ..............................56 4.1.5 Pure languages and the management of state ......... 57 4.1.6 Compiler implementation ...................58 4.1.7 Fusion and loop restructuring in functional languages ..... 59 4.2 Compilation of Numerical Computer Algebra Systems .........60 4.3 Summary ................................61 S Algorithm Framework 63 5.1 Category Hierarchy ...........................63 5.1.1 Linear algebra categories ....................67 5.1.2 Problem specific categories ...................71 5.2 Domain Implementation ........................76 5.2.1 Index functions, recurrences, and infinite sequences ......76 5.2.2 Krylov space recurrence ....................77 5.2.3 Matrix of basis vectors .....................80 5.2.4 Tridiagonal matrix of recurrence coefficients .........80 5.2.5 QR decomposition .......................81 5.2.6 Banded upper triangular factor .................82 5.2.7 Lazy vector domain ........................82 5.2.8 Search vector recurrence ....................82 5.2.9 Matrix of search vectors ....................83 5.2.10 Glue code ............................83 5.3 Evaluation of Framework Design .....................85 5.3.1 Remaining issues ........................87 5.4 Summary ................................89 6 Linear Systems 91 6.1 Partial Differential Equations and Their Discretisation .........91 6.1.1 Grid numbering, matrix layout and stencils ..........92 6.2 Example Operators ............................94 ix 6.2.1 The Laplacian-like simple operators ..............94 6.2.2 The Wilson-Dirac operator ................... 95 6.3 Domain Implementation ........................98 6.3.1 The scalar domains ........................98 6.3.2 The simple stencil operator and vector domains ........100 6.3.3 Subdomains for the Wilson-Dirac problem ...........101 6.3.4 Aggregate structures of subdomains ...............102 6.3.5 The Wilson-Dirac Operator and Vector Domains .......105 6.4 Design Issues ...............................106 6.4.1 Boundary conditions and indexing ...............106 6.5 Summary ..................................107 7 Optimisation across Components 109 7.1 Basic Terminology and Formalisms ..................110 7.1.1 Loops ..............................110 7. L2 Dependencies between loop iterations .............111 7.1.3 Statements