Auto-Tuning Compiler Options for HPC
Total Page:16
File Type:pdf, Size:1020Kb
University of Bath PHD Auto-tuning compiler options for HPC Jones, Jessica Award date: 2019 Awarding institution: University of Bath Link to publication Alternative formats If you require this document in an alternative format, please contact: [email protected] General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 27. Sep. 2021 Citation for published version: Jones, J 2019, 'Auto-tuning compiler options for HPC', Ph.D., University of Bath. Publication date: 2019 Document Version Publisher's PDF, also known as Version of record Link to publication Publisher Rights GNU GPL University of Bath General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 16. Jul. 2019 Auto-tuning compiler options for HPC submitted by Jessica R. Jones for the degree of Doctor of Philosophy of the University of Bath Department of Computer Science August 2018 COPYRIGHT Attention is drawn to the fact that copyright of this thesis rests with the author. A copy of this thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that they must not copy it or use material from it except as permitted by law or with the consent of the author. This thesis may be made available for consultation within the University Library and may be photocopied or lent to other libraries for the purposes of consultation. Signature of Author . Jessica R. Jones ACKNOWLEDGEMENTS I must frankly own, that if I had known, beforehand, that this book would have cost me the labour which it has, I should never have been courageous enough to commence it. Isabella Beeton There are many people who should be thanked for their patience, understanding and time. These include, but are not limited to, my supervisors, James Davenport, John ffitch and Guy McCusker, my partner, Ed White, my sisters, Eve Jones, Zoe¨ Jones and Rebecca Bonome, and my friends (in no particular order), Martin Brain, Florian Schanda, David Stewart, Mark Cahill, Alaric Snell-Pym, Anita Faul, Mesar Hameed and Ben Pring. All have helped in various ways, from proof-reading to giving advice at odd hours on Sundays, to bringing me cups of tea and making sure some shred of sanity still remains at the end of this. I could not have done it without your support. The HPC support staff at the University of Bath and at the University of Southampton are also due thanks; without the systems they are responsible for, this work would not have been possible. I would also like to thank Jason Ansel, author of OpenTuner, and Clint Whaley, author of ATLAS, for the advice they gave both in person and by email. They both helped me to avoid wasting precious time, and made my life a little bit easier. Considerable thanks are due also to my examiners, who gave up their time to read my work and make suggestions for its improvement. Having acknowledged that I am probably one of the many who “do not know how to write” [121], I know that their insightful comments and thought-provoking questions have greatly aided in achieving the clarity this document requires. I hope I have done justice to their hard work. Last, but not least, I must thank my colleagues at Cray UK Ltd, who put up with my ramblings and kindly agreed to cope without me for weeks while I battled to finish this. The contribution of time is more valuable than you know. Contents List of Figures . 9 List of Tables . 10 List of Algorithms . 11 List of Prototypes . 12 Summary . 13 1 Introduction 14 1.1 Motivation . 16 1.2 Aims of this thesis . 17 1.3 Document overview . 18 2 The landscape 20 2.1 Computational Hardware . 20 2.1.1 Purpose-built supercomputers . 21 2.1.2 Floating point hardware . 22 2.1.3 Multiply-Accumulate . 22 2.1.4 Vectorisation . 23 2.1.5 Accelerators . 23 2.1.6 Computational Speed and FLOPs . 24 2.2 Interconnect . 25 2.3 The power consumption/performance trade-off . 26 2.4 Memory Hardware . 27 2.4.1 Virtual Memory . 27 2.4.2 Caches and Translation Look-Aside Buffers (TLBs) . 29 2.4.3 Caches on modern CPUs . 29 2.5 Symmetric Multi-Processor (SMP) systems . 30 2.5.1 Simultaneous Multi-Threading (SMT) . 31 2.5.2 Non-Uniform Memory Access (NUMA) Architectures . 31 2.5.3 Pipelining . 31 2.6 Compilers . 32 2.6.1 Compilers in practice . 32 2.6.2 Today’s HPC Compilers . 34 2.7 Libraries . 36 2.7.1 Dense Linear Algebra . 36 2.7.2 Fast Fourier Transform (FFT) . 38 2.8 Software . 38 2.9 Operating Systems . 39 3 Related Work 41 4 Contents 3.1 Auto-tuning . 42 3.2 Tuning the choice of compiler flags . 43 3.2.1 A note about optimization levels of compilers . 44 3.3 Distributed compilation . 45 4 Initial Experiences 47 4.1 Case study: When cache matters, a comparison of ATLAS and GotoBLAS . 48 4.1.1 Comparing BLAS library performance . 48 4.1.2 Observations of TOP500 list . 49 4.1.3 Comparing ATLAS and GotoBLAS . 49 4.1.4 How much does the translation look-aside buffer (TLB) really matter (to GotoBLAS)? . 51 4.2 Towards a deeper understanding . 51 4.3 Early motivation . 52 5 Requirements Analysis 53 5.1 Initial High Level Use Case . 53 5.2 Requirements Specification . 54 5.2.1 Initial Requirements . 55 5.2.2 The Prototypes . 56 6 The convergence of benchmarks 61 6.1 Introduction . 61 6.1.1 Background . 62 6.1.2 Method . 63 6.1.3 Environmental constraints . 63 6.1.4 Results . 63 6.1.5 Homogeneous machines are not always uniform . 72 6.1.6 Conclusion . 72 7 Design Specification 74 7.1 Lessons drawn from prototyping . 74 7.1.1 Language choice . 74 7.1.2 Choice of optimization algorithm . 76 7.1.3 Repetition of benchmark . 78 7.1.4 Choice of pseudo-random number generator (PRNG) . 78 7.1.5 Fault Tolerance . 78 7.2 Key design decisions . 79 7.2.1 Language choice . 80 7.2.2 Distribution of work . 80 7.2.3 Search algorithm . 80 7.2.4 Categorisation of flags . 80 7.2.5 Choice of optimization algorithm . 80 7.2.6 Fitness evaluation . 82 7.2.7 Repeated runs of the benchmark . 83 7.2.8 Choice of PRNG . 84 7.3 Robustness . 86 7.3.1 Recording progress and resuming after termination . 86 5 Contents 7.3.2 Terminating cleanly . 86 7.3.3 Robustness against compiler bugs . 86 7.3.4 Fault tolerance . 89 7.3.5 Non-homogeneity . 90 7.4 Conclusion . 90 8 Design Implementation 92 8.1 Component break-down and encapsulation . 93 8.1.1 Unit testing . 93 8.2 Alterations to target software . 96 8.3 Wrapper scripts . 96 8.4 Signal handling . 97 8.5 Difficulties encountered in third-party software . 97 8.5.1 Overly-helpful libraries . 97 8.5.2 Bugs in SQLite . 98 8.6 Conclusion . ..