Data Movement Optimization for High-Performance Computing

Research Collection Doctoral Thesis Data Movement Optimization for High-Performance Computing Author(s): Gysi, Tobias Publication Date: 2019 Permanent Link: https://doi.org/10.3929/ethz-b-000403518 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library diss. eth no. 26410 DATAMOVEMENTOPTIMIZATIONFOR HIGH-PERFORMANCECOMPUTING A dissertation submitted to attain the degree of doctor of sciences of eth zurich (Dr. sc. ETH Zurich) presented by tobias gysi Dipl. Informatik-Ing. ETH ETH Zurich born on the 7th of July 1980 citizen of Buchs AG, Switzerland accepted on the recommendation of Prof. Dr. Torsten Hoefler, examiner Prof. Dr. Thomas C. Schulthess, co-examiner Dr. Albert Cohen, co-examiner 2019 Tobias Gysi: Data Movement Optimization for High-Performance Computing, © 2019 ABSTRACT Tuning codes to make efficient use of high-performance computing systems is known to be hard. Programmers have to schedule their computations to thousands of compute cores having the compute and data movement costs in mind. The necessary code transformations – for example, to overlap computation and inter-node communication – are well known. But the complex interplay of hardware and software often prevents programmers from iden- tifying performance bottlenecks and selecting good code transformations. This dissertation introduces compilation frameworks, performance tools, and programming models to tackle these programmability challenges. Over the last decades, the compute performance improved at a much faster pace than memory performance. Data-movement optimizations to reduce the communication and memory access costs thus became much more pressing. We address the problem by automating the selection of data- locality transformations (absinthe) and by adapting the programming model (dcuda) to overlap computation and inter-node communication automatically. The performance models needed to automate the tuning (absinthe and haystack) also provide the programmers with valuable insights when optimizing codes manually. An important algorithmic motif in high-performance computing is the sequential execution of multiple but different stencils. Our compilation framework (absinthe) automates the selection of data-locality transformations for such stencil programs. It has three main components: 1) a transformation algebra, 2) a performance model, and 3) an optimizer. The transformation algebra (modesto) defines the space of possible code transformations and the learned performance model (absinthe) guides the selection of good code transformations. In summary, this dissertation contributes compilation frameworks, performance tools, and programming models to foster the application of data movement optimizations in high-performance computing. In particular, we automate the selection of data-locality transformations for stencil programs. We believe our work lays the foundation for future compilation frameworks that support even broader application domains. iii ZUSAMMENFASSUNG Es ist bekannt, dass die Optimierung von Codes zur effizienten Nutzung von Hochleistungsrechnern schwierig ist. Programmierer müssen, unter Berücksichtigung von Rechenaufwand und Datenübertragungskosten, ihre Berechnungen auf Tausende von Rechenkernen verteilen. Die notwendigen Code-Transformationen – zum Beispiel zur Überlappung der Kommuni- kation zwischen den Rechenknoten mit den Berechnungen – sind allge- mein bekannt. Das komplexe Zusammenspiel von Hardware und Software hindert Programmierer jedoch häufig daran, Performance-Probleme zu identifizieren und gute Code-Transformationen auszuwählen. In dieser Dissertation werden Kompilierungs-Frameworks, Performance-Tools und Programmiermodelle vorgestellt, um die Programmierbarkeit von Hoch- leistungsrechnern zu verbessern. In den letzten Jahrzehnten hat sich die Rechenleistung deutlich schneller verbessert als die Speicherleistung. Optimierungen zur Reduzierung der Kommunikations- und Speicherzugriffskosten wurden daher dringlicher. Wir gehen das Problem an, indem wir die Auswahl von Transformationen zur Verbesserung der Datenlokalität automatisieren (absinthe) und das Programmiermodell anpassen (dcuda), um die Kommunikation zwischen den Rechenknoten automatisch mit den Berechnungen zu überlappen. Die zur Automatisierung erforderlichen Performance-Modelle (absinthe und haystack) bieten den Programmierern zudem wertvolle Einblicke bei der manuellen Code Optimierung. Ein wichtiges algorithmisches Motiv im Hochleistungsrechnen ist die sequentielle Ausführung mehrerer, aber unterschiedlicher Stencil Berech- nungen. Unser Kompilierungs-Framework (absinthe) automatisiert die Auswahl von Code-Transformationen zur Verbesserung der Datenlokali- tät von Stencil Programmen. Es besteht aus drei Hauptkomponenten: 1) einer Transformations-Algebra, 2) einem Performance-Modell und 3) einem Optimierer. Die Transformations-Algebra (modesto) definiert den Raum möglicher Code-Transformationen und das erlernte Performance-Modell (absinthe) steuert die Auswahl guter Code-Transformationen. Zusammenfassend leistet diese Dissertation einen Beitrag zur Entwick- lung von Kompilierungs-Frameworks, Performance-Tools und Program- miermodellen, um die Anwendung von Datenlokalitätsoptimierungen im Hochleistungsrechnen zu fördern. Insbesondere automatisieren wir die v Optimierung von Stencil Programmen. Wir glauben, dass unsere Arbeit den Grundstein für zukünftige Kompilierungs-Frameworks legt, die noch breitere Anwendungsbereiche unterstützen. vi ACKNOWLEDGEMENTS I want to thank Torsten Hoefler for supervising my Ph.D. studies here at ETH Zurich. His scientific guidance was essential to address the right research questions and helped to widen the scope of my work. I am also grateful to Torsten Hoefler and Thomas Schulthess for providing me with the opportunity to return to academia after spending multiple years in the industry. I furthermore want to thank my co-examiners, Albert Cohen and Thomas Schulthess, for their effort and their valuable feedback. Special thanks go to Tobias Grosser who was co-supervising my Ph.D. studies and contributed key ideas to my research. I value the contributions of all my co-authors, collaborators, and students. I especially enjoyed working with Tobias Grosser, Jeremia Bär, Laurin Brand- ner, Grzegorz Kwasniewski, Aditya Konduri, Siddharth Bhat, and Alain Denzler on published and yet to be published works. The contributions of my co-authors were essential for the success of my projects. I also want to thank the entire group for the fun birthday parties at the lake and many other memorable moments. Last but not least, I want to thank my family for their support and my dance friends for many good moments and great dances that were a welcome change to my research work. vii PUBLICATIONS Publications that form the basis of this thesis: • Tobias Gysi, Tobias Grosser, and Torsten Hoefler. “MODESTO: Data-centric Analytic Optimization of Complex Sten- cil Programs on Heterogeneous Architectures.” ICS 2015.[1]. • Tobias Gysi, Tobias Grosser, and Torsten Hoefler. “Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot.” PACT 2019.[2]. • Tobias Gysi, Tobias Grosser, Laurin Brandner, and Torsten Hoefler. “A Fast Analytical Model of Fully Associative Caches.” PLDI 2019.[3]. • Tobias Gysi, Jeremia Bär, and Torsten Hoefler. “dCUDA: Hardware Supported Overlap of Computation and Com- munication.” SC 2016.[4]. Additional publications not part of this thesis: • Tobias Gysi, Carlos Osuna, Oliver Fuhrer, Mauro Bianco, and Thomas C. Schulthess. “STELLA: A Domain-specific Tool for Structured Grid Methods in Weather and Climate Models.” SC 2015.[5]. • Oliver Fuhrer, Carlos Osuna, Xavier Lapillonne, Tobias Gysi, Ben Cum- ming, Mauro Bianco, Andrea Arteaga, and Thomas C. Schulthess. “Towards a Performance Portable, Architecture Agnostic Implemen- tation Strategy for Weather and Climate Models.” Supercomputing Frontiers and Innovations. April 2014.[6]. ix CONTENTS 1 introduction 1 1.1 Automatic Data-Locality Optimization 2 1.2 User-Guided Data Movement Optimization 4 1.3 Importance of the Programming Model 7 1.4 Thesis Contributions 7 2 a stencil algebra 9 2.1 Stencil Algebra 10 2.1.1 Definition of a Stencil Program 10 2.1.2 Example 11 2.1.3 Data Locality Transformations 13 2.1.4 Stencil Algebra Definition 15 2.1.5 Performance Modeling 17 2.1.6 Stencil Program Analysis 19 2.2 Case Study 28 2.2.1 STELLA 28 2.2.2 Stencil Program Optimization 30 2.3 Evaluation 31 2.4 Related Work 34 2.5 Summary of the Approach 35 3 a learned performance model 37 3.1 Background 39 3.1.1 Architecture Overview 39 3.1.2 Stencil Sequences 40 3.1.3 Data-Locality Transformations 40 3.2 Modeling 41 3.2.1 Stencil Sequences 43 3.2.2 Data-Locality Transformations 44 3.2.3 Performance Model 45 3.2.4 Learning the Performance Model 49 3.3 Optimization 51 3.3.1 Linearizing Multiplications 51 3.3.2 Modeling Stencil Groups 52 3.4 Evaluation 53 3.4.1 Setup & Methodology 53 3.4.2 Implementation 55 xi xii contents 3.4.3 Learning the Target Systems 55 3.4.4 Tuning the Application Kernels 58 3.4.5 Comparison with Halide and Polymage 61 3.5 Related Work 62 3.6 Summary of the Approach 63 4 an analytical cache model 65 4.1 Background 67 4.1.1 Hardware Model 67 4.1.2 Cache Misses 67 4.1.3 Integer Sets and Maps 68 4.1.4 Static Control Programs 69 4.2 Cache Model 70 4.2.1 Computing the Stack Distance 71 4.2.2 Counting

Load more