Efficient Code Generation for Hardware Accelerators by Refining Partially Specified Implementation

Efficient code generation for hardware accelerators by refining partially specified implementation Ulysse Beaugnon To cite this version: Ulysse Beaugnon. Efficient code generation for hardware accelerators by refining partially specified implementation. Programming Languages [cs.PL]. Université Paris sciences et lettres, 2019. English. NNT : 2019PSLEE050. tel-02385303v2 HAL Id: tel-02385303 https://tel.archives-ouvertes.fr/tel-02385303v2 Submitted on 26 Nov 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Préparée à l’École Normale Supérieure Efficient Code Generation for Hardware Accelerators by Refining Partially Specified Implementation Soutenue par Composition du jury : Ulysse BEAUGNON Francesco, ZAPPA NARDELLI Le 10 Juin 2019 Directeur de Recherche, INRIA Président Rastislav, BODIK Professeur, University of Washington Rapporteur Ecole doctorale n° 386 Christophe, DUBACH Sciences Mathématiques de Professeur, University of Edinburgh Rapporteur Paris Centre Anton, LOKHMOTOV CEO, Dividiti Examinateur Jacques, PIENAAR Ingenieur, Google Examinateur Spécialité Albert, COHEN Informatique Chercheur, Google & ENS Directeur Marc, POUZET Professeur, École Normale Supérieure Directeur 2 Abstract Software programmable hardware accelerators, such as Graphical Processing Units (GPUs), are specialized processors designed to perform specific tasks more efficiently than general purpose processors. They trade off generality against specialized data paths and massive parallelism, providing a raw processing power that is orders of magnitude higher than for contemporary multicore CPUs. Unfortunately, finding an efficient implementation for a function on an hardware accelerator is a complex problem. It requires making careful decisions for mapping computations to the appropriate levels of parallelism and for expliciting data movements across the different memory spaces, in addition to choosing amongst to the many possible thread-local optimizations. While the set of possible optimizations is usually well known, complex interactions between them make it hard to find a global optimum. Indeed, anticipating downstream transformations and deriving profitability information from inter- mediate compilation steps is a challenge. Transformations may not commute and some optimization opportunities may only become available after applying so-called “enabling” transformations. Con- versely, some transformations may hinder further optimizations. As a result, the production of highly tuned implementations remains a critical challenge to achieve competitive performance. This dissertation introduces the concept of candidate to formally define, represent and explore spaces of possible implementations. A candidate is a partially specified implementation with some decisions fixed while others are left open. It represents a whole set of possible implementations of the same function. Candidates expose all potential decisions upfront and ensure they are commutative. Taking a decision always restricts the set of possible implementations. This defines a well-behaved optimization space; in particular, it allows search algorithms looking for the best implementations to make the most performance-impacting decisions first and to have a global knowledge of wich optimizations may fea- ture in implementations. We provide a framework that automatically generate code to represents and manipulates candidates, from a declarative description of available choices and their interaction. This description is independent of the function to implement. We instantiate our concept of candidate to generate efficient code for linear algebra functions on GPUs. This shows our approach is expressive enough to model interacting decisions with a fundamental impact on the structure of the generated code, including compositions of strip-mining, loop fusion, loop interchange, unrolling, vectorization, parallelization, and orchestration of data movements across the memory hierarchy. We develop a model capable of computing a lower bound for the execution time of any implementation that derives from a candidate. We show that this model provides actionable information, even after taking only a few decisions, and that it enables pruning the implementation space, reducing its size by several orders of magnitude. We propose a simple search algorithm to illustrate our approach. It combines the lower bound performance model and actual evaluation on the hardware with statistical exploration to drive the search towards the most efficient implementations. Our experiments show that it generates code that is competitive with hand-tuned libraries for linear algebra functions on GPUs. They also demonstrate that taking the most important decisions first helps finding better implementations faster, thus showing how the concept of candidate empowers search algorithms. 3 4 Acknowledgements Many people helped me along the path leading to this dissertation. I am dedicating the next few paragraphs to thank them. First and foremost, I am grateful for my PhD advisors, Albert and Marc, who put their expertise to the service of my scientific and human development. In particular, I would like to honor the trust Albert put in me and the energy he gave for my ideas to succeed. And to honor Marc’s thoroughness and the life he injects into the Parkas team, be it with the ski, the brioches or the rest. This dissertation would not have been possible without Jacques, who helped me develop the initial idea and create the first prototype. Thank you for your guidance during that time. I look forward to working with you once again. The role played by Basile, Nicolas and Andi was equally crucial. They poured their ideas and their technical knowledge into my prototype to turn it into something that actually works. The careful proofreading of this dissertation by Basile and Andy was also essential. Thank you for your patience and your dedication. I would also like to thank the members of my PhD jury, especially the reviewers for reading my dissertation and providing constructive feedback. The years I spent working in the Parkas team are full of happy memories. Thank you Tim, Francesco, Paul, Adrien, Guillaume, Nath, Guillaume, Lélio, Chandan, Adila and all the others. A special mention for Tim and his legendary helpfulness. I made my first steps into the world of compilers during an internship with Anton. Obviously, I was convinced. Thank you for your kind welcome and your guidance. Finally, thank you Florian, Yoann, Éric, Particia for your support along these years and Anaël for all the courage you gave me. 5 6 Contents 1 Introduction 9 1.1 Hardware Accelerators . .9 1.2 Challenges of Code Generation for Hardware Accelerators . 10 1.3 Our Solution: Partially Specified Implementations . 10 1.4 Key Contributions . 11 1.5 Organization . 12 2 Motivation: Partially Specified Schedules 15 2.1 Problem Statement: Scheduling a Basic Block . 16 2.2 Background: Constraint Satisfaction Problems . 17 2.3 Partially Specified Schedule . 19 2.4 Optimistic Model of Partially Specified Schedules . 20 2.5 Summary . 21 3 Representing Candidate Implementations 23 3.1 Candidate Implementations . 24 3.2 Decision Space Definition . 25 3.3 Candidate Representation Description Language . 27 3.4 Example: Instructions Scheduling Within a Basic Block . 34 3.5 Constraint Propagation Code Generation . 36 3.6 Discussion . 39 3.7 Summary . 41 4 Application: Linear Algebra on GPUs 43 4.1 Background: GPUs Architecture . 44 4.2 Kernel Representation . 46 4.3 Decisions Encoding . 49 4.4 Code Generation . 56 4.5 Discussion . 60 4.6 Summary . 63 5 An Optimistic Performance Model of Candidates 65 5.1 Model Overview . 66 5.2 Single Thread Model . 72 5.3 Instantiation of the Model for Kepler GPUs . 80 5.4 Empirical Evaluation . 82 5.5 Discussion . 85 5.6 Summary . 87 6 Scaling the Search with the Properties of Candidates 89 6.1 Search Algorithm . 90 6.2 Sample Kernels . 91 6.3 Empirical Evaluation . 93 6.4 Discussion . 95 6.5 Summary . 96 7 7 Conclusion 97 7.1 Principal Contributions . 98 7.2 Long Term Perspectives . 100 A Proofs of the Performance Model 109 A.1 Notations and Basic Properties . 109 A.2 Program Order . 110 A.3 Execution Graph Properties . 114 A.4 Virtual and Concrete Paths . 116 A.5 Abstraction of the Execution Graph . 123 8 Chapter 1 Introduction Hardware accelerators are specialized processors that are orders of magnitude more efficient than tradi- tional processors for some classes of computation. This dissertation addreses the generation of optimized code for such specialized hardware. However, starting from a high level implementation of a computa- tionally intensive algorithm, finding the right combination of code transformations to fully exploit this computing power is a complex problem. The main limitation of existing approaches is that they make it hard to anticipate downstream transformations and to extract profitability information from early compilation steps. The question this dissertation addresses is: Which combination of implementation decisions does lead to the fastest code, given

Load more