Autotuning for Automatic Parallelization on Heterogeneous Systems

Autotuning for Automatic Parallelization on Heterogeneous Systems Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Philip Pfaffe ___________________________________________________________________ ___________________________________________________________________ Tag der mündlichen Prüfung: 24.07.2019 1. Referent: Prof. Dr. Walter F. Tichy 2. Referent: Prof. Dr. Michael Philippsen Abstract To meet the surging demand for high-speed computation in an era of stagnat- ing increase in performance per processor, systems designers resort to aggregating many and even heterogeneous processors into single systems. Automatic parallelization tools relieve application developers of the tedious and error prone task of programming these heterogeneous systems. For these tools, there are two aspects to maximizing performance: Optimizing the execution on each parallel platform individually, and executing work on the available platforms cooperatively. To date, various approaches exist targeting either aspect. Automatic parallelization for simultaneous cooperative computation with optimized per-platform execution however remains an unsolved problem. This thesis presents the APHES framework to close that gap. The framework com- bines automatic parallelization with a novel technique for input-sensitive online autotuning. Its first component, a parallelizing polyhedral compiler, transforms implicitly data-parallel program parts for multiple platforms. Targeted platforms then automatically cooperate to process the work. During compilation, the code is instrumented to interact with libtuning, our new autotuner and second component of the framework. Tuning the work distribution and per-platform execution maximizes overall performance. The autotuner enables always-on autotuning through a novel hybrid tuning method, combining a new efficient search technique and model-based prediction. Experiments show that the APHES framework can solve the cooperative heterogeneous parallelization problem and that cooperative execution outperforms versions parallelized for a single platform. On benchmarks from the PolyBench benchmark suite, the APHES-transformed programs achieve a speedup of up to 6× compared to program versions generated by state-of-the-art single-platform parallelizers. The libtuning autotuner reduces the search time by up to 30% compared to state- of-the-art autotuning while still finding competitive configurations. Additionally, model-based prediction is is able to reduce 99% of the search overhead. I Ich versichere wahrheitsgemäß, dass ich die Dissertationsschrift mit dem Titel “Autotuning for Automatic Parallelization on Heterogeneous Systems” selbständig angefertigt und keine anderen als die angegebenen Quellen und Hilfsmittel be- nutzt und die wörtlich oder inhaltlich übernommenen Stellen als solche kenntlich gemacht, sowie die Regeln zur Sicherung guter wissenschaftlicher Praxis am Karl- sruher Institut für Technologie (KIT) beachtet habe. Philip Pfaffe Unterschleißheim, 22. Mai, 2020 III Contents 1 Introduction1 1.1 Problem Statement...........................3 1.2 Thesis Objectives............................5 1.3 Scope..................................6 1.4 Structure of this Dissertation.....................7 2 Fundamental Concepts9 2.1 Online and Offline Autotuning.....................9 2.1.1 The Tuning Problem......................9 2.1.2 Search Algorithms....................... 14 2.1.3 Reinforcement Learning.................... 17 2.2 Program Analysis and Transformation................ 20 2.2.1 The LLVM Framework..................... 20 2.2.2 Parallelism Detection, Dependence Graphs, and Interproce- dural Analyses......................... 23 2.2.3 The Polyhedral Model..................... 26 3 An Overview of the APHES Framework 33 3.1 The libtuning Autotuner....................... 33 3.2 Autotuning with libtuning ...................... 41 3.3 Autotuning and Automatic Parallelization with APHES ........ 42 3.4 Summary................................ 45 4 Related Work 47 4.1 Autotuning............................... 49 4.1.1 Tuning Algorithms and Autotuners.............. 49 4.1.2 Machine and Performance Models............... 54 4.2 Automatic Parallelization for Accelerators.............. 57 4.2.1 Dependence-based Parallelization............... 58 V Contents 4.2.2 Parallelization of (Mostly) Affine Programs with the Poly- hedral Model.......................... 62 4.2.3 Explicitly Parallel DSLs.................... 65 4.2.4 Pattern-based Detection of Parallelism............ 67 4.3 Autotuning Languages and Compilers................. 68 4.3.1 Modeling and Machine Learning for Autotuning Compilers. 69 4.3.2 Search-based Tuning Compilers................ 71 4.3.3 Languages............................ 75 4.4 Summary................................ 80 5 Hybrid Online Autotuning 81 5.1 Hybrid Tuning: Combining Search and Prediction.......... 82 5.1.1 Context Sensitivity in Online Autotuning........... 84 5.1.2 Approximating and Observing the Context.......... 84 5.1.3 Adapting to Context Changes................. 85 5.2 The libtuning Architecture...................... 86 5.2.1 Tuning Parameters....................... 87 5.2.2 Indicators............................ 89 5.2.3 Fundamental Search Algorithms................ 91 5.3 Hierarchical Search........................... 94 5.4 Model-based Prediction......................... 103 5.5 Summary................................ 104 6 Automatic Heterogeneous Parallelization 107 6.1 Polyhedral Parallelization and Partitioning.............. 107 6.1.1 Mapping and Partitioning the Schedule Tree......... 108 6.1.2 Code Generation........................ 111 6.2 Dependence Testing for Parallelization and Partitioning....... 113 6.2.1 Detecting Data Parallelism................... 114 6.2.2 Target Offloading........................ 116 6.2.3 Runtime System........................ 120 6.3 Offloading Heuristics and Runtime Predictors............ 123 6.4 Summary................................ 125 7 Experimental Evaluation 127 7.1 Benchmarks............................... 128 VI Contents 7.2 Ad-hoc Parallelization......................... 130 7.3 Polyhedral Parallelization....................... 135 7.3.1 Performance Results...................... 135 7.3.2 Detailed Analysis of Hierarchical Search........... 138 7.3.3 Discussion............................ 150 7.4 Hybrid Autotuning........................... 151 7.4.1 Benchmark Selection...................... 151 7.4.2 Experiment Setup........................ 152 7.4.3 Results.............................. 153 7.4.4 Discussion............................ 158 7.5 Summary................................ 159 8 Conclusion and Outlook 161 8.1 Thesis Objectives............................ 163 8.2 Outlook................................. 164 List of Figures 167 List of Tables 171 Bibliography 173 VII Chapter 1 Introduction The era of growing processor speed is over. Moore’s law, which promised a dou- bling in performance every 18 to 24 months, has ended. This is a fact that must be accepted. But it is not an indicator of doom or deterioration of the industry! Instead it turns out to be an immense opportunity for system architects. Being at the end of the line of the development of more and more powerful CPUs has ignited the strive for developing the architectures of the future. Today, there are dozens of emerging parallel architectures, often optimized for specific tasks. A re- cent IEEE Spectrum article1 quotes David Patterson estimating that there are “at least 45 hardware startups tackling the problem”, making our time a “golden time for computer architecture”. And these are not dreams of the future: Today’s systems are highly parallel already, containing multiple general and special purpose parallel processors. This trend is not limited to the performance-hungry field of high-performance computing, but has reached low- to high-end consumer comput- ers, down to the smartphone in everyone’s pocket: Nearly every device is packed with multiple CPUs, GPUs for both graphics and computation acceleration, and specialized parallel compute units for various specific tasks. The hardware landscape available to us within a single system has changed, and in the wake of this change an enormous challenge for the field of software engineering has risen. Effectively programming these highly parallel, heterogeneous systems is an intricate and time-consuming task even for highly trained and seasoned developers. Fortunately, various tools are available to ease the design, evaluation, and verification of parallel software. The list includes profilers, debug- 1Source: https://spectrum.ieee.org/view-from-the-valley/computing/hardware/ david-patterson-says-its-time-for-new-computer-architectures-and-software-languages, September 2018. Last accessed: December 29th, 2018 1 Chapter 1 Introduction gers and race detectors, recommender systems for parallelizable program parts, and automatic parallelizers. Fully automatic parallelization for heterogeneous systems is naturally the ultimate goal, because it reduces the required developer effort to zero. To date, automatic parallelization approaches are mostly limited to single- platform targets, e.g., GPUs or CPUs. This is likely because targeting multiple platforms simultaneously is a notoriously hard problem. Simultaneously refers here to executing a program on different platforms cooperatively for distinct parts

Autotuning for Automatic Parallelization on Heterogeneous Systems

Using the GNU Compiler Collection (GCC)

Memory Tagging and How It Improves C/C++ Memory Safety Kostya Serebryany, Evgenii Stepanov, Aleksey Shlyapnikov, Vlad Tsyrklevich, Dmitry Vyukov Google February 2018

Statically Detecting Likely Buffer Overflow Vulnerabilities

Precise Null Pointer Analysis Through Global Value Numbering

A Formally-Verified Alias Analysis

Aliasing Restrictions of C11 Formalized in Coq

Aliases, Intro. to Optimization

Teach Yourself Perl 5 in 21 Days

User-Directed Loop-Transformations in Clang

Generalizing Loop-Invariant Code Motion in a Real-World Compiler

A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel

Efficient Symbolic Analysis for Optimizing Compilers*