Autotuning for Automatic Parallelization on Heterogeneous Systems
Total Page:16
File Type:pdf, Size:1020Kb
Autotuning for Automatic Parallelization on Heterogeneous Systems Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Philip Pfaffe ___________________________________________________________________ ___________________________________________________________________ Tag der mündlichen Prüfung: 24.07.2019 1. Referent: Prof. Dr. Walter F. Tichy 2. Referent: Prof. Dr. Michael Philippsen Abstract To meet the surging demand for high-speed computation in an era of stagnat- ing increase in performance per processor, systems designers resort to aggregating many and even heterogeneous processors into single systems. Automatic paral- lelization tools relieve application developers of the tedious and error prone task of programming these heterogeneous systems. For these tools, there are two aspects to maximizing performance: Optimizing the execution on each parallel platform individually, and executing work on the available platforms cooperatively. To date, various approaches exist targeting either aspect. Automatic parallelization for simultaneous cooperative computation with optimized per-platform execution however remains an unsolved problem. This thesis presents the APHES framework to close that gap. The framework com- bines automatic parallelization with a novel technique for input-sensitive online autotuning. Its first component, a parallelizing polyhedral compiler, transforms implicitly data-parallel program parts for multiple platforms. Targeted platforms then automatically cooperate to process the work. During compilation, the code is instrumented to interact with libtuning, our new autotuner and second com- ponent of the framework. Tuning the work distribution and per-platform execu- tion maximizes overall performance. The autotuner enables always-on autotuning through a novel hybrid tuning method, combining a new efficient search technique and model-based prediction. Experiments show that the APHES framework can solve the cooperative heteroge- neous parallelization problem and that cooperative execution outperforms versions parallelized for a single platform. On benchmarks from the PolyBench benchmark suite, the APHES-transformed programs achieve a speedup of up to 6× compared to program versions generated by state-of-the-art single-platform parallelizers. The libtuning autotuner reduces the search time by up to 30% compared to state- of-the-art autotuning while still finding competitive configurations. Additionally, model-based prediction is is able to reduce 99% of the search overhead. I Ich versichere wahrheitsgemäß, dass ich die Dissertationsschrift mit dem Titel “Autotuning for Automatic Parallelization on Heterogeneous Systems” selbständig angefertigt und keine anderen als die angegebenen Quellen und Hilfsmittel be- nutzt und die wörtlich oder inhaltlich übernommenen Stellen als solche kenntlich gemacht, sowie die Regeln zur Sicherung guter wissenschaftlicher Praxis am Karl- sruher Institut für Technologie (KIT) beachtet habe. Philip Pfaffe Unterschleißheim, 22. Mai, 2020 III Contents 1 Introduction1 1.1 Problem Statement...........................3 1.2 Thesis Objectives............................5 1.3 Scope..................................6 1.4 Structure of this Dissertation.....................7 2 Fundamental Concepts9 2.1 Online and Offline Autotuning.....................9 2.1.1 The Tuning Problem......................9 2.1.2 Search Algorithms....................... 14 2.1.3 Reinforcement Learning.................... 17 2.2 Program Analysis and Transformation................ 20 2.2.1 The LLVM Framework..................... 20 2.2.2 Parallelism Detection, Dependence Graphs, and Interproce- dural Analyses......................... 23 2.2.3 The Polyhedral Model..................... 26 3 An Overview of the APHES Framework 33 3.1 The libtuning Autotuner....................... 33 3.2 Autotuning with libtuning ...................... 41 3.3 Autotuning and Automatic Parallelization with APHES ........ 42 3.4 Summary................................ 45 4 Related Work 47 4.1 Autotuning............................... 49 4.1.1 Tuning Algorithms and Autotuners.............. 49 4.1.2 Machine and Performance Models............... 54 4.2 Automatic Parallelization for Accelerators.............. 57 4.2.1 Dependence-based Parallelization............... 58 V Contents 4.2.2 Parallelization of (Mostly) Affine Programs with the Poly- hedral Model.......................... 62 4.2.3 Explicitly Parallel DSLs.................... 65 4.2.4 Pattern-based Detection of Parallelism............ 67 4.3 Autotuning Languages and Compilers................. 68 4.3.1 Modeling and Machine Learning for Autotuning Compilers. 69 4.3.2 Search-based Tuning Compilers................ 71 4.3.3 Languages............................ 75 4.4 Summary................................ 80 5 Hybrid Online Autotuning 81 5.1 Hybrid Tuning: Combining Search and Prediction.......... 82 5.1.1 Context Sensitivity in Online Autotuning........... 84 5.1.2 Approximating and Observing the Context.......... 84 5.1.3 Adapting to Context Changes................. 85 5.2 The libtuning Architecture...................... 86 5.2.1 Tuning Parameters....................... 87 5.2.2 Indicators............................ 89 5.2.3 Fundamental Search Algorithms................ 91 5.3 Hierarchical Search........................... 94 5.4 Model-based Prediction......................... 103 5.5 Summary................................ 104 6 Automatic Heterogeneous Parallelization 107 6.1 Polyhedral Parallelization and Partitioning.............. 107 6.1.1 Mapping and Partitioning the Schedule Tree......... 108 6.1.2 Code Generation........................ 111 6.2 Dependence Testing for Parallelization and Partitioning....... 113 6.2.1 Detecting Data Parallelism................... 114 6.2.2 Target Offloading........................ 116 6.2.3 Runtime System........................ 120 6.3 Offloading Heuristics and Runtime Predictors............ 123 6.4 Summary................................ 125 7 Experimental Evaluation 127 7.1 Benchmarks............................... 128 VI Contents 7.2 Ad-hoc Parallelization......................... 130 7.3 Polyhedral Parallelization....................... 135 7.3.1 Performance Results...................... 135 7.3.2 Detailed Analysis of Hierarchical Search........... 138 7.3.3 Discussion............................ 150 7.4 Hybrid Autotuning........................... 151 7.4.1 Benchmark Selection...................... 151 7.4.2 Experiment Setup........................ 152 7.4.3 Results.............................. 153 7.4.4 Discussion............................ 158 7.5 Summary................................ 159 8 Conclusion and Outlook 161 8.1 Thesis Objectives............................ 163 8.2 Outlook................................. 164 List of Figures 167 List of Tables 171 Bibliography 173 VII Chapter 1 Introduction The era of growing processor speed is over. Moore’s law, which promised a dou- bling in performance every 18 to 24 months, has ended. This is a fact that must be accepted. But it is not an indicator of doom or deterioration of the industry! Instead it turns out to be an immense opportunity for system architects. Being at the end of the line of the development of more and more powerful CPUs has ignited the strive for developing the architectures of the future. Today, there are dozens of emerging parallel architectures, often optimized for specific tasks. A re- cent IEEE Spectrum article1 quotes David Patterson estimating that there are “at least 45 hardware startups tackling the problem”, making our time a “golden time for computer architecture”. And these are not dreams of the future: Today’s sys- tems are highly parallel already, containing multiple general and special purpose parallel processors. This trend is not limited to the performance-hungry field of high-performance computing, but has reached low- to high-end consumer comput- ers, down to the smartphone in everyone’s pocket: Nearly every device is packed with multiple CPUs, GPUs for both graphics and computation acceleration, and specialized parallel compute units for various specific tasks. The hardware landscape available to us within a single system has changed, and in the wake of this change an enormous challenge for the field of software engineering has risen. Effectively programming these highly parallel, heteroge- neous systems is an intricate and time-consuming task even for highly trained and seasoned developers. Fortunately, various tools are available to ease the design, evaluation, and verification of parallel software. The list includes profilers, debug- 1Source: https://spectrum.ieee.org/view-from-the-valley/computing/hardware/ david-patterson-says-its-time-for-new-computer-architectures-and-software-languages, September 2018. Last accessed: December 29th, 2018 1 Chapter 1 Introduction gers and race detectors, recommender systems for parallelizable program parts, and automatic parallelizers. Fully automatic parallelization for heterogeneous sys- tems is naturally the ultimate goal, because it reduces the required developer effort to zero. To date, automatic parallelization approaches are mostly limited to single- platform targets, e.g., GPUs or CPUs. This is likely because targeting multiple platforms simultaneously is a notoriously hard problem. Simultaneously refers here to executing a program on different platforms cooperatively for distinct parts