Autotuning Programs with Algorithmic Choice by Jason Ansel
Total Page:16
File Type:pdf, Size:1020Kb
Autotuning Programs with Algorithmic Choice by Jason Ansel Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February, 2014 c Massachusetts Institute of Technology 2014. All rights reserved. Autotuning Programs with Algorithmic Choice by Jason Ansel Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February, 2014 c Massachusetts Institute of Technology 2014. All rights reserved. Author................................................................................................................ Department of Electrical Engineering and Computer Science January 6, 2014 Certified by.......................................................................................................... Saman Amarasinghe Professor Thesis Supervisor Accepted by......................................................................................................... Professor Leslie A. Kolodziejski Chair, Department Committee on Graduate Students Autotuning Programs with Algorithmic Choice by Jason Ansel Submitted to the Department of Electrical Engineering and Computer Science on January 6, 2014, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering Abstract The process of optimizing programs and libraries, both for performance and quality of service, can be viewed as a search problem over the space of implementation choices. This search is traditionally manually conducted by the programmer and often must be repeated when systems, tools, or requirements change. The overriding goal of this work is to automate this search so that programs can change themselves and adapt to achieve performance portability across different environments and requirements. To achieve this, first, this work presents the PetaBricks programming language which focuses on ways for expressing program implementation search spaces at the language level. Second, this work presents OpenTuner which provides sophisticated techniques for searching these search spaces in a way that can easily be adopted by other projects. PetaBricks is a implicitly parallel language and compiler where having multiple implementations of multiple algorithms to solve a problem is the natural way of programming. Choices are provided in a way that also allows our compiler to tune at a finer granularity. The PetaBricks compiler autotunes programs by making both fine-grained as well as algorithmic choices. Choices also include different automatic parallelization techniques, data distributions, algorithmic parameters, transformations, and blocking. PetaBricks also introduces novel techniques to autotune algorithms for different convergence criteria or quality of service requirements. We show that the PetaBricks autotuner is often able to find non-intuitive poly-algorithms that outperform more traditional hand written solutions. OpenTuner is a open source framework for building domain-specific multi-objective program autotuners. OpenTuner supports fully-customizable configuration representations, an extensible technique representation to allow for domain-specific techniques, and an easy to use interface for communicating with the program to be autotuned. A key capability inside OpenTuner is the use of ensembles of disparate search techniques simultaneously; techniques that perform well will dynamically be allocated a larger proportion of tests. OpenTuner has been shown to perform well on complex search spaces up to 103000 possible configurations in size. Thesis Supervisor: Saman Amarasinghe Title: Professor 7 8 Contents Abstract 7 Acknowledgments 13 1 Introduction 15 1.1 Contributions . 20 1.1.1 Language . 21 1.1.2 Process and Compilation . 21 1.1.3 Autotuning Techniques . 22 2 The PetaBricks Language 25 2.1 Sorting as an Example of Algorithmic Choice . 25 2.2 Iteration Order Choices . 28 2.3 Variable Accuracy . 31 2.3.1 K-Means Example . 32 2.3.2 Language Support for Variable Accuracy . 34 2.3.3 Variable Accuracy Language Features . 36 2.3.4 Accuracy Guarantees . 38 2.4 Input Features . 38 2.5 A More Complex Example . 40 2.5.1 The Choice Space for SeparableConvolution . 43 2.6 Language Specification . 45 2.6.1 Transform Header Flags . 45 2.6.2 Rule Header Flags . 50 2.6.3 Matrix Definitions . 51 2.6.4 Matrix Regions . 52 3 The PetaBricks Compiler 53 3.1 PetaBricks Compiler . 54 3.2 Parallelism in Output Code . 59 3.3 Autotuning System and Choice Framework . 60 3.4 Runtime Library . 62 3.5 Code Generation for Heterogeneous Architectures . 62 3.5.1 OpenCL Kernel Generation . 63 3.5.2 Data Movement Analysis . 64 3.5.3 Runtime System . 65 9 3.5.4 Memory Management . 69 3.5.5 GPU Choice Representation to the Autotuner . 70 3.6 Choice Space Representation . 72 3.6.1 Choice Configuration Files . 72 3.7 Deadlocks and Race Conditions . 73 3.8 Automated Consistency Checking . 74 4 Benchmarks and Experimental Analysis 75 4.1 Fixed Accuracy Benchmarks . 76 4.1.1 Symmetric Eigenproblem . 76 4.1.2 Sort . 78 4.1.3 Matrix Multiply . 80 4.2 Autotuning Parallel Performance . 80 4.3 Effect of Architecture on Autotuning . 81 4.4 Variable Accuracy Benchmarks . 82 4.4.1 Bin Packing . 83 4.4.2 Clustering . 84 4.4.3 Image Compression . 85 4.4.4 Preconditioned Iterative Solvers . 86 4.5 Experimental Results . 88 4.5.1 Analysis . 88 4.5.2 Programmability . 91 4.6 Heterogeneous Architectures Experimental Results . 92 4.6.1 Methodology . 92 4.6.2 Benchmark Results and Analysis . 95 4.6.3 Heterogeneous Results Summary . 100 4.7 Summary . 102 5 Multigrid Benchmarks 103 5.1 Autotuning Multigrid . 104 5.1.1 Algorithmic choice in multigrid . 104 5.1.2 Full dynamic programming solution . 106 5.1.3 Discrete dynamic programming solution . 108 5.1.4 Extension to Autotuning Full Multigrid . 109 5.1.5 Limitations . 111 5.2 Results . 112 5.2.1 Autotuned multigrid cycle shapes . 112 5.2.2 Performance . 116 5.2.3 Effect of Architecture on Autotuning . 124 6 The PetaBricks Autotuner 127 6.1 The Autotuning Problem . 128 6.1.1 Properties of the Autotuning Problem . 129 6.2 A Bottom Up EA for Autotuning . 130 6.3 Experimental Evaluation . 136 6.3.1 GPEA . 136 10 6.3.2 Experimental Setup . 136 6.3.3 INCREA vs GPEA . 137 6.3.4 Representative runs . 139 7 Input Sensitivity 145 7.1 Usage . 148 7.2 Input Aware Learning . 148 7.2.1 A Simple Design and Its Issues . 148 7.2.2 Design of the Two Level Learning . 149 7.2.3 Level 1 . 150 7.2.4 Level 2 . 152 7.2.5 Discussion of the Two Level Learning . 156 7.3 Evaluation . 157 7.3.1 Input Features and Inputs . 158 7.3.2 Experimental Results . 159 7.3.3 Input Generation . 162 7.3.4 Model of Diminishing Returns with More Landmark Configurations . 164 8 Online Autotuning 167 8.1 Competition Execution Model . 169 8.1.1 Other Splitting Strategies . 169 8.1.2 Time Multiplexing Races . 170 8.2 SiblingRivalry Online Learner . ..