Cross-System Runtime Prediction of Parallel Applications on Multi-Core Processors

UNIVERSITY OF CALIFORNIA, IRVINE Cross-System Runtime Prediction of Parallel Applications on Multi-Core Processors DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Scott W Godfrey Dissertation Committee: Professor Amelia Regan, Chair Professor Michael Dillencourt Professor Emeritus Dennis Volper 2016 c 2016 Scott W Godfrey DEDICATION \Lead, follow, or get out of the way." -Joe [76] \It's only after we've lost everything that we're free to do anything." -Tyler Durden [20] \The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt." -Bertrand Russell (attributed) \The degree of one's emotions varies inversely with one's knowledge of the facts." -Bertrand Russell (attributed) \If you're going through hell, keep on going." -Unknown To my family, committee, friends, employers, and all those who have supported my endeavors unquestioningly and uncompromisingly, I salute you as I depart from this fantastic technicolor fantasyland called àcademia'. You and all those who have acted, and those who will act, in the name of justice and righteousness are the true heroes of the world. ROSEBUD ii TABLE OF CONTENTS Page LIST OF FIGURES vii LIST OF TABLES ix ACKNOWLEDGMENTS x CURRICULUM VITAE xi ABSTRACT OF THE DISSERTATION xiii 1 Introduction 1 2 Review of Related Literature 6 3 Modern Parallel Hardware Technology 7 3.1 Flynn's Taxonomy . .7 3.2 Development of CMP Multi-Core . .8 3.3 Intel Core2 Architecture . 11 3.4 Intel i7 Architecture . 11 3.5 Hyperthreading/Hardware Threads . 12 4 Parallel Performance Models 16 4.1 Models of Parallel Computation . 16 4.1.1 Amdahl's Law . 16 4.1.2 Gustafson's Law . 17 4.1.3 Unification of Amdahl's and Gustafson's Laws . 18 4.1.4 Parallel Speedup . 18 4.1.5 Roofline . 19 4.2 Algorithmic Computational Models . 20 4.3 Drawbacks in the Modern Era . 22 4.3.1 Runtime Variability, Performance Uncertainty, and Noise . 25 4.3.2 System Symmetry . 26 4.3.3 Lack of Hierarchy . 27 4.3.4 Continuous Functions . 27 iii 5 Operating System Process/Thread Scheduler Effects 29 5.1 Process Affinity . 32 5.2 Thread Affinity . 32 5.3 Thread Placement . 32 5.4 Affinity in Practice . 33 5.5 Variation in Performance . 33 5.6 Experimental Thread Affinity Effects . 34 6 Structure of a Parallel Application 40 6.1 Composition . 40 6.2 Decomposition . 46 6.3 Parallelization (from the literature) . 47 7 Parallel Benchmarking 49 7.1 Relationship between HYDRA benchmark types . 52 7.1.1 Concurrent Processes (independent memory address spaces) . 52 7.1.2 Concurrent Threads (common memory address space) . 53 7.1.3 Individual application task parallel computation . 53 7.2 Benchmarking Protocols . 55 8 Modular Performance Model 60 8.1 Hardware Parameters . 61 8.2 Algorithm Bandwidth . 62 8.3 Software Parts . 63 8.3.1 Amdahl's Law . 63 8.3.2 Modularity . 63 8.3.3 Task Parallelism . 64 8.4 Hardware Parts . 66 8.4.1 Main Memory Bandwidth . 66 8.4.2 Sequential Boost . 67 8.4.3 \Virtual" Core Efficiency . 67 8.4.4 Lx Space Contention . 68 8.4.5 Lx Space Sharing . 70 8.5 Contentious Parts . 71 8.5.1 H3 Parallel Mutex, Simple . 73 8.5.2 H3 Parallel Mutex, Parameterized . 73 8.5.3 H2 Sequential Mutex, Parameterized . 74 8.5.4 H2 Thread Mutex, Parameterized . 74 8.5.5 H1,H2 Model Extension . 74 8.6 Operating System Parts . 76 8.6.1 Thread Placements . 76 8.6.2 Probabilities and Structure of Migrations . 79 8.6.3 The Cost of Migrations . 83 8.7 Performance Model Implementation . 87 iv 9 Experimental Applications 88 9.1 3D Finite-Difference Numerical Integration (FDI) . 90 9.1.1 Application Characteristics . 90 9.2 3D Surface Reconstruction (SRA) . 91 9.2.1 Application Characteristics . 92 10 Experimental Toolset 93 10.1 Development Tools . 93 10.1.1 Prometheus: Combinatoric Build . 93 10.1.2 Ilithyia: Code Generation . 94 10.2 Logistics Tools . 96 10.2.1 Iris: Distribution and Collection . 96 10.2.2 Ponos: Automated Benchmarking . 96 10.3 Analysis Tools . 97 10.3.1 Pandora: Model Fitting and Cross-Prediction . 97 11 Error Analysis 99 11.1 Relevance . 99 11.2 Outlier Rejection . 100 11.3 Error Metrics and Characterization . 100 11.3.1 Total Squared Error, Mean Squared Error . 101 11.3.2 Total Absolute Error, Mean Absolute Error . 101 11.3.3 Mean Absolute Relative Error . 102 11.3.4 Mean Weighted Absolute Relative Error . 102 11.3.5 Prediction Methodology . 105 12 Optimization 106 12.1 Types of Optimization . 106 12.2 Optimization Strategy . 107 12.3 Solution Methodology . 108 13 Cross-Prediction 112 13.1 Methods and Error Measures . 112 13.2 Complications, Caveats, and Limitations . 115 14 Predictive Outcomes 117 14.1 Architecture Representation . 117 14.2 Model Decomposition . 118 14.3 Curve-Fitting Experimental Data . 118 14.3.1 Best Fit on Model Parts . 118 14.3.2 Best Fit on Model Properties . 119 14.3.3 Best Fit on Model . 120 14.4 Cross-Prediction . 125 14.4.1 Cross-Prediction on Model Parts . 125 14.4.2 Cross-Prediction on Model . 125 v 15 Conclusions 138 16 Opportunities for Future Work 142 Bibliography 144 A Data Fitting Results 151 A.1 Fitting Errors Per Model . 151 A.1.1 Fitting Errors Per Model, All Data . 151 A.1.2 Fitting Errors Per Model, FDI . 154 A.1.3 Fitting Errors Per Model, SRA . 157 A.2 Fitting Errors, Per Part . 160 A.2.1 Fitting Errors, Per Part, Aggregate, Per Architecture . 160 A.2.2 Fitting Errors, Per Part, FDI, Per Architecture . 164 A.2.3 Fitting Errors, Per Part, SRA, Per Architecture . 168 A.3 Fitting Errors, Per Property . 172 A.3.1 Fitting Errors, Per Property, Aggregate, Per Architecture . 172 A.3.2 Fitting Errors, Per Property, FDI, Per Architecture . 191 A.3.3 Fitting Errors, Per Property, SRA, Per Architecture . 210 B Cross-Prediction Results 229 B.1 Cross Prediction Relative Errors, Per Part . 229 B.1.1 Cross Prediction Relative Errors, All Data, Per Part . 230 B.1.2 Cross Prediction Relative Errors, Per Part, FDI . 231 B.1.3 Cross Prediction Relative Errors, Per Part, SRA . 232 B.2 Cross Prediction Relative Errors, Per Model . 233 B.2.1 Cross Prediction Relative Errors, All Data . 233 B.2.2 Cross Prediction Relative Errors, FDI, per Architecture . 244 B.2.3 Cross Prediction Relative Errors, SRA, per Architecture . 253 vi LIST OF FIGURES Page 1.1 Computational Scheme . .3 1.2 Predictive System Architecture . .4 3.1 Intel i7 cache structure . 14 3.2 Intel Core 2 cache structure . 14 3.3 AMD FX cache.

Cross-System Runtime Prediction of Parallel Applications on Multi-Core Processors

High Performance Computing Through Parallel and Distributed Processing

Parallel System Performance: Evaluation & Scalability

Scalable Task Parallel Programming in the Partitioned Global Address Space

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

Compiling for a Multithreaded Dataflow Architecture : Algorithms, Tools, and Experience Feng Li

Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

Scheduling on Asymmetric Parallel Architectures

CUDA C++ Programming Guide

14. Parallel Computing 14.1 Introduction 14.2 Independent

CS 211: Computer Architecture ¾ Starting with Simple ILP Using Pipelining ¾ Explicit ILP - EPIC ¾ Key Concept: Issue Multiple Instructions/Cycle Instructor: Prof

Parallel Programming in Openmp About the Authors

CUDA Dynamic Parallelism