Scalable Compiler Optimizations for Improving the Memory System Performance in Multi- and Many-Core Processors

Scalable Compiler Optimizations for Improving the Memory System Performance in Multi- and Many-core Processors A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Sanyam Mehta IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Pen-Chung Yew September, 2014 c Sanyam Mehta 2014 ALL RIGHTS RESERVED Acknowledgements I would first of all acknowledge my Teachers for molding the direction of my life’s path, especially when I had entered my Bachelor’s and needed such clear guidance the most. Not only was I introduced to research by them, but was also, more importantly, taught the utility of the education that I was getting. I have ever since been clear about things to do next that will benefit me and others, and thus contribute positively in a little way to the society. As a PhD student, I feel that I was blessed to have Professor Yew as my advisor. Right from the beginning, he encouraged and helped me to choose the right problems to work on, and was always there to provide me with the right guidance at the right moments based on his vast experience of our research area. In addition, I would also like to extend my heartfelt thanks to Professor Zhai, for helping me refine my work and make it appealing to the target audience. She has especially been very helpful in teaching me to make presentations for conferences. Also, I would thank, Professor Kumar for always being very warm and accomodating throughout my stay at the University of Minnesota and while serving as a member on my committee, and Professor David Lilja, Professor Stephen McCamant, Professor Daniel Boley and Dr. Aamer Jaleel for honoring me with their presence during my thesis defense and asking me insightful questions. I would also thank my very dear friends, Ankit, Ashish, Ayush, Rajat, Saket, and Shashank for helping me in various ways to stay focused towards my goal in life. It is because of being happy in their presence that I could work effectively at college. Finally, I would thank my parents, brother, and also other family members. They first of all cared for me and taught me all that I knew until I grew up and could think about things, and then allowing me to be in a different country for my PhD. They have been very understanding and accomodating to allow me to do something that I thought would be helpful for all, and have always encouraged me and blessed me to be successful. i Dedication To my Teacher and Parents ii Abstract The last decade has seen the transition from unicore processors to their multi-core (and now many-core) counterparts. Today, multi-cores are ubiquitous - they form the core fabric of our laptop and desktop PCs, supercomputers, datacenters, and also mobile devices. This transition has brought about renewed focus on compiler developers to extract performance from these parallel processors. In addition to extracting parallelism, another important responsibility of a parallelizing (or optimizing) compiler is to improve the memory system performance of the source program. This is particularly important because all cores on the chip simultaneously demand for data from the slow memory, and thus computation ends up waiting for the arriving data. In other words, the multi-cores have accentuated the memory-wall. These simultaneous requests for data from off-chip memory also leads to contention for bandwidth in off-chip net- work (called bandwidth wall), leading to further increase in the effective memory latency as seen by the executing program. While the above responsibilities of a parallelizing compiler are better understood, we identify three key challenges facing the compiler developers on current processors. These include, (1) the diverse set of microarchitectures existent at any time, and more importantly, the changes in micrarchitecture between generations. We identify that the existing compilers have particularly been unable to adapt to some of the important changes to microarchitecture in the last decade, resulting in suboptimal optimizations. (2) Poor show of compilers in real applications that contain large scope of statements amenable for optimization. This weakness stems from the lack of a good cost model for deciding statements to fuse, and sheer inability to fuse due to ineffective dependence analysis. (3) Unscalability of compilers - this is a traditional limitation of compilers where the compilers choose to optimize small scopes to contain the compile time and memory requirement, and thus loose optimization opportunities. In this thesis, we make the following contributions to address the above challenges. 1. We revisit three compiler optimizations (loop tiling and loop fusion for enhancing tem- poral locality and data prefetching for hiding memory latency) for improving memory (and parallel) performance in light of the various recent advances in microarchitecture, including deeper memory hierarchy, the multithreading technology, the (short-vector) iii SIMDization technology, and hardware prefetching, and propose generic algorithms im- plementable in production compilers for a range of processors. 2. We propose wise heuristics in a cost model to choose good statements to fuse, and also improve dependence analysis to not loose critical fusion opportunity in application programs when it exists. 3. The final contribution of this thesis is a solution to the unscalability problem. Based on program semantics, we devise a way to represent the entire program with much fewer rep- resentative statements and dependences, leading to significantly improved compile time and memory requirement for compilation. Thus, real applications can now be optimized not only efficiently, but also at a very low overhead. iv Contents Acknowledgements i Dedication ii Abstract iii List of Tables ix List of Figures x 1 Introduction 1 1.1 Key challenges for compiler developers in the present day . 2 1.1.1 The diverse world of computer architecture that also keeps on evolving 2 1.1.2 Real applications and poor show . 5 1.1.3 Unscalability: The traditional woe of compilers . 6 1.2 Contribution of this thesis . 7 1.3 Organization of the thesis . 8 1.4 Related Publications . 9 2 Background 10 2.1 Loop Tiling and Tile Size Selection . 11 2.2 Data Prefetching . 13 2.3 Loop Fusion and Polyhedral Compiler Framework . 15 2.3.1 Overview of the Polyhedral Framework . 16 v 3 Tile Size Selection, from Then to Now 20 3.1 Introduction . 20 3.2 Motivation . 22 3.2.1 Modeling the effect of set-associativity. 22 3.2.2 Tapping into data reuse at multiple levels of cache. 24 3.3 Our Approach . 26 3.3.1 Data reuse in L1 cache. 26 3.3.2 Data reuse in L2 cache. 28 3.3.3 Data reuse in both L1 and L2 cache . 29 3.3.4 Interaction with vectorization. 31 3.3.5 Interaction with Time Skewing. 32 3.3.6 Other factors that influence tiling. 34 3.4 The Tile Size Selection (TSS) Algorithm . 36 3.4.1 Scope of the algorithm . 38 3.4.2 The Framework . 40 3.5 Experimental Setup . 41 3.6 Experimental Results . 42 3.7 Related Work . 49 3.8 Conclusion . 50 4 Coordinated Multi-Stage Prefetching for Present-day Processors 52 4.1 Introduction . 52 4.2 Motivation . 55 4.3 Coordinated Multi-stage Data Prefetching . 56 4.3.1 What to Prefetch . 56 4.3.2 Where to Insert Prefetch Instructions . 57 4.3.3 When to Prefetch . 58 4.3.4 Multi-stage Prefetching in multi-threaded environment . 62 4.4 The Compiler Framework . 63 4.5 Experimental Results and Discussion . 65 4.5.1 Experimental Environment . 65 4.5.2 Results for Multi-stage Prefetching in single-threaded environment . 66 vi 4.5.3 Results for Multi-stage Prefetching in multithreaded environment . 73 4.6 Related Work . 74 4.7 Conclusion . 76 5 Loop Fusion and Real Applications - Part 1 77 5.1 Introduction . 77 5.2 Motivation . 79 5.2.1 Why not scalar and array expansion (perhaps, followed by contraction)? 81 5.2.2 Why not scalar and array renaming across loop nests? . 82 5.3 Background . 83 5.3.1 Some key concepts and definitions . 84 5.4 Our Approach . 86 5.4.1 Key Insight . 86 5.4.2 Fusion of loop nests with same loop-order in the presence of temporary array variables . 89 5.4.3 Fusion of loop nests with different loop-orders . 89 5.4.4 Cases in which dependence relaxation is not feasible . 90 5.5 Implementation: putting it all together . 91 5.5.1 Validation of relaxation criteria . 93 5.6 Experimental Evaluation . 94 5.6.1 Setup . 94 5.6.2 Benchmarks . 94 5.6.3 Results and Discussion . 95 5.7 Related Work . 100 5.8 Conclusion . 102 6 Loop Fusion and Real Applications - Part 2 103 6.1 Introduction and Motivation . 103 6.2 Loop fusion and the polyhedral framework . 107 6.2.1 Existing fusion models in the polyhedral framework . 108 6.3 Problem Formulation . 109 6.4 Our Fusion Algorithm . 112 6.4.1 Algorithm 3: Finding a good pre-fusion schedule . 112 vii 6.4.2 Algorithm 4: Enabling Outer-level Parallelism . 116 6.5 Experimental Evaluation . 118 6.5.1 Setup . 118 6.5.2 Benchmarks . 119 6.5.3 Results and Discussion . 120 6.6 Related Work . 124 6.7 Conclusion . 126 7 Addressing Scalability: Optimizing Large Programs at an Easy Price 127 7.1 Introduction . 127 7.2 Causes of Unscalability . 131 7.2.1 First Culprit: Phase I - Constructing legality constraints . 131 7.2.2 Second Culprit: Phase II - Using an ILP solver to find legal hyperplanes 132 7.2.3 Third Culprit: Phase III - Finding linearly independent hyperplanes .

Load more