Macroscopic Data Structure Analysis and Optimization

MACROSCOPIC DATA STRUCTURE ANALYSIS AND OPTIMIZATION BY CHRIS LATTNER B.S., University of Portland, 2000 M.S., University of Illinois at Urbana-Champaign, 2002 DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2005 Urbana, Illinois Abstract Providing high performance for pointer-intensive programs on modern architectures is an increas- ingly difficult problem for compilers. Pointer-intensive programs are often bound by memory latency and cache performance, but traditional approaches to these problems usually fail: Pointer-intensive programs are often highly-irregular and the compiler has little control over the layout of heap allocated objects. This thesis presents a new class of techniques named “Macroscopic Data Structure Analyses and Optimizations”, which is a new approach to the problem of analyzing and optimizing pointer- intensive programs. Instead of analyzing individual load/store operations or structure definitions, this approach identifies, analyzes, and transforms entire memory structures as a unit. The foundation of the approach is an analysis named Data Structure Analysis and a transformation named Automatic Pool Allocation. Data Structure Analysis is a context-sensitive pointer analysis which identifies data structures on the heap and their important properties (such as type safety). Auto- matic Pool Allocation uses the results of Data Structure Analysis to segregate dynamically allocated objects on the heap, giving control over the layout of the data structure in memory to the compiler. Based on these two foundation techniques, this thesis describes several performance improv- ing optimizations for pointer-intensive programs. First, Automatic Pool Allocation itself provides important locality improvements for the program. Once the program is pool allocated, several pool-specific optimizations can be performed to reduce inter-object padding and pool overhead. Second, we describe an aggressive technique, Automatic Pointer Compression, which reduces the size of pointers on 64-bit targets to 32-bits or less, increasing effective cache capacity and memory bandwidth for pointer-intensive programs. This thesis describes the approach, analysis, and transformation of programs with macroscopic techniques, and evaluates the net performance impact of the transformations. Finally, it describes iii a large class of potential applications for the work in fields such as heap safety and reliability, program understanding, distributed computing, and static garbage collection. iv To Tanya, for her unwavering love and support. v Acknowledgments This thesis would not be possible without the support of many people who have helped me in ways both large and small. In particular, I would like to thank my advisor, Vikram Adve, for his support, patience, and especially his trust and respect. He taught me how to communicate ideas more effectively, gave me the freedom to investigate the (broad and sometimes strange) areas that interest me, and contributed many ideas to this work. Few other advisors would allow a motivated student to go off and build an entire compiler to support their research. My wife Tanya is an unshakable source of support, understanding, and love. She helped me make it through the occasionally grueling all nighters and other challenging parts of these last five years, selflessly supporting me even when under pressures from her own research work and job. In addition to support of my research, she continues to enrich my life as a whole. I have deeply enjoyed my interactions and friendships with the members of the LLVM research group as well as the open-source community we have built around LLVM. Both have provided important insights, hard problems, and a desire to make LLVM as stable, robust, and extensible as possible. Special thanks go to Misha Brukman for reading (and rereading) many of my papers with his particularly critical eye for incorrect-hyphenation and mispellings. I would like to thank the UIUC Classical Fencing Club as a whole, and John Mainzer and Luda Yafremava in particular, for absorbing many of the frustrations and craziness accumulated over the course of this work. They provided an important outlet and taught me physical awareness, flexibility, and dexterity that I did not think was possible. They are also responsible for keeping random workplace violence to an tolerable level, for which my colleagues are undoubtedly thankful! Finally, I would like to thank Steven Vegdahl, who encouraged me to pursue graduate studies and whose infectious love of compilers started me on this path in the first place. vi Table of Contents Chapter 1 Introduction . 1 1.1 Foundations of the Macroscopic Approach . 4 1.1.1 Data Structure Analysis . 4 1.1.2 Automatic Pool Allocation . 5 1.2 Applications of Macroscopic Techniques . 7 1.2.1 Simple Pool Allocation Optimizations . 7 1.2.2 Transparent Pointer Compression . 8 1.2.3 Other Macroscopic Techniques . 8 1.3 Research Contributions of this Thesis . 9 1.4 Thesis Organization . 10 Chapter 2 The LLVM Compiler Infrastructure . 11 2.1 Introduction . 11 2.2 Program Representation . 15 2.2.1 Overview of the LLVM Instruction Set . 15 2.2.2 Language-Independent Type Information, Cast, and GetElementPtr . 16 2.2.3 Explicit Memory Allocation and Unified Memory Model . 19 2.2.4 Function Calls and Exception Handling . 19 2.2.5 Plain-text, Binary, and In-memory Representations . 22 2.3 The LLVM Compiler Architecture . 23 2.3.1 High-Level Design of the LLVM Compiler Framework . 23 2.3.2 Compile-Time: External Front-end and Static Optimizer . 25 2.3.3 Linker & Interprocedural Optimizer . 26 2.3.4 Offline or JIT Native Code Generation . 27 2.3.5 Runtime Path Profiling & Reoptimization . 27 2.3.6 Offline Reoptimization with End-user Profile Information . 28 2.4 Applications and Experiences . 29 2.4.1 Representation Issues . 29 2.4.2 Example Applications of LLVM . 34 2.5 Related Work . 36 2.6 Conclusion . 39 Chapter 3 Data Structure Analysis . 40 3.1 The Data Structure Graph . 42 3.1.1 Graph Nodes and Fields . 45 3.2 Construction Algorithm . 49 3.2.1 Primitive Graph Operations . 49 vii 3.2.2 Local Analysis Phase . 51 3.2.3 Bottom-Up Analysis Phase . 53 3.2.4 Top-Down Analysis Phase . 59 3.2.5 Complexity Analysis . 60 3.2.6 Bounding Graph Size . 60 3.3 Engineering an Efficient Pointer Analysis . 61 3.3.1 The Globals Graph . 61 3.3.2 Efficient Graph Inlining . 63 3.3.3 Partitioning EV for Efficient Global Variable Iteration . 64 3.3.4 Shrinking EV with Global Value Equivalence Classes . 64 3.3.5 Avoiding N 2 Inlining for Function Pointers . 66 3.3.6 Merge Call Nodes for External Functions . 66 3.3.7 Direct Call Nodes . 67 3.4 Experimental Results . 67 3.4.1 Benchmark Suite and Simple Measurements . 67 3.4.2 Analysis Time & Memory Consumption . 70 3.4.3 Inferred Type Information . 71 3.5 Related Work . 74 3.5.1 Shape Analyses . 75 3.5.2 Cloning-based Context-Sensitive Analyses . 75 3.5.3 Non-cloning Context Sensitive Analyses . 76 3.6 Data Structure Analysis: Summary of Contributions . 77 Chapter 4 Using Data Structure Analysis for Alias and IP Mod/Ref Analysis . 79 4.1 Alias Analysis and Mod/Ref Information . 79 4.1.1 Alias Analysis Assumptions and Applications . 80 4.1.2 Mod/Ref Analysis Assumptions and Applications . 81 4.2 Implementing Alias and Mod/Ref Analysis with DSA: ds-aa . 82 4.2.1 Computing Alias Analysis Responses . 83 4.2.2 Computing Mod/Ref Responses . 84 4.3 Alias Analysis Implementations for Comparison . 87 4.3.1 local Alias Analysis . 87 4.3.2 steens-fi Alias Analysis . 88 4.3.3 steens-fs Alias Analysis . 88 4.3.4 anders Alias Analysis . 89 4.4 Analysis Precision with a Synthetic Client . 89 4.4.1 Alias Precision . 90 4.4.2 Mod/Ref Precision . 95 4.5 Analysis Precision with Scalar Loop Optimizations . 97 4.5.1 Number of Transformations Performed . 98 4.5.2 Alias and Mod/Ref Queries . 103 4.6 Observations and Conclusions . 104 Chapter 5 Automatic Pool Allocation . 108 5.1 The Transformation Overview and Example . 110 5.1.1 Pool Allocator Runtime Library . 111 5.1.2 Overview Using an Example . 113 viii 5.2 The Core Pool Allocation Transformation . 114 5.2.1 Analysis: Finding Pool Descriptors for each H Node . 114 5.2.2 The Simple Transformation (No Indirect Calls) . 115 5.2.3 Passing Descriptors for Indirect Function Calls . 118 5.3 Algorithmic Complexity . 120 5.4 Simple Pool Allocation Refinements . 121 5.4.1 Argument Passing for Global Pools . 121 5.4.2 poolcreate/pooldestroy Placement . 121 5.5 Experimental Results . 123 5.5.1 Methodology and Benchmarks . 123 5.5.2 Pool Allocation Statistics . 124 5.5.3.

Load more