The Velocity Compiler: Extracting

THE VELOCITY COMPILER: EXTRACTING EFFICIENT MULTICORE EXECUTION FROM LEGACY SEQUENTIAL CODES MATTHEW JOHN BRIDGES ADISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY RECOMMENDED FOR ACCEPTANCE BY THE DEPARTMENT OF COMPUTER SCIENCE ADVISOR: PROF. DAVID I. AUGUST NOVEMBER 2008 c Copyright by Matthew John Bridges, 2008. All Rights Reserved Abstract Multiprocessor systems, particularly chip multiprocessors, have emerged as the predomi- nant organization for future microprocessors. Systems with 4, 8, and 16 cores are already shipping and a future with 32 or more cores is easily conceivable. Unfortunately, multiple cores do not always directly improve application performance, particularly for a single legacy application. Consequently, parallelizing applications to execute on multiple cores is essential. Parallel programming models and languages could be used to create multi-threaded applications. However, moving to a parallel programming model only increases the com- plexity and cost involved in software development. Many automatic thread extraction techniques have been explored to address these costs. Unfortunately, the amount of parallelism that has been automatically extracted using these techniques is generally insufficient to keep many cores busy. Though there are many reasons for this, the main problem is that extensions are needed to take full advantage of these techniques. For example, many important loops are not parallelized because the compiler lacks the necessary scope to apply the optimization. Additionally, the sequential programming model forces programmers to define a single legal application outcome, rather than allowing for a range of legal outcomes, leading to conservative dependences that prevent parallelization. This dissertation integrates the necessary parallelization techniques, extending them where necessary, to enable automatic thread extraction. In particular, this includes an ex- panded optimization scope, which facilitates the optimization of large loops, leading to parallelism at higher levels in the application. Additionally, this dissertation shows that many unnecessary dependences can be broken with the help of the programmer using nat- ural, simple extensions to the sequential programming model. Through a case study of several applications, including several C benchmarks in the SPEC CINT2000 suite, this dissertation shows how scalable parallelism can be extracted. iii By changing only 38 source code lines out of 100,000, several of the applications were parallelizable by automatic thread extraction techniques, yielding a speedup of 3.64x on 32 cores. iv Acknowledgments First and foremost, I would like to thank my advisor, David August, for his assistance and support throughout my years in graduate school at Princeton. He has taught me a great deal about research, and I would not be here if not for his belief in me. Additionally, the strong team environment and investment in infrastructure that he has advocated has been a wonderful benefit. His continued expectation that I work to the best of my abilities has lead to this dissertation, and for that I am grateful. This dissertation was also made possible by the support of the Liberty research group, including Ram Rangan, Manish Vachharajani, David Penry, Thomas Jablin, Yun Zhang, Bolei Guo, Adam Stoler, Jason Blome, George Reis, and Jonathan Chang. Spyros Tri- antafyllis, Easwaran Raman, and Guilherme Ottoni have contributed greatly to the VE- LOCITY compiler used in this research and have my thanks for both their friendship and support. Special thanks goes to Neil Vachharajani, both for the many contributions to VE- LOCITY and this work, as well as for the many interesting discussions we have had over the years. I would like to thank the members of my dissertation committee, Kai Li, Daniel Lav- ery, Margaret Martonosi, and Doug Clark. Their collective wisdom and feedback have improved this dissertation. In particular, I thank my advisor and my readers, Kai Li and Daniel Lavery, for carefully reading through this dissertation and providing comments. Ad- ditionally, I thank my undergraduate research advisor, Lori Pollock, without whom I would not have entered graduate research. This dissertation would not be possible without material research support from the Na- tional Science Foundation, Intel Corporation, and GSRC. Special thanks also goes to Intel for summer internship that they provided and the guidance given to me by Daniel Lavery and Gerolf Hoflehner. Finally, I thank my friends and family. I thank my mother Betty, my father John, my step-father Doug, my mother-in-law Janet, and my father-in-law Russell, for having sup- v ported me throughout this process with their love and encouragement. Above all, my wife Marisa has my sincere thanks for putting up with everything involved in creating this dissertation and for reading early drafts. She has seen me through the good times and the bad and I would not have accomplished this with out her. vi Contents Abstract ....................................... iii Acknowledgments .................................. v List of Tables .................................... x List of Figures .................................... xi 1 Introduction 1 2 Automatic Parallelization 7 2.1 Extracting Parallelism ............................. 7 2.1.1 Independent Multi-Threading .................... 7 2.1.2 Cyclic Multi-Threading ....................... 9 2.1.3 Pipelined Multi-Threading ...................... 11 2.2 Enhancing Parallelism ............................ 13 2.2.1 Enhancing IMT Parallelism ..................... 13 2.2.2 Enhancing CMT Parallelism ..................... 16 2.2.3 Enhancing PMT Parallelism ..................... 19 2.3 Breaking Dependences with Speculation ................... 21 2.3.1 Types of Speculation ......................... 21 2.3.2 Speculative Parallelization Techniques ................ 22 2.3.3 Loop-Sensitive Profiling ....................... 27 2.4 Compilation Scope .............................. 28 vii 2.5 Parallelization For General-Purpose Programs ................ 30 3 The VELOCITY Parallelization Framework 33 3.1 Compilation and Execution Model ...................... 33 3.2 Decoupled Software Pipelining ........................ 34 3.2.1 Determining a Thread Assignment .................. 34 3.2.2 Realizing a Thread Assignment ................... 41 3.3 Parallel-Stage DSWP ............................. 46 3.3.1 Determining a Thread Assignment .................. 46 3.3.2 Realizing a Thread Assignment ................... 50 3.4 Speculative Parallel-Stage DSWP ...................... 56 3.4.1 Determining a Thread Assignment .................. 56 3.4.2 Realizing a Thread Assignment ................... 63 3.4.3 Loop-Aware Profiling ........................ 70 3.5 Interprocedural Speculative Parallel-Stage DSWP .............. 75 3.5.1 Determining a Thread Assignment .................. 75 3.5.2 Realizing the Thread Assignment .................. 78 4 Extending the Sequential Programming Model 85 4.1 Programmer Specified Parallelism ...................... 85 4.2 Annotations for the Sequential Programming Model ............. 87 4.2.1 Y-branch ............................... 88 4.2.2 Commutative ............................. 90 5 Evaluation of VELOCITY 94 5.1 Compilation Environment ........................... 94 5.2 Measuring Multi-Threaded Performance ................... 95 5.2.1 Emulation ............................... 96 5.2.2 Performance ............................. 96 viii 5.2.3 Cache Simulation ........................... 99 5.2.4 Bringing it All Together ....................... 99 5.3 Measuring Single-Threaded Performance .................. 100 5.4 Automatically Extracting Parallelism ..................... 100 5.4.1 256.bzip2 ............................... 100 5.4.2 175.vpr ................................ 104 5.4.3 181.mcf ................................ 106 5.5 Extracting Parallelism with Programmer Support .............. 109 5.5.1 164.gzip ................................ 109 5.5.2 197.parser ............................... 112 5.5.3 186.crafty ............................... 113 5.5.4 300.twolf ............................... 115 5.6 Continuing Performance Trends ....................... 117 6 Conclusions and Future Directions 122 6.1 Future Directions ............................... 123 ix List of Tables 3.1 DSWP ISA Extensions ............................ 42 3.2 PS-DSWP ISA Extensions .......................... 51 3.3 SpecPS-DSWP ISA Extensions ........................ 67 3.4 iSpecPS-DSWP ISA Extensions ....................... 82 5.1 Cache Simulator Details ........................... 99 5.2 Information about the loop(s) parallelized, the execution time of the loop, and a description. ............................... 117 5.3 The minimum # of threads at which the maximum speedup occurred. As- suming a 1.4x speedup per doubling of cores, the Historic Speedup column gives the speedup needed to maintain existing performance trends. The final column gives the ratio of actual speedup to expected speedup. ..... 120 x List of Figures 1.1 Normalized performance of the SPEC 92, 95, and 2000 integer benchmarks suites. Each version of the suite is normalized to the previous version using matching hardware. Only the highest performance result for each time period is shown. The solid line is a linear regression on all points prior to 2004, while the dashed line a linear regression on all points in or after 2004. Source data from SPEC

The Velocity Compiler: Extracting

Debloating Modern Java Applications a Thesis Submitted In

Subroutines and Control Abstraction8

Inlining in Gforth: Early Experiences

The C++ Programming Language in Cheminformatics And

Reflecting on Generics for Fortran

Instruction Fetching: Coping with Code Bloat

Jshrink: In-Depth Investigation Into Debloating Modern Java Applications

1. Optimizing Software in C++ an Optimization Guide for Windows, Linux, and Mac Platforms

SI 413, Unit 10: Control