An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
Total Page:16
File Type:pdf, Size:1020Kb
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs Simone Campanoni Kevin Brownell Svilen Kanev Timothy M. Jones+ Gu-Yeon Wei David Brooks Harvard University University of Cambridge+ Cambridge, MA, USA Cambridge, UK Abstract overheads to support misspeculation and must therefore target Data dependences in sequential programs limit paralleliza- relatively large loops to amortize penalties. tion because extracted threads cannot run independently. Al- In contrast to existing parallelization solutions, we propose though thread-level speculation can avoid the need for precise an alternate strategy that instead targets small loops, which are dependence analysis, communication overheads required to syn- much easier to analyze via state-of-the-art control and data flow chronize actual dependences counteract the benefits of paral- analysis, significantly improving accuracy. Furthermore, this lelization. To address these challenges, we propose a lightweight ease of analysis enables transformations that simply re-compute architectural enhancement co-designed with a parallelizing com- shared variables in order to remove a large fraction of actual piler, which together can decouple communication from thread dependences. This strategy increases TLP and reduces core-to- execution. Simulations of these approaches, applied to a pro- core communication. Such optimizations do not readily translate cessor with 16 Intel Atom-like cores, show an average of 6.85× to TLS because the complexity of TLS-targeted code typically performance speedup for six SPEC CINT2000 benchmarks. spans multiple procedures in larger loops. Finally, our data shows parallelizing small hot loops yields high program coverage and 1. Introduction produces meaningful speedups for the non-numerical programs in the SPEC CPU2000 suite. In today’s multicore era, program performance largely depends on the amount of thread level parallelism (TLP) available. While Targeting small loops presents its own set of challenges. Even some computing problems often translate to either inherently after extensive code analysis and optimizations, small hot loops parallel or easy-to-parallelize numerical programs, sequentially will retain actual dependences, typically to share dynamically designed, non-numerical programs with complicated control allocated data. Moreover, since loop iterations of small loops (e.g., execution path) and data flow (e.g., aliasing) are much tend to be short in duration (less than 25 clock cycles on average), more common, but difficult to analyze precisely. These non- they require frequent, memory-mediated communication. At- numerical programs are the focus of this paper. While conven- tempting to run these iterations in parallel demands low-latency tional wisdom is that non-numerical programs cannot make good core-to-core communication for memory traffic, something not use of multiple cores, research in the last decade has made steady available in commodity multicore processors. progress towards extracting TLP from complex, sequentially- To meet these demands, we present HELIX-RC, a co-designed designed programs such as the integer benchmarks from SPEC architecture-compiler parallelization framework for chip multi- CPU suites [3, 6, 27, 45, 48]. To further extend this body of processors. The compiler identifies what data must be shared research, this paper presents lightweight architectural enhance- between cores and the architecture proactively circulates this data ments for fast inter-core communication in order to support along with synchronization signals among cores. Rather than advances in a custom compiler framework that parallelizes loop waiting for a request, this proactive communication immediately iterations across multiple cores within a chip multiprocessor. circulates shared data, as early as possible—decoupling commu- Performance gains sought by parallelizing loop iterations of nication from computation. HELIX-RC builds on the HCCv1 non-numerical programs depend on two key factors: (i) accuracy compiler, developed for the first iteration of HELIX [6, 7], that of the data dependence analysis and (ii) speed of communica- automatically generates parallel code for commodity multicore tion provided by the underlying computer architecture to satisfy processors. Because performance improvements from HCCv1 the dependences. Unfortunately, complex control and data flow saturate at four cores due to communication latency, we pro- in non-numerical programs—both exacerbated by ambiguous pose ring cache—an architectural enhancement that facilitates pointers and ambiguous indirect calls—make accurate data de- low-latency core-to-core communication—to satisfy inter-thread pendence analysis difficult. In addition to actual dependences memory dependences and relies on guarantees provided by the that require communication between cores, a compiler must co-designed HCCv3 compiler to keep it lightweight. conservatively handle apparent dependences never realized at HELIX-RC automatically parallelizes non-numerical program runtime. While thread level speculation (TLS) avoids the need with unmatched performance improvements. Across a range of for accurate data dependence analysis by speculating that some SPEC CINT2000 benchmarks, decoupling communication en- apparent dependences are not realized [29, 38, 39], TLS suffers ables a three-fold improvement in performance when compared 978-1-4799-4394-4/14/$31.00 c 2014 IEEE to HCCv1, on a simulated multicore processor consisting of 16 16 14 HCCv1 HCCv2 Numerical Atom-like, in-order cores with a ring cache with 1KB per node Programs of memory (32× smaller than the L1 data cache). The proposed 12 10 Non-Numerical Programs system offers an average speedup of 6.85× when compared to 8 running un-parallelized code on a single core. Detailed evalua- 6 tions show that even with a conservative ring cache configuration, 4 HELIX-RC is able to achieve 95% of the speedup possible with Program speedup 2 unlimited resources (i.e., unbounded bandwidth, instantaneous 0 inter-core communication, and unconstrained size). Moreover, 175.vpr 179.art 164.gzip 300.twolf 181.mcf 256.bzip2 177.mesa Geomean 197.parser 183.equake 188.ammp simulations for a HELIX-RC system comprising 16 out-of-order INT Geomean FP Geomean cores show 3.8× performance speedup for the same set of non- numerical programs. This result confirms HELIX-RC’s ability Figure 1: Improving the HCCv1 compiler alone does not improve to extract TLP on top of the instruction level parallelism (ILP) performance for SPEC CINT2000 benchmarks. provided by an out-of-order processor. The remainder of this paper further describes the motivation communication latency.1 The engineering improvements of for and results of implementing HELIX-RC. We first review HCCv2 significantly raised speedups for numerical programs the limitations of compiler-only improvements and identify co- (SPEC CFP2000) over HCCv1 from 2:4× to 11×. HCCv2 design opportunities to improve TLP of loop iterations. Next, successfully parallelized the numerical programs because the we explore the speedups obtained by decoupling communication accuracy of the data dependence analysis is high for loops at from computation with compiler support. After describing the almost any level of the loop nesting hierarchy. Furthermore, the overall HELIX-RC approach, we delve deeper into both the com- improved compiler removed the remaining actual dependences piler and the hardware enhancement. Finally, we use a detailed among registers (e.g., via parallel reduction) to generate loops simulation framework to evaluate the performance of HELIX-RC with long iterations that can run in parallel on different cores. and analyze its sensitivity to architectural parameters. Unfortunately, non-numerical programs (SPEC CINT) are 2. Background and Opportunities not as compliant to compiler improvements and saw little to no benefit from HCCv2. Because core-to-core communication in 2.1. Limits of compiler-only improvements conventional systems is expensive, the compiler must parallelize large loops (the larger the loop with loop-carried dependences, To understand what limits the performance of parallel code ex- the less frequently cores synchronize), which limits the accuracy tracted from non-numerical programs, we started with HCCv1 [6, of dependence analysis and thereby limits TLP extraction. This 7], a state-of-the-art parallelizing compiler. is why HELIX-RC focuses on small (hot) loops to parallelize this HCCv1. This first generation compiler automatically gener- class of programs. Our hypothesis is that modest architectural ates parallel threads from sequential programs by distributing enhancements, co-designed with a compiler that targets small successive loop iterations across adjacent cores within a single loops, can successfully parallelize non-numerical programs. multicore processor, similar to conventional DOACROSS par- allelism [10]. Since there are data dependences between loop 2.2. Opportunity iterations (i.e., loop-carried dependences), some segments of There is an opportunity to aggressively parallelize non-numerical a loop’s body—called sequential segments—must execute in programs based on the following insights: (i) small loops are iteration order on the separate cores to preserve the semantics of easier to analyze with high accuracy; (ii) predictable computa- sequential code. Synchronization operations mark the beginning tion means most of the required communication updates shared and end of each sequential segment.