Optimizing Compilers

Optimizing Compilers Source-to-source (high-level) translation and optimization for multicores Rudolf (Rudi) Eigenmann R. Eigenmann, Optimizing compilers Slide 1 I. Motivation and Introduction: Optimizing Compilers are in the Center of the (Software) Universe They translate increasingly advanced human interfaces (programming languages) onto increasingly complex target machines Today Tomorrow C, C++, Problem Human (programming) Java, Specification language Fortran Language Grand Grand Translator Challenge Machine architecture Workstation Globally Multicores Distributed/Cloud HPC Systems Resources Processors have multiple cores. Parallelization is a key optimization. R. Eigenmann, Optimizing compilers Slide 2 Issues in Optimizing / Parallelizing Compilers The Goal: • We would like to run standard (C, C++, Java, Fortran) programs on common parallel computers leads to the following high-level issues: • How to detect parallelism? • How to map parallelism onto the machine? • How to create a good compiler architecture? R. Eigenmann, Optimizing compilers Slide 3 Detecting Parallelism • Program analysis techniques • Data dependence analysis • Dependence removing techniques • Parallelization in the presence of dependences • Runtime dependence detection R. Eigenmann, Optimizing compilers Slide 4 Mapping Parallelism onto the Machine • Exploiting parallelism at many levels – Multiprocessors and multi-cores (our focus) – Distributed computers (clusters or global networks) – Heterogeneous architectures – Instruction-level parallelism – Vector machines • Exploiting memory organizations – Data placement – Locality enhancement – Data communication R. Eigenmann, Optimizing compilers Slide 5 Architecting a Compiler • Compiler generator languages and tools • Internal representations • Implementing analysis and transformation techniques • Orchestrating compiler techniques (when to apply which technique) • Benchmarking and performance evaluation R. Eigenmann, Optimizing compilers Slide 6 Early, Foundational Parallelizing Compiler Books and Survey Papers Books: • Michael Wolfe: High-Performance Compilers for Parallel Computing (1996) • Utpal Banerjee: several books on Data Dependence Analysis and Transformations • Ken Kennedy, John Allen: Optimizing Compilers for Modern Architectures: A Dependence-based Approach (2001) • Zima, H. and Chapman, B., Supercompilers for parallel and vector computers (1990) • Scheduling and automatic Parallelization, Darte, A., Robert Y., and Vivien, F., (2000) Survey Papers: • Rudolf Eigenmann and Jay Hoeflinger, Parallelizing and Vectorizing Compilers, Wiley Encyclopedia of Electrical Engineering, John Wiley &Sons, Inc., 2001 • Utpal Banerjee, Rudolf Eigenmann, Alexandru Nicolau, and David Padua. Automatic Program Parallelization. Proceedings of the IEEE, 81(2), FeBruary 1993. • David F. Bacon, Susan L. Graham, Compiler transformations for high-performance computing, ACM Computing Surveys (CSUR), Volume 26, Issue 4, DecemBer 1994, Pages: 345 - 420,1994 R. Eigenmann, Optimizing compilers Slide 7 Course Approach There are many schools on optimizing compilers. Our approach is performance-driven. We will discuss: – Performance of parallelization techniques – Analysis and Transformation techniques in the Cetus compiler (for multiprocessors/cores) – Additional transformations (for GPGPUs and other architectures) – Compiler infrastructure considerations R. Eigenmann, Optimizing compilers Slide 8 The Heart of Automatic Parallelization Data Dependence Testing If a loop does not have data dependences between any two iterations then it can be safely executed in parallel In many science/engineering applications, loop parallelism is most important. In non- numerical programs other control structures are also important R. Eigenmann, Optimizing compilers Slide 9 Data Dependence Tests: Motivating Examples Loop Parallelization Statement Reordering Can the iterations of this can these two statements be loop be run concurrently? swapped? DO i=1,100,2 DO i=1,100,2 B[2*i] = ... B[2*i] = ... ... = B[2*i] +B[3*i] ... = B[3*i] ENDDO ENDDO DD testing is important not just for DD testing is needed to detect detecting parallelism parallelism A data dependence exists between two adjacent data references iff: • both references access the same storage location and • at least one of them is a write access R. Eigenmann, Optimizing compilers Slide 10 Data Dependence Tests: Concepts Terms for data dependences between statements of loop iterations. • Distance (vector): indicates how many iterations apart are source and sink of dependence. • Direction (vector): is basically the sign of the distance. There are different notations: (<,=,>) or (-1,0,+1) meaning dependence (from earlier to later, within the same, from later to earlier) iteration. • Loop-carried (or cross-iteration) dependence and non-loop-carried (or loop-independent) dependence: indicates whether or not a dependence exists within one iteration or across iterations. – For detecting parallel loops, only cross-iteration dependences matter. – equal dependences are relevant for optimizations such as statement reordering and loop distribution. R. Eigenmann, Optimizing compilers Slide 11 Data Dependence Tests: Concepts • Iteration space graphs: the un-abstracted form of a dependence graph with one node per statement instance. j Example: DO i=1,n DO j=1,m a[i,j] = a[i-1,j-2]+b[i,j] ENDDO ENDDO i order R. Eigenmann, Optimizing compilers Slide 12 Data Dependence Tests: Formulation of the Data-dependence problem DO i=1,n the question to answer: a[4*i] = . can 4*i ever be equal to 2*i +1 within i , i Î[1,n] ? . = a[2*i+1] 1 2 1 2 ENDDO Note that the iterations at which the two expressions are equal may differ. To express this fact, we choose the notation i1, i2. Let us generalize a bit: given • two subscript functions f and g, and • loop bounds lower, upper, Does f(i1) = g(i2) have a solution such that lower £ i1, i2 £ upper ? R. Eigenmann, Optimizing compilers Slide 13 This course would now be finished if: • the mathematical formulation of the data dependence problem had an accurate and fast solution, and • there were enough loops in programs without data dependences, and • dependence-free code could be executed by today’s parallel machines directly and efficiently. • engineering these techniques into a production compiler were straightforward. There are enough hard problems to fill several courses! R. Eigenmann, Optimizing compilers Slide 14 II. Performance of Basic Automatic Program Parallelization R. Eigenmann, Optimizing compilers Slide 15 Three Decades of Parallelizing Compilers A Performance study at the beginning of the 90es (Blume study) Analyzed the performance of state-of-the-art parallelizers and vectorizers using the Perfect Benchmarks. William Blume and Rudolf Eigenmann, Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs, IEEE Transactions on Parallel and Distributed Systems, 3(6), November 1992, pages 643--656. Good reasons for starting two decades back: • We will learn simple techniques first. • We will see how parallelization techniques have evolved • We will see that extensions of the important techniques back then are still the important techniques today. R. Eigenmann, Optimizing compilers Slide 16 Overall Performance of parallelizers in 1990 Speedup on 8 processors with 4-stage vector units R. Eigenmann, Optimizing compilers Slide 17 Performance of Individual Techniques R. Eigenmann, Optimizing compilers Slide 18 Transformations measured in the “Blume Study” • Scalar expansion • Reduction parallelization • Induction variable substitution • Loop interchange • Forward Substitution • Stripmining • Loop synchronization • Recurrence substitution R. Eigenmann, Optimizing compilers Slide 19 Scalar Expansion Privatization DO PARALLEL j=1,n DO j=1,n PRIVATE t output t = a[j]+b[j] flow t = a[j]+b[j] c[j] = t + t2 c[j] = t + t2 ENDDO ENDDO anti Expansion We assume a shared-memory model: • by default, data is shared, i.e., all DO PARALLEL j=1,n processors can see and modify it t0[j] = a[j]+b[j] • processors share the work of parallel loops c[j] = t0[j] + t0[j]2 ENDDO R. Eigenmann, Optimizing compilers Slide 20 Parallel Loop Syntax and Semantics in OpenMP !$OMP PARALLEL PRIVATE(<private data>) <preamble code> #pragma omp parallel for !$OMP DO DO i = lb, ub for (i=lb; i<=ub; i++) { <loop body code> <loop body code> ENDDO } !$OMP END DO <postamble code> !$OMP END PARALLEL work (iterations) shared by participating processors (threads) Same code executed by all participating processors (threads) R. Eigenmann, Optimizing compilers Slide 21 Reduction Parallelization !$OMP PARALLEL, PRIVATE (s) s = 0 DO j=1,n !$OMP DO DO j=1,n flow sum = sum + a[j] s = s + a[j] ENDDO anti ENDDO !$OMP ENDDO !$OMP ATOMIC sum=sum+s !$OMP PARALLEL DO !$OMP END PARALLEL !$OMP+REDUCTION(+:sum) DO j=1,n sum = sum + a[j] ENDDO R. Eigenmann, Optimizing compilers Slide 22 Induction Variable Substitution ind = ind0 ind = ind0 DO j = 1,n DO PARALLEL j = 1,n a[ind] = b[j] a[ind0+k*(j-1)] = b[j] ind = ind+k flow ENDDO dependence ENDDO Note, this is the reverse of strength reduction, an important transformation in classical (code generating) compilers. R0 ¬ &d loop: real d[20,100] loop: ... DO j=1,n ... (R0) ¬ 0 d[1,j]=0 R0 ¬ &d+20*j ... ENDDO (R0) ¬ 0 ... R0 ¬ R0+20 jump loop jump loop R. Eigenmann, Optimizing compilers Slide 23 Forward Substitution m = n+1 m = n+1 … … DO j=1,n DO j=1,n a[j] = a[j+m] a[j] = a[j+n+1] ENDDO ENDDO a = x*y a = x*y b = a+2 b = x*y+2 c = b + 4 c = x*y + 6 dependences

Load more