Optimizing Compilers Compiler Structure

Optimizing compilers Compiler structure token stream Potential optimizations: Source-language (AST): Parser • constant bounds in loops/arrays syntax tree • loop unrolling Semantic analysis • suppressing run-time checks (eg, type checking) • enable later optimisations syntax tree IR: local and global Intermediate • CSE elimination code generator low!level IR • live variable analysis (eg, canonical trees/tuples) • code hoisting Optimizer • enable later optimisations low!level IR (eg, canonical trees/tuples) Code-generation (machine code): Machine code • register allocation generator • instruction scheduling machine code Copyright !c 2007 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for • peephole optimization personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected]. CS502 Register allocation 1 CS502 Register allocation 2 Optimization Optimization Goal: produce fast code Definition: An optimization is a transformation that is expected to: • What is optimality? improve the running time of a program • Problems are often hard or decrease its space requirements • Many are intractable or even undecideable The point: • Many are NP-complete • Which optimizations should be used? • “improved” code, not “optimal” code • Many optimizations overlap or interact • sometimes produces worse code • range of speedup might be from 1.000001 to 4 (or more) CS502 Register allocation 3 CS502 Register allocation 4 Machine-independent transformations Machine-dependent transformations • applicable across broad range of machines • capitalize on machine-specific properties • remove redundant computations • improve mapping from IR onto machine • move evaluation to a less frequently executed place • replace a costly operation with a cheaper one • specialize some general-purpose code • hide latency • find useless code and remove it • replace sequence of instructions with more powerful one (use “exotic” instructions) • expose opportunities for other optimizations CS502 Register allocation 5 CS502 Register allocation 6 A classical distinction Optimization The distinction is not always clear: Desirable properties of an optimizing compiler replace multiply with shifts and adds • code at least as good as an assembler programmer • stable, robust performance (predictability) • architectural strengths fully exploited • architectural weaknesses fully hidden • broad, efficient support for language features • instantaneous compiles Unfortunately, modern compilers often drop the ball CS502 Register allocation 7 CS502 Register allocation 8 Optimization Scope of optimization Good compilers are crafted, not assembled Local (single block) • confined to straight-line code • consistent philosophy • simplest to analyse • careful selection of transformations • time frame: ’60s to present, particularly now • thorough application Intraprocedural (global) • coordinate transformations and data structures • consider the whole procedure • attention to results (code, time, space) • What do we need to optimize an entire procedure? • classical data-flow analysis, dependence analysis Compilers are engineered objects • time frame: ’70s to present • minimize running time of compiled code Interprocedural (whole program) • minimize compile time • analyse whole programs • use reasonable compile-time space (serious problem) • What do we need to optimize and entire program? • less information is discernible Thus, results are sometimes unexpected • time frame: late ’70s to present, particularly now CS502 Register allocation 9 CS502 Register allocation 10 Optimization Safety Three considerations arise in applying a transformation: Fundamental question Does the transformation change the results of executing the code? • safety yes ⇒ don’t do it! no ⇒ it is safe • profitability • opportunity Compile-time analysis • may be safe in all cases (loop unrolling) We need a clear understanding of these issues • analysis may be simple (DAGs and CSEs) • may require complex reasoning (data-flow analysis) • the literature often hides them • every discussion should list them clearly CS502 Register allocation 11 CS502 Register allocation 12 Profitability Opportunity Fundamental question Fundamental question Is there a reasonable expectation that the transformation will be an Can we efficiently locate sites for applying the transformation? improvement? yes ⇒ compilation time won’t suffer yes ⇒ do it! no ⇒ better be highly profitable no ⇒ don’t do it Issues • provides a framework for applying transformation • systematically find all sites • update safety information to reflect previous changes Compile-time estimation • order of application (hard) • always profitable • heuristic rules • compute benefit (rare) CS502 Register allocation 13 CS502 Register allocation 14 Optimization Example: loop unrolling Successful optimization requires Idea: reduce loop overhead by creating multiple successive copies of the loop’s • test for safety, profitability should be body and increasing the increment appropriately O(1) per transformation Safety: always safe or O(n) for whole routine (maybe nlogn) Profitability: reduces overhead • profit is local improvement× executions (instruction cache blowout) ⇒ focus on loops: (subtle secondary effects) – loop unrolling Opportunity: loops – factoring loop invariants – strength reduction • want to minimize side-effects like code growth Unrolling is easy to understand and perform CS502 Register allocation 15 CS502 Register allocation 16 Example: loop unrolling Example: loop unrolling Matrix-matrix multiply Matrix-matrix multiply (assume 4-word cache line) do i ← 1, n, 1 do i ← 1, n, 1 do j ← 1, n, 1 do j ← 1, n, 1 c(i,j) ← 0 c(i,j) ← 0 do k ← 1, n, 1 do k ← 1, n, 4 c(i,j) ← c(i,j) + a(i,k) * b(k,j) c(i,j) ← c(i,j) + a(i,k) * b(k,j) c(i,j) ← c(i,j) + a(i,k+1) * b(k+1,j) • 2n3 flops, n3 loop increments and branches c(i,j) ← c(i,j) + a(i,k+2) * b(k+2,j) c(i,j) ← c(i,j) + a(i,k+3) * b(k+3,j) • each iteration does 2 loads and 2 flops 3 n3 • 2n flops, 4 loop increments and branches • each iteration does 8 loads and 8 flops • memory traffic is better – c(i,j) is reused (put it in a register) – a(i,k) reference are from cache – b(k,j) is problematic This is the most overstudied example in the literature CS502 Register allocation 17 CS502 Register allocation 18 Example: loop unrolling Example: loop unrolling Matrix-matrix multiply (to improve traffic on b) What happened? do j ← 1, n, 1 • interchanged i and j loops do i ← 1, n, 4 c(i,j) ← 0 • unrolled i loop do k ← 1, n, 4 • fused inner loops c(i,j) ← c(i,j) + a(i,k) * b(k,j) 3 + a(i,k+1) * b(k+1,j) 3 n • 2n flops, 16 loop increments and branches + a(i,k+2) * b(k+2,j) + a(i,k+3) * b(k+3,j) • first assignment does 8 loads and 8 flops nd th c(i+1,j) ← c(i+1,j) + a(i+1,k) * b(k,j) • 2 through 4 do 4 loads and 8 flops + a(i+1,k+1) * b(k+1,j) + a(i+1,k+2) * b(k+2,j) • memory traffic is better + a(i+1,k+3) * b(k+3,j) c(i+2,j) ← c(i+2,j) + a(i+2,k) * b(k,j) – c(i,j) is reused (register) + a(i+2,k+1) * b(k+1,j) – a(i,k) references are from cache + a(i+2,k+2) * b(k+2,j) + a(i+2,k+3) * b(k+3,j) – b(k,j) is reused (register) c(i+3,j) ← c(i+3,j) + a(i+3,k) * b(k,j) + a(i+3,k+1) * b(k+1,j) + a(i+3,k+2) * b(k+2,j) + a(i+3,k+3) * b(k+3,j) CS502 Register allocation 19 CS502 Register allocation 20 Example: loop unrolling Loop optimizations: factoring loop-invariants It is not as easy as it looks: Loop invariants: expressions constant within loop body Safety: loop interchange? loop unrolling? loop fusion? Relevant variables: those used to compute and expression Profitability: machine dependent (mostly) Opportunity: Opportunity: find memory-bound loop nests 1.identifyvariablesdefinedinbodyofloop (LoopDef) Summary 2. loop invariants have no relevant variables in LoopDef • chance for large improvement 3. assign each loop-invariant to temp. in loop header • answering the fundamentals is tough 4. use temporary in loop body • resulting code is ugly Safety: loop-invariant expression may throw exception early Profitability: • loop may execute 0 times • loop-invariant may not be needed on every path through loop body Matrix-matrix multiply is everyone’s favorite example CS502 Register allocation 21 CS502 Register allocation 22 Example: factoring loop invariants Example: factoring loop invariants (cont.) for i := 1 to 100 do (∗ LoopDef = {i, j,k,A} ∗) Factoring the inner loop: And the second loop: for j := 1 to 100 do (∗ LoopDef = { j,k,A} ∗) for i := 1 to 100 do for i := 1 to 100 do for k := 1 to 100 do (∗ LoopDef = {k,A} ∗) for j := 1 to 100 do t3 := ADR(A[i]); A[i, j,k] := i ∗ j ∗ k; t1 := ADR(A[i][ j]); for j := 1 to 100 do t2 := i ∗ j; t1 := ADR(t3[ j]); for k := 1 to 100 do t := i ∗ j; • 3 million index operations 2 t1[k] := t2 ∗ k; for k := 1 to 100 do t1[k] := t2 ∗ k; • 2 million multiplications CS502 Register allocation 23 CS502 Register allocation 24 Strength reduction in loops Example: strength reduction in loops Loop induction variable: incremented on each iteration From previous example: , , ,... i0 i0 + 1 i0 + 2 for i := 1 to 100 do t := ADR(A[i]); Induction expression: ic + c , where c , c are loop invariant 3 1 2 1 2 t4 := i; (∗ i ∗ j0 = i ∗) i0c1 + c2,(i0 + 1)c1 + c2,(i0 + 2)c1 + c2,..

Load more