Optimizing compilers Compiler structure
token stream Potential optimizations: Source-language (AST): Parser • constant bounds in loops/arrays syntax tree • loop unrolling Semantic analysis • suppressing run-time checks (eg, type checking) • enable later optimisations syntax tree IR: local and global Intermediate • CSE elimination code generator low!level IR • live variable analysis (eg, canonical trees/tuples) • code hoisting Optimizer • enable later optimisations low!level IR (eg, canonical trees/tuples) Code-generation (machine code): Machine code • register allocation generator • instruction scheduling machine code Copyright !c 2007 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for • peephole optimization personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected].
CS502 Register allocation 1 CS502 Register allocation 2
Optimization Optimization
Goal: produce fast code Definition: An optimization is a transformation that is expected to:
• What is optimality? improve the running time of a program
• Problems are often hard or decrease its space requirements
• Many are intractable or even undecideable The point: • Many are NP-complete
• Which optimizations should be used? • “improved” code, not “optimal” code • Many optimizations overlap or interact • sometimes produces worse code
• range of speedup might be from 1.000001 to 4 (or more)
CS502 Register allocation 3 CS502 Register allocation 4 Machine-independent transformations Machine-dependent transformations
• applicable across broad range of machines • capitalize on machine-specific properties
• remove redundant computations • improve mapping from IR onto machine
• move evaluation to a less frequently executed place • replace a costly operation with a cheaper one
• specialize some general-purpose code • hide latency
• find useless code and remove it • replace sequence of instructions with more powerful one (use “exotic” instructions)
• expose opportunities for other optimizations
CS502 Register allocation 5 CS502 Register allocation 6
A classical distinction Optimization
The distinction is not always clear: Desirable properties of an optimizing compiler
replace multiply with shifts and adds • code at least as good as an assembler programmer
• stable, robust performance (predictability)
• architectural strengths fully exploited
• architectural weaknesses fully hidden
• broad, efficient support for language features
• instantaneous compiles
Unfortunately, modern compilers often drop the ball
CS502 Register allocation 7 CS502 Register allocation 8 Optimization Scope of optimization
Good compilers are crafted, not assembled Local (single block) • confined to straight-line code • consistent philosophy • simplest to analyse • careful selection of transformations • time frame: ’60s to present, particularly now • thorough application Intraprocedural (global) • coordinate transformations and data structures • consider the whole procedure • attention to results (code, time, space) • What do we need to optimize an entire procedure? • classical data-flow analysis, dependence analysis Compilers are engineered objects • time frame: ’70s to present
• minimize running time of compiled code Interprocedural (whole program) • minimize compile time • analyse whole programs • use reasonable compile-time space (serious problem) • What do we need to optimize and entire program? • less information is discernible Thus, results are sometimes unexpected • time frame: late ’70s to present, particularly now
CS502 Register allocation 9 CS502 Register allocation 10
Optimization Safety
Three considerations arise in applying a transformation: Fundamental question Does the transformation change the results of executing the code? • safety yes ⇒ don’t do it! no ⇒ it is safe
• profitability
• opportunity Compile-time analysis • may be safe in all cases (loop unrolling) We need a clear understanding of these issues • analysis may be simple (DAGs and CSEs) • may require complex reasoning (data-flow analysis) • the literature often hides them
• every discussion should list them clearly
CS502 Register allocation 11 CS502 Register allocation 12 Profitability Opportunity
Fundamental question Fundamental question Is there a reasonable expectation that the transformation will be an Can we efficiently locate sites for applying the transformation? improvement? yes ⇒ compilation time won’t suffer yes ⇒ do it! no ⇒ better be highly profitable no ⇒ don’t do it Issues • provides a framework for applying transformation • systematically find all sites • update safety information to reflect previous changes Compile-time estimation • order of application (hard) • always profitable • heuristic rules • compute benefit (rare)
CS502 Register allocation 13 CS502 Register allocation 14
Optimization Example: loop unrolling
Successful optimization requires Idea: reduce loop overhead by creating multiple successive copies of the loop’s • test for safety, profitability should be body and increasing the increment appropriately O(1) per transformation Safety: always safe or O(n) for whole routine (maybe nlogn) Profitability: reduces overhead • profit is local improvement× executions (instruction cache blowout) ⇒ focus on loops: (subtle secondary effects) – loop unrolling Opportunity: loops – factoring loop invariants – strength reduction • want to minimize side-effects like code growth
Unrolling is easy to understand and perform
CS502 Register allocation 15 CS502 Register allocation 16 Example: loop unrolling Example: loop unrolling
Matrix-matrix multiply Matrix-matrix multiply (assume 4-word cache line)
do i ← 1, n, 1 do i ← 1, n, 1 do j ← 1, n, 1 do j ← 1, n, 1 c(i,j) ← 0 c(i,j) ← 0 do k ← 1, n, 1 do k ← 1, n, 4 c(i,j) ← c(i,j) + a(i,k) * b(k,j) c(i,j) ← c(i,j) + a(i,k) * b(k,j) c(i,j) ← c(i,j) + a(i,k+1) * b(k+1,j) • 2n3 flops, n3 loop increments and branches c(i,j) ← c(i,j) + a(i,k+2) * b(k+2,j) c(i,j) ← c(i,j) + a(i,k+3) * b(k+3,j)
• each iteration does 2 loads and 2 flops 3 n3 • 2n flops, 4 loop increments and branches • each iteration does 8 loads and 8 flops
• memory traffic is better – c(i,j) is reused (put it in a register) – a(i,k) reference are from cache – b(k,j) is problematic This is the most overstudied example in the literature
CS502 Register allocation 17 CS502 Register allocation 18
Example: loop unrolling Example: loop unrolling
Matrix-matrix multiply (to improve traffic on b) What happened?
do j ← 1, n, 1 • interchanged i and j loops do i ← 1, n, 4 c(i,j) ← 0 • unrolled i loop do k ← 1, n, 4 • fused inner loops c(i,j) ← c(i,j) + a(i,k) * b(k,j) 3 + a(i,k+1) * b(k+1,j) 3 n • 2n flops, 16 loop increments and branches + a(i,k+2) * b(k+2,j) + a(i,k+3) * b(k+3,j) • first assignment does 8 loads and 8 flops nd th c(i+1,j) ← c(i+1,j) + a(i+1,k) * b(k,j) • 2 through 4 do 4 loads and 8 flops + a(i+1,k+1) * b(k+1,j) + a(i+1,k+2) * b(k+2,j) • memory traffic is better + a(i+1,k+3) * b(k+3,j) c(i+2,j) ← c(i+2,j) + a(i+2,k) * b(k,j) – c(i,j) is reused (register) + a(i+2,k+1) * b(k+1,j) – a(i,k) references are from cache + a(i+2,k+2) * b(k+2,j) + a(i+2,k+3) * b(k+3,j) – b(k,j) is reused (register) c(i+3,j) ← c(i+3,j) + a(i+3,k) * b(k,j) + a(i+3,k+1) * b(k+1,j) + a(i+3,k+2) * b(k+2,j) + a(i+3,k+3) * b(k+3,j)
CS502 Register allocation 19 CS502 Register allocation 20 Example: loop unrolling Loop optimizations: factoring loop-invariants
It is not as easy as it looks: Loop invariants: expressions constant within loop body
Safety: loop interchange? loop unrolling? loop fusion? Relevant variables: those used to compute and expression Profitability: machine dependent (mostly) Opportunity: Opportunity: find memory-bound loop nests 1.identifyvariablesdefinedinbodyofloop (LoopDef) Summary 2. loop invariants have no relevant variables in LoopDef • chance for large improvement 3. assign each loop-invariant to temp. in loop header • answering the fundamentals is tough 4. use temporary in loop body • resulting code is ugly Safety: loop-invariant expression may throw exception early
Profitability: • loop may execute 0 times • loop-invariant may not be needed on every path through loop body
Matrix-matrix multiply is everyone’s favorite example
CS502 Register allocation 21 CS502 Register allocation 22
Example: factoring loop invariants Example: factoring loop invariants (cont.)
for i := 1 to 100 do (∗ LoopDef = {i, j,k,A} ∗) Factoring the inner loop: And the second loop: for j := 1 to 100 do (∗ LoopDef = { j,k,A} ∗) for i := 1 to 100 do for i := 1 to 100 do for k := 1 to 100 do (∗ LoopDef = {k,A} ∗) for j := 1 to 100 do t3 := ADR(A[i]); A[i, j,k] := i ∗ j ∗ k; t1 := ADR(A[i][ j]); for j := 1 to 100 do t2 := i ∗ j; t1 := ADR(t3[ j]); for k := 1 to 100 do t := i ∗ j; • 3 million index operations 2 t1[k] := t2 ∗ k; for k := 1 to 100 do t1[k] := t2 ∗ k; • 2 million multiplications
CS502 Register allocation 23 CS502 Register allocation 24 Strength reduction in loops Example: strength reduction in loops
Loop induction variable: incremented on each iteration From previous example: , , ,... i0 i0 + 1 i0 + 2 for i := 1 to 100 do t := ADR(A[i]); Induction expression: ic + c , where c , c are loop invariant 3 1 2 1 2 t4 := i; (∗ i ∗ j0 = i ∗) i0c1 + c2,(i0 + 1)c1 + c2,(i0 + 2)c1 + c2,... for j := 1 to 100 do t1 := ADR(t3[ j]); 1. replace ic1 + c2 by t in body of loop t2 := t4; (∗ t4 = i ∗ j ∗) t5 := t2; (∗ t2 ∗ k0 = t2 ∗) 2. insert t := i0c1 + c2 before loop for k := 1 to 100 do t1[k] := t5; (∗ t5 = t2 ∗ k ∗) 3. insert t := t + c1 at end of loop t5 := t5 +t2; end; t4 := t4 + i; end; end;
CS502 Register allocation 25 CS502 Register allocation 26
Example: strength reduction in loops Example: strength reduction in loops
After copy propagation and exposing indexing: Applying strength reduction to exposed index expressions: for i := 1 to 100 do t6 := A0; for i := 1 to 100 do t3 := A0 +(10000 ∗ i) − 10000; t3 := t6; t4 := i; t7 := t3; t4 := i; for j := 1 to 100 do for j := 1 to 100 do t := t +(100 ∗ j) − 100; t1 := t7; t5 := t4; t8 := t1; 1 3 for k := 1 to 100 do t5 := t4; for k := 1 to 100 do t8 ↑:= t5; t5 := t5 +t4; (t1 + k − 1) ↑:= t5; t8 := t8 + 1; t5 := t5 +t4; end; end; t4 := t4 + i; t4 := t4 + i; end; t7 := t7 + 100; end; end; t6 := t6 + 10000; end;
Again, copy propagation further improves the code.
CS502 Register allocation 27 CS502 Register allocation 28 Intraprocedural analysis and transformation Live variables: any-path backward flow
In order to move, remove or rearrange instructions, must Define: • understand the control flow of the procedure In(b): variables live on entry to block b Out(b): variables live on exit from block b • find and connect definitions and uses of variables ⇒ data-flow analysis Let S(b) be the set of all immediate successors of b. Then:
The control flow graph (CFG): Out(b)=∪i∈S(b)In(i) • nodes are basic blocks S(b)=! ⇒ Out(b)=! • edges represent potential flow of control Now, define: Any-path flow analysis: property may be true if it holds along any path to node Use(b): variables used in b before/without definition in b In(b) ⊇ Use(b) All-paths flow analysis: property must be true if it holds along all paths to node Def(b): variables defined in b In(b) ⊇ Out(b) − Def(b)
Use(b) and Def(b) are constant (independent of control flow)
Now, v ∈ In(b) iff. v ∈ Use(b) or v ∈ Out(b) − Def(b) i.e., In(b)=Use(b) ∪ (Out(b) − Def(b))
CS502 Register allocation 29 CS502 Register allocation 30
Uninitialized variables: any-path forward flow Available expressions: all-paths forward flow
Define: An expression is available if recomputing it is redundant
In(b): possibly uninitialized vars. on entry to b Define: Out(b): possibly uninitialized vars. on exit from b RelVar(T): all vars/temps used to calculate T Let P(b) be the set of all immediate predecessors of b. Then: T is available on exit from b iff. ∀X ∈ RelVar(T), T is more recently defined in b than X In(b)=∪i∈P(b)Out(i) Out(b): (value-numbered) temps available on exit from b In(b): temporaries available on entry to b P(b)=! ⇒ In(b)=U (the set of all variables)
Now, define: In(b)=∩i∈P(b)Out(i) P(b)=! ⇒ In(b)=! Init(b): vars known to be initialized on exit from b (e.g., vars assigned legal values or tested before use) Out(b) ⊇ In(b) − Init(b) Now, define: Uninit(b): vars that become uninitialized in b, not subsequently reassigned or tested Computed(b): exprs computed in b, not subsequently killed (e.g., assigned null, freed ptrs, nested variables) Kill(b): exprs killed in b Out(b) ⊇ Uninit(b) (because one of its rel. vars. is defined in b)
Now, Out(b)=Uninit(b) ∪ (In(b) − Init(b)) Now, Out(b)=Computed(b) ∪ (In(b) − Kill(b))
CS502 Register allocation 31 CS502 Register allocation 32 Very busy expressions: all-paths backward flow A taxonomy of data-flow problems
Forward Backward An expression is very busy if its value is used along all paths before the Out(b)=Gen(b) ∪ (In(b) − Kill(b) In(b)=Gen(b) ∪ (Out(b) − Kill(b) expression is killed (reg. alloc., loop invariants)
Out(b): expressions very busy on exit from b Any Define: In(b)=Si∈P(b) Out(i) Out(b)=Si∈S(b) In(i) In(b): expressions very busy on entry to b Out(b)=Gen(b) ∪ (In(b) − Kill(b) In(b)=Gen(b) ∪ (Out(b) − Kill(b) Then:
All In(b)=Ti∈P(b) Out(i) Out(b)=Ti∈S(b) In(i) Out(b)=∩i∈S(b)In(i) S(b)=! ⇒ Out(b)=! Used(b): all expressions used before they are killed in b Now, define: Kill(b): allexpressionskilledin b before they are used Then, In(b)=Used(b) ∪ (Out(b) − Kill(b))
CS502 Register allocation 33 CS502 Register allocation 34
Other data-flow problems Other data-flow problems (cont.)
Reaching definitions: a definition of v reaches a use of v if ∃ any path from the du-chains: uses associated with each definition (backward) definition to the use, no intervening definitions In(b): possible uses of a var defined on entry to b (register targeting, constant propagation) Out(b): possible uses of a var defined on exit from b In(b): defs that reach entry to b Gen(b): uses in b of a var not yet defined in b Out(b): defs that reach exit from b Kill(b): uses in b of vars defined in b Gen(b): defs in b that reach exit from b Copy propagation: given A:=B, replace use of A with use of B Kill(b): defs killed by b (i.e., not in Gen(b)) In(b): copy statements that reach entry to b ud-chains: reaching defs associated with each use (forward) Out(b): copy statements that reach exit from b Gen(b): copy statements in b that reach end of b Kill(b): copy statements that reach entry to, but not exit from, b
CS502 Register allocation 35 CS502 Register allocation 36 Optimizations based on global data-flow Ordering optimization phases
Very busy expressions: 1. semantic analysis and intermediate code generation: • move invariants outside loops (used on every itern) • loop unrolling • code hoisting: move from successors common predecessor (dominator) • inline expansion Global CSE: 2. intermediate code generation: • remove redundant first calculations of expressions • build basic blocks with their Def and Kill sets • local CSE handles later redundant calculations Live variable analysis: 3. build control flow graph: • only live variables (incl. live global CSEs) are stored on block exit • perform initial data flow analyses Uninitialized variable analysis: • assume worst case for calls if no interproc. analysis • compile-time warnings of possible uses 4. early data-flow optimizations: constant/copy propagation (may expose dead • run-time checks to detect illegal uses (e.g., null ptrs) code, changing flow graph, so iterate) Constant and copy propagation: uses reaching definitions, du-chains, copy 5. CSE and live/dead variable analyses propagation analysis 6. translate basic blocks to target code: local optimizations (register • propagate constant and variable assignments and simplify resulting allocation/assignment, code selection) expressions 7. peephole optimization
CS502 Register allocation 37 CS502 Register allocation 38