<<

Optimizing structure

token stream Potential optimizations: Source-language (AST): Parser • constant bounds in loops/arrays syntax tree • Semantic analysis • suppressing run-time checks (eg, type checking) • enable later optimisations syntax tree IR: local and global Intermediate • CSE elimination code generator low!level IR • (eg, canonical trees/tuples) • code hoisting Optimizer • enable later optimisations low!level IR (eg, canonical trees/tuples) Code-generation (): Machine code • generator • machine code Copyright ! 2007 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for • personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected].

CS502 Register allocation 1 CS502 Register allocation 2

Optimization Optimization

Goal: produce fast code Definition: An optimization is a transformation that is expected to:

• What is optimality? improve the running time of a program

• Problems are often hard or decrease its space requirements

• Many are intractable or even undecideable The point: • Many are NP-complete

• Which optimizations should be used? • “improved” code, not “optimal” code • Many optimizations overlap or interact • sometimes produces worse code

• range of speedup might be from 1.000001 to 4 (or more)

CS502 Register allocation 3 CS502 Register allocation 4 Machine-independent transformations Machine-dependent transformations

• applicable across broad range of machines • capitalize on machine-specific properties

• remove redundant computations • improve mapping from IR onto machine

• move evaluation to a less frequently executed place • replace a costly operation with a cheaper one

• specialize some general-purpose code • hide latency

• find useless code and remove it • replace sequence of instructions with more powerful one (use “exotic” instructions)

• expose opportunities for other optimizations

CS502 Register allocation 5 CS502 Register allocation 6

A classical distinction Optimization

The distinction is not always clear: Desirable properties of an

replace multiply with shifts and adds • code at least as good as an assembler

• stable, robust performance (predictability)

• architectural strengths fully exploited

• architectural weaknesses fully hidden

• broad, efficient support for language features

• instantaneous compiles

Unfortunately, modern compilers often drop the ball

CS502 Register allocation 7 CS502 Register allocation 8 Optimization Scope of optimization

Good compilers are crafted, not assembled Local (single ) • confined to straight-line code • consistent philosophy • simplest to analyse • careful selection of transformations • time frame: ’60s to present, particularly now • thorough application Intraprocedural (global) • coordinate transformations and data structures • consider the whole procedure • attention to results (code, time, space) • What do we need to optimize an entire procedure? • classical data-flow analysis, dependence analysis Compilers are engineered objects • time frame: ’70s to present

• minimize running time of compiled code Interprocedural (whole program) • minimize • analyse whole programs • use reasonable compile-time space (serious problem) • What do we need to optimize and entire program? • less information is discernible Thus, results are sometimes unexpected • time frame: late ’70s to present, particularly now

CS502 Register allocation 9 CS502 Register allocation 10

Optimization Safety

Three considerations arise in applying a transformation: Fundamental question Does the transformation change the results of executing the code? • safety yes ⇒ don’ do it! no ⇒ it is safe

• profitability

• opportunity Compile-time analysis • may be safe in all cases (loop unrolling) We need a clear understanding of these issues • analysis may be simple (DAGs and CSEs) • may require complex reasoning (data-flow analysis) • the literature often hides them

• every discussion should list them clearly

CS502 Register allocation 11 CS502 Register allocation 12 Profitability Opportunity

Fundamental question Fundamental question Is there a reasonable expectation that the transformation will be an Can we efficiently locate sites for applying the transformation? improvement? yes ⇒ compilation time won’t suffer yes ⇒ do it! no ⇒ better be highly profitable no ⇒ don’t do it Issues • provides a framework for applying transformation • systematically find all sites • update safety information to reflect previous changes Compile-time estimation • order of application (hard) • always profitable • heuristic rules • compute benefit (rare)

CS502 Register allocation 13 CS502 Register allocation 14

Optimization Example: loop unrolling

Successful optimization requires Idea: reduce loop by creating multiple successive copies of the loop’s • test for safety, profitability should be body and increasing the increment appropriately O(1) per transformation Safety: always safe or O(n) for whole routine (maybe nlogn) Profitability: reduces overhead • profit is local improvement× executions (instruction blowout) ⇒ focus on loops: (subtle secondary effects) – loop unrolling Opportunity: loops – factoring loop invariants – • want to minimize side-effects like code growth

Unrolling is easy to understand and perform

CS502 Register allocation 15 CS502 Register allocation 16 Example: loop unrolling Example: loop unrolling

Matrix-matrix multiply Matrix-matrix multiply (assume 4-word cache line)

do i ← 1, n, 1 do i ← 1, n, 1 do j ← 1, n, 1 do j ← 1, n, 1 c(i,j) ← 0 c(i,j) ← 0 do k ← 1, n, 1 do k ← 1, n, 4 c(i,j) ← c(i,j) + a(i,k) * b(k,j) c(i,j) ← c(i,j) + a(i,k) * b(k,j) c(i,j) ← c(i,j) + a(i,k+1) * b(k+1,j) • 2n3 flops, n3 loop increments and branches c(i,j) ← c(i,j) + a(i,k+2) * b(k+2,j) c(i,j) ← c(i,j) + a(i,k+3) * b(k+3,j)

• each iteration does 2 loads and 2 flops 3 n3 • 2n flops, 4 loop increments and branches • each iteration does 8 loads and 8 flops

• memory traffic is better – c(i,j) is reused (put it in a register) – a(i,k) reference are from cache – b(k,j) is problematic This is the most overstudied example in the literature

CS502 Register allocation 17 CS502 Register allocation 18

Example: loop unrolling Example: loop unrolling

Matrix-matrix multiply (to improve traffic on b) What happened?

do j ← 1, n, 1 • interchanged i and j loops do i ← 1, n, 4 c(i,j) ← 0 • unrolled i loop do k ← 1, n, 4 • fused inner loops c(i,j) ← c(i,j) + a(i,k) * b(k,j) 3 + a(i,k+1) * b(k+1,j) 3 n • 2n flops, 16 loop increments and branches + a(i,k+2) * b(k+2,j) + a(i,k+3) * b(k+3,j) • first assignment does 8 loads and 8 flops nd th c(i+1,j) ← c(i+1,j) + a(i+1,k) * b(k,j) • 2 through 4 do 4 loads and 8 flops + a(i+1,k+1) * b(k+1,j) + a(i+1,k+2) * b(k+2,j) • memory traffic is better + a(i+1,k+3) * b(k+3,j) c(i+2,j) ← c(i+2,j) + a(i+2,k) * b(k,j) – c(i,j) is reused (register) + a(i+2,k+1) * b(k+1,j) – a(i,k) references are from cache + a(i+2,k+2) * b(k+2,j) + a(i+2,k+3) * b(k+3,j) – b(k,j) is reused (register) c(i+3,j) ← c(i+3,j) + a(i+3,k) * b(k,j) + a(i+3,k+1) * b(k+1,j) + a(i+3,k+2) * b(k+2,j) + a(i+3,k+3) * b(k+3,j)

CS502 Register allocation 19 CS502 Register allocation 20 Example: loop unrolling Loop optimizations: factoring loop-invariants

It is not as easy as it looks: Loop invariants: expressions constant within loop body

Safety: ? loop unrolling? loop fusion? Relevant variables: those used to compute and expression Profitability: machine dependent (mostly) Opportunity: Opportunity: find memory-bound loop nests 1.identifyvariablesdefinedinbodyofloop (LoopDef) Summary 2. loop invariants have no relevant variables in LoopDef • chance for large improvement 3. assign each loop-invariant to temp. in loop header • answering the fundamentals is tough 4. use temporary in loop body • resulting code is ugly Safety: loop-invariant expression may throw exception early

Profitability: • loop may execute 0 times • loop-invariant may not be needed on every path through loop body

Matrix-matrix multiply is everyone’s favorite example

CS502 Register allocation 21 CS502 Register allocation 22

Example: factoring loop invariants Example: factoring loop invariants (cont.)

for i := 1 to 100 do (∗ LoopDef = {i, j,k,A} ∗) Factoring the inner loop: And the second loop: for j := 1 to 100 do (∗ LoopDef = { j,k,A} ∗) for i := 1 to 100 do for i := 1 to 100 do for k := 1 to 100 do (∗ LoopDef = {k,A} ∗) for j := 1 to 100 do t3 := ADR(A[i]); A[i, j,k] := i ∗ j ∗ k; t1 := ADR(A[i][ j]); for j := 1 to 100 do t2 := i ∗ j; t1 := ADR(t3[ j]); for k := 1 to 100 do t := i ∗ j; • 3 million index operations 2 t1[k] := t2 ∗ k; for k := 1 to 100 do t1[k] := t2 ∗ k; • 2 million multiplications

CS502 Register allocation 23 CS502 Register allocation 24 Strength reduction in loops Example: strength reduction in loops

Loop : incremented on each iteration From previous example: , , ,... i0 i0 + 1 i0 + 2 for i := 1 to 100 do t := ADR(A[i]); Induction expression: ic + c , where c , c are loop invariant 3 1 2 1 2 t4 := i; (∗ i ∗ j0 = i ∗) i0c1 + c2,(i0 + 1)c1 + c2,(i0 + 2)c1 + c2,... for j := 1 to 100 do t1 := ADR(t3[ j]); 1. replace ic1 + c2 by t in body of loop t2 := t4; (∗ t4 = i ∗ j ∗) t5 := t2; (∗ t2 ∗ k0 = t2 ∗) 2. insert t := i0c1 + c2 before loop for k := 1 to 100 do t1[k] := t5; (∗ t5 = t2 ∗ k ∗) 3. insert t := t + c1 at end of loop t5 := t5 +t2; end; t4 := t4 + i; end; end;

CS502 Register allocation 25 CS502 Register allocation 26

Example: strength reduction in loops Example: strength reduction in loops

After copy propagation and exposing indexing: Applying strength reduction to exposed index expressions: for i := 1 to 100 do t6 := A0; for i := 1 to 100 do t3 := A0 +(10000 ∗ i) − 10000; t3 := t6; t4 := i; t7 := t3; t4 := i; for j := 1 to 100 do for j := 1 to 100 do t := t +(100 ∗ j) − 100; t1 := t7; t5 := t4; t8 := t1; 1 3 for k := 1 to 100 do t5 := t4; for k := 1 to 100 do t8 ↑:= t5; t5 := t5 +t4; (t1 + k − 1) ↑:= t5; t8 := t8 + 1; t5 := t5 +t4; end; end; t4 := t4 + i; t4 := t4 + i; end; t7 := t7 + 100; end; end; t6 := t6 + 10000; end;

Again, copy propagation further improves the code.

CS502 Register allocation 27 CS502 Register allocation 28 Intraprocedural analysis and transformation Live variables: any-path backward flow

In order to move, remove or rearrange instructions, must Define: • understand the control flow of the procedure In(b): variables live on entry to block b Out(b): variables live on exit from block b • find and connect definitions and uses of variables ⇒ data-flow analysis Let S(b) be the set of all immediate successors of b. Then:

The control flow graph (CFG): Out(b)=∪i∈S(b)In(i) • nodes are basic blocks S(b)=! ⇒ Out(b)=! • edges represent potential flow of control Now, define: Any-path flow analysis: property may be true if it holds along any path to node Use(b): variables used in b before/without definition in b In(b) ⊇ Use(b) All-paths flow analysis: property must be true if it holds along all paths to node Def(b): variables defined in b In(b) ⊇ Out(b) − Def(b)

Use(b) and Def(b) are constant (independent of control flow)

Now, v ∈ In(b) iff. v ∈ Use(b) or v ∈ Out(b) − Def(b) i.e., In(b)=Use(b) ∪ (Out(b) − Def(b))

CS502 Register allocation 29 CS502 Register allocation 30

Uninitialized variables: any-path forward flow Available expressions: all-paths forward flow

Define: An expression is available if recomputing it is redundant

In(b): possibly uninitialized vars. on entry to b Define: Out(b): possibly uninitialized vars. on exit from b RelVar(T): all vars/temps used to calculate T Let P(b) be the set of all immediate predecessors of b. Then: T is available on exit from b iff. ∀X ∈ RelVar(T), T is more recently defined in b than X In(b)=∪i∈P(b)Out(i) Out(b): (value-numbered) temps available on exit from b In(b): temporaries available on entry to b P(b)=! ⇒ In(b)=U (the set of all variables)

Now, define: In(b)=∩i∈P(b)Out(i) P(b)=! ⇒ In(b)=! Init(b): vars known to be initialized on exit from b (e.g., vars assigned legal values or tested before use) Out(b) ⊇ In(b) − Init(b) Now, define: Uninit(b): vars that become uninitialized in b, not subsequently reassigned or tested Computed(b): exprs computed in b, not subsequently killed (e.g., assigned null, freed ptrs, nested variables) Kill(b): exprs killed in b Out(b) ⊇ Uninit(b) (because one of its rel. vars. is defined in b)

Now, Out(b)=Uninit(b) ∪ (In(b) − Init(b)) Now, Out(b)=Computed(b) ∪ (In(b) − Kill(b))

CS502 Register allocation 31 CS502 Register allocation 32 Very busy expressions: all-paths backward flow A taxonomy of data-flow problems

Forward Backward An expression is very busy if its value is used along all paths before the Out(b)=Gen(b) ∪ (In(b) − Kill(b) In(b)=Gen(b) ∪ (Out(b) − Kill(b) expression is killed (reg. alloc., loop invariants)

Out(b): expressions very busy on exit from b Any Define: In(b)=Si∈P(b) Out(i) Out(b)=Si∈S(b) In(i) In(b): expressions very busy on entry to b Out(b)=Gen(b) ∪ (In(b) − Kill(b) In(b)=Gen(b) ∪ (Out(b) − Kill(b) Then:

All In(b)=Ti∈P(b) Out(i) Out(b)=Ti∈S(b) In(i) Out(b)=∩i∈S(b)In(i) S(b)=! ⇒ Out(b)=! Used(b): all expressions used before they are killed in b Now, define: Kill(b): allexpressionskilledin b before they are used Then, In(b)=Used(b) ∪ (Out(b) − Kill(b))

CS502 Register allocation 33 CS502 Register allocation 34

Other data-flow problems Other data-flow problems (cont.)

Reaching definitions: a definition of v reaches a use of v if ∃ any path from the du-chains: uses associated with each definition (backward) definition to the use, no intervening definitions In(b): possible uses of a var defined on entry to b (register targeting, constant propagation) Out(b): possible uses of a var defined on exit from b In(b): defs that reach entry to b Gen(b): uses in b of a var not yet defined in b Out(b): defs that reach exit from b Kill(b): uses in b of vars defined in b Gen(b): defs in b that reach exit from b Copy propagation: given A:=B, replace use of A with use of B Kill(b): defs killed by b (i.e., not in Gen(b)) In(b): copy statements that reach entry to b ud-chains: reaching defs associated with each use (forward) Out(b): copy statements that reach exit from b Gen(b): copy statements in b that reach end of b Kill(b): copy statements that reach entry to, but not exit from, b

CS502 Register allocation 35 CS502 Register allocation 36 Optimizations based on global data-flow Ordering optimization phases

Very busy expressions: 1. semantic analysis and intermediate code generation: • move invariants outside loops (used on every itern) • loop unrolling • code hoisting: move from successors common predecessor (dominator) • Global CSE: 2. intermediate code generation: • remove redundant first calculations of expressions • build basic blocks with their Def and Kill sets • local CSE handles later redundant calculations Live variable analysis: 3. build control flow graph: • only live variables (incl. live global CSEs) are stored on block exit • perform initial data flow analyses Uninitialized variable analysis: • assume worst case for calls if no interproc. analysis • compile-time warnings of possible uses 4. early data-flow optimizations: constant/copy propagation (may expose dead • run-time checks to detect illegal uses (e.g., null ptrs) code, changing flow graph, so iterate) Constant and copy propagation: uses reaching definitions, du-chains, copy 5. CSE and live/dead variable analyses propagation analysis 6. translate basic blocks to target code: local optimizations (register • propagate constant and variable assignments and simplify resulting allocation/assignment, code selection) expressions 7. peephole optimization

CS502 Register allocation 37 CS502 Register allocation 38