Optimizing compilers

CS6013 - Modern Compilers: Theory and Practise Overview of different optimizations

V. Krishna Nandivada Copyright c 2020 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that IIT Madras copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected].

*

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 2 / 1

Compiler structure Optimization

token stream Potential optimizations: Source-language (AST): Parser constant bounds in syntax tree loops/arrays Goal: produce fast code Semantic analysis (eg, type checking) suppressing run-time What is optimality? checks syntax tree enable later optimisations Problems are often hard Many are intractable or even undecideable Intermediate IR: local and global code generator CSE elimination Many are NP-complete low−level IR Which optimizations should be used? (eg, canonical trees/tuples) code hoisting Many optimizations overlap or interact Optimizer enable later optimisations

low−level IR Code-generation (machine (eg, canonical trees/tuples) code): Machine code generator machine code * *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 3 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 4 / 1 Optimization Machine-independent transformations

applicable across broad range of machines Definition: An optimization is a transformation that is expected to: remove redundant computations improve the running time of a program or decrease its space requirements move evaluation to a less frequently executed place The point: specialize some general-purpose code “improved” code, not “optimal” code (forget “optimum”) find useless code and remove it sometimes produces worse code expose opportunities for other optimizations range of speedup might be from 1.000001 to xxx

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 5 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 6 / 1

Machine-dependent transformations A classical distinction

capitalize on machine-specific properties improve mapping from IR onto machine The distinction is not always clear: replace multiply with shifts replace a costly operation with a cheaper one and adds hide latency replace sequence of instructions with more powerful one (use “exotic” instructions)

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 7 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 8 / 1 Optimization Optimization

Desirable properties of an Good compilers are crafted, not assembled code at least as good as an assembler programmer consistent philosophy careful selection of transformations stable, robust performance (predictability) thorough application architectural strengths fully exploited coordinate transformations and data structures architectural weaknesses fully hidden attention to results (code, time, space) broad, efficient support for language features Compilers are engineered objects instantaneous compiles minimize running time of compiled code minimize compile time use reasonable compile-time space (serious problem) Unfortunately, modern compilers often drop the ball

Thus, results are sometimes unexpected

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 9 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 10 / 1

Scope of optimization Optimization

Local (single block) confined to straight-line code Three considerations arise in applying a transformation: simplest to analyse safety time frame: ’60s to present, particularly now profitability Intraprocedural (global) consider the whole procedure opportunity What do we need to optimize an entire procedure? We need a clear understanding of these issues classical data-flow analysis, dependence analysis the literature often hides them time frame: ’70s to present every discussion should list them clearly Interprocedural (whole program) analyse whole programs What do we need to optimize and entire program? less information is discernible time frame: late ’70s to present, particularly now

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 11 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 12 / 1 Safety Profitability

Fundamental question Does the transformation change the results of Fundamental question Is there a reasonable expectation that the executing the code? transformation will be an improvement?

yes ⇒ don’t do it! yes ⇒ do it! no ⇒ it is safe no ⇒ don’t do it

Compile-time analysis Compile-time estimation may be safe in all cases (loop unrolling) always profitable analysis may be simple (DAGs and CSEs) heuristic rules may require complex reasoning (data-flow analysis) compute benefit (rare)

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 13 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 14 / 1

Opportunity Optimization

Fundamental question Can we efficiently locate sites for applying the transformation? Successful optimization requires test for safety yes ⇒ compilation time won’t suffer profit is local improvement × executions no ⇒ better be highly profitable ⇒ focus on loops: Issues loop unrolling factoring loop invariants provides a framework for applying transformation systematically find all sites want to minimize side-effects like code growth update safety information to reflect previous changes order of application (hard)

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 15 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 16 / 1 Example: loop unrolling Example: loop unrolling

Matrix-matrix multiply Idea: reduce loop overhead by creating multiple successive copies of the loop’s body and increasing the increment do i ← 1, n, 1 appropriately do j ← 1, n, 1 c(i,j) ← 0 Safety: always safe do k ← 1, n, 1 Profitability: reduces overhead c(i,j) ← c(i,j) + a(i,k) * b(k,j) (instruction cache blowout) (subtle secondary effects) 2n3 flops, n3 loop increments and branches Opportunity: loops each iteration does 2 loads and 2 flops

Unrolling is easy to understand and perform This is the most overstudied example in the literature

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 17 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 18 / 1

Example: loop unrolling Example: loop unrolling Matrix-matrix multiply (assume 4-word cache line) Matrix-matrix multiply (to improve traffic on b) do j ← 1, n, 1 do i ← 1, n, 1 do i ← 1, n, 4 do j 1, n, 1 ← c(i,j) ← 0 c(i,j) ← 0 do k ← 1, n, 4 do k ← 1, n, 4 c(i,j) ← c(i,j) + a(i,k) * b(k,j) c(i,j) ← c(i,j) + a(i,k) * b(k,j) + a(i,k+1) * b(k+1,j) + a(i,k+2) * b(k+2,j) c(i,j) ← c(i,j) + a(i,k+1) * b(k+1,j) + a(i,k+3) * b(k+3,j) c(i,j) ← c(i,j) + a(i,k+2) * b(k+2,j) c(i+1,j) ← c(i+1,j) + a(i+1,k) * b(k,j) c(i,j) ← c(i,j) + a(i,k+3) * b(k+3,j) + a(i+1,k+1) * b(k+1,j) + a(i+1,k+2) * b(k+2,j) + a(i+1,k+3) * b(k+3,j) c(i+2,j) ← c(i+2,j) + a(i+2,k) * b(k,j) 3 n3 2n flops, 4 loop increments and branches + a(i+2,k+1) * b(k+1,j) each iteration does 8 loads and 8 flops + a(i+2,k+2) * b(k+2,j) + a(i+2,k+3) b(k+3,j) memory traffic is better * c(i+3,j) ← c(i+3,j) + a(i+3,k) * b(k,j) c(i,j) is reused (put it in a register) + a(i+3,k+1) * b(k+1,j) a(i,k) reference are from cache + a(i+3,k+2) * b(k+2,j)

b(k,j) is problematic * + a(i+3,k+3) * b(k+3,j) *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 19 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 20 / 1 Example: loop unrolling Loop transformations

What happened? It is not as easy as it looks: interchanged i and j loops unrolled i loop Safety : ? loop unrolling? loop fusion? fused inner loops Opportunity : find memory-bound loop nests

3 Profitability : machine dependent (mostly) 2n3 flops, n loop increments and branches 16 Summary first assignment does 8 loads and 8 flops chance for large improvement 2nd through 4th do 4 loads and 8 flops answering the fundamentals is tough memory traffic is better resulting code is ugly c(i,j) is reused (register) a(i,k) references are from cache b(k,j) is reused (register) Matrix-matrix multiply is everyone’s favorite example

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 21 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 22 / 1

Loop optimizations: factoring loop-invariants Example: factoring loop invariants

i=1 Loop invariants: expressions constant within loop body foreach .. 100 do // LoopDef = {i,j,k, A} Relevant variables: those used to compute and expression foreach j=1 .. 100 do Opportunity: // LoopDef = {j,k, A} 1 identify variables defined in body of loop (LoopDef) foreach k=1 .. 100 do 2 loop invariants have no relevant variables in LoopDef // LoopDef = {k, A} 3 assign each loop-invariant to temp. in loop header A[i,j,k] = i * j * k; 4 use temporary in loop body end Safety: loop-invariant expression may throw exception early end Profitability: end loop may execute 0 times 3 million index operations loop-invariant may not be needed on every path through loop body 2 million multiplications

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 23 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 24 / 1 Example: factoring loop invariants (cont.) Strength reduction in loops

Factoring the inner loop: And the second loop: foreach i=1 .. 100 do foreach i=1 .. 100 do Loop : incremented on each iteration // LoopDef = {i,j,k, A} // LoopDef = {i,j,k, A} i ,i + 1,i + 2,... foreach j=1 .. 100 do t3 = &A[i]; 0 0 0 // LoopDef = {j,k, A} foreach j=1 .. 100 do Induction expression: ic1 + c2, where c1, c2 are loop invariant t1 = &A[i][j]; // LoopDef = {j,k, A} i0c1 + c2,(i0 + 1)c1 + c2,(i0 + 2)c1 + c2,... t2 = i * j ; t1 = &t3[j]; foreach k=1 .. 100 do t2 = i * j ; 1 replace ic1 + c2 by t in body of loop // LoopDef = {k,A} foreach k=1 .. 100 do 2 insert t := i0c1 + c2 before loop // LoopDef = {k,A} t1[k] = t * k; 3 insert t := t + c1 at end of loop end t1[k] = t * k; end end end end end

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 25 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 26 / 1

Example: strength reduction in loops Example: strength reduction in loops

From previous example: After copy propagation and exposing indexing: foreach i=1 .. 100 do t3 = &A[i]; foreach i=1 .. 100 do t3 = A + (10000 i) - 10000; t4 = i;//i j0 = i * * t4 = i; foreach j=1 .. 100 do t1 = &t3[j]; foreach j=1 .. 100 do t1 = t3 + (100 j) - 100; t2 = t4;// t4 = i j * * t5 = t4; t5 = t2;// t2 * k0 = t2 foreach k=1 .. 100 do k=1 foreach .. 100 do *(t1 + k - 1) = t5; t1[k] = t5;// t5 = t2 * k t5 = t5 + t4; t5 = t5 + t2; end end t4 = t4 + i; t4 = t4 + i; end end end end

* *

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 27 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 28 / 1 Example: strength reduction in loops Ordering optimization phases

Applying strength reduction to exposed index expressions: t6 = A; 1 semantic analysis and intermediate code generation: foreach i=1 .. 100 do loop unrolling t3 = t6; t4 = i; t7 = t3; 2 intermediate code generation: foreach j=1 .. 100 do build basic blocks with their Def and Kill sets t1 = t7; t5 = t4; 3 build control flow graph: t8 = t1; perform initial data flow analyses foreach k=1 .. 100 do assume worst case for calls if no interproc. analysis *t8 = t5; t5 = t5 + t4; 4 early data-flow optimizations: constant/copy propagation (may expose dead code, changing flow graph, so iterate) t8 = t8 + 1; 5 CSE and live/dead variable analyses end 6 translate basic blocks to target code: local optimizations (register t4 = t4 + i; allocation/assignment, code selection) t7 = t7 + 100; 7 peephole optimization end t6 = t6 + 10000; end * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 29 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 30 / 1 Again, copy propagation further improves the code.

Loop optimizations

Loop unswitching Loop tiling Loop unrolling Loop reversal Loop-invariant code motion Loop interchange Loop fusion Loop distribution Strip mining Vectorisation (brief)

*

V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 31 / 1