CS6013 - Modern Compilers: Theory and Practise Overview of Different Optimizations
Total Page:16
File Type:pdf, Size:1020Kb
Optimizing compilers CS6013 - Modern Compilers: Theory and Practise Overview of different optimizations V. Krishna Nandivada Copyright c 2020 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that IIT Madras copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected]. * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 2 / 1 Compiler structure Optimization token stream Potential optimizations: Source-language (AST): Parser constant bounds in syntax tree loops/arrays loop unrolling Goal: produce fast code Semantic analysis (eg, type checking) suppressing run-time What is optimality? checks syntax tree enable later optimisations Problems are often hard Many are intractable or even undecideable Intermediate IR: local and global code generator CSE elimination Many are NP-complete low−level IR live variable analysis Which optimizations should be used? (eg, canonical trees/tuples) code hoisting Many optimizations overlap or interact Optimizer enable later optimisations low−level IR Code-generation (machine (eg, canonical trees/tuples) code): Machine code register allocation generator instruction scheduling machine code peephole optimization * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 3 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 4 / 1 Optimization Machine-independent transformations applicable across broad range of machines Definition: An optimization is a transformation that is expected to: remove redundant computations improve the running time of a program or decrease its space requirements move evaluation to a less frequently executed place The point: specialize some general-purpose code “improved” code, not “optimal” code (forget “optimum”) find useless code and remove it sometimes produces worse code expose opportunities for other optimizations range of speedup might be from 1.000001 to xxx * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 5 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 6 / 1 Machine-dependent transformations A classical distinction capitalize on machine-specific properties improve mapping from IR onto machine The distinction is not always clear: replace multiply with shifts replace a costly operation with a cheaper one and adds hide latency replace sequence of instructions with more powerful one (use “exotic” instructions) * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 7 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 8 / 1 Optimization Optimization Desirable properties of an optimizing compiler Good compilers are crafted, not assembled code at least as good as an assembler programmer consistent philosophy careful selection of transformations stable, robust performance (predictability) thorough application architectural strengths fully exploited coordinate transformations and data structures architectural weaknesses fully hidden attention to results (code, time, space) broad, efficient support for language features Compilers are engineered objects instantaneous compiles minimize running time of compiled code minimize compile time use reasonable compile-time space (serious problem) Unfortunately, modern compilers often drop the ball Thus, results are sometimes unexpected * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 9 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 10 / 1 Scope of optimization Optimization Local (single block) confined to straight-line code Three considerations arise in applying a transformation: simplest to analyse safety time frame: ’60s to present, particularly now profitability Intraprocedural (global) consider the whole procedure opportunity What do we need to optimize an entire procedure? We need a clear understanding of these issues classical data-flow analysis, dependence analysis the literature often hides them time frame: ’70s to present every discussion should list them clearly Interprocedural (whole program) analyse whole programs What do we need to optimize and entire program? less information is discernible time frame: late ’70s to present, particularly now * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 11 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 12 / 1 Safety Profitability Fundamental question Does the transformation change the results of Fundamental question Is there a reasonable expectation that the executing the code? transformation will be an improvement? yes ) don’t do it! yes ) do it! no ) it is safe no ) don’t do it Compile-time analysis Compile-time estimation may be safe in all cases (loop unrolling) always profitable analysis may be simple (DAGs and CSEs) heuristic rules may require complex reasoning (data-flow analysis) compute benefit (rare) * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 13 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 14 / 1 Opportunity Optimization Fundamental question Can we efficiently locate sites for applying the transformation? Successful optimization requires test for safety yes ) compilation time won’t suffer profit is local improvement × executions no ) better be highly profitable ) focus on loops: Issues loop unrolling factoring loop invariants provides a framework for applying transformation strength reduction systematically find all sites want to minimize side-effects like code growth update safety information to reflect previous changes order of application (hard) * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 15 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 16 / 1 Example: loop unrolling Example: loop unrolling Matrix-matrix multiply Idea: reduce loop overhead by creating multiple successive copies of the loop’s body and increasing the increment do i 1, n, 1 appropriately do j 1, n, 1 c(i,j) 0 Safety: always safe do k 1, n, 1 Profitability: reduces overhead c(i,j) c(i,j) + a(i,k) * b(k,j) (instruction cache blowout) (subtle secondary effects) 2n3 flops, n3 loop increments and branches Opportunity: loops each iteration does 2 loads and 2 flops Unrolling is easy to understand and perform This is the most overstudied example in the literature * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 17 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 18 / 1 Example: loop unrolling Example: loop unrolling Matrix-matrix multiply (assume 4-word cache line) Matrix-matrix multiply (to improve traffic on b) do j 1, n, 1 do i 1, n, 1 do i 1, n, 4 do j 1, n, 1 c(i,j) 0 c(i,j) 0 do k 1, n, 4 do k 1, n, 4 c(i,j) c(i,j) + a(i,k) * b(k,j) c(i,j) c(i,j) + a(i,k) * b(k,j) + a(i,k+1) * b(k+1,j) + a(i,k+2) * b(k+2,j) c(i,j) c(i,j) + a(i,k+1) * b(k+1,j) + a(i,k+3) * b(k+3,j) c(i,j) c(i,j) + a(i,k+2) * b(k+2,j) c(i+1,j) c(i+1,j) + a(i+1,k) * b(k,j) c(i,j) c(i,j) + a(i,k+3) * b(k+3,j) + a(i+1,k+1) * b(k+1,j) + a(i+1,k+2) * b(k+2,j) + a(i+1,k+3) * b(k+3,j) c(i+2,j) c(i+2,j) + a(i+2,k) * b(k,j) 3 n3 2n flops, 4 loop increments and branches + a(i+2,k+1) * b(k+1,j) each iteration does 8 loads and 8 flops + a(i+2,k+2) * b(k+2,j) + a(i+2,k+3) b(k+3,j) memory traffic is better * c(i+3,j) c(i+3,j) + a(i+3,k) * b(k,j) c(i,j) is reused (put it in a register) + a(i+3,k+1) * b(k+1,j) a(i,k) reference are from cache + a(i+3,k+2) * b(k+2,j) b(k,j) is problematic * + a(i+3,k+3) * b(k+3,j) * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 19 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 20 / 1 Example: loop unrolling Loop transformations What happened? It is not as easy as it looks: interchanged i and j loops unrolled i loop Safety : loop interchange? loop unrolling? loop fusion? fused inner loops Opportunity : find memory-bound loop nests 3 Profitability : machine dependent (mostly) 2n3 flops, n loop increments and branches 16 Summary first assignment does 8 loads and 8 flops chance for large improvement 2nd through 4th do 4 loads and 8 flops answering the fundamentals is tough memory traffic is better resulting code is ugly c(i,j) is reused (register) a(i,k) references are from cache b(k,j) is reused (register) Matrix-matrix multiply is everyone’s favorite example * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 21 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 22 / 1 Loop optimizations: factoring loop-invariants Example: factoring loop invariants i=1 Loop invariants: expressions constant within loop body foreach .. 100 do // LoopDef = fi,j,k, Ag Relevant variables: those used to compute and expression foreach j=1 .. 100 do Opportunity: // LoopDef = fj,k, Ag 1 identify variables defined in body of loop (LoopDef) foreach k=1 .. 100 do 2 loop invariants have no relevant variables in LoopDef // LoopDef = fk, Ag 3 assign each loop-invariant to temp. in loop header A[i,j,k] = i * j * k; 4 use temporary in loop body end Safety: loop-invariant expression may throw exception early end Profitability: end loop may execute 0 times 3 million index operations loop-invariant may not be needed on every path through loop body 2 million multiplications * * V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 23 / 1 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2020 24 / 1 Example: factoring loop invariants (cont.) Strength reduction in loops Factoring the inner loop: And the second loop: foreach i=1 .