Code Optimization

Code Optimization

1 / 42 TDDB44/TDDD55 Lecture 10: Code Optimization Peter Fritzson and Christoph Kessler and Martin Sjölund Department of Computer and Information Science Linköping University 2020-11-30 2 / 42 Part I Overview 3 / 42 Code Optimization – Overview Goal Code that is faster and/or smaller and/or low energy consumption Source-to-source IR-level target-level compiler/optimizer optimizations optimizations Emit Intermediate asm Source Front program Back- Target-level code code End representation End representation (IR) Target machine Mostly target machine Target machine dependent, independent, independent, mostly language independent language dependent language independent 4 / 42 Remarks I Often multiple levels of IR: I high-level IR (e.g. abstract syntax tree AST), I medium-level IR (e.g. quadruples, basic block graph), I low-level IR (e.g. directed acyclic graphs, DAGs) ! do optimization at most appropriate level of abstraction ! code generation is continuous lowering of the IR towards target code I “Postpass optimization”: done on binary code (after compilation or without compilation) 5 / 42 Disadvantages of Compiler Optimizations I Debugging made difficult I Code moves around or disappears I Important to be able to switch off optimization Note: Some compilers have an optimization level -Og that avoids optimizations that makes debugging hard I Increases compilation time I May even affect program semantics I A = B*C - D + E ! A = B*C + E - D may lead to overflow if B*C+E is too large a number 6 / 42 Optimization at Different Levels of Program Representation Source-level optimization I Made on the source program (text) I Independent of target machine Intermediate code optimization I Made on the intermediate code (e.g. on AST trees, quadruples) I Mostly target machine independent Target-level code optimization I Made on the target machine code I Target machine dependent 7 / 42 Source-level Optimization At source code level, independent of target machine I Replace a slow algorithm with a quicker one, e.g. Bubble sort ! Quick sort I Poor algorithms are the main source of inefficiency but difficult to automatically optimize I Needs pattern matching, e.g. Kessler 1996; Di Martino and Kessler 2000 8 / 42 Intermediate Code Optimization At the intermediate code (e.g., trees, quadruples) level In most cases target machine independent I Local optimizations within basic blocks (e.g. common subexpression elimination) I Loop optimizations (e.g. loop interchange to improve data locality) I Global optimization (e.g. code motion, within procedures) I Interprocedural optimization (between procedures) 9 / 42 Target-level Code Optimization At the target machine binary code level Dependent on the target machine I Instruction selection, register allocation, instruction scheduling, branch prediction I Peephole optimization 10 / 42 Basic Block A basic block is a sequence of textually consecutive operations (e.g. quadruples) that contains no branches (except perhaps its last operation) and no branch targets (except perhaps its first operation). 1: ( JEQZ, T1, 5, 0 ) B1 2: ( ASGN, 2, 0, A ) B2 3: ( ADD A, 3, B ) I Always executed in same 4: ( JUMP, 7, 0, 0 ) order from entry to exit 5: ( ASGN, 23, 0, A ) B3 6: ( SUB A, 1, B ) I A.k.a. straight-line code 7: ( MUL, A, B, C ) B4 8: ( ADD, C, 1, A ) 9: ( JNEZ, B, 2, 0 ) 11 / 42 Control Flow Graph Nodes primitive operations 1: ( JEQZ, T1, 5, 0 ) (e.g. quadruples), or 2: ( ASGN, 2, 0, A ) basic blocks Edges control flow 3: ( ADD A, 3, B ) 1: ( JEQZ, T1, 5, 0 ) B1 transitions 4: ( JUMP, 7, 0, 0 ) 2: ( ASGN, 2, 0, A ) B2 3: ( ADD A, 3, B ) 5: ( ASGN, 23, 0, A ) 4: ( JUMP, 7, 0, 0 ) 6: 5:( SUB ( ASGN, A, 1, 23,B ) 0, A ) B3 6: ( SUB A, 1, B ) 7: 7:( MUL, ( MUL, A, B,A, CB, ) C ) B4 8: ( ADD, C, 1, A ) 8: ( ADD, C, 1, A ) 9: ( JNEZ, B, 2, 0 ) 9: ( JNEZ, B, 2, 0 ) 12 / 42 Basic Block Control Flow Graph B1 1: ( JEQZ, T1, 5, 0 ) Nodes basic blocks Edges control flow B2 2: ( ASGN, 2, 0, A ) transitions 3: ( ADD A, 3, B ) 1: ( JEQZ, T1, 5, 0 ) B1 4: ( JUMP, 7, 0, 0 ) 2: ( ASGN, 2, 0, A ) B2 3: ( ADD A, 3, B ) B3 5: ( ASGN, 23, 0, A ) 4: ( JUMP, 7, 0, 0 ) 6: ( SUB A, 1, B ) 5: ( ASGN, 23, 0, A ) B3 6: ( SUB A, 1, B ) 7: ( MUL, A, B, C ) B4 B4 7: ( MUL, A, B, C ) 8:8: ( ADD,( ADD, C, C, 1, 1, A A) ) 9:9: ( JNEZ,( JNEZ, B, B, 2, 2, 0 0) ) 13 / 42 Part II Local Optimization (within single Basic Block) 14 / 42 Local Optimization Within a single basic block (needs no information about other blocks) Example: Constant folding (Constant propagation) Computes constant expressions at compile time const int NN = 4; const int NN = 4; // ... ! // ... i = 2 + NN; i = 6; j = i * 5 + a; j = 30 + a; 15 / 42 Local Optimization (cont.) Elimination of common subexpressions Common subexpression elimination builds DAGs (directed acyclic graphs) from expression trees and forests. ! tmp = i+1; A[i+1] = B[i+1] A[tmp] = B[tmp]; T = C * B; D = D + C * B; ! D = D + T; A = D + C * B; A = D + T; Note: Redifinition of D = D+T is not a common subexpression (does not refer to the same value). 16 / 42 Local Optimization (cont.) Reduction in operator strength Replace an expensive operation by a cheaper one (on the given target machine) ! x = y ^ 2.0; x = y * y; ! x = 2.0 * y; x = y + y; ! x = 8 * y; x = y << 3; ! (S1+S2).length() S1.length() + S2.length() 17 / 42 Some Other Machine-Independent Optimizations Array-references I C = A[I,J] + A[I,J+1] I Elements are beside each other in memory. Ought to be “give me the next element”. Inline expansion of code for small routines x = sqr(y) ! x = y * y Short-circuit evaluation of tests (a > b) and (c-b < k) and /* ... */ If the first expression is false, the rest of the expressions do not need to be evaluated if they do not contain side-effects (or if the language stipulates that the operator must perform short-circuit evaluation) 18 / 42 More examples of machine-independent optimization See for example the OpenModelica Compiler (https: //github.com/OpenModelica/OpenModelica/blob/master/ OMCompiler/Compiler/FrontEnd/ExpressionSimplify.mo) optimizing abstract syntax trees: // listAppend(e1,{}) => e1 is O(1) instead of O(len(e1)) case DAE.CALL(path=Absyn.IDENT("listAppend"), expLst={e1,DAE.LIST(valList={})}) then e1; // atan2(y,0) = sign(y)*pi/2 case (DAE.CALL(path=Absyn.IDENT("atan2"),expLst={e1,e2})) guard Expression.isZero(e2) algorithm e := Expression.makePureBuiltinCall( "sign", {e1}, DAE.T_REAL_DEFAULT); then DAE.BINARY( DAE.RCONST(1.570796326794896619231321691639751442), DAE.MUL(DAE.T_REAL_DEFAULT), e); 19 / 42 Exercise 1: Draw a basic block control flow graph (BB CFG) 20 / 42 Part III Loop Optimization 21 / 42 Loop Optimization Minimize time spent in a loop B1 I Time of loop body 1: ( JEQZ, T1, 5, 0 ) I Data locality B2 2: ( ASGN, 2, 0, A ) Loop control overhead I 3: ( ADD A, 3, B ) What is a loop? 4: ( JUMP, 7, 0, 0 ) I A strongly connected component (SCC) in the control flow graph resp. basic B3 5: ( ASGN, 23, 0, A ) block graph 6: ( SUB A, 1, B ) I SCC strongly connected, i.e., all nodes can be reached from all others B4 7: ( MUL, A, B, C ) Has a unique entry point I 8: ( ADD, C, 1, A ) Ex. { B2, B4 } is an SCC with 2 entry points 9: ( JNEZ, B, 2, 0 ) ! not a loop in the strict sense (spaghetti code). 22 / 42 Loop Example B1 1: ( JEQZ, T1, 5, 0 ) B2 2: ( ASGN, 2, 0, A ) 3: ( ADD A, 3, B ) 4: ( JUMP, 7, 0, 0 ) I Removed the 2nd entry point B3 5: ( ASGN, 23, 0, A ) Ex. { B2, B4 } is an SCC with 1 entry point ! 6: ( JUMP, 10, 0, 0 ) is a loop. B4 7: ( MUL, A, B, C ) 8: ( ADD, C, 1, A ) 9: ( JNEZ, B, 2, 0 ) 23 / 42 Loop Optimization Example: Loop-invariant code hoisting Move loop-invariant code out of the loop tmp = c / d; for (i=0; i<10; i++){ ! for (i=0; i<10; i++){ a[i] = b[i] + c / d; a[i] = b[i] + tmp; } } 24 / 42 Loop Optimization Example: Loop unrolling I Reduces loop overhead (number of tests/branches) by duplicating loop body. Faster code, but code size expands. I In general case, e.g. when odd number loop limit – make it even by handling 1st iteration in an if-statement before loop . i = 1; i = 1; while (i <= 50){ while (i <= 50){ ! a[i] = b[i]; a[i] = b[i]; i = i + 1; i = i + 1; a[i] = b[i]; } i = i + 1; } 25 / 42 Loop Optimization Example: Loop interchange To improve data locality, change order of inner/outer loop to make data access sequential; this makes accesses within a cached block (reduce cache misses / page faults). for (i=0; i<N; i++) for (j=0; j<M; j++) ! for (j=0; j<M; j++) for (i=0; i<N; i++) a[ j ][ i ] = 0.0 ; a[ j ][ i ] = 0.0 ; Column-major order Row-major order a11 a12 a13 a11 a12 a13 a21 a22 a23 a21 a22 a23 a31 a32 a33 a31 a32 a33 Figure: By Cmglee – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=65107030 26 / 42 Loop Optimization Example: Loop fusion I Merge loops with identical headers I To improve data locality and reduce number of tests/branches for (i=0; i<N; i++) for (i=0; i<N; i++){ a[ i ] = /* ..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    42 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us