Compiler construction 2014 An example of optimization in LLVM

int f () { int i, j, k; i = 8; Lecture 8 j = 1; Comments More on code optimization k = 1; while (i != j) { Human reader sees, with some if (i==8) effort, that the C/Javalette function SSA form f k = 0; returns 8. Constant propagation else We follow how LLVM:s Common subexpression elimination i++; optimizations will discover this Loop optimizations i = i+k; fact. j++; } return i; }

Step 1: Naive translation to LLVM Step 2: Translating to SSA form (opt -mem2reg) define i32 @f() { if.then: entry: store i32 0, i32* %k define i32 @f() { if.then: %i = alloca i32 br label %if.end entry: br label %if.end %j = alloca i32 if.else: br label %while.cond %k = alloca i32 %tmp4 = load i32* %i if.else: store i32 8, i32* %i %inc = add i32 %tmp4, 1 while.cond: %inc = add i32 %i.1, 1 store i32 1, i32* %j store i32 %inc, i32* %i %k.1 = phi i32 [ 1, %entry ], br label %if.end store i32 1, i32* %k br label %if.end [ %k.0, %if.end ] br label %while.cond if.end: %j.0 = phi i32 [ 1, %entry ], if.end: while.cond: %tmp5 = load i32* %i [ %inc8, %if.end ] %k.0 = phi i32 [ 0, %if.then ], %tmp = load i32* %i %tmp6 = load i32* %k %i.1 = phi i32 [ 8, %entry ], [ %k.1, %if.else ] %tmp1 = load i32* %j %add = add i32 %tmp5, %tmp6 [ %add, %if.end ] %i.0 = phi i32 [ %i.1, %if.then ], %cmp = icmp ne i32 %tmp, %tmp1 store i32 %add, i32* %i %cmp = icmp ne i32 %i.1, %j.0 [ %inc, %if.else ] br i1 %cmp, label %while.body, %tmp7 = load i32* %j br i1 %cmp, label %while.body, %add = add i32 %i.0, %k.0 label %while.end %inc8 = add i32 %tmp7, 1 label %while.end %inc8 = add i32 %j.0, 1 while.body: store i32 %inc8, i32* %j br label %while.cond %tmp2 = load i32* %i br label %while.cond while.body: %cmp3 = icmp eq i32 %tmp2, 8 while.end: %cmp3 = icmp eq i32 %i.1, 8 while.end: br i1 %cmp3, label %if.then, %tmp9 = load i32* %i br i1 %cmp3, label %if.then, ret i32 %i.1 label %if.else ret i32 %tmp9 label %if.else } } Step 3: Sparse Conditional Constant Propagation Step 4: CFG Simplification (opt -simplifycfg)

(opt -sccp) define i32 @f() { entry: br label %while.cond define i32 @f() { if.then: entry: br label %if.end while.cond: br label %while.cond %j.0 = phi i32 [ 1, %entry ], if.else: [ %inc8, %if.end ] Comments while.cond: br label %if.end %k.1 = phi i32 [ 1, %entry ], %j.0 = phi i32 [ 1, %entry ], [ 0, %if.end ] If the function terminates, the [ %inc8, %if.end ] if.end: %cmp = icmp ne i32 8, %j.0 return value is 8. %k.1 = phi i32 [ 1, %entry ], %inc8 = add i32 %j.0, 1 br i1 %cmp, label %if.end, [ 0, %if.end ] br label %while.cond label %while.end opt has not yet detected that the %cmp = icmp ne i32 8, %j.0 loop is certain to terminate. br i1 %cmp, label %while.body, while.end: if.end: label %while.end ret i32 8 %inc8 = add i32 %j.0, 1 } br label %while.cond while.body: br i1 true, label %if.then, while.end: label %if.else ret i32 8 }

Step 5: Dead Loop Deletion (opt -loop-deletion) Static Single Assignment form

Def-use chains define i32 @f() { One more -simplifycfg step Dataflow analysis often needs to connect a definition with its uses entry: yields finally and, conversely, find all definitions reaching a use. br label %while.end This can be simplified if each variable has only one definition. define i32 @f() { while.end: entry: A new form of IR ret i32 8 ret i32 8 Three-address code can be converted to SSA form by renaming } } variables so that each variable has just one definition.

For realistic code, dozens of passes are performed, some of them A non-example Converted to SSA repeatedly. Many heuristics are used to determine order. s := 0 s1 := 0 Use opt -std-compile-opts for a default selection. x := 1 x1 := 1 s := s + x s2 := s1 + x1 x := x + 1 x2 := x1 + 1 Conversion to SSA An artificial device: φ-functions

Naive version First pass: Add “definitions” of the form A harder example Conversion started x := φ(x,..., x) in the beginning of each block with several s := 0 predecessors and for each variable. x := 1 s1 := 0 Second pass: Do renumbering. L1: if x > n goto L2 x1 := 1 s := s + x L1: if x > n goto L2 After 1st pass After 2nd pass x := x + 1 s2 := s? + x? goto L1 x2 := x? + 1 s := 0 s1 := 0 L2: goto L1 x := 1 x1 := 1 L2: L1: s := φ(s,s) L1: s3 := φ(s1,s2) x := φ(x,x) x3 := φ(x1,x2) Note the def/use difficulty: What should replace the three if x > n goto L2 if x3 > n goto L2 In s + x, which def of s does the ’?’ ? s := s + x s2 := s3 + x3 use refer to? x := x + 1 x2 := x3 + 1 goto L1 goto L1 L2: L2:

What are φ-functions? Step 2 of example revisited: To SSA form

                        A device during optimization                  !" #  $ Think of φ-functions as function calls during optimization.          ! "!  #!  " Later, some of them will be eliminated (e.g. by dead code      #! "! !  " !" #  $ $     ! "! $!  "  %&$    % %    !  %& $  elimination).  %& %&   %& %&    %! !      %& !" # $ !" # $ &  Others will after optimization be transformed to real code. '  x3 := (x1,x2) Idea: φ will be transformed to an instruction !" # $    %&$   !" # $  % %'   !#     %& %&(  %&  %&)$     %!   ! ( x3 := x1 at the end of left predecessor and x3 := x2 at end of    %&  # "   #    %&)     right predecessor. '  & 

# # "   %&$   (      *   $$ !  %&    (   !     # $            # $    Advantages     Many analyses become much simpler when code is in SSA form. # $  %&+$      %&,$        !   "!  ! (" $$$$ !  %&+ %&, Main reason: we see immediately for each use of a variable where   $$  $   !   "! $ ! ("  %&-$   (  !$  $$ !  %&-   # ( !  it was defined.            !" #  $       Computing SSA form; algorithm Simple constant propagation

A dataflow analysis based on SSA form Uses values from a lattice L with elements We already did this : Not a constant, as far as the analysis can tell. Yes, but the conversion inserts unnecessary φ-functions and is too > c1, c2, c3,...: The value is constant, as indicated. inefficient – the gains in analysis with SSA form may be lost in : Yet unknown, may be constant. ⊥ conversion. Each variable v is assigned an initial value val(v) L: ∈ Variables with definitions v := c get val(v) = c, Better algorithms input variables/parameters v get val(v) = , > There are algorithms for finding the right number of φ-functions and the rest get val(v) = . needed. ⊥ These are based on the notion of dominance; if you intend to use The lattice L SSA form, you need to learn about that – or use LLVM, which has The lattice order tools to do it for you. c for all c. c1 c2 c3 c4 . . . ⊥ ≤ ≤ > ci and cj not related.

Propagation phase, 1 Propagation phase, 2

Iteration Iteration, continued Initially, place all names n with val(n) = on a worklist. 6 > Update val for the defined variables, putting variables that get a Iterate by picking a name from the worklist, examining its uses and new value back on the worklist. computing val of the RHS’s, using rules as Terminate when worklist is empty.

0 x = 0 (for any x) · Termination x = · ⊥ ⊥ Values of variables on the worklist can only increase (in lattice x = (x = 0) · > > 6 order) during iteration. Each value can only have its value increased twice. plus ordinary multiplication for constant operands. For φ-functions, we take the join of the arguments, where A disappointment ∨ x = x for all x, x = for all x, and In our running example, this algorithm will terminate with all ⊥ ∨ > ∨ > variables having value . , if c = c > c c = > i 6 j i ∨ j c , otherwise. We need to take reachability into account.  i Sparse Conditional Constant Propagation Correctness of SCCP

  Sketch of algorithm      Uses also a worklist of reachable blocks.         ! "!  #!  " A combination of two dataflow analyses $     ! "! !  " Initially, only the entry block is  % %  #! Sparse conditional constant propagation can be seen as the   %! !   reachable. &  combination of simple constant propagation and reachability In evaluation of φ functions, analysis/dead code analysis.      !   ! '    only flows from   # ⊥ &  unreachable blocks. Both of these can be expressed as dataflow problems and a framework can be devised where the correctness of such New blocks added to worklist    '       combination can be proved. when elaborating terminating   instructions.   Result for running example as  # ' !      shown to the right   

Final steps Common subexpression elimination

Control flow graph simplification Problem Fairly simple pass; SCCP does not change graph structure of CFG We want to avoid re-computing an expression; instead we want to even when “obvious” simplifications can be done. use the previously computed value.

Dead Loop Elimination Notes j Identifies an (namely ), which Code example The second occurrence of a - d increases with 1 for each loop iteration, a := b + c should not be computed; terminates the loop when reaching a known value, b := a - d instead we should use d := b. is initialised to a smaller value. c := b + c Both occurrences of b + c must d := a - d When such a variable is found, loop termination is guaranteed and be computed, since b is redefined the loop can be removed. in-between. , 1 Value numbering, 2

Algorithm A classic technique For each instruction x := y # z: Works on three-address code within a basic block. Look up VN ny for y in D1. Each expression is assigned a value number (VN), so that If not present, generate new unique VN ny and expressions that have the same VN must have the same value. put D1(y) = ny , D2(ny ) = y. (Note: The VN is not the value of the expression.) Do the same for z. Look up x in D ; if n found, remove x from D (n). Data structures 1 2 Look up (ny ,#,nz ) in D1. A dictionary D1 that associates If VN m found, a variable or a literal with a VN. a triple (VN,operator,VN) with a VN. insert D1(x) = m (m has been computed before). if D2(m) is non-empty, replace instruction by Typically, D1 is implemented as a hash table. x := v for some v in that set. A dictionary D2, mapping VNs to sets of variables Otherwise, generate new unique VN m and (implemented as an array). put D1(nx , #, ny ) = m, D1(x) = m.

Add x to D2(m).

Value numbering, 3 Available expressions: a dataflow analysis

Extended basic blocks Purpose A subtree of the CFG where each Algebraic identities An auxiliary concept in an intraprocedural analysis for finding node has only one predecessor. Value numbering can be common subexpressions. Each path through the EBB is combined with code improvement handled by value numbering. using identities such as Definition To avoid starting from scratch, An expression x # y is available at a point P in a CFG if the use stacks of dictionaries. expression is evaluated on every path from the entry node to P x 0 = 0 (Needs SSA form.) · and neither x nor y is redefined after the last such evaluation. 0 x = 0 · B1 x 1 = x Locally defined sets · 1 x = x We consider sets of expressions. B2 B3 · x # y ... = ... gen(n) is the set of expressions that are evaluated in n B4 B5 without subsequent definition of x or y. Avoid long sequences of tests! kill(n) is the set of expressions x # y where n defines x or y B6 without subsequent evaluation of x # y. Available expressions: the flow equations Available expressions: Comments

Sets to compute by flow analysis Solution method avail-in(n) is the set of available exprs at the beginning of n. Iteration from the initial sets avail-in(n) = avail-out = U, avail-out(n) is the set of available exprs at the end of n. where U is the set of all expressions occurring in the CFG (except for avail-in(n ) = ). 0 {} Converges to the greatest fixpoint. All sets shrink avail-out(n) = gen(n) (avail-in(n) kill(n)) ∪ − monotonically during iterations. avail-in(n ) = for the entry node n 0 {} 0 Fixpoint solution has the property that any expr declared avail-in(n) = p preds(n)avail-out(p) (other n) available is really available. ∩ ∈ This does not hold for previous iterations. Motivation Sets can be represented as bit-vectors (U = all ones). An expr is available on exit from n if it is either generated in n or This is a forward problem; information flows from it was already available on entry and not killed in n. predecessors to successors. An expr is available on entry if it is available from all preds. Thus one should try to compute predecessors first.

Common subexpression elimination Tail recursion

Available expressions can be eliminated A different optimization If dataflow analysis finds that y # z in an instruction x := y # z A recursive function is tail-recursive if it returns a value computed is available we could eliminate it. by (just) a recursive call. This can (and should) be optimized to a This a second, separate step (code transformation): replace loop. instruction by x := w. But how to find w? ack rewritten Recursive form Basic idea int ack(int n,int k,int s){ int sumTo(int lim) { w L: if n>k then Generate a new name . Follow the control backwards along all return ack(1,lim,0); v := y # z return s; paths until a definition is found (such a def must exist } else in all paths!). Replace the def by int ack(int n,int k,int s){ w := y # z k = k; // not needed if n>k then v := w s = s+n; return s; n = n+1; else A more powerful idea // note reordering! return ack(n+1,k,s+n); goto L; Find these definitions by dataflow analysis: reaching definitions. } } A motivating example Possible naive LLVM code, part 1

%arr = type { i32, [ 0 x i32 ] }* define i32 @sum(%arr %__p__a) { A simple Javalette function (in extension arrays1) entry: %a = alloca %arr store %arr %__p__a , %arr* %a int sum (int [] a) { %_res_t0 = alloca i32 int res=0; store i32 0 , i32* %_res_t0 for (int x : a) %_x_t1 = alloca i32 res = res + x; %t2 = load %arr* %a return res; %t3 = getelementptr %arr %t2 , i32 0, i32 0 } %t4 = load i32* %t3 %_indexx_t5 = alloca i32 What code would you generate? store i32 0 , i32* %_indexx_t5 br label %lab0 lab0: %t6 = load i32* %_indexx_t5 %t7 = icmp slt i32 %t6 , %t4 br i1 %t7 , label %lab1 , label %lab2

Possible naive LLVM code, part 2 After opt -mem2reg

lab1: %t8 = getelementptr %arr %t2 , i32 0, i32 1, i32 %t6 define i32 @sum(%arr %__p__a) { entry: %t3 = getelementptr %arr %__p__a, i32 0, i32 0 %t9 = load i32* %t8 %t4 = load i32* %t3 store i32 %t9 , i32* %_x_t1 br label %lab0 %t10 = load i32* %_res_t0 %t11 = load i32* %_x_t1 lab0: %_res_t0.0 = phi i32 [ 0, %entry ], [ %t12, %lab1 ] %_indexx_t5.0 = phi i32 [ 0, %entry ], [ %t13, %lab1 ] %t12 = add i32 %t10 , %t11 %t7 = icmp slt i32 %_indexx_t5.0, %t4 store i32 %t12 , i32* %_res_t0 br i1 %t7, label %lab1, label %lab2 %t13 = add i32 %t6 , 1 store i32 %t13 , i32* %_indexx_t5 lab1: %t8 = getelementptr %arr %__p__a, i32 0, i32 1, i32 %_indexx_t5.0 %t9 = load i32* %t8 br label %lab0 %t12 = add i32 %_res_t0.0, %t9 %t13 = add i32 %_indexx_t5.0, 1 lab2: %t14 = load i32* %_res_t0 br label %lab0 ret i32 %t14 } lab2: ret i32 %_res_t0.0 } After opt -std-compile-opts Generated x86 assembly (with llc)

define i32 @sum(%arr nocapture %__p__a) nounwind readonly { _sum: push EDI entry: %t3 = getelementptr %arr %__p__a, i32 0, i32 0 push ESI Comments mov ECX, DWORD PTR [ESP + 12] %t4 = load i32* %t3 No local vars; no stack %t71 = icmp sgt i32 %t4, 0 mov EDX, DWORD PTR [ECX] br i1 %t71, label %bb.nph, label %lab2 test EDX, EDX frame handling. jg LBB0_2 Uses callee save xor EAX, EAX bb.nph: %tmp = zext i32 %t4 to i64 registers EDI and ESI; br label %lab1 jmp LBB0_4 LBB0_2: xor ESI, ESI note save/restore. lab1: %indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %lab1 ] add ECX, 4 ECX holds address of xor EAX, EAX %_res_t0.02 = phi i32 [ 0, %bb.nph ], [ %t12, %lab1 ] current array elem; %t8 = getelementptr %arr %__p__a, i64 0, i32 1, i64 %indvar LBB0_3: add EAX, DWORD PTR [ECX] %t9 = load i32* %t8 add EDX, -1 increased by 4 in each %t12 = add i32 %t9, %_res_t0.02 adc ESI, -1 iteration. add ECX, 4 %indvar.next = add i64 %indvar, 1 EDX counts nr of %exitcond = icmp eq i64 %indvar.next, %tmp mov EDI, EDX br i1 %exitcond, label %lab2, label %lab1 or EDI, ESI elems remaining. jne LBB0_3 Use of ESI in loop LBB0_4: pop ESI lab2: %_res_t0.0.lcssa = phi i32 [ 0, %entry ], [ %t12, %lab1 ] termination test?? ret i32 %_res_t0.0.lcssa pop EDI } ret

Optimizations of loops Moving loop-invariant code out of the loop

In computationally demanding applications, most of the time is spent in executing (inner) loops. A simple example Not quite as simple for (i=0; i

A basic induction variable is an (integer) variable which has a n is a basic IV (only def is to while (n<100) { single definition in the loop body, which increases its value with a increase by 1). k = 7*n + 3; fixed (loop-invariant) amount. k is derived IV. a[k]++; Example: n = n + 3 Replace multiplication involved in n++; def of k by addition. } A basic IV will assume values in arithmetic progression when the loop executes. k = 7*n + 3; while (n<100) { Given a basic IV we can find a collection of derived IV’s, each of Replace multiplication involved in a[k]++; which has a single def of the form def of derived IV by addition. n++; m = a*n+b; k+=7; where a and b are loop-invariant. } The def can be extended to allow RHS of the form a*k+b where k also is an already established derived IV. Could there be some problem with this transformation?

Strength reduction for IV’s, continued One more example

if (n<100) { Sample loop k = 7*n + 3; int sum = 0; What can these techniques do for The loop might not execute at all, do { for(i=0; i<1000; i++) this loop? in which case k would not be a[k]++; sum += a[i]; evaluated. n++; Naive assembler code Better to perform loop inversion k+=7; Strength reduction/IV techniques first. %sum = 0 } while (n<100); %sum = 0 %i = 0 } %off = 0 L1: %off = mul %i, 4 %addr = %addr.a %addr = add %addr.a,%off if (n<100) { %end = add %addr.a,4000 %a.i = load %addr k = 7*n + 3; L1: %a.i = load %addr %sum = add %sum,%a.i do { If n is not used after the loop, it %sum = add %sum,%a.i %i = add %i, 1 a[k]++; can be eliminated from the loop %addr = add %addr, 4 %stop = cmp lt %i,1000 k+=7; %stop = cmp lt %addr,%end br %stop, L1, L2 } while (k<703); br %stop, L1, L2 L2: } L2: Loop unrolling Optimizations in gcc

On ASTs for (i=0; i<100; i++) for (i=0; i<100; i=i+4) { Inlining, constant folding, arithm. simplification. a[i] = a[i] + x[i] a[i] = a[i] + x[i] a[i+1] = a[i+1] + x[i+1] On RTL code ( three-address code) a[i+2] = a[i+2] + x[i+2] ≈ Tail (and sibling) call optimization. a[i+3] = a[i+3] + x[i+3] } Jump optimization. SSA pass: constant propagation, . In which ways is this an improvement? Common subexpression elimination, more constant What to do if upper bound is n? propagation. Is unrolling four steps the best choice? . What could be the disadvantages? ... Difficult decisions: optimization order, repetitions.

Summing up

On optimization We have only looked at a few of many, many techniques. Modern optimization techniques use sophisticated algorithms and clever data structures. Frameworks such as LLVM make it possible to get the benefits of state-of-the-art techniques in your own compiler project.