15-745 Lecture 5 Loops Are Key Common Loop Optimizations

Home , Control flow

Loops are Key 15-745 Lecture 5 • Loops are extremely important – the “90-10” rule Control flow analysis • Loop optimization involves Natural loops – understanding control-flow structure Classical Loop Optimizations – Understanding data-dependence information DdiDependencies – sensitivity to side-effecting operations – extra care in some transformations such as register spilling

Copyri ght © Se th Gold st ei n, 2008

Common loop optimizations Finding Loops

• To optimize loops, we need to find them! • Hoisting of loop-invariant computations Scalar opts, DF analysi s, • Could use source language loop information in – pre-compute before entering the loop Control flow analysis the abstract syntax tree… • Elimination of induction variables •BUT: –change p=i*w+ b to p=b,p+=w, when w,b invari ant – There are multiple source loop constructs: for, while, • Loop unrolling do-while, even goto in C – to improve scheduling of the loop body – Want IR to support different languages • Software pipelining Requires understanding – Ideally, we want a single concept of a loop so all have – To improve scheduling of the loop body data same analysis, same optimizations dependencies • Loop permutation – Solution: dismantle source-level constructs, then –to improve cache memory performance re-find loops from fundamentals

• To optimize loops, we need to find them! • Many languages have goto and other complex • Specifically: control, so loops can be hard to find in general – loop-header node(s) • nodes in a loop that have immediate predecessors • Determininggpg the control structure of a program not in the loop is called control-flow analysis – back edge(s) • cntlcontrol-flow edge s to previously executed nodes • Based on the notion of dominators – all nodes in the looppy body

Dominators Back edges and loop headers

•a dom b • A control-flow edge Entry – node a dominates b if every possible from node B3 to B2 B1 k = false execution path from entry to b includes a is a back edge if i = 1 B2 dom B3 j = 2 • a sdom b i <= n B2 – a strictly dominates b if a dom b and a != b • Furthermore, in that case node B2 j = j*2 ..k.. B4 B3 k = true •a idom b is a loop header i = i+1 –a imme dildiately didominates b if a s dom b, AND print j i = i+1 B5 B6 there is no c such that a sdom c and c sdom b exit

• Consider a back edge from node n to node h

• The natural loop of n→h is the set of nodes L Simple example: such that for all x∈L: –h dom x and – there is a path from x to n not containing h (often it’s more complicated, since a FOR loop foun d in the source code might need an if/then guard)

Examples Examples

Try this: a for (..) { if { b c … } else { d e … e if (x) { e; f break; ) } }

for (..) { for (..) { if { if { … lexically, in loop, … lexically, in loop, } else { but not in } else { but not in … natural loop e … natural loop e if (x) { if (x) { e; e; break; break; and another ) ) reason why CFG } } analysis is } } preferred over source/AST loops

Examples Natural Loops

• Yes, it can happen in C One loop per header.. What are the natural loops?

0 0 0 {012}{0,1,2} 1 2 1 1

2 2 0

3 3 {} 1 2 {1,2,3} {1,2},{0,1,2,3}

• Unless two natural loops have the same header, • A more general looping structure is a strongly they are either di sjoint or nested wi thin each connectdted componen t tof the con tro l flow grap h other – subgraph such that •If A and B are loopp(s (sets of blocks) with scc scc headers a and b such that a ≠ b and b ∈ A every block in N is reachable from every –B⊂ A scc other node using only edges in Escc – loop B is nested within A –Bis the inner loop 0 • Can compute the loop-nest tree scc 1 2 Not very useful definition of a loop

Reducible Flow Graphs Reducible flow graphs Definition: A flow graph G is reducible iff we can partition the edges into two disjoint groups, forward edges and back There is a sppgp,ecial class of flow graphs, called edges, with the following two properties. reducible flow graphs, for which several code- optimizations are especially easy to perform. 1. The forward edges form an acyclic graph in which every node can be reached from the initial node of G. 2. The back edges consist only of edges whose heads In reducible flow graphs loops are unambiguously dominate their tails. defined and dominators can be efficiently Why isn’t this computed. reducible? 0

This flow graph has no back edges. Thus, it would be 1 2 reducible if the entire graph were acyclic, which is not the case.

Lecture 5 15-745 19 © 2008 Lecture 5 15-74520 © 2008 Alternative definition Properties of Reducible Flow Graphs • Definition: A flow graph G is reducible if we can repeatedly collapse (reduce) together blocks (x,y) where x is the only predecessor of y (ignoring self • In a reducible flow ggp,raph, loops) until we are left with a single node all loops are natural loops • Can use DFS to find loops 0 0 • Many analyses are more efficient 0 0 – polynomial versus exponential 1 1,2,3 1,2 1,,,2,3 2 3 01230,1,2,3 3

Good News Dealing with Irreducibility

• Most flow graphs are reducible •Don’t • Languages prohibit irreducibility • Can split nodes and duplicate code to get – goto free C reducible graph – Java – possible exponential blowup • Programmers usually don’t use such constructs • Other techniques… even if they’re available 0 – >90% of old Fortran code reducible 0 1 2 1 2 2a

•A definition Loop optitiitis:mizations: t = x op y in a loop is (conservatively) loop-invariant if Hoistinggp of loop-invariant – x and y are constants, or computations – all reaching definitions of x and y are outside the loop, or – only one definition reaches x (or y), and that definition is loop-invariant • so keep marking iteratively

Loop-invariant computations Hoisting

• Be careful: • In order to “hoist” a loop-invariant computation out of a loop, we need a place to put it t = expr; for () { • We could copy it to all immediate predecessors s = t * 2; (except along the back-edge) of the loop t = loop_invariant_expr; header... x = t + 2; … } • ...But we can avoid code duppyglication by inserting a new block, called the pre-header • Even though t’s two reaching expressions are each invariant, s is not invariant…

preheaders A A A’

A B B B’

Hoisting conditions We need to be careful...

• For a loop-invariant definition • All hoisting conditions must be satisfied!

d: t = x op y L0: L0: L0: t = 0 t = 0 t = 0 • we can hoist d into the loop’s pre-header only if L1: L1: L1: i = i + 1 if i>=N goto L2 i = i + 1 1. d’s block dominates all loop exits at which t is live- t = a*ba * b i = i+1i + 1 t = a*ba * b out, and M[i] = t t = a * b M[i] = t if i

OK violates 1,3 violates 2

• All hoisting conditions must be satisfied!

L0: L0: L0: t = 0 t = 0 t = 0 L1: L1: L1: Loop optimizations: i = i + 1 if i>=N goto L2 i = i + 1 t = a*ba * b i = i+1i + 1 t = a*ba * b this de f M[i] = t t = a * b M[i] = t reaches Induction-variable if i

OK violates 1,3 violates 2

The basic idea of IVE Simple Example of IVE

• Suppose we have a loop variable

– i initially 0; each iteration i = i + 1 i <- 0 H: if i >= n goto exit for i = 0 to n j

– iitiliinitialize x = io * c1 + c2 (execute once) Clearly, j & k do not need to be computer anew – increment x by c1 each iteration each time since they are related to i and i changes linearly.

i <- 0 i <- 0 i <- 0 j' <- 0 j' <- 0 i <- 0 H: k' <- a k' <- a k' <- a if i >= n goto exi t H: H: H: j <- i * 4 if i >= n goto exit if i >= n goto exit if i >= n goto exit k <- j + a j <- j' j <- j' k <- k' M[k] <- 0 k <- k' k <- k' M[k] <- 0 i <- i + 1 M[k] <- 0 M[k] <- 0 i <- i + 1 goto H i <- i + 1 i <- i + 1 k' <- k' + 4 j' <- j' + 4 j' <- j' + 4 goto H k' <- k'+4k' + 4 k' <- k'+4k' + 4 goto H goto H

Bu t, then we don't even need j (or j') Do we need i?

Simple Example of IVE Simple Example of IVE

Rewrite comparison Invariant code motion on a+(n*4)

i <- 0 i <- 0 i <- 0 k' <- a k' <- a k' <- a k' <- a n' <- a + (n * 4) H: H: H: H: if i >= n goto exit if k' >= a+(n*4) goto exi if k' >= a+(n*4)goto exit if k' >= n' goto exit k <- k' k <- k' k <- k' k <- k' M[k] <- 0 M[k] <- 0 M[k] <- 0 M[k] <- 0 i <- i + 1 k' <- k' + 4 k' <- k' + 4 k' <- k' + 4 k' <- k' + 4 goto H goto H goto H goto H

But, a+(n*4) is loop invariant now, we do copy propagati on and eliilimina te k

Lecture 5 15-745 © 2008 39 Lecture 5 15-745 © 2008 40 Simple Example of IVE Simple Example of IVE Compare original and result of IVE Copy propagation k' <- a i <- 0 k' <- a k' <- a n' <- a + (n * 4) H: n' <- a + (n * 4) n' <- a + (n * 4) H: if i >= n goto exit H: H: if k' >= n' goto exit j <- i*4i * 4 if k' >= n' goto exit if k' >= n' goto exit M[k'] <- 0 k <- j + a k <- k' M[k'] <- 0 k' <- k' + 4 M[k] <- 0 M[k] <- 0 k' <- k' + 4 goto H i

Voila! Voila!

What we did Is it faster?

• identified induction variables (i,j,k) • strength reduction (changed * into +) • On some hardware, adds are much faster than • dead-code elimination (j <- j') multiplies • useless-variable elimination (j' <- j' + 4) (This can also be done with ADCE) • Furthermore, one fewer value is computed, • loop invariant identification & code-motion – thus potentially saving a register • almost useless-variable elimination (i) – and decreasing the possibility of spilling • copy propagation

• Before attempting IVE, it is best to first • First, find the basic IVs perform : – scan loop body for defs of the form – constant propagation & constant folding x = x + c or x = x – c – copy propagation where c is loop-invariant – loop-invariant hoisting – record these basic IVs as x = (x, 1, c) – this represents the IV: x = x * 1 + c

Representing IVs How to do it, step 2

• Characterize all induction variables by: • Scan for derived IVs of the form k = i * c1 + c2 (base-variable, offset, multiple) – where i is a basic IV, – where the offset and multippple are loop- this is the only def of k in the loop, and invariant c1 and c2 are loop invariant • IOW, after an induction variable is defined it • We say k is in the family of i equals: • Record as k = (i, c1, c2) offse t + mu ltip le * base-varibliable

• Iterate, looking for derived IVs of the form i <- 0 k = j * c1 + c2 H: if i >= n goto exit – where IV j =(i, a, b), and j <- i * 4 k <- j + a – this is the only def of k in the loop, and M[k] <- 0 – there is no def of i between the def of j and i <- i + 1 goto H the def of k i: (i, 1, 1) i.e., i = 1 + 1 * i – c1 and c2 are loop invariant j: (i, 0, 4) i.e., j = 0 + 4 * i • Record as k = (i, ac1a*c1, bc1+c2)b*c1+c2) k: (i, a, 4) i.e., k = a + 4 * i

So, j & k are in fam ily o f i

Finding the IVs How to do it, step 4

• Maintain three tables: basic & maybe & other • This is the strength reduction step • Find basic Ivs: Scan stmts. If var ∉ maybe, and of proper • For an induction variable k = (i, c1, c2) form,,p put into basic. Otherwise, ,p put var in other and remove from maybe. – initialize k = i * c1 + c2 in the preheader •Find compound Ivs: – replace k’ s def in the loop by – If var defined more than once, put into other k = k + c1 – For all stmts of ppproper form that use a basic – make sure to do this after i’ s def IV » FIX THIS SLIDE

• This is the comparison rewriting step • Are the c1, c2 constant, or just invariant? – if constant, then you can keep folding them: they’re always a constant even for derived IVs • For an induction variable k = (i, ak, bk) – If k used only in definition and comparison – otherwise, they can be expressions of loop- – There exists another variable, j, in the same invariant variables

class and is not “useless ” and j=(i, aj, bj) •Rewrite k < n as • But if constant, can find IVs of the type

j < (bj/bk)(n-ak)+aj x = i/b • Note: since they are in same class: and know that it’s legal, if b evenly divides the stride… (j-aj)/bj = (k-ak)/bk

Is it faster? (2) Common loop optimizations

• On some hardware, adds are much faster than multipl ies • Hoisting of loop-invariant computations • But…not always a win! – pre-compute before entering the loop • Elimination of induction variables – Constant multiplies might otherwise be –change p=i*w+ b to p=b,p+=w, when w,b invari ant reduced to shifts/adds that result in even • Loop unrolling better code than IVE – to to improve scheduling of the loop body – Scaling of addresses (i*4) might come for • Software pipelining Requires free on your processor’s address modes understanding – To improve scheduling of the loop body data dependencies • So maybe: only convert i*c1+c2 when c1 is • Loop permutation loop invariant but not a constant –to improve cache memory performance

• Loop independent data dependence occurs • Flow Dependence W Î R δf true between accesses in the same loop iteration. • Anti-Dependence R Î W δa false • Loop-carried data dependence occurs between • Output Dependence W Î W δo accesses across different loop iterations. • There is data dependence between acce ss a at iterat i on i-k and access b at iteration i when: S1) a=0; S2) b=a; –aand b access the same memory location S3) c=a+d+e; – There is a path from a to b S4) d=b; – EEtherither a or b is a write S5) b= 5+e;

Example Dependencies Data Dependence in Loops S1) a=0; These are scalar dependencies.1 The S2) b=a; • Dependence can flow across iterations of same idea holds for memory accesses. S3) c= a+d+e; the loop. S4) d=b; 2 • Dependence information is annotated with source type target due to S5) b=5+e; iteration information. S1 δf S2 a • If dependence is across iterations it is loop S1 δf S3 a 3 S2 δf S4 b carried otherwise loop independent. S3 δa S4 d 4 S4 δa S5 b for (i=0; i

for (i=0; i

Iteration Space Distance Vector Every iteration generates a point in an n- dimensional space, where n is the depth of for (i=0; i

11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 63 11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 64 Example of Distance Vectors Example of Distance Vectors

for (i=0; i

Data Dependences Review of Linear Algebra Lexicoggpraphic Order Loop carried: between two statements instances in two different iterations of a loop.

Loop iiddtndependent: btbetween two s tttatemen ts Two n-vectors a and b are equal, a = b, if ai = bi, 1 ≤ i ≤ n. instances in the same loop iteration.

We say that a is less than b, a

Lexically forward: the source comes before the target . We say that a is lexicographically less than b, at level j, LLywexically backward: otherww.ise. a «j b, if ai = bi, 1 ≤ i < j and aj

We say that a is lexicographically less than b, a « b, if ≤ ≤ The right-hand side of an assignment is considered there is a j, 1 j n, such ththtat a «j b. to precede the left-hand side.

Consider the vectors a and b below : Let n ≥ 1, and i, j, k denote arbitrary vectors in Rn

n ⎡ 1 ⎤ ⎡ 1 ⎤ 1 For each u in 1 ≤ u ≤ n, the relation «u in R is ⎢−1⎥ ⎢−1⎥ irreflexive and transitive. a = ⎢ ⎥ b = ⎢ ⎥ 2 The n reltislations «u are paiiisrwise disjitdisjoint: ⎢ 0 ⎥ ⎢ 1 ⎥ i « j and i « j imply that u = v. ⎢ ⎥ ⎢ ⎥ u v ⎣ 2 ⎦ ⎣−1⎦ 3 If i ≠ j, there is a unique integer u such that 1 ≤ u ≤ n and exactly one of the following two conditions We say that a is lexicographically less than b holds: i « j or j « i. at level 3, a p 3 b, or simply tha t a p b. u u Both a and b are lexicographically positive 4 i «u j and j «v k together imply that i «w k, where w = min (u,v). because 0 p a, and 0 p b.

Data Dependence in Loops Data Dependence in Loops

An Example

Find the depend ence rel ati ons due to the array X in the program (S1) for i = 2 to S2 below: 9 do (1,3) (S2) X[i] = Y[i] + Z[i] (S1) for i = 2 to 9 do S3 (S ) X[i] = Y[i] + Z[i] (S3) A[i] = 2 X[i-1] + 1 Data dependence graph for statements in a loop (S3) A[i] = X[i-1] + 1 (S4) end for (1,3) := iteration distance is 1, latency is 3. (S4) end for i = 2 i = 3 i = 4 Solution To find the data dependence relations in a simple loop, we can unroll (s2) X[2]=Y[2]+Z[2] X[3]=Y[3]+Z[3] X[4]=Y[4]+Z[4] (s3) A[2]=X[1]+1 A[3]=X[2]+1 A[4]=X[3]+1 the loop and see which statement instances depend on which others: i = 2 i = 3 i = 4 There is a loop-carried, lexically forward, flow (s2) X[2]=Y[2]+Z[2] X[3] =Y[3]+Z[3] X[4]=Y[4]+Z[4] dddependence from S2 to S3. (s3) A[2]=X[1]+1 A[3] =X[2]+1 A[4]=X[3]+1