<<

Loops are Key 15-745 Lecture 5 • Loops are extremely important – the “90-10” rule • Loop optimization involves Natural loops – understanding control-flow structure Classical Loop Optimizations – Understanding data-dependence information DdiDependencies – sensitivity to side-effecting operations – extra care in some transformations such as register spilling

Copyri ght © Se th Gold st ei n, 2008

Based on slides from Lee & Callahan Lecture 5 15-745 © 2008 1 Lecture 5 15-745 © 2008 2

Common loop optimizations Finding Loops

• To optimize loops, we need to find them! • Hoisting of loop-invariant computations Scalar opts, DF analysi s, • Could use source language loop information in – pre-compute before entering the loop Control flow analysis the abstract syntax tree… • Elimination of induction variables •BUT: –change p=i*w+ b to p=b,p+=w, when w,b invari ant – There are multiple source loop constructs: for, while, • Loop unrolling do-while, even in – to improve of the loop body – Want IR to support different languages • pipelining Requires understanding – Ideally, we want a single concept of a loop so all have – To improve scheduling of the loop body data same analysis, same optimizations dependencies • Loop permutation – Solution: dismantle source-level constructs, then –to improve cache memory performance re-find loops from fundamentals

Lecture 5 15-745 © 2008 3 Lecture 5 15-745 © 2008 4 Finding Loops Control-flow analysis

• To optimize loops, we need to find them! • Many languages have goto and other complex • Specifically: control, so loops can be hard to find in general – loop-header node(s) • nodes in a loop that have immediate predecessors • Determininggpg the control structure of a program not in the loop is called control-flow analysis – back edge(s) • cntlcontrol-flow edge s to previously executed nodes • Based on the notion of dominators – all nodes in the looppy body

Lecture 5 15-745 © 2008 5 Lecture 5 15-745 © 2008 6

Dominators Back edges and loop headers

•a dom b • A control-flow edge Entry – node a dominates b if every possible from node B3 to B2 B1 k = false path from entry to b includes a is a back edge if i = 1 B2 dom B3 j = 2 • a sdom b i <= n B2 – a strictly dominates b if a dom b and a != b • Furthermore, in that case node B2 j = j*2 ..k.. B4 B3 k = true •a idom b is a loop header i = i+1 –a imme dildiately didominates b if a s dom b, AND print j i = i+1 B5 B6 there is no c such that a sdom c and c sdom b exit

Lecture 5 15-745 © 2008 7 Lecture 5 15-745 © 2008 8 Natural loop Examples

• Consider a back edge from node n to node h

• The natural loop of n→h is the set of nodes L Simple example: such that for all x∈L: –h dom x and – there is a path from x to n not containing h (often it’s more complicated, since a foun in the might need an if/then guard)

Lecture 5 15-745 © 2008 9 Lecture 5 15-745 © 2008 10

Examples Examples

Try this: a for (..) { if { b c … } else { d e … e if (x) { e; break; ) } }

Lecture 5 15-745 © 2008 11 Lecture 5 15-745 © 2008 12 Examples Examples

for (..) { for (..) { if { if { … lexically, in loop, … lexically, in loop, } else { but not in } else { but not in … natural loop e … natural loop e if (x) { if (x) { e; e; break; break; and another ) ) reason why CFG } } analysis is } } preferred over source/AST loops

Lecture 5 15-745 © 2008 13 Lecture 5 15-745 © 2008 14

Examples Natural Loops

• Yes, it can happen in C One loop per header.. What are the natural loops?

0 0 0 {012}{0,1,2} 1 2 1 1

2 2 0

3 3 {} 1 2 {1,2,3} {1,2},{0,1,2,3}

Lecture 5 15-745 © 2008 15 Lecture 5 15-745 © 2008 16 Nested Loops General Loops

• Unless two natural loops have the same header, • A more general looping structure is a strongly they are either di sjoint or nested wi thin each connectdted componen tof the con tro l flow grap h other – subgraph such that •If A and B are loopp(s (sets of blocks) with scc scc headers a and b such that a ≠ b and b ∈ A every in N is reachable from every –B⊂ A scc other node using only edges in Escc – loop B is nested within A –Bis the inner loop 0 • Can compute the loop-nest tree scc 1 2 Not very useful definition of a loop

Lecture 5 15-745 © 2008 17 Lecture 5 15-745 © 2008 18

Reducible Flow Graphs Reducible flow graphs Definition: A flow graph G is reducible iff we can partition the edges into two disjoint groups, forward edges and back There is a sppgp,ecial class of flow graphs, called edges, with the following two properties. reducible flow graphs, for which several code- optimizations are especially easy to perform. 1. The forward edges form an acyclic graph in which every node can be reached from the initial node of G. 2. The back edges consist only of edges whose heads In reducible flow graphs loops are unambiguously dominate their tails. defined and dominators can be efficiently Why isn’t this computed. reducible? 0

This flow graph has no back edges. Thus, it would be 1 2 reducible if the entire graph were acyclic, which is not the case.

Lecture 5 15-745 19 © 2008 Lecture 5 15-74520 © 2008 Alternative definition Properties of Reducible Flow Graphs • Definition: A flow graph G is reducible if we can repeatedly collapse (reduce) together blocks (x,y) where x is the only predecessor of y (ignoring self • In a reducible flow ggp,raph, loops) until we are left with a single node all loops are natural loops • Can use DFS to find loops 0 0 • Many analyses are more efficient 0 0 – polynomial versus exponential 1 1,2,3 1,2 1,,,2,3 2 3 01230,1,2,3 3

Lecture 5 15-745 21 © 2008 Lecture 5 15-74522 © 2008

Good News Dealing with Irreducibility

• Most flow graphs are reducible •Don’t • Languages prohibit irreducibility • Can split nodes and duplicate code to get – goto free C reducible graph – Java – possible exponential blowup • Programmers usually don’t use such constructs • Other techniques… even if they’re available 0 – >90% of old code reducible 0 1 2 1 2 2a

Lecture 5 15-74523 © 2008 Lecture 5 15-745 © 2008 Loop-invariant computations

•A definition Loop optitiitis:mizations: t = x op y in a loop is (conservatively) loop-invariant if Hoistinggp of loop-invariant – x and y are constants, or computations – all reaching definitions of x and y are outside the loop, or – only one definition reaches x (or y), and that definition is loop-invariant • so keep marking iteratively

Lecture 5 15-745 © 2008 25 Lecture 5 15-745 © 2008 26

Loop-invariant computations Hoisting

• Be careful: • In order to “hoist” a loop-invariant computation out of a loop, we need a place to put it t = expr; for () { • We could copy it to all immediate predecessors s = t * 2; (except along the back-edge) of the loop t = loop_invariant_expr; header... x = t + 2; … } • ...But we can avoid code duppyglication by inserting a new block, called the pre-header • Even though t’s two reaching expressions are each invariant, s is not invariant…

Lecture 5 15-745 © 2008 27 Lecture 5 15-745 © 2008 28 Hoisting Hoisting

preheaders A A A’

A B B B’

B

Lecture 5 15-745 © 2008 29 Lecture 5 15-745 © 2008 30

Hoisting conditions We need to be careful...

• For a loop-invariant definition • All hoisting conditions must be satisfied!

d: t = x op y L0: L0: L0: t = 0 t = 0 t = 0 • we can hoist d into the loop’s pre-header only if L1: L1: L1: i = i + 1 if i>=N goto L2 i = i + 1 1. d’s block dominates all loop exits at which t is live- t = a*ba * b i = i+1i + 1 t = a*ba * b out, and M[i] = t t = a * b M[i] = t if i

OK violates 1,3 violates 2

Lecture 5 15-745 © 2008 31 Lecture 5 15-745 © 2008 32 We need to be careful...

• All hoisting conditions must be satisfied!

L0: L0: L0: t = 0 t = 0 t = 0 L1: L1: L1: Loop optimizations: i = i + 1 if i>=N goto L2 i = i + 1 t = a*ba * b i = i+1i + 1 t = a*ba * b this de f M[i] = t t = a * b M[i] = t reaches Induction-variable if i

OK violates 1,3 violates 2

Lecture 5 15-745 © 2008 33 Lecture 5 15-745 © 2008 34

The idea of IVE Simple Example of IVE

• Suppose we have a loop variable

– i initially 0; each iteration i = i + 1 i <- 0 H: if i >= n goto exit for i = 0 to n j

– iitiliinitialize x = io * c1 + c2 (execute once) Clearly, j & k do not need to be anew – increment x by c1 each iteration each time since they are related to i and i changes linearly.

Lecture 5 15-745 © 2008 35 Lecture 5 15-745 © 2008 36 Simple Example of IVE Simple Example of IVE

i <- 0 i <- 0 i <- 0 j' <- 0 j' <- 0 i <- 0 H: k' <- a k' <- a k' <- a if i >= n goto exi t H: H: H: j <- i * 4 if i >= n goto exit if i >= n goto exit if i >= n goto exit k <- j + a j <- j' j <- j' k <- k' M[k] <- 0 k <- k' k <- k' M[k] <- 0 i <- i + 1 M[k] <- 0 M[k] <- 0 i <- i + 1 goto H i <- i + 1 i <- i + 1 k' <- k' + 4 j' <- j' + 4 j' <- j' + 4 goto H k' <- k'+4k' + 4 k' <- k'+4k' + 4 goto H goto H

Bu t, then we don't even need j (or j') Do we need i?

Lecture 5 15-745 © 2008 37 Lecture 5 15-745 © 2008 38

Simple Example of IVE Simple Example of IVE

Rewrite comparison Invariant code motion on a+(n*4)

i <- 0 i <- 0 i <- 0 k' <- a k' <- a k' <- a k' <- a n' <- a + (n * 4) H: H: H: H: if i >= n goto exit if k' >= a+(n*4) goto exi if k' >= a+(n*4)goto exit if k' >= n' goto exit k <- k' k <- k' k <- k' k <- k' M[k] <- 0 M[k] <- 0 M[k] <- 0 M[k] <- 0 i <- i + 1 k' <- k' + 4 k' <- k' + 4 k' <- k' + 4 k' <- k' + 4 goto H goto H goto H goto H

But, a+(n*4) is now, we do copy propagati on and eliilimina te k

Lecture 5 15-745 © 2008 39 Lecture 5 15-745 © 2008 40 Simple Example of IVE Simple Example of IVE Compare original and result of IVE Copy propagation k' <- a i <- 0 k' <- a k' <- a n' <- a + (n * 4) H: n' <- a + (n * 4) n' <- a + (n * 4) H: if i >= n goto exit H: H: if k' >= n' goto exit j <- i*4i * 4 if k' >= n' goto exit if k' >= n' goto exit M[k'] <- 0 k <- j + a k <- k' M[k'] <- 0 k' <- k' + 4 M[k] <- 0 M[k] <- 0 k' <- k' + 4 goto H i

Voila! Voila!

Lecture 5 15-745 © 2008 41 Lecture 5 15-745 © 2008 42

What we did Is it faster?

• identified induction variables (i,j,k) • strength reduction (changed * into +) • On some hardware, adds are much faster than • dead-code elimination (j <- j') multiplies • useless-variable elimination (j' <- j' + 4) (This can also be done with ADCE) • Furthermore, one fewer value is computed, • loop invariant identification & code-motion – thus potentially saving a register • almost useless-variable elimination (i) – and decreasing the possibility of spilling • copy propagation

Lecture 5 15-745 © 2008 43 Lecture 5 15-745 © 2008 44 Loop preparation How to do it, step 1

• Before attempting IVE, it is best to first • First, find the basic IVs perform : – scan loop body for defs of the form – constant propagation & constant folding x = x + c or x = x – c – copy propagation where c is loop-invariant – loop-invariant hoisting – record these basic IVs as x = (x, 1, c) – this represents the IV: x = x * 1 + c

Lecture 5 15-745 © 2008 45 Lecture 5 15-745 © 2008 46

Representing IVs How to do it, step 2

• Characterize all induction variables by: • Scan for derived IVs of the form k = i * c1 + c2 (base-variable, offset, multiple) – where i is a basic IV, – where the offset and multippple are loop- this is the only def of k in the loop, and invariant c1 and c2 are loop invariant • IOW, after an induction variable is defined it • We say k is in the family of i equals: • Record as k = (i, c1, c2) offse t + mu ltip le * base-varibliable

Lecture 5 15-745 © 2008 47 Lecture 5 15-745 © 2008 48 How to do it, step 3 Simple Example of IVE

• Iterate, looking for derived IVs of the form i <- 0 k = j * c1 + c2 H: if i >= n goto exit – where IV j =(i, a, b), and j <- i * 4 k <- j + a – this is the only def of k in the loop, and M[k] <- 0 – there is no def of i between the def of j and i <- i + 1 goto H the def of k i: (i, 1, 1) i.e., i = 1 + 1 * i – c1 and c2 are loop invariant j: (i, 0, 4) i.e., j = 0 + 4 * i • Record as k = (i, ac1a*c1, bc1+c2)b*c1+c2) k: (i, a, 4) i.e., k = a + 4 * i

So, j & k are in fam ily o f i

Lecture 5 15-745 © 2008 49 Lecture 5 15-745 © 2008 50

Finding the IVs How to do it, step 4

• Maintain three tables: basic & maybe & other • This is the strength reduction step • Find basic Ivs: Scan stmts. If var ∉ maybe, and of proper • For an induction variable k = (i, c1, c2) form,,p put into basic. Otherwise, ,p put var in other and remove from maybe. – initialize k = i * c1 + c2 in the preheader •Find compound Ivs: – replace k’ s def in the loop by – If var defined more than once, put into other k = k + c1 – For all stmts of ppproper form that use a basic – make sure to do this after i’ s def IV » FIX THIS SLIDE

Lecture 5 15-745 © 2008 51 Lecture 5 15-745 © 2008 52 How to do it, step 5 Notes

• This is the comparison rewriting step • Are the c1, c2 constant, or just invariant? – if constant, then you can keep folding them: they’re always a constant even for derived IVs • For an induction variable k = (i, ak, bk) – If k used only in definition and comparison – otherwise, they can be expressions of loop- – There exists another variable, j, in the same invariant variables

class and is not “useless ” and j=(i, aj, bj) •Rewrite k < n as • But if constant, can find IVs of the type

j < (bj/bk)(n-ak)+aj x = i/b • Note: since they are in same class: and know that it’s legal, if b evenly divides the stride… (j-aj)/bj = (k-ak)/bk

Lecture 5 15-745 © 2008 53 Lecture 5 15-745 © 2008 54

Is it faster? (2) Common loop optimizations

• On some hardware, adds are much faster than multipl ies • Hoisting of loop-invariant computations • But…not always a win! – pre-compute before entering the loop • Elimination of induction variables – Constant multiplies might otherwise be –change p=i*w+ b to p=b,p+=w, when w,b invari ant reduced to shifts/adds that result in even • Loop unrolling better code than IVE – to to improve scheduling of the loop body – Scaling of addresses (i*4) might come for • Software pipelining Requires free on your processor’s address modes understanding – To improve scheduling of the loop body data dependencies • So maybe: only convert i*c1+c2 when c1 is • Loop permutation loop invariant but not a constant –to improve cache memory performance

Lecture 5 15-745 © 2008 55 Lecture 5 15-745 © 2008 56 Dependencies in Loops Defining Dependencies

• Loop independent data dependence occurs • Flow Dependence W Î R δf true between accesses in the same loop iteration. • Anti-Dependence R Î W δa false • Loop-carried data dependence occurs between • Output Dependence W Î W δo accesses across different loop iterations. • There is data dependence between acce ss a at iterat i on i-k and access b at iteration i when: S1) a=0; S2) b=a; –aand b access the same memory location S3) c=a+d+e; – There is a path from a to b S4) d=b; – EEtherither a or b is a write S5) b= 5+e;

Lecture 5 15-745 © 2008 57 Lecture 5 15-745 © 2008 58

Example Dependencies Data Dependence in Loops S1) a=0; These are scalar dependencies.1 The S2) b=a; • Dependence can flow across iterations of same idea holds for memory accesses. S3) c= a+d+e; the loop. S4) d=b; 2 • Dependence information is annotated with source type target due to S5) b=5+e; iteration information. S1 δf S2 a • If dependence is across iterations it is loop S1 δf S3 a 3 S2 δf S4 b carried otherwise loop independent. S3 δa S4 d 4 S4 δa S5 b for (i=0; i

Lecture 5 15-745 © 2008 59 11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 60 Data Dependence in Loops Unroll Loop to Find Dependencies

for (i=0; i

11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 61 11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 62

Iteration Space Distance Vector Every iteration generates a point in an n- dimensional space, where n is the depth of for (i=0; i

11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 63 11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 64 Example of Distance Vectors Example of Distance Vectors

for (i=0; i

11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 65 11/20/01 15-411 Fall '01 © Seth Copen Goldstein 2001 66

Data Dependences Review of Linear Algebra Lexicoggpraphic Order Loop carried: between two statements instances in two different iterations of a loop.

Loop iiddtndependent: btbetween two s tttatemen ts Two n-vectors a and b are equal, a = b, if ai = bi, 1 ≤ i ≤ n. instances in the same loop iteration.

We say that a is less than b, a

Lexically forward: the source comes before the target . We say that a is lexicographically less than b, at level j, LLywexically backward: otherww.ise. a «j b, if ai = bi, 1 ≤ i < j and aj

We say that a is lexicographically less than b, a « b, if ≤ ≤ The right-hand side of an assignment is considered there is a j, 1 j n, such ththtat a «j b. to precede the left-hand side.

Lecture 5 15-745 © 2008 67 Lecture 5 15-745 © 2008 68 Lexicographic Order Properties of Lexicographic Order Example of vectors

Consider the vectors a and b below : Let n ≥ 1, and i, j, k denote arbitrary vectors in Rn

n ⎡ 1 ⎤ ⎡ 1 ⎤ 1 For each u in 1 ≤ u ≤ n, the relation «u in R is ⎢−1⎥ ⎢−1⎥ irreflexive and transitive. a = ⎢ ⎥ b = ⎢ ⎥ 2 The n reltislations «u are paiiisrwise disjitdisjoint: ⎢ 0 ⎥ ⎢ 1 ⎥ i « j and i « j imply that u = v. ⎢ ⎥ ⎢ ⎥ u v ⎣ 2 ⎦ ⎣−1⎦ 3 If i ≠ j, there is a unique integer u such that 1 ≤ u ≤ n and exactly one of the following two conditions We say that a is lexicographically less than b holds: i « j or j « i. at level 3, a p 3 b, or simply tha t a p b. u u Both a and b are lexicographically positive 4 i «u j and j «v k together imply that i «w k, where w = min (u,v). because 0 p a, and 0 p b.

Lecture 5 15-745 © 2008 69 Lecture 5 15-745 © 2008 70

Data Dependence in Loops Data Dependence in Loops

An Example

Find the depend ence rel ati ons due to the array X in the program (S1) for i = 2 to S2 below: 9 do (1,3) (S2) X[i] = Y[i] + Z[i] (S1) for i = 2 to 9 do S3 (S ) X[i] = Y[i] + Z[i] (S3) A[i] = 2 X[i-1] + 1 Data dependence graph for statements in a loop (S3) A[i] = X[i-1] + 1 (S4) end for (1,3) := iteration distance is 1, latency is 3. (S4) end for i = 2 i = 3 i = 4 Solution To find the data dependence relations in a simple loop, we can unroll (s2) X[2]=Y[2]+Z[2] X[3]=Y[3]+Z[3] X[4]=Y[4]+Z[4] (s3) A[2]=X[1]+1 A[3]=X[2]+1 A[4]=X[3]+1 the loop and see which statement instances depend on which others: i = 2 i = 3 i = 4 There is a loop-carried, lexically forward, flow (s2) X[2]=Y[2]+Z[2] X[3] =Y[3]+Z[3] X[4]=Y[4]+Z[4] dddependence from S2 to S3. (s3) A[2]=X[1]+1 A[3] =X[2]+1 A[4]=X[3]+1

Lecture 5 15-745 © 2008 From Wolfe 71 Lecture 5 15-745 © 2008 From Wolfe 72