<<

15-745 Optimizing Spring 2007

1 School of Computer Science School of Computer Science

Compiler Backend Intermediate Representations

instruction IR IR IR Assem Source Front Middle Back Target selector Code End End End Code

Front end - produces an intermediate representation (IR) register TempMap allocator Middle end - transforms the IR into an equivalent IR that runs more efficiently Back end - transforms the IR into native code

instruction Assem scheduler IR encodes the ’s knowledge of the program Middle end usually consists of several passes

2 3 School of Computer Science School of Computer Science

Page ‹#› Types of Intermediate Intermediate Representations Representations Decisions in IR design affect the speed and efficiency of Structural the compiler – Graphically oriented Examples: Some important IR properties – Heavily used in source-to-source translators Trees, DAGs – Ease of generation – Tend to be large – Ease of manipulation Linear

– Procedure size – Pseudo-code for an abstract machine Examples: – Freedom of expression – Level of abstraction varies 3 address code Stack – Level of abstraction – Simple, compact data structures The importance of different properties varies between – Easier to rearrange compilers Hybrid Example: – Selecting an appropriate IR for a compiler is critical – Combination of graphs and linear code Control-flow graph

4 5 School of Computer Science School of Computer Science

Level of Abstraction Level of Abstraction The level of detail exposed in an IR influences the Structural IRs are usually considered high-level profitability and feasibility of different optimizations. Linear IRs are usually considered low-level Not necessarily true: Two different representations of an array reference: load loadI 1 => r1 Low level AST sub rj, r1 => r2 loadArray A,i,j + subscript loadI 10 => r3 High level linear code mult r2, r3 => r4 @A sub ri, r1 => r5 + A i j add r4, r5 => r6 * - loadI @A => r7 add r7, r6 => r8 High level AST: load r => r - 10 j 1 Good for memory 8 Aij disambiguation Low level linear code: j 1 Good for address calculation 6 7 School of Computer Science School of Computer Science

Page ‹#› Abstract Syntax Structural IR Directed Acyclic Graph Structural IR An is the procedure’s parse A directed acyclic graph (DAG) is an AST with a unique tree with the nodes for most non-terminal nodes node for each value removed ← ←

z - w /

x * - x - 2 * y 2 y z ← x - 2 * y x * w ← x / 2

2 y Same expression twice means Makes sharing explicit that the compiler might arrange When is an AST a good IR? to evaluate it just once! Encodes redundancy

8 9 School of Computer Science School of Computer Science

Pegasus IR Structural IR Stack Machine Code Linear IR Originally used for stack-based computers, now Java Predicated Explicit Example: push x GAted Simple x - 2 * y becomes push 2 Uniform SSA push y multiply subtract

Advantages Structural IR used in – Compact form Implicit names take up CASH (Compiler for – Introduced names are implicit, not explicit no space, where explicit Application Specific – Simple to generate and execute code ones do! Hardware) Useful where code is transmitted over slow communication links (the net )

10 11 School of Computer Science School of Computer Science

Page ‹#› Three Address Code Linear IR Two Address Code Linear IR Three address code has statements of the form: Allows statements of the form x ← y op z x ← x op y With 1 operator (op ) and, at most, 3 names (x, y, & z) Has 1 operator (op ) and, at most, 2 names (x and y)

t 2 Example: 1 ← Example: t load y t ← 2 * y 2 ← z ← x - 2 * y becomes t2 ← t2 * t1 z ← x - t z ← x - 2 * y becomes z ← load x z ← z - t Advantages: – Can be very compact 2 – Resembles many machines (RISC) – Destructive operations make reuse hard – Compact form – Good model for machines with destructive ops (x86)

12 13 School of Computer Science School of Computer Science

Control-flow Graph Hybrid IR Using Multiple Representations Models the transfer of control in the procedure

Source Front IR 1 Middle IR 2 Middle IR 3 Back Target Nodes in the graph are basic blocks Code End End End End Code – Straight-line code – Either linear or structural representation Repeatedly lower the level of the intermediate representation Edges in the graph represent control flow – Each intermediate representation is suited towards certain optimizations Example: the Open64 compiler – WHIRL intermediate format Example if (x = y) Basic blocks — • Consists of 5 different IRs that are progressively more detailed Maximal length – gcc sequences of • but not explicit about it :-( a ← 2 a ← 3 straight-line code b ← 5 b ← 4

← a * b 14 15 School of Computer Science School of Computer Science

Page ‹#› Instruction Selection Instruction selection example Suppose we have IR -> Assem MOVE(TEMP r, templates MEM(BINOP(TIMES,TEMP s,CONST c))) We can generate the x86 code…

instruction IR Assem movl %(,s,c), %r selector …if c = 1, 2, or 4; otherwise… imull $c, %s, %r movl (%r), %r

16 17 School of Computer Science School of Computer Science

Selection dependencies Example, cont’d We can see that the selection of instructions can For depend on the constants MEM(BINOP(TIMES,TEMP s,CONST c)) The context of an IR expression can also affect the choice of instruction we might sometimes want to generate Consider testl $0,(,%s,c) MOVE(TEMP r, je L MEM(BINOP(TIMES,TEMP s,CONST c))) What context might cause us to do this?

18 19 School of Computer Science School of Computer Science

Page ‹#› Instruction selection as tree matching Sample tree-matching rules In order to take context into account, instruction IR pattern code cost selectors often use pattern-matching on IR trees BINOP(PLUS,i,j) leal (i,j),r 1 BINOP(TIMES,i,j) movl j,r 2 – each pattern specifies what instructions to select imull i,r BINOP(PLUS,i,CONST c) leal c(i),r 1 MEM(BINOP(PLUS,i,CONST c)) movl c(i),r 1 MOVE(MEM(BINOP(PLUS,i,j)),k) movl k,(i,j) 1 BINOP(TIMES,i,CONST c) leal (,i,c),r 1 If c is 1, 2, or 4 BINOP(TIMES,i,CONST c) movl c,r 2 imull i,r MEM(i) movl (i),r 1 MOVE(MEM(i),j) movl j,(i) 1 MOVE(MEM(i),MEM(j)) movl (j),t 2 movl t,(i) 20 21 School of Computer Science … School of Computer Science

Tiling an IR tree, v.1 Tiling an IR tree, v.2

a[x] = *y; a[x] = *y; MOVE MOVE r3 MEM MEM MEM MEM leal $a(%ebp),r1 r4 movl $a(%ebp),r1 y movl (r1),r2 y + leal (,x,4),r2 leal (,x,$4),r3 movl (y),r3 + leal (r2,r3),r4 r2 r3 movl r3,(r1,r2) r1 r2 movl (y),r5 MEM movl r5,(r4) * MEM r1 *

+ x CONST 4 + x CONST 4

(assume a is a EBP CONST a formal parameter BP CONST a passed on the stack) 22 23 School of Computer Science School of Computer Science

Page ‹#› Tiling choices The best tiling? In general, for any given tree, many tilings are We want the “lowest cost” tiling possible – usually, the shortest sequence – each resulting in a different instruction sequence – but can also take into account cost/delay of each instruction Optimum tiling We can ensure pattern coverage by covering, at a – lowest-cost tiling minimum, all atomic IR trees Locally Optimal tiling – no two adjacent tiles can be combined into one tile of lower cost

24 25 School of Computer Science School of Computer Science

Locally optimal tilings MOVE

MEM MEM Locally optimal tiling is easy Choose the largest pattern with lowest y – A simple greedy works extremely cost, i.e., the + well in practice: “maximal munch”

– Maximal munch MEM * • start at root + x CONST 4 • use “biggest” match (in # of nodes) – use cost to break ties BP CONST a

IR pattern code cost MOVE(MEM(BINOP(PLUS,i,j)),k) movl k,(i,j) 1 MOVE(MEM(i),j) movl j,(i) 1 MOVE(MEM(i),MEM(j)) movl (j),t 2 26 27 movl t,(i) School of Computer Science School of Computer Science

Page ‹#› Maximal munch Maximal munch is not optimum Maximal munch does not necessarily produce the Consider what happens, for example, if two of our optimum selection of instructions rules are changed as follows: But: – it is easy to implement – it tends to work well for current instruction-set IR pattern code cost architectures MEM(BINOP(PLUS,i,CONST c)) movl c,r 3 addl i,r movl (r),r MOVE(MEM(BINOP(PLUS,i,j)),k) movl j,r 3 addl i,r movl k,(r)

28 29 School of Computer Science School of Computer Science

Sample tree-matching rules Tiling an IR tree, new rules Rule # IR pattern cost a[x] = *y; 0 TEMP t 0 MOVE r3 1 CONST c 1 MEM MEM 2 BINOP(PLUS,i,j) 1 movl $a,r1 3 BINOP(TIMES,i,j) 2 addl %ebp,r1 y 4 BINOP(PLUS,i,CONST c) 1 movl (r1),r1 + 5 MEM(BINOP(PLUS,i,CONST c)) 3 leal (,x,4),r2 r1 r2 6 MOVE(MEM(BINOP(PLUS,i,j)),k) 3 movl (y),r3 MEM * 7 BINOP(TIMES,i,CONST c) 1 If c is 1, 2, or 4 movl r2,r4 addl r1,r4 8 BINOP(TIMES,i,CONST c) 2 + movl r3,(r4) x CONST 4 9 MEM(i) 1 10 MOVE(MEM(i),j) 1 BP CONST a 11 MOVE(MEM(i),MEM(j)) 2 30 31 School of Computer Science School of Computer Science

Page ‹#› Optimum selection Dynamic programming To achieve optimum instruction selection, we must The idea is fairly simple use a more complex algorithm – Working bottom up… – dynamic programming – Given the optimum tilings of all subtrees, generate In contrast to maximal munch, the trees are optimum tiling of the current tree matched bottom-up • consider all tiles for the root of the current tree • sum cost of best subtree tiles and each tile • choose tile with minimum total cost Second pass generates the code using the results from the bottom-up pass

32 33 School of Computer Science School of Computer Science

Bottom-up CG, pass 1 Bottom-up CG, pass 2

(10,6),(11,6) (10,6),(11,6) MOVE MOVE (9,4) (9,2) (9,4) (9,2) MEM MEM MEM MEM

+ (2,1) + (2,1) (4,3)+ (4,3)+ Note: (r,c) Note: (r,c) (9,2) (0,0) (1,1) (9,2) (0,0) (1,1) means rule r, (1,1) %EBP CONST 99 means rule r, (1,1) %EBP CONST 99 MEM CONST 4 MEM CONST 4 total cost c total cost c (4,1)+ (2,1)+

(0,0) (1,1) (0,0) (1,1) %EBP CONST a %EBP CONST a

34 35 School of Computer Science School of Computer Science

Page ‹#› Bottom-up code generation Tools How does the running time compare to maximal A lot of tools have been developed to munch? automatically generate instruction selectors Memory usage? – TWIG [Aho,Ganapathi,Tjiang, ’86] – BURG [Fraser,Henry,Proebsting, ’92] – BEG [Emmelmann,Schroer,Landwehr, ’89] – … These generate bottom-up instruction selectors from tree-matching rule specifications

36 37 School of Computer Science School of Computer Science

Code-generator generators Tiling a DAG Twig, Burg, and Beg use the dynamic How would you tile mov x+4 -> t0 programming approach mov x+8 -> t1 ld ld – A lot of work has gone into making these cgg’s mov &x -> t3 highly efficient add t3,4 -> t4 t0 t1 load (t4) -> t0 There are also grammar-based cgg’s that use add t3,4 -> t5 load (t5) -> t1 LR(k) to perform the tree-matching + +

4 &x 8

38 39 School of Computer Science School of Computer Science

Page ‹#› Peephole Matching Peephole Matching Basic idea Basic idea Compiler can discover local improvements locally Compiler can discover local improvements locally – Look at a small set of adjacent operations – Look at a small set of adjacent operations – Move a “peephole” over code & search for improvement – Move a “peephole” over code & search for improvement Classic example was store followed by load Classic example was store followed by load Simple algebraic identities

Original code Improved code Original code Improved code storeAI r1 ⇒ r0,8 storeAI r1 ⇒ r0,8 loadAI r0,8 ⇒ r15 i2i r1 ⇒ r15 addI r2,0 ⇒ r7 mult r4,r2 ⇒ r10 mult r4,r7 ⇒ r10

40 41 School of Computer Science School of Computer Science

Peephole Matching Peephole Matching Basic idea Implementing it Compiler can discover local improvements locally Early systems used limited set of hand-coded patterns Window size ensured quick processing – Look at a small set of adjacent operations – Move a “peephole” over code & search for improvement Modern peephole instruction selectors (Davidson) Break problem into three tasks Classic example was store followed by load

Simple algebraic identities IR Expander LLIR Simplifier LLIR Matcher ASM Jump to a jump IR→LLIR LLIR→LLIR LLIR→ASM

Original code Improved code Apply symbolic interpretation & simplification systematically jumpI → L10 L10: jumpI → L11 L10: jumpI → L11

42 43 School of Computer Science School of Computer Science

Page ‹#› Peephole Matching Peephole Matching Expander Simplifier Turns IR code into a low-level IR (LLIR) such as RTL Looks at LLIR through window and rewrites it Uses forward substitution, algebraic simplification, local constant Operation-by-operation, template-driven rewriting propagation, and dead-effect elimination LLIR form includes all direct effects (e.g., setting cc) Performs local optimization within window Significant, albeit constant, expansion of size

IR Expander LLIR Simplifier LLIR Matcher ASM IR→LLIR LLIR→LLIR LLIR→ASM IR Expander LLIR Simplifier LLIR Matcher ASM IR→LLIR LLIR→LLIR LLIR→ASM This is the heart of the peephole system – Benefit of shows up in this step

44 45 School of Computer Science School of Computer Science

Peephole Matching Example

Matcher LLIR Code r ← 2 Compares simplified LLIR against a library of patterns 10 r11 ← @y Original IR Code r ← r + r Picks low-cost pattern that captures effects 12 0 11 r13 ← MEM(r12) OP Arg1 Arg2 Result Expand r ← r x r Must preserve LLIR effects, may add new ones (e.g., set 14 10 13 r15 ← @x mult 2 Y t1 cc) r16 ← r0 + r15 r17 ← MEM(r16) sub x t1 w Generates the assembly code output r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19 MEM(r20) ← r18 IR Expander LLIR Simplifier LLIR Matcher ASM IR→LLIR LLIR→LLIR LLIR→ASM

46 47 School of Computer Science School of Computer Science

Page ‹#› Example Example

LLIR Code

r10 ← 2 r11 ← @y r ← r + r LLIR Code ILOC Code 12 0 11 LLIR Code r ← MEM(r ) r13 ← MEM(r0+ @y) Match loadAI r0,@y ⇒ r13 13 12 Simplify r13 ← MEM(r0+ @y) r ← 2 x r multI 2 x r ⇒ r r14 ← r10 x r13 14 13 13 14 r14 ← 2 x r13 r ← MEM(r + @x) loadAI r ,@x ⇒ r r15 ← @x 17 0 0 17 r17 ← MEM(r0 + @x) r ← r - r sub r - r ⇒ r r16 ← r0 + r15 18 17 14 17 14 18 r18 ← r17 - r14 MEM(r + @w) ← r storeAI r ⇒ r ,@w r17 ← MEM(r16) 0 18 18 0 MEM(r0 + @w) ← r18 r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19 MEM(r20) ← r18

48 49 School of Computer Science School of Computer Science

Steps of the Simplifier Steps of the Simplifier

LLIR Code 3-operation window LLIR Code r ← 2 r ← 2 10 10 r10 ← 2 r ← @y r ← @y 11 11 r11 ← @y r ← r + r r ← r + r 12 0 11 12 0 11 r12 ← r0 + r11 r13 ← MEM(r12) r13 ← MEM(r12) r14 ← r10 x r13 r14 ← r10 x r13 r15 ← @x r15 ← @x r16 ← r0 + r15 r16 ← r0 + r15 r17 ← MEM(r16) r17 ← MEM(r16) r18 ← r17 - r14 r18 ← r17 - r14 r19 ← @w r19 ← @w r20 ← r0 + r19 r20 ← r0 + r19 MEM(r20) ← r18 MEM(r20) ← r18

50 51 School of Computer Science School of Computer Science

Page ‹#› Steps of the Simplifier Steps of the Simplifier

LLIR Code LLIR Code r ← 2 r ← 2 10 r10 ← 2 r10 ← 2 10 r10 ← 2 r10 ← 2 r ← @y r ← @y 11 r11 ← @y r12 ← r0 + @y 11 r12 ← r0 + @y r13 ← MEM(r0 + @y) r ← r + r r ← r + r 12 0 11 r12 ← r0 + r11 r13 ← MEM(r12) 12 0 11 r13 ← MEM(r12) r14 ← r10 x r13 r13 ← MEM(r12) r13 ← MEM(r12) r14 ← r10 x r13 r14 ← r10 x r13 r15 ← @x r15 ← @x r16 ← r0 + r15 r16 ← r0 + r15 r17 ← MEM(r16) r17 ← MEM(r16) r18 ← r17 - r14 r18 ← r17 - r14 r19 ← @w r19 ← @w r20 ← r0 + r19 r20 ← r0 + r19 MEM(r20) ← r18 MEM(r20) ← r18

52 53 School of Computer Science School of Computer Science

Steps of the Simplifier Steps of the Simplifier

LLIR Code LLIR Code 1st op it has rolled out of window r10 ← 2 r10 ← 2 r11 ← @y r11 ← @y r10 ← 2 r13 ← MEM(r0 + @y) r12 ← r0 + r11 r12 ← r0 + r11 r13 ← MEM(r0 + @y) r14 ← 2 x r13 r13 ← MEM(r12) r13 ← MEM(r12) r13 ← MEM(r0 + @y) r14 ← 2 x r13 r14 ← r10 x r13 r15 ← @x r14 ← r10 x r13 r14 ← r10 x r13 r14 ← 2 x r13 r15 ← @x r15 ← @x r15 ← @x r15 ← @x r16 ← r0 + r15 r16 ← r0 + r15 r16 ← r0 + r15 r17 ← MEM(r16) r17 ← MEM(r16) r18 ← r17 - r14 r18 ← r17 - r14 r19 ← @w r19 ← @w r20 ← r0 + r19 r20 ← r0 + r19 MEM(r20) ← r18 MEM(r20) ← r18

54 55 School of Computer Science School of Computer Science

Page ‹#› Steps of the Simplifier Steps of the Simplifier

LLIR Code LLIR Code

r10 ← 2 r10 ← 2 r11 ← @y r11 ← @y r12 ← r0 + r11 r12 ← r0 + r11 r13 ← MEM(r12) r13 ← MEM(r12) r14 ← r10 x r13 r14 ← r10 x r13 r15 ← @x r14 ← 2 x r13 r14 ← 2 x r13 r15 ← @x r14 ← 2 x r13 r14 ← 2 x r13 r16 ← r0 + r15 r15 ← @x r16 ← r0 + @x r16 ← r0 + r15 r16 ← r0 + @x r17 ← MEM(r0+@x) r17 ← MEM(r16) r16 ← r0 + r15 r17 ← MEM(r16) r17 ← MEM(r16) r17 ← MEM(r16) r18 ← r17 - r14 r18 ← r17 - r14 r18 ← r17 - r14 r19 ← @w r19 ← @w r20 ← r0 + r19 r20 ← r0 + r19 MEM(r20) ← r18 MEM(r20) ← r18

56 57 School of Computer Science School of Computer Science

Steps of the Simplifier Steps of the Simplifier

LLIR Code LLIR Code

r10 ← 2 r10 ← 2 r11 ← @y r11 ← @y r12 ← r0 + r11 r12 ← r0 + r11 r13 ← MEM(r12) r13 ← MEM(r12) r14 ← r10 x r13 r14 ← r10 x r13 r15 ← @x r15 ← @x r16 ← r0 + r15 r14 ← 2 x r13 r17 ← MEM(r0+@x) r16 ← r0 + r15 r ← MEM(r ) EM r ← MEM(r ) 17 16 r17 ← M (r0+@x) r18 ← r17 - r14 17 16 r17 ← MEM(r0+@x) r18 ← r17 - r14 r ← r - r r ← r - r 18 17 14 r18 ← r17 - r14 r19 ← @w 18 17 14 r18 ← r17 - r14 r19 ← @w r ← @w r ← @w 19 19 r19 ← @w r20 ← r0 + r19 r20 ← r0 + r19 r20 ← r0 + r19 MEM(r20) ← r18 MEM(r20) ← r18

58 59 School of Computer Science School of Computer Science

Page ‹#› Steps of the Simplifier Steps of the Simplifier

LLIR Code LLIR Code

r10 ← 2 r10 ← 2 r11 ← @y r11 ← @y r12 ← r0 + r11 r12 ← r0 + r11 r13 ← MEM(r12) r13 ← MEM(r12) r14 ← r10 x r13 r14 ← r10 x r13 r15 ← @x r15 ← @x r16 ← r0 + r15 r16 ← r0 + r15 r17 ← MEM(r16) r17 ← MEM(r16) r18 ← r17 - r14 r18 ← r17 - r14 r19 ← @w r19 ← @w r18 ← r17 - r14 r18 ← r17 - r14 r20 ← r0 + r19 r20 ← r0 + r19 r ← r - r r ← r - r @ @ 18 17 14 18 17 14 MEM(r ) ← r r19 ← w r20 ← r0 + w MEM(r ) ← r 20 18 20 18 r20 ← r0 + @w MEM(r0 + @w) ← r18 r20 ← r0 + r19 MEM(r20) ← r18 MEM(r20) ← r18

60 61 School of Computer Science School of Computer Science

Steps of the Simplifier Example

LLIR Code LLIR Code

r10 ← 2 r10 ← 2 r11 ← @y r11 ← @y r12 ← r0 + r11 r12 ← r0 + r11 r ← MEM(r ) r ← MEM(r ) LLIR Code 13 12 13 12 Simplify r13 ← MEM(r0+ @y) r14 ← r10 x r13 r14 ← r10 x r13 r14 ← 2 x r13 r15 ← @x r15 ← @x r17 ← MEM(r0 + @x) r16 ← r0 + r15 r16 ← r0 + r15 r18 ← r17 - r14 r17 ← MEM(r16) r17 ← MEM(r16) MEM(r0 + @w) ← r18 r18 ← r17 - r14 r18 ← r17 - r14 r19 ← @w r19 ← @w r ← r + r r ← r + r 20 0 19 r18 ← r17 - r14 r18 ← r17 - r14 20 0 19 MEM(r ) ← r MEM(r ) ← r 20 18 r20 ← r0 + @w MEM(r0 + @w) ← r18 20 18 MEM(r20) ← r18

62 63 School of Computer Science School of Computer Science

Page ‹#› Making It All Work SIMD Instructions Details We’ve assumed each LLIR is largely machine independent (RTL) tile is connected MOVE Target machine described as LLIR → ASM pattern MEM MEM Actual pattern matching – Use a hand-coded pattern matcher (gcc) y – Turn patterns into grammar & use LR parser (VPO) Several important compilers use this technology What about SIMD instructions? It seems to produce good portable instruction selectors Ex. TI C62x add2 +

Key strength appears to be late low-level optimization subword parallelism

64 65 32 bits School of Computer Science School of Computer Science

Automatic Vectorization Loop level

for (i = 0; i < 1024; i++) for (i = 0; i < 1024; i+=4) { { C[i] = A[i]*B[i]; C[i:i+3] = A[i:i+3]*B[i:i+3]; } }

Basic block level

x = a+b; (x,y) = (a,c)+(b,d); y = c+d;

looking for independent, isomorphic operations

66 School of Computer Science

Page ‹#›