10/15/2013

Where we are

Intermediate code CS 4120 syntax-directed translation Introduction to reordering with traces Canonical intermediate code Ross Tate tiling Cornell University dynamic programming Abstract assembly code

Lecture 18: Instruction Selection

Assembly code

CS 4120 Introduction to Compilers 2

Abstract Assembly Instruction selection

• Abstract assembly • Conversion to abstract assembly is = assembly code w/ infinite register set problem of instruction selection for a • Canonical intermediate code = abstract assembly code + expression trees single IR statement node MOVE(e , e ) ⇒ mov e1, e2 1 2 • Full abstract assembly code: glue JUMP(e) ⇒ jmp e translated instructions from each of the CJUMP(e,l) ⇒ cmp e1, e2 [jne|je|jgt|…] l statements

CALL(e, e1,…) ⇒ push e1; … ; call e • Problem: more than one way to translate LABEL(l ) ⇒ l: a given statement. How to choose?

CS 4120 Introduction to Compilers 3 CS 4120 Introduction to Compilers 4

Example x86-64 ISA

• Need to map IR tree to actual machine instructions – need to know how instructions work mov %rbp,t2 MOVE • A two-address CISC architecture (inherited from 4004, 8008, add $4,t2 8086…) • Typical instruction has TEMP(t1) ADD mov (t2),t3 – opcode (mov, add , sub, shl, shr, mul, div, jmp, jcc, ? add t3, t1 push, pop, test, enter, leave) TEMP(t1) MEM – destination (r,n,(r),k(r),(r1,r2), (r1,r2,w),k(r1,r2,w) ) ADD (may also be an operand) add 4(%rbp),t1 – source (any legal destination, or a constant $k) TEMP(fp) 4 opcode src src/dest mov $1,%rax add %rcx,%rbz sub %rbp,%esi add %edi, (%rcx,%edi,16) je label1 jmp 4(%rbp)

CS 4120 Introduction to Compilers 5 CS 4120 Introduction to Compilers 6

1 10/15/2013

AT&T vs Intel Tiling

• Intel syntax: • Idea: each Pentium instruction performs • opcode dest, src computation for a piece of the IR tree: a tile • Registers rax, rbx, rcx,...r8,r9,...r15 mov %rbp,t2 add $4, t2 • constants k MOVE • memory operands [n], [r+k], [r1+w*r2], ... mov (t2),t3 • AT&T syntax (GNU assembler default): TEMP(t1) ADD t3 add t3, t1 • opcode src, dest TEMP(t1) MEM • Tiles connected by new • %rax, %rbx,… t2 ADD temporary registers (t2, • constants $k t2 t3) that hold result of • memory operands n, k(r), (r1,r2,w), ... TEMP(fp) 4 tile

CS 4120 Introduction to Compilers 7 CS 4120 Introduction to Compilers 8

Tiles Some tiles MOVE t2 mov t2,t1 TEMP(t1) t2 t1 + mov t1, t2 add $1, t2 tf 1

mov t1,tf ADD add t ,t • Tiles capture ’s understanding of 2 f instruction set t1 t2 • Each tile: sequence of instructions that MOVE update a fresh temporary (may need extra MEM CONST(i) mov $i,(t1,t2) mov’s) and associated IR tree ADD t • All outgoing edges are temporaries t1 2

CS 4120 Introduction to Compilers 9 CS 4120 Introduction to Compilers 10

Designing tiles More handy tiles lea instruction computes a memory address but doesn’t • Only add tiles that are useful to compiler actually load from memory • Many instructions will be too hard to use tf effectively or will offer no advantage • Need tiles for all single-node trees to guarantee + lea (t1,t2), tf that every tree can be tiled, e.g. t1 t2 t1 tf mov t1, t2 + add t1, t3 + (k one of lea (t ,t ,k ), t 1 * 1 2 1 f 2,4,8,16) t2 t3 t1 CONST(k ) 1 t2

CS 4120 Introduction to Compilers 11 CS 4120 Introduction to Compilers 12

2 10/15/2013

Matching CJUMP for RISC Condition code ISA • As defined in lecture, have • Appel’s CJUMP corresponds more directly CJUMP(cond, destination) to Pentium conditional jumps • Appel: CJUMP(op, e1, e2, destination) set condition codes

where op is one of ==, !=, <, <=, =>, > CJUMP cmp t1, t2 • Our CJUMP translates easily to RISC ISAs < NAME(n) jl n that have explicit comparison result t1 t2 test condition codes CJUMP MIPS t1 cmplt t2, t3, t1 < • However, can handle Pentium-style jumps NAME(n) br t1, n t2 t3 with lecture IR with appropriate tiles

CS 4120 Introduction to Compilers 13 CS 4120 Introduction to Compilers 14

Branches An annoying instruction

• How to tile a conditional jump? • Pentium mul instruction multiples single operand by eax, puts result in eax (low 32 • Fold comparison operator into tile bits), edx (high 32 bits) • Solution: add extra mov instructions, let CJUMP test t1 jnz l1 register allocation deal with edx overwrite t 1 l1 (l2) tf mov t1, %eax CJUMP MUL cmp t1, t2 mul t2 je l1 t EQ l1 (l2) 1 t2 mov %eax, tf t1 t2

CS 4120 Introduction to Compilers 15 CS 4120 Introduction to Compilers 16

Tiling Problem Example

• How to pick tiles that cover IR statement tree with minimum execution time? x = x + 1; • Need a good selection of tiles ebp: Pentium frame-pointer register • small tiles to make sure we can tile every tree MOVE t2 • large tiles for efficiency MEM t1 + mov (%ebp,x), t1 • Usually want to pick large tiles: fewer instructions mov t1, t2 + MEM 1 • instructions ≠ cycles: RISC core instructions take add $1, t2 FP x 1 cycle, other instructions may take more + mov t2, (%ebp,x) add %rax,4(%rcx) ⇔ mov 4(%rcx),%rdx FP x add %rdx,%rax mov %rax,4(%rcx)

CS 4120 Introduction to Compilers 17 CS 4120 Introduction to Compilers 18

3 10/15/2013

Alternate (non-RISC) tiling Greedy tiling

• Assume larger tiles = better x = x + 1; • Greedy algorithm: start from top of tree and use largest tile that matches tree MOVE • Tile remaining subtrees recursively MEM + add $1, (ebp,x) MOVE + MEM 1 MEM 4 FP x + MOVE ADD FP x + MEM MUL MEM r/m32 ADD 4 CONST(k) ADD r/m32 FP 8 FP 12 CS 4120 Introduction to Compilers 19 CS 4120 Introduction to Compilers 20

Improving instruction selection Timing model

• Greedy tiling may not generate best code • Idea: associate cost with each tile (proportional to # cycles to execute) • Always selects largest tile, not necessarily • caveat: cost is fictional on modern architectures fastest instruction • Estimate of total execution time is sum of costs of all tiles • May pull nodes up into tiles when better to leave below MOVE • Can do better using dynamic programming 2 MEM + 1 algorithm + MEM 1 FP x Total cost: 5 + FP x 2

CS 4120 Introduction to Compilers 21 CS 4120 Introduction to Compilers 22

Finding optimum tiling Recursive implementation

• Goal: find minimum total cost tiling of tree • Any dynamic-programming algorithm • Algorithm: for every node, find minimum equivalent to a memoized version of total-cost tiling of that node and sub-tree. same algorithm that runs top-down • Lemma: once minimum-cost tiling of all • For each node, record best tile for node children of a node is known, can find • Start at top, recurse: minimum-cost tiling of the node by trying out • First, check in table for best tile for this node all possible tiles matching the node • If not computed, try each matching tile to see which one has lowest cost • Therefore: start from leaves, work • Store lowest-cost tile in table and return upward to top node • Finally, use entries in table to emit code

CS 4120 Introduction to Compilers 23 CS 4120 Introduction to Compilers 24

4 10/15/2013

Problems with model Finding matching tiles

• Modern processors: • Explicitly building every tile: tedious • execution time not sum of tile times • Easier to write subroutines for matching • instruction order matters Pentium source, destination operands • Processors are pipelining instructions and executing • Reuse matcher for all opcodes different pieces of instructions in parallel • bad ordering (e.g. too many memory operations in sequence) stalls processor pipeline • processor can execute some instructions in parallel (super-scalar) • cost is merely an approximation • needed

CS 4120 Introduction to Compilers 25 CS 4120 Introduction to Compilers 26

Matching tiles Tile Specifications

abstract class IR_Stmt { • Previous approach simple, efficient, but Assembly munch(); } hard-codes tiles and their priorities class IR_Move extends IR_Stmt { • Another option: explicitly create data IR_Expr src, dst; Assembly munch() { structures representing each tile in if (src instanceof IR_Plus && instruction set MOVE ((IR_Plus)src).lhs.equals(dst) && • Tiling performed by a generic tree-matching + is_regmem32(dst) { Assembly e = (IR_Plus)src).rhs.munch(); and code generation procedure return e.append(new AddIns(dst, e.target())); • Can generate from instruction set description r/m32 } – generic back end! else if ... } • For RISC instruction sets, over-engineering }

CS 4120 Introduction to Compilers 27 CS 4120 Introduction to Compilers 28

Summary

• Can specify code-generation process as a set of tiles that relate IR trees to instruction sequences • Instructions using fixed registers problematic but can be handled using extra temporaries • Greedy algorithm implemented simply as recursive traversal • Dynamic-programming algorithm generates better code, can also be implemented recursively using memoization • Real optimization will require instruction scheduling

CS 4120 Introduction to Compilers 29

5