Rethinking Code Generation in

Christian Schulte SCALE KTH Royal Institute of Technology & SICS (Swedish Institute of Computer Science)

joint work with: Mats Carlsson SICS Roberto Castañeda Lozano SICS + KTH Frej Drejhammar SICS Gabriel Hjort Blindell KTH funded by: Ericsson, Vetenskapsrådet Compilation

source front-end optimizer back-end assembly program (code generator) program Oct Oct 1, 2014

• Front-end: depends on source programming language • changes infrequently

SCALE Schulte, • Optimizer: independent optimizations

• changes infrequently CodeGeneration Rethinking

• Back-end: depends on processor architecture 2 • changes often: new architectures, new features, … Building a

source front-end optimizer back-end assembly program (code generator) program Oct Oct 1, 2014

LLVM Schulte, SCALE Schulte,

Infrequent changes: front-end & optimizer • CodeGeneration Rethinking • reuse state-of-the-art: LLVM, for example

3 Building a Compiler

source front-end optimizer back-end assembly program (code generator) program Oct Oct 1, 2014

Unison Schulte, SCALE Schulte,

Infrequent changes: front-end & optimizer • CodeGeneration Rethinking • reuse state-of-the-art: LLVM, for example • Frequent changes: back-end • use flexible approach: Unison (project this talk is based on) 4 State-of-the-art

instruction selection Oct Oct 1, 2014

add r0 r1 r2 x = y + z; mv $a6f0 r0 Schulte, SCALE Schulte,

• Code generation organized into stages • instruction selection, CodeGeneration Rethinking

5 State-of-the-art

Oct Oct 1, 2014 x  register r0 x = y + z; y  memory (spill to stack) … Schulte, SCALE Schulte,

• Code generation organized into stages • instruction selection, register allocation, CodeGeneration Rethinking

6 State-of-the-art

Oct Oct 1, 2014 x = y + z; u = v – w; … … u = v – w; x = y + z; Schulte, SCALE Schulte,

• Code generation organized into stages • instruction selection, register allocation, instruction scheduling CodeGeneration Rethinking

7 State-of-the-art

instruction register instruction selection allocation scheduling Oct Oct 1, 2014 • Code generation organized into stages • stages are interdependent: no optimal order possible Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking

8 State-of-the-art

instruction instruction register selection scheduling allocation Oct Oct 1, 2014 • Code generation organized into stages • stages are interdependent: no optimal order possible

SCALE Schulte, • Example: instruction scheduling  register allocation • increased delay between instructions can increase throughput Rethinking Code Generation CodeGeneration Rethinking  registers used over longer time-spans  more registers needed

9 State-of-the-art

instruction register instruction selection allocation scheduling Oct Oct 1, 2014 • Code generation organized into stages • stages are interdependent: no optimal order possible

SCALE Schulte, • Example: instruction scheduling  register allocation • put variables into fewer registers Rethinking Code Generation CodeGeneration Rethinking  more dependencies among instructions  less opportunity for reordering instructions

10 State-of-the-art

instruction instruction register selection scheduling allocation Oct Oct 1, 2014 • Code generation organized into stages • stages are interdependent: no optimal order possible

SCALE Schulte, • Stages use heuristic algorithms • for hard combinatorial problems (NP hard) Rethinking Code Generation CodeGeneration Rethinking • assumption: optimal solutions not possible anyway • difficult to take advantage of processor features • error-prone when adapting to change 11 State-of-the-art

instruction instruction register selection scheduling allocation Oct Oct 1, 2014 • Code generation organized into stages • stages are interdependent: no optimal order possible

SCALE Schulte, • Stages use heuristic algorithms • for hard combinatorial problemspreclude (NP hard) optimal code, Rethinking Code Generation CodeGeneration Rethinking • assumption: optimal solutions makenot possible development anyway • difficult to take advantage of processorcomplex features • error-prone when adapting to change 12 Rethinking: Unison Idea

• No more staging and heuristic algorithms! • many assumptions are decades old...

• Use state-of-the-art technology for solving combinatorial Oct 1, 2014 optimization problems: constraint programming • tremendous progress in last two decades...

SCALE Schulte, • Generate and solve single model • captures all code generation tasks in unison Rethinking Code Generation CodeGeneration Rethinking • high-level of abstraction: based on processor description • flexible: ideally, just change processor description • potentially optimal: tradeoff between decisions accurately reflected 13 Unison Approach

instruction selection constraint model constraints input instruction constraints program scheduling constraints Oct 1, 2014 register allocation Schulte, SCALE Schulte, processor description

• Generate constraint model CodeGeneration Rethinking • based on input program and processor description • constraints for all code generation tasks • generate but not solve: simpler and more expressive 14 Unison Approach

instruction selection constraint model constraints off-the-shelf input instruction constraint constraints program scheduling solver constraints Oct 1, 2014 register allocation assembly program Schulte, SCALE Schulte, processor description

• Off-the-shelf constraint solver solves constraint model CodeGeneration Rethinking • solution is assembly program • optimization takes inter-dependencies into account 15 Overview

• Constraint programming in a nutshell

• Approach Oct 1, 2014 • Results

Schulte, SCALE Schulte, • Discussion

Rethinking Code Generation CodeGeneration Rethinking

16 Oct Oct 1, 2014 Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking

CONSTRAINT PROGRAMMING 17 IN A NUTSHELL Constraint Programming

• Model and solve combinatorial (optimization) problems

• Modeling

• variables Oct 1, 2014 • constraints • branching heuristics • (cost function) SCALE Schulte, • Solving • constraint propagation Rethinking Code Generation CodeGeneration Rethinking • heuristic search

• Of course simplified… 18 • array of modeling techniques

Problem: Send More Money

• Find distinct digits for letters such that

SEND Oct 1, 2014 + MORE Schulte, SCALE Schulte, = MONEY Rethinking Code Generation CodeGeneration Rethinking

19 Constraint Model

• Variables: S,E,N,D,M,O,R,Y  {0,…,9}

• Constraints: Oct 1, 2014 distinct(S,E,N,D,M,O,R,Y)

1000×S+100×E+10×N+D SCALE Schulte, + 1000×M+100×O+10×R+E = 10000×M+1000×O+100×N+10×E+Y CodeGeneration Rethinking S0 M0

20 Constraints

• State relations between variables • legal combinations of values for variables

• Examples Oct 1, 2014

• all variables pair wise distinct: distinct(x1, …, xn) • arithmetic constraints: x + 2×y = z • domain-specific: cumulative( t , …, t ) 1 n SCALE Schulte,

nooverlap(r1, …, rn)

Rethinking Code Generation CodeGeneration Rethinking • Success story: global constraints • modeling: capture recurring problem structures • solving: enable strong reasoning 21 constraint-specific methods

Solving: Variables and Values Oct Oct 1, 2014

x{1,2,3,4} y{1,2,3,4} z{1,2,3,4} Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking • Record possible values for variables • solution: single value left • failure: no values left 22 Constraint Propagation

distinct(x, y, z) x + y = 3 Oct 1, 2014

x{1,2,3,4} y{1,2,3,4} z{1,2,3,4} Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking • Prune values that are in conflict with constraint

23 Constraint Propagation

distinct(x, y, z) x + y = 3 Oct 1, 2014

x{1,2} y{1,2} z{1,2,3,4} Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking • Prune values that are in conflict with constraint

24 Constraint Propagation

distinct(x, y, z) x + y = 3 Oct 1, 2014

x{1,2} y{1,2} z{3,4} Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking • Prune values that are in conflict with constraint • propagation is often smart if not perfect! • captured by so-called global constraints 25 Heuristic Search

distinct(x, y, z) x + y = 3

x=1 x{1,2} y{1,2} z{3,4} x≠1

distinct(x, y, z) x + y = 3 distinct(x, y, z) x + y = 3 Oct 1, 2014

x{1} y{2} z{3,4} x{2} y{1} z{3,4} Schulte, SCALE Schulte,

• Propagation alone not sufficient decompose into simpler sub-problems

• CodeGeneration Rethinking • search needed • Create subproblems with additional constraints • enables further propagation • defines search tree 26 • uses problem specific heuristic

What Makes It Work?

• Essential: avoid search...... as it always suffers from combinatorial explosion

• Constraint propagation drastically reduces search space Oct 1, 2014

• Efficient and powerful methods for propagation available

SCALE Schulte, • When using search, use a clever heuristic

Rethinking Code Generation CodeGeneration Rethinking • Array of modeling techniques available that reduce search • adding redundant constraints for more propagation, symmetry breaking, ... 27 • Hybrid methods (together with LP, SAT, stochastic, ...) Approach Source Material

• Survey on Combinatorial Register Allocation and Instruction Scheduling • Roberto Castañeda Lozano, Christian Schulte. CoRR entry, 2014.

• Combinatorial Spill Code Optimization and Ultimate Oct 1, 2014 Coalescing • Roberto Castañeda Lozano, Mats Carlsson, Gabriel Hjort Blindell, Christian Schulte. Languages, Compilers, Tools and Theory for Embedded Systems, 2014. SCALE Schulte, • Constraint-based Register Allocation and Instruction

Scheduling CodeGeneration Rethinking • Roberto Castañeda Lozano, Mats Carlsson, Frej Drejhammar, Christian Schulte. Eighteenth International Conference on Principles and Practice of Constraint Programming, 2012. 29

Getting Started…

int fac(int n) { t3li int f = 1; t4slti t2 while (n > 0) { bne t3 t8mul t7,t6 f = f * n; n--; t9subiu t6 } bgtz t9 5, 2013Sep return f;

} jr t10 Schulte, SCALE Schulte, • Function is unit of compilation • generate code for one function at a time • Instruction selection has already been performed CodeGeneration Rethinking • some instructions might depend on register allocation [later] • Use control flow graph (CFG) and turn it into LSSA form • edges = control flow 30 • nodes = basic blocks (no control flow) Register Allocation

t2  mul t1, 2 r2  mul r1, 2 r2  mul r1, 2 t3  sub t1, 2 r3  sub r1, 2 r1  sub r1, 2 t4  add t2, t3 r4  add r2, r3 r1  add r2, r1 return t4 return r4 return r1 Sep 5, 2013Sep • Assign registers to program temporaries (variables) • infinite number of temporaries • finite number of registers SCALE Schulte, • Naive strategy: each temporary assigned a different register

• will never work, way too few registers! CodeGeneration Rethinking

• Assign the same register to several temporaries • when is this safe? interference 31 • what if there are not enough registers? spilling Static Single Assignment (SSA)

t3li t4slti t2 bne t3 t8mul t7,t6 t9subiu t6

bgtz t9 5, 2013Sep

jr t10 Schulte, SCALE Schulte,

• SSA: each temporary is defined (t  ...) once Rethinking Code Generation CodeGeneration Rethinking • SSA simplifies many optimizations • Instead of using -functions we use -congruences and LSSA • -functions disambiguate definitions of temporaries 32 Liveness and Interference

t0add ...

...... Sep 5, 2013Sep

jr t0 Schulte, SCALE Schulte,

• Temporary is live when it might be still used • live range of a temporary from its definition to use CodeGeneration Rethinking • Temporaries interfere if they are live simultaneously • this definition is naive [more later] 33 • Non-interfering temporaries can be assigned to same register

Linear SSA (LSSA)

t1t5 t2t6 t3li t3t7 t4slti t2 bne t4 t mul t ,t 8 7 6 t6t9 t t t t 5 10 t9subiu t6 1 10 t7t8

t8t11 5, 2013Sep t3t11 bgtz t9

jr t10 Schulte, SCALE Schulte, • Linear live range of a temporary cannot span block boundaries • Liveness across blocks defined by temporary congruence  t  t’  represent same original temporary CodeGeneration Rethinking

34 Linear SSA (LSSA)

t1t5 t2t6 t3li t3t7 t4slti t2 bne t4 t mul t ,t 8 7 6 t6t9 t t t t 5 10 t9subiu t6 1 10 t7t8

t8t11 5, 2013Sep t3t11 bgtz t9

jr t10 Schulte, SCALE Schulte, • Linear live range of a temporary cannot span block boundaries • Liveness across blocks defined by temporary congruence  t  t’  represent same original temporary CodeGeneration Rethinking

• Example: t3, t7, t8, t11 are congruent correspond to the program variable f (factorial result) • 35 • not discussed: t1 return address, t2 first argument, t11 return value

Linear SSA (LSSA)

t1t5 t2t6 t3li t3t7 t4slti t2 bne t4 t mul t ,t 8 7 6 t6t9 t t t t 5 10 t9subiu t6 1 10 t7t8

t8t11 5, 2013Sep t3t11 bgtz t9

jr t10 Schulte, SCALE Schulte, • Linear live range of a temporary cannot span block boundaries • Liveness across blocks defined by temporary congruence  t  t’  represent same original temporary CodeGeneration Rethinking • Advantage • simple modeling for linear live ranges 36 • enables problem decomposition for solving Spilling

• If not enough registers available: spill

• Spilling moves temporary to memory (stack) • store in memory after defined 5, 2013Sep • load from memory before used • memory access typically considerably more expensive

• decision on spilling crucial for performance SCALE Schulte,

• Architectures might have more than one register file Rethinking Code Generation CodeGeneration Rethinking • some instructions only capable of addressing a particular file • “spilling” from one register bank to another

37 Coalescing

• Temporaries d (“destination”) and s (“source”) are move- related if

d  s

• d and s should be coalesced (assigned to same register) 5, 2013Sep • coalescing saves move instructions and registers

Schulte, SCALE Schulte, • Coalescing is important • due to how registers are managed (calling convention, callee-save) Rethinking Code Generation CodeGeneration Rethinking • due to using LSSA for our model (congruence)

38 Copy Operations

• Copy operations replicate a temporary t to a temporary t’

t’  {i1, i2, …, in} t

• copy is implemented by one of the alternative instructions i1, i2, …, in

• instruction depends on where t and t’ are stored Oct 1, 2014 similar to [Appel & George, 2001]

• Example MIPS32 SCALE Schulte, t’  {move, sw, nop} t • t’ memory and t register: sw spill Rethinking Code Generation CodeGeneration Rethinking • t’ register and t register: move move-related • t’ and t same register: nop coalescing • MIPS32: instructions can only be performed on registers 39 Alternative Temporaries

• Program representation uses operands and alternative temporaries • enable substitution of temporaries that hold the same value

Oct 1, 2014 • Alternative temporaries realize ultimate coalescing • all temporaries which are copy-related can be coalesced

• opposed to naïve coalescing: temporaries which are not live at the SCALE Schulte, same time can be coalesced

Rethinking Code Generation CodeGeneration Rethinking • Alternative temporaries enable spill code optimization • possibly reuse spilled temporary defined by load instruction

40 • Significant impact on code quality Register Allocation Approach

• Local register allocation • perform register allocation per basic block • possible as temporaries are not shared among basic blocks

Oct 1, 2014 • Local register assignment as geometrical packing problem • take width of temporaries into account • also known as “register packing” Schulte, SCALE Schulte,

• Global register allocation • force temporaries into same registers across blocks CodeGeneration Rethinking

41 Unified Register Array

registers memory registers

r0 r1 rn-1 m0 m1

… … Oct 1, 2014

unified register array Schulte, SCALE Schulte, • Unified register array

• limited number of registers for each register file CodeGeneration Rethinking • memory is just another “register” file • unlimited number of memory “registers” 42 Geometrical Interpretation

registers memory registers

r0 r1 rn-1 m0 m1 clockcycle

… … Oct 1, 2014

unified register array temporary t Schulte, SCALE Schulte, • Temporary t is rectangle

• width is 1 (occupies one register) CodeGeneration Rethinking • top = issue cycle of defining instruction (t  …) • bottom = last issue cycle of using instructions ( …  t) 43 Register Assignment

registers memory registers

r0 r1 rn-1 m0 m1 clockcycle

… … Oct 1, 2014

unified register array temporary t Schulte, SCALE Schulte, • Register assignment = geometric packing problem

• find horizontal coordinates for all temporaries CodeGeneration Rethinking • such that no two rectangles for temporaries overlap • corresponds to global constraint (no-overlap) 44 Register Packing Oct Oct 1, 2014

• Temporaries might have different width width(t) SCALE Schulte, • many processors support access to register parts • still modeled as geometrical packing problem [Pereira & Palsberg, 2008] Rethinking Code Generation CodeGeneration Rethinking

45 Register Packing

AX BX CX width(t1)=1 AH AL BH BL CH CL clockcycle width(t3)=2

width(t3)=1 Oct 1, 2014

width(t4)=2

• Temporaries might have different width width(t) SCALE Schulte, • many processors support access to register parts • still modeled as geometrical packing problem [Pereira & Palsberg, 2008] Rethinking Code Generation CodeGeneration Rethinking • Example: Intel x86 • assign two 8 bit temporaries (width = 1) to 16 bit register (width = 2) • register parts: AH, AL, BH, BL, CH, CL 46 • possible for 8 bit: AH, AL, BH, BL, CH, CL • possible for 16 bit: AH, BH, CH Register Packing

AX BX CX t start(t1)=0 end(t1)=1 width(t1)=1 AH AL BH BL CH CL 1 clockcycle start(t2)=0 end(t2)=2 width(t3)=2 t2

start(t )=0 end(t )=1 width(t )=1 Oct 1, 2014 t3 3 3 3 start(t )=1 end(t )=2 width(t )=2 t4 4 4 4

• Temporaries might have different width width(t) SCALE Schulte, • many processors support access to register parts • still modeled as geometrical packing problem [Pereira & Palsberg, 2008] Rethinking Code Generation CodeGeneration Rethinking • Example: Intel x86 • assign two 8 bit temporaries (width = 1) to 16 bit register (width = 2) • register parts: AH, AL, BH, BL, CH, CL 47 • possible for 8 bit: AH, AL, BH, BL, CH, CL • possible for 16 bit: AH, BH, CH Register Packing

AX BX CX start(t1)=0 end(t1)=1 width(t1)=1 AH AL BH BL CH CL clockcycle start(t2)=0 end(t2)=2 width(t3)=2 t3 t1 t2 t4

start(t3)=0 end(t3)=1 width(t3)=1 Oct 1, 2014

start(t4)=1 end(t4)=2 width(t4)=2

• Temporaries might have different width width(t) SCALE Schulte, • many processors support access to register parts • still modeled as geometrical packing problem [Pereira & Palsberg, 2008] Rethinking Code Generation CodeGeneration Rethinking • Example: Intel x86 • assign two 8 bit temporaries (width = 1) to 16 bit register (width = 2) • register parts: AH, AL, BH, BL, CH, CL 48 • possible for 8 bit: AH, AL, BH, BL, CH, CL • possible for 16 bit: AH, BH, CH Global Register Allocation

• Enforce that congruent temporaries are assigned to same register

• If register pressure is low… Oct 1, 2014 • copy instructions might disappear (nop) = coalescing

SCALE Schulte, • If register pressure is high… • copy instructions might be implemented by a move (move) Rethinking Code Generation CodeGeneration Rethinking = no coalescing • copy instructions might be implemented by a load/store (lw , sw) = spill 49 Local Instruction Scheduling

in 1 (t2) 1 1 (t2) t li 3 li slti t4slti t2 1 (t1) 1 (t ) 1 (t4) bne t4 2 1 1 (t3)

bne Oct 1, 2014

2 out

• Data and control dependencies SCALE Schulte, • data, control, artificial (for making in and out first/last)

• again ignored: t1 return address, t2 first argument CodeGeneration Rethinking • If instruction i depends on j

issue distance of operation for i 50 must be at least latency of operation for j Limited Processor Resources

• Processor resources • functional units • data buses

Oct 1, 2014 • Classical cumulative scheduling problem functional units

• processor resource has capacity #units SCALE Schulte, • instructions occupy parts of resource 1 unit • resource consumption can never exceed capacity CodeGeneration Rethinking • Also modeled as resources • instruction bundle width for VLIW processor 51 • how many instructions can be issued simultaneously

Oct Oct 1, 2014 Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking

RESULTS 52 Code Quality Oct Oct 1, 2014 Schulte, SCALE Schulte,

Compared to LLVM 3.3 for Qualcomm’s Hexagon V4 • CodeGeneration Rethinking • 7% mean improvement • Provably optimal () for 29% of functions • model limitation: no re-materialization 53 Scalability Oct Oct 1, 2014 Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking

• Quadratic average complexity up to 1000 instructions 54

Optimizing for Size Oct Oct 1, 2014 Schulte, SCALE Schulte, • Code size improvement over LLVM 3.3

• 1% mean improvement CodeGeneration Rethinking • Important: straightforward replacement of optimization criterion 55 Impact Alternative Temporaries Oct Oct 1, 2014 Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking

• 62% of functions become faster, none slower

• 2% mean improvement 56

Oct Oct 1, 2014 Schulte, SCALE Schulte, Rethinking Code Generation CodeGeneration Rethinking

DISCUSSION 57 Related Approaches

• Idea and motivation in Unison for combinatorial optimization is absolutely not new! • starting in the early 1990s [Castañeda Lozano & Schulte, Survey on Combinatorial Register Allocation and Instruction Scheduling, CoRR, 2014] Oct 1, 2014

• Common to pretty much all approaches: compilation unit is basic block SCALE Schulte,

• Approaches differ CodeGeneration Rethinking • which code generation tasks covered • which technology used (ILP, CLP, SAT, Stochastic Optimization, ...) 58 • Common challenge: robustness and scalability Unique to Unison Approach

• First global approach (function as compilation unit)

• Constraint programming using global constraints

• sweet spot: cumulative and nooverlap are state-of-the-art! Oct 1, 2014

• Full register allocation with ultimate coalescing, packing, spilling, and spill code optimization Schulte, SCALE Schulte, • spilling is internalized

• Robust at the expense of optimality CodeGeneration Rethinking • problem decomposition

• But: instruction selection not yet there! 59 Project Context

• Goal • deliver a deployable product to Ericsson • possibly open source software to LLVM

5, 2013Sep • Project funded by • Ericsson 2010-2015 13.3 MSEK 2015-2017 5.8 MSEK Schulte, SCALE Schulte, • Vetenskapsrådet 2012-2014 2.4 MSEK

• Connected project: several tools generated from processor CodeGeneration Rethinking description • code generator just one tool 60 • other tools: assembler, linker, ...