Hardware-Software Interface

Machine Program

Available resources Required resources statically fixed dynamically varying Designed to support Designed to run well on Introduction to wide variety of programs X a variety of machines Interested in running Interested in having Optimizing many programs fast itself run fast

Performance = tcyc x CPI x code size

Reflects how well the machine resources match CS 211 CS 211 the program requirements

Compiler Tasks Structure

• Code Translation – Source language → target language FORTRAN → C Frond End Optimizer Back End C → MIPS, PowerPC or Alpha machine code IR IR MIPS binary → Alpha binary high-level machine source code code Dependence • Code Optimization Analyzer – Code runs faster – Match dynamic code behavior to static machine structure Machine independent Machine dependent

(IR= intermediate representation)

CS 211 CS 211

1 Structure of Optimizing Compilers Front End

TOOLS Source Program Source Program • Lexical Analysis Front ends – Misspelling an identifier, keyword, or operator Program e.g. lex Front-end #1 Database Front-end #2 …..

High-level Intermediate Language HIL High-level Optimizer • Syntax Analysis Middle Optimized HIL – Grammar errors, such as mismatched parentheses end Lowering of IL e.g. yacc Low-level Intermediate Language LIL Back Low-level • Semantic Analysis ends Optimizer Optimized LIL – Type checking

Target-1 Target-2 Target-3 Code Generator Code Generator Code Generator and Linker and Linker and Linker …..

Runtime CS 211 Systems Target-1 Executable Target-2 Executable Target-3 Executable

Front-end Front-end

1. Scanner - converts input character 3. Semantic Analysis - generates stream into stream of lexical tokens intermediate language representation from input source program and user options/directives, and reports any 2. Parser - derives syntactic structure semantic errors encountered (parse tree, abstract syntax tree) from token stream, and reports any syntax errors encountered

CS 211 CS 211

2 High-level Optimizer Intermediate Representation • Achieve retargetability • Global intra-procedural and inter- – Different source languages procedural analysis of source – Different target machines program's control and data flow • Example (tree-based IR from CMCC) Linear form of graphical representation Graphical Representation A0 5 78 “a” • Selection of high-level A1 5 78 “b” A2 5 78 “c” ASGI optimizations and transformations int a, b, c, d; A3 5 78 “d” d = a * (b+c) &d MULI FND1 ADDRL A3 FND2 ADDRL A0 • Update of high-level intermediate FND3 INDIRI FND2 INDIRI ADDI FND4 ADDRL A1 language FND5 INDIRI FND4 &a INDIRI INDIRI FND6 ADDRL A2 FND7 INDIRI FND6 FND8 ADDI FND5 FND7 &b &c FND9 MULI FND3 FND8 CS 211 CS 211 FND10 ASGI FND1 FND9

Lowering of Intermediate Language Machine-Independent Optimizations

• Dataflow Analysis and Optimizations – Constant propagation • Linearized storage/mapping of variables – Copy propagation – e.g. 2-d array to 1-d array – • Array/structure references → load/store operations • Elimination of common subexpression – e.g. A[I] to load R1,(R0) where R0 contains i • High-level control structures → low-level • control flow • Stength reduction – e.g. “While” statement to Branch statements • Function/Procedure inlining

CS 211 CS 211

3 Code-Optimizing Transformations Code Optimization Example x = 1 x = 1 • y = a * b + 3 propagation y = a * b + 3 (1 + 2) ⇒ 3 z = a * b + x + z + 2 z = a * b + 1 + z + 2 x = 3 x = 3 (100 > 0) true ⇒ constant • Copy propagation folding x = b + c x = b + c dead code ⇒ elimination x = 1 z = y * x z = y * (b + c) y = a * b + 3 y = a * b + 3 • Common subexpression z = a * b + 3 + z z = a * b + 3 + z x = 3 x = 3 x = b * c + 4 t = b * c common z = b * c - 1 ⇒ x = t + 4 subexpression z = t - 1

• Dead code elimination t = a * b + 3 x = 1 y = t x = b + c or if x is not referred to at all z = t + z CS 211 CSx 211 = 3

Code Motion • Move code between basic blocks • Replace complex (and costly) expressions • E.g. move loop invariant computations with simpler ones outside of loops • E.g. a : = b*17 a: = (b<<4) + b • E.g. t = x / y p = & a[ i ] while ( i < 100 ) { while ( i < 100 ) { t = i * 100 *p = x / y + i *p = t + i while ( i < 100 ) { while ( i < 100 ) { i = i + 1 i = i + 1 a[ i ] = i * 100 *p = t }} i = i + 1 t = t + 100 } p = p + 4 i = i + 1 }

CS 211 loopCS 211 invariant: &a[i]==p, i*100==t

4 elimination Loop Optimizations

• Induction variable: loop index. • Motivation: restructure program so as to enable more effective back-end • Consider loop: optimizations and hardware exploitation for (i=0; i

CS 211 CS 211

Importance of Loop Optimizations Loop optimizations

Program No. of Static Dynamic % of • Loops are good targets for optimization. Loops B.B. Count B.B. Count Total • Basic loop optimizations: – code motion; nasa7 9 --- 322M 64% – induction-variable elimination; 16 --- 362M 72% – strength reduction (x*2 -> x<<1). 83 --- 500M ~100% matrix300 1 17 217.6M 98% • Improve performance by unrolling the 15 96 221.2M 98+% loop tomcatv 1 7 26.1M 50% – Note impact when using processors that allow parallel execution of instructions 5 22 52.4M 99+% • Texas Instruments new DSP processors 12 96 54.2M ~100%

StudyCS of211 loop-intensive benchmarks in the SPEC92 suite [C.J. Newburn, 1991] CS 211

5 Function inlining Back End • Replace function calls with function body • Increase compilation scope (increase ILP) IR Back End Machine code e.g. constant propagation, common subexpression • Reduce function call overhead e.g. passing arguments, reg. saves and restores code code register code selection scheduling allocation emission [W.M. Hwu, 1991 (DEC 3100)] Program In-line Speedup in-line Code Expansion • map virtual registers into architect registers cccp 1.06 1.25 • rearrange code compress 1.05 1.00+ • target machine specific optimizations equ 1.12 1.21 - delayed branch espresso 1.07 1.09 - conditional move lex 1.02 1.06 - instruction combining tbl 1.04 1.18 auto increment addressing mode xlisp 1.46 1.32 add carrying (PowerPC) yacc 1.03 1.17 hardware branch (PowerPC) Instruction-level IR CS 211 CS 211

Code Selection Our old friend…CPU Time • Map IR to machine instructions (e.g. pattern Inst *match (IR *n) { matching) switch (n->opcode) { • CPU time = CPI * IC * Clock case ……..: case MUL : • What do the various optimizations affect ASGI l = match (n->left()); r = match (n->right()); – Function inlining &d MULI if (n->type == D || n->type == F ) – inst = mult_fp( (n->type == D), l, r ); INDIRI ADDI else – Code optimizing transformations inst = mult_int ( (n->type == I), l, r); – Code selection &a INDIRI INDIRI break; case ADD : l = match (n->left()); &b &c r = match (n->right()); if (n->type == D || n->type == F) addi Rt1, Rb, Rc inst = add_fp( (n->type == D), l, r); muli Rt2, Ra, Rt1 else inst = add_int ((n->type == I), l, r); break; case ……..: } CS 211 return inst; CS 211 }

6 Machine Dependent Optimizations Peephole Optimizations

• Replacements of assembly instruction through template matching • Register Allocation • Eg. Replacing one addressing mode with • another in a CISC

• Peephole Optimizations

CS 211 CS 211

Code Scheduling Cost of Instruction Scheduling • Rearrange code sequence to minimize execution time • Given a program segment, the goal is to execute – Hide instruction latency it as quickly as possible l.d f4, 8(r8) – Utilize all available resources l.d f2, 16(r8) 0 stall fadd f5, f4, f6 0 stall • The completion time is the objective function or fsub f7, f2, f6 cost to be minimized fmul f7, f7, f5 3 stalls reorder s.d f7, 24(r8) l.d f4, 8(r8) 1 stall l.d f8, 0(r9) 1 stall fadd f5, f4, f6 s.d f8, 8(r9) • This is referred to as the makespan of the l.d f2, 16(r8) 1 stall fsub f7, f2, f6 schedule fmul f7, f7, f5 3 stalls l.d f4, 8(r8) s.d f7, 24(r8) l.d f2, 16(r8) 0 stall l.d f8, 0(r9) 1 stall fadd f5, f4, f6 • It has to be balanced against the running time s.d f8, 8(r9) reorder 0 stall fsub f7, f2, f6 and space needs of the algorithm for finding the fmul f7, f7, f5 0 stalls schedule, which translates to compilation cost (memory dis-ambiguation)l.d f8, 0(r9) 1 stall s.d f8, 8(r9) s.d f7, 24(r8) CS 211 CS 211

7 Instruction Scheduling Example

main(int argc, char *argv[]) ← { op 10 MPY vr2 param1, 255 op 12 MPY vr3 ← param1, 15 int a, b, c; op 14 MPY vr8 ← vr2, vr2 op 15 SHL vr9 ← param1, 2 a = argc; op 16 MPY vr10 ← vr9, r3 b = a * 255; op 17 SUB param2 ← vr8, r10 c = a * 15; op 18 MOV param1 ← addr("%d\n“) printf("%d\n", b*b - 4*a*c ); op 27 PBRR vb12 ← addr(printf) After Scheduling } op 20 BRL ret_addr ← vb12 (Prior to Register Allocation)

CS 211 CS 211

The General Instruction Instruction Scheduling Scheduling Problem Feasible Schedule: A specification of a start time for each instruction such that the following Given a source program P, schedule the constraints are obeyed: instructions so as to minimize the 1. Resource: Number of instructions of a overall execution time on the functional given type of any time < corresponding units in the target machine number of FUs 2. Precedence and Latency: For each predecessor j of an instruction i in the DAG, i is the started only δ cycles after j finishes where δ is the latency labeling the edge (j,i), Output: A schedule with the minimum overall completion time CS 211 CS 211

8 Instruction Scheduling Instruction Scheduling

Input: A basic block represented as a DAG Idle Cycle Due to Latency

S1 i1 i3 i2 i4 i2 0 1 Latency

i1 S i i i i i4 2 1 2 3 4 00

i3 • i is a load instruction. 2 • Two schedules for the above DAG with • Latency of 1 on (i2,i4) means that i4 S2 as the desired sequence. cannot start for one cycle after i2

CScompletes. 211 CS 211

Why Register Allocation? Register Allocation

• Storing and accessing variables from • Map virtual registers into physical registers is much faster than accessing registers data from memory. – minimize register usage to reduce memory accesses – Variables ought to be stored in registers – but introduces false dependencies . . . . . l.d f4, 8(r8) • It is useful to store variables as long as l.d $f0, 8(r8) f2 fadd f5, f4, f6 fadd $f2, $f0, $f3 f4 $f0 possible, once they are loaded into l.d f2, 16(r8) l.d $f0, 16(r8) f7 f8 registers fsub f7, f2, f6 fsub $f0, $f0, $f3 • Registers are bounded in number fmul f7, f7, f5 fmul $f0, $f0, $f2 f5 $f2 s.d f7, 24(r8) s.d $f0, 24(r8) – “register-sharing” is needed over time. l.d f8, 0(r9) l.d $f0, 0(r9) f6 $f3 s.d f8, 8(r9) s.d $f0, 8(r9)

CS 211 CS 211

9 Cost of Register Allocation The Goal (Contd.)

• Therefore, maximizing the duration of operands in registers • Primarily to assign registers to or minimizing the amount of spilling, is the goal variables • Once again, the running time (complexity) and space used, of the algorithm for doing this is the compilation cost

• However, the allocator runs out of registers quite often

• Decide which variables to “flush” out of registers to free them up, so that other variables can be bought in – Spilling

CS 211 CS 211

Register Allocation and Assignment Our old friend…CPU Time

• CPU time = CPI * IC * Clock • Allocation: identifying program values (virtual registers, live ranges) and program points at • What do the various optimizations affect which values should be stored in a physical register – Instruction scheduling • Stall cycles – Register Allocation • Program values that are not allocated to • Stall cycles due to false dependencies, spill code registers are said to be spilled

• Assignment: identifying which physical register should hold an allocated value at each program point.

CS 211 CS 211

10 Performance analysis Programs and performance analysis

• Elements of program performance (Shaw): • Best results come from analyzing – execution time = program path + instruction timing optimized instructions, not high-level • Path depends on data values. Choose language code: which case you are interested in. – non-obvious translations of HLL statements into instructions; • Instruction timing depends on pipelining, – code may move; cache behavior. – cache effects are hard to predict. • importance of compiler – Back-end of compiler

CS 211 CS 211

Instruction timing Trace-driven performance analysis

• Not all instructions take the same amount • Trace: a record of the execution path of a of time. program. – Hard to get execution time data for instructions. • Trace gives execution path for • Instruction execution times are not performance analysis. independent. • A useful trace: • Execution time may depend on operand – requires proper input values; values. – is large (gigabytes). • Trace generation in H/W or S/W?

CS 211 CS 211

11 What are Execution Frequencies

• Branch probabilities

• Average number of loop iterations Execution Frequencies? • Average number of procedure calls

CS 211

How are Execution Frequencies How are Execution Frequencies Used? Obtained?

• Profiling tools: • Focus optimization on most frequently used regions – Mechanism: sampling vs. counting – Granularity = procedure vs. basic block – region-based compilation

• Compile-time estimation: • Provides quantitative basis for evaluating quality of – Default values optimization heuristics – Compiler analysis – Goal is to select same set of program regions and optimizations that would be obtained from profiled frequencies

CS 211 CS 211

12 What are Execution Costs? How are Execution Costs Used?

Cost of intermediate code operation In conjunction with execution frequencies: parametrized according to target architecture: • Identify most time-consuming regions of program • Number of target instructions • Provides quantitative basis for evaluating quality of • Resource requirement template optimization heuristics

• Number of cycles

CS 211 CS 211

How are Execution Costs Obtained? Cost Functions

• Simplistic translation of intermediate code operation to • Effectiveness of the Optimizations: How well corresponding instruction template for target machine can we optimize our objective function? Impact on running time of the compiled code determined by the completion time.

• Efficiency of the optimization: How fast can we optimize? Impact on the time it takes to compile or cost for gaining the benefit of code with fast running time. CS 211 CS 211

13 Instruction Scheduling: Program Dependence Graph

CS 211 CS 211

Basic Graphs Examples

• A graph is made up of a set of nodes (V) and a set of edges (E)

• Each edge has a source and a sink, both of which must be members of the nodes set, i.e. E = V × V

• Edges may be directed or undirected – A directed graph has only directed edges – A undirected graph has only undirected edges Undirected graph Directed graph

CS 211 CS 211

14 Paths Cycles

source

path sink

Acyclic Undirected graph Directed graph Undirected graph Directed graph Directed graph

CS 211 CS 211

Connected Graphs Connectivity of Directed Graphs

• A strongly connected directed graph is one which has a path from each vertex to

every other vertex A

• Is this graph strongly B connected? D C

E F Unconnected graph Connected G directed graph

CS 211 CS 211

15 Program Dependence Graph Control Flow Graphs

• The Program Dependence Graph (PDG) is the intermediate (abstract) representation of a program designed for use in • Motivation: language-independent and machine- independent representation of control flow in optimizations programs used in high-level and low-level code optimizers. The flow graph data structure lends • It consists of two important graphs: itself to use of several important algorithms from graph theory. – Control Dependence Graph captures control flow and control dependence – Data Dependence Graph captures data dependences

CS 211 CS 211

Control Flow Graph: Definition CFG From Trimaran

A control flow graph CFG = ( Nc ; Ec ; Tc ) consists of main(int argc, char *argv[ ]) BB1 { • Nc, a set of nodes. A node represents a straight-line sequence of operations with no intervening control flow if (argc == 1) { i.e. a basic block. printf("1"); BB2

• Ec ⊆ Nc x Nc x Labels, a set of labeled edges. } else { BB3 BB4 • Tc , a node type mapping. Tc(n) identies the type of node n if (argc == 2) { as one of: START, STOP, OTHER. printf("2"); } else { BB5 BB6 We assume that CFG contains a unique START node and a unique STOP node, and that for any node N in printf("others"); CFG, there exist directed paths from START to N and } BB8 from N to STOP. } printf("done"); BB9 }

CS 211 CS 211

16 Data and Control Dependences Data Dependence Analysis If two operations have potentially interfering data Motivation: identify only the essential control and accesses, data dependence analysis is necessary for data determining whether or not an interference actually dependences which need to be obeyed by exists. If there is no interference, it may be possible to transformations for code optimization. reorder the operations or execute them concurrently. The data accesses examined for data dependence Program Dependence Graph (PDG) consists of analysis may arise from array variables, scalar 1. Set of nodes, as in the CFG variables, procedure parameters, pointer dereferences, etc. in the original source program. 2. Control dependence edges 3. Data dependence edges Data dependence analysis is conservative, in that it may state that a data dependence exists between two statements, when actually none exists. Together, the control and data dependence edges dictate whether or not a proposed code transformation is legal.

CS 211 CS 211

Def/Use chaining for Data Data Dependence: Definition Dependence Analysis

A data dependence, S1 → S2, exists between CFG A def-use chain links a definition D (i.e. a write access nodes S1 and S2 with respect to variable X if and of variable X to each use U (i.e. a read access), such only if that there is a path from D to U in CFG that does not redefine X.

1. there exists a path P: S1 → S2 in CFG, with no Similarly, a use-def chain links a use U to a definition intervening write to X, and D, and a def-def chain links a definition D to a definition D’ (with no intervening write to X in all 2. at least one of the following is true: cases). Def-use, use-def, and def-def chains can be computed (a) (flow) X is written by S and later read by S , or by data flow analysis, and provide a simple but 1 2 conservative way of enumerating flow, anti, and output (b) (anti) X is read by S1 and later is written by S2 data dependences. or

(c) (output) X is written by S1 and later written by S2 CS 211 CS 211

17 Impact of Control Flow Impact of Control Flow (Contd.)

· Using the loop to optimize its dynamic behavior is a • Acyclic control flow is easier to deal with than challenging problem. cyclic control flow. Problems in dealing with cyclic flow: · Hard to optimize well without detailed knowledge of the range of the iteration. · A loop implicitly represent a large run-time program space compactly. · In practice, profiling can offer limited help in estimating · Not possible to open out the loops fully at compile-time. loop bounds.

· Loop unrolling provides a partial solution.

more... CS 211 CS 211

Control Dependence Control Dependence Analysis Analysis (contd.)

We want to capture two related ideas with control dependence analysis of a CFG: 2. Two nodes, Y and Z, should be identified as 1. Node Y should be control dependent on node X if having identical control conditions if in every run node X evaluates a predicate (conditional branch) which can control whether node Y will of the program, node Y is executed if and only if subsequently be executed or not. This idea is node Z is executed. This idea is useful for useful for determining whether node Y needs to wait for node X to complete, even though they determining whether nodes Y and Z can be made have no data dependences. adjacent and executed concurrently, even though they may be far apart in the CFG.

CS 211 CS 211

18 Acyclic Instruction Scheduling

• We will consider the case of acyclic control flow first.

• The acyclic case itself has two parts: Instruction Scheduling Algorithms – The simpler case that we will consider first has no branching and corresponds to basic block of code, eg., loop bodies. – The more complicated case of scheduling programs with acyclic control flow with branching will be considered next.

CS 211 CS 211

The Core Case: Scheduling Basic Blocks Instruction Scheduling

• Why are basic blocks easy? • Input: A basic block represented as a DAG

• All instructions specified as part of the i2 input must be executed. 0 1 Latency i1 i4

00 • Allows deterministic modeling of the input. i3 • i2 is a load instruction. • No “branch probabilities” to contend with; • Latency of 1 on (i2,i4) means that i4 makes problem space easy to optimize cannot start for one cycle after i2 using classical methods. CS 211 CS completes.211

19 Instruction Scheduling (Contd.) The General Instruction Scheduling Problem • Input: DAG representing each basic block Idle Cycle Due where: to Latency S1 i1 i3 i2 i4 • 1. Nodes encode unit execution time (single cycle) instructions. • 2. Each node requires a definite class of S2 i1 i2 i3 i4 FUs. • 3. Additional pipeline delays encoded as latencies on the edges.

• Two schedules for the above DAG with S2 • 4. Number of FUs of each type in the as the desired sequence. target machine. more...

CS 211 CS 211

The General Instruction Scheduling Drawing on Deterministic Scheduling Problem (Contd.)

• Feasible Schedule: A specification of a start time for • Canonical List Scheduling Algorithm: each instruction such that the following constraints are obeyed: • 1. Assign a Rank (priority) to each • 1. Resource: Number of instructions of a given instruction (or node). type at any time < corresponding number of FUs. • 2. Sort and build a priority list of the • 2. Precedence and Latency: For each instructions in non-decreasing order of predecessor j of an instruction i in the DAG, i is the Rank. started only cycles after j finishes where k is the – Nodes with smaller ranks occur earlier latency labeling the edge (j,i),

• Output: A schedule with the minimum overall completion time (makespan). CS 211 CS 211

20 Drawing on Deterministic Scheduling (Contd.) Code Scheduling

• 3. Greedily list-schedule . • Objectives: minimize execution latency of the program – Scan iteratively and on each scan, choose the largest – Start as early as possible instructions on the critical path number of “ready” instructions subject to resource (FU) – Help expose more instruction-level parallelism to the hardware constraints in list-order – Help avoid resource conflicts that increase execution time – An instruction is ready provided • Constraints • it has not been chosen earlier and – Program Precedences • all of its predecessors have been chosen and the – Machine Resources appropriate latencies have elapsed. • Motivations • – Dynamic/Static Interface (DSI): By employing more software (static) optimization techniques at compile time, hardware complexity can be significantly reduced – Performance Boost: Even with the same complex hardware, software scheduling can provide additional performance enhancement over that of unscheduled code CS 211 CS 211

Precedence Constraints Precedence Graph • Minimum required ordering and latency between definition and use i1 i2 i5 i6 i7 i10 i11 i12 • Precedence graph 2 2 2 2 2 2 i1: l.s f2, 4(r2) 2 2 – Nodes: instructions i2: l.s f0, 4(r5) i3 i8 i13 i3: fadd.s f0, f2, f0 – Edges (a→b): a precedes b 2 i4: s.s f0, 4(r6) 2 2 – Edges are annotated with minimum latency i5: l.s f14, 8(r7) i4 i9 i14 i6: l.s f6, 0(r2) w[i+k].ip = z[i].rp + z[m+i].rp; i7: l.s f5, 0(r3) w[i+j].rp = e[k+1].rp* i8: fsub.s f5, f6, f5 4 4 (z[i].rp -z[m+i].rp) - i9: fmul.s f4, f14, f5 e[k+1].ip * i10: l.s f15, 12(r7) i15 (z[i].ip - z[m+i].ip); i11: l.s f7, 4(r2) 2 i12: l.s f8, 4(r3) i13: fsub.s f8, f7, f8 FFT code fragment i14: fmul.s f8, f15, f8 i16 i15: fsub.s f8, f4, f8 CS 211 CS 211 i16: s.s f8, 0(r8)

21 The Value of Greedy List Scheduling Resource Constraints

• Bookkeeping • Example: Consider the DAG shown below: – Prevent resources from being oversubscribed

Machine model fadd f1, f1, f2 I1 I2 FA FM

add r1, r1, 1 • Using the list =

add r2, r2, 4 • Greedy scanning produces the steps of cycle the schedule as follows: CS 211 fadd f3, f3, f4 CS 211

The Value of Greedy List Scheduling (Contd.) List Scheduling for Basic Blocks

• 1. On the first scan: i1 which is the first 1. Assign priority to each instruction step. 2. Initialize ready list that holds all ready instructions • 2. On the second and third scans and out Ready = data ready and can be scheduled of the list order, respectively i4 and i5 to 3. Greedily choose one ready instruction I from correspond to steps two and three of the ready list with the highest priority schedule. Possibly using tie-breaking heuristics • 3. On the fourth and fifth scans, i2 and i3 4. Insert I into schedule Making sure resource constraints are satisfied respectively scheduled in steps four and five. 5. Add those instructions whose precedence constraints are now satisfied into the ready list

CS 211 CS 211

22 Rank/Priority Functions/Heuristics Orientation of Scheduling

• Number of descendants in precedence • Instruction Oriented graph – Initialization (priority and ready list) • Maximum latency from root node of – Choose one ready instruction I and find a slot in schedule precedence graph make sure resource constraint is satisfied – Insert I into schedule • Length of operation latency – Update ready list • Ranking of paths based on importance • Cycle Oriented • Combination of above – Initialization (priority and ready list) – Step through schedule cycle by cycle – For the current cycle C, choose one ready instruction I be sure latency and resource constraints are satisfied – Insert I into schedule (cycle C) – Update ready list CS 211 CS 211

List Scheduling Example (a + b) * (c - d) + e/f

123 4 5 6 (a+b)*c ld a ld b ld c ld d ld e ld f 123 ld a ld b ld c Load: 2 cycles 789fadd fsub fdiv Add: 1 cycle Mult: 2 cycles add 10 fmul 4 load: 2 cycles add: 1 cycle sub: 1 cycle fadd mul: 4 cycles 11 mul div: 10 cycles 5 orientation: cycle direction: backward CS 211 heuristic: maximum latency to root CS 211

23 Scalar Scheduling Example ILP Scheduling Example

Cycle Ready list Schedule Code Resources Cycle Ready list Schedule Code 1 1,2,4,3,5 1 ld a Mem Me ALU m 2 1,2,4,3,5 1 ld a 1 1,2,4,3,5 1,2 X X ld a ld b 3 2,4,3,5 2 ld b 4 2,4,3,5 2 ld b 2 1,2,4,3,5 1,2 X X ld a ld b 5 4,3,5 4 a+b 3 4,3,5 4,3 X X ld c (a+b) 6 3,5 3 ld c 4 3, 5 3 X ld c 7 3,5 3 ld c 5 5 5 X mult 8 5 5 mult 6 5 5 X mult 9 5 5 mult 10 11 12 Ready inst are green CS 21113 Red indicates not ready CS 211 Black indicates under execution 14

Some Intuition How Good is Greedy?

• Approximation: For any pipeline depth k 1 • Greediness helps in making sure that idle and any number m of pipelines, cycles don’t remain if there are available •Sgreedy/Sopt (2 – 1/mk). instructions further “down stream.” •

• Ranks help prioritize nodes such that choices made early on favor instructions with greater enabling power, so that there is no unforced idle cycle. – Rank/Priority function is critical

CS 211 CS 211

24 How Good is Greedy? (Contd.) How good is greedy?

• For example, with one pipeline (m=1) and the • Running Time of Greedy List Scheduling: latencies k grow as 2,3,4,…, the approximate Linear in the size of the DAG. schedule is guaranteed to have a completion time no more 66%, 75%, and 80% over the optimal completion time. • “Scheduling Time-Critical Instructions on • This theoretical guarantee shows that greedy RISC Machines,” K. Palem and B. Simons, scheduling is not bad, but the bounds are worst- ACM Transactions on Programming case; practical experience tends to be much Languages and Systems, 632-658, Vol. 15, better. 1993. more...

CS 211 CS 211

Rank Functions

• 1. “Postpass Code Optimization of Pipelined Constraints”, J. Hennessey and T. Gross, ACM Transactions on A Critical Choice: The Rank Function Programming Languages and Systems, for Prioritizing Nodes vol. 5, 422-448, 1983.

• 2. “Scheduling Expressions on a Pipelined Processor with a Maximal Delay of One Cycle,” D. Bernstein and I. Gertner, ACM Transactions on Programming Languages and Systems, vol. 11 no. 1, 57-66, Jan 1989. CS 211 CS 211

25 Rank Functions (Contd.) An Example Rank Function

• 3. “Scheduling Time-Critical Instructions • The example DAG on RISC Machines,” K. Palem and B. i2 Simons, ACM Transactions on 0 1 Latency Programming Languages and Systems, i1 i4 00 632-658, vol. 15, 1993 i3 • 1. Initially label all the nodes by the same value, say • 2. Compute new labels from old starting with nodes at level • Optimality: 2 and 3 produce optimal zero (i4) and working towards higher levels: schedules for RISC processors such as the • (a) All nodes at level zero get a rank of . IBM 801, Berkeley RISC and so on. • more...

CS 211 CS 211

An Example Rank Function (Contd.) An Example Rank Function (Contd.)

• (b) For a node at level 1, construct a new • (d) The result is that i2 and i3 respectively label which is the concentration of get new labels and hence ranks ’= > ’’ = all its successors connected by a latency . 1 edge. • Note that ’= > ’’ = i.e., labels are drawn – Edge i2 to i4 in this case. from a totally ordered alphabet. • (c) The empty symbol is associated with • (e) Rank of i1 is the concentration of the latency zero edges. ranks of its immediate successors i2 and – Edges i3 to i4 for example. i3 i.e., it is ’’’= ’| ’’.

• 3. The resulting sorted list is (optimum) i1,

CS 211 CS i2112, i3, i4. •

26 Getting around basic block Limitations of List Scheduling limitations • Cannot move instructions past conditional branch instructions in the program (scheduling • Basic block size limits amount of limited by basic block boundaries) parallelism available for extraction • Problem: Many programs have small numbers of – Need to consider more “flexible” regions of instructions instructions (4-5) in each basic block. Hence, not much code motion is possible • A well known classical approach is to • Solution: Allow code motion across basic block consider traces through the (acyclic) boundaries. control flow graph. – Speculative Code Motion: “jumping the gun” – Shall return to this when we cover Compiling for ILP • execute instructions before we know whether or not we need to processors • utilize otherwise idle resources to perform work which we speculate will need to be done – Relies on program profiling to make intelligent decisions about speculation

CS 211 CS 211

STAR T Traces • “Trace Scheduling: A Technique for Global BB-3 Microcode Compaction,” J.A. Fisher, IEEE Transactions on Computers, Vol. C-30, 1981. BB-1 BB-2

• Main Ideas: BB-4 BB-5

· Choose a program segment that has no cyclic dependences. BB-6 · Choose one of the paths out of each branch that is encountered. A trace BB-1, BB-4, BB-6 BB-7 Branch Instruction more...

CS 211 CS 211 STOP

27 Revisiting A Typical

Intermediate Language

Source Program Front End Back End

Register Allocation

Scheduling Register Allocation

CS 211 CS 211

Rationale for Separating Register Allocation from Scheduling The Goal

• Primarily to assign registers to • Each of Scheduling and Register Allocation variables are hard to solve individually, let alone solve globally as a combined optimization. • However, the allocator runs out of registers quite often • So, solve each optimization locally and heuristically “patch up” the two stages. • Decide which variables to “flush” out of registers to free them up, so that other variables can be bought in – Spilling

CS 211 CS 211

28 Register Allocation and Assignment Register Allocation – Key Concepts

• Determine the range of code over which a • Allocation: identifying program values (virtual registers, live ranges) and program points at variable is used which values should be stored in a physical – Live ranges register • Formulate the problem of assigning variables to registers as a graph problem • Program values that are not allocated to registers are said to be spilled – Graph coloring – Use application domain (Instruction execution) to define the priority function • Assignment: identifying which physical register should hold an allocated value at each program point.

CS 211 CS 211

Live Ranges Computing Live Ranges

Using data flow analysis, we compute for each basic a :=... BB1 Live range of block: virtual register BB2 • In the forward direction, the reaching attribute. a = (BB1, BB2, TF BB3, BB4, BB3 := a BB4 A variable is reaching block i if a definition or BB5, BB6, use of the variable reaches the basic block along BB7). the edges of the CFG. := a BB5 Def-Use chain of • In the backward direction, the liveness attribute. BB6 virtual register A variable is live at block i if there is a direct a = (BB1, BB3, BB7 := a reference to the variable at block i or at some BB5, BB7). block j that succeeds i in the CFG, provided the variable in question is not redefined in the CS 211 CS 211interval between i and j.

29 Computing Live Ranges (Contd.) Global Register Allocation

The live range of a variable is the • Local register allocation does not store data in intersection of basic-blocks in CFG registers across basic blocks. nodes in which the variable is live, and Local allocation has poor register utilization the set which it can reach. global register allocation is essential. • Simple global register allocation: allocate most “active” values in each inner loop. • Full global register allocation: identify live ranges in control flow graph, allocate live ranges, and split ranges as needed. Goal: select allocation so as to minimize number of load/store instructions performed by optimized program.

CS 211 CS 211

Simple Example of Global Register Another Example of Global Register Allocation Allocation

Control Flow a =... B1 Control Flow a =... B1 Graph Graph TF TF

B2 b = ... ..= a B3 B2 b = ... c = c +1 B3 F T

.. = b B4 ...= a +b B4

• Live range of a = {B1, B2, B3, B4} • Live range of a = {B1, B3} • Live range of b = {B2, B4} • Live range of b = {B2, B4} • Live range of c = {B3} • No interference! a and b can be assigned • In this example, a and c interfere, and c should be to the same register given priority because it has a higher usage CS 211 CS count.211

30 Cost and Savings Interference Graph

• Definition: An interference graph G is an undirected • Compilation Cost: running time and space graph with the following properties: of the global allocation algorithm. • (a) each node x denotes exactly one distinct live range X, and • Execution Savings: cycles saved due to register residence of variables in • (b) an edge exists between nodes x and y iff X, Y optimized program execution. interfere (overlap), where X and Y are the live ranges corresponding to nodes x and y. • Contrast with memory-residence which leads to longer execution times. CS 211 CS 211

Interference Graph Example Interference Graph Example Live Ranges Interference Graph Live Ranges Interference Graph

a Live ranges overlap a Live ranges overlap a := … and hence interfere a := … and hence interfere b := … b := … c := … b c c := … b c

:= a Node model := a Node model live ranges live ranges := b := b d := … d := … := c := c := d := d

CS 211 CS 211

31 The Classical Approach The Classical Approach (Contd.)

• “Register Allocation and Spilling via Graph • These works introduced the key notion of an Coloring”, G. Chatin, Proceedings SIGPLAN-82 interference graph for encoding conflicts between Symposium on Compiler Construction, 98-105, 1982. the live ranges.

• This notion was defined for the global control flow • “Register Allocation via Coloring”, G. Chaitin, M. graph. Auslander, A. Chandra, J. Cocke, M. Hopkins and P. Markstein, Computer Languages, vol. 6, 47-57, 1981. • It also introduced the notion of graph coloring to model the idea of register allocation.

• more… CS 211 CS 211

Execution Time and Spill-cost Graph Coloring • Given an undirected graph G and a set of k • Spilling: Moving a variable that is currently distinct colors, compute a coloring of the nodes register resident to memory when no more of the graph i.e., assign a color to each node such registers are available, and a new live- that no two adjacent nodes get the same color. range needs to be allocated one spill. Recall that two nodes are adjacent iff they have • Minimizing Execution Cost: Given an an edge between them. optimistic assignment— i.e., one where all the variables are register-resident, • A given graph might not be k-colorable. minimizing spilling. • In general, it is a computationally hard problem to color a given graph using a given number k of colors. • The register allocation problem uses good heuristics for coloring. CS 211 CS 211

32 Register Interference & Allocation Register Allocation as Coloring

• Interference Graph: G = • Given k registers, interpret each register as a – Nodes (V) = variables, (more specifically, their live ranges) color. – Edges (E) = interference between variable live ranges • The graph G is the interference graph of the given • Graph Coloring (vertex coloring) program. – Given a graph, G=, assign colors to nodes (V) so that no two adjacent (connected by an edge) nodes have the same • The nodes of the interference graph are the color executable live ranges on the target platform. – A graph can be “n-colored” if no more than n colors are needed to color the graph. • A coloring of the interference graph is an – The chromatic number of a graph is min{n} such that it can be assignment of registers (colors) to live ranges n-colored (nodes). – n-coloring is an NP-complete problem, therefore optimal solution can take a long time to compute • Running out of colors implies not enough registers and hence a need to spill in the above How is graph coloring related to register allocation? model.

CS 211 CS 211

Interference Graph Chaitin’s Graph Coloring Theorem

r1, r2 & r3 • Key observation: If a graph G has a node are live-in “” X with degree less than n (i.e. having less than n edges connected to it), then G is n- beq r2, $0 r1 r7 colorable IFF the reduced graph G’ obtained from G by deleting X and all its ld r4, 16(r3) ld r5, 24(r3) r2 r6 edges is n-colorable. sub r6, r2, r4 add r2, r1, r5 sw r6, 8(r3) r3 r5 Proof: r4 add r7, r7, 1 G’ blt r7, 100 n-1 Nodes: live ranges Edges: interference CS 211r1& r3 are live-out CS 211 G

33 Graph Coloring Algorithm (Not Optimal) Example (N = 4) COLOR stack = {} • Assume the register interference graph is n-colorable COLOR stack = {r5} How do you choose n? r1 r7 r1 r7 • Simplification – Remove all nodes with degree less than n remove r5 r2 r6 – Repeat until the graph has n nodes left r2 r6 • Assign each node a different color • Add removed nodes back one-by-one and pick a legal color r3 r5 r3 as each one is added (2 nodes connected by an edge get r4 r4 different colors) blocks spill r1 Must be possible with less than n colors Is this a ood choice?? r7 r7 • Complications: simplification can block if there are no nodes with less than n edges r2 remove r6 Choose one node to spill based on spilling heuristic r2 r6

r3 r3 r4 CS 211 CS 211 r4 COLOR stack = {r5, r6} COLOR stack = {r5}

Example (N = 5) Register Spilling

COLOR stack = {} COLOR stack = {r5} • When simplification is blocked, pick a node to delete from the r1 r7 r1 r7 graph in order to unblock • Deleting a node implies the variable it represents will not be remove r5 kept in register (i.e. spilled into memory) r2 r6 r2 r6 – When constructing the interference graph, each node is assigned a value indicating the estimated cost to spill it. r3 r5 – The estimated cost can be a function of the total number of definitions r3 and uses of that variable weighted by its estimated execution frequency. r4 r4 – When the coloring procedure is blocked, the node with the least spilling remove r6 cost is picked for spilling. • When a node is spilled, spill code is added into the original r1 r7 code to store a spilled variable at its definition and to reload it r1 r7 at each of its use remove r4 r2 • After spill code is added, a new interference graph is rebuilt r2 from the modified code, and n-coloring of this graph is again attempted r3 r3 r4 CS 211 CS 211 COLOR stack = {r5, r6, r4} COLOR stack = {r5, r6}

34 The Alternate Approach:more common Important Modeling Difference

• an alternate approach used widely in most • The first difference from the classical approach compilers is that now we assume that the “home location” of a live range is in memory. – also uses the Graph Coloring Formulation

• “The Priority Based Coloring Approach to – Conceptually, values are always in memory unless promoted Register Allocation”, F. Chow and J. to a register; this is also referred to as the pessimistic Hennessey, ACM Transactions on approach. Programming Languages and Systems, vol. 12, 501-536, 1990. – In the classical approach, the dual of this model is used – Hennessey, Founder of MIPS, President of Stanford Univ! where values are always in registers except when spilled; recall that this is referred to as the optimistic approach.

more...

CS 211 CS 211

The Main Information to be Important Modeling Difference Used by the Register Allocator • A second major difference is the granularity at • For each live range, we have a bit vector LIVE of which code is modeled. the basic blocks in it. – In the classical approach, individual instructions are modeled • Also we have INTERFERE which gives for the whereas live range, the set of all other live ranges that – Now, basic blocks are the primitive units modeled as nodes in live ranges and the interference graph. interfere with it. • The final major difference is the place of the • Recall that two live ranges interfere if they register allocation in the overall compilation intersect in at least one (basic-block). process. • If ⏐INTERFERE⏐ is smaller than the number of – In the present approach, the interference graph is considered available of registers for a node i, then i is earlier in the compilation process using intermediate level unconstrained; it is constrained otherwise. statements; compiler generated temporaries are known. – In contrast, in the previous work the allocation is done at the level of the machine code. more...

CS 211 CS 211

35 The Main Information to be Used by the Register Allocator Prioritizing Live Ranges In the memory bound approach, given live • An unconstrained node can be safely assigned a ranges with a choice of assigning register since conflicting live ranges do not use registers, we do the following: up the available registers. • We associate a (possibly empty) set • Choose a live range that is “likely” to FORBIDDEN with each live range that represents yield greater savings in execution time. the set of colors that have already been assigned to the members of its INTERFERENCE set. • This means that we need to estimate the savings of each basic block in a live The above representation is essentially a detailed range. interference graph representation.

CS 211 CS 211

Estimate the Savings Estimate the Savings (Contd.)

Given a live range X for variable x, the estimated 3. Savings(X,i) = {CyclesSaved-Setup} savings in a basic block i is determined as follows: These indicate the actual savings in cycles after 1. First compute CyclesSaved which is the number of accounting for the possible loads/stores needed loads and stored of x in i scaled by the number to move x at the beginning/end of i. of cycles taken for each load/store.

2. Compensate the single load and/or store that 4. TotalSavings(X) = ΣiεX Savings(X,i) x W( i ). might be needed to bring the variable in and/or (a) x is the set of all basic blocks in the live store the variable at the end and denote it by range of Setup. X. Note that Setup is derived from a single load or (b) W( i ) is the execution frequency of variable x store or a load plus a store. in more... block i. CS 211 CS 211 more...

36 Estimate the Savings (Contd.) The Algorithm

For all constrained live ranges, execute the following steps: 5. Note however that live regions might span a few blocks but yield a large savings due to frequent 1. Compute Priority(X) if it has not already been computed. 2. For the live range X with the highest priority: use of the variable while others might yield the (a) If its priority is negative or if no basic block i in X same cumulative gain over a larger number of can be assigned a register—because every basic blocks. We prioritize the former case and color has been assigned to a basic block that define: interferes with i — then delete X from the list and modify the interference graph. {Priority(X) = TotalSavings(X)/Span(X)} (b) Else, assign it a color that is not in its forbidden set. (c) Update the forbidden sets of the members of where Span(X) is the number of basic blocks in INTERFERE for X’. X. more...

CS 211 CS 211

The Algorithm (Contd.) The Idea Behind Splitting

3. For each live range X’ that is in INTERFERE for X • Splitting ensures that we break a live range up do: into increasingly smaller live ranges. (a) If the FORBIDDEN of X’ is the set of all colors • The limit is of course when we are down to the size of a single basic block. i.e., if no colors are available, SPLIT (X’). • The intuition is that we start out with coarse- Procedure SPLIT breaks a live range into grained interference graphs with few nodes. smaller • This makes the interference node degree live ranges with the intent of reducing the possibly high. interference of X’ it will be described next. • We increase the problem size via splitting on a need-to basis. 4. Repeat the above steps till all constrained live • This strategy lowers the interference. ranges are colored or till there is no color left to color any basic block. CS 211 CS 211

37 The Splitting Strategy The Splitting Strategy (Contd.)

A sketch of an algorithm for splitting: 4. Recompute priorities and reprioritize the 1. Choose a split point. list. Note that we are guaranteed that X has at least one basic block i which can be assigned a color i.e., its forbidden set does not include all the Other bookkeeping activities to realize a colors. The earliest such in the order of control safe flow can be the split point. implementation are also executed.

2. Separate the live range X into X1 and X2 around the split point.

3. Update the sets INTERFERE for X1 and X2 and those for the live ranges that interfered with X more...

CS 211 CS 211

Live Range Splitting Example Live Range Splitting Example

a := a := BB1 b := BB1 b := a … a

c := BB2 c := BB2 b c b c TF interference graph split b BB3 BB4 BB3 BB4 b2 interference graph

:= a := a BB5 BB5 := c := c New live ranges: a: BB1, BB2, BB3, BB4, BB5 := b BB6 … BB6 Live Ranges: b: BB1 := b a: BB1, BB2, BB3, BB4, BB5 c: BB2, BB3, BB4, BB5 spill introduced b: BB1, BB2, BB3, BB4, BB5, BB6 b2: BB6 c: BB2, BB3, BB4, BB5 b and b2 are logically the same program variable Assume the number of physical registers = 2 b2 is a renamed equivalent of b. All nodes are now unconstrained. CS 211 CS 211

38 Interaction Between Allocation and Scheduling Next - - Scheduling for ILP Processors • The allocator and the scheduler are typically • Basic block does not expose enough patched together heuristically. parallelism due to small num of inst. • Leads to the “phase ordering problem: Should • Need to look at more flexible regions allocation be done before scheduling or vice- – Trace scheduling, Superblock,…. versa? • Scheduling more flexible regions implies • Saving on spilling or “good allocation” is only using features such as speculation, code indirectly connected to the actual execution duplication, predication time. Contrast with instruction scheduling.

• Factoring in register allocation into scheduling and solving the problem “globally” is a research CS 211issue. CS 211

EPIC and Compiler Optimization Scheduling for ILP Processors

• EPIC requires dependency free • Size of basic block limits amount of ILP “scheduled code” that can be extracted • Burden of extracting parallelism falls on • More than one basic block = going beyond compiler branches – Loop optimizations also • success of EPIC architectures depends on • Trace scheduling efficiency of Compilers!! – Pick a trace in the program graph • We provide overview of Compiler • Most frequently executed region of code Optimization techniques (as they apply to • Region based scheduling – Find a region of code, and send this to the EPIC/ILP) scheduler/register allocator – enhanced by examples using Trimaran ILP Infrastructure

CS 211 CS 211

39 Getting around basic block STAR T limitations BB-3 • Basic block size limits amount of

parallelism available for extraction BB-1 BB-2 – Need to consider more “flexible” regions of instructions • A well known classical approach is to consider traces through the (acyclic) BB-4 BB-5 control flow graph. – Shall return to this when we cover Compiling for ILP processors BB-6 A trace BB-1, BB-4, BB-6 BB-7 Branch Instruction

CS 211 CS 211 STOP

Definitions: The Trace Region Based Scheduling A • Treat a region as input to the scheduler B – How to schedule instructions in a region ? 0.9 0.1 – Can we move instructions to any “slot” ? – What do we have to watch out for ? CD 0.4 0.6 0.8 • Scheduling algorithm 0.2 – Input is the Region (Trace, Superblock, etc.) E F – Use List scheduling algorithm • Treat movement of instructions past branch and join G points as “special cases”

H 0.2 0.8 I CS 211 CS 211

40 The Four Elementary but Significant Side-effects The First Case

• Consider a single instruction moving past If A is a DEF Live Off-trace a conditional branch:

A False Dependence Edge Added

Off-trace Path

• This code movement leads to the instruction Branch Instruction Instruction being moved executing sometimes when the instruction ought not to have: speculatively. CS 211 CS 211 more...

The Second Case The Third Case

A

Replicate A Edged added

• Identical to previous case except the pseudo-dependence Off-trace Path edge is from A to the join instruction whenever A is a “write” or a def. • A more general solution is to permit the code motion but undo the effect of the speculated definition by adding • Instruction A will not be executed if the off-trace repair code path is taken. An expensive proposition in terms of compilation cost. • To avoid mistakes, it is replicated.

CS 211 CS 211 more...

41 The Fourth Case Super Block

Off-trace Path • A trace with a single entry but potentially many exits • Simplifies code motion during scheduling Replicate A – upward movements past a side entry within a block are pure replication – downward movements past a side entry within a block A are pure speculation

• Two step formation – Trace picking • Similar to Case 3 except for the direction of the – Tail duplication replication as shown in the figure above.

CS 211 CS 211

Definitions: The Superblock Super block formation and tail duplication

If x=3 • The superblock is a scheduling region composed A A If x=3 of basic blocks with a single entry but potentially

many exits y=1 y=2 y=1 y=2 B C B C u=v u=w u=v u=w • Superblock formation is done in two steps

– Trace selection D’ D If x=3 D – Tail duplication A larger scheduling region exposes more instructions that may be executed in parallel. EFx=y*2 z=y*3 EFx=2 z=6 E’

G G’ 0 1 2 3 4 5 6 7 8 G

Very Long Instruction Word Format

H H CS 211 CS 211

42 Background: Region Formation Advantage of Superblock The SuperBlock • We have taken care of the replication BB1 A A BB1 when we form the region 0.1 0.9 – Schedule the region independent of other regions!

B C B C BB2 BB3 BB2 BB3 – Don’t have to worry about code replication each time we move an instruction around a branch

D’ D BB4 D BB4 BB8 • Send superblock to list scheduler and it 0.8 0.2 works same as it did with basic blocks !

EFBB5 BB6 EFBB5 BB6 BB9 E’

G BB7 BB10 G’ G BB7

H H CS 211 CS 211

Hyerblock Region Formation Background: Region Formation If-Conversion Example • Single entry/ multiple exit set of predicated basic blocks (if-conversion) • There are no incoming control flow arcs from outside basic blocks to the selected blocks other than the entry block • Nested inner loops inside the selected blocks are not allowed • Hyperblock formation procedure: – Trace selection – Tail duplication – Loop peeling – Node splitting – If-conversion

CS 211 CS 211

43 Background: Region Formation Hyper block formation procedure The HyperBlock • Tail duplication – remove side entries • Loop Peeling – create bigger region for nested loop • Node Splitting – Eliminate dependencies created by control path merge – large code expansion • After above three transformations, perform if conversion

CS 211 CS 211

Tail Duplication Loop Peeling

x > 0 x > 0 A A y > 0 y > 0 v:=v*x v:=v*x B B’ B x = 1 x = 1

v:=v+1 v:=v-1 v:=v+1 v:=v-1 C C’ C

D D D’ u:=v+y u:=v+y u:=v+y

CS 211 CS 211

44 Node Splitting Assembly Code

x > 0 x > 0 x > 0 A ble x,0,C

y > 0 y > 0 y > 0 B ble y,0,F

v:=v*x C v:=v*x x = 1 x = 1 x = 1 D ne x,1,F v:=v+1 v:=v+1 v:=v-1 v:=v-1v:=v-1 v:=v+1 E v:=v+1 F v:=v-1 k:=k+1 k:=k+1 v:=v-1

u:=v+y u:=v+y u:=v+y u:=v+y u:=v+y G u:=v+y u:=v+y l=k+z u:=v+y l=k+z l=k+z l=k+z CS 211 CS 211

If conversion Summary: Region Formation

• In general, the opportunity to extract more A ble x,0,C ble x,0,C parallelism increases as the region size d := ?(y>0) increases. There are more instructions ble y,0,F B f’:= ?(y<=0) exposed in the larger region size. C v:=v*x e := ?(x=1) if d C v:=v*x • The compile time increases as the region f”:= ?(x≠1) if d D ne x,1,F f := ?(f’∨f”) size increases. A trade-off in compile time v := v+1 if e versus run-time must be considered. E v:=v+1 F v:=v-1 v := v-1 if f u := v+y

G u:=v+y u:=v+y u:=v+y

CS 211 CS 211

45 Region Formation in Trimaran

• A research infrastructure used to facilitate the creation and evaluation of EPIC/VLIW and superscalar compiler optimization techniques. – Forms 3 types of regions: • Basic blocks • Superblocks • Hyperblocks – Operates only on the C language as input – Uses a general machine description language (HMDES) • This infrastructure uses a parameterized processor architecture called HPL-PD (a.k.a. PlayDoh) • All architectures are mapped into and simulated in HPL-PD.

CS 211 CS 211

CS 211 CS 211

46 ILP Scheduling – Summary

• Send a large region of code into a list scheduler – What regions? • Start with a trace of high frequency paths in program • Modify list scheduler to handle movements past branches – IF you have speculation in the processor then allow speculative code motion – Replication will cause code size growth but do not need speculation to support it – Hyperblock may need predication support • Key ideas: increase the scope of ILP analysis – Tradeoff between compile time and execution time • When do we stop ?

CS 211 CS 211

47