
Hardware-Software Interface Machine Program Available resources Required resources statically fixed dynamically varying Designed to support Designed to run well on Introduction to wide variety of programs X a variety of machines Interested in running Interested in having Optimizing Compilers many programs fast itself run fast Performance = tcyc x CPI x code size Reflects how well the machine resources match CS 211 CS 211 the program requirements Compiler Tasks Compiler Structure • Code Translation – Source language → target language FORTRAN → C Frond End Optimizer Back End C → MIPS, PowerPC or Alpha machine code IR IR MIPS binary → Alpha binary high-level machine source code code Dependence • Code Optimization Analyzer – Code runs faster – Match dynamic code behavior to static machine structure Machine independent Machine dependent (IR= intermediate representation) CS 211 CS 211 1 Structure of Optimizing Compilers Front End TOOLS Source Program Source Program • Lexical Analysis Front ends – Misspelling an identifier, keyword, or operator Program e.g. lex Front-end #1 Database Front-end #2 ….. High-level Intermediate Language HIL High-level Optimizer • Syntax Analysis Middle Optimized HIL – Grammar errors, such as mismatched parentheses end Lowering of IL e.g. yacc Low-level Intermediate Language LIL Back Low-level • Semantic Analysis ends Optimizer Optimized LIL – Type checking Target-1 Target-2 Target-3 Code Generator Code Generator Code Generator and Linker and Linker and Linker ….. Runtime CS 211 Systems Target-1 Executable Target-2 Executable Target-3 Executable Front-end Front-end 1. Scanner - converts input character 3. Semantic Analysis - generates stream into stream of lexical tokens intermediate language representation from input source program and user options/directives, and reports any 2. Parser - derives syntactic structure semantic errors encountered (parse tree, abstract syntax tree) from token stream, and reports any syntax errors encountered CS 211 CS 211 2 High-level Optimizer Intermediate Representation • Achieve retargetability • Global intra-procedural and inter- – Different source languages procedural analysis of source – Different target machines program's control and data flow • Example (tree-based IR from CMCC) Linear form of graphical representation Graphical Representation A0 5 78 “a” • Selection of high-level A1 5 78 “b” A2 5 78 “c” ASGI optimizations and transformations int a, b, c, d; A3 5 78 “d” d = a * (b+c) &d MULI FND1 ADDRL A3 FND2 ADDRL A0 • Update of high-level intermediate FND3 INDIRI FND2 INDIRI ADDI FND4 ADDRL A1 language FND5 INDIRI FND4 &a INDIRI INDIRI FND6 ADDRL A2 FND7 INDIRI FND6 FND8 ADDI FND5 FND7 &b &c FND9 MULI FND3 FND8 CS 211 CS 211 FND10 ASGI FND1 FND9 Lowering of Intermediate Language Machine-Independent Optimizations • Dataflow Analysis and Optimizations – Constant propagation • Linearized storage/mapping of variables – Copy propagation – e.g. 2-d array to 1-d array – Value numbering • Array/structure references → load/store operations • Elimination of common subexpression – e.g. A[I] to load R1,(R0) where R0 contains i • High-level control structures → low-level • Dead code elimination control flow • Stength reduction – e.g. “While” statement to Branch statements • Function/Procedure inlining CS 211 CS 211 3 Code-Optimizing Transformations Code Optimization Example x = 1 x = 1 • Constant folding y = a * b + 3 propagation y = a * b + 3 (1 + 2) ⇒ 3 z = a * b + x + z + 2 z = a * b + 1 + z + 2 x = 3 x = 3 (100 > 0) true ⇒ constant • Copy propagation folding x = b + c x = b + c dead code ⇒ elimination x = 1 z = y * x z = y * (b + c) y = a * b + 3 y = a * b + 3 • Common subexpression z = a * b + 3 + z z = a * b + 3 + z x = 3 x = 3 x = b * c + 4 t = b * c common z = b * c - 1 ⇒ x = t + 4 subexpression z = t - 1 • Dead code elimination t = a * b + 3 x = 1 y = t x = b + c or if x is not referred to at all z = t + z CS 211 CSx 211 = 3 Code Motion Strength Reduction • Move code between basic blocks • Replace complex (and costly) expressions • E.g. move loop invariant computations with simpler ones outside of loops • E.g. a : = b*17 a: = (b<<4) + b • E.g. t = x / y p = & a[ i ] while ( i < 100 ) { while ( i < 100 ) { t = i * 100 *p = x / y + i *p = t + i while ( i < 100 ) { while ( i < 100 ) { i = i + 1 i = i + 1 a[ i ] = i * 100 *p = t }} i = i + 1 t = t + 100 } p = p + 4 i = i + 1 } CS 211 loopCS 211 invariant: &a[i]==p, i*100==t 4 Induction variable elimination Loop Optimizations • Induction variable: loop index. • Motivation: restructure program so as to enable more effective back-end • Consider loop: optimizations and hardware exploitation for (i=0; i<N; i++) • Loop transformations are useful for for (j=0; j<M; j++) z[i][j] = b[i][j]; enhancing • Rather than recompute i*M+j for each – register allocation – instruction-level parallelism array in each iteration, share induction – data-cache locality variable between arrays, increment at end – vectorization of loop body. – parallelization CS 211 CS 211 Importance of Loop Optimizations Loop optimizations Program No. of Static Dynamic % of • Loops are good targets for optimization. Loops B.B. Count B.B. Count Total • Basic loop optimizations: – code motion; nasa7 9 --- 322M 64% – induction-variable elimination; 16 --- 362M 72% – strength reduction (x*2 -> x<<1). 83 --- 500M ~100% matrix300 1 17 217.6M 98% • Improve performance by unrolling the 15 96 221.2M 98+% loop tomcatv 1 7 26.1M 50% – Note impact when using processors that allow parallel execution of instructions 5 22 52.4M 99+% • Texas Instruments new DSP processors 12 96 54.2M ~100% StudyCS of211 loop-intensive benchmarks in the SPEC92 suite [C.J. Newburn, 1991] CS 211 5 Function inlining Back End • Replace function calls with function body • Increase compilation scope (increase ILP) IR Back End Machine code e.g. constant propagation, common subexpression • Reduce function call overhead e.g. passing arguments, reg. saves and restores code code register code selection scheduling allocation emission [W.M. Hwu, 1991 (DEC 3100)] Program In-line Speedup in-line Code Expansion • map virtual registers into architect registers cccp 1.06 1.25 • rearrange code compress 1.05 1.00+ • target machine specific optimizations equ 1.12 1.21 - delayed branch espresso 1.07 1.09 - conditional move lex 1.02 1.06 - instruction combining tbl 1.04 1.18 auto increment addressing mode xlisp 1.46 1.32 add carrying (PowerPC) yacc 1.03 1.17 hardware branch (PowerPC) Instruction-level IR CS 211 CS 211 Code Selection Our old friend…CPU Time • Map IR to machine instructions (e.g. pattern Inst *match (IR *n) { matching) switch (n->opcode) { • CPU time = CPI * IC * Clock case ……..: case MUL : • What do the various optimizations affect ASGI l = match (n->left()); r = match (n->right()); – Function inlining &d MULI if (n->type == D || n->type == F ) – Loop unrolling inst = mult_fp( (n->type == D), l, r ); INDIRI ADDI else – Code optimizing transformations inst = mult_int ( (n->type == I), l, r); – Code selection &a INDIRI INDIRI break; case ADD : l = match (n->left()); &b &c r = match (n->right()); if (n->type == D || n->type == F) addi Rt1, Rb, Rc inst = add_fp( (n->type == D), l, r); muli Rt2, Ra, Rt1 else inst = add_int ((n->type == I), l, r); break; case ……..: } CS 211 return inst; CS 211 } 6 Machine Dependent Optimizations Peephole Optimizations • Replacements of assembly instruction through template matching • Register Allocation • Eg. Replacing one addressing mode with • Instruction Scheduling another in a CISC • Peephole Optimizations CS 211 CS 211 Code Scheduling Cost of Instruction Scheduling • Rearrange code sequence to minimize execution time • Given a program segment, the goal is to execute – Hide instruction latency it as quickly as possible l.d f4, 8(r8) – Utilize all available resources l.d f2, 16(r8) 0 stall fadd f5, f4, f6 0 stall • The completion time is the objective function or fsub f7, f2, f6 cost to be minimized fmul f7, f7, f5 3 stalls reorder s.d f7, 24(r8) l.d f4, 8(r8) 1 stall l.d f8, 0(r9) 1 stall fadd f5, f4, f6 s.d f8, 8(r9) • This is referred to as the makespan of the l.d f2, 16(r8) 1 stall fsub f7, f2, f6 schedule fmul f7, f7, f5 3 stalls l.d f4, 8(r8) s.d f7, 24(r8) l.d f2, 16(r8) 0 stall l.d f8, 0(r9) 1 stall fadd f5, f4, f6 • It has to be balanced against the running time s.d f8, 8(r9) reorder 0 stall fsub f7, f2, f6 and space needs of the algorithm for finding the fmul f7, f7, f5 0 stalls schedule, which translates to compilation cost (memory dis-ambiguation)l.d f8, 0(r9) 1 stall s.d f8, 8(r9) s.d f7, 24(r8) CS 211 CS 211 7 Instruction Scheduling Example main(int argc, char *argv[]) ← { op 10 MPY vr2 param1, 255 op 12 MPY vr3 ← param1, 15 int a, b, c; op 14 MPY vr8 ← vr2, vr2 op 15 SHL vr9 ← param1, 2 a = argc; op 16 MPY vr10 ← vr9, r3 b = a * 255; op 17 SUB param2 ← vr8, r10 c = a * 15; op 18 MOV param1 ← addr("%d\n“) printf("%d\n", b*b - 4*a*c ); op 27 PBRR vb12 ← addr(printf) After Scheduling } op 20 BRL ret_addr ← vb12 (Prior to Register Allocation) CS 211 CS 211 The General Instruction Instruction Scheduling Scheduling Problem Feasible Schedule: A specification of a start time for each instruction such that the following Given a source program P, schedule the constraints are obeyed: instructions so as to minimize the 1.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages47 Page
-
File Size-