Compiler Optimization and Structure of Compilers

Hardware-Software Interface Machine Program Available resources Required resources statically fixed dynamically varying Designed to support Designed to run well on Introduction to wide variety of programs X a variety of machines Interested in running Interested in having Optimizing Compilers many programs fast itself run fast Performance = tcyc x CPI x code size Reflects how well the machine resources match CS 211 CS 211 the program requirements Compiler Tasks Compiler Structure • Code Translation – Source language → target language FORTRAN → C Frond End Optimizer Back End C → MIPS, PowerPC or Alpha machine code IR IR MIPS binary → Alpha binary high-level machine source code code Dependence • Code Optimization Analyzer – Code runs faster – Match dynamic code behavior to static machine structure Machine independent Machine dependent (IR= intermediate representation) CS 211 CS 211 1 Structure of Optimizing Compilers Front End TOOLS Source Program Source Program • Lexical Analysis Front ends – Misspelling an identifier, keyword, or operator Program e.g. lex Front-end #1 Database Front-end #2 ….. High-level Intermediate Language HIL High-level Optimizer • Syntax Analysis Middle Optimized HIL – Grammar errors, such as mismatched parentheses end Lowering of IL e.g. yacc Low-level Intermediate Language LIL Back Low-level • Semantic Analysis ends Optimizer Optimized LIL – Type checking Target-1 Target-2 Target-3 Code Generator Code Generator Code Generator and Linker and Linker and Linker ….. Runtime CS 211 Systems Target-1 Executable Target-2 Executable Target-3 Executable Front-end Front-end 1. Scanner - converts input character 3. Semantic Analysis - generates stream into stream of lexical tokens intermediate language representation from input source program and user options/directives, and reports any 2. Parser - derives syntactic structure semantic errors encountered (parse tree, abstract syntax tree) from token stream, and reports any syntax errors encountered CS 211 CS 211 2 High-level Optimizer Intermediate Representation • Achieve retargetability • Global intra-procedural and inter- – Different source languages procedural analysis of source – Different target machines program's control and data flow • Example (tree-based IR from CMCC) Linear form of graphical representation Graphical Representation A0 5 78 “a” • Selection of high-level A1 5 78 “b” A2 5 78 “c” ASGI optimizations and transformations int a, b, c, d; A3 5 78 “d” d = a * (b+c) &d MULI FND1 ADDRL A3 FND2 ADDRL A0 • Update of high-level intermediate FND3 INDIRI FND2 INDIRI ADDI FND4 ADDRL A1 language FND5 INDIRI FND4 &a INDIRI INDIRI FND6 ADDRL A2 FND7 INDIRI FND6 FND8 ADDI FND5 FND7 &b &c FND9 MULI FND3 FND8 CS 211 CS 211 FND10 ASGI FND1 FND9 Lowering of Intermediate Language Machine-Independent Optimizations • Dataflow Analysis and Optimizations – Constant propagation • Linearized storage/mapping of variables – Copy propagation – e.g. 2-d array to 1-d array – Value numbering • Array/structure references → load/store operations • Elimination of common subexpression – e.g. A[I] to load R1,(R0) where R0 contains i • High-level control structures → low-level • Dead code elimination control flow • Stength reduction – e.g. “While” statement to Branch statements • Function/Procedure inlining CS 211 CS 211 3 Code-Optimizing Transformations Code Optimization Example x = 1 x = 1 • Constant folding y = a * b + 3 propagation y = a * b + 3 (1 + 2) ⇒ 3 z = a * b + x + z + 2 z = a * b + 1 + z + 2 x = 3 x = 3 (100 > 0) true ⇒ constant • Copy propagation folding x = b + c x = b + c dead code ⇒ elimination x = 1 z = y * x z = y * (b + c) y = a * b + 3 y = a * b + 3 • Common subexpression z = a * b + 3 + z z = a * b + 3 + z x = 3 x = 3 x = b * c + 4 t = b * c common z = b * c - 1 ⇒ x = t + 4 subexpression z = t - 1 • Dead code elimination t = a * b + 3 x = 1 y = t x = b + c or if x is not referred to at all z = t + z CS 211 CSx 211 = 3 Code Motion Strength Reduction • Move code between basic blocks • Replace complex (and costly) expressions • E.g. move loop invariant computations with simpler ones outside of loops • E.g. a : = b*17 a: = (b<<4) + b • E.g. t = x / y p = & a[ i ] while ( i < 100 ) { while ( i < 100 ) { t = i * 100 *p = x / y + i *p = t + i while ( i < 100 ) { while ( i < 100 ) { i = i + 1 i = i + 1 a[ i ] = i * 100 *p = t }} i = i + 1 t = t + 100 } p = p + 4 i = i + 1 } CS 211 loopCS 211 invariant: &a[i]==p, i*100==t 4 Induction variable elimination Loop Optimizations • Induction variable: loop index. • Motivation: restructure program so as to enable more effective back-end • Consider loop: optimizations and hardware exploitation for (i=0; i<N; i++) • Loop transformations are useful for for (j=0; j<M; j++) z[i][j] = b[i][j]; enhancing • Rather than recompute i*M+j for each – register allocation – instruction-level parallelism array in each iteration, share induction – data-cache locality variable between arrays, increment at end – vectorization of loop body. – parallelization CS 211 CS 211 Importance of Loop Optimizations Loop optimizations Program No. of Static Dynamic % of • Loops are good targets for optimization. Loops B.B. Count B.B. Count Total • Basic loop optimizations: – code motion; nasa7 9 --- 322M 64% – induction-variable elimination; 16 --- 362M 72% – strength reduction (x*2 -> x<<1). 83 --- 500M ~100% matrix300 1 17 217.6M 98% • Improve performance by unrolling the 15 96 221.2M 98+% loop tomcatv 1 7 26.1M 50% – Note impact when using processors that allow parallel execution of instructions 5 22 52.4M 99+% • Texas Instruments new DSP processors 12 96 54.2M ~100% StudyCS of211 loop-intensive benchmarks in the SPEC92 suite [C.J. Newburn, 1991] CS 211 5 Function inlining Back End • Replace function calls with function body • Increase compilation scope (increase ILP) IR Back End Machine code e.g. constant propagation, common subexpression • Reduce function call overhead e.g. passing arguments, reg. saves and restores code code register code selection scheduling allocation emission [W.M. Hwu, 1991 (DEC 3100)] Program In-line Speedup in-line Code Expansion • map virtual registers into architect registers cccp 1.06 1.25 • rearrange code compress 1.05 1.00+ • target machine specific optimizations equ 1.12 1.21 - delayed branch espresso 1.07 1.09 - conditional move lex 1.02 1.06 - instruction combining tbl 1.04 1.18 auto increment addressing mode xlisp 1.46 1.32 add carrying (PowerPC) yacc 1.03 1.17 hardware branch (PowerPC) Instruction-level IR CS 211 CS 211 Code Selection Our old friend…CPU Time • Map IR to machine instructions (e.g. pattern Inst *match (IR *n) { matching) switch (n->opcode) { • CPU time = CPI * IC * Clock case ……..: case MUL : • What do the various optimizations affect ASGI l = match (n->left()); r = match (n->right()); – Function inlining &d MULI if (n->type == D || n->type == F ) – Loop unrolling inst = mult_fp( (n->type == D), l, r ); INDIRI ADDI else – Code optimizing transformations inst = mult_int ( (n->type == I), l, r); – Code selection &a INDIRI INDIRI break; case ADD : l = match (n->left()); &b &c r = match (n->right()); if (n->type == D || n->type == F) addi Rt1, Rb, Rc inst = add_fp( (n->type == D), l, r); muli Rt2, Ra, Rt1 else inst = add_int ((n->type == I), l, r); break; case ……..: } CS 211 return inst; CS 211 } 6 Machine Dependent Optimizations Peephole Optimizations • Replacements of assembly instruction through template matching • Register Allocation • Eg. Replacing one addressing mode with • Instruction Scheduling another in a CISC • Peephole Optimizations CS 211 CS 211 Code Scheduling Cost of Instruction Scheduling • Rearrange code sequence to minimize execution time • Given a program segment, the goal is to execute – Hide instruction latency it as quickly as possible l.d f4, 8(r8) – Utilize all available resources l.d f2, 16(r8) 0 stall fadd f5, f4, f6 0 stall • The completion time is the objective function or fsub f7, f2, f6 cost to be minimized fmul f7, f7, f5 3 stalls reorder s.d f7, 24(r8) l.d f4, 8(r8) 1 stall l.d f8, 0(r9) 1 stall fadd f5, f4, f6 s.d f8, 8(r9) • This is referred to as the makespan of the l.d f2, 16(r8) 1 stall fsub f7, f2, f6 schedule fmul f7, f7, f5 3 stalls l.d f4, 8(r8) s.d f7, 24(r8) l.d f2, 16(r8) 0 stall l.d f8, 0(r9) 1 stall fadd f5, f4, f6 • It has to be balanced against the running time s.d f8, 8(r9) reorder 0 stall fsub f7, f2, f6 and space needs of the algorithm for finding the fmul f7, f7, f5 0 stalls schedule, which translates to compilation cost (memory dis-ambiguation)l.d f8, 0(r9) 1 stall s.d f8, 8(r9) s.d f7, 24(r8) CS 211 CS 211 7 Instruction Scheduling Example main(int argc, char *argv[]) ← { op 10 MPY vr2 param1, 255 op 12 MPY vr3 ← param1, 15 int a, b, c; op 14 MPY vr8 ← vr2, vr2 op 15 SHL vr9 ← param1, 2 a = argc; op 16 MPY vr10 ← vr9, r3 b = a * 255; op 17 SUB param2 ← vr8, r10 c = a * 15; op 18 MOV param1 ← addr("%d\n“) printf("%d\n", b*b - 4*a*c ); op 27 PBRR vb12 ← addr(printf) After Scheduling } op 20 BRL ret_addr ← vb12 (Prior to Register Allocation) CS 211 CS 211 The General Instruction Instruction Scheduling Scheduling Problem Feasible Schedule: A specification of a start time for each instruction such that the following Given a source program P, schedule the constraints are obeyed: instructions so as to minimize the 1.

Compiler Optimization and Structure of Compilers

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support