Compiler Optimization and Structure of Compilers

Total Page:16

File Type:pdf, Size:1020Kb

Compiler Optimization and Structure of Compilers Hardware-Software Interface Machine Program Available resources Required resources statically fixed dynamically varying Designed to support Designed to run well on Introduction to wide variety of programs X a variety of machines Interested in running Interested in having Optimizing Compilers many programs fast itself run fast Performance = tcyc x CPI x code size Reflects how well the machine resources match CS 211 CS 211 the program requirements Compiler Tasks Compiler Structure • Code Translation – Source language → target language FORTRAN → C Frond End Optimizer Back End C → MIPS, PowerPC or Alpha machine code IR IR MIPS binary → Alpha binary high-level machine source code code Dependence • Code Optimization Analyzer – Code runs faster – Match dynamic code behavior to static machine structure Machine independent Machine dependent (IR= intermediate representation) CS 211 CS 211 1 Structure of Optimizing Compilers Front End TOOLS Source Program Source Program • Lexical Analysis Front ends – Misspelling an identifier, keyword, or operator Program e.g. lex Front-end #1 Database Front-end #2 ….. High-level Intermediate Language HIL High-level Optimizer • Syntax Analysis Middle Optimized HIL – Grammar errors, such as mismatched parentheses end Lowering of IL e.g. yacc Low-level Intermediate Language LIL Back Low-level • Semantic Analysis ends Optimizer Optimized LIL – Type checking Target-1 Target-2 Target-3 Code Generator Code Generator Code Generator and Linker and Linker and Linker ….. Runtime CS 211 Systems Target-1 Executable Target-2 Executable Target-3 Executable Front-end Front-end 1. Scanner - converts input character 3. Semantic Analysis - generates stream into stream of lexical tokens intermediate language representation from input source program and user options/directives, and reports any 2. Parser - derives syntactic structure semantic errors encountered (parse tree, abstract syntax tree) from token stream, and reports any syntax errors encountered CS 211 CS 211 2 High-level Optimizer Intermediate Representation • Achieve retargetability • Global intra-procedural and inter- – Different source languages procedural analysis of source – Different target machines program's control and data flow • Example (tree-based IR from CMCC) Linear form of graphical representation Graphical Representation A0 5 78 “a” • Selection of high-level A1 5 78 “b” A2 5 78 “c” ASGI optimizations and transformations int a, b, c, d; A3 5 78 “d” d = a * (b+c) &d MULI FND1 ADDRL A3 FND2 ADDRL A0 • Update of high-level intermediate FND3 INDIRI FND2 INDIRI ADDI FND4 ADDRL A1 language FND5 INDIRI FND4 &a INDIRI INDIRI FND6 ADDRL A2 FND7 INDIRI FND6 FND8 ADDI FND5 FND7 &b &c FND9 MULI FND3 FND8 CS 211 CS 211 FND10 ASGI FND1 FND9 Lowering of Intermediate Language Machine-Independent Optimizations • Dataflow Analysis and Optimizations – Constant propagation • Linearized storage/mapping of variables – Copy propagation – e.g. 2-d array to 1-d array – Value numbering • Array/structure references → load/store operations • Elimination of common subexpression – e.g. A[I] to load R1,(R0) where R0 contains i • High-level control structures → low-level • Dead code elimination control flow • Stength reduction – e.g. “While” statement to Branch statements • Function/Procedure inlining CS 211 CS 211 3 Code-Optimizing Transformations Code Optimization Example x = 1 x = 1 • Constant folding y = a * b + 3 propagation y = a * b + 3 (1 + 2) ⇒ 3 z = a * b + x + z + 2 z = a * b + 1 + z + 2 x = 3 x = 3 (100 > 0) true ⇒ constant • Copy propagation folding x = b + c x = b + c dead code ⇒ elimination x = 1 z = y * x z = y * (b + c) y = a * b + 3 y = a * b + 3 • Common subexpression z = a * b + 3 + z z = a * b + 3 + z x = 3 x = 3 x = b * c + 4 t = b * c common z = b * c - 1 ⇒ x = t + 4 subexpression z = t - 1 • Dead code elimination t = a * b + 3 x = 1 y = t x = b + c or if x is not referred to at all z = t + z CS 211 CSx 211 = 3 Code Motion Strength Reduction • Move code between basic blocks • Replace complex (and costly) expressions • E.g. move loop invariant computations with simpler ones outside of loops • E.g. a : = b*17 a: = (b<<4) + b • E.g. t = x / y p = & a[ i ] while ( i < 100 ) { while ( i < 100 ) { t = i * 100 *p = x / y + i *p = t + i while ( i < 100 ) { while ( i < 100 ) { i = i + 1 i = i + 1 a[ i ] = i * 100 *p = t }} i = i + 1 t = t + 100 } p = p + 4 i = i + 1 } CS 211 loopCS 211 invariant: &a[i]==p, i*100==t 4 Induction variable elimination Loop Optimizations • Induction variable: loop index. • Motivation: restructure program so as to enable more effective back-end • Consider loop: optimizations and hardware exploitation for (i=0; i<N; i++) • Loop transformations are useful for for (j=0; j<M; j++) z[i][j] = b[i][j]; enhancing • Rather than recompute i*M+j for each – register allocation – instruction-level parallelism array in each iteration, share induction – data-cache locality variable between arrays, increment at end – vectorization of loop body. – parallelization CS 211 CS 211 Importance of Loop Optimizations Loop optimizations Program No. of Static Dynamic % of • Loops are good targets for optimization. Loops B.B. Count B.B. Count Total • Basic loop optimizations: – code motion; nasa7 9 --- 322M 64% – induction-variable elimination; 16 --- 362M 72% – strength reduction (x*2 -> x<<1). 83 --- 500M ~100% matrix300 1 17 217.6M 98% • Improve performance by unrolling the 15 96 221.2M 98+% loop tomcatv 1 7 26.1M 50% – Note impact when using processors that allow parallel execution of instructions 5 22 52.4M 99+% • Texas Instruments new DSP processors 12 96 54.2M ~100% StudyCS of211 loop-intensive benchmarks in the SPEC92 suite [C.J. Newburn, 1991] CS 211 5 Function inlining Back End • Replace function calls with function body • Increase compilation scope (increase ILP) IR Back End Machine code e.g. constant propagation, common subexpression • Reduce function call overhead e.g. passing arguments, reg. saves and restores code code register code selection scheduling allocation emission [W.M. Hwu, 1991 (DEC 3100)] Program In-line Speedup in-line Code Expansion • map virtual registers into architect registers cccp 1.06 1.25 • rearrange code compress 1.05 1.00+ • target machine specific optimizations equ 1.12 1.21 - delayed branch espresso 1.07 1.09 - conditional move lex 1.02 1.06 - instruction combining tbl 1.04 1.18 auto increment addressing mode xlisp 1.46 1.32 add carrying (PowerPC) yacc 1.03 1.17 hardware branch (PowerPC) Instruction-level IR CS 211 CS 211 Code Selection Our old friend…CPU Time • Map IR to machine instructions (e.g. pattern Inst *match (IR *n) { matching) switch (n->opcode) { • CPU time = CPI * IC * Clock case ……..: case MUL : • What do the various optimizations affect ASGI l = match (n->left()); r = match (n->right()); – Function inlining &d MULI if (n->type == D || n->type == F ) – Loop unrolling inst = mult_fp( (n->type == D), l, r ); INDIRI ADDI else – Code optimizing transformations inst = mult_int ( (n->type == I), l, r); – Code selection &a INDIRI INDIRI break; case ADD : l = match (n->left()); &b &c r = match (n->right()); if (n->type == D || n->type == F) addi Rt1, Rb, Rc inst = add_fp( (n->type == D), l, r); muli Rt2, Ra, Rt1 else inst = add_int ((n->type == I), l, r); break; case ……..: } CS 211 return inst; CS 211 } 6 Machine Dependent Optimizations Peephole Optimizations • Replacements of assembly instruction through template matching • Register Allocation • Eg. Replacing one addressing mode with • Instruction Scheduling another in a CISC • Peephole Optimizations CS 211 CS 211 Code Scheduling Cost of Instruction Scheduling • Rearrange code sequence to minimize execution time • Given a program segment, the goal is to execute – Hide instruction latency it as quickly as possible l.d f4, 8(r8) – Utilize all available resources l.d f2, 16(r8) 0 stall fadd f5, f4, f6 0 stall • The completion time is the objective function or fsub f7, f2, f6 cost to be minimized fmul f7, f7, f5 3 stalls reorder s.d f7, 24(r8) l.d f4, 8(r8) 1 stall l.d f8, 0(r9) 1 stall fadd f5, f4, f6 s.d f8, 8(r9) • This is referred to as the makespan of the l.d f2, 16(r8) 1 stall fsub f7, f2, f6 schedule fmul f7, f7, f5 3 stalls l.d f4, 8(r8) s.d f7, 24(r8) l.d f2, 16(r8) 0 stall l.d f8, 0(r9) 1 stall fadd f5, f4, f6 • It has to be balanced against the running time s.d f8, 8(r9) reorder 0 stall fsub f7, f2, f6 and space needs of the algorithm for finding the fmul f7, f7, f5 0 stalls schedule, which translates to compilation cost (memory dis-ambiguation)l.d f8, 0(r9) 1 stall s.d f8, 8(r9) s.d f7, 24(r8) CS 211 CS 211 7 Instruction Scheduling Example main(int argc, char *argv[]) ← { op 10 MPY vr2 param1, 255 op 12 MPY vr3 ← param1, 15 int a, b, c; op 14 MPY vr8 ← vr2, vr2 op 15 SHL vr9 ← param1, 2 a = argc; op 16 MPY vr10 ← vr9, r3 b = a * 255; op 17 SUB param2 ← vr8, r10 c = a * 15; op 18 MOV param1 ← addr("%d\n“) printf("%d\n", b*b - 4*a*c ); op 27 PBRR vb12 ← addr(printf) After Scheduling } op 20 BRL ret_addr ← vb12 (Prior to Register Allocation) CS 211 CS 211 The General Instruction Instruction Scheduling Scheduling Problem Feasible Schedule: A specification of a start time for each instruction such that the following Given a source program P, schedule the constraints are obeyed: instructions so as to minimize the 1.
Recommended publications
  • Integrating Program Optimizations and Transformations with the Scheduling of Instruction Level Parallelism*
    Integrating Program Optimizations and Transformations with the Scheduling of Instruction Level Parallelism* David A. Berson 1 Pohua Chang 1 Rajiv Gupta 2 Mary Lou Sofia2 1 Intel Corporation, Santa Clara, CA 95052 2 University of Pittsburgh, Pittsburgh, PA 15260 Abstract. Code optimizations and restructuring transformations are typically applied before scheduling to improve the quality of generated code. However, in some cases, the optimizations and transformations do not lead to a better schedule or may even adversely affect the schedule. In particular, optimizations for redundancy elimination and restructuring transformations for increasing parallelism axe often accompanied with an increase in register pressure. Therefore their application in situations where register pressure is already too high may result in the generation of additional spill code. In this paper we present an integrated approach to scheduling that enables the selective application of optimizations and restructuring transformations by the scheduler when it determines their application to be beneficial. The integration is necessary because infor- mation that is used to determine the effects of optimizations and trans- formations on the schedule is only available during instruction schedul- ing. Our integrated scheduling approach is applicable to various types of global scheduling techniques; in this paper we present an integrated algorithm for scheduling superblocks. 1 Introduction Compilers for multiple-issue architectures, such as superscalax and very long instruction word (VLIW) architectures, axe typically divided into phases, with code optimizations, scheduling and register allocation being the latter phases. The importance of integrating these latter phases is growing with the recognition that the quality of code produced for parallel systems can be greatly improved through the sharing of information.
    [Show full text]
  • Control Flow Analysis in Scheme
    Control Flow Analysis in Scheme Olin Shivers Carnegie Mellon University [email protected] Abstract Fortran). Representative texts describing these techniques are [Dragon], and in more detail, [Hecht]. Flow analysis is perhaps Traditional ¯ow analysis techniques, such as the ones typically the chief tool in the optimising compiler writer's bag of tricks; employed by optimising Fortran compilers, do not work for an incomplete list of the problems that can be addressed with Scheme-like languages. This paper presents a ¯ow analysis ¯ow analysis includes global constant subexpression elimina- technique Ð control ¯ow analysis Ð which is applicable to tion, loop invariant detection, redundant assignment detection, Scheme-like languages. As a demonstration application, the dead code elimination, constant propagation, range analysis, information gathered by control ¯ow analysis is used to per- code hoisting, induction variable elimination, copy propaga- form a traditional ¯ow analysis problem, induction variable tion, live variable analysis, loop unrolling, and loop jamming. elimination. Extensions and limitations are discussed. However, these traditional ¯ow analysis techniques have The techniques presented in this paper are backed up by never successfully been applied to the Lisp family of computer working code. They are applicable not only to Scheme, but languages. This is a curious omission. The Lisp community also to related languages, such as Common Lisp and ML. has had suf®cient time to consider the problem. Flow analysis dates back at least to 1960, ([Dragon], pp. 516), and Lisp is 1 The Task one of the oldest computer programming languages currently in use, rivalled only by Fortran and COBOL. Flow analysis is a traditional optimising compiler technique Indeed, the Lisp community has long been concerned with for determining useful information about a program at compile the execution speed of their programs.
    [Show full text]
  • Language and Compiler Support for Dynamic Code Generation by Massimiliano A
    Language and Compiler Support for Dynamic Code Generation by Massimiliano A. Poletto S.B., Massachusetts Institute of Technology (1995) M.Eng., Massachusetts Institute of Technology (1995) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 1999 © Massachusetts Institute of Technology 1999. All rights reserved. A u th or ............................................................................ Department of Electrical Engineering and Computer Science June 23, 1999 Certified by...............,. ...*V .,., . .* N . .. .*. *.* . -. *.... M. Frans Kaashoek Associate Pro essor of Electrical Engineering and Computer Science Thesis Supervisor A ccepted by ................ ..... ............ ............................. Arthur C. Smith Chairman, Departmental CommitteA on Graduate Students me 2 Language and Compiler Support for Dynamic Code Generation by Massimiliano A. Poletto Submitted to the Department of Electrical Engineering and Computer Science on June 23, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Dynamic code generation, also called run-time code generation or dynamic compilation, is the cre- ation of executable code for an application while that application is running. Dynamic compilation can significantly improve the performance of software by giving the compiler access to run-time infor- mation that is not available to a traditional static compiler. A well-designed programming interface to dynamic compilation can also simplify the creation of important classes of computer programs. Until recently, however, no system combined efficient dynamic generation of high-performance code with a powerful and portable language interface. This thesis describes a system that meets these requirements, and discusses several applications of dynamic compilation.
    [Show full text]
  • Strength Reduction of Induction Variables and Pointer Analysis – Induction Variable Elimination
    Loop optimizations • Optimize loops – Loop invariant code motion [last time] Loop Optimizations – Strength reduction of induction variables and Pointer Analysis – Induction variable elimination CS 412/413 Spring 2008 Introduction to Compilers 1 CS 412/413 Spring 2008 Introduction to Compilers 2 Strength Reduction Induction Variables • Basic idea: replace expensive operations (multiplications) with • An induction variable is a variable in a loop, cheaper ones (additions) in definitions of induction variables whose value is a function of the loop iteration s = 3*i+1; number v = f(i) while (i<10) { while (i<10) { j = 3*i+1; //<i,3,1> j = s; • In compilers, this a linear function: a[j] = a[j] –2; a[j] = a[j] –2; i = i+2; i = i+2; f(i) = c*i + d } s= s+6; } •Observation:linear combinations of linear • Benefit: cheaper to compute s = s+6 than j = 3*i functions are linear functions – s = s+6 requires an addition – Consequence: linear combinations of induction – j = 3*i requires a multiplication variables are induction variables CS 412/413 Spring 2008 Introduction to Compilers 3 CS 412/413 Spring 2008 Introduction to Compilers 4 1 Families of Induction Variables Representation • Basic induction variable: a variable whose only definition in the • Representation of induction variables in family i by triples: loop body is of the form – Denote basic induction variable i by <i, 1, 0> i = i + c – Denote induction variable k=i*a+b by triple <i, a, b> where c is a loop-invariant value • Derived induction variables: Each basic induction variable i defines
    [Show full text]
  • Computer Architecture: Static Instruction Scheduling
    Key Questions Q1. How do we find independent instructions to fetch/execute? Computer Architecture: Q2. How do we enable more compiler optimizations? Static Instruction Scheduling e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, … Q3. How do we increase the instruction fetch rate? i.e., have the ability to fetch more instructions per cycle Prof. Onur Mutlu (Editted by Seth) Carnegie Mellon University 2 Key Questions VLIW (Very Long Instruction Word Q1. How do we find independent instructions to fetch/execute? Simple hardware with multiple function units Reduced hardware complexity Q2. How do we enable more compiler optimizations? Little or no scheduling done in hardware, e.g., in-order e.g., common subexpression elimination, constant Hopefully, faster clock and less power propagation, dead code elimination, redundancy Compiler required to group and schedule instructions elimination, … (compare to OoO superscalar) Predicated instructions to help with scheduling (trace, etc.) Q3. How do we increase the instruction fetch rate? More registers (for software pipelining, etc.) i.e., have the ability to fetch more instructions per cycle Example machines: Multiflow, Cydra 5 (8-16 ops per VLIW) IA-64 (3 ops per bundle) A: Enabling the compiler to optimize across a larger number of instructions that will be executed straight line (without branches TMS32xxxx (5+ ops per VLIW) getting in the way) eases all of the above Crusoe (4 ops per VLIW) 3 4 Comparison between SS VLIW Comparison:
    [Show full text]
  • CSE 401/M501 – Compilers
    CSE 401/M501 – Compilers Dataflow Analysis Hal Perkins Autumn 2018 UW CSE 401/M501 Autumn 2018 O-1 Agenda • Dataflow analysis: a framework and algorithm for many common compiler analyses • Initial example: dataflow analysis for common subexpression elimination • Other analysis problems that work in the same framework • Some of these are the same optimizations we’ve seen, but more formally and with details UW CSE 401/M501 Autumn 2018 O-2 Common Subexpression Elimination • Goal: use dataflow analysis to A find common subexpressions m = a + b n = a + b • Idea: calculate available expressions at beginning of B C p = c + d q = a + b each basic block r = c + d r = c + d • Avoid re-evaluation of an D E available expression – use a e = b + 18 e = a + 17 copy operation s = a + b t = c + d u = e + f u = e + f – Simple inside a single block; more complex dataflow analysis used F across bocks v = a + b w = c + d x = e + f G y = a + b z = c + d UW CSE 401/M501 Autumn 2018 O-3 “Available” and Other Terms • An expression e is defined at point p in the CFG if its value a+b is computed at p defined t1 = a + b – Sometimes called definition site ... • An expression e is killed at point p if one of its operands a+b is defined at p available t10 = a + b – Sometimes called kill site … • An expression e is available at point p if every path a+b leading to p contains a prior killed b = 7 definition of e and e is not … killed between that definition and p UW CSE 401/M501 Autumn 2018 O-4 Available Expression Sets • To compute available expressions, for each block
    [Show full text]
  • Static Instruction Scheduling for High Performance on Limited Hardware
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2017.2769641, IEEE Transactions on Computers 1 Static Instruction Scheduling for High Performance on Limited Hardware Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Själander, Vasileios Spiliopoulos Stefanos Kaxiras, Alexandra Jimborean Abstract—Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14% for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications’ high-performance. Index Terms—Compilers, code generation, memory management, optimization ✦ 1 INTRODUCTION result in a sub-optimal utilization of the limited OoO engine Computer architects of the past have steadily improved that may stall the core for an extended period of time.
    [Show full text]
  • Lecture 1 Introduction
    Lecture 1 Introduction • What would you get out of this course? • Structure of a Compiler • Optimization Example Carnegie Mellon Todd C. Mowry 15-745: Introduction 1 What Do Compilers Do? 1. Translate one language into another – e.g., convert C++ into x86 object code – difficult for “natural” languages, but feasible for computer languages 2. Improve (i.e. “optimize”) the code – e.g, make the code run 3 times faster – driving force behind modern processor design Carnegie Mellon 15-745: Introduction 2 Todd C. Mowry What Do We Mean By “Optimization”? • Informal Definition: – transform a computation to an equivalent but “better” form • in what way is it equivalent? • in what way is it better? • “Optimize” is a bit of a misnomer – the result is not actually optimal Carnegie Mellon 15-745: Introduction 3 Todd C. Mowry How Can the Compiler Improve Performance? Execution time = Operation count * Machine cycles per operation • Minimize the number of operations – arithmetic operations, memory acesses • Replace expensive operations with simpler ones – e.g., replace 4-cycle multiplication with 1-cycle shift • Minimize cache misses Processor – both data and instruction accesses • Perform work in parallel cache – instruction scheduling within a thread memory – parallel execution across multiple threads • Related issue: minimize object code size – more important on embedded systems Carnegie Mellon 15-745: Introduction 4 Todd C. Mowry Other Optimization Goals Besides Performance • Minimizing power and energy consumption • Finding (and minimizing the
    [Show full text]
  • Compiler Optimisation 6 – Instruction Scheduling
    Compiler Optimisation 6 – Instruction Scheduling Hugh Leather IF 1.18a [email protected] Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2019 Introduction This lecture: Scheduling to hide latency and exploit ILP Dependence graph Local list Scheduling + priorities Forward versus backward scheduling Software pipelining of loops Latency, functional units, and ILP Instructions take clock cycles to execute (latency) Modern machines issue several operations per cycle Cannot use results until ready, can do something else Execution time is order-dependent Latencies not always constant (cache, early exit, etc) Operation Cycles load, store 3 load 2= cache 100s loadI, add, shift 1 mult 2 div 40 branch 0 – 8 Machine types In order Deep pipelining allows multiple instructions Superscalar Multiple functional units, can issue > 1 instruction Out of order Large window of instructions can be reordered dynamically VLIW Compiler statically allocates to FUs Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule1 a := 2*a*b*c Cycle Operations Operands waiting loadAI rarp; @a ) r1 add r1; r1 ) r1 loadAI rarp; @b ) r2 mult r1; r2 ) r1 loadAI rarp; @c ) r2 mult r1; r2 ) r1 storeAI r1 ) rarp; @a Done 1loads/stores 3 cycles, mults 2, adds 1 Effect of scheduling Superscalar, 1 FU: New op each cycle if operands ready Simple schedule1 a := 2*a*b*c Cycle Operations Operands waiting 1 loadAI rarp; @a ) r1 r1 2 r1 3 r1 add r1; r1 ) r1 loadAI rarp; @b ) r2 mult r1; r2 ) r1 loadAI rarp;
    [Show full text]
  • Register Allocation and Instruction Scheduling in Unison
    Register Allocation and Instruction Scheduling in Unison Roberto Castaneda˜ Lozano Mats Carlsson Gabriel Hjort Blindell Swedish Institute of Computer Science Swedish Institute of Computer Science Christian Schulte School of ICT, KTH Royal Institute of [email protected] School of ICT, KTH Royal Institute of Technology, Sweden Technology, Sweden [email protected] Swedish Institute of Computer Science fghb,[email protected] Abstract (1) (4) low-level IR This paper describes Unison, a simple, flexible, and potentially op- register (2) allocation (6) (7) timal software tool that performs register allocation and instruction integrated Unison IR generic scheduling in integration using combinatorial optimization. The combinatorial construction (5) solver tool can be used as an alternative or as a complement to traditional model approaches, which are fast but complex and suboptimal. Unison is instruction most suitable whenever high-quality code is required and longer (3) scheduling (8) compilation times can be tolerated (such as in embedded systems processor assembly or library releases), or the targeted processors are so irregular that description code traditional compilers fail to generate satisfactory code. Figure 1. Unison’s approach. Categories and Subject Descriptors D.3.4 [Programming Lan- guages]: Processors—compilers, code generation, optimization; D.3.2 [Programming Languages]: Language Classifications— full scope while robustly scaling to medium-size functions. A de- constraint and logic languages; I.2.8 [Artificial Intelligence]: tailed comparison with related approaches is available in a survey Problem Solving, Control Methods, and Search—backtracking, by Castaneda˜ Lozano and Schulte [2, Sect. 5]. scheduling Keywords combinatorial optimization; register allocation; in- Approach. Unison approaches register allocation and instruction struction scheduling scheduling as shown in Figure 1.
    [Show full text]
  • An Application of Constraint Programming to Superblock Instruction Scheduling
    An Application of Constraint Programming to Superblock Instruction Scheduling Abid M. Malik, Michael Chase, Tyrel Russell, and Peter van Beek Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 3G1 {ammalik,vanbeek}@cs.uwaterloo.ca Abstract. Modern computer architectures have complex features that can only be fully taken advantage of if the compiler schedules the com- piled code. A standard region of code for scheduling in an optimiz- ing compiler is called a superblock. Scheduling superblocks optimally is known to be NP-complete, and production compilers use non-optimal heuristic algorithms. In this paper, we present an application of con- straint programming to the superblock instruction scheduling problem. The resulting system is both optimal and fast enough to be incorporated into production compilers, and is the first optimal superblock sched- uler for realistic architectures. In developing our optimal scheduler, the keys to scaling up to large, real problems were in applying and adapting several techniques from the literature including: implied and dominance constraints, impact-based variable ordering heuristics, singleton bounds consistency, portfolios, and structure-based decomposition techniques. We experimentally evaluated our optimal scheduler on the SPEC 2000 benchmarks, a standard benchmark suite. Depending on the architec- tural model, between 98.29% to 99.98% of all superblocks were solved to optimality. The scheduler was able to routinely solve the largest su- perblocks, including superblocks with up to 2,600 instructions, and gave noteworthy improvements over previous heuristic approaches. 1TheProblem Modern computer architectures have complex features that can only be fully taken advantage of if the compiler schedules the compiled code.
    [Show full text]
  • CS415 Compilers Instruction Scheduling
    CS415 Compilers Instruction Scheduling These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University Register Allocation The General Task • At each point in the code, pick the values to keep in registers • Insert code to move values between registers & memory → No instruction reordering (leave that to scheduling) • Minimize inserted code — both dynamic & static measures • Make good use of any extra registers Allocation versus assignment • Allocation is deciding which values to keep in registers • Assignment is choosing specific registers for each value The compiler must perform both allocation & assignment cs415, spring 16 2 Lecture 4 Top-down Versus Bottom-up Allocation Top-down allocator • Work from external notion of what is important • Assign registers in priority order • Register assignment remains fixed for entire basic block • Save some registers for the values relegated to memory (feasible set F) Bottom-up allocator • Work from detailed knowledge about problem instance • Incorporate knowledge of partial solution at each step • Register assignment may change across basic block (different register assignments for different parts of live range) • Save some registers for the values relegated to memory (feasible set F) cs415, spring 16 3 Lecture 4 Bottom-up Allocator The idea: • Focus on replacement rather than allocation • Keep values “used soon” in registers • Only parts of a live range may be assigned to a physical register ( ≠ top-down allocation’s “all-or-nothing” approach) Algorithm:
    [Show full text]