Introduction to Instruction Scheduling

Reminders Introduction to Instruction Scheduling Read: Ch.17 (instruction scheduling) T3 due after Spring Break 15-745 Optimizing Compilers Spring 2006 Watch the RSS feed for announcements about reading list and project suggestions Peter Lee Contact Mike about proposal dates Optimizations Instruction scheduling IR-level optimizations Assume all instructions are essential mostly machine independent i.e., we have finished optimizing the IR goal is to eliminate computations Instruction scheduling attempts to rewrite the code for maximum Instruction-level optimizations instruction-level parallelism (ILP) register allocation: reduce data access cost Instruction scheduling (IS) is NP-complete today: instruction scheduling (and bad in practice), so heuristics must be used Instruction scheduling Parallelism constraints Data-dependence constraints time If instruction A computes a value that is a = 1 + x; a = 1 + x; b = 2 + y; c = 3 + z; read by instruction B, then B can’t b = 2 + y; c = 3 + z; execute before A is completed Resource hazards Since all three instructions are independent, Finiteness of hardware (e.g., how many we can execute them in parallel, assuming adequate hardware processing resources add circuits) means limited parallelism Hardware parallelism Pipelining Basic idea is to decompose an instruction’s Three forms of parallelism are found in execution into a sequence of stages, so modern hardware: that multiple instruction executions can pipelining be overlapped superscalar processing same principle as the assembly line multiprocessing the classic 5-stage pipeline - instruction fetch Of these, the first two forms are commonly - decode and register fetch IF RF EX ME WB - execute on ALU exploited by instruction scheduling - memory access time - write back to register file Why this might help Pipelining speedup The classic Suppose a non-pipelined machine can execute an laundry picture... instruction in 5ns this is a completion rate of 0.20inst/ns If each stage of the pipelined machine completes in 1ns, then it executes at a completion rate of 1.0inst/ns (after 5ns to fill the pipeline) Works best if each stage takes the same amount of an N-fold improvement (where N is the time (e.g., one clock cycle) number of stages)! Pipelining illustration Pipelining illustration In a given cycle, each IF RF EX MEWB IF RF EX MEWB instruction is in a IF RF EX MEWB IF RF EX MEWB different stage, but every IF RF EX MEWB IF RF EX MEWB stage is active IF RF EX MEWB IF RF EX MEWB IF RF EX MEWB IF RF EX MEWB time time The pipeline is “full” here The standard von Neumann model The pipeline concept was first developed for the MIPS RISC processor, by Hennessy, et al. 1981 Pipelining speedup Is more better? In the ideal case, a pipelined processor will If 5 stages is good, 8 should be better, right? complete one instruction every cycle and 20 even better than that! this is the instruction throughput, which ideally is 1 IPC (instructions per Not only do we get (potentially) more cycle) parallelism this way, but with smaller/simpler stages, the clock rate can be even higher Alternatively, we can talk in terms of CPI (cycles per instruction) so we win on both parallelism and cycle time More and more stages Limitations of pipelining Since all stages execute in parallel, the Indeed, the MIPS R4000 used an 8-stage cycle time is limited by the slowest stage pipeline, and the Pentium 4 has 20 stages! Also, the deeper the pipeline the longer it most consumers equate performance with takes to fill clock rate, so this also sells more chips Worst of all, pipeline hazards can cause However, the value of a stage is unclear if it is the processor to suspend, or stall, extremely simple, e.g., less than the time of progress temporarily an integer add Stalls can have a devastating effect on Amdahl’s Law: a limit to available parallelism performance virtual cache Pipeline hazards memory One instruction may need the results of a previous instruction y e r o 8B h 32B 8KB A branch/jump target might not be c main m CPU a disk c known until later in the pipeline regs me memory An instruction may be waiting on a fetch 3ns 6ns from main memory 60ns Such inconsistencies in the pipeline that causes stalls are called hazards larger, slower, cheaper 8ms Superscalar processing Superscalar illustration Multiple instructions in the same To get even more parallelism, we can go IF RF EX MEWB pipeline stage at the same time superscalar IF RF EX MEWB IF RF EX MEWB Some versions of the PowerPC, for The basic idea: IF RF EX MEWB example, have 4 ALUs and 2 FPUs multiple instructions proceed simultaneously IF RF EX MEWB IF RF EX MEWB The Intel i960 (1988) was the through the same pipeline stages first commercial superscalar IF RF EX MEWB microprocessor this is accomplished by adding more IF RF EX MEWB hardware, for parallel execution of stages IF RF EX MEWB IF RF EX MEWB Before that, 1960’s-era Cray and for dispatching instructions to them supercomputers also used the superscalar concept Scheduling complications Complication #1: Data dependences We would like to reorder/rewrite instructions so as to maximize parallelism, i.e., keep all hardware resources busy Must be careful not to read or write a data location “too early” before after time 1 cycle x = 1 y = x True dependence: read-after-write x = 1 Output dependence: write-after-write n cycles Complications: sometimes x = 2 fixed via renaming 1) data dependences y = x Anti-dependence: write-after-read 2) control dependences {x = 1 3) limited hardware resources Data hazards Data dependences r2+r3 available here r1 = r2 + r3 In practice, data r4 = r1 + r1 dependences are extremely difficult to IF RF EX ME WB x = a[i]; reason about *p = 1; y = *q; Considerable research *r = z; r1 = [r2] [r2] available here effort on alias analysis, r4 = r1 + r1 pointer analysis, with limited success IF RF EX ME WB Register renaming Renaming example r1 = r2 + 1 r7 = r2 + 1 r7 = r2 + 1 Sometimes data hazards are not “real”, in [fp+8] = r1 [fp+8] = r7 r1 = r3 + 2 the sense that a simple renaming of r1 = r3 + 2 r1 = r3 + 2 [fp+8] = r7 registers can eliminate them [fp+12] = r1 [fp+12] = r1 [fp+12] = r1 for output dependence, A and B write Some register allocators avoid quick re-use of registers, for anti dependence, A reads & B writes to get some of the benefit of renaming. The hardware can sometimes rename Note that there is a phase-ordering problem: registers, thereby allowing reordering - scheduling before or after register allocation? Complication #2: Control dependences Branch delay slots What do we do when we reach a conditional branch? jump IF RF EX ME WB Option 1: hw solution stall the pipeline unconditional target is IF RF EX ME WB available only after decode stage conditional target is jump IF RF EX ME WB Option 2: sw solution available only after Can try to schedule for insert nop instructions nop execute stage multiple paths, but this is IF RF EX ME WB largely impractical for IF RF EX ME WB IF RF EX ME WB real programs Both make the code slower Branch delay slots Branch prediction Option 3: branch takes effect Some current processors will speculatively after the delay slots. execute at conditional branches jump IF RF EX ME WB if correct guess, then great inst x IF RF EX ME WB This means some instructions get if not, then flush the pipelines before the inst y IF RF EX ME WB executed after the WB stage branch but before the inst 2 IF RF EX ME WB branching takes effect Average number of instructions per basic block (in C code) is 5 inst 3 IF RF EX ME WB what happens with a 20-stage pipeline? Complication #3: Hardware HW complication: Issue width Superscalar allows more than one instruction to be “issued” to the processor The hardware is finite (obviously) in each cycle Also, for engineering reasons, there may However, there is a finite limit be constraints on how the hardware resources may be used time infinite issue width issue width = 4 HW Complication: Functional units HW complication: Pipeline limits A 4-way superscalar might be limited to Often, there are restrictions on the issuing, e.g., at most 2 integer, 1 memory, pipelining in a functional unit and 1 FP instruction per cycle e.g., some FPUs allow a new FP division only once every 2 cycles original unconstrained realistic e.g.,or ionlygina 1l new floating-pointrealisti cdivision once every 2 time time cycles integer FP bottleneck integer memory memory floating point floating point Compiler or hardware? Compiler or hardware? It is possible to do some “on-the-fly” In reality, a combination of compiler and instruction re-ordering in hardware hardware scheduling support will be used This raises the question of whether it is better to do scheduling in the compiler or compiler- hardware in hardware centric -centric There is general agreement, however, that very long in-order out-of-order further improvements in superscalar instruction superscalar superscalar control will be quite limited word (VLIW) [early Pentium] [Pentium 4] [Itanium] VLIW processors Compiling for VLIW Except for memory references, execution Idea: Give full control of scheduling to the latencies for each functional unit are fixed compiler so, the hardware typically checks data Why: The hardware can be simpler (and thus dependences on memory references faster) if it isn’t complicated by scheduling dynamically How: Expose all functional units via a “very The compiler is responsible for taking data long instruction word” dependences into account int mem fp a = b + 1; a very long time must insert NOPs for empty slots c = a – d; instruction word

Load more