Instruction Execution and Pipelining

Instruction Execution & Instruction Execution Pipelining CS160 Ward 1 CS160 Ward 2 Instruction Execution Indirect Cycle • Simple fetch-decode-execute cycle: • Instruction execution may require 1. Get address of next instruction from PC memory access to fetch operands 2. Fetch next instruction into IR • Memory fetch may use indirect 3. Change PC addressing 4. Determine instruction type (add, shift, … ) – Actual address to be fetched is in memory – Requires more memory accesses 4. If instruction has operand in memory, fetch it into a register – Can be thought of as additional instruction 5. Execute instruction storing result subcycle 6. Go to step 1 CS160 Ward 3 CS160 Ward 4 Basic Instruction Cycle States Data Flow (Instruction Fetch) • Depends on CPU design; below typical • Fetch (Step 2) – PC contains address of next instruction – Address moved to MAR – Address placed on address bus – Control unit requests memory read – Result placed on data bus, copied to MBR, then to IR – Meanwhile PC incremented by 1 instruction word • Typically byte addressable, so PC incremented by 4 for 32-bit address ISA CS160 Ward 5 CS160 Ward 6 Data Flow (Instruction Fetch) Data Flow (Data Fetch) 3-Bus Architecture • IR is examined • If indirect addressing, indirect cycle is performed – Right most bits of MBR transferred to MAR – Control unit requests memory read – Result (address of operand) moved to MBR Combined in 1-Bus Architecture CS160 Ward 7 CS160 Ward 8 Data Flow (Indirect Diagram) Data Flow (Execute) • May take many forms • Depends on instruction being executed • May include – Memory read/write – Input/Output – Register transfers – ALU operations CS160 Ward 9 CS160 Ward 10 Data Path [1] Data Path [2] • Data path cycle: two operands going from registers, through ALU and results stored (the faster the better). • Data path cycle time: time to accomplish above • Width of data path is a major determining factor in performance CS160 Ward 11 CS160 Ward 12 RISC and CISC Based Processors RISC vs. CISC Debate • RISC (Reduced Instruction Set Computer) • CISC & RISC both improved during 1980’s vs – CISC driven primarily by technology CISC (Complex Instruction Set Computer) – RISC driven by technology & architecture – “War” started in the late 1970’s • RISC has basically won – RISC: – Improved by over 50% / year • Simple interpret step so each instruction faster – Took advantage of new technology • Issues instructions more quickly (faster) – Incorporated new architectural features (e.g., Instruction Level Parallelism (ISL), superscalar – Basic concept: concepts) If RISC needs 5 instructions to do what 1 CISC • Compilers have become critical component instruction does but 1 CISC instruction takes 10 in performance times longer than 1 RISC, then RISC is faster CS160 Ward 13 CS160 Ward 14 More on RISC & CISC • RISC-CISC debate largely irrelevant now – Concepts are similar • Heavy reliance on compiler technology Prefetching – Modern RISC has “complicated” ISA – CISC leader (Intel x86 family) now has large RISC component CS160 Ward 15 CS160 Ward 16 Prefetch Improved Performance • Fetch accesses main memory • But not doubled: • Execution usually does not access main – Fetch usually shorter than complete memory execution of one instruction • Can fetch next instruction during • Prefetch more than one instruction? execution of current instruction – If fetch takes longer than execute, then cache (will discuss later) saves the day! • Called instruction prefetch – Any jump or branch means that prefetched instructions are not the required instructions CS160 Ward 17 CS160 Ward 18 Prefetching, Waiting & Branching Pipelining CS160 Ward 19 CS160 Ward 20 Pipelining Timing of Pipeline • Add more stages to improve performance • For example, 6 stages: – Fetch instruction (FI) Memory access – Decode instruction (DI) – Calculate operands (CO) – Fetch operands (FO) ? Memory access ? – Execute instruction (EI) – Write operand (result) (WO) ? Memory access ? Overlap these stages Again, cache saves the day! CS160 Ward 21 CS160 Ward 22 Important points Execution Time • An instruction pipeline passes instructions through • Assume that a pipelined instruction processor a series of stages with each performing some part has 4 stages, and the maximum time of the instruction. required in the stages are 10, 11, 10 and 12 • If the speed of two processors, one with a nanoseconds, respectively. pipeline and one without, are the same, the • How long will the processor take to perform pipeline architecture will not improve the overall each stage? 12 nanoseconds time required to execute one instruction. • How long does it take for one instruction to • If the speed of two processors, one with a complete execution? 48 nanoseconds pipeline and one without, are the same, the • How long will it take for ten instruction to pipelined architecture has a higher throughput complete execution? (48 + 9*12) nanoseconds (number of instructions processed per second). 156 nanoseconds 15.6 nanoseconds/instru CS160 Ward 23 CS160 Ward 24 4-Stage Pipeline Some Potential Causes of Pipeline “Stalls” • Another example (with interstage buffers): • What might cause the pipeline to “stall”? – Fetch instruction (F) Memory access – One stage takes too long (examples) – Decode instruction and fetch operands (D) • Different execution times (data hazard) – Execute instruction (E) ? Memory access ? • Cache miss (control hazard) – Write operand (result) (W) • Memory unit busy (structural hazard) – Branch to a new location (control hazard) – Call a subroutine (control hazard) CS160 Ward 25 CS160 Ward 26 Long Stages – Execution [1] Long Stages – Execution [2] • For a variety of reasons, some instructions • Solution: take longer to execute than others (e.g., – Sub-divide the long execute stage into ⇒ add vs divide) stall multiple stages Time Clock yclec 1 2 3 4 5 6 7 8 9 – Provide multiple execution hardware so Instruction that multiple instructions can be in the I1 F1 D1 E1 W1 stalls 2 execute stage at the same time I F D E W cycles 2 2 2 2 2 – Produces one instruction completion every I3 F3 D3 E3 W3 cycle after the pipeline is full data I F D E W 4 4 4 4 4 hazard example I5 F5 D5 E5 CS160 Ward 27 CS160 Ward 28 Long Stages – Cache Miss Long Stages – Memory Access Conflict • Delay in getting next instruction due to • Two different instructions require a cache miss (will discuss later) access to memory at the same time Load X(R1),R2 - place contents of memory location X+R1 into R2 control I2 is writing to R2 hazard when I3 wants to example write to a register stall Decode, Execute & Write stages become idle – stall. structural hazard CS160 Ward 29 CS160 Ward 30 Conditional Branch in a Pipeline Instruction 3 is a conditional branch to Instruction 15 Six Stage Instruction Pipeline Note Important Unconditional Architectural Logic branch Point: When is the PC changed? ? control hazard CS160 Ward 31 CS160 Ward 32 Data Dependencies in a Pipeline Alternative Pipeline r1 = r2 + r3 Depiction r4 = r1 -r3 Pipeline stalls until r1 has been written. (RAW hazard) r1 = r2 + r3 r4 = r1 -r3 CS160 Ward 33 CS160 Ward 34 Example of Avoiding Stalls RAW Hazard: Forwarding • Hardware optimization to avoid stall • Allows ALU to reference result in next instruction • Example: Instruction K: C ← add A B Instruction K+1: D ← subtract E C • Stalls eliminated by rearranging (a) to (b) CS160 Ward 35 CS160 Ward 36 Unconditional Branches: Instruction Conditional Branches: Delayed Branch Queue & Prefetching With more stages, need to reorder more instructions. CS160 Ward 37 CS160 Ward 38 Conditional Branches: Dynamic Prediction Pipelining Speedup Factors [1] without branch Never get speedup better than the number of stages in a pipeline CS160 Ward 39 CS160 Ward 40 Pipelining Speedup Factors [2] Classic 5-Stage RISC Pipeline • Instruction Fetch (IF) Memory access • Instruction Decode & Register Fetch (ID) • Execute / Effective Address (load-store) (EX) • Memory Access (MEM) ? Memory access ? • Write-back (to register) (WB) Never get speedup better than the number of instructions without a branch. CS160 Ward 41 CS160 Ward 42 Another 5-Stage Pipeline Achieving Maximum Speed • Instruction Fetch (IF) Memory access • Program must be written to accommodate instruction pipeline • Instruction Decode (ID) • To minimize stalls, for example: • Operand Fetch (OF) ? Memory access ? – Avoid introducing unnecessary branches • Execute Instruction (EX) – Delay references to result register(s) • Write-back (WB) ? Memory access ? CS160 Ward 43 CS160 Ward 44 More About Pipelines Advanced Pipeline Topics • Although hardware that uses an instruction • Structural hazards (e.g., resource conflicts) pipeline will not run at full speed unless • Data hazards (e.g., RAW, WAR, WAW) programs are written to accommodate the • Control hazards (e.g., branch prediction, pipeline, a programmer can choose to ignore cache miss) pipelining and assume the hardware will automatically increase speed whenever • Dynamic scheduling possible. • Out-of-order issue • Modern compilers are typically able to move • Out-of-order execution instructions around to optimize pipeline • Speculative execution execution for RISC architectures. • Register renaming • . CS160 Ward 45 CS160 Ward 46 Superscalar Architecture [1] • Superscalar architecture (parallelism inside single processor – Instruction-Level Parallelism) example – 2 pipelines with 2 ALU’s – 2 instructions must not conflict over resource (e.g., Parallelizations register, memory access unit) and must not depend on other result CS160 Ward 47 CS160 Ward 48

Load more