Instruction Execution and Pipelining

Instruction Execution & Instruction Execution Pipelining CS160 Ward 1 CS160 Ward 2 Instruction Execution Indirect Cycle • Simple fetch-decode-execute cycle: • Instruction execution may require 1. Get address of next instruction from PC memory access to fetch operands 2. Fetch next instruction into IR • Memory fetch may use indirect 3. Change PC addressing 4. Determine instruction type (add, shift, … ) – Actual address to be fetched is in memory – Requires more memory accesses 4. If instruction has operand in memory, fetch it into a register – Can be thought of as additional instruction 5. Execute instruction storing result subcycle 6. Go to step 1 CS160 Ward 3 CS160 Ward 4 Basic Instruction Cycle States Data Flow (Instruction Fetch) • Depends on CPU design; below typical • Fetch (Step 2) – PC contains address of next instruction – Address moved to MAR – Address placed on address bus – Control unit requests memory read – Result placed on data bus, copied to MBR, then to IR – Meanwhile PC incremented by 1 instruction word • Typically byte addressable, so PC incremented by 4 for 32-bit address ISA CS160 Ward 5 CS160 Ward 6 Data Flow (Instruction Fetch) Data Flow (Data Fetch) 3-Bus Architecture • IR is examined • If indirect addressing, indirect cycle is performed – Right most bits of MBR transferred to MAR – Control unit requests memory read – Result (address of operand) moved to MBR Combined in 1-Bus Architecture CS160 Ward 7 CS160 Ward 8 Data Flow (Indirect Diagram) Data Flow (Execute) • May take many forms • Depends on instruction being executed • May include – Memory read/write – Input/Output – Register transfers – ALU operations CS160 Ward 9 CS160 Ward 10 Data Path [1] Data Path [2] • Data path cycle: two operands going from registers, through ALU and results stored (the faster the better). • Data path cycle time: time to accomplish above • Width of data path is a major determining factor in performance CS160 Ward 11 CS160 Ward 12 RISC and CISC Based Processors RISC vs. CISC Debate • RISC (Reduced Instruction Set Computer) • CISC & RISC both improved during 1980’s vs – CISC driven primarily by technology CISC (Complex Instruction Set Computer) – RISC driven by technology & architecture – “War” started in the late 1970’s • RISC has basically won – RISC: – Improved by over 50% / year • Simple interpret step so each instruction faster – Took advantage of new technology • Issues instructions more quickly (faster) – Incorporated new architectural features (e.g., Instruction Level Parallelism (ISL), superscalar – Basic concept: concepts) If RISC needs 5 instructions to do what 1 CISC • Compilers have become critical component instruction does but 1 CISC instruction takes 10 in performance times longer than 1 RISC, then RISC is faster CS160 Ward 13 CS160 Ward 14 More on RISC & CISC • RISC-CISC debate largely irrelevant now – Concepts are similar • Heavy reliance on compiler technology Prefetching – Modern RISC has “complicated” ISA – CISC leader (Intel x86 family) now has large RISC component CS160 Ward 15 CS160 Ward 16 Prefetch Improved Performance • Fetch accesses main memory • But not doubled: • Execution usually does not access main – Fetch usually shorter than complete memory execution of one instruction • Can fetch next instruction during • Prefetch more than one instruction? execution of current instruction – If fetch takes longer than execute, then cache (will discuss later) saves the day! • Called instruction prefetch – Any jump or branch means that prefetched instructions are not the required instructions CS160 Ward 17 CS160 Ward 18 Prefetching, Waiting & Branching Pipelining CS160 Ward 19 CS160 Ward 20 Pipelining Timing of Pipeline • Add more stages to improve performance • For example, 6 stages: – Fetch instruction (FI) Memory access – Decode instruction (DI) – Calculate operands (CO) – Fetch operands (FO) ? Memory access ? – Execute instruction (EI) – Write operand (result) (WO) ? Memory access ? Overlap these stages Again, cache saves the day! CS160 Ward 21 CS160 Ward 22 Important points Execution Time • An instruction pipeline passes instructions through • Assume that a pipelined instruction processor a series of stages with each performing some part has 4 stages, and the maximum time of the instruction. required in the stages are 10, 11, 10 and 12 • If the speed of two processors, one with a nanoseconds, respectively. pipeline and one without, are the same, the • How long will the processor take to perform pipeline architecture will not improve the overall each stage? 12 nanoseconds time required to execute one instruction. • How long does it take for one instruction to • If the speed of two processors, one with a complete execution? 48 nanoseconds pipeline and one without, are the same, the • How long will it take for ten instruction to pipelined architecture has a higher throughput complete execution? (48 + 9*12) nanoseconds (number of instructions processed per second). 156 nanoseconds 15.6 nanoseconds/instru CS160 Ward 23 CS160 Ward 24 4-Stage Pipeline Some Potential Causes of Pipeline “Stalls” • Another example (with interstage buffers): • What might cause the pipeline to “stall”? – Fetch instruction (F) Memory access – One stage takes too long (examples) – Decode instruction and fetch operands (D) • Different execution times (data hazard) – Execute instruction (E) ? Memory access ? • Cache miss (control hazard) – Write operand (result) (W) • Memory unit busy (structural hazard) – Branch to a new location (control hazard) – Call a subroutine (control hazard) CS160 Ward 25 CS160 Ward 26 Long Stages – Execution [1] Long Stages – Execution [2] • For a variety of reasons, some instructions • Solution: take longer to execute than others (e.g., – Sub-divide the long execute stage into ⇒ add vs divide) stall multiple stages Time Clock yclec 1 2 3 4 5 6 7 8 9 – Provide multiple execution hardware so Instruction that multiple instructions can be in the I1 F1 D1 E1 W1 stalls 2 execute stage at the same time I F D E W cycles 2 2 2 2 2 – Produces one instruction completion every I3 F3 D3 E3 W3 cycle after the pipeline is full data I F D E W 4 4 4 4 4 hazard example I5 F5 D5 E5 CS160 Ward 27 CS160 Ward 28 Long Stages – Cache Miss Long Stages – Memory Access Conflict • Delay in getting next instruction due to • Two different instructions require a cache miss (will discuss later) access to memory at the same time Load X(R1),R2 - place contents of memory location X+R1 into R2 control I2 is writing to R2 hazard when I3 wants to example write to a register stall Decode, Execute & Write stages become idle – stall. structural hazard CS160 Ward 29 CS160 Ward 30 Conditional Branch in a Pipeline Instruction 3 is a conditional branch to Instruction 15 Six Stage Instruction Pipeline Note Important Unconditional Architectural Logic branch Point: When is the PC changed? ? control hazard CS160 Ward 31 CS160 Ward 32 Data Dependencies in a Pipeline Alternative Pipeline r1 = r2 + r3 Depiction r4 = r1 -r3 Pipeline stalls until r1 has been written. (RAW hazard) r1 = r2 + r3 r4 = r1 -r3 CS160 Ward 33 CS160 Ward 34 Example of Avoiding Stalls RAW Hazard: Forwarding • Hardware optimization to avoid stall • Allows ALU to reference result in next instruction • Example: Instruction K: C ← add A B Instruction K+1: D ← subtract E C • Stalls eliminated by rearranging (a) to (b) CS160 Ward 35 CS160 Ward 36 Unconditional Branches: Instruction Conditional Branches: Delayed Branch Queue & Prefetching With more stages, need to reorder more instructions. CS160 Ward 37 CS160 Ward 38 Conditional Branches: Dynamic Prediction Pipelining Speedup Factors [1] without branch Never get speedup better than the number of stages in a pipeline CS160 Ward 39 CS160 Ward 40 Pipelining Speedup Factors [2] Classic 5-Stage RISC Pipeline • Instruction Fetch (IF) Memory access • Instruction Decode & Register Fetch (ID) • Execute / Effective Address (load-store) (EX) • Memory Access (MEM) ? Memory access ? • Write-back (to register) (WB) Never get speedup better than the number of instructions without a branch. CS160 Ward 41 CS160 Ward 42 Another 5-Stage Pipeline Achieving Maximum Speed • Instruction Fetch (IF) Memory access • Program must be written to accommodate instruction pipeline • Instruction Decode (ID) • To minimize stalls, for example: • Operand Fetch (OF) ? Memory access ? – Avoid introducing unnecessary branches • Execute Instruction (EX) – Delay references to result register(s) • Write-back (WB) ? Memory access ? CS160 Ward 43 CS160 Ward 44 More About Pipelines Advanced Pipeline Topics • Although hardware that uses an instruction • Structural hazards (e.g., resource conflicts) pipeline will not run at full speed unless • Data hazards (e.g., RAW, WAR, WAW) programs are written to accommodate the • Control hazards (e.g., branch prediction, pipeline, a programmer can choose to ignore cache miss) pipelining and assume the hardware will automatically increase speed whenever • Dynamic scheduling possible. • Out-of-order issue • Modern compilers are typically able to move • Out-of-order execution instructions around to optimize pipeline • Speculative execution execution for RISC architectures. • Register renaming • . CS160 Ward 45 CS160 Ward 46 Superscalar Architecture [1] • Superscalar architecture (parallelism inside single processor – Instruction-Level Parallelism) example – 2 pipelines with 2 ALU’s – 2 instructions must not conflict over resource (e.g., Parallelizations register, memory access unit) and must not depend on other result CS160 Ward 47 CS160 Ward 48

Instruction Execution and Pipelining

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support