Lecture 15: Pipelining

Lecture 15: Pipelining Spring 2019 Jason Tang "1 Topics • Overview of pipelining! • Pipeline performance! • Pipeline hazards "2 Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a A s k B O r d C e r D • A clothes washer takes 30 minutes, dryer takes 40 minutes, and folding takes 20 minutes! • Sequential laundry would thus take 6 hours for 4 loads "3 Pipelined Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 40 40 40 20 T a A s k O B r d e C r D • Pipelining means start work as soon as possible! • Pipelined laundry would thus take 3.5 hours for 4 loads "4 Pipelining • Does not improve latency of a single task, but improves throughput of entire workload! • Pipeline rate limited by slowest pipeline stage! • Multiple tasks operating simultaneously using di$erent resources! • Speedup correlates to the number of pipe stages! • Actual speedup reduced by: unbalanced lengths of pipe stages, time to fill pipeline, time to drain pipeline, and stalling due to dependencies "5 Multi-Cycle Instruction Execution IF: Instruction Fetch ID: Instruction EX: Execute / MEM: Memory WB: Write Decode / Register Address Access back File Read Calculation adder 4 Register File Data addr read read PC A Sel ALU addr data data Instruction B Sel write Memory W Sel data Data Memory Extend "6 Stages of Instruction Execution Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load: Fetch Decode Execute Memory Write Back • As mentioned earlier, load instructions take the longest to process! • All instructions follow at most these five steps:! • Fetch: fetch instruction from Instruction Memory at PC! • Decode: fetch registers and decode instruction! • Execute: calculate results! • Memory: read/write data from Data Memory! • Write Back: write data back to register file "7 Instruction Pipelining Fetch Decode Execute Memory Write Back Time Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back Program Flow Fetch Decode Execute Memory Write Back • Start handling next instruction while current instruction is in progress! • Pipelining is feasible when di$erent devices are used at di$erent stages of instruction execution! • Pipelined instruction throughput = non − pipelined time number of stages "8 Datapath Comparison • Example program flow: a load instruction followed by a store instruction Cycle 1 Cycle 2 Clk load store waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk load store … Write Fetch Decode Execute Memory Fetch Decode Execute Memory Fetch Back Write Fetch Decode Execute Memory load Back Write Fetch Decode Execute Memory store Back "9 Example Pipeline Performance Fetch Decode Execute Memory Write Back 200 ps 100 ps 200 ps 200 ps 100 ps • Given this instruction sequence: load, store, and then R-type:! • How long for a single-cycle datapath?! • How long for a multi-cycle datapath?! • For a pipelined datapath (ignoring all hazards)?! • How long would it take to execute 1000 consecutive loads for: single-cycle, multi-cycle, and pipeline datapaths? "10 Designing Instruction Sets for Pipelining • How bits are represented within an instruction a$ects pipeline performance! • Simplifying instruction fetch:! • RISC architectures [generally] have same sized instructions! • CISC architectures have varying length instructions! • Simplifying memory access:! • ARMv8-A has limited load and store instructions! • x86-64 allows memory to be used as operands to ALU "11 Pipeline Hazards • Situation that prevents following instruction from executing on next clock cycle! • Structural hazard: attempt to use a resource two di$erent ways at same time! • Example: all-in-one washer/dryer! • Data hazard: attempt to use item before it is ready! • Example: one sock still in washer when ready to fold sock pair! • Control hazard: attempt to make a decision before condition is evaluated! • Example: choosing laundry detergent based upon previous load "12 Structural Hazard Time ldur X1, [X9, #100] IF ID EX MEM WB ldur X2, [X9, #100] IF ID EX MEM WB ldur X3, [X9, #100] IF ID EX MEM WB ldur X4, [X9, #100] IF ID EX MEM WB Program Flow ldur X5, [X9, #100] IF ID EX MEM WB • Combined instruction/data memory can cause conflicting accesses! • Can be resolved by adding an idle cycle before fetching fourth ldur "13 Memory Architectures • von Neumann (also known an Princeton) Architecture: single combined memory bus for both data and instructions! • Simpler to build, allows for self-modifying code, but leads to the von Neumann bottleneck! • Harvard Architecture: separate memory buses for data and instructions! • Allows parallel access to data and instructions, allows di$erent memory technologies used, but much more complicated to build, prevents self- modifying code! • Modified Harvard Architecture: has split caches, but unified main memory http://ithare.com/modified-harvard-architecture-clarifying-confusion/ "14 Data Hazard Time add X1, X2, X3 IF ID EX MEM WB sub X4, X1, X3 IF ID EX MEM WB ldur X1, [X2, #0] IF ID EX MEM WB orr X8, X1, X9 IF ID EX MEM WB Program Flow eor X10, X1, X11 IF ID EX MEM WB • Dependence of one instruction on an earlier one in pipeline! • Can be resolved by code reordering, forwarding, or stalling "15 Code Reordering • Clever compilers (specifically, code generators) could reorder generated assembly to avoid data hazards! • Example: a = b + e; c = b + f; literal, unoptimized reordered, optimized ldur X1, [X0, #0] // load b ldur X1, [X0, #0] // load b ldur X2, [X0, #8] // load e ldur X2, [X0, #8] // load e add X3, X1, X2 // b + e ldur X4, [X0, #12] // load f ldur X4, [X0, #12] // load f add X3, X1, X2 // b + e add X5, X1, X4 // b + f add X5, X1, X4 // b + f "16 Forwarding add X1, X2, X3 IF ID EX MEM WB X1 sub X4, X1, X3 IF ID EX MEM WB • Add hardware to retrieve missing data from an internal buffer instead of from programmer-visible registers or memory • Only works for forward paths, later in time • Does not work for a load immediately followed by an instruction that uses that result (a load-use data hazard) "17 Stalling Time ldur X1, [X2, #0] IF ID EX MEM WB X1 X1 (stall) bubble bubble bubble bubble bubble orr X8, X1, X9 IF ID EX MEM WB eor X10, X1, X11 IF ID EX MEM WB Program Flow • When code reordering and forwarding is insu%cient, then intentionally stall pipeline by adding bubbles! • Many ways to detect when stalling is needed and how many bubbles to induce "18 Control Hazard Time add X1, X2, X3 IF ID EX MEM WB cbz X1, 3 IF ID EX MEM WB (stall) bubble bubble bubble bubble bubble (stall) bubble bubble bubble bubble bubble Program Flow orr X7, X8, X9 IF ID EX MEM WB • Upon branching, the PC for the next instructor is unknown until after decode (for unconditional branches) or after execution (for conditional branches)! • One solution is to always induce stall(s) when a branch instruction is detected, until after branch is resolved "19 Branch Prediction Time add X1, X2, X3 IF ID EX MEM WB cbz X1, 3 IF ID EX MEM WB ldur X1, [X2, #0] IF ID bubble bubble bubble sub X4, X1, X3 IF bubble bubble bubble bubble Program Flow orr X7, X8, X9 IF ID EX MEM WB • In simple case, assume that branch will never be taken! • If branch is taken, then flush pipeline, restarting with correct instruction "20.

Load more