Processor Pipeline

Spring 2016 :: CSE 502 – Computer Architecture Processor Pipeline Nima Honarmand Spring 2016 :: CSE 502 – Computer Architecture Generic Instruction Cycle • Steps in processing an instruction: – Instruction Fetch (IF_STEP) – Instruction Decode (ID_STEP) – Operand Fetch (OF_STEP) • Might be from registers or memory – Execute (EX_STEP) • Perform computation on the operands – Result Store or Write Back (RS_STEP) • Write the execution results back to registers or memory • ISA determines what needs to be done in each step for each instruction • μArch determines how HW implements the steps Spring 2016 :: CSE 502 – Computer Architecture Datapath vs. Control Logic • Datapath is the collection of HW components and their connection in a processor – Determines the static structure of processor • Control logic determines the dynamic flow of data between the components – E.g., the control lines of MUXes and ALU in last slide – Is a function of? • Instruction words • State of the processor • Execution results at each stage Spring 2016 :: CSE 502 – Computer Architecture Generic Datapath Components • Main components – Instruction Cache – Data Cache – Register File – Functional Units (ALU, Floating Point Unit, Memory Unit, …) – Pipeline Registers • Auxiliary Components (in advanced processors) – Reservation Stations – Reorder Buffer – Branch Predictor – Prefetchers – … • Lots of glue logic (often multiplexors) to glue these together Spring 2016 :: CSE 502 – Computer Architecture Example: MIPS Instruction Set • All instructions are 32 bits Spring 2016 :: CSE 502 – Computer Architecture A Simple MIPS Datapath Write-Back (WB) +1 PC Reg ALU File I-cache D-cache Inst. Decode & Inst. Fetch Register Read Execute Memory (IF) (ID) (EX) (MEM) IF_STEP ID_STEP OF_STEP EX_STEP RS_STEP Spring 2016 :: CSE 502 – Computer Architecture Single-Instruction Datapath Single-cycle ins0.(fetch,dec,ex,mem,wb) ins1.(fetch,dec,ex,mem,wb) Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time • Process one instruction at a time • Single-cycle control: hardwired – Low CPI (1) – Long clock period (to accommodate slowest instruction) • Multi-cycle control: typically micro-programmed – Short clock period – High CPI • Can we have both low CPI and short clock period? – Not if datapath executes only one instruction at a time – No good way to make a single instruction go faster Spring 2016 :: CSE 502 – Computer Architecture Pipelined Datapath Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) ins0.fetch ins0.(dec,ex) ins0.(mem,wb) Pipelined ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time ins2.fetch ins2.(dec,ex) ins2.(mem,wb) • Start with multi-cycle design • When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 • Each instruction passes through all stages … but instructions enter and leave at faster rate Style Ideal CPI Cycle Time (1/freq) Single-cycle 1 Long Multi-cycle > 1 Short Pipelined 1 Short Pipeline can have as many insns in flight as there are stages Spring 2016 :: CSE 502 – Computer Architecture Pipeline Illustrated Comb. Logic L n Gate Delay BW = ~(1/n) n n L -- Gate L -- Gate 2 Delay 2 Delay BW = ~(2/n) n n n Gate L -- Gate L -- Gate L -- 3Delay 3Delay 3 Delay BW = ~(3/n) Pipeline Latency = n Gate Delay + (p-1) register delays p: # of stages Improves throughput at the expense of latency Spring 2016 :: CSE 502 – Computer Architecture 5-Stage MIPS Pipeline Spring 2016 :: CSE 502 – Computer Architecture Stage 1: Fetch • Fetch an instruction from instruction cache every cycle – Use PC to index instruction cache – Increment PC (assume no branches for now) • Write state to the pipeline register (IF/ID) – The next stage will read this pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 1: Fetch Diagram target M U X 1 + PC + 1 + PC Decode PC en Instruction Cache bits Instruction en IF / ID Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 2: Decode • Decodes opcode bits – Set up Control signals for later stages • Read input operands from register file – Specified by decoded instruction bits • Write state to the pipeline register (ID/EX) – Opcode – Register contents, immediate operand – PC+1 (even though decode didn’t use it) – Control signals (from insn) for opcode and destReg Spring 2016 :: CSE 502 – Computer Architecture Stage 2: Decode Diagram target PC +1 PC regA regB PC + 1 + PC regA contents Register File Fetch destReg Execute data regB contents en bits imm Instruction Control Signals/ IF / ID ID / EX Pipeline register Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 3: Execute • Perform ALU operations – Calculate result of instruction • Control signals select operation • Contents of regA used as one input • Either regB or constant offset (imm from insn) used as second input – Calculate PC-relative branch target • PC+1+(constant offset) • Write state to the pipeline register (EX/Mem) – ALU result, contents of regB, and PC+1+offset – Control signals (from insn) for opcode and destReg Spring 2016 :: CSE 502 – Computer Architecture Stage 3: Execute Diagram target PC+1 + +offset PC + 1 + PC A ALU regA L result contents M U Decode U Memory regB X regB contents contents imm Control Signals destReg Control Signals/ data ID / EX EX/Mem Pipeline register Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 4: Memory • Perform data cache access – ALU result contains address for LD or ST – Opcode bits control R/W and enable signals • Write state to the pipeline register (Mem/WB) – ALU result and Loaded data – Control signals (from insn) for opcode and destReg Spring 2016 :: CSE 502 – Computer Architecture Stage 4: Memory Diagram target PC+1 +offset ALU result ALU in_addr result back - in_data Execute Write Data Cache data Loaded regB contents en R/W signals signals Control Control destReg data EX/Mem Mem/WB Pipeline register Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 5: Write-back • Writing result to register file (if required) – Write Loaded data to destReg for LD – Write ALU result to destReg for ALU insn – Opcode bits control register write enable signal Spring 2016 :: CSE 502 – Computer Architecture Stage 5: Write-back Diagram ALU result data Loaded M Memory data U X signals Control M destReg U Mem/WB X Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Putting It All Together M U X 1 + target + PC+1 PC+1 instruction eq? ALU regA M regB result Register valA A U Inst ALU X PC File L mdata Cache data result valB M Data dest U U Cache X data dest valB imm M U X Control signals/ signals signals Control Control IF/ID ID/EX EX/Mem Mem/WB Spring 2016 :: CSE 502 – Computer Architecture Issues With Pipelining Spring 2016 :: CSE 502 – Computer Architecture Pipelining Idealism • Uniform Sub-operations – Operation can partitioned into uniform-latency sub-ops • Repetition of Identical Operations – Same ops performed on many different inputs • Independent Operations – All ops are mutually independent Spring 2016 :: CSE 502 – Computer Architecture Pipeline Realism • Uniform Sub-operations … NOT! – Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (left-over time near end of cycle) • Repetition of Identical Operations … NOT! – Unifying instruction types • Coalescing instruction types into one “multi-function” pipe • Minimize external fragmentation (idle stages to match length) • Independent Operations … NOT! – Resolve data and resource hazards • Inter-instruction dependency detection and resolution Pipelining is expensive Spring 2016 :: CSE 502 – Computer Architecture The Generic Instruction Pipeline Instruction Fetch IF Instruction Decode ID Operand Fetch OF Instruction Execute EX Write-back WB Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages IF TIF= 6 units Without pipelining ID Tcyc TIF+TID+TOF+TEX+TOS TID= 2 units = 31 Pipelined OF TID= 9 units Tcyc max{TIF, TID, TOF, TEX, TOS} = 9 EX Speedup = 31 / 9 = 3.44 TEX= 5 units WB TOS= 9 units Can we do better? Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages (1/2) • Two methods for stage quantization – Divide sub-ops into smaller pieces – Merge multiple sub-ops into one • Recent/Current trends – Deeper pipelines (more and more stages) – Pipelining of memory accesses – Multiple different pipelines/sub-pipelines Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: Finer-Grained Machine Cycle: 4 machine cyc / instruction 11 machine cyc /instruction IF IF TIF&ID= 8 units ID IF ID OF TOF= 9 units OF OF # stages = 4 # stages = 11 OF Tcyc= 9 units EX T = 5 units Tcyc= 3 units EX EX EX WB WB TOS= 9 units WB WB Spring 2016 :: CSE 502 – Computer Architecture Pipeline Examples AMDAHL 470V/7 IF_STEP PC GEN MIPS R2000/R3000 Cache Read IF_STEP Cache Read IF ID_STEP ID_STEP Decode OF_STEP Read REG OF_STEP RD Addr GEN EX_STEP ALU Cache Read Cache Read RS_STEP MEM EX_STEP EX 1 EX 2 WB RS_STEP Check Result Write Result Spring 2016 :: CSE 502 – Computer Architecture Instruction Dependencies (1/2) • Data Dependence – Read-After-Write (RAW) (the only true dependence) • Read must wait until earlier write finishes – Anti-Dependence (WAR) • Write must wait until earlier read finishes (avoid clobbering) – Output Dependence (WAW) • Earlier write can’t overwrite later write • Control Dependence (a.k.a. Procedural Dependence) – Branch condition must execute before branch target – Instructions after branch

Processor Pipeline

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support