Processor Pipeline

Spring 2016 :: CSE 502 – Computer Architecture Processor Pipeline Nima Honarmand Spring 2016 :: CSE 502 – Computer Architecture Generic Instruction Cycle • Steps in processing an instruction: – Instruction Fetch (IF_STEP) – Instruction Decode (ID_STEP) – Operand Fetch (OF_STEP) • Might be from registers or memory – Execute (EX_STEP) • Perform computation on the operands – Result Store or Write Back (RS_STEP) • Write the execution results back to registers or memory • ISA determines what needs to be done in each step for each instruction • μArch determines how HW implements the steps Spring 2016 :: CSE 502 – Computer Architecture Datapath vs. Control Logic • Datapath is the collection of HW components and their connection in a processor – Determines the static structure of processor • Control logic determines the dynamic flow of data between the components – E.g., the control lines of MUXes and ALU in last slide – Is a function of? • Instruction words • State of the processor • Execution results at each stage Spring 2016 :: CSE 502 – Computer Architecture Generic Datapath Components • Main components – Instruction Cache – Data Cache – Register File – Functional Units (ALU, Floating Point Unit, Memory Unit, …) – Pipeline Registers • Auxiliary Components (in advanced processors) – Reservation Stations – Reorder Buffer – Branch Predictor – Prefetchers – … • Lots of glue logic (often multiplexors) to glue these together Spring 2016 :: CSE 502 – Computer Architecture Example: MIPS Instruction Set • All instructions are 32 bits Spring 2016 :: CSE 502 – Computer Architecture A Simple MIPS Datapath Write-Back (WB) +1 PC Reg ALU File I-cache D-cache Inst. Decode & Inst. Fetch Register Read Execute Memory (IF) (ID) (EX) (MEM) IF_STEP ID_STEP OF_STEP EX_STEP RS_STEP Spring 2016 :: CSE 502 – Computer Architecture Single-Instruction Datapath Single-cycle ins0.(fetch,dec,ex,mem,wb) ins1.(fetch,dec,ex,mem,wb) Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time • Process one instruction at a time • Single-cycle control: hardwired – Low CPI (1) – Long clock period (to accommodate slowest instruction) • Multi-cycle control: typically micro-programmed – Short clock period – High CPI • Can we have both low CPI and short clock period? – Not if datapath executes only one instruction at a time – No good way to make a single instruction go faster Spring 2016 :: CSE 502 – Computer Architecture Pipelined Datapath Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) ins0.fetch ins0.(dec,ex) ins0.(mem,wb) Pipelined ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time ins2.fetch ins2.(dec,ex) ins2.(mem,wb) • Start with multi-cycle design • When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 • Each instruction passes through all stages … but instructions enter and leave at faster rate Style Ideal CPI Cycle Time (1/freq) Single-cycle 1 Long Multi-cycle > 1 Short Pipelined 1 Short Pipeline can have as many insns in flight as there are stages Spring 2016 :: CSE 502 – Computer Architecture Pipeline Illustrated Comb. Logic L n Gate Delay BW = ~(1/n) n n L -- Gate L -- Gate 2 Delay 2 Delay BW = ~(2/n) n n n Gate L -- Gate L -- Gate L -- 3Delay 3Delay 3 Delay BW = ~(3/n) Pipeline Latency = n Gate Delay + (p-1) register delays p: # of stages Improves throughput at the expense of latency Spring 2016 :: CSE 502 – Computer Architecture 5-Stage MIPS Pipeline Spring 2016 :: CSE 502 – Computer Architecture Stage 1: Fetch • Fetch an instruction from instruction cache every cycle – Use PC to index instruction cache – Increment PC (assume no branches for now) • Write state to the pipeline register (IF/ID) – The next stage will read this pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 1: Fetch Diagram target M U X 1 + PC + 1 + PC Decode PC en Instruction Cache bits Instruction en IF / ID Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 2: Decode • Decodes opcode bits – Set up Control signals for later stages • Read input operands from register file – Specified by decoded instruction bits • Write state to the pipeline register (ID/EX) – Opcode – Register contents, immediate operand – PC+1 (even though decode didn’t use it) – Control signals (from insn) for opcode and destReg Spring 2016 :: CSE 502 – Computer Architecture Stage 2: Decode Diagram target PC +1 PC regA regB PC + 1 + PC regA contents Register File Fetch destReg Execute data regB contents en bits imm Instruction Control Signals/ IF / ID ID / EX Pipeline register Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 3: Execute • Perform ALU operations – Calculate result of instruction • Control signals select operation • Contents of regA used as one input • Either regB or constant offset (imm from insn) used as second input – Calculate PC-relative branch target • PC+1+(constant offset) • Write state to the pipeline register (EX/Mem) – ALU result, contents of regB, and PC+1+offset – Control signals (from insn) for opcode and destReg Spring 2016 :: CSE 502 – Computer Architecture Stage 3: Execute Diagram target PC+1 + +offset PC + 1 + PC A ALU regA L result contents M U Decode U Memory regB X regB contents contents imm Control Signals destReg Control Signals/ data ID / EX EX/Mem Pipeline register Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 4: Memory • Perform data cache access – ALU result contains address for LD or ST – Opcode bits control R/W and enable signals • Write state to the pipeline register (Mem/WB) – ALU result and Loaded data – Control signals (from insn) for opcode and destReg Spring 2016 :: CSE 502 – Computer Architecture Stage 4: Memory Diagram target PC+1 +offset ALU result ALU in_addr result back - in_data Execute Write Data Cache data Loaded regB contents en R/W signals signals Control Control destReg data EX/Mem Mem/WB Pipeline register Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Stage 5: Write-back • Writing result to register file (if required) – Write Loaded data to destReg for LD – Write ALU result to destReg for ALU insn – Opcode bits control register write enable signal Spring 2016 :: CSE 502 – Computer Architecture Stage 5: Write-back Diagram ALU result data Loaded M Memory data U X signals Control M destReg U Mem/WB X Pipeline register Spring 2016 :: CSE 502 – Computer Architecture Putting It All Together M U X 1 + target + PC+1 PC+1 instruction eq? ALU regA M regB result Register valA A U Inst ALU X PC File L mdata Cache data result valB M Data dest U U Cache X data dest valB imm M U X Control signals/ signals signals Control Control IF/ID ID/EX EX/Mem Mem/WB Spring 2016 :: CSE 502 – Computer Architecture Issues With Pipelining Spring 2016 :: CSE 502 – Computer Architecture Pipelining Idealism • Uniform Sub-operations – Operation can partitioned into uniform-latency sub-ops • Repetition of Identical Operations – Same ops performed on many different inputs • Independent Operations – All ops are mutually independent Spring 2016 :: CSE 502 – Computer Architecture Pipeline Realism • Uniform Sub-operations … NOT! – Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (left-over time near end of cycle) • Repetition of Identical Operations … NOT! – Unifying instruction types • Coalescing instruction types into one “multi-function” pipe • Minimize external fragmentation (idle stages to match length) • Independent Operations … NOT! – Resolve data and resource hazards • Inter-instruction dependency detection and resolution Pipelining is expensive Spring 2016 :: CSE 502 – Computer Architecture The Generic Instruction Pipeline Instruction Fetch IF Instruction Decode ID Operand Fetch OF Instruction Execute EX Write-back WB Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages IF TIF= 6 units Without pipelining ID Tcyc TIF+TID+TOF+TEX+TOS TID= 2 units = 31 Pipelined OF TID= 9 units Tcyc max{TIF, TID, TOF, TEX, TOS} = 9 EX Speedup = 31 / 9 = 3.44 TEX= 5 units WB TOS= 9 units Can we do better? Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages (1/2) • Two methods for stage quantization – Divide sub-ops into smaller pieces – Merge multiple sub-ops into one • Recent/Current trends – Deeper pipelines (more and more stages) – Pipelining of memory accesses – Multiple different pipelines/sub-pipelines Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: Finer-Grained Machine Cycle: 4 machine cyc / instruction 11 machine cyc /instruction IF IF TIF&ID= 8 units ID IF ID OF TOF= 9 units OF OF # stages = 4 # stages = 11 OF Tcyc= 9 units EX T = 5 units Tcyc= 3 units EX EX EX WB WB TOS= 9 units WB WB Spring 2016 :: CSE 502 – Computer Architecture Pipeline Examples AMDAHL 470V/7 IF_STEP PC GEN MIPS R2000/R3000 Cache Read IF_STEP Cache Read IF ID_STEP ID_STEP Decode OF_STEP Read REG OF_STEP RD Addr GEN EX_STEP ALU Cache Read Cache Read RS_STEP MEM EX_STEP EX 1 EX 2 WB RS_STEP Check Result Write Result Spring 2016 :: CSE 502 – Computer Architecture Instruction Dependencies (1/2) • Data Dependence – Read-After-Write (RAW) (the only true dependence) • Read must wait until earlier write finishes – Anti-Dependence (WAR) • Write must wait until earlier read finishes (avoid clobbering) – Output Dependence (WAW) • Earlier write can’t overwrite later write • Control Dependence (a.k.a. Procedural Dependence) – Branch condition must execute before branch target – Instructions after branch

Processor Pipeline

PIPELINING and ASSOCIATED TIMING ISSUES Introduction: While

2.5 Classification of Parallel Computers

Microprocessor Architecture

Instruction Pipelining (1 of 7)

Review of Computer Architecture

CS152: Computer Systems Architecture Pipelining

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Pipelined Computations

Trends in Processor Architecture

Reverse Engineering X86 Processor Microcode

Chap. 9 Pipeline and Vector Processing

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures