Pipelining: Basic/ Intermediate Concepts and Implementation
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1 Contents 1. Pipelining Introduction 2. The Major Hurdle of Pipelining—Pipeline Hazards 3. RISC-V ISA and its Implementation Reading: u Textbook: Appendix C u RISC-V ISA u Chisel Tutorial Pipelining: Its Natural! © Laundry Example © Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold A B C D u Washer takes 30 minutes u Dryer takes 40 minutes u “Folder” takes 20 minutes © One load: 90 minutes Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s A k O B r d e r C D © Sequential laundry takes 6 hours for 4 loads © If they learned pipelining, how long would laundry take? Pipelined Laundry Start Work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 40 40 40 20 T a Sequential laundry takes 6 hours for 4 loads s A k O r B d e r C D © Pipelined laundry takes 3.5 hours for 4 loads The Basics of a RISC Instruction Set (1/2) © RISC (Reduced Instruction Set Computer) architecture or load-store architecture: u All operations on data apply to data in register and typically change the entire register (32 or 64 bits per register). u The only operations that affect memory are load and store operation. u The instruction formats are few in number with all instructions typically being one size. † These simple three properties lead to dramatic simplifications in the implementation of pipelining. The Basics of a RISC Instruction Set (2/2) © MIPS 64 / RISC-V u 32 registers, and R0 = 0; u Three classes of instructions » ALU instruction: add (DADD), subtract (DSUB), and logical operations (such as AND or OR); » Load and store instructions: » Branches and jumps: R-format (add, sub, …) OP rs rt rd sa funct I-format (lw, sw, …) OP rs rt immediate J-format (j) OP jump target RISC Instruction Set © Every instruction can be implemented in at most 5 clock cycles/ stages u Instruction fetch cycle (IF): send PC to memory, fetch the current instruction from memory, and update PC to the next sequential PC by adding 4 to the PC. u Instruction decode/register fetch cycle (ID): decode the instruction, read the registers corresponding to register source specifiers from the register file. u Execution/effective address cycle (EX): perform Memory address calculation for Load/Store, Register-Register ALU instruction and Register-Immediate ALU instruction. u Memory access (MEM): Perform memory access for load/store instructions. u Write-back cycle (WB): Write back results to the dest operands for Register-Register ALU instruction or Load instruction. Classic 5-Stage Pipeline for a RISC © Each cycle the hardware will initiate a new instruction and will be executing some part of the five different instructions. u Simple; u However, be ensure that the overlap of instructions in the pipeline cannot cause such a conflict. (also called Hazard) Clock number Instruction number 1 2 3 4 5 6 7 8 9 Instruction i IF ID EX MEM WB Instruction i+1 IF ID EX MEM WB Instruction i+2 IF ID EX MEM WB Instruction i+3 IF ID EX MEM WB Instruction i+4 IF ID EX MEM WB Computer Pipelines © Pipeline properties u Execute billions of instructions, so throughput is what matters. u Pipelining doesn’t help latency of single task, it helps throughput of entire workload; u Pipeline rate limited by slowest pipeline stage; u Multiple tasks operating simultaneously; u Potential speedup = Number pipe stages; u Unbalanced lengths of pipe stages reduces speedup; u Time to “fill” pipeline and time to “drain” it reduces speedup. © The time per instruction on the pipelined processor in ideal conditions is equal to, Time per instruction on unpipelined machine Number of pipe stage † However, the stages may not be perfectly balanced. † Pipelining yields a reduction in the average execution time per instruction. Review: Components of a Computer Memory Processor Input Enable? Read/Write Control Program Datapath Address Program Counter (PC) Bytes Registers Write Data Arithmetic & Logic Unit ReadData Data Output (ALU) Processor-Memory Interface I/O-Memory Interfaces CPU and Datapath vs Control © Datapath: Storage, FU, interconnect sufficient to perform the desired functions u Inputs are Control Points u Outputs are signals © Controller: State machine to orchestrate operation on the data path u Based on desired function and signals 5 Stages of MIPS Pipeline Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back Next PC MUX Adder Next SEQ PC Zero? 4 RS1 MUX Address Address Memory RS2 File Reg Inst ALU Memory MUX Data RD L M MUX D Sign Imm Extend IR <= mem[PC]; PC <= PC + 4 WB Data Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt] Making RISC Pipelining Real © Function units used in different cycles u Hence we can overlap the execution of multiple instructions © Important things to make it real u Separate instruction and data memories, e.g. I-cache and D-cache, banking » Eliminate a conflict for accessing a single memory. u The Register file is used in the two stages (two R and one W every cycle) » Read from register in ID (second half of CC), and write to register in WB (first half of CC). u PC » Increment and store the PC every clock, and done it during the IF stage. » A branch does not change the PC until the ID stage (have an adder to compute the potential branch target). u Staging data between pipeline stages » Pipeline register Pipeline Datapath © Register files in ID and WB stage u Read from register in ID (second half of CC), and write to register in WB (first half of CC). © IM and DM Pipeline Registers Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back Next PC MUX Next SEQ PC Next SEQ PC Adder Zero? 4 RS1 MUX Address Address Memory MEM/WB EX/MEM RS2 File Reg ID/EX ID/EX IF/ID IF/ID ALU Memory MUX Data MUX IR <= mem[PC]; Sign PC <= PC + 4 Extend Imm A <= Reg[IR ]; WB Data rs RD RD RD B <= Reg[IRrt] rslt <= A opIRop B WB <= rslt Pipeline Registers for Data Staging between Pipeline Stages Named as: IF/ID, ID/EX, EX/MEM, and MEM/WB Reg[IR ] <= WB rd Pipeline Registers © Edge-triggered property of register is critical Inst. Set Processor Controller branch requires 3 cycles, store requires 4 cycles, and all other IR <= mem[PC]; PC <= PC + 4 Ifetch instructions require 5 cycles. A <= Reg[IRrs]; opFetch-DCD JSR JR B <= Reg[IR ] rt ST br jmp LD RR RI r <= A + IRim if bop(A,b) PC <= IRjaddr r <= A opIRop B r <= A opIRop IRim PC <= PC+IRim WB <= r WB <= r WB <= Mem[r] Reg[IRrd] <= WB Reg[IRrd] <= WB Reg[IRrd] <= WB Events on Every Pipeline Stage Stage Any Instruction IF IF/ID.IR ß MEM[PC]; IF/ID.NPC ß PC+4 PC ß if ((EX/MEM.opcode=branch) & EX/MEM.cond) {EX/MEM.ALUoutput} else {PC + 4} ID ID/EX.A ß Regs[IF/ID.IR[Rs]]; ID/EX.B ß Regs[IF/ID.IR[Rt]] ID/EX.NPC ß IF/ID.NPC; ID/EX.Imm ß extend(IF/ID.IR[Imm]); ID/EX.Rw ß IF/ID.IR[Rt or Rd] ALU Instruction Load / Store Branch EX EX/MEM.ALUoutput ß EX/MEM.ALUoutput ß EX/MEM.ALUoutput ß ID/EX.A func ID/EX.B, or ID/EX.A + ID/EX.Imm ID/EX.NPC + (ID/EX.Imm << 2) EX/MEM.ALUoutput ß ID/EX.A op ID/EX.Imm EX/MEM.B ß ID/EX.B EX/MEM.cond ß br condition MEM MEM/WB.ALUoutput ß MEM/WB.LMD ß EX/MEM.ALUoutput MEM[EX/MEM.ALUoutput] or MEM[EX/MEM.ALUoutput] ß EX/MEM.B WB Regs[MEM/WB.Rw] ß For load only: MEM/WB.ALUOutput Regs[MEM/WB.Rw] ß MEM/WB.LMD Pipelining Performance (1/2) © Pipelining increases throughput, not reduce the execution time of an individual instruction. u In face, slightly increases the execution time (an instruction) due to overhead in the control of the pipeline. u Practical depth of a pipeline is limits by increasing execution time. © Pipeline overhead u Unbalanced pipeline stage; u Pipeline stage overhead; u Pipeline register delay; u Clock skew. Processor Performance Instructions Cycles Time CPU Time = * * Program Instruction Cycle © Instructions per program depends on source code, compiler technology, and ISA © Cycles per instructions (CPI) depends on ISA and µarchitecture © Time per cycle depends upon the µarchitecture and base technology CPI for Different Instructions 7 cycles 5 cycles 10 cycles Inst 1 Inst 2 Inst 3 Time Total clock cycles = 7+5+10 = 22 Total instruc7ons = 3 CPI = 22/3 = 7.33 CPI is always an average over a large numBer of instruc7ons 22 Pipeline Performance (2/2) © Example 1 (p.C-10): Consider the unpipelined processor in previous section. Assume that it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches, and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? © Answer The average instruction execution time on the unpipelined processor is Average instruction execution time = Clock cycle × Average CPI =1 ns×((40%+ 20%)× 4 + 40%× 5) =1 ns× 4.4 = 4.4 ns Average instruction time unpipelined 4.4 ns Speedup from pipelining = = = 3.7 times Average instruction time pipelined 1.2 ns † In the pipeline, the clock must run at the speed of the slowest stage plus overhead, which will be 1+0.2 ns.