Pipelined Instruction Executionhazards, Stages
Total Page:16
File Type:pdf, Size:1020Kb
Computer Architectures Pipelined Instruction Execution Hazards, Stages Balancing, Super-scalar Systems Pavel Píša, Richard Šusta Michal Štepanovský, Miroslav Šnorek Main source of inspiration: Patterson and Hennessy Czech Technical University in Prague, Faculty of Electrical Engineering English version partially supported by: European Social Fund Prague & EU: We invests in your future. B35APO Computer Architectures Ver.1.10 1 Motivation – AMD Bulldozer 15h (FX, Opteron) - 2011 B35APO Computer Architectures 2 Motivation – Intel Nehalem (Core i7) - 2008 B35APO Computer Architectures 3 The Goal of Today Lecture ● Convert/extend CPU presented in the lecture 2 to the pipelined CPU design. ● The following instructions are considered for our CPU design: add, sub, and, or, slt, addi, lw, sw and beq Typ 31… 0 R opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 rd(5), 15:11 shamt(5) funct(6), 5:0 I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 J opcode(6), 31:26 address(26), 25:0 B35APO Computer Architectures 4 Single Cycle CPU Together with Memories MemToReg Control MemWrite Unit Branch 31:26 ALUControl 2:0 Opcode ALUScr RegDest 5:0 Funct RegWrite WE3 0 PC’ PC Instr 25:21 SrcA Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4 B35APO Computer Architectures 5 From lecture 2 Single Cycle CPU – Performance: IPS = IC / T = IPCavg.fCLK ● What is the maximal possible frequency of this CPU? ● It is given by latency on the critical path – it is lw in our case: Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4 B35APO Computer Architectures 6 From lecture 2 Single Cycle CPU – Throughput: IPS = IC / T = IPCavg.fCLK ● Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup ● Consider following parameters tPC = 30 ns tMem= 300 ns tRFread = 150 ns tALU = 200 ns tMux = 20 ns tRFsetup = 20 ns Then Tc = 1020 ns --> fCLK max = 980 kHz, IPS = 1 • 980e3 = 980 000 instructions per second B35APO Computer Architectures 7 From lecture 2 Separate Instruction Fetch and Execution Even non-pipelined processors separate instruction execution into stages : Instruction Fetch Execution 1. Instruction Fetch – setup PC for memory and fetch pointed instruction. Update PC = PC+4 2. The actual execution of the instruction B35APO Computer Architectures 8 Non-Pielined Execution with Instructions Prefetching WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4 in the figure, it indicates the clock input responding to the rising edge B35APO Computer Architectures 9 Single Cycle CPU – Throughput: IPS = IC / T = IPCstr⋅fCLK Consider following parameters: tPC = 30 ns tMem = 300 ns tRFread = 50 ns tALU = 200 ns tMux = 20 ns tRFsetup = 20 ns If Tcfetch is executed paralel with Tcproc, then Tcfetch < Tcproc, and Tcp = 50+200+300+20+20 = 590 ns = 1.69 MHz -> IPS = 1 690 000 B35APO Computer Architectures 10 Delay Slot ● Prefetched instruction is executed unconditionally even when the proceeding instruction is a taken branch label: . ● Branch takes place after the next instruction add $2,$3,$4 beq $1,$0,label ● MIPS architecture defines execution of one delay slot Delay Slot ● Compiler fills the branch delay slot ● By selecting an independent instruction label: from the sequence before the branch . ● It has to be to execute instruction in the beq $1,$0,label delay slot whether branch is taken or not add $2,$3,$4 ● If no suitable instruction is found ● then the compiler fills delay slot with a NOP B35APO Computer Architectures 11 Delay Slot The task of the compiler is therefore to ensure that the following instructions in the branch delay slot are valid and useful. Three alternatives illustrating how the delay slot can be filled are depicted in the figures below. The easiest way (at the same time the least efficient) is to fill the delay slot with a blank instruction - nop. B35APO Computer Architectures 12 Pipelined Instructions Execution Suppose that instruction execution can be divided into 5 stages: IF ID EX MEM WB IF – Instruction Fetch, ID – Instruction decode (and Operands Fetch), EX – Execute, MEM – Memory Access, WB – Write Back k and = max { i } i=1, where i is time required for signal propagation (propagation delay) through i-th stage. IF – setup PC for memory and fetch pointed instruction. Update PC = PC+4 ID – decode the opcode and read registers specified by instruction, check for equality (for possible beq instruction), sign extend offset, compute branch target address for branch case (this is means to extend offset and add PC) EX – execute function/pass register values through ALU MEM – read/write main memory for load/store instruction case WB – write result into RF for instructions of register-register class or instruction load (result source is ALU or memory) B35APO Computer Architectures 13 Instruction-level Parallelism - Pipelining IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 ID I1 I2 I3 I4 I5 I6 I7 I8 I9 EX I1 I2 I3 I4 I5 I6 I7 I8 MEM I1 I2 I3 I4 I5 I6 I7 ST I1 I2 I3 I4 I5 I6 5 čas 1 2 3 4 5 6 7 8 9 10 ● The time to execute n instructions in the k-stage pipeline: Tk = k. + (n – 1) T1 nk τ ● Speedup: Sk= = lim Sk=k Tk kτ +(n−1)τ n→∞ Prerequisite: pipeline is optimally balanced, circuit can arbitrarily divided B35APO Computer Architectures 14 Pipeline Loads Example B35APO Computer Architectures 15 source: Gedare Bloom and Mary Jane Irwin Instruction-level Parallelism - Pipelining ● Does not reduce the execution time of individual instructions, effect is just the opposite... ● Hazards: ● structural (resolved by duplication), ● data (result of data dependencies: RAW, WAR, WAW) ● control (caused by instructions which change PC)... ● Hazard prevention can result in pipeline stall or pipeline flush ● Remark : Deeper pipeline (more stages) results in shorter sequences of gates in each stage which enables to increase the operating frequency of the processor…, but more stages means higher overhead (demand to arrange better instructions into pipeline and result in more significant lag in the case of stall or pipeline flush) B35APO Computer Architectures 16 Instruction-level Parallelism – Semantics Violations Data hazard: Add writes new value to R1 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R1,R3 IF ID EX MEM WB flow of instructions SUB reads incorrect value from R1 and expected effect Condition and new PC evaluation Control hazard: PC set to branch target BEQZ R3, M1 IF ID EX MEM WB ADD R6,R1,R2 IF ID EX MEM WB instruction 3 IF ID EX MEM WB instruction 4 IF ID EX MEM WB M1: ADD R4,R6,R7 IF ID EX MEM WB Should be these instructions fetched (and executed then)? B35APO Computer Architectures 17 Non-pipelined Execution WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4 B35APO Computer Architectures 18 From lecture 2 Pipelined Execution AluOutW WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOutM Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteDataE WriteDataM WD 20:16 Rt 0 WriteRegE WriteRegM WriteRegW 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4F PCPlus4D PCPlus4E Fetch Decode Execute Memory WriteBack B35APO Computer Architectures 19 Pipelined Execution MemToReg Control MemWrite Unit Branch 31:26 ALUControl 2:0 Opcode ALUScr 5:0 RegDest Funct AluOutW RegWrite WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOutM Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteDataE WriteDataM WD 20:16 Rt 0 WriteRegE WriteRegM WriteRegW 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4F PCPlus4D PCPlus4E Fetch Decode Execute Memory WriteBack B35APO Computer Architectures 20 The Same Design but Drawn Scaled Down… RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM Zero 0 PC´ PC InstrD 25:21 WE3 SrcAE WE A RD A1 RD1 ALUOutM ReadDataW 1 20:16 A RD Instruction A2 RD2 0 SrcBE ALU Data Memory A3 Reg. 1 Memory WriteDataE WriteDataM WD3 File WD 1 ALUOutW 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext <<2 PCPlus4F PCPlus4D + PCBranchD ResultW B35APO Computer Architectures 21 Cause of the Data Hazards ● Register File – access from two pipeline stages (Decode, WriteBack) – actual write occurs at the first half of the clock cycle, the read in the second half ⇒ there is no hazard for sub $s0 input operand ● RAW (Read After Write) hazard – and (or) requires $s0 in 3 (4) ● How can such hazard be prevented without pipeline throughput degradation? B35APO Computer Architectures 22 QtMips – Resolve Data Hazard by Stall OR $t2, $s0, $s5 AND $t0, $s0, $s1 NOP ADD $s0, $s2, $s3 OR $9, $20, $16 AND $8, $16, $17 NOP ADD $16, $18, $19 ORI $21,