Computer Architectures
Pipelined Instruction Execution Hazards, Stages Balancing, Super-scalar Systems Pavel Píša, Richard Šusta Michal Štepanovský, Miroslav Šnorek Main source of inspiration: Patterson and Hennessy
Czech Technical University in Prague, Faculty of Electrical Engineering English version partially supported by: European Social Fund Prague & EU: We invests in your future.
B35APO Computer Architectures Ver.1.10 1 Motivation – AMD Bulldozer 15h (FX, Opteron) - 2011
B35APO Computer Architectures 2 Motivation – Intel Nehalem (Core i7) - 2008
B35APO Computer Architectures 3 The Goal of Today Lecture
● Convert/extend CPU presented in the lecture 2 to the pipelined CPU design. ● The following instructions are considered for our CPU design: add, sub, and, or, slt, addi, lw, sw and beq
Typ 31… 0 R opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 rd(5), 15:11 shamt(5) funct(6), 5:0 I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 J opcode(6), 31:26 address(26), 25:0
B35APO Computer Architectures 4 Single Cycle CPU Together with Memories
MemToReg Control MemWrite Unit Branch 31:26 ALUControl 2:0 Opcode ALUScr RegDest 5:0 Funct RegWrite WE3 0 PC’ PC Instr 25:21 SrcA Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4
B35APO Computer Architectures 5 From lecture 2 Single Cycle CPU – Performance: IPS = IC / T = IPCavg.fCLK ● What is the maximal possible frequency of this CPU? ● It is given by latency on the critical path – it is lw in our case:
Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup
WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4
B35APO Computer Architectures 6 From lecture 2 Single Cycle CPU – Throughput: IPS = IC / T = IPCavg.fCLK
● Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup ● Consider following parameters
tPC = 30 ns
tMem= 300 ns
tRFread = 150 ns
tALU = 200 ns
tMux = 20 ns
tRFsetup = 20 ns
Then Tc = 1020 ns --> fCLK max = 980 kHz,
IPS = 1 • 980e3 = 980 000 instructions per second
B35APO Computer Architectures 7 From lecture 2 Separate Instruction Fetch and Execution
Even non-pipelined processors separate instruction execution into stages :
Instruction Fetch Execution
1. Instruction Fetch – setup PC for memory and fetch pointed instruction. Update PC = PC+4 2. The actual execution of the instruction
B35APO Computer Architectures 8 Non-Pielined Execution with Instructions Prefetching
WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4
in the figure, it indicates the clock input responding to the rising edge
B35APO Computer Architectures 9 Single Cycle CPU – Throughput: IPS = IC / T = IPCstr⋅fCLK
Consider following parameters:
tPC = 30 ns tMem = 300 ns
tRFread = 50 ns tALU = 200 ns
tMux = 20 ns tRFsetup = 20 ns
If Tcfetch is executed paralel with Tcproc, then Tcfetch < Tcproc, and Tcp = 50+200+300+20+20 = 590 ns = 1.69 MHz -> IPS = 1 690 000
B35APO Computer Architectures 10 Delay Slot
● Prefetched instruction is executed unconditionally even when the proceeding instruction is a taken branch label: . . . ● Branch takes place after the next instruction add $2,$3,$4 beq $1,$0,label ● MIPS architecture defines execution of one
delay slot Delay Slot ● Compiler fills the branch delay slot ● By selecting an independent instruction label: from the sequence before the branch . . . ● It has to be to execute instruction in the beq $1,$0,label delay slot whether branch is taken or not
add $2,$3,$4 ● If no suitable instruction is found ● then the compiler fills delay slot with a NOP
B35APO Computer Architectures 11 Delay Slot
The task of the compiler is therefore to ensure that the following instructions in the branch delay slot are valid and useful. Three alternatives illustrating how the delay slot can be filled are depicted in the figures below.
The easiest way (at the same time the least efficient) is to fill the delay slot with a blank instruction - nop. B35APO Computer Architectures 12 Pipelined Instructions Execution
Suppose that instruction execution can be divided into 5 stages:
IF ID EX MEM WB
IF – Instruction Fetch, ID – Instruction decode (and Operands Fetch), EX – Execute, MEM – Memory Access, WB – Write Back
k and = max { i } i=1, where i is time required for signal propagation (propagation delay) through i-th stage.
IF – setup PC for memory and fetch pointed instruction. Update PC = PC+4 ID – decode the opcode and read registers specified by instruction, check for equality (for possible beq instruction), sign extend offset, compute branch target address for branch case (this is means to extend offset and add PC) EX – execute function/pass register values through ALU MEM – read/write main memory for load/store instruction case WB – write result into RF for instructions of register-register class or instruction load (result source is ALU or memory) B35APO Computer Architectures 13 Instruction-level Parallelism - Pipelining
IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 ID I1 I2 I3 I4 I5 I6 I7 I8 I9
EX I1 I2 I3 I4 I5 I6 I7 I8
MEM I1 I2 I3 I4 I5 I6 I7
ST I1 I2 I3 I4 I5 I6 5 čas 1 2 3 4 5 6 7 8 9 10 ● The time to execute n instructions in the k-stage pipeline:
Tk = k. + (n – 1)
T1 nk τ ● Speedup: Sk= = lim Sk=k Tk kτ +(n−1)τ n→∞ Prerequisite: pipeline is optimally balanced, circuit can arbitrarily divided
B35APO Computer Architectures 14 Pipeline Loads Example
B35APO Computer Architectures 15 source: Gedare Bloom and Mary Jane Irwin Instruction-level Parallelism - Pipelining
● Does not reduce the execution time of individual instructions, effect is just the opposite...
● Hazards: ● structural (resolved by duplication), ● data (result of data dependencies: RAW, WAR, WAW) ● control (caused by instructions which change PC)... ● Hazard prevention can result in pipeline stall or pipeline flush
● Remark : Deeper pipeline (more stages) results in shorter sequences of gates in each stage which enables to increase the operating frequency of the processor…, but more stages means higher overhead (demand to arrange better instructions into pipeline and result in more significant lag in the case of stall or pipeline flush) B35APO Computer Architectures 16 Instruction-level Parallelism – Semantics Violations
Data hazard: Add writes new value to R1
ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R1,R3 IF ID EX MEM WB flow of instructions SUB reads incorrect value from R1 and expected effect Condition and new PC evaluation Control hazard: PC set to branch target BEQZ R3, M1 IF ID EX MEM WB ADD R6,R1,R2 IF ID EX MEM WB instruction 3 IF ID EX MEM WB instruction 4 IF ID EX MEM WB M1: ADD R4,R6,R7 IF ID EX MEM WB
Should be these instructions fetched (and executed then)? B35APO Computer Architectures 17 Non-pipelined Execution
WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4
B35APO Computer Architectures 18 From lecture 2 Pipelined Execution
AluOutW
WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOutM Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteDataE WriteDataM WD 20:16 Rt 0 WriteRegE WriteRegM WriteRegW 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4F PCPlus4D PCPlus4E
Fetch Decode Execute Memory WriteBack B35APO Computer Architectures 19 Pipelined Execution
MemToReg Control MemWrite Unit Branch 31:26 ALUControl 2:0 Opcode ALUScr 5:0 RegDest Funct AluOutW RegWrite WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOutM Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteDataE WriteDataM WD 20:16 Rt 0 WriteRegE WriteRegM WriteRegW 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4F PCPlus4D PCPlus4E
Fetch Decode Execute Memory WriteBack B35APO Computer Architectures 20 The Same Design but Drawn Scaled Down…
RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM
Zero 0 PC´ PC InstrD 25:21 WE3 SrcAE WE A RD A1 RD1 ALUOutM ReadDataW 1 20:16 A RD Instruction A2 RD2 0 SrcBE ALU Data Memory A3 Reg. 1 Memory WriteDataE WriteDataM WD3 File WD 1 ALUOutW 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext
<<2 PCPlus4F PCPlus4D + PCBranchD
ResultW
B35APO Computer Architectures 21 Cause of the Data Hazards
● Register File – access from two pipeline stages (Decode, WriteBack) – actual write occurs at the first half of the clock cycle, the read in the second half ⇒ there is no hazard for sub $s0 input operand ● RAW (Read After Write) hazard – and (or) requires $s0 in 3 (4) ● How can such hazard be prevented without pipeline throughput degradation?
B35APO Computer Architectures 22 QtMips – Resolve Data Hazard by Stall
OR $t2, $s0, $s5 AND $t0, $s0, $s1 NOP ADD $s0, $s2, $s3 OR $9, $20, $16 AND $8, $16, $17 NOP ADD $16, $18, $19 ORI $21, $21, 0x5555 ORI $s5, $s5, 0x5555 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 0 1 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0
02114024 = 0 0 0 0 0 0 0 5 0 5 Cache RsD 5 16 00000000 0 0 Hit: 5 0 0 Miss: RtD ALU 5 Cache 0 17 11111111 0 0 17 5 Hit: 0 0 7 Registers 0 0 5 PC 0 Miss: 0 0 5 0 0 0 0 0x80020030 Program 0 0 0 5 0 Memory 0 0 5 0 3 Data 0 5 0 3 5 0 Memory 3 5 0 3 0 5 + NORMAL 3 5 3 Peripherals 5 4 3 3 RtD 17 RtE RdD RdE 0 16 0 8 0 Sign Terminal extension 00004024 00000000
<<2 0 Exception + STALL NONE
21
STALL Hazard Cycles Stalls
Unit B35APO Computer Architectures 23 QtMips – Resolve Data Hazard by Stall
OR $t1, $s4, $s0 AND $t0, $s0, $s1 OR $9, $20, $16 AND $8, $16, $17 NOP NOP
ADD $16, $18, $19 ADD $s0, $s2, $s3 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 0 0 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0
02114024 = 0 0 0 0 0 0 0 0 0 0 Cache RsD 0 16 55555555 0 0 Hit: 0 0 0 Miss: RtD ALU 0 Cache 0 17 11111111 0 0 18 0 Hit: 0 0 7 Registers 0 0 0 PC 0 Miss: 0 0 0 0 0 0 0 0x80020034 Program 0 0 0 5 0 Memory 0 0 5 0 0 Data 0 5 0 0 5 0 Memory 0 5 0 0 0 5 + NORMAL 0 5 0 Peripherals 5 4 0 0 RtD 17 RtE RdD RdE 0 0 0 8 0 Sign Terminal extension 00004024 00000000
<<2 0 Exception + STALL NONE
16
NORMAL Hazard Cycles Stalls
Unit B35APO Computer Architectures 24 Forwarding to Avoid Data Hazards
● If a result is available (computed) before subsequent instruction(s) requires the value then data hazard can be avoided by forwarding ● Hazard case is indicated when some of source registers in EX stage is the same as destination register in stage MEM or WB ● The register numbers are fed to the Hazard Unit ● The RegWrite signal from MEM and WB stage has to be monitored as well to check that register number on WriteReg lines takes effect – lw / sw B35APO Computer Architectures 25 CPU after previous design steps
RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM
Zero 0 PC´ PC InstrD 25:21 WE3 SrcAE WE A RD A1 RD1 ALUOutM ReadDataW 1 20:16 A RD Instruction A2 RD2 0 SrcBE ALU Data Memory A3 Reg. 1 Memory WriteDataE WriteDataM WD3 File WD 1 ALUOutW 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext
<<2 PCPlus4F PCPlus4D + PCBranchD
ResultW
B35APO Computer Architectures 26 Data Hazards Solved by Forwarding
RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM
Zero 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 10 1 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext
<<2 PCPlus4F PCPlus4D + PCBranchD
ResultW
Forward Forward RegWrite RegWriteM AE BE W Hazard unit
B35APO Computer Architectures 27 QtMips – Resolve Data Hazard by Forwarding
OR $t1, $s4, $s0 AND $t0, $s0, $s1 ADD $s0, $s2, $s3 ORI $s5, $s5, 0x5555 OR $9, $20, $16 AND $8, $16, $17 ADD $16, $18, $19 ORI $21, $21, 0x5555 LUI $s5, 0x5555 LUI $5, 0x5555 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 1 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0
02114024 = 2 2 2 0 0 2 0 0 0 2 5 2 5 Cache RsD 5 16 00000000 2 5 Hit: 5 2 5 Miss: RtD ALU 5 Cache 0 17 11111111 5 0 6 5 Hit: 5 0 7 Registers 0 3 0 5 PC 5 Miss: 0 3 5 0 5 0 3 0x80020034 Program 5 0 0 5 3 Memory 5 0 5 3 0 Data 0 5 3 0 5 3 Memory 0 0 3 0 1 0 + NORMAL 0 0 0 Peripherals 0 4 RsE 0 18 0 RtD 17 RtE 19 RdD RdE 16 21 16 8 Sign Terminal extension 00004024 ffff8020
<<2 0 Exception + NORMAL NONE
21
NORMAL Hazard Cycles Stalls
Unit B35APO Computer Architectures 28 QtMips – Resolve Data Hazard by Forwarding
SUB $t2, $s0, $s5 OR $t1, $s4, $s0 AND $t0, $s0, $s1 ADD $s0, $s2, $s3 SUB $10, $16, $21 OR $9, $20, $16 AND $8, $16, $17 ADD $16, $18, $19 ORI $21, $21, 0x5555 ORI $s5, $s5, 0x5555 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 1 MemToReg MemWrite 0 0 0 MemRead 0 0 AluCtrl 0 00 RegDest AluSrc 1 Branch 0
02904825 = 5 5 5 0 0 5 0 0 2 5 5 5 5 Cache RsD 5 20 44444444 5 1 Hit: 5 5 1 Miss: RtD ALU 5 Cache 0 16 00000000 1 0 7 5 Hit: 1 0 7 Registers 0 1 0 5 PC 1 Miss: 0 1 5 0 1 0 1 0x80020038 Program 1 0 0 5 1 Memory 1 0 5 1 3 Data 0 5 1 3 5 1 Memory 3 5 1 3 1 5 + NORMAL 3 5 3 Peripherals 5 4 RsE 3 16 3 RtD 16 RtE 17 RdD RdE 16 8 9 8 Sign Terminal extension 00004825 00004024
<<2 0 Exception + FORWARD NONE
21
NORMAL Hazard Cycles Stalls
Unit B35APO Computer Architectures 29 QtMips – Resolve Data Hazard by Forwarding
SUB $t2, $s0,$s5 OR $t1, $s4, $s0 AND $t0, $s0, $s1 NOP SUB $10, $16,$21 OR $9, $20, $16 AND $8, $16, $17 ADD $s0, $s2,$s3 ADD $16, $18,$19 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 1 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0
02155022 = 4 4 4 0 0 4 0 0 0 4 1 4 1 Cache RsD 1 16 55555555 4 5 Hit: 1 4 5 Miss: RtD ALU 1 Cache 0 21 55555555 5 0 7 1 Hit: 5 0 8 Registers 1 5 0 1 PC 5 Miss: 0 5 1 0 5 0 5 0x8002003c Program 5 0 0 5 5 Memory 5 0 5 5 1 Data 0 5 5 1 5 5 Memory 1 5 5 1 1 5 + NORMAL 1 5 1 Peripherals 5 4 RsE 1 20 1 RtD 21 RtE 16 RdD RdE 10 9 8 9 Sign Terminal extension 00005022 00004825
<<2 0 Exception + FORWARD NONE
16
NORMAL Hazard Cycles Stalls
Unit B35APO Computer Architectures 30 Data Hazard Avoided by Pipeline Stall
● If subsequent instructions require result before it is available in CPU then the pipeline has to be stalled (stall state inserted) ● The stall is mean to solve hazard but affect system throughput ● Pipeline stages preceding that one which is affected by the hazard are stalled until all results required by subsequent instructions are available – results are forwarded to the sink which required their value
B35APO Computer Architectures 31 Data Hazard Avoided by Pipeline Stall
● The stall is realized by the holding content of the inter-stage registers (gating their clocks or blocking their latch enable signals) ● Results from colliding stages have to be „discarded“ – certain control signals in CPU (RF or memory write enable, branch gating) are reset (held low) ● Both is achieved by introduction of control signals to hold and/or reset inter-stages registers B35APO Computer Architectures 32 Processor Design Build Till Now
RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM
Zero 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 10 1 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext
<<2 PCPlus4F PCPlus4D + PCBranchD
ResultW
Forward Forward RegWrite RegWriteM AE BE W Hazard unit
B35APO Computer Architectures 33 Processor with Data Hazards Avoided by Stall
RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM
Zero 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW EN 10 1 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext
<<2
PCPlus4F PCPlus4D CLR + EN PCBranchD
ResultW
Forward Forward RegWrite RegWriteM Stall F Stall D AE BE W Hazard unit
B35APO Computer Architectures 34 QtMips – Resolve LW Hazard by Stall
OR $t1, $s4, $s0 AND $t0, $s0, $s1 LW $s0, 40($0) ORI $s5, $s5, 0x5555 OR $9, $20, $16 AND $8, $16, $17 LW $16, 40($0) ORI $21, $21, 0x5555 LUI $21, 0x5555 LUI $s5, 0x5555 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 1 MemToReg MemWrite 0 1 0 MemRead 0 0 AluCtrl 0 11 RegDest AluSrc 1 Branch 0
02114024 = 0 0 0 0 0 0 0 0 0 0 5 0 5 Cache RsD 5 16 00000000 0 0 Hit: 5 0 0 ALU 5 Cache 0 Miss: RtD 0 6 17 11111111 5 Hit: 0 0 0 7 Registers 0 0 1 5 PC 0 Miss: 0 0 5 0 0 0 0 0x80020030 Program 2 0 0 5 0 Memory 8 0 5 0 0 Data 0 5 0 0 5 0 Memory 0 0 0 0 0 0 + NORMAL 0 0 0 Peripherals 0 4 RsE 0 0 RtD 17 RtE 160 RdD RdE 16 21 8 0 Sign Terminal extension 00004024 00000028
<<2 0 Exception + NORMAL NONE
21
STALL Hazard Cycles Stalls
Unit B35APO Computer Architectures 35 QtMips – Resolve LW Hazard by Stall
OR $t1, $s4, $s0 AND $t0, $s0, $s1 NOP LW $s0, 40($0) OR $9, $20, $16 AND $8, $16, $17 NOP LW $16, 40($0) ORI $21, $21, 0x5555 ORI $s5, $s5, 0x5555 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 0 1 MemToReg 0 0 1 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0
02114024 = 0 0 0 1 0 0 0 0 0 0 0 0 0 Cache RsD 0 16 00000000 0 0 Hit: 0 0 0 ALU 0 Cache 1 Miss: RtD 0 7 17 11111111 0 Hit: 2 0 3 7 Registers 0 0 0 2 PC 0 Miss: 4 0 8 0 0 5 0 0x80020034 Program 0 1 6 5 0 Memory 0 7 5 0 0 Data 8 5 0 0 5 0 Memory 0 5 0 0 0 5 + NORMAL 0 5 0 Peripherals 5 4 RsE 0 0 RtD 17 RtE 0 RdD RdE 0 16 0 8 0 Sign Terminal extension 00004024 00000000
<<2 0 Exception + STALL NONE
21
NORMAL Hazard Cycles Stalls
Unit B35APO Computer Architectures 36 QtMips – Resolve LW Hazard by Stall
SUB $t2, $s0, $s5 OR $t1, $s4, $s0 AND $t0, $s0, $s1 NOP SUB $10, $16, $21 OR $9, $20, $16 AND $8, $16, $17 NOP LW $16, 40($0) LW $s0, 40($0) 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 0 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0
02904825 = 1 2 3 0 0 4 0 0 1 5 0 6 0 Cache RsD 0 20 44444444 7 1 Hit: 0 8 0 Miss: RtD ALU 0 Cache 0 16 12345678 1 0 8 0 Hit: 0 0 7 Registers 0 1 0 0 PC 1 Miss: 0 1 0 0 0 0 1 0x80020038 Program 1 1 0 1 1 Memory 0 0 2 1 0 Data 0 3 1 0 4 1 Memory 0 5 1 0 1 6 + NORMAL 0 7 0 Peripherals 8 4 RsE 0 16 0 RtD 16 RtE 17 RdD RdE 8 0 9 8 Sign Terminal extension 00004825 00004024
<<2 0 Exception + FORWARD NONE
16
NORMAL Hazard Cycles Stalls
Unit B35APO Computer Architectures 37 Control Hazards (Branch and Jump)
● Result is not known before 4th cycle. Why?
B35APO Computer Architectures 38 Control Hazards – Better to Know Result Earlier…
● If the result of comparison can be evaluated in the 2nd cycle misprediction penalty can be reduced ● But the processing of the comparison at earlier stage can induce new RAW hazards..!!! B35APO Computer Architectures 39 Resolve Control Hazards by Early Evaluate and Flush
RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW EN 10 1 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD
ResultW
Forward Forward RegWrite RegWriteM Stall F Stall D AE BE W Hazard unit
B35APO Computer Architectures 40 Resolve RAW Hazards by Forwarding or Stalling Forward / Stall Stall RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW EN 10 1 20:16 1 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 0 10 1 Memory WriteDataE WriteDataM WD3 File 1 WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR No PCPlus4F PCPlus4D CLR EN PCBranchD Action Required ResultW
Forward Forward RegWrite Forward RegWriteM Stall F Stall D BranchD BD AE BE W Hazard unit
B35APO Computer Architectures 41 QtMips – Resolve BEQ Source Hazard by Stall
AND $t0, $s0, $s1 BEQ $t1, $t2, 0x80020040 NOP ADDI $t2, $0, 4660 AND $8, $16, $17 BEQ $9, $10, 0x80020040 NOP ADDI $10, $0, 4660 ADDI $9, $0, 4660 ADDI $t1, $0, 4660 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 0 0 1 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 0 Branch 0
112a0003 = 0 0 0 0 0 0 0 1 0 0 0 0 0 Cache RsD 0 00001234 0 0 Hit: 0 9 0 0 Miss: RtD ALU 1 Cache 0 10 00000000 0 0 8 2 Hit: 0 0 7 Registers 0 0 0 3 PC 0 Miss: 0 0 4 0 0 0 0 0x80020040 Program 0 0 0 0 0 Memory 0 0 0 0 0 Data 0 0 0 0 0 0 Memory 0 1 0 0 0 2 + FORWARD 0 3 0 Peripherals 4 4 RsE 0 0 RtD 10 RtE 0 RdD RdE 0 10 0 0 0 Sign Terminal extension 00000003 00000000
<<2 1 Exception + STALL NONE
9
NORMAL Hazard Cycles Stalls
Unit B35APO Computer Architectures 42 QtMips – Resolve BEQ Control Hazard
SLT $t3, $s2, $s3 AND $t0, $s0, $s1 BEQ $t1, $t2, 0x80020040 NOP SLT $11, $18, $19 AND $8, $16, $17 BEQ $9, $10, 0x80020040 NOP ADDI $10, $0, 4660 ADDI $t2, $0, 4660 1
IF/ID ID/EX EX/MEM MEM/WB
JumpReg 0 Control PcToR31 0 unit RegWrite 1 0 0 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0
02114024 = 0 0 0 0 0 0 0 0 0 1 0 2 0 Cache RsD 0 16 00000000 3 0 Hit: 0 4 0 Miss: RtD ALU 0 Cache 0 17 11111111 0 0 8 0 Hit: 0 0 8 Registers 0 0 0 0 PC 1 Miss: 0 0 0 0 2 0 0 0x80020044 Program 3 0 0 0 0 Memory 4 0 0 1 0 Data 0 0 2 0 0 3 Memory 0 1 4 0 0 2 + NORMAL 0 3 0 Peripherals 4 4 RsE 0 0 RtD 17 RtE 109 RdD RdE 10 0 8 0 Sign Terminal extension 00004024 00000003
<<2 0 Exception + NORMAL NONE
10
NORMAL Hazard Cycles Stalls
Unit B35APO Computer Architectures 43 We are Finished – Pipelined Processor is Designed
RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW EN 10 1 20:16 1 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 0 10 1 Memory WriteDataE WriteDataM WD3 File 1 WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD
ResultW
Forward Forward RegWrite Forward RegWriteM Stall F Stall D BranchD BD AE BE W Hazard unit
B35APO Computer Architectures 44 Pipelined CPU – Performance: IPS = IC / T = IPCavg.fCLK
● What is maximal acceptable frequency for the CPU? ● Which stage is the slowest one? ● The cycle time is determined by the slowest stage ● For our case: Tc = 300 ns --> 3 333 kHz
If the pipeline fill overhead is neglected (i.e. no pipeline stalls and flushes are considered) then ideal IPC = 1. IPS = 1 • 3 333e3 = 3 333 000 instructions per second
● Introduction of the 5-stage pipeline increases performance (throughput) 3 333 000/ 980 000 = 3.4 times! (considering IPC=1)
B35APO Computer Architectures 45 What is Result of the Design? Return back to non-pipelined CPU version
MemToReg Control MemWrite Unit Branch 31:26 Opcode ALUControl 2:0 ALUScr 5:0 RegDest Funct AluOutW RegWrite 25:21 WE3 SrcA PC’ Zero WE 0 PC A RD Instr A1 RD1 ALU 0 Result 1 20:16 A RD 1 Instr. A2 RD2 0 SrcB AluOutM Data ReadData Memory 1 Memory A3 Reg. WD3 WD 20:16 File WriteData Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch PCPlus4F PCPlus4D PCPlus4E +
B35APO Computer Architectures 46 What is Result of the Design? Return back to non-pipelined CPU version
MemToReg Control unit Memory Control MemWrite (control path) Unit Branch 31:26 Opcode ALUControl 2:0 ALUScr 5:0 RegDest Funct AluOutW RegWrite 25:21 WE3 SrcA PC’ Zero WE A RD 0 PC A RD Instr A1 RD1 ALU 0 Result 1 20:16 A RD 1 Instr. A2 RD2 0 SrcB AluOutM ReadData Memory 1 A3 Reg. WD3 WD Data/ALU 20:16 File WriteData Rt 0 WriteReg 1 (data path) WE 15:11 Rd + A RD 4 15:0 <<2 Data Sign Ext SignImm PCBranch Memory PCPlus4F PCPlus4D PCPlus4E + WD
B35APO Computer Architectures 47 What is Result of the Design?
Processor
Control unit
PC Instruction PC A RD RD A Instr. Memory
Write enable Data-path Address for data WE (ALU, registers) Read/Write Read data Address A RD RD A Data Results Data to Write Memory WD WD
B35APO Computer Architectures 48 CPU Design Result – Pipelined Version
RegWriteD RegWriteE RegWriteM RegWriteW Contr MemToRegD MemToRegE MemToRegM ol unit MemWriteD MemWriteE MemWriteM MemTo RegW 31:26 ALUControlD ALUControlE Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD = PCSrcD 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 20:16 1 10 A RD 00 SrcBE Instruction A2 RD2 01 0 ALU Data 10 Memory Memory A3 Reg. 0 1 WriteDataE WriteDataM 1 WD 25:21 WD3 File RsD RsE ALUOutW1 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 15:11 RdD RdE 0 WriteRegW 4:0 1 15:0 SignImmD + Sign 4 SignImmE Ext <<2 PCPlus4F PCPlus4D + CLR EN PCBranchD
ResultW
Forward Forward Forward RegWriteM RegWrite Stall F Stall D BranchD BD AE BE W Hazard unit
B35APO Computer Architectures 49 Pipelined CPU – Timing
● The timing/AC characteristics of synchronous sequential circuit :
tsetup – inputs setup time
thold – inputs hold time
● Signal integrity constrain for the setup time before the clock:
Tc >= tpcq + tpd + tsetup
tpd – combinatorial logic propagation delay
B35APO Computer Architectures 50 Pipelined Processor – Timing
● Constraint for the setup time (consider the clock distribution jitter):
Tc >= tpcq + tpd + tsetup + tskew
Clock distribution jitter is limiting factor,
if it reaches or exceeds value of tpd (too deep pipeline / too many stages…)
B35APO Computer Architectures 51 Clock Distribution Network Skew
Y 4 4
S 1 Data IN 2 D Q 5 2 3 1 S 3 1 2 3 2 5 Data OUT CLK D Q 6 3 2 3 CLK Q 6
R Q 2 1
Z R FF1 X 1 FF2 1
CLK 1 2 1 2 1 2 1 2
● Positive Clock Skew – clock arrives at the capturing sequential later than it arrives at the launching sequential ● Negative Clock Skew – clock arrives at the launching sequential later than it arrives at the capturing sequential ● Local Clock Skew – skew between any two sequentials with a valid timing path between them. ● Global Clock Skew – clock skew between any two sequentials in the design irrespective of whether a timing paths exists between them
B35APO Computer Architectures 52 diagram source: peda clock-tree-skew Clock Distribution Network – H-tree
2 2 1
1
3 4 4
3
CLK-in
source: Tawfik, S., Kursun, V.: Clock Distribution Networks with Gradual Signal Transition Time Relaxation for Reduced Power Consumption.
B35APO Computer Architectures 53 Pipeline Stages Balancing Linear pipelining:
(applies to tree based adder, multiplier, (unrolled) iterative divider..)
● Balancing: the goal is to divide the processing into N stages in such way, that stage propagation delays are roughly the same… ● The number of stages reflects preference of performance (throughput) versus latency.
B35APO Computer Architectures 54 Superpipeline and Beyond
● Not well balanced 5-stage pipeline:
IM RF DM RF
IF ID EX MEM WB ● Deeper pipeline is result of decomposing stages into more stages
IM RF DM RF
IF IS RF EX DF DS TC WB ● It allows CPU to work at higher frequencies but introduces many problems as well.. ● Complex forwarding, more pipeline stalls, hazards need to be solved by complex logic B35APO Computer Architectures 55 Typical Pipeline Depths in Today's CPUs P5 (Pentium) : 5 P6 (Pentium 3): 10 P6 (Pentium Pro): 14 NetBurst (Willamette, 180 nm) - Celeron, Pentium 4: 20 NetBurst (Northwood, 130 nm) - Celeron, Pentium 4, Pentium 4 HT: 20 NetBurst (Prescott, 90 nm) - Celeron D, Pentium 4, Pentium 4 HT, Pentium 4 ExEd: 31 NetBurst (Cedar Mill, 65 nm): 31 NetBurst (Presler 65 nm) - Pentium D: 31 Core : 14 Haswell 14-19 Bonnell: 16 Cooper Lake 14-19
K7 Architecture - Athlon : 10-15 AMD Zen 19 K8 - Athlon 64, Sempron, Opteron, Turion 64: 12-17 AMD Zen2 19 Cortex-A35 2-wide 8 ARM 8-9: 5 Cortex-A53 2-wide 8 ARM 11: 8 Cortex-A57 3-wide 15 Cortex A7 2-wide 8-10 SiFive FE310-G000 5 Cortex-A77 4-wide 11–13 Cortex A8 2-wide 13 SiFive FU540-C000 5 Denver 2/7-wide 13 Cortex A15 3-wide 15-25 M4 6-wide 15 Lightning 7-wide 16 ● The Optimum Pipeline Depth for a Microprocessor: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.4333&rep=rep1&type=pdf B35APO Computer Architectures 56 Branch Stall Discussion and Delay Slots
● The instruction memory read and fetch is expensive and result of condition evaluation in branch instructions (even worse target in indirect branch instructions) has to be evaluated before next fetch and execute. The stall state is waste of cycles. Options to use that cycle(s) are: ● Start fetch and execution of instruction(s) following branch and flush/discard results if it is resolved that it should not be executed ● Extend above by adding condition results/branch predictor (taken/not-taken) and branch target cache (BTB) ● Execute one or more instructions after branch unconditionally in (so called) delay slot ● Delay slots unconditional execution is common for many DSP (digital signal processor) and some RISC architectures (MIPS, SPARC)
B35APO Computer Architectures 57