Computer Architectures

Pipelined Instruction Execution Hazards, Stages Balancing, Super-scalar Systems Pavel Píša, Richard Šusta Michal Štepanovský, Miroslav Šnorek Main source of inspiration: Patterson and Hennessy

Czech Technical University in Prague, Faculty of Electrical Engineering English version partially supported by: European Social Fund Prague & EU: We invests in your future.

B35APO Computer Architectures Ver.1.10 1 Motivation – AMD Bulldozer 15h (FX, Opteron) - 2011

B35APO Computer Architectures 2 Motivation – Intel Nehalem (Core i7) - 2008

B35APO Computer Architectures 3 The Goal of Today Lecture

● Convert/extend CPU presented in the lecture 2 to the pipelined CPU design. ● The following instructions are considered for our CPU design: add, sub, and, or, slt, addi, lw, sw and beq

Typ 31… 0 R opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 rd(5), 15:11 shamt(5) funct(6), 5:0 I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 J opcode(6), 31:26 address(26), 25:0

B35APO Computer Architectures 4 Single Cycle CPU Together with Memories

MemToReg Control MemWrite Unit Branch 31:26 ALUControl 2:0 Opcode ALUScr RegDest 5:0 Funct RegWrite WE3 0 PC’ PC Instr 25:21 SrcA Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4

B35APO Computer Architectures 5 From lecture 2 Single Cycle CPU – Performance: IPS = IC / T = IPCavg.fCLK ● What is the maximal possible frequency of this CPU? ● It is given by latency on the critical path – it is lw in our case:

Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup

WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4

B35APO Computer Architectures 6 From lecture 2 Single Cycle CPU – Throughput: IPS = IC / T = IPCavg.fCLK

● Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup ● Consider following parameters

tPC = 30 ns

tMem= 300 ns

tRFread = 150 ns

tALU = 200 ns

tMux = 20 ns

tRFsetup = 20 ns

Then Tc = 1020 ns --> fCLK max = 980 kHz,

IPS = 1 • 980e3 = 980 000

B35APO Computer Architectures 7 From lecture 2 Separate Instruction Fetch and Execution

Even non-pipelined processors separate instruction execution into stages :

Instruction Fetch Execution

1. Instruction Fetch – setup PC for memory and fetch pointed instruction. Update PC = PC+4 2. The actual execution of the instruction

B35APO Computer Architectures 8 Non-Pielined Execution with Instructions Prefetching

WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4

in the figure, it indicates the clock input responding to the rising edge

B35APO Computer Architectures 9 Single Cycle CPU – Throughput: IPS = IC / T = IPCstr⋅fCLK

Consider following parameters:

tPC = 30 ns tMem = 300 ns

tRFread = 50 ns tALU = 200 ns

tMux = 20 ns tRFsetup = 20 ns

If Tcfetch is executed paralel with Tcproc, then Tcfetch < Tcproc, and Tcp = 50+200+300+20+20 = 590 ns = 1.69 MHz -> IPS = 1 690 000

B35APO Computer Architectures 10

● Prefetched instruction is executed unconditionally even when the proceeding instruction is a taken branch label: . . . ● Branch takes place after the next instruction add $2,$3,$4 beq $1,$0,label ● MIPS architecture defines execution of one

delay slot Delay Slot ● Compiler fills the branch delay slot ● By selecting an independent instruction label: from the sequence before the branch . . . ● It has to be to execute instruction in the beq $1,$0,label delay slot whether branch is taken or not

add $2,$3,$4 ● If no suitable instruction is found ● then the compiler fills delay slot with a NOP

B35APO Computer Architectures 11 Delay Slot

The task of the compiler is therefore to ensure that the following instructions in the branch delay slot are valid and useful. Three alternatives illustrating how the delay slot can be filled are depicted in the figures below.

The easiest way (at the same time the least efficient) is to fill the delay slot with a blank instruction - . B35APO Computer Architectures 12 Pipelined Instructions Execution

Suppose that instruction execution can be divided into 5 stages:

IF ID EX MEM WB

IF – Instruction Fetch, ID – Instruction decode (and Operands Fetch), EX – Execute, MEM – Memory Access, WB – Write Back

k and  = max { i } i=1, where i is time required for signal propagation (propagation delay) through i-th stage.

IF – setup PC for memory and fetch pointed instruction. Update PC = PC+4 ID – decode the opcode and read registers specified by instruction, check for equality (for possible beq instruction), sign extend offset, compute branch target address for branch case (this is means to extend offset and add PC) EX – execute function/pass register values through ALU MEM – read/write main memory for load/store instruction case WB – write result into RF for instructions of register-register class or instruction load (result source is ALU or memory) B35APO Computer Architectures 13 Instruction-level Parallelism - Pipelining

IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 ID I1 I2 I3 I4 I5 I6 I7 I8 I9

EX I1 I2 I3 I4 I5 I6 I7 I8

MEM I1 I2 I3 I4 I5 I6 I7

ST I1 I2 I3 I4 I5 I6 5       čas 1 2 3 4 5 6 7 8 9 10 ● The time to execute n instructions in the k-stage :

Tk = k. + (n – 1) 

T1 nk τ ● Speedup: Sk= = lim Sk=k Tk kτ +(n−1)τ n→∞ Prerequisite: pipeline is optimally balanced, circuit can arbitrarily divided

B35APO Computer Architectures 14 Pipeline Loads Example

B35APO Computer Architectures 15 source: Gedare Bloom and Mary Jane Irwin Instruction-level Parallelism - Pipelining

● Does not reduce the execution time of individual instructions, effect is just the opposite...

● Hazards: ● structural (resolved by duplication), ● data (result of data dependencies: RAW, WAR, WAW) ● control (caused by instructions which change PC)... ● Hazard prevention can result in pipeline stall or pipeline flush

● Remark : Deeper pipeline (more stages) results in shorter sequences of gates in each stage which enables to increase the operating frequency of the …, but more stages means higher overhead (demand to arrange better instructions into pipeline and result in more significant lag in the case of stall or pipeline flush) B35APO Computer Architectures 16 Instruction-level Parallelism – Semantics Violations

Data hazard: Add writes new value to R1

ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R1,R3 IF ID EX MEM WB flow of instructions SUB reads incorrect value from R1 and expected effect Condition and new PC evaluation Control hazard: PC set to branch target BEQZ R3, M1 IF ID EX MEM WB ADD R6,R1,R2 IF ID EX MEM WB instruction 3 IF ID EX MEM WB instruction 4 IF ID EX MEM WB M1: ADD R4,R6,R7 IF ID EX MEM WB

Should be these instructions fetched (and executed then)? B35APO Computer Architectures 17 Non-pipelined Execution

WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4

B35APO Computer Architectures 18 From lecture 2 Pipelined Execution

AluOutW

WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOutM Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteDataE WriteDataM WD 20:16 Rt 0 WriteRegE WriteRegM WriteRegW 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4F PCPlus4D PCPlus4E

Fetch Decode Execute Memory WriteBack B35APO Computer Architectures 19 Pipelined Execution

MemToReg Control MemWrite Unit Branch 31:26 ALUControl 2:0 Opcode ALUScr 5:0 RegDest Funct AluOutW RegWrite WE3 SrcA 0 PC’ PC Instr 25:21 Zero WE A RD A1 RD1 ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOutM Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteDataE WriteDataM WD 20:16 Rt 0 WriteRegE WriteRegM WriteRegW 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4F PCPlus4D PCPlus4E

Fetch Decode Execute Memory WriteBack B35APO Computer Architectures 20 The Same Design but Drawn Scaled Down…

RegWriteD RegWriteE RegWriteM RegWriteW MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM

Zero 0 PC´ PC InstrD 25:21 WE3 SrcAE WE A RD A1 RD1 ALUOutM ReadDataW 1 20:16 A RD Instruction A2 RD2 0 SrcBE ALU Data Memory A3 Reg. 1 Memory WriteDataE WriteDataM WD3 File WD 1 ALUOutW 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext

<<2 PCPlus4F PCPlus4D + PCBranchD

ResultW

B35APO Computer Architectures 21 Cause of the Data Hazards

– access from two pipeline stages (Decode, WriteBack) – actual write occurs at the first half of the clock cycle, the read in the second half ⇒ there is no hazard for sub $s0 input operand ● RAW (Read After Write) hazard – and (or) requires $s0 in 3 (4) ● How can such hazard be prevented without pipeline throughput degradation?

B35APO Computer Architectures 22 QtMips – Resolve Data Hazard by Stall

OR $t2, $s0, $s5 AND $t0, $s0, $s1 NOP ADD $s0, $s2, $s3 OR $9, $20, $16 AND $8, $16, $17 NOP ADD $16, $18, $19 ORI $21, $21, 0x5555 ORI $s5, $s5, 0x5555 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 0 1 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0

02114024 = 0 0 0 0 0 0 0 5 0 5 RsD 5 16 00000000 0 0 Hit: 5 0 0 Miss: RtD ALU 5 Cache 0 17 11111111 0 0 17 5 Hit: 0 0 7 Registers 0 0 5 PC 0 Miss: 0 0 5 0 0 0 0 0x80020030 Program 0 0 0 5 0 Memory 0 0 5 0 3 Data 0 5 0 3 5 0 Memory 3 5 0 3 0 5 + NORMAL 3 5 3 Peripherals 5 4 3 3 RtD 17 RtE RdD RdE 0 16 0 8 0 Sign Terminal extension 00004024 00000000

<<2 0 Exception + STALL NONE

21

STALL Hazard Cycles Stalls

Unit B35APO Computer Architectures 23 QtMips – Resolve Data Hazard by Stall

OR $t1, $s4, $s0 AND $t0, $s0, $s1 OR $9, $20, $16 AND $8, $16, $17 NOP NOP

ADD $16, $18, $19 ADD $s0, $s2, $s3 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 0 0 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0

02114024 = 0 0 0 0 0 0 0 0 0 0 Cache RsD 0 16 55555555 0 0 Hit: 0 0 0 Miss: RtD ALU 0 Cache 0 17 11111111 0 0 18 0 Hit: 0 0 7 Registers 0 0 0 PC 0 Miss: 0 0 0 0 0 0 0 0x80020034 Program 0 0 0 5 0 Memory 0 0 5 0 0 Data 0 5 0 0 5 0 Memory 0 5 0 0 0 5 + NORMAL 0 5 0 Peripherals 5 4 0 0 RtD 17 RtE RdD RdE 0 0 0 8 0 Sign Terminal extension 00004024 00000000

<<2 0 Exception + STALL NONE

16

NORMAL Hazard Cycles Stalls

Unit B35APO Computer Architectures 24 Forwarding to Avoid Data Hazards

● If a result is available (computed) before subsequent instruction(s) requires the value then data hazard can be avoided by forwarding ● Hazard case is indicated when some of source registers in EX stage is the same as destination register in stage MEM or WB ● The register numbers are fed to the Hazard Unit ● The RegWrite signal from MEM and WB stage has to be monitored as well to check that register number on WriteReg lines takes effect – lw / sw B35APO Computer Architectures 25 CPU after previous design steps

RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM

Zero 0 PC´ PC InstrD 25:21 WE3 SrcAE WE A RD A1 RD1 ALUOutM ReadDataW 1 20:16 A RD Instruction A2 RD2 0 SrcBE ALU Data Memory A3 Reg. 1 Memory WriteDataE WriteDataM WD3 File WD 1 ALUOutW 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext

<<2 PCPlus4F PCPlus4D + PCBranchD

ResultW

B35APO Computer Architectures 26 Data Hazards Solved by Forwarding

RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM

Zero 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 10 1 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext

<<2 PCPlus4F PCPlus4D + PCBranchD

ResultW

Forward Forward RegWrite RegWriteM AE BE W Hazard unit

B35APO Computer Architectures 27 QtMips – Resolve Data Hazard by Forwarding

OR $t1, $s4, $s0 AND $t0, $s0, $s1 ADD $s0, $s2, $s3 ORI $s5, $s5, 0x5555 OR $9, $20, $16 AND $8, $16, $17 ADD $16, $18, $19 ORI $21, $21, 0x5555 LUI $s5, 0x5555 LUI $5, 0x5555 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 1 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0

02114024 = 2 2 2 0 0 2 0 0 0 2 5 2 5 Cache RsD 5 16 00000000 2 5 Hit: 5 2 5 Miss: RtD ALU 5 Cache 0 17 11111111 5 0 6 5 Hit: 5 0 7 Registers 0 3 0 5 PC 5 Miss: 0 3 5 0 5 0 3 0x80020034 Program 5 0 0 5 3 Memory 5 0 5 3 0 Data 0 5 3 0 5 3 Memory 0 0 3 0 1 0 + NORMAL 0 0 0 Peripherals 0 4 RsE 0 18 0 RtD 17 RtE 19 RdD RdE 16 21 16 8 Sign Terminal extension 00004024 ffff8020

<<2 0 Exception + NORMAL NONE

21

NORMAL Hazard Cycles Stalls

Unit B35APO Computer Architectures 28 QtMips – Resolve Data Hazard by Forwarding

SUB $t2, $s0, $s5 OR $t1, $s4, $s0 AND $t0, $s0, $s1 ADD $s0, $s2, $s3 SUB $10, $16, $21 OR $9, $20, $16 AND $8, $16, $17 ADD $16, $18, $19 ORI $21, $21, 0x5555 ORI $s5, $s5, 0x5555 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 1 MemToReg MemWrite 0 0 0 MemRead 0 0 AluCtrl 0 00 RegDest AluSrc 1 Branch 0

02904825 = 5 5 5 0 0 5 0 0 2 5 5 5 5 Cache RsD 5 20 44444444 5 1 Hit: 5 5 1 Miss: RtD ALU 5 Cache 0 16 00000000 1 0 7 5 Hit: 1 0 7 Registers 0 1 0 5 PC 1 Miss: 0 1 5 0 1 0 1 0x80020038 Program 1 0 0 5 1 Memory 1 0 5 1 3 Data 0 5 1 3 5 1 Memory 3 5 1 3 1 5 + NORMAL 3 5 3 Peripherals 5 4 RsE 3 16 3 RtD 16 RtE 17 RdD RdE 16 8 9 8 Sign Terminal extension 00004825 00004024

<<2 0 Exception + FORWARD NONE

21

NORMAL Hazard Cycles Stalls

Unit B35APO Computer Architectures 29 QtMips – Resolve Data Hazard by Forwarding

SUB $t2, $s0,$s5 OR $t1, $s4, $s0 AND $t0, $s0, $s1 NOP SUB $10, $16,$21 OR $9, $20, $16 AND $8, $16, $17 ADD $s0, $s2,$s3 ADD $16, $18,$19 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 1 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0

02155022 = 4 4 4 0 0 4 0 0 0 4 1 4 1 Cache RsD 1 16 55555555 4 5 Hit: 1 4 5 Miss: RtD ALU 1 Cache 0 21 55555555 5 0 7 1 Hit: 5 0 8 Registers 1 5 0 1 PC 5 Miss: 0 5 1 0 5 0 5 0x8002003c Program 5 0 0 5 5 Memory 5 0 5 5 1 Data 0 5 5 1 5 5 Memory 1 5 5 1 1 5 + NORMAL 1 5 1 Peripherals 5 4 RsE 1 20 1 RtD 21 RtE 16 RdD RdE 10 9 8 9 Sign Terminal extension 00005022 00004825

<<2 0 Exception + FORWARD NONE

16

NORMAL Hazard Cycles Stalls

Unit B35APO Computer Architectures 30 Data Hazard Avoided by Pipeline Stall

● If subsequent instructions require result before it is available in CPU then the pipeline has to be stalled (stall state inserted) ● The stall is mean to solve hazard but affect system throughput ● Pipeline stages preceding that one which is affected by the hazard are stalled until all results required by subsequent instructions are available – results are forwarded to the sink which required their value

B35APO Computer Architectures 31 Data Hazard Avoided by Pipeline Stall

● The stall is realized by the holding content of the inter-stage registers (gating their clocks or blocking their latch enable signals) ● Results from colliding stages have to be „discarded“ – certain control signals in CPU (RF or memory write enable, branch gating) are reset (held low) ● Both is achieved by introduction of control signals to hold and/or reset inter-stages registers B35APO Computer Architectures 32 Build Till Now

RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM

Zero 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 10 1 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext

<<2 PCPlus4F PCPlus4D + PCBranchD

ResultW

Forward Forward RegWrite RegWriteM AE BE W Hazard unit

B35APO Computer Architectures 33 Processor with Data Hazards Avoided by Stall

RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE RegDstD RegDstE 5:0 Funct BranchD BranchD BranchE PCSrcM

Zero 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW EN 10 1 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign SignImmE 4 Ext

<<2

PCPlus4F PCPlus4D CLR + EN PCBranchD

ResultW

Forward Forward RegWrite RegWriteM Stall F Stall D AE BE W Hazard unit

B35APO Computer Architectures 34 QtMips – Resolve LW Hazard by Stall

OR $t1, $s4, $s0 AND $t0, $s0, $s1 LW $s0, 40($0) ORI $s5, $s5, 0x5555 OR $9, $20, $16 AND $8, $16, $17 LW $16, 40($0) ORI $21, $21, 0x5555 LUI $21, 0x5555 LUI $s5, 0x5555 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 1 MemToReg MemWrite 0 1 0 MemRead 0 0 AluCtrl 0 11 RegDest AluSrc 1 Branch 0

02114024 = 0 0 0 0 0 0 0 0 0 0 5 0 5 Cache RsD 5 16 00000000 0 0 Hit: 5 0 0 ALU 5 Cache 0 Miss: RtD 0 6 17 11111111 5 Hit: 0 0 0 7 Registers 0 0 1 5 PC 0 Miss: 0 0 5 0 0 0 0 0x80020030 Program 2 0 0 5 0 Memory 8 0 5 0 0 Data 0 5 0 0 5 0 Memory 0 0 0 0 0 0 + NORMAL 0 0 0 Peripherals 0 4 RsE 0 0 RtD 17 RtE 160 RdD RdE 16 21 8 0 Sign Terminal extension 00004024 00000028

<<2 0 Exception + NORMAL NONE

21

STALL Hazard Cycles Stalls

Unit B35APO Computer Architectures 35 QtMips – Resolve LW Hazard by Stall

OR $t1, $s4, $s0 AND $t0, $s0, $s1 NOP LW $s0, 40($0) OR $9, $20, $16 AND $8, $16, $17 NOP LW $16, 40($0) ORI $21, $21, 0x5555 ORI $s5, $s5, 0x5555 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 0 1 MemToReg 0 0 1 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0

02114024 = 0 0 0 1 0 0 0 0 0 0 0 0 0 Cache RsD 0 16 00000000 0 0 Hit: 0 0 0 ALU 0 Cache 1 Miss: RtD 0 7 17 11111111 0 Hit: 2 0 3 7 Registers 0 0 0 2 PC 0 Miss: 4 0 8 0 0 5 0 0x80020034 Program 0 1 6 5 0 Memory 0 7 5 0 0 Data 8 5 0 0 5 0 Memory 0 5 0 0 0 5 + NORMAL 0 5 0 Peripherals 5 4 RsE 0 0 RtD 17 RtE 0 RdD RdE 0 16 0 8 0 Sign Terminal extension 00004024 00000000

<<2 0 Exception + STALL NONE

21

NORMAL Hazard Cycles Stalls

Unit B35APO Computer Architectures 36 QtMips – Resolve LW Hazard by Stall

SUB $t2, $s0, $s5 OR $t1, $s4, $s0 AND $t0, $s0, $s1 NOP SUB $10, $16, $21 OR $9, $20, $16 AND $8, $16, $17 NOP LW $16, 40($0) LW $s0, 40($0) 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 1 0 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0

02904825 = 1 2 3 0 0 4 0 0 1 5 0 6 0 Cache RsD 0 20 44444444 7 1 Hit: 0 8 0 Miss: RtD ALU 0 Cache 0 16 12345678 1 0 8 0 Hit: 0 0 7 Registers 0 1 0 0 PC 1 Miss: 0 1 0 0 0 0 1 0x80020038 Program 1 1 0 1 1 Memory 0 0 2 1 0 Data 0 3 1 0 4 1 Memory 0 5 1 0 1 6 + NORMAL 0 7 0 Peripherals 8 4 RsE 0 16 0 RtD 16 RtE 17 RdD RdE 8 0 9 8 Sign Terminal extension 00004825 00004024

<<2 0 Exception + FORWARD NONE

16

NORMAL Hazard Cycles Stalls

Unit B35APO Computer Architectures 37 Control Hazards (Branch and Jump)

● Result is not known before 4th cycle. Why?

B35APO Computer Architectures 38 Control Hazards – Better to Know Result Earlier…

● If the result of comparison can be evaluated in the 2nd cycle misprediction penalty can be reduced ● But the processing of the comparison at earlier stage can induce new RAW hazards..!!! B35APO Computer Architectures 39 Resolve Control Hazards by Early Evaluate and Flush

RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW EN 10 1 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD

ResultW

Forward Forward RegWrite RegWriteM Stall F Stall D AE BE W Hazard unit

B35APO Computer Architectures 40 Resolve RAW Hazards by Forwarding or Stalling Forward / Stall Stall RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW EN 10 1 20:16 1 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 0 10 1 Memory WriteDataE WriteDataM WD3 File 1 WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR No PCPlus4F PCPlus4D CLR EN PCBranchD Action Required ResultW

Forward Forward RegWrite Forward RegWriteM Stall F Stall D BranchD BD AE BE W Hazard unit

B35APO Computer Architectures 41 QtMips – Resolve BEQ Source Hazard by Stall

AND $t0, $s0, $s1 BEQ $t1, $t2, 0x80020040 NOP ADDI $t2, $0, 4660 AND $8, $16, $17 BEQ $9, $10, 0x80020040 NOP ADDI $10, $0, 4660 ADDI $9, $0, 4660 ADDI $t1, $0, 4660 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 0 0 1 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 0 Branch 0

112a0003 = 0 0 0 0 0 0 0 1 0 0 0 0 0 Cache RsD 0 00001234 0 0 Hit: 0 9 0 0 Miss: RtD ALU 1 Cache 0 10 00000000 0 0 8 2 Hit: 0 0 7 Registers 0 0 0 3 PC 0 Miss: 0 0 4 0 0 0 0 0x80020040 Program 0 0 0 0 0 Memory 0 0 0 0 0 Data 0 0 0 0 0 0 Memory 0 1 0 0 0 2 + FORWARD 0 3 0 Peripherals 4 4 RsE 0 0 RtD 10 RtE 0 RdD RdE 0 10 0 0 0 Sign Terminal extension 00000003 00000000

<<2 1 Exception + STALL NONE

9

NORMAL Hazard Cycles Stalls

Unit B35APO Computer Architectures 42 QtMips – Resolve BEQ Control Hazard

SLT $t3, $s2, $s3 AND $t0, $s0, $s1 BEQ $t1, $t2, 0x80020040 NOP SLT $11, $18, $19 AND $8, $16, $17 BEQ $9, $10, 0x80020040 NOP ADDI $10, $0, 4660 ADDI $t2, $0, 4660 1

IF/ID ID/EX EX/MEM MEM/WB

JumpReg 0 Control PcToR31 0 unit RegWrite 1 0 0 MemToReg 0 0 0 MemWrite 0 0 MemRead AluCtrl 0 00 RegDest AluSrc 1 Branch 0

02114024 = 0 0 0 0 0 0 0 0 0 1 0 2 0 Cache RsD 0 16 00000000 3 0 Hit: 0 4 0 Miss: RtD ALU 0 Cache 0 17 11111111 0 0 8 0 Hit: 0 0 8 Registers 0 0 0 0 PC 1 Miss: 0 0 0 0 2 0 0 0x80020044 Program 3 0 0 0 0 Memory 4 0 0 1 0 Data 0 0 2 0 0 3 Memory 0 1 4 0 0 2 + NORMAL 0 3 0 Peripherals 4 4 RsE 0 0 RtD 17 RtE 109 RdD RdE 10 0 8 0 Sign Terminal extension 00004024 00000003

<<2 0 Exception + NORMAL NONE

10

NORMAL Hazard Cycles Stalls

Unit B35APO Computer Architectures 43 We are Finished – Pipelined Processor is Designed

RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW EN 10 1 20:16 1 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 0 10 1 Memory WriteDataE WriteDataM WD3 File 1 WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdE RdD 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD

ResultW

Forward Forward RegWrite Forward RegWriteM Stall F Stall D BranchD BD AE BE W Hazard unit

B35APO Computer Architectures 44 Pipelined CPU – Performance: IPS = IC / T = IPCavg.fCLK

● What is maximal acceptable frequency for the CPU? ● Which stage is the slowest one? ● The cycle time is determined by the slowest stage ● For our case: Tc = 300 ns --> 3 333 kHz

If the pipeline fill overhead is neglected (i.e. no pipeline stalls and flushes are considered) then ideal IPC = 1. IPS = 1 • 3 333e3 = 3 333 000 instructions per second

● Introduction of the 5-stage pipeline increases performance (throughput) 3 333 000/ 980 000 = 3.4 times! (considering IPC=1)

B35APO Computer Architectures 45 What is Result of the Design? Return back to non-pipelined CPU version

MemToReg Control MemWrite Unit Branch 31:26 Opcode ALUControl 2:0 ALUScr 5:0 RegDest Funct AluOutW RegWrite 25:21 WE3 SrcA PC’ Zero WE 0 PC A RD Instr A1 RD1 ALU 0 Result 1 20:16 A RD 1 Instr. A2 RD2 0 SrcB AluOutM Data ReadData Memory 1 Memory A3 Reg. WD3 WD 20:16 File WriteData Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch PCPlus4F PCPlus4D PCPlus4E +

B35APO Computer Architectures 46 What is Result of the Design? Return back to non-pipelined CPU version

MemToReg Control unit Memory Control MemWrite (control path) Unit Branch 31:26 Opcode ALUControl 2:0 ALUScr 5:0 RegDest Funct AluOutW RegWrite 25:21 WE3 SrcA PC’ Zero WE A RD 0 PC A RD Instr A1 RD1 ALU 0 Result 1 20:16 A RD 1 Instr. A2 RD2 0 SrcB AluOutM ReadData Memory 1 A3 Reg. WD3 WD Data/ALU 20:16 File WriteData Rt 0 WriteReg 1 (data path) WE 15:11 Rd + A RD 4 15:0 <<2 Data Sign Ext SignImm PCBranch Memory PCPlus4F PCPlus4D PCPlus4E + WD

B35APO Computer Architectures 47 What is Result of the Design?

Processor

Control unit

PC Instruction PC A RD RD A Instr. Memory

Write enable Data-path Address for data WE (ALU, registers) Read/Write Read data Address A RD RD A Data Results Data to Write Memory WD WD

B35APO Computer Architectures 48 CPU Design Result – Pipelined Version

RegWriteD RegWriteE RegWriteM RegWriteW Contr MemToRegD MemToRegE MemToRegM ol unit MemWriteD MemWriteE MemWriteM MemTo RegW 31:26 ALUControlD ALUControlE Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD = PCSrcD 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 20:16 1 10 A RD 00 SrcBE Instruction A2 RD2 01 0 ALU Data 10 Memory Memory A3 Reg. 0 1 WriteDataE WriteDataM 1 WD 25:21 WD3 File RsD RsE ALUOutW1 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 15:11 RdD RdE 0 WriteRegW 4:0 1 15:0 SignImmD + Sign 4 SignImmE Ext <<2 PCPlus4F PCPlus4D + CLR EN PCBranchD

ResultW

Forward Forward Forward RegWriteM RegWrite Stall F Stall D BranchD BD AE BE W Hazard unit

B35APO Computer Architectures 49 Pipelined CPU – Timing

● The timing/AC characteristics of synchronous sequential circuit :

tsetup – inputs setup time

thold – inputs hold time

● Signal integrity constrain for the setup time before the clock:

Tc >= tpcq + tpd + tsetup

tpd – combinatorial logic propagation delay

B35APO Computer Architectures 50 Pipelined Processor – Timing

● Constraint for the setup time (consider the clock distribution jitter):

Tc >= tpcq + tpd + tsetup + tskew

Clock distribution jitter is limiting factor,

if it reaches or exceeds value of tpd (too deep pipeline / too many stages…)

B35APO Computer Architectures 51 Clock Distribution Network Skew

Y 4 4

S 1 Data IN 2 D Q 5 2 3 1 S 3 1 2 3 2 5 Data OUT CLK D Q 6 3 2 3 CLK Q 6

R Q 2 1

Z R FF1 X 1 FF2 1

CLK 1 2 1 2 1 2 1 2

● Positive Clock Skew – clock arrives at the capturing sequential later than it arrives at the launching sequential ● Negative Clock Skew – clock arrives at the launching sequential later than it arrives at the capturing sequential ● Local Clock Skew – skew between any two sequentials with a valid timing path between them. ● Global Clock Skew – clock skew between any two sequentials in the design irrespective of whether a timing paths exists between them

B35APO Computer Architectures 52 diagram source: peda clock-tree-skew Clock Distribution Network – H-tree

2 2 1

1

3 4 4

3

CLK-in

source: Tawfik, S., Kursun, V.: Clock Distribution Networks with Gradual Signal Transition Time Relaxation for Reduced Power Consumption.

B35APO Computer Architectures 53 Pipeline Stages Balancing Linear pipelining:

(applies to tree based , multiplier, (unrolled) iterative divider..)

● Balancing: the goal is to divide the processing into N stages in such way, that stage propagation delays are roughly the same… ● The number of stages reflects preference of performance (throughput) versus latency.

B35APO Computer Architectures 54 Superpipeline and Beyond

● Not well balanced 5-stage pipeline:

IM RF DM RF

IF ID EX MEM WB ● Deeper pipeline is result of decomposing stages into more stages

IM RF DM RF

IF IS RF EX DF DS TC WB ● It allows CPU to work at higher frequencies but introduces many problems as well.. ● Complex forwarding, more pipeline stalls, hazards need to be solved by complex logic B35APO Computer Architectures 55 Typical Pipeline Depths in Today's CPUs P5 (Pentium) : 5 (Pentium 3): 10 P6 (Pentium Pro): 14 NetBurst (Willamette, 180 nm) - Celeron, Pentium 4: 20 NetBurst (Northwood, 130 nm) - Celeron, Pentium 4, Pentium 4 HT: 20 NetBurst (Prescott, 90 nm) - Celeron D, Pentium 4, Pentium 4 HT, Pentium 4 ExEd: 31 NetBurst (Cedar Mill, 65 nm): 31 NetBurst (Presler 65 nm) - Pentium D: 31 Core : 14 Haswell 14-19 Bonnell: 16 Cooper Lake 14-19

K7 Architecture - Athlon : 10-15 AMD Zen 19 K8 - Athlon 64, Sempron, Opteron, Turion 64: 12-17 AMD Zen2 19 Cortex-A35 2-wide 8 ARM 8-9: 5 Cortex-A53 2-wide 8 ARM 11: 8 Cortex-A57 3-wide 15 Cortex A7 2-wide 8-10 SiFive FE310-G000 5 Cortex-A77 4-wide 11–13 Cortex A8 2-wide 13 SiFive FU540-C000 5 Denver 2/7-wide 13 Cortex A15 3-wide 15-25 M4 6-wide 15 Lightning 7-wide 16 ● The Optimum Pipeline Depth for a : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.4333&rep=rep1&type=pdf B35APO Computer Architectures 56 Branch Stall Discussion and Delay Slots

● The instruction memory read and fetch is expensive and result of condition evaluation in branch instructions (even worse target in indirect branch instructions) has to be evaluated before next fetch and execute. The stall state is waste of cycles. Options to use that cycle(s) are: ● Start fetch and execution of instruction(s) following branch and flush/discard results if it is resolved that it should not be executed ● Extend above by adding condition results/ (taken/not-taken) and branch target cache (BTB) ● Execute one or more instructions after branch unconditionally in (so called) delay slot ● Delay slots unconditional execution is common for many DSP () and some RISC architectures (MIPS, SPARC)

B35APO Computer Architectures 57