18‐447 Lecture 6: Microprogrammed Multi‐Cycle Implementation

James C. Hoe Department of ECE Carnegie Mellon University

18‐447‐S21‐L06‐S1, James C. Hoe, CMU/ECE/CALCM, ©2021 Housekeeping

•Your goal today – understand why VAX was possible and reasonable •Notices –Lab 1, Part A, due this week –Lab 1, Part B, due next week –HW1,due Monday 2/22 • Readings –P&H Appendix C –Start reading the rest of P&H Ch 4

18‐447‐S21‐L06‐S2, James C. Hoe, CMU/ECE/CALCM, ©2021 “Single‐Cycle” : Is it any good?

PCSrc1=Jump Instruction [25– 0] Shift Jump address [31– 0] left 2 26 28 0 1

PC+4 [31– 28] M M u u x x ALU Add result 10 Add RegDst Shift PCSrc2=Br Taken Jump left 2 4 Branch MemRead Instruction [31– 26] Control MemtoReg ALUOp MemWrite ALUSrc RegWrite

Instruction [25– 21] Read Read register 1 PC address Read data 1 Instruction [20– 16] Read register 2 bcondZero Instruction 0 Registers Read ALU [31– 0] 0 ALU M Write data 2 result Address Read 1 M data u M memory x u Instruction [15– 11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15– 0] Sign extend ALU ALU operation control Neither fast nor cheap, Instruction [5– 0] and not even simplest 18‐447‐S21‐L06‐S3, James C. Hoe, CMU/ECE/CALCM, ©2021 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Go Fast(er)!!

18‐447‐S21‐L06‐S4, James C. Hoe, CMU/ECE/CALCM, ©2021 Iron Law of Performance • time/program = (inst/program) (cyc/inst) (time/cyc)

1/IPC 1/MIPS 1/GHz note workload dependence • Contributing factors –time/cyc: architecture and implementation – cyc/inst: architecture, implementation, instruction mix – inst/program: architecture, nature and quality of prgm

• **Note**: cyc/inst is a workload average potentially large instantaneous variations due to instruction type and sequence 18‐447‐S21‐L06‐S5, James C. Hoe, CMU/ECE/CALCM, ©2021 Worst‐Case Critical Path

PCSrc1=Jump Instruction [25– 0] Shift Jump address [31– 0] left 2 26 28 0 1

PC+4 [31– 28] M M u u x x ALU Add result 10 Add RegDst Shift PCSrc2=Br Taken Jump left 2 4 Branch MemRead Instruction [31– 26] Control MemtoReg ALUOp MemWrite ALUSrc RegWrite

Instruction [25– 21] Read Read register 1 PC address Read data 1 Instruction [20– 16] Read register 2 bcondZero Instruction 0 Registers Read ALU [31– 0] 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15– 11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15– 0] Sign extend ALU ALU operation control

Instruction [5– 0]

18‐447‐S21‐L06‐S6, James C. Hoe, CMU/ECE/CALCM, ©2021 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Single‐Cycle Datapath Analysis • Assume (numbers from P&H) –memory units (read or write): 200 ps –ALU and adders: 100 ps – (read or write): 50 ps –other : 0 ps steps IF ID EX MEM WB Delay resources mem RF ALU mem RF R‐type 200 50 100 50 400 I‐type 200 50 100 50 400 LW 200 50 100 200 50 600 SW 200 50 100 200 550 Bxx 200 50 100 350 JALR 200 50 100 50 350 JAL 200 100 50 300 18‐447‐S21‐L06‐S7, James C. Hoe, CMU/ECE/CALCM, ©2021 Single‐Cycle Implementations

• Good match for the sequential and atomic semantics of ISAs – instantiate programmer‐visible state one‐for‐one –map instructions to combinational next‐state logic •But, contrived and inefficient 1. all instructions run as slow as slowest instruction 2. must provide worst‐case combinational resource in parallel as required by any one instruction 3. what about CISC ISAs? polyf?

Not the fastest, cheapest or even the simplest way

18‐447‐S21‐L06‐S8, James C. Hoe, CMU/ECE/CALCM, ©2021 Multi‐cycle Implementation: Ver 1.0

•Each instruction type take only as much time as needed –run a 50 psec clock –each instruction type take as many 50‐psec clock cycles as needed •Add “MasterEnable” signal so architectural state ignores clock edges until after enough time –an instruction’s effect is still purely combinational from state to state – all other control signal unaffected

18‐447‐S21‐L06‐S9, James C. Hoe, CMU/ECE/CALCM, ©2021 Multi‐Cycle Datapath: Ver 1.0

PCSrc1=Jump Instruction [25– 0] Shift Jump address [31– 0] left 2 26 28 0 1

PC+4 [31– 28] M M u u x x ALU Add result 10 Add RegDst Shift PCSrc =Br Taken Jump left 2 2 4 Branch MemRead Instruction [31– 26] Control MemtoReg ALUOp MemWrite ALUSrc RegWrite

Instruction [25– 21] Read Read register 1 PC address Read data 1 Instruction [20– 16] Read register 2 bcondZero Instruction 0 Registers Read ALU [31– 0] 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15– 11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15– 0] Sign extend ALU control

Instruction [5– 0] MasterEn

18‐447‐S21‐L06‐S10, James C. Hoe, CMU/ECE/CALCM, ©2021 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Sequential Control: Ver 1.0

IF1 IF2 IF3 IF4 ID

JAL EX1

Bxx or JAL or JALR / EX2 MasterEn=1 WB LW or SW I‐type or R‐type

LW MEM MEM MEM MEM SW / 4 3 2 1 MasterEn=1

18‐447‐S21‐L06‐S11, James C. Hoe, CMU/ECE/CALCM, ©2021 Performance Analysis

•Iron Law: time/program = (inst/program) (cyc/inst) (time/cyc)

•For same ISA, inst/program is the same; okay to compare

MIPS = IPC ×fclk in MHz

frequency in MHz

million instructions per second

18‐447‐S21‐L06‐S12, James C. Hoe, CMU/ECE/CALCM, ©2021 Performance Analysis

•Single‐Cycle Implementation 1 × 1,667MHz = 1667 MIPS •Multi‐Cycle Implementation

IPCavg × 20,000 MHz = 2178 MIPS

what is IPCaverage? • Assume: 25% LW, 15% SW, 40% ALU, 13.3% Branch, 6.7% Jumps [Agerwala and Cocke, 1987] –weighted arithmetic mean of CPI  9.18 –weighted harmonic mean of IPC  0.109 –weighted arithmetic mean of IPC  0.115

18‐447‐S21‐L06‐S13, James C. Hoe, CMU/ECE/CALCM, ©2021 MIPS = IPC ×fclk Microsequencer: Ver 1.0

•ROM as a combinational logic lookup table ** ROM size grows as O(2n) as the literally Combinational control logic MasterEnDatapath controlnumber outputs of inputs holds the 2n rows truth table OutputsOutputsm ** ROM size grows as O(m) as the number of outputs addressn / inputInputs

Next state Inputs from instruction State register register opcode field

18‐447‐S21‐L06‐S14, James C. Hoe, CMU/ECE/CALCM, ©2021 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Microcoding: Ver 0 (note: this is only about counting clock ticks) state cntrl conditional targets label flow R/I‐type LW SW Bxx JALR JAL

IF1 next ‐‐‐‐‐‐ IF2 next ‐‐‐‐‐‐ IF3 next ‐‐‐‐‐‐ IF4 goto ID ID ID ID ID EX1 ID next ‐‐‐‐‐‐

EX1 next ‐‐‐‐‐‐

EX2 goto WB MEM1 MEM1 IF1 IF1 IF1 MEM1 next ‐‐‐‐‐‐ MEM2 next ‐‐‐‐‐‐ MEM3 next ‐‐‐‐‐‐ MEM4 goto ‐ WB IF1 ‐‐‐ WB goto IF1 IF1 ‐‐‐‐ CPI 8 12 11 7 7 6 A systematic approach to FSM sequencing/control 18‐447‐S21‐L06‐S15, James C. Hoe, CMU/ECE/CALCM, ©2021 /Microsequencer •A stripped‐down “processor” for sequencing and control – control states are like PC – PC indexed into a program ROM to select an instruction

Microcode – program state and storage Datapath Outputs control well‐formed control‐flow outputs support (branch, jump) Input 1

– fields in the instruction Sequencing Microprogram control maps to control signals Address select logic •Very elaborate controllers

Inputs from instruction have been built register opcode field [Based on original figure from P&H CO&D, COPYRIGHT 18‐447‐S21‐L06‐S16, James C. Hoe, CMU/ECE/CALCM, ©2021 2004 Elsevier. ALL RIGHTS RESERVED.] Go Cheap!! (And More Capable)

18‐447‐S21‐L06‐S17, James C. Hoe, CMU/ECE/CALCM, ©2021 Reducing Datapath by Resource Reuse How to reuse same adder for two additions in one instruction

Add

4 ALU operatio Read 3 Read register 1 PC address Read data 1 Read register 2 Zero Instruction Registers ALU ALU Write result register Instruction Read memory Write data 2 data RegWrite ALUSrc 16 32 Sign isItype extend

“Single‐cycle” reused same adder for different instructions 18‐447‐S21‐L06‐S18, James C. Hoe, CMU/ECE/CALCM, ©2021 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Reducing Datapath by Sequential Reuse

Add

4 ALU control Read 3 Read register 1 PC address Read Read data 1 register 2 Zero Instruction Registers ALU ALU IR Write result Instruction register Read memory Write data 2 data 4 RegWrite

to IR or not to IR?

18‐447‐S21‐L06‐S19, James C. Hoe, CMU/ECE/CALCM, ©2021 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Removing Redundancies

rd

rs1

rs2

immed

•Latch Enables: PC, IR, MDR, A, B, ALUOut, RegWr, MemWr • Steering: ALUSrc1{RF,PC}, ALUSrc2{RF, immed}, MAddrSrc{PC, ALUOut}, RFDataSrc{ALUOut, MDR} Could also reduce down to a single register read‐write port!

18‐447‐S21‐L06‐S20, James C. Hoe, CMU/ECE/CALCM, ©2021 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Synchronous Register Transfers • Synchronous state with latch enables –PC, IR, RF, MEM, A, B, ALUOut, MDR •One can enumerate all possible “register transfers” •For example starting from PC –IR  MEM[ PC ] –MDR  MEM[ PC ] –PC  PC  4 –PC  PC B –PC  PC  immediate(IR) –ALUOut  PC  4 –ALUOut  PC  immediate(IR) –ALUOut  PC  B 18‐447‐S21‐L06‐S21, James C. Hoe, CMU/ECE/CALCM, ©2021 Not all feasible RTs are meaningful Useful Register Transfers (by dest) •PC  PC + 4

•PC  PC + immediateSB‐type,U‐type(IR) •PC  A + immediateSB‐type(IR) •IR MEM[ PC ] •A  RF[ rs1(IR) ] •B  RF[ rs2(IR) ] •ALUOut  A + B

•ALUOut  A + immediateI‐type,S‐type(IR) •ALUOut  PC + 4 •MDR  MEM[ ALUOut ] •MEM[ ALUOut ]  B •RF[ rd(IR) ]  ALUOut, •RF[ rd(IR) ]  MDR 18‐447‐S21‐L06‐S22, James C. Hoe, CMU/ECE/CALCM, ©2021 RT Sequencing: R‐Type ALU • IF IR  MEM[ PC ] step 1 • ID A  RF[ rs1(IR) ] step 2 B  RF[ rs2(IR) ] step 3 • EX ALUOut  A + B step 4 • MEM

• WB RF[ rd(IR) ]  ALUOut step 5 PC  PC+4 step 6 if MEM[PC] == ADD rd rs1 rs2 GPR[rd]  GPR[rs1] + GPR[rs2]

18‐447‐S21‐L06‐S23,PC James  PCC. Hoe, + 4CMU/ECE/CALCM, ©2021 RT Datapath Conflicts

RF[ rd(IR) ]  ALUOut, PC  PC + 4

PC  PC + imm(IR) RF[rd(IR) ]  MDR

PC  A + imm(IR)

IR  MEM[ PC ] ALUOut  A + imm(IR)

B  RF[rs1(IR) ] ALUOut  A + B

A  RF[rs2(IR) ] ALUOut  PC + 4

MEM[ ALUOut ]  B MDR  MEM[ ALUOut ] Can utilize each resource only once per control step (cycle) 18‐447‐S21‐L06‐S24, James C. Hoe, CMU/ECE/CALCM, ©2021 RT Sequencing: R‐Type ALU • IF IR  MEM[ PC ] step 1 • ID A  RF[ rs1(IR) ] B  RF[ rs2(IR) ] step 2 • EX ALUOut  A + B • MEM step 3

• WB RF[ rd(IR) ]  ALUOut step 4 PC  PC+4

18‐447‐S21‐L06‐S25, James C. Hoe, CMU/ECE/CALCM, ©2021 RT Sequencing: LW • IF IR  MEM[ PC ] • ID A  RF[ rs1(IR) ] B  RF[ rs2(IR) ] • EX

ALUOut  A + immI‐type(IR) • MEM MDR  MEM[ ALUOut ] • WB RF[ rd(IR) ]  MDR PC  PC+4 if MEM[PC]==LW rd offset(base) EA = sign‐extend(offset) + GPR[base] GPR[rd]  MEM[ EA ]

18‐447PC‐S21 ‐L06‐PCS26, +James 4 C. Hoe, CMU/ECE/CALCM, ©2021 Combined RT Sequencing R‐Type LW SW Branch Jump common start: IR  MEM[ PC ] next

A  RF[ rs1(IR) ] steps B  RF[ rs2(IR) ] ALUOut  PC+imm(IR) case opcode opcode ALUOut ALUOut ALUOut PC PC  PC + 4  A+B  A+imm(IR)  A+imm(IR)  PC+imm(IR) start

dependent next next next next RF[rd(IR)] M[ALUOut] MDR  ALUOut  B cond?( A , B )  M[ALUOut] PC  PC+4 PC  PC+4 start start startif‐not‐cond

next steps RF[rd(IR)]  MDR PC  ALUOut PC  PC+4 start start

18‐447‐S21‐L06RTs‐S27, James in C.each Hoe, CMU/ECE/CALCM, state corresponds©2021 to some setting of the control signals Horizontal

Microcode ALUSrcA storage output IorD Datapath IRWrite Outputs control outputsPCWrite PCWriteCond

…. “control”

n‐bit InputPC input bit ‐

1 k

Sequencing Microprogram counter control Adder

Address select logic

Inputs from instruction register opcode field

[Based on original figure from P&H CO&D, COPYRIGHT n 2004 Elsevier. ALL RIGHTS RESERVED.] : 2  k bit (not including sequencing) 18‐447‐S21‐L06‐S28, James C. Hoe, CMU/ECE/CALCM, ©2021 Vertical Microcode

1‐bit signal means do this RT Microcode “PC  PC+4” storage “PC  ALUOut” Datapath“PC  PC[ 31:28 ],IR[ 25:0 ],2’b00” Outputs control“IR  MEM[ PC ]” outputs “A  RF[ IR[ 25:21 ] ]” “B  RF[ IR[ 20:16 ] ]” …………. ……. n‐bit InputPC input 1 m‐bit input

Sequencing Microprogram counter control ROM Adder

Address select logic k‐bit output

[Based on original figure from P&H CO&D, COPYRIGHT

2004 Elsevier. ALL RIGHTS RESERVED.] Inputs from instruction …. PCWriteCond PCWrite IRWrite IorD ALUSrcA register opcode field

Still more elaborate behaviors

18‐447‐S21‐L06‐S29, James C. Hoe, CMU/ECE/CALCM, ©2021 can be sequenced as subroutines Programmed Implementation

Combinational control logic Datapat

Outputs

Inputs datapath

datapath

Inputs from instruction State register register opcode field to from

[Based on original figure from P&H rd CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] rs1

rs2

immed

18‐447‐S21‐L06‐S30, James C. Hoe, CMU/ECE/CALCM, ©2021 Microcoding for CISC

•Can we extend last slide –to support a new instruction? –to support a complex instruction, e.g. polyf? •Yes, very simple datapath do very complicated things easily but with a slowdown –if I can sequence an arbitrary RISC instruction then I can sequence an arbitrary “RISC program” as a program sequence – will need some ISA state (e.g. loop counters) for more elaborate programs –more elaborate ISA features also make life easier

18‐447‐S21‐L06‐S31, James C. Hoe, CMU/ECE/CALCM, ©2021 Single‐ [8086 Family User’s Manual]

18‐447‐S21‐L06‐S32, James C. Hoe, CMU/ECE/CALCM, ©2021 You get a try on HW2 High Performance CISC Today •High‐perf translate CISC inst’s to RISC uOPs • Pentium‐Pro decoding example:

16 bytes of instructions

uop ROM: primary decoder decoder play‐back a decoder byte uOP sequence offset for more complicated decode decode up to 2 instructions 1st x86 more simple x86 into 1~4 uOPs that each map to 1 uOP

uOP stream executes on a RISC internal machine 18‐447‐S21‐L06‐S33, James C. Hoe, CMU/ECE/CALCM, ©2021 Evolution of ISAs •Why were the earlier ISAs so simple? e.g., EDSAC – technology – precedence •Why did it get so complicated later? e.g., VAX11 – assembly programming – lack of memory size and performance – microprogrammed implementation •Why did it become simple again? e.g., RISC –memory size and speed (!) – compilers •Why is x86 still so popular? – technical merit vs. {SW base, psychology, deep pocket} •Why has ARM thrived while other RISC ISAs vanished

18‐447‐S21‐L06‐S34, James C. Hoe, CMU/ECE/CALCM, ©2021 Why RISC‐V now?