Chapter Five

1 The Big Picture: The Performance Perspective

• Performance of a machine is determined by: – Instruction count – Clock cycle time – Clock cycles per instruction • design ( and control) will determine: – Clock cycle time – Clock cycles per instruction • Single cycle processor - one clock cycle per instruction – Advantages: Simple design, low CPI – Disadvantagggy,es: Long cycle time, which is limited by the slowest instruction. CPI

Inst. Count Cycle Time 2 How to Design a Processor: step-by-step

1. Analyze instruction set Î datapath requirements – the meaning of each instruction is given by register transfers R[rd] <– R[rs] + R[rt]; – datapath must include storage element for ISA registers – datapath must support each register transfer 2. Select set of datapath components and establish clocking methodology 3. Design datapath to meet the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Design the control logic

3 Review: The MIPS Instruction Formats

• All MIPS instructions are 32 bits long. The three instruction formats are: 31 26 21 16 11 6 0 R-Type op rs rt rd shamt funct 6 bits5 bits 5 bits 5 bits 5 bits 6 bits 31 26 21 16 0 I-Type op rs rt immediate 6 bits5 bits 5 bits 16 bits 31 26 0 J-Type op target address 6 bits 26 bits • The different fields are: – op: operation of the instruction – rs, rt, rd: the source and destination register specifiers – shamt: shift amount – funct: selects the variant of the operation in the “op” field – address / immediate: address offset or immediate value – target address: target address of the jump instruction 4 The Processor: Datapath & Control

• Designing an implementation that includes a subset of the core MIPS instruction set: – The memory-reference instructions: lw, sw – The arithmetic-logical instructions: add, sub, and, or, slt – The cont rol fl ow i nst ructi ons: bjbeq, j • For every instruction except j, the first two steps are identical: – Send the program (PC) to the memory that contains the code and fetch the instruction from that memory – Read one or two registers, using fields of the instruction to select the regi st ers t o read . F or th e l oad word i nst ructi on, we need t o read only one register. • After these two steps, the actions required to complete the instruction depend on the instruction class

• All instructions use the ALU after reading the registers. Why?

5 Step 1a: The MIPS Subset

31 26 21 16 11 6 0 • R-Type op rs rt rd shamt funct – ADD and SUB 6 bits5 bits 5 bits 5 bits 5 bits 6 bits • addu rd, rs, rt • subu rd,,, rs, rt

31 26 21 16 0 • I-Type op rs rt immediate – OR Immediate : 6 bits5 bits 5 bits 16 bits • ori rt, rs, imm16 – LOAD and STORE • lw rt, imm16(rs) • sw rt, imm16(rs) – BRANCH: • beq rs, rt, imm16

6 Register Transfer Logic (RTL)

• RTL gives the meaning of the instructions • All instructions start by fetching the instruction

op | rs | rt | rd | shamt | funct = MEM[ PC ] op | rs | rt | Imm16 = MEM[ PC ]

inst Register Transfers addu R[rd] Å R[rs] + R[rt]; PC Å PC + 4 subu R[rd] Å R[rs] – R[rt]; PC Å PC + 4 ori R[rt] Å R[rs] | zerozero_ext(imm16); ext(imm16); PC Å PC+4PC + 4 load R[rt] Å MEM[R[rs] + sign_ext(imm16)]; PC Å PC + 4 store MEM[R[rs] + sign_ext(imm16)] Å R[rt]; PC Å PC + 4 beq if (R[rs]==R[rt]) then PC Å PC + 4 + (sign_ext(imm16)<<2) else PC Å PC + 4

7 Step 1: Requirements of the Instruction Set

• Memory – instruction & data • Registers (32 x 32) – read rs – readtd rt – write rt or rd • PC • Extender (sign extend or zero extend) • Add and sub register or extended immediate • Add 4 or s hifted ext end ed i mmedi at e t o PC

8 More Implementation Details

• Abstract / Simplified View:

• The functional units in the MIPS implementation consist of two different types of logic elements: – Elements that operate on data values (combinational) • Outputs depend only on the current inputs. – Elements that contain state (sequential) • They have internal storage. • At least two inputs (clock and data) and one output (data) •E.ggp. D flip-flop

9 Step 2: Components of the Datapath

: Does not use a clock – – MUX – ALU

• Storage Element: Register (Basic Building Blocks) – Similar to the D Flip Flop except • N-bit input and output WitWrite E nabl e • Write enable input – Write Enable: Data In Data Out • negated (0): Data Out will not change • asserted (1): Data Out will become Data In N N on the falling edge of the clock Clk

10 Storage Element:

Write • Register File consists of 32 registers: RW RA RB Enable 5 5 5 – Two 32-bit output busses: busA and busB busA – One 32-bit input : busW busW 32×32-bit 32 32 Registers busB • Register is selected by: Clk 32 – RA (number) selects the register to put on busA (data) – RB (number) selects the register to put on busB (data) – RW (number) selects the register to be written via busW (()data) when Write Enable is 1

• Clock input (CLK) – The CLK inpu t is a fac tor ONLY dur ing wr ite operati on – During read operation, behaves as a combinational logic block: • RA or RB valid => busA or busB valid after “access time.”

11 Storage Element: Idealized Memory

• Memory (idealized) Write Enable Address – One input bus: Data In 32 – One output bus: Data Out Data In DataOut 32 32 Clk • Memory word is selected by: – Address selects the word to put on Data Out – Write Enable = 1: address selects the memory word to be written via the Data In bus

• Clock input (CLK) – The CLK input is a factor ONLY during write operation – During read operation, memory behaves as a combinational logic block: • Address valid => Data Out valid after “access time.”

12 Clocking Methodology – Negative Edge Triggered

Clk StSetup HldHold StSetup HldHold Don’t Care

......

• All storage elements are clocked by the same clock edge • Cycle Time = CLK-to-Q + Longest Delay Path + Setup + Clock Skew • (CLK-to-Q + Shortest Delay Path – Clock Skew) > Hold Time

13 Combinational vs. Sequential

• Combinational logic , state • Allows a state element to be elements, and the clock read and written in the same clock cycle without causing a – Any inputs to a state race condition. elements must reach a stable value before the active clock edge causes the sate to be updated. – The time necessary for the signals to travel from SE1 to SE2 defines the clock cycle length 14 Step 3: Design Datapath

• Register Transfer Requirements Æ Datapath Design – Instruction Fetch – Decode instructions and Read Operands – Execute Operation – Write back the result

15 Datapath for Instruction Fetch

• The common RTL operations – Fetch the Instruction: mem[PC] – Update the : • Sequential Code: PC Å PC + 4 • Branch and Jump: PC Å “something else”

16 Datapath for R-type ALU operations

• Register file and ALU

17 Datapath for Add & Subtract

• R[rd] Å R[rs] op R[rt] Example: addu rd, rs, rt – Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields – ALUctr and RegWr: control logic after decoding the instruction

31 26 21 16 11 6 0 op rs rt rd shamt funct 6 bits5 bits 5 bits 5 bits 5 bits 6 bits

Rd Rs Rt RegWr ALUctr 5 5 5 3 busA Rw Ra Rb busW 32×32-bit 32 ALU Result 32 Registers 3 Clk bBbusB 2 32

18 Register-Register Timing: One complete cycle

Clk Clk-to-Q PC Old Value New Value Instruction Memory Access Time Rs, Rt, Rd, Old Value New Value Op, Func Delay through Control Logic ALUctr Old Value New Value

RegWr Old Value New Value Register File Access Time busA, B Old Value New Value ALU Delay busW Old Value New Value Rd Rs Rt ALUctr RegWr 5 5 5 busA Register Write Rw Ra Rb Occurs Here busW 32 ALU 32 32-bit Result 32 Registers 32 Clk bBbusB 32 19 Logical Operations with Immediate

• R[rt] Å R[rs] op ZeroExt[imm16] Example : ori rt, rs, imm16

31 26 21 16 0 op rs rt immediate 6 bits5 bits 5 bits 16 bits 31 16 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 immediate Rd Rt RegDst 16 bits 16 bits Mux Rs ALUctr RegWr 5 5 5 3 busA Rw Ra Rb

bWbusW ALU 32 32-bit 32 Result 32 Registers 32 Clk busB 32 Mux Zero

imm16 E 16 xt 32 ALUSrc 20 Datapath for Loads and Stores

21 Load Operations

• R[rt] Å Mem[R[rs] + SignExt[imm16] Example: lw rt, rs, imm16

31 26 21 16 0 op rs rt immediate Rd Rt 6 bits5 bits 5 bits 16 bits RegDst Mux Rs RegWr ALUctr 5 5 5 3 busA WSrcW_Src Rw Ra Rb

busW ALU 32×32-bit 32 32 Registers 32 Clk bBbusB

MemWr Mux 32 Mux

Ex WrEn Adr t ender Data In 32 imm16 32 Data 16 32 Memory Clk ALUSrc

ExtOp 22 Store Operations

• Mem[ R[rs] + SignExt[imm16] Å R[rt] ] Example: sw rt, rs, imm16 31 26 21 16 0 op rs rt immediate 6 bits5 bits 5 bits 16 bits

Rd Rt ALUctr MemWr W_Src RegDst Mux Rs Rt 3 RegWr 5 5 5 busA Rw Ra Rb busW 32 A 32×32-bit L U 32 Registers 32 Clk busB Mux Mux 32 E

xtender WrEn Adr Data In 32 32 Data imm16 32 16 Memory Clk

ExtOp ALUSrc 23 Datapath for Branch

• beq rs, rt, imm16 – mem[PC] Fetch the instruction from memory – Equal <- R[rs] == R[rt] Calculate the branch condition – if (COND eq 0) Calculate the next instruction’s address PC Å PC + 4 + ( SignExt(imm16) x 4 ) else PC Å PC + 4

31 26 21 16 0 op rs rt immediate 6 bits5 bits 5 bits 16 bits 24 Putting it All Together: A Single Cycle Datapath

25 Step 4: Given Datapath: RTL Æ Control

• Necessary Control Lines

26 Designing the Main

31-26 25-21 20-16 15-11 10-6 5-0 R op rs rtt rdd shhtamt ftfunct I op rs rt 16 bit address J op 26 bit address

• Observations – Op field is always contained in bits 31-26 – R-type, Branch, Store Æ two registers (rs and rt) needs to be read – Base register for Load, Store Æ rs (bits 25-21) – 16-bit offset for Load, Store, Branch Æ bits 15-0 – Destination register • For Load, bits 20-16 (rt) • For R-type, bits 15-11 (rd)

27 R-Type: The Add Instruction

31 26 21 16 11 6 0 op rs rt rd shamt funct 6 bits5 bits 5 bits 5 bits 5 bits 6 bits

• add rd, rs, rt

– imem[PC] Read an instruction • Fetch the instruction from the instruction memory

– R[rd] Å R[rs] + R[rt] Execute the actual operation • Read two operands from the register file • Add two operands • Store the result to the register file

– PC Å PC + 4 Set the next instruction’s address • Add the program counter and 4 • Store the resu lt t o th e program count er 28 Instruction Fetch Unit at the Beginning of Add

• Fetch the instruction from Instruction memory: – Instruction Å mem[PC]

Inst Memory Instruction<31:0> Adr 0 00 PC

Clk

29 The Single Cycle Datapath during Add

• R[rd] Å R[rs] op R[rt] Instruction<31:0> <21:25> <16:20> <11:15>

nPC_sel= +4 <0:15> Instruction Rd Rt Fetch Unit RegDst = 1 Clk 1 Mux 0 Rs Rt ALUctr = Add Rt Rs Rd Imm16 RegWrite = 1 5 5 5 MemtoReg = 0 busA Zero MemWrite = 0 Rw Ra Rb

busW ALU 32 32-bit 32 32 0 Registers busB 0 32 M M

Clk u

32 x ux 32 Exten WrEn Adr 1 1 Data In 32 Data imm16 d 32 16 er Memory Clk ALUSrc = 0

ExtOp = x

30 Instruction Fetch Unit at the End of Add

• PC Å PC + 4

Inst Memory Instruction<31:0> Adr

nPC_sel = +4

4 Add e 00 r 0 Mux PC A dder

Clk m16 mm i

31 R-type Instruction

• Control for R-type instructions

32 I-Type: The Or immediate Instruction

31 26 21 16 0 op rs rt immediate

• ori rt, rs, imm16

– imem[PC]

– R[rt] Å R[rs] | ZeroExt(imm16) • Read an operand from the register file • Or the operand and the zero extended immediate value • Store the result to the register file

– PC Å PC + 4

33 The Single Cycle Datapath for Or Immediate

• R[rt] Å R[rs] or ZeroExt(Imm16)

Instruction<31:0> <21:2 <16:2 <11:15

nPC_sel= +4 <0:15 Instruction Rd Rt Fetch Unit 5 0 > > RegDst = 0 Clk > > 1 Mux 0 Rt Rs Rd Imm16 Rs Rt ALUctr = Or RegWrite = 1 55 5 MemtoReg = 0 busA Zero MemWrite = 0 Rw Ra Rb

busW ALU 32 32-bit 32 32 0 Registers busB 0 32 M Mux Clk ux 32 32 Ext WrEn Adr 1

e 1 Data In nder 32 Data i16imm16 32 16 Memory Clk ALUSrc = 1

ExtOp = 0 34 I-Type: The Load, Store Instruction

31 26 21 16 0 op rs rt immediate

• lw rt, imm16(rs) – imem[PC] – R[rt] Å dmem[ R[rs] + SignExt(imm16) ] • Read an operand from the register file • Add the operand and the immediate value • Rea d a da tum from the da ta memory w ith the resu lt a ddress • Store the datum to the register file – PC Å PC + 4

• sw rt, imm16(rs) – dmem[ R[rs] + SignExt(imm16) ] Å R[rt] • Read two operands from the register file • Add one operand and the immediate value • Store the other operand to the data memory with the result address

35 The Single Cycle Datapath for Load

• R[rt] Å DMem[ R[rs] + SignExt[imm16] ]

Instruction<31:0> <21:25 <16:20 <11:15

nPC_sel= +4 <0:15> Instruction Rd Rt Fetch Unit >

RegDst = 0 Clk > > 1 Mux 0 ALUctr = Add Rt Rs Rd Imm16 RegWrite = 1 Rs Rt 5 5 5 MemtoReg = 1 busA Zero MemWrite = 0 Rw Ra Rb

busW ALU 32 32-bit 32 32 0 Registers busB 0 32 M M Clk ux

32 ux

Exte WrEn Adr 1 1 Data In n 32 Data imm16 der 32 32 16 Memory Clk ALUSrc = 1

EtOExtOp = 1 36 The Single Cycle Datapath for Store

• dmem[ R[rs] + SignExt(imm16) ] Å R[rt]

Instruction<31:0> <21:25 <16:20 <11:15

nPC_sel= +4 <0:15> Instruction Rd Rt Fetch Unit >

RegDst = X Clk > > 1 Mux 0 ALUctr = Add Rt Rs Rd Imm16 RegWrite = 0 Rs Rt 5 5 5 MemtoReg = X busA Zero MemWrite = 1 Rw Ra Rb

busW ALU 32 32-bit 32 32 0 Registers busB 0 32 M Mux Clk ux 32

Exte WrEn Adr 1 1 Data In n 32 Data imm16 der 32 32 16 Memory Clk ALUSrc = 1

EOExtOp = 1 37 Load Instruction

• Control for I-type Load instructions

38 I-Type: The Branch Instruction

31 26 21 16 0 op rs rt immediate

• beq rs, rt, imm16

– imem[PC] Read an instruction

– Zero Å R[rs] == R[rt] Execute the actual operation • Read an operand from the register file • Subtract one operand from the other

– Baddr Å PC + 4 + SignExt(imm16) Calculate branch address • Add PC and the immediate value

– PC Å PC + 4 or Baddr Set the next instruction’s address

39 The Single Cycle Datapath for Branch

• if (R[rs] - R[rt] == 0) then Zero Å 1 ; else Zero Å 0

Instruction< 31:0> <21:25 <16:20 <11:15

nPC_sel= “Br” <0:15 Instruction Rd Rt Fetch Unit > > RegDst = x Clk > > 1 Mux 0 ALUctr = Sub RegWr = 0 Rs Rt Rt Rs Rd Imm16 5 5 5 MemtoReg = x busA Zero MemWr = 0 Rw Ra Rb

busW ALU 32 32-bit 32 32 0 Registers busB 0 32 M Mux Clk ux 32 32 Ext WrEn Adr 1

e 1 Data In nder 32 Data i16imm16 32 16 Memory Clk ALUSrc = 0

ExtOp = x 40 Instruction Fetch Unit at the End of Branch

• if (Zero==1) then PC Å PC + 4 + SignExt(imm16)<<2; else PC Å PC + 4

Inst Memory Instruction<31:0> Adrd

nPC_sel = Br

4 Ad d er 00 0 Mux PC Adder Sign

& s Clk imm16 E h xt l

41 BEQ Instruction

• Control for I-type BEQ instructions

42 Jump Instruction

• PC Å “PC+4[31:28]”;“imm26<<2”

43 Simple Datapath with the Control Unit

Instr RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp2 R10010 0010 Load01111 0000 Store X 1 X 0 0 1 0 0 0 Beq X 0 X 0 0 0 1 0 1 44 A Summary of the Control Signals

See func 10 0000 10 0010 We Don’t Care :-) Appendix A op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 add sub ori lw sw beq jump RegDst 1 1 0 0 x x x ALUSrc 0 0 1 1 1 0 x MemtoReg 0 0 0 1 x x x RegWrite 1 1 1 1 0 0 0 MemWrite 0 0 0 0 1 0 0 nPCsel 0 0 0 0 0 1 0 Jump 0 0 0 0 0 0 1 EOExtOp x x 0 1 1 x x ALUctr<2:0> Add Subtract Or Add Add Subtract xxx 31 26 21 16 11 6 0 R-type op rs rt rd shamt funct add, sub

I-type op rs rt immediate ori, lw, sw, beq

J-typeop target address jump 45 The Concept of Local Decoding

op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 R-typeorilwswbeq jjpump RegDst 1 0 0 x x x ALUSrc 0 1 1 1 0 x MemtoReg 0 0 1 x x x RegWrite 1 1 1 0 0 0 MemWrite 0 0 0 1 0 0 BhBranch 0 0 0 0 1 0 Jump 0 0 0 0 0 1 ExtOp x 0 1 1 x x ALUop “R-type” Or Add Add Subtract xxx

func ALU ALUctr op Main 6 Control 3 6 Control ALUop (Local)

N A L U

46 Opcode & ALU Operation Mapping

Name Opcode Opcode (Bin) (Hex) Op5 Op4 Op3 Op2 Op1 Op0 R-format0000000 lw35100011 sw 43 1 0 1 0 1 1 beq4000100

Op Operation ALUOp Funct ALU action ALUctr LW Load word 000 XXXXXX Add 010 SW Store word 000 XXXXXX Add 010 BEQ Branch equal 001 XXXXXX Sub 110 R-type add 100 100000 Add 010 R-type sub 100 100010 SbSub 110 R-type AND 100 100100 AND 000 R-type OR 100 100101 OR 001 R-type slt 100 101010 Slt 111 47 Control with Opcode

Input or Output Signal Name R-format lw sw beq Op5 0 1 1 0 Op4 0 0 0 0 Inputs Op3 0 0 1 0 Op20001 Op1 0 1 1 0 Op0 0 1 1 0 RegDst 1 0 X X ALUSrc 0 1 1 0 Outputs MemtoReg 0 1 X X RegWrite 1 1 0 0 MemRead 0 1 0 0 MemWrite 0 0 1 0 Branch 0 0 0 1 ALUOp3 1 0 0 0 ALUOp2 0 0 0 0 ALUOp1 0 0 0 1 48 Logic for each control signal

• nPC_sel Å if (OP == BEQ) then EQUAL else 0 • ALUsrc Å if (OP == “000000”) then “regB” else “immed” • ALUctr Å if (OP == “000000”) then funct elseif (OP == ORi) then “OR” elseif (()OP == BEQ) then “sub” else “add” • MemWr Å (OP == Store) • MemtoReg Å (OP == Load) • RegWr: Å if ((OP == Store) || (OP == BEQ)) then 0 else 1 • RegDst: Å if ((OP == Load) || (OP == ORi)) then 0 else 1

49 Control Logic

• Simple combinational logic (truth tables)

Inputs Op5 Op4 Op3 Op2 Op1 ALUOp Op0 ALU control block ALUOp0 Outputs ALUOp1 R-format Iw sw beq RegDst Operation2 F3 ALUSrc Operation MemtoReg F2 Operation1 F (5– 0) RegWrite F1 Operation0 MemRead F0 MemWrite Branch ALUOp1

ALUOpO

50 Summary

• 5 steps to design a processor – 1. Analyze instruction set => datapath requirements – 2. Select set of datapath components & establish clock methodology – 3. Design datapath meeting the requirements – 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. – 5. Design the control logic • MIPS mak es it easi er – Instructions same size – Source registers always in same place – Immediates same size, location – Operations always on registers/immediates • Single cycle datapath Î CPI=1, CCT Î long • Next time: implementing control

51 Where we are headed

• Single Cycle Problems: – The clock cycle is equal to the worst-case delay for all instructions – Inefficient both in its performance and in its hardware cost • One Solution: – use a “smaller” cycle time – have different instructions take different numbers of cycles – a “multicyypcle” datapath:

52 What’s wrong with our CPI=1 processor?

• Long Cycle Time • All instructions take as much time as the slowest • Real memory is not so nice as our idealized memory – cannot always get the job done in one (short) cycle

Arithmetic & Logical PC Inst Memory Reg File mux ALU mux setup

Load PC Inst Memory Reg File mux ALU Data Mem mux setup Critical Path Store PC Inst Memory Reg File mux ALU Data Mem

Branch PC Inst Memory Reg File cmp mux setup

53 Reducing Cycle Time

• Cut combinational dependency graph and insert register / latch • Do same work in two fast cycles, rather than one slow one

storage element storage element

Acyclic Acyclic Combinational Combinational LiLogic Logic (A)

=> storage element

Acyclic Combinational storage element Logic (B)

storage element 54 Multicycle Approach

• We will be reusing functional units – ALU used to compute address and to increment PC – Memory used for instruction and data

• Our control signals will not be determined directly by instruction – e.g., what should the ALU do for a “subtract” instruction?

• We’ll use a finite state machine for control

• Partition data so can handle different instruction execution times.

55 Multicycle Implementation

• Break each instruction into a series of steps – Break it down into steppgs following our rule that data flows through at most one major functional unit (e.g., balance work across steps) • Each step in the execution will take 1 clock cycle – Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) • Allows a functional unit to be used more than once per instruction – RdReduce the amount of fh hard ware requi red

• Single memory unit • Single ALU • Registers after every major functional unit

56 Five Execution Steps

• Instruction Fetch (IF) • Instruction Decode (ID) and Register Fetch • EtiMExecution, Memory Add AddCress Comput ttiBhCltiation, or Branch Completion (EX) • Memory Access (MEM) or R-type instruction completion • Write-back step (WB)

57 Partitioning the CPI=1 Datapath

• Add registers between smallest steps egWr C_sel Uctr egDst USrc emRd emWr xtOp LL LL PP RR RR EE MM MM n A A h tion h eg. PC ile

Exec ss ore and cc cc cc mm tt CC RR FF ee P Fet Fet Me Next Acc Oper Instru Result S

Allow the inst ructi on t o t ak e mu ltipl e cycl es. 58 Step 1: Instruction Fetch

• Use PC to get instruction and put it in the .

• Increment the PC by 4 and put the result back in the PC.

• Can be described succinctly using RTL "Register-Transfer Language“

IR = Memory[PC]; PC = PC + 4;

Updating the PC at this stage can be done using ALU.

59 Step 2: Instruction Decode and Register Fetch

• Read registers rs and rt in case we need them

• Compute the branch address in case the instruction is a branch RTL:

A = Reg[IR[25:21]]; B = Reg[IR[20:16]]; ALUOut = PC + (sign(sign_extend(IR[15:0]) extend(IR[15:0]) << 2);

• We aren't setting any control lines based on the instruction type (we are b usy "d ecodi ng" it i n our cont rol l ogi c)

60 Step 3: (Instruction dependent operations)

• ALU is performing one of three functions, based on instruction type

– Memory Reference: ALUOut = A + sign-extend(IR[15:0]);

– R-type: ALUOut = A op B;

– Branch: if (A==B) PC = ALUOut;

• Jump instruction

PC = PC [31:28] || (IR[25:0] << 2);

61 Step 4: (R-type or memory-access)

• Loads and stores access memory

MDR = Memory[ALUOut]; or Memory[ALUOut] = B;

• R-type instructions finish

Reg[IR[15:11]] = ALUOut;

The write actually takes place at the end of the cycle on the edge

62 Step 5: Write-back step

• Load instructions finish

Reg[IR[20:16]]= MDR;

What a bout a ll t he ot her instruct ions ?

63 Summary:

Action for R-type Action for memory-reference Action for Action for Step name instructions instructions branches jumps

IR = Memory[PC] Instruction fetch PC = PC + 4

A = Reg [IR[25:21]] Instruction decode/register B = Reg [IR[20:16]] fetch ALUOut = PC + (sign-extend (IR[15:0]) << 2)

Execution, address computation, ALUOut = A + if (A ==B) then PC = PC [31:28] ALUOut = A op B branch/jump sign_extend(IR[15:0]) PC = ALUOut II (IR[25:0]<<2) completion

Load: MDR = Memory[ALUOut] Memory access or Reg [IR[15:11]] = or R-type completion ALUOut Store: Memory [ALUOut] = B Memory read Load: Reg[IR[20:16]] = MDR completion

64 Multicycle Datapath for R-type Instruction

65 Multicycle Datapath for Load Instruction

66 Multicycle Datapath for Branch Instruction M UX

67 Multicycle Datapath with Control Lines

68 Control for Multicycle Implementation

• Two techniques – Finite State Machine – Microprogramming • Finite state machines: – a set of states and – next state function (determined by current state and the input) – output function (determined by current state and possibly input) – a Moore machine (output based only on current state) – a Mealy machine (output based on current state & inputs)

Next state Next-state Current state function

Clock Inputs

Output Outputs function 69 Control for Multicycle Implementation

70 Control for Multicycle Implementation

ItInstructi tiFthdDdon Fetch and Decode

71 Control for Multicycle Implementation

MRfItti(ld)Memory Reference Instructions (load)

72 Control for Multicycle Implementation

Memory RfReference Itti(t)Instructions (store)

73 Control for Multicycle Implementation

R-tIttitype Instructions

74 Control for Multicycle Implementation

BhIttiBranch Instructions

75 Control for Multicycle Implementation

JIttiJump Instructions

76 Implementing the Control

• Value of control signals is dependent upon: – what instruction is being executed – which step is being performed

• Use the information we’ve accumulated to specify a finite state machine – specify the finite state machine graphically, or – use microprogramming

• Implementation can be derived from specification

77 Complete Finite State Machine Control

Instruction decode/ Instruction fetch register fetch 0 MemRead 1 ALUSrcA = 0 IorD = 0 ALUSrcA = 0 Start IRWrite ALUSrcB = 11 ALUSrcB = 01 ALUOp = 00 ALUOp = 00 PCWrite

PCSource = 00 ) ') pe Q -ty E R B p = ' (O = 'J') =

') pp Memory address 'SW p = Branch (O Jump p (O computation r (O ') o Execution completion completion 'LW p = 2 (O 6 8 9 ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 ALUSrcA =1 PCWrite ALUSrcB = 10 ALUOp = 01 ALUSrcB = 00 PCSource = 10 ALUOp =00 ALUOp= 10 PCWriteCond PCSource = 01

(O p =

'S W ') Memory Memory Op = 'LW') (( access access R-type completion 3 5 7

RegDst = 1 MemRead MemWrite RegWrite IorD = 1 IorD = 1 MemtoReg = 0

Write-back step 4

RegDst = 0 RegWrite MemtoReg=1

Finite State Machine for Control

• Implementation:

PCWrite PCWriteCond IorD MemRead MemWrite IRWrite Control logic MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst

NS3 NS2 NS1 Inputs NS0 p5 p4 p3 p2 p1 p0 3 2 1 0 OO OO OO OO OO OO SS SS SS SS

Instruction register State register opcode field

79 Recap: PLA Implementation of the Main Control

op<5>.. op<5>.. op<5>.. op<5>.. op<5>.. op<5>.. <0> <0> <0> <0> <0> op<0>

R-type ori lw sw beq jump RegWrite

ALUSrc RegDst MemtoReg MWitMemWrite Branch Jump ExtOp ALUop<2> ALUop<1> ALUop<0> 80 PLA Implementation

• If I picked a horizontal or vertical line could you explain it?

Op5

Op4

Op3 AND Plane Op2

Op1

Op0

S3

S2 Î

S1

S0

PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 Î PCSource0 OR Plane ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 81 ROM Implementation

• ROM = "Read Only Memory" – values of memory locations are fixed ahead of time • A ROM can be used to implement a truth table – if the address is m-bits, we can address 2m entries in the ROM. – our outputs are the bits of data that the address points to.

address data 0 0 0 0 0 1 1 0 0 1 1 1 0 0 mn 0 1 0 1 1 0 0 01110000 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 11101111 1 1 0 1 1 1

misthem is the "height", and n is the "width" 82 Example: Implement an Adder with ROM

• Implementation of 2 bit adder

a1a0+b1b0 = c2c1c0 a1a0b1b0 c2c1c0

a1a0 00 + 00 = 000 0000 000 +b+ b1b0 00 + 01 = 001 0001 001 00 + 10 = 010 0010 010 c2c1c0 00 + 11 = 011 0011 011 01 + 00 = 001 0100 001 01 + 01 = 010 0101 010 01 + 10 = 011 0110 011 01 + 11 = 100 0111 100 ··· ··· ··· 11 + 00 = 011 1100 011 11 + 01 = 100 1101 100 11 + 10 = 101 1110 101 11 + 11 = 110 1111 110

83 ROM Implementation

• How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 210 = 1024 different addresses) • How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs

• ROM is 210 x 20 = 20K bits (and a rather unusual size)

• Rather wasteful, since for lots of the entries, the outputs are the same — i.e., opcode is often ignored

84 ROM vs PLA

• Break up the table into two parts — 4 state bits tell you the 16 outputs, 2 4 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM — Total: 4.3K bits of ROM • PLA is much smaller — can share product terms — only need entries that produce an active output — can take into account don't cares • Size is (#inputs ∗ #product-terms) + (#outputs ∗ #product-terms) For this example = (10x17)+(20x17) = 460 PLA cells

• PLA cells usually about the size of a ROM cell (slightly bigger)

85 Microprogramming

• Control is the hard part of – Datapath is fairly regular and well-organized – Memory is highly regular – Control is irregular and global

• Microprogramming:

– A Particular Strategy for Implementing the Control Unit of a processor by "programming" at the level of register transfer operations

:

– Logical structure and functional capabilities of the hardware as seen by the microprogrammer

86 Sequencer-based control unit

• Complex instructions: the "next state" is often current state + 1

Control unit PCWrite PCWriteCond IorD MemRead PLA or ROM MemWrite IRWrite BWrite Multicycle Outputs MemtoReg PCSource Datapath ALUOp ALUSrcB ALUSrcA RegWrite RegDst Input AddrCtl 1

State Types of “branching” Adder • Set state to 0 Address select logic • Dispatch (state 1) • Use incremented state number Op[5– 0]

Instruction register opcode field

87 “Macroinstruction” Interpretation

User program plus Data Main ADD Memory SUB this can change! AND . . . one of these is DATA mapped into one of these

CPU control AND microsequence memory e.g., Fetch Calc Operand Addr FthOFetch Operand( d()s) Calculate Save Answer(s)

88 Microprogramming

Control unit PCWrite PCWriteCond IorD memory MemRead Datapath MemWrite IRWrite BWrite Outputs MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl Input 1

Microprogram counter

Adder

Address select logic Op[5– 0]

Instruction register opcode field

• What are the “microinstructions” ? 89 Variations on Microprogramming

• “Horizontal” Microcode – control field for each control point in the machine

µseq µaddr A-mux B-mux bus enables register enables • “Vertical” Microcode – compact microinstruction format for each class of microoperation – local decode to generate all control points branch: µseq-op µadd execute: ALU-op A,B,R memory:mem-opS, D

src dst other control fields next states inputs

D D E E MUX C C 90 Horizontal vs. Vertical Microprogramming

• NOTE: previous organization is not TRUE horizontal microprogramming; register decoders give flavor of encoded microoperations

• Most microprogramming-based controllers vary between: – horizontal orgg(anization (1 control bit p er control p p)oint) – vertical organization (fields encoded in the control memory and must be decoded to control something)

Horizontal Vertical

+ more control over the potential + easier to program, not very parallelism of operations in the different from programming datapath a RISC machine in assembly language - uses up lots of - extra level of decoding may slow the machine down

91 Designing a Microinstruction Set

1. Start with list of control signals

2. Group signals together that make sense (vs. random): called “fields”

3. Places fields in some logical order (e .g ., ALU operation & ALU operands first and microinstruction sequencing last)

4. CtCreate a sym blilbolic legend dfthiittif for the microinstruction format thi, showing name of field values and how they set the control signals – Use computers to design computers

5. To minimize the width, encode operations that will never be used at the same time

92 1&2) List of control signals, grouped into fields

Signal name Effect when deasserted Effect when asserted ALUSelA 1st ALU operand = PC 1st ALU operand = Reg[rs]

ol RegWrite None Reg. is written

trtr MtRMemtoReg RitdtitALUReg. write data input = ALU RitdtitReg. write data input = memory RDtRegDst Reg. dest. no. = rt Reg. dest. no. = rd TargetWrite None Target reg. = ALU

it Con MemRead None Memory at address is read

BB MWitMemWrite None MtddiittMemory at address is written IorD Memory address = PC Memory address = ALU IRWrite None IR = Memory PCWrite None PC = PCSource Single PCWr ite Con d None IF ALUzero then PC = PCSource

Signal name Value Effect ALUOp 00 ALU adds

ol 01 ALU subtracts rr 10 ALU does function code 11 ALU does logical OR ALUSelB 000 2nd ALU input = Reg[rt] it Cont 001 2nd ALU input = 4 BB 010 2nd ALU input = sign extended IR[15-0] 011 2nd ALU input = sign extended, shift left 2 IR[15-0] 100 2nd ALU input = zero extended IR[15-0]

ultiple PCSource 00 PC = ALU

M 01 PC = Target 10 PC = PC+4[29-26] : IR[25–0] << 2 93 Start with list of control signals, cont’d

• For next state function (next microinstruction address), use Sequencer-based control unit from last lecture – Called “microPC” or “µPC” vs. state register

• Siggqnal Sequencing PLA or ROM Value Effect 1 00 Next µaddress = 0 01 Next µaddress = dispatch ROM 1 State Adder 10 Next µaddress = dispatch ROM 2 Mux AddrCtl 11 Next µaddress = µaddress + 1 3210

0

Dispatch ROM 2 Dispatch ROM 1

Address select logic Op

Instruction register opcode field

94 Details

Dispatch ROM 1 Dispatch ROM 2 Op Opcode name Value Op Opcode name Value 000000 R-format 0110 100011 lw 0011 000010 jmp 1001 101011 sw 0101 000100 beq 1000 PLA or ROM 100011 lw 0010 101011 sw 0010 1

State

Adder

Mux AddrCtl 3210

0

Dispatch ROM 2 Dispatch ROM 1

Address select logic Op

Instruction register opcode field

State number Address-control action Value of AddrCtl 0 Use incremented state 3 1 Use dispatch ROM 1 1 2 Use dispatch ROM 2 2 3 Use incremented state 3 4 Replace state number by 0 0 5 Replace state number by 0 0 6 Use incremen tdted s tttate 3 7 Replace state number by 0 0 8 Replace state number by 0 0 9Replace state number by 00 95 3) Microinstruction Format

• Unencoded vs. Encoded fields

Field Name Width Control Signals Set wide narrow ALU Ctrl 3 2 ALUOp SRC121ALUSrcA SRC242ALUSrcB Reg Ctrl 32Regg,Write, MemtoReg, g,g RegDst. Memory 3 2 MemRead, MemWrite, IorD PCWrite Ctrl 3 2 PCWrite, PCWriteCond, PCSource Sequenc in g 42AddddrCt l . Total width 22 13 bits

96 4) Legend of Fields and Symbolic Names

Field name Value Signals active Comment Add ALUOp = 00 Cause the ALU to add. ALU control Subt ALUOp = 01 Cause the ALU to subtract; this implements the compare for branches. Func code ALUOp = 10 Use the instruction's function code to determine ALU control. PC ALUSrcA = = 0 0 Use the PC as as the the first first ALU ALU input input. SRC1 A ALUSrcA = 1 Register A is the first ALU input. B ALUSrcB = 00 Register B is the second ALU input. 4 ALUSrcB = 01 Use 4 as the second ALU input. SRC2 Extend ALUSrcB = 10 Use output of the sign extension unit as the second ALU input. Extshft ALUSrcB = 11 Use the out ppyut of the shift-by-two unit as the second ALU input. p Read two registers using the rs and rt fields of the IR as the register Read numbers and putting the data into registers A and B. RegWrite, Write a register using the rd field of the IR as the register number and Write ALU RegDst = 1, the contents of the ALUOut as the data. Register control MemtoReg = 0 RWitRegWrite, Writ e a regi ist ter using i th the rt tfi field ld of fth the IR IR as th the regis iter t number b and d Write MDR RegDst = 0, the contents of the MDR as the data. MemtoReg = 1 MemRead, Read memory using the PC as address; write result into IR (and the MDR) Read PC lorD = 0 MemRead, Read memory using using the the ALUOut ALUOut as as address; address; write write result result into into MDR MDR. Memory Read ALU lorD = 1 MemWrite, Write memory using the ALUOut as address, contents of B as the data. Write ALU lorD = 1 PCSource = 00 Write the output of the ALU into the PC. ALU PCWrite PCSource = 01, If the Zero output of the ALU is active, write the PC with the contents PC write control ALUOut-cond PCWriteCond of the register ALUOut. PCSource = 10, Write the PC with the jump address from the instruction. jump address PCWrite Seq AddrCtl = 11 Choose the next microinstruction sequentially. FthFetch AddrCtl = 00 00 GtthfitGo to the first mi icroins it tructi titbon to begin i a new ins itruc ttion. ti Sequencing Dispatch 1 AddrCtl = 01 Dispatch using the ROM 1. Dispatch 2 AddrCtl = 10 Dispatch using the ROM 2. 97 Microprogramming

• A specification methodology – appropriate if hundreds of opcodes , modes , cycles , etc . – signals specified symbolically using microinstructions

ALU Register PCWrite Label control SRC1 SRC2 control Memory control Sequencing Fetch Add PC 4 Read PC ALU Seq Add PC Extshft Read Dispatch 1 M1Mem1 Add A EtExtend Dispa tc h 2 LW2 Read ALU Seq Write MDR Fetch SW2 Write ALU Fetch Rformat1 Func code A B Seq Write ALU Fetch BEQ1 Subt A B ALUOut-cond Fetch JUMP1 Jump address Fetch • Will two implementations of the same architecture have the same microcode? • What would a do?

98 5) Maximally vs. Minimally Encoded

• No encoding: – 1 bit for each datapath operation – faster, requires more memory (logic) – used for Vax 780 — an astonishinggy 400K of memory! • Lots of encoding: – send the microinstructions through logic to get control signals – uses less memory, slower • Historical context of CISC: – Too much logic to put on a single chip with everything else – Use a ROM (or even RAM) to hold the microcode – It’s easy to add new instructions

99 Legacy Software and Microprogramming

• IBM bet company on 360 Instruction Set Architecture (ISA): single instruction set for many classes of machines – (8-bit to 64-bit) • Stewart Tucker stuck with job of what to do about software compatability • If microprogramming could easily do same instruction set on many different , then why couldn’t multiple microprograms do multiple instruction sets on the same microarchitecture? • Coined term “emulation”: instruction set interpreter in microcode for non-native instruction set • Very successful: in early years of IBM 360 it was hard to know whether old instruction set or new instruction set was more frequently used

100 Microprogramming Pros and Cons

• Ease of design • Flexibility – Easy to adapt to changes in organization, timing, technology – Can make changes late in design cycle, or even in the field • Can implement very powerful instruction sets (just more control memory) • Generality – CilCan implement multi ltilittitple instruction sets on same machi ne. – Can tailor instruction set to application. • Compatibility • Many organizations, same instruction set • Costly to implement • Slow

101 Microcode: Trade-offs

• Distinction between specification and implementation is often blurred

• Specification Advantages: – Easy to design and write – Design architecture and microcode in parallel • Implementation (off-chip ROM) Advantages – Easy to change since values are in memory – Can emulate other architectures – Can make use of internal registers • Implementation Disadvantages, SLOWER now that: – Control is implemented on same chip as processor – ROM is no longer faster than RAM – No need to go back and make changes

102 Historical Perspective

• In the ‘60s and ‘70s microprogramming was very important for implementing machines • This led to more sophisticated ISAs and the VAX • In the ‘80s RISC processors based on pipelining became popular • Pipelining the microinstructions is also possible! • Implementations of IA-32 architecture processors since 486 use: – “hardwired control” for simpler instructions (few cycles, FSM control implemented using PLA / random logic) – “microcoded control” for more complex instructions (large numbers of cycles, central control store)

• The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store

103 Overview of Control

• The control can then be implemented with one of several methods using a structured logic technique.

Initial Representation Finite State Diagram Microprogram

Sequencing Control Explicit Next State Microprogram Counter Function + Dispatch ROMs

Logic Representation Logic Equations Truth Tables

Implementation PLA ROM Technique

“hardwired control ” “microprogrammed control”

104 Summary

• If we understand the instructions… – We can build a simple processor!

• If instructions take different amounts of time, multi-cycle is better

• Datapath implemented using: – Combinational logic for arithmetic – State holding elements to remember bits

• Control imppglemented using: – Combinational logic for single-cycle implementation – Finite state machine for multi-cycle implementation

105 An Alternative DataPath: “Princeton” Organization

A-Bus

B Bus Reg A next P inst IR S mem File PC C mem ZX SX B W-Bus

• In each clock cycle, each Bus can be used to transfer from one source • µ-instruction can simply contain B-Bus and W-Dst fields • Single memory for instruction and data access

• In this case our state diagram does not change – several additional control signals – must ensure each bus is only driven by one source on each cycle

106 What about a 2-Bus Microarchitecture (datapath)?

• 2-Bus Microarchitecture

Instruction Fetch A- BusB Reg A Bus next P IR S Mem File M PC C ZXSX B

Decode / Operand Fetch

Reg A next P IR S Mem File M PC C ZXSX B

107 Operations of 2-Bus Microarchitecture

Execute

Reg A next P IR S Mem File M PC C ZXSX B Mem

addr Reg A next P IR S Mem File M PC C ZXSX B

Write-back

Reg A next P IR S Mem File M PC C ZXSX B 108 Exceptions

System user program Exception Exception: Handler

return from exception

normal control flow: sequential, jumps, branches, calls, returns

• Exception = Unprogrammed control transfer – system takes action to handle the exception •must record t he a ddress o f t he o ffen ding instruct ion – returns control to user – must save & restore user state • Allows constuction of a “user virtual machine” 109 What happens to Instruction with Exception?

• MIPS architecture defines the instruction as having no effect if the instruction causes an exception.

• When get to we will see that certain classes of exceptions must prevent the instruction from changgging the machine state.

• This aspect of handling exceptions becomes complex and potentially limits performance Î whyitishardwhy it is hard

110 Two Types of Exceptions

• Interrupts – caused by external events – asynchronous to program execution – may be handled between instructions – simply suspend and resume user program

• Traps – caused by internal events • exceptional conditions (overflow) • errors (parity) • faults (non-resident page) – synchronous to program execution – condition must be remedied by the handler – instruction may be retried or simulated and program continued or program may be aborted

111 MIPS convention:

• exception means any unexpected change in control flow, without distinguishing internal or external; use the term interrupt only when the eventit is ext ernall y caused .

Type of event where? MIPS term. I/O device request External Interrupt Invoke OS from user program Internal Exception Arithmetic overflow Internal Exception Using an undefined instruction Internal Exception Hardware malfunctions Either Exception or Interrupt

112 Additions to MIPS ISA to support Exceptions?

• EPC–a 32-bit register used to hold the address of the affected instruction. • Cause–a register used to record the cause of the exception. To simplify the discussion, assume – undefined instruction=0 – arithmetic overflow= 1 • Status - interrupt mask and enable bits and determines what exceptions can occur.

• Control signals to write EPC , Cause, and Status • Be able to write exception address into PC, increase mux set PC to

exception add ress (C000 0000hex). • May have to undo PC = PC + 4, since want EPC to point to offending instruction (not its successor); PC = PC - 4

113 Details of the

15 8 5 4 3 2 1 0

Status Mask k e k e k e old prev current

• Mask = 1 bit for each of 5 hardware and 3 software interrupt levels – 1 => enables interrupts – 0 => disables interrupts • k = kernel/user – 0 => was in the kernel when interrupt occurred – 1 => was running user mode • e = interrupt enable – 0 => interrupts were disabled – 1 => interrupts were enabled • When interrupt occurs, 6 LSB shifted left 2 bits, setting 2 LSB to 0 – run in kernel mode with interrupts disabled 114 Big Picture: user / system modes

• By providing two modes of execution (user/system) it is possible for the computer to manage itself – operating system is a special program that runs in the privileged mode and has access to all of the resources of the computer – presents “virtual resources” to each user that are more convenient that the physical resources • files vs. disk sectors • virtual memoryypy vs. physical memor y – protects each user program from others • Exceptions allow the system to take action in response to events that occur while user program is executing – O/S begins at the handler

115 Details of the Cause register

15 10 52 Status Pending Code

• Pending Interrupt: – 5 hardware levels: bit set if interrupt occurs but not yet serviced – handles cases when more than one interrupt occurs at same time, or while records interrupt requests when interrupts disabled • Exception Code: encodes reasons for interrupt – 0 (INT) => external interrupt – 4 (ADDRL) => address error exception (load or instr fetch) – 5 (ADDRS) => address error exception (store) – 6 (IBUS) => bus error on instruction fetch – 7 (DBUS) => bus error on data fetch – 8 (Syscall) => Syscall exception – 9 (BKPT) => Breakpoint exception – 10 (RI) => Reserved Instruction exception – 12 (OVF) => A rith meti c overfl ow excepti on 116 How Control Detects Exceptions in our FSD

• Undefined Instruction – detected when no next state is defined from state 1 for the op value. – We handle this exception by defining the next state value for all op values other than lw, sw, 0 (R-type), jmp, beq, and ori as new state 12. – Shown syyygmbolically using “other” to indicate that the op field does not match any of the opcodes that label arcs out of state 1. • Arithmetic overflow – included logic in the ALU to detect overflow, and a siggppgnal called Overflow is provided as an output from the ALU. This signal is used in the modified finite state machine to specify an additional possible next state

• Note: Challenge in designing control of a real machine is to handle different interactions between instructions and other exception-causing events such that control logic remains small and fast. – Complex interactions makes the control unit the most challenging aspect of hardware design

117 Modification to the Control Specification

• Operations for an Exception – EPC Å PC – 4 – PC Å exp_addr – cause Å 0 (UI) or 1 (Ovf)

118