Outline Combinational Element

• Combinational & sequential logic combinational • Single-cycle CPU input n n output logic • Multi-cycle CPU

•O utput determined entirely by input •Contains no storage element

1 2

Examples of Combinational Elements (Sequential) Element

w rite MM •M ultiplexor selects one out of 2n •S tate element has storage (i.e., 2n UU 1 inputs memory) XX •S tate defined by storage content S tate Element input output •O utput depends on input and the storage •A LU performs arithmetic & logic n state operations •W rite lead controls storage update –A N D : 000 64 –O R: 001 •Clock lead determines time of update zero clock –add: 010 A LU 64 result •Examples: main memory, registers, PC –subtract: 110 64 –set on less than: 111 –other 3 combinations unused 3

3 4

Clocking Methodology Input/Output of Elements

S tate •N eeded to prevent simultaneous S tate Combinational element Combinational element 1 logic read/w rite to state elements 1 logic •Edge-triggered methodology: state elements updated at rising clock edge •Combinational elements take input from one state element at clock edge and output to another state element at the

S tate next clock edge, S tate Combinational SStatattee element Combinational element eelelemmeennt t 1 logic 1 logic 22 •W ithin a clock cycle, state elements are not updated and their stable state is available as input to combinational elements, clock input •O utput can be derived from a state element at the edge of one cycle and input into the same state at the next.

5 6 MIPS64 Instruction Formats

•Register file is the structure that 64 contains the ’s 32 registers 5 Read reg 1 Read data 1 Register 5 •A ny register can be accessed for read Read reg 2 numbers 5 R egisters or w ritten by specifying the register W rite reg 6 5 5 16 Read W rite data 64 number 64 data 2 opcode rs rd immediate D ata I-T ype •Register File’s I/O structure N ote the regularity of 6 5 5 5 5 6 instruction encoding. –3 inputs derived from current RegW rite instruction to specify register R-T ype opcode rs rt rd shamt func T his is important for operands (2 for read and 1 for implementing an efficient w rite) pipelined CPU . •Register file’s outputs are alw ays 6 26 –1 input to w rite data into a J-T ype register available on the output lines opcode O ffset added to PC •Register w rite is controlled by –2 outputs carrying contents of the specified registers RegW rite lead

7 8

Common Steps in Instruction Execution Differences in Instruction Execution

• Data transfer (strictly load/store ISA) – load: access memory for read data {ld R1, 0(R2)} • Execution of all instructions require the following steps – store: access memory for write data {ld 0(R2), R1} – send PC to memory and fetch instruction stored at location specified by PC • ALU instruction – read 0-2 registers, using fields specifying the registers in the – no memory access for operands instruction – access a register for write of result {add R1,R2, R3} • All instructions use ALU functionality • Branch instruction – data transfer instructions: compute address – change PC content based on comparison {bnez R1, Loop} – ALU instructions: execute ALU operations – branch instructions: comparison & address compuation

9 10

Summary Data Path & Control path

is the signal path through which data in the CPU flows including the functional elements Fetch D ecode Read Compute A ccess W rite • Elements of Datapath Registers M emory Registers – combinational elements add/sub X X X X X – state (sequential) elements load X X X X X X • Control path store X X X X X – the signal path from the controller to the Datapath conditional X X X X elements branch – exercises timing & control over Datapath elements unconditional X X X branch

11 12 What Should be in the Datapath Datapath Schematic

• At a minimum we need combinational and sequential logic elements in the datapath to support the following functions – fetch instructions and data from memory – Read registers – decode instructions and dispatch them to the D ata R egisters

– execute arithmetic & logic operations Instruction Mem ory Register # – update state elements (registers and memory) PC A ddress A LU Register # A LU A ddress Instruction D ata Register # Mem ory

D ata

W hat is this for?

13 14

Datapath Building Blocks: Instruction Access Datapath Building Blocks: R-Type Instruction

6 5 5 5 5 6 A LU op

• Program (PC) H ow w ide is this in MIPS 64? opcode rs rt rd shamt func – a register that points to the next Read R-T ype Format 5 Read reg 1 instruction to be fetched Instruction data 1 – it is incremented each clock cycle 5 zero Read reg 2 A LU • Content of PC is input to Instruction 5 Register A LU W rite reg Memory A LU • Used for arithmetic & logic File A dder Read • The instruction is fetched and operations W rite data data 2 supplied to upstream datapath • Read two register, rs and rt elements 4 • ALU operates on registers’ • is used to increment PC by 4 in preparation for the next instruction content RegW rite (why 4?) PC Read • Write result to register rd • Adder: an ALU with control input address hardwired to perform add instruction Instruction • Example: add R1, R2, R3 only 32 – rs=R2, rt=R3, rd=R1 • For reasons that will become clear Instruction later, we assume separate memory • Controls Mem ory units for instructions & data – RegWrite is asserted to enable write at clock edge – ALUop to control operation

15 16

I-Type Instruction: load/store Required Datapath Elements for load/store

• rs contains the base field for • Register file the displacement address – load: registers to read for base address & to write for data mode – store: registers to read for base address & for data • rt specifies register 6 5 5 16 – to load from memory for I-T ype opcode rs rt immediate • Sign extender load – to sign-extend and condition immediate field for 2’s complement addition – to write to memory for store of address offset using 64-bit ALU • Immediate contains address LLWW R R22, ,2 23322((RR11)) S W R5, -88(R4) • ALU offset S W R5, -88(R4) 16 sign 64 – to add base address and sign-extended immediate field • To compute memory extend address, we must • Data memory to load/store data: – sign-extend the 16-bit – memory address; data input for store; data output for load immediate to 64 bits – add it to the base in rs – control inputs: MemRead, MemWrite, clock

17 18 Datapath Building Blocks: load/store I-Type Instruction: bne

6 5 5 16 • Branch datapath must compute branch condition & branch I-Type opcode rs rt immediate address ALUop • rs and rt refer to registers to be MemWrite compared for branch condition 6 5 5 16 Read 5 Read reg 1 • if Reg[rs] != Reg[rd], data 1 zero I-T ype opcode rs rt immediate Instruction 5 – PC = PC + Imm<< 2 (note that at Read reg 2 ALU this point PC is already 5 Registers ALU Read Write reg Address data incremented. In effect Read PCcurrent=(PCprevious+4) + Imm<< 2 bne R1, R2, Imm Write data Data bne R1, R2, Imm data 2 Memory • else if Reg[rs] == Reg[rt] Write – PC remains unchanged: PC =(PC +4) data current previous RegWrite – the next sequential instruction is taken 64 shift 64 • Required functional elements left 2 16 sign 64 MemRead – RegFile, sign extender, adder, extend shifter

19 20

Sign Extend & Shift Operations Datapath Building Blocks: bne

• Sign extension is required 6 5 5 16 because I-T ype opcode rs rt immediate – 16-bit offset must be A LU op = subtract expanded to 64 bits in order -20189 -20189 -80756 Read to be used in the 64-bit 5 Read reg 1 0xb123 0xffffb123 0xfffec48c data 1 zero T o branch adder Instruction 5 Read reg 2 A LU control logic – we are using 2’s Registers A LU 1 W rite reg complement arithmetic sign 64 shift 64 6 W rite data Read • Shift by 2 is required extend left 2 data 2 because A LU A dder RegW rite – instructions are 32-bits wide and are aligned on a word (4 bytes) boundary PC+4 16 sign 64 shift – in effect we are using an 18- extend left 2 B ranch target A LU bit offset instead of 16 A dder PC+4 from Instruction D atapath 21 22

Computing Address & Branch Condition Putting it All Together

• The register operands of bne are compared in the same ALU • Combine datapath building blocks to build the full datapath we use for load/store/arithmetic/logic instructions – now we must decide some specifics of implementation – the ALU provides a ZERO output signal to indicate condition • Single-cycle CPU – the ZERO signal controls what instruction will be fetched next – each instruction executes in one clock cycle depending on whether the branch is taken or not – CPI=1 for all instructions • We also need to compute the address • Multi-cycle CPU – we may not be able to use the ALU if it is being used to compute the branch condition (more on this later) – instructions execute in multiples of a shorter clock cycle – need an additional ADDER (an ALU hardwired to add only) to – different instructions have different CPI compute branch address

23 24 Single-Cycle CPU The Processor: Datapath & Control

• We're ready to look at an implementation of the MIPS • Simplified to contain only: • One clock cycle for all instructions – memory-reference instructions: lw, sw • No datapath resource can be used more than once per – arithmetic-logical instructions: add, sub, and, or, slt clock cycle – control flow instructions: beq, j – results in resource duplication for elements that must • Generic Implementation: be used more than once – use the (PC) to supply instruction address – examples: separate memory units for instruction and – get the instruction from memory data; two ALUs for conditional branches – read registers • Some datapath elements may be shared through – use the instruction to decide exactly what to do multiplexing as long as they are used once • All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow?

25 26

MIPS Fetch-Execute Processor Architecture Initialize Program Counter (PC) Ω first instruction

Program Counter (PC) Program Counter (PC)

Program Program Memory Memory Instruction Register

ALU ALU

Control Control Address Logic Address Logic Rs Rs

Data In Data In Rdest Rdest Rt Rt

4 4

Data Data Memory Memory (Register File) (Register File)

27 28

Activate Control Logic Route Address to Program Memory

Program Counter (PC) Program Counter (PC)

Program Program Memory Instruction Register Memory Instruction Register

ALU ALU

Control Control Address Logic Address Logic Rs Rs

Data In Data In Rdest Rdest Rt Rt

4 4

Data Data Memory Memory (Register File) (Register File)

29 30 Route Instruction to Instruction Register (IR) Select Appropriate Data From Register File

Program Counter (PC) Program Counter (PC)

Program Program Memory Instruction Register Memory Instruction Register

ALU ALU

Control Control Address Logic Address Logic Rs Rs

Data In Data In Rdest Rdest Rt Rt

4 4

Data Data Memory Memory (Register File) (Register File)

31 32

Route Data to (ALU) Do the Computation

Program Counter (PC) Program Counter (PC)

Program Program Memory Instruction Register Memory Instruction Register

ALU ALU

Control Control Address Logic Address Logic Rs Rs

Data In Data In Rdest Rdest Rt Rt

4 4

Data Data Memory Memory (Register File) (Register File)

33 34

Store the Result Increment PC Ω Point to Next Instruction

Program Counter (PC) Program Counter (PC)

Program Program Memory Instruction Register Memory Instruction Register

ALU ALU

Control Control Address Logic Address Logic Rs Rs

Data In Data In Rdest Rdest Rt Rt

4 4

Data Data Memory Memory (Register File) (Register File)

35 36 Increment PC Ω Point to Next Instruction Execute Next Instruction

Program Counter (PC) Program Counter (PC)

Program Program Memory Instruction Register Memory Instruction Register

ALU ALU

Control Control Address Logic Address Logic Rs Rs

Data In Data In Rdest Rdest Rt Rt

4 4

Data Data Memory Memory (Register File) (Register File)

37 38

State Elements An unclocked state element

• Unclocked vs. Clocked • The set-reset latch • Clocks used in synchronous logic – output depends on present inputs and also on past inputs – when should an element that contains state be updated? Falling edge R Q

Clock period Rising edge cycle time

Q S

39 40

Latches and Flip- D-latch

• Output is equal to the stored value inside the • Two inputs: element (don't need to ask for permission to look at – the data value to be stored (D) the value) – the (C) indicating when to read & store D • Change of state (value) is based on the clock • Two outputs: – Latches: output changes whenever the inputs change, – the value of the internal state (Q) and it's complement and the clock is asserted (level-triggered methodology)

– Flip-flops: state changes only on a clock edge C D (edge-triggered methodology) Q • A clocking methodology defines when signals can be read C and written— wouldn't want to read a signal at the same _ Q Q time it was being written D

41 42 D flip-flop Our Implementation

• Output changes only on the clock edge • An edge triggered methodology • Typical execution: D Q D Q D D D Q – read contents of some state elements latch latch Q C C Q – send values through some combinational logic – write results to one or more state elements C

State State element Combinational logic element 1 2 D

C Clock cycle

Q

43 44

Register File Abstraction

• Built using D flip-flops Read register • Make sure you understand the abstractions! number 1 Register 0 • Sometimes it is easy to think you do, when you Register 1 M don’t . . . u Read data 1 Read register number 1 x Read Register n – 2 Select data 1 Read register Register n – 1 number 2 A31 Register file M Write Read Select u C31 register data 2 Read register x number 2 B31 Write 32 data Write A M 32 M u C A30 x M u Read data 2 32 B u C30 x x . B30 . . .

A0 M u C0 x B0 Do you understand? What is the “Mux” above?

45 46

Register File Simple Implementation

• Note: we still use the real clock to determine when • Include the functional units we need for each instruction to write MemWrite Instruction address Read Address data 16 32 Instruction PC Add Sum Sign Write Data extend Instruction Write memory C memory data 0 1 Register 0 MemRead n-to-2n . D a. Instruction memory b. Program counter c. Adder Register number . decoder . C a. Data memory unit b. Sign-extension unit Register 1 n – 1 D n 5 Read ALU operation register 1 4 Read . . data 1 Register 5 Read numbers register 2 Zero C Data ALU 5 Registers ALU Register n – 2 Write result D register Read data 2 C Data Write Register n – 1 Data Register data D RegWrite

a. Registers b. ALU

47 48 Building the Datapath Control

• Use multiplexors to stitch them together • Selecting the operations to perform (ALU, read/write, etc.)

PCSrc • Controlling the flow of data (multiplexor inputs)

M • Information comes from the 32 bits of the instruction Add u x ALU • Example: 4 Add result Shift left 2 add $8, $17, $18 Instruction Format: Read ALUSrc Read 4 ALU operation PC register 1 address Read MemWrite data 1 000000 10001 10010 01000 00000 100000 Read MemtoReg register 2 Zero Instruction ALU Registers Read ALU Read Write Address data 2 result data Instruction register M M memory u u x x op rs rt rd shamt funct Write data Data Write memory RegWrite data

16 32 MemRead • ALU's operation based on instruction type and function code Sign extend

49 50

Control Control

• e.g., what should the ALU do with this instruction • Must describe hardware to compute 4-bit ALU control input • Example: lw $1, 100($2) – given instruction type 00 = lw, sw ALUOp 35 2 1 100 01 = beq, computed from instruction type 10 = arithmetic op rs rt 16 bit offset – function code for arithmetic

• ALU control input • Describe it using a truth table (can turn into gates): 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR

• Why is the code for subtract 0110 and not 0011? 51 52

R-type Instruction 0 M Add u x ALU 1 4 Add result Shift RegDst left 2 Branch MemRead Instruction [31–26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite

Instruction [25–21] Read Read register 1 PC address Read data 1 Instruction [20–16] Read Zero Instruction register 2 0 ALU [31–0] Read ALU Read M Write 0 Address 1 u data 2 result data M Instruction Instruction [15–11] x register M memory u u 1 x x Write 1 0 data Registers Data Write memory data 16 32 Instruction [15–0] Sign ALU extend control

Instruction [5–0]

Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 54 beq X 0 X 0 0 0 1 0 1 Load Instruction Branch on Equal

55 56

Control Our Simple Control Structure

• Simple combinational logic (truth tables) • All of the logic is combinational Inputs Op5 Op4 • We wait for everything to settle down, and the right Op3 Op2 ALUOp Op1 ALU control block thing to be done Op0 ALUOp0 ALUOp1 – ALU might not produce “right answer” right away Outputs Operation2 R-format Iw sw beq F3 RegDst – we use write signals along with clock to determine when Operation F2 Operation1 ALUSrc F (5– 0) F1 MemtoReg to write Operation0 RegWrite F0 MemRead • Cycle time determined by length of the longest path MemWrite

Branch ALUOp1 State State element Combinational logic element ALUOpO 1 2

Clock cycle

57 We are ignoring some details like setup and hold times 58

Single Cycle Implementation Where we are headed

• Calculate cycle time assuming negligible delays except: • Single Cycle Problems: – memory (200ps), – what if we had a more complicated instruction like floating point? ALU and adders (100ps), – wasteful of area register file access (50ps) • One Solution: PCSrc – use a “smaller” cycle time M Add u – have different instructions take different numbers of cycles x ALU Add 4 result – a “multicycle” datapath: Shift left 2

Read Read ALUSrc 4 ALU operation PC register 1 address Read MemWrite data 1 Read MemtoReg register 2 Zero Instruction ALU Instruction Registers Read ALU Read Write Address register data 2 result data Data Instruction register M M PC Address memory u u A x x Write Register # data Data Instruction Write or data Registers ALU ALUOut RegWrite memory Memory data Register # Memory 16 32 MemRead B Sign Data data Register # extend register

59 60 Multicycle Approach Multicycle Approach

• Reuse functional units • Break up the instructions into steps, each step takes a cycle – ALU used to compute address and to increment PC – balance the amount of work to be done – Memory used for instruction and data – restrict each cycle to use only one major functional unit • At the end of a cycle • Control signals will not be determined directly by – store values for use in later cycles (easiest thing to do) instruction – this introduces additional “internal” registers – e.g., what should the ALU do for a “subtract” instruction?

0 – There must be some sequencing involved leading to …. PC 0 M Instruction Read u Address [25–21] register 1 M x Read u A x 1 Instruction data 1 • Use a finite state machine for control Memory Read 1 Zero [20–16] register 2 0 MemData ALU ALU Instruction M Registers ALUOut Write result [15–0] Instruction u Read register 0 Write [15–11] x data 2 B data Instruction 1 4 1 M register Write u 0 data 2 x Instruction M 3 u [15–0] x 1 16 32 Memory Sign Shift data extend left 2 register

61 62

Instructions from ISA perspective Breaking down an instruction

• Consider each instruction from the perspective of ISA • ISA definition of arithmetic: • Example: Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] – The add instruction changes a register op Reg[Memory[PC][20:16]] – Instruction specified by the PC – Destination register specified by bits 15:11 of instruction • Could break down to: – New value is the sum (“op”) of two registers – IR <= Memory[PC] – Source registers specified by bits 25:21 and 20:16 of the instruction – A <= Reg[IR[25:21]] – B <= Reg[IR[20:16]] Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op – ALUOut <= A op B Reg[Memory[PC][20:16]] – Reg[IR[20:16]] <= ALUOut

– In order to accomplish this, we must break up the instruction (kind of like introducing variables when programming) • Don’t forget an important part of the operation: – PC <= PC + 4

63 64

Idea behind multicycle approach Five Execution Steps

• We define each instruction from the ISA perspective • Instruction Fetch • Break it down into steps following the rule that data flows through, at most, one major functional unit (e.g., balance • Instruction Decode and Register Fetch work across steps) • Introduce new registers as needed (A, B, ALUOut, MDR, etc.) • Execution, Memory Address Computation, or Branch • Finally, try and pack as much work into each step (avoid Completion unnecessary cycles) while also trying to share steps where possible (minimizes control and likely hardware, helps to • Memory Access or R-type instruction completion simplify solution) • Write-back step Result: The textbook’s multicycle Implementation.

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES

65 66 Step 1: Instruction Fetch Step 2: Instruction Decode and Register Fetch

• Use PC to get instruction and put it in the Instruction Register • Read registers rs and rt in case we need them • Increment the PC by 4 and put the result back in the PC • Compute the branch address in case the • Can be described succinctly using RTL "Register-Transfer instruction is a branch Language" • RTL: IR <= Memory[PC]; PC <= PC + 4; A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; What is the advantage of updating the PC now? ALUOut <= PC + (sign-extend(IR[15:0]) << 2);

• We aren't setting any control lines based on the instruction type

67 68

Step 3: (Instruction dependent) Step 4: (R-type or memory-access)

• ALU is performing one of three functions, based on • Loads and stores access memory instruction type MDR <= Memory[ALUOut]; – Memory Reference: or Memory[ALUOut] <= B; ALUOut <= A + sign-extend(IR[15:0]); • R-type instructions finish – R-type: Reg[IR[15:11]] <= ALUOut; ALUOut <= A op B;

– Branch: The write actually takes place at the end of the cycle on the edge if (A==B) PC <= ALUOut;

69 70

Step 5: Write-back step Summary:

• Reg[IR[20:16]] <= MDR;

Which instruction needs this?

71 72 Simple Questions PCWriteCond PCSource

PCWrite Outputs ALUOp IorD ALUSrcB MemRead Control • How many cycles will it take to execute this code? ALUSrcA MemWrite Op RegWrite MemtoReg [5–0] lw $t2, 0($t3) IRWrite RegDst 0 lw $t3, 4($t3) Jump M address 1 u Shift x Instruction [25-0] 26 28 [31–0] beq $t2, $t3, Label left 2 2 Instruction nop [31–26] 0 PC [31–28] PC 0 add $t5, $t2, $t3 M Instruction Read u Address [25–21] register 1 M x Read u sw $t5, 8($t3) A x 1 Instruction data 1 Memory Read 1 Zero [20–16] register 2 Label: ... 0 MemData ALU ALU Instruction M Registers ALUOut Write result [15–0] Instruction u Read x register 0 Write [15–11] data 2 B • What is going on during the 8th cycle of execution? data Instruction 1 4 1 M register Write u 0 data 2 x • In what cycle does the actual addition of $t2 and $t3 take Instruction M 3 u [15–0] x place? 1 Memory 16 32 ALU data Sign Shift register extend left 2 control

Instruction [5–0] 73

Review: finite state machines Review: finite state machines

• Finite state machines: • Example: – a set of states and – next state function (determined by current state and the input) B. 37 A friend would like you to build an “electronic eye” for use as a – output function (determined by current state and possibly input) fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if Next state asserted, indicate that a light should be on. Only one light is on at a Current state Next-state function time, and the light “moves” from left to right and then from right to Clock Inputs left, thus scaring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state

Output Outputs machine used to specify the electronic eye. Note that the rate of the function eye’s movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs. – We’ll use a Moore machine (output based only on current state)

75 76

Instruction fetch Instruction decode/ Implementing the Control Graphical Specification of FSM register fetch MemRead 0 ALUSrcA = 0 1 IorD = 0 IRWrite ALUSrcA = 0 Start ALUSrcB = 01 ALUSrcB = 11 ALUOp = 00 ALUOp = 00 PCWrite • Value of control signals is dependent upon: • Note: PCSource = 00 – what instruction is being executed – don’t care if not mentioned – which step is being performed – asserted if name only – otherwise exact value Memory address Branch Jump • Use the information we’ve accumulated to specify a computation Execution completion completion 2 6 8 9 ALUSrcA = 1 finite state machine ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 PCWrite ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 PCSource = 10 ALUOp = 00 ALUOp = 10 – specify the finite state machine graphically, or PCWriteCond • How many state PCSource = 01 – use microprogramming bits will we need? Memory Memory access access R-type completion 3 5 7

• Implementation can be derived from specification MemRead MemWrite RegDst = 1 IorD = 1 IorD = 1 RegWrite MemtoReg = 0

Memory read completon step 4

RegDst = 1 RegWrite MemtoReg = 0 77 78 Finite State Machine for Control PLA Implementation

• If I picked a horizontal or vertical line could you Op5

• Implementation: explain it? Op4

Op3

PCWrite Op2

PCWriteCond Op1 IorD MemRead Op0 MemWrite IRWrite S3 Control logic MemtoReg S2 PCSource ALUOp S1 Outputs ALUSrcB S0 ALUSrcA RegWrite RegDst PCWrite PCWriteCond NS3 IorD NS2 MemRead NS1 MemWrite Inputs NS0 IRWrite MemtoReg PCSource1 5 4 3 2 1 0 PCSource0 p p p p p p 3 2 1 0 O S S S S O O O O O ALUOp1 Instruction register State register ALUOp0 opcode field ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 79 NS0 80

ROM Implementation ROM Implementation

• ROM = "Read Only Memory" • How many inputs are there? – values of memory locations are fixed ahead of time 6 bits for opcode, 4 bits for state = 10 address lines • A ROM can be used to implement a truth table (i.e., 210 = 1024 different addresses) – if the address is m-bits, we can address 2m entries in the ROM. • How many outputs are there? – our outputs are the bits of data that the address points to. 16 datapath-control outputs, 4 state bits = 20 outputs

• ROM is 210 x 20 = 20K bits (and a rather unusual size) 0 0 0 0 0 1 1 0 0 1 1 1 0 0 m n 0 1 0 1 1 0 0 • Rather wasteful, since for lots of the entries, the outputs are 0 1 1 1 0 0 0 1 0 0 0 0 0 0 the same 1 0 1 0 0 0 1 — i.e., opcode is often ignored 1 1 0 0 1 1 0 1 1 1 0 1 1 1

m is the "height", and n is the "width"

81 82

ROM vs PLA Another Implementation Style

• Break up the table into two parts • Complex instructions: the "next state" is often — 4 state bits tell you the 16 outputs, 24x16 bits of ROM current state + 1 — 10 bits tell you the 4 next state bits, 210x4 bits of ROM PCWrite PCWriteCond — Total: 4.3K bits of ROM IorD MemRead PLA or ROM MemWrite • PLA is much smaller IRWrite BWrite Outputs MemtoReg — can share product terms PCSource ALUOp ALUSrcB — only need entries that produce an active output ALUSrcA RegWrite — can take into account don't cares RegDst Input AddrCtl • Size is (#inputs ì #product-terms) + (#outputs ì #product- 1 State

terms) Adder For this example = (10x17)+(20x17) = 510 PLA cells Address select logic ] 0 – 5 [

• PLA cells usually about the size of a ROM cell (slightly bigger) p O

Instruction register opcode field 83 84 Details Microprogramming

Dispatch ROM 1 Dispatch ROM 2

Op Opcode name Value Op Opcode name Value Control unit PCWrite 000000 R-format 0110 100011 lw 0011 PCWriteCond 000010 jmp 1001 101011 sw 0101 IorD MemRead 000100 beq 1000 memory Datapath PLA or ROM MemWrite 100011 lw 0010 IRWrite 101011 sw 0010 BWrite 1 Outputs MemtoReg PCSource State ALUOp ALUSrcB Adder ALUSrcA RegWrite AddrCtl Mux RegDst 3 2 1 0 AddrCtl Input 1 0

Microprogram counter Dispatch ROM 2 Dispatch ROM 1 Adder

Address select logic Address select logic

Instruction register opcode field State number Address-control action Value of AddrCtl 0 Use incremented state 3 1 Use dispatch ROM 1 1 Instruction register 2 Use dispatch ROM 2 2 opcode field 3 Use incremented state 3 4 Replace state number by 0 0 5 Replace state number by 0 0 6 Use incremented state 3 • What are the “microinstructions” ? 7 Replace state number by 0 0 8 Replace state number by 0 0 85 86 9 Replace state number by 0 0

Microprogramming Microinstruction format Field name Value Signals active Comment Add ALUOp = 00 Cause the ALU to add. ALU control Subt ALUOp = 01 Cause the ALU to subtract; this implements the compare for branches. • A specification methodology Func code ALUOp = 10 Use the instruction's function code to determine ALU control. SRC1 PC ALUSrcA = 0 Use the PC as the first ALU input. – appropriate if hundreds of opcodes, modes, cycles, etc. A ALUSrcA = 1 Register A is the first ALU input. B ALUSrcB = 00 Register B is the second ALU input. – signals specified symbolically using microinstructions SRC2 4 ALUSrcB = 01 Use 4 as the second ALU input. Extend ALUSrcB = 10 Use output of the sign extension unit as the second ALU input. Extshft ALUSrcB = 11 Use the output of the shift-by-two unit as the second ALU input. ALU Register PCWrite Read Read two registers using the rs and rt fields of the IR as the register Label control SRC1 SRC2 control Memory control Sequencing numbers and putting the data into registers A and B. Fetch Add PC 4 Read PC ALU Seq Write ALU RegWrite, Write a register using the rd field of the IR as the register number and Register RegDst = 1, the contents of the ALUOut as the data. Add PC Extshft Read Dispatch 1 control MemtoReg = 0 Mem1 Add A Extend Dispatch 2 Write MDR RegWrite, Write a register using the rt field of the IR as the register number and RegDst = 0, the contents of the MDR as the data. LW2 Read ALU Seq MemtoReg = 1 Write MDR Fetch Read PC MemRead, Read memory using the PC as address; write result into IR (and SW2 Write ALU Fetch lorD = 0 the MDR). Memory Read ALU MemRead, Read memory using the ALUOut as address; write result into MDR. Rformat1 Func code A B Seq lorD = 1 Write ALU Fetch Write ALU MemWrite, Write memory using the ALUOut as address, contents of B as the lorD = 1 data. BEQ1 Subt A B ALUOut-cond Fetch ALU PCSource = 00 Write the output of the ALU into the PC. JUMP1 Jump address Fetch PCWrite PC write control ALUOut-cond PCSource = 01, If the Zero output of the ALU is active, write the PC with the contents • Will two implementations of the same architecture have the same PCWriteCond of the register ALUOut. jump address PCSource = 10, Write the PC with the jump address from the instruction. microcode? PCWrite Seq AddrCtl = 11 Choose the next microinstruction sequentially. • What would a do? Sequencing Fetch AddrCtl = 00 Go to the first microinstruction to begin a new instruction. Dispatch 1 AddrCtl = 01 Dispatch using the ROM 1. 87 Dispatch 2 AddrCtl = 10 Dispatch using the ROM 2. 88

Maximally vs. Minimally Encoded Microcode: Trade-offs

• No encoding: • Distinction between specification and implementation is sometimes blurred – 1 bit for each datapath operation • Specification Advantages: – faster, requires more memory (logic) – Easy to design and write – used for Vax 780 — an astonishing 400K of memory! – Design architecture and microcode in parallel • Lots of encoding: – send the microinstructions through logic to get control signals • Implementation (off-chip ROM) Advantages – uses less memory, slower – Easy to change since values are in memory • Historical context of CISC: – Can emulate other architectures – Too much logic to put on a single chip with everything else – Can make use of internal registers – Use a ROM (or even RAM) to hold the microcode • Implementation Disadvantages, SLOWER now that: – It’s easy to add new instructions – Control is implemented on same chip as processor – ROM is no longer faster than RAM – No need to go back and make changes

89 90 Historical Perspective Pentium 4

• Pipelining is important (last IA-32 without it was 80386 in 1985) • In the ‘60s and ‘70s microprogramming was very important for implementing machines Control • This led to more sophisticated ISAs and the VAX Control I/O • In the ‘80s RISC processors based on pipelining became interface popular Instruction cache Data Chapter 7 cache Enhanced • Pipelining the microinstructions is also possible! floating point and multimedia Integer datapath Secondary • Implementations of IA-32 architecture processors since 486 cache and memory interface use: Control Chapter 6 – “hardwired control” for simpler instructions

Advanced pipelining (few cycles, FSM control implemented using PLA or random logic) Control hyperthreading support – “microcoded control” for more complex instructions (large numbers of cycles, central ) • Pipelining is used for the simple instructions favored by compilers • The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store “Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions” 91 92

Pentium 4 Chapter 5 Summary

• Somewhere in all that “control” we must handle complex instructions • If we understand the instructions …

Control We can build a simple processor!

Control I/O interface • If instructions take different amounts of time, multi-cycle is

Instruction cache Data better cache Enhanced floating point and multimedia Integer datapath Secondary • Datapath implemented using: cache and memory Control interface – Combinational logic for arithmetic – State holding elements to remember bits Advanced pipelining Control hyperthreading support • Control implemented using: • Processor executes simple microinstructions, 70 bits wide – Combinational logic for single-cycle implementation (hardwired) – Finite state machine for multi-cycle implementation • 120 control lines for integer datapath (400 for floating point) • If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions) • Its complicated! 93 94