Chapter 4 Let’s build a processor!
1 Introduction
CPU performance factors
Instruction count
Determined by ISA and compiler
CPI and Cycle time
Determined by CPU hardware We will examine three MIPS implementations
A simplified version (single cycle and multi-cycle)
A more realistic pipelined version Simple subset, shows most aspects Memory reference: lw, sw
Arith meti c/l ogi cal : add, sub, and, or, slt
Control transfer: beq, j 2 Instruction Execution
PC instruction memory , fetch instruction
Register numbers register file, read registers
Depending on instruction class
Use ALU to calculate
Arithmetic result
Memory address for load/store
Branchddh target address
Access data memory for load/store
PC target address or PC + 4
3 CPU Overview
Can’t just join wires together
Use multiplexers
5 Control
6 Logic Design Basics
Information encoded in binary
Low voltage = 0, High voltage = 1
OiOne wire per bit
Multi-bit data encoded on multi-wire buses
Combinational element
Operate on data
Output is a function of input
State ((qsequential) elements
Store information 7 Combinational Elements
AND-gate Adder A + Y
Y = A & B Y = A + B B
A Y B
Arithmetic/Logic Unit Multiplexer Y = F(A, B) Y = S ? I1 : I0 A M I0 Y u Y ALU I1 x B
S F
8 SSaetate Elem ent s
Unclocked vs . Clocked
Clocks used in synchronous logic
when should an element that contains state be updated?
falling edge
cycle time riiising ed ge
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 9 An un cl ock ed st at e el em ent
The set -reset latch
output depends on present inputs and also on past inputs
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 10 Laactch es aadnd Fli p-flops
Output is equal to the stored value inside the element (don't need to ask for permission to look at the value)
Change of st a te (va lue) i s b ased on th e clock
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 11 Laactch es aadnd Fli p-flops
Latches: whenever the inputs change, and the clock is asserted "logically true" , — could mean electrically low Flip-flop: state changes only on a clock edge (edge-triggered methodology)
A clocking methodology defines when signals can be read and written — wouldn't want to read a signal at the same time it was being written
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 12 D-laactch
Two inputs:
the data value to be stored (D)
the cl ock sig inal (C) i ndi cati ng wh en t o read & store D
TttTwo outputs:
the value of the internal state (Q) and it's compltlement
C D Q
C _ Q D Q Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 13 D flip-flop
Output changes only on the clock edge
D D Q D Q Q D D latch latch _ _ C C Q Q
C
D
C
Q
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 14 Sequential Elements
Register: stores data in a circuit
Uses a clock signal to determine when to update the stored value
Edge-triggered: update when Clk changes from 0 to 1
D Q Clk D Clk Q
15 Sequential Elements
Register with write control
Only updates on clock edge when write control input is 1
Used when stored value is required later
Clk
D Q Write Write D Clk Q
16 Clocking Methodology
Combinational logggic transforms data during clock cycles
Between clock edges
Input from state elements, output to state element
Longest tdl delay d dtetermi nes cl ock peri idod
17 Regi st er File
Built using D flip -flops Read register number 1 Register 0 Register 1 M Read register u number 1 Read Read data 1 data 1 Register n – 1 x Read register Register n number 2 Register file Write Read register register number 2 Read Write data 2 data Write M u Read data 2 x
Do you understand? What is the “Mux” above?
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 18 Abstraction
Make sure you understand the abstractions!
Sometimes it is easy to think you do, when Select you d’tdon’t A31 M Select u C31 x B31 32 A M 32 u C A30 M 32 x B u C30 x . B30 . . .
A0 M u C0 x B0
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 19 Regi st er File
Note: we still use the real clock to determine when to write
Write
C 0 Register 0 1 D n-to-1 C Register number decoder Register 1 D n – 1 n
C Register n – 1 D C Register n Register data D
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 20 SSpeimple Impeeaoplementation
Include the functional units we need for each instruction
Instruction MemWrite address
PC Instruction Add Sum Address Read Instruction data 16 32 Sign memory extend Write Data data memory a. Instruction memory b. Program counter c. Adder
ALU control MemRead 5 Read 3 register 1 Read a. Data memory unit b. Sign-extension unit data 1 Register 5 Read numbers register 2 Zero Registers Data ALU ALU 5 Write result register Read Write data 2 Why do we need this stuff? Data data
RegWrite
a. Registers b. ALU Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 21 Building a Datapath
Datapath
Elements that process data and addresses in the CPU
Registers, ALUs, mux’s, memories, …
We will build a MIPS datapath incrementally
RfiiRefining th e overvi ew d esig in
22 Instruction Fetch
Increment by 4 for next 32-bit instruction register
23 R-Format Instructions
Read two reggpister operands
Perform arithmetic/logical operation
Write register result
24 Load/Store Instructions
Read register operands Calculate address using 16-bit offset
Use ALU, but sign-extend offset Loaddd: Read memory and dd update register Store: Write register value to memory
25 Branch Instructions
Read register operands
Compare operands
Use ALU, subtract and check Zero output
Calculate target address
Sign-extend displacement
Shift left 2 pp(laces (word disp lacement)
Add to PC + 4
Already calculated by instruction fetch
26 Branch Instructions
Just re-routes wires
Sign-bit wire replicated 27 Composing the Elements
First-cut data path does an instruction in one clock cycle
Eac hdtpthelementh datapath element can nonldoone only do one function at a time
Hence, we need separate instruction and data memories
Use multiplexers where alternate data sources are used for different instructions
28 R-Type/Load/Store Datapath
29 Full Datapath
30 Contro l
Selecting the operations to perform (ALU, read/write, etc.)
Controlling the flow of data (multiplexer inputs)
Information comes from the 32 bits of the instruction
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 31 Contro l
Example: add $8, $17, $18
Instruction Format:
000000 10001 10010 01000 00000 100000
op rs rt rd shamt funct
ALU's operation based on instruction type and function code
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 32 Contro l
e. g., what should the ALU do with this instruction Example: lw $1 , 100($2)
35 2 1 100 op rs rt 16 bit number
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 33 ALU Control
ALU used for
Load/Store: F = add
Branch: F = subtract
R-type: F depends on funct field
ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR
34 ALU Control
Assume 2 -bit ALUOp derived from opcode
Combination al logi c de ri ve s ALU cont rol opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111
35 The Main Control Unit
Control signals derived from instruction R-type 0 rs rt rd shamt funct 31:2625:21 20:16 15:11 10:6 5:0
Load/ 35 or 43 rs rt address Store 31:26 25:21 20:16 15:0
Branch 4 rs rt address 31:26 25:21 20:16 15:0
opcode always read, write for sign-extend read except R-type and add fldfor load andld load
36 Contro l
0 M u x Add ALU 1 result Add Shift RegDst left 2 4 Branch MemRead Instruction [31– 26] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite
Instruction [25– 21] Read Read register 1 PC address Read Instruction [20– 16] Read data 1 Zero Instruction register 2 0 Registers ALU [31– 0] Read 0 ALU M Write data 2 result Address Read 1 Instruction register M data u M memory x u Instruction [15– 11] Write x u 1 Data x data 1 memory 0 Write data 16 32 Instruction [15– 0] Sign extend ALU control
Instruction [5– 0]
Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format1 0 0 100 0 1 0 lw 0 1 1 110 0 0 0 sw X 1 X 001 0 0 0 beq X 0 XElectrical & 000 Computer Engineering 1 0 1 School of Engineering THE COLLEGE OF NEW JERSEY 37 Contro l
Simple combinational logic (truth tables) Instruction Inputs Op5 Op4 Op3 Decoding ALUOp Op2 Op1 ALU control block Op0 ALUOp0 ALUOp1
Outputs Operation2 R-format Iw sw beq F3 RegDst Operation F2 Operation1 ALUSrc F (5– 0) F1 MemtoReg Operation0 RegWrite F0 MemRead MemWrite
Branch ALUOp1
ALUOpO
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 38 Our Simp l e Contr ol Struc tu re
All of the logic is combinational
We wait for everything to settle down, and the right thing to be done
ALU might not produce “right answer ” right away
we use wriillite signals along wi ihlkth clock to determine when to write
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 39 Our Simp l e Contr ol Struc tu re
Cycle time determined by length of the longest path
State State element Combinational logic element 1 2
Clock cycle
We are ignoring some details like setup and hold times
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 40 Datapath With Control
41 R-Type Instruction
42 Load Instruction
43 Branch-on-Equal Instruction
44 Implementing Jumps
Jump 2 address 31:26 25:0 Jump uses word address
Update PC with concatenation of
Top 4 bits of old PC
26-bit jump address
00
Need an extra control signal decoded from opcode
Chapter 4 — The Processor — 45 45 Datapath With Jumps Added
46 Performance Issues
Longest delay determines clock period
Critical path: load instruction
Instruction memory register file ALU data memory register file
Not feasible to varyyp period for different instructions
Violates design principle
Making the common case fast
We will improve performance by pipelining 47 Wher e we aaere headed
Single Cycle Problems:
what if we had a more complicated instruction like floating point?
requires more area
OSltiOne Solution:
use a “smaller” cycle time
have different instructions take different numbers of cycles
a “lil”d“multicycle” datapath : Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 48 Wher e we aaere headed
One Solution:
use a “smaller” cycle time
have different instructions take different numbers of cycles
a “multicycle” datapath:
Instruction register Data PC Address A Register # Instruction Memory or data Registers ALU ALUOut Memory Register # data B Data register Register #
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 49 Multi cycl e Appr oach
We will be reusing functional units
ALU used to compute address and to increment PC
Memory used for instruction and data
Our control signals will not be determined soley by instruction
e.g., what should the ALU do for a “subtract” itinstruc ti?tion?
We’ll use a finite state machine for control
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 50 Review: finite state machines
Finite state machines:
a set of states and
next state function (determined by current state and the input)
output function (determined by current state and possibly input)
Next state Next-state Current state function
Clock Inputs
Output Outputs function
We’ll use a Moore machine (output based only on current state)
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 51 Review: finite state machines
Example: B. 21 A friend would like you to build an “electronic eye” for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light “moves” from left to right and then from right to left, thus sca ring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eye’s movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs.
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 52 Multi cycl e Appr oach
Break up the instructions into steps, each step takes a cycle
balance the amo unt of wo rk to be done
restrict each cycle to use only one major functional unit
At the end of a cycle
store values for use in later cycles (easiest thing to do)
itintrod uce additi onal l“it “internal” regi itsters Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 53 Multi cycl e Appr oach
At the end of a cycle
store values for use in later cycles (easiest thing to do)
introduce additional “internal” registers
PC 0 0 M Instruction Read Address [25– 21] register 1 M u u x Instruction Read Read A x 1 Memory Zero [20– 16] register 2 data 1 1 ALU MemData 0 Registers ALU ALUOut result Instruction M Write Read [15– 0] Instruction u register data 2 B 0 Write Instruction [15– 11] x 4 1 M data 1 Write u register data 2 x Instruction 0 3 [15– 0] M u x Memory 1 data 16 32 Shift register Sign extend left 2
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 54 Instructions from ISA perspective
Consider each instruction from ppperspective of ISA.
Example:
The add instruction changes a register.
Register specified by bits 15:11 of instruction.
Instruction specified by the PC.
New value is the sum ((op)“op”)oftworegisters of two registers.
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 55 Instructions from ISA perspective
Example:
Registers specified by bits 25:21 and 20:16 of the instruction Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg [Memory [PC][20: 16]]
In order to accomplish this we must break up the instruction. (kind of like introducing variables when
programming) Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 56 Breaking down an instruction
ISA definition of arithmetic:
Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op
Reg[Memory [PC][20: 16]]
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 57 Breaking down an instruction
Could break down to: IR <= Memory[PC] A <= Reg[IR[25:21]] B <= Reg[IR[20:16]] ALUOut <= A op B Reg[IR[20:16]] <= ALUOut
We forgot an important part of the definition of arithmetic! PC <= PC + 4 Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 58 Idea behind multicycle approach
We define each instruction from the ISA perspective (do this!)
Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps)
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 59 Idea behind multicycle approach
Introduce new registers as needed (e. g, A, B, ALUOut, MDR, etc.)
Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solti)lution) Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 60 Five Executi on SepsSteps
Instruction Fetch
Instruction Decode and Register Fetch
Execution, Memory Address Computation , or Branch Completion
Memory Access or R-type instruction completion
Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 61 SSeptep 1: Instruc ucotion Feectch
Use PC to ggpet instruction and put it in the Instruction Register.
Increment the PC by 4 and put the result back in the PC.
Can be described succinctly using RTL "Register- Transfer Language" IR = Memory[PC]; PC = PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now?
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 62 SSptep 2: Instruction Decode and Re gister Fetch
Read registers rs and rt in case we need them
CtthbhddiCompute the branch address in case th thittiie instruction is a branch
RTL: A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2);
We aren't setting any control lines based on the instruction type (b"ddi"ititlli)(we are busy "decoding" it in our control logic)
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 63 Stepp( 3 (instruction de pendent )
ALU is performing one of three functions, based on instruction type
Memory Reference: ALUOut = A + sign-extend(IR[15-0]);
R-type: ALUOut = A op B;
Branch: if (A==B) PC = ALUOut;
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 64 Step 4 (R-type or memory- access)
Loads and stores access memory MDR = Memory[ALUOut]; or endMemory[ALUOut] of the cycle on the = edge B;
R-tittifiihtype instructions finish Reg[IR[15-11]] = ALUOut; Th t th ll t it tl k e write actually takes place at the
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 65 Write-back st ep
Reg[IR[20-16]]= MDR;
Which i nstructi on need s thi s? What about all the other instructions?
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 66 Summary:
Action for R-type Action for memory-reference Action for Action for Step name instructions instructions branches jumps Instruction fetch IR = Memory[PC] PC = PC + 4 Instruction A = Regg[ [IR[ 25-21]] decode/register fetch B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] II computation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2) jump completion Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut] completion ALUOut or Store: Memory [ALUOut] = B
Memoryyp read completion Load: Reg[IR[20-16]] = MDR
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 67 Simp l e Questi on s
How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label #assume not add $t5, $t2, $t3 sw $t5, 8($t3) Label: ...
What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place?
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 68 Imppelem entin g the Contr ol
Value of control siggppnals is dependent upon:
what instruction is being executed
which steppgp is being performed
Use the information we’ve acculumated to specify a finite state machine
specify the finite state machine graphically, or
use microprogramming
Implementation can be derived from specification Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 69 Grappphical Specification of FSM
Instruction decode/ Instruction fetch register fetch 0 MemRead 1 ALUSrcA = 0 IorD = 0 ALUSrcA = 0 Start IRWrite ALUSrcB = 11 Note: ALUSrcB = 01 ALUOp = 00 ALUOp = 00 don’t care if PCWrite PCSource = 00 ) ') pe -ty Q not R E B p = ' ') (O = Memory address 'SW p = Branch (O Jump mentioned p 'J') = (Op computation r (O ') o Execution completion completion 'LW p = asserted if 2 (O 6 8 9 ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 name only ALUSrcA =1 PCWrite ALUSrcB = 10 ALUOp = 01 ALUSrcB = 00 PCSource = 10 ALUOp = 00 ALUOp= 10 PCWriteCond otherwise PCSource = 01
exact value (O ) p '' =
'S W ') Memory Memory (Op = 'LW (Op access access R-type completion 3 5 7 How many RegDst = 1 MemRead MemWrite RegWrite sta te bits w ill IorD = 1 IorD = 1 MemtoReg = 0 we need?
Write-back step 4
RegDst=0 RegWrite MemtoReg=1 Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 70 Finite State Machine for CtlControl
Implementation: PCWrite PCWriteCond IorD MemRead MemWrite IRWrite Control logic MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst
NS3 NS2 NS1 Inputs NS0 Op5 Op4 Op3 Op2 Op1 Op0 S3 S2 S1 S0
Instruction register State register opcode field
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 71 PLA Imppelem ent ati on
Op5
Op4
If I picked a horizontal Op3 or vertical line could Op2 Op1 you explain it? Op0 S3
S2
S1
S0
PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 Electrical & Computer Engineering NS2 NS1 School of Engineering NS0 THE COLLEGE OF NEW JERSEY 72 ROM Imppeeaolementation
ROM = "Read Only Memory "
values of memory locations are fixed ahead of time
A ROM can be used to implement a truth table m if the address is m-bits, we can address 2 entries in the ROM.
our outputs are the bits of data that the address points to.
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 73 ROM Imppeeaolementation
A ROM can be used to implement a truth tbltable m if the address is m-bits, we can address 2 entries in the ROM .
our outputs are the bits of data that the address points to. 0 0 0 0 0 1 1 0 0 1 1 1 0 0 mn 0 1 0 1 1 0 0 0 1 1 1 0 0 0 10000001 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 m is the "heigth", and n is the "width" Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 74 ROM Imppeeaolementation
How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 210 = 1024 different addresses)
How many outputs are there? 16 dat apa th-contltrol ou tpu ts, 4tt4 state bits = 20 outputs
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 75 ROM Imppeeaolementation
10 ROM is 2 x 20 = 20K bits (and a rather unusual size)
Rather wasteful, since for lots of the entr ies, the ou tpu ts are the same — i.e., opcode is often ignored
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 76 ROM vs PLA
Break up the table into two parts — 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM — 10 b its tell you th e 4 n ex t state bits, 210 x 4 bits of ROM — Total: 4 .3K bits of ROM
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 77 ROM vs PLA
PLA is much smaller — can share product terms — only need entries that produce an activ e output — can take into account don't cares
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 78 ROM vs PLA
Size is (#inputs #product-terms) + (#outputs #product-terms) For this example = (10 x17) +( 20 x17) = 460 PLA ce lls
PLA cell s usuall y ab out th e si ze of a ROM cell (slightly bigger)
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 79 Another Imppylementation Style
Complex instructions: the "next state" is often curren t s ta te + 1 Control unit PCWrite PCWriteCond IorD MemRead PLA or ROM MemWrite IRWrite BWrite Outputs MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst Input AddrCtl 1
State
Adder
Address select logic Op[5– 0]
Instruction register opcode field Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 80 Details Dispat ch ROM 1 Dispa tch ROM 2 Op Opcode name Value Op Opcode name Value 000000 R-format 0110 100011 lw 0011 000010 jmp 1001 101011 sw 0101 000100 beq 1000 PLA or ROM 100011 lw 0010 101011 sw 0010 1
State
Adder
Mux AddrCtl 3 2 1 0
0
Dispatch ROM 2 Dispatch ROM 1
Address select logic Op
Instruction register opcode field
State number Address-control action Value of AddrCtl 0 UiUse incremente d state 3 1 Use dispatch ROM 1 1 2 Use dispatch ROM 2 2 3 Use incremented state 3 4 Replace state number by 0 0 5 Replace state number by 0 0 6 Use incremented state 3 7 Replace state number by 0Electrical & Computer Engineering 0 8 Replace state number by 0School of Engineering 0 9ReSpring Semester,place 2004 state number by 00THE COLLEGE OF NEW JERSEY 81 Miccopogaroprogrammin g
Control unit PCWrite PCWriteCond IorD Microcode memory MemRead Datapath MemWrite IRWrite BWrite Outputs MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl Input 1
Microprogram counter
Adder
Address select logic 5– 0] [[ Op
Instruction register opcode field What are the “microinstructions” ?
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 82 Miccopogaroprogrammin g
A specification methodology
appropriate if hundreds of opcodes, modes, cycles, etc.
signals specified symbolically using microinstructions
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 83 Miccopogaroprogrammin g
ALU Register PCWrite Label control SRC1 SRC2 control Memory control Sequencing Fetch Add PC 4 Read PC ALU Seq Add PC Extshft Read Dispatch 1 Mem1 Add A Extend Dispatch 2 LW2 RdALURead ALU Seq Write MDR Fetch SW2 Write ALU Fetch Rformat1 Func code A B Seq Write ALU Fetch BEQ1 Subt A B ALUOut-cond Fetch JUMP1 Jump address Fetch Will two imppflementations of the same architecture have the same microcode?
What would a microassembler do?
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 84 Microinstruction format Field name Value Signals active Comment Add ALUOp = 00 Cause the ALU to add. ALU control Subt ALUOp = 01 Cause the ALU to subtract; this implements the compare for branches. Func code ALUOp = 10 Use the instruction's function code to determine ALU control. SRC1 PC ALUSrcA = 0 Use the PC as the first ALU input. A ALUSrcA = 1 Register A is the first ALU input. B ALUSrcB = 00 Register B is the second ALU input. SRC2 4 ALUSrcB = 01 Use 4 as the second ALU input. Extend ALUSrcB = 10 Use output of the sign extension unit as the second ALU input. Extshft ALUSrcB = 11 Use the output of the shift-by-two unit as the second ALU input. Read Read two registers using the rs and rt fields of the IR as the register numbers and putting the data into registers A and B. WitWrite ALU RWitRegWrite, Writ e a regi st er usi ng th e rd fie ld o f the IR as the reg is ter num ber an d Register RegDst = 1, the contents of the ALUOut as the data. control MemtoReg = 0 Write MDR RegWrite, Write a register using the rt field of the IR as the register number and RegDst = 0, the contents of the MDR as the data. MemtoReg = 1 RdPCRead PC MRdMemRead, RdRead memory us ing thPCddthe PC as address; wr ite resu ltitIR(dlt into IR (and lorD = 0 the MDR). Memory Read ALU MemRead, Read memory using the ALUOut as address; write result into MDR. lorD = 1 Write ALU MemWrite, Write memory using the ALUOut as address, contents of B as the lorD = 1 data. ALU PCSource = 00 Write the output of the ALU into the PC. PCWrite PC write control ALUOut-cond PCSource = 01, If the Zero output of the ALU is active, write the PC with the contents PCWriteCond of the register ALUOut. jump address PCSource = 10, Write the PC with the jump address from the instruction. PCWrite Seq AddrCtl = 11 Choose the next microinstruction sequentially. Sequencing Fetch AddrCtl = 00 Go to the first microinstruction to begin a new instruction. Dispatch 1 AddrCtl = 01 DispatchElectrical using & Computer the ROM Engineering1. Dispatch 2 AddrCtl = 10 Dispatch Schoolusing theof EngineeringROM 2. Spring Semester, 2004 THE COLLEGE OF NEW JERSEY 85 Maximally vs. Minimally Encoded
No encoding:
1 bit for each datapath operation
faster, requires more memory (logic)
used for Vax 780 — an astonishing 400K of memory!
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 86 Maximally vs. Minimally Encoded
Lots of encoding:
send the microinstructions through logic to gettlilt control signals
uses less memory, slower
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 87 Maximally vs. Minimally Encoded
Historical context of CISC:
Too much logic to put on a single chip with everything el se
Use a ROM (or even RAM) to hold the microcode
It’s easy to add new instructions
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 88 Micr ocode: Trade -oosffs
Distinction between specification and implementation is sometimes blurred
Specification Advantages:
Easy to design and write
Design architecture and microcode in parallel
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 89 Micr ocode: Trade -oosffs
Implementation (off-chip ROM) Advantages
Easy to change since values are in memory
Can emulate other architectures
Can make use of internal registers
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 90 Micr ocode: Trade -oosffs
Implementation Disadvantages, SLOWER now that:
Control is implemented on same chip as processor
ROM is no longer faster than RAM
No need to go back and make changes
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 91 The Big Pict ur e
Initial Finite state Microprogram representation diagram
Sequencing Explicit next Microprogram counter control state function + dispatch ROMS
Logic Logic Truth representation equations tables
Implementation Programmable Read only technique logic array memory
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 92 Historical Perspective
In the ‘60s and ‘70s microprogramming was very important for implementing machines This led to more sophisticated ISAs and the VAX In the ‘80s RISC processors based on pipelining became popular Pipelining the microinstructions is also possible!
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 93 Historical Perspective
Implementations of IA -32 architecture processors since 486 use:
“hardwired control ” for simpler instructions (few cycles, FSM control implemented using PLA or random logic)
“ddl”fl“microcoded control” for more complex instructions (large numbers of cycles, central control store)
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 94 Historical Perspective
The IA-64 architecture uses a RISC- style ISA and can be implemented without a large central control store
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 95 Pentium 4
Pippgpelining is important ( last IA-32 without it was 80386 in 1985)
Control
Control I/O interface
Instruction cache Data Chapter 5 cache Enhanced floating point and multimedia Integer datapath Secondary cache and memory Control interface Chapter 4
Advanced pipelining Control hyperthreading support
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 96 Pentium 4 needs to ensure that the simple instructions execute Pitheppgelinin instructiong is used set for penalize the sim thep le instructionscomplex, less favored by compilers
“l“Simply put, hhf a high performance l implementation
quickly and, and that that the burden the burden of the complexities of the complexities of of
frequently used, used, instructions instructions”
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 97 Pentium 4
Somewhere in all that “control” we must handle complex instructions
Control
Control I/O interface
Instruction cache Data cache Enhanced floating point and multimedia Integer datapath Secondary cache and memory Control interface
Advanced pipelining Control hyperthreading support
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 98 Pentium 4
Processor executes simple microinstructions, 70 bits wide (hardwired)
120 control lines for integer datapath (400 for floating point)
If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions)
Its complicated!
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 99 Summary
If we understand the instructions … We can build a simple processor!
If instructions take different amounts of time, multi-cycle is better
Datapath implemented using:
Combinational logic for arithmetic
State holding elements to r emember bits Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 100 Summary
Control implemented using:
Combinational logic for single-cycle implementation
Finite state machine for multi-cycle implementation
Electrical & Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 101 Pipelinin g
Improve performance by increasing inst ruc tion through hput Program execution 2 4 6 8 10 12 14 16 18 order Time (in instructions) Instruction Data lw $1, 100($0) Reg ALU Reg fetch access
Instruction Data lw $2, 200($0) Reg ALU Reg 8 ns fetch access
Instruction lw $3, 300($0) 8 ns fetch ... 8 ns
Program execution 2 4 6 8 10 12 14 Time order (in instructions) Instruction Data lw $1, 100($0) Reg ALU Reg fetch access Instruction Data lw $2, 200($0) Reg ALU Reg 2 ns fetch access
Instruction Data lw $3, 300($0) Reg ALU Reg 2 ns fetch access
2 ns 2 ns 2 ns 2 ns 2 ns Ideal speedup is number of stages in the pipeline. Do we achieve this? Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 102 Pipelining Analogy
Pippyppgelined laundry: overlapping execution
Parallelism improves performance
Four loads:
Spppeedup = 8/3.5 = 2.3
Non-stop:
Speedup = 2n/0.5n + 1.5 ≈ 4 = number of stages
103 Pipelinin g
What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
What makes it hard?
structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 104 Pipelinin g
We’ ll build a simple pipeline and look at these issues
We’ll talk about modern processors and whthat rea lly mak es it ith hard :
exception handling
trying to improve performance with out-of- order execution, etc.
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 105 Basi c Idea
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back register file read address calculation 0 M u x 1
Add
4 Add Add result
Shift left 2
Read register 1 PC Address Read data 1 Read register 2 Zero Instruction Registers Read ALU ALU 0 Write data 2 result Address Read 1 data Instruction register M u Data M memory u Write x memory x data 1 0 Write data 16 32 Sign extend
What do we need to add to actually split the dhi?datapath into stages? Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 106 Pipelin ed Dat apath
0 M u x 1
IF/ID ID/EX EX/MEM MEM/WB
Add
Add 4 Add result
Shift left 2
n Read register 1 PC Address Read data 1 Read Zero Instruction Instructio register 2 Registers Read ALU ALU memory 0 Read Write data 2 result Address 1 register M data M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 107 Corr ect ed Datapa th
0 M u x 1
IF/ID ID/EX EX/MEM MEM/WB
Add
4 Add Add result
Shift left 2
Read register 1 PC Address Read data 1 Read Zero
Instruction register 2 Instruction Registers Read ALU ALU memory 0 Write data 2 result Address Read 1 register M data M u Data Write x u memory x data 1 0 Write data 16 32 Sign extend
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 108 Graphically Representing Pipelines
Time (in clock cycles) Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 execution order (in instructions) lw $10, 20($1) IM Reg ALU DM Reg
sub $11, $2, $3 IM Reg ALU DM Reg
Can help with answering questions like:
how many cycles does it take to execute this code?
whihALUdidihat is the ALU doing during cycl l4?e 4?
use this representation to help understand dtdatapa ths Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 109 Pipelin e Contr ol
PCSrc
0 M u x 1
IF/ID ID/EX EX/MEM MEM/WB
Add
Add Add 4 result Branch Shift RegWrite left 2
Read MemWrite register 1 PC Address Read uction
rr data 1 Read ALUSrc ZeroZero MemtoReg Instruction Inst register 2 Registers Read ALU ALU memory 0 Read Write data 2 result Address 1 register M data M u Data Write x u memory x data 1 0 Write data Instruction [15– 0] 16 32 6 Sign ALU extend control MemRead
Instruction [20– 16] 0 M ALUOp Instruction u [15– 11] x 1
RegDst
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 110 Pipelin e contr ol
We have 5 stages . What needs to be controlled in each stage?
Instruction Fetch and PC Incr ement
Instruction Decode / Register Fetch
EtiExecution
Memory Stage
Write Back
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 111 Pipelin e contr ol
How would control be handled in an automobile plant?
a fanc y cont r ol cente r telling ev ery one what to do?
should we use a finite state machine?
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 112 Pipelin e Contr ol
Pass control signals along just like the data
Execution/Address Calculation Memory access stage stage control stage control lines control lines lines Reg ALU ALU ALU Mem Mem Reg Mem to Instruction Dst Op1Op0 Src Branch Read Write write Reg R-format110000010 lw 000101011 sw X0010010X beq X0101000X
WB
Instruction Control M WB
EX M WB
IF/ID ID/EX EX/MEM MEM/WB Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 113 Dat apath with CooControl
PCSrc
ID/EX 0 M u WB x EX/MEM 1 Control M WB MEM/WB
EX M WB IF/ID
Add
Add 4 Add result Branch Shift left 2
RegWrite ALUSrc Read
register 1 MemWrite PC Address Read toReg uction
rr data 1 Read mm Zero
Inst register 2 Instruction Me Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 register M data u Data M Write x memory u x data 1 0 Write data
Instruction 16 32 6 [15– 0] Sign ALU MemRead extend control
Instruction [20– 16] 0 ALUOp M Instruction u [15– 11] x 1 RegDst
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 114 Depeepedecesndencies
Problem with starting next instruction before first is finished
dependencies that “go backward in time” are data hazards
Time (in clock cycles)
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20 Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg
and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2 IM Reg DM Reg
add $14, $2, $2 IM Reg DM Reg
sw $15, 100($2) IM Reg DM Reg
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 115 Softw ar e Sol uti on
Have compiler guarantee no hazards
Where do we insert the “nops” ?
sub $2, $1 , $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2)
Problem: this really slows us down!
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 116 MIPS Pipeline
Five stages , one step per stage 1. IF: Instruction fetch from memory 2. ID: Ins truc tion deco de & regi ist er rea d 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Wr ite resu lt back t o regi ist er
117 Pipeline Performance
Assume time for stages is
100ps for register read or write
200ps for other stages
Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register ALU op Memory Register Total time read access write lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps
118 Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
119 Pipeline Speedup
If all stages are balanced
i.e., all take the same time
Time bet ween inst ruc tionspipelined = Time between instructionsnonpipelined Number of stages
If not balanced, speedup is less
Speedup due to increased throughput
Latencyy( (time for each instruction) does not decrease
120 Pipelining and ISA Design
MIPS ISA designed for pipelining
All instructions are 32-bits
Easier to fetch and decode in one cycle
c.f. x86: 1- to 17-byte instructions
Few and regular instruction formats
Can decode and read registers in one step
Load/store addressing rd Cllddi3Can calculate address in 3 stage, access memory in 4th stage
Alignment of memory operands
Memory access takes only one cycle 121 Hazards
Situations that prevent starting the next instruction in the next cycle Structure hazards
A required resource is busy Data hazard
Need to wait for previous instruction to complete its data read/write Control hazard
Deciding on control action depends on previous instruction 122 Structure Hazards
Conflict for use of a resource
In MIPS pipeline with a single memory
Load/store requires data access
Instruction fetch would have to stall for thlhat cycle
Would cause a pipeline “bubble”
HiliddhiHence, pipelined datapaths require separate instruction/data memories
Or separate instruction/data caches
123 Data Hazards
An instruction depppends on completion of data access by a previous instruction add $s0,,, $t0, $t1 sub $t2, $s0, $t3
124 Forwarding (aka Bypassing)
Use result when it is computed
Don’t wait for it to be stored in a register
Requires extra co nnectio ns in the da tapa th
125 Load-Use Data Hazard
Can’t alwayyygs avoid stalls by forwarding
If value not computed when needed
Can’t forward backward in time!
126 Fooadgrwarding
Use temporary results, don’t wait for them to be written
register file forwarding to handle read/write to same register
ALU forwarding Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20 Value of EX/MEM : XXX–20XXXXX Value of MEM/WB : XXXX–20XXXX
Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg
and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2 IM Reg DM Reg
add $14, $2, $2 IM Reg DM Reg
sw $15, 100($2) IM Reg DM Reg
Electrical and Computer Engineering what if this $2 was $13? School of Engineering THE COLLEGE OF NEW JERSEY 127 Fooadgrwarding
ID/EX
WB EX/MEM
Control M WB MEM/WB
IF/ID EX M WB
M u x Registers
Instruction Instruction Data PC ALU memory memory M u M x u x
IF/ID.RegisterRs Rs IF/ID.RegisterRt Rt IF/ID.RegisterRt Rt M EX/MEM.RegisterRd IF/ID.RegisterRd Rd u x
Forwarding MEM/WB.RegisterRd unit
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 128 Can't alw ays f orw ar d
Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that writes to the same register
Time (in clock cycles) Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 execution order (in instructions) lw $2, 20($1) IM Reg DM Reg
and $4, $2, $5 IM Reg DM Reg
or $8, $2, $6 IM Reg DM Reg
add $9, $4, $2 IM Reg DM Reg
slt $1, $6, $7 IM Reg DM Reg
Thus, we need a hazard detection unit to “stall” the load instruction Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 129 SSatallin g
We can stall the pipeline by keeping an instruction in the same stage
Program Time (in clock cycles) execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 order (in instructions)
lw $2, 20($1) IM Reg DM Reg
and $4, $2, $5 IM Reg Reg DM Reg
or $8, $2, $6 IM IM Reg DM Reg
bubble
add $9, $4, $2 IM Reg DM Reg
slt $1, $6, $7 IM Reg DM Reg
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 130 Haaadzard Deecoetection Unit
Stall byyg letting an instruction that won’t write anything go forward
Hazard ID/EX.MemRead detection unit ID/EX
WB EX/MEM M Control u M WB IF/IDWrite x MEM/WB 0 IF/ID EX M WB te M u PCWri x Registers
Instruction Instruction Data PC ALU memory memory M u M x u x
IF/ID.RegisterRs IF/ID.RegisterRt Rt IF/ID.RegisterRt M EX/MEM.RegisterRd IF/ID.RegisterRd Rd u x ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd Rt unit
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 131 Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next instruction CCcodeo code for A = B + E;;C C = B + F;
lw $t1, 0($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) stall add $t5, $t1, $t4 add $t5, $t1, $t4 sw $t5, 16($t0) sw $t5, 16($t0) 13 cycles 11 cycles 132 Control Hazards
Branch determines flow of control
Fetching next instruction depends on branch outcome
Pipeline can’t always fetch correct instruction
Still working on ID stage of branch In MIPS pipeline
Need to compare registers and compute target early in the pipeline
Add hardware to do it in ID stage
133 Stall on Branch
Wait until branch outcome determined before fetching next instruction
134 Flushin g Instruc ti on s
IF.Flush
Hazard detection unit M ID/EX u x WB EX/MEM M Control u M WB MEM/WB 0 x
IF/ID EX M WB
4 Shift left 2 M u x Registers = Instruction Data PC ALU memory memory M u M x u x
Sign extend
M u x Forwarding unit
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 135 Branch Prediction
Longer pipelines can ’t readily determine branch outcome early
Sta ll pena lty bec omes un accept a ble
Predict outcome of branch
Only stall if prediction is wrong
In MIPS pipeline
Can predict branches not taken
Fetch instruction after branch, with no delay
136 MIPS with Predict Not Taken
Prediction correct
Prediction incorrect
137 More-Realistic Branch Prediction
Static branch prediction
Based on typical branch behavior
Example: loop and if -statement branches
Predict backward branches taken
Predict forward branches not taken
Dynamic branch prediction
Hardware measures actual branch behavior
e. g., record recent history of each branch
Assume future behavior will continue the trend
When wrong, stall while re-fetchinggp, and update history
138 Branch Prediction
Sophisticated Techniques:
A “branch target buffer” to help us look up the destination
Correlating predictors that base prediction on gllblbhobal behavi or and recently executed branches (e.g., prediction fififor a specific branch instruction based on what happened in previous branches)
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 139 Branch Prediction
Sophisticated Techniques:
Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.
A“bA “branch hd del ay sl ot” whi hihthch the compil er t tiries t o fill with a useful instruction (make the one cycle dldelay part of fthISA) the ISA)
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 140 Branch Prediction
Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!
Modern processors predict correctly 95% of the time!
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 141 Impporovin g Peoerform an ce
Try and avoid stalls! E .g ., reorder these instructions:
lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 142 Impporovin g Peoerform an ce
Dyypnamic Pipeline Scheduling
Hardware chooses which instructions to execute next
Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!)
Speculates on branches and keeps the pipeline fllfull (may need to rollback if prediction incorrect)
TiTrying t o expl lititoit instructi on-level parall eli sm Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 143 Impporovin g Peoerform an ce
Add a “branch delay slot”
the next instruction after a branch is always executed
rely on compiler to “fill” the slot with something useful
Superscalar: start more than one instruction in the same cycle
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 144 Dyn ami c Sch edulin g
The hardware performs the “scheduling”
hardware tries to find instructions to execute
out of order execution is possible
speculative execution and dynamic branch prediction
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 145 Pipeline Summary
The BIG Picture
Pipelining improves performance by iiittithhtincreasing instruction throughput
Executes multiple instructions in parallel
Each instruction has the same latency
Subject to hazards
Structure, data, control
Instruction set design affects complexity of pipeline implementation 146 MIPS Pipelined Datapath
MEM
Right-to-left WB flow leads to hazards
147 Pipeline registers
Need regggisters between stages
To hold information produced in previous cycle
148 Pipeline Operation
Cycle-by-cycle flow of instructions through the pipelined datapath
“Single-clock -cycle” pipeline di ag ram
Shows pipeline usage in a single cycle
Highlight resources used
c.f. “multi-clock-cycle” diagram
Graph of operation over time
We’ll look at “single-clock-cycle” diagrams for load & store
149 IF for Load, Store, …
150 ID for Load, Store, …
151 EX for Load
152 MEM for Load
153 WB for Load
Wrong register number
154 Corrected Datapath for Load
155 EX for Store
156 MEM for Store
157 WB for Store
158 Multi-Cycle Pipeline Diagram
Form showing resource usage
159 Multi-Cycle Pipeline Diagram
Traditional form
160 Single-Cycle Pipeline Diagram
State of pppipeline in a g iven cy cle
161 Pipelined Control (Simplified)
162 Pipelined Control
Control signals derived from instruction
As in single-cycle implementation
163 Pipelined Control
164 Daaata Haadsazards in AL U Instruc ucotion s
Consider this sequence: sub $2, $1,$3 and $12,$2,$5 or $13,$6,$2 add $$,14,$2,$2 sw $15,100($2)
We can resolve hazards with forwarding
How do we detect when to forward?
165 Dependencies & Forwarding
166 Detecting the Need to Forward
Pass regggister numbers along pipeline
e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ALU operand register numbers in EX stage are given by
ID//gEX.RegisterRs ,/g, ID/EX.RegisterRt Data hazards when Fwd from 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM pipeline reg 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs Fwd from MEM/WB 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt pipeline reg
167 Detecting the Need to Forward
But only if forwarding instruction will write to a register!
EX/MEM.RegW rite , MEM/WB .RegW rite
And only if Rd for that instruction is not $zero
EX/MEM.RegisterRd ≠ 0, MEM/WB.R egi isterRd ≠ 0
168 Forwarding Paths
169 Forwarding Conditions
EX hazard
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
MEM hazard
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 170 Double Data Hazard
Consider the sequence: add $1,$1,$2 add $1,$1,$3 add $1,$1,$4
Both hazards occur
Want to use the most recent
RiRevise MEM MEMh hazard conditi on
Only fwd if EX hazard condition isn’t true
171 Revised Forwarding Condition
MEM hazard
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
172 Datapath with Forwarding
173 Load-Use Data Hazard
Need to stall for one cycle
174 Load-Use Hazard Detection
Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by
IF/ID. RegisterRs, IF/ID. RegisterRt Load-use hazard when
ID/EX. MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble
175 How to Stall the Pipeline
Force control values in ID/EX register to 0
EX, MEM and WB do nop (no-operation)
Prevent update of PC and IF/ID register
Using instruction is decoded again
Following instruction is fetched again
1-cycle stall allows MEM to read data for lw
Can subsequently forward to EX stage
176 Stall/Bubble in the Pipeline
Stall inserted here
177 Stall/Bubble in the Pipeline
Or, more accurately… 178 Datapath with Hazard Detection
179 Stalls and Performance
The BIG Picture
Stalls reduce performance
But are required to get correct results
Compiler can arrange code to avoid hazards and stalls
Requires knowledge of the pipeline structure
180 Branch Hazards
If branch outcome determined in MEM
Flush these instructions (Set control values to 0)
PC 181 Reducing Branch Delay
Move hardware to determine outcome to ID stage
Target address adder
Register comparator
Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12,,, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7 ... 72: lw $4, 50($7) 182 Example: Branch Taken
183 Example: Branch Taken
184 Data Hazards for Branches
If a comppgarison register is a destination of 2nd or 3rd preceding ALU instruction
add $1, $2, $3 IF ID EX MEM WB add $4, $5, $6 IF ID EX MEM WB
… IF ID EX MEM WB beq $1, $4, target IF ID EX MEM WB
Can resolve using forwarding
185 Data Hazards for Branches
If a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction
Need 1 stall cycle
lw $1, addr IF ID EX MEM WB add $4, $5, $6 IF ID EX MEM WB beq stalled IF ID beq $1, $4, tttarget ID EX MEM WB
186 Data Hazards for Branches
If a comparison register is a destination of immediately preceding load instruction
Need 2 stall cycles
lw $1, addr IF ID EX MEM WB beq stalled IF ID beq stalled ID beq $1, $0, tttarget ID EX MEM WB
187 Dynamic Branch Prediction
In deeper and superscalar pipelines , branch penalty is more significant
Use dynamic prediction
Branch prediction buffer (aka branch history table)
Indexed by recent branch instruction addresses
Stores outcome (taken/not taken)
To execute a branch
Check table, expect the same outcome
Start fetching from fall-through or target
If wrong, flush pipeline and flip prediction
188 1-Bit Predictor: Shortcoming
Inner looppp branches mispredicted twice!
outer: … … inner: … … beq …, …, inner … beq …, …, outer
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around 189 2-Bit Predictor
Only change prediction on two successive mispredictions
190 Calculating the Branch Target
Even with predictor , still need to calculate the target address
1-cycle penalt y fo r a t aken bra nch
Branch target buffer
Cache of target addresses
Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken, can fetch target immediately
191 Exceptions and Interrupts
“Unexpected” events requiring change in flow of control
Different ISAs use the terms differently
Exception
Arises within the CPU
e.g., undefined opcode, overflow, syscall, …
Interrupt
From an external I/O controller
Dealing with them without sacrificing performance is hard
192 Handling Exceptions
In MIPS , exceptions managed by a System Control Coprocessor (CP0) Save PC of offendingg( (or interrup ted) instruction
In MIPS: Exception Program Counter (EPC) SiditifthblSave indication of the problem
In MIPS: Cause register
We’ ll assume 1-bit
0 for undefined opcode, 1 for overflow Jump to handler at 8000 00180
193 An Alternate Mechanism
Vectored Interrupts
Handler address determined by the cause Example:
Undefined opcode: C000 0000
Overflow: C000 0020
…: C000 0040 Instructions either
Deal with the interrupt, or
Jump to real handler (ISR)
194 Handler Actions
Read cause, and transfer to relevant handler Determine action required If restartable
Take corrective action
use EPC to return to program Otherwise
Terminate program
Report error using EPC, cause, …
195 Exceptions in a Pipeline
Another form of control hazard Consider overflow on add in EX stage add $1, $2, $1
Prevent $1 from being clobbered
Complete previous instructions Flush add and subsequent instructions
Set Cause and EPC register values
Transfer control to handler Similar to mispredicted branch
Use much of the same hardware 196 Pipeline with Exceptions
197 Exception Properties
Restartable exceptions
Pipeline can flush the instruction
Han dler execut es, then re turns to the instruction
Refetched and executed from scratch
PC saved in EPC register
Identifi es causi ng i nst ructi on
Actually PC + 4 is saved
HdlHandler must tdjt adjust
198 Exception Example
Exception on add in 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) … Handler 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) …
199 Exception Example
200 Exception Example
201 Multiple Exceptions
Pipelining overlaps multiple instructions
Could have multiple exceptions at once
Simple approach: deal with exception from earliest instruction
Flush subsequent instructions
“Precise” exceptions
In compppplex pipelines
Multiple instructions issued per cycle
Out-of-order completion
Maintaining precise exceptions is difficult!
202 Imprecise Exceptions
Just stop pipeline and save state
Including exception cause(s)
Let the handler work out
Which instruction(s) had exceptions
Whicctocopeteoush to complete or flush
May require “manual” completion
Simplifies hardware, but more complex handler software
Not feasible for complex multiple-issue out-of-order pipelines
203 Instruction -Level Parallelism (ILP)
Pipelining: executing multiple instructions in parallel To increase ILP
Deeper pipeline
Less work per stage shorter clock cycle
Multiple issue
Replicate pipeline stages multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Ins truc tions Per Cyc le (IPC)
E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4
Bu t depen denc ies red uce thi s i n practi ce
204 Multiple Issue
Static multiple issue
Compiler groups instructions to be issued together
Packages them into “issue slots”
Compiler detects and avoids hazards
Dyypnamic multiple issue
CPU examines instruction stream and chooses instructions to issue each cycle
Compiler can help by reordering instructions
CPU resolves hazards using advanced techniques at runtime
205 Speculation
“Guess” what to do with an instruction
Start operation as soon as possible
Check whether guess was right
If so, complete the operation
If not, roll-back and do the right thing
Common to static and dynamic multiple issue
Examples
Speculate on branch outcome
Roll back if path taken is different
Speculate on load
Roll back if location is updated 206 Compiler/Hardware Speculation
Compiler can reorder instructions
e.g., move load before branch
Cild“fiCan include “fix-up”i” inst ruc tions to recover from incorrect guess
HdHardware can l ook kh ahead df for instructions to execute
Buffer results until it determines they are actually needed
Flus h b uff ers on incorrect specul ati on
207 Speculation and Exceptions
What if exception occurs on a speculatively executed instruction?
e. g., speculative load before null-pointer check Static speculation
Can add ISA support for deferring exceptions Dynamic speculation
Can buffer exceptions until instruction completion (which may not occur)
208 Static Multiple Issue
Compiler groups instructions into “issue packets”
Gr ou p of inst ruc tions tha t ca n be i ssued on a single cycle
Determined by pipeline resources required
Think of an issue packet as a very long itinstructi on
Specifies multiple concurrent operations
Very Long Instruction Word (VLIW)
209 Scheduling Static Multiple Issue
Compiler must remove some/all hazards
Reorder instructions into issue packets
NdNo depen denci es with a pack ktet
Possibly some dependencies between packets
Varies between ISAs; compiler must know!
Pad with nop if necessary
210 MIPS with Static Dual Issue
Two-issue packets
One ALU/branch instruction
One load/store instruction
64-bit aligned
ALU/branch, then load/store
Pad an unused instruction with nop
Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB 211 MIPS with Static Dual Issue
212 Hazards in the Dual-Issue MIPS
More instructions executing in parallel
EX data hazard
Forwarding avoided stalls with single-issue
Now can’t use ALU result in load/store in same packet
add $t0,,, $s0, $s1 load $s2, 0($t0)
Split into two packets, effectively a stall
Load-use hazard
Still one cycle use latency, but now two instructions
More aggressive scheduling required 213 Scheduling Example
Schedule this for dual-issue MIPS
Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0
ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $1$s1, $1$s1,–4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4
IPC = 5/4 = 1.25 (c.f. peak IPC = 2) 214 Loop Unrolling
Replicate loop body to expose more parallelism
Reduce s loop -contr ol o ve rhe a d
Use different registers per replication
Called “register renaming”
Avoid loop-carried “anti-dependencies”
Store followed by a load of the same register
Aka “name dependence”
Reuse of a register name
215 Loop Unrolling Example
ALU/branch Load/store cycle Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8
IPC = 14/8 = 1.75
Closer to 2, but at cost of registers and code size
216 Dynamic Multiple Issue
“Superscalar” processors
CPU decides whether to issue 0, 1, 2, … eachlh cycle
Avoiding structural and data hazards
Avoids the need for compiler scheduling
Though it may still help
Code semantics ensured by the CPU
217 Dynamic Pipeline Scheduling
Allow the CPU to execute instructions out of order to avoid stalls
Bu t c ommit r esu lt to r egis ters in or der
Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20
Can start sub while addu is waiting for lw
218 Dynamically Scheduled CPU
Preserves dependencies
Hold pending operands
Results also sent to any waiting reservation stations
Reorders buffer for regiitster wr ites Can supply operands for issued instructions 219 Register Renaming
Reservation stations and reorder buffer effectively provide register renaming On instruction issue to reservation station
If operan d is avail abl e i n regi st er fil e or reord er buffer
Copied to reservation station
No longer required in the register; can be overwritten
If operand is not yet available
It will be pr ovi ded to t he rese rv ati on stati on by a function unit
Register update may not be required
220 Speculation
Predict branch and continue issuing
Don’t commit until branch outcome determined
Load speculation
AidldAvoid load and cach e mi ss d dlelay
Predict the effective address
Predict loaded value
Load before completing outstanding stores
Bypass stored values to load unit
Don’t commit load until speculation cleared 221 Why Do Dynamic Scheduling?
Why not just let the compiler schedule code?
NtNot a lltllll stalls are predi diblcable
e.g., cache misses
Can’t always schedule around branches
Branch outcome is dynamically determined
Different implementations of an ISA have different latencies and hazards
222 Does Multiple Issue Work?
The BIG Picture
Yes, but not as much as we’d like
PhlddithtliitILPPrograms have real dependencies that limit ILP
Some dependencies are hard to eliminate
e.g., poitinter ali liiasing
Some parallelism is hard to expose
Limited window size during instruction issue
Memory delays and limited bandwidth
Hard to keep pipelines full
Speculation can help if done well 223 Power Efficiency
Comppylexity of dy namic scheduling and speculations requires power
Multiple simpler cores may be better
Microprocessor Year Clock Rate Pipeline Issue Out-of-order/ Cores Power Stages width Speculation i486 1989 25MHz 5 1 No 1 5W Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Will amett e 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Core 2006 2930MHz 14 4 Yes 2 75W UltraS parc III 2003 1950MHz 14 4 No 1 90W UltraSparc T1 2005 1200MHz 6 1 No 8 70W
224 The Opteron X4 Microarchitecture
72 physical registers
225 The Opteron X4 Pipeline Flow
For integgper operations
FP is 5 stages longer
Up to 106 RISC-ops in progress
BttlBottlenec ks
Complex instructions with long dependencies
Branch mispredictions
Memory access delays 226 Fallacies
Pipelining is easy (!)
The basic idea is easy
The devil is in the details
e.g., detecting data hazards
Pipelining is independent of technology
So why haven’t we always done pipelining?
More transistors make more advanced techniques fiblfeasible
Pipeline-related ISA design needs to take account of technology trends
e.g., predicated instructions
227 Pitfalls
Poor ISA design can make pipelining harder
e. g., c omplex in st ruction set s (VAX, IA -32)
Significant overhead to make pipelining work
IA-32 micro -op approach
e.g., complex addressing modes
Register update side effects , memory indirection
e.gg,., delay ed branches
Advanced pipelines have long delay slots
228 Advanced Pipelining
This class has ggyiven you the backg round you need to learn more!
Electrical and Computer Engineering School of Engineering THE COLLEGE OF NEW JERSEY 229 Concluding Remarks
ISA influences design of datapath and control
Datapath and control influence design of ISA
Pipelining improves instruction throughput using parallelism
More instructions completed per second
Latency for each instruction not reduced
Hazards: structural, data, control
Multiple issue and dynamic scheduling (ILP)
Dependencies limit achievable parallelism
Complexity leads to the power wall 230