Quick viewing(Text Mode)

Lecture 8 Simple & Pipelined Processor Designs Announcements

Lecture 8 Simple & Pipelined Processor Designs Announcements

Announcements

• Upcoming deadlines Lecture 8 – HW2 due today – PA1 due on Thursday 2/8

Simple & Pipelined Processor Designs • Quiz 1 review session on Friday

• Quiz 1 Christos Kozyrakis – Tue 2/6, 7pm–9pm, location TBD Stanford University – Local SCPD students must come to Stanford for the midterm http://eeclass.stanford.edu/ee108b – Covers lectures 1-7 – Closed book, 1 page of notes + green card, calculator

C. Kozyrakis EE108b Lecture 8 1 C. Kozyrakis EE108b Lecture 8 2

Review: How to Execute Instructions Review: Datapath for Instruction Fetch Unit

• First we need to: – Fetch the instruction • Then we need to: – Decode instruction / fetch register operands • Then we need to: – Do the operation Add 4 • Then we need to: – Write the result into register-file • Finally we need to: Read PC – Calculate the next instruction address address Instruction [31– 0] Instruction memory

C. Kozyrakis EE108b Lecture 8 3 C. Kozyrakis EE108b Lecture 8 4 Review: Datapath for Arithmetic & Logical Review: Load Datapath Instructions

• Extend datapath to support other immediate operations

RegWrite • Extender handles either sign or zero extension ALUOp Instruction [25– 21] Read register 1 Read • MUX selects between ALU result and Memory output data 1 Instruction [20– 16] Read ALUSrc Zero Instruction register 2 0 Registers ALU [31– 0] Read 0 ALU M Write data 2 result u register M x u Instruction [15– 11] Write x 1 data 1

RegDst RegWrite 16 32 Instruction [15– 0] Sign ALUOp or Zero Instruction [25– 21] Read MemWrite extend register 1 Read data 1 Instruction [20– 16] Read ALUSrc MemtoReg Zero Instruction register 2 0 Registers Read ALU ALU [31– 0] 0 Read M Write data 2 result Address 1 register M data u M x u Instruction [15– 11] Write x u 1 Data x data 1 memory 0 Write RegDst data 16 32 Instruction [15– 0] Sign extend MemRead

C. Kozyrakis EE108b Lecture 8 5 C. Kozyrakis EE108b Lecture 8 6

Review: Store Datapath Putting it All Together

• Read Register 2 is passed on to Memory PC [31– 28] Instruction [25– 0] 00 0 1 M M u u • Memory address calculated just as in lw case x x ALU Add 1 0 result Add Shift left 2 Jump 4 Branch

RegWrite

ALUOp RegWrite Instruction [25– 21] Read MemWrite ALUOp register 1 Read Instruction [25– 21] Read MemWrite Instruction [20– 16] data 1 ALUSrc MemtoReg Read Read PC register 1 register 2 Zero address Read Instruction Instruction [20– 16] data 1 0 Registers Read ALU ALU Read ALUSrc MemtoReg [31– 0] 0 Read M Write data 2 result Address 1 register 2 Eq data InstructionInstruction u register M 0 Registers Read ALU ALU M [31–[31– 0] 0] 0 Read x u M Write data 2 result Address 1 Instruction [15– 11] Write x u data 1 Data x Instruction u register M data u M 1 memory memory x 0 Instruction [15– 11] Write x u Write 1 Data x data data 1 RegDst memory 0 Write 16 32 Instruction [15– 0] Sign RegDst data extend MemRead 16 32 Instruction [15– 0] Sign extend MemRead

C. Kozyrakis EE108b Lecture 8 7 C. Kozyrakis EE108b Lecture 8 8 Control At Beginning Of Clock Cycle

• Since every instruction takes one cycle, control is state free! PC [31– 28] Instruction [25– 0] 00 0 1 M M u u – It is just decoded instruction bits x x ALU Add 1 0 result Add Shift left 2 Jump • There are also few control points 4 Branch – Control on the multiplexers

– Operation type for the ALU RegWrite ALUOp Instruction [25– 21] Read MemWrite Read – Write control on the Instruction & Data memories PC register 1 address Read data 1 Instruction [20– 16] Read ALUSrc MemtoReg Zero InstructionInstruction register 2 0 Registers Read ALU ALU [31–[31– 0] 0] 0 Read M Write data 2 result Address 1 Instruction register M data u M memory x u • First part of cycle does not have any control Instruction [15– 11] Write x u 1 Data x data 1 memory 0 Write – Which is good, since we don’t have instruction yet RegDst data 16 32 Instruction [15– 0] Sign extend MemRead

• Look at setting of the control points for different instructions

C. Kozyrakis EE108b Lecture 8 9 C. Kozyrakis EE108b Lecture 8 10

Control for Arithmetic Instruction Fetch at End

PC [31– 28] Instruction [25– 0] 00 0 1 PC [31– 28] Instruction [25– 0] 00 0 1 M M M M u u u u x x x x ALU ALU Add 1 0 Add 1 0 result result Add Add Shift Shift left 2 Jump left 2 Jump 4 0 4 0 Branch Branch 0 0 X

1 1 RegWrite RegWrite ALUOp 0 ALUOp 0 Instruction [25– 21] Read MemWrite Instruction [25– 21] Read MemWrite Read Read PC register 1 0 PC register 1 0 address Read 0 address Read 0 data 1 data 1 Instruction [20– 16] Read ALUSrc MemtoReg Instruction [20– 16] Read ALUSrc MemtoReg Zero Zero InstructionInstruction register 2 InstructionInstruction register 2 0 Registers Read ALU ALU 0 Registers Read ALU ALU [31–[31– 0] 0] 0 Read [31–[31– 0] 0] 0 Read M Write data 2 result Address 1 M Write data 2 result Address 1 Instruction register M data Instruction register M data u M u M memory x u memory x u Instruction [15– 11] Write x u Instruction [15– 11] Write x u 1 Data x 1 Data x data 1 data 1 memory 0 memory 0 Write Write RegDst data RegDst data 1 16 32 1 16 32 Instruction [15– 0] Sign Instruction [15– 0] Sign extend MemRead extend MemRead X 0 X 0

C. Kozyrakis EE108b Lecture 8 11 C. Kozyrakis EE108b Lecture 8 12 Arithmetic Immediate ( ori ) Control for Load

PC [31– 28] Instruction [25– 0] 00 0 1 PC [31– 28] Instruction [25– 0] 00 0 1 M M M M u u u u x x x x ALU ALU Add 1 0 Add 1 0 result result Add Add Shift Shift left 2 Jump left 2 Jump 4 0 4 0 Branch Branch 0 0

1 1 RegWrite Or RegWrite Add ALUOp 0 ALUOp 0 Instruction [25– 21] Read MemWrite Instruction [25– 21] Read MemWrite Read Read PC register 1 1 PC register 1 1 address Read 0 address Read 1 data 1 data 1 Instruction [20– 16] Read ALUSrc MemtoReg Instruction [20– 16] Read ALUSrc MemtoReg Zero Zero InstructionInstruction register 2 InstructionInstruction register 2 0 Registers Read ALU ALU 0 Registers Read ALU ALU [31–[31– 0] 0] 0 Read [31–[31– 0] 0] 0 Read M Write data 2 result Address 1 M Write data 2 result Address 1 Instruction register M data Instruction register M data u M u M memory x u memory x u Instruction [15– 11] Write x u Instruction [15– 11] Write x u 1 Data x 1 Data x data 1 data 1 memory 0 memory 0 Write Write RegDst data RegDst data 0 16 32 0 16 32 Instruction [15– 0] Sign Instruction [15– 0] Sign extend MemRead extend MemRead 0 0 1 1

C. Kozyrakis EE108b Lecture 8 13 C. Kozyrakis EE108b Lecture 8 14

Control for Store Control for Branch ( beq )

PC [31– 28] Instruction [25– 0] 00 0 1 PC [31– 28] Instruction [25– 0] 00 0 1 M M M M u u u u x x x x ALU ALU Add 1 0 Add 1 0 result result Add Add Shift Shift left 2 Jump left 2 Jump 4 0 4 0 Branch Branch 1 0 1

0 0 RegWrite Add RegWrite Sub ALUOp 1 ALUOp 0 Instruction [25– 21] Read MemWrite Instruction [25– 21] Read MemWrite Read Read PC register 1 1 PC register 1 0 address Read X address Read X data 1 data 1 Instruction [20– 16] Read ALUSrc MemtoReg Instruction [20– 16] Read ALUSrc MemtoReg Zero Zero InstructionInstruction register 2 InstructionInstruction register 2 0 Registers Read ALU ALU 0 Registers Read ALU ALU [31–[31– 0] 0] 0 Read [31–[31– 0] 0] 0 Read M Write data 2 result Address 1 M Write data 2 result Address 1 Instruction register M data Instruction register M data u M u M memory x u memory x u Instruction [15– 11] Write x u Instruction [15– 11] Write x u 1 Data x 1 Data x data 1 data 1 memory 0 memory 0 Write Write RegDst data RegDst data X 16 32 X 16 32 Instruction [15– 0] Sign Instruction [15– 0] Sign extend MemRead extend MemRead 1 0 X 0

C. Kozyrakis EE108b Lecture 8 15 C. Kozyrakis EE108b Lecture 8 16 Control for Jump ( j) Summary of Control Signals

PC [31– 28] Instruction [25– 0] 00 0 1 M M u u x x func 10 0000 10 0010 Not Important ALU Add 1 0 result Add op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 Shift left 2 Jump 4 1 add sub ori lw sw beq jump Branch 0 RegDst 1 1 0 0 x x x ALUSrc 0 0 1 1 1 0 0 x RegWrite X MemtoReg 0 0 0 ALUOp 0 1 x x x Instruction [25– 21] Read MemWrite Read PC register 1 X address Read X RegWrite 1 1 1 1 0 0 0 data 1 Instruction [20– 16] Read ALUSrc MemtoReg Zero InstructionInstruction register 2 MemWrite 0 Registers Read ALU ALU 0 0 0 0 1 0 0 [31–[31– 0] 0] 0 Read M Write data 2 result Address 1 Instruction register M data u M memory x u Branch 0 0 0 0 0 1 0 Instruction [15– 11] Write x u 1 Data x data 1 memory 0 Write Jump 0 0 0 0 0 0 1 RegDst data X 16 32 Instruction [15– 0] Sign ExtOp x x 0 1 1 x x extend MemRead X 0 ALUctr<2:0> Add Sub Or Add Add Sub xxx

C. Kozyrakis EE108b Lecture 8 17 C. Kozyrakis EE108b Lecture 8 18

Multilevel Decoding Multilevel Decoding (cont)

• Since only the ALU needs the func field – Pass it to the ALU unit, and have a local decoder there

op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 R-type ori lw sw beq jump func ALU ALUctr RegDst 1 0 0 x x x op Main 6 Control 3 ALUSrc 0 1 1 1 0 x 6 Control ALUop N (Local) ALU MemtoReg 0 0 1 x x x RegWrite 1 1 1 0 0 0 MemWrite 0 0 0 1 0 0 Branch 0 0 0 0 1 0 Jump 0 0 0 0 0 1 ExtOp x 0 1 1 x x ALUop “R-type” Or Add Add Subtract xxx

C. Kozyrakis EE108b Lecture 8 19 C. Kozyrakis EE108b Lecture 8 20 Putting It All Together Single Cycle Processor

• Advantages PC [31– 28] Instruction [25– 0] 00 0 0 M M u u – Single cycle per instruction makes logic and clock simple x x ALU Add 1 1 result Add Shift Jump RegDst left 2 4 Branch • Disadvantages MemRead Instruction [31– 26] MemtoReg – Inefficient utilization of memory and functional units since different Control ALUOp MemWrite ALUSrc instructions take different lengths of time RegWrite • ALU only computes values a small amount of the time Instruction [25– 21] Read Read PC register 1 address Read data 1 Instruction [20– 16] Read – Cycle time is the worst case path → long cycle times Zero Instruction register 2 0 Registers Read ALU ALU [31– 0] 0 Read M Write data 2 result Address 1 Instruction register M data • Load instruction u M memory x u Instruction [15– 11] Write x u 1 Data x data 1 memory 0 – All machines would have a CPI of 1 Write data 16 32 Instruction [15– 0] Sign extend ALU control

Instruction [5– 0]

C. Kozyrakis EE108b Lecture 8 21 C. Kozyrakis EE108b Lecture 8 22

Single Cycle Processor Performance Variable Clock Single Cycle Processor Performance

• Functional unit delay • Instruction Mix Instr. Instr. Reg. ALU Data Register Total – Memory: 2ns – 45% ALU class memory read op memory write R-type 2 0.5 1 0.5 4 – ALU and adders: 1 ns – 25% loads load 2 0.5 1 2 0.5 6 – Register file: 0.5 ns – 10% stores store 2 0.5 1 2 5.5 – 15% branches branch 2 0.5 1 3.5 Instruction Instruction Register ALU Data Register Total class memory read operation memory write – 5% jumps jump 2 2 R-type 2 0.5 1 0.5 4 • CPU clock cycle = 4 x 45% + 6 x 25% + 5.5 x 10% + 3.5 x15% + 2 x 5% load 2 0.5 1 2 0.5 6 = 4.5 ns store 2 0.5 1 2 5.5 branch 2 0.5 1 3.5 jump 2 2

• CPU clock cycle = 6 ns

C. Kozyrakis EE108b Lecture 8 23 C. Kozyrakis EE108b Lecture 8 24 Pipelining: Increasing Parallelism It’s Natural and You Do It All the Time!

• Problem: • Laundry Example – Each functional unit used once per cycle • Ann, Brian, Cathy, Dave – Most of the time it is sitting waiting for its turn each have one load of clothes • Well it is calculating all the time, but it is waiting for valid data to wash, dry, and fold A B C D – There is no parallelism in this arrangement • Washer takes 30 minutes • Making instructions take more cycles can make machine faster! – Each instruction takes roughly the same time • Dryer takes 40 minutes • While the CPI is much worse, the clock freq is much higher – Overlap execution of multiple instructions at the same time • Different instructions will be active at the same time • “Folding bench” takes 20 minutes – This is called “Pipelining” – We will look at a 5 stage • Modern machines (Pentium 4) have order 20 cycles/instruction

C. Kozyrakis EE108b Lecture 8 25 C. Kozyrakis EE108b Lecture 8 26

Pipelined Laundry: Sequential Laundry Start work ASAP

6 PM 7 8 9 10 11 Midnight 6 PM 7 8 9 10 11 Midnight Time Time

T a 30 40 20 30 40 20 30 40 20 30 40 20 30 40 40 40 40 20 s T k A a A s O k r B B d O r e C r d C e r D D Sequential laundry takes 6 hours for 4 loads Pipelined laundry takes 3.5 hours for 4 loads C. Kozyrakis EE108b Lecture 8 27 C. Kozyrakis EE108b Lecture 8 28 Pipelining Lessons Dividing Up The Execution

• Since we are going to break execution into different clock cycles 6 PM • Pipelining doesn’t help latency 7 8 9 of single task, it helps – We want to balance the work done in each clock Time throughput of entire workload – Break into a “reasonable” number of cycles T • Multiple tasks operating simultaneously a 30 40 40 40 40 20 • Imagine we are building a system and know the following: s • Potential = Number k A pipe stages – Register file – 1 ns • Pipeline rate limited by slowest – ALU operation – 2 ns O pipeline stage – Memory access – 2 ns r B • Unbalanced lengths of pipe d stages reduces speedup e • How to divide up the cycle? • Time to “fill ” pipeline and time r C to “drain ” it reduces speedup – Make memory access one cycle, that is largest factor D

C. Kozyrakis EE108b Lecture 8 29 C. Kozyrakis EE108b Lecture 8 30

5 Stage Execution Processor Pipeline

• IF: Instruction Fetch • Fetch a new instruction each cycle – Fetch the instruction from memory – Each stage of the pipeline is working on a different instruction – Increment the PC • RF/ID: Register Fetch and Instruction Decode IF RF EX WB – Fetch base register Instruction Register Execution Write • EX: Execute Fetch Fetch back – Calculate base + sign-extended offset • MEM: Memory R ALU R Instr. – Read the data from the data memory P e e • WB: Write back Memory C g g – Write the results back to the register file s Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 s

Load IF RF/ID EX MEM WB C. Kozyrakis EE108b Lecture 8 31 C. Kozyrakis EE108b Lecture 8 32 Pipelining Load Functional Units Are Busy

• Load instruction takes 5 stages • Pipelining now keeps all the functional units busy – Five independent functional units work on each stage – Fetch a new instruction each cycle • Each functional unit used only once – Fetch registers every cycle – Another load can start as soon as 1 st finishes IF stage – Use the ALU almost every cycle – Each load still takes 5 cycles to complete – Use the Data Memory many cycles – The throughput , however, is much higher • Instructions still take 10ns to complete – But start a new instruction every 2ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 – Looks like CPI is 1 cycle Clock 1 2 3 4 5 6 7 8 9

1st lw IF RF/ID EX MEM WB I1 IF ID EX MEM WB 2nd lw IF RF/ID EX MEM WB I2 IF ID EX MEM WB I3 IF ID EX MEM WB 3rd lw IF RF/ID EX MEM WB I4 IF ID EX MEM WB

C. Kozyrakis EE108b Lecture 8 33 C. Kozyrakis I5 EE108b LectureIF 8 ID EX MEM WB 34

Pipeline Datapath Load Datapath: Stage 1

lw Instruction Fetch

0 0 M M u u x x 1 1

IF/ID ID/EX EX/MEM MEM/WB IF/ID ID/EX EX/MEM MEM/WB

Add Add

4 Add Add 4 Add Add result result Shift Shift left 2 left 2 Read Read PC Address ion register 1 ion register 1 Read PC Address Read data 1 truct

data 1 s Read truct Zero In s Read register 2 Zero Instruction Instruction In register 2 Registers Read ALU ALU Registers ALU memory Write 0 Read memory Read 0 ALU data 2 result Address 1 Write data 2 result Address Read 1 register M data register M data u Data M u Data M u u Write x memory x Write x memory x data 1 data 1 0 0 Write Write data data 16 32 16 32 Sign Sign extend extend

C. Kozyrakis EE108b Lecture 8 35 C. Kozyrakis EE108b Lecture 8 36 Load Datapath: Stage 2 Load Datapath: Stage 3

lw lw Register Fetch Execute

0 0 M M u u x x 1 1

IF/ID ID/EX EX/MEM MEM/WB IF/ID ID/EX EX/MEM MEM/WB

Add Add

4 Add Add 4 Add Add result result Shift Shift left 2 left 2 n n Read Read io io t PC Address t register 1 PC Address register 1 c c Read Read ru ru data 1 data 1 Read Read Zero Zero Inst Instruction Inst register 2 Instruction register 2 Registers ALU Registers ALU memory Read 0 ALU memory Read 0 ALU Write data 2 result Address Read 1 Write data 2 result Address Read 1 register M data register M data u Data M u Data M u u Write x memory Write x memory data x data x 1 0 1 0 Write Write data data 16 32 16 32 Sign Sign extend extend

C. Kozyrakis EE108b Lecture 8 37 C. Kozyrakis EE108b Lecture 8 38

Load Datapath: Stage 4 Load Datapath: Stage 5

lw lw Memory

0 0 M M Write Back u u x x 1 1

IF/ID ID/EX EX/MEM MEM/WB IF/ID ID/EX EX/MEM MEM/WB

Add Add

4 Add Add 4 Add Add result result Shift Shift left 2 left 2 n Read n Read io io t PC Address t register 1 PC Address register 1 c Read c Read ru data 1 ru data 1 Read Read Zero Zero Inst Instruction Inst register 2 Instruction register 2 Registers ALU Registers ALU memory Read 0 ALU memory Read 0 ALU Write data 2 result Address Read 1 Write data 2 result Address Read 1 register M data register M data u Data M u Data M u u Write x memory Write x memory data x data x 1 0 1 0 Write Write data data 16 32 16 32 Sign Sign extend extend

C. Kozyrakis EE108b Lecture 8 39 C. Kozyrakis EE108b Lecture 8 40 Pipeline Control Control Signals

• Need to control functional units • Use a Main Control unit to generate signals during RF/ID Stage – But they are from working on different instructions! – Control signals for EX • Not a problem • (ExtOp, ALUSrc, …) used 1 cycle later – Just pipeline the control signals along with the data – Control signals for Mem – Make sure they line up • (MemWr, Branch) used 2 cycles later • Using labeling conventions often helps – Control signals for WB – Instruction_rf – means this instruction is in RF • (MemtoReg, MemWr) used 3 cycles later – Every time it gets flopped, changes pipestage • Make sure right signals go to the right places

C. Kozyrakis EE108b Lecture 8 41 C. Kozyrakis EE108b Lecture 8 42

Implementing Control Putting it All Together

PCSrc

ID/EX 0 M u WB x EX/MEM 1 RF/ID EX MEM WB Control M WB MEM/WB

EX M WB IF/ID ExtOp ExtOp Add Add 4 Add

e result it r Branch W Ex/MEM RegisterEx/MEM Shift ALUSrc ALUSrc RegisterMEM/WB

eg left 2 ite R ALUSrc r W ID/Ex RegisterID/Ex IF/ID Register m g

Read e e

ALUOp ALUOp on i t PC Address register 1 M c Read toR u r

t data 1 m

s Read Main Zero e In RegDst RegDst Instruction register 2 M Registers Read ALU memory 0 ALU Write data 2 result Address Read 1 Control register M data u Data M Write x memory u x MemWr MemWr MemWr data 1 0 Write Branch Branch Branch data Instruction 16 32 6 [15– 0] Sign ALU MemRead MemtoReg MemtoReg MemtoReg MemtoReg extend control Instruction [20– 16] 0 ALUOp RegWr RegWr RegWr RegWr M Instruction u [15– 11] x 1 _rf _ex _mem _wb RegDst

C. Kozyrakis EE108b Lecture 8 43 C. Kozyrakis EE108b Lecture 8 44 Comparison Cycle 1 Cycle 2 But Something Is Fishy Here Clk

Single Cycle Implementation: • If dividing it into 5 parts made the clock faster Load Store Waste – And the effective CPI is still one

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 • Then dividing it into 10 parts would make the clock even faster Clk – And wouldn’t the CPI still be one?

Multiple Cycle Implementation: Load Store R-type • Then why not go to twenty cycles? IF Reg EX MEM WB IF Reg EX MEM IF • Really two issues Pipeline Implementation: – Some things really have to complete in a cycle Load IF Reg EX MEM WB • Find next PC from current PC – CPI is not really one Store IF Reg EX MEM WB • Sometimes you need the results from the previous instruction R-type IF Reg EX MEM WB C. Kozyrakis EE108b Lecture 8 45 C. Kozyrakis EE108b Lecture 8 46