3/12/2015

Homework #2 and Lab #4

 Homework #2 has been posted  Due Friday, March 20  Pipeline hazards covered today and next week.  Lab #4 has been posted Lecture 19, 20 & 21: Processor Pipelining – Part I  design a single-cycle processor  First due date: Friday, March 20 Reading: Chapter 4, March 9, 11, 13, 2015  Demo the first set of instructions executing through your processor design Patterson & Hennesey texbook Prof. R. Iris Bahar  Second due date: Friday, April 3  Demo the full set of instructions executing correctly on your processor  Final report due at this time  Before starting on lab, go through the TimingQuest and PLL tutorials.

© 2015 R.I. Bahar  This lab is to be completed INDIVIDUALLY Portions of these slides taken from Professors S. Reda and D. Patterson 2

Single-Cycle MIPS Processor Complete Single Cycle Processor

Fetch instruction @ PC  Datapath

 Control Decode instruction

Fetch Operands

Execute instruction

Store result  Datapath for all instructions Update PC except jump

3 4

1 3/12/2015

Single Cycle processor with Control Datapath and control with jumps

[without jumps]

5 6

Performance Issues Pipelining Analogy  Pipelined laundry: overlapping execution  Parallelism improves performance  Longest delay determines clock period  Critical path: load instruction  Instruction memory  register file  ALU  data memory  register file  Not feasible to vary period for different instructions  Violates design principle  Making the common case fast  We will improve performance by pipelining

7 8

2 3/12/2015

MIPS stages MIPS datapath pipeline stages  Need registers between stages  Holds information produced in previous cycle  Note that the register file is written in written in the first half of the cycle and read in the second half.

Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register 10 9

Pipeline datapath abstraction Multi-cycle datapath pipeline diagram  Showing optimal resource usage  Traditional form

11 12

3 3/12/2015

Tracing lW in its journey: 1st cycle Tracing lw in its journey: 2nd cycle

13 14

Tracing lw in its journey: 3rd cycle Tracing lw in its journey: 4th cycle

15 16

4 3/12/2015

Tracing lw in its journey: 5th cycle Corrected pipeline datapath for lW

Wrong register number

17 18

Pipeline state in 5th cycle Pipeline Performance  Assume time for stages is  100ps for register read or write  200ps for other stages  Compare pipelined datapath with single-cycle datapath

Instr Instr fetch Register ALU op Memory Register Total time read access write lw 200ps 100 ps 200ps 200ps 100 ps 800ps lw $10, 20($1) sub $11, $2, $3 sw 200ps 100 ps 200ps 200ps 700ps add $12, $3, $4 R-format 200ps 100 ps 200ps 100 ps 600ps lw $13, 24($1) add $14, $5, $6 beq 200ps 100 ps 200ps 500ps

20 19

5 3/12/2015

Single-cycle vs. pipeline performance Pipeline Speedup

Single-cycle (Tc= 800ps)  If all stages are balanced  i.e., all take the same time

 Time between instructionspipelined = Time between instructionsnonpipelined Number of stages Pipelined (Tc= 200ps)  If not balanced, speedup is less  Speedup due to increased throughput  Latency (time for each instruction) does not decrease

22 21

Pipeline datapath summary Reminder of single-cycle control

How many cycles does it take to execute this code?

23 24

6 3/12/2015

ALU Control Main decoder Instruction Op RegWrite RegDst AluSrc Branch Mem-read MemWrite MemtoReg ALUOp  Assume 2-bit ALUOp derived from opcode 5:0 1:0  Combinational logic derives ALU control R-type 000000 110000010  Define additional ALU control encodings to expand its functionality lw 100011 101010100 opcode ALUOp Operation funct ALU function ALU control sw 101011 0 X 1 0 0 1 X 00 lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 000100 0 X 0 1 0 0 X 01 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 addi subtract 100010 subtract 0110 001000 10100 0000 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111

25 26

Control signals Modifications to pipeline control  Control signals are derived from instructions  Same as in single-cycle implementation  Control is carried over to the proper pipeline stage

27 28

7 3/12/2015

Pipelined datapath + control Example: Cycle 1

29 30

Cycle 2 Cycle 3

31 32

8 3/12/2015

Cycle 4 Cycle 5

33 34

Cycle 6 Cycle 7

35 36

9 3/12/2015

Cycle 8 Cycle 9

37 38

Pipelining Hazards 1. Structure Hazards

Hazards are situations that prevent starting the next  Conflict for use of a resource instruction in the next cycle  What if in MIPS pipeline we had a single memory for 1. Structural hazards instruction and data?  A required resource is busy  Load/store requires data access 2. Data hazards  Instruction fetch would have to stall for that cycle  Need to wait for previous instruction to complete its data  Would cause a pipeline “bubble” read/write 3. Control hazards  Hence, pipelined datapaths require separate instruction/data memories  Deciding on control action depends on previous instruction  Or separate instruction/data caches  What about having only one adder in the MIPS pipeline?

39 40

10 3/12/2015

2. Data Hazards: compute-use 2. Data Hazard: load-use

1234567 8

Time (cycles) $ add DM $s0 add $s0, $s2, $ IM RF $s3 + RF

$s0 and DM $t0 and $t0, $s0, $ IM RF $s1 & RF

$ or DM $ or $t1, $s4, $s0 IM RF $s0 | RF

$s0 sub DM $t2 sub $t2, $s0, $ IM RF $s5 - RF

41 42

Handling data hazards A. Compile Time Technique: Code Scheduling

A. Compile-time techniques  Reorder code to avoid use of load result in the next instruction B. Stall the processor at run time  C code for A = B + E; C = B + F; C. Forward data at run time  Compiler must be aware of pipeline structure

lw $t1, 0($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t2, 4($t0)

stall add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) stall add $t5, $t1, $t4 add $t5, $t1, $t4 sw $t5, 16($t0) sw $t5, 16($t0) 13 cycles 11 cycles

43 44

11 3/12/2015

A. Compile Time Technique: Insert NOPs B. Run time technique: Stall pipeline  Insert enough NOPs until result is ready (wastes cycles)  Detect dependency at run time and insert “bubbles”  Doesn’t require HW to detect hazards  Prevent new instruction from advancing in pipeline 12345678910 add $s0, $t0, $t1 Time (cycles) $s2 add DM $s0 sub $t2, $s0, $t3 add $s0, $s2, $s3 IM RF $s3 + RF

nop IM nop RF DM RF

nop IM nop RF DM RF

$s0 and DM $t0 and $t0, $s0, $s1 IM RF $s1 & RF

$s4 or DM $t1 or $t1, $s4, $s0 IM RF $s0 | RF

$s0 sub DM $t2 sub $t2, $s0, $s5 IM RF $s5 - RF

45 46

How do we stall the pipeline? C. Data forwarding during runtime inserting a bubble  Don’t wait for result to be stored in a register  forward the results from wherever they happen to be  Do not update PC or IF/ID  Requires extra connections in the datapath  instruction in ID stage is decoded again, instruction in IF stage is fetched again  Force control values in ID/EX register to 0  Essentially passes on a NOP instruction to the EX stage  Inserting a 2-cycle stall allows results to be written to register file before reading them in ID stage.

48 47

12 3/12/2015

Dependencies and forwarding Circuitry for forwarding

49 50

One more MUX for immediates When should data be forwarded?  EX/MEM.RegWrite and/or MEM/WB.RegWrite are true  Destination register(s) are equal to the source registers of the next 1 - 2 instructions. That is,

 EX/MEM.RegisterRd == ID/EX.RegisterRs

 EX/MEM.RegisterRd == ID/EX.RegisterRt

 MEM/WB.RegisterRd == ID/EX.RegisterRs

 MEM/WB.RegisterRd == ID/EX.RegisterRt  Dest. reg. in EX/MEM and/or MEM/WB is not $0.

 EX/MEM.RegisterRd ≠ 0

 MEM/WB.RegisterRd ≠ 0 51 52

13