Lecture Topics

ECE 486/586 • Pipelining – Hazards and Stalls • Data Hazards – Forwarding • Control Hazards Lecture # 11 Reference: • Appendix C: Section C.2 Spring 2015

Portland State University

RAW Hazards Stall Penalty Time Clock Cycle 1 2 3 4 5 6 7 8 • If the dependent instruction follows right after the producer Add R2, R3, #100 IF ID EX MEM WB instruction, we need to stall the pipe for 2 cycles (Stall penalty = 2) • What happens if the producer and dependent instructions are Subtract R9, R2, #30 IF IDstall stall EX MEM WB separated by an intermediate instruction?

• Subtract is stalled in decode stage for two additional cycles to delay reading Add R2, R3, #100 IF ID EX MEM WB R2 until the new value of R2 has been written by Add

How is this stall implemented in hardware? Or R4, R5, R6 IF ID EX MEM WB • Control circuitry recognizes the data dependency when it decodes Subtract (comparing source register ids with dest. register ids of prior instructions) Subtract R9, R2, #30 IF ID stall EX MEM WB • During cycles 3 to 5: • Add proceeds through the pipe • In this case, stall penalty = 1 • Subtract is held in the ID stage • Stall penalty depends upon the distance between the producer • In cycle 5 and dependent instruction • Add writes R2 in the first half cycle, Subtract reads R2 in the second half cycle Impact of Data Hazards - Alleviating Data Hazards

• Frequent stalls caused by data hazards can impact the • Stalls due to data dependencies can be mitigated by forwarding • Consider the two instructions discussed in the previous example performance significantly. • The result of producer instruction is actually available after the completion of EX • Example: If every 5 th instruction causes a 2-cycle stall due to a stage (cycle 3), when the ALU completes its computation data hazard, then the CPI can increase by (1/5)*2 = 0.4 • Instead of stalling the dependent instruction, the hardware can forward the result from the output of the EX stage to the ALU, where it can be used by the • Would like to avoid stalls caused by RAW hazards dependent instruction – Technique: “forwarding” • The arrow in below figure shows data being forwarded from EX stage of first instruction to EX stage of the second instruction • Also called “bypassing” or “short-circuiting” Time – Key Idea: Forward results from later stages in the pipeline to Clock Cycle 1 2 3 4 5 6 earlier stages Add R2, R3, #100 IF ID EX MEM WB No stall

Subtract R9, R2, #30 IF ID EX MEM WB

Forwarding – Another Example Stalls Caused by Loads

• In previous example, data was forwarded from EX stage to EX stage • Load instructions can also cause pipeline stalls • Such forwarding mitigates data hazards between successive instructions • Example 1 (No data dependency): A Load “misses” in the and needs to stall • How can we avoid stalls when instruction I j+2 depends on instruction I j? for multiple cycles until data is read from memory. All the subsequent instructions • Solution: Forward operands from output of MEM stage to input of EX stage stall, even if they have no data dependency with the Load (no data hazard) – Operand forwarding cannot mitigate these stalls • Example: Forwarding from • Example 2 (Data dependency): Consider the following instruction sequence: Load R2, #60(R3) Add R2, R3, #100 IF ID EX MEM WB MEM stage to ALU input Subtract R9, R2, #30 – Register R2 causes a data dependency between the two instructions Or R4, R5, R6 IF ID EX MEM WB – Assume that the Load hits in the cache. Data read from memory is available after the MEM stage and written to R2 in the WB stage Subtract R9, R2, #30 IF ID EX MEM WB – In the absence of forwarding, Subtract will stall in ID stage for 2 extra cycles – Can forwarding mitigate this stall? Operand Forwarding for Load Instructions Major Hurdle of Pipelining: Hazards

Time Clock Cycle 1 2 3 4 5 6 7 • Three types of pipeline hazards – Structural hazard – a situation where two (or more) instructions Load R2, #60(R3) IF ID EX MEM WB Stall Penalty = 1 cycle require the use of a given hardware resource at the same time – Data hazard – any condition in which either the source or the Subtract R9, R2, #30 IF ID EX MEM WB destination operands of an instruction are not available, when needed in the pipeline – Control hazard – a delay in the availability of an instruction or the • Load instruction cannot forward until the end of cycle 4 memory address needed to fetch the instruction • Subtract instruction will need to stall for one cycle • Forwarding reduces the stall penalty from 2 cycles to 1 cycle

Forwarding for Load instructions reduces stalls but does not always eliminate stalls

Control Hazards Conditional Branches

• Control hazards are caused by a delay in the availability of an • Consider a conditional branch instruction such as instruction or the memory address needed to fetch the instruction BEQ R5, R6, LOOP • In ideal pipelined execution, a new instruction is fetched every cycle, • Testing the branch condition (comparison between [R5] and [R6]) while the previous instruction is being decoded determines whether the branch is taken or not taken • This will work fine as long as the instruction addresses follow a pre- • Both the comparison and target address calculation happens in the determined sequence (e.g., PC  [PC]+4) EX stage (3 rd cycle of instruction processing) • However, branch instructions can alter this sequence • If the branch is taken • Branch instructions first need to be executed to determine whether – Cannot fetch the branch target before the start of 4 th cycle and where to branch – No useful instruction fetched in cycles 2 and 3  Control Hazard • Pipeline needs to be stalled before the branch outcome is decided Handling Control Hazards Control Hazard Example

Time • “Predicted-not-taken” scheme Clock Cycle 1 2 3 4 5 6 7 8 – Increment PC by 4 every cycle and keep fetching sequential Ij, BEQ R5, R6, Ik IF ID EX instructions after the branch instruction

– After the branch outcome is computed Ij+1 IF ID • If the branch condition is false (Not-taken branch) : I – We have already been fetching instructions from the correct path => j+2 IF no problem • But if the branch condition is true (Taken branch): • Let us assume R5 = R6 – Need to start fetching from the branch target • In cycle-3, EX stage determines that the branch is taken and computes the – What happens to the instructions that have already been fetched on target address the wrong path? • PC is updated with the branch target address

• Two instructions (I j+1 and I j+2 ) on the wrong path have already been fetched

Control Hazard Example Reducing Branch Penalty Time Clock Cycle 1 2 3 4 5 6 7 8 • How to reduce the branch penalty for conditional branches?

Ij, BEQ R5, R6, Ik IF ID EX MEM WB Instructions – Test the branch condition in the ID stage discarded after • Move the comparator from EX stage to ID stage branch outcome Ij+1 IF ID • Branch condition tested in parallel with target address computation is determined • Branch penalty reduced to one cycle I j+2 IF – Also compute the branch target address in the ID stage • Need an in the ID stage to add the branch offset to the PC I k IF ID EX MEM WB • When the decoder determines that the instruction is a conditional branch, the computed target address is available before the end of Branch Penalty the ID stage • Update the PC before the end of ID stage • Control transfers to branch target (I ) k Negative Effect: ID stage lengthened; may have to increase the • Instructions I and I flushed from the pipeline – j+1 j+2 clock period • Branch Penalty in this case is 2 cycles • What would be the Branch Penalty if the branch was not taken? Reducing Branch Penalty Alternative Branch Strategies

Time Clock Cycle 1 2 3 4 5 6 7 8 • “Predict-taken” scheme: Treat every “branch” as taken – In our 5-stage pipeline, we don’t know target any earlier than we Ij, BEQ R5, R6, Ik IF ID EX MEM WB Only one instruction needs know branch outcome to be discarded Ij+1 IF – No advantage, unless we could predict the target address earlier Branch in the pipeline outcome Ik IF ID EX MEM WB determined here • Delayed Branch Branch Penalty – Compiler optimization • Branch condition and target address computed in cycle # 2 – Use knowledge of the program semantics to rearrange instructions • Branch penalty reduced to one cycle before/after the branch

• Only one instruction (I j+1 ) is fetched incorrectly and needs to be discarded