Control (Branch) Hazards

A: beqz R2, L1 B C D ------L1: P

Naïve (Lazy) Implementation of a Conditional Branch Instruction in DLX Pipeline : IF: Fetch Branch Instruction from IM

ID: Decode Instruction and read registers to be used in comparison

EX: Determine Branch Outcome: Compare Source Register values to zero (or each other) and set FLAG in Output Pipeline Register

MEM: Compute Target Address: (PC + 4) carried along with instruction + Sign Extended and Shifted Displacement At end of clock cycle: PC assigned Target Address if instruction now in MEM is a successful branch else Execution continues as normal

Successful (or Taken) Branch: (i) instruction is a branch and (ii) branch condition is true

1 Naive Implementation of Branch Equal Instruction: BEQZ Rs, d

MUX AND

PC PC P ADD C C C + n n PC + 4 t t Decode r r REG l l FILE (rs) zero F L I REG Outcome of the n A MEM G branch known s FILE (rt) A IM t at end of cycle r L 3. u rs c U PC updated t rt with new value i rd (branch target o or PC+4 value) n SE d << d at end of cycle 4. Compute target address 2 IF ID EX MEM WB Branch outcome known Control Hazard

T = 1 IF A ID EX MEM WB

T = 2 IF ID A EX MEM WB

T = 3 IF ID EX A MEM WB

T = 4 B/P IF ID EX MEM A WB

16

/NOT TAKEN / TAKEN BRANCH Control (Branch) Hazards

: Problems: • The target address of the branch is not known (at least) till instruction is decoded • What is the address of instruction P? • The outcome of the branch (taken/ not taken) is determined deep in the pipeline • Should we execute B or P after A?

What should the pipeline (processor) do after fetching the branch instruction?

SOLUTION 1 • Delay the next instruction • Till we know the outcome of the branch and the address of next instruction

Software: Add 3 NOPS after every Branch Instruction Hardware: Hazard Detection Unit checks for a Branch Instruction in the ID, EX, or

MEM stage and stalls PC/Inserts NOPs 15 Simple Solution: Insert NOPs

A: beqz R2, label Possible execution sequences: NOP NOP Branch Not Taken: A, NOP, NOP, NOP, B NOP Branch Taken: A, NOP, NOP, NOP, P B ----- • Adds 3 cycles to execution time for every branch label: P

1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB

NOP

NOP

NOP

B P IF ID EX MEM WB

17 Control Hazard

T = 1 NOP IF A ID EX MEM WB

NOP T = 2 NOP IF ID A EX MEM WB

NOP NOP T = 3 NOP IF ID EX A MEM WB

NOP NOP NOP P/B T = 4 IF ID EX v MEM A WB

16 Hardware-Controlled

A :BEQ R1, R2, L1 B : ---- C: ---

L1: P

1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB B IF IF IF CPC IF ID EX MEM WB

Branch Taken: 3 Additional Cycles Hardware-Controlled Pipeline Stall

A :BEQ R1, R2, L1 B : ---- C: ---

L1: P

1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB B IF IF IF IF ID EX MEM WB C

Branch Not Taken: 3 Additional Cycles Hardware-Controlled Pipeline Stall

A :BEQ R1, R2, L1 B : ---- C: ---

L1: P

1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB B IF IF IF ID EX MEM WB CC IF ID EX MEM WB

Optimized Branch Not Taken: PC gets address of C Hazard Detection Unit

Freeze register: do not update HDU

Insert NOP

P IF ID EX MEM WB C

Stall PC and Insert NOP into IF/ID if there is a Branch instruction in either the IF/ID, ID/EX or EX/MEM pipeline register Hardware Controlled Pipeline Stall

A: BEQ R1, R2, L1 B: C: --- L1: P

Stall B A T = 2 IF ID EX MEM WB

Stall A B T = 3 IF ID EX MEM WB

• Instruction B (address) held in PC register until A reaches WB stage • Internally generated NOPs propagated forward while B is stalled Hardware Controlled Pipeline Stall

B T = 4 IF ID EX A MEM WB

P T = 5 IF ID EX MEM A WB

TAKEN BRANCH

B T = 5 IF ID EX MEM A WB

BRANCH NOT TAKEN

C B OPTIMIZED IF ID EX MEM A WB Branch Delay Slots

Software Solution: • Software must delay the execution of the next-in-line instruction after the Branch Delay depends on the pipeline structure • Microarchitecture is exposed to the software (compiler)

Branch Delay slots:

• Delay introduced by software to avoid control hazards

• Dummy instructions following branch instruction for purpose of creating delays till the new PC value can be set

• Instructions in the Branch Delay slots always executed

• In our design: 3 Branch Delay Slots

• Microarchitecture might choose not to expose all the delay slots and use some hardware mechanisms for providing the remaining delay 5 Performance of Simple Stall Based Schemes

1. Stall scheme has a branch penalty of 3 cycles (may be 2 in optimized hardware design) 2. Software inserted NOPs (3 cycles) 1. Hardware inserted stall cycles (3 non-optimized)

Example: Suppose Branch Frequency is 20% and 60% of branches are taken. Assume software solution with penalties as above. Assume the compile is able to fill 20% of the Branch Delay slots with useful instructions. How is CPI affected?

Each Branch Instructions incurs extra delay of 3 cycles except for the delay slots filled with useful instructions. Branch Penalty (per executed instruction) = 20% x 3 (delay slots) x(80%) unfilled delay slots = 0.48 cycles

CPI = Nominal CPI + Penalty Cycles (per instruction)

Assuming no other causes of stalls

CPI = 1.0 + 0.52 = 1.48 13 Alternate Hardware Solution

beqz R2, label Why delay in-line instructions B, C, D etc? B • C • Let instructions following A enter pipeline normally D E

-- Works if Branch Not Taken! label: P

1 2 3 4 5 6 7 8 9

A IF ID EX MEM WB B IF ID EX MEM WB

C IF ID EX MEM WB

D IF ID EX MEM WB

E IF ID EX MEM WB

14 Control Hazard No Stall Cycles

T = 1 B IF A ID EX MEM WB

T = 2 C IF B ID A EX MEM WB

T = 3 D IF C ID B EX A MEM WB

T = 4 E IF D ID C EX B MEM A WB

15

BRANCH NOT TAKEN Speculation: Alternate Hardware Solution

beqz R2, label What if Branch is Taken? B C D E • B, C, D have not updated machine state at cycle 4 -- • Flush B, C, D at end of cycle 4 label: P

1 2 3 4 5 6 7 8 9

A IF ID REG MEM WB B IF ID REG MEM WB

C IF ID REG MEM WB

D IF ID REG MEM WB

P IF ID REG MEM WB

16 Control Hazard

T = 1 B IF A ID EX MEM WB

T = 2 C IF B ID A EX MEM WB

T = 3 D IF C ID B EX A MEM WB

T = 4 P IF D ID C EX B MEM A WB

16

TAKEN BRANCH Control Hazard

T = 4 P IF D ID C EX B MEM A WB

T = 5 Q IF P ID D EX C MEM B WB

T = 6 R IF Q ID P EX D MEM C WB

T = 7 S IF R ID Q EX P MEM D WB

16

TAKEN BRANCH: WRITES to MEM or REG by B, C or D will result in error Alternate Hardware Solution Taken Branch

T = 2 B IF A ID EX MEM WB

T = 3 C IF B ID A EX MEM WB

T = 4 D IF C ID B EX A MEM WB

Insert NOP in IF/ID, ID/EX, EX/MEM

T = 4 P IF ID EX MEM A WB

17 Branch Penalty in Modified Hardware Scheme

• More than an optimized implementation of stall

• Simple form of control speculation • Speculating it is a NOT TAKEN Branch • Continue fetching in-line instructions

• Performance depends on accuracy of speculation • Speculation correct (NOT TAKEN Branch): Continue with no stalls (0 Penalty Cycles) • Speculation incorrect (TAKEN Branch): Flush 3 trailing instructions (3 Penalty Cycles)

Example: Branch Frequency: 20% 5% of Branches are Unconditional Branches 70% Conditional branches are NOT TAKEN

CPI = Nominal CPI + Penalty cycles for TAKEN BRANCH + Penalty Cycles for NOT TAKEN Branch

Penalty Cycles for TAKEN BRANCH = Penalty cycles for UNCONDITIONAL BRANCH + Penalty cycles for TAKEN CONDITIONAL BRANCH = 20% x 5% x 3 + 20% x 95% x 30% x 3 = 0.03 + 0.171 = 0.201

CPI = 1.0 + = 1.201 19 Predict branch: Not Taken; Actually Not Taken

No Stall Cycles

T = 1 B IF A ID EX MEM WB

T = 2 C IF B ID A EX MEM WB

T = 3 D IF C ID B EX A MEM WB

BRANCH NOT TAKEN DO NOTHING

T = 4 E IF D ID C EX B MEM A WB

20 Predict branch: Not Taken; Actually Taken

T = 1 B IF A ID EX MEM WB

T = 2 C IF B ID A EX MEM WB

T = 3 D IF C ID B EX A MEM WB

Branch actually taken: FLUSH pipeline Make B,C,D NOPS

T = 4 P IF ID EX MEM A WB

21 More Control Speculation

Can we predict branch as taken ? • Speculatively fetch and execute instructions at the branch target • Useful only if target address is known earlier than branch outcome • May require stall cycles until target address known • Flush pipeline if prediction is incorrect • Must ensure that flushed instructions do not update any machine state

• Assume that target address is computed in the ID stage • Stall of 1 cycle till PC updated with target address (ALWAYS!) • Assume branch outcome known at the end of cycle 3 in EX stage

22 Predict branch taken

T = 1 B IF A ID EX MEM WB

T = 2 P IF ID A EX MEM WB

T = 3 Q IF P ID EX A MEM WB

Branch actually taken: Single stall cycle

T = 4 R IF Q ID P EX MEM A WB

23 Predict branch taken

T = 1 B IF A ID EX MEM WB

EX T = 2 P IF ID A MEM WB

Branch actually not FLUSH pipeline taken: 2 wasted cycles Make P NOP

T = 3 B IF ID EX A MEM WB

T = 4 C IF B ID EX MEM A WB

24 More Control Speculation (contd … )

Reduce branch delay (from 3 cycles of first design) to 1 or 2 Early Branch Detection hardware to compute: Target address : Easy to move to ID stage Branch outcome: Easy to move to EX stage

Predict Not Taken: Actually Not Taken: No stalls Actually Taken: 2 cycles

Predict Taken: Actually Taken: 1 cycle Actually Not Taken: 2 cycles

26 More Control Speculation

Predict Branch Taken Branch Actually Taken: 1 cycle penalty Branch Actually Not Taken: 2 cycle penalty • 16% of instructions were conditional branches • 4% of instructions were unconditional branches • 62% of conditional branches were taken

CPI = 1 + 16% x 62% x 1 + 16% x 38% x 2 + 4% x 1 = 1.26

Taken Conditional Branch Not Taken Conditional Branch Unconditional Branch (Taken)

Predict Not Taken: Branch Actually Taken: 2 cycle penalty Branch Actually Not Taken: 0 cycle penalty Know its TAKEN

CPI = 1 + 16% x 62% x 2 + 4% x 1 = 1.14 27 Summary: Control (Branch) Hazard

• Do branch resolution (outcome and target address) early • Stall pipeline for required number of cycles

Methods that reduce the branch penalty 1. Prediction • Speculatively predict branch outcome • Recover if mispredicted

2. Delayed Branch • Always execute instruction(s) following the branch. • Compiler tries to fill the branch delay slot(s) with useful instructions.

28 Summary: Control (Branch) Hazard Default pipeline: Fetch and execute in-line instructions following a branch Incorrect if the branch is actually taken Solutions: Delay the actual instruction following a branch: Software: Exposed Branch Delay Slots Insert NOP instructions (3 for our pipeline) (Optimization) Move useful instructions into the Delay Slot

Microarchitecture: Insert stall cycles in the pipeline dynamically Freeze the PC, Insert NOP into the IF/ID Pipeline Register

Optimizations: • Do branch resolution (outcome and target address) early Compute Target Address in ID stage PC can be updated with target address at end of cycle 2 Compute Branch Outcome in EX stage (or even ID stage!)

28 Summary: Control (Branch) Hazard

Optimizations (contd …)

Use Branch Prediction to reduce Branch Penalty • Speculatively predict branch outcome • Recover if mispredicted

Predict Branch Not Taken: No penalty if Branch is actually not taken If Branch actually taken Flush the pipeline of the inline instructions that are already in the pipeline Update PC with Target Address

Predict Branch Taken: Compute Target Address in ID stage and update PC Requires 1 stall cycle to compute address

Compute Branch Outcome in (say) EX stage If Branch Actually Taken (prediction is correct) do nothing If Branch Actually Not Taken (misspeculation) Flush pipeline of the instruction at target that has entered pipeline Update PC with inline instruction address 28