Control (Branch) Hazards

Control (Branch) Hazards A: beqz R2, L1 B C D ------ L1: P Naïve (Lazy) Implementation of a Conditional Branch Instruction in DLX Pipeline : IF: Fetch Branch Instruction from IM ID: Decode Instruction and read registers to be used in comparison EX: Determine Branch Outcome: Compare Source Register values to zero (or each other) and set FLAG in Output Pipeline Register MEM: Compute Target Address: (PC + 4) carried along with instruction + Sign Extended and Shifted Displacement At end of clock cycle: PC assigned Target Address if instruction now in MEM is a successful branch else Execution continues as normal Successful (or Taken) Branch: (i) instruction is a branch and (ii) branch condition is true 1 Naive Implementation of Branch Equal Instruction: BEQZ Rs, d MUX AND PC PC P ADD C C C + n n t PC + 4 t Decode r r REG l l FILE (rs) zero F L I REG Outcome of the n A MEM G branch known s FILE (rt) A IM t at end of cycle r L 3. u rs c U PC updated t rt with new value i rd (branch target o or PC+4 value) n at end of cycle SE d << d 4. Compute target address 2 IF ID EX MEM WB Branch outcome known Control Hazard T = 1 IF A ID EX MEM WB T = 2 IF ID A EX MEM WB T = 3 IF ID EX A MEM WB T = 4 B/P IF ID EX MEM A WB 16 /NOT TAKEN / TAKEN BRANCH Control (Branch) Hazards : Problems: • The target address of the branch is not known (at least) till instruction is decoded • What is the address of instruction P? • The outcome of the branch (taken/ not taken) is determined deep in the pipeline • Should we execute B or P after A? What should the pipeline (processor) do after fetching the branch instruction? SOLUTION 1 • Delay the next instruction • Till we know the outcome of the branch and the address of next instruction Software: Add 3 NOPS after every Branch Instruction Hardware: Hazard Detection Unit checks for a Branch Instruction in the ID, EX, or MEM stage and stalls PC/Inserts NOPs 15 Simple Software Solution: Insert NOPs A: beqz R2, label Possible execution sequences: NOP NOP Branch Not Taken: A, NOP, NOP, NOP, B NOP Branch Taken: A, NOP, NOP, NOP, P B ----- • Adds 3 cycles to execution time for every branch label: P 1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB NOP NOP NOP B P IF ID EX MEM WB 17 Control Hazard T = 1 NOP IF A ID EX MEM WB NOP T = 2 NOP IF ID A EX MEM WB NOP NOP T = 3 NOP IF ID EX A MEM WB NOP NOP NOP P/B T = 4 IF ID EX v MEM A WB 16 Hardware-Controlled Pipeline Stall A :BEQ R1, R2, L1 B : ---- C: --- L1: P 1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB B IF IF IF CPC IF ID EX MEM WB Branch Taken: 3 Additional Cycles Hardware-Controlled Pipeline Stall A :BEQ R1, R2, L1 B : ---- C: --- L1: P 1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB B IF IF IF IF ID EX MEM WB C Branch Not Taken: 3 Additional Cycles Hardware-Controlled Pipeline Stall A :BEQ R1, R2, L1 B : ---- C: --- L1: P 1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB B IF IF IF ID EX MEM WB CC IF ID EX MEM WB Optimized Branch Not Taken: PC gets address of C Hazard Detection Unit Freeze register: do not update HDU Insert NOP P IF ID EX MEM WB C Stall PC and Insert NOP into IF/ID if there is a Branch instruction in either the IF/ID, ID/EX or EX/MEM pipeline register Hardware Controlled Pipeline Stall A: BEQ R1, R2, L1 B: C: --- L1: P Stall B A T = 2 IF ID EX MEM WB Stall A B T = 3 IF ID EX MEM WB • Instruction B (address) held in PC register until A reaches WB stage • Internally generated NOPs propagated forward while B is stalled Hardware Controlled Pipeline Stall B T = 4 IF ID EX A MEM WB P T = 5 IF ID EX MEM A WB TAKEN BRANCH B T = 5 IF ID EX MEM A WB BRANCH NOT TAKEN C B OPTIMIZED IF ID EX MEM A WB Branch Delay Slots Software Solution: • Software must delay the execution of the next-in-line instruction after the Branch Delay depends on the pipeline structure • Microarchitecture is exposed to the software (compiler) Branch Delay slots: • Delay introduced by software to avoid control hazards • Dummy instructions following branch instruction for purpose of creating delays till the new PC value can be set • Instructions in the Branch Delay slots always executed • In our design: 3 Branch Delay Slots • Microarchitecture might choose not to expose all the delay slots and use some hardware mechanisms for providing the remaining delay 5 Performance of Simple Stall Based Schemes 1. Stall scheme has a branch penalty of 3 cycles (may be 2 in optimized hardware design) 2. Software inserted NOPs (3 cycles) 1. Hardware inserted stall cycles (3 non-optimized) Example: Suppose Branch Frequency is 20% and 60% of branches are taken. Assume software solution with penalties as above. Assume the compile is able to fill 20% of the Branch Delay slots with useful instructions. How is CPI affected? Each Branch Instructions incurs extra delay of 3 cycles except for the delay slots filled with useful instructions. Branch Penalty (per executed instruction) = 20% x 3 (delay slots) x(80%) unfilled delay slots = 0.48 cycles CPI = Nominal CPI + Penalty Cycles (per instruction) Assuming no other causes of stalls CPI = 1.0 + 0.52 = 1.48 13 Alternate Hardware Solution beqz R2, label Why delay in-line instructions B, C, D etc? B • C • Let instructions following A enter pipeline normally D E -- Works if Branch Not Taken! label: P 1 2 3 4 5 6 7 8 9 A IF ID EX MEM WB B IF ID EX MEM WB C IF ID EX MEM WB D IF ID EX MEM WB E IF ID EX MEM WB 14 Control Hazard No Stall Cycles T = 1 B IF A ID EX MEM WB T = 2 C IF B ID A EX MEM WB T = 3 D IF C ID B EX A MEM WB T = 4 E IF D ID C EX B MEM A WB 15 BRANCH NOT TAKEN Speculation: Alternate Hardware Solution beqz R2, label What if Branch is Taken? B C D E • B, C, D have not updated machine state at cycle 4 -- • Flush B, C, D at end of cycle 4 label: P 1 2 3 4 5 6 7 8 9 A IF ID REG MEM WB B IF ID REG MEM WB C IF ID REG MEM WB D IF ID REG MEM WB P IF ID REG MEM WB 16 Control Hazard T = 1 B IF A ID EX MEM WB T = 2 C IF B ID A EX MEM WB T = 3 D IF C ID B EX A MEM WB T = 4 P IF D ID C EX B MEM A WB 16 TAKEN BRANCH Control Hazard T = 4 P IF D ID C EX B MEM A WB T = 5 Q IF P ID D EX C MEM B WB T = 6 R IF Q ID P EX D MEM C WB T = 7 S IF R ID Q EX P MEM D WB 16 TAKEN BRANCH: WRITES to MEM or REG by B, C or D will result in error Alternate Hardware Solution Taken Branch T = 2 B IF A ID EX MEM WB T = 3 C IF B ID A EX MEM WB T = 4 D IF C ID B EX A MEM WB Insert NOP in IF/ID, ID/EX, EX/MEM T = 4 P IF ID EX MEM A WB 17 Branch Penalty in Modified Hardware Scheme • More than an optimized implementation of stall • Simple form of control speculation • Speculating it is a NOT TAKEN Branch • Continue fetching in-line instructions • Performance depends on accuracy of speculation • Speculation correct (NOT TAKEN Branch): Continue with no stalls (0 Penalty Cycles) • Speculation incorrect (TAKEN Branch): Flush 3 trailing instructions (3 Penalty Cycles) Example: Branch Frequency: 20% 5% of Branches are Unconditional Branches 70% Conditional branches are NOT TAKEN CPI = Nominal CPI + Penalty cycles for TAKEN BRANCH + Penalty Cycles for NOT TAKEN Branch Penalty Cycles for TAKEN BRANCH = Penalty cycles for UNCONDITIONAL BRANCH + Penalty cycles for TAKEN CONDITIONAL BRANCH = 20% x 5% x 3 + 20% x 95% x 30% x 3 = 0.03 + 0.171 = 0.201 CPI = 1.0 + = 1.201 19 Predict branch: Not Taken; Actually Not Taken No Stall Cycles T = 1 B IF A ID EX MEM WB T = 2 C IF B ID A EX MEM WB T = 3 D IF C ID B EX A MEM WB BRANCH NOT TAKEN DO NOTHING T = 4 E IF D ID C EX B MEM A WB 20 Predict branch: Not Taken; Actually Taken T = 1 B IF A ID EX MEM WB T = 2 C IF B ID A EX MEM WB T = 3 D IF C ID B EX A MEM WB Branch actually taken: FLUSH pipeline Make B,C,D NOPS T = 4 P IF ID EX MEM A WB 21 More Control Speculation Can we predict branch as taken ? • Speculatively fetch and execute instructions at the branch target • Useful only if target address is known earlier than branch outcome • May require stall cycles until target address known • Flush pipeline if prediction is incorrect • Must ensure that flushed instructions do not update any machine state • Assume that target address is computed in the ID stage • Stall of 1 cycle till PC updated with target address (ALWAYS!) • Assume branch outcome known at the end of cycle 3 in EX stage 22 Predict branch taken T = 1 B IF A ID EX MEM WB T = 2 P IF ID A EX MEM WB T = 3 Q IF P ID EX A MEM WB Branch actually taken: Single stall cycle T = 4 R IF Q ID P EX MEM A WB 23 Predict branch taken T = 1 B IF A ID EX MEM WB EX T = 2 P IF ID A MEM WB Branch actually not FLUSH pipeline taken: 2 wasted cycles Make P NOP T = 3 B IF ID EX A MEM WB T = 4 C IF B ID EX MEM A WB 24 More Control Speculation (contd … ) Reduce branch delay (from 3 cycles of first design) to 1 or 2 Early Branch Detection hardware to compute: Target address : Easy to move to ID stage Branch outcome: Easy to move to EX stage Predict Not Taken: Actually Not Taken: No stalls Actually Taken: 2 cycles Predict Taken: Actually Taken: 1 cycle Actually Not Taken: 2 cycles 26 More Control Speculation Predict Branch Taken Branch Actually Taken: 1 cycle penalty Branch Actually Not Taken: 2 cycle penalty • 16% of instructions were conditional branches • 4% of instructions were unconditional branches • 62% of conditional branches were taken CPI = 1 + 16% x 62% x 1 + 16% x 38% x 2 + 4% x 1 = 1.26 Taken Conditional Branch Not Taken Conditional Branch Unconditional Branch (Taken) Predict Not Taken: Branch Actually Taken: 2 cycle penalty Branch Actually Not Taken: 0 cycle penalty Know its TAKEN CPI = 1 + 16% x 62% x 2 + 4% x 1 = 1.14 27 Summary: Control (Branch) Hazard • Do branch resolution (outcome and target address) early • Stall pipeline for required number of cycles Methods that reduce the branch penalty 1.

Load more