Instruction Fetch and Branch Prediction

CprE 581 Computer Systems Architecture Readings: Textbook (4th ed 2.3, 2.9); (5th ed 3.3)

1 Frontend and Backend Feedback: - Prediction correct or not, update info - Incorrect: Correct next PC FU Fetch select Regfile bypass Rename Wakeup D-

FU commit schedule execute

Frontend: Backend: - Keep fetching n insts per cycle (in-order) - Execute instructions out-of-order - Predict next PC based on current PC and - Provide feedback info to frontend past history of branch targets and directions - Mis-prediction affects - Special handling of function returns performance but not correctness 2 Instruction Flow

Instruction flow must be continuous Branch Target prediction: What is the target PC, must be done at the fetch stage

Branch prediction: What direction does a branch take, usually done at fetching

Return Address Prediction: Special target prediction for return instructions, may be done at fetching or decoding

3 Instruction Flow

Design questions: Single cycle loop What would happen if PC branch prediction is Inst Target, branch, done after the Memory and RA fetching stage, e.g. predictors decoding? INST At the fetch stage, Feedback how to know if an inst is a branch or not? Decode/Rename

How to know if an inst Feedback is a return inst?

4 Branch Prediction Buffer

IF ID EX M WB PC I-Cache PC

Associative Lookup expensive! A0 0 A1 1

A2 1 log k BPB Index PC A(k-1) 0 Branch Target Buffer (BTB)

PC of instruction to fetch

Look up Predicted PC

Number of entries in branch- target buffer

No: instruction is = not predicted to be Branch branch. Proceed normally predicted taken or Yes: then instruction is branch and predicted untaken PC should be used as the next PC Branch Prediction Steps

Send PC to memory and branch-target buffer

IF No Yes Entry found in branch-target buffer?

Send out Is predicted Yes No instruction PC a taken branch? ID No Yes Taken Normal branch? instruction execution

Enter branch addr Mispredicted branch, kill Branch predicted and next PC fetched inst; restart fetch Correctly; continue EX Into BTB at other target; delete execution with entry from BTB no stalls Adv. Techniques for Instruction Delivery and Speculation

Branch Folding Optimization:  Larger branch-target buffer  Add target instruction into buffer to deal with longer decoding time required by larger buffer  “Branch folding”

Copyright © 2012, Elsevier Inc. All rights reserved. Mis-prediction Recovery

Pipeline flushing  Mis-prediction is detected when a branch is resolved  May wait until the branch is to be committed, and then flush the  Selective flushing: Immediately and selectively flush misfetched instructions

Fetch stage flushing: Special cases, e.g.  A branch target was wrongly predicted; the correct branch target is known at decoding for most branches  An unconditional branches (jumps) were predicted as not-taken

9 Branch Prediction

Predict branch direction: taken or not taken (T/NT) BNE R1, R2, L1 taken … Not taken L1: … Static prediction: decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more …

10 Predictor for a Single Branch

General Form

1. Access 2. Predict state PC Output T/NT

3. Feedback T/NT 1-bit prediction Feedback T NT NT Predict Not Predict Taken 1 0 T Taken

11 1-bit BHT Accuracy Example: in a loop, 1-bit BHT will cause 2 mispredictions Consider a loop of 10 iterations before exit: for (…){ for (i=0; i<10; i++) a[i] = a[i] * 2.0; }  Two mispredictions – first loop iteration and last loop iteration.  Only 80% accuracy.

12 1-Bit Prediction Drawbacks

Taken: 9 times LOOP: Inst 1 Not taken: 1 time Inst 2 Inst 3 10 iterations . 1-bit prediction mispredicts . twice: 20% misprediction rate Inst k Branch

Outer loop Branch History Table of 1-bit Predictor

BHT also Called Branch Prediction Buffer in textbook k-bit Branch Can use only one 1-bit predictor, but accuracy is address low BHT: use a table of simple k predictors, indexed by bits 2 from PC Similar to direct mapped cache Prediction More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors

14 2-bit Saturating

Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) T NT Predict Taken 11 10 Predict Taken T T NT NT Predict Not 01 00 Predict Not T Taken Taken

NT Blue: stop, not taken Gray: go, taken Adds hysteresis to decision making 15 Correlating Branches

Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch. Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table. In general, (m,n) predictor records last m branches to select between 2m history tables each with n-bit counters.  Old 2-bit BHT is then a (0,2) predictor Correlating Branches

if (d==0) BNEZ R1, L1 Branch B1 d=1; ADDI R1, R0, #1 if (d==1) L1: SUBUI R3, R1, #1 BNEZ R3, L2 Branch B2 …… L2:

B1 and B2 are correlated? B1 Not Taken  B2 Not Taken Correlating Branch Predictor

Idea: taken/not taken of Branch address (4 bits) recently executed branches is related to behavior of next branch 1-bits per branch local predictors (as well as the history of that branch behavior)  Then behavior of recent branches Prediction selects between, say, 2 predictions of next branch, updating just that prediction  (1,1) predictor: 1-bit global, 1-bit local 1-bit global branch history (0 = not taken) 18 Correlating Branch Predictor

General form: (m, n) Branch address (4 bits) predictor  m bits for global 2-bits per branch history, n bits for local local predictors history  Records correlation between m+1 branches Prediction  Simple implementation: global history can be stored in a  Example: (2,2) predictor, 2-bit global, 2-bit global 2-bit local branch history (01 = not taken then taken) 19 20 Correlating Branch Example Initial value Value of d of d d==0? b1 before b2 d==1? b2 0 Yes Not taken 1 Yes Not taken 1 No Taken 1 Yes Not taken 2 No Taken 2 No Taken Assume d alternates between 2 and 0.

b1 b1 New b1 b2 b2 New b2 d=? prediction action prediction prediction action prediction 2NTT T NTT T 0 T NT NT T NT NT 2NTT T NTT T 0 T NT NT T NT NT

1-bit predictor mispredicts every branch! Correlating Branch Example

Prediction if last branch Prediction bits not taken Prediction if last branch taken NT/NT Not taken Not taken NT/T Not taken Taken T/NT Taken Not Taken T/T Taken Taken

Initial prediction: NT/NT b2 action d=? b1 prediction New b1 prediction b1 action b2 prediction New b2 pred

2 NT /NT TT/NTNT/ NT TNT/T 0 T/ NT NT T/NT NT /T NT NT/T 2 T/NT T T/NT NT/ T TNT/T 0T/NT NT T/NT NT /T NT NT/T Correlating Branches

Branch address

(2,2) predictor 4  Then behavior 2-bit per branch predictors of recent branches selects between, say, XX XX prediction four predictions of next branch, updating just 00 01 10 11 that prediction 2-bit global branch history Gselect and Gshare predictors

 Keep a global register (GR) with outcome of k branches

 Use that in conjunction global branch history with PC to index into a register (GBHR) table containing 2-bit / PHT

predictor shift

branch result:  Gselect – concatenate taken/

not taken / 2  Gshare – XOR (better) predict: decode taken/ not taken

ECE668 .24 Adapted from Patterson, Katz and Culler © CopyrightUCB 2007 CAM Copyright 2001 UCB & Morgan Kaufmann Accuracy of Different Schemes (Figure 2.7, page 87)

20% 4096 Entries 2-bit BHT 18% Unlimited Entries 2-bit BHT 16% 1024 Entries (2,2) BHT

14%

12% 11%

10%

8%

Frequency of Mispredictions 6% 6% 6% 6% 5% 5% 4% Frequency of Mispredictions Frequency of 4%

2% 1% 1% 0% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 25 Re-evaluating Correlation  Several SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches: program branch % static # = 90% compress 14% 236 13 eqntott 25% 494 5 gcc 15% 9531 2020 mpeg 10% 5598 532 real gcc 13% 17361 3214  Real programs + OS more like gcc  Small benefits of correlation beyond benchmarks?  Mispredict because either:  Wrong guess for that branch  Got branch history of wrong branch when indexing the table  For SPEC92, 4096 about as good as infinite table  Misprediction mostly due to wrong prediction  Can we improve using global history?

ECE668 .26 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Estimate Branch Penalty

EX: BHT correct rate is 95%, BTB hit rate is 95%

Average miss penalty is 1 cycle on BTB and 6 cycles on BHT

How much is the branch penalty?

27 Return Address (RA) Prediction Return: special register indirect branches Register hard to predict  Many callers, one callee  Jump to multiple return addresses from a single address (no PC-target correlation) SPEC89 85% such branches for procedure return Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries leads to small miss rate

28 Accuracy of Return Address Predictor

29 Tournament Predictors

 Motivation for correlating branch predictors: 2-bit local predictor failed on important branches; by adding global information, performance improved  Tournament predictors: use two predictors, 1 based on global information and 1 based on local information, and combine with a selector  Hopes to select right predictor for right branch (or right context of branch)

ECE668 .30 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Tournament Branch Predictor

Used in : Track both “local” and global history Intended for mixed types of applications Global history: T/NT history of past k branches, e.g. 0 1 0 1 0 1 (NT T NT T NT T) PC Local Global Choice Predictor Predictor Predictor

mux

Global NT/T history

31 Tournament Predictor in Alpha 21264

 4K 2-bit counters to choose from among a global predictor and a local predictor  Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor  12-bit pattern: ith bit is 0 => ith prior branch not taken; ith bit is 1 => ith prior branch taken;

00,10,11 00,01,11 1 2 Use 1 Use 2 3 4K  2 10 01 01 10 . bits 01 . Use 1 Use 2 10 12 00,11 00,11

ECE668 .32 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Tournament Predictor in Alpha 21264

 Local predictor consists of a 2-level predictor:  Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted  Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction  Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors)

1K  1K  10 3 bits bits

ECE668 .33 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Tournament Branch Predictor Local predictor: use 10-bit local history, shared 3-bit counters

PC Local history Counters NT/T table (1Kx10) (1Kx3) 10 1

Global and choice predictors:

Global history Counters NT/T 12-bit 12 (4Kx2) 1

010101010101 NT/T Counters local/global (4Kx2) 1 34 % of predictions from local predictor in Tournament Prediction Scheme

0% 20% 40% 60% 80% 100%

nasa7 98% matrix300 100% tomcatv 94% doduc 90% spice 55% fpppp 76% gcc 72% espresso 63% eqntott 37% li 69%

ECE668 .35 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Accuracy of Branch Prediction 99% tomcatv 99% 100%

95% doduc 84% 97%

86% fpppp 82% 98% Profile-based 2-bit counter 88% li 77% Tournament 98%

86% espresso 82% 96%

88% gcc 70% 94% fig 3.40 0% 20% 40% 60% 80% 100%  Profile: branch profile from last execution (static in that is encoded in instruction, but profile)

ECE668 .36 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Branch Prediction Branch Prediction Performance

Branch predictor performance Copyright © 2012, Elsevier Inc. All rights reserved. Patt-Yeh Predictor The correlating branch predictors we have just studied work by combining local with global information. However, it is also possible to do quite well considering only information about the current branch (local information).

38 Patt-Yeh Predictor

39 Patt-Yeh Predictor A: T NT N T N T N B: T TTTTTTT

40 41 PT entries 01 and 10 are “trained” for A, and 11 is “trained” for B. In general, the Yeh-Patt predictor provides 96%–98% accuracy for integer code

42 Branch Predictors Smith (bimodal) predictor Pattern-based predictors  Two-level, gshare, bi-mode, gskewed, Agree, … Predictors based on alternative contexts  Alloyed history, path history, loop counting, … Hybrid predictors  Multiple component predictors + selection/fusion  Tournament, multihybrid, prediction fusion, …

Reference book: Ch. 9, “Advanced Instruction Flow Techniques”

43 Branch Decoupling Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero

Say 100 iterations. Can branch be pre-computed for each loop iteration? Branch Determining Instructions (BDIs) LD F0,0(R1)

ADDD F4,F0,F2

SD 0(R1),F4

BDI SUBI R1,R1,8

BNEZ R1,Loop Branch Decoupling Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,8 BNEZ R1,Loop

Branch Stream Program Stream BLoop: SUBI PLoop: LD R1,R1,8 F0,0(R1) BNEZ ADDD F4,F0,F2 R1,Bloop,Ploop SD 0(R1),F4 SUBI R1,R1,8 Branch Decoupled

B-Reg File I-cache P-Reg File B-

P-Processor D-cache B-PC I-cache

PPC Block Counter Target + Block size PPC Queue PPC Control

If(Block counter not = 0) Decrement Block counter & Increment PPC. else when (PPCQ not empty) Dequeue a (target, block size) entry from PPCQ. PPC  target; Block counter  block size; Branch Prediction With n-way Issue 1. Branches will arrive up to n times faster in an n-issue processor 2. Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor

49 Modern Design: Frontend and Backend

Frontend: Instruction fetch and dispatch  To supply high-quality instructions to the backend  Instruction flows in program order Backend: Schedule/execute, Writeback and Commit  Instructions are processed out-of-order

Frontend Enhancements  Instruction prefetch: fetch ahead to deliver multiple  To handle multiple branches: may access multiple cache lines in one cycle, use prefetch to hide the cost Target and branch predictions may be integrated with instruction cache: e.g. P4

50 Pitfall: Sometimes bigger and dumber is better

21264 uses tournament Reversed for predictor (29 Kbits) transaction processing (TP) ! Earlier 21164 uses a  21264 avg. 17 simple 2-bit predictor mispredictions per 1000 with 2K entries (or a instructions total of 4 Kbits)  21164 avg. 15 mispredictions per 1000 SPEC95 benchmarks, instructions 21264 outperforms TP code much larger &  21264 avg. 11.5 21164 hold 2X branch mispredictions per 1000 predictions based on instructions local behavior (2K vs.  21164 avg. 16.5 1K local predictor in mispredictions per 1000 the 21264) instructions

51