Instruction Fetch and Branch Prediction
CprE 581 Computer Systems Architecture Readings: Textbook (4th ed 2.3, 2.9); (5th ed 3.3)
1 Frontend and Backend Feedback: - Prediction correct or not, update info - Incorrect: Correct next PC FU Fetch select Regfile bypass Rename Wakeup D-cache
FU commit schedule execute
Frontend: Backend: - Keep fetching n insts per cycle (in-order) - Execute instructions out-of-order - Predict next PC based on current PC and - Provide feedback info to frontend past history of branch targets and directions - Mis-prediction affects - Special handling of function returns performance but not correctness 2 Instruction Flow
Instruction flow must be continuous Branch Target prediction: What is the target PC, must be done at the fetch stage
Branch prediction: What direction does a branch take, usually done at fetching
Return Address Prediction: Special target prediction for return instructions, may be done at fetching or decoding
3 Instruction Flow
Design questions: Single cycle loop What would happen if PC branch prediction is Inst Target, branch, done after the Memory and RA fetching stage, e.g. predictors decoding? INST At the fetch stage, Feedback how to know if an inst is a branch or not? Decode/Rename
How to know if an inst Feedback is a return inst?
4 Branch Prediction Buffer
IF ID EX M WB PC I-Cache PC
Associative Lookup expensive! A0 0 A1 1
A2 1 log k BPB Index PC A(k-1) 0 Branch Target Buffer (BTB)
PC of instruction to fetch
Look up Predicted PC
Number of entries in branch- target buffer
No: instruction is = not predicted to be Branch branch. Proceed normally predicted taken or Yes: then instruction is branch and predicted untaken PC should be used as the next PC Branch Prediction Steps
Send PC to memory and branch-target buffer
IF No Yes Entry found in branch-target buffer?
Send out Is predicted Yes No instruction PC a taken branch? ID No Yes Taken Normal branch? instruction execution
Enter branch addr Mispredicted branch, kill Branch predicted and next PC fetched inst; restart fetch Correctly; continue EX Into BTB at other target; delete execution with entry from BTB no stalls Adv. Techniques for Instruction Delivery and Speculation
Branch Folding Optimization: Larger branch-target buffer Add target instruction into buffer to deal with longer decoding time required by larger buffer “Branch folding”
Copyright © 2012, Elsevier Inc. All rights reserved. Mis-prediction Recovery
Pipeline flushing Mis-prediction is detected when a branch is resolved May wait until the branch is to be committed, and then flush the pipeline Selective flushing: Immediately and selectively flush misfetched instructions
Fetch stage flushing: Special cases, e.g. A branch target was wrongly predicted; the correct branch target is known at decoding for most branches An unconditional branches (jumps) were predicted as not-taken
9 Branch Prediction
Predict branch direction: taken or not taken (T/NT) BNE R1, R2, L1 taken … Not taken L1: … Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more …
10 Predictor for a Single Branch
General Form
1. Access 2. Predict state PC Output T/NT
3. Feedback T/NT 1-bit prediction Feedback T NT NT Predict Not Predict Taken 1 0 T Taken
11 1-bit BHT Accuracy Example: in a loop, 1-bit BHT will cause 2 mispredictions Consider a loop of 10 iterations before exit: for (…){ for (i=0; i<10; i++) a[i] = a[i] * 2.0; } Two mispredictions – first loop iteration and last loop iteration. Only 80% accuracy.
12 1-Bit Prediction Drawbacks
Taken: 9 times LOOP: Inst 1 Not taken: 1 time Inst 2 Inst 3 10 iterations . 1-bit prediction mispredicts . twice: 20% misprediction rate Inst k Branch
Outer loop Branch History Table of 1-bit Predictor
BHT also Called Branch Prediction Buffer in textbook k-bit Branch Can use only one 1-bit predictor, but accuracy is address low BHT: use a table of simple k predictors, indexed by bits 2 from PC Similar to direct mapped cache Prediction More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors
14 2-bit Saturating Counter
Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) T NT Predict Taken 11 10 Predict Taken T T NT NT Predict Not 01 00 Predict Not T Taken Taken
NT Blue: stop, not taken Gray: go, taken Adds hysteresis to decision making process 15 Correlating Branches
Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch. Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table. In general, (m,n) predictor records last m branches to select between 2m history tables each with n-bit counters. Old 2-bit BHT is then a (0,2) predictor Correlating Branches
if (d==0) BNEZ R1, L1 Branch B1 d=1; ADDI R1, R0, #1 if (d==1) L1: SUBUI R3, R1, #1 BNEZ R3, L2 Branch B2 …… L2:
B1 and B2 are correlated? B1 Not Taken B2 Not Taken Correlating Branch Predictor
Idea: taken/not taken of Branch address (4 bits) recently executed branches is related to behavior of next branch 1-bits per branch local predictors (as well as the history of that branch behavior) Then behavior of recent branches Prediction selects between, say, 2 predictions of next branch, updating just that prediction (1,1) predictor: 1-bit global, 1-bit local 1-bit global branch history (0 = not taken) 18 Correlating Branch Predictor
General form: (m, n) Branch address (4 bits) predictor m bits for global 2-bits per branch history, n bits for local local predictors history Records correlation between m+1 branches Prediction Simple implementation: global history can be stored in a shift register Example: (2,2) predictor, 2-bit global, 2-bit global 2-bit local branch history (01 = not taken then taken) 19 20 Correlating Branch Example Initial value Value of d of d d==0? b1 before b2 d==1? b2 0 Yes Not taken 1 Yes Not taken 1 No Taken 1 Yes Not taken 2 No Taken 2 No Taken Assume d alternates between 2 and 0.
b1 b1 New b1 b2 b2 New b2 d=? prediction action prediction prediction action prediction 2NTT T NTT T 0 T NT NT T NT NT 2NTT T NTT T 0 T NT NT T NT NT
1-bit predictor mispredicts every branch! Correlating Branch Example
Prediction if last branch Prediction bits not taken Prediction if last branch taken NT/NT Not taken Not taken NT/T Not taken Taken T/NT Taken Not Taken T/T Taken Taken
Initial prediction: NT/NT b2 action d=? b1 prediction New b1 prediction b1 action b2 prediction New b2 pred
2 NT /NT TT/NTNT/ NT TNT/T 0 T/ NT NT T/NT NT /T NT NT/T 2 T/NT T T/NT NT/ T TNT/T 0T/NT NT T/NT NT /T NT NT/T Correlating Branches
Branch address
(2,2) predictor 4 Then behavior 2-bit per branch predictors of recent branches selects between, say, XX XX prediction four predictions of next branch, updating just 00 01 10 11 that prediction 2-bit global branch history Gselect and Gshare predictors
Keep a global register (GR) with outcome of k branches
Use that in conjunction global branch history with PC to index into a register (GBHR) table containing 2-bit / PHT
predictor shift
branch result: Gselect – concatenate taken/
not taken / 2 Gshare – XOR (better) predict: decode taken/ not taken
ECE668 .24 Adapted from Patterson, Katz and Culler © CopyrightUCB 2007 CAM Copyright 2001 UCB & Morgan Kaufmann Accuracy of Different Schemes (Figure 2.7, page 87)
20% 4096 Entries 2-bit BHT 18% Unlimited Entries 2-bit BHT 16% 1024 Entries (2,2) BHT
14%
12% 11%
10%
8%
Frequency of Mispredictions 6% 6% 6% 6% 5% 5% 4% Frequency of Mispredictions Frequency of 4%
2% 1% 1% 0% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 25 Re-evaluating Correlation Several SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches: program branch % static # = 90% compress 14% 236 13 eqntott 25% 494 5 gcc 15% 9531 2020 mpeg 10% 5598 532 real gcc 13% 17361 3214 Real programs + OS more like gcc Small benefits of correlation beyond benchmarks? Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when indexing the table For SPEC92, 4096 about as good as infinite table Misprediction mostly due to wrong prediction Can we improve using global history?
ECE668 .26 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Estimate Branch Penalty
EX: BHT correct rate is 95%, BTB hit rate is 95%
Average miss penalty is 1 cycle on BTB and 6 cycles on BHT
How much is the branch penalty?
27 Return Address (RA) Prediction Return: special register indirect branches Register indirect branch hard to predict Many callers, one callee Jump to multiple return addresses from a single address (no PC-target correlation) SPEC89 85% such branches for procedure return Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries leads to small miss rate
28 Accuracy of Return Address Predictor
29 Tournament Predictors
Motivation for correlating branch predictors: 2-bit local predictor failed on important branches; by adding global information, performance improved Tournament predictors: use two predictors, 1 based on global information and 1 based on local information, and combine with a selector Hopes to select right predictor for right branch (or right context of branch)
ECE668 .30 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Tournament Branch Predictor
Used in Alpha 21264: Track both “local” and global history Intended for mixed types of applications Global history: T/NT history of past k branches, e.g. 0 1 0 1 0 1 (NT T NT T NT T) PC Local Global Choice Predictor Predictor Predictor
mux
Global NT/T history
31 Tournament Predictor in Alpha 21264
4K 2-bit counters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor 12-bit pattern: ith bit is 0 => ith prior branch not taken; ith bit is 1 => ith prior branch taken;
00,10,11 00,01,11 1 2 Use 1 Use 2 3 4K 2 10 01 01 10 . bits 01 . Use 1 Use 2 10 12 00,11 00,11
ECE668 .32 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Tournament Predictor in Alpha 21264
Local predictor consists of a 2-level predictor: Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors)
1K 1K 10 3 bits bits
ECE668 .33 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Tournament Branch Predictor Local predictor: use 10-bit local history, shared 3-bit counters
PC Local history Counters NT/T table (1Kx10) (1Kx3) 10 1
Global and choice predictors:
Global history Counters NT/T 12-bit 12 (4Kx2) 1
010101010101 NT/T Counters local/global (4Kx2) 1 34 % of predictions from local predictor in Tournament Prediction Scheme
0% 20% 40% 60% 80% 100%
nasa7 98% matrix300 100% tomcatv 94% doduc 90% spice 55% fpppp 76% gcc 72% espresso 63% eqntott 37% li 69%
ECE668 .35 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Accuracy of Branch Prediction 99% tomcatv 99% 100%
95% doduc 84% 97%
86% fpppp 82% 98% Profile-based 2-bit counter 88% li 77% Tournament 98%
86% espresso 82% 96%
88% gcc 70% 94% fig 3.40 0% 20% 40% 60% 80% 100% Profile: branch profile from last execution (static in that is encoded in instruction, but profile)
ECE668 .36 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Branch Prediction Branch Prediction Performance
Branch predictor performance Copyright © 2012, Elsevier Inc. All rights reserved. Patt-Yeh Predictor The correlating branch predictors we have just studied work by combining local with global information. However, it is also possible to do quite well considering only information about the current branch (local information).
38 Patt-Yeh Predictor
39 Patt-Yeh Predictor A: T NT N T N T N B: T TTTTTTT
40 41 PT entries 01 and 10 are “trained” for A, and 11 is “trained” for B. In general, the Yeh-Patt predictor provides 96%–98% accuracy for integer code
42 Branch Predictors Smith (bimodal) predictor Pattern-based predictors Two-level, gshare, bi-mode, gskewed, Agree, … Predictors based on alternative contexts Alloyed history, path history, loop counting, … Hybrid predictors Multiple component predictors + selection/fusion Tournament, multihybrid, prediction fusion, …
Reference book: Ch. 9, “Advanced Instruction Flow Techniques”
43 Branch Decoupling Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero
Say 100 iterations. Can branch be pre-computed for each loop iteration? Branch Determining Instructions (BDIs) LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
BDI SUBI R1,R1,8
BNEZ R1,Loop Branch Decoupling Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,8 BNEZ R1,Loop
Branch Stream Program Stream BLoop: SUBI PLoop: LD R1,R1,8 F0,0(R1) BNEZ ADDD F4,F0,F2 R1,Bloop,Ploop SD 0(R1),F4 SUBI R1,R1,8 Branch Decoupled Microarchitecture
B-Reg File I-cache P-Reg File B-Processor
P-Processor D-cache B-PC I-cache
PPC Block Counter Target + Block size PPC Queue PPC Control
If(Block counter not = 0) Decrement Block counter & Increment PPC. else when (PPCQ not empty) Dequeue a (target, block size) entry from PPCQ. PPC target; Block counter block size; Branch Prediction With n-way Issue 1. Branches will arrive up to n times faster in an n-issue processor 2. Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor
49 Modern Design: Frontend and Backend
Frontend: Instruction fetch and dispatch To supply high-quality instructions to the backend Instruction flows in program order Backend: Schedule/execute, Writeback and Commit Instructions are processed out-of-order
Frontend Enhancements Instruction prefetch: fetch ahead to deliver multiple instructions per cycle To handle multiple branches: may access multiple cache lines in one cycle, use prefetch to hide the cost Target and branch predictions may be integrated with instruction cache: e.g. Intel P4 trace cache
50 Pitfall: Sometimes bigger and dumber is better
21264 uses tournament Reversed for predictor (29 Kbits) transaction processing (TP) ! Earlier 21164 uses a 21264 avg. 17 simple 2-bit predictor mispredictions per 1000 with 2K entries (or a instructions total of 4 Kbits) 21164 avg. 15 mispredictions per 1000 SPEC95 benchmarks, instructions 21264 outperforms TP code much larger & 21264 avg. 11.5 21164 hold 2X branch mispredictions per 1000 predictions based on instructions local behavior (2K vs. 21164 avg. 16.5 1K local predictor in mispredictions per 1000 the 21264) instructions
51