Instruction Fetch and Branch Prediction Cpre 581 Computer

Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4th ed 2.3, 2.9); (5th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update info - Incorrect: Correct next PC FU Fetch select Regfile bypass Rename Wakeup D-cache FU commit schedule execute Frontend: Backend: - Keep fetching n insts per cycle (in-order) - Execute instructions out-of-order - Predict next PC based on current PC and - Provide feedback info to frontend past history of branch targets and directions - Mis-prediction affects - Special handling of function returns performance but not correctness 2 Instruction Flow Instruction flow must be continuous Branch Target prediction: What is the target PC, must be done at the fetch stage Branch prediction: What direction does a branch take, usually done at fetching Return Address Prediction: Special target prediction for return instructions, may be done at fetching or decoding 3 Instruction Flow Design questions: Single cycle loop What would happen if PC branch prediction is Inst Target, branch, done after the Memory and RA fetching stage, e.g. predictors decoding? INST At the fetch stage, Feedback how to know if an inst is a branch or not? Decode/Rename How to know if an inst Feedback is a return inst? 4 Branch Prediction Buffer IF ID EX M WB PC I-Cache PC Associative Lookup expensive! A0 0 A1 1 A2 1 log k BPB Index PC A(k-1) 0 Branch Target Buffer (BTB) PC of instruction to fetch Look up Predicted PC Number of entries in branch- target buffer No: instruction is = not predicted to be Branch branch. Proceed normally predicted taken or Yes: then instruction is branch and predicted untaken PC should be used as the next PC Branch Prediction Steps Send PC to memory and branch-target buffer IF No Yes Entry found in branch-target buffer? Send out Is predicted Yes No instruction PC a taken branch? ID No Yes Taken Normal branch? instruction execution Enter branch addr Mispredicted branch, kill Branch predicted and next PC fetched inst; restart fetch Correctly; continue EX Into BTB at other target; delete execution with entry from BTB no stalls Adv. Techniques for Instruction Delivery and Speculation InstructionDelivery for Techniques Adv. Branch Folding Optimization: Larger branch-target buffer Add target instruction into buffer to deal with longer decoding time required by larger buffer “Branch folding” Copyright © 2012, Elsevier Inc. All rights reserved. Mis-prediction Recovery Pipeline flushing Mis-prediction is detected when a branch is resolved May wait until the branch is to be committed, and then flush the pipeline Selective flushing: Immediately and selectively flush misfetched instructions Fetch stage flushing: Special cases, e.g. A branch target was wrongly predicted; the correct branch target is known at decoding for most branches An unconditional branches (jumps) were predicted as not-taken 9 Branch Prediction Predict branch direction: taken or not taken (T/NT) BNE R1, R2, L1 taken … Not taken L1: … Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more … 10 Predictor for a Single Branch General Form 1. Access 2. Predict state PC Output T/NT 3. Feedback T/NT 1-bit prediction Feedback T NT NT Predict Not Predict Taken 1 0 T Taken 11 1-bit BHT Accuracy Example: in a loop, 1-bit BHT will cause 2 mispredictions Consider a loop of 10 iterations before exit: for (…){ for (i=0; i<10; i++) a[i] = a[i] * 2.0; } Two mispredictions – first loop iteration and last loop iteration. Only 80% accuracy. 12 1-Bit Prediction Drawbacks Taken: 9 times LOOP: Inst 1 Not taken: 1 time Inst 2 Inst 3 10 iterations . 1-bit prediction mispredicts . twice: 20% misprediction rate Inst k Branch Outer loop Branch History Table of 1-bit Predictor BHT also Called Branch Prediction Buffer in textbook k-bit Branch Can use only one 1-bit predictor, but accuracy is address low BHT: use a table of simple k predictors, indexed by bits 2 from PC Similar to direct mapped cache Prediction More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors 14 2-bit Saturating Counter Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) T NT Predict Taken 11 10 Predict Taken T T NT NT Predict Not 01 00 Predict Not T Taken Taken NT Blue: stop, not taken Gray: go, taken Adds hysteresis to decision making process 15 Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch. Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table. In general, (m,n) predictor records last m branches to select between 2m history tables each with n-bit counters. Old 2-bit BHT is then a (0,2) predictor Correlating Branches if (d==0) BNEZ R1, L1 Branch B1 d=1; ADDI R1, R0, #1 if (d==1) L1: SUBUI R3, R1, #1 BNEZ R3, L2 Branch B2 …… L2: B1 and B2 are correlated? B1 Not Taken B2 Not Taken Correlating Branch Predictor Idea: taken/not taken of Branch address (4 bits) recently executed branches is related to behavior of next branch 1-bits per branch local predictors (as well as the history of that branch behavior) Then behavior of recent branches Prediction selects between, say, 2 predictions of next branch, updating just that prediction (1,1) predictor: 1-bit global, 1-bit local 1-bit global branch history (0 = not taken) 18 Correlating Branch Predictor General form: (m, n) Branch address (4 bits) predictor m bits for global 2-bits per branch history, n bits for local local predictors history Records correlation between m+1 branches Prediction Simple implementation: global history can be stored in a shift register Example: (2,2) predictor, 2-bit global, 2-bit global 2-bit local branch history (01 = not taken then taken) 19 20 Correlating Branch Example Initial value Value of d of d d==0? b1 before b2 d==1? b2 0 Yes Not taken 1 Yes Not taken 1 No Taken 1 Yes Not taken 2 No Taken 2 No Taken Assume d alternates between 2 and 0. b1 b1 New b1 b2 b2 New b2 d=? prediction action prediction prediction action prediction 2NTT T NTT T 0 T NT NT T NT NT 2NTT T NTT T 0 T NT NT T NT NT 1-bit predictor mispredicts every branch! Correlating Branch Example Prediction if last branch Prediction bits not taken Prediction if last branch taken NT/NT Not taken Not taken NT/T Not taken Taken T/NT Taken Not Taken T/T Taken Taken Initial prediction: NT/NT b2 action d=? b1 prediction New b1 prediction b1 action b2 prediction New b2 pred 2 NT /NT TT/NTNT/ NT TNT/T 0 T/ NT NT T/NT NT /T NT NT/T 2 T/NT T T/NT NT/ T TNT/T 0T/NT NT T/NT NT /T NT NT/T Correlating Branches Branch address (2,2) predictor 4 Then behavior 2-bit per branch predictors of recent branches selects between, say, XX XX prediction four predictions of next branch, updating just 00 01 10 11 that prediction 2-bit global branch history Gselect and Gshare predictors Keep a global register (GR) with outcome of k branches Use that in conjunction global branch history with PC to index into a register (GBHR) table containing 2-bit / PHT predictor shift branch result: Gselect – concatenate taken/ not taken / 2 Gshare – XOR (better) predict: decode taken/ not taken ECE668 .24 Adapted from Patterson, Katz and Culler © CopyrightUCB 2007 CAM Copyright 2001 UCB & Morgan Kaufmann Accuracy of Different Schemes (Figure 2.7, page 87) 20% 18% 4096 Entries 2-bit BHT 16% Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 14% 12% 11% 10% 8% 6% 6% 6% Frequency of Mispredictions 6% 5% 5% 4% Frequency of Mispredictions 4% 2% 1% 1% 0% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 25 Re-evaluating Correlation Several SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches: program branch % static # = 90% compress 14% 236 13 eqntott 25% 494 5 gcc 15% 9531 2020 mpeg 10% 5598 532 real gcc 13% 17361 3214 Real programs + OS more like gcc Small benefits of correlation beyond benchmarks? Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when indexing the table For SPEC92, 4096 about as good as infinite table Misprediction mostly due to wrong prediction Can we improve using global history? ECE668 .26 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Estimate Branch Penalty EX: BHT correct rate is 95%, BTB hit rate is 95% Average miss penalty is 1 cycle on BTB and 6 cycles on BHT How much is the branch penalty? 27 Return Address (RA) Prediction Return: special register indirect branches Register indirect branch hard to predict Many callers, one callee Jump to multiple return addresses from a single address (no PC-target correlation) SPEC89 85% such branches for procedure return Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries leads to small miss rate 28 Accuracy of Return Address Predictor 29 Tournament Predictors Motivation for correlating branch predictors: 2-bit local predictor failed on important branches; by adding global information, performance improved Tournament predictors: use two predictors, 1 based on global information and 1 based on local information, and combine with a selector Hopes to select right predictor for right branch (or right context of branch) ECE668 .30 Adapted from Patterson, Katz and Culler © UCB Copyright 2001 UCB & Morgan Kaufmann Tournament Branch Predictor Used in Alpha 21264: Track both “local” and global history Intended for mixed types of applications Global history: T/NT history of past k branches, e.g.

Load more