Branch Predictor Prediction: A Power-Aware Branch Predictor for High- Performance Processors

Amirali Baniasadi Andreas Moshovos Electrical and Computer Engineering Electrical and Computer Engineering Northwestern University University of Toronto

ABSTRACT predictors. Combined predictors use three underlying sub- We introduce Branch Predictor Prediction (BPP) as a predictors that are all accessed for every branch. Out of the power-aware branch prediction technique for high perfor- three prediction hints, one is used to select among the other mance processors. Our predictor reduces branch prediction two. BPP exploits the temporal and sub-predictor locality power dissipation by selectively turning on and off two of characteristics of typical branch streams to allow us to gate the three tables used in the combined branch predictor. BPP two out of the three sub-predictors for many branches. Spe- relies on a small buffer that stores the addresses and the cifically, we have observed that: 1) often, short sequences sub-predictors used by the most recent branches executed. of branches tend to appear repeatedly, and 2) they tend to Later we refer to this buffer to decide if any of the sub-pre- use the same sub-predictors. In BPP, a small buffer is intro- dictors and the selector could be gated without harming duced in the fetch stage. BPP entries record the sub-predic- performance. In this work we study power and performance tors used by recent branches. Each BPP entry is tagged by trade-offs for a subset of SPEC 2k benchmarks. We show the PC of a recently seen branch and records the sub-predic- that on the average and for an 8-way , BPP can tors used by the two branches that followed it in the reduce branch prediction power dissipation by 28% and dynamic execution stream. By associating sub-predictor 14% compared to non-banked and banked 32k predictors hints with a preceding branch, BPP avoids increasing pre- respectively. This comes with a negligible impact on perfor- diction latency. The hints become available at least one mance (1% max). We show that BPP always reduces power cycle in advance of when the actual prediction needs to take even for smaller predictors and that it offers better overall place. power and performance compared to simpler predictors. We show that our predictor can reduce branch predictor power by 14% over a 32K-entry, conventional, banked combined branch predictor (taking into consideration the 1. INTRODUCTION BTB and the return address stack that our BPP does not optimize). Moreover, we show that a processor with BPP is Dynamic power dissipation has emerged as a first class always more power efficient than one that uses any of the consideration in modern, high-performance processor sub-predictors alone. Finally, we show that BPP reduces design. In this work, we focus on power-efficient and power even for smaller branch predictors (range 16K- highly-accurate branch predictors. Reportedly, for a typical entriesdownto1K-entries). high-performance processor, branch predictors dissipate a considerable amount (i.e., about 10%)of overall power [1]. The rest of the paper is organized as follows. In section 2 we explain BPP. In section 3 we present our experimental Overall power dissipation heavily depends on branch evaluation. We report performance, and power savings. We prediction accuracy making developing power-efficient report relative power reduction for both the predictor and branch predictors a challenging task. In isolation, it would the entire processor. We compare our model to three differ- seem that using smaller predictors would result in reduced ent alternatives. In section 4, we study how BPP reacts to power dissipation. However, this is not true since branch changes in predictor size and BPP configuration. In section predictor accuracy indirectly impacts the amount of work 5 we review related work. Finally, in section 6, we summa- and hence the amount of power that is consumed by the rest rize our findings. of the processor [2]. Accordingly, in this work we focus on predictors that are both power-efficient and highly-accu- 2. BRANCH PREDICTOR PREDICTION rate. Branch prediction is essential to sustaining high perfor- We propose branch predictor prediction or BPP as a mance in modern high-end processors. The combined pre- power-efficient extension to the commonly used combined dictor is used in many high-performance designs. Combined predictors use three underlying sub-predictors. dynamic branches have used (and therefore are predicted to Two of the sub-predictors produce predictions for branches. use) the same sub-predictor and they have not been miss- They are typically tuned for different branch behaviors. By predicted. using a selector predictor, combined predictors offer the We access the BPP-buffer in parallel with the branch best of both underlying sub-predictors. predictor. This is true for both branch lookup and update. While using three tables makes better prediction possi- We allocate BPP entries at fetch in order. We update the ble, it also increases power dissipation. Our analysis shows BPP-buffer speculatively as soon as a branch calculates its that for a 32K-entry predictor, the three sub-predictors con- direction. We do not flush the BPP entries that follow a sume close to 65% of the total predictor power (the rest is miss-predicted branch. We instead mark the miss-predicted consumed in the BTB and the return address stack). branch. This allows us to salvage sub-predictor locality BPP relies on typical program behavior, to gate two out down miss-predicted paths. For example, due to control of the three underlying sub-predictors for most branches. In independence some of the sub-predictor hints may be valid particular, BPP exploits the following two phenomena: (1) even though an earlier branch was miss-predicted. We Branch instructions show strong temporal locality. We mea- lookup the BPP buffer to decide what sub-predictor to gate sured the temporal locality in the dynamic branch stream by (if any) for the next cycle. While BPP uses branch temporal studying how often a branch is within the last n branches locality it is different from a branch predictor of fetched, where n varies from 8 to 64. On average and for a shorts since it associates sub-predictor hints with preceding subset of SPEC’2k benchmarks, almost half of the branches branches. appear within eight branches apart and about 83% of the Provided that sufficient branch and sub-predictor branches appear within the last 64 branches fetched. (2) locality exists, BPP has the potential for reducing branch Often branches tend to use the same sub-predictor. Our prediction power dissipation. However, it introduces extra study shows that on the average, a branch uses the same power overhead and can, in principle, increase overall sub-predictor with a 91% probability for a subset of power dissipation if the necessary behavior is not there. We SPEC’2K benchmarks. take into account this overhead in our study and show that Based on the above we suggest BPP. At the core of for the programs we studied, BPP is robust. BPP we use a FIFO buffer called the BPP-buffer. BPP- 3. METHODOLOGY AND RESULTS buffer is a small power-efficient structure that stores infor- In this section, we present our analysis of the BPP tech- mation regarding the most recent branches encountered. nique. We report performance results in section 3.1 We The stored information is used to gate the selector and one report power measurements in section 3.2. of the two sub-predictors. This n-entry FIFO buffer records the n most recently fetched branches in sequence. Every We used programs from the SPEC2K suite compiled buffer entry includes an address field and a two bit sub-pre- for the MIPS-like architecture used by the Simplescalar dictor hint field. We use the latter to record the last sub-pre- v3.0 simulation tool set. We used GNU’s gcc dictor used by the specific branch. Combinations 00 and 01 (flags: -O2 –funroll-loops –finline-functions). Table 1 indicate that the gshare and bimodal predictors have been reports the branch prediction accuracy per benchmark. To the last predictors used to speculate the branch outcome obtain reasonable simulation times we simulated 200M of respectively. We assign combination 11 to indicate that the the instructions after skipping the initialization. We detail branch was miss-predicted last time speculated. We do not the base processor model in table 2. gate the sub-predictors if the miss-predicted branch reap- For most of our experiments we used a combined pre- pears since we have no confidence on the predictor used dictor with 32K-entries per sub-predictor. In section 4, we last time. demonstrate that BPP is still effective even with smaller The front-end fetches up to two branches every cycle predictors. We use the 32K-entry predictor after investigat- for a total of eight instructions. Once all instructions are ing predictors of different sizes and studying performance fetched we check the BPP buffer to see if the last branch and accuracy. Our study shows that further increase in pre- fetched is among those recorded in the buffer. If no match is dictor size rarely and slightly contributes to performance found, all sub-predictors will be probed during the next while increases predictor power dissipation dramatically. cycle. If a match is found, we look at the sub-predictor hints Our study shows that while using smaller predictors saves of the next two buffer elements. Since branch sequences predictor power, it also results in higher overall processor tend to repeat, we implicitly predict (assume) that these power dissipation. access hints are the right ones for the next two branches that We used WATTCH [3] for power estimation. We mod- maybefetchedduringthenextcycle.Wegatetwoofthe eled an aggressive 2GHz superscalar three tables if the next two (guessed by the BPP-buffer) manufactured under a 0.1 micron technology. Program Ab. BP Acc. Program Ab. BP Acc. ammp amm 99% mesa mes 99% 120% bzip bzp 97% parser prs 93% 100% compress cmp 92% vortex vor 98% 80% equake equ 98% vpr vpr 92% 60% gcc gcc 93% wolf wlf 92% r f m zp es o l G mcf mcf 91% m b equ gcc mcf prs v vpr w V a cmp m A Table 1: Benchmarks and prediction accuracy CMB BMD GSH (direction and target). Figure 1: Performance (higher is better). Bars Branch Predictor 32K GShare+32K bi-modal w/ 32K selector from left to right compare BPP to the CMB, BMD Scheduler 128 entries, RUU-like and GSH. Fetch Unit Up to 8 instr./cycle. Max 2 branches/cycle 3.2. Power 64-Entry Fetch Buffer As an indicator of power reduction we studied how Load/Store Queue 128 entries, 4 loads or stores per cycle often BPP gates two out of the three sub-predictors per Perfect disambiguation benchmark for SPEC’2k benchmarks. Our measurements OOO Core any 8 instructions / cycle show that on the average, BPP gates sub-predictors 54% of Func. Unit Latencies same as MIPS R10000 time reaching a maximum of 96% and a minimum of 34%. L1 - Instruction /Data 64K, 4-way SA, 32-byte blocks, Caches 3 cycle hit latency In the rest of the section we first report predictor power Unified L2 256K, 4-way SA, 64-byte blocks, dissipation. while exploiting a less complex branch predic- 16-cycle hit latency tor reduces predictor’s power dissipation, it may increase Main Memory Infinite, 100 cycles the total power dissipation[1]. Therefore, we also report Table 2: Base processor configuration. total power dissipation. In our experiments we take into account the power overhead associated with the BPP buffer. 3.1. Performance Moreover, we also study how BPP interacts with banked BPP can negatively impact accuracy and hence perfor- predictors (suggested by previous work[1]). Banking is par- mance. Therefore, we first investigate how BPP impacts ticularly important for large predictors. We used four banks performance. To determine whether our technique is indeed based on CACTI’s analysis. We make the following worthwhile, it is important to compare it with a number of assumptions to pessimistically account for power overheads obvious alternatives. Accordingly, we compare with a num- and static power: First, we assume gated structures still ber of alternative machines that use different branch predic- consume 10% of their maximum power dissipation. Sec- tor organizations. We compare with three alternative branch ond, in banked predictors we assume that all banks are predictors: (1) Conventional Combined Predictor (CMB) always precharged (i.e., banks need to be ready in case they (2) Bimodal-Predictor (BMD) (3) Gshare-Predictor are accessed by the next branch). It would be possible to (GSH). Comparing to CMB shows whether our power sav- use BPP to selectively precharge only the sub-predictor ings are worth the possible performance loss, BMD and banks that we will use. GSH comparisons explore the possible benefits and costs of Figure 2 reports relative predictor and overall power the combined predictor when compared against its two sub- reduction for non-banked and banked predictors. For each predictors. benchmark, bars from left to right report power reduction In figure 1 we report performance for the processor that compared to CMB, BMD and GSH. The entire bar shows uses a 32-entry BPP (in section 4 we vary BPP size) com- savings achieved for non-banked predictors. The dark por- pared to 3 different base cases. Bars from left to right report tion of each bar shows savings for banked predictors. Fig- relative performance compared to CMB, BMD and GSH. ure 2(a) reports predictor power reduction. On average, On average, performance slowdown is 0.3% compared to non-banked BPP reduces predictor power dissipation 28% CMB. In the worst case of mcf, it is only 1%. As it can be compared to a non-banked CMB. For banked predictors, seen in table 1, mcf exhibits the worst branch behavior. power savings are lower but still considerable. A banked However, even then, the performance loss is only 1%, mak- BPP reduces branch predictor power 14% compared to a ing BPP an attractive alternative to a conventional com- banked CMB. For both banked and non-banked predictors, bined predictor (CMB). Comparing to BMD and GSH, on as expected, BPP dissipates more power than BMD and average, BPP offers higher performance. The performance GSH. difference is amplified for some of the integer benchmarks In figure 2(b) we report overall power reduction for (e.g., gcc), while they are negligible for most of the floating processors using banked and non-banked predictors. For point benchmarks (e.g., mes). non-banked BPP, compared to non-banked CMB, BMD and tively. Therefore, savings are maximum for a 32-entry GSH average power reductions are 1.7%, 5.6% and 1.8%. buffer. Apparently, a 32-entry buffer is large enough to For a banked BPP and compared to a banked CMB, BMD store the required information. While 8- and 16-entry buff- and GSH average power reductions are 0.6%, 5.2% and ers appear to be too small, a 64-entry buffer imposes unnec- 1.4%. This suggests that BPP reduces both the predictor’s essary power overhead. and the total processor power dissipation compared to CMB. Moreover it shows that using a combined predictor 5. RELATED WORK both increases performance and saves power when com- Previous work has introduced banking and the usage of pared to using only one of its sub-predictors. prediction probe detector (PPD) as two techniques that reduce branch predictor power dissipation without harming 50% accuracy[1]. Banking reduces the active portion of the pre- 30% dictor. In this paper we studied how BPP interacts with 10% banking and demonstrated that BPP can reduce power even -10% when the underlying predictors are banked. PPD aims at -30% eliminating unnecessary BTB and predictor accesses. -50% While BPP does not target reducing the BTB power dissi- pation it could be used on top of PPD to further reduce s r r m o wlf VG m bzp mp equ gcc mcf pr v vp a c mes A power dissipation by gating the sub-predictors when there are predictor lookups required.

20% 6. CONCLUSION 15% We presented BPP, a technique for reducing power 10% while maintaining the accuracy advantage of combined branch predictors.We affirmed that it is possible to signifi- 5% cantly reduce power by exploiting a) the temporal locality 0% amongst branch instructions, and b) the high predictability p u s r lf m q cc w VG in their usage of sub-predictors. On average, BPP reduces m bzp m e g mcf pr vor vp a c mes A predictor power dissipation 28%, compared to a non- banked and 14% compared to a banked conventional com- Figure 2: (a) Branch prediction power reduction (b) bined predictor. This comes with a negligible performance Overall power reduction. Bars from left to right degradation. We have also shown that while BPP is more compare to CMB, BMD and GSH. Entire bar reports effective for larger predictors, it still reduces power even for non-banked and dark portion of the bar reports for smaller branch predictors. More importantly, we have for banked predictors. shown that when one considers the overall processor power 4. SENSITIVITY ANALYSIS dissipation, BPP-enhanced processors always dissipate less In this section, we study how BPP reacts to smaller power when compared to ones that use either just the con- predictor sizes and different BPP buffer sizes. ventional combined predictor, or just one of its underlying sub-predictors. First we report BPP sensitivity to predictor size. We studied relative predictor power dissipation compared to a Because of the considerable power savings and the rel- non-banked combined predictor with a similar predictor atively small cost, BPP is an attractive power-aware size. We picked non-banked predictors since banking is enhancement for modern, high-performance processors. reportedly less effective for smaller predictors[1]. Our stud- REFERENCES ies show that savings are less but still obtainable for smaller predictors. On average, BPP reduces power by 4%, 8%, [1] D. Parikh, K. Skadron, Y. Zhang, M. Barcella, amd M.R. Stan. 13%, 16% and 21% for predictor sizes 1k, 2k, 4k, 8k and Power Issues Related to Branch Prediction. In Proc. Intl. 16k. Symposium on High-Performance , February 2002. We also studied how BPP reacts to various BPP buffer [2] A. Baniasadi, A. Moshovos, Instruction Flow-Based Front-end sizes. Our studies show that performance stays within 0.5% Throttling for Power-Aware High-Performance Processors. for buffer sizes 64, 32, 16 and 8. On average, and for a In Proc. ISLPED’01, August 2001 banked predictor, predictor power reduction is 11.3%, 14%, [3] D. Brooks, V. Tiwari M. Martonosi “Wattch: A Framework for 10.4% and 8.7% for buffer sizes 64, 32, 16 and 8 respec- Architectural-Level Power Analysis and Optimizations”, Proc of the 27th Int’l Symp. on Computer Architecture, 2000