The B&Mode Branch Predictor

Chih-Chieh Lee, I-Cheng K. Chen, and Trevor N. Mudge EECS Department, 1301 Beal Ave., Ann Arbor, Michigan 48109-2122 {leecc, icheng, tnm} @eecs.umich.edu

Abstract been pushed above 90%. As a result, two-level dynamic branch predictors have been incorporatedin several rcccnt Dynamic branch predictors are popular because they high-performance . Perhaps the best can deliver accurate branch prediction without changes to known examples,at the time of writing, are the Pro the instruction set architecture or pre-existing binaries. [Gwennap and [Gwennap96]. However, to achieve the desiredprediction accuracy, exist- Among two-level predictors, those using global history ing dynamic branch predictors require considerable schemeshave beenshown to yield the bestperformance for amounts of hardware to minimize the integerence effects integer benchmarks[YehPatt93]. However, to achieve high due to aliasing in the prediction tables. We propose a new levels of accuracy, current dynamic branch predictors rc- dynamic predictor, the bi-mode predictor, which divides the quire considerableamounts of hardwarebecause their most prediction tables into two halves and, by dynamically deter- significant weakness,the destructive aliasing problem, is mining the current “mode” of the program, selects the ap- most easily solved by increasing the size of the predictors propriate half of the table for prediction. This approach is [SechrestLeeMudge96].This paper proposesa new tech- shown to preserve the merits of global history basedpredic- nique, the bi-mode branchpredictor, that is economical and tion while reducing destructive aliasing and, as a result, im- simple enough to avoid critical timing paths. Furthermore, proving prediction accuracy. Moreover, it is simple enough we demonstratethat on the IBS and SPECCINT95 bench- that it does not impact a ’s cycle time. We con- marks the bi-mode predictor performs on average better clude by conducting a comprehensive study into the mech- than gshare,one of the best global history basedpredictors, anism underlying two-level dynamic predictors and for the same cost. Finally, we conduct a comprehcnsivc investigate the criteria for their optimal designs. The anal- study into the mechanism underlying two-level dynamic ysis presented provides a general framework for studying predictors and investigate the criteria for their optimal dc- branch predictors. signs. The study explains why our proposed schemepcr- 1. Introduction forms well and provides a general framework for studying branch predictors. The ability to minimize stalls or bubbles that The report is organized into five sections.In section 2, may result from branchesis becoming increasingly critical we summarizethe aliasing problem, and then introduce our asmicroprocessor designs implement greaterdegrees of in- solution for de-aliasing. Section 3 describesour simulation struction level parallelism. There are several techniquesfor methodology and presentsthe simulation results. In section reducing branch penaltiesincluding guardedexecution, ba- 4 we presentan analysis of aliasing in dynamic branch pre- sic block enlargement,and static and dynamic branch pre- dictors that explains the source of the improved perfor- diction [PnevmatikatosSohi94, Hwu93, Smith81, mancefor the bi-mode predictor. Finally, in the conclusion FisherFreudenberger92, YehPatt91, PanSoRahmeh921. we proposefuture directions for this work. Among these, dynamic branch prediction is perhaps the 2. Aliasing and De-aliasing most popular, becauseit yields good results and can be im- plemented without changesto the instruction set architec- 2.1 The aliasing problem ture or pre-existing binaries. The strength of dynamic branch prediction is that it can Branch outcomesare not usually the result of random ac- track branch behavior closely at run-time, providing a de- tivities; most of the time they are correlated with past bc- gree of adaptivity that other approachesare lacking. This havior and the behavior of neighboring branches. By adaptivity is especially critical when behavior of branches keeping track of the history of branch outcomes,it is possi- can be affectedby the input data of different program runs. ble to anticipatewith a high degreeof certainty which direc- With the introduction of two-level schemes[YehPattgl], tion future brancheswill take. the prediction accuracy of dynamic branch predictors has However, current dynamic branchpredictors still exhibit

4 1072-445l/97$10.00@1997IEEE performancelimits. These are due in part to the restricted availability of information upon which to basepredictions, but more importantly due to shortcomingsof design, espe- cially the way that branch outcome history is exploited. In current designs,dynamic predictors spendlarge amountsof hardware to memorize this branch outcome history. Each static (per-address)branch often has a biased behavior so that it is either usually taken or usually not-taken. This can be exploited by the conventional two-bit schemeto ‘l-l- , ChoicelPredictor predict future outcomesof a particular static branch. How- DirectioniTi +,A ever, two-bit counter schemesare limited becausebranches , may behavedifferently from their biasesunder somespecial conditions. These conditions are not difficult to recognize, t but recognition requires memory space. Therefore, to Final prediction for the branch achieve very high prediction accuracy,both the per-address bias and the special conditions need to be identified and Figure 1: Proposed branch prediction scheme memorizedby dynamic predictors. diagram Global history-the outcomes of neighboring branch- es-is a commonway to identify special branch conditions. 2.2 Proposed branch prediction scheme Previous studieshave shown that the global history indexed schemesachieve good performanceby storing the outcomes The bi-mode branch predictor is aimedat the elimination of global history patternsin two-bit counters,e.g., the GAg of destructive aliasing in global history indexed schemes. and GAS schemes[PanSoRahmeh92, YehPatt921. Another This scheme, shown in Figure 1, splits the second-level way to identify special branch conditions is to use per-ad- two-bit counter table into two halves. Given a history pat- dresshistory-the past outcomesof a branch itself, such as tern, two counters,one from eachhalf, are selected.We re- PAg and PASschemes [YehPatt91]. The per-addresshistory fer to theseas the direction predictors. Meanwhile, another schemeis also shown to be effective, especially for loop-in- two-bit counter table, indexed by the branch addressesonly, tensive floating-point programs.However, as we noted ear- is used to provide a final selection for these two counters. lier, [YehPatt93] shows that, for integer programs, global The counter table providing selection will be referred to as history schemestend to perform better than per-addresshis- the choice predictor. The final prediction is then made by tory schemesbecause global schemescan make better pre- the state of the counter selectedfrom the direction predic- dictions for if-then-else branchesdue to their ability to track tors and, importantly, only the selectedcounter will be up- correlation with neighboring branches. datedwith the branch outcome;the statusof the unselected Nevertheless,the global history schemeis still limited by one, will not be altered. The choice predictor is always up- destructive aliasing that occurs when two brancheshave the datedwith the branch outcome,except that when the choice same global history pattern, but opposite biases is opposite to the branch outcome but the selectedcounter [TalcottNemirovskyWood95,YoungGloySmithB.?i]. This is of the direction predictors makesa correct final prediction. not due to the limited availability of information, but to the This partial updatepolicy is particularly effective when the indexing method which does not discriminate between total hardwarebudget is small. brancheswith the sameglobal history patterns. Our proposed scheme can improve global history in- One proposal to overcome the destructive aliasing, dexed schemesbecause although global history patternsare gshare,randomizes the index by xor-ing the global history still kept in the second level tabIe, they are dynamically with the branch address [McFarling93]. It provides only classified before being stored.They are classified by a pre- limited improvement [SechrestLeeMudge96]. Recently, liminary prediction from the choice predictor which is sim- there have been several new proposals to reduce aliasing ply a conventional two-bit counter scheme,and, as such, problems [ChangEversPatt96, Sprangle97, typically can provide 80% or better prediction accuracy (MichaudSeznecUhlig971. The best of these with relatively modest cost. Thus, the bi-mode schemedi- lIvlichaudSeznecUhlig971 employs a hardware hashing vides branchesinto two groups accordingto the per-address scheme. A comparative study of these and the bi-mode bias of the choice predictor, and then usesthe global history schemecan be found in [Lee97]. The study showsthat hard- patterns to identify the special conditions for each of two ware hashing is useful for small low cost systems.For large groups separately.The effect of the choice predictor is to systems the bi-mode scheme is the best cost-effective separatethe destructive aliaseswhile keeping the harmless schemeto date. aliasestogether.

5

I ------___- 3. Experimental Results Benchmarks Input data file compress biglesth, reduced In this section, we demonstratethat our proposed bi- 2 mode branch predictor is more accurateand cost-effective w jump.i 5 2stone9.h, train data, reduced than one of the best two-level branch predictors, gshare.To 6 90 xl&p trainhp evaluatethe improvement,we have conductedtrace-driven Y Ped scrabblh, reduced 8 simulations. vortex train data, reduced 3.1 Description of gshare scheme Table 1: Description of the input data files used in In gshare,the global history is xor-ed together with the the SPEC CINT95 programs low-order addressbits of a branch to form an index. This in- dex is then usedto selecta 2-bit saturatingup-down counter from a pattern history table (PHT)‘. Depending on the sign static conditional dynamic conditional Benchmarks bit of the selected2-bit counter, the branch is either predict- branches branches 1 ed as taken or not taken. I compress 1 402 I 10,114,353 To makea fair comparisonwith the gsharepredictor, the WC 16,035 26,520,618 best configuration of gsharemust be determinedand used. 90 5,112 17,673,772 This point is often overlooked and the single-PHT gshare xlisp 636 25,008,567 configuration is used for comparisons.However, this sin- per1 1,974 39,714,684 gle-PHT gshareconfiguration is not the optimal configura- 6,599 27,792,020 tion as was shown in [SechrestLeeMudge96].To find the 6,333 11,901,481 best configuration, we exhaustively simulated all pair-wise gs 12,852 16,307,247 combinations of history length and addresslength. In gen- mw&v 5,598 9,566,290 eral, the bestcombination hasmultiple PHTs. Since the best nroff 5,249 22,574,884 real-gee 17,361 14,309,867 configuration is different for each benchmark, we present sdet 5,310 5,514,439 results using the configuration that yields the best accuracy verilog 4,636 6,212,381 for the averageof all the benchmarksstudied. video-play 4,606 5,759,231 3.2 Description of input trace Table 2: Static and dynamic branch counts in the To assessthe performanceof the bi-mode branch predic- IBS and SPEC CINT95 programs tor, we conducteda trace-driven simulation using the Ultrix version of the Instruction Benchmark Suite (IBS-Ultrix) benchmarks[Uhlig95] and the SPECCINT9.5 benchmarks SPECCINT95 are summarizedin Table 2. [SPEC95]. 3.3 Simulation results The IBS-Ultrix benchmarksare a set of applications de- Figure 2 shows the misprediction rates for the best signed to reflect realistic workloads. The traces of these gshare and bi-mode predictors. In our simulation the best benchmarkswere generatedthrough hardware monitoring configurations of gshare,which are labeled gshare.best, al- of a MIPS -basedworkstation. Thesetraces were col- ways have multiple PHTs in the second-level table Note lectedunder Ultrix 3.1, and include both kernel and user ac- that gshare.best is the best for the averagedresults, not ncc- tivities. essarythe bestfor individual benchmarks.For easycompar- For the SPEC CINT95 benchmark, we use ATOM ison with other published results, we also include the [EustaceSrivastava95],a code instrumentation tool from misprediction rates for the single-PHT gshare configura- Digital Equipment Corporation, to generateand capturead- tion, which is labeledgshare.IPHT. In Figure 2, the vertical dresstraces. The benchmarkswere first instrumented with axis representsthe branch misprediction rate, and the hori- ATOM, then executed on a DEC 21064 workstation run- zontal axis for the size of predictors. A lower curve indi- ning OSF/l 3.0 to generatetraces. These traces contained catesthat the schemehas better performancefor the same only user-level instructions. The input to the SPEC95 cost. Cost is measuredby counting the numberof bytesused benchmarkswas a reducedinput dataset and is describedin in the 2-bit counters.Note that the bi-mode predictors natu- Table 1. The branch statisticsof tracesfrom the IBS and the rally have a cost that is 1.5 times that of the next smaller gsharescheme2. This reflects the cost of the choice predic- 1.The pattern history tables are the tablesconstituting the second-level tors. table of the two-level predictors,as definedin [YehPatt92].In the two- level predictor model, the numberof PHTs is determinedby the branch Figure 2 shows the bi-mode predictors outperforms addressbits directly usedas the index. gsharepredictors for all sizes of predictors measured.This --

CINT95-AVERAGE M-AVERAGE

2- ; / / 0 ‘ 4 0.25 0.5 1Preddr $ bytes! 16 32 0.25 0.5 1 2 &bytes! 16 32 Size PredictorSize

Figure 2: Averaged misprediction rates for SPEC CINT95 and IBS-Ultrix

20

.I 01 . ’ ’ . 1 025 0.5 1 16 32 025 0.5 1 2 pK bytes; 16 32 025 0.5 1 16 32 Predii siie pK byes; Predictor Size Prediito: Sue &bytes;

pefl gshare.lPM - gshare.lPtri -

01 j 8 / 01 . . 1 01 . * . 025 16 32 025 0.5 1 2 $bytesf 16 32 025 0.5 1 2 4 by&$ 16 32 Predictor Size Predictor Sue (K

Figure 3: Misprediction rates for SPEC CINT95

is indicated by lower curves. In addition, the bi-mode pre- est static branches,have no aliasing problems and thus can dictors are more cost effective, because,for predictors larg- enjoy the benefit from correlation in branch histories. The er than 4K bytes, they needless than half the size of gshare results of these two small benchmarks correspond to the predictors to achieve the samemisprediction rate. findings reportedby Se&rest et al. [SechrestLeeMudge96]. Bi-mode predictors also outperform gshare on most of The caseof the go benchmark, where the bi-mode method the individual benchmark examined, see Figure 3 and is beatenby the multiple-PHTs, will be discussedin more Figure 4. Moreover, the single-PHT gsharescheme is worse detail in the next section. than the multiple-PHTs gsharescheme for all benchmarks except the compress and xlisp, where it outperforms even 4. Analysis the bi-mode scheme.These two benchmarks,with the few- Many brancheshave a tendencyto be either taken or not- taken most of time. Common examplesare branchesfor er- 2. In our experiments,all two-bit countersin gsbareschemes are initial- ized to weakly-takenfor eachbenchmark run. For the bi-mode scheme, ror checking and looping. Thesekinds of branchesare usu- the choice predictor is resetto weakly-taken,and one bank of the direc- ally describedas being strongly biasedin one direction. As tion predictor is resetto weakly-not-takenand the other bank is weakly- taken. might be expected,strongly biasedbranches are much easi-

7 0’ ’ ’ * ’ ’ ’ '25 0.5 1 16 32 025 0.5 1P,&o: ;'K bybfl 10 32 S&o

0 025 0.5 1Pmdd ;'K byl$ 16 32 0.25 0.5 1Predkt: i'K bytes; 16 32 She size Figure 4: Misprediction rates for IBS-Ultrix er to predict than weakly biased branches in dynamic curacy than the traditional two-bit counter scheme branch predictors, and this was confirmed by Chang el al. proposedby Smith [Smith811is because,in addition to the [Chang94]. In the samestudy, they also measuredthe dis- branch address,they incorporate the branch history infor- tribution of branch biases for SPEC CINT92. Their mea- mation to form the index for the second-level two-bit surement showed that on average about 50% of total counter table. The index for the second-level table divides dynamic branchesare attributed to the static branchesthat the dynamic branch streaminto substreamsthat are dircct- are biased in either the taken or not-taken direction for ed to a saturatingtwo-bit counter. Ideally, the index should more than 90% of the time. generatehighly biased substreamsso that the value of the In this section, our analysis extends the idea of bias to saturating counter selectedby the index can stay at one of the dynamic branch substreamsthat arrive at each two-bit the saturatedvalues most of time. Global history, compared counter in the second-level table. Using this concept, we to the branch address,can divide a dynamic branch stream will demonstrate the advantagesand drawbacks of two into more highly biasedsubstreams, as we will show Iatcr. kinds of information usedin the two-level scheme,speciti- However, if the indexing method mixes oppositely biased tally, the branch addressand global history. The analysis substreamstogether, then destructivealiasing can arise and allows us explain why the bi-mode schemecan improve on the associatedcounter will perform badly asa predictor, bc- current dynamic branch predictors. causeit will oscillate betweenthe two saturatedvalues. Our 4.1 Bias measurement for global-history based study will compareusing branch addresseswith global his- tory to separateout oppositely biasedsubstreams, and how schemes aliasing can degradethe performanceof two-level schemes As we have noted before, the reason that two-level dy- using global history. namic branch predictors can achieve higher prediction ac- To contrast the benefits of addressand global history

8 dynamic count count of taken out- branch normalized count from i=b when using comes when using bias class address, i counter c, Isid counter c to G Nbc II 0x0010x005 1 2012 111 I SNTST 20/5012/50 = 40%24%

II 0x100 I a 3 WB 8/50 =I 6% IO 1 SNT IO/50 = 20% ox 150 I Table 3: An example of calculating the normalized count for a counter c bits, we considertwo alternative two-level gsharestyle pre- Thus the two conditions become: dictors. Both have the same size second-level tables, 256 1. (Ci(Nic) 1for those i such that SicE WB) << (Zi(Nic) 1 counters, but differ in that one employs more history bits, for those i such that SicE WB) representinghistory-indexed schemes,while the other rep- 2. (Ci(iVic)1 for those i such that SicE ST) should differ resentsaddress-index schemes. The first schemexors 8 bits greatly from (Ci(Nic) I for those i such that SicE SNT). In an of branch addresswith 8 bits of global history to form the ideal situation, one of the sumsshould be 0. index into the second-level table (“history-indexed”). The Table 3 illustrates the normalized count resulting from secondscheme xors 8 bits of branch addresswith only 2 bits three streamsincident on the samecounter c. In this exam- of global history as the index (“address-indexed”). ple, there is a total of four static branches(i = 1,..,4) whose We define three bias classeson a stream of branch out- addressesare 0x001,0x005,0x1 00 and 0x150, respectively, comes: 1) strongly taken (ST) if the outcomes are taken that usedthe two-bit counter c for prediction during the pro- 90% of the time or more; 2) strongly not taken (SNT) if the gram execution (they may also use other counters too). outcomesare not taken 90% of the time or more; and 3) These four streamsfall into different bias classeswith re- weakly biased(WB) if the neither of the above apply. spect to c. The normalized count of ST class at the counter We are interestedin the streamof branch outcomes,sp c is 24%, the SNT class is 60% (40%+20%), and the WB from a particular static branch, i, to a particular prediction class is 16%. Becausethe SNT class is more frequent than counter,j. This streambelongs to one of the three bias class- the ST class,the SNT is the dominant classin the counter c, es, i.e., exactly one of the following is true: sg E ST, $9 E and the ST is the non-dominantclass. In fact, Table 3 shows SNT, or se E WB. A good indexing method will createthese an undesirable situation becausethe indexing method has streamsso that the following two conditions hold: done a poor job of separatingthe bias classesand the SNT 1. The number of streamsthat are in the WB classis kept classis not overwhelmingly dominant. small. Figure 5 illustrates the bias classesfor all of the predic- 2. Most of the streamsincident on a particular prediction tion counters for the gee benchmark. We have performed counter,j = c, belong to only the ST class, or alternatively, the sameexperiments for other SPECbenchmarks, and we only the SNT class,i.e., SicE ST for most i, or sic E SNT for selectgee becauseit is representativeof the results from the most i. A counter should not see an even mix of streams other benchmarks,see [SechrestLeeMudge96].The X axis from both classesor its prediction ability will be reduced. lists all the countersin the second-leveltable, and the Y axis Condition 2 actually statesthat one of the two strongly representsthe normalized counts of the three bias classesin biasedclass should dominate the other strongly biasedclass eachcounter. The counterslisted in the X axis are sortedac- at a counter. When this domination occurs, the counter will cording to the normalized dynamic frequency of WB class. be biasedat one saturatedvalue with little destructive inter- It can be seenthat the areasize of WB region of the history- ference.We will refer to the more frequent strongly-biased indexed schemeis smaller than that of the address-indexed class at a counter as the donzinanf class, and the other less one. This suggeststhat the schemeemploying more branch frequent strongly-biasedclass as the non-dominant class. history can generatemore highly biasedsubstreams for pre- To be more precise,we should consider streamsweight- dictors. If there is no harmful aliasing problem in the histo- ed by their lengths. If $1 is the number of outcomesin the ry-index scheme,i.e., each counter only needsto deal with stream4, we define the normalized count that a branch, i = substreamsof one bias class,the prediction accuracywill be b, contributes to a particular prediction counter,j = c, to be: very high [TalcottNemirovskyWood95, YoungGloySmith951.

Nbc = However, in the usual situation where harmful aliasing doesexist, the performanceof the history basedscheme de- over all static branchesi grades.As shown in the same figure (Figure 5), the non-

I 9 : I g 00 g 60 5 70 c 70 z .-:: 60 o 60 g 50 g 50 :: 40 =: 40 .i 3. Q 30 dominant dominant

I I 65 129 193 256 65 129 193 256 Individual counters lnividual counters Figure 6: Bias breakdown for the bi-mode scheme This figure shows the bias of branch substreams of each counter for the bi-mode scheme. This bi-mode scheme has a 128-counter choice predictor and two 128.counter direction predictors. As shown in the figure, the dominant substreams dominate most of the counters of the second-level table, implying that interference is reduced significantly.

time, the resulting substreamsmerged at each counter should be as unidirectional as possible; in other words, the 10 dominant area in Figure 5 should be large. Unfortunately, o- t ----.“ll--“---l---__- --I neither the address-indexedscheme nor the history-indexed 1 65 129 193 256 schemecan achieve both of thesetwo design goals simulta- Individual counters neously. Figure 5: Bias breakdown for the gshare scheme 4.2 Bias measurement for the b&mode scheme in the SPEC ClNT95 gee benchmark. History-indexed on the top, address- In this subsection, we repeat the analysis above for the indexed on the bottom bi-mode prediction scheme.The configuration under exam- ination has a 12%counterchoice predictor indexed by the This figure shows the bias of branch outcome substreams arriving at each of 256 counters in a second-level table. The branch addressand two banks of 128 counters in the scc- top graph is for the history-index scheme (8 bits of branch ond-level table, eachof which is indexed by 7 bits of branch address xor-ed with 8 bits of global history); the bottom graph addressxor-ed with 7 bits of global history. This systemhas is for the address-indexed scheme (8 bits of branch address xor-ed with 2 bits of global history). These two graphs illus- about 50% more bytes than the predictors in the previous trate the difference between the two indexing methods. The subsection, so the following analysis should be viewed address-indexed scheme suffers from a larger number of qualitatively. weakly biased (WB) branch substreams, while the history- indexed scheme suffers from more non-dominant sub- Figure 6 presentsthe measurementresults. It can be seen streams, implying a high degree of destructive interference that the weakly biased class in the bi-mode schemeis kept between strongly but oppositely biased streams (between the as small as the one in the history-indexed scheme,indicat- SNT and ST classes). ing that the advantageof employing history information is dominant classin the history-indexed schemeis larger than preserved.On the other hand, Figure 6 also shows the bi- the one in the address-indexedscheme. In other words, al- mode schemeyields a much larger area for the dominant though the history-indexed selects the greater number of class than the history-indexed scheme, implying that dc- highly biasedsubstreams, it doesnot separatethe taken and structive aliasing has beenreduced. not-taken onesas well as address-indexedscheme. The counting argumentsthat we employ to classify the To summarize the analysis above, an ideal dynamic ST, SNT, and WB classesare open to the criticism that they branch predictor should generateas few weakly biasedsub- do not capture the order in which the ST and SNT runs ap- streamsas possible; in other words, the area of the weakly pear. For example, it is undesirable for them to be intcr- biased region should be as small as possible. At the same mixed so that the streamchanges between the two classes. As a final experiment, we have counted the numbers of I

‘sSNT&TmWSj Dominant Non-dominant 1 WB ~ ,2 ---~- --~. _ ._. _. _.- _-.--._____ - _-.-- .~-. history-indexed 1 3.826,578 1 3,589,689 1 2.252.874 bi-mode 3,685,544 2,717,563 1 2,226,353 Table 4: Numbers of changes between different bias classes for the history-indexed and bi-mode schemes This table shows numbers of changes between branch out- come streams of different bias classes in the history-indexed and bi-mode schemes. We first count changes for each bias class in a counter, and then accumulate the counts of all

counters for a scheme. For example, the count for the domi- 256 1K 32K - nant class of the history-indexed scheme is the total number of changes of the dominant class due to interference by the Figure 7: Misprediction contributed by three bias other two classes in the scheme. classes in gee In this figure, three schemes are compared, a gshare using changes between bias classesdue to interference. Table 4 fewer history bits (representing the address-indexed shows the results for the history-indexed and bi-mode scheme), a gshare using more history bits (history-indexed) and the bi-mode scheme. For each scheme, three different schemes.The bi-mode schemehas fewer changes,implying sizes of second-level tables are examined: 256, 1K and 32K that its ST and SNT classes are less intermingled. This counters. gshare (m) represents a gshare scheme that uses means less interference,and further illustrates why our pro- m-bit global history, while bi-mode (m) represents a bi-mode scheme that uses m-bit global history for its direction predic- posed prediction schemeperform better than conventional tors. The choice predictor of the bi-mode scheme is half the two-level schemes. size of its second-level table. As shown in the figure, the address-indexed scheme always has larger misprediction for 4.3 Breakdown of misprediction for the gshare the WI3 class. The history-indexed scheme has less mispre- and bi-mode schemes diction for the WB class, but has more for the SNT and ST classes due to interference. The bi-mode scheme reduces We have also measuredtl,e misprediction contributed by error from the WI3 and reduces, in most cases, the mispredic- three biased classesfor the gshare and bi-mode schemes. tion for the SNT and ST by removing interference. Again, for the gsharescheme, the configurations using few- er global history bits and more global history bits are both q .SNToSTmWB included for comparison. Figure 7 presents the measurementresults for the gee benchmark.Three different sizesare studied for the branch predictors: 256, 1024, and 32,768 counters in the second level table. For eachconfiguration, the misprediction is bro- ken down to three categoriesaccording to the bias classes. In other words, the sum of mispredictionsfrom three classes is the misprediction rate for the correspondingscheme. For the gsharepredictors of the samesize, the one using fewer global history bits always has the least error from the strongly-biased classes,but it suffers from poor prediction 256 1K 32K for the weakly-biased substream. The bi-mode scheme Figure 8: Misprediction contributed by three bias keepsa reducederror for the weakly biasedclass, while suc- classes in go cessfully reducing the error from strongly-biasedclasses. This figure shows misprediction due to three bias classes for 4.4 go benchmark the go benchmark. As the same in Figure 7, three schemes with three different sizes of second-level tables are com- In Section 3, we noted that the bi-mode schemewas not pared. As shown in the figure, the misprediction due to the the best for the go benchmark.In this section, we provide WB class dominates in go for three schemes, and thus the further analysis. interference between SNT and ST classes is not the major concern. To improve prediction accuracy for go, more history The go benchmarkis intrinsically hard to predict because bits should be used because it is an effective way to remove about half of its dynamic branches are in the WB class. the WB class. Note that as more history bits are used, the rel- ative misprediction rates due to the WB class becomes Figure 8 shows the n&prediction contributed by the three smaller. bias classesfor the go benchmark.It is clear that for all the

11 schemesand configurations the misprediction for the WB References classdominates-destructive aliasing is not the major con- [Chang94] Chang. P., Hao, E., Yeh, T., and Patt, Y., “Branch cern. There is not much room for the bi-mode schemeto im- Classification: a New Mechanismfor Improving Branch Predictor prove becauseit is targeted at eliminating harmful aliasing Performance,”IEEE Micro-27, Nov. 1994. rather than improving prediction for the weakly biasedsub- [ChangEversPatt96]Chang, P., Evers, M., andPatt, Y., “Improv- streams.As observed in the previous subsection, the dy- ing Branch Prediction Accuracy by Reducing Pattern History Tn- namic frequency of the weakly biased class is mainly ble Interference,” International Conference on Parallel determinedby the number of global history bits used.From Architecture and Compilation Techniques, Oct. 1996. Figure 8, we seethat the error of the WI3 classis reducedas @ZustaceSrivastava95]Eustace, A. and Srivastava,A., “ATOM: more global history bits are applied. The prediction accura- A flexible interface for building high performanceprogram analy- cy for programslike the go benchmarkwill only improve if sis tools,” Proceedings of the Winter 1995 USENIX Technical more global history information is employed so that more Conference on UNIX andAdvanced Computing Systems, 303-3 14, Jan. 1995. strongly biasedsubstreams can be generated. FisherFreudenberger921 Fisher, J.A., and Freudenberger,S.M. 5. Concluding Remarks “‘PredictingConditional Branch Directions From PreviousRuns of a Program,Proc.” 5th Annual Intl. Con& on Architectural Support In this paper, a new global-history basedbranch predic- for Programming Languages and Operating Systems, Oct. 1992, tion scheme,the bi-mode predictor, is proposed.It is de- [Gwennap Gwennap, L. “’s P6 Uses Decoupled Super- signed to improve predictions by eliminating the aliasing in scalar Design,” Report, Vol. 9, No.2, Feb. 16, dynamic branch predictors. Its successrelies on dynamical- 1995. ly determining the taken or not-taken direction with an ac- [Gwennap%] Gwennap,L. “Digital 21264 Sets New Standard,” curate but simple choice predictor. This classification can Microprocessor Report, Vol. 10, No. 14, Oct. 28,1996. help removing much of the destructive aliasing while keep- [Hwu93] Hwu, W-W., Mahlke, S.A., Chen, W-Y., Chang, P-P., ing the harmlessaliasing togetherfor the two-bit counter ta- Warter, N.j., Bringmann, R.A., Ouellette, R.G., Hank, R.E., Kiyo- bles. hara, T., Haab, G.E., Holm, J.G., and Lavery, D.M., ‘The super- A detailed analysis on the mechanism of two-level block: An effective technique for VLIW and superscnlar scheme’s index system was also presented in the paper. compilation,” Journal of Supercomputing, 7(9-SO), 1993. From the analysis we found that by using more global his- [Lee971Lee, C-C., “Optimizing High Performance Dynamic tory bits in an index can move more branches from the Branch Predictors,” Ph.D Dissertation, Univ. of Michigan, Ann weakly biasedgroup to the strongly biased group, but these Arbor, Nov. 1997. indices suffer from destructive aliasing. Using more branch [McFarIing93] McFarling. S. “Combining Branch Predictors,” addressbits reduces the destructive aliasing but increases WRL Technical Note TN-36, Jun. 1993. the weakly biased group. The benefits of using branch ad- [MichaudSeznecUhlig97]Michaud, P., Seznec, A., and Uhlig, dressesand global history cannot be preserved in current R., ‘“Trading Conflict and Capacity Aliasing in Conditional two-level schemessimultaneously, but they can in the bi- Branch Predictors,” Proc. of the 24th Ann. Int. Symp. on Computer mode scheme. Architecture, May 1997. The bi-mode schemecan outperform other dynamic pre- [PnevmatikatosSohi94] Pnevmatikatos! D.N. and Sohi, G.S., dictors, yet there is still room for improvement. One poten- “Guarded Execution and Branch Prediction in Dynamic ILP Pro- tial shortcoming for the bi-mode schemeis that, though it cessors,”Proc. of the 21st Ann. Int. Symp. on Computer Architec- can distinguish strongly taken and strongly not-taken sub- ture, Apr. 1994. streamsas shown in Figure 6, it can still suffer from inter- [PanSoRahmeh92] Pan,S.T., So, K., and Rahmeh,J.T., “Improv- ference between the weakly biased and strongly biased ing the Accuracy of Dynamic Branch Prediction Using Branch substreams.Therefore, there are at least two directions for Correlation,” Proceedings of the 5th Int. Conf. on @hitectural $p~;$br Programming Languages and Operatmg Systems, the future work: one is to find a cost-effectiveway to reduce . . the weakly biased substreams,and the other is to further separatethe weakly-biased substreamsfrom the strongly- [SechrestLeeMudge96] Sechrest,S, Lee, C-C, and Mudge, T, “Correlation and Aliasing in Dynamic Branch Predictors,” Pro- biasedsubstreams for the counters.We are currently inves- ceedings of the 23rd International Symposium on ComputerArchi- tigating theseissues. tecture, May 1996. Acknowledgment [SmithSl] Smith, J.E. “A Study of Branch Prediction Strategies,” Proceedings of the 8th International Symposium on ComputerAr- This work was supported by DARPA contract DAA chitecture, 135-148,May 1981. H04-94-G-0327. [SPEC95] SPECCPU’95, Technical Manual, Aug. 1995.

12

____- --- I- _-_. -- --.-:-- IO., -.. , . ‘, I -_

[Sprangle97] Sprangle,E., Chappell R., Alsup, M., and Patt, Y., “The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference,” Proc. ofthe 24th Ann. Int. Synp. on ComputerArchitecture, May 1997. nalcottNemirovskyWood95] Talcott, A.R., Nemirovsky, M., and Wood, R.C., “The Influence of Branch Prediction Table Inter- ferenceon Branch Prediction SchemePerformance,” Proceedings of the 3rd International Conferenceon Parallel Architectures and Compilation Techniques,Jun. 1995. [uhIig95] Uhlig, R., Nagle, D, Mudge, T., Sechrest,S., and Emer, J. “Instruction Fetching: Coping with Code Bloat,” Proceedingsof the 22th International Symposiumon ComputerArchitecture, Ita- ly, Jun. 1995. [I’ehPattgl]Yeh, T-Y. and Patt, Y. “Two-Level Adaptive Train- ing Branch Prediction,” Proceedings of the 24th International Synposium on , 51-61, Nov. 1991. [YehPatt92]Yeh, T-Y. and Patt, Y. “Alternative Implementations of two-level adaptivebranch predictions,” Proceedingsof the 19th International Symposium on , 124-134, May 1992. [I’ehPatt93]Yeh, T-Y. and Patt, Y. “A Comparisonof Dynamic Branch Predictorsthat use Two Levels of Branch History,” Pro- ceedingsof the 20th International Symposiumon ComputerArchi- tecture, May 1993. [I’oungGIoySmith95] Young, C., Gloy, N., and Smith, M. “A ComparativeAnalysis of Schemesfor Correlated Branch Predic- tion,” Proceedingsof the 22th International Symposiumon Com- puter Architecture, Itzl,, Jun. 1995.

13

-_- . . -__- _~