(SLB) Predictor: a Compiler Assisted Branch Prediction for Data Dependent Branches

Store-Load-Branch (SLB) Predictor: A Compiler Assisted Branch Prediction for Data Dependent Branches M. Umar Farooq Khubaib Lizy K. John Department of Electrical and Computer Engineering The University of Texas at Austin [email protected], [email protected], [email protected] Abstract This work is based on the following observation: Hard- to-predict data-dependent branches are commonly associ- Data-dependent branches constitute single biggest ated with program data structures such as arrays, linked source of remaining branch mispredictions. Typically, lists, trees etc., and follow store-load-branch execution se- data-dependent branches are associated with program quence similar to one shown in listing 1. A set of memory data structures, and follow store-load-branch execution se- locations is written while building and updating the data quence. A set of memory locations is written at an earlier structure (line 2, listing 1). During data structure traver- point in a program. Later, these locations are read, and sal, these locations are read, and used for evaluating branch used for evaluating branch condition. Branch outcome de- condition (line 7, listing 1). pends on data values stored in data structure, which, typi- 52 50 53 cally do not have repeatable pattern. Therefore, in addition 35 Gshare to history-based dynamic predictor, we need a different kind 30 YAGS BiMode of predictor for handling such branches. TAGE 25 This paper presents Store-Load-Branch (SLB) predic- 20 tor; a compiler-assisted dynamic branch prediction scheme for data-dependent direct and indirect branches. For ev- 15 ery data-dependent branch, compiler identifies store in- 10 5 structions that modify the data structure associated with Mispredictions per 1K instructions the branch. Marked store instructions are dynamically 0 hpg ospf cmy cjpeg djpeg tracked, and stored values are used for computing branch text01 iirflt01 aiifft01 aifirf01 aifftr01 rotate01 ttsprk01 idctrn01 dither01 Average canrdr01 pntrch01 tblook01 rspeed01 basefp01 matrix01 a2time01 cacheb01 coremark ConvEn2 bitmnp01 FFTPulse fBitAlStep puwmod01 pktflow512 ViterbZeros flags ahead of time. Branch flags are buffered, and later routelookup AutCorPulse used for making predictions. On average, compared to standalone TAGE predictor, combined TAGE+SLB predictor re- Figure 1. MPKI for EEMBC benchmark suite duces branch MPKI by 21% and 51% for SPECINT and EEMBC benchmark suites respectively. 30 Gshare 25 YAGS BiMode TAGE 1 Introduction 20 15 Most branch prediction techniques rely on branch his- 10 tory information for predicting future branches. Some use 5 short history [11] [21] [36], while others use longer his- Mispredictions per 1K instructions tory [16] [25] [30] [32]. History-based dynamic branch pre- 0 Average 176_gcc 254_gap 252_eon diction schemes have shown to reach high prediction accu- 181_mcf 164_gzip 473_astar 300_twolf 256_bzip2 186_crafty 197_parser 255_vortex racy for all except few hard-to-predict branches. Figures 253_perlbmk 175_vpr_route 175_vpr_place 1 and 2 show branch mispredictions per 1K instructions for EEMBC and SPECint benchmark suites using several Figure 2. MPKI for SPECint benchmark suite history-based branch predictors. As can be seen from the figures, several benchmarks have higher branch mispredic- This paper proposes Store-Load-Branch (SLB) predictions even when using long history branch predictor. tor, a compiler-assisted dynamic branch prediction scheme 1 Rest of the paper is organized as follows: Section Listing 1. Store-Load-Branch execution sequence 2 presents motivating examples from real benchmarks. 1 f o r (node=head; node!=NULL; node=node! n e x t ) f Section 3 describes SLB prediction scheme. Simulation 2 node!key = ... ; methodology and results are presented in Sections 4 and 3 g 4 . 5 respectively. Section 6 discusses related work. Finally, 5 node = head; Section 7 concludes the paper. 6 while ( node!=NULL) f 7 if(node!key <condition>) f 8 . Listing 2. LD-BR execution sequence (cjpeg) 9 g # d e f i n e DCTSIZE2 64 1 10 node=node! n e x t ; # d e f i n e DIVIDE BY(a,b) a /= b 2 11 g 3 void forward DCT(...) 4 f 5 / ∗ work area for FDCT subroutine ∗ / 6 for data-dependent branches using data value correlation. DCTELEM workspace[DCTSIZE2]; 7 Compiler identifies all program points where data structure 8 / ∗ Perform the DCT ∗ / 9 associated with a hard-to-predict data-dependent branch is (*do dct) (workspace); 10 referenced and modified. At run-time, hardware tracks 11 marked store instructions that modify the data structure, r e g i s t e r DCTELEM temp , q v a l ; 12 computes branch condition flags ahead of time using store register int i ; 13 data values, and buffers them in a structure at store ad- r e g i s t e r JCOEFPTR output p t r = c o e f blocks[bi]; 14 15 dresses. Later, during data structure traversal, pre-computed f o r ( i = 0 ; i < DCTSIZE2; i++) f 16 flags are read using predicted load address, and used for pre- a l = d i v i s o r s [ i ] ; 17qv dicting branch outcome. temp = workspace[i]; 18 Typically, instruction that loads data structure values 19 is quickly followed by branch instruction for evaluating if (temp < 0) f 20 branch condition using loaded values. Therefore, actual = −temp ; 21temp load address is usually not available before branch instruc- += qval >>1; 22temp BY(temp, qval); 23DIVIDE tion gets fetched. Hence, we use predicted load address for = −temp ; 24temp reading pre-computed branch flags. Addresses for simple g e l s e f 25 data structures such as arrays are easier to predict using += qval >>1; 26temp stride-based address predictor. Bekerman et al. proposed BY(temp, qval); 27DIVIDE g 28 load address predictor for irregular data structures such as t p u t ptr[i] = (JCOEF) temp; 29ou linked list and tree [4], which we adapted according to our g 30 requirement. g 31 We compared our design with the state-of-the-art TAGE branch predictor [32]. Results show that, for several benchmarks, top mispredicting branches in the TAGE predictor are accurately predicted using SLB predictor. On aver- 2 Motivating Examples age, compared to standalone TAGE predictor, combined TAGE+SLB predictor reduces branch mispredictions per 2.1 Example 1: Data-Dependent Direct 1K instructions (MPKI) by 21% for SPECint [34] bench- Branches mark suite. Similarly, for EEMBC [12] benchmark suite, MPKI is reduced by 51%. We will use cjpeg benchmark from eembc-consumer This paper makes following contributions: suite as a motivating example. The benchmark performs standard JPEG compression on a given image. Input im- 1. We investigate program patterns that manifest hard-to- age is broken into block of 8x8 pixels, and each block goes predict, data-dependent branches. through discrete cosine transform (DCT), quantization and 2. We propose SLB prediction scheme for data- entropy coding steps. As shown in listing 2, do dct function dependent branches, which predicts branch outcome (line 10) populates an array, workspace, with DCT coeffi- using pre-computed branch flags. cients whose values range between -1024 to 1023. During 3. Our implementation of SLB requires adding couple quantization, each DCT coefficient is read (line 18), com- of hint instructions to ISA, and light-weight hardware pared (line 20) to see if it is positive or negative, and quan- structures. tized accordingly. The branch ‘if (temp < 0)’ (line 20) is 2 jects in the grid (line 15 and 18), a buffer lookup during Listing 3. ST execution sequence (cjpeg) grid traversal yields correct function address correspond- 1 GLOBAL( void ) ing to the read object. Note that, in addition to accurately 2 j p e g f d c t i s l o w ( DCTELEM * data) 3 f predicting target address for ‘oPtr!viewingHit()’ function 4 DCTELEM ∗ d a t a p t r ; call (line 41), another hard-to-predict branch at line 37, ‘if 5 dataptr = data; ( oPtr )’, can also be accurately predicted since it also de- 6 pends on the same object read from the grid. 7 f o r (ctr = DCTSIZE−1; c t r >= 0 ; c t r −−) f 8 dataptr[DCTSIZE*0] = ... ; 9 dataptr[DCTSIZE*4] = ... ; 10 . Listing 4. ST-LD-BR execution sequence (eon) 11 d a t a p t r ++; 12 g c l a s s mrBox : p u b l i c mrSurface f 1 13 g g 2 3 c l a s s mrSphere : p u b l i c mrSurface f 4 g 5 6 an example of a hard-to-predict data-dependent branch. Ta- c l a s s mrSolidTexture : p u b l i c mrSurface f 7 ble 1 (row 3) shows branch characteristics and prediction g 8 accuracy of this branch using short and long history pre- 9 dictors. It shows that using longer history TAGE predictor void mrGrid::insert( double time1 , double time2 , 10 does not improve prediction accuracy for this branch. In- mrSurface *obj ptr) f 11 // other piece of code 12 stead, if branch condition flags are computed and buffered i f (grid(i, j, k) == 0) f 13 1 while populating array workspace in do dct function (see i d ( i , j , k ) = new mrSurfaceList (); 14gr listing 3, lines 8-9), a simple buffer lookup can yield perfect ((mrSurfaceList*)grid(i, j, k))!Add(obj ptr); 15 branch prediction. g 16 e l s e 17 2.2 Example 2: Data-Dependent Indirect ((mrSurfaceList*)grid(i, j, k))!Add(obj ptr); 18 g 19 Branches 20 mrGrid::viewingHit( 21ggBoolean Indirect branches are generally harder to predict than di- c on s t ggRay3& r , 22 rect branches as they may have multiple targets correspond- double time , 23 double tmin , 24 ing to a single static indirect branch.

(SLB) Predictor: a Compiler Assisted Branch Prediction for Data Dependent Branches

Memory Centric Characterization and Analysis of SPEC CPU2017 Suite

MIPS® Architecture for Programmers Volume I-B: Introduction to the Micromips32™ Architecture, Revision 5.03

Overview of the SPEC Benchmarks

Oracle Corporation: SPARC T7-1

Poweredge R940 (Intel Xeon Gold 5122, 3.60 Ghz) Specint Rate Base2006 = 1080 CPU2006 License: 55 Test Date: Jun-2017 Test Sponsor: Dell Inc

IBM Power Systems Performance Report Apr 13, 2021

MIPS32™ Architecture for Programmers Volume I: Introduction to the MIPS32™ Architecture

QSCORES: Trading Dark Silicon for Scalable Energy Efficiency With

MCST~8 Assembly Language Programming Manual PRELIMINARY EDITION

Notes on Usage of the RX Family C/C++ Compiler V.1.00 Release 01 and Corrections in the User's Manual

Benchmarking-HOWTO.Pdf

The Benefits of Hardware-Software Co-Design/Convergence for Large- Scale Enterprise Workloads