Cost Effective Speculation with the Omnipredictor Arthur Perais, André Seznec
Total Page:16
File Type:pdf, Size:1020Kb
Cost Effective Speculation with the Omnipredictor Arthur Perais, André Seznec To cite this version: Arthur Perais, André Seznec. Cost Effective Speculation with the Omnipredictor. PACT ’18 - 27th International Conference on Parallel Architectures and Compilation Techniques, Nov 2018, Limassol, Cyprus. 10.1145/3243176.3243208. hal-01888884 HAL Id: hal-01888884 https://hal.inria.fr/hal-01888884 Submitted on 14 Nov 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Cost Effective Speculation with the Omnipredictor Arthur Perais André Seznec INRIA Univ Rennes IRISA CNRS ABSTRACT rather late in the pipeline, often as late as the execu- Modern superscalar processors heavily rely on out-of- tion stage. Therefore the instruction fetch engine is in order and speculative execution to achieve high perfor- charge of speculatively generating this instruction block mance. The conditional branch predictor, the indirect address. Modern processors instruction address genera- branch predictor and the memory dependency predic- tor features three predictors, the Return Address Stack tor are among the key structures that enable efficient (RAS) [1], the conditional branch predictor [2] and the speculative out-of-order execution. Therefore, proces- indirect jump predictor [3]. It also features a Branch sors implement these three predictors as distinct hard- Target Buffer (BTB) [4] to cache decoded targets for di- ware components. rect conditional and unconditional jumps, and in some In this paper, we propose the omnipredictor that pre- designs, to predict targets for indirect jumps and/or re- dicts conditional branches, memory dependencies and turns. Among these structures, the most critical one is indirect branches at state-of-the-art accuracies without the conditional branch direction predictor since condi- paying the hardware cost of the memory dependency tional branches are the most frequent and can only be predictor and the indirect jump predictor. resolved at execute time. Significant research effort [2, We first show that the TAGE prediction scheme based 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 4, 16, 17, 18, 19] on global branch history can be used to concurrently was dedicated to this class of predictors throughout the predict both branch directions and memory dependen- past 25 years. State-of-the-art propositions [9, 19] com- cies. Thus, we unify these two predictors within a reg- bine the neural branch prediction approach [18] with ular TAGE conditional branch predictor whose predic- the TAGE predictor [16]. The BTB is also very impor- tion is interpreted according to the type of the instruc- tant, since a missing target results in a pipeline bubble tion accessing the predictor. Memory dependency pre- in the front-end: a miss in the BTB for a direct branch diction is provided at almost no hardware overhead. can be resolved at Decode. The RAS addresses very effi- We further show that the TAGE conditional predic- ciently the predictions of return targets provided an ad- tor can be used to accurately predict indirect branches equate management of its speculative state [20]. Lastly, through using TAGE entries as pointers to Branch Tar- indirect branches were traditionally predicted through get Buffer entries. Indirect target prediction can be the BTB. However the misprediction rate on this class blended into the conditional predictor along with mem- of branch was shown to be problematic [3], especially ory dependency prediction, forming the omnipredictor. given that the targets of such branches are always re- solved in the Execute stage. Thus, specific predictors 1. INTRODUCTION were proposed to handle them. 1.1 The Context: Speculation in All Stages Data-flow Speculation. To maximimize performance, out-of-order processors To maximize performance, out-of-order execution su- should be aware of the complete data-flow graph of the perscalar processors speculate on many aspects of the current instruction window. Unfortunately, memory de- control- and data-flow graphs. Since the fetch-to-execute pendencies cannot be determined precisely until the ad- delay may span several tens of cycles, accurate control- dresses of inflight memory instructions are resolved in flow speculation is tantamount to performance. Specu- the Execute stage. To tackle this limitation, memory lative memory access reordering is also key to efficiency, dependency prediction (MDP) was proposed [22]. Ide- but may lead to memory order violations hence frequent ally, the role of MDP is to determine, for a given load pipeline squashes. Therefore, accurate memory data de- if an older speculative store will write to the loaded ad- pendency specualation is also a necessity. dress so the load can be marked dependent on the store and wait for it to execute. One of the first documented Control-flow Speculation. hardware implementations of MDP can be found in the The instruction fetch front-end of the processor re- Alpha 21264 [23]. It categorized loads as either \can trieves blocks of contiguous instructions. The correct issue as soon as register dependencies are satisfied" or address of the subsequent instruction block is computed 1 \must wait for all older stores to compute their address". the BTB for indirect jumps. The omnipredictor pro- More refined schemes were proposed by Chrysos and vides close to state-of-the-art indirect target and mem- Emer [24] and Subramaniam and Loh [25]. However, all ory dependency prediction through stealing entries in these contributions introduce dedicated hardware struc- the conditional branch predictor table and the BTB, tures to perform speculation. while only marginally decreasing the branch prediction accuracy. That is, the ommipredictor allows to signif- Overhead of Speculation. icantly reduce the overall hardware real-estate associ- In summary, modern microarchitectures feature sev- ated with speculative execution, removing the stand- eral predictors to speculate on different aspects of the alone memory dependency predictor and indirect jump control- and data-flow. These predictors are paramount predictor without provisioning additional entries in the to performance, but represent a significant amount under-utilized predictors: TAGE or the BTB. of hardware and contribute to power and energy con- sumption. In a fetch block, each instruction can be a 1.3 Paper organization conditional branch branch, a memory load or an indirect For the sake of simplicity, unless mentioned other- branch. To enable high over-all front-end bandwitdth, wise, we will consider a fixed instruction length ISA the overall potential prediction bandwidth of these pre- such as ARMv8. dictors is largely over-dimensioned on many superscalar The remainder of this paper is organized as follows. processor designs. For instance, on the Compaq EV8 Section 2 briefly presents state-of-the-art in branch di- [16] or on the IBM Power 8 [36], the conditional branch rection, branch target and memory dependency predic- predictor generates one conditional branch prediction tion. and one possible branch target per fetched instruction Section 3 presents the structure of the instruction (16 per cycle on EV8, 8 per cycle on Power 8). fetch front-end on wide-issue superscalar processors im- plementing a fixed instruction length ISA on proces- 1.2 Unified Speculation in a High Performance sors such as Compaq EV8 and IBM Power 8; it also Microprocessor points out the prediction bandwidth waste associated with such a structure. Section 4 presents the principle of Implementing a conditional or/and MDP predictor TAGE-based memory dependency prediction. We also able to generate a prediction per fetched instruction for describe how conditional branch prediction and mem- a block of 8 contiguous instructions might be considered ory dependency prediction can be packed in the same as waste of silicon and/or energy since such a block hardware predictor. Section 5 further shows that TAGE of 8 instructions rarely features more than 3 branches associated with the block-BTB of a superscalar proces- (including at most one taken) and more than 3 memory sor can also be used to predict indirect targets. Section loads or stores. 6 presents experimental results. Section 7 discusses the In this paper, we show that the conditional branch omnipredictor approach in the context of a variable in- predictor, the memory dependency predictor and the struction length ISA. Finally Section 8 provides direc- indirect branch predictor can be unified within a single tions for future research as well as concluding remarks. hardware predictor, the omnipredictor. This organi- zation saves on silicon real-estate at close to no performance overhead. 2. RELATED WORK As a first contribution, we introduce the MDP-TAGE predictor, which leverages the observation that the mul- 2.1 Conditional Branch Prediction tivalue prediction counter (in fact a 3-bit field) of the Branch direction predictor have come a long way since TAGE branch predictor entry [19] can be interpreted as Smith's study on branch prediction strategies in which a distance to a store for a load instruction. This allows the PC-indexed bimodal table was introduced [2]. Many us to blend the memory dependency predictor into the schemes were proposed over the last three decades, in- conditional branch predictor. cluding 2-level [5, 6], gshare and gselect [10], YAGS [15], Second we show that the TAGE predictor storage PPM [13] and GEHL [14]. The current state-of-the-art structure can also be used to predict indirect branches. is embodied by neural based predictors [7, 8, 18] as well The 3-bit field in the TAGE entry can be used as a as the TAGE predictor [16] and its derived predictors pointer or an indirect pointer to targets stored in the [9, 19].