Using Value Prediction to Increase the Power of Speculative Execution Hardware

Using Value Prediction to Increase the Power of Speculative Execution Hardware Freddy Gabbay Avi Mendelson Department of Electrical Engineering National Semiconductor Israel Technion - Israel Institute of Technology, [email protected] Haifa 32000, Israel. [email protected] Abstract handled (or even eliminated in several cases) by various This paper presents an experimental and analytical hardware and software techniques ([1], [2], [3], [4], [5], study of value prediction and its impact on speculative [6], [8], [9], [10], [12], [13], [14], [21], [22], [23], [24], execution in superscalar microprocessors. Value [27], [29], [32], [33], [36], [37], [38], [39]). As opposed to prediction is a new paradigm that suggests predicting name dependencies and control dependencies, only outcome values of operations (at run-time) and using these true-data dependencies are considered to be a fundamental predicted values to trigger the execution of true-data limit on the extractable ILP, since they reflect the serial dependent operations speculatively. As a result, stalls to nature of a program by dictating in which sequence data memory locations can be reduced and the amount of should be passed between instructions. This kind of instruction-level parallelism can be extended beyond the extractable parallelism is represented by the dataflow limits of the program’s dataflow graph. This paper graph of the program ([21]). examines the characteristics of the value prediction In this paper we deal with a superscalar processor concept from two perspectives: 1. the related phenomena execution model ([9], [11], [20], [21], [34]). This machine that are reflected in the nature of computer programs, and model is divided into two major subsystems: 2. the significance of these phenomena to boosting instruction-fetch and instruction-execution, which are instruction-level parallelism of super-scalar separated by a buffer, termed the instruction window. The microprocessors that support speculative execution. In instruction fetch subsystem acts as a producer of order to better understand these characteristics, our work instructions. It fetches multiple instructions from a combines both analytical and experimental studies. sequential instruction stream, decodes them simultaneously and places them in the program appearance order in the 1 . Introduction instruction window. The instruction execution subsystem The growing density of gates on a silicon die allows acts as a consumer, since it attempts to execute instructions modern microprocessors to employ increasing number of placed in the instruction window. If this subsystem execution units. Current microprocessor architectures executed instructions according to their appearance order assume sequential programs as an input and a parallel in the program (in-order execution), it would have to stall execution model. Thus, the hardware is expected to extract each time an instruction proves unready to be executed. In the parallelism at run time out of the instruction stream order to overcome this limitation and better utilize the without violating the sequential correctness of the machine’s available resources, most existing modern execution. The efficiency of such architectures is highly superscalar architectures are capable of searching and dependent on both the hardware mechanisms and the sending the ready instructions for execution beyond the application characteristic; i.e., the instruction-level stalling instruction (out-of-order execution). Since the parallelism (ILP) the program exhibits. original instruction sequence is not preserved, superscalar Instructions within the sequential program cannot processors employ a special mechanism, termed the always be ready for parallel execution due to several reorder buffer ([21]). The reorder buffer forces the constraints. These are traditionally classified as: true-data completion of the execution of instructions to become dependencies, name dependencies (false dependencies) visible (retire) in-order. Therefore, although instructions and control dependencies ([21], [35]). Neither control may complete their execution, they cannot completely dependencies nor name dependencies are considered an retire since they are forced to wait in the reorder buffer upper bound on the extractable ILP since they can be until all previous instructions complete correctly. This capability is essential in order to keep the correct order of value locality and they suggested exploiting this property exceptions and interrupts and to deal with control in order to reduce memory latency and increase memory dependencies as well. In order to tolerate the effect of bandwidth. They proposed a special mechanism, termed control dependencies, many superscalar processors also “Load Value Prediction Unit” (LVP), which attempts to support speculative execution. This technique involves the predict the values that were about to be loaded from introduction of a branch prediction mechanism ([33], [37], memory. The LVP was suggested for current processors [38], [39]) and a means for allowing the processor to models (PowerPC 620 and ALPHA AXP 21164) where its continue executing control dependent instructions before relative performance gain was also examined. In their resolving the branch outcome. Execution of such control further work ([26]), they extended the notion of value dependent instructions is termed “speculative execution”. locality and showed that this property may appear not only In order to maintain the correctness of the execution, a in load instructions but also in other types of instructions speculatively-executed instruction can only retire if the such as arithmetic instructions. As a result of this prediction it relies upon was proven to be correct; observation, they suggested value prediction also in order otherwise, it is discarded. to collapse true-data dependencies. In this paper we study the concept of value prediction Gabbay et al. ([16], [17]) have simultaneously and introduced in [25], [26], [16], [17] and [18]. The new independently studied the value prediction paradigm. Their approach attempts to eliminate true-data dependencies by approach was different in the sense that they focused on predicting at run-time the outcome values of instructions, exploring phenomena related to value prediction and their and executing the true-data dependent instructions based significance for future processor architecture. In addition on that prediction. Moreover, it has been shown that the they examined the different effects of value prediction on bound of true-data dependencies can be exceeded without an abstract machine model. This was useful since it violating sequential program correctness. This claim allowed examination of the pure potential of this overcomes two commonly recognized fundamental phenomenon independent of the limitations of individual principles: 1. the ILP of a sequential program is limited by machines. In this paper and our previous studies we its dataflow graph representation, and 2. in order to initially review substantial evidence confirming that guarantee the correctness of the program, true-data computed values during the execution of a program are dependent instructions cannot be executed in parallel. likely to be correctly predicted at run-time. In addition, we The integration of value prediction in superscalar extend the notion of value locality and show that programs processors introduces a new kind of speculative execution. may exhibit different patterns of predictable values which In this case the execution of instructions becomes can be exploited by various value predictors. Our focus on speculative when it is not assured that these instructions the statistical patterns of value prediction also provides us were fed with the correct input values. Note that since with better understanding of the possible exploitation of present superscalar processors already employ a certain value prediction and its mechanisms. form of speculative execution, from the hardware This paper extends our previous studies, broadly perspective value prediction is a feasible and natural examines the potential of these new concepts and presents extension of the current technology. Moreover, even an analytical model that predicts the expected increase in though both value prediction and branch prediction use a the ILP as a result of using value prediction. In addition, similar technology, there are fundamental differences the pioneer study presented in this paper aims at opening between the goals of these mechanisms. While branch new opportunities for future studies which would examine prediction aims at increasing the number of candidate the related microarchitectural considerations and solutions instructions for execution by executing control-dependent such as [18]. The rest of this paper is organized as follows: instructions (since the amount of available parallelism Section 2 introduces the concept of value prediction and within a basic block is relatively small [21]), value various value predictors and shows how value prediction prediction aims at allowing the processor to execute can exceed current ultimate limits. Section 3 broadly operations beyond the limit of true-data dependencies studies and analyzes the characteristics of value prediction. (given by the dataflow graph). Section 4 presents an analytical model and experimental measurements of the extractable ILP. We conclude this Related work: work in Section 5. The pioneer studies that introduced the concept of value

Load more