Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures
Total Page:16
File Type:pdf, Size:1020Kb
Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures Andreas Abel and Jan Reineke Saarland University Saarland Informatics Campus Saarbrücken, Germany abel, [email protected] ABSTRACT been shown to be relatively low; Chen et al. [12] found that Tools to predict the throughput of basic blocks on a spe- the average error of previous tools compared to measurements cific microarchitecture are useful to optimize software per- is between 9% and 36%. formance and to build optimizing compilers. In recent work, We have identified two main reasons that contribute to several such tools have been proposed. However, the accuracy these discrepancies between measurements and predictions: of their predictions has been shown to be relatively low. (1) the microarchitectural models of previous tools are not In this paper, we identify the most important factors for detailed enough; (2) evaluations were partly based on un- these inaccuracies. To a significant degree these inaccura- suitable benchmarks, and biased and inaccurate throughput cies are due to elements and parameters of the pipelines of measurements. In this paper, we address both of these issues. recent CPUs that are not taken into account by previous tools. In the first part, we develop a pipeline model that is signif- A primary reason for this is that the necessary details are icantly more detailed than those of previous tools. It is appli- often undocumented. In this paper, we build more precise cable to all Intel Core microarchitectures released in the last models of relevant components by reverse engineering us- decade; the differences between these microarchitectures can ing microbenchmarks. Based on these models, we develop be captured by a small number of parameters. A challenge in a simulator for predicting the throughput of basic blocks. building such a model is that many of the relevant properties In addition to predicting the throughput, our simulator also are undocumented. We have developed microbenchmark- provides insights into how the code is executed. based techniques to reverse-engineer these properties. Our tool supports all Intel Core microarchitecture gen- Based on this model, we then implement a simulator that erations released in the last decade. We evaluate it on an can compute throughput predictions for basic blocks. In ad- improved version of the BHive benchmark suite. On many dition, it also provides further insights into how basic blocks recent microarchitectures, its predictions are more accurate are executed, such as the port usage for each instruction. than the predictions of state-of-the-art tools by more than an To compare basic block throughput predictors, Chen et order of magnitude. al. [12] proposed the BHive benchmark suite, along with a tool for measuring throughputs. We discovered that this 1. INTRODUCTION benchmark suite contains a significant number of benchmarks with ambiguous throughputs that can give an unfair advantage Techniques to predict the performance of software are, for to some of the tools. We therefore discard the corresponding example, required to build compilers, to optimize code for benchmarks. Moreover, we discovered that the measurement specific platforms, and to analyze the worst-case execution tool often provides inaccurate measurements, and it is based time of software in safety-critical systems. Such techniques on a different definition of throughput than most previous are also useful in the area of security, where performance throughput predictors. We significantly improve the accuracy differences may be used for side-channel attacks. of the measurement tool, and we extend it to support both arXiv:2107.14210v1 [cs.PF] 29 Jul 2021 A specific problem in this area that has received a lot of throughput definitions. attention in recent work is to predict the performance of basic Using this improved benchmark suite and measurement blocks, under the assumption that all memory accesses lead to methodology, we compare our throughput predictor with a cache hits. A popular performance predictor from this class number of previous tools on nine different Intel Core microar- is Intel’s IACA [1]. It is, for example, used by the Kerncraft chitectures. In most cases, the average error of our tool is tool [2] to construct Roofline [3] and ECM [4] models. It around or below 1%, which is significantly lower than the has also been used by Bhattacharyya et al. [5] to build a error of previous tools. speculative execution attack. To summarize, the main contributions of our paper are: However, IACA is closed source, does not support recent • A parametric pipeline model that is applicable to all microarchitectures, and has been discontinued by Intel in Intel Core microarchitectures released in the last ten 2019. A number of alternatives were proposed in recent work, years (Section2). such as OSACA [6,7], llvm-mca [8], CQA [9], Ithemal [10], • A throughput predictor (Section3) that is more accurate and DiffTune [11]. than previous work by more than an order of magnitude The accuracy of the throughput predictions of previous on several recent microarchitectures (Section5). tools compared to measurements on the actual hardware has 1 mainly responsible for detecting where each instruction be- Instruction Cache gins. This is not completely straightforward, as an instruction Predecoder can be between 1 and 15 bytes long, and detecting its length can require inspecting several bytes of the instruction. Instruction Queue (IQ) The predecoded instructions are inserted into the instruc- Front End tion queue (IQ). The length of this queue is a parameter of our DSB Decoder MS model; it is between 20 and 25 on recent microarchitectures. Instruction Decode Queue (IDQ) According to our experiments, the predecoder can prede- code at most five instructions per cycle on all of the microar- Renamer / Allocator chitectures we consider. If there are, e.g., six instructions in a 16-byte block, in the first cycle, five instructions would retire Reorder Buffer be predecoded, and in the next cycle, only one instruction would be predecoded. Several sources claim that the prede- Scheduler Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 coding limit is six instructions per cycle [15, 16, 17]. This µop µop µop µop µop µop does not agree with our experiments. The source for the claims in [15, 16, 17] might be a section in Intel’s optimiza- tion manual [18] that mentions a limit of six instructions per Store Data Load, AGU Load, AGU cycle; however, this section is in a part of the manual that de- Execution Engine (Back End) ALU, JMP, . ALU, V-MUL, . ALU, V-ADD, . scribes predecoding on old Intel Core 2 CPUs (which did not have a µop cache, and thus decoding was a more significant L1 Data Cache bottleneck). The manual does not describe the predecoding limits for more recent microarchitectures. Memory L2 Cache A special case are instructions with a so called length- changing prefix (LCP). For such instructions, the predecoder Figure 1: Pipeline of Intel Core CPUs has to use a slower length-decoding algorithm. This results in a penalty of three cycles for each such instruction. • An improved set of benchmarks for validating through- For building an accurate simulator, it is important to know put predictors, and a more accurate measurement method- how instructions that cross a 16-byte boundary are prede- ology (Section4). coded; however, this is undocumented. Our experiments show that instructions are predecoded with the 16-byte block 2. PIPELINE MODEL in which they end; they count toward the limit of five instruc- tions in the corresponding cycle. We found that there is a Figure1 shows the general structure of the pipelines of one cycle penalty if five instructions were predecoded in the recent Intel Core CPUs. At such a high level, all these CPUs current cycle, the next instruction crosses a 16-byte boundary, are very similar. For developing an accurate performance but the primary opcode is contained in the current 16-byte predictor, however, a more detailed model is necessary. Un- block; if there are only prefixes or escape opcodes of the next fortunately, the official documentation only provides limited instruction in the current block, there is no penalty. information on these details. Unofficial sources are not al- Besides the difference in the length of the IQ, the prede- ways correct, and also do not provide all of the necessary coders appear to be identical on all CPUs that we consider. information at the required level of detail. In this section, we describe the pipeline model that we have developed for accurate throughput prediction. Thus, the 2.2 Decoding focus of the presentation is on those aspects that we have The decoding unit fetches up to four instructions from found to be relevant for predicting the performance of basic the IQ. It decodes the instructions into a sequence of µops blocks. We describe the general structure and fill in the and sends them to the Instruction Decode Queue (IDQ, some- missing details. We obtained information on undocumented times also called µop queue). The IDQ has between 28 and details through reverse engineering via microbenchmarks 70 entries on the microarchitectures we consider. This is a using hardware performance counters. For evaluating these parameter of our model. microbenchmarks, we use nanoBench [13] and an extension The decoding unit consists of one complex decoder and to nanoBench that provides cycle-by-cycle performance data, three simple decoders. The simple decoders can only decode similar to Brandon Falk’s “Sushi Roll” approach [14]. instructions with one µop. The complex decoder always de- The model presented in this section is significantly more codes the first of the fetched instructions, and it can generate detailed than any previous models described in the literature. up to four µops. Instructions with more than four µops are It is applicable to all Intel Core microarchitectures released in (at least partially) handled by the Microcode Sequencer (see the last ten years; maybe surprisingly, the differences between Section 2.5).