From micro-OPs to abstract resources: constructing a simpler CPU performance model through microbenchmarking
Nicolas Deruminy Christophe Guillon Fabian Gruber [email protected] Théophile Bastian STMicroelectronics Guillaume Iooss Grenoble, France [email protected] [email protected] [email protected] [email protected] Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG Grenoble, France Louis-Noel Pouchet Fabrice Rastello [email protected] [email protected] Colorado State University Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG Fort Collins, USA Grenoble, France Abstract 1 Introduction In a super-scalar architecture, the scheduler dynamically Performance modeling is a critical component for program assigns micro-operations (휇OPs) to execution ports. The port optimizations, assisting compilers as well as developers in mapping of an architecture describes how an instruction predicting the performance of code variations ahead of time. decomposes into 휇OPs and lists for each 휇OP the set of ports Performance models can be obtained through different ap- it can be mapped to. It is used by compilers and performance proaches that span from precise and complex simulation debugging tools to characterize the performance throughput of a hardware description [22, 23, 37] to application level of a sequence of instructions repeatedly executed as the core analytical formulations [16, 36]. A widely used approach for component of a loop. modeling the CPU of modern pipelined, super-scalar, out-of- This paper introduces a dual equivalent representation: order processors consists in decoupling the different sources The resource mapping of an architecture is an abstract model of bottlenecks, such as the latency related ones (critical path, where, to be executed, an instruction must use a set of ab- dependencies), the memory-related ones (cache behavior, stract resources, themselves representing combinations of bandwidth, prefetch, etc.), or the port throughput related execution ports. For a given architecture, finding a port ones (instruction execution units). Decoupled models allow mapping is an important but difficult problem. Building to pinpoint the source of a performance bottleneck which is a resource mapping is a more tractable problem and pro- critical both for compiler optimization [20, 28], kernel hand- vides a simpler and equivalent model. This paper describes tuning [14, 38], and performance debugging [5, 17, 21, 24, arXiv:2012.11473v3 [cs.AR] 16 Sep 2021 PALMED, a tool that automatically builds a resource map- 30, 33]. Cycle-approximate simulators such as ZSim [32] or ping for pipelined, super-scalar, out-of-order CPU archi- MCsimA+ [6] can also take advantage of such an instruc- tectures. PALMED does not require hardware performance tion characterization. This paper focuses on modeling the counters, and relies solely on runtime measurements. port throughput, that is, estimating the performance of a We evaluate the pertinence of our dual representation for dependency-free loop where all memory accesses are L1-hits. throughput modeling by extracting a representative set of Such modeling is usually based on the so-called port map- basic-blocks from the compiled binaries of the SPEC CPU ping of a CPU, that is the list of execution ports each instruc- 2017 benchmarks [7]. We compared the throughput predicted tion can be mapped to. This motivated several projects to by existing machine models to that produced by PALMED, extract information from available documentation [8, 21]. and found comparable accuracy to state-of-the art tools, But the documentation on commercial CPUs, when avail- achieving sub-10 % mean square error rate on this work- able, is often vague or outright lacking information. Intel’s load on Intel’s Skylake microarchitecture. processor manual [10], for example, does not describe all the instructions implemented by Intel cores, and for those covered, it does not even provide the decomposition of in- 2 Motivation and Overview 휇 dividual instructions into micro operations ( OPs), nor the 2.1 Background execution ports that these 휇OPs can use. In this work, we consider a CPU as a processing device Another line of work that allows for a more exhaustive mainly described by the so-called “port model”. Here, in- and precise instruction characterization is based on micro- structions are first fetched from memory, then decomposed benchmarks, such as those developed to characterize the into one or more micro-operations, also called 휇OPs. The CPU memory hierarchy [9]. While characterizing the latency of schedules these 휇OPs on a free compatible execution port, instructions is quite easy [13, 15, 18], throughput is more before the final retirement stage. Even though some instruc- challenging. Indeed, on super-scalar processors, the through- tions such as add %rax, %rax translate into only a single put of a combination of instructions cannot be simply derived 휇OP, the x86 instruction set also contains more complex in- from the throughput of the individual instructions. This is structions that translate into multiple 휇OPs. For example, because instructions compete for CPU resources, such as the wbinvd (Write Back and Invalidate Cache) instruction functional units, or execution ports, which can prevent them produces as many 휇OPs as needed to flush every line of the from executing in parallel. It is thus necessary to not only cache, leading to thousands of 휇OPs [4]. characterize the throughput of each individual instruction, Execution ports are controllers routing 휇OPs to execution but also to come up with a description of the available re- units with one or more different functional capabilities: for sources and the way they are shared. example, on the Intel Skylake architecture, only the port 4 In this work, we present a fully automated, architecture- may store data; and the store address must have previously agnostic approach, fully implemented in PALMED, to auto- been computed by an Address Generation Unit, available on matically build a mapping between instructions (micro-ops) ports 2, 3 and 7. and execution ports. It automatically builds a static perfor- The latency of an instruction is the number of clock cy- mance model of the throughput of sets of instructions to be cles elapsed between two dependent computations. The la- executed on a particular processor. While prior techniques tency of an instruction 퐼 can be experimentally measured by targeting this problem have been presented, e.g. [4, 29], we creating a micro-benchmark that executes a long chain of make several key contributions in PALMED to improve au- instances of 퐼, each depending on the previous one. tomation, coverage, scalability, accuracy and practicality: The throughput of an instruction is the maximum num- • We introduce a dual equivalent representation of the port ber of instances of that instruction that can be executed mapping problem, into a conjunctive abstract resource in parallel in one cycle. On every recent x86 architecture, mapping problem, facilitating the creation of specific micro- all units but the divider are fully pipelined, meaning that benchmarks to saturate resources. they can reach a maximum throughput of one 휇OP per cy- • We present several new algorithms: to automatically gen- cle – even if their latency it greater than one cycle. For an erate versatile sets of saturating micro-benchmarks, for instruction 퐼, the throughput of 퐼 can be experimentally mea- any instruction and resource; to build efficient Linear Pro- sured by creating a micro-benchmark that executes many gramming optimization problems exploiting these micro- non-dependent instances of 퐼: The combined throughput of benchmark measurements; and to compute a complete a multiset1 of instructions can be defined similarly. For ex- resource mapping for all instructions. ample, the throughput of {ADDSS2, BSR}, i.e. two instances of • We present a complete, automated implementation in ADDSS and one instance of BSR, is equal to the number of in- PALMED, which we evaluate against numerous other ap- structions executed per cycle (IPC) by the micro-benchmark: proaches including IACA [17], LLVM-mca [33], PMEvo repeat: [29] and UOPS.info [4]. ADDSS %xmm1 %xmm1; ADDSS %xmm2 %xmm2; BSR %rax %rax; This paper has the following structure. Sec. 2 discusses ADDSS %xmm3 %xmm3; ADDSS %xmm4 %xmm4; BSR %rbx %rbx; ADDSS %xmm5 %xmm5; ADDSS %xmm6 %xmm6; BSR %rcx %rcx; the state-of-practice and presents our novel approach to au- ... tomatically generate a valid port mapping. Sec. 3 presents A resource-mapping describes the resources used by each formal definitions and the equivalence between our model instruction in a way that can be used to derive the throughput and the three-level mapping currently in use. Sec. 4 presents for any multiset of instructions, without having to execute our architecture-agnostic approach to deduce the abstract the corresponding micro-benchmark. Such information is mapping without the use of any performance counters be- crucial for manual assembly optimization to pinpoint the sides elapsed CPU cycles. Sec 5 extensively evaluates our precise cause of slowdowns in highly optimized codes, and approach against related work on two off-the-shelf CPUs. measure the relative usage of the peak performance of the Related work is further discussed in Sec. 6 before concluding. machine.
1A multiset is a set that can contain multiple instances of an element. As with normal sets, the order of elements is not relevant 2 In this work, we target the automatic construction of a by compilers [25, 29]. Another natural strategy followed by resource mapping for a given CPU on which we can accu- Ithemal [25] is to build micro-benchmarks from the “most rately measure elapsed cycles for a code fragment. Note that executed” basic-blocks of some representative benchmarks. PALMED only uses benchmarks that have no dependencies, A third strategy, used by PMEvo [29], is to have kernels that that is, where all instructions can execute in parallel. Conse- contain repetitions of two different instructions. quently the order of instructions in the benchmark does not Our solution is constructive and follows several succes- matter2. sive steps that allow building a non-combinatorial number of micro-benchmarks that stresses the usage of each individ- 2.2 Constructing a resource mapping ual resource, thus characterizing the resource usage of all To characterize the throughput of each individual instruction, instructions. a description of the available resources and the way they The second main challenge addressed by PMEvo is to build are shared is needed. The most natural way to express this an interpretable model, that is, a resource-mapping that can sharing is through a port mapping, a tripartite graph that be used by a compiler or a performance debugging tool, in- describes how instructions decompose to 휇OPs and assigns stead of a black-box only able to predict the throughput of a 휇OPs to execution ports (see Fig. 1a). The goal of existing microkernel. One issue with the standard port-mapping ap- work has been to reverse engineer such a port mapping for proach, as used in [4, 21, 33], is that computing the through- different CPU architectures. put of a set of instructions requires the resolution of a flow The first level of this mapping, from instructions to 휇OPs, problem; that is, given a set of micro-benchmarks, finding is conjunctive, i.e., a given instruction decomposes into one a mapping of 휇OPs to ports that best expresses the corre- or more of each of the 휇OPs it maps to. The second level of sponding observed performances requires solving a multi- this mapping, on the other hand, is disjunctive, i.e. a 휇OP can resolution linear optimization problem. This linear problem choose to execute on any one of the ports it maps to. Even also does not scale to larger sets of benchmarks, even when with hardware counters that provide the number of 휇OPs restricting the micro-benchmarks to only contain up to two executed per cycle and the usage of each individual port, different instructions. PMEvo addressed this issue by using creating such a mapping is quite challenging and requires a a evolutionary algorithm that approximates the result. lot of manual effort with ad hoc solutions to handle all the cases specific to each architecture [4, 13, 15, 30]. Table 1. Summary of key features of PALMED vs. related Such approaches, while powerful and allowing a semi- work automatic characterization of basic-block throughput, suffer no HW no manual interpretable general from several limitations. First, they assume that the architec- counters expertise ture provides the required hardware counters. Second, they ✗✗✓✓ only allow modeling the throughput bottlenecks associated llvm-mca [33] Ithemal [25] ✓✗✗✗ with port usage, and neglect other resources, such as the IACA [17] N/A ✗✓✓ front-end or reorder buffer. In other words, it provides a uop.info [4] ✗✗✓✓ performance model of an ideal architecture that does not PMEvo [29] ✓✓✓✗ necessarily fully match reality. PALMED ✓✓✓✓ To overcome these limitations, we restrict ourselves to only using cycle measurements when building our perfor- 2.3 Resource mapping: dual representation mance model. Not relying on specialized hardware perfor- mance counters may complicate the initial model construc- Our approach is based on a crucial observation: a dual rep- tion, but in exchange our approach is able to model resources resentation exists for which computing the throughput is not covered by hardware counters with relative ease. This not a linear problem, but a simple formula instead. While also paves the way to significantly ease the development of it takes several hours to solve the original disjunctive-port- modeling techniques for new CPU architectures. One of the mapping formulation, only a few minutes suffice for the main challenges is to generate a set of micro-benchmarks corresponding conjunctive-resource-mapping formulation. that allows the detection of all the possible resource shar- For the sake of illustration, we consider the Skylake in- ing. Unfortunately, to be exhaustive, and in the absence of structions restricted to those that only use ports 0, 1, or 6 푝 푝 푝 structural properties, this set is combinatorial: all possible (denoted as 0, 1, and 6). Fig. 1a shows the port mapping mixes of instructions need to be evaluated. A simple way to for six such instructions. In this example: the 휇OP of BSR has 푝 reduce the set of micro-benchmarks required is to reduce a single port 1 on which it can be issued; as for instruction 휇 푝 푝 the set of modeled instructions to those that are emitted ADDSS, its OP can be issued on either 0 or 1. Hence, BSR has a throughput of one, that is, only one instruction can be 2We assume, like all related work we are aware of, that the CPU scheduler issued per cycle, whereas ADDSS has a throughput of two: two is able to optimally schedule these simple kernels. different instances of ADDSS may be executed in parallel by 푝0 3 DIVPS VCVTT ADDSS BSR JNLE JMP DIVPS VCVTT ADDSS BSR JNLE JMP ADDSS BSR 1/3 1/2 1/3 1 2 1/2 2 p0 p1 p6 r0 r01 r1 r016 r06 r6 r01 r016 r1 1 1 1 1 2 1 3 2 1 (a) Port mapping (disjunctive form) and maxi- (b) Abstract resource mapping (conjunctive form) (c) Normalized conjunctive form mum throughput of each port. and maximum throughput of each resource. for ADDSS and BSR. Figure 1. Mappings computed for a few SKL-SP instructions.
푝0 푝1 previous works. Instead of representing the execution flow 푝 푝 ADDSS BSR 0 1 as the traditional three-level “instructions decomposed as ADDSS BSR ADDSS BSR micro-operations (micro-ops) executed by ports” model, we ADDSS ADDSS ∅ BSR opt for a direct “instructions use abstract resources” model. Whereas an instruction is transformed into several micro-ops (a) ADDSS2 BSR (b) ADDSS BSR2 which in turn may be executed by different compute units; Figure 2. Disjunctive port assignment examples our bipartite model strictly uses every resource mapped to the instructions. In other words, the or in the mapping graph are replaced with and, which greatly simplifies throughput and 푝 . The throughput of the multiset 퐾 = {ADDSS2, BSR}, 1 estimation. This representation may also represent other bot- more compactly denoted by ADDSS2BSR, is therefore deter- tlenecks such as the instruction decoder or the reorder buffer mined by the combined throughput of resources 푝 and 푝 . 0 1 as other abstract resources. Note that this corresponds to the Indeed, in a steady state mode, the execution can saturate user view, where the micro-ops and their execution paths both resources by repeating the pattern represented in Fig 2a. are kept hidden inside the processor. An important contribu- In this case, there clearly does not exist any better scheduling, tion of this paper is to provide a constructive algorithm that and the corresponding execution time for 퐾 is 3 cycles for provides a non-combinatorial set of representative micro- every 6 instructions, that is, an Instruction Per Cycle (IPC) benchmarks that can be used to characterize all instructions of 2. Now, if we consider the set ADDSS BSR2, its through- of the architecture. put is limited by 푝1. Indeed, the optimal schedule in that case would repeat the pattern represented in Fig 2b, which requires 2 cycles for 3 instructions, that is, an IPC of 1.5. 2.4 PALMED: flow of work More generally, the maximum throughput of a multiset on Fig. 3 overviews the major steps of PALMED, which are a tripartite port-mapping can be solved by expressing the extensively described in Sec. 4. Our algorithm follows an minimal scheduling problem as a flow problem. approach similar to the one developed by uops.info: its prin- The dual representation, advocated in this paper, corre- ciple is to first find a set of basic instructions producing only sponds to a conjunctive bipartite resource mapping as illus- one 휇OP and bound to one port. trated in Fig. 1b. In this mapping, an instruction such as This first step can be done on Intel CPUs by measuring ADDSS which uses one out of two possible ports 푝0 and 푝1 the 휇OP per cycle on each port for each instruction through will only use the abstract resource 푟01 representing the com- performance counters. Those basic instructions are then used bined load on both ports, and will use neither 푟0 nor 푟1. In to characterize the port mapping of any general instruction this model, the maximum throughput of 푟01 is the sum of the by artificially saturating one-by-one each individual port throughput of 푝0 and 푝1, that is, 2 uses per cycle. Instructions and measuring the effect on the usage of the other ports. The that may only be computed on 푝0 will then use 푟0 and 푟01, challenge addressed by PALMED is to find a mapping, even along with all other resources combining the use of 푝0 with for architectures that do not have such hardware counters. other ports such as 푟06 and 푟016. Followingly, the average This translates in two major hardships: firstly, in our case, execution time of a microkernel is computed as the maxi- there is no predefined resources; secondly, there even isno mum load over all abstract resources, that is, their number of simple technique to find the number of 휇OPs an instruction uses divided by their throughput (see Sec. 3). One can prove decomposes into. As illustrated by Fig. 3 the algorithm of (see [1]) the strict equivalence between the two represen- PALMED is composed of three steps: 1. Find basic instruc- tations without the need for any combinatorial explosion in tions; 2. Characterize a set of abstract resources (expressed the number of combined resources (e.g. 8 for Nehalem to 14 as a core mapping) and an associated set of saturating micro- for Skylake-X): indeed, just as 푟16 in Fig. 1b is subsumed by kernels (a single instruction might not be enough to saturate resources 푟1 and 푟6, many combined resources are redundant a resource); 3. Compute the resource usage of each other and not needed. instruction with respect to the core mapping. A key contribution of this paper is to provide a less intri- There are 754 x86 Intel Skylake-compatible instructions cate two-level view, that can be constructed quicker than that use only the ports 푝0, 푝1, or 푝6. Quadratic benchmarking 4 휎 휎 퐾 퐼 퐾,퐼1 퐼 퐾,퐼2 퐼휎퐾,퐼푚 – that is, measuring the execution time of one benchmark Let = 1 2 ··· 푚 be a microkernel. In a steady per pair of instruction, leading to a quadratic number of state execution of 퐾, for each loop iteration, instruction 푖 must measures (567762) – allows us to regroup those who have use a resource 푟 for (휎퐾,푖 휌푖,푟 ) cycles. The number of cycles the same behavior together, leading to only 9 classes of in- required to execute one loop iteration is: structions. For each class, a single instruction is used as a ! representative. Among those instructions, two heuristics ∑︁ 푡 (퐾) = max 휎퐾,푖 · 휌푖,푟 푟 ∈R (described in sec 4.1) select the set of basic instructions, out- 푖 ∈퐾 putting DIVPS, BSR, JMP, JNLE, and ADDSS. Fig. 1b shows the output of the Core Mapping stage in One should observe that Def. 3.2 defines formally a nor- Fig. 3, in bold. In practice, abstract resources are internally malized version where throughputs of abstract resources are set to 1. For the sake of clarity, the example in Sec. 2 named 푅0,..., 푅5. For convenience we renamed them to the hardware execution ports they correspond to: for example, was considering non-normalized throughputs, that is, dif- ferent than 1. Going from non-normalized (as in Fig. 1b) the abstract resource 푟01 corresponds to the combined use to normalized form (as in Fig. 1c) simply relies in dividing of port 푝0 and 푝1 for an optimal schedule. The Core mapping also computes a set of saturating micro- the incoming edges of a resource by the resource’s through- benchmarks that individually saturate each of the individual put before setting its throughput to 1. For example, in the VCVTT 푟 abstract resource. Here, each basic instruction will constitute non-normalized form uses 2 times 01, which has a throughput of 2, leading to a normalized 휌 of 1. Sim- by itself a saturating micro-benchmark: DIVPS will saturate VCVTT,푟01 ilarly, 휌ADDSS,푟 = 1/3. 푟0, BSR will saturate 푟1, JMP will saturate 푟6, ADDSS will satu- 016 푟 푟 rate 01, and JNLE will saturate 06. Note that this is not the Definition 3.3 (Throughput). The throughput 퐾 of a micro- case in general: we possibly need to combine several basic kernel 퐾 is its instruction per cycle rate (IPC), defined as: instructions together to saturate a resource. Here, the sat- Í 푟 |퐾 | ∈ 휎퐾,푖 urating micro-benchmark for resource 016 is composed of 퐾 = = 푖 퐾 푡 퐾 Í 휎 휌 two basic instructions: ADDSS and JNLE. The last phase of our ( ) max푟 ∈R 푖 ∈퐾 퐾,푖 · 푖,푟 algorithm will, for each of the 742 remaining instructions, Example. If 퐾 = ADDSS2 BSR, as in Fig 2a, build a set of micro-benchmarks that combine the saturating 푡 퐾 휌 휌 kernels with the instruction, and compute its mapping. ( ) = max 푟 ∈{푟1,푟01,푟016 } 2 × ADDSS,푟 + BSR,푟 1 1 1 1 = max (푟 ) 2 × 0 + 1, (푟 ) 2 × + , (푟 ) 2 × + 3 The bipartite resource mapping 1 01 2 2 016 3 3 This section presents the theoretical foundation of PALMED, = 1.5 that is, the equivalence between the disjunctive 휇OP map- ping graph formerly used and a dual conjunctive formula- 퐾 = (2 + 1)/1.5 = 2 tion. On 퐾 ′ = ADDSS BSR2, as in Fig 2b, the same computation ′ ′ Definition 3.1 (Microkernel). A microkernel 퐾 is an infi- gives 푡 (퐾 ) = 2, the bottleneck being 푟1; hence, 퐾 = 3/2. nite loop made up of a finite multiset of instructions, 퐾 = The mathematical definitions, the method to build acon- 퐼휎퐾,1 퐼휎퐾,2 퐼휎퐾,푚 1 2 ··· 푚 without dependencies between instructions. junctive port mapping from a disjunctive one, and the ab- The number of instructions executed during one loop iteration stract resource and the equivalence proof can be found in 퐾 Í 휎 is | | = 푖 퐾,푖 . Appendix A. In a classical disjunctive port mapping formalism, an in- struction 푖 from a microkernel 퐾 is assigned to a port (re- 4 Computing resource mapping source 푟) that is compatible. The execution time of 퐾 is de- As depicted in Fig. 3, our approach can be decomposed into termined by the resource which is used the most by its in- three different steps. Sec. 4.1 describes the selection of basic structions in a given such assignment, and depends on the instructions, a subset of instructions that map to as few re- assignment picked, as presented in Sec 2. Instead, we con- sources as possible. Sec. 4.2 describes the computation of the sider a conjunctive port mapping: core mapping for these basic instructions. The core mapping Definition 3.2 (Conjunctive port mapping). A conjunctive stays fixed for the rest of the algorithm. Along with thecore port mapping is a bipartite weighted graph (퐼, R, 퐸, 휌퐼,R) mapping, we also select a saturating micro-benchmark for where: 퐼 represents the set of instructions; R represents the set each resource, called the saturating kernel. The saturating ker- of abstract resources, that has a (normalized) throughput of nel is made up of basic instructions that do not place a heavy 1; 퐸 ⊂ 퐼 × R expresses the required use of abstract resources load on other resources. Sec. 4.3 describes how PALMED for each instruction. An instruction 푖 that uses a resource 푟 uses the saturating kernels and the core mapping to deduce, ((푖, 푟) ∈ 퐸) always uses the same proportion (number of cycles) one by one, the resources used by the remaining instructions + 휌푖,푟 ∈ Q . If 푖 does not use 푟, then 휌푖,푟 = 0. of the targeted architecture. 5 Figure 3. High- level view of the algorithms of PALMED
4.1 Basic Instructions selection 1 Function Select_basic_insts(I, 푛) The first step of our algorithm trims the instruction setto 2 I퐹 := I; extract a minimal set of instructions for which the entire // Remove low-IPC; compute eq. classes mapping will be computed. As this mapping will be reused 3 foreach 푎 ∈ I퐹 do 4 if 푎 ≤ 1 − 휖 then I := I − {푎} later, we need enough instructions to detect all resources, 퐹 퐹 ; 푎 푝 푏 푝 but the more instructions we have, the longer the resolution 5 if ∃푏 ∈ I퐹 , ∀푝 ∈ I, 푎 푝 = 푏 푝 then of the linear problem to find the core mapping will take. We 6 I퐹 := I퐹 − {푎} thus first apply two simple filters that reduce the number of // Select very basic instructions basic instructions, as depicted in the first half of Algo. 1. 7 foreach 푎 ∈ I퐹 do n 푎 푏 o 8 Dj[푎] := 푏 ∈ I퐹 , 푎 푏 = 푎 + 푏 Low-IPC. If 푎 < 1, then 푎 is ignored. Assuming every 9 let 푎