From micro-OPs to abstract resources: constructing a simpler CPU performance model through microbenchmarking

Nicolas Deruminy Christophe Guillon Fabian Gruber [email protected] Théophile Bastian STMicroelectronics Guillaume Iooss Grenoble, France [email protected] [email protected] [email protected] [email protected] Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG Grenoble, France Louis-Noel Pouchet Fabrice Rastello [email protected] [email protected] Colorado State University Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG Fort Collins, USA Grenoble, France Abstract 1 Introduction In a super-scalar architecture, the scheduler dynamically Performance modeling is a critical component for program assigns micro-operations (휇OPs) to ports. The port optimizations, assisting as well as developers in mapping of an architecture describes how an instruction predicting the performance of code variations ahead of time. decomposes into 휇OPs and lists for each 휇OP the set of ports Performance models can be obtained through different ap- it can be mapped to. It is used by compilers and performance proaches that span from precise and complex simulation debugging tools to characterize the performance throughput of a hardware description [22, 23, 37] to application level of a sequence of instructions repeatedly executed as the core analytical formulations [16, 36]. A widely used approach for component of a loop. modeling the CPU of modern pipelined, super-scalar, out-of- This paper introduces a dual equivalent representation: order processors consists in decoupling the different sources The resource mapping of an architecture is an abstract model of bottlenecks, such as the latency related ones (critical path, where, to be executed, an instruction must use a set of ab- dependencies), the memory-related ones (cache behavior, stract resources, themselves representing combinations of bandwidth, prefetch, etc.), or the port throughput related execution ports. For a given architecture, finding a port ones (instruction execution units). Decoupled models allow mapping is an important but difficult problem. Building to pinpoint the source of a performance bottleneck which is a resource mapping is a more tractable problem and pro- critical both for optimization [20, 28], kernel hand- vides a simpler and equivalent model. This paper describes tuning [14, 38], and performance debugging [5, 17, 21, 24, arXiv:2012.11473v3 [cs.AR] 16 Sep 2021 PALMED, a tool that automatically builds a resource map- 30, 33]. Cycle-approximate simulators such as ZSim [32] or ping for pipelined, super-scalar, out-of-order CPU archi- MCsimA+ [6] can also take advantage of such an instruc- tectures. PALMED does not require hardware performance tion characterization. This paper focuses on modeling the counters, and relies solely on runtime measurements. port throughput, that is, estimating the performance of a We evaluate the pertinence of our dual representation for dependency-free loop where all memory accesses are L1-hits. throughput modeling by extracting a representative set of Such modeling is usually based on the so-called port map- basic-blocks from the compiled binaries of the SPEC CPU ping of a CPU, that is the list of execution ports each instruc- 2017 benchmarks [7]. We compared the throughput predicted tion can be mapped to. This motivated several projects to by existing machine models to that produced by PALMED, extract information from available documentation [8, 21]. and found comparable accuracy to state-of-the art tools, But the documentation on commercial CPUs, when avail- achieving sub-10 % mean square error rate on this work- able, is often vague or outright lacking information. Intel’s load on Intel’s Skylake microarchitecture. processor manual [10], for example, does not describe all the instructions implemented by Intel cores, and for those covered, it does not even provide the decomposition of in- 2 Motivation and Overview 휇 dividual instructions into micro operations ( OPs), nor the 2.1 Background execution ports that these 휇OPs can use. In this work, we consider a CPU as a processing device Another line of work that allows for a more exhaustive mainly described by the so-called “port model”. Here, in- and precise instruction characterization is based on micro- structions are first fetched from memory, then decomposed benchmarks, such as those developed to characterize the into one or more micro-operations, also called 휇OPs. The CPU memory hierarchy [9]. While characterizing the latency of schedules these 휇OPs on a free compatible execution port, instructions is quite easy [13, 15, 18], throughput is more before the final retirement stage. Even though some instruc- challenging. Indeed, on super-scalar processors, the through- tions such as add %rax, %rax translate into only a single put of a combination of instructions cannot be simply derived 휇OP, the x86 instruction set also contains more complex in- from the throughput of the individual instructions. This is structions that translate into multiple 휇OPs. For example, because instructions compete for CPU resources, such as the wbinvd (Write Back and Invalidate Cache) instruction functional units, or execution ports, which can prevent them produces as many 휇OPs as needed to flush every line of the from executing in parallel. It is thus necessary to not only cache, leading to thousands of 휇OPs [4]. characterize the throughput of each individual instruction, Execution ports are controllers routing 휇OPs to execution but also to come up with a description of the available re- units with one or more different functional capabilities: for sources and the way they are shared. example, on the Intel Skylake architecture, only the port 4 In this work, we present a fully automated, architecture- may store data; and the store address must have previously agnostic approach, fully implemented in PALMED, to auto- been computed by an Address Generation Unit, available on matically build a mapping between instructions (micro-ops) ports 2, 3 and 7. and execution ports. It automatically builds a static perfor- The latency of an instruction is the number of clock cy- mance model of the throughput of sets of instructions to be cles elapsed between two dependent computations. The la- executed on a particular processor. While prior techniques tency of an instruction 퐼 can be experimentally measured by targeting this problem have been presented, e.g. [4, 29], we creating a micro-benchmark that executes a long chain of make several key contributions in PALMED to improve au- instances of 퐼, each depending on the previous one. tomation, coverage, scalability, accuracy and practicality: The throughput of an instruction is the maximum num- • We introduce a dual equivalent representation of the port ber of instances of that instruction that can be executed mapping problem, into a conjunctive abstract resource in parallel in one cycle. On every recent x86 architecture, mapping problem, facilitating the creation of specific micro- all units but the divider are fully pipelined, meaning that benchmarks to saturate resources. they can reach a maximum throughput of one 휇OP per cy- • We present several new algorithms: to automatically gen- cle – even if their latency it greater than one cycle. For an erate versatile sets of saturating micro-benchmarks, for instruction 퐼, the throughput of 퐼 can be experimentally mea- any instruction and resource; to build efficient Linear Pro- sured by creating a micro-benchmark that executes many gramming optimization problems exploiting these micro- non-dependent instances of 퐼: The combined throughput of benchmark measurements; and to compute a complete a multiset1 of instructions can be defined similarly. For ex- resource mapping for all instructions. ample, the throughput of {ADDSS2, BSR}, i.e. two instances of • We present a complete, automated implementation in ADDSS and one instance of BSR, is equal to the number of in- PALMED, which we evaluate against numerous other ap- structions executed per cycle (IPC) by the micro-benchmark: proaches including IACA [17], LLVM-mca [33], PMEvo repeat: [29] and UOPS.info [4]. ADDSS %xmm1 %xmm1; ADDSS %xmm2 %xmm2; BSR %rax %rax; This paper has the following structure. Sec. 2 discusses ADDSS %xmm3 %xmm3; ADDSS %xmm4 %xmm4; BSR %rbx %rbx; ADDSS %xmm5 %xmm5; ADDSS %xmm6 %xmm6; BSR %rcx %rcx; the state-of-practice and presents our novel approach to au- ... tomatically generate a valid port mapping. Sec. 3 presents A resource-mapping describes the resources used by each formal definitions and the equivalence between our model instruction in a way that can be used to derive the throughput and the three-level mapping currently in use. Sec. 4 presents for any multiset of instructions, without having to execute our architecture-agnostic approach to deduce the abstract the corresponding micro-benchmark. Such information is mapping without the use of any performance counters be- crucial for manual assembly optimization to pinpoint the sides elapsed CPU cycles. Sec 5 extensively evaluates our precise cause of slowdowns in highly optimized codes, and approach against related work on two off-the-shelf CPUs. measure the relative usage of the peak performance of the Related work is further discussed in Sec. 6 before concluding. machine.

1A multiset is a set that can contain multiple instances of an element. As with normal sets, the order of elements is not relevant 2 In this work, we target the automatic construction of a by compilers [25, 29]. Another natural strategy followed by resource mapping for a given CPU on which we can accu- Ithemal [25] is to build micro-benchmarks from the “most rately measure elapsed cycles for a code fragment. Note that executed” basic-blocks of some representative benchmarks. PALMED only uses benchmarks that have no dependencies, A third strategy, used by PMEvo [29], is to have kernels that that is, where all instructions can execute in parallel. Conse- contain repetitions of two different instructions. quently the order of instructions in the benchmark does not Our solution is constructive and follows several succes- matter2. sive steps that allow building a non-combinatorial number of micro-benchmarks that stresses the usage of each individ- 2.2 Constructing a resource mapping ual resource, thus characterizing the resource usage of all To characterize the throughput of each individual instruction, instructions. a description of the available resources and the way they The second main challenge addressed by PMEvo is to build are shared is needed. The most natural way to express this an interpretable model, that is, a resource-mapping that can sharing is through a port mapping, a tripartite graph that be used by a compiler or a performance debugging tool, in- describes how instructions decompose to 휇OPs and assigns stead of a black-box only able to predict the throughput of a 휇OPs to execution ports (see Fig. 1a). The goal of existing microkernel. One issue with the standard port-mapping ap- work has been to reverse engineer such a port mapping for proach, as used in [4, 21, 33], is that computing the through- different CPU architectures. put of a set of instructions requires the resolution of a flow The first level of this mapping, from instructions to 휇OPs, problem; that is, given a set of micro-benchmarks, finding is conjunctive, i.e., a given instruction decomposes into one a mapping of 휇OPs to ports that best expresses the corre- or more of each of the 휇OPs it maps to. The second level of sponding observed performances requires solving a multi- this mapping, on the other hand, is disjunctive, i.e. a 휇OP can resolution linear optimization problem. This linear problem choose to execute on any one of the ports it maps to. Even also does not scale to larger sets of benchmarks, even when with hardware counters that provide the number of 휇OPs restricting the micro-benchmarks to only contain up to two executed per cycle and the usage of each individual port, different instructions. PMEvo addressed this issue by using creating such a mapping is quite challenging and requires a a evolutionary algorithm that approximates the result. lot of manual effort with ad hoc solutions to handle all the cases specific to each architecture [4, 13, 15, 30]. Table 1. Summary of key features of PALMED vs. related Such approaches, while powerful and allowing a semi- work automatic characterization of basic-block throughput, suffer no HW no manual interpretable general from several limitations. First, they assume that the architec- counters expertise ture provides the required hardware counters. Second, they ✗✗✓✓ only allow modeling the throughput bottlenecks associated -mca [33] Ithemal [25] ✓✗✗✗ with port usage, and neglect other resources, such as the IACA [17] N/A ✗✓✓ front-end or reorder buffer. In other words, it provides a uop.info [4] ✗✗✓✓ performance model of an ideal architecture that does not PMEvo [29] ✓✓✓✗ necessarily fully match reality. PALMED ✓✓✓✓ To overcome these limitations, we restrict ourselves to only using cycle measurements when building our perfor- 2.3 Resource mapping: dual representation mance model. Not relying on specialized hardware perfor- mance counters may complicate the initial model construc- Our approach is based on a crucial observation: a dual rep- tion, but in exchange our approach is able to model resources resentation exists for which computing the throughput is not covered by hardware counters with relative ease. This not a linear problem, but a simple formula instead. While also paves the way to significantly ease the development of it takes several hours to solve the original disjunctive-port- modeling techniques for new CPU architectures. One of the mapping formulation, only a few minutes suffice for the main challenges is to generate a set of micro-benchmarks corresponding conjunctive-resource-mapping formulation. that allows the detection of all the possible resource shar- For the sake of illustration, we consider the Skylake in- ing. Unfortunately, to be exhaustive, and in the absence of structions restricted to those that only use ports 0, 1, or 6 푝 푝 푝 structural properties, this set is combinatorial: all possible (denoted as 0, 1, and 6). Fig. 1a shows the port mapping mixes of instructions need to be evaluated. A simple way to for six such instructions. In this example: the 휇OP of BSR has 푝 reduce the set of micro-benchmarks required is to reduce a single port 1 on which it can be issued; as for instruction 휇 푝 푝 the set of modeled instructions to those that are emitted ADDSS, its OP can be issued on either 0 or 1. Hence, BSR has a throughput of one, that is, only one instruction can be 2We assume, like all related work we are aware of, that the CPU scheduler issued per cycle, whereas ADDSS has a throughput of two: two is able to optimally schedule these simple kernels. different instances of ADDSS may be executed in parallel by 푝0 3 DIVPS VCVTT ADDSS BSR JNLE JMP DIVPS VCVTT ADDSS BSR JNLE JMP ADDSS BSR 1/3 1/2 1/3 1 2 1/2 2 p0 p1 p6 r0 r01 r1 r016 r06 r6 r01 r016 r1 1 1 1 1 2 1 3 2 1 (a) Port mapping (disjunctive form) and maxi- (b) Abstract resource mapping (conjunctive form) () Normalized conjunctive form mum throughput of each port. and maximum throughput of each resource. for ADDSS and BSR. Figure 1. Mappings computed for a few SKL-SP instructions.

푝0 푝1 previous works. Instead of representing the execution flow 푝 푝 ADDSS BSR 0 1 as the traditional three-level “instructions decomposed as ADDSS BSR ADDSS BSR micro-operations (micro-ops) executed by ports” model, we ADDSS ADDSS ∅ BSR opt for a direct “instructions use abstract resources” model. Whereas an instruction is transformed into several micro-ops (a) ADDSS2 BSR (b) ADDSS BSR2 which in turn may be executed by different compute units; Figure 2. Disjunctive port assignment examples our bipartite model strictly uses every resource mapped to the instructions. In other words, the or in the mapping graph are replaced with and, which greatly simplifies throughput and 푝 . The throughput of the multiset 퐾 = {ADDSS2, BSR}, 1 estimation. This representation may also represent other bot- more compactly denoted by ADDSS2BSR, is therefore deter- tlenecks such as the instruction decoder or the reorder buffer mined by the combined throughput of resources 푝 and 푝 . 0 1 as other abstract resources. Note that this corresponds to the Indeed, in a steady state mode, the execution can saturate user view, where the micro-ops and their execution paths both resources by repeating the pattern represented in Fig 2a. are kept hidden inside the processor. An important contribu- In this case, there clearly does not exist any better scheduling, tion of this paper is to provide a constructive algorithm that and the corresponding execution time for 퐾 is 3 cycles for provides a non-combinatorial set of representative micro- every 6 instructions, that is, an Instruction Per Cycle (IPC) benchmarks that can be used to characterize all instructions of 2. Now, if we consider the set ADDSS BSR2, its through- of the architecture. put is limited by 푝1. Indeed, the optimal schedule in that case would repeat the pattern represented in Fig 2b, which requires 2 cycles for 3 instructions, that is, an IPC of 1.5. 2.4 PALMED: flow of work More generally, the maximum throughput of a multiset on Fig. 3 overviews the major steps of PALMED, which are a tripartite port-mapping can be solved by expressing the extensively described in Sec. 4. Our algorithm follows an minimal scheduling problem as a flow problem. approach similar to the one developed by uops.info: its prin- The dual representation, advocated in this paper, corre- ciple is to first find a set of basic instructions producing only sponds to a conjunctive bipartite resource mapping as illus- one 휇OP and bound to one port. trated in Fig. 1b. In this mapping, an instruction such as This first step can be done on Intel CPUs by measuring ADDSS which uses one out of two possible ports 푝0 and 푝1 the 휇OP per cycle on each port for each instruction through will only use the abstract resource 푟01 representing the com- performance counters. Those basic instructions are then used bined load on both ports, and will use neither 푟0 nor 푟1. In to characterize the port mapping of any general instruction this model, the maximum throughput of 푟01 is the sum of the by artificially saturating one-by-one each individual port throughput of 푝0 and 푝1, that is, 2 uses per cycle. Instructions and measuring the effect on the usage of the other ports. The that may only be computed on 푝0 will then use 푟0 and 푟01, challenge addressed by PALMED is to find a mapping, even along with all other resources combining the use of 푝0 with for architectures that do not have such hardware counters. other ports such as 푟06 and 푟016. Followingly, the average This translates in two major hardships: firstly, in our case, execution time of a microkernel is computed as the maxi- there is no predefined resources; secondly, there even isno mum load over all abstract resources, that is, their number of simple technique to find the number of 휇OPs an instruction uses divided by their throughput (see Sec. 3). One can prove decomposes into. As illustrated by Fig. 3 the algorithm of (see [1]) the strict equivalence between the two represen- PALMED is composed of three steps: 1. Find basic instruc- tations without the need for any combinatorial explosion in tions; 2. Characterize a set of abstract resources (expressed the number of combined resources (e.g. 8 for Nehalem to 14 as a core mapping) and an associated set of saturating micro- for Skylake-X): indeed, just as 푟16 in Fig. 1b is subsumed by kernels (a single instruction might not be enough to saturate resources 푟1 and 푟6, many combined resources are redundant a resource); 3. Compute the resource usage of each other and not needed. instruction with respect to the core mapping. A key contribution of this paper is to provide a less intri- There are 754 x86 Intel Skylake-compatible instructions cate two-level view, that can be constructed quicker than that use only the ports 푝0, 푝1, or 푝6. Quadratic benchmarking 4 휎 휎 퐾 퐼 퐾,퐼1 퐼 퐾,퐼2 퐼휎퐾,퐼푚 – that is, measuring the execution time of one benchmark Let = 1 2 ··· 푚 be a microkernel. In a steady per pair of instruction, leading to a quadratic number of state execution of 퐾, for each loop iteration, instruction 푖 must measures (567762) – allows us to regroup those who have use a resource 푟 for (휎퐾,푖 휌푖,푟 ) cycles. The number of cycles the same behavior together, leading to only 9 classes of in- required to execute one loop iteration is: structions. For each class, a single instruction is used as a ! representative. Among those instructions, two heuristics ∑︁ 푡 (퐾) = max 휎퐾,푖 · 휌푖,푟 푟 ∈R (described in sec 4.1) select the set of basic instructions, out- 푖 ∈퐾 putting DIVPS, BSR, JMP, JNLE, and ADDSS. Fig. 1b shows the output of the Core Mapping stage in One should observe that Def. 3.2 defines formally a nor- Fig. 3, in bold. In practice, abstract resources are internally malized version where throughputs of abstract resources are set to 1. For the sake of clarity, the example in Sec. 2 named 푅0,..., 푅5. For convenience we renamed them to the hardware execution ports they correspond to: for example, was considering non-normalized throughputs, that is, dif- ferent than 1. Going from non-normalized (as in Fig. 1b) the abstract resource 푟01 corresponds to the combined use to normalized form (as in Fig. 1c) simply relies in dividing of port 푝0 and 푝1 for an optimal schedule. The Core mapping also computes a set of saturating micro- the incoming edges of a resource by the resource’s through- benchmarks that individually saturate each of the individual put before setting its throughput to 1. For example, in the VCVTT 푟 abstract resource. Here, each basic instruction will constitute non-normalized form uses 2 times 01, which has a throughput of 2, leading to a normalized 휌 of 1. Sim- by itself a saturating micro-benchmark: DIVPS will saturate VCVTT,푟01 ilarly, 휌ADDSS,푟 = 1/3. 푟0, BSR will saturate 푟1, JMP will saturate 푟6, ADDSS will satu- 016 푟 푟 rate 01, and JNLE will saturate 06. Note that this is not the Definition 3.3 (Throughput). The throughput 퐾 of a micro- case in general: we possibly need to combine several basic kernel 퐾 is its instruction per cycle rate (IPC), defined as: instructions together to saturate a resource. Here, the sat- Í 푟 |퐾 | ∈ 휎퐾,푖 urating micro-benchmark for resource 016 is composed of 퐾 = = 푖 퐾 푡 퐾 Í 휎 휌 two basic instructions: ADDSS and JNLE. The last phase of our ( ) max푟 ∈R 푖 ∈퐾 퐾,푖 · 푖,푟 algorithm will, for each of the 742 remaining instructions, Example. If 퐾 = ADDSS2 BSR, as in Fig 2a, build a set of micro-benchmarks that combine the saturating 푡 퐾 휌 휌  kernels with the instruction, and compute its mapping. ( ) = max 푟 ∈{푟1,푟01,푟016 } 2 × ADDSS,푟 + BSR,푟  1 1 1 1  = max (푟 ) 2 × 0 + 1, (푟 ) 2 × + , (푟 ) 2 × + 3 The bipartite resource mapping 1 01 2 2 016 3 3 This section presents the theoretical foundation of PALMED, = 1.5 that is, the equivalence between the disjunctive 휇OP map- ping graph formerly used and a dual conjunctive formula- 퐾 = (2 + 1)/1.5 = 2 tion. On 퐾 ′ = ADDSS BSR2, as in Fig 2b, the same computation ′ ′ Definition 3.1 (Microkernel). A microkernel 퐾 is an infi- gives 푡 (퐾 ) = 2, the bottleneck being 푟1; hence, 퐾 = 3/2. nite loop made up of a finite multiset of instructions, 퐾 = The mathematical definitions, the method to build acon- 퐼휎퐾,1 퐼휎퐾,2 퐼휎퐾,푚 1 2 ··· 푚 without dependencies between instructions. junctive port mapping from a disjunctive one, and the ab- The number of instructions executed during one loop iteration stract resource and the equivalence proof can be found in 퐾 Í 휎 is | | = 푖 퐾,푖 . Appendix A. In a classical disjunctive port mapping formalism, an in- struction 푖 from a microkernel 퐾 is assigned to a port (re- 4 Computing resource mapping source 푟) that is compatible. The execution time of 퐾 is de- As depicted in Fig. 3, our approach can be decomposed into termined by the resource which is used the most by its in- three different steps. Sec. 4.1 describes the selection of basic structions in a given such assignment, and depends on the instructions, a subset of instructions that map to as few re- assignment picked, as presented in Sec 2. Instead, we con- sources as possible. Sec. 4.2 describes the computation of the sider a conjunctive port mapping: core mapping for these basic instructions. The core mapping Definition 3.2 (Conjunctive port mapping). A conjunctive stays fixed for the rest of the algorithm. Along with thecore port mapping is a bipartite weighted graph (퐼, R, 퐸, 휌퐼,R) mapping, we also select a saturating micro-benchmark for where: 퐼 represents the set of instructions; R represents the set each resource, called the saturating kernel. The saturating ker- of abstract resources, that has a (normalized) throughput of nel is made up of basic instructions that do not place a heavy 1; 퐸 ⊂ 퐼 × R expresses the required use of abstract resources load on other resources. Sec. 4.3 describes how PALMED for each instruction. An instruction 푖 that uses a resource 푟 uses the saturating kernels and the core mapping to deduce, ((푖, 푟) ∈ 퐸) always uses the same proportion (number of cycles) one by one, the resources used by the remaining instructions + 휌푖,푟 ∈ Q . If 푖 does not use 푟, then 휌푖,푟 = 0. of the targeted architecture. 5 Figure 3. High- level view of the algorithms of PALMED

4.1 Basic Instructions selection 1 Function Select_basic_insts(I, 푛) The first step of our algorithm trims the instruction setto 2 I퐹 := I; extract a minimal set of instructions for which the entire // Remove low-IPC; compute eq. classes mapping will be computed. As this mapping will be reused 3 foreach 푎 ∈ I퐹 do 4 if 푎 ≤ 1 − 휖 then I := I − {푎} later, we need enough instructions to detect all resources, 퐹 퐹 ; 푎 푝 푏 푝 but the more instructions we have, the longer the resolution 5 if ∃푏 ∈ I퐹 , ∀푝 ∈ I, 푎 푝 = 푏 푝 then of the linear problem to find the core mapping will take. We 6 I퐹 := I퐹 − {푎} thus first apply two simple filters that reduce the number of // Select very basic instructions basic instructions, as depicted in the first half of Algo. 1. 7 foreach 푎 ∈ I퐹 do n 푎 푏 o 8 Dj[푎] := 푏 ∈ I퐹 , 푎 푏 = 푎 + 푏 Low-IPC. If 푎 < 1, then 푎 is ignored. Assuming every 9 let 푎 |Dj[푏]|) ∨ |Dj[푎]| = |Dj[푏]| ∧ 푎 > 푏 ; use one resource more than once. 11 IVB := ∅; Then, we compute, for every remaining pair of instruction 12 for 푎 ∈ I퐹 in

Polybench native IPC

(a) IPC prediction profile heatmaps – predictions closer to the red line are more accurate. Predicted IPC ratio (Y) against native IPC(X)

(b) Translation block coverage (Cov.), root-mean-square error on IPC predictions (Err.) and Kendall’s tau correlation coefficient휏 ( 퐾 ) compared to native execution PMD uops.info PMEvo IACA llvm-mca Cov. Err. 휏퐾 Cov. Err. 휏퐾 Cov. Err. 휏퐾 Cov. Err. 휏퐾 Cov. Err. 휏퐾 Unit (%) (%) (1) (%) (%) (1) (%) (%) (1) (%) (%) (1) (%) (%) (1) SPEC2017 N/A 7.8 0.90 99.9 40.3 0.71 71.3 28.1 0.47 100.0 8.7 0.80 96.8 20.1 0.73 SKL-SP Polybench N/A 24.4 0.78 100.0 68.1 0.29 66.8 46.7 0.14 100.0 15.1 0.67 99.5 15.3 0.65 SPEC2017 N/A 29.9 0.68 N/A N/A N/A 71.3 36.5 0.43 N/A N/A N/A 96.8 33.4 0.75 ZEN1 Polybench N/A 32.6 0.46 N/A N/A N/A 66.8 38.5 0.11 N/A N/A N/A 99.5 28.6 0.40

Figure 4. Accuracy of IPC predictions compared to native execution of PALMED versus uops.info, PMEvo, IACA and llvm-mca on SPEC CPU2017 and PolyBench/C 4.2

Root-Mean-Square method: varies from tool to tool: some work in degraded mode when meeting instructions they cannot handle, some will crash tv ∑︁ weight  IPC − IPC 2 on the basic block. For PMEvo, we ignored any instruction Err = 푖 푖,tool 푖,native RMS, tool Í not supported by their mapping – degrading the quality of 푖 푗 weight푗 IPC푖,native the result; hence, a plain error is a basic block in which no We also provide Kendall’s 휏 coefficient19 [ ], a measure of single instruction was supported. Although it would be fairer the rank correlation of a predictor – that is, for each pair of to other tools to measure absolute coverage – that is, the pro- basic blocks, whether a predictor predicted correctly which portion of basic blocks supported by the tool, regardless of block had the higher IPC. The coefficient varies between −1 what PALMED supports –, technical limitations prevented (full anti-correlation) and 1 (full correlation). us from doing so: running the various tools requires our The same table also provides a coverage metric, with re- back-end to generate assembly code, which can only be done spect to PALMED. This metric characterizes the proportion for the instructions it supports. of basic blocks supported by PALMED that the tool was able to process. Note that the ability to process a basic block 9 Table 2. Main features of the obtained mappings hand-written microbenchmarks. Fog also uses hardware per- Machine SKL-SP ZEN1 formance counters and hand-crafted benchmarks to reverse- engineers port mappings for Intel, AMD and VIA CPUs. Gen. microbenchmarks ∼ 1,000,000 ∼ 1,000,000 Fog’s mappings are considered by the community to be quite Resources found 17 17 accurate. For example, the machine model of the x86 back- uops’ inst. supported 3313 1104 end of the LLVM compiler framework [20] is partially based Instructions mapped 2586 2596 on them [34]. However, Fog and Granlund’s approach is te- dious and error-prone, since modern CPU instruction sets We evaluate two architectures: the SKL-SP is an Intel Xeon have thousands of different intricate instructions. Abel and Silver 4114 CPU at 2.20 GHz, using Debian, Linux kernel 4.19 Reineke [3, 4] have tackled this problem by combining an and PAPI 6.0.0.1 to collect the execution time in cycles for automatic microbenchmark generator with an algorithm for each microbenchmarks, restraining to non-AVX-512 instruc- port-mapping construction. Their technique relies on hard- 휇 tions. The ZEN is an AMD EPYC 7401P CPU at 2 GHz, setup ware counters for the number of OPs executed on each similarly. For each of these two architectures, the number of execution port, only available on recent Intel CPUs. They generated microbenchmarks, resources found and mapped recently started providing data on the newest generations instructions are gathered in Table 2. of AMD CPUs, but by lack of necessary hardware counters, only latency and throughput are published. We compare the number of instructions supported by OSACA [21] is an open source alternative to IACA offer- PALMED with the ones supported by uops.info as a base- ing a similar static throughput and latency estimator. It relies line. As uops supports only partially AMD’s architecture on automated benchmarks manually linked with publicly (providing only throughput and latencies, but no usable port available documentation to infer the port mapping and the mapping), less than half the instructions supported by our latencies of the instructions. The tool Kerncraft [16] focuses tool are present for this target. Contrarily, on SKL-SP, uops on hot loop bodies from HPC applications while also model- supports the AVX-512 extension, therefore leading to a more ing caches; its mapping comes from automated benchmarks complete set of supported instructions. PMEvo’s mapping generated through Likwid [35] and hardware counters mea- behaves poorly in terms of coverage (see Fig. 4b), failing to surements. CQA [31], a static loop analyzer integrated into support all instructions in more than 25 % of the basic blocks the MAQAO framework [12], takes a similar path while also on any benchmark and processor tested. supporting OpenMP routines. It combines dependency anal- In Fig. 4, we observe that PALMED performs significantly ysis, microbenchmarks, and a port mapping and previous better than uops.info and PMEvo on both platforms. On Sky- manual results to offer various types of optimization ad- lake, it outperforms all other tested tools in terms of Kendall’s vice to the user, such as vectorisation, or how to avoid port tau, and compares well with IACA and LLVM-MCA, archiv- saturation. Both Kerncraft and CQA use a hard-coded port ing sub-10 % mean square error rate on SPEC2017. However, mapping based on Fog’s work and official Intel and AMD those two last tools use manual expertise and are tailored for documentation. a platform, whereas our tool is fully automated and generic. Besides the classic port mappings, machine learning based On Zen1, PALMED is comparable to LLVM-MCA, but approaches have also been used, eg. in Ithemal [25], to ap- shows a greater error rate than on Intel. This is due to the proximate the throughput of basic blocks with good accu- internal organization of the Zen microarchitecture, which racy. However, the resulting model is completely opaque and uses a separated pipeline for integer/control flow and floating cannot be analyzed or used for any other purpose than the point/vector operations. As PALMED tries to minimize the prediction of basic block throughputs. For instance, Ithemal number of resources, this separation is not properly detected, does not report on the influence of each instruction, which leading to IPC predictions lower than the actual value as seen is critical for manual assembly optimization. on the heatmaps on Fig. 4a. PMEvo [29] is a tool that, like PALMED, automatically generates a set of benchmarks that it uses to build a port 6 Related work mapping. It produces a disjunctive tripartite model with in- Intel has developed a static analyzer named IACA [17] which structions, 휇OPs, and ports, which is the key different with uses its internal mapping based on proprietary information. PALMED. It does not require hardware performance counter, However, the project is closed-source and has been dep- and only relies on runtime measurements of its benchmarks. recated since April 2019. Even though some latencies are The set of benchmarks used is determined semi-randomly given directly in the documentation [10], they are known using a genetic algorithm. The benchmarks themselves are to contain errors and approximations, in addition to being simpler than those used by PALMED and contain at most two incomplete. different types of instructions. The main difference between First attempts on x86 to measure the latency and through- PMEvo and PALMED is that PMEvo uses internally a dis- put were led by Agner Fog [13] and Granlund [15] using junctive bipartite resource model, instead of the conjunctive 10 model used by PALMED. These models, while able to accu- [7] J. Bucek, K. Lange, and J. von Kistowski, “SPEC CPU2017: rately predict the execution of pipelined instructions bottle- Next-generation compute benchmark,” in Companion of the 2018 necked only on the execution ports, cannot represent other ACM/SPEC International Conference on Performance Engineering, ICPE 2018, K. Wolter, W. J. Knottenbelt, A. van Hoorn, and bottlenecks like the reorder buffer, or the non-pipelined in- M. Nambiar, Eds. ACM, April 2018, pp. 41–42. [Online]. Available: structions like division. More importantly, PMEvo’s approach https://doi.org/10.1145/3185768.3185771 is less scalable, as handling more instructions may quickly [8] C. Chatelet, C. Courbet, O. Sykora, and N. Paglieri. Google lead to an overwhelming number of microbenchmarks, while exegesis. [Online]. Available: https://llvm.org/docs/CommandGuide/ our approach is focused to generate specifically microbench- llvm-exegesis.html [9] C. L. Coleman and J. W. Davidson, “Automatic memory hierarchy marks that saturate resources. PALMED can complete the characterization,” in 2001 IEEE International Symposium on Performance full mapping, benchmarking included, in a few hours. An- Analysis of Systems and Software. ISPASS., 2001, pp. 103–110. other key to this scalability is our incremental approach to [10] I. Corporation. Intel 64 and ia-32 architectures optimization reference handle complex instructions using a linear programming manual. [Online]. Available: https://www.intel.com/content/dam/doc/ formulation to compute automatically, and optimally, the manual/64-ia-32-architectures-optimization-manual.pdf [11] ——. Intel x86 encoder decoder (intel xed). [Online]. Available: mapping. https://github.com/intelxed/xed [12] L. Djoudi, J. Noudohouenou, and W. Jalby, “The design and architecture 7 Conclusion of MAQAOAdvisor: A live tuning guide,” in Proceedings of the 15th We presented PALMED which automatically builds a re- International Conference on High Performance Computing, ser. HiPC 2008, P. Sadayappan, M. Parashar, R. Badrinath, and V. K. Prasanna, source mapping for CPU instructions. This allows to model Eds., vol. 5374. Berlin, Heidelberg: Springer-Verlag, December not only execution port usage, but also other limiting re- 2008, pp. 42–56. [Online]. Available: https://doi.org/10.1007/978-3- sources, such as the front-end or the reorder buffer. We 540-89894-8_8 presented an end-to-end approach to enable the mapping [13] A. Fog. (2020) Instruction tables: Lists of instruction latencies, of thousands of instructions in a few hours, including mi- through-puts and micro-operation breakdowns for intel, AMD and VIA CPUs. [Online]. Available: http://www.agner.org/optimize/ crobenchmarking time. Our key contributions include the instruction_tables.pdf mathematically rigorous formulation of the port mapping [14] F. Franchetti, T. M. Low, D. Popovici, R. M. Veras, D. G. problem as solving iteratively linear programs, enabling an Spampinato, J. R. Johnson, M. Püschel, J. C. Hoe, and J. M. F. incremental and scalable approach to handling thousands Moura, “SPIRAL: Extreme performance portability,” Proceedings of of instructions. We provided a method to automatically gen- the IEEE, vol. 106, no. 11, pp. 1935–1968, 2018. [Online]. Available: https://doi.org/10.1109/JPROC.2018.2873289 erate microbenchmarks saturating specific resources, alle- [15] T. Granlund. (2017) Instruction latencies and throughput for AMD and viating the need for statistical sampling. We demonstrated intel x86 processors. [Online]. Available: https://gmplib.org/~tege/x86- on one Intel and one AMD high-performance CPUs that timing.pdf PALMED generates automatically practical port mappings [16] J. Hammer, J. Eitzinger, G. Hager, and G. Wellein, “Kerncraft: A tool that compare favorably with state-of-the-art systems like for analytic performance modeling of loop kernels,” in Tools for High Performance Computing 2016, vol. abs/1702.04653. Cham: Springer IACA and uops.info that either use performance counters or International Publishing, 2017, pp. 1–22. manual expertise. [17] I. Hirsh and G. S. Intel® architecture code analyzer. [Online]. Available: https://software.intel.com/en-us/articles/intel-architecture- References code-analyzer [18] instlatx64. x86, x64 instruction latency, memory latency and cpuid [1] [Online]. Available: https://www.dropbox.com/s/zvsnj4wsx0fj775/ dumps. [Online]. Available: http://instlatx64.atw.hu/ PMD-full.pdf?dl=0 [19] M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, [2] “QEMU: the FAST! processor emulator,” https://www.qemu.org. no. 1/2, pp. 81–93, 1938. [3] A. Abel and J. Reineke, “nanoBench: A low-overhead tool for running [20] C. Lattner and V. S. Adve, “LLVM: A compilation framework for arXiv e-prints microbenchmarks on x86 systems,” , vol. abs/1911.03282, lifelong program analysis & transformation,” in 2nd IEEE/ACM 2019. [Online]. Available: http://arxiv.org/abs/1911.03282 International Symposium on Code Generation and Optimization (CGO [4] ——, “uops.info: Characterizing latency, throughput, and port usage 2004). San Jose, CA, USA: IEEE Computer Society, March 2004, pp. Proceedings of the of instructions on intel microarchitectures,” in 75–88. [Online]. Available: https://doi.org/10.1109/CGO.2004.1281665 Twenty-Fourth International Conference on Architectural Support [21] J. Laukemann, J. Hammert, J. Hofmann, G. Hager, and G. Wellein, for Programming Languages and Operating Systems, ASPLOS 2019 , “Automated instruction stream throughput prediction for intel and I. Bahar, M. Herlihy, E. Witchel, and A. R. Lebeck, Eds. New AMD microarchitectures,” in 2018 IEEE/ACM Performance Modeling, York, NY, USA: ACM, April 2019, pp. 673–686. [Online]. Available: Benchmarking and Simulation of High Performance Computer Systems https://doi.org/10.1145/3297858.3304062 (PMBS). Dallas, TX, USA: IEEE Computer Society, ACM, November [5] ——, “Accurate throughput prediction of basic blocks on recent intel 2018, pp. 121–131. microarchitectures,” 2021. [22] G. H. Loh, S. Subramaniam, and Y. Xie, “Zesto: A cycle-level [6] J. H. Ahn, S. Li, S. O, and N. P. Jouppi, “McSimA+: A manycore simulator simulator for highly detailed microarchitecture exploration,” in with application-level+ simulation and detailed microarchitecture IEEE International Symposium on Performance Analysis of Systems 2012 IEEE International Symposium on Performance modeling,” in and Software, ISPASS 2009. Boston, Massachusetts, USA: IEEE Analysis of Systems and Software . Austin, TX, USA: IEEE Computer Society, April 2009, pp. 53–64. [Online]. Available: Computer Society, April 2013, pp. 74–85. [Online]. Available: https://doi.org/10.1109/ISPASS.2009.4919638 https://doi.org/10.1109/ISPASS.2013.6557148 11 [23] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, Computer Society, September 2010, pp. 207–216. [Online]. Available: M. Andreozzi, A. Armejach, N. Asmussen, S. Bharadwaj, G. Black, https://doi.org/10.1109/ICPPW.2010.38 G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillón, L. Chen, [36] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful N. Derumigny, S. Diestelhorst, W. Elsasser, M. Fariborz, A. F. Farahani, visual performance model for multicore architectures,” Commun. P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, B. Hanindhito, ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009. [Online]. Available: A. Hansson, S. Haria, A. Harris, T. Hayes, A. Herrera, M. Horsnell, http://doi.acm.org/10.1145/1498765.1498785 S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul, T. M. Jones, M. Jung, [37] M. T. Yourst, “PTLsim: A cycle accurate full system x86-64 S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Krishna, T. Marinelli, microarchitectural simulator,” in 2007 IEEE International Symposium C. Menard, A. Mondelli, T. Mück, O. Naji, K. Nathella, H. Nguyen, on Performance Analysis of Systems and Software. San Jose, California, N. Nikoleris, L. E. Olson, M. S. Orr, B. Pham, P. Prieto, T. Reddy, USA: IEEE Computer Society, April 2007, pp. 23–34. [Online]. A. Roelke, M. Samani, A. Sandberg, J. Setoain, B. Shingarov, M. D. Available: https://doi.org/10.1109/ISPASS.2007.363733 Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Upton, N. Vaish, [38] F. G. V. Zee and R. A. van de Geijn, “BLIS: a framework for I. Vougioukas, Z. Wang, N. Wehn, C. Weis, D. A. Wood, H. Yoon, rapidly instantiating BLAS functionality,” ACM Transactions on and É. F. Zulian, “The gem5 simulator: Version 20.0+,” 2020. [Online]. Mathematical Software, vol. 41, no. 3, June 2015. [Online]. Available: Available: https://arxiv.org/abs/2007.03152 https://doi.org/10.1145/2764454 [24] G. Marin, J. J. Dongarra, and D. Terpstra, “MIAMI: A framework for application performance diagnosis,” in 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS A Appendix - Proof of the equivalence of 2014. Monterey, CA, USA: IEEE Computer Society, March 2014, the dual representation pp. 158–168. [Online]. Available: https://doi.org/10.1109/ISPASS.2014. 6844480 This section contains the definition of a dual conjunctive [25] C. Mendis, A. Renda, S. P. Amarasinghe, and M. Carbin, “Ithemal: bipartite port mapping, to a disjunctive port mapping. It also Accurate, portable and fast basic block throughput estimation using contains the full proof of equivalence between these two deep neural networks,” in Proceedings of the 36th International representations. Conference on Machine Learning, ICML 2019, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., A.1 Primary definitions vol. 97. Long Beach, California, USA: PMLR, June 2019, pp. 4505–4515. [Online]. Available: http://proceedings.mlr.press/v97/mendis19a.html Definition A.1 (Microkernel). A microkernel 퐾 is an infi- [26] F. Nielsen, Hierarchical Clustering, 02 2016, pp. 195–211. nite loop made up of a finite multiset of instructions, 퐾 = [27] L.-N. Pouchet and T. Yuki, “PolyBench/C: The polyhedral benchmark 퐼휎퐾,1 퐼휎퐾,2 퐼휎퐾,푚 1 2 ··· 푚 without dependencies between instructions. suite, version 4.2,” 2016, http://polybench.sf.net. The number of instructions executed during one loop iteration [28] G. C. Project. (1987) GNU compiler collection (gcc). [Online]. Available: 퐾 Í 휎 https://gcc.gnu.org/ is | | = 푖 퐾,푖 . [29] F. Ritter and S. Hack, “Pmevo: portable inference of port mappings for out-of-order processors by evolutionary optimization,” in Proceedings Definition A.2 (Disjunctive port mapping). A disjunctive of the 41st ACM SIGPLAN International Conference on Programming port mapping is a bipartite graph (푉, R, 퐸) where: 푉 represents Language Design and Implementation, PLDI 2020, A. F. Donaldson the set of 휇OPs; R represents the set of resources (corresponding and E. Torlak, Eds. New York, USA: ACM, June 2020, pp. 608–622. to execution ports in a real-world CPU); 퐸 ⊂ 푉 × R expresses [Online]. Available: https://doi.org/10.1145/3385412.3385995 휇 [30] A. C. Rubial, E. Oseret, J. Noudohouenou, W. Jalby, and G. Lartigue, the possible mappings from OPs to ports. In this original form “CQA: A code quality analyzer tool at binary level,” in 21st International each port 푟 ∈ R has a throughput 휌(푟) of 1. 휎 휎 휎 Conference on High Performance Computing, HiPC 2014. Goa, India: 퐾 퐼 퐾,1 퐼 퐾,2 퐼 퐾,푚 Let = 1 2 ··· 푚 be a microkernel where each IEEE Computer Society, December 2014, pp. 1–10. [Online]. Available: instruction is composed of a single 휇OP 푣푖 . https://doi.org/10.1109/HiPC.2014.7116904 A valid assignment represents the choice of which resources [31] ——, “CQA: A code quality analyzer tool at binary level,” in 21st International Conference on High Performance Computing, HiPC 2014. to associate with a given instance of an instruction. However, Goa, India: IEEE Computer Society, December 2014, pp. 1–10. [Online]. this choice might change between iterations. Thus, we represent Available: https://doi.org/10.1109/HiPC.2014.7116904 the valid assignment as a mapping 푝 : I × R ↦→ [0; 1] where [32] D. Sánchez and C. Kozyrakis, “ZSim: fast and accurate mi- 푝푖,푟 corresponds to the frequency a given resource is chosen. We croarchitectural simulation of thousand-core systems,” in 40th also define 푅 (푝) = {푟, 푝 ≠ 0}. This assignment is valid if: Annual International Symposium on Computer Architecture, (ISCA’13), 푖 푖,푟 A. Mendelson, Ed. New York, NY, USA: ACM, June 2013, pp. 475–486. ∀퐼푖 ∈ 퐾, ∀푟 ∈ 푅푖 (푝), (푣푖, 푟) ∈ 퐸 [Online]. Available: https://doi.org/10.1145/2485922.2485963 [33] Sony Corporation and L. Project. LLVM analyzer. 퐼 퐾, ∑︁ 푝 [Online]. Available: https://llvm.org/docs/CommandGuide/llvm-mca. ∀ 푖 ∈ 푖,푟 = 1 html 푟 ∈푅푖 (푝) [34] C. Topper. (2018, Mar.) Update to the llvm scheduling model for The of an assignment (푝 ) , is: intel sandy bridge, haswell, broadwell, and skylake processors. execution time 푖,푟 푖,푟 [Online]. Available: https://github.com/llvm/llvm-project/commit/ ∑︁ 푡end = max 휎퐾,푖 · 푝푖,푟 cdfcf8ecda8065fda495d73ed16277668b3b56dc 푟 ∈R [35] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A lightweight 푖 ∈퐾 performance-oriented tool suite for x86 multicore environments,” in The minimal execution time over all valid assignments is 39th International Conference on Parallel Processing (ICPP) Workshops 푡 (퐾) 2010, W. Lee and X. Yuan, Eds. San Diego, California, USA: IEEE denoted (obtained using an optimal assignment).

12 Definition A.3 (Conjunctive port mapping). A conjunctive A.2 Equivalence between disjunctive and port mapping is a bipartite weighted graph (퐼, R, 퐸, 휌퐼,R) conjunctive formulations 퐼 where: represents the set of instructions; R represents the set This section provides the main intuition to understand the 퐸 퐼 of abstract resources; ⊂ × R expresses the required use of equivalence between the disjunctive and the conjunctive abstract resources for each instruction; form. Each abstract resource 푟 ∈ R has a (normalized) through- put of 1; An instruction 푖 that uses a resource 푟 ((푖, 푟) ∈ 퐸) Definition A.6 (Saturated port set). Consider a microkernel 퐾 푝 퐾 always uses the same proportion (number of cycles, possibly . Let ( 푖,푟 )푖,푟 be a valid assignment of for a disjunctive + (푉, R, 퐸) S lower/greater than 1) 휌푖,푟 ∈ Q ; If 푖 does not use 푟, then 휌푖,푟 = 0. port mapping . Its saturated port set is defined as 휎 휎 퐾 퐼 퐾,퐼1 퐼 퐾,퐼2 퐼휎퐾,퐼푚 follows: Let = 1 2 ··· 푚 be a microkernel. In a steady state execution of 퐾, for each loop iteration, instruction 푖 must ( ) 푟 휎 휌 푟 푡 ∑︁ 휎 푝 use resource ( 퐾,푖 푖,푟 ) cycles. S = 푠 such that end = 퐾,푖 · 푖,푟푠 The number of cycles required to execute one loop iteration 푖 ∈퐾 is: That is, resources 푟 for which their loads Í 휎 · 푝 ! 푆 푖 ∈퐾 퐾,푖 푖,푟푠 ∑︁ correspond to |퐾|/퐾, the steady state execution time of 퐾. 푡 (퐾) = max 휎퐾,푖 · 휌푖,푟 푟 ∈R 푖 ∈퐾 Lemma A.1 (Saturated set assumption). Let (푝 ) be a One should observe that Def. A.3 defines formally a nor- 푖,푟 푖,푟 valid assignment for a microkernel 퐾 in a disjunctive port malized version of the graph used in the illustrative example mapping (푉, R, 퐸) and S its saturated set. Let 푟 and 푟 be of Sec. 2.3; where throughputs of abstract resources are set to 푠 푡 two resources such that (푣, 푟 ) ∈ 퐸 and (푣, 푟 ) ∈ 퐸, we assume 1. For the sake of clarity, we used non-normalized through- 푠 푡 푟 ∈ S and 푟 ∉ S. puts (that is, different than 1) in Fig. 1b with the following 푠 푡 Then, either there exists a faster valid assignment for which notations: use stands for the non-normalized usage, and load both resources 푟 and 푟 are saturated, or there exists a valid for the normalized 휌 , equal to #푢푠푒푖 . For example, 푠 푡 푖,푟 푡ℎ푟표푢푔ℎ푝푢푡 (푟) assignment whose saturated set is strictly smaller than 푝. VCVTT uses 2 times 푟01, which has a throughput of 2: its load 휌 휌 VCVTT,푟01 is equal to 1. Similarly, ADDSS,푟016 = 1/3. A direct consequence of this lemma is: Definition A.4 (Throughput). The throughput 퐾 of a micro- Corollary A.1 (Saturating assignment). Let us consider an kernel 퐾 is its instruction per cycle rate (IPC), defined as: optimal assignment (푝푖,푟 )푖,푟 of a list of 휇OPs 퐾 on a disjunctive (푉, R, 퐸) |퐾|  Í 휎  port mapping , such that the size of its saturated set • 퐾 = = max 푖 ∈퐾 퐾,푖 S 푣 ∈ 푉 (푟 , 푟 ) ∈ R2 푡 퐾 Í 휎 푝 is minimal. For all such that there are 푥 푦 ( ) valid assignment 푝 max푟 ∈R 푖 ∈퐾 퐾,푖 · 푖,푟 connected to 푣 (i.e. (푣, 푟푥 ) ∈ 퐸 and (푣, 푟푦) ∈ 퐸): if 푟푥 ∈ S, then for a disjunctive port mapping. 푟푦 ∈ S. |퐾| Í 휎 Thus: ∀푖 ∈ I, [푅 (푝) ⊂ S ⇔ {푟, (푣 , 푟) ∈ 퐸} ⊂ S] • 퐾 = = 푖 ∈퐾 퐾,푖 푖 푖 푡 퐾 Í 휎 휌 ( ) max푟 ∈R 푖 ∈퐾 퐾,푖 · 푖,푟 for a conjunctive port mapping Theorem A.1 (Equivalence of ∇-duality). Let 퐾 be a micro- 푉, , 퐸 푟 Example. Given 푎 and 푏 two instructions, 푎푎푏푏 repre- kernel. Let ( R ) (with the set of resources R = { 푗 }푗 ), ∇ a sents a microkernel repeating 푎 and 푏 as many times as their set of subsets of R, and (푉, R, 퐸) (with the set of resources R also denoted {푟 } ) be its ∇-dual. respective IPC 푎 and 푏. We note its throughput 푎푎푏푏. 퐽 퐽 ∈∇ (i) Let (푝푖,푟 )푖,푟 be a valid optimal assignment (i.e. of minimal Definition A.5 (∇-dual conjunctive port mapping). Let (푉, execution time and minimal saturated set size) of 퐾 with regard R, 퐸) be a disjunctive port mapping. Let ∇ be a non-empty to (푉, R, 퐸). This assignment can be translated into its ∇-dual, set of subsets of R. We define its ∇-dual, a conjunctive port with no change to its execution time. In other words, 푡 (퐾) ≤ mapping, as (푉, R, 퐸) such that: 푡 (퐾).  (ii) If ∇ is the set of all subsets of R then 푡 (퐾) = 푡 (퐾). R = 푟 퐽 , 퐽 ∈ ∇ 퐸 = (푣, 푟 ) s.t. {푟, (푣, 푟) ∈ 퐸} ⊆ 퐽 퐽 Theorem A.2 (Equivalence). Let 퐾 be a microkernel and 휌(푟 ) = Í 휌(푟 ) = |퐽 | 퐽 푟 푗 ∈퐽 푗 (푉, R, 퐸) (with the associated throughput function 푡) be a dis- Then, we can normalize this graph by adding weights to junctive port mapping. Then, there exists ∇ a set of subsets of edges, and update the resource throughput, noted 휌푁 as it is R, and (푉, R, 퐸) a conjunctive port mapping called the dual normalized. (with the associated throughput 푡) whose set of resources R is  1/휌(푟 ) if (푖, 푟 ) ∈ 퐸 indexed by ∇ such that: 휌푁 = 퐽 퐽 푖,푟 퐽 0 else (i) Every (푝푖,푟 )푖,푟 optimal assignment (i.e. of minimal execu- 푁 휌 (푟 퐽 ) = 1 tion time and minimal saturated set size) of 퐾 with regard to 13 (푉, R, 퐸) can be translated into (푉, R, 퐸), with no change of (ii) Now, assuming that ∇ is not limited to a few subsets its execution time. In other words, 푡 (퐾) ≤ 푡 (퐾). of R, let us prove that 푡 (퐾) = 푡 (퐾). (ii) If ∇ is the set of all subsets of R then 푡 (퐾) = 푡 (퐾). For a given optimal assignment (푝푖,푟 )푖,푟 , let us pick 퐽 = S, his saturated set of minimal size. Let us show that for Proof. (i) Let 푝 be an optimal valid assignment for a list of this particular 퐽, the previously considered inequalities are 휇OPs 퐾 on a disjunctive port mapping (푉, R, 퐸), which min- equalities. imizes its saturated set size |S|. As mentioned previously, equation (1) is an equality when From the definition of an execution time, we have: 퐽 S ∑︁ is a subset of the saturated set . Thus, we only need to ∀푟 ∈ R, 휎퐾,푖 · 푝푖,푟 ≤ 푡 (퐾) show that the inequality (2) is an equality for this 퐽. ∈ 푖 퐾 Notice that for any instruction 푣푖 , if we have an edge Hence, for any subset of resources 퐽 ⊂ R, (푣푖, 푟) ∈ 퐸 when 푟 ∈ 퐽 = S, then by Corollary A.1, we have ∑︁ ∑︁ {푟, (푣 , 푟) ∈ 퐸} ⊆ 퐽 푣 휎 · 푝 ≤ 푡 (퐾) · |퐽 | 푖 . Thus, given a 푖 we have two situations: 퐾,푖 푖,푟 푣 푟 퐽 푟 ∈퐽 푖 ∈퐾 • Either there are no edge from 푖 to any ∈ , then 훿 = 0, and Í 푝 = 0. Thus, we have Thus, {푟,(푣푖,푟) ∈퐸 }⊆퐽 푟 ∈퐽 푖,푟 Í Í 휎 푝 equality. 푟 ∈퐽 푖 ∈퐾 퐾,푖 · 푖,푟 푣 ∀퐽 ⊂ R, ≤ 푡 (퐾) (1) • Or there are an edge from 푖 to a saturated resource |퐽 | 푟 ∈ 퐽. Thus, as mentioned before, {푟, (푣푖, 푟) ∈ 퐸} ⊆ 퐽 퐽 훿 Í 푝 Notice that this is an equality when is a subset of a satu- and {푟,(푣푖,푟) ∈퐸 }⊆퐽 = 1 = 푟 ∈퐽 푖,푟 . Thus, we also have rated set S of any optimal placement (푝푖,푟 )푖 . Indeed, by the an equality. definition of the saturated set (definition A.6) we have that Therefore, the whole chain of inequality linking 푡 (퐾) to 푟 푡 퐾 Í 휎 푝 for each 푠 ∈ S, ( ) = 푖 ∈퐾 퐾,푖 · 푖,푟 . We will now prove 푡 (퐾) are equalities. Thus, 푡 (퐾) = 푡 (퐾). □ that 푡 (퐾) ≤ 푡 (퐾). Consider any combined port 푟 퐽 ∈ R whose We have an equality if ∇ is the set of all subsets of R, whose throughput is 휌(푟 퐽 ) = |퐽 |. For any 푟 퐽 , by the definition of size is exponential in the number of resources. However, the the dual: proof shows that we can restrict ourselves to the saturated 훿 set S of an optimal assignment. ∑︁ 휎 휌푁 ∑︁ 휎 (푣푖,푟 퐽 ) ∈퐸 퐾,푖 · = 퐾,푖 · ∇ 푖,푟 퐽 |퐽 | In practice, we build by considering the abstract re- 푖 ∈퐾 푖 ∈퐾 sources that directly correspond to the set of resources that ∑︁ 훿 {푟,(푣 ,푟) ∈퐸 }⊆퐽 휇 = 휎 · 푖 a given OP can be mapped to in the disjunctive mapping. 퐾,푖 |퐽 | 푖 ∈퐾 Then, we recursively apply this rule: if two abstract resources have a non-empty intersection, we then add their union as where 훿 = 1 if the statement 푠푡푎푡 is true, and 훿 = 0 if 푠푡푎푡 푠푡푎푡 a new abstract resource. Intuitively, this new abstract re- it is false. source introduces a new constraint on the valid assignment Now, let us show that for any assignment (푝 ) in the 푖,푟 푖,푟 in the dual, corresponding to a potential saturation of these disjunctive graph, we have: resources. We end up with a set containing fewer than 14 훿 ∑︁ 푝 {푟,(푣푖,푟) ∈퐸 }⊆퐽 ≤ 푖,푟 (2) elements in our experiments. 푟 ∈퐽 In order to prove this statement, we consider the two cases B Appendix - Automated resource on the value of 훿: mapping discovery • If {푟, (푣푖, 푟) ∈ 퐸} ⊈ 퐽, then the 훿 = 0. Because 푝푖,푟 are This section aims at proving mathematically the algorithms positive values by definition, this inequality is trivially presented in Section 4. satisfied. • If {푟, (푣푖, 푟) ∈ 퐸} ⊆ 퐽, then all the neighbors of 푟 in the B.1 Definitions disjunctive graph are in 퐽. Thus, 푅푖 (푝) ⊆ {푟, (푣푖, 푟) ∈ We consider a bipartite conjunctive mapping as introduced 퐸 퐽 Í 푝 } ⊆ . So, 푟 ∈퐽 푖,푟 = 1. Notice that in this case, we in Definition A.3. have an equality. Definition B.1 (Extended bipartite conjunctive mapping). 푟 ∈ R Therefore, for any 퐽 : A bipartite conjunctive port mapping is equivalent to a unique Í 푝 ∑︁ 푁 ∑︁ 푟 ∈퐽 푖,푟 extended form that decouples the use of combined resources as 휎퐾,푖 · 휌 ≤ 휎퐾,푖 · 푖,푟 퐽 |퐽 | either a consequence of the use of simpler resources, or as the 푖 ∈퐾 푖 ∈퐾 Í Í 휎 · 푝 sole use of the combined resource. ≤ 푟 ∈퐽 푖 ∈퐾 퐾,푖 푖,푟 Let (푉, R, 퐸) be a bipartite conjunctive port mapping. Its 퐽 | | extended form is a graph (푉, R, 퐸′ ∪ 퐵) with 퐵 the set of back By using equation (1), we have 푡 (퐾) ≤ 푡 (퐾). edges defined by: 14 푘 푆 v v v v v v Definition B.5 ( -exclusive saturation). A microkernel 1 2 3 1 2 3 is said to be a 푘-exclusive saturation with 푘 ∈ [0, 1] of a resource 푟 when the maximum of the uses of every resource n1 n1/2n2 n3/2 n3 n1 n2 n3 but 푟 (including the combine resource property) is bounded by 푘, and where 푆 never use a resource that has a back edge with 1/2 1/2 푟 푆 r1 r12 r2 r1 r12 r2 , i.e. when verifies 1 − 푘 (a) (b) max푤푆,푟 ′ ≤ 푟 ′≠푟 푆 Figure 5. Conjunctive resource mapping and its extended And form; both normalized. ′ ∀푟 ,푤푟,푟 ′ ≠ 0 ⇒ 푤푆,푟 ′ = 0 We call 푆 an exclusive saturation of 푟 when 푆 is a 1-exclusive ′ 2 ′ •( 푟, 푟 ) ∈ 퐵 ⊂ R ⇔ ∀푖, 푤푖,푟 ′ ≥ 푤푖,푟 ∧ 휌(푟 ) > 휌(푟). saturation of 푟, which means that 푆 only uses the resource 푟. ′ Then, 푤푟,푟 ′ = 1, and (푟, 푟 ) ∈ 퐵 is said to be a back edge. 푣, 푟 , 푣, 푟,푤 ′ 퐸′ 푣, 푟,푤 퐸 • ∀( ) ( 푣,푟 ) ∈ ⇔ ( 푣,푟 ) ∈ with weight B.2 Completeness of the mapping 푤 ′ 푤 Í 푤 푤 푣,푟 = 푣,푟 − 푟 ′≠푟 푣,푟 · 푟,푟 ′ > 0. We assume that the solver finds an edge (퐼, 푟) when we pro- 푤 ′ 퐸′ If 푣,푟 is reduced to 0, the edge is not in . vide a microkernel containing at least once 퐼 that saturates 푟. An illustrative example is given in figure 5. This means that the solver is smart enough to detect that an instruction which uses a resource (and its amount) as long Definition B.2 (Resource usage). Given a bipartite conjunc- as we provide a benchmark using the instruction limited by it. tive port mapping and a microkernel K, we note 푤퐾,푟 the use 푟 of the resource during the execution of K, i.e. Theorem B.1 (Completeness of the output mapping). Let ∑︁ 푟, 푟푆푇 and 푟푆 be three resources, and 푆 and 퐼 two microkernels 푤퐾,푟 = 푤푣 ,푟 푖 represented as a single vertex combining the use of all their 푣푖 ∈퐾 instructions, forming a normalized conjunctive bipartite map- ping on the expanded form (see definition B.1). For the sake of Definition B.3 (Load of a resource). Given the conjunctive port mapping (푉, R, 퐸) on the extended form and a microkernel simplicity, we will use greek letters instead of multiple indexes of 휌 in this proof: 퐾, we note load(푟) the normalized use of the resource 푟 during ℓ, ℓ ′ 푟 푟 the execution of the microkernel of execution time 푡푒푛푑 , i.e. • (possibly 0) are the back edge from 푆 (resp. 푆푇 ) to 푟 푟 휌 휌 (resp. 푆 ), which corresponds to 푟푆 ,푟 and 푟푇 ,푟푆 , respec- Í 푤 + Í ′ 푤 ′푤 ′ (푟) = 푣∈퐾 푣,푟 푟 ∈R 퐾,푟 푟 ,푟 tively load 푡 푒푛푑 훼, 훽 훾 휌 휌 휌 • , and are used to denote 퐼,푟푆푇 ,, 퐼,푟푆 , and 퐼,푟 . 푆 Definition B.4 (Normalised resource mapping). The nor- Let us assume that verifies the following properties: 푉, , 퐸′ 푆 1 푟 malized version ( R ) of a conjunctive (or disjunctive) • realize a 4 -exclusive saturation of 푆 resource mapping is the semantically equivalent resource map- • 푆 uses another resource 푟푇 with a coefficient 훼푆푇 (noted 푉, , 퐸 휌 ping ( R ) where the throughput of every resource has been 푆,푟푇 with the former notation) normalized to 1, thus decreasing the value of the edges: • 푆 does not use 푟 apart from the contribution from 푟푆푇 and 푟 푤퐸 푇 ′ 푣,푟 푤퐸 = 푆4·푆 퐼 퐼 푟 푣,푟 휌(푟) Then the benchmark saturates 푆 , thus allowing the solver to find the link 퐼 → 푟푆 . In the extended resource mapping form, the values of the back- edges become: 푤퐵 Proof. The graph representing the resource usage for one 푤퐵′ 푟,푟 ′ 푆 푟,푟 ′ = execution of the benchmark is represented in figure 6a and 휌(푟 ′) 퐼 푆 1 푟 figure 6b for . As realizes a 4 -exclusive saturation of 푆 , then 휌푆,푟 = 1/푆, so that 푆 repetitions of 푆 are needed to load Lemma B.1 (Bounds of the normalized back edges). On a 푆 resource 푟푆 with the value 1. normalized bipartite resource mapping on the extended form: Let us consider the benchmark 푆4·푆 퐼 퐼 , illustrated in figure 1 6c. The load of 푟푇 is: 0 ≤ 푤 ′ ≤ 푟,푟 2 load(푟푇 ) = 4훼푆푇 푆 + 훾퐼 15 γ Similarly, S I load(푟푇 ) ≤ 4훼푆푇 푆 + 훾퐼 ≤ 3 + 1 α 1/S ST α β ≤ 4

To obtain a lower bound on load(푟푆 ), we use similar bounds: ′ rT rS r 훼 ≥ 0 and ℓ ≥ 0, so rT rS r ′ l'α l/S+ll'α (푟 ) = ℓ ·(4훼 푆 + 훾퐼) + 훼퐼 + 4 ST ST l+ll'γl'γ load 푆 푆푇 ′ (a) (b) ≥ 4 + 퐼 ·(ℓ 훾 + 훼) 푟 푆 훾 γI So, when 푆 is used by , either indirectly by > 0 and ′ S I ℓ > 0 or directly when 훼 > 0, then load(푟푆 ) > load(푟푇 ) and load(푟푆 ) > load(푟). So 푟푆 is the bottleneck, and the solver 4αSTS 4 αI βI will find the edge 퐼 → 푟푆 . □ Note that this proof still stands when 푆 and 퐼 use several rT rS r resources that indirectly contribute to 푟푆 , as it is be equivalent to a bigger value of 훾. l'(4αSTS+γI) l(αI+4)+ ll'(4αSTS+γI) (c)

Figure 6. Saturating benchmarks 푆 and instructions to anal- yse 퐼: individual uses (6b and 6a), and benchmark 푆4·푆 퐼 퐼 (6c).

The load of 푟푆 is: 푆 load(푟 ) = ℓ ′ · load(푟 ) + 훼퐼 + 4 · 푆 푇 푆 ′ = ℓ ·(4훼푆푇 푆 + 훾퐼) + 훼퐼 + 4 ′ ′ = 4ℓ 훼푆푇 푆 + ℓ 훾퐼 + 훼퐼 + 4 And the load of 푅 is:

load(푅) ≤ ℓ · load(푟푆 ) + 훽퐼 ′ ′ = 4ℓ ℓ훼푆푇 푆 + ℓ ℓ훾퐼 + ℓ훼퐼 + 4ℓ + 훽퐼 1 1 By lemma B.1, ℓ ≤ and ℓ ′ ≤ . 2 2 Then 푆 퐼 퐼 4 load(푟) ≤ 4훼 + 훾 + 훼 + + 훽퐼 푆푇 4 4 2 2 1 But, by definition of the IPC, max (훼, 훽,훾) = , so 퐼 1 1 4 1 load(푟) ≤ 훼 푆 + + + + 푆푇 4 2 2 2 13 ≤ 훼 푆 + 푆푇 4 As 푆 realizes a 1 -exclusive saturation of 푟 , then 훼 ≤ 3 , 4 푆 푆푇 4·푆 so 3 · 푆 13 load(푟) ≤ + 4 · 푆 4 ≤ 4

16