Measuring Software Performance on Technical Report

November 21, 2018

Martin Becker Samarjit Chakraborty Chair of Real-Time Computer Systems Chair of Real-Time Computer Systems Technical University of Munich Technical University of Munich Munich, Germany Munich, Germany [email protected] [email protected]

OS program program CPU .text .bss + + .data +/- + instructions cache branch + coherency scheduler misprediction core + pollution + migrations data + + interrupt L1i$ miss access + + + + + + context mode + + (TLB flush) TLB + switch data switch miss L1d$ +/- + (KPTI TLB flush) miss prefetch +/- + + + higher-level + cache miss walk + + multicore + + (TLB shootdown) TLB coherency page DRAM + page fault + cache miss + + + disk + major minor I/O

Figure 1. Event interaction map for a program running on an Intel Core processor on Linux. Each event itself may cause processor cycles, and inhibit (−), enable (+), or modulate (⊗) others.

Abstract that our measurement setup has a large impact on the results. Measuring and analyzing the performance of software has More surprisingly, however, they also suggest that the setup reached a high complexity, caused by more advanced pro- can be negligible for certain analysis methods. Furthermore, cessor designs and the intricate interaction between user we found that our setup maintains significantly better per- formance under background load conditions, which means

arXiv:1811.01412v2 [cs.PF] 20 Nov 2018 programs, the , and the processor’s microar- chitecture. In this report, we summarize our experience on it can be used to improve high-performance applications. how performance characteristics of software should be mea- CCS Concepts • Software and its engineering → Soft- sured when running on a Linux operating system and a ware performance; modern processor. In particular, (1) we provide a general overview about hardware and operating system features Keywords Software performance, Linux, Hardware Coun- that may have a significant impact on timing and explain ters, Microarchitecture, Jitter their interaction, (2) we identify sources of errors that need to be controlled in order to obtain unbiased measurement 1 Introduction results, and (3) we propose a measurement setup for Linux Countless performance tests of software are available online to minimize errors. Although not the focus of this report, we in blogs, articles, and so on, but often their significance can describe the measurement using hardware perfor- be refuted by their measurement setup. Speedups are usu- mance counters, which can faithfully point to performance ally reported in units of time, yet without any introspection bottlenecks on a given processor. Our experiments confirm how exactly the differences came to live. Not only are the results questionable when the software runs on a different Unit Properties machine, but even on identical processors with identical Processor Intel Core i7-2640M @2.8GHz, dual-core memory configurations and peripherals, there are many ex- Microarchitecture “Sandy Bridge”, Microcode version 0x25 ternal factors that can influence the result. One such factor L1i/d cache 32kB, 8-way, 64B/line, per core L2 unified cache 256kB, 8-way, 64B/line, writeback, per core is the operating system (OS) itself. Different configurations L3 uncore cache 4MB, 16-way, 64B/line, shared between all cores and usages of OSes might nullify or magnify some effects, TLB 2-level, second level unified, per core and thus the performance measurements do not necessarily OS Debian 8.11 GNU/Linux SMP 3.16.51-3 reflect the characteristics of the software that we actually Table 1. System specifications want to analyze and improve. In this report, we describe how performance measure- ments of user software should be conducted, when running taken. The numbers are all based on the following system on a mainline Linux OS and a modern multi-core processor. penalties, caused by the typical design of CPUs and OSes. Specifically, we are concerned with real-time measurements Penalties: The various latencies for our system are sum- taken on the real hardware, providing quantitative informa- marized in Tab. 2, and discussed in the following. They are tion like execution time, memory access times and power taken from the processor documentation [12], and extended consumption. We expect that the setup of the OS and hard- by measurements using lmbench [24] and pmbench [32]. The ware have a significant impact on the results, and thata values denoted “best case” stem from the manufacturer doc- proper setup can greatly reduce measurement errors. umentation. Our measurements show that these values are This document is structured as follows: We start with a also the most frequently observed ones. Note, however, that general survey of microarchitectural and OS elements affect- penalties can (and should) be hidden by out-of-order pro- ing performance in modern processors in Section 2. In the cessing. That is, not every page walk delays computation for next section we list possible sources of measurement errors, 30 cycles. followed by a proposed measurement setup in Section 4. Fi- nally, we show several examples of how the setup influences Event Condition Latency [cycles] the results, before we conclude this report. L1 cache hit best case 4 L2 cache hit best case 12 2 Microarchitectural and OS Performance L3 cache hit best case 26..31 Modulators DRAM access best case ≈ 200 branch misprediction most frequent 20 We start by giving an overview on hardware and OS fea- TLB miss 2nd-level TLB hit 7 tures and how they influence the performance of software. page walk most frequent ≈30 In essence, this section is an elaboration of the interactions minor page fault most frequent ≈1,000 .. 4,000 depicted in Figure 1. The well-informed reader might di- major page fault most frequent ≈260k .. 560k rectly proceed with the subsequent section, which identifies context switch best case ≈3,400 sources of measurement errors based on the details presented Table 2. Penalties for system in Table 1 here. Although we try to keep the explanations generic, it would be impractical to make only statements that cover any con- Software Performance: In this report, we look at software ceivable processor. The details are therefore given for an In- performance, mainly from a timing point of view. Specifically, nd TM tel 2 generation Core x86-64 microarchitecture (“Sandy for the execution time of the process, we only consider the Bridge”). This is a pipelined, four-width superscalar multi- time where the processor executes instructions on behalf of core processor, with out-of-order processing, speculative the process (including kernel code and stalls), but not sleep- execution, a multi-level cache hierarchy, prefetching and ing or waiting states, since the latter are either voluntary or (MMU). This kind of processor depend on the execution context and not the program itself. is currently used in high-end consumer computers, and has well-proven architectural features that we expect to see in 2.1 Microarchitecture embedded processors in the coming years; many of them As an overview about the features discussed next, consider are already available in ARM SoCs [3]. The details of our the microarchitectural block diagram shown in Fig.2. machine are summarized in Tab. 1. As operating system (OS) we consider Linux, specifically in an SMP configuration. 2.1.1 Number of Cycles and Instructions Along this report, we will give some numbers to illustrate As a single number indicating program speedup, we first the magnitude of some effects. These numbers are given to look at the number of processor cycles spent on execution. the best of our knowledge, the best available vendor docu- Lower is better. The number of clock cycles is primarily mentation, and supported by measurements that we have driven by the instructions being executed, whereas the exact 2 Front End Instruction Cache Tag L1 Instruction Cache µOP Cache 32KiB 8-Way Instruction Tag TLB 16 Bytes/cycle Branch Predictor Instruction Fetch & PreDecode (16 B window) (BPU) 32B/cycle MOP MOP MOP MOP MOP MOP

Instruction Queue Macro-Fusion (40, 2x20 entries)

MOP MOP MOP MOP MOP

MicroCode 4-Way Decode Sequencer ROM Complex Simple Simple Simple (MS ROM) Decoder Decoder Decoder Decoder

1-4 µOPs µOP µOPµOP

4 µOPs Stack Engine 4 µOPs (SE) Decoded Stream Buffer (DSB) Adder Adder Adder (µOP Cache) 4 µOPs (1.5k µOPs; 8-Way) MUX (32 B window)

Loop Stream Allocation Queue (IDQ) (2x28 µOPs) Micro-Fusion Detector (LSD)

4 µOP µOPµOP µOPµOP Branch Order Buffer Register Alias Table (RAT) (BOB) (48-entry)

Common Data Buses (CDBs) Data Common Load FP Rename / Allocate / Retirement Ones Idioms Zeroing Idioms ReOrder Buffer (168 entries) Int Vect µOPµOP µOPµOP µOPµOP 256KiB 8-Way Scheduler

Integer Physical Register File Vector Physical Register File Uni fi ed STLB Int Unified Reservation Station (RS) L2 Cache (160 Registers) (144 Registers) Store (54 entries) 32B/cycle Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 µOP µOP µOP µOP µOP µOP To L3

INT ALU INT ALU INT ALU AGU AGU Store Data INT DIV INT MUL Vect Shuffle Load Data Load Data INT Vect ALU INT Vect ALU INT Vect ALU INT Vect MUL FP ADD Branch 128bit/cycle FP MUL FP DIV Vect Shuffle EUs Execution Store Buffer & Forwarding

Load Buffer 32B/cycle Engine (36 entries) (64 entries)

16B/cycle 16B/cycle

L1 Data Cache Data TLB 16B/cycle 32KiB 8-Way Line Fill Buffers (LFB) Memory Subsystem (10 entries)

Figure 2. Microarchitecture of Intel Sandy Bridge as example of a superscalar out-of-order processor with caches, branch predictor and prefetcher. Taken from [4], and checked for consistency against manufacturer datasheets [12].

3 relationship is defined by the implementation of the microar- example, an arithmetic division can take several cycles to chitecture. The relationship is usually nontrivial and not complete, meanwhile successor instructions can be executed, fully documented, therefore we have forgone indicating the up to the point where the result of the division is required. relative influence of events of clock cycles in Fig.1. Wewill Similarly, memory access latencies can be hidden. This is provide typical numbers in the following descriptions, as far one of the primary reasons, while it is nontrivial to predict as they are known. the number of clock cycles that a certain event slows down If we count both the number of retired instructions and a program. processor cycles, we can compute the ratio instructions per As a side effect, such out-of-order (OoO) processing en- cycle (IPC). While often used as a first performance indicator, ables superscalarity. By having more than one instance of this figure is highly program-specific, and not helpful for each functional unit (e.g., several ALUs or FPUs), it becomes judging our software. For example, a program may execute possible to execute several instructions in parallel. Thus, OoO faster than before although the IPC has dropped, simply by processors often issue multiple instructions at the same clock virtue of a reduced number of instructions. cycle; for our machine, four uops are issued simultaneously, Micro-operations (uops): Some processors split instruc- thus the maximum IPC is four. tions into multiple smaller operations [3, 12], which pro- An algorithm for OoO was proposed by Robert Tomasulo vides more possibilities for out-of-order processing [8]. On in 1967 [28], and today’s out-of-order processor implemen- such targets, it is more meaningful to measure uops instead tations still follow the same concept [8, 12]. The OoO im- of instructions, since a single instruction may become a plementation is often called execution engine, and for the variable-length series of uops. Some, usually more complex purpose of our performance measurements, the following instructions, may even invoke Microcode. constituents must be considered: 1. Register Renaming and Microcode: On some processors, instructions are not hard- 2. Scheduler. wired but interpreted, to be broken down to hardwired ma- Register Renaming: Register names used by the compiler chine code during execution. This abstraction layer can incur are those defined by the instruction set architecture (ISA), a slowdown, but it allows upgrading processors during their called architectural registers. Due to the limited number of lifetime, e.g., to fix processor errata. On x86, microcode is registers, the compiler might be forced to re-use registers only used for few complex instructions; specifically, only for for computations that are otherwise not related, creating instructions generating more than four uops in our proces- false data dependencies. However, actual implementations sor [8]. The performance can be impacted in two ways: First, of the ISA may have more physical registers than architec- switching between the regular instruction stream and Mi- tural ones. Therefore, the first step is to map architectural to crocode may incur a penalty, and second, Microcode generat- physical registers, while resolving some false dependencies. ing a lot of uops may be limited by the frontend bandwidth. This renaming process therefore improves superscalar and Uops Cache: Instructions are decoded into uops by a hard- OoO processing [8]. ware circuit that can be limited in throughput, causing a Scheduler: The renamed operations now enter the main bottleneck if the average instruction length exceeds its ca- core of OoO processing, the scheduler. Here operations are pacity. Some processors add a cache to store decoded uops, queued up for execution units in reservation stations, and in an attempt to alleviate this problem [8]. will be held there until all operands become available. When this becomes the case, the operation is started on the execu- 2.1.2 Pipelining tion unit. As soon as it completes, results are propagated to Instead of processing one instruction or uop after another, all subscribers (e.g., registers or reservation station entries processing is often is temporally overlapping to reduce ex- waiting for a result), and the operation is forwarded from the ecution time, called pipelining. For example, while one in- reservation station to the Reorder Buffer. From here, most struction is being executed, the subsequent one can already schedulers also take care of retiring the instructions in their be decoded to uops in parallel, and meanwhile the uops original order. A bottleneck may occur when the scheduler from the previous instruction can be retired. Today it is very stalls execution due to lack of free reservation stations. rare to find processors or even microcontrollers that do not implement some form of pipelining. 2.1.4 Branch Mispredictions Many microarchitectures perform some kind of branch pre- 2.1.3 Out-of-Order and Superscalar Processing diction, to hide the latency for loading the instructions and Modern processors implement dynamic of in- data after a branch [3, 8]. Predictions for both the outcome structions [3, 8]. That is, the order in which instructions are of two-way branches, and the (possibly multi-way) target executed, may deviate from the original order of the instruc- address of indirect branches, are being made by the branch tion stream given by the program counter. This enables the prediction unit. processor to hide various processing latencies, by perform- If the branch is incorrectly predicted, then the pipeline and ing other work while waiting for an instruction to finish. For other resources must be flushed, which means there is atime 4 penalty to fetch and decode the correct instructions. The data in the faster cache, before accessing the slow memory. magnitude of the penalty primarily depends on the pipeline If we find the data, called a cache hit, we have circumvented depth. In case of Sandy Bridge, the penalty for flushing and waiting for the slow memory. If the cache does not contain resuming execution is about 20 cycles [8, 12]. the data, called a cache miss, we have to pay the penalty for accessing the slow memory (usually DRAM). After this, the 2.1.5 Speculative Execution data is placed in the cache for future reference. The time window between branch prediction and learning Since caches have to be small to be fast, there are inevitably the actual branch outcome is spent with speculative execution. situations when data is not in the cache, and needs to be That is, the processor continues the control flow at the as- loaded from slower memory. Even if we were able to per- sumed branch target, and buffers the results until the actual fectly predict what data (including instructions) is needed, branch target becomes known. Whenever the prediction was then there are still compulsory misses on first access. As a incorrect, it flushes the pipeline as described above. Ifthe result, execution can be slowed down by one or two orders prediction was correct, the speculatively executed instruc- of magnitude even with fast caches and very high hit ratio. tions are allowed to be released from the buffer (“retire”), For example, let us assume each instruction takes one cycle, and no time was lost waiting for the branch outcome. and that each cache miss costs five extra cycles. Then, even There is one lesser known side effect, however, dubbed with p = 95% hit ratio, a program with X instructions would pollution in Fig.1. Although instructions executed during take X + X(1 − p)5 = 1.25X cycles, i.e., experience a 25% mis-speculation are not retired, they can still cause changes slowdown. Measuring cache behavior is therefore important. in cache and buffer states. These effects are indirect cost of Victim Caches: These are small and fully associative caches, branch mispredictions, which manifest themselves during holding items evicted from a larger, not fully-associative one. later execution. These effects have recently been exploited Each miss in the large cache is first looked up in the victim in the Spectre and Meltdown vulnerabilities [19]. cache, before the slower memory is consulted. This masks Last but not least, even perfect speculation might become miss times for temporally close conflict misses. From a prac- a performance bottleneck in some cases. All speculatively tical point of view, the victim cache need not be considered taken branches are stored in the branch order buffer (BOB) separately; instead, a victim hit can be seen as a hit in the until they are confirmed. However, too many speculations in faster cache. Vice versa, a victim miss can be seen as a hit in a short time window might cause the BOB to fill up, which the next-slower memory or cache. in turn stops the issue of new uops [12]. Hierarchy: As most modern processors, our machine uses a hierarchy of caches. That is, there are three caches in a 2.1.6 Machine Clears cascade (three “levels”) before the slower DRAM is accessed. When multiple threads can run truly in parallel (as on SMP The first level (L1) is the fastest/smallest and separate for systems and especially with OoO processing), the ordering of data (L1d) and instruction (L1i), see Tables 1 and 2. Leaving memory accesses must be monitored and ensured. If the CPU aside special architectural tricks, this is the only level that detects that accesses complete differently to program order, can be accessed directly by the CPU. Higher levels (L2, L3) a machine clear is performed. This entails undoing some are larger and slower, and usually unified. Depending on the operations, flushing the pipeline, and re-starting with the processor, some caches can be shared with peripherals or correct operands [12]. The cost is comparable to a branch other processors. misprediction. Further causes for machine clears are self- modifying code and illegal AVX addresses. 2.1.8 DRAM Access The next-larger memory after the cache hierarchy, is Direct 2.1.7 Cache Hierarchy and Misses Random Access Memory (DRAM). Conceptually, the cache Although memory access is fast these days, it can still be or- hierarchy acts as a frontend and buffer to DRAM accesses. ders of magnitudes slower than the processor, which has be- Only if the lookup of data or instructions missed at all levels come known as the Memory Wall [31]. Consider the penalty in the cache hierarchy (Fig.1 “higher-level cache miss”; not for a major page fault (disk access) shown in Tab. 2: At least necessarily sequentially, though), then the DRAM is con- five orders of magnitude lie between the processor andthe sulted. Accesses to DRAM are typically 100 times slower disk access, although we use a relatively fast Solid State than the CPU, thus the penalty missing the entire cache hier- Disk (SSD). Caching is the omnipresent approach to coun- archy becomes steep. Even worse, DRAM is often accessed teract this issue, by preloading or latching all data in faster, via the Northbridge, possibly suffering contention with other smaller memories, and exploiting the principle of locality. CPUs and DMA transfers [7]. That is, data that has been recently accessed, is likely to be accessed soon in the future again. It thus makes sense to 2.1.9 Hardware Data Prefetching buffer recently used data in caches. Every time a data item To further reduce access times to slow memory, many pro- is requested (“data access” in Fig.1), we check for the desired cessors have a data prefetcher circuit, which predicts future 5 data accesses and actively pre-loads the data into caches. data reference (see Fig.1), thus there is an obvious incen- Most prefetchers are triggered by certain access patterns in tive to minimize delays. Therefore, in practice the address cache misses [7]. Some newer processors may even cross translation process is very intricate and specific to the mi- page boundaries and thus influence the TLB12 [ , §2.4.7]. In croarchitecture. An in-depth description can be found in [7]. summary, prefetch events both depend on and influence L1d Usually a hardware unit called Memory Management Unit cache events, as shown in Fig.1. (MMU) is responisble for the address translation. To en- sure translations do not stall execution, most MMUs have 2.1.10 ISA Extensions, Streaming/Vector/SIMD their own caches, called Translation-look-aside buffers (TLBs), Processors may extend the ISA with instructions applying which hold the most recently translated addresses. In case the same operation on multiple data items (vectors, single- the buffers do not provide the needed translation (TLB miss), instruction-multiple data). Examples are the extensions AVX, then a slow search in memory has to be conducted, called SSE, MMX on Intel and AMD, and NEON on ARM proces- a page walk, see Fig.1. On x86 machines, this means that sors. Using these can greatly speed up certain calculations, a dedicated hardware circuit starts looking for but switching to these modes may also cause extra penal- entries in memory, which incurs a penalty depending on ties [8, §9.1.2]. In general, any switch between ISA modes or memory access latency. For other architectures, like some extensions may cause extra penalties. PowerPC, this might be done in software, and thus is even slower. In case the information was found in the page table, 2.1.11 Direct Memory Access (DMA) the TLB is updated and execution resumes. Otherwise, the Accessing secondary storage, such as hard drives, but also operating system is signaled a page fault, and has to decide traffic from network cards, usually takes place via DMA, whether the access is allowed. If so, the OS updates the page which enables peripherals to exchange data directly with table and TLB with the requested translation, and possibly the DRAM. As mentioned earlier, this might lead to bus bring missing data to the DRAM. The cost for virtual address contention and slow down DRAM accesses for the processor. translation therefore is consisting of cycles spent in TLB lookup (hardware), plus page walk cycles (often hardware), 2.1.12 Cache Coherency Protocols and ring bus plus cycles to handle page faults (OS). On multi-core systems, further cache accesses are caused by Translation-Lookaside-Buffer: TLBs are regularly flushed cache coherency protocols, e.g., between the caches of the by the OS, since it is responsible for the coherency between different cores [7]. It becomes active during core migrations, TLBs and page tables in memory, and that the TLBs do not but also in the presence of data shared between cores. hold stale translations from a formerly running process. They can either be flushed or selectively updated, depending on 2.1.13 Neglected Features the OS and hardware capabilities. Accesses after a flush are TLB misses, therefore flushes degrade performance. In the There are many peculiarities to each microarchitecture. We x86 architecture, there furthermore exists the case of TLB have omitted many of such details, in an attempt to focus on shootdown. This stems from the need to have consistent TLB those features prevalent in most processors. Of course, these entries between multiple cores in the presence of sharing. details can be important for the performance, and a careful Since x86 does not have a coherency protocol in place, TLB study of the microarchitecture is required to see which need contradictions between cores are avoided by triggering flush- to be measured. Omissions include instruction and uop fu- ing in hardware. Last but not least, TLB and cache access sion, loopback buffers, register stalls, exhaustion of execution can be executed in parallel, to reduce latency. ports, cache bank conflicts, misaligned memory access, and Page Walks: Examining the page tables in memory incurs a store forwarding stalls. These are not considered to have a penalty that depends on whether the tables are cached (L1d, large or systematic performance impact on Sandy Bridge [8]. L2, L3), or whether the slow DRAM needs to be consulted. On our machine, a typical page walk costs between 20 and 2.2 Microarchitecture and OS Interaction 60 cycles. In an extreme case (“TLB trashing”), this could 2.2.1 and Paging happen for every instruction, and thus become very expen- Virtual Addressing is common in larger general-purpose sive. On some microarchitectures, as in Intel Sandy Bridge, and application processors, and is done for a variety of rea- page walks go through the caching hierarchy. That is, page sons, chief among which are process isolation and provision tables are buffered in L1d and below, and thus caches canbe of a contiguous address space from the process’ point of modified by page walks. Consequently, caching behavior and view. However, it also enables paging, which helps to sig- TLB misses cannot always be separated, even in the absence nificantly mitigate the latency of accessing slow secondary of page faults. Additionally, it has recently been disclosed storage, such as hard drives. Unfortunately, Virtual Address- that Intel processors speculatively work on possibly invalid ing comes with a performance penalty that can vary highly. Translation needs to be performed for every instruction and 6 cache entries in parallel to page walks (see “L1TF” vulnera- and 4,096 cycles, again coinciding with the median. Note bility [6]). It can therefore be assumed that even the latency that this includes lazy allocations. The numbers are consis- of page walks is partially hidden. tent with recent measurements of Torvalds on the successor Page Faults: If a translation cannot be found in the page microarchitecture Haswell [29]. table, a page fault is signalled from the MMU to the OS (Fig.1 bottom). There are three fundamental types of page faults: 2.2.2 Context Switch 1. Invalid page fault, 2. major page fault, or 3. minor page Switching between processes happens frequently (discussed fault. later in Section 2.2.6), and is an expensive operation that Invalid page faults are those caused by an attempt to ac- influences the TLB and caches, see Fig.1. cess addresses that are beyond the process’ address space, or Besides executing a number of instructions to save and where privileges are insufficient. An example are segmenta- load hardware registers, stack pointer and PC of suspended tion faults. We do not discuss them further, since they are and waking processes, the pipeline is flushed, and the TLB pathological events pointing to faulty software. must also be updated [21]. On most Linux versions the TLB Major page faults require disk access, which is orders update takes the form of a flush, accompanied by switch- of magnitude slower than the effects we are trying to ob- ing the page table pointer. This can only be avoided with serve here. For our SSD-equipped machine, we used the hardware that supports process context identifiers (PCIDs) pmbench [32] tool and found the most frequent latency for and with newer kernels supporting this feature – such as major faults as between 262 thousand and 524 thousand cy- x86 on Kernel 4.14 onwards [22]. The penalty for such TLB cles (coinciding with the median), same for both read and invalidation is called the direct cost of a context switch [21]. write accesses. Additionally, the distribution is tail-heavy, However, the CPU caches are also affected, which incurs similar as in [32]. That is, the estimated average is about indirect costs: Because processes share caches among them- one million cycles, due to some accesses exceeding several selves and also with the kernel, the waking process may not dozens of milliseconds. find its data in the caches as had been left when itgotsus- Closely related to page faults, Linux implements a page cache: pended, and experience cache misses. This effect magnifies the virtual memory buffers data blocks of recently used files. with growing size [21]. Context switches can When files are read (e.g., via fread or mmap), then access is by happen involuntarily due to interrupts, or voluntarily due to default buffered via the page cache. Therefore, if a program system calls (both discussed later). Using lmbench, we found is executed a second time and there was sufficient memory, that context switches on our system take at least 3,400 cycles, there ideally are no major page faults due to file access. To with a frequent value around 30,000 cycles. Note that this further reduce first-time access latency, the number would increase if a core migration happens at the proactively reads file data from disk before it is demanded same time, which is explained next. (“read ahead”), which is no longer counted as major page fault. Finally, if DRAM is exhausted and swapping is enabled, 2.2.3 Core Migrations and Load Balancing unused pages are temporarily written to secondary storage The OS (and the hardware, if Hyperthreading is enabled) may (“swapped out”). When they are needed again, they have to migrate a running process between cores (“core migrations” be brought back to DRAM, which also causes major page in Fig.1), to balance load or thermal stress. This causes a faults [9]. context switch with additional overhead. The process must Minor page faults are caused by memory allocation with- be stopped, copied to another core’s run queue, cache lines out disk access, but may still be problematic for our mea- are moved to the new core [7], and only then both scheduling surements. The memory can either be immediately allocated, domains are released. Naturally, this requires to run a lot or only reserved (“lazy”). In the latter case, which is the de- of kernel code, which in turn increases chances for events fault case in Linux, the page is only created when the first like branch mispredictions, cache pollution and page walks. write occurs (“copy-on-write”). This means that the penalty We have observed latencies of beyond 100,000 cycles for of minor page faults consists of two parts: the cost for the migrations, depending on the working set size. fault handler itself, and the conditional copy-on-write cost. Measuring the cost of a minor page fault is unequally more complicated than major faults. First, the latency is relatively 2.2.4 Mode Switch short, resulting in inaccuracy when we try to measure it Switching between user and kernel mode is not a context using software. Tracing kernel functions is one option that switch, and thus has lower overhead. These switches are we have exercised, but it has some non-negligible overhead caused by system calls done by the running program (thus in our kernel version, only giving us an upper bound. Hard- the bidirectional interaction between instructions and mode ware measurements are not possible in this case. The result switches in Fig.1) for which the kernel is supposed to perform is a somewhat wide range for the minor page fault cost. Us- some work on behalf of the calling process. The kernel is ing pmbench, we found that the mode lies between 1,024 then said to be in process context. 7 In Linux on x86, these calls involve copying the arguments One special periodic interrupt is the local timer interrupt to registers, triggering a trap (CPU changes to kernel mode), driving the scheduler, also called “scheduling clock ticks”, executing the trap handler (which copies the arguments from and discussed next. the registers to the kernel stack and then performs its work), and eventually returning to user mode. Depending on the 2.2.6 Scheduler Ticks processor and OS, the TLB might be invalidated. This is the By default, the Linux scheduler runs periodically, to select case for x86 since the Spectre and Meltdown vulnerabilities in which process is executed on the hardware for the next time 2017, where now kernel page table isolation (KPTI) invalidates slice. This is achieved by a timer interrupt firing typically entries during each mode switch to protect the kernel from every four milliseconds. Considering that interrupts cause attacks, unless the hardware supports PCIDs (see above). mode switches and possibly context switches (see Fig.1), Finally, when exiting the kernel mode, a context switch to a running the scheduler itself may unnecessarily suspend our different process might take place. user task. Luckily, there are two options to configure the Our kernel has KPTI enabled, but no PCID support. We kernel differently, commonly referred to as “tickless” [14]: created a test program to measure the direct penalty from 1. Dyntick-Idle, enabled by the kernel configuration op- KPTI. We found that the fastest syscall drops from at most 660 CONFIG_NO_HZ_IDLE=y 1 tion , describes a kernel config- cycles down to less than 130 cycles, when KPTI is disabled . uration that omits scheduling ticks when there is no Kernel developers have measured large slowdowns as well, task to be executed. While this is often the default for with up to 30% in networking code [5]. That is significant, desktop and embedded systems to save energy, but is and thus needs to be considered during measurements. not useful for our purpose. In summary, mode switches also have a direct penalty 2. Adaptive ticks, enabled by kernel configuration option caused by the call overhead, and indirect penalties caused CONFIG_NO_HZ_FULL=y, describes a kernel configura- by TLB and cache effects, depending on kernel version and tion that additionally turns off ticks when there is only processor. one runnable task, thus preventing unnecessary in- terruptions. On top of the configuration option, it is 2.2.5 Interrupts necessary to add a kernel boot parameter specifying 2 Typically, a few dozens of interrupts can arrive at any point for which CPUs it shall be applied. RCU calls need to in time. For example, there are thermal interrupts in case the be offloaded to other CPUs. hardware overheats, and machine check exceptions indicat- Both of these configurations have disadvantages, as well, ing hardware errors in the CPU. Some interrupts cannot be including increased number of instructions to switch to/from avoided, while others can be disabled or redirected to differ- idle mode, and more expensive mode switches [14]. Using ent cores. Interrupts do not directly cause context switches again the program shown in Appendix A, we have found as described above. Instead, the CPU itself saves and restores that mode switch times in our system increase to about 700 very few registers (notably, the PC), and then a mode switch cycles with adaptive ticks and KPTI disabled. Consequently, happens [2], see also Fig.1. After this, the interrupt handler tickless operation may not be beneficial for workloads with must take care of the rest. Specifically, the kernel then is many syscalls, or for those going idle frequently. running in interrupt context, as opposed to process context, and the work being done here is not attributed to any user 2.2.7 Neglected Features process. The most important mechanisms of OS-hardware interaction The interrupt handler itself is supposed to be fast and have been explained. Nevertheless, there are many other must not suspend execution. This is also called the top half. features and effects that depend on the specific version of Interrupts which require longer-lasting or I/O work to be OS. Some omissions include speculative paging [11], as that done, must defer such work for later processing in a so-called is a recent development not commonly used, yet, and page bottom half [16]. migration between NUMA nodes, since our target is a single Once the top half handler finishes, the OS switches back NUMA node. to user mode. As explained before, a context switch might happen during this transition. Specifically, if the interrupt 3 Measurement Errors had deferred some work to a bottom half, then a context switch to a kernel thread might happen at this point in time, This section summarizes features that can create measure- to finish the interrupt work. ment errors, building on the explanations given before and visualized in Fig.1. By error we subsume all effects that stem from sources other than the software under analysis and, 1This upper bound was measured as the minimum latency over many calls of getuid, with the program given in Appendix A. This program had when not properly controlled, would therefore mislead us in experienced a 3x speedup! a systematic or random direction. For example, we consider 2see /proc/interrupts effects caused by other processes running in the background 8 of an operating system as measurement error. Inter alia, they and the processor share the L3 cache on our machine. share caches with the software under analysis, and thus This means that even different monitor setups can change the caching behavior of our software can look significantly cache miss statistics and bandwidth benchmarks3. worse if there exists a data-heavy background process. Table 3 depicts all major sources of measurement errors, 3.1 Short-Living Programs together with a classification whether they are caused by When an OS is used, there is necessarily an overhead for hardware (HW) or software (SW), their measurable impact, starting and terminating a process. This can become signif- and the time when the effects manifest in the measurements. icant when we measure the performance of programs that Additional explanations are given in the notes below. have a short execution time. After all, most developers are Notes for Table 3: not trying to optimize the OS or its libraries, but their own software. To give a rough number, programs that execute less (a) Only traffic not caused by the software under analysis is than a million instructions should be considered to be too considered an error. Such traffic might be caused byall short, with the specific number being subject to the amount other processes running on the same core, including the of work the OS has to do. kernel. See process isolation. As an example, consider the call graph of a program (the (b) If the number of demand data accesses shall be measured, nsichneu from the Mälardalen WCET benchmarks, com- then the prefetcher skews the results, otherwise it is not piled with gcc and -O3) shown in Figure 3a. The user code, considered a source of error. Also, prefetching has the all in a single function (black node in the upper right half), same effect as demand access on caches and TLB, thusit executes about 40,000 instructions. The entire program, from can skew TLB access metrics. startup to after termination, requires about 150,000 instruc- (c) Only interrupts not caused and required by the software tions. The OS performs tasks such as dynamic loading (starts under analysis are considered source of errors. Interrupts in upper left corner and spreads about two thirds into the may cause other processes to be scheduled to execute picture), which locates and loads into memory all dynamic bottom halves. libraries used by the program. The actual user code is exe- (d) The mode switches themselves are often required to ex- cuted thereafter, and followed by another cascade of calls ecute system calls. But mode switch overhead can be performing cleanup actions. Figure 3b shows the same pro- highly different depending on OS and HW. Therefore, it gram with static linking. The remaining overhead is mainly may or may not contribute a significant error. from the C library, although this programs does not make (e) Allocators for virtual memory have different characteris- any explicit library calls, as evident from the call graph. In tics. Some may consume significantly more/less memory this example, the user code is only responsible for about 26% than others, and some may force context switches. of the processor cycles with dynamic linking, and for 75% (f) Non-tickless kernels may generate unnecessary context with static linking. switches and impair TLB, caches and thus performance. In summary, if we are not aware of such OS and library None of those switches are caused by the software under overhead when looking at the measurement results, we may analysis, and thus considered errors. draw conclusions that merely reflect the operating system (g) Major page faults cause disk access. If the effects to be and its libraries, rather than the software that we are trying measured are rather small compared to disk access times, to analyze and improve. then major faults would mask those effects due to large jitter, and must be avoided. Furthermore, it is worth not- 4 Measurement Setup ing that a process accessing a file might be sent into waiting state, and then the number of unhalted processor In this section, we propose a measurement setup on SMP cycles is not incremented. Consequently, I/O is not di- Linux, that minimizes measurement errors and thus provides rectly visible in counter-based measurements, and even a faithful quantification of the software under analysis, while I/O-limited applications can still exhibit a high IPC. In- suppressing other influences as far as possible. directly, however, it can be seen that the process spends All event interactions described in previous sections are less CPU time than wall-clock time, pointing to an I/O- highly depending on the microarchitecture, the operating bottleneck. system and its configuration, next to the program under (h) Depending on the microarchitecture, DMA might slow analysis itself. We describe our measurement setup for the down DRAM access because of increased traffic on the system detailed in Table 1, as a representative of an advanced Northbridge [7]. Naturally, DMA also generates inter- superscalar out-of-order processor with operating system, rupts. Furthermore, cache coherency might be a con- MMU, caches, prefetchers and so on. For different targets or cern [25], and generate extra cache traffic. OSs, the setup has to be adapted accordingly. (i) Some caches might be shared with peripherals and thus 3We have observed a 10% decline in throughput in the STREAM bench- be influenced by their operation. For example, the GPU mark [23], when executed on a dual-monitor vs. single-monitor setup 9 ucini ihihe nbak ohfgrsso h aeporm hr a sdnmclylne n b ttclylinked. statically and (b) linked (a) isdynamically where program, same the show figures Both black. in highlighted is function 3. Figure

0x0000000000001130

1 1 1

_dl_start 0x0000000000406de6 _dl_init

1 1 4

_dl_sysdep_start (below main) call_init.part.0

1 1 1 1 1 1 1 1 1 1 1 1 1

sbrk brk dl_main exit __cxa_atexit __libc_csu_init main _setjmp 0x0000000004a23588 0x00000000000006d0 _init dlinit_alt init_cacheinfo

algah fasotlvn rga,sgetn htms ftewr en oei S“vred.Teuser The “overhead”. OS is done being work the of most that suggesting program, short-living a of graphs Call 1 3 1 1 4 2 1 1 1 1 1 1 1 1 1 2

handle_ld_preload _dl_next_ld_env_entry 1 _dl_discover_osversion _dl_unload_cache _dl_relocate_object _dl_debug_state _dl_add_to_slotinfo _dl_sysdep_start_cleanup __run_exit_handlers __new_exitfn 0x0000000000406eb0 0x0000000000400370 1 __sigsetjmp __ctype_init __init_misc handle_intel

1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 4

dso_name_valid_for_suid do_preload _dl_map_object_deps 1 uname munmap strcasecmp time gettimeofday strncasecmp memcpy@GLIBC_2.2.5 _IO_cleanup __call_tls_dtors _Exit _dl_fini __sigjmp_save rindex intel_check_word

1 1 2 _dl_protect_relro 1 1 1 1 1 1 1 1 1 1 1

_dl_catch_error 1 _dl_receive_error __init_cpu_features _dl_vdso_vsym _IO_flush_all_lockp __tls_get_addr 1 0x0000000000406e90 0x0000000000406f54 0x0000000004a2385c 0x0000000000000690

(a) 1 1 1 3 2 1 3 2 1 1 1 88 _dl_sort_fini 1 1 1 1 1

3 map_doit __sigsetjmp openaux _dl_init_paths _dl_initial_error_catch_tsd _dl_runtime_resolve 0x0000000000406e10 __cxa_finalize 0x0000000000000600

1 1 2 2 2 1 3 version_check_doit 1 3 1

5 _dl_map_object fillin_rpath 1 3 init_tls 1 _dl_fixup 1 1 __unregister_atfork

1 1 2 1 4 1 3 5 4 1 1 3

_dl_load_cache_lookup 2 open_path 1 1 _dl_map_object_from_fd 2 expand_dynamic_string_token 1 strsep free _dl_allocate_tls_init _dl_important_hwcaps _dl_allocate_tls_storage _dl_determine_tlsoffset _dl_check_all_versions _dl_lookup_symbol_x

9 1 16 12 2 1 1 2 5 2 2 5 2 4 1 4 1 1 3 2 1 1 4 7 91

_dl_setup_hash 2 _dl_cache_libcmp open_verify _dl_sysdep_read_whole_file _xstat 2 2 local_strdup _dl_next_tls_modid _dl_new_object _dl_debug_initialize mprotect index 36 4 memset 1 2 access allocate_dtv 2 _dl_check_map_versions do_lookup_x _dl_add_to_namespace_list

2 8 2 18 1 1 1 6 6 3 5 2 3 1 1 3 3 7 17 84 3 3

bcmp _dl_cache_libcmp'2 read open _fxstat close 6 memcpy strlen 2 mempcpy calloc 1 match_symbol _dl_name_match_p check_match.9457 rtld_lock_default_lock_recursive rtld_lock_default_unlock_recursive

5 8 3 66 96

1 malloc strcmp

27 112

__libc_memalign strcmp'2 1048

3

mmap

0x000000000040789e

1

(below main)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

_dl_aux_init _setjmp index strncasecmp_l memmove strstr strcpy _dl_discover_osversion __libc_init_first stpcpy main exit strcmp __cxa_atexit __libc_csu_init strcasecmp_l bcmp

1 1 1 __pthread_initialize_minimal 1 1 1 1 1 1 1 1 1

10 __sigsetjmp __init_cpu_features uname __ctype_init __libc_init_secure _dl_non_dynamic_init __init_misc __run_exit_handlers __new_exitfn init_cacheinfo 0x0000000000400278 0x0000000000407980

1 7 1 1 1 1 1 2 1

__sigjmp_save getenv _dl_init_paths 1 rindex _IO_cleanup __libc_csu_fini _Exit handle_intel __register_frame_info

1 13 1 1 1 1 1 1 4

strncmp fillin_rpath _dl_important_hwcaps _dl_get_origin _IO_flush_all_lockp 0x000000000048e118 fini 0x0000000000407950 intel_check_word

5 4 4 4 1 1 1 1 1 1

7 1 1 strsep expand_dynamic_string_token free mempcpy access 3 check_free.isra.0 __deregister_frame_info 0x00000000004078d0

4 4 4 4 4 4 1 1 1 1 (b)

__libc_setup_tls strpbrk __GI_strchr local_strdup _int_free __syscall_error __deregister_frame_info_bases

1 4 4 4

strlen memcpy malloc

1

malloc_hook_ini

1 1 12

1 malloc'2 ptmalloc_init.part.7

1 1

_int_malloc __linkin_atfork

2 1

__default_morecore malloc_consolidate

2 1

sbrk malloc_init_state

4

brk Source Class Impact When Note Speed Stepping/Frequency Scaling HW varying execution time immediate TLB Shootdown HW slower address translation lagging DMA Transfers HW/SW slower memory access immediate (h) Cache Coherency* HW additional accesses to cache immediate & lagging (a) Hardware Prefetcher* HW more cache & TLB accesses, less or more misses immediate & lagging (b) Load Balancing/Migrations SW more & TLB accesses, less or more misses immediate & lagging Interrupts HW/SW longer execution time, more mode switches immediate & lagging (c) Mode Switches SW more cache/TLB misses, more context switches immediate & lagging (d) Context Switches SW more cache/TLB misses, longer execution time immediate & lagging Allocator SW more context switches, different cache usage immediate & lagging (e) Scheduler SW more or less context switches immediate & lagging (f) Major Page Fault* SW/HW more context switches immediate & lagging (g) Peripherals HW more variance in memory access, cache misses immediate & lagging (i) Table 3. Summary of sources for measurement errors. Sources marked with (*) may or may not be considered creating measurement errors, depending on the intended observation.

The main focus of our setup is how to control variables run queue, to prevent scheduler ticks from getting generated. that would otherwise lead to the measurement errors listed There are several factors that can cause runnable kernel in the previous section. Therefore, many of the suggestions threads, which are discussed in [15]. One such factor are given here are about parameters and configuration of both RCU calls, which should be disabled for the isolated CPUs the OS and the CPU. In the end, we briefly describe the with kernel boot parameter rcu_nocbs. Another reason are measurement process itself. bottom halves of interrupts as discussed earlier, which can be prevented by setting the interrupt affinity. 4.1 Control Variables Allocator Selection: Yet another reason why tickless oper- 4.1.1 Isolating and Pinning the Process ation fails, may be the kernel’s memory allocator. Depend- First, one or more (yet not all) CPU cores were isolated from ing on the distribution, different algorithms (called “SLAB”, the scheduler and SMP balancing, which prevents other “SLOB” or “SLUB”) can be in use, with different impact on userspace tasks from interfering (kernel boot parameter cache miss/hit metrics and execution time. Notably, the SLAB isolcpus). We recommend to leave out CPU0, since one allocator requires periodic cleanups, which enables a kernel core is required to process the offloaded tasks, and CPU0 process and prevents tickless operation. The best option is usually serves interrupts that cannot be moved to other cores the newest, and in Desktop distributions most commonly (e.g., some DMA controllers). Additionally, Hyperthreading used allocator, the SLUB allocator (one exception is Debian, was turned off in BIOS, to avoid hardware context switches, using SLAB and thus rendering adaptive ticks ineffective). same-core migrations and some known errata with the hard- Changing the allocator requires building a custom kernel. ware counters. The process under analysis was subsequently Watchdogs: More context switches could be caused by the pinned to the isolated cores with command line tool taskset. watchdog. This can be prevented with kernel boot parame- Note that each subprocess/thread needs pinning if a range of ters nowatchdog and nosoftlockup. cores is isolated, since otherwise migrations might happen. Real-Time Priority: Depending on the Linux version, some Interrupts: The affinity of interrupts was set to the non- kernel threads [15] are still active on isolated cores and can isolated cores, preventing them from skewing the measure- cause unwanted context switches (e.g., kworker). To avoid ments4. Thermal interrupts were prevented by throttling the this, we perform our measurements at real-time priority us- CPU speed such that full load does not lead to critical tem- ing the command line tool chrt. Additionally, real-time throt- peratures. This can be done by setting the governor’s limits tling must be disabled (by setting sched_rt_runtime_us in in or tools like cpufreq. Machine check exceptions were ), since otherwise some Linux versions force one invol- disabled by kernel boot parameter mce=off. untary context switch per second for CPU-bound tasks. Ensuring tickless operation: To minimize the number of Summary: The show configuration results in no other user context switches, a kernel with adaptive ticks should be processes running on the isolated cores, and in no CPU mi- used, and enabled with kernel boot parameter nohz_full. grations taking place of our process(es). Some interrupts may The remaining challenge is to keep kernel threads off the

4/proc/irq/default_smp_affinity 11 still be left, but these should only cause mode switches. This to bring data from the slow disk to RAM, as soon as a de- can be verified using perf5. mand is foreseen. However, while this mechanism prevents major page faults quite effectively, it causes DMA transfers 4.1.2 Fixing Processor Speed and might not be desirable. Once a file is buffered in virtual If the processor supports speed stepping/frequency scaling, memory (“page cache”), data can be accessed without disk then the clock speed of the processor should be fixed for access taking place. Obviously, the page cache can only pre- two reasons. First, if the absolute execution time is part of vent all disk access if it is large enough to hold all files that the measured metrics, a non-fixed clock speed could yield will be in use, and not flushed in between. different results in subsequent runs, depending on how the The easiest way to make use of the page cache is to run the governor selects the frequency. Second, we conservatively measurement twice, and discarding the results from the first prevent a potential impact of frequency scaling on processor run. Alternatively, there exist tools which allow checking cycles; usually there is not enough information in the data and manipulating the page cache, e.g., vmtouch. sheets to conclude there is no influence. 4.1.5 Controlling Hardware Data Prefetching Further, fixing the clock speed also ensures more stable DRAM access times when measured in processor clock cycles. Although it is possible to distinguish some events by their DRAM runs at its own bus frequency, which means that a cause (i.e., demand vs. speculation), it cannot be known higher CPU frequency will make DRAM bottlenecks stand how much of the prefetcher-induced penalties are hidden out more. This is important for backend-bound benchmarks, in out-of-order processing, or how many of certain events where the bottleneck becomes more severe with increasing are caused by the software directly. One shortcut to answer processor frequency. this question is to turn off hardware prefetching and run the Fixing the processor speed is best done in BIOS by dis- benchmarks twice, then compare results. This is specific to abling technologies like speed stepping. However, it has to every processor family, and may not be possible on all CPUs. be ensured that the processor has sufficient cooling under For Intel CPUs, machine-specific registers (MSRs) allow con- full load, which often is not the case for mobile or Laptop de- figuring the processor in many ways, inter alia to turnoff vices. Alternatively, for Intel CPUs, the older ACPI driver can prefetchers. be activated with boot parameter intel_pstate=disable, 4.1.6 Controlling Influences of Peripherals which supports to fix the frequency of the processor. If peripherals are sharing memory with the software under 4.1.3 Controlling TLB Flushes and Shootdowns analysis, it might be helpful to turn them off or minimize Even though we have isolated cores from the scheduler, their impact, in case an influence on the results cannot be shootdowns may still happen because the Linux kernel runs ruled out. For example, the graphics engine in Sandy Bridge on the same core as the process it is serving. Thus, every can be made to relinquish parts of the L3 cache by booting core that is used executes a part of the kernel at some point, into a low-resolution mode. where shared kernel data can lead to shootdowns. 4.2 Taking Measurements In our kernel version KPTI is enabled, yet PCID support is not available. Therefore, each mode switch flushes the TLB, The act of taking measurements itself is straightforward, and significantly skewing our measurements (see Section 2.2.4). should provide meaningful results if the previous recommen- We thus disabled KPTI with kernel boot option pti=off6. dations have been followed. We therefore summarize this This step is not recommended when PCIDs are supported by only briefly. both OS and CPU. 4.2.1 Performance Monitoring Units 4.1.4 Controlling Page Faults While there exist different ways and tools to measure per- formance, such as hardware tracing and countless tools, we As earlier, we assume the user wants to avoid major page briefly describe the use of the CPU’s performance monitoring faults as far as possible, since they have a penalty large unit (PMU) [12], which provides hardware event counters. enough to hide other effects that may be of interest. Also, These counters are processor-specific registers that are in- there is not much the user can do if access to secondary cremented on the occurrence of certain events, for example, storage is logically required, unless the the program under the number of L1 cache misses, or the number of processor analysis is redesigned. cycles. Under Linux, the tool allows setting up and read- In Linux, reading or writing data files is by default buffered ing these registers, and to separate event counts by process. through the virtual memory. A read-ahead heuristic is used A growing number of CPUs and SoCs offer an equivalent to the PMU, e.g., many ARM Cortex processors already do. 5perf record -e "sched:sched_switch" ... Additional to hardware events, the perf also reads kernel 6Note that this may open a security vulnerability! (software) counters, such as minor and major page faults, 12 context switches and CPU migrations. Note: We highly rec- and precise results. Otherwise, grouping and multiplexing ommend to use the native names of the registers as opposed should be used, preferably using weak groups. to perf’s names, to avoid misunderstandings (e.g., the perf For the system described in Tab. 1, the CPU supports tool counts STLB hits under the name ITLB loads on Sandy around 475 different events, offering deep insights into the Bridge), and to watch out for errata (such as counter prob- processor. However, looking at single counter values can be lems under HyperThreading). misleading. Counters can have several orders of magnitude Multiplex and Grouping: PMUs have a limited number of in difference (e.g., page faults vs. cache misses), yet have hardware counters (eight, in our case). If we request more a similar impact on the resulting performance. Especially events than counters, perf starts time-multiplexing, and ex- when comparing two versions of the same program, large trapolating the results from the sampling window to the differences may become meaningless when compared tothe entire life time of the process. It is thus possible to miss absolute values. certain events if the specific counter is not currently ac- tive. Therefore, if the process behavior is not stationary for 4.2.2 Hierarchical Bottleneck Analysis long enough, the results may appear inconsistent. One way To identify the true bottleneck of an application, Yasin has around this issue is grouping of counters, which tries to proposed a hierarchical analysis [33]. The analysis follows multiplex all counters of a group at the same time. How- the hierarchical anatomy of general OoO processors, de- ever, some processors have scheduling restrictions for which picted in Fig.4. As a first level, the user should only be counters must not be used together. One way around this is concerned whether the majority of cycles is spent in the to avoid multiplexing at all, and execute the program multi- Frontend (entails decoding instructions, L1d access and Mi- ple times instead. This only makes sense if the workload is crocode switches), or in Bad Speculation (machine clears and repeatable. With Boolean options for multiplexing (M) and branch mispredictions) or in the Backend (L1/2/3 and DRAM grouping (G), there are four different possibilities to measure, data access) or whether most of the time is spent in retiring each with its own consequences: instructions (ideally, 100%). Once the most time-consuming 1. Grouping and Multiplexing: Possibly unsafe and incom- category is identified, the user should focus on its children plete results, because the process might be undersam- at the next level, determine the most prominent one, and so pled from M., and G. may create scheduling conflicts, on. For a full explanation, the reader is referred to [33]. which disables some counters. 2. No Grouping and Multiplexing: Possibly unsafe but complete, because the process might be undersampled and counts inconsistent towards each other, but groups can be built in arbitrary ways to resolve scheduling conflicts. 3. No Multiplexing and no Grouping: Possibly unsafe but complete, only works for repeatable workloads. Multi- plexing, although not enabled, might still take place implicitly in an attempt to resolve conflicting counters. 4. No Multiplexing and Grouping: Safe but possibly in- complete, only works for repeatable workloads. The grouping prevents implicit multiplexing from taking place. By unsafe we mean that the counts may be both imprecise and contradict each other (e.g., it might be possible that Figure 4. Hierarchical View on Processor Performance [33]. the L2 miss count is greater than the L3 access count). By complete we mean that every event that has been requested A free implementation of this analysis for Intel CPUs is was indeed counted at some point in time. available in a tool called toplev, which is part of Intel’s Recently, weak groups have been introduced in perf [17], pmu-tools [18]. It takes measurements using Linux’ perf which can be broken to resolve scheduling conflicts, but oth- tool, and presents the results in the explained hierarchy, erwise ensure that certain events are counted synchronously. but in ratios rather than absolute numbers. Multiplexing It is thus a mixture of cases 1 and 2, resulting in complete and grouping is supported as described earlier. Taking the results with minimal multiplexing artifacts, applicable to measurements thus boils down to invoking the tool. non-repeatable workloads. Only after the bottleneck has been identified, individual We suggest to use grouping and avoid multiplexing as counters should be inspected using perf, since now we know long as the workload is repeatable, to obtain self-consistent they are meaningful. Last but not least, it is worth noting that 13 pmu-tools also includes a convenience wrapper for perf, be isolated, as explained in Section 4. Future releases of called ocperf. This tool allows using native names for Intel perf/toplev might no longer have this caveat. CPUs, and has better support for uncore events. It should be noted that toplev does not allow localizing 5 Experiments causes of undesired behavior in the source code, since coun- ters operate cumulatively and are not associated to specific We provide a few short examples illustrating the difference locations in the source code of a program. A method for between our proposed measurement setup, and a default localization is provided by perf in sampling mode, where Linux environment. We have chosen three programs with a history of events can be logged, annotated in the source different characteristics: code. The tool offers many options explained in the docu- • gnugo: This is a game engine for the Chinese Go game. mentation [13]. The program consists of many branches and jumps, 4.2.3 Uncertainty Propagation many of which are hard to predict. Additionally, it has a low spatial locality of instructions, and will there- All measured events can be associated with a measurement fore like suffer from many instruction cache misses. uncertainty. This uncertainty indicates the precision of the Memory usage is low compared to the next program. measured values, and prevents us from drawing false con- • stream: This is McCalpin’s memory bandwidth bench- clusions if we are comparing two versions of a software. mark [23], which stresses the memory subsystem heav- Crucially, not all events can be measured directly on the ily, but in a predictable pattern. Consequently, this hardware, and thus must be calculated from others, mea- program occupies a lot of memory (1.1GB in our case), sured or themselves calculated, ones. The calculated events but has few branches or jumps. therefore have an uncertainty based on their constituents, • syscalls: This is the program shown in Appendix A, which needs to be properly tracked. As an example, on Sandy which exercises syscalls. The memory usage is low Bridge the cache hit ratio r must be estimated from access a compared to the others, but it causes many mode and hit h counters, and its standard deviation σr depends on switches and thus allows more context switches to both measurements as follows take place. This program should keep the functional r h σh 2 σa 2 σa,h units busy. σr = + − 2 , (1) a h a ah All programs have been executed five times in a row to al- where σa,h is the covariance between accesses and hits. Anal- low determining a standard deviation, which is then used for ogously, the uncertainty is propagated for multiplication, our uncertainty propagation. We have repeated this for two division, addition, and all other operations [20]. different measurement setups; first with the default settings We have contributed patches to toplev, which track the of the system described in Tab. 1, and another time with our standard deviation of counters across multiple runs (--repeat) proposed measurement setup described in Section 4. through all calculations, while assuming statistically inde- All measurements have been taken with similar parame- pendent variables, i.e., σa,h = 0. The resulting uncertainty ters, using the performance monitoring counters (see Sec- for each event is indicated with error bars in our plots. tion 4.2.1. Specifically, multiplexing and HyperThreading Warning: When measurements are taken in system-wide were disabled, to prevent undersampling and to ensure proper mode across more than one logical processor core, then uncertainty propagation, as explained in Section 4.2.2. toplev currently forces perf to not aggregate the results (flag -A). This in turn suppresses perf’s output of the stan- 5.1 Hierarchical Bottleneck Analysis dard deviation, and leads to an optimistic uncertainty. System- wide mode is also forced when HyperThreading (HT) is ac- Figure 5 shows the results of the first level of the hierarchical tivated, thus HT suppresses the standard deviation, as well. analysis using toplev. It can be seen that the measurements There are thus three options to capture measurement uncer- all programs reflect the expected behavior: gnugo indeed tainty correctly: (1) Recommended: Disable HT and avoid spends a lot of time in its frontend (FE), due to many cache system-wide mode. The perf tool will “follow” the software misses and branch mispredictions (see Fig.6). The stream under analysis to whatever core it is going (if not pinned). benchmark spends most of its time in the backend (BE), (2) Alternative 1: Disable HT and use system-wide mode because it is mostly memory-bound (see Fig.7). The syscalls while filtering only for only one single core. The software program shows itself to be mainly backend-heavy, but this under analysis must be pinned to that single core. (3) Alter- time due to core usage (see Fig.8). native 2: Keep HT and use system-wide mode, and specify The difference between the default setup and ours ishow- --single-thread, which forces toplev to aggregate the re- ever barely visible for this type of analysis. Only syscalls sults across all CPUs. The system should otherwise be idle. In shows a larger difference. The results with our proposed all cases, the core where the software is running on should setup make the program appear less back-end bound, in 14 measurement setups, thus the programs would take about 100 the same time to execute in both setups. The next apparent difference is the absence of context switches (one switch is always necessary) and CPU migra- tions in our setup, followed by no major faults in our setup, 50 and a different L1d caching behavior. All in all, the measure- ratio % ment uncertainty again is comparable when the counts are close. The results for stream and syscalls are shown in Appen- dix B. Again, there are some differences in the absolute 0 event counts. This time, IPC0 shows that stream runs faster in our setup ( 0.6/0.8 = 25%), whereas syscalls is signifi- cantly slower ( 0.68/0.15 = 450%), as expected due to the gnugo-dflgnugo-ourstream-dfl stream-oursyscalls-dflsyscalls-our higher overhead for mode switches. Furthermore, syscalls also shows large differences in most counters. The entire FE BAD BE RET characteristics seem to be driven by the differences in the mode switch cost, and even outweighs that the default setup Figure 5. Results of hierarchical performance analysis for performs many more context switches and CPU migrations default setup (dfl) and our proposed setup (our). than our proposed setup. In conclusion, the measurement setup can make a large dif- ference in absolute event counts and software performance. exchange for more time spent retiring instructions. This hi- Our experiments suggest that neither setup is always domi- erarchical analysis does not offer any more explanation, and nating in performance, and that the choice of setup should thus we will revisit this in the next section. depend on the software under analysis. All figures have error bars, but only very few ofthem are visible due to their low magnitude. This suggests that 5.3 Stability towards Background Processes a default setup may be sufficient in terms of measurement So far, all measurements have been taken on an idle system, uncertainty, and that the results of a hierarchical analysis with no interactive user processes being executed. We now may be close enough between the two setups, at least for the examine how measurements change when we run some back- programs tested here. This is surprising, because the absolute ground processes, as often the case in production systems. values of the counters, as well as program performance, differ Specifically, we run four background processes which stress significantly, as we show next. caches and DRAM (stress-ng -C2 --vm 2 --vm-populate --vm-bytes=512m). 5.2 Absolute Event Counts and Absolute Performance 5.3.1 Hierarchical Analysis Hierarchical analysis has not been showing any larger differ- Figure 10 shows the results of a hierarchical analysis, which ences between the measurement setups. It builds on ratios certify that all measurements under all setups are stable. rather than absolute values, therefore large differences could There are small differences, which however do not change become invisible. While this is acceptable and desirable for the results fundamentally. Again, the measurement uncer- bottleneck analysis, it might be undesirable in other sce- tainty surprisingly is always very low. The results suggest narios, such as when absolute performance or the precise that the hierarchical analysis is also robust against back- event counts are required. These absolute counts differ sig- ground processes, at least at the uppermost level of the hier- nificantly, as we show in the following. archy. Figure 9 shows the absolute event counts of gnugo for our proposed setup (“tune”) and for the default one. First, it can 5.3.2 Absolute Event Counts and Absolute be seen that the program runs faster under the default setup. Performance The bar “task-clock (msec)” shows 11.4s vs 12.9s; note that The absolute event counts give a similar picture as before. task-clock only increments for those cycles where the pro- However, the wall-clock time of the programs is considerably gram has been active on the processor, which is always less more stable in our setup. or equal than wall-clock time. As a better metric reflecting Figure 11 shows again absolute event counts for both wall-clock time, all plots show a bar called “IPC0”, which measurement setups of gnugo, when background processes is calculated as I/(f · tw ), where I is the number of instruc- are running. A large change in IPC0 can be seen. In our tions, f the processor frequency, and tw the wall-clock time setup, the program almost maintains the same performance of the program. In Fig.9, IPC0 is almost identical for both as on an idle system (4% degradation, since the IPC0 drops 15 FE Frontend_Bound: 43.94+-0.02% Slots <== BAD Bad_Speculation: 14.16+-0.01% Slots RET Retiring: 35.07+-0.01% Slots below FE Frontend_Bound.Frontend_Latency: 22.77+-0.03% Slots FE Frontend_Bound.Frontend_Bandwidth: 21.13+-0.03% Slots BAD Bad_Speculation.Branch_Mispredicts: 14.07+-0.01% Slots FE Frontend_Bound.Frontend_Latency.ICache_Misses: 17.66 +- 0.00 % Clocks_Estimated FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 14.07 +- 0.01 % Clocks_Estimated

Figure 6. toplev output for gnugo with our proposed measurement setup.

FE Frontend_Bound: 5.62+-0.00% Slots below BAD Bad_Speculation: 0.24+-0.00% Slots below BE Backend_Bound: 72.20+-0.00% Slots RET Retiring: 21.94+-0.00% Slots below BE/Mem Backend_Bound.Memory_Bound: 52.55+-0.02% Slots <== BE/Core Backend_Bound.Core_Bound: 19.65+-0.02% Slots BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 0.00+-0.00% Stalls below BE/Mem Backend_Bound.Memory_Bound.L2_Bound: 1.73+-0.02% Stalls below BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 17.80+-0.03% Stalls below BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 36.13+-0.04% Stalls below BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 10.27+-0.01% Stalls below BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 21.92+-0.03% Clocks

Figure 7. toplev output for stream with our proposed measurement setup.

FE Frontend_Bound: 21.40+-0.05% Slots BE Backend_Bound: 47.52+-0.08% Slots RET Retiring: 26.31+-0.01% Slots below FE Frontend_Bound.Frontend_Latency: 12.94+-0.03% Slots below BE/Mem Backend_Bound.Memory_Bound: 0.00+-0.00% Slots below BE/Core Backend_Bound.Core_Bound: 47.52+-0.08% Slots BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 39.88+-0.20% Clocks <==

Figure 8. toplev output for syscalls with our proposed measurement setup. from 1.25 to 1.21), whereas in the default setup performance Blackfin, MIPS, PowerPC and SPARC processors [30]. Spe- significantly degrades (64% degradation, since IPC0 drops cific to Intel CPUs, we have toplev, which we have men- from 1.22 to 0.45). The result is similar for stream, which tioned earlier [18], and the commercial tool VTune. Inter- degrades by 16% in our setup, but by 67% in the default setup, estingly, some processors also provide counters for power and for syscalls, which degrades by 60% in the default setup, events [27], for which our setup might be equally important. but only by 2% in our setup. Software-based Measurements: There exist many well- In summary, the absolute event counts are considerably known tools for debugging software performance. Perhaps different in our setup, and additionally the performance of best known are software profilers, such as gprof (execution the program under background load is significantly better time) and Visual Studio Profiler. More generally, there ex- maintained in our setup. ists the larger class of dynamic program analyzers, which observe programs during run-time, sometimes by instru- mentation, other times through simulation. A survey about 5.4 Other Influences those was recently given in [10]. One successful framework When performance counters are used, there is a low but worth mentioning is Valgrind [26], which provides tools visible overhead, as with many other measurement methods. to track execution time and memory usage. However, such We have neglected this overhead in our experiments. tools change the program’s characteristics through instru- mentation or simulation, and furthermore do not provide any microarchitectural explanations. Finally, we should also 6 Related Work mention , which can collect software events during Hardware Performance Counters: Since all performance the execution of a program, and is integrated with perf. counters are vendor- and processor-specific, different profil- Reducing OS Noise: Another attempt at reducing the OS’ ing software exists to access them. A generic tool is Linux’ impact on workloads has been described by Akkan et. al [1]. perf, which also supports some many AMD, ARM, ARC, However, their report is mainly concerned with the OS, and

16 setup (dfl) and our proposed setup (our). ourproposed default and for (dfl) processes, setup background active highly without and 10. Figure uncertainty. measurement 9. Figure ratio %

gnugo-dfl-stress 100 gnugo-dfl 20 40 60 80 event count normalized 0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 boueeetcut for counts event Absolute eut fheacia efrac nlsswith analysis performance hierarchical of Results gnugo-our-stressgnugo-our task-clock (msec) 11,267 12,940 cycles 36,957,230,731 36,027,887,500 45,083,726,454 stream-dfl-stress instructions stream-dfl 45,102,370,615 IPC0 1.22 1.25 FE cpu-migrations 17 0 stream-our-stress context-switches 64 stream-our 1 BAD major-fault-estcycles 333,333 0 major-faults 1.67 0 syscalls-dfl-stress minor-fault-estcycles 4,092,000 syscalls-dfl 4,211,000 BE 4,092

gnugo minor-faults 4,211 branch-misses 403,786,991 syscalls-our-stress 403,450,360 Event countercomparisonfor"gnugo-b5-r10--level17" RET syscalls-our branches 8,655,235,030 8,658,824,329 o eal eu dfut n u rpsdstp(ue.Rderrbr indicate bars error Red (tune). setup proposed our and (default) setup default for branches-indirect 66,111 53,212 mispred-estcycles 8,075,739,820 8,069,007,200 TLB-miss-estcycles 656,780,971 675,439,156 cache-dmd-estcycles 183,866,224,452 95,514,963,258 L1-dmd-estcycles 170,507,919,608 82,275,322,136 L1d-dmd-hit-ratio 0.98 0.39 L1d-dmd-load 14,055,929,726 7,038,697,722 13,810,675,207 17 L1d-dmd-load-hit 2,771,907,876 L1i-hit 15,005,629,488 ehv ecie h otipratfaue fbt the both of features important most the described have We many and applicable, where references the given have We Threading, esrmn eu ae itedfeec.I otat when In contrast, difference. little makes the setup method, measurement analysis performance ratio-based and archical errors. measurement the indicate to with propagation, method uncertainty analysis performance have hierarchical we a that, but Towards extended characterized, analysis. is under hardware program or the OS mainly the er- not measurement that reduce such to rors, aims that have proposed setup we measurement a influences, identified perfor- the software on Based influence mance. they how super- and out-of-order processor, modern scalar a and system operating Linux Remarks Concluding 7 as serves reference. itself definitive depre- code a or source imprecise not Linux of the but doubt documentation, Last faintest cated reader. the even interested erase the to by least, found be can more kernel. the of primary workings invaluable inner been the understanding have for LWN sources and docs, kernel the the var- Linux, in on details aspects technical ious For there. found be can pointers some Hyper- (e.g., effects important some neglects furthermore 15,025,014,782 L1i-hit-ratio 0.97

upiigy u xeiet hwdta o hshier- this for that showed experiments our Surprisingly, 0.97 L1i-miss 532,185,776 533,618,030 L2-dmd-estcycles 11,715,101,256 11,722,296,852 L2-hit-ratio 0.94 0.94 2,359,940,830

ftrace L2-pending-cycles 1,482,028,786 L2d-dmd-access 248,200,161 247,766,802 L2d-dmd-hit 210,268,246 211,694,946 L2i-hit 765,990,192 vredadtc egh.Nevertheless, length). tick and overhead 765,163,125 L2i-miss 25,282,088 23,333,923 LLC-dmd-access 72,493,365 66,713,690 1,643,203,588 iu enlMiigList Mailing Kernel Linux LLC-dmd-estcycles 1,517,344,270 LLC-dmd-hit 63,200,138 58,359,395 LLC-dmd-hit-ratio 0.87 0.88 LLC-dmd-miss 9,293,227 8,354,295

LLC-dmd-pending-cycles 1,162,927,211 tune/gnugo.perf.all default/gnugo.perf.all 740,249,084 dram-dmd-estcycles 1,858,645,400 1,670,859,000 dram-dmd-pending-cycles 1,197,013,618 741,779,701 RAT-stalls 1,973,576,708 1,967,742,453 VirtAddr-estcycles 916,333,402 903,686,146 clears 2,437,673 2,433,139 269,074,549

(LKML), decode-switch-cycles 268,999,803 References results. configurations optimal presented achieve to the chosen of be subset might a thus pro- some and for grams, performance degrade however may This setup. proposed our recommend analysis. we under measurements, robust software more the For measure- of the characteristics for the requirements and the ments both on depends It exist. unsafe. be would presented setup targets the all the of to of extrapolation results an parts but which relevant, evaluate become to may reader givenrefer- the have for We ences OSversions. and processors different system the tune to applications. guideline high-performance a proposed for as Our used setup. be default therefore a can in setup observed been up has of 64% degradation to performance a observ- whereas their performance, in able stability setup. better default considerably a show than Programs processes background to immune more has setup our code, improvements. source significant ac- to shown cache correlated when be e.g., shall counts interest, cess of are counts event absolute visible. are differences performance Large uncertainty. measurement indicate bars error Red 11. Figure [1] ncnlso,teutmt esrmn eu osnot does setup measurement ultimate the conclusion, In with change may here presented results the Naturally, setupis proposed the that found have we effect, side a As fHg efrac optn Applications Computing Performance kernel. High Linux the of in noise Understanding the 2013. isolating Liebrock. and Lorie and Lang, Michael Akkan, Hakan event count normalized 1 0 0 0 boueeetcut for counts event Absolute task-clock (msec) 12,764 13,432 cycles 39,079,715,926 37,317,002,174 instructions 45,062,697,409 45,072,954,120 IPC0 0.45 1.21 cpu-migrations 61 0 context-switches 2,128 1 major-fault-estcycles 0 0 major-faults 0 0 4,090,666 minor-fault-estcycles Event countercomparisonfor"gnugo-b5-r10--level17"understress 4,212,000 minor-faults 4,090 gnugo 4,212 branch-misses 407,232,403 h nentoa Journal International The 7 21) pp.136–146. (2013), 2 27, 404,348,326 branches 8,663,925,837 8,656,365,580 o eal eu dfut n u rpsdstp(ue ihbcgon processes. background with (tune) setup proposed our and (default) setup default for branches-indirect 98,760 55,669 mispred-estcycles 8,144,648,060 8,086,966,520 TLB-miss-estcycles 638,634,549 728,565,250 cache-dmd-estcycles 183,723,194,840 183,445,978,440 L1-dmd-estcycles 170,737,026,416 170,550,891,544 L1d-dmd-hit-ratio 0.98 0.98 L1d-dmd-load 14,008,408,127 14,050,835,500 13,790,516,007

18 L1d-dmd-load-hit 13,816,794,750 L1i-hit 15,103,224,590 15,004,133,386 [13] [12] [11] [10] 0.96 [6] [9] [8] [7] [5] [4] [3] [2] L1i-hit-ratio 0.97 L1i-miss 540,847,964 notmzto ud o sebypormesadcmie mak- compiler and programmers assembly for guide optimization An //perf.wiki.kernel.org/index.php/Tutorial //www.tldp.org/HOWTO/KernelAnalysis-HOWTO-3.html unrblt.Online. vulnerability. enlog 08 iu enlpoiigwt ef Online. perf. with profiling kernel Linux 2018. kernel.org. Manual 2018. Corp. Intel handling. page-fault Online. speculative at attempt Another 2017. Hussein. Nur (FICTA) Applications and pp.113–122. Springer, Theory Computing: Intelligent of In Frontiers tools. and program dynamic techniques of survey analysis A 2015. Sharma. Ganga and Gosain Anjana gorman/html/understand/ at Available License. Publication Open 2007. 3rd. Gorman. Sept Mel 2018 Retrieved online. CPUs: ers. VIA and AMD Intel, of microarchitecture The 2018. Fog. Agner about know should programmer memory. every What 2007. Drepper. Ulrich 30. Oct fault terminal L1 the back: strikes Meltdown 2018. Corbet. space. Jonathan user from kernel the hiding Online. KAISER: 2017. Corbet. Jonathan Online. 30. Oct Diagram. 2018 Block Bridge Sandy en.wikichip.org/wiki/File:sandy_bridge_block_diagram.svg 2017. At32Hz. Ltd. ARM ed.). 0015B 2016. Ltd. ARM 30. Oct. 2018 Online. HowTo. Analysis Kernel TLDP 2018. Arcomano. Robertp 532,782,023 L2-dmd-estcycles 11,660,713,800 11,843,441,340 L2-hit-ratio 0.94 0.95 ne Corp. Intel . 3,758,928,741 https://lwn.net/Articles/730531/ https://lwn.net/Articles/738975/ L2-pending-cycles

e a,Inc Hat, Red 2,421,320,806 L2d-dmd-access 249,961,015 248,366,073 L2d-dmd-hit 210,071,776 otxA7Sfwr piiainGuide Optimization Software Cortex-A57

ne 4adI-2AcietrsOtmzto Reference Optimization Architectures IA-32 and 64 Intel 213,696,025 L2i-hit 761,654,374

nesadn h iu ita eoyManager Memory Virtual Linux the Understanding 773,257,420 L2i-miss 26,095,729 o.1(2007). Vol.11 https://lwn.net/Articles/762570/ 17,040,086 LLC-dmd-access 75,236,960 59,363,431 . LLC-dmd-estcycles 1,325,454,624 1,051,645,556 LLC-dmd-hit 50,979,024 tune/gnugo_stress_vm.perf.all default/gnugo_stress_vm.perf.all 40,447,906 LLC-dmd-hit-ratio 0.68

rc nentoa ofrneon Conference International Proc. 0.68 LLC-dmd-miss 24,257,936 18,915,525 867,934,354 . 30. Oct 2018 retrieved , https://www.kernel.org/doc/ LLC-dmd-pending-cycles . 566,581,589 dram-dmd-estcycles 4,851,587,200 3,783,105,000 dram-dmd-pending-cycles 2,890,994,386 1,854,739,216 RAT-stalls 1,996,515,554 1,975,366,649 VirtAddr-estcycles 1,316,514,520

eree 2018 retrieved , 1,233,145,915 clears 2,861,607

AMUAN (ARM 2,653,249 270,549,814 retrieved , retrieved , decode-switch-cycles 269,312,081 https:// https: https: . . [14] kernel.org. 2018. NO_HZ: Reducing Scheduling-Clock Ticks. On- A Mode Switch Benchmark for x86 line. https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt, retrieved 2018 Oct 30. #include #include [15] kernel.org. 2018. Reducing OS jitter due to per-cpu kthreads. #include Online. https://www.kernel.org/doc/Documentation/ kernel-per-CPU-kthreads.txt, retrieved 2018 Oct 30. #define TWENTY_MILLION 20000000 [16] kernel.org. 2018. Software Interrupt Context: Softirqs and Tasklets. #define NTRIALSTWENTY_MILLION Online. https://www.kernel.org/doc/htmldocs/kernel-hacking/ int main( void ){ basics-softirqs.html, retrieved 2018 Oct. 30. unsigned long ini, end, now, best, tsc; [17] Andi Kleen. 2017. Support standalone metrics and metric groups for int i; perf. Online. https://lwn.net/Articles/732567/, retrieved 2018 Oct 30. #define measure_time(code) \ [18] Andi Kleen. 2018. pmu-tools. github repository. for (i = 0;i

19 event count normalized event count normalized 1 1 1 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0 0 0 0 iue13. Figure 0 0 1 2 3 iue12. Figure

task-clock (msec) 5,323 task-clock (msec) 7,068 8,928 6,606 cycles 17,473,627,364 cycles 23,156,210,094 24,843,949,952 tune/syscalls.perf.all default/syscalls.perf.all 18,365,073,546 instructions 2,562,671,219 instructions 14,573,967,566 16,851,967,309 14,758,654,789 0.63 boueeetcut for counts event Absolute 0.15 IPC0 IPC0 for counts event Absolute 0.68 0.80 cpu-migrations 9.33 cpu-migrations 14 0 0 context-switches 27 context-switches 45 1 1 major-fault-estcycles 0 major-fault-estcycles 0 0 0 major-faults 0 major-faults 0 0 0 minor-fault-estcycles 47,333 minor-fault-estcycles 293,033,000 168,666 293,153,666 minor-faults 47 minor-faults 293,033 168 293,153 branch-misses 19,194 branch-misses 705,935 20,051,490 962,313 branches 441,360,828 branches 2,455,816,821 3,240,727,471 2,490,873,413 branches-indirect 19,983,495 branches-indirect 10,638 19,989,697 3,373 mispred-estcycles 383,880 mispred-estcycles 14,118,700 Event countercomparisonfor"syscalls" 401,029,800 19,246,260Event countercomparisonfor"stream"

syscalls 2,135,233,233 247,576,714 TLB-miss-estcycles stream TLB-miss-estcycles 40,474 133,555,513 cache-dmd-estcycles 10,527,739,816 cache-dmd-estcycles 27,592,522,522 50,901,626,334 15,357,212,728 L1-dmd-estcycles 10,524,072,608 L1-dmd-estcycles 16,328,996,444 50,899,882,180 4,061,649,464 0.96 0.49 o eal eu dfut n u rpsdstp(tune). setup proposed our and (default) setup default for L1d-dmd-hit-ratio

o eal eu dfut n u rpsdstp(tune). setup proposed our and (default) setup default for L1d-dmd-hit-ratio 0.99 0.20 L1d-dmd-load 903,535,440 L1d-dmd-load 3,998,228,110 5,276,183,832 2,043,088,079 864,832,837 1,972,934,665

20 L1d-dmd-load-hit L1d-dmd-load-hit 5,236,191,145 411,715,569 L1i-hit 901,352,478 L1i-hit 136,379,781 2,252,588,255 191,981,228 L1i-hit-ratio 1 L1i-hit-ratio 0.99 1 1.00 L1i-miss 245,967 L1i-miss 743,192 128,381 440,550 L2-dmd-estcycles 2,746,392 L2-dmd-estcycles 913,864,548 1,214,040 959,521,080 L2-hit-ratio 0.78 L2-hit-ratio 0.23 0.80 0.24 L2-pending-cycles 7,184,139 L2-pending-cycles 12,184,189,648 589,285 8,555,061,379 L2d-dmd-access 57,414 L2d-dmd-access 336,604,405 46,330 336,394,410 L2d-dmd-hit 33,382 L2d-dmd-hit 75,899,633 28,715 79,746,670 L2i-hit 195,484 L2i-hit 255,746 72,455 213,420 L2i-miss 40,369 L2i-miss 801,317 7,584 475,558 LLC-dmd-access 75,922 LLC-dmd-access 490,487,061 20,454 487,571,297 LLC-dmd-estcycles 920,816 LLC-dmd-estcycles 10,349,661,530 530,114 10,336,042,184 LLC-dmd-hit 35,416 LLC-dmd-hit 398,063,905 20,389 397,540,084 LLC-dmd-hit-ratio 0.47 LLC-dmd-hit-ratio 0.81 1.00 0.81 LLC-dmd-miss 40,506 LLC-dmd-miss 92,423,156 65 90,031,213 LLC-dmd-pending-cycles 797,702 LLC-dmd-pending-cycles 4,641,115,268 tune/stream.perf.all default/stream.perf.all 576,421 3,309,123,272 dram-dmd-estcycles 8,101,200 dram-dmd-estcycles 18,484,631,200 13,000 18,006,242,600 dram-dmd-pending-cycles 6,386,436 dram-dmd-pending-cycles 7,543,074,379 12,863 5,245,938,106 RAT-stalls 568,453,277 RAT-stalls 38,775,346 849,727,093 24,203,230 VirtAddr-estcycles 5,037,891,171 VirtAddr-estcycles 948,613,095 106,095 652,336,113 clears 24,000 clears 1,262,047 19,541,817 743,989 decode-switch-cycles 20,027,476 decode-switch-cycles 4,127,875 167,202,822 4,925,309 processes. 15. Figure processes. 14. Figure

event count normalized event count normalized 1 1 1 1 0 0 0 0 0 0 0 1 2 0 boueeetcut for counts event Absolute boueeetcut for counts event Absolute

task-clock (msec) 5,484 task-clock (msec) 8,266 9,032 7,937 cycles 17,624,330,719 cycles 26,848,760,160 25,102,349,089 22,032,050,115 instructions 2,560,374,126 instructions 14,634,956,342 16,861,816,689 14,739,629,779 IPC0 0.06 IPC0 0.21 0.67 0.67 cpu-migrations 25 cpu-migrations 43 0 0 context-switches 925 context-switches 1,450 1 1 major-fault-estcycles 0 major-fault-estcycles 0 0 0 major-faults 0 major-faults 0 0 0 minor-fault-estcycles 47,333 minor-fault-estcycles 293,033,000 169,000 293,153,333 minor-faults 47 minor-faults 293,033 169 293,153 69,947 753,732 syscalls

branch-misses stream branch-misses 20,063,662 1,012,925 441,661,036 2,467,650,054

branches Event countercomparisonfor"syscalls"understress branches 3,240,376,763 2,488,071,633 Event countercomparisonfor"stream"understress branches-indirect 20,026,362 branches-indirect 28,012 20,004,405 4,071 mispred-estcycles 1,398,940 mispred-estcycles 15,074,640 o eal eu dfut n u rpsdstp(ue ihbackground with (tune) setup proposed our and (default) setup default for o eal eu dfut n u rpsdstp(ue ihbackground with (tune) setup proposed our and (default) setup default for 401,273,240 20,258,500 TLB-miss-estcycles 2,239,892,893 TLB-miss-estcycles 204,286,495 16,975 129,393,663 cache-dmd-estcycles 10,401,648,850 cache-dmd-estcycles 27,595,898,998 50,756,129,846 27,774,674,616 L1-dmd-estcycles 10,392,772,344 L1-dmd-estcycles 16,635,589,544 50,748,847,860 17,086,372,004 L1d-dmd-hit-ratio 0.95 L1d-dmd-hit-ratio 0.50 0.99 0.51 L1d-dmd-load 903,651,867 L1d-dmd-load 4,009,025,384 5,277,866,334 4,030,113,364 862,232,090 2,007,000,126

21 L1d-dmd-load-hit L1d-dmd-load-hit 5,233,062,803 2,037,884,165 L1i-hit 873,728,906 L1i-hit 144,897,134 2,221,086,359 195,824,671 L1i-hit-ratio 1.00 L1i-hit-ratio 0.99 1 1.00 L1i-miss 578,121 L1i-miss 1,163,387 682,572 623,943 L2-dmd-estcycles 3,842,880 L2-dmd-estcycles 978,345,384 4,860,996 1,040,060,112 L2-hit-ratio 0.26 L2-hit-ratio 0.23 0.29 0.25 L2-pending-cycles 66,517,961 L2-pending-cycles 14,461,216,745 57,749,845 11,728,151,901 L2d-dmd-access 426,590 L2d-dmd-access 345,695,728 348,694 342,624,015 L2d-dmd-hit 50,924 L2d-dmd-hit 81,103,977 10,904 86,449,824 L2i-hit 269,316 L2i-hit 424,805 394,179 221,852 L2i-miss 541,851 L2i-miss 1,243,444 670,815 635,185 LLC-dmd-access 969,513 LLC-dmd-access 486,663,894 1,051,229 474,170,021 LLC-dmd-estcycles 5,033,626 LLC-dmd-estcycles 9,981,964,070 tune/syscalls_stress_vm.perf.all default/syscalls_stress_vm.perf.all

2,420,990 9,648,242,500 tune/stream_stress_vm.perf.all default/stream_stress_vm.perf.all LLC-dmd-hit 193,601 LLC-dmd-hit 383,921,695 93,115 371,086,250 LLC-dmd-hit-ratio 0.20 LLC-dmd-hit-ratio 0.79 0.09 0.78 LLC-dmd-miss 775,912 LLC-dmd-miss 102,742,199 958,114 103,083,771 LLC-dmd-pending-cycles 2,289,418 LLC-dmd-pending-cycles 5,032,987,798 790,800 3,983,037,298 dram-dmd-estcycles 155,182,400 dram-dmd-estcycles 20,548,439,800 191,622,800 20,616,754,200 dram-dmd-pending-cycles 64,228,542 dram-dmd-pending-cycles 9,428,228,946 56,959,044 7,745,114,602 RAT-stalls 575,155,083 RAT-stalls 44,017,799 862,860,562 31,506,880 VirtAddr-estcycles 5,171,010,255 VirtAddr-estcycles 1,054,229,638 152,115,537 722,636,946 clears 494,719 clears 1,578,824 20,117,960 1,232,148 decode-switch-cycles 20,247,219 decode-switch-cycles 4,726,328 167,604,829 5,070,730