Core Specialization for AVX-512 Using Fault-And-Migrate

Core Specialization for AVX-512 Using Fault-and-Migrate Masterarbeit von cand. inform. Peter Brantsch an der Fakultät für Informatik Erstgutachter: Prof. Dr. Frank Bellosa Zweitgutachter: Prof. Dr. Wolfgang Karl Betreuender Mitarbeiter: Mathias Gottschlag, M.Sc. Bearbeitungszeit: 07. Januar 2019 – 08. Juli 2019 KIT – Die Forschungsuniversität in der Helmholtz-Gemeinscha www.kit.edu Ich versichere wahrheitsgemäß, die Arbeit selbstständig verfasst, alle benutzten Hilfsmittel vollständig und genau angegeben und alles kenntlich gemacht zu haben, was aus Arbeiten anderer unverändert oder mit Abänderungen entnommen wurde sowie die Satzung des KIT zur Sicherung guter wissenschaftlicher Praxis in der jeweils gültigen Fassung beachtet zu haben. Karlsruhe, den 08. Juli 2019 Abstract The Advanced Vector Extensions 512 (AVX-512) are modern Single Instruction Multiple Data (SIMD) extensions to the x86 instruction set using 512-bit wide registers, enabling substantial acceleration of numeric workloads, for example processing eight sets of 64- bit operands in parallel. Because of the high power consumption of the corresponding functional units, a CPU core executing AVX-512 instructions has to temporarily reduce its clock frequency to maintain thermal and electrical limits. This clock frequency reduction can slow down the scalar part of mixed workloads because it persists substantially beyond the last AVX-512 instruction. To mitigate this performance impediment, core specialization can be used, which is the preferred use of certain cores for specific kinds of computation. By running AVX-512 and scalar code on disjoint sets of CPU cores, throttling of cores executing scalar code can be avoided. The Operating Systems Group at the Karlsruhe Institute of Technology has already demonstrated that core specialization can be effectively employed against the aforemen- tioned performance reduction by implementing it in Linux. A new system call is introduced to mark the beginning and end of AVX-512 phases of a task, such that the scheduler can migrate it to a specialized core. However, the existing implementation is neither transparent nor automatic, but instead requires the application to be modified. This thesis presents an extension of the existing core specialization implementation, making it transparent and automatic by efficiently virtualizing AVX-512 to intercept the instructions and subsequently trigger migration. Our extension determines the necessary number of AVX-512 cores at runtime based on CPU time consumed. Because there is no trivial way of detecting the end of an AVX-512 phase, we compare different heuristics for re-migration. We evaluate our prototype in a web server scenario with nginx and OpenSSL using ChaCha20-Poly1305 encryption and brotli compression, using AVX-512 to accelerate the combination of cipher and message authentication code. The benchmarks show that the performance degradation caused by AVX-512-induced frequency reductions can be almost completely mitigated, without having to modify the application. v Contents Abstractv Contents1 1 Introduction3 2 Background and Related Work5 2.1 AVX-512 . .5 2.2 Processor Clock Frequency Behavior . .6 2.3 Research in Mixed AVX-512 Workloads . .7 2.4 Core Specialization . .8 2.5 Scheduling for Heterogeneous Systems . .9 2.6 Staged Computation and Cohort Scheduling . 10 2.7 Reconfigurable Systems . 11 3 Analysis 13 3.1 Core Specialization . 13 3.2 Fault and Migrate . 14 4 Design and Implementation 15 4.1 Fault-and-Migrate . 15 4.1.1 Making AVX-512 Trap . 15 4.1.2 Triggering Migration . 17 4.1.3 Handling AVX-512 in The Kernel . 17 4.2 Core Specialization . 17 4.3 Determining the Number of AVX-512 Cores . 18 4.4 Re-Migration Heuristics . 19 4.5 Orthogonality of Approaches . 20 4.6 Debugging and Configuration . 20 5 Evaluation 23 5.1 Setup and Methods . 24 5.1.1 Patching CPU Feature Detection . 25 5.1.2 Repeatability . 25 5.1.3 Performance Counters . 25 5.1.4 The “Perf” Tools . 26 5.2 Mitigation of Frequency Reduction Effects . 26 5.2.1 Re-Migration Heuristics . 27 1 Contents 5.2.2 Selection of AVX-512 Cores . 28 5.2.3 Influence of the α Parameter . 31 5.2.4 Isolation of Throttling . 31 5.3 Theoretical Limit of Speedup . 32 5.4 Remaining Overhead . 34 5.4.1 Trap Handling . 34 5.4.2 Fault-and-Migrate . 34 5.5 Latency . 38 5.6 The Multi-Queue Skiplist Scheduler . 39 5.7 Discussion . 40 6 Conclusion 41 6.1 Future Work . 42 Bibliography 45 Glossary 49 2 1 Introduction Until around 2007, semiconductor scaling worked approximately like Dennard et al. [7] described it in 1974, with each new technology generation bringing smaller, faster tran- sistors at a constant power density. Then, Dennard scaling ended [2], as leakage currents through the ever thinner gate oxide increased, becoming a noticeable part of chip power consumption. The power density could no longer be kept constant and clock frequencies stopped increasing, though transistor miniaturization continued, and the number of tran- sistors per chip increased. To still increase the computing power of chips, even though single-thread performance scaling is hampered by stagnating clock frequencies, contem- porary designs employ parallelism of both code1 and data2. Because of the high power density and parallelism of processors, an increasing share of die area is under-utilized, either because running all of the chip at its maximum clock rate would exceed thermal or electrical limits, or because programs do not efficiently exploit the processor’s parallelism [8]. This share is called dark silicon. A recent example are the Advanced Vector Extensions 512 (AVX-512) [21, 22], because their functional units can raise the power consumption of a CPU core up to the point that it has to reduce its clock frequency, depending on the type3 and rate of Advanced Vec- tor Extensions 2 (AVX 2) and AVX-512 instructions, and the number of similarly loaded cores. However, these SIMD additions to the x86 instruction set operating on 512-bit wide registers can significantly accelerate computation such as cryptography (see the mi- crobenchmark in Section 5.3), offsetting the otherwise detrimental effect of the frequency reduction on performance. If this frequency reduction only affected the SIMD instructions, it would not pose a problem, but before the core can return to higher clock frequencies, a timer of approximately 2ms first has to expire without any events again necessitating frequency changes [18, section 17.26]. Mixed workloads, in which only small parts of computation are accelerated, can therefore suffer an overall slowdown, because the reduced clock frequency persists beyond the last AVX-512 instruction, affecting the following scalar instructions. This slowdown was demonstrated by Krasnov [28] in a web server using AVX-512-accelerated cryptography functions. Chapter 2 will describe the frequency behavior of Intel CPUs in regard to AVX-512 in more detail. To avoid any possible frequency reduction because of wide SIMD instructions, vectorization could simply be disabled, though only at the cost of all its advantages. No general decision whether to enable or disable vectorization for all mixed applications can be made, though, because the net benefit for even a single application depends on the scenario. 1Multiple Instruction streams, Multiple Data streams (MIMD), i.e. multi-core systems 2SIMD 3floating point or integer, multiplication or simpler operation 3 1 Introduction For the same application, vectorization can be advantageous in some circumstances, and detrimental in others. Instead, a method of maintaining predictable system performance and still reaping the benefits of vectorization is needed. Ideally, to save developers from laboriously instrumenting their applications, and users from unwelcome surprises in the form of library updates bringing performance regressions, such a method would be transparent and automatic. Gottschlag and Bellosa [13, 14] have shown that core specialization can successfully be used to mitigate the performance impediments faced by mixed workloads by restricting the execution of AVX-512 code to certain cores, such that only these cores are affected by frequency reductions. This thesis builds upon their previous work, and makes the following contributions: Eicient Virtualization of AVX-512 The CPU is configured to fault on AVX-512 instructions, hence the name “fault-and-migrate”, such that we can intercept these instructions and accordingly migrate the task. This is equivalent to efficient virtualization as per the criteria of Popek and Goldberg [36]. Transparency The existing implementation introduces a new system call to inform the scheduler about the beginning and end of AVX-512 use, requiring modification of the application code. Using the efficient virtualization described above, the scheduler can detect the beginning of AVX-512 use transparently. Automation To require the least amount of tuning by the user and easily adapt to a variety of mixed workloads, the system determines the number of AVX-512 cores automatically. Evaluation using a web server scenario derived from that of Krasnov [28] shows that core specialization using fault-and-migrate can mitigate the performance reduction of mixed workloads caused by AVX-512-induced frequency reductions. Our mechanism for automatically determining the number of AVX-512 cores is not yet optimal, tendentially choosing too many cores and causing greatly variable performance. However, using a static set of cores for AVX-512, automatic migration and re-migration are working completely satisfac- torily. The web server scenario, originally suffering from an 11.6% performance reduction with AVX-512, reaches 99% of its non-AVX-512 throughput using our prototype. The remainder of this thesis is structured as follows: Chapter 2 gives detailed background information and an overview of related work, Chapter 3 analyses the problem and prerequisites and discusses possible approaches, followed by Chapter 4, which presents a solution and explains the design and implementation decisions. A comprehensive evaluation is performed in Chapter 5.

Load more