Energy Efficiency Features of the Intel Skylake-SP Processor and Their

Energy Efficiency Features of the Intel Skylake-SP Processor and Their Impact on Performance Robert Schöne, Thomas Ilsche, Mario Bielert, Andreas Gocht, Daniel Hackenberg Center for Information Services and High Performance Computing (ZIH) Technische Universität Dresden – 01062 Dresden, Germany Email: {robert.schoene | thomas.ilsche | mario.bielert | andreas.gocht | daniel.hackenberg}@tu-dresden.de Abstract—The overwhelming majority of High Performance Therefore, they are using their systems as is, hoping that Computing (HPC) systems and server infrastructure uses Intel the applied settings will fit their purpose. High Performance x86 processors. This makes an architectural analysis of these Computing (HPC), however, has specific requirements because processors relevant for a wide audience of administrators and performance engineers. In this paper, we describe the effects a single wrongly-configured core in a highly parallel program of hardware controlled energy efficiency features for the Intel can lead to millions of cores waiting. Therefore, a detailed Skylake-SP processor. Due to the prolonged micro-architecture analysis and understanding of the possible impacts of power cycles, which extend the previous Tick-Tock scheme by Intel, our saving mechanisms is key to every performance and energy findings will also be relevant for succeeding architectures. The efficiency evaluation and optimization. findings of this paper include the following: C-state latencies increased significantly over the Haswell-EP processor generation. In this paper, we describe new features of Intel Skylake-SP The mechanism that controls the uncore frequency has a latency processors, which are likely also part of future processors. In of approximately 10 ms and it is not possible to truly fix the previous years, Intel followed a Tick-Tock scheme, where a uncore frequency to a specific level. The out-of-order throttling new architecture and a new process are introduced alternately. for workloads using 512 bit wide vectors also occurs at low Recently, this cycle has been extended with a third step where processor frequencies. Data has a significant impact on processor power consumption which causes a large error in energy models the architecture is slightly improved. This is a chance for relying only on instructions. software and performance engineers, because an adaption of Index Terms—Microprocessors, Performance analysis, Systems software for a specific hardware has now more time to be modeling, Dynamic voltage scaling worthwhile. This paper is structured as follows: Section II describes I. INTRODUCTION AND BACKGROUND relevant micro-architectural changes and energy efficiency In recent years, the number of energy efficiency features in features of the Intel Skylake-SP processor compared to the Intel processors increased significantly. Contemporary proces- predecessor generations. In Section III, we give an overview sors support Per-Core P-States (PCPs) [1], Uncore Frequency about the test system that we use for our experiments. A Scaling (UFS) [1], turbo frequencies [2, Section 14.3.3], core short analysis of standardized mechanisms described in the and package C-states [3], T-states [4], power capping [5], Advanced Configuration and Power Interface (ACPI) is given Energy-Performance-Bias (EPB) [2, Section 14.3.4], and in Section IV. Sections V-VIII present an analysis of hardware other mechanisms. While these features improved the energy- controlled energy efficiency mechanisms and their effect on proportionality significantly, they also have a major influence runtime and power consumption. We conclude our paper with on the performance of the processor. This can be seen when a summary and an outlook in Section IX. comparing the given system configuration for results submitted for the SPECpower_ssj [6] with those submitted to performance II. SKYLAKE-SP ARCHITECTURE AND ENERGY related benchmarks like SPEC CPU2017 [7]. While the former EFFICIENCY FEATURES results got more energy proportional within the last decade, The Skylake server processors introduce several new micro- the latter often disable power saving mechanisms to increase architectural features, which increase performance, scalability, performance and provide repeatable results. Furthermore, new and efficiency. Furthermore, they uphold energy efficiency architectural features, which boost performance, also influence features from older architectures and introduce new ones. the power consumption. One example is the introduction of AVX2 and FMA in Intel Haswell processors. Some workloads A. Microarchitecture using these features, would increase the power consumption On the core level, which is summarized in Table I, AVX- over the thermal design power (TDP) at nominal frequency. 512 (or AVX512F to be precise) is the most anticipated All of these power-related features span a huge design space feature for HPC. With AVX-512, an entire cache line (64 B, with conflicting optimization goals of performance, power one ZMM register) can now be used as input for vector capping, and energy efficiency. Users are often oblivious to the instructions. Furthermore, with the new EVEX prefix, the implications of these mechanisms which are typically controlled number of addressable SIMD registers doubled to 32. Therefore, by hardware, firmware and sometimes the operating system. 2 kiB of data can now be held in vector registers, compared © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works Microarchitecture Haswell-EP Skylake-SP 6100/8100 facilitates higher turbo frequencies on some cores and lower References [8], [9], [10], [11] [8], [11], [12], [13] on others, which in turn influences the performance of a Allocation queue 56/core 64/thread single thread depending on the hardware core that executes Execute 8 micro-ops/cycle 8 micro-ops/cycle Scheduler entries 60 97 it. However, currently none of the currently available Xeon ROB entries 192 224 Scalable Processors supports this feature. An anticipated INT/FP register file 168/168 180/168 feature of Skylake processors is Hardware Duty Cycling SIMD ISA AVX2 AVX-512 FPU width 2×256 Bit FMA 2×512 Bit FMA (HDC) [2, Section 14.5], also known as SoC Duty Cycling. FLOPS/cycle (double) 16 32 HDC implements a more coarse grained duty cycling in Load/store buffers 72/42 72/56 comparison to T-states. In contrast to clock gating, HDC uses L1D accesses 2×32 B load + 2×64 B load + per cycle 1×32 B store 1×64 B store C-states with power gating. However, this feature is targeted L2 B/cycle 64 64 at mobile and desktop processors and is not described in more Supported memory 4×DDR4-2133 6×DDR4-2666 detail in this document. DRAM bandwidth up to 68.2 GB/s up to 128 GB/s Table I: Comparison of Haswell-EP and Skylake-SP processors C. Hardware-Controlled P-States (HWP) to 512 B in Haswell and Broadwell architectures. The L1D Introduced with the Broadwell generation of Intel processors, cache bandwidth has been doubled while L2 cache bandwidth Hardware-Controlled P-States (alias Hardware Power Manage- is still limited to 64 B/cycle. The L2 cache size increased from ment (HWPM) or SpeedShift), move the decision of frequency 256 kiB to 1 MiB, whereas LLC slice sizes decreased from and C-state usage from the operating system to the processor [2, 2.5 MiB to 1.375 MiB. The increased number of store buffers Section 14.4]. This removes the perturbation of an OS control can lead to a faster streamed store throughput. In addition, all loop, which interrupts the workload regularly. Furthermore, it codes benefit from extended out-of-order features. increases the responsiveness because the hardware control loop The uncore also faced a major re-design. In previous can be executed more frequently without perturbation. While architectures (starting with Nehalem-EX), cores have been the Broadwell processors hardware acts mostly autonomously, connected by a ring network. The increased number of cores Skylake-SP processors provide interfaces for a collaboration in Haswell and Broadwell architectures led to the introduction with the OS through interrupts [2, Section 14.4.6]. With the of a second ring with connections between both. Skylake now HWP interface, the OS can define a performance and power introduces a mesh architecture, where each core, including its profile, and set a minimal, efficient, and maximal frequency. LLC slice, connects to a 2D mesh, as depicted in Figure 1. Under Linux, the tool x86_energy_policy can be used The external communication (PCIe, UPI) is placed on one side to interact directly with the hardware interface. HWP is of the mesh. Furthermore, two cores in the remaining mesh accompanied by an incremental counter register MSR_PPERF, are replaced by integrated memory controllers (iMCs), each which holds the Productive Performance Count (PCNT) that of which can host up to three DRAM channels. There are should increase only if a cycle has been used efficiently. multiple flavors of Skylake-SP based processors with varying external connections and numbers of cores. The cluster-on-die (CoD) feature is now called sub-NUMA clustering and enables dividing the cores into two NUMA PCIe UPI PCIe PCIe UPI PCIe domains. Cache coherence is implemented by MESIF and a DMI,CBDMA directory-based home snoop protocol. B. Energy Efficiency Mechanisms Core L3 Core L3 Core L3 Core L3 Core L3 Core L3 Like its predecessors, Skylake-SP processors support Per- DDR4 A DDR4 D DDR4 B DRAM Core L3 Core L3 Core L3 Core L3 DRAM DDR4 E Core P-States (PCPs) and Uncore Frequency Scaling (UFS). DDR4 C DDR4 F This enables fine-grained control over performance and energy efficiency decisions. The Energy Performance Bias (EPB) Core L3 Core L3 Core L3 Core L3 Core L3 Core L3 indicates, whether to balance the profile for runtime or power consumption or something in between.

Energy Efficiency Features of the Intel Skylake-SP Processor and Their

Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance

Dual-Core Intel® Xeon® Processor 3100 Series Specification Update

Microcode Revision Guidance August 31, 2019 MCU Recommendations

Single Board Computer

IBM Posts SPEC CPU2006 Scores for Quad-Core X3200 M2 X3200 M2 Achieves Leadership Specint2006 Score for a Single-Socket Server Using Intel Xeon X3370 Processor

The New Intel® Xeon® Processor Scalable Family

Instruction Rate with Ivy Bridge Vs Haswell for Some Common Jobs

Intel's Core Microarchitecture Sets New Records in Performance and Energy Efficiency 23 May 2006

Product Brief: Intel® Xeon® Scalable Platform

Download Datasheet

The Intel X86 Microarchitectures Map Version 2.0

Ultimate Workstation Performance