Energy Efficiency Aspects of the AMD Zen 2 Architecture

Energy Efficiency Aspects of the AMD Zen 2 Architecture Robert Schöne1 Thomas Ilsche2 Mario Bielert2 Markus Velten2 Markus Schmidl3 Daniel Hackenberg2 Center for Information Services and High Performance Computing (ZIH), TU Dresden, Dresden, 01062, Germany 1 [email protected] 2 fi[email protected] 3 [email protected] Abstract—In High Performance Computing, systems are eval- only defined by power efficiency during peak performance. uated based on their computational throughput. However, per- Many mechanisms, such as P-states or C-states, concern formance in contemporary server processors is primarily limited operation during scaled-down performance or idle phases. by power and thermal constraints. Ensuring operation within a given power envelope requires a wide range of sophisticated Other mechanisms, such as Turbo frequencies and power control mechanisms. While some of these are handled transpar- capping, aim at maximizing performance under thermal and ently by hardware control loops, others are controlled by the power constraints. They are supported by internal energy operating system. A lack of publicly disclosed implementation measurements, which can also be used for energy-efficiency details further complicates this topic. However, understanding optimizations. This paper analyzes these dynamic and highly these mechanisms is a prerequisite for any effort to exploit the full computing capability and to minimize the energy consumption configurable mechanisms rather than the application-specific of today’s server systems. This paper highlights the various performance per watt. The resulting insight is the foundation energy efficiency aspects of the AMD Zen 2 microarchitecture to to improve the complex interactions between applications, facilitate system understanding and optimization. Key findings operating systems (OSs), and independent hardware control include qualitative and quantitative descriptions regarding core for performance and energy efficiency. frequency transition delays, workload-based frequency limitations, effects of I/O die P-states on memory performance as The paper is structured as follows: Section II and Section III well as discussion on the built-in power monitoring capabilities discuss existing work on the evaluation of energy efficiency and its limitations. Moreover, we present specifics and caveats mechanisms and the Rome architecture, respectively. Sec- of idle states, wakeup times as well as the impact of idling and tion IV introduces our test system and power measurement inactive hardware threads and cores on the performance of active infrastructure. The next three sections highlight particular resources such as other cores. aspects, each including methodologies, test results, and a Keywords-AMD; Zen 2; Epyc Rome; power saving; energy discussion: In Section V, we unveil details on processor efficiency; DVFS; C-State; performance; RAPL frequencies. Section VI covers characterizations of processor I. INTRODUCTION idle states. In Section VII, we validate the accuracy of the internal power monitoring mechanism. We conclude the paper With the Epyc Rome processor generation, AMD processors with a summary and outlook in Section VIII. gained a noticeable share in the TOP500 list of supercomputers for the first time since Opteron Interlagos, which debuted in II. RELATED WORK ON EFFICIENCY MECHANISMS 2011. The new architecture is not only competitive in terms of A. ACPI States performance, but also power efficiency among systems using Power saving interfaces are defined in the Advanced Con- general-purpose x86 processors as Figure 1 shows. figuration and Power Interface (ACPI) [3]. Performance states The Green500 [1] list, used to create Figure 1, ranks top (P-states) provide different performance levels and can be High Performance Computing (HPC) systems by their energy arXiv:2108.00808v1 [cs.AR] 2 Aug 2021 selected during runtime [3, Section 2.6]. Usually processors efficiency under full load. In practice, energy efficiency is not implement these with Dynamic Voltage and Frequency Scaling (DVFS). However, their particular implementation is highly processor-architecture dependent. Mazouz et al. were one of AMD Zen 2 (Rome) the first to investigate this in [4]: They describe how P- Intel Cascade Lake state transition times depend on initial and target processor Intel Xeon Phi frequency. We show in [5] that waking an idling processor Intel Skylake core is also frequency-dependent, but additionally depends on Intel Broadwell the waker-wakee-relation and the applied idle state. Likewise, Intel Haswell we also show in [6] how long it takes to enter an idle state. Finally, we describe the effect of clock modulation (throttling) 0 1 2 3 4 5 6 in [7]. This paper covers P-state and C-state transitions in Power Efficiency [GFlops/W] Section V and Section VI, respectively. Software controlled Fig. 1: Efficiency of systems with x86 processors in the clock modulation, however, is not publicly documented for 2021/07 Green500 list [1] (architectures with > 5 systems). the Zen 2 architecture [2], [8]. © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. B. Processor-internal Power Measurement and Capping on processors with up to 64 cores. Based on the core count, Processor-internal power monitoring is used to select turbo two or one of the CCDs attach to the same switch within frequencies and implement power capping [9], [10]. The the I/O die network. Each of the switches on the I/O die that accuracy of these monitoring mechanisms therefore directly connects the CCDs also attaches a memory controller with influences processor performance. We describe the Intel Run- two memory channels, which can result in four non-uniform ning Average Power Limit (RAPL) for Intel Sandy Bridge and memory access (NUMA) nodes. the AMD Application Power Management (APM) for AMD Each core has a common front-end which fetches instruc- Bulldozer in [11]. We find that both are based on models tions for two independent hardware threads [17]. The fetch that use data from processor internal resource usage monitors. window is 32 B wide and fed to a 4-way decoder. The back- We also analyze RAPL for Intel Haswell processors [12] and end is split into two parts: One part comprises four Arithmetic describe it to be accurate and based on measurements. Hähnel Logical Units and three Address Generation Units (AGUs), et al. measure the update rate of RAPL in [13] as 1 ms. Lipp the other contains two 256-bit-wide Floating-point Multiply- et al. show in [14] that RAPL can provide significantly higher Add (FMA) and two 256-bit-wide floating-point add units. rates for the core power domain (pp0) for certain processors. The AGUs can be used for two loads and one store per cycle, They leverage this for a side channel attack [14]. where each of these can transfer up to 32 B of data. Each processor core holds an op cache for 4096 ops, 32 KiB C. Processor-specific Overviews L1I and L1D caches, and 512 KiB L2 cache, which is used for instructions and data. In addition, each CCX holds 16 MiB We describe the Intel Haswell server architecture in de- of L3 cache, distributed over four slices with 4 MiB each. tail [12]. In addition to the previously mentioned RAPL analysis, the authors found an asynchronous mechanism that B. Energy Efficiency Details for ACPI States sets core frequencies in an interval of 500 µs. We also describe The AMD Zen 2 architecture implements a wide range of the interaction between core and uncore frequency mecha- power saving mechanisms. According to AMD’s Processor nisms and show how concurrency and frequencies influence Programming Reference [2, Section 2.1.14.3], a maximum of memory bandwidths. Gough et al. provide a broad overview eight P-states can be defined. However, on most systems, the of the Haswell architecture as well as suggestions for tun- number of available P-states will be lower. The actual number ing server systems according to user requirements [15]. We can be retrieved by polling the P-state current limit Model describe the Intel Skylake server architecture [16], covering Specific Register (MSR). The definition of single P-states the internal hardware control for uncore frequencies, AVX- includes specifications for frequency, the “expected maximum frequency mechanisms, and the influence of data on the power current dissipation of a single core”, and a “voltage ID”. The consumption of a well-defined workload. We show that uncore latter is not publicly documented. A processor core frequency frequency changes can occur every 1:5 ms. can be higher than nominal when using Core Performance Boost. No implementation details are disclosed for server ar- III. ARCHITECTURAL DETAILS OF “ROME”PROCESSORS chitectures. For desktop processors, AMD describes Precision A. General Architecture Details Boost, where the frequency can be increased in 25 MHz steps 1 Zen 2 uses a modular design on multiple levels [2, Section as part of the SenseMI technology . This would match the 1.8.1]. The structure of the processor is depicted in Figure 2. frequency multiplier entry in the MSR, where multiples of Four cores are clustered in one Core Complex (CCX, also 25 MHz can be defined. CPU Complex). One Core Complex Die (CCD) comprises 1https://community.amd.com/t5/blogs/understanding-precision-boost-2-in- two CCXs. Up to eight CCDs are attached to one I/O die amd-sensemi-technology/ba-p/416073 Core 2 4 MiB L3 4 MiB L3 Core 3 CCD CCD CCD CCD X C DRAM UMC UMC DRAM C Core 0 4 MiB L3 4 MiB L3 Core 1 DRAM UMC UMC DRAM IF I/O, xGMI DRAM UMC UMC DRAM Core 4 4 MiB L3 4 MiB L3 Core 5 X DRAM UMC UMC DRAM C C Core 6 4 MiB L3 4 MiB L3 Core 7 CCD CCD CCD CCD (a) Core Complex Die (CCD) with Core Complexes (CCXs) (b) I/O die with memory controllers (UMC), attached memory, IF- Switches (brown), CCDs, repeaters, and I/O; xGMI attachements not depicted; based on [2] Fig.

Energy Efficiency Aspects of the AMD Zen 2 Architecture

Effective Virtual CPU Configuration with QEMU and Libvirt

H242-Z10 Edge Server - AMD up 2U 4 Nodes

AMD EPYC™ 7371 Processors Accelerating HPC Innovation

Intel/AMD CPU Release Dates, 2018

Evaluation of AMD EPYC

AMD Ryzen™ PRO & Athlon™ PRO Processors Quick Reference Guide

AMD EPYC 7002 Architecture Extends Benefits for Storage-Centric

Amd（Amd.Us）18Q1 点评 2018 年 07 月 30 日

Software-Based Undervolting Faults in AMD Zen Processors Fehler in AMD Zen Prozessoren Durch Software-Basierte Unterspannung

High Performance Linpack Benchmark on AMD EPYC™ Processors

AMD's Early Processor Lines, up to the Hammer Family (Families K8

AMD Raven Ridge