Beyond the Roofline: Cache-Aware Power and Energy-Efficiency
Total Page:16
File Type:pdf, Size:1020Kb
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TC.2016.2582151 TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MAY 2016 1 Beyond the Roofline: Cache-aware Power and Energy-Efficiency Modeling for Multi-cores Aleksandar Ilic, Member, IEEE, Frederico Pratas, Member, IEEE, and Leonel Sousa, Senior Member, IEEE Abstract—To foster the energy-efficiency in current and future multi-core processors, the benefits and trade-offs of a large set of optimization solutions must be evaluated. For this purpose, it is often crucial to consider how key micro-architecture aspects, such as accessing different memory levels and functional units, affect the attainable power and energy consumption. To ease this process, we propose a set of insightful cache-aware models to characterize the upper-bounds for power, energy and energy-efficiency of modern multi-cores in three different domains of the processor chip: cores, uncore and package. The practical importance of the proposed models is illustrated when optimizing matrix multiplication and deriving a set of power envelopes and energy-efficiency ranges of the micro-architecture for different operating frequencies. The proposed models are experimentally validated on a computing platform with a quad-core Intel 3770K processor by using hardware counters, on-chip power monitoring facilities and assembly micro-benchmarks. Index Terms—Multicore architectures, Modeling and simulation, Performance evaluation and measurement. F 1 INTRODUCTION N the road to pursuing highly energy-efficient execu- The key contribution of this paper is a set of insight- O tion, current and future trends in computer architec- ful cache-aware models to characterize the upper-bounds ture move towards complex and heterogeneous designs by for power, energy and energy-efficiency of modern multi- incorporating specialized functional units and cores [1]. In core architectures. The proposed Power CARMs explicitly this process, it is often crucial to consider the characteris- consider how the accesses to different memory levels and tics/demands of different applications and the capability utilization of specific functional units impact the power con- of relevant micro-architecture components to satisfy these sumption in different domains of the processor chip – cores, demands. In such a broad design space, evaluating benefits off-core components (uncore) and the overall chip. This and trade-offs for a set of different optimization goals and paper also proposes the Total Power CARM that defines the approaches is a hard and time consuming task. complete power envelope of a multi-core processor at a sin- Although cycle-accurate simulators can precisely model gle frequency level. By coupling two fundamental CARMs the functionality of architectures and applications, these (i.e., performance [3] and the proposed Power CARMs) dif- environments are rather complex, hard to use and difficult ferent models can be derived to express the architecture to develop [2]. In contrast, to perform a rapid analysis of efficiency limits. Due to their practical importance, we fo- different design solutions, insightful modeling is of great cus herein on two pivotal models, namely the Energy and practical importance for computer architects and application Energy-efficiency CARMs, using which the maximum energy- designers, especially in early design stages and for fast efficiency of the micro-architecture is formalized. prototyping. This type of modeling is usually focused on The proposed models are experimentally validated on a describing the vital functional aspects of the architecture computing platform with a quad-core Intel 3770K processor via intuitive graphical representation, such as our recently by using hardware counters, on-chip power monitoring proposed Cache-Aware Roofline Model (CARM) [3]. facilities and assembly micro-benchmarks, while their use- CARM is a novel insightful Roofline Modeling concept fulness is illustrated when optimizing matrix multiplication. that allows a description of the performance upper bounds The proposed models are also used to analyze the Dy- for multi-core architectures with complex memory hierar- namic Voltage and Frequency Scaling (DVFS) benefits from chy [3]. Later works, such as [4]–[7], also corroborate the the application and micro-architecture perspectives, where advantages of incorporating multiple memory levels into a different power envelopes and energy-efficiency ranges are single performance model. However, to this date, there are derived for a range of supported operational frequencies. no insightful power/energy consumption and/or energy- efficiency models based on the CARM principles. The work 2 BACKGROUND:ROOFLINE MODELING proposed in this paper aims at closing this gap. The Roofline Modeling represents an insightful approach for describing the attainable performance upper bounds of • Aleksandar Ilic and Leonel Sousa are with INESC-ID, IST, Universidade the micro-architecture (e.g., F in flops=s). It is based on de Lisboa, Lisbon, Portugal a E-mail: fAleksandar.Ilic,[email protected] the observation that the execution time T can be limited • Frederico Pratas is with Imagination Technologies Limited. either by the time to perform computations (Tφ) or mem- E-mail: [email protected] ory operations (Tβ), i.e., it assumes a perfect overlap of Manuscript received August 11, 2015; revised March 20, 2016; May 20, 2016. these operations in time (T = max T ;T ). Hence, in the f φ βg Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TC.2016.2582151 TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MAY 2016 2 Original Roofline Model (ORM) Cache-aware Roofline Model (CARM) EXPERIMENTAL AVX Double Precision (DP) PLATFORM Intel 3770K Ivy Bridge (4 Cores @ 3.5GHz) Floating Point (FP) Arithmetic INTENSITY (x-axis) Operational (e.g., flops/DRAMbytes) Arithmetic (flops/bytes) MODEL TYPE Multi-plot (one per memory level) Single-plot (one for all memory levels) Peak Memory Bandwidth (GB/s): AVX Load+Store Peak AVX Performance Fp 1 Core 4 Cores Between two memory levels Complete memory hierarchy Private: 1 Core Shared: 4 Cores MEMORY SCOPE CARM 28 Gflops/s 112 Gflops/s (1 AVX MUL+ADD/clk) (4 cores x 28) Memory bound region Compute bound region L1 L2 L3 DRAM ORM 32kB, bus: 384b 256kB,bus:256b 8MB, bus: 256b 2ch, 2x933MHz MODEL ORM Data-sheets Data-sheets / Hardware Counters RAPL:=PP0=and=PP1 BL1 C BL2 C BL3 C BD C CONSTRUCTION CARM Micro-benchmarking Micro-benchmarking 168→ ≈ 40.9→ ≈ 121.8→ ≈ 13.6→ TIME CPU_CLK_UNHALTED.CORE /3.5 CARM SIMD_FP_256.PACKED_DOUBLE x4 Memory bound region Compute bound region memory bandwidth to the core [7] FLOPS MEM_UOPS_RETIRED.ALL_LOADS EXPERIMENTAL ORM X Possible BL1 C BL2 L1 BL3 L2 BD L3 x32 → → → → bytes MEM_UOPS_RETIRED.ALL_STORES VALIDATION CARM Possible (achievable for all models) 168 112 448 29.9 CARM ORM ORM L1 ORM L2 ORM L3 ORM DRAM OFFCORE_RESPONSE.DATA_IN_ x64❖ Performance Power, Energy and Energy-Efficiency peak bandwidth between two memory levels ORM SOCKET.LLC_MISS.LOCAL_DRAM_0 bytes ❖ EXISTING ORM Williams, et.al. [8] Choi, et.al. [10] cache-line size DISABLED Enhanced Intel SpeedStep Technology (TurboBoost) / Hyperthreading / Turbo Mode / MODELS CARM Ilic, et.al. [3] contribution of this paper BIOS SETTINGS Hardware Prefetcher / Adjacent Cache Line Prefetch / EPU Power Saving Fig. 1: CARM and ORM: Differences and modeling domains. Fig. 2: Experimental platform, setup and hardware counters. micro-architecture, the execution can be limited either by tions and time (clock cycles) from a consistent architecture the processor computation capabilities (e.g., peak Floating point of view. Although Fp can be derived from data-sheets, Point (FP) performance, Fp in flops=s) or by the memory the By values are usually experimentally determined [3]. subsystem capabilities (i.e., the memory bandwidth, B). Model interpretation and experimental validation: The To date, there are two main approaches for performance ORM and CARM are similarly interpreted. In Fig. 3-C/D, Roofline modeling, the Original Roofline Model (ORM) [8] the slanted lines mark the memory-bound regions, while the and our recently proposed CARM [3]. Although both ap- Fp limits the compute-bound region (horizontal line). The proaches model Fa, the fundamental differences lie in the ridge point (where the slanted and horizontal lines intersect) way how memory traffic is considered and how intensity is the minimum intensity to reach maximum performance. is defined, i.e., the x-axis in the plots. The ORM consid- The experimental validation of both ORM and CARM ers data traffic between two subsequent memory levels, is performed with (micro-)benchmarks capable of fully ex- thus the x-axis refers to Operational Intensity (OI), e.g., ercising the functional units and memory subsystem. The flops=DRAMbytes, where DRAMbytes refers to amount CARM is experimentally validated for a range of different of data traffic between the Last Level Cache (LLC) and Intel micro-architectures with accuracy higher than 90% for DRAM [8]. In contrast, CARM considers Arithmetic Intensity all modeled regions [3], [9]. The impossibility of validating (AI) on the x-axis, i.e., flops=bytes, where bytes reflect data ORM with real benchmarks is discussed in [4], while adopt- traffic at the memory ports of the processor pipeline (i.e., ing the micro-benchmark methodology from [3] and [10]. as seen by the cores) [3]. This fundamental difference has Application characterization and optimization: direct repercussions in how the two models are constructed, Roofline models are typically used to simplify detection of experimentally validated, and used for application charac- the main application bottlenecks.