Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientiﬁc Kernels

Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels Ang Li Weifeng Liu1,2 Mads R.B. Kristensen Pacific Northwest National Lab, USA 1. University of Copenhagen, University of Copenhagen, Denmark [email protected] Denmark [email protected] 2. Norwegian University of Science and Technology, Norway [email protected] Brian Vinter Hao Wang Kaixi Hou University of Copenhagen, Denmark Virginia Tech, USA Virginia Tech, USA [email protected] [email protected] [email protected] Andres Marquez Shuaiwen Leon Song1,2 Pacific Northwest National Lab, USA 1.Pacific Northwest National Lab [email protected] 2. College of William and Mary, USA [email protected] ABSTRACT DOI: 10.1145/3126908.3126931 High-bandwidth On-Package Memory (OPM) innovates the conventional memory hierarchy by augmenting a new on-package layer between classic on-chip cache and off-chip DRAM. Due to its rela- 1 INTRODUCTION tive location and capacity, OPM is often used as a new type of LLC. In the past four decades, perhaps no other topics in the HPC domain Despite the adaptation in modern processors, the performance and have drawn more attention than reducing the overheads of data move- power impact of OPM on HPC applications, especially scientific ment between compute units and memory hierarchy. Right from kernels, is still unknown. In this paper, we fill this gap by conducting the very first discussion about exploiting “small buffer memories” a comprehensive evaluation for a wide spectrum of scientific kernels for data locality in the 1970’s [40], caches have evolved to become with a large amount of representative inputs, including dense, sparse the key factor in determining the execution performance and power and medium, on two Intel OPMs: eDRAM on multicore Broadwell efficiency of an application on modern processors. and MCDRAM on manycore Knights Landing. Guided by our gen- Although today’s memory hierarchy design is already deep and eral optimization models, we demonstrate OPM’s effectiveness for complex, the bandwidth mismatch between on-chip and off-chip easing programmers’ tuning efforts to reach ideal throughput for memory, known as the off-chip “memory-wall”, is still a big obstacle both compute-bound and memory-bound applications. for high performance delivery on modern systems. This is especially the case when GPU modules or other manycore accelerators are CCS CONCEPTS integrated on chip. As a result, high-bandwidth memory (HBM) •Hardware → Memory and dense storage; Power estimation and becomes increasingly popular. Recently, as a good representative of optimization; •Computer systems organization → Multicore ar- HBM, on-package memory (OPM) has been adopted and promoted chitectures; •Computing methodologies → Model development by major HPC vendors in many commercial CPU products, such and analysis; as IBM Power-7 and Power-8 [14], Intel Haswell [16], Broadwell, Skylake [28] and Knights Landing (KNL) [42]. For instance, five ACM Reference format: , of the top ten supercomputers in the newest Top 500 list [1]have Ang Li, Weifeng Liu1 2, Mads R.B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song1,2. 2017. Exploring equipped OPM (primarily on Intel Knights Landing). and Analyzing the Real Impact of Modern On-Package Memory on HPC On-package Memory, as its name has implied, is a memory stor- Scientific Kernels . In Proceedings of SC17, Denver, CO, USA, November age manufactured in a separate chip but encapsulated onto the same 12–17, 2017, 14 pages. package where the main processor locates. OPM has several unique characteristics: (i) they are located between on-chip cache and off- ACM acknowledges that this contribution was authored or co-authored by an employee, or contractor of the national government. As such, the Government retains a nonexclu- package DRAM, (ii) their capacity is larger than on-chip scratchpad sive, royalty-free right to publish or reproduce this article, or to allow others to do so, memory or cache but often smaller than off-package DRAM; (iii) for Government purposes only. Permission to make digital or hard copies for personal their bandwidth is smaller than on-chip storage but significantly or classroom use is granted. Copies must bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must larger than off-package DRAM; (iv) their access latency is smaller be honored. To copy otherwise, distribute, republish, or post, requires prior specific or similar to off-package DRAM. Although these properties portray permission and/or a fee. Request permissions from [email protected]. OPM similar to last-level cache (LLC), their actual impact on per- SC17, Denver, CO, USA © 2017 ACM. 978-1-4503-5114-0/17/11. $15.00 formance and power for real HPC applications, especially critical DOI: 10.1145/3126908.3126931 scientific kernels, remains unknown. SC17, November 12–17, 2017, Denver, CO, USA A. Li et al. • We demonstrate the observations and quantitatively analyze each kernel’s performance insights on these two OPM. We also derive an intuitive visual analytical model to better explain complex scenarios and provide architectural design suggestions. • We provide a general optimization guideline on how to tune applications and architectures for superior performance/energy efficiency on platforms equipped with these two types of OPM. Figure 1: Probability density for achievable GEMM performance (GFlop/s) using 1024 2 MODERN ON-PACKAGE MEMORY samples with different tiling size and problem size. With eDRAM, the function curve as a whole shifts more to the upper-right, implying that more samples can reach near-peak 2.1 eDRAM (e.g., 90%) performance. In other words, having eDRAM increases the chance for users’ less-optimized applications to reach the “vendor-claimed” performance. However, the eDRAM is a capacitor-based DRAM that can be integrated on the right boundary only moves a bit, indicating that eDRAM cannot significantly improve same die as the main processors. It is evolved from DRAM, but the raw peak performance. often viewed as a promising solution to supplement, or even re- place SRAM as the last-level-cache (LLC). eDRAM can be either Exploring and analyzing such impact of modern OPM on HPC DRAM-cell based or CMOS-compatible gain-cell based; both rely scientific kernels is essential for both application developers and on capacitors to store data bits [38]. In comparison, DRAM-cell HPC architects. Figure 1 shows an example. For a HPC system, based eDRAM requires additional process to fabricate the capaci- there is often a huge implicit gap between “claimed” attainable per- tor, in turn exhibiting higher storage density. Both IBM and Intel formance and real “deliverable” performance, even with the same adopt DRAM-cell based eDRAM design in their recent processors. algorithm. This is because the specific implementation used for However, unlike IBM that digs trench capacitors into the silicon benchmarking or ranking is usually carefully hand-tuned for param- substrate of the main processor chip, Intel fills the metal-insulator- eterization. Such an expert version is significantly different from metal (MIM) trench capacitors above the transistors, and fabricates the one commonly used by an application developer or a HPC user. an isolated chip (Figure 2). Such fabrication simplifies the manu- With regard to OPM, although conventional wisdom may question facturing process and reduces cost. Nevertheless, being on-package its usefulness for computation-bound applications such as GEMM, is less efficient than being on-die, although Intel proposed a new Figure 1 uses eDRAM as an example to show that it can greatly high-speed On-Package I/O (OPIO) to connect the eDRAM and the improve the chance for a less-optimized code to harvest near-peak CPU chip. Up to date, IBM has integrated eDRAM in Power-7 and performance, dramatically mitigating the aforementioned perfor- Power-8 while Intel has included it in its Haswell, Broadwell and mance gap. In addition, recent released OPM such as MCDRAM Skylake architectures. eDRAM acts as the LLC in their memory on Intel KNL has many architectural tuning options available so it hierarchies. is important to understand how these features affect applications’ Compared with SRAM (typical building blocks for today’s on- performance and energy profile. With these in mind, we conduct chip caches), eDRAM has two major advantages: (I) Higher density. a thorough evaluation to quantitatively assess the effectiveness of The Intel eDRAM cell size is reported as 0.029μm2, 3.1x smaller two types of widely-adopted OPM. To the best of our knowledge, than their densest SRAM cell. Other existing works report similar this is the first work on quantitatively evaluating modern OPM on density advancement [35, 43, 8, 15]. The density increase has real hardware for essential HPC application kernels while develop- two positive impacts on performance when utilized as a cache: (a) ing evaluation methodology to assist analysis on complex results to with the same area, the 3x cache capacity fits larger applications’ provide useful insights. footprint and significantly reduces capacity-related cache misses (i.e., Such evaluation is not only necessary but also fundamental for approximately 1.7x miss rate reduction for a 3x capacity increase enabling reliable technical advancement, e.g., motivating software- √ according to the 2 rule [20]); (b) on the other hand, with the architecture co-design exploration and assisting validation of mod- same capacity, area reduction implies wire-length reduction and eling/simulation. Particularly, this study benefits three types of shorter latency for retrieving data [15]. (II) Lower leakage power. audience: (A) procurement specialists considering purchasing OPM- Compared with SRAM, eDRAM can reduce leakage power by a equipped processors for the applications of interest; (B) application factor of 8 [25]. For large on-chip or on-package LLCs, this is very developers who intend to estimate the potential gain through OPM important as leakage power is becoming one of the major power and customize optimizations; and (C) HPC researchers and archi- contributors on modern processors.

Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientiﬁc Kernels

Data Storage the CPU-Memory

Managing the Memory Hierarchy

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

インテル® Parallel Studio XE 2020 リリースノート

Embedded DRAM

Introduction to High Performance Computing

Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

The Memory Hierarchy

Lecture 10: Memory Hierarchy -- Memory Technology and Principal of Locality

Introduction to Intel Xeon Phi (“Knights Landing”) on Cori

Chapter 2: Memory Hierarchy Design