Modeling Non-Uniform Memory Access on Large Compute Nodes

Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model Nicolas Denoyelle, Brice Goglin, Aleksandar Ilic, Emmanuel Jeannot, Leonel Sousa To cite this version: Nicolas Denoyelle, Brice Goglin, Aleksandar Ilic, Emmanuel Jeannot, Leonel Sousa. Modeling Non- Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model. IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers, 2019, 30 (6), pp.1374–1389. 10.1109/TPDS.2018.2883056. hal-01924951 HAL Id: hal-01924951 https://hal.inria.fr/hal-01924951 Submitted on 16 Nov 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model Nicolas Denoyelle∗z, Brice Goglin∗, Aleksandar Ilicy, Emmanuel Jeannot∗x, Leonel Sousay ∗ Inria – Bordeaux - Sud-Ouest, Univ. Bordeaux, France fnicolas.denoyelle, brice.goglin, [email protected] y INESC-ID, Instituto Superior Tecnico,´ Universidade de Lisboa, Lisbon, Portugal faleksandar.ilic, [email protected] z Atos Bull Technologies – 38130, Echirolles, France x LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest Abstract—NUMA platforms, emerging memory architectures with on-package high bandwidth memories bring new opportunities and challenges to bridge the gap between computing power and memory performance. Heterogeneous memory machines feature several performance trade-offs, depending on the kind of memory used, when writing or reading it. Finding memory performance upper-bounds subject to such trade-offs aligns with the numerous interests of measuring computing system performance. In particular, representing applications performance with respect to the platform performance bounds has been addressed in the state-of-the-art Cache-Aware Roofline Model (CARM) to troubleshoot performance issues. In this paper, we present a Locality-Aware extension (LARM) of the CARM to model NUMA platforms bottlenecks, such as contention and remote access. On top of this, the new contribution of this paper is the design and validation of a novel hybrid memory bandwidth model. This new hybrid model quantifies the achievable bandwidth upper-bound under above-described trade-offs with less than 3% error. Hence, when comparing applications performance with the maximum attainable performance, software designers can now rely on more accurate information. Index Terms—Roofline Model, Cache-Aware Roofline Model, Heterogeneous Memory, Knights Landing, Skylake, Platform Modeling, Benchmarking, NUMA, Multi-core/single-chip multiprocessors, Shared Memory, Memory Bandwidth F 1 INTRODUCTION HE increasing demands of current applications, both in in different ways and motivate the joint study of both T terms of computation and amount of data to be manip- systems. ulated, and the modest improvements in the performance of Additionally, each cluster of the KNL may feature tradi- processing cores have led to the development of large multi- tional DDR memory as well as 3D-stacked high-bandwidth core and many-core systems [1]. These platforms embed memory named MCDRAM, which can be used as hardware- complex memory hierarchies, spanning from registers to managed cache or additional software-managed memory. private and shared caches, local main memory, and mem- Managing heterogeneous memories in runtime systems ory accessed remotely through interconnection networks. In brings another level of complexity and makes performance these systems, memory throughput is not uniform, since analysis harder and even more necessary. Hence, being they embed several kinds of memories and the distance able to understand the impact of the memory hierarchy between processor and memories varies. On such Non- and core layout on application performance, as well as on Uniform Memory Access (NUMA) architectures, the way attainable performance upper-bounds, is of most impor- data is allocated and accessed has a significant impact on tance and interest. This is especially true when modeling performance [2]. the architecture and tuning applications to take advantage Recently, the latest Intel Xeon Phi processor, code-name of the architecture characteristics in order to improve per- Knights Landing (KNL) [3], traces the NUMA roadmap with formance. As a reference point, on-chip bandwidth varies a processor organized into 4 Sub-NUMA Clusters (SNC- from several orders of magnitude (' 20 times) between 4 mode). Usually, NUMA platforms include several sock- the caches closest to the cores and the local main memory ets interconnected with processor-specific links (e.g. Quick (cf. Section 4). Moreover, purely remote memory access Path Interconnect [4]) or by custom switches, such as SGI across a tested QPI link delivers half of the local memory NUMAlink or Bull Coherent Switch [5]. However, the KNL throughput, thus pushing further the data transfer time. interconnects NUMA clusters at the chip scale (through a 2D Additionally, increasing the number of concurrent requests mesh of up to 36 dual-core tiles). Though the software may on a single memory also increases the contention and can see both types of systems as similar homogeneous NUMA slow down the perceived local memory bandwidth down to trees, the strong architectural differences between NUMA '46% of its maximum value on a tested Broadwell system sockets and KNL chips, can impact application performance (cf. Section 4). On the KNL system with 64 cores, the same 2 memory access pattern decreases the bandwidth down to ' 25% of its maximum value (cf. Section 5). Hence, it fpeak is clear that the memory access pattern has a significant Application with good locality impact on the delivered throughput. Finally, unlike cross Same application with worse locality QPI bandwidth, our experiments in Section 5 show a nearly uniform bandwidth on the interconnection network of the KNL chip. Therefore, it is obvious that the chip layout has a Cache BandwidthMemory * AI BandwidthRemote Bandwidth * AI * AI significant impact on the achievable performance. [GFlops/s] Performance Tuning application performance and inferring their abil- Arithmetic Intensity [Flops/Byte] ity to fully exploit the capabilities of those complex systems require to model and acquire knowledge about their re- Fig. 1: CARM chart of a hypothetical compute node com- alistically achievable performance upper-bounds and their posed of one cache level. individual components (including the different levels of memory hierarchy and the interconnection network). The Cache-Aware Roofline Model [6] (CARM) has been recently tion 6, discusses the main limitations of the model, instanti- proposed –by some of the authors of this paper- as an ates and validates a heterogeneous bandwidth model, able insightful model and an associated methodology aimed to overcome such limitations. Section 7 gives an overview at visually aiding performance characterization and opti- of the state-of-the-art related works, while Section 8 draws mization of applications running on systems with a cache the final conclusion of this research work. memory subsystem. CARM has been integrated by Intel into their proprietary tools, and it is described as “an incredibly useful diagnosis tool (that can guide the developers in the ap- 2 LOCALITY AWARE ROOFLINE MODELING plication optimization process), ensuring that they can squeeze the maximum performance out of their code with minimal time In general, the Roofline modeling [8] is an insightful ap- and effort.”1 However, the CARM refers to systems based on proach to represent the performance upper-bounds of a pro- a single-socket computational node with uniform memory cessor micro-architecture. Since computations and memory access, which does not exhibit the NUMA effects that can transfers are simultaneously performed, this model is based also significantly impact performance. on the assumption that the overall execution time can be To address these issues, we have proposed, firstly in limited either by the time to perform computations or by our previous contribution [7] extended in this paper, a the time to transfer data. Hence, from the micro-architecture new methodology to enhance the CARM unsightliness and perspective, the overall performance can be limited either provide locality hints for application optimization on con- by the peak performance of the computational units or by temporary large shared memory systems, such as multi- the capabilities of the memory system (i.e. bandwidth). socket NUMA systems and Many Integrated Core proces- To model the performance limits of contemporary multi- sors with heterogeneous memory technologies and multiple core systems, the Cache-Aware Roofline Model (CARM) [6] hardware configurations [7]. However, we noticed that some explicitly considers both the throughput of computational highly optimized synthetic benchmarks are not correctly engine and the realistically achievable bandwidth of each characterized yet by our model. Though they are designed memory

Modeling Non-Uniform Memory Access on Large Compute Nodes

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism

Trafficdb: HERE's High Performance Shared-Memory Data Store

Cache-Aware Roofline Model: Upgrading the Loft

CS 575: the Roofline Model

Scheduling of Shared Memory with Multi - Core Performance in Dynamic Allocator Using Arm Processor

Beyond the Roofline: Cache-Aware Power and Energy-Efficiency

The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

Lecture 17: Multiprocessors

Use of Shared Memory in the Context of Embedded Multi-Core Processors: Exploration of the Technology and Its Limits

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use