Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications

Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications Brice Goglin Inria Bordeaux - Sud-Ouest University of Bordeaux, France [email protected] ABSTRACT performance computing runtimes to match the application High-performance computing requires a deep knowledge of requirements with the hardware topology. Memory hierar- the hardware platform to fully exploit its computing power. chy is a key component in this topology awareness. Precise The performance of data transfer between cores and memory knowledge of the locality of NUMA nodes and caches with is becoming critical. Therefore locality is a major area of respect to CPU cores is required for proper placement of optimization on the road to exascale. Indeed, tasks and task and data buffers. data have to be carefully distributed on the computing and Operating systems such as Linux already expose some lo- memory resources. cality information to user-space applications [16], and HPC We discuss the current way to expose processor and mem- runtimes in particular. Modeling the platform as a hierar- ory locality information in the Linux kernel and in user-space chical tree is a convenient way to implement locality-aware libraries such as the hwloc software project. The current de task and data placement. Even if a tree does not perfectly facto standard structural modeling of the platform as the match the actual hardware topology, it is a good compromise tree is not perfect, but it offers a good compromise between between precision and performance; algorithms such as re- precision and convenience for HPC runtimes. cursive top-down partitioning can be easily implemented for We present an in-depth study of the software view of the distributing tasks on the platform according to the hierar- upcoming Intel Knights Landing processor. Its memory lo- chical model of the hardware. Basic queries such as finding cality cannot be properly exposed to user-space applications which cores and NUMA nodes are close to each other are without a significant rework of the current software stack. also straightforward in such a structural model. This is as We propose an extension of the current hierarchical platform easy as looking up specific resources in parent or children model in hwloc. It correctly exposes new heterogeneous ar- nodes in the tree. chitectures with high-bandwidth or non-volatile memories to However, new memory architecture trends are going to applications, while still being convenient for affinity-aware deeply modify the actual organization of the memory hierar- HPC runtimes. chy. Indeed, high bandwidth and/or non-volatile memories as well as memory-side caches are expected to significantly change the traditional platform model. As a case study, we Keywords explain how the Linux kernel and user-space libraries expose Heterogeneous memory; locality; affinity; structural model- the topology of the upcoming Intel Knights Landing archi- ing; user-space runtimes; high-performance computing; Linux tecture. We reveal several flaws in the current HPC software stack, which lacks expressiveness and a generic memory model. We then propose a new structural model of the paral- 1. INTRODUCTION lel computing platforms. This model comes as an extension to the hwloc library, the de facto standard tool for expos- Parallel platforms are increasingly complex. Processors ing hardware topology to HPC applications. Our model now have many cores, multiple levels of caches as well as still satisfies the needs of many existing locality-aware HPC a NUMA interconnect. Exploiting the computing power of runtimes while supporting new heterogeneous memory hier- these machines requires deep knowledge of the actual orga- archies. nization of the hardware resources. Indeed tasks and data buffers have to be carefully distributed on computing and memory resources so as to avoid contention, remote NUMA 2. EXPOSING MEMORY LOCALITY AS A access, etc. Making the most of the platform requires high HIERARCHICAL TREE OF RESOURCES We explain in this section why memory locality is critical Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed to high-performance computing and we discuss performance for profit or commercial advantage and that copies bear this notice and the full cita- and structural platform modeling. We then explain why tion on the first page. Copyrights for components of this work owned by others than modeling as a hierarchical tree of hardware resources is a ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- good trade-off between precision and convenience. publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MEMSYS ’16, October 03 - 06, 2016, Alexandria, VA, USA 2.1 Memory Locality matters to HPC Appli- c 2016 ACM. ISBN 978-1-4503-4305-3/16/10. $15.00 cations DOI: http://dx.doi.org/10.1145/2989081.2989115 The importance of NUMA awareness has been known in high-performance computing for decades [4]. Threads communication- or synchronization- intensive applications should run as close as possible to the data they use, and may prefer compact placement below a shared cache. data should be allocated on NUMA nodes close to these These ideas apply to shared-memory programming be- threads. Many research works focused on improving appli- cause of shared buffers between threads, but also to pro- cation performance by placing threads and data using in- cesses as soon as they communicate with each other within formation about the application behavior and about the ar- the machine. Indeed MPI communication performance varies chitecture [22, 32]. Such locality issues may be dealt with with the physical distance. Intra-socket communication is by looking at hardware performance counters to auto-detect usually faster than inter-socket causing performance to in- non-local NUMA memory accesses [29], or by having the ap- crease when processes are placed close to their favorite peers [2]. plication provide the runtime with hints about its usage of Placing tasks and data according to the application affini- memory buffers [6]. ties may be performed dynamically by monitoring perfor- The democratization of multicore processes in the last mance counters and detecting cache contention or remote twelve years added cache sharing to the reasons why local- NUMA accesses [30]. Another approach consists in hav- ity matters to performance. Indeed modern processors con- ing the application help the runtime system by providing tain multiple levels of caches, some being private to a single hints about its affinities [6]. MPI process launchers may core, others being shared by all or some cores, as depicted also use the application communication pattern to identify on Figure 1. This causes synchronization and data sharing which processes to co-locate [15]. These ideas rely on infor- performance to vary with task placement even more since mation about the application software. They also requires data may or may not be available in a local cache thanks to deep knowledge about the hardware so that the application another core using it. Information about the affinities be- needs may be mapped on to the machine resources. tween threads and data buffers had therefore been used for better placement based on the impact of caches on perfor- 2.2 Performance or Structural Modeling mance [25]. The complexity of modern computing platforms makes them increasingly harder to use, causing the gap between Machine (32GB total) Package P#0 peak performance and application performance to widen. Understanding the platform behavior under different kinds NUMANode P#0 (16GB) of load is critical to performance optimization. Performance L3 (6144KB) counters are a convenient way to retrieve information about L2 (2048KB) L2 (2048KB) L2 (2048KB) L2 (2048KB) bottlenecks for instance in the memory hierarchy [31] and L1i (64KB) L1i (64KB) L1i (64KB) L1i (64KB) apply feedback to better schedule the next runs [25]. How- ever these strategies remain difficult given the number of L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) parameters that are involved (memory/cache replacement Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 policy, prefetching, bandwidth at each hierarchy level, etc.), PU P#0 PU P#4 PU P#8 PU P#12 PU P#16 PU P#20 PU P#24 PU P#28 many of them being poorly documented. The platform may also be modeled by measuring the per- NUMANode P#1 (16GB) formance of data transfers between pairs of resources, either L3 (6144KB) within a single host or between hosts. Placement algorithms L2 (2048KB) L2 (2048KB) L2 (2048KB) L2 (2048KB) may then use this knowledge to apply weights to all pairs of cores when scheduling tasks [23]. This approach may even L1i (64KB) L1i (64KB) L1i (64KB) L1i (64KB) lead to experimentally rebuilding the entire platform topol- L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) L1d (16KB) ogy for better task placement [12]. Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 These ideas however lack a precise description of the struc- PU P#32 PU P#36 PU P#40 PU P#44 PU P#48 PU P#52 PU P#56 PU P#60 tural model of the machine. Experimental measurement cannot ensure the reliable detection of the hierarchy of computing and memory resources such as packages, cores, shared Figure 1: AMD platform containing Opteron 6272 caches and NUMA nodes. For instance, it may be difficult processors, simplified to a single processor, and re- to distinguish cores that are physically close and cores that ported by hwloc's lstopo tool.

Load more