Ref. Ares(2019)562628 - 31/01/2019

HORIZON 2020 TOPIC FETHPC-02-2017 Transition to Exascale Computing

Exascale Programming Models for Heterogeneous Systems 801039

D 3.1 Report on Current and Emerging Transport Technologies

WP3: Efficient and Simplified Usage of Diverse Memories

Date of preparation (latest version): 31/1/2019 Copyright c 2018 – 2021 The EPiGRAM-HS Consortium

The opinions of the authors expressed in this document do not necessarily reflect the official opinion of the EPiGRAM-HS partners nor of the European Commission. D 3.1: Report on Current and Emerging Transport Technologies 2 DOCUMENT INFORMATION

Deliverable Number D 3.1 Deliverable Name Report on Current and Emerging Transport Technologies for Movement Due Date 31/1/2019 (PM 5) Deliverable lead Adrian Tate Authors Tim Dykes Harvey Richardson Adrian Tate Steven W. D. Chien Responsible Author Adrian Tate (Cray) e-mail: [email protected] Keywords memory data movement, transport WP/Task WP 3/Task 3.1/3.4 Nature R Dissemination Level PU Planned Date 28/1/2019 Final Version Date 31/1/2019 Reviewed by Stefano Markidis (KTH) Martin Kuehn (FhG) MGT Board Approval YES D 3.1: Report on Current and Emerging Transport Technologies 3 DOCUMENT HISTORY

Partner Date Comment Version Cray UK 15/01/2019 version for internal review v0.1 Cray UK 24/01/2019 including comments from S. Markidis v0.2 Cray UK 28/01/2019 Final version including all feedback v1.0 D 3.1: Report on Current and Emerging Transport Technologies 4 Executive Summary

This report presents an overview of the current and upcoming landscape of heteroge- neous memory technologies for pre-Exascale and Exascale high performance computing systems, surveying both hardware and software interfaces and identifying those of great- est importance to the EPiGRAM-HS project. A series of observations are made to guide the Memory Work Package, and the project as a whole, with particular consideration given to the requirement to present, allocate, and move data and to design the memory- abstraction device to be developed as part of EPiGRAM-HS. The report concludes that while the landscape of Exascale memory systems and software is complex, a subset of the hardware and software technologies are robust enough to be relied upon to develop interesting advances in this area. D 3.1: Report on Current and Emerging Transport Technologies 5 Contents

1 Introduction 7

2 Heterogeneous Memory Systems 8 2.1 Memory Trends ...... 8 2.2 Exascale Memory Systems ...... 8 2.3 Exascale Programmes ...... 9 2.4 Review of Exascale Candidate Memories ...... 13 2.4.1 Static Random-Access Memory (SRAM) ...... 13 2.4.2 (HBM) ...... 14 2.4.3 Dynamic Random-Access Memory (DRAM) ...... 15 2.4.4 DIMM-Based Non- (NVDIMM) ...... 16 2.4.5 Solid State Drive (SSD) ...... 17 2.4.6 (HDD) ...... 18

3 Interfaces to Memory and Storage 19 3.1 Operating System Memory Management ...... 19 3.2 Low-Level Hardware Support ...... 20 3.2.1 The Peripheral Component Interface (PCI) ...... 20 3.2.2 Non-Volatile Memory Express (NVMe) ...... 21 3.2.3 Storage Device Interfaces ...... 22 3.2.4 Accelerators ...... 22 3.2.5 Bus and Interconnect Developments ...... 22 3.3 Shared Memory ...... 23 3.3.1 Basic Memory Allocation Support ...... 23 3.3.2 POSIX Shared Memory ...... 24 3.3.3 Malloc-Like Memory Allocators ...... 24 3.3.4 Memory Copy and Move APIs ...... 25 3.3.5 Persistent Memory APIs ...... 25 3.3.6 Tmpfs ...... 26 3.3.7 XPMEM ...... 26 3.4 OS Support for Heterogeneity ...... 27 3.4.1 NUMA Memory Management ...... 27 3.4.2 Heterogeneous Memory Management ...... 27 3.5 Programming Frameworks and APIs ...... 28 3.5.1 CUDA ...... 28 3.5.2 ROCm ...... 30 3.5.3 HSA ...... 31 3.5.4 OpenMP ...... 31 3.5.5 OpenCL ...... 32 3.5.6 OpenACC ...... 33 3.5.7 Memkind ...... 33 3.5.8 MPI for Memory Access ...... 34 3.5.9 Higher Level Programming Frameworks ...... 34 3.6 Off- Access to Memory ...... 36 D 3.1: Report on Current and Emerging Transport Technologies 6

3.6.1 RDMA ...... 36 3.6.2 Burst Buffers ...... 36 3.6.3 Object Store ...... 37 3.6.4 Data Brokering and Workflow Services ...... 38

4 Discussion 39 Observation 1 ...... 39 Observation 2 ...... 42 Observation 3 ...... 42 Observation 4 ...... 43 D 3.1: Report on Current and Emerging Transport Technologies 7 1 Introduction

The EPiGRAM-HS project seeks to research and apply the state-of-the-art in program- ming environments for heterogeneous Exascale platforms, with the goal of ‘enabling ex- treme scale applications on heterogeneous hardware’. Work Package 3 (WP3) of the EPiGRAM-HS project focuses on the memory technologies in such systems with the goal of providing ‘simplified efficient usage of complex memory systems’. This document surveys pre-Exascale and Exascale data movement technologies in order to define a set of relevant hardware and software technologies to be pursued in WP3. Within this document the term ‘transport’ is defined as a technology or set of technologies by which data is moved from one storage location to another (possibly re- mote) storage location. It is important to note that a canonical interpretation of the term ‘transport’ specifically concerns low-level protocols within buses and networks e.g. those used within PCIe and NVLink. While such transports are mentioned, this document is primarily concerned with the movement of data into and out of storage locations. The investigation is motivated by recent and upcoming changes in the balance of computing systems and hence in supercomputers. Today’s software system and program- ming environment were designed primarily during an earlier paradigm, when arithmetic operations were costly compared to data movement. Since the reverse is true today, the software systems are not well equipped to minimise data movement or even to present to the programmer toolsets that allow fine control of data movement. This landscape and the constraints acting within it are described in Section 2.1. Data movement is a broad term from a technical perspective and could encompass large swathes of the HPC and technical computing sectors. Furthermore there has been an explosion in the variants of storage media that are either available at the time of writ- ing1 or would be available in the next few years. The potential scope of this document is therefore extremely wide. However, many candidate technologies are not realistically relevant to mainstream HPC and/or Exascale computing. Section 2.2 defines an Exascale Candidate technology as being one which has the right combination of availability, cost, and performance attributes to be deemed relevant to the EPiGRAM-HS project. Thank- fully, the list of such technologies is manageable. As such there are many technologies not discussed in this document, e.g. which may be interesting to post-Exascale computing but which do not fit the above criteria and so are not discussed. Concerning data-movement software, the question of relevance to Exascale is more nuanced and this report is in many ways supporting information / analysis of this point. Given the variety of storage media that are expected to play a role in an Exascale system (as outlined in Section 2.3), it is important to understand which software technologies are required in order to successfully use each class of memory efficiently. ‘Usage’ of a memory will include each of: explicit programming, performance-portability features, and abstraction of the memory. The relevant technologies are reviewed in Section 3, and in Section 4 we classify each technology by its relevance to the project and present key observations and analysis as a guide for future activity in the EPiGRAM-HS project.

1January 2019 D 3.1: Report on Current and Emerging Transport Technologies 8

Figure 1: Growth in processor performance over 40 years, from [1].

2 Heterogeneous Memory Systems

2.1 Memory Trends The slowdown in CPU performance gains over recent years has been extremely well docu- mented. At the turn of the millennium the 50% annual growth rate that had characterised the previous 15 years began to slow down and in recent years the growth rate has been less than 5% as shown in Figure 1. Although the growth-rate and its decline is an ex- tremely nuanced topic, the most significant factor has been the end of Dennard scaling - the linear relationship between supply voltage and transistor size/density, allowing chip makers to effectively make smaller, faster chips - which ended in about 2002. Since this time, chip makers have relied upon increasing the amount on-chip parallelism which has stressed software implementations and made high efficiency more difficult to obtain. Although processor performance increases have slowed, memory performance has con- sistently decreased relative to processor performance since 1980. The dramatic divergence can be seen in Figure 2 which plots memory performance (measured as DRAM access latency) and processor performance on a single core, over 35 years. With 1980 as the baseline, modern memories have slowed by three orders of magnitude relative to processor performance. Even if the processor performance were to continue to plateau, it would take many decades for this deficit to to be corrected. This chart illustrates why DRAM advances are no longer satisfactory and why we are seeing enormous innovation in mem- ory systems (see 2.4.2). We have not seen a dramatic change in the software environment in order to address this three orders of magnitude shift.

2.2 Exascale Memory Systems As described in Section 1 the Exascale candidate memory technologies that this document reviews is a subset of the total number of technologies that could be considered. We define an Exascale candidate technology as a or movement technology that D 3.1: Report on Current and Emerging Transport Technologies 9

Figure 2: Relative performances of processors and memory over 35 years, from [1].

possesses the following qualities:

• The technology is available with general availability falling somewhere in the time- frame 2019-2025.

• The technology has a cost roadmap that makes its GA (general availability) prod- uct suitably placed for mainstream supercomputing. For technologies similar in performance to DRAM this requires that the cost is roughly around $10 per GB though this depends on the type, purpose and speeds of the memory. At the lower end of the memory-storage hierarchy, the costs per GB of storage are considerably lower.

• The technology possesses some performance (bandwidth or latency) advantage over existing technologies and that we can on paper classify the conditions under which applications can exploit this performance advantage.

At the time of writing2, the first Exascale architectures are becoming concretely de- fined and so a first approximation to the list of Exascale memory candidates can be obtained simply by reviewing those.

2.3 Exascale Programmes There are four active realistic Exascale programmes across three continents; these are in Europe, United States, Japan, and China. Figure 3 compares all four programs, which are reviewed briefly in this section. The US initially pursued three distinct architectural approaches to achieving Exascale using GPUs, Intel Xeon Phi, and ARM processors. In 2018, Intel discontinued the Xeon Phi processor line [3], cancelling the codenamed Knights Hill processor that would have been the basis of the first Exascale system at Argonne (A18). Instead, the so-called A21 system will be built by Intel and Cray [4] using an undisclosed non Von-Neumann

2January 2019 D 3.1: Report on Current and Emerging Transport Technologies 10

Figure 3: Comparison of the four Exascale programmes. (a) timelines and specs (b) funding levels, from [2].

processing technology called CSA [5]. While details of CSA are not public, it is well known that the processor will have a data-flow quality combined with reconfigurable processing elements. Despite its exotic processing, it is likely to be deployed with a fairly standard (possibly including specialised buffers) in combination with Intel’s DIMM-based NVRAM Optane persistent memory. The Summit and Sierra systems are pre-Exascale machines combining NVIDIA GPUs and IBM POWER9TM CPUs. The node diagram of the Summit architecture is shown in Figure 4 and consists of 2 POWER9 CPUs each with three NVIDIA Volta GPUs attached via NVIDIA’s proprietary NVLink interconnect. Each Volta has 16GB HBM. Attached to each node is 1.6 TB of NVMe flash. A third US-based Exascale approach will pursue ARM-based processors using HBM (see Section 2.4.2) but with no accelerator attached. Although a number of vendors now develop 64-bit ARM processors, the number of candidate commodity processors that provide HBM memory as well as sufficient degrees of SIMD parallelisation is small. The most likely candidate is a node consisting of 2 or 4 ARM processors develop by Mar- vell Semiconductor Inc. that would be a successor of the current ThunderX2 processor3. That future chip would likely combine 32GB HBM with DDR4/5 memory. A prototype machine Astra is being installed at Sandia national lab [6]. In October 2018 the European Commission announced that it will fund development of two or more Exascale systems under the umbrella of the ‘EuroHPC’ joint undertak- ing, which pools resources from participating EU member states and other partners (e.g. Switzerland). Of the e1.4B allocated to the project, 50% is sourced from the European Commission and the remainder from the member states themselves [7]. Fundamental to the European approach is the development of a home-grown European processor, devel- oped as part of the European Processor Initiative. As shown in Figure 5 the programme will develop a CPU processor based on an ARM instruction set, and a RISC-V accel-

3https://www.marvell.com/server-processors/thunderx2-arm-processors/ D 3.1: Report on Current and Emerging Transport Technologies 11

Figure 4: Node architecture of the Summit Supercomputer (WikiChip). erator. Generation I of both will be deployed into a pre-Exascale system in 2021 and generation II will be integrated into a full Exascale system in 2023-2024. The mem- ory systems of neither have been fully defined so assumptions must be made that each will follow a memory design reminiscent of their cousin architectures. In the case of the ARM-based CPU, it is expected that the attached memory will resemble that attached to the Marvell ARM products of the same era, which would mean a HBM third-generation product combined with DRAM (DDR5) as will be found on the Marvell ThunderX3 pack- age. The memory system of the accelerator product will certainly use HBM also since SDRAM-based GDDR products will not be competitive in this time-frame. Japan’s approach to Exascale is also largely home-grown. The so-called Post-K system will be deployed in 2021 and be built from Fujitsu’s A64FX processor that is based on the ARMv8.2 instruction set. Interestingly, the processor will exclusively use HBM as main memory. The node design of A64FX is shown in Figure 6. In addition to the HBM physical memory, a multi-tier (SRAM) system will be provided. There is no mention of planned usage of additional memories in the public materials. China will likely install the first Exaflop computer from homespun processing tech- nologies possibly in combination with commodity GPUs. Additionally, China is set to install 3 pre-Exascale systems in 2019-2020. Due to low levels of public information, it is not clear what specific memory features the Chinese programmes will develop though some combination of DRAM, SRAM and HBM is inevitable. The list of candidate Exascale memory and storage technologies is thus fairly short and is summarised in Table 1. Although this report discussed both memory and storage media, there is a growing interest in the potential to view the entire memory-storage hierarchy as a set of memory/storage devices. There are two motivating factors behind this recent D 3.1: Report on Current and Emerging Transport Technologies 12

Figure 5: European Processor Roadmap, from [8]

Figure 6: Fujitsu’s post-K processor, from [9] D 3.1: Report on Current and Emerging Transport Technologies 13

Memory Purpose Programmes SRAM cache All HBM cache/main memory/accel- All erator memory DRAM main memory Europe, US-A21, US-GPU, US-ARM NV-DIMM persistent memory US-A21 NVMe-Flash node-local storage Europe, US-GPU, US-A21 NA-Flash burst-buffer Europe, US-GPU, US-A21, US-ARM HDD parallel filesystem All

Table 1: Exascale candidate memories trend, i) a number of software technologies do allow existing memories to be accessed like storage devices and vice versa, and ii) some emerging products such as storage- class memories or DIMM-based non-volatile memories really do blur the lines between memory and storage. Although the unification of the memory hierarchy is unarguably a distant goal, this report will adopt the more general term memory to describe either a traditional memory or a storage level. Each memory appearing in a row of Table 1 will be described further in the following section and its relevance to the EPiGRAM-HS project summarised in Section 4.

2.4 Review of Exascale Candidate Memories 2.4.1 Static Random-Access Memory (SRAM) Unlike DRAM, Static Random Access Memory does not need to refresh data to preserve it, meaning the access time is much closer to the clock cycle time and leading to a much lower power requirement in general when accessed infrequently. SRAM can provide 100 times the bandwidth of DRAM, though the practical speed obtained depends on its positioning (on-chip versus off-chip) and the size of the cache. SRAM has been the choice material of caches for decades, usually presented in a small number of hierarchical memories (2-4 though 3 is very common) with higher levels provid- ing increased performance at lower capacity. Such caches are often hardware controlled by associative replacement logic which loads data (by cacheline) automatically with the intention of reusing that cacheline later. Such caches cannot be programmed explicitly. In general purpose computing, caches have been shown to be surprisingly effective [1] but for domain-specific architectures such associative replacements can be suboptimal. Hence, in recent times NVIDIA GPUs have introduced programmable ‘shared memory’ caches and several AI-specific chip architectures will provide a small amount of SRAM local to each compute core [10]. All Exascale architectures will include SRAM as part of a traditional multi-tier cache hierarchy and some systems will see SRAM in programmable caches. EPiGRAM-HS should therefore be concerned with SRAM, since cache perfor- mance remains critical in many applications. Controlling the number of cache misses in D 3.1: Report on Current and Emerging Transport Technologies 14

Figure 7: Stacking of DRAM to form HBM, from [1] associative caches through e.g. loop tiling can be extremely difficult but it is feasible that an EPiGRAM-HS application could be modified to obtain improved cache performance.

2.4.2 High Bandwidth Memory (HBM) One response to the diminishing returns of powering DDR DRAM memories (see Section 2.4.3) has been to stack multiple DRAM slices in so-called 2.5D and 3D configurations as shown in Figure 7. The family of stacked, on-package and high bandwidth memories represent a set of game-changing technologies for HPC. The main device memory of NVIDIA Pascal GPUs is formed of one such High Bandwidth Memory (HBM) material. It is worth clarifying the terminology as the high bandwidth memory term has both a generic and a specific meaning. The generic term has referred to two previous incarnations of memory developed from a material quite distinct from the HBM in Pascal GPUs. The first is GDDR5 memory, a form of synchronous Graphics RAM (SGRAM) built from DDR3 with additional specialist features. The second is MCDRAM which is formed from the Hybrid Memory Cube material4 developed jointly by Intel and Micron, and which was used in the Intel Xeon Phi (Knights Landing) processor package. Since both the Hybrid Memory Cube and the Knights Landing chips are discontinued, such memories do not need to be covered by EPiGRAM-HS except where they can simulate other types of HBM. The specific term HBM refers to a technical specification defined by JEDEC in doc- ument JESD235. In the HBM design memory dies and a logic die are packaged together with the CPU or GPU on a single substrate. HBM chips borrow heavily from DDR4 (see Section 2.4.3) design. Each stack of HBM DRAM provides 8 independent channels. Each of the channels runs at somewhat low speeds by DDR standards (1-to-2 GT/s). However, the larger width of each channel (128 bits) combined with the high channel count means that each stack is able to provide 128-to-256 GB/s of bandwidth. The wide-and-slow interface coupled with the smaller capacitive loading of the short channel to the CPU also makes HBM much more power- efficient than other solutions. Projections indicate that HBM will deliver full bandwidth at less than half the I/O power used by DDR4. With a data rate of 2 Gbps per signal (Gen-2) bandwidth is 32 GB/s per channel. With

4http://hybridmemorycube.org/ D 3.1: Report on Current and Emerging Transport Technologies 15

8 channels the bandwidth is 256 GB/s per stack. Hence with 4 stacks of HBM operating at 2 Gbps, a node can provide an aggregate of 1 TB/s of on-package memory bandwidth. For comparison, previous high-end NVIDIA GPUs had a 384 bit wide GDDR5 interface (12 x32 devices operating at 7 Gbps) for an aggregate of 336 GB/s. Overall system power is 6-7 pJ/bit versus 18-22 pJ/bit for GDDR5. Early implementations of HBM such as those in NVIDIA Pascal are using 2.5D so- lutions but later SoCs especially those with lower thermal design power (TDP) could take advantage of the cost savings of direct 3D stacking. JEDEC took the approach with HBM of standardising the interface to the DRAM chips directly, which differs from the one taken by Hybrid Memory Cube (HMC). Pascal GPUs shipped with HBM-2 in 2016. HBM-3 will provide double per-channel bandwidths up to 512 GB/s per stack while keeping the power consumption constant. HBM prices remain high today, but are expected to move towards those of DDR as the device and packaging technology matures. Whether HBM devices ever reach the same price per GB as DRAM is unlikely, and even if the costs (per GB) converge the pricing is likely to reflect the higher bandwidth. In the timeframe of pre-exascale systems of interest to EPiGRAM-HS, HBM will offer substantially higher bandwidth than DDR5, but the capacity available for constant system cost will be lower. Efficient usage of HBM is likely one of the most critical aspects of heterogeneous memory research in the project.

2.4.3 Dynamic Random-Access Memory (DRAM) DRAM remains the ubiquitous material for main memory in commodity CPUs. However, as described in the preceding section, HBM is used as the main device memory in NVIDIA GPUs, and will be used exclusively as main memory in the Fujitsu A64FX processor (see Section 2.2). While the topic of DRAM and DDR trends is diverse and detailed, the questions that are of interest to the EPiGRAM-HS project are simply the performance trends and costs of DDR versus alternatives within the project period of interest (2018-2025). As such we first review recent and upcoming DDR specifications. The initial DDR4 specification was published by JEDEC in November, 2013 and has since been updated for clarity and additional details, including higher data rate support. DDR4 is a standard 1 transistor, 1 capacitor (1T1C) based DRAM. Initial offerings support 2133 and 2400 data rates with a goal to reach 3200 during the lifetime of DDR4. While x4, x8, and x16 devices are available, HPC applications primarily use the x4 and/or x8 devices for higher capacity and RAS purposes. The core frequency of the DRAM array has not increased significantly across gen- erations, but the IO data rates have increased by approximately 2X each generation. Significant resilience features have been added to successive generations of DDR, but with some power cost. Figure 8 summarises the specifications of DDR1-4. In 2019, the first DDR products based on the DDR5 specification are available. The general goals are to improve bandwidth and capacity while maintaining a low power envelope and low latency. From JEDEC5:

5https://www.jedec.org/category/technology-focus-area/main-memory-ddr3-ddr4-sdram D 3.1: Report on Current and Emerging Transport Technologies 16

Figure 8: Comparison of DDR1-4 specifications. DDR5 which is available at the time of writing promises the double the data-rate and increased density but practical performance data are still lacking

The JEDEC DDR5 standard is currently in development in JEDEC’s JC- 42 Committee for Solid State Memories. JEDEC DDR5 will offer improved performance with greater power efficiency as compared to previous generation DRAM technologies. As planned, DDR5 will provide double the bandwidth and density over DDR4, along with delivering improved channel efficiency. These enhancements, combined with a more user-friendly interface for server and client platforms, will enable high performance and improved power man- agement in a wide variety of applications.

However, the number of DDR5 memory products on the market remains low, and the technical specifications of those in production are not well documented. Technical, physi- cal and power considerations mean that there is some scepticism that DDR5 performance can meet expectations [11]. DRAM prices have recently risen above the $10 per GB norm. Higher speed and higher capacity DIMMs are significantly more expensive. It is therefore expected that DDR5 DIMMs will be unfavourably costly in the early part of the 2019-2025 timeframe and that DDR4 DRAM will continue to dominate the market. As such, it is expected that interest in DRAM alternatives such as HBM, NVDIMM-P and DIMM-based NVRAM will be accelerated.

2.4.4 DIMM-Based Non-Volatile Memory (NVDIMM) One way to package flash memory is in a standard Dual In-line Memory Module (DIMM) format called an NVDIMM. Typically the non-volatile memory used is NAND flash — a type of flash memory named after its usage of NAND logic gates. There are multiple types of DIMM-based non-volatile memory, as differentiated by the JEDEC taxonomy [12]: NVDIMM-F, NVDIMM-N, and NVDIMM-P; each will be briefly described. NVDIMM-F consists entirely of DIMM-based NAND flash memory with an integrated ASIC to introduce DRAM-like behaviour. This type of NV-DIMM is equivalent to an SSD attached to the system bus and supports large capacities (i.e. equivalent to SSDs, terabyte orders) which are addressable. NVDIMM-F was first popularised by Di- ablo Technologies in 2013 as Memory Channel Storage (MCS) technology [13], and was D 3.1: Report on Current and Emerging Transport Technologies 17 employed in IBM’s (now withdrawn) eXFlash DIMM product line, however in recent years interest has died down. NVDIMM-N is a combination of regular DRAM and on-board NAND flash memory, the DRAM is used as system memory whilst the NAND serves as a non-volatile storage to provide persistency. In case of power failure, an on-board battery/capacitor keeps the DIMM powered whilst a control unit copies DRAM contents to the flash storage. This type of NV-DIMM provides persistency in a byte, or block, addressable manner at DRAM speeds, however does not typically extend capacity beyond traditional DRAM (i.e. gigabyte orders). A variety of NVDIMM-N solutions exist from vendors such as AgigA, Micron and Netlist. Notably, Netlist’s HybriDIMM product employs a similar approach but uses a much larger non-volatile storage than DRAM and an on-board ASIC actively moves and caches data between the DRAM and NVRAM to allow a larger addressable capacity [14]. NVDIMM-P is an upcoming JEDEC specification for a new high capacity memory module interface, where main memory may be presented as RAM but consist of one of a variety of emerging non-volatile memory technologies. It is expected to include a new DDR5 interface, and modifications to DDR4. Details are not yet confirmed, as the specification has not been released, the closest existing product is provided by Intel (see following text). In terms of candidate memories for Exascale, Intel’s Optane DC DIMM-based non- volatile memory is the only hardware solution currently considered a likely candidate, due to planned inclusion in the A18 Exascale system, originally planned as part of the US Exascale programme as discussed in Section 2.3. This system was replaced with A21, which has not yet had significant architectural details announced; however, Intel Optane DC is a potential candidate memory for this system. Intel’s solution for persistent memory is based on 3D XPoint memory [15], a transistor-less stacked memory bearing similarities to both Phase Change Memory (PCM) and Resistive RAM (ReRAM). Intel’s DIMM-based DC-range (it is also packaged as a NVMe device) offer significant bandwidth improvements over Flash, comparable to those of DRAM, packaged densely in DIMMs of 64-128GB today, with the expectation that 1TB DIMMs are available in coming years. Prices track DRAM, so it is conceivable that a node can see many TB of memory capacity. Similarly, vendors will pursue disaggregated nodes [16] of dense NVDIMM solutions to provide network attached storage or backup.

2.4.5 Solid State Drive (SSD) The class of memories that we call Solid State Drives can be constructed with a variety of storage media and presented to the system in a variety of means. We will again restrict the review to those technologies that are immediately under consideration for Exascale systems, which means (i) NAND flash SSDs local to the node and (ii) NAND flash SSDs accessed over the network. The third potential class of SSDs using non-volatile material such as Intel’s 3DXpoint (see 2.4.4) is not deemed relevant to Exascale programmes since the cost of these devices versus flash make them unlikely to be considered. The higher bandwidth of 3DXpoint is more interesting in the DIMM-based product. In HPC, SSD/flash based storage is a natural successor to traditional hard disk stor- age systems, providing significantly better bandwidth and latency with lower power re- D 3.1: Report on Current and Emerging Transport Technologies 18

Figure 9: Bandwidth and capacity disparity of hybrid SSD/HDD tiered storage, from [19]. quirements. Furthermore, traditional Lustre HDD-based parallel filesystems are designed for large streaming read/writes as required in many varieties of simulation checkpointing; modern high performance workloads are increasingly requiring mixed I/O (e.g. small, un- aligned, or random read/writes) [17] that is problematic for current parallel filesystems but manageable through the high IOPS (input/output operations per second) capability of flash based storage. However, primarily due to cost and capacity limitations of flash, HDDs currently form the basis of most large storage solutions for HPC systems. Hybrid flash/HDD storage is currently a popular means of exploiting the benefits of flash storage without the cost of an all-flash based storage solution. Tiered storage in the form of burst buffers (discussed further in Section 3.6.2), and hybrid storage such as the Cray ClusterStor L300N combine HDDs for large capacity and large streaming block I/O with SSDs for high bandwidth and when I/O patterns are not suited to HDD-based Lustre storage (i.e. for high IOPS). Tiered storage suffers from typical drawbacks of tiered memory systems, in that data must be staged in and out of each tier, accounting for the disparate bandwidth and capacities as illustrated in Figure 9. It is expected that, as the price of flash memory decreases (a clear trend as observed in [18]), HPC storage solutions will eventually become entirely flash-based. Whilst there are already all-flash based Lustre storage solutions on the market, such as the Cray Clusterstor L300F, it is expected that in the 2023-2025 timeframe flash-based storage solutions will rival HDDs in cost for large HPC centre requirements (e.g. 50-250 PB) [19].

2.4.6 Hard Disk Drive (HDD) Storage solutions for current leadership-class high performance computing systems are still predominantly HDD-based, with some usage of hybrid flash storage (e.g. burst buffers). As discussed in Section 2.4.5, all-flash storage is expected to be the norm within the next decade, however certainly pre-Exascale, and potentially Exascale, systems will inevitably exploit some form of HDD-based Lustre parallel filesystem. It is expected that a user’s interaction with spinning disks will remain via the main parallel filesystem or through commercially popular filesytems such as XFS. Although I/O and filesystem tuning will remain a critical component of application optimisation, EPiGRAM-HS does not directly deal with this topic in detail and the relevance of HDD to the project is quite low. D 3.1: Report on Current and Emerging Transport Technologies 19 3 Interfaces to Memory and Storage

Memory and storage devices are accessible via a range of interfaces ranging from low-level and device-specific to high-level APIs or frameworks targeted at application developers. First, in Section 3.1, the memory management mechanisms in the Linux kernel are in- troduced, then the relevant interfaces starting from the hardware level (Section 3.2) and moving on to higher-level memory movement APIs and operating system support for heterogeneity (Sections 3.3 and 3.4 respectively). Programming models and APIs are considered in Section 3.5, and finally off-node memory access in Section 3.6.

3.1 Operating System Memory Management Many of the memory interfaces that are described in the following subsections have some level of interaction with the Linux memory subsystem. As such, this section first introduces the memory management system in the Linux kernel, which is based on the concept of where processes are each given a virtual address space that abstracts the physical memory of the system. This abstraction allows multiple processes to use address ranges larger than physical memory, originally to support systems with a small amount of physical memory augmented by a swap device on disk. It further provides the operating system with a means to transparently allocate, re-map, move, and reclaim physical memory from processes without affecting their virtual memory layout, and to abstract complex physical memory hierarchies. Physical memory is divided into memory banks. Each memory bank is called a node, and represents a bank of memory typically with Uniform Memory Access (UMA). A node is further divided into zones, represented memory ranges with specific attributes, such as Direct Memory Access (DMA) or non-DMA writeable. Finally, each zone is divided into discrete physical memory chunks known as pages, grouped by page type (for example pages that are unallocated, mapped to processes, or reserved by the kernel), and sized as a multiple of 1024 bytes. Each physical page may be mapped to one or more virtual pages. Processes are each given a virtual address space, and interact with physical memory by initiating requests for virtual pages from the page allocator, which are mapped into a virtual memory area (VMA) in the processes address space. The mappings between virtual and physical pages are stored in a hierarchical structure known as the page table. Each memory access requested by the process is translated to a physical address using the page table, typically in hardware via a TLB structure in a CPU-integrated Memory Management Unit (MMU). The virtual address space is organised into segments (for example: data, stack, and heap), some of which may be mapped to files (executables or dynamic libraries, and normal files), and which may consist of multiple memory arenas. The size of some of these segments is defined by the format of the executable (ELF, for example). Note that the fact that virtual address space is available does not mean that a virtual page is mapped to a physical page. This happens by demand paging, when an unmapped address is referenced a page fault is generated which traps to the kernel, the kernel then maps the virtual page to a physical page in memory, possibly backed by a file in the filesystem. Figure 10 presents an illustrative example of the described memory management sys- tem, for a fictitious compute node consisting of dual socket CPUs and a single attached D 3.1: Report on Current and Emerging Transport Technologies 20

NUMA Memory Management HMM Page Key

Free Physical Memory CPU 0 CPU 1 GPU 1

Allocated Numa Nodes Node 0 Node 1

In transit Zones DMA DMA32 Normal Normal Device

Not mappable by CPU Grouped Physical Pages

move_pages()

Virtual to Physical Page Mapping Page mapping

Logical Pages

Memory Arenas ... Text Data BSS Heap Mapped Memory Stack ...

brk mmap stack offset offset offset

Process Virtual 0x00..00 0xFF..FF Address. Space

Figure 10: Representation of Linux memory management: physical and virtual memory layout for a compute node with two CPU sockets and one attached GPU (for illustra- tive purposes). The heterogeneous memory management (HMM) system shown for the attached GPU, along with page movement mechanisms, are described in Section 3.4.

GPU. Each of the nodes, zones, and pages for the physical address space are represented, along with a process virtual address space split into memory arenas and further into logical pages, with a page mapping mechanism to convert between virtual and physical addresses. Further discussion on heterogeneous memory management (labelled as HMM in Figure 10), as well as page movement mechanisms, is presented in Section 3.4. The kernel provides a small number of system calls for memory management, which are typically abstracted by programming languages and higher level APIs, as outlined in later sections. Full detail on the fundamentals of Linux memory management can be found in [20], or a more up to date overview in the current kernel documentation [21].

3.2 Low-Level Hardware Support In this subsection we consider the standard low-level interfaces that are relevant when memory devices are directly attached to a compute node or when devices that expose memory are connected to a compute node.

3.2.1 The Peripheral Component Interface (PCI) The PCI interface is used to connect peripherals both internal and external to a compute node [22, 23, 24]. PCI was originally developed by Intel in the early 1990s. At the time of writing, nodes on HPC systems support PCIe gen3 which has a signalling rate of 8GT/s and in future gen4 will deliver 16GT/s [25]. As with other interconnects PCI has moved from a parallel bus architecture to one of differential signalling over ‘lanes’ in PCI Express (PCIe), this delivers a high-performance, bidirectional serial interconnect. D 3.1: Report on Current and Emerging Transport Technologies 21

Support for PCIe lanes is implemented directly in modern processors (for example 32 or more lanes) and in addition processors can connect to PCI devices using support in compatible chipsets.

DDR4 SATA USB

SATA M.2 slot Processor (DMI) Chipset PCIe Network

PCIe PCIe gen3 expansion

Figure 11: Illustrative Node Architecture

In Figure 11 we show an illustrative node architecture where the processor provides direct connection to RAM and PCIe lanes (it could also provide a network connection). In this case the processor uses DMI to communicate to a chipset that provides extra I/O expansion in the form of more PCIe lanes tied to expansion sockets, SATA connections, an M.2 NVMe socket etc. The precise design of a node would vary, for example a cluster node or consumer desktop would have much more I/O expansion than a node on an HPC supercomputer. Accelerator devices would typically be mounted in PCIe expansion slots. PCI provides the ability to communicate with devices and transfer data (memory and IO) using a transactional interface with flow control.

3.2.2 Non-Volatile Memory Express (NVMe)

Figure 12: M.2 NVMe SSD (Image: Dmitry Nosachev)

NVM Express (NVMe) [26, 27] is an interface used to connect non-volatile memory to a cpu via PCI Express. The protocol supports low-latency parallel data paths to memory and is much more efficient than traditional storage APIs like SCSI, SAS, ATAPI and AHCI [28]. Further efficiencies are obtained by multiple I/O queues. In addition to D 3.1: Report on Current and Emerging Transport Technologies 22 availability in traditional 2.5” form factor and on PCI Express Expansion cards, NVMe devices may be delivered in a M.2 form factor requiring an M.2 slot that supports NVMe (with PCIe connections). The M.2 form factor specifies the features of internally mounted expansion cards, these are keyed for capability and can deliver PCI Express 3.0, SATA 3.0 or USB 3.0 lanes to the device. M.2 supports AHCI and NVMe for SSD devices. Support for NVMe devices is a combination of platform BIOS and OS driver.

3.2.3 Storage Device Interfaces Disk-based devices use a range of interfaces which most recently are: SCSI, SATA, SAS, AHCI and NVMe [29]. These interfaces have developed from parallel bus (ATA, SCSI) to the serial variants (SATA, SAS). AHCI is a more recent standard for ATA/SATA drives that improves queuing support and is appropriate for SSDs. SAS drives can be supported by chipsets on a node, SCSI/SAS support usually requires a PCI adapter. The interfaces in this section are going to be inefficient when used to access a memory device, both because the device is pretending to be a hard drive and because of the performance limitations of the physical interfaces. Note that accessing a device as a disk drive means that it will be exposed by the kernel as a block device typically with a filesystem interface on top.

3.2.4 Accelerators Today accelerators of interest to the HPC community are hosted on PCIe expansion cards within a compute node. With the exception of Power-based systems that can use native NVLink6 the PCIe interface is used for connection of devices. Vendor-specific device drivers are required which support vendor-supplied software stacks, the most mature being that from NVIDIA for their range of GPUs. The interface provided by the CUDA programming model is described elsewhere in this document.

3.2.5 Bus and Interconnect Developments In 2016 three efforts were announced to standardise the bus/interconnect to allow for tighter coupling between processors and accelerators with streamlined software stacks. These efforts are:

Cache Coherent interconnect for Accelerators (CCIX) [30]

Gen-Z [31]

Open Coherent Accelerator Processor Interface (OpenCAPI) [32]

CCIX aims to provide a processor/ISA-agnostic interface to accelerators and memory with hardware cache-coherence across the link and with a driver-less and interrupt-less data sharing framework. Gen-Z targets communication to direct attach components and even those on the fabric with unified data access provided as memory operations even supporting byte addressable load/store and messaging (put/get).

6https://www.nvidia.com/en-gb/data-center/nvlink/ D 3.1: Report on Current and Emerging Transport Technologies 23

OpenCAPI addresses a tightly coupled interface between processor, accelerators and memory and is not as ambitious as Gen-Z for example in extending to a switched network or fabric. It will require direct processor support for virtual to physical translations which will allow for attached devices to operate with virtual addresses with coherent access from accelerator to system memory. The OpenCAPI transport is IBM BlueLink. Bluelink will support NVLink 2.0 connections (NVLink was originally developed by NVIDIA to connect Pascal architecture GPUs together and to host CPUs (only IBM POWER to date).) At the moment these initiatives are in their infancy but may lead to much more efficient architectures for data access (both direct and block access) in the future from device to accelerator and accelerator device to device. A useful comparison of Gen-Z, CCIX and OpenCAPI was presented by Brad Benton at the 13th Annual OpenFabrics Workshop [33].

3.3 Shared Memory This section covers the interfaces that are provided by applications and libraries to use memory on Linux. The most fundamental operations include kernel support to provide memory to a process, and provide access to memory from multiple processes simulta- neously. Further covered are libraries that deal with memory allocation and efficient movement of data and memory usage in particular contexts like persistent memory and memory-resident filesystems. Note that there is a distinction between interfaces provided by the kernel as system calls (documented in section 2 of the Linux man pages) and interfaces layered on top of these and provided by libraries (that may be standard on Linux) and are documented in section 3 of the man pages.

3.3.1 Basic Memory Allocation Support As described in Section 3.1 a Linux process has access to a range of virtual addresses. New virtual addresses can be gained by growing existing memory segments via the brk(2) and sbrk(2) system calls or using the mmap(2) system call to provide addresses linked to anonymous memory or to a file. The system call mmap(2) associates (maps) files into the address space of the calling process at a user or kernel-supplied location. The user specifies how much of the allocation is initialised from the file before mmap returns. The munmap(2) call deletes a mapping previously created with mmap(2). One use of mmap is to map a real file, but it can also be used to create an anonymous mapping which is not backed by a file. Multiple processes can call mmap and a mapping, in this case msync(3) must be used to ensure changes by one process are written to the file. Most applications and libraries that require explicit memory allocation are typically doing so by directly or indirectly using the standard memory allocator as described in Section 3.3.3. Subsequent sections describe in more detail various APIs that are relevant to the allocation and movement of data, note that the specific APIs mentioned are not intended to be a comprehensive list but merely indicative of the core functionality that is available. D 3.1: Report on Current and Emerging Transport Technologies 24

3.3.2 POSIX Shared Memory As discussed in Section 3.3.1, it is possible to share memory between processes using the mmap(2) system call. As an alternative it is possible to use the IPC shared memory routines to create shared memory segments that can be shared between unrelated pro- cesses. Historically these routines have been available in different variants: BSD, SYSV and POSIX. The POSIX variant (documented in shm overview(7)) actually uses mmap along with other functions. The crucial API calls are

shm open ftruncate mmap/munmap shm unlink close fstat

shm open(3) is used to open or create-and-open a shared memory object with a given name and obtain a new file descriptor. The object would be removed by the corresponding shm unlink(3) call. The ftruncate(3) call is used to size the memory mapping and it can then be mapped into process address space using mmap(3). The fstat(3) call can be used to determine the size of an existing shared memory object.

3.3.3 Malloc-Like Memory Allocators The standard library call (found in glibc on Linux) to allocate memory is malloc(3). The basic API calls supported are: calloc Allocate and zero memory of requested size and return pointer malloc Allocate memory of requested size and return pointer free Free previously allocated memory associated with pointer realloc Reallocate previously allocated memory associated with pointer

The memory allocator itself uses chunks of memory acquired from the heap to satisfy allocation requests. Note that memory returned from malloc could still need to be paged in to the process requesting it. Memory allocators are complex and other implementations exist that may be optimized for particular circumstances for example jemalloc, mtmalloc and ptmalloc. Other API calls are available to control memory allocations: posix memalign Returns an allocation to aligned memory

madvise Give hint about memory usage D 3.1: Report on Current and Emerging Transport Technologies 25

We also mention that on HPC workloads it can be advantageous to use huge pages and these are not satisfactorily implemented on Linux. Transparent huge pages may be enabled statically in the system configuration and provided by default to applications. However this can lead to problems. Alternatively solutions like libhugetlbfs can be used which provide process segments based on hugepages for specific applications that require them. Memory allocation otherwise remains the same.

3.3.4 Memory Copy and Move APIs Utilities to move data efficiently between memory locations are provided by Linux libraries and libraries that support compiler library routines required by language standards (for example C). The most basic routines to move memory are memcpy(3), bcopy(3), memmove(3) and wmemcpy(3). These routines differ in whether they allow overlap or not. Similar routines (strcpy(3)) and variants provide the ability to stop copying on a null character. Note that these routines are not essential for application developers since program- ming languages have built-in support to copy data between variables/objects. However, routines like memcpy(3) are optimised for performance, particularly in the case of large buffers, for example by taking care of data alignment and using the most optimal pro- cessor instructions.

3.3.5 Persistent Memory APIs Persistent memory standardisation is being driven in part by the non-profit Storage Network Industry Association[34] which has developed a high-level NVM Programming Model (NPM) which describes recommended behaviour between system components (OS/user-space) and operational models for NVM access.

Mgmt. Block File Memory

Management UI Application Application Application mmap

Standard Standard Standard Load/Store Raw Device File API File API User Management Library Access PMDK Space

File System PM-Aware MMU Mappings NVDIMM Driver Kernel Space

NVDIMM

Figure 13: SNIA NVM Software Architecture D 3.1: Report on Current and Emerging Transport Technologies 26

Figure 13 illustrates the SNIA architecture for persistent memory access. Both access via filesystem and a direct access mechanism (DAX) (the rightmost arrow) are provided, the latter OS feature provides direct access to memory with no page cache via memory- mapped files. Using a filesystem paradigm provides a way to name persistent data and provide familiar access controls. Intel developed a set of APIs delivered in the Persistent Memory Development Kit (PMDK) [35] which uses the DAX feature. This project was started by Intel during development of what was to become the Optane range of persistent memory products. The lowest level library (libpmem) API follows the POSIX shared memory model but adds features relevant to persistent storage: the ability to flush data, test if the file is persistent and a malloc implementation (providing huge pages if possible), The library also provides its own version of memcpy. The librpmem library supports remote access to persistent memory. Higher level support can be built on top of libpmem, for example transaction support. The libpmemlog library supports an interface to a persistent memory resident log file. The libpmemobj library provides a flexible object store backed by a persistent memory file, it provides persistent memory management, transactions, lists, locking and other features. The libpmemblk library provides interfaces to create arrays of pmem resident blocks of fixed size that are atomically updated even in the case of program or power failure.

3.3.6 Tmpfs The tmpfs(7) feature of Linux provides a memory-resident filesystem of fixed maximum size once mounted. The filesystem exists in virtual memory (specifically the page cache). The typical usage of a tmpfs filesystem is for the /tmp filesystem and the /dev/shm filesystem used for POSIX shared memory support. Note that there are other mechanisms to use memory for filesystems (ramfs for exam- ple).

3.3.7 XPMEM XPMEM (Cross Partition Memory) is a technology comprised of a kernel module and library that allows a process to make segments of its address space available to be mapped by another process. It was originally developed by SGI and was further developed by Cray. It can provide the support for development of PGAS languages where direct access to the address space of other processes of a distributed application is required. There is also a public version of XPMEM7 that is used in the vader BTL transport of OpenMPI as a replacement for the sm (shared memory) transport. The API of the publicly available version provides the following routines: xpmem make Exports a region of a process’ address space

xpmem get Acquire permission to attach to a (remote) memory region

xpmem attach Attach to a (remote) memory region XPMEM is still provided by Cray and HPE (SGI) as part of their environments.

7https://code.google.com/archive/p/xpmem and https://github.com/hjelmn/xpmem D 3.1: Report on Current and Emerging Transport Technologies 27

3.4 OS Support for Heterogeneity This section discusses the current state of support for heterogeneous memory in the Linux kernel, existing research efforts, and future directions.

3.4.1 NUMA Memory Management Non-Uniform Memory Access (NUMA) architectures have been supported in the Linux kernel since the early 2000s. On boot, the memory architecture of the machine is detected using the Advanced Configuration and Power Interface (ACPI), and an appropriate set of nodes are created to represent each memory bank found. For UMA architectures this is a single node, whilst for NUMA architectures a list of nodes is created. A node stores the memory bank characteristics, such as the memory size, how much is unallocated, and a list of associated CPUs with local access to that memory bank, and NUMA distances. NUMA distance is a metric used to define the relative costs of non-local memory access, typically measured in latency, bandwidth, or hops. Command line utilities exist to query the current NUMA configuration, such as taskset, numactl, and numastat, plus an ac- companying library libnuma supplying system calls for NUMA aware applications to use. A NUMA representation of a two CPU one GPU system is illustrated in Figure 10. Memory allocation for a NUMA system is defined by memory policies, which can be set system wide, per process, or per VMA in the processes address space. The default policy is node local, meaning an allocation request from a process via e.g. mmap(2) is serviced from the node with which the hosting CPU is associated if possible. The memory policy can be set explicitly via system calls set/get mempolicy, for example allowing a process to explicitly request allocations from alternative NUMA nodes. The typical approach for advanced programmers is to rely on the default first touch memory allocation policy, where pages are allocated in the NUMA node associated with the thread that first references page memory; appropriate placement can be obtained by taking care to initially reference allocations correctly from each thread, and binding threads to CPUs. Application level memory movement across NUMA domains is supported in libnuma through explicit page migration: move pages Migrate a list of pages to a specified memory node, highlighted in Figure 10.

migrate pages Migrate all pages for a specific process belonging to the specified set of old nodes, to the specified set of new nodes.

Kernel extensions for automatic migration have been proposed using schemes such as affinity on next touch [36, 37], an approach which is supported on Solaris, however so far only explicit migration is available in the Linux kernel.

3.4.2 Heterogeneous Memory Management Heterogeneous Memory Management (HMM) is a recent addition to the Linux kernel, as of 4.14, to provide infrastructure for integrating external device memory, such as accelerators, into the kernel memory management system. This infrastructure currently provides two main features: D 3.1: Report on Current and Emerging Transport Technologies 28

Shared address space: The typical approach for accelerator devices with on-board memory is a split address space; the CPU maintains a main-memory page table, and the device maintains a device-memory page table. HMM exposes helper functions such that the CPU page table can be mirrored on the device, and both main and device memory can be presented as a shared address space. Device zones: Data movement currently requires explicit copying though device driver functions. This feature creates an additional memory zone to track device pages, marked as un-mappable by the CPU. This allows page migration between CPU and the device memory using regular paging mechanisms, and page faults can trigger a migration from the device back to main memory. Figure 10 illustrates the new HMM features incorporated into the Linux memory man- agement system. Device memory is included as a new memory zone with appropriate pages in the page tables, which are replicated both on the GPU and the host device to present an extended address space where identical virtual addresses on the device and the host always map to the same physical address. An important consideration for these features is the necessity of vendor cooperation to provide support in device drivers. Beyond the existing support, further kernel extensions have been proposed. These include: extending the NUMA abstraction to heterogeneous memories and providing OS services for asynchronous and DMA accelerated memory movement [38]; allocation and migration services that can transparently target storage class memories [39]; and expanding on the concept of NUMA distance to include hetero- geneous memory architectures [40].

3.5 Programming Frameworks and APIs This section summarises a variety of parallel programming frameworks and APIs that specifically target heterogeneous environments, with accompanying interfaces for the as- sociated heterogeneous memory hierarchies. Many of these frameworks target an assumed architecture of a multi-core CPU paired with an accelerator (GPU, FPGA, etc); code runs on the host CPU and utilises an attached accelerator device via the programming frame- work. Throughout this section the term host will refer to the host CPU, whilst device will refer to the paired accelerator, unless otherwise stated.

3.5.1 CUDA Compute Unified Device Architecture (CUDA) refers to both the hardware architecture and programming framework for NVIDIA GPUs. CUDA supports explicit memory al- location and movement; the memory model was that of a split address space up until CUDA 4.0. Runtime API calls are modelled on standard library functions accepting device andor host pointers, such as: cudaMalloc Allocate linear block of device memory, returning a device pointer. cudaFree Free memory block allocated by cudaMalloc.

cudaMalloc* Family of related functions to allocate formatted chunks of device memory (e.g. cudaMalloc3D). D 3.1: Report on Current and Emerging Transport Technologies 29

cudaMemcpy Move memory block from host to host, host to device, or device to host.

cudaMemcpy* Family of related functions for moving formatted blocks of memory (e.g. cu- daMemcpy3D) and asynchronous movement. cudaMemset Fill device memory block with constant byte value. cudaMemset* Family of related functions for filling formatted blocks of memory (e.g. cud- aMemset3D) and asynchronous movement.

Subsequently, more advanced features for memory management have been supported through the concepts of unified virtual addressing (UVA), and unified memory. UVA provides a virtual address space encompassing both host and device memory (potentially multiple devices), whilst unified memory takes this concept a step further and provides a managed memory system with features such as automatic page migration and prefetching, which can be integrated into high level data structures. The current API (at the time of writing, CUDA 10.0) supports:

cudaMallocManaged Allocated a managed block of unified memory.

cudaMemAdvise Advise the CUDA memory subsystem on expected utilisation for memory ranges (modelled on system call madvise).

cudaMemPrefetchAsync Pre-fetch memory ranges to specific devices, or host CPU.

cuda*Attributes Family of functions for getting and setting attributes on memory ranges, and getting attributes of device pointers.

Memory management features are architecture dependent, current support for x86 Linux platforms with Pascal and Volta GPU architectures includes GPU page faulting, automatic page migration (on first touch), and an extended 48 bit virtual address space. Support for unified memory is planned, based on the recently added HMM kernel features (Section 3.4.2), which will allow passing host OS allocated (malloc) memory directly to GPU kernels with automatic migration management. Alternative architectures, such as POWER9 and Volta, already support advanced page migration using access counters (hot pages), hardware coherency with nvlink2, and hardware supported address translation (ATS) for integrating with system allocated memory. Direct access to GPU memory from other devices on the PCI bus is also possible through GPUDirect RDMA (Remote Direct Memory Access), using the CUDA driver and kernel APIs. Memory ranges must be registered to ensure correct synchronisation via driver API function:

cuPointerSetAttribute Set attribute of device pointer, can be used to enable safer synchro- nised behaviour on this address range, allowing concurrent RDMA operations.

cuPointerGetAttribute Get attribute of device pointer.

Whilst the following process may be simplified by upcoming HMM integration, currently memory must then be pinned/unpinned for RDMA access via kernel API functions: D 3.1: Report on Current and Emerging Transport Technologies 30

nvidia p2p get pages Get the physical pages underlying a range of GPU virtual memory for access by a third-party device.

nvidia p2p put pages Releases a set of pages previously made accessible to a third-party device.

nvidia p2p free pages Frees a third-party P2P page table and is meant to be invoked during the execution of the nvidia p2p get pages callback.

nvidia p2p dma map pages Make physical pages retrieved with nvidia p2p get pages acces- sible to third party device. nvidia p2p dma unmap pages Unmap those pages that were previously mapped with nvidia p2p dma map pages.

nvidia p2p free dma mapping Free dma mapped pages, meant to be invoked during the execution of the nvidia p2p get pages callback.

Full details may be found in the up to date CUDA Toolkit Documentation [41].

3.5.2 ROCm ROCm is the compute platform for AMD Radeon GPUs, consisting of a kernel driver and Heterogeneous System Architecture (HSA) (see Section 3.5.3) enabled runtime. In- terfaces to ROCm supported GPUs include Heterogeneous Compute (HC) C++, a dialect of C++ with extensions for accelerators supported by the Heterogeneous Compute Com- piler (HCC). ROCm also supports the parallel programming model OpenCL which will be described in Section 3.5.5, and the memory management routines described therein. Furthermore, an automatic tool is included to convert CUDA source to an intermediate C++ dialect HIP, the Heterogeneous-compute Interface for Portability, which supports a subset of the CUDA memory management routines. The ROCm runtime implements memory management for each of these languages through the HSA API discussed in Section 3.5.3. RDMA is also supported through the ROCK kernel driver, using an amd rdma interface that can be queried via amdkfd query rdma interface. This provides RDMA functions:

get pages Make the physical pages underlying a range of GPU virtual memory accessible for a third-party device.

put pages Releases a set of pages previously made accessible to a third-party device.

is gpu address Check if page range belongs to GPU address space

get page size Check size of single GPU page.

Full detail is provided in the ROCm Documentation [42]. D 3.1: Report on Current and Emerging Transport Technologies 31

3.5.3 HSA The Heterogeneous System Architecture [43] is a set of specifications defining a virtual instruction set intermediate layer (HSAIL), memory model, task dispatcher, and run- time aimed at coherent integration of heterogeneous hardware systems. Whilst not in widespread use today, HSA provides a runtime API that is implemented by the ROCm driver for AMD GPUs (Section 3.5.2), which may potentially feature in some pre-Exascale and Exascale HPC systems. Compilers, such as HCC, may generate HSAIL for high level languages with extensions for parallelism such as HC C++, which can either be JIT-compiled into platform specific code by a HSA finalizer at runtime, or offline-compiled at application build time. The memory abstraction of HSA provides a shared virtual addressing for all hosts and agents (devices) visible to the runtime. Memory is split into segments at a system architecture level, such as globally accessible, private, readonly and more (as defined in the System Architecture Specification [44]). At the programming API level, memory blocks are split into regions, with associated sizes, segment associations, and allocation characteristics. The runtime allocation and movement API is provided as follows:

hsa memory allocate Allocate a block of memory in a given region.

hsa memory free Free block of memory allocated with hsa memory allocate.

hsa memory copy Copy memory block from src to dst.

hsa memory copy multiple Copy memory block from src to multiple dsts.

hsa memory assign agent Change ownership of a memory buffer.

hsa memory register Register a memory buffer for access by a kernel agent.

hsa memory deregister Deregister a memory buffer registered with hsa memory register.

Further API functions are provided for interacting with memory regions. A user level application may not require to interact directly with the HSA memory API, as it is intended to abstract specific vendor driver APIs below higher level approaches such as C++, OpenCL, Java, and domain specific languages (DSL). Direct interaction may be necessary however to interact with vendor drivers implementing the HSA runtime API for memory management, such as ROCm.

3.5.4 OpenMP OpenMP (Open Multi-Processing) is a popular specification for shared memory multi- threaded parallelism, originally built on the fork-join execution model using directives and library calls. As of the 4.0 specification OpenMP includes extensions to support heterogeneous environments, such as a set of directives to offload work to generic devices (GPU, FPGA, etc). Each device has a device environment defined by an implicit target data region. Through target directives, variables can be associated to, and mapped between, specific environments; enabling migration of data between device environments. The most recent OpenMP specification (5.0) aims to provide a portable interface for memory placement. This interface is based on the concepts of memory spaces, traits, and D 3.1: Report on Current and Emerging Transport Technologies 32

allocators. A memory space represents a storage resource with a specific set of traits; the predefined list of spaces are as follows (as defined by the specification):

omp default mem space Represents the system default storage.

omp large cap mem space Represents storage with large capacity.

omp const mem space Represents storage optimized for variables with constant values.

omp high bw mem space Represents storage with high bandwidth.

omp low lat mem space Represents storage with low latency.

An allocator is associated with a memory space, from which memory is allocated according to a set of user traits defining requirements such as alignment, threaded access, and pinning behaviour. Default allocators are defined for each of the predefined memory spaces, and the predefined threaded access models (cgroup, pteam, thread). The available memory spaces are implementation defined, and API functions are pro- vided to initialise an allocator and allocate memory in a specific memory space. Directives are provided to specify allocator choice for automatic allocation in the context of a parallel block.

3.5.5 OpenCL OpenCL (Open Compute Language) is a parallel programming model for heterogeneous environments employing a wide range of hardware, including GPUs, FPGAs and DSPs. Programs are written in OpenCL C, a language based on C99, and an API is provided for interacting with devices and the runtime. More recently, OpenCL 2.2 introduces a C++ kernel language utilising a static subset of C++14 language features. The OpenCL memory model is divided into host memory, and device memory. Device memory is further split into four non-overlapping address regions: global, constant, local, and private. These logical regions are non-overlapping segments of a larger general address space, and are defined by the access characteristics for levels of the OpenCL platform model; for example, global is read-writeable from any device in the context, whilst private is local to a single work item (kernel iteration). Data buffers are stored in memory objects and may be moved between host, device, and specific regions using a series of commands submitted to an OpenCL command queue, as defined by the specification [45]:

read/write/fill The data associated with a memory object is explicitly read and writ- ten between the host and global memory regions using commands enqueued to an OpenCL command queue.

map/unmap Data from a memory object is mapped into a contiguous block of memory accessed through a host accessible pointer.

copy The data associated with a memory object is copied between two buffers, each of which may reside either on the host or on the device. D 3.1: Report on Current and Emerging Transport Technologies 33

As of OpenCL 2.0, a shared virtual memory (SVM) address space is supported, with memory movement based on user triggered synchronisation points. SVM allocations can be made using the clSVMAlloc API call, and various granularities of synchronisation can be enforced.

3.5.6 OpenACC OpenACC (Open Accelerators) is a set of directive based language extensions to par- allelise computational kernels in C, C++, or Fortran code, targeting both accelerator devices and CPUs. The memory model of OpenACC is similar to OpenMP; memo- ries are exposed via a device data environment, and data directives are used to define memory behaviours. Movement can be implicitly handled by an OpenACC capable com- piler, auto-generating the appropriate device-specific memory handling code. More fine tuned control is supported through directives to signal expected behaviour for memory migration between environments, and to create mappings between variables. Directives defining behaviour specified variables are listed as so: copy Allocate device space for specified variables, copy in (to device) and out (to host) for region duration, and release device space. copyin Allocate and copy in, release without copy out. copyout Allocate but do not initialise with copy in, copy out and release. create Allocate space, no copies in or out. present Specified variables already present on device, no movement necessary. deviceptr Specified variables on device but not managed by OpenACC - disable address translation.

Further clauses exist for behaviours such as conditional copies, synchronisation, and further syntax for additional context such as specifying array shapes. For recent NVIDIA GPUs, OpenACC (depending on implementation) may also sup- port the CUDA unified memory model (see Section 3.5.1), providing automatic page- based data migration. Some limitations currently apply, however further releases of the PGI compilers are expected to support the full range of unified memory features (includ- ing the proposed x86 HMM support discussed in Section 3.5.1) [46].

3.5.7 Memkind The memkind library [47] provides alternatives to the ISO C standard interface for mem- ory management, extending the interface to support different memory kinds. Memkind aims to be a user-extensible heap manager that exposes the memory management features in the Linux kernel (e.g. as outlined in Sections 3.3 and 3.4) to support heterogeneous memory systems, building on top of existing heap manager jemalloc8.

8http://jemalloc.net/ D 3.1: Report on Current and Emerging Transport Technologies 34

The jemalloc memory model partitions memory into arenas, primarily to reduce lock contention; threads are distributed amongst arenas in a round-robin fashion, and alloca- tions are served from the assigned arena. The memkind library exploits this abstraction, and the extension interface of jemalloc, to create arenas with specific memory properties. At allocation time, arenas with specific properties can be selected through a series of flags passed to the memkind API, which mirrors the ISO C standard allocation routines with additional parameters and a memkind prefix. Default kinds are provided, such as memkind hbw and memkind hugetlb, target- ing high bandwidth and the Linux hugetlbfs. An API is also defined for users to create their own kinds, providing a decorator interface for mixed allocations. As part of the memkind library, hbwmalloc is provided as an interface explicitly targeting high band- width memories, such as Intel Xeon Phi MCDRAM; AutoHBW is further included to automatically replace regular allocations with HBM allocations.

3.5.8 MPI for Memory Access The Message Passing Interface (MPI) [48] is the de facto standard message-passing API. This standard is pervasive in High Performance Computing so any capability offered that is relevant for allocation or access to memory is of interest. There are two specific capabilities that are worth mentioning in the context of this document:

• Memory allocation

• RMA Window allocation

MPI provides its own memory allocation routine MPI Alloc mem which returns mem- ory of the requested size taking account of a implementation defined info argument. This mechanism could be used to return memory with specific capabilities. Recent work has aimed to support heterogeneous memory allocation transparently via arenas and MPI Info arguments to MPI Alloc mem [49]. MPI also provides the capability to expose memory in a window so that it is accessible to other processes via RMA operations. This mechanism can be used to provide an MPI interface to shared memory on a node using MPI Win allocate shared. In addition, one could augment an MPI implementation to provide access to memory regions using MPI windows.

3.5.9 Higher Level Programming Frameworks Higher level programming frameworks build on the previously discussed libraries and APIs to present more general abstractions for heterogeneous programming, and typi- cally include memory abstractions or interfaces that may be relevant in the context of EPiGRAM-HS. As an example, Kokkos [50] is a C++ library-based programming model focusing on achieving performance portability of heterogeneous systems by abstracting both the execution and memory models of a user code. Execution spaces encapsulate the locations where code can be executed, such as CPUs or GPUs, with accompanying patterns and policies to further describe the execution structure and scheduling. Data structures are D 3.1: Report on Current and Emerging Transport Technologies 35 defined in terms of memory spaces, with associated layouts and traits. A memory space defines where the data should reside, such as HBM, DDR, or non-volatile memory. The layout defines in-memory data organisation, such as row vs. major, tiled, or strided array ordering. Traits define how the memory is accessed, such as random, stream, or atomic. To target specific hardware devices and memory tiers, Kokkos provides a variety of platform-specific back ends, which are utilised based on the chosen execution space. Current parallel models with varying levels of support as back-ends are OpenMP, CUDA, pthreads, ROCm, and Qthreads. Memory management in the Kokkos programming model is addressed via a multidimensional array abstraction. A Kokkos array implements cus- tom access operators utilising a defined mapping between the user’s logical array indices and the physical data layout in memory. Memory allocation and movement is performed transparently behind this array abstraction, via one of the supported back-ends. Another emerging programming framework that makes data movement the primary concern is Legion [51] which enables full-system data-awareness and improved productiv- ity. The Legion authors describe their central philosophy as follows:

...the cost of data movement within these [modern computer] architectures is now coming to dominate the overall cost of computation, both in terms of power and performance. Despite these conditions, most machines are still pro- grammed using an eclectic mix of programming systems that focus only on de- scribing parallelism (MPI, Pthreads, OpenMP, OpenCL, OpenACC, CUDA). Achieving high performance and power efficiency on future architectures will require programming systems capable of reasoning about the structure of pro- gram data to facilitate efficient placement and movement of data.

A key design feature of Legion is the separation of concerns amongst its three main abstractions

• Tasks (execution model) Describe parallel execution elements and algorithmic op- erations with sequential semantics, out-of-order execution

• Regions () Describe decomposition of computational domain. Privileges (read-write, read-only, reduce) Coherence (exclusive, atomic)

• Mapper Describes how tasks and regions should be mapped to the target architec- ture

Legion is being actively developed and pursued seriously by a number of Exascale activities e.g. [52]. Furthermore and unlike Kokkos, it is not restricted to C++ codes. As such Legion is something that the EPiGRAM-HS project should follow carefully. Another abstraction that has become prominent in recent years is SYCL [53, 54]. It provides a ‘single-source’ approach to programming heterogeneous processors (and a host) using the latest C++ features and, as of July 2018 (SYCL 1.2.1.r3), is a pure C++ DSL based on OpenCL 1.2. It was originally introduced by Codeplay with standardisation effort undertaken by the Khronos Group. SYCL provides a high level programming model using C++ templates and lambda functions and provides optimisation across a range of OpenCL implementations. A SYCL program defines device kernels through a parallel for D 3.1: Report on Current and Emerging Transport Technologies 36 function, these can be queued for execution and may access data defined in a data buffer. SYCL is designed to support source compilation by multiple compilers so that host and device code may be separately compiled. SYCL kernels are despatched in SPMD fashion across the PEs of the targeted OpenCL device. An parallel implementation of the C++ standard template library (STL) is currently under development by Khronos, and there is growing support in existing libraries such as the SYCL support in Tensorflow implemented by Codeplay [55].

3.6 Off-Node Access to Memory Beyond on-node interfaces to memory, this section briefly overviews the options for remote memory access and movement, considering RDMA (Section 3.6.1), intermediate storage layers exploiting burst buffers and (Sections 3.6.2, 3.6.3), and finally higher level abstractions for data brokering and workflow services (Section 3.6.4).

3.6.1 RDMA One of the most efficient mechanisms for off-node data access is via Remote Direct Mem- ory Access (RDMA). RDMA allows direct transfer of data without involving the CPU or, crucially, the OS at either end, and is likely to offer the best efficiency and latency possible. Networks that support RDMA transfers have been popular in HPC systems for many years, for example SCI, Myrinet, Elan, Quadrics, Infiniband and Gemini/Aries. On Linux the OpenFabrics Enterprise Distribution (OFED) stack has become the standard stack used in Infiniband networks, perhaps the most prevalent of network tech- nologies in current use. Although initially Infiniband-specific low-level APIs were used, a more portable low-level library called Libfabric is available as the main entry point for end-user software that needs to use the fabric. Described here are popular message-passing and PGAS implementations that can take advantage of RDMA network hardware, although it should be noted that higher level frameworks such as DataSpaces (Section 3.6.4) may also have custom network transport layers capable of exploiting RDMA.

• MPI: DPM / Windows The Message Passing Interface is the de facto API used for message passing. It provides a single-sided API to communicate data between processes by means of a programmer-allocated window which can then be the target of put and get opera- tions. Typical MPI implementations implement these operations using RDMA.

• GASPI, SHMEM, Fortran, UPC etc. There are a range of Partitioned Global Address Space (PGAS) languages that provide at a minimum put and get operations to remote data. The earliest of these is SHMEM and more recently GASPI, UPC and Fortran coarrays.

3.6.2 Burst Buffers A burst buffer is typically used as an intermediate storage layer between compute nodes and storage systems, providing a high bandwidth intermediary for I/O acceleration based D 3.1: Report on Current and Emerging Transport Technologies 37

on memory technologies such as NVRAM and SSD. Burst buffers are designed to address bursty I/O patterns [56], resulting from applications switching between compute and I/O dominant execution phases. The burst buffer provides a high peak bandwidth to allow I/O dominant phases to quickly complete without waiting for data to reach the parallel filesystem (PFS), reducing load and contention on the PFS. Two popular burst buffer solutions are described here:

• Cray DataWarp Cray DataWarp [57] exploits SSDs as directly attached node storage to decouple application I/O from the PFS for Cray XC supercomputers. Usage is typically enabled at job submission time through the workload manager; DataWarp is then exploited via regular POSIX I/O API calls, with a C library also available for man- ual utilisation. DataWarp can be used, for example, as application cache memory, scratch storage, data sharing, or as swap space.

• Infinite Memory Engine DataDirect Network’s (DDN) Infinite Memory Engine (IME) product provides plat- form independent burst buffer features exploiting SSD-based burst buffers. IME can be purchased as a hardware supported burst buffer or a software only solution added to existing systems, and supports similar use cases to DataWarp.

The typical expectation is that applications should not require modification to exploit burst buffer systems, instead benefiting from transparent usage behind POSIX I/O calls. However, advanced burst buffer usage is likely to require some modification, for example data migration between burst buffers and a PFS may be explicitly specified during a checkpoint-restart scenario, or for data sharing during in-transit visualisation. A variety of middlewares are available to abstract specific vendor APIs and simplify the use of tiered storage. For example, LibHIO for hierarchical I/O [58] and SCR for scalable checkpoint-restart [59], and coordination frameworks for scalable burst buffer utilisation [60].

3.6.3 Object Store An object storage system is an architecture which manages data as objects instead of files [61]. Due to their scalability, object stores are widely adopted in cloud based sys- tems. Object store operations are stateless and in object store semantics there are only two basic operations: GET and PUT.A PUT operation returns an ID which uniquely represents the object. Object store implementations usually provides a facility to map an assigned name to an ID, together with which describes the object. This is often implemented with a key-value store. All objects are stored without structure and clients communicate directly to the storage node where data physically resides without requiring location lookup by hashing the object’s ID [62]. Objects are immutable and it is impossible to concurrently create or update the same object. This eliminates the bottleneck caused by locking. In contrary to POSIX I/O, object stores support a weak form of consistency: eventual consistency. This means that a successfully returned PUT D 3.1: Report on Current and Emerging Transport Technologies 38 operation does not necessary require that the object will be visible immediately. Deter- ministic placement of object through ID hashing leads to the elimination bottleneck due to lookup [63].

• Mero is the core of the SAGE soft structure with the aim of providing Exascale- capable object storage. Mero consists of a cluster of nodes where persistent storage is attached. A feature of Mero is its support for different tiers of storage technol- ogy, such as NVMe SSDs, NAND flash SSD and hard disks. Apart from that, it features in-storage compute, where certain computations can be offloaded to the storage system. Operations of Mero are transaction based. A key-value store is also provided.

• DAOS is another object storage system that targets Exascale I/O capability. Sim- ilar to Mero, DAOS supports the use of multi tier storage technology. The entire software stack consists of top level APIs, caching and tiering, sharding and finally persistent memory. Operations are transaction based.

• Ceph is one of the most commonly known object storage systems, providing in- terfaces for each of object-, block- and file-based access and a set of POSIX I/O extensions which provide relaxed consistency. Unlike Lustre, any party can com- pute the physical location of an object by hashing its ID. For this reason, location metadata is completely eliminated. This reduces the stress on the metadata clus- ter. Additionally, it is possible to manipulate the underlying object store directly through librados [64].

3.6.4 Data Brokering and Workflow Services Memory interfacing and movement is an intrinsic component of workflow management, enabling the output of one workflow task to become the input of another. Built on top of the existing memory interfaces described in earlier sections, there also exist a series of data brokerage services to support movement of memory on and off node in contexts such as workflows, data staging, I/O acceleration and application coupling. The remainder of this section provides a brief representative description of such services.

• DataSpaces DataSpaces [65] provides a shared virtual memory space for the purpose of cou- pling applications, for example in mixed simulation and analysis workflows. The DataSpaces software architecture is split into three layers: the Data Storage Layer, Communication Layer, and Data Lookup Layer. The Data Storage Layer allocates and manages memory buffers which are hosted on intermediate staging nodes, and data is moved between applications and memory spaces via the Communication Layer exploiting the DART transport library [66]. The Data Lookup Layer further provides high level indexing services via a . Included in the DataSpaces project is DIMES [67], a similar data staging library with the ability to store data in memory on application nodes as opposed to requiring additional data storage nodes. D 3.1: Report on Current and Emerging Transport Technologies 39

• DataBroker The IBM DataBroker (DBR)9 is an in-memory distributed data store for coupling workflow applications. Data is stored in key-value form (or as tuples), within names- paces that provide a sharing mechanism across applications. Memory storage and movement is enabled by a generic back-end runtime interface, the default of which is the Redis10 in-memory data store, with interfaces to GASNet [68] and IBM run- times. • ADIOS The Adaptable IO System (ADIOS) [69] is a framework for parallel I/O in scientific applications. The ADIOS API can be used to insert generic I/O hooks in an application code, with specific I/O behaviour set at application start-up through an XML (eXtendable Markup Language) configuration file. ADIOS relies on a series of underlying transport methods to move data between memory tiers and filesystems, including POSIX I/O, MPI-IO, the Lustre filesystem API, Dataspaces, DIMES, FlexPath [70], HDF5, NetCDF4, ICEE [71], and more. Users may perform simple I/O, data staging, or direct transport for application coupling, with support for aggregated writes. • GLEAN GLEAN [72] is a framework for data staging and parallel I/O to support analysis in an in-situ or coupled context. GLEAN may be invoked explicitly through its API transparently via its embedding within popular I/O libraries such as NetCDF and HDF5. Staging nodes are exploited for I/O aggregation, and topology aware memory movement performed via MPI.

4 Discussion

This section discusses how the information presented in this document may influence the EPiGRAM-HS project. Figure 14 aims to bring together the wide variety of hardware and software technologies discussed in this report, presenting a simplified view of the relation of Exascale memory candidate technologies to the existing memory hierarchy and software ecosystem. Figure 15 ranks these technologies in terms of relevance to the EPiGRAM-HS project. The final contribution of this document is a series of observations that are intended to guide the EPiGRAM-HS project and in particular the Memory work package.

Observation 1: A memory-focused approach to performance portability would be highly impactful, and is possible (though challenging) under current con- ditions. Exascale computing and the current technological climate have made architectures increasingly complex and highly susceptible to change further in the future. As such,

9https://github.com/IBM/data-broker 10https://redis.io/ D 3.1: Report on Current and Emerging Transport Technologies 40

Software Memory Candidates

Low Level Kokkos OpenMP OpenACC Interfaces Programming Frameworks OpenCL MPI Memory tiers

Memkind CUDA ROCm Accelerator Allocator APIs/Runtimes APIs/Runtimes and Device Drivers OS Memory HW Drivers HMM Management

Processor / Chipset

PCIe On-chip

OpenCAPI / GenZ / CCIX

SRAM Cache NVLink HBM

Accelerator Device Main Memory DRAM Memory NV-DIMM

NVMe- On-node Storage NVMe-NV Flash

Network Attached Storage

NA-Flash HDD

Parallel File System

Hardware

Figure 14: A visual representation of the hardware memory technologies as they fit into the memory and storage hierarchy, as discussed in Section 2, coupled with the software and hardware interfaces discussed in Section 3. D 3.1: Report on Current and Emerging Transport Technologies 41 Low Hardware HDD Software Low interest for WP GenZ/CCIX/OpenCAPI .3 tmpfs PCIe Object Store NVMe HSA

Be aware of / consider in design Data Brokers XPMEM OpenACC SSD RDMA Memkind SYCL Kokkos Legion MPI Focus on & design for WP DRAM NVDIMM OpenCL ROCm CUDA OpenMP SRAM PMDK HMM HBM .3 High

Figure 15: Relevance of Technologies for EPiGRAM-HS Memory Work Package D 3.1: Report on Current and Emerging Transport Technologies 42 future-proofing of applications and software (or performance portability) remains one of the most critical goals. Performance portability across heterogeneous hardware is a challenging technical goal and is usually concerned primarily with accessing the right type of processing device (as well as the associated data copies to execute on accelerators for example). EPiGRAM-HS WP3 should view performance portability through the lens of data movement and efficient usage of complex memories. This document has shown that although the upcoming systems are complex, that complexity is finite and manageable and that software frameworks exist at all levels that can be leveraged to provide such portability. This document does not attempt to define such a solution, but it is clear that this would require building on multiple existing or custom packages across programming frameworks, operating system tools, runtimes, and low-level APIs.

Observation 2: Efficient usage of the upper memory hierarchy consisting of SRAM, DRAM and HBM will be vital. DRAM remains highly relevant but pricing and unimpressive DDR evolution means that HBM will be increasingly important, not only for device memory but also as cache and as main memory. Each of these usage models has a different set of requirements and most applications will eventually encounter all three models. As such the relevance of the previous observation applies directly here. It is rare for software to be optimized for memory access via cache except in some performance-critical contexts like numerical libraries or some important application ker- nels. Cache hardware generally does a good job and doing better is very hard. Compiler developers implement optimizations for cache but there is a limit to what can be done, particularly without full knowledge of loop extents and data sizes/relationships at run- time. An additional complexity comes from the advent of programmable cache. For the programmer / applications scientists this is bad news, since the burden now rests with them in terms of getting good performance out of this memory. But there is scope to do significantly better than associative algorithms, which have zero awareness of the appli- cation at all. Performance portability across SRAM implementations is a key concern of EPiGRAM-HS.

Observation 3: Increased diversity in the memory hierarchy will require new software support and increased levels of abstraction. On-node and network attached flash are becoming very important, as discussed in Sec- tions 2.4.4 and 2.4.5, SSDs are likely to completely replace HDD for Exascale storage by 2025 (or at the very least, hybrid systems will be standard). Where flash will be a drop-in replacement to spinning disks, these devices will be accessed primarily via the parallel filesystem such as lustre. Features within Lustre for caching and tiering are under development and lie outside the scope of EPiGRAM-HS. Where the devices are accessed directly, either locally or with domain-specific libraries (e.g. burst buffers) they are in scope for the project. Furthermore efficient usage of the latter type will be high priority for a small number of applications including potentially some within the project. The general availability of Intel’s DC Optane Persistent Memory and the emergence of other specifications such as NVDIMM-P will create interesting performance/capacity D 3.1: Report on Current and Emerging Transport Technologies 43 considerations that deviate from the traditional model and present opportunities for experimentation. While the price remains similar to DRAM, we are not likely to see these products take a large role, but in future we may see Terabytes of persistent memory per node, presenting opportunities for staging. The programmability of such devices remains an open question and the project should attempt to clarify what is available beyond PMDK.

Observation 4: Useful technologies exist for memory modelling and abstrac- tion that could support a heterogeneous memory abstraction framework. Kernel support for heterogeneous memory is maturing, as discussed in Section 3.4. This is important for unified memory views, and allows driver-based automatic page migration, however there are some potential drawbacks to relying on OS and driver-based data movement across heterogeneous memories. Despite these limitations, it will be important for any abstraction layer to exploit such OS services where possible as it will likely be more efficient to do so compared to higher-abstraction software. Vendor-specific APIs such as CUDA and ROCm are important for optimal hardware usage, however ultimately performance portable code should rely on generic heteroge- neous libraries and programming frameworks. A memory abstraction layer may have to explicitly implement memory management with vendor APIs e.g. CUDA code on NVIDIA GPUs. However, it will also be necessary to integrate user code written for vendor specific APIs, i.e. abstracting but retaining the ability to interact with / return to the user vendor-API data types if required. The memory models of pervasive HPC parallel programming frameworks, such as MPI and OpenMP, are maturing to consider heterogeneity. Any library for memory abstraction or movement will inevitably need to integrate with such frameworks, and this should be strongly considered at the design stage. However, for non-local data, internal communications may make better usage of RDMA via libfabric or a PGAS language. Such considerations are not of primary concern, the developed toolset should interact well with standard libraries but use the most efficient option internally. Of the generic heterogeneous programming frameworks, OpenCL appears to be the most widely implemented framework with memory management routines. OpenCL has some level of support on Intel/Arm CPUs, NVIDIA GPUs, AMD GPUs, Intel/Altera FPGAS (Stratis), Xilinx FPGAs. There has not been a strong uptake in terms of ap- plication development, but OpenCL should be considered as one of the most hardware agnostic programming frameworks. Memory abstraction (or data movement abstraction between host and device) has been performed well in the Kokkos library. This is an example of how details can be hidden from the user. However, Kokkos’ reliance on advanced C++ features such as lambdas means that the abstraction comes at a price. Furthermore, a large number of HPC applications are written in Fortran or C, and do not interface simply to C++ frameworks. Legion has been used in non-C++ applications to similar effect (though with a different emphasis). Recreating Kokkos-style abstractions in languages such as Fortran will be extremely challenging, and alternative approaches including Legion should be considered. Nonetheless the success of Kokkos should be a inspiration to the project, which itself should have the lofty goal of matching the applicability of the Kokkos abstraction abilities D 3.1: Report on Current and Emerging Transport Technologies 44 but in a more general manner. In this report we have reviewed the relevant technologies that will be required to allocate, place and move data on and between locations in a heterogeneous Exascale system. These operations are fundamental to our design of an abstract memory device in the EPIGRAM-HS project. D 3.1: Report on Current and Emerging Transport Technologies 45 References

[1] John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. Elsevier, 6th edition, 2017.

[2] B. Sorensen. Exascale update. https://insidehpc.com/2018/09/exascale-update-hyperion-research/, 2018. Online; accessed January 9, 2019.

[3] T. Trader. Requiem for a phi: Knights landing discontinued. https://www. hpcwire.com/2018/07/25/end-of-the-road-for-knights-landing-phi/, 2018. Online; accessed January 9, 2019.

[4] T. Morgan. Argonne hints at future architecture of Aurora exascale system. https://www.nextplatform.com/2018/03/19/ argonne-hints-at-future-architecture-of-aurora-exascale-system/, 2018. Online; accessed January 9, 2019.

[5] T. Morgan. Intels exascale dataflow engine drops x86 and von Neumann. https://www.nextplatform.com/2018/08/30/ intels-exascale-dataflow-engine-drops-x86-and-von-neuman/, 2018. Online; accessed January 9, 2019.

[6] M. Feldman. Sandia to install first petascale supercomputer powered by arm processors. https://www.top500.org/news/sandia-to-install- first-petascale-supercomputer-powered-by-arm-processors/, 2018. Online; accessed January 9, 2019.

[7] E. Kelly. Eu launches e1B project to build worlds fastest supercomputer. https://sciencebusiness.net/news/ eu-launches-eu1b-project-build-worlds-fastest-supercomputer, 2018. Online; accessed January 9, 2019.

[8] M. Valero. European processor initiative & risc-v. https://content.riscv.org/wp-content/uploads/2018/05/11.15-11. 45-EXASCALE-RISC-V-Mateo.Valero-9-5-2018-1.pdf, 2018. Online; accessed January 23, 2019.

[9] T. Morgan. Fujitsus a64fx arm chip waves the hpc banner high. https://www.nextplatform.com/2018/08/24/ fujitsus-a64fx-arm-chip-waves-the-hpc-banner-high/, 2018. Online; accessed January 23, 2019.

[10] S. Knowles. How to build a processor for artificial intelligence (part 2). https://www.graphcore.ai/posts/ how-to-build-a-processor-for-machine-intelligence-part-2, 2017. Online; accessed January 9, 2019. D 3.1: Report on Current and Emerging Transport Technologies 46

[11] M. Sanders. DDR5 RAM expected in 2019 but the figures are hardly impressive. https://www.eteknix.com/ddr5-ram-expected-2019/, 2017. Online; accessed January 9, 2019.

[12] B. Gervasi and J. Hinkle. Overcoming system memory challenges with persistent memory and nvdimm-p. In JEDEC Server Forum, 2017.

[13] Objective Analysis. Using flash as memory. https://objective-analysis.com/uploads/2013-07-30_Objective_Analysis_ White_Paper_-_Using_Flash_as_Memory.pdf, 2013. Online; accessed Jan 10, 2018.

[14] Objective Analysis. Why wait for storage class memory? https://objective-analysis.com/uploads/2016-08-15_Objective_Anlaysis_ Tech_Brief_for_Netlist.pdf, 2016. Online; accessed Jan 10, 2018.

[15] R. Crooke and M. Durcan. 3D XPoint: A breakthrough in non-volatile memory technology. https://www.intel.co.uk/content/www/uk/en/ architecture-and-technology/intel-micron-3d-xpoint-webcast.html, 2015. Online; accessed January 9, 2019.

[16] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch. System-level implications of disaggregated memory. In IEEE International Symposium on High-Performance Comp Architecture, pages 1–12, Feb 2012.

[17] DirectData Networks. Ddn annual high performance computing trends survey reveals that data storage has become the most strategic part of the hpc data center. https://www.ddn.com/press-releases/ ddn-annual-high-performance-computing-trends-survey-reveals-that-data-storage-has-become-the-most-strategic-part-of-the-hpc-data-center/, 2015. Online; accessed January 30, 2019.

[18] Hyperion Research. Flash storage trends and impacts. https://www.weka.io/wp-content/uploads/2018/04/ HPE-Flash-Tech-Spotlight-Hyperion-Research.pdf, 2018. Online; accessed January 29, 2018.

[19] Torben Peterson. Flash acceleration of hpc storage - nx. https://www.cray.com/ resources/flash-acceleration-hpc-storage-nxd-performance-comparison, 2018. Online; accessed Jan 10, 2018.

[20] Mel Gorman. Understanding the Linux Virtual Memory Manager. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2004.

[21] The kernel development community. The linux kernel user’s and administrator’s guide: Memory management. https://www.kernel.org/doc/html/latest/admin-guide/mm/index.html, 2018. Online; accessed December 10, 2018. D 3.1: Report on Current and Emerging Transport Technologies 47

[22] National Instruments. Introduction to PCI Express. http://www.ni.com/white-paper/3540/en/, 2016. Online; accessed January 3, 2019.

[23] National Instruments. PCI Express an overview of the pci express standard. http://www.ni.com/white-paper/3767/en/, 2014. Online; accessed January 3, 2019.

[24] HPC-SIG. HPC-SIG website. https://pcisig.com/, 2019. Online; accessed January 3, 2019.

[25] PCI-SIG. PCI-SIG publishes PCI Express 4.0, revision 0.9 specification. http://www.businesswire.com/news/home/20170607005319/en/PCI-SIG%C2% AE-Publishes-PCI-Express%C2%AE-4.0-Revision-0.9, 2017. Online; accessed January 3, 2019.

[26] NVM Express. NVM Express website. https://nvmexpress.org/, 2018. Online; accessed January 4, 2019.

[27] Western Digital. NVM Express website. https: //blog.westerndigital.com/nvme-important-data-driven-businesses/, 2019. Online; accessed January 3, 2019.

[28] Seagate. NVMe performance for the SSD age (technical paper). https: //www.seagate.com/files/www-content/product-content/ssd-fam/nvme-ssd/ nytro-xf1440-ssd/_shared/docs/nvme-performance-tp692-1-1610us.pdf, 2015. Online; accessed January 3, 2019.

[29] Tom Coughlin, Roger Hoyt, and Jim Hardy. Digital storage and memory technology (part 1). https://www.ieee.org/content/dam/ieee-org/ieee-web/ pdf/digital-storage-memory-technology.pdf, 2017. Online; accessed January 4, 2019.

[30] CCIX Consortium. CCIX Consortium website. https://www.ccixconsortium.com/, 2019. Online; accessed January 4, 2019.

[31] GEN-Z Consortium. GEN-Z Consortium website. https://genzconsortium.org/, 2019. Online; accessed January 4, 2019.

[32] OpenCAPI Consortium. OpenCAPI Consortium website. https://opencapi.org/, 2019. Online; accessed January 4, 2019.

[33] Benton, Brad. CCIX, GEN-Z, OpenCAPI: Overview & comparison. https://www.openfabrics.org/images/eventpresos/2017presentations/213_ CCIXGen-Z_BBenton.pdf, 2017. Online; accessed January 7, 2019.

[34] The Storage Network Industry Association. The Storage Network Industry Association website. https://www.snia.org, 2019. Online; accessed January 9, 2019. D 3.1: Report on Current and Emerging Transport Technologies 48

[35] Intel Corporation. Persistent memory development kit. http://pmem.io/pmdk/, 2009. Online; accessed December 4, 2018.

[36] Stefan Lankes, Boris Bierbaum, and Thomas Bemmerl. Affinity-on-next-touch: An extension to the linux kernel for numa architectures. In Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, PPAM’09, pages 576–585, Berlin, Heidelberg, 2010. Springer-Verlag.

[37] B. Goglin and N. Furmento. Enabling high-performance memory migration for multithreaded applications on linux. In 2009 IEEE International Symposium on Parallel Distributed Processing, pages 1–9, May 2009.

[38] Felix Xiaozhu Lin and Xu Liu. Memif: Towards programming heterogeneous memory asynchronously. SIGARCH Comput. Archit. News, 44(2):369–383, March 2016.

[39] M. Giardino, K. Doshi, and B. Ferri. Soft2LM: Application guided heterogeneous memory management. In 2016 IEEE International Conference on Networking, Architecture and Storage (NAS), pages 1–10, Aug 2016.

[40] Sean Williams, Latchesar Ionkov, and Michael Lang. NUMA distance for heterogeneous memory. In Proceedings of the Workshop on Memory Centric Programming for HPC, MCHPC’17, pages 30–34, New York, NY, USA, 2017. ACM.

[41] NVIDIA Corporation. CUDA toolkit documentation. https://docs.nvidia.com/cuda/index.html, 2018. Online; accessed December 12, 2018.

[42] AMD Corporation. ROCm documentation. https://github.com/RadeonOpenCompute/ROCm_Documentation, 2018. Online; accessed December 12, 2018.

[43] Wen-mei Hwu. Heterogeneous system architecture: A New Compute Platform Infrastructure. Morgan Kaufmann, 2016.

[44] HSA Foundation. HSA platform system architecture specification. http://www.hsafoundation.com/standards/, 2017. Online; accessed Jan 3, 2018.

[45] Khronos Group. The OpenCL 2.2 specification. https://www.khronos.org/registry/OpenCL/, 2018. Online; accessed December 6, 2018.

[46] S. Deldon, J. Beyer, and D Miles. OpenACC and CUDA unified memory. In Proceedings of the Cray User Group, 2018.

[47] Christopher Cantalupo, Vishwanath Venkatesan, Jeff R Hammond, Krzysztof Czurylo, and Simon Hammond. User extensible heap manager for heterogeneous memory platforms and mixed memory policies. Architecture document, 2015. D 3.1: Report on Current and Emerging Transport Technologies 49

[48] Message Passing Interface Forum. MPI a message-passing interface standard version 3.1. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf, 2015. Online; accessed January 4, 2019. [49] Sean Williams, Latchesar Ionkov, Michael Lang, and Jason Lee. Heterogeneous meory and arena-based heap allocation. NCHC’18: Workshop on Memory Centric High Performance Computing, 2018. [50] H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing, 74(12):3202 – 3216, 2014. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing. [51] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality and independence with logical regions. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1–11. IEEE, 2012. [52] Sean Treichler, Michael Bauer, Ankit Bhagatwala, Giulio Borghesi, Ramanan Sankaran, Hemanth Kolla, Patrick S McCormick, Elliott Slaughter, Wonchan Lee, Alex Aiken, et al. S3d-legion: An exascale software for direct numerical simulation of turbulent combustion with complex multicomponent chemistry. In Exascale Scientific Applications, pages 257–278. Chapman and Hall/CRC, 2017. [53] Khronos OpenCL Working Group — SYCL subgroup. SYCL specification v 1.2.1. https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf, 2019. [54] Ronan Keryell, Ruyman Reyes, and Lee Howes. Khronos SYCL for OpenCL. In Proceedings of the 3rd International Workshop on OpenCL - IWOCL 15. ACM Press, 2015. [55] Mehdi Goli, Luke Iwanski, and Andrew Richards. Accelerated machine learning using TensorFlow and SYCL on OpenCL devices. In Proceedings of the 5th International Workshop on OpenCL, page 8. ACM, 2017. [56] N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn. On the role of burst buffers in leadership-class storage systems. In 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), pages 1–11, April 2012. [57] D. Henseler, B. Landsteiner, D. Petesch, C. Wright, and N. Wright. Architecture and design of Cray DataWarp. In Proceedings of the Cray User Group, 2016. [58] N. Hjelm and C. Wright. libhio: Optimizing IO on Cray XC systems with datawarp. In Proceedings of the Cray User Group, 2017. [59] A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC ’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11, Nov 2010. D 3.1: Report on Current and Emerging Transport Technologies 50

[60] Ziqi Fan, Fenggang Wu, Jim Diehl, David H. C. Du, and Doug Voigt. CDBB: An NVRAM-based burst buffer coordination system for parallel file systems. In Proceedings of the High Performance Computing Symposium, HPC ’18, pages 1:1–1:12, San Diego, CA, USA, 2018. Society for Computer Simulation International.

[61] Michael Factor, Kalman Meth, Dalit Naor, Ohad Rodeh, and Julian Satran. Object storage: The future building block for storage systems. In Local to Global Data Interoperability-Challenges and Technologies, 2005, pages 119–123. IEEE, 2005.

[62] Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 307–320. USENIX Association, 2006.

[63] S. W. Chien, Stefano Markidis, Rami Karim, Erwin Laure, and Sai Narasimhamurthy. Exploring Scientific Application Performance Using Large Scale Object Storage. arXiv preprint arXiv:1807.02562, 2018.

[64] Red Hat, Inc, and Contributors. Introduction to librados. http://docs.ceph.com/docs/master/rados/api/librados-intro/, 2018. Online; accessed December 12, 2018.

[65] Ciprian Docan, Manish Parashar, and Scott Klasky. Dataspaces: An interaction and coordination framework for coupled simulation workflows. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pages 25–36, New York, NY, USA, 2010. ACM.

[66] Ciprian Docan, Manish Parashar, and Scott Klasky. Enabling high-speed asynchronous data extraction and transfer using dart. Concurr. Comput. : Pract. Exper., 22(9):1181–1204, June 2010.

[67] Fan Zhang. Programming and runtime support for enabling data-intensive coupled scientific simulation workflows. PhD thesis, Rutgers University, New Jersey, USA, 5 2017.

[68] Dan Bonachea and Jaein Jeong. Gasnet: A portable high-performance communication layer for global address-space languages. CS258 Parallel Computer Architecture Project, Spring, 2002.

[69] Jay F. Lofstead, Scott Klasky, Karsten Schwan, Norbert Podhorszki, and Chen Jin. Flexible IO and integration for scientific codes through the adaptable io system (adios). In Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments, CLADE ’08, pages 15–24, New York, NY, USA, 2008. ACM.

[70] Jai Dayal, Drew Bratcher, Greg Eisenhauer, Karsten Schwan, Matthew Wolf, Xuechen Zhang, Hasan Abbasi, Scott Klasky, and Norbert Podhorszki. Flexpath: Type-based publish/subscribe system for large-scale science analytics. In 2014 14th D 3.1: Report on Current and Emerging Transport Technologies 51

IEEE/ACM International Symposium on Cluster, Cloud and (CCGrid), pages 246–255. IEEE, 2014.

[71] Jong Y Choi, Kesheng Wu, Jacky C Wu, Alex Sim, Qing G Liu, Matthew Wolf, C Chang, and Scott Klasky. Icee: Wide-area in transit data processing framework for near real-time scientific applications. In 4th SC Workshop on Petascale (Big) Data Analytics: Challenges and Opportunities in conjunction with SC13, 11, 2013.

[72] V. Vishwanath, M. Hereld, and M. E. Papka. Toward simulation-time data analysis and I/O acceleration on leadership-class systems. In 2011 IEEE Symposium on Large Data Analysis and Visualization, pages 9–14, Oct 2011.