Architectures for High Performance Computing and Data Systems Using Byte-Addressable Persistent Memory

Architectures for High Performance Computing and Data Systems using Byte-Addressable Persistent Memory Adrian Jackson∗, Mark Parsonsy, Michele` Weilandz EPCC, The University of Edinburgh Edinburgh, United Kingdom ∗[email protected], [email protected], [email protected] Bernhard Homolle¨ x SVA System Vertrieb Alexander GmbH Paderborn, Germany x [email protected] Abstract—Non-volatile, byte addressable, memory technology example of such hardware, used for data storage. More re- with performance close to main memory promises to revolutionise cently, flash memory has been used for high performance I/O computing systems in the near future. Such memory technology in the form of Solid State Disk (SSD) drives, providing higher provides the potential for extremely large memory regions (i.e. > 3TB per server), very high performance I/O, and new ways bandwidth and lower latency than traditional Hard Disk Drives of storing and sharing data for applications and workflows. (HDD). This paper outlines an architecture that has been designed to Whilst flash memory can provide fast input/output (I/O) exploit such memory for High Performance Computing and High performance for computer systems, there are some draw backs. Performance Data Analytics systems, along with descriptions of It has limited endurance when compare to HDD technology, how applications could benefit from such hardware. Index Terms—Non-volatile memory, persistent memory, stor- restricted by the number of modifications a memory cell age class memory, system architecture, systemware, NVRAM, can undertake and thus the effective lifetime of the flash SCM, B-APM storage [29]. It is often also more expensive than other storage technologies. However, SSD storage, and enterprise level SSD I. INTRODUCTION drives, are heavily used for I/O intensive functionality in large There are a number of new memory technologies that are scale computer systems because of their random read and write impacting, or likely to impact, computing architectures in the performance capabilities. near future. One example of such a technology is so called Byte-addressable random access persistent memory (B- high bandwidth memory, already featured today on Intel’s APM), also known as storage class memory (SCM), NVRAM latest many-core processor, the Xeon Phi Knights Landing [1], or NVDIMMs, exploits a new generation of non-volatile mem- and NVIDIA’s latest GPU, Volta [2]. These contain MCDRAM ory hardware that is directly accessible via CPU load/store [1] and HBM2 [3] respectively, memory technologies built operations, has much higher durability than standard flash with traditional DRAM hardware but connected with a very memory, and much higher read and write performance. wide memory bus (or series of buses) directly to the processor High-performance computing (HPC) and high-performance to provide very high memory bandwidth when compared to data analytics (HPDA) systems currently have different hard- arXiv:1805.10041v1 [cs.DC] 25 May 2018 traditional main memory (DDR channels). ware and configuration requirements. HPC systems generally This has been enabled, in part, by the hardware trend for require very high floating-point performance, high memory incorporating memory controllers and memory controller hubs bandwidth, and high-performance networks, with a high- directly onto processors, enabling memory to be attached to performance filesystem for production I/O and possibly a the processor itself rather than through the motherboard and larger filesystem for long term data storage. However, HPC associated chipset. However, the underlying memory hard- applications do not generally have large memory requirements ware is the same, or at least very similar, to the traditional (although there are some exceptions to this) [4]. HPDA volatile DRAM memory that is still used as main memory systems on the other hand often do have high memory capacity for computer architectures, and that remains attached to the requirements, and also require a high-performance filesystem motherboard rather than the processor. to enable very large amounts of data to be read, processed, Non-volatile memory, i.e. memory that retains data even and written. after power is turned off, has been exploited by consumer B-APM, with its very high performance I/O characteristics, electronics and computer systems for many years. The flash and vastly increased capacity (compared to volatile memory), memory cards used in cameras and mobile phones are an offers a potential hardware solution to enable the construction of a compute platform that can support both types of use notifying the OS that the I/O operation has finished through case, with high performance processors, very large amounts of an interrupt to the device driver. B-APM in compute nodes, and a high-performance network, B-APM can be accessed simply by using a load or store providing a scalable compute, memory, and I/O system. instruction, as with any other memory operation from an In this paper, we outline the systemware and hardware application or program. If the application requires persistence, required to provide such a system. We start by describing it must flush the data from the volatile CPU caches and ensure persistent memory, and the functionality it provides, in more that the same data has also arrived on the non-volatile medium. detail in section II. In section III we discuss how B-APM could There are cache flush commands and fence instructions. To be exploited for scientific computation or data analytics. Fol- keep the performance for persistence writes, new cache flush lowing this we outline our proposed hardware and systemware operation have been introduced. Additionally, write buffers architectures in sections IV and V, describe how applications in the memory controller may be protected by hardware could benefit from such a system in section VI. We finish by through modified power supplies (such as those supporting discussing related work in section VII summarising the paper asynchronous DRAM refresh [6]). in the final section. With B-APM providing much lower latencies than external storage devices, the traditional I/O block access model, using II. PERSISTENT MEMORY interrupts, becomes inefficient because of the overhead of con- B-APM takes new non-volatile memory technology and text switches between user and kernel mode (which can take packages it in the same form factor (i.e. using the same thousands of CPU cycles [30]). Furthermore, with B-APM it connector and dimensions) as main memory (SDRAM DIMM becomes possible to implement remote persistent access to form factor). This allows B-APM to be installed and used data stored in the memory using RDMA technology over a alongside DRAM based main memory, accessed through the suitable interconnect. Using high performance networks can same memory controller. enable access to data stored in B-APM in remote nodes faster than accessing local high performance SSDs via traditional As B-APM is installed in a processors memory channels, I/O interfaces and stacks inside a node. applications running on the system can access B-APM directly as if it was main memory, including true random data access Therefore, it is possible to use B-APM to greatly improve at byte or cache line granularity. Such an access mechanism is I/O performance within a server, increase the memory ca- very different to the traditional block based approaches used pacity of a server, or provide a remote data store with high for current HDD or SSD devices, which generally requires performance access for a group of servers to share. Such I/O to be done using blocks of data (i.e. 4KB of data written storage hardware can also be scaled up by adding more B- or read in one operation), and relies on expensive kernel APM memory in a server, or adding more nodes to the remote interrupts. data store, allowing the I/O performance of a system to scale The B-APM technology that will be the first to market is as required. However, if B-APM is provisioned in the servers, Intel and Microns 3D XPointTMmemory [5]. The performance there must be software support for managing data within the of this, byte-addressable, B-APM, is projected to be lower B-APM. This includes moving data as required for the jobs than main memory (with a latency ∼5-10x that of DDR4 running on the system, and providing the functionality to let memory when connected to the same memory channels), but applications run on any server and still utilise the B-APM for much faster than SSDs or HDDs. It is also projected to be fast I/O and storage (i.e. applications should be able to access of much larger capacity than DRAM, around 2-5x denser (i.e. B-APM in remote nodes if the system is configured with B- 2-5x more capacity in the same form factor). APM only in a subset of all nodes). As B-APM is persistent, it also has the potential to be A. Data Access used for resiliency, providing backup for data from active applications, or providing long term storage for databases or This new class of memory offers very large memory capac- data stores required by a range of applications. With support ity for servers, as well as long term very high performance from the systemware, servers can be enabled to handle power persistent storage within the memory space of the servers, loss without experiencing data loss, efficiently and transpar- and the ability to undertake I/O (reading and writing storage ently recovering from power failure and resuming applications data) in a new way. Direct access (DAX) from applications from their latest running state, and maintaining data with little to individual bytes of data in the B-APM is very different overhead in terms of performance. from the block-oriented way I/O is currently implemented.

Architectures for High Performance Computing and Data Systems Using Byte-Addressable Persistent Memory

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

ECE 571 – Advanced Microprocessor-Based Design Lecture 17

Adm-Pcie-9H3 V1.5

High Bandwidth Memory for Graphics Applications Contents

How to Manage High-Bandwidth Memory Automatically

Exploring New Features of High- Bandwidth Memory for Gpus

AXI High Bandwidth Memory Controller V1.0 Logicore IP

Arxiv:1905.04767V1 [Cs.DB] 12 May 2019 Ing Amounts of Data Into Their Complex Computational Data Locality, Limiting Data Reuse Through Caching [121]

Breaking the Memory-Wall for AI: In-Memory Compute, Hbms Or Both?

High Performance Memory

High-Performance Architectures for Embedded Memory Systems