Closing the Performance Gap Between DRAM and PM for In-Memory Index Structures

Closing the Performance Gap between DRAM and PM for In-Memory Index Structures Technical Report UCSC-CRSS-20-01 UCSC-SSRC-20-01 May 29th, 2020 Yuanjiang Ni∗ Shuo Chen Qingda Lu [email protected] [email protected] [email protected] Heiner Litz Zhu Pang [email protected] [email protected] Ethan L. Miller Jiesheng Wu [email protected] [email protected] Center for Research in Storage Systems Storage Systems Research Center Baskin School of Engineering University of California, Santa Cruz Santa Cruz, CA 95064 http://crss.ucsc.edu/ http://ssrc.ucsc.edu/ ∗Work performed while at Alibaba USA. Abstract ficiency [6, 9, 35, 41, 42] and CPU efficiency [25, 47], however, only for conventional DRAM-based sys- Emerging byte-addressable persistent memories tems. such as Intel’s 3D XPoint enable new opportuni- Prior research on PM has focused on topics in- ties and challenges for in-memory indexing struc- cluding consistency and the read-write asymme- tures. While prior work has covered various as- try of PM without focusing on the integration pects of persistent indexing structures, it has also of PM technology and indexing data structures. been limited to performance studies using simulated For instance, there exists a large body of work memories ignoring the intricate details of persistent on providing consistency and durability via write- memory devices. Our work presents one of the first, ahead logging (WAL) [7, 13, 50] or native persis- in-depth performance studies on the interplay of tence [11, 15, 20, 28, 29, 37, 58]. Prior work on real persistent memory hardware and indexing data addressing the asymmetric read-write performance structures. We conduct comprehensive evaluations of PM proposed different techniques for trading ad- of six variants of B+Trees leveraging diverse work- ditional reads for a reduction in writes [10, 60, 61], loads and configurations. We obtain important find- for instance by maintaining the keys of an index ings via thorough investigation of the experimental node in their insertion order [10] to trade-off writes results and detailed micro-architectural profiling. for higher query overheads. Based on our findings, we propose two novel tech- This paper looks at indexing structures from a niques for improving the indexing data structure new PM-based perspective. By studying a sys- performance on persistent memories. Group flush- tem with Intel Optane DC Persistent Memory Mod- ing inserts timely flush operations improving the in- ule (DCPMM), the first commercially available 3D sertion performance of conventional B+-Trees by up XPoint-based PM product that offers desired char- to 24% while Persistency Optimized Log-structuring acteristics of high density, byte addressability and revisits log-structuring for persistent memories im- low latency, we conduct a comprehensive evaluation proving the performance of B+-Trees by up to 41%. of in-PM indexes to answer the following questions: (1) What are the implications of 3D XPoint mem- 1 introduction ory [21, 57] on in-memory index performance? (2) Persistent Memory (PM) devices combine mem- What is the real-world performance impact of vari- ory and storage characteristics by increasing cost- ous PM indexing techniques? (3) Are there device efficiency over DRAM, providing non-volatility and or platform-specific techniques that efficiently opti- byte addressability (via memory load/store instruc- mize indexes in 3D XPoint memory? Our study tar- tions) as well as by supporting sub-µs-scale laten- gets sorted index structures instead of hash tables as cies. The eminent availability of PM opens up the support of efficient range queries is critical for new opportunities for many IO intensive applica- many cloud applications. Among sorted indexing tions. For instance, it improves the viability of in- structures, we focus on the B+-tree and its vari- memory databases by providing higher storage ca- ants, as they are widely used both as standalone pacity for the same cost while providing native dura- software components and as basic building blocks bility without incurring the high write overhead of of advanced indexing structures including Worm- flash SSDs. Due to the limited availability of PM, hole [54] and the MassTree [35]. In comparison, prior work resorted to DRAM-based emulation [58] as demonstrated in prior work [51, 55], index struc- or hardware simulation [39]. Our work is one of the tures such as skip lists and tries provide inferior per- first utilizing real PM hardware evaluating the ac- formance in many scenarios and are often enhanced tual performance benefits and trade-offs of PM. In by incorporating B-tree into their designs. contrast to existing studies of Intel’s Optane mem- Our evaluation results demonstrate that the be- ory [21, 57] that only evaluate raw performance i.e. havioral characteristics of PM are more intricate as latency and bandwidth, our work provides a thor- shown by prior work that leveraged only simula- ough analysis of the interplay of PM and the index- tion techniques. Despite sharing the same interface ing data structures commonly used in databases. with DRAM in load/store instructions, PM should While there exists a large body of work on in- not be be treated as slow RAM. We first conduct a dexing structures as well as on PM, the interplay comprehensive evaluation on the PM indexing tech- between the two remains largely unexplored. For niques with diverse workloads and configurations. instance, prior work on in-memory indexing has in- Several important findings are made and analyzed vestigated scalability [5, 18, 30, 32, 34, 35], cache ef- with detailed profiling. Leveraging our findings we 1 have developed three optimizations to explore more potential techniques to improve the index perfor- CPU Cache mance for PM. The main findings and results of our CPU Core 64-Byte request optimizations are summarized below. iMC Main Findings: 1) PM bandwidth represents a greater performance challenge than PM latency for DDR-T Interface 64-Byte request ADR indexing structures with write-intensive workloads, Apache Pass Controller domain increasing the performance gap between DRAM and PM from 2.2× to 4.4×. PM manufacturers AIT should focus on improving bandwidth by increas- WC Buer DCPMM ing intra-module parallelism to address the most 256-Byte request significant performance issue of PM. 2) The benefits of write-optimized indexing structures are lim- 3D XPoint Storage Media ited in real-world applications. Keeping data un- sorted at tree leaves only improves random insertion performance by 6% in a single threaded work- Figure 1: The internal details of the Intel Optane load. 3) The granularity disparity between CPU DC Persistent Memory Module caches and DCPMM (64 vs 256 bytes) contributes to sub-optimal bandwidth utilization in PMs. 4) The overhead of providing durability and consis- 2 Background tency for PMs is small compared to the raw performance degradation introduced by PM over DRAM. 2.1 3D XPoint Persistent Memory 5) The state of the art consistency mechanism for B-Trees—FAST/FAIR—benefits from continuously While many PM technologies such as PCM (Phase flushing cache-lines to PM. With this optimiza- Change Memory) [43], ReRAM [46] and MRAM [48] tion, enforcing consistency only introduces a latency have been under development for the last three penalty of 1.27× over conventional B-Trees with- decades, Intel recently released 3D XPoint mem- out consistency guarantees. 6) The idiosyncrasies ory as one of the first commercially available PM of DCPMM affect B-Tree parameter tuning such as products. node size. Figure 1 outlines the architecture of Intel’s 3D Optimizations: 1) Interleaving that leverages XPoint platform [21]. DCPMM modules are con- software prefetching and fast context switching can nected to the integrated memory controller (iMC) be used to hide the longer memory access latency on a recent Intel server CPU such as Cascade Lake of PMs. We show that in-PM indexing structures via DDR-T, a proprietary memory protocol. While benefit more from such technique than in-DRAM built on DDR4’s electrical and mechanical inter- indexes. 2) We introduce Group flushing, a tech- face, DDR-T provides an asynchronous command nique that prevents data reordering by flushing and data interface to communicate with the host modified data in groups. An unmodified B-tree iMC. It utilizes a request/grant scheme for initial- with software-directed group flushing achieves write izing requests by sending a command packet from throughput comparable to its write-efficient coun- the host iMC to the DCPMM controller. Similar to terpart while having better search and range-query DDR4, the iMC accesses DIMMS at the cache-line performance. 3) Log-structuring translates random (64-byte) granularity. The iMC and the DCPMM writes to sequential writes at the cost of additional modules connected to it form the Asynchronous reads and garbage collection, effectively addressing DRAM Refresh (ADR) domain. Every cache line the larger line size deployed by PM (265 Byte vs. that reaches the ADR is guaranteed to be persisted, 64 Byte). We show that the performance can be im- even in the event of a power loss. Note that the CPU proved by up to 41% via log structuring, compared caches as well as other buffers in the CPU are out- to the state-of-art PM-aware B+-Tree. side the ADR domain and hence suffer from data loss in the case of a power failure. For wear-leveling and bad-block management DCPMM coordinates accesses to the underlying 3D XPoint storage media via an indirection layer. The Address Indirection Table (AIT) translates system addresses to device-internal addresses. The access 2 Bandwidth (GB/s) Idle Latency (ns) Seq. Rand. Seq. Rand. Seq. Rand. NT Store clwb Read Read Write Write Read Read DRAM 105.9 70.4 52.3 52 81 101 86 57 DCPMM 38.9 10.3 11.5 2.8 169 305 90 62 Table 1: The basic performance characteristics of DCPMM.

Load more