Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee, UNIST (Ulsan National Institute of Science and Technology); K
Total Page:16
File Type:pdf, Size:1020Kb
WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee, UNIST (Ulsan National Institute of Science and Technology); K. Hyun Lim, Hongik University; Hyunsub Song, Beomseok Nam, and Sam H. Noh, UNIST (Ulsan National Institute of Science and Technology) https://www.usenix.org/conference/fast17/technical-sessions/presentation/lee This paper is included in the Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST ’17). February 27–March 2, 2017 • Santa Clara, CA, USA ISBN 978-1-931971-36-2 Open access to the Proceedings of the 15th USENIX Conference on File and Storage Technologies is sponsored by USENIX. WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee, K. Hyun Lim†, Hyunsub Song, Beomseok Nam, Sam H. Noh UNIST (Ulsan National Institute of Science and Technology), †Hongik University Abstract 9, 10, 14]. In the traditional block-based storage de- vice, the failure atomicity unit, which is the update unit Recent interest in persistent memory (PM) has stirred de- where consistent state is guaranteed upon any system velopment of index structures that are efficient in PM. failure, has been the disk block size. However, as persis- Recent such developments have all focused on variations tent memory, which is byte-addressable and non-volatile, of the B-tree. In this paper, we show that the radix tree, will be accessible though the memory bus rather than via which is another less popular indexing structure, can be the PCI interface, the failure atomicity unit for persistent more appropriate as an efficient PM indexing structure. memory is generally expected to be 8 bytes or no larger This is because the radix tree structure is determined by than a cache line [5, 6, 12, 13, 15, 19]. the prefix of the inserted keys and also does not require The smaller failure atomicity unit, however, appears tree rebalancing operations and node granularity updates. to be a double-edged sword in the sense that though this However, the radix tree as-is cannot be used in PM. As allows for reduction of data written to persistent store, another contribution, we present three radix tree variants, it can lead to high overhead to enforce consistency. This namely, WORT (Write Optimal Radix Tree), WOART is because in modern processors, memory write opera- (Write Optimal Adaptive Radix Tree), and ART+CoW. tions are often arbitrarily reordered in cache line granu- Of these, the first two are optimal for PM in the sense that larity and to enforce the ordering of memory write op- they only use one 8-byte failure-atomic write per update erations, we need to employ memory fence and cache to guarantee the consistency of the structure and do not line flush instructions [21]. These instructions have been require any duplicate copies for logging or CoW. Exten- pointed out as a major cause of performance degrada- sive performance studies show that our proposed radix tion [3, 9, 15, 20]. Furthermore, if data to be written is tree variants perform considerable better than recently larger than the failure-atomic write unit, then expensive proposed B-tree variants for PM such NVTree, wB+Tree, mechanisms such as logging or copy-on-write (CoW) and FPTree for synthetic workloads as well as in imple- must be employed to maintain consistency. mentations within Memcached. Recently, several persistent B-tree based indexing structures such as NVTree [20], wB+Tree [3], and FP- 1 Introduction Tree [15] have been proposed. These structures focus on reducing the number of calls to the expensive memory Previous studies on indexing structures for persistent fence and cache line flush instructions by employing an memory (PM) have concentrated on B-tree variants. In append-only update strategy. Such a strategy has been this paper, we advocate that the radix tree can be better shown to significantly reduce duplicate copies needed suited for PM indexing than B-tree variants. We present for schemes such as logging resulting in improved per- radix tree variant indexing structures that are optimal for formance. However, this strategy does not allow these PM in that consistency is always guaranteed by a single structures to retain one of the key features of B-trees, 8-byte failure-atomic write without any additional copies that is, having the keys sorted in the nodes. Moreover, for logging or CoW. this strategy is insufficient in handling node overflows as Emerging persistent memory technologies such as node splits involve multiple node changes, making log- phase-change memory, spin-transfer torque MRAM, and ging necessary. 3D Xpoint are expected to radically change the land- While B-tree based structures have been popular in- scape of various memory and storage systems [4, 5, 7, memory index structures, there is another such structure, USENIX Association 15th USENIX Conference on File and Storage Technologies 257 namely, the radix tree, that has been less so. The first where CoW can be expensive, with the radix tree, we contribution of this paper is showing the appropriateness show that CoW incurs considerably less overhead. and the limitation of the radix tree for PM storage. That Through an extensive performance study using syn- is, since the radix tree structure is determined by the pre- thetic workloads, we show that for insertion and search, fix of the inserted keys, the radix tree does not require the radix tree variants that we propose perform better key comparisons. Furthermore, tree rebalancing opera- than recent B-tree based persistent indexes such as the tions and updates in node granularity units are also not NVTree, wB+Tree, and FPTree [3, 15, 20]. We also im- necessary. Instead, insertion or deletion of a key results in plement the indexing structure within Memcached and a single 8-byte update operation, which is perfect for PM. show that similarly to the synthetic workloads, our pro- However, the original radix tree is known to poorly uti- posed radix tree variants perform substantially better lize memory and cache space. In order to overcome this than the B-tree variants. However, performance evalua- limitation, the radix tree employs a path compression op- tions show that our proposed index structures are less ef- timization, which combines multiple tree nodes that form fective for range queries compared to the B-tree variants. a unique search path into a single node. Although path The rest of the paper is organized as follows. In Sec- compression significantly improves the performance of tion 2, we present the background on consistency issues the radix tree, it involves node split and merge opera- with PM and PM targeted B-tree variant indexing struc- tions, which is detrimental for PM. tures. In Section 3, we first review the radix tree to help The limitation of the radix tree leads us to the sec- understand the main contributions of our work. Then, we ond contribution of this paper. That is, we present three present the three radix tree variants that we propose. We radix tree variants for PM. For the first of these struc- discuss the experimental environment in Section 4 and tures, which we refer to as Write Optimal Radix Tree then present the experimental results in Section 5. Fi- for PM (WORTPM, or simply WORT), we develop a nally, we conclude with a summary in Section 6. failure-atomic path compression scheme for the radix tree such that it can guarantee failure atomicity with the 2 Background and Motivation same memory saving effect as the existing path compres- sion scheme. For the node split and merge operations in In this section, we review background work that we WORT, we add memory barriers and persist operations deem most relevant to our work and also necessary to such that the number of writes, memory fence, and cache understand our study. First, we review the consistency line flush instructions in enforcing failure atomicity is issue of indexing structures in persistent memory. Then, minimized. WORT is optimal for PM, as is the second we present variants of B-trees for PM. As the contribu- variant that we propose, in the sense that they require tion of our work starts with the radix tree, we review the only one 8-byte failure-atomic write per update to guar- radix tree in Section 3. antee the consistency of the structure without any dupli- cate copies. 2.1 Consistency in Persistent Memory The second and third structures that we propose are both based on the Adaptive Radix Tree (ART) that was Ensuring recovery correctness in persistent indexing proposed by Leis et al. [11]. ART resolves the trade-off structures requires additional memory write ordering between search performance and node utilization by em- constraints. In disk-based indexing, arbitrary changes to ploying an adaptive node type conversion scheme that a volatile copy of a tree node in DRAM can be made dynamically changes the size of a tree node based on without considering memory write ordering because it is node utilization. This requires additional metadata and a volatile copy and its persistent copy always exists in more memory operations than the traditional radix trees, disk storage and is updated in disk block units. However, but has been shown to still outperform other cache con- with failure-atomic write granularity of 8 bytes in PM, scious in-memory indexing structures. However, ART in changes to an existing tree node must be carefully or- its present form does not guarantee failure atomicity. For dered to enforce consistency and recoverability. For ex- the second radix tree variant, we present Write Optimal ample, the number of entries in a tree node must be in- Adaptive Radix Tree (WOART), which is a PM exten- creased after a new entry is stored.