Accordion: Better Memory Organization for LSM Key-Value Stores

Accordion: Better Memory Organization for LSM Key-Value Stores Edward Bortnikov Anastasia Braginsky Eshcar Hillel Yahoo Research Yahoo Research Yahoo Research [email protected] [email protected] [email protected] Idit Keidar Gali Sheffi Technion and Yahoo Research Yahoo Research [email protected] gsheffi@oath.com ABSTRACT of applications for which they are used continuously in- Log-structured merge (LSM) stores have emerged as the tech- creases. A small sample of recently published use cases in- nology of choice for building scalable write-intensive key- cludes massive-scale online analytics (Airbnb/ Airstream [2], value storage systems. An LSM store replaces random I/O Yahoo/Flurry [7]), product search and recommendation (Al- with sequential I/O by accumulating large batches of writes ibaba [13]), graph storage (Facebook/Dragon [5], Pinter- in a memory store prior to flushing them to log-structured est/Zen [19]), and many more. disk storage; the latter is continuously re-organized in the The leading approach for implementing write-intensive background through a compaction process for efficiency of key-value storage is log-structured merge (LSM) stores [31]. reads. Though inherent to the LSM design, frequent com- This technology is ubiquitously used by popular key-value pactions are a major pain point because they slow down storage platforms [9, 14, 16, 22,4,1, 10, 11]. The premise data store operations, primarily writes, and also increase for using LSM stores is the major disk access bottleneck, disk wear. Another performance bottleneck in today's state- exhibited even with today's SSD hardware [14, 33, 34]. of-the-art LSM stores, in particular ones that use managed An LSM store includes a memory store in addition to a languages like Java, is the fragmented memory layout of large disk store, which is comprised of a collection of files their dynamic memory store. (see Figure1). The memory store absorbs writes, and is In this paper we show that these pain points may be mit- periodically flushed to disk as a new immutable file. This igated via better organization of the memory store. We approach improves write throughput by transforming expen- present Accordion { an algorithm that addresses these prob- sive random-access I/O into storage-friendly sequential I/O. lems by re-applying the LSM design principles to memory Reads search the memory store as well as the disk store. management. Accordion is implemented in the production The number of files in the disk store has adverse impact code of Apache HBase, where it was extensively evaluated. on read performance. In order to reduce it, the system peri- We demonstrate Accordion's double-digit performance gains odically runs a background compaction process, which reads versus the baseline HBase implementation and discuss some some files from disk and merges them into a single sorted one unexpected lessons learned in the process. while removing redundant (overwritten or deleted) entries. We provide further background on LSM stores in Section2. PVLDB Reference Format: Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel, Idit Kei- dar, and Gali Sheffi. Accordion: Better Memory Organization for 1.2 LSM Performance Bottlenecks LSM Key-Value Stores. PVLDB, 11 (12): 1863 - 1875, 2018. The performance of modern LSM stores is highly opti- DOI: https://doi.org/10.14778/3229863.3229873 mized, and yet as technology scales and real-time performance expectations increase, these systems face more strin- gent demands. In particular, we identify two principal pain 1. INTRODUCTION points. First, LSM stores are extremely sensitive to the rate and 1.1 LSM Stores extent of compactions. If compactions are infrequent, read Persistent NoSQL key-value stores (KV-stores) have be- performance suffers (because more files need to be searched come extremely popular over the last decade, and the range and caching is less efficient when data is scattered across multiple files). On the other hand, if compactions are too frequent, they jeopardize the performance of both writes and reads by consuming CPU and I/O resources, and by This work is licensed under the Creative Commons Attribution- indirectly invalidating cached blocks. In addition to their NonCommercial-NoDerivatives 4.0 International License. To view a copy performance impact, frequent compactions increase the disk of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For write volume, which accelerates device wear-out, especially any use beyond those covered by this license, obtain permission by emailing for SSD hardware [27]. [email protected]. Second, as the amount of memory available in commod- Proceedings of the VLDB Endowment, Vol. 11, No. 12 Copyright 2018 VLDB Endowment 2150-8097/18/8. ity servers rises, LSM stores gravitate towards using larger DOI: https://doi.org/10.14778/3229863.3229873 memory stores. However, state-of-the-art memory stores are 1863 organized as dynamic data structures consisting of many • Less GC overhead. By dramatically reducing the size small objects, which becomes inefficient in terms of both of the dynamic segment, Accordion reduces GC over- cache performance and garbage collection (GC) overhead head, thus improving write throughput and making when the memory store size increases. This has been re- performance more predictable. The flat organization ported to cause significant problems, which developers of of the static segments is also readily amenable to off- managed stores like Cassandra [4] and HBase [1] have had heap allocation, which further reduces memory con- to cope with [12,3]. Note that the impact of GC pauses is sumption by allowing serialization, i.e., eliminating the aggravated when timeouts are used to detect faulty storage need to store Java objects for data items. nodes and initiate fail-over. Given the important role that compaction plays in LSM • Cache friendliness. Finally, the flat nature of im- stores, it is not surprising that significant efforts have been mutable indices improves locality of reference and hence invested in compaction parameter tuning, scheduling, and so boosts hardware cache efficiency. on [6, 18, 17, 32]. However, these approaches only deal with Accordion is implemented in HBase production code. In the aftermath of organizing the disk store as a sequence of Section4 we experiment with the Accordion HBase imple- files created upon memory store overflow events, and do not mentation in a range of scenarios. We study production-size address memory management. And yet, the amount of data datasets and data layouts, with multiple storage types (SSD written by memory flushes directly impacts the ensuing flush and HDD). We focus on high throughput scenarios, where and compaction toll. Although the increased size of memory the compaction and GC toll are significant. stores makes flushes less frequent, today's LSM stores do Our experiments show that the algorithm's contribution not physically delete removed or overwritten data from the to overall system performance is substantial, especially un- memory store. Therefore, they do not reduce the amount of der a heavy-tailed (Zipf) key access distribution with small data written to disk. objects, as occurs in many production use cases [35]. For The problem of inefficient memory management in man- example, Accordion improves the system's write through- aged LSM stores has received much less attention. The put by up to 48%, and reduces read tail latency (in HDD only existing solutions that we are aware of work around settings) by up to 40%. At the same time, it reduces the the lengthy JVM GC pauses by managing memory alloca- write volume (excluding the log) by up to 30%. Surprisingly, tions on their own, either using pre-allocated object pools we see that in many settings, disk I/O is not the principal and local allocation buffers [3], or by moving the data to bottleneck. Rather, the memory management overhead is off-heap memory [12]. more substantial: the improvements are highly correlated with reduction in GC time. 1.3 Accordion To summarize, Accordion takes a proactive approach for We introduce Accordion, an algorithm for memory store handling disk compaction even before data hits the disk, and management in LSM stores. Accordion re-applies to the addresses GC toll by directly improving the memory store memory store the classic LSM design principles, which were structure and management. It grows and shrinks a sequence originally used only for the disk store. The main insight is of static memory segments resembling accordion bellows, that the memory store can be partitioned into two parts: thus increasing memory utilization and reducing fragmen- (1) a small dynamic segment that absorbs writes, and (2) a tation, garbage collection costs, and the disk write volume. sequence of static segments created from previous dynamic Our experiments show that it significantly improves end-to- segments. Static segments are created by in-memory flushes end system performance and reduces disk wear, which led occurring at a higher rate than disk flushes. Note that the to the recent adoption of this solution in HBase production key disadvantage of LSM stores { namely, the need to search code; it is generally available starting the HBase 2.0 release. multiple files on read { is not as significant in Accordion Section5 discusses related work and Section6 concludes the because RAM-resident segments can be searched efficiently, paper. possibly in parallel. Accordion takes advantage of the fact that static segments 2. BACKGROUND are immutable, and optimizes their index layout as flat; it HBase is a distributed key-value store that is part of the can further serialize them, i.e., remove a level of indirection. Hadoop open source technology suite. It is implemented in The algorithm also performs in-memory compactions, which Java. Similarly to other modern KV-stores, HBase follows eliminate redundant data before it is written to disk. Accor- the design of Google's Bigtable [22].

Accordion: Better Memory Organization for LSM Key-Value Stores

Artificial Intelligence for Understanding Large and Complex

Learning Key-Value Store Design

Myrocks in Mariadb

Myrocks Deployment at Facebook and Roadmaps

Etriks Analytical Environment: a Practical Platform for Medical Big Data Analysis

Optimal Bloom Filters and Adaptive Merging for LSM-Trees∗

Μtune: Auto-Tuned Threading for OLDI Microservices

Optimizing Space Amplification in Rocksdb

The Full-Stack Database Infrastructure Operations Experts for Web-Scale

Succinct Range Filters

LSM-Tree Database Storage Engine Serving Facebook's Social Graph

Μtune: Auto-Tuned Threading for OLDI Microservices Akshitha Sriraman and Thomas F