Flodb: Unlocking Memory in Persistent Key-Value Stores

FloDB: Unlocking Memory in Persistent Key-Value Stores Oana Balmau1, Rachid Guerraoui1, Vasileios Trigonakis2 ∗, and Igor Zablotchi1 y 1EPFL, {first.last}@epfl.ch 2Oracle Labs, {first.last}@oracle.com Abstract put, remove concurrent Log-structured merge (LSM) data stores enable to store and get, scan FloDB process large volumes of data while maintaining good per- concurrent fast buffer formance. They mitigate the I/O bottleneck by absorbing up- sorted level dates in a memory layer and transferring them to the disk layer in sequential batches. Yet, the LSM architecture fun- L0 Disk ... component damentally requires elements to be in sorted order. As the amount of data in memory grows, maintaining this sorted or- Ln der becomes increasingly costly. Contrary to intuition, existing LSM systems could actually lose throughput with larger Figure 1: LSM data store using FloDB. memory components. In this paper, we introduce FloDB, an LSM memory component architecture which allows throughput to scale on 1. Introduction modern multicore machines with ample memory sizes. The Key-value stores are a crucial component of many systems main idea underlying FloDB is essentially to bootstrap the that require low-latency access to large volumes of data [1, 6, traditional LSM architecture by adding a small in-memory 12, 15, 16]. These stores are characterized by a flat data or- buffer layer on top of the memory component. This buffer ganization and simplified interface, which allow for efficient offers low-latency operations, masking the write latency of implementations. However, the amount of data targeted by the sorted memory component. Integrating this buffer in the key-value stores is usually larger than main memory; thus, classic LSM memory component to obtain FloDB is not triv- persistent storage is generally required. Since accessing per- ial and requires revisiting the algorithms of the user-facing sistent storage is slow compared to CPU speed [41, 42], up- LSM operations (search, update, scan). FloDB’s two layers dating data directly on disk yields a severe bottleneck. can be implemented with state-of-the-art, highly-concurrent To address this challenge, many key-value stores adopt data structures. This way, as we show in the paper, FloDB the log-structured merge (LSM) architecture [35, 36]. Ex- eliminates significant synchronization bottlenecks in classic amples include LevelDB [5], RocksDB [12], cLSM [26], LSM designs, while offering a rich LSM API. bLSM [39], HyperLevelDB [6] and HBase [11]. LSM data We implement FloDB as an extension of LevelDB, stores are suitable for applications that require low latency Google’s popular LSM key-value store. We compare FloDB’s accesses, such as message queues that undergo a high num- performance to that of state-of-the-art LSMs. In short, ber of updates, and for maintaining session states in user- FloDB’s performance is up to one order of magnitude higher facing applications [12]. Basically, the LSM architecture than that of the next best-performing competitor in a wide masks the disk access bottleneck, on the one hand, by range of multi-threaded workloads. caching reads and, on the other hand, by absorbing writes in memory and writing to disk in batches at a later time. ∗ The project was completed while the author was at EPFL. Although LSM key-value stores go a long way addressing y Authors appear in alphabetical order. the challenge posed by the I/O bottleneck, their performance does not however scale with the size of the in-memory component, nor does it scale up with the number of threads. In other words, and maybe surprisingly, increasing the in- memory parts of existing LSMs only benefits performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and up to a relatively small size. Similarly, adding threads does the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires not improve the throughput of many existing LSMs, due to prior specific permission and/or a fee. Request permissions from [email protected]. their use of global blocking synchronization. EuroSys ’17, April 23-26, 2017, Belgrade, Serbia © 2017 ACM. ISBN 978-1-4503-4938-3/17/04. $15.00 As we discuss in this paper, the two aforementioned scal- DOI: http://dx.doi.org/10.1145/3064176.3064193 ability limitations are inherent to the design of traditional 80 LSM architectures. We circumvent these limitations by in- obtains up to one order of magnitude better throughput than troducing FloDB, a novel LSM memory component, de- its highest performing competitor. signed to scale up with the number of threads as well as with To summarize, our contributions are threefold: its size in memory. Traditionally, LSMs have been employ- • FloDB, a two-level architecture for the log-structured ing a two-tier storage hierarchy, with one level in memory merge memory component. FloDB scales with the amount and one level on disk. We propose an additional in-memory of main memory, and has a rich API, where reads, writes level. In other words, FloDB is a two-level memory com- and scans can all proceed concurrently. ponent on top of the disk component (Figure 1), each in- • A publicly available implementation in C++ of the FloDB memory level being a concurrent data structure [2, 3, 23, 29]. architecture as an extension of LevelDB, as well as an The top in-memory level is a small and fast data structure, evaluation thereof with up to 192GB memory compo- while the bottom in-memory level is a larger, sorted data nent size on a Xeon multicore machine. structure. New entries are inserted in the top level and then • An algorithm for a novel multi-insert operation for con- drained to the bottom level in the background, before being current skiplists. Besides FloDB, multi-insert can benefit stored onto disk. any other LSM architecture employing a skiplist. This scheme has several advantages. First, it allows scans Despite its advantages, we do not claim FloDB is a sil- and writes to proceed in parallel, on the bottom and on the ver bullet; it does have limitations. First, as in all LSMs, top levels respectively. Second, the use of a small, fast top- the steady-state throughput is bounded by the slowest com- level enables low-latency updates regardless of the size of ponent in the hierarchy: writing the memory component to the memory component, whereas existing LSMs see their disk. The improvements FloDB makes in the memory com- performance drop as the memory component size increases. ponent are orthogonal to potential disk-level improvements, Larger memory components can support longer bursts of which go beyond the scope of this paper. Disk-level im- writes at peak throughput. Third, maintaining the bottom provements could be used in conjunction with FloDB to memory layer sorted allows flushing to disk to proceed with- further boost the overall performance of LSMs. Second, out an additional—and expensive—sorting step. it is possible to devise workloads that prove problematic We implement FloDB using a small high-performance for the scans of FloDB. For example, in write-intensive concurrent hash table [21] for the top in-memory level and workloads with heavy contention the performance of long- a larger concurrent skiplist [29] for the bottom in-memory running scans decreases. Third, for datasets much larger than level. At a first glance, it might seem that implementing main memory, FloDB improves read performance less than FloDB requires nothing more than adding an extra hash write performance. This is due to the fact that in LSMs read table-based buffer level on top of an existing LSM archi- performance largely depends on the effectiveness of caching, tecture. However, this seemingly small step entails subtle which is also outside the scope of our paper. technical challenges. The first challenge is to ensure effi- Roadmap. The rest of this paper is organized as follows. In cient data flow between the two in-memory levels, so as to Section 2, we give an overview of state-of-the-art LSMs and take full advantage of the additional buffer while not deplet- their limitations. In Section 3, we present the FloDB archi- ing system resources. To this end, we introduce the multi- tecture and the flow of its operations. In Section 4, we de- insert operation, a novel operation for concurrent skiplists. scribe in more details the technical solutions we adopted in The main idea is to insert n sorted elements in the skiplist in FloDB. In Section 5, we report on the evaluation of FloDB’s one operation, using the spot of the previously inserted ele- performance. In Section 6, we discuss related work and we ment as a starting point for inserting the next elements, thus conclude in Section 7. reusing already traversed hops. The skiplist multi-insert is of independent interest and can benefit previous single-writer 2. Shortcomings of Current LSMs LSM implementations as well, by increasing the throughput of updates to the memory component. The second challenge In this section, we give an overview of current LSM-based is ensuring the consistency of user-facing LSM operations key-value stores and highlight two of their significant limi- while enabling a high level of concurrency between these tations that we address with this work. operations. In particular, FloDB is the first LSM system to si- 2.1 LSM Key-Value Store Overview multaneously support consistent scans and in-place updates.

Flodb: Unlocking Memory in Persistent Key-Value Stores

Learning React Functional Web Development with React and Redux

UNIVERSITY of CALIFORNIA SAN DIEGO Simplifying Datacenter Fault

Crawling Code Review Data from Phabricator

Artificial Intelligence for Understanding Large and Complex

Unicorn: a System for Searching the Social Graph

2.3 Apache Chemistry

Learning React Modern Patterns for Developing React Apps

Elmulating Javascript

Ecological Momentary Assessments from a Cross-Platform, Native Application

“They Keep Coming Back Like Zombies”: Improving Software

Realtime Data Processing at Facebook

The Compassion Project