Faster: a Concurrent Key-Value Store with In-Place Updates

Faster: A Concurrent Key-Value Store with In-Place Updates Badrish Chandramouliy, Guna Prasaadz∗, Donald Kossmanny, Justin Levandoskiy, James Huntery, Mike Barnett† yMicrosoft Research zUniversity of Washington [email protected], [email protected], {donaldk, justin.levandoski, jahunter, mbarnett}@microsoft.com ABSTRACT 1.1 Challenges and Existing Solutions Over the last decade, there has been a tremendous growth in data- State management is an important problem for all these applications intensive applications and services in the cloud. Data is created on and services. It exhibits several unique characteristics: a variety of edge sources, e.g., devices, browsers, and servers, and • Large State: The amount of state accessed by some applications processed by cloud applications to gain insights or take decisions. can be very large, far exceeding the capacity of main memory. For Applications and services either work on collected data, or monitor example, a targeted search ads provider may maintain per-user, and process data in real time. These applications are typically update per-ad and clickthrough-rate statistics for billions of users. Also, intensive and involve a large amount of state beyond what can fit in it is often cheaper to retain state that is infrequently accessed on main memory. However, they display significant temporal locality secondary storage [18], even when it fits in memory. in their access pattern. This paper presents Faster, a new key- • Update Intensity: While reads and inserts are common, there value store for point read, blind update, and read-modify-write are applications with significant update traffic. For example, a operations. Faster combines a highly cache-optimized concurrent monitoring application receiving millions of CPU readings every hash index with a hybrid log: a concurrent log-structured record second from sensors and devices may need to update a per-device store that spans main memory and storage, while supporting fast aggregate for each reading. in-place updates of the hot set in memory. Experiments show that • Locality: Even though billions of state objects maybe alive at Faster achieves orders-of-magnitude better throughput – up to any given point, only a small fraction is typically “hot” and ac- 160M operations per second on a single machine – than alternative cessed or updated frequently with a strong temporal locality. For systems deployed widely today, and exceeds the performance of instance, a search engine that tracks per-user statistics (averaged pure in-memory data structures when the workload fits in memory. over one week) may have a billion users “alive” in the system, ACM Reference Format: but only have a million users actively surfing at a given instant. Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin Levandoski, Further, the hot set may drift over time; in our example, as users James Hunter, Mike Barnett. 2018. Faster: A Concurrent Key-Value Store start and stop browsing sessions. with In-Place Updates. In SIGMOD’18: 2018 International Conference on • Management of Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, Point Operations: Given that state consists of a large number NY, USA, 16 pages. https://doi.org/10.1145/3183713.3196898 of independent objects that are inserted, updated, and queried, a system tuned for (hash-based) point operations is often sufficient. 1 INTRODUCTION If range queries are infrequent, they can be served with simple workarounds such as indexing histograms of key ranges. There has recently been a tremendous growth in data-intensive applications and services in the cloud. Data created on a variety • Analytics Readiness: Updates to state should be readily avail- of edge sources such as Internet-of-Things devices, browsers, and able for subsequent offline analytics; for e.g., to compute average servers is processed by applications in the cloud to gain insights. ad clickthrough-rate drift over time. Applications, platforms, and services may work offline on collected A simple solution adopted by many systems is to partition the data (e.g., in Hadoop [13] or Spark [42]), monitor and process data in state across multiple machines, and use pure in-memory data struc- real time as it arrives (e.g., in streaming dataflows [12, 42] and actor- tures such as the Intel TBB Hash Map [5], that are optimized for based application frameworks [2]), or accept a mix of record inserts, concurrency and support in-place updates – where data is modi- updates, and reads (e.g., in object stores [30] and web caches [36]). fied at its current memory location without creating anewcopy elsewhere – to achieve high performance. However, the overall so- ∗ Work performed during internship at Microsoft Research. lution is expensive and often severely under-utilizes the resources Permission to make digital or hard copies of all or part of this work for personal or on each machine. For example, the ad serving platform of a search classroom use is granted without fee provided that copies are not made or distributed engine may partition its state across the main memory of hun- for profit or commercial advantage and that copies bear this notice and the full citation dreds of machines, resulting in a low per-machine request rate and on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, poorly utilized compute resources. Further, pure in-memory data to post on servers or to redistribute to lists, requires prior specific permission and/or a structures make recovery from failures complicated. fee. Request permissions from [email protected]. Key-value stores are a popular alternative for state management. SIGMOD’18, June 10–15, 2018, Houston, TX, USA © 2018 Association for Computing Machinery. A key-value store is designed to handle larger-than-memory [28] ACM ISBN 978-1-4503-4703-7/18/06...$15.00 data, and supports failure recovery by storing data on secondary https://doi.org/10.1145/3183713.3196898 storage. Many such key-value stores [25, 34, 40] have been proposed Threads Hash Index (§3) Records Allocators in the past. However, these systems are usually optimized for blind ... updates, reads, and range scans, rather than point operations and In-Memory ..., r2 ,r1 (e.g., for updating aggregates). Hence, these sys- read-modify-writes Append-Only-Log tems do not scale to more than a few million updates per second [8], ... ... even when the hot-set fits entirely in main memory. Caching sys- Hybrid-Log tems such as Redis [36] and Memcached [30] are optimized for point operations, but are slow and depend on an external system such as Concurrent & Larger-Than- In-Place Record Allocator a database or key-value store for storage and/or failure recovery. Latch-free Memory Updates The combination of concurrency, in-place updates (in memory), §4 In-Memory 3 3 §5 Append-Only-Log 3 3 and ability to handle data larger than memory is important in our §6 Hybrid-Log 3 3 3 target applications; but these features are not simultaneously met by existing systems. We discuss related work in detail in Sec. 8. Figure 1: Overall Faster architecture. • (Sec. 7) We perform a detailed evaluation using the YCSB [4] 1.2 Introducing Faster benchmark, evaluate the “up to orders-of-magnitude” better per- This paper describes a new concurrent key-value store called Faster, formance and near-linear scalability of Faster with HybridLog. designed to serve applications that involve update-intensive state Further, we use simulations to better understand the natural management. The Faster interface (Sec. 2.2 and Appendix E) sup- caching behavior of HybridLog in comparison to more expen- ports, in addition to reads, two types of state updates seen in prac- sive, state-of-the-art caching techniques. tice: blind updates, where an old value is replaced by a new value Faster follows the design philosophy of making the common case blindly; and read-modify-writes (RMWs), where the value is atomi- fast. By carefully (1) providing fast point-access to records using a cally updated based on the current value and an input (optional). cache-efficient concurrent latch-free hash index; (2) choosing when RMW updates, in particular, enable us to support partial updates and how, expensive or uncommon activities (such as index resizing, (e.g., updating a single field in the value) as well as mergeable aggre- checkpointing, and evicting data from memory) are performed; and gates [39] (e.g., sum, count). Being a point-operations store, Faster (3) allowing threads to perform in-place updates most of the time, achieves an in-memory throughput of 100s of million operations Faster exceeds the throughput of pure in-memory systems for per second. Retaining such high performance while supporting data in-memory workloads, while supporting data larger than memory larger than memory required a careful design and architecture. We and adapting to a changing hot set. make the following contributions: Implemented as a high-level-language component using dynamic • (Sec. 2) Towards a scalable threading model, we augment stan- code generation, Faster blurs the line between traditional key- dard epoch-based synchronization into a generic framework that value stores and update-only “state stores” used in streaming sys- facilitates lazy propagation of global changes to all threads via tems (e.g., Spark State Store [14] and Flink [12]). Further, HybridLog trigger actions. This framework provides Faster threads with is record-oriented and approximately time-ordered, and can be fed unrestricted access to memory under the safety of epoch protec- into scan-based log analytics systems (cf. Appendix F). tion. Throughout the paper, we highlight instances where this The main result shown in this paper with the help of Faster generalization helped simplify our scalable concurrent design.

Faster: a Concurrent Key-Value Store with In-Place Updates

Redis and Memcached

Modeling and Analyzing Latency in the Memcached System

Release 2.5.5 Ask Solem Contributors

Online Visualization of Geospatial Stream Data Using the Worldwide Telescope

An End-To-End Application Performance Monitoring Tool

Improving High Contention OLTP Performance Via Transaction

A High-Performance, Distributed Main Memory Transaction Processing System

The Fourth Paradigm

WEB2PY Enterprise Web Framework (2Nd Edition)

Performance at Scale with Amazon Elasticache

High Performance with Distributed Caching

WWW 2013 22Nd International World Wide Web Conference