Memc3: Compact and Concurrent Memcache with Dumber Caching and Smarter Hashing

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing Bin Fan, David G. Andersen, Michael Kaminsky∗ Carnegie Mellon University, ∗Intel Labs Abstract Standard Memcached, at its core, uses a typical hash table design, with linked-list-based chaining to handle This paper presents a set of architecturally and workload- collisions. Its cache replacement algorithm is strict LRU, inspired algorithmic and engineering improvements also based on linked lists. This design relies on locking to the popular Memcached system that substantially to ensure consistency among multiple threads, and leads improve both its memory efficiency and throughput. to poor scalability on multi-core CPUs [11]. These techniques—optimistic cuckoo hashing, a com- This paper presents MemC3 (Memcached with pact LRU-approximating eviction algorithm based upon CLOCK and Concurrent Cuckoo Hashing), a complete CLOCK, and comprehensive implementation of opti- redesign of the Memcached internals. This re-design mistic locking—enable the resulting system to use 30% is informed by and takes advantage of several observa- less memory for small key-value pairs, and serve up to tions. First, architectural features can hide memory access 3x as many queries per second over the network. We latencies and provide performance improvements. In par- have implemented these modifications in a system we ticular, our new hash table design exploits CPU cache call MemC3—Memcached with CLOCK and Concur- locality to minimize the number of memory fetches re- rent Cuckoo hashing—but believe that they also apply quired to complete any given operation; and it exploits more generally to many of today’s read-intensive, highly instruction-level and memory-level parallelism to overlap concurrent networked storage and caching systems. those fetches when they cannot be avoided. Second, MemC3’s design also leverages workload char- acteristics. Many Memcached workloads are predomi- 1 Introduction nately reads, with few writes. This observation means that we can replace Memcached’s exclusive, global locking with an optimistic locking scheme targeted at the Low-latency access to data has become critical for many common case. Furthermore, many important Memcached Internet services in recent years. This requirement has workloads target very small objects, so per-object over- led many system designers to serve all or most of certain heads have a significant impact on memory efficiency. data sets from main memory—using the memory either For example, Memcached’s strict LRU cache replacement as their primary store [19, 26, 21, 25] or as a cache to requires significant metadata—often more space than the deflect hot or particularly latency-sensitive items [10]. object itself occupies; in MemC3, we instead use a com- Two important metrics in evaluating these systems are pact CLOCK-based approximation. performance (throughput, measured in queries served per The specific contributions of this paper include: second) and memory efficiency (measured by the overhead required to store an item). Memory consumption is A novel hashing scheme called optimistic cuckoo • important because it directly affects the number of items hashing. Conventional cuckoo hashing [23] achieves that system can store, and the hardware cost to do so. space efficiency, but is unfriendly for concurrent oper- This paper demonstrates that careful attention to algo- ations. Optimistic cuckoo hashing (1) achieves high rithm and data structure design can significantly improve memory efficiency (e.g., 95% table occupancy); (2) throughput and memory efficiency for in-memory data allows multiple readers and a single writer to concur- stores. We show that traditional approaches often fail rently access the hash table; and (3) keeps hash table to leverage the target system’s architecture and expected operations cache-friendly (Section 3). workload. As a case study, we focus on Memcached [19], A compact CLOCK-based eviction algorithm that re- • a popular in-memory caching layer, and show how our quires only 1 bit of extra space per cache entry and toolbox of techniques can improve Memcached’s perfor- supports concurrent cache operations (Section 4). mance by 3 and reduce its memory use by 30%. Optimistic locking that eliminates inter-thread syn- × • 1 USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 371 function stock Memcached MemC3 Hash table per-slab LRU Linked-list Hash Table w/ chaining concurrent lookup, Slab1 header concurrency serialized serialized insert Slab2 header lookup performance slower faster insert performance faster slower space 13.3n Bytes 9.7n Bytes ∼ Cache Mgmt concurrent update, concurrency serialized serialized eviction space 18n Bytes n bits Figure 1: Memcached data structures. Table 1: Comparison of operations. n is the number of existing key-value items. Chaining is efficient for inserting or deleting single keys. However, lookup may require scanning the entire chain. chronization while ensuring consistency. The opti- Memory Allocation Naive memory allocation (e.g., mal- mistic cuckoo hash table operations (lookup/insert) loc/free) could result in significant memory fragmentation. and the LRU cache eviction operations both use this To address this problem, Memcached uses slab-based locking scheme for high-performance access to shared memory allocation. Memory is divided into 1 MB pages, data structures (Section 4). and each page is further sub-divided into fixed-length Finally, we implement and evaluate MemC3, a chunks. Key-value objects are stored in an appropriately- networked, in-memory key-value cache, based on size chunk. The size of a chunk, and thus the number of Memcached-1.4.13.1 Table 1 compares MemC3 and stock chunks per page, depends on the particular slab class. For Memcached. MemC3 provides higher throughput using example, by default the chunk size of slab class 1 is 72 significantly less memory and computation as we will bytes and each page of this class has 14563 chunks; while demonstrate in the remainder of this paper. the chunk size of slab class 43 is 1 MB and thus there is only 1 chunk spanning the whole page. To insert a new key, Memcached looks up the slab 2 Background class whose chunk size best fits this key-value object. If a vacant chunk is available, it is assigned to this item; if the 2.1 Memcached Overview search fails, Memcached will execute cache eviction. Cache policy In Memcached, each slab class maintains Interface Memcached implements a simple and light- its own objects in an LRU queue (see Figure 1). Each weight key-value interface where all key-value tuples are access to an object causes that object to move to the head stored in and served from DRAM. Clients communicate of the queue. Thus, when Memcached needs to evict with the Memcached servers over the network using the an object from the cache, it can find the least recently following commands: used object at the tail. The queue is implemented as a SET/ADD/REPLACE(key, value): add a (key, doubly-linked list, so each object has two pointers. • value) object to the cache; Threading Memcached was originally single-threaded. GET(key): retrieve the value associated with a key; • It uses libevent for asynchronous network I/O call- DELETE(key): delete a key. • backs [24]. Later versions support multi-threading but use Internally, Memcached uses a hash table to index the global locks to protect the core data structures. As a result, key-value entries. These entries are also in a linked list operations such as index lookup/update and cache evic- sorted by their most recent access time. The least recently tion/update are all serialized. Previous work has shown used (LRU) entry is evicted and replaced by a newly that this locking prevents current Memcached from scal- inserted entry when the cache is full. ing up on multi-core CPUs [11]. Hash Table To lookup keys quickly, the location of each Performance Enhancement Previous solutions [4, 20, key-value entry is stored in a hash table. Hash collisions 13] shard the in-memory data to different cores. Shard- are resolved by chaining: if more than one key maps ing eliminates the inter-thread synchronization to permit into the same hash table bucket, they form a linked list. higher concurrency, but under skewed workloads it may also exhibit imbalanced load across different cores or 1Our prototype does not yet provide the full memcached api. waste the (expensive) memory capacity. Instead of simply 2 372 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association sharding, we explore how to scale performance to many read/write access; it also requires multiple memory ref- threads that share and access the same memory space; erences for each insertion or lookup. To overcome these one could then apply sharding to further scale the system. limitations, we propose a collection of new techniques that improve basic cuckoo hashing in concurrency, memory efficiency and cache-friendliness: 2.2 Real-world Workloads: Small and An optimistic version of cuckoo hashing that supports Read-only Requests Dominate • multiple-reader/single writer concurrent access, while Our work is informed by several key-value workload char- preserving its space benefits; acteristics published recently by Facebook [3]. A technique using a short summary of each key to • First, queries for small objects dominate. Most keys improve the cache locality of hash table operations; are smaller than 32 bytes and most values no more than and a few hundred bytes. In particular, there is one common An optimization for cuckoo hashing insertion that im- • type of request that almost exclusively uses 16 or 21 Byte proves throughput. keys and 2 Byte values. The consequence of storing such small key-value ob- As we show in Section 5, combining these techniques jects is high memory overhead. Memcached always allo- creates a hashing scheme that is attractive in practice: cates a 56-Byte header (on 64-bit servers) for each key- its hash table achieves over 90% occupancy (compared value object regardless of the size.

Memc3: Compact and Concurrent Memcache with Dumber Caching and Smarter Hashing

Mysql NDB Cluster 7.5.16 (And Later)

Efficient Parallel I/O on Multi-Core Architectures

Message Passing and Network Programming

Introduction to Asynchronous Programming

A Study on Performance Improvement of Tor Network with Active Circuit Switching

VOLTTRON 3.0: User Guide

Enabling Richer Insight Into Runtime Executions of Systems Karthik Swaminathan Nagaraj Purdue University

Patterns of Modern Web Applications with Javascript

Enhancing Quality of Service Metrics for High Fan-In Node.Js Applications by Optimising the Network Stack

A Kitsune: Efficient, General-Purpose Dynamic Software Updating for C

Why Protocol Stacks Should Be in User-Space?

Performance Evaluation of NEAT Internet Transport Layer API and Library