Scaling Memcache at

Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry . Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, Venkateshwaran Venkataramani rajeshn,hans @fb.com, sgrimm, marc @facebook.com, herman, hcli, rm, mpal, dpeek, ps, dstaff, ttung, veeve @fb.com { } { } { } Facebook Inc.

Abstract: Memcached is a well known, simple, in- however, web pages routinely fetch thousands of key- memory caching solution. This paper describes how value pairs from memcached servers. Facebook leverages memcached as a building block to One of our goals is to present the important themes construct and scale a distributed key-value store that that emerge at different scales of our deployment. While supports the world’s largest social network. Our system qualities like performance, efficiency, fault-tolerance, handles billions of requests per second and holds tril- and consistency are important at all scales, our experi- lions of items to deliver a rich experience for over a bil- ence indicates that at specific sizes some qualities re- lion users around the world. quire more effort to achieve than others. For exam- ple, maintaining data consistency can be easier at small 1 Introduction scales if replication is minimal compared to larger ones where replication is often necessary. Additionally, the Popular and engaging social networking sites present importance of finding an optimal communication sched- significant infrastructure challenges. Hundreds of mil- ule increases as the number of servers increase and net- lions of people use these networks every day and im- working becomes the bottleneck. pose computational, network, and I/O demands that tra- This paper includes four main contributions: (1) ditional web architectures struggle to satisfy. A social We describe the evolution of Facebook’s memcached- network’s infrastructure needs to (1) allow near real- based architecture. (2) We identify enhancements to time communication, (2) aggregate content on-the-fly memcached that improve performance and increase from multiple sources, (3) be able to access and update memory efficiency. (3) We highlight mechanisms that very popular shared content, and (4) scale to process improve our ability to operate our system at scale. (4) millions of user requests per second. We characterize the production workloads imposed on We describe how we improved the open source ver- our system. sion of memcached [14] and used it as a building block to construct a distributed key-value store for the largest so- 2 Overview cial network in the world. We discuss our journey scal- The following properties greatly influence our design. ing from a single cluster of servers to multiple geograph- First, users consume an order of magnitude more con- ically distributed clusters. To the best of our knowledge, tent than they create. This behavior results in a workload this system is the largest memcached installation in the dominated by fetching data and suggests that caching world, processing over a billion requests per second and can have significant advantages. Second, our read op- storing trillions of items. erations fetch data from a variety of sources such as This paper is the latest in a series of works that have MySQL , HDFS installations, and backend recognized the flexibility and utility of distributed key- services. This heterogeneity requires a flexible caching value stores [1, 2, 5, 6, 12, 14, 34, 36]. This paper fo- strategy able to store data from disparate sources. cuses on memcached—an open-source implementation Memcached provides a simple set of operations (set, of an in-memory —as it provides low latency get, and delete) that makes it attractive as an elemen- access to a shared storage pool at low cost. These quali- tal component in a large-scale distributed system. The ties enable us to build data-intensive features that would open-source version we started with provides a single- otherwise be impractical. For example, a feature that machine in-memory hash table. In this paper, we discuss issues hundreds of queries per page request how we took this basic building block, made it more ef- would likely never leave the prototype stage because it ficient, and used it to build a distributed key-value store would be too slow and expensive. In our application, that can process billions of requests per second. Hence-

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 385 web web         server server

1. get k 2. SELECT ... 1. UPDATE ...

2. delete k   3. set (k,v)   memcache database memcache database   Figure 1: Memcache as a demand-filled look-aside   cache. The left half illustrates the read path for a web server on a cache miss. The right half illustrates the write path.         forth, we use ‘memcached’ to refer to the source code or a running binary and ‘memcache’ to describe the dis- tributed system. Figure 2: Overall architecture memcache Query cache: We rely on to lighten the read We structure our paper to emphasize the themes that memcache load on our databases. In particular, we use emerge at three different deployment scales. Our read- demand-filled look-aside as a cache as shown in Fig- heavy workload and wide fan-out is the primary con- ure 1. When a web server needs data, it first requests cern when we have one cluster of servers. As it becomes memcache the value from by providing a string key. If necessary to scale to multiple frontend clusters, we ad- the item addressed by that key is not cached, the web dress data replication between these clusters. Finally, we server retrieves the data from the database or other back- describe mechanisms to provide a consistent user ex- end service and populates the cache with the key-value perience as we spread clusters around the world. Op- pair. For write requests, the web server issues SQL state- erational complexity and fault tolerance is important at ments to the database and then sends a delete request to all scales. We present salient data that supports our de- memcache that invalidates any stale data. We choose to sign decisions and refer the reader to work by Atikoglu delete cached data instead of updating it because deletes et al. [8] for a more detailed analysis of our workload. At Memcache are idempotent. is not the authoritative source a high-level, Figure 2 illustrates this final architecture in of the data and is therefore allowed to evict cached data. which we organize co-located clusters into a region and While there are several ways to address excessive designate a master region that provides a data stream to read traffic on MySQL databases, we chose to use keep non-master regions up-to-date. memcache. It was the best choice given limited engi- While evolving our system we prioritize two ma- neering resources and time. Additionally, separating our jor design goals. (1) Any change must impact a user- caching layer from our persistence layer allows us to ad- facing or operational issue. Optimizations that have lim- just each layer independently as our workload changes. ited scope are rarely considered. (2) We treat the prob- Generic cache: We also leverage memcache as a more ability of reading transient stale data as a parameter to general key-value store. For example, engineers use be tuned, similar to responsiveness. We are willing to memcache to store pre-computed results from sophisti- expose slightly stale data in exchange for insulating a cated machine learning algorithms which can then be backend storage service from excessive load. used by a variety of other applications. It takes little ef- fort for new services to leverage the existing marcher 3 In a Cluster: Latency and Load infrastructure without the burden of tuning, optimizing, We now consider the challenges of scaling to thousands provisioning, and maintaining a large server fleet. of servers within a cluster. At this scale, most of our As is, memcached provides no server-to-server co- efforts focus on reducing either the latency of fetching ordination; it is an in-memory hash table running on cached data or the load imposed due to a cache miss. a single server. In the remainder of this paper we de- scribe how we built a distributed key-value store based 3.1 Reducing Latency on memcached capable of operating under Facebook’s Whether a request for data results in a cache hit or miss, workload. Our system provides a suite of configu- the latency of memcache’s response is a critical factor ration, aggregation, and routing services to organize in the response time of a user’s request. A single user memcached instances into a distributed system. web request can often result in hundreds of individual

386 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association memcache get requests. For example, loading one of our popular pages results in an average of 521 distinct items UDP direct fetched from memcache.1 by mcrouter (TCP) We provision hundreds of memcached servers in a cluster to reduce load on databases and other services. Items are distributed across the memcached servers through consistent hashing [22]. Thus web servers have to routinely communicate with many memcached servers microseconds to satisfy a user request. As a result, all web servers communicate with every memcached server in a short period of time. This all-to-all communication pattern

can cause incast congestion [30] or allow a single server 0 200 600 1000 1400 to become the bottleneck for many web servers. Data Average of Medians Average of 95th Percentiles replication often alleviates the single-server bottleneck but leads to significant memory inefficiencies in the Figure 3: Get latency for UDP, TCP via mcrouter common case. We reduce latency mainly by focusing on the connection thereby reducing the overhead. The UDP memcache client, which runs on each web server. This implementation detects packets that are dropped or re- client serves a range of functions, including serializa- ceived out of order (using sequence numbers) and treats tion, compression, request routing, error handling, and them as errors on the client side. It does not provide request batching. Clients maintain a map of all available any mechanism to try to recover from them. In our in- servers, which is updated through an auxiliary configu- frastructure, we find this decision to be practical. Un- memcache ration system. der peak load, clients observe that 0.25% of get requests are discarded. About 80% of these drops Parallel requests and batching: We structure our web- are due to late or dropped packets, while the remainder application code to minimize the number of network are due to out of order delivery. Clients treat get er- round trips necessary to respond to page requests. We rors as cache misses, but web servers will skip insert- construct a directed acyclic graph (DAG) representing ing entries into memcached after querying for data to the dependencies between data. A web server uses this avoid putting additional load on a possibly overloaded DAG to maximize the number of items that can be network or server. fetched concurrently. On average these batches consist of 24 keys per request2. For reliability, clients perform set and delete opera- tions over TCP through an instance of mcrouter run- Client-server communication: Memcached servers do ning on the same machine as the web server. For opera- not communicate with each other. When appropriate, tions where we need to confirm a state change (updates we embed the complexity of the system into a stateless and deletes) TCP alleviates the need to add a retry mech- client rather than in the memcached servers. This greatly anism to our UDP implementation. simplifies memcached and allows us to focus on making it highly performant for a more limited use case. Keep- Web servers rely on a high degree of parallelism and ing the clients stateless enables rapid iteration in the over-subscription to achieve high throughput. The high software and simplifies our deployment process. Client memory demands of open TCP connections makes it logic is provided as two components: a library that can prohibitively expensive to have an open connection be- memcached be embedded into applications or as a standalone proxy tween every web thread and server without mcrouter named mcrouter. This proxy presents a memcached some form of connection coalescing via . Co- server interface and routes the requests/replies to/from alescing these connections improves the efficiency of other servers. the server by reducing the network, CPU and memory resources needed by high throughput TCP connections. Clients use UDP and TCP to communicate with Figure 3 shows the average, median, and 95th percentile memcached servers. We rely on UDP for get requests to latencies of web servers in production getting keys over reduce latency and overhead. Since UDP is connection- UDP and through mcrouter via TCP. In all cases, the less, each thread in the web server is allowed to directly standard deviation from these averages was less than communicate with memcached servers directly, bypass- 1%. As the data show, relying on UDP can lead to a ing mcrouter, without establishing and maintaining a 20% reduction in latency to serve requests. 1The 95th percentile of fetches for that page is 1,740 items. Incast congestion: Memcache clients implement flow- 2The 95th percentile is 95 keys per request. control mechanisms to limit incast congestion. When a

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 387 3.2 Reducing Load 95th Percentile We use memcache to reduce the frequency of fetch- Median ing data along more expensive paths such as database queries. Web servers fall back to these paths when the desired data is not cached. The following subsections describe three techniques for decreasing load.

milliseconds 3.2.1 Leases We introduce a new mechanism we call leases to address two problems: stale sets and thundering herds. A stale 0 10 20 30 40 set occurs when a web server sets a value in memcache 100 200 300 400 500 that does not reflect the latest value that should be cached. This can occur when concurrent updates to Window Size memcache get reordered. A thundering herd happens when a specific key undergoes heavy read and write ac- Figure 4: Average time web requests spend waiting to tivity. As the write activity repeatedly invalidates the re- be scheduled cently set values, many reads default to the more costly path. Our lease mechanism solves both problems. Intuitively, a memcached instance gives a lease to a client requests a large number of keys, the responses client to set data back into the cache when that client ex- can overwhelm components such as rack and cluster periences a cache miss. The lease is a 64-bit token bound switches if those responses arrive all at once. Clients to the specific key the client originally requested. The therefore use a sliding window mechanism [11] to con- client provides the lease token when setting the value trol the number of outstanding requests. When the client in the cache. With the lease token, memcached can ver- receives a response, the next request can be sent. Similar ify and determine whether the data should be stored and to TCP’s congestion control, the size of this sliding win- thus arbitrate concurrent writes. Verification can fail if dow grows slowly upon a successful request and shrinks memcached has invalidated the lease token due to re- when a request goes unanswered. The window applies ceiving a delete request for that item. Leases prevent to all memcache requests independently of destination; stale sets in a manner similar to how load-link/store- whereas TCP windows apply only to a single stream. conditional operates [20]. A slight modification to leases also mitigates thunder- Figure 4 shows the impact of the window size on the ing herds. Each memcached server regulates the rate at amount of time user requests are in the runnable state which it returns tokens. By default, we configure these but are waiting to be scheduled inside the web server. servers to return a token only once every 10 seconds per The data was gathered from multiple racks in one fron- key. Requests for a key’s value within 10 seconds of a tend cluster. User requests exhibit a Poisson arrival pro- token being issued results in a special notification telling cess at each web server. According to Little’s Law [26], the client to wait a short amount of time. Typically, the L = λW, the number of requests queued in the server client with the lease will have successfully set the data (L) is directly proportional to the average time a request within a few milliseconds. Thus, when waiting clients takes to process (W), assuming that the input request retry the request, the data is often present in cache. rate is constant (which it was for our experiment). The To illustrate this point we collect data for all cache time web requests are waiting to be scheduled is a di- misses of a set of keys particularly susceptible to thun- rect indication of the number of web requests in the dering herds for one week. Without leases, all of the system. With lower window sizes, the application will cache misses resulted in a peak database query rate of have to dispatch more groups of memcache requests se- 17K/s. With leases, the peak database query rate was rially, increasing the duration of the web request. As the 1.3K/s. Since we provision our databases based on peak window size gets too large, the number of simultaneous load, our lease mechanism translates to a significant ef- memcache requests causes incast congestion. The result ficiency gain. will be memcache errors and the application falling back Stale values: With leases, we can minimize the appli- to the persistent storage for the data, which will result cation’s wait time in certain use cases. We can further in slower processing of web requests. There is a balance reduce this time by identifying situations in which re- between these extremes where unnecessary latency can turning slightly out-of-date data is acceptable. When a be avoided and incast congestion can be minimized. key is deleted, its value is transferred to a data struc-

388 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association 3.2.3 Replication Within Pools Low−churn High−churn 80 Within some pools, we use replication to improve the la- 60 tency and efficiency of memcached servers. We choose 40 to replicate a category of keys within a pool when (1) Terabytes 20 the application routinely fetches many keys simultane- Daily Weekly Daily Weekly ously, (2) the entire data set fits in one or two memcached Minimum, mean, and maximum servers and (3) the request rate is much higher than what Figure 5: Daily and weekly working set of a high-churn a single server can manage. family and a low-churn key family We favor replication in this instance over further di- viding the key space. Consider a memcached server ture that holds recently deleted items, where it lives for holding 100 items and capable of responding to 500k a short time before being flushed. A get request can re- requests per second. Each request asks for 100 keys. turn a lease token or data that is marked as stale. Appli- The difference in memcached overhead for retrieving cations that can continue to make forward progress with 100 keys per request instead of 1 key is small. To scale stale data do not need to wait for the latest value to be the system to process 1M requests/sec, suppose that we fetched from the databases. Our experience has shown add a second server and split the key space equally be- that since the cached value tends to be a monotonically tween the two. Clients now need to split each request for increasing snapshot of the database, most applications 100 keys into two parallel requests for 50 keys. Con- can use a stale value without any changes. sequently, both servers still have to process∼ 1M requests 3.2.2 Memcache Pools per second. However, if we replicate all 100 keys to mul- tiple servers, a client’s request for 100 keys can be sent Using memcache as a general-purpose caching layer re- to any replica. This reduces the load per server to 500k quires workloads to share infrastructure despite differ- requests per second. Each client chooses replicas based ent access patterns, memory footprints, and quality-of- on its own IP address. This approach requires delivering service requirements. Different applications’ workloads invalidations to all replicas to maintain consistency. can produce negative interference resulting in decreased hit rates. 3.3 Handling Failures To accommodate these differences, we partition a The inability to fetch data from memcache results in ex- cluster’s memcached servers into separate pools. We cessive load to backend services that could cause fur- designate one pool (named wildcard) as the default and ther cascading failures. There are two scales at which provision separate pools for keys whose residence in we must address failures: (1) a small number of hosts wildcard is problematic. For example, we may provi- are inaccessible due to a network or server failure or (2) sion a small pool for keys that are accessed frequently a widespread outage that affects a significant percent- but for which a cache miss is inexpensive. We may also age of the servers within the cluster. If an entire clus- provision a large pool for infrequently accessed keys for ter has to be taken offline, we divert user web requests which cache misses are prohibitively expensive. to other clusters which effectively removes all the load Figure 5 shows the working set of two different sets from memcache within that cluster. of items, one that is low-churn and another that is high- For small outages we rely on an automated remedi- churn. The working set is approximated by sampling all ation system [3]. These actions are not instant and can operations on one out of every one million items. For take up to a few minutes. This duration is long enough to each of these items, we collect the minimum, average, cause the aforementioned cascading failures and thus we and maximum item size. These sizes are summed and introduce a mechanism to further insulate backend ser- multiplied by one million to approximate the working vices from failures. We dedicate a small set of machines, set. The difference between the daily and weekly work- named Gutter, to take over the responsibilities of a few ing sets indicates the amount of churn. Items with differ- failed servers. Gutter accounts for approximately 1% of ent churn characteristics interact in an unfortunate way: the memcached servers in a cluster. low-churn keys that are still valuable are evicted before When a memcached client receives no response to its high-churn keys that are no longer being accessed. Plac- get request, the client assumes the server has failed and ing these keys in different pools prevents this kind of issues the request again to a special Gutter pool. If this negative interference, and allows us to size high-churn second request misses, the client will insert the appropri- pools appropriate to their cache miss cost. Section 7 pro- ate key-value pair into the Gutter machine after querying vides further analysis. the database. Entries in Gutter expire quickly to obviate

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 389 Gutter invalidations. Gutter limits the load on backend Memcache services at the cost of slightly stale data. Note that this design differs from an approach in Mcrouter which a client rehashes keys among the remaining memcached servers. Such an approach risks cascading Storage Server failures due to non-uniform key access frequency. For MySQL McSqueal example, a single key can account for 20% of a server’s Commit Log requests. The server that becomes responsible for this hot key might also become overloaded. By shunting load Update to idle servers we limit that risk. Operations Storage Ordinarily, each failed request results in a hit on the backing store, potentially overloading it. By using Gut- ter to store these results, a substantial fraction of these failures are converted into hits in the gutter pool thereby Figure 6: Invalidation pipeline showing keys that need reducing load on the backing store. In practice, this sys- to be deleted via the daemon (mcsqueal). tem reduces the rate of client-visible failures by 99% write semantics for a single user request and reduce the and converts 10%–25% of failures into hits each day. If amount of time stale data is present in its local cache. a memcached server fails entirely, hit rates in the gutter SQL statements that modify authoritative state are pool generally exceed 35% in under 4 minutes and often amended to include memcache keys that need to be approach 50%. Thus when a few memcached servers are invalidated once the transaction commits [7]. We de- unavailable due to failure or minor network incidents, ploy invalidation daemons (named mcsqueal) on every Gutter protects the backing store from a surge of traffic. database. Each daemon inspects the SQL statements that 4 In a Region: Replication its database commits, extracts any deletes, and broad- casts these deletes to the memcache deployment in every It is tempting to buy more web and memcached servers frontend cluster in that region. Figure 6 illustrates this to scale a cluster as demand increases. However, na¨ıvely approach. We recognize that most invalidations do not scaling the system does not eliminate all problems. delete data; indeed, only 4% of all deletes issued result Highly requested items will only become more popular in the actual invalidation of cached data. as more web servers are added to cope with increased While mcsqueal could con- user traffic. Incast congestion also worsens as the num- Reducing packet rates: tact memcached servers directly, the resulting rate of ber of memcached servers increases. We therefore split packets sent from a backend cluster to frontend clus- our web and memcached servers into multiple frontend ters would be unacceptably high. This packet rate prob- clusters. These clusters, along with a storage cluster that lem is a consequence of having many databases and contain the databases, define a region. This region ar- many memcached servers communicating across a clus- chitecture also allows for smaller failure domains and ter boundary. Invalidation daemons batch deletes into a tractable network configuration. We trade replication fewer packets and send them to a set of dedicated servers of data for more independent failure domains, tractable running mcrouter instances in each frontend cluster. network configuration, and a reduction of incast conges- These mcrouters then unpack individual deletes from tion. each batch and route those invalidations to the right This section analyzes the impact of multiple frontend memcached server co-located within the frontend clus- clusters that share the same storage cluster. Specifically ter. The batching results in an 18 improvement in the we address the consequences of allowing data replica- median number of deletes per packet.× tion across these clusters and the potential memory effi- ciencies of disallowing this replication. Invalidation via web servers: It is simpler for web servers to broadcast invalidations to all frontend clus- 4.1 Regional Invalidations ters. This approach unfortunately suffers from two prob- While the storage cluster in a region holds the authori- lems. First, it incurs more packet overhead as web tative copy of data, user demand may replicate that data servers are less effective at batching invalidations than into frontend clusters. The storage cluster is responsi- mcsqueal pipeline. Second, it provides little recourse ble for invalidating cached data to keep frontend clus- when a systemic invalidation problem arises such as ters consistent with the authoritative versions. As an op- misrouting of deletes due to a configuration error. In the timization, a web server that modifies data also sends past, this would often require a rolling restart of the en- invalidations to its own cluster to provide read-after- tire memcache infrastructure, a slow and disruptive pro-

390 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association A (Cluster) B (Region) 4.3 Cold Cluster Warmup Median number of users 30 1 When we bring a new cluster online, an existing one Gets per second 3.26 M 458 K fails, or perform scheduled maintenance the caches will Median value size 10.7 kB 4.34 kB have very poor hit rates diminishing the ability to in- Table 1: Deciding factors for cluster or regional replica- sulate backend services. A system called Cold Clus- tion of two item families ter Warmup mitigates this by allowing clients in the “cold cluster” (i.e. the frontend cluster that has an empty cess we want to avoid. In contrast, embedding invalida- cache) to retrieve data from the “warm cluster” (i.e. a tions in SQL statements, which databases commit and cluster that has caches with normal hit rates) rather than store in reliable logs, allows mcsqueal to simply replay the persistent storage. This takes advantage of the afore- invalidations that may have been lost or misrouted. mentioned data replication that happens across frontend clusters. With this system cold clusters can be brought 4.2 Regional Pools back to full capacity in a few hours instead of a few days. Each cluster independently caches data depending on Care must be taken to avoid inconsistencies due to the mix of the user requests that are sent to it. If race conditions. For example, if a client in the cold clus- users’ requests are randomly routed to all available fron- ter does a database update, and a subsequent request tend clusters then the cached data will be roughly the from another client retrieves the stale value from the same across all the frontend clusters. This allows us to warm cluster before the warm cluster has received the take a cluster offline for maintenance without suffer- invalidation, that item will be indefinitely inconsistent ing from reduced hit rates. Over-replicating the data can in the cold cluster. Memcached deletes support nonzero be memory inefficient, especially for large, rarely ac- hold-off times that reject add operations for the spec- cessed items. We can reduce the number of replicas by ified hold-off time. By default, all deletes to the cold having multiple frontend clusters share the same set of cluster are issued with a two second hold-off. When a memcached servers. We call this a regional pool. miss is detected in the cold cluster, the client re-requests Crossing cluster boundaries incurs more latency. In the key from the warm cluster and adds it into the cold addition, our networks have 40% less average available cluster. The failure of the add indicates that newer data bandwidth over cluster boundaries than within a single is available on the database and thus the client will re- cluster. Replication trades more memcached servers for fetch the value from the databases. While there is still a less inter-cluster bandwidth, lower latency, and better theoretical possibility that deletes get delayed more than fault tolerance. For some data, it is more cost efficient two seconds, this is not true for the vast majority of the to forgo the advantages of replicating data and have a cases. The operational benefits of cold cluster warmup single copy per region. One of the main challenges of far outweigh the cost of rare cache consistency issues. scaling memcache within a region is deciding whether We turn it off once the cold cluster’s hit rate stabilizes a key needs to be replicated across all frontend clusters and the benefits diminish. or have a single replica per region. Gutter is also used when servers in regional pools fail. 5 Across Regions: Consistency Table 1 summarizes two kinds of items in our appli- There are several advantages to a broader geographic cation that have large values. We have moved one kind placement of data centers. First, putting web servers (B) to a regional pool while leaving the other (A) un- closer to end users can significantly reduce latency. touched. Notice that clients access items falling into cat- Second, geographic diversity can mitigate the effects egory B an order of magnitude less than those in cate- of events such as natural disasters or massive power gory A. Category B’s low access rate makes it a prime failures. And third, new locations can provide cheaper candidate for a regional pool since it does not adversely power and other economic incentives. We obtain these impact inter-cluster bandwidth. Category B would also advantages by deploying to multiple regions. Each re- occupy 25% of each cluster’s wildcard pool so region- gion consists of a storage cluster and several frontend alization provides significant storage efficiencies. Items clusters. We designate one region to hold the master in category A, however, are twice as large and accessed databases and the other regions to contain read-only much more frequently, disqualifying themselves from replicas; we rely on MySQL’s replication mechanism regional consideration. The decision to migrate data into to keep replica databases up-to-date with their mas- regional pools is currently based on a set of manual ters. In this design, web servers experience low latency heuristics based on access rates, data set size, and num- when accessing either the local memcached servers or ber of unique users accessing particular items. the local database replicas. When scaling across mul-

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 391 tiple regions, maintaining consistency between data in marker indicates that data in the local replica database memcache and the persistent storage becomes the pri- are potentially stale and the query should be redirected mary technical challenge. These challenges stem from to the master region. When a web server wishes to up- a single problem: replica databases may lag behind the date data that affects a key k, that server (1) sets a re- master database. mote marker rk in the region, (2) performs the write to Our system represents just one point in the wide the master embedding k and rk to be invalidated in the spectrum of consistency and performance trade-offs. SQL statement, and (3) deletes k in the local cluster. On The consistency model, like the rest of the system, has a subsequent request for k, a web server will be unable evolved over the years to suit the scale of the site. It to find the cached data, check whether rk exists, and di- mixes what can be practically built without sacrificing rect its query to the master or local region depending on our high performance requirements. The large volume the presence of rk. In this situation, we explicitly trade of data that the system manages implies that any minor additional latency when there is a cache miss, for a de- changes that increase network or storage requirements creased probability of reading stale data. have non-trivial costs associated with them. Most ideas We implement remote markers by using a regional that provide stricter semantics rarely leave the design pool. Note that this mechanism may reveal stale in- phase because they become prohibitively expensive. Un- formation during concurrent modifications to the same like many systems that are tailored to an existing use key as one operation may delete a remote marker that case, memcache and Facebook were developed together. should remain present for another in-flight operation. It This allowed the applications and systems engineers to is worth highlighting that our usage of memcache for re- work together to find a model that is sufficiently easy mote markers departs in a subtle way from caching re- for the application engineers to understand yet perfor- sults. As a cache, deleting or evicting keys is always a mant and simple enough for it to work reliably at scale. safe action; it may induce more load on databases, but We provide best-effort eventual consistency but place an does not impair consistency. In contrast, the presence of emphasis on performance and availability. Thus the sys- a remote marker helps distinguish whether a non-master tem works very well for us in practice and we think we database holds stale data or not. In practice, we find both have found an acceptable trade-off. the eviction of remote markers and situations of concur- Writes from a master region: Our earlier decision re- rent modification to be rare. quiring the storage cluster to invalidate data via daemons Operational considerations: Inter-region communica- has important consequences in a multi-region architec- tion is expensive since data has to traverse large geo- ture. In particular, it avoids a race condition in which graphical distances (e.g. across the continental United an invalidation arrives before the data has been repli- States). By sharing the same channel of communication cated from the master region. Consider a web server in for the delete stream as the database replication we gain the master region that has finished modifying a database network efficiency on lower bandwidth connections. and seeks to invalidate now stale data. Sending invalida- The aforementioned system for managing deletes in tions within the master region is safe. However, having Section 4.1 is also deployed with the replica databases to the web server invalidate data in a replica region may be broadcast the deletes to memcached servers in the replica premature as the changes may not have been propagated regions. Databases and mcrouters buffer deletes when to the replica databases yet. Subsequent queries for the downstream components become unresponsive. A fail- data from the replica region will race with the replica- ure or delay in any of the components results in an in- tion stream thereby increasing the probability of setting creased probability of reading stale data. The buffered stale data into memcache. Historically, we implemented deletes are replayed once these downstream components mcsqueal after scaling to multiple regions. are available again. The alternatives involve taking a Writes from a non-master region: Now consider a cluster offline or over-invalidating data in frontend clus- user who updates his data from a non-master region ters when a problem is detected. These approaches result when replication lag is excessively large. The user’s next in more disruptions than benefits given our workload. request could result in confusion if his recent change is missing. A cache refill from a replica’s database should only be allowed after the replication stream has caught 6 Single Server Improvements up. Without this, subsequent requests could result in the The all-to-all communication pattern implies that a sin- replica’s stale data being fetched and cached. gle server can become a bottleneck for a cluster. This We employ a remote marker mechanism to minimize section describes performance optimizations and mem- the probability of reading stale data. The presence of the ory efficiency gains in memcached which allow better

392 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association scaling within clusters. Improving single server cache performance is an active research area [9, 10, 28, 25]. hits 6.1 Performance Optimizations misses We began with a single-threaded memcached which used a fixed-size hash table. The first major optimizations were to: (1) allow automatic expansion of the hash ta- Max sustained items / second ble to avoid look-up times drifting to O(n), (2) make the server multi-threaded using a global lock to protect mul- 0 2M 4M 6M tiple data structures, and (3) giving each thread its own Facebook Facebook−μlocks 1.4.10 UDP port to reduce contention when sending replies and later spreading interrupt processing overhead. The first Figure 7: Multiget hit and miss performance comparison two optimizations were contributed back to the open by memcached version source community. The remainder of this section ex- plores further optimizations that are not yet available in the open source version. TCP Our experimental hosts have an Intel Xeon UDP CPU (X5650) running at 2.67GHz (12 cores and 12 hyperthreads), an Intel 82574L gigabit ethernet

controller and 12GB of memory. Production servers Max sustained items / second have additional memory. Further details have been 0 1M 2M previously published [4]. The performance test setup consists of fifteen clients generating memcache traffic Get 10−key multiget to a single memcached server with 24 threads. The Figure 8: Get hit performance comparison for single clients and server are co-located on the same rack and gets and 10-key multigets over TCP and UDP connected through gigabit ethernet. These tests measure the latency of memcached responses over two minutes forms our TCP implementation by 13% for single gets of sustained load. and 8% for 10-key multigets. Get Performance: We first investigate the effect of re- Because multigets pack more data into each request placing our original multi-threaded single-lock imple- than single gets, they use fewer packets to do the same mentation with fine-grained locking. We measured hits work. Figure 8 shows an approximately four-fold im- by pre-populating the cache with 32-byte values before provement for 10-key multigets over single gets. issuing memcached requests of 10 keys each. Figure 7 shows the maximum request rates that can be sustained 6.2 Adaptive Slab Allocator with sub-millisecond average response times for differ- Memcached employs a slab allocator to manage memory. ent versions of memcached. The first set of bars is our The allocator organizes memory into slab classes, each memcached before fine-grained locking, the second set of which contains pre-allocated, uniformly sized chunks is our current memcached, and the final set is the open of memory. Memcached stores items in the smallest pos- source version 1.4.10 which independently implements sible slab class that can fit the item’s metadata, key, and a coarser version of our locking strategy. value. Slab classes start at 64 bytes and exponentially in- Employing fine-grained locking triples the peak get crease in size by a factor of 1.07 up to 1 MB, aligned on rate for hits from 600k to 1.8M items per second. Per- 4-byte boundaries3. Each slab class maintains a free-list formance for misses also increased from 2.7M to 4.5M of available chunks and requests more memory in 1MB items per second. Hits are more expensive because the slabs when its free-list is empty. Once a memcached return value has to be constructed and transmitted, while server can no longer allocate free memory, storage for misses require a single static response (END) for the en- new items is done by evicting the least recently used tire multiget indicating that all keys missed. (LRU) item within that slab class. When workloads We also investigated the performance effects of us- change, the original memory allocated to each slab class ing UDP instead of TCP. Figure 8 shows the peak re- may no longer be enough resulting in poor hit rates. quest rate we can sustain with average latencies of less than one millisecond for single gets and multigets of 10 3This scaling factor ensures that we have both 64 and 128 byte keys. We found that our UDP implementation outper- items which are more amenable to hardware cache lines.

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 393 We implemented an adaptive allocator that period- ically re-balances slab assignments to match the cur- rent workload. It identifies slab classes as needing more memory if they are currently evicting items and if the next item to be evicted was used at least 20% more re- cently than the average of the least recently used items in other slab classes. If such a class is found, then the slab holding the least recently used item is freed and trans- All requests ferred to the needy class. Note that the open-source com- percentile of requests A popular data intensive page

munity has independently implemented a similar allo- 0 20 40 60 80 100 cator that balances the eviction rates across slab classes while our algorithm focuses on balancing the age of the 20 100 200 300 400 500 600 oldest items among classes. Balancing age provides a distinct memcached servers better approximation to a single global Least Recently Figure 9: Cumulative distribution of the number of dis- Used (LRU) eviction policy for the entire server rather tinct memcached servers accessed than adjusting eviction rates which can be heavily influ- enced by access patterns. 7 Memcache Workload 6.3 The Transient Item Cache We now characterize the memcache workload using data from servers that are running in production. While memcached supports expiration times, entries 7.1 Measurements at the Web Server may live in memory well after they have expired. Memcached lazily evicts such entries by checking ex- We record all memcache operations for a small percent- piration times when serving a get request for that item age of user requests and discuss the fan-out, response or when they reach the end of the LRU. Although effi- size, and latency characteristics of our workload. cient for the common case, this scheme allows short- Fanout: Figure 9 shows the distribution of distinct lived keys that see a single burst of activity to waste memcached servers a web server may need to contact memory until they reach the end of the LRU. when responding to a page request. As shown, 56% of all page requests contact fewer than 20 memcached We therefore introduce a hybrid scheme that relies on servers. By volume, user requests tend to ask for small lazy eviction for most keys and proactively evicts short- amounts of cached data. There is, however, a long tail to lived keys when they expire. We place short-lived items this distribution. The figure also depicts the distribution into a circular buffer of linked lists (indexed by sec- for one of our more popular pages that better exhibits onds until expiration) – called the Transient Item Cache the all-to-all communication pattern. Most requests of – based on the expiration time of the item. Every sec- this type will access over 100 distinct servers; accessing ond, all of the items in the bucket at the head of the several hundred memcached servers is not rare. buffer are evicted and the head advances by one. When we added a short expiration time to a heavily used set of Response size: Figure 10 shows the response sizes from keys whose items have short useful lifespans; the pro- memcache requests. The difference between the median portion of memcache pool used by this key family was (135 bytes) and the mean (954 bytes) implies that there reduced from 6% to 0.3% without affecting the hit rate. is a very large variation in the sizes of the cached items. In addition there appear to be three distinct peaks at ap- 6.4 Software Upgrades proximately 200 bytes and 600 bytes. Larger items tend to store lists of data while smaller items tend to store Frequent software changes may be needed for upgrades, single pieces of content. bug fixes, temporary diagnostics, or performance test- Latency: We measure the round-trip latency to request ing. A memcached server can reach 90% of its peak hit data from memcache, which includes the cost of rout- rate within a few hours. Consequently, it can take us over ing the request and receiving the reply, network transfer 12 hours to upgrade a set of memcached servers as the re- time, and the cost of deserialization and decompression. sulting database load needs to be managed carefully. We Over 7 days the median request latency is 333 microsec- modified memcached to store its cached values and main onds while the 75th and 95th percentiles (p75 and p95) data structures in System V shared memory regions so are 475μs and 1.135ms respectively. Our median end- that the data can remain live across a software upgrade to-end latency from an idle web server is 178μs while and thereby minimize disruption. the p75 and p95 are 219μs and 374μs, respectively. The

394 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association percentile of requests fraction of deletes that failed fraction 1e−06 1e−05 1e−04 1e−03 1s 10s 1m 10m 1h 1d 1s 10s 1m 10m 1h 1d 0 20 40 60 80 100 master region replica region 0 200 400 600 seconds of delay Figure 11: Latency of the Delete Pipeline Bytes Figure 10: Cumulative distribution of value sizes of a million deletes and record the time the delete was is- fetched sued. We subsequently query the contents of memcache across all frontend clusters at regular intervals for the wide variance between the p95 latencies arises from sampled keys and log an error if an item remains cached handling large responses and waiting for the runnable despite a delete that should have invalidated it. thread to be scheduled as discussed in Section 3.1. In Figure 11, we use this monitoring mechanism to re- 7.2 Pool Statistics port our invalidation latencies across a 30 day span. We break this data into two different components: (1) the We now discuss key metrics of four memcache pools. delete originated from a web server in the master region The pools are wildcard (the default pool), app (a pool and was destined to a memcached server in the master re- devoted for a specific application), a replicated pool for gion and (2) the delete originated from a replica region frequently accessed data, and a regional pool for rarely and was destined to another replica region. As the data accessed information. In each pool, we collect average show, when the source and destination of the delete are statistics every 4 minutes and report in Table 2 the high- co-located with the master our success rates are much est average for one month collection period. This data higher and achieve four 9s of reliability within 1 second approximates the peak load seen by those pools. The ta- and five 9s after one hour. However when the deletes ble shows the widely different get, set, and delete rates originate and head to locations outside of the master re- for different pools. Table 3 shows the distribution of re- gion our reliability drops to three 9s within a second and sponse sizes for each pool. Again, the different char- four 9s within 10 minutes. In our experience, we find acteristics motivate our desire to segregate these work- that if an invalidation is missing after only a few sec- loads from one another. onds the most common reason is that the first attempt As discussed in Section 3.2.3, we replicate data failed and subsequent retrials will resolve the problem. within a pool and take advantage of batching to handle the high request rates. Observe that the replicated pool 8 Related Work has the highest get rate (about 2.7 that of the next high- est one) and the highest ratio of bytes× to packets despite Several other large websites have recognized the util- having the smallest item sizes. This data is consistent ity of key-value stores. DeCandia et al. [12] present with our design in which we leverage replication and a highly available key-value store that is used by a batching to achieve better performance. In the app pool, variety of application services at Amazon.com. While a higher churn of data results in a naturally higher miss their system is optimized for a write heavy workload, rate. This pool tends to have content that is accessed for ours targets a workload dominated by reads. Similarly, a few hours and then fades away in popularity in favor LinkedIn uses Voldemort [5], a system inspired by Dy- of newer content. Data in the regional pool tends to be namo. Other major deployments of key-value caching large and infrequently accessed as shown by the request solutions include [6] at Github, Digg, and Bliz- rates and the value size distribution. zard, and memcached at [33] and Zynga. Lak- shman et al. [1] developed Cassandra, a schema-based 7.3 Invalidation Latency distributed key-value store. We preferred to deploy and We recognize that the timeliness of invalidations is a scale memcached due to its simpler design. critical factor in determining the probability of expos- Our work in scaling memcache builds on extensive ing stale data. To monitor this health, we sample one out work in distributed data structures. Gribble et al. [19]

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 395 get set delete packets pool miss rate s s s s outbound bandwidth (MB/s) wildcard 1.76% 262k 8.26k 21.2k 236k 57.4 app 7.85% 96.5k 11.9k 6.28k 83.0k 31.0 replicated 0.053% 710k 1.75k 3.22k 44.5k 30.1 regional 6.35% 9.1k 0.79k 35.9k 47.2k 10.8 Table 2: Traffic per server on selected memcache pools averaged over 7 days

pool mean std dev p5 p25 p50 p75 p95 p99 wildcard 1.11 K 8.28 K 77 102 169 363 3.65 K 18.3 K app 881 7.70 K 103 247 269 337 1.68K 10.4 K replicated 66 2 62 68 68 68 68 68 regional 31.8 K 75.4 K 231 824 5.31 K 24.0 K 158 K 381 K Table 3: Distribution of item sizes for various pools in bytes present an early version of a key-value storage system its graph model to persistent storage and takes respon- useful for Internet scale services. Ousterhout et al. [29] sibility for persistence. Many components, such as our also present the case for a large scale in-memory key- client libraries and mcrouter, are used by both systems. value storage system. Unlike both of these solutions, memcache does not guarantee persistence. We rely on 9 Conclusion other systems to handle persistent data storage. In this paper, we show how to scale a memcached-based Ports et al. [31] provide a library to manage the architecture to meet the growing demand of Facebook. cached results of queries to a transactional database. Many of the trade-offs discussed are not fundamental, Our needs require a more flexible caching strategy. Our but are rooted in the realities of balancing engineering use of leases [18] and stale reads [23] leverages prior resources while evolving a live system under continu- research on cache consistency and read operations in ous product development. While building, maintaining, high-performance systems. Work by Ghandeharizadeh and evolving our system we have learned the following and Yap [15] also presents an algorithm that addresses lessons. (1) Separating cache and persistent storage sys- the stale set problem based on time-stamps rather than tems allows us to independently scale them. (2) Features explicit version numbers. that improve monitoring, debugging and operational ef- While software routers are easier to customize and ficiency are as important as performance. (3) Managing program, they are often less performant than their hard- stateful components is operationally more complex than ware counterparts. Dobrescu et al. [13] address these stateless ones. As a result keeping logic in a stateless issues by taking advantage of multiple cores, multiple client helps iterate on features and minimize disruption. memory controllers, multi-queue networking interfaces, (4) The system must support gradual rollout and roll- and batch processing on general purpose servers. Ap- back of new features even if it leads to temporary het- plying these techniques to mcrouter’s implementation erogeneity of feature sets. (5) Simplicity is vital. remains future work. Twitter has also independently de- veloped a memcache proxy similar to mcrouter [32]. Acknowledgements In Coda [35], Satyanarayanan et al. demonstrate how We would like to thank Philippe Ajoux, Nathan Bron- datasets that diverge due to disconnected operation can son, Mark Drayton, David Fetterman, Alex Gartrell, An- be brought back into sync. Glendenning et al. [17] lever- drii Grynenko, Robert Johnson, Sanjeev Kumar, Anton age Paxos [24] and quorums [16] to build Scatter, a dis- Likhtarov, Mark Marchukov, Scott Marlette, Ben Mau- tributed hash table with linearizable semantics [21] re- rer, David Meisner, Konrad Michels, Andrew Pope, Jeff silient to churn. Lloyd et al. [27] examine causal consis- Rothschild, Jason Sobel, and Yee Jiun Song for their tency in COPS, a wide-area storage system. contributions. We would also like to thank the anony- TAO [37] is another Facebook system that relies heav- mous reviewers, our shepherd Michael Piatek, Tor M. ily on caching to serve large numbers of low-latency Aamodt, Remzi H. Arpaci-Dusseau, and Tayler Hether- queries. TAO differs from memcache in two fundamental ington for their valuable feedback on earlier drafts of ways. (1) TAO implements a graph data model in which the paper. Finally we would like to thank our fellow en- nodes are identified by fixed-length persistent identifiers gineers at Facebook for their suggestions, bug-reports, (64-bit integers). (2) TAO encodes a specific mapping of and support which makes memcache what it is today.

396 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association References [21] HERLIHY, M. P., AND WING, J. M. Linearizability: a correct- [1] . http://cassandra.apache.org/. ness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12, 3 (July 1990), 463– [2] Couchbase. http://www.couchbase.com/. 492. [3] Making Facebook Self-Healing. https://www.facebook. [22] KARGER, D., LEHMAN, E., LEIGHTON,T.,PANIGRAHY, R., com/note.php?note_id=10150275248698920. LEVINE, M., AND LEWIN, D. Consistent Hashing and Random [4] Open Compute Project. http://www.opencompute.org. trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. In Proceedings of the 29th annual ACM http://project-voldemort.com/ [5] Project Voldemort. . Symposium on Theory of Computing (1997), pp. 654–663. [6] Redis. http://redis.io/. [23] KEETON, K., MORREY, III, C. B., SOULES, C. A., AND [7] Scaling Out. https://www.facebook.com/note.php?note_ VEITCH, A. Lazybase: freshness vs. performance in informa- id=23844338919. tion management. ACM SIGOPS Operating Systems Review 44, 1 (Dec. 2010), 15–19. [8] ATIKOGLU, B., XU,Y.,FRACHTENBERG, E., JIANG, S., AND PALECZNY, M. Workload analysis of a large-scale key-value [24] LAMPORT, L. The part-time parliament. ACM Transactions on store. ACM SIGMETRICS Performance Evaluation Review 40, Computer Systems 16, 2 (May 1998), 133–169. 1 (June 2012), 53–64. [25] LIM, H., FAN, B., ANDERSEN, D. G., AND KAMINSKY, M. [9] BEREZECKI, M., FRACHTENBERG, E., PALECZNY, M., AND Silt: a memory-efficient, high-performance key-value store. In STEELE, K. Power and performance evaluation of memcached Proceedings of the 23rd ACM Symposium on Operating Systems on the tilepro64 architecture. Sustainable Computing: Informat- Principles (2011), pp. 1–13. ics and Systems 2, 2 (June 2012), 81 – 90. [26] LITTLE, J., AND GRAVES, S. Little’s law. Building Intuition [10] BOYD-WICKIZER, S., CLEMENTS, A. T., MAO, Y., (2008), 81–100. PESTEREV, A., KAASHOEK,M.F.,MORRIS, R., AND [27] LLOYD,W.,FREEDMAN, M., KAMINSKY, M., AND ANDER- ZELDOVICH, N. An analysis of scalability to many cores. In Proceedings of the 9th USENIX Symposium on Operating SEN, D. Don’t settle for eventual: scalable causal consistency for Proceedings of the 23rd ACM Systems Design & Implementation (2010), pp. 1–8. wide-area storage with COPS. In Symposium on Operating Systems Principles (2011), pp. 401– [11] CERF, V. G., AND KAHN, R. E. A protocol for packet network 416. intercommunication. ACM SIGCOMM Compututer Communi- cation Review 35, 2 (Apr. 2005), 71–82. [28] METREVELI, Z., ZELDOVICH, N., AND KAASHOEK, M. Cphash: A cache-partitioned hash table. In Proceedings of the [12] DECANDIA, G., HASTORUN, D., JAMPANI, M., KAKULAP- 17th ACM SIGPLAN symposium on Principles and Practice of ATI, G., LAKSHMAN, A., PILCHIN, A., SIVASUBRAMANIAN, Parallel Programming (2012), pp. 319–320. S., VOSSHALL, P., AND VOGELS, W. Dynamo: amazon’s highly available key-value store. ACM SIGOPS Operating Sys- [29] OUSTERHOUT, J., AGRAWAL,P.,ERICKSON, D., tems Review 41, 6 (Dec. 2007), 205–220. KOZYRAKIS, C., LEVERICH, J., MAZIERES` , D., MI- TRA, S., NARAYANAN, A., ONGARO, D., PARULKAR, G., [13] FALL, K., IANNACCONE, G., MANESH, M., RATNASAMY, S., ROSENBLUM, M., RUMBLE, S. M., STRATMANN, E., AND ARGYRAKI, K., DOBRESCU, M., AND EGI, N. Routebricks: STUTSMAN, R. The case for ramcloud. Communications of the enabling general purpose network infrastructure. ACM SIGOPS ACM 54, 7 (July 2011), 121–130. Operating Systems Review 45, 1 (Feb. 2011), 112–125. [30] PHANISHAYEE, A., KREVAT, E., VASUDEVAN,V.,ANDER- [14] FITZPATRICK, B. Distributed caching with memcached. Linux SEN, D. G., GANGER, G. R., GIBSON, G. A., AND SE- Journal 2004, 124 (Aug. 2004), 5. SHAN, S. Measurement and analysis of tcp throughput col- [15] GHANDEHARIZADEH, S., AND YAP, J. Gumball: a race con- lapse in cluster-based storage systems. In Proceedings of the 6th dition prevention technique for cache augmented sql database USENIX Conference on File and Storage Technologies (2008), management systems. In Proceedings of the 2nd ACM SIGMOD pp. 12:1–12:14. Workshop on Databases and Social Networks (2012), pp. 1–6. [31] PORTS, D. R. K., CLEMENTS,A.T.,ZHANG, I., MADDEN, [16] GIFFORD, D. K. Weighted voting for replicated data. In Pro- S., AND LISKOV, B. Transactional consistency and automatic ceedings of the 7th ACM Symposium on Operating Systems Prin- management in an application data cache. In Proceedings of ciples (1979), pp. 150–162. the 9th USENIX Symposium on Operating Systems Design & Implementation (2010), pp. 1–15. [17] GLENDENNING, L., BESCHASTNIKH, I., KRISHNAMURTHY, A., AND ANDERSON, T. Scalable consistency in Scatter. In [32] RAJASHEKHAR, M. Twemproxy: A fast, light-weight proxy for Proceedings of the 23rd ACM Symposium on Operating Systems memcached. https://dev.twitter.com/blog/twemproxy. Principles (2011), pp. 15–28. [33] RAJASHEKHAR, M., AND YUE, Y. Caching with twem- [18] GRAY, C., AND CHERITON, D. Leases: An efficient fault- cache. http://engineering.twitter.com/2012/07/ tolerant mechanism for distributed file cache consistency. ACM caching-with-twemcache.html. SIGOPS Operating Systems Review 23, 5 (Nov. 1989), 202–210. [34] RATNASAMY, S., FRANCIS,P.,HANDLEY, M., KARP, R., [19] GRIBBLE, S. D., BREWER, E. A., HELLERSTEIN, J. M., AND AND SHENKER, S. A scalable content-addressable network. CULLER, D. Scalable, distributed data structures for internet ACM SIGCOMM Computer Communication Review 31, 4 (Oct. service construction. In Proceedings of the 4th USENIX Sym- 2001), 161–172. posium on Operating Systems Design & Implementation (2000), [35] SATYANARAYANAN, M., KISTLER, J., KUMAR,P.,OKASAKI, pp. 319–332. M., SIEGEL, E., AND STEERE, D. Coda: A highly available file [20] HEINRICH, J. MIPS R4000 Microprocessor User’s Manual. system for a distributed workstation environment. IEEE Trans- MIPS technologies, 1994. actions on Computers 39, 4 (Apr. 1990), 447–459.

USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 397 [36] STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, M., AND BALAKRISHNAN, H. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review 31, 4 (Oct. 2001), 149–160. [37] VENKATARAMANI,V.,AMSDEN, Z., BRONSON, N., CABR- ERA III, G., CHAKKA,P.,DIMOV,P.,DING, H., FERRIS, J., GIARDULLO, A., HOON, J., KULKARNI, S., LAWRENCE, N., MARCHUKOV, M., PETROV, D., AND PUZAR, L. Tao: how facebook serves the social graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2012), pp. 791–792.

398 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) USENIX Association