Implementing Linearizability at Large Scale and Low Latency

Implementing Linearizability at Large Scale and Low Latency Collin Lee∗, Seo Jin Park∗, Ankita Kejriwal, Satoshi Matsushitay, and John Ousterhout Stanford University, yNEC Abstract defined by Herlihy and Wing [12]. However, few large-scale Linearizability is the strongest form of consistency for storage systems implement linearizability today. concurrent systems, but most large-scale storage systems Almost all large-scale systems contain mechanisms that settle for weaker forms of consistency. RIFL provides a contribute to stronger consistency, such as reliable network general-purpose mechanism for converting at-least-once protocols, automatic retry of failed operations, idempotent RPC semantics to exactly-once semantics, thereby mak- semantics for operations, and two-phase commit protocols. ing it easy to turn non-linearizable operations into lineariz- However, these techniques are not sufficient by themselves able ones. RIFL is designed for large-scale systems and is to ensure linearizability. They typically result in “at-least- lightweight enough to be used in low-latency environments. once semantics,” which means that a remote operation may RIFL handles data migration by associating linearizability be executed multiple times if a crash occurs during its ex- metadata with objects in the underlying store and migrat- ecution. Re-execution of operations, even seemingly being metadata with the corresponding objects. It uses a lease nign ones such as simple writes, violates linearizability and mechanism to implement garbage collection for metadata. makes the system’s behavior harder for developers to predict We have implemented RIFL in the RAMCloud storage sys- and manage. tem and used it to make basic operations such as writes In this paper we describe RIFL (Reusable Infrastruc- and atomic increments linearizable; RIFL adds only 530 ns ture for Linearizability), which is a mechanism for ensur- to the 13.5 µs base latency for durable writes. We also used ing “exactly-once semantics” in large-scale systems. RIFL RIFL to construct a new multi-object transaction mechanism records the results of completed remote procedure calls in RAMCloud; RIFL’s facilities significantly simplified the (RPCs) durably; if an RPC is retried after it has completed, transaction implementation. The transaction mechanism can RIFL ensures that the correct result is returned without re- commit simple distributed transactions in about 20 µs and it executing the RPC. RIFL guarantees safety even in the face outperforms the H-Store main-memory database system for of server crashes and system reconfigurations such as data the TPC-C benchmark. migration. As a result, RIFL makes it easy to turn non- linearizable operations into linearizable ones. 1 Introduction RIFL is novel in several ways: Consistency is one of the most important issues in the de- • Reusable mechanism for exactly-once semantics: RIFL sign of large-scale storage systems; it represents the degree is implemented as a general-purpose package, indepen- to which a system’s behavior is predictable, particularly in dent of any specific remote operation. As a result, it can the face of concurrency and failures. Stronger forms of con- be used in many different situations, and existing RPCs sistency make it easier to develop applications and reason can be made linearizable with only a few additional lines about their correctness, but they may impact performance of code. RIFL’s architecture and most of its implementa- or scalability and they generally require greater degrees of tion are system-independent. fault tolerance. The strongest possible form of consistency in • Reconfiguration tolerance: large-scale systems migrate a concurrent system is linearizability, which was originally data from one server to another, either during crash recovery (to redistribute the possessions of a dead server) ∗ These authors contributed equally to this work. or during normal operation (to balance load). RIFL handles reconfiguration by associating RIFL metadata with Permission to make digital or hard copies of part or all of this work for personal or particular objects and arranging for the metadata to mi- classroom use is granted without fee provided that copies are not made or distributed grate with the objects; this ensures that the appropriate for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. metadata is available in the correct place to handle RPC For all other uses, contact the owner/author(s). retries. SOSP’15, October 4–7, 2015, Monterey, CA. Copyright is held by the owner/author(s). • Low latency: RIFL is lightweight enough to be used even ACM 978-1-4503-3834-9/15/10. in ultra-low-latency systems such as RAMCloud [21] and http://dx.doi.org/10.1145/2815400.2815416 W(0) R(1) W(0) R(2) W(3) R(2) A: A: W(1) R(0) Object B: 1 2 3 2 Value (a) Linearizable History B: W(0) R(1) W(0) W(2) A: Figure 2: Non-linearizable behavior caused by crash recovery. W(1) R(1) In this example, the server completes a write from Client B but B: crashes before responding. After the server restarts, Client B (b) Non-Linearizable History reissues the write, but meanwhile Client A has written a different value. As a result, Client A observes the value 2 being written Figure 1: Examples of linearizable (a) and non-linearizable (b) twice. histories for concurrent clients performing reads ( R() ) and W() writes ( ) on a single object, taken from [12]. Each row Early large-scale storage systems settled for weak con- corresponds to a single client’s history with time increasing to sistency models in order to focus on scalability or partition- the right. The notation “W(1)” means that client B wrote the value 1 into the object. Horizontal bars indicate the time duration tolerance [7,9, 17, 23], but newer systems have begun pro- of each operation. viding stronger forms of consistency [3,6, 19]. They employ a variety of techniques, such as: FaRM [8], which have end-to-end RPC times as low as • Network protocols that ensure reliable delivery of request 5 µs. and response messages. • Scalable: RIFL has been designed to support clusters • Automatic retry of operations after server crashes, so that with tens of thousands of servers and one million or all operations are eventually completed. more clients. Scalability impacted the design of RIFL in • Operations with idempotent semantics, so that repeated several ways, including the mechanisms for generating executions of an operation produce the same result as a unique RPC identifiers and for garbage-collecting meta- single execution. data. • Two-phase commit and/or consensus protocols [11, 18], We have implemented RIFL in the RAMCloud storage which ensure atomic updates of data on different servers. system in order to evaluate its architecture and performance. However, few large-scale systems actually implement lin- Using RIFL, we were able to make existing operations such earizability, and the above techniques are insufficient by as writes and atomic increments linearizable with less than themselves. For example, Figure2 shows how retrying an 20 additional lines of code per operation. We also used idempotent operation after a server crash can result in non- RIFL to construct a new multi-object transaction mechanism linearizable behavior. The problem with most distributed in RAMCloud; the use of RIFL significantly reduced the systems is that they implement at-least-once semantics. If a amount of mechanism that had to be built for transactions. client issues a request but fails to receive a response, it retries The RAMCloud implementation of RIFL exhibits high per- the operation. However, it is possible that the first request formance: it adds less than 4% to the 13.5 µs base cost for actually completed and the server crashed before sending a writes, and simple distributed transactions execute in about response. In this situation the retry causes the operation to 20 µs. RAMCloud transactions outperform H-Store [15] on be performed twice, which violates linearizability. the TPC-C benchmark, providing at least 10x lower latency In order for a system to provide linearizable behavior, and 1.35x–7x as much throughput. it must implement exactly-once semantics. To do this, the 2 Background and Goals system must detect when an incoming request is a retry Linearizability is a safety property concerning the behav- of a request that already completed. When this occurs, the ior of operations in a concurrent system. A collection of op- server must not re-execute the operation. However, it must erations is linearizable if each operation appears to occur still return whatever results were generated by the earlier instantaneously and exactly once at some point in time be- execution, since the client has not yet received them. tween its invocation and its completion. “Appears” means Some storage systems, such as H-Store [15] and FaRM [8], that it must not be possible for any client of the system, either implement strongly consistent operations in the servers but the one initiating an operation or other clients operating con- they don’t provide exactly-once semantics for clients: af- currently, to observe contradictory behavior. Figure1 shows ter a server crash, a client may not be able to determine examples of linearizable and non-linearizable operation his- whether a transaction completed. As a result, these systems tories. Linearizability is the strongest form of consistency do not guarantee linearizability (linearizability must be im- for concurrent systems. plemented on top of the transaction mechanism, as discussed in Section8). The overall goal for RIFL is to implement exactly-once and they consist of two parts: a 64-bit unique identifier for semantics, thereby filling in the missing piece for lineariz- the client and a 64-bit sequence number allocated by that ability.

Implementing Linearizability at Large Scale and Low Latency

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support