When Is Operation Ordering Required in Replicated Transactional Storage?
Total Page:16
File Type:pdf, Size:1020Kb
When Is Operation Ordering Required in Replicated Transactional Storage? Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports University of Washington iyzhang,naveenks,aaasz,arvind,drkp @cs.washington.edu { } Abstract Today’s replicated transactional storage systems typically have a layered architecture, combining protocols for transaction coordination, consistent replication, and concurrency control. These systems generally require costly strongly-consistent replication protocols like Paxos, which assign a total order to all operations. To avoid this cost, we ask whether all replicated operations in these systems need to be strictly ordered. Recent research has yielded replication protocols that can avoid unnecessary ordering, e.g., by exploiting commutative operations, but it is not clear how to apply these to replicated transaction processing systems. We answer this question by analyzing existing transaction processing designs in terms of which replicated operations require ordering and which simply require fault tolerance. We describe how this analysis leads to our recent work on TAPIR, a transaction protocol that efficiently provides strict serializability by using a new replication protocol that provides fault tolerance but not ordering for most operations. 1 Introduction Distributed storage systems for today’s large-scale web applications must meet a daunting set of requirements: they must offer high performance, graceful scalability, and continuous availability despite the inevitablity of failures. Increasingly, too, application programmers prefer systems that support distributed transactions with strong consistency to help them manage application complexity and concurrency in a distributed environment. Several recent systems [11, 3, 8, 5, 6] reflect this trend. One notable example is Google’s Spanner system [6], which guarantees strictly-serializable (aka linearizable or externally consistent) transaction ordering [6]. Generally, distributed transactional storage with strong consistency comes at a heavy performance price. These systems typically integrate several expensive mechanisms, including concurrency control schemes like strict two-phase locking, strongly consistent replication protocols like Paxos, and atomic commitment protocols like two-phase commit. The costs associated with these mechanisms often drive application developers to use more efficient, weak consistency protocols that fail to provide strong system guarantees. In conventional designs, these protocols – concurrency control, replication, and atomic commitment – are implemented in separate layers, with each layer providing a subset of the guarantees required for distributed Copyright 2016 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 27 ACID transactions. For example, a system may partition data into shards, where each shard is replicated using Paxos, and use two-phase commit and two-phase locking to implement serializable transactions across shards. Though now frequently implemented together, these protocols were originally developed to address separate issues in distributed systems. In our recent work [23, 24], we have taken a different approach: we consider the storage system architecture as a whole and identify opportunities for cross-layer optimizations to improve performance. To improve the performance of replicated transactional systems, we ask the question: do all replicated op- erations need to be strictly ordered? Existing systems treat the replication protocol as an implementation of a persistent, ordered log. Ensuring a total ordering, however, require cross-replica coordination on every oper- ation, increasing system latency, and is typically implemented with a designated leader, introducing a system bottleneck. A recent line of distributed systems research has developed more efficient replication protocols by identifying cases when operations do not need to be ordered – typically by having the developer express which operations commute with others [4, 14, 18, 15]. Yet it is not obvious how to apply these techniques to the lay- ered design of transactional storage systems: the operations being replicated are not transactions themselves, but coordination operations like prepare or commit records. To answer this question, we analyzed the interaction of atomic commitment, consistent replication, and concurrency control protocols in common combinations. We consider both their requirements on layers below and the guarantees they provide for layers above. Our analysis leads to several key insights about the protocols implemented in distributed transactional storage systems: • The concurrency control mechanism is the application for the replication layer, so its requirements impact the replication algorithm. • Unlike disk-based durable storage, replication can separate ordering/consistency and fault-tolerance guar- antees with different costs for each. • Consensus-based replication algorithms like Paxos are not the most efficient way to provide ordering. • Enforcing consistency at every layer is not necessary for maintaining global transactional consistency. We describe how these observations led us to design a new transactional storage system, TAPIR [23, 24]. TAPIR provides strict serializable isolation of transactions, but relies on a weakly consistent underlying repli- cation protocol, inconsistent replication (IR). IR is designed to efficiently support fault-tolerant but unordered operations, and TAPIR uses an optimistic timestamp ordering technique to make most of its replicated oper- ations unordered. This article does not attempt to provide a complete explanation of the TAPIR protocol, its performance, or its correctness. Rather, it analyzes it along with additional protocols to demonstrate the value of separating operation ordering from fault tolerance in replicated transactional systems. 2 Coordination Protocols This paper is about the protocols used to build scalable, fault-tolerant distributed storage system, like distributed key-value stores or distributed databases. In this section, we specify the category of systems we are considering, and review the standard protocols that they use to coordinate replicated, distributed transactions. 2.1 Transactional System Model The protocols that we discuss in this paper are designed for scalable, distributed transactional storage. We as- sume the storage system divides stored data across several shards. Within a shard, we assume the system uses replication for availability and fault tolerance. Each shard is replicated across a set of storage servers organized 28 2PC Distributed Transaction CC CC CC Protocol Replication R R R R R R R R R Protocol Figure 1: Standard protocol architecture for transactional distributed storage systems. into a replica group. Each replica group consists of 2 f + 1 storage servers, where f is the number of faults that can be tolerated by one replica group. Transactions may read and modify data that spans multiple shards; a distributed transaction protocol ensures that transactions are applied consistently to all shards. We consider system architectures that layer the distributed transaction protocol atop the replication protocol, as shown in Figure 1. In the context of Agrawal et al.’s recent taxonomy of partitioned replicated databases [2], these are replicated object systems. This is the traditional system architecture, used by several influential sys- tems including Spanner [6], MDCC [11], Scatter [9] and Granola [7]. However, other architectures have been proposed that run the distributed transaction protocol within each site, and use replication across sites [17]. Our analysis does not consider the cost of disk writes. This may seem an unconventional choice, as syn- chronous writes to disk are the traditional way to achieve durability in storage systems – and a major source of their overhead. Rather, we consider architectures where durability is achieved using in-memory replication, combined with asynchronous checkpoints to disk. Several recent systems (e.g., RAMCloud [20], H-Store [21]) have adopted this approach, which offers two advantages over disk. First, it ensures that the data remains contin- uously available when one system fails, minimizing downtime. Second, recording an operation in memory on multiple machines can have far lower latency than synchronously writing to disk, while still tolerating common failures. We assume that clients are application servers that access data stored in the system. Clients have access to a directory of storage servers and are able to directly map data to storage servers, using a technique like consistent hashing [10]. Transaction Model We assume a general transaction model. Clients begin a transaction, then execute oper- ations during the transaction’s execution period. During this period, the client is free to abort the transaction. Once the client finishes execution, it can commit the transaction, atomically and durably committing the exe- cuted operations to the storage servers. 2.2 Standard Protocols The standard architecture for these distributed storage systems, shown in Figure 1 is layering an atomic com- mitment protocol, like