Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks

Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks Jiaqing Du Sameh Elnikety Willy Zwaenepoel EPFL Microsoft Research EPFL Lausanne, Switzerland Redmond, WA, USA Lausanne, Switzerland Abstract—Clock-SI is a fully distributed protocol that im- Managing timestamps in a centralized system is straightfor- plements snapshot isolation (SI) for partitioned data stores. It ward. Most SI implementations maintain a global variable, the derives snapshot and commit timestamps from loosely synchro- database version, to assign snapshot and commit timestamps nized clocks, rather than from a centralized timestamp authority as used in current systems. A transaction obtains its snapshot to transactions. timestamp by reading the clock at its originating partition and When a transaction starts, its snapshot timestamp is set Clock-SI provides the corresponding consistent snapshot across to the current value of the database version. All its reads all the partitions. In contrast to using a centralized timestamp are satisfied from the corresponding snapshot. To support authority, Clock-SI has availability and performance benefits: snapshots, multiple versions of each data item are kept, each It avoids a single point of failure and a potential performance bottleneck, and improves transaction latency and throughput. tagged with a version number equal to the commit timestamp We develop an analytical model to study the trade-offs in- of the transaction that creates the version. The transaction troduced by Clock-SI among snapshot age, delay probabilities reads the version with the largest version number smaller of transactions, and abort rates of update transactions. We than its snapshot timestamp. If the transaction is read-only, verify the model predictions using a system implementation. it always commits without further checks. If the transaction Furthermore, we demonstrate the performance benefits of Clock- SI experimentally using a micro-benchmark and an application- has updates, its writes are buffered in a workspace. When the level benchmark on a partitioned key-value store. For short read- update transaction requests to commit, a certification check only transactions, Clock-SI improves latency and throughput by verifies that the transaction writeset does not intersect with 50% by avoiding communications with a centralized timestamp the writesets of concurrent committed transactions. If the authority. With a geographically partitioned data store, Clock- certification succeeds, the database version is incremented, and SI reduces transaction latency by more than 100 milliseconds. Moreover, the performance benefits of Clock-SI come with higher the transaction commit timestamp is set to this value. The availability. transaction’s updates are made durable and visible, creating a Keywords-snapshot isolation, distributed transactions, parti- new version of each updated data item with a version number tioned data, loosely synchronized clocks equal to the commit timestamp. Efficiently maintaining and accessing timestamps in a dis- I. INTRODUCTION tributed system is challenging. We focus here on partitioning, Snapshot isolation (SI) [1] is one of the most widely used which is the primary technique employed to manage large data concurrency control schemes. While allowing some anomalies sets. Besides allowing for larger data sizes, partitioned systems not possible with serializability [2], SI has significant per- improve latency and throughput by allowing concurrent access formance advantages. In particular, SI never aborts read-only to data in different partitions. With current large main memory transactions, and read-only transactions do not block update sizes, partitioning also makes it possible to keep all data in transactions. SI is supported in several commercial systems, memory, further improving performance. such as Microsoft SQL Server, Oracle RDBMS, and Google Existing implementations of SI for a partitioned data store Percolator [3], as well as in many research prototypes [4], [5], [3], [5], [9], [4] use a centralized authority to manage [6], [7], [8]. timestamps. When a transaction starts, it requests a snapshot Intuitively, under SI, a transaction takes a snapshot of timestamp from the centralized authority. Similarly, when a the database when it starts. A snapshot is equivalent to a successfully certified update transaction commits, it requests logical copy of the database including all committed updates. a commit timestamp from the centralized authority. Each When an update transaction commits, its updates are applied partition does its own certification for update transactions, atomically and a new snapshot is created. Snapshots are totally and a two-phase commit (2PC) protocol is used to commit ordered according to their creation order using monotonically transactions that update data items at multiple partitions. The increasing timestamps. Snapshots are identified by timestamps: centralized timestamp authority is a single point of failure The snapshot taken by a transaction is identified by the and a potential performance bottleneck. It negatively impacts transaction’s snapshot timestamp. A new snapshot created by a system availability, and increases transaction latency and mes- committed update transaction is identified by the transaction’s saging overhead. We refer to the implementations of SI using commit timestamp. a centralized timestamp authority as conventional SI. This paper introduces Clock-SI, a fully distributed imple- II. BACKGROUND AND OVERVIEW mentation of SI for partitioned data stores. Clock-SI uses In this section, we define the system model and SI, and loosely synchronized clocks to assign snapshot and commit describe the challenges of using loosely synchronized physical timestamps to transactions, avoiding the centralized timestamp clocks to implement SI. authority in conventional SI. Similar to conventional SI, partitions do their own certification, and a 2PC protocol is used A. System Model to commit transactions that update multiple partitions. We consider a multiversion key-value store, in which the Compared with conventional SI, Clock-SI improves system dataset is partitioned and each partition resides on a single availability and performance. Clock-SI does not have a single server. A server has a standard hardware clock. Clocks are point of failure and a potential performance bottleneck. It saves synchronized by a clock synchronization protocol, such as one round-trip message for a ready-only transaction (to obtain Network Time Protocol (NTP) [15]. We assume that clocks the snapshot timestamp), and two round-trip messages for an always move forward, perhaps at different speeds as provided update transaction (to obtain the snapshot timestamp and the by common clock synchronization protocols [15], [16]. The commit timestamp). These benefits are significant when the absolute value of the difference between clocks on different workload consists of short transactions as in key-value stores, servers is bounded by the clock synchronization skew. and even more prominent when the data set is partitioned The key-value store supports three basic operations: get, geographically across data centers. put, and delete. A transaction consists of a sequence of We build on earlier work [10], [11], [12] to totally order basic operations. A client connects to a partition, selected by a events using physical clocks in distributed systems. The nov- load balancing scheme, and issues transactions to the partition. elty of Clock-SI is to efficiently create consistent snapshots We call this partition the originating partition of transactions using loosely synchronized clocks. In particular, a transaction’s from the connected client. The originating partition executes snapshot timestamp is the value of the local clock at the the operations of a transaction sequentially. If the originating partition where it starts. Similarly, the commit timestamp of a partition does not store a data item needed by an operation, it local update transaction is obtained by reading the local clock. executes the operation at the remote partition that stores the item. The implementation of Clock-SI poses several challenges The originating partition assigns the snapshot timestamp because of using loosely synchronized clocks. The core of to a transaction by reading its local clock. When an update these challenges is that, due to a clock skew or pending transaction starts to commit, if it updates data items at a single commit, a transaction may receive a snapshot timestamp for partition, the commit timestamp is assigned by reading the which the corresponding snapshot is not yet fully available. We local clock at that partition. We use a more complex protocol delay operations that access the unavailable part of a snapshot to commit a transaction that updates multiple partitions (see until it becomes available. As an optimization, we can assign to Section III-B). a transaction a snapshot timestamp that is slightly smaller than the clock value to reduce the possibility of delayed operations. B. Snapshot Isolation We build an analytical model to study the properties of Formally, SI is a multiversion concurrency control scheme Clock-SI and analyze the trade-offs of using old snapshots. with three main properties [1], [17], [18] that must be satisfied We also verify the model using a system implementation. by the underlying implementation: (1) Each transaction reads We demonstrate the performance benefits of Clock-SI on a from a consistent snapshot,

Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks

A Fast Implementation of Parallel Snapshot Isolation

One-Copy Serializability with Snapshot Isolation Under the Hood

Making Snapshot Isolation Serializable

Firebird 3.0 Developer's Guide

Ensuring Consistency Under the Snapshot Isolation

Where We Are Snapshot Isolation Snapshot Isolation

SQL: Transactions Introduction to Databases Compsci 316 Fall 2017 2 Announcements (Tue., Oct

Postgres-R(SI) Data Replication and Snapshot Isolation Example

Precisely Serializable Snapshot Isolation

Snapshot Isolation

Serializable Isolation for Snapshot Databases

A Critique of ANSI SQL Isolation Levels