Inferring a Serialization Order for Distributed Transactions∗

Inferring a Serialization Order for Distributed Transactions∗ Khuzaima Daudjee and Kenneth Salem School of Computer Science University of Waterloo Waterloo, Canada fkdaudjee, [email protected] Abstract Consider a distributed database system in which each site’s local concurrency control is rigorous and transactions Data partitioning is often used to scale-up a database are synchronized using a two-phase (2PC) commit proto- system. In a centralized database system, the serialization col. Our choice of 2PC stems from it being the most widely order of commited update transactions can be inferred from used protocol for coordinating transactions in distributed the database log. To achieve this in a shared-nothing dis- database systems [1]. Although serializability is guaranteed tributed database, the serialization order of update trans- at the cluster by the sites’ local concurrency controls and actions must be inferred from multiple database logs. We the 2PC protocol, the issue is how to determine the re- describe a technique to generate a single stream of updates sulting serialization order. The contribution of this paper from logs of multiple database systems. This single stream is a technique that determines a serialization order for dis- represents a valid serialization order of update transactions tributed update transactions that have executed in a parti- at the sites over which the database is partitioned. tioned database over multiple sites. Our technique merges log entries of each site into a single stream that represents a valid serialization order for update transactions. 1. Introduction 1.1. System Model In a centralized database system that guarantees com- 1 mitment ordering , the serialization order of transactions The database is partitioned over one or more sites. No that have committed at the single site can be inferred from restrictions are placed on how the database is partitioned. the database log. This can be achieved by using the log se- Transactions execute at the sites using a 2PC protocol quence numbers of transactions’ commit records. That is, [4, 7, 9]. Each transaction executes at one or more sites. One seq(T1) < seq(T2) if and only if T1 precedes T2 in the of these sites acts as the coordinator site for the transaction; local serialization order. This information can then be ex- other sites are participant sites. Different transactions may tracted from the database log using a standard mechanism have different coordinating sites. Each site is autonomous such as a log sniffer [5]. Since update transactions are pro- and maintains its own database log and recovery informa- cessed at the single site, this makes it relatively easy to de- tion about its local transactions. termine a serialization order for update transactions. In the 2PC protocol, the coordinator generates a prepare Data partitioning is often used to improve scalability by message, which is sent to all participant sites involved in the distributing a database over a number of sites in a shared- distributed transaction. The participant sites generate and nothing architecture. Each site in this cluster consists of an respond with an acknowledgement message, which we shall autonomous database system with a local concurrency con- call prepare-ack. After acknowledgement messages are re- troller which is rigorous [2] and ensures commitment or- ceived from all participant sites, the coordinator commits dering. Rigorousness [2] is enforced by well-known con- and generates a coord-commit message that is sent to all par- currency control algorithms such as strict two phase lock- ticipants. Each participant then either commits, generating ing in which no locks are released until commit. a commit message, or aborts the transaction. Since the 2PC protocol does not commit a transaction until all read/write ∗ University of Waterloo School of Computer Science Technical Report CS-2005-34 operations of the transaction have executed, and each local 1 Commitment ordering [10] ensures that transactions can be serialized concurrency control is rigorous, the cluster of sites guaran- in the order in which they commit. tees 1SR [2]. Each database log contains update, commit and abort SITE s1 SITE s2 records. Each transaction update log record describes an update executed by the corresponding update transaction. Both log: update(T1) update and commit log records of an update transaction con- log: update(T1) tain the transaction’s global transaction id (tid). If a commit log record is that of a transaction’s coordinator (a coord- commit record), it also contains the list of sites that are par- Lamport Clock (LC): 1 LC: 0 ticipating in the transaction. Update information can be extracted from the database logs using a standard mechanism such as a log sniffer [5]. LC: 1 The updates from each log are merged into a single stream using the algorithm described in Section 1.3. LC: 2 prepare−ack(T1), LC=1 log: coord−commit(T1), 1.2. Timestamps ts(T1)=2 participants={s1,s2} In this section, we describe the use of Lamport timestamps to capture causal relationships between updates at multiple sites. These timestamps, together with the up- commit(T1), LC=2 LC: 3 dates committed by transactions, are entered into each site’s log: commit(T1) database log. In the next section, we describe an algorithm ts(T1)=3 that uses these timestamps to merge the updates from the database logs of the sites. Figure 1. 2PC Message Exchange Each site maintains a Lamport clock [6], which is sim- ply a monotonically increasing integer counter. A Lamport timestamp is the value of a site’s Lamport clock at some par- round-robin fashion and writes this information to records 2 ticular time. When a 2PC participant generates a prepare- in a list called the record list, or r-list , at the log merger ack message, it increments its clock, and piggybacks the site. resulting timestamp onto the message. When the coordina- There is one r-list record per transaction. Each record tor has received all of the prepare-ack messages, it sets its in the r-list is uniquely identified by a global transaction Lamport clock to the maximum of its current value and the id (tid) and stores information from all commit and update values of the timestamps attached to the prepare-ack mes- records for that transaction from all sites’ database logs. sages, plus one. Assuming that the transaction commits, Each r-list record holds the minimum transaction times- this value becomes the transaction’s coord-commit timestamp, which is read from the transaction’s coord-commit tamp, and it is logged along with the transaction’s com- record. The r-list record also holds the site id of each log mit record at the coordinator. The coordinator also piggy- record read, and the participant list of the coord-commit backs the coord-commit timestamp onto the commit mes- record. Lastly, the updates of each update record with the sages it sends to the participants. When a participant re- same tid are kept in the r-list record. ceives a commit message, it sets its Lamport clock to the The log merger visits the database sites in a round-robin maximum of its current value and the coord-commit times- fashion. Pseudocode for the actions of the log merger when tamp, plus one. The resulting timestamp is also logged with it visits a site s is shown in Algorithm 1. The log merger the commit record at the participant. A simple example of can choose any size for the batch of records to read at site s message exchange using 2PC and the logging of records is since it does not matter for the correctness of the algorithm shown in Figure 1. (though it may impact the rate at which records are output). A site can be skipped by the log merger if that site does not 1.3. Algorithm have any log records available to be read. Step 14 ensures that a transaction’s r-list record will ulti- mately contain that transaction’s coord-commit timestamp, In this section, we describe our algorithm, which merges which is always the smallest timestamp from among all database log entries of sites into a single stream that repre- commit records of that transaction. sents a valid global serialization order for update transac- s Let ts(Ti ) be the commit timestamp of the subtransac- tions. 0 s tion of Ti running at site s. Let ts (T ) be the prepare-ack The log merging algorithm is executed by a log merger i process, which runs at one of the sites. The algorithm reads update and commit records from sites’ database logs in a 2 All fields of a newly created r-list record are initialized to null. Algorithm 1 Merging of Update Streams Case 1: The coordinators of T1 and T2 run at site s. Since s T1 conflicts with and precedes T2, T1 must be serialized s 1 read batch of log records from head of log site s before T2 . Since the local concurrency control at site s en- s s 2 for each record r in the batch sures commitment ordering, T1 commits before T2 . Since 3 do let t be r-list entry for which t.tid = r.tid the timestamping protocol increments timestamps on each s s 4 if no such t commit operation, ts(T1 ) < ts(T2 ) and thus ts(T1) < 5 then /* create and initialize */ ts(T2). 6 create t in r-list Case 2: T1’s coordinator runs at site s but T2’s does not. 7 t.tid = r.tid Since T1 conflicts with and precedes T2, and the local con- s 0 s 8 t.ts = 1 currency control at site s is rigorous, ts(T1 ) < ts (T2 ).

Inferring a Serialization Order for Distributed Transactions∗

An Efficient Concurrency Control Technique for Mobile Database Environment by Md

ACID Compliant Distributed Key-Value Store

Lecture 15: March 28 15.1 Overview 15.2 Transactions

Distributed Transactions

Effective Technique for Optimizing Timestamp Ordering in Read- Write/Write-Write Operations

On Ordering Transaction Commit

A Study of the Availability and Serializability in a Distributed Database System

An Alternative Model to Overcoming Two Phase Commit Blocking Problem

Transaction Management in the R* Distributed Database Management System

Implementing Distributed Transactions Distributed Transaction Distributed Database Systems ACID Properties Global Atomicity Atom

Distributed Transaction Processing: Reference Model, Version 3

Fast General Distributed Transactions with Opacity