<<

in Distributed Systems

PHILIP A. BERNSTEIN AND NATHAN GOODMAN Corporation of America, Cambridge, Massachusetts 02139

In this paper we survey, consolidate, and present the state of the art in concurrency control. The heart of our analysts is a decomposition of the concurrency control problem into two major subproblems: read-write and write-write synchronization. We describe a series of synchromzation techniques for solving each subproblem and show how to combine these techniques into algorithms for solving the entire concurrency control problem. Such algorithms are called "concurrency control methods." We describe 48 principal methods, including all practical algorithms that have appeared m the literature plus several new ones. We concentrate on the structure and correctness of concurrency control algorithms. Issues of performance are given only secondary treatment. Keywords and Phrases: concurrency control, , dtstnbuted database management systems, locking, senahzability, synchromzation, tunestamp ordering, timestamps, two- phase , two-phase locking CR Categories: 4.33, 4.35

INTRODUCTION Concurrency control has been actively investigated for the past several years, and The Concurrency Control Problem the problem for nondistributed DBMSs is Concurrency control is the activity of co- well understood. A broad mathematical ordinating concurrent accesses to a data- theory has been developed to analyze the base in a multiuser database management problem, and one approach, called two- system (DBMS). Concurrency control per- phase locking, has been accepted as a mits users to access a database in a multi- standard solution. Curre.nt research on non- programmed fashion while preserving the distributed concun'ency control is focused illusion that each user is executing alone on on evolutionary improvements to two- a dedicated system. The main technical phase locking, detailed performance analy- difficulty in attaining this goal is to prevent sis and optimization, and extensions to the database updates performed by one user mathematical theory. from interfering with database retrievals Distributed concurrency control, by con- and updates performed by another. The trast, is in a state of extreme turbulence. concurrency control problem is exacerbated More than 20 concurrency control algo- in a distributed DBMS (DDBMS) because rithms have been proposed for DDBMSs, (1) users may access data stored in many and several have been, or are being, imple- different in a distributed system, mented. These algorithms are usually com- and (2) a concurrency control mechanism plex, hard to understand, and difficult to at one computer cannot instantaneously prove correct (indeed, many are incorrect). know about interactions at other com- Because they are described in different ter- puters. minologies and make different assumptions

Permission to copy without fee all or part of this material m granted provided that the coples are not made or distributedfor directcommercial advantage, the ACM copyright notice and the titleof the publicationand its date appear, and notice is given that copying is by perrmssion of the Associationfor Computing Machinery. To copy otherwise,or to republish,requires a fee and/or specificpermission. © 1981 ACM 0010-4892/81/0600-0185 $00.75

Computing Surveys, Vol. 13, No. 2, June 1981 186 • P. A. Bernstein and N. Goodman CONTENTS cry concurrency control algorithm must in- clude a subalgorithm to solve each subprob- lem. The first step toward understanding a concurrency control algorithm is to isolate INTRODUCTION the subalgorithm employed for each sub- The Concurrency Control Problem problem. Examples of Concurrency Control Anomalies After studying the large number of pro- Comparison to Mutual Exclnslon Problems posed algorithms, we find that they are 1. TRANSACTION-PROCESSING MODEL 1.1 Prelmunary Defimtmns and DDBMS Archi- compositions of only a few subalgorithms. tecture In fact, the subalgorithms used by all prac- 1.2 Centrahzed Transactmn-Processmg Model tical DDBMS concurrency control algo- 1.3 Dmmbuted Transactmn-Processing Model rithms are variations of just two basic tech- 2 DECOMPOSITION OF THE CONCUR- niques: two-phase locking and timestamp RENCY CONTROL PROBLEM 2 1 Selaallzabfllty ordering; thus the state of the art is far 2.2 A Parachgm for Concurrency Control more coherent than a review of the litera- 3. SYNCHRONIZATION TECHNIQUES ture would seem to indicate. BASED ON TWO-PHASE LOCKING 3.1 Basra 2PL Implementation Examples of Concurrency Control Anomalies 3.2 Primary Copy 2PL 3.3 Voting 2PL The goal of concurrency control is to pre- 3.4 Centrahzed 2PL vent interference among users who are si- 3.5 Deadlock Detection and Prevention 4 SYNCHRONIZATION TECHNIQUES multaneously accessing a database. Let us BASED ON TIMESTAMP ORDERING illustrate the problem by presenting two 4.1 Basic T/O Implementatmn "canonical" examples of interuser interfer- 4.2 The Thomas Write Rule ence. Both are examples of an on-line 4.3 MulUversion T/O 4.4 Conservative T/O electronic funds transfer system accessed 4 5 Tnnestamp Management via remote automated teller machines 5 INTEGRATED CONCURRENCY CONTROL (ATMs). In response to customer requests, METHODS ATMs retrieve data from a database, per- 5 1 Pure 2PL Methods form computations, and store results back 5.2 Pure T/O Methods 5.3 MLxed 2PL and T/O Methods into the database. 6. CONCLUSION Anomaly 1: Lost Updates. Suppose two APPENDIX. OTHER CONCURRENCY CON- customers simultaneously try to deposit TROL METHODS AI. Certifiers money into the same account. In the ab- A2. Thomas' MaJority Consensus Algorithm sence of concurrency control, these two ac- A3. Ellis' Ring Algorithm tivities could interfere (see Figure 1). The ACKNOWLEDGMENT two ATMs handling the two customers REFERENCES could read the account balance at approxi- mately the same time, compute new bal- v ances in parallel, and then store the new about the underlying DDBMS environ- balances back into the database. The net ment, it is difficult to compare the many effect is incorrect: although two customers proposed algorithms, even in qualitative deposited money, the database only reflects terms. Naturally each author proclaims his one activity; the other deposit is lost by the or her approach as best, but there is little system. compelling evidence to support the claims. Anomaly 2: Inconsistent Retrievals. Suppose two customers simultaneously ex- To survey the state of the art, we intro- ecute the following transactions. duce a standard terminology for describing DDBMS concurrency control algorithms Customer 1: Move $1,000,000 from Acme and a standard model for the DDBMS en- Corporation's savings ac- vironment. For analysis purposes we de- count to its checking account. compose the concurrency control problem Customer 2: Print Acme Corporation's into two major subproblems, called read- total balance in savings and write and write-write synchronization. Ev- checking.

Computing Surveys, Vol. 13, No 2, June 1981 Concurrency Control in Database Systems 187

Execut,on of T I Dotobose Execution of T2

READ bolonce I I, 00000 ] 0,000e

Add ~I,000,000 $1,500,000 [ J $2~500,000 ] Add $21000,000

WRITE result bock to dotobose bock to dotobose

Figure 1. Lostupdate anomaly.

In the absence of concurrency control access to shared resources. However, con- these two transactions could interfere (see trol schemes that work for one do not nec- Figure 2). The first transaction might read essarily work for the other, as illustrated by the savings account balance, subtract the following example. Suppose processes $1,000,000, and store the result back in the P1 and P2 require access to resources R1 database. Then the second transaction and R2 at different points in their execution. might read the savings and checking ac- In an , the following inter- count balances and print the total. Then leaved execution of these processes is per- the first transaction might finish the funds fectly acceptable: P1 uses R1, P2 uses R~, Pe transfer by reading the checking account uses R2, P1 uses R2. In a database, however, balance, adding $1,000,000, and finally stor- this execution is not always acceptable. As- ing the result in the database. Unlike sume, for example, that P2 transfers funds Anomaly 1, the final values placed into the by debiting one account (RI), then crediting database by this execution are correct. Still, another (R2). If P2 checks both balances, it the execution is incorrect because the bal- will see R~ after it has been debited, but see ance printed by Customer 2 is $1,000,000 R2 before it has been credited. Other differ- short. ences between concurrency control and mu- These two examples do not exhaust all tual exclusion are discussed in CHAM74. possible ways in which concurrent users can interfere. However, these examples are typical of the concurrency control problems 1. TRANSACTION-PROCESSING MODEL that arise in DBMSs. To understand how a concurrency control algorithm operates, one must understand Comparison to Problems how the algorithm fits into an overall The problem of database concurrency con- DDBMS. In this section we present a sim- trol is similar in some respects to that of ple model of a DDBMS, emphasizing how mutual exclusion in operating systems. The the DDBMS processes user interactions. latter problem is concerned with coordinat- Later we explain how concurrency control ing access by concurrent processes to sys- algorithms operate in the context of this tem resources such as memory, I/O devices, model. and CPU. Many solution techniques have been developed, including locks, sema- 1.1 Preliminary Definitions and DDBMS phores, monitors, and serializers [BRIN73, Architecture DIJK71, HEWI74, HOAR74]. The concurrency control and mutual ex- A distributed database management sys- clusion problems are similar in that both tem (DDBMS) is a collection of sites in- are concerned with controlling concurrent terconnected by a network [DEPP76, ComputingSurveys, Vol 13, No. 2, June 1981 188 • P. A. Bernstein and N. Goodman

Execut,on of T1 Dolobose Execution of T2

READ sowngs bolonce 1,2,ooo,oool Subtroct $1,00O,OOO

WRITE result I .,ooo,oooI

READ checking bolonce READ sov,ngs bolonce Add $1,OOO,OOO 1,5oo,ooo. f oo.o..oooI READ check,ng bolonce 5Ol~,0OO WRITE result $1.500,000 ] l sum " Print Sum \ St,SOD,ODD J

Figure 2. Inconsistent retrieval anomaly.

ROTH77]. Each site is a computer running logical data item is called a stored data one or both of the following mod- item. (When no confusion is possible, we ules: a transaction manager (TM) or a data use the term data item for stored data manager (DM). TMs supervise interactions item.) The stored copies of logical data item between users and the DDBMS while DMs X are denoted xl ..... Xm. We typically use manage the actual database. A network is x to denote an arbitrary stored data item. a computer-to-computer communication A stored database state is an assignment system. The network is assumed to be per- of values to the stored data items in a fectly reliable: if site A sends a message to database. site B, site B is guaranteed to receive the Users interact with the DDBMS by exe- message without error. In addition, we as- cuting transactions. Transactions may be sume that between any pair of sites the on-line queries expressed in a self-contained network delivers messages in the order they , or application programs were sent. written in a general-purpose programming From a user's perspective, a database language. The concurrency control algo- consists of a collection of logical data rithms we study pay no attention to the items, denoted X, Y, Z. We leave the gran- computations performed by transactions. ularity of logical data items unspecified; in Instead, these algorithms make all of their practice, they may be files, records, etc. A decisions on the basis of the data items a logical database state is an assignment of transaction reads and writes, and so details values to the logical data items composing of the form of transactions are unimportant a database. Each logical data item may be in our analysis. However we do assume that stored at any DM in the system or redun- transactions represent complete and cor- dantly at several DMs. A stored copy of a rect computations; each transaction, if ex-

Computing Surveys, Vol. 13, No 2, June 1981 Concurrency Control in Database Systems ° 189

tronsoctton,

transoctlon

tronsactlon

tronsoctlon /X\

tronsoctlon

tronsoctlon

Figure 3. DDBMS system architecture. ecuted alone on an initially consistent da- DMs manage the data. (TMs do not com- tabase, would terminate, produce correct municate with other TMs, nor do DMs results, and leave the database consistent. communicate with other DMs.) The logical readset (correspondingly, TMs supervise transactions. Each trans- writeset) of a transaction is the set of logical action executed in the DDBMS is super- data items the transaction reads (or writes). vised by a single TM, meaning that the Similarly, stored readsets and stored transaction issues all of its database oper- writesets are the stored data items that a ations to that TM. Any distributed com- transaction reads and writes. putation that is needed to execute the The correctness of a concurrency control transaction is managed by the TM. algorithm is defined relative to users' ex- Four operations are defined at the trans- pectations regarding transaction execution. action-TM interface. READ(X) returns There are two correctness criteria: (1) users the value of X (a logical data item) in the expect that each transaction submitted to current logical database state. WRITE(X, the system will eventually be executed; (2) new-value) creates a new logical database users expect the computation performed by state in which X has the specified new each transaction to be the same whether it value. Since transactions are assumed to executes alone in a dedicated system or in represent complete computations, we use parallel with other transactions in a multi- BEGIN and END operations to bracket programmed system. Realizing this expec- transaction executions. tation is the principal issue in concurrency DMs manage the stored database, func- control. tioning as backend database processors. In A DDBMS contains four components response to commands from transactions, (see Figure 3): transactions, TMs, DMs, TMs issue commands to DMs specifying and data. Transactions communicate with stored data items to be read or written. The TMs, TMs communicate with DMs, and details of the TM-DM interface constitute

Computing Surveys, Vol. 13, No. 2, June 1981 190 • P. A. Bernstein and N. Goodman the core of our transaction-processing A DBMS can avoid such partial results model and are discussed in Sections 1.2 and by having the property of atomic commit- 1.3. Section 1.2 describes the TM-DM in- ment, which requires that either all of a teraction in a centralized database environ- transaction's din-writes are processed or ment, and Section 1.3 extends the discus- none are. The "standard" implementation sion to a distributed database setting. of atomic commitment is a procedure called two-phase commit [LAMP76, GRAY78].1 1.2 Centralized Transaction-Processing Suppose T is updating data items X and Y. Model When T issues its END, the first phase of two-phase commit begins, during which the A centralized DBMS consists of one TM DM issues prewrite commands for X and and one DM executing at one site. A trans- Y. These commands instruct the DM to action T accesses the DBMS by issuing copy the values of X and Y from T's private BEGIN, READ, WRITE, and END oper- workspace onto secure storage. If the ations, which are processed as follows. DBMS fails during the first phase, no harm BEGIN: The TM initializes for T a pri- is done, since none of T's updates have yet vate workspace that functions as a tempo- been applied to the stored database. During rary buffer for values read from and written the second phase, the TM issues din-write into the database. commands for X and Y which instruct the READ(X): The TM looks for a copy of DM to copy the values of X and Y into the X in T's private workspace. If the copy stored database. If the DBMS fails during exists, its value is returned to T. Otherwise the second phase, the database may contain the TM issues din-read(x) to the DM to incorrect information, but since the values retrieve a copy of X from the database, of X and Y are stored on secure storage, gives the retrieved value to T, and puts it this inconsistency can be rectified when the into T's private workspace. system recovers: the recovery procedure WRITE(X, new-value): The TM again reads the values of X and Y from secure checks the private workspace for a copy of storage and resumes the commitment activ- X. If it finds one, the value is updated to ity. new-value; otherwise a copy of X with the We emphasize that this is a mathemati- new value is created in the workspace. The cal model of , an ap- new value of X is not stored in the database proximation to the way DBMSs actually at this time. function. While the implementation details END: The TM issues dm-write(x) for of atomic commitment are important in each logical data item X updated by T. designing a DBMS, they are not central to Each dm-write(x) requests that the DM an understanding of concurrency control. update the value of X in the stored database To explain concurrency control algorithms to the value of X in T's local workspace. we need a model of transaction execution When all dm-writes are processed, T is in which atomic commitment is visible, but finished executing, and its private work- not dominant. space is discarded. 1.3 -Processing The DBMS may restart T any time be- Model fore a din-write has been processed. The effect of restarting T is to obliterate its Our model of transaction processing in a private workspace and to reexecute T from distributed environment differs from that the beginning. As we will see, many concur- in a centralized one in two areas: handling rency control algorithms use transaction private workspaces and implementing two- restarts as a tactic for attaining correct phase commit. executions. However, once a single dm- write has been processed, T cannot be re- started; each dm-write permanently installs The term "two-phase commit" is commonly used to denote the distributed version of this procedure. How- an update into the database, and we cannot ever, since the centralized and distributed versions are permit the database to reflect partial effects identical in structure, we use "two-phase commit" to of transactions. describe both.

Computing Surveys, Vol 13, No. 2, June 1981 Concurrency Control in Database Systems • 191 In a centralized DBMS we assumed that workspace to see if a copy of X is present. (1) private workspaces were part of the TM, If so, that copy's value is made available to and (2) data could freely move between a T. Otherwise the TM selects some stored transaction and its workspace, and between copy of X, say xi, and issues din-read(x,) to a workspace and the DM. These assump- the DM at which x, is stored. The DM tions are not appropriate in a DDBMS responds by retrieving the stored value of because TMs and DMs may run at different x, from the database, placing it in the pri- sites and the movement of data between a vate workspace. The TM returns this value TM and a DM can be expensive. To reduce to T. this cost, many DDBMSs employ query WRITE(X, new-value): The value of X in optimization procedures which regulate T's private workspace is updated to new- (and, it is hoped, reduce) the flow of data value, assuming the workspace contains a between sites. For example, in SDD-1 the copy of X. Otherwise, a copy of X with the private workspace for transaction T is dis- new value is created in the workspace. tributed across all sites at which T accesses END: Two-phase commit begins. For data [BF.RN81]. The details of how T reads each X updated by T, and for each stored and writes data in these workspaces is a copy x, of X, the TM issues a prewrite (x,) problem and has no di- to the DM that stores x,. The DM responds rect effect on concurrency control. by copying the value of X from T's private The problem of atomic commitment is workspace onto secure storage internal to aggravated in a DDBMS by the possibility the DM. After all prewrites are processed, of one site failing while the rest of the the TM issues dm-writes for all copies of all system continues to operate. Suppose T is logical data items updated by T. A DM updating x, y, z stored at DMx, DMy, DMz, responds to dm-write(x,) by copying the and suppose T's TM fails after issuing dm- value of x, from secure storage into the write(x), but before issuing the dm-writes stored database. After all dm-writes are for y and z. At this point the database is installed, T's execution is finished. incorrect. In a centralized DBMS this phe- nomenon is not harmful because no trans- 2. DECOMPOSITION OF THE CONCUR- action can access the database until the RENCY CONTROL PROBLEM TM recovers from the failure. However, in a DDBMS, other TMs remain operational In this section we review concurrency con- and can access the incorrect database. trol theory with two objectives: to define To avoid this problem, prewrite com- "correct executions" in precise terms, and mands must be modified slightly. In addi- to decompose the concurrency control tion to specifying data items to be copied problem into more tractable subproblems. onto secure storage, prewrites also specify which other DMs are involved in the com- 2.1 mitment activity. Then if the TM fails dur- ing the second phase of two-phase commit, Let E denote an execution of transactions the DMs whose dm-writes were not issued T1 ..... T,. E is a serial execution if no transactions execute concurrently in E; that can recognize the situation and consult the other DMs involved in the commitment. If is, each transaction is executed to comple- any DM received a dm-write, the remaining tion before the next one begins. Every serial ones act as if they had also received the execution is defined to be correct, because command. The details of this procedure are the properties of transactions (see Section complex and appear in HAMM80. 1.1) imply that a serial execution terminates properly and preserves database consist- As in a centralized DBMS, a transaction T accesses the system by issuing BEGIN, ency. An execution is serializable if it is READ, WRITE, and END operations. In computationally equivalent to a serial exe- a DDBMS these are processed as follows. cution, that is, if it produces the same out- put and has the same effect on the database BEGIN: The TM creates a private work- as some serial execution. Since serial exe- space for T. We leave the location and cutions are correct and every serializable organization of this workspace unspecified. execution is equivalent to a serial one, every READ(X): The TM checks T's private serializable execution is also correct. The Computing Surveys, Vol. 13, No. 2, June 1981 192 P. A. Bernstein and N. Goodman Transachons Database

T 1 • BEGIN; i----n ~ READ (X); WRITE(Y); END

T2 BEGIN; READ(Y), WRITE(Z); END

T 3 . BEGIN, READ(Z), WRITE(X), END

One possible execution of T1, T2, and T3 is represented by the following logs. (Note. r,[x] denotes the operation din-read(x) issued by T~; w,[x] denotes a din-write(x) issued by T,.) Log for DM A: rl[xl]wl[yl]r2[yl]w3[xl] Log for DM B: wl[y2]w2[z2] Log for DM C. w2[z3]r3[z3]

Figure 4. Modeling executions as logs.

• The execution modeled in Figure 4 is serial. Each the total order, then all of T,'s operations log is itselfserial; that is, there is no interleaving of precede all of Tfs operations in every log operations from differenttransactions. At DM A, Ti where both appear (see Figure 5). Intui- precedes T~ precedes T3; at DM B, % precedes T~; tively, this says that transactions execute and at DM C, T2 precedes T3. Therefore, TI, T2, T3 serially and in the same order at all DMs. is a total order satisfying the definition of serial. • The following execution is not serial.The logs them- Two operations conflict if they operate selves are not serial. on the same data item and one of the op- erations is a dm-write. The order in which DM A: rl[xl]r2[YllW3[Xl]Wl[ yl] DM B: w2[z2]wl[y2] operations execute is computationally sig- DM C: w2[z3lr3[z3] nificant if and only if the operations con- flict. To illustrate the notion of conflict, • The following execution is also not serial Although consider a data item x and transactions T, each log is serial, there is no total order consistent with all logs. and Tj. If T, issues dm-read (x) and T~ issues dm-write(x), the value read by T, will DM A: rl[x~]wl[yl]re[yl]w3[x~] (in general) differ depending on whether DM B: w2[z2]wl[y2] the dm-read precedes or follows the dm- DM C: w2[z3]r3[z3] write. Similarly, if both transactions issue Figure 5. Serial and nonserial loops. dm-write(x) operations, the final value of x depends on which dm-write happens last. goal of database concurrency control is to Those conflict situations are called read- ensure that all executions are serializable. write (rw) conflicts and write-write (ww) The only operations that access the conflicts, respectively. stored database are din-read and din-write. The notion of conflict helps characterize Hence it is sufficient to model an execution the equivalence of executions. Two execu- of transactions by the execution of din- tions are computationally equivalent if (1) reads and din-writes at the various DMs of each dm-read operation reads data item the DDBMS. In this spirit we formally values that were produced by the same dm- model an execution of transactions by a set writes in both executions; and (2) the final of logs, each of which indicates the order in dm-write on each data item is the same in which dm-reads and din-writes are proc- both executions [PAPA77, PAPA79]. Condi- essed at one DM (see Figure 4). An execu- tion (1) ensures that each transaction reads tion is serial if there is a total order of the same input in both executions (and transactions such that if T, precedes Tj in therefore performs the same computation).

Computing Surveys, Vol. 13, No. 2, June 1981 Concurrency Control in Database Systems • 193

Combined with (2), it ensures that both (3) T, --,ww Tj if in some log of E, T, writes executions leave the database in the same into some data item into which T~ sub- final state. sequently writes; From this we can characterize serializa- (4) T, --,~w~ Tj if T, -*~ T~ or T, --*w~ Tj; ble executions precisely. (5) T~ --* Tj if Tj --*~ T~ or T~ --*ww % Theorem 1 [PAPA77, PAPA79, STEA76] Intuitively, -* (with any subscript) Let T ffi (T1, ..., Tin} be a set of transac- means "in any serialization must precede." tions and let E be an execution of these For example, T, --*~w Tj means "T, in any transactions modeled by logs (Lb .... serialization must precede Tj." This inter- Lm}. E is serializable if there exists a total pretation follows from Theorem 1: If T, ordering of T such that for each pair of reads x before Tj writes into x, then the conflicting operations O~ and Oj from dis- hypothetical serialization in Theorem 1 tinct transactions T, and Tj (respectively), must have T, preceding T~. O~ precedes Oj in any log L~..... Lm if and Every conflict between operations in E is only if T~ precedes T~ in the total ordering. represented by an --, relationship. There- The total order hypothesized in Theorem fore, we can restate Theorem 1 in terms of 1 is called a serialization order. If the --,. According to Theorem 1, E is serializa- transactions had executed serially in the ble if there is a total order of transactions serialization order, the computation per- that is consistent with -*. This latter con- formed by the transactions would have dition holds if and only if --, is acyclic. (A been identical to the computation repre- , --*, is acyclic if there is no sequence sented by E. T1 -* T2, T2 --* Ta ..... Tn-1 --* Tn such that To attain serializability, the DDBMS T1 ffi T~.) Let us decompose --, into its must guarantee that all executions satisfy components, --*rwr and--* ww, and restate the the condition of Theorem 1, namely, that theorem using them. conflicting dm-reads and dm-writes be processed in certain relative orders. Con- currency control is the activity of control- Theorem 2 [BERNSOa] ling the relative order of conflicting opera- tions; an algorithm to perform such control Let "-'>rwr and ---,ww be associated with exe- is called a synchronization technique. To cution E. E is serializable if (a) -'*rwr and be correct, a DDBMS must incorporate "-'>w~ are acyclic, and (b) there is a total synchronization techniques that guarantee ordering of the transactions consistent the conditions of Theorem 1. with all --~ and all ---~w relationships.

2.2 A Paradigm for Concurrency Control Theorem 2 is an immediate consequence of Theorem 1. (Indeed, part (b) of Theorem In Theorem 1, rw and ww conflicts are 2 is essentially a restatement of the earlier treated together under the general notion theorem.) However, this way of character- of conflict. However, we can decompose the izing serializability suggests a way of de- concept of serializability by distinguishing composing the problem into simpler parts. these two types of conflict. Let E be an Theorem 2 implies that rw and ww conflicts execution modeled by a set of logs. We can be synchronized independently except define several binary relations on transac- insofar as there must be a total ordering of tions in E, denoted by -* with various sub- the transactions consistent with both types scripts. For each pair of transactions, T~ of conflicts. This suggests that we can use and Tj one technique to guarantee an acyclic (1) T~ --*~w Tj if in some log of E, T, reads --*~w~relation (which amounts to read-write some data item into which T~ subse- synchronization) and a different technique quently writes; to guarantee an acyclic --*~,~ relation (2) T~ --*~ T~ if in some log of E, T, writes (write-write synchronization). However, in into some data item that Tj subse- addition to both --.~ and -*ww being quently reads; acyclic, there must also be one serial order

Computing Surveys, Vol. 13, No. 2, June 1981 194 • P.A. Bernstein and N. Goodman consistent with all--, relations. This serial require that transactions hold all locks until order is the cement that binds together the termination rw and ww synchronization techniques. Two-phase locking is a correct synchro- Decomposing serializability into rw and nization technique, meaning that 2PL ww synchronization is the cornerstone of attains an acyclic --*~ (--*~) relation our paradigm for concurrency control. It when used for rw (ww) synchronization will be important hereafter to distinguish [BERs79b, EswA76, PAPA79]. The seriali- algorithms that attain either rw or ww syn- zation order attained by 2PL is determined chronization from algorithms that solve the by the order in which transactions obtain entire distributed concurrency control locks. The point at the end of the growing problem. We use the term synchronization phase, when a transaction owns all the locks technique for the former type of algorithm, it ever will own, is called the locked point and concurrency control method for the of the transaction [BERN79b]. Let E be an latter. execution in which 2PL is used for rw (ww) synchronization. The --*~ (--*~) relation induced by E is identical to the relation 3. SYNCHRONIZATION TECHNIQUES induced by a serial execution E' in which BASED ON TWO-PHASE LOCKING every transaction executes at its locked Two-phase locking (2PL) synchronizes point. Thus the locked points of E deter- reads and writes by explicitly detecting and mine a serialization order for E. preventing conflicts between concurrent 3.1 Basic 2PL Implementation operations. Before reading data item x, a transaction must "own" a readlock on An implementation of 2PL amounts to x. Before writing into x, it must "own" a building a 2PL scheduler, a software mod- writelock on x. The ownership of locks is ule that receives requests and lock governed by two rules: (1) different trans- releases and processes them according to actions cannot simultaneously own con- the 2PL specification. flicting locks; and (2) once a transaction The basic way to implement 2PL in a surrenders ownership of a lock, it may never distributed database is to distribute the obtain additional locks. schedulers along with the database, placing The definition of conflicting lock de- the scheduler for data item x at the DM pends on the type of synchronization being were x is stored. In this implementation performed: for rw synchronization two readlocks may be implicitly requested by locks conflict if (a) both are locks on the din-reads and writelocks may be implicitly same data item, and (b) one is a readlock requested by prewrites. If the requested and the other is a writelock; for ww syn- lock cannot be granted, the operation is chronization two locks conflict if (a) both placed on a waiting queue for the desired are locks on the same data item, and (b) data item. (This can produce a deadlock, both are writelocks. as discussed in Section 3.5.) Writelocks are The second lock ownership rule causes implicitly released by din-writes. However, every transaction to obtain locks in a two- to release readlocks, special lock-release op- phase manner. During the growing phase erations are required. These lock releases the transaction obtains locks without re- may be transmitted in parallel with the din- leasing any locks. By releasing a lock the writes, since the dm-writes signal the start transaction enters the shrinking phase. of the shrinking phase. When a lock is During this phase the transaction releases released, the operations on the waiting locks, and, by rule 2, is prohibited from queue of that data item are processed first- obtaining additional locks. When the trans- in/first-out (FIFO) order. action terminates (or aborts), all remaining Notice that this implementation "auto- locks are automatically released. matically" handles redundant data cor- A common variation is to require that rectly. Suppose logical data item X has transactions obtain all locks before begin- copies xl, ..., xm. If basic 2PL is used for ning their main execution. This variation is rw synchronization, a transaction may read called predeclaration. Some systems also any copy and need only obtain a readlock

Coraputmg Surveys, Vol. 13, No. 2, June 1981 Concurrency Control in Database Systems • 195 on the copy of X it actually reads. However, receives acknowledgments from the DMs, if a transaction updates X, then it must it counts the number of"lock~set" responses: update all copies of X, and so must obtain if the number constitutes a majority, then writelocks on all copies of X (whether basic the TM behaves as if all locks were set. 2PL is used for rw or ww synchronization). Otherwise, it waits for "lockset" operations from DMs that originally said "lock 32 Primary Copy 2PL blocked." aside (see Section 3.5), it will eventually receive enough "lockset" Primary copy 2PL is a 2PL technique that operations to proceed. pays attention to data redundancy Since only one transaction can hold a [STos79]. One copy of each logical data majority of locks on X at a time, only one item is designated the primary copy; before transaction writing into X can be in its accessing any copy of the logical data item, second commit phase at any time. All cop- the appropriate lock must be obtained on ies of X thereby have the same sequence of the primary copy. writes applied to them. A transaction's For readlocks this technique requires locked point occurs when it has obtained a more communication than basic 2PL. Sup- majority of its writelocks on each data item pose xl is the primary copy of logical data in its writeset. When updating many data item X, and suppose transaction T wishes items, a transaction must obtain a majority to read some other copy, x,, of X. To read of locks on every data item before it issues x,, T must communicate with two DMs, the any dm-writes. DM where Xs is stored (so T can lock xl) In principle, voting 2PL could be adapted and the DM where x, is stored. By contrast, for rw synchronization. Before reading any under basic 2PL, T would only communi- copy of X a transaction requests readlocks cate with x,'s DM. For writelocks, however, on all copies of X; when a majority of locks primary copy 2PL does not incur extra com- are set, the transaction may read any copy. munication. Suppose T wishes to update X. This technique works but is overly strong: Under basic 2PL, T would issue prewrites Correctness only requires that a single copy to all copies of X (thereby requesting of X be locked--namely, the copy that is writelocks on these data items) and then read--yet this technique requests locks on issue dm-writes to all copies. Under pri- all copies. For this reason we deem voting mary copy 2PL the same operations would 2PL to be inappropriate for rw synchroni- be required, but only the prewrite (Xl) zation. would request a writelock. That is, pre- writes would be sent for xl, ..., xm, but the prewrites for x2..... xm would not implicitly 3.4 Centralized 2PL request writelocks. Instead of distributing the 2PL schedulers, one can centralize the scheduler at a single 3.3 Voting 2PL site [ALsB76a, GARC79a]. Before accessing Voting 2PL (or majority consensus 2PL) is data at any site, appropriate locks must be another 2PL implementation that exploits obtained from the central 2PL scheduler. data redundancy. Voting 2PL is derived So, for example, to perform dm-read(x) from the majority consensus technique of where x is not stored at the central site, the Thomas [THOM79] and is only suitable for TM must first request a readlock on x from ww synchronization. the central site, walt for the central site to To understand voting, we must examine acknowledge that the lock has been set, it in the context of two-phase commit. Sup- then send dm-read(x) to the DM that holds pose transaction T wants to write into X. x. (To save some communication, one can Its TM sends prewrites to each DM holding have the TM send both the lock request a copy of X. For the voting protocol, the and dm-read (x) to the central site and let DM always responds immediately. It ac- the central site directly forward dm-read(x) knowledges receipt of the prewrite and says to x's DM; the DM then responds to the "lock set" or "lock blocked." (In the basic TM when dm-read (x) has been processed.) implementation it would not acknowledge Like primary copy 2PL, this approach tends at all until the lock is set.) After the TM to require more communication than basic Computing Surveys, Vol. 13, No. 2, June 1981 196 P. A. Bernstein and N. Goodman

Tronsochons Datobose

T 1 : BEGIN; r---~ t<:;:~l READ (X); WRITE(Y); END

T2 BEGIN; READ(Y); WRITE(Z); END

T 3 , BEGIN, READ(Z), WRITE(X), END

• Suppose transactions execute concurrently, with each transaction issuing its READ before any transaction issues its END. • This partial execution could be represented by the following logs DM A: rl [xl] DM B: r~[y2] DM C: r3[z3] • At this point, T~ has readlock on xx T2 has readlock on y2 T3 has readlock on z3 • Before proceeding, all transactions must obtain wntelocks. % requires wntelocks on y~ and ye T2 requires writelocks on z2 and z3 T3 requires writelock on Xl • But % cannot get writelock on y2, until T2 releases readlock T~ cannot get writelock on z3, until T3 releases readlock Ts cannot get wntelock on x~, until Tx releases readlock This is a deadlock

Figure 6. Deadlock.

2PL, since dm-reads and prewrites usually Two general techniques are available for cannot implicitly request locks. deadlock resolution: deadlock prevention and deadlock detection. 3.5 Deadlock Detection and Prevention 3.5.1 Deadlock Prevention The preceding implementations of 2PL force transactions to wait for unavailable Deadlock prevention is a "cautious" locks. If this waiting is uncontrolled, dead- scheme in which a transaction is restarted locks can arise (see Figure 6). when the system is "afraid" that deadlock Deadlock situations can be characterized might occur. To implement deadlock pre- by waits-for graphs [HOLT72, KING74], di- vention, 2PL schedulers are modified as rected graphs that indicate which transac- follows. When a lock request is denied, the tions are waiting for which other transac- scheduler tests the requesting transaction tions. Nodes of the graph represent trans- (say T,) and the transaction that currently actions, and edges represent the "waiting- owns the lock (say T~). If T, and Tj pass the for" relationship: an edge is drawn from test, T, is permitted to wait for T~ as usual. transaction T, to transaction Tj if T, is Otherwise, one of the two is aborted. If T, waiting for a lock currently owned by T~. is restarted, the deadlock prevention algo- There is a deadlock in the system if and rithm is called nonpreemptive; if T~ is re- only if the waits-for graph contains a cycle started, the algorithm is called preemptive. (see Figure 7). The test applied by the scheduler must

Computing Surveys, Vol. 13, No 2, June 1981 Concurrency Control in Database Systems • 197

T 1 must walt for T2 to release read-lock on Y2 T~.'~'r a

T3 must wa,t for Tl tO ~ /1"2 must wa,t forT3tO release read-lock on x 1 - \T5 release read-look on Z3

Figure 7. Waits-for graph for Figure 6. guarantee that if T, waits for Tj, then dead- Two timestamp-based deadlock preven- lock cannot result. One simple approach is tion schemes are proposed in Rasp,78. never to let T~ wait for Tj. This trivially Wait-Die is the nonpreemptive technique. prevents deadlock but forces many restarts. Suppose transaction T, tries to wait for T~. A better approach is to assign priorities If T, has lower priority than T~ (i.e., T, is to transactions and to test priorities to de- younger than T~), then T, is permitted to cide whether T, can wait for Tj. For exam- wait. Otherwise, it is aborted ("dies") and ple, we could let T, wait for Tj if T, has forced to restart. It is important that T, not lower priority than Tj (if T~ and Tj have be assigned a new timestamp when it re- equal priorities, T, cannot wait for Tj, or starts. Wound.Wait is the preemptive vice versa). This test prevents deadlock counterpart to Wait-Die. If T, has higher because, for every edge (T, Tj) in the waits- priority than Tj, then T, waits; otherwise Tj for graph, T, has lower priority than Tj. is aborted. Since a cycle is a path from a node to itself Both Wait-Die and Wound-Wait avoid and since T, cannot have lower priority cyclic restart. However, in Wound-Wait an thCan itself, no cycle can exist. old transaction may be restarted many One problem with the preceding ap- times, while in Wait-Die old transactions proach is that cyclic restart is possible-- never restart. It is suggested in RosE78 that some unfortunate transaction could be con- Wound-Wait induces fewer restarts in total. tinually restarted without ever finishing. To Care must be exercised in using preemp- avoid this problem, Rosenkrantz et al. tive deadlock prevention with two-phase [RosE78] propose using "timestamps" as commit: a transaction must not be aborted priorities. Intuitively, a transaction's time- once the second phase of two-phase commit stamp is the time at which it begins execut- has begun. If a preemptive technique ing, so old transactions have higher priority wishes to abort Tj, it checks with Tfs TM than young ones. and cancels the abort if Tj has entered the The technique of Ros~.78 requires that second phase. No deadlock can result be- each transaction be assigned a unique cause if Tj is in the second phase, it cannot timestamp by its TM. When a transaction be waiting for any transactions. begins, the TM reads the local clock time Preordering of resources is a deadlock and appends a unique TM identifier to the avoidance technique that avoids restarts low-order bits [THOM79]. The resulting altogether. This technique requires prede- number is the desired timestamp. The TM claration of locks (each transaction obtains also agrees not to assign another timestamp all its locks before execution). Data items until the next clock tick. Thus timestamps are numbered and each transaction re- assigned by different TMs differ in their quests locks one at a time in numeric order. low-order bits (since different TMs have The priority of a transaction is the number different identifiers), while timestamps as- of the highest numbered lock it owns. Since signed by the same TM differ in their high- a transaction can only wait for transactions order bits (since the TM does not use the with higher priority, no deadlocks can oc- same clock time twice). Hence timestamps cur. In addition to requiring predeclaration, are unique throughout the system. Note a principal disadvantage of this technique that this algorithm does not require clocks is that it forces locks to be obtained sequen- at different sites to be precisely synchro- tially, which tends to increase response nized. time.

Computing Surveys, Vol. 13, No. 2, June 1981 198 • P. A. Bernstein and N. Goodman

• Consider the execution illustrated in Figures 6 and 7. • Locks are requested at DMs in the following order:

DM A DM B DM C

readlock xl for T1 readlock y2 for T2 readlock z3 for % writelock yl for T~ writelock z2 for T2 *writelock x~ for T3 *writelock y2 for T1 *writelock z3 for T2 • None of the "starred" locks can be granted and the system is in deadlock. However, the waits-for graphs at each DM are acyclic. DM A DM B DM C ® ,(9 © ® @ ,@ Figure 8. Multisite deadlock.

3.5.2 Deadlock Detection wide waits-for graph by constructing the union of the local graphs. In deadlock detection, transactions wait for In the hierarchical approach, the data- each other in an uncontrolled manner and base sites are organized into a hierarchy (or are only aborted if a deadlock actually oc- tree), with a deadlock detector at each node curs. Deadlocks are detected by explicitly constructing the waits-for graph and of the hierarchy [MENA79]. For example, one might group sites by region, then by searching it for cycles. {Cycles in a graph country, then by continent. Deadlocks that can be found efficiently using, for example, Algorithm 5.2 in AHO75.) If a cycle is found, are local to a single site are detected at that site; deadlocks involving two or more sites one transaction on the cycle, called the of the same region are detected by the victim, is aborted, thereby breaking the deadlock. To minimize the cost of restarting regional deadlock detector; and so on. the victim, victim selection is usually based Although centralized and hierarchical deadlock detection differ in detail, both in- on the amount of resources used by each volve periodic transmission of local waits- transaction on the cycle. for information to one or more deadlock The principal difficulty in implementing detector sites. The periodic nature of the deadlock detection in a distributed data- introduces two problems. First, a base is constructing the waits-for graph ef- deadlock may exist for several minutes ficiently. Each 2PL scheduler can easily without being detected, causing response- construct the waits-for graph based on the time degradation. The solution, executing waits-for relationships local to that sched- the deadlock detector more frequently, in- uler. However, these local waits-for graphs creases the cost of deadlock detection. Sec- are not sufficient to characterize all dead- ond, a transaction T may be restarted for locks in the distributed system (see Figure reasons other than concurrency control 8). Instead, local waits-for graphs must be (e.g., its site crashed). Until T's restart combined into a more "global" waits-for propagates to the deadlock detector, the graph. (CentrAlized 2PL does not have this deadlock detector can find a cycle in the problem, since there is only one scheduler.} waits-for graph that includes T. Such a We describe two techniques for construct- cycle is called a phantom deadlock. When ing global waits-for graphs: centralized and the deadlock detector discovers a phantom hierarchical deadlock detection. deadlock, it may unnecessarily restart a In the centralized approach, one site is transaction other than T. Special precau- designated the deadlock detector for the tions are also needed to avoid unnecessary distributed system [GRAY78, STON79]. Pe- restarts for deadlocks in voting 2PL. 2 riodically (e.g., every few minutes) each scheduler sends its local waits-for graph to 2 Suppose logical data item X has copies x~, x2, and x3, the deadlock detector. The deadlock detec- and suppose usmg voting 2PL T, owns write-locks on tor combines the local graphs into a system- x] and x2 but T,'s lock request for x~ is blocked by Tj.

Computing Surveys, Vol 13, No. 2, June 1981 Concurrency Control in Database Systems • 199 A major cost of deadlock detection is the and outputs these operations according to restarting of partially executed transac- the T/O specification [SHAP77a, SHAP77b]. tions. Predeclaration can be used to reduce In practice, prewrites must also be proc- this cost. By obtaining a transaction's locks essed through the T/O scheduler for two- before it executes, the system will only re- phase commit to operate properly. As was start transactions that have not yet exe- the case with 2PL, the basic T/O imple- cuted. Thus little work is wasted by the mentation distributes the schedulers along restart. with the database [BEBN80a]. If we ignore two-phase commit, the basic 4. SYNCHRONIZATION TECHNIQUES T/O scheduler is quite simple. At each DM, BASED ON TIMESTAMP ORDERING and for each data item x stored at the DM, the scheduler records the largest timestamp Timestamp ordering (T/O) is a technique of any dm-read(x) or din-write(x) that has whereby a serialization order is selected a been processed. These are denoted R-ts(x) priori and transaction execution is forced to and W-ts(x), respectively. For rw synchro- obey this order. Each transaction is as- nization, scheduler S operates as follows. signed a unique timestamp by its TM. The Consider a din-read(x) with timestamp TS. TM attaches the timestamp to all dm-reads If TS < W-ts(x), S rejects the dm-read and and dm-writes issued on behalf of the trans- aborts the issuing transaction. Otherwise S action, and DMs are required to process outputs the dm-read and sets R-ts(x) to conflicting operations in timestamp order. max(R-ts(x)-,TS). For a dm-write(x) with The timestamp of operation O is denoted timestamp TS, S rejects the dm-write if ts(O). TS < R-ts(x); otherwise it outputs the dm- The definition of conflicting operations write and sets W-ts(x) to max(W-ts(x),TS). depends on the type of synchronization For ww synchronization, S rejects a dm- being performed and is analogous to con- write(x) with timestamp TS if TS < W- flicting locks. For rw synchronization, two ts(x); otherwise it outputs the dm-write and operations conflict if (a) both operate on sets W-ts(x) to TS. the same data item, and (b) one is a dm- When a transaction is aborted, it is as- read and the other is a dm-write. For ww signed a new and larger timestamp by its synchronization, two operations conflict if TM and is restarted. Restart issues are (a) both operate on the same data item, and discussed further below. (b) both are dm-writes. Two-phase commit is incorporated by It is easy to prove that T/O attains an timestamping prewrites and accepting or acyclic --.~ (-. ~w) relation when used for rejecting prewrites instead of dm-writes. rw (ww) synchronization. Since each DM Once a scheduler accepts a prewrite, it must processes conflicting operations in time- guarantee to accept the corresponding dm- stamp order, each edge of the --. ~w~ (-~ ww) write no matter when the dm-write arrives. relation is in timestamp order. Conse- For rw (or ww) synchronization, once S quently, all paths in the relation are in accepts a prewrite(x) with timestamp TS it timestamp order and, since all transactions must not output any dm-read(x) (or dm- have unique timestamps, no cycles are pos- write(x)) with timestamp greater than TS sible. In addition, the timestamp order is a until the dm-write(x) is output. The effect valid serialization order. is similar to setting a writelock on x for the duration of two-phase commit.. 4.1 Basic T/O Implementation To implement the above rules, S buffers An implementation of T/O amounts to dm-reads, dm-writes, and prewrites. Let building a T/O scheduler, a software mod- min-R-ts(x) be the minimum timestamp of ule that receives dm-reads and dm-writes any buffered din-read(x), and define min- W-ts(x) and min-P-ts(x) analogously. Rw Insofar as xa's scheduler is concerned, T, is waitmg for synchronization is accomplished as follows: %. However, since T, has a maJority of the copies locked, T, can proceed without waiting for Tj. This 1. Let R be a dm-read(x). If ts(R) < W- fact should be incorporated into the deadlock resolu- ts(x), R is rejected. Else if ts(R) > min- tion scheme to avoid unnecessary restarts. P-ts(x), R is buffered. Else R is output.

Computing Surveys, Vol. 13, No. 2, June 1981 200 P. A. Bernstein and N. Goodman

Let R ffi dm-read (x). Let W ffi rim-write (x). R is ready flit precedes the earliest prewrite request: ff ts(R) < min-P-ts(x). W is ready if it precedes the earliest din-read request: ifts (W) < min-R-ts(x). When a din-write(x) arrives, do the following:

I Bufferit,I

es

I Output all ready W's, and debuffer their ' 1 prewrites. (This may increase min-P-t~(x) and make some R's ready.) I ! i Output all ready R's. (This may increase | min-R-ts(x) and make some W's ready.) I I Figure 9. Buffer emptying for basic T/O rw synchromzation.

2. Let P be a prewrite(x). If ts(P) < R- rejected. If ts(W) > min-P-ts(x), W is ts(x), P is rejected. Else P is buffered. buffered; else W is output. 3. Let W be a dm-write(x). W is never 3. When W is output, the corresponding rejected. If ts(W) > min-R-ts(x), W is prewrite is debuffered. If this causes buffered. (If W were output it would min-P-ts(x) to be increased, the buffered cause a buffered dm-read(x) to be re- dm-writes are retested to see if any can jected.) Else W is output. now be output. See Figure 10. 4. When W is output, the corresponding As with 2PL, a common variation is to prewrite is debuffered. If this causes require that transactions predeclare their min-P-ts(x) to increase, the buffered din- readsets and writesets, issuing all dm-reads reads are retested to see if any of them and prewrites before beginning their main can be output. If this causes min-R-ts(x) execution.3 If all operations are accepted, to be increased, the buffered dm-writes are also retested, and so forth. This proc- ess is diagramed in Figure 9. 3 These prewrites are nonstandard relative to the def- Ww synchronization is accomplished as fol- inition in Section 1.4. Since new values for the data items in the writeset are not yet known, these pre- lows: writes do not instruct DMs to store values on secure storage; instead, prewrite (x) merely "warns" the DM 1. Let P be a prewrite(x). If ts(P) < W- to expect a din-write (x) m the near future. However, ts(x), P is rejected; else P is buffered. these prewrites are processed by synchronization al- 2. Let W be a dm-write(x). W is never gorithms exactly as "standard" ones are.

CompuUng Surveys, Vol. 13, No. 2, June 1981 Concurrency Control in Database Systems • 201

When a din-write(x) arrives, do the followmg:

I Bufferlt ]

es

Output all ready W's and debuffer their prewrites. (This may increase min-P-ts(x) and make some W's ready.) I Figure 10. Buffer emptying for basic T/O ww synchronization. the transaction is guaranteed to execute (W-ts, value) pairs, called versions. The R- without danger of restart. Another varia- ts's of x record the timestamps of all exe- tion is to delay the processing of operations cuted dm-read(x) operations, and the ver- to wait for operations with smaller time- sions record the timestamps and values of stamps. The extreme version of this heuris- all executed dm-write(x) operations. (In tic is conservative T/O, described in Sec- practice one cannot store R-ts's and ver- tion 4.4. sions forever; techniques for deleting old versions and timestamps are described in 4.2 The Thomas Write Rule Sections 4.5 and 5.2.2.) Multiversion T/O accomplishes rw syn- For ww synchronization the basic T/O chronization as follows (ignoring two-phase scheduler can be optimized using an obser- commit). Let R be a dm-read(x). R is proc- vation of THOM79. Let W be a dm-write(x), essed by reading the version of x with larg- and suppose ts(W) < W-ts(x). Instead of est timestamp less than ts(R) and adding rejecting W we can simply ignore it. We ts(R) to x's set of R-ts's; see Figure lla. R call this the Thomas Write Rule (TWR). is never rejected. Let W be a dm-write(x), Intuitively, TWR applies to a dm-write that and let interval(W) be the interval from tries to place obsolete information into the ts(W) to the smallest W-ts(x) > ts(W); 4 database. The rule guarantees that the ef- see Figure llb. If any R-ts(x) lies in fect of applying a set of dm-writes to x is interval(W), W is rejected; otherwise W is identical to what would have happened had output and creates a new version of x with the dm-writes been applied in timestamp timestamp ts(W). order. To prove the correctness of multiversion If TWR is used, there is no need to in- T/O, we must show that every execution is corporate two-phase commit into the ww equivalent to a serial execution in time- synchronization algorithm; the ww sched- stamp order [BERNS0b]. Let R be a dm- uler always accepts prewrites and never read(x) that is processed "out of order"; buffers dm-writes. that is, suppose R is executed after a dm- write(x) whose timestamp exceeds ts(R). 4.3 Multiversion T/O Since R ignores all versions with time- For rw synchronization the basic T/O scheduler can be improved using multiver- sion data items [REED78]. For each data 4Interval(W) ffi (ts(W),oo) if no W-ts(x) > ts (W) item x there is a set of R-ts's and a set of exists.

Computing Surveys, Vol. 13, No. 2, June 1981 202 P. A. Bernstein and N. Goodman

(a) Let us represent the versions of a data item x on a "time line":

Values V1 V2 V3 "'" Vn-1 V~ W-timestamps ~ 1~0 2~0 ... 912 1[00 ~--~ To process a dm-read(x) with timestamp 95, find the biggest W-timestamp less than 95; in this case 92. That is the version you read. So in this case, the value read by the din-read is V,.].

(b) Let us represent the R-timestamps of x similarly:

R-timestamps ~ ~ ll5 .,. 9[2 915

Values VI1VI~ VI3 " " " V,,-1V. W-timestamps ~ 10 20 92I lloo

Let W be din-write(x) with timestamp 93. Interval(W) ffi (93,100).

To process W we create a new version of x with that timestamp.

R-timestamps 1 I J I I 5 7 15 92 95 Values Vl V2 V3 .. • Vn.1 V Vn W-timestamps 5I 10I 20I • • • 912 913 1100

However, this new version "invalidates" the din-read of part (a), because if the din-read had arrived after the din-write, it would have read value V instead of Vn-1. Therefore, we must reject the din-write.

Figure 11. Multiversion reading and writing. stamps greater than ts(R), the value read > ts(P). Rw synchronization is performed by R is identical to the value it would have as follows: read had it been processed in timestamp order. Now let W be a dm-write(x) that is processed "out of order"; that is, suppose it 1. Let R be a dm-read(x). R is never rejected. is executed after a dm-read(x) whose time- If ts(R) lies in interval(prewrite(x)) for some buffered prewrite(x), then R is stamp exceeds ts(W). Since W was not rejected, there exists a version of x with buffered. Else R is output. timestamp TS such that ts(W) < TS < 2. Let P be a prewrite(x). If some R-ts(x) ts(dm-read). Again the effect is identical to lies in interval(P), P is rejected. Else P a timestamp-ordered execution. is buffered. For ww synchronization, multiversion 3. Let W be a din-write(x). W is always T/O is essentially an embellished version output immediately. of TWR. A dm-write(x) always creates a 4. When W is output, its prewrite is debuf- new version of x with timestamp ts(dm- fered, and the buffered din-reads are re- write) and is never rejected. tested to see if they can now be output. Integrating two-phase commit requires See Figure 12. that dm-reads and prewrites (but not dm- writes) be buffered as in basic T/O. Let P Two-phase commit is not an issue for ww be a buffered prewrite(x): interval(P) is the synchronization, since dm-writes are never interval from ts(P) to the smallest W-ts(x) rejected for ww synchronization.

Computing Surveys, Vol. 13, No. 2, June 1981 Concurrency Control in Database Systems • 203

Let R ffi din-read(x). R is ready if ts(R) ~ interval buffered. Define min-W-ts(TMi) analo- (P), where P is any buffered gously. prewrite(x). Conservative T/O performs rw synchro- When a dm-write arrives do the following: nization as follows: 1. Let R be a din-read(x). If ts(R) > min- Output It and debuffer its prewrite [ W-ts(TM) for any TM in the system, R is buffered. Else R is output. 2. Let Wbe a dm-write(x). Ifts(W) :> min- R-ts(TM) for any TM, W is buffered. 1 Else W is output. I o t ut ready 's'l 3. When R or W is output or buffered, this may increase min-R-ts(TM) or min-W- ts(TM); buffered operations are retested to see if they can now be output.

The effect is that R is output if and only if (a) the scheduler has a buffered din-write from every TM, and (b) ts(R) < minimum Figure 12. Buffer emptying for multiverslon T/O. timestamp of any buffered dm-write. Simi- larly, W is output if and only if (a) there is 4.4 Conservative T/O a buffered din-read from every TM, and (b) ts(W) < minimum timestamp of any Conservative timestamp ordering is a tech- buffered din-read. Thus R (or W) is output nique for eliminating restarts during T/O ff and only if the scheduler has received scheduling [BERN80a]. When a scheduler every din-write (or din-read) with smaller receives an operation O that might cause a timestamp that it will ever receive. future restart, the scheduler delays 0 until Ww synchronization is accomplished as it is sure that no future restarts are possible. follows: Conservative T/O requires that each scheduler receive dm-reads (or dm-writes) 1. Let Wbe a din-write(x). Ifts(W) > min- from each TM in timestamp order. For W-ts(TM) for any TM in the system, W example, if scheduler Sj receives dm- is buffered; else it is output. read(x) followed by dm-read(y) from TM,, 2. When W is buffered or output, this may then ts(dm-read(x)) _ ts(dm-read(y)). increase min-W-ts(TM); buffered din- Since the network is assumed to be a FIFO writes are retested accordingly. channel, this timestamp ordering is accom- plished by requiring that TM, send din- The effect is that the scheduler waits reads (or din-writes) to S: in timestamp until it has a buffered din-write from every order: TM and then outputs the din-write with Conservative T/O buffers din-reads and smallest timestamp. din-writes as part of its normal operation. Two-phase commit need not be tightly When a scheduler buffers an operation, it integrated into conservative T/O because remembers the TM that sent it. Let min-R- dm-writes are never rejected. Although pre- ts(TM,) be the minimum timestamp of any writes must be issued for all data items buffered din-read from TM~, with min-R- updated, these operations are not processed ts(TM,) ffi -oo if no such din-read is by the conservative T/O schedulers. The above implementation of conserva- tive T/O suffers three major problems: (1) 5 This can be implemented by requiring that TMs If some TM never sends an operation to process transactions serially Alternatively, we can some scheduler, the scheduler will "get require that transactions issue all dm-reads before stuck" and stop outputting. (2) To avoid beginning their main executmn, and all dm-writes after the first problem, every TM must commu- terminating their main execution. Then transactions can execute concurrently, although they must termi- nicate regularly with every scheduler; this nate in timestamp order. is infeasible in large networks. (3) The im-

Computing Surveys, Vol. 13, No. 2, June 1981 204 ° P. A. Bernstein and N. Goodman

• A class is defined by a readset and a writeset. For of class C if readset(T) is a subset of read- example, set(C) and writeset(T) is a subset of write- CI: readset ffi {xl}, writeset ffi {yl, Y2} set(C). (Classes need not be disjoint.) C2: readset ffi {xl, y2}, writeset ffi {yl, y2, z2, z3} Class definitions are not expected to C3: readset ffi {y2, z3}, writeset --- {xl, z2, z3} change frequently during normal operation • A transaction is a member of a class if its readset is of the system. Changing a class definition a subset of the class readset and its writeset is a is akin to changing the database schema subset of the class wnteset. For example, and requires mechanisms beyond the scope Tl: readset ffi {xl}, writeset ffi {Yl, y2} of this paper. We assume that class defini- T2: readset ffi (y2), writeset -- {z2, z3) tions are stored in static tables that are Ts" readset ffi {z3}, writeset ffi {x~} available to any site requiring them. • T~ is a member of C1 and C2 Classes are associated with TMs. Every • T~ is a member of C2 and C3 transaction that executes at a TM must be • T3 Is a member of C3 a member of a class associated with the TM. If a transaction is submitted to a TM Figure 13. Transaction classes. that has no class containing it, the trans- action is forwarded to another TM that plementation is overly conservative; the ww does. We assume that every class is associ- algorithm, for instance, processes all dm- ated with exactly one TM, and vice versa. writes in timestamp order, not merely con- The class associated with TM, is denoted flicting ones. These problems are addressed C,. To execute transactions that are mem- below. bers of class C at two TMs, we define an- Null Operations. To solve the first other class C' with the same definition as C problem, TMs are required to periodically and associate C with one TM and C' with send timestamped operations to each the other. To execute transactions that are scheduler in the absence of "real" traffic. A members of two classes at one site, we null operation is a dm-read or dm-write multiprogram two TMs at that site. whose sole purpose is to convey timestamp Classes are exploited by conservative information and thereby unblock "real" T/O schedulers as follows. Consider rw syn- dm-reads and prewrites. An impatient chronization and suppose scheduler S scheduler can prompt a TM for a null op- wants to output a dm-read(x). Instead of eration by sending a "request message." waiting for dm-writes with smaller time- For example, for rw synchronization sup- stamps from all TMs, S need only wait pose scheduler S wants to process a dm- for dm-writes from those TMs whose class read with timestamp TS, but does not have writeset contains x. Similarly, to process a a buffered dm-write from TM~. S can send dm-write (x), S need only wait for dm-reads a message to TM~ requesting a null-dm- with smaller timestamp from those TMs write with timestamp greater than TS. whose class readset contains x. Thus com- A variation is to use null operations with munication requirements are decreased, very large (perhaps infinite) timestamps. and the level of concurrency in the system For example, if TM~ rarely needs to issue is increased. Ww synchronization proceeds dm-reads to S, TM, can send S a null-dm- similarly. read with infinite timestamp signifying that TM, does not intend to communicate with S until further notice. Conflict Graph Analysis. Conflict graph analysis is a technique for further Transaction Classes. Transaction improving the performance of conservative classes [BER~78a, BERN80d] is a technique T/O with classes. A conflict graph is an for reducing communication in conserva- undirected graph that summarizes poten- tive T/O and for supporting a less conserv- tial conflicts between transactions in differ- ative scheduling policy. As in predeclara- ent classes. For each class C, the graph tion, assume that every transaction's read- contains two nodes, denoted r~ and w,, set and writeset are known in advance. A which represent the readset and writeset of class is defined by a readset and a writeset C,. The edges of the graph are defined as (see Figure 13). Transaction T is a member follows (see Figure 14): (1) For each class

Computing Surveys, Vol 13, No 2, June 1981 Concurrency Control in Database Systems • 205

Define C1, C2, Ca as in Figure 13.

C1 readset = {xl} C2 readset = {x], y2} C3 readset ffi {Y2, za}

C1 writeset = {yl, y2} C2 writeset = {yl, y2, z2, za} Ca writeset ffi {xl, z2, za}

Figure 14. Conflict graph. C~ there is a vertical edge between r~ and cally. The analysis produces a indi- w~; (2) for each pair of classes C, and Cj cating which horizontal and vertical edges (with i ~ j) there is a horizontal edge require synchronization and which do not. between w~ and wj if and only if writeset(C~) This table, like class definitions, is distrib- intersects writeset(C~); (3) for each pair of uted in advance to all schedulers that classes C, and C~ (with i # j) there is a need it. diagonal edge between r~ and w~ if and only if readset(C~) intersects writeset(C~). 4.5 Timestamp Management Intuitively, a horizontal edge indicates A common criticism of T/O schedulers is that a scheduler S may be forced to delay that too much memory is needed to store dm-writes for purposes of ww synchroniza- timestamps. This problem can be overcome tion. Suppose classes C~ and C~ are con- by "forgetting" old timestamps. nected by a horizontal edge (w,, wj), indi- Timestamps are used in basic T/O to cating that their class writesets intersect. If reject operations that "arrive late," for ex- S receives a dm-write from C,, it must delay ample, to reject a dm-read(x) with time- the dm-write until it receives all dm-writes stamp TS1 that arrives after a dm-write(x) with smaller timestamps from Cj. Similarly, with timestamp TS2, where TS1 <: TS2. In a diagonal edge indicates that S may need principle, TS1 and TS2 can differ by an to delay operations for rw synchronization. arbitrary amount. However, in practice it is Conflict graph analysis improves the sit- unlikely that these timestamps will differ uation by identifying interclass conflicts by more than a few minutes. Consequently, that cannot cause nonserializable behavior. timestamps can be stored in small tables This corresponds to identifying horizontal which are periodically purged. and diagonal edges that do not require syn- R-ts's are stored in the R-table with en- chronization. In particular, schedulers need tries of the form (x, R-ts); for any data only synchronize dm-writes from C, and Cj item x, there is at most one entry. In addi- if either (1) the horizontal edge between w~ tion, a variable, R-min, tells the maximum and wj is embedded in a cycle of the conflict value of any timestamp that has been graph; or (2) portions of the intersection of purged from the table. To find R-ts(x), a C~'s writeset and C/s writeset are stored at scheduler searches the R-table for an (x, two or more DMs [BERN80C]. That is, if TS) entry. If such an entry is found, R- conditions (1) and (2) do not hold, schedu- ts(x) = TS; otherwise, R-ts(x) _ R-rain. To ler S need not process dm-writes from C~ err on the side of safety, the scheduler and Cj in timestamp order. Similarly, dm- assumes R-ts(x) ffi R-rain. To update R- reads from C, and dm-writes from Cj need ts(x), the scheduler modifies the (x, TS) only be processed in timestamp order if entry, if one exists. Otherwise, a new entry either (1) the diagonal edge between r, and is created and added to the table. When the wj is embedded in a cycle of the conflict R-table is full, the scheduler selects an ap- graph; or (2) portions of the intersection of propriate value for R-rain and deletes all C,'s readset and Cj's writeset are stored at entries from the table with smaller time- two or more DMs. stamp. W-ts's are managed similarly, and Since classes are defined statically, con- analogous techniques can be devised for flict graph analysis is also performed stati- multiversion .

Computing Surveys, Vol. 13, No. 2, June 1981 206 • P. A. Bernstein and N. Goodman

Maintaining timestamps for conservative have useful properties that cannot be at- T/O is even cheaper, since conservative tained by pure 2PL or T/O methods. T/O requires only timestamped operations, not timestamped data. If conservative T/O 5.1 Pure 2PL Methods is used for rw synchronization, the R-ts's of data items may be discarded. If conserva- The 2PL synchronization techniques of tive T/O is used for both rw and ww syn- Section 3 can be integrated to form 12 chronization, W-ts's can be eliminated also. principal 2PL methods: Method rw techmque ww technique 5. INTEGRATED CONCURRENCY CONTROL 1 Basic2PL Basic 2PL METHODS 2 Basic2PL Primary copy 2PL 3 Basic2PL Voting 2PL An integrated concurrency control method 4 Basic2PL Centralized 2PL consists of two components--an rw and a 5 Primarycopy 2PL Basic2PL ww synchronization technique--plus an in- 6 Primarycopy 2PL Primarycopy 2PL terface between the components that at- 7 Primarycopy 2PL Voting2PL 8 Primarycopy 2PL Centralized2PL tains condition (b) of Theorem 2: a total 9 Centralized2PL Basic 2PL ordering of the transactions consistent with 10 Centralized2PL Primarycopy 2PL all ---*~w~ and --*ww relationships. In this 11 Centralized2PL Voting2PL section we list 48 concurrency control meth- 12 Centralized2PL Centralized2PL ods that can be constructed using the tech- Each method can be further refined by the niques of Sections 3 and 4. choice of deadlock resolution technique Approximately 20 concurrency control (see Section 3.5). methods have been described in the litera- The interface between each 2PL rw tech- ture. Virtually all of them use a single nique and each 2PL ww technique is synchronization technique {either 2PL or straightforward. It need only guarantee T/O) for both rw and ww synchronization. that "two-phasedness" is preserved, mean- Indeed, most methods use the same varia- ing that all locks needed for both the rw tion of a single technique for both kinds of and ww technique must be obtained before synchronization. However, such homoge- any lock is released by either technique. neity is neither necessary nor especially desirable. 5. 1.1 Methods Using Basic 2PL for rw For example, the analysis of Section 3.2 Synchronization suggests that using basic 2PL for rw syn- chronization and primary copy 2PL for ww Methods 1-4 use basic 2PL for rw synchro- synchronization might be superior to using nization. Consider a logical data item X basic 2PL (or primary copy 2PL) for both. with copies xl, ..., Xm. TO read X, a trans- More outlandish combinations may be even action sends a dm-read to any DM that better. For example, one can combine basic stores a copy of X. This dm-read implicitly 2PL with TWR. In this method ww con- requests a readlock on the copy of X at that flicts never cause transactions to be delayed DM. To write X, a transaction sends pre- or restarted; multiple transactions can write writes to every DM that stores a copy of X. into the same data items concurrently (see These prewrites implicitly request write- Section 5.3). locks on the corresponding copies of X. For In Sections 5.1 and 5.2 we describe meth- all four methods, these writelocks conflict ods that use 2PL and T/O techniques for with readlocks on the same copy, and may both rw and ww synchronization. The con- also conflict with other writelocks on the currency control methods in these sections same copy, depending on the specific ww are easy to describe given the material of synchronization technique used by the Sections 3 and 4; the description of each method. method is little more than a description of Since locking conflict rules for writelocks each component technique. In Section 5.3 will vary from copy to copy, we distinguish we list 24 concurrency control methods that three types. An rw writelock only conflicts combine 2PL and T/O techniques. As we with readlocks on the same data item. A show in Section 5.3, methods of this type ww writelock only conflicts with ww write-

Computing Surveys, Vol 13, No. 2, June 1981 Concurrency Control in Database Systems • 207 locks on the same data item. And an rww explicit lock request to Xl'S DM; when the writelock conflicts with readlocks, ww lock is granted the transaction can read any writelocks, and rww writelocks. Thus, using copy of X. basic 2PL for rw synchronization, every To write into X, a transaction sends pre- prewrite sets rw writelocks, and may set writes to every DM that stores a copy of X. stronger locks depending on the ww tech- A prewrite(xl) implicitly requests an rw nique. writelock. Prewrites on other copies of X may also request writelocks depending on Method 1: Basic 2PL for ww synchroni- the ww technique. zation. All writelocks are rww writelocks; that is, for i ffi 1, ..., m, a writelock on x, Method 5: Basic 2PL for ww synchroni- conflicts with either a readlock or a write- zation. For i ffi 2 .... , m, prewrite(x,) re- lock on x, This is the "standard" distrib- quests a ww writelock. Since the writelock uted implementation of 2PL. on xz must also conflict with readlocks on Method 2: Primary copy 2PL for ww xl, prewrite(xl) requests an rww writelock. synchronization. Writelocks only conflict Method 6: Primary copy 2PL for ww on the primary copy. An rww writelock is synchronization. Prewrite(xl) requests an used on the primary copy, while rw write- rww writelock on xl. Prewrites on other locks are used on the others. copies do not request any locks. This Method 3: Voting 2PL for ww synchro- method was originally proposed by SToN79 nization. A DM responds to a prewrite(x,) and is used in Distr~uted INGRES by attempting to set an rww writelock on [STOs77]. x, However, if another transaction already Method 7: Voting 2PL for ww synchro- owns an rww writelock on x,, the DM only nization. When a scheduler receives a pre- sets an rw writelock and leaves a request write(x,) for i ~ 1, it tries to set a ww for an rww writelock pending. A transaction writelock on x,. When it receives a pre- can write into any copy of X after it obtains write(x1), it tries to set an rww writelock on rww writelocks on a majority of copies. This x~; if it cannot, then it sets an rw writelock is similar to the method proposed in on xz (if possible) before waiting for the GIFF79. rww writelock. A transaction can write into Method 4: Centralized 2PL for ww syn- every copy of X after it obtains a ww (or chronization. To write into X, a transaction rww) writelock on a majority of copies must first explicitly request a ww writelock of X. on X from a centralized 2PL scheduler. The Method 8: Centralized 2PL for ww syn- rw writelocks set by prewrites never conflict chronization. Transactions obtain ww with each other. writelocks from a centralized 2PL schedu- In all four methods, readlocks are explic- ler. Thus a prewrite(xl) requests an rw itly released by lock releases while write- writelock on x~; for i ffi 2,..., m, prewrite(x,) locks are implicitly released by dm-writes. does not request any lock. Lock releases may be transmitted in paral- Lock releases for Methods 5-8 are han- lel with dm-writes. In Method 4, after all dled as in Section 5.1.1. din-writes have been executed, additional lock releases must be sent to the centralized scheduler to release writelocks held there. 5 1.3 Methods Using Centrahzed 2PL for rw Synchroniza tion 5.1.2 Methods Using Primary Copy 2PL for rw The remaining 2PL methods use central- Synchromzahon ized 2PL for rw synchronization. Before Methods 5-8 use primary copy 2PL for rw reading (or writing) any copy of logical data synchronization. Consider a logical data item X, a transaction must obtain a read- item X with copies xl ..... xm, and assume lock (or rw writelock) on X from a central- xz is the primary copy. To read X, a trans- ized 2PL scheduler. Before writing X, the action must obtain a readlock on x~. It may transaction must also send prewrites to ev- obtain this lock by issuing a dm-read(Xl). ery DM that stores a copy of X. Some Alternatively, the transaction can send an of these prewrites implicitly request ww

Computing Surveys, Vol. 13, No. 2, June 1981 208 • P. A. Bernstein and N. Goodman writelocks on copies of X, depending on the quirement is that both techniques use the specific method. same timestamp for any given transaction.

Method 9: Basic 2PL for ww synchro- 5.2.1 Methods Using Basic T/O for rw nization. Every prewrite requests a ww Synchronization writelock. Method 10: Primary copy 2PL for ww Methods 1-4 use basic T/O for rw synchro- synchronization. If xl is the primary copy nization. All four methods require R-ts's for of X, a prewrite(xl) requests a ww writelock. each data item. Methods 1, 2, and 4 require Prewrites on other copies do not request W-ts's, while in Method 3 each data item any writelocks. has a set of timestamped versions; for Method 11: Voting 2PL for ww synchro- Method 3, let W-ts(x) denote x's largest nization. Every prewrite attempts to set a timestamp. Each method buffers dm-reads ww writelock. A transaction can write into and prewrites for two-phase commitment every copy of X after it obtains ww write- purposes; let min-R-ts(x) and min-P-ts(x) locks on a majority of copies of X. be the minimum timestamps of any Method 12: Centralized 2PL for ww syn- buffered din-read(x) and prewrite(x), re- chronization. All locks are obtained at the spectively. centralized 2PL scheduler. Before writing These methods can be described by the into any copy of X, an rww writelock on X following steps. Let R be a din-read(x), P a is obtained from the centralized scheduler. prewrite(x), and W a din-write(x). Prewrites set no locks at all. Method 12 is 1. If ts(R) < W-ts(x), R is rejected. Else the "standard" implementation of central- ized 2PL (called primary site in ALSB76a). if ts(R) > min-P-ts(x), R is buffered. Else R is output and R-ts(x) is set to Lock releases for Methods 9-12 are han- max(R-ts(x), ts(R)). dled as in Section 5.1.1. . If ts(P) < R-ts(x) or condition (A) 6 holds, P is rejected. Else P is buffered. 5.2 Pure T/O Methods 3. If ts(W) > min-R-ts(x) or condition (B) 6 holds, W is buffered. Else W is output The T/O synchronization techniques of and W-ts(x) is set to max(W-ts(x), Section 5 can also be integrated to form 12 ts(W)). For Method 3, a new version of principal T/ O methods: x is created with timestamp ts(W).

. When W is output, its prewrite is debuf- Method rw technique ww technique fered and the buffered din-reads and din- 1 Basic T/O Basic T/O writes are retested to see if any can now 2 Basic T/O Thomas Write Rule (TWR) be output. 3 Basic T/O Multiversion T/O 4 Basic T/O Conservatwe T/O Method 1: Basic T/O for ww synchroni- 5 Multiversion T/O Basic T/O zation. Condition (A) is ts(P) < W-ts(x) 6 Multiverslon T/O TWR and condition (B) is ts(W) > min-P-ts(x). 7 Multiversion T/O Multiversion T/O Note that min-R-ts(x) > min-P-ts(x), since 8 Multwersion T/O Conservative T/O R is buffered only if ts(R) > min-P-ts(x). 9 Conservative T/O Basic T/O 10 Conservative T/O TWR Also, when Wis output, ts(W) > W-ts(x), 11 Conservative T/O Multiversion T/O since condition (B) forces dm-writes on a 12 Conservahve T/O Conservative T/O given x to be output in timestamp order. (That there are also 12 2PL methods is Thus step 3 simplifies to coincidental.) 3. If ts(W) > min-P-ts(x), W is buffered. Each T/O method that incorporates a Else W is output and W-ts(x) is set to conservative component can be refined by ts(W). including classes and conflict graph analysis Method 2: TWR for ww synchronization. (see Sections 4.4.2 and 4.4.3). Conditions (A) and (B) are null. However, The interface between rw and ww syn- chronization techniques is even simpler for 6 Conditions (A) and (B) are determmed by the new T/O methods than for 2PL. The only re- technique. See the following.

Computing Surveys, Vol. 13, No 2, June 1981 Concurrency Control in Database Systems • 209 if ts(W) < W-ts(x), Whas no effect on the R-ts(x), then P is rejected. Else it is database. buffered. Method 3: Multiversion T/O for ww syn- Because of this simplification, the method chronization. Like Method 2 except that only requires that the maximum R-ts(x) be W always creates a new version of x. Method 4: Conservative T/ O for ww syn- stored. chronization. Condition (A) is null. For Condition (B) forces dm-writes on a each TM, let min-W-ts(TM) be the mini- given data item to be output in timestamp mum timestamp of any buffered dm-write order. This supports a systematic technique for "forgetting" old versions. Let max-W- from that TM. Condition (B) is ts(W) > ts(x) be the maximum W-ts(x) and let min- min-W-ts(TM) for some TM. As in Method 1, this causes dm-writes on a given x to be ts be the minimum of max-W-ts(x) over all output in timestamp order, and step 3 sim- data items in the database. No dm-write plifies to with timestamp less than min-ts can be output in the future. Therefore, insofar as 3. If ts{W) > min-R-ts{x) or ts(W) > min- update transactions are concerned, we can W-ts(TM) for some TM, W is buffered. safely forget all versions timestamped less Else W is output and W-ts(x) is set to than min-ts. TMs should be kept informed ts(W). of the current value of min-ts and queries (read-only transactions) should be assigned 5.2.2 Methods Using Multiversion T/O for rw timestamps greater than min-ts. Also, after Synchronization a new min-ts is selected, older versions should not be forgotten immediately, so Methods 5-8 use multiversion T/O for rw that active queries with smaller timestamps synchronization and require a set of R-ts's have an opportunity to finish. and a set of versions for each data item. Method 6: TWR for ww synchronization. These methods can be described by the This method is incorrect. TWR requires following steps. Define R, P, W, min-R-ts, that W be ignored if ts(W) < max W-ts(x). min-W-ts, and min-P-ts as above; let inter- This may cause later dm-reads to be read val{P) be the interval from ts(P) to the incorrect data. See Figure 15. {Method 6 is smallest W-ts(x) > ts(P). the only incorrect method we will encoun- 1. R is never rejected. If ts(R) lies in ter.) interval(prewrite(x)) for some buffered Method 7: Multiversion T/ O for ww syn- prewrite(x), then R is buffered. Else R is chronization. Conditions (A) and (B) are output and ts(R) is added to x's set of null. Note that this method, unlike all pre- R-ts's. vious ones, never buffers dm-writes. 2. If some R-ts(x) lies in interval(P) or Method 8: Conservative T/O for ww syn- condition (A) holds, then P is rejected. chronization. Condition (A) is null. Condi- Else P is buffered. tion (B) is ts(W) > min-W-ts(TM) for some 3. If condition (B) holds, W is buffered. TM. Condition (B) forces dm-writes to Else W is output and creates a new be output in timestamp order, implying version of x with timestamp ts(W). interval(P) = (ts(P), oo). As in Method 5, 4. When W is output, its prewrite is debuf- this simplifies step 2: feted, and buffered dm-reads and dm- 2. If ts(P) < max R-ts(x), P is rejected; else writes are retested. it is buffered.

Method 5: Basic T/O for ww synchron- Like Method 5, this method only requires ization. Condition (A) is ts(P) < max- that the maximum R-ts(x) be stored, and it W-ts(x) and condition (B) is ts(W) > supports systematic "forgetting" of old ver- min-P-ts(x). Condition (A) implies that sions described above. interval(P) = (ts(P), ~); some R-ts(x) lies 5.2.3 Methods Using Conservative T/O for rw in that interval if and only if ts(P) < max- Synchronization imum R-ts(x). Thus step 2 simplifies to The remaining T/O methods use conserv- 2. If ts(P) < max W-ts(x) or ts(P) < max ative T/O for rw synchronization. Methods

ComputingSurveys, Vol. 13, No. 2, June 1981 210 • P. A. Bernstein and N. Goodman

• Consider data items x and y with the foUowmg versions:

Values 0 100

I I v W-tlmestamps 0 lo0 Values 0 y I r W-timestamps 0

• Now suppose T has timestamp 50 and writes x := 50, y := 50. Under Method 6 the update to x is ignored, and the result is

Values 0 lo0 x I I v W-tmaestamps 0 100 Values 0 5O y J I W-timestamps 0 5O

• Finally, suppose T' has timestamp 75 and reads x and y. The values it will read are x = 0, y ffi 50, whmh is incorrect. T' should read x - 50, y = 50.

Figure 15. Inconsistent retrievals in Method 6.

9 and 10 require W-ts's for each data item, This method is essentially the SDD-1 and Method 11 requires a set of versions for concurrency control [BERN80d], although each data item. Method 12 needs no data in SDD-1 the method is refined in several item timestamps at all. Define R, P, W and ways. SDD-1 uses classes and conflict graph min-P-ts as in Section 5.2.1; let min-R- analysis to reduce communication and ts(TM) (or min-W-ts(TM)) be the mini- increase the level of concurrency. Also, mum timestamp of any buffered dm-read SDD-1 requires predeclaration of read-sets (or dm-write) from TM. and only enforces the conservative sched- uling on dm-reads. By doing so, it forces 1. If ts(R) > min-W-ts(TM) for any TM, R dm-reads to wait for dm-writes, but does is buffered; else it is output. not insist that dm-writes wait for all dm- 2. If condition (A) holds, P is rejected. Else reads with smaller timestamps. Hence dm- P is buffered. reads can be rejected in SDD-1. 3. Ifts(W) > min-R-ts(TM) for any TM or condition (B) holds, W is buffered. Else Method 11: Multiversion T/O for ww W is output. synchronization. Conditions (A) and (B) 4. When W is output, its prewrite is debuf- are null. When W is output, it creates a fered. When R or W is output or new version of x with timestamp ts(W). buffered, buffered dm-reads and dm- When R is output it reads the version with writes are retested to see if any can now largest timestamp less than ts(R). be output. This method can be optimized by noting the multiversion T/O "automatically" pre- Method 9: Basic T/O for ww synchroni- vents dm-reads from being rejected, and zation. Condition (A) is ts(P) < W-ts(x), makes it unnecessary to buffer dm-writes. and condition (B) is ts(W) > min-P-ts(x). Thus step 3 can be simplified to Method 10: TWR for ww synchroniza- 3. W is output immediately. tion. Conditions (A) and (B) are null. How- ever, if ts(W) < W-ts(x), W has no effect Method 12: Conservative T/O for ww on the database. synchronization. Condition (A) is null; con-

Computing Surveys, Vol. 13, No 2, June 1981 Concurrency Control in Database Systems . 211 dition (B) is ts(W) > min-W-ts(TM) for precedes T,+l's locked point, and (b) T, some TM. The effect is to output W if the released a lock on some data item x before scheduler has received all operations with T,+I obtained a lock on x. Let L be the L- timestamps less than ts(W) that it will ever ts(x) retrieved by TI+I. Then ts(T,) < L < receive. Method 12 has been proposed in ts(T,+~), and by induction ts(Ta) < ts(Tn). CI~EN80, KANE79, and SHAP77a. 5.3.2 Mixed Methods Using 2PL for rw Syn- chrontzation 5.3 Mixed 2PL and T/O Methods There are 12 principal methods in which The major difficulty in constructing meth- 2PL is used for rw synchronization and ods that combine 2PL and T/O lies in de- T/O is used for ww synchronization: veloping the interface between the two techniques. Each technique guarantees an Method rw technique ww technique acyclic --*~ (or ---~) relation when used for rw (or ww) synchronization. The inter- 1 Basic 2PL Basic T/O 2 Basic 2PL TWR face between a 2PL and a T/O technique 3 Basic 2PL Multiversion T/O must guarantee that the combined --* rela- 4 Basic 2PL Conservative T/O tion (i.e., --*~ U --->v,~) remains acyclic. That 5 Primary copy 2PL Basic T/O is, the interface must ensure that the seri- 6 Prnnary copy 2PL TWR 7 Primary copy 2PL Multiversion T/O alization order induced by the rw technique 8 Primary copy 2PL Conservative T/O is consistent with that induced by the ww 9 Centrahzed 2PL Basic T/O technique. In Section 5.3.1 we describe an 10 Centralized 2PL TWR interface that makes this guarantee. Given 11 Centrahzed 2PL Multiversion T/O such an interface, any 2PL technique can 12 Centralized 2PL Conservatwe T/O be integrated with any T/O technique. Sec- tions 5.3.2 and 5.3.3 describe such methods. Method 2 best exemplifies this class of methods, and it is the only one we describe in detail. Method 2 requires that every 5.3. 1 The Interface stored data item have an L-ts and a W-ts. The serialization order induced by any 2PL (One timestamp can serve both roles, but technique is determined by the locked we do not consider this optimization here.) points of the transactions that have been Let X be a logical data item with copies synchronized (see Section 3). The seriali- xl .... , xm. To read X, transaction T issues zation order induced by any T/O technique a dm-read on any copy of X, say x,. This is determined by the timestamps of the dm-read implicitly requests a readlock on synchronized transactions. So to interface x, and when the readlock is granted, L- 2PL and T/O we use locked points to in- ts(x,) is returned to T. To write into X, T duce timestamps [BERN80b]. issues prewrites on every copy of X. These Associated with each data item is a lock prewrites implicitly request rw writelocks timestamp, L-ts(x). When a transaction T on the corresponding copies, and as each sets a lock of x, it simultaneously retrieves writelock is granted, the corresponding L-ts L-ts(x). When T reaches its locked point it is returned to T. When T has obtained all is assigned a timestamp, ts(T), greater than of its locks, ts(T) is calculated as in Section any L-ts it retrieved. When T releases 5.3.1. T attaches ts(T) to its dm-writes, its lock on x, it updates L-ts(x) to be which are then sent. max(L-ts(x), ts(T)). Dm-writes are processed using TWR. Let Timestamps generated in this way are W be dm-write(xj). If ts(W) > W-ts(xj), consistent with the serialization order in- the dm-write is processed as usual (xj is duced by 2PL. That is, ts(Tj) < ts(Tk) if Tj updated). If, however, ts(W) < W-ts(xj), W must precede Tk in any serialization in- is ignored. duced by 2PL. To see this, let T1 and Tn be The interesting property of this method a pair of transactions such that T~ must is that writelocks never conflict with write- precede T, in any serialization. Thus there locks. The writelocks obtained by prewrites exist transactions T1, T2 .... T,q, T, such are only used for rw synchronization, and that for i = 1.... , n-1 (a) T,'s locked point only conflict with readlocks. This permits Computing Sm~zeys, Vol. 13, No. 2, June 1981 212 . P. A. Bernstein and N. Goodman transactions to execute concurrently to is to request a ww writelock on x. When the completion even if their writesets intersect. lock is granted, L-ts(x) is returned to T; Such concurrency is never possible in a this is the interface role of P. Also when the pure 2PL method. lock is granted, P is buffered and the rw synchronization mechanism is informed 5.3,3 Mixed Methods Using T/O for rw Syn- that a dm-write with timestamp greater chronizatton than L-ts(x) is pending. This is its rw role. When T has obtained all of its writelocks, There are also 12 principal methods that ts(T) is calculated as in Section 5.3.1 and T use T/O for rw synchronization and 2PL begins its main execution. T attaches ts(T) for ww synchronization: to its dm-reads and dm-writes and rw syn- Method rw technique ww technique chronization is performed by multiversion T/O, as follows: 13 Basic T/O Basic 2PL 14 Basic T/O Primary copy 2PL 1. Let R be a dm-read(x). If there is a 15 Basic T/O Voting 2PL buffered prewrite(x) (other than one is- 16 Basic T/O Centralized 2PL 17 Multiversion T/O Basic 2PL sued by T), and if L-ts(x) < ts(T), then 18 Multiversion T/O Primary copy 2PL R is buffered. Else R is output and reads 19 Multlversion T/O Voting 2PL the version of x with largest timestamp 20 Multiversion T/O Centralized 2PL less than ts(T). 21 Conservative T/O Basic 2PL 2. Let W be a din-write(x). W is output 22 Conservative T/O Primary copy 2PL 23 Conservative T/O Voting 2PL immediately and creates a new version 24 Conservative T/O Centralized 2PL of x with timestamp ts(T). 3. When W is output, its prewrite is debuf- These methods all require predeclaration fered, and its writelock on x is released. of writelocks. Since T/O is used for rw This causes L-ts(x) to be updated to synchronization, transactions must be as- max(L-ts(x), ts(T)) -- ts(T). signed timestamps before they issue dm- reads. However, the timestamp generation One interesting property of this method technique of Section 5.3.1 requires that a is that restarts are needed only to prevent transaction be at its locked point before it or break deadlocks caused by ww synchro- is assigned its timestamp. Hence every nization; rw conflicts never cause restarts. transaction must be at its locked point be- This property cannot be attained by a pure fore it issues any dm-reads; in other words, 2PL method. It can be attained by pure every transaction must obtain all of its T/O methods, but only if conservative T/O writelocks before it begins its main execu- is used for rw synchronization; in many tion. cases conservative T/O introduces exces- To illustrate these methods, we describe sive delay or is otherwise infeasible. Method 17. This method requires that each The behavior of this method for queries stored data item have a set of R-ts's and a is also interesting. Since queries set no set of (W-ts, value) pairs (i.e., versions). writelocks, the timestamp generation rule The L-ts of any data item is the maximum does not apply to them. Hence the system of its R-ts's and W-ts's. is free to assign any timestamp it wishes to Before beginning its main execution, a query. It may assign a small timestamp, transaction T issues prewrites on every in which case the query will read old data copy of every data item in its writeset. 7 but is unlikely to be delayed by buffered These prewrites play a role in ww synchro- prewrites; or it may assign a large time- nization, rw synchronization, and the inter- stamp, in which case the query will read face between these techniques. current data but is more likely to be de- Let P be a prewrite(x). The ww role of P layed. No matter which timestamp is se- lected, however, a query can never cause an update to be rejected. This property 7 Since new values for the data items in the writeset cannot be easily attained by any pure 2PL are not yet known, these prewrites do not instruct DMs to store values on secure storage, they merely or T/O method. "warn" DMs to "expect" the corresponding dm-wntes We also observe that this method creates See footnote 3. versions in timestamp order, and so sys- Computing Surveys, Vol 13, No 2, June 1981 Concurrency Control in Database Systems • 213 tematic forgetting of old versions is possible from algorithm to algorithm, system to sys- (see Section 5.2.2). In addition, the method tem, and application to application. This requires only maximum R-ts's; smaller ones impact is not understood in detail, and a may be instantly forgotten. comprehensive quantitative analysis of per- formance is beyond the state of the art. Recent theses by Garcia-Mo!ina [GARc79a] CONCLUSION and Reis [REm79a] have taken first steps We have presented a framework for the toward such an analysis but there clearly design and analysis of distributed database remains much to be done. concurrency control algorithms. The frame- We hope, and indeed recommend, that work has two main components: (1) a sys- future work on distributed concurrency tem model that provides common termi- control will concentrate on the performance nology and concepts for describing a variety of algorithms. There are, as we have seen, of concurrency control algorithms, and (2) many known methods; the question now is a problem decomposition that decomposes to determine which are best. concurrency control algorithms into read- write and write-write synchronization sub- APPENDIX. OTHER CONCURRENCY algorithms. CONTROL METHODS We have considered synchronization sub- algorithms outside the context of specific In this appendix we describe three concur- concurrency control algorithms. Virtually rency control methods that do not fit the all known database synchronization algo- framework of Sections 3-5: the certifier rithms are variations of two basic tech- methods of Badal [BADA79], Bayer et al. niques-two-phase locking (2PL) and [BAYE80], and Casanova [CASA79], the ma- timestamp ordering (T/O). We have de- jority consensus algorithm of Thomas scribed the principal variations of each [THoM79], and the ring algorithm of Ellis technique, though we do not claim to have [ELLI77]. We argue that these methods are exhausted all possible variations. In addi- not practical in DDBMSs. The certifier tion, we have described ancillary problems methods look promising for centralized {e.g., deadlock resolution) that must be DBMSs, but severe technical problems solved to make each variation effective. must be overcome before these methods We have shown how to integrate the can be extended correctly to distributed described techniques to form complete con- systems. The Thomas and Ellis algorithms, currency control algorithms. We have listed by contrast, are among the earliest algo- 47 concurrency control algorithms, describ- rithms proposed for DDBMS concurrency ing 25 in detail. This list includes almost all control. These algorithms introduced sev- concurrency control algorithms described eral important techniques into the field but, previously in the literature, plus several as we will see, have been surpassed by new ones. This extreme consolidation of the recent developments. state of the art is possible in large part because of our framework set up earlier. A1. Certifiers The focus of this paper has primarily A 1.1 The Certification Approach been the structure and correctness of syn-, chronization techniques and concurrency In the certification approach, din-reads and control algorithms. We have left open a prewrites are processed by DMs first-come/ very important issue, namely, performance. first-served, with no synchronization what- The main performance metrics for con- soever. DMs do maintain summary infor- currency control algorithms are system mation about rw and ww conflicts, which throughput and transaction response time. they update every time an operation is Four cost factors influence these metrics: processed. However, din-reads and pre- intersite communication, local processing, writes are never blocked or rejected on the transaction restarts, and transaction block- basis of the discovery of such a conflict. ing. The impact of each cost factor on sys- Synchronization occurs when a transac- tem throughput and response time varies tion attempts to terminate. When a trans-

Computing Surveys, Vo|. 13, No. 2, June 1981 214 • P. A. Bernstein and N. Goodman action T issues its END, the DBMS decides when they terminate. The main problem in whether or not to certify, and thereby com- the summarization algorithm is avoiding mit, T. the need to store information about al- To understand how this decision is made, ready-certified transactions. The main we must distinguish between "total" and problem in the certification algorithm is "committed" executions. A total execution obtaining a consistent copy of the summary of transactions includes the execution of all information. To do so the certification al- operations processed by the system up to a gorithm often must perform some synchro- particular moment. The committed execu- nization of its own, the cost of which must tion is the portion of the total execution be included in the cost of the entire method. that only includes din-reads and din-writes processed on behalf of committed transac- A1.2 Certificatton Using the--~ Relatton tions. That is, the committed execution is the total execution that would result from One certification method is to construct the aborting all active transactions (and not ---) relation as dm-reads and prewrites are restarting them). processed. To certify a transaction, the sys- When T issues its END, the system tests tem checks that ---> is acyclic [BADA79, whether the committed execution aug- BAYE80, CASA79]. s mented by T's execution is serializable, that To construct --% each site remembers the is, whether after committing T the resulting most recent transaction that read or wrote committed execution would still be serial- each data item. Suppose transactions T, izable. If so, T is committed; otherwise T is and T~ were the last transactions to (re- restarted. spectively) read and write data item x. If There are two properties of certification transaction Tk now issues a din-read(x), that distinguish it from other approaches. Tj --* Tk is added to the summary infor- First, synchronization is accomplished en- mation for the site and Tk replaces T, as tirely by restarts, never by blocking. And the last transaction to have read x. Thus second, the decision to restart or not is pieces of--* are distributed among the sites, made after the transaction has finished ex- reflecting run-time conflicts at each site. ecuting. No concurrency control method To certify a transaction, the system must discussed in Sections 3-5 satisifies both check that the transaction does not lie on these properties. a cycle in --* (see Theorem 2, Section 2). The rationale for certification is based on Guaranteeing acyclicity is sufficient to an optimistic assumption regarding run- guarantee serializability. time conflicts: if very few run-time conflicts There are two problems with this ap- are expected, assume that most executions proach. First, it is in general not correct to are serializable. By processing din-reads delete a certified transaction from --), even and prewrites without synchronization, the if all of its updates have been committed. concurrency control method never delays a For example, if T, --) Tj and T, is active but transaction while it is being processed. Only Tj is committed, it is still possible for Tj ---) a (fast, it is hoped) certification test when T, to develop; deleting Tj would then cause the transaction terminates is required. the cycle T, --~ Tj ---) T, to go unnoticed Given optimistic transaction behavior, the when T, is certified. However, it is ob- test will usually result in committing the viously not feasible to allow ---) to grow transaction, so there are very few restarts. indefinitely. This problem is solved by Ca- Therefore certification simultaneously sanova [CASA79] by a method of encoding avoids blocking and restarts in optimistic information about committed transactions situations. in space proportional to the number of ac- A certification concurrency control tive transactions. method must include a summarization al- A second problem is that all sites must gorithm for storing information about dm- be checked to certify any transaction. Even reads and prewrites when they are proc- essed and a certification algorithm for us- 8 In BAYE80 certification is only used for rw synchro- ing that information to certify transactions nization whereas 2PL is used for ww synchronization.

Computing Surveys, Vol. 13, No. 2, June 1981 Concurrency Control in Database Systems ° 215 sites at which the transaction never ac- action terminates, the system augments the cessed data must participate in the cycle update list with the.list of data items read checking of--*. For example, suppose we and their timestamps at the time they were want to certify transaction T. T might be read. In addition, the timestamp of the involved in a cycle T --. T1 --) T2 --) ... --* transaction itselfis added to the update list. Tn-1 --> Tn ---> T, where each conflict Tk --) This completes the firstphase of execution. Tk+l occurred at a different site. Possibly T In the second phase the update list is only accessed data at one site; yet the --) sent to every site. Each site (including the relation must be examined at n sites to site that produced the update list)votes on certify T. This problem is currently un- the update list.Intuitively speaking, a site solved, as far as we know. That is, any votes yes on an update list if it can certify correct certifier based on this approach of the transaction that produced it. After a checking cycles in --) must access the --~ site votes yes, the update list is said to be relation at all sites to certify each and every pending at that site. To cast the vote, the transaction. Until this problem is solved, site sends a message to the transaction's we judge the certification approach to be home site,which, when it receives a major- impractical in a distributed environment. ity of yes or no votes, informs all sites of the outcome. If a majority voted yes, then all sites are required to commit the update, A2. Thomas' Majority Consensus Algorithm which is then installed using TWR. If a A2.1 The Algorithm majority voted no, all sites are told to dis- card the update, and the transaction is re- One of the first published algorithms for started. distributed concurrency control is a certifi- The rule that determines when a site may cation method described in THOM79. vote "yes" on a transaction is pivotal to the Thomas introduced several important syn- correctness of the algorithm. To vote on an chronization techniques in that algorithm, update list U, a site compares the time- including the Thomas Write Rule (Section stamp of each data item in the readset of U 3.2.3), majority voting (Section 3.1.1), and with the timestamp of that same data item certification (Appendix A1). Although in the site'slocal database. If any data item these techniques are valuable when consid- has a timestamp in the database different ered in isolation, we argue that the overall from that in U, the site votes no. Otherwise, Thomas algorithm is not suitable for dis- the site compares the readset and writeset tributed databases. We first describe the of U with the readset and writeset of each algorithm and then comment on its appli- pending update listat that site,and if there cation to distributed databases. is no rw conflict between U and any of the Thomas' algorithm assumes a fully re- pending update lists,it votes yes. If there is dundant database, with every logical data an rw conflict between U and one of those item stored at every site. Each copy carries pending requests, the site votes pass (ab- the timestamp of the last transaction that stain) if U's timestamp is larger than that wrote into it. of all pending update lists with which it Transactions execute in two phases. In conflicts. If there is an rw conflict but U's the first phase each transaction executes timestamp is smaller than that of the con- locally at one site called the transaction's flicting pending update list, then it sets U home site. Since the database is fully re- aside on a wait queue and tries again when dundant, any site can serve as the home the conflictingrequest has either been com- site for any transaction. The transaction is mitted or aborted at that site. assigned a unique timestamp when it begins The voting rule is essentially a certifica- executing. During execution it keeps a rec- tion procedure. By making the timestamp ord of the timestamp of each data item it comparison, a site is checking that the read- reads and, when its executes a write on a set was not written into since the transac- data item, processes the write by recording tion read it. If the comparisons are satisfied, the new value in an update list. Note that the situation is as if the transaction had each transaction must read a copy of a data locked its readset at that site and held the item before it writes into it. When the trans- locks until it voted. The voting rule is Computing Stttw~ys, Vgl. 13, N% 2, June 1981 216 • P. A. Bernstein and N. Goodman thereby guaranteeing rw synchronization The second part of the voting rule guaran- with a certification rule approximating rw tees that for every pair of conflicting trans- 2PL. (This fact is proved precisely in actions that received a majority of yes BEm~79b.) votes, all sites that voted yes on both trans- The second part of the voting rule, in actions voted on the two transactions in the which U is checked for rw conflicts against same order. This makes the certification pending update lists, guarantees that con- step behave just as it would if it were cen- flicting requests are not certified concur- tralized, thereby avoiding the problem ex- rently. An example illustrates the problem. emplified in the previous paragraph. Suppose T1 reads X and Y, and writes Y, while T2 reads X and Y, and writes X. A2.3 Partially Redundant Databases Suppose T1 and T2 execute at sites A and B, respectively, and X and Y have time- For the majority consensus algorithm to be stamps of 0 at both sites. Assume that T1 useful in a distributed database environ- and T~ execute concurrently and produce ment, it must be generalized to operate update lists ready for voting at about the correctly when the database is only par- same time. Either T~ or T2 must be re- tially redundant. There is reason to doubt started, since neither read the other's out- that such a generalization can be accom- put; if they were both committed, the result plished without either serious degradation would be nonserializable. However both of performance or a complete change in the Tl's and T2's update lists will (concurrently) set of techniques that are used. satisfy the timestamp comparison at both First, the majority consensus decision A and B. What stops them from both ob- rule apparently must be dropped, since the taining unanimous yes votes is the second voting algorithm depends on the fact that part of the voting rule. After a site votes on all sites perform exactly the same certifi- one of the transactions, it is prevented from cation test. In a partially redundant data- voting on the other transaction until the base, each site would only be comparing first is no longer pending. Thus it is not the timestamps of the data items stored at possible to certify conflicting transactions that site, and the significance of the major- concurrently. (We note that this problem ity vote would vanish. of concurrent certification exists in the al- If majority voting cannot be used to syn- gorithms of Section A1.2, too. This is yet chronize concurrent certification tests, ap- another technical difficulty with the certi- parently some kind of mutual exclusion fication approach in a distributed environ- mechanism must be used instead. Its pur- ment.) pose would be to prevent the concurrent, With the second part of the voting rule, and therefore potentially incorrect, certifi- the algorithm behaves as if the certification cation of two conflicting transactions, and step were atomically executed at a primary would amount to locking. The use of locks site. If certification were centralized at a for synchronizing the certification step is primary site, the certification step at the not in the spirit of Thomas' algorithm, since primary site would serve the same role as a main goal of the algorithm was to avoid the majority decision in the voting case. locking. However, it is worth examining such a locking mechanism to see how cer- tification can be correctly accomplished in A2.2 Correctness a partially redundant database. To process a transaction T, a site pro- No simple proof of the serializability of duces an update list as usual. However, Thomas' algorithm has ever been demon- since the database is partially redundant, it strated, although Thomas provided a de- may be necessary to read portions of T's tailed "plausibility" argument in THOM79. readset from other sites. After T termi- The first part of the voting rule can cor- nates, its update list is sent to every site rectly be used in a centralized concurrency that contains part of T's readset or writeset. control method since it implies 2PL To certify an update list, a site first sets [BERN79b], and a centralized method based local locks on the readset and writeset, and on this approach was proposed in KUNG81. then (as in the fully redundant case) it

Computing Surveys, Vol. 13, No 2, June 1981 Concurrency Control in Database Systems * 217 compares the update list's timestamps with correctly running, the algorithm runs the database's timestamps. If they are iden- smoothly. Thus, handling a site failure is tical, it votes yes; otherwise it votes no. A free, insofar as the voting procedure is con- unanimous vote of yes is needed to commit cerned. the updates. Local locks cannot be released However, from current knowledge, this until the voting decision is completed. justification is not compelling for several While this version of Thomas' algorithm reasons. First, although there is no cost for partially redundant data works cor- when a site fails, substantial effort may be rectly, its performance is inferior to stand- required when a site recovers. A centralized ard 2PL. This algorithm requires that the algorithm using backup sites, as in same locks be set as in 2PL, and the same ALSB76a, lacks the symmetry of Thomas' deadlocks can arise. Yet the probability of algorithm, but may well be more efficient restart is higher than in 2PL, because even due to the simplicity of site recovery. In after all locks are obtained the certification addition, the majority consensus algorithm step can still vote no (which cannot happen does not consider the problem of atomic in 2PL). commitment and it is unclear how one One can improve this algorithm by des- would integrate two-phase commit into the ignating a primary copy of each data item algorithm. and only performing the timestamp com- Overall, the reliability threats that are parison against the primary copy, making handled by the majority consensus algo- it analogous to primary copy 2PL. How- rithm have not been explicitly listed, and ever, for the same reasons as above, we alternative solutions have not been ana- would expect primary copy 2PL to outper- lyzed. While voting is certainly a possible form this version of Thomas' algorithm too. technique for obtaining a measure of relia- We therefore must leave open the prob- bility, the circumstances under which it is lem of producing an efficient version of cost-effective are unknown. Thomas' algorithm for a partially redun- dant database. A3. Ellis' Ring Algorithm Another early solution to the problem of A2.4 Performance distributed database concurrency control is Even in the fully redundant case, the per- the ring algorithm [ELLI77]. Ellis was prin- formance of the majority consensus algo- cipally interested in a proof technique, rithm is not very good. First, repeating the called L systems, for proving the correct- certification and conflict detection at each ness of concurrent algorithms. He devel- site is more than is needed to obtain seri- oped his concurrency control method pri- alizability: a centralized certifier would marily as an example to illustrate L-system work just as well and would only require proofs, and never made claims about its that certification be performed at one site. performance. Because the algorithm was Second, the algorithm is quite prone to only intended to illustrate mathematical restarts when there are run-time conflicts, techniques, Ellis imposed a number of re- since restarts are the only tactic available strictions on the algorithm for mathemati- for synchronizing transactions, and so will cal convenience, which make it infeasible in only perform well under the most optimistic practice. Nonetheless, the algorithm has circumstances. Finally, even in optimistic received considerable attention in the lit- situations, the analysis in GARC79a indi- erature, and in the interest of completeness, cates that centralized 2PL outperforms the we briefly discuss it. majority consensus algorithm. Ellis' algorithm solves the distributed concurrency control problem with the fol- A2.5 Rehablhty lowing restrictions: Despite the performance problems of the (1) The database must be fully redundant. majority consensus algorithm, one can try (2) The communication medium must be to justify the algorithms on reliability a ring, so each site can only communi- grounds. As long as a majority of sites are cate with its successor on the ring.

Computing Surveys, Vol. 13, No 2, June 1981 218 • P. A. Bernstein and N. Goodman

(3) Each site-to-site communication link is ALSB76a ALSBERG, P. A, AND DAY, J.D. "A prin- pipelined. ciple for resilient of distributed resources," in Proc. 2nd Int. Conf. Soft- (4) Each site can supervise no more than ware Eng., Oct. 1976, pp. 562-570. one active update transaction at a time. ALSB76b ALSBERG, P. A., BELFORD, G.C., DAY, J. (5) To update any copy of the database, a D., AND GRAPLA, E. "Multi-copy resil- transaction must first obtain a lock on iency techniques," Center for Advanced Computation, AC Document No. 202, the entire database at all sites. Univ. Illinois at Urbana-Champaign, May The effect of restriction 5 is to force all 1976. BADA78 BADAL, D. Z., AND POPEK, G.J. "A pro- transactions to execute serially; no concur- posal for distributed concurrency control rent processing is ever possible. For this for partially redundant distributed data reason alone, the algorithm is fundamen- base system," in Pron. 3rd Berkeley tally impractical. Workshop D~str~buted Data Manage- ment and Computer Networks, 1978, pp. To execute, an update transaction mi- 273-288 grates around the ring, (essentially) obtain- BADA79 BADAL, D. Z. "Correctness of concur- ing a lock on the entire database at each rency control and implications in distrib- site. However, the lock conflict rules are uted databases," in Proc COMPSAC 79 nonstandard. A lock request from a trans- Conf., Chicago, Ill., Nov. 1979. BADAS0 BADAL, D.Z. "On the degree of concur- action that originated at site A conflicts at rency provided by concurrency control site C with a lock held by a transaction that mechanisms for distributed databases," in originated from site B if B = C and either Proc. Int Symp. D~stributed Databases, A ffi B or A's priority < B's priority. The Versailles, France, March 1980. BAYER, R., HELLER, H., AND REISER, daisy-chain communication induced by the BAYE80 A. "Parallelism and recovery in data- ring combined with this locking rule pro- base systems," ACM Trans. Database duces a deadlock-free algorithm that does Syst. 5, 2 (June 1980), 139-156. not require deadlock detection and never BZLP76 BELFORD, G. C., SCHWARTZ, P. M., AND SLUIZER, S. "The effect of back-up induces restarts. A detailed description of strategy on database availability," CAC the algorithm appears in GARC79a. Document No. 181, CCTCWAD Docu- There are several problems with this al- ment No. 5515, Center for Advanced gorithm in a distributed database environ- Computation, Univ. Ilhnom at Urbana- ment. First, as mentioned above, it forces Champaign, Urbana, Feb. 1976. BERN78a BERNSTEIN, P. A., GOODMAN, N., ROTH- transactions to execute serially. Second, it NXE, J B., AND PAPADIMITRIOU, C. only applies to a fully redundant database. A. "The concurrency control mecha- And third, the daily-chain communication nism of SDD-I: A system for dmtributed requires that each transaction obtain its databases (the fully redundant case)," lock at one site at a time, which causes IEEE Trans. Softw. Eng. SE-4, 3 (May 1978), 154-168. communication delay to be (at least) lin- BERN79a BERNSTEIN, P. A., AND GOODMAN, early proportional to the number of sites in N. "Approaches to concurrency control the system. in dmtributed databases," in Pron. 1979 A modified version of Ellis' algorithm Natl. Computer Conf., AFIPS Press, Ar- lington, Va., June 1979. that mitigates the firstproblem is proposed BERN79b BERNSTEIN, P. A., SHIPMAN, D. W., AND in GARC79a. Even with this improvement, WONO, W.S. "Formal Aspects of Sen- performance analysis indicates that the ring alizability in Database Concurrency Con- algorithm is inferior to centralized 2PL. trol," IEEE Trans. Softw Eng. SE-5, 3 And, of course, the modified algorithm still (May 1979), 203-215. BERN80a BERNSTEIN, P. A., AND GOODMAN, suffers from the last two problems. N. "Timestamp based algorithms for concurrency control in distributed data- ACKNOWLEDGMENT base systems," Proc 6th Int. Conf. Very This work was supported by Rome Air Development Large Data Bases, Oct. 1980. Center under contract F30602-79-C-0191. BERN80b BERNSTEIN, P. A., GOODMAN, N., AND LAI, M.Y. "Two Part Proof Schema for Database Concurrency Control," in Proc REFERENCES 5th Berkeley Workshop D~str~buted Data AHO75 AHO, A. V., HOPCROFT, E., AND ULLMAN, Management and Computer Networks, J. D. The design and analysts of com- Feb. 1980. puter algorithms, Addison-Wesley, Read- BERNS0c BERNSTEIN, P. A, AND SHIPMAN, D. ing, Mass., 1975. W "The correctness of concurrency Computing Surveys, Vol. 13, No. 2, June 1981 Concurrency Control in Database Systems • 219

control mechanisms in a system for dis- trollers," in Proe. 4th Berkeley Workshop tributed databases (SDD-1)," in ACM D~stnbuted Databases and Computer Trans. Database Syst. 5, 1 (March 1980), Networks, Aug. 1979. 52-68. GARC79C GARCIA-MOLINA, H. "Centrahzed con- BERN80d BERNSTEIN, P, SHIPMAN, D. W., AND trol update algorithms for fully redundant ROTHNIE, J.B. "Concurrency control m distributed databases," in Proe. 1st Int. a system for distributed databases (SDD- Conf. D~stributed Computing Systems 1)," in ACM Trans. Database Syst 5, 1 (IEEE), New York, Oct. 1979, pp. 699- (March 1980), 18-51. 705. BERN81 BERNSTEIN, P. A, GOODMAN,N., WONG, GARD77 GARDARIN,G., AND LEBAUX,P. "Sched- E, REEVE, C. L., AND ROTHNIE, J. uling algorithms for avoiding inconsis- B. "Query processing m SDD-I," ACM tency in large databases," in Proc. 1977 Trans. Database Syst. 6, 2, to appear. Int. Conf. Very Large Data Bases BREI79 BREITWIESER, H., AND KERSTEN, (IEEE), New York, pp. 501-516. U. "Transaction and catalog manage- GELE78 GELEMBE, E., AND SEVCIE, K. "Analysis merit of the distributed file management of update synchronization for multiple system DISCO," in Proc. Very Large copy databases," in Proc. 3rd Berkeley Data Bases, Rm de Janerio, 1979 Workshop Distributed Databases and BRIN73 BRINCH-HANSEN, P. Operating system Computer Networks, Aug. 1978. pnnc~ples, Prentice-Hall, Englewood GIFF79 GIFFORD, D. K. "Weighted voting for Cliffs, N. J., 1973. rephcated data," in Proc. 7th Syrup. Op- CASA79 CASANOVA, M. A. "The concurrency erating Systems Principles, Dec. 1979. control problem for database systems," GRAY75 GRAY, J. N., LORIE, R. A., PUTZULO, G. Ph.D. dissertation, Harvard Univ., Tech. R., AND TRAIGER, I.L. "Granularity of Rep. TR-17-79, Center for Research in locks and degrees of consistency in a Computmg Technology, 1979. shared database," IBM Res. Rep. RJ1654, CHAM74 CHAMBERLIN, D. D., BOYCE, R. F., AND Sept. 1975. TRAIGER, I.L. "A deadlock-free scheme GRAY78 GRAY, J.N. "Notes on database oper- for resource allocation in a database en- ating systems," in Operating Systems: An wronment," Info. Proc. 74, North-Hol- Advanced Course, vol. 60, Lecture Notes land, Amsterdam, 1974. in , Springer-Verlag, CHEN80 CHENG, W. K., AND BELFORD, G. New York, 1978, pp. 393-481. C. "Update Synchromzation in Distrib- HAMM80 HAMMER, M. M., AND SHIPMAN, D. uted Databases," in Proc. 6th Int. Conf. W. "Reliability mechanisms for SDD-I: Very Large Data Bases, Oct. 1980. A system for distributed databases," ACM DEPP76 DEPPE, M. E., AND FRY, J. Trans. Database Syst. 5, 4 (Dec. 1980), P. "Distributed databases' A summary 431-466. of research," in Computer networks, vol. HEWI74 HEWITT, C.E. "Protection and synchro- 1, no. 2, North-Holland, Amsterdam, nization in actor systems," Working Paper Sept. 1976. No. 83, M.I.T. Artificial Intelligence Lab., DIJK71 DIJKSTRA, E.W. "Hmrarchical ordering Cambridge, Mass., Nov. 1974. of sequential processes," Acta Inf. 1, 2 HOAR74 HOARE, C. A.R. "Monitors. An operat- (1971), 115-138. ing system structuring concept," Com- ELLI77 ELLIS, C.A. "A robust algorithm for up- mun. ACM 17, 10 (Oct. 1974), 549-557. dating duphcate databases," in Proe 2nd HOLT72 HOLT, R.C. "Some deadlock propemes Berkeley Workshop D~str~buted Data- of computer systems," Comput. Surv. 4, 3 bases and Computer Networks, May (Dec. 1972) 179-195. 1977. KANE79 KANEKO, A., NISHIHARA,Y., TSURUOKA, ESWA76 ESWARAN, K. P., GRAY, J. N., LORIE, R. K., AND HATTORI, M. "Logical clock A., AND TRAIGER,I.L. "The notions of synchronization method for duplicated consistency and predicate locks in a da- database control," in Proe. 1st Int. Conf. tabase system." Commun. ACM 19, 11 D~stributed Computing Systems (IEEE), (Nov. 1976), 624-633. New York, Oct. 1979, pp. 601-611. GARC78 GARCIA-MOLINA, H "Performance KAWA79 KAWAZU,S, MINAMI,ITOH, S., AND TER- comparisons of two update algorithms for ANAKA, K. "Two-phase deadlock detec- distributed databases," in Proc. 3rd tion algorithm in distributed databases," Berkeley Workshop D~stmbuted Data- in Proc. 1979 Int. Conf. Very Large Data bases and Computer Networks, Aug. Bases (IEEE), New York. 1978. KING74 KING, P. P., AND COLLMEYER, A GARC79a GARCIA-MOLINA, H. "Performance of J. "Database sharing--an efficient update algorithms for replicated data in a method for supporting concurrent pro- distributed database," Ph.D. dlssertatmn, cesses," in Proc. 1974 Nat. Computer Computer Science Dept., Stanford Umv., Conf., vol. 42, AFIPS Press, Arlington, Stanford, Calif., June 1979. Va., 1974. GARc79b GARCIA-MOLINA, H. "A concurrency KUNG79 KUNG, H. T, AND PAPADIMITRIOU, C. control mechanism for distributed data H. "An optimality theory of concur- bases winch use centralized locking con- rency control for databases," in Proe. 1979 ComputingSurveys, VoL 13, No. 2, June 1981 220 • P.A. Bernstein and N. Goodman

ACM-SIGMOD Int. Conf Management algorithm," in Proc. 4th Berkeley Work- of Data, June 1979. shop Dtstributed Data Management and KUNG81 KUNG, H. T., AND ROBINSON, J.T. "On Computer Networks, Aug. 1979. optimistic methods for concurrency con- REED78 REED, D.P. "Naming and synchroniza- trol," ACM Trans. Database Syst. 6, 2, tion m a decentralized computer system~ (June 81), 213-226. Ph.D. dissertation, Dept. of Electrical En- LAMP76 LAMPSON, B., AND STURGIS, H. "Crash gineering, M.I.T., Cambridge, Mass., recovery in a chstnbuted data storage sys- Sept., 1978. tem," Tech. Rep., Computer Science Lab., REIS79a REIS, D. "The effect of concurrency Xerox Palo Alto Research Center, Palo control on database management system Alto, Calif., 1976. performance," Ph.D. dissertation, Com- LAMP78 LAMPORT,L. "Time, clocks and ordering puter Science Dept., Univ. Califorma, of events in a distributed system," Com- Berkeley, April 1979. mun. ACM 21, 7 (July 1978), 558-565. RExs79b REIS, D. "The effects of concurrency LELA78 LELANN, G. "Algorithms for distributed control on the performance of a distrib- data-sharing sytems which use tickets," m uted database management system," in Proe. 3rd Berkeley Workshop Distrib- Proc. 4th Berkeley Workshop D~strtbuted uted Databases and Computer Networks, Data Management and Computer Net. Aug. 1978. works, Aug. 1979. hN79 LIN, W. K. "Concurrency control in RosE79 ROSEN, E.C. "The updating protocol of multiple copy distributed data base sys- the ARPANET's new routing algorithm: tem," in Proc 4th Berkeley Workshop A case study in maintaining identical cop- Distributed Data Management and Com- ies of a changing distributed data base," puter Networks, Aug. 1979. in Proc. 4th Berkeley Workshop Dlstrtb. MENA79 MENASCE, D. A., AND MUNTZ, R. uted Data Management and Computer R. "Locking and deadlock detection in Networks, Aug. 1979. chstributed databases," IEEE Trans. ROSE78 ROSENKRANTZ, D. J., STEARNS, R E., Softw. Eng. SE-5, 3 (May 1979), 195-202. AND LEWIS, P.M. "System level concur- MENAS0 MENASCE, D. A., POPEK, G. J., AND rency control for distributed database sys- MUNTZ, R.R. "A locking protocol for tems," ACM Trans. Database Syst. 3, 2 resource coordination in distributed da- (June 1978), 178-198. tabases," ACM Trans. Database Syst. 5, ROTH77 ROTHNIE, J. B., AND GOODMAN, N. "A 2 (June 1980), 103-138. survey of research and development in MINO78 MINOURA, T. "Maximally concurrent distributed databases systems," in Proe transaction processing," in Proc. 3rd 3rd Int. Conf. Very Large Data Bases Berkeley Workshop D~str~buted Data- (IEEE), Tokyo, Japan, Oct. 1977. bases and Computer Networks, Aug. SCHL78 SCHLAGETER, G. "Process synchromza- 1978. tion in database systems." ACM Trans. MINO79 MINOURA, T. "A new concurrency con- Database Syst. 3, 3 (Sept. 1978), 248-271. trol algorithm for distributed data base SEQU79 SEQUIN, J., SARGEANT,G., AND WILNES, systems," in Proc. 4th Berkeley Work- P. "A majority consensus algorithm for shop D~stributed Data Management and the eonsmtency of duplicated and distrib- Computer Networks, Aug. 1979. uted information," in Proc. 1st Int. Conf. MONT78 MONTGOMERY, W. A. "Robust concur- Distributed Computing Systems (IEEE), rency control for a distributed informa- New York, Oct. 1979, pp. 617-624. tion system," Ph.D. dissertation, Lab. for SHAP77a SHAPIRO, R. M., AND MILLSTEIN, R. Computer Science, M.I.T., Cambridge, E. "Rehability and fault recovery in dis- Mass, Dee. 1978. tributed processing," in Oceans '77 Conf PAPA77 PAPADIMITRIOU,C. H., BERNSTEIN, P A, Record, vol II, Los Angeles, 1977. AND ROTHNIE, J. B. "Some computa- SHAP77b SHAPIRO, R. M., AND MILLSTEIN, R. tmnal problems related to database con- E. "NSW reliability plan," Massachu- currency control," in Proe. Conf. Theoret- setts Tech. Rep. 7701-1411, Computer As- wal Computer Scwnee, Waterloo, Ont., sociates, Wakefield, Mass., June 1977. Canada, Aug. 1977. SILBS0 SILBERSCHATZ, A., AND KEDEM, PAPA79 PAPADIMITRIOU, C. H. "Seriallzability Z. "Consistency in hierarchical database of concurrent updates," J. ACM 26, 4 systems," J. ACM 27, 1 (Jan. 1980), 72- (Oct. 1979), 631-653. 80. RAHI79 RAHIMI, S K., AND FRANTS, W.R. "A STEA76 STEARNS, R. E., LEWIS, P. M. II, AND posted update approach to concurrency ROSENKRANTZ,D.d. "Concurrency con- control in distributed database systems," troll for database systems," in Proe. 17th in Proe. 1st Int. Conf. Dtstr~buted Com- Syrup. Foundatmns Computer Science puting Systems (IEEE), New York, Oct. (IEEE), 1976, pp. 19-32. 1979, pp. 632-641. STEAS1 STEARNS, R. E., AND ROSENKRANTZ, RAMI79 RAMIREZ, R. J , AND SANTORO, J. "Distributed database concurrency N. "Distributed control of updates in controls using fore-values," in Proc 1981 multiple-copy data bases: A time optimal SIGMOD Conf. (ACM).

Computing Surveys, Vol. 13, No 2, June 1981 Concurrency Control in Database Systems • 221

STON77 STONEBRAKER, M., AND NEUHOLD~ 3. Performance: BADA80, GARC78, GARC79a, E. "A distributed database version of GARC79b, GELE78, REIS79a, RExs79b, ROTH77 INGRES," in Proc. 2nd Berkeley Work- 4. Reliabihty shop D~stributed Data Management and General: ALSB76a, ALSB76b, BELF76, BERN79a, Computer Networks, May 1977. HAMMS0, LAMP76 STON79 STONEBRAKER, M. "Concurrency con- Two-phase commzt: HAMM80, LAMP76 trol and consistency of multiple copies of 5. Timestamp-ordered scheduling (T/O) data in distributed INGRES, IEEE General: BADA78, BERN78a, BERN80a, BERN80b, Trans. Soflw. Eng. SE-5, 3 (May 1979), BERN80d, LELA78, LIN79, RAMI79 188-194. Thomas' Wrtte Rule: THOM79 THOM79 THOMAS, R.H. "A solution to the con- Multivers~on t~mestamp ordering: MONT78, currency control problem for multiple REED78 copy databases," in Proc. 1978 COMP- T~mestamp and clock management: LAMP78, CON Conf. (IEEE), New York. THOM79 VERH78 VERHOFSTAD, J. S. M. "Recovery and 6. Two-phase locking (2PL) crash resmtance in a filing system," in General. BERN79b, BREI79, ESWA76, GARD77, Proc. SIGMOD Int Conf. Management GRAY75, GRAY78, PAPA79, SCHL78, SILB80, of Data (ACM), New York, 1977, pp 158- STEA81 167. D~str~buted 2PL: MENA80, MINO79, ROSE78, STON79 A Partial Index of References Primary copy 2PL: STOle77, STON79 1. Cert~fwrs: BADA79, BAYE80, CASA79, KUNG81, Centralized 2PL: ALSB76a, ALSB76b, GARc79b, PAPA79, THOM79 GARC79C 2. Concurrency control theory: BERN79b, BERN80C, Voting 2PL: GIFF79, SEQU79, THOM79 CASA79, ESWA76, KUNG79, MXNO78, PAPA77, Deadlock detection/prevention: GRAY78, KXNG74, PAPA79, SCHL78, SILB80, STEA76 KAWA79, ROSE78, STON79

Received April 1980; final revision accepted February 1981

Computing Surveys, Vol. 13, No 2, June 1981