Concurrency Control and Recovery

Michael J Franklin

Department of Science and UMIACS

University of Maryland

College Park MD

Intro duction

Many serviceoriented businesses and organizations such as banks airlines catalog retailers hos

pitals etc have grown to dep end on fast reliable and correct access to their missioncritical

on a constant basis In many cases particularly for global enterprises x access is required

that is the data must b e available seven days a week twentyfour hours a day Data Manage

ment Systems DBMS are often employed to meet these stringent p erformance availability and

reliabili ty demands As a result two of the core functions of a DBMS are to protect the data

stored in the and to provide correct and highly available access to that data in the

presence of concurrent access by large and diverse user p opulations and despite various

and hardware failures The resp onsibili ty for these functions resides in the

and recovery comp onents of the DBMS software Concurrency control ensures that individual

users see consistent states of the database even though op erations on b ehalf of many users may

b e interleaved by the database system Recovery ensures that the database is fault tolerant that

is that the database state is not corrupted as the result of a software system or media failure

The existence of this functionality in the DBMS allows applications to b e written without explicit

concern for concurrency and fault tolerance This freedom provides a tremendous increase in pro



Portions of this chapter are reprinted with p ermission from M Franklin M Zwilling C Tan M Carey and

D DeWitt Crash Recovery in ClientServer EXODUS Proc ACM International Conference on Management of

Data SIGMOD San Diego June c by the Asso ciation for Computing Machinery Inc ACM

grammer pro ductivity and allows new applications to b e added more easily and safely to an existing

system

For database systems correctness in the presence of concurrent access andor failures is tied to

the notion of a transaction A transaction is a unit of work p ossibly consisting of multiple data

accesses and up dates that must or ab ort as a single atomic unit When a transaction

commits all up dates it p erformed on the database are made p ermanent and visible to other trans

all of its up dates are removed from the database actions In contrast when a transaction ab orts

and the database is restored if necessary to the state it would have b een in if the ab orting trans

action had never b een executed Informally transaction executions are said to resp ect the ACID

prop ertiesGray

Atomicity This is the allornothing asp ect of transactions discussed ab ove either all op era

tions of a transaction complete successfully or none of them do Therefore after a transaction

has completed ie committed or ab orted the database will not reect a partial result of

that transaction

Consistency Transactions preserve the consistency of the data a transaction p erformed on a

database that is internally consistent will leave the database in an internally consistent state

Consistency is typically expressed as a set of declarative integrity constraints For example

a constraint may b e that the salary of an employee cannot b e higher than that of his or her

manager

Isolation A transactions b ehavior is not impacted by the presence of other transactions that may

b e accessing the same database concurrently That is a transaction sees only a state of the

database that could o ccur if that transaction were the only one running against the database

and pro duces only results that it could pro duce if it was running alone

Durability The eects of committed transactions survive failures Once a transaction commits

its up dates are guaranteed to b e reected in the database even if the contents of volatile eg

main memory or nonvolatile eg disk storage are lost or corrupted

Of these four transaction prop erties the concurrency control and recovery comp onents of a

DBMS are primarily concerned with preserving Atomicity Isolation and Durability The preser

vation of the Consistency prop erty typically requires additional mechanisms such as compiletime

1

analysis or runtime triggers in order to check adherence to integrity constraints For this reason

this chapter fo cuses primarily on the AI and D of the ACID transaction prop erties

Transactions are used to structure complex pro cessing tasks which consist of multiple data

accesses and up dates A traditional example of a transaction is a money transfer from one bank

account say account A to another say B This transaction consists of a withdrawal from A and

a dep osit into B and requires four accesses to account information stored in the database a read

and write of A and a read and write of B The data accesses of this transaction are as follows

Transfer

bal ReadA 01 A

02 A bal A bal

bal 03 WriteAA

bal ReadB 04 B

05 B bal B bal

bal 06 WriteBB

The value of A in the database is read and decremented by then the value of B in the

database is read and incremented by Thus Transfer preserves the invariant that the sum of

the balances of A and B prior to its execution must equal the sum of the balances after its execution

regardless of whether the transaction commits or ab orts Consider the imp ortance of the atomicity

prop erty At several p oints in during the Transfer transaction the database is in a temp orarily

inconsistent state For example b etween the time that account A is up dated statement and the

time that account B is up dated statement the database reects the decrement of A but not the

increment of B so it app ears as if has disapp eared from the database If the transaction reaches

such a p oint and then is unable to complete eg due to a failure or an unresolvable conict etc

then the system must ensure that the eects of the partial results of the transaction ie the

up date to A are removed from the database otherwise the database state will b e incorrect The

durability prop erty in contrast only comes into play in the event that the transaction successfully

commits Once the user is notied that the transfer has taken place he or she will assume that

1

In the case of triggers the recovery mechanism is typically invoked to ab ort an oending transaction

account B contains the transferred funds and may attempt to use those funds from that p oint on

Therefore the DBMS must ensure that the results of the transaction ie the transfer of the

remain reected in the database state even if the system crashes

Atomicity consistency and durability address correctness for serial execution of transactions

where only a single transaction at a time is allowed to b e in progress In practice however database

management systems typically supp ort concurrent execution in which the op erations of multiple

transactions can b e executed in an interleaved fashion The motivation for concurrent execution

in a DBMS is similar to that for multiprogramming in op erating systems namely to improve

the utilization of system hardware resources and to provide multiple users a degree of fairness in

access to those resources The isolation prop erty of transactions comes into play when concurrent

execution is allowed

Consider a second transaction that computes the sum of the balances of accounts A and B

ReportSum

01 A bal ReadA

bal ReadB 02 B

03 PrintA bal B bal

Assume that initially the balance of account A is and the balance of account B is

If a ReportSum transaction is executed on this state of the database it will print a result of

In a database system restricted to serial execution of transactions ReportSum will also

pro duce the same result if it is executed after a Transfer transaction The atomicity prop erty of

transactions ensures that if the Transfer ab orts all of its eects are removed from the database

so ReportSum would see A and B and the durability prop erty ensures that if it

commits then all of its eects remain in the database state so ReportSum would see A

and B

Under concurrent execution however a problem could arise if the isolation prop erty is not

enforced As shown in Figure if ReportSum were to execute after Transfer has up dated

account A but b efore it has up dated account B then ReportSum could see an inconsistent state

of the database In this case the execution of ReportSum sees a state of the database in which

has b een withdrawn from account A but has not yet b een dep osited in account B resulting

in a total of it seems that has disapp eared from the database This result is not one

that could b e obtained in any serial execution of Transfer and ReportSum transactions It

o ccurs b ecause in this example ReportSum accessed the database when it was in a temp orarily

inconsistent state This problem is sometimes referred to as the inconsistent retrieval problem

To preserve the isolation prop erty of transactions the DBMS must prevent the o ccurrence of this

and other p otential anomalies that could arise due to concurrent execution The formal notion

of correctness for concurrent execution in database systems is known as serializabili ty and is

describ ed in the Basic Principles section of this chapter

Transfer ReportSum

01 A bal ReadA

02 A bal A bal

03 WriteAA bal

01 A bal ReadA value is

02 B bal ReadB value is

03 PrintA bal B bal result

04 B bal ReadB

05 B bal B bal

06 WriteBB bal

Figure An Incorrect Interleaving of Transfer and ReportSum

Although the transaction pro cessing literature often traces the history of transactions back to

antiquity such as Sumerian tax records or to early contract lawGray Gray Kort the

ro ots of the transaction concept in information systems are typically traced back to the early

s and the work of Bjork and DaviesBjor Davi Early systems such as IBMs IMS

addressed related issues but a systematic treatment and understanding of ACID transactions was

develop ed several years later by memb ers of the IBM System R group Gray Eswa and others

eg Rose Lome Since that time many techniques for implementing ACID transactions

have b een prop osed and a fairly well accepted set of techniques has emerged The remainder of

this chapter contains an overview of the basic theory that has b een develop ed as well as a survey of

the more widelyknown implementation techniques for concurrency control and recovery A brief

discussion of work on extending the simple transaction mo del is presented at the end of the chapter

It should b e noted that issues related to those addressed by concurrency control and recovery in

database systems arise in other areas of computing systems as well such as le systems and memory

systems There are however two salient asp ects of the ACID mo del that distinguish transactions

from other approaches First is the incorp oration of b oth isolation concurrency control and

fault tolerance recovery issues Second is the concern with treating multiple write andor read

op erations on multiple data items as an atomic isolated unit of work While these asp ects of

the ACID mo del provide p owerful guarantees for the protection of data they also can induce

signicant systems implementation complexity and p erformance overhead For this reason the

notion of ACID transactions and their asso ciated implementation techniques have remained largely

within the DBMS domain where the provision of highlyavailable and reliable access to mission

critical data is a primary concern

Underlying Principles

Concurrency Control

Serializabili ty

As stated in the previous section the resp onsibili ty for maintaining the isolation prop erty of ACID

transactions resides in the concurrency control p ortion of the DBMS software The most widely ac

cepted notion of correctness for concurrent execution of transactions is Serializabili ty

is the prop erty that a p ossibly interleaved execution of a group transactions has the same eect

on the database and pro duces the same output as some serial ie noninterleaved execution of

those transactions It is imp ortant to note that serializabili ty do es not sp ecify any particular serial

order but rather only that the execution is equivalent to some serial order This distinction makes

serializabili ty a slightly less intuitive notion of correctness compared to transaction initiation time

or commit order but it provides the DBMS with signicant additional exibility in the scheduling

of op erations This exibility can translate into increased resp onsiveness for end users

A rich theory of database concurrency control has b een develop ed over the years see Papa

Bern Gray and serializabili ty lies at the heart of much of this theory In this chapter we

fo cus on the simplest mo dels of concurrency control where the op erations that can b e p erformed

by transactions are restricted to r eadx w r itex commit and abor t The op eration r eadx

retrieves the value of a data item from the database w r itex mo dies the value of a data item

in the database and commit and abor t indicate successful or unsuccessful transaction completion

resp ectively with the concomitant guarantees provided by the ACID prop erties We also fo cus

Conict serializability is the on a sp ecic variant of serializabili ty called conict serializabili ty

most widely accepted notion of correctness for concurrent transactions b ecause there are ecient

easily implementable techniques for detecting andor enforcing it Another wellknown variant is

called serializabili ty View serializability is less restrictive ie it allows more legal schedules

than conict serializabili ty but it and other variants are primarily of theoretical interest b ecause

they are impractical to implement The reader is referred to Papa for a detailed treatment of

alternative serializability mo dels

Transaction Schedules

Conict serializability is based on the notion of a schedule of transaction op erations A schedule

for a set of transaction executions is a partial ordering of the op erations p erformed by those trans

actions which shows how the op erations are interleaved The ordering dened by a schedule can

b e partial in the sense that it is only required to sp ecify two typ es of dep endencies

 All op erations of a given transaction for which an order is sp ecied by that transaction must

app ear in that order in the schedule For example the denition of ReportSum ab ove

sp ecies that account A is read b efore account B

 The ordering of all conicting op erations from dierent transactions must b e sp ecied

Two op erations are said to conict if they b oth op erate on the same data item and at least

one of them is a w r ite

The concept of a schedule provides a mechanism to express and reason ab out the p ossibly

concurrent execution of transactions A serial schedule is one in which all the op erations of each

transaction app ear consecutively For example the serial execution of Transfer followed by

ReportSum is represented by the following schedule

r A ! w A ! r B ! w B ! c ! r A ! r B ! c

0 0 0 0 0 1 1 1

In this notation each op eration is represented by its initial letter the subscript of the op eration

of the transaction on whose b ehalf the op eration was p erformed indicates the transaction numb er

and a capital letter in brackets indicates a sp ecic data item from the database for read and write

op erations A transaction numb er tn is a unique identier that is assigned by the DBMS to an

execution of a transaction In the example ab ove the execution of Transfer was assigned tn

and the execution of ReportSum was assigned tn A right arrow ! b etween two op erations

indicates that the lefthand op eration is ordered b efore the righthand one The ordering relationship

is transitive the orderings implied by transitivity are not explicitly drawn

For example the interleaved execution of Transfer and ReportSum shown in Figure would

pro duce the following schedule

r A ! w A ! r A ! r B ! c ! r B ! w B ! c

0 0 1 1 1 0 0 0

The formal denition of serializabili ty is based on the concept of equivalent schedules Two

schedules are said to b e equivalent  if

 They contain the same transactions and op erations and

 They order all conicting op erations of nonab orting transactions in the same way

Given this notion of equivalent schedules a schedule is said to b e serializable if and only if it is

equivalent to some serial schedule For example the following concurrent schedule is serializable

b ecause it is equivalent to schedule

r A ! w A ! r A ! r B ! w B ! c ! r B ! c

0 0 1 0 0 0 1 1

In contrast the interleaved execution of schedule is not serializable To see why notice that in

writes of Transfer will precede any serial execution of Transfer and ReportSum either b oth

b oth reads of ReportSum or vice versa However in schedule w A ! r A but r B ! w b

0 1 1 0

Schedule therefore is not equivalent to any p ossible serial schedule of the two transactions so

it is not serializable This result agrees with our intuitive notion of correctness b ecause recall that

schedule resulted in the apparent loss of

Testing for Serializabil ity

A schedule can easily b e tested for serializability through the use of a precedence graph A prece

transaction execution in a dence graph is a directed graph that contains a vertex for each committed

schedule noncommitted executions can b e ignored The graph contains an edge from transaction

execution T to transaction execution T i 6 j if there is an op eration in T that is constrained to

i j i

precede an op eration of T in the schedule A schedule is serializable if and only if its precedence

j

graph is acyclic Figure a shows the precedence graph for schedule That graph has an edge

T ! T b ecause the schedule contains w A ! r A and an edge T ! T b ecause the schedule

0 1 0 1 1 0

contains r B ! w b The cycle in the graph shows that the schedule is nonserializable In con

1 0

trast Figure b shows the precedence graph for schedule In this case all ordering constraints

are from T to T so the precedence graph is acyclic indicating that the schedule is serializable

0 1

T T T0 T1 0 1

(a) (b)

Figure Precedence Graphs for a NonSerializable and b Serializable Schedules

There are a numb er of practical ways to implement conict serializability These and other

implementation issues are addressed in the Best Practices section of this chapter Before dis

cussing implementation issues however we rst survey the the basic principles underlying database

recovery

Recovery

Coping With Failures

Recall that the resp onsibility for the atomicity and durability prop erties of ACID transactions

lies in the recovery comp onent of the DBMS For recovery purp oses it is necessary to distinguish

b etween two typ es of storage volatile storage such as main memory whose state is lost in the

event of a system crash or p ower outage and nonvolatile storage such as magnetic disks or

tap es whose contents p ersist across such events The recovery subsystem is relied up on to ensure

correct op eration in the presence of three dierent typ es of failures listed in order of likeliho o d

 Transaction Failure When a transaction that is inprogress reaches a state from which it

cannot successfully commit all up dates that it made must b e removed from the database in

order to preserve the atomicity prop erty This is known as transaction rollback

 System Failure If the system fails in a way that causes the loss of volatile memory contents

recovery must ensure that the up dates of all transactions that had committed prior to the

crash are reected in the database and that all up dates of other transactions ab orted or

inprogress at the time of the crash are removed from the database

 Media Failure In the event that data is lost or corrupted on the nonvolatile storage eg due

to a diskhead crash then the online version of data is lost In this case the database must

b e restored from an archival version of the database and brought up to date using op eration

logs

In this chapter we fo cus on the issues of rollback and crash recovery the most frequent uses

of the DBMS recovery subsystem Recovery from media crashes requires substantial additional

mechanisms and complexity b eyond what is covered here Media recovery is addressed in the

recoveryrelated references listed at the end of this chapter

Buer Management Issues

The pro cess of removing the eects of an incomplete or ab orted transaction for preserving atomicity

The pro cess of reinstating the eects of a committed transaction for durability is known as UNDO

is known as REDO The amount of work that a recovery subsystem must p erform for either of

these functions dep ends on how the DBMS buer manager handles data that is up dated by in

progress andor committing transactions Haer Bern Recall that the buer manager is the

DBMS comp onent that is resp onsible for co ordinating the transfer of data b etween main memory

ie volatile storage and disk ie nonvolatile storage The unit of storage that can b e written

atomically to nonvolatile storage is called a page Up dates are made to copies of pages in the

volatile buer p o ol and those copies are written out to nonvolatile storage at a later time If

the buer manager allows an up date made by an uncommitted transaction to overwrite the most

p olicy recent committed value of a data item on nonvolatile storage it is said to supp ort a STEAL

the opp osite is called NOSTEAL If the buer manager ensures that all up dates made by a

transaction are reected on nonvolatile storage b efore the transaction is allowed to commit then

it is said to supp ort a FORCE p olicy the opp osite is NOFORCE

Supp ort for the STEAL p olicy implies that in the event that a transaction needs to b e rolled

back due to transaction failure or system crash UNDOing the transaction will involve restoring

the values of any nonvolatile copies of data that were up dated by that transaction back to their

previous committed state In contrast a NOSTEAL p olicy guarantees that the data values on

nonvolatile storage are valid so they do not need to b e restored A NOFORCE p olicy raises the

p ossibility that some committed data values may b e lost during a system crash b ecause there is no

guarantee that they have b een placed on nonvolatile storage This means that substantial REDO

work may b e required to preserve the durability of committed up dates In contrast a FORCE

p olicy ensures that the committed up dates are placed on nonvolatile storage so that in the event

of a system crash the up dates will still b e reected in the copy of the database on nonvolatile

storage

From the ab ove discussion it should b e apparent that a buer manager that supp orts the

combination of NOSTEAL and FORCE would place the fewest demands on UNDO and REDO

recovery However these p olicies may negatively impact the p erformance of the DBMS during

normal op eration ie when there are no crashes or rollbacks b ecause they restrict the exibility

of the buer manager NOSTEAL obligates the buer manager to retain up dated data in memory

until a transaction commits or to write that data to a temp orary lo cation on nonvolatile storage

eg a swap area The problem with a FORCE p olicy is that it can imp ose signicant disk write

overhead during the critical path of a committing transaction For these reasons many buer

managers supp ort the STEAL and NOFORCE STEALNOFORCE p olicies

Logging

In order to deal with the UNDO and REDO requirements imp osed by the STEAL and NOFORCE

p olicies resp ectively database systems typically rely on the use of a log A log is a sequential le that

stores information ab out transactions and the state of the system at certain instances Each entry

in the log is called a log record One or more log records are written for each up date p erformed by

a transaction When a log record is created it is assigned a Log Sequence Numb er LSN which

serves to uniquely identify that record in the log LSNs are typically assigned in a monotonically

increasing fashion so that they provide an indication of relative p osition in the log When an up date

is made to a data item in the buer a log record is created for that up date Many systems write

the LSN of this new log record into the page containing the up dated data item Recording LSNs

in this fashion allows the recovery system to relate the state of a data page to logged up dates in

order to tell if a given log record is reected in a given state of a page

Log records are also written for transaction management activities such as the commit or ab ort

of a transaction In addition log records are sometimes written to describ e the state of the system

at certain p erio ds of time For example such log records are written as part of the checkp ointing

pro cess Checkp oints are taken p erio dicall y during normal op eration to help b ound the amount of

recovery work that would b e required in the event of a crash Part of the checkp ointing pro cess

involves the writing of one or more checkp oint records These records can include information

ab out the contents of the buer p o ol and the transactions that are currently active etc The

particular contents of these records dep ends on the metho d of checkp ointing that is used Many

dierent checkp ointing metho ds have b een develop ed some of which involve quiescing the system to

a consistent state while others are less intrusive A particularly nonintrusive typ e of checkp ointing

is used by the ARIES recovery metho d Mohab that is describ ed in the Best Practices section

of this chapter

and logical Gray For transaction up date op erations there are two basic typ es of logging physical

Physical log records typically indicate lo cation eg p osition on a particular page of mo died data

in the database If supp ort for UNDO is provided ie a STEAL p olicy is used then the value

of the item prior to the up date is recorded in the log record This is known as the b efore image

of the item Similarly the after image ie the new value of the item after the up date is logged

if REDO supp ort is provided Thus physical log records in a DBMS with STEALNOFORCE

buer management contain b oth the old and new data values of items Recovery using physical log

meaning records has the prop erty that recovery actions ie UNDOs or REDOs are idemp otent

that they have the same eect no matter how many times they are applied This prop erty is im

p ortant if recovery is invoked multiple times as will o ccur if a system fails rep eatedly eg due to

a p ower problem or a faulty device

Logical logging sometimes referred to as op erational logging records only highlevel informa

tion ab out op erations that are p erformed rather than recording the actual changes to items or

storage lo cations in the database For example the insert of a new tuple into a might

require many physical changes to the database such as space allo cation index up dates and reor

ganization etc Physical logging would require log records to b e written for all of these changes

In contrast logical logging would simply log the fact that the insertion had taken place along with

the value of the inserted tuple The REDO pro cess for a logical logging system must determine

the set of actions that are required to fully reinstate the insert Likewise the UNDO logic must

determine the set of actions that make up the inverse of the logged op eration

Logical logging has the advantage that it minimizes the amount of data that must b e written to

the log Furthermore it is inherently app ealing b ecause it allows many of the implementation details

of complex op erations to b e hidden in the UNDOREDO logic In practice however recovery based

on logical logging is dicult to implement b ecause the actions that make up the logged op eration

are not p erformed atomically That is when a system is restarted after a crash the database may

not b e in an action consistent state with resp ect to a complex op eration it is p ossible that only

a subset of the up dates made by the action had b een placed on nonvolatile storage prior to the

crash As a result it is dicult for the recovery system to determine which p ortions of a logical

up date are reected in the database state up on recovery from a system crash In contrast physical

logging do es not suer from this problem but it can require substantially higher logging activity

In practice systems often implement a compromise b etween physical and logical approaches that

has b een referred to as physiological logging Gray In this approach log records are constrained

to refer to a single page but may reect logical op erations on that page For example a physiological

log record for an insert on a page would sp ecify the value of the new tuple that is added to the

page but would not sp ecify any freespace manipulation or reorganization of data on the page

resulting from the insertion the REDO and UNDO logic for insert would b e required to infer

the necessary op erations If a tuple insert required up dates to multiple pages eg data pages

plus multiple index pages then a separate physiological log record would b e written for each

page up dated Physiological logging avoids the action consistency problem of logical logging while

reducing to some extent the amount of logging that would b e incurred by physical logging The

ARIES recovery metho d is one example of a recovery metho d that uses physiological logging

Write Ahead Logging WAL

A nal recovery principle to b e addressed in this section is the Write Ahead Logging WAL

proto col Recall that the contents of volatile storage are lost in the event of a system crash As

a result any log records that are not reected on nonvolatile storage will also b e lost during a

crash WAL is a proto col that ensures that in the event of a system crash the recovery log contains

sucient information to p erform the necessary UNDO and REDO work when a STEALNO

FORCE buer management p olicy is used The WAL proto col ensures that

All log records p ertaining to an up dated page are written to nonvolatile storage b efore the

page itself is allowed to b e overwritten in nonvolatile storage

A transaction is not considered to b e committed until all of its log records including its

commit record have b een written to stable storage

The rst p oint ensures that UNDO information required due to the STEAL p olicy will b e present

in the log in the event of a crash Similarly the second p oint ensures that any REDO information

required to due to the NOFORCE p olicy will b e present in the nonvolatile log The WAL proto col

is typically enforced with sp ecial supp ort provided by the DBMS buer manager

Best Practices

Concurrency Control

Twophase lo cking

The most prevalent implementation technique for concurrency control is lo cking Typically two

typ es of lo cks are supp orted shared S lo cks and exclusive X lo cks The compatibility of these

lo cks is dened by the compatibility matrix shown in The compatibility matrix shows that

two dierent transactions are allowed to hold S lo cks simultaneously on the same data item but

that X lo cks cannot b e held on an item simultaneously with any other lo cks by other transactions

access to data ie multiple concurrent readers on that item S lo cks are used for protecting read

are allowed and X lo cks are used for protecting write access to data As long as a transaction is

holding a lo ck no other transaction is allowed to obtain a conicting lo ck If a transaction requests

a lo ck that cannot b e granted due to a lo ck conict that transaction is blo cked ie prohibited

from pro ceeding until all the conicting lo cks held by other transactions are released

S X

S y n

X n n

Table Compatibility Matrix for S and X Lo cks

S and X lo cks as dened in Table directly mo del the semantics of conicts used in the

denition of conict serializabili ty Therefore lo cking can b e used to enforce serializabili ty Rather

than testing for serializability after a schedule has b een pro duced as was done in the previous

section the blo cking of transactions due to lo ck conicts can b e used to prevent nonserializable

b eing pro duced schedules from ever

A transaction is said to b e wellformed with resp ect to reads if it always holds an S or an X

with resp ect to writes if it always holds an X lo ck on an item while reading it and wellformed

lo ck on an item while writing it Unfortunately restricting all transactions to b e wellformed is

not sucient to guarantee serializabili ty For example a nonserializable execution such as that of

schedule is still p ossible using wellformed transactions Serializabili ty can b e enforced however

through the use of twophase lo cking PL Two phase lo cking requires that all transactions b e

wellformed and that they resp ect the following rule

Once a transaction has released a lo ck it is not allowed to obtain any additional lo cks

This rule results in transactions that have two phases

in which the transaction is acquiring lo cks and A growing phase

A shrinking phase in which lo cks are released

The twophase rule dictates that the transaction shifts from the growing phase to the shrink

ing phase at the instant it rst releases a lo ck To see how PL enforces serializabili ty consider

again schedule Recall that the problem arises in this schedule b ecause w A ! r A but

0 1

r B ! w b This schedule could not b e pro duced under PL b ecause transaction Report

1 0

Sum would b e blo cked when it attempted to read the value of A b ecause transaction would b e

holding an X lo ck on it Transaction would not b e allowed to release this X lo ck b efore obtaining

its X lo ck on B and thus it would either ab ort or p erform its up date of B b efore transaction is

allowed to progress In contrast note that schedule the serial schedule would b e allowed in

PL PL would also allow the following serializable interleaved schedule

r A ! r A ! r B ! c ! w A ! r B ! w B ! c

1 0 1 1 0 0 0 0

It is imp ortant to note however that twophase lo cking is sucient but not necessary for

implementing serializability In other words there are schedules that are serializable but would not

b e allowed by twophase lo cking Schedule is an example of such a schedule

In order to implement PL the DBMS contains a comp onent called a lo ck manager The lo ck

manager is resp onsible for granting or blo cking lo ck requests for managing queues of blo cked trans

actions and for unblo cking transactions when lo cks are released In addition the lo ck manager is

also resp onsible for dealing with deadlo ck situations A deadlo ck arises when a set of transactions

is blo cked each waiting for another memb er of the set to release a lo ck In a deadlo ck situation

none of the transactions involved can make progress Database systems deal with deadlo cks us

ing one of two general techniques avoidance or detection Deadlo ck avoidance can b e achieved

by imp osing an order in which lo cks can b e obtained on data items by requiring transactions to

predeclare their lo cking needs or by ab orting transactions rather than blo cking them in certain

situations

or explicit checking Deadlo ck detection on the other hand can b e implemented using timeouts

Timeouts are the simplest technique if a transaction is blo cked b eyond a certain amount of time

it is assumed that a deadlo ck has o ccurred The choice of a timeout interval can b e problematic

however If it is to o short then the system may infer the presence of a deadlo ck that do es not truly

exist If it is to o long then deadlo cks may go undetected for to o long a time Alternatively the

system can explicitly check for deadlo cks using a structure called a waitsfor graph A waitsfor

graph is a directed graph with a vertex for each active transaction The lo ck manager constructs the

graph by placing an edge from a transaction T to a transaction T i 6 j if T is blo cked waiting

i j i

for a lo ck held by T If the waitsfor graph contains a cycle all of the transactions involved in the

j

cycle are waiting for each other and thus they are deadlo cked When a deadlo ck is detected one

or more of the transactions involved is rolledback When a transaction is rolledback its lo cks are

automatically released so the deadlo ck will b e broken

Isolation Levels

As should b e apparent from the previous discussion transaction isolation comes at a cost in p o

2

tential concurrency Transaction blo cking can add signicantly to transaction resp onse time As

stated previously serializability is typically implemented using twophase lo cking which requires

lo cks to b e held at least until all necessary lo cks have b een obtained Prolonging the holdingtime

of lo cks increases the likeliho o d of blo cking due to data contention

In some applications however serializabili ty is not strictly necessary For example a data

analysis program that computes aggregates over large numb ers of tuples may b e able to tolerate

2

Note that other nonblo cking approaches discussed later in this section also suer from similar problems

some inconsistent access to the database in exchange for improved p erformance The concept of

or isolation levels has b een develop ed to allow transactions to trade concurrency degrees of isolation

for consistency in a controlled manner Gray Gray Bere In their pap er Gray et al

dened four degrees of consistency using characterizations based on lo cking dep endencies and

anomalies ie results that could not arise in a serial schedule The degrees were named degree

with degree b eing the least consistent and degree intended to b e equivalent to serializable

execution

The original presentation has served as the basis for understanding relaxed consistency in many

current systems but it has b ecome apparent over time that the dierent characterizations in that

pap er were not sp ecied to an equal degree of detail As p ointed out in a recent pap er by Berenson

et al Bere the SQL standard suers from a similar lack of sp ecicity Berenson et al have

attempted to clarify the issue but it is to o early to determine if they have b een successful In

this section we fo cus on the lo ckingbased denitions of the isolation levels as they are generally

acknowledged to have sto o d the test of time Bere However the denition of the degrees

of consistency requires an extension to the previous description of lo cking in order to address the

phantom problem

An example of the phantom problem is the following assume a transaction T reads a set of

i

tuples that satisfy a query predicate A second transaction T inserts a new tuple that satises the

j

predicate If T then executes the query again it will see the new item so that its second answer

i

diers from the rst This b ehavior could never o ccur in a serial schedule as a phantom tuple

app ears in the midst of a transaction thus this execution is anomalous The phantom problem is

data items that an artifact of the transaction mo del consisting of reads and writes to individual

we have used so far In practice transactions include queries that dynamically dene sets of items

based on predicates When a query is executed all of the tuples that satisfy the predicate at that

time can b e lo cked as they are accessed Such individual lo cks however do not protect against the

later addition of further tuples that satisfy the predicate

One obvious to the phantom problem is to lo ck predicates instead of or in addition

to individual items Eswa This solution is impractical to implement however due to the

complexity of detecting the overlap of a set of arbitrary predicates Predicate lo cking can b e

approximated using techniques based on lo cking clusters of data or ranges of index values Such

techniques however are b eyond the scop e of this chapter In this discussion we will assume that

predicates can b e lo cked without sp ecifying the technical details of how this can b e accomplished

see Gray Mohaa for detailed treatments of this topic

The lo ckingoriented denitions of the isolation levels are based on whether or not read andor

write op erations are wellformed ie protected by the appropriate lo ck and if so whether those

lo cks are long duration or short duration Long duration lo cks are held until the end of a transaction

EOT ie when it commits or ab orts short duration lo cks can b e released earlier Long duration

write lo cks on data items have imp ortant b enets for recovery namely they allow recovery to b e

p erformed using b efore images If long duration write lo cks are not used then the following scenario

could arise

w A ! w A ! a

0 1 0

In this case restoring A with T s b efore image of it will b e incorrect b ecause it would overwrite T s

0 1

up date Simply ignoring the ab ort of T is also incorrect In that case if T were to subsequently

0 1

ab ort installing its b efore image would reinstate the value written by T For this reason and for

0

simplicity lo cking systems typically hold long duration lo cks on data items This is sometimes

referred to as strict lo cking Bern

Given these notions of lo cks the degrees of isolation presented in the SQL standard can b e

obtained using dierent lo ck proto cols In the following all levels are assumed to b e wellformed

with resp ect to writes and to hold long duration write ie exclusive lo cks on up dated data items

3

Four levels are dened from weakest to strongest

This level which provides the weakest consistency guarantees allows READ UNCOMMITTED

transactions to read data that has b een written by other transactions that have not committed

In a lo cking implementation this level is achieved by b eing illformed with resp ect to reads

3

It should b e noted that twophase lo cks can b e substituted for the longduration lo cks in these denitions without

impacting the consistency provided Longduration lo cks are typically used however to avoid the recoveryrelated

problems describ ed previously

ie not obtaining read lo cks The risks of op erating at this level include in addition to

the risks incurred at the more restrictive levels the p ossibility of seeing up dates that will

eventually b e rolledback and the p ossibility of seeing some of the up dates made by another

transaction but missing others made by that transaction

This level ensures that transactions only see up dates that have b een made READ COMMITTED

by transactions that have committed This level is achieved by b eing wellformed with resp ect

to reads on individual data items but holding the read lo cks only as short duration lo cks

Transactions op erating at this level run the risk of seeing nonrep eatable reads in addition

to the risks of the more restrictive levels That is a transaction T could read a data item

0

twice and see two dierent values This anomaly could o ccur if a second transaction were to

up date the item and commit in b etween the two reads by T

0

This level ensures that reads to individual data items are rep eatable but REPEATABLE READ

do es not protect against the phantom problem describ ed previously This level is achieved by

b eing wellformed with resp ect to reads on individual data items and holding those lo cks for

long duration

SERIALIZABLE This level protects against all of the problems of the less restrictive levels

including the phantom problem It is achieved by b eing wellformed with resp ect to reads on

predicates as well as on individual data items and holding all lo cks for long duration

A key asp ect of this denition of degrees of isolation is that as long as all transactions execute

at the READ UNCOMMITTED level or higher they are able to obtain at least the degree of

isolation they desire without interference from any transactions running at lower degrees Thus

these degrees of isolation provide a p owerful to ol that allows application writers or users to trade

o consistency for improved concurrency As stated earlier the denition of these isolation levels

for concurrency control metho ds that are not based on lo cking has b een problematic This issue is

addressed in depth in Bere

It should b e noted that the discussion of lo cking so far has ignored an imp ortant class of data

that is typically present in namely indexes Because indexes are auxiliary information

they can b e accessed in a nontwophase manner without sacricing serializability Furthermore the

hierarchical structure of many indexes eg Btrees makes them p otential concurrency b ottlenecks

due to high contention at the upp er levels of the structure For this reason signicant eort has

gone into developing metho ds for providing highlyconcurrent access to indexes Pointers to some

of this work can b e found in the For Further Information section at the end of this chapter

Hierarchical Lo cking

The examples in the preceeding discussions of concurrency control primarily dealt with op erations

on a single granularity of data items eg tuples In practice however the notions of conicts and

lo cks can b e applied at many dierent granularities For example it is p ossible to p erform lo cking at

the granularity of a page relation or even an entire database In cho osing the prop er granularity

at which to p erform lo cking there is a fundamental tradeo b etween p otential concurrency and

lo cking overhead Lo cking at a ne granularity such as an individual tuple allows for maximum

concurrency as only transactions that are truly accessing the same tuple have the p otential to

conict The downside of such negrained lo cking however is that a transaction that accesses a

large numb er of tuples will have to acquire a large numb er of lo cks Each lo ck request requires a

call to the lo ck manager This overhead can b e reduced by lo cking at a coarser granularity but

coarse granularity raises the p otential for false conicts For example two transactions that up date

dierent tuples residing on the same page would conict under pagelevel lo cking but not under

tuplelevel lo cking

The notion of hierarchical or multigranular lo cking was intro duced to allow concurrent transac

tions to obtain lo cks at dierent granularities in order to optimize the ab ove tradeo Gray In

hierarchical lo cking a lo ck on a granule at a particular level of the granularity hierarchy implicitl y

lo cks all items included in that granule For example an S lo ck on a relation implicitly lo cks all

pages and tuples in that relation Thus a transaction with such a lo ck can read any tuple in the

relation without requesting additional lo cks Hierarchical lo cking intro duces additional lo ck mo des

b eyond S and X These additional mo des allow transactions to declare their intention to p erform

an op eration on ob jects at lower levels of the granularity hierarchy The new mo des are IS IX and

SIX for Intention Shared Intention Exclusive and Shared with Intention Exclusive An IS or IX

lo ck on a granule provides no privileges on that granule but indicates that the holder intends to

obtain S or X lo cks on one or more ner granules An SIX lo ck combines an S lo ck on the entire

granule with an IX lo ck SIX lo cks supp ort the common access pattern of scanning the items in a

granule eg tuples in a relation and cho osing to up date a fraction of them based on their values

Similarly to S and X lo cks these lo ck mo des can b e describ ed using a compatibility matrix

The compatibility matrix for these mo des is shown in Table In order for transactions lo cking at

dierent granularities to co exist all transactions must follow the same hierarchical lo cking proto col

starting from the ro ot of the granularity hierarchy This proto col is shown in Table For

IS IX S SIX X

IS y y y y n

IX y y n n n

S y n y n n

SIX y n n n n

X n n n n n

Table Compatibility Matrix for Regular and Intention Lo cks

To Get Must Have on all Ancestors

IS or S IS or IX

IX or SIX IXSIX or X

Table Hierarchical Lo cking Rules

example to read a single record a transaction would obtain IS lo cks on the database relation and

page followed by an S lo ck on the sp ecic tuple If a transaction wanted to read all or most tuples

on a page then it could obtain IS lo cks on the database and relation followed by an S lo ck on

the entire page By following this uniform proto col p otential conicts b etween transactions that

ultimately obtain S andor X lo cks at dierent granularities can b e detected A useful extension to

hierarchical lo cking is known as lo ck escalation Lo ck escalation allows the DBMS to automatically

adjust the granularity at which transactions obtain lo cks based on their b ehavior If the system

detects that a transaction is obtaining lo cks on a large p ercentage of the granules that make up

a larger granule it can attempt to grant the transaction a lo ck on the larger granule so that no

additional lo cks will b e required for subsequent accesses to other ob jects in that granule Automatic

escalation is useful b ecause the access pattern that a transaction will pro duce is often not known

until runtime

Other Concurrency Control Metho ds

As stated previously twophase lo cking is the most generally accepted technique for ensuring serial

izability Lo cking is considered to b e a p essimistic technique b ecause it is based on the assumption

that transactions are likely to interfere with each other and takes measures eg blo cking to ensure

that such interference do es not o ccur An imp ortant alternative to lo cking is optimistic concur

rency control Optimistic metho ds eg Kung allow transactions to p erform their op erations

without obtaining any lo cks To ensure that concurrent executions do not violate serializability

transactions must p erform a validation phase b efore they are allowed to commit Many optimistic

proto cols have b een prop osed In the algorithm of Kung the validation pro cess ensures that the

reads and writes p erformed by a validating transaction did not conict with any other transactions

with which it ran concurrently If during validation it is determined a conict had o ccurred the

validating transaction is ab orted and restarted

Unlike lo cking which dep ends on blo cking transactions to ensure isolation optimistic p olicies

As a result although they dont p erform any blo cking the p er dep end on transaction restart

formance of optimistic p olicies can b e negatively impacted by data contention as are p essimistic

schemes a high degree of data contention will result in a large numb er of unsuccessful transac

tion executions The p erformance tradeos b etween optimistic and p essimistic have b een addressed

in numerous studies see Agra In general lo cking is likely to b e sup erior in resourcelimited

environments b ecause blo cking do es not consume cpu or disk resources In contrast optimistic

techniques may have p erformance advantages in situations where resources are abundant b ecause

they allow more executions to pro ceed concurrently If resources are abundant then the resource

consumption of restarted transactions will not signicantly hurt p erformance In practice however

resources are typically limited and thus concurrency control in virtually all commercial database

systems is based on lo cking

Another class of concurrency control techniques is known as multiversion concurrency con

trol eg Reed As up dating transactions mo dify data items these techniques retain the

previous versions of the items online Readonly transactions ie transactions that p erform no

up dates can then b e provided with access to these older versions allowing them to see a con

sistent although p ossibly somewhat outofdate snapshot of the database Optimistic multiver

sion and other concurrency control techniques eg timestamping are addressed in further detail

in Bern

Recovery

The recovery subsystem is generally considered to b e one of the more dicult parts of a DBMS

to design for two reasons First recovery is required to function in failure situations and must

correctly cop e with a huge numb er of p ossible system and database states Second the recovery

system dep ends on the b ehavior of many other comp onents of the DBMS such as concurrency

control buer management disk management and query pro cessing As a result few recovery

metho ds have b een describ ed in the literature in detail One exception is the ARIES recovery

system develop ed at IBM Mohab Many details ab out the ARIES metho d have b een published

and the metho d has b een included in a numb er of DBMSs Furthermore the ARIES metho d

involves only a small numb er of basic concepts For these reasons we fo cus on the ARIES metho d

in the remainder of this section The ARIES metho d is related to many other recovery metho ds such

as those describ ed in Bern Gray A comparison with other techniques app ears in Mohab

Overview of ARIES

ARIES is a fairly recent renement of the WriteAheadLogging WAL proto col Recall that the

WAL proto col enables the use of a STEALNO FORCE buer management p olicy which means

that pages on stable storage can b e overwritten at any time and that data pages do not need to b e

forced to disk in order to commit a transaction As with other WAL implementations each page

in the database contains a Log Sequence Numb er LSN which uniquely identies the log record

for the latest up date that was applied to the page This LSN referred to as the pageLSN is used

during recovery to determine whether or not an up date for a page must b e redone LSN information

is also used to determine the p oint in the log from which the Redo pass must commence during

restart from a system crash LSNs are often implemented using the physical address of the log

record in the log to enable the ecient lo cation of a log record given its LSN

Much of the p ower and relative simplicity of the ARIES algorithm is due to its REDO paradigm

of rep eating history in which it redo es up dates for all transactions including those that will even

tually b e undone Rep eating history enables ARIES to employ a variant of the physiological logging

technique describ ed earlier it uses pageoriented REDO and a form of logical UNDO Page

oriented REDO means that REDO op erations involve only a single page and that the aected

page is sp ecied in the log record This is part of physiological logging In the context of ARIES

logical UNDO means that the op erations p erformed to undo an up date do not need to b e the exact

inverses of the op erations of the original up date

In ARIES logical UNDO is used to supp ort negrained ie tuplelevel lo cking and high

concurrency index management For an example of the latter issue consider a case in which

a transaction T up dates an index entry on a given page P Before T completes a second

transaction T could split P causing the index entry to b e moved to a new page P If T must

b e undone a physical pageoriented approach would fail b ecause it would erroneously attempt

to p erform the UNDO op eration on P Logical UNDO solves this problem by using the index

structure to nd the index entry and then applying the UNDO op eration to it in its new lo cation

In contrast to UNDO pageoriented REDO can b e used b ecause the rep eating history paradigm

ensures that REDO op erations will always nd the index entry on the page referenced in the log

record any op erations that had aected the lo cation of the index op eration at the time the log

record was created will b e replayed b efore that log record is redone

ARIES uses a three pass algorithm for restart recovery The rst pass is the Analysis pass which

pro cesses the log forward from the most recent checkp oint This pass determines information ab out

dirty pages and active transactions that is used in the subsequent passes The second pass is the

REDO pass in which history is rep eated by pro cessing the log forward from the earliest log record

that could require redo thus insuring that all logged op erations have b een applied The third pass

pass This pass pro ceeds backwards from the end of the log removing from the is the UNDO

database the eects of all transactions that had not committed at the time of the crash These

passes are shown in Figure Note that the relative ordering of the starting p oint for the REDO

pass the endp oint for the UNDO pass and the checkp oint can b e dierent than that shown in the

gure The three passes are describ ed in more detail b elow

Start of oldest First update in−progress potentially Checkpoint End of Log transaction lost during crash Log (time ) Analysis Redo

Undo

Figure The Three Passes of ARIES Restart

ARIES maintains two imp ortant data structures during normal op eration The rst is the

Transaction Table which contains status information for each transaction that is currently running

which is the LSN of the most recent log record This information includes a eld called the lastLSN

written by the transaction The second data structure called the Dirty Page Table contains an

entry for each dirty page A page is considered to b e dirty if it contains up dates that are

not reected on stable storage Each entry in the Dirty Page Table includes a eld called the

recoveryLSN which is the LSN of the log record that caused the asso ciated page to b ecome dirty

is the LSN of the earliest log record that might need to b e redone for Therefore the recoveryLSN

the page during restart Log records b elonging to the same transaction are linked backwards in

time using a eld in each log record called the prevLSN eld When a new log record is written for

eld in the Transaction Table entry is placed in the prevLSN a transaction the value of the lastLSN

eld of the new record and the new records LSN is entered as the lastLSN in the Transaction Table

entry

During normal op eration checkp oints are taken p erio dically ARIES uses a form of fuzzy

checkp oints which are extremely inexp ensive When a checkp oint is taken a checkp oint record

is constructed which includes the contents of the Transaction Table and the Dirty Page Table

Checkp oints are ecient since no op erations need b e quiesced and no database pages are ushed

to p erform a checkp oint However the eectiveness of checkp oints in reducing the amount of the

log that must b e maintained is limited in part by the earliest recoveryLSN of the dirty pages at

checkp oint time Therefore it is helpful to have a background pro cess that p erio dically writes dirty

pages to nonvolatile storage

Analysis

The job of the Analysis pass of restart recovery is threefold It determines the p oint in the log

at which to start the REDO pass It determines which pages could have b een dirty at the time

of the crash in order to avoid unnecessary IO during the REDO pass and It determines which

transactions had not committed at the time of the crash and will therefore need to b e undone

The Analysis pass b egins at the most recent checkp oint and scans forward to the end of the

log It reconstructs the Transaction Table and Dirty Page Table to determine the state of the

system as of the time of the crash It b egins with the copies of those structures that were logged in

the checkp oint record Then the contents of the tables are mo died according to the log records

that are encountered during the forward scan When a log record for a transaction that do es not

app ear in the Transaction Table is encountered that transaction is added to the table When a log

record for the commit or the ab ort of a transaction is encountered the corresp onding transaction is

removed from the Transaction Table When a log record for an up date to a page that is not in the

Dirty Page Table is encountered that page is added to the Dirty Page Table and the LSN of the

record which caused the page to b e entered into the table is recorded as the recoveryLSN for that

page At the end of the Analysis pass the Dirty Page Table is a conservative since some pages

may have b een ushed to nonvolatile storage list of all database pages that could have b een dirty

at the time of the crash and the Transaction Table contains entries for those transactions that

will actually require undo pro cessing during the UNDO phase The earliest recoveryLSN of all the

is used as the sp ot in the log from which to entries in the Dirty Page Table called the rstLSN

b egin the REDO phase

REDO

As stated earlier ARIES employs a redo paradigm called rep eating history That is it redo es

up dates for all transactions committed or otherwise The eect of rep eating history is that at the

end of the REDO pass the database is in the same state with resp ect to the logged up dates that it

was in at the time that the crash o ccurred The REDO pass b egins at the log record whose LSN is

determined by Analysis and scans forward from there To redo an up date the logged the rstLSN

action is reapplied and the pageLSN on the page is set to the LSN of the redone log record No

logging is p erformed as the result of a redo For each log record the following algorithm is used to

determine if the logged up date must b e redone

 If the aected page is not in the Dirty Page Table then the up date do es NOT require redo

in the pages table  If the aected page is in the Dirty Page Table then if the recoveryLSN

entry is greater than the LSN of the record b eing checked the up date do es NOT require redo

 Otherwise the LSN stored on the page the pageLSN must b e checked This may require

is greater than or equal to the LSN of the that the page b e read in from disk If the pageLSN

record b eing checked then the up date do es NOT require redo Otherwise the up date MUST

b e redone

UNDO

The UNDO pass scans backwards from the end of the log During the UNDO pass all transac

tions that had not committed by the time of the crash must b e undone In ARIES undo is an

unconditional op eration That is the pageLSN of an aected page is not checked b ecause it is al

ways the case that the undo must b e p erformed This is due to the fact that the rep eating of history

in the REDO pass insures that all logged up dates have b een applied to the page

When an up date is undone the undo op eration is applied to the page and is logged using

a sp ecial typ e of log record called a Comp ensation Log Record CLR In addition to the undo

information a CLR contains a eld called the UndoNxtLSN The UndoNxtLSN is the LSN of the

eld next log record that must b e undone for the transaction It is set to the value of the prevLSN

of the log record b eing undone The logging of CLRs in this fashion enables ARIES to avoid ever

having to undo the eects of an undo eg as the result of a system crash during an ab ort thereby

limiting the amount of work that must b e undone and b ounding the amount of logging done in the

event of multiple crashes When a CLR is encountered during the backwards scan no op eration

is p erformed on the page and the backwards scan continues at the log record referenced by the

eld of the CLR thereby jumping over the undone up date and all other up dates for UndoNxtLSN

the transaction that have already b een undone the case of multiple transactions will b e discussed

shortly An example execution is shown in Figure

Write Write Write CLR FOR CLR FOR CLR FOR page 1 page 1 page 1 LSN 30 LSN 20 LSN 10 Undo Undo Undo Log (time )

LSN: 10 20 30 Restart 40 50Restart 60

Figure The Use of CLRs for UNDO

In Figure a transaction logged three up dates LSNs and b efore the system crashed

for the rst time During REDO the database was brought up to date with resp ect to the log ie

andor were redone if they werent on nonvolatile storage but since the transaction

was in progress at the time of the crash they must b e undone During the UNDO pass up date

was undone resulting in the writing of a CLR with LSN which contains an UndoNxtLSN

value that p oints to Then was undone resulting in the writing of a CLR LSN with

value that p oints to However the system then crashed for a second time an UndoNxtLSN

b efore was undone Once again history is rep eated during REDO which brings the database

back to the state it was in after the application of LSN the CLR for When UNDO b egins

during this second restart it will rst examine the log record Since the record is a CLR no

mo dication will b e p erformed on the page and UNDO will skip to the record whose LSN is stored

in the UndoNxtLSN eld of the CLR ie LSN Therefore it will continue by undoing the

up date whose log record has LSN This is where the UNDO pass was interrupted at the time

of the second crash Note that no extra logging was p erformed as a result of the second crash

In order to undo multiple transactions restart UNDO keeps a list containing the next LSN to

b e undone for each transaction b eing undone When a log record is pro cessed during UNDO the

prevLSN or UndoNxtLSN in the case of a CLR is entered as the next LSN to b e undone for that

transaction Then the UNDO pass moves on to the log record whose LSN is the most recent of the

next LSNs to b e redone UNDO continues backward in the log until all of the transactions in the

list have b een undone up to and including their rst log record UNDO for transaction rollback

works similarly to the UNDO pass of the restart algorithm as describ ed ab ove The only dierence

is that during transaction rollback only a single transaction or part of a transaction must b e

undone Therefore rather than keeping a list of LSNs to b e undone for multiple transactions

rollback can simply follow the backward chain of log records for the transaction to b e rolled back

Research Issues and Summary

The mo del of ACID transactions that has b een describ ed in this chapter has proven to b e quite

durable in its own right and serves as the underpinnings for the current generation of database and

transaction pro cessing systems This chapter has fo cused on the issues of concurrency control and

recovery in a centralized environment It is imp ortant to note however that the basic mo del is

used in many typ es of distributed and parallel DBMS environments and the mechanisms describ ed

here have b een successfully adapted for use in more complex systems Additional techniques

however are needed in such environments One imp ortant technique is twophase commit which

is a proto col for ensuring that all participants in a agree on the decision to

commit or ab ort that transaction

While the basic transaction mo del has b een a clear success its limitations have also b een

apparent for quite some time eg Gray Much of the ongoing research related to concurrency

control and recovery is aimed at addressing some of these limitations This research includes the

development of new implementation techniques as well as the investigation of new and extended

transaction mo dels

The ACID transaction mo del suers from a lack of exibility and the inability to mo del many

typ es of interactions that arise in complex systems and organizations For example in collab o

rative work environments strict isolation is not p ossible nor even desirable Kort Workow

management systems are another example where the ACID mo del which works b est for relatively

simple and short transactions is not directly appropriate For these typ es of applications a richer

multilevel notion of transactions is required

In addition to the problems raised by complex application environments there are also many

computing environments for which the ACID mo del is not fully appropriate These include envi

ronments such as mobile wireless networks where large p erio ds of disconnection are exp ected and

lo oselycoupled widearea networks the Internet is an extreme example in which the availability

of systems is relatively low The techniques that have b een develop ed for supp orting ACID trans

actions must b e adjusted to cop e with such highlyvariable situations New techniques must also

b e develop ed to provide concurrency control and recovery in nontraditional environments such as

heterogeneous systems disseminationoriented environments and others

A nal limitation of ACID transactions in their simplest form is that they are a general mech

anism and hence do not exploit the semantics of data andor applications Such knowledge could

b e used to signicantly improve system p erformance Therefore the development of concurrency

control and recovery techniques that can exploit applicationsp ecic prop erties is another area of

active research

As should b e obvious from the preceding discussion there is still a signicant amount of work

that remains to b e done in the areas of concurrency control and recovery for database systems

The basic concepts however such as serializability theory twophase lo cking write ahead logging

etc will continue to b e a fundamental technology b oth in their own right and as building blo cks

for the development of more sophisticated and exible information systems

Dening Terms

ACID prop erties The transaction prop erties of Atomicity Consistency Isolation and Dura

bility that are upheld by the DBMS

ab ort The pro cess of rolling back an uncommitted transaction All changes to the database

state made by that transaction are removed

checkp ointing An action taken during normal system op eration that can help limit the amount

of recovery work required in the event of a system crash

commit The pro cess of successfully completing a transaction Up on commit all changes to the

database state made by a transaction are made p ermanent and visible to other transactions

concurrency control The mechanism that ensures that individual users see consistent states

of the database even though op erations on b ehalf of many users may b e interleaved by the

database system

concurrent execution The p ossibly interleaved execution of multiple transactions simultane

ously

conicting op erations Two op erations are said to conict if they b oth op erate on the same

data item and at least one of them is a w r ite

deadlo ck A situation in which a set of transactions is blo cked each waiting for another memb er

of the set to release a lo ck In such a case none of the transactions involved can make progress

log A sequential le that stores information ab out transactions and the state of the system at

certain instances

log record An entry in the log One or more log records are written for each up date p erformed

by a transaction

Log Sequence Numb er LSN A numb er assigned to a log record which serves to uniquely

identify that record in the log LSNs are typically assigned in a monotonically increasing

fashion so that they provide an indication of relative p osition

multiversion concurrency control A concurrency control technique that provides readonly

transactions with conictfree access to previous versions of data items

nonvolatile storage Storage such as magnetic disks or tap es whose contents p ersist across

p ower failures and system crashes

optimistic concurrency control A concurrency control technique that allows transactions to

pro ceed without obtaining lo cks and ensures correctness by validating transactions up on their

completion

recovery The mechanism that ensures that the database is fault tolerant that is that the

database state is not corrupted as the result of a software system or media failure

schedule A schedule for a set of transaction executions is a partial ordering of the op erations

p erformed by those transactions which shows how the op erations are interleaved

serial execution The execution of a single transaction atatime

serializabili ty The prop erty that a p ossibly interleaved execution of a group transactions has

the same eect on the database and pro duces the same output as some serial ie non

interleaved execution of those transactions

STEALNOFORCE A buer management p olicy that allows committed data values to b e

overwritten on nonvolatile storage and do es not require committed values to b e written to

nonvolatile storage This p olicy provides exibility for the buer manager at the cost of

increased demands on the recovery subsystem

transaction A unit of work p ossibly consisting of multiple data accesses and up dates that must

commit or ab ort as a single atomic unit Transactions have the ACID prop erties of Atomicity

Consistency Isolation and Durability

twophase lo cking PL A lo cking proto col that is a sucient but not a necessary condition

for serializabili ty Two phase lo cking requires that all transactions b e wellformed and that

once a transaction has released a lo ck it is not allowed to obtain any additional lo cks

volatile storage Storage such as main memory whose state is lost in the event of a system

crash or p ower outage

wellformed A transaction is said to b e wellformed with resp ect to reads if it always holds a

shared or an exclusive lo ck on an item while reading it and wellformed with resp ect to writes

if it always holds an exclusive lo ck on an item while writing it

Write Ahead Logging A proto col that ensures all log records required to correctly p erform

recovery in the event of a crash a placed on nonvolatile storage

References

Agra Agrawal R Carey M Livny M Concurrency Control Performance Mo deling Alter

natives and Implications ACM Transactions on Database Systems Decemb er

Bere Berenson H Bernstein P Gray J Melton J Oneil B Oneil P A Critique of

ANSI SQL Isolation Levels Proc of the ACM SIGMOD International Conference on

the Management of Data San Jose CA June

Bern Bernstein P Hadzilacos V Go o dman N Concurrency Control and Recovery in

Database Systems AddisonWesley

Bjor Bjork L Recovery Scenario for a DBDC System Proc of the ACM Annual Con

ference Atlanta

Davi Davies C Recovery Semantics for a DBDC System Proc of the ACM Annual

Conference Atlanta

Eswa Eswaran L Gray J Lorie R Traiger I The Notion of Consistency and Predicate

Lo cks in a Database System Communications of the ACM Novemb er

Gray Gray J Lorie R Putzolu G Traiger I Granularity of Lo cks and Degrees of

Consistency in a Shared Database IFIP Working Conference on Model ling of Database

Management Systems

Gray Gray J The Transaction Concept Virtues and Limitations Proc of the Seventh

International Conference on Very Large Databases Cannes

Gray Gray J Reuter A Concepts and Techniques Morgan Kauf

mann San Mateo CA

Haer Haerder T Reuter A Principles of TransactionOriented Database Recovery ACM

Computing Surveys

Kort Korth H The Double Life of the Transaction Abstraction Fundamental Principle

and Evolving System Concept Proc of the TwentyFirst International Conference on

Very Large Databases Zurich

Kung Kung H and Robinson J On Optimistic Metho ds for Concurrency Control ACM

Transactions on Database Systems

Lome Lomet D Pro cess Structuring Synchronization and Recovery Using Atomic Actions

SIGPLAN Notices March

Mohaa Mohan C ARIESKVL A KeyValue Lo cking Metho d for Concurrency Control of

Multiaction Transactions Op erating on BTree Indexes Proc of the th Interna

tional Conference on Very Large Data Bases Brisbane August

Mohab Mohan C Haderle D Lindsay B Pirahesh H Schwarz P ARIES A Transaction

Metho d Supp orting FineGranularity Lo cking and Partial Rollbacks Using WriteAhead

Logging ACM Transactions on Database Systems March

Papa Pap dimitriou C The Theory of Database Concurrency Control

Press Ro ckville MD

Reed Reed D Implementing Atomic Actions on Decentralized Data ACM Transactions

on Computer Systems February

Rose Rosenkrantz D Sterns R Lewis P System Level Concurrency Control for Dis

tributed Database Systems ACM Transactions on Database Systems

For Further Information

For many years what knowledge that existed in the public domain ab out concurrency control

and recovery was passedon primarily though the use of multiplegeneration copies of a set of

lecture notes written by Jim Gray in the late seventies Notes on Database Op erating Systems in

Operating Systems An Advanced Course published by SpringerVerlag Berlin Fortunately

this state of aairs has b een supplanted by the publication of Transaction Processing Concepts

and Techniques by Jim Gray and Andreas Reuter Morgan Kaufmann San Mateo This

latter b o ok contains a detailed treatment of all of the topics covered in this chapter plus many

others that are crucial for implementing transaction pro cessing systems

An excellent treatment of concurrency control and recovery theory and algorithms can b e found

in Concurrency Control and Recovery in Database Systems by Phil Bernstein Vassos Hadzilacos

and Nathan Go o dman AddisonWesley Reading MA Another source of valuable infor

mation on concurrency control and recovery implementation is the series of pap ers on the ARIES

metho d by C Mohan and others at IBM some of which are referenced in this chapter The b o ok

The Theory of Database Concurrency Control by Christos Papadimitriou Computer Science Press

Ro ckville MD covers a numb er of serializability mo dels

The p erformance asp ects of concurrency control and recovery techniques have b een only briey

addressed in this chapter More information can b e found in the recent b o ok Performance of Con

currency Control Mechanisms in Centralized Database Systems edited by Vijay Kumar Prentice

Hall Englewo o d Clis NJ Also the p erformance asp ects of transactions are addressed

in The Benchmark Handbook For Database and Transaction Processing Systems second edition

edited by Jim Gray Morgan Kaufmann San Mateo

Finally extensions to the ACID transaction mo del are discussed in Models

edited by Ahmed Elmagarmid Morgan Kaufmann San Mateo Pap ers containing the

most recent work on related topics app ear regularly in the ACM SIGMOD Conference and the

International Conference on Very Large DataBases VLDB among others