Data Management Systems

and recovery Logging • Transactions • Recovery Recovery procedures Implementations

Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich

Recovery 1 Dealing with failures

• Recovery aims at reconstructing a consistent state after a failure • Failures considered: • Transaction failure (transaction aborts, rollbacks, client fails in the middle of executing a transaction, transaction times out) • Recovery procedure is to undo the changes made by the transaction while it was active • System failure (machine shuts down, machine fails, OS bails out, etc.) • Recovery procedure is to bring the database back in a consistent state • Media failure (disk fails, permanent storage has errors, etc.) • Recovery procedure involves restoring the database to a known consistent state

Recovery 2 Dealing with failures

• Transaction failure: • Typically done using undo logs or undo records (see last section) • Transactions must create an undo record every time they modify something • The undo record is used to remove changes made by the transaction while it was active • System failure: • Typically done using redo and undo logs as well as by running a recovery procedure • Undo the changes of active transactions at the time of failure, redo the changes of committed transactions • Clan up the data in permanent storage • Media failures • Typically addressed through replication and by separating data files from log files

Recovery 3 System failure

• In this section we will focus on system failures: the sudden loss of the data in memory • The engine needs to be able to recover the database up to the point of its las committed state, containing the changes made by all committed transactions as of the time of the failure • This is called the recovery procedure • The recovery procedure needs to • Operate only with the data on permanent storage • Be correct even when successive failures occur in the middle of the recovery procedure

Recovery 4 Logging

Recovery 5 Some notation

• The value that existed in the database before a transaction modifies an item is called a “before image” • The value that exists in the database after a transaction modifies an item is called “after image” • A record that registers what the transaction has done (a before image, and after image, transaction is, SCN, LSN, etc.) is a log record • In a database engine, both the data and the log are persistent; recovery procedures will combine both depending on the design • We will assume serializable and strict execution: • A transaction can be undone by restoring its before images • A transaction can be redone by applying its after images • Complication in reality: transactions update tuples, I/O system updates blocks

Recovery 6 The “log”

• For the moment, we will work with an idealized log • We will assume physical logging: the log entries contain the actual data modified by the transaction (before image, after image) • For each modification made by a transaction, there is a log record • Log records are ordered and reflect the logical sequence of events in the database (use SCN or LSN) • The log is made persistent (kept in stable storage) • Other data structures and information can be kept persistent as well • Which transaction have committed, aborted, active … • Flush operations force that something (a block, a log entry, some information) is written to permanent storage

Recovery 7 Achieving durability

Buffer cache Persistent data

Transaction T I/O update x update y insert p insert q delete t In Memory Log Persistent Log …. T T T T T T T T I/O Xbi Ybi p q Xbi Ybi p q Xai Yai … Xai Yai

Recovery 8 Durability

• Durability ensures that the database remembers what has been committed and that it can recover the last committed state of the database • This implies writing to persistent storage (I/O), which is an expensive operation. However: • Being able to recover the last committed state implies we remember that state • Thus, when a transaction commits, we need to store in persistent state everything we might need to restore that transaction • => I/O (of data, logs, or both)

Recovery 9 When to write to persistent storage

• Transactions modify items: • If changes form an active transaction can end up on persistent storage, recovery will involve undo of such changes -> REQUIRES UNDO • If changes form a committed transaction are not yet on persistent storage when the transaction is declared as committed (the client is notified), recovery will involve redo of such changes -> REQUIRES REDO • In both situations we need the log => when a transaction commits, all its log entries must be persistent • To undo changes: we need the before image (UNDO LOG) • To redo changes: we need the after image (REDO LOG) • In the idealized log we study now, assume the log contains everything

Recovery 10 Recovery Procedures

Recovery 11 Recovery manager

• The recovery manager implements the recovery procedure, which will depend on how the system functions: • Requires UNDO and REDO • Requires UNDO • Requires REDO • Requires neither • The procedure will depend on when modifications are written to persistent storage with respect to when the transaction commits • We will assume updates in place and ignore blocks for the moment (assume items can be made persistent individually)

Recovery 12 From the buffer cache perspective

• Steal Policy • STEAL: an uncommitted transaction is allowed to overwrite in persistent storage the changes of a committed transaction • This happens by pushing a dirty block to storage before the transaction commits • Will require to be able to undo this transaction • Force Policy • FORCE: all changes made by a transaction must by in persistent storage before the transaction commits • Requires to flush all blocks with updates form the transaction • If not in place, it will require to be able to redo the transaction • Steal/no-Force = UNDO/REDO, most common approach

Recovery 13 Lock tuples, update blocks => STEAL

Buffer cache Persistent data

Transaction T1 update x commit I/O

Transaction T2 update y

Transactions that do not conflict may still be updating the same block. If the block is copied to storage, it is possible that some changes are committed while others are not. If failures occur, there is no guaranteed the data in storage is 100% consistent

Recovery 14 Requires UNDO and REDO

• READ • Just read the value from the block on the buffer cache • WRITE • Create log entry (before image, after image) and append it to the persistent log • Write after image to block on the buffer cache • COMMIT • Write a persistent log entry indicating the transaction has committed • ABORT • For all updates, restore the before image using the log entry

Recovery 15 Recovery procedure (UNDO/REDO)

• Start from the end of the log and work backwards • Keep a list of UNDONE items and another for REDONE items • Procedure terminates when all items are in either the UNDONE or REDONE list or we reach the beginning of the log • For each log entry: • Look at the data item being accessed (x), if x is none of the two lists • If the log entry is of a committed transaction, apply the after image, add x to the REDONE list • If the log entry is of an aborted transaction, apply the before image, add x to the UNDONE list

Recovery 16 UNDO/REDO recovery procedure

• The procedure ignores the data stored on disk as it could correspond to uncommitted transactions (hence the need for UNDO), it only takes it as starting point for recovery • Another way to look at this procedure is as follows: • For every item in the database: • Find the last committed transaction that modified the item and REDO the modification • If no committed transaction modified the item, find the first aborted transaction than modified the item and UNDO the modification • If no transaction has touched the item, its value is correct (we assume we start from a consistent state) • In practice, not done like this because there are many more items than log entries, easier to process the log sequentially from end to the beginning.

Recovery 17 Advantages of UNDO/REDO recovery

• The only forced I/O are log records • Buffer Cache manager has a lot of freedom: • No need to flush dirty pages if there is no need to reuse the space • I/O on data is minimized and only triggered for block replacement policies • Allows to write dirty data (written by uncommitted transactions) to disk, which simplifies buffer management • Recovery is more complicated and takes more time but normal operations are only minimally affected • Queries are not affected since there is no forced I/O of data • Transactions are affected because the need to write every operation to the log

Recovery 18 UNDO, no REDO

• READ • Just read the value from the block in the buffer cache • WRITE • Create log entry (before image) and append it to the persistent log • Write after image to block on the buffer cache • COMMIT • Flush all dirty values modified by the transaction if still in the cache • Write a persistent log entry indicating the transaction has committed • ABORT • For all updates, restore the before image using the log entry

Recovery 19 Recovery Procedure (UNDO/no REDO)

• Start form the end of the log and scan backwards • Keep a list of UNDONE items • Procedure terminates when all items are in UNDONE list or we reach the beginning of the log • For each log entry: • Look at the data item being accessed (x) • if x is not in the UNDONE list and the transaction is aborted • UNDO the changes by using the before image, add X to the UNDONE list

Recovery 20 Recovery Procedure (UNDO/no REDO)

• The procedure relies on the fact that all committed values are in persistent storage and, therefore, they have not been lost • Another way to look at this procedure is as follows: • For every item in the database • If no aborted transaction has touched it, then it is correct • Otherwise, find the last aborted transaction that touched the item and UNDO it • It works because we are assuming strict execution: • There can only be one aborted transaction that modified the correct value at the time of failure (cannot overwrite uncommitted items) • It is enough to undo that transaction and we have the last committed value

Recovery 21 Comments on UNDO/no REDO

• Forced I/O on all dirty blocks touched by a transaction when it commits • Log records no longer need to include after images, making the log records smaller • Recovery procedure is shorter: only involves undoing aborted transactions (theoretically, we would not even need the log entries of committed transactions) • The trade-off induced by no-REDO does not pay off in practice: • We still need to write to the log with every update • No-REDO requires smaller log records … • .. But it forces I/O on data blocks which are often much larger! • Flushing the buffer cache interferes with its operation (e.g., queries)

Recovery 22 No UNDO, REDO

• READ • If the transaction did not write the item before, read the value from the block in the buffer cache • If the transaction has written the item before, read the value from the temporary buffer • WRITE • Create the log entry (after image) and append it to the persistent log • Write the after image to some temporary buffer (e.g., shadow page) • COMMIT • Apply all updates in the temporary buffer to the actual data blocks • Write a persistent log entry indicating the transaction has committed • ABORT • Discard the temporary buffer

Recovery 23 Recovery procedure (no UNDO/REDO)

• Start form the end of the log and scan backwards • Keep a list of REDONE items • Procedure terminates when all items are in REDONE list or we reach the beginning of the log • For each log entry: • Look at the data item being accessed (x) • if x is not in the REDONE list and the transaction is committed • REDO the changes by using the before image, add X to the REDONE list

Recovery 24 Recovery procedure (no UNDO/REDO)

• The procedure relies on the fact that there are never dirty blocks in the buffer cache. All data there is committed and is the last committed version • Another way to look at this procedure is as follows: • For every item in the database • Find the last committed transaction that touched it and REDO it • The step is needed because we are not flushing data blocks at commit and it could be that the changes of a committed transaction are not yet in persistent storage, hence the need to redo those changes upon recovery

Recovery 25 Comments on no-UNDO/REDO

• Forced I/O only on the log records • Log records no longer need to include before inages, making the log records smaller • Recovery procedure is shorter: only involves redoing the last committed transaction that touched an item (theoretically, we do not need log entries for aborted transactions) • Consider similarities with snapshot isolation • Read from snapshot at the time of start • Write to buffer and only apply buffer at commit • Read and writes do not interfere with each other

Recovery 26 No UNDO, no REDO

• This is a special recovery procedure rarely used in conventional : • It does not require a log • No undo implies no before images are needed to correct uncommitted data from the stable image of the database (uncommitted data is never written to persistent storage) • No redo implies no after images are needed because when transactions commit, all their changes are added to persistent storage (data in memory is never dirty) • It requires to be able to write all changes made by a transaction to persistent storage in a single atomic action • Not doable if updates are conventional data blocks (no way to write several blocks atomically)

Recovery 27 Data structures for no UNDO, no REDO

Berstein, Hadzilacos, and Goodman Concurrency Control and Recovery

Recovery 28 No UNDO, no REDO

• READ • If the value has not been written before by the transaction, using the current directory to find the latest committed copy • If the value has been written before by the transaction, use the shadow directory of that transaction to find the updated copy • WRITE • Write to a buffer and add a pointer in the shadow directory for the transactions • COMMIT • Create a full directory by merging the current one and the shadow directory of the transaction • Swap the pointer indicating the latest committed directory • ABORT • Discard the buffer and the shadow directory

Recovery 29 Recovery Procedure (no UNDO, no REDO)

• That is the whole point, there is none!

Recovery 30 Comments on no UNDO, no REDO

• Not used in practice although some of the ideas can be partially applied • Access to storage requires an indirection through the directory that indicates which one is the latest version. This is too expensive • It requires garbage collection of all the uncommitted values, shadow directories, etc. • It moves data all the time, creating problems with the block representation in, e.g., clustered indexes or hash clustered tables

Recovery 31 Implementation of Recovery

Recovery 32 What is a log record?

• So far, the explanations on what to log and what log records are was a bit abstract. In practice: • Data is stored in blocks (tendency is towards large blocks) • Log records are stored in blocks (tendency is towards smaller blocks) • But they are not blocks of the same size: each log record uses one log block • Many systems use a log block size of 512 bytes: • This is the size of a physical sector on disks • Larger units of transfer are possible (see group commit later)

Recovery 33 What is inside a log record?

• Depends on the system but, to a first approximation: • LSN = Log Sequence Number, used to navigate the log • SCN = System Change number, used to timestamp events • Pointers to other log records of the same transaction • Transaction ID and related information • REDO related information: • Change vectors, each one describing changes to a single block of data (after images) • UNDO related information: • Before images

Recovery 34 Log Sequence, System Change Numbers

• Systems use a number of timestamps to identify the moment transactions start and also to order events in the system: • LSN: Log Sequence Number • SCN: System Change Number • … • LSN are used in the log to order transactions and decide what goes before or after. Also used to indicate log files. • SCN are used in snapshot isolation to identify correct snapshots • SCN are also attached to data to indicate the version of the data (which transaction modified the data last)

Recovery 35 Managing the in-memory log (Oracle)

• Oracle uses a redo log (also contains undo data) • In memory, it is a circular buffer • As transactions modify data, redo records are created in memory and placed in the redo log buffer • When a commit occurs, the redo records are flushed to a file in storage • A size often mentioned in system manuals is 60KB

Recovery 36 Managing the log in storage (Oracle)

• Several files are used • The log writer only writes to a single redo log file at a time • When a file is full and needs to be archived, the LSN is increased and the system switches to a new redo log file • That way, archival does not interfere with normal operations as the system always has a redo log file where it can write

• Potential bottleneck … https://docs.oracle.com/cd/B28359_01/server.111/b28310/onl ineredo001.htm#ADMIN11305

Recovery 37 Group Commit and log buffer flush

• The log buffer has to be written (flushed) to disk when some of this events occur: • A transaction commits • The log buffer becomes full • Dirty pages are written to storage (see WAL) • A checkpoint is taken (see Checkpointing) • Instead of doing it for every transaction, systems often commit transactions in groups or batches: • Slight delay in committing but less I/O since all the log entries are written in one go • Can happen anyway as part of committing transactions when using a circular log buffer

Recovery 38 Write Ahead Logging

• The most common implementation of these ideas is Write Ahead Logging (WAL): • Separate persistent storage for data from persistent storage for the log • Log contains enough information to implement whatever policy is chosen (redo/undo) • Log records corresponding to a change in the database must be written to the log before changes to the data in the buffer cache are flushed to permanent storage • COMMIT record in the log used to mark the end of a transaction • Typically used to implement UNDO/REDO (on 2PL based systems) or no-UNDO/REDO (on SI based systems)

Recovery 39 Checkpoints

• A log would grow forever if we do not do anything • Storage would become a problem • Recovery would take too long (replay everything since the database was created) • Instead, checkpoints are used: • A checkpoint: • Push all dirty blocks to disk • Push all the logs in the log buffer to disk • Active transaction , dirty page table (system state) • Mark the log with a checkpoint label and flush it to the log • Recovery happens from a checkpoint instead of from the beginning • Note that a checkpoint is not necessarily a consistent copy of the database

Recovery 40 WAL and Checkpoints (ARIES style recovery) Earliest relevant log entry • If we use WAL, recovery with a checkpoint is as follows • Find the latest completed checkpoint in the log Last • Traverse the log to the end analyzing what has CKPT been done:

• Identify transactions that were active at the time LOG Last active of the crash transaction • Identify dirty pages that might have not made it to

the disk at the time of the crash ANALYSIS • Apply all updates (redo) starting from the log PHASE REDO entry matching the lowest SCN in the dirty pages

• Undo all transactions that were active at the UNDO PHASE time of the crash Time of crash

Recovery 41 As a reference (SQL Server)

• Database Checkpoints • https://docs.microsoft.com/en-us/sql/relational-databases/logs/database- checkpoints--server?=sql-server-ver15 • Transaction Log • https://docs.microsoft.com/en-us/sql/relational-databases/logs/the- transaction-log-sql-server?view=sql-server-ver15 • Recovery (including WAL) • https://docs.microsoft.com/en-us/sql/relational-databases/backup- restore/restore-and-recovery-overview-sql-server?view=sql-server- ver15#TlogAndRecovery

Recovery 42 The log today

Recovery 43 No more disks …

• A lot of the procedures and ideas around how to do logging come from the time when persistent storage was implemented using hard disks • Today, quickly being replaced with a number of different architectures: • HDD for data, SSD for logs • NVM for logs and checkpoints • Especially the use of Non-Volatile Memory changes quite a few things: • No longer necessary to flush log records upon commit • Checkpoints can also be kept in the NVM

Recovery 44