Review: The ACID properties

• A tomicity: All actions in the Xact happen, or none happen. ARIES: Logging and • C onsistency: If each Xact is consistent, and the DB starts Recovery consistent, it ends up consistent. • I solation: Execution of one Xact is isolated from that of other Slides derived from Joe Hellerstein; Xacts. Updated by A. Fekete • D urability: If a Xact commits, its effects persist. If you are going to be in the logging business, one of the things that you have to do is to • The Recovery Manager guarantees Atomicity & Durability. learn about heavy equipment. - Robert VanNatta, Logging History of Columbia County

Motivation Intended Functionality

• Atomicity: – Transactions may abort (“Rollback”). • At any time, each data item contains the value produced by the most recent update done by a • Durability: transaction that committed – What if DBMS stops running? (Causes?)

 Desired Behavior after system restarts: crash! T1 – T1, T2 & T3 should be T2 durable. T3 – T4 & T5 should be T4 aborted (effects not seen). T5 Assumptions Challenge: REDO

Action Buffer Disk • Essential is in effect. Initially 0 – For read/write items: Write locks taken and held T1 writes 1 1 0 till T1 commits 1 0 • Eg Strict 2PL, but read locks not important for recovery. CRASH 0 – For more general types: operations of concurrent transactions commute • Need to restore value 1 to item • Updates are happening “in place”. – Last value written by a committed transaction – i.e. data is overwritten on (deleted from) its location. • Unlike multiversion approaches • Buffer in volatile memory; data persists on disk

Challenge: UNDO Handling the Buffer Pool

Action Buffer Disk • Can you think of a simple scheme Initially 0 to guarantee Atomicity & No Steal Steal T1 writes 1 1 0 Durability? Page flushed 1 Force CRASH 1 Trivial • Force write to disk at commit? • Need to restore value 0 to item – Poor response time. – Last value from a committed transaction – But provides durability. No Force Desired • No Steal of buffer-pool frames from uncommited Xacts (“pin”)? – Poor throughput. – But easily ensure atomicity More on Steal and Force Basic Idea: Logging

•STEAL (why enforcing Atomicity is hard) – To steal frame F: Current page in F (say P) is • Record REDO and UNDO information, for every update, in a log written to disk; some Xact holds lock on P. . • What if the Xact with the lock on P aborts? – Sequential writes to log (put it on a separate disk). • Must remember the old value of P at steal time (to – Minimal info (diff) written to log, so multiple updates support UNDOing the write to page P). fit in a single log page. • NO FORCE (why enforcing Durability is hard) • Log: An ordered list of REDO/UNDO actions – What if system crashes before a modified page is – Log record contains: written to disk? – Write as little as possible, in a convenient place, at – and additional control info (which we’ll see soon) commit time,to support REDOing modifications. – For abstract types, have operation(args) instead of old value new value.

Write-Ahead Logging (WAL) ARIES

• The Write-Ahead Logging Protocol: • Exactly how is logging (and recovery!) done? 1. Must force the log record for an update before – Many approaches (traditional ones used in the corresponding data page gets to disk. relational systems of 1980s); 2. Must write all log records for a Xact before – ARIES algorithms developed by IBM used many of commit. the same ideas, and some novelties that were • #1 (undo rule) allows system to have quite radical at the time Atomicity. • Research report in 1989; conference paper on an • #2 (redo rule) allows system to have extension in 1989; comprehensive journal publication in Durability. 1992 • 10 Year VLDB Award 1999 DB RAM Key ideas of ARIES WAL & the Log LSNs pageLSNs flushedLSN abort • Each log record has a unique Log Sequence Number (LSN). • Log every change (even undos during txn Log records ) – LSNs always increasing. flushed to disk • In restart, first repeat history without • Eachrecord data page contains a backtracking pageLSN. – Even redo the actions of loser transactions – The LSN of the most recent log • Then undo actions of losers for an update to that page. • System keeps track of pageLSN • LSNs in pages used to coordinate state “Log tail” flushedLSN. in RAM between log, buffer, disk – The max LSN flushed so far. Novel features of ARIES in italics

WAL constraints Log Records Possible log record types: • Before a page is written, LogRecord fields: • Update –pageLSN flushedLSN prevLSN • Commit • Commit record included in log; all related XID •Abort update log records precede it in log type •End (signifies end of commit pageID or abort) update length • Compensation Log Records records offset (CLRs) only before-image – for UNDO actions after-image – (and some other tricks!) Other Log-Related State Normal Execution of an Xact

• Transaction : • Series of reads & writes, followed by commit or – One entry per active Xact. abort. –Contains XID, status (running/commited/aborted), – We will assume that page write is atomic on disk. and lastLSN. • In practice, additional details to deal with non-atomic writes. • Dirty Page Table: • Strict 2PL (at least for writes). – One entry per dirty page in buffer pool. • STEAL, NO-FORCE buffer management, with Write- –Contains recLSN -- the LSN of the log record which Ahead Logging. first caused the page to be dirty.

The Big Picture: What’s Stored Where Checkpointingtable

• Periodically, the DBMS creates a checkpoint, in order to minimize the time taken to recover in the event of a system LOG crash. Write to log: DB RAM – begin_checkpoint record: Indicates when chkpt began. LogRecords – end_checkpoint record: Contains current Xact table and dirty page prevLSN Xact Table . This is a `fuzzy checkpoint’: XID Data pages lastLSN • Other Xacts continue to run; so these tables only known to reflect type each status some mix of state after the time of the . pageID with a • No attempt to force dirty pages to disk; effectiveness of checkpoint length pageLSN Dirty Page Table limited by oldest unwritten change to a dirty page. (So it’s a good idea begin_checkpoint recLSN to periodically flush dirty pages to disk!) offset before-image master record – Store LSN of chkpt record in a safe place (master record). record after-image flushedLSN Simple Transaction Abort Abort, cont.

• For now, consider an explicit abort of a Xact. – No crash involved. • To perform UNDO, must have a lock on data! • We want to “play back” the log in reverse – No problem! order, UNDOing updates. • Before restoring old value of a page, write a CLR: – You continue logging while you UNDO!! –Get lastLSN of Xact from Xact table. – CLR has one extra field: undonextLSN – Can follow chain of log records backward via the • Points to the next LSN to undo (i.e. the prevLSN of the record we’re prevLSN field. currently undoing). – Note: before starting UNDO, could write an Abort – CLR contains REDO info log record. –CLRs never Undone • Undo needn’t be idempotent (>1 UNDO won’t happen) • Why bother? • But they might be Redone when repeating history (=1 UNDO guaranteed) • At end of all UNDOs, write an “end” log record.

Transaction Commit Crash Recovery: Big Picture

Oldest log • Write commit record to log. rec. of Xact  Start from a checkpoint (found active at crash • All log records up to Xact’s lastLSN are flushed. via master record). – Guarantees that flushedLSN  lastLSN. Smallest  Three phases. Need to: – Note that log flushes are sequential, synchronous recLSN in dirty page – Figure out which Xacts writes to disk. table after committed since checkpoint, – Many log records per log page. Analysis which failed (Analysis). • Make transaction visible – REDO all actions. – Commit() returns, locks dropped, etc. Last chkpt  (repeat history) • Write end record to log. – UNDO effects of failed Xacts. CRASH A RU Recovery: The Analysis Phase Recovery: The REDO Phase • We repeat History to reconstruct state at crash: – Reapply all updates (even of aborted Xacts!), redo • Reconstruct state at checkpoint. CLRs. –via end_checkpoint record. • Scan forward from log rec containing smallest • Scan log forward from begin_checkpoint. recLSN in D.P.T. For each CLR or update log rec LSN, –Endrecord: Remove Xact from Xact table. REDO the action unless page is already more – Other records: Add Xact to Xact table, set uptodate than this record: lastLSN=LSN, change Xact status on commit. – REDO when Affected page is in D.P.T., and has – Update record: If P not in Dirty Page Table, pageLSN (in DB) < LSN. [if page has recLSN > LSN no • Add P to D.P.T., set its recLSN=LSN. need to read page in from disk to check pageLSN] • To REDO an action: This phase could be skipped; – Reapply logged action. information can be regained in REDO pass following –Set pageLSN to LSN. No additional logging!

Invariant Recovery: The UNDO Phase

• State of page P is the outcome of all changes • Key idea: Similar to simple transaction abort, of relevant log records whose LSN is <= for each loser transaction (that was in flight or P.pageLSN aborted at time of crash) • During redo phase, every page P has – Process each loser transaction’s log records P.pageLSN >= redoLSN backwards; undoing each record in turn and generating CLRs • Thus at end of redo pass, the has a • But: loser may include partial (or complete) state that reflects exactly everything on the rollback actions (stable) log • Avoid to undo what was already undone – undoNextLSN field in each CLR equals prevLSN field from the original action UndoNextLSN Recovery: The UNDO Phase

ToUndo={ l | l a lastLSN of a “loser” Xact} Repeat: – Choose largest LSN among ToUndo. – If this LSN is a CLR and undonextLSN==NULL •Write an End record for this Xact. – If this LSN is a CLR, and undonextLSN != NULL • Add undonextLSN to ToUndo • (Q: what happens to other CLRs?) – Else this LSN is an update. Undo the update, write a CLR, add prevLSN to ToUndo. Until ToUndo is empty. From Mohan et al, TODS 17(1):94-162

Example of Recovery Example: Crash During Restart! LSN LOG LSN LOG 00,05 begin_checkpoint, end_checkpoint RAM 00 begin_checkpoint RAM 10 update: T1 writes P5 05 end_checkpoint 20 update T2 writes P3 undonextLSN Xact Table 10 update: T1 writes P5 prevLSNs Xact Table 30 T1 abort lastLSN lastLSN 20 update T2 writes P3 40,45 CLR: Undo T1 LSN 10, T1 End status status 30 T1 abort Dirty Page Table Dirty Page Table 50 update: T3 writes P1 recLSN 40 CLR: Undo T1 LSN 10 recLSN 60 update: T2 writes P5 flushedLSN 45 T1 End flushedLSN CRASH, RESTART 50 update: T3 writes P1 70 CLR: Undo T2 LSN 60 ToUndo 60 update: T2 writes P5 ToUndo 80,85 CLR: Undo T3 LSN 50, T3 end CRASH, RESTART CRASH, RESTART 90 CLR: Undo T2 LSN 20, T2 end Additional Crash Issues Parallelism during restart

• What happens if system crashes during Analysis? During REDO? • Activities on a given page must be processed in sequence • How do you limit the amount of work in REDO? – Flush asynchronously in the background. • Activities on different pages can be done in parallel – Watch “hot spots”! • How do you limit the amount of work in UNDO? – Avoid long-running Xacts.

Log record contents Physical logging

• Describe the bits (optimization: only those that • What is actually stored in a log record, to allow change) REDO and UNDO to occur? • Eg • Many choices, 3 main types – OLD STATE: 0x47A90E…. –PHYSICAL – NEW STATE: 0x632F00… – LOGICAL – So REDO: set to NEW; UNDO: set to OLD – PHYSIOLOGICAL • Or just delta (OLD XOR NEW) – DELTA: 0x24860E… – So REDO=UNDO=xor with delta • Ponder: XOR is not idempotent, but redo and undo must be; why is this OK? Logical Logging Physiological Logging

• Describe the operation and arguments • Describe changes to a specified page, logically • Eg Update field 3 of record whose key is 37, by within that page adding 32 • Goes with common page layout, with records • We need a programmer supplied inverse indexed from a page header operation to undo this • Allows movement within the page (important for records whose length varies over time) • Eg on page 298, replace record at index 17 from old state to new state • Eg on page 35, insert new record at index 20

ARIES logging Interactions

• ARIES allows different log approaches; • Recovery is designed with deep awareness of common choice is: access methods (eg B-trees) and concurrency • Physiological REDO logging control – Independence of REDO (e.g. indexes & tables) • And vice versa • Can have concurrent commutative logical operations like • Need to handle failure during page split, increment/decrement (“escrow transactions”) reobtaining locks for prepared transactions • Logical UNDO during recovery, etc – To allow for simple management of physical structures that are invisible to users • CLR may act on different page than original action – To allow for escrow Nested Top Actions Summary of Logging/Recovery

• Trick to support physical operations you do not • Recovery Manager guarantees Atomicity & want to ever be undone Durability. –Example? • Use WAL to allow STEAL/NO-FORCE w/o • Basic idea sacrificing correctness. – At end of the nested actions, write a dummy CLR • LSNs identify log records; linked into • Nothing to REDO in this CLR backwards chains per transaction (via – Its UndoNextLSN points to the step before the prevLSN). nested action. • pageLSN allows comparison of data page and log records.

Summary, Cont. Further reading

• Checkpointing: A quick way to limit the • Repeating History Beyond ARIES, amount of log to scan on recovery. – C. Mohan, Proc VLDB’99 • Recovery works in 3 phases: – Reflections on the work 10 years later – Analysis: Forward from checkpoint. • Model and Verification of a Data Manager –Redo:Forward from oldest recLSN. Based on ARIES –Undo: Backward from end to first LSN of oldest – D. Kuo, ACM TODS 21(4):427-479 Xact alive at crash. – Proof of a substantial subset • Upon Undo, write CLRs. • A Survey of B-Tree Logging and Recovery • Redo “repeats history”: Simplifies the logic! Techniques – G. Graefe, ACM TODS 37(1), article 1