Transactions for Durability

CS162: Operating Systems and Systems Programming Lecture 26: Transactions for Durability 4 August 2015 Charles $eiss #ttps://cs162.eecs.ber'eley&edu% Recall: Filesystem )nconsistency *+ample: system cras#es w#ile create %usr/lib/foo Maybe there's a directory entry for it … but t#e inode is still mar'ed as free /usr/lib: free inumber list: (bar, #1236) … (foo, #7823) #7820 ... #7823 #7824 #7825 2 Approach #1: Careful Ordering Se2uence operations in a specific order ― Careful design to allo, sequence to be interrupted safely Post4crash recovery ― $ead data structures to see if there ,ere any operations in progress ― Clean up%finis# as needed pproach taken in ( T, ((S 7fsc'8, and many app- le5el recovery schemes 7e&g&, 9ord) 0 FFS: Create a File Normal operation: llocate data block 9rite data block llocate inode 9rite inode block Update directory ,it# 3le name 4; file number Update modify time for directory 4 FFS: Create a File Normal operation: Recovery: (fsck) entire inode array llocate data block Scan If any unlin'ed files (not in any 9rite data block directory)6 delete ― Or relink (make up a name8 llocate inode Compare free block bitmap against 9rite inode block inode trees Update directory ,it# 3le name Scan directories for missing 4; file number update/access times Update modify time for directory Time proportional to size of disk " Application Level <ormal operation: Reco5ery: 9rite name of each open file to app folder On startup, see if any 3les were left open 9rite changes to backup file If so, loo' for backup file $ename backup file to be 3le 7atomic operation If so, as' user to provided by 3le system8 compare versions Delete list in app folder on clean shutdo,n 6 Approach #2: Copy on Write 71) To update 3le system6 ,rite a ne, version of the 3le system containing the update ― <ever update in place ― $euse e+isting unchanged disk blocks pproach taken in networ' file server appliances (WAFL, >FS8 = Approach #2: Copy on Write 72) Seems e+pensive! Aut ― :pdates can be batched ― lmost all disk writes can occur in parallel ― Most of cost of write is in seeks, not actual transfer ― pplications often rewrite whole file any,ays ? Approach #2: Copy4on49rite 708 To change a 3le, make a new copy of the file, then change the directory entry 9#y is this consistent? ― ssume directory entry has to point to the old version or the other 9#y is this durableC ― Don.t say the operation is done until we've replaced the directory entry B *mulating COW @ user level Open/Create a ne, 3le foo&5 ― ,here v is the 5ersion 1 Do all the updates based on t#e old foo ― $eading from foo and writing to foo&5 ― Including copying over any unchanged parts :pdate the link ― link("foo.v", "foo"); /* replace hardlink – over,rite file # E% 10 *mulating COW @ user level ― Open/Create a ne, file foo&5 ― Do all the updates based on the old foo ― :pdate the link (rename78/lin'788 Does it wor'C 9#at if multiple updaters at same timeC Gow to keep track of every version of file? ― 9ould we want to do thatC 11 Creating a New Version old version new version Write )f file represented as a tree of blocks6 just need to update the leading fringe 12 Creating a New Version old version new version Write If file represented as a tree of blocks, just need to update the leading fringe 13 ZFS Hariable siJed bloc's: "12 B – 12? KA Symmetric tree ― Know if it is large or small when we ma'e the copy Store version number ,ith pointers ― Can create new version by adding bloc's and new pointers Auffers a collection of ,rites before creating a ne, version ,ith them (ree space represented as tree of extents in eac# bloc' group ― Delay updates to freespace (in log8 and do them all when bloc' group is activated 14 COW with smaller-radixold version new version blocks Write If file represented as a tree of blocks, just need to update the leading fringe 15 -ore General Solutions Transactions for Atomic :pdates ― *nsure t#at multiple related updates are performed atomically ― i.e., if a crash occurs in t#e middle, t#e state of t#e systems reflects eit#er all or none of t#e updates ― Most modern file systems use transactions internally to update t#e many pieces ― Many applications implement t#eir own transactions Redundancy for media failures ― Redundant representation 7error correcting codes) ― Replication ― *.g., RA)D disks 16 Recall: Transaction n atomic sequence of actions 7reads%,rites) on a storage system 7or database) T#at takes it from one consistent state to another transaction consistent state 1 consistent state 2 Recall: Fast <D Right ??? T#e concepts related to transactions appear in many aspects of systems Reliability ― File Systems Performance ― Data Base systems ― Concurrent Programming *+ample of a powerful6 elegant concept simplifying implementation <D achieving etter performance& !ystem ehavior is viewed from a particular perspective" ― Properties are met from that perspective Redo Logging *+ample: A: Ban' of merica6 9: 9ells Fargo Prepare phase (con't$ commit case): recei5es guarantee of commit: ― A ,rites "transfer P1!! from 9Q Rall committed>" to log ― A applies transaction ― 4;9: ).ll commit, too 9 finaliJes cras#es transaction: before applying ― W ,ritestransaction? O<all committed>" to log ― W applies transaction Recall: Redo Logging *+ample: A: Ban' of merica6 9: 9ells Fargo Prepare phase (con't$ commit case): recei5es guarantee of commit: ― A ,rites "transfer P1!! from 9Q Rall committed>" to log ― A applies transaction ― 4;9: ).ll commit, cras#es too before applying 9 finaliJes transaction?transaction: ― W ,rites O<all$eads committed>" log w#en to log it is restartedQ ― W applies reappliestransaction transaction Called redo logging Typical Structure With Redo Logging Aegin a transaction – get transaction id Do a bunc# of updates TO THE LOG ― If any fail along the way, roll-back ― Or, if any conNicts ,ith other transactions, roll-back Commit the transaction TO THE LOG ctually do the the updates to the non-log 5ersion Typical Structure With Redo Logging Aegin a transaction – get transaction id Do a bunc# of updates TO THE LOG ― If any fail along the way, roll-back ― Or, if any conNicts ,ith other transactions, roll-back Cras# before committing to t#e Commitlog? the transaction TO THE LOG ctually do the the updates to the non-log 5ersion 9#en rereading log, ignore those updates& Typical Structure With Redo Logging Aegin a transaction – get transaction id Do a bunc# of updates TO THE LOG ― If any fail along the way, roll-back ― Or, if any conNicts ,ith other transactions, roll-back Commit the transaction TO THE LOG ctually doCras# the the afterupdates committing to the non-log to 5ersion t#e log? 9#en rereading log, redo those updates" Assumptions about the Log Log can be written in order ― If "Commit TransactionO is present, then pre5ious parts of the log – the updates – will be too Log writes actually complete before we do the real updates ― Commit record must always be present if any of the updates it would cause are ― Transaction didn.t commit S no changes 9hat do disks actually do? Partially written sectorC ― *rror4correcting codes identify S sector read ,ill fail Write log in orderC ― 9ell, …& Recall: Disk Scheduling Also something controller does (dis' controller #as its own queue) More criteria to optimize for, li'e process scheduling ― Fairness versus Throughput 5ersus $esponse Time 26 9hat do disks actually do? Partially written sectorC ― *rror4correcting codes identify S sector read ,ill fail Write log in orderC ― 9ell, they might reorder ,rites, keep them 2ueued OS needs to tell controller to do writes in order ― 9ait for confirmation write is actually complete ― Disk controllers have caches + delayed writes! Transactional File Systems Aetter reliability t#roug# use of log ― )ll changes are treated as transactions ― A transaction is committed once it is written to t#e log ● Data forced to dis' for reliability ● Process can be accelerated ,ith NH$ - ― Alt#ough File system may not be updated immediately, data preserved in the log Journaling File System ― Applies updates to system metadata using transactions (using logs6 etc&8 ― *x: NTFS, Apple HFS+, Linux XFS, JFS, ext06 ext4 Redo Logging Prepare Reco5ery ― 9rite all changes 7in ― $ead log transaction8 to log ― $edo any operations for Commit committed transactions ― ― Single dis' write to ma'e Marbage collect log transaction durable Redo ― Copy changes to dis' Marbage collection ― $eclaim space in log Creating a 3le 7<o transaction8 (ind free data block(s8 Free Space (ind free inode entry map / (ind dirent insertion point Data blocks )node table 44444444444444444444444444 Directory 9rite map (i.e&6 mark used8 entries 9rite inode entry to point to block(s8 9rite dirent to point to inode Creating a 3le 7As transaction8 (ind free data bloc'7s8 (ind free inode entry Free Space (ind dirent insertion point map / 44444444444444444444444444 Data blocks Write map (used8 Inode table Write inode entry to point to bloc'(s8 Directory Write dirent to point to inode entries tail head t i t r pending m done a t m s o c Log in non45olatile storage (Flash or on Dis'8 ReDo log fter Commit Free ll access to file system 3rst Space map loo's in log … Data blocks Inode table *5entually copy changes to Directory disk entries tail tail tail tail tail head t i t r done m a m t o s c Log in non-volatile storage (Flash) pending Crash during logging - Recover :pon reco5ery scan t#e long Free Detect transaction start ,it# Space map no commit … Data blocks &iscard log entries Inode table Directory Disk remains unc#anged entries tail head t done pending r a t s Log in non-volatile storage (Flash or on Disk) Recovery fter Commit Scan log, find start Find matching commit Free Space map Redo it as usual … Data blocks ― Or just let it happen later Inode table Directory entries tail head t i t r done pending m a m t o s c Log in non-volatile storage (Flash or on Disk) 9hat if had already started writing back the transaction C Idempotent – the result does not change if t#e operation is repeat se5eral times& Just write them again during reco5ery& *+ample: ― Log might say "mark blocks #0!!45 as usedO.

Load more