Transactions for Durability

Transactions for Durability

CS162: Operating Systems and Systems Programming Lecture 26: Transactions for Durability 4 August 2015 Charles $eiss #ttps://cs162.eecs.ber'eley&edu% Recall: Filesystem )nconsistency *+ample: system cras#es w#ile create %usr/lib/foo Maybe there's a directory entry for it … but t#e inode is still mar'ed as free /usr/lib: free inumber list: (bar, #1236) … (foo, #7823) #7820 ... #7823 #7824 #7825 2 Approach #1: Careful Ordering Se2uence operations in a specific order ― Careful design to allo, sequence to be interrupted safely Post4crash recovery ― $ead data structures to see if there ,ere any operations in progress ― Clean up%finis# as needed pproach taken in ( T, ((S 7fsc'8, and many app- le5el recovery schemes 7e&g&, 9ord) 0 FFS: Create a File Normal operation: llocate data block 9rite data block llocate inode 9rite inode block Update directory ,it# 3le name 4; file number Update modify time for directory 4 FFS: Create a File Normal operation: Recovery: (fsck) entire inode array llocate data block Scan If any unlin'ed files (not in any 9rite data block directory)6 delete ― Or relink (make up a name8 llocate inode Compare free block bitmap against 9rite inode block inode trees Update directory ,it# 3le name Scan directories for missing 4; file number update/access times Update modify time for directory Time proportional to size of disk " Application Level <ormal operation: Reco5ery: 9rite name of each open file to app folder On startup, see if any 3les were left open 9rite changes to backup file If so, loo' for backup file $ename backup file to be 3le 7atomic operation If so, as' user to provided by 3le system8 compare versions Delete list in app folder on clean shutdo,n 6 Approach #2: Copy on Write 71) To update 3le system6 ,rite a ne, version of the 3le system containing the update ― <ever update in place ― $euse e+isting unchanged disk blocks pproach taken in networ' file server appliances (WAFL, >FS8 = Approach #2: Copy on Write 72) Seems e+pensive! Aut ― :pdates can be batched ― lmost all disk writes can occur in parallel ― Most of cost of write is in seeks, not actual transfer ― pplications often rewrite whole file any,ays ? Approach #2: Copy4on49rite 708 To change a 3le, make a new copy of the file, then change the directory entry 9#y is this consistent? ― ssume directory entry has to point to the old version or the other 9#y is this durableC ― Don.t say the operation is done until we've replaced the directory entry B *mulating COW @ user level Open/Create a ne, 3le foo&5 ― ,here v is the 5ersion 1 Do all the updates based on t#e old foo ― $eading from foo and writing to foo&5 ― Including copying over any unchanged parts :pdate the link ― link("foo.v", "foo"); /* replace hardlink – over,rite file # E% 10 *mulating COW @ user level ― Open/Create a ne, file foo&5 ― Do all the updates based on the old foo ― :pdate the link (rename78/lin'788 Does it wor'C 9#at if multiple updaters at same timeC Gow to keep track of every version of file? ― 9ould we want to do thatC 11 Creating a New Version old version new version Write )f file represented as a tree of blocks6 just need to update the leading fringe 12 Creating a New Version old version new version Write If file represented as a tree of blocks, just need to update the leading fringe 13 ZFS Hariable siJed bloc's: "12 B – 12? KA Symmetric tree ― Know if it is large or small when we ma'e the copy Store version number ,ith pointers ― Can create new version by adding bloc's and new pointers Auffers a collection of ,rites before creating a ne, version ,ith them (ree space represented as tree of extents in eac# bloc' group ― Delay updates to freespace (in log8 and do them all when bloc' group is activated 14 COW with smaller-radixold version new version blocks Write If file represented as a tree of blocks, just need to update the leading fringe 15 -ore General Solutions Transactions for Atomic :pdates ― *nsure t#at multiple related updates are performed atomically ― i.e., if a crash occurs in t#e middle, t#e state of t#e systems reflects eit#er all or none of t#e updates ― Most modern file systems use transactions internally to update t#e many pieces ― Many applications implement t#eir own transactions Redundancy for media failures ― Redundant representation 7error correcting codes) ― Replication ― *.g., RA)D disks 16 Recall: Transaction n atomic sequence of actions 7reads%,rites) on a storage system 7or database) T#at takes it from one consistent state to another transaction consistent state 1 consistent state 2 Recall: Fast <D Right ??? T#e concepts related to transactions appear in many aspects of systems Reliability ― File Systems Performance ― Data Base systems ― Concurrent Programming *+ample of a powerful6 elegant concept simplifying implementation <D achieving etter performance& !ystem ehavior is viewed from a particular perspective" ― Properties are met from that perspective Redo Logging *+ample: A: Ban' of merica6 9: 9ells Fargo Prepare phase (con't$ commit case): recei5es guarantee of commit: ― A ,rites "transfer P1!! from 9Q Rall committed>" to log ― A applies transaction ― 4;9: ).ll commit, too 9 finaliJes cras#es transaction: before applying ― W ,ritestransaction? O<all committed>" to log ― W applies transaction Recall: Redo Logging *+ample: A: Ban' of merica6 9: 9ells Fargo Prepare phase (con't$ commit case): recei5es guarantee of commit: ― A ,rites "transfer P1!! from 9Q Rall committed>" to log ― A applies transaction ― 4;9: ).ll commit, cras#es too before applying 9 finaliJes transaction?transaction: ― W ,rites O<all$eads committed>" log w#en to log it is restartedQ ― W applies reappliestransaction transaction Called redo logging Typical Structure With Redo Logging Aegin a transaction – get transaction id Do a bunc# of updates TO THE LOG ― If any fail along the way, roll-back ― Or, if any conNicts ,ith other transactions, roll-back Commit the transaction TO THE LOG ctually do the the updates to the non-log 5ersion Typical Structure With Redo Logging Aegin a transaction – get transaction id Do a bunc# of updates TO THE LOG ― If any fail along the way, roll-back ― Or, if any conNicts ,ith other transactions, roll-back Cras# before committing to t#e Commitlog? the transaction TO THE LOG ctually do the the updates to the non-log 5ersion 9#en rereading log, ignore those updates& Typical Structure With Redo Logging Aegin a transaction – get transaction id Do a bunc# of updates TO THE LOG ― If any fail along the way, roll-back ― Or, if any conNicts ,ith other transactions, roll-back Commit the transaction TO THE LOG ctually doCras# the the afterupdates committing to the non-log to 5ersion t#e log? 9#en rereading log, redo those updates" Assumptions about the Log Log can be written in order ― If "Commit TransactionO is present, then pre5ious parts of the log – the updates – will be too Log writes actually complete before we do the real updates ― Commit record must always be present if any of the updates it would cause are ― Transaction didn.t commit S no changes 9hat do disks actually do? Partially written sectorC ― *rror4correcting codes identify S sector read ,ill fail Write log in orderC ― 9ell, …& Recall: Disk Scheduling Also something controller does (dis' controller #as its own queue) More criteria to optimize for, li'e process scheduling ― Fairness versus Throughput 5ersus $esponse Time 26 9hat do disks actually do? Partially written sectorC ― *rror4correcting codes identify S sector read ,ill fail Write log in orderC ― 9ell, they might reorder ,rites, keep them 2ueued OS needs to tell controller to do writes in order ― 9ait for confirmation write is actually complete ― Disk controllers have caches + delayed writes! Transactional File Systems Aetter reliability t#roug# use of log ― )ll changes are treated as transactions ― A transaction is committed once it is written to t#e log ● Data forced to dis' for reliability ● Process can be accelerated ,ith NH$ - ― Alt#ough File system may not be updated immediately, data preserved in the log Journaling File System ― Applies updates to system metadata using transactions (using logs6 etc&8 ― *x: NTFS, Apple HFS+, Linux XFS, JFS, ext06 ext4 Redo Logging Prepare Reco5ery ― 9rite all changes 7in ― $ead log transaction8 to log ― $edo any operations for Commit committed transactions ― ― Single dis' write to ma'e Marbage collect log transaction durable Redo ― Copy changes to dis' Marbage collection ― $eclaim space in log Creating a 3le 7<o transaction8 (ind free data block(s8 Free Space (ind free inode entry map / (ind dirent insertion point Data blocks )node table 44444444444444444444444444 Directory 9rite map (i.e&6 mark used8 entries 9rite inode entry to point to block(s8 9rite dirent to point to inode Creating a 3le 7As transaction8 (ind free data bloc'7s8 (ind free inode entry Free Space (ind dirent insertion point map / 44444444444444444444444444 Data blocks Write map (used8 Inode table Write inode entry to point to bloc'(s8 Directory Write dirent to point to inode entries tail head t i t r pending m done a t m s o c Log in non45olatile storage (Flash or on Dis'8 ReDo log fter Commit Free ll access to file system 3rst Space map loo's in log … Data blocks Inode table *5entually copy changes to Directory disk entries tail tail tail tail tail head t i t r done m a m t o s c Log in non-volatile storage (Flash) pending Crash during logging - Recover :pon reco5ery scan t#e long Free Detect transaction start ,it# Space map no commit … Data blocks &iscard log entries Inode table Directory Disk remains unc#anged entries tail head t done pending r a t s Log in non-volatile storage (Flash or on Disk) Recovery fter Commit Scan log, find start Find matching commit Free Space map Redo it as usual … Data blocks ― Or just let it happen later Inode table Directory entries tail head t i t r done pending m a m t o s c Log in non-volatile storage (Flash or on Disk) 9hat if had already started writing back the transaction C Idempotent – the result does not change if t#e operation is repeat se5eral times& Just write them again during reco5ery& *+ample: ― Log might say "mark blocks #0!!45 as usedO.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    85 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us