Reliability, Transactions, Distributed Systems

Reliability, Transactions, Distributed Systems

Important “ilities” CS162 • Availability: the probability that the system can accept and process Operating Systems and requests – Often measured in “nines” of probability. So, a 99.9% probability is Systems Programming considered “3-nines of availability” Lecture 20 – Key idea here is independence of failures • Durability: the ability of a system to recover data despite faults Reliability, Transactions – This idea is fault tolerance applied to data Distributed Systems – Doesn’t necessarily imply availability: information on pyramids was very durable, but could not be accessed until discovery of Rosetta Stone November 6th, 2017 • Reliability: the ability of a system or component to perform its Prof. Ion Stoica required functions under stated conditions for a specified period of time (IEEE definition) http://cs162.eecs.Berkeley.edu – Usually stronger than simply availability: means that the system is not only “up”, but also working correctly – Includes availability, security, fault tolerance/durability – Must make sure data survives system crashes, disk crashes, etc 11/6/17 CS162 © UCB Fall 2017 Lec 20.2 How to Make File System Durable? RAID: Redundant Arrays of Inexpensive Disks • Disk blocks contain Reed-Solomon error correcting codes (ECC) to deal with small defects in disk drive • Invented by David Patterson, Garth A. Gibson, and – Can allow recovery of data from small media defects Randy Katz here at UCB in 1987 • Make sure writes survive in short term – Either abandon delayed writes or • Data stored on multiple disks (redundancy) – Use special, battery-backed RAM (called non-volatile RAM or NVRAM) for dirty blocks in buffer cache • Either in software or hardware • Make sure that data survives in long term – In hardware case, done by disk controller; file system may not even know that there is more than one disk in use – Need to replicate! More than one copy of data! – Important element: independence of failure » Could put copies on one disk, but if disk head fails… • Initially, five levels of RAID (more now) » Could put copies on different disks, but if server fails… » Could put copies on different servers, but if building is struck by lightning…. » Could put copies on servers in different continents… 11/6/17 CS162 © UCB Fall 2017 Lec 20.3 11/6/17 CS162 © UCB Fall 2017 Lec 20.4 Page 1 RAID 1: Disk Mirroring/Shadowing RAID 5+: High I/O Rate Parity • Data stripped across Stripe multiple disks Unit – Successive blocks D0 D1 D2 D3 P0 recovery stored on successive Increasing group (non-parity) disks D4 D5 D6 P1 D7 Logical – Increased bandwidth Disk • Each disk is fully duplicated onto its “shadow” over single disk Addresses – For high I/O rate, high availability environments D8 D9 P2 D10 D11 • Parity block (in green) – Most expensive solution: 100% capacity overhead constructed by XORing D12 P3 D13 D14 D15 • Bandwidth sacrificed on write: data bocks in stripe – Logical write = two physical writes – P0=D0ÅD1ÅD2ÅD3 P4 D16 D17 D18 D19 – Highest bandwidth when disk heads and rotation fully – Can destroy any one synchronized (hard to do exactly) disk and still • Reads may be optimized reconstruct data D20 D21 D22 D23 P5 – Can have two independent reads to same data – Suppose Disk 3 fails, then can reconstruct: Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 • Recovery: D2=D0ÅD1ÅD3ÅP0 – Disk failure Þ replace disk and copy data to new disk • Can spread information widely across internet for durability – Hot Spare: idle disk already attached to system to be used for immediate replacement – Overview now, more later in semester 11/6/17 CS162 © UCB Fall 2017 Lec 20.5 11/6/17 CS162 © UCB Fall 2017 Lec 20.6 Higher Durability/Reliability through Geographic Replication File System Reliability • Highly durable – hard to destroy all copies • What can happen if disk loses power or software crashes? • Highly available for reads – read any copy – Some operations in progress may complete • Low availability for writes – Some operations in progress may be lost – Can’t write if any one replica is not up – Overwrite of a block may only partially complete – Or – need relaxed consistency model • Reliability? – availability, security, durability, fault-tolerance • Having RAID doesn’t necessarily protect against all such failures Replica #1 – No protection against writing bad state – What if one disk of RAID group not written? Replica #2 • File system needs durability (as a minimum!) – Data previously stored can be retrieved (maybe after some recovery step), regardless of failure Replica #n 11/6/17 CS162 © UCB Fall 2017 Lec 20.7 11/6/17 CS162 © UCB Fall 2017 Lec 20.8 Page 2 Storage Reliability Problem Threats to Reliability • Single logical file operation can involve updates to • Interrupted Operation multiple physical disk blocks – Crash or power failure in the middle of a series of related – inode, indirect block, data block, bitmap, … updates may leave stored data in an inconsistent state – With sector remapping, single update to physical disk block – Example: Transfer funds from one bank account to another can require multiple (even lower level) updates to sectors – What if transfer is interrupted after withdrawal and before deposit? • At a physical level, operations complete one at a time – Want concurrent operations for performance • Loss of stored data • How do we guarantee consistency regardless of when – Failure of non-volatile storage media may cause previously crash occurs? stored data to disappear or be corrupted 11/6/17 CS162 © UCB Fall 2017 Lec 20.9 11/6/17 CS162 © UCB Fall 2017 Lec 20.10 Reliability Approach #1: Careful Ordering FFS: Create a File • Sequence operations in a specific order Normal operation: Recovery: – Careful design to allow sequence to be interrupted safely • Allocate data block • Scan inode table • Write data block • If any unlinked files (not in • Post-crash recovery • Allocate inode any directory), delete or put in lost & found dir – Read data structures to see if there were any operations in • Write inode block progress • Compare free block – Clean up/finish as needed • Update bitmap of free bitmap against inode trees blocks and inodes • Scan directories for missing • Update directory with update/access times • Approach taken by file name ® inode – FAT and FFS (fsck) to protect filesystem structure/metadata number – Many app-level recovery schemes (e.g., Word, emacs autosaves) • Update modify time for Time proportional to disk size directory 11/6/17 CS162 © UCB Fall 2017 Lec 20.11 11/6/17 CS162 © UCB Fall 2017 Lec 20.12 Page 3 Reliability Approach #2: Copy on Write File Layout COW with Smaller-Radix Blocks • To update file system, write a new version of the file old version new version system containing the update – Never update in place – Reuse existing unchanged disk blocks • Seems expensive! But – Updates can be batched – Almost all disk writes can occur in parallel Write • Approach taken in network file server appliances • If file represented as a tree of blocks, just need to – NetApp’s Write Anywhere File Layout (WAFL) update the leading fringe – ZFS (Sun/Oracle) and OpenZFS 11/6/17 CS162 © UCB Fall 2017 Lec 20.13 11/6/17 CS162 © UCB Fall 2017 Lec 20.14 ZFS and OpenZFS More General Reliability Solutions • Variable sized blocks: 512 B – 128 KB • Use Transactions for atomic updates • Symmetric tree – Ensure that multiple related updates are performed atomically – i.e., if a crash occurs in the middle, the state of the systems – Know if it is large or small when we make the copy reflects either all or none of the updates • Store version number with pointers – Most modern file systems use transactions internally to update filesystem structures and metadata – Can create new version by adding blocks and new pointers – Many applications implement their own transactions • Buffers a collection of writes before creating a new version with them • Provide Redundancy for media failures – Redundant representation on media (Error Correcting Codes) • Free space represented as tree of extents in each block group – Replication across media (e.g., RAID disk array) – Delay updates to freespace (in log) and do them all when block group is activated 11/6/17 CS162 © UCB Fall 2017 Lec 20.15 11/6/17 CS162 © UCB Fall 2017 Lec 20.16 Page 4 Transactions Key Concept: Transaction • Closely related to critical sections for manipulating shared data structures • An atomic sequence of actions (reads/writes) on a storage system (or database) • They extend concept of atomic update from memory to stable storage • That takes it from one consistent state to another – Atomically update multiple persistent data structures transaction consistent state 1 consistent state 2 • Many ad hoc approaches – FFS carefully ordered the sequence of updates so that if a crash occurred while manipulating directory or inodes the disk scan on reboot would detect and recover the error (fsck) – Applications use temporary files and rename 11/6/17 CS162 © UCB Fall 2017 Lec 20.17 11/6/17 CS162 © UCB Fall 2017 Lec 20.18 Typical Structure “Classic” Example: Transaction • Begin a transaction – get transaction id BEGIN; --BEGIN TRANSACTION UPDATE accounts SET balance = balance - 100.00 WHERE name = 'Alice'; • Do a bunch of updates UPDATE branches SET balance = balance - 100.00 WHERE name = (SELECT branch_name FROM accounts WHERE name – If any fail along the way, roll-back = 'Alice'); – Or, if any conflicts with other transactions, roll-back UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob'; • Commit the transaction UPDATE branches SET balance = balance + 100.00 WHERE

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    11 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us