Reliability, Transactions, Distributed Systems

Important “ilities” CS162 • Availability: the probability that the system can accept and process Operating Systems and requests – Often measured in “nines” of probability. So, a 99.9% probability is Systems Programming considered “3-nines of availability” Lecture 20 – Key idea here is independence of failures • Durability: the ability of a system to recover data despite faults Reliability, Transactions – This idea is fault tolerance applied to data Distributed Systems – Doesn’t necessarily imply availability: information on pyramids was very durable, but could not be accessed until discovery of Rosetta Stone November 6th, 2017 • Reliability: the ability of a system or component to perform its Prof. Ion Stoica required functions under stated conditions for a specified period of time (IEEE definition) http://cs162.eecs.Berkeley.edu – Usually stronger than simply availability: means that the system is not only “up”, but also working correctly – Includes availability, security, fault tolerance/durability – Must make sure data survives system crashes, disk crashes, etc 11/6/17 CS162 © UCB Fall 2017 Lec 20.2 How to Make File System Durable? RAID: Redundant Arrays of Inexpensive Disks • Disk blocks contain Reed-Solomon error correcting codes (ECC) to deal with small defects in disk drive • Invented by David Patterson, Garth A. Gibson, and – Can allow recovery of data from small media defects Randy Katz here at UCB in 1987 • Make sure writes survive in short term – Either abandon delayed writes or • Data stored on multiple disks (redundancy) – Use special, battery-backed RAM (called non-volatile RAM or NVRAM) for dirty blocks in buffer cache • Either in software or hardware • Make sure that data survives in long term – In hardware case, done by disk controller; file system may not even know that there is more than one disk in use – Need to replicate! More than one copy of data! – Important element: independence of failure » Could put copies on one disk, but if disk head fails… • Initially, five levels of RAID (more now) » Could put copies on different disks, but if server fails… » Could put copies on different servers, but if building is struck by lightning…. » Could put copies on servers in different continents… 11/6/17 CS162 © UCB Fall 2017 Lec 20.3 11/6/17 CS162 © UCB Fall 2017 Lec 20.4 Page 1 RAID 1: Disk Mirroring/Shadowing RAID 5+: High I/O Rate Parity • Data stripped across Stripe multiple disks Unit – Successive blocks D0 D1 D2 D3 P0 recovery stored on successive Increasing group (non-parity) disks D4 D5 D6 P1 D7 Logical – Increased bandwidth Disk • Each disk is fully duplicated onto its “shadow” over single disk Addresses – For high I/O rate, high availability environments D8 D9 P2 D10 D11 • Parity block (in green) – Most expensive solution: 100% capacity overhead constructed by XORing D12 P3 D13 D14 D15 • Bandwidth sacrificed on write: data bocks in stripe – Logical write = two physical writes – P0=D0ÅD1ÅD2ÅD3 P4 D16 D17 D18 D19 – Highest bandwidth when disk heads and rotation fully – Can destroy any one synchronized (hard to do exactly) disk and still • Reads may be optimized reconstruct data D20 D21 D22 D23 P5 – Can have two independent reads to same data – Suppose Disk 3 fails, then can reconstruct: Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 • Recovery: D2=D0ÅD1ÅD3ÅP0 – Disk failure Þ replace disk and copy data to new disk • Can spread information widely across internet for durability – Hot Spare: idle disk already attached to system to be used for immediate replacement – Overview now, more later in semester 11/6/17 CS162 © UCB Fall 2017 Lec 20.5 11/6/17 CS162 © UCB Fall 2017 Lec 20.6 Higher Durability/Reliability through Geographic Replication File System Reliability • Highly durable – hard to destroy all copies • What can happen if disk loses power or software crashes? • Highly available for reads – read any copy – Some operations in progress may complete • Low availability for writes – Some operations in progress may be lost – Can’t write if any one replica is not up – Overwrite of a block may only partially complete – Or – need relaxed consistency model • Reliability? – availability, security, durability, fault-tolerance • Having RAID doesn’t necessarily protect against all such failures Replica #1 – No protection against writing bad state – What if one disk of RAID group not written? Replica #2 • File system needs durability (as a minimum!) – Data previously stored can be retrieved (maybe after some recovery step), regardless of failure Replica #n 11/6/17 CS162 © UCB Fall 2017 Lec 20.7 11/6/17 CS162 © UCB Fall 2017 Lec 20.8 Page 2 Storage Reliability Problem Threats to Reliability • Single logical file operation can involve updates to • Interrupted Operation multiple physical disk blocks – Crash or power failure in the middle of a series of related – inode, indirect block, data block, bitmap, … updates may leave stored data in an inconsistent state – With sector remapping, single update to physical disk block – Example: Transfer funds from one bank account to another can require multiple (even lower level) updates to sectors – What if transfer is interrupted after withdrawal and before deposit? • At a physical level, operations complete one at a time – Want concurrent operations for performance • Loss of stored data • How do we guarantee consistency regardless of when – Failure of non-volatile storage media may cause previously crash occurs? stored data to disappear or be corrupted 11/6/17 CS162 © UCB Fall 2017 Lec 20.9 11/6/17 CS162 © UCB Fall 2017 Lec 20.10 Reliability Approach #1: Careful Ordering FFS: Create a File • Sequence operations in a specific order Normal operation: Recovery: – Careful design to allow sequence to be interrupted safely • Allocate data block • Scan inode table • Write data block • If any unlinked files (not in • Post-crash recovery • Allocate inode any directory), delete or put in lost & found dir – Read data structures to see if there were any operations in • Write inode block progress • Compare free block – Clean up/finish as needed • Update bitmap of free bitmap against inode trees blocks and inodes • Scan directories for missing • Update directory with update/access times • Approach taken by file name ® inode – FAT and FFS (fsck) to protect filesystem structure/metadata number – Many app-level recovery schemes (e.g., Word, emacs autosaves) • Update modify time for Time proportional to disk size directory 11/6/17 CS162 © UCB Fall 2017 Lec 20.11 11/6/17 CS162 © UCB Fall 2017 Lec 20.12 Page 3 Reliability Approach #2: Copy on Write File Layout COW with Smaller-Radix Blocks • To update file system, write a new version of the file old version new version system containing the update – Never update in place – Reuse existing unchanged disk blocks • Seems expensive! But – Updates can be batched – Almost all disk writes can occur in parallel Write • Approach taken in network file server appliances • If file represented as a tree of blocks, just need to – NetApp’s Write Anywhere File Layout (WAFL) update the leading fringe – ZFS (Sun/Oracle) and OpenZFS 11/6/17 CS162 © UCB Fall 2017 Lec 20.13 11/6/17 CS162 © UCB Fall 2017 Lec 20.14 ZFS and OpenZFS More General Reliability Solutions • Variable sized blocks: 512 B – 128 KB • Use Transactions for atomic updates • Symmetric tree – Ensure that multiple related updates are performed atomically – i.e., if a crash occurs in the middle, the state of the systems – Know if it is large or small when we make the copy reflects either all or none of the updates • Store version number with pointers – Most modern file systems use transactions internally to update filesystem structures and metadata – Can create new version by adding blocks and new pointers – Many applications implement their own transactions • Buffers a collection of writes before creating a new version with them • Provide Redundancy for media failures – Redundant representation on media (Error Correcting Codes) • Free space represented as tree of extents in each block group – Replication across media (e.g., RAID disk array) – Delay updates to freespace (in log) and do them all when block group is activated 11/6/17 CS162 © UCB Fall 2017 Lec 20.15 11/6/17 CS162 © UCB Fall 2017 Lec 20.16 Page 4 Transactions Key Concept: Transaction • Closely related to critical sections for manipulating shared data structures • An atomic sequence of actions (reads/writes) on a storage system (or database) • They extend concept of atomic update from memory to stable storage • That takes it from one consistent state to another – Atomically update multiple persistent data structures transaction consistent state 1 consistent state 2 • Many ad hoc approaches – FFS carefully ordered the sequence of updates so that if a crash occurred while manipulating directory or inodes the disk scan on reboot would detect and recover the error (fsck) – Applications use temporary files and rename 11/6/17 CS162 © UCB Fall 2017 Lec 20.17 11/6/17 CS162 © UCB Fall 2017 Lec 20.18 Typical Structure “Classic” Example: Transaction • Begin a transaction – get transaction id BEGIN; --BEGIN TRANSACTION UPDATE accounts SET balance = balance - 100.00 WHERE name = 'Alice'; • Do a bunch of updates UPDATE branches SET balance = balance - 100.00 WHERE name = (SELECT branch_name FROM accounts WHERE name – If any fail along the way, roll-back = 'Alice'); – Or, if any conflicts with other transactions, roll-back UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob'; • Commit the transaction UPDATE branches SET balance = balance + 100.00 WHERE

Reliability, Transactions, Distributed Systems

W4118: Linux File Systems

Certifying a File System

XFS: There and Back ...And There Again? Slide 1 of 38

Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO

Acronis® Disk Director® 12 User's Guide

Journaling File Systems

Partitioning Disks with Parted

Data Interchange & Optical Standards

The Evolution of File Systems

08-Filesystems.Pdf

Filesystem On-Disk Structures for FAT, HPFS, NTFS, JFS, Extn and Reiserfs Presentation Contents

The Zettabyte File System