CS5460: Operating Systems

Lecture 20: Reliability

CS 5460: Operating Systems File System Optimizations

Technique Effect Disk buffer Eliminates problem Modern Aggregated disk I/O Reduces seeks Prefetching Overlap/hide disk access Disk head scheduling Reduces seeks Historic Disk interleaving Reduces rotational latency

 Goal: Reduce or hide expensive disk operations

CS 5460: Operating Systems Buffer/

 Idea: Keep recently used disk blocks in kernel memory  Process reads from a file: – If blocks are not in buffer cache » Allocate space in buffer cache  Q: What do we purge and how? » Initiate a disk read » Block the process until disk operations complete – Copy data from buffer cache to process memory – Finally, system call returns  Usually, a process does not see the buffer cache directly  mmap() maps buffer cache pages into process RAM

CS 5460: Operating Systems Buffer/Page Cache

 Process writes to a file: – If blocks are not in the buffer cache » Allocate pages » Initiate disk read » Block process until disk operations complete – Copy written data from process RAM to buffer cache  Default: writes create dirty pages in the cache, then the system call returns – Data gets written to device in the background – What if the file is unlinked before it goes to disk?  Optional: Synchronous writes which go to disk before the system call returns – Really slow!

CS 5460: Operating Systems Performing Large File I/Os

 Idea: Try to allocate contiguous chunks of file in large contiguous regions of the disk – Disks have excellent bandwidth, but lousy latency! – Amortize expensive seeks over many block read/writes  Question: How? – Maintain free block bitmap (cache parts in memory) – When you allocate blocks, use a modified “best fit” algorithm, rather than allocating a block at a time (pre-allocate even)  Problem: Hard to do this when disk full/fragmented – Solution A: Keep a reserve (e.g., 10%) available at all times – Solution B: Run a disk “defragger” occasionally

CS 5460: Operating Systems Prefetching

 Idea: Read blocks from disk ahead of user request  Goal: Reduce number of seeks visible to user – If block read before request à hits in file buffer cache

User File System Read 0 Read 0 Read 1 Read 1 Read 2 Read 2

 Problem: What blocks should we prefetch? – Easy: Detect sequential access and prefetch ahead N blocks – Harder: Detect periodic/predictable “random” accesses

CS 5460: Operating Systems and Reliability

CS 5460: Operating Systems Fault Tolerance

 What kinds of failures do we need to consider? – OS crash, power failure » Data not on disk is lost; rarely, partial writes – Disk media failure » Data on disk corrupted or unavailable – failure » Large swaths of data unavailable temporarily or permanently – Network failure » Clients and servers cannot communicate (transient failure) » Only have access to stale data (if any) – … (what else?)

CS 5460: Operating Systems Techniques to Tolerate Failure

 Careful disk writes and “fsck” – Leave disk in recoverable state even if not all writes finish – Run “disk check” program to identify/fix inconsistent disk state  RAID: – Redundant Array of Inexpensive Independent Disks – Write each block on more than one independent disk – If disk fails, can recover block contents from non-failed disks  Logging – Rather than overwrite-in-place, write changes to log file – Use two-phase commit to make log updates transactional  Clusters – Replicate data at the server level

CS 5460: Operating Systems Careful Writes

 Order writes so that disk state is recoverable – Accept that disk contents may be inconsistent or stale – Run sanity check program to detect and fix problems  Properties that should hold at all times – All blocks pointed to are not marked free – All blocks not pointed to are marked free – No block belongs to more than one file  Goal: Avoid major inconsistency

 Not a goal: Never lose data

CS 5460: Operating Systems Careful Writes Example

 To create a file, you must: – Allocate and initialize an inode – Allocate and initialize some data blocks – Modify the directory file of the directory containing the file – Modify the directory file’s inode (last modified time, size)  In what order should we do these writes?  How to add transactional (all or nothing) semantics?

 How do careful writes interact with optimizations?

CS 5460: Operating Systems Careful Writes Exercise

 To delete a file, you must: – Deallocate the file’s inode – Deallocate the file’s disk blocks – Modify the directory file of the directory containing the file – Update the directory file’s inode

 In what order should we do these operations? – Consider what intermediate states are recoverable via fsck

CS 5460: Operating Systems Soft Update Rules

 Never point to a block before initializing it  Never reuse a block before nullifying pointers to it

 Never reset last pointer to live block before setting a new one  Always mark free-block bitmap entries as used before making the directory entry point to it

CS 5460: Operating Systems Careful Writes: More Exercises

 To write a file, you must: – Modify (and perhaps allocate) the file’s disk blocks – Modify the file’s inode (size and last modified time) – Maybe, modify indirect block(s)

 To move a file between directories, you must: – Modify the source directory – Modify the destination directory – Modify the inodes of both directories

CS 5460: Operating Systems RAID

 Goal: Organize multiple physical disks into a single high-performance, high-reliability logical disk

I/O bus RAID CPU ctlr.

 Issues to consider: – Multiple disks à higher aggregate throughput (more spindles) – Multiple disks à (hopefully) independent failure modes – Multiple disks à vulnerable to individual disk failures (MTTF) – Writing to multiple disks for à higher write overhead

CS 5460: Operating Systems Possible Uses of Multiple Disks

 Striping – Spread pieces of a single file across multiple disks – Advantages: » Can service multiple independent requests in parallel » Can service single “large” requests in parallel – Issues: » Interleave factor » How the data is striped across disks  Redundancy (replication) – Store multiple copies of blocks on independent disks – Advantages: » Can tolerate partial system failure à How much? – Issues: » How widely do you want to spread the data?

CS 5460: Operating Systems Types of RAID

RAID level Description 0 w/o redundancy 1 Disk mirroring 2 Parallel array of disks w/ error correcting disk (checksum) 3 Bit-interleaved parity 4 Block-interleaved parity 5 Block-interleaved, distributed parity

CS 5460: Operating Systems RAID Level 0

 Striping – Spread contiguous blocks of a file across multiple spindles – Simple round-robin distribution  Non-redundant – No fault tolerance  Advantages – Higher throughput – Larger storage  Disadvantages RAID – Lower reliability – any drive failure ctlr. destroys the file system – Added cost I/O bus

CPU

CS 5460: Operating Systems RAID Level 1

 Mirroring – Write complete copies of all blocks to multiple disks – How many copies à how much reliability  No striping – No added write bandwidth – Potential for pipelined reads  Advantage: – Can tolerate disk failures (“availability”) RAID ctlr.  Disadvantage: – High cost (extra disks and RAID I/O bus controller)

 Q: How to recover from drive CPU failure?

CS 5460: Operating Systems RAID Level 5

 Mirroring + striping + distributed parity – Spread contiguous blocks of a file across multiple spindles – Adds parity information » Example: XOR of other blocks  Combines features of 0 & 1

 Advantages – Higher throughput – Lower cost (than level 1) RAID – Any single disk can fail ctlr.  Disadvantages – More complexity in RAID I/O bus controller – Slower recovery time than RAID 1 CPU  RAID 6: 2 parity disks

CS 5460: Operating Systems RAID Tradeoffs

 Space efficiency  Minimum number of disks

 Number of simultaneous failures tolerated  Read performance

 Write performance  Time to recover from a failed disk

 Complexity of controller

CS 5460: Operating Systems RAID Discussion

 RAID can be implemented by hardware or software – Hardware RAID implemented by RAID controller » Often supports hot swapping using hot spare disks » Not totally clear that cheap RAID HW is worth it – Software RAID implemented by OS kernel (device driver)  Multiple parity disks can handle multiple errors

 Nested RAID – Can use a RAID array as a “disk” in a higher level RAID » RAID 1+0: RAID 0 (striping) run across RAID 1 (mirrored) arrays » RAID 0+1: RAID 1 (mirroring) run across RAID 0 (striped) arrays

CS 5460: Operating Systems RAID Discussion

 What are the risks due to purchasing a large number of disks at the same time for use in a RAID?  Hot spares can be useful

 What does a RAID look like to the file system code?

 RAID summary – Tolerates failed disks – May not deal well with correlated failure modes – Can improve sustained transfer rate – Does not improve individual seek latencies

CS 5460: Operating Systems Logging / Journaling

 Observations: – Recreating consistent disk after failure is problematic – Conventional file systems optimized for large contiguous reads – File buffer cache eliminates reads à writes often bottleneck » Recall “careful writes” à cannot defer metadata writes indefinitely » Metadata ops access non-contiguous parts of disk (file, inode, dir)  Idea: redesign the file system around a “log” – Contiguous log structure à append at end StartTransaction – Usage is similar to a database transaction log – Eliminate random seeks in the critical path EndTransaction  Sweeper process StartTransaction – Copies data from log to “real” locations EndTransaction – Kicked off periodically (e.g., log filling up) …

CS 5460: Operating Systems Example: File Creation

 Conventional file system:  Log-based file system: – Allocate and initialize inode – Allocate and initialize inode – Write inode to disk – Load directory file – Load directory file – Load directory inode – Load directory inode – Write: – Update directory file » BeginTransaction (FileCreate) – Write directory file to disk » Filename: /tmp/foo » Inode#: 1234 – Update directory inode » Inode Contents: … – Write directory inode to disk » Directory Contents: … » EndTransaction (FileCreate) – Later: Flush free inode bitmap – Later: Copy data from log to Lots of seeks “real” structures Lots of small writes Few seeks + one big write

CS 5460: Operating Systems Using the Operation Log

 Issue: Inconsistency between log contents and “real” contents (for anything not yet copied back)

 Questions: – What problems can this cause? – How do you get around these problems?  Issue: What if I re-modify file/inode before flush?

CS 5460: Operating Systems Using the Operation Log

 Issue: Inconsistency between log contents and “real” contents (for anything not yet copied back)

 Questions: – What problems can this cause? » Cannot simply read data/metadata from “real” locations » Need to check log contents on any lookup/read – How do you get around these problems? » Maintain index of logged-but-not-flushed state in DRAM » Always check index first whenever you want to read data/metadata

 Issue: What if I re-modify file/inode before flush? – Correct: Simply flush changes in order they appear in log – Optimized: If 2nd change negates first, only flush 2nd à be careful!

CS 5460: Operating Systems What About File Data Writes?

 Option one: – Write the new data into a log – Later copy data from log to “real” disk blocks  Option two: – Write new data to “real” disk blocks right away

 Tradeoffs?

CS 5460: Operating Systems Crash Recovery

 Question: How do you recover after a crash? – What inconsistencies are possible? – How do you detect and correct inconsistencies?  Answer: Run a log sweeper (ala fsck/ChkDsk) – Search through the log to find “oldest” valid record – Walk log from oldest to newest: » If complete transaction present in the log à complete (if necessary) » If incomplete transaction found à abort/undo it – Recovery analogous to transaction logs in database systems

CS 5460: Operating Systems Logging vs. Not

 Advantages of logging: – Fast metadata operations à one big synchronous write – Efficient for small write operations (if normal writes are logged) – Clean, fast recovery mechanism

 Disadvantages of logging: – Space overhead à log and in-memory structures – Complexity à transactions, extra data structures, sweeper process – Duplication of effort à write to both log and “real” locations

CS 5460: Operating Systems Logging Filesystems in Practice

 NTFS uses a log  Recent versions of UFS+ use a log

EXT2 does not use a log – Works using techniques we discussed through the last lecture  Linux EXT3 is log-based, and is forward-compatible – You can take an EXT2 filesystem and start using it as EXT3 by adding a log – EXT3 can be converted back to EXT2  is more sophisticated than EXT3 but still retains back-compatibility  does not use logging

CS 5460: Operating Systems Questions?

CS 5460: Operating Systems