CS5460: Operating Systems
Lecture 20: File System Reliability
CS 5460: Operating Systems File System Optimizations
Technique Effect Disk buffer cache Eliminates problem Modern Aggregated disk I/O Reduces seeks Prefetching Overlap/hide disk access Disk head scheduling Reduces seeks Historic Disk interleaving Reduces rotational latency
Goal: Reduce or hide expensive disk operations
CS 5460: Operating Systems Buffer/Page Cache
Idea: Keep recently used disk blocks in kernel memory Process reads from a file: – If blocks are not in buffer cache » Allocate space in buffer cache Q: What do we purge and how? » Initiate a disk read » Block the process until disk operations complete – Copy data from buffer cache to process memory – Finally, system call returns Usually, a process does not see the buffer cache directly mmap() maps buffer cache pages into process RAM
CS 5460: Operating Systems Buffer/Page Cache
Process writes to a file: – If blocks are not in the buffer cache » Allocate pages » Initiate disk read » Block process until disk operations complete – Copy written data from process RAM to buffer cache Default: writes create dirty pages in the cache, then the system call returns – Data gets written to device in the background – What if the file is unlinked before it goes to disk? Optional: Synchronous writes which go to disk before the system call returns – Really slow!
CS 5460: Operating Systems Performing Large File I/Os
Idea: Try to allocate contiguous chunks of file in large contiguous regions of the disk – Disks have excellent bandwidth, but lousy latency! – Amortize expensive seeks over many block read/writes Question: How? – Maintain free block bitmap (cache parts in memory) – When you allocate blocks, use a modified “best fit” algorithm, rather than allocating a block at a time (pre-allocate even) Problem: Hard to do this when disk full/fragmented – Solution A: Keep a reserve (e.g., 10%) available at all times – Solution B: Run a disk “defragger” occasionally
CS 5460: Operating Systems Prefetching
Idea: Read blocks from disk ahead of user request Goal: Reduce number of seeks visible to user – If block read before request à hits in file buffer cache
User File System Read 0 Read 0 Read 1 Read 1 Read 2 Read 2
Problem: What blocks should we prefetch? – Easy: Detect sequential access and prefetch ahead N blocks – Harder: Detect periodic/predictable “random” accesses
CS 5460: Operating Systems Fault Tolerance and Reliability
CS 5460: Operating Systems Fault Tolerance
What kinds of failures do we need to consider? – OS crash, power failure » Data not on disk is lost; rarely, partial writes – Disk media failure » Data on disk corrupted or unavailable – Disk controller failure » Large swaths of data unavailable temporarily or permanently – Network failure » Clients and servers cannot communicate (transient failure) » Only have access to stale data (if any) – … (what else?)
CS 5460: Operating Systems Techniques to Tolerate Failure
Careful disk writes and “fsck” – Leave disk in recoverable state even if not all writes finish – Run “disk check” program to identify/fix inconsistent disk state RAID: – Redundant Array of Inexpensive Independent Disks – Write each block on more than one independent disk – If disk fails, can recover block contents from non-failed disks Logging – Rather than overwrite-in-place, write changes to log file – Use two-phase commit to make log updates transactional Clusters – Replicate data at the server level
CS 5460: Operating Systems Careful Writes
Order writes so that disk state is recoverable – Accept that disk contents may be inconsistent or stale – Run sanity check program to detect and fix problems Properties that should hold at all times – All blocks pointed to are not marked free – All blocks not pointed to are marked free – No block belongs to more than one file Goal: Avoid major inconsistency
Not a goal: Never lose data
CS 5460: Operating Systems Careful Writes Example
To create a file, you must: – Allocate and initialize an inode – Allocate and initialize some data blocks – Modify the directory file of the directory containing the file – Modify the directory file’s inode (last modified time, size) In what order should we do these writes? How to add transactional (all or nothing) semantics?
How do careful writes interact with optimizations?
CS 5460: Operating Systems Careful Writes Exercise
To delete a file, you must: – Deallocate the file’s inode – Deallocate the file’s disk blocks – Modify the directory file of the directory containing the file – Update the directory file’s inode
In what order should we do these operations? – Consider what intermediate states are recoverable via fsck
CS 5460: Operating Systems Soft Update Rules
Never point to a block before initializing it Never reuse a block before nullifying pointers to it
Never reset last pointer to live block before setting a new one Always mark free-block bitmap entries as used before making the directory entry point to it
CS 5460: Operating Systems Careful Writes: More Exercises
To write a file, you must: – Modify (and perhaps allocate) the file’s disk blocks – Modify the file’s inode (size and last modified time) – Maybe, modify indirect block(s)
To move a file between directories, you must: – Modify the source directory – Modify the destination directory – Modify the inodes of both directories
CS 5460: Operating Systems RAID
Goal: Organize multiple physical disks into a single high-performance, high-reliability logical disk
I/O bus RAID CPU ctlr.
Issues to consider: – Multiple disks à higher aggregate throughput (more spindles) – Multiple disks à (hopefully) independent failure modes – Multiple disks à vulnerable to individual disk failures (MTTF) – Writing to multiple disks for replication à higher write overhead
CS 5460: Operating Systems Possible Uses of Multiple Disks
Striping – Spread pieces of a single file across multiple disks – Advantages: » Can service multiple independent requests in parallel » Can service single “large” requests in parallel – Issues: » Interleave factor » How the data is striped across disks Redundancy (replication) – Store multiple copies of blocks on independent disks – Advantages: » Can tolerate partial system failure à How much? – Issues: » How widely do you want to spread the data?
CS 5460: Operating Systems Types of RAID
RAID level Description 0 Data striping w/o redundancy 1 Disk mirroring 2 Parallel array of disks w/ error correcting disk (checksum) 3 Bit-interleaved parity 4 Block-interleaved parity 5 Block-interleaved, distributed parity
CS 5460: Operating Systems RAID Level 0
Striping – Spread contiguous blocks of a file across multiple spindles – Simple round-robin distribution Non-redundant – No fault tolerance Advantages – Higher throughput – Larger storage Disadvantages RAID – Lower reliability – any drive failure ctlr. destroys the file system – Added cost I/O bus
CPU
CS 5460: Operating Systems RAID Level 1
Mirroring – Write complete copies of all blocks to multiple disks – How many copies à how much reliability No striping – No added write bandwidth – Potential for pipelined reads Advantage: – Can tolerate disk failures (“availability”) RAID ctlr. Disadvantage: – High cost (extra disks and RAID I/O bus controller)
Q: How to recover from drive CPU failure?
CS 5460: Operating Systems RAID Level 5
Mirroring + striping + distributed parity – Spread contiguous blocks of a file across multiple spindles – Adds parity information » Example: XOR of other blocks Combines features of 0 & 1
Advantages – Higher throughput – Lower cost (than level 1) RAID – Any single disk can fail ctlr. Disadvantages – More complexity in RAID I/O bus controller – Slower recovery time than RAID 1 CPU RAID 6: 2 parity disks
CS 5460: Operating Systems RAID Tradeoffs
Space efficiency Minimum number of disks
Number of simultaneous failures tolerated Read performance
Write performance Time to recover from a failed disk
Complexity of controller
CS 5460: Operating Systems RAID Discussion
RAID can be implemented by hardware or software – Hardware RAID implemented by RAID controller » Often supports hot swapping using hot spare disks » Not totally clear that cheap RAID HW is worth it – Software RAID implemented by OS kernel (device driver) Multiple parity disks can handle multiple errors
Nested RAID – Can use a RAID array as a “disk” in a higher level RAID » RAID 1+0: RAID 0 (striping) run across RAID 1 (mirrored) arrays » RAID 0+1: RAID 1 (mirroring) run across RAID 0 (striped) arrays
CS 5460: Operating Systems RAID Discussion
What are the risks due to purchasing a large number of disks at the same time for use in a RAID? Hot spares can be useful
What does a RAID look like to the file system code?
RAID summary – Tolerates failed disks – May not deal well with correlated failure modes – Can improve sustained transfer rate – Does not improve individual seek latencies
CS 5460: Operating Systems Logging / Journaling
Observations: – Recreating consistent disk after failure is problematic – Conventional file systems optimized for large contiguous reads – File buffer cache eliminates reads à writes often bottleneck » Recall “careful writes” à cannot defer metadata writes indefinitely » Metadata ops access non-contiguous parts of disk (file, inode, dir) Idea: redesign the file system around a “log” – Contiguous log structure à append at end StartTransaction – Usage is similar to a database transaction log
CS 5460: Operating Systems Example: File Creation
Conventional file system: Log-based file system: – Allocate and initialize inode – Allocate and initialize inode – Write inode to disk – Load directory file – Load directory file – Load directory inode – Load directory inode – Write: – Update directory file » BeginTransaction (FileCreate) – Write directory file to disk » Filename: /tmp/foo » Inode#: 1234 – Update directory inode » Inode Contents: … – Write directory inode to disk » Directory Contents: … » EndTransaction (FileCreate) – Later: Flush free inode bitmap – Later: Copy data from log to Lots of seeks “real” structures Lots of small writes Few seeks + one big write
CS 5460: Operating Systems Using the Operation Log
Issue: Inconsistency between log contents and “real” contents (for anything not yet copied back)
Questions: – What problems can this cause? – How do you get around these problems? Issue: What if I re-modify file/inode before flush?
CS 5460: Operating Systems Using the Operation Log
Issue: Inconsistency between log contents and “real” contents (for anything not yet copied back)
Questions: – What problems can this cause? » Cannot simply read data/metadata from “real” locations » Need to check log contents on any lookup/read – How do you get around these problems? » Maintain index of logged-but-not-flushed state in DRAM » Always check index first whenever you want to read data/metadata
Issue: What if I re-modify file/inode before flush? – Correct: Simply flush changes in order they appear in log – Optimized: If 2nd change negates first, only flush 2nd à be careful!
CS 5460: Operating Systems What About File Data Writes?
Option one: – Write the new data into a log – Later copy data from log to “real” disk blocks Option two: – Write new data to “real” disk blocks right away
Tradeoffs?
CS 5460: Operating Systems Crash Recovery
Question: How do you recover after a crash? – What inconsistencies are possible? – How do you detect and correct inconsistencies? Answer: Run a log sweeper (ala fsck/ChkDsk) – Search through the log to find “oldest” valid record – Walk log from oldest to newest: » If complete transaction present in the log à complete (if necessary) » If incomplete transaction found à abort/undo it – Recovery analogous to transaction logs in database systems
CS 5460: Operating Systems Logging vs. Not
Advantages of logging: – Fast metadata operations à one big synchronous write – Efficient for small write operations (if normal writes are logged) – Clean, fast recovery mechanism
Disadvantages of logging: – Space overhead à log and in-memory structures – Complexity à transactions, extra data structures, sweeper process – Duplication of effort à write to both log and “real” locations
CS 5460: Operating Systems Logging Filesystems in Practice
NTFS uses a log Recent versions of UFS+ use a log
Linux EXT2 does not use a log – Works using techniques we discussed through the last lecture Linux EXT3 is log-based, and is forward-compatible – You can take an EXT2 filesystem and start using it as EXT3 by adding a log – EXT3 can be converted back to EXT2 EXT4 is more sophisticated than EXT3 but still retains back-compatibility Btrfs does not use logging
CS 5460: Operating Systems Questions?
CS 5460: Operating Systems