CS5460: Operating Systems
Total Page:16
File Type:pdf, Size:1020Kb
CS5460: Operating Systems Lecture 20: File System Reliability CS 5460: Operating Systems File System Optimizations Technique Effect Disk buffer cache Eliminates problem Modern Aggregated disk I/O Reduces seeks Prefetching Overlap/hide disk access Disk head scheduling Reduces seeks Historic Disk interleaving Reduces rotational latency Goal: Reduce or hide expensive disk operations CS 5460: Operating Systems Buffer/Page Cache Idea: Keep recently used disk blocks in kernel memory Process reads from a file: – If blocks are not in buffer cache » Allocate space in buffer cache Q: What do we purge and how? » Initiate a disk read » Block the process until disk operations complete – Copy data from buffer cache to process memory – Finally, system call returns Usually, a process does not see the buffer cache directly mmap() maps buffer cache pages into process RAM CS 5460: Operating Systems Buffer/Page Cache Process writes to a file: – If blocks are not in the buffer cache » Allocate pages » Initiate disk read » Block process until disk operations complete – Copy written data from process RAM to buffer cache Default: writes create dirty pages in the cache, then the system call returns – Data gets written to device in the background – What if the file is unlinked before it goes to disk? Optional: Synchronous writes which go to disk before the system call returns – Really slow! CS 5460: Operating Systems Performing Large File I/Os Idea: Try to allocate contiguous chunks of file in large contiguous regions of the disk – Disks have excellent bandwidth, but lousy latency! – Amortize expensive seeks over many block read/writes Question: How? – Maintain free block bitmap (cache parts in memory) – When you allocate blocks, use a modified “best fit” algorithm, rather than allocating a block at a time (pre-allocate even) Problem: Hard to do this when disk full/fragmented – Solution A: Keep a reserve (e.g., 10%) available at all times – Solution B: Run a disk “defragger” occasionally CS 5460: Operating Systems Prefetching Idea: Read blocks from disk ahead of user request Goal: Reduce number of seeks visible to user – If block read before request à hits in file buffer cache User File System Read 0 Read 0 Read 1 Read 1 Read 2 Read 2 Problem: What blocks should we prefetch? – Easy: Detect sequential access and prefetch ahead N blocks – Harder: Detect periodic/predictable “random” accesses CS 5460: Operating Systems Fault Tolerance and Reliability CS 5460: Operating Systems Fault Tolerance What kinds of failures do we need to consider? – OS crash, power failure » Data not on disk is lost; rarely, partial writes – Disk media failure » Data on disk corrupted or unavailable – Disk controller failure » Large swaths of data unavailable temporarily or permanently – Network failure » Clients and servers cannot communicate (transient failure) » Only have access to stale data (if any) – … (what else?) CS 5460: Operating Systems Techniques to Tolerate Failure Careful disk writes and “fsck” – Leave disk in recoverable state even if not all writes finish – Run “disk check” program to identify/fix inconsistent disk state RAID: – Redundant Array of Inexpensive Independent Disks – Write each block on more than one independent disk – If disk fails, can recover block contents from non-failed disks Logging – Rather than overwrite-in-place, write changes to log file – Use two-phase commit to make log updates transactional Clusters – Replicate data at the server level CS 5460: Operating Systems Careful Writes Order writes so that disk state is recoverable – Accept that disk contents may be inconsistent or stale – Run sanity check program to detect and fix problems Properties that should hold at all times – All blocks pointed to are not marked free – All blocks not pointed to are marked free – No block belongs to more than one file Goal: Avoid major inconsistency Not a goal: Never lose data CS 5460: Operating Systems Careful Writes Example To create a file, you must: – Allocate and initialize an inode – Allocate and initialize some data blocks – Modify the directory file of the directory containing the file – Modify the directory file’s inode (last modified time, size) In what order should we do these writes? How to add transactional (all or nothing) semantics? How do careful writes interact with optimizations? CS 5460: Operating Systems Careful Writes Exercise To delete a file, you must: – Deallocate the file’s inode – Deallocate the file’s disk blocks – Modify the directory file of the directory containing the file – Update the directory file’s inode In what order should we do these operations? – Consider what intermediate states are recoverable via fsck CS 5460: Operating Systems Soft Update Rules Never point to a block before initializing it Never reuse a block before nullifying pointers to it Never reset last pointer to live block before setting a new one Always mark free-block bitmap entries as used before making the directory entry point to it CS 5460: Operating Systems Careful Writes: More Exercises To write a file, you must: – Modify (and perhaps allocate) the file’s disk blocks – Modify the file’s inode (size and last modified time) – Maybe, modify indirect block(s) To move a file between directories, you must: – Modify the source directory – Modify the destination directory – Modify the inodes of both directories CS 5460: Operating Systems RAID Goal: Organize multiple physical disks into a single high-performance, high-reliability logical disk I/O bus RAID CPU ctlr. Issues to consider: – Multiple disks à higher aggregate throughput (more spindles) – Multiple disks à (hopefully) independent failure modes – Multiple disks à vulnerable to individual disk failures (MTTF) – Writing to multiple disks for replication à higher write overhead CS 5460: Operating Systems Possible Uses of Multiple Disks Striping – Spread pieces of a single file across multiple disks – Advantages: » Can service multiple independent requests in parallel » Can service single “large” requests in parallel – Issues: » Interleave factor » How the data is striped across disks Redundancy (replication) – Store multiple copies of blocks on independent disks – Advantages: » Can tolerate partial system failure à How much? – Issues: » How widely do you want to spread the data? CS 5460: Operating Systems Types of RAID RAID level Description 0 Data striping w/o redundancy 1 Disk mirroring 2 Parallel array of disks w/ error correcting disk (checksum) 3 Bit-interleaved parity 4 Block-interleaved parity 5 Block-interleaved, distributed parity CS 5460: Operating Systems RAID Level 0 Striping – Spread contiguous blocks of a file across multiple spindles – Simple round-robin distribution Non-redundant – No fault tolerance Advantages – Higher throughput – Larger storage Disadvantages RAID – Lower reliability – any drive failure ctlr. destroys the file system – Added cost I/O bus CPU CS 5460: Operating Systems RAID Level 1 Mirroring – Write complete copies of all blocks to multiple disks – How many copies à how much reliability No striping – No added write bandwidth – Potential for pipelined reads Advantage: – Can tolerate disk failures (“availability”) RAID ctlr. Disadvantage: – High cost (extra disks and RAID I/O bus controller) Q: How to recover from drive CPU failure? CS 5460: Operating Systems RAID Level 5 Mirroring + striping + distributed parity – Spread contiguous blocks of a file across multiple spindles – Adds parity information » Example: XOR of other blocks Combines features of 0 & 1 Advantages – Higher throughput – Lower cost (than level 1) RAID – Any single disk can fail ctlr. Disadvantages – More complexity in RAID I/O bus controller – Slower recovery time than RAID 1 CPU RAID 6: 2 parity disks CS 5460: Operating Systems RAID Tradeoffs Space efficiency Minimum number of disks Number of simultaneous failures tolerated Read performance Write performance Time to recover from a failed disk Complexity of controller CS 5460: Operating Systems RAID Discussion RAID can be implemented by hardware or software – Hardware RAID implemented by RAID controller » Often supports hot swapping using hot spare disks » Not totally clear that cheap RAID HW is worth it – Software RAID implemented by OS kernel (device driver) Multiple parity disks can handle multiple errors Nested RAID – Can use a RAID array as a “disk” in a higher level RAID » RAID 1+0: RAID 0 (striping) run across RAID 1 (mirrored) arrays » RAID 0+1: RAID 1 (mirroring) run across RAID 0 (striped) arrays CS 5460: Operating Systems RAID Discussion What are the risks due to purchasing a large number of disks at the same time for use in a RAID? Hot spares can be useful What does a RAID look like to the file system code? RAID summary – Tolerates failed disks – May not deal well with correlated failure modes – Can improve sustained transfer rate – Does not improve individual seek latencies CS 5460: Operating Systems Logging / Journaling Observations: – Recreating consistent disk after failure is problematic – Conventional file systems optimized for large contiguous reads – File buffer cache eliminates reads à writes often bottleneck » Recall “careful writes” à cannot defer metadata writes indefinitely » Metadata ops access non-contiguous parts of disk (file, inode, dir) Idea: redesign the file system around a “log” – Contiguous log structure à append at end StartTransaction – Usage is similar to a database transaction log <transaction info> – Eliminate random seeks in the critical