Buffer Cache File System Buffer Cache

Recall: Multilevel Indexed Files (Original 4.1 BSD) CS162 • Sample file in multilevel Operating Systems and indexed format: – 10 direct ptrs, 1K blocks Systems Programming – How many accesses for block #23? (assume file Lecture 20 header accessed on open)? » Two: One for indirect block, one for data Filesystems (Con’t) – How about block #5? Reliability, Transactions » One: One for data – Block #340? » Three: double indirect block, indirect block, and data April 14th, 2020 • UNIX 4.1 Pros and cons – Pros: Simple (more or less) Prof. John Kubiatowicz Files can easily expand (up to a point) Small files particularly cheap and easy http://cs162.eecs.Berkeley.edu – Cons: Lots of seeks (lead to 4.2 Fast File System Optimizations) • Ext2/3 (Linux): – 12 direct ptrs, triply-indirect blocks, settable block size (4K is common) 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.2 Recall: Buffer Cache File System Buffer Cache Disk • Kernel must copy disk blocks to main memory to Data blocks access their contents and write them back if modified Reading – Could be data blocks, inodes, directory contents, etc. PCB iNodes – Possibly dirty (modified and not written back) file • Key Idea: Exploit locality by caching disk data in desc memory Writing – Name translations: Mapping from pathsinodes Dir Data blocks – Disk blocks: Mapping from block addressdisk content Free bitmap • Buffer Cache: Memory used to cache kernel resources, Memory including disk blocks and name translations Blocks – Can contain “dirty” blocks (blocks yet on disk) State free free • OS implements a cache of disk blocks for efficient access to data, directories, inodes, freemap 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.3 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.4 File System Buffer Cache: open File System Buffer Cache: open Disk Disk Data blocks Data blocks Reading Reading PCB PCB iNodes iNodes file file desc desc Writing Writing Dir Data blocks Dir Data blocks <name>:inumber Free bitmap Free bitmap Memory Memory Blocks Blocks State free freedirrd State free dir inoderd • {load block of directory; search for map}+ ; • {load block of directory; search for map}+ ; Load inode ; • Create reference via open file descriptor 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.5 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.6 File System Buffer Cache: Read? File System Buffer Cache: Write? Disk Disk Data blocks Data blocks Reading Reading PCB PCB iNodes iNodes file file desc desc Writing Writing Dir Data blocks Dir Data blocks <name>:inumber <name>:inumber Free bitmap Free bitmap Memory Memory Blocks Blocks State free dir inode State free dir inode • From inode, traverse index structure to find data block; • Process similar to read, but may allocate new blocks (update free load data block; copy all or part to read data buffer map), blocks need to be written back to disk; inode? 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.7 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.8 File System Buffer Cache: Eviction? Buffer Cache Discussion Disk Data blocks • Implemented entirely in OS software Reading – Unlike memory caches and TLB PCB • Blocks go through transitional states between free and iNodes in-use file desc – Being read from disk, being written to disk Writing – Other processes can run, etc. Dir Data blocks • Blocks are used for a variety of purposes <name>:inumber – inodes, data for dirs and files, freemap Free bitmap – OS maintains pointers into them Memory • Termination – e.g., process exit – open, read, write Blocks • Replacement – what to do when it fills up? State free dir dirty inode • Blocks being written back to disc go through a transient state 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.9 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.10 File System Caching File System Caching (con’t) • Replacement policy? LRU • Cache Size: How much memory should the OS allocate to the – Can afford overhead full LRU implementation buffer cache vs virtual memory? – Advantages: – Too much memory to the file system cache won’t be able to » Works very well for name translation run many applications at once » Works well in general as long as memory is big enough to – Too little memory to file system cache many applications may accommodate a host’s working set of files. run slowly (disk caching not effective) – Disadvantages: – Solution: adjust boundary dynamically so that the disk access » Fails when some application scans through file system, thereby flushing the cache with data used only once rates for paging and file access are balanced »Example: find . –exec grep foo {} \; • Read Ahead Prefetching: fetch sequential blocks early • Other Replacement Policies? – Key Idea: exploit fact that most common file access is – Some systems allow applications to request other policies sequential by prefetching subsequent disk blocks ahead of – Example, ‘Use Once’: current read request (if they are not already in memory) » File system can discard blocks as soon as they are used – Elevator algorithm can efficiently interleave groups of prefetches from concurrent applications – How much to prefetch? » Too many imposes delays on requests by other applications » Too few causes many seeks (and rotational delays) among concurrent file requests 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.11 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.12 Delayed Writes Delayed Writes • Delayed Writes: Writes to files not immediately sent to disk • Delay block allocation: May be able to allocate multiple – So, Buffer Cache is a write-back cache blocks at same time for file, keep them contiguous • write() copies data from user space buffer to kernel buffer • Some files never actually make it all the way to disk – Enabled by presence of buffer cache: can leave written file blocks in cache for a while – Many short-lived files – Other apps read data from cache instead of disk • But what if system crashes before buffer cache block is – Cache is transparent to user programs flushed to disk? • Flushed to disk periodically • And what if this was for a directory file? – In Linux: kernel threads flush buffer cache very 30 sec. in – Lose pointer to inode default setup • file systems need recovery mechanisms • Disk scheduler can efficiently order lots of requests – Elevator Algorithm can rearrange writes to avoid random seeks 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.13 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.14 Important “ilities” How to Make File System Durable? • Availability: the probability that the system can accept and • Disk blocks contain Reed-Solomon error correcting process requests codes (ECC) to deal with small defects in disk drive – Often measured in “nines” of probability. So, a 99.9% probability is considered “3-nines of availability” – Can allow recovery of data from small media defects – Key idea here is independence of failures • Durability: the ability of a system to recover data despite • Make sure writes survive in short term faults – Either abandon delayed writes or – This idea is fault tolerance applied to data – Use special, battery-backed RAM (called non-volatile – Doesn’t necessarily imply availability: information on pyramids RAM or NVRAM) for dirty blocks in buffer cache was very durable, but could not be accessed until discovery of Rosetta Stone • Make sure that data survives in long term • Reliability: the ability of a system or component to perform its required functions under stated conditions for a specified – Need to replicate! More than one copy of data! period of time (IEEE definition) – Important element: independence of failure – Usually stronger than simply availability: means that the » Could put copies on one disk, but if disk head fails… system is not only “up”, but also working correctly » Could put copies on different disks, but if server fails… – Includes availability, security, fault tolerance/durability » Could put copies on different servers, but if building is – Must make sure data survives system crashes, disk crashes, struck by lightning…. other problems » Could put copies on servers in different continents… 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.15 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.16 RAID: Redundant Arrays of Inexpensive Disks RAID 1: Disk Mirroring/Shadowing • Classified by David Patterson, Garth A. Gibson, and Randy Katz here at UCB in 1987 – Classic paper was first to evaluate multiple schemes recovery group • Data stored on multiple disks (redundancy) • Each disk is fully duplicated onto its “shadow” – For high I/O rate, high availability environments – Berkeley researchers were looking for alternatives to – Most expensive solution: 100% capacity overhead big expensive disks – Redundancy necessary because cheap disks were • Bandwidth sacrificed on write: more error prone – Logical write = two physical writes – Highest bandwidth when disk heads and rotation fully synchronized (hard to do exactly) • Either in software or hardware • Reads may be optimized – In hardware case, done by disk controller; file system may – Can have two independent reads to same data not even know that there is more than one disk in use • Recovery: • Initially, five levels of RAID (more now) – Disk failure replace disk and copy data to new disk – Hot Spare: idle disk already attached to system to be used for immediate replacement 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.17 4/14/20 Kubiatowicz CS162 © UCB Spring 2020 Lec 20.18 RAID 5+: High I/O Rate Parity Allow more disks to fail! Stripe

Load more