Recall: Components of a

CS162 File Operating Systems and Systems Programming Structure Lecture 21 File Header One Block = multiple sectors Filesystems 3: Filesystem Case Studies (Con’t), File number Structure Ex: 512 sector, 4K block Buffering, Reliability, and Transactions “inumber” blocks

November 9th, 2020 Prof. John Kubiatowicz “” http://cs162.eecs.Berkeley.edu …

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.2

Recall: FAT Properties Recall: Multilevel Indexed Files (Original 4.1 BSD) • File is collection of disk blocks • Sample file in multilevel indexed format: FAT Disk Blocks ( calls them “clusters”) – 10 direct ptrs, 1K blocks • FAT is array of integers mapped 1-1 0: 0: File number – How many accesses for block #23? with disk blocks (assume file header accessed on )? – Each integer is either: 31: File 31, Block 0 » Two: One for indirect block, one for data » Pointer to next block in file; or File 31, Block 1 – How about block #5? » End of file flag; or » One: One for data » Free block flag – Block #340? • File Number is index of root » Three: double indirect block, of block list for the file File 31, Block 3 indirect block, and data free – Follow list to get block # • 4.1 Pros and cons – Directory must map name to block – Pros: Simple (more or less) number at start of file File 31, Block 2 Files can easily expand (up to a point) Small files particularly cheap and easy • But: Where is FAT stored? – Cons: Lots of seeks – Beginning of disk, before the data blocks N-1: N-1: Very large files must many indirect block (four I/Os per block!) – Usually 2 copies (to handle errors) memory

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.3 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.4 Fast File System (BSD 4.2, 1984) • Same inode structure as in BSD 4.1 – same file header and triply indirect blocks like we just studied – Some changes to block sizes from 1024  4096 bytes for performance • Paper on FFS: “A Fast File System for UNIX” – Marshall McKusick, William Joy, Samuel Leffler and Robert Fabry – Off the “resources” page of course website – Take a look!

• Optimization for Performance and Reliability: – Distribute among different tracks to be closer to data – Uses bitmap allocation in place of freelist CASE STUDY: – Attempt to allocate files contiguously – 10% reserved disk space BERKELEY FAST FILE SYSTEM (FFS) – Skip-sector positioning (mentioned later)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.5 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.6

FFS Changes in Inode Placement: Motivation FFS Locality: Block Groups • In early UNIX and DOS/Windows’ FAT file system, headers stored in special • The UNIX BSD 4.2 (FFS) distributed the header information array in outermost cylinders (inodes) closer to the data blocks – Fixed size, set when disk is formatted – Often, inode for file stored in same “cylinder group” » At formatting time, a fixed number of inodes are created as parent directory of the file » Each is given a unique number, called an “inumber” – makes an “” of that directory run very fast • File system divided into set of block groups • Problem #1: Inodes all in one place (outer tracks) – set of tracks – Head crash potentially destroys all files by destroying inodes – Inodes not close to the data that the point to • Data blocks, metadata, and free space interleaved within block group » To read a small file, seek to get header, seek back to data – Avoid huge seeks between user data and system structure • Problem #2: When create a file, don’t know how big it will become (in UNIX, most writes are by appending) • Put directory and its files in common block group – How much contiguous space do you allocate for a file? – Makes it hard to optimize for performance

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.7 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.8 FFS Locality: Block Groups (Con’t) UNIX 4.2 BSD FFS First Fit Block Allocation • First-Free allocation of new file blocks – To expand file, first try successive blocks in bitmap, then choose new range of blocks – Few little holes at start, big sequential runs at end of group – Avoids fragmentation – Sequential layout for big files • Important: keep 10% or more free! – Reserve space in the Block Group • Summary: FFS Inode Layout Pros – For small directories, can fit all data, file headers, etc. in same cylinder  no seeks! – File headers much smaller than whole block • Fills in the small holes at the start of block group (a few hundred bytes), so multiple headers fetched from disk at same time – Reliability: whatever happens to the disk, you can many of the files • Avoids fragmentation, leaves contiguous free space at end (even if directories disconnected)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.9 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.10

Attack of the Rotational Delay UNIX 4.2 BSD FFS • Problem 3: Missing blocks due to rotational delay •Pros – Issue: Read one block, do processing, and read next block. In meantime, disk has continued turning: missed next block! Need 1 revolution/block! – Efficient storage for both small and large files Skip Sector – Locality for both small and large files – Locality for metadata and data – No defragmentation necessary! Track Buffer (Holds complete track) • Cons – Solution1: Skip sector positioning (“interleaving”) – Inefficient for tiny files (a 1 byte file requires both an inode and a data » Place the blocks from one file on every other block of a track: give time for processing block) to overlap rotation » Can be done by OS or in modern drives by the disk controller – Inefficient encoding when file is mostly contiguous on disk – Solution 2: Read ahead: read next block right after first, even if application hasn’t asked – Need to reserve 10-20% of free space to prevent fragmentation for it yet » This can be done either by OS (read ahead) » By disk itself (track buffers) - many disk controllers have internal RAM that allows them to read a complete track • Modern disks + controllers do many things “under the covers” – Track buffers, elevator algorithms, bad block filtering

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.11 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.12 Administrivia Example: /3 Disk Layout • Disk divided into block groups – Provides locality – Each group has two block-sized bitmaps (free blocks/inodes) – Block sizes settable at format time: 1K, 2K, 4K, 8K… • Actual inode structure similar to 4.2 BSD • Midterm 2: Graded! – with 12 direct pointers – Mean: 55.34, Stdev 15.09 • : Ext2 with Journaling – Historical offset: +26 – Several degrees of protection with • No Class on Wednesday! comparable overhead – It is a holiday – We will talk about Journalling later • If you have any group issues going on, make sure you: – Make sure that your TA understands what is happing – Make sure that you reflect these issues on your group evaluations • Example: create a file1.dat under /dir1/ in Ext3

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.13 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.14

Recall: Directory Abstraction Hard Links • Hard • Directories are specialized files – Mapping from name to file number in the – Contents: List of pairs /usr – First hard link to a file is made when file created /usr • System calls to access directories – Create extra hard links to a file with the link() – open / creat traverse the structure – Remove links with () system call – mkdir /rmdir add/remove entries /usr/lib /usr/lib /usr/lib4.3 • When can file contents be deleted? /usr/lib4.3 – link / unlink (rm) – When there are no more hard links to the file • libc support – Inode maintains reference count for this purpose – DIR * opendir (const char *dirname) – struct dirent * readdir (DIR *dirstream) – int readdir_r (DIR *dirstream, struct dirent *entry, /usr/lib4.3/foo /usr/lib4.3/foo struct dirent **result)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.15 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.16 Soft Links (Symbolic Links) Directory Traversal • What happens when we open /home/cs162/stuff.txt? inode • Soft link or or • “/” - inumber for root inode configured into kernel, say 2 block 49358 – Directory entry contains the path and name of the file 2 – Read inode 2 from its position in inode array on disk “home”:8086 – Map one name to another name – Extract the direct and indirect block pointers 732 block 12132 – Determine block that holds root directory (say block 49358) • Contrast these two different types of directory entries: – Read that block, scan it for “home” to get inumber for this “stuff.txt”:9909 – Normal directory entry: directory (say 8086) 8086 block 7756 – Symbolic link: • Read inode 8086 for /home, extract its blocks, read block (say 7756), scan it for “cs162” to get its inumber (say 732) “cs162”:732 • Read inode 732 for /home/cs162, extract its blocks, read • OS looks up destination file name each time program accesses block (say 12132), scan it for “stuff.txt” to get its inumber, Blocks of source file name say 9909 9909 stuff.txt – Lookup can fail (error result from open) • Read inode 9909 for /home/cs162/stuff.txt • Set up file description to refer to this inode so reads / • Unix: Create soft links with symlink syscall can access the data blocks referenced by its direct and indirect pointers 2 9099 “home”:8086 • Check permissions on the final inode and each 732 directory’s inode… 8086 “cs162”:732 “stuff.txt”:9909

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.17 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.18

Large Directories: B-Trees (dirhash)

in FreeBSD, NetBSD, OpenBSD

CASE STUDY: WINDOWS NTFS

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.19 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.20 New Technology File System (NTFS) NTFS • Master File Table • Default on modern Windows systems – Database with Flexible 1KB entries for metadata/data • Variable length extents – Variable-sized attribute records (data or metadata) – Rather than fixed blocks – Extend with variable depth (non-resident) • Instead of FAT or inode array: Master File Table • Extents – variable length contiguous regions – Like a database, with max 1 KB size for each table entry – Block pointers cover runs of blocks – Everything (almost) is a sequence of pairs – Similar approach in Linux () » Meta-data and data – File create can provide hint as to • Each entry in MFT contains metadata and: – size of file – File’s data directly (for small files) • Journaling for reliability – A list of extents (start block, size) for file’s data – Discussed later – For big files: pointers to other MFT entries with more lists

http://ntfs.com/ntfs-mft.htm 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.21 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.22

NTFS Small File: Data stored with Metadata NTFS Medium File: Extents for File Data

Create time, modify time, access time, Owner id, security specifier, flags (RO, hidden, sys)

data attribute

Attribute list

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.23 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.24 NTFS Large File: Pointers to Other MFT Records NTFS Huge, Fragmented File: Many MFT Records

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.25 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.26

NTFS Directories

• Directories implemented as B Trees • File's number identifies its entry in MFT • MFT entry always has a file name attribute – Human readable name, file number of parent dir • Hard link? Multiple file name attributes in MFT entry

MEMORY MAPPED FILES

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.27 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.28 Memory Mapped Files Recall: Who Does What, When?

• Traditional I/O involves explicit transfers between buffers in process Process virtual address physical address address space to/from regions of a file page# instruction MMU frame# – This involves multiple copies into caches in memory, plus system calls PT offset retry exception page fault • What if we could “map” the file directly into an empty region of our frame# address space offset – Implicitly “page it in” when we read it update PT entry – Write it and “eventually” page it out Page Fault Handler

• Executable files are treated this way when we exec the process!! load page from disk

scheduler

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.29 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.30

Using Paging to mmap() Files mmap() system call

Process virtual address physical address page# instruction MMU PT frame# offset retry page fault exceptionRead File contents Operating System from memory!Create PT entries Page Fault Handler for mapped region as “backed” by file • May map a specific region or let the system find one for you – Tricky to know where the holes are • Used both for manipulating files and for sharing between processes File scheduler mmap() file to region of VAS 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.31 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.32 An mmap() Example Sharing through Mapped Files VAS 1 VAS 2 #include /* also stdio.h, stdlib.h, string.h, fcntl.h, unistd.h */ 0x000… 0x000… $ ./mmap int something = 162; instructions instructions Data at: 105d63058 int main (int argc, char *argv[]) { Heap at : 7f8a33c04b70 int myfd; data char *mfile; Stack at: 7fff59e9db10 data mmap at : 105d97000 File printf("Data at: %16lx\n", (long unsignedThis is int) line &something); one heap heap printf("Heap at : %16lx\n", (long unsignedThis is int) line malloc(1)); two printf("Stack at: %16lx\n", (long unsigned int) &mfile); This is line three /* Open the file */ This is line four myfd = open(argv[1], O_RDWR | O_CREAT); Memory if (myfd < 0) { perror("open failed!");exit(1); }

/* map the file */ mfile = mmap(0, 10000, PROT_READ|PROT_WRITE, MAP_FILE|MAP_SHARED, myfd, 0); stack stack if (mfile == MAP_FAILED) {perror("mmap$ cat failed"); test exit(1);} This is line one printf("mmap at : %16lx\n", (long unsignedThiLet's int) write mfile); over its line three puts(mfile); This is line four OS OS strcpy(mfile+20,"Let's write over it"); 0xFFF… 0xFFF… close(myfd); return 0; • Also: anonymous memory between parents and children } – no file backing – just swap space 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.33 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.34

Buffer Cache • Kernel must copy disk blocks to main memory to access their contents and write them back if modified – Could be data blocks, inodes, directory contents, etc. – Possibly dirty (modified and not written back) • Key Idea: Exploit locality by caching disk data in memory – Name translations: Mapping from pathsinodes – Disk blocks: Mapping from block addressdisk content • Buffer Cache: Memory used to cache kernel resources, including disk blocks and name translations THE BUFFER CACHE – Can contain “dirty” blocks (with modifications not on disk)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.35 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.36 File System Buffer Cache File System Buffer Cache: open Disk Disk Data blocks Data blocks • Directory lookup Reading repeat as needed: Reading • OS implements a PCB PCB cache of disk iNodes – load block of iNodes directory blocks for efficient file file access to data, desc – search for map desc Writing Writing directories, inodes, freemap Dir Data blocks Dir Data blocks

Free bitmap Free bitmap Memory Memory

Blocks Blocks

State free free State free freedirrd

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.37 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.38

File System Buffer Cache: open File System Buffer Cache: Read? Disk Disk Data blocks Data blocks • Directory lookup • Read Process repeat as needed: Reading – From inode, Reading PCB PCB – load block of traverse index iNodes iNodes directory structure to find file data block file – search for map desc desc – load data block • Create reference Writing Writing via open file Dir Data blocks – copy all or part Dir Data blocks descriptor :inumber to read data :inumber buffer Free bitmap Free bitmap Memory Memory

Blocks Blocks

State free dir inoderd State free dir inode

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.39 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.40 File System Buffer Cache: Write? File System Buffer Cache: Eviction? Disk Disk Data blocks Data blocks • Process similar to • Blocks being read, but may Reading written back to Reading allocate new PCB disc go through a PCB blocks (update iNodes transient state iNodes free map), blocks file file desc desc need to be written Writing Writing back to disk; inode? Dir Data blocks Dir Data blocks :inumber :inumber

Free bitmap Free bitmap Memory Memory

Blocks Blocks

State free dir inode State free dir dirty inode

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.41 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.42

Buffer Cache Discussion File System Caching • Implemented entirely in OS software • Replacement policy? LRU – Unlike memory caches and TLB – Can afford overhead full LRU implementation • Blocks go through transitional states between free and in-use – Advantages: » Works very well for name translation – Being read from disk, being written to disk » Works well in general as long as memory is big enough to accommodate a host’s – Other processes can run, etc. working set of files. • Blocks are used for a variety of purposes – Disadvantages: – inodes, data for dirs and files, freemap » Fails when some application scans through file system, thereby flushing the cache with data used only once – OS maintains pointers into them » Example: find . –exec grep foo {} \; • Termination – e.g., process exit – open, read, write • Other Replacement Policies? • Replacement – what to do when it fills up? – Some systems allow applications to request other policies – Example, ‘Use Once’: » File system can discard blocks as soon as they are used

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.43 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.44 File System Caching (con’t) File System Prefetching • Cache Size: How much memory should the OS allocate to the buffer cache • Read Ahead Prefetching: fetch sequential blocks early vs virtual memory? – Key Idea: exploit fact that most common file access is sequential by prefetching – Too much memory to the file system cache  won’t be able to run many subsequent disk blocks ahead of current read request applications – Elevator algorithm can efficiently interleave prefetches from concurrent – Too little memory to file system cache  many applications may run slowly (disk applications caching not effective) • How much to prefetch? – Solution: adjust boundary dynamically so that the disk access rates for paging – Too much prefetching imposes delays on requests by other applications and file access are balanced – Too little prefetching causes many seeks (and rotational delays) among concurrent file requests

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.45 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.46

Delayed Writes Delayed Writes (Advantages) • Buffer cache is a writeback cache (writes are termed “Delayed Writes”) • Performance advantage: return to user quickly without writing to disk!

• write() copies data from user space to kernel buffer cache • Disk scheduler can efficiently order lots of requests – Quick return to user space – Elevator Algorithm can rearrange writes to avoid random seeks • Delay block allocation: • read() is fulfilled by the cache, so reads see the results of writes – May be able to allocate multiple blocks at same time for file, keep them contiguous – Even if the data has not reached disk • Some files never actually make it all the way to disk – Many short-lived files! • When does data from a write syscall finally reach disk? – When the buffer cache is full (e.g., we need to evict something) – When the buffer cache is flushed periodically (in case we crash)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.47 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.48 Buffer Caching vs. Demand Paging Dealing with Persistent State • Replacement Policy? • Buffer Cache: write back dirty blocks periodically, even if used recently – Demand Paging: LRU is infeasible; use approximation (like NRU/Clock) – Why? To minimize data loss in case of a crash – Buffer Cache: LRU is OK – Linux does periodic flush every 30 seconds • Not foolproof! Can still crash with dirty blocks in the cache • Eviction Policy? – What if the dirty block was for a directory? – Demand Paging: evict not-recently-used pages when memory is close to full » Lose pointer to file’s inode (leak space) – Buffer Cache: write back dirty blocks periodically, even if used recently » File system now in inconsistent state  » Why? To minimize data loss in case of a crash

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.49 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.50

Important “ilities” • Availability: the probability that the system can accept and process requests – Measured in “nines” of probability: e.g. 99.9% probability is “3-nines of availability” – Key idea here is independence of failures

• Durability: the ability of a system to recover data despite faults – This idea is fault tolerance applied to data – Doesn’t necessarily imply availability: information on pyramids was very durable, but could not be accessed until discovery of Rosetta Stone

• Reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time (IEEE definition) – Usually stronger than simply availability: means that the system is not only “up”, HOW TO MAKE FILE SYSTEMS MORE but also working correctly – Includes availability, security, fault tolerance/durability DURABLE? – Must make sure data survives system crashes, disk crashes, other problems

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.51 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.52 How to Make File Systems more Durable? RAID 1: Disk Mirroring/Shadowing • Disk blocks contain Reed-Solomon error correcting codes (ECC) to deal with small defects in disk drive – Can allow recovery of data from small media defects recovery • Make sure writes survive in short term group – Either abandon delayed writes or • Each disk is fully duplicated onto its “” – Use special, battery-backed RAM (called non-volatile RAM or NVRAM) for dirty – For high I/O rate, high availability environments blocks in buffer cache – Most expensive solution: 100% capacity overhead • Make sure that data survives in long term • Bandwidth sacrificed on write: – Need to replicate! More than one copy of data! – Logical write = two physical writes – Important element: independence of failure – Highest bandwidth when disk heads and rotation synchronized (challenging) » Could put copies on one disk, but if disk head fails… » Could put copies on different disks, but if server fails… • Reads may be optimized » Could put copies on different servers, but if building is struck by – Can have two independent reads to same data lightning…. • Recovery: » Could put copies on servers in different continents… – Disk failure  replace disk and copy data to new disk – Hot Spare: idle disk attached to system for immediate replacement 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.53 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.54

RAID 5+: High I/O Rate Parity RAID 6 and other Erasure Codes Stripe Unit • Data stripped across multiple disks • In general: RAIDX is an “erasure code” – Successive blocks stored on successive D0 D1 D2 D3 P0 – Must have ability to know which disks are bad (non-parity) disks – Treat missing disk as an “Erasure” Increasing – Increased bandwidth over single disk D4 D5 D6 P1 D7 Logical • Today, disks so big that: RAID 5 not sufficient! Disk – Time to repair disk sooooo long, another disk might fail in process! • Parity block (in green) constructed D8 D9 P2 D10 D11 Addresses by XORing data bocks in stripe – “RAID 6” – allow 2 disks in replication stripe to fail –P0=D0D1D2D3 D12 P3 D13 D14 D15 – Requires more complex erasure code, such as EVENODD code (see readings) – Can destroy any one disk and still • More general option for general erasure code: Reed-Solomon codes reconstruct data P4 D16 D17 D18 D19 – Based on polynomials in GF(2k) (I.e. k-bit symbols) – 𝑚 data points define a degree 𝑚 polynomial; encoding is 𝑛 points on the polynomial • Suppose Disk 3 fails, then can reconstruct: D20 D21 D22 D23 P5 D2=D0D1D3P0 –Any 𝑚 points can be used to recover the polynomial; 𝑛𝑚failures tolerated Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 • Erasure codes not just for disk arrays. For example, geographic replication – E.g., split data into 𝑚4 chunks, generate 𝑛16 fragments and distribute across • Can spread information widely across internet for durability the Internet – RAID algorithms work over geographic scale – Any 4 fragments can be used to recover the original data --- very durable!

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.55 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.56 Use of Erasure Coding for High Durability/overhead ratio! Higher Durability through Geographic Replication • Highly durable – hard to destroy all copies • Highly available for reads – Simple replication: read any copy – Erasure coded: read m of n • Low availability for writes – Can’t write if any one replica is not up Fraction Blocks Lost – Or – need relaxed consistency model Per Year (FBLPY) • Reliability? – availability, security, durability, fault-tolerance

Replica/Frag #1

Replica/Frag #2 • Exploit law of large numbers for durability! • 6 month repair, FBLPY with 4x increase in total size of data: – Replication (4 copies): 0.03 -35 – Fragmentation (16 of 64 fragments needed): 10 Replica/Frag #n 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.57 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.58

File System Reliability: (Difference from Block-level reliability) • What can happen if disk loses power or software crashes? – Some operations in progress may complete – Some operations in progress may be lost – Overwrite of a block may only partially complete

• Having RAID doesn’t necessarily protect against all such failures – No protection against writing bad state – What if one disk of RAID group not written? • File system needs durability (as a minimum!) HOW TO MAKE FILE SYSTEMS MORE – Data previously stored can be retrieved (maybe after some recovery step), RELIABLE? regardless of failure • But durability is not quite enough…!

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.59 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.60 Storage Reliability Problem Threats to Reliability • Single logical file operation can involve updates to multiple physical disk blocks • Interrupted Operation – inode, indirect block, data block, bitmap, … – Crash or power failure in the middle of a series of related updates may leave stored data in an inconsistent state – With sector remapping, single update to physical disk block can require multiple (even lower level) updates to sectors – Example: transfer funds from one bank account to another – What if transfer is interrupted after withdrawal and before deposit? • At a physical level, operations complete one at a time – Want concurrent operations for performance • Loss of stored data – Failure of non-volatile storage media may cause previously stored data to disappear or be corrupted • How do we guarantee consistency regardless of when crash occurs?

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.61 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.62

Two Reliability Approaches Reliability Approach #1: Careful Ordering

Careful Ordering and Recovery Versioning and Copy-on-Write • Sequence operations in a specific order • FAT & FFS + (fsck) • ZFS, … – Careful design to allow sequence to be interrupted safely • Each step builds structure, • Version files at some granularity • Data block inode  free  directory • Create new structure linking back to • Post-crash recovery • Last step links it in to rest of FS unchanged parts of old – Read data structures to see if there were any operations in progress • Recover scans structure looking for • Last step is to declare that the new – Clean up/finish as needed incomplete actions version is ready • Approach taken by – FAT and FFS (fsck) to protect filesystem structure/metadata – Many app-level recovery schemes (e.g., Word, autosaves)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.63 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.64 Berkeley FFS: Create a File Reliability Approach #2: Copy on Write File Layout

Normal operation: Recovery: • Recall: multi-level index structure lets us find the data blocks of a file • Allocate data block • Scan inode table • Instead of over-writing existing data blocks and updating the index structure: – Create a new version of the file with the updated data • Write data block • If any unlinked files (not in any – Reuse blocks that don’t change much of what is already in place • Allocate inode directory), delete or put in lost & found dir – This is called: Copy On Write (COW) • Write inode block • Compare free block bitmap • Update bitmap of free blocks against inode trees • Seems expensive! But and inodes – Updates can be batched • Scan directories for missing • Update directory with file name update/access times – Almost all disk writes can occur in parallel  inode number • Update modify time for directory • Approach taken in network file server appliances Time proportional to disk size – NetApp’s Write Anywhere File Layout (WAFL) – ZFS (Sun/Oracle) and OpenZFS

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.65 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.66

COW with Smaller-Radix Blocks Example: ZFS and OpenZFS old version new version • Variable sized blocks: 512 B – 128 KB • Symmetric tree – Know if it is large or small when we make the copy • Store version number with pointers – Can create new version by adding blocks and new pointers • Buffers a collection of writes before creating a new version with them • Free space represented as tree of extents in each block group – Delay updates to freespace (in log) and do them all when block group is activated

Write • If file represented as a tree of blocks, just need to update the leading fringe

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.67 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.68 More General Reliability Solutions Transactions • Use Transactions for atomic updates • Closely related to critical sections for manipulating shared data structures – Ensure that multiple related updates are performed atomically – i.e., if a crash occurs in the middle, the state of the systems reflects either all or • They extend concept of atomic update from memory to stable storage none of the updates – Atomically update multiple persistent data structures – Most modern file systems use transactions internally to update filesystem structures and metadata – Many applications implement their own transactions • Many ad-hoc approaches – FFS carefully ordered the sequence of updates so that if a crash occurred while manipulating directory or inodes the disk scan on reboot would detect • Provide Redundancy for media failures and recover the error (fsck) – Redundant representation on media (Error Correcting Codes) – Applications use temporary files and rename – Replication across media (e.g., RAID disk array)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.69 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.70

Key Concept: Transaction Typical Structure

•A transaction is an atomic sequence of reads and writes that takes the • Begin a transaction – get transaction id system from consistent state to another. • Do a bunch of updates transaction – If any fail along the way, roll-back consistent state 1 consistent state 2 – Or, if any conflicts with other transactions, roll-back

• Commit the transaction • Recall: Code in a critical section appears atomic to other threads • Transactions extend the concept of atomic updates from memory to persistent storage

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.71 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.72 “Classic” Example: Transaction Concept of a log

BEGIN; ‐‐BEGIN TRANSACTION • One simple action is atomic – write/append a basic item UPDATE accounts SET balance = balance ‐ 100.00 WHERE • Use that to seal the commitment to a whole series of actions name = 'Alice';

UPDATE branches SET balance = balance ‐ 100.00 WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Alice');

UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';

UPDATE branches SET balance = balance + 100.00 WHERE

name = (SELECT branch_name FROM accounts WHERE name N Start Tran = 'Bob'); N Commit Tran COMMIT; ‐‐COMMIT WORK Get 7$ from account B Put 15$ into account Y Put 15$ into account Put 15$ into account X Get 10$ from account A Get 10$ from account Get 13$ from account C

Transfer $100 from Alice’s account to Bob’s account

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.73 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.74

Transactional File Systems Journaling File Systems • Better reliability through use of log • Don’t modify data structures on disk directly – Changes are treated as transactions • Write each update as transaction recorded in a log – A transaction is committed once it is written to the log – Commonly called a journal or intention list » Data forced to disk for reliability – Also maintained on disk (allocate blocks for it when formatting) » Process can be accelerated with NVRAM • Once changes are in the log, they can be safely applied to file system – Although File system may not be updated immediately, data preserved in the log – e.g. modify inode pointers and directory mapping • Garbage collection: once a change is applied, remove its entry from the log • Difference between “Log Structured” and “Journaled” – In a Log Structured filesystem, data stays in log form – In a Journaled filesystem, Log used for recovery • Linux took original FFS-like file system (ext2) and added a journal to get ext3! – Some options: whether or not to write all data to journal or just metadata

• Other examples: NTFS, Apple HFS+, Linux XFS, JFS, ext4

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.75 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.76 Creating a File (No Journaling Yet) Creating a File (With Journaling)

• Find free data block(s) • Find free data block(s)

• Find free inode entry Free • Find free inode entry Free • Find dirent insertion point space • Find dirent insertion point space map map ------… Data blocks ------… Data blocks • Write map (i.e., mark used) Inode table • [log] Write map (i.e., mark used) Inode table • Write inode entry to point to block(s) • [log] Write inode entry to point to block(s) • Write dirent to point to inode Directory • [log] Write dirent to point to inode Directory entries entries tail head

done pending start commit Log: in non-volatile storage (Flash or on Disk)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.77 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.78

After Commit, Eventually Replay Transaction Crash Recovery: Discard Partial Transactions

• All accesses to the file system first looks in • Upon recovery, scan the log the log Free Free – Actual on-disk data structure might be stale space • Detect transaction start with no commit space map map … … Data blocks Data blocks • Discard log entries • Eventually, copy changes to disk and Inode table Inode table discard transaction from the log • Disk remains unchanged Directory Directory entries entries

tail tail tail tail tail head tail head

done pending done pending start start commit Log: in non-volatile storage (Flash or on Disk) Log: in non-volatile storage (Flash or on Disk)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.79 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.80 Crash Recovery: Keep Complete Transactions Journaling Summary

Why go through all this trouble? • Updates atomic, even if we crash: • Scan log, find start – Update either gets fully applied or discarded Free space – All physical operations treated as a logical unit • Find matching commit map … Data blocks Inode table Isn’t this expensive? • Redo it as usual • Yes! We're now writing all data twice (once to log, once to actual data – Or just let it happen later Directory blocks in target file) entries • Modern filesystems journal metadata updates only tail head – Record modifications to file system data structures – But apply updates to a file’s contents directly done pending start commit Log: in non-volatile storage (Flash or on Disk)

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.81 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.82

File System Summary (1/3) File System Summary (2/3) • File System: • File layout driven by freespace management – Transforms blocks into Files and Directories – Optimizations for sequential access: start new files in open ranges of free – Optimize for size, access and usage patterns blocks, rotational optimization – Maximize sequential access, allow efficient random access – Integrate freespace, inode table, file blocks and dirs into block group – Projects the OS protection and security regime (UGO vs ACL) • Deep interactions between mem management, file system, sharing • File defined by header, called “inode” – mmap(): map file or anonymous segment to memory • Naming: translating from user-visible names to actual sys resources • Buffer Cache: Memory used to cache kernel resources, including disk – Directories used for naming for local file systems blocks and name translations – Linked or tree structure stored in files – Can contain “dirty” blocks (blocks yet on disk) • 4.2 BSD Multilevel Indexed Scheme – inode contains file info, direct pointers to blocks, indirect blocks, doubly indirect, etc.. – NTFS: variable extents not fixed blocks, tiny files data is in header

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.83 11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.84 File System Summary (3/3) • File system operations involve multiple distinct updates to blocks on disk – Need to have all or nothing semantics – Crash may occur in the midst of the sequence • Traditional file system perform check and recovery on boot – Along with careful ordering so partial operations result in loose fragments, rather than loss • Copy-on-write provides richer function (versions) with much simpler recovery – Little performance impact since sequential write to storage device is nearly free • Transactions over a log provide a general solution – Commit sequence to durable log, then update the disk – Log takes precedence over disk – Replay committed transactions, discard partials

11/9/20 Kubiatowicz CS162 © UCB Fall 2020 Lec 21.85