Lecture 24: Filesystem Implementations

Fall 2018 Jason Tang

Slides based upon Concept slides, http://codex.cs.yale.edu/avi/os-book/OS9/slide-dir/index.html Copyright Silberschatz, Galvin, and Gagne, 2013 1 Topics

• Allocation Methods

• Filesystem Performance

• Filesystem Features

• Real Filesystems

2 Contiguous Allocation

• Allocation Methods: Describes how filesystem allocates disk blocks (similar to a memory allocator)

• Contiguous Allocation: Each file occupies set of contiguous blocks

• Simple implementation: only need starting location (a block number) and length (number of blocks)

• Optimal performance for HDDs: after disk head has finished reading block n, it is in correct position to read block n + 1

3 Contiguous Allocation

• Difficulty in finding space for file

• Use first fit, best fit, or worst fit algorithm, just like with memory

• Subject to external fragmentation

• How to determine number of blocks when file is created?

• Unable to handle files that grow over time

• Need for offline compaction

4 -Based Allocation

• Similar to contiguous allocation, in that contiguous blocks reserved for file

• When not enough reserved blocks, filesystem allocates another extent

• List of extents stored in file’s metadata (inode)

• A file consists of zero or more extents

• Internal fragmentation if extents too large

• External fragmentation if a file has many extents

• Used in many modern filesystems (including , HFS+, and NTFS)

5 Linked Allocation

• Each file is a linked list of blocks; disk blocks are scattered throughout disk

• Each block contains pointer to next block, or NULL to indicate end of file

• Directory stores pointer to first and last blocks of each file

• No random access; must traverse list sequentially

• Need to defragment disk for optimal HDD performance

• Example: If each block is 512 bytes and a pointer to next block requires 4 bytes, then each block may only hold 508 bytes (0.78% overhead)

6 Linked Allocation

7 Linked Allocation

• Cluster allocation: Similar to linked allocation, but allocation in multiples of blocks (clusters) instead of individual blocks

• Improves performance, but increases internal fragmentation

• Reliability problem: if a block/cluster is written to a bad sector, the remainder of file is lost

8 (FAT)

• Drawback to linked allocation is that seeking to a specific location requires many I/O operations

• FAT filesystem reserves space at beginning of disk that contains a block table

• Each entry in table corresponds to a block on disk

• Each entry contains value of next block, or special end-of-file value, or 0 to indicate an unused block

• Kernel caches entire block table for faster access

9 Allocation Comparison

Linked Allocation File Allocation Table

10 Indexed Allocation

• Each file has its own index block

• Contains pointers to actual data blocks, similar to page table

• Can be multilevel, like a hierarchical page table

• Maximum file size based upon number of pointers possible

• More overhead, due to additional index block metadata

11 Indexed Allocation

12 Efficiency and Performance

• Efficiency depends on:

• Disk allocation and directory algorithms

• Information kept in file metadata

• Performance depends on:

• Keeping file data and metadata physically close together (for HDDs)

• Buffer cache: main memory reserved to cache frequently used blocks

13 Filesystem Optimizations

• Read-ahead: when reading data block n, often block n+1 will be needed soon

• A process that is reading a file will need those bytes soon, but a process that is writing to a file will usually not read back those bytes

• Synchronous writes: bytes committed to disk in same order that kernel receives request

• Asynchronous writes: kernel caches write requests, and will commit those changes whenever disk is idle (more common)

• Coherency: needed if one process is writing and a different process is reading same file

14 Traditional Caching

• Devices use DMA to copy file data into a page cache

• OS buffered I/O transactions into a buffer cache

• Data represented twice, which could cause inconsistency

15 Unified Buffer Cache

• Uses same memory cache for process I/O as well as hardware DMA

• Solves coherency problem; implemented by and other modern OSes

• Like any other memory cache, requires algorithm to choose when to evict entries within

16 Journaling

• Consistency check: when mounting filesystem, compare data in directory structure with data blocks in disk, and try to fix inconsistencies

• Can be slow and sometimes fails

• Log structured (or journaled) file system records each metadata update to filesystem as a transaction

• All transactions written to a log

• A transaction is considered committed after it is written to log

17 Journaling

• Log transactions are asynchronously written to filesystem structures

• A transaction is removed from log after filesystem structures are modified

• If filesystem crashes (e.g., power outage), all remaining transactions in the log must still be performed

• A filesystem may have journal for all data changes (slower, more reliable), or only for metadata (so-called writeback mode, which is faster but less reliable)

18 Metadata Journaling Example

1. Process attempts to append data to file 2. OS writes to journal the file’s new size 3. OS allocates block for new data 4. OS writes to journal the allocated location 5. OS writes data to allocated block 6. OS clears journal • If step 1 or step 2 crashes, then disk remains unchanged (no inconsistency)

• If step 3 crashes, OS will replay starting at step 3 (but write garbage to file)

• If step 4 or 5 crashes, OS will replay starting at step 4 (but write garbage to file)

• If step 6 crashes, OS will have an out of date journal, but file contents are correct

19 Snapshots

• Some filesystems can automatically make backups of files prior to modifications

• When a change to metadata or to block n in file, old data copied to snapshot before filesystem is updated

• Similar to copy-on-write

• Metadata updated to indicate where previous data exists

• Filesystems limit amount of snapshots (limit to percent of disk, or automatically remove snapshots older than certain date)

20 Checksums

• Mathematical algorithm used to detect errors within a file

• Also known as a hash

• Example: count the number of 1 bits in a file, and store a 1 if that count is odd, 0 if even (so-called parity bit)

• When filesystem is mounted, for each file, check that sum of all 1 bits in file plus parity bit is an even number; if not even then file is corrupted

• Many fancier checksums can even detect where error occurred and can recover from it (so-called error-correcting codes)

21 FAT32

• Default filesystem for MS-DOS, Windows (prior to Windows XP), and still used on most USB flash drives

• Uses FAT, resulting in good performance for slower CPUs and small storage overhead

• Maximum size of a single file is 232 - 1 bytes

• Each FAT entry refers to one of 232 sectors; if each sector is 512 bytes, maximum disk size is 2 TiB

• No journaling, snapshots, checksums, nor ACLs

22 EXT4

• Default filesystem for many Linux distributions

• Uses extent allocation

• A single inode can hold up to 4 extents, each being up to 128 MiB

• When a file requires more extents, its inode contains a pointer to a HTree

• Supports journaling and ACLs

• Journal is checksummed for improved reliability

23 ZFS

• Default filesystem for Solaris; can also be used in Linux

• Every block of data is checksummed; checksum stored in separate area (not with data block)

• Single ZFS filesystem may span across multiple physical devices

• Maximum file size is 264 bytes

• Supports journaling, snapshot, and ACLs

24 UBIFS

• Designed for SSDs without integrated wear controller

• Tolerates frequent power outages, unreliable flash chips

• Traditional hard disk-based filesystems store metadata (FAT, journal, etc) in same physical place

• UBIFS employs wear-leveling, to prevent reaching erase count

• Can always change a one bit to zero, but not vice versa without an erase

• Uses write-back buffers, to reduce I/O and also to reduce erasing

25 UBIFS

• Checksums all data blocks, including superblocks, to detect faulty flash sectors

• Whereas on EXT4, the journal is at same location, UBIFS’s journal wanders

• When journal is full, kernel selects a new block and writes the new block’s address to end of current journal

• Background kernel thread commits changes to flash:

• When journal is 80% full

• Or when internal timer expires, flushes write-back buffers

26