An Introduction to Disk-Based Linux File Systems

An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012 v1.3 Outline The Basics The Virtual File System (VFS) File System Layout Journaling What does a disk-based file system do? Provides structure to the array of bits residing on the disk File and directory naming and hierarchy File access – open, read, write, seek, close, ... Knows how to map <file,offset> to <sector,offset> Tracks which sectors are used and which are “free” Access control Extra features (e.g., improved reliability, snapshots, compression, encryption) 3 Linux File System Types Disk (ext2/3/4, xfs, btrfs, ntfs, vfat, etc.) Network (nfs, cifs, afs, ceph, etc.) Memory (ramfs, tmpfs, etc.) Pseudo (proc, sysfs, etc.) Stackable (ecryptfs, etc.) Object store (exofs) FUSE (Filesystem in USErspace): allows developers to implement file systems in userspace (easier to develop, slower to run) ... Approximately 60 file systems currently in the Linux kernel! 4 Important Metadata Structures Superblock: on-disk metadata for entire file system Block size, pointer to fs root directory, ... Inode: on-disk metadata for a single file Inode number (unique ID), owners, timestamps, size, data block pointers, ... Dentry: metadata for a directory entry, a single component of a path (not synced to disk) File: open file structure (not synced to disk) File → Dentry → Inode → Superblock Each structure has associated operations that are implemented by each file system Note: All Linux file system implementations have the above structures in memory, but not all have superblocks and inodes on disk (especially file systems not native to Linux/Unix, like FAT). These must map on-disk structures to those in memory. 5 Directories A directory is simply a special type of file Can contain other files, directories, links, etc. Each entry has an inode number and name The file system knows how to find a file based on its inode number What are the basic steps for performing a lookup on file /foo/bar? 6 Hard Links Associate several names with one inode When creating a link, increment the inode's reference count (refcount) The inode and associated data will only be deleted when the refcount is zero Can only be used within a single file-system dentry dentry Can only point to files. This inode prevents cycles in the directory tree data Not supported by all file systems blocks 7 Symbolic Links (Symlinks) A special file that contains a file name When the kernel encounters a symlink during a pathname lookup it replaces the name of the link by its contents (the name of the target file), and restarts the pathname interpretation Can point to files on another file system Can point to any type of file (e.g., directory) Can become a dangling pointer if the target file is deleted Use more inodes than hard links (2 vs. 1) Higher overhead than hard links for resolution 8 Device Files In Linux, devices can be accessed via special files, generally found under /dev Two main types: Character: stream of bytes (keyboard, serial) Block: random access of blocks (hard disk, CD-ROM) Outline The Basics The Virtual File System (VFS) File System Layout Journaling The Virtual File System (VFS) When we have so many file systems, we need to ensure that: User programs do not need to be file-system--aware File systems don't re-implement similar functionality Solution: The VFS. A kernel layer that: Handles all system calls related to a standard Unix file system (all file systems have the same API) Handles generic activities (e.g., caching, readahead) Has generic file system “library” functions that can be used by any file system (e.g., fs/libfs.c) Each specific file system implements a set of functions (operations vectors) Object oriented programming in C 11 The Virtual File System (VFS) Application user-space kernel VFS System call Scheduler handler ext3 isofs NFS Memory Interrupt Page Cache management handler Driver Driver Driver (Disk) (CD-ROM) (Network) 12 Readahead Takes advantage of the page cache When a page is read, the VFS code may ask the file system to read the next several contiguous blocks. Hopefully, the next block read by the application will already be loaded into the page cache. Performed during: Sequential reads on files Directory reads The VFS contains the logic to perform readahead effectively 13 Example File System Operations ext3 / home etc … mnt 14 Example File System Operations ext3 / home etc … mnt avishay The VFS mount operation: 1) Calls the xfs get_sb function to read the superblock from the partition xfs 2) This function also reads the inode of the root directory mount -t xfs /dev/sdb1 /home Note that performing a lookup on 'home' would have previously invoked ext3, but now it is xfs. Any files/directories in 'home' on ext3 will now be hidden by 'home' on xfs. 15 Example File System Operations ext3 / home etc … mnt avishay cdrom foo xfs isofs mount -t xfs /dev/sdb1 /home A similar sequence of events occurs here, this time mount -t isofs /dev/hdc1 /mnt/cdrom mounting an isofs file system on a CD-ROM drive. 16 Example File System Operations ext3 / home etc … mnt avishay cdrom bar foo xfs isofs mount -t xfs /dev/sdb1 /home Lookup operations will be performed on all 3 file mount -t isofs /dev/hdc1 /mnt/cdrom systems. The copy operation cp /mnt/cdrom/foo /home/avishay/bar will read from 'foo' (isofs) and write to 'bar' (xfs). The VFS determines which file system to invoke. 17 Outline The Basics The Virtual File System (VFS) File System Layout Journaling File System Layout Some considerations: Minimize seeks between metadata and related data Minimize number of disk reads required to get to data Maximize readahead (sequential access) Recovery from disk corruption, power outage, etc. Management: fragmentation, compaction, etc. 19 Contiguous Allocation Files are allocated contiguously on the disk Space for entire file must be requested in advance Search bit map or linked list to locate a space Pros Fast sequential access Easy random access Cons External fragmentation Hard to grow files: may have to move (large) files May need compaction E A B C D 20 Linked Files (Alto) Each file is a linked list File header (like inode) points to first block on disk Each block points to the next File header Pros Can grow files dynamically File block 1 Free list is similar to a file No external fragmentation or need to move files Cons File block N Random access is horrible Even sequential access needs one seek per block Unreliable: losing one block means losing the rest 21 File Allocation Table (FAT) Table of “next pointers”, indexed by block Dentry points to 1st block of file Two copies of FAT, at the beginning of the volume Pros Faster random access Cache FAT table and traverse in memory Cons FAT table may be too large to cache - long seeks Pointers for all files are interspersed in FAT table Need full table in memory, even for one file Solution: indexed files 22 Single-Level Indexed Files User declares maximum file size A file header holds an array of pointers to disk blocks Pros Random access is fast Better metadata caching than FAT File Cons header Clumsy to grow beyond the limit Many seeks Disk blocks 23 ext2: Block Groups Boot Block Block Block Block group 0 group 1 ... group n Super Group Data Block inode inode Data block Descriptors Bitmap Bitmap Table Blocks Improved reliability Control structures are replicated Easy to recover the superblock Improved performance Reduces the distance between the inodes and related data blocks It is possible to reduce the disk head seeks during I/O on files 24 ext2: Multi-Level Indexed Files The inode contains 15 pointers: 12 direct pointers 13: 1-level indirect data 14: 2-level indirect 15: 3-level indirect 1 data Pros & Cons 2 ... In favor of small files data 13 Can grow 14 Lots of seeking 15 inode data (somewhat limited by block groups) data ext3: same on-disk format plus journal (covered later) 25 ext4/xfs/Btrfs: Extents & Trees Extent: set of logically contiguous blocks within a file that are stored contiguously on disk Single ext4 extent: up to 128MB with 4KB block size Less meta-data: Only need to remember: <1st logical block, # blocks, 1st physical block> xfs and Btrfs store extents in B-tree variants These are newer and very interesting Linux disk-based file systems and have become more “standard” 26 Log-structured File System Will be covered separately tomorrow 27 Outline The Basics The Virtual File System (VFS) File System Layout Journaling File System Corruption Some FS operations require multiple writes which may not all complete (power fail, crash) The on-disk state will be invalid on next mount Example: To write to a file, 3 main operations: 1.Write data to disk block 2.Update the free space map 3.Update pointer from inode to block With no help, detecting and recovering from errors require examining all data structures In Linux, this is done by fsck (file system check) This was acceptable in the past, but takes too long for larger file systems Journaling Journal: a special file that logs the changes destined for the file system in a circular buffer Idea: use a journal to log changes before they're committed to the file system to avoid metadata corruption Examples: JFS/JFS2, ext3/4, XFS, ReiserFS ext3 Journaling Modes Writeback: Only metadata is journaled. Data is written indepentently. Preserves file system structure and avoids corruption, but files may contain stale data (like ext2 + fast fsck).

An Introduction to Disk-Based Linux File Systems

Copy on Write Based File Systems Performance Analysis and Implementation

The Kernel Report

Rootless Containers with Podman and Fuse-Overlayfs

Serverless Network File Systems

Dm-X: Protecting Volume-Level Integrity for Cloud Volumes and Local

Enhancing the Accuracy of Synthetic File System Benchmarks Salam Farhat Nova Southeastern University, [email protected]

Ext4 File System and Crash Consistency

XFS: There and Back ...And There Again? Slide 1 of 38

Comparing Filesystem Performance: Red Hat Enterprise Linux 6 Vs

CIS 191 Linux Lab Exercise

How UNIX Organizes and Accesses Files on Disk Why File Systems

Connecting the Storage System to the Solaris Host