An Introduction to Disk-Based Linux File Systems
Avishay Traeger
IBM Haifa Research Lab Internal Storage Course ― October 2012 v1.3 Outline
The Basics The Virtual File System (VFS) File System Layout Journaling What does a disk-based file system do?
Provides structure to the array of bits residing on the disk File and directory naming and hierarchy File access – open, read, write, seek, close, ... Knows how to map
Extra features (e.g., improved reliability, snapshots, compression, encryption)
3 Linux File System Types
Disk (ext2/3/4, xfs, btrfs, ntfs, vfat, etc.) Network (nfs, cifs, afs, ceph, etc.) Memory (ramfs, tmpfs, etc.) Pseudo (proc, sysfs, etc.) Stackable (ecryptfs, etc.) Object store (exofs) FUSE (Filesystem in USErspace): allows developers to implement file systems in userspace (easier to develop, slower to run) ... Approximately 60 file systems currently in the
Linux kernel! 4 Important Metadata Structures
Superblock: on-disk metadata for entire file system Block size, pointer to fs root directory, ... Inode: on-disk metadata for a single file Inode number (unique ID), owners, timestamps, size, data block pointers, ... Dentry: metadata for a directory entry, a single component of a path (not synced to disk) File: open file structure (not synced to disk)
File → Dentry → Inode → Superblock Each structure has associated operations that are implemented by each file system Note: All Linux file system implementations have the above structures in memory, but not all have superblocks and inodes on disk (especially file systems not native to Linux/Unix, like FAT). These must map on-disk structures to those in memory. 5 Directories
A directory is simply a special type of file Can contain other files, directories, links, etc. Each entry has an inode number and name The file system knows how to find a file based on its inode number What are the basic steps for performing a lookup on file /foo/bar?
6 Hard Links
Associate several names with one inode When creating a link, increment the inode's reference count (refcount) The inode and associated data will only be deleted when the refcount is zero Can only be used within a single file-system dentry dentry Can only point to files. This inode prevents cycles in the directory tree data Not supported by all file systems blocks 7 Symbolic Links (Symlinks)
A special file that contains a file name When the kernel encounters a symlink during a pathname lookup it replaces the name of the link by its contents (the name of the target file), and restarts the pathname interpretation Can point to files on another file system Can point to any type of file (e.g., directory) Can become a dangling pointer if the target file is deleted Use more inodes than hard links (2 vs. 1) Higher overhead than hard links for resolution
8 Device Files
In Linux, devices can be accessed via special files, generally found under /dev Two main types: Character: stream of bytes (keyboard, serial) Block: random access of blocks (hard disk, CD-ROM) Outline
The Basics The Virtual File System (VFS) File System Layout Journaling The Virtual File System (VFS)
When we have so many file systems, we need to ensure that: User programs do not need to be file-system--aware File systems don't re-implement similar functionality Solution: The VFS. A kernel layer that: Handles all system calls related to a standard Unix file system (all file systems have the same API) Handles generic activities (e.g., caching, readahead) Has generic file system “library” functions that can be used by any file system (e.g., fs/libfs.c) Each specific file system implements a set of functions (operations vectors)
Object oriented programming in C 11 The Virtual File System (VFS)
Application user-space kernel VFS System call Scheduler handler ext3 isofs NFS
Memory Interrupt Page Cache management handler Driver Driver Driver (Disk) (CD-ROM) (Network)
12 Readahead
Takes advantage of the page cache When a page is read, the VFS code may ask the file system to read the next several contiguous blocks. Hopefully, the next block read by the application will already be loaded into the page cache. Performed during: Sequential reads on files Directory reads The VFS contains the logic to perform readahead effectively 13 Example File System Operations
ext3 /
home etc … mnt
14 Example File System Operations
ext3 /
home etc … mnt
avishay The VFS mount operation: 1) Calls the xfs get_sb function to read the superblock from the partition xfs 2) This function also reads the inode of the root directory
mount t xfs /dev/sdb1 /home Note that performing a lookup on 'home' would have previously invoked ext3, but now it is xfs. Any files/directories in 'home' on ext3 will now be hidden by 'home' on xfs.
15 Example File System Operations
ext3 /
home etc … mnt
avishay cdrom
foo xfs isofs
mount t xfs /dev/sdb1 /home A similar sequence of events occurs here, this time mount t isofs /dev/hdc1 /mnt/cdrom mounting an isofs file system on a CD-ROM drive.
16 Example File System Operations
ext3 /
home etc … mnt
avishay cdrom
bar foo xfs isofs
mount t xfs /dev/sdb1 /home Lookup operations will be performed on all 3 file mount t isofs /dev/hdc1 /mnt/cdrom systems. The copy operation cp /mnt/cdrom/foo /home/avishay/bar will read from 'foo' (isofs) and write to 'bar' (xfs). The VFS determines which file system to invoke. 17 Outline
The Basics The Virtual File System (VFS) File System Layout Journaling File System Layout
Some considerations: Minimize seeks between metadata and related data Minimize number of disk reads required to get to data Maximize readahead (sequential access) Recovery from disk corruption, power outage, etc. Management: fragmentation, compaction, etc.
19 Contiguous Allocation
Files are allocated contiguously on the disk Space for entire file must be requested in advance Search bit map or linked list to locate a space Pros Fast sequential access Easy random access Cons External fragmentation Hard to grow files: may have to move (large) files May need compaction E
A B C D
20 Linked Files (Alto)
Each file is a linked list File header (like inode) points to first block on disk Each block points to the next File header Pros Can grow files dynamically File block 1 Free list is similar to a file No external fragmentation or need to move files Cons File block N Random access is horrible Even sequential access needs one seek per block Unreliable: losing one block means losing the rest
21 File Allocation Table (FAT)
Table of “next pointers”, indexed by block Dentry points to 1st block of file Two copies of FAT, at the beginning of the volume Pros Faster random access Cache FAT table and traverse in memory Cons FAT table may be too large to cache - long seeks Pointers for all files are interspersed in FAT table Need full table in memory, even for one file Solution: indexed files
22 Single-Level Indexed Files
User declares maximum file size A file header holds an array of pointers to disk blocks Pros Random access is fast Better metadata caching than FAT File Cons header Clumsy to grow beyond the limit Many seeks Disk blocks
23 ext2: Block Groups
Boot Block Block Block Block group 0 group 1 ... group n
Super Group Data Block inode inode Data block Descriptors Bitmap Bitmap Table Blocks Improved reliability Control structures are replicated Easy to recover the superblock Improved performance Reduces the distance between the inodes and related data blocks It is possible to reduce the disk head seeks during I/O on files
24 ext2: Multi-Level Indexed Files
The inode contains 15 pointers: 12 direct pointers 13: 1-level indirect data 14: 2-level indirect 15: 3-level indirect 1 data Pros & Cons 2 ... In favor of small files data 13 Can grow 14 Lots of seeking 15 inode data (somewhat limited by block groups) data ext3: same on-disk format plus journal (covered later) 25 ext4/xfs/Btrfs: Extents & Trees
Extent: set of logically contiguous blocks within a file that are stored contiguously on disk Single ext4 extent: up to 128MB with 4KB block size Less meta-data: Only need to remember: <1st logical block, # blocks, 1st physical block> xfs and Btrfs store extents in B-tree variants
These are newer and very interesting Linux disk-based file systems and have become more “standard”
26 Log-structured File System
Will be covered separately tomorrow
27 Outline
The Basics The Virtual File System (VFS) File System Layout Journaling File System Corruption
Some FS operations require multiple writes which may not all complete (power fail, crash) The on-disk state will be invalid on next mount Example: To write to a file, 3 main operations: 1.Write data to disk block 2.Update the free space map 3.Update pointer from inode to block With no help, detecting and recovering from errors require examining all data structures In Linux, this is done by fsck (file system check) This was acceptable in the past, but takes too long for larger file systems Journaling
Journal: a special file that logs the changes destined for the file system in a circular buffer Idea: use a journal to log changes before they're committed to the file system to avoid metadata corruption Examples: JFS/JFS2, ext3/4, XFS, ReiserFS ext3 Journaling Modes
Writeback: Only metadata is journaled. Data is written indepentently. Preserves file system structure and avoids corruption, but files may contain stale data (like ext2 + fast fsck). Ordered (default): Data written to disk before metadata transactions commit → no stale data blocks. Journal: Journals all data and metadata, so data is written twice (same consistency guarantees as 'ordered', different performance). References & Further Reading
References in this presentation refer to Linux 2.6.35 http://lxr.linux.no/#linux+v2.6.35/
Further reading Linux Kernel Development (Love): Good for overview – 3rd edition recently published 2nd edition: http://linuxkernel2.atw.hu/ (hopefully posted with the author's permission...) Understanding the Linux Kernel (Bovet & Cesati): Good for reference btrfs: http://lwn.net/Articles/342892/
Some of content in these slides taken from: http://www.cs.princeton.edu/courses/archive/fall09/cos318/lectures/FileLayout.pdf http://www.ntfs.com/fat-allocation.htm http://www.ibm.com/developerworks/library/l-journaling-filesystems/index.html Tel-Aviv University advanced storage course slides by Ronen Kat and Ohad Rodeh Various wikipedia articles http://static.usenix.org/event/usenix05/tech/general/full_papers/prabhakaran/prabhakaran_ht ml/main.html
32