An Introduction to Disk-Based File Systems

Avishay Traeger

IBM Haifa Research Lab Internal Storage Course ― October 2012 v1.3 Outline

 The Basics  The Virtual (VFS)  File System Layout  Journaling What does a disk-based file system do?

 Provides structure to the array of bits residing on the disk  File and naming and hierarchy  File access – , , write, seek, , ...  Knows how to map to  Tracks which sectors are used and which are “free”  Access control

 Extra features (e.g., improved reliability, snapshots, compression, encryption)

3 Linux File System Types

 Disk (/3/4, , , , vfat, etc.)  Network (nfs, cifs, afs, , etc.)  Memory (ramfs, , etc.)  Pseudo (proc, , etc.)  Stackable (, etc.)  Object store (exofs)  FUSE (): allows developers to implement file systems in userspace (easier to develop, slower to run)  ...  Approximately 60 file systems currently in the

Linux kernel! 4 Important Metadata Structures

 Superblock: on-disk metadata for entire file system  size, pointer to fs , ...  : on-disk metadata for a single file  Inode number (unique ID), owners, timestamps, size, data block pointers, ...  Dentry: metadata for a directory entry, a single component of a path (not synced to disk)  File: open file structure (not synced to disk)

 File → Dentry → Inode → Superblock  Each structure has associated operations that are implemented by each file system Note: All Linux file system implementations have the above structures in memory, but not all have superblocks and on disk (especially file systems not native to Linux/, like FAT). These must map on-disk structures to those in memory. 5 Directories

 A directory is simply a special type of file  Can contain other files, directories, links, etc.  Each entry has an inode number and name  The file system knows how to find a file based on its inode number  What are the basic steps for performing a lookup on file /foo/bar?

6 Hard Links

 Associate several names with one inode  When creating a link, increment the inode's reference count (refcount)  The inode and associated data will only be deleted when the refcount is zero  Can only be used within a single file-system dentry dentry  Can only point to files. This inode prevents cycles in the directory data  Not supported by all file systems blocks 7 Symbolic Links (Symlinks)

 A special file that contains a file name  When the kernel encounters a symlink during a pathname lookup it replaces the name of the link by its contents (the name of the target file), and restarts the pathname interpretation  Can point to files on another file system  Can point to any type of file (e.g., directory)  Can become a dangling pointer if the target file is deleted  Use more inodes than hard links (2 vs. 1)  Higher overhead than hard links for resolution

8 Device Files

 In Linux, devices can be accessed via special files, generally found under /dev  Two main types:  Character: stream of (keyboard, serial)  Block: random access of blocks (hard disk, CD-ROM) Outline

 The Basics  The (VFS)  File System Layout  Journaling The Virtual File System (VFS)

 When we have so many file systems, we need to ensure that:  User programs do not need to be file-system--aware  File systems don't re-implement similar functionality  Solution: The VFS. A kernel layer that:  Handles all system calls related to a standard (all file systems have the same API)  Handles generic activities (e.g., caching, )  Has generic file system “library” functions that can be used by any file system (e.g., fs/libfs.)  Each specific file system implements a set of functions (operations vectors)

 Object oriented programming in C 11 The Virtual File System (VFS)

Application user-space kernel VFS Scheduler handler isofs NFS

Memory Cache management handler Driver Driver Driver (Disk) (CD-ROM) (Network)

12 Readahead

 Takes advantage of the page cache  When a page is read, the VFS code may ask the file system to read the next several contiguous blocks.  Hopefully, the next block read by the application will already be loaded into the page cache.  Performed during:  Sequential reads on files  Directory reads  The VFS contains the logic to perform readahead effectively 13 Example File System Operations

ext3 /

home etc … mnt

14 Example File System Operations

ext3 /

home etc … mnt

avishay The VFS mount operation: 1) Calls the xfs get_sb function to read the superblock from the partition xfs 2) This function also reads the inode of the root directory

mount ­t xfs /dev/sdb1 /home Note that performing a lookup on 'home' would have previously invoked ext3, but now it is xfs. Any files/directories in 'home' on ext3 will now be hidden by 'home' on xfs.

15 Example File System Operations

ext3 /

home etc … mnt

avishay cdrom

foo xfs isofs

mount ­t xfs /dev/sdb1 /home A similar sequence of events occurs here, this time mount ­t isofs /dev/hdc1 /mnt/cdrom mounting an isofs file system on a CD-ROM drive.

16 Example File System Operations

ext3 /

home etc … mnt

avishay cdrom

bar foo xfs isofs

mount ­t xfs /dev/sdb1 /home Lookup operations will be performed on all 3 file mount ­t isofs /dev/hdc1 /mnt/cdrom systems. The copy operation /mnt/cdrom/foo /home/avishay/bar will read from 'foo' (isofs) and write to 'bar' (xfs). The VFS determines which file system to invoke. 17 Outline

 The Basics  The Virtual File System (VFS)  File System Layout  Layout

 Some considerations:  Minimize seeks between metadata and related data  Minimize number of disk reads required to get to data  Maximize readahead (sequential access)  Recovery from disk corruption, power outage, etc.  Management: fragmentation, compaction, etc.

19 Contiguous Allocation

 Files are allocated contiguously on the disk  Space for entire file must be requested in advance  Search bit map or linked list to locate a space  Pros  Fast sequential access  Easy random access  Cons  External fragmentation  Hard to grow files: may have to move (large) files  May need compaction E

A B C D

20 Linked Files (Alto)

 Each file is a linked list  File header (like inode) points to first block on disk  Each block points to the next File header  Pros  Can grow files dynamically File block 1  Free list is similar to a file  No external fragmentation or need to move files  Cons File block N  Random access is horrible  Even sequential access needs one seek per block  Unreliable: losing one block means losing the rest

21 (FAT)

 Table of “next pointers”, indexed by block  Dentry points to 1st block of file  Two copies of FAT, at the beginning of the volume  Pros  Faster random access  Cache FAT table and traverse in memory  Cons  FAT table may be too large to cache - long seeks  Pointers for all files are interspersed in FAT table  Need full table in memory, even for one file  Solution: indexed files

22 Single-Level Indexed Files

 User declares maximum file size  A file header holds an array of pointers to disk blocks  Pros  Random access is fast  Better metadata caching than FAT  File Cons header  Clumsy to grow beyond the limit  Many seeks Disk blocks

23 ext2: Block Groups

Boot Block Block Block Block group 0 group 1 ... group n

Super Group Data Block inode inode Data block Descriptors Bitmap Bitmap Table Blocks  Improved reliability  Control structures are replicated  Easy to recover the superblock  Improved performance  Reduces the distance between the inodes and related data blocks  It is possible to reduce the disk head seeks during I/O on files

24 ext2: Multi-Level Indexed Files

 The inode contains 15 pointers:  12 direct pointers  13: 1-level indirect data  14: 2-level indirect  15: 3-level indirect 1 data  Pros & Cons 2 ...  In favor of small files data 13  Can grow 14  Lots of seeking 15 inode data (somewhat limited by block groups) data  ext3: same on-disk format plus journal (covered later) 25 /xfs/Btrfs: Extents & Trees

: set of logically contiguous blocks within a file that are stored contiguously on disk  Single ext4 extent: up to 128MB with 4KB block size  Less meta-data: Only need to remember: <1st logical block, # blocks, 1st physical block>  xfs and Btrfs store extents in B-tree variants

 These are newer and very interesting Linux disk-based file systems and have become more “standard”

26 Log-structured File System

 Will be covered separately tomorrow

27 Outline

 The Basics  The Virtual File System (VFS)  File System Layout  Journaling File System Corruption

 Some FS operations require multiple writes which may not all complete (power fail, crash)  The on-disk state will be invalid on next mount  Example: To write to a file, 3 main operations: 1.Write data to disk block 2.Update the free space map 3.Update pointer from inode to block  With no help, detecting and recovering from errors require examining all data structures  In Linux, this is done by (file system check)  This was acceptable in the past, but takes too long for larger file systems Journaling

 Journal: a special file that logs the changes destined for the file system in a circular buffer  Idea: use a journal to log changes before they're committed to the file system to avoid metadata corruption  Examples: JFS/JFS2, ext3/4, XFS, ReiserFS ext3 Journaling Modes

 Writeback: Only metadata is journaled. Data is written indepentently. Preserves file system structure and avoids corruption, but files may contain stale data (like ext2 + fast fsck).  Ordered (default): Data written to disk before metadata transactions commit → no stale data blocks.  Journal: Journals all data and metadata, so data is written twice (same consistency guarantees as 'ordered', different performance). References & Further Reading

 References in this presentation refer to Linux 2.6.35  http://lxr.linux.no/#linux+v2.6.35/

 Further reading  Development (Love): Good for overview – 3rd edition recently published  2nd edition: http://linuxkernel2.atw.hu/ (hopefully posted with the author's permission...)  Understanding the Linux Kernel (Bovet & Cesati): Good for reference  btrfs: http://lwn.net/Articles/342892/

 Some of content in these slides taken from:  http://www.cs.princeton.edu/courses/archive/fall09/cos318/lectures/FileLayout.pdf  http://www.ntfs.com/fat-allocation.htm  http://www.ibm.com/developerworks/library/l-journaling-filesystems/index.html  Tel-Aviv University advanced storage course slides by Ronen Kat and Ohad Rodeh  Various wikipedia articles  http://static.usenix.org/event/usenix05/tech/general/full_papers/prabhakaran/prabhakaran_ht ml/main.html

32