IBM Technology Center

The Linux /3/4 Filesystem: Past, Present, and Future

Theodore Ts'o IBM Linux Technology Center September 11, 2006

© 2006 IBM Corporation IBM Linux Technology Center

Agenda

ƒ A brief history of the ext2/3 filesystem ƒ The filesystem format ƒ Features added to ext3 in Linux 2.6 ƒ New features planned for ext3/4 ƒ Why ? ƒ Conclusion

© 2006 IBM Corporation IBM Linux Technology Center

A brief history of Linux filesystems

ƒ The Minix filesystem (1991, used to bootstrap Linux)  Max FS size: 64MB  Max file size: 64MB  Max filename: 14/30 bytes (fixed-length directory entries)  Only supported modification timestamp ƒ First attempt to improve on Minixfs: the ext filesystem (1992)  Max FS size: 2GB  Max file size: 2GB  Max filename: 255 bytes  Still only one timestamp  Linked lists for free block/inodes caused performance problems

© 2006 IBM Corporation IBM Linux Technology Center

The xiafs and ext2fs filesystems

ƒ Xiafs: minimal changes from minix (January 1993)  Max FS size: 2GB  Max file size: 64MB (instead of 2GB)  Max filename: 248 bytes (fixed-length directory entries)  ctime/mtime/atime timestamps ƒ Ext2fs – improvements to extfs (January 1993)  Max FS extended to 4TB  Variable block sizes  ctime/mtime/atime timestamps  Improved block/inode allocation using bitmaps and block groups

© 2006 IBM Corporation IBM Linux Technology Center

Competition between xiafs and ext2fs

ƒ Since xiafs only made minor changes to minix, it was initially (appeared) more stable. ƒ Frank Xia tried to rename xiafs to Linuxfs – negative reaction to marketing-driven changes ƒ Ext2 had a larger development community (so more features added) and had a more scalable design. In the end it became the dominant “default” filesystem ƒ Features added to ext2 over the years  sparse superblocks  Large file support (> 2GB)  Extended attributes  ACL's

© 2006 IBM Corporation IBM Linux Technology Center

The ext3 filesystem

ƒ Journalling added to ext2 in 2000 (work started in 1998). ƒ Since it required many changes to the code base, a new version of the filesystem code was created in the kernel.  Hence, ext3  But really just ext2 with the COMPAT_HAS_JOURNAL feature (from the filesystem format point of view) ƒ Other Journaling Filesystems  Reiserfs, JFS, XFS ƒ Advantages of ext3  Backwards compatibility with ext2  Robustness against hardware errors highest priority

© 2006 IBM Corporation IBM Linux Technology Center

The ext2/3 filesystem format

ƒ Verfy similar to the BSD FFS ƒ Cylinder groups have become “block groups” ƒ Compatibility feature sets allow controlled addition of new features via three bitmasks:  R/W Compat – The kernel may the filesystem even if it does not understand a feature in this bitmask. (E2fsck however will refuse to touch a filesystem it doesn't understand)  R/O Compat – The kernel may mount the filesystem read/only if it does not understand a feature in this bitmask  Incompat – The kernel must not mount the filesystem if it does not understand a feature in this bitmask

© 2006 IBM Corporation IBM Linux Technology Center

Ext2 Filesystem Layout

Boot BG #0 BG #1 ... BG #N

Super FS des- Block Inode Inode Data blocks Block criptors BitmapBitmap Table

© 2006 IBM Corporation IBM Linux Technology Center

Ext2 Inode structure

Mode Owners Size data Timestamps data ... data data direct blocks data data data indirect data data d. indirect t. indirect data data

© 2006 IBM Corporation IBM Linux Technology Center

Ext2 Directory Layout

Inode Table

Directory I1 name1 i2 name2 I3 name3 I3 name4

© 2006 IBM Corporation IBM Linux Technology Center

Features added to Linux 2.6

ƒ BKL removal and other scalability improvements (Andrew Morton, Alex Thomas) ƒ Directory Indexing (Daniel Phillips, Theodore Ts'o) ƒ Extended Attributes (Andreas Gruenbacher) ƒ Online resizing (Andreas Dilger, Stephen Tweedie) ƒ Reservation-based block preallocation (Mingming Cao, Andrew Morton, Stephen Tweedie, Badari Pulvarty)

© 2006 IBM Corporation IBM Linux Technology Center

Reducing Lock Contention

ƒ Motivation: scaling issues for 2.4's ext3/jbd under workloads with concurrent I/O ƒ To address this problem:  replaced the per-filesystem superblock lock in ext3 with finer- grained locks  Removed the big (global) kernel lock from the JBD layer ƒ Result: SDET benchmark throughput improved by a factor of 10

© 2006 IBM Corporation IBM Linux Technology Center

Directory Indexing

ƒ Motivation: large directories took a long time to search ƒ Solution: Add a search tree indexed by the hash of the filename to the directory  Variation of a B+tree – Directory entries stored in only leaf nodes – The use of fixed-length, 64-bit hashes as keys results in a high fanout factor ƒ Fully backwards compatible with older kernels  Interior nodes look like deleted directory entries  Older kernels will clear the directory indexed bit when they modify a directory, thus invalidating the interior nodes until they can be regenerated.

© 2006 IBM Corporation IBM Linux Technology Center

Extended Attributes

ƒ Motivation: need to store small amounts of custom metadata which is associated with files or directories  Also needed to support Access Control Lists (ACL's) ƒ EAs are stored in a single EA block, which can be shared by inodes have same extended attributes ƒ In Linux 2.6.11+, EA's can be stored in the expanded inode as well.  This EA-in-inode makes the ext3 top filesystem on Samba4 benchmarks

© 2006 IBM Corporation IBM Linux Technology Center

Online Resizing

ƒ Motivation: Taking advantage of new disk space after a logical volume has been grown by the LVM subsystem without needing to unmount the filesystem ƒ Solution: Reserve space so that the number of blocks needed for the block group descriptors can be grown  An additional 4k block is required for every 32 block groups  Block group descriptors must be contiguously stored after the superblock ƒ Integrated into the kernel as of 2.6.10 and 1.39

© 2006 IBM Corporation IBM Linux Technology Center

Reservation based block preallocation

ƒ Block preallocation helps Ext3 (before) reduce file fragmentation caused by concurrent allocation ƒ Ext3 added block preallocation since 2.6.10 kernel. Ext3 (After) ƒ Ext3 uses in-memory block reservation to support a large preallocation file file file file 1 2 3 4

© 2006 IBM Corporation IBM Linux Technology Center

Files file 1 file 2 file 3 file 4

Reservation (8, 31) Tree (0, 7) (32, 63) (64, 71)

disk blocks

© 2006 IBM Corporation IBM Linux Technology Center

tiobench sequential write

40

35

30

ext3 2.4.29 25 ext3 2.6.11 JFS 20 XFS

15 Throughput(MB/sec) 10

5

0 4 threads 16threads 64threads

© 2006 IBM Corporation IBM Linux Technology Center

Features planned for ext3/4

ƒ Extents ƒ Support for large disks (48 and 64 bit block numbers) ƒ Fine-grained timestamps ƒ Asynchornous (background) unlink/truncate ƒ Support > 32,000 subdirectories ƒ Finer grained locking to support parallel directory operations

© 2006 IBM Corporation IBM Linux Technology Center

disk blocks Why Extents? 0 ... i_data Ext2/3 Indirect Block Map ... 200 0 200 201 1 201 ... 213 ...... 213 ...... 11 211 1236 ... 12 212 ... 1238 1239 13 1237 ...... 14 65530 ... 1239 ...... direct block 65531 65532 65533 6553 indirect block ...... double indirect block ...... triple indirect block

© 2006 IBM Corporation IBM Linux Technology Center

Extents

● Extents are an efficient way to represent large files ● An is a single descriptor for a range of contiguous blocks

logical length physical

0 1000 200

© 2006 IBM Corporation IBM Linux Technology Center Extent disk blocks i_data Map 200 201 header ...... 0 1199 100 ... 0 ... 200 ... 1001 6000 6001 2000 ... 6000 ... 6199 ......

© 2006 IBM Corporation IBM Linux Technology Center leaf node disk blocks Extent Tree 0 i_data index node ...

header 0

0 root ......

extents

extents index

node header ... © 2006 IBM Corporation

IBM Linux Technology Center

Extent Related Works

ƒ Multiple block allocation  An efficient way to allocating a chunk of contiguous blocks at a time ƒ Delayed allocation  Enable multiple block allocation by deferring and clustering single block allocation

© 2006 IBM Corporation IBM Linux Technology Center

Evaluation of Extents Patches

ƒ Improvements for large file creation/removal/sequential read/sequential rewrite ƒ Benchmarks used: dbench, tiobench, FFSB filemark, sqlbench, iozone, etc.

© 2006 IBM Corporation IBM Linux Technology Center

Tiobench Sequential Write Comparison With Extents

40

35

30

ext3 2.6.11 25 ext3+extetns JFS 20 XFS

15 Throughput(MB/sec) 10

5

0 4 threads 16threads 64threads

© 2006 IBM Corporation IBM Linux Technology Center

Large File Sequential I/O Comparison Using FFSB

180 166.3

160 153.7 156.3

140 127 120 104.3 ext3 102.7 100 100 94.8 ext3+extents 91.9 89.3 JFS 80 75.7 XFS 71

60 Throughput(MB/sec) 40

20

0 Sequential Read Sequential write Sequential re-write

© 2006 IBM Corporation IBM Linux Technology Center

Ext4: The next-generation ext3

ƒ When initial versions of the extents patches were sent out for comment, some developers expressed concern:  Ext3 was too important to risk destabilizing code quality  Backwards incompatible extensions could cause user confusion ƒ After much discussion, a consensus on moving forward  Ext3 cleanup patches would be applied  The ext3 code base would be forked to fs/ext4, with the filesystem name ext4-dev  New work would happen in ext4-dev, and when the feature set for ext4 is stablized it would be renamed from ext4-dev to ext4.

© 2006 IBM Corporation IBM Linux Technology Center

Conclusion

ƒ The ext2/3/4 filesystem is oldest filesystem which is still being actively developed in Linux ƒ Has served the Linux community well for over 10 years ƒ With new improvements being constantly being proposed, implemented, and placed into production, ext3/4 development continues to remain vital and exciting!

© 2006 IBM Corporation