XFS In Rapid Development

Jeff Liu We have many requests to provide a supported option for the XFS on Oracle – Oracle Linux Blog Feb 28, 2013

2 About This Talk

• Introduction - About XFS - XFS Development Community • How Fast XFS Is Going - Kernel changes (> Linux 3.0) - User space Programs - XFS Test Suite • Upcoming Features - Kernel and user space - Preview of the self describing metadata

3 About XFS

• Full 64-bit journaling file system • Well-known for high-performance and scalability • Maximum filesystem size/: 16 EiB/8EiB • Variable blocks sizes: 512 bytes to 64 KB • Freeze/Thaw to support volume level snapshot - xfs_freeze(8) • Online filesystem/file defragmentation - xfs_fsr(8) • Online filesystem resize – xfs_growfs(8) • Internal log space/External log volume • Realtime subvolume - Provide very deterministic data rates suitable for media streaming applications

4 XFS Development Community

• Developers From Corporations - SGI, Redhat, Oracle, SuSE, IBM • Main Contributors – In alphabetical order Dave Chinner, Christoph Hellwig - Preeminent Individual Contributors Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely, • Maintainer Ben Myers @SGI • Join us via Mailing list: @oss.sgi.com and IRC Channel: irc.freenode.net#xfs • Newcomers are always welcome!

5 How Fast XFS Is Going

The statistics of code changes between Linux v3.0 - v3.10-rc1 (Jul 21 2011 - May 11 2013) Btrfs/ with JBD2/XFS

git diff --stat --minimal -C -M v3.0..v3.10-rc1 -- fs/[btrfs|xfs|ext4 with jbd2] s n o i 50000 t e l e

d 45000

d n

a 40000

s n

o 35000 i t r e

s 30000 n i

, Ext4&JBD2 d 25000 e g n

a 20000 XFS h c

s 15000 e l i

f Btrfs

f 10000 o

r e

b 5000 m u

n 0

e h

T Files changed Insertions Deletions Linux v3.0 ~ v3.10-rc1

6 How Fast XFS Is Going

• XFS changes were made up of - Improvements – performance/scalability improvements, code base refactoring - New features – anything new - Bug fixes - Misc – trivial fix, code style adjustment, dead code cleanups

7 How Fast XFS Is Going

The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1 Based on the number of Patches

Improvement New feature Bug fix Misc

8 How Fast XFS Is Going

The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1 Based on the lines (+/-)

Improvement New feature Bug fix Misc

9 How Fast XFS Is Going

• Xfsprogs v3.1.6 ~ v3.1.11 (Oct 11 2011 ~ May 09 2013) - 15 Contributors - 106 patches $ git diff ­­stat ­­minimal ­C ­M v3.1.6 v3.1.11 |grep changed 108 files changed, 11113 insertions(+), 11418 deletions(­)

10 How Fast XFS Is Going

• XFS test suite - xfstests - A generic test tool for Linux local filesystems - 300+ test cases overall - 170+ special test cases for XFS • Test cases are well-organized for different filesystems $ ls ­l xfstests/tests/ btrfs/ ext4/ generic/ Makefile shared/ udf/ xfs/

11 Speedup Direct-IO R/W On High IOPS Devices

• XFS inode locking modes, e.g. shared/exclusive - The name convention is inherited from SGI IRIX - Equivalent is the / modes on Linux • Issues faced before Linux 3.2 - Exclusive lock range is too extensive - Concurrent direct-IO reads are serialized on page cache check up - Exclusive lock mode is used for direct-IO write by default

12 Speedup Direct-IO R/W On High IOPS Devices

• Solutions - Use shared lock for direct-IO read, take the exclusive mode if the page invalidation is needed - Use shared lock for direct-IO writes by default, take the exclusive lock during IO submission if extent allocation is required

13 Speedup Direct-IO R/W On High IOPS Devices

FIO Scenario Storage formated with default options

Fio version 2.1 Simplified output of xfs_info(8)

Direct=1 Metadata: isize=256 agcount=4 rw=randrw agsize=937408 blks sectsz=512 bs=4k size=10G Data: bsize=4096 blocks=3749632 Numjobs=10 #[20,40,80] sunit=0 swidth=0 blks Runtime=120 Thread Log: internal bsize=4096 ioengine=psync blocks=2560 version=2

14 Speedup Direct-IO R/W On High IOPS Devices

XFS Read IOPS, SSD SATA3 Vanilla 3.7.0 vs 2.6.39 in delaylog mode

d 14000 n o c

e 12000 s r e

p 10000 s n

o 2.6.39 i 8000 t a r e p 6000 3.7.0 o t u p 4000 t u O / t 2000 u p n I 0 10 20 40 80 Threads

15 Speedup Direct-IO R/W On High IOPS Devices

XFS Write IOPS, SSD SATA3 Vanilla 3.7.0 vs 2.6.39 in delaylog mode d n 14000 o c e

s 12000 r e p

10000 s n o i 8000 t a 2.6.39 r e 6000 p o 3.7.0 t u 4000 p t u

O 2000 / t u p 0 n I 10 20 40 80 Threads

16 Sync Story

• Improve concurrency for fsync(2) on files - Unlock inode before the log force • Optimizations for fsync(2) on directories - Directories are only updated transactionally - No file data need to flush - Does not have to flush disk caches except as part of a transaction commit • Improved sync behavior in the face of aggressive dirtying - Writes data out itself two times per filesystem sync that overriding the livelock protection in the core writeback code

17 Sync Story

• Xfssyncd workqueue was removed, Instead - New dedicated workqueue for inode reclaim - New dedicated workqueue for log operation - Now the sync work is periodic log work only for xfsyncd_centisecs sysctl

18 Efficient Sparse File Handing

• SEEK_DATA/SEEK_HOLE options to lseek(2) - Derive from Solaris ZFS - Neater call interface than FIEMAP ioctl(2) • Use scenarios - (1), GNU tar(1), etc... - Virtual image(XEN, KVM) backup - Sparse file detection

19 Efficient Sparse File Handing

• Refinement for unwritten extents • Create a sparse file with unwritten extents mixed with data and holes

#!/bin/bash

xfs_io ­F ­f '­c falloc 0 10G' /xfs/sparse

for i in $(seq 0 30 120); do offset=$(($i * $((1 << 20)))) xfs_io "­c pwrite $offset 500m" /xfs/sparse done

20 Efficient Sparse File Handing

• Layout of the created sparse file

$ filefrag ­v sparse Filesystem type is: 58465342 File size of sparse is 10737418240 (2621440 blocks, blocksize 4096) ext logical physical expected length flags 0 0 43547551 151040 1 151040 43698591 1946111 unwritten 2 2097151 43008572 45644702 524289 unwritten,eof sparse: 2 extents found

21 Efficient Sparse File Handing

Sparse file copy via xfstests/seek_copy_test on laptop with normal SATA disk With/Without unwritten extents refinement

140

120

s 100 d n o

c 80 e s

n 60 i

e

m 40 i T

20

0 Improved Non-improved

22 Quota Improvements

• XFS disk quota supports - User quota - Group quota - Project quota – per quota (limit disk quota per directory)

23 Quota Improvements

• Bad scalability for tens thousands of in-memory dquot searching, why? - User/Group/Project dquots are stored at a global hash table which is shared between file systems • Hash table at worst O(n) search/insert/delete while Radix tree at worst O(k) on insertion and deletion • Solutions - Replace global hash tables with per-filesystem radix tree - Replace global dquot lru lists with per-filesystems - Remove the global xfs_Gqm structure

24 Fighting With Process 8K Stack Space Limitation

• 8K process stack space for x86_64 in Linux 2.6 by default - Every process has a dedicated kernel stack - Kernel stacks are a fixed size, can not be expanded as required - Can not be swapped • Extreme stack use in the Linux VM/VFS call chain • The old problems for XFS - Significant issues with the amount of stack that allocation in XFS uses, especially in the memory reclaim situations (writeback path)

25 Fighting With Process 8K Stack Space Limitation

• Buffer cache miss that triggers I/O vs CPU cache miss • Solution - Alleviate stack allocation in allocation call chain, e.g. Delayed allocation - Move all allocations to a new worker thread combine with a completion - Avoid context switch overhead if an allocation request is comes in with large stack

26 Bounds Checking Enabled XFS Kernel

• Alternative CONFIG_XFS_WARN Support - Depends on XFS_FS && !XFS_DEBUG - Converts ASSERT() checks to WARN_ON(1) - Does not modify algorithms - Does not cause kernel to panic on non-fatal errors - Allow to find strange "out of bounds" problems more easily - Already turned on Fedora kernel-debug packages • Suggest applying this feature for other Linux distributions with XFS support

27 Bounds Checking Enabled Kernel

• XFS with CONFIG_XFS_DEBUG - Very efficient buddy for developers - Weak points from a user perspective . Significant overhead in production environment . Change the behavior of algorithms(such as allocation) to improve the test coverage, e.g. xfs_alloc_ag_vextent_near() . Would intentionally panic the machine on non-fatal errors by design • Only advisable to use for debugging purpose

28 Misc Changes

• Mount options - Nodelaylog mode is removed, using delaylog mode by default ( >= Linux 3.3) - Inode64 re-mountable - Inode32 re-mountable • Speculative preallocation improvements - Trimming the speculative preallocation near ENOSPC/quota limits/sparse file • Discontiguous buffers - Virtually contiguous in the buffers, but non-contiguous on disk

29 Upcoming – Self Describing Metadata Preview

30 Upcoming – Self Describing Metadata Preview

• XFS utilities for forensic analysis of the file system structures - xfs_repair(8) - xfs_db(8) • Analyze the structure of 100TB to 1PB storage :( • Primary concern for supporting PB scale file system - Minimize the time and effort required for basic forensic analysis of the file system structures

31 Self Describing Metadata Preview

• Problems with the current metadata format - Magic number is the only way - Lack of magic number identifying in AGFL, remote symlinks and remote attribute blocks

32 Self Describing Metadata Preview

• Additional information need to be recored - CRC32c validation - Filesystem identifier - The owner of a metadata block - Logical sequence number (LSN) of the most recent transaction

33 Self Describing Metadata Preview

• The typical on-disk structure

● struct xfs_ondisk_hdr { __be32 magic; /* magic number */ __be32 crc; /* CRC, not logged */ uuid_t uuid; /* filesystem identifier */ __be64 owner; /* parent object */ __be64 blkno; /* location on disk */ __be64 lsn; /* last modification in log, not logged */ };

34 Self Describing Metadata Preview

• Additional information format - According to the type of metadata blocks • Runtime Validation - Immediately after a successful read from disk - Immediately prior to write IO submission

35 Self Describing Metadata Preview

• Compatibilities - No forwards compatibility, old filesystem will not support the new disk format - No backwards compatibility, old kernels and userspace will not be able to read the new format - Kernel and userspace that support the new format will still work just fine with the old, Non-CRC check enabled format - Support two different incompatible disk formats from this point onwards - Will not provide tools to convert the format of existing file system?

36 Upcoming - Kernel

• Splitting project quota support from group quota support - Same quota inode is used for project and group quota - Introduce the 3rd quota inode for project quota so that they can be enabled at the same time • Online shrink - Initial patch set has already been posted to mailing list, dependent on xfs_agstate(8) with kernel changes as well as xfs_reno(8) - Challenge of internal log blocks moving

37 Upcoming – User space

• xfs_reno(8) - Allows an inode64 filesystem to be converted to inode32 - The file system has to be mounted with inode32 in advance • xfs_agstate(8) - Allow to turn an allocation group offline or back to online

38 XFS's 20th birthday is coming in October :)

Thank you!

39 References

• http://xfs.org/index.php/XFS_status_update_for_2011 • http://xfs.org/index.php/XFS_status_update_for_2012 • http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html • http://oss.sgi.com/archives/xfs/2013-04/msg00100.html • http://lwn.net/Articles/476267/ • http://lwn.net/Articles/476263/ • http://lwn.net/Articles/84583/ • http://en.wikipedia.org/wiki/Hash_table

40 Acknowledgments

• Thanks you guys for reviewing this document with nice comments in alphabetical order: Ben Myers, Dave Chinner, Eric Sandeen, Mark Tinguely • I would like to thank Christoph Hellwig for updating the XFS development status per every Linux official release between 2011 to 2012 as those updates saved me a lot of time to see the progress in that period.

41