<Insert Picture Here> XFS in Rapid Development

<Insert Picture Here> XFS In Rapid Development Jeff Liu <[email protected]> We have many requests to provide a supported option for the XFS file system on Oracle Linux – Oracle Linux Blog Feb 28, 2013 2 About This Talk • Introduction - About XFS - XFS Development Community • How Fast XFS Is Going - Kernel changes (> Linux 3.0) - User space Programs - XFS Test Suite • Upcoming Features - Kernel and user space - Preview of the self describing metadata 3 About XFS • Full 64-bit journaling file system • Well-known for high-performance and scalability • Maximum filesystem size/file size: 16 EiB/8EiB • Variable blocks sizes: 512 bytes to 64 KB • Freeze/Thaw to support volume level snapshot - xfs_freeze(8) • Online filesystem/file defragmentation - xfs_fsr(8) • Online filesystem resize – xfs_growfs(8) • Internal log space/External log volume • Realtime subvolume - Provide very deterministic data rates suitable for media streaming applications 4 XFS Development Community • Developers From Corporations - SGI, Redhat, Oracle, SuSE, IBM • Main Contributors – In alphabetical order Dave Chinner, Christoph Hellwig - Preeminent Individual Contributors Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely, <leave the seat of honour open for you> • Maintainer Ben Myers @SGI • Join us via Mailing list: [email protected] and IRC Channel: irc.freenode.net#xfs • Newcomers are always welcome! 5 The number of files changed, insertions and de letions 2013) -May 11 code changesv3.10-rc1 of (Jul21statistics betweenLinuxv3.0- 2011 The How Fast XFSGoing Is How Fast 10000 15000 20000 25000 30000 35000 40000 45000 50000 5000 0 Files changed git diff --stat --minimal -C -M v3.0..v3.10-rc1 fs/[btrfs|xfs|ext4 -- with jbd2] Btrfs/Ext4 with JBD2/XFS with Btrfs/Ext4 Linux v3.0 ~ v3.10-rc1 Insertions Deletions Btrfs XFS Ext4&JBD2 6 How Fast XFS Is Going • XFS changes were made up of - Improvements – performance/scalability improvements, code base refactoring - New features – anything new - Bug fixes - Misc – trivial fix, code style adjustment, dead code cleanups 7 How Fast XFS Is Going The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1 Based on the number of Patches Improvement New feature Bug fix Misc 8 How Fast XFS Is Going The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1 Based on the lines (+/-) Improvement New feature Bug fix Misc 9 How Fast XFS Is Going • Xfsprogs v3.1.6 ~ v3.1.11 (Oct 11 2011 ~ May 09 2013) - 15 Contributors - 106 patches $ git diff --stat --minimal -C -M v3.1.6 v3.1.11 |grep changed 108 files changed, 11113 insertions(+), 11418 deletions(-) 10 How Fast XFS Is Going • XFS test suite - xfstests - A generic test tool for Linux local filesystems - 300+ test cases overall - 170+ special test cases for XFS • Test cases are well-organized for different filesystems $ ls -l xfstests/tests/ btrfs/ ext4/ generic/ Makefile shared/ udf/ xfs/ 11 Speedup Direct-IO R/W On High IOPS Devices • XFS inode locking modes, e.g. shared/exclusive - The name convention is inherited from SGI IRIX - Equivalent is the read/write modes on Linux • Issues faced before Linux 3.2 - Exclusive lock range is too extensive - Concurrent direct-IO reads are serialized on page cache check up - Exclusive lock mode is used for direct-IO write by default 12 Speedup Direct-IO R/W On High IOPS Devices • Solutions - Use shared lock for direct-IO read, take the exclusive mode if the page invalidation is needed - Use shared lock for direct-IO writes by default, take the exclusive lock during IO submission if extent allocation is required 13 Speedup Direct-IO R/W On High IOPS Devices FIO Scenario Storage formated with default options Fio version 2.1 Simplified output of xfs_info(8) Direct=1 Metadata: isize=256 agcount=4 rw=randrw agsize=937408 blks sectsz=512 bs=4k size=10G Data: bsize=4096 blocks=3749632 Numjobs=10 #[20,40,80] sunit=0 swidth=0 blks Runtime=120 Thread Log: internal bsize=4096 ioengine=psync blocks=2560 version=2 14 Input/Output operations per second 10000 12000 14000 Speedup Direct-IO R/W On High IOPS R/W High On Devices Direct-IO Speedup 2000 4000 6000 8000 0 10 Vanilla 3.7.0 vs 2.6.39 in delaylog mode XFS Read IOPS, SSD SATA3SSD Read XFS IOPS, 20 Threads 40 80 3.7.0 2.6.39 15 Input/Output operations per second Speedup Direct-IO R/W On High IOPS R/W High On Devices Direct-IO Speedup 10000 12000 14000 2000 4000 6000 8000 0 10 Vanilla 3.7.0 vs 2.6.39 in delaylog mode XFS Write IOPS, SSDSATA3 IOPS, Write XFS 20 Threads 40 80 3.7.0 2.6.39 16 Sync Story • Improve concurrency for fsync(2) on files - Unlock inode before the log force • Optimizations for fsync(2) on directories - Directories are only updated transactionally - No file data need to flush - Does not have to flush disk caches except as part of a transaction commit • Improved sync behavior in the face of aggressive dirtying - Writes data out itself two times per filesystem sync that overriding the livelock protection in the core writeback code path 17 Sync Story • Xfssyncd workqueue was removed, Instead - New dedicated workqueue for inode reclaim - New dedicated workqueue for log operation - Now the sync work is periodic log work only for xfsyncd_centisecs sysctl 18 Efficient Sparse File Handing • SEEK_DATA/SEEK_HOLE options to lseek(2) - Derive from Solaris ZFS - Neater call interface than FIEMAP ioctl(2) • Use scenarios - cp(1), GNU tar(1), etc... - Virtual image(XEN, KVM) backup - Sparse file detection 19 Efficient Sparse File Handing • Refinement for unwritten extents • Create a sparse file with unwritten extents mixed with data and holes #!/bin/bash xfs_io -F -f ©-c falloc 0 10G© /xfs/sparse for i in $(seq 0 30 120); do offset=$(($i * $((1 << 20)))) xfs_io "-c pwrite $offset 500m" /xfs/sparse done 20 Efficient Sparse File Handing • Layout of the created sparse file $ filefrag -v sparse Filesystem type is: 58465342 File size of sparse is 10737418240 (2621440 blocks, blocksize 4096) ext logical physical expected length flags 0 0 43547551 151040 1 151040 43698591 1946111 unwritten 2 2097151 43008572 45644702 524289 unwritten,eof sparse: 2 extents found 21 Efficient Sparse File Handing File Sparse Efficient Time in seconds Sparse file copy via xfstests/seek_copy_test on laptop with normal SATA disk 100 120 140 20 40 60 80 0 Improved With/Without unwritten extents refinement Non-improved 22 Quota Improvements • XFS disk quota supports - User quota - Group quota - Project quota – per directory quota (limit disk quota per directory) 23 Quota Improvements • Bad scalability for tens thousands of in-memory dquot searching, why? - User/Group/Project dquots are stored at a global hash table which is shared between file systems • Hash table at worst O(n) search/insert/delete while Radix tree at worst O(k) on insertion and deletion • Solutions - Replace global hash tables with per-filesystem radix tree - Replace global dquot lru lists with per-filesystems - Remove the global xfs_Gqm structure 24 Fighting With Process 8K Stack Space Limitation • 8K process stack space for x86_64 in Linux 2.6 by default - Every process has a dedicated kernel stack - Kernel stacks are a fixed size, can not be expanded as required - Can not be swapped • Extreme stack use in the Linux VM/VFS call chain • The old problems for XFS - Significant issues with the amount of stack that allocation in XFS uses, especially in the memory reclaim situations (writeback path) 25 Fighting With Process 8K Stack Space Limitation • Buffer cache miss that triggers I/O vs CPU cache miss • Solution - Alleviate stack allocation in allocation call chain, e.g. Delayed allocation - Move all allocations to a new worker thread combine with a completion - Avoid context switch overhead if an allocation request is comes in with large stack 26 Bounds Checking Enabled XFS Kernel • Alternative CONFIG_XFS_WARN Support - Depends on XFS_FS && !XFS_DEBUG - Converts ASSERT() checks to WARN_ON(1) - Does not modify algorithms - Does not cause kernel to panic on non-fatal errors - Allow to find strange "out of bounds" problems more easily - Already turned on Fedora kernel-debug packages • Suggest applying this feature for other Linux distributions with XFS support 27 Bounds Checking Enabled Kernel • XFS with CONFIG_XFS_DEBUG - Very efficient buddy for developers - Weak points from a user perspective . Significant overhead in production environment . Change the behavior of algorithms(such as allocation) to improve the test coverage, e.g. xfs_alloc_ag_vextent_near() . Would intentionally panic the machine on non-fatal errors by design • Only advisable to use for debugging purpose 28 Misc Changes • Mount options - Nodelaylog mode is removed, using delaylog mode by default ( >= Linux 3.3) - Inode64 re-mountable - Inode32 re-mountable • Speculative preallocation improvements - Trimming the speculative preallocation near ENOSPC/quota limits/sparse file • Discontiguous buffers - Virtually contiguous in the buffers, but non-contiguous on disk 29 Upcoming – Self Describing Metadata Preview 30 Upcoming – Self Describing Metadata Preview • XFS utilities for forensic analysis of the file system structures - xfs_repair(8) - xfs_db(8) • Analyze the structure of 100TB to 1PB storage :( • Primary concern for supporting PB scale file system - Minimize the time and effort required for basic forensic analysis of the file system structures 31 Self Describing Metadata Preview • Problems with the current metadata format - Magic number is the only way - Lack of magic number identifying in AGFL, remote

<Insert Picture Here> XFS in Rapid Development

Filesystem Maintenance

Backing up Linux and Other Unix(- Like) Systems

Oracle® Linux 7 Managing File Systems

Freebsd File Formats Manual Libarchive-Formats (5)

TAR(5) BSD File Formats Manual TAR(5)

Managing Network File Systems in Oracle® Solaris 11.4

Review NTFS Basics

Lustre 1.8 Operations Manual

Comparing NTFS File System with ETX4 File System

DIGILIANT Windows Storage Server 2003 R2

Exadata ACFS Snapshots & Sparse Clones

[MS-FSCC]: File System Control Codes