Local file systems update

Red Hat Luk´aˇsCzerner February 23, 2013 Copyright © 2013 Luk´aˇsCzerner, . Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the COPYING file. Agenda

1 file systems overview 2 Challenges we’re facing today 3 Xfs 4 5 6 Questions ? Part I file systems overview File systems in Linux kernel

Linux kernel has a number of file systems Cluster, network, local Special purpose file systems Virtual file systems interaction with other Linux kernel subsystems Memory Management layer VFS - virtual file system switch Optional stackable device drivers mdraid The Linux I/O Stack Diagram version 0.1, 2012-03-06 outlines the Linux I/O stack as of Kernel version 3.3

Applications (Processes) anonymous pages (malloc) ... stat(2) (2) (2) write(2) (2) VFS

block based FS Network FS pseudo FS special purpose FS direct I/O ext4 NFS proc (O_DIRECT) gfs pipefs ramfs Cache btrfs ifs ocfs futexfs ...... devtmpfs iso9660 ... smbfs usbfs

network Block I/O Layer optional stackable devices on topLVM of “normal” block devices – work on bios mdraid device drbd lvm ... mapper BIOs (Block I/O)

I/O Scheduler maps bios to requests cfq deadline noop

hooked in Device Drivers (hook in similar like stacked devices like request-based mdraid/device mapper do) device mapper targets /dev/fio* /dev/rssd* dm-multipath SCSI upper layer iomemory-vsl mtip32xx with module option /dev/vd* /dev/fio* /dev/sda /dev/sdb ... /dev/nvme#n# nvme sysfs (transport attributes) SCSI mid layer virtio_blk iomemory-vsl

Transport Classes scsi_transport_fc scsi_transport_sas SCSI low layer scsi_transport_... libata megaraid sas aacraid qla2xxx lpfc ...

ahci ata_piix ...

HDD SSD DVD LSI Adaptec Qlogic Emulex ... Fusion-io nvme Micron drive RAID RAID HBA HBA PCIe Card device PCIe Card Physical devices

The Linux I/O Stack Diagram (version 0.1, 2012-03-06) http://www.thomas-krenn.com/en/oss/linux-io-stack-diagram.html Created by Werner Fischer and Georg Schönberger License: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/ Most active local file systems

File system Commits Developers Active developers Ext4 648 112 13 Ext3 105 43 2 Xfs 650 61 8 Btrfs 1302 114 21 Number of lines of code

80000 Ext3 Btrfs 70000 Xfs Ext4

60000

50000

40000 Lines of code

30000

20000

10000 01/01/05 01/01/06 01/01/07 01/01/08 01/01/09 01/01/10 01/01/11 01/01/12 01/01/13 01/01/14 Part II Challenges we’re facing today Scalability

Common hardware storage capacity increases You can buy single 4TB drive for a reasonable price Bigger file system and file size Common hardware computing power and parallelism increases More processes/threads accessing the file system Locking issues I/O stack designed for high latency low IOPS Problems solved in networking subsystem Reliability

Scalability and Reliability are closely coupled problems Being able to fix your file system In reasonable time With reasonable memory requirements Detect errors before your application does Metadata checksumming Metadata should be self describing Online file system scrub New types of storage

Non-volatile memory Wear levelling more-or-less solved in firmware Block layer has it’s IOPS limitations We can expect bigger erase blocks Thinly provisioned storage Lying to users to get more from expensive storage Filesystems can throw away most of it’s locality optimization Cut down performance Device mapper dm-thinp target Hierarchical storage Hide inexpensive slow storage behind expensive fast storage Performance depends on working set size Improve performance Device mapper dm-cache target, Maintainability issues

More file systems with different use cases Multiple set of incompatible applications Different set of features and defaults Each file system have different management requirements Requirements from different types of storage SSD Thin provisioning Bigger sector sizes Deeper storage technology stack mdraid device mapper multipath Having a centralized management tool is incredibly useful Having a central source of information is a must System Storage Manager http://storagemanager.sf.net Part III What’s new in xfs Scalability improvements

Delayed logging Impressive improvements in metadata modification performance Single threaded workload still slower then ext4, but not much With more threads scales much better than ext4 On-disk format change XFS scales well up to hundreds of terabytes Allocation scalability Free space indexing Locking optimization Pretty much the best choice for beefy configurations with lots of storage Reliability improvements

Metadata checksumming CRC to detect errors Metadata verification as it is written to or read from disk On-disk format change Future work Reverse mapping allocation tree Online transparent error correction Online metadata scrub Part IV What’s new in ext4 Scalability improvements

Based on very old architecture Free space tracked in bitmaps on disk Static metadata positions Limited size of allocation groups Limited file size limit (16TB) Advantages are resilient on-disk format and backwards and forward compatibility Some improvements with bigalloc feature Group number of blocs into clusters Cluster is now the smallest allocation unit Trade-off between performance and space utilization efficiency status tree for tracking delayed extents No longer need to scan page cache to find delalloc blocks Scalability is very much limited by design, on-disk format and backwards compatibility Reliability improvements

Better memory utilization of user space tools No longer stores whole bitmaps - converted to extents Biggest advantage for e2fsck Faster file system creation table initialization postponed to kernel Huge time saver when creating bigger file systems Metadata checksumming CRC to detect errors Not enabled by default Part V What’s new in btrfs Getting stabilized

Performance improvements is not where the focus is right now Design specific performance problems Optimization needed in future Still under heavy development Not all features are yet ready or even implemented stabilization takes a long time Reliability in btrfs

Userspace tools not in very good shape utility still not fully finished Neither kernel nor userspace handles errors gracefully Very good design to build on Metadata and data checksumming Back reference Online filesystem scrub Resources

Linux Weekly News http://lwn.net Kernel mailing lists http://vger.kernel.org linux-fsdevel linux-ext4 linux-btrfs linux-xfs Linux Kernel code http://kernel.org Linux IO stack diagram http://www.thomas-krenn.com/en/oss/linuxiostack- diagram.html The end.

Thanks for listening.