OpenZFS Performance Improvements LUG Developer Day 2015 April 16, 2015
Brian, Behlendorf
LLNL-PRES-669602 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC 1 Lawrence Livermore National Laboratory LLNL-PRES-669602 Agenda
. IO Performance • Large Block Feature . Meta Data Performance • Large DNode Feature . Planned Performance Investigations
2 Lawrence Livermore National Laboratory LLNL-PRES-669602 IO Performance – Large Blocks
All objects in ZFS are divided in to blocks
3 Lawrence Livermore National Laboratory LLNL-PRES-669602 Virtual Devices (VDEVs)
Root Root
Top-Level Mirror RAIDZ
Leaf Data Data Data Data Parity
Where blocks are stored is controlled by the VDEV topology
4 Lawrence Livermore National Laboratory LLNL-PRES-669602 Why is 128K the Max Block Size?
. Historical Reasons • Memory and disk capacities were smaller 10 years ago • 128K blocks were enough to saturate the hardware . Good for a wide variety of workloads
. Good for a range of configurations
. Performed well enough for most users
But times have changed and it’s now limiting performance
5 Lawrence Livermore National Laboratory LLNL-PRES-669602 Simulated Lustre OSS Workload . fio – Flexible I/O Tester • 64 Threads • 1M IO Size • 4G Files • Random read/write workload . All tests run after import to ensure a cold cache . All default ZFS tunings, except: • zfs_prefetch_disabled=1 • zfs set recordsize={4K..1M} pool/dataset
Workload designed to benchmark a Lustre OSS
6 Lawrence Livermore National Laboratory LLNL-PRES-669602 Random 1M R/W – 2-Disk Mirror MB/s vs ZFS Block Size 180 160 140 120 100 80 Read Mirror 60 Write Mirror 40 20 0
Read bandwidth increases with block size
7 Lawrence Livermore National Laboratory LLNL-PRES-669602 Random 1M R/W – 10-Disk RAIDZ MB/s vs ZFS Block Size 1400 1200 1000 Read RAIDZ 800 Write RAIDZ 600 Read RAIDZ2 Write RAIDZ2 400 Read RAIDZ3 200 Write RAIDZ3 0
Read and write bandwidth increase with block size
8 Lawrence Livermore National Laboratory LLNL-PRES-669602 Block Size Tradeoffs . Bandwidth vs IOPs • Larger blocks better bandwidth • Smaller blocks better IOPs . ARC Memory Usage • Small and large blocks have the same fixed overhead • Larger blocks may result in unwanted cached data • Larger block are harder to allocate memory for . Overwrite Cost . Checksum Cost
It’s all about tradeoffs
9 Lawrence Livermore National Laboratory LLNL-PRES-669602 Block Size Tradeoffs . Disk Fragmentation and Allocation Cost • Gang Blocks . Disk Space Usage • Snapshots . Compression Ratio • Larger blocks compress better . De-Duplication Ratio • Smaller blocks de-duplicate better
The “right” block size depends on your needs
10 Lawrence Livermore National Laboratory LLNL-PRES-669602 Let’s Increase the Block Size . The 128K limit • Only limited by the original ZFS implementation • The on-disk format supports up to 16M blocks . New “Large Block” Feature Flag . Easy to implement for prototyping / benchmarking . Hard to polish for real world production use • Cross-platform compatibility • ‘zfs send/recv’ compatibility • Requires many subtle changes throughout the code base
We expect bandwidth to improve…
11 Lawrence Livermore National Laboratory LLNL-PRES-669602 Random 1M R/W – 2-Disk Mirror MB/s vs ZFS Block Size 250
200
150
Read Mirror 100 Write Mirror 50
0
63% performance improvement for 1M random reads!
12 Lawrence Livermore National Laboratory LLNL-PRES-669602 Random 1M R/W – 10-Disk RAIDZ MB/s vs ZFS Block Size 1400 1200 1000 Read RAIDZ 800 Write RAIDZ 600 Read RAIDZ2 Write RAIDZ2 400 Read RAIDZ3 200 Write RAIDZ3 0
100% performance improvement for 1M random reads!
13 Lawrence Livermore National Laboratory LLNL-PRES-669602 I/O Performance Summary
. 1M blocks can improve random read performance by a factor of 2x for RAIDZ . Blocks up to 16M are possible • Requires ARC improvements (Issue #2129) • Early results show 1M blocks may be the sweet spot • >1M blocks may help large RAIDZ configurations . Planned for the next Linux OpenZFS release . Large blocks already merged upstream
Large blocks can improve performance
14 Lawrence Livermore National Laboratory LLNL-PRES-669602 Meta Data Performance - Large Dnodes
. MDT stores file striping in an extended attribute
. Optimized for LDISKFS 1024B INODE . 1K inodes with inline xattr INODE INLINE . Single IO per xattr TABLE XATTR
EXTERNAL 4K XATTR BLOCK Required Block Optional Block
LDISKFS was optimized long long ago for Lustre
15 Lawrence Livermore National Laboratory LLNL-PRES-669602 File Forks used on illumos / FreeBSD
. Linux “xattrs=on” property 512B DNODE • Maps xattr to file forks 16K DNODE BONUS • No compatibility issues BLOCK SPACE • No limit on xattr size
• No limit on number of xattrs XATTR ZAP
. Three IOs per xattr
XATTR XATTR XATTR XATTR XATTR XATTR Required Block Optional Block
The need for fast xattrs was known early in development
16 Lawrence Livermore National Laboratory LLNL-PRES-669602 Extended Attributes in SAs
. Goal is to reduce I/Os 512B DNODE . Linux “xattrs=sa” property 16K DNODE BONUS • Maps xattr to SA BLOCK SPACE • 1 block of SAs (128K…) XATTR ZAP SPILL . Much faster BLOCK . Two IOs per xattr XATTR XATTR XATTR XATTR XATTR XATTR Required Block Optional Block
The first step was to store xattrs as a system attribute (SA)
17 Lawrence Livermore National Laboratory LLNL-PRES-669602 Large DNode Feature 512B-16K DNODE
16K DNODE BONUS BONUS BLOCK SPACE SPACE
XATTR ZAP SPILL XATTR ZAP SPILL BLOCK BLOCK
XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR
Required Block Optional Block
Only a single I/O is needed to read multiple dnodes with xattrs
18 Lawrence Livermore National Laboratory LLNL-PRES-669602 Meta Data Workload . xattrtest – performance and correctness test • https://github.com/behlendorf/xattrtest • 4096 files / process, 512 bytes xattr per file • 64 processes . All tests run after import to ensure a cold cache . 16-disk (HDD) mirror pool . All default ZFS tunings
19 Lawrence Livermore National Laboratory LLNL-PRES-669602 Xattrtest Performance 60000
50000 v0.6.4 xattr=on 40000 dnodesize=512 v0.6.4 30000 xattr=sa dnodesize=512 20000 v0.6.4 xattr=sa 10000 dnodesize=1024 0 Create/s Setxattr/s Getxattr/s Unlink/s
20 Lawrence Livermore National Laboratory LLNL-PRES-669602 ARC Memory Footprint
120 120 v0.6.3 100 100 xattr=sa dnodesize=512 80 80 v0.6.4 xattr=sa 60 60 dnodesize=512 40 40 v0.6.4 xattr=sa 20 20 dnodesize=1024
0 0 ARC Size (%) ARC Blocks (%)
16x fewer blocks reduces I/O and saves ARC memory
21 Lawrence Livermore National Laboratory LLNL-PRES-669602 mds_survey.sh 10000 9000 8000 v0.6.3 7000 xattr=sa 6000 dnodesize=512 v0.6.4 5000 xattr=sa 4000 dnodesize=512 3000 v0.6.4 2000 xattr=sa 1000 dnodesize=1024 0 Create/s Destroy/s
30% performance improvement for creates and destroys
22 Lawrence Livermore National Laboratory LLNL-PRES-669602 Large DNode Performance Summary
. Improves performance across the board . Total I/O required for cold files reduced . ARC • Fewer cached blocks (no spill blocks) • Reduced memory usage • Smaller MRU/MFU results in faster memory reclaim and reduced lock contention in the ARC . Reduces pool fragmentation
Increasing the dnode size has many benefits
23 Lawrence Livermore National Laboratory LLNL-PRES-669602 Large DNode Work in Progress
. Developed by Prakash Surya and Ned Bass . Conceptually straightforward but challenging . Undergoing rigorous testing • ztest, zfs-stress, xattrtest, mds_survey, ziltest . 80% done, under active development: • ZFS Intent Log (ZIL) support • “zfs send/recv” compatibility . Patches will be submitted for upstream review
24 Lawrence Livermore National Laboratory LLNL-PRES-669602 Planned Performance Investigations
. Support the ZFS Intent Log (ZIL) in Lustre . Improve Lustre’s use of ZFS level prefetching . Optimize Lustre specific objects (llogs, etc) . Explore TinyZAP for Lustre namespace . More granular ARC locking (issue #3115) . Page backed ARC buffers (issue #2129) . Hardware optimized checksums / parity
There are many areas where Lustre MDT performance can be optimized
25 Lawrence Livermore National Laboratory LLNL-PRES-669602 Questions?
26 Lawrence Livermore National Laboratory LLNL-PRES-669602 SPL-0.6.4 / ZFS-0.6.4 Released
. Released April 8th, 2015. Packages available for:
. Six New Feature Flags • *Spacemap Histograms • Extensible Datasets • Bookmarks • Enabled TXGs • Hole Birth • *Embedded Data
27 Lawrence Livermore National Laboratory LLNL-PRES-669602 SPL-0.6.4 / ZFS-0.6.4 Released . New Functionality • Asynchronous I/O (AIO) support • Hole punching via fallocate(2) • Two new properties (redundant_metadata, overlay) • Various enhancements to command tools . Performance Improvements • ‘zfs send/recv’ • Spacemap histograms • Faster memory reclaim . Over 200 Bug Fixes
28 Lawrence Livermore National Laboratory LLNL-PRES-669602
New KMOD Repo for RHEL/CentOS . DKMS and KMOD packages for EPEL 6 • Versions: spl-0.6.4, zfs-0.6.4, lustre-2.5.3 . Enable the repository • vim /etc/yum.repos.d/zfs.repo • Disable the default zfs (DKMS) repository • Enable the zfs-kmod repository . Install ZFS, Lustre server, or Lustre client • yum install zfs • yum install lustre-osd-zfs • yum install lustre
See zfsonlinux.org/epel for complete instructions
29 Lawrence Livermore National Laboratory LLNL-PRES-669602