<<

OpenZFS Performance Improvements LUG Developer Day 2015 April 16, 2015

Brian, Behlendorf

LLNL-PRES-669602 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC 1 Lawrence Livermore National Laboratory LLNL-PRES-669602 Agenda

. IO Performance • Large Block Feature . Meta Data Performance • Large DNode Feature . Planned Performance Investigations

2 Lawrence Livermore National Laboratory LLNL-PRES-669602 IO Performance – Large Blocks

All objects in ZFS are divided in to blocks

3 Lawrence Livermore National Laboratory LLNL-PRES-669602 Virtual Devices (VDEVs)

Root Root

Top-Level Mirror RAIDZ

Leaf Data Data Data Data Parity

Where blocks are stored is controlled by the VDEV topology

4 Lawrence Livermore National Laboratory LLNL-PRES-669602 Why is 128K the Max Block Size?

. Historical Reasons • Memory and disk capacities were smaller 10 years ago • 128K blocks were enough to saturate the hardware . Good for a wide variety of workloads

. Good for a range of configurations

. Performed well enough for most users

But times have changed and it’s now limiting performance

5 Lawrence Livermore National Laboratory LLNL-PRES-669602 Simulated Lustre OSS Workload . fio – Flexible I/O Tester • 64 Threads • 1M IO Size • 4G Files • Random read/write workload . All tests run after import to ensure a cold cache . All default ZFS tunings, except: • zfs_prefetch_disabled=1 • set recordsize={4K..1M} pool/dataset

Workload designed to benchmark a Lustre OSS

6 Lawrence Livermore National Laboratory LLNL-PRES-669602 Random 1M R/W – 2-Disk Mirror MB/s vs ZFS Block Size 180 160 140 120 100 80 Read Mirror 60 Write Mirror 40 20 0

Read bandwidth increases with block size

7 Lawrence Livermore National Laboratory LLNL-PRES-669602 Random 1M R/W – 10-Disk RAIDZ MB/s vs ZFS Block Size 1400 1200 1000 Read RAIDZ 800 Write RAIDZ 600 Read RAIDZ2 Write RAIDZ2 400 Read RAIDZ3 200 Write RAIDZ3 0

Read and write bandwidth increase with block size

8 Lawrence Livermore National Laboratory LLNL-PRES-669602 Block Size Tradeoffs . Bandwidth vs IOPs • Larger blocks better bandwidth • Smaller blocks better IOPs . ARC Memory Usage • Small and large blocks have the same fixed overhead • Larger blocks may result in unwanted cached data • Larger block are harder to allocate memory for . Overwrite Cost . Checksum Cost

It’s all about tradeoffs

9 Lawrence Livermore National Laboratory LLNL-PRES-669602 Block Size Tradeoffs . Disk Fragmentation and Allocation Cost • Gang Blocks . Disk Space Usage • Snapshots . Compression Ratio • Larger blocks compress better . De-Duplication Ratio • Smaller blocks de-duplicate better

The “right” block size depends on your needs

10 Lawrence Livermore National Laboratory LLNL-PRES-669602 Let’s Increase the Block Size . The 128K limit • Only limited by the original ZFS implementation • The on-disk format supports up to 16M blocks . New “Large Block” Feature Flag . Easy to implement for prototyping / benchmarking . Hard to polish for real world production use • Cross-platform compatibility • ‘zfs send/recv’ compatibility • Requires many subtle changes throughout the code base

We expect bandwidth to improve…

11 Lawrence Livermore National Laboratory LLNL-PRES-669602 Random 1M R/W – 2-Disk Mirror MB/s vs ZFS Block Size 250

200

150

Read Mirror 100 Write Mirror 50

0

63% performance improvement for 1M random reads!

12 Lawrence Livermore National Laboratory LLNL-PRES-669602 Random 1M R/W – 10-Disk RAIDZ MB/s vs ZFS Block Size 1400 1200 1000 Read RAIDZ 800 Write RAIDZ 600 Read RAIDZ2 Write RAIDZ2 400 Read RAIDZ3 200 Write RAIDZ3 0

100% performance improvement for 1M random reads!

13 Lawrence Livermore National Laboratory LLNL-PRES-669602 I/O Performance Summary

. 1M blocks can improve random read performance by a factor of 2x for RAIDZ . Blocks up to 16M are possible • Requires ARC improvements (Issue #2129) • Early results show 1M blocks may be the sweet spot • >1M blocks may help large RAIDZ configurations . Planned for the next OpenZFS release . Large blocks already merged upstream

Large blocks can improve performance

14 Lawrence Livermore National Laboratory LLNL-PRES-669602 Meta Data Performance - Large Dnodes

. MDT stores file striping in an extended attribute

. Optimized for LDISKFS 1024B . 1K with inline xattr INODE INLINE . Single IO per xattr TABLE XATTR

EXTERNAL 4K XATTR BLOCK Required Block Optional Block

LDISKFS was optimized long long ago for Lustre

15 Lawrence Livermore National Laboratory LLNL-PRES-669602 File Forks used on illumos / FreeBSD

. Linux “xattrs=on” property 512B DNODE • Maps xattr to file forks 16K DNODE BONUS • No compatibility issues BLOCK SPACE • No limit on xattr size

• No limit on number of xattrs XATTR ZAP

. Three IOs per xattr

XATTR XATTR XATTR XATTR XATTR XATTR Required Block Optional Block

The need for fast xattrs was known early in development

16 Lawrence Livermore National Laboratory LLNL-PRES-669602 Extended Attributes in SAs

. Goal is to reduce I/Os 512B DNODE . Linux “xattrs=sa” property 16K DNODE BONUS • Maps xattr to SA BLOCK SPACE • 1 block of SAs (128K…) XATTR ZAP SPILL . Much faster BLOCK . Two IOs per xattr XATTR XATTR XATTR XATTR XATTR XATTR Required Block Optional Block

The first step was to store xattrs as a system attribute (SA)

17 Lawrence Livermore National Laboratory LLNL-PRES-669602 Large DNode Feature 512B-16K DNODE

16K DNODE BONUS BONUS BLOCK SPACE SPACE

XATTR ZAP SPILL XATTR ZAP SPILL BLOCK BLOCK

XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR XATTR

Required Block Optional Block

Only a single I/O is needed to read multiple dnodes with xattrs

18 Lawrence Livermore National Laboratory LLNL-PRES-669602 Meta Data Workload . xattrtest – performance and correctness test • https://github.com/behlendorf/xattrtest • 4096 files / process, 512 bytes xattr per file • 64 processes . All tests run after import to ensure a cold cache . 16-disk (HDD) mirror pool . All default ZFS tunings

19 Lawrence Livermore National Laboratory LLNL-PRES-669602 Xattrtest Performance 60000

50000 v0.6.4 xattr=on 40000 dnodesize=512 v0.6.4 30000 xattr=sa dnodesize=512 20000 v0.6.4 xattr=sa 10000 dnodesize=1024 0 Create/s Setxattr/s Getxattr/s Unlink/s

20 Lawrence Livermore National Laboratory LLNL-PRES-669602 ARC Memory Footprint

120 120 v0.6.3 100 100 xattr=sa dnodesize=512 80 80 v0.6.4 xattr=sa 60 60 dnodesize=512 40 40 v0.6.4 xattr=sa 20 20 dnodesize=1024

0 0 ARC Size (%) ARC Blocks (%)

16x fewer blocks reduces I/O and saves ARC memory

21 Lawrence Livermore National Laboratory LLNL-PRES-669602 mds_survey.sh 10000 9000 8000 v0.6.3 7000 xattr=sa 6000 dnodesize=512 v0.6.4 5000 xattr=sa 4000 dnodesize=512 3000 v0.6.4 2000 xattr=sa 1000 dnodesize=1024 0 Create/s Destroy/s

30% performance improvement for creates and destroys

22 Lawrence Livermore National Laboratory LLNL-PRES-669602 Large DNode Performance Summary

. Improves performance across the board . Total I/O required for cold files reduced . ARC • Fewer cached blocks (no spill blocks) • Reduced memory usage • Smaller MRU/MFU results in faster memory reclaim and reduced lock contention in the ARC . Reduces pool fragmentation

Increasing the dnode size has many benefits

23 Lawrence Livermore National Laboratory LLNL-PRES-669602 Large DNode Work in Progress

. Developed by Prakash Surya and Ned Bass . Conceptually straightforward but challenging . Undergoing rigorous testing • ztest, zfs-stress, xattrtest, mds_survey, ziltest . 80% done, under active development: • ZFS Intent Log (ZIL) support • “zfs send/recv” compatibility . Patches will be submitted for upstream review

24 Lawrence Livermore National Laboratory LLNL-PRES-669602 Planned Performance Investigations

. Support the ZFS Intent Log (ZIL) in Lustre . Improve Lustre’s use of ZFS level prefetching . Optimize Lustre specific objects (llogs, etc) . Explore TinyZAP for Lustre namespace . More granular ARC locking (issue #3115) . backed ARC buffers (issue #2129) . Hardware optimized checksums / parity

There are many areas where Lustre MDT performance can be optimized

25 Lawrence Livermore National Laboratory LLNL-PRES-669602 Questions?

26 Lawrence Livermore National Laboratory LLNL-PRES-669602 SPL-0.6.4 / ZFS-0.6.4 Released

. Released April 8th, 2015. Packages available for:

. Six New Feature Flags • *Spacemap Histograms • Extensible Datasets • Bookmarks • Enabled TXGs • Hole Birth • *Embedded Data

27 Lawrence Livermore National Laboratory LLNL-PRES-669602 SPL-0.6.4 / ZFS-0.6.4 Released . New Functionality • Asynchronous I/O (AIO) support • Hole punching via fallocate(2) • Two new properties (redundant_metadata, overlay) • Various enhancements to command tools . Performance Improvements • ‘zfs send/recv’ • Spacemap histograms • Faster memory reclaim . Over 200 Bug Fixes

28 Lawrence Livermore National Laboratory LLNL-PRES-669602

New KMOD Repo for RHEL/CentOS . DKMS and KMOD packages for EPEL 6 • Versions: spl-0.6.4, zfs-0.6.4, lustre-2.5.3 . Enable the repository • vim /etc/yum.repos.d/zfs.repo • Disable the default zfs (DKMS) repository • Enable the zfs-kmod repository . Install ZFS, Lustre server, or Lustre client • yum install zfs • yum install lustre-osd-zfs • yum install lustre

See zfsonlinux.org/epel for complete instructions

29 Lawrence Livermore National Laboratory LLNL-PRES-669602