Best Practices for Openzfs L2ARC in the Era of Nvme

Best Practices for OpenZFS L2ARC in the Era of NVMe Ryan McKenzie iXsystems 2019 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 1 Agenda ❏ Brief Overview of OpenZFS ARC / L2ARC ❏ Key Performance Factors ❏ Existing “Best Practices” for L2ARC ❏ Rules of Thumb, Tribal Knowledge, etc. ❏ NVMe as L2ARC ❏ Testing and Results ❏ Revised “Best Practices” 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 2 ARC Overview ● Adaptive Replacement Cache ● Resides in system memory NIC/HBA ● Shared by all pools ● Used to store/cache: ARC OpenZFS ○ All incoming data ○ “Hottest” data and Metadata (a tunable ratio) SLOG vdevs ● Balances between ○ Most Frequently Used (MFU) Data vdevs L2ARC vdevs ○ Most Recently zpool Used (MRU) FreeBSD + OpenZFS Server 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 3 L2ARC Overview ● Level 2 Adaptive Replacement Cache ● Resides on one or more storage devices NIC/HBA ○ Usually Flash ● Device(s) added to pool ARC OpenZFS ○ Only services data held by that pool ● Used to store/cache: ○ “Warm” data and SLOG vdevs metadata ( about to be evicted from ARC) Data vdevs L2ARC vdevs ○ Indexes to L2ARC zpool blocks stored in ARC headers FreeBSD + OpenZFS Server 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 4 ZFS Writes ● All writes go through ARC, written blocks are “dirty” until on stable storage ACK NIC/HBA ○ async write ACKs immediately ARC OpenZFS ○ sync write copied to ZIL/SLOG then ACKs ○ copied to data SLOG vdevs vdev in TXG ● When no longer dirty, written blocks stay in Data vdevs L2ARC vdevs ARC and move through zpool MRU/MFU lists normally FreeBSD + OpenZFS Server 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 5 ZFS Writes ● All writes go through ARC, written blocks are “dirty” until on stable storage ACK NIC/HBA ○ async write ACKs immediately ARC OpenZFS ○ sync write copied to ZIL/SLOG then ACKs ZIL on vdevs OR SLOG ○ copied to data SLOG vdevs vdev in TXG ● When no longer dirty, written blocks stay in Data vdevs L2ARC vdevs ARC and move through zpool MRU/MFU lists normally FreeBSD + OpenZFS Server 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 6 ZFS Writes ● All writes go through ARC, written blocks are “dirty” until on stable storage NIC/HBA ○ async write ACKs immediately ARC OpenZFS ○ sync write copied to ZIL/SLOG then ACKs ○ copied to data SLOG vdevs vdev in TXG ● When no longer dirty, written blocks stay in Data vdevs L2ARC vdevs ARC and move through zpool MRU/MFU lists normally FreeBSD + OpenZFS Server 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 7 ZFS Writes ● All writes go through ARC, written blocks are “dirty” until on stable storage NIC/HBA ○ async write ACKs immediately ARC OpenZFS ○ sync write copied to ZIL/SLOG then ACKs ○ copied to data SLOG vdevs vdev in TXG ● When no longer dirty, written blocks stay in Data vdevs L2ARC vdevs ARC and move through zpool MRU/MFU lists normally FreeBSD + OpenZFS Server 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 8 ZFS Reads ● Requested block in ARC ○ Respond with block from ARC ● Requested block in L2 NIC/HBA ○ Get index to L2ARC block (from ARC) ARC ○ Respond with block OpenZFS from L2ARC ● Otherwise: read miss ○ Copy block from data vdev to ARC SLOG vdevs ○ Respond with block from ARC ○ “ARC PROMOTE”: Data vdevs L2ARC vdevs read block(s) stay in zpool ARC and move through MRU/MFU FreeBSD + OpenZFS Server lists normally 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 9 ZFS Reads ● Requested block in ARC ○ Respond with block from ARC ● Requested block in L2 NIC/HBA ○ Get index to L2ARC block (from ARC) ARC ○ Respond with block OpenZFS from L2ARC ● Otherwise: read miss ○ Copy block from data vdev to ARC SLOG vdevs ○ Respond with block from ARC ○ “ARC PROMOTE”: Data vdevs L2ARC vdevs read block(s) stay in zpool ARC and move through MRU/MFU FreeBSD + OpenZFS Server lists normally 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 10 ZFS Reads ● Requested block in ARC ○ Respond with block from ARC ● Requested block in L2 NIC/HBA ○ Get index to L2ARC block (from ARC) ARC ○ Respond with block OpenZFS from L2ARC demand ● Otherwise: read miss ○ Copy block from data vdev to ARC Prefetch more SLOG vdevs blocks to avoid ○ Respond with block another read from ARC miss? ○ “ARC PROMOTE”: Data vdevs L2ARC vdevs read block(s) stay in zpool ARC and move through MRU/MFU FreeBSD + OpenZFS Server lists normally 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 11 ZFS Reads ● Requested block in ARC ○ Respond with block from ARC ● Requested block in L2 NIC/HBA ○ Get index to L2ARC block (from ARC) ARC ○ Respond with block OpenZFS from L2ARC ● Otherwise: read miss ○ Copy block from data vdev to ARC SLOG vdevs ○ Respond with block from ARC ○ “ARC PROMOTE”: Data vdevs L2ARC vdevs read block(s) stay in zpool ARC and move through MRU/MFU FreeBSD + OpenZFS Server lists normally 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 12 ARC Reclaim ● Called to maintain/make head tail room in ARC for MRU incoming writes MFU L2ARC Feed hdrs ● Asynchronous of ARC ARC Reclaim to speed up ZFS write ACKs ● Periodically scans tails of MRU/MFU OpenZFS ○ Copies blocks from N N N ARC to L2ARC ○ Index to L2ARC 0 0 0 block resides in ARC L2ARC vdevs headers ○ Feed rate is tunable zpool ● Writes to L2 device(s) FreeBSD + OpenZFS Server sequentially (rnd robin) 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 14 ARC Reclaim ● Called to maintain/make head tail room in ARC for MRU incoming writes MFU L2ARC Feed hdrs ● Asynchronous of ARC ARC Reclaim to speed up ZFS write ACKs ● Periodically scans tails of MRU/MFU OpenZFS ○ Copies blocks from N N N ARC to L2ARC ○ Index to L2ARC 0 0 0 block resides in ARC L2ARC vdevs headers ○ Feed rate is tunable zpool ● Writes to L2 device(s) FreeBSD + OpenZFS Server sequentially (rnd robin) 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 15 L2ARC “Reclaim” ● Not really a concept of head tail reclaim MRU ● When L2ARC device(s) MFU get full (sequentially at the “end”), L2ARC Feed hdrs ARC just starts over at the “beginning” ○ Overwrites whatever was there OpenZFS before and updates index in ARC N N N header ● If an L2ARC block is 0 0 0 written, ARC will handle L2ARC vdevs the write of the dirty zpool block, L2ARC block is stale and the index is FreeBSD + OpenZFS Server dropped 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 16 L2ARC “Reclaim” ● Not really a concept of head tail reclaim MRU ● When L2ARC device(s) MFU get full (sequentially at the “end”), L2ARC Feed hdrs ARC just starts over at the “beginning” ○ Overwrites whatever was there OpenZFS before and updates index in ARC N N N header ● If an L2ARC block is 0 0 0 written , ARC will handle L2ARC vdevs the write of the dirty zpool block, L2ARC block is stale and the index is FreeBSD + OpenZFS Server dropped 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 17 Some Notes: ARC / L2ARC Architecture ● ARC/L2 “Blocks” are variable size: ○ =volblock size for zvol data blocks ○ =record size for dataset data blocks ○ =indirect block size for metadata blocks ● Smaller volblock/record sizes yield more metadata blocks (overhead) in the system ○ may need to tune metadata % of ARC 2019 Storage2019 DeveloperStorage Developer Conference. Conference. © Insert Your© iXsystems. Company All Name. Rights All Reserved. Rights Reserved. 19 Some Notes: ARC / L2ARC Architecture ● Blocks get into ARC via any ZFS write or by demand/prefetch on ZFS read miss ● Blocks cannot get into L2ARC unless they are in ARC first (primary/secondary cache settings) ○ CONFIGURATION PITFALL ● L2ARC is not persistent ○ Blocks are persistent on the L2ARC device(s) but the indexes to them are lost if the ARC headers are lost (main memory) 2019 Storage2019 DeveloperStorage Developer Conference. Conference. © Insert Your© iXsystems. Company All Name. Rights All Reserved. Rights Reserved. 20 Some Notes: ARC / L2ARC Architecture ● Write-heavy workloads can “churn” tail of the ARC MRU list quickly ● Prefetch-heavy workloads can “scan” tail of the ARC MRU list quickly ● In both cases… ○ ARC MFU bias maintains integrity of cache ○ L2ARC Feed “misses” many blocks ○ Blocks that L2ARC does feed may not be hit again - “Wasted L2ARC Feed” 2019 Storage2019 DeveloperStorage Developer Conference. Conference. © Insert Your© iXsystems. Company All Name. Rights All Reserved. Rights Reserved. 21 Agenda ✓ Brief Overview of OpenZFS ARC / L2ARC ❏ Key Performance Factors ❏ Existing “Best Practices” for L2ARC ❏ Rules of Thumb, Tribal Knowledge, etc. ❏ NVMe as L2ARC ❏ Testing and Results ❏ Revised “Best Practices” 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 22 Key Performance Factors ● Random Read-Heavy Workload ○ L2ARC designed for this, expect very effective once warmed ● Sequential Read-Heavy Workload ○ L2ARC not originally designed for this, but specifically noted that it might work with “future SSDs or other storage tech.” ■ Let’s revisit this with NVMe! 2019 Storage2019 DeveloperStorage Developer Conference. Conference. © Insert Your© iXsystems. Company All Name. Rights All Reserved. Rights Reserved. 23 Key Performance Factors ● Write-Heavy Workload (random or sequential) ○ ARC MRU churn, memory pressure, L2ARC never really warmed because many L2ARC blocks get invalidated/dirty ■ “Wasted” L2ARC Feeds ○ ARC and L2ARC are designed to primarily benefit reads ○ Design expectation is “do no harm”, but there may be a constant background performance impact of L2ARC feed 2019 Storage2019 DeveloperStorage Developer Conference. Conference. © Insert Your© iXsystems. Company All Name. Rights All Reserved. Rights Reserved. 24 Key Performance Factors ● Small ADS (relative to ARC and L2ARC) ○ Higher possibility for L2ARC to fully warm and for L2ARC Feed to slow down ● Large ADS (relative to ARC and L2ARC) ○ Higher possibility for constant background L2ARC Feed 2019 Storage2019 DeveloperStorage Developer Conference.

Best Practices for Openzfs L2ARC in the Era of Nvme

Copy on Write Based File Systems Performance Analysis and Implementation

ZFS (On Linux) Use Your Disks in Best Possible Ways Dobrica Pavlinušić CUC Sys.Track 2013-10-21 What Are We Going to Talk About?

Pete's All Things Sun (PATS): the State Of

Introduction to Debugging the Freebsd Kernel

BSD UNIX Toolbox: 1000+ Commands for Freebsd, Openbsd and Netbsd Christopher Negus, Francois Caen

The Parallel File System Lustre

Porting the ZFS File System to the Freebsd Operating System

The Dragonflybsd Operating System

BSD UNIX Toolbox 1000+ Commands for Freebsd, Openbsd

Oracle ZFS Storage Appliance: Ideal Storage for Virtualization and Private Clouds

ZFS 2018 and Onward

Filesystem Considerations for Embedded Devices ELC2015 03/25/15