CONFIDENTIAL 11 ZFS Readzilla und Logzilla

Claudia Hildebrandt System Engineer/Consultant Sun Microsystems GmbH

Sun Microsystems 22 Hybrid Storage Pools • Hybrid Storage Pools > Pools with SSDs > SSDs as write flash accelerators and seperate log devices for the ZIL - aka Logzilla > SSDs as read flash accelerators – aka Readzilla • OpenStorage at this time > 18 GB Logzilla > 100 GB Readzilla

Sun Microsystems 3 Logzilla

• ZFS uses the ZFS Intent Log ( ZIL) to match POSIX synchronous requirements • ZIL uses allocated blocks within the main storage pool ( default ) • Better performance with seperate ZIL (slog) – ZIL is allocated on seperate devices like a dedicated disk, also SSDs or NVRAM • # zpool add log • Note: use mirrored log devices, RAIDZ is not supported

Sun Microsystems 4 Readzilla • aka L2ARC as secondary caching tier between the DRAM (the ZFS ARC) and disk. • ZFS ARC – ZFS adjustable replacement cache > Stores ZFS data and metadata information from all active storage pools in physical memory by default as much as possible, except 1 GB of RAM > ZFS ARC consumes free memory as long there is free memory and releases the memory only to system when free memory is requested by another application > With Readzilla the information in RAM can be moved to disk and cached as long as there is free space

Sun Microsystems 5 ZFS – Features

“Pooled storage Model” “All or nothing”

“Always consistent” “Self healing”

Sun Microsystems 6 ZFS Architecture

Sun Microsystems 7 ARC Overview and Purpose • ZFS does not use cache like UFS (except: mmap(2)) • Adaptive Replacement Cache > Based on Megiddo & Modha (IBM) at FAST 2003 – ARC: A Self-Tuning, Low Overhead Replacement Cache > ZFS ARC differs slightly in implementation – ZFS: Variable sized cache and contents, non-evictable contents • DMU uses ARC to cache data objects based on DVA • 1 ARC per system • 2 LRU (Least Recently Used) caches plus History > Recency (MRU) and Frequency (MFU) – ARC data survives large file scan > 1c cache and 1c history (c = cache size)

Sun Microsystems 8 Adjustable Replacement Cache (ARC )

• Central point for memory management for the SPA > Ability to evict buffers as a result of memory pressure • Dynamically, adaptively and self-tuning > Cache adjusts based on I/O workload • scan-resistant

Sun Microsystems 9 6 states of arc_buf

• ARC_anon : > Buffers not associated with a DVA > They hold dirty block copies before being written to storage > They are considered as part of ARC_mru • ARC_mru : Recently used and currently cached • ARC_mru_ghost : Recently used, no longer in cache • ARC_mfu : Frequently used and currently cached • ARC_mfu_ghost : Frequently used, no longer cached • ARC_l2c_only : exists only in L2ARC • Ghost caches only contain ARC buffer headers

Sun Microsystems 10 ARC Diagram of Caches

c c ARCARC MRU MFU MRU MFU p G host Caches

• MRU = Most Recently Used, MFU = Most Fequently Used. Both lists plus the Ghost Caches are twice the size of the cache c • ARC adapts c ( cache size ) and p ( used pages in MRU ) in response to workloads • ARC parameters initialised to: arc_c_min = MAX(1/32 of all mem, 64Mb) arc_c_max = MAX(3/4 of all mem, all but 1Gb) arc_c = MIN(1/8 physmem, 1/8 VM size) arc_p = arc_c / 2

Sun Microsystems 11 How it works

LRU p MRU MRU c - p LRU

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFU ghost

Sun Microsystems 12 claudia@frodo:~/Downloads$ pfexec ./arc_summary.pl System Memory: Physical RAM: 4052 MB Free Memory : 2312 MB LotsFree: 63 MB

ZFS Tunables (/etc/system):

ARC Size: Current Size: 772 MB (arcsize) Target Size (Adaptive): 3039 MB (c) Min Size (Hard Limit): 379 MB (zfs_arc_min) Max Size (Hard Limit): 3039 MB (zfs_arc_max)

ARC Size Breakdown: Most Recently Used Cache Size: 50% 1519 MB (p) Most Frequently Used Cache Size: 49% 1519 MB (c-p)

Sun Microsystems 13 Data is read A arc_read request

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFU ghost

Sun Microsystems 14 Data buffer is in MRU

A

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFU ghost

Sun Microsystems 15 Same data buffer read again A arc_read request

A

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFU ghost

Sun Microsystems 16 Data buffer moves in MFU

A

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFUghost

Sun Microsystems 17 Cache fills up

D E F C B A

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFUghost

Sun Microsystems 18 MRU data buffer is read again D arc_read request

D E F C B A

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFUghost

Sun Microsystems 19 MFU list is dynamically adjusted

E F D C B A

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFUghost

Sun Microsystems 20 Data buffer in MFU is read again arc_read request B

E F D C B A

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFUghost

Sun Microsystems 21 Data buffer moves at 1st position

E F B D C A

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFUghost

Sun Microsystems 22 ARC Caches in Action • If evicting during cache insert, then: > 1. Inserting in MRU & MRU < p then arc_evict(MFU) > 2. Inserting in MRU & MRU > p then arc_evict(MRU) > 3. Inserting in MFU & MFU < (c-p) then arc_evict(MRU) > 4. Inserting in MFU & MFU > (c-p) then arc_evict(MFU) • Buffers change state (ie cache) in response to access > If current state is MRU, and at least ARC_MINTIME (62ms) since last access, then new state is MFU > All other repeated accesses result in state of MFU – Exception: Prefetching in MRU or Ghosts results in MRU

Sun Microsystems 23 Least recency data buffer evicted G arc_read request

E F G B D C A

e

ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost

Sun Microsystems 24 Least frequency data buffer evicted G arc_read request

F G G B D C A

e a

ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost

Sun Microsystems 25 ARC Adapting and Adjusting • Adapting...adapting to workload > When adding new content: – If (hit in MRU_Ghost) then increase p – If (hit in MFU_Ghost) then decrease p – If (arc_size within (2*maxblocksize) of c) then increase c • Adjusting...adjusting contents to fit > When shrinking or reclaiming: – If (MRU > p) then arc_evict(MRU) – If (MRU+MRU_Ghost > c) then arc_evict(MRU_Ghost) – If (arc_size > c) then arc_evict(MFU) – If (arc_size + Ghosts > 2*c) then arc_evict(MFU_Ghost)

Sun Microsystems 26 Data buffer not in cache F arc_read request

I J G B D C

h f e a

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFUghost

Sun Microsystems 27 ARC adaptive self tuning

I J F G B

h e a c d

ARC = c MRU = p MFU = c - p

Ghost caches MRU ghost MFUghost

Sun Microsystems 28 ARC is to small A E arc_read request C

I J F G B D

h e c a

ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost

Sun Microsystems 29 ARC Reclaiming • Reclaim...reclaiming kernel memory > Every second (or sooner if adapting or kmem callback) > Check VM parameters: freemem, lotsfree, needfree, desfree > If required: – Set arc_no_grow – suspend ARC adaption growths – Set Aggressive Reclaim Policy triggers ARC shrink – Shrinks by MAX(1/32 of current size, VM needfree) down to arc_min – Calls arc_adjust() to adjust (ie evict) cache contents to new sizes – Call kmem_cache_reap_now() on ZIO buffers • Megiddo/Modha said: “We think of ARC as dynamically, adaptively and continually balancing between recency and frequency - in an online and self-tuning fashion - in response to evolving and possibly changing access patterns” Sun Microsystems 30 L2ARC • Enhances the ARC • Second cache layer between main memory and disk or SSD • Boosts random read performance • Devices used can be: > Short-stroked disks > Solid state disks > Devices with smaller read latency

Sun Microsystems 31 L2ARC – How does it populate the cache? • L2ARC attempts to cache data from ARC before it is evicted > There is no eviction path form ARC to L2ARC • A kernel thread scans the eviction list of MFU/MRU and copies them to L2ARC devices > Refer to l2arc_feed_thread()

Sun Microsystems 32 L2ARC – Tuning • The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads: > l2arc_write_max : max write bytes per interval > l2arc_noprefetch : skip caching prefetched buffers > l2arc_headroom : number of max device writes to precache > l2arc_feed_secs :seconds between L2ARC writing

Sun Microsystems 33 ZIL ZFS Intent Log • Filesystems buffer write requests and sync these to storage periodically to improve performance • Power loss can corrupt filesystems and/or suffer data loss > Corruption solved with TXG commits –Always on-disk consistency • Use synchronous semantics for applications requiring data is flushed to stable pool by the time a system call returns > Open file with O_DSYNC > Flush buffered contents with fsync(3c) • The ZIL provides synchronous semantics for ZFS

Sun Microsystems 34 ZIL Operational Overview • ZFS intent log (ZIL) saves transaction records of system calls that change the file system in memory with enough information to replay them • ZFS operations are organized by the DMU as transactions. Whenever a DMU transaction is opened there is also a ZIL transaction opened > A Log record holds a system call transaction > A Log block can hold many log records and blocks are chained together > Log Blocks are dynamically allocated and freed as needed > a) ZIL blocks freed on TXG commit by DMU ( discard ) > b) flushed due to synchronous requirements e.g. fsync(3C), O_DSYNC commited to stable storage • In the event of power failure/panic the transactions are replayed from ZIL • 1 ZIL per file system

Sun Microsystems 35 ZIL • ZILogs resides in mem ory or on disk • ZIL gathers in­mem ory transactions of system calls and pushes the list out to a per filesystem on­disk log • ZILogs are written on disk in variable block sizes > m in. 4 KB, max. 128 KB

Sun Microsystems 36 Seperate ZIL

• Enables the use of limited capacity but fast block devices such as NVRAM and SSDs • ZIL allocates from main pool leads to pool fragmentation • Performance increasement > databases and NFS relies on speed and the need to be assured that the data are not lost

Sun Microsystems 37 ZFS Hybrid Storage Pool

Sun Microsystems 38 OpenStorage – 7000 Series • Logzilla devices – 18 GB flash-based SSDs backed up by a supercapacitor > 10,000 write IOPS • Readzilla devices – up to 6 100 GB read optimized SSDs > 50 -100 micro seconds

Sun Microsystems 39 Sun Storage 7000 Unified Storage System

Sun Microsystems 40 NEW • Since 2009.03 > Triple RAIDZ ( RAID-Z3 ) > Triple mirroring storage profile > Enhanced iSCSI support > Infiniband support > improved management

Sun Microsystems 41 Links Hybrid Storage Pools

http://blogs.sun.com/ahl/entry/flash_hybrid_pools_a nd_future http://blogs.sun.com/ahl/entry/hsp_goes_glossy

Demo: Storage Simulator http://www.sun.com/storage/disk_systems/unified_ storage/resources.jsp?intcmp=2992

Sun Microsystems 42 Vielen Dank

Claudia Hildebrandt [email protected]

Sun Microsystems 4344