CONFIDENTIAL 11 ZFS Readzilla und Logzilla
Claudia Hildebrandt System Engineer/Consultant Sun Microsystems GmbH
Sun Microsystems 22 Hybrid Storage Pools • Hybrid Storage Pools > Pools with SSDs > SSDs as write flash accelerators and seperate log devices for the ZIL - aka Logzilla > SSDs as read flash accelerators – aka Readzilla • OpenStorage at this time > 18 GB Logzilla > 100 GB Readzilla
Sun Microsystems 3 Logzilla
• ZFS uses the ZFS Intent Log ( ZIL) to match POSIX synchronous requirements • ZIL uses allocated blocks within the main storage pool ( default ) • Better performance with seperate ZIL (slog) – ZIL is allocated on seperate devices like a dedicated disk, also SSDs or NVRAM • # zpool add
Sun Microsystems 4 Readzilla • aka L2ARC as secondary caching tier between the DRAM cache (the ZFS ARC) and disk. • ZFS ARC – ZFS adjustable replacement cache > Stores ZFS data and metadata information from all active storage pools in physical memory by default as much as possible, except 1 GB of RAM > ZFS ARC consumes free memory as long there is free memory and releases the memory only to system when free memory is requested by another application > With Readzilla the information in RAM can be moved to disk and cached as long as there is free space
Sun Microsystems 5 ZFS – Features
“Pooled storage Model” “All or nothing”
“Always consistent” “Self healing”
Sun Microsystems 6 ZFS Architecture
Sun Microsystems 7 ARC Overview and Purpose • ZFS does not use page cache like UFS (except: mmap(2)) • Adaptive Replacement Cache > Based on Megiddo & Modha (IBM) at FAST 2003 – ARC: A Self-Tuning, Low Overhead Replacement Cache > ZFS ARC differs slightly in implementation – ZFS: Variable sized cache and contents, non-evictable contents • DMU uses ARC to cache data objects based on DVA • 1 ARC per system • 2 LRU (Least Recently Used) caches plus History > Recency (MRU) and Frequency (MFU) – ARC data survives large file scan > 1c cache and 1c history (c = cache size)
Sun Microsystems 8 Adjustable Replacement Cache (ARC )
• Central point for memory management for the SPA > Ability to evict buffers as a result of memory pressure • Dynamically, adaptively and self-tuning > Cache adjusts based on I/O workload • scan-resistant
Sun Microsystems 9 6 states of arc_buf
• ARC_anon : > Buffers not associated with a DVA > They hold dirty block copies before being written to storage > They are considered as part of ARC_mru • ARC_mru : Recently used and currently cached • ARC_mru_ghost : Recently used, no longer in cache • ARC_mfu : Frequently used and currently cached • ARC_mfu_ghost : Frequently used, no longer cached • ARC_l2c_only : exists only in L2ARC • Ghost caches only contain ARC buffer headers
Sun Microsystems 10 ARC Diagram of Caches
c c ARCARC MRU MFU MRU MFU p G host Caches
• MRU = Most Recently Used, MFU = Most Fequently Used. Both lists plus the Ghost Caches are twice the size of the cache c • ARC adapts c ( cache size ) and p ( used pages in MRU ) in response to workloads • ARC parameters initialised to: arc_c_min = MAX(1/32 of all mem, 64Mb) arc_c_max = MAX(3/4 of all mem, all but 1Gb) arc_c = MIN(1/8 physmem, 1/8 VM size) arc_p = arc_c / 2
Sun Microsystems 11 How it works
LRU p MRU MRU c - p LRU
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFU ghost
Sun Microsystems 12 claudia@frodo:~/Downloads$ pfexec ./arc_summary.pl System Memory: Physical RAM: 4052 MB Free Memory : 2312 MB LotsFree: 63 MB
ZFS Tunables (/etc/system):
ARC Size: Current Size: 772 MB (arcsize) Target Size (Adaptive): 3039 MB (c) Min Size (Hard Limit): 379 MB (zfs_arc_min) Max Size (Hard Limit): 3039 MB (zfs_arc_max)
ARC Size Breakdown: Most Recently Used Cache Size: 50% 1519 MB (p) Most Frequently Used Cache Size: 49% 1519 MB (c-p)
Sun Microsystems 13 Data is read A arc_read request
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFU ghost
Sun Microsystems 14 Data buffer is in MRU
A
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFU ghost
Sun Microsystems 15 Same data buffer read again A arc_read request
A
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFU ghost
Sun Microsystems 16 Data buffer moves in MFU
A
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFUghost
Sun Microsystems 17 Cache fills up
D E F C B A
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFUghost
Sun Microsystems 18 MRU data buffer is read again D arc_read request
D E F C B A
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFUghost
Sun Microsystems 19 MFU list is dynamically adjusted
E F D C B A
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFUghost
Sun Microsystems 20 Data buffer in MFU is read again arc_read request B
E F D C B A
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFUghost
Sun Microsystems 21 Data buffer moves at 1st position
E F B D C A
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFUghost
Sun Microsystems 22 ARC Caches in Action • If evicting during cache insert, then: > 1. Inserting in MRU & MRU < p then arc_evict(MFU) > 2. Inserting in MRU & MRU > p then arc_evict(MRU) > 3. Inserting in MFU & MFU < (c-p) then arc_evict(MRU) > 4. Inserting in MFU & MFU > (c-p) then arc_evict(MFU) • Buffers change state (ie cache) in response to access > If current state is MRU, and at least ARC_MINTIME (62ms) since last access, then new state is MFU > All other repeated accesses result in state of MFU – Exception: Prefetching in MRU or Ghosts results in MRU
Sun Microsystems 23 Least recency data buffer evicted G arc_read request
E F G B D C A
e
ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost
Sun Microsystems 24 Least frequency data buffer evicted G arc_read request
F G G B D C A
e a
ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost
Sun Microsystems 25 ARC Adapting and Adjusting • Adapting...adapting to workload > When adding new content: – If (hit in MRU_Ghost) then increase p – If (hit in MFU_Ghost) then decrease p – If (arc_size within (2*maxblocksize) of c) then increase c • Adjusting...adjusting contents to fit > When shrinking or reclaiming: – If (MRU > p) then arc_evict(MRU) – If (MRU+MRU_Ghost > c) then arc_evict(MRU_Ghost) – If (arc_size > c) then arc_evict(MFU) – If (arc_size + Ghosts > 2*c) then arc_evict(MFU_Ghost)
Sun Microsystems 26 Data buffer not in cache F arc_read request
I J G B D C
h f e a
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFUghost
Sun Microsystems 27 ARC adaptive self tuning
I J F G B
h e a c d
ARC = c MRU = p MFU = c - p
Ghost caches MRU ghost MFUghost
Sun Microsystems 28 ARC is to small A E arc_read request C
I J F G B D
h e c a
ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost
Sun Microsystems 29 ARC Reclaiming • Reclaim...reclaiming kernel memory > Every second (or sooner if adapting or kmem callback) > Check VM parameters: freemem, lotsfree, needfree, desfree > If required: – Set arc_no_grow – suspend ARC adaption growths – Set Aggressive Reclaim Policy triggers ARC shrink – Shrinks by MAX(1/32 of current size, VM needfree) down to arc_min – Calls arc_adjust() to adjust (ie evict) cache contents to new sizes – Call kmem_cache_reap_now() on ZIO buffers • Megiddo/Modha said: “We think of ARC as dynamically, adaptively and continually balancing between recency and frequency - in an online and self-tuning fashion - in response to evolving and possibly changing access patterns” Sun Microsystems 30 L2ARC • Enhances the ARC • Second cache layer between main memory and disk or SSD • Boosts random read performance • Devices used can be: > Short-stroked disks > Solid state disks > Devices with smaller read latency
Sun Microsystems 31 L2ARC – How does it populate the cache? • L2ARC attempts to cache data from ARC before it is evicted > There is no eviction path form ARC to L2ARC • A kernel thread scans the eviction list of MFU/MRU and copies them to L2ARC devices > Refer to l2arc_feed_thread()
Sun Microsystems 32 L2ARC – Tuning • The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads: > l2arc_write_max : max write bytes per interval > l2arc_noprefetch : skip caching prefetched buffers > l2arc_headroom : number of max device writes to precache > l2arc_feed_secs :seconds between L2ARC writing
Sun Microsystems 33 ZIL ZFS Intent Log • Filesystems buffer write requests and sync these to storage periodically to improve performance • Power loss can corrupt filesystems and/or suffer data loss > Corruption solved with TXG commits –Always on-disk consistency • Use synchronous semantics for applications requiring data is flushed to stable pool by the time a system call returns > Open file with O_DSYNC > Flush buffered contents with fsync(3c) • The ZIL provides synchronous semantics for ZFS
Sun Microsystems 34 ZIL Operational Overview • ZFS intent log (ZIL) saves transaction records of system calls that change the file system in memory with enough information to replay them • ZFS operations are organized by the DMU as transactions. Whenever a DMU transaction is opened there is also a ZIL transaction opened > A Log record holds a system call transaction > A Log block can hold many log records and blocks are chained together > Log Blocks are dynamically allocated and freed as needed > a) ZIL blocks freed on TXG commit by DMU ( discard ) > b) flushed due to synchronous requirements e.g. fsync(3C), O_DSYNC commited to stable storage • In the event of power failure/panic the transactions are replayed from ZIL • 1 ZIL per file system
Sun Microsystems 35 ZIL • ZILogs resides in mem ory or on disk • ZIL gathers inmem ory transactions of system calls and pushes the list out to a per filesystem ondisk log • ZILogs are written on disk in variable block sizes > m in. 4 KB, max. 128 KB
Sun Microsystems 36 Seperate ZIL
• Enables the use of limited capacity but fast block devices such as NVRAM and SSDs • ZIL allocates from main pool leads to pool fragmentation • Performance increasement > databases and NFS relies on speed and the need to be assured that the data are not lost
Sun Microsystems 37 ZFS Hybrid Storage Pool
Sun Microsystems 38 OpenStorage – 7000 Series • Logzilla devices – 18 GB flash-based SSDs backed up by a supercapacitor > 10,000 write IOPS • Readzilla devices – up to 6 100 GB read optimized SSDs > 50 -100 micro seconds
Sun Microsystems 39 Sun Storage 7000 Unified Storage System
Sun Microsystems 40 NEW • Since 2009.03 > Triple RAIDZ ( RAID-Z3 ) > Triple mirroring storage profile > Enhanced iSCSI support > Infiniband support > improved management
Sun Microsystems 41 Links Hybrid Storage Pools
http://blogs.sun.com/ahl/entry/flash_hybrid_pools_a nd_future http://blogs.sun.com/ahl/entry/hsp_goes_glossy
Demo: Storage Simulator http://www.sun.com/storage/disk_systems/unified_ storage/resources.jsp?intcmp=2992
Sun Microsystems 42 Vielen Dank
Claudia Hildebrandt [email protected]
Sun Microsystems 4344