ZFS ARC Cacche And

CONFIDENTIAL 11 ZFS Readzilla und Logzilla Claudia Hildebrandt System Engineer/Consultant Sun Microsystems GmbH Sun Microsystems 22 Hybrid Storage Pools • Hybrid Storage Pools > Pools with SSDs > SSDs as write flash accelerators and seperate log devices for the ZIL - aka Logzilla > SSDs as read flash accelerators – aka Readzilla • OpenStorage at this time > 18 GB Logzilla > 100 GB Readzilla Sun Microsystems 3 Logzilla • ZFS uses the ZFS Intent Log ( ZIL) to match POSIX synchronous requirements • ZIL uses allocated blocks within the main storage pool ( default ) • Better performance with seperate ZIL (slog) – ZIL is allocated on seperate devices like a dedicated disk, also SSDs or NVRAM • # zpool add <pool_name> log <log_device1> <log_device2> • Note: use mirrored log devices, RAIDZ is not supported Sun Microsystems 4 Readzilla • aka L2ARC as secondary caching tier between the DRAM cache (the ZFS ARC) and disk. • ZFS ARC – ZFS adjustable replacement cache > Stores ZFS data and metadata information from all active storage pools in physical memory by default as much as possible, except 1 GB of RAM > ZFS ARC consumes free memory as long there is free memory and releases the memory only to system when free memory is requested by another application > With Readzilla the information in RAM can be moved to disk and cached as long as there is free space Sun Microsystems 5 ZFS – Features “Pooled storage Model” “All or nothing” “Always consistent” “Self healing” Sun Microsystems 6 ZFS Architecture Sun Microsystems 7 ARC Overview and Purpose • ZFS does not use page cache like UFS (except: mmap(2)) • Adaptive Replacement Cache > Based on Megiddo & Modha (IBM) at FAST 2003 ± ARC: A Self-Tuning, Low Overhead Replacement Cache > ZFS ARC differs slightly in implementation ± ZFS: Variable sized cache and contents, non-evictable contents • DMU uses ARC to cache data objects based on DVA • 1 ARC per system • 2 LRU (Least Recently Used) caches plus History > Recency (MRU) and Frequency (MFU) ± ARC data survives large file scan > 1c cache and 1c history (c = cache size) Sun Microsystems 8 Adjustable Replacement Cache (ARC ) • Central point for memory management for the SPA > Ability to evict buffers as a result of memory pressure • Dynamically, adaptively and self-tuning > Cache adjusts based on I/O workload • scan-resistant Sun Microsystems 9 6 states of arc_buf • ARC_anon : > Buffers not associated with a DVA > They hold dirty block copies before being written to storage > They are considered as part of ARC_mru • ARC_mru : Recently used and currently cached • ARC_mru_ghost : Recently used, no longer in cache • ARC_mfu : Frequently used and currently cached • ARC_mfu_ghost : Frequently used, no longer cached • ARC_l2c_only : exists only in L2ARC • Ghost caches only contain ARC buffer headers Sun Microsystems 10 ARC Diagram of Caches c c ARCARC MRU M FU MRU M FU p G host Caches • MRU = Most Recently Used, MFU = Most Fequently Used. Both lists plus the Ghost Caches are twice the size of the cache c • ARC adapts c ( cache size ) and p ( used pages in MRU ) in response to workloads • ARC parameters initialised to: arc_c_min = MAX(1/32 of all mem, 64Mb) arc_c_max = MAX(3/4 of all mem, all but 1Gb) arc_c = MIN(1/8 physmem, 1/8 VM size) arc_p = arc_c / 2 Sun Microsystems 11 How it works LRU p MRU MRU c - p LRU ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFU ghost Sun Microsystems 12 claudia@frodo:~/Downloads$ pfexec ./arc_summary.pl System Memory: Physical RAM: 4052 MB Free Memory : 2312 MB LotsFree: 63 MB ZFS Tunables (/etc/system): ARC Size: Current Size: 772 MB (arcsize) Target Size (Adaptive): 3039 MB (c) Min Size (Hard Limit): 379 MB (zfs_arc_min) Max Size (Hard Limit): 3039 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 50% 1519 MB (p) Most Frequently Used Cache Size: 49% 1519 MB (c-p) Sun Microsystems 13 Data is read A arc_read request ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFU ghost Sun Microsystems 14 Data buffer is in MRU A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFU ghost Sun Microsystems 15 Same data buffer read again A arc_read request A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFU ghost Sun Microsystems 16 Data buffer moves in MFU A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 17 Cache fills up D E F C B A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 18 MRU data buffer is read again D arc_read request D E F C B A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 19 MFU list is dynamically adjusted E F D C B A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 20 Data buffer in MFU is read again arc_read request B E F D C B A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 21 Data buffer moves at 1st position E F B D C A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 22 ARC Caches in Action • If evicting during cache insert, then: > 1. Inserting in MRU & MRU < p then arc_evict(MFU) > 2. Inserting in MRU & MRU > p then arc_evict(MRU) > 3. Inserting in MFU & MFU < (c-p) then arc_evict(MRU) > 4. Inserting in MFU & MFU > (c-p) then arc_evict(MFU) • Buffers change state (ie cache) in response to access > If current state is MRU, and at least ARC_MINTIME (62ms) since last access, then new state is MFU > All other repeated accesses result in state of MFU ± Exception: Prefetching in MRU or Ghosts results in MRU Sun Microsystems 23 Least recency data buffer evicted G arc_read request E F G B D C A e ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 24 Least frequency data buffer evicted G arc_read request F G G B D C A e a ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 25 ARC Adapting and Adjusting • Adapting...adapting to workload > When adding new content: ± If (hit in MRU_Ghost) then increase p ± If (hit in MFU_Ghost) then decrease p ± If (arc_size within (2*maxblocksize) of c) then increase c • Adjusting...adjusting contents to fit > When shrinking or reclaiming: ± If (MRU > p) then arc_evict(MRU) ± If (MRU+MRU_Ghost > c) then arc_evict(MRU_Ghost) ± If (arc_size > c) then arc_evict(MFU) ± If (arc_size + Ghosts > 2*c) then arc_evict(MFU_Ghost) Sun Microsystems 26 Data buffer not in cache F arc_read request I J G B D C h f e a ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 27 ARC adaptive self tuning I J F G B h e a c d ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 28 ARC is to small A E arc_read request C I J F G B D h e c a ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 29 ARC Reclaiming • Reclaim...reclaiming kernel memory > Every second (or sooner if adapting or kmem callback) > Check VM parameters: freemem, lotsfree, needfree, desfree > If required: ± Set arc_no_grow – suspend ARC adaption growths ± Set Aggressive Reclaim Policy triggers ARC shrink ± Shrinks by MAX(1/32 of current size, VM needfree) down to arc_min ± Calls arc_adjust() to adjust (ie evict) cache contents to new sizes ± Call kmem_cache_reap_now() on ZIO buffers • Megiddo/Modha said: “We think of ARC as dynamically, adaptively and continually balancing between recency and frequency - in an online and self-tuning fashion - in response to evolving and possibly changing access patterns” Sun Microsystems 30 L2ARC • Enhances the ARC • Second cache layer between main memory and disk or SSD • Boosts random read performance • Devices used can be: > Short-stroked disks > Solid state disks > Devices with smaller read latency Sun Microsystems 31 L2ARC – How does it populate the cache? • L2ARC attempts to cache data from ARC before it is evicted > There is no eviction path form ARC to L2ARC • A kernel thread scans the eviction list of MFU/MRU and copies them to L2ARC devices > Refer to l2arc_feed_thread() Sun Microsystems 32 L2ARC – Tuning • The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads: > l2arc_write_max : max write bytes per interval > l2arc_noprefetch : skip caching prefetched buffers > l2arc_headroom : number of max device writes to precache > l2arc_feed_secs :seconds between L2ARC writing Sun Microsystems 33 ZIL ZFS Intent Log • Filesystems buffer write requests and sync these to storage periodically to improve performance • Power loss can corrupt filesystems and/or suffer data loss > Corruption solved with TXG commits ±Always on-disk consistency • Use synchronous semantics for applications requiring data is flushed to stable pool by the time a system call returns > Open file with O_DSYNC > Flush buffered contents with fsync(3c) • The ZIL provides synchronous semantics for ZFS Sun Microsystems 34 ZIL Operational Overview • ZFS intent log (ZIL) saves transaction records of system calls that change the file system in memory with enough information to replay them • ZFS operations are organized by the DMU as transactions. Whenever a DMU transaction is opened there is also a ZIL transaction opened > A Log record holds a system call transaction > A Log block can hold many log records and blocks are chained together > Log Blocks are dynamically allocated and freed as needed > a) ZIL blocks freed on TXG commit by DMU ( discard ) > b) flushed due to synchronous requirements e.g. fsync(3C), O_DSYNC commited to stable storage • In the event of power failure/panic the transactions are replayed from ZIL • 1 ZIL per file system Sun Microsystems 35 ZIL • ZILogs resides in mem ory or on disk • ZIL gathers in-mem ory transactions of system calls and pushes the list out to a per filesystem on-disk log • ZILogs are written on disk in variable block sizes > m in.

Load more