ZFS ARC Cacche And

ZFS ARC Cacche And

CONFIDENTIAL 11 ZFS Readzilla und Logzilla Claudia Hildebrandt System Engineer/Consultant Sun Microsystems GmbH Sun Microsystems 22 Hybrid Storage Pools • Hybrid Storage Pools > Pools with SSDs > SSDs as write flash accelerators and seperate log devices for the ZIL - aka Logzilla > SSDs as read flash accelerators – aka Readzilla • OpenStorage at this time > 18 GB Logzilla > 100 GB Readzilla Sun Microsystems 3 Logzilla • ZFS uses the ZFS Intent Log ( ZIL) to match POSIX synchronous requirements • ZIL uses allocated blocks within the main storage pool ( default ) • Better performance with seperate ZIL (slog) – ZIL is allocated on seperate devices like a dedicated disk, also SSDs or NVRAM • # zpool add <pool_name> log <log_device1> <log_device2> • Note: use mirrored log devices, RAIDZ is not supported Sun Microsystems 4 Readzilla • aka L2ARC as secondary caching tier between the DRAM cache (the ZFS ARC) and disk. • ZFS ARC – ZFS adjustable replacement cache > Stores ZFS data and metadata information from all active storage pools in physical memory by default as much as possible, except 1 GB of RAM > ZFS ARC consumes free memory as long there is free memory and releases the memory only to system when free memory is requested by another application > With Readzilla the information in RAM can be moved to disk and cached as long as there is free space Sun Microsystems 5 ZFS – Features “Pooled storage Model” “All or nothing” “Always consistent” “Self healing” Sun Microsystems 6 ZFS Architecture Sun Microsystems 7 ARC Overview and Purpose • ZFS does not use page cache like UFS (except: mmap(2)) • Adaptive Replacement Cache > Based on Megiddo & Modha (IBM) at FAST 2003 ± ARC: A Self-Tuning, Low Overhead Replacement Cache > ZFS ARC differs slightly in implementation ± ZFS: Variable sized cache and contents, non-evictable contents • DMU uses ARC to cache data objects based on DVA • 1 ARC per system • 2 LRU (Least Recently Used) caches plus History > Recency (MRU) and Frequency (MFU) ± ARC data survives large file scan > 1c cache and 1c history (c = cache size) Sun Microsystems 8 Adjustable Replacement Cache (ARC ) • Central point for memory management for the SPA > Ability to evict buffers as a result of memory pressure • Dynamically, adaptively and self-tuning > Cache adjusts based on I/O workload • scan-resistant Sun Microsystems 9 6 states of arc_buf • ARC_anon : > Buffers not associated with a DVA > They hold dirty block copies before being written to storage > They are considered as part of ARC_mru • ARC_mru : Recently used and currently cached • ARC_mru_ghost : Recently used, no longer in cache • ARC_mfu : Frequently used and currently cached • ARC_mfu_ghost : Frequently used, no longer cached • ARC_l2c_only : exists only in L2ARC • Ghost caches only contain ARC buffer headers Sun Microsystems 10 ARC Diagram of Caches c c ARCARC MRU M FU MRU M FU p G host Caches • MRU = Most Recently Used, MFU = Most Fequently Used. Both lists plus the Ghost Caches are twice the size of the cache c • ARC adapts c ( cache size ) and p ( used pages in MRU ) in response to workloads • ARC parameters initialised to: arc_c_min = MAX(1/32 of all mem, 64Mb) arc_c_max = MAX(3/4 of all mem, all but 1Gb) arc_c = MIN(1/8 physmem, 1/8 VM size) arc_p = arc_c / 2 Sun Microsystems 11 How it works LRU p MRU MRU c - p LRU ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFU ghost Sun Microsystems 12 claudia@frodo:~/Downloads$ pfexec ./arc_summary.pl System Memory: Physical RAM: 4052 MB Free Memory : 2312 MB LotsFree: 63 MB ZFS Tunables (/etc/system): ARC Size: Current Size: 772 MB (arcsize) Target Size (Adaptive): 3039 MB (c) Min Size (Hard Limit): 379 MB (zfs_arc_min) Max Size (Hard Limit): 3039 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 50% 1519 MB (p) Most Frequently Used Cache Size: 49% 1519 MB (c-p) Sun Microsystems 13 Data is read A arc_read request ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFU ghost Sun Microsystems 14 Data buffer is in MRU A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFU ghost Sun Microsystems 15 Same data buffer read again A arc_read request A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFU ghost Sun Microsystems 16 Data buffer moves in MFU A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 17 Cache fills up D E F C B A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 18 MRU data buffer is read again D arc_read request D E F C B A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 19 MFU list is dynamically adjusted E F D C B A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 20 Data buffer in MFU is read again arc_read request B E F D C B A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 21 Data buffer moves at 1st position E F B D C A ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 22 ARC Caches in Action • If evicting during cache insert, then: > 1. Inserting in MRU & MRU < p then arc_evict(MFU) > 2. Inserting in MRU & MRU > p then arc_evict(MRU) > 3. Inserting in MFU & MFU < (c-p) then arc_evict(MRU) > 4. Inserting in MFU & MFU > (c-p) then arc_evict(MFU) • Buffers change state (ie cache) in response to access > If current state is MRU, and at least ARC_MINTIME (62ms) since last access, then new state is MFU > All other repeated accesses result in state of MFU ± Exception: Prefetching in MRU or Ghosts results in MRU Sun Microsystems 23 Least recency data buffer evicted G arc_read request E F G B D C A e ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 24 Least frequency data buffer evicted G arc_read request F G G B D C A e a ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 25 ARC Adapting and Adjusting • Adapting...adapting to workload > When adding new content: ± If (hit in MRU_Ghost) then increase p ± If (hit in MFU_Ghost) then decrease p ± If (arc_size within (2*maxblocksize) of c) then increase c • Adjusting...adjusting contents to fit > When shrinking or reclaiming: ± If (MRU > p) then arc_evict(MRU) ± If (MRU+MRU_Ghost > c) then arc_evict(MRU_Ghost) ± If (arc_size > c) then arc_evict(MFU) ± If (arc_size + Ghosts > 2*c) then arc_evict(MFU_Ghost) Sun Microsystems 26 Data buffer not in cache F arc_read request I J G B D C h f e a ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 27 ARC adaptive self tuning I J F G B h e a c d ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 28 ARC is to small A E arc_read request C I J F G B D h e c a ARC = c MRU = p MFU = c - p Ghost caches MRU ghost MFUghost Sun Microsystems 29 ARC Reclaiming • Reclaim...reclaiming kernel memory > Every second (or sooner if adapting or kmem callback) > Check VM parameters: freemem, lotsfree, needfree, desfree > If required: ± Set arc_no_grow – suspend ARC adaption growths ± Set Aggressive Reclaim Policy triggers ARC shrink ± Shrinks by MAX(1/32 of current size, VM needfree) down to arc_min ± Calls arc_adjust() to adjust (ie evict) cache contents to new sizes ± Call kmem_cache_reap_now() on ZIO buffers • Megiddo/Modha said: “We think of ARC as dynamically, adaptively and continually balancing between recency and frequency - in an online and self-tuning fashion - in response to evolving and possibly changing access patterns” Sun Microsystems 30 L2ARC • Enhances the ARC • Second cache layer between main memory and disk or SSD • Boosts random read performance • Devices used can be: > Short-stroked disks > Solid state disks > Devices with smaller read latency Sun Microsystems 31 L2ARC – How does it populate the cache? • L2ARC attempts to cache data from ARC before it is evicted > There is no eviction path form ARC to L2ARC • A kernel thread scans the eviction list of MFU/MRU and copies them to L2ARC devices > Refer to l2arc_feed_thread() Sun Microsystems 32 L2ARC – Tuning • The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads: > l2arc_write_max : max write bytes per interval > l2arc_noprefetch : skip caching prefetched buffers > l2arc_headroom : number of max device writes to precache > l2arc_feed_secs :seconds between L2ARC writing Sun Microsystems 33 ZIL ZFS Intent Log • Filesystems buffer write requests and sync these to storage periodically to improve performance • Power loss can corrupt filesystems and/or suffer data loss > Corruption solved with TXG commits ±Always on-disk consistency • Use synchronous semantics for applications requiring data is flushed to stable pool by the time a system call returns > Open file with O_DSYNC > Flush buffered contents with fsync(3c) • The ZIL provides synchronous semantics for ZFS Sun Microsystems 34 ZIL Operational Overview • ZFS intent log (ZIL) saves transaction records of system calls that change the file system in memory with enough information to replay them • ZFS operations are organized by the DMU as transactions. Whenever a DMU transaction is opened there is also a ZIL transaction opened > A Log record holds a system call transaction > A Log block can hold many log records and blocks are chained together > Log Blocks are dynamically allocated and freed as needed > a) ZIL blocks freed on TXG commit by DMU ( discard ) > b) flushed due to synchronous requirements e.g. fsync(3C), O_DSYNC commited to stable storage • In the event of power failure/panic the transactions are replayed from ZIL • 1 ZIL per file system Sun Microsystems 35 ZIL • ZILogs resides in mem ory or on disk • ZIL gathers in-mem ory transactions of system calls and pushes the list out to a per filesystem on-disk log • ZILogs are written on disk in variable block sizes > m in.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    43 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us