06 使用bcache为ceph OSD加速的具体实践 by 花瑞

杉岩官方微信 Practices for accelerating Ceph OSD with bcache 花瑞 [email protected] www.szsandstone.com Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 01 Caching choices for Ceph n Ceph Cache Tiering n Complexity in operation and maintenance, too much strategy n Data migration between cache pool and base pool cost too much n Coarse-grained objects promotion, slower performance in some workloads n Longer IO path when cache miss Ceph cache tiering n OSD Cache SSD n Simple to deploy, simple replacement strategy HDD n It is more worthy to accelerate OSD metadata and journal n Fine-grained sensitivity to active and inactive data OSD Cache www.szsandstone.com 02 Caching choices for Ceph n Linux block caching choices Bcache Flashcache EnhanceIO Dm-cache • First committed to • Support by facebook • Derived from • First committed to kernel-3.10 Flashcache kernel-3.9 • Using kernel device- • Good performance mapper • Normal performance • Using kernel device- mapper • SSD-friendly design • Normal performance • Easy maintain • Normal performance • Pooling SSD resource, • Easy develop, debug • Poor features thin-provisioning • No more developed • Rich features and maintained www.szsandstone.com 03 Why bcache n Feature comparison Bcache Flashcache/EnhanceIO Management SSD pooled, thin-provisioning, easy to add SSD(partition) binding to backing HDD, non backing HDD flexible Hit ratio Extent-based/B+tree Index, high hit ratio Block-based/hash Index, starvation for some cache block, low hit ratio Writeback Flush dirty data by HDD id, with throttle, good Flush dirty data by bucket(2M), bad sequentiality sequentiality SSD-friendly design Full COW, Reducing write amplification, media Fixed metadata zone in SSD, flash media wear wear out slowly out quickly when update index IO sensitivity REQ_SYNC/REQ_META/REQ_FLUSH/REQ_FUA, -- good compatibility with Filestore(XFS) Other feature sequential IO bypass/congestion control/SSD IO -- error hander www.szsandstone.com 04 Why bcache n Performance Comparison n http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/ fio, 4KB, random, libaio, iodepth=128, with writebackrunning read write mixrw Flashcache Bcache www.szsandstone.com 05 Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 06 Practices for accelerating OSD with bcache n Best configuration n For each SSD, we create one cache pool, attach equal number of HDDs into each cache pool n Use independent thin-flash LUN to accelerate ObjectMap(and journal) OSD FileStore Other OSD ObjectMap FileJournal bcache0 bcache1 bcache2 bcache3 … SSD HDD HDD 07 www.szsandstone.com Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 08 Bcache introduction n Feature thin-flash LUN n Cache mode：writeback, writethrough, writearound bcache0 bcache1 bcache2 bcache3 n Replacement algorithm：LRU, FIFO, Random n Others：sequential IO bypass, congestion control, effective dirty flushing SSD sdb attach HDD sdc sdd sde Cache Pool www.szsandstone.com 09 Bcache introduction——Layout SSD Layout n Bucket n Unit of SSD space allocation n Typically 512KB n Data zone n COW allocator n Continues extents in data bucket n Metadata zone n B+ tree index to extents n SB bucket n Updates can be done with only append, except n journal bucket for SB bucket n btree bucket n uuid bucket n prio bucket p data bucket www.szsandstone.com 10 Bcache introduction——Index n Index HDD id HDD offset (req LBA) n Addressing all HDD space in one space key n B+treekey value store, lookup cache data by HDD id + SSD offset gen value LBA n Map each B+treenode to one btree bucket n Caching each node in memory (metadata cache) n Use Journal/WAL to accelerate the updates of B+tree bkey bkey bkey btree buckets bkey bkey bkey bkey bkey bkey bkey bkey bkey Data buckets www.szsandstone.com 11 Bcache introduction——GC n Garbage collection thread n GC is just for reusing/reclaiming buckets n GC thread traverse bucket in cache pool one by one, Mark & Compact them n Allocator thread reclaim buckets by marked info n Metadata GC ——Btree GC n Mark btree bucket/uuid bucket/prio bucket n Compact these buckets after mark n Data GC——Move GC n Mark reclaimable bucket/dirty bucket n Compact these buckets after mark, move the caching data between buckets www.szsandstone.com 12 Bcache introduction——Writeback n Writeback thread n Each writebackthread work for one backing HDD Writeback Throttle for Water level n Fetch dirty bkeys by HDD id, push them into buffer n Reorder dirty bkeys by LBA, read data from SSD, then flush to HDD n Throttle n PD-Controller for water level n More dirty data in cache, more aggressive speed for flushing n More quickly change of water level, more aggressive speed for flushing www.szsandstone.com 13 Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 14 Challenge for production ready n More features OSD need n Stability promotion n SSD & HDD Hot-plug support n Detach operation of backing HDD will block until dirty clean n Can’t prune dirty data when backing HDD missing n Recover thin-flash LUN after reboot or crash n Performance issues n Performance drop until dirty data has hit 0, then performance came back n Performance fluctuation when GC thread work n More read IO on SSD in some workloads when memory low www.szsandstone.com 15 Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 16 Next step n Bluestore is coming… n FileStoreworks on filesystem, and bcache have some jointed optimization with filesystem n Bluestoreworks on raw block device n Define different replication/recovery policy for cache only BlueStore Data Metadata n Enhance Bluestorewith full user-space SSD cache n NVMe optimization RocksDB Allocator BlueFS SSD HDD HDD HDD 17 Thanks! 尾页 www.szsandstone.com.

06 使用bcache为ceph OSD加速的具体实践 by 花瑞

The Linux Kernel Module Programming Guide

The Xen Port of Kexec / Kdump a Short Introduction and Status Report

Anatomy of Linux Loadable Kernel Modules a 2.6 Kernel Perspective

Kdump, a Kexec-Based Kernel Crash Dumping Mechanism

Communicating Between the Kernel and User-Space in Linux Using Netlink Sockets

Detecting Exploit Code Execution in Loadable Kernel Modules

DM-Relay - Safe Laptop Mode Via Linux Device Mapper

SUSE Linux Enterprise Server 15 SP2 Autoyast Guide Autoyast Guide SUSE Linux Enterprise Server 15 SP2

The Linux Storage Stack Diagram

Effective Cache Apportioning for Performance Isolation Under

An Evolutionary Study of Linux Memory Management for Fun and Profit Jian Huang, Moinuddin K

Comparison of Kernel and User Space File Systems