06 使用bcache为ceph OSD加速的具体实践 by 花瑞
Total Page:16
File Type:pdf, Size:1020Kb
杉岩官方微信 Practices for accelerating Ceph OSD with bcache 花瑞 [email protected] www.szsandstone.com Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 01 Caching choices for Ceph n Ceph Cache Tiering n Complexity in operation and maintenance, too much strategy n Data migration between cache pool and base pool cost too much n Coarse-grained objects promotion, slower performance in some workloads n Longer IO path when cache miss Ceph cache tiering n OSD Cache SSD n Simple to deploy, simple replacement strategy HDD n It is more worthy to accelerate OSD metadata and journal n Fine-grained sensitivity to active and inactive data OSD Cache www.szsandstone.com 02 Caching choices for Ceph n Linux block caching choices Bcache Flashcache EnhanceIO Dm-cache • First committed to • Support by facebook • Derived from • First committed to kernel-3.10 Flashcache kernel-3.9 • Using kernel device- • Good performance mapper • Normal performance • Using kernel device- mapper • SSD-friendly design • Normal performance • Easy maintain • Normal performance • Pooling SSD resource, • Easy develop, debug • Poor features thin-provisioning • No more developed • Rich features and maintained www.szsandstone.com 03 Why bcache n Feature comparison Bcache Flashcache/EnhanceIO Management SSD pooled, thin-provisioning, easy to add SSD(partition) binding to backing HDD, non backing HDD flexible Hit ratio Extent-based/B+tree Index, high hit ratio Block-based/hash Index, starvation for some cache block, low hit ratio Writeback Flush dirty data by HDD id, with throttle, good Flush dirty data by bucket(2M), bad sequentiality sequentiality SSD-friendly design Full COW, Reducing write amplification, media Fixed metadata zone in SSD, flash media wear wear out slowly out quickly when update index IO sensitivity REQ_SYNC/REQ_META/REQ_FLUSH/REQ_FUA, -- good compatibility with Filestore(XFS) Other feature sequential IO bypass/congestion control/SSD IO -- error hander www.szsandstone.com 04 Why bcache n Performance Comparison n http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/ fio, 4KB, random, libaio, iodepth=128, with writebackrunning read write mixrw Flashcache Bcache www.szsandstone.com 05 Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 06 Practices for accelerating OSD with bcache n Best configuration n For each SSD, we create one cache pool, attach equal number of HDDs into each cache pool n Use independent thin-flash LUN to accelerate ObjectMap(and journal) OSD FileStore Other OSD ObjectMap FileJournal bcache0 bcache1 bcache2 bcache3 … SSD HDD HDD 07 www.szsandstone.com Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 08 Bcache introduction n Feature thin-flash LUN n Cache mode:writeback, writethrough, writearound bcache0 bcache1 bcache2 bcache3 n Replacement algorithm:LRU, FIFO, Random n Others:sequential IO bypass, congestion control, effective dirty flushing SSD sdb attach HDD sdc sdd sde Cache Pool www.szsandstone.com 09 Bcache introduction——Layout SSD Layout n Bucket n Unit of SSD space allocation n Typically 512KB n Data zone n COW allocator n Continues extents in data bucket n Metadata zone n B+ tree index to extents n SB bucket n Updates can be done with only append, except n journal bucket for SB bucket n btree bucket n uuid bucket n prio bucket p data bucket www.szsandstone.com 10 Bcache introduction——Index n Index HDD id HDD offset (req LBA) n Addressing all HDD space in one space key n B+treekey value store, lookup cache data by HDD id + SSD offset gen value LBA n Map each B+treenode to one btree bucket n Caching each node in memory (metadata cache) n Use Journal/WAL to accelerate the updates of B+tree bkey bkey bkey btree buckets bkey bkey bkey bkey bkey bkey bkey bkey bkey Data buckets www.szsandstone.com 11 Bcache introduction——GC n Garbage collection thread n GC is just for reusing/reclaiming buckets n GC thread traverse bucket in cache pool one by one, Mark & Compact them n Allocator thread reclaim buckets by marked info n Metadata GC ——Btree GC n Mark btree bucket/uuid bucket/prio bucket n Compact these buckets after mark n Data GC——Move GC n Mark reclaimable bucket/dirty bucket n Compact these buckets after mark, move the caching data between buckets www.szsandstone.com 12 Bcache introduction——Writeback n Writeback thread n Each writebackthread work for one backing HDD Writeback Throttle for Water level n Fetch dirty bkeys by HDD id, push them into buffer n Reorder dirty bkeys by LBA, read data from SSD, then flush to HDD n Throttle n PD-Controller for water level n More dirty data in cache, more aggressive speed for flushing n More quickly change of water level, more aggressive speed for flushing www.szsandstone.com 13 Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 14 Challenge for production ready n More features OSD need n Stability promotion n SSD & HDD Hot-plug support n Detach operation of backing HDD will block until dirty clean n Can’t prune dirty data when backing HDD missing n Recover thin-flash LUN after reboot or crash n Performance issues n Performance drop until dirty data has hit 0, then performance came back n Performance fluctuation when GC thread work n More read IO on SSD in some workloads when memory low www.szsandstone.com 15 Outline n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step www.szsandstone.com 16 Next step n Bluestore is coming… n FileStoreworks on filesystem, and bcache have some jointed optimization with filesystem n Bluestoreworks on raw block device n Define different replication/recovery policy for cache only BlueStore Data Metadata n Enhance Bluestorewith full user-space SSD cache n NVMe optimization RocksDB Allocator BlueFS SSD HDD HDD HDD 17 Thanks! 尾页 www.szsandstone.com.