杉岩官方微信

Practices for accelerating Ceph OSD with

花瑞 [email protected]

www.szsandstone.com Outline

n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step

www.szsandstone.com 01 Caching choices for Ceph n Ceph Tiering n Complexity in operation and maintenance, too much strategy n Data migration between cache pool and base pool cost too much n Coarse-grained objects promotion, slower performance in some workloads n Longer IO path when cache miss

Ceph cache tiering n OSD Cache SSD

n Simple to deploy, simple replacement strategy HDD n It is more worthy to accelerate OSD metadata and journal n Fine-grained sensitivity to active and inactive data

OSD Cache

www.szsandstone.com 02 Caching choices for Ceph n block caching choices

Bcache EnhanceIO Dm-cache

• First committed to • Support by facebook • Derived from • First committed to kernel-3.10 Flashcache kernel-3.9 • Using kernel device- • Good performance mapper • Normal performance • Using kernel device- mapper • SSD-friendly design • Normal performance • Easy maintain • Normal performance • Pooling SSD resource, • Easy develop, debug • Poor features thin-provisioning • No more developed • Rich features and maintained

www.szsandstone.com 03 Why bcache n Feature comparison

Bcache Flashcache/EnhanceIO

Management SSD pooled, thin-provisioning, easy to add SSD(partition) binding to backing HDD, non backing HDD flexible Hit ratio Extent-based/B+tree Index, high hit ratio Block-based/hash Index, starvation for some cache block, low hit ratio Writeback Flush dirty data by HDD id, with throttle, good Flush dirty data by bucket(2M), bad sequentiality sequentiality SSD-friendly design Full COW, Reducing amplification, media Fixed metadata zone in SSD, flash media wear wear out slowly out quickly when update index IO sensitivity REQ_SYNC/REQ_META/REQ_FLUSH/REQ_FUA, -- good compatibility with Filestore(XFS) Other feature sequential IO bypass/congestion control/SSD IO -- error hander

www.szsandstone.com 04 Why bcache

n Performance Comparison n http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/

fio, 4KB, random, libaio, iodepth=128, with writebackrunning

write mixrw

Flashcache Bcache

www.szsandstone.com 05 Outline

n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step

www.szsandstone.com 06 Practices for accelerating OSD with bcache n Best configuration n For each SSD, we create one cache pool, attach equal number of HDDs into each cache pool n Use independent thin-flash LUN to accelerate ObjectMap(and journal)

OSD FileStore Other OSD ObjectMap FileJournal

bcache0 bcache1 bcache2 bcache3 …

SSD

HDD HDD

07 www.szsandstone.com Outline

n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step

www.szsandstone.com 08 Bcache introduction n Feature thin-flash LUN n Cache mode:writeback, writethrough, writearound bcache0 bcache1 bcache2 bcache3 n Replacement algorithm:LRU, FIFO, Random n Others:sequential IO bypass, congestion control, effective dirty flushing SSD sdb

attach

HDD sdc sdd sde

Cache Pool

www.szsandstone.com 09 Bcache introduction——Layout

SSD Layout n Bucket n Unit of SSD space allocation n Typically 512KB n Data zone n COW allocator n Continues extents in data bucket n Metadata zone n B+ tree index to extents n SB bucket n Updates can be done with only append, except n journal bucket for SB bucket n btree bucket n uuid bucket n prio bucket p data bucket

www.szsandstone.com 10 Bcache introduction——Index n Index HDD id HDD offset (req LBA) n Addressing all HDD space in one space key n B+treekey value store, lookup cache data by HDD id + SSD offset gen value LBA n Map each B+treenode to one btree bucket n Caching each node in memory (metadata cache) n Use Journal/WAL to accelerate the updates of B+tree bkey bkey bkey

btree buckets

bkey bkey bkey bkey bkey bkey bkey bkey bkey

Data buckets

www.szsandstone.com 11 Bcache introduction——GC

n Garbage collection n GC is just for reusing/reclaiming buckets n GC thread traverse bucket in cache pool one by one, Mark & Compact them n Allocator thread reclaim buckets by marked info n Metadata GC ——Btree GC n Mark btree bucket/uuid bucket/prio bucket n Compact these buckets after mark n Data GC——Move GC n Mark reclaimable bucket/dirty bucket n Compact these buckets after mark, move the caching data between buckets

www.szsandstone.com 12 Bcache introduction——Writeback

n Writeback thread

n Each writebackthread work for one backing HDD Writeback Throttle for Water level n Fetch dirty bkeys by HDD id, push them into buffer n Reorder dirty bkeys by LBA, read data from SSD, then flush to HDD n Throttle n PD-Controller for water level n More dirty data in cache, more aggressive speed for flushing n More quickly change of water level, more aggressive speed for flushing

www.szsandstone.com 13 Outline

n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step

www.szsandstone.com 14 Challenge for production ready n More features OSD need n Stability promotion n SSD & HDD Hot-plug support n Detach operation of backing HDD will block until dirty clean n Can’t prune dirty data when backing HDD missing n Recover thin-flash LUN after reboot or crash n Performance issues n Performance drop until dirty data has hit 0, then performance came back n Performance fluctuation when GC thread work n More read IO on SSD in some workloads when memory low

www.szsandstone.com 15 Outline

n Caching choices for Ceph n Practices for accelerating OSD with bcache n Bcache introduction n Challenge for production ready n Next step

www.szsandstone.com 16 Next step

n Bluestore is coming… n FileStoreworks on filesystem, and bcache have some jointed optimization with filesystem n Bluestoreworks on raw block device n Define different replication/recovery policy for cache only BlueStore Data Metadata n Enhance Bluestorewith full -space SSD cache n NVMe optimization RocksDB

Allocator BlueFS

SSD

HDD HDD HDD

17 Thanks!

尾页 www.szsandstone.com