Goodbye, Xfs: Building a New, Faster Storage Backend for Ceph

GOODBYE, XFS: BUILDING A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL – RED HAT 2017.09.12 OUTLINE ● Ceph background and context – FileStore, and why POSIX failed us ● BlueStore – a new Ceph OSD backend ● Performance ● Recent challenges ● Current status, future ● Summary 2 MOTIVATION CEPH ● Object, block, and file storage in a single cluster ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● “A Scalable, High-Performance Distributed File System” ● “performance, reliability, and scalability” 4 ● Released two weeks ago ● Erasure coding support for block and file (and object) ● BlueStore (our new storage backend) 5 CEPH COMPONENTS OBJECT BLOCK FILE RGW RBD CEPHFS A web services gateway A reliable, fully-distributed A distributed file system for object storage, block device with cloud with POSIX semantics and compatible with S3 and platform integration scale-out metadata Swift management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 6 OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs ext4 FS FS FS FS DISK DISK DISK DISK M 7 OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs FileStore FileStore FileStore FileStore ext4 FS FS FS FS DISK DISK DISK DISK M 8 IN THE BEGINNING EBOFS 2004 2008 2012 2014 2016 FakeStore 9 OBJECTSTORE AND DATA MODEL ● ObjectStore ● Object – “file” – abstract interface for storing – data (file-like byte stream) local data – attributes (small key/value) – EBOFS, FakeStore – omap (unbounded key/value) ● Collection – “directory” – actually a shard of RADOS pool – Ceph placement group (PG) ● All writes are transactions – Atomic + Consistent + Durable – Isolation provided by OSD 10 LEVERAGE KERNEL FILESYSTEMS! EBOFS EBOFS removed 2004 2008 2012 2014 2016 FakeStore FileStore FileStore + XFS + BtrFS 11 FILESTORE ● FileStore ● /var/lib/ceph/osd/ceph-123/ – current/ – evolution of FakeStore ● meta/ – PG = collection = directory – osdmap123 – object = file – osdmap124 ● 0.1_head/ ● Leveldb – object1 – – large xattr spillover object12 ● 0.7_head/ – object omap (key/value) data – object3 – object5 ● 0.a_head/ – object4 – object6 ● omap/ – <leveldb files> 12 POSIX FAILS: TRANSACTIONS ● Most transactions are simple [ { "op_name": "write", – write some bytes to object (file) "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", – update object attribute (file "length": 4194304, "offset": 0, xattr) "bufferlist length": 4194304 }, – append to update log (kv insert) { "op_name": "setattrs", "collection": "0.6_head", ...but others are arbitrarily "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { large/complex "_": 269, "snapset": 31 ● } Serialize and write-ahead txn to }, { journal for atomicity "op_name": "omap_setkeys", "collection": "0.6_head", – We double-write everything! "oid": "#0:60000000::::head#", "attr_lens": { – "0000000005.00000000000000000006": 178, Lots of ugly hackery to make "_info": 847 replayed events idempotent } } ] 13 POSIX FAILS: KERNEL CACHE ● Kernel cache is mostly wonderful – automatically sized to all available memory – automatically shrinks if memory needed – efficient ● No cache implemented in app, yay! ● Read/modify/write vs write-ahead – write must be persisted – then applied to file system – only then you can read it 14 POSIX FAILS: ENUMERATION ● Ceph objects are distributed by a 32-bit hash … ● Enumeration is in hash order DIR_A/ – scrubbing DIR_A/A03224D3_qwer – data rebalancing, recovery DIR_A/A247233E_zxcv – enumeration via librados client API … DIR_B/ ● POSIX readdir is not well-ordered DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo ● Need O(1) “split” for a given shard/range DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz ● Build directory tree by hash-value prefix DIR_B/DIR_A/ – split big directories, merge small ones DIR_B/DIR_A/BA4328D2_asdf – read entire directory, sort in-memory … 15 THE HEADACHES CONTINUE ● New FileStore+XFS problems continue to surface – FileStore directory splits lead to throughput collapse when an entire pool’s PG directories split in unison – Cannot bound deferred writeback work, even with fsync(2) – QoS efforts thwarted by deep queues and periodicity in FileStore throughput – {RBD, CephFS} snapshots triggering inefficient 4MB object copies to create object clones 16 ACTUALLY, OUR FIRST PLAN WAS BETTER BlueStore EBOFS EBOFS NewStore removed 2004 2008 2012 2014 2016 FakeStore FileStore FileStore + XFS + BtrFS 17 BLUESTORE BLUESTORE ● BlueStore = Block + NewStore – consume raw block device(s) ObjectStore – key/value database (RocksDB) for metadata BlueStore – data written directly to block device data metadata – pluggable inline compression – full data checksums RocksDB BlueRocksEnv ● Target hardware BlueFS – Current gen HDD, SSD (SATA, SAS, NVMe) BlockDevice ● We must share the block device with RocksDB 19 ROCKSDB: BlueRocksEnv + BlueFS ● class BlueRocksEnv : public rocksdb::EnvWrapper ● Map “directories” to different block devices – passes “file” operations to BlueFS – db.wal/ – on NVRAM, NVMe, SSD – ● BlueFS is a super-simple “file system” db/ – level0 and hot SSTs on SSD – db.slow/ – cold SSTs on HDD – all metadata lives in the journal – all metadata loaded in RAM on start/mount ● – journal rewritten/compacted when it gets large BlueStore periodically balances free space – no need to store block free list superblock journal … data data more journal … data data 20 file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ... METADATA ONODE – OBJECT ● Per object metadata struct bluestore_onode_t { uint64_t size; – Lives directly in key/value pair map<string,bufferptr> attrs; – Serializes to 100s of bytes uint64_t flags; ● Size in bytes // extent map metadata struct shard_info { ● Attributes (user attr data) uint32_t offset; uint32_t bytes; ● Inline extent map (maybe) }; vector<shard_info> shards; bufferlist inline_extents; bufferlist spanning_blobs; }; 22 CNODE – COLLECTION ● Collection metadata struct spg_t { uint64_t pool; – Interval of object namespace uint32_t hash; }; pool hash name bits C<12,3d3e0000> “12.e3d3” = <19> struct bluestore_cnode_t { uint32_t bits; }; pool hash name snap O<12,3d3d880e,foo,NOSNAP> = … O<12,3d3d9223,bar,NOSNAP> = … ● Nice properties – Ordered enumeration of objects O<12,3d3e02c2,baz,NOSNAP> = … – We can “split” collections by adjusting O<12,3d3e125d,zip,NOSNAP> = … collection metadata only O<12,3d3e1d41,dee,NOSNAP> = … O<12,3d3e3832,dah,NOSNAP> = … 23 BLOB AND SHAREDBLOB ● Blob struct bluestore_blob_t { uint32_t flags = 0; – Extent(s) on device vector<bluestore_pextent_t> extents; – Checksum metadata uint16_t unused = 0; // bitmap – Data may be compressed uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; bufferptr csum_data; uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; }; ● SharedBlob struct bluestore_shared_blob_t { uint64_t sbid; – Extent ref count on cloned bluestore_extent_ref_map_t ref_map; blobs }; 24 25 EXTENT MAP EXTENT ● ● ● Blobs stored inline in each shard in each inline stored Blobs in chunks Serialized extents → blob extents object Map – – – – onodekey “ boundaries shard across referenced isit unless keys adjacent in stored otherwise small if inonodevalue inline stored spanning” blobs stored in spanning”blobs stored logical offsets ... map extent inline + onode = O<,,baz,,> shard map extent = O<,,bar,,4> shard map extent = O<,,bar,,0> blobs spanning + onode = O<,,bar,,> map extent inline + onode = O<,,foo,,> ... blob blob blob shard #2 shard #1 spanning blob DATA PATH DATA PATH BASICS Terms Three ways to write ● TransContext ● New allocation – State describing an executing – Any write larger than min_alloc_size transaction goes to a new, unused extent on disk – ● Sequencer Once that IO completes, we commit the transaction – An independent, totally ordered ● Unused part of existing blob queue of transactions – One per PG ● Deferred writes – Commit temporary promise to (over)write data with transaction ● includes data! – Do async (over)write – Then clean up temporary k/v pair 27 IN-MEMORY CACHE ● OnodeSpace per collection – in-memory name → Onode map of decoded onodes ● BufferSpace for in-memory Blobs – all in-flight writes – may contain cached on-disk data ● Both buffers and onodes have lifecycles linked to a Cache – TwoQCache – implements 2Q cache replacement algorithm (default) ● Cache is sharded for parallelism – Collection → shard mapping matches OSD's op_wq – same CPU context that processes client requests will touch the LRU/2Q lists 28 PERFORMANCE HDD: RANDOM WRITE 450 ) 400 /s B 350 M ( 300 t 250 Bluestore vs Filestore HDD Random Write Throughput pu gh 200 ou 150 hr T 100 50 0 30 4 8 2000 1800 1600 16 1400 1200 S 32 P 1000 O I 800 Bluestore vs Filestore64 HDD Random Write IOPS 600 400 200 IO Size 128 0 256 4 512 8 1024 16 Filestore 2048 Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) 32 4096 BS (Master-d62a4948) Bluestore (wip-bluestore-dw) 64 IO Size128 256 512 1024 2048 Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) 4096 BS (Master-d62a4948) Bluestore (wip-bluestore-dw) HDD: SEQUENTIAL WRITE 450 ) 400 /s B 350 M ( 300 Bluestore vs Filestore HDD Sequential Write Throughput t 250 pu gh 200 ou 150 hr 100 T WTH 50 0 31 4 8 3500 3000 16 2500 S 32

Load more