Goodbye, Xfs: Building a New, Faster Storage Backend for Ceph

GOODBYE, XFS: BUILDING A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL – RED HAT 2017.09.12 OUTLINE ● Ceph background and context – FileStore, and why POSIX failed us ● BlueStore – a new Ceph OSD backend ● Performance ● Recent challenges ● Current status, future ● Summary 2 MOTIVATION CEPH ● Object, block, and file storage in a single cluster ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● “A Scalable, High-Performance Distributed File System” ● “performance, reliability, and scalability” 4 ● Released two weeks ago ● Erasure coding support for block and file (and object) ● BlueStore (our new storage backend) 5 CEPH COMPONENTS OBJECT BLOCK FILE RGW RBD CEPHFS A web services gateway A reliable, fully-distributed A distributed file system for object storage, block device with cloud with POSIX semantics and compatible with S3 and platform integration scale-out metadata Swift management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 6 OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs ext4 FS FS FS FS DISK DISK DISK DISK M 7 OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs FileStore FileStore FileStore FileStore ext4 FS FS FS FS DISK DISK DISK DISK M 8 IN THE BEGINNING EBOFS 2004 2008 2012 2014 2016 FakeStore 9 OBJECTSTORE AND DATA MODEL ● ObjectStore ● Object – “file” – abstract interface for storing – data (file-like byte stream) local data – attributes (small key/value) – EBOFS, FakeStore – omap (unbounded key/value) ● Collection – “directory” – actually a shard of RADOS pool – Ceph placement group (PG) ● All writes are transactions – Atomic + Consistent + Durable – Isolation provided by OSD 10 LEVERAGE KERNEL FILESYSTEMS! EBOFS EBOFS removed 2004 2008 2012 2014 2016 FakeStore FileStore FileStore + XFS + BtrFS 11 FILESTORE ● FileStore ● /var/lib/ceph/osd/ceph-123/ – current/ – evolution of FakeStore ● meta/ – PG = collection = directory – osdmap123 – object = file – osdmap124 ● 0.1_head/ ● Leveldb – object1 – – large xattr spillover object12 ● 0.7_head/ – object omap (key/value) data – object3 – object5 ● 0.a_head/ – object4 – object6 ● omap/ – <leveldb files> 12 POSIX FAILS: TRANSACTIONS ● Most transactions are simple [ { "op_name": "write", – write some bytes to object (file) "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", – update object attribute (file "length": 4194304, "offset": 0, xattr) "bufferlist length": 4194304 }, – append to update log (kv insert) { "op_name": "setattrs", "collection": "0.6_head", ...but others are arbitrarily "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { large/complex "_": 269, "snapset": 31 ● } Serialize and write-ahead txn to }, { journal for atomicity "op_name": "omap_setkeys", "collection": "0.6_head", – We double-write everything! "oid": "#0:60000000::::head#", "attr_lens": { – "0000000005.00000000000000000006": 178, Lots of ugly hackery to make "_info": 847 replayed events idempotent } } ] 13 POSIX FAILS: KERNEL CACHE ● Kernel cache is mostly wonderful – automatically sized to all available memory – automatically shrinks if memory needed – efficient ● No cache implemented in app, yay! ● Read/modify/write vs write-ahead – write must be persisted – then applied to file system – only then you can read it 14 POSIX FAILS: ENUMERATION ● Ceph objects are distributed by a 32-bit hash … ● Enumeration is in hash order DIR_A/ – scrubbing DIR_A/A03224D3_qwer – data rebalancing, recovery DIR_A/A247233E_zxcv – enumeration via librados client API … DIR_B/ ● POSIX readdir is not well-ordered DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo ● Need O(1) “split” for a given shard/range DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz ● Build directory tree by hash-value prefix DIR_B/DIR_A/ – split big directories, merge small ones DIR_B/DIR_A/BA4328D2_asdf – read entire directory, sort in-memory … 15 THE HEADACHES CONTINUE ● New FileStore+XFS problems continue to surface – FileStore directory splits lead to throughput collapse when an entire pool’s PG directories split in unison – Cannot bound deferred writeback work, even with fsync(2) – QoS efforts thwarted by deep queues and periodicity in FileStore throughput – {RBD, CephFS} snapshots triggering inefficient 4MB object copies to create object clones 16 ACTUALLY, OUR FIRST PLAN WAS BETTER BlueStore EBOFS EBOFS NewStore removed 2004 2008 2012 2014 2016 FakeStore FileStore FileStore + XFS + BtrFS 17 BLUESTORE BLUESTORE ● BlueStore = Block + NewStore – consume raw block device(s) ObjectStore – key/value database (RocksDB) for metadata BlueStore – data written directly to block device data metadata – pluggable inline compression – full data checksums RocksDB BlueRocksEnv ● Target hardware BlueFS – Current gen HDD, SSD (SATA, SAS, NVMe) BlockDevice ● We must share the block device with RocksDB 19 ROCKSDB: BlueRocksEnv + BlueFS ● class BlueRocksEnv : public rocksdb::EnvWrapper ● Map “directories” to different block devices – passes “file” operations to BlueFS – db.wal/ – on NVRAM, NVMe, SSD – ● BlueFS is a super-simple “file system” db/ – level0 and hot SSTs on SSD – db.slow/ – cold SSTs on HDD – all metadata lives in the journal – all metadata loaded in RAM on start/mount ● – journal rewritten/compacted when it gets large BlueStore periodically balances free space – no need to store block free list superblock journal … data data more journal … data data 20 file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ... METADATA ONODE – OBJECT ● Per object metadata struct bluestore_onode_t { uint64_t size; – Lives directly in key/value pair map<string,bufferptr> attrs; – Serializes to 100s of bytes uint64_t flags; ● Size in bytes // extent map metadata struct shard_info { ● Attributes (user attr data) uint32_t offset; uint32_t bytes; ● Inline extent map (maybe) }; vector<shard_info> shards; bufferlist inline_extents; bufferlist spanning_blobs; }; 22 CNODE – COLLECTION ● Collection metadata struct spg_t { uint64_t pool; – Interval of object namespace uint32_t hash; }; pool hash name bits C<12,3d3e0000> “12.e3d3” = <19> struct bluestore_cnode_t { uint32_t bits; }; pool hash name snap O<12,3d3d880e,foo,NOSNAP> = … O<12,3d3d9223,bar,NOSNAP> = … ● Nice properties – Ordered enumeration of objects O<12,3d3e02c2,baz,NOSNAP> = … – We can “split” collections by adjusting O<12,3d3e125d,zip,NOSNAP> = … collection metadata only O<12,3d3e1d41,dee,NOSNAP> = … O<12,3d3e3832,dah,NOSNAP> = … 23 BLOB AND SHAREDBLOB ● Blob struct bluestore_blob_t { uint32_t flags = 0; – Extent(s) on device vector<bluestore_pextent_t> extents; – Checksum metadata uint16_t unused = 0; // bitmap – Data may be compressed uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; bufferptr csum_data; uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; }; ● SharedBlob struct bluestore_shared_blob_t { uint64_t sbid; – Extent ref count on cloned bluestore_extent_ref_map_t ref_map; blobs }; 24 25 EXTENT MAP EXTENT ● ● ● Blobs stored inline in each shard in each inline stored Blobs in chunks Serialized extents → blob extents object Map – – – – onodekey “ boundaries shard across referenced isit unless keys adjacent in stored otherwise small if inonodevalue inline stored spanning” blobs stored in spanning”blobs stored logical offsets ... map extent inline + onode = O<,,baz,,> shard map extent = O<,,bar,,4> shard map extent = O<,,bar,,0> blobs spanning + onode = O<,,bar,,> map extent inline + onode = O<,,foo,,> ... blob blob blob shard #2 shard #1 spanning blob DATA PATH DATA PATH BASICS Terms Three ways to write ● TransContext ● New allocation – State describing an executing – Any write larger than min_alloc_size transaction goes to a new, unused extent on disk – ● Sequencer Once that IO completes, we commit the transaction – An independent, totally ordered ● Unused part of existing blob queue of transactions – One per PG ● Deferred writes – Commit temporary promise to (over)write data with transaction ● includes data! – Do async (over)write – Then clean up temporary k/v pair 27 IN-MEMORY CACHE ● OnodeSpace per collection – in-memory name → Onode map of decoded onodes ● BufferSpace for in-memory Blobs – all in-flight writes – may contain cached on-disk data ● Both buffers and onodes have lifecycles linked to a Cache – TwoQCache – implements 2Q cache replacement algorithm (default) ● Cache is sharded for parallelism – Collection → shard mapping matches OSD's op_wq – same CPU context that processes client requests will touch the LRU/2Q lists 28 PERFORMANCE HDD: RANDOM WRITE 450 ) 400 /s B 350 M ( 300 t 250 Bluestore vs Filestore HDD Random Write Throughput pu gh 200 ou 150 hr T 100 50 0 30 4 8 2000 1800 1600 16 1400 1200 S 32 P 1000 O I 800 Bluestore vs Filestore64 HDD Random Write IOPS 600 400 200 IO Size 128 0 256 4 512 8 1024 16 Filestore 2048 Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) 32 4096 BS (Master-d62a4948) Bluestore (wip-bluestore-dw) 64 IO Size128 256 512 1024 2048 Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) 4096 BS (Master-d62a4948) Bluestore (wip-bluestore-dw) HDD: SEQUENTIAL WRITE 450 ) 400 /s B 350 M ( 300 Bluestore vs Filestore HDD Sequential Write Throughput t 250 pu gh 200 ou 150 hr 100 T WTH 50 0 31 4 8 3500 3000 16 2500 S 32

Goodbye, Xfs: Building a New, Faster Storage Backend for Ceph

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support