GOODBYE, XFS: BUILDING A NEW, FASTER STORAGE BACKEND FOR

SAGE WEIL – 2017.09.12 OUTLINE

● Ceph background and context – FileStore, and why POSIX failed us ● BlueStore – a new Ceph OSD backend ● Performance ● Recent challenges ● Current status, future ● Summary

2 MOTIVATION CEPH

● Object, , and file storage in a single cluster

● All components scale horizontally

● No single point of failure

● Hardware agnostic, commodity hardware

● Self-manage whenever possible

● Open source (LGPL)

● “A Scalable, High-Performance Distributed

● “performance, reliability, and scalability”

4 ● Released two weeks ago

● Erasure coding support for block and file (and object)

● BlueStore (our new storage backend)

5 CEPH COMPONENTS

OBJECT BLOCK FILE

RGW RBD CEPHFS A web services gateway A reliable, fully-distributed A distributed file system for , block device with cloud with POSIX semantics and compatible with S3 and platform integration scale-out metadata Swift management LIBRADOS A allowing apps to directly access RADOS (, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

6 OBJECT STORAGE DAEMONS (OSDS)

M

OSD OSD OSD OSD M FS FS FS FS

DISK DISK DISK DISK M

7 OBJECT STORAGE DAEMONS (OSDS)

M OSD OSD OSD OSD

M xfs btrfs FileStore FileStore FileStore FileStore ext4 FS FS FS FS

DISK DISK DISK DISK M

8 IN THE BEGINNING

EBOFS

2004 2008 2012 2014 2016

FakeStore

9 OBJECTSTORE AND DATA MODEL

● ObjectStore ● Object – “file” – abstract interface for storing – data (file-like stream) local data – attributes (small key/value) – EBOFS, FakeStore – omap (unbounded key/value) ● Collection – “” – actually a shard of RADOS pool – Ceph placement group (PG)

● All writes are transactions – Atomic + Consistent + Durable – Isolation provided by OSD

10 LEVERAGE KERNEL FILESYSTEMS!

EBOFS EBOFS removed

2004 2008 2012 2014 2016

FakeStore FileStore FileStore + XFS + BtrFS 11 FILESTORE

● FileStore ● /var/lib/ceph/osd/ceph-123/ – current/ – evolution of FakeStore ● meta/ – PG = collection = directory – osdmap123 – object = file – osdmap124 ● 0.1_head/ ● Leveldb – object1 – – large xattr spillover object12 ● 0.7_head/ – object omap (key/value) data – object3 – object5 ● 0.a_head/ – object4 – object6 ● omap/ –

12 POSIX FAILS: TRANSACTIONS

● Most transactions are simple [ { "op_name": "write", – write some to object (file) "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", – update object attribute (file "length": 4194304, "offset": 0, xattr) "bufferlist length": 4194304 }, – append to update log (kv insert) { "op_name": "setattrs", "collection": "0.6_head", ...but others are arbitrarily "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { large/complex "_": 269, "snapset": 31 ● } Serialize and write-ahead txn to }, { journal for atomicity "op_name": "omap_setkeys", "collection": "0.6_head", – We double-write everything! "oid": "#0:60000000::::head#", "attr_lens": { – "0000000005.00000000000000000006": 178, Lots of ugly hackery to make "_info": 847 replayed events idempotent } } ] 13 POSIX FAILS: KERNEL CACHE

● Kernel cache is mostly wonderful – automatically sized to all available memory – automatically shrinks if memory needed – efficient

● No cache implemented in app, yay!

● Read/modify/write vs write-ahead – write must be persisted – then applied to file system – only then you can read it

14 POSIX FAILS: ENUMERATION

● Ceph objects are distributed by a 32-bit hash … ● Enumeration is in hash order DIR_A/ – scrubbing DIR_A/A03224D3_qwer – data rebalancing, recovery DIR_A/A247233E_zxcv – enumeration via librados client API … DIR_B/ ● POSIX readdir is not well-ordered DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo ● Need O(1) “split” for a given shard/range DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz ● Build directory tree by hash-value prefix DIR_B/DIR_A/ – split big directories, merge small ones DIR_B/DIR_A/BA4328D2_asdf – read entire directory, sort in-memory …

15 THE HEADACHES CONTINUE

● New FileStore+XFS problems continue to surface – FileStore directory splits lead to throughput collapse when an entire pool’s PG directories split in unison – Cannot bound deferred writeback work, even with fsync(2) – QoS efforts thwarted by deep queues and periodicity in FileStore throughput – {RBD, CephFS} snapshots triggering inefficient 4MB object copies to create object clones

16 ACTUALLY, OUR FIRST PLAN WAS BETTER

BlueStore EBOFS EBOFS NewStore removed

2004 2008 2012 2014 2016

FakeStore FileStore FileStore + XFS + BtrFS 17 BLUESTORE BLUESTORE

● BlueStore = Block + NewStore – consume raw block device(s) ObjectStore – key/value database (RocksDB) for metadata BlueStore – data written directly to block device data metadata – pluggable inline compression – full data checksums RocksDB BlueRocksEnv ● Target hardware BlueFS – Current gen HDD, SSD (SATA, SAS, NVMe)

BlockDevice

● We must share the block device with RocksDB

19 ROCKSDB: BlueRocksEnv + BlueFS

● class BlueRocksEnv : public ::EnvWrapper ● Map “directories” to different block devices – passes “file” operations to BlueFS – db.wal/ – on NVRAM, NVMe, SSD – ● BlueFS is a super-simple “file system” db/ – level0 and hot SSTs on SSD – db.slow/ – cold SSTs on HDD – all metadata lives in the journal – all metadata loaded in RAM on start/mount

● – journal rewritten/compacted when it gets large BlueStore periodically balances free space – no need to store block free list

superblock journal … data data more journal … data data

20 file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ... METADATA ONODE – OBJECT

● Per object metadata struct bluestore_onode_t { uint64_t size; – Lives directly in key/value pair map attrs; – Serializes to 100s of bytes uint64_t flags;

● Size in bytes // map metadata struct shard_info { ● Attributes (user attr data) uint32_t offset; uint32_t bytes; ● Inline extent map (maybe) }; vector shards;

bufferlist inline_extents; bufferlist spanning_blobs; };

22 CNODE – COLLECTION

● Collection metadata struct spg_t { uint64_t pool; – Interval of object namespace uint32_t hash; }; pool hash name bits

C<12,3d3e0000> “12.e3d3” = <19> struct bluestore_cnode_t { uint32_t bits; }; pool hash name snap

O<12,3d3d880e,foo,NOSNAP> = …

O<12,3d3d9223,bar,NOSNAP> = … ● Nice properties – Ordered enumeration of objects O<12,3d3e02c2,baz,NOSNAP> = … – We can “split” collections by adjusting O<12,3d3e125d,zip,NOSNAP> = … collection metadata only O<12,3d3e1d41,dee,NOSNAP> = …

O<12,3d3e3832,dah,NOSNAP> = …

23 BLOB AND SHAREDBLOB

● Blob struct bluestore_blob_t { uint32_t flags = 0; – Extent(s) on device vector extents; – Checksum metadata uint16_t unused = 0; // bitmap – Data may be compressed uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; bufferptr csum_data;

uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; };

● SharedBlob struct bluestore_shared_blob_t { uint64_t sbid; – Extent ref count on cloned bluestore_extent_ref_map_t ref_map; blobs };

24 EXTENT MAP

● Map object extents → blob ... extents O<,,foo,,> = onode + inline extent map O<,,bar,,> = onode + spanning blobs ● Serialized in chunks O<,,bar,,0> = extent map shard – stored inline in onode value if O<,,bar,,4> = extent map shard small O<,,baz,,> = onode + inline extent map ... – otherwise stored in adjacent keys

● b

Blobs stored inline in each shard o l b 1 #

b d s r o – t unless it is referenced across l a e b h

s f s g f b n

shard boundaries o i o l l n a b n c i – a g “spanning” blobs stored in p o s onode key l 2 b # o l

b d r a h s 25 DATA PATH DATA PATH BASICS

Terms Three ways to write

● TransContext ● New allocation – State describing an executing – Any write larger than min_alloc_size transaction goes to a new, unused extent on disk – ● Sequencer Once that IO completes, we commit the transaction – An independent, totally ordered ● Unused part of existing blob queue of transactions – One per PG ● Deferred writes – Commit temporary promise to (over)write data with transaction

● includes data! – Do async (over)write – Then clean up temporary k/v pair

27 IN-MEMORY CACHE

● OnodeSpace per collection – in-memory name → Onode map of decoded onodes

● BufferSpace for in-memory Blobs – all in-flight writes – may contain cached on-disk data ● Both buffers and onodes have lifecycles linked to a Cache – TwoQCache – implements 2Q cache replacement algorithm (default)

● Cache is sharded for parallelism – Collection → shard mapping matches OSD's op_wq – same CPU context that processes client requests will touch the LRU/2Q lists

28 PERFORMANCE HDD: RANDOM WRITE

Bluestore vs Filestore HDD Random Write Throughput 450 400 350 ) s / 300 Filestore B 250 Bluestore (wip-bitmap-alloc-) M (

t 200 BS (Master-a07452d) u

p 150

h BS (Master-d62a4948) g

u 100 Bluestore (wip-bluestore-dw) o r 50 h

T 0 4 8 6 2 4 8 6 2 4 8 6 1 3 6 2 5 1 2 4 9 1 2 5 0 0 0 1 2 4 IO Size Bluestore vs Filestore HDD Random Write IOPS 2000 1800 1600 1400 Filestore 1200 1000 Bluestore (wip-bitmap-alloc-perf) S BS (Master-a07452d)

P 800

O BS (Master-d62a4948) I 600 400 Bluestore (wip-bluestore-dw) 200 0 4 8 6 2 4 8 6 2 4 8 6 1 3 6 2 5 1 2 4 9 1 2 5 0 0 0 1 2 4 30 IO Size HDD: SEQUENTIAL WRITE

Bluestore vs Filestore HDD Sequential Write Throughput 450 400 350 ) s / 300 Filestore B 250

M Bluestore (wip-bitmap-alloc-perf) (

t 200 BS (Master-a07452d) u

p 150

h BS (Master-d62a4948) g

u 100 Bluestore (wip-bluestore-dw) o

r 50

h

T 0 4 8 6 2 4 8 6 2 4 8 6 1 3 6 2 5 1 2 4 9 1 2 5 0 0 0 1 2 4 IO Size Bluestore vs Filestore HDD Sequential Write IOPS 3500 3000 WTH 2500 Filestore 2000 Bluestore (wip-bitmap-alloc-perf)

S 1500 BS (Master-a07452d) P

O BS (Master-d62a4948) I 1000 Bluestore (wip-bluestore-dw) 500 0 4 8 6 2 4 8 6 2 4 8 6 1 3 6 2 5 1 2 4 9 1 2 5 0 0 0 1 2 4 31 IO Size RGW ON HDD+NVME, EC 4+2

4+2 Erasure Coding RadosGW Write Tests

32MB Objects, 24 HDD/NVMe OSDs on 4 Servers, 4 Clients 1800 1600 1400 Filestore 512KB Chunks )

s 1200 / Filestore 4MB Chunks B

M 1000 Bluestore 512KB Chunks (

t

u Bluestore 4MB Chunks

p 800 h g

u 600 o r h

T 400 200 0 1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados 1 RGW Server 4 RGW Servers Bench

32 ERASURE CODE OVERWRITES

● Luminous allows overwrites of EC objects – Requires two-phase commit to avoid “RAID-hole” like failure conditions – OSD creates temporary rollback objects

● clone_range $extent to temporary object ● overwrite $extent with new data ● BlueStore can do this easily…

● FileStore literally copies (reads and writes to-be-overwritten region)

33 BLUESTORE vs FILESTORE 3X vs EC 4+2 vs EC 5+1

RBD 4K Random Writes 16 HDD OSDs, 8 32GB volumes, 256 IOs in flight 900 800 700

600 3X EC42 500 EC51 S

P 400 O I 300 200 100 0 Bluestore Filestore HDD/HDD

34 OTHER CHALLENGES USERSPACE CACHE

● Built ‘mempool’ accounting infrastructure – easily annotate/tag C++ classes and containers – low overhead – debug mode provides per-type (vs per-pool) accounting

● Requires configuration – bluestore_cache_size (default 1GB)

● not as convenient as auto-sized kernel caches

● Finally have meaningful implementation of fadvise NOREUSE (vs DONTNEED)

36 MEMORY EFFICIENCY

● Careful attention to struct sizes: packing, redundant fields

● Checksum chunk sizes – client hints to expect sequential read/write → large csum chunks – can optionally select weaker checksum (16 or 8 bits per chunk)

● In-memory red/black trees (e.g., std::map<>) bad for CPUs – low temporal write locality → many CPU cache misses, failed prefetches – use btrees where appropriate – per-onode slab allocators for extent and blob structs

37 ROCKSDB

● Compaction – Overall impact grows as total metadata corpus grows – Invalidates rocksdb block cache (needed for range queries) – Awkward to control priority on kernel libaio interface

● Many deferred write keys end up in L0

● High write amplification – SSDs with low-cost random reads care more about total write overhead

● Open to alternatives for SSD/NVM

38 STATUS STATUS

● Stable and recommended default in Luminous v12.2.z (just released!)

● Migration from FileStore – Reprovision individual OSDs, let cluster heal/recover

● Current efforts – Optimizing for CPU time – SPDK support! – Adding some repair functionality to fsck

40 FUTURE FUTURE WORK

● Tiering – hints to underlying block-based tiering device (e.g., dm-cache, )? – implement directly in BlueStore?

● host-managed SMR? – maybe...

● Persistent memory + 3D NAND future?

● NVMe extensions for key/value, blob storage?

42 SUMMARY

● Ceph is great at scaling out ● POSIX was poor choice for storing objects ● Our new BlueStore backend is so much better – Good (and rational) performance! – Inline compression and full data checksums ● Designing internal storage interfaces without a POSIX bias is a win...... eventually ● We are definitely not done yet – Performance! – Other stuff ● We can finally solve our IO problems ourselves

43 BASEMENT CLUSTER

● 2 TB 2.5” HDDs

● 1 TB 2.5” SSDs (SATA)

● 400 GB SSDs (NVMe)

● Luminous 12.2.0

● CephFS

● Cache tiering

● Erasure coding

● BlueStore

● Untrained IT staff! 44 THANK YOU!

Sage Weil CEPH PRINCIPAL ARCHITECT

[email protected]

@liewegas