ERASURE CODING AND CACHE TIERING
SAGE WEIL - SDC 2014.09.16 ARCHITECTURE CEPH MOTIVATING PRINCIPLES
● All components must scale horizontally
● There can be no single point of failure
● The solution must be hardware agnostic
● Should use commodity hardware
● Self-manage whenever possible
● Open source
3 ARCHITECTURAL COMPONENTS
APP HOST/VM CLIENT
RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
4 ROBUST SERVICES BUILT ON RADOS ARCHITECTURAL COMPONENTS
APP HOST/VM CLIENT
RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
6 THE RADOS GATEWAY
APPLICATION APPLICATION
REST
RADOSGW RADOSGW LIBRADOS LIBRADOS
socket
M M
M RADOS CLUSTER
7 MULTI-SITE OBJECT STORAGE
WEB WEB APPLICATION APPLICATION APP APP SERVER SERVER
CEPH OBJECT CEPH OBJECT GATEWAY GATEWAY (RGW) (RGW) CEPH STORAGE CEPH STORAGE CLUSTER CLUSTER (US-EAST) (EU-WEST)
8 RADOSGW MAKES RADOS WEBBY
RADOSGW: . REST-based object storage proxy . Uses RADOS to store objects ● Stripes large RESTful objects across many RADOS objects . API supports buckets, accounts . Usage accounting for billing . Compatible with S3 and Swift applications
10 ARCHITECTURAL COMPONENTS
APP HOST/VM CLIENT
RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
11 STORING VIRTUAL DISKS
VM
HYPERVISOR LIBRBD
M M
RADOS CLUSTER
12 KERNEL MODULE
LINUX HOST KRBD
M M
RADOS CLUSTER
13 RBD FEATURES
● Stripe images across entire cluster (pool)
● Read-only snapshots
● Copy-on-write clones
● Broad integration
– Qemu – Linux kernel – iSCSI (STGT, LIO) – OpenStack, CloudStack, Nebula, Ganeti, Proxmox
● Incremental backup (relative to snapshots)
14 ARCHITECTURAL COMPONENTS
APP HOST/VM CLIENT
RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
15 SEPARATE METADATA SERVER
LINUX HOST KERNEL MODULE
metadata 01 data 10
M M
M RADOS CLUSTER
16 SCALABLE METADATA SERVERS
METADATA SERVER . Manages metadata for a POSIX-compliant shared filesystem . Directory hierarchy . File metadata (owner, timestamps, mode, etc.) . Clients stripe file data in RADOS . MDS not in data path . MDS stores metadata in RADOS . Key/value objects . Dynamic cluster scales to 10s or 100s . Only required for shared filesystem
17 RADOS ARCHITECTURAL COMPONENTS
APP HOST/VM CLIENT
RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
19 RADOS
● Flat object namespace within each pool
● Rich object API (librados)
– Bytes, attributes, key/value data – Partial overwrite of existing data – Single-object compound operations – RADOS classes (stored procedures)
● Strong consistency (CP system)
● Infrastructure aware, dynamic topology
● Hash-based placement (CRUSH)
● Direct client to server data path
20 RADOS CLUSTER
APPLICATION
M M
M M
M RADOS CLUSTER
21 RADOS COMPONENTS
OSDs: . 10s to 1000s in a cluster . One per disk (or one per SSD, RAID group…) . Serve stored objects to clients . Intelligently peer for replication & recovery
Monitors: . Maintain cluster membership and state . Provide consensus for distributed decision- M making . Small, odd number (e.g., 5) . Not part of data path
22 OBJECT STORAGE DAEMONS
M
OSD OSD OSD OSD M xfs btrfs ext4 FS FS FS FS
DISK DISK DISK DISK M
23 DATA PLACEMENT WHERE DO OBJECTS LIVE?
M
?? M APPLICATION OBJECT
M
25 A METADATA SERVER?
M
1 M APPLICATION 2
M
26 CALCULATED PLACEMENT
A-G M
H-N M APPLICATION F O-T
M U-Z
27 CRUSH
10 10 01 01 11
01
01 01 01 10 01
10 OBJECTS 10 01 11 10 10 01
11
10 10 01 01 01
PLACEMENT GROUPS CLUSTER (PGs) 28 CRUSH IS A QUICK CALCULATION
10 01 01 11
01 01 10 01
OBJECT
01 11 10 10
10 10 01 01
RADOS CLUSTER
29 CRUSH AVOIDS FAILED DEVICES
10 01 01 11
01 01 10 01
OBJECT
01 11 10 10
10
10 10 01 01
RADOS CLUSTER
30 CRUSH: DECLUSTERED PLACEMENT
31 ● Each PG independently maps to a pseudorandom set of OSDs 10 01 01 11 ● PGs that map to the same OSD generally have replicas that do not
01 01 10 01 ● When an OSD fails, each PG it stored will generally be re-replicated by a different OSD
01 11 10 10 – Highly parallel recovery
10 10 01 01
RADOS CLUSTER
31 CRUSH: DYNAMIC DATA PLACEMENT
CRUSH: . Pseudo-random placement algorithm . Fast calculation, no lookup . Repeatable, deterministic . Statistically uniform distribution . Stable mapping . Limited data migration on change . Rule-based configuration . Infrastructure topology aware . Adjustable replication . Weighting
32 DATA IS ORGANIZED INTO POOLS
10 11 10 01 POOL 10 01 01 11 OBJECTS A 01 01 01 10
01 10 11 10 POOL 01 01 10 01 OBJECTS B 10 01 01 01
POOL OBJECTS C 10 01 10 11 01 11 10 10
01 10 01 01
POOL OBJECTS D 11 10 01 10 10 10 01 01
01 01 10 01 CLUSTER POOLS (CONTAINING PGs) 33 TIERED STORAGE TWO WAYS TO CACHE
● Within each OSD
– Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools
● dm-cache, bcache, FlashCache
● Variety of caching controllers – We can help with hints
● Cache on separate devices/nodes
– Different hardware for different tiers
● Slow nodes for cold data
● High performance nodes for hot data – Add, remove, scale each tier independently
● Unlikely to choose right ratios at procurement time
35 TIERED STORAGE
APPLICATION
CACHE POOL (REPLICATED)
BACKING POOL (ERASURE CODED)
CEPH STORAGE CLUSTER
36 RADOS TIERING PRINCIPLES
● Each tier is a RADOS pool
– May be replicated or erasure coded
● Tiers are durable
– e.g., replicate across SSDs in multiple hosts
● Each tier has its own CRUSH policy
– e.g., map to SSDs devices/hosts only
● librados clients adapt to tiering topology
– Transparently direct requests accordingly
● e.g., to cache – No changes to RBD, RGW, CephFS, etc.
37 WRITE INTO CACHE POOL
CEPH CLIENT
WRITE ACK
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
38 WRITE INTO CACHE POOL
CEPH CLIENT
WRITE ACK
CACHE POOL (SSD): WRITEBACK
PROMOTE
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
39 READ (CACHE HIT)
CEPH CLIENT
READ READ REPLY
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
40 READ (CACHE MISS)
CEPH CLIENT
READ REDIRECT READ READ REPLY
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
41 READ (CACHE MISS)
CEPH CLIENT
READ READ REPLY
CACHE POOL (SSD): WRITEBACK
PROMOTE
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
42 ESTIMATING TEMPERATURE
● Each PG constructs in-memory bloom filters
– Insert records on both read and write – Each filter covers configurable period (e.g., 1 hour) – Tunable false positive probability (e.g., 5%) – Maintain most recent N filters on disk
● Estimate temperature
– Has object been accessed in any of the last N periods? – ...in how many of them? – Informs flush/evict decision
● Estimate “recency”
– How many periods since the object hasn't been accessed? – Informs read miss behavior: promote vs redirect
43 AGENT: FLUSH COLD DATA
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
FLUSH ACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
44 TIERING AGENT
● Each PG has an internal tiering agent
– Manages PG based on administrator defined policy
● Flush dirty objects
– When pool reaches target dirty ratio – Tries to select cold objects – Marks objects clean when they have been written back to the base pool
● Evict clean objects
– Greater “effort” as pool/PG size approaches target size
45 READ ONLY CACHE TIER
CEPH CLIENT
WRITE ACK READ READ REPLY
CACHE POOL (SSD): READ ONLY
PROMOTE
BACKING POOL (REPLICATED)
CEPH STORAGE CLUSTER
46 ERASURE CODING ERASURE CODING
OBJECT OBJECT
COPY 1 2 3 4 X Y COPY COPY
REPLICATED POOL ERASURE CODED POOL
CEPH STORAGE CLUSTER CEPH STORAGE CLUSTER
Full copies of stored objects One copy plus parity . Very high durability . Cost-effective durability . 3x (200% overhead) . 1.5x (50% overhead)
48 . Quicker recovery . Expensive recovery ERASURE CODING SHARDS
OBJECT
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
ERASURE CODED POOL
CEPH STORAGE CLUSTER
49 ERASURE CODING SHARDS
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
0 1 2 3 A A' 4 5 6 7 B B' 8 9 10 9 C C' 12 13 14 15 D D' 16 17 18 19 E E'
● Variable stripe size CEPH STORAGE CLUSTER ● Zero-fill shards (logically) in partial tail stripe
50 PRIMARY
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
ERASURE CODED POOL
CEPH STORAGE CLUSTER
51 EC READ
CEPH CLIENT
READ
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
ERASURE CODED POOL
CEPH STORAGE CLUSTER
52 EC READ
CEPH CLIENT
READ
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
READS ERASURE CODED POOL CEPH STORAGE CLUSTER
53 EC READ
CEPH CLIENT
READ REPLY
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
ERASURE CODED POOL
CEPH STORAGE CLUSTER
54 EC WRITE
CEPH CLIENT
WRITE
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
ERASURE CODED POOL
CEPH STORAGE CLUSTER
55 EC WRITE
CEPH CLIENT
WRITE
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER
56 EC WRITE
CEPH CLIENT
WRITE ACK
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
ERASURE CODED POOL
CEPH STORAGE CLUSTER
57 EC WRITE: DEGRADED
CEPH CLIENT
WRITE
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER
58 EC WRITE: PARTIAL FAILURE
CEPH CLIENT
WRITE
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER
59 EC WRITE: PARTIAL FAILURE
CEPH CLIENT
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
B B A B A A ERASURE CODED POOL
CEPH STORAGE CLUSTER
60 EC RESTRICTIONS
● Overwrite in place will not work in general
● Log and 2PC would increase complexity, latency
● We chose to restrict allowed operations
– create – append (on stripe boundary) – remove (keep previous generation of object for some time)
● These operations can all easily be rolled back locally
– create → delete – append → truncate – remove → roll back to previous generation
● Object attrs preserved in existing PG logs (they are small)
● Key/value data is not allowed on EC pools
61 EC WRITE: PARTIAL FAILURE
CEPH CLIENT
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
B B A B A A ERASURE CODED POOL
CEPH STORAGE CLUSTER
62 EC WRITE: PARTIAL FAILURE
CEPH CLIENT
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
A A A A A A ERASURE CODED POOL
CEPH STORAGE CLUSTER
63 EC RESTRICTIONS
● This is a small subset of allowed librados operations
– Notably cannot (over)write any extent
● Coincidentally, these operations are also inefficient for erasure codes
– Generally require read/modify/write of affected stripe(s)
● Some applications can consume EC directly
– RGW (no object data update in place)
● Others can combine EC with a cache tier (RBD, CephFS)
– Replication for warm/hot data – Erasure coding for cold data – Tiering agent skips objects with key/value data
64 WHICH ERASURE CODE?
● The EC algorithm and implementation are pluggable
– jerasure (free, open, and very fast) – ISA-L (Intel library; optimized for modern Intel procs) – LRC (local recovery code – layers over existing plugins)
● Parameterized
– Pick k or m, stripe size
● OSD handles data path, placement, rollback, etc.
● Plugin handles
– Encode and decode – Given these available shards, which ones should I fetch to satisfy a read? – Given these available shards and these missing shards, which ones should I fetch to recover?
65 COST OF RECOVERY
1 TB OSD
66 COST OF RECOVERY
1 TB OSD
67 COST OF RECOVERY (REPLICATION)
1 TB OSD
1 TB
68 COST OF RECOVERY (REPLICATION)
1 TB OSD
.01 TB .01 TB
.01 TB .01 TB
.01 TB .01 TB
.
.
.
.
.
.
69 COST OF RECOVERY (REPLICATION)
1 TB OSD
1 TB
70 COST OF RECOVERY (EC)
1 TB OSD
1 TB
1 TB
1 TB
1 TB
71 LOCAL RECOVERY CODE (LRC)
OBJECT
1 2 3 4 X Y OSD OSD OSD OSD OSD OSD
A B C OSD OSD OSD
ERASURE CODED POOL CEPH STORAGE CLUSTER
72 BIG THANKS TO
● Ceph
– Loic Dachary (CloudWatt, FSF France, Red Hat) – Andreas Peters (CERN) – Sam Just (Inktank / Red Hat) – David Zafman (Inktank / Red Hat)
● jerasure
– Jim Plank (University of Tennessee) – Kevin Greenan (Box)
73 ROADMAP WHAT'S NEXT
● Erasure coding
– Allow (optimistic) client reads directly from shards – ARM optimizations for jerasure
● Cache pools
– Better agent decisions (when to flush or evict) – Supporting different performance profiles
● e.g., slow / “cheap” flash can read just as fast – Complex topologies
● Multiple readonly cache tiers in multiple sites ● Tiering
– Support “redirects” to cold tier below base pool – Dynamic spin-down
75 OTHER ONGOING WORK
● Performance optimization (SanDisk, Mellanox)
● Alternative OSD backends
– leveldb, rocksdb, LMDB – hybrid key/value and file system
● Messenger (network layer) improvements
– RDMA support (libxio – Mellanox) – Event-driven TCP implementation (UnitedStack)
● Multi-datacenter RADOS replication
● CephFS
– Online consistency checking – Performance, robustness
76 THANK YOU!
Sage Weil CEPH PRINCIPAL ARCHITECT
@liewegas