ERASURE CODING AND CACHE TIERING

SAGE WEIL - SDC 2014.09.16 ARCHITECTURE CEPH MOTIVATING PRINCIPLES

● All components must scale horizontally

● There can be no single point of failure

● The solution must be hardware agnostic

● Should use commodity hardware

● Self-manage whenever possible

● Open source

3 ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

4 ROBUST SERVICES BUILT ON RADOS ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

6 THE RADOS GATEWAY

APPLICATION APPLICATION

REST

RADOSGW RADOSGW LIBRADOS LIBRADOS

socket

M M

M RADOS CLUSTER

7 MULTI-SITE OBJECT STORAGE

WEB WEB APPLICATION APPLICATION APP APP SERVER

CEPH OBJECT CEPH OBJECT GATEWAY GATEWAY (RGW) (RGW) CEPH STORAGE CEPH STORAGE CLUSTER CLUSTER (US-EAST) (EU-WEST)

8 RADOSGW MAKES RADOS WEBBY

RADOSGW: . REST-based object storage proxy . Uses RADOS to store objects ● Stripes large RESTful objects across many RADOS objects . API supports buckets, accounts . Usage accounting for billing . Compatible with S3 and Swift applications

10 ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

11 STORING VIRTUAL DISKS

VM

HYPERVISOR LIBRBD

M M

RADOS CLUSTER

12 KERNEL MODULE

LINUX HOST KRBD

M M

RADOS CLUSTER

13 RBD FEATURES

● Stripe images across entire cluster (pool)

● Read-only snapshots

● Copy-on-write clones

● Broad integration

– Qemu – kernel – iSCSI (STGT, LIO) – OpenStack, CloudStack, Nebula, Ganeti, Proxmox

● Incremental backup (relative to snapshots)

14 ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

15 SEPARATE METADATA SERVER

LINUX HOST KERNEL MODULE

metadata 01 data 10

M M

M RADOS CLUSTER

16 SCALABLE METADATA SERVERS

METADATA SERVER . Manages metadata for a POSIX-compliant shared filesystem . Directory hierarchy . File metadata (owner, timestamps, mode, etc.) . Clients stripe file data in RADOS . MDS not in data path . MDS stores metadata in RADOS . Key/value objects . Dynamic cluster scales to 10s or 100s . Only required for shared filesystem

17 RADOS ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS A web services A reliable, fully- A distributed file gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

19 RADOS

● Flat object namespace within each pool

● Rich object API (librados)

– Bytes, attributes, key/value data – Partial overwrite of existing data – Single-object compound operations – RADOS classes (stored procedures)

● Strong consistency (CP system)

● Infrastructure aware, dynamic topology

● Hash-based placement (CRUSH)

● Direct client to server data path

20 RADOS CLUSTER

APPLICATION

M M

M M

M RADOS CLUSTER

21 RADOS COMPONENTS

OSDs: . 10s to 1000s in a cluster . One per disk (or one per SSD, RAID group…) . Serve stored objects to clients . Intelligently peer for replication & recovery

Monitors: . Maintain cluster membership and state . Provide consensus for distributed decision- M making . Small, odd number (e.g., 5) . Not part of data path

22 OBJECT STORAGE DAEMONS

M

OSD OSD OSD OSD M xfs ext4 FS FS FS FS

DISK DISK DISK DISK M

23 DATA PLACEMENT WHERE DO OBJECTS LIVE?

M

?? M APPLICATION OBJECT

M

25 A METADATA SERVER?

M

1 M APPLICATION 2

M

26 CALCULATED PLACEMENT

A-G M

H-N M APPLICATION F O-T

M U-Z

27 CRUSH

10 10 01 01 11

01

01 01 01 10 01

10 OBJECTS 10 01 11 10 10 01

11

10 10 01 01 01

PLACEMENT GROUPS CLUSTER (PGs) 28 CRUSH IS A QUICK CALCULATION

10 01 01 11

01 01 10 01

OBJECT

01 11 10 10

10 10 01 01

RADOS CLUSTER

29 CRUSH AVOIDS FAILED DEVICES

10 01 01 11

01 01 10 01

OBJECT

01 11 10 10

10

10 10 01 01

RADOS CLUSTER

30 CRUSH: DECLUSTERED PLACEMENT

31 ● Each PG independently maps to a pseudorandom set of OSDs 10 01 01 11 ● PGs that map to the same OSD generally have replicas that do not

01 01 10 01 ● When an OSD fails, each PG it stored will generally be re-replicated by a different OSD

01 11 10 10 – Highly parallel recovery

10 10 01 01

RADOS CLUSTER

31 CRUSH: DYNAMIC DATA PLACEMENT

CRUSH: . Pseudo-random placement algorithm . Fast calculation, no lookup . Repeatable, deterministic . Statistically uniform distribution . Stable mapping . Limited data migration on change . Rule-based configuration . Infrastructure topology aware . Adjustable replication . Weighting

32 DATA IS ORGANIZED INTO POOLS

10 11 10 01 POOL 10 01 01 11 OBJECTS A 01 01 01 10

01 10 11 10 POOL 01 01 10 01 OBJECTS B 10 01 01 01

POOL OBJECTS C 10 01 10 11 01 11 10 10

01 10 01 01

POOL OBJECTS D 11 10 01 10 10 10 01 01

01 01 10 01 CLUSTER POOLS (CONTAINING PGs) 33 TIERED STORAGE TWO WAYS TO CACHE

● Within each OSD

– Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools

● dm-cache, , FlashCache

● Variety of caching controllers – We can help with hints

● Cache on separate devices/nodes

– Different hardware for different tiers

● Slow nodes for cold data

● High performance nodes for hot data – Add, remove, scale each tier independently

● Unlikely to choose right ratios at procurement time

35 TIERED STORAGE

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

CEPH STORAGE CLUSTER

36 RADOS TIERING PRINCIPLES

● Each tier is a RADOS pool

– May be replicated or erasure coded

● Tiers are durable

– e.g., replicate across SSDs in multiple hosts

● Each tier has its own CRUSH policy

– e.g., map to SSDs devices/hosts only

● librados clients adapt to tiering topology

– Transparently direct requests accordingly

● e.g., to cache – No changes to RBD, RGW, CephFS, etc.

37 WRITE INTO CACHE POOL

CEPH CLIENT

WRITE ACK

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

38 WRITE INTO CACHE POOL

CEPH CLIENT

WRITE ACK

CACHE POOL (SSD): WRITEBACK

PROMOTE

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

39 READ (CACHE HIT)

CEPH CLIENT

READ READ REPLY

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

40 READ (CACHE MISS)

CEPH CLIENT

READ REDIRECT READ READ REPLY

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

41 READ (CACHE MISS)

CEPH CLIENT

READ READ REPLY

CACHE POOL (SSD): WRITEBACK

PROMOTE

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

42 ESTIMATING TEMPERATURE

● Each PG constructs in-memory bloom filters

– Insert records on both read and write – Each filter covers configurable period (e.g., 1 hour) – Tunable false positive probability (e.g., 5%) – Maintain most recent N filters on disk

● Estimate temperature

– Has object been accessed in any of the last N periods? – ...in how many of them? – Informs flush/evict decision

● Estimate “recency”

– How many periods since the object hasn't been accessed? – Informs read miss behavior: promote vs redirect

43 AGENT: FLUSH COLD DATA

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

FLUSH ACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

44 TIERING AGENT

● Each PG has an internal tiering agent

– Manages PG based on administrator defined policy

● Flush dirty objects

– When pool reaches target dirty ratio – Tries to select cold objects – Marks objects clean when they have been written back to the base pool

● Evict clean objects

– Greater “effort” as pool/PG size approaches target size

45 READ ONLY CACHE TIER

CEPH CLIENT

WRITE ACK READ READ REPLY

CACHE POOL (SSD): READ ONLY

PROMOTE

BACKING POOL (REPLICATED)

CEPH STORAGE CLUSTER

46 ERASURE CODING ERASURE CODING

OBJECT OBJECT

COPY 1 2 3 4 X Y COPY COPY

REPLICATED POOL ERASURE CODED POOL

CEPH STORAGE CLUSTER CEPH STORAGE CLUSTER

Full copies of stored objects One copy plus parity . Very high durability . Cost-effective durability . 3x (200% overhead) . 1.5x (50% overhead)

48 . Quicker recovery . Expensive recovery ERASURE CODING SHARDS

OBJECT

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

49 ERASURE CODING SHARDS

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

0 1 2 3 A A' 4 5 6 7 B B' 8 9 10 9 C C' 12 13 14 15 D D' 16 17 18 19 E E'

● Variable stripe size CEPH STORAGE CLUSTER ● Zero-fill shards (logically) in partial tail stripe

50 PRIMARY

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

51 EC READ

CEPH CLIENT

READ

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

52 EC READ

CEPH CLIENT

READ

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

READS ERASURE CODED POOL CEPH STORAGE CLUSTER

53 EC READ

CEPH CLIENT

READ REPLY

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

54 EC WRITE

CEPH CLIENT

WRITE

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

55 EC WRITE

CEPH CLIENT

WRITE

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER

56 EC WRITE

CEPH CLIENT

WRITE ACK

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

57 EC WRITE: DEGRADED

CEPH CLIENT

WRITE

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER

58 EC WRITE: PARTIAL FAILURE

CEPH CLIENT

WRITE

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER

59 EC WRITE: PARTIAL FAILURE

CEPH CLIENT

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

B B A B A A ERASURE CODED POOL

CEPH STORAGE CLUSTER

60 EC RESTRICTIONS

● Overwrite in place will not work in general

● Log and 2PC would increase complexity, latency

● We chose to restrict allowed operations

– create – append (on stripe boundary) – remove (keep previous generation of object for some time)

● These operations can all easily be rolled back locally

– create → delete – append → truncate – remove → roll back to previous generation

● Object attrs preserved in existing PG logs (they are small)

● Key/value data is not allowed on EC pools

61 EC WRITE: PARTIAL FAILURE

CEPH CLIENT

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

B B A B A A ERASURE CODED POOL

CEPH STORAGE CLUSTER

62 EC WRITE: PARTIAL FAILURE

CEPH CLIENT

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

A A A A A A ERASURE CODED POOL

CEPH STORAGE CLUSTER

63 EC RESTRICTIONS

● This is a small subset of allowed librados operations

– Notably cannot (over)write any extent

● Coincidentally, these operations are also inefficient for erasure codes

– Generally require read/modify/write of affected stripe(s)

● Some applications can consume EC directly

– RGW (no object data update in place)

● Others can combine EC with a cache tier (RBD, CephFS)

– Replication for warm/hot data – Erasure coding for cold data – Tiering agent skips objects with key/value data

64 WHICH ERASURE CODE?

● The EC algorithm and implementation are pluggable

– jerasure (free, open, and very fast) – ISA-L ( library; optimized for modern Intel procs) – LRC (local recovery code – layers over existing plugins)

● Parameterized

– Pick k or m, stripe size

● OSD handles data path, placement, rollback, etc.

● Plugin handles

– Encode and decode – Given these available shards, which ones should I fetch to satisfy a read? – Given these available shards and these missing shards, which ones should I fetch to recover?

65 COST OF RECOVERY

1 TB OSD

66 COST OF RECOVERY

1 TB OSD

67 COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

68 COST OF RECOVERY (REPLICATION)

1 TB OSD

.01 TB .01 TB

.01 TB .01 TB

.01 TB .01 TB

.

.

.

.

.

.

69 COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

70 COST OF RECOVERY (EC)

1 TB OSD

1 TB

1 TB

1 TB

1 TB

71 LOCAL RECOVERY CODE (LRC)

OBJECT

1 2 3 4 X Y OSD OSD OSD OSD OSD OSD

A B C OSD OSD OSD

ERASURE CODED POOL CEPH STORAGE CLUSTER

72 BIG THANKS TO

● Ceph

– Loic Dachary (CloudWatt, FSF France, Red Hat) – Andreas Peters (CERN) – Sam Just (Inktank / Red Hat) – David Zafman (Inktank / Red Hat)

● jerasure

– Jim Plank (University of Tennessee) – Kevin Greenan (Box)

73 ROADMAP WHAT'S NEXT

● Erasure coding

– Allow (optimistic) client reads directly from shards – ARM optimizations for jerasure

● Cache pools

– Better agent decisions (when to flush or evict) – Supporting different performance profiles

● e.g., slow / “cheap” flash can read just as fast – Complex topologies

● Multiple readonly cache tiers in multiple sites ● Tiering

– Support “redirects” to cold tier below base pool – Dynamic spin-down

75 OTHER ONGOING WORK

● Performance optimization (SanDisk, Mellanox)

● Alternative OSD backends

– leveldb, rocksdb, LMDB – hybrid key/value and

● Messenger (network layer) improvements

– RDMA support (libxio – Mellanox) – Event-driven TCP implementation (UnitedStack)

● Multi-datacenter RADOS replication

● CephFS

– Online consistency checking – Performance, robustness

76 THANK YOU!

Sage Weil CEPH PRINCIPAL ARCHITECT

[email protected]

@liewegas