, Programmable Storage, and Data Fabrics Carlos Maltzahn, UC Santa Cruz Fermilab, 6/9/17 Carlos Maltzahn Background

• Adjunct Professor, Computer Science, UC Santa Cruz • Current Research • Director, UCSC Systems Research • High-performance ultra-scale Laboratory (SRL) storage and data management • Director, Center for Research in Open • End-to-end Performance Source Software (CROSS) cross.ucsc.edu management and QoS • Director, UCSC/LANL Institute for • Reproducible Evaluation of Scalable Scientific Data Management Systems (ISSDM) • Network Intermediaries • 1999-2004: Performance Engineer, • Other Research Netapp • Data Management Games • Information Retrieval • Advising 6 Ph.D. students. • Cooperation Dynamics • Graduated 5 Ph.D. students • I do this 100% of my time!

2 Agenda

• Overview of Ceph • Programmable Storage • CDNs/Data Fabrics

Ceph, Programmable Storage, and Data Fabrics 3 Carlos Maltzahn Ceph History

• 2005: Started as summer project • Funded by DOE/NNSA (LANL, LLNL, SNL) • Quickly turned into Sage Weil’s Ph.D. project • 2006: Publications at OSDI and SC • 2007: Sage graduated end of 2007, turned prototype into OSS project • 2010: Ceph kernel client in 2.6.34 • 2011: Inktank startup • 2014: acquires Inktank for $175m

Ceph, Programmable Storage, and Data Fabrics 4 Carlos Maltzahn ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS A web services A reliable, fully- A distributed

RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors

Ceph, Programmable Storage, and Data Fabrics 4 5 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 6 Carlos Maltzahn OBJECT STORAGE DAEMONS

M

OSD OSD OSD OSD M btrfs xfs ext4 FS FS FS FS

DISK DISK DISK DISK M

Ceph, Programmable Storage, and Data Fabrics 6 7 Carlos Maltzahn RADOS CLUSTER

APPLICATION

M M

M M

M RADOS CLUSTER

Ceph, Programmable Storage, and Data Fabrics 7 8 Carlos Maltzahn RADOS COMPONENTS

OSDs: . 10s to 10000s in a cluster . One per disk (or one per SSD, RAID group…) . Serve stored objects to clients . Intelligently peer for replication & recovery

Monitors: . Maintain cluster membership and state . Provide consensus for distributed decision- M making . Small, odd number . These do not serve stored objects to clients Ceph, Programmable Storage, and Data Fabrics 8 9 Carlos Maltzahn WHERE DO OBJECTS LIVE?

M

?? M APPLICATION OBJECT

M

Ceph, Programmable Storage, and Data Fabrics 9 10 Carlos Maltzahn A METADATA SERVER?

M

1 M APPLICATION 2

M

Ceph, Programmable Storage, and Data Fabrics 10 11 Carlos Maltzahn CALCULATED PLACEMENT

A-G M

H-N M APPLICATION F O-T

M U-Z

Ceph, Programmable Storage, and Data Fabrics 11 12 Carlos Maltzahn EVEN BETTER: CRUSH!

10 10 01 01 11

01

01 01 01 10 01

10 OBJECTS 10 01 11 10 10 01

11

10 10 01 01 01

PLACEMENT GROUPS CLUSTER (PGs) Ceph, Programmable Storage, and Data Fabrics 12 13 Carlos Maltzahn CRUSH IS A QUICK CALCULATION

10 01 01 11

01 01 10 01

OBJECT

01 11 10 10

10 10 01 01

RADOS CLUSTER

Ceph, Programmable Storage, and Data Fabrics 13 14 Carlos Maltzahn CRUSH: DYNAMIC DATA PLACEMENT

CRUSH: . Pseudo-random placement algorithm . Fast calculation, no lookup . Repeatable, deterministic . Statistically uniform distribution . Stable mapping . Limited data migration on change . Rule-based con

Ceph, Programmable Storage, and Data Fabrics 14 15 Carlos Maltzahn DATA IS ORGANIZED INTO POOLS

10 11 10 01 POO 10 01 01 11 OBJECTS L A 01 01 01 10

01 10 11 10 POO 01 01 10 01 OBJECTS L B 10 01 01 01

POO 10 01 11 OBJECTS L C 10 01 11 10 10

01 10 01 01

POO 11 10 OBJECTS L 01 10 10 10 01 01 D 01 01 10 01 CLUSTER POOLS (CONTAINING PGs) Ceph, Programmable Storage, and Data Fabrics 15 16 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 17 Carlos Maltzahn ACCESSING A RADOS CLUSTER

APPLICATION LIBRADOS

OBJECT

socket

M M

M RADOS CLUSTER

Ceph, Programmable Storage, and Data Fabrics 17 18 Carlos Maltzahn LIBRADOS: RADOS ACCESS FOR APPS

LIBRADOS: . Direct access to RADOS for applications . C, C++, Python, PHP, Java, Erlang L . Direct access to storage nodes . No HTTP overhead

Ceph, Programmable Storage, and Data Fabrics 18 19 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 20 Carlos Maltzahn THE RADOS GATEWAY

APPLICATION APPLICATION

REST

RADOSGW RADOSGW LIBRADOS LIBRADOS

socket

M M

M RADOS CLUSTER

Ceph, Programmable Storage, and Data Fabrics 20 21 Carlos Maltzahn RADOSGW MAKES RADOS WEBBY

RADOSGW: . REST-based object storage proxy . Uses RADOS to store objects . API supports buckets, accounts . Usage accounting for billing . Compatible with S3 and Swift applications

Ceph, Programmable Storage, and Data Fabrics 21 22 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 23 Carlos Maltzahn STORING VIRTUAL DISKS

VM

HYPERVISOR LIBRBD

M M

RADOS CLUSTER

Ceph, Programmable Storage, and Data Fabrics 23 24 Carlos Maltzahn KERNEL MODULE FOR MAX FLEXIBLE!

LINUX HOST KRBD

M M

RADOS CLUSTER

Ceph, Programmable Storage, and Data Fabrics 25 25 Carlos Maltzahn RBD STORES VIRTUAL DISKS

RADOS BLOCK DEVICE: . Storage of disk images in RADOS . Decouples VMs from host . Images are striped across the cluster (pool) . Snapshots . Copy-on-write clones . Support in: . Mainline (2.6.39+) . Qemu/KVM, native coming soon . OpenStack, CloudStack, Nebula, Proxmox

Ceph, Programmable Storage, and Data Fabrics 26 26 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 27 Carlos Maltzahn SEPARATE METADATA SERVER

LINUX HOST KERNEL MODULE

metadata 01 data 10

M M

M RADOS CLUSTER

Ceph, Programmable Storage, and Data Fabrics 28 28 Carlos Maltzahn SCALABLE METADATA SERVERS

METADATA SERVER . Manages metadata for a POSIX-compliant shared

Ceph, Programmable Storage, and Data Fabrics 29 29 Carlos Maltzahn Programmable Storage Students: Michael Sevilla, Noah Watkins, Ivo Jimenez, Collaborators: Jeff LeFevre, Peter Alvaro, Shel Finkelstein, Carlos Maltzahn

M. Sevilla et al. “Malacology: A Programmable Storage System,” EuroSys 2017, Belgrade, Serbia, April 23-26, 2017.

Ceph, Programmable Storage, and Data Fabrics 30 Carlos Maltzahn Narrow waist of storage

• Traditional • Blocks: fixed sized, small data chunks (512B, 4KB), flat name space • Objects: flexibly sized, large data chunks (100KBs to MBs), flat name space • Files: flexibly sized byte streams, hierarchical name space • Narrow waist creates a semantic gap • Storage systems have no idea what the data means • Applications have no idea how to best store the data: lots of magic numbers! • Advent of open-source software storage systems • No more fear of vendor lock-in • New opportunity to construct better storage interfaces

Ceph, Programmable Storage, and Data Fabrics 31 Carlos Maltzahn scrubbing [50]. Ceph already provides some degree of pro- programmers wanting to exploit application-specific seman- grammability; the object storage daemons support domain- tics, and/or programmers knowing how to manage resources specific code that can manipulate objects on the server that to improve performance. A solution based on application- has the data local. These “interfaces” are implemented by specific object interfaces is a way to work around the tradi- composing existing low-level storage abstractions that exe- tionally rigid storage APIs because custom object interfaces cute atomically. They are written in C++ and are statically give programmers the ability to tell the storage system about loaded into the system. their application: if the application is CPU or I/O bound, if it The Ceph community provides empirical evidence that has locality, if its size has the potential to overload a single developers are already beginning to embrace programmable node, etc. Programmers often know what the problem is and storage. Figure 2 shows a dramatic growth in the produc- how to solve it, but until the ability to modify object inter- tion use of domain-specific interfaces in the Ceph commu- faces, they had no way to express to the storage system how nity since 2010. In that figure, classes are functional group- to handle their data. ings of methods on storage objects (e.g. remotely computing Our approach is to expose more of the commonly used, and caching the checksum of an object extent). What is most code-hardened subsystems of the underlying storage system remarkable is that this trend contradicts the notion that API as interfaces. The intent is that these interfaces, which can changes are a burden for users. Rather it appears that gaps be as simple as a redirection to the persistent data store or in existing interfaces are being addressed through ad-hoc ap- as complicated as a strongly consistent directory service, proaches to programmability. In fact, Table 1 categorizes ex- should be used and re-used in many contexts to implement isting interfaces and we clearly see a trend towards reusable a wide range of services. By making programmability a scrubbing [50]. Ceph already provides some degree of pro- programmers wanting to exploit application-specific seman- services. ‘feature’, rather than a ‘hack’ or ‘workaround’, we help grammability; the object storage daemons support domain- tics, and/or programmers knowing how to manage resources specific code that can manipulate objects on the server that to improve performance. A solution based on application- standardize a development process that now is largely ad- has the data local. These “interfaces” are implemented by specific object interfaces is a way to work around the tradi- hoc. composing existing low-level storage abstractions that exe- tionally rigid storage APIs because custom object interfaces cute atomically. They are written in C++ and are statically give programmers the ability to tell the storage system about loaded into the system. their application: if the application is CPU or I/O bound, if it 3. Challenges The Ceph community provides empirical evidence that has locality, if its size has the potential to overload a single developers are already beginning to embrace programmable node, etc. Programmers often know what the problem is and Implementing the infrastructure for programmability into storage. Figure 2 shows a dramatic growth in the produc- how to solve it, but until the ability to modify object inter- existing services and abstractions of distributed storage sys- tion use of domain-specific interfaces in the Ceph commu- faces, they had no way to express to the storage system how tems is challenging, even if one assumes that the source code nity since 2010. In that figure, classes are functional group- to handle their data. of the storage system and the necessary expertise for under- ings of methods on storage objects (e.g. remotely computing Our approach is to expose more of the commonly used, standing it is available. Some challenges include: and caching the checksum of an object extent). What is most code-hardened subsystems of the underlying storage system remarkable is that this trend contradicts the notion that API as interfaces. The intent is that these interfaces, which can Programmable Storage: Generalization of Storage systems are generally required to be highly avail- changes are a burden for users. Rather it appears that gaps be as simple as a redirection to the persistent data store or • in existing interfaces are being addressed through ad-hoc ap- as complicated as a strongly consistent directory service, able so that any complete restarts of the storage system to existing storage abstractionsFigure 2: [source] Since 2010, the growth in the number of co-designed reprogram them is usually unacceptable. proaches to programmability. In fact, Table 1 categorizes ex- shouldobject storage be used interfaces and re-used in Ceph in has many been contexts accelerating. to implement This plot is the isting interfaces and we clearly see a trend towards reusable a wide range of services. By making programmability a Policies and optimizations are usually hard-wired into number of object classes (a group of interfaces), and the total number of • services.• Classes of objects with class-specific access methods ‘feature’,methods (the rather actual than API end-points). a ‘hack’ or ‘workaround’, we help the services and one has to be careful when factoring standardize a development process that now is largely ad- them to avoid introducing additional bugs. These poli- hoc.Category Example # cies and optimizations are usually cross-cutting solutions Logging Geographically distribute replicas 11 to concerns or trade-offs that cannot be fully explored Metadata Snapshots in the block device OR at the time the code is written (as they relate to work- 3. Challenges 74 Management Scan extents for file system repair load or hardware). Given these policies and optimiza- ImplementingLocking theGrants infrastructure clients for exclusive programmability access into 6 tions, decomposition of otherwise orthogonal internal ab- existingOther services and Garbage abstractions collection, of distributed reference storage counting sys- 4 stractions can be difficult or dangerous. tems is challenging, even if one assumes that the source code Mechanisms that are often only exercised according to ofTable the 1: storageA variety system of object and the storage necessary classes existexpertise to expose for under- interfaces to • standingapplications. it is # available.is the number Some of methods challenges that implement include: these categories. hard-wired policies and not in their full generality have The takeaway from Figure 2 is that programmers are al- hidden bugs that are revealed as soon as those mecha- Storage systems are generally required to be highly avail- nisms are governed by different policies. In our experi- ready• trying to use programmability because their needs, whetherable so they that anybe related complete to restarts performance, of the storage availability, system consis- to ence introducing programmability into a storage system Figure 2: [source] Since 2010, the growth in the number of co-designed reprogram them is usually unacceptable. proved to be a great debugging tool. object storage interfaces in Ceph has been accelerating. This plot is the tency, convenience, etc., are not satisfied by the existing de- number of object classes (a group of interfaces), and the total number of faultPolicies set of interfaces. and optimizations The popularity are usually of the hard-wired custom object into in- Programmability, especially in live systems, implies • • methods (the actual API end-points). terfacethe services facility andof Ceph one couldhas to be be due careful to a whennumber factoring of reasons, changes that need to be carefully managed by the sys- Ceph, Programmable Storage, and Data Fabrics suchthem as the to avoid default introducing algorithms/tunables additional bugs. of the These storage poli- system32 tem itself, including versioning and propagation of those Category ExampleCarlos Maltzahn # beingcies insufficient and optimizations for the are application’s usually cross-cutting performance solutions goals, changes without affecting correctness. Logging Geographically distribute replicas 11 to concerns or trade-offs that cannot be fully explored Metadata Snapshots in the block device OR 74 at the time the code is written (as they relate to work- Management Scan extents for file system repair load or hardware). Given these policies and optimiza- Locking Grants clients exclusive access 6 tions, decomposition of otherwise orthogonal internal ab- Other Garbage collection, reference counting 4 stractions can be difficult or dangerous. Mechanisms that are often only exercised according to Table 1: A variety of object storage classes exist to expose interfaces to • applications. # is the number of methods that implement these categories. hard-wired policies and not in their full generality have The takeaway from Figure 2 is that programmers are al- hidden bugs that are revealed as soon as those mecha- ready trying to use programmability because their needs, nisms are governed by different policies. In our experi- whether they be related to performance, availability, consis- ence introducing programmability into a storage system tency, convenience, etc., are not satisfied by the existing de- proved to be a great debugging tool. fault set of interfaces. The popularity of the custom object in- Programmability, especially in live systems, implies • terface facility of Ceph could be due to a number of reasons, changes that need to be carefully managed by the sys- such as the default algorithms/tunables of the storage system tem itself, including versioning and propagation of those being insufficient for the application’s performance goals, changes without affecting correctness. Narrow Waist to Storage → Redundant Functionality My App

batching atomic ops consensus data access migration File, Block, Object

✓ consensus ✓ batching Storage System ✓ atomic ops ✓ data access ✓ migration

1. Problem 2. Solution Ceph, Programmable Storage, and Data Fabrics3. Higher-Level Services 4. Evaluation 33 Carlos Maltzahn 2 CDNs / Data Fabric: A Global Semantic Fabric

• Data access by declarative “query” using meaningful data names instead of byte ranges • Data location service that associates access costs with locations of data components useful to a given query • Automatic construction of efficient data access plan for a given query based on probing of nearby caches and the data location service • Composition of query response from potentially many data sources • Caching and prefetching policies based on meaningful data names

Ceph, Programmable Storage, and Data Fabrics 34 Carlos Maltzahn Thanks!

Contact: • [email protected] • https://users.soe.ucsc.edu/~carlosm

Acknowledgements: • Ceph slides (slides 5-29) by Sage Weil, Red Hat • Funding of this work comes from the Center for Research in Open Source Software, DOE Award DE-SC0016074, and NSF Award 1450488

Ceph, Programmable Storage, and Data Fabrics 35 Carlos Maltzahn