Ceph, Programmable Storage, and Data Fabrics Carlos Maltzahn, UC Santa Cruz Fermilab, 6/9/17 Carlos Maltzahn Background
Total Page:16
File Type:pdf, Size:1020Kb
Ceph, Programmable Storage, and Data Fabrics Carlos Maltzahn, UC Santa Cruz Fermilab, 6/9/17 Carlos Maltzahn Background • Adjunct Professor, Computer Science, UC Santa Cruz • Current Research • Director, UCSC Systems Research • High-performance ultra-scale Laboratory (SRL) storage and data management • Director, Center for Research in Open • End-to-end Performance Source Software (CROSS) cross.ucsc.edu management and QoS • Director, UCSC/LANL Institute for • Reproducible Evaluation of Scalable Scientific Data Management Systems (ISSDM) • Network Intermediaries • 1999-2004: Performance Engineer, • Other Research Netapp • Data Management Games • Information Retrieval • Advising 6 Ph.D. students. • Cooperation Dynamics • Graduated 5 Ph.D. students • I do this 100% of my time! 2 Agenda • Overview of Ceph • Programmable Storage • CDNs/Data Fabrics Ceph, Programmable Storage, and Data Fabrics 3 Carlos Maltzahn Ceph History • 2005: Started as summer project • Funded by DOE/NNSA (LANL, LLNL, SNL) • Quickly turned into Sage Weil’s Ph.D. project • 2006: Publications at OSDI and SC • 2007: Sage graduated end of 2007, turned prototype into OSS project • 2010: Ceph Linux kernel client in 2.6.34 • 2011: Inktank startup • 2014: Red Hat acquires Inktank for $175m Ceph, Programmable Storage, and Data Fabrics 4 Carlos Maltzahn ARCHITECTURAL COMPONENTS APP HOST/VM CLIENT RGW RBD CEPHFS A web services A reliable, fully- A distributed <le gateway for object distributed block system with POSIX storage, compatible device with cloud semantics and scale- with S3 and Swift platform integration out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors Ceph, Programmable Storage, and Data Fabrics 4 5 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 6 Carlos Maltzahn OBJECT STORAGE DAEMONS M OSD OSD OSD OSD M btrfs xfs ext4 FS FS FS FS DISK DISK DISK DISK M Ceph, Programmable Storage, and Data Fabrics 6 7 Carlos Maltzahn RADOS CLUSTER APPLICATION M M M M M RADOS CLUSTER Ceph, Programmable Storage, and Data Fabrics 7 8 Carlos Maltzahn RADOS COMPONENTS OSDs: . 10s to 10000s in a cluster . One per disk (or one per SSD, RAID group…) . Serve stored objects to clients . Intelligently peer for replication & recovery Monitors: . Maintain cluster membership and state . Provide consensus for distributed decision- M making . Small, odd number . These do not serve stored objects to clients Ceph, Programmable Storage, and Data Fabrics 8 9 Carlos Maltzahn WHERE DO OBJECTS LIVE? M ?? M APPLICATION OBJECT M Ceph, Programmable Storage, and Data Fabrics 9 10 Carlos Maltzahn A METADATA SERVER? M 1 M APPLICATION 2 M Ceph, Programmable Storage, and Data Fabrics 10 11 Carlos Maltzahn CALCULATED PLACEMENT A-G M H-N M APPLICATION F O-T M U-Z Ceph, Programmable Storage, and Data Fabrics 11 12 Carlos Maltzahn EVEN BETTER: CRUSH! 10 10 01 01 11 01 01 01 01 10 01 10 OBJECTS 10 01 11 10 10 01 11 10 10 01 01 01 PLACEMENT GROUPS CLUSTER (PGs) Ceph, Programmable Storage, and Data Fabrics 12 13 Carlos Maltzahn CRUSH IS A QUICK CALCULATION 10 01 01 11 01 01 10 01 OBJECT 01 11 10 10 10 10 01 01 RADOS CLUSTER Ceph, Programmable Storage, and Data Fabrics 13 14 Carlos Maltzahn CRUSH: DYNAMIC DATA PLACEMENT CRUSH: . Pseudo-random placement algorithm . Fast calculation, no lookup . Repeatable, deterministic . Statistically uniform distribution . Stable mapping . Limited data migration on change . Rule-based con<guration . Infrastructure topology aware . Adjustable replication . Weighting Ceph, Programmable Storage, and Data Fabrics 14 15 Carlos Maltzahn DATA IS ORGANIZED INTO POOLS 10 11 10 01 POO 10 01 01 11 OBJECTS L A 01 01 01 10 01 10 11 10 POO 01 01 10 01 OBJECTS L B 10 01 01 01 POO 10 01 11 OBJECTS L C 10 01 11 10 10 01 10 01 01 POO 11 10 OBJECTS L 01 10 10 10 01 01 D 01 01 10 01 CLUSTER POOLS (CONTAINING PGs) Ceph, Programmable Storage, and Data Fabrics 15 16 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 17 Carlos Maltzahn ACCESSING A RADOS CLUSTER APPLICATION LIBRADOS OBJECT socket M M M RADOS CLUSTER Ceph, Programmable Storage, and Data Fabrics 17 18 Carlos Maltzahn LIBRADOS: RADOS ACCESS FOR APPS LIBRADOS: . Direct access to RADOS for applications . C, C++, Python, PHP, Java, Erlang L . Direct access to storage nodes . No HTTP overhead Ceph, Programmable Storage, and Data Fabrics 18 19 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 20 Carlos Maltzahn THE RADOS GATEWAY APPLICATION APPLICATION REST RADOSGW RADOSGW LIBRADOS LIBRADOS socket M M M RADOS CLUSTER Ceph, Programmable Storage, and Data Fabrics 20 21 Carlos Maltzahn RADOSGW MAKES RADOS WEBBY RADOSGW: . REST-based object storage proxy . Uses RADOS to store objects . API supports buckets, accounts . Usage accounting for billing . Compatible with S3 and Swift applications Ceph, Programmable Storage, and Data Fabrics 21 22 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 23 Carlos Maltzahn STORING VIRTUAL DISKS VM HYPERVISOR LIBRBD M M RADOS CLUSTER Ceph, Programmable Storage, and Data Fabrics 23 24 Carlos Maltzahn KERNEL MODULE FOR MAX FLEXIBLE! LINUX HOST KRBD M M RADOS CLUSTER Ceph, Programmable Storage, and Data Fabrics 25 25 Carlos Maltzahn RBD STORES VIRTUAL DISKS RADOS BLOCK DEVICE: . Storage of disk images in RADOS . Decouples VMs from host . Images are striped across the cluster (pool) . Snapshots . Copy-on-write clones . Support in: . Mainline Linux Kernel (2.6.39+) . Qemu/KVM, native Xen coming soon . OpenStack, CloudStack, Nebula, Proxmox Ceph, Programmable Storage, and Data Fabrics 26 26 Carlos Maltzahn Ceph, Programmable Storage, and Data Fabrics 27 Carlos Maltzahn SEPARATE METADATA SERVER LINUX HOST KERNEL MODULE metadata 01 data 10 M M M RADOS CLUSTER Ceph, Programmable Storage, and Data Fabrics 28 28 Carlos Maltzahn SCALABLE METADATA SERVERS METADATA SERVER . Manages metadata for a POSIX-compliant shared <lesystem . Directory hierarchy . File metadata (owner, timestamps, mode, etc.) . Stores metadata in RADOS . Does not serve <le data to clients . Only required for shared <lesystem Ceph, Programmable Storage, and Data Fabrics 29 29 Carlos Maltzahn Programmable Storage Students: Michael Sevilla, Noah Watkins, Ivo Jimenez, Collaborators: Jeff LeFevre, Peter Alvaro, Shel Finkelstein, Carlos Maltzahn M. Sevilla et al. “Malacology: A Programmable Storage System,” EuroSys 2017, Belgrade, Serbia, April 23-26, 2017. Ceph, Programmable Storage, and Data Fabrics 30 Carlos Maltzahn Narrow waist of storage • Traditional • Blocks: fixed sized, small data chunks (512B, 4KB), flat name space • Objects: flexibly sized, large data chunks (100KBs to MBs), flat name space • Files: flexibly sized byte streams, hierarchical name space • Narrow waist creates a semantic gap • Storage systems have no idea what the data means • Applications have no idea how to best store the data: lots of magic numbers! • Advent of open-source software storage systems • No more fear of vendor lock-in • New opportunity to construct better storage interfaces Ceph, Programmable Storage, and Data Fabrics 31 Carlos Maltzahn scrubbing [50]. Ceph already provides some degree of pro- programmers wanting to exploit application-specific seman- grammability; the object storage daemons support domain- tics, and/or programmers knowing how to manage resources specific code that can manipulate objects on the server that to improve performance. A solution based on application- has the data local. These “interfaces” are implemented by specific object interfaces is a way to work around the tradi- composing existing low-level storage abstractions that exe- tionally rigid storage APIs because custom object interfaces cute atomically. They are written in C++ and are statically give programmers the ability to tell the storage system about loaded into the system. their application: if the application is CPU or I/O bound, if it The Ceph community provides empirical evidence that has locality, if its size has the potential to overload a single developers are already beginning to embrace programmable node, etc. Programmers often know what the problem is and storage. Figure 2 shows a dramatic growth in the produc- how to solve it, but until the ability to modify object inter- tion use of domain-specific interfaces in the Ceph commu- faces, they had no way to express to the storage system how nity since 2010. In that figure, classes are functional group- to handle their data. ings of methods on storage objects (e.g. remotely computing Our approach is to expose more of the commonly used, and caching the checksum of an object extent). What is most code-hardened subsystems of the underlying storage system remarkable is that this trend contradicts the notion that API as interfaces. The intent is that these interfaces, which can changes are a burden for users. Rather it appears that gaps be as simple as a redirection to the persistent data store or in existing interfaces are being addressed through ad-hoc ap- as complicated as a strongly consistent directory service, proaches to programmability. In fact, Table 1 categorizes ex- should be used and re-used in many contexts to implement isting interfaces and we clearly see a trend towards reusable a wide range of services. By making programmability a scrubbing [50]. Ceph already provides some degree of pro- programmers wanting to exploit application-specific seman- services. ‘feature’, rather than a ‘hack’ or ‘workaround’, we help grammability; the object storage daemons support domain- tics, and/or programmers knowing how to manage resources specific code that can manipulate objects on the server that to improve performance. A solution based on application- standardize a development process that now is largely ad- has the data local. These “interfaces” are implemented by specific object interfaces is a way to work around the tradi- hoc. composing existing low-level storage abstractions that exe- tionally rigid storage APIs because custom object interfaces cute atomically.