Digitalocean
Total Page:16
File Type:pdf, Size:1020Kb
UPDATE SAGE WEIL – RED HAT SC16 CEPH BOF – 2016.11.16 2 JEWEL APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT RADOSGWRADOSGW RBDRBD CEPHCEPH FSFS LIBRADOSLIBRADOS AA bucket-based bucket-based AA reliable reliable and and fully- fully- AA POSIX-compliant POSIX-compliant AA library library allowing allowing RESTREST gateway, gateway, distributeddistributed block block distributeddistributed file file appsapps to to directly directly compatiblecompatible with with S3 S3 device,device, with with a a Linux Linux system,system, with with a a accessaccess RADOS, RADOS, andand Swift Swift kernelkernel client client and and a a LinuxLinux kernel kernel client client withwith support support for for QEMU/KVMQEMU/KVM driver driver andand support support for for C,C, C++, C++, Java, Java, FUSEFUSE Python,Python, Ruby, Ruby, andand PHP PHP AWESOME AWESOME NEARLY AWESOME AWESOME RADOSRADOS AWESOME AA reliable, reliable, autonomous, autonomous, distributed distributed object object store store comprised comprised of of self-healing, self-healing, self-managing, self-managing, intelligentintelligent storage storage nodes nodes 3 2016 = FULLY AWESOME OBJECT BLOCK FILE RGW RBD CEPHFS S3 and Swift compatible A virtual block device with A distributed POSIX file object storage with object snapshots, copy-on-write system with coherent versioning, multi-site clones, and multi-site caches and snapshots on federation, and replication replication any directory LIBRADOS A library allowing apps to direct access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomic, distributed object store comprised of self-healing, self-managing, intelligent storage nodes (OSDs) and lightweight monitors (Mons) 4 WHAT IS CEPH ALL ABOUT ● Distributed storage ● All components scale horizontally ● No single point of failure ● Software ● Hardware agnostic, commodity hardware ● Object, block, and file in a single cluster ● Self-manage whenever possible ● Free software (LGPL) 5 HYBRID CLUSTER RADOS CLUSTER M M M HDD SSD NVMe – Single Ceph cluster – Separate RADOS pools for different storage hardware types 6 CEPH ON FLASH ● Samsung ● SanDisk – up to 153 TB in 2u – up to 512 TB in 3u – 700K IOPS, 30 GB/s – 780K IOPS, 7 GB/s 7 CEPH ON SSD+HDD ● SuperMicro ● Dell ● HP ● Quanta / QCT ● Fujitsu (Eternus) ● Penguin OCP 8 CEPH ON HDD (LITERALLY) ● Microserver on 8TB He WD HDD ● 504 OSD 4 PB cluster ● 1 GB RAM – SuperMicro 1048-RT chassis ● 2 core 1.3 GHz Cortex A9 ● ● 2x 1GbE SGMII instead of SATA Next gen will be aarch64 9 RELEASE CADENCE WE ARE HERE Hammer (LTS) Infernalis Jewel (LTS) Kraken Luminous (LTS) Spring 2015 Fall 2015 Spring 2016 Fall 2016 Spring 2017 12.2.z 10.2.z 0.94.z 10 11 KRAKEN BLUESTORE ● BlueStore = Block + NewStore – key/value database (RocksDB) for metadata OSD – all data written directly to raw block device(s) ObjectStore BlueStore – can combine HDD, SSD, NVMe, NVRAM data metadata ● Full data checksums (crc32c, xxhash) r o r s ● o RocksDB t Inline compression (zlib, snappy, zstd) s a e r c p o BlueRocksEnv l ● l ~2x faster than FileStore m A o BlueFS C – better parallelism, efficiency on fast devices – no double writes for data BlockDevice BlockDevice BlockDevice – performs well with very small SSD journals 12 TCP OVERHEAD IS SIGNIFICANT writes reads 13 DPDK STACK 14 RDMA STACK ● RDMA for data path (ibverbs) ● Keep TCP for control path ● Working prototype in ~1 month 15 16 LUMINOUS MULTI-MDS CEPHFS 17 ERASURE CODE OVERWRITES ● Current EC RADOS pools are append only – simple, stable suitable for RGW, or behind a cache tier ● EC overwrites will allow RBD and CephFS to consume EC pools directly ● It's hard – implementation requires two-phase commit – complex to avoid full-stripe update for small writes – relies on efficient implementation of “move ranges” operation by BlueStore ● It will be huge – tremendous impact on TCO – make RBD great again 18 CEPH-MGR ● ceph-mon monitor daemons currently do a lot – more than they need to (PG stats to support things like 'df') – this limits cluster scalability ● ceph-mgr moves non-critical metrics into a separate daemon M M – that is more efficient ??? – that can stream to graphite, influxdb (time for new iconography) – that can efficiently integrate with external modules (even Python!) ● Good host for – integrations, like Calamari REST API endpoint – coming features like 'ceph top' or 'rbd top' – high-level management functions and policy 19 QUALITY OF SERVICE ● Set policy for both – reserved/minimum IOPS – proportional sharing of excess capacity ● by – type of IO (client, scrub, recovery) – pool – client (e.g., VM) ● Based on mClock paper from OSDI'10 – IO scheduler – distributed enforcement with cooperating clients 20 RBD ORDERED WRITEBACK CACHE ● Low latency writes to local SSD ● Persistent cache (across reboots etc) ● Fast reads from cache ● Ordered writeback to the cluster, with batching, etc. ● RBD image is always point-in-time consistent Local SSD Ceph RBD Low Latency Low Latency Higher Latency SPoF SPoF on recent writes No SPoF No data Stale but crash consistent Perfect durability 21 RADOSGW INDEXING REST REST ZONE A ZONE B ZONE C RADOSG RADOSG RADOSG RADOSG RADOSG WLIBRADOS WLIBRADOS WLIBRADOS WLIBRADOS WLIBRADOS M M M M M M M M M CLUSTER A CLUSTER B CLUSTER C 22 GROWING DEVELOPMENT COMMUNITY 23 GROWING DEVELOPMENT COMMUNITY ● Red Hat ● Digiware ● Mirantis ● Mellanox ● SUSE ● Intel ● SanDisk ● Walmart Labs ● XSKY ● DreamHost ● ZTE ● Tencent ● LETV ● Deutsche Telekom ● Quantum ● Igalia ● EasyStack ● Fujitsu ● H3C ● DigitalOcean ● UnitedStack ● University of Toronto 24 GROWING DEVELOPMENT COMMUNITY ● Red Hat ● Digiware ● Mirantis ● Mellanox ● SUSE ● Intel ● SanDisk ● Walmart Labs ● XSKY ● DreamHost ● ZTE ● Tencent ● LETV ● Deutsche Telekom ● Quantum ● Igalia ● EasyStack ● Fujitsu ● H3C ● DigitalOcean ● UnitedStack ● University of Toronto 25 GROWING DEVELOPMENT COMMUNITY ● Red Hat ● Digiware ● Mirantis ● Mellanox ● SUSE ● Intel ● SanDisk ● Walmart Labs ● XSKY ● DreamHost ● ZTE ● Tencent ● LETV ● Deutsche Telekom ● Quantum ● Igalia ● EasyStack ● Fujitsu ● H3C ● DigitalOcean ● UnitedStack ● University of Toronto 26 GROWING DEVELOPMENT COMMUNITY ● Red Hat ● Digiware ● Mirantis ● Mellanox ● SUSE ● Intel ● SanDisk ● Walmart Labs ● XSKY ● DreamHost ● ZTE ● Tencent ● LETV ● Deutsche Telekom ● Quantum ● Igalia ● EasyStack ● Fujitsu ● H3C ● DigitalOcean ● UnitedStack ● University of Toronto 27 GROWING DEVELOPMENT COMMUNITY ● Red Hat ● Digiware ● Mirantis ● Mellanox ● SUSE ● Intel ● SanDisk ● Walmart Labs ● XSKY ● DreamHost ● ZTE ● Tencent ● LETV ● Deutsche Telekom ● Quantum ● Igalia ● EasyStack ● Fujitsu ● H3C ● DigitalOcean ● UnitedStack ● University of Toronto 28 GROWING DEVELOPMENT COMMUNITY ● Red Hat ● Digiware ● Mirantis ● Mellanox ● SUSE ● Intel ● SanDisk ● Walmart Labs ● XSKY ● DreamHost ● ZTE ● Tencent ● LETV ● Deutsche Telekom ● Quantum ● Igalia ● EasyStack ● Fujitsu ● H3C ● DigitalOcean ● UnitedStack ● University of Toronto 29 TAKEAWAYS ● Nobody should have to use proprietary storage systems out of necessity ● The best storage solution should be built with free software ● Ceph is growing up, but we're not done yet – performance, scalability, features, ease of use ● We are highly motivated 30 THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT [email protected] @liewegas.