Petabytes of Ceph: from Tests to Production

Petabytes of Ceph: From Tests to Production Dan van der Ster, CERN IT Storage Group [email protected] Ceph Day Switzerland, 14 June 2016 Ceph Day Suisse - Dan van der Ster 2 Outline • Storage for CERN and Particle Physics • Ceph@CERN, and how we use it • Operating Ceph at Scale • The Ceph Particle Physics Community Ceph Day Suisse - Dan van der Ster 3 Storage for CERN and Particle Physics • Huge data requirements (150PB now, 50PB++/year in future) • Worldwide LHC Grid standards for accessing and moving data • GridFTP, Xrootd to access data, FTS to move data, SRM to manage data NFS RBD CASTOR EOS AFS Ceph • Storage operated by the FDO team at CERN IT • Elsewhere: Hadoop, NFS Appliances, … Ceph Day Suisse - Dan van der Ster 4 Storage for CERN and Particle Physics • Huge data requirements (150PB now, +50PB/year in future) • Worldwide LHC Grid standards for accessing and moving data • GridFTP, Xrootd to access data, FTS to move data, SRM to manage data CERNbox A CVMFS NFS RBD CASTOR EOS F Ceph S Ceph • Shrinking OpenAFS, growing EOS (plus added CERNbox for sync) • Ceph getting its first taste of physics data/apps in CASTOR and CVMFS Ceph Day Suisse - Dan van der Ster 5 Ceph & Cinder @ CERN Ceph Day Suisse - Dan van der Ster 6 Our Ceph Clusters • Beesly + Wigner (3.6 PB + 433 TB, v0.94.7): • Cinder (various QoS types) + Glance + RadosGW • Isolated pools/disks for volumes, volumes++, RadosGW • Hardware reaching EOL this summer. • Dwight (0.5 PB, v0.94.7): • Preprod cluster for development (client side), testing upgrades / crazy ideas. • Erin (2.9 PB, v10.2.1++): • New cluster for CASTOR: disk buffer/cache in front of tape drives • Bigbang (~30 PB, master): • Playground for short term scale tests whenever CERN receives new hardware. Ceph Day Suisse - Dan van der Ster 7 Growth of the beesly cluster From ~200TB total to ~450 TB of RBD + 50 TB RGW Ceph Day Suisse - Dan van der Ster 8 OpenStack Glance + Cinder • OpenStack is still Ceph’s killer app. We’ve doubled usage in the past year. • Very stable, almost 100% uptime*. No data durability issues. • Libnss kvm crashes & 0.94.6->7 broke our record L Ceph Day Suisse - Dan van der Ster 9 Example Use-cases Just a couple… Ceph Day Suisse - Dan van der Ster10 NFS on RBD • ~50TB across 28 servers: • OpenStack VM + RBD • CentOS 7 with ZFS for DR Example: ~25 puppet masters reading node configurations at up to 40kHz • Not highly-available, but… • cheap, thinly provisioned, resizable, trivial to add new filers Ceph Day Suisse - Dan van der Ster 11 /cvmfs on RBD • RO POSIX fs to deliver software (also root fs!) over the WAN • How it works: stratums of preloaded HTTP servers + CDN of Squids + CVMFS FUSE client • Content addressable storage, inherent deduplication • Data compression at publish time. • Same arch as our NFS filers: ZFS on RBD • Stratum 0 on S3 is an future scaling option https://cernvm.cern.ch/ Ceph Day Suisse - Dan van der Ster 12 RadosGW for Volunteer Computing • LHC@Home uses BOINC for volunteer computing • Donate your home CPU cycles to LHC data processing • >10000 volunteer’s cores running in parallel • Data stage-in/out with our Ceph radosgw via Dynafed • Dynafed: The Dynamic Federation project • present a distributed repository as if it were one • redirect GET/PUT requests to the nearest copy • Auth with pre-signed URLs: keep secrets off the desktops • Now in EPEL http://lhcathome.web.cern.ch/ http://lcgdm.web.cern.ch/dynafed-dynamic-federation-project Ceph Day Suisse - Dan van der Ster 13 Operating Ceph at Scale Provisioning Large Clusters • We still use puppet-ceph (originally from eNovance, but heavily modified) • Install software/configuration/tuning, copy in keys, but don’t touch the disks • New: ceph-disk-prepare-all • Inspect the system to discover empty non-system drives/SSDs • Guess an ideal mapping of journals to HDDs • https://github.com/cernceph/ceph-scripts/blob/master/ceph-disk/ceph-disk- prepare-all • Deploying a large cluster takes one afternoon. Ceph Day Suisse - Dan van der Ster 15 Ceph Hardware Replacement • Need to replace 960-3TB OSDs with 1152 new 6TB drives • How not to do it… add new OSDs and remove old OSDs all at once • Would lead to massive re-peering, re-balancing, unacceptable IO latency. • How to do it: gradually add new & remove old OSDs • How quickly? OSD-by-OSD, server-by-server, rack-by-rack? Tweaking weights as we go? • Considerations: • We want to reuse the low OSD id’s (implies add/remove/add/remove/… loop) • We don’t want to have to babysit (need to automate the process) • We want to move rgw pools to another cluster! Ceph Day Suisse - Dan van der Ster 16 RadosGW: One endpoint, many clusters • *.cs3.cern.ch is a DNS load balanced alias • HaProxy (>=1.6) listens on public side • Mapping file from bucket name to cluster • RadosGW listens on loopback https://gist.github.com/cernceph/4a03316a31ce7abe49167c392fc827da Ceph Day Suisse - Dan van der Ster 17 Bigbang: 30PB Ceph Tes ting • To really make an impact, we need Ceph to scale to many 10s of PB’s • At OpenStack Vancouver we presented a 30PB Ceph test: • It worked, but had various issues related to osd->mon messaging volume, pool creation/deletion, osdmap churn and memory usage • We later worked with Ceph devs on further scale testing: • Ceph jewel incorporates these improvements Ceph Day Suisse - Dan van der Ster 18 Bigbang Part II • Bigbang II is a second 30PB test we’ve been running during May 2016 • Previous issues all solved. • Benchmarking: ~30GB/s seems doable • New jewel features: ms type = async • Fewer threads, no tcmalloc thrashing, lower RAM usage • Rare peering glitches, hopefully fixed in next jewel. op queue = wpq • better recovery transparency Ceph Day Suisse - Dan van der Ster 19 (Minor) Pain points Ceph Day Suisse - Dan van der Ster20 Scrubbing • Scrubbing has been a problem historically • Too many concurrent scrubs increases latency • Hammer / jewel randomize the scrub schedule. See plot à • Jewel scrub IOs go via the OSD op queue Minimal scrubbing: • Better fair sharing of disk time J • But still needs tuning L osd scrub chunk max = 1 • Need to throttle on high-BW clusters à osd scrub chunk min = 1 osd scrub priority = 1 osd scrub sleep = 0.1 Ceph Day Suisse - Dan van der Ster 21 Scrubbing • Scrubbing has been a problem historically • Too many concurrent scrubs increases latency • Hammer / jewel randomize the scrub schedule. See plot à • Jewel scrub IOs go via the OSD op queue Minimal scrubbing: • Better fair sharing of disk time J • But still needs tuning L osd scrub chunk max = 1 • Need to throttle on high-BW clusters à osd scrub chunk min = 1 osd scrub priority = 1 osd scrub sleep = 0.1 Ceph Day Suisse - Dan van der Ster 22 Balancing OSD data • We often want to fill a cluster: # PGs per OSD (minimal reweighting) • Imagine not being able to use 10% of a 140 10PB cluster !! 120 • Hammer 0.94.7 and Jewel have a 100 new gradual (test-)reweight-by- 80 utilization feature • This is a good workaround, but it 60 decreases the flexibility of the OSD tree 40 • Proactive reweighting of an empty cluster is much more effective than fixing 20 things later. 0 0 51 5,1 10,2 15,3 20,4 25,5 30,6 35,7 40,8 45,9 56,1 61,2 66,3 71,4 76,5 81,6 86,7 91,8 96,9 Ceph Day Suisse - Dan van der Ster 23 Ceph in High Energy Physics Ceph Day Suisse - Dan van der Ster24 The Broader HEP Community • Ceph is gaining popularity across the Worldwide LHC Grid: • Many OpenStack/RBD deployments + growing usage for physics • U Chicago ATLAS Tier2 (http://cern.ch/go/6T9q): running CephFS + RBD • OSiRIS Project (http://cern.ch/go/F6zS): Three U’s in Michigan building distributed Ceph infrastructure: • transparent & performant access to the same storage from any campus • Orsay/Saclay in France have a similar distributed Ceph project • STFC/RAL in the UK: 12PiB cluster for WLCG Tier1, details next slides. • Meeting monthly to discuss Ceph in HEP • [email protected] ML for discussions: http://cern.ch/go/H6Zh • Thanks to Alastair Dewhurst (STFC/RAL) for the initiative Ceph Day Suisse - Dan van der Ster 25 Echo Cluster at RAL: Ceph for WLCG Data • 12.1PiB RAW disks: 63 storage nodes, 36x5.5TiB HDD each • Platform: SL7, Ceph Jewel, 16+3 EC • Config: Quattor, ncm-ceph, ceph-deploy. • 3 hours to re-deploy from scratch • LHC users: GridFTP and XrootD gateways on libradosstriper • Dev’t: Sebastien Ponce (CERN) and Ian Johnson (STFC) • LHC + other users: RadosGW-S3 endpoints Slide by Bruno Canning (RAL) Ceph Day Suisse - Dan van der Ster 26 Which brings us to Sebastien… … Ceph and Particle Physics Data Ceph Day Suisse - Dan van der Ster27 .

Load more