Petabytes of Ceph: from Tests to Production

Petabytes of Ceph: From Tests to Production Dan van der Ster, CERN IT Storage Group [email protected] Ceph Day Switzerland, 14 June 2016 Ceph Day Suisse - Dan van der Ster 2 Outline • Storage for CERN and Particle Physics • Ceph@CERN, and how we use it • Operating Ceph at Scale • The Ceph Particle Physics Community Ceph Day Suisse - Dan van der Ster 3 Storage for CERN and Particle Physics • Huge data requirements (150PB now, 50PB++/year in future) • Worldwide LHC Grid standards for accessing and moving data • GridFTP, Xrootd to access data, FTS to move data, SRM to manage data NFS RBD CASTOR EOS AFS Ceph • Storage operated by the FDO team at CERN IT • Elsewhere: Hadoop, NFS Appliances, … Ceph Day Suisse - Dan van der Ster 4 Storage for CERN and Particle Physics • Huge data requirements (150PB now, +50PB/year in future) • Worldwide LHC Grid standards for accessing and moving data • GridFTP, Xrootd to access data, FTS to move data, SRM to manage data CERNbox A CVMFS NFS RBD CASTOR EOS F Ceph S Ceph • Shrinking OpenAFS, growing EOS (plus added CERNbox for sync) • Ceph getting its first taste of physics data/apps in CASTOR and CVMFS Ceph Day Suisse - Dan van der Ster 5 Ceph & Cinder @ CERN Ceph Day Suisse - Dan van der Ster 6 Our Ceph Clusters • Beesly + Wigner (3.6 PB + 433 TB, v0.94.7): • Cinder (various QoS types) + Glance + RadosGW • Isolated pools/disks for volumes, volumes++, RadosGW • Hardware reaching EOL this summer. • Dwight (0.5 PB, v0.94.7): • Preprod cluster for development (client side), testing upgrades / crazy ideas. • Erin (2.9 PB, v10.2.1++): • New cluster for CASTOR: disk buffer/cache in front of tape drives • Bigbang (~30 PB, master): • Playground for short term scale tests whenever CERN receives new hardware. Ceph Day Suisse - Dan van der Ster 7 Growth of the beesly cluster From ~200TB total to ~450 TB of RBD + 50 TB RGW Ceph Day Suisse - Dan van der Ster 8 OpenStack Glance + Cinder • OpenStack is still Ceph’s killer app. We’ve doubled usage in the past year. • Very stable, almost 100% uptime*. No data durability issues. • Libnss kvm crashes & 0.94.6->7 broke our record L Ceph Day Suisse - Dan van der Ster 9 Example Use-cases Just a couple… Ceph Day Suisse - Dan van der Ster10 NFS on RBD • ~50TB across 28 servers: • OpenStack VM + RBD • CentOS 7 with ZFS for DR Example: ~25 puppet masters reading node configurations at up to 40kHz • Not highly-available, but… • cheap, thinly provisioned, resizable, trivial to add new filers Ceph Day Suisse - Dan van der Ster 11 /cvmfs on RBD • RO POSIX fs to deliver software (also root fs!) over the WAN • How it works: stratums of preloaded HTTP servers + CDN of Squids + CVMFS FUSE client • Content addressable storage, inherent deduplication • Data compression at publish time. • Same arch as our NFS filers: ZFS on RBD • Stratum 0 on S3 is an future scaling option https://cernvm.cern.ch/ Ceph Day Suisse - Dan van der Ster 12 RadosGW for Volunteer Computing • LHC@Home uses BOINC for volunteer computing • Donate your home CPU cycles to LHC data processing • >10000 volunteer’s cores running in parallel • Data stage-in/out with our Ceph radosgw via Dynafed • Dynafed: The Dynamic Federation project • present a distributed repository as if it were one • redirect GET/PUT requests to the nearest copy • Auth with pre-signed URLs: keep secrets off the desktops • Now in EPEL http://lhcathome.web.cern.ch/ http://lcgdm.web.cern.ch/dynafed-dynamic-federation-project Ceph Day Suisse - Dan van der Ster 13 Operating Ceph at Scale Provisioning Large Clusters • We still use puppet-ceph (originally from eNovance, but heavily modified) • Install software/configuration/tuning, copy in keys, but don’t touch the disks • New: ceph-disk-prepare-all • Inspect the system to discover empty non-system drives/SSDs • Guess an ideal mapping of journals to HDDs • https://github.com/cernceph/ceph-scripts/blob/master/ceph-disk/ceph-disk- prepare-all • Deploying a large cluster takes one afternoon. Ceph Day Suisse - Dan van der Ster 15 Ceph Hardware Replacement • Need to replace 960-3TB OSDs with 1152 new 6TB drives • How not to do it… add new OSDs and remove old OSDs all at once • Would lead to massive re-peering, re-balancing, unacceptable IO latency. • How to do it: gradually add new & remove old OSDs • How quickly? OSD-by-OSD, server-by-server, rack-by-rack? Tweaking weights as we go? • Considerations: • We want to reuse the low OSD id’s (implies add/remove/add/remove/… loop) • We don’t want to have to babysit (need to automate the process) • We want to move rgw pools to another cluster! Ceph Day Suisse - Dan van der Ster 16 RadosGW: One endpoint, many clusters • *.cs3.cern.ch is a DNS load balanced alias • HaProxy (>=1.6) listens on public side • Mapping file from bucket name to cluster • RadosGW listens on loopback https://gist.github.com/cernceph/4a03316a31ce7abe49167c392fc827da Ceph Day Suisse - Dan van der Ster 17 Bigbang: 30PB Ceph Tes ting • To really make an impact, we need Ceph to scale to many 10s of PB’s • At OpenStack Vancouver we presented a 30PB Ceph test: • It worked, but had various issues related to osd->mon messaging volume, pool creation/deletion, osdmap churn and memory usage • We later worked with Ceph devs on further scale testing: • Ceph jewel incorporates these improvements Ceph Day Suisse - Dan van der Ster 18 Bigbang Part II • Bigbang II is a second 30PB test we’ve been running during May 2016 • Previous issues all solved. • Benchmarking: ~30GB/s seems doable • New jewel features: ms type = async • Fewer threads, no tcmalloc thrashing, lower RAM usage • Rare peering glitches, hopefully fixed in next jewel. op queue = wpq • better recovery transparency Ceph Day Suisse - Dan van der Ster 19 (Minor) Pain points Ceph Day Suisse - Dan van der Ster20 Scrubbing • Scrubbing has been a problem historically • Too many concurrent scrubs increases latency • Hammer / jewel randomize the scrub schedule. See plot à • Jewel scrub IOs go via the OSD op queue Minimal scrubbing: • Better fair sharing of disk time J • But still needs tuning L osd scrub chunk max = 1 • Need to throttle on high-BW clusters à osd scrub chunk min = 1 osd scrub priority = 1 osd scrub sleep = 0.1 Ceph Day Suisse - Dan van der Ster 21 Scrubbing • Scrubbing has been a problem historically • Too many concurrent scrubs increases latency • Hammer / jewel randomize the scrub schedule. See plot à • Jewel scrub IOs go via the OSD op queue Minimal scrubbing: • Better fair sharing of disk time J • But still needs tuning L osd scrub chunk max = 1 • Need to throttle on high-BW clusters à osd scrub chunk min = 1 osd scrub priority = 1 osd scrub sleep = 0.1 Ceph Day Suisse - Dan van der Ster 22 Balancing OSD data • We often want to fill a cluster: # PGs per OSD (minimal reweighting) • Imagine not being able to use 10% of a 140 10PB cluster !! 120 • Hammer 0.94.7 and Jewel have a 100 new gradual (test-)reweight-by- 80 utilization feature • This is a good workaround, but it 60 decreases the flexibility of the OSD tree 40 • Proactive reweighting of an empty cluster is much more effective than fixing 20 things later. 0 0 51 5,1 10,2 15,3 20,4 25,5 30,6 35,7 40,8 45,9 56,1 61,2 66,3 71,4 76,5 81,6 86,7 91,8 96,9 Ceph Day Suisse - Dan van der Ster 23 Ceph in High Energy Physics Ceph Day Suisse - Dan van der Ster24 The Broader HEP Community • Ceph is gaining popularity across the Worldwide LHC Grid: • Many OpenStack/RBD deployments + growing usage for physics • U Chicago ATLAS Tier2 (http://cern.ch/go/6T9q): running CephFS + RBD • OSiRIS Project (http://cern.ch/go/F6zS): Three U’s in Michigan building distributed Ceph infrastructure: • transparent & performant access to the same storage from any campus • Orsay/Saclay in France have a similar distributed Ceph project • STFC/RAL in the UK: 12PiB cluster for WLCG Tier1, details next slides. • Meeting monthly to discuss Ceph in HEP • [email protected] ML for discussions: http://cern.ch/go/H6Zh • Thanks to Alastair Dewhurst (STFC/RAL) for the initiative Ceph Day Suisse - Dan van der Ster 25 Echo Cluster at RAL: Ceph for WLCG Data • 12.1PiB RAW disks: 63 storage nodes, 36x5.5TiB HDD each • Platform: SL7, Ceph Jewel, 16+3 EC • Config: Quattor, ncm-ceph, ceph-deploy. • 3 hours to re-deploy from scratch • LHC users: GridFTP and XrootD gateways on libradosstriper • Dev’t: Sebastien Ponce (CERN) and Ian Johnson (STFC) • LHC + other users: RadosGW-S3 endpoints Slide by Bruno Canning (RAL) Ceph Day Suisse - Dan van der Ster 26 Which brings us to Sebastien… … Ceph and Particle Physics Data Ceph Day Suisse - Dan van der Ster27 .

Petabytes of Ceph: from Tests to Production

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support