Petabytes of Ceph: From Tests to Production

Dan van der Ster, CERN IT Storage Group [email protected]

Ceph Day Switzerland, 14 June 2016

Ceph Day Suisse - Dan van der Ster 2 Outline • Storage for CERN and Particle Physics

• Ceph@CERN, and how we use it

• Operating Ceph at Scale

• The Ceph Particle Physics Community

Ceph Day Suisse - Dan van der Ster 3 Storage for CERN and Particle Physics

• Huge data requirements (150PB now, 50PB++/year in future) • Worldwide LHC Grid standards for accessing and moving data • GridFTP, Xrootd to access data, FTS to move data, SRM to manage data

NFS RBD CASTOR EOS AFS Ceph

• Storage operated by the FDO team at CERN IT • Elsewhere: Hadoop, NFS Appliances, …

Ceph Day Suisse - Dan van der Ster 4 Storage for CERN and Particle Physics

• Huge data requirements (150PB now, +50PB/year in future) • Worldwide LHC Grid standards for accessing and moving data • GridFTP, Xrootd to access data, FTS to move data, SRM to manage data

CERNbox A CVMFS NFS RBD CASTOR EOS F Ceph S Ceph

• Shrinking OpenAFS, growing EOS (plus added CERNbox for sync) • Ceph getting its first taste of physics data/apps in CASTOR and CVMFS

Ceph Day Suisse - Dan van der Ster 5 Ceph & Cinder @ CERN

Ceph Day Suisse - Dan van der Ster 6 Our Ceph Clusters

• Beesly + Wigner (3.6 PB + 433 TB, v0.94.7): • Cinder (various QoS types) + Glance + RadosGW • Isolated pools/disks for volumes, volumes++, RadosGW • Hardware reaching EOL this summer. • Dwight (0.5 PB, v0.94.7): • Preprod cluster for development (client side), testing upgrades / crazy ideas. • Erin (2.9 PB, v10.2.1++): • New cluster for CASTOR: disk buffer/cache in front of tape drives

• Bigbang (~30 PB, master): • Playground for short term scale tests whenever CERN receives new hardware.

Ceph Day Suisse - Dan van der Ster 7 Growth of the beesly cluster

From ~200TB total to ~450 TB of RBD + 50 TB RGW

Ceph Day Suisse - Dan van der Ster 8 OpenStack Glance + Cinder

• OpenStack is still Ceph’s killer app. We’ve doubled usage in the past year. • Very stable, almost 100% uptime*. No data durability issues. • Libnss kvm crashes & 0.94.6->7 broke our record L

Ceph Day Suisse - Dan van der Ster 9 Example Use-cases Just a couple…

Ceph Day Suisse - Dan van der Ster10 NFS on RBD • ~50TB across 28 servers: • OpenStack VM + RBD • CentOS 7 with ZFS for DR Example: ~25 puppet masters reading node configurations at up to 40kHz • Not highly-available, but… • cheap, thinly provisioned, resizable, trivial to add new filers

Ceph Day Suisse - Dan van der Ster 11 /cvmfs on RBD

• RO POSIX fs to deliver software (also root fs!) over the WAN

• How it works: stratums of preloaded HTTP servers + CDN of Squids + CVMFS FUSE client • Content addressable storage, inherent deduplication • Data compression at publish time.

• Same arch as our NFS filers: ZFS on RBD • Stratum 0 on S3 is an future scaling option

https://cernvm.cern.ch/

Ceph Day Suisse - Dan van der Ster 12 RadosGW for Volunteer Computing

• LHC@Home uses BOINC for volunteer computing • Donate your home CPU cycles to LHC data processing • >10000 volunteer’s cores running in parallel • Data stage-in/out with our Ceph radosgw via Dynafed

• Dynafed: The Dynamic Federation project • present a distributed repository as if it were one • redirect GET/PUT requests to the nearest copy • Auth with pre-signed URLs: keep secrets off the desktops • Now in EPEL

http://lhcathome.web.cern.ch/ http://lcgdm.web.cern.ch/dynafed-dynamic-federation-project

Ceph Day Suisse - Dan van der Ster 13 Operating Ceph at Scale Provisioning Large Clusters

• We still use puppet-ceph (originally from eNovance, but heavily modified) • Install software/configuration/tuning, copy in keys, but don’t touch the disks

• New: ceph-disk-prepare-all • Inspect the system to discover empty non-system drives/SSDs • Guess an ideal mapping of journals to HDDs • https://github.com/cernceph/ceph-scripts/blob/master/ceph-disk/ceph-disk- prepare-all

• Deploying a large cluster takes one afternoon.

Ceph Day Suisse - Dan van der Ster 15 Ceph Hardware Replacement

• Need to replace 960-3TB OSDs with 1152 new 6TB drives

• How not to do it… add new OSDs and remove old OSDs all at once • Would lead to massive re-peering, re-balancing, unacceptable IO latency.

• How to do it: gradually add new & remove old OSDs • How quickly? OSD-by-OSD, server-by-server, rack-by-rack? Tweaking weights as we go? • Considerations: • We want to reuse the low OSD id’s (implies add/remove/add/remove/… loop) • We don’t want to have to babysit (need to automate the process) • We want to move rgw pools to another cluster!

Ceph Day Suisse - Dan van der Ster 16 RadosGW: One endpoint, many clusters

• *.cs3.cern.ch is a DNS load balanced alias

• HaProxy (>=1.6) listens on public side • Mapping file from bucket name to cluster

• RadosGW listens on loopback

https://gist.github.com/cernceph/4a03316a31ce7abe49167c392fc827da

Ceph Day Suisse - Dan van der Ster 17 Bigbang: 30PB Ceph Tes ting • To really make an impact, we need Ceph to scale to many 10s of PB’s

• At OpenStack Vancouver we presented a 30PB Ceph test: • It worked, but had various issues related to osd->mon messaging volume, pool creation/deletion, osdmap churn and memory usage

• We later worked with Ceph devs on further scale testing: • Ceph jewel incorporates these improvements

Ceph Day Suisse - Dan van der Ster 18 Bigbang Part II

• Bigbang II is a second 30PB test we’ve been running during May 2016 • Previous issues all solved. • Benchmarking: ~30GB/s seems doable

• New jewel features: ms type = async • Fewer threads, no tcmalloc thrashing, lower RAM usage • Rare peering glitches, hopefully fixed in next jewel. op queue = wpq • better recovery transparency

Ceph Day Suisse - Dan van der Ster 19 (Minor) Pain points

Ceph Day Suisse - Dan van der Ster20 Scrubbing • Scrubbing has been a problem historically • Too many concurrent scrubs increases latency

• Hammer / jewel randomize the scrub schedule. See plot à

• Jewel scrub IOs go via the OSD op queue Minimal scrubbing: • Better fair sharing of disk time J • But still needs tuning L osd scrub chunk max = 1 • Need to throttle on high-BW clusters à osd scrub chunk min = 1 osd scrub priority = 1 osd scrub sleep = 0.1

Ceph Day Suisse - Dan van der Ster 21 Scrubbing • Scrubbing has been a problem historically • Too many concurrent scrubs increases latency

• Hammer / jewel randomize the scrub schedule. See plot à

• Jewel scrub IOs go via the OSD op queue Minimal scrubbing: • Better fair sharing of disk time J • But still needs tuning L osd scrub chunk max = 1 • Need to throttle on high-BW clusters à osd scrub chunk min = 1 osd scrub priority = 1 osd scrub sleep = 0.1

Ceph Day Suisse - Dan van der Ster 22 Balancing OSD data

• We often want to fill a cluster: # PGs per OSD (minimal reweighting) • Imagine not being able to use 10% of a 140 10PB cluster !! 120

• Hammer 0.94.7 and Jewel have a 100 new gradual (test-)reweight-by- 80 utilization feature • This is a good workaround, but it 60 decreases the flexibility of the OSD tree 40 • Proactive reweighting of an empty cluster is much more effective than fixing 20 things later. 0 0 51 5,1 10,2 15,3 20,4 25,5 30,6 35,7 40,8 45,9 56,1 61,2 66,3 71,4 76,5 81,6 86,7 91,8 96,9

Ceph Day Suisse - Dan van der Ster 23 Ceph in High Energy Physics

Ceph Day Suisse - Dan van der Ster24 The Broader HEP Community • Ceph is gaining popularity across the Worldwide LHC Grid: • Many OpenStack/RBD deployments + growing usage for physics

• U Chicago ATLAS Tier2 (http://cern.ch/go/6T9q): running CephFS + RBD • OSiRIS Project (http://cern.ch/go/F6zS): Three U’s in Michigan building distributed Ceph infrastructure: • transparent & performant access to the same storage from any campus • Orsay/Saclay in France have a similar distributed Ceph project • STFC/RAL in the UK: 12PiB cluster for WLCG Tier1, details next slides.

• Meeting monthly to discuss Ceph in HEP • [email protected] ML for discussions: http://cern.ch/go/H6Zh • Thanks to Alastair Dewhurst (STFC/RAL) for the initiative

Ceph Day Suisse - Dan van der Ster 25 Echo Cluster at RAL: Ceph for WLCG Data

• 12.1PiB RAW disks: 63 storage nodes, 36x5.5TiB HDD each

• Platform: SL7, Ceph Jewel, 16+3 EC • Config: Quattor, ncm-ceph, ceph-deploy. • 3 hours to re-deploy from scratch

• LHC users: GridFTP and XrootD gateways on libradosstriper • Dev’t: Sebastien Ponce (CERN) and Ian Johnson (STFC)

• LHC + other users: RadosGW-S3 endpoints Slide by Bruno Canning (RAL)

Ceph Day Suisse - Dan van der Ster 26 Which brings us to Sebastien…

… Ceph and Particle Physics Data

Ceph Day Suisse - Dan van der Ster27