Site Reliability Engineering best practices implemented in EOS

“Hope is not a strategy”

Hugo Gonzalez Labrador (CERN IT-ST) Scope

• Logging • Data deletion • Lame duck state • Retry mechanism • Quality of Traffic Logging

SRE best practices: • Allow to increase verbosity levels on the fly, no need to restart process and check real traffic that is failing

• How EOS implements it • Runtime log levels per component and subcomponent • $ eos debug err /eos/*/fst • $ eos debug crit /eos/*/mgm • $ eos debug debug --filter MgmOfsMessages • Log Ids: find the lifecycle of an operation • func=open level=INFO logid=13727e80-7a6e-11eb-886e-a4bf0112e0f8 …. • Error suppression ala syslog: • ---- high rate error messages suppressed ----

• Next steps? • Add a selection language for deeper dive: • $ eos debug debug user=andreas and protocol=fusex and fromip=1.2.3.4 • Sample rate • Oh but is very costly to run debug in production? Fine, sample the rate, don’t show debug lines for each single request. • Only 1 out of 100. Data deletion

• Google best practice

• How EOS implements it? • Soft deletion • Versioning • Recycle bin • recovery • File layout • Clone mechanism: just learned about it J • External Data deletion: Soft Deletion: Versioning

• File versioning triggers on file overwrite • Versioning is smart: mix of time-bucket + last N-changes: • Examples: eos ls -la /eos/user/g/gonzalhu drwx--s--+ 1 gonzalhu it 12638 Jan 12 14:44 .sys.v#.tus-eos.drawio -rw-r--r-- 2 gonzalhu it 1409 Jan 12 14:44 tus-eos.drawio

$ eos file versions /eos/user/g/gonzalhu/tus-eos.drawio -rw-r--r-- 2 gonzalhu it 0 Jan 12 14:36 1610458605.3084bd5f -rw-r--r-- 2 gonzalhu it 1101 Jan 12 14:38 1610458701.3084c3cf -rw-r--r-- 2 gonzalhu it 1117 Jan 12 14:38 1610458708.3084c3e5 -rw-r--r-- 2 gonzalhu it 1101 Jan 12 14:38 1610458711.3084c3ee

$ eos file versions /eos/user/g/gonzalhu/tus-eos.drawio 1610458711.3084c3ee success: staged '/eos/user/g/gonzalhu/.sys.v#.tus-eos.drawio/1610458711.3084c3ee' back to '/eos/user/g/gonzalhu/tus-eos.drawio' - the previous file is now '/eos/user/g/gonzalhu/.sys.v#.tus-eos.drawio/1610459070.3084d399; Data deletion: Soft Deletion: Recycle bin

• Recycle bin can be configured very fine grained: up to per folder level • Recycle retention policy is fully configurable (6 months, 1 year, …) • Examples: $ eos recycle ls Wed Jan 27 13:01:20 2021 file fxid:00000000316f9198 /eos/user/g/gonzalhu/P/PRIVATE/keys/dns-vero/dns.key Wed Jan 27 13:01:16 2021 file fxid:0000000036f9197 /eos/user/g/gonzalhu/P/PRIVATE/keys/contacts.

$ eos recycle restore fxid:0000000036f9197 Data deletion: : File layout

• File layout is the first protection against major failures (dead disk) • EOS offers many smart data placements: • Replica-based: server, rack, , geographic region • Erasure-coding: server, rack, data center, geographic region(?) • Fine-grained data placement: up to folder level • $ eos attr ls /eos/user/g/gonzalhu/

• sys.forced.nstripes="2 Data deletion: Disaster Recovery: external

• EOS exposes many protocols (HTTP, XROOTD, POSIX) • This gives enough flexibility to plugin arbitrary backup tools • EOS optimizes change discovery thanks to recursive etags/mtimes • Example: • Restic: Backing up CERNBox – Roberto – Wed @14:20 What can go wrong? Resource exhaustion

• Not enough CPU • Increased number of inflight-requests Down • Longer queue lengths • Thread starvation • Missed deadlines/timeouts • Reduced CPU caching • Not enough memory • Dying tasks Degraded • Reduction in cache hits • Not enough threads • Not enough file descriptors => no network connections Recovered Lame duck state

• From a client perspective, a server can be in 3 states: • A) Healthy • B) Refusing connections: backend is unresponsive or dead • C) Lame duck: server is operational but is asking clients to stop sending requests

• Lame duck mode is very important to move from “Degraded” to “Recovered” rather than “Degraded’” to “Down” Lame duck state

• How EOS implements it? • EOS MGM asks clients to stall their requests for a while • Wait 60 seconds before trying to contact me again okay? • EOS MGM can ask clients to connect to a different server • Hey, I’m super busy now, please connect to this other server okay? • Useful when read/write requests fail towards/from FSTs

EOS EOS MGM CLIENT (FUSEX) Sleep for a while … Retry logic

• Google best practice: try to recover from internal failures before bubbling up to the end- user • How EOS implements it? • Configurable retry logic from client-side (timeouts, deadlines, redirections): man xrdcopy • Client-side fallback to another MGM based on round-robin hostname resolution – EOS for Physics – Cristi – Monday@15:20 • Redirection limit configurable: tune to your file layout/deployment, then bubble error to client • Next steps? • Infinite retry can put the service down as it will exhaust the available resources • Client retry-budget at 10%. If reached do not retry as this will imply up 3X rejected requests • Client-side adaptive throttling • Each client keeps the last 2 minutes of history • Number of requests attempted • Number of requests accepted • Once the cutoff value is reached, clients fail requests locally Traffic QoS: Google best practices

• Different queries have different costs • Online vs batch, external vs internal network, authenticated vs public • Don’t measure query costs by moving targets • One query list one inode vs one query list 1k inode • Measure query cost by available capacity • For example, cost at CPU per query • Requests send to backend are classified with levels of criticality • CRITICAL_PLUS: serious user visible impact • CRITICAL: default value from production jobs • SHEDDABLE_PLUS: partial unavailability is acceptable: batch jobs • SHEDDABLE: partial or full unavailability is expected: • Criticality is propagated across services from the first to the last • Batch traffic protected with a proxy Traffic QoS: How EOS implements it?

• How EOS implements it? • Protocol-based connection pools: xrootd, http, unix socket (ipc) • Exhaustion in one thread pool should affect the other • Shield batch traffic: CMS use-case, dedicated MGM to shield traffic • Next steps? • App-based resource configuration: info=eos.app=fuse::samba

Batch Lambda clients x Batch EOS EOS MGM user 20 000 MGM (open a PDF) Cancel fan-out effect Throttle traffic Conclusion

• EOS is a modern open source storage solution, many SRE functionalities Google uses internally are available to you • EOS offers many functionalities to cope with the inner nature of a complex distributed system with extreme use-cases (batch, online, offline) with different requirements (latency, throughput, bandwidth) • Supporting the previous SRE best practices helps to: • Increase the efficiency of the EOS operators when performing investigations • Protect the system against major downtimes

• However, there is many room to improve!