Site Reliability Engineering Best Practices Implemented in EOS

Site Reliability Engineering best practices implemented in EOS “Hope is not a strategy” Hugo Gonzalez Labrador (CERN IT-ST) Scope • Logging • Data deletion • Lame duck state • Retry mechanism • Quality of Traffic Logging • Google SRE best practices: • Allow to increase verbosity levels on the fly, no need to restart process and check real traffic that is failing • How EOS implements it • Runtime log levels per component and subcomponent • $ eos debug err /eos/*/fst • $ eos debug crit /eos/*/mgm • $ eos debug debug --filter MgmOfsMessages • Log Ids: find the lifecycle of an operation • func=open level=INFO logid=13727e80-7a6e-11eb-886e-a4bf0112e0f8 …. • Error suppression ala syslog: • ---- high rate error messages suppressed ---- • Next steps? • Add a selection language for deeper dive: • $ eos debug debug user=andreas and protocol=fusex and fromip=1.2.3.4 • Sample rate • Oh but is very costly to run debug in production? Fine, sample the rate, don’t show debug lines for each single request. • Only 1 out of 100. Data deletion • Google best practice • How EOS implements it? • Soft deletion • Versioning • Recycle bin • Disaster recovery • File layout • Clone mechanism: just learned about it J • External backup Data deletion: Soft Deletion: Versioning • File versioning triggers on file overwrite • Versioning is smart: mix of time-bucket + last N-changes: • Examples: eos ls -la /eos/user/g/gonzalhu drwx--s--+ 1 gonzalhu it 12638 Jan 12 14:44 .sys.v#.tus-eos.drawio -rw-r--r-- 2 gonzalhu it 1409 Jan 12 14:44 tus-eos.drawio $ eos file versions /eos/user/g/gonzalhu/tus-eos.drawio -rw-r--r-- 2 gonzalhu it 0 Jan 12 14:36 1610458605.3084bd5f -rw-r--r-- 2 gonzalhu it 1101 Jan 12 14:38 1610458701.3084c3cf -rw-r--r-- 2 gonzalhu it 1117 Jan 12 14:38 1610458708.3084c3e5 -rw-r--r-- 2 gonzalhu it 1101 Jan 12 14:38 1610458711.3084c3ee $ eos file versions /eos/user/g/gonzalhu/tus-eos.drawio 1610458711.3084c3ee success: staged '/eos/user/g/gonzalhu/.sys.v#.tus-eos.drawio/1610458711.3084c3ee' back to '/eos/user/g/gonzalhu/tus-eos.drawio' - the previous file is now '/eos/user/g/gonzalhu/.sys.v#.tus-eos.drawio/1610459070.3084d399; Data deletion: Soft Deletion: Recycle bin • Recycle bin can be configured very fine grained: up to per folder level • Recycle retention policy is fully configurable (6 months, 1 year, …) • Examples: $ eos recycle ls Wed Jan 27 13:01:20 2021 file fxid:00000000316f9198 /eos/user/g/gonzalhu/P/PRIVATE/keys/dns-vero/dns.key Wed Jan 27 13:01:16 2021 file fxid:0000000036f9197 /eos/user/g/gonzalhu/P/PRIVATE/keys/contacts. $ eos recycle restore fxid:0000000036f9197 Data deletion: Disaster Recovery: File layout • File layout is the first protection against major failures (dead disk) • EOS offers many smart data placements: • Replica-based: server, rack, data center, geographic region • Erasure-coding: server, rack, data center, geographic region(?) • Fine-grained data placement: up to folder level • $ eos attr ls /eos/user/g/gonzalhu/ • sys.forced.nstripes="2 Data deletion: Disaster Recovery: external • EOS exposes many protocols (HTTP, XROOTD, POSIX) • This gives enough flexiBility to plugin arBitrary Backup tools • EOS optimizes change discovery thanks to recursive etags/mtimes • Example: • Restic: Backing up CERNBox – RoBerto – Wed @14:20 What can go wrong? Resource exhaustion • Not enough CPU • Increased number of inflight-requests Down • Longer queue lengths • Thread starvation • Missed deadlines/timeouts • Reduced CPU caching • Not enough memory • Dying tasks Degraded • Reduction in cache hits • Not enough threads • Not enough file descriptors => no network connections Recovered Lame duck state • From a client perspective, a server can be in 3 states: • A) Healthy • B) Refusing connections: backend is unresponsive or dead • C) Lame duck: server is operational but is asking clients to stop sending requests • Lame duck mode is very important to move from “Degraded” to “Recovered” rather than “Degraded’” to “Down” Lame duck state • How EOS implements it? • EOS MGM asks clients to stall their requests for a while • Wait 60 seconds before trying to contact me again okay? • EOS MGM can ask clients to connect to a different server • Hey, I’m super busy now, please connect to this other server okay? • Useful when read/write requests fail towards/from FSTs EOS EOS MGM CLIENT (FUSEX) Sleep for a while … Retry logic • Google best practice: try to recover from internal failures before bubbling up to the end- user • How EOS implements it? • Configurable retry logic from client-side (timeouts, deadlines, redirections): man xrdcopy • Client-side fallback to another MGM based on round-robin hostname resolution – EOS for Physics – Cristi – Monday@15:20 • Redirection limit configurable: tune to your file layout/deployment, then bubble error to client • Next steps? • Infinite retry can put the service down as it will exhaust the available resources • Client retry-budget at 10%. If reached do not retry as this will imply up 3X rejected requests • Client-side adaptive throttling • Each client keeps the last 2 minutes of history • Number of requests attempted • Number of requests accepted • Once the cutoff value is reached, clients fail requests locally Traffic QoS: Google best practices • Different queries have different costs • Online vs batch, external vs internal network, authenticated vs public • Don’t measure query costs by moving targets • One query list one inode vs one query list 1k inode • Measure query cost by available capacity • For example, cost at CPU per query • Requests send to backend are classified with levels of criticality • CRITICAL_PLUS: serious user visible impact • CRITICAL: default value from production jobs • SHEDDABLE_PLUS: partial unavailability is acceptable: batch jobs • SHEDDABLE: partial or full unavailability is expected: • Criticality is propagated across services from the first to the last • Batch traffic protected with a proxy Traffic QoS: How EOS implements it? • How EOS implements it? • Protocol-based connection pools: xrootd, http, unix socket (ipc) • Exhaustion in one thread pool should affect the other • Shield batch traffic: CMS use-case, dedicated MGM to shield traffic • Next steps? • App-based resource configuration: info=eos.app=fuse::samba Batch Lambda clients x Batch EOS EOS MGM user 20 000 MGM (open a PDF) Cancel fan-out effect Throttle traffic Conclusion • EOS is a modern open source storage solution, many SRE functionalities Google uses internally are available to you • EOS offers many functionalities to cope with the inner nature of a complex distributed system with extreme use-cases (batch, online, offline) with different requirements (latency, throughput, bandwidth) • Supporting the previous SRE best practices helps to: • Increase the efficiency of the EOS operators when performing investigations • Protect the system against major downtimes • However, there is many room to improve!.

Site Reliability Engineering Best Practices Implemented in EOS

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support