OSG Operations: GRACC, StashCache, OASIS
John Thiltges, Derek Weitzel, Marian Zvada San Diego, Jan 28-31 GRACC Architecture
Quick reminder of the GRACC Architecture
● Gratia probes run on CE’s and submit hosts ● Each of these boxes are multiple actual processes
2 GRACC
Hardware hosted on OpenStack platform
● ElasticSearch cluster (ELK), CEPH storage ● 1 VM Front-End (64GB RAM, 2TB data volume) ● 5 VMs data nodes (32GB RAM, 5TB data volume)
Backup
● Data records to FNAL tape storage ● ES snapshots to HDFS@HCC
Reporting
● GRACC-APEL reporting ● resource usage reports for PIs being sent weekly/monthly ○ probe for missing records 3 GRACC
Monitoring
● check_mk with automated notifications
Deployment
● fully puppetized ● docker containers (not for everything)
4 GRACC
Monitoring dashboards
● status of ES health ● status of nodes
5 Message Bus
Message bus is utilized by GRACC, Network Monitoring, StashCache federation accounting
● Hosted on commercial provider: CloudAMQP ● Monitored through Grafana alerts, and CloudAMQP alerts
6 StashCache (XCache)
Hardware
● 1 StashCache (cache) host (fiona box) ● 2 redirectors behind redirector.opensciencegrid.org ○ hosts deployed (hcc-osg-redirector{1,2}.unl.edu) via puppet ○ xrootd config setup in HA mode (each host subscribe to DNS alias as meta-manager)
Monitoring
● Check_mk ○ missing discovery of the cache nodes (not such option in XRootD)
Deployment/Development
● v1.0.0 creates xcache packaging which provides more generic way of cache and origin installation ● k8s approach to deploy caches, see Edgar's presentation about I2 caches 7 OASIS
● OSG Application Software Installation Service ● using CernVM FileSystem (CVMFS) technology
Hardware (physical)
● oasis.opensciencegrid.org (hcc-oasis.unl.edu) ● oasis-login.opensciencegrid.org (hcc-oasis-login.unl.edu) ● oasis-replica.opensciencegrid.org (hcc-cvmfs.unl.edu) ● for each we have -itb host with equivalent setup ○ replica-itb (hcc-cvmfs-itb.unl.edu) is borrowed from CMS! ○ dedicated physical hardware for replica-itb to be purchased (whose responsibility?)
Monitoring
● Automated alarms to e-group: [email protected]
● check_mk 8 OASIS
Deployment
● fully puppetized (both -itb and production hosts), except: ● custom script to install and configure (pretty much tight to the hostnames) ○ quite some stuff to be moved to puppet
Operations
● creating tasks what we can improve/change (umbrella ticket OO-231) ● packaging in goc(-itb).repo ● rewriting some parts of documentation still needed ● seeing places where we mention GOC (e.g. The GOC will enable…) ○ https://opensciencegrid.org/docs/data/update-oasis/#enable-oasis) ● do we have common name whether we are GOC or OSG Ops? ○ for building OSG Ops image it matters
9 CVMFS Indexer
Turns an origin’s data into /cvmfs/stash.osgstorage.org/user/dweitzel….
● Deployed with Puppet ● The node has hundreds and hundreds GB of state ● Configuration is backed up, not the state which can be reproduced.
Monitoring
● Check_mk
10 Topology
Hardware
● Two VMs at Nebraska (production and ITB)
Monitoring
● check_mk
Possible improvements
● Can topology be run in Heroku or at least behind Cloudflare? ○ Only gridsite client is GGUS? Can this be split apart (separate name?)
11 OSG Repo
Hardware
● Two VMs at Nebraska (production and ITB)
Monitoring
● check_mk
Possible improvements
● Place a cloudflare cache in front of repo? ● Unable to recreate tarball installations if original, our backups, and wisconsin backups fail.
12 OSG HTCondor-CE Collectors
Hardware
● Two VMs at Nebraska
Monitoring
● check_mk
Deployment
● Puppet configured ● No state, so dead simple configuration. Certs are the only moderately difficult part.
13 Network Services
We host a few services for network monitoring
● Perfsonar collectors which transform data and inject it into ES instances at UNL and UChicago
See Shawn’s slides later...
14 XRootD Monitor
Receives monitoring data from all StashCache caches, and soon all of CMS
Hardware
● 1 VM running the monitor in a docker container
Deployment
● Docker
15 Future efforts to consider
More containers!
● Puppet is a good step, making documented and reproducible server configs ● But puppet is still "sticky" and non-trivial to move the config to a different site ● Containers are far more portable, for disaster recovery or future funding situations leading to service moves ● Developers can have an ITB environment on their own workstation in seconds
Kubernetes?
High-availability
● Cloudflare load balancing?
16