OSG Operations: GRACC, StashCache, OASIS

John Thiltges, Derek Weitzel, Marian Zvada San Diego, Jan 28-31 GRACC Architecture

Quick reminder of the GRACC Architecture

● Gratia probes run on CE’s and submit hosts ● Each of these boxes are multiple actual processes

2 GRACC

Hardware hosted on OpenStack platform

● ElasticSearch cluster (ELK), CEPH storage ● 1 VM Front-End (64GB RAM, 2TB data volume) ● 5 VMs data nodes (32GB RAM, 5TB data volume)

Backup

● Data records to FNAL tape storage ● ES snapshots to HDFS@HCC

Reporting

● GRACC-APEL reporting ● resource usage reports for PIs being sent weekly/monthly ○ probe for missing records 3 GRACC

Monitoring

● check_mk with automated notifications

Deployment

● fully puppetized ● containers (not for everything)

4 GRACC

Monitoring dashboards

● status of ES health ● status of nodes

5 Message Bus

Message bus is utilized by GRACC, Network Monitoring, StashCache federation accounting

● Hosted on commercial provider: CloudAMQP ● Monitored through alerts, and CloudAMQP alerts

6 StashCache (XCache)

Hardware

● 1 StashCache (cache) host (fiona box) ● 2 redirectors behind redirector.opensciencegrid.org ○ hosts deployed (hcc-osg-redirector{1,2}.unl.edu) via puppet ○ xrootd config setup in HA mode (each host subscribe to DNS alias as meta-manager)

Monitoring

● Check_mk ○ missing discovery of the cache nodes (not such option in XRootD)

Deployment/Development

● v1.0.0 creates xcache packaging which provides more generic way of cache and origin installation ● k8s approach to deploy caches, see Edgar's presentation about I2 caches 7 OASIS

● OSG Installation Service ● using CernVM FileSystem (CVMFS) technology

Hardware (physical)

● oasis.opensciencegrid.org (hcc-oasis.unl.edu) ● oasis-login.opensciencegrid.org (hcc-oasis-login.unl.edu) ● oasis-replica.opensciencegrid.org (hcc-cvmfs.unl.edu) ● for each we have -itb host with equivalent setup ○ replica-itb (hcc-cvmfs-itb.unl.edu) is borrowed from CMS! ○ dedicated physical hardware for replica-itb to be purchased (whose responsibility?)

Monitoring

● Automated alarms to e-group: [email protected]

● check_mk 8 OASIS

Deployment

● fully puppetized (both -itb and production hosts), except: ● custom script to install and configure (pretty much tight to the hostnames) ○ quite some stuff to be moved to puppet

Operations

● creating tasks what we can improve/change (umbrella ticket OO-231) ● packaging in goc(-itb).repo ● rewriting some parts of documentation still needed ● seeing places where we mention GOC (e.g. The GOC will enable…) ○ https://opensciencegrid.org/docs/data/update-oasis/#enable-oasis) ● do we have common name whether we are GOC or OSG Ops? ○ for building OSG Ops image it matters

9 CVMFS Indexer

Turns an origin’s data into /cvmfs/stash.osgstorage.org/user/dweitzel….

● Deployed with Puppet ● The node has hundreds and hundreds GB of state ● Configuration is backed up, not the state which can be reproduced.

Monitoring

● Check_mk

10 Topology

Hardware

● Two VMs at Nebraska (production and ITB)

Monitoring

● check_mk

Possible improvements

● Can topology be run in Heroku or at least behind Cloudflare? ○ Only gridsite client is GGUS? Can this be split apart (separate name?)

11 OSG Repo

Hardware

● Two VMs at Nebraska (production and ITB)

Monitoring

● check_mk

Possible improvements

● Place a cloudflare cache in front of repo? ● Unable to recreate tarball installations if original, our backups, and wisconsin backups fail.

12 OSG HTCondor-CE Collectors

Hardware

● Two VMs at Nebraska

Monitoring

● check_mk

Deployment

● Puppet configured ● No state, so dead simple configuration. Certs are the only moderately difficult part.

13 Network Services

We host a few services for network monitoring

● Perfsonar collectors which transform data and inject it into ES instances at UNL and UChicago

See Shawn’s slides later...

14 XRootD Monitor

Receives monitoring data from all StashCache caches, and soon all of CMS

Hardware

● 1 VM running the monitor in a docker container

Deployment

● Docker

15 Future efforts to consider

More containers!

● Puppet is a good step, making documented and reproducible configs ● But puppet is still "sticky" and non-trivial to move the config to a different site ● Containers are far more portable, for disaster recovery or future funding situations leading to service moves ● Developers can have an ITB environment on their own workstation in seconds

Kubernetes?

High-availability

● Cloudflare load balancing?

16