Linux Clusters Institute: Monitoring Steve Bird, Systems Administrator, Indiana University XCRI Engineer, XSEDE

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

August 2019 1 Why monitor?

August 2019 2 Service Level Agreement (SLA)

• Which services must be provided by you? • Which services must be provided to you?

• Regulatory requirements • Contractual requirements • Business requirements

• Common Deliverables • Availability of services (Uptime) • mean time between failures (MTBF) • mean time to repair or mean time to recovery (MTTR)

August 2019 3 Monitoring and Notification Basic Flow

Metric Metric Automated Gathering of collection aggregation Metrics

Metric Metric Evaluate Metrics against Transformation Analysis Requirements

Presentation Notification Prepare and share metrics for Stakeholders

August 2019 4 What to Collect (Metrics) • Overall cluster health • Storage • Queue size • Capacity • Jobs running • Degraded status • Jobs Queued • Connectivity • Overall network usage • Number of responding nodes • Security • Logs of everything • Individual node health • Load average • Power status • Memory used • Network bandwidth • Temperatures • CPU usage • Cold-aisle • Temperature • Switches exhausts • CPU temperatures

August 2019 5 Metric Collection

• Collection Tools (Common Tools) Collection tools already exist to • Ganglia capture most metrics. • Collectd • Perfmon No single tool will do everything • Performance Co-pilot (PCP) you need unless you write it • Nagios yourself • Unified Fabric Manager (UFM) • Cacti Try to avoid re-inventing the • Syslog wheel. • TACC stats • Scripts

August 2019 6 Metric Aggregation

• Aggregation Tools Metrics need to be gathered from • Ganglia all over the cluster to a single • Collectd place for analysis and storage • Performance Co-pilot (PCP) • Nagios Most metrics should transfer over • Unified Fabric Manager (UFM) the Management Ethernet to • Cacti avoid interference with Job • Syslog performance in Low Latency interconnect • Round Robin Database (RRD)

August 2019 7 Metric Analysis and Transformation

• Monitoring Conundrum • Data is useless unless we do something with it • We can collect much more data than we can analyze • We generally won’t know what data we need until we need it • Exception: Data we must provide for SLA requirements • Limited storage and processing capacity for metric analysis

August 2019 8 Notification

• Notification Tools Basic functionality of all alerts: • Nagios Red, Yellow, Green • • Zenoss Most notification tools are forks • or clones of Nagios • PRTG • OpenNMS Notification tools can be passive • OP5 or active in querying the status of • Pandora FMS the cluster • Unified Fabric Manager (UFM)

August 2019 9 Notification

• Monitoring for known evil • Basis for all notifications • Only alert if something known bad happens

• Metrics -> Notifications • Most tools will require extensive configuration to be useful • Most tools will have a way to query metrics and create alerts • Some tools, such as Nagios, have this entire process built in • Others will have ways to bolt on this functionality • Nagios can query Ganglia • Ganglia can query Nagios

August 2019 10 How should we get notified?

• Emergency • Fire and smoke exiting machine

• Urgent: • Email or text or phone call • Define this carefully

• Not-so urgent: • Web page updates • Especially helpful for historical data • Email (filtered) • End-user support requests

August 2019 11 SLA based Alerts

• Alerts on Deliverables • Availability of services (Uptime) • Example: Alert if less than 98% of batch nodes are online

• mean time between failures (MTBF) • Example: Send email report of time between failures

• mean time to repair or mean time to recovery (MTTR) • Example: Alert if a down node does not come online after 4 hours down

August 2019 12 How often to alert?

• SLA requirements • If your SLA requires it, you may get called off-hours or even on holidays

• You will quickly get a feel for this • Too much info is often worse than too little info • The “urgent” – continually • The “not-so-urgent” – anywhere from a few times per day to once per week • There’s nothing wrong with trial and error • Consider aggregated reports for ‘not-so-urgent’

August 2019 13 Security Alerts

• Securing the cluster • Security alerts may need to go to specific groups or people instead of normal operations

• Regulations and Security rules may apply to cluster which must be enforced • Compliance to Regulations: Sarbanes Oxley, Fisma, HIPAA, etc. • NIST 800-171 (CUI – Controlled Unclassified Information) has specific reporting timeframes

• Active response may need to be required such as blocking IPs

• Security status updates

• Alerts on security failures • sudo reports • Network login failures (e.g. fail2ban) • crontab failures • Logfile errors (customize to fit)

August 2019 14 Log Management

• Centralized Log Management • Make troubleshooting easier

• Analysis of your logs

• Reduce the data loss risk

• ELK stack (https://www.elastic.co): • Elasticse arch: Stores and indexes logs • Logsta sh: Collects and parses logs • : Virtualization and analysis.

August 2019 15 Example: Nagios

August 2019 16 Example: Nagios

• Nagios/NRPE (Nagios Remote Plugin Executor) • Generic executable that runs “plugins” • Plugins can monitor just about anything you can think of monitoring

• Even works with Windows

• Nagios (http://www.nagios.org/) is by far the most common monitoring system

August 2019 17 Example: Check_MK

August 2019 18 Example: Check_MK

• Check_MK (http://mathias-kettner.com/check_mk.html) • Much faster than Nagios • Graphing tools and web GUI • Agent-side and server-side checks • Easy install and configure with OMD • Used by CERN (and HCC).

August 2019 19 Example: Ganglia

August 2019 20 Example: Ganglia

• Ganglia (http://ganglia.sourceforge.net/) - for historical and resource monitoring • Ours are public • RRD files give historical data (a.k.a. “lots of pretty graphs”)

August 2019 21 Example: Grafana

August 2019 22 Example: Grafana

• Grafana (https://grafana.com) – General purpose dashboard • InfluxDB (https://www.influxdata.com) – • Collectd (https://collectd.org) – Collect metric and send to InfluxDB

August 2019 23 Example: ELK stack

August 2019 24 Monitoring Future

• Large data analysis using machine learning • Noise reduction • Anomaly detection • Correlation

• Examples: • Moogsoft (https://www.moogsoft.com) • Metricly (https://www.metricly.com/product) • Anodot (https://www.anodot.com)

August 2019 25 Questions?

August 2019 26