Status reporting basics
S. Teige February 2, 2016
1 Determination of the status of a service
The status of a service1 monitored by the service monitor probe is reported in a form similar to the below: OK timestamp:1454357761 generated:Mon Feb 1 20:16:01 UTC 2016 uptime:516949.70 kernel:2.6.32-573.12.1.el6.x86_64 filesystem_use:/var at 42% Only the first two fields are manditory, everything else is optional and/or informative. This reporting file is generated on the machine being monitored. Any logic desired (that can be reduced to a status) can be implemented as described below. Derived from this reporting are the service history2 and availability metrics3
1.1 Detection The above information is accessed by the RSV infrastructure at the GOC either via a wget command (for http capable services) or via a shared file system (for other services). The status reported is given by the first (required) line of the status report. It may have one of three values: OK, WARNING or CRITICAL. The logic used to determine this value is entirely implemented on the machine being monitored. The scripts are: /net/nas01/Public/status/
1 overall_status=0
## system stuff kern=‘uname -r‘ utime=‘ cat /proc/uptime | awk ’{ print $1 }’‘ now=‘date +%s‘ dire=‘hostname | awk -F. ’{print $1}’‘
max_frac=‘df -h | grep -v nas01 | awk ’{print $5}’ | sed s/%//g | grep -v Use | sort -n max_sys=‘df -h | grep $max_frac% | awk ’{print $6" at "$5}’‘ if [ $max_frac -gt 90 ]; then overall_status=1; fi if [ $max_frac -gt 95 ]; then overall_status=2; fi
if [ $overall_status -eq 0 ]; then echo "OK" > /net/nas01/Public/status/$dire/stamp fi if [ $overall_status -eq 1 ]; then echo "WARNING" > /net/nas01/Public/status/$dire/stamp fi
if [ $overall_status -eq 2 ]; then echo "CRITICAL" > /net/nas01/Public/status/$dire/stamp fi
echo "timestamp:$now" >> /net/nas01/Public/status/$dire/stamp now=‘date‘ echo "generated:$now" >>/net/nas01/Public/status/$dire/stamp echo "uptime:$utime" >>/net/nas01/Public/status/$dire/stamp echo "kernel:$kern" >>/net/nas01/Public/status/$dire/stamp echo "filesystem_use:$max_sys" >>/net/nas01/Public/status/$dire/stamp
1.2 Reporting The RSV infrastructure at the GOC obtains the reported status file from the monitored machine via wget or from the shared file system. If it cannot obtain this file the reported status is UNKNOWN. If the file is obtained successfully, the status reported is determined from the first and second fields of the status report. The timestamp line is checked to assure the status report was generated recently where recently is configurable. Twenty minutes to generate a WARNING status and thirty minutes to generate a CRITICAL status are typical values.
2 If the timestamp is current, the status reported is given by the first field of the status file. That is, RSV becomes a simple pass through of the status diagnosed remotely.
3