Status Reporting Basics

Status reporting basics S. Teige February 2, 2016 1 Determination of the status of a service The status of a service1 monitored by the service monitor probe is reported in a form similar to the below: OK timestamp:1454357761 generated:Mon Feb 1 20:16:01 UTC 2016 uptime:516949.70 kernel:2.6.32-573.12.1.el6.x86_64 filesystem_use:/var at 42% Only the first two fields are manditory, everything else is optional and/or informative. This reporting file is generated on the machine being monitored. Any logic desired (that can be reduced to a status) can be implemented as described below. Derived from this reporting are the service history2 and availability metrics3 1.1 Detection The above information is accessed by the RSV infrastructure at the GOC either via a wget command (for http capable services) or via a shared file system (for other services). The status reported is given by the first (required) line of the status report. It may have one of three values: OK, WARNING or CRITICAL. The logic used to determine this value is entirely implemented on the machine being monitored. The scripts are: /net/nas01/Public/status/<machine name>/status_stamp.sh These are run on the monitored machine via an entry in /etc/cron.d, typically every fifteen minutes. Implementation of any desired diagnostic test is done via modification of these scripts. They should be publically available to edit as you see fit. The first two lines must be present and the remaining content is unconstrained. An example script is the bare-bones script currently monitoring perfsonar1: 1for example, the service status: http://tinyurl.com/gqanunl 2history: http://tinyurl.com/j59e5lc 3availability: http://tinyurl.com/jb28a7s 1 overall_status=0 ## system stuff kern=‘uname -r‘ utime=‘ cat /proc/uptime | awk ’{ print $1 }’‘ now=‘date +%s‘ dire=‘hostname | awk -F. ’{print $1}’‘ max_frac=‘df -h | grep -v nas01 | awk ’{print $5}’ | sed s/%//g | grep -v Use | sort -n max_sys=‘df -h | grep $max_frac% | awk ’{print $6" at "$5}’‘ if [ $max_frac -gt 90 ]; then overall_status=1; fi if [ $max_frac -gt 95 ]; then overall_status=2; fi if [ $overall_status -eq 0 ]; then echo "OK" > /net/nas01/Public/status/$dire/stamp fi if [ $overall_status -eq 1 ]; then echo "WARNING" > /net/nas01/Public/status/$dire/stamp fi if [ $overall_status -eq 2 ]; then echo "CRITICAL" > /net/nas01/Public/status/$dire/stamp fi echo "timestamp:$now" >> /net/nas01/Public/status/$dire/stamp now=‘date‘ echo "generated:$now" >>/net/nas01/Public/status/$dire/stamp echo "uptime:$utime" >>/net/nas01/Public/status/$dire/stamp echo "kernel:$kern" >>/net/nas01/Public/status/$dire/stamp echo "filesystem_use:$max_sys" >>/net/nas01/Public/status/$dire/stamp 1.2 Reporting The RSV infrastructure at the GOC obtains the reported status file from the monitored machine via wget or from the shared file system. If it cannot obtain this file the reported status is UNKNOWN. If the file is obtained successfully, the status reported is determined from the first and second fields of the status report. The timestamp line is checked to assure the status report was generated recently where recently is configurable. Twenty minutes to generate a WARNING status and thirty minutes to generate a CRITICAL status are typical values. 2 If the timestamp is current, the status reported is given by the first field of the status file. That is, RSV becomes a simple pass through of the status diagnosed remotely. 3.

Load more