Automated System Monitoring Josh Malone Systems Administrator
[email protected] National Radio Astronomy Observatory Charlottesville, VA https://blogs.nrao.edu/jmalone 2 One night, about 8 or 9 years ago, the chiller in our DC failed. Co-worker arrive in the morning to find room was 90F ambient. Quickly set up fans to vent the room. Checked servers - found that main web server had lost both disks in its OS RAID mirror. (15k disks, ran hot) Main page was being served from memory, but the OS was freaking out. We had minimal monitoring scripts. No environment monitoring. No disk health checks. Failure caught us completely by surprise. We decided that we weren’t going to let this happen ever again. Over the next year or so we implemented 2 independent monitoring systems - one for servers/ services and one for environmentals. Set up each system to also monitor the other. WHAT IS AUTOMATED MONITORING? 7 Some sort of dedicated, automatic instrumentation to check services and/or servers Detect and report service problems, server hardware issues Usually provides a central “dashboard” to track problems Can be distributed; but still under control of a central daemon * Diferentiates it from “a bunch of scripts” used to check on things; that doesn’t have the ability to determine cause or eliminate false alarms. Automated Monitoring Workflow 8 Most packages implement this type of workflow Not all packages provide event handlers ack’ing page is important - let’s other admins know that someone is working on the problem so they don’t step on each other’s toes Monitoring Packages: Open Source • • Pandora FMS • Opsview Core • Naemon • • • • • • Captialware ServerStatus • Core • Sensu All Trademarks and Logos are property of their respective trademark or copyright holders and are used by permission or fair use for education.