Local Service Monitoring Status of Linux Operating Systems

MASARYK UNIVERSITY FACULTY OF INFORMATICS Local service monitoring status of Linux operating systems BACHELOR THESIS Jakub Svoboda Brno, Spring 2012 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Jakub Svoboda Advisor: Mgr. Pavel Tuˇcek ii Acknowledgement I’d like to thank to my advisor Mgr. Pavel Tuˇcekfor patiency, guidance, invaluable assistance and encouragement. I’d also like to thank to Jan Koneˇcnýfor programming advices in the course of designing the application. iii Abstract Theoretical part of the thesis analyzes methods of monitoring Linux operating system and monitoring requirements of the Institute of Computer Science. In the practical part of the thesis, Linux monitoring application is designed and implemented. The application is developed as a part of ICS’ Large Enterprise Monitoring (Lemon) project. iv Keywords Linux, monitoring, Mono, Lemon, LinMon v Contents 1 Introduction ....................................... 1 2 Operating system monitoring in general ...................... 2 2.1 Operating system purpose ............................ 2 2.2 Reliability of operating system ......................... 2 2.3 Reasons for monitoring .............................. 3 2.4 Existing GNU/Linux-compatible solutions .................. 3 2.4.1 SYSSTAT . 4 2.4.2 Dstat . 4 2.4.3 vmstat . 4 2.4.4 Collectd . 4 2.4.5 Munin . 5 2.4.6 Nagios, Shinken and Icinga . 5 2.4.7 PCP . 5 2.4.8 Xymon . 5 3 System monitoring at the Institute of Computer Science ............. 7 3.1 Lemon project ................................... 7 3.1.1 Lemon architecture . 7 Event generators ............................. 7 Transport system ............................ 8 Processing system ............................ 8 Web service and presentation application .............. 9 3.2 Monitoring of operating systems ........................ 9 3.3 Requirements for monitoring of GNU/Linux machines ........... 9 3.3.1 Report format . 9 3.3.2 Scope of monitoring . 10 3.3.3 Runtime requirements . 11 3.3.4 Packaging requirements . 11 3.4 Suitability of existing applications ....................... 11 4 Implementation of GNU/Linux monitoring application . 12 4.1 Chosen goals .................................... 12 4.1.1 Operating system and programming language . 12 4.1.2 Configuration . 12 4.1.3 Types of reports . 13 4.1.4 Monitored areas . 13 Disk usage ................................ 13 Iptables .................................. 13 Network interfaces ........................... 17 Users ................................... 17 Groups .................................. 18 Recent logins (both physical and remote) . 18 vi Recent physical logins ......................... 18 Recent login attempts over ssh (both successful and unsuccess- ful) ................................ 18 Installed packages ............................ 19 Available package updates ....................... 19 Recent reboots .............................. 19 Time of last reboot ........................... 20 Processes using CPU above a set limit . 20 Information about operating system . 20 System installation date ........................ 20 4.2 Application architecture ............................. 21 4.2.1 Classes and interfaces . 21 4.2.2 Error handling . 22 4.2.3 File access and permissions . 23 4.2.4 Settings file and alternative settings . 23 4.2.5 Report production . 24 4.2.6 Report comparison . 24 Comparison without a primary key . 25 Comparison with a primary key ................... 25 5 Testing of LinMon ................................... 26 5.1 Testing environment ............................... 26 5.1.1 Personal computers . 26 GNU/Linux ................................ 26 Other than GNU/Linux ......................... 27 5.1.2 ICS server . 27 Problems that occurred during testing . 27 5.2 LinMon in production environment ...................... 28 6 Conclusion ........................................ 29 Bibliography . 29 vii 1 Introduction Masaryk University operates a large computer network. The need to manage the network effectively resulted in a computer monitoring project called Lemon. Lemon is a system that collects data about computers using agents installed on the computers and processes them on centralized servers. It allows Masaryk University to check computer health, localize problems and to prevent misuse. Lemon is modular and consists of several components. Lemon is capable of monitoring only Windows systems at this time, with planned Linux support. This thesis deals with the Linux monitoring agent. The second chapter of the thesis describes monitoring of operating systems in general and Linux-compatible monitoring applications. The third chapter presents system monitoring at the Institute of Computer Science, requirements for the monitoring application and suitability of existing applications. The fourth chapter presents chosen goals and describes architecture of the developed application. The fifth chapter talks about testing and its results. The last chapter evaluates the developed application and its impact on system monitoring at the Masaryk University. The thesis is accompanied by three Appendices. Appendix A lists all source code, Appendix B contains class diagrams and Appendix C is an Administrator’s Guide which helps administrators with using LinMon. 1 2 Operating system monitoring in general 2.1 Operating system purpose Computers and the software they run are complex mechanisms with an inherent risk of failure1. Most computers today are run with an operating system so that multiple applications can run at once, use the operating system routines instead of accessing the bare hardware and use safe mechanisms of inter-process communication.2 2.2 Reliability of operating system The apparent advantage of running an operating system comes with a cost. A failure in the operating system can result in all the applications being unable to run properly or to run at all. (Not that an application running on the bare metal cannot fail, but OS adds additional possible point of failure apart from the application itself.) Microkernel and nanokernel operating systems try to solve this problem by minimizing the amount of code comprising the very core of the operating system with independent modules that can be restarted without affecting the rest of the system in a case of failure3. There are even operating systems capable of taking snapshots—containing state of all running programs, memory and necessary data—and then recovering to the latest snapshot in case of failure, with no need to restart programs or check filesystems, effectively restarting quickly and losing only last few minutes of data. This property is called orthogonal persistence4. However, general-purpose operating systems are not made in such a safe and uninterruptible way. They are made of interdependent components and failure can propagate throughout the system. The majority of workstations and a large share of servers run general-purpose operating systems nowadays5. When a failure occurs, a manual intervention may be required. The system administrator6 must collect all the information they need to find and resolve the problem. This might include: • Errors logged in a system log. 1. Proving the correctness of software is a very hard to nearly impossible task. Bruce Schneier has an informative post with a discussion on his blog [1]. 2. Solutions going against this approach do exist—such as running a single application on bare hardware or using exokernel (such as MIT Exokernel Operating System [2]) merely for multiplexing bare hardware, isolating the applications from each other—but they are limited to microcontrollers and embedded systems, as in the case of a single application, or simply very rare, as in the case of using exokernel. 3. Such an operating system is QNX Neutrino [3]. 4. Such operating systems are KeyKOS [4] and EROS [5], for instance. 5. According to W3Techs statistics from March 25th 2012, “Unix” and Windows web servers together have 100% market share. http://w3techs.com/technologies/overview/operating_system/all 6. I will call anyone who is experienced in maintaining and repairing an operating system a “system administrator” for the purpose of this thesis. 2 2. OPERATING SYSTEM MONITORING IN GENERAL • Disk usage information. (Is there a disk that is full that consequently caused the system to fail?) • Installed software and changes in installed software. (Has a new software or software update caused the failure?) • Users, groups and changes in users and groups. (Is there a new user who may have run something that caused the failure?) • Logins to the system. (The system administrator can look who was logged in at the time of the failure and limit the scope of the investigation.) • Network interfaces. (Was there a change in the network configuration that caused the system to fail?) • Firewall rules. (Was there a change of the firewall rules or does a rule collide with something?) • Reboots. (If and when the machine was rebooted.) • What is the exact version of the operating system (there might be a bug specific to this OS version). 2.3 Reasons for monitoring The need for frequent and repetitive collection of information makes automation rele- vant. And not only that. There are cases where it is useful to monitor operating system state. A history of the states can be inspected

Local Service Monitoring Status of Linux Operating Systems

Naemonbox Manual Documentation Release 0.0.7

Josh Malone Systems Administrator National Radio Astronomy Observatory Charlottesville, VA

Monitoring Bareos with Icinga 2 Version: 1.0

Pynag Documentation Release 0.9.0

Azure Icinga 2.5 - Client Connection Guide Scope

Observing the Clouds: a Survey and Taxonomy of Cloud Monitoring Jonathan Stuart Ward† and Adam Barker*†

Supervision Utilisation De Check-MK

Network Monitoring Using Nagios and Autoconfiguration for Cyber Defense Competitions

Performance Monitoring Using Nagios Core Hpc4e-Comcidis Vin´Icius P

Peter Helin, ABB Crane Systems Stefan Löfgren, Mälardalen

Ausreißer Check Mk

Best Practices in Monitoring