SERVER HARDWARE HEALTH STATUS MONITORING Examining the Reliability of a Centralized Monitoring Architecture
Total Page:16
File Type:pdf, Size:1020Kb
SERVER HARDWARE HEALTH STATUS MONITORING Examining the reliability of a centralized monitoring architecture Bachelor Degree Project in Computer Science G2E, 22,5 ECTS Spring term 2018 Victor Jarlow Supervisor: Dennis Modig Examiner: Jianguo Ding Abstract Monitoring of servers over the network is important to detect anomalies in servers in a datacenter. Systems management software exist which can receive messages from servers on which such anomalies occur. Network monitoring software are often used to periodically poll servers for their hardware health status. A centralized approach to network monitoring is presented in this thesis, in which a systems management software receives messages from servers, and is polled by a network monitoring software. This thesis examines the reliability of a centralized monitoring approach in terms of how accurate its response is, as well as the time it took to respond with the correct hardware health status when polled, when it is affected by varying degrees of traffic through conducting an experiment. The results of the experiment show that the monitoring architecture is accurate when exposed to a level of load which is in line with scalability guidelines as offered by the company developing the systems management software, and that the time it takes for a hardware health status to be poll-able for the majority of the measurements lie within the interval 0 to 15 seconds. Keywords: network monitoring, hardware health status monitoring, centralized, distributed, accuracy, scalability, presentation-time Table of contents 1 INTRODUCTION ............................................................................................................................................ 1 2 BACKGROUND .............................................................................................................................................. 2 2.1 SIMPLE NETWORK MANAGEMENT PROTOCOL (SNMP) ................................................................................. 2 2.2 WEB-APPLICATION PROGRAMMING INTERFACE (WEB-API) ......................................................................... 3 2.3 OUT-OF-BAND CONTROLLER .......................................................................................................................... 3 2.3.1 Dell Remote Access Controller (DRAC) ............................................................................................... 4 2.4 SYSTEM MONITORING .................................................................................................................................... 5 2.4.1 Network monitoring software ................................................................................................................ 5 2.5 SYSTEMS MANAGEMENT SOFTWARE .............................................................................................................. 6 2.5.1 OpenManage Essentials (OME) ............................................................................................................ 6 2.6 CENTRALIZATION VS DISTRIBUTED SYSTEMS ............................................................................................... 7 2.7 MONITORING ARCHITECTURES .................................................................................................................... 8 2.7.1 Distributed monitoring architecture ...................................................................................................... 9 2.7.2 Centralized monitoring architecture ................................................................................................... 10 2.8 RELATED WORK ......................................................................................................................................... 12 3 PROBLEM DESCRIPTION ........................................................................................................................... 13 3.1 MOTIVATION ............................................................................................................................................... 13 3.2 AIM ............................................................................................................................................................. 14 3.3 RESEARCH QUESTION .................................................................................................................................. 14 3.4 LIMITATIONS ............................................................................................................................................... 15 4 METHOD ......................................................................................................................................................... 16 4.1 TESTING ARCHITECTURE RELIABILITY ......................................................................................................... 16 4.1.1 Triggering alerts.................................................................................................................................. 16 4.1.2 Collecting data .................................................................................................................................... 17 4.1.3 Generating load ................................................................................................................................... 20 4.1.4 Extracting data .................................................................................................................................... 22 4.1.5 Lab environment .................................................................................................................................. 23 4.2 THREATS TO VALIDITY ................................................................................................................................ 24 5 RESULTS ......................................................................................................................................................... 25 5.1 PILOT TEST .................................................................................................................................................. 25 5.2 ACCURACY .................................................................................................................................................. 27 5.3 PRESENTATION-TIME ................................................................................................................................... 27 5.3.1 Best-case scenario ............................................................................................................................... 28 5.3.2 Worst-case scenario ............................................................................................................................ 29 6 CONCLUSIONS .............................................................................................................................................. 30 7 DISCUSSION ................................................................................................................................................... 31 7.1 VALIDITY..................................................................................................................................................... 31 7.2 ETHICAL CONSIDERATIONS .......................................................................................................................... 32 7.3 FUTURE WORK ............................................................................................................................................. 32 REFERENCES .................................................................................................................................................... 33 APPENDIX A – DISTRIBUTED MONITORING PLUGIN APPENDIX B – CENTRALIZED MONITORING PLUGIN APPENDIX C – DATA-COLLECTION SCRIPT APPENDIX D – DATA-EXTRACTION SCRIPT APPENDIX E – TRAP-GENERATION SCRIPT APPENDIX F – ACTIVITY DIAGRAMS OF FUNCTIONS IN DATA-COLLECTION SCRIPT APPENDIX G – PILOT TEST RESULTS APPENDIX H – EXPERIMENT RESULTS, BASELINE APPENDIX I – EXPERIMENT RESULTS, LOW LOAD LEVEL APPENDIX J – EXPERIMENT RESULTS, MEDIUM LOAD LEVEL APPENDIX K – EXPERIMENT RESULTS, HIGH LOAD LEVEL 1 Introduction Accurate hardware monitoring of servers is an integral part of properly managing and maintaining a large-scale datacenter, since inaccurate hardware monitoring can lead to a waste of company resources, in the sense that undiscovered alerts can lead to equipment breaking prematurely, and false alarms can lead to employees having to spend time solving non- existent errors (Barrosso, Clidaras & Hölzle, 2013). When monitoring servers over the network, what is called network monitoring software are often used to periodically poll servers for their hardware health status through executing status checks using plugins (Nagios, n.d.). Server manufacturers provide software for centrally managing their products, often called systems management software, which provide functionality such as: discovering and inventorying servers, monitoring the health of servers, performing updates, performing remote tasks and enforcing compliance policies (Zahoor, Qamar & ur Rasool, 2015). This systems management software can receive alerts from servers about the status of the hardware, which can provide useful detailed information about what component has failed or if a component is about to fail, as well as informational messages such as a threshold value returning to normal. This thesis aims to explore the viability of monitoring servers through