SERVER HARDWARE HEALTH STATUS MONITORING Examining the reliability of a centralized monitoring architecture

Bachelor Degree Project in Science G2E, 22,5 ECTS Spring term 2018

Victor Jarlow

Supervisor: Dennis Modig Examiner: Jianguo Ding Abstract

Monitoring of servers over the network is important to detect anomalies in servers in a datacenter. Systems management software exist which can receive messages from servers on which such anomalies occur. software are often used to periodically poll servers for their hardware health status. A centralized approach to network monitoring is presented in this thesis, in which a systems management software receives messages from servers, and is polled by a network monitoring software. This thesis examines the reliability of a centralized monitoring approach in terms of how accurate its response is, as well as the time it took to respond with the correct hardware health status when polled, when it is affected by varying degrees of traffic through conducting an experiment. The results of the experiment show that the monitoring architecture is accurate when exposed to a level of load which is in line with scalability guidelines as offered by the company developing the systems management software, and that the time it takes for a hardware health status to be poll-able for the majority of the measurements lie within the interval 0 to 15 seconds.

Keywords: network monitoring, hardware health status monitoring, centralized, distributed, accuracy, scalability, presentation-time

Table of contents

1 INTRODUCTION ...... 1 2 BACKGROUND ...... 2 2.1 SIMPLE NETWORK MANAGEMENT PROTOCOL (SNMP) ...... 2 2.2 WEB-APPLICATION PROGRAMMING INTERFACE (WEB-API) ...... 3 2.3 OUT-OF-BAND CONTROLLER ...... 3 2.3.1 Dell Remote Access Controller (DRAC) ...... 4 2.4 SYSTEM MONITORING ...... 5 2.4.1 Network monitoring software ...... 5 2.5 SYSTEMS MANAGEMENT SOFTWARE ...... 6 2.5.1 OpenManage Essentials (OME) ...... 6 2.6 CENTRALIZATION VS DISTRIBUTED SYSTEMS ...... 7 2.7 MONITORING ARCHITECTURES ...... 8 2.7.1 Distributed monitoring architecture ...... 9 2.7.2 Centralized monitoring architecture ...... 10 2.8 RELATED WORK ...... 12 3 PROBLEM DESCRIPTION ...... 13 3.1 MOTIVATION ...... 13 3.2 AIM ...... 14 3.3 RESEARCH QUESTION ...... 14 3.4 LIMITATIONS ...... 15 4 METHOD ...... 16 4.1 TESTING ARCHITECTURE RELIABILITY ...... 16 4.1.1 Triggering alerts...... 16 4.1.2 Collecting data ...... 17 4.1.3 Generating load ...... 20 4.1.4 Extracting data ...... 22 4.1.5 Lab environment ...... 23 4.2 THREATS TO VALIDITY ...... 24 5 RESULTS ...... 25 5.1 PILOT TEST ...... 25 5.2 ACCURACY ...... 27 5.3 PRESENTATION-TIME ...... 27 5.3.1 Best-case scenario ...... 28 5.3.2 Worst-case scenario ...... 29 6 CONCLUSIONS ...... 30 7 DISCUSSION ...... 31 7.1 VALIDITY...... 31 7.2 ETHICAL CONSIDERATIONS ...... 32 7.3 FUTURE WORK ...... 32 REFERENCES ...... 33 APPENDIX A – DISTRIBUTED MONITORING PLUGIN APPENDIX B – CENTRALIZED MONITORING PLUGIN APPENDIX C – DATA-COLLECTION SCRIPT APPENDIX D – DATA-EXTRACTION SCRIPT APPENDIX E – TRAP-GENERATION SCRIPT APPENDIX F – ACTIVITY DIAGRAMS OF FUNCTIONS IN DATA-COLLECTION SCRIPT APPENDIX G – PILOT TEST RESULTS APPENDIX H – EXPERIMENT RESULTS, BASELINE APPENDIX I – EXPERIMENT RESULTS, LOW LOAD LEVEL APPENDIX J – EXPERIMENT RESULTS, MEDIUM LOAD LEVEL APPENDIX K – EXPERIMENT RESULTS, HIGH LOAD LEVEL

1 Introduction Accurate hardware monitoring of servers is an integral part of properly managing and maintaining a large-scale datacenter, since inaccurate hardware monitoring can lead to a waste of company resources, in the sense that undiscovered alerts can lead to equipment breaking prematurely, and false alarms can lead to employees having to spend time solving non- existent errors (Barrosso, Clidaras & Hölzle, 2013).

When monitoring servers over the network, what is called network monitoring software are often used to periodically poll servers for their hardware health status through executing status checks using plugins (, n.d.).

Server manufacturers provide software for centrally managing their products, often called systems management software, which provide functionality such as: discovering and inventorying servers, monitoring the health of servers, performing updates, performing remote tasks and enforcing compliance policies (Zahoor, Qamar & ur Rasool, 2015). This systems management software can receive alerts from servers about the status of the hardware, which can provide useful detailed information about what component has failed or if a component is about to fail, as well as informational messages such as a threshold value returning to normal.

This thesis aims to explore the viability of monitoring servers through an integration with a central systems management software, in terms of how reliable it is. Two different implementations, one centralized and one distributed will be presented. The distributed architecture will be used to verify the centralized one. The presentation-time, accuracy and scalability will be examined in an experiment in which a systems management software is polled several times to be able to draw some conclusions about how these three factors are influenced by the fact that the monitoring is performed through the systems management software. The experiment is carried out by performing a series of actions automated through scripting.

1

2 Background

This chapter will explain concepts which help the reader understand later parts of the thesis. Technical and non-technical concepts as well as related research will be presented.

2.1 Simple Network Management Protocol (SNMP) SNMP is a networking protocol operating in the application layer of the internet protocol suite, designed for collecting and organizing information, as well as modifying parameters related to devices on IP networks. Examples of devices that often support SNMP include routers, switches, servers, workstations, printers and more (Mauro & Schmidt, 2005).

SNMP messages are sent using the transport layer protocol UDP, which is a stateless protocol, meaning that there are no acknowledgements are being sent over the network, notifying each of the communicating parties that the data sent has been received. SNMP messages contain the version of SNMP used, the community string, which is like a password that allows access to the agent, and the Protocol Data Unit, which contains the bulk of the information transferred in an SNMP exchange (Mauro & Schmidt, 2005).

In SNMP terminology the monitored object is called an element, the system doing the monitoring is called a Network Management Station (NMS) and the software running within the element is called an agent.

Another concept central to SNMP is the notion of a Management Information Database (MIB). A MIB is a hierarchical tree structure which contains objects. The way the MIB is structured is decided by the Structure of Management Information Version 1 or 2 (SMIv1 or SMIv2). The SMI decides how the objects should be named, either numerically or in human- readable form. The encoding of the object is also described by the SMI in order to provide a standard way for of different make to be able to communicate. What can be thought of as the address of an object in the MIB is called an Object Identifier (OID). An OID is a sequence of integers which traverse a global tree. The tree consists of a root node connected to some labeled nodes via edges. Each node may, in turn, have child-nodes of its own which are labeled as an integer (McCloghrie & Rose, 1990). An example OID would be 1.3.6.1.4.1.42, where 1 represents iso, 3 represents org, 6 represents dod, 1 represents internet, 4 represents private, 1 represents enterprises, and 42 would represent the enterprise which has reserved that integer number for use with their own SNMP-capable products.

SNMP exists in three versions (v1, v2c and v3). Depending on which version of SNMP is used, different operations can be carried out by an agent or NMS. Two of the operations available when using SNMPv1,2c or 3 is to retrieve information are get and trap. The get operation is carried out when an NMS polls an agent. The NMS sends a getRequest-PDU, in which the NMS requests to get information about the information contained within an object with a certain OID, and the agent responds with the information contained within that object through a Response-PDU. The trap operation is carried out when the element wants to notify the NMS of some event happening, for example a hardware component failing. The element asynchronously sends the Trap-PDU containing the object of a certain OID and its value to the NMS, which is configured to constantly listen for such messages (Mauro & Schmidt, 2005).

2

2.2 Web-Application Programming Interface (Web-API) Richardson & Amundsen, 2013 describe API as a set of methods of communication between various different software components. It provides a way for one software to access information or modify information in another software. APIs may be implemented for web- based systems (using the web-based protocols such as HTTP or HTTPS), operating systems or a database for example.

Web-APIs employ a server-client architecture, where a client requests to modify or read from a certain resource offered by the server in a way that abstracts the client from the computations performed to carry out the request. In order to retrieve monitoring information using a web-API, a HTTP GET request can be made to a resource. The HTTP standard defines methods, or verbs, to be performed on a resource residing on a webserver. Some examples of methods include: GET, PUT, POST, DELETE and so on. The GET method should only retrieve data, having no other effects (Fielding, Gettys, Mogul, Frystyk, Masinter, Leach & Berners-Lee, 1999).

The data being retrieved can be formatted using different data formats, such as XML (eXtensible Markup Language). XML is a standard for serialization, meaning that it is a way to convert objects or data structures into a format that can be stored in for example a file, or to be transmitted over a network link, to then be deserialized, i.e. converted back into an object or data structure (Microsoft, 2015).

2.3 Out-of-band controller An OOB (Out-Of-Band) controller is the hardware and its corresponding software which enables the management and monitoring of a device without the ever having booted. This can be beneficial for troubleshooting an error which is preventing a server from booting, or when a non-operating system component of a device needs to be configured remotely (J. Wijekoon, 2011). Access to the different components of a computer is gained through the wired or wireless LAN Controller as described by Figure 1.

Figure 1. Intel AMT hardware architecture (Wijekoon, J., et al., 2011, p. 411) 3

The OOB controller is connected to sensors which lets it collect information about for example temperature, power supply, memory devices, etc.

Server manufacturers often create their own proprietary OOB management systems like Dell’s iDRAC (integrated Dell Remote Access Controller), HP’s iLO (integrated Lights-Out) and intel’s AMT (Active Management Technology). These OOB management systems offer a wide array of features which can be used to manage and monitor devices.

2.3.1 Dell Remote Access Controller (DRAC) DRAC is the proprietary OOB management system offered by Dell. The management system is sold either as a separate expansion card or comes integrated on the server motherboard. The controller has its own memory, processor, network interface and access to the system bus. DRAC offers features such as power management, remote console abilities and virtual media access, available through command-line interface or web browser.

The DRAC interfaces with the baseboard management controller (BMC) and is based on the Intelligent Platform Management Interface (IPMI) 2.0 standard (Dell, 2006).

Examples of server components which can be configured or monitored using the iDRAC management system include temperatures, memory, network controllers, fans and more (Dell, 2017).

4

2.4 System monitoring In computer systems monitoring there are two kinds of scopes of monitoring:

The first scope is single system monitoring, in which the scope of the monitoring is limited to the specified computer. The monitoring is performed by a software running on the system being monitored. The software collects information about for example hardware such as CPU- and memory utilization or information about the status of hardware components. The software can also collect and present information about services and processes active on the monitored system (Monit, 2018).

The second scope is networked monitoring, in which a system monitors several network- connected computers. The monitoring system either polls the computers at a set interval by sending out messages which the computers respond to with the requested information, or it receives messages on a per-event basis, meaning that the monitored subject (the computer) sends a message if it recognizes that for example a hardware component has started failing (Limoncelli, Hogan and Chalup, 2007).

Limoncelli et al. (2007) mentions that two primary types of monitoring exist, real-time monitoring and historical monitoring. The purpose of historical monitoring is to collect utilization or other statistical data from systems such as CPU usage. Monitoring of this sort has use cases related to anomaly detection, usage-based billing and presenting data in a graph to visualize past resource utilization.

Real-time monitoring, which is the type of monitoring this thesis will examine closer, reports if a host is down, or if some kind of problem has arisen. If a problem arises in a server, it sends an alert to a monitoring system. This makes real-time monitoring a prime tool for reactive system administration, allowing system administrators to spring into action to tend to any detected problem as soon as they are discovered.

2.4.1 Network monitoring software Network monitoring software are special software that monitor devices connected to the network. One such software is called Nagios. According to Nemeth, Snyder, Hein & Whaley (2011), Nagios specializes in real-time monitoring of error conditions. Its strengths are its modular, heavily customizable configuration system that allows custom scripts to be written to monitor any conceivable metric. Nagios also has the capability to present the status of servers in a way that gives an overview of problems in the network, and on which host they have occurred. Alerts can also be sent to network administrators by email, pagers and tickets to case-management systems can be automatically generated.

Using Nagios terminology, the scripts are called “plugins”, which are standalone extensions that makes it possible to monitor anything. A plugin executes a script or a program and then returns the results to Nagios Core (Nagios, n.d.). Nagios Core is the foundation upon which many monitoring projects are built, some examples include GroundWork, OP5 and Ichiga (PandoraFMS, 2017). A typical use-case of a network monitoring software includes the execution of plugins to be scheduled, leading to the periodical polling of desired resources.

5

An example of plugin execution relevant to this thesis would be executing a plugin which checks the hardware health status of a server every 5 minutes.

2.5 Systems management software A systems management software is a piece of a software which centrally manages and monitors hosts. As stated by Zahoor, Qamar & ur Rasool (2015) a systems management software can be used to handle tasks such as monitoring activities of hardware and software that include resource utilization and alarm systems which helps in diagnosing and fixing any faults that arise during the operation of the data center, as well as provisioning the servers and configuration of network devices. Management of the inventory of hardware and software is also possible. Simply put a systems management software can be described as a platform in which several servers can be inventoried, and a simple overview of all servers is given to the user. Provisioning servers can be made less time-consuming by utilizing templates.

2.5.1 OpenManage Essentials (OME) OME is the proprietary systems management software offered by Dell. It offers several capabilities for managing the hardware of multiple Dell servers. OME is a one-to-many systems management software for dell hardware and other devices. With OME it is possible to:

• Discover and inventory a server. • Monitor the health of the server. • View and manage system alerts. • Perform system updates and remote tasks. • View hardware inventory and compliance reports (Dell Support, 2018a).

Dell’s OME offers an API and a graphical user interface (see figure 2). This API can be used to integrate other systems with OME (Dell EMC, 2018). OME also requires a database in which it can store alerts and general information about servers, which the API gets the information it presents from.

Figure 2. The graphical user interface of Dell Openmanage Essentials (Dell Support, 2014)

6

2.6 Centralization vs distributed systems Centralization can be described as the action or process of bringing activities together in one place (Merriam-Webster, 2018). A distributed system can be described as components located on different networked computers communicate by their actions by passing messages (Coulouris, Dollimore, Kindberg & Blair, 2012).

By introducing centralized points in a system, performance can become a bottleneck, as stated by Coulouris, Dollimore, Kindberg & Blair (2012), the predecessor to the DNS (Domain Name System) relied on a server which kept a master record, which was available for download by someone wishing to resolve a hostname. When the internet grew, the single server could not serve all the computers wishing to download the master record, leading to the service being unavailable. DNS removed this bottleneck by partitioning the master record between many different servers. Distributing the workload between several servers minimized the work each server has to do as the internet grew, as it is a solution that scales better.

Limoncelli et al. (2007) argues that centralization does not innately improve the efficiency, what improves is standardization, which is a by-product of centralization, which in turn can improve efficiency. This means that a system with central points can allow for greater standardization, thus improving efficiency in other aspects than pure computational performance, for example efficiency in only having to communicate with one interface (the centralized point) instead of potentially numerous different ones.

7

2.7 Monitoring Architectures In order to be able to perform an experiment to examine the reliability of a centralized monitoring architecture, it had to be implemented somehow. A centralized monitoring architecture was implemented. This monitoring architectures is designed for use with Dell’s proprietary hardware and software, including the iDRAC and OpenManage Essentials. The choice to examine the reliability of a centralized monitoring architecture using Dell proprietary systems management software and out-of-band controllers was made since according to Statistia (2018), Dell holds a market share of 17.5% of all server system vendors, meaning that the conclusions drawn in this thesis can be useful for many decisionmakers at datacenters debating on whether to implement a centralized monitoring architecture or not.

Unlike a centralized monitoring architecture, the information needed to poll a server is available through directly contacting each servers OOB controller, while in the centralized monitoring architecture, the information needed to poll a server is available through a systems management software. In this architecture, the systems management software acts as a central point through which all information flows (see figure 3).

Figure 3. Illustration depicting where server information is retrieved from in the centralized- and distributed monitoring architectures

8

2.7.1 Distributed monitoring architecture The distributed monitoring architecture consists of three components:

• A network monitoring software • A plugin script written in bash (see appendix A) • Server with an iDRAC

The distributed monitoring architecture is distributed in the sense that there is a single network monitoring software polling several servers directly. For every network monitoring software there are several servers with iDRAC.

As can be seen in figure 4, a network monitoring software executes the plugin. The plugin then sends a GetRequest-PDU to the iDRAC of the monitored server, requesting the hardware health status of the server. The server responds with a Response-PDU. The response-PDU contains an integer that corresponds with a severity level.

Dell, 2018 define three different severity levels for hardware-related errors: Informational, Warning and Critical. Informational status indicate that an operation completed successfully or that a threshold value returned to normal. Warning status indicates that an event that may not be significant, but may lead to possible future problems has occurred, such as crossing a warning threshold. A critical status indicates that a significant event has occurred, which may lead to the immediate loss of data or function.

The plugin receives the Reponse-PDU, processes it to interpret what its hardware health status is, and forwards it to the network monitoring software.

When polling a server using the distributed plugin, the iDRAC may return either an “OK” or a “Critical” return-code. If the server is unresponsive or if the server is non-existent, the plugin sets the return-code to be “Unknown”.

Figure 4. An example of the network monitoring software executing the plugin to monitor a server – distributed architecture

9

2.7.2 Centralized monitoring architecture The centralized architecture consists of four components:

• A network monitoring software • A plugin script written in python (see Appendix B) • A systems management software (OpenManage Essentials) • A server with an iDRAC

The centralized approach is centralized in the sense that there is a single network monitoring software that poll one or more systems management software (in this case OME), which in turn have information about the servers saved in a database.

A poll of a server in the centralized monitoring architecture works as follows (see Figure 5): The network monitoring software executes the plugin, which performs an API call, requesting the information of a specific server to be sent back to the plugin as an XML-document. The plugin parses this XML-document and extracts the hardware health status of the server as well as other information, such as if any warranty has expired, or an alert message corresponding to the hardware health status of the server.

Figure 5. An example of the network monitoring software executing the plugin to poll a server – centralized architecture

Dell EMC (2018) define 5 error codes which are used by OME to describe a server’s hardware health status: 0,2,4,8 and 16. These integer values map to a description of “None”, “Unknown”, “Normal”, “Warning” and “Critical” respectively. That means that if a server has its hardware health status set to 8 in the OME database, then it has the status “Warning”. The centralized plugin may return “Normal”, “Warning” or “Critical” if the server has been discovered by OME and there is no unforeseen error that has occurred. If the server hasn’t been discovered, then the plugin will return the status code “Unknown”.

10

OME receives information from servers and places it in its database using two methods (see Figure 6): The first method involves a server sending a SNMP-trap to OME when a hardware error has occurred. The second method involves OME polling all servers. The OME status polls are executed periodically, and can detect offline servers, or servers with network errors which prevent SNMP-trap from being sent.

Figure 6. OpenManage Essentials poll as well as a server trap

11

2.8 Related work Several papers which are related to the monitoring of datacenters exist. Differences between SNMP based monitoring and web-API-based monitoring are explored by Pras, Drevers, van de Meent & Quartel (2004) in their study which examines the differences in bandwidth consumption with and without compression, resource usage and round-trip delay between XML/SOAP and SNMP. Text-objects were collected using SNMP agents for SNMP and a custom web services application written in C for the web-API-based monitoring.

They concluded that SNMP performs better both when retrieving many objects and when retrieving a single object when observing how much bandwidth was used. However, the web services proved to perform better on when compression is utilized, and a large quantity of objects are retrieved. CPU time needed for coding SNMP messages proved to be 3 to 7 times less than that of coding the web services (XML). The round-trip delay proved to be comparable between the two. However, Pras et al. (2004) did not examine the reliability of the two protocols, which is a part of what this thesis will examine.

Lu, An, Du, Cao & Jiang (2015) present a platform for datacenter monitoring based on the network monitoring software Nagios. In their platform, which makes use of plugins that collect information about hardware temperature, CPU statistics, and running services. In their research they develop a prototype collecting this information and storing it in the Nagios database. They present a web interface in which the data is visualized and presented for users. Lu et al. (2015) did not perform any type of performance or stability testing on their prototype, which could have helped determine the usefulness of the systems monitoring software.

Issariyapat, Pongpaibool, Mongkolluksame & Meesublak (2012) also present a solution for what they claim is a better network monitoring system than most modern monitoring systems. Nagios is used as groundwork in their prototype, which consisted of building a tool for visualizing network topologies. The prototype tool visualizes the status of links between network equipment, as well as the status of the hosts themselves. They make use of the colors green, yellow and red for host and link status in their web interface in order to make their system easily understandable and user-friendly. Their prototype tool which is called NetHAM (Net Health Analysis and Monitor) is developed using the PHP language and open source technology such as JQuery, AJAX and Adobe Flex 3. They did not test the reliability of their tool, which could have been useful to determine the performance of the tool.

12

3 Problem description This chapter will detail the motivation, aim, the research question and limitations of the thesis.

3.1 Motivation Server availability is very important to the organisation or company running a datacenter, since the servers are performing computations or contain data which is of importance to the organisation’s customers. Because of this, an error should be reported by the monitoring architecture so that the right measures to mitigate it can be taken by appropriate personnel, as to not deny the customers of the service they require (Barrosso, Clidaras & Hölzle, 2013).

The monitoring system should be accurate, since a server reporting no error when there is an error, or a server reporting an error when there is none, can be expensive in terms of equipment not being able to deliver a service to customers and in terms of wasted time of system administrators trying to fix a non-existent error.

Scalability is an important factor to consider when deciding what monitoring architecture to implement. According to Neuman (1994) a system is said to be scalable if it can handle the addition of users or resources without noticeable loss of performance or increase in administrative complexity. Neuman (1994) continues to argue that system load is affected by scale, and that as a system gets bigger, the amount of data that must be managed by network services grows, as does the total number of requests for service. With that in mind it is important to choose a monitoring architecture that scales well with a potential increase of monitored servers or a large influx of alarms happening at the same time.

A distributed approach and a centralized approach to monitoring differ in that distributed approaches generally perform better in regard to reliability since the computing resources are spread out over several servers (Coulouris et, al., 2012). A centralized approach may provide a single interface to a wide array of different hardware models of monitored subjects, enabling the use of a singular plugin for monitoring instead of having to maintain one plugin for each OOB controller hardware version i.e. iDRAC 6, iDRAC 7, iDRAC 8 etc. Out of the two monitoring architectures presented in this thesis, the centralized variant provides more detailed output, which can be useful for service operators when troubleshooting hardware errors.

Therefore, it is interesting to investigate if the usage of a centralized monitoring architecture can be reliable as well, thus allowing the use of only one monitoring plugin to monitor several hardware versions, providing more information and providing stable monitoring.

13

The aspects related to reliability that will be examined in this thesis includes: • Accuracy Meaning how often is the monitoring method correct in what it outputs (no false-positives or incorrect alerts).

• Presentation-time The monitoring solution cannot take too much time presenting the hardware health status change, if it does inconsistencies may arise, leading to a worsened quality of monitoring.

• Scalability Meaning how accurate the monitoring method is when many servers are failing simultaneously.

3.2 Aim This thesis aims to evaluate if a centralized approach to monitoring is viable to use in terms of how accurate it is, how long it takes for it to respond and how well it scales. The conclusions drawn will be able to provide a decision basis for decisionmakers at datacenters when choosing between implementing a centralized or distributed architecture.

3.3 Research question This thesis will answer the following research question:

RQ: How often does the centralized monitoring architecture detect a correct and timely response when the hardware health status changes in the monitored servers?

The research questions will be answered by conducting an experiment designed to test the accuracy of the centralized architecture. The experiment involves triggering a change of hardware health status in the iDRAC of a server between the hardware health statuses “OK” and “Warning”, which causes a trap to be sent to OME. The centralized plugin is then executed to see if the centralized architecture discovers the error in a correct manner, as well as how long it took to discover. The result is then written to a text-file. This procedure is repeated for 1000 iterations, meaning that 1000 measurements of the correctness and timeliness is taken.

The experiment will test the correctness and timeliness of the architecture when there is no load disturbing the system, as well as with varying degrees of load being simulated, to be able to test how the architecture scales.

Upon completion of all iterations of the tests, relevant data will be extracted and analyzed to be able to draw conclusions from the results.

14

3.4 Limitations Some limitations had to be made due to the nature of the systems tested in this thesis and the fact that the thesis had to be completed within a deadline

• Since OME can only be installed on the Windows server operating system, this will be the operating system that is going to be used for the systems management software. • Since it is not possible to integrate network monitoring software like Nagios with other components in the experiment like iDRAC in a way that makes it possible to perform experiments, the CentOS was installed on a machine, instead of a network monitoring software, to poll the centralized architecture. • The CentOS machine will also be used to generate traps. This is due to the fact that that the package “net-snmp” has a stable working package in the default CentOS repositories. • Experimental scripts that require the execution of other scripts will be written in the scripting language bash, to allow for easy integration with the utilities in the Linux shell. As Cooper (2014) describes, Shell scripting follows the classic UNIX philosophy of breaking complex projects into simpler subtasks of chaining together components and utilities. • Experimental scripts that do not have to integrate with the utilities in the Linux shell will be written in the python scripting language. Kalb (2016) describes Python as a general-purpose programming language that can be used in many different fields because of its versatility, often used by system administrators to automate tasks. • Hardware and software bound to the server manufacturer Dell will be used in this experiment, since there was not enough time to implement scripts which make use of hardware and software bound to other server manufacturers.

15

4 Method In this chapter, the experiment to be conducted is described.

The tools used as well as the lab environment will be explained in greater detail to give the reader an insight into how the experiment can be replicated and verified.

4.1 Testing architecture reliability To be able to answer the research question, the reliability of the centralized architecture will be tested and verified using automation in the form of scripts which will carry out actions without the intervention of human operators. As described by Wohlin et al. (2012) the basic principle of reliability testing is that when you measure a phenomenon twice, the outcome shall be the same. The methods presented intend to provide a way to measure just that.

According to Guo, Pohl & Gerokostopoulos (2013) the non-parametric binomial equation is often used in practice to determine the sample size of a test when using a risk control approach. Using this approach, the sample size needed to achieve a certain confidence level and a certain reliability can be determined.

In the experiment, the number of iterations the data-collection script was to run to take measurements was chosen using the non-parametric binomial equation:

푓 푛 1 − 퐶퐿 = ∑ ( ) (1 − 푅)푖푅푛−푖 푖 푖=0

Where 퐶퐿 is the test confidence level, 푅 is the reliability requirement, 푓 is the number of allowed failures and 푛 is the sample size.

Using the values 퐶퐿 = 0.99 and 푅 = 0.99 for a confidence level of 99% and a reliability requirement of 99%, for the 푓 values 1,2,3 and 4 the 푛 value equates to 661,838,1001 and 1157. If 0 failures are allowed, the formula becomes: 1 − 퐶퐿 = 푅푛 in this case the n value equates to 459. The number of iterations was chosen to be 1000 since this would demonstrate a 99% reliability at a 99% confidence level in the case of 0,1,2 or 3 failures.

4.1.1 Triggering alerts To be able to measure reliability there must be some way to induce and resolve artificial errors, i.e. change the hardware health status of the server in the iDRAC to have some sort of expected result which the output of each of the architectures can be compared to. The iDRAC has support for executing commands through a shell called ‘racadm’, which stands for remote access controller admin (Dell EMC, 2016). This shell allows commands to be executed on the controller for monitoring or modification purposes. The following commands are executed to change server hardware health status: racadm -r “hostname” -u “username -p “password” sensorsettings set iDRAC.Embedded.1#SystemBoardInletTemp -level max 7 # set inlet temp status warning

16

racadm -r “hostname” -u “username -p “password” sensorsettings set iDRAC.Embedded.1#SystemBoardInletTemp -level max 40 # set inlet temp status ok

The “-r”, “-u” and “-p” flags decide which hostname the command should be issued to, as well as which username and password will be used to authenticate when connecting to racadm. The “sensorsettings set” part of the command means that the subcommand for adjusting a sensor threshold value will be executed (Dell EMC, 2016). This subcommand takes in a FQDD (Fully Qualified Descriptor), which in this case is the FQDD for the inlet temperature sensor. The acceptable level before the iDRAC triggers an alert is what is decided by the “-level max 푛” part of the command, where 푛 is an integer that represents the limit in degrees Celsius for when the iDRAC should change its hardware health status to “Warning”. In the case of this script 푛 is either 7 or 40 since the temperature of the server used in the experiment ranged between 23-25 degrees Celsius. If the experiment is to be conducted under conditions where the temperature of the server is in a different range, other values may have to be set.

4.1.2 Collecting data To enable the collection of statistical data to be able to answer the research question a data- collection script was produced (see Appendix C). The purpose of the script is to:

1. Induce a hardware health status change in the iDRAC. 2. Check that said hardware was correctly set. 3. Execute the check of the centralized plugin. 4. If the check is incorrect, retry until timeout. 5. Write the results of the checks to a text-file together with a timestamp.

The script works by first setting the hardware health status of the server in the iDRAC using racadm. Changing the hardware health status in racadm sometimes does not change the state of the server in the MIB of the SNMP agent in the iDRAC, which it must do if a trap should be sent, alerting OME of the change. This is unreliable.

Therefore, the hardware change is first introduced, then compared to the output of the distributed plugin. If the output of the distributed plugin doesn’t match the desired status, the hardware health status of the server is set to the opposite of the desired status, and then back to the desired status.

For example: if the script is trying to set the hardware health status of the server to “OK”, it sends a command to racadm to change the hardware health status to “OK”. The script uses the “sleep” program in bash, which stops the script from executing further instructions for a set amount of time (ss64, n.d.). the script sleeps for 5 seconds to allow the hardware health status change to propagate, and then checks the distributed plugin (i.e. sends a getResponse-PDU for the hardware health status of the server). If the distributed plugin check shows that the hardware health status is “Critical”, then the script sends a command to racadm to change the hardware health status to “Critical”, and then a command to change the hardware health status

17

back to “OK”. The process is repeated until the hardware health status is set to the desired status both in racadm and the SNMP agent of the iDRAC.

When the status is successfully set, a line containing the current time and which status has been set (i.e. “OK” or “Warning”) is written to a text-file.

The script sleeps for 5 seconds, after which it continues to execute the centralized plugin to see if the hardware health status change propagated to OME. If the hardware health status change did not propagate to OME, the script sleeps for another 5 seconds then checks again and continues to do so for a total of 60 times, for a maximum of 300 seconds (5 minutes). If the hardware health status change still has not spread to OME, current time and the string “Did not detect hardware change” is written in a line to the text-file. If, however, the hardware health status change does propagate to OME, the current date and output of the centralized architecture is written to the text-file.

This limit of retrying for 5 minutes was chosen to prevent the script from getting stuck in an infinite loop in the case that a hardware health status change never gets discovered by OME.

Figure 7. Diagram describing the sequence of events as they happen in an iteration of the data-collection script.

The experiment is carried out according to figure 7, which shows how the different components of the experiment work together. The numbers within parentheses describe in which sequence the events are performed.

Activity diagrams of the functions which make up the data-collection script can be found in Appendix F.

18

Example output from the data-collection script when the centralized architecture’s plugins is correct can look as follows: 2018-04-15 03:34:43+02:00 threshold set to: 40

2018-04-15 03:35:03+02:00 Hardware OK: No hardware errors.

2018-04-15 03:35:23+02:00 threshold set to: 7

2018-04-15 03:35:43+02:00 Hardware Warning: Message: The system inlet temperature is greater than the upper warning threshold

Each line has a prepended timestamp. In this example the “40”, or “OK” state is induced in the iDRAC on line 1, followed by the check by the centralized architecture on line 2. On line 3 the iDRAC is set to display a warning, lastly the centralized architecture’s plugin is run which also detects that the server displays a warning.

Example output from the data-collection script when the centralized plugin is incorrect looks as follows:

1 2018-04-15 03:35:23+02:00 threshold set to: 40

2 2018-04-15 03:40:25+02:00 Did not detect hardware health status change

The “OK” status is induced in the iDRAC on line 1, on line 2 the centralized plugin failed to detect the hardware changes within the 60 retires limit.

While the script is running, the program Wireshark captures packets coming in to the OME machine with the capture filter dst port 162 and src host ip.address.of.iDRAC. This allows the incoming SNMP-trap packets from the source address if the iDRAC to be captured. Once the packets have been captured they can be exported from wireshark. The data is exported to a to a JavaScript Object Notation (JSON) file. JSON is a type of data- serialization format (Microsoft, 2017).

19

4.1.3 Generating load To be able to test how to the centralized architecture performs with regards to reliability under load, several servers failing in rapid succession has to be simulated. In order to accomplish this, a script which sends SNMP-traps to OME, called the trap-generation script (see Appendix E) was produced. The trap-generation script is executed by the data-collection script and is subsequently stopped when the trap-generation script finishes.

A custom SNMP-trap which is designed to contain the same contents as a real SNMP-trap coming from a failing server to OME is sent using the program “snmptrap”, which can use the SNMP-trap operation to send information (Net-snmp, 2009). A trap is sent with the OID of 1.3.6.1.4.1.674.10892.5.3.2.1.0.2089, which is the OID of the alertNetworkWarning trap in the iDRAC mib (Dell Support, 2018b), this trap contains the objects found at OIDs 1.3.6.1.4.1.674.10892.5.3.1.1 through 1.3.6.1.4.1.674.10892.5.3.1.12. Each time a trap is sent to OME, the information of the 12 objects are sent with it. The following command is used to send said trap:

snmptrap -v 2c -c public $TRAP_RECEIVER '' 1.3.6.1.4.1.674.10892.5.3.2.1.0.2089 \ 1.3.6.1.4.1.674.10892.5.3.1.1 s "$counter" \ 1.3.6.1.4.1.674.10892.5.3.1.2 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.3 s "3" \ 1.3.6.1.4.1.674.10892.5.3.1.4 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.5 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.6 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.7 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.8 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.9 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.10 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.11 s "testtrap"

The -v flag is used to specify the version of SNMP to be used to send the trap. In this case version 2c is used. The -c specifies which community string is used for authentication. The variable $TRAP_RECEIVER specifies the IP address or hostname of the target that should receive the trap.

The object at OID 1.3.6.1.4.1.674.10892.5.3.1.1 contains the alert message ID, the value of this object is set to be whatever the counter variable is, which starts out at 0, and is incremented by 1 for each trap sent. The rest of the objects contain the text “testtrap”, with the exception of 1.3.6.1.4.1.674.10892.5.3.1.3 which is set to 3, which is the error code for a critical event.

The trap-generation script is executed and killed by the data-collection script (see Appendix C). When the script is killed, it writes the contents of the counter variable to a text-file (see Appendix E), thus allowing the total numbers of traps sent to be used to later calculate an average of the number of traps sent per second.

The tests will be performed under certain load-levels according to table 1.

20

Load level Sleep time in Measurements Estimated Estimated seconds between traps/sec traps/minute traps

Baseline - 1000 0 0

Low 0.1 1000 8.33 500

Medium 0.05 1000 14.29 900

High 0.02 1000 25 1500

Table 1. Number of hardware health status changes, sleep time between traps sent and load level.

The load level ranges from “Low” to “High”.

1 In the “Low” load level, a trap is sent, then the script sleeps for seconds, i.e. 0.1. 10

1 In the “Medium” load level a trap is sent, then the script sleeps for seconds, i.e. 0.05. 20

1 In the “High” load level, a trap is sent, then the script sleeps for seconds, i.e. 0.02. 50

The load levels chosen were based on the reasoning that it takes around 20 milliseconds (ms) to send a trap, meaning that it would take 120 ms to send each trap when a 0.1 sec sleep time is introduced between every trap. This would cause 8.33 traps to be sent each second. 8.33 traps being sent per seond would equate to around 500 traps sent per minute on average. An estimate were caluclated for every load level as can be seen in table 1.

According to Dell Support (2015), a medium deployment size is used in this experiment for OME (see chapter 4.1.5). 500 monitored devices are recommended for the medium deployment size. This would mean that

21

4.1.4 Extracting data Since it would be cumbersome for a human to look through all the lines of the text-file in order to extract relevant data and perform calculations, a data-extraction script was created (see Appendix D). It takes in both the text-file generated by the data-collection script and the corresponding JSON-file containing the Wireshark capture for the set of measurements. The purpose of the script is to:

• Count the amount of measurements • Count how many times the hardware health status change was/was not discovered • Calculate the total runtime of the test • Calculate the time it took for OME to present the hardware status change of each measurement and calculate the absolute error of each measurement

The amount of measurements taken is calculated by looking at the size of the array which contains all the matches.

The lines of the text-file is parsed to determine if any line contains the text “Did not detect hardware health status change”, if the text is found, the measurement is counted as the hardware health status change not being discovered, meaning that the results of this measurement is incorrect.

The runtime is calculating by subtracting the last matched timestamp from the first.

The results text-file is compared to an exported JSON-formatted version of the Wireshark capture. Since the capture contains the time at which all traps arrived at OME, this can be used to calculate how long it took from the trap arriving at OME, to it being poll-able by the network monitoring software. For each line-pair, the time it took for the centralized plugin to respond to the poll from the network monitoring software is calculated, as well as the uncertainty interval (absolute error) is calculated. Calculating the absolute error is necessary due to the fact that the data-collection script incorporates sleeping to avoid overloading OME by having no set interval between polls. This is done in the script with the following lines:

discoverydelta=(checkCent-2.5-((checkDistri-trapSentToOME)/2))- trapSentToOME absoluteError=2.5+((checkDistri-trapSentToOME)/2)

The checkCent variable contains the timestamp when the centralized plugin polled OME. The checkDistri variable contains the timestamp of when the distributed plugin was used to verify that the correct hardware health status was set in the SNMP agent of the iDRAC. The trapSentToOME variable contains the timestamp of when OME received the SNMP-trap from the iDRAC.

22

4.1.5 Lab environment The Ericsson GIC in Linköping volunteered that their hardware, which consists of a Dell server with OOB controllers and network infrastructure which could be used to carry out the experiment on, which is why their production environment was chosen as the location to conduct the experiment. Although the experiment in this thesis are conducted at the Ericsson GIC in Linköping they can be replicated in other environments using similar systems and methods.

The hardware in the lab-setup consists of a Dell PowerEdge R220 server, running an ESXi Hypervisor, where the iDRAC as well as two virtual machines are used. All virtual machines are connected by a virtual switch inside ESXi. The iDRAC interface is connected to a management network, which is reachable from the hypervisor.

Figure 8. Hardware used in the experiment as well as the steps performed

The hardware of the PowerEdge R220 server consists of an Intel Xeon E3-1220 v3 CPU @ 3.10GHz, 16GB of DIMM DDR-3 single-bit ECC RAM @ 1600MHz and a 1TB hard drive.

The network monitoring software is a virtual machine on which the distro Centos 7 with Linux kernel version 3.10.0-693.21.1.e17.x86_64 was installed. It was given 4 vCPUs, 512 MB of ram and 16 GB of storage. The network monitoring software also sent traps when testing load levels low through highf.

OME was installed on top of a windows server 2016 64-bit (version 1607, OS Build 14393.693) installation. The version of OME used is 2.4.0.930. It was given 4vCPUs, 6GB of ram and 100 GB of storage. According to Dell Support (2015), This fits the recommended deployment size of up to 500 devices. The database which OME uses is located on the same virtual machine, and is a Microsoft SQL server 2014 version 12.2.5000.0.

No setting was changed on OME, thus ensuring that the default configurations were used. The PowerEdge R220 server was discovered using WS-MAN. There were no other inventoried servers. The iDRAC is also configured with the default settings. All components (OME, iDRAC, Network monitoring software) had their clocks synced all using the same central NTP-server.

23

4.2 Threats to validity Wohlin et al. (2016) describe some threats to validity which apply to qualitative experiments which are related to statistical data, which are described below.

Fishing and error rate

When a researcher is looking for a specific outcome, she may influence the results by either tampering with the data or by choosing to ignore data, also known as fishing for a certain result. If that is the case, the analyses become invalid.

Reliability of measures

Bad instrumentation and bad instrumentation layout can affect the validity of an experiment. An example includes: Lines of code are more reliable than function points since it does not involve human judgement.

Random irrelevancies in experimental setting

Elements outside the experimental setting may disturb the experiment, such as noise or sudden interrupts.

History

In an experiment, different treatments may be applied to the same object at different times. The risk that the history affects the outcome of the results then becomes apparent, since the circumstances are not the same on both occasions.

24

5 Results This chapter will detail the results obtained using the aforementioned method.

5.1 Pilot test The initial tests showed that the centralized architecture was more inaccurate than expected (see figure 9). This was discovered to be due to a semantic error in the data-collection script, since no check was made to make sure that the value that was set correctly in the iDRAC. Initially, this was the function that was used: changeTempTresholdRacadm() { #set the iDRAC thermal sensor threshold /opt/dell/srvadmin/sbin/racadm -r "hostname" -u "username" -p "password” sensorsettings set iDRAC.Embedded.1#SystemBoardInletTemp - level max $1 }

The function simply sets the inlet temperature threshold to be the value of the “$1” variable, which is either 7 for a “Warning” alert, and 40 for a “OK” alert. Since there was no mechanism for checking that the intended value actually was set in the iDRAC, and that the hardware health status change had propagated to the SNMP agent. This resulted in the centralized architecture failing to produce the correct output at times. An example would be that the centralized architecture failed to detect the hardware health status change at the timestamp 0:0:1:43. 45 more such cases can be observed in Appendix G.

Figure 9. Pilot test results for the reliability experiment

25

The function was updated to check if the correct value was set, as described in chapter 4.1.2. To make sure that the value set in racadm propagates to the SNMP agent, a 5 seconds sleep time was introduced.

The following function was wirst used to check the results of the centralized architecture: checkCentralizedAndWriteResultsToFile() { #execute the centralized plugin, write a timestamp and its output to a file python $pluginspath"check_dell_ome_hardware.py" -u "USER47" -p "USER48" -H "seliics00668" | (echo -n `date --rfc-3339=seconds`" " && cat) >> $triggerpath"results_"$filename".log" }

This was later changed since the distributed plugin allows an error to be discovered some seconds faster than the centralized plugin, since the iDRAC needs to send a SNMP-trap to OpenManage Essentials in the centralized architecture, which in turn needs some time to present this hardware change, making it available to poll through the API. In the distributed architecture the iDRAC just needs to update its SNMP agent to reflect the state of the server (see figure 3 and 4). This lead to implementing a sleep in the script for 5 seconds after the hardware health status has been confirmed to have propagated to the SNMP agent. If the error is not detected within 5 seconds, the script keeps sleeping and retrying for at most 60 times, if the hardware health status change has still not been detected it is assumed that the centralized architecture has failed, since having no cut-off point could lead to the script being stuck in an infinite loop. The pilot tests showed that in the majority of the measurements, the centralized architecture managed to respond on the first, second or third try, which is why the arbitrary cut-off point of 60 attempts was chosen.

26

5.2 Accuracy The result of the experiment conducted were that in the baseline, the low- and medium load levels the centralized monitoring architecture managed to respond to the polling of the network monitoring software within the 60 retries limit for all 1000 hardware health status changes. There was one hardware health status change which the centralized monitoring architecture failed to acknowledge to when polled, as can be seen in figure 10.

Figure 10. Accuracy experiment results

5.3 Presentation-time There was some variation in the time it took to present the correct hardware health status when polled in the different load levels. Important to note is that the presentation-time 푠−푐 푠−푐 presented can vary as described by the equation 푝 , 푝 = 푝 − 2.5 − ( ) ± (2.5 + ( )), 1 2 2 2 where 푟 is the time at which the centralized plugin returned a value, 푠 is the time at which the data-collection script checked the SNMP agent of the iDRAC and a value was returned, and 푐 is the time at which OME received a SNMP-trap from the iDRAC. The true result of each measurement lies somewhere on the interval [푝1, 푝2], since the script sleeps for 5 seconds before checking the SNMP agent of the iDRAC and since the script sleeps 5 seconds after the hardware health status has been changed before the poll is performed.

The absolute error of each measurement in each load level can be seen in the corresponding appendix (Appendix H through K). 27

To be able to perform any kind of statistical analysis on the results obtained, a best-case and a worst-case scenario of the results were calculated by adding the absolute error from the results for the worst-case scenario and subtracting the absolute error from the results for the best-case scenario.

5.3.1 Best-case scenario

Figure 11. Averages and standard deviation of best-case of each load level. Error bars represent the standard deviation

As can be seen in figure 11, for the best-case scenario, the baseline had the lowest average presentation-time. The high load level had the highest presentation-time. The high load level also had the highest standard deviation, which was 11.81. The baseline and the low- and medium load levels average are similar in that the average presentation-time in seconds lie close to 2 (see table 2). Worth noting is that most of the measurements taken lie in the range 0 to 15 seconds presentation-time, with the exception of one or two measurements in the baseline and the low- and high load level. Load level Min Max Average Standard Measurements Traps per presentation- presentation- presentation- deviation between 0 and second time (sec) time (sec) time (sec) (sec) 15

Baseline 0 19 2.03 2.19 99,9% 0

Low 0 47 2.22 2.84 99,8% 8,75

Medium 0 13 2.13 2.35 100% 14,92

High 0 364 3.42 11.81 99,8% 27,14

Table 2. Summary of the results in the best-case scenario

28

5.3.2 Worst-case scenario

The mean and standard deviation for the worst-case scenario follow the trend in the best-case scenario, with the exception of a shorter average presentation-time for the medium load level.

Figure 12. Averages and standard deviation of worst-case of each load level. Error bars represent the standard deviation

Load level Min Max Average Standard Measureme Traps per presentation- presentation- presentation- deviation nts between second time (sec) time (sec) time (sec) (sec) 0 and 15

Baseline 7 26 9,54 2,14 98,9% 0

Low 7 54 9,58 2,86 98,6% 8,75

Medium 3 18 6,81 2,32 99,4% 14,92

High 7 373 10,85 11,83 93,8% 27,14

Table 3. Summary of the results in the worst-case scenario

29

6 Conclusions The aim of this thesis was to examine the reliability of a centralized monitoring architecture, in terms of its accuracy, scalability and the presentation-time, which was accomplished by performing an experiment.

The research question posed was: “How often does the centralized monitoring architecture detect a correct and timely response when the hardware health status changes in the monitored servers?”.

The experimental results for the baseline and the low- and medium load levels show that there were 0 incorrect measurements made. According to the non-parametric binomial equation presented in chapter 4, the conclusion can be drawn that with a 99% confidence level the centralized monitoring architecture is 99.5% reliable when introducing a medium load level of around 15 traps per second. Since the high load level failed once, it is possible to come to the conclusion that with a 99 % confidence level, the centralized monitoring architecture is 99.3% reliable when introduced to around 27 traps per second for the duration of the test.

The discrepancy of 1 out of 1000 in the high load level occurred due to OME somehow failing to present the hardware health status change in its API. Measurement no. 182 did not get processed fully by OME. This was verified by looking in the Wireshark capture, showing that the trap was indeed sent from iDRAC and received by OME, meaning that the failure to make the hardware health status change poll-able occurred after OME had received it.

The experiment showed that when the centralized architecture experiences load, its presentation-time increases, and so does the variability of the presentation-time. In the high load level an average of around 27 traps were sent each second, meaning that around 1620 traps were sent every minute. 4 vCPUs and 6 GB ram was allocated to the OME instance. This fits the “medium” recommended deployment size of 500 devices, as described by Dell Support (2015), meaning that the simulated scenario involves each server changing hardware health status three times per minute on average, which is well above the amount of hardware health status changes that would occur under normal conditions in a real-life situation, though this type of behaviour may occur during specific events that affect the hardware health status of all servers simultaneously.

30

7 Discussion The experiment conducted in this thesis was able to successfully produce results two scenarios, namely the best, and the worst-case scenarios, from which conclusions could be drawn.

More precise results would have made for more solid conclusions about the presentation-time to be drawn. The preciseness of the results would have increased if the sleep-time used in the data-collection script was reduced by some seconds. The reason for the sleep time before executing the centralized plugin being 5 seconds was somewhat arbitrary, but a sleep time had to be chosen since polling OME too much would have introduced additional load in OME, which might have skewed the results. That being said, a suitable polling time would have been 1 second. This polling time would have been chosen if there had been more time to be able to re-do the experiment.

The sleep timer controlling the wait time after setting the iDRAC error, which was also set to 5 seconds could have been reduced to a lower number as well, but not too low since that would lead to the hardware health status change not having time to propagate to the SNMP agent of the iDRAC before it is polled, which would lead to the hardware health status infinitely resetting

The biggest disadvantage with the centralized architecture is that it introduces a single point of failure. Both monitoring architectures presented in this thesis have a single point of failure in that the network monitoring software has to perform the execution of all plugins, as well as processing returned information, meaning that if the network monitoring software stops working somehow, all network monitoring is unavailable, granted that there is no redundancy. What the centralized architecture does it that it introduces a second single point of failure, which the distributed monitoring architecture does not have. This means that if the systems management software stops working, the centralized monitoring architecture is unable to perform its task of monitoring the servers. This can be prevented by having some degree of redundancy, for example deploying two or more systems management software on separate hardware, in separate physical locations. This is something to take into consideration when debating on whether to implement this style of monitoring or not.

7.1 Validity Some threats to validity which apply to the type of experiment conducted in this thesis were mentioned in chapter 4.2, they will be discussed further in this chapter.

Precautions were taken to ensure that I remained impartial throughout the experimentation process, and that no fishing for certain results took place. After the data-collection script was finalized, all results were presented in this written report, there were no selection of the “best” or “worst” results.

To mitigate validity threats that has to do with human error when for example calculating or analyzing data, the experiment in this thesis is performed using scripting and automation. This

31

means that the experiment is more likely to provide results that reflect reality, in that no human could have made an error when measuring.

An isolated testing environment was used to mitigate the threat of unexpected happenings while taking the measurements, such as congestion in the network or other disturbances.

The data for each of the load levels were collected sequentially after each other, to try and reduce the threat to validity associated with measuring the same object at different times.

7.2 Ethical considerations Since the experiment conducted in this thesis did not involve humans in any way, the methodology used did not harm or make anyone uncomfortable.

The results and conclusions drawn in this thesis can be used by any company that wishes to implement a centralized monitoring solution. This means that this thesis promotes inclusion in that anyone can partake in the information presented. Since the centralized monitoring architecture presented in this thesis can increase the verbosity of the alerts sent by servers, it can also make the workplace less stressful for operators working with monitoring where the centralized monitoring architecture presented in this thesis is implemented.

7.3 Future work Since this thesis focused on the reliability of a centralized monitoring architecture, using Dell hardware, it would be interesting to find out the results of a similar experiment conducted using other proprietary hardware and software from other hardware manufacturers.

The experiment in this thesis was conducted using 1000 measurements. An experiment where more measurements are taken per experiment could yield a more accurate result.

As the SQL-database resided on the same physical hardware, it would be interesting to investigate how the results of the experiment would turn out if another database-setup was used, perhaps over the network. This would change the way OME stores and retrieves data and could therefore change the outcome of the experiment.

It would be interesting to examine how aspects related to security affect the reliability of network monitoring. OME and iDRAC has the capability to make use of SNMPv3 traps, which has support for encryption with DES, 3DES and AES. The additional overhead of encrypting and decrypting each trap would surely tax the hardware more than sending messages in plain-text. An experiment could be conducted in which the reliability is tested with and without the usage of encryption.

Since this experiment did not examine the CPU, RAM and I/O usage of the different components, and how different load levels affected them, an experiment which examines the resource usage of each component in different load levels could yield interesting results. An example of such an experiment would be using SNMP to periodically poll the hypervisor for resource information for each VM operating inside it.

32

References Cooper, M. (2014). Advanced Bash-Scripting Guide: An in-depth exploration of the art of shell scripting. Retrieved May 23, 2018 from: http://ldp.huihoo.org/LDP/abs/abs-guide.pdf

Coulouris, G., Dollimore, J., Kindberg, T. & Blair, G. (2012). Distributed Systems: Concepts and Design (5. Ed.) Boston: Pearson Education. Retrieved April 2, 2018 from: http://www.gecg.in/papers/ds5thedn.pdf

Dell. (2006). Exploring the DRAC 5: The Next-Generation Dell Remote Access Controller. Retrieved April 8, 2018 from: http://www.dell.com/downloads/global/power/ps3q06-20060118-McGary.pdf

Dell. (2017). Advantages of iDRAC & iSM (Out-of-band) and OMSA (in-band) tools in Systems Management functions for Dell EMC PowerEdge servers. Retrieved February 28, 2018 from: http://en.community.dell.com/techcenter/extras/m/white_papers/20444257/download

Dell. (2018) Dell OpenManage SNMP Reference Guide Version 8.1. Retrieved May 01, 2018 from: http://www.dell.com/support/manuals/se/sv/sebsdt1/integrated-dell-remote-access- cntrllr-8-with-lifecycle-controller-v2.00.00.00/snmp_ref_guide-v5/understanding-trap- severity?guid=guid-ae65587b-4aa2-4c9a-a84c-1508b4d6da3d&lang=en-us

Dell EMC. (2016). iDRAC 8/7 v2.40.40.40 RACADM CLI Guide. Retrieved May 5, 2018 from: http://topics-cdn.dell.com/pdf/idrac7-8-lifecycle-controller-v2.40.40.40_Reference-Guide_en- us.pdf

Dell EMC. (2018). REST API Guide – OpenManage Essentials. Retrieved May 02, 2018 from: http://downloads.dell.com/manuals/all- products/esuprt_software/esuprt_ent_sys_mgmt/openmanage-essentials-v24_white- papers_en-us.pdf

Dell Support. (2014). Dell OpenManage Essentials Version 2.0.1 User's Guide Retrieved May 14, 2018 from: http://topics-cdn.dell.com/pdf/dell-ome-v2.0.1_users-guide_en-us.pdf

Dell Support. (2015). Minimum requirements for OpenManage Essentials. Retrieved May 14, 2018 from: http://topics-cdn.dell.com/pdf/dell-openmanage-essentials-v2.1_reference-guide_en-us.pdf

Dell Support. (2018a). Dell OpenManage Essentials. Retrieved May 2, 2018 from: http://www.dell.com/support/contents/us/en/04/article/product-support/self-support- knowledgebase/enterprise-resource-center/systemsmanagement/ome

33

Dell Support. (2018b). Dell OpenManage SNMP Reference Guide for iDRAC and Chassis Management Controller. Retrieved May 18, 2018 from: http://www.dell.com/support/manuals/us/en/04/dell-opnmang-srvr-admin-v8.2/snmp_idrac8- v4/system-trap-group?guid=guid-d84a905f-d3d5-4ad3-ba5b-d300d67dfee3&lang=en-us

Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. & Berners-Lee, T. (1999). Hypertext Transfer Protocol -- HTTP/1.1. Retrieved April 17, 2018 from: https://tools.ietf.org/html/rfc2616

Guo, Pohl & Gerokostopoulos. Determining the Right Sample Size for Your Test: Theory and Application. Retrieved May 4, 2018 from: https://pdfs.semanticscholar.org/2cd1/ff702d6bda9e320c9f03627d15218a965c94.pdf

Issariyapat, C., Pongpaibool, P., Mongkolluksame, S. & Meesublak, K. (2012). Using Nagios as a Groundwork for Developing a Better Network Monitoring System. Proceedings of PICMET '12: Technology Management for Emerging Technologies, 2771-2777. Retrieved April 2, 2018 from: http://ieeexplore-ieee-org.libraryproxy.his.se/stamp/stamp.jsp?tp=&arnumber=6304293

Kalb, I. (2016). Learn to Program with Python. Retrieved May 24, 2018 from: https://link-springer-com.libraryproxy.his.se/content/pdf/10.1007%2F978-1-4842-2172-3.pdf

Limoncelli, A. T., Hogan, J. C. & Chalup, R. S. (2007). The Practice of System and Network Administration (2. Ed.). California: Pearson Education, Inc.

Lu, C., An, X., Du, G., Cao, J. & Jiang, Z. (2015). A Platform for Gathering Multilayer Runtime Data in a Datacenter. 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), 449-452. Retrieved April 2, 2018 from: http://ieeexplore-ieee-org.libraryproxy.his.se/stamp/stamp.jsp?tp=&arnumber=7339094

Mauro, R. D. & Schmidt, J. K. (2005). Essential SNMP (2. Ed.). California: O’Reilly Media. Retrieved March 6, 2018 from: http://www.reedbushey.com/124Essential%20SNMP%202nd%20Edition.pdf

McCloghrie, K. & Rose, M. (1988). Structure and identification of management information for TCP/IP-based internets. Retrieved April 15, 2018 from: https://tools.ietf.org/html/rfc1155

Merriam-Webster. (2018). Centralize. Retrieved May 22, 2018 from: https://www.merriam-webster.com/dictionary/centralization

Microsoft. (2015). Serialization (C# ). Retrieved April 17, 2018 from: https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/serialization/

Microsoft. (2017). How to: Serialize and Deserialize JSON Data. Retrieved May 18, 2018 from: https://docs.microsoft.com/en-us/dotnet/framework/wcf/feature-details/how-to-serialize-and- deserialize-json-data

34

Monit. (2018). Monit Documentation. Retrieved April 17, 2018 from: https://mmonit.com/monit/documentation/monit.html

Nagios. (n.d.). Nagios Plugins. Retrieved May 4, 2018 from: https://www.nagios.org/projects/nagios-plugins/

Nemeth, E., Snyder, G., Hein, R. T. & Whaley, B. (2011). UNIX and Linux System Administration Handbook (4. Ed.). Boston: Pearson Education.

Net-snmp. (2009). SNMPTRAP. Retrieved May 5, 2018 from: http://www.net-snmp.org/docs/man/snmptrap.html

PandoraFMS. (2017). Top 16 best network monitoring tools for 2016. Retrieved May 17, 2018 from: https://blog.pandorafms.org/network-monitoring-tools/

Pras, A., Drevers, T, van de Meent, R. & Quartel, D. (2004). Comparing the Performance of SNMP and Web Services-Based Management. IEEE Transactions on Network and Service Management, 1, 72-82. Retrieved March 9, 2018 from: http://ieeexplore-ieee-org.libraryproxy.his.se/stamp/stamp.jsp?tp=&arnumber=4798292

Richardson, L. & Amundsen, M. (2013). RESTful Web APIs. California: O’Reilly Media. Retrieved April 15, 2018 from: http://sd.blackball.lv/library/RESTful_Web_APIs_(2013).pdf

Ss64. (n.d.). sleep. Retrieved May 18, 2018 from: https://ss64.com/bash/sleep.html

Statista. (2018). Quarterly market share held by server system vendors worldwide from 2009 to 2017. Retrieved May 9, 2018 from: https://www.statista.com/statistics/269396/global-market-share-held-by-server-system- vendors-since-1st-quarter-2009/

Wijekoon, J., Wijesundara, M., Dassanayaka, T., Samarathunga D., Dissanayaka, R. & Perera D. (2011). The Advanced Remote PC Management Suite Paper presented at 6th International Conference on Industrial and Information Systems, Sri Lanka. 411. Retrieved February 28, 2018 from: http://ieeexplore-ieee-org.libraryproxy.his.se/stamp/stamp.jsp?tp=&arnumber=6038103

Wohlin, C., Runeson, P., Höst, M., Ohlsson, C. M., Ragnell, B. & Wesslén, A. (2012). Experimentation in Software Engineering. Retrieved April 15, 2018 from: https://link-springer-com.libraryproxy.his.se/content/pdf/10.1007%2F978-3-642-29044-2.pdf

Zahoor, B., Qamar, B. & ur Rasool, R. (2015). Handbook of Data Centers. New York: Science+Business Media. Retrieved March 7, 2018 from: https://link-springer-com.libraryproxy.his.se/content/pdf/10.1007%2F978-1-4939-2092-1.pdf

35

Appendix A – Distributed monitoring plugin #!/bin/bash

# Declare EXIT-codes OK=0 WARNING=1 CRITICAL=2 UNKNOWN=3

# Make sure hostname is sent as first argument if [ $# -ne 1 ]; then echo "usage: $0 " exit $UNKNOWN fi

# Split host & domain-name and append -o or -ilo HOST=$(echo $1 | cut -d. -f1) DOMAIN=$(echo $1 | cut -d. -f2-) HOSTNAME_GIC=$(host "$HOST"-o."$DOMAIN") RETVAL_GIC=$? HOSTNAME_GIC=$(echo "$HOSTNAME_GIC" | awk 'END{print $4}') HOSTNAME_HUB=$(host "$HOST"-ilo."$DOMAIN") RETVAL_HUB=$? HOSTNAME_HUB=$(echo "$HOSTNAME_HUB" | awk 'END{print $4}') if [ $RETVAL_GIC -eq 0 ] ; then RESULT=$(snmpwalk -v 2c -c public "$HOSTNAME_GIC" 1.3.6.1.4.1.674.10892.2.2.1.0) if [ $? -eq 1 ] ; then echo "UNKNOWN: iLO-interface not reachable" exit $UNKNOWN fi elif [ $RETVAL_HUB -eq 0 ] ; then RESULT=$(snmpwalk -v 2c -c public "$HOSTNAME_HUB" 1.3.6.1.4.1.674.10892.2.2.1.0) if [ $? -eq 1 ] ; then echo "UNKNOWN: iLO-interface not reachable" exit $UNKNOWN fi else echo "CRITICAL: $1 not found in DNS" exit $CRITICAL fi

# Strip result to a single integer RESULT=$(echo $RESULT | awk '{print $4'})

# Send SNMP-result to naemon if [ "$RESULT" != "3" ] ; then echo CRITICAL: Hardware Error=$RESULT exit $CRITICAL else echo OK: Hardware OK fi

i

Appendix B – Centralized monitoring plugin #!/usr/bin/env python

#dependencies import requests import base64 import xmltodict import urllib3 import argparse import time import subprocess urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

#handle arguments parser = argparse.ArgumentParser() parser.add_argument('-u','--username', help='Username for ome (retrieved from resources.cfg)',required=True) parser.add_argument('-p','--password', help='Password for ome (retrieved from resources.cfg)',required=True) parser.add_argument('-H','--hostname', help='Hostname for dell server',required=True) parser.add_argument('-w', '--warranty', action="store_true", default=False, help='Specifies that warranty warnings should be output') args = vars(parser.parse_args()) warrantybool=args['warranty'] username=args['username'] password=args['password'] hostname=args['hostname'] hostname = hostname.split('.')[0]

#Get username and password from textfile cmd = "grep "+ username + " /opt/monitor/etc/resource.cfg | awk -F= '{print $2}'" ps = subprocess.Popen(cmd,shell=True,stdout=subprocess.PIPE,stderr=subprocess.ST DOUT) output1 = ps.communicate()[0] username=output1.strip() cmd = "grep "+ password + " /opt/monitor/etc/resource.cfg | awk -F= '{print $2}'" ps = subprocess.Popen(cmd,shell=True,stdout=subprocess.PIPE,stderr=subprocess.ST DOUT) output2 = ps.communicate()[0] password=output2.strip()

#Constants statuscodes={ 4:0, #Normal (OME:4/OP5:0) 8:1, #Warning (OME:8/OP5:1) 16:2, #Critical (OME:16/OP5:2) 2:3 #Unknown (OME:2/OP5:3) } base64encoded = base64.b64encode(username+":"+password) header={'Authorization':'Basic ' + base64encoded} hostname=hostname.split(".")[0]

i

#Full path of the file containing all dell server servicetag:id pair file path_to_device_file = "/opt/contrib-plugins/hydra_dell.json"

def checkWarrantyDaysRemaining (warrantydict): daysremaining=int(warrantydict['DaysRemaining']) if daysremaining > 30: string="" elif 1 <= daysremaining <= 30: string="Warranty warning: \"" + warrantydict['WarrantyDescription'] + "\" will end within 30 days at the date: " + warrantydict['EndDate'] + ". " elif daysremaining <= 0: string="Warranty warning: \"" + warrantydict['WarrantyDescription'] +"\" ended at the date: " + warrantydict['EndDate'] + ". " return string

#if warrantydicts is a str, then it's been called knowing the host dosnt exist in OME. def performRequest (url,header,warrantydicts,timedifference): try: r=requests.get(url, headers=header, verify=False, timeout=30) except requests.exceptions.RequestException as exception: message="cannot connect to OME. " + str(exception) if type(warrantydicts) == str: print message exit (3) else: exitProgram(warrantydicts,3,message,timedifference) if r.status_code == 401 or r.status_code == 403: message="Authorization error while connecting to url " + url + ", http status code: " + str(r.status_code) if type(warrantydicts) == str: print message exit (3) else: exitProgram(warrantydicts,3,message,timedifference) elif r.status_code >= 402 and r.status_code <= 418: message="Client error while connecting to url " + url + ", http status code: " + str(r.status_code) if type(warrantydicts) == str: print message exit (3) else: exitProgram(warrantydicts,3,message,timedifference) elif r.status_code >= 500 and r.status_code <= 509: message="Server error while connecting url " + url + ", http status code: " + str(r.status_code) if type(warrantydicts) == str: print message exit (3) else: exitProgram(warrantydicts,3,message,timedifference) else: return (r.text) def exitProgram (warrantydicts, errorcode, previouserrors,timedifference):

ii

string=previouserrors + " " if type(timedifference) is int: if timedifference > 86400: string+="Warning: hydra_dell.json is older than 24 hours. " if type(warrantydicts) is list: for warranty in warrantydicts: string+=checkWarrantyDaysRemaining(warranty) elif str(type(warrantydicts)) == "": string+=checkWarrantyDaysRemaining(warrantydicts) elif type(warrantydicts) is False: string+="Tried to get warranty information, but none availiable in OME. " print string exit(errorcode) def checkOMEAlert(status): if status == 8: message="Hardware Warning: " elif status == 16: message="Hardware Critical: " alerts_url = "https://"+ omeinstance +":2607/api/ome.svc/Devices/" + id + "/Alerts/$top=1?Severity=" + str(status) alert=performRequest(alerts_url,header,"",timedifference) alert=xmltodict.parse(alert) if len(alert['DeviceAlerts2Response']['DeviceAlerts2Result'])!=1: alertmessage=alert['DeviceAlerts2Response']['DeviceAlerts2Result']['Alert'] ['Message'].split('.,')[0] message+=alertmessage+ ", Occured: " + alert['DeviceAlerts2Response']['DeviceAlerts2Result']['Alert']['Time'] else: message+="No message found, idrac version: " + device['DeviceInventoryResponse']['DeviceInventoryResult']['Agents']['Agent ']['Version'] + ". " return message

#Main program #Try to read from ome device textfile try: f = open(path_to_device_file, 'r').read() devices = eval(f) except: print "Could not read from file: " + path_to_device_file exit(3)

#look through each of the instances for the hostname supplied to the plugin, if its found, break and save id and omeinstace, if not found exit found=False for instance in devices: #continue if we run into Z_date key, since its not an ome instace if instance == "Z_date": continue for hosts in devices[instance]: if hostname in hosts: id=hosts[hostname] omeinstance=instance found=True break if found == False:

iii

print "Host does not exist in OME" exit(3) currenttime=int(time.time()) timedifference = currenttime-int(devices['Z_date'])

#Check if device exist, get status information for device device_url="https://"+ omeinstance +":2607/api/ome.svc/Devices/" + id device=performRequest(device_url,header,"","") device=xmltodict.parse(device) try: status=int(device['DeviceInventoryResponse']['DeviceInventoryResult']['Agen ts']['Agent']['Status']) except: message="Host does not exist in OME. " exitProgram("",3,message,timedifference)

#Check if system is powered off if device['DeviceInventoryResponse']['DeviceInventoryResult']['Device']['Power Status'] != "1": status=16

#Check if -w flag is set. if it is, get warranty information, if its not set warrantydicts is empty string if warrantybool == True: warrantydicts=device['DeviceInventoryResponse']['DeviceInventoryResult']['W arranty']['Warranty'] # print warrantydicts else: warrantydicts=""

#If status is critical or warning check if firmware version is availiable, then exit if status == 8 or status == 16: message=checkOMEAlert(status) if status == 4: message="Hardware OK: No hardware errors. " if status == 2: message="Hardware Unknown: Unknown hardware status " exitProgram(warrantydicts,statuscodes[status], message,timedifference)

iv

Appendix C – Data-collection script #!/bin/bash pluginspath="/opt/contrib-plugins/" triggerpath=${pluginspath}"TriggerAlerts/" filename=$1 setracadm="/opt/dell/srvadmin/sbin/racadm -r "10.36.156.25" -u "ejarvic" -p "Syp93939494#" sensorsettings set iDRAC.Embedded.1#SystemBoardInletTemp - level max" warningtemp=7 oktemp=40 setTempTresholdRacadm() { retry=0 #set the iDRAC thermal sensor threshold while true; do invalue=$1 #make sure that the iDRAC hw state has changed if [ $retry -ne 1 ];then $setracadm $invalue fi sleep 5 distributedcheck=$(bash $pluginspath"check_dell_hardware.sh" "seliics00668.seli.gic.ericsson.se") #check if the value we just set is what the SNMP agent is showing if [[ $distributedcheck =~ OK ]] && [[ $invalue == 7 ]];then $setracadm 40 sleep 5 $setracadm 7 retry=1 elif [[ $distributedcheck =~ CRITICAL ]] && [[ $invalue == 40 ]];then $setracadm 7 sleep 5 $setracadm 40 retry=1 elif [[ $distributedcheck == "" ]] || [[ $distributedcheck == " " ]] || [[ -z $distributedcheck ]] || [[ $distributedcheck =~ "UNKNOWN" ]];then continue else echo `date --rfc-3339=seconds` "threshold set to: "$invalue >> $triggerpath"results_"$filename".log" break fi done } checkCentralized() { #execute the centralized plugin, write a timestamp and its output to a file counter=0 while [ $counter -lt 61 ];do let "counter++" sleep 5 centralizedcheck=$(python $pluginspath"check_dell_ome_hardware.py" -u "USER47" -p "USER48" -H "seliics00668") if [[ $centralizedcheck =~ OK ]] && [[ $1 == 7 ]];then i

continue elif [[ $centralizedcheck =~ Warning ]] && [[ $1 == 40 ]];then continue elif [[ $centralizedcheck == "" ]] || [[ $centralizedcheck == " " ]] || [[ -z $centralizedcheck ]] || [[ ! $centralizedcheck =~ OK|Warning ]];then continue else echo "$(date --rfc-3339=seconds) $centralizedcheck" >> $triggerpath"results_"$filename".log" return 0 fi done echo "$(date --rfc-3339=seconds) Hardware timeout" >> $triggerpath"results_"$filename".log" return 0 } cleanup() { kill $PID exit } #main program $triggerpath/trapgen.sh & export PID=$! trap cleanup SIGINT trap cleanup SIGTERM MIN=1 MAX=500 #perform 500 errors and 500 error-restores for (( j=$MIN; j<=$MAX; j++));do setTempTresholdRacadm $warningtemp checkCentralized $warningtemp setTempTresholdRacadm $oktemp checkCentralized $oktemp done kill $PID

ii

Appendix D – Data-extraction script import re import sys import time import json from datetime import datetime, timedelta #capture the timestamps and status in groups pattern = "(\d+\-\d+\-\d+\s\d{2}\:\d{2}\:\d{2}\+\d{2}\:\d{2}) threshold set to: (\d+)\n(\d+\-\d+\-\d+\s\d{2}\:\d{2}\:\d{2}\+\d{2}\:\d{2}) Hardware (\w+)" with open(sys.argv[1], 'r') as myfile: data=myfile.read() with open(sys.argv[1]+".json", 'r') as jsonfile: jsondata=json.load(jsonfile) regexgroups=re.findall(pattern,data) centcorr=0 centincorr=0 timediff=[] #Convert a date string of format '%Y-%m-%d %H:%M:%S+02:00' to seconds since epoch def convertStringToEpoch(datestring): pattern = '%Y-%m-%d %H:%M:%S+02:00' epochtime = int(time.mktime(time.strptime(datestring, pattern))) return epochtime

#convert epoch time to Day:Hour:Minute:Second def convertSecondsToDHMS(epoch): sec=timedelta(seconds=int(epoch)) d = datetime(1,1,1) + sec datetimestring=str(d.day-1) + ":" + str(d.hour) + ":" + str(d.minute) + ":" + str(d.second) return datetimestring amountoftests=len(regexgroups) #find out total time for which the tests ran startepoch = convertStringToEpoch(regexgroups[0][0]) endepoch = convertStringToEpoch(regexgroups[amountoftests-1][0]) delta=convertSecondsToDHMS(endepoch-startepoch) print "test ran for: " + delta + "(Day:Hour:Minute:Second)" print "amount of simulated errors " + str(amountoftests) counter=0 for group in regexgroups: trapSentToOME=int(round(float(jsondata[counter]['_source']['layers']['frame ']['frame.time_epoch']))) counter+=1 checkDistri=convertStringToEpoch(group[0]) checkCent=convertStringToEpoch(group[2]) discoverydelta=(checkCent-2.5-((checkDistri-trapSentToOME)/2))- trapSentToOME absoluteError=2.5+((checkDistri-trapSentToOME)/2) print str(discoverydelta) + " error: " + str(absoluteError) string=str(counter) + ";" + str(discoverydelta) + ";" + str(absoluteError) + ";" if group[1] == "40": if group[3] == "OK": centcorr+=1 i

string+="yes" else: string+="no" centincorr+=1 if group[1] == "7": if group[3] == "Warning": centcorr+=1 string+="yes" else: centincorr+=1 string+="no" timediff.append(string) print "correct: " + str(centcorr) print "incorrect: " + str(centincorr)

#print to csv file f=open(sys.argv[1]+'_timestamps.csv','w') f.write("starttime: "+regexgroups[0][0]+" endtime: "+regexgroups[amountoftests-1][0]+" runtime: " + delta + "\n") f.write("amount of simulated errors " + str(amountoftests) + "\n") f.write("correct: " + str(centcorr) + "\n") f.write("incorrect: " + str(centincorr) + "\n") f.write("iterationnumber;presentationtime;absoluteError;iscorrect\n") for infostring in timediff: f.write(str(infostring)+"\n") f.close()

ii

Appendix E – Trap-generation script #!/bin/bash

#divide divisor=50 quotient=$(bc <<< "scale = 10; (1/$divisor)") counter=0 terminate() { echo $counter >> /opt/contrib-plugins/TriggerAlerts/trapssent.txt exit } #main program trap terminate SIGINT trap terminate SIGTERM while true; do let "counter++"

TRAP_RECEIVER=testome

snmptrap -v 2c -c public $TRAP_RECEIVER '' 1.3.6.1.4.1.674.10892.5.3.2.1.0.2089 \ 1.3.6.1.4.1.674.10892.5.3.1.1 s "$counter" \ 1.3.6.1.4.1.674.10892.5.3.1.2 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.3 s "3" \ 1.3.6.1.4.1.674.10892.5.3.1.4 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.5 s "seliics00668.seli.gic.ericsson.se" \ 1.3.6.1.4.1.674.10892.5.3.1.6 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.7 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.8 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.9 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.10 s "testtrap" \ 1.3.6.1.4.1.674.10892.5.3.1.11 s "testtrap"

sleep $quotient done

i

Appendix F – Activity diagrams of functions in data-collection script

checkCent() function

setRacadm() and checkDistri() functions

i

Appendix G – Pilot test results starttime: 2018-04-10 00:23:23+02:00 endtime: 2018-04-10 08:06:34+02:00 total runtime: 0:7:43:11

timestamp

0:0:1:43 0:0:2:37 0:0:15:37 0:0:45:17 0:1:0:2 0:1:10:12 0:1:27:47 0:1:32:23 0:1:39:49 0:1:49:5 0:1:51:9 0:1:53:55 0:2:3:11 0:2:6:3 0:2:18:16 0:3:9:20 0:3:11:10 0:3:24:7 0:3:26:53 0:3:32:28 0:3:42:36 0:3:50:56 0:4:18:45 0:4:33:37 0:4:37:20 0:4:38:15 0:4:40:32 0:4:45:38 0:4:59:35 0:5:21:50 0:5:28:18 0:6:0:45 0:6:5:23 0:6:6:17 0:6:15:32 0:6:42:22 0:6:49:45 0:6:54:20 0:6:55:15

i

0:7:1:43 0:7:10:3 0:7:26:44 0:7:36:0 0:7:36:55 0:7:37:57 0:7:38:51

ii

Appendix H – Experiment results, Baseline

Starttime Endtime Runtime 2018-05-10 11:06:18+02:00 2018-05-10 16:04:27+02:00 0:4:58:9 amount of simulated errors 1000 correct: 1000 incorrect: 0 Presentation-time (Y-axis in seconds) of each measurement. Error bar represents absolute error rate

i

Appendix I – Experiment results, Low Load level Total traps sent: runtime (sec): traps/sec 164177 18773 8.74 Starttime Endtime Runtime: 2018-05-12 9:23:42+02:00 2018-05-12 14:36:35+02:00 0:5:12:53 amount of simulated errors 1000 correct: 1000 incorrect: 0 Presentation-time (Y-axis in seconds) of each measurement. Error bar represents absolute error rate

i

Appendix J – Experiment results, Medium Load Level Total traps sent runtime (seconds) traps/sec 277348 18587 14.92161188 Starttime Endtime Runtime 2018-05-11 08:22:08+02:00 2018-05-11 13:31:55+02:00 0:5:9:47 amount of simulated errors 1000 correct: 1000 incorrect: 0 Presentation-time (Y-axis in seconds) of each measurement. Error bar represents absolute error rate

i

Appendix K – Experiment results, High Load Level Total traps sent runtime (seconds) traps/sec 553786 20404 27.14105 Starttime: Endtime: Runtime: 2018-05-10 16:23:04+02:00 2018-05-10 22:03:08+02:00 0:5:40:4 correct: 999 incorrect: 1 Presentation-time (Y-axis in seconds) of each measurement. Error bar represents absolute error rate

i