An Introduction to Proactive Server Preservation in an HPC Environment

An Introduction to Proactive Server Preservation in an HPC Environment Chad Feller1,2 Karen Schlauch2 Frederick C Harris, Jr.1 1Department of Computer Science and Engineering 2Center for Bioinformatics University of Nevada, Reno Reno, NV. [email protected] Abstract the network. A more intelligent approach involves in- stalling a daemon, or system service, on the the mon- Monitoring has long been the challenge of a server itored server, allowing the monitoring system to not administrator. Monitoring disk health, system load, only observe external behavior as in the basic model, network congestion, and environmental conditions like but also to see what is going on inside of the monitored temperature are all things that can be tied into moni- server. The benet here is obvious, CPU load and toring systems. Monitoring systems vary in scope and memory issues can be observed among other things capabilities, and can re o alerts for just about any- such as process characteristics. The downside is that thing they are congured for. The sysadmin then has these daemons are typically OS specic, and them- the responsibility of weighing the alert and deciding if selves can fail for various reasons. They may be slow and when to act. In a High Performance Computing to respond if the monitored server is under heavy load, (HPC) environment, some of these failures can have a or can be terminated by the OS under memory pres- ripple eect, aecting a larger area than the physical sure. Ultimately, if the OS has a kernel panic due to a problem. Furthermore, some temperature and load hardware or software failure, the daemon is ineective. swings can be more drastic in an HPC environment than they would be otherwise. Because of this a timely, This paper is laid out with the remainder of this measured, response is critical. When a timely response section presenting a few select examples of other mon- is not possible, conditions can escalate rapidly in an itoring systems. Section 2 goes on to discuss our novel HPC environment, leading to component failure. In design from an high level overview, along with an anal- this situation, an intelligent, automatic, measured re- ysis of the components leveraged by our design. Sec- sponse, is critical. Here we present a novel approach tion 3 takes an in depth look at the implementation, to server monitoring, coupled with a response system rst from a high level, and then on a component by designed to deliver an intelligent response with High component basis. The section nishes with a walk- Performance Computing in mind. through of the algorithm. Section 4 concludes with Keywords: System Monitoring, Environmental the current state of the project, and Section 5 discusses Monitoring, IPMI. future work. 1 Introduction A common, if not typical, approach to monitoring sys- 1.1 System and Network Monitors tems is to have a central host that monitors servers over the network. A most basic implementation would Currently there are several open source system mon- have a central monitoring host that pings the moni- itoring and network monitoring software solutions tored server to see if it responds, and sends requests freely available over the internet. There are yet more to specied ports to ensure that services are still run- available if commercial oerings are also considered. ning on those ports. This basic approach can answer However, many existing monitoring solutions are ei- the questions of whether is server is up or down, and ther overly broad or narrowly focused. Here we are whether or not specied services are available over going to take a look at some of the existing solutions: 1 Nagios Munin According to their website, Nagios is a powerful mon- Munin [7] is a monitoring tool, written in Perl, de- itoring system that enables organizations to identify signed to monitor system and network performance. and resolve IT infrastructure problems before they af- It aims to be as plug and play as possible (to just fect critical business processes [1, 2]. In practice, work) out of the box. Additionally, Munin, like Cacti, Nagios is a monitoring tool capable of monitoring a presents both historical and current data, attempting range of network and system services, as well as host to make it as easy as possible to see what is dif- resources via plugins and daemons that would reside ferent today when performance issues arise. While on the remote host. It also supports the ability to Munin doesn't have the ability to send alerts like Na- allow the system administrator to provide a custom gios, Munin can integrate into Nagios, so that Nagios' response. The system administrator would have to alert and response system can be leveraged. write custom event handlers, which could be as simple as text this group of people or as elaborate as write 1.2 Environmental Monitors this message to the logs, and reboot the server. Na- gios uses a at le for its database backend by default. Environmental monitors are commonly employed in server rooms to monitor humidity and temperature, two important elements of server room climate con- Monit trol. While temperature is an obvious concern, humidity can also be an issue. If there is too much hu- Monit is a monitoring tool, written in C, primarily midity in the air, condensation can occur. Too little designed to monitor system (daemon) processes, and humidity can enable the buildup of static electricity take predened actions should certain conditions hap- in the air. Traditional environmental monitors are pen. In fact, Monit's site says this is what sets them standalone units, which may simply sound an audi- apart: In dierence to many monitoring systems, ble alarm, or more commonly nowadays, can send out Monit can act if an error situation should occur, e.g.; alert emails if they are network enabled. Temperature if sendmail is not running, Monit can start sendmail only monitors are also available, many of which are again automatically or if apache is using too many re- available as units meant to hang o of a server, as a sources (e.g. if a DoS attack is in progress) Monit can USB dongle for instance, or as a network aware stan- stop or restart apache and send you an alert message dalone unit capable of feeding temperature points into [3]. Additionally, Monit can be used to check system other system monitoring software. resources, much like other tools. 2 Our Approach Ganglia The motivation for this project comes from the fact Ganglia [4, 5] is a distributed system monitoring tool, that HPC environments are acutely susceptible to en- designed to log and graph, both historically and in real vironmental failures, particularly cooling system fail- time, CPU, memory, and network loads. It is used pri- ures. A cooling system failure in a server room with marily in environments such as HPC clusters. Ganglia HPC equipment can spiral out of control much more was written modularly, in several languages including quickly than other server rooms hosting more conven- C, Perl, Python, and PHP. Unlike Nagios and Monit, tional computing equipment such as simple mail and Ganglia isn't designed to do anything beyond report- web servers. (Although in this era of increased den- ing collected data. That is, it won't alert the system sity due to virtualization, the dierence in temperature administrator if certain events occur. output of HPC vs non-HPC equipment may begin to diminish.) Cacti If an environmental failure happens, particularly after working hours, and the system administrator's Cacti [6] is a monitoring tool, written in PHP, designed pager or cell phone goes o because the environmental to monitor, log and graph, both historically and real monitor sends out an alert, is the system administrator time network trac. It is designed for, and commonly going to be able to get to a remote console in time? used to, monitor network switches and routers. How- Will there be enough time to login remotely, evaluate ever, Cacti can be congured to monitor just about the situation, determine the best course of action, and any data source that can be stored in an RRD (Round then execute that action before the temperature crests Robin Database). Like Ganglia, it is only designed to into the critical zone causing hardware failure? report data that it collects, and doesn't support the Witnessing these types of events is what moti- ability to send alerts or custom responses. vated this project. The fact that a 1U server can be 2 ordered with more than 12 CPU cores makes the po- At a most basic level the IPMI system will al- tential for heat creation greater than it was 4 years ago, low a system administrator to query things like chassis when that same 1U server was only available with 4 power status, view event logs, query hardware cong- CPU cores. Naturally the HVAC system is going to be uration such as sensor information or FRU data, and upgraded to handle the additional capacity, but when turn the server or workstation on an o. the HVAC system fails, the temperature is going to The IPMI system is accessible several dierent rise quite a bit faster in that same server room than it ways. In this work, we will be accessing it over the did a few years ago, giving that same system adminis- network. This is typically achieved one of two ways. trator a lot less time to eectively respond. Either by sharing an interface with the system NIC, Additionally, in an HPC environment, there is the via what is known as side-band management, or a ded- consideration of jobs.

Load more