Assessment and Improvement of Hang Detection in the Linux Operating System

2009 28th IEEE International Symposium on Reliable Distributed Systems Assessment and Improvement of Hang Detection in the Linux Operating System Domenico Cotroneo∗, Roberto Natella∗†, and Stefano Russo∗† ∗Dipartimento di Informatica e Sistemistica, Università degli Studi di Napoli Federico II, Via Claudio 21, Naples, Italy †Laboratorio CINI-ITEM “Carlo Savy”, Complesso Universitario Monte Sant’Angelo, Ed.1, Via Cinthia, Naples, Italy {cotroneo, roberto.natella, stefano.russo}@unina.it Abstract—We propose a fault injection framework to assess detection mechanisms within OSs can fail, and they may hang detection facilities within the Linux Operating System lead to missing, misleading, or false detections. This is the (OS). The novelty of the framework consists in the adoption of case of hang failures, which are caused by a significant a more representative faultload than existing ones, and in the effectiveness in terms of number of hang failures produced; part of software faults in OS code [15]; they cause the OS representativeness is supported by a field data study on the to get partially or completely stalled, and to not provide Linux OS. Using the proposed fault injection framework, any response to user-space applications. It is difficult to along with realistic workloads, we find that the Linux OS detect hang failures, because they hamper the execution is unable to detect hangs in several cases. We experience a of the OS, and, unlike other types of failures (e.g., illegal relative coverage of 75%. To improve detection facilities, we propose a simple yet effective hang detector, which periodically memory access), they are not notified by the hardware. tests OS liveness, as perceived by applications, by means of Focus of this paper is on the assessment and the I/O system calls; it is shown that this approach can improve improvement of hang detection mechanisms in commodity relative coverage up to 94%. The hang detector can be OSs. The problem of hang detection has been faced in deployed on any Linux system, with an acceptable overhead. past research studies (see §II). However, they assume knowledge about workload or availability of special I. INTRODUCTION hardware, thus hampering their adoption in commodity Operating Systems (OSs) act as the core of a wide range OSs. Our driving idea is to propose a fault injection of computer systems, enabling the development of complex framework to assess hang detection facilities, and to and distributed services. Commodity OSs are being used guide their improvement. To the best of our knowledge, even in critical scenarios, since they allow a reduction of there are not fault injection techniques tailored for the development time and costs. However, they suffer the well- assessment of hang detection mechanisms. Fault injection known dependability pitfalls that characterize Off-the-Shelf is a valuable approach for experimental evaluation of (OTS) software modules [1]. A failure in the OS may com- fault tolerance mechanisms, as it provides the means for promise the correct execution and application performance, studying of complex interactions between faults, errors, as well as the mission of the overall system [2]. Hence, failures and fault tolerance mechanisms [16], and it can be characterizing the dependability of OSs is a major concern. used to complement and increase the confidence of other It has been demonstrated that software faults (i.e., bugs) validation approaches [17]. However, existing fault injection are a major cause of system failures [3]–[5]. As they can techniques and tools, based on bit-flips (e.g., NFTAPE) and manifest transiently, depending on environmental conditions generic software faults (e.g., G-SWFIT), are able to induce (e.g., hardware and software configuration, workload, and a hang only in a small minority of cases [18], [19], and no timing events), they often elude testing efforts, thus resulting previous work discusses fault representativeness with respect in failures on the field; this is especially true for OSs, which to hang scenarios. In this paper, we propose a fault injection are very complex and often made up of millions of lines of framework for experimental evaluation of OS hang detection code, often written by third-party developers (e.g., device mechanisms. The contribution of this paper is twofold: drivers). Several research studies have been conducted on 1) The design of a representative and effective fault injection fault tolerance issues within commodity OSs, such as repli- framework, which is made up of the following phases: cation, checkpointing, driver restarting, and microreboots • Collection and analysis of software fault reports. Our [6]–[10]. All these mechanisms rely on the OS capability study on the Linux OS confirms that concurrency of decting the occurrence of a failure. Failure detection issues are a major cause of hang failures, as observed capabilities are a fundamental requirement for enabling from previous field data studies [5], [15]; fault tolerance strategies, and to achieve self-management • Definition of a set of faults based on the collected in complex and distributed systems [11]. However, there are software fault reports; failures that may occur without being explicitly notified by • Identification of fault locations by system profiling. the system, for which a recovery action cannot be started. It is worth noting that although several model checking As shown by several studies on field failures [4], [12]–[14], approaches [20] have been proposed, the fault injection framework is complementary to them, since it is difficult This work has been partially supported by the Regione Campania in to comprehensively model software systems. the context of the projects L.R n. 5/2002, “Reti di sensori senza filo per l’identificazione ed il tracciamento di target mobili” and Misura 3.17 2) It proposes a hang detection mechanism that overcomes POR 200/2006, “REMOAM Reti di sensori per il monitoraggio dei rischi limitations of hang detectors in the Linux OS. The ambientali.” proposed mechanism tests OS liveness, as perceived 1060-9857/09 $26.00 © 2009 IEEE 288 DOI 10.1109/SRDS.2009.26 Authorized licensed use limited to: Universita degli Studi di Napoli. Downloaded on January 6, 2010 at 13:33 from IEEE Xplore. Restrictions apply. by applications, by means of periodic system call characterized by minimal overhead and cannot be affected invocations, in order to stimulate most important OS by software faults. However, hardware support may be not subsystems (e.g., memory management, file system, available, and they are not able to detect failures in which device drivers); failure detection is performed by the OS is stalled but the monitored events still occur. More monitoring of kernel I/O throughput. flexible software-implemented techniques are needed to cope An experimental campaign is conducted on Linux OS, with these failures. In this paper we consider hang detection using the proposed framework and workloads representative mechanisms that do not rely on hardware instrumentation. of long-running applications in real-world systems. Workload modeling techniques. They provide a synthetic Experiments show that, in 75% of cases, the kernel is model of system performance, in order to detect anomalies able to detect a hang (see §IV). The proposed detection due to both applications and OS faults [27]–[29]. OS level mechanism has been able to detect 94% of hangs, thus metrics are considered (e.g., CPU utilization, paging, I/O improving detection relative coverage of 19%, with respect operations, and free memory), which are modeled by means to the considered faultload and workloads (see §VI). The of time series analysis and machine learning (e.g., PCA, detector exhibits low overhead and no false positives, and decision trees). They are particularly effective when the it does not require in-depth knowledge of system internals. operational profile is known a priori. Regression techniques The remainder of the paper is organized as follows: have been proposed in [30] to reduce false detections due §II discusses the state of the art in the field of OS hang to changes in the operational profile; nevertheless, workload detection; §III describes the proposed fault injection frame- modeling approaches are not immune to false positives. work for OS hang detection assessment; §IV describes the Moreover, workload modeling cannot distinguish between experimental analysis we conduct on Linux; §V proposes a OS faults and application faults, therefore it is difficult to mechanism for improving hang detection, which is analyzed correctly apply OS fault tolerance mechanisms. in §VI; §VII ends up with conclusions and future work. B. Technical Background II. BACKGROUND AND RELATED RESEARCH Kernel data collection tools. Several monitoring facilities A. Related Work on Hang Detection are provided by the Linux kernel, which have been exploited in this work. In particular, we use KProbes1, which inserts Debugging techniques. They use static and dynamic breakpoints in arbitrary binary code locations in charge of source code analysis to identify hang root causes. In [21], triggering user-defined handler functions. Handlers can be the disk I/O subsystem is modeled analytically; the model is used to collect information about internal kernel variables; then compared to execution traces, to identify workload con- subsequently, kernel execution is restored. Kdump2 is a ditions under which performance is suspiciously low, and to tool for failure data collection based on the execution fix anomalies

Load more