Evaluating Linux Kernel Crash Dumping Mechanisms

Evaluating Linux Kernel Crash Dumping Mechanisms Fernando Luis Vázquez Cao NTT Data Intellilink [email protected] Abstract 1 Introduction There have been several kernel crash dump capturing solutions available for Linux for some Mainstream Linux lacked a kernel crash dump- time now and one of them, kdump, has even ing mechanism for a long time despite the made it into the mainline kernel. fact that there were several solutions (such as Diskdump [1], Netdump [2], and LKCD [3]) But the mere fact of having such a feature does available out of tree . Concerns about their in- not necessary imply that we can obtain a dump trusiveness and reliability prevented them from reliably under any conditions. The LKDTT making it into the vanilla kernel. (Linux Kernel Dump Test Tool) project was created to evaluate crash dumping mechanisms Eventually, a handful of crash dumping so- in terms of success rate, accuracy and com- lutions based on kexec [4, 5] appeared: pleteness. Kdump [6, 7], Mini Kernel Dump [8], and Tough Dump [9]. On paper, the kexec-based A major goal of LKDTT is maximizing the approach seemed very reliable and the impact coverage of the tests. For this purpose, LKDTT in the kernel code was certainly small. Thus, forces the system to crash by artificially recre- kdump was eventually proposed as Linux ker- ating crash scenarios (panic, hang, exception, nel’s crash dumping mechanism and subse- stack overflow, hang, etc.), taking into ac- quently accepted. count the hardware conditions (such as ongoing DMA or interrupt state) and the load of the system. The latter being key for the significance However, having a crash dumping mechanism and reproducibility of the tests. does not necessarily imply that we can get a dump under any crash scenario. It is necessary Using LKDTT the author could constate the su- to do proper testing, so that the success rate and perior reliability of the kexec-based approach accuracy of the dumps can be estimated and the to crash dumping, although several deficiencies different solutions compared fairly. Besides, in kdump were revealed too. Since the final having a standardised test suite would also help goal is having the best crash dumping mech- establishing a quality standard and, collaterally, anism possible, this paper also addresses how detecting regressions would be much easier. the aforementioned problems were identified and solved. Finally, possible applications of Unless otherwise indicated, henceforth all the kdump beyond crash dumping will be intro- explanations will refer to i386 and x86_64 ar- duced. chitectures, and Linux 2.6.16 kernel. 154 • Evaluating Linux Kernel Crash Dumping Mechanisms 1.1 Shortcomings of current testing meth- achieve and, as an attempt to fill this gap, the ods LKDTT project [10] was created. Using LKDTT many deficiencies in kdump, Typically to test crash dumping mechanisms a LKCD, mkdump and other similar projects kernel module is created that artificially causes were found. Over the time, some regressions the system to die. Common methods to bring were observed too. This type of information the system down from this module consist of is of great importance to both Linux distribu- directly invoking panic, making a null pointer tions and end-users, and making sure it does dereference and other similar techniques. not pass unnoticed is one of the commitments of this project. Sometimes, to ease testing a user space tool is provided that sends commands to the kernel- To create meaningful tests it is necessary to un- /proc space part of the testing tool (via the derstand the basics of the different crash dump- file system or a new device file), so that things ing mechanisms. A brief introduction follows like the crash type to be generated can be con- in the next section. figured at run-time. Beyond the crash type, there are no provisions to further define the crash scenario to be recre- 2 Crash dump ated. In other words, parameters like the load of the machine and the state of the hardware are undefined at the time of testing. A variety of crash dumping solutions have been developed for Linux and other UNIX R - Judging from the results obtained with this ap- like operating systems over the time. Even proach to testing all crash dumping solutions though implementations and design principles seem to be very close in terms of reliability, may differ greatly, all crash dumping mecha- regardless of whether they are kexec-based or nisms share a multistage nature: not, which seems to contradict theory. The rea- son is that the coverage of the tests is too lim- ited as a consequence of leaving important fac- 1. Crash detection. tors out of the picture. Just to give some exam- ples, the hardware conditions (such as ongoing 2. Minimal machine shutdown. DMA or interrupt state), the system load, and the execution context are not taken into consid- 3. Crash dump capture. eration. This greatly diminishes the relevance of the results. 2.1 Crash detection 1.2 LKDTT motivation For the crash dump capturing process to start a trigger is needed. And this trigger is, most The critical role crash dumping solutions play interestingly, a system crash. in enterprise systems calls for proper testing, so that we can have an estimate of their suc- The problem is that this peculiar trigger some- cess rate under realistic crash scenarios. This is times passes unnoticed or, in the words, the ker- something the current testing methods cannot nel is unable to detect that itself has crashed. 2006 Linux Symposium, Volume One • 155 The culprits of system crashes are software er- state the kernel cannot recover from and, rors and hardware errors. Often a hardware er- to avoid further damage, the system panics ror leads to a software errors, and vice versa, (see panic below). For example, a driver so it is not always easy to identify the original might have been in the middle of talking problem. For example, behind a panic in the to hardware or holding a lock at the time VFS code a damaged memory module might of the crash and it would not be safe to re- be lurking. sume execution. Hence, a panic is issued instead. There is one principle that applies to both software and hardware errors: if the intention is • Panic: Panics are issued by the kernel to capture a dump, as soon as an error is de- upon detecting a critical error from which tected control of the system should be handed it cannot recover. After printing and error to the crash dumping functionality. Deferring message the system is halted. the crash dumping process by delegating in- • Faults: Faults are triggered by instructions vocation of the dump mechanism to functions that cannot or should not be executed by such as panic is potentially fatal, because the the CPU. Even though some of them are crashing kernel might well lose control of the perfectly valid, and in fact play an essen- system completely before getting there (due to tial role in important parts of the kernel a stack overflow for example). (for example, pages faults in virtual memory management); there are certain faults As one might expect, the detection stage of the caused by programming errors, such as crash dumping process does not show marked divide-error, invalid TSS, or double fault implementation specific differences. As a con- (see below), which the kernel cannot re- sequence, a single implementation could be cover from. easily shared by the different crash dumping solutions. • Double and triple faults: A double fault indicates that the processor detected a sec- ond exception while calling the handler for 2.1.1 Software errors a previous exception. This might seem a rare event but it is possible. For exam- A list of the most common crash scenarios the ple, if the invocation of an exception han- kernel has to deal with is provided below: dler causes a stack overflow a page fault is likely to happen, which, in turn, would cause a double fault. In i386 architectures, • Oops: Occurs when a programming mis- if the CPU faults again during the incep- take or an unexpected event causes a situation of the double fault, then it triple faults, tion that the kernel deems grave. Since the entering a shutdown cycle that is followed kernel is the supervisor of the entire sys- by a system RESET. tem it cannot simply kill itself as it would • Hangs: Bugs that cause the kernel to loop do with a user-space application that goes in kernel mode, without giving other tasks nuts. Instead, the kernel issues and oops the chance to run. Hangs can be classified (which results in a stack trace and error in two big groups: message to the console) and strives to get out of the situation. But often, after the – Soft lockups: These are transitory oops, the system is left in an inconsistent lockups that delay execution and 156 • Evaluating Linux Kernel Crash Dumping Mechanisms scheduling of other tasks. Soft lock- whether trying to capture a crash dump in the ups can be detected using a software event of a serious hardware error is a sensi- watchdog. tive thing to do. When the underlying hard- – Hard lockups: These are lockups that ware cannot be trusted one would rather bring leave the system completely unre- the system down to avoid greater havoc. sponsive. They occur, for example, The Linux kernel can make use of some er- when a CPU disables interrupts and ror detection facilities of computer hardware. gets stuck trying to get spinlock that Currently the kernel is furnished with several is not freed due to a locking error.

Evaluating Linux Kernel Crash Dumping Mechanisms

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support