Evaluating Linux Kernel Crash Dumping Mechanisms

Total Page:16

File Type:pdf, Size:1020Kb

Evaluating Linux Kernel Crash Dumping Mechanisms Evaluating Linux Kernel Crash Dumping Mechanisms Fernando Luis Vázquez Cao NTT Data Intellilink [email protected] Abstract 1 Introduction There have been several kernel crash dump cap- turing solutions available for Linux for some Mainstream Linux lacked a kernel crash dump- time now and one of them, kdump, has even ing mechanism for a long time despite the made it into the mainline kernel. fact that there were several solutions (such as Diskdump [1], Netdump [2], and LKCD [3]) But the mere fact of having such a feature does available out of tree . Concerns about their in- not necessary imply that we can obtain a dump trusiveness and reliability prevented them from reliably under any conditions. The LKDTT making it into the vanilla kernel. (Linux Kernel Dump Test Tool) project was created to evaluate crash dumping mechanisms Eventually, a handful of crash dumping so- in terms of success rate, accuracy and com- lutions based on kexec [4, 5] appeared: pleteness. Kdump [6, 7], Mini Kernel Dump [8], and Tough Dump [9]. On paper, the kexec-based A major goal of LKDTT is maximizing the approach seemed very reliable and the impact coverage of the tests. For this purpose, LKDTT in the kernel code was certainly small. Thus, forces the system to crash by artificially recre- kdump was eventually proposed as Linux ker- ating crash scenarios (panic, hang, exception, nel’s crash dumping mechanism and subse- stack overflow, hang, etc.), taking into ac- quently accepted. count the hardware conditions (such as ongoing DMA or interrupt state) and the load of the sys- tem. The latter being key for the significance However, having a crash dumping mechanism and reproducibility of the tests. does not necessarily imply that we can get a dump under any crash scenario. It is necessary Using LKDTT the author could constate the su- to do proper testing, so that the success rate and perior reliability of the kexec-based approach accuracy of the dumps can be estimated and the to crash dumping, although several deficiencies different solutions compared fairly. Besides, in kdump were revealed too. Since the final having a standardised test suite would also help goal is having the best crash dumping mech- establishing a quality standard and, collaterally, anism possible, this paper also addresses how detecting regressions would be much easier. the aforementioned problems were identified and solved. Finally, possible applications of Unless otherwise indicated, henceforth all the kdump beyond crash dumping will be intro- explanations will refer to i386 and x86_64 ar- duced. chitectures, and Linux 2.6.16 kernel. 154 • Evaluating Linux Kernel Crash Dumping Mechanisms 1.1 Shortcomings of current testing meth- achieve and, as an attempt to fill this gap, the ods LKDTT project [10] was created. Using LKDTT many deficiencies in kdump, Typically to test crash dumping mechanisms a LKCD, mkdump and other similar projects kernel module is created that artificially causes were found. Over the time, some regressions the system to die. Common methods to bring were observed too. This type of information the system down from this module consist of is of great importance to both Linux distribu- directly invoking panic, making a null pointer tions and end-users, and making sure it does dereference and other similar techniques. not pass unnoticed is one of the commitments of this project. Sometimes, to ease testing a user space tool is provided that sends commands to the kernel- To create meaningful tests it is necessary to un- /proc space part of the testing tool (via the derstand the basics of the different crash dump- file system or a new device file), so that things ing mechanisms. A brief introduction follows like the crash type to be generated can be con- in the next section. figured at run-time. Beyond the crash type, there are no provisions to further define the crash scenario to be recre- 2 Crash dump ated. In other words, parameters like the load of the machine and the state of the hardware are undefined at the time of testing. A variety of crash dumping solutions have been developed for Linux and other UNIX R - Judging from the results obtained with this ap- like operating systems over the time. Even proach to testing all crash dumping solutions though implementations and design principles seem to be very close in terms of reliability, may differ greatly, all crash dumping mecha- regardless of whether they are kexec-based or nisms share a multistage nature: not, which seems to contradict theory. The rea- son is that the coverage of the tests is too lim- ited as a consequence of leaving important fac- 1. Crash detection. tors out of the picture. Just to give some exam- ples, the hardware conditions (such as ongoing 2. Minimal machine shutdown. DMA or interrupt state), the system load, and the execution context are not taken into consid- 3. Crash dump capture. eration. This greatly diminishes the relevance of the results. 2.1 Crash detection 1.2 LKDTT motivation For the crash dump capturing process to start a trigger is needed. And this trigger is, most The critical role crash dumping solutions play interestingly, a system crash. in enterprise systems calls for proper testing, so that we can have an estimate of their suc- The problem is that this peculiar trigger some- cess rate under realistic crash scenarios. This is times passes unnoticed or, in the words, the ker- something the current testing methods cannot nel is unable to detect that itself has crashed. 2006 Linux Symposium, Volume One • 155 The culprits of system crashes are software er- state the kernel cannot recover from and, rors and hardware errors. Often a hardware er- to avoid further damage, the system panics ror leads to a software errors, and vice versa, (see panic below). For example, a driver so it is not always easy to identify the original might have been in the middle of talking problem. For example, behind a panic in the to hardware or holding a lock at the time VFS code a damaged memory module might of the crash and it would not be safe to re- be lurking. sume execution. Hence, a panic is issued instead. There is one principle that applies to both soft- ware and hardware errors: if the intention is • Panic: Panics are issued by the kernel to capture a dump, as soon as an error is de- upon detecting a critical error from which tected control of the system should be handed it cannot recover. After printing and error to the crash dumping functionality. Deferring message the system is halted. the crash dumping process by delegating in- • Faults: Faults are triggered by instructions vocation of the dump mechanism to functions that cannot or should not be executed by such as panic is potentially fatal, because the the CPU. Even though some of them are crashing kernel might well lose control of the perfectly valid, and in fact play an essen- system completely before getting there (due to tial role in important parts of the kernel a stack overflow for example). (for example, pages faults in virtual mem- ory management); there are certain faults As one might expect, the detection stage of the caused by programming errors, such as crash dumping process does not show marked divide-error, invalid TSS, or double fault implementation specific differences. As a con- (see below), which the kernel cannot re- sequence, a single implementation could be cover from. easily shared by the different crash dumping so- lutions. • Double and triple faults: A double fault indicates that the processor detected a sec- ond exception while calling the handler for 2.1.1 Software errors a previous exception. This might seem a rare event but it is possible. For exam- A list of the most common crash scenarios the ple, if the invocation of an exception han- kernel has to deal with is provided below: dler causes a stack overflow a page fault is likely to happen, which, in turn, would cause a double fault. In i386 architectures, • Oops: Occurs when a programming mis- if the CPU faults again during the incep- take or an unexpected event causes a situa- tion of the double fault, then it triple faults, tion that the kernel deems grave. Since the entering a shutdown cycle that is followed kernel is the supervisor of the entire sys- by a system RESET. tem it cannot simply kill itself as it would • Hangs: Bugs that cause the kernel to loop do with a user-space application that goes in kernel mode, without giving other tasks nuts. Instead, the kernel issues and oops the chance to run. Hangs can be classified (which results in a stack trace and error in two big groups: message to the console) and strives to get out of the situation. But often, after the – Soft lockups: These are transitory oops, the system is left in an inconsistent lockups that delay execution and 156 • Evaluating Linux Kernel Crash Dumping Mechanisms scheduling of other tasks. Soft lock- whether trying to capture a crash dump in the ups can be detected using a software event of a serious hardware error is a sensi- watchdog. tive thing to do. When the underlying hard- – Hard lockups: These are lockups that ware cannot be trusted one would rather bring leave the system completely unre- the system down to avoid greater havoc. sponsive. They occur, for example, The Linux kernel can make use of some er- when a CPU disables interrupts and ror detection facilities of computer hardware. gets stuck trying to get spinlock that Currently the kernel is furnished with several is not freed due to a locking error.
Recommended publications
  • Kernel Panic
    Kernel Panic Connect, Inc. 1701 Quincy Avenue, Suites 5 & 6, Naperville, IL 60540 Ph: (630) 717-7200 Fax: (630) 717-7243 www.connectrf.com Table of Contents SCO Kernel Patches for Panics on Pentium Systems…………………………………..1 Kernel Panic Debug Methodologies..............................................................................3 SCO Kernel Patches for Panics on Pentium Systems SCO Kernel Patches for Panics on Pentium Systems Web Links http://www.sco.com/cgi-bin/ssl_reference?104530 http://www.sco.com/cgi-bin/ssl_reference?105532 Pentium System Panics with Trap Type 6; no dump is saved to /dev/swap. Keywords panic trap kernel invalid opcode pentium querytlb querytbl 386 mpx unix v4 dump swap double type6 no patch /etc/conf/pack.d/kernel/locore.o image locore Release SCO UNIX System V/386 Release 3.2 Operating System Version 4.0, 4.1 and 4.2 SCO Open Desktop Release 2.0 and 3.0 SCO Open Server System Release 2.0 and 3.0 Problem My system panics with a Trap Type 6, but no memory dump gets written. Cause There is a flaw in the kernel querytlb() routine that allows the Pentium to execute a 386-specific instruction which causes it to panic with an invalid opcode. querytlb() is only called when your system is panicking. The fault causes the system to double- panic and it thus fails to write a panic dump to disk. This means that the panic trap type 6 is concealing another type of panic which will show itself after the following patch has been applied. Solution You can apply the following patch, as root, to: /etc/conf/pack.d/kernel/locore.o Use the procedure that follows.
    [Show full text]
  • SUMURI Macintosh Forensics Best Practices
    ! MAC FORENSICS - STEP BY STEP Disclaimer: Before using any new procedure, hardware or software for forensics you must do your own validation and testing before working on true evidence. These best practices are summarized from SUMURI’s Macintosh Forensic Survival Courses which is a vendor- neutral training course taught to law enforcement, government and corporate examiners worldwide. More information about SUMURI can be found at SUMURI.com STEP ACTIONS DESCRIPTION STEP-1: PRE-SEARCH INTELLIGENCE Find out as much as you can about your target: • Number and types of Macs (MacBook, iMac or Mac Pro). • Operating System Versions (for collecting/processing volatile data and Copy Over Procedure). • Type/s and number of ports. • Does it contain a T2 chipset with Secure Boot? • Is FileVault Active? STEP-2: ISOLATE Assign one trained Digital Evidence Collection Specialist to handle the computers to minimize contamination and Chain of Custody. Prohibit anyone else from handling the devices. STEP-3: ALWAYS ASK FOR THE PASSWORD Most newer Macs have enhanced security features such as T2 Security Chipsets, APFS File Systems, Secure Boot, FileVault and more. Any one or combination of these security features can stop you from getting the data. ALWAYS ASK FOR THE PASSWORD. SUMURI.com © SUMURI 2010-2019 ! STEP ACTIONS DESCRIPTION STEP-4: IF COMPUTER IS ON - SCREEN SAVER PASSWORD ACTIVE Options are: • Ask for the Password - Confirm password and proceed to Step-6. • Restart to Image RAM - Connect a RAM Imaging Utility to the Mac such as RECON IMAGER. Conduct a soft-restart (do not power off if possible and image the RAM).
    [Show full text]
  • Kdump, a Kexec-Based Kernel Crash Dumping Mechanism
    Kdump, A Kexec-based Kernel Crash Dumping Mechanism Vivek Goyal Eric W. Biederman Hariprasad Nellitheertha IBM Linux NetworkX IBM [email protected] [email protected] [email protected] Abstract important consideration for the success of a so- lution has been the reliability and ease of use. Kdump is a crash dumping solution that pro- Kdump is a kexec based kernel crash dump- vides a very reliable dump generation and cap- ing mechanism, which is being perceived as turing mechanism [01]. It is simple, easy to a reliable crash dumping solution for Linux R . configure and provides a great deal of flexibility This paper begins with brief description of what in terms of dump device selection, dump saving kexec is and what it can do in general case, and mechanism, and plugging-in filtering mecha- then details how kexec has been modified to nism. boot a new kernel even in a system crash event. The idea of kdump has been around for Kexec enables booting into a new kernel while quite some time now, and initial patches for preserving the memory contents in a crash sce- kdump implementation were posted to the nario, and kdump uses this feature to capture Linux kernel mailing list last year [03]. Since the kernel crash dump. Physical memory lay- then, kdump has undergone significant design out and processor state are encoded in ELF core changes to ensure improved reliability, en- format, and these headers are stored in a re- hanced ease of use and cleaner interfaces. This served section of memory. Upon a crash, new paper starts with an overview of the kdump de- kernel boots up from reserved memory and pro- sign and development history.
    [Show full text]
  • Linux Kernal II 9.1 Architecture
    Page 1 of 7 Linux Kernal II 9.1 Architecture: The Linux kernel is a Unix-like operating system kernel used by a variety of operating systems based on it, which are usually in the form of Linux distributions. The Linux kernel is a prominent example of free and open source software. Programming language The Linux kernel is written in the version of the C programming language supported by GCC (which has introduced a number of extensions and changes to standard C), together with a number of short sections of code written in the assembly language (in GCC's "AT&T-style" syntax) of the target architecture. Because of the extensions to C it supports, GCC was for a long time the only compiler capable of correctly building the Linux kernel. Compiler compatibility GCC is the default compiler for the Linux kernel source. In 2004, Intel claimed to have modified the kernel so that its C compiler also was capable of compiling it. There was another such reported success in 2009 with a modified 2.6.22 version of the kernel. Since 2010, effort has been underway to build the Linux kernel with Clang, an alternative compiler for the C language; as of 12 April 2014, the official kernel could almost be compiled by Clang. The project dedicated to this effort is named LLVMLinxu after the LLVM compiler infrastructure upon which Clang is built. LLVMLinux does not aim to fork either the Linux kernel or the LLVM, therefore it is a meta-project composed of patches that are eventually submitted to the upstream projects.
    [Show full text]
  • Oracle VM Virtualbox User Manual
    Oracle VM VirtualBox R User Manual Version 5.0.16 c 2004-2016 Oracle Corporation http://www.virtualbox.org Contents 1 First steps 11 1.1 Why is virtualization useful?............................. 12 1.2 Some terminology................................... 12 1.3 Features overview................................... 13 1.4 Supported host operating systems.......................... 15 1.5 Installing VirtualBox and extension packs...................... 16 1.6 Starting VirtualBox.................................. 17 1.7 Creating your first virtual machine......................... 18 1.8 Running your virtual machine............................ 21 1.8.1 Starting a new VM for the first time.................... 21 1.8.2 Capturing and releasing keyboard and mouse.............. 22 1.8.3 Typing special characters.......................... 23 1.8.4 Changing removable media......................... 24 1.8.5 Resizing the machine’s window...................... 24 1.8.6 Saving the state of the machine...................... 25 1.9 Using VM groups................................... 26 1.10 Snapshots....................................... 26 1.10.1 Taking, restoring and deleting snapshots................. 27 1.10.2 Snapshot contents.............................. 28 1.11 Virtual machine configuration............................ 29 1.12 Removing virtual machines.............................. 30 1.13 Cloning virtual machines............................... 30 1.14 Importing and exporting virtual machines..................... 31 1.15 Global Settings...................................
    [Show full text]
  • Linux Core Dumps
    Linux Core Dumps Kevin Grigorenko [email protected] Many Interactions with Core Dumps systemd-coredump abrtd Process Crashes Ack! 4GB File! Most Interactions with Core Dumps Poof! Process Crashes systemd-coredump Nobody abrtd Looks core kdump not Poof! Kernel configured Crashes So what? ● Crashes are problems! – May be symptoms of security vulnerabilities – May be application bugs ● Data corruption ● Memory leaks – A hard crash kills outstanding work – Without automatic process restarts, crashes lead to service unavailability ● With restarts, a hacker may continue trying. ● We shouldn't be scared of core dumps. – When a dog poops inside the house, we don't just `rm -f $poo` or let it pile up, we try to figure out why or how to avoid it again. What is a core dump? ● It's just a file that contains virtual memory contents, register values, and other meta-data. – User land core dump: Represents state of a particular process (e.g. from crash) – Kernel core dump: Represents state of the kernel (e.g. from panic) and process data ● ELF-formatted file (like a program) User Land User Land Crash core Process 1 Process N Kernel Panic vmcore What is Virtual Memory? ● Virtual Memory is an abstraction over physical memory (RAM/swap) – Simplifies programming – User land: process isolation – Kernel/processor translate virtual address references to physical memory locations 64-bit Process Virtual 8GB RAM Address Space (16EB) (Example) 0 0 16 8 EB GB How much virtual memory is used? ● Use `ps` or similar tools to query user process virtual memory usage (in KB): – $ ps -o pid,vsz,rss -p 14062 PID VSZ RSS 14062 44648 42508 Process 1 Virtual 8GB RAM Memory Usage (VSZ) (Example) 0 0 Resident Page 1 Resident Page 2 16 8 EB GB Process 2 How much virtual memory is used? ● Virtual memory is broken up into virtual memory areas (VMAs), the sum of which equal VSZ and may be printed with: – $ cat /proc/${PID}/smaps 00400000-0040b000 r-xp 00000000 fd:02 22151273 /bin/cat Size: 44 kB Rss: 20 kB Pss: 12 kB..
    [Show full text]
  • SUSE Linux Enterprise Server 15 SP2 Autoyast Guide Autoyast Guide SUSE Linux Enterprise Server 15 SP2
    SUSE Linux Enterprise Server 15 SP2 AutoYaST Guide AutoYaST Guide SUSE Linux Enterprise Server 15 SP2 AutoYaST is a system for unattended mass deployment of SUSE Linux Enterprise Server systems. AutoYaST installations are performed using an AutoYaST control le (also called a “prole”) with your customized installation and conguration data. Publication Date: September 24, 2021 SUSE LLC 1800 South Novell Place Provo, UT 84606 USA https://documentation.suse.com Copyright © 2006– 2021 SUSE LLC and contributors. All rights reserved. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”. For SUSE trademarks, see https://www.suse.com/company/legal/ . All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its aliates. Asterisks (*) denote third-party trademarks. All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its aliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof. Contents 1 Introduction to AutoYaST 1 1.1 Motivation 1 1.2 Overview and Concept 1 I UNDERSTANDING AND CREATING THE AUTOYAST CONTROL FILE 4 2 The AutoYaST Control
    [Show full text]
  • Linux Crash HOWTO
    Linux Crash HOWTO Norman Patten [email protected] 2002−01−30 Revision History Revision 1.0 2002−01−30 Revised by: NM Initial release. This document describes the installation and usage of the LKCD (Linux Kernel Crash Dump) package. Linux Crash HOWTO Table of Contents 1. Introduction.....................................................................................................................................................1 1.1. Copyright and License......................................................................................................................1 2. How LKCD Works.........................................................................................................................................2 2.1. What You Need.................................................................................................................................2 3. Installation of lkcd..........................................................................................................................................3 3.1. Installing From Source Code............................................................................................................3 3.2. Building and Installing LKCD Utilities............................................................................................3 3.3. What Gets Installed...........................................................................................................................3 3.4. Installing LKCD Utilities From RPM..............................................................................................3
    [Show full text]
  • SUSE Linux Enterprise Server 12 Does Not Provide the Repair System Anymore
    General System Troubleshooting Sascha Wehnert Premium Service Engineer Attachmate Group Germany GmbH [email protected] What is this about? • This session will cover the following topics: ‒ How to speed up a service request ‒ How to gather system information using supportconfig ‒ Configure serial console in grub to trace kernel boot messages ‒ Accessing a non booting systems using the rescue system ‒ System crash situations and how to prepare (i586/x86_64 only) 2 The challenge of a service request • Complete service request description: “We need to increase our disk space.” 3 The challenge of a service request • Which SUSE® Linux Enterprise Server version? • Is this a physical or virtual environment? • If virtual, what virtualization solution is being used? • If physical, local SCSI RAID array? What hardware? • If using HBAs, dm-multipathing or iSCSI connected disks or a 3rd party solution? • Disk and system partition layout? • What has been done so far? What was achieved? What failed? • What information do I need in order to help? 4 What information would be needed? • SUSE Linux Enterprise Server version → /etc/SuSE-release, uname -a • Physical → dmidecode XEN → /proc/xen/xsd_port KVM → /proc/modules • Hardware information → hwinfo • Partition information → parted -l, /etc/fstab • Multipathing/iSCSI → multipath, iscsiadm • Console output or /var/log/YaST2/y2log in case YaST2 has been used 5 supportconfig • Since SUSE Linux Enterprise Server 10 SP4 included in default installation. • Maintained package, updates available via patch channels. For best results always have latest version installed from channels installed. • One single command to get (almost) everything. • Splits data into files separated by topic. • Can be modified to exclude certain data, either via /etc/supportconfig.conf or command options.
    [Show full text]
  • Embedded Linux Conference Europe 2019
    Embedded Linux Conference Europe 2019 Linux kernel debugging: going beyond printk messages Embedded Labworks By Sergio Prado. São Paulo, October 2019 ® Copyright Embedded Labworks 2004-2019. All rights reserved. Embedded Labworks ABOUT THIS DOCUMENT ✗ This document is available under Creative Commons BY- SA 4.0. https://creativecommons.org/licenses/by-sa/4.0/ ✗ The source code of this document is available at: https://e-labworks.com/talks/elce2019 Embedded Labworks $ WHOAMI ✗ Embedded software developer for more than 20 years. ✗ Principal Engineer of Embedded Labworks, a company specialized in the development of software projects and BSPs for embedded systems. https://e-labworks.com/en/ ✗ Active in the embedded systems community in Brazil, creator of the website Embarcados and blogger (Portuguese language). https://sergioprado.org ✗ Contributor of several open source projects, including Buildroot, Yocto Project and the Linux kernel. Embedded Labworks THIS TALK IS NOT ABOUT... ✗ printk and all related functions and features (pr_ and dev_ family of functions, dynamic debug, etc). ✗ Static analysis tools and fuzzing (sparse, smatch, coccinelle, coverity, trinity, syzkaller, syzbot, etc). ✗ User space debugging. ✗ This is also not a tutorial! We will talk about a lot of tools and techniches and have fun with some demos! Embedded Labworks DEBUGGING STEP-BY-STEP 1. Understand the problem. 2. Reproduce the problem. 3. Identify the source of the problem. 4. Fix the problem. 5. Fixed? If so, celebrate! If not, go back to step 1. Embedded Labworks TYPES OF PROBLEMS ✗ We can consider as the top 5 types of problems in software: ✗ Crash. ✗ Lockup. ✗ Logic/implementation error. ✗ Resource leak. ✗ Performance.
    [Show full text]
  • Linux Kernel Debugging Advanced Operating Systems 2020/2021
    Linux Kernel Debugging Advanced Operating Systems 2020/2021 Vlastimil Babka Linux Kernel Developer, SUSE Labs [email protected] Agenda – Debugging Scenarios • Debugging production kernels – Post-mortem analysis: interpreting kernel oops/panic output, creating and analyzing kernel crash dumps – Kernel observability – dynamic debug, tracing, alt-sysrq dumps, live crash session • Debugging during individual kernel development – Debug prints – printk() facilitiy – Debugger (gdb) support • Finding (latent) bugs during collaborative development – Optional runtime checks configurable during build – Testing and fuzzing – Static analysis 2 Enterprise Linux Distro and Bugs (incl. Kernel) • The software installation (e.g. ISO) itself is free (and open source, obviously) • Customers pay for support subscription – Delivery of (tested by QA) package updates – fixing known bugs, CVE’s… but not more! – Getting reported bugs fixed • Bugs specific to customer’s workload, hardware, “luck”, large number of machines… • Upstream will also fix reported bugs, but only with latest kernel and no effort guarantees • Dealing with kernel bugs, e.g. crashes – Find out the root cause (buggy code) with limited possibilities (compared to local development) • Typically no direct access to customer’s system or workload • Long turnarounds for providing a modified debug kernel and reproducing the bug – Write and deliver a fix (upstream first!) or workaround; fix goes to next update • Possibly a livepatch in some cases – Is a lot of fun ;-) 3 Kernel Oops - The Real World Example
    [Show full text]
  • Generating a White List for Hardware Which Works with Kexec/Kdump
    Generating a White List for Hardware which Works with Kexec/Kdump Fernando Luis Vázquez Cao NTT Open Source Software Center [email protected] Abstract nally, the capture of the crash dump. Kdump is very good at the first two but there are still some The mainstream Linux kernel lacked a crash issues when the dump capture kernel takes control dumping mechanism from its inception until the of the system. In particular the new kernel may fail recent adoption of kdump. This, despite the fact to initialise the underlying devices which, in turn, that there were several solutions available out-of- is likely to lead to a kernel panic or an oops. tree and some of them were even included in major distributions. However concerns about their intru- siveness and reliability prevented them from mak- The underlying problem is that the state of the ing it into the mainstream kernel (aka vanilla ker- devices during a kdump boot is not predictable nel), the main argument being that relying on the because no device shutdown is performed in the resources of a crashing kernel to capture a dump, crashed kernel (it cannot be trusted), and the as they did, is inherently dangerous. firmware stage of the standard boot process is skipped (the dump capture kernel is a soft-booted The appearance of Eric Biederman’s kexec kernel after all). In other words, the inherent as- patches and its subsequent inclusion in the kernel sumption that the firmware (known as the BIOS on as a new system call paved the way for the imple- some systems) is always there to do the dirty work mentation of an idea that had been floating around is not valid anymore.
    [Show full text]