Mini-Ckpts: Surviving OS Failures in Persistent Memory

David Fiala, Frank Mueller Kurt Ferreira Christian Engelmann North Carolina State University∗ Sandia National Laboratories† Oak Ridge National Laboratory‡§ davidfi[email protected], [email protected] [email protected] [email protected]

ABSTRACT minate due to tight communication in HPC. Therefore, Concern is growing in the high-performance computing the OS itself must be capable of tolerating failures in a (HPC) community on the reliability of future extreme- robust system. In this work, we introduce mini-ckpts, scale systems. Current efforts have focused on appli- a framework which enables application survival despite cation fault-tolerance rather than the the occurrence of a fatal OS failure or crash. mini- (OS), despite the fact that recent studies have suggested ckpts achieves this tolerance by ensuring that the crit- that failures in OS memory may be more likely. The OS ical data describing a is preserved in persistent is critical to a system’s correct and efficient operation of memory prior to the failure. Following the failure, the the node and processes it governs — and the parallel na- OS is rejuvenated via a warm reboot and the applica- ture of HPC applications means any single node failure tion continues execution effectively making the failure generally forces all processes of this application to ter- and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three ∗This work was supported in part by a subcontract from to six seconds and has a failure-free overhead of between Sandia National Laboratories and NSF grants CNS- 3-5% for a number of key HPC workloads. In contrast 1058779, CNS-0958311. to current fault-tolerance methods, this work ensures † Sandia is a multiprogram laboratory operated by San- that the operating and runtime systems can continue dia Corporation, a Lockheed Martin Company, for the in the presence of faults. This is a much finer-grained United States Department of Energy’s National Nu- clear Security Administration under contract DE-AC04- and dynamic method of fault-tolerance than the current 94AL85000. coarse-grained application-centric methods. Handling ‡This manuscript has been authored by UT-Battelle, faults at this level has the potential to greatly reduce LLC under Contract No. DE-AC05-00OR22725 with overheads and enables mitigation of additional faults. the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the 1. INTRODUCTION article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, The operating system (OS) kernel is responsible for irrevocable, world-wide license to publish or repro- supervising running applications, providing services, duce the published form of this manuscript, or allow and governing hardware device access. While applica- others to do so, for United States Government pur- tions are isolated from each other and a failure in one poses. The Department of Energy will provide pub- application does not necessarily affect others, a failure lic access to these results of federally sponsored re- search in accordance with the DOE Public Access Plan in the kernel may result in a full system crash, stopping (http://energy.gov/downloads/doe-public-access-plan). all applications without warning due to system services §This work was sponsored by the U.S. Department of that run across the entire machine Energy’s Office of Advanced Scientific Computing Re- In current high-performance computing (HPC) sys- search. tems with 100s of thousands of processor cores and 100s ACM acknowledges that this contribution was authored or co-authored by an of TB of main memory, faults are becoming increasingly employee, or contractor of the national government. As such, the Government common and are predicted to become even more com- retains a nonexclusive, royalty-free right to publish or reproduce this article, or mon in future systems. Faults can affect memory, pro- to allow others to do so, for Government purposes only. Permission to make digital or hard copies for personal or classroom use is granted. Copies must bear cessors, power supplies, and other hardware. Research this notice and the full citation on the first page. Copyrights for components indicates that servers tend to crash twice per year (2- of this work owned by others than ACM must be honored. To copy otherwise, 4% failure rate) and unrecoverable DRAM errors occur distribute, republish, or post, requires prior specific permission and/or a fee. Re- quest permissions from [email protected]. in 2% of all DIMMs per year [37]. A number of stud- ICS ’16, June 01-03, 2016, Istanbul, Turkey ies indicate that ECC mechanisms alone are unlikely to 2016 ACM. ISBN 978-1-4503-4361-9/16/06. . . $15.00 be sufficient in correcting a significant number of these DOI: http://dx.doi.org/10.1145/2925426.2926295 memory errors [26, 38]. While not all faults result in failure, when they do, intermediate results from entire process can be proactively managed prior to a failure. HPC application executions may be lost 1. This allows HPC applications to avoid costly rollbacks The objective of this work is to isolate applications (during kernel panics) to previous checkpoints, espe- from faults that originate within the processor or main cially when considering that rolling back on a massively- system memory and result in OS kernel failures and parallel HPC system typically requires all communicat- crashes. This work is critical to the scalability of cur- ing nodes to coordinate and rollback together unless rent and future large-scale systems due to the fact that coupled with other more costly techniques such as mes- while the kernel occupies a relatively small portion of sage logging [30]. memory (10-100 MB) relative to an HPC application This work makes the following contributions: (up to 32 GB per node), errors in kernel memory may • It identifies the minimal data structures required to be more likely [26], and the kernel is critical to all other save a process’ state during a kernel failure. processes running on a node — and in HPC even other • It further identifies instrumentation locations generi- nodes whose progress depends on being able to commu- cally needed in OSs for processor state saving. nicate with the faulted node. In this work, we target • It introduces mini-ckpts, a framework that imple- the OS due to its widespread adoption on current ments process protection against kernel failures by HPC platforms (97% of the Top500 [5]), although the saving application state in persistent memory. methodology described in this work applies directly to • It evaluates the overhead and time savings of mini- most other OSs. ckpts’ warm reboots and application protection. While failures due to hardware faults are a concern in • It experimentally evaluates application survival of large clusters, software bugs pose an equally fatal out- mini-ckpts protected programs in the face of in- come in the kernel, regardless of scale. Bugs may result jected kernel faults and bugs for a number of key in kernel crashes (a kernel “panic”), CPU hangs, and OpenMP and MPI parallelized workloads. memory leaks resulting in eventual kernel crashes, or other outcomes leaving system performance degraded or only partially functional. Therefore, we have designed 2. MINI-CKPTS OVERVIEW mini-ckpts to recover the system in the face of both mini-ckpts protects applications by taking a check- kernel bugs and transient hardware faults. point of the process and its resources, modifying its ad- The primary objective of mini-ckpts is to demon- dress space for persistence, and then periodically saving strate the feasibility of voluntary or failure-induced sys- the state of the process’ registers whenever the kernel tem reboots by serializing and isolating the most es- is entered. With the checkpointed information and reg- sential parts of the application to persistent memory. ister state, mini-ckpts later recovers a process after a To this end, we make minimal modifications to the OS kernel crash occurs. Recovery is performed by attempt- for saving the CPU state during transitions between ing to recreate the process in the exact state prior to and kernel space while migrating application the crash using well established techniques employed by virtual memory to an execute-in-place (XIP) directly traditional checkpoint/restart software, such as Berke- mapped physical memory region. Finally, failures are ley Lab Checkpoint/Restart (BLCR) [15]. The basic mitigated by warm booting an identical copy of the ker- checkpointing features mini-ckpts supports are shown nel and resuming the protected applications via restor- in Table 1. Additionally, mini-ckpts support for per- ing the saved state of user processes inside the newly sistent virtual memory is comprehensive. A summary booted kernel. is available in Table 2. While other research, such as Otherworld [14], has To use mini-ckpts, one first needs to boot a kernel proposed the idea of a secondary crash kernel (activated with mini-ckpts modifications followed by additional upon failure) that parses the data structures of the orig- bootstrapping steps during start-up: inal, failed kernel, this work is novel as it is the first to Table 1: Summary of process features mini-ckpts pro- demonstrate complete recovery independence from the tects in checkpoint during failures data structures of an earlier kernel. The premise is that Multi Threaded Proc. Process ID a kernel failure may corrupt the kernel’s content. The Mutex/Conditions Variables Process UID & GID failure itself might be due to a corruption or bug in Regular Files (excl. seek) System Calls the first place, which could impede access to the old Signal Handlers & Masks Unflushed File Buffers kernel’s data. mini-ckpts’ unique contribution is that Stdin/Out/Err mmap’d files it has no dependencies on the failed kernel, and that Block Dev. /dev/{null,zero} mprotect all state required to transparently resume a protected FPU State CPU Registers Process Credentials Process Parent/Child 1 While a failure is one possible consequence of a fault, Table 2: Persistent virtual memory support protected we do not strictly distinguish these terms in the follow- during failures ing since the common term “kernel fault” actually refers to a kernel failure resulting from a fault, i.e., the termi- BSS, Data, & Heap sections Executable (text) nology is used inconsistently in practice but the context Anonymous mmap’d Regions Shared Libraries allows one to infer the meaning. File-based mmap Regions vdso, vsyscall • The mini-ckpts kernel is booted and filesystems such 3.1 PRAMFS: Protected and Persistent as /proc and /dev are mounted. RAM Filesystem • The PRAMFS persistent memory filesystem is either Our approach requires a 100% in-memory RAM initialized as empty or mounted if it already exists filesystem that can survive a kernel crash and reboot from a prior boot. to persist anonymous and/or private memory regions • A fresh copy of the mini-ckpts kernel is loaded into across kernel failures without the need for perfect ker- a reserved section of memory for later use (upon a nel data structure state. system crash). PRAMFS [4] is a persistent NVRAM filesystem for • If we are booting after a failure, the last checkpoint is Linux that is capable of storing both the filesystem’s restarted automatically. The mini-ckpts protected metadata as well as contents 100% in-memory while by- application resumes exactly where it left off. passing the Linux Page Cache. A PRAMFS partition Once an application is launched on a mini-ckpts can be utilized on volatile RAM as well as NVRAM, al- enabled kernel, failure protection begins after the first though the lifetime of a PRAMFS partition would then checkpoint operation. A checkpoint operation is either be restricted to the duration of power being applied self-triggered programmatically via a call to a shared li- to the RAM (i.e., a PRAMFS filesystem in RAM can- brary (preloaded with LD_PRELOAD or triggered by send- not survive once the power is turned off). PRAMFS ing a specific signal to the process. The initial check- can be combined with file mapping in processes as well point migrates the process’ virtual memory to the pro- as execute-in-place (XIP) support to allow for a pro- tected and persistent file system (PRAMFS), described cess’ file-backed mappings to not only read/write di- in the following section. It then stores its open files, rectly to physical memory but also execute directly from memory mappings, signal handlers, and other state de- PRAMFS [25]. Since memory mappings to PRAMFS tailed in Tables 1 and 2. Once the first checkpoint oper- are one-to-one with physical memory, an application ation is complete, the system will ensure that the CPU whose executable, data, stack, and heap are mapped registers and FPU (Floating Point Unit) state are al- to a file based in PRAMFS would, in fact, be entirely ways saved in persistent memory when the kernel is en- executing, reading, and writing within the PRAMFS tered. Any further kernel entries, whether system call or storage without any volatility should the process be ter- interrupt handler, will further update the saved check- minated at any point. To put this another way, if all of point with the correct CPU and FPU state. Once a a process’ memory was mapped to PRAMFS files and checkpoint is committed to the persistent stable storage, the process was terminated prematurely at any arbi- any fault in the kernel will automatically warm reboot trary point, then the image in PRAMFS would be very the system and trigger the application to resume exactly similar to a process’ core dump. at the previous kernel entry which failed. This process can repeat indefinitely until the mini-ckpts protected 3.1.1 Protecting PRAMFS from Spurious Writes application completes its execution successfully. During Faults Some faults may result in spurious writes to parts of memory related to or referenced by the faulting 3. PERSISTENT VIRTUAL MEMORY code. mini-ckpts assumes that the memory stored in ACROSS REBOOTS PRAMFS is guarded by user space protection mecha- An application’s virtual memory consists of file- nisms, such as algorithm-based fault tolerance, software mapped memory regions, anonymous (“private” or lack- fault tolerance, redundancy, or some other means out- ing a file backing) memory regions, and special (vsyscall side the scope of OS resilience [23, 20]. Otherworld or VDSO) regions. Also, file-mapped memory regions investigated the use of write-protected page tables to may be either private CoW (copy on write) or shared protect user space memory and reported a runtime over- between processes. Some memory regions are subject head between 4%-12% for this additional protection [14] to write modifications during program execution (e.g., while protecting NVRAM may result in 2x to 4x run- “anonymous” regions: data, stack, heap space). time overheads [22]. Internally, the kernel provides applications with anonymous memory virtual pages by designating frames 3.2 Preparing a Process for Mini-Ckpts from unused physical memory. The physical to virtual mini-ckpts’ goal is to ultimately preserve process layout and structure is stored within the kernel and is state in the event of a kernel crash, therefore, it is im- volatile, i.e., can be lost or corrupted should the kernel perative that the contents of all memory mappings are be rebooted. Another volatile memory type is tmpfs, safely retained. However, there are several regions of the 100% in-memory RAM file system. Although tmpfs memory that will be lost during a kernel crash, mostly is entirely stored in physical memory, it is incapable due to their dependence on the volatile page-cache, such of surviving a kernel restart because the mapping that as copy-on-write mappings, private (anonymous) map- describes both the layout and contents of files are not pings, data, and the stack. stored in persistent memory. To always ensure consistency of non-volatile backing store of applications, mini-ckpts migrates all memory isters of this thread every time it is interrupted by the mappings to PRAMFS so that the process’ memory kernel. Thus, the kernel will always maintain a safe may survive a kernel crash. To achieve this, we added copy of each thread’s registers prior to a possible kernel functionality to the kernel to pause all running threads crash. This allows the threads to restart immediately of a process, iterate through each virtual memory map- without any loss in computation upon a kernel failure. ping of the process, and either copy or move each memory mapping to a new, distinct file in PRAMFS. 5. CAPTURE AND RESTORATION OF For each memory mapping we created a new file in REGISTERS PRAMFS sized to match each distinct memory region. Then, for each region we copy the contents of the ex- A mini-ckpts checkpoint is taken whenever a thread isting memory to the file, unmap the old region, and relinquishes control over a processor: at a (1) system then remap the new PRAMFS file into the same loca- call, (2) interrupt, (3) non-maskable interrupt triggered tion with the same read, write, and/or execute flags, but by signals, , sleeping, synchronization (wait- with a shared instead of a private mapping, which en- ing for mutual exclusion) and both blocking and non- sures that writes instantly update the PRAMFS region. blocking system calls. Each type of transition makes Note: Once the PRAMFS files are virtually mapped a guarantee as to the state of the processor and regis- into a process, there are no further dependencies within ters when control is returned to the interrupted process the kernel to maintain the mappings. The hardware even though some may change certain registers. When accesses the page tables directly, and even a failing ker- the hardware encounters an error, or when there is a nel would not necessarily affect an application’s direct software bug, such as a stuck processor, non-maskable access to its memory as there is no buffering/caching interrupts may be triggered to initiate the mini-ckpts between the application, kernel, and physical memory. recovery. The only locations that require kernel-level instrumentation in order to save the registers of an in- terrupted process are the entry points to both, system 4. INITIAL CHECKPOINTING calls and interrupt handlers, within the kernel. Protection for applications in mini-ckpts begins 5.1 Implementing Mini-Ckpts at the In- with the first initial checkpoint. This initial checkpoint terrupt and System Call Level performs two major functions; first, it remaps all mem- ory into that is backed in PRAMFS mini-ckpts needs to save the general purpose reg- and second, it stores a serialized version of the process’ isters while entering the kernel. An interrupt already state. A checkpoint can be triggered either locally (syn- provides most of the functionality needed. mini-ckpts chronously) or through another process (via a signal). simply calls a routine that copies the saved registers The initial checkpoint modifies its memory mappings. to the mini-ckpt location in persistent memory. If To ensure that the application is not running during this a kernel crash occurs during the later handling of an time, mini-ckpts interrupts all threads of the target interrupt, then the most recent register state is safely process with a signal and invokes a system call within stored. mini-ckpts also saves the state of the floating the threads’ signal handlers. The system call traps all point registers besides general purpose registers. threads in the kernel and puts them to sleep until the System calls are handled in a slightly different man- checkpoint commit operation is complete. Once within ner than interrupts, because some general purpose regis- the kernel, mini-ckpts remaps all process memory to ters containing arguments to system calls are not saved. PRAMFS as previously described. All memory regions, Since mini-ckpts depends on saving the state of all except the VDSO and vsyscall, are individually mapped registers, the entry point for system calls required mod- to separate files in PRAMFS. Since memory regions ifications to allocate kernel stack space for the complete may be several GB in size, which exceeds PRAMFS’ set of registers as well as to both save and restore the limits, mini-ckpts maps large contiguous mappings registers during the system call entry and exit points. into separate files, each no larger than 512MB. If a system call ultimately results in a kernel crash, After remapping process memory, mini-ckpts then mini-ckpts will later restore the process’ registers, records process state information to a separate persis- but it still needs to consider what to do about the ini- tent region of memory. This recording largely mirrors tiated system call. We provide two solutions: the BLCR checkpoint library’s storing of information, • Return to the instruction immediately after the such as open file descriptors, threads, virtual memory SYSCALL instruction. This simulates the interruption layout and protections, process IDs, signal handlers, etc. of a system call by setting the register containing the (see [15] and Table 1). TCP sockets are not restarted system call return value to EINTR, which is a stan- automatically, which we address later in the context of dard Linux return value indicating that the kernel how to restore MPI communication. was interrupted and that the user process should try Before allowing the application to resume, mini- to repeat the system call. ckpts marks each running thread as tracked within the • Try to repeat the system call using the exact same kernel. This will enable mini-ckpts to capture the reg- arguments immediately after restoring the thread. We recommend the former approach because: (1) best gered by memory hardware faults, software bugs, or as- programming practices require developers to check and sertion failures within kernel code. Depending on the handle the return values from system calls, (2) libraries severity of the failure, it might not be possible for the may already repeat the call for the programmer if they kernel to perform its final debug print messages prior detect EINTR, and (3) if the system call is repeated auto- to halting. Additionally, in multicore (SMP) systems, matically by mini-ckpts, then it risks potentially con- the panic handler is tasked with attempting to halt all taining a handle to an object that no longer exists (such other processors. as a socket) after a restore. 6.1 Warm Rebooting 5.2 Restoring Registers during Restart The provides , a mechanism For each thread of a process that is restored after a wherein the panic handler attempts to perform a warm restart, its general purpose registers must be returned reboot into a secondary crash kernel and simultaneously to their state immediately prior to the failure. A helper preserve a copy of its own memory for later observation routine is used to recreate each thread using standard and debugging from the crash kernel. Because the in- clone system calls, and then each newly created thread tent of a crash kernel is merely to provide the most basic requests its registers to be restored through a system services in order to inspect or save the memory of the call via an to mini-ckpts. To ensure that regis- previous kernel, the system is warm booted in an un- ters are not clobbered upon returning from a syscall, usual configuration: SMP is unavailable, and the core a trampoline was added that uses a combination of the on which the new kernel is running is typically the same user space stack and injected code to simulate a trans- core that previously experienced a kernel panic. For the parent restart. This builds on common platforms capa- purposes of HPC applications, any warm reboot would bilities: (1) the ability for a system call to return to a require a stable system with all SMP cores available. specific address, and (2) the potential for the kernel to Once in this state, recovering would normally require a modify the virtual memory of a process. full reboot through the BIOS. Our technique resembles setjmp/longjmp and de- To provide a fully functional SMP system after a pends on using the stack to feed registers to an injected kernel panic, mini-ckpts implements a migrate-and- trampoline code: shutdown protocol that is used during a failure. First, • A small region of executable helper code is injected since a failure may occur on any core, mini-ckpts in- into the process being restored. stalls a specialized NMI (Non-Maskable Interrupt) han- • A new thread is created for each one lost during the dler once a panic is detected. Next, since other cores failure. running in parallel may not yet be aware of the fail- • Thread registers are restored via a system call. ure, the failing core sends an NMI to all other cores. • The user space stack pointer is retrieved from the This NMI forces all cores to immediately jump to the registers stored during the system call entry. NMI handler. As part of the NMI handler, any and • The stack pointer is advanced as the values of all all mini-ckpts protected processes save their registers. previously saved registers are written directly to the Once in the NMI handler, core 0 takes the lead (if it user space stack. was not already the first core to panic) since it must be • The kernel’s saved return instruction pointer is set to the first CPU to perform a warm reboot. Once core a region of injected code. The system call will now 0 has detected all other cores to have signaled that return to a new location. they are halted, core 0 begins unpacking and relocat- • The injected code restores each register by popping ing a fresh copy of the kernel, sets up basic page tables, it from the stack, including the saved flags. The final passes memory mapping information for the PRAMFS retq instruction returns execution back to the point and persistent regions, and jumps to the entry point of immediately prior to the previous kernel failure. the new kernel. It has essentially performed an emer- gency shutdown of all cores and then completed the Once the trampoline finishes restoring the thread’s same steps a traditional bootloader would do to start registers, the process is unable to discern that it was in- an operating system. terrupted at all except for time measurements spanning between the last mini-ckpt during failure, restart, and 6.2 Requirements for a Warm Reboot time up until the retq completes in the trampoline. Minimizing the data structures required to perform a warm reboot is critical to mini-ckpts’ success. The 6. HANDLING KERNEL FAULTS code paths needed for a successful warm reboot begin A failure in the kernel (panic) should result in a soft- at a call to panic. The paths to reaching panic are ware fault handler known as a kernel panic. Within the very broad but may include dereferencing invalid point- Linux kernel, the default action of the panic handler is ers or assertion failures within the kernel. Once panic to attempt a backtrace of the current stack, print the is reached, mini-ckpts ensures that no further mem- current registers, print an error message, if provided, ory allocation will be required. Its dependencies then and then halt the system. A kernel panic may be trig- involve functions that typically execute between 1-5 in- structions, such as shutting down the local APIC, high on poll, writev, and readv. It uses a threaded performance timers, and then signaling NMI interrupts progress engine that monitors the return values of sys- for all cores. For NMI interrupts to work, we must also tem calls. It can detect when mini-ckpts has trans- assume that the code paths and interrupt vectors them- parently warm rebooted the system by watching for an selves remain valid. EINTR system call error as discussed earlier. Upon de- Within the NMI handler, we must assume that the tection of a warm reboot, librlmpi resets any message per-CPU interrupt stacks remain available. This also buffers in transit (pessimistically assuming they failed), assumes that hardware registers were not corrupted and reestablishes connections with its peers. Likewise, as part of the failure. Once in the NMI handler, if a peer that has not undergone a failure is able to accept the panic has interrupted a mini-ckpts protected task, a new connection from a lost peer and resume commu- mini-ckpts must be able to access a 64-bit pointer in nication where it left off. Any message transmissions the thread’s thread_info struct. It is fine if any other that were cut off during a failure are restarted. Any thread or task related members have been corrupted. data that had been buffered in kernel send or receive The single pointer both indicates (if non-zero) that the buffers will be resent, as librlmpi employs positive ac- thread is protected and, in doing so, points directly to knowledgments between peers. where the interrupted threads’ registers should be saved librlmpi uses internal message IDs to uniquely iden- in persistent memory. Locating the thread_info struct tify and acknowledge transmissions between peers. Af- is fortunately trivial if the thread’s registers are valid ter a failure, the peer node that failed (and subse- since it is calculated based on the kernel’s stack pointer. quently warm rebooted) reestablishes connections with Finally, mini-ckpts assumes that the persistent re- its peers. Both peers then send recovery details on the gion of memory where mini-ckpts stores a serialized last messages they received even if they were not ac- copy of the protected process remains safe. The check- knowledged earlier, i.e., a message is only resent if it was point itself is typically under 100KB, which is small rel- (1) not yet sent, or (2) lost in the kernel buffer or net- ative to today’s main memory size. Additionally, soft- work. During normal execution and recovery, any mes- ware correction codes could be applied to the persistent sage IDs that preceded the most recent acknowledgment memory if desired. are marked as complete (i.e., MPI_Wait or MPI_Test finish) and the space occupied by the message may be 7. SUPPORTING MPI APPLICATIONS reused, per the normal MPI interface. librlmpi employs the equivalent of a ready-send trans- Traditional checkpoint/restart tools do not support mission protocol. Although ready-send semantics are automatic reconnection of network sockets for applica- not enforced at the application level, librlmpi will not tions during restart. Instead, they tend to provide a transmit an outgoing message until its peer has sent a callback procedure for applications that moves the bur- receive envelope indicating it is ready and has a buffer den (and programming logic required) to the developer, available to receive the message. If an MPI application if they wish to support the checkpoint restart paradigm. attempts to send a message prior to a receive being But kernels typically buffer both file I/O and network posted, the transmission will be delayed until the re- traffic so that an application may falsely believe that it ceive is posted. From an application’s point of view, we has successfully sent data that is still buffered and then follow the MPI standard and support properly written lost on a kernel crash. As a result, MPI implementa- MPI applications without any modifications. tions that do support checkpoint/restart may require Overall, librlmpi combined with mini-ckpts trans- network traffic to quiesce and all buffers drained before parently allows an MPI application to be interrupted a checkpoint is taken. Moreover, MPI implementations by a kernel panic and resume execution without loss of consider the loss of communication with a peer pro- progress. It does not require any modification to an cess as a critical error forcing an entire job to terminate MPI application. If an MPI application wishes to begin without saving its computation. its mini-ckpts protection automatically (instead of us- mini-ckpts supports HPC workloads via run- ing an external signal), an API is provided, which allows through / forward progress fault tolerance for MPI ap- the application to initiate protection. plications through librlmpi. librlmpi specifically han- dles transient network failures, network buffer loss, and 7.2 MPI and Language Support incomplete message transmission, which all may occur librlmpi supports basic MPI point-to-point communi- during a local or remote (peer) kernel failure. librlmpi cation, many collectives, and provides specialized rou- is reliable, in that it supports handling lost messages tines to allow an MPI application to enable mini-ckpts either on the network wire or kernel buffer, and in that protection and inject kernel faults for testing kernel it can tolerate a network failure during any point of its panic recovery. execution or any of its system calls. 7.1 librlmpi Internals 8. EXPERIMENTAL SETUP Internally, during normal execution, librlmpi depends mini-ckpts is evaluated in both physical and virtu- Benchmark CG EP IS LU BT FT MG SP UA OpenMP Input B B C A A B B A A Bench./App. CG EP IS LU PENNANT Clover Leaf MPI Input C C C B sedov (400 iter.) clover bm2 short.in Table 3: Used benchmark classes (input/problem sizes) for NPB for 8 Threads / 4 MPI Tasks (measured BIOS Kernel Netw driver+ Kernel Software Cold total Warm in seconds) boot time boot total NFS-root mount misc stack total w/ BIOS boot tot. AMD Bare Metal 37.4 5.3 1.5 4.8 0.7 50.3 6.0 Intel Bare Metal 50.8 6.7 3.0 3.7 0.7 73.0 7.4 AMD VM — 0.8 < 0.2 < 0.6 3.0 — 3.8 Intel VM — 0.7 < 0.2 < 0.5 1.3 — 1.9 Table 4: Observed Times for Both Cold and Warm Booting a mini-ckpts-enabled System alized environments to demonstrate mini-ckpts effec- mini-ckpts in multithreaded/MPI environments. NPB tiveness and application performance overheads. Fault includes computational kernels whose performance is in- injections are provided as callable triggers that delib- dicative of other large-scale workloads and is a common erately corrupt kernel data structures or directly in- benchmarks suite used in HPC. NPB DC (Data Cube) is voke a kernel panic. mini-ckpts is evaluated on both excluded as it depends on heavy disk I/O. The bench- OpenMP parallelized applications and MPI-based ap- mark classes (input/problem sizes) used are listed in plications using a prototype MPI implementation with Table 3. librlmpi to handle network connection loss at any point For MPI, we evaluate the performance of the CG during execution. (Conjugate Gradient), IS (Integer Sort), LU (Lower- Our test environments used for evaluation include: Upper Gauss-Seidel solver), and EP (Embarrassingly (1) 4 nodes with AMD Opteron 6128 CPUs, (2) 1 node Parallel) benchmarks, as others depend on MPI func- with an Intel Xeon E5-2650 CPU, and (3) KVM vir- tionality not implemented within our librlmpi proto- tualized environments running on the same AMD and type, such as MPI_Comm_split. We further evaluate for Intel hosts. (1) and (2) are referred to as bare metal in MPI the PENNANT mini-app version 0.7, an unstruc- the results section. The bare metal environments are all tured physics mesh mini-application from Los Alamos booted in a diskless NFS root environment. The virtual National Laboratory [3], and the clover leaf mini-app machine environments use a “share9p” root filesystem version 1.0 from Sandia National Laboratories, an Euler with their hosts, which, for practical purposes, is simi- equation solver [1]. The OSU Micro Benchmarks ver- lar to a diskless NFS root for the virtual machine. In all sion 4.4.1 is used to evaluate librlmpi’s point-to-point configurations we provide 128MB of memory for storing latency, bandwidth, and collective operations [2]. Open- persistent checkpoint information (although typically MPI version 1.8.5 is included in the experiments to com- <100KB is used) and an additional 4GB of memory pare the performance of a mainstream MPI implemen- for the PRAMFS partition, which is where the memory tation against our prototype implementation, librlmpi, of protected processes will reside during execution. All and to establish any relative difference. All MPI bench- systems have been configured conservatively to optimize marks are run across four AMD Opteron 6128 nodes boot time: the system’s boot order prioritizes booting using Gigabit Ethernet for communication. directly to our custom kernel, which bypasses network controller initialization and any potential scanning for 9. RESULTS other boot options. We first measure the time required for a full vs. warm The experimental operating system modified for reboot on both a bare metal and virtualized systems mini-ckpts is based on Linux 3.12.33, and our modified (Table 4). The cost of booting measured from the BIOS user space checkpoint library is a branch of BLCR [15] to the stage that the bootloader is reached is approxi- version 0.8.6 b4. Our bare metal and virtual machine mately 37 and 50 seconds on our bare metal test sys- experiments both use the same root environment based tems. As explained in the previous experimental setup on Fedora 22. The kernel is configured for loadable section, the systems’ boot order was optimized by di- module support, but all drivers, etc. are compiled di- rectly booting to our kernel. Once the bootloader is rectly into the kernel to tune it for a cluster environ- reached and the kernel begins executing, the boot pro- ment. The only modules that are loaded at runtime are cess requires between 5-7 seconds before the kernel is our modified BLCR checkpointing module and modules fully initialized. Of this, device initialization accounts that are used to inject faults during our evaluation. As for the majority of time. The network card drivers re- a result, when evaluating boot times and methods, we quire between 1-3 seconds to initialize (even though omitted the use of an . Our IP addressing is statically configured vs. DHCP and is qemu-kvm 1.6.2. the addressing is configured by the kernel instead of The NAS Parallel Benchmarks (NPB) OpenMP and user space utilities). The remaining time is spread out MPI Versions [8] are used to evaluate the effectiveness of among the initializations of various kernel subsystems and other devices that are configured relatively quickly. the average runtime per benchmark, excluding intializa- In virtualized environments on the same systems as the tion, and the error bars indicate the observed minimum AMD and Intel bare metal measurements, we observe and maximum time. that the same kernel boots in less than one second. The We next discuss the potential for hardware archi- primary difference is the lack of slowly initializing hard- tectures to create performance variability in terms of ware devices: The Ethernet driver required less than 0.1 mini-ckpts runtime cost. seconds to complete, and the disk driver (share9p) com- pleted initialization and mounting in under 0.2 seconds. 9.2.1 NUMA Constraints Imposed by PRAMFS mini-ckpts never requires a full reboot in either a vir- Naively remapping application memory to PRAMFS tualized or bare metal environment regardless whether may have implications for performance: In Linux, or not it is activated to simply rejuvenate a kernel or anonymous memory mappings (i.e., heap and stack) in response to a failure. The important metric to con- may be provided by any free memory. A specialized sider is the difference in boot times excluding the BIOS, memory allocator may request memory local to a spe- which is between 5-7 seconds vs. approximately 2-4 sec- cific CPU (closest memory controller). PRAMFS, how- onds for a virtual machine (VM). This number repre- ever, is statically allocated to a specific region of phys- sents the approximate downtime incurred by a kernel ical memory. In non-uniform memory access (NUMA) panic when combined with the cost of booting our soft- architectures, this implies that PRAMFS mapped mem- ware stack. We observe that the total cost of warm ory regions will have differing latencies between mem- rebooting either bare metal environment is between 6- ory accesses on varying cores. We observed that the 7.4 seconds. minimum and maximum runtimes vary by 62% for the After the kernel boots, our software stack usually re- CG benchmark when no core pinning was applied to quires about one second to start up services and mod- OpenMP threads (Figures omitted due to space con- ules before resuming a rescued application. straints). As the scheduler moved threads between cores, the overall runtime of the application varied 9.1 Basic Applications based on NUMA locality. While CG had the most pro- To ensure mini-ckpts basic correctness, we devel- nounced effect, the other benchmarks displayed some oped test applications that iterated through hardware degree of variance as well. registers while setting them to known deterministic pat- We were able to confirm that NUMA was responsible terns in single and multithreaded environments. Dur- for the runtime variance through two experiments: ing execution, we periodically injected a kernel panic. (1) We created a microbenchmark that used mmap to Through over 100 repeated kernel panics, mini-ckpts map 64MB of memory to our application. In a linear was able to continuously warm reboot the system and fashion, we then wrote 6GB worth of cumulative writes continue executing the application without affecting its over this region. Using core pinning, we timed this ex- state. periment running on each core. This experiment was We next developed a similar test that performed float- repeated for both the AMD and Intel bare metal sys- ing point (FPU) calculations on multiple cores designed tems as well as within virtual machines on both. The to only print a failure if the FPU operations (multi- results of each experiment are shown in Table 5. ply, divide, add, subtract) diverged from their expected -5 Cores 0-3 4-7 8-11 12-16 value by more than 10 , to account for FPU round- AMD 1.42 2.04 3.25 3.30 ing errors. Again, mini-ckpts was able to protect this AMD VM 3.2 - 3.4 FPU test against failures for over 100 consecutive panics Intel 0.90 1.12 during the execution of the same application. Intel VM 0.95 mini-ckpts was then evaluated against the sh shell, All Times in Seconds vi text editor, and python interpreter. While mini- Table 5: NUMA Microbenchmark Interacting with ckpts has not been extended to support updating file PRAMFS Memory Mappings descriptor seek pointers, it was able to continue execut- We see locality for PRAMFS, and we also observed ing these applications correctly provided they did not that by changing the physical location of the PRAMFS perform extensive file I/O. The vi editor did require a the NUMA localities also move, as expected. Addition- command to be blindly typed to reset its terminal dis- ally, we compared the runtime of writing to a PRAMFS play after each panic, since the terminal would not be mapping vs. an mmap anonymous memory mappings aware that it needs to refresh after a transparent warm and found the runtimes had a difference of 2.7% on aver- reboot. age (in either direction) for the AMD machine and a 1% difference (in favor of PRAMFS) on the Intel machine. 9.2 OpenMP Application Performance On the virtual machines, we found that regardless of We next evaluated mini-ckpts with HPC bench- core pinning within the VM, the hypervisor treats each marks from the NAS Parallel Benchmarks (NPB) suite virtual CPU as a thread, which is subject to migration and OpenMP version 3.3 [8]. The results for all envi- by the hypervisor’s scheduler. As shown in Table 5, the ronments (8x runs each) are in Figure 1. Bars represent AMD VM scheduler experienced performance near the 80 80 Baseline Baseline 70 Mini-ckpts Enabled 70 Mini-ckpts Enabled 60 60 50 50 40 40 30 30 20 20 Runtime in Seconds in Runtime 10 Seconds in Runtime 10 0 0 BT CG EP FT IS LU MG SP UA BT CG EP FT IS LU MG SP UA

(a) Bare Metal AMD Opteron 6128 (b) Bare Metal Intel E5-2650

80 80 Baseline Baseline 70 Mini-ckpts Enabled 70 Mini-ckpts Enabled 60 60 50 50 40 40 30 30 20 20 Runtime in Seconds in Runtime 10 Seconds in Runtime 10 0 0 BT CG EP FT IS LU MG SP UA BT CG EP FT IS LU MG SP UA

(c) KVM Hypervisor on AMD Opteron 6128 Host (d) KVM Hypervisor on Intel E5-2650 Host Figure 1: Runtime Performance of the NPB OpenMP Benchmarks Running with 8 Threads - Core Pinning Applied its bare metal host’s worst NUMA mappings. The Intel mentation, librlmpi, supports these requirements. To VM showed a performance between both its host’s best provide a meaningful evaluation of the performance of and worst performance as the scheduler migrated the mini-ckpts with MPI, we also show performance com- virtual cores during execution. parisons between vanilla librlmpi (without mini-ckpts (2) To isolate the effects of PRAMFS from mini- enabled) and a mainstream MPI implementation, Open ckpts, we repeated the benchmarks by modifying mini- MPI. Additionally, mini-ckpts requires remapping all ckpts to only perform PRAMFS remappings, but to process memory to PRAMFS, so we include an exper- not make any further modifications or checkpoints on iment that triggers PRAMFS remapping, but does not the process. When combined with core pinning and activate mini-ckpts protection. This experiment pro- knowledge of the NUMA latency of each core, we were vides insight into the effects of PRAMFS memory ef- able to repeat both the best and worst case runtimes fects (NUMA) without incurring any mini-ckpts over- for each experiment. Repeated experiments with best- head. Open MPI currently cannot handle recovery after case NUMA mappings and core pinning are shown in a mini-ckpt, but the relative performance comparison Figure 1. provides the reader with a grasp of librlmpi’s perfor- In comparing the two OpenMP experiments with mance against a known implementation. (Figure 1) and without core pinning, we observed that Benchmarks are reported from 8x runs each with bars mini-ckpts in combination with intelligent physical for minimum/maximum values. We omit detailed re- memory mapping of PRAMFS reduced the AMD host sults from micro benchmarks in favor of a brief sum- runtimes on average from 26% to 6.9%. The Intel host mary due to space. runtimes were reduced on average from 25% to 5.9%. Figure 2 shows the runtimes of four NPB MPI bench- marks, PENNANT, Clover Leaf over 4 AMD nodes 9.3 MPI Application Performance using Gigabit Ethernet. Compared to Open MPI, li- brlmpi on its own performs better than Open MPI for mini-ckpts requires customized MPI implementa- IS (6% faster). Performance is similar for CG (-1%), EP tion support to handle lost network connections and (0.7%), LU (2%), PENNANT (3%) and is 8% slower for buffers on both the sender and receiver. An imple- Clover Leaf. The results as both, percent differences mentation must also be prepared to handle unexpected and difference in geometric mean (librlmpi in various failures during system calls by anticipating an EINTR configurations vs. Open MPI), are shown in Table 6. return value from system calls that mini-ckpts recov- These results show that librlmpi provides comparable ered from after a warm reboot. Our prototype imple- A B C 200 Open MPI librlmpi Benchmark diff. diff. diff. B-A C-B librlmpi (with PRAMFS remap) L L+P L+P+M P M librlmpi (mini-ckpts & PRAMFS) NPB CG -1.0% -0.2% 1.0% 0.8% 1.2% 150 NPB LU 2.0% 5.5% 6.5% 3.5% 1.1% NPB EP 0.7% 0.7% 0.8% 0.0% 0.1% NPB IS -6.0% -5.1% -6.1% 0.9% -1.0% 100 PENNANT 3.1% 1.3% 2.8% -1.8% 1.5% Clover Leaf 7.9% 11.2% 13.4% 3.2% 2.2% Averages 1.1% 2.2% 3.1% 1.1% 0.9%

Runtime in Seconds in Runtime Diff.Geo.Mean 1.0% 2.1% 2.9% 1.0% 0.8% 50 Table 6: Differences in Runtime & Geometric Mean of MPI Benchmarks with librlmpi(L), PRAMFS(P), and mini-ckpts (M) Relative to Open MPI 0 NPB CG NPB LU NPB EP NPB IS PENNANT Clover Leaf cation from zero (failure free case) to four. Our second experiment alternates between all nodes each time pick- ing a new node to fail. For instance, in our four process jobs, we first fail node 1, followed by 2, then 3, and Figure 2: Runtime Comparisons of Various MPI Bench- finally node 4. marks in Differing MPI Stack Configurations Figure 3 summarizes our injections during MPI ap- plication execution. As each benchmark’s failure-free point-to-point message performance to Open MPI at runtime differs, we start the y-axis at 0 and measure least at our MPI job size when message latency is not only the additional time required when one or more ker- a bottleneck to application performance. nel failures are injected. Each MPI application run was Relative to Open MPI, overall average difference of conducted with zero and four failures, and each failure just librlmpi for all benchmarks is 1.1% with a difference was repeated in (1) a scenario where one node receives in geometric mean of 1.0%. As we enable PRAMFS all of the failures, or (2) the failures were distributed with librlmpi, the average difference becomes 2.2% with (“alternating”) between all nodes. A separate experi- a difference of geometric mean of 2.1%. Finally, with ment was run for every possible failure count and tar- full librlmpi, PRAMFS, and mini-ckpts enabled, the get. Each experiment was repeated 8 times resulting in average difference becomes 3.1% with a difference of error bars that show the observed minimum and max- geometric mean of 2.9%. imum runtime, while the points represent the average. Table 6 additionally compares librlmpi against itself. We observe a linear runtime increase as the number of Since librlmpi is a prototype demonstrating the capa- injections is incremented for each benchmark. bility of mini-ckpts to work with an MPI implemen- tation, the overheads of PRAMFS and mini-ckpts rel- 50 ative to baseline librlmpi (instead of Open MPI) are now discussed. By subtracting column B from column 40 A, it is observed that on average running an applica- tion with data migrated to PRAMFS accounts for 1.1% 30 of the overhead. Further, only considering the costs of mini-ckpts by subtracting column C from column B, 20 Same Target CG Alternating Target CG it is observed that on average mini-ckpts accounts for Same Target IS 0.9% of the overhead. 10 Alternating Target IS

Additional Runtime in Seconds in Runtime Additional Same Target LU Alternating Target LU 0 9.3.1 Injections During MPI Execution 0 1 2 3 4 The performance of the librlmpi prototype is used as Number of Kernel Panic Injections a tool to demonstrate mini-ckpts functionality rather than its MPI performance. mini-ckpts provides pro- Figure 3: Runtime Costs per Kernel Panic Injection for tection at a per-node level. We investigated if failures MPI Applications Demonstrating Linear Slowdown incur a constant cost, independent of the number of On average, the additional time increase per injec- processes in a job, as well as, independent of the type tion was between 10-13 seconds, as shown in Table 7. of job running. We designed injection experiments for These experiments were run on AMD nodes where the MPI applications to evaluate how mini-ckpts scales expected total warm reboot time (as shown in Table 4) with MPI. The first experiment involves picking a sin- is previously observed to be 6 seconds, excluding the gle node to repeatedly inject kernel failures into. We time to restart an application. We expect that warm vary the number of injections seen per run of the appli- reboots will incur an additional slight overhead be- sides the time taken for the reboot itself since the CPU from a panic and reboot the system in a valid man- caches will have been flushed when the operating sys- ner regardless of which OS core it originated on. This tem restarts. This is similar to a benchmark performing was a critical test as it demonstrated the effectiveness a “warm-up” prior to starting the primary computation. of mini-ckpts NMI migration protocol during a fault. A mini-ckpts restart is equivalent to jumping straight into main execution without warming up the cache. The 9.4.2 Memory Allocation restart times from the MPI injections show that failures Memory allocation within the Linux kernel is essen- increase the application runtime in a linear manner. tial for it to operate properly. An allocation failure # Kernel Panics 1 2 3 4 when required will force the kernel to panic. mini- Time Increase 12.53 11.89 10.76 10.14 ckpts does not allocate memory during a panic. How- All Times in Seconds ever, it is able to induce a kernel panic by exhausting the Table 7: Average Additional Runtime With Varying available memory and waiting for an allocation request Number of Kernel Panics in MPI Applications to arrive that cannot be fulfilled. We performed allo- cation experiments repeatedly via kmalloc exhausting 9.4 Fault Injections available memory. Resulting crashes were able to be Fault injections were performed by directly modify- caught and mitigated through a warm reboot. While ing kernel memory from within a kernel module that this was deliberate, mini-ckpts shows promise for miti- was triggered by ioctl system calls, which identifies gating kernel drivers that contain memory leaks on long- the type of injection. Ultimately, we measure mini- running systems. With mini-ckpts, a leak could tem- ckpts’ effectiveness based on a successful recovery from porarily be coped with by allowing a warm reboot to an injection that impacted the kernel in such a way that take place whenever it has exceeded available memory. is causing a crash or system hang. For example, some types of kernel injections will either be benign or cause 9.4.3 Hard Hangs the system to behave incorrectly (go unnoticed with- Some types of faults result in hanging the system, out a panic or hang). These types of injections, if they such as an infinite loop in a kernel driver. While Linux were to occur in the wild, would only be remedied by is preemptible, a kernel thread may disable preemption mini-ckpts if the kernel implements a sanity check or to avoid being interrupted during some work. If a fault bug check that ultimately detects and reacts (by pan- or bug occurs during this time, it is possible that the icking). This is an assumption on our work, but we will CPU will become indefinitely hung. While mini-ckpts point out that — even during the development of mini- cannot directly protection against this type of failure, ckpts — there were several times that mini-ckpts mis- an NMI watchdog may help: behaved only to be caught by one of the kernel’s safe An NMI watchdog within modern hardware period- guards, such as hang detection or file I/O errors. This ically sends a non-maskable interrupt to the OS as a ultimately caused mini-ckpts to reboot the system to forward progress / liveness test. When it fails, the a sane state immediately, making kernel development watchdog can directly call a kernel panic (which mini- easier than on bare metal without our improvements. ckpts uses to free the system from its hang and warm 9.4.1 Invalid Pointers reboots). As part of our testing, we injected hangs in regular system calls and interrupt handlers. We were We experimented with pointer injections, either set- able to successfully detect and recover from all failures. ting a pointer to NULL or switching it to a random value. NULL pointers are the easiest to catch and were 9.4.4 Soft Hangs protected 100% of the time, since NULL pointer derefer- Unlike hard hangs, a soft hang is a software loop that encing is an easily caught exception. Changing a pointer is not making forward progress but is still on a code with a non-NULL value had differing outcomes: If the path that clears the NMI watchdog timer. These types pointer was not a valid address, then it was caught as of hangs are much more difficult to detect because the an exception (leading to a panic), just as easily as a operating system appears to be making progress despite NULL pointer. However, changing the low order bits on being indefinitely trapped in a loop. While we did not a valid kernel pointer typically resulted in a call chain explicitly test for these types of hangs, we did notice that would randomly jump through the kernel until an that the read-copy-update (RCU) mechanism was able invalid reference or sanity check was reached. to detect hangs within mini-ckpts’ file I/O routines During each of the NAS benchmarks we performed taking several minutes to complete during debugging the following experiments: (1) We forced a direct call and development. to panic. (2) We injected a NULL or invalid pointer to task_struct fs member, signal handlers, parent mem- ber, and/or files member. We were able to detect and 10. RELATED WORK recover from each injection. We also scheduled forced Traditional checkpoint/restart of tightly-coupled panics via a NULL pointer on each core of the system. HPC applications is a well known and researched field. This test showed that mini-ckpts is able to recover Typically, checkpointing involves saving the state of one or more processes to stable storage (usually a disk), in- tem with a reliable write-back file cache in memory that cluding the memory, memory layout, open files, threads, is capable of surviving warm reboots. Rio guarantees and handles to other services provided by the kernel. that a file write operation buffered in the kernel dur- Checkpointing usually comes in two flavors: (1) system- ing a failure will not be lost, provided that the kernel level checkpointing, such as BLCR [15], which aims is able to perform a warm reboot, recover its unper- to provide transparent checkpoint and restart support turbed contents, and resume the write operation to sta- to applications via kernel modules without requiring ble storage on subsequent reboots. Rio is an example user space application modifications, and (2) user-level of an independent part of the operating system making checkpointing, such as MTCP [35] and CRIU [17], guarantees about forward progress regardless of failures which also aim to provide transparent or assisted check- by providing a near-zero cost virtual write-through re- pointing to processes without requiring kernel modifica- silience cache. tions or kernel modules. Although traditional check- Protecting the operating system in other ways, such pointing methods may vary in how they are imple- as remote patching in Backdoors [11] or recovering mented or the services they provide, all restore a pro- from disk and memory, has been studied, but they cess to an earlier known good state. For long-running have not focused on generalized, transparent high per- applications that are infrequently checkpointed due to formance application restart in an HPC environment the overhead associated with committing a checkpoint, where automatic, low-latency recovery is essential for the loss of progress due to a restart from failure may efficiency [10]. Additionally, work that requires remote dramatically reduce the efficiency of an application. intervention or local disk access may not be capable To mitigate the overhead incurred by checkpointing, of functioning in circumstances where drivers or kernel a number of optimizations have been studied. These support for either the network or disk are not guaran- optimizations include: incremental checkpointing [6, teed. mini-ckpts is specialized and superior in this 41, 19, 18] which typically saves only modified por- area in that it aims to ensure minimal internal kernel tions of a program, checkpoint compression [34, 28, 27], dependencies during fault handling. and moving incremental checkpoint logic to within ker- Otherworld [14] is perhaps the most similar work to nel [21]. Even with such checkpointing optimizations, ours in that it attempts to recover applications in the when checkpointing, restarts, and rework are modeled event that a kernel fails. Otherworld maintains a “crash to match future exascale systems’ expected failure rates, kernel” loaded in separate memory, and it receives con- studies show that applications may spend more than trol of the system upon failures. Their crash kernel 50% of their time in checkpoint/restart/rework phases depends on known offsets of data structures within the instead of making forward progress [16, 33, 36]. In these old kernel to attempt to parse and reconstruct old ker- application-based systems, failures that occur within nel data structures. This includes a wide variety of the operating system require an application to rollback traversals and parsing of memory mappings, process and rework as there simply is no other recourse. How- structures, file structures, and so forth. Thus, Other- ever, mini-ckpts can eliminate unnecessary restart/re- world depends heavily on an intact and correct kernel work when the application itself remains recoverable, state upon failure, which is a significant limitation. The despite a failed kernel. event that caused a kernel to fail may be the effect of The idea of an operating system resilient to software a corrupt data structure (i.e., it would continue to ex- errors is not new [7, 29, 12]. The MVS [7] operating ist upon restart) or may create corruption within kernel system was designed with fault tolerance in mind for data structures during the process of crashing. Many both software and hardware errors by requiring that all corruptions may make Otherworld’s reconstruction im- services be accompanied by an error recovery routine possible, while mini-ckpts provides non-stop execution that executes in the event of a failure. This protection without this limitation. increases both the complexity and size of the operat- mini-ckpts may serve a secondary role beyond reac- ing system. In the case of MVS, recovery routines were tive fault tolerance for operating systems by providing provided for only 65% of the critical jobs. Despite the a means of fast kernel rejuvenation, if used proactively efforts of the recovery routines that were provided, in outside the existence of an imminent crash/fault [24]. cases were recovery was activated due to a fault, MVS Outside of OS designs involving non-volatile memory, was only successful in preventing an abort 50% of the to the best of our knowledge, this work is the first in its time [40]. Noting that many kernel failures occur due to class to provide fast operating system restarts without bugs within device drivers, the NOOKS [39] framework checkpointing or interrupting an application [9]. Reju- wrapped calls between the core Linux kernel and device venating the OS in HPC workloads and virtual machine drivers to isolate the effects of one from the other. For monitors has been shown to be beneficial in reducing cases where bugs are known to exist in drivers, NOOKS downtime [32, 31]. provided a transparent solution, but incurs a latency and bandwidth cost due to the wrapping of communi- cation to and from drivers. 11. CONCLUSION The Rio File Cache [13] provides the operating sys- Today’s operating systems are currently not designed with fault tolerance in mind, despite the fact that OS Applications, 5(3):63–73, Fall 1991. memory appears more likely to fail than the remainder [9] K. Bailey, L. Ceze, S. D. Gribble, and H. M. Levy. of memory. The default mechanism to handle a failure Operating system implications of fast, cheap, within Linux and commodity OS’s is to print an error non-volatile memory. In Proceedings of the 13th message and reboot, causing a loss of all unsaved appli- USENIX Conference on Hot Topics in Operating cation data on the node and typically triggers a restart Systems, HotOS’13, pages 2–2, Berkeley, CA, for the remainder of the nodes in the application. For USA, 2011. USENIX Association. these HPC applications, where checkpointing, rollback, [10] M. Baker and M. Sullivan. The recovery box: and rework are expensive, mitigating an OS crash by al- Using fast recovery to provide high availability in lowing warm reboots and recovering applications with- the unix environment. In In Proceedings USENIX out data loss can provide a safeguard against memory Summer Conference, pages 31–43, 1992. corruption, system hangs, and other unexpected fail- [11] A. Bohra, I. Neamtiu, P. Gallard, F. Sultan, and ures. This work has identified the key points of in- L. Iftode. Remote repair of operating system state strumentation within the Linux kernel required to save using backdoors. In International Conference on the critical state of an application during an OS failure Autonomic Computing (ICAC-04), New-York, and has provided mechanisms via persistent memory to NY, May 2004. Initial version published as enable restoration of application data across reboots. Technical Report, Rutgers University We provide an experimental implementation and eval- DCS-TR-543. uation of our prototype, mini-ckpts, capable of pro- [12] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, tecting HPC applications from crashes within the kernel and A. Fox. Microreboot – A technique for cheap by providing non-stop forward progress immediately af- recovery. In Proceedings of the 6th Conference on ter a short warm reboot. We show that a mini-ckpts Symposium on Opearting Systems Design & enabled kernel can successfully protect an application Implementation - Volume 6, OSDI’04, pages 3–3, from computation loss (due to a kernel failure), warm Berkeley, CA, USA, 2004. USENIX Association. reboot, and transparently restart execution, within 3-6 [13] P. M. Chen, W. T. Ng, S. Chandra, C. Aycock, seconds of a fault occurring. We demonstrate the effec- G. Rajamani, and D. Lowell. The Rio File Cache: tiveness of our work by injecting faults into OpenMP Surviving operating system crashes. In and MPI HPC benchmarks and observe runtime over- Proceedings of the Seventh International heads that average between 5% to 6% for OpenMP ap- Conference on Architectural Support for plications and 3.1% for MPI applications. Programming Languages and Operating Systems, ASPLOS VII, pages 74–83, New York, NY, USA, 12. REFERENCES 1996. ACM. [1] Clover Leaf. [14] A. Depoutovitch and M. Stumm. Otherworld: http://uk-mac.github.io/CloverLeaf/. Giving applications a chance to survive OS kernel [2] OSU MPI micro benchmarks. crashes. In Proceedings of the 5th European http://mvapich.cse.ohio-state.edu/benchmarks/. Conference on Computer Systems, EuroSys ’10, [3] PENNANT. pages 181–194, New York, NY, USA, 2010. ACM. https://github.com/losalamos/PENNANT. [15] J. Duell. The design and implementation of [4] Protected and persistent RAM filesystem. Berkeley Lab’s Linux Checkpoint/Restart. Tr, http://sourceforge.net/projects/pramfs/. Lawrence Berkeley National Laboratory, 2000. [5] Top 500 list. http://www.top500.org/, June 2002. [16] E. N. Elnozahy and J. S. Plank. Checkpointing [6] S. Agarwal, R. Garg, M. S. Gupta, and J. E. for peta-scale systems: a look into the future of Moreira. Adaptive incremental checkpointing for practical rollback-recovery. Dependable and massively parallel systems. In Proceedings of the Secure Computing, IEEE Transactions on, 18th Annual International Conference on 1(2):97–108, Apr. 2004. Supercomputing, ICS ’04, pages 277–286, New [17] P. Emelianov and S. Hallyn. State of criu and York, NY, USA, 2004. ACM. integration with . Linux Plumbers Conference, [7] M. A. Auslander, D. C. Larkin, and A. L. Scherr. Sept. 2013. The evolution of the mvs operating system. IBM [18] K. B. Ferreira, R. Riesen, R. Brightwell, P. G. Journal of Research and Development, Bridges, and D. Arnold. Libhashckpt: Hash-based 25(5):471–482, 1981. incremental checkpointing using GPUs. In [8] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Proceedings of the 18th EuroMPI Conference, Browning, R. L. Carter, D. Dagum, R. A. Santorini, Greece, September 2011. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. [19] K. B. Ferrira, R. Riesen, P. G. Bridges, Schreiber, H. D. Simon, V. Venkatakrishnan, and D. Arnold, and R. Brightwell. Accelerating S. K. Weeratunga. The NAS Parallel Benchmarks. incremental checkpointing for extreme-scale The International Journal of computing. FGCS, 2013. [20] D. Fiala, F. Mueller, C. Engelmann, R. Riesen, 171–181, New York, NY, USA, 1988. ACM. K. Ferreira, and R. Brightwell. Detection and [31] K. Kourai and S. Chiba. A fast rejuvenation correction of silent data corruption for large-scale technique for server consolidation with virtual high-performance computing. In Proceedings of machines. In DSN, pages 245–255. IEEE the International Conference on High Computer Society, 2007. Performance Computing, Networking, Storage and [32] N. Naksinehaboon, N. Taerat, C. Leangsuksun, Analysis, SC ’12, pages 78:1–78:12, Los Alamitos, C. Chandler, and S. L. Scott. Benefits of software CA, USA, 2012. IEEE Computer Society Press. rejuvenation on HPC systems. In ISPA, pages [21] R. Gioiosa, J. Sancho, S. Jiang, and F. Petrini. 499–506. IEEE, 2010. Transparent, incremental checkpointing at kernel [33] R. A. Oldfield, S. Arunagiri, P. J. Teller, level: a foundation for fault tolerance for parallel S. Seelam, R. Riesen, M. R. Varela, and P. C. computers. In Supercomputing, 2005. Proceedings Roth. Modeling the impact of checkpoints on of the ACM/IEEE SC 2005 Conference, pages next-generation systems. In Proceedings of the 9–9, Nov 2005. 24th IEEE Conference on Mass Storage Systems [22] K. Greenan and E. L. Miller. Reliability and Technologies, San Diego, CA, September mechanisms for file systems using non-volatile 2007. memory as a metadata store. In Proceedings of [34] J. S. Plank, J. Xu, and R. H. Netzer. Compressed the 6th ACM & IEEE Conference on Embedded differences: An algorithm for fast incremental Software EMSOFT 06, pages 178–187, Oct 2006. checkpointing. University of Tennessee, Tech. [23] K.-H. Huang and J. A. Abraham. Rep. CS-95-302, 1995. Algorithm-based fault tolerance for matrix [35] M. Rieker, J. Ansel, and G. Cooperman. operations. IEEE Trans. Comput., 33(6):518–528, Transparent user-level checkpointing for the June 1984. native POSIX thread library for Linux. In The [24] Y. Huang, C. Kintala, N. Kolettis, and N. Fulton. International Conference on Parallel and Software rejuvenation: Analysis, module and Distributed Processi ng Techniques and applications. In Fault-Tolerant Computing, 1995. Applications, Las Vegas, NV, Jun 2006. FTCS-25. Digest of Papers., Twenty-Fifth [36] B. Schroeder and G. A. Gibson. Understanding International Symposium on, pages 381–390, June failures in petascale computers. Journal of 1995. Physics: Conference Series, 78(1):012022, 2007. [25] J. Hulbert. Introducing the advanced XIP file [37] B. Schroeder, E. Pinheiro, and W.-D. Weber. system. In Linux Symposium, page 211, 2008. DRAM errors in the wild: A large-scale field [26] A. A. Hwang, I. A. Stefanovici, and B. Schroeder. study. In Proceedings of the 11th Joint Cosmic rays don’t strike twice: Understanding the International Conference on Measurement and nature of DRAM errors and the implications for Modeling of Computer Systems (SIGMETRICS) system design. SIGPLAN Not., 47(4):111–122, 2009, pages 193–204, Seattle, WA, USA, Mar. 2012. June 11-13, 2009. ACM Press, New York, NY, [27] D. Ibtesham, D. Arnold, P. G. Bridges, K. B. USA. Ferreira, and R. Brightwell. On the viability of [38] V. Sridharan, N. DeBardeleben, S. Blanchard, compression for reducing the overheads of K. B. Ferreira, J. Stearley, J. Shalf, and checkpoint/restart-based fault tolerance. In S. Gurumurthi. Memory errors in modern Proceedings of the International Conference on systems: The good, the bad, and the ugly. Parallel Processing (ICPP), 2012. SIGPLAN Not., 50(4):297–310, Mar. 2015. [28] D. Ibtesham, K. B. Ferreira, and D. Arnold. A [39] M. M. Swift, S. Martin, H. M. Levy, and S. J. study of checkpoint compression for Eggers. Nooks: An architecture for reliable device high-performance computing systems. drivers. In Proceedings of the 10th Workshop on International Journal of High Performance ACM SIGOPS European Workshop, EW 10, Computing Applications (IJHPCA), 2015. pages 102–107, New York, NY, USA, 2002. ACM. [29] D. Jewett. Integrity s2: A fault-tolerant unix [40] P. Velardi and R. K. Iyer. A study of software platform. In Fault-Tolerant Computing, 1991. failures and recovery in the mvs operating system. FTCS-21. Digest of Papers., Twenty-First IEEE Trans. Computers, 33(6):564–568, 1984. International Symposium, pages 512–519, June [41] S. Yi, J. Heo, Y. Cho, and J. Hong. Adaptive 1991. page-level incremental checkpointing based on [30] D. B. Johnson and W. Zwaenepoel. Recovery in expected recovery time. In Proceedings of the 2006 distributed systems using asynchronous message ACM Symposium on Applied Computing, SAC logging and checkpointing. In Proceedings of the ’06, pages 1472–1476, New York, NY, USA, 2006. Seventh Annual ACM Symposium on Principles ACM. of Distributed Computing, PODC ’88, pages