Mini-Ckpts: Surviving OS Failures in Persistent Memory

Mini-Ckpts: Surviving OS Failures in Persistent Memory David Fiala, Frank Mueller Kurt Ferreira Christian Engelmann North Carolina State University∗ Sandia National Laboratories† Oak Ridge National Laboratory‡§ davidfi[email protected], [email protected] [email protected] [email protected] ABSTRACT minate due to tight communication in HPC. Therefore, Concern is growing in the high-performance computing the OS itself must be capable of tolerating failures in a (HPC) community on the reliability of future extreme- robust system. In this work, we introduce mini-ckpts, scale systems. Current efforts have focused on appli- a framework which enables application survival despite cation fault-tolerance rather than the operating system the occurrence of a fatal OS failure or crash. mini- (OS), despite the fact that recent studies have suggested ckpts achieves this tolerance by ensuring that the crit- that failures in OS memory may be more likely. The OS ical data describing a process is preserved in persistent is critical to a system’s correct and efficient operation of memory prior to the failure. Following the failure, the the node and processes it governs — and the parallel na- OS is rejuvenated via a warm reboot and the applica- ture of HPC applications means any single node failure tion continues execution effectively making the failure generally forces all processes of this application to ter- and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three ∗This work was supported in part by a subcontract from to six seconds and has a failure-free overhead of between Sandia National Laboratories and NSF grants CNS- 3-5% for a number of key HPC workloads. In contrast 1058779, CNS-0958311. to current fault-tolerance methods, this work ensures † Sandia is a multiprogram laboratory operated by San- that the operating and runtime systems can continue dia Corporation, a Lockheed Martin Company, for the in the presence of faults. This is a much finer-grained United States Department of Energy’s National Nu- clear Security Administration under contract DE-AC04- and dynamic method of fault-tolerance than the current 94AL85000. coarse-grained application-centric methods. Handling ‡This manuscript has been authored by UT-Battelle, faults at this level has the potential to greatly reduce LLC under Contract No. DE-AC05-00OR22725 with overheads and enables mitigation of additional faults. the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the 1. INTRODUCTION article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, The operating system (OS) kernel is responsible for irrevocable, world-wide license to publish or repro- supervising running applications, providing services, duce the published form of this manuscript, or allow and governing hardware device access. While applica- others to do so, for United States Government pur- tions are isolated from each other and a failure in one poses. The Department of Energy will provide pub- application does not necessarily affect others, a failure lic access to these results of federally sponsored research in accordance with the DOE Public Access Plan in the kernel may result in a full system crash, stopping (http://energy.gov/downloads/doe-public-access-plan). all applications without warning due to system services §This work was sponsored by the U.S. Department of that run across the entire machine Energy’s Office of Advanced Scientific Computing Re- In current high-performance computing (HPC) sys- search. tems with 100s of thousands of processor cores and 100s ACM acknowledges that this contribution was authored or co-authored by an of TB of main memory, faults are becoming increasingly employee, or contractor of the national government. As such, the Government common and are predicted to become even more com- retains a nonexclusive, royalty-free right to publish or reproduce this article, or mon in future systems. Faults can affect memory, pro- to allow others to do so, for Government purposes only. Permission to make digital or hard copies for personal or classroom use is granted. Copies must bear cessors, power supplies, and other hardware. Research this notice and the full citation on the first page. Copyrights for components indicates that servers tend to crash twice per year (2- of this work owned by others than ACM must be honored. To copy otherwise, 4% failure rate) and unrecoverable DRAM errors occur distribute, republish, or post, requires prior specific permission and/or a fee. Re- quest permissions from [email protected]. in 2% of all DIMMs per year [37]. A number of stud- ICS ’16, June 01-03, 2016, Istanbul, Turkey ies indicate that ECC mechanisms alone are unlikely to c 2016 ACM. ISBN 978-1-4503-4361-9/16/06. $15.00 be sufficient in correcting a significant number of these DOI: http://dx.doi.org/10.1145/2925426.2926295 memory errors [26, 38]. While not all faults result in failure, when they do, intermediate results from entire process can be proactively managed prior to a failure. HPC application executions may be lost 1. This allows HPC applications to avoid costly rollbacks The objective of this work is to isolate applications (during kernel panics) to previous checkpoints, espe- from faults that originate within the processor or main cially when considering that rolling back on a massively- system memory and result in OS kernel failures and parallel HPC system typically requires all communicat- crashes. This work is critical to the scalability of cur- ing nodes to coordinate and rollback together unless rent and future large-scale systems due to the fact that coupled with other more costly techniques such as mes- while the kernel occupies a relatively small portion of sage logging [30]. memory (10-100 MB) relative to an HPC application This work makes the following contributions: (up to 32 GB per node), errors in kernel memory may • It identifies the minimal data structures required to be more likely [26], and the kernel is critical to all other save a process’ state during a kernel failure. processes running on a node — and in HPC even other • It further identifies instrumentation locations generi- nodes whose progress depends on being able to commu- cally needed in OSs for processor state saving. nicate with the faulted node. In this work, we target • It introduces mini-ckpts, a framework that imple- the Linux OS due to its widespread adoption on current ments process protection against kernel failures by HPC platforms (97% of the Top500 [5]), although the saving application state in persistent memory. methodology described in this work applies directly to • It evaluates the overhead and time savings of mini- most other OSs. ckpts’ warm reboots and application protection. While failures due to hardware faults are a concern in • It experimentally evaluates application survival of large clusters, software bugs pose an equally fatal out- mini-ckpts protected programs in the face of in- come in the kernel, regardless of scale. Bugs may result jected kernel faults and bugs for a number of key in kernel crashes (a kernel “panic”), CPU hangs, and OpenMP and MPI parallelized workloads. memory leaks resulting in eventual kernel crashes, or other outcomes leaving system performance degraded or only partially functional. Therefore, we have designed 2. MINI-CKPTS OVERVIEW mini-ckpts to recover the system in the face of both mini-ckpts protects applications by taking a check- kernel bugs and transient hardware faults. point of the process and its resources, modifying its ad- The primary objective of mini-ckpts is to demon- dress space for persistence, and then periodically saving strate the feasibility of voluntary or failure-induced sys- the state of the process’ registers whenever the kernel tem reboots by serializing and isolating the most es- is entered. With the checkpointed information and reg- sential parts of the application to persistent memory. ister state, mini-ckpts later recovers a process after a To this end, we make minimal modifications to the OS kernel crash occurs. Recovery is performed by attempt- for saving the CPU state during transitions between ing to recreate the process in the exact state prior to user space and kernel space while migrating application the crash using well established techniques employed by virtual memory to an execute-in-place (XIP) directly traditional checkpoint/restart software, such as Berke- mapped physical memory region. Finally, failures are ley Lab Checkpoint/Restart (BLCR) [15]. The basic mitigated by warm booting an identical copy of the ker- checkpointing features mini-ckpts supports are shown nel and resuming the protected applications via restor- in Table 1. Additionally, mini-ckpts support for per- ing the saved state of user processes inside the newly sistent virtual memory is comprehensive. A summary booted kernel. is available in Table 2. While other research, such as Otherworld [14], has To use mini-ckpts, one first needs to boot a kernel proposed the idea of a secondary crash kernel (activated with mini-ckpts modifications followed by additional upon failure) that parses the data structures of the orig- bootstrapping steps during start-up: inal, failed kernel, this work is novel as it is the first to Table 1: Summary of process features mini-ckpts pro- demonstrate complete recovery independence from the tects in checkpoint during failures data structures of an earlier kernel. The premise is that Multi Threaded Proc. Process ID a kernel failure may corrupt the kernel’s content. The Mutex/Conditions Variables Process UID & GID failure itself might be due to a corruption or bug in Regular Files (excl. seek) System Calls the first place, which could impede access to the old Signal Handlers & Masks Unflushed File Buffers kernel’s data. mini-ckpts’ unique contribution is that Stdin/Out/Err mmap’d files it has no dependencies on the failed kernel, and that Block Dev.

Mini-Ckpts: Surviving OS Failures in Persistent Memory

Checkpoint and Restoration of Micro-Service in Docker Containers

Evaluating and Improving LXC Container Migration Between

The Aurora Operating System

Checkpoint and Restore of Singularity Containers

Instant OS Updates Via Userspace Checkpoint-And

Task Migration at Scale Using CRIU Linux Plumbers Conference 2018

Open Source Software Packages

Red Hat Enterprise Linux 7 7.8 Release Notes

Seccomp Update

Netlink Sockets

Eyeglass Search OSS Licenses and Packages V9

Restoring Uniqueness in Microvm Snapshots