Container Live Migration

Total Page:16

File Type:pdf, Size:1020Kb

Container Live Migration Container Live Migration Adrian Reber 2020, July 10 Red Hat Blog: Container migration with Podman on RHEL https://www.redhat.com/en/blog/container-migration-podman-rhel 2 DEVOOPS 2020 Agenda Use cases Details Demos Future 3 DEVOOPS 2020 Definition: Container Live Migration 4 DEVOOPS 2020 Transfer Running Container 5 DEVOOPS 2020 Serialize on Source System 6 DEVOOPS 2020 Transfer to Destination System 7 DEVOOPS 2020 Checkpoint/Restore in Userspace CRIU 8 DEVOOPS 2020 Multiple Integrations Exist 9 DEVOOPS 2020 Use Cases 10 DEVOOPS 2020 Reboot and Save State 11 DEVOOPS 2020 Container Host 12 DEVOOPS 2020 Container Host 13 DEVOOPS 2020 Container Host 14 DEVOOPS 2020 Container Host 15 DEVOOPS 2020 Quick Startup 16 DEVOOPS 2020 Container Host 17 DEVOOPS 2020 Container Container Host 18 DEVOOPS 2020 Container Container ContainerHost Container 19 DEVOOPS 2020 Container Live Migration 20 DEVOOPS 2020 Container Source Destination 21 DEVOOPS 2020 Container Container Source Destination 22 DEVOOPS 2020 Container Container Container Source Destination Container 23 DEVOOPS 2020 CRIU 24 DEVOOPS 2020 First Step: Checkpointing 25 DEVOOPS 2020 Seize Process Using ptrace() 26 DEVOOPS 2020 Collect Details From /proc/<PID>/* 27 DEVOOPS 2020 Parasite Code 28 DEVOOPS 2020 Parasite Code Most favorite part 29 DEVOOPS 2020 Parasite Code And the craziest 30 DEVOOPS 2020 Parasite Code Injected into the process 31 DEVOOPS 2020 Parasite Code Daemon waiting for commands 32 DEVOOPS 2020 Parasite Code Removed after usage 33 DEVOOPS 2020 Original Process Code To Be Checkpointed 34 DEVOOPS 2020 Original Process Code To Be Checkpointed 35 DEVOOPS 2020 Original Process Parasite Code To Be Checkpointed 36 DEVOOPS 2020 Original Process Code To Be Checkpointed 37 DEVOOPS 2020 Checkpointing Finished 38 DEVOOPS 2020 Checkpointing Finished All relevant information written 39 DEVOOPS 2020 Checkpointing Finished Target process is killed 40 DEVOOPS 2020 Checkpointing Finished Or continues to run 41 DEVOOPS 2020 Container Live Migration SELinux Linux Security Summit EU 2019 https://sched.co/Tymj 42 DEVOOPS 2020 2015: CRIU LSM support 43 DEVOOPS 2020 During checkpointing Read /proc/PID/attr/current 44 DEVOOPS 2020 During restore Write /proc/PID/attr/current 45 DEVOOPS 2020 1 if (!strstartswith(last, "unconfined_")){ 2 pr_err("Non unconfined selinux contexts not supported %s\n", last); 3 freecon(ctx); 4 return -1; 5 } 46 DEVOOPS 2020 • setsockcreatecon(3) for parasite daemon • Write to /proc/PID/attr/current • Allow dyntransition • Do not set context of threads • Allow writing to /proc/sys/kernel/ns_last_pid • Fix socket labels • Pre-create CRIU log files with appropriate labels • Fix file descriptor leaks 47 DEVOOPS 2020 Second/Last Step: Restoring 48 DEVOOPS 2020 Read Checkpoint Images 49 DEVOOPS 2020 clone() For Each PID/TID LPC: CRIU and the PID dance clone3() with Linux 5.5 https://linuxplumbersconf.org/event/4/contributions/472/ 50 DEVOOPS 2020 PID dance open() /proc/sys/kernel/ns_last_pid write() (PID - 1) to ns_last_pid close() ns_last_pid clone() getpid() 51 DEVOOPS 2020 Avoiding the PID dance (2010): eclone() https://lore.kernel.org/patchwork/patch/198220/ 52 DEVOOPS 2020 Avoiding the PID dance (2019): clone3() 53 DEVOOPS 2020 ”In general, clone3() is extensible and allows for the implementation of new features.” https://git.kernel.org/pub/scm/linux/kernel/git/ torvalds/linux.git/commit/?id=7f192e3cd316ba58c 54 DEVOOPS 2020 clone3() with set_tid 55 DEVOOPS 2020 1 struct clone_args args = {0}; 2 pid_t *set_tid; 3 set_tid[0] = 2020; 4 args.set_tid = set_tid; 5 args.set_tid_size = 1; 6 syscall(__NR_clone3, args, sizeof(struct clone_args)); 56 DEVOOPS 2020 CRIU Morphs Itself Open and position file descriptors 57 DEVOOPS 2020 CRIU Morphs Itself Map memory pages 58 DEVOOPS 2020 CRIU Morphs Itself Load security settings 59 DEVOOPS 2020 CRIU Morphs Itself Jump into restored process 60 DEVOOPS 2020 Container Live Migration 61 DEVOOPS 2020 Container Live Migration OpenVZ 62 DEVOOPS 2020 Container Live Migration Borg 63 DEVOOPS 2020 Container Live Migration LXC/LXD 64 DEVOOPS 2020 Container Live Migration Docker 65 DEVOOPS 2020 Container Live Migration Podman 66 DEVOOPS 2020 Podman: daemonless 67 DEVOOPS 2020 Podman: rootless 68 DEVOOPS 2020 Podman: Checkpoint/Restore October 2018 69 DEVOOPS 2020 Podman: Checkpoint/Restore Required runc and CRIU changes 70 DEVOOPS 2020 Podman: Container Live Migration June 2019 71 DEVOOPS 2020 Podman: Container Live Migration Required runc, CRIU, SELinux changes 72 DEVOOPS 2020 Checkpoint includes File System Changes 73 DEVOOPS 2020 1 # podman run --rm -d adrianreber/wildfly-hello 2 699f33eb7fecbc5bbb00400be0aa79c888dbc63a54cac7bd2eed836a57d8a68a 3 # podman inspect -l --format "{{.NetworkSettings.IPAddress}}" 4 10.88.0.247 5 # curl 10.88.0.247:8080/helloworld/ 6 0 7 # curl 10.88.0.247:8080/helloworld/ 8 1 9 # podman container checkpoint -l --export=/tmp/chkpt.tar.gz 10 699f33eb7fecbc5bbb00400be0aa79c888dbc63a54cac7bd2eed836a57d8a68a 11 # scp /tmp/chkpt.tar.gz rhel08:/tmp 74 DEVOOPS 2020 1 # podman container restore --import=/tmp/chkpt.tar.gz 2 699f33eb7fecbc5bbb00400be0aa79c888dbc63a54cac7bd2eed836a57d8a68a 3 # podman inspect -l --format "{{.NetworkSettings.IPAddress}}" 4 10.88.0.247 5 # curl 10.88.0.247:8080/helloworld/ 6 2 7 # curl 10.88.0.247:8080/helloworld/ 8 3 75 DEVOOPS 2020 1 # podman container restore --import=/tmp/chkpt.tar.gz -n hello1 2 d02feeec894d77f66cc82484fe77ae369396a85f6d05594dc156c21e685942dd 3 # podman container restore --import=/tmp/chkpt.tar.gz -n hello2 4 735efb4fee6961d3eee069beb28dde5cbc6fc46c1a32a43ecc993d04c02015b2 5 # podman inspect --format "{{.NetworkSettings.IPAddress}}" hello1 6 10.88.0.248 7 # podman inspect --format "{{.NetworkSettings.IPAddress}}" hello2 8 10.88.0.249 9 # curl 10.88.0.248:8080/helloworld/ 10 2 11 # curl 10.88.0.249:8080/helloworld/ 12 2 76 DEVOOPS 2020 Future: kubectl migrate 77 DEVOOPS 2020 Future: Non-root checkpoint/restore 78 DEVOOPS 2020 Summary • CRIU can checkpoint and restore containers • Integrated in different containers engines • Used in production • Reboot into new kernel without losing container state • Start multiple copies • Migrate running containers 79 DEVOOPS 2020 https://lisas.de/~adrian/container-live-migration-article.pdf https://asciinema.org/a/249922 https://asciinema.org/a/249918 https://lisas.de/~adrian/posts/2019-Apr-10-criu-and-selinux.html https://criu.org/Podman https://twitter.com/adrian__reber https://www.redhat.com/en/blog/container-migration-podman-rhel https://cfp.all-systems-go.io/ASG2019/talk/E88Z7V/ https://sched.co/Tymj https://linuxplumbersconf.org/event/4/contributions/472/ 80 DEVOOPS 2020 Thank you.
Recommended publications
  • Checkpoint and Restoration of Micro-Service in Docker Containers
    3rd International Conference on Mechatronics and Industrial Informatics (ICMII 2015) Checkpoint and Restoration of Micro-service in Docker Containers Chen Yang School of Information Security Engineering, Shanghai Jiao Tong University, China 200240 [email protected] Keywords: Lightweight Virtualization, Checkpoint/restore, Docker. Abstract. In the present days of rapid adoption of micro-service, it is imperative to build a system to support and ensure the high performance and high availability for micro-services. Lightweight virtualization, which we also called container, has the ability to run multiple isolated sets of processes under a single kernel instance. Because of the possibility of obtaining a low overhead comparable to the near-native performance of a bare server, the container techniques, such as openvz, lxc, docker, they are widely used for micro-service [1]. In this paper, we present the high availability of micro-service in containers. We investigate capabilities provided by container (docker, openvz) to model and build the Micro-service infrastructure and compare different checkpoint and restore technologies for high availability. Finally, we present preliminary performance results of the infrastructure tuned to the micro-service. Introduction Lightweight virtualization, named the operating system level virtualization technology, partitions the physical machines resource, creating multiple isolated user-space instances. Each container acts exactly like a stand-alone server. A container can be rebooted independently and have root access, users, IP address, memory, processes, files, etc. Unlike traditional virtualization with the hypervisor layer, containerization takes place at the kernel level. Most modern operating system kernels now support the primitives necessary for containerization, including Linux with openvz, vserver and more recently lxc, Solaris with zones, and FreeBSD with Jails [2].
    [Show full text]
  • Evaluating and Improving LXC Container Migration Between
    Evaluating and Improving LXC Container Migration between Cloudlets Using Multipath TCP By Yuqing Qiu A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of the requirements for the degree of Master of Applied Science in Electrical and Computer Engineering Carleton University Ottawa, Ontario © 2016, Yuqing Qiu Abstract The advent of the Cloudlet concept—a “small data center” close to users at the edge is to improve the Quality of Experience (QoE) of end users by providing resources within a one-hop distance. Many researchers have proposed using virtual machines (VMs) as such service-provisioning servers. However, seeing the potentiality of containers, this thesis adopts Linux Containers (LXC) as Cloudlet platforms. To facilitate container migration between Cloudlets, Checkpoint and Restore in Userspace (CRIU) has been chosen as the migration tool. Since the migration process goes through the Wide Area Network (WAN), which may experience network failures, the Multipath TCP (MPTCP) protocol is adopted to address the challenge. The multiple subflows established within a MPTCP connection can improve the resilience of the migration process and reduce migration time. Experimental results show that LXC containers are suitable candidates for the problem and MPTCP protocol is effective in enhancing the migration process. i Acknowledgement I would like to express my sincerest gratitude to my principal supervisor Dr. Chung-Horng Lung who has provided me with valuable guidance throughout the entire research experience. His professionalism, patience, understanding and encouragement have always been my beacons of light whenever I go through difficulties. My gratitude also goes to my co-supervisor Dr.
    [Show full text]
  • The Aurora Operating System
    The Aurora Operating System Revisiting the Single Level Store Emil Tsalapatis Ryan Hancock Tavian Barnes RCS Lab, University of Waterloo RCS Lab, University of Waterloo RCS Lab, University of Waterloo [email protected] [email protected] [email protected] Ali José Mashtizadeh RCS Lab, University of Waterloo [email protected] ABSTRACT KEYWORDS Applications on modern operating systems manage their single level stores, transparent persistence, snapshots, check- ephemeral state in memory, and persistent state on disk. En- point/restore suring consistency between them is a source of significant developer effort, yet still a source of significant bugs inma- ACM Reference Format: ture applications. We present the Aurora single level store Emil Tsalapatis, Ryan Hancock, Tavian Barnes, and Ali José Mash- (SLS), an OS that simplifies persistence by automatically per- tizadeh. 2021. The Aurora Operating System: Revisiting the Single sisting all traditionally ephemeral application state. With Level Store. In Workshop on Hot Topics in Operating Systems (HotOS recent storage hardware like NVMe SSDs and NVDIMMs, ’21), June 1-June 3, 2021, Ann Arbor, MI, USA. ACM, New York, NY, Aurora is able to continuously checkpoint entire applications USA, 8 pages. https://doi.org/10.1145/3458336.3465285 with millisecond granularity. Aurora is the first full POSIX single level store to han- dle complex applications ranging from databases to web 1 INTRODUCTION browsers. Moreover, by providing new ways to interact with Single level storage (SLS) systems provide persistence of and manipulate application state, it enables applications to applications as an operating system service. Their advantage provide features that would otherwise be prohibitively dif- lies in removing the semantic gap between the in-memory ficult to implement.
    [Show full text]
  • Checkpoint and Restore of Singularity Containers
    Universitat politecnica` de catalunya (UPC) - BarcelonaTech Facultat d'informatica` de Barcelona (FIB) Checkpoint and restore of Singularity containers Grado en ingenier´ıa informatica´ Tecnolog´ıas de la Informacion´ Memoria 25/04/2019 Director: Autor: Jordi Guitart Fernandez Enrique Serrano G´omez Departament: Arquitectura de Computadors 1 Abstract Singularity es una tecnolog´ıade contenedores software creada seg´unlas necesidades de cient´ıficos para ser utilizada en entornos de computaci´onde altas prestaciones. Hace ya 2 a~nosdesde que los usuarios empezaron a pedir una integraci´onde la fun- cionalidad de Checkpoint/Restore, con CRIU, en contenedores Singularity. Esta inte- graci´onayudar´ıaen gran medida a mejorar la gesti´onde los recursos computacionales de las m´aquinas. Permite a los usuarios guardar el estado de una aplicaci´on(ejecut´andose en un contenedor Singularity) para poder restaurarla en cualquier momento, sin perder el trabajo realizado anteriormente. Por lo que la posible interrupci´onde una aplicaci´on, debido a un fallo o voluntariamente, no es una p´erdidade tiempo de computaci´on. Este proyecto muestra como es posible realizar esa integraci´on. Singularity ´esuna tecnologia de contenidors software creada segons les necessitats de cient´ıfics,per ser utilitzada a entorns de computaci´od'altes prestacions. Fa 2 anys desde que els usuaris van comen¸cara demanar una integraci´ode la funcional- itat de Checkpoint/Restore, amb CRIU, a contenidors Singularity. Aquesta integraci´o ajudaria molt a millorar la gesti´odels recursos computacionals de les m`aquines.Permet als usuaris guardar l'estat d'una aplicaci´o(executant-se a un contenidor Singularity) per poder restaurar-la en qualsevol moment, sense perdre el treball realitzat anteriorment.
    [Show full text]
  • Instant OS Updates Via Userspace Checkpoint-And
    Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya Kashyap, Changwoo Min, Byoungyoung Lee, and Taesoo Kim, Georgia Institute of Technology; Pavel Emelyanov, CRIU and Odin, Inc. https://www.usenix.org/conference/atc16/technical-sessions/presentation/kashyap This paper is included in the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16). June 22–24, 2016 • Denver, CO, USA 978-1-931971-30-0 Open access to the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16) is sponsored by USENIX. Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya Kashyap Changwoo Min Byoungyoung Lee Taesoo Kim Pavel Emelyanov† Georgia Institute of Technology †CRIU & Odin, Inc. # errors # lines Abstract 50 1000K 40 100K In recent years, operating systems have become increas- 10K 30 1K 20 ingly complex and thus more prone to security and per- 100 formance issues. Accordingly, system updates to address 10 10 these issues have become more frequently available and 0 1 increasingly important. To complete such updates, users 3.13.0-x 3.16.0-x 3.19.0-x May 2014 must reboot their systems, resulting in unavoidable down- build/diff errors #layout errors Jun 2015 time and further loss of the states of running applications. #static local errors #num lines++ We present KUP, a practical OS update mechanism that Figure 1: Limitation of dynamic kernel hot-patching using employs a userspace checkpoint-and-restart mechanism, kpatch. Only two successful updates (3.13.0.32 34 and → which uses an optimized data structure for checkpoint- 3.19.0.20 21) out of 23 Ubuntu kernel package releases.
    [Show full text]
  • Task Migration at Scale Using CRIU Linux Plumbers Conference 2018
    Task Migration at Scale Using CRIU Linux Plumbers Conference 2018 Victor Marmol Andy Tucker [email protected] [email protected] 2018-11-15 Confidential + Proprietary Who we are Outside of Google, we’ve worked on open source cluster management and containers lmctfy Confidential + Proprietary 2 Who we are Inside Google: we’re part of the Borg team ● Manages all compute jobs ● Runs on every server ConfidentialImages + Proprietary by Connie Zhou3 What is Borg? Google’s cluster management system ● Borgmaster: Cluster control and main API entrypoint ● Borglet: On-machine management daemon ● Suite of tools and UIs for managing jobs ● Many purpose-built platforms created on top of Borg ● Everything runs on Borg and everything runs in containers Confidential + Proprietary 4 Borg basics Allocated Ports Base compute primitive: Task ● A priority signals how quickly a task should Processes Packages schedule Processes Packages Processes Packages ● It’s appclass describes a task as either Processes Packages serving (latency sensitive) or batch ● Static content/binaries provided by Container packages Task ● A container isolates a task’s resources ● Native Linux processes Borg Machine ● Share an IP with the machine, ports are allocated for each task Confidential + Proprietary 5 Borg basics: evictions When a task is forcefully terminated by Borg ● Typically receive a notification: 1-5min ● Our SLO allows for quite a few evictions ● Applications must handle them Reasons for evictions ● Preemption: a higher priority task needs the resources ● Software upgrades (e.g.: kernel, firmware) ● Re-balancing for availability or performance Confidential + Proprietary 6 Evictions are impactful and hard to handle Technical Complexity ● Handling evictions requires state management ○ How and what state to serialize and where to store it ● Application-specific and not very reusable Lost Compute ● Batch jobs run at lower priorities and get preempted often ● Even platforms that handle them for users, don’t do a great job Task 0 Evicted!Shard 1 Task 2 Task 3 Task 4 Compute is lost..
    [Show full text]
  • Open Source Software Packages
    Hitachi Ops Center 10.5.1 Open Source Software Packages Contact Information: Hitachi Ops Center Project Manager Hitachi Vantara LLC 2535 Augustine Drive Santa Clara, California 95054 Name of package Version License agreement @agoric/babel-parser 7.12.7 The MIT License @angular-builders/custom-webpack 8.0.0-RC.0 The MIT License @angular-devkit/build-angular 0.800.0-rc.2 The MIT License @angular-devkit/build-angular 0.901.12 The MIT License @angular-devkit/core 7.3.8 The MIT License @angular-devkit/schematics 7.3.8 The MIT License @angular/animations 9.1.11 The MIT License @angular/animations 9.1.12 The MIT License @angular/cdk 9.2.4 The MIT License @angular/cdk-experimental 9.2.4 The MIT License @angular/cli 8.0.0 The MIT License @angular/cli 9.1.11 The MIT License @angular/common 9.1.11 The MIT License @angular/common 9.1.12 The MIT License @angular/compiler 9.1.11 The MIT License @angular/compiler 9.1.12 The MIT License @angular/compiler-cli 9.1.12 The MIT License @angular/core 7.2.15 The MIT License @angular/core 9.1.11 The MIT License @angular/core 9.1.12 The MIT License @angular/forms 7.2.15 The MIT License @angular/forms 9.1.0-next.3 The MIT License @angular/forms 9.1.11 The MIT License @angular/forms 9.1.12 The MIT License @angular/language-service 9.1.12 The MIT License @angular/platform-browser 7.2.15 The MIT License @angular/platform-browser 9.1.11 The MIT License @angular/platform-browser 9.1.12 The MIT License @angular/platform-browser-dynamic 7.2.15 The MIT License @angular/platform-browser-dynamic 9.1.11 The MIT License @angular/platform-browser-dynamic
    [Show full text]
  • Red Hat Enterprise Linux 7 7.8 Release Notes
    Red Hat Enterprise Linux 7 7.8 Release Notes Release Notes for Red Hat Enterprise Linux 7.8 Last Updated: 2021-03-02 Red Hat Enterprise Linux 7 7.8 Release Notes Release Notes for Red Hat Enterprise Linux 7.8 Legal Notice Copyright © 2021 Red Hat, Inc. The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/ . In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates. XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and other countries. Node.js ® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
    [Show full text]
  • Seccomp Update
    seccomp update https://outflux.net/slides/2015/lss/seccomp.pdf Linux Security Summit, Seattle 2015 Kees Cook <[email protected]> (pronounced “Case”) What is seccomp? ● Programmatic kernel attack surface reduction ● Used by: – Chrome – vsftpd – OpenSSH – Systemd (“SystemCallFilter=...”) – LXC (blacklisting) – … and you too! (easiest via libseccomp) seccomp update 2/8 Linux Security Summit 2015 Seattle, Aug 21 Architecture support ● x86: v3.5 ● s390: v3.6 ● arm: v3.8 ● mips: v3.15 ● arm64: v3.19, AKASHI Takahiro ● powerpc: linux-next (v4.3), Michael Ellerman seccomp update 3/8 Linux Security Summit 2015 Seattle, Aug 21 split-phase internals ● v3.19, Andy Lutomirski ● Splits per-architecture calls to seccomp into 2 phases ● Speeds up simple (no tracing) callers ● Only used on x86 so far seccomp update 4/8 Linux Security Summit 2015 Seattle, Aug 21 Regression tests ● v4.2: moved the 48 tests from github into the kernel: tools/testing/selftests/seccomp/ ● Shows some interesting glitches with restart_syscall on arm (hidden) and arm64 (hidden, unless compat, then exposed) ● Gained big-endian support during powerpc port ● Added s390 seccomp support today seccomp update 5/8 Linux Security Summit 2015 Seattle, Aug 21 Minor changes ● v4.0: SECCOMP_RET_ERRNO capped at MAX_ERRNO – Avoid confusing userspace ● v4.1: asm-generic for seccomp.h – Easier architecture porting seccomp update 6/8 Linux Security Summit 2015 Seattle, Aug 21 Future ● Argument inspection ● CRIU (checkpoint/restore) – PTRACE_O_SUSPEND_SECCOMP with CAP_SYS_ADMIN: linux-next (v4.3), Tycho Andersen – Serialize dump/restore of filters. ● eBPF – Use maps or tail calls instead of balanced if/else trees for checking syscall numbers. seccomp update 7/8 Linux Security Summit 2015 Seattle, Aug 21 Questions? https://outflux.net/slides/2015/lss/seccomp.pdf @kees_cook [email protected] [email protected] [email protected] seccomp update 8/8 Linux Security Summit 2015 Seattle, Aug 21.
    [Show full text]
  • Mini-Ckpts: Surviving OS Failures in Persistent Memory
    Mini-Ckpts: Surviving OS Failures in Persistent Memory David Fiala, Frank Mueller Kurt Ferreira Christian Engelmann North Carolina State University∗ Sandia National Laboratories† Oak Ridge National Laboratory‡§ davidfi[email protected], [email protected] [email protected] [email protected] ABSTRACT minate due to tight communication in HPC. Therefore, Concern is growing in the high-performance computing the OS itself must be capable of tolerating failures in a (HPC) community on the reliability of future extreme- robust system. In this work, we introduce mini-ckpts, scale systems. Current efforts have focused on appli- a framework which enables application survival despite cation fault-tolerance rather than the operating system the occurrence of a fatal OS failure or crash. mini- (OS), despite the fact that recent studies have suggested ckpts achieves this tolerance by ensuring that the crit- that failures in OS memory may be more likely. The OS ical data describing a process is preserved in persistent is critical to a system’s correct and efficient operation of memory prior to the failure. Following the failure, the the node and processes it governs — and the parallel na- OS is rejuvenated via a warm reboot and the applica- ture of HPC applications means any single node failure tion continues execution effectively making the failure generally forces all processes of this application to ter- and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three ∗This work was supported in part by a subcontract from to six seconds and has a failure-free overhead of between Sandia National Laboratories and NSF grants CNS- 3-5% for a number of key HPC workloads.
    [Show full text]
  • Netlink Sockets
    CHAPTER 2 Netlink Sockets Chapter 1 discusses the roles of the Linux kernel networking subsystem and the three layers in which it operates. The netlink socket interface appeared first in the 2.2 Linux kernel as AF_NETLINK socket. It was created as a more flexible alternative to the awkward IOCTL communication method between userspace processes and the kernel. The IOCTL handlers cannot send asynchronous messages to userspace from the kernel, whereas netlink sockets can. In order to use IOCTL, there is another level of complexity: you need to define IOCTL numbers. The operation model of netlink is quite simple: you open and register a netlink socket in userspace using the socket API, and this netlink socket handles bidirectional communication with a kernel netlink socket, usually sending messages to configure various system settings and getting responses back from the kernel. This chapter describes the netlink protocol implementation and API and discusses its advantages and drawbacks. I also talk about the new generic netlink protocol, discuss its implementation and its advantages, and give some illustrative examples using the libnl library. I conclude with a discussion of the socket monitoring interface. The Netlink Family The netlink protocol is a socket-based Inter Process Communication (IPC) mechanism, based on RFC 3549, “Linux Netlink as an IP Services Protocol.” It provides a bidirectional communication channel between userspace and the kernel or among some parts of the kernel itself. Netlink is an extension of the standard socket implementation. The netlink protocol implementation resides mostly under net/netlink, where you will find the following four files: •฀ af_netlink.c •฀ af_netlink.h •฀ genetlink.c •฀ diag.c Apart from them, there are a few header files.
    [Show full text]
  • Eyeglass Search OSS Licenses and Packages V9
    ECA 15.1 opensuse https://en.opensuse.org/openSUSE:License Package Licence Name Version Type Key Licence Name opensuse 15.1 OS annogen:annogen 0.1.0 JAR Not Found antlr:antlr 2.7.7 JAR BSD Berkeley Software Distribution (BSD) aopalliance:aopalliance 1 JAR Public DomainPublic Domain asm:asm 3.1 JAR Not Found axis:axis 1.4 JAR Apache-2.0 The Apache Software License, Version 2.0 axis:axis-wsdl4j 1.5.1 JAR Not Found backport-util-concurrent:backport-util-concurrent3.1 JAR Public DomainPublic Domain com.amazonaws:aws-java-sdk 1.1.7.1 JAR Apache-2.0 The Apache Software License, Version 2.0 com.beust:jcommander 1.72 JAR Apache-2.0 The Apache Software License, Version 2.0 com.carrotsearch:hppc 0.6.0 JAR Apache-2.0 The Apache Software License, Version 2.0 com.clearspring.analytics:stream 2.7.0 JAR Apache-2.0 The Apache Software License, Version 2.0 com.drewnoakes:metadata-extractor 2.4.0-beta-1JAR Public DomainPublic Domain com.esotericsoftware:kryo-shaded 3.0.3 JAR BSD Berkeley Software Distribution (BSD) com.esotericsoftware:minlog 1.3.0 JAR BSD Berkeley Software Distribution (BSD) com.fasterxml.jackson.core:jackson-annotations2.6.1 JAR Apache-2.0 The Apache Software License, Version 2.0 com.fasterxml.jackson.core:jackson-annotations2.5.0 JAR Apache-2.0 The Apache Software License, Version 2.0 com.fasterxml.jackson.core:jackson-annotations2.2.0 JAR Apache-2.0 The Apache Software License, Version 2.0 com.fasterxml.jackson.core:jackson-annotations2.9.0 JAR Apache-2.0 The Apache Software License, Version 2.0 com.fasterxml.jackson.core:jackson-annotations2.6.0
    [Show full text]