How Containers Work Sasha Goldshtein @Goldshtn CTO, Sela Group Github.Com/Goldshtn

Linux How Containers Work Sasha Goldshtein @goldshtn CTO, Sela Group github.com/goldshtn Agenda • Why containers? • Container building blocks • Security profiles • Namespaces • Control groups • Building tiny containers Hardware Virtualization vs. OS Virtualization VM Container ASP.NET Express ASP.NET Express RavenDB Nginx RavenDB Nginx App App App App .NET Core .NET Core V8 libC .NET Core .NET Core V8 libC Ubuntu Ubuntu Ubuntu RHEL Ubuntu Ubuntu Ubuntu RHEL Linux Linux Linux Linux Linux Kernel Kernel Kernel Kernel Kernel Hypervisor Hypervisor DupliCated Shared (image layer) Key Container Usage Scenarios • Simplifying configuration: infrastructure as code • Consistent environment from development to production • Application isolation without an extra copy of the OS • Server consolidation, thinner deployments • Rapid deployment and scaling Linux Container Building Blocks User Namespaces node:6 ms/aspnetcore:2 openjdk:8 Copy-on- /app/server.js /app/server.dll /app/server.jar write layered /usr/bin/node /usr/bin/dotnet /usr/bin/javaControl groups filesystems /lib64/libc.so /lib64/libc.so /lib64/libc.so 1 CPU 2 CPU 1 CPU 1 GB 4 GB 2 GB syscall Kernel Security profile Security Profiles • Docker supports additional security modules that restrict what containerized processes can do • AppArmor: text-based configuration that restricts access to certain files, network operations, etc. • Docker uses seccomp to restrict system calls performed by the containerized process • The default seccomp profile blocks syscalls like reboot, stime, umount, specific types of socket protocols, and others • See also: Docker seccomp profile, AppArmor for Docker Experiment: Using a Custom Profile (1/4) host% docker run -d --rm microsoft/dotnet:runtime …app host% docker exec -it $CID bash cont# apt update && apt install linux-perf cont# /usr/bin/perf_4.9 record -F 97 -a ... perf_event_open(..., 0) failed unexpectedly ... Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, ... Experiment: Using a Custom Profile (2/4) host% echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid 0 cont# /usr/bin/perf_4.9 record -F 97 -ag ... perf_event_open(..., 0) failed unexpectedly ... Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, ... Experiment: Using a Custom Profile (3/4) host% diff --color=always --context=2 default.json perf.json *** default.json 2018-04-18 06:06:19.612371527 +0000 --- perf.json 2018-04-18 06:06:38.423793797 +0000 *************** *** 217,220 **** --- 217,221 ---- "openat", "pause", + "perf_event_open", "pipe", "pipe2", Experiment: Using a Custom Profile (4/4) host% docker run -d --security-opt seccomp=./perf.json microsoft/dotnet:runtime …app host% docker exec -it app sh -c 'apt update && \ apt install -y linux-perf && \ /usr/bin/perf_4.9 record \ -o /app/perf.data -F 97 -a -- sleep 5' ... [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.064 MB /app/perf.data (970 samples) ] Capabilities • Docker also disables capabilities such as CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_SYS_BOOT, CAP_SYS_TIME which you can add if necessary • docker run --cap-add ... • Docker’s default seccomp policy is configured to also enable certain syscalls when the relevant capability is enabled • E.g. ptrace(2) when CAP_SYS_PTRACE is enabled • E.g. reboot(2) when CAP_SYS_BOOT is enabled • See also: capabilities(7) Namespaces • Namespaces isolate processes from each other; restrict visibility • PID: container gets its own PIDs • mnt: container gets its own mount points (view of the file system) • net: container gets its own network interfaces • user: container gets its own user and group ids • See also: unshare(2), setns(2), namespaces(7) Docker Process Architecture bash % docker run … dockerd docker-containerd ubuntu:xenialubuntu:xenial dockerdocker-containerd-containerd-shim-shim ubuntu:xenial# … docker-runccontainerd-shim ## …… runcrunc Experiment: Namespace System Calls host[1]% CONTAINERD=$(pgrep -o -f docker-containerd) host[1]% strace -f -eunshare -qq -p $CONTAINERD [pid 16274] unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC| CLONE_NEWNET|CLONE_NEWPID) = 0 ... host[2]% docker run -it --rm ubuntu:xenial bash Experiment: Namespace Isolation (1/2) host% docker run -d --rm microsoft/dotnet:runtime …app host% docker run -d --rm microsoft/dotnet:runtime …app cont1# ls /proc | head -2 1 14 cont2# ls /proc | head -2 1 25 host% ps -e | grep simpleapp | grep -v grep 316 ? 00:00:00 simpleapp 32709 ? 00:00:00 simpleapp Experiment: Namespace Isolation (2/2) cont1# touch /tmp/file1 cont2# ls /tmp clr-debug-pipe-1-502146227-in clr-debug-pipe-1-502146227-out Experiment: Listing Namespaces (1/2) host% ls -l /proc/$PID/ns total 0 lrwxrwxrwx. 1 root root 0 Apr 17 10:07 cgroup -> cgroup:[4026531835] lrwxrwxrwx. 1 root root 0 Apr 17 10:03 ipc -> ipc:[4026532323] lrwxrwxrwx. 1 root root 0 Apr 17 10:03 mnt -> mnt:[4026532321] lrwxrwxrwx. 1 root root 0 Apr 17 10:02 net -> net:[4026532326] lrwxrwxrwx. 1 root root 0 Apr 17 10:03 pid -> pid:[4026532324] lrwxrwxrwx. 1 root root 0 Apr 17 10:07 pid_for_children -> ... lrwxrwxrwx. 1 root root 0 Apr 17 10:07 user -> user:[4026531837] lrwxrwxrwx. 1 root root 0 Apr 17 10:03 uts -> uts:[4026532322] Experiment: Listing Namespaces (2/2) host% lsns -t pid NS TYPE NPROCS PID USER COMMAND 4026531836 pid 92 1 root /usr/lib/systemd/systemd ... 4026532260 pid 1 32709 root /app/simpleapp 4026532324 pid 1 316 root /app/simpleapp host% lsns -p 316 NS TYPE NPROCS PID USER COMMAND ... 4026532321 mnt 1 316 root /app/simpleapp 4026532322 uts 1 316 root /app/simpleapp 4026532323 ipc 1 316 root /app/simpleapp 4026532324 pid 1 316 root /app/simpleapp 4026532326 net 1 316 root /app/simpleapp Experiment: Entering Namespaces (1/2) host% nsenter -t 316 -m ls -la /tmp total 16 ... host% strace -f -esetns nsenter -t 316 -a true setns(3, CLONE_NEWCGROUP) = 0 setns(4, CLONE_NEWIPC) = 0 setns(5, CLONE_NEWUTS) = 0 setns(6, CLONE_NEWNET) = 0 setns(7, CLONE_NEWPID) = 0 setns(8, CLONE_NEWNS) = 0 ... Experiment: Entering Namespaces (2/2) host[2]% docker exec $CID true host[1]% strace -f -esetns -qq -p $CONTAINERD [pid 712] setns(5, CLONE_NEWIPC) = 0 [pid 712] setns(8, CLONE_NEWUTS) = 0 [pid 712] setns(9, CLONE_NEWNET) = 0 [pid 712] setns(10, CLONE_NEWPID) = 0 [pid 712] setns(11, CLONE_NEWNS) = 0 ... Experiment: “Attaching” A Container (1/4) host% docker run -it -v $PWD:/src --rm microsoft/dotnet:2.1-sdk sh -c 'cd /src && dotnet new console' host% vim Program.cs host% docker run --rm -v $PWD:/src microsoft/dotnet:2.1-sdk sh -c 'cd /src && dotnet restore && dotnet publish -c Release -o ./out -r linux-x64' host% docker run --name app --rm -d -v $PWD/out:/app microsoft/dotnet:2.1-runtime-deps /app/simpleapp Experiment: “Attaching” A Container (2/4) host% sudo docker exec app ps OCI runtime exec failed: exec failed: container_linux.go:296: starting container process caused "exec: \"ps\": executable file not found in $PATH": unknown host% sudo docker exec app ls /proc 1 15 ... Experiment: “Attaching” A Container (3/4) host% sudo docker exec -t app strace -ewrite -p 15 OCI runtime exec failed: exec failed: container_linux.go:296: starting container process caused "exec: \"strace\": executable file not found in $PATH": unknown host% sudo docker exec -it app sh cont# apt update && apt install strace ... cont# strace -ewrite -p 15 strace: attach: ptrace(PTRACE_ATTACH, 15): Operation not permitted Experiment: “Attaching” A Container (4/4) host% docker run --rm -it --pid=container:app --cap-add=SYS_PTRACE alpine sh cont# apk add strace --no-cache ... cont# strace -ewrite -p 1 strace: Process 1 attached with 7 threads [pid 1] write(20, ".", 1) = 1 [pid 1] write(20, ".", 1) = 1 ^Cstrace: Process 1 detached See also: capabilities(7) Control Groups • Control groups place quotas on processes or process groups • cpu,cpuacct: used to cap CPU usage and apply CPU shares • docker run --cpus --cpu-shares • memory: used to cap user and kernel memory usage • docker run --memory --kernel-memory • blkio: used to cap IOPS and throughput per block device, and to assign weights (shares) • docker run --device-{read,write}-{ops,bps} • See also: cgroups(7) Experiment: Controlling CPU Usage host% docker run --rm -d --cpus=0.5 progrium/stress -c 4 host% top ... %CPU %MEM TIME+ COMMAND 12.7 0.0 0:09.74 stress 12.7 0.0 0:09.87 stress 12.7 0.0 0:09.79 stress 12.3 0.0 0:09.78 stress Experiment: Identifying Throttling (1/2) host% docker run --rm -d --cpus=0.5 progrium/stress -c 4 host% for pid in `pidof stress`; do echo $pid; \ grep nonvoluntary /proc/${pid}/status; done 12652 nonvoluntary_ctxt_switches: 12487 12651 nonvoluntary_ctxt_switches: 13920 12650 nonvoluntary_ctxt_switches: 14001 12649 nonvoluntary_ctxt_switches: 12489 12617 nonvoluntary_ctxt_switches: 31 Experiment: Identifying Throttling (2/2) host% CONTAINER=$(docker inspect --format='{{.Id}}' ...) host% CGROUPDIR=/sys/fs/cgroup/cpu,cpuacct/docker/${CONTAINER} host% cat ${CGROUPDIR}/cpu.cfs_period_us 100000 host% cat ${CGROUPDIR}/cpu.cfs_quota_us 50000 host% cat ${CGROUPDIR}/cpu.stat nr_periods 8354 nr_throttled 8352 throttled_time 1249595091213 Experiment: Controlling Memory Usage (1/2) host% docker run --name app -d --memory=128m ... host% cat /sys/fs/cgroup/memory/docker/.../memory.limit_in_bytes 134217728 Note: overcommit behavior and uncommitted memory dramatically complicate things. How much memory is your process really using? From /proc/$PID/status: From /sys/fs/cgroup/memory/.../memory.stat: VmPeak: 2533932 kB RssFile: 22912 kB cache 8192 VmSize: 2533560 kB RssShmem: 0 kB rss 5783552 VmLck: 4 kB VmData: 69012 kB rss_huge 0 VmPin: 0 kB VmStk: 132 kB shmem 8192 VmHWM: 28144 kB VmExe: 292 kB mapped_file 8192 VmRSS: 28144 kB VmLib: 56172 kB RssAnon: 5232 kB VmPTE: 304 kB VmSwap: 0 kB Experiment: Controlling Memory Usage (2/2) host% dmesg 131072kB, failcnt 8 [5027568.316065] simpleapp invoked oom-killer: ... ... [5027568.436555] Memory cgroup stats for [5027568.349751] oom_kill_process+0x218/0x420 /docker/fad6...: cache:8KB rss:129536KB [5027568.352827] out_of_memory+0x2ea/0x4f0 rss_huge:0KB shmem:8KB mapped_file:8KB ..

How Containers Work Sasha Goldshtein @Goldshtn CTO, Sela Group Github.Com/Goldshtn

Kernel Boot-Time Tracing

The Seeds of Rural Resilience

Linux Perf Event Features and Overhead

AD Bridge User Guide

Linux on IBM Z

Linux Performance Tools

List of New Applications Added in ARL #2517

Snap Vs Flatpak Vs Appimage: Know the Differences | Which Is Better

On Access Control Model of Linux Native Performance Monitoring Motivation

Ubuntu Server Guide Basic Installation Preparing to Install

Kernel Runtime Security Instrumentation Process Is Executed

Implementing SCADA Security Policies Via Security-Enhanced Linux