<<

How Containers Work Sasha Goldshtein @goldshtn CTO, Sela Group .com/goldshtn

Agenda

• Why containers? • Container building blocks • Security profiles • Namespaces • Control groups • Building tiny containers Hardware vs. OS Virtualization

VM Container ASP.NET Express ASP.NET Express RavenDB Nginx RavenDB Nginx App App App App .NET Core .NET Core V8 libc .NET Core .NET Core V8 libc Ubuntu Ubuntu RHEL Ubuntu Ubuntu Ubuntu RHEL Linux Linux Linux Linux Kernel Kernel Kernel Kernel Hypervisor

Duplicated Shared (image layer) Key Container Usage Scenarios

• Simplifying configuration: infrastructure as code • Consistent environment from development to production • Application isolation without an extra copy of the OS • Server consolidation, thinner deployments • Rapid deployment and scaling Linux Container Building Blocks

User Namespaces node:6 ms/aspnetcore:2 openjdk:8 Copy-on- /app/server.js /app/server.dll /app/server.jar write layered /usr/bin/node /usr/bin/dotnet /usr/bin/javaControl groups filesystems /lib64/libc.so /lib64/libc.so /lib64/libc.so 1 CPU 2 CPU 1 CPU 1 GB 4 GB 2 GB syscall Kernel Security profile Security Profiles

• Docker supports additional security modules that restrict what containerized processes can do • AppArmor: text-based configuration that restricts access to certain files, network operations, etc. • Docker uses to restrict system calls performed by the containerized process • The default seccomp profile blocks syscalls like reboot, stime, umount, specific types of socket protocols, and others

• See also: Docker seccomp profile, AppArmor for Docker Experiment: Using a Custom Profile (1/4) host% docker run -d --rm /dotnet:runtime …app host% docker exec -it $CID bash cont# update && apt install linux-perf cont# /usr/bin/perf_4.9 record -F 97 -a ... perf_event_open(..., 0) failed unexpectedly ... Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, ... Experiment: Using a Custom Profile (2/4) host% echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid 0 cont# /usr/bin/perf_4.9 record -F 97 -ag ... perf_event_open(..., 0) failed unexpectedly ... Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, ... Experiment: Using a Custom Profile (3/4) host% diff --color=always --context=2 default.json perf.json *** default.json 2018-04-18 06:06:19.612371527 +0000 --- perf.json 2018-04-18 06:06:38.423793797 +0000 *************** *** 217,220 **** --- 217,221 ---- "openat", "pause", + "perf_event_open", "pipe", "pipe2", Experiment: Using a Custom Profile (4/4) host% docker run -d --security-opt seccomp=./perf.json microsoft/dotnet:runtime …app host% docker exec -it app sh - 'apt update && \ apt install -y linux-perf && \ /usr/bin/perf_4.9 record \ -o /app/perf.data -F 97 -a -- sleep 5' ... [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.064 MB /app/perf.data (970 samples) ] Capabilities

• Docker also disables capabilities such as CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_SYS_BOOT, CAP_SYS_TIME which you can add if necessary • docker run --cap-add ... • Docker’s default seccomp policy is configured to also enable certain syscalls when the relevant capability is enabled • E.g. ptrace(2) when CAP_SYS_PTRACE is enabled • E.g. reboot(2) when CAP_SYS_BOOT is enabled

• See also: capabilities(7) Namespaces

• Namespaces isolate processes from each other; restrict visibility • PID: container gets its own PIDs • mnt: container gets its own mount points (view of the ) • net: container gets its own network interfaces • user: container gets its own user and group ids

• See also: unshare(2), setns(2), namespaces(7) Docker Process Architecture

bash % docker run … dockerd docker-containerd

ubuntu:xenialubuntu:xenial dockerdocker-containerd-containerd-shim-shim ubuntu:xenial# … docker-runccontainerd-shim ## …… runcrunc Experiment: Namespace System Calls host[1]% CONTAINERD=$(pgrep -o -f docker-containerd) host[1]% -f -eunshare -qq -p $CONTAINERD [pid 16274] unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC| CLONE_NEWNET|CLONE_NEWPID) = 0 ... host[2]% docker run -it --rm ubuntu:xenial bash Experiment: Namespace Isolation (1/2) host% docker run -d --rm microsoft/dotnet:runtime …app host% docker run -d --rm microsoft/dotnet:runtime …app cont1# ls /proc | head -2 1 14 cont2# ls /proc | head -2 1 25 host% ps -e | grep simpleapp | grep -v grep 316 ? 00:00:00 simpleapp 32709 ? 00:00:00 simpleapp Experiment: Namespace Isolation (2/2) cont1# touch /tmp/file1 cont2# ls /tmp clr-debug-pipe-1-502146227-in clr-debug-pipe-1-502146227-out Experiment: Listing Namespaces (1/2) host% ls -l /proc/$PID/ns total 0 lrwxrwxrwx. 1 root root 0 Apr 17 10:07 cgroup -> cgroup:[4026531835] lrwxrwxrwx. 1 root root 0 Apr 17 10:03 ipc -> ipc:[4026532323] lrwxrwxrwx. 1 root root 0 Apr 17 10:03 mnt -> mnt:[4026532321] lrwxrwxrwx. 1 root root 0 Apr 17 10:02 net -> net:[4026532326] lrwxrwxrwx. 1 root root 0 Apr 17 10:03 pid -> pid:[4026532324] lrwxrwxrwx. 1 root root 0 Apr 17 10:07 pid_for_children -> ... lrwxrwxrwx. 1 root root 0 Apr 17 10:07 user -> user:[4026531837] lrwxrwxrwx. 1 root root 0 Apr 17 10:03 uts -> uts:[4026532322] Experiment: Listing Namespaces (2/2) host% lsns -t pid NS TYPE NPROCS PID USER COMMAND 4026531836 pid 92 1 root /usr/lib//systemd ... 4026532260 pid 1 32709 root /app/simpleapp 4026532324 pid 1 316 root /app/simpleapp host% lsns -p 316 NS TYPE NPROCS PID USER COMMAND ... 4026532321 mnt 1 316 root /app/simpleapp 4026532322 uts 1 316 root /app/simpleapp 4026532323 ipc 1 316 root /app/simpleapp 4026532324 pid 1 316 root /app/simpleapp 4026532326 net 1 316 root /app/simpleapp Experiment: Entering Namespaces (1/2) host% nsenter -t 316 -m ls -la /tmp total 16 ... host% strace -f -esetns nsenter -t 316 -a true setns(3, CLONE_NEWCGROUP) = 0 setns(4, CLONE_NEWIPC) = 0 setns(5, CLONE_NEWUTS) = 0 setns(6, CLONE_NEWNET) = 0 setns(7, CLONE_NEWPID) = 0 setns(8, CLONE_NEWNS) = 0 ... Experiment: Entering Namespaces (2/2) host[2]% docker exec $CID true host[1]% strace -f -esetns -qq -p $CONTAINERD [pid 712] setns(5, CLONE_NEWIPC) = 0 [pid 712] setns(8, CLONE_NEWUTS) = 0 [pid 712] setns(9, CLONE_NEWNET) = 0 [pid 712] setns(10, CLONE_NEWPID) = 0 [pid 712] setns(11, CLONE_NEWNS) = 0 ... Experiment: “Attaching” A Container (1/4) host% docker run -it -v $PWD:/src --rm microsoft/dotnet:2.1-sdk sh -c 'cd /src && dotnet new console' host% vim Program.cs host% docker run --rm -v $PWD:/src microsoft/dotnet:2.1-sdk sh -c 'cd /src && dotnet restore && dotnet publish -c Release -o ./out -r linux-x64' host% docker run --name app --rm -d -v $PWD/out:/app microsoft/dotnet:2.1-runtime-deps /app/simpleapp Experiment: “Attaching” A Container (2/4) host% sudo docker exec app ps OCI runtime exec failed: exec failed: container_linux.go:296: starting container process caused "exec: \"ps\": executable file not found in $PATH": unknown host% sudo docker exec app ls /proc 1 15 ... Experiment: “Attaching” A Container (3/4) host% sudo docker exec -t app strace -ewrite -p 15 OCI runtime exec failed: exec failed: container_linux.go:296: starting container process caused "exec: \"strace\": executable file not found in $PATH": unknown host% sudo docker exec -it app sh cont# apt update && apt install strace ... cont# strace -ewrite -p 15 strace: attach: ptrace(PTRACE_ATTACH, 15): Operation not permitted Experiment: “Attaching” A Container (4/4) host% docker run --rm -it --pid=container:app --cap-add=SYS_PTRACE alpine sh cont# apk add strace --no-cache ... cont# strace -ewrite -p 1 strace: Process 1 attached with 7 threads [pid 1] write(20, ".", 1) = 1 [pid 1] write(20, ".", 1) = 1 ^Cstrace: Process 1 detached

See also: capabilities(7) Control Groups

• Control groups place quotas on processes or process groups • cpu,cpuacct: used to cap CPU usage and apply CPU shares • docker run --cpus --cpu-shares • memory: used to cap user and kernel memory usage • docker run --memory --kernel-memory • blkio: used to cap IOPS and throughput per block device, and to assign weights (shares) • docker run --device-{read,write}-{ops,bps}

• See also: (7) Experiment: Controlling CPU Usage host% docker run --rm -d --cpus=0.5 progrium/stress -c 4 host% top ... %CPU %MEM TIME+ COMMAND 12.7 0.0 0:09.74 stress 12.7 0.0 0:09.87 stress 12.7 0.0 0:09.79 stress 12.3 0.0 0:09.78 stress Experiment: Identifying Throttling (1/2) host% docker run --rm -d --cpus=0.5 progrium/stress -c 4 host% for pid in `pidof stress`; do echo $pid; \ grep nonvoluntary /proc/${pid}/status; done 12652 nonvoluntary_ctxt_switches: 12487 12651 nonvoluntary_ctxt_switches: 13920 12650 nonvoluntary_ctxt_switches: 14001 12649 nonvoluntary_ctxt_switches: 12489 12617 nonvoluntary_ctxt_switches: 31 Experiment: Identifying Throttling (2/2) host% CONTAINER=$(docker inspect --format='{{.Id}}' ...) host% CGROUPDIR=/sys/fs/cgroup/cpu,cpuacct/docker/${CONTAINER} host% cat ${CGROUPDIR}/cpu.cfs_period_us 100000 host% cat ${CGROUPDIR}/cpu.cfs_quota_us 50000 host% cat ${CGROUPDIR}/cpu.stat nr_periods 8354 nr_throttled 8352 throttled_time 1249595091213 Experiment: Controlling Memory Usage (1/2) host% docker run --name app -d --memory=128m ... host% cat /sys/fs/cgroup/memory/docker/.../memory.limit_in_bytes 134217728

Note: overcommit behavior and uncommitted memory dramatically complicate things. How much memory is your process really using? From /proc/$PID/status: From /sys/fs/cgroup/memory/.../memory.stat: VmPeak: 2533932 kB RssFile: 22912 kB cache 8192 VmSize: 2533560 kB RssShmem: 0 kB rss 5783552 VmLck: 4 kB VmData: 69012 kB rss_huge 0 VmPin: 0 kB VmStk: 132 kB shmem 8192 VmHWM: 28144 kB VmExe: 292 kB mapped_file 8192 VmRSS: 28144 kB VmLib: 56172 kB RssAnon: 5232 kB VmPTE: 304 kB VmSwap: 0 kB Experiment: Controlling Memory Usage (2/2) host% dmesg 131072kB, failcnt 8 [5027568.316065] simpleapp invoked oom-killer: ...... [5027568.436555] Memory cgroup stats for [5027568.349751] oom_kill_process+0x218/0x420 /docker/fad6...: cache:8KB rss:129536KB [5027568.352827] out_of_memory+0x2ea/0x4f0 rss_huge:0KB shmem:8KB mapped_file:8KB ... [5027568.355900] mem_cgroup_out_of_memory... [5027568.463326] Memory cgroup out of memory: [5027568.359454] mem_cgroup_oom_synchronize.. Kill process 2758 (simpleapp) score 1166 or [5027568.363031] ? get_mem_cgroup_from_mm+... sacrifice child [5027568.366437] pagefault_out_of_memory... [5027568.469250] Killed process 2758 [5027568.369902] __do_page_fault+0x4a7/0x4e0 (simpleapp) total-vm:2607292kB, anon- [5027568.373138] do_page_fault+0x32/0x110 rss:129296kB, file-rss:23008kB, shmem-rss:0kB [5027568.376057] ? page_fault+0x36/0x60 [5027568.480651] oom_reaper: reaped process [5027568.378846] page_fault+0x4c/0x60 2758 (simpleapp), now anon-rss:0kB, file- ... rss:0kB, shmem-rss:0kB [5027568.413222] Task in /docker/fad6... killed as a result of limit of /docker/fad6... [5027568.424013] memory: usage 131072kB, limit Experiment: Controlling Disk I/O host% docker run --rm --device-write-bps=/dev/xvda:10kb ubuntu dd oflag=direct if=/dev/zero of=/file bs=4k count=10 10+0 records in 10+0 records out 40960 bytes (41 kB, 40 KiB) copied, 4.00837 s, 10.2 kB/s CLR CGroup Awareness

• The CLR needs to be cgroup-aware: • Start aggressive GC when getting close to the cgroup limit • Plan the generation sizes and GC work according to the cgroup limit • Correctly determine number of CPU cores assigned to the process (Environment.ProcessorCount, TPL planning) • Create the appropriate number of server GC heaps (based on # of cores) • A lot of work is still in progress, see for example: • https://github.com/dotnet/coreclr/issues/13489 • https://github.com/dotnet/corefx/issues/25193 • https://github.com/dotnet/coreclr/issues/14991 • https://github.com/dotnet/coreclr/pull/10064 Sidebar: Container Resource Utilization

• High-level, Docker-specific: docker stats • systemd-cgtop • Control group metrics • E.g. /sys/fs/cgroup/cpu,cpuacct/*/*/* • Third-party monitoring solutions: cAdvisor, Snap, PCP, collectd Docker Image Layers

• Container images consist of layers mounted on top of each other in a copy-on-write filesystem • The top layer is writeable, and writes create a just-in-time copy (CoW) • The rest of the layers are read-only • View the layer history of an image with docker history

R/W Compiled .NET core base Container 1 layer app image base image image

/app/myapp.dll /usr/bin/dotnet /lib64/libc.so R/W Container 2 /app/mylib.dll /lib64/libcurl.so /etc/hosts layer Experiment: Examining Image History host% docker history microsoft/dotnet-nightly:2.1-runtime …edited for brevity… /bin/sh -c curl -sL --output dotnet.tar.gz … && mkdir -p /usr/share/dotnet && tar -xzf dotnet.tar.gz -C /usr/share/dotnet && rm dotnet.tar.gz && ln -s /usr/share/dotnet/dotnet /usr/bin/dotnet 74MB /bin/sh -c apt-get update && apt-get install -y curl rm -rf /var/lib/apt/lists/* 7.02MB /bin/sh -c apt-get update && apt-get install -y ca- certificates libc6 libgcc1 … 43.8MB …edited for brevity… Sidebar: Tips For Smaller Images

• Don’t install debugging/development tools in the production container (e.g. gdb, curl, wrk, …) • Remove package caches in the same layer where they are installed: • Debian: rm -rf /var/lib/apt/lists/* • Alpine: apk add --no-cache • : clean all • Minimize the number of layers by installing packages together: • apt install -y curl -tools wget gcc • Use Dockerfile linters, e.g. https://www.fromlatest.io The Smallest Container Image host% docker images hello-world REPOSITORY TAG SIZE hello-world latest 1.85kB

• We can go smaller. The scratch container is 0 bytes. But that’s cheating. The smallest ELF executable is 45 bytes: host% docker images smallest REPOSITORY TAG SIZE smallest latest 45B How About .NET Core? host% docker images microsoft/dotnet:2.1-* REPOSITORY TAG SIZE microsoft/dotnet 2.1-sdk 1.72GB microsoft/dotnet 2.1-runtime 180MB microsoft/dotnet 2.1-runtime-deps 99MB

• Use the sdk images for building only • Use the runtime images for running the app (dotnet myapp), or … • Build with --self-contained and use runtime-deps images • ”Hello World” app output with --self-contained is 72MB • Experimental IL Linker can further reduce the size of self-contained images Use Multi-Stage Builds

FROM microsoft/dotnet:2.1-sdk AS build WORKDIR /src COPY . /src RUN dotnet restore RUN dotnet publish -c Release -o out

FROM microsoft/dotnet:2.1-runtime WORKDIR /app COPY --from=build /src/out . ENTRYPOINT ["dotnet", "myapp.dll"] Consider Using Alpine-Based Images

• Alpine is a lightweight very well-suited for running in containers ( instead of libc, busybox for shell) host% docker images alpine REPOSITORY TAG SIZE alpine latest 4.15MB host% docker images microsoft/*alpine* REPOSITORY TAG SIZE microsoft/dotnet-nightly 2.1-runtime-alpine 87.5MB microsoft/dotnet-nightly 2.1-runtime-deps-alpine 13.1MB How About a Self-Contained Alpine Image?

FROM microsoft/dotnet-nightly:2.1-sdk AS build WORKDIR /src COPY . /src RUN dotnet publish -c Release -o /app -r alpine.3.6-x64

FROM microsoft/dotnet-nightly:2.1-runtime-deps-alpine WORKDIR /app COPY --from=build /app . ENTRYPOINT ["/app/simpleapp"] How About a Self-Contained Alpine Image? host% docker build -t app:selfcontained . ... host% docker images app:selfcontained REPOSITORY TAG SIZE app selfcontained 54.3MB Some Perspective (Empty App)

Platform Distribution Size (MB) .NET Core Alpine (runtime) 87 .NET Core Alpine (runtime-deps) 54 .NET Core Debian (runtime) 180 .NET Core Debian (runtime-deps) 140 Node.js Alpine 23 Node.js Debian 202 OpenJDK 8 (JRE) Alpine 57 OpenJDK 8 (JRE) Debian (headless JVM) 79 And Some More Perspective Summary

• Why containers? • Container building blocks • Security profiles • Namespaces • Control groups • Building tiny containers Questions? Sasha Goldshtein @goldshtn CTO, Sela Group github.com/goldshtn