& a Quantitative Comparison

Xu Wang hyper.sh; Kata Containers Architecture Committee Fupan Li hyper.sh; Kata Containers Upstream Developer “All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections.”

----David Wheeler Agenda

• Secure Containers, or ABI Oriented Virtualization

• Architecture Comparison

• Benchmarks

• The Future of the Secure Container Projects What’s Kata Containers

• A container runtime, like runC

• Built w/ virtualization tech, like VM

• Initiated by hyper.sh and Intel®

• Hosted by OpenStack Foundation

• Contributed by Huawei, , MSFT, etc. Kata Containers is Virtualized Container What’s gVisor

• A container runtime, like runC, too

• A user-space kernel, written in Go

• implements a substantial portion of the Linux system surface

• Developed in Google and was open-sourced in mid-2018

gVisor implements Linux by way of Linux. The Similarity

• OCI compatible container runtime

• Linux ABI for container applications Container App • Support OCI runtime spec (, k8s, zun…) Linux ABI

• Extra isolation layer Handling BLACK BOX OCIInterface • Guest Kernel, or say, Linux* in Linux Linux ABI

• Do not expose system interface of the host to the container apps Host OS

• Goals General Architecture of Kata Containers and gVisor • Less overhead, better isolation Architecture Comparison Run Kata Containers

• Employ existing VMM (or improved) and standard Linux kernel • Per-container shims and per-sandbox proxy are stand-alone binaries, or built-in as (such as in containerd shim v2 implementation) • An agent in guest OS based on libcontainer

runC-like IO-stub process shim cmdline for container Apps Container App shim proxy Kata Agent Kata runtime Container App No proxy when virt-serial Guest Linux Kernel vscok enabled or vsock VMM (e.g. Qemu)

Host Kernel Run gVisor

• All in one binary (runsc + Sentry +Gofer) • Sentry as a user-space kernel, and per-container process stub • Gofer for 9p IO

runC-like 9p cmdline Container Sentry Gofer ContainerApp Sentry Gofer runsc App

Syscall redirect Host Kernel Detailed Architecture – Kata

• Classic hardware-assisted

virtualization App App agent • Provide virtual hardware proc proc interface to Guest • 2 layers of isolation devices Drivers • Qemu kvm and CPU ring protection vhost-user • Many IO optimization ways Guest Kernel • Pass through kvm vhost • Vhost and vhost-user Host Kernel passthru Detailed Architecture – gVisor (1)

• 2 layers of isolations • Minimal kernel and written in Golang • Safer but the compatibility… • Ring protection and small set of syscalls • Avoid access more buggy syscalls App proc sentry Gofer • File operations are proceeded in separated gofer process • Syscall interception by ptrace or kvm Syscall seccomp/ • The performance impactions… interception namespace

Host Kernel Detailed Architecture – gVisor (2)

• In gVisor-kvm: • Sentry is both host process and guest kernel App proc Sentry in • Ring0 lower in guest host • Upper in VM Gofer • In Host Sentry as guest kernel • Kvm is for higher syscall interception performance rather than isolation kvm: Syscall seccomp/ • No insolation between different modes interception namespace of sentry itself Host Kernel Expectations based on the Arch

• Isolation: • Both include 2 isolation layer – better isolation than runC • Compatibility: • Kata should have better compatibility over gVisor • Performance: • Both should have little overhead on CPU/Memory • gVisor should have smaller memory footprint over kata, and may boot faster • gVisor may have some pain on syscall heavy (include IO heavy) workloads • Kata may have some IO optimization ways Benchmarks and Analytics Benchmarks & Setup

• Functional Tests • Syscall coverage • Standard Benchmarks • Boot time • Memory footprint • CPU/Memory Performance • IO Performance • Networking Performance • Real-life cases • nginx, redis, • gVisor didn’t complete a mysql test, and a jenkins test (pull and build) failed due to the ‘git clone’ failure Testing Setup

• Testing Setup • Most of the Cases: server on packet.net Testing CPU: 4 Physical Cores @ 2.0 GHz (1 × E3-1578L) client Memory: 32GB DDR4-2400 ECC RAM Storage: 240 GB of SSD (1 × 240 GB) Network: 10 Gbps Network (2 × INTEL X710 NIC'S IN TLB) • Disk IO cases: a more powerful server with additional disks CPU: 24 Physical Cores @ 2.2 GHz (2 × E5-2650 V4) Memory: 256 GB of DDR4 ECC RAM (16 × 16 GB) container disks Storage: 2.8 TB of SSD (6 × 480GB SSD)

• Most of the test containers have 8GB memory if not further noted host • For networking tests, two servers play roles as client and server Syscall Coverage & Memory Footprint

• gVisor will call about 70 syscalls • With (512MB memory) according to the code • We enabled template for Kata

• For kata, we collect statistics of Memory Footprint (KB) the qemu processes in different 80000 70000 cases 60000 50000 • apt-get: 53 40000 • redis: 35 30000 20000 • nginx: 36 10000 0 kata containers gvisor

single container average of 20 Boot time

• Boot busybox image • Kata (different configurations), gvisor, runc Boot time (seconds) 2.5

2 cli_dis-factory

1.5 cli_en-factory shimv2_dis-factory

1 shimv2_en-factory gvisor

0.5 runc

0 1 CPU/Memory Performance

• Sysbench CPU and memory test • Test 5 rounds with sysbench and get the average result • CPU overhead are very limited for both kata and gVisor • Random memory write: 2.5% vs 5.5% overhead comparing to host • Seq memory write: 8% vs 13% overhead comparing to host

CPU Test (sec), Lower is better Memory Write 12 10.00048 10.0004 10.00066 6000 5246.84 4868.7 10 5000 4561.58 8 4000 6 3000 1917.63 1870.29 1815.2 4 2000 2 1000 0 0 host kata gvisor host kata gvisor

avg seq write (MB/s) avg rand write (MB/s) IO performance (1)

• IO tests with FIO, gVisor didn’t complete the test

• Note: The “kata passthru fs” has multi-queue support (in the default kata guest kernel), which is not present in the host driver Buffer IO (MiB/s) Direct IO (MiB/s)

30000 1200

25000 1000

20000 800

15000 600

10000 400

5000 200

0 0 128k randread 128k randwrite 4k randread 4k randwrite 128k randread 128k randwrite 4k randread 4k randwrite

host fs runC kata passthru(virtio-blk) kata passthru(virtio-scsi) kata 9p host fs runC kata passthru fs IO Performance (2)

• IO test with dd command • dd if=/dev/zero of=/test/fio-rand-read bs=4k count=25000000 • dd if=/dev/zero of=/test/fio-rand-read bs=128k count=780000 • Note: pay attention to the cache of 9p • The DIO on a network FS like 9p only means no guest (client) page cache, but there may exist cache on the host (server) side dd test (MB/s)

800 700 600 500 400 300 200 100 0 128k 4k

host fs runC kata passthru fs kata 9p gvisor 9p Networking Performance

• The networking performance is tested on 10Gb Ethernet • In default configurations

Network Throughput (Gbps) 10 9 8 7 6 5 4 3 2 1 0 1

host docker container kata gvisor Real-life case: Nginx

• Apache Bench version 2.3 • ab -n 50000 -c 100 http://10.100.143.131:8080/ • General result

runC Kata* gVisor Concurrency Level 100 100 100 Time taken for tests 3.455 3.439 161.338 seconds Complete requests 50000 50000 50000 Failed requests 0 0 0 Total transferred 42250000 42700000 42250000 bytes HTML transferred 30600000 30600000 30600000 bytes Requests per second 14473.73 14541.18 309.91 [#/sec] (mean) Time per request 6.909 6.877 322.677 [ms] (mean) Time per request 0.069 0.069 3.227 [ms] (mean, across all concurrent requests) Transfer rate 11943.65 12127.12 255.73 [Kbytes/sec] received * The www-root of the kata test case is not located on 9pfs Real-life case: Nginx (Cont)

• Detailed Connection time(ms) and percentage Percentage of the requests served within a certain time runc min mean[+/-sd] median max (ms) Connect 0 0 0.3 0 7 Processing 1 7 0.3 7 14 900 836 Waiting 1 7 0.3 7 8 800 runC Total 3 7 0.4 7 15 kata 700 gvisor 674 kata min mean[+/-sd] median max 634 600 Connect 0 0 0.1 0 7 553 Processing 2 7 1 7 207 500 485 Waiting 2 7 1 7 207 415 400 390 Total 3 7 1 7 207 355 300 305 gvisor min mean[+/-sd] median max 207 200 Connect 0 0 0.1 0 2 Processing 44 322 119.8 305 836 100 7 7 7 7 7 8 8 8 Waiting 21 322 119.9 304 836 0 7 7 7 7 7 7 7 7 15 Total 45 322 119.8 305 836 50% 66% 75% 80% 90% 95% 98% 99% * The www-root of the kata test case is not located on 9pfs 100% Real-life case: Redis

Redis-benchmark (requests per second) 160000 140000 120000 100000 80000 60000 40000 20000 0

SET GET INCR LPOP RPOP SADD HSET SPOP LPUSH RPUSH PING_BULK PING_INLINE MSET (10 keys)

LRANGE_100LRANGE_300 (first 100 elements)LRANGE_500 (first 300 elements)LRANGE_600 (first 450 elements) (first 600 elements)

runC kata gvisorLPUSH (needed to benchmark LRANGE) Real-life case: Tensorflow (CPU)

Note: this is a CPU test (No GPU)

TensorFlow 1.10 CNN benchmarks (images/sec) Model resnet50 1.6 Dataset imagenet (synthetic) 1.4 Mode training 1.2 SingleSess False Batch size 32 global 1 32 per device 0.8 Num batches 100 0.6

Num epochs 0 0.4 Devices ['/cpu:0'] 0.2 Data format NHWC Optimizer sgd 0 runc kata gvisor-kvm gvisor-ptrace Summary

• Both kata and gVisor • Small attack interface • CPU performance identical to host and runC • On Kata Containers • Some suggestions: Passthru fs, Factory and/or shimv2, etc. • Good performance in most cases with proper configuration • On gVisor • Fast boot performance and low memory footprint • Still need improve the compatibility for general usage • Furthermore tests • Fine tuned benchmarks, vhost-user/dpdk etc. and NUMA Specific tests, nested virtualization tests • Different kinds of networking tests • More real-life cases The Future of the Projects Kata Containers

• Kata is on the way more efficient • The shim v2 shown in the benchmarks is going to be merged soon • The nemu will reduce the VMM overhead • Multiple network improvement is under developing or testing • And more featureful • Better GPU and other accelerator support • Contributions are welcome gVisor and Similar Projects

• gVisor shows real lightweight secure sandboxing tech • gVisor need more work on • More efficient syscall interception and processing • Better compatibility for more applications • Another project, linuxd, is based on linux kernel and doing similar jobs • Lai, Jiangshan, Containerize Linux Kernel, OpenSource Summit 2018, Vancouver (https://schd.ws/hosted_files/ossna18/db/Containerize%20Linux%20Kernel.pdf) Q&A Contributions are Welcome