Kata Containers and Gvisor a Quantitative Comparison

& a Quantitative Comparison Xu Wang hyper.sh; Kata Containers Architecture Committee Fupan Li hyper.sh; Kata Containers Upstream Developer “All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections.” ----David Wheeler Agenda • Secure Containers, or Linux ABI Oriented Virtualization • Architecture Comparison • Benchmarks • The Future of the Secure Container Projects What’s Kata Containers • A container runtime, like runC • Built w/ virtualization tech, like VM • Initiated by hyper.sh and Intel® • Hosted by OpenStack Foundation • Contributed by Huawei, Google, MSFT, etc. Kata Containers is Virtualized Container What’s gVisor • A container runtime, like runC, too • A user-space kernel, written in Go • implements a substantial portion of the Linux system surface • Developed in Google and was open-sourced in mid-2018 gVisor implements Linux by way of Linux. The Similarity • OCI compatible container runtime • Linux ABI for container applications Container App • Support OCI runtime spec (docker, k8s, zun…) Linux ABI • Extra isolation layer Handling BLACK BOX OCIInterface • Guest Kernel, or say, Linux* in Linux Linux ABI • Do not expose system interface of the host to the container apps Host OS • Goals General Architecture of Kata Containers and gVisor • Less overhead, better isolation Architecture Comparison Run Kata Containers • Employ existing VMM (or improved) and standard Linux kernel • Per-container shims and per-sandbox proxy are stand-alone binaries, or built-in as library (such as in containerd shim v2 implementation) • An agent in guest OS based on libcontainer runC-like IO-stub process shim cmdline for container Apps Container App shim proxy Kata Agent Kata runtime Container App No proxy when virt-serial Guest Linux Kernel vscok enabled or vsock VMM (e.g. Qemu) Host Kernel Run gVisor • All in one binary (runsc + Sentry +Gofer) • Sentry as a user-space kernel, and per-container process stub • Gofer for 9p IO runC-like 9p cmdline Container Sentry Gofer ContainerApp Sentry Gofer runsc App Syscall redirect Host Kernel Detailed Architecture – Kata • Classic hardware-assisted virtualization App App agent • Provide virtual hardware proc proc interface to Guest • 2 layers of isolation devices Drivers • Qemu kvm and CPU ring protection vhost-user • Many IO optimization ways Guest Kernel • Pass through kvm vhost • Vhost and vhost-user Host Kernel passthru Detailed Architecture – gVisor (1) • 2 layers of isolations • Minimal kernel and written in Golang • Safer but the compatibility… • Ring protection and small set of syscalls • Avoid access more buggy syscalls App proc sentry Gofer • File operations are proceeded in separated gofer process • Syscall interception by ptrace or kvm Syscall seccomp/ • The performance impactions… interception namespace Host Kernel Detailed Architecture – gVisor (2) • In gVisor-kvm: • Sentry is both host process and guest kernel App proc Sentry in • Ring0 lower in guest host • Upper in VM Gofer • In Host Sentry as guest kernel • Kvm is for higher syscall interception performance rather than isolation kvm: Syscall seccomp/ • No insolation between different modes interception namespace of sentry itself Host Kernel Expectations based on the Arch • Isolation: • Both include 2 isolation layer – better isolation than runC • Compatibility: • Kata should have better compatibility over gVisor • Performance: • Both should have little overhead on CPU/Memory • gVisor should have smaller memory footprint over kata, and may boot faster • gVisor may have some pain on syscall heavy (include IO heavy) workloads • Kata may have some IO optimization ways Benchmarks and Analytics Benchmarks & Setup • Functional Tests • Syscall coverage • Standard Benchmarks • Boot time • Memory footprint • CPU/Memory Performance • IO Performance • Networking Performance • Real-life cases • nginx, redis, tensorflow • gVisor didn’t complete a mysql test, and a jenkins test (pull and build) failed due to the ‘git clone’ failure Testing Setup • Testing Setup • Most of the Cases: server on packet.net Testing CPU: 4 Physical Cores @ 2.0 GHz (1 × E3-1578L) client Memory: 32GB DDR4-2400 ECC RAM Storage: 240 GB of SSD (1 × 240 GB) Network: 10 Gbps Network (2 × INTEL X710 NIC'S IN TLB) • Disk IO cases: a more powerful server with additional disks CPU: 24 Physical Cores @ 2.2 GHz (2 × E5-2650 V4) Memory: 256 GB of DDR4 ECC RAM (16 × 16 GB) container disks Storage: 2.8 TB of SSD (6 × 480GB SSD) • Most of the test containers have 8GB memory if not further noted host • For networking tests, two servers play roles as client and server Syscall Coverage & Memory Footprint • gVisor will call about 70 syscalls • With busybox (512MB memory) according to the code • We enabled template for Kata • For kata, we collect statistics of Memory Footprint (KB) the qemu processes in different 80000 70000 cases 60000 50000 • apt-get: 53 40000 • redis: 35 30000 20000 • nginx: 36 10000 0 kata containers gvisor single container average of 20 Boot time • Boot busybox image • Kata (different configurations), gvisor, runc Boot time (seconds) 2.5 2 cli_dis-factory 1.5 cli_en-factory shimv2_dis-factory 1 shimv2_en-factory gvisor 0.5 runc 0 1 CPU/Memory Performance • Sysbench CPU and memory test • Test 5 rounds with sysbench and get the average result • CPU overhead are very limited for both kata and gVisor • Random memory write: 2.5% vs 5.5% overhead comparing to host • Seq memory write: 8% vs 13% overhead comparing to host CPU Test (sec), Lower is better Memory Write 12 10.00048 10.0004 10.00066 6000 5246.84 4868.7 10 5000 4561.58 8 4000 6 3000 1917.63 1870.29 1815.2 4 2000 2 1000 0 0 host kata gvisor host kata gvisor avg seq write (MB/s) avg rand write (MB/s) IO performance (1) • IO tests with FIO, gVisor didn’t complete the test • Note: The “kata passthru fs” has multi-queue support (in the default kata guest kernel), which is not present in the host driver Buffer IO (MiB/s) Direct IO (MiB/s) 30000 1200 25000 1000 20000 800 15000 600 10000 400 5000 200 0 0 128k randread 128k randwrite 4k randread 4k randwrite 128k randread 128k randwrite 4k randread 4k randwrite host fs runC kata passthru(virtio-blk) kata passthru(virtio-scsi) kata 9p host fs runC kata passthru fs IO Performance (2) • IO test with dd command • dd if=/dev/zero of=/test/fio-rand-read bs=4k count=25000000 • dd if=/dev/zero of=/test/fio-rand-read bs=128k count=780000 • Note: pay attention to the cache of 9p • The DIO on a network FS like 9p only means no guest (client) page cache, but there may exist cache on the host (server) side dd test (MB/s) 800 700 600 500 400 300 200 100 0 128k 4k host fs runC kata passthru fs kata 9p gvisor 9p Networking Performance • The networking performance is tested on 10Gb Ethernet • In default configurations Network Throughput (Gbps) 10 9 8 7 6 5 4 3 2 1 0 1 host docker container kata gvisor Real-life case: Nginx • Apache Bench version 2.3 • ab -n 50000 -c 100 http://10.100.143.131:8080/ • General result runC Kata* gVisor Concurrency Level 100 100 100 Time taken for tests 3.455 3.439 161.338 seconDs Complete requests 50000 50000 50000 FaileD requests 0 0 0 Total transferred 42250000 42700000 42250000 bytes HTML transferreD 30600000 30600000 30600000 bytes Requests per seconD 14473.73 14541.18 309.91 [#/sec] (mean) Time per request 6.909 6.877 322.677 [ms] (mean) Time per request 0.069 0.069 3.227 [ms] (mean, across all concurrent requests) Transfer rate 11943.65 12127.12 255.73 [Kbytes/sec] receiveD * The www-root of the kata test case is not located on 9pfs Real-life case: Nginx (Cont) • Detailed Connection time(ms) and percentage Percentage of the requests served within a certain time runc min mean[+/-sd] median max (ms) Connect 0 0 0.3 0 7 Processing 1 7 0.3 7 14 900 836 Waiting 1 7 0.3 7 8 800 runC Total 3 7 0.4 7 15 kata 700 gvisor 674 kata min mean[+/-sd] median max 634 600 Connect 0 0 0.1 0 7 553 Processing 2 7 1 7 207 500 485 Waiting 2 7 1 7 207 415 400 390 Total 3 7 1 7 207 355 300 305 gvisor min mean[+/-sd] median max 207 200 Connect 0 0 0.1 0 2 Processing 44 322 119.8 305 836 100 7 7 7 7 7 8 8 8 Waiting 21 322 119.9 304 836 0 7 7 7 7 7 7 7 7 15 Total 45 322 119.8 305 836 50% 66% 75% 80% 90% 95% 98% 99% * The www-root of the kata test case is not located on 9pfs 100% Real-life case: Redis RedIs-benchmark (reQuests per second) 160000 140000 120000 100000 80000 60000 40000 20000 0 SET GET INCR LPOP RPOP SADD HSET SPOP LPUSH RPUSH PING_BULK PING_INLINE MSET (10 keys) LRANGE_100LRANGE_300 (fIrst 100 elements)LRANGE_500 (fIrst 300 elements)LRANGE_600 (fIrst 450 elements) (fIrst 600 elements) runC kata gvIsorLPUSH (needed to benchmark LRANGE) Real-life case: Tensorflow (CPU) Note: this is a CPU test (No GPU) TensorFlow 1.10 CNN benchmarks (images/sec) Model resnet50 1.6 Dataset imagenet (synthetic) 1.4 Mode training 1.2 SingleSess False Batch size 32 global 1 32 per device 0.8 Num batches 100 0.6 Num epochs 0 0.4 Devices ['/cpu:0'] 0.2 Data format NHWC Optimizer sgd 0 runc kata gvisor-kvm gvisor-ptrace Summary • Both kata and gVisor • Small attack interface • CPU performance identical to host and runC • On Kata Containers • Some suggestions: Passthru fs, Factory and/or shimv2, etc. • Good performance in most cases with proper configuration • On gVisor • Fast boot performance and low memory footprint • Still need improve the compatibility for general usage • Furthermore tests • Fine tuned benchmarks, vhost-user/dpdk etc.

Kata Containers and Gvisor a Quantitative Comparison

An Event-Driven Embedded Operating System for Miniature Robots

Benchmarking, Analysis, and Optimization of Serverless Function Snapshots

What Are the Problems with Embedded Linux?

Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform

Performance Study of Real-Time Operating Systems for Internet Of

Qualifikationsprofil #10309

Gvisor Is a Project to Restrict the Number of Syscalls That the Kernel and User Space Need to Communicate

Our Tas Top 11 Technologies of the Decade Tinyos and Nesc Hardware Evolu On

Surviving Software Dependencies

Firecracker: Lightweight Virtualization for Serverless Applications

Building for I.MX Mature Boards

Architectural Implications of Function-As-A-Service Computing