Kata Containers and Gvisor a Quantitative Comparison
Total Page:16
File Type:pdf, Size:1020Kb
& a Quantitative Comparison Xu Wang hyper.sh; Kata Containers Architecture Committee Fupan Li hyper.sh; Kata Containers Upstream Developer “All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections.” ----David Wheeler Agenda • Secure Containers, or Linux ABI Oriented Virtualization • Architecture Comparison • Benchmarks • The Future of the Secure Container Projects What’s Kata Containers • A container runtime, like runC • Built w/ virtualization tech, like VM • Initiated by hyper.sh and Intel® • Hosted by OpenStack Foundation • Contributed by Huawei, Google, MSFT, etc. Kata Containers is Virtualized Container What’s gVisor • A container runtime, like runC, too • A user-space kernel, written in Go • implements a substantial portion of the Linux system surface • Developed in Google and was open-sourced in mid-2018 gVisor implements Linux by way of Linux. The Similarity • OCI compatible container runtime • Linux ABI for container applications Container App • Support OCI runtime spec (docker, k8s, zun…) Linux ABI • Extra isolation layer Handling BLACK BOX OCIInterface • Guest Kernel, or say, Linux* in Linux Linux ABI • Do not expose system interface of the host to the container apps Host OS • Goals General Architecture of Kata Containers and gVisor • Less overhead, better isolation Architecture Comparison Run Kata Containers • Employ existing VMM (or improved) and standard Linux kernel • Per-container shims and per-sandbox proxy are stand-alone binaries, or built-in as library (such as in containerd shim v2 implementation) • An agent in guest OS based on libcontainer runC-like IO-stub process shim cmdline for container Apps Container App shim proxy Kata Agent Kata runtime Container App No proxy when virt-serial Guest Linux Kernel vscok enabled or vsock VMM (e.g. Qemu) Host Kernel Run gVisor • All in one binary (runsc + Sentry +Gofer) • Sentry as a user-space kernel, and per-container process stub • Gofer for 9p IO runC-like 9p cmdline Container Sentry Gofer ContainerApp Sentry Gofer runsc App Syscall redirect Host Kernel Detailed Architecture – Kata • Classic hardware-assisted virtualization App App agent • Provide virtual hardware proc proc interface to Guest • 2 layers of isolation devices Drivers • Qemu kvm and CPU ring protection vhost-user • Many IO optimization ways Guest Kernel • Pass through kvm vhost • Vhost and vhost-user Host Kernel passthru Detailed Architecture – gVisor (1) • 2 layers of isolations • Minimal kernel and written in Golang • Safer but the compatibility… • Ring protection and small set of syscalls • Avoid access more buggy syscalls App proc sentry Gofer • File operations are proceeded in separated gofer process • Syscall interception by ptrace or kvm Syscall seccomp/ • The performance impactions… interception namespace Host Kernel Detailed Architecture – gVisor (2) • In gVisor-kvm: • Sentry is both host process and guest kernel App proc Sentry in • Ring0 lower in guest host • Upper in VM Gofer • In Host Sentry as guest kernel • Kvm is for higher syscall interception performance rather than isolation kvm: Syscall seccomp/ • No insolation between different modes interception namespace of sentry itself Host Kernel Expectations based on the Arch • Isolation: • Both include 2 isolation layer – better isolation than runC • Compatibility: • Kata should have better compatibility over gVisor • Performance: • Both should have little overhead on CPU/Memory • gVisor should have smaller memory footprint over kata, and may boot faster • gVisor may have some pain on syscall heavy (include IO heavy) workloads • Kata may have some IO optimization ways Benchmarks and Analytics Benchmarks & Setup • Functional Tests • Syscall coverage • Standard Benchmarks • Boot time • Memory footprint • CPU/Memory Performance • IO Performance • Networking Performance • Real-life cases • nginx, redis, tensorflow • gVisor didn’t complete a mysql test, and a jenkins test (pull and build) failed due to the ‘git clone’ failure Testing Setup • Testing Setup • Most of the Cases: server on packet.net Testing CPU: 4 Physical Cores @ 2.0 GHz (1 × E3-1578L) client Memory: 32GB DDR4-2400 ECC RAM Storage: 240 GB of SSD (1 × 240 GB) Network: 10 Gbps Network (2 × INTEL X710 NIC'S IN TLB) • Disk IO cases: a more powerful server with additional disks CPU: 24 Physical Cores @ 2.2 GHz (2 × E5-2650 V4) Memory: 256 GB of DDR4 ECC RAM (16 × 16 GB) container disks Storage: 2.8 TB of SSD (6 × 480GB SSD) • Most of the test containers have 8GB memory if not further noted host • For networking tests, two servers play roles as client and server Syscall Coverage & Memory Footprint • gVisor will call about 70 syscalls • With busybox (512MB memory) according to the code • We enabled template for Kata • For kata, we collect statistics of Memory Footprint (KB) the qemu processes in different 80000 70000 cases 60000 50000 • apt-get: 53 40000 • redis: 35 30000 20000 • nginx: 36 10000 0 kata containers gvisor single container average of 20 Boot time • Boot busybox image • Kata (different configurations), gvisor, runc Boot time (seconds) 2.5 2 cli_dis-factory 1.5 cli_en-factory shimv2_dis-factory 1 shimv2_en-factory gvisor 0.5 runc 0 1 CPU/Memory Performance • Sysbench CPU and memory test • Test 5 rounds with sysbench and get the average result • CPU overhead are very limited for both kata and gVisor • Random memory write: 2.5% vs 5.5% overhead comparing to host • Seq memory write: 8% vs 13% overhead comparing to host CPU Test (sec), Lower is better Memory Write 12 10.00048 10.0004 10.00066 6000 5246.84 4868.7 10 5000 4561.58 8 4000 6 3000 1917.63 1870.29 1815.2 4 2000 2 1000 0 0 host kata gvisor host kata gvisor avg seq write (MB/s) avg rand write (MB/s) IO performance (1) • IO tests with FIO, gVisor didn’t complete the test • Note: The “kata passthru fs” has multi-queue support (in the default kata guest kernel), which is not present in the host driver Buffer IO (MiB/s) Direct IO (MiB/s) 30000 1200 25000 1000 20000 800 15000 600 10000 400 5000 200 0 0 128k randread 128k randwrite 4k randread 4k randwrite 128k randread 128k randwrite 4k randread 4k randwrite host fs runC kata passthru(virtio-blk) kata passthru(virtio-scsi) kata 9p host fs runC kata passthru fs IO Performance (2) • IO test with dd command • dd if=/dev/zero of=/test/fio-rand-read bs=4k count=25000000 • dd if=/dev/zero of=/test/fio-rand-read bs=128k count=780000 • Note: pay attention to the cache of 9p • The DIO on a network FS like 9p only means no guest (client) page cache, but there may exist cache on the host (server) side dd test (MB/s) 800 700 600 500 400 300 200 100 0 128k 4k host fs runC kata passthru fs kata 9p gvisor 9p Networking Performance • The networking performance is tested on 10Gb Ethernet • In default configurations Network Throughput (Gbps) 10 9 8 7 6 5 4 3 2 1 0 1 host docker container kata gvisor Real-life case: Nginx • Apache Bench version 2.3 • ab -n 50000 -c 100 http://10.100.143.131:8080/ • General result runC Kata* gVisor Concurrency Level 100 100 100 Time taken for tests 3.455 3.439 161.338 seconDs Complete requests 50000 50000 50000 FaileD requests 0 0 0 Total transferred 42250000 42700000 42250000 bytes HTML transferreD 30600000 30600000 30600000 bytes Requests per seconD 14473.73 14541.18 309.91 [#/sec] (mean) Time per request 6.909 6.877 322.677 [ms] (mean) Time per request 0.069 0.069 3.227 [ms] (mean, across all concurrent requests) Transfer rate 11943.65 12127.12 255.73 [Kbytes/sec] receiveD * The www-root of the kata test case is not located on 9pfs Real-life case: Nginx (Cont) • Detailed Connection time(ms) and percentage Percentage of the requests served within a certain time runc min mean[+/-sd] median max (ms) Connect 0 0 0.3 0 7 Processing 1 7 0.3 7 14 900 836 Waiting 1 7 0.3 7 8 800 runC Total 3 7 0.4 7 15 kata 700 gvisor 674 kata min mean[+/-sd] median max 634 600 Connect 0 0 0.1 0 7 553 Processing 2 7 1 7 207 500 485 Waiting 2 7 1 7 207 415 400 390 Total 3 7 1 7 207 355 300 305 gvisor min mean[+/-sd] median max 207 200 Connect 0 0 0.1 0 2 Processing 44 322 119.8 305 836 100 7 7 7 7 7 8 8 8 Waiting 21 322 119.9 304 836 0 7 7 7 7 7 7 7 7 15 Total 45 322 119.8 305 836 50% 66% 75% 80% 90% 95% 98% 99% * The www-root of the kata test case is not located on 9pfs 100% Real-life case: Redis RedIs-benchmark (reQuests per second) 160000 140000 120000 100000 80000 60000 40000 20000 0 SET GET INCR LPOP RPOP SADD HSET SPOP LPUSH RPUSH PING_BULK PING_INLINE MSET (10 keys) LRANGE_100LRANGE_300 (fIrst 100 elements)LRANGE_500 (fIrst 300 elements)LRANGE_600 (fIrst 450 elements) (fIrst 600 elements) runC kata gvIsorLPUSH (needed to benchmark LRANGE) Real-life case: Tensorflow (CPU) Note: this is a CPU test (No GPU) TensorFlow 1.10 CNN benchmarks (images/sec) Model resnet50 1.6 Dataset imagenet (synthetic) 1.4 Mode training 1.2 SingleSess False Batch size 32 global 1 32 per device 0.8 Num batches 100 0.6 Num epochs 0 0.4 Devices ['/cpu:0'] 0.2 Data format NHWC Optimizer sgd 0 runc kata gvisor-kvm gvisor-ptrace Summary • Both kata and gVisor • Small attack interface • CPU performance identical to host and runC • On Kata Containers • Some suggestions: Passthru fs, Factory and/or shimv2, etc. • Good performance in most cases with proper configuration • On gVisor • Fast boot performance and low memory footprint • Still need improve the compatibility for general usage • Furthermore tests • Fine tuned benchmarks, vhost-user/dpdk etc.