Kubecon China 2018
Nabla containers: a new approach to container isolation Brandon Lum, Ricardo Koller , Dan Williams, Sahil Suneja IBM Research https://nabla-containers.github.io Containers are not securely Isolated
2 Containers are not securely Isolated
- What does this exactly mean?
- Why are VMs considered secure but not containers?
- How do we improve container isolation?
3 Overview
• Threat Model: Isolation • Isolation through surface reduction • Our approach: Nabla • Measuring Isolation • Nabla vs VMs?
4 What does it mean to be isolated?
• Containers that are co-located containers should not be able to access data of another
Service • Scenarios: A • Horizontal attacks from vulnerable secret attacker services • Container-native multi-tenant cloud
Kernel Container Isolation Reality
• Containers == namespaced containers processes à Kernel exploits mostly work • Sep 2018: CVE-2018-14634 • DirtyCOW (CVE-2016-5195) Service Exploit via A • Many more (CVE database), syscalls 2018: Codexec (3), Mem. Corrupt (8) secret attacker
• Horizontal attack possible via shared privileged component attacker (kernel) Kernel DirtyCOW
• DityCow Exploit Sketch: // FROM: https://dirtycow.ninja/
• mmap a page map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE • Create a thread that invokes ,f,0); printf("mmap %zx\n\n",(uintptr_t) map); madvise /* You have to do it on two threads. */ • Create a thread that invokes pthread_create(&pth1,NULL,madviseThread,argv[1 ]); //madvise Read/Write procfs pthread_create(&pth2,NULL,procselfmemThread,ar gv[2]); // R/W procfs • Triggers race condition in Kernel /* You have to wait for the threads to finish. Mem. management code */ pthread_join(pth1,NULL); pthread_join(pth2,NULL); return 0;
7 Container Isolation Reality
containers
Service A
secret attacker
attacker
Kernel Kernel Footprint
Application • Exploits target vulnerable part of kernel via syscalls.
> 300 Syscalls • If we restrict the number of syscalls • à Less reachable kernel functions
FS • à Less potential vulnerabilities • à Less possible exploits disk
Kernel Docker Default Seccomp Policy
Application • Docker default seccomp policy • disables around 44 system calls out of 300+. seccomp (Whitelisting policy) ~ 280 Syscalls 44 Syscalls • Generic seccomp policies – hard to create s.t. it is secure • Syscall profiling is mostly heuristic based FS
disk Greyed – unreachable functions Kernel Nabla
Application
Original 300+ Syscall interface* LibOS seccomp 7 Syscalls • Deterministic and generic seccomp policy • Only 7 syscalls! FS • Uses LibOS techniques
disk
Kernel Nabla
• Taking unikernel ideas and putting it into containers
“Unikernels as Processes” • Using tools/technologies from the (ACM SoCC ’18) rumprun and solo5 community (https://dl.acm.org/citation.cfm?id=3267845)
• Modify unikernel to work as a process
12 Making and running a Nabla
Application • Build app. Application with custom LibOS build process* seccomp Build process* 7 Syscalls Nabla > 300 Syscalls Binary
Application Application
• Nabla runtime, Application Application LibOS LibOS seccomp seccomp runnc loads the 7 Syscalls 7 Syscalls nabla binaries runc runnc and sets up seccomp Container Runtime profiles
*current limitation of build process, we are investigating ways to consider removing a custom build process 13 Demo
14 strace/ftrace measurements (Low is good)
strace Application measures syscalls invoked. > 300 Syscalls
FS
disk Kernel
ftrace measures number of boxes touched. 15 ftrace measurements (lower is better)
Kata-containers (VMs) What does this say about Nabla our isolation vs VMs?
16 Have we surpassed VM isolation?
• We explored and contested this idea in our paper:
“Say Goodbye to Virtualization for a Safer Cloud” (USENIX HotCloud 2018) (https://www.usenix.org/conference/hotcloud18/presentation/williams)
• Maybe… But several questions: • Implementation specific comparisons? KVM vs other hypervisors • Hardware inclusive threat model (Spectre/Meltdown, etc.) • Other metrics
17 What’s Next?
• We want to engage the community:
• Development work for runnc/nabla-base-build/nabla-demo-apps • Remove need to rebuild nabla containers (Support for dynamic linking LibOS) • Create new images and more language support for applications
• Chime in on Improving Security Analysis/Metrics • https://github.com/nabla-containers/nabla-measurements
18 Thank You! https://nabla-containers.github.io
Brandon Lum (@lumjjb) – [email protected]
#NablaContainers
19 Backup
20 ftrace measurements (lower is better)
Application
> 300 Syscalls
FS
disk Kernel
Measuring number of boxes Touched. 21 Throughput (higher is better)
22 Demo
IMAGE REGISTRY Kubelet Image pull (OCI image spec)
CRI containerd runc Cri-containerd Other Config from podSpec CNI i.e. mounts, security, etc. Run Container runnc Container Runtime (OCI Runtime Spec)
CNI Plugin
23 Inside a Nabla container Application Original • Unmodified user code (e.g., Node.js, Container redis, nginx, etc.) Libc
• Rumprun library OS NetBSD • Unmodified NetBSD code + some glue TCP/IP • Runs on thin Solo5 unikernel interface FS … Rumprun glue • Nabla Tender • Setup of seccomp policy Solo5 • Translates Solo5 calls to system calls � Tender Backup: Containers vs VMs
25 Overview
• Threat Model: Isolation • What makes VMs isolated? • Nabla: How do we get those isolation properties without overhead?
Disclaimer: In this talk, we are doing a 1:1 comparison. Defense in depth is a valid discussion with a different set of trade-offs.
26 Containers VMs
Pro- Guest cess ☠ OS ☠
Host Kernel Hypervisor (+ Host Kernel (root))
High Level - Syscalls: Low Level – VT: Filesystem interface, Block Dev. Interface, socket interface, TAP interface, etc. etc.
27 Containers VMs
Guest Application Process Guest . FS OS .
Interface
Interface FS
disk Infra
A LOT more disk exploitable code in Infra the infrastructure!!! 28 Lower level interface
Less code
Fewer vulnerabilities
Stronger isolation 30 Kernel functions accessed by applications
1600 processContainer • Compared to standard 1400 ukvmVM containers 1200 nablanabla • 5-6x less kernel functions 1000 accessed • 8-14x fewer syscalls 800 600 • About half the number 400 Unique kernel of kernel functions accessed as VMs!
functions accessed 200 0 nginx nginx-largenode-expressredis-get redis-set Accessible kernel functions under Nabla policy
700 accept • Trinity kernel fuzz 600 nabla block tester to try to access 500 as much of kernel as 30 possible 400 300 0 • 200 Nabla policy reduces 0 10 amount of accessible 100 Unique kernel functions kernel functions by 0 98% 0 50 100 150 200 250 300 Unikernel isolation comes from the interface
Hypercall System Call Resource • Direct mapping between 10 walltime clock_gettime hypercalls and system puts write stdout call/resource pairs poll ppoll net_fd blkinfo • 6 for I/O blkwrite pwrite64 blk_fd • Network: packet level blkread pread64 blk_fd • Storage: block level netinfo netwrite write net_fd netread read net_fd • vs. >350 syscalls halt exit_group
33 SOCC
34 Implementation: nabla �
• Extended Solo5 unikernel ecosystem and ukvm • Prototype supports: • MirageOS • IncludeOS • Rumprun
• https://github.com/solo5/solo5
35 Measuring isolation: common applications
• Code reachable 1600 through interface is process 1400 ukvm a metric for attack 1200 nabla surface 1000 • Used kernel ftrace 800 600 400 Unique kernel
• Results: functions accessed 200 0 • Processes: 5-6x more nginx nginx-largenode-expressredis-get redis-set • VMs: 2-3x more
36 Measuring isolation: fuzz testing
• Used kernel ftrace 700 accept • Used trinity 600 nabla block system call fuzzer to 500 try to access more of 30 the kernel 400 300 0 200 • Results: 0 10 100 • Nabla policy reduces Unique kernel functions by 98% over a 0 0 50 100 150 200 250 300 “normal” process
37 Measuring performance: throughput
245 • Applications include: 200% ukvm 180% • Web servers nabla QEMU/KVM • Python benchmarks 160% • Redis 140% no I/O with I/O • etc. 120% 100% Normalized throughput
80% py_tornado py_chameleon node_fib mirage_HTTP py_2to3 node_express nginx_large redis_get redis_set includeos_TCP nginx includeos_UDP • Results: • 101%-245% higher throughput than ukvm
38 Measuring performance: CPU utilization
120 100 • vmexits have an effect on 80 60 instructions per cycle (a) 40 CPU % 20 100 0 80 60
• Experiment with MirageOS (b) 40 20 VMexits/ms web server 0 1.5 • Results: 1 (c) • 12% reduction in cpu 0.5 nabla
IPC (ins/cycle) ukvm utilization over ukvm 0 0 5000 10000 15000 20000 Requests/sec
39 Measuring performance: startup time
• Startup time is important QEMU/KVM ukvm nabla process for serverless, NFV 30 30 30 750 QEMU/KVM 20 20 20 ukvm 500 500 nabla 10 10 10 process Hello world • Results: 250 Hello world 0 • Ukvm has 30-370% higher 0 0 0 0 200 200 200 latency than nabla 1500 1500 150 150 150 1000 1000 100 100 100 500 500 50 50 50 HTTP POST • Mostly due avoiding KVM HTTP Post 0 0 0 0 0 overheads 246810121416 246810121416 246810121416 246810121416 02468101214 Number of cores
40