Container Isolation with Gvisor Abdellfetah SGHIOUAR

Container isolation with gVisor Abdellfetah SGHIOUAR

Strategic Cloud Engineer

Google Stockholm

@boredabdel Containers do not contain Dan Walsh, 2014 ● Year 2013: Docker Inc. released its container engine ○ Million downloads and about 8,000 docker images that year ○ Now the technology has really taken off. ○ Many advantages: Portability, visibility, repeatability...

● Google has been developing and using containers to manage our applications for more than a decade. ○ Launch over 4 billion containers per week on Borg. ● Security concerns remain ○ Security teams have to rework on processes and tooling to implement security for containers.....

● The last decade has seen a lot of work on isolation mechanisms ○ Namespaces ○ Cgroups ○ Users ○ Capabilities ○ Chroot ○ Seccomp ○ Linux Security Modules (LSM)... Onces upon a time: Run as root

● All devices accessible ● Host filesystem accessible ● All resources consumable ● Network reconfigurable root 234 /bin/sh ● Can perform any kernel call ● Can SIGKILL others.... Onces upon a time: Run as root

● What can go wrong ? (everything) ○ Fat-fingers errors ? rm -rf /* ○ Software bugs root 234 /bin/sh ○ Vulnerable software or libraries ○ .... Onces upon a time: Run as unprivileged user

● Limited devices access including network device. ● Limited filesystem access. User / Group ● Permissions of kernel calls are checked before execute. ● Limited ability to send signals one 234 /bin/sh

But if setuid? Or access to raw sockets ? Drop Capabilities

capabilities ● Container engines drop some capabilities. ○ SYS_MODULE,SYS_ADMIN User / Group ○ SYS_TIME, SYS_RESOURCE ○ NET_ADMIN, SYS_LOG, … one 234 /bin/sh

● Fewer capabilities == better isolation!

What about resource isolation? Enter Cgroups cgroup

● Cgroup limits, accounts for, and isolates the capabilities resource usage: ○ cpu - limits access to the CPU User / Group ○ cpuacct - accounts cpu usage by cgroup ○ cpuset - assign cores & memory nodes to cgroup one 234 /bin/sh ○ devices - control device access by cgroup ○ memory - limits & accounts memory usage and more

But still can see all processes, network interfaces, mount points on the system! Enter namespaces

namespace

● Provide isolation for each namespace cgroup type ● Currently support 7 different capabilities namespaces: ○ Network, PID, mount, user, IPC, User / Group UTS, cgroup ○ More to come one 234 /bin/sh Still containers doesn't contain

● Privilege escalation ○ CVE-2016-5195 DirtyCOW. ○ CVE-2017-5753/5715/5754 Spectre/Meltdown.

● A container has it's own network interface but shares the host TCP/IP stack ○ “Each container also gets its own network stack” ○ CVE-2013-4348 A single malformed packet from remote can crash your kernel Why ?

● Still sharing the same kernel. ● Share same device drivers. ● A container makes direct syscalls to the kernel ● Linux kernel represents a large attack surface... In the quest for better isolation, we need tokeep in mind.

● An image I pulled from a random corner of the world should not exploit my Linux box. ● Little work or no work required from users. ○ Not overly restricted ○ No modification to the application ● Feels like a container ○ Fast startup ○ Cheap to run: low memory consumption Isolation with VM's

● Each app could get it's own Kernel (guest kernel).

● Good isolation, compatibility

● High overhead, inflexible Rule Based Access Control

● Seccomp, SELinux, AppArmor...

● Good isolation, compatibility, Linux Native

● Low overhead.

● Someone has to define rules for every single app --> Hard

● Hard to define rules for unknown apps Rule Based Access Control

● Seccomp, SELinux, AppArmor...

● Good isolation, compatibility, Linux Native

● Low overhead.

● Someone has to define rules for every single app --> Hard

● Hard to define rules for unknown apps Best of both world

● Key Ingredients: ○ Independent Kernels • ○ Virtualization hardware is an important defensive layer ■ Clear privilege separation and state encapsulation

● Collaterals ○ Virtualized hardware interface ■ Inflexible ■ Obscure primitives (I/O ports, interrupts, exceptions) ○ The Linux kernel ■ One-size-fit-all --> More code == More bugs waiting to be exploited. ■ Monolithic (everything in the same address space) Enter gVisor

● User mode mini-kernel

● Intercepts syscalls in user space (fast). ○ 211 SysCalls implemeted so far

● One kernel/container, low overhead.

● Secure by default, no need for SELinux, AppArmor complexity. gVisor Architecture

● Two processes: ○ Sentry for Syscalls and Network ○ Gofer for filesystems.

● Communicate over IPC (9P)

● One kernel/container, low overhead.

● Secure by default, no need for SELinux, AppArmor complexity. gVisor use cases

● What it is good for ? ○ Small containers. ○ High density. ○ Start fast. ○ Untrusted containers.

● What it's NOT good for ? ○ Syscall heavy workloads ○ Direct access to specialized hardware (passthrough access) Learn more

● Join: https://groups.google.com/forum/#!forum/gvisor-users ● Get involved: ○ https://github.com/google/gvisor ○ Join sig-node for discussion on Slack Thank you

Questions ?

@boredabdel