Container isolation with gVisor Abdellfetah SGHIOUAR
Strategic Cloud Engineer
Google Stockholm
@boredabdel Containers do not contain Dan Walsh, 2014 ● Year 2013: Docker Inc. released its container engine ○ Million downloads and about 8,000 docker images that year ○ Now the technology has really taken off. ○ Many advantages: Portability, visibility, repeatability...
● Google has been developing and using containers to manage our applications for more than a decade. ○ Launch over 4 billion containers per week on Borg. ● Security concerns remain ○ Security teams have to rework on processes and tooling to implement security for containers.....
● The last decade has seen a lot of work on isolation mechanisms ○ Namespaces ○ Cgroups ○ Users ○ Capabilities ○ Chroot ○ Seccomp ○ Linux Security Modules (LSM)... Onces upon a time: Run as root
● All devices accessible ● Host filesystem accessible ● All resources consumable ● Network reconfigurable root 234 /bin/sh ● Can perform any kernel call ● Can SIGKILL others.... Onces upon a time: Run as root
● What can go wrong ? (everything) ○ Fat-fingers errors ? rm -rf /* ○ Software bugs root 234 /bin/sh ○ Vulnerable software or libraries ○ .... Onces upon a time: Run as unprivileged user
● Limited devices access including network device. ● Limited filesystem access. User / Group ● Permissions of kernel calls are checked before execute. ● Limited ability to send signals one 234 /bin/sh
But if setuid? Or access to raw sockets ? Drop Capabilities
capabilities ● Container engines drop some capabilities. ○ SYS_MODULE,SYS_ADMIN User / Group ○ SYS_TIME, SYS_RESOURCE ○ NET_ADMIN, SYS_LOG, … one 234 /bin/sh
● Fewer capabilities == better isolation!
What about resource isolation? Enter Cgroups cgroup
● Cgroup limits, accounts for, and isolates the capabilities resource usage: ○ cpu - limits access to the CPU User / Group ○ cpuacct - accounts cpu usage by cgroup ○ cpuset - assign cores & memory nodes to cgroup one 234 /bin/sh ○ devices - control device access by cgroup ○ memory - limits & accounts memory usage and more
But still can see all processes, network interfaces, mount points on the system! Enter namespaces
namespace
● Provide isolation for each namespace cgroup type ● Currently support 7 different capabilities namespaces: ○ Network, PID, mount, user, IPC, User / Group UTS, cgroup ○ More to come one 234 /bin/sh Still containers doesn't contain
● Privilege escalation ○ CVE-2016-5195 DirtyCOW. ○ CVE-2017-5753/5715/5754 Spectre/Meltdown.
● A container has it's own network interface but shares the host TCP/IP stack ○ “Each container also gets its own network stack” ○ CVE-2013-4348 A single malformed packet from remote can crash your kernel Why ?
● Still sharing the same kernel. ● Share same device drivers. ● A container makes direct syscalls to the kernel ● Linux kernel represents a large attack surface... In the quest for better isolation, we need tokeep in mind.
● An image I pulled from a random corner of the world should not exploit my Linux box. ● Little work or no work required from users. ○ Not overly restricted ○ No modification to the application ● Feels like a container ○ Fast startup ○ Cheap to run: low memory consumption Isolation with VM's
● Each app could get it's own Kernel (guest kernel).
● Good isolation, compatibility
● High overhead, inflexible Rule Based Access Control
● Seccomp, SELinux, AppArmor...
● Good isolation, compatibility, Linux Native
● Low overhead.
● Someone has to define rules for every single app --> Hard
● Hard to define rules for unknown apps Rule Based Access Control
● Seccomp, SELinux, AppArmor...
● Good isolation, compatibility, Linux Native
● Low overhead.
● Someone has to define rules for every single app --> Hard
● Hard to define rules for unknown apps Best of both world
● Key Ingredients: ○ Independent Kernels • ○ Virtualization hardware is an important defensive layer ■ Clear privilege separation and state encapsulation
● Collaterals ○ Virtualized hardware interface ■ Inflexible ■ Obscure primitives (I/O ports, interrupts, exceptions) ○ The Linux kernel ■ One-size-fit-all --> More code == More bugs waiting to be exploited. ■ Monolithic (everything in the same address space) Enter gVisor
● User mode mini-kernel
● Intercepts syscalls in user space (fast). ○ 211 SysCalls implemeted so far
● One kernel/container, low overhead.
● Secure by default, no need for SELinux, AppArmor complexity. gVisor Architecture
● Two processes: ○ Sentry for Syscalls and Network ○ Gofer for filesystems.
● Communicate over IPC (9P)
● One kernel/container, low overhead.
● Secure by default, no need for SELinux, AppArmor complexity. gVisor use cases
● What it is good for ? ○ Small containers. ○ High density. ○ Start fast. ○ Untrusted containers.
● What it's NOT good for ? ○ Syscall heavy workloads ○ Direct access to specialized hardware (passthrough access) Learn more
● Join: https://groups.google.com/forum/#!forum/gvisor-users ● Get involved: ○ https://github.com/google/gvisor ○ Join sig-node for discussion on Slack Thank you
Questions ?
@boredabdel