Arxiv:2102.06972V1 [Cs.CR] 13 Feb 2021

BPFCONTAIN: Fixing the Soft Underbelly of Container Security William Findlay David Barrera Anil Somayaji Carleton University Carleton University Carleton University Abstract such isolation is rarely enough on its own to serve a security barrier. In networking, network address translation (NAT) Linux containers currently provide limited isolation guaran- virtualizes IP addresses so one IP address can be shared by an tees. While containers separate namespaces and partition entire network of systems. While this many-to-one mapping resources, the patchwork of mechanisms used to ensure sepa- provides a significant degree of isolation to systems behind ration cannot guarantee consistent security semantics. Even the NAT, it is no substitute for a network firewall, especially worse, attempts to ensure complete coverage results in a one that is configured, e.g., to block outbound connections. mishmash of policies that are difficult to understand or audit. Linux containers are virtualized using cgroups [5] and names- Here we present BPFCONTAIN, a new container confine- paces [18], while confinement is enforced through seccomp- ment mechanism designed to integrate with existing container bpf [29] and mandatory access control mechanisms such as management systems. BPFCONTAIN combines a simple yet SELinux [45] and AppArmor [10]. While these confinement flexible policy language with an eBPF-based implementation mechanisms are powerful and flexible, container isolation that allows for deployment on virtually any Linux system was not their primary design goal, and currently they can only running a recent kernel. In this paper, we present BPFCON- accomplish the task with complex policies that are difficult to TAIN’s policy language, describe its current implementation write and audit. as integrated into docker, and present benchmarks comparing To properly isolate containers, the kernel requires clear it with current container confinement technologies. rules about what interactions are permissible, whether between containers or with the host OS. Much like firewall 1 Introduction rules specify which packets may traverse it, OSes need un- ambiguous, auditable rules to determine what kernel-level Linux containers have become the preferred unit of appli- operations are and are not allowed based on container bound- cation management in the cloud, forming the foundation of aries. Following Unix tradition, the Linux kernel provides Docker [11], Kubernetes [20], Snap [46], Flatpak [14], and only security mechanisms and abstractions for defining secu- others. By including just the binaries, libraries, and configura- rity policies, but leaves policies themselves to be managed in tion files needed by an application, containers enable simpli- userspace. However, unlike processes, files, and network con- arXiv:2102.06972v1 [cs.CR] 13 Feb 2021 fied deployment of vendor-packaged applications, rapid hori- nections, the Linux kernel has no unified abstraction around zontal application scaling, and direct developer-to-production containers. DevOps workflows, all without the overhead of hypervisor- We assert that the lack of strong isolation guarantees for based virtual machines (HVMs). containers arises from a semantic gap between the security Many have recognized that containers currently offer weak mechanisms that currently exist in the Linux kernel and the isolation guarantees [25, 49, 50]. Weak container confine- policies that we wish to define to isolate containers. As long ment is less of a risk when all containers on a given host as the kernel does not implement container-level access con- are deployed by the same party. However, strong container trol mechanisms, the result will be complex, circumventable isolation mitigates privilege escalation attacks and is a critical container isolation. requirement in multi-tenancy environments where attacker- The Linux kernel has recently gained an alternative way controlled containers co-exist with benign containers. to implement security abstractions: eBPF. An extension of The key to improving container isolation lies in recognizing the Berkeley Packet Filter [33], Linux’s eBPF now allows for that isolation is not the same as virtualization. Virtualization complex monitoring and manipulation of kernel-level events. can exhibit isolation characteristics as a side effect; however, Thanks to implicit load-time verification of eBPF bytecode, 1 eBPF provides strong performance, portability, and safety over traditional approaches to container isolation and guarantees. More recent improvements to eBPF [9] have en- least-privilege. abled interfacing with Linux Security Module (LSM) hooks, allowing eBPF code to implement new kernel-level security The rest of this paper proceeds as follows. Section2 mechanisms. presents background and our motivation for implementing This paper proposes BPFCONTAIN, a novel approach to an eBPF-based container security mechanism. Section3 container security under the Linux kernel, rectifying over- presents an overview of BPFCONTAIN and discusses our privileged and insecure containers. Leveraging eBPF, BPF- threat model and design goals. Section4 describes BPFCON- CONTAIN uses runtime security instrumentation to implement TAIN’s policy language. Section5 discusses the design and container-aware policy enforcement and harden the host ker- implementation of its userspace components and enforcement nel against privilege escalation attacks mounted from within engine. Section6 presents an evaluation of BPFCONTAIN’s containers. Specifically, BPFCONTAIN attaches eBPF pro- security and performance. Section7 covers related work and grams to LSM hooks and critical functions within the kernel Section8 discusses limitations and opportunities for future to enforce per-container policy in kernelspace. work. Section9 concludes. Thanks to eBPF’s dynamic instrumentation capabilities, this integration occurs at runtime and requires no modifica- 2 Background and Motivation tion or patching of the kernel. BPFCONTAIN addresses the container userspace/kernelspace semantic gap by defining a This section reviews current techniques for achieving virtual- new YAML-based policy language for container confinement ization, isolation, and least privilege on Linux containers. It that allows for simple default deny and default allow policies, also provides background on the classic and extended Berke- high-level semantically meaningful permissions, and (where ley Packet Filters (BPF and eBPF), and discusses the moti- necessary) fine-grained control over the LSM API, all at the vation behind a new container-focused security enforcement level of containers. mechanism. While BPFCONTAIN can co-exist with seccomp-bpf, A container is a userspace representation of a set of pro- SELinux, and other existing Linux security mechanisms, it cesses that share the same virtualized view of files and system does not rely on them, and in fact in the context of container resources. Containers run directly on the host operating sys- isolation, BPFCONTAIN makes them redundant. It also has tem and share the underlying OS kernel, and thus do not modest userspace requirements, consisting only of a small require a full guest operating system or hypervisor to pro- daemon and a control program, both of which are written in vide the virtualized view of the system to applications (see Rust. In summary, we make the following contributions: Figure 2.1). To support security beyond basic process isola- • We present the design, implementation, and evaluation tion, modern container runtimes rely on a variety of low-level of BPFCONTAIN, a container-aware security enforce- Linux kernel facilities to enforce virtualization, isolation, and ment mechanism for Linux. BPFCONTAIN is available1 least-privilege. Note that by relying on a disparate set of under a GPLv2 license, and installation requires only a mechanisms and corresponding policies, a failure in the en- 5.10 or newer Linux kernel. While there is past work on forcement or policy in any single mechanism exposes the using eBPF for process sandboxing [13], BPFCONTAIN container to attacks. is both more general and more deployable, implementing mechanisms and a policy language specific to container 2.1 Container Security confinement and built using tools such as BPF CO-RE and Rust that have much lower space and runtime over- Namespaces and Control Groups (cgroups). These head. mechanisms allow for further confinement of processes by restricting the system resources that a process or group of pro- • We design a flexible policy language for confining con- cesses can access. Namespaces isolate access by providing tainers that offers optional layers of granularity to meet a a process group a private, virtualized naming of a class of wide range of real world container use cases. The policy resources, such as process IDs, filesystem mountpoints, and language is YAML-based, and expressive enough that user IDs. Cgroups (control groups) limit available quantities it can be used to confine individual system resources, of system resources, such as CPU, memory, and block device yet simple enough that it affords ad-hoc confinement use I/O. Namespaces and cgroups on their own cannot enforce cases. fine-grained access policies (e.g., to resources within a names- pace), and thus require additional mechanisms as described • We discuss integrating BPFCONTAIN with existing container management frameworks without modifying their below. source code, offering significant security advantages POSIX Capabilities. These provide a finer-grained alter- 1https://github.com/willfindlay/bpfcontain-rs

Load more