Universidad de Los Andes

Tesis de Maestr´ıa

Hardening Processes Extending Grsecurity to Integrate System Call Filters and Namespaces

David Derby Cardona

Facultad de Ingenier´ıa Departamento de Ingenier´ıade Sistemas y Computaci´on June 2016 Universidad de Los Andes

Tesis de Maestr´ıa

Hardening Linux Processes Extending Grsecurity to Integrate System Call Filters and Namespaces

David Derby Cardona

Asesor: Sandra Rueda Rodr´ıguez Jurados: Rafael G´omezD´ıaz Fabian Molina Molina

Facultad de Ingenier´ıa Departamento de Ingenier´ıade Sistemas y Computaci´on June 2016 Abstract

The area of Linux sandboxing has seen various developments in recent years with the intro- duction of containers and the ever present need to harden the security of applications. Two of the more prominent technologies that have been used when creating sandboxes are namespaces and system call filters. Whilst these technologies have been ef- fective for creating sandboxes, they are limited in that they require a developer to integrate them into their software. This work proposes to use these two technologies to enforce the Principle of Least Privilege on every process on a system. The solution extends a grsecurity hardened and allows the user to define security policies for each process which permit them to behave as intended. The presented results demonstrate the effectiveness of the extended Linux kernel and its impact on performance. The results provide a basis that may be built upon to deliver a comprehensive solution that would be appealing for use in real world environments.

1 Contents

Abstract 1

Index of Figures 4

Index of Tables 5

1 Introduction 1

2 Context and Problem Description 3 2.1 Linux ...... 3 2.1.1 Processes ...... 3 2.1.2 Permissions ...... 4 2.1.3 Authorisation ...... 5 2.2 Least privilege ...... 5 2.3 Limitations of existing solutions ...... 6 2.3.1 Type enforcement solutions ...... 6 2.3.2 Other solutions ...... 6 2.3.2.1 Android ...... 7 2.4 The problem ...... 8 2.4.1 General objective ...... 8 2.4.2 Specific objectives ...... 8

3 Related Work 9 3.1 Integrated kernel tools ...... 9 3.2 tools ...... 9 3.3 Other isolation work ...... 10 3.4 Access control technologies ...... 10 3.5 Summary ...... 10

4 Proposal 16 4.1 Design ...... 16 4.1.1 Constrained access ...... 17 4.1.2 Enforcing constrained execution ...... 18 4.1.3 Default policies ...... 18 4.1.3.1 IPC namespaces ...... 18 4.1.3.2 Network namespaces ...... 19

2 CONTENTS 3

4.1.3.3 Mount namespaces ...... 19 4.1.3.4 PID namespaces ...... 19 4.1.3.5 User namespaces ...... 19 4.1.3.6 UTS namespaces ...... 20 4.1.3.7 Secure computing () ...... 20 4.1.4 Policy creation ...... 20 4.2 Limitations ...... 21

5 Implementation 22 5.1 Components ...... 22 5.1.1 Grsecurity ...... 22 5.1.1.1 Gradm ...... 23 5.1.1.2 Policy file structure ...... 23 5.1.1.3 Gradm special roles ...... 23 5.2 Extending grsecurity ...... 24 5.2.1 Extension to gradm ...... 24 5.2.1.1 Implementation of new policy types ...... 26 5.2.2 Extension to the grsecurity hardened Linux kernel ...... 27 5.2.2.1 Implementation of namespaces ...... 27 5.2.2.2 Implementation of system call filters ...... 28 5.3 Evaluation ...... 28 5.3.1 System call filtering ...... 28 5.3.2 Namespaces ...... 29 5.3.3 Performance ...... 29 5.4 Summary ...... 30

6 Conclusions and Future Work 38 6.1 Conclusions ...... 38 6.2 Future work ...... 38 6.2.1 Feature completeness ...... 39 6.2.2 Blacklists ...... 39 6.2.3 Namespace bind mounts ...... 39 6.2.4 Further namespace customisation ...... 39 6.2.5 Further seccomp customisation ...... 40 6.2.6 Policy file separation ...... 40 6.2.7 Learning mode ...... 41 6.2.8 Code refactoring ...... 41

A Gradm patch 42

B Kernel patch 53

Bibliography 59 List of Figures

2.1 Execution of a command in Linux ...... 4 2.2 Access control representation ...... 4 2.3 The Android execution environment ...... 7

4.1 Creation of constrained processes ...... 17

5.1 Structure of the grsecurity policy file ...... 24 5.2 Flex code to match namespace and system call filter rules ...... 25 5.3 Bison code to handle tokens ...... 25 5.4 Example of a policy in the policy file ...... 26 5.5 Sequence of interaction of grsecurity’s RBAC system ...... 31 5.6 The sequence of interaction between a process, execve() and stored policies . . 32 5.7 The C source code of the “rootshell” program ...... 32 5.8 Demonstration of “rootshell” malware succeeding ...... 33 5.9 Finding setuid programs with the find command ...... 33 5.10 Demonstration of “rootshell” malware failing ...... 33 5.11 Demonstration of a network namespace ...... 34 5.12 Demonstration of an IPC namespace ...... 35 5.13 Demonstration of a UTS namespace ...... 35 5.14 Tests run on a standard grsecurity enhanced Linux kernel ...... 36 5.15 Tests run on a grsecurity Linux kernel including the extension presented in this work...... 37

6.1 Policy rule for an IPC namespace with a bind mount ...... 39 6.2 Policy rule for a network namespace with a virtual network link ...... 40 6.3 Policy rule for a system call filter with a thread synchronisation flag ...... 40

4 List of Tables

3.1 Isolation through virtualisation of operating systems ...... 11 3.2 User space tools for sandboxing ...... 12 3.3 ...... 13 3.4 Other security facilities provided by mainline Linux ...... 14 3.5 Out-of-tree kernel patches ...... 15

5 Chapter 1

Introduction

The use of Linux based operating systems has grown in server, desktop/laptop and mobile deployments [1]. Linux usage of the Internet’s web servers is significant as it is the most popular operating system with hosting companies. Eight out of ten of the most reliable com- pany websites were hosted on Linux machines as reported by Netcraft in April 2016 [2]. The desktop and laptop market has seen growth with the Linux based ChromeOS, the operating system which runs on Chromebooks, recently outselling Apple’s range of Macs for the first time in the United States of America [3] [4]. In the mobile sector, Linux remains the dominant player with the Linux based Android operating system holding 80% market share [5]. The significant presence that Linux holds in the industry, highlights the need to research new ways to keep these systems secure. Malicious programs have frequently been developed for Windows systems, however the increased popularity of other operating systems including Linux, has led to an increase in those being targeted too. This rise shows that attackers are not only interested in targeting single clients but also the infrastructure that they use [6]. Although the volume of attacks targeting Linux is lower than those targeting Windows, the number of attacks against Linux has considerably increased, meaning that Linux system administrators have an increased responsibility to consider additional defence mechanisms to protect their systems. Unix based systems traditionally use discretionary access control (DAC) as a means of restricting access to objects based on the identity of subjects and/or groups to which they belong [7]. When a program is executed, it uses default permission inheritance which means that the created process will run with the same permissions as the user that started it. The process has the same visibility and access rights to the system as the user would have, thus each process exposes a surface that would be attractive to an attacker who may take advantage by exploiting a vulnerability in a program or by means of malware executed by that user. A user does not need to have the same system access rights as a privileged user such as root for an attacker to cause damage or to obtain confidential information. Existing solutions which improve upon the traditional DAC restriction on computer pro- grams include the use of: role based access control (RBAC), security modules or enhancements which provide (MAC), control of operating system privileges, op- erating system capabilities, memory page protections, integrity protection and a variety of methods to create a sandbox. These solutions often present their own set of challenges which impact their widespread adoption.

1 CHAPTER 1. INTRODUCTION 2

Recent additions to the Linux kernel provide new facilities that allow a software developer to isolate processes and restrict their behaviour. Given that these isolation and restriction techniques are relatively new, much of the large back catalogue of software already available for Linux does not yet take advantage of them. In order to do so, each software package would have to be analysed and then modified. A software developer may not necessarily take the time and effort to secure the software that they have already written or that they may write in future. This work explores the security features provided by the Linux kernel and proposes an original solution that complements other methods of locking down the runtime environment of all Linux processes. It aims to help Linux distribution maintainers and system administrators to implement policies that restrict the environment of existing unmodified software. The proposed solution not only helps to prevent attacks that originate from vulnerabilities in authorised software, including zero-day threats, but can also prevent attacks from purpose- built malware. The rest of the document is organised as follows: Chapter 2 presents the problem in depth; Chapter 3 reviews related work; Chapters 4 and 5 propose a solution that addresses the limitations of previous work; finally Chapter 6 analyses the proposed solution and its results. Chapter 2

Context and Problem Description

A typical computer program that runs on Linux: is granted visibility to the global file system (relying on DAC to restrict access), is able to obtain a list of every process running on the system, has access to network devices, has access to a variety of interprocess communication (IPC) systems, and is capable of making any system call. This defies the Principle of Least Privilege, also called the Principle of Least Authority, which implies that every program must only be able to access the resources that are necessary for its legitimate purpose [8].

2.1 Linux

2.1.1 Processes Like other Unix based operating systems, Linux assigns a unique user identifier (UID) to each user. It also handles groups of users, assigning a group identifier (GID) to each group. Each user belongs to a single primary group and may also be a member of secondary groups, thus each user has a unique UID, one primary GID, and zero or more secondary GIDs. Linux assigns a unique process identifier (PID) and a set of credentials to each running process. These credentials include a real UID and a real GID which determine who owns the process. Other credentials include an effective user identifier (EUID) and an effective group identifier (EGID). The EUID and EGID are used by the kernel to determine the permissions that the process will have when accessing shared resources. Unlike other operating systems based on Unix, Linux also has credentials for file system user and group identifiers (FSUID and FSGID) which determine permissions for accessing files [9]. If a user runs a program with the setuid file mode bit set, the EUID of the process will change to be the UID of the owner of that file. Linux uses the setuid bit to handle temporal domain transitions; a process temporarily runs within a protection domain different from the owner’s domain, with a different set of permissions. Files, processes and other resources are assigned the UID and GID of the process that created them [10]. Figure 2.1 illustrates the sequence of events that takes place when a user runs a command in Linux. The user first runs a command (cmd) from the shell and a fork() system call is made to create a duplicate of the calling process, in this case the user’s shell. This duplicate process has the same UID and EUID of the initial shell but it has a new PID. The new process then makes an execve() system call to replace the program code with the code associated with the command. It is assumed that the program does not have the setuid file mode bit set meaning

3 CHAPTER 2. CONTEXT AND PROBLEM DESCRIPTION 4 that the EUID is that of the user which started the command. Linux may also use the clone() system call to create a new process in a similar manner to fork(). Unlike fork(), it allows the child process to share parts of its execution context with the calling process.

Figure 2.1: Execution of a command in standard Linux. The user shell forks a new process, with the same UID but a new PID, then the new process replaces the old code for the code corresponding to the invoked command.

2.1.2 Permissions Linux implements a simplified version of an access control list (ACL) to represent permissions assigned to resources. In this model, every resource has an associated list of the subjects that have access to it and the permissions of those subjects. Figure 2.2 illustrates this model.

Figure 2.2: Access control representation. The system keeps a list of subjects for every object (resource) with permissions to operate on that object.

The classic model of permissions in Unix and Linux does not allow for an ACL of individual users to be applied to a resource. Permissions are limited to three categories:

• The resource owner, represented by a UID.

• The resource group, represented by a GID. These permissions apply to all members of this group.

• The other category. These permission apply to everyone else. That is to say everyone who is not the resource owner or a member of the resource group. CHAPTER 2. CONTEXT AND PROBLEM DESCRIPTION 5

Possible permissions that can be applied to each category are , write and execute, represented as r, w and x respectively. For example, a file with the permissions r-xr-xr--, authorises the owner to read and execute the file (the first group of three: r-x), authorises members of the file’s group to read and execute the file (the second group of three: r-x) and authorises the rest of users in the system to only read the file (the last group of three: r--). When a process creates a new file it defines the file’s permissions, thus establishing who can access that file and how [10].

2.1.3 Authorisation Algorithm 1 shows pseudocode to evaluate whether a given user is authorised to act on a file.

Algorithm 1 Permission Evaluation 1: if UIDuser == UIDfile then 2: permissions ← permissionsP erCategory(owner) 3: else if anyGIDuser == GIDfile then 4: permissions ← permissionsP erCategory(group) 5: else 6: permissions ← permissionsP erCategory(others) 7: end if 8: 9: if requestedAction in permissions then 10: authorise 11: end if

Linux makes authorisation decisions based on the UIDs and GIDs of the entities requesting access (users or processes) and the targets of those requests (resources). This means that a Linux process runs within the same protection domain (i.e. with the same permissions) associated with the user that owns the process or with the user that owns the executable file, thus a new process will often run with more permissions than those that are actually needed.

2.2 Least privilege

The Principle of Least Privilege, a design principle for information protection mechanisms [11], states that every program and every privileged user of the system should operate using the least set of privileges necessary to complete the job [8]. Many operating systems, including Linux, do not adhere to this principle. Processes on Unix and Linux inherit the UID from the user that starts a process or from the file owner if the setuid file mode bit is set. All processes have the same visibility and set of access rights to the entire system as its owner or the EUID in effect. In addition to this, traditionally there is no restriction on the system calls that a process is allowed to execute. Whilst there have been attempts to restrict which system calls a process may execute, none of these solutions enforce system call restrictions on the entire system. These defects create an attack surface that is bigger than what is actually necessary. This escalates the number of attack opportunities. An increase of exposure of a system’s surface will present a higher number of attack opportunities, increasing the likelihood that it will be a target of an attack. System security may be improved by reducing its attack surface [12]. CHAPTER 2. CONTEXT AND PROBLEM DESCRIPTION 6

A system’s attack surface may be described by three abstract dimensions: targets and enablers, channels and protocols, and access rights. Targets are the resources that an attacker aims to take control of and enablers are other resources that an attacker may use to exe- cute the attack. Channels and protocols are the means of communication that an attacker uses to control resources. Access rights are what constrains the attacker’s ability to control resources [12]. The principle of least privilege addresses the access rights abstract dimension of a system’s attack surface. By imposing access rights on resources that do not already implement them, a system’s attack surface be can be reduced.

2.3 Limitations of existing solutions

There are a variety of solutions available for Linux which aim to apply the principle of least privilege by the use of type enforcement. Type enforcement gives priority to mandatory access control over discretionary access control [13]. Existing solutions however, are limited in certain aspects.

2.3.1 Type enforcement solutions Existing type enforcement solutions have no notion of namespaces. Namespaces are a rel- atively new concept to Linux which have been recently added to the mainline kernel. A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global re- source [14]. Changes to the global resource are visible to other processes that are members of the namespace but are invisible to all other processes [14]. Another recent addition to the mainline Linux kernel that is not used by existing type enforcement solutions is SECure COMPuting with filters, commonly known as seccomp filters or seccomp-bpf. Seccomp filtering provides a way of constricting a process to a subset of system calls that it may use [15]. There is no type enforcement solution at present which takes advantage of the benefits provided by seccomp filters.

2.3.2 Other solutions Existing software solutions that do utilise namespaces or seccomp filters tend to be either focused on providing isolation for operating system level (OS level) virtualisation technolo- gies [16] or on providing a sandbox for individual programs and applications [17] [18] [19] [20]. Process sandboxing is often implemented independently for each program [17] [18]. The pro- cedure for using namespaces or seccomp filters to sandbox a process entails that a software developer implements these mechanisms into each of the programs that they write. As this practice is completely optional, there is no guarantee that a software developer will imple- ment these security features, thus leaving many programs with broad access rights to system resources. Solutions such as Firejail [19] and Mbox [20] allow a user to run any program they desire inside a sandbox by prefixing the command to run the program with another command that calls the sandbox program. Although these solutions are effective at ensuring the correct behaviour of any given program, they are user space programs which must be manually invoked CHAPTER 2. CONTEXT AND PROBLEM DESCRIPTION 7 by the user, therefore they do not enforce the principle of least privilege. The implications of this are that these solutions do not function automatically on a system-wide level and they will not mitigate attacks that originate from malware.

2.3.2.1 Android Modern mobile devices frequently use an operating system capable of pre-emptive multitasking meaning that they can handle multiple processes simultaneously. Android is one of these operating systems. It uses a Linux kernel [21] and is the most widely used operating system for mobile devices [5]. Unlike traditional Unix or Linux systems, mobile devices are typically used by only one user, the device’s owner. This would mean that each of the processes run by the device owner would all share the same UID. Android however, runs each application with a different UID meaning that each process has different access rights. Figure 2.3 illustrates the Android environment, where each application process runs with a different UID.

Figure 2.3: The Android execution environment. Each Android application process runs with a unique UID as if the processes were executed by different users.

Android also has other mechanisms to control access to resources but this particular feature is of interest because it enables the system to assign different access rights to different applications. This is important because although all applications are run on behalf of the same user, each application may have been developed by different software developers, has a different job, and should have access to different resources. Although Android’s mechanism of running each application with a different UID is a novel approach to solve the problem, implementing this in traditional Linux systems may not be so straightforward. The software development model for Android is quite different from that of traditional Linux. Android has its own model where developers create applications which are submitted to an application store. This allows Google, the company that develops Android, to enforce certain requirements that software must meet before it can be released. This differs from traditional Linux software development which has a much more open model. In this model, a developer may choose to develop software which could be a program or an application, there are no constraints set by a third party of what the software must or must not do, and the developer may choose to distribute it themselves. Whilst it is also possible to do this with Android, the user must enable an option to allow installation of applications from unknown sources and accept a disclaimer which states that the device will be more vulnerable to attacks [22] [23]. CHAPTER 2. CONTEXT AND PROBLEM DESCRIPTION 8

2.4 The problem

Many operating systems run many applications which may have been developed by different software developers or companies. These applications should run with only the access rights required for their legitimate purpose, however this is not always the case. Although traditional Linux systems will not automatically launch applications with a unique UID as Android does, some applications will be configured with their own UID when they are installed. Furthermore, some system administrators will manually configure other applications to run with their own unique user identifiers. This behaviour however, is not enforced and may leave a system vulnerable. Recently developed technologies enable software developers to create sandboxes to enhance the security of their programs and applications. These technologies allow a software developer to control namespaces and system call filters. Tools that use these technologies, aimed at users, have also been created. These tools allow users to run any given program inside a sandbox, however they run in user space meaning that they do not automatically enforce the principle of least privilege system-wide. Automatic enforcement of the principle of least privilege can only occur when implemented in the kernel. Existing security enhancements which have been implemented in the kernel, do not make use of recently developed sandboxing technologies.

2.4.1 General objective This work proposes the design and implementation of a security enhancement for Linux to coordinate several technologies and automatically start processes in policy defined constrained environments.

2.4.2 Specific objectives • Evaluate the scope and limitations of existing security technologies for Linux and develop a solution that further enhances system security by creating a constrained environment for process execution.

• Extend the Linux kernel to enforce a policy that will start processes in constrained environments using namespaces and system call filters.

• Provide a solution that helps Linux system administrators to secure the systems that they maintain by enabling them to define policies for the kernel extension. Chapter 3

Related Work

This chapter discusses the technologies considered to be the most pertinent to this proposal. An analysis of other related work is catalogued at the end of the chapter.

3.1 Integrated kernel tools

The Linux kernel provides a variety of security tools via Linux Security Module (LSM) hooks [24]. SELinux [25] is a commonly used module which provides a mechanism for sup- porting MAC security policies. SELinux allows fine-grained type enforcement policies which give the user a high level of control, however these policies are difficult to maintain which often leads to users granting broad rights [26]. Grsecurity [27] is a large patch for the Linux kernel which provides security enhancements that defend against a wide range of security threats through intelligent access control, memory corruption based exploit prevention and a host of other system hardening techniques that generally require no configuration. It is notable in that it is not implemented as an LSM [28]. Capsicum [29] is a cross platform operating system capabilities and sandbox framework that extends the standard POSIX API to provide sandboxed capability mode and capabilities kernel primitives. It also provides a user space sandbox API [26]. The Capsicum framework takes an approach that focuses on helping software developers to harden their programs and applications.

3.2 User space tools

Linux tools such as Firejail [19] and Mbox [20] also utilise kernel technologies to isolate or restrict applications inside a sandbox according to a given policy. These tools are good solutions to help users to provide a sandbox for a given application, however they require the user to prefix the command they wish to run inside the sandbox with a command to invoke the sandbox technology. This places the onus on the user to take the initiative to sandbox each and every application they wish to run and does not provide any operating system integration leaving it vulnerable to malware attacks. Systrace is a cross platform solution which limits an application’s access to the system by enforcing access policies for system calls [30]. Its implementation consists of a kernel patch and a user space tool. The Linux implementation of systrace predates the introduction of

9 CHAPTER 3. RELATED WORK 10 seccomp to the Linux kernel as a way of enabling system call policies. The user space tool is required to execute programs inside a sandbox. Systrace also provides a graphical user interface for generating policies interactively. As with other user space tools for sandboxing applications, it is limited in that it cannot automatically enforce system-wide MAC policies for all processes. Oz is a sandboxing system for Linux that isolates applications [31]. It achieves isolation through the use of namespaces, seccomp filters, capabilities and Xpra. Xpra is a persistent remote display server and client for forwarding applications and desktop screens and its use in Oz is to provide isolation for X11 applications. Oz is designed to sandbox applications auto- matically. It is limited in that a user must first declare which applications will be sandboxed automatically. In other words, it only supports blacklisting and cannot be used to enforce the principle of least privilege.

3.3 Other isolation work

Qubes OS [32] is a complete operating system that uses the to run individual or groups of applications in separate virtual machines. This approach provides strong isolation, however it is heavy on system resource usage due to its employment of full virtual machines per application or application group. OS level virtualisation solutions such as LXC [33] provide a type of virtual machine called a container, which shares its kernel with that of the host operating system. The LXC project makes use of the Linux kernel’s namespace isolation features to ensure that its containers can function as isolated, unprivileged operating system instances [16]. OpenBSD offers a new mitigation mechanism called Pledge. Pledge allows software de- velopers to filter system calls and create security policies for their programs with just a few lines of additional code. To facilitate the job of creating system call policies, Pledge has cat- egorised all systems calls into several groups which may be declared by a software developer for filtering [34] [35].

3.4 Access control technologies

This section presents a tabular summary of the advantages and disadvantages of the tech- nologies mentioned in the previous section and also other related technologies.

3.5 Summary

The range of security enhancing technologies available for Linux is both broad and diverse. There is no universal solution that has been established and alternatives are constantly being researched and developed. The following chapter proposes to build upon some of these tech- nologies to offer a solution which adds value to the range of technologies already available. CHAPTER 3. RELATED WORK 11

Technology Description Advantages Disadvantages Qubes OS [32] A complete operating system · Full virtual machines managed by Xen · Heavy on system resource usage due to that uses the Xen type 1 hy- provide strong isolation [36]. its use of full virtual machines. pervisor to run individual or · Its implementation as a complete, inde- groups of applications in sep- pendent operating system makes it diffi- arate virtual machines. cult to adapt for integration into existing operating systems. LXC [33] OS level virtualisation technol- · Relatively lightweight as it shares the ·Application isolation is possible, how- ogy for Linux. Achieves isola- kernel with the host OS. ever LXC is a solution for isolating com- tion by utilising namespaces. · Does not require a custom kernel, re- plete operating systems. cent mainline kernels provide full sup- · Poor documentation for configuring port [37]. unprivileged containers. · Containers can be run as an unprivi- leged user [16]. Linux- OS level virtualisation technol- · Offers an experimental kernel that in- · Not provided by mainline Linux [38]. VServer [38] ogy for Linux. cludes grsecurity patches [38]. · Root privilege isolation is not guaran- teed [39]. OpenVZ [40] OS level virtualisation technol- · Offers checkpointing and live migration · Not provided by mainline Linux [42]. ogy for Linux. [41]. Virtuozzo [43] OS level and full virtualisation · Offers enterprise features like inte- · Not provided by mainline Linux [42]. technology for Linux based grated backup and P2V migration [44]. · Commercial/proprietary software li- on OpenVZ. Full virtualisation cence [45]. provided via a hypervisor. - OS level virtualisation technol- · Integrated OS level virtualisation solu- · Requires systemd which may not be nspawn [46] ogy for systemd. Uses systemd tion for Linux installations that use sys- installed or available for a given Linux to spawn namespace contain- temd. instance [47]. ers for debugging, testing and building.

Table 3.1: Isolation through virtualisation of operating systems CHAPTER 3. RELATED WORK 12

Technology Description Advantages Disadvantages Docker [48] Provides application contain- · Integration with various infrastructure · Portability means that containers are ers to facilitate the deployment tools [49] [50] [51]. large and must include all of the appli- of applications. · Container images can be run without cation’s dependencies [53] which may re- modification on multiple Linux distribu- sult in a large memory footprint. tions [52]. · Applications must be repackaged as containers images [53]. · Difficult to build containers images for large applications [54]. Firejail [19] An SUID security sandbox pro- · A general purpose sandbox that can · User space tool which must be man- gram that reduces the risk of easily be applied to any program. ually invoked for each command in or- security breaches by restrict- ·Can replace a user’s default command der to provide the sandbox. This means ing the running environment interpreter. that there is no automatic enforcement of untrusted applications using · It does not have the overhead of in- of least privilege by the operating sys- , seccomp- stalling or executing an application in- tem. bpf and POSIX capabilities. side an OS container. Mbox [20] Prevents programs from modi- · A general purpose sandbox that can · Requires user interaction at the end of fying the host file system while easily be applied to any program. program execution to examine changes giving them the impression · It does not have the overhead of in- in the sandbox file system and then se- that they are in fact making stalling or executing an application in- lectively commit them back to the host those modifications by provid- side an OS container. file system. ing a layered sandbox file sys- · Allows an unprivileged user to install a · User space tool which must be man- tem and by interposing on sys- software package [20]. ually invoked for each command in or- tem calls with ptrace() and der to provide the sandbox. This means seccomp-bpf. that there is no automatic enforcement of least privilege by the operating sys- tem. Oz [31] A sandboxing system targeting · A general purpose sandbox that can · User space tool which only supports everyday workstation applica- easily be applied to any program. blacklisting. This means that there is no tions. It uses namespaces, sec- · Provides automatic sandboxing for ap- automatic enforcement of least privilege comp filters, capabilities and plications that the user has identified to by the operating system. Xpra to achieve isolation. be untrusted. PIP [55] [56] An information flow system · Fine-grained policies [56]. · User space tool which must be man- focused on usability to pro- · Decomposed sandbox architecture [55]. ually invoked for each command in or- vide integrity protection. The · Techniques for inferring policies from der to provide the sandbox. This means system combines userland ap- runtime and profile data for untrusted that there is no automatic enforcement proach with DAC to enforce se- processes [55]. of least privilege by the operating sys- curity policies. It does not rely · Spares users from making security- tem. on the kernel. critical policy decisions [55].

Table 3.2: User space tools that provide sandboxes for programs and applications. CHAPTER 3. RELATED WORK 13

Technology Description Advantages Disadvantages SELinux [57] LSM that provides a mecha- · Allows fine-grained Type Enforcement · Broad rights are often granted as fine- nism for supporting MAC poli- policies [24] [26]. grained Type Enforcement policies are cies. · Supports RBAC [24]. difficult to write and maintain [26]. · Optionally supports multi-level secu- · Requires that the file system supports rity [24]. “security labels” [59]. · Supports the concept of a “remote pol- · LSMs are not stackable [28]. icy server” [58]. AppArmor [60] LSM that supplements the tra- · Easier to learn and use compared to · File system objects are identified by ditional Unix DAC model by SELinux [61]. path name instead of inode meaning providing MAC. · File system agnostic [62]. that policies can easily be bypassed by creating a new hard link [63]. · Limited set of operations on files [64]. · LSMs are not stackable [28]. Smack [65] LSM that protects data and · Simplicity is its main design goal [66]. · LSMs are not stackable [28]. process interaction from mali- cious manipulation using a set of custom MAC rules. TOMOYO LSM that focuses on the be- · Can be used as a system analysis tool · Version 2.x has not yet been completely Linux [67] haviour of a system. Every [67]. mainlined [68]. process is created to achieve a · Easy to use [67]. · LSMs are not stackable [28]. purpose and is allowed to de- clare behaviours and resources needed to achieve their pur- pose. AKARI [69] LSM forked from TOMOYO · Fewer features than · A few advanced networking operations Linux 1.x. [68] which may be desirable for certain are limited or unavailable for restriction use cases. [68]. · Restricting use of capabilities is not possible [68]. · LSMs are not stackable [28]. Yama [70] LSM that collects a number of · Aims to prevent an attacker from at- · Only ptrace() is restricted for now [70]. system-wide DAC security pro- taching other running processes to a · LSMs are not stackable [28]. tections that are not handled compromised application [70]. by the core kernel itself. Linux Intrusion LSM that implements MAC. · Offers a port scan detector [72]. · Not provided by mainline Linux. Detection Sys- · Offers file protection, even from root · Appears to be discontinued. tem [71] [72]. · LSMs are not stackable [28]. · Offers process protection [72].

Table 3.3: Linux Security Modules CHAPTER 3. RELATED WORK 14

Technology Description Advantages Disadvantages Linux names- A global system resource ab- · Strong process isolation alternative. · A Linux-only kernel feature. paces [14] straction that makes it ap- · Linux provides namespaces for Con- pear to the processes within trol Groups (), IPC, Network, the namespace that they have Mount, PID, User and UTS [14]. their own isolated instance of the global resource. Changes to the global resource are vis- ible to other processes that are members of the namespace, but are invisible to other pro- cesses. seccomp [15] A Linux kernel feature that · Strong process confinement. · It may be difficult to establish which restricts which system calls a · The filtering extension allows a devel- system calls are required by a process. thread may make. oper to filter system calls using a config- · The programming interface may only urable policy implemented using Berke- be used from user space. ley Packet Filter (BPF) rules [73]. · Time Of Check to Time Of Use (TOC- TOU) attacks are impossible [15]. POSIX capa- An abstraction of a set of op- · A set of privileges was defined to be · The POSIX.1e draft standard was bilities [74] erating system operations as portable across POSIX compliant oper- withdrawn [74] [75]. process privileges that can be ating systems. · Extended privileges are not portable granted or denied by a software · An extended set of OS specific priv- [74]. developer. ileges is available for Linux and other · No default enforcement policy is avail- operating systems [74]. able. · Must be implemented by the software developer of each program.

Table 3.4: Other security facilities provided by mainline Linux CHAPTER 3. RELATED WORK 15

Technology Description Advantages Disadvantages RSBAC [76] Access control framework that · Several well-known and new security · Not provided by mainline Linux. gives detailed access control models, like MAC, ACL and RC [77]. information and can imple- · Detailed control over individual user ment almost any access control and program network accesses [77]. model, e.g. as a runtime regis- · Virtual User Management, in kernel tered kernel module. and fully access controlled [77]. · On-access virus scanning with the Dazuko interface [77]. · Any combination of security models is possible [77]. · Easily extensible: write your own model for runtime registration [77]. · Powerful logging system which makes intrusion attempts easily detectable [77]. PaX [78] A patch for the Linux kernel · Offers executable space protection [79]. · Not provided by mainline Linux. that implements least privilege · Offers ASLR [80]. · Cannot block fundamental design flaws protections for memory pages. · Blocks attacks which introduce and ex- in either executable programs or in the ecute arbitrary code [81]. kernel that allow an exploit to abuse · Blocks attacks which attempt to ex- supplied services. ecute existing program code in the in- tended order with arbitrary data [81]. grsecurity [27] An extensive security enhance- · Provides protection against zero-day · Not provided by mainline Linux. ment to the Linux kernel that and other advanced threats [27]. · Stable patches are restricted to paying defends against a wide range of · Mitigates shared-host/controller weak- customers only [83]. security threats through - nesses [27]. ligent access control, memory · Not using LSM may provide various corruption based exploit pre- advantages [28] [82]. vention, and a host of other · Includes PaX patches [27] [81]. system hardening techniques that generally require no con- figuration. Openwall Small security-enhanced Linux · Proactive source code review for several · Not provided by mainline Linux. GNU/*/Linux distribution for servers, appli- classes of software vulnerabilities [84]. [84] ances, and virtual appliances. · Compatible with other Linux distribu- tions as a patchset. Systrace [85] A utility which limits an ap- · Confines untrusted binary applications · Not provided by mainline Linux. plication’s access to the system [30]. · User space tool which must be man- by enforcing access policies for · Interactive policy generation with ually invoked for each command in or- system calls. graphical user interface [30]. der to provide the sandbox. This means · Non-interactive policy enforcement that there is no automatic enforcement [30]. of least privilege by the operating sys- · Remote monitoring and intrusion de- tem. tection [30]. · Privilege elevation mode makes it pos- sible to eliminate setuid binaries [30]. Capsicum [86] OS capabilities and sandbox · Fine-grained sandboxing available to · Must be implemented by the software framework that extends the the software developer provides strong developer of each program [29]. POSIX API to provide new process isolation [29]. · When adapting existing software to run kernel primitives and a user in capability mode, identifying capabil- space sandbox API. ity requirements can be tricky [29]. · Only partially provided by mainline Linux. NUSE [87] Network stack in USerspacE. A · Similar to micro-kernel design, the net- · Not a complete solution for isolating new project to move the Linux work stack is isolated from the Linux processes or applications. network stack into user space. kernel. · Not a very mature technology.

Table 3.5: Out-of-tree kernel patches Chapter 4

Proposal

This chapter examines how namespace and system call filtering technologies may be used to improve the overall security of a Linux system. The proposed solution overcomes the limitations of user space tools by initiating these technologies from kernel space in order to enforce MAC policies. This is achieved by extending the Linux kernel so that it enforces all processes to auto- matically spawn inside a sandbox that uses namespaces and system call filters. This observes the principle of least privilege in ways that have not been explored by other security enhance- ments. This sandbox uses namespaces to isolate the process which constrains the global view of the system that it sees and uses system call filters to restrict what it is able do. Rather than offering a replacement for existing kernel security enhancements, the proposed solution is intended to complement them by adding extra layers of security. The end-user of this solution is intended to be a Linux distribution maintainer or a system administrator. The level of access required for a process to be fully functional may be granted by applying policies that have been previously defined by the end-user. This proposal provides a way of limiting the scope of an attack from vulnerabilities that may exist in authorised software, it helps to prevent the use of unauthorised software that a user may have installed and it can also mitigate attacks that originate from malware.

4.1 Design

An operating system kernel provides system calls to allow user space programs to make kernel service requests. When a system call is executed, the kernel takes over execution to run code in kernel space which has a higher privilege level than user space. In Linux operating systems, a new process is created when either of the execve() or execveat() system calls are initiated from user space. The execve() system call is used for executing new programs. The execveat() system call is similar to execve() but it allows a program to be executed relative to a directory file descriptor. Both system calls execute common code. The proposed kernel enhancement is achieved by modifying the common code that called by execve() and execveat() system calls. The kernel plays a key role in running an operating system; it is therefore critically impor- tant that the design of this kernel extension does not introduce any vulnerabilities which may compromise the security of the system, thus negating the whole objective of providing a secu-

16 CHAPTER 4. PROPOSAL 17 rity enhancement. New code introduced should be lean and kept to a minimum. Figure 4.1 shows a high level view of constrained process creation.

Figure 4.1: Creation of constrained processes. (1) A process makes a request to the kernel to create a new process. (2) The modified kernel function uses a previously defined policy to create a process that will run within a constrained environment.

4.1.1 Constrained access To build constrained processes, two resources are controlled: namespaces and system calls.

• The Linux kernel provides namespaces which wrap global system resources in an ab- straction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource [14]. Changes to the global resource are visible to other processes that are members of the namespace but are invisible to other processes that have been defined in their own namespace [14]. The Linux kernel currently provides seven different types of namespaces: Cgroups, IPC, Network, Mount, PID, User and UTS. Linux 4.6 introduced cgroup namespaces which was not available at the time of development of this solution, therefore they are not covered by this thesis.

• A process must execute a system call to request particular services from the kernel. The kernel receives the call, changes execution mode, from user mode to kernel mode and runs the routine that handles the request. When the routine finishes the task, the kernel changes execution mode back to user mode and returns control to the user program. Filtering system calls can significantly impede a process by limiting the actions that it is capable of performing, thus minimising the exposed kernel surface [15]. A process may filter incoming system calls using the seccomp with filters kernel facility.

Namespace and seccomp technologies have typically been used in containers and user space driven sandbox programs but their potential to be incorporated into MAC policy has yet to be explored. By tackling the problem inside the kernel, the principle of least privilege is extended to use these relatively new mechanisms to confine a process automatically inside a sandbox. CHAPTER 4. PROPOSAL 18

4.1.2 Enforcing constrained execution This thesis proposes to modify the execve() system call to automatically execute new processes inside a sandbox with isolated namespaces and a limited set of system calls that may be used. By modifying execve(), the principle of least privilege may be enforced at the earliest opportunity, providing an effective way of hardening a Linux system that is not currently used by existing MAC security enhancements in the kernel. Processes will need to be granted the access that they require in order to be fully functional. To achieve this, a policy file may be defined by the end-user. This policy file defines namespace and system call policies for each authorised program. The file will be parsed by a user space utility which implements an interface in the kernel that can then be used by the modified execve() system call to apply MAC policy prior to invoking a program’s main function. The policy for a given program will be identified by the filename argument of execve() and matched against an entry in the policy file. If a policy for the program pointed to by the filename is not found or if the policy file is not accessible, then the modified execve() will enforce a default restricted policy. This will thwart the capabilities of any unauthorised software and mitigate a substantial amount of attacks that originate from malware. This solution should not be the only means that a user has to protect their system. There are already many well established methods for securing a Linux system which provide strong protection. It is not the aim of this solution to replace these methods, but to supplement them. For greatest flexibility, the proposed kernel modification should not be implemented as an LSM as this kernel feature does not support stacking of multiple modules.

4.1.3 Default policies In order to maximise the protection of a system and enforce the principle of least privilege, it is desired that every process will spawn inside a sandbox which will severely inhibit its capabilities unless an overriding policy is explicitly defined by the policy file. It is expected that enforcing such a policy will result in the breakage of many processes and will prevent their normal operation. This is a frequent challenge when enforcing mandatory access control. A user that encounters a program that has broken due to MAC policy may become frustrated and decide to completely disable MAC in order to make their program work. This is an undesired effect when enforcing the principle of least privilege. Whilst it is possible to set a default policy that completely neuters a process, such a strict policy may be undesirable as it may serve to frustrate the user. There is no perfect solution. What follows is a brief description of each of the six types of namespaces offered by Linux at the time of development, and an evaluation of their usage as part of a default policy for isolating processes. Likewise, the suitability of seccomp for use in default policy is also discussed.

4.1.3.1 IPC namespaces IPC namespaces provide isolation for System V IPC objects and POSIX message queues [14]. System V IPC provides message queue, semaphore, and shared memory facilities. POSIX message queues provide similar functionality to System V message queues but have a distinct API. These are all mechanisms that allow processes to share data with one another. They operate on a system-wide level using object ownership, octal permissions and an identifier to CHAPTER 4. PROPOSAL 19 control access. Any user may obtain information on all IPC objects on a system regardless of whether they have permission to access the object. This information may be obtained using the ipcs command. A default policy that isolates processes inside separate IPC namespaces would eliminate this flaw and provide an extra layer of security.

4.1.3.2 Network namespaces Network namespaces provide isolation of the system resources associated with networking. These resources include network devices, IPv4 and IPv6 protocol stacks, IP routing tables, firewalls, the /proc/net and /sys/class/net directories and port numbers associated with sockets [14]. Physical network devices can be allocated to exactly one network namespace or can be shared across network namespaces through the configuration of virtual network devices. Spawning a process automatically inside an unconfigured namespace will prevent it from having any network access. This is a sensible default policy as it means that network access must be declared for any program that requires it.

4.1.3.3 Mount namespaces The mount namespace is relatively old compared to the other namespaces. It was first imple- mented in Linux 2.4.19 in 2002 and no other namespaces were planned at the time. Mount namespaces isolate the set of file system mount points meaning that processes in different mount namespaces can have different views of the file system hierarchy [14]. A process run- ning in its own mount namespace inherits the mount points of its parent process but may add or remove mount points that are private to the namespace. The privacy that this pro- vides makes mount namespaces a reasonable choice to include in a default policy that aims to enforce the principle of least privilege.

4.1.3.4 PID namespaces PID namespaces isolate the process ID number space, meaning that processes in different PID namespaces can have the same PID [88]. A typical process does not need to know about other processes running on the system yet they are able to read the /proc file system and obtain information on all running processes. This information may be very useful to an attacker that wants to explore a system. This can be mitigated by using a default policy that enforces every process to start in a new PID namespace.

4.1.3.5 User namespaces User namespaces isolate security related identifiers and attributes, in particular, UIDs and GIDs, the root directory, keys and capabilities [89]. This allows a process inside a user namespace to run as UID 0 whilst having a different UID outside of its namespace. This gives it super user privileges inside the namespace yet it will be isolated from performing privileged operations outside the namespace. Some exploits have been known to break out of user namespaces. It is hoped that this will improve as the technology matures and bugs are ironed out [90]. Nevertheless, with an up to date kernel, user namespaces still provide an extra layer of security that must be broken. If the process is started by an unprivileged user, the default policy will run the process inside a new user namespace as the nobody user with CHAPTER 4. PROPOSAL 20 a primary group of nogroup. If the process is started by the root user, the process will be started inside a new namespace with a UID of 0.

4.1.3.6 UTS namespaces UNIX Time-Sharing System namespaces, commonly referred to as UTS namespaces, provide isolation for the hostname and NIS domainname system identifiers [14]. These are typically used in containers to allow each container to have unique system identifiers. A default policy that automatically starts a process inside a new UTS namespace would prevent an attacker from being able to change these identifiers in the global namespace.

4.1.3.7 Secure computing (seccomp) The original implementation of seccomp did not support custom system call policies but allowed a process to use a fixed set of only four system calls. This operational mode, now known as strict mode, is still available in the seccomp kernel API. The only system calls that a calling thread in strict mode is permitted to make are read(), write(), exit() and sigreturn(). Other system calls result in the delivery of a SIGKILL signal which will immediately kill the process. The current implementation of seccomp includes an alternate mode, known as filter mode, which supports configurable system call policies using BPF rules. BPF originated as a means to provide a raw interface to the data link layer of the OSI model of computer networking. Whenever a packet is received by an interface, all file descriptors listening on that interface apply a user defined filter. In Linux, BPF has been extended and now allows a user to define filters to control system calls using seccomp. The proposed extension to grsecurity uses the filter operational mode to allow the end-user to use granular system call policies by defining a whitelist containing each of the system calls that a process is permitted to use. As with the strict mode, any attempt to use a system call that has not been defined in the whitelist will use a SIGKILL signal to immediately kill the process. It is possible to specify alternate behaviour for filter mode which drops the system call but allows the process to continue instead of sending a SIGKILL signal. This may be specified in the definition of the socket filter of the BPF program, however this behaviour falls outside of the scope of the presented solution. The structure of a seccomp BPF program and its filter is described in the seccomp documentation [15] [91]. Setting a default policy for filtering system calls is not an easy task due to the sheer volume of available system calls and the unpredictable nature of which ones a process will use. For this reason, system call filter policies will be left to be established by the end-user.

4.1.4 Policy creation An end-user who has chosen to use this solution will want to define their own policies. This process should be relatively straightforward for namespaces as their behaviour is clearly de- fined. It should be fairly easy figure out whether a process requires access to the network, requires access to the list of processes running on a system, requires access to IPC facilities and so forth. On the other hand, identifying which system calls a process may need to use can be quite unpredictable. Fortunately seccomp provides a feature to catch a failed system CHAPTER 4. PROPOSAL 21 call and report it. This significantly simplifies the task of creating new system call policies and will therefore be incorporated into the solution.

4.2 Limitations

The solution can only enforce policy rules at the moment that a process is spawned. Although this solution provides better protection than traditional process execution, a process may not need the complete access to a particular resource that has been granted by a policy throughout its lifetime. It is therefore not recommended as a replacement for making sure that software has been properly sandboxed through the modification of source code. The life cycle of a process may be generalised into two phases: the initialisation phase followed by the main loop phase. The resources that a process requires for each of these phases normally differ substantially and will often have further changes in requirements during the main loop phase. A software developer will be able to identify exactly when their software will or will not require a particular resource and will be able to code their software to grant or deny access to that resource at runtime in order to provide better security. This solution is purely intended to be a facility that assists the end-user to harden their systems. Chapter 5

Implementation

To demonstrate this proposal, the solution will be built on top of an existing, well established security enhancement that permits users to define new policies. This minimises the amount of new code introduced and allows efforts to be focused on the core part of the proposed solution. To meet this requirement, grsecurity was chosen as the kernel security enhancement that this solution would be built on top of. It is one of the leading security enhancements for Linux that is not implemented as an LSM and provides security features that are not available in mainline kernel LSMs. Although grsecurity is not the most popular security enhancement and is not available in mainline Linux, it is installed by default in security orientated distributions of Linux based operating systems such as Alpine Linux and Subgraph OS [92] [93], whilst distributions like Debian and Gentoo provide facilities to install it [94] [95]. It provides protections for several classes of attacks that are not protected in mainline Linux. This means that grsecurity hardened kernels have not been affected by some known kernel vulnerabilities [96]. The solution takes into account the design aspects described the previous chapter. The result is an extension to grsecurity which implements policies to control system call filters and namespaces. At present it implements system call filters and three types of namespaces: network, IPC and UTS. The initial implementation supports the x86-64 architecture only.

5.1 Components

This section describes the components used and how they have been modified to achieve the desired solution.

5.1.1 Grsecurity Grsecurity is a comprehensive security enhancement for Linux. It adds several enhancements that are not included in the mainline Linux kernel and are missing from security modules such as SELinux and AppArmor [97]. It consists of memory page protections including PaX, an RBAC system, file system protections, kernel auditing, executable protections, network protections and physical protections for USB devices. The Kernel Self Protection Project aims to bring some of grsecurity’s protections to mainline Linux [98].

22 CHAPTER 5. IMPLEMENTATION 23

Grsecurity is available for download in several parts but this solution is only concerned with two of them. The main part is the kernel patch which may be applied to a range of kernels; the compatible range is defined in the patch’s filename. Kernel patches come in two series, stable and test. Since September 2015, the stable series is only made available to commercial customers. The proposed solution is built using the publicly available testing series. The other part of grsecurity is a user space program called gradm which serves as an RBAC administration utility. This solution has been developed on top of Linux version 4.4.6 with grsecurity patch named grsecurity-3.1-4.4.6-201604100830.patch and gradm version 3.1- 201603152148.

5.1.1.1 Gradm Gradm is an administration program for the grsecurity RBAC system. It is used in conjunction with a policy file which allows the end-user to define system security policies. The main tasks of gradm include authentication, parsing of the policy file and validation of policies, enabling and disabling of the RBAC system and enabling and disabling of learning mode. Gradm reads policies from the policy file, validates them and then passes them to the kernel. The implementation of gradm uses the programs Flex and Bison to define the policy file structure. Flex is used to generate lexical analysers that create tokenised input data to define the lexicon of the policy file. Bison is a parser generator which uses the tokenised data generated from Flex to define the language of the policy file. Both of these tools generate C code which is integrated into the gradm program.

5.1.1.2 Policy file structure Grsecurity’s RBAC system uses a policy file with rules that consist of role, subject and object abstractions:

• A role describes a traditional Unix user or group or a special role that is specific to grsecurity.

• A subject describes a directory, a program or a script.

• Objects define the resources that are to be applied to subjects. They may represent files, capabilities, resource restrictions, PaX flags or IP ACLs.

Roles and subjects may specify additional attributes. Roles, subjects and objects may also use modes to provide information pertinent to the rule. These rules establish the behaviour of a resource and who it applies to. A generalisation of the policy file structure is illustrated in Figure 5.1. Further information describing the policy file structure may be found in the official grsecurity documentation [99].

5.1.1.3 Gradm special roles One example of a gradm special role is the admin role. If an end-user wishes to perform an administrative task which would be inhibited by the active policies, they may authenticate to the admin special role which requires a password. This will allow them to perform the CHAPTER 5. IMPLEMENTATION 24 role subject / / [+-] [+-] subject / ... role ...

Figure 5.1: Structure of the grsecurity policy file. administrative task without any restrictions whilst keeping the rest of the system protected by the active policies. An end user may use gradm to remove themselves from the authenticated special role once the administrative task is complete.

5.2 Extending grsecurity

This section describes the extension to gradm implemented in this solution to recognise new rules that allow the end-user to define policies for system call filters and namespaces.

5.2.1 Extension to gradm This extension to grsecurity augments the object abstraction of grsecurity policies by intro- ducing two new object types for system call filters and namespaces. This enables the definition of new types of policy rules. Specific namespace and system call policies are defined as objects in a similar fashion to how capabilities and PaX flag policies are defined for a subject. A line is initiated with a plus (+) or a minus (-) character to indicate that the rule will be either granted or denied respectively. In the case of namespaces, this character is followed by the “CLONE ” string to in- dicate that a namespace policy is being defined. Specific namespace policies are defined by the same namespace constants as used by the Linux kernel for clone(), setns() and un- share() system calls. These are CLONE NEWIPC, CLONE NEWNET, CLONE NEWNS, CLONE NEWPID, CLONE NEWUSER and CLONE NEWUTS [14]. Which constant is used for which namespace is fairly self explanatory with the exception of CLONE NEWNS, which is the constant that is used for the mount namespace. CHAPTER 5. IMPLEMENTATION 25

To define a system call policy, the string consists of the system call name, followed by parenthesis “()”. System calls also use the plus or minus prefix character at the start of the string to indicate whether the rule will be either granted or denied respectively. To allow gradm to handle new types of policies for system call filters and namespaces, new Flex and Bison rules were added. To implement the recognition of these new rules, the lexicon must first be expanded by introducing new rules to gradm’s Flex code. The Flex code that matches the strings that represent namespace and system call filter rules is shown in Figure 5.2. The namespace and system call filter Flex rules return the tokens NS NAME and SC NAME respectively.

[+-]CLONE_"[A-Z]"+ { gradmlval.string = gr_strdup(yytext); return NS_NAME; }

[+-][a-z][_a-z0-9]+\(\) { gradmlval.string = gr_strdup(yytext); return SC_NAME; }

Figure 5.2: Flex code to match namespace and system call filter rules.

The current language definition is quite simple as the current implementation does not support an object mode or any additional options that would drastically affect the language of namespace and system call filter policy definitions. When the parser generated by Bison matches a line with the NS NAME or SC NAME tokens, it calls the relevant function to add the relevant ACL object. The Bison code is shown in Figure 5.3. An example of a policy defined in the policy file is illustrated in Figure 5.4. object_ns_label: NS_NAME { add_namespace_acl(current_subject, $1); free($1); } ; object_sc_label: SC_NAME { add_systemcall_acl(current_subject, $1); free($1); } ;

Figure 5.3: Bison code to handle tokens. CHAPTER 5. IMPLEMENTATION 26 role default G subject /bin/pwd +CLONE_NEWNS +CLONE_NEWUSER +CLONE_NEWPID +CLONE_NEWNET +CLONE_NEWIPC +CLONE_NEWUTS +brk() +access() +mmap() +open() +fstat() +close() +read() +mprotect() +arch_prctl() +munmap() +getcwd() +write() +exit_group()

Figure 5.4: An example of a policy defined in the policy file for the pwd command. This policy explicitly defines new namespaces of all six types and permits all of the system calls necessary for the correct functionality of pwd.

5.2.1.1 Implementation of new policy types Two new files of source code have been added to gradm to handle the new types of policies, gradm ns.c and gradm sc.c. These files contain the add namespace acl() and add systemcall acl() functions respectively and are responsible for further parsing and val- idation of the policies. gradm ns.c contains an array of structs (data structures) which rep- resents the six namespace constants available in Linux 4.4 and the values of each. Likewise, gradm sc.c contains an array of structs which represents all of the system call names and system call numbers for the x86-64 architecture that are implemented in Linux 4.4. These arrays are used to compare the policy object names with the set of valid names. This finalises the validation of namespace and system call policies. A policy subject and all of the objects associated with it, may be thought of as an ACL. Grsecurity stores these ACLs in memory as structs, one for each subject. Gradm defines this struct as proc acl in the header file gradm defs.h. The namespace and system call filter rules of a given subject are aggregated and stored in bit fields which form part of the proc acl struct. The namespace bit field is stored as an unsigned 32-bit integer which matches the representation of the bit mask used for the CLONE flags within the kernel. There are 326 system calls for the x86-64 architecture in Linux 4.4 and the number of system calls may grow with each release. To store this quantity of system calls in a bit field, an array of twelve CHAPTER 5. IMPLEMENTATION 27 unsigned 32-bit integers is used for a total of 384 bits which should accommodate for an increase in the number of system calls in future Linux versions for several years before the size of this array must be increased. Gradm’s proc acl struct has an analogous struct inside the kernel named acl subject label which is used to store ACLs for subjects in kernel space. This struct is defined in the in- clude/linux/gracl.h header file of the grsecurity patched Linux source code. Gradm transfers the policy structs to the kernel via the device node /dev/grsec. Gradm’s sequence of interac- tion is illustrated in Figure 5.5. The source code of the extension to gradm is listed as a patch in Appendix A.

5.2.2 Extension to the grsecurity hardened Linux kernel The Linux kernel maintains a task struct for every running process on the system. This struct, defined as task struct in include/linux/sched.h, holds all of the information that the kernel needs to maintain for a running process. In a grsecurity hardened kernel, this includes a pointer to the acl subject label struct which holds all of the ACL policies for a process. When a program is executed, it makes a call to the kernel using either execve() or ex- ecveat(). It is critical that the new code which enforces namespaces and system call filters upon processes is invoked after a program issues either of these system calls but before any further user space code is processed. The reason for this is to guarantee that the program will not be vulnerable to time of check to time of use attacks. Figure 5.6 illustrates the sequence of interaction between a process, the Linux kernel and the policy file. Both execve() and execveat() system calls make a call to the common function do execveat common(). This function has already been extended in a grsecurity enhanced kernel to invoke grsecurity specific code. One part of this code, inside the function gr set proc label(), sets up grsecurity process resource restrictions according to policies that can be defined in the grsecurity policy file. It is at this point, right after the process resource restrictions function has returned, that this grsecurity extension is modified to make calls to two new functions which set up namespaces and system call filters for the process. These two new functions, named gr set proc ns() and gr set proc sc() are detailed in the following sections.

5.2.2.1 Implementation of namespaces The gr set proc label() function was modified to make a call to the new function gr set proc ns() which receives one argument which is a pointer to the task struct for the running process. The sys unshare() kernel function is the function that is called when the unshare() system call is made from user space. This function is called from gr set proc ns(), passing the namespace policy which is retrieved from the ACL in the task struct. The use of CLONE NEWNET, CLONE NEWIPC and CLONE NEWUTS re- quire the CAP SYS ADMIN capability. This would normally be granted before the call to sys unshare() and then it would be removed when the function returns. This is absent from the current implementation, however this capability may be granted by defining the appro- priate object in the grsecurity policy file. This will be corrected in a future version. This is all that is needed for the implementation of network, IPC and UTS namespaces. CHAPTER 5. IMPLEMENTATION 28

5.2.2.2 Implementation of system call filters When the namespaces have been set up and the gr set proc ns() function returns, the mod- ified gr set proc label() function makes a call to the new function gr set proc sc() which also receives one argument which is the pointer to the task struct for the running process. The gr set proc sc() function calls sys prctl() to set the no new privs bit which is needed in order to use the SECCOMP SET MODE FILTER operation without the CAP SYS ADMIN capa- bility. Once this is set, a new function install sc whitelist filter is called and the system call filter policy is passed to it. This function sets up a whitelist system call filter for the process. This function sets up an array of socket filters which is a set of BPF instructions to be passed to the BPF program as explained in section 4.1.3.7. This array is constructed as follows:

• The first four socket filters of this array are used to validate the architecture which in this case is x86-64.

• The next socket filter examines the system call.

• Following this are two socket filters for each system call that is allowed to be run according to the policy.

• Finally the last socket filter instructs the BPF program to kill the process upon encoun- tering a bad system call.

The BPF program is created using this array of socket filters and the length of the array. The function sys seccomp() is called in order to operate on the seccomp state of the process. The sys seccomp() function is the kernel function that is called when the seccomp() system call is made from user space. This function had to be modified in order to be usable from kernel space. In normal seccomp usage, the BPF program is created in user space. In order to get the BPF program from user space into kernel space, the kernel must call the copy from user() in-line function. The purpose of this function is solely to copy data from a user space address to kernel space. As one might expect, copy from user() will not work if it is called from kernel space. This solution includes a modification which bypasses this function call. This modification is all that is needed to make seccomp work when initiated from kernel space. The source code of the extension to the grsecurity enhanced kernel is listed as a patch in Appendix B.

5.3 Evaluation

The implementation at present provides support for controlling system call filters and network, IPC and UTS namespaces on a per program basis.

5.3.1 System call filtering The current implementation blocks all system calls by default thus complying with the princi- ple of least privilege. The end-user may use the policy file to specify a whitelist of permitted system calls per subject to allow each one to function correctly. The process of determining which system calls may be required by a subject is not straight- forward. Whilst it is possible to determine which system calls a process uses by using a system CHAPTER 5. IMPLEMENTATION 29 call tracing program such as strace or by using seccomp’s reporting facility, both of these methods report the system calls when they are executed and a running program may not necessarily traverse through all possible permutations of its code, meaning some system calls that may be critical to the program’s correct behaviour may be omitted from this analysis. There are often cases where correct functionality of a program is preferred to enforcement of the principle of least privilege, meaning that the option to allow blacklist filtering would be desirable. It is sometimes easier to identify which system calls may be dangerous or unneces- sary for the correct behaviour of a program. For example, one may know if a program should not need to run with the privileges of a different user or group to those that the program was started with. In this case, the system calls setuid(), setgid(), seteuid(), setegid(), setresuid(), setresgid(), setfsuid() and setfsgid() may be deemed dangerous and would be specified in the blacklist. It is possible to mimic the functionality of a blacklist with the current implementation however that would mean that the end user would have to specify every system call except for those that they want to omit for each subject. Given that there are over 300 system calls for Linux, this approach is not very practical. A simple program has been written in C as an example of code that may be found in malware that may find its way onto a system. The program attempts to start a new shell as the super user. It relies on root user file ownership and the setuid file mode bit to be set. The source code is listed in Figure 5.7 and an example of how it is used is shown in Figure 5.8. A reasonable default policy may consider the setuid() system call to be dangerous, thus block it by default for all processes. Every program that legitimately necessitates the use of setuid() would need to be specified inside the policy file with a rule to allow this system call. Identifying which files may require the use of setuid() is not such a difficult task. The find command may be used to find all occurrences of files on a system which have the setuid file mode bit set. This is demonstrated in Figure 5.9. Once the end-user has verified that each of these programs legitimately require the use of setuid(), rules may be added to the policy file to grant the use of this system call for each of them. Gradm is then reloaded with the new policy file. Any program that has not been granted the use of setuid() will be killed as soon as it tries to execute this system call. Those processes which have been granted the use of setuid() continue to work as normal. This is illustrated in Figure 5.10.

5.3.2 Namespaces The current implementation provides basic support for network, IPC and UTS namespaces. Newly executed programs spawn processes inside new namespaces. Forked or cloned processes are spawned inside the same namespaces as their parents. Figure 5.11 demonstrates the effect of network namespaces, Figure 5.12 demonstrates the effect of IPC namespaces and Figure 5.13 demonstrates the effect of UTS namespaces.

5.3.3 Performance Due to time constraints, it was not possible to carry out extensive performance testing. The relevance of performance testing is questionable at this stage of development given that there CHAPTER 5. IMPLEMENTATION 30 is still room for further code optimisation. Nevertheless, basic performance testing was carried out on the current implementation. To see the impact that this kernel modification has on performance, measurements of the duration of a running process were taken from a system running a grsecurity hardened Linux kernel with the RBAC system enabled, with and without the modification provided by this solution. All tests were run on an identical system with an identical operating system instance, with the only difference being that each set of tests were run with a different Linux kernel. Each kernel was built from an identical kernel configuration file with the only difference being the kernel extension as described by the patch listed in Appendix B. The time shell keyword utility of Bash was used to obtain the time that the process spends in kernel space. The test was run on the decompression program xz which was executed 10 times against each kernel. The results are presented in Figures 5.14 and 5.15. The statistic of interest from the results is the sys value which represents the system time, in other words, the time spent in kernel space. The average time that the xz program spent in kernel space across the 10 tests without the kernel extension was 0.0260 seconds. Across the 10 tests with the extension that isolates processes using namespaces and system call filters, the average time was 0.0307 seconds. The strace utility was used to obtain further information that shows where this time was spent. The xz program makes a total of 89956 system calls when decompressing the Linux kernel source tree. 79165 of these were write() system calls, 10664 were read() system calls and the remaining 127 system calls were spread across 17 different system call types. When adding an extra layer of security, there is an expectation of a decrease in perfor- mance. When running a simple program like xz, the performance difference is negligible. This may not be true for more complex programs, however further code optimisation would reduce the impact on performance. The most obvious way to improve the performance of this kernel extension would be to construct the seccomp BPF program in user space after the policies have been validated by the gradm utility. The BPF program would then be copied from user space to kernel space. The current implementation constructs the BPF program every time execve() is called.

5.4 Summary

This chapter has described the main aspects of grsecurity’s RBAC system, including the gradm administration program and the format of its policy file. The implementation of the extension to gradm to handle new types of policies for namespaces and system call filters has been described, as well as that of the extension to the grsecurity enhanced kernel which processes these policies. The usage and effectiveness of the new policy types have been evaluated, and also the impact that they have on performance. CHAPTER 5. IMPLEMENTATION 31

Figure 5.5: The sequence of interaction that happens when a user enables the grsecurity RBAC system. (1) The user enables the RBAC system by running the command gradm -E. (2) gradm opens and reads the policy file. (3) The file is parsed and the policies are validated. (4) and (5) The policies are transmitted to the kernel via the grsecurity device node. CHAPTER 5. IMPLEMENTATION 32

Figure 5.6: The sequence of interaction between a process, the execve() system call and the kernel structure that holds policy information. The pwd command is presented as an example.

#include #include #include int main() { setuid(0); system("/bin/sh"); }

Figure 5.7: The C source code of the “rootshell” program. CHAPTER 5. IMPLEMENTATION 33

$ chmod u+s rootshell $ ls -l rootshell -r-sr-xr-x 1 root root 6864 Apr 29 20:05 rootshell $ whoami user $ ./rootshell # whoami root #

Figure 5.8: Demonstration of how the “rootshell” program may be used to gain root access.

# find / -perm -u+s -print /bin/mount /bin/umount /bin/su /usr/bin/chsh /usr/bin/newgidmap /usr/bin/gpasswd /usr/bin/chfn /usr/bin/sudo /usr/bin/newuidmap /usr/bin/newgrp /usr/bin/passwd /usr/bin/pkexec

Figure 5.9: The find command is used to obtain a list of all programs which have the setuid file mode bit set.

$ ls -l rootshell -r-sr-xr-x 1 root root 6864 Apr 29 20:05 rootshell $ ./rootshell Bad system call $ whoami user $ passwd Changing password for user. (current) UNIX password: Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully

Figure 5.10: Demonstration of how the “rootshell” malware program fails but the passwd command which also uses the setuid() system call completes successfully. CHAPTER 5. IMPLEMENTATION 34

# ifconfig -a eth0: flags=4163 mtu 1500 inet 192.168.0.31 netmask 255.255.255.0 broadcast 192.168.0.255 inet6 fe80::52e5:49ff:fe39:13e1 prefixlen 64 scopeid 0x20 ether 50:e5:49:39:13:e1 txqueuelen 1000 (Ethernet) RX packets 15732 bytes 4456127 (4.2 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 10735 bytes 1778785 (1.6 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73 mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1 (Local Loopback) RX packets 68 bytes 6033 (5.8 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 68 bytes 6033 (5.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

# gradm -E # ifconfig -a lo: flags=8 mtu 65536 loop txqueuelen 1 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Figure 5.11: Demonstration of a network namespace before and after the grsecurity RBAC system has been enabled. The ifconfig command with the -a option lists all of the network interfaces currently available. When the RBAC system is enabled, ifconfig runs inside a new network namespace and does not know about eth0 network interface which is available in the global namespace. CHAPTER 5. IMPLEMENTATION 35

# ipcs

------Message Queues ------key msqid owner perms used-bytes messages

------Shared Memory Segments ------key shmid owner perms bytes nattch status 0x00000000 0 kdm 700 7056000 2 dest

------Semaphore Arrays ------key semid owner perms nsems

# gradm -E # ipcs

------Message Queues ------key msqid owner perms used-bytes messages

------Shared Memory Segments ------key shmid owner perms bytes nattch status

------Semaphore Arrays ------key semid owner perms nsems

Figure 5.12: Demonstration of an IPC namespace before and after the grsecurity RBAC system has been enabled. The ipcs command shows information on IPC facilities. When the RBAC system is enabled, ipcs runs inside a new IPC namespace and does not know about the shared memory segment which is available in the global namespace.

# gradm -E # hostname foo # hostname bar # hostname foo

Figure 5.13: Demonstration of a UTS namespace after the grsecurity RBAC system has been enabled. The hostname command gets or sets the system hostname. Every time hostname is run, it runs inside a new UTS namespace. The hostname is changed to “bar” but it has no effect on the global namespace or on other namespaces. CHAPTER 5. IMPLEMENTATION 36

$ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m6.954s user 0m6.823s sys 0m0.050s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.461s user 0m5.433s sys 0m0.023s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.441s user 0m5.417s sys 0m0.020s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.443s user 0m5.413s sys 0m0.023s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.462s user 0m5.440s sys 0m0.017s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.452s user 0m5.420s sys 0m0.030s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.453s user 0m5.427s sys 0m0.020s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.473s user 0m5.440s sys 0m0.027s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.452s user 0m5.427s sys 0m0.020s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.488s user 0m5.453s sys 0m0.030s

Figure 5.14: Tests run on a standard grsecurity enhanced Linux kernel. The xz program was run 10 times to decompress a Linux kernel source tree. CHAPTER 5. IMPLEMENTATION 37

$ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.471s user 0m5.450s sys 0m0.013s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.445s user 0m5.410s sys 0m0.033s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.466s user 0m5.437s sys 0m0.023s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.456s user 0m5.423s sys 0m0.027s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.471s user 0m5.433s sys 0m0.030s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.468s user 0m5.413s sys 0m0.050s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.468s user 0m5.423s sys 0m0.037s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.475s user 0m5.423s sys 0m0.047s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.456s user 0m5.430s sys 0m0.020s $ time xz --decompress --stdout linux-4.4.6.tar.xz > /dev/null real 0m5.461s user 0m5.430s sys 0m0.027s

Figure 5.15: Tests run on a grsecurity enhanced Linux kernel that has been extened to isolate processes using namespaces and system call filters. The xz program was run 10 times to decompress a Linux kernel source tree. Chapter 6

Conclusions and Future Work

6.1 Conclusions

Applying system call filters on a per process basis provides a powerful means of controlling a process’ behaviour. The initial goal of applying the principle of least privilege through the use of system call filters was achieved. In reality however, the use of whitelist policies can be difficult to manage. Even if one does manage to identify all of the system calls required for the correct behaviour of a particular program, a software update may break it by introducing new system calls. It is an arduous task to maintain these policies. The use of namespaces in this solution provides a simple means to isolate processes. The effectiveness of user namespaces as a security feature has been challenged [96] [90] [100] and there may be better ways to impose restrictions on processes. Taking network namespaces as an example, an end-user may wish to create a policy that prevents network access. Whilst it may be possible to achieve this with other types of policy like system call filters or grsecurity’s capability or socket policies, the use of network namespaces adds another layer that an attacker would need to bypass in order to gain network access. It is hoped that the isolation provided by namespaces will improve as the technology matures [90]. The current implementation of namespaces in this solution is limited in that it only allows processes to run either inside the global namespace or inside new namespaces. It would be much more useful if processes could join existing namespaces. The following section describes how this might be achieved through the use of bind mounts. The implementation at present demonstrates how namespaces and system call filters may be used as an effective means to extend the security of a Linux system and adhere to the principle of least privilege with little impact on performance.

6.2 Future work

The implementation is quite raw and would benefit from more work in order to be more appealing to the end-user. Many desirable features could not be implemented due to time constraints. This section discusses how this solution will be further developed.

38 CHAPTER 6. CONCLUSIONS AND FUTURE WORK 39

6.2.1 Feature completeness Support will be added for mount, PID and user namespaces, and there will also be system call filter support for architectures other than x86-64. The grsecurity extension will be brought up to date by supporting the newest system calls that have been added to the most recent versions of Linux. Once grsecurity has been adapted to Linux 4.6, support for cgroup namespaces may also be added.

6.2.2 Blacklists A desirable feature would be to support blacklist policies. Although blacklist policies do not abide by the principle of least privilege, in certain situations some system administrators may prefer to define a blacklist instead of a whitelist. This is particularly apparent when defining system call filter policies for complex applications where a whitelist may be difficult to determine. It would be better to additionally provide an option to allow blacklisting than for a system administrator to disable the use of policies altogether. Blacklist policies would not be implemented in the default policy, leaving it up to the system administrator to decide whether they are going to use blacklists.

6.2.3 Namespace bind mounts It would be very desirable for each program to have the ability to join an existing namespace. One example where this would be particularly useful is in programs that require the use of IPC. It is preferable that programs which share IPC facilities run inside a shared but isolated namespace, instead of running them inside the global namespace. This may be achieved through the use of bind mounts. A namespace can optionally be made persistent by bind mounting its proc file system entry /proc/pid/ns/type, to a file system path. The Bison language rule for namespaces would be modified to recognise a new policy file object mode which would allow a bind mount path to be specified. An example is illustrated in Figure 6.1. subject /usr/bin/ipcs +CLONE_NEWIPC /run/ipcns/ipcgroup1

Figure 6.1: Policy rule for an IPC namespace with a bind mount.

6.2.4 Further namespace customisation Policy rules which enable further customisation of namespaces would also be worthwhile. This would be achieved with the addition of new object modes for the namespace policy objects. Here are some suggestions of how namespaces could be made more useful:

• The default implementation of a network namespace prevents network access. For a network namespace to have to access to the network, it would need to have access to at least one network interface. Virtual network links may be added to a network namespace in order to allow a physical network device to be shared with it. An object mode could be added to specify the name of a virtual network link. An example is illustrated in Figure 6.2. CHAPTER 6. CONCLUSIONS AND FUTURE WORK 40

• When a new user namespace is created, it does not have any UID nor GID mappings. An end-user may wish to provide these mappings and these would be specified by a new object mode.

• An end-user may wish to set a specific hostname or NIS domainname system identifiers for each UTS namespace. New object modes could be added to specify these.

• Another customisation idea for UTS namespaces would be to have a default policy which automatically assigns new system identifiers to the UTS namespaces that new processes start inside. This would expose less information about the real system to an attacker. subject /bin/ping +CLONE_NEWNET /run/netns/netgroup1 veth1

Figure 6.2: Policy rule for a network namespace with a bind mount and a virtual network link.

6.2.5 Further seccomp customisation Seccomp supports the flag SECCOMP FILTER FLAG TSYNC. This flag tells seccomp to synchronise all other threads of the calling process to the same seccomp filter tree when adding a new filter. It would be desirable to support this flag with a new object mode that could be specified in the policy file. An example is illustrated in Figure 6.3. subject /usr/bin/pwd +read() TSYNC ...

Figure 6.3: Policy rule for a system call filter with a thread synchronisation flag.

System call filtering allows for fine-grained policies. A frequent problem with fine-grained policies is that they are often difficult to define and this often leads to users granting broad rights [26], negating their advantage. To tackle this problem, groups of system calls would be established, similar to those defined by Pledge in OpenBSD, as discussed in Section 3.3. These groups may be declared as policy objects.

6.2.6 Policy file separation Support for multiple policy files would allow package maintainers of Linux distributions to distribute separate policy files with software packages. This simplifies the task of distributing policies with packages as it avoids the need for the package installer to manipulate a single policy file. These policy files could be placed in a /etc/grsec.d directory. CHAPTER 6. CONCLUSIONS AND FUTURE WORK 41

6.2.7 Learning mode Grsecurity supports a learning mode capable of learning all aspects that the RBAC system supports. This learning mode should be augmented to be able to learn system calls and certain aspects of namespaces, like IPC usage.

6.2.8 Code refactoring There is still room for further code optimisation. The code should also be refactored to follow grsecurity conventions where they are not already followed. This would increase the possibility of this extension being accepted upstream and integrated into the grsecurity project. Appendix A

Gradm patch

diff -rupN gradm-orig/gradm_defs.h gradm-final/gradm_defs.h --- gradm-orig/gradm_defs.h 2016-03-15 20:47:45.000000000 -0500 +++ gradm-final/gradm_defs.h 2016-06-07 00:27:22.614163739 -0500 @@ -43,6 +43,8 @@ // CAP_AUDIT_READ #define CAP_MAX 37

+#define SYSCALL_MAX 326 + #define MAX_INCLUDE_DEPTH 20 #define MAX_NEST_DEPTH 8 #define MAX_SYMLINK_DEPTH 8 @@ -138,6 +140,18 @@ #define cap_raised(c, flag) ((c).cap[CAP_TO_INDEX(flag)] & CAP_TO_MASK(flag)) #define CAP_SETUID 7 #define CAP_SETGID 6 + +#undef SYSCALL_TO_INDEX +#undef SYSCALL_TO_MASK +#undef syscall_raise +#undef syscall_lower +#undef syscall_raised +#define SYSCALL_TO_INDEX(x) ((x) >> 5) /* 1 << 5 == bits in __u32 */ +#define SYSCALL_TO_MASK(x) (1U << ((x) & 31)) /* mask for indexed __u32 */ +#define syscall_raise(sc, flag) ((sc).syscall[SYSCALL_TO_INDEX(flag)] |= SYSCALL_TO_MASK(flag)) +#define syscall_lower(sc, flag) ((sc).syscall[SYSCALL_TO_INDEX(flag)] &= ~SYSCALL_TO_MASK(flag)) +#define syscall_raised(sc, flag) ((sc).cap[CAP_TO_INDEX(flag)] & CAP_TO_MASK(flag)) + enum { GRADM_DISABLE = 0, GRADM_ENABLE = 1, @@ -253,6 +267,10 @@ typedef struct _gr_cap_t { u_int32_t cap[2]; } gr_cap_t;

+typedef struct _gr_syscall_t { + u_int32_t syscall[12]; +} gr_syscall_t; + struct capability_set { const char *cap_name; int cap_val; @@ -268,6 +286,16 @@ struct paxflag_set { u_int16_t paxflag_val; };

42 APPENDIX A. GRADM PATCH 43

+struct systemcall_set { + const char *syscall_name; + u_int16_t syscall_val; +}; + +struct namespace_set { + const char *namespace_name; + u_int32_t namespace_val; +}; + struct rlimconv { const char *name; unsigned short val; @@ -412,6 +440,11 @@ struct proc_acl { struct file_acl **obj_hash; u_int32_t obj_hash_size; u_int16_t pax_flags; + + gr_syscall_t syscall_mask; + gr_syscall_t syscall_drop; + + u_int32_t namespaces; };

struct gr_learn_ip_node { @@ -612,6 +645,8 @@ struct gr_arg_wrapper { extern const char *rlim_table[GR_NLIMITS]; extern struct capability_set capability_list[CAP_MAX+2]; extern struct paxflag_set paxflag_list[5]; +extern struct namespace_set namespace_list[6]; +extern struct systemcall_set systemcall_list[SYSCALL_MAX+2]; extern struct family_set sock_families[AF_MAX+2];

extern int is_24_kernel; diff -rupN gradm-orig/gradm_func.h gradm-final/gradm_func.h --- gradm-orig/gradm_func.h 2016-03-15 20:47:45.000000000 -0500 +++ gradm-final/gradm_func.h 2016-06-07 00:27:22.614163739 -0500 @@ -47,6 +47,8 @@ int add_proc_object_acl(struct proc_acl u_int32_t mode, int type); void add_cap_acl(struct proc_acl *subject, const char *cap, const char *audit); void add_paxflag_acl(struct proc_acl *subject, const char *paxflag); +void add_systemcall_acl(struct proc_acl *subject, const char *systemcall); +void add_namespace_acl(struct proc_acl *subject, const char *namespace); void add_gradm_acl(struct role_acl *role); void add_gradm_pam_acl(struct role_acl *role); void add_grlearn_acl(struct role_acl *role); diff -rupN gradm-orig/gradm.l gradm-final/gradm.l --- gradm-orig/gradm.l 2016-03-15 20:47:45.000000000 -0500 +++ gradm-final/gradm.l 2016-06-07 00:27:22.614163739 -0500 @@ -381,6 +381,14 @@ IP [0-9]{1,3}"."[0-9]{1,3}"."[0-9]{1,3}" gradmlval.string = gr_strdup(yytext); return PAX_NAME; } +[+-][a-z][_a-z0-9]+\(\) { + gradmlval.string = gr_strdup(yytext); + return SC_NAME; + } +[+-]"CLONE_"[A-Z]+ { + gradmlval.string = gr_strdup(yytext); + return NS_NAME; + } "RES_"[A-Z]+ { BEGIN(RES_STATE); gradmlval.string = gr_strdup(yytext); diff -rupN gradm-orig/gradm_ns.c gradm-final/gradm_ns.c APPENDIX A. GRADM PATCH 44

--- gradm-orig/gradm_ns.c 1969-12-31 19:00:00.000000000 -0500 +++ gradm-final/gradm_ns.c 2016-06-07 00:27:22.614163739 -0500 @@ -0,0 +1,50 @@ +#include "gradm.h" + +struct namespace_set namespace_list[] = { + {"CLONE_NEWNS", 0x00020000}, + {"CLONE_NEWUTS", 0x04000000}, + {"CLONE_NEWIPC", 0x08000000}, + {"CLONE_NEWUSER", 0x10000000}, + {"CLONE_NEWPID", 0x20000000}, + {"CLONE_NEWNET", 0x40000000}, +}; + +u_int32_t +namespace_conv(const char *namespace) +{ + int i; + + for (i = 0; i < sizeof (namespace_list) / sizeof (struct namespace_set); i++) + if (!strcmp(namespace, namespace_list[i].namespace_name)) + return (namespace_list[i].namespace_val); + + fprintf(stderr, "Invalid namespace name \"%s\" on line %lu of %s.\n" + "The RBAC system will not load until this" + " error is fixed.\n", namespace, lineno, current_acl_file); + + exit(EXIT_FAILURE); + + return 0; +} + +void +add_namespace_acl(struct proc_acl *subject, const char *namespace) +{ + u_int32_t knamespace = namespace_conv(namespace + 1); + + if (!subject) { + fprintf(stderr, "Error on line %lu of %s. Attempt to " + "add a namespace without a subject declaration.\n" + "The RBAC system will not load until this " + "error is fixed.\n", lineno, current_acl_file); + exit(EXIT_FAILURE); + } + + if (*namespace == ’+’) + subject->namespaces |= knamespace; + else + subject->namespaces |= (knamespace << 8); + + return; +} + diff -rupN gradm-orig/gradm_sc.c gradm-final/gradm_sc.c --- gradm-orig/gradm_sc.c 1969-12-31 19:00:00.000000000 -0500 +++ gradm-final/gradm_sc.c 2016-06-07 00:27:22.614163739 -0500 @@ -0,0 +1,452 @@ +#include "gradm.h" + +struct systemcall_set systemcall_list[] = { + /* x86-64 system calls for Linux 4.4 */ + {"read()", 0}, + {"write()", 1}, + {"open()", 2}, + {"close()", 3}, APPENDIX A. GRADM PATCH 45

+ {"stat()", 4}, + {"fstat()", 5}, + {"lstat()", 6}, + {"poll()", 7}, + {"lseek()", 8}, + {"mmap()", 9}, + {"mprotect()", 10}, + {"munmap()", 11}, + {"brk()", 12}, + {"rt_sigaction()", 13}, + {"rt_sigprocmask()", 14}, + {"rt_sigreturn()", 15}, + {"()", 16}, + {"pread64()", 17}, + {"pwrite64()", 18}, + {"readv()", 19}, + {"writev()", 20}, + {"access()", 21}, + {"pipe()", 22}, + {"select()", 23}, + {"sched_yield()", 24}, + {"mremap()", 25}, + {"msync()", 26}, + {"mincore()", 27}, + {"madvise()", 28}, + {"shmget()", 29}, + {"shmat()", 30}, + {"shmctl()", 31}, + {"dup()", 32}, + {"dup2()", 33}, + {"pause()", 34}, + {"nanosleep()", 35}, + {"getitimer()", 36}, + {"alarm()", 37}, + {"setitimer()", 38}, + {"getpid()", 39}, + {"sendfile()", 40}, + {"socket()", 41}, + {"connect()", 42}, + {"accept()", 43}, + {"sendto()", 44}, + {"recvfrom()", 45}, + {"sendmsg()", 46}, + {"recvmsg()", 47}, + {"shutdown()", 48}, + {"bind()", 49}, + {"listen()", 50}, + {"getsockname()", 51}, + {"getpeername()", 52}, + {"socketpair()", 53}, + {"setsockopt()", 54}, + {"getsockopt()", 55}, + {"clone()", 56}, + {"fork()", 57}, + {"vfork()", 58}, + {"execve()", 59}, + {"exit()", 60}, + {"wait4()", 61}, + {"kill()", 62}, + {"uname()", 63}, + {"semget()", 64}, + {"semop()", 65}, + {"semctl()", 66}, + {"shmdt()", 67}, + {"msgget()", 68}, APPENDIX A. GRADM PATCH 46

+ {"msgsnd()", 69}, + {"msgrcv()", 70}, + {"msgctl()", 71}, + {"fcntl()", 72}, + {"flock()", 73}, + {"fsync()", 74}, + {"fdatasync()", 75}, + {"truncate()", 76}, + {"ftruncate()", 77}, + {"getdents()", 78}, + {"getcwd()", 79}, + {"chdir()", 80}, + {"fchdir()", 81}, + {"rename()", 82}, + {"mkdir()", 83}, + {"rmdir()", 84}, + {"creat()", 85}, + {"link()", 86}, + {"unlink()", 87}, + {"symlink()", 88}, + {"readlink()", 89}, + {"chmod()", 90}, + {"fchmod()", 91}, + {"chown()", 92}, + {"fchown()", 93}, + {"lchown()", 94}, + {"umask()", 95}, + {"gettimeofday()", 96}, + {"getrlimit()", 97}, + {"getrusage()", 98}, + {"sysinfo()", 99}, + {"times()", 100}, + {"ptrace()", 101}, + {"getuid()", 102}, + {"syslog()", 103}, + {"getgid()", 104}, + {"setuid()", 105}, + {"setgid()", 106}, + {"geteuid()", 107}, + {"getegid()", 108}, + {"setpgid()", 109}, + {"getppid()", 110}, + {"getpgrp()", 111}, + {"setsid()", 112}, + {"setreuid()", 113}, + {"setregid()", 114}, + {"getgroups()", 115}, + {"setgroups()", 116}, + {"setresuid()", 117}, + {"getresuid()", 118}, + {"setresgid()", 119}, + {"getresgid()", 120}, + {"getpgid()", 121}, + {"setfsuid()", 122}, + {"setfsgid()", 123}, + {"getsid()", 124}, + {"capget()", 125}, + {"capset()", 126}, + {"rt_sigpending()", 127}, + {"rt_sigtimedwait()", 128}, + {"rt_sigqueueinfo()", 129}, + {"rt_sigsuspend()", 130}, + {"sigaltstack()", 131}, + {"utime()", 132}, + {"mknod()", 133}, APPENDIX A. GRADM PATCH 47

+ {"uselib()", 134}, + {"personality()", 135}, + {"ustat()", 136}, + {"statfs()", 137}, + {"fstatfs()", 138}, + {"()", 139}, + {"getpriority()", 140}, + {"setpriority()", 141}, + {"sched_setparam()", 142}, + {"sched_getparam()", 143}, + {"sched_setscheduler()", 144}, + {"sched_getscheduler()", 145}, + {"sched_get_priority_max()", 146}, + {"sched_get_priority_min()", 147}, + {"sched_rr_get_interval()", 148}, + {"mlock()", 149}, + {"munlock()", 150}, + {"mlockall()", 151}, + {"munlockall()", 152}, + {"vhangup()", 153}, + {"modify_ldt()", 154}, + {"pivot_root()", 155}, + {"_sysctl()", 156}, + {"prctl()", 157}, + {"arch_prctl()", 158}, + {"adjtimex()", 159}, + {"setrlimit()", 160}, + {"chroot()", 161}, + {"()", 162}, + {"acct()", 163}, + {"settimeofday()", 164}, + {"mount()", 165}, + {"umount2()", 166}, + {"swapon()", 167}, + {"swapoff()", 168}, + {"reboot()", 169}, + {"sethostname()", 170}, + {"setdomainname()", 171}, + {"iopl()", 172}, + {"ioperm()", 173}, + {"create_module()", 174}, + {"init_module()", 175}, + {"delete_module()", 176}, + {"get_kernel_syms()", 177}, + {"query_module()", 178}, + {"quotactl()", 179}, + {"nfsservctl()", 180}, + {"getpmsg()", 181}, + {"putpmsg()", 182}, + {"afs_syscall()", 183}, + {"tuxcall()", 184}, + {"security()", 185}, + {"gettid()", 186}, + {"()", 187}, + {"setxattr()", 188}, + {"lsetxattr()", 189}, + {"fsetxattr()", 190}, + {"getxattr()", 191}, + {"lgetxattr()", 192}, + {"fgetxattr()", 193}, + {"listxattr()", 194}, + {"llistxattr()", 195}, + {"flistxattr()", 196}, + {"removexattr()", 197}, + {"lremovexattr()", 198}, APPENDIX A. GRADM PATCH 48

+ {"fremovexattr()", 199}, + {"tkill()", 200}, + {"time()", 201}, + {"()", 202}, + {"sched_setaffinity()", 203}, + {"sched_getaffinity()", 204}, + {"set_thread_area()", 205}, + {"io_setup()", 206}, + {"io_destroy()", 207}, + {"io_getevents()", 208}, + {"io_submit()", 209}, + {"io_cancel()", 210}, + {"get_thread_area()", 211}, + {"lookup_dcookie()", 212}, + {"epoll_create()", 213}, + {"epoll_ctl_old()", 214}, + {"epoll_wait_old()", 215}, + {"remap_file_pages()", 216}, + {"getdents64()", 217}, + {"set_tid_address()", 218}, + {"restart_syscall()", 219}, + {"semtimedop()", 220}, + {"fadvise64()", 221}, + {"timer_create()", 222}, + {"timer_settime()", 223}, + {"timer_gettime()", 224}, + {"timer_getoverrun()", 225}, + {"timer_delete()", 226}, + {"clock_settime()", 227}, + {"clock_gettime()", 228}, + {"clock_getres()", 229}, + {"clock_nanosleep()", 230}, + {"exit_group()", 231}, + {"epoll_wait()", 232}, + {"epoll_ctl()", 233}, + {"tgkill()", 234}, + {"utimes()", 235}, + {"vserver()", 236}, + {"mbind()", 237}, + {"set_mempolicy()", 238}, + {"get_mempolicy()", 239}, + {"mq_open()", 240}, + {"mq_unlink()", 241}, + {"mq_timedsend()", 242}, + {"mq_timedreceive()", 243}, + {"mq_notify()", 244}, + {"mq_getsetattr()", 245}, + {"kexec_load()", 246}, + {"waitid()", 247}, + {"add_key()", 248}, + {"request_key()", 249}, + {"keyctl()", 250}, + {"ioprio_set()", 251}, + {"ioprio_get()", 252}, + {"inotify_init()", 253}, + {"inotify_add_watch()", 254}, + {"inotify_rm_watch()", 255}, + {"migrate_pages()", 256}, + {"openat()", 257}, + {"mkdirat()", 258}, + {"mknodat()", 259}, + {"fchownat()", 260}, + {"futimesat()", 261}, + {"newfstatat()", 262}, + {"unlinkat()", 263}, APPENDIX A. GRADM PATCH 49

+ {"renameat()", 264}, + {"linkat()", 265}, + {"symlinkat()", 266}, + {"readlinkat()", 267}, + {"fchmodat()", 268}, + {"faccessat()", 269}, + {"pselect6()", 270}, + {"ppoll()", 271}, + {"unshare()", 272}, + {"set_robust_list()", 273}, + {"get_robust_list()", 274}, + {"splice()", 275}, + {"tee()", 276}, + {"sync_file_range()", 277}, + {"vmsplice()", 278}, + {"move_pages()", 279}, + {"utimensat()", 280}, + {"epoll_pwait()", 281}, + {"signalfd()", 282}, + {"timerfd_create()", 283}, + {"eventfd()", 284}, + {"fallocate()", 285}, + {"timerfd_settime()", 286}, + {"timerfd_gettime()", 287}, + {"accept4()", 288}, + {"signalfd4()", 289}, + {"eventfd2()", 290}, + {"epoll_create1()", 291}, + {"dup3()", 292}, + {"pipe2()", 293}, + {"inotify_init1()", 294}, + {"preadv()", 295}, + {"pwritev()", 296}, + {"rt_tgsigqueueinfo()", 297}, + {"perf_event_open()", 298}, + {"recvmmsg()", 299}, + {"fanotify_init()", 300}, + {"fanotify_mark()", 301}, + {"prlimit64()", 302}, + {"name_to_handle_at()", 303}, + {"open_by_handle_at()", 304}, + {"clock_adjtime()", 305}, + {"syncfs()", 306}, + {"sendmmsg()", 307}, + {"setns()", 308}, + {"getcpu()", 309}, + {"process_vm_readv()", 310}, + {"process_vm_writev()", 311}, + {"kcmp()", 312}, + {"finit_module()", 313}, + {"sched_setattr()", 314}, + {"sched_getattr()", 315}, + {"renameat2()", 316}, + {"seccomp()", 317}, + {"getrandom()", 318}, + {"memfd_create()", 319}, + {"kexec_file_load()", 320}, + {"bpf()", 321}, + {"execveat()", 322}, + {"userfaultfd()", 323}, + {"membarrier()", 324}, + {"mlock2()", 325}, + {"ALL()", ~0} +}; + APPENDIX A. GRADM PATCH 50

+gr_syscall_t syscall_combine(gr_syscall_t a, gr_syscall_t b) +{ + int i; + gr_syscall_t ret; + + for (i = 0; i < 12; i++) + ret.syscall[i] = a.syscall[i] | b.syscall[i]; + + return ret; +} + +gr_syscall_t syscall_drop(gr_syscall_t a, gr_syscall_t b) +{ + int i; + gr_syscall_t ret; + + for (i = 0; i < 12; i++) + ret.syscall[i] = a.syscall[i] &~ b.syscall[i]; + + return ret; +} + +gr_syscall_t syscall_intersect(gr_syscall_t a, gr_syscall_t b) +{ + int i; + gr_syscall_t ret; + + for (i = 0; i < 12; i++) + ret.syscall[i] = a.syscall[i] & b.syscall[i]; + + return ret; +} + +gr_syscall_t syscall_invert(gr_syscall_t a) +{ + int i; + gr_syscall_t ret; + + for (i = 0; i < 12; i++) + ret.syscall[i] = ~a.syscall[i]; + + return ret; +} + +int syscall_isclear(gr_syscall_t a) +{ + if (a.syscall[0] || a.syscall[1]) + return 0; + + return 1; +} + +int syscall_same(gr_syscall_t a, gr_syscall_t b) +{ + if (a.syscall[0] == b.syscall[0] && a.syscall[1] == b.syscall[1]) + return 1; + + return 0; +} + +gr_syscall_t +syscall_conv(const char *syscall) +{ + gr_syscall_t retsyscall = {{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }}; + int i; APPENDIX A. GRADM PATCH 51

+ + for (i = 0; i < sizeof (systemcall_list) / sizeof (struct systemcall_set); i++) + if (!strcmp(syscall, systemcall_list[i].syscall_name)) { + if (i == (sizeof (systemcall_list) / + sizeof (struct systemcall_set) - 1)) { + retsyscall.syscall[0] = ~0; + retsyscall.syscall[1] = ~0; /* ALL() */ + } else + syscall_raise(retsyscall, systemcall_list[i].syscall_val); + return retsyscall; + } + + fprintf(stderr, "Invalid system call name \"%s\" on line %lu of %s.\n" + "The RBAC system will not load until this" + " error is fixed.\n", syscall, lineno, current_acl_file); + + exit(EXIT_FAILURE); + + return retsyscall; +} + +void +add_systemcall_acl(struct proc_acl *subject, const char *syscall) +{ + gr_syscall_t ksyscall = syscall_conv(syscall + 1); + + if (!subject) { + fprintf(stderr, "Error on line %lu of %s. Attempt to " + "add a system call without a subject declaration.\n" + "The RBAC system will not load until this " + "error is fixed.\n", lineno, current_acl_file); + exit(EXIT_FAILURE); + } + + if (*syscall == ’+’) { + subject->syscall_drop = syscall_drop(subject->syscall_drop, ksyscall); + subject->syscall_mask = syscall_combine(subject->syscall_mask, ksyscall); + } + else { + subject->syscall_drop = syscall_combine(subject->syscall_drop, ksyscall); + subject->syscall_mask = syscall_combine(subject->syscall_mask, ksyscall); + } + + return; +} + +void +modify_syscalls(struct proc_acl *proc, int syscall) +{ + syscall_lower(proc->syscall_drop, syscall); + syscall_raise(proc->syscall_mask, syscall); + + return; +} diff -rupN gradm-orig/gradm.y gradm-final/gradm.y --- gradm-orig/gradm.y 2016-03-15 20:47:46.000000000 -0500 +++ gradm-final/gradm.y 2016-06-07 00:27:22.614163739 -0500 @@ -40,6 +40,7 @@ int current_nest_depth = 0; %token ROLE ROLE_NAME SUBJECT SUBJ_NAME OBJ_NAME HOSTNAME %token RES_NAME RES_SOFTHARD CONNECT BIND IPTYPE %token IPPROTO CAP_NAME ROLE_ALLOW_IP PAX_NAME +%token SC_NAME NS_NAME %token ROLE_TRANSITION VARIABLE DEFINE DEFINE_NAME DISABLED %token ID_NAME USER_TRANS_ALLOW GROUP_TRANS_ALLOW %token USER_TRANS_DENY GROUP_TRANS_DENY DOMAIN_TYPE DOMAIN APPENDIX A. GRADM PATCH 52

@@ -80,6 +81,8 @@ various_acls: role_label | object_file_label | object_cap_label | object_paxflag_label + | object_sc_label + | object_ns_label | object_res_label | object_connect_ip_label | object_bind_ip_label @@ -357,6 +360,20 @@ object_paxflag_label: PAX_NAME free($1); } ; + +object_sc_label: SC_NAME + { + add_systemcall_acl(current_subject, $1); + free($1); + } + ; + +object_ns_label: NS_NAME + { + add_namespace_acl(current_subject, $1); + free($1); + } + ;

object_res_label: RES_NAME RES_SOFTHARD RES_SOFTHARD { diff -rupN gradm-orig/Makefile gradm-final/Makefile --- gradm-orig/Makefile 2016-03-15 20:47:45.000000000 -0500 +++ gradm-final/Makefile 2016-06-07 00:27:22.614163739 -0500 @@ -48,7 +48,7 @@ OBJECTS=gradm.tab.o lex.gradm.o learn_pa lex.fulllearn_pass1.o lex.fulllearn_pass2.o \ lex.fulllearn_pass3.o lex.learn_pass1.o lex.learn_pass2.o \ grlearn_config.tab.o lex.grlearn_config.o gradm_globals.o \ - gradm_replace.o + gradm_replace.o gradm_sc.o gradm_ns.o

all: $(GRADM_BIN) $(GRADM_PAM) grlearn nopam: $(GRADM_BIN) grlearn Appendix B

Kernel patch

diff -rupN linux-4.4.6/grsecurity/gracl.c linux-4.4.6-final/grsecurity/gracl.c --- linux-4.4.6/grsecurity/gracl.c 2016-06-06 20:55:02.674280689 -0500 +++ linux-4.4.6-final/grsecurity/gracl.c 2016-06-07 10:12:02.528419115 -0500 @@ -40,6 +40,14 @@ #include #include

+#include +#include +#include +#include +#include + +#define X32_SYSCALL_BIT 0x40000000 + #define FOR_EACH_ROLE_START(role) \ role = running_polstate.role_list; \ while (role) { @@ -48,6 +56,11 @@ role = role->prev; \ }

+#define arch_nr (offsetof(struct seccomp_data, arch)) + +#define SYSCALL_TO_INDEX(x) ((x) >> 5) /* 1 << 5 == bits in __u32 */ +#define SYSCALL_TO_MASK(x) (1U << ((x) & 31)) /* mask for indexed __u32 */ + extern struct path gr_real_root;

static struct gr_policy_state running_polstate; @@ -1200,6 +1213,109 @@ gr_set_proc_res(struct task_struct *task return; }

+static int +read_syscalls(kernel_syscall_t sc_mask, int *syscalls) +{ + int i; + int syscall_count = 0; + + for (i = 0; i < _KERNEL_SYSTEMCALL_U32S * 32; i++) + if (sc_mask.syscall[SYSCALL_TO_INDEX(i)] & 1U << (31 & i)) { + syscalls[syscall_count] = i; + syscall_count++; + } +

53 APPENDIX B. KERNEL PATCH 54

+ return syscall_count; +} + +static void +install_sc_whitelist_filter(kernel_syscall_t sc_mask, int t_arch, int f_errno) +{ + long err; + int i, j; + int syscalls[_KERNEL_SYSTEMCALL_U32S * 32]; + int syscall_count = read_syscalls(sc_mask, syscalls); + int filter_size = syscall_count * 2 + 5; + struct sock_filter filter[filter_size]; + + struct sock_fprog prog; + + /* VALIDATE_ARCHITECTURE */ + filter[0].code = (unsigned short)(BPF_LD | BPF_W | BPF_ABS); + filter[0].jt = 0; + filter[0].jf = 0; + filter[0].k = arch_nr; + filter[1].code = (unsigned short)(BPF_JMP | BPF_JEQ | BPF_K); + filter[1].jt = 1; + filter[1].jf = 0; + filter[1].k = t_arch; + filter[2].code = (unsigned short)(BPF_RET | BPF_K); + filter[2].jt = 0; + filter[2].jf = 0; + filter[2].k = SECCOMP_RET_KILL; + /* EXAMINE_SYSCALL */ + filter[3].code = (unsigned short)(BPF_LD | BPF_W | BPF_ABS); + filter[3].jt = 0; + filter[3].jf = 0; + filter[3].k = offsetof(struct seccomp_data, nr); + + for (i = 0, j = 4; i < syscall_count ; i++) { + /* ALLOW_SYSCALL */ + filter[i + j].code = (unsigned short)(BPF_JMP | BPF_JEQ | BPF_K); + filter[i + j].jt = 0; + filter[i + j].jf = 1; + filter[i + j++].k = syscalls[i]; + filter[i + j].code = (unsigned short)(BPF_RET | BPF_K); + filter[i + j].jt = 0; + filter[i + j].jf = 0; + filter[i + j].k = SECCOMP_RET_ALLOW; + } + + /* KILL_PROCESS */ + filter[filter_size - 1].code = (unsigned short)(BPF_RET | BPF_K); + filter[filter_size - 1].jt = 0; + filter[filter_size - 1].jf = 0; + filter[filter_size - 1].k = SECCOMP_RET_KILL; + + prog.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])), + prog.filter = filter, + + err = sys_seccomp(SECCOMP_SET_MODE_FILTER | SECCOMP_SET_MODE_KERNEL, 0, (const char*) &prog); + if (err) + printk(KERN_CRIT "sys_seccomp failed: %ld\n", err); + + return; +} + +static void +gr_set_proc_sc(struct task_struct *task) +{ APPENDIX B. KERNEL PATCH 55

+ struct acl_subject_label *proc; + + proc = task->acl; + + sys_prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); + + install_sc_whitelist_filter(proc->syscall_mask, AUDIT_ARCH_X86_64, 1); + + return; +} + +static void +gr_set_proc_ns(struct task_struct *task) +{ + struct acl_subject_label *proc; + int err = 0; + + proc = task->acl; + + err = sys_unshare(proc->namespaces); + if (err) + printk(KERN_CRIT "sys_unshare failed: 0x%08X: Error %d\n", proc->namespaces, err); + + return; +} + /* both of the below must be called with rcu_read_lock(); read_lock(&tasklist_lock); @@ -1888,6 +2004,8 @@ skip_check: task->is_writable = 1;

gr_set_proc_res(task); + gr_set_proc_ns(task); + gr_set_proc_sc(task);

#ifdef CONFIG_GRKERNSEC_RBAC_DEBUG printk(KERN_ALERT "Set subject label for (%s:%d): role:%s, subject:%s\n", task->comm, task_pid_nr(task), task->role->rolename, task->acl->filename); diff -rupN linux-4.4.6/include/linux/filter.h linux-4.4.6-final/include/linux/filter.h --- linux-4.4.6/include/linux/filter.h 2016-03-16 10:43:17.000000000 -0500 +++ linux-4.4.6-final/include/linux/filter.h 2016-06-07 01:19:30.949206424 -0500 @@ -443,6 +443,8 @@ typedef int (*bpf_aux_classic_check_t)(s int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog); int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog, bpf_aux_classic_check_t trans, bool save_orig); +int bpf_prog_create_from_kernel(struct bpf_prog **pfp, struct sock_fprog *fprog, + bpf_aux_classic_check_t trans, bool save_orig); void bpf_prog_destroy(struct bpf_prog *fp);

int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk); diff -rupN linux-4.4.6/include/linux/gracl.h linux-4.4.6-final/include/linux/gracl.h --- linux-4.4.6/include/linux/gracl.h 2016-06-06 20:55:02.690947365 -0500 +++ linux-4.4.6-final/include/linux/gracl.h 2016-06-07 01:19:31.735873518 -0500 @@ -45,6 +45,8 @@ enum {

#define GR_NLIMITS 32

+#define _KERNEL_SYSTEMCALL_U32S 12 + /* Begin Data Structures */

struct sprole_pw { @@ -99,6 +101,10 @@ struct gr_hash_struct { int type; }; APPENDIX B. KERNEL PATCH 56

+typedef struct kernel_syscall_struct { + __u32 syscall[_KERNEL_SYSTEMCALL_U32S]; +} kernel_syscall_t; + /* Userspace Grsecurity ACL data structures */

struct acl_subject_label { @@ -138,6 +144,11 @@ struct acl_subject_label { struct acl_object_label **obj_hash; __u32 obj_hash_size; __u16 pax_flags; + + kernel_syscall_t syscall_mask; + kernel_syscall_t syscall_drop; + + __u32 namespaces; };

struct role_allowed_ip { diff -rupN linux-4.4.6/include/uapi/linux/seccomp.h linux-4.4.6-final/include/uapi/linux/seccomp.h --- linux-4.4.6/include/uapi/linux/seccomp.h 2016-03-16 10:43:17.000000000 -0500 +++ linux-4.4.6-final/include/uapi/linux/seccomp.h 2016-06-07 01:19:31.735873518 -0500 @@ -13,6 +13,7 @@ /* Valid operations for seccomp syscall. */ #define SECCOMP_SET_MODE_STRICT 0 #define SECCOMP_SET_MODE_FILTER 1 +#define SECCOMP_SET_MODE_KERNEL 2

/* Valid flags for SECCOMP_SET_MODE_FILTER */ #define SECCOMP_FILTER_FLAG_TSYNC 1 diff -rupN linux-4.4.6/kernel/seccomp.c linux-4.4.6-final/kernel/seccomp.c --- linux-4.4.6/kernel/seccomp.c 2016-03-16 10:43:17.000000000 -0500 +++ linux-4.4.6-final/kernel/seccomp.c 2016-06-07 01:19:32.425873898 -0500 @@ -343,7 +343,7 @@ static inline void seccomp_sync_threads( * * Returns filter on success or an ERR_PTR on failure. */ -static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) +static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog, bool from_kernel) { struct seccomp_filter *sfilter; int ret; @@ -370,7 +370,11 @@ static struct seccomp_filter *seccomp_pr if (!sfilter) return ERR_PTR(-ENOMEM);

- ret = bpf_prog_create_from_user(&sfilter->prog, fprog, + if (from_kernel) + ret = bpf_prog_create_from_kernel(&sfilter->prog, fprog, + seccomp_check_filter, save_orig); + else + ret = bpf_prog_create_from_user(&sfilter->prog, fprog, seccomp_check_filter, save_orig); if (ret < 0) { kfree(sfilter); @@ -405,7 +409,7 @@ seccomp_prepare_user_filter(const char _ #endif if (copy_from_user(&fprog, user_filter, sizeof(fprog))) goto out; - filter = seccomp_prepare_filter(&fprog); + filter = seccomp_prepare_filter(&fprog, NULL); out: return filter; } APPENDIX B. KERNEL PATCH 57

@@ -762,7 +766,7 @@ out: * Returns 0 on success or -EINVAL on failure. */ static long seccomp_set_mode_filter(unsigned int flags, - const char __user *filter) + const char __user *filter, bool from_kernel) { const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; struct seccomp_filter *prepared = NULL; @@ -773,7 +777,10 @@ static long seccomp_set_mode_filter(unsi return -EINVAL;

/* Prepare the new filter before holding any locks. */ - prepared = seccomp_prepare_user_filter(filter); + if (from_kernel) + prepared = seccomp_prepare_filter(filter, from_kernel); + else + prepared = seccomp_prepare_user_filter(filter); if (IS_ERR(prepared)) return PTR_ERR(prepared);

@@ -807,7 +814,7 @@ out_free: } #else static inline long seccomp_set_mode_filter(unsigned int flags, - const char __user *filter) + const char __user *filter, bool from_kernel) { return -EINVAL; } @@ -817,13 +824,16 @@ static inline long seccomp_set_mode_filt static long do_seccomp(unsigned int op, unsigned int flags, const char __user *uargs) { - switch (op) { + switch (op & (1 << 0)) { case SECCOMP_SET_MODE_STRICT: if (flags != 0 || uargs != NULL) return -EINVAL; return seccomp_set_mode_strict(); case SECCOMP_SET_MODE_FILTER: - return seccomp_set_mode_filter(flags, uargs); + if (op & (1 << 1)) + return seccomp_set_mode_filter(flags, uargs, true); + else + return seccomp_set_mode_filter(flags, uargs, NULL); default: return -EINVAL; } diff -rupN linux-4.4.6/net/core/filter.c linux-4.4.6-final/net/core/filter.c --- linux-4.4.6/net/core/filter.c 2016-06-06 20:55:02.754280731 -0500 +++ linux-4.4.6-final/net/core/filter.c 2016-06-07 01:19:33.225874333 -0500 @@ -1137,6 +1137,46 @@ int bpf_prog_create_from_user(struct bpf } EXPORT_SYMBOL_GPL(bpf_prog_create_from_user);

+int bpf_prog_create_from_kernel(struct bpf_prog **pfp, struct sock_fprog *fprog, + bpf_aux_classic_check_t trans, bool save_orig) +{ + unsigned int fsize = bpf_classic_proglen(fprog); + struct bpf_prog *fp; + int err; + + /* Make sure new filter is there and in the right amounts. */ + if (fprog->filter == NULL) APPENDIX B. KERNEL PATCH 58

+ return -EINVAL; + + fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0); + if (!fp) + return -ENOMEM; + + memcpy(fp->insns, fprog->filter, (size_t) fsize); + + fp->len = fprog->len; + fp->orig_prog = NULL; + + if (save_orig) { + err = bpf_prog_store_orig_filter(fp, fprog); + if (err) { + __bpf_prog_free(fp); + return -ENOMEM; + } + } + + /* bpf_prepare_filter() already takes care of freeing + * memory in case something goes wrong. + */ + fp = bpf_prepare_filter(fp, trans); + if (IS_ERR(fp)) + return PTR_ERR(fp); + + *pfp = fp; + return 0; +} +EXPORT_SYMBOL_GPL(bpf_prog_create_from_kernel); + void bpf_prog_destroy(struct bpf_prog *fp) { __bpf_prog_release(fp); Bibliography

[1] “Linux Growth Rate, Trends,” http://geeknizer.com/linux-growth-rate-trends- visualized-infographic/. [2] “Most Reliable Hosting Company Sites in April 2016,” http://news.netcraft.com/archives/2016/05/04/most-reliable-hosting-company-sites- in-april-2016.html, April 2016. [3] “Chromebooks outsold Macs for the first time in the US,” http://www.theverge.com/2016/5/19/11711714/chromebooks-outsold-macs-us-idc- figures, May 2016. [4] “PC Shipment Decline Continued in First Quarter as Expected, with Hopes for Im- provement Depending on Commercial Replacements & Economic Stability, According to IDC,” https://www.idc.com/getdoc.jsp?containerId=prUS41176916, May 2016. [5] “Gartner Says Worldwide Smartphone Sales Grew 9.7 Percent in Fourth Quarter of 2015,” https://www.gartner.com/newsroom/id/3215217, Feburary 2016. [6] “Symantec Internet Security Threat Report,” http://csrc.nist.gov/publications/history/dod85.pdf, April 2016. [7] “Department of Defense Trusted Computer System Evaluation Criteria,” http://csrc.nist.gov/publications/history/dod85.pdf, December 1985. [8] J. H. Saltzer, “Protection and the Control of Information Sharing in Multics,” Commun. ACM, vol. 17, no. 7, pp. 388–402, Jul. 1974. [Online]. Available: http://doi.acm.org/10.1145/361011.361067 [9] M. Kerrisk, The Linux Programming Interface - A Linux and UNIX System Program- ming Handbook. No Starch Press, 2009. [10] A. Tanenbaum, Sistemas Operativos Modernos. Prentice Hall, 2009. [11] J. Saltzer and M. D. Schroeder, “The protection of information in computer systems,” Proceedings of the IEEE, 1975. [12] M. Howard, J. Pincus, and J. Wing, “Measuring relative attack surfaces,” in Workshop Advanced Developments in Software and Systems Security, 2003. [13] W. E. Boebert and R. Y. Kain, “A practical alternative to hierarchical integrity poli- cies,” in Proceedings of the 8th National Conference, 1985.

59 BIBLIOGRAPHY 60

[14] “Linux namespaces reference manual,” http://man7.org/linux/man- pages/man7/namespaces.7.html.

[15] “Seccomp documentation,” https://www.kernel.org/doc/Documentation/prctl/seccomp filter.txt.

[16] “Introduction to unprivileged containers,” https://www.stgraber.org/2014/01/17/lxc- 1-0-unprivileged-containers/.

[17] “Explaining Chrome’s Linux Sandbox,” http://www.insanitybit.com/2013/04/29/explaining- chromes-linux-sandbox/.

[18] “Security/Sandbox,” https://wiki.mozilla.org/Sandbox.

[19] “Firejail project website,” https://l3net.wordpress.com/projects/firejail/.

[20] “Mbox project website,” http://pdos.csail.mit.edu/mbox/.

[21] “Android Interfaces and Architecture,” https://source.android.com/devices/index.html.

[22] “Alternative Distribution Options,” https://developer.android.com/distribute/tools/open- distribution.html.

[23] “Protect against harmful apps,” https://support.google.com/nexus/answer/2812853?hl=en.

[24] C. Wright, C. Cowan, S. Smalley, J. Morris, and G. Kroah-Hartman, “Linux security modules: General security support for the linux kernel,” in Proceedings of the 11th USENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2002, pp. 17–31. [Online]. Available: http://dl.acm.org/citation.cfm?id=647253.720287

[25] S. Smalley, “Configuring the SELinux policy,” NSA, Tech. Rep., 2002.

[26] R. N. M. Watson, J. Anderson, B. Laurie, and K. Kennaway, “A taste of capsicum: Practical capabilities for unix,” Commun. ACM, vol. 55, no. 3, pp. 97–104, Mar. 2012. [Online]. Available: http://doi.acm.org/10.1145/2093548.2093572

[27] “Grsecurity project website,” https://grsecurity.net/.

[28] “Why doesn’t grsecurity use LSM?” https://grsecurity.net/lsm.php.

[29] R. N. M. Watson, J. Anderson, B. Laurie, and K. Kennaway, “Capsicum: practical capabilities for unix,” in Proceedings of the 19th USENIX Security Symposium, 2010.

[30] N. Provos, “Improving Host Security with System Call Policies,” in Proceedings of the 12th Conference on USENIX Security Symposium - Volume 12, ser. SSYM’03. Berkeley, CA, USA: USENIX Association, 2003, pp. 18–18. [Online]. Available: http://dl.acm.org/citation.cfm?id=1251353.1251371

[31] “Oz project home,” https://github.com/subgraph/oz.

[32] “Qubes OS project website,” https://www.qubes-os.org/.

[33] “Linux Containers project website,” https://linuxcontainers.org/. BIBLIOGRAPHY 61

[34] “Hackfest Conference Pledge presentation slides,” http://www.openbsd.org/papers/hackfest2015- pledge/index.html, 2015.

[35] “Pledge reference manual,” http://man.openbsd.org/cgi-bin/man.cgi/OpenBSD- current/man2/pledge.2.

[36] “Introducing Qubes OS,” http://blog.invisiblethings.org/2010/04/07/introducing- qubes-os.html.

[37] “Resource management: Linux kernel Namespaces and cgroups,” https://www.cs.ucsb.edu/˜rich/class/cs290-cloud/papers/lxc-namespace.pdf.

[38] “Linux-VServer project website,” http://linux-vserver.org/.

[39] “Linux-VServer Security - Secure Capabilities,” http://linux- vserver.org/Paper#Secure Capabilities.

[40] “OpenVZ project website,” http://openvz.org/.

[41] “Checkpointing and live migration,” https://wiki.openvz.org/Checkpointing and live migration.

[42] “Quick installation - Kernel installation,” https://wiki.openvz.org/Quick installation.

[43] “Virtuozzo website,” http://www.odin.com/products/virtuozzo/.

[44] “Comparison,” https://openvz.org/Comparison.

[45] “Frequently Asked Questions - Licensing,” http://www.odin.com/products/virtuozzo/#tab4.

[46] “systemd-nspawn reference manual,” http://www.freedesktop.org/software/systemd/man/systemd- nspawn.html.

[47] “Free/Open Source Operating systems without systemd in the default installation,” http://without-systemd.org/wiki/index.php/Main Page.

[48] “Docker project website,” https://www.docker.com/.

[49] “Amazon EC2 Installation,” https://docs.docker.com/engine/installation/amazon/.

[50] “jenkins-docker-plugin,” https://github.com/georgebashi/jenkins-docker-plugin.

[51] “OpenStack-Docker: How to manage your Linux Containers with Nova,” https://blog.docker.com/2013/06/openstack-docker-manage-linux-containers-with- /.

[52] “What is Docker?” https://www.docker.com/what-docker.

[53] “Security Risks and Benefits of Docker Application Containers,” https://zeltser.com/security-risks-and-benefits-of-docker-application/.

[54] S. H. Eskildsen, “Why Docker is Not Yet Succeeding Widely in Production,” http://sirupsen.com/production-docker/, Jul. 2015. BIBLIOGRAPHY 62

[55] “PIP project website,” http://www.seclab.cs.sunysb.edu/seclab/pip/.

[56] W. K. Sze and R. Sekar, “Comprehensive integrity protection for desktop linux,” in Proceedings of the 19th ACM Symposium on Access Control Models and Technologies, ser. SACMAT ’14. New York, NY, USA: ACM, 2014, pp. 89–92. [Online]. Available: http://doi.acm.org/10.1145/2613087.2613112

[57] “SELinux project website,” http://selinuxproject.org/.

[58] “SELinux Global Configuration Files,” http://selinuxproject.org/page/GlobalConfigurationFiles.

[59] “Kernel Korner - Filesystem Labeling in SELinux,” http://www.linuxjournal.com/node/7426/print.

[60] “AppArmor project website,” http://apparmor.net/.

[61] R. Spenneberg, “Shutting out intruders with AppArmor - PROTECTIVE ARMOR,” Linux Magazine, August 2006.

[62] “AppArmor,” http://www.softpanorama.org/Commercial linuxes/Suse/Security/apparmor.shtml.

[63] “SELinux & AppArmor - Comparison of Secure OSes,” http://elinux.org/images/3/39/SecureOS nakamura.pdf.

[64] “apparmor.d ref. manual,” http://manpages.ubuntu.com/manpages/hardy/man5/apparmor.d.5.html.

[65] “Smack project website,” http://schaufler-ca.com/.

[66] “Description from the Linux source tree,” http://schaufler- ca.com/description from the linux source tree.

[67] “TOMOYO Linux project website,” http://tomoyo.osdn.jp/.

[68] “AKARI/TOMOYO functionality comparison table,” http://akari.osdn.jp/comparison.html.

[69] “AKARI project website,” http://akari.osdn.jp/.

[70] “Yama documentation,” https://www.kernel.org/doc/Documentation/security/Yama.txt.

[71] “Linux Intrusion Detection System project website,” http://www.lids.org/.

[72] “Linux Intrusion Detection System FAQ,” https://roedie.nl/lids-faq/lids-faq.html.

[73] “Yet another new approach to seccomp,” https://lwn.net/Articles/475043/.

[74] “Linux Capabilities reference manual,” http://man7.org/linux/man- pages/man7/capabilities.7.html.

[75] “The meaning of Posix.1e,” http://wt.tuxomania.net/publications/posix.1e/.

[76] “RSBAC project website,” https://www.rsbac.org/.

[77] “What is RSBAC?” https://www.rsbac.org/why. BIBLIOGRAPHY 63

[78] “PaX project website,” https://pax.grsecurity.net/. [79] “PaX - the main document, overall description,” https://pax.grsecurity.net/docs/pax.txt. [80] “PaX - address space layout randomization,” https://pax.grsecurity.net/docs/aslr.txt. [81] “Grsecurity Features,” https://grsecurity.net/features.php. [82] “Why RSBAC does not use LSM,” https://www.rsbac.org/documentation/why rsbac does not use lsm, May 2006. [83] “Important Notice Regarding Public Availability of Stable Patches,” https://grsecurity.net/announce.php. [84] “Openwall GNU/*/Linux project website,” http://openwall.com/Owl/. [85] “Systrace project website,” http://www.systrace.org/. [86] “Capsicum project website,” https://www.cl.cam.ac.uk/research/security/capsicum/. [87] “NUSE project website,” https://libos-nuse.github.io/. [88] “Linux PID namespaces reference manual,” http://man7.org/linux/man- pages/man7/pid namespaces.7.html. [89] “Linux user namespaces reference manual,” http://man7.org/linux/man- pages/man7/user namespaces.7.html. [90] “How container work in Linux,” http://mirrors.dotsrc.org/fosdem/2016/k1105/how- containers-work-in-linux.mp4. [91] “Seccomp reference manual,” http://man7.org/linux/man- pages/man2/seccomp.2.html. [92] “About — Alpine Linux,” http://alpinelinux.org/about/. [93] “Subgraph - Hardening,” https://subgraph.com/sgos/hardening/index.en.html. [94] “Debian Package Tracking System - linux-grsec,” https://packages.qa.debian.org/l/linux-grsec.html. [95] “Hardened/Grsecurity2 Quickstart,” https://wiki.gentoo.org/wiki/Hardened/Grsecurity2 Quickstart. [96] “The Truth about Linux 4.6,” https://forums.grsecurity.net/viewtopic.php?t=4476&p=16309. [97] ““Which is better, grsecurity or SELinux?”,” https://grsecurity.net/compare.php. [98] “Kernel Self Protection Project,” http://kernsec.org/wiki/index.php/Kernel Self Protection Project. [99] “Grsecurity documention,” https://en.wikibooks.org/wiki/Grsecurity. [100] “User namespaces not used?” https://github.com/subgraph/oz/issues/11#issuecomment- 163396758.