RICE UNIVERSITY Safe and Secure Subprocess Virtualization in Userspace

By

Bumj in Im

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE

Doctor of Philosophy

APPROVED, THESIS COMMITTEE

ang chen Nathan Dautenhahn (Aug 12, 2021 19:01 CDT) ang chen (Aug 12, 2021 16:01 CDT) Nathan Dautenhahn Ang Chen

Assistant Professor of Computer Science Assistant Professor of Computer Science

Dan Wallach (Aug 12, 2021 16:05 CDT) Dan Wallach

Professor of Computer Science and of Electrical and Computer Engineering

Kaiyuan Yang (Aug 12, 2021 16:32 CDT) Kaiyuan Yang

Assistant Professor of Electrical and Computer Engineering

HOUSTON, TEXAS August 2021 Safe and Secure Subprocess Virtualization in Userspace

Thesis by Bumjin Im

Thesis for the Degree of Doctor of Philosophy Department of Computer Science Rice University (Houston, Texas) August, 2021 ABSTRACT

Safe and Secure Subprocess Virtualization in Userspace

by

Bumjin Im

Commodity operating systems isolate the application with process boundary, and all the developers develop the applications upon the principle. However, the applications cannot simply trust the process-based isolation. Virtually all the applications link at least one dynamic library on the runtime that the libraries share all the resources in the same process boundary. Unfortunately, the application developers do not fully understand the libraries they are using, and it could even be infeasible for some complex applications. If a single malicious or buggy library is linked to the application, it can breach the entire application due to its process boundary principle. Since the process-based isolation could continue for some time, it could be harder to achieve the least privilege. We propose a new process model, Endokernel, to resolve this issue. Endokernel contains a monitor inside the standard process in the commodity operating system and provides safe isolation between subprocess, maintenance, and the secure interactions between subprocesses. Endokernel also proposes a endoprocess virtualization technique. Utilizing endoprocess virtualization could realize a more ne-grained least privilege principle in the commodity computing environment. We develop Intravirt as the prototype of Endokernel. Intravirt realizes the Endokernel model on Intel CPU and Linux by actively utilizing Intel® Memory Protection Key(MPK) and Control-ow Enforcement Technology(CET) as the core security mechanisms. Since MPK and CET are hardware mechanisms, Intravirt aims to secure and high-performance endoprocess virtualization. We then evaluate the security and the performance of Intravirt by measuring microbenchmarks and the actual applications with several use cases for the secure computing environment. Throughout the research, we verify Endokernel is a feasible, lightweight, applicable, and eective security model. Acknowledgments

It was a reckless decision as a mid-aged man to start an advanced academic degree in a foreign country with a foreign language after resigning from a well-paid and recently promoted job. Everyone did not understand this decision indeed, and many people said it is a mistake. However, I started a new life in Houston, Texas, being a student after 13 years, get a master’s degree, publish a conference paper, and nally get a ph.D. degree. This tremendous achievement could be impossible without enormous help and support from many people. Without them, there will be no research achievement, no conference paper, no admission to the university, and I will never be able to dream about this. Professor Dan Wallach guided me to join the ph.D. program at Rice University. Without him, I would never think of applying to Rice University. Instead of rushing me to nish the school work quickly, he gave me enough time to settle down to the new culture. Also, he gave me enormous advice as a father, neighbor, and teacher that helped me so much to carry out the program and to support my family members. Lastly, when I decided to change the advisor, he did not hesitate to allow and support my new decision that the lost momentum of the research was able to grow again. It was the beginning of my 5th year when I decided to join Nathan’s group. I was mid-40, have family, and the background knowledge is dierent from the group’s research projects. Hence, it was a risky gamble for him to admit me as his student. However, he welcomed me without hesitation and supported me in making such a decision. He also understood and waited patiently for my months-long distractive working environment and the slow progress due to the family support in the pandemic and the lack of knowledge. Without professor Nathan Dautenhahn, I would decide to stop the program during my 5th year. I think he certainly thought about admitting me as his rst graduate student in his academic career. Also, he would be anxious about the research after admission. I appreciate his endless patience and the waiting for my research progress. Mr. Hyunjin Choi became my boss about ten years after starting to work at Samsung. Working with him was an auspicious event for me. He tried to make the most rational and practical decision, and he always tried to reduce the unnecessary burden on my work. He always gave me his best advice not only for the project but also the career and personal issues that he was not a simple boss, but a teacher of my life. After a few years working with him, I was frustrated with continuing my career at Samsung and in Korea, his advice was to consider an advanced academic degree abroad and develop a new career there instead of telling me to work with him forever. Ordinary manager will tell his coworker to work together with sweet promises like promotion, but he guided me to a dierent career path to me, and he chose to let me go. He is indeed one of the people who inuenced my life. Fangfei Yang is my lucky elf in this research. At the beginning of the research, I could not code in assembly, no knowledge of low-level code and hardware in detail. The only thing I have was the research idea. His deep knowledge of the low-level operating systems and the hardware and the never-decreasing passion kept the research rolling all the time and injecting even more fascinating ideas into the research. I admire him as a fellow student and appreciate much for his eorts. Without his contribution, the research could stall at any time. Daniel Song joined Rice University 2years earlier than me, working with professor Dan Wallach, and he is a Korean. He gave me enormous help and tips to survive in a foreign country without trouble, and he kept in touch with my family as well, becoming an uncle to my kids. He still gives me even more tips and helps about the graduation and career paths, as well as his mistake stories. He spent a noticeable amount of his time and resources for my family and me that I could start my life in a foreign country without hassles, and my children got an uncle. Lastly, I have to say thank you to my family. Most of all, my wife gave up all the privileges and assets she possessed, and she just followed me that I appreciate her sacrice, and I also feel a deep sorry for her. Her husband, I, was a recognized employee at Samsung, her children enjoyed their school life, and there was no potential trouble that everyone else did not support my decision to go abroad for this program, she supported me from when I started thinking about the ph.D. program in Rice University, and she still struggles to live in a foreign country only with her direct family members. Also, she still makes an endless eort to support my program and overcoming this pandemic. She is the headstone of my life, without a doubt. I can recall clearly the my children’s rst day of school in Houston. They were dropped in unfamiliar schools, could not understand English at all, completely dierent culture, and no friends. But they did not complain about the new schools, and they quickly adapted, fortunately. The pandemic made my kids stuck at home all the time, but they are still not complaining about this, and they are keeping what they need to do. I really appreciate my adorable kids. Contents

1 Introduction 1 1.1 Ideal Solution: Use Safe Languages for Everything ...... 3 1.2 Straightforward Solution: More Process Separations ...... 3 1.3 Ecient Solution: Subprocess Isolations ...... 5 1.4 Problems in Subprocess Isolation ...... 6 1.5 Endokernel: Safe Subrocess Isolation in Commodity OS ...... 8 1.6 Contributions ...... 9

2 Subprocess Isolations and System Call Virtualizations 12 2.1 Subprocess Separation ...... 12 2.1.1 Language Based Separation ...... 13 2.1.2 Operating System Based Separation ...... 17 2.1.3 Hardware Accelerated Separation ...... 19 2.2 System Call and Signal Virtualization ...... 26 2.2.1 Linux Security Module ...... 26 2.2.2 System call Filtering ...... 27 2.2.3 System call tracing and interposition ...... 30

3 Threats 33 3.1 Unauthorized memory access ...... 33 3.2 Unauthorized le access ...... 35 3.3 Unauthorized system call execution ...... 35 3.4 Attack on Subprocess Isolation: PKU Pitfall ...... 36 4 Endokernel Architecture 38 4.1 Assumption ...... 38 4.2 Requirements ...... 38 4.3 Mechanisms Gaps and Challenges ...... 40 4.4 Endoprocess Model ...... 42 4.5 Design Principle ...... 44 4.6 Authority Model ...... 45 4.7 Nested Endokernel Organization ...... 47 4.7.1 In-Process Policy ...... 47 4.7.2 Interface ...... 48 4.8 Separation Facilities: Nested Boxing ...... 49 4.9 Intel® Memory Protection Key ...... 51

5 Design and Implementation 52 5.1 Privilege and Memory Virtualization ...... 52 5.1.1 Virtual Privilege Switch ...... 53 5.1.2 Securing the Domain Switch ...... 53 5.1.3 Instruction Capabilities ...... 54 5.1.4 Controlling mode switches ...... 55 5.2 System Call Monitor and Handling ...... 58 5.2.1 Passthrough ...... 59

5.2.2 No syscall from untrusted domain subspaces ...... 59

5.2.3 Complete mediation for mapped syscall ...... 60 5.3 OS Object Virtualization ...... 63 5.3.1 Sensitive but Unvirtualized System Calls ...... 63 5.3.2 Files ...... 63 5.3.3 Mappings ...... 64 5.3.4 Processes ...... 64 5.3.5 Forbidden system calls ...... 65 5.4 Signal virtualization ...... 65 5.4.1 Signals for Ephemeral System Call Trampoline ...... 68 5.4.2 Multithreading Design ...... 69 5.4.3 CET ...... 69 5.4.4 Multiple subdomains ...... 70 5.5 Multi-threading and Concurrency ...... 70 5.5.1 Concurrency in subprocess isolation ...... 70 5.5.2 Multithreading model ...... 71 5.5.3 Thread Local Data Structure ...... 71 5.5.4 Required Atomicity ...... 72 5.5.5 sysret-gadget Race Condition ...... 73

5.5.6 Clone ...... 74 5.5.7 Multi-Domain ...... 76 5.6 Implementation Details ...... 80

6 Use Cases 81 6.1 Library Isolation ...... 81 6.1.1 Reference Application: zlib ...... 81 6.1.2 Safeboxing OpenSSL in ...... 82 6.2 Module sandboxing ...... 83 6.2.1 Sandboxing HTTP Parser in NGINX ...... 83 6.2.2 Preventing sudo Privilege Escalation ...... 85 6.3 Endo-process System Call Policy Enhancement ...... 85 6.3.1 NGINX Private Key File Protection ...... 85 6.3.2 Directory Protection ...... 90

7 Evaluation 93 7.1 Security Evaluation ...... 93 7.1.1 Fake Signal ...... 93 7.1.2 Fork Bomb ...... 94 7.1.3 Syscall Arguments Abuse ...... 95 7.1.4 Race condition using shared memory ...... 95 7.1.5 TSX attack ...... 96 7.1.6 Race condition using multi threading ...... 96 7.2 Performance Evaluation ...... 97 7.2.1 Microbenchmarks ...... 99 7.2.2 Macrobenchmarks ...... 102 7.3 Performance Evaluation of the Use Cases ...... 104 7.3.1 zlib ...... 104 7.3.2 Safeboxing OpenSSL and Sandboxing Parser in NGINX ...... 106 7.3.3 File and Directory Protection ...... 108

8 Conclusion and Futuer Works 113

References 117 List of Figures

1.1 Problems of privilege separation approaches ...... 7

4.1 Intravirt Architecture...... 43

5.1 Signal Entrypoint ...... 66 5.2 State Transition with Signal; UT:Untrusted; T: Trusted; Sig: Signal Handler, Signal masked by Kernel; Smi: Semi-Trusted Domain ...... 67

7.1 System call latency of LMBench benchmark...... 98 7.2 Normalized latency of reading a 40MB le...... 100

7.3 latency for getppid for dierent rerandomization scaling...... 100 7.4 Random read bandwidth for di. number of threads measured with

sysbench...... 101 7.5 Normalized overhead of di. Linux applications...... 102 7.6 Normalized overhead of isolated zlib...... 105 7.7 Normalized throughput of privilege separated NGINX using TLS v1.2 with

ECDHE-RSA-AES128-GCM-SHA256, 2048, 128...... 106 7.8 System call latency of LMBench benchmark with dierent protection metodologies...... 109 7.9 Normalized throughput of NGINX to download 64KB le for dierent private key protection methodologies...... 110 7.10 Normalized latency of zip for dierent le protection methodologies. . . . 111 List of Tables

7.1 Quantitative security analysis based on attacks demonstrated in [1] and attacks found by us. ◦ indicates the variant of Intravirt in this column is vulnerable, • if it prevents this attack. × indicates this attack is beyond Intravirt’s threat model...... 94 7.2 Performance overhead of zlib test due to CET. No Intravirt involved. . . . . 106

7.3 xcall count for dierent le sizes in the test scenarios including startup of the process...... 107 1

Chapter 1

Introduction

Recently, security became one of the most critical features in the computing environment.

Modern operating systems(OSes), such as Windows, Unix, Linux, FreeBSD, and MAC OS

X provide various mechanisms and abstractions to provide such security and continuously

introduce new mechanisms to protect the system from more advanced attacks. To provide

security abstraction, they use processes as the unit of security management. Because of

this, all the codes share the same privilege level, memory, and les in a single process.

Therefore, if a small part of the process is compromised, it will aect the entire process.

The current security architecture is not entirely incorrect that there will be a unit

of any security architecture. If the unit is too small, the application development will

be challenging, and if the unit is too big, there will be serious security issues, so this

granularity issue is always present in any computing environment. However, modern

applications are complicated, providing feature-rich functions, visually beautiful, and

require many common functionalities such as security. Due to the complexity and massive

requirements, the application developers cannot develop all the features from scratch,

so using those third party libraries is very common. For example, the Linux version of

the Google Chrome web browser links about 100 libraries. Those libraries are mostly for

common but labor intensive functions, such as cryptography, mathematical functions, and

3D graphics. As a result, any single bug in one of the libraries will be resulting in the full

breach of the victim application process and the system itself. Moreover, this trend does

not seem to stop any soon. 2

There are several actual cases of this problem. The most popular incident is Heart-

Bleed [2]. HeartBleed is a vulnerability by a bug in OpenSSL [3] library, the de-facto standard for cryptography and secure communication functions. OpenSSL has been used by most Unix based systems, including Linux and FreeBSD. The bug is very simple that a missing boundary check in the SSL heartbeat message could be exploitable. So, a mali- ciously crafted heartbeat message could lead to memory content exposure in the target system. The vulnerability was so phenomenal that a simple bug could aect millions of computers in the world because everyone used OpenSSL.

The library is not the only problem. Each modules should be treated as a security unit. For example, every module in a application shares the privileges and the resources with other modules. In this example, the HTTP parser module only requires access to the HTTP message received from the network, and it does not need to access any other resources and no privilege is required. However, in the current architecture, any buggy HTTP parser module could lead to a complete compromise, and the attacker could acquire full access to the web server and the underlying system as well. CVE-2009-2629 [4],

CVE-2013-2028 [5], and CVE-2013-2070 [6] show this type of vulnerability.

There is one more example about this case, CVE-2021-3156 [7]. Sudo utiity [8] is a utility in Unix operating system that could execute some processes with the root privilege.

Usually, system administrators use Sudo to gain the root privilege to manage the system conguration. To execute the command line utility as root, the system administrator executes sudo and the target utility as the command line argument of Sudo. When sudo is executed, it rst asks the user the password and continues execution only when the password is correct and when the user is in the sudoer group. However, a bug in Sudo command line parser module allows the attacker to execute arbitrary code without verifying anything. This attack is because the command line argument parser shares the privilege 3

of Sudo utility, which is a setuided application.

In conclusion, we have to reconsider the problematic process based privilege model in

the complicated modern computing environment. If we discover a new privilege model with the same applicability, ner granularity, and minor performance overhead, we could

help billions of people at risk of their data breach.

1.1 Ideal Solution: Use Safe Languages for Everything

This type of issue is not new, and there have been enormous eorts with various approaches

to solving the problem. The most ideal approach is to code with type safe language, such

as Java and Rust. By doing this, most of the memory corruption bugs will be disappeared.

However, most of the libraries are still developed in unsafe languages such as C and C++,

and it looks like this trend does not seem to change any soon. Also, even though everything

is developed in safe languages, it is still vulnerable to intentionally malicious libraries.

1.2 Straightforward Solution: More Process Separations

The aordable approach is to separate the modules and libraries into dierent processes.

It is the easiest approach to take and most straightforward to apply, and many existing

applications are using this approach to preserve security, such as mail servers, web servers,

and even web browsers. Process separation utilizes the separation feature provided by

the operating systems and the hardware, which is well proven already and easy to apply.

However, the application has to be redesigned to insert IPC routines when those separated

processes should share any data, introducing performance overhead. Also, it creates even

more performance overhead due to the context switching between processes.

Google Chrome [9] is a web browser based on an open source project developed by

Google. The signicant dierence between traditional web browsers like Internet Explorer 4

and Chrome is that Chrome separates the process for each opened tab. In Chrome, there

are two types of processes. The browser process takes the I/O, and the main loop for

the browser application, and the renderer process takes the content rendering for each

tab. So, when a new tab is created, the browser process forks a renderer process, the

renderer process receives HTTP data from the Internet via the browser process, renders

the contents, and lets the browser process show the result to the screen. Therefore, the

memory footprint is signicantly high, and the system performance is getting much slower when there are too many opened tabs due to the massive IPCs and context switches.

However, the performance overhead is not a serious issue because the bottleneck of web

browsing is the speed of Internet trac.

Postx [10] is an email server developed by IBM to replace obsolete sendmail email

server. Sendmail was initially developed in the 1980s, which performs all the email related

functions like email transmission, inbox management, and user management in a single

root process. Due to its complexity, there have been many vulnerabilities such as CA-1988-

01, CA-1990-01, and CA-1994-12. However, those small bugs in the program lead to the

complete breach of the system because it is a single root process. Postx spread out the

risks into multiple processes. Postx launches more than ten processes during startup, and

each module uses IPC to communicate with other modules. Also, all the processes are not

running as root. The performance overhead could be much more signicant than sendmail,

but it is not a severe problem due to the nature of email performance requirements.

More recently, a web server is released which resistant to the Heartbleed

attack. H2O project [11] architecture is a single process, event-driven web server, but

only one small module is separated into a process. H2O also uses OpenSSL [3] as the

cryptographic and TLS library, but it separates private key modules into a process. Any time

the server requires the private key computation, it requests the computation to the private 5 key module, and the module returns the results only. The private key is always in the memory of the separated module and never gets out from it. However, the performance is a crucial requirement for a webserver, and the process separation could create a burden for the performance. But fortunately, the overall performance overhead is only 2% because the private key is only required during the web session startup. However, H2O has drawbacks.

First, it only protects the private key in the memory. Any other resources in the memory are not protected, and the private key le is not protected that the attacker could simply open the private key le. Also, since only the private key module is separated, H2O uses a lot of low level OpenSSL APIs to perform secure communication, even though OpenSSL itself provides one-line high level API.

1.3 Ecient Solution: Subprocess Isolations

Subprocess isolation is a relatively new approach to separate the resource in a single process, dene the policy for each compartment, and enforce access control policy in a single process. This approach requires relatively more eort than the process separation, and there is almost no underlying operating system support. However, it is much faster than the process separation, applicable to the various operating systems and architectures, and could be optimized for various applications.

The most common technique for subprocess isolation is the Software Fault Isola- tion(SFI) [12], which compartmentalize the codes, memory, and resources into more than two domains and monitor the interaction between domains. There are multiple approache to achieve this. For example, it could be modify the compiler [13–17], extend the un- derlying operating system kernel [17–23], modify userspace libraries [22–25], utilize hardware functionalities [20, 22–24], and design a new computing environment [23, 26, 27] in some extreme cases. In this approach, the application runs under the same address 6

space, and it does not require kernel level context switches and performance heavy IPC.

Therefore, it is generally much faster than the process separation. The applications of this

approach are extensive, but the most famous applications are securing foreign function

interface(FFI) [14, 15, 17, 28] between safe and unsafe language that the unsafe part is

compartmented to prevent any unauthorized memory access.

1.4 Problems in Subprocess Isolation

Subprocess isolation is a novel approach for privilege separation, and expected security is

promising. However, we have to address that the underlying operating system is not aware

of it. The operating system manages the hardware and software and provides interfaces to

users to manage the resources as the system calls. System calls provide privilege separation

and access control, but the base unit is the process, not the subprocess separation domain.

Therefore even in the subprocess separation, the resources accessed by system calls will

be shared. For example, le descriptors and signal handlers will be shared, memory access

by ptrace will be available unless there is no proper consideration of the system calls.

Therefore, every subprocess separation technique should be aware of this threat. As a

result, we have to aware of the interfaces and provisioned functionalities in the operating

system kernel.

Connor et al. [1] shows this issue precisely. For example, in Linux, the operating

system provides a few more methods to access the memory, other than direct access by the

address. First, it provides a le, /proc/[pid]/mem in proc le system. This le is a virtual

le that maps to the virtual memory address of the corresponding process [pid]. Simply

by open the le, read and write the le is equivalent to direct memory access. Therefore,

if the subprocess isolation abstraction does not take care of this interface, the technique

is insecure. Along with this le-backed memory access, more interfaces provide such 7

memory access, such as signals, ptrace, and debugging. Thus, the underlying operating

system interfaces have to be considered carefully.

We can categorize three types of existing works to respond to this issue. First and the

majority of the response is not to solve the problem, which means that many of the existing works have serious security holes. The second solution is to prohibit system calls. In this

case, the security hole is prevented, but it decreases the applicability signicantly. The last

is to intercept and virtualize system calls by ptrace or equivalent debugging feature. This

approach could acquire both security and applicability, but since the mechanism requires

multiple processes and the mediation by the kernel, the performance will be dramatically

decreased. Therefore we need a new type of technique to satisfy security, applicability,

and performance at the same time.

Process Separation Sandboxing Subprocess Separation

X Unrusted Untrusted

X

X Trusted Unrusted

Trusted Trusted

syscall

X

IPC syscall Operating System

Figure 1.1: Problems of privilege separation approaches

Figure 1.1 summarizes the typical problems in privilege separation. First, process

separation has a clear advantage to separating the memory and resources, but it has

a massive penalty in the sharing data by the IPC and the context switching. Second,

sandboxing techniques enable the protection of the application from the untrusted code

inside, but it requires limiting the system call features to prevent attacks via the system

calls. Last, subprocess isolation protects sensitive data in the application process, but the 8

untrusted code could bypass the protection by operating system interfaces.

Lastly, multi-threading is crucial in the modern computing environment. But it is

tough to support the concurrency with the subprocess isolation. First, the underlying

OS allows sharing all the resources between threads, and the isolation could interfere

between threads. Second, the thread local storage could be extended to securely provides

the isolation for each thread. Lastly, the isolation has to securely but extensively support

the communication between threads.

1.5 Endokernel: Safe Subrocess Isolation in Commodity OS

Our goal is to develop a new subprocess separation technique with a very low performance

overhead, supporting multiple separation domains, and considering the operating system

interface without requiring modication of the hardware and the operating system. For the

performance, we utilize hardware accelerated memory protection mechanisms. We then

separate the target application process into multiple domains, and we provide a monitor

called endokernel to support domain management, domain switch, cross function calls,

and system call virtualization. endokernel prevents unauthorized system call execution,

and all the system calls are executed in the trampoline which endokernel provides. Also, we provide a minimal and straightforward system call virtualization policy to protect the

application from the memory protection bypass. endokernel provides the concurrency

and protects the application from the attacks like Time-Of-Check, Time-Of-Use(TOCTOU)

attacks.

Lastly, we develop the prototype of endokernel, called Intravirt, in userspace. Develop-

ing such features in kernel looks more suitable and convenient due to the privilege level

and the controllability, but it has a critical drawback. All Unix based operating systems take

the process as the unit of the privilege separation, so the code developed in the kernel will 9

not be able to upstream unless the kernel developers change the fundamental architecture.

Because of this, we have to port the implementation of Intravirt on every new release

of the kernel. However, we only need to maintain Intravirt code in userspace, making

the applicability and deployability much better. We lastly evaluate the security and the

performance of Intravirt, and we perform case studies on a few use case scenarios.

We implemented Intravirt in a Linux environment. We use Ubuntu 20.04 platform with kernel version 5.9.8, including a few latest feature patches from the upstream. Our

code fully works in the userspace that consists of 15,000 lines of C code and 4,000 lines of

assembly code. We used about 6,000 lines of open source C code, so our contribution is

about 9,000 lines of C code along with 400 lines of assembly code.

1.6 Contributions

This dissertation includes the presentation of the following artfacts and contributions:

Endokernel Architecture a new subprocess isolation abstraction with the following contributions.

• Provides a monitor to create and maintain subprocess isolation which are linked to

the application process during the startup of the application.

• All the functions run in userspace, and no kernel modication is required.

• Implement endokernel prototype, Intravirt.

• Provide Intel MPK based mechanism and support multi domain up to the 16 MPK

domains.

• Provide hardware accelerated memory protection with very low performance over-

head on domain switch. 10

System call virtualization framework monitor all the system calls and virtualize them to protect the system.

• Provide a trampoline to execute syscall instructions safely and controlled.

• Virtualize all the system calls that preventing arbitrary syscall instruction execution

by attackers.

• Protect the system from the indirect jump to the trampoline.

• Provide concurrency in trampoline for multi-threaded environments.

• The framework does not require modifying the applications to virtualize the system

calls.

Signal virtualization framework provides a virtualized environment for the signal handling that monitors all the signals and prevent compromising Intravirt.

• Protect the sigframe data structure to prevent malicious modication of important

registers and variables.

• Prevent signal spoong that the attacker could not articially call the signal handler.

System call baseline policy provides attack prevention by malicious system calls.

• Systemically analyze all the system calls to nd out any possible MPK bypass.

• Enforce the policy on runtime.

• Provide the concurrency of the policy enforcement

Compelling use cases provides applications using endokernel with compelling scenarios

• Select the applications that endokernel resolves the known problems.

• Design and implement the protection policy for the applications. 11

• Provide the actual performance data. 12

Chapter 2

Subprocess Isolations and System Call Virtualizations

In this section, we survey the related works for subprocess isolation. The goal of endokernel

is to isolate the part of the memory space within the application process, and virtualize

the system calls and signals and enforce the security policy to protect the application

from various attacks. We survey techniques for isolation provided by the language side,

operating systems, and hardware. As well, we also investigate system call and signal virtualization techniques.

2.1 Subprocess Separation

Traditional address space separation provided by the operating systems are widely used

because the process separation is proven techniques and easy to apply in most of the

computing environment. But, it has clear limitations due to the signicant overhead by the

context switching and IPC, and it is not trivial to eectively and securely share the memory

between processes. Therefore, there are numerous techniques have been published so far.

There are basic operations to provide subprocess privilege separation, which are, identify

the address space of the application, compartment the memory spaces into more than

two domains, trust only one of the compartmented domains, and enforce the policy that

only the code in the allowed domain access the corresponding memory space. The most

common technique to provide such isolation is Software Fault Isolation(SFI) [12]. As well,

to provide safe SFI, we have more techniques like Control Flow Integrity(CFI) [29], and

Code Pointer Integrity(CPI) [30]. The existing works are mostly based on one or more of 13 these techniques.

In this section, we analyze the existing works and address their contributions and

their drawbacks. To provide a well-structured survey, we categorized the existing works with language based, operating system based, and hardware based works. In language

based works, the works are focussed on the compilers to provide the abstractions. In

operating system based, the works are enhancing the existing operating system to provide

the separation. Lastly, in hardware based, they use the hardware features to provide such

abstractions.

2.1.1 Language Based Separation

The easiest way is to write the code in memory-safe languages. Because we need to perform

this approach statically, the performance optimization is relatively more straightforward,

but all the linked libraries in the application have to be written in memory-safe languages.

Unfortunately, there are enormous libraries written in unsafe languages such as C or C++.

As a result, this approach is not the easiest one indeed. To provide memory protection

in these unsafe environments, some early techniques insert memory boundary check

routine during the compile time and check the memory boundary on the heap. One of the

applications of these techniques is foreign function interface protection to protect the safe

language from the unsafe part of the application, such as Java Native Interface(JNI), Web

assembly, or Android native applications.

CCured CCured by Necula et al. [13] is the pioneer of this approach. The goal of CCured is to provide type safety in non-type-safe languages such as C, without modifying existing

source code only with the recompilation that the result is remarkable. CCured provides

the type safety by categorizing the pointers into dierent types, inserting type checking 14

code and boundary check code into the original code during compile time. The overhead

is from 0 to 100%, depending on the test applications. Many of the existing applications

are not required to be modied to apply CCured when the paper is published. As well, it

provides a formal denition and the verication of its safety.

However, CCured has several disadvantages. Since the technique works in the compiler

stage, any dynamically linked code does not aect to CCured as well as self-modifying

code. Also, it could not work on the uniquely designed data type. In addition, the memory

footprint increases due to the excessive tags and indexes, and performance overhead is

unavoidable due to the inserted boundary check code by the compiler. Lastly, CCured only

provides the type safety in C. Therefore, any direct memory access could bypass CCured,

such as le backed memory access, which is /proc/self/mem.

Safe Java Native Interface Java Native Interface (JNI) is known to be vulnerable to buggy native code, so Safe JNI by Tan et al. [14] is proposed to provide security in the

interface between Java and native code. Safe JNI consists of 3 parts. First, it uses CCured to

provide the type safety in the native library. Second, it adds dynamic type checking code

in the JNI interface, and lastly, it provides a new memory management module for JNI

applications. In SafeJNI, each pointer has a boolean validity tag to prevent dereferencing

after free, acts as a simple reference counter, and has a C level garbage collector. They

tested Safe JNI with Zlib and they compared to the full Java implementation of Zlib that

Safe JNI was about 10% faster than full Java implementation.

Since SafeJNI utilizes CCured internally, it inherits the disadvantages of CCured. In

addition, since C code could invoke any low level functions, it could easily bypass the

Java security framework. As a result, Safe JNI does provide memory safety, but it does not

provide the overall security enhancement for Java native interface. 15

Native Client Native Client(NaCl) [15] is a sandboxing framework designed by engineers in Google in 2009 to run native code safely in the Google Chrome browser. By providing

the sandbox for the native code, NaCl separates code and memory between web and native,

protecting the web browser from any attack from the malicious native code. A new set of

interfaces called NPAPI are dened to provide communication between native and web.

To protect the browser from malicious native code, NaCl has its dedicated compiler. The

binary created by the compiler has a few unique properties. The instructions are aligned with 32 bits and page size, and hlt instructions are padded and only allow dedicated

indirect jump pseudo instruction that the attacker will not be able to perform return

oriented or jump oriented attacks. In addition, the memory region of the web and the

native are separated, that the data sharing is allowed only by NPAPI.

NaCl is a novel abstraction to provide such a sandboxing environment, but it does

have disadvantages. To prevent illegal resource access from the native code, NaCl provides

a very strict system call lter. Most of the system calls are not allowed for the native

code, making the native code less useful. The native code in NaCl is dedicated to faster

computation rather than providing rich native features.

CompARTist CompARTist [16] separates advertising libraries and other codes in the Android applications in the compile time. The compiler analyzes the intermediate repre-

sentation of the source code of the application, identify the advertisement library and the

application code, separate them into dierent processes, substitute functions calls between

the application and the ad library into the binder calls, and then compile the application.

To provide seamless application functionalities with the advertisement, CompARTist iden-

ties the location of the ad banner on the screen and overlays the ad banner on top of

the application window. Since there is not much interaction between the application and 16

the ad banner, the overhead is relatively small. Because of the process separation, any

malicious advertisement library cannot access the memory of the application.

Even though CompARTist provides an eective and powerful separation between the

ad library and the application, it does have critical limitations. First, the compiler performs very complicated tasks to identify the ad library and the code, analyze the display location,

and seamlessly overlay the two windows look one window. Because of the complexity, the

applicability is very low. In the paper, about 62% of the selected applications in the Google

Play Store worked correctly. In addition, because it has dependencies with the Android

platform, any change in the platform would aect the technique. Lastly, even though

the ad library is separated into a dierent process, they did not adequately separate the

permission of the ad banner process, which means that any malicious ad banner process

has the same privilege of the application.

RLBox RLBox [31] is a library isolation technique for Mozilla Firefox [32] web browser. It does not mainly focus on the isolation mechanism that we could use SFI or process

separations. Instead, its contribution lies in the secure computing environment in library

isolation. The authors carefully analyze the attack surfaces and potential issues of the

library calls to fulll the secure isolation and propose an automated safe library isolation

framework. Firefox web browsers commercially use this technique, and their analysis of

the attacks and potential issues is signicantly valuable.

Since the technique is applied to the commercial software, the level of completion is very high, and most of the isolation researches could refer to its analysis. However, it also

does not allow system call like Native Client [15], making the applicability lower, and the

performance overhead is high that it’s over 20% in some cases. 17

2.1.2 Operating System Based Separation

As mentioned above, process based isolation suers from the performance overhead. There

are several eorts to provide ner grained and lightweight separation compared to the

process based isolation by modifying the operating systems.

Lightweight Context Lightweight Context(LwC) [18] provides a context separation technique that behaves similarly to the traditional process separation but in a simpler

feature. LwC provides LwC_create to copy the LwC instance similar to the fork system call

and provides context switching API between LwC instances. As a result, it behaves very

similar to a process separation that the memory and le descriptors are separated between

LwC instances, but it does not provide the concurrency between LwC instances, and only

one instance could be working at a time. As well, LwC provides a resource overlay feature

to share resources between LwC instances in a process. Since LwC provides lightweight

separation and context switches, the overhead is relatively smaller than the context switch

between processes.

Even though LwC provides robust separation and resource sharing features, it has a

critical disadvantage in its implementation. LwC is implemented in FreeBSD kernel that

porting to other platforms such as Linux requires additional research eort. Also, we do

not know any potential corner cases in other platforms, which could be a big obstacle. In

addition, FreeBSD itself keeps evolving that LwC should be ported to the latest version of

FreeBSD every time a new version of the kernel is released.

Secure Memory Views Secure Memory Views (SMV) [19] uniquely implemented intra- process memory separation. Their design uses the monolithic kernel, Linux, as their

codebase, and they modied the way to manage the page table entry(PTE) to provide 18

the memory separation between threads in a process. Since it uses the page table entry

management for the design, the overhead is minimal and provides very ecient memory

separation. The application developers should modify their applications to call proper

SMV APIs to utilize isolation and enforce the access policy such as granting and revoking.

SMV shows less than 1% of overhead in the web server test scenario because it

utilizes the virtual memory management mechanism in Linux.

Even though SMV has very low performance overhead due to the unique PTE based

mechanism design, it has several drawbacks. First, it does not provide privilege separation

for the non-threaded third party libraries. In this case, the application developers should

modify the application to provide such isolation. SMV is a thread based isolation technique,

so considering memory isolation in a single threaded application is inadequate.

Nativeguard NativeGuard [17] is a technique to separate the java part and the native part of Android applications. Instead of applying NativeGuard during the application devel-

opment, it repackages existing applications that it analyze the original application package,

identify the java part of the application and the native part of the application, substituting

API calls to binder IPC messages, and repackage them into separated applications. As a

result, NativeGuard could be a very eective technique to separate the foreign function

interface in the Android environment. Since NativeGuard separates one application into

two parts, the separation is very eective due to the process separation, so it cannot be

categorized as a technique of an subprocess separation.

Because the design of NativeGuard is simple and straightforward, the drawbacks are

also simple and straightforward. Since it separates one application into two dierent

applications, the overhead of NativeGuard is relatively high due to the IPC and the context

switches. They performed several simple performance tests, and the performance overhead 19

is up to 200% depending on the test scenarios. In addition, there is a critical integrity

problem using NativeGuard. Since NativeGuard repackages the original package, the

signature verication will fail due to the signature mismatch.

2.1.3 Hardware Accelerated Separation

The techniques mentioned above include additional functions that the performance over-

head is inevitable. ince minimizing the overhead is the most crucial goal of the techniques,

many people try to use the hardware features. The most common hardware feature is

the VT-x x86 virtualization extension, and another frequently explored hardware feature

is Intel Memory Protection Key(MPK). Some other eorts design new hardware to fulll

such isolated environment.

Shreds Shreds [20] provides an subprocess separation technique similar to LwC [18]. Shreds uses Domain Access Control Register(DACR) [33] as the mechanism of memory

protection. DACR supports up to 16 memory protection domains that one domain is

assigned to a page table entry, and the access permission of each domain is stored in the

DACR register, and the CPU automatically enforces the access control whenever any pro-

cess performs memory data access. The application developers should call shreds_enter

API when the sensitive data should be accessed, which the DACR domain is changed.

After nishing the sensitive operations, it calls shreds_exit to exit from the domain, which Shreds returns back the DACR to the normal domain. Operations to manage DACR

are privileged operations, so Shreds provides a kernel module and the interface between

userspace to manage DACR properly. Shreds also provides mechanisms in the compiler to verify the usage of Shreds APIs and CFI mechanisms to prevent attacks like ROP. Shreds

does not have performance overhead on the memory protection during the memory access 20 because DACR is a hardware feature. However, it does have performance overhead due to the Shreds context switches. The paper says the performance overhead of the tests they performed was up to 5%. In addition, due to the compiler modication, compilation time is increased up to 40%.

Shreads provides very concise, fast, and powerful subprocess isolation. However, the dependencies between CPU architecture, operating system kernels, and compiler could make the maintenance dicult when each module is updated. More importantly, Shreads does not address the security of the operating system that the attacker could successfully bypass the memory protection by using the system calls or the signal handlers.

Dune In 2012, a creative technique for memory protection and privilege separation, called Dune [21], was published. CPU virtualization functionalities have been supported much earlier, but Dune used these features to provide application privilege separation instead of running a virtual machine safely. In Dune, applications are running on top of the newly created hypervisor instead of virtualizing the whole operating system. Dune then utilizes virtualization features in the Intel CPUs, such as ring management in userspace.

Dune also uses the system call trap mechanism to intercept all the system calls in the application on top of the hypervisor and pass them to the operating system running in the dierent hypervisor. Since Dune uses Intel’s hardware feature, the overhead of memory access is minimal, but the overall system call performance is relatively slow due to the system call trap. However, some system calls which manage virtual memory, such as appel1 are much faster than the native system calls.

Since Dune’s abstraction is very dierent from other subprocess separation techniques, it is pretty hard to compare Dune to other techniques. However, due to its uniqueness,

Dune has a unique drawback. The application running on top of the hypervisor requires 21 libdune library to manage the page tables, access control policy, system call, and signal

that the library is incredibly complex. Therefore, applying Dune to other platforms and

hardware could be a bothersome task. In addition, even though Dune supports the system

call trap, it does not particularly address the security issues of the system calls. It could be

straightforward to protect the system from the system calls executed by untrusted code

because it already supports the system call trap, but it lacks the consideration of the system

calls.

ERIM ERIM [24] provides a very similar abstraction with Shreds [20], but ERIM uses Intel’s Memory Protection Key(MPK) [34] instead of ARM’s DACR [33]. MPK is a memory

protection hardware function in Intel’s latest CPUs which is very similar to DACR that

MPK uses a dedicated register called Protection Key Register Userspace(PKRU) like ARM has

DACR register for the access control policy management for the memory pages. However,

unlike DACR, MPK operations are unprivileged userspace operations that anyone could call

the instructions to modify MPK settings. For example, WRPKRU and XRSTOR instructions

could directly modify PKRU value, and system calls like pkey_alloc and pkey_mprotect

could modify the protection key in the page table entry and the PKRU value as well. Because

of that, ERIM scans all the code regions of the application and the linked libraries to

prevent the execution of such codes. In ERIM, if there are such instructions, it replaces the

original instructions to dierent instructions or adds safety checks after the instruction to

prevent such attacks. In the same sense, ERIM prohibits memory allocation with writable

and executable that attacker might link with benign page and insert such instruction

after the allocation. One more dierence of ERIM compared to Shreds is that ERIM is

mostly userspace driven abstraction. Since it is in the userspace, it is straightforward to

apply ERIM to other platforms. However, to prevent memory attacks, memory allocation 22

related system calls such as mmap and mprotect are intercepted by either ptrace or Linux

Security Module(LSM). Since ERIM also uses hardware for memory protection, there is

no performance overhead of memory protection, but there is context switch overhead

between domains. The test results in NGINX with AES session key protection scenario

show that the overall overhead is up to 4%.

Even though ERIM provides a very concise and valuable abstraction, there are some

critical disadvantages. First of all, ERIM only supports two domains. Even though MPK

supports up to 16 protection domains, MPK only utilizes 2 of them. Second, since MPK

is unprivileged userspace operations, it requires additional routines to prevent it from

the untrusted and malicious code, but it is challenging to achieve. Third, ERIM lacks

multi-threading consideration. Moreover, as mentioned in PKU Pitfall [1], ERIM does not

consider anything about the system calls. Therefore the attackers could easily bypass

ERIM’s protection model by executing dangerous system calls.

HODOR HODOR [22] is very similar to ERIM [24] that they are published concurrently by a dierent group of people. HODOR supports not only MPK but also VMFunc for its

memory protection mechanism. Due to its similarity with ERIM, HODOR with MPK has

almost the same characteristics as ERIM, including the performance overhead. The main

dierence between ERIM is that HODOR requires both kernel and userspace modication.

Another dierence is the number of domains. ERIM only supports two domains, but

HODOR supports all the 16 MPK domains. The most interesting dierence is the way

to prevent unauthorized WRPKRU instruction. ERIM scans and rewrites all the possible

WRPKRU candidates into dierent instructions, HODOR uses a hardware watchpoint that

dynamically inspects the WRPKRU instruction and traps it if it is called.

Since HODOR is very similar to ERIM, it also has most of the advantages and disad- 23 vantages of ERIM. However, HODOR requires both kernel and userspace modication, it

has relatively more dependency issues than ERIM.

Donky Donky [23] is very similar to this research design that it was published in 2020. Donky supports both Intel and RISC-V architecture. Especially in RISC-V, they designed a

new register and memory protection mechanism to provide a feature similar to Intel MPK,

but it supports up to 1024 domains instead of 16 domains of MPK. Donky provides safe

system call ltering in userspace, which is also very similar to this research. However,

the most interesting contribution of Donky is to support memory protection in RISC-V

architecture. They added a new memory protection mechanism and system call ltering

feature into the new open source CPU architecture. Since the design is based on the

hardware, the performance overhead is minimal as well.

However, on the other hand, Donky lacks system call ltering in Intel architecture.

Donky utilizes hypervisor to prevent arbitrary system call attack or indirect jump attacks, which requires more dependencies and modules.

libmpk libmpk [25] provides another level of indirection for MPK. MPK has a critical limitation for the applicability that it only supports up to 16 keys Therefore only 16

dierent domains are allowed in a single application. If the number of threads increases,

the limitation will bring severe issues in concurrency and security. libmpk overcomes such

limitations by virtualizing MPK similar to the virtual memory in the modern computing

environment. First of all, when a new application is executed, libmpk assigns a domain

for managing the virtual domains. After that, whenever the application requests a new

domain assignment, it creates a virtual domain and maps it with one of the 15 physical

domains. On every domain switch, libmpk provides virtual to physical mapping to the

domain as well. Therefore, in theory, libmpk could support the innite number of MPK 24 domains. In addition, libmpk supports up to 15 MPK cache entries for performance gain.

There are two critical issues on libmpk. First, it does not care for the system calls.

Therefore, any attacker who could bypass MPK by invoking such system calls and signals,

could destroy the libmpk management system. Also, the MPK cache miss in libmpk has

serious performance issues depending on the memory footprint of the application. In

cache miss, libmpk maps the target virtual domain to the least used physical domain and

modies all the page table entries of the missed and the victim domain. The authors claim

it is still much faster than calling the same amount of mprotect system calls, but it is still very slow.

FlexOS FlexOS [35] resolves the security issue in the library OS that the untrusted application code and the most critical system libraries are in the same process space.

FlexOS isolates the libraries and applies for MPK based protection. For this, the library

developers should provide the specication for the formal verication, and FlexOS provides

an isolated environment based on the specication. FlexOS focuses on enhancing the

security of the libraryOS, which has 6-230% overhead depending on the test scenario.

FlexOS provides a very similar function to this dissertation, but it lacks a few crucial

aspects. First of all, it only focussed on the library separation that it misses addressing the

policy to prevent MPK bypass by the system calls. Also, it does not consider that MPK

does not protect the execution of code. As a result, FlexOS does not look deployable any

soon.

Sung et.al Sung et al. [36] provides Intra-unikernel isolation by MPK, which is very similar to FlexOS [35]. This scheme provides type safety and memory safety by utilizing

Rust language, and MPK isolates the kernel. Performance overhead in microbenchmarks is

relatively high, but it is much faster than the previous scheme by Linux-KVM. 25

However, it does have similar downsides which FlexOS has. It did not take care of

the system call policy that it is trivial to bypass MPK by system calls, and it also did not

consider MPK does not protect the executing codes.

CHERI CHERI [27, 37] is a capability based computing environment project driven by a group of people at Cambridge University for multiple years. In CHERI, it has its

CPU architecture [27, 37], its own operating systems [38, 39], its compiler [26], and the

applications [28] made use of CHERI architecture. The main contribution of CHERI is to

provide an overall computing environment for capability based computing that in CHERI,

they extended the concept of the pointer to provide memory safety. The traditional pointer

only takes the memory address, but in CHERI, the pointer is consists of the address, the

boundary and the permission data that the size is up to 256 bits. Therefore, any process which wants to access the memory must have the proper capability to access it. This type

of memory safety has been proposed for a long time ago, but CHERI’s main contribution is

to provide a whole set of the computing environment from the hardware to the application,

including the compiler and the operating system.

Even though CHERI introduced a fully compiled capability based computing envi-

ronment, its drawback is also from its contribution. CHERI lacks the applicability that it

requires the dedicated hardware, operating system, and compiler. As well, the applications

should be redesigned.

EdgeOS EdgeOS [40] is a subprocess virtualization scheme to provide fast 5G network service in the edge cloud. It can be applied on the microkernel operating system that it

introduces a radically lightweight subprocess, called featherweight process(FWP). FWP is a

concept of the subprocess that runs in a process on the microkernel-based operating system with an extremely short launching time due to the caching and reusing of the FWP instances. 26

EdgeOS has one more crucial module called memory management accelerator(MMA) for

communication between FWP. MMA enables the communication between FWP by copying

the message as memory copy, and it mediates the access control to provide security. Since

EdgeOS is implemented on the microkernel-based OS, it could provide more exible and

secure memory isolation than the monolithic OS like Linux.

EdgeOS has clear advantages. It introduces a fast and concise concept of FWP that

could be used for IoT services in the 5G network with performance. However, it only works

on the microkernel, which is quite far from most of the applications and the platform

running in the industry. Therefore, the applicability is very low, but it could be a good

opportunity if ported to the more prevalent monolithic OSes.

2.2 System Call and Signal Virtualization

As mentioned in chapter 2.1, we introduced several techniques in various layers, various

architectures, and various systems. However, many works only focus on isolation and do

not consider the interfaces in the operating systems that could be used to bypass such

isolation, which PKU Pitfall [1] addressed. In this section, we investigate the existing works in system call virtualization and what they provide and what are missing.

2.2.1 Linux Security Module

Linux Security Module(LSM) [41] is proposed about two decades ago, and it has been

Linux kernel upstream since then. Even though LSM is named as a module, it cannot

be congured in runtime because it is a build time congurable security module. LSM

provides hooking functions for all the system call routines in the kernel that the registered

module uses the hooks to provide various functionalities into the system call procedures.

LSM hooks are executed before the actual system call operation is performed in the kernel 27

that the module performs some functionalities and returns 0 in success, and other value

on errors which makes the system call returns error eventually. The overhead LSM is very

small that it is negligible.

The most useful function provided via LSM is the mandatory access control, such as

SELinux [42], Tomoyo [43], AppArmor [44], and Smack [45]. As well, minor access control

schemes like YAMA [46] and Linux capability [47] are also using LSM. LSM is suitable for

additional access control mechanisms, but it is not easy to apply system call virtualization

because of its limitation in the input and output parameter.

2.2.2 System call Filtering

Applications do not need all system calls. There are common system calls most of the

applications are using, such as read, write, and exit, but there are more than 300 system

calls that most of the applications are not using. However, all the system calls are allowed

by default to all the processes that any buggy or malicious application could execute

unintended system calls to attack the system. Also, the application links untrusted third

party libraries for common functions, but the libraries eventually share all the permissions with the application, including system calls. Therefore there are many research eorts

invested in limiting the system calls for the applications and some of them are widely used

in the industry.

First of all, many of the techniques using LSM [41] provides such system call ltering

features. For example, SELinux [42] enforces ne grained access control policy for each

process, so any unintended system calls could be ltered by simply allowing required

system calls only. However, SELinux is overkill for the system call ltering because it

provides too many functionalities other than simple system call ltering.

The other popular technique is Seccomp [48]. Seccomp is in the upstream in Linux 28

kernel since 2005, that the objective of the module is system call ltering. Seccomp uses

Berkeley Packet Filter(BPF) [49] that which works like a network rewall provides powerful

ltering rules with low performance overhead. Once the ltering rule is established, it can

add more rules, but the rules cannot be removed or modied even after fork.

Many other works provide more powerful and more eective system call ltering for various environments. In this section, we investigate a few of the researches.

Janus In 1996, Janus [50] is published that it provides system call ltering and preliminary mandatory access control. This technique looks very simple and premature in the current

standard, but it could be fascinating when it was published. Janus is a process-based

system call ltering mechanism that another process called framework, is launched when

an application process is launched. Then, the framework process attaches itself as a

debugger and sets the breakpoints on every system call instruction in the application. So, whenever the application tries to execute the system call instruction, it stops working

and wakes up the framework process to distinguish whether the system call is safe or not

based on the congurable access control policy.

The most important contribution of Janus is introducing a new concept of sandboxing,

and system calls are ltered to provide a safe sandboxing environment. The performance

measurement was also inadequate that they only measured two dierent applications with

simple input data. Therefore it did not fully show the eect of the system call ltering.

syslter syslter [51] is an automatic system call ltering policy enforcement technique, published in 2020. Generally speaking, the system call ltering policy is manually created

by the developers and applied with ltering tools like seccomp [48] in runtime. However,

syslter automatically analyzes the binary, derives the whole call graph of the binary

le, creates eBPF lter, and applies them on runtime. Therefore, if the binary analysis is 29

perfect, it will achieve the least privilege in the system calls. The evaluation data says that

they evaluated over 30,000 binaries in Linux packages from a Linux distro, and 90% of the

system calls are successfully detected. In the performance perspective, they tested syslter

on NGINX web server [52], and there was up to 18% of performance overhead due to the

nature of linear rule search in Seccomp.

Automated ltering rule creation is the most signicant contribution of this work. The

most important aspect of the automated rule will be accuracy. There are two types of

errors in the accuracy, the false positive and the false negative. The false positive means

that not required system calls are allowed, which will extend the attack surface. The false

negative means required system calls are not allowed, which will introduce application

failure. As a result, both types of accuracy errors should be carefully mitigated. In addition,

they used a special compiler for the binary analysis Therefore in the actual computing

environment, various binaries from various compilers would have issues.

Temporal Specialization Temporal Specialization [53] is a very similar technique to syslter [51], and it is published in 2020 as well. The most signicant dierence between

syslter and Temporal Specialization is that syslter requires binary to perform binary

analysis, and Temporal Specialization requires source code to perform source code analysis.

After analyzing the source code, it detects all the possible system calls, creates Seccomp [48]

policy rules, and inserts the Seccomp policy update code in the program’s starting point.

The most interesting contribution of this work is that it could identify the initialization

phase and the service phase of the application and insert one more Seccomp policy update

code right after the initialization and before the service. Therefore, it could achieve a more

mature least privilege principle compare to syslter.

Temporal Specialization has the same accuracy issue with syslter due to the technical 30

similarity. This work also requires source code to perform the analysis, so it cannot be

applied to binary-only environments. In addition, they did not perform any performance

tests in the paper, which is a disappointing part.

Jigsaw Jigsaw [54] is a very eective vulnerability detecting tool using system call monitoring, published in 2014. Jigsaw is primarily dedicated to the confused deputy

attacks by detecting the application’s request ltering code and analyzing the actual

ltering functions. Then, it also enforces the access control policy. The lters Jigsaw tries

to detect the binding lters and the name lters by static analysis and dynamic analysis

and detects the missing lters. In addition, in runtime, a kernel module intercepts the

system calls and analyzes whether the lter is correctly applied and any unauthorized

system call is ltered.

Jigsaw introduces an interesting concept, brilliant design, and well-dened formal verications along with a working implementation. The performance overhead is no more

than 10% in their measurement, and it could detect several confused deputy attacks from

the actual applications. However, since it uses heuristics and static and dynamic analysis,

it could have incorrect detections, false positives, and false negatives. Also, it is hard to

respond to the new attacks by this method.

2.2.3 System call tracing and interposition

Along with the system call ltering, system call tracing and interpositioning provide deeper

system call management. In system call tracing, the tracker tracks the execution of the

target application, then it pauses every system call execution and traces the executed

system call, input parameter, and the return value if possible. These techniques usually

utilize ptrace [55] mechanism to trace the target application that the tracer process attaches 31

to the tracee process and intercept the execution on every system call execution. strace [56]

is the most popular tool to trace the system call.

System call interposition is the extended technique of the system call tracing. Instead

of just tracking which system call is being executed, it intercepts the execution, performs

some additional functionalities on bahalf of the tracee process, and returns back to the

tracee [57]. Ptrace is widely used for this technique as well, and most of the techniques

using system call interposition try to enforce some security policy.

In this section, we investigate a few related works and nd out the characteristics of

each research.

Droidtrace In 2014, DroidTrace [58] was published, which provides an anomaly de- tection on the Android system based on the ptrace and the dynamic analysis. At rst,

DroidTrace dynamically analyzes the Java part of the application and derives the call

graph. Then it utilizes ptrace to trace all the system call executions. During the system

call execution, it compares the behavior with the pre-dened policy. If any anomaly is

detected, it alarms the user. Especially, DroidTrace targets dynamically linked libraries

that many of the dynamic analysis tools could miss at that time, and they proved it could

detect several actual vulnerabilities.

Ostia Ostia [59] is a system ltering and interposition technique published in 2004. The system call ltering of Ostia is implemented as a kernel module that when a new

system call is executed, and the context switch happened to the kernel, the kernel module

intercepts the system call, looks up the policy, and enforces it. If the policy says it is

allowed, the kernel module calls a callback function dened in the library linked in the

original user process and the callback function sends the system call information to another

user process called the agent. The agent process receives the request, performs the system 32 call virtualization on behalf of the original process, and returns back the result.

Because Ostia is an old paper, the design looks not ecient from a modern viewpoint.

However, it was creative at the moment of publishing. One drawback of the work is

that there was already an existing system call ltering mechanism, the Linux Security

Module [41]. One more issue of this work is the complicated call ow of the system call

interposition. Once the application calls a system call, the context switch happened to

the kernel. Then, the kernel module of Ostia takes the role, looks up the policy, and calls

the callback function in the application again. After that, the callback function creates an

IPC message with the system call information and sends the message to the agent process.

Then, another context switch is performed to the agent process, and the agent process

nally performs the system call, which will eventually perform the context switch to the

kernel to execute the system call. After the system call is executed, the return value will be

propagated reverse of the system call ow, which will take multiple context switches. The

performance measurement performed by the author says that the system call performance

overhead is at least seven times the original system call performance to the tens of times

in some cases. 33

Chapter 3

Threats

3.1 Unauthorized memory access

Any data in a process is stored in the memory in any case. It could be stored in a concise

amount of time or stays in memory until the process is destroyed. The data could be

constant or variable or even some codes. It could be in the stack, the heap, the RODATA,

or the BSS area, which are all in the memory after all. The data could be a constant value, a state variable, cryptographic keys, or a control variable for the code execution.

Therefore, protecting memory contents is crucial for application security. However, most

of the modern operating system allows any random memory access in the same process.

Therefore the compromised application by a bug or a malicious library is at signicant

risk.

Direct access The most straightforward threat is to access the target memory directly. In this scenario, the attacker acquires the target address and then executes the instructions

to access the memory, such as LD and ST. This type of attack is straightforward, but it

is also easy to detect and defend the attack. However, the attacker could hide the target

address using various techniques, so the memory protection mechanism is required other

than the instruction detection. In this case, the memory protection mechanism has to be

protected. 34

Access by system calls Applying the memory protection mechanism should be carefully evaluated because direct memory access is not the only way to access the memory. The

operating systems provide interfaces between userspace processes and the kernel space

resource as system calls. In Linux, there are more than 300 system calls. System calls are

executed in kernel space and return the result to the caller application in the userspace.

Therefore, we must carefully evaluate the memory protection mechanism whether it works

as intended in the kernel space. For example, Intel Memory Protection Key(MPK) [34]

enforces the additional memory protection by the CPU with the permission bits in the

PKRU register. However, it is reset in the kernel space, and the kernel could access all the

memory without restriction.

Operating systems provide various ways to access memory [1]. For example, open, read,

and write /proc/self/mem le is entirely equivalent to access to the memory. Also, there

are special system calls to access other processes’ memory, such as vm_process_readv, vm_process_writev, and ptrace. They are mostly used for debugging. As a result, we

have to analyze all the system calls the operating system to provide carefully, investigate

the interoperability of the designing memory protection mechanism and the operating

system.

Access by signals Along with the system calls, signals are also useful interfaces to perform unauthorized memory accesses. Whenever a signal has occurred, the kernel

calls the registered signal handler function to process the signal. In that case, we have to

evaluate the memory protection and the signal handler carefully. For example, in Linux

and MPK, the kernel resets the PKRU value whenever the signal handler is called. Therefore,

the attacker could register the malicious signal handler and trigger the signal, and then the

signal handler will be able to access all the memory without MPK enforcement. In addition, 35

the sigframe contains the PKRU value after returning from the signal handler, and the signal

handler could rewrite the value into any 32 bit integer. Therefore, the attacker could easily

manipulate the PKRU value by the signal handler. As a result, we have to analyze the eect

of the signals and the designing memory protection mechanism.

3.2 Unauthorized le access

To perform TLS communication, we need public key pair, which is the most critical data in

secure communication. Most of the TLS libraries like OpenSSL [3] acquire public key pair

and certicate by reading les stored in the local computer and load them into memory

for future use. Those les are protected by le permissions in most operating systems that

no other uses could access the les. In some cases, mandatory access control mechanisms

like SELinux [42] are applied to only allow the application to access the le. However,

suppose the application has a bug and compromised, or a library is malicious. In that case,

the attackers could open the key les and read the key from the les by simple system

call executions. Also, the attack could read the key les if the les are already open by the

application and there are open le descriptors. The memory protection does not protect

against this attack. Therefore we need to extend our protection abstraction to the system

calls.

3.3 Unauthorized system call execution

As mentioned above, we need to virtualize system calls to protect the system from malicious

system calls. However, the system call itself is an unprivileged two byte instruction in x86

architecture that can be executed at any time anywhere in the code. That is, the attacker

could execute any arbitrary system call with any arbitrary input parameter by executing

syscall instruction by itself, instead of call glibc wrapper functions. Also, attackers do not 36

even need syscall instruction in their code. The attacker could ll up required registers

as the input parameters of the system call and simply jump to the code address which

syscall instruction is located.

3.4 Attack on Subprocess Isolation: PKU Pitfall

Conor et al. [1] give a signicant hint for this research. It introduces several attack

scenarios on MPK based subprocess isolation like ERIM and Hodor. Some of the attacks

are universally applicable to non-MPK based isolation techniques as long as the techniques

run on top of Linux or similar Unix-based operating systems. The common factor of these

attacks is using system calls as the attack surface, which means that the most critical threat

for the subprocess isolation is the underlying operating system. The several essential

attack scenarios as following.

First, some system calls bypass MPK by design. For example, process_vm_readv and

process_vm_writev access memory of other processes mainly for the debugging purpose.

The actual memory access in these system calls happens in the kernel space which the MPK

is not applied. These also bypass Linux Security Module, so this is not an MPK specic

attack surface.

Second, the attacker could use ptrace [55]. Ptrace is a tracing mechanism designed for

debugging and proling that the tracer attaches to the tracee and accesses the memory

freely. Ptrace even bypasses MMU permission, so it has to be seriously evaluated. There

are several techniques to prevent ptrace like YAMA [46].

Third, le-backed memory access is allowed. In Linux, /proc/[pid]/mem is a virtual

le that maps to the virtual memory of the process. Therefore, opening, reading, and writing the le is entirely equivalent to memory access, and it bypasses the MPK protection.

This interface could be a critical attack surface for many subprocess isolation techniques, 37 not only for MPK based ones.

Lastly, signaling is also a critical attack surface. As mentioned above, the signal handler

is called with MPK reset by the kernel so that any signal handler can access any memory

in the process. Therefore, the signal handlers have to be monitored and virtualized. Also,

sigframe data structure is critical. Sigframe contains the important congurations and

register values to be restored when the signal handler is returned to the application, but

the signal handler is allowed to read and write the data structure. Therefore, the attacker

could modify PKRU value in the malicious signal handler and return to normal with a

compromised MPK setting. 38

Chapter 4

Endokernel Architecture

4.1 Assumption

This work focus on the subprocess isolation in userspace that we do not consider the

security of the underlying operating system and the hardware. Therefore, we assume

the kernel and the hardware are not vulnerable. Also, we assume that there are no side

channels. Lastly, we also assume Intravirt implementation has no bug.

4.2 Requirements

The requirements for this work are as follows. All of the required items should be satised

to evaluate the research is successful.

Memory isolation and protection Memory isolation is the building block of intra- process isolation. Without memory isolation, subprocess isolation is not possible. The

isolated memory for each domain should not be accessible by other domains. Each domain

should have a dedicated call stack and heap. Lastly, the memory isolation mechanism

should not have a performance overhead.

Safe domain switch Data sharing between separated domains and calling functions in other domains requires a context switch. This context switch has to be done only by

the feature provided and controlled by Intravirt. That is, the attacker should not be able

to arbitrarily switch to other domains without Intravirt, and the data should be shared 39

only by Intravirt. Also, the overhead of the context switch has to be tiny compared to the

process context switch.

System call virtualization Due to the unauthorized memory accesses by the system calls and signals, it is necessary to provide the system call virtualization. All the system

calls should be executed only by Intravirt, and the applications should call glibc wrapper

functions only. We have two requirements to prevent any arbitrary syscall execution.

First, no syscall instruction outside of Intravirt should be executed. By doing this, the

attacker cannot insert syscall instruction in her code area. Second, indirect jump to the

syscall instruction in Intravirt has to be detected and prevented.

Along with the arbitrary syscall prevention, all the syscalls have to be analyzed.

And then, we need to provide and enforce the policy to prevent unauthorized access by

system calls and signals. Also, the performance overhead of the system call virtualization

have to be smaller than traditional ptrace based system call interposition techniques.

Programmable Security Abstractions Much like the Exokernel argument [60], today’s process-based isolation is inexible. However, unlike Exokernel, the key challenge is not

about exposing state for managing performance, but rather making the policy language

more closely matching the needs of applications. This inuences 1) Ease of use: A primary

reason why ne-grained security is not applied is the complexity and diverse nature

of application demands. We argue that an abstraction that works for one application won’t necessarily be the easiest to apply for another. 2) Performance. We believe that

an extensible protection architecture will ameliorate these issues by putting control into

application specic abstractions. 40

Mechanism Portability The key problem is what are the essential elements independent of the mechanisms. It is clear that subprocess isolation mechanisms are only going to see

increased exploration, which fractures the landscape of approaches for applying them.

Each new system provides some properties, but how do we compare them? We believe it

is necessary to establish a model that prescribes a set of clear abstractions and security

properties so that diverse systems can be reasonably and systematically applied and

compared.

4.3 Mechanisms Gaps and Challenges

Several facets must be preserved to have meaningful privilege separation and compares

related eorts. The key gaps and challenges are described below.

Subdomain Identiability One solution would be to extend the kernel with subprocess abstractions. However, a userspace monitor is still necessary to track the current protection

domain or else you have to transition to kernel on each switch which is prohibitively

costly.

Programmability and Optimizations Having a general interfaces would be ideal but as implored by prior work (Exokernel, etc), applications tend to be severely constrained.

What’s worse is that existing process abstraction needs to have a separate interface to

accommodate dierent interaction pattern and to be ecient. Thus custimizeability of the

abstractions is critical and most prior work don’t handle it properly.

Leaky System Objects Since OSs are unaware of subprocess domains, an untrusted portion of a application can request access and the OS will gladly service it. Although we shows several bypass attacks, the primary challenge is to systematically assesses all 41 interfaces and to integrate them into a unied policy management interface. It is easier to reason about the policy for a relatively strict interface, but things like ioctls make it impossible to have comprehensive defenses.

System Flow Policies A basic property is that the information in a subspace should never ows in or out of systemobjects unless explicitly granted. However, deriving the system ows itself is hard due to system complexity. Although prior work such as Erim and Hodor shows that one can reason about the ows through a specic system object, the approach is hard to be broadened to a systematic solution. syscall Monitor The need to monitor syscalls is clear, but how to do it is not. A deny- all policy—as used by intra-app sandboxing [15,61,62]—sandboxing would indiscriminately deny all access and neglect a large application space. For example, deny-all sandboxing cannot prevent Heartbleed [63]. In general, applications should be able to benet from privilege separation while not losing functionality. Alternatively, we could modify the OS so that it recognizes and enforces endoprocesses [18,19,64]. Unfortunately, this introduces signicant complexity as indicated by Sirius [64].

Instead of the in-kernel approach, we propose enforcing nested ow policies at the syscalls—allowing some to bypass without change, others to be denied, and the rest to be securely emulated. This is not supported by well-known systems in Linux: MBOX uses ptrace for similar protections [65], but only virtualizes the lesystem interface and is inecient. Seccomp [48] with eBPF [49] and LSM [41] enforce syscall policies, but lack the ability to modify syscall semantics, which will require modifying the LSM hooks extensively. 42

Multi-Process An attacker can fork an exploited process, and access the original address space directly through load and stores instructions and access indirectly through read

system calls. The endokernel must be inside the new process to ensure the protections, or

the memory must be scrubbed. Prior approaches [22, 24, 64] do not consider this threat

and would have to disallow fork system calls.

Signals Signals create several exploitable gaps and challenges. First, Linux exposes virtual CPU state to the signal handler includings PKRU, which can be exploited by an

attacker. Second, the kernel does not change the domain and will trap if not properly

setup. Third, the kernel always delivers the signal to a default domain, exposing the

monitor control-ow attacks. Fourth, properly virtualizing signals requires complex

synchronization and modifying the semantics to be both correct, safe, and ecient. Overall,

properly handling signals introduces signicant complexity into the endokernel.

Multi-Threaded While existing work claims to have a design supporting multi-threading, none of them have implemented concurrency control in the runtime monitor, introducing

TOCTOU attacks and memory leaks, as well as neglects to measure scalability.

Multi-Domain Prior work isolated one domain per thread but not multiple domains per thread. The challenge is that switching from the untrusted domain to the monitor exposes

less data than executing an cross domain call because the stack requires tracking to ensure

return integrity.

4.4 Endoprocess Model

The Endokernel is a general purpose model for nesting a monitor, the endokernel, into the

process address space, which is responsible for self-protection—enforcing the abstraction 43

Application

Sandbox Sandbox Untrusted Glibc Trusted syscall passthrough Safebox Safebox

Domain switch

Syscall virtualization Syscall Policy Domain Manager Process boundary Process Trampoline Signal virtualization Thread Manager syscall libintravirt.so

Kernel

Figure 4.1: Intravirt Architecture.

of two privilege levels within the process—and presenting a lightweight virtual machine, the endoprocess, to the application. The Endokernel has been designed to insert directly below application logic and directly on top of the OS and HW provided abstractions. The core methodology is to systematically identify 1) what needs to be protected, 2) how that information can be interacted with (through the CPU, memory, or OS interfaces), and 3) specify a set of abstractions that must be in place to secure endoprocess isolation. The basic goal is to identify an architecture level description that is portable and independent of the exact layers above and below to properly encapsulate the endoprocess internals. The architecture has two main elements: 1) the authority model and 2) the nested endokernel architecture that ensures isolation. We show how to use this to create least-authority separation services, nested boxing, for application use. Figure 4.1, depicts these three elements together in the architecture. 44

4.5 Design Principle

We share the trusted monitor principles as outlined by Needham—tamper-proof, non-

bypassable, and small enough to verify [66]—and add the following:

Nested Separation Kernel Address spaces and kernel interactions are slow, eliminate all OS interactions [67,68]—i.e., pure userspace, while being smaller than a microkernel

and only tolerating elements inside if they support primitive separation mechanisms with

a minimal interface.

Self-Contained and Secure Userspace Avoid implementing system object isolation in the kernel: adding yet another security framework hacked on top of thousands of

kernel objects. Nesting requires part of the mechanism to be in-process, however, certain

resources could be virtualized by the OS. While that seems like it might be the best choice,

if parts of the process were virtualized by the monitor and others by the OS then: 1)

complexity arises in bridging the semantic gap of the abstractions, 2) bugs can arise from

complex concurrency, access, and exception control, and 3) ties the endoprocess abstraction

to a specic kernel implementation instead of the semantics of its interface.

General and Extensible The design should permit many implementations, i.e., using various hardware (MPK) or software isolation (SFI) techniques that might present valuable

tradeo points in the security-performance space. The architecture should enable safe

extensibility of the security abstractions to enable custom, least-authority protection

services. 45

4.6 Authority Model

The Endokernel represents and enforces authority based on a protection domain, called an

endoprocess. As outlined by Lampson [69] and instantiated by Mondrix [70], an endoprocess

must provide the basic properties of data abstraction: protected entry/return and memory

isolation, while also protecting access through OS objects. Most existing work multiplexes

regions of the virtual address space and uses hardware mechanisms to protect entry and

exit, however, these works neglect to map these properties to the other ways in which

the environment can be used to avoid mediation. Thus, in addition to traditional CPU

and memory virtualization, the Endokernel also virtualizes: CPU registers, the le system,

address spaces and memory management, and exceptions (as implemented through signals).

We use the term authority context to avoid confusion related to many other names, but it

is a lightweight virtual machine while being more precise than domain.

Denition 1 (endoprocess) An endoprocess is an authority context a tuple of (instruction, subspace, entry-point, return-point, le system, address space, and exception) capabilities.

Instruction capabilities specify which instructions are permitted without monitoring,

and is required to fully virtualize the CPU—similar to the hosted architecture of VMMs,

SFI, and Nested Kernel approaches. Explicitly representing instructions is critical as

many protection models operate by allowing instructions enforced either by privilege

level hardware (rings), capability hardware, or software based techniques like SFI (inline

monitors) or deprivileging (static veriers w/ runtime code integrity). As an example,

recent work uses memory protection keys to isolate virtual regions, however, the hardware

exposes the key register to corruption through WRPKRU. As we show in our prototype, we

implement a restricted view of CPU state by preventing any access to WRPKRU and syscall

instructions from non-endokernel code, but do so using diverse mechanisms. The way we 46 virtualize the CPU also inuences low-level mechanisms that enforce protected entry and

exits.

Memory capabilities allow an endoprocess to read, write, or execute a subspace, which

is a subregion of the virtual address space. The default subspaces for each endoprocess

include: stack, heap, and code. File system capabilities specify operations permitted for

opening, reading, and writing runtime state through the le system. Address space and

memory management capabilities determine what changes to the address space (e.g., mmap,

mprotect, etc.) a endoprocess has. Exception capabilities allow a endoprocess to securely

register for and handle signals (e.g., SIGSEGV). Entry-point capabilities denote points at which a endoprocess transition is permitted, and is much like converting function calls

into an RPC for context-switching and message passing. Return-point capabilities are

dynamically generated whenever invoking a cross-domain (RPC), xcall, and require the

machine to return in nested order. Each endoprocess, by default, is granted exclusive

access to its own code, data, and stack subspaces.

An execution context is the combination of the (endoprocess 푋 thread context), which

includes the program counter, stack pointer, and other per domain CPU registers. We

have chose explicitly to model the endoprocess in a similar way as a traditional process

by allowing multiple threads to coexist in a single authority context concurrently. As the

thread executes it traverses various contexts. This model allow for the greater range of

exibility for developing extensible protection. This execution model is the exact same as

provided by Mondrix. To support it the monitor supports the following interface: program

start, interrupted state, signals, up-calls, and xcalls.

Property 1 (Endoprocess Isolation) Each endoprocess is granted exclusive access to its code, data, and stack subspaces, guaranteed secure entry/return, mapping capabilities for it’s

own subspaces, and capabilities to OS level interfaces unless explicitly excluded for isolating 47 other endoprocess state.

With these capabilities, the Endokernel exposes the ability to fully virtualize each

resource while restricting access to privilege in-process state (e.g., monitor memory). This

is essential as many applications cannot be deployed without a certain level of access, but

the monitor itself must ensure it’s protection by reducing the functionality. This is one of

the most critical features gained under the Endokernel model relative to existing ad hoc

approaches.

4.7 Nested Endokernel Organization

The Endokernel Architecture is a process model where a security monitor, the endokernel,

is nested within the address space with full authority. The endokernel is then responsible

for multiplexing the process to enforce modularity in a set of endoprocesses. The rst

goal of the endokernel is to self-isolate, i.e., secure the endokernel state and endoprocess

abstraction from untrusted domain bypass. This section explicitly details this architecture,

leaving the protection abstractions as extensions on top of this basic isolation.

4.7.1 In-Process Policy

The endokernel is granted full authority to all process resources, and the untrusted domain

is granted access to all process resources except for the following: endokernel subspaces,

memory management (e.g., protection registers via WRPKRU) and direct OS call (e.g., syscall)

instructions, le system operations that would allow access to endokernel subspaces (e.g.,

read/write /proc/self/mem/endokernel-subspace), address space manipulation (e.g., mmap)

that would expose endokernel subspaces, and signal capabilities that could otherwise use

to bypass subspace isolation. In this way, the endokernel virtualizes privilege within the 48

address space while also inserting the endokernel in between the untrusted domain and

all privileged resources, where the higher privilege state all protection state, including

everything that that could allow unmediated access by the lower privilege untrusted

domain. Just like any kernelized system, protected gates ensure the endokernel is securely

entered into when a protection domain switch occurs. This architecture is similar to and

inspired by the hosted VMM architecture and Nested Kernel Architecture.

Denition 2 (Endokernel Architecture) An Endokernel Architecture is a split process model where the endokernel is nested within the address space.

The endokernel is responsible for exporting the basic endoprocess abstractions for all

untrusted domain endoprocesses, thus enabling a new method for virtualizing subprocess

resources and enforcing the following property:

Property 2 (Complete Mediation) A non-bypassable endokernel that is simple and guar- antees isolation.

To achieve this the Endokernel enforces the following policies: secure loading and ini-

tialization so that all protection is congured appropriately; exports call gates for cross

domain calls and ensures argument integrity and context-switching; inserts a monitor

for all system calls so that they can be fully virtualized; monitors all address space and

protection bits modications to ensure isolation is not disabled; controls all signals so

that they route through the endokernel before go to any untrusted domain endoprocess;

handles concurrency to support multi-threaded execution.

4.7.2 Interface

In the basic architecture, the endokernel transparently inserts itself and presents a minimal

interface to the protected resources. All access to the privileged resources must become 49

calls into the endokernel. In the process model, this typically means only a system call

interface as that is the mechanism by which most resources are accessed and typically the

only resource that must passthrough the monitor. Access to address spaces and le systems

are monitored through the system call interface. Other resources are memory based and

since the untrusted domain has no access to the endokernel state, there is no need for an

explicit interface. We do not dene an endoprocess creation/destruction interface as that

is the responsibility of the extension for implementing endoprocess modularity, which we

believe is best tailored to the application itself.

4.8 Separation Facilities: Nested Boxing

Least-authority is hard to apply in practice because security policies are highly dependent

on the objects being protected. As indicated, many abstractions are rigid and do not allow

for specialization from the application developers. The Endokernel Architecture allows

us the ability to use the endokernel to explore diverse endoprocess and sharing models

on top. To improve programmability and make use of the nested endokernel, we present

the nested boxing abstraction, which eectively creates three virtual privilege rings in

the process. The nested boxing model allows each level access to all resources of the less

privileged layers, while removing the ability from those domains to access more privileged

domains. In this thesis we x the number of domains to four from most to least privileged:

endokernel, safebox, unbox, and sandbox. Each domain is given an initialized endoprocess

that provides capabilities for accessing domain resources. To make programming easier, we also use a libos that aids in allocation and separation policy management.

Dynamic Memory Management One of the core challenges with privilege separation is modifying the code so that data is statically and dynamically separated. Static separation 50

is easily done using loader modication, but dynamic memory management is harder, in

particular when we have to ensure subspace isolation. In our system we provide a nested

endokernel allocator that transparently replaces whatever allocator the code originally

used and automatically manages the heap and associated privilege policies.

Memory Sharing endoprocess’s share data through a simple manual page level grant/revoke model. A endoprocess grants access to any of its pages to a lower privilege domain and

removes access through the revoke operation.

Protected Entry and Return Cross domain calls, or xcalls, are invoked by the calling domain and can only enter the called domain at predened entry points as specied by the

endoprocess denition. This interface will reject all attempts of accessing the safebox if it is

not to a preloaded entrypoint. It will then do the domain-switch: switch the stack, current

domain ID, store the return address in a protected memory subspace, and transfer control

to the safebox. When the called function nishes, it returns to the interface function, which domain-switches back to the untrusted domain. Entry points can either be dened

manually or as we show for full library separation, by using the library export list. This

model of control ow allows the called domain to subsequently call less privileged code,

i.e., if it does this the called code operates within the endoprocess context and is thus in the

TCB. We allow users to determine when and how to use these features, granting greater

exibility at the cost of more complexity in reasoning about security if a callback is issued.

This can implement the Shreds abstraction, if used in code with no callbacks. 51

4.9 Intel® Memory Protection Key

Hsu et al. [19] describe three generations of privilege separation, each increasing from

manual, address-space isolation to the third generation that eciently enables concurrent

per-thread memory views. The key is new hardware that extends paging with userlevel

tags for fast but insecure isolation.

MPK [34] extends page tables with a 4-bit tag for labeling each mapping. A new 32-bit

CPU register, called PKRU, species access control policies for each tag, 2-bits per for

controlling read or write access to one of the 16 tag values. The policy is updated via a new

ring-3 instruction called WRPKRU. On each access, the CPU checks the access control policy

as specied by the mapping’s tag and associated policy from the PKRU. If not permitted the

CPU faults and delivers an exception.

MPK Security vs Performance Unfortunately, the PKRU can be modied by any user-

level WRPKRU instruction: MPK is bypassable using gadget based attacks. As such MPK

balances security and performance by allowing protection changes without switching into

the kernel.

Preventing MPK Policy Corruption Nested privilege separation reconciles the ex-

posure of protection state by ensuring WRPKRU instructions are only used safely by the

endokernel. They achieve this by removing all WRPKRU instructions from the untrusted

binary and crafting nested call gates that prevent abuse [20, 22, 24, 35, 36, 71]. 52

Chapter 5

Design and Implementation

Intravirt is a userlevel only Endokernel system that fully virtualizes privilege and prevents

bypass attacks. Beyond memory and CPU virtualization, it emphasizes full virtualization of

system calls and signals, as well as exposes and addresses concurrency, multi-threaded, and

multi-domain challenges. Intravirt injects the monitor into the application, as the trusted

domain endokernel, and removes the ability of the untrusted domain to directly modify

privileged state. Privileged state includes: protection information (PKRU and memory

mappings), code, endokernel code and data, direct system call invocation, raw signal

handler data, CPU registers on transitions and control-ow, and system objects. The

endokernel is inserted on startup by hooking all system call execution and initializing the

protection state so that the trusted domain is isolated with no les opened or mapped.

5.1 Privilege and Memory Virtualization

While we build on and extend ERIM, we include it here for a complete view of Intravirt.

We encourage the reader to review detailed methodology from the original work. An

initial conguration partitions the application into the trusted domain and untrusted

domain, where the trusted domain contains the trusted monitor and the untrusted domain

contains the rest. Once the application is separated so that the parts are dierentiated, the

system is congured so that all pages of the trusted domain have key 0 and 1 based on the

condential requirement, and all pages of the untrusted domain have key 2. Some pages will have other keys if they belong to other subdomains in untrusted domain. 53

5.1.1 Virtual Privilege Switch

One of the most important elements when nesting the endokernel into the same address

space is the need for secure context switching, which is complex to get correct because

an attacker has access to whatever is mapped into the address space. While the trusted

domain is executing the PKRU is congured to allow_all (read/write to all domains), and

operating in the trusted domain virtual privileged mode. While the untrusted domain is

executing the PKRU value for the key 0 will be deny_write and deny_all for 1. The virtual

domain switch is implemented as a change in the protection policies in the PKRU—when

entering the monitor set the policy to allow_all, when exiting restore the original key

based on the previous state. This means that whenever the value of PKRU changes so too

does the currently executing domain. Each entry point into Intravirt is setup with a call

gate with a WRPKRU that transitions the domain. The basic idea is to nest monitor code

directly into the address space of the application and wrap each entry and exit point with a

WRPKRU operation. By doing this the system can transition between contexts and only allow

monitor code to access protected state—a virtual privilege switch. This similar technique

is also used to switch between dierent subdomains to enable the usage of other keys in

untrusted domain.

5.1.2 Securing the Domain Switch

Unlike systems with real hardware gates, this software/hardware virtual privilege switch

has challenges because the instruction must be mapped as executable to allow fast privilege

switching. The rst thing an attacker could do is use a direct jump to any code in the

monitor and thus bypass the entry gate. This would in fact allow the attacker to execute

monitor code. One way to thwart could be to modify the executable policy on the monitor

pages, but that would require a call into the OS which defeats the purpose of fast domain 54

switching of MPK in the rst place. Instead, we observe that even if an attacker is able to

jump into the middle of the monitor the domain would have never switched, therefore,

none of the protected state is available for access and therefore the basic memory protection

property holds. The only way to change the domain is to enter through the entry gate.

Since the switch is a single instruction, we can easily verify the result of such switching

immediately after the WRPKRU instruction and loop back if it is not switch to the intended

PKRU state. This ensure that the PKRU state at all exits of the gate secquence will be the

intended PKRU state.

Eectively, the attacker now faces the dilemma that jumping into the middle of the

code will ended nothing since it is the equivelent of running the same code in any other

locations, or it can try to jump to the entry gate, but any landing places of the gate will only

switch to the correct PKRU value and continue the exeuction with deterministic control

ow. No code can be abused.

5.1.3 Instruction Capabilities

Alternatively, an attacker could generate their own unprotected variant of WRPKRU—if

an attacker can inject or abuse the WRPKRU instruction, they could switch domains and

gain access to the monitors protected state. To deal with this ERIM and others like it

use a technique called instruction capabilities: that is by using a combination of static

transformations and code validation and dynamic protections an instruction becomes

much like a capability. The static analysis removes all instances of the WRPKRU opcode

so that the attacker has no aligned or unaligned instructions that could write the value without monitoring, and dynamic runtime is congured so that all code is writeable or

executable but not both. 55

5.1.4 Controlling mode switches

Processes may switch into 32-bit compatibility mode, which changes how some instruc-

tions are decoded and executed. The security monitor code may not enforce the intended

checks when executed in compatibility mode. Thus, we insert a short instruction se-

quence immediately after WRPKRU or XRSTOR instructions that will fault if the process is in

compatibility mode.

64-bit processes on Linux are able to switch to compatibility mode, e.g. by performing

a far jump to a 32-bit code segment that is included in the Global Descriptor Table (GDT).

Executing code in compatibility mode can change the semantics of that code compared to

running it in 64-bit mode. For example, the REX prexes that are used to select a 64-bit

register operand size and to index the expanded register le in 64-bit mode are interpreted

as INC and DEC instructions in compatibility mode. Another example is that the RIP-relative

addressing mode in 64-bit mode is interpreted as specifying an absolute displacement in

compatibility mode.

Executing the trusted code in compatibility mode may undermine its intended operation

in a way that leads to security vulnerabilities. For example, if the trusted code attempts to

load internal state using a RIP-relative data access, that will be executed in compatibility

mode as an access to an absolute displacement. The untrusted code may have control

over the contents of memory at that displacement, depending on the memory layout of

the program. This may lead to the trusted code making access control decisions based

on forged data. Conversely, if the trusted code stores sensitive data using a RIP-relative

data access, executing the store in compatibility mode may cause the data to be stored to a

memory region that can be accessed by the untrusted code.

To check that the program is executing in 64-bit mode when it enters the trusted code,

a sequence of instructions such as the following may be used: 56

1. Shift RAX left by 1 bit. In compatibility mode, this is executed as a decrement of EAX

followed by a 1-bit left shift of EAX.

2. Increment RAX, which sets the least-signicant bit of RAX. In compatibility mode,

this rst decrements EAX and then increments EAX, resulting in no net change to the

value of EAX.

3. Execute a BT (bit test) instruction referencing the least-signicant bit of EAX, which is

valid in both 64-bit mode as well as compatibility mode. The BT instruction updates

CF, the carry ag, to match the value of the specied bit. It does not aect the value

of EAX.

4. Execute a JC instruction that will jump past the next instruction i CF is set.

5. Include a UD2 instruction that will unconditionally generate an invalid opcode excep-

tion, which will provide an opportunity for the OS to terminate the application. The

security monitor should prevent the untrusted code from intercepting any signal

generated due to invalid opcode exception from this code sequence.

6. Shift RAX right by 1 bit to restore its original value. This instruction is unreachable

in compatibility mode.

The preceding description of the operation of the instructions in compatibility mode

assumes that the default operand size is set to 32 bits. However, a program may use the

modify_ldt system call to install a code segment with a default operand size of 16 bits. That would cause the instructions that are described above as accessing EAX to instead access AX.

That still results in the instruction sequence detecting that the program is not executing in

64-bit mode and generating an invalid opcode exception. Furthermore, Intravirt can block

the use of modify_ldt to install new segment descriptors. None of the default segment

descriptors in Linux specify a 16-bit default operand size.

It is convenient to use EAX/RAX in the preceding instructions, because the REX prex 57

for accessing RAX in the instructions used in the test happens to be interpreted as DEC

EAX, which enables our test to distinguish between 64-bit mode and compatibility mode

as described above by modifying the value of the register that is subsequently tested in

the BT instruction. However, we need to restore the value of EAX/RAX after the mode test.

One option would be to store RAX to the stack. However, that may introduce a TOCTTOU vulnerability if the untrusted code can modify the saved value. That is why we used shift

operations to save and restore the original value of RAX, depending on the property that

only the least-signicant 32 bits of RAX are ever set at the locations where mode checks

are needed.

The mode test comprises 11 bytes of instructions total. The mode test instruction

sequence overwrites the value of the ags register. If the value of the ags register needs

to be retained across the mode test, that can be accomplished using a matching pair of

PUSHF and POPF instructions surrounding the mode test. These instructions are encoded

identically in 64-bit mode and compatibility mode. It may be possible for untrusted code

to overwrite the ags register value while it is saved to the stack. However, trusted code

should not depend on ags register values set by untrusted code, regardless of whether that

register has been loaded from stack memory or it has been set by the processor directly as

a side-eect of executing instructions in untrusted code.

If the instruction sequence for testing the value of EAX/RAX used with an XRSTOR

or WRPKRU instruction that is not followed by trusted code is valid in all modes that are

reachable y the untrusted code, then the mode test code may be omitted prior to that value

test code. 58

5.2 System Call Monitor and Handling

Intravirt must ensure that access to system objects is virtualized. We could place this

monitor in the kernel, however, that would separate the memory protection logic from

the mechanism and create greater external dependencies. Furthermore, it would push the

policy specication into the kernel, but the abstractions supported need to be extensible

and thus endanger the whole OS. Instead we observe that system resources are provided via the system call interface, and that the semantics of that layer are stable and allow for

reasoning and enforcement of endoprocess isolation policies. Additionally, if Intravirt will

have greater portability if targeting POSIX. Furthermore, if we locate the monitor in the

kernel we also have to add the extra context switches and the layers of complexity in the

kernel for handling the virtualization.

As such, Intravirt virtualizes system objects by monitoring all control transfers between

the untrusted domain and the OS through a novel in-address space syscall monitor, called

the nexpoline.

Property 3 (Nexpoline) All legitimate syscalls go through endokernel checks and virtual- ization.

The basic way Intravirt does this is to 1) prevent all system call operations from untrusted

domain subspaces and 2) mediate and virtualize all others. We could use a control-ow

integrity monitor to provide both of these, like CPI [30], but that would add unnecessary

overhead, require compiler level instrumentation, and violate our minimal mechanism

principle. Alternatively, we could extend the OS however, this would break our principle

of no kernel dependencies and cost. 59

5.2.1 Passthrough

The rst step in handling is to determine what virtualization, if any, is necessary, because

many system calls do not allow endoprocess bypass. Additionally, if a system call creates

an interface to read or write from memory, they will use the application’s virtual addresses, which means that the MPK domain will be enforced even if accessed from the supervisor

mode—this is something we learned only through failing, so it is important to note that

by default the kernel leaves the MPK domain untouched and thus the hardware continues

to enforce MPK based policies even from supervisor mode access. The benet of this is

that any kernel access to endoprocess subspaces not permitted based on the current PKRU value will trap and deliver to the endokernel: meaning a powerful deny-by-default policy

that is enforced even on ioctls with unknown semantics. It does not mean that these

cannot remap pages and get around the domains, but it does mean that a common path

for access must be coded around, adding greater condence that access paths have been

protected. With these passthrough system calls, we use our protected nexpoline control

path and right before executing the syscall we transition the PKRU domain to the original

caller so the kernel will respect the memory policies in place. After the syscall, Intravirt will switch to trusted domain for nalizing the syscall and then transition back to the

calling endoprocess.

5.2.2 No syscall from untrusted domain subspaces

To prevent direct invocation of syscall operations, we could remove all syscall isntruc-

tions from untrusted domain and ensure integrity like we do for WRPKRU, however, the

syscall opcode is short and might lead to high false positives. Instead, we use OS sand-

boxes that restrict syscall use to a protected trusted domain subspace. There are two

that can be used: seccomp [48] and dispatch [72]. When we started, seccomp was the 60

only available one, but it has many drawbacks: 1) you cannot grow or modify the lter, which makes support for multiple threads and forks challenging and 2) it adds signicant

overhead. The only way to address the second is to use a dierent mechanism. Thus, we

also explore the recently released kernel dispatch mechanism, which is a lightweight lter

that restricts system calls to the particular subspace. Both of these mechanisms work by

specifying the virtual address region that is permitted to invoke system calls, which we

use to restrict to endokernel subspaces.

5.2.3 Complete mediation for mapped syscall

Unfortunately, the only way to invoke a syscall is for the opcode to exist in the runtime,

meaning it must be placed in memory that the untrusted domain can jump to. Ideally,

protection keys would distinguish executability and we could use a endoprocess switch,

but they do not: Intel relies on the NX mappings. Alternatively, subspaces with syscall

opcodes could be marked NX, but the nexpoline would require another syscall to enable write access to the page.

Instead, the nexpoline protects each instance of syscall; return; instruction se-

quence, called the sysret-gadget, so that if control neglects to enter through the call gate

the syscall is inaccessible. The basic control ow is to enter through the call gate and

perform system virtualization, setup the nexpoline code subspace, jump to the syscall,

then return to the handler for cleanup.

Randomized Location To abuse the sysret-gadget the attacker must know where it is located. As such, the rst isolation approach randomizes the location of the sysret-

gadget. We create one pointer that points to the sysret-gadget, and make it readable by

the endokernel endoprocess. This means that to get access to the pointer, the endoprocess 61

must be switched-to rst, and thus guarantee protected entry. The pointer is looked

up immediately after switch, which means that all code between that instruction and

the sysret-gadget will execute: endokernel executes all virtualization and once approved

invokes the sysret-gadget. This ensures complete mediation because the only way to get

the sysret-gadget location is to enter at the beginning, which ensures full virtualization.

The sysret-gadget can then be re-randomized at various intervals to provide stronger or weaker security; we measure the cost of randomizing at diering numbers of system calls.

Multi-threading creates some complexity as it could leak, which we address by creating a

per-thread pointer and giving enough virtual space to remain probabilistically secure. The

benet of this technique is that it is the simplest, and most of the time, results in the best

performance.

Ephemeral On-Demand While randomization—especially if randomizing on each

syscall—creates a high degree of separation, it is not guaranteed. To provide deterministic

isolation, we present the ephemeral nexpoline, which achieves isolation by writing the

sysret-gadget into an executable endokernel subspace on gate entry and rewriting to trap

instructions (int 3) after completion. This requires Intravirt to create a single page for

the nexpoline in the trusted domain with read and write permission restricted to the

endokernel (via MPK) and execute permission for all domains. Intravirt ensures that while

the untrusted domain executes the entire page is lled with int 3 instructions which would create a signal if the untrusted domain were to jump to this page. The endokernel

interposes on all control transfers from the OS to untrusted domain, thus it ensures that

prior to any control transfer back to the untrusted domain the sysret-gadget is removed.

The resulting enforced property is that there is no executable sysret-gadget while untrusted

domain is in control. 62

Handling multi-threaded execution is challenging because the sysret-gadget is callable by other threads running in the process. To address this issue, Intravirt creates a per-thread

lter that restricts each thread’s syscall to only come from a per-thread subspace. This means that the OS syscall lter ensures that if a thread invokes the sysret-gadget of another thread (while the system call is being handled) it will trap. In this way, the syscall instruction is ephemeral and only exists while the thread is executing the nexpoline. This creates complexity as signals may modify the control ow of system calls, which we describe in § 5.4.

Control enforcement technology (CET) [73] CET provides hardware to enforce control ow policies. While designed for enforcing Control-Flow Integrity [29], we show how to (ab)use CET to implement a virtual call gate, which ensures syscall; return; is not directly executable by the untrusted domain. Briey, CET guarantees that all returns return to the caller and indirect jumps only target locations that are prefaced with the end-branch instruction. CET also supports legacy code, by exporting a bitmap to mark all pages that can bypass indirect jump enforcement, but the shadow stack must be used across the whole application.

Intravirt allocates a shadow stack for each endoprocess and ensures that a stack cannot be used by a dierent endoprocess by assigning each one to a protected subspace. Intravirt marks all endokernel entrypoints with ENDBR64: denying transitions into the endokernel from any indirect jumps. This creates a problem though, because indirect jumps within the endokernel also require end-branch instructions and could be used as alternative entrypoints to the endokernel. Thus, all jumps within the endokernel are direct jumps with a xed oset from current IP and thus are not exploitable. This allows syscall; return; to be placed anywhere in the trusted domain, since the hardware automatically ensures 63

all syscall will start from a legit entrypoint. While CET can provide greater security for

the whole application, our evaluation shows signicant overheads compared to the other

approaches (see §7.2).

5.3 OS Object Virtualization

The primary goal of Intravirt is to preserve endoprocess isolation, which requires system

object virtualization for eliminating cross endoprocess ows. Intravirt represents these

in three three core system abstractions and policies to systematically reason about and

specify policies: les (including sockets), address spaces, and processes.

5.3.1 Sensitive but Unvirtualized System Calls

A key class of system interfaces (ioctls, sendto, etc.) may index into regions of the address

space that the kernel might accesses on behalf of a process, but as discussed, the kernel will use the userlevel virtual addresses which are protected by the hardware enforcing

MPK domain isolation even from privilege accesses. These do not require full system

level virtualization, but if the kernel did not implement that strategy, they could be fully virtualized by analyzing the arguments and denying any access that crosses endoprocess

isolation.

5.3.2 Files

The Linux kernel exposes (via the procfs) several sensitive les that may leak endoprocess’s

memory, because the kernel does not enforce page permissions, e.g., /proc/self/mems. [1].

To prevent any le-related system call from ever pointing to such a sensitive le, Intravirt

tracks the inode of each opened le. Conveniently, inodes are the same even when using

soft or hard links. This allows Intravirt to enforce that no open inode matches the inode 64

of a sensitive le. The associated rules are transitively forwarded to child processes as

they inherit the le descriptor table of the parent.

5.3.3 Mappings

In addition, one may break the isolation property of Intravirt by aliasing the same le

mapping multiple times with dierent access permissions. For instance one mapping

may allow read/execute, while the other alias mapping to the same le permits read/write

accesses. We prevent such attacks by emulating the mapping using the regular le interface

and copying the le to a read/write page rst which is later turned read/execute after all

security checks passed. As a result an executable page is never backed by a mapped le.

Memory system calls create, modify, or change access permissions of memory pages.

Across such system calls we prevent endoprocess from accessing or altering another endo-

processes memory, e.g. by never permitting a endoprocess to map another endoprocess’s

memory. In addition, new memory mappings by a endoprocess are tagged as belonging to

the endoprocess. Intravirt enforces these policies by building a memory map that associates

access permissions with endoprocess.

5.3.4 Processes

The Kernel permits virtual memory accesses of other processes via process_vm_readv

and process_vm_writev system calls. These calls access memory of remote processes

or the current process itself. For these two system calls, we apply the same restrictions

as for le-backed system calls preventing a domain from accessing another domain’s

memory. In addition, we completely prevent access to another process’ memory via

process_vm_readv/writev. [fork and vfork] Due to the insecure behavior of vfork, we

emulate it by using fork instead. fork needs to be altered to enforce transitive policy 65

enforcement across process boundaries. [exec] A process application can be modied using

the exec system call. In this case, the kernel loads the new executable and starts executing

it. This is problematic, because we need to initialize its protections before the application.

Hence, any exec system call needs to be intercepted to ensure policy enforcement is

enabled after exec.

5.3.5 Forbidden system calls

Several system calls access protection state. Intravirt currently denies access to the fol-

lowing and leaves their virtualization to future work: clone with shared memory, pkey_*

system calls, modify_ldt, rt_tgsigqueueinfo, seccomp, prctl accessing seccomp, shmdt,

shmat, ptrace.

5.4 Signal virtualization

Signals modify the execution ow of a process by pushing a signal frame onto the process

stack and transferring control to the point indicated by signal handler. The primary reasons we must fully virtualize signals are because 1) Linux always resets PKRU to a semi-privileged

state where domain 0 is made RW-accessible and all other domains are read-only and

2) because signals expose processor state through struct sigframe, potentially leaking

sensitive state or allowing corruption of PKRU, which could lead to untrusted domain

control while in the trusted domain context. As such, Intravirt must interpose on all

signal delivery to minimally transition protection back to the untrusted domain mode and virtualize signal handler state to avoid leakage and corruption.

Intravirt accomplishes this by virtualizing signals so that all signal handlers are regis-

tered with Intravirt rst, and second, registering signals with the kernel so that Intravirt

always gains control of initial signal delivery. When a signal occurs Intravirt rst copies 66

1 sig_entry: 2 movq $1, __flag_from_kernel(%rip) 3 erim_switch 4 cmpq $1, __flag_from_kernel(%rip) 5 jne __sigexit 6 movq $0, __flag_from_kernel(%rip) 7 call _shim_sig_entry

Figure 5.1: Signal Entrypoint

the signal handler context info to protected memory so that the untrusted domain cannot

read or corrupt it. Next Intravirt must deliver the signal to the untrusted domain, but

to do so it must 1) push the signal info onto the untrusted domain stack and 2) switch

the protection domain to the untrusted domain. Unfortunately, the semi-protected PKRU

state does not map the untrusted domain stack as writable, so Intravirt rst modies PKRU

so that it is fully in the trusted domain and then pushes the signal information onto the

untrusted domain stack. Then Intravirt transitions to the untrusted domain mode, giving

control to the handler registered in the rst step.

The next challenge is that the domain switch into the trusted domain places a WRPKRU in

the control path, which can be abused by the untrusted domain to launch a signal spoong

attack. By spoong a signal, the untrusted domain could hijack the return path to its

own code while setting PKRU to the trusted domain. As such, Intravirt must rst add a

mechanism to detect whether the signal is legitimately from the kernel or if it is from the

untrusted domain. Figure 5.1 shows our approach that uses a special ag that resides in

the trusted domain as a proof of PKRU status before WRPKRU. This ag is allocated with key

0, so it is writable only if the signal handler is invoked by the kernel which reset PKRU to

default. A spoofed signal handler invocation from the untrusted domain would result in a

segmentation fault that can be detected by the signal handler. 67

Semi-Trusted Sig Sig l R a SmiUT SmiT ec gn v i S S i v oxing g n B n c al i e n a ig l R S v ec Syscall R UT Entrypoint Entrypoint T S ign Ret D a l Deliver, Syscall Trusted l i r

Untrusted a e c n g t i D S e r liv fe er Sig Sig De UT T

Figure 5.2: State Transition with Signal; UT:Untrusted; T: Trusted; Sig: Signal Handler, Signal masked by Kernel; Smi: Semi-Trusted Domain

The next major issue is dealing with signals being delivered while the Intravirt system

call virtualization is working in the trusted domain. This can cause bugs due to reentry,

leading to potential security violations due to corrupted state. We must guarantee that our

signal handler can only be invoked by the kernel once until we decide to either deliver or

defer the signal and return to the corresponding state. The second problem arises out of

the complex nature of adding Intravirt in between the untrusted domain and kernel, in the

case where the signal is delivered during Intravirt’s handling of syscall. Unfortunately, we cannot simply ignore either these signals because that would break functionality. In

this case, Intravirt must defer the signal till after the syscall is completed.

The solution to interrupted signal delivery is to emulate almost exactly like the kernel.

As depicted in Figure 5.2, signals occurring while in the trusted domain will be deferred

by adding them to an internal pending signals queue and masking that particular type of

signal in the kernel. The latter step is not necessary but pushes the complexity of managing 68

multiple signals of the same type to the kernel. Once the current operation is completed,

Intravirt selects the last available signal that has not been masked by the user and delivers

it.

Signals represent the most complex aspect of Intravirt. They present subtle but funda-

mental attack vectors while also exposing signicant concurrency and compatibility issues.

Intravirt appropriately handles all these cases and identies several issues not mentioned

by prior work [1].

5.4.1 Signals for Ephemeral System Call Trampoline

Typically, signals return via the sigreturn system call. In the case of ephemeral nexpoline, we cannot rely on the kernel infrastructure to return from a signal, since we have to

cleanup the system call instructions after the sigreturn system call. Unfortunately, this is

quite hard to achieve, since sigreturn may directly return to the untrusted domain. Hence,

the emulation of sigreturn in userspace is easier, especially since the x86 instruction set

provides XRSTOR instructions to load the CPU state from a memory location.

Another issue of signal virtualization for secc-eph is the fact that during the cleanup

sequence a signal could occur. This results in a race condition between the cleanup and

the signal handling. To overcome this issue, we created a ‘transaction’ for the cleanup

phase of a system call. If a signal occurs within the cleanup procedure, Intravirt does not

resume to the trap source, but rather to the start of the cleanup phase. Therefore, we reset

the rip to the beginning of the cleanup phase and try to restore the signal context. This

procedure guarantees that signals occuring at any point within the trusted domain clean

the nexpoline. 69

5.4.2 Multithreading Design

So in the rst version we had single threaded and the kernel delivers the signal to the

interrupted thread. If the kernel is delivering on the backend of a syscall then we always

come from domain 0 and thus the kernel can deliver to key 0 secure stack, but the problem

happens if the dom 1 was interrupted and a signal comes. The kernel copies the PKRU value

and thus can’t pus to the dom0 stack. So it faults. Kernel couldn’t do copy to user.

To solve this we put the signal deliver into an untrusted trampoline page so that the

kernel could always write and that would jump directly into IV to handle it. This worked

but created the issue of a signal spoong attack because now an untrusted domain could

jmp to the untursted stack trampoline. So we solved with a nexpoline type solution..

We have a new problem with multithreading in that this open page won’t work anymore

because the page will be accessible to other threads in the same default domain. So we

realize the interface provided by the kernel is jsut broken. To x we modify the kernel

to allow a return to both a registered stack, which is already there, and to return to a

specic registered key value. So we now will always return to 0 domain and the 0 stack

and never expose the data. We must then ensure that no one else registers, so we deny

any registrations after initialization.

To summarize: small kernel patch to allow a default domain and deny any registration.

5.4.3 CET

also complicated the design of signal by adding another stack that must take care of during

the signal delivering. We a special system calll to write to the shadow stack whcih allows

us to push RIP to restore address and RIP to signal handler and restore token on the shadow

stack so we can have the required token for switching the stack when exiting Intravirt.

The similar trick is also used for virtualized sigreturn to switch to the old stack. 70

5.4.4 Multiple subdomains

As we discussed, control ow and corresponding CPU state are critical to the integrity of

sensitive application. This applies to not only the Intravirt but also the sandbox and the

safebox. Since users can run whatever code in the subdomains, any interrpution during

the execution of boxed code can be exploited to leak data. For this reason, we block the

signal from the view of subdomains. The kernel can still deliver signal to Intravirt signal

entrypoint but we will treat it as a signal delivered in trusted domain and pend that signal.

5.5 Multi-threading and Concurrency

Since multi-threading is one of the essential elements of the modern computing environ-

ment, the subprocess isolation also has to provide it securely. However, it is not trivial to

support the concurrent environment in such an isolated environment.

5.5.1 Concurrency in subprocess isolation

First, the underlying OS makes all threads share the memory and OS objects like le

descriptors without limitation. In this environment, a concurrent thread could easily

interfere with the isolation abstraction.

Second, many applications utilize thread local storage(TLS) for storing data only for

the corresponding thread, such as call stack and thread maintenance information. The

isolation abstraction will require management data structure for the domains, but the OS

or the userspace threading libraries (e.g., pthread) do not provide multiple TLSes for each

thread.

Lastly, in the multithreaded application environment, the shared data structures, locks,

and some notication mechanisms are commonly used to communicate between threads. 71

But the isolation abstraction could create problems for eective communications because

some of the data structures could be isolated, and some could be not. Therefore, the

isolation and thread communication data structures could be considered to be matrix-like

designs.

In summary, to abstract the subprocess isolation accurately, we have to put the concur-

rency to one of the main priorities, design it clearly and extensively, and implementation

has to be tightly tested.

5.5.2 Multithreading model

To provides concurrency, we need to select a multi-threading model to design a proper

environment. For example, we could consider one-to-many style model that there is

only one Intravirt thread and it intermediates all the system call executions for all the

threads in the process. But in this model, it is easy to expect that there will be signicant

performance issues and concurrency problems, therefore we cannot select this model. The

multi-threading model in Intravirt is more likely to be an one-to-one style. That is, each

thread has its own Intravirt instance, maintaining local data structures for stacks, PKRU

state and trampoline in some designs, but shares policy enforcement information (e.g.

mapping). Therefore, all the system call virtualizations, and the policy enforcements are

performed by each thread itself.

5.5.3 Thread Local Data Structure

In Intravirt, there are various types of local data that we support protection from unau-

thorized access, safe management to prevent any collision, and eective access by each

thread without a complex address derivation. To provide such a feature, we focus on GS register supported by x86 architecture. 72

GS register along with FS register is an user level segment registers that the application could make use of. But, FS register is being widely used by gcc and pthread for the

Stack canary and thread local data, so Intravirt uses GS register that is known to be no application is explicitly using. Intravirt stores thread local data as a data structure, and store the pointer of the data structure in GS register which could be easily accessed using oset of the segment register, same as other segment registers.

The thread local data structure is protected by MPK, therefore only monitor domain could access such area and any untrusted domain tries to access the location will be rejected by the CPU. However, the attacker could create a maliciously crafted thread local data structure and modify GS register to pointing the malformed data. But fortunately, GS register is only accessed by arch_prctl system call, which we could easily virtualize and prevent unauthorized access.

5.5.4 Required Atomicity

Linux does not guarantee the order of system call execution when multiple threads execute the system calls at the same time. It only supports internal locks to prevent any kind of collision, such as accessing the same le descriptor at the same time. Overall, Linux does not have any strict atomicity policy if there is no critical collision. But in Intravirt, the system calls are virtualized and security policy could be enforced. Therefore, there are enormous security condition checks and some of the system calls have a series of other system call executions along with security checks. In multi-thread environment, such checks and calls could be disturbed by other threads and such interruption could be used by attackers. For example, as presented in PKU Pitfall [1], the attacker could access the protected memory area by accessing /proc/self/mem, so one of the base policy should check whether the thread is trying to access it. In Intravirt, /proc/self/mem is treated as a 73

special le that a ag is set on open system call and the le oset is checked on every le

accesses such as read and write. But, there are many dierent TOCTOU attack scenarios

that the attacker could spawn another thread and manipulate the le oset by using lseek.

As well, if the ag is set before the actual le descriptor is assigned by the kernel, the

attacker could access the le before the ag is set.

Therefore, In Intravirt, we provide internal locking mechanism to provide such atom-

icity that could prevent various TOCTOU attacks. In the current implementation, a lock

is provided for memory related system calls such as mmap and mprotect, one for signal

related system calls such as rt_sigaction, and one for each opened le descriptor. For

les, we do not lock every time the le is accessed, to provide same use case with original

Linux that only close is blocked when another thread is in the sysret-gadget. The reason why we block close is to prevent the attacker from close and open a new le with same

le descriptor simultaneously and perform attacks.

5.5.5 sysret-gadget Race Condition

We already argue that protecting sysret gadget is very important to prevent unauthorized

system call execution. In the multi-thread environment, such protection has to be carefully

designed. For example, in Ephemeral Intravirt, the sysret-gadget location is xed and

the gadget exists when a thread is executing a system call. Therefore, any attack thread

could simply jmp to the gadget when another thread is calling a system call. As well, in

Randomized Intravirt, there is a probability that 2 threads could collide with the same gadget

location, and the shared gadget location could increase the successful guess probability.

Therefore there should be protection mechanisms in each design conguration.

First, in Randomized Intravirt, the sysret-gadget’s area does not overlap with other

threads that there is no collision, and remain the probability the same. 74

In Ephemeral Intravirt, we apply per thread seccomp lter to prevent accessing other

thread’s sysret-gadget. But, seccomp lter is always inherited from the parent and it’s

cannot be altered, so the child thread will have the same seccomp lter with its parent. In

Ephemeral Intravirt, we have a special thread which only spawns other threads on behalf

of the application threads, called Queen thread. We did not apply seccomp leter to the

Queen thread. When the application creates a new thread by calling clone system call,

Queen thread receives such request, create a new thread, apply a new seccomp lter, and

jmp to the user app code to start the thread.

In CET Intravirt, we have per-thread shadow stack that any kind of unauthorized

indirect jump could be detected and rejected easily.

5.5.6 Clone

In Linux, the clone system call is used to create new threads and processes. For Intravirt

to properly maintains its integrity, this process must be done carefully. As we already

described in previous paragraph, for example, the seccomp_eiv needs special consideration

for the sysret-gadget. When clone system call is called, we rst distinghish if the system

call is about to create a new process or a new thread based on the ags. For all other

Intravirt variants except for seccomp_eiv, the syscall is directly called and both old and

new process are continue executing after the kernel returns from the system call. But in

the case of the seccomp_eiv, which requires a Queen thread, simply calling the syscall without any preparation will create a new process which will discard all other threads in

the child process space including the Queen thread as well. As a result, the new process will lose the ability to create new threads any longer. In this case, we spawn the new

process by the Queen thread instead of the caller thread, and the newly created Queen

thread spawns another thread and restore the old context into the new thread and jump to 75

the pointer where the clone is being called from the parent process. In the case of creating

a new thread, the called thread will map a new stack for the thread, it allocates local data

structures on the stack, and copies the context of untrusted domain to the stack so it can

returns to the caller with proper context.

In CET variant, the new shadow stacks for untrusted domain and other subdomains

are also created with restore tokens. For the untrusted domain’s shadow stack the RIP

from the old thread is also pushed in to the stack. And the addresses are put into the

local data structure for the new thread. The sysret gadget and trampoline (if needed) will also be prepared on this data structure. Then, the initialization arguments are pushed

into the stack. The clone system call is invoked by the current thread, or Queen thread if

seccomp_eiv is used. In old thread, it returns immediately when the new tid is available while in the new thread it directly jumps to a thread start code. Now, the new thread has

identical state as the old one, but it cannot use the normal syscall code since GS segment

is not initialized. And the seccomp or dispatch is also disabled by the queen or the kernel

respectively. We must reestablish it before handling the control to the untrusted domain.

In the case of CET, we can simply use syscall instruction to call arch_prctl to set the

GS segment and CET can prevent the syscall from being abused. For other variants, we

have the address to the thread local data which contains all information we need for the

system call and we use these information instead of GS based addressing to utilize the

sysret gadget for the same arch_prctl. Next, we enable the syscall lter by calling

either seccomp or set the dispatch address range as we did when initializing Intravirt. And we restore all context data and jump to the monitor_ret to return from the Intravirt and

restore the context. 76

5.5.7 Multi-Domain

Intel MPK allows the change of PKRU register through WRPKRU instruction. As we disscuss

early in the thesis, this threats any privilege model based on the domain since an attacker

can always override current protection domain using this instruction. And we use binary

inspection to elimiate WRPKRU in untrusted code. However, this also means that untrusted

code cannot switch the domain for its own use.

While Intravirt along is using two of the MPK domain for its private data, and one as

untrusted domain in general, there are still 13 domains which can be used by the untrusted

part as a memory domain for private data to ensure the security and condentiality. To do

this, we repack the MPK interface with our multi-domain Intravirt design by providing all

the essencial components, including isolated encapsulations for code, data and context,

tracing current PKRU inside Intravirt, call-gate from the untrusted domain to the encapsu-

lation through a sets of xed entrypoints and providing a library for the user to assist the

annotation of sensitive data and code.

Secure Dynamic Loading Since we do not allow WRPKRU in user code, we add a new virtual system call iv_domain to complete the encapsulation of the domain.

iv_domain accepts pointers to the segment of code and data, a pointer to function table which contains the legit entrypoints and a pointer to a stub function. Intravirt will assign

an unsued MPK domain to, and only to this encapsulation, and map the code and data to

this domain to prevent other domains from accessing them.

The stub function is our solution to put the domain switch, which contains WRPKRU

closer to the, user code, so it does not need a indirect call to the switching function and when coding, user code can use it as a normal symbol to a function. While ensure the

WRPKRU will not be reused against our system. It will be loaded upon iv_domain gets called 77 and mapped as intervirt memory.

To ensure any code in current application that call into the boxing would not get

compromised, after any use of the iv_domain system call, all executable pages will be

locked down that mmap, remap and unmap an executable page in any form will become

an illegal operation.

Secure xcall These encapsulations have data and code memory marked with theirs MPK domains. We will then allow user to change MPK domain in order to use these data,

not arbitrary, however, through our xcall interface.

The xcall was a stub function when linked by the user, which gets replaced during

the loading.

It rst looks up the called function in a protected function table, which will also be

copied from the original function table and put into the protected memory, to check if this

call to a entrypoint. Then, it switches to trusted domain and update the variable indicate

current PKRU, previous PKRU and stacks (shadow stack if CET is used), and fetch the address

to the context associated to this domain in Intravirt memory and switch to that domain

and context. A special case here is that the system might not in untrusted domain. And

if the PKRU state indicates the program is running inside the requesting domain, we will

bypass the xcall gate by not updating the data structure and use jmp to jump to target

address as if the xcall does not exist. After switching, it calls to that function, this will

ensure that all xcall will enter correct MPK domain and transfer the control statically to

a set of entrance, instead any addresses given by the caller, even the addresses inside the

encapsulation, which will leak the privilege to a potential attacker.

The signal delivering is disabled after the switching of the protection domain with the

similar reason for its disabled during the running of Intravirt code – to prevent the leak of 78

CPU state and to protect the control-ow not get disrupted by signal.

When the called function ended, it will return to our xcall gate and we switch back

to untrusted, which rst switch to Intravirt, update current PKRU back to untrusted value,

and nally do the switch with the current PKRU and also switch the stack back to the old

caller stack.

Please notice there are modiable limitation on the size of the stack carried arguments

for calling through ‘xcall’. And higher the limitation will introduce higher overhead on

the switching.

Whole library isolation Beside the ne grained semantics that isolate only sensitive functions and data. We also provide another way of creating the isolated domain by adding

a few lines of code that use iv_domain to inform Intravirt with the base address of the

current library. Intravirt will read all exported symbols from the ELF symbol table on that

address, create stub codes that call to the same secure xcall as the ne grained one, and

redirect the hook all these symbols to use the new stub codes. Any lookup of the symbols will get our new address and thus all call into the library will automatically switch to

isolated domain and switch back when return from these exported functions while the

calling inside the library will remain normal function call. This feature is mostly provided

for isolating sensitive libraries but the programmer should be awared the use of outside

functions, especially libc functions that might leak the sensitive data (e.g. memcpy). Also

the functions in the libraries that leak sensitive data directly (e.g. dump_key) or indirectly

(e.g. bn_mul). Because the domain isolation only ensures that certain memory space are

not accesible from outside, but it cannot do anything to the data ow that intended to leak

data.

dev uses the above libos function which exectues by glibc and Intravirt does magic 1. dy- 79

namimc hook for allocators and free: link libos 2. get lib base addr libcrypto 3. walk through

table to nd libssl 4. visor call to iv for setup lib domain iso: safebox(libaddress,domainid)

Then Intravirt: nd all code, data, bss, pages, setup keys, libos does : lazy slab allocator

properties of this techniques: callbacks don’t switch back which could be problematic.

Safebox libOS In the invravirt side, we only provides the most basic and essencial building block to describe the encapsulation structure from the user. There are few

important elements on this design. First, all related data must being put into secure

memory. We provide this with a macro ISO_DATA which adds a section attribution. The

similar thing is apply to code with a macro ISO_CODE. We also mark all entrypoints with

a macro ISO_ENTRY. This is also achieve through a special section for function table and

symbols generated automatically by the GCC Linker to mark the start and stop of the

section. There two symbols can later being used by iv_domain to provide the entrypoints.

It will also generate the encapsulation and stub function and initialization function

automatically and try to pad code and data to page boundary.

To use the simplify the use of xcall, all entrypoints get a wrapping function, which

load its function id and jump the the actual xcall generated by Intravirt. And any use of

these wrapping function, can through the xcall macro. GCC compiler will translate it into

the calling of wrapping function with all arguments. In short, all you need is to replace

‘func(args)’ to ‘xcall(func, args)’.

We also support a simple threadsafe slab allocator and user can enable this by including

a single reader code for allocator. 80

5.6 Implementation Details

Intravirt is built out of ve primary components—secure loading, privilege and memory virtualization, syscall virtualization, signal virtualization, and xcall gates. We use the

Graphene passthrough LibOS [74–76] to securely load, insert syscall hooking into glibc,

and separate the trusted domain from untrusted domain memory regions. We use ERIM [24]

to isolate memory and protect WRPKRU, and 200 LoCs for tracking page attributes. We

implement all syscall and signal virtualization code. In total our system comprises

approximately 15k lines of codes, with ∼ 6, 400 new Intravirt code. 81

Chapter 6

Use Cases

In this chapter, we address the actual application scenario which Intravirt could provide.

Because Intravirt provides an isolated endoprocess environment, there are enormous

applications we could apply. Using the system call virtualization feature in Intravirt, we

could apply dierent system call policies for each endoprocess, similar to mandatory access

control mechanisms like SELinux [42], but in the endoprocesses.

6.1 Library Isolation

First of all, we address applications we could make use of the library separation. Using

Intravirt, we could safely separate the code and the data of the libraries providing xcalls which provides a compelling but fast isolated environment.

6.1.1 Reference Application: zlib

The very rst use case for Intravirt is zlib [77]. There will be no noticeable security benet

by isolating zlib, but zlib works as a reference use case for probably all the library isolation

techniques. Therefore, this use case could work as the baseline application for Intravirt

that we could easily compare to other techniques in terms of the performance, applicability,

and maybe compatibility, which is meaningful.

We could isolate zlib with the whole library separation technique. The implementation

is relatively simple that we modify zlib code to add a constructor function that the con-

structor gets the symbolic information of the library by calling dladdr function in loader, 82 and calls iv_domain system call to assign a new domain for zlib. In addition, the allocator could be replaced with a new one if required. For this, we added ten lines of codes in C, and dependency to loader(ld.so) is added. The applications that use zlib do not require modication, and by calling zlib API will automatically invoke domain switch. We address the performance evaluation in § 7.3.

6.1.2 Safeboxing OpenSSL in NGINX

OpenSSL [3] is responsible for the secure communication and cryptographic operations in NGINX [52] webserver. Once it is compromised, the impact will be signicant that the leaked session keys could expose the encrypted messages, and the identity could be intercepted due to the leaked private key. Unfortunately, OpenSSL is a dynamically linked library loaded during the startup of the NGINX process, shares all the memory with the remaining part of the application. Therefore, any small vulnerable part of the NGINX web server could lead to a complete breach of the secret information.

There have been enormous eorts to separate OpenSSL from the application to prevent such attacks. For example, the H2O web server project [11] separates the private key management module into a dierent process that the separated process performs all the operations regarding the private key, and it communicates with the primary web server process with IPC. Therefore, they claim that the private key is still protected even after the webserver is compromised. However, due to its process separation design, it has a performance overhead. The overall performance overhead is about 2% because the private key is required only at the beginning of the web session. Unfortunately, H2O cannot protect its session keys. If the session key management is also separated into a dierent process, the performance overhead will be signicantly increased due to the frequent use of the session key. Also, it relies on the complex low-level OpenSSL crypto APIs because 83

it separates cryptographic operations into dierent processes that reimplemented the

cryptographic functions, leading to the increase of dependency and complexity.

In contrast, ERIM [24] chooses OpenSSL session key protection for its use case section.

In their implementation, they modied OpenSSL to add domain switch codes before and

after AES session key operations, and all other parts of the program cannot access the

session keys without switch the domain. The performance overhead is about 2 − 3%.

However, due to its hacky implementation, it only supports the AES algorithm and only

one protection domain.

We utilize the same whole library separation as we did in zlib. Therefore, all the

codes, functions, and data in libssl.so and libcrypto.so are isolated, and xcall is called when any OpenSSL API is called. Therefore, all the secure communications resources are

protected, including the private key and the session key. However, this does not protect

from the bug in OpenSSL library like Heartbleed. We added 215 lines of code in OpenSSL,

and we cover the performance evaluation in § 7.3.

6.2 Module sandboxing

Intravirt could also isolate part of the program as well as the whole library. Since it could

isolate smaller parts than the library, the Intravirt could provide ner-grained isolation.

However, in this case, the protected memory has to be aligned by page because MPK

provides per-page protection.

6.2.1 Sandboxing HTTP Parser in NGINX

NGINX parser performs straightforward functionality. It rst reads the message received

by the network module, interprets the message contents, lls up the output data structure,

and returns to the caller. However, the parser module is also located in the same address 84

space, it shares all the resources with other parts of the process, and the compromised

parser could lead to severe data exposure such as personal data and nancial information.

The parser also works as a frontend module in NGINX that the attacker would pick the

target to exploit. There are actual buer overow vulnerabilities in the parser [4–6]. By

utilizing these vulnerabilities, the attacker could get complete control of the webserver. As

a result, we need to sandbox the parser to prevent most of the privilege it has.

In this use case, we modied the NGINX HTTP request handler code to acquire the

address of the parser functions, insert the call gate, and call the iv_domain system call that

Intravirt assigns a new sandbox endoprocess, prepare the call stack to isolate the parser.

As a result, when the parser is called, it calls xcall instead of calling the parser directly,

then domain switch is performed by xcall, and calls the parser function after that. The

current policy for sandbox endoprocess is that it cannot call any functions and system

calls, and cannot access any memory pages outside of the endoprocesses.

There would have a problem with the data structures. If the output data structure is

allocated outside the parser, the parser cannot access the data structure. In this case, we

have two solutions. The rst is to move the allocator inside the sandbox. This approach

has an advantage for performance, but the implementation could be challenging. The

second solution is to demote the already allocated memory pages before feeding them into

the parser. This approach is easier to implement, but it could have performance overhead

due to the MPK key change to the pages when the parser is called. We used the second

solution, and we added page alignment code in the allocator. With this approach, we added

377 lines of code in NGINX. We address the performance overhead in § 7.3. 85

6.2.2 Preventing sudo Privilege Escalation

A recent bug was found in the sudo argument parser that allows an attacker to corrupt

a function pointer and gain control with root access [7]. We compartmentalize sudo

so that the parser code, in le parse_args.c, is sandboxed, and restricted to only the

command line arguments and an output buer. The worst attack that can happen now

is overowing its internal buer and eventually segfault or done nothing harmful. In

summary, by changing approximately 200 lines of codes, importing our libsep in sudo and

using Intravirt, we conne the argument parser and successfully prevent the root exploit.

More generally, almost all parsers have a similar type of behavior and could benet from

similar changes, and possibly automatically.

6.3 Endo-process System Call Policy Enhancement

The system call virtualization feature can be used for additional OS object protection

along with the library separation. For example, we could provide dierent system call virtualization policy for each endoprocess.

6.3.1 NGINX Private Key File Protection

As mentioned in § 6.1.2, we discovered that we could protect the session keys and the

private keys stored in the memory. However, this is not complete protection that we

have to protect the private key les stored in the disk. H2O [11] separates the private key

management into another process, but the compromised web server could simply open

the private key le and read the le data. To prevent such an attack, the administrator

must utilize dierent access control mechanisms such as Unix user ID or mandatory access

controls such as SELinux [42]. However, applying these access control requires signicant 86

modication of H2O source code and the application runtime model because the simple

fork system call cannot provide dierent subject identiers for the processes.

In Intravirt, the private key les could also be protected by implementing additional

system call virtualization policies within the safeboxed OpenSSL in NGINX. In this section, we introduce a le capability system based on Intravirt. To provide a secure and ecient

le capability system, we need to analyze the threats and system calls. We also need to

dene the system call policy as well as the concurrency consideration. We discuss the

performance evaluation in § 7.3.

Assumption

In Linux, all the OS objects are abstracted as les, but we only consider the regular les

actually stored in the disk. The les like sockets, pipes and device nodes are out of scope.

As well, we do not consider the attacker execute other programs to manipulate the les.

The attackers are only valid in the same process boundary.

Identifying the Private Key Files

Identifying the private key les is the rst task in this use case. The identier must be

immutable and not copiable and should remain until the end of the application process.

There could be many identication mechanisms fullling the requirements, but we only

discuss two of them.

First, we could make use of the most basic le system identier, the inode. The inode

is a unique integer in the le system, which we could use as an identier by a tuple with

the le system device identier. However, we have to maintain a data structure indicating which inodes are the private key les in this case, and the data structure has to be protected

by Intravirt. Also, we need to provide an interface to manage the identication data 87

structures. The interfaces could be implemented by conguration le access or pseudo

system calls.

The second approach is to import a similar concept from other techniques. For example,

mandatory access control mechanisms such as SELinux [42] and AppArmor [44] support

le object labeling mechanism by using extended attributes in the Linux le system [78].

Each le could have an extended attribute item indicating the le is the private key le in

this approach, and Intravirt enforces the policy by reading the attribute in the virtualized

system call routines. This approach is straightforward to apply, but it does not work on

some le systems which do not support the extended attribute.

This dissertation uses the second approach with the private key le stored in the ext4

le system. The policies are 1) if the label is not existing or the label says unbox, then the

unbox and the safebox domains can access the les, and 2) if the label says safebox, then

only the safebox domains are allowed to access the les. We use additional system call

such as getxattr and fgetxattr to achieve the label of the les.

Possible Attacks and Mitigations

There are multiple ways to access the private key les with various system calls.

Direct Access This type of access directly accesses les. Example system calls in this

type are read, write, preadv, and pwritev. Some system calls like truncate could modify

the les without opening or reading them. For this, we check the permission of the les

and then execute the system calls.

Control Takeover This type of system calls do not access the le contents directly, but they control the le operations by modifying le information or le handles. The examples

are open, close, rename, unlink, chdir and chmod. We also enforce the policy in these 88 system calls.

Data Theft This type of system calls copy the le contents rather than directly accessing

the contents, such as dup and sendfile. These system calls require the permission check

for the source le permission and the permission check of the destination le because the

kernel overwrites the destination le in some of these system calls.

Indirect Access This type of system calls, such as execve do not access the le contents, but the attacker could infer the le contents by executing the system calls. We also need

to enforce the le protection policy for them.

Denial of Service These system calls do not access the les, but they could induce

malfunctions of the normal le operations. For example, lseek could change the le oset

to disturb the benign access, and flock could lock the les preventing access from others.

These also require policy enforcement.

Attachment These system calls do not access directly to the target les, but it creates a

reference for the les. Examples are link, and symlink. We do not directly enforce the

policy in this type of system calls, but we check the path is the symlink or the realpath

during the permission checks.

Reading Information These system calls read the le information such as size, type, and timestamps. We do not particularly enforce anything on these system calls. The

examples are stat and getxattr. 89

Policy Enforcement

At rst, we assigned “safebox” label to the private key les used by NGINX by using

setfattr command [79] in the shell. And then, we newly virtualize or modify the existing

policy for 57 system calls in total. In each virtualized system calls, we check the label of

the les fed as input parameters by calling getxattr or fgetxattr system calls. We then

continue the system call execution when the label and the caller domain match. Otherwise, we return the permission error(EPERM). Additionally, we also virtualize a few more system

calls for consistency, such as access and faccess because the purpose of them are to check

the permission.

Concurrency Consideration

In the multithreaded environment, assuming the attackers have full control of the syn-

chronization and the timing, we have to aware of the TOCTOU attacks because the

implementation has permission checks before executing the system calls. First of all, for

the system calls that receive le descriptors as input parameters, we could achieve the

concurrency by using the locking system we address in § 5.5. On the other hand, the

system calls with the path as input parameters need more cautions. The relative path

is allowed for the system calls so that the attacker might change the current working

directory after the permission check. The symbolic link could be another problem in which

the attacker could replace the original symbolic link with a malicious one to point to a

dierent target path.

We introduce a new lock that symlink, chdir, fchdir system call are locked when

other threads are in any system calls using path information to deal with the problems.

Lastly, some system calls with “at” sux such as openat allow directory le descriptor

to point as the root of the directory search which might be substituted by attacker thread 90

at any time. Therefore, we also apply locks on the directory le descriptors similar to other

le descriptors.

6.3.2 Directory Protection

We can extend the use case of § 6.3.1 by protecting the whole ssl directory, including

all les and subdirectories. We could simply label all the les and subdirectories in the

directory we want to protect, but it is tough to respond to the change of the les. Therefore, we need to consider a new approach to protect a directory.

This use case is the same as chroot [80] with inverse security policy, and very useful to

provide private storage to each endo-process. We address the design and implementation

in this section. The performance of this application is covered in § 7.3.

Identifying the Protected Directory

As we discussed in § 6.3.1, we could use a unique value of the les such as inode, or we

could also use an additional attribute of the les like extended attribute. The other method

for this is to use a new system call to identify the directory to be protected and remain

protected until the process lifecycle, just like chroot does. We assigned a new system call,

endo_toorhc, and let Intravirt intercepts the call, and manages the protected directories.

After selecting the directory to protect, we need to identify all the les and subdi-

rectories. We could perform this task le by le, but it could have severe performance

overhead, and it isn’t easy to handle some events such as le creation or deletion. After all, we will need to label only the root directory to be protected, and we need to distinguish

the location of the les in every le operation, and it has to be eectively fast.

The system calls taking le path allow the relative paths and indirect path like “..”.

Also, the user could create a symbolic link to point to any le in the system that it is 91 tough to distinguish the correct absolute path of a le in the userspace with the given path information. To solve this issue, we use /proc/self/fd/ directory. The kernel provides this interface in the proc le system that the absolute path of the opened le is provided as a symbolic link. So, we need to use readlink system call to read the absolute path of them to identify the exact location of the les. However, it only shows the path of opened

les, so we need to open the target every time we want to check the absolute path. This approach includes performance overhead due to the up to two system calls — open and readlink — for each le operation. In addition, the result of readlink is provided as a string, that we need to perform string compare operations. However, this looks like the most accurate way to retrieve the absolute path of the les in the userspace. We could cache the le’s location once it is open, reducing the overhead overall.

Policy Enforcement

First of all, the application calls the newly added endo_toorhc system call to select a directory to be protected. During this selection procedure, the application can also select the allowed domain for the directory. Like chroot is a privileged system call, endo_toorhc is also privileged that only safebox domain can call this new system call. After selecting the protected directory, all the les and subdirectories in the directory are only allowed to the given domain.

All the system calls taking the path as the input parameter, Intravirt opens the le in the

rst place, retrieve the absolute path of the le by calling readlink on /proc/self/fd/[FD], and compare the prex of the absolute path to the selected protected directory. If it matches, then Intravirt decides the access by comparing the caller’s domain and the selected domain for the protected directory. In this case, Intravirt opens the le once, so we substituted some system calls with an identical one with the le descriptors, such as chmod and fchmod. 92

Also, all the system calls which open a new le descriptor, cache the label of the opened

le to reduce the performance overhead.

Lastly, this use case is orthogonal from the use case in § 6.3.1. Therefore we could

apply both use cases at the same time. However, we need to consider the collision cases

that the le and the directory are protected with dierent labels. In this case, the current

implementation takes a higher label for policy enforcement. For example, a le labeled as

a safebox in the protected directory for unbox, we take safebox because it’s higher.

Concurrency consideration

We have the same concurrency issue as discussed in § 6.3.1. However, we need one more

locking to prevent any race condition on endo_toorhc.

Applying endo_chroot

This use case is very similar to chroot, only except for the inverse security policy. Therefore, we could also think of another use case, ehco_chroot. However, chroot is much complex

than this use case. For example, chroot modies the visibility that all the path information

after chroot has to be below the new root directory. In Intravirt, it’s not trivial to change

all the paths under the new root directory that we have to insert the new root directory

into the prex of the path. Also, some system calls with directory le descriptors (i.e.,

openat) require even more complex path manipulation. We also need to virtualize the

current working directory and le descriptors for each domain, which is also very complex. 93

Chapter 7

Evaluation

7.1 Security Evaluation

Table 7.1 summarizes the quantitative security analysis based on known attacks described

by Conner et al. [1] and additional attacks we found.

In general, Intravirt defends against the attacks raised in [1].

Intravirt’s system call and signal virtualization guarantees security properties prevent-

ing any exploitation as described in the attacks. We exclude the two race-condition attacks,

due to their requirement for multi-threading, which Intravirt does not support.

In addition to the attacks described by Conner et al., we found several attacks against

subprocess system call and signal virtualization. For the evaluation, we created a xed

address secret inside trusted domain. All test cases try to steal this secret and hence, would break Intravirt’s isolation guarantees. The attacks try to bypass our system call virtualization by performing system calls modifying the protection policy of the secret or

trying to elevate itself to be trusted by overriding the PKRU register. They specically target

the implementation of Intravirt and highlight the degree to which Intravirt has followed

through with its security guarantees. Ideally, Intravirt prevents all attacks.

7.1.1 Fake Signal

Intravirt eectively prevents the basic sigreturn attack from [1]. However, the kernel

places signals on the untrusted stack and delivers the signal to our monitor signal entry- 94

Attack secc-rand secc-eph CET Inconsistency of PKU Permission [1] • • • Inconsistency of PT Permissions [1] • • • Mappings with Mutable Backings [1] • • • Changing Code by Relocation [1] • • • Modifying PKRU via sigreturn [1] • • • Race condition in Signal Delivery [1] × × × Race condition in Memory Scanning [1] × × × Determination of Trusted Mappings [1] • • • Inuencing Behavior with seccomp [1] • • • Modifying Trusted Mappings [1] • • • Fake Signal • • • Fork Bomb ◦ • • Syscall Arguments Abuse • • • Race condition using shared memory • • • TSX attack × • •

Table 7.1: Quantitative security analysis based on attacks demonstrated in [1] and attacks found by us. ◦ indicates the variant of Intravirt in this column is vulnerable, • if it prevents this attack. × indicates this attack is beyond Intravirt’s threat model.

point. The untrusted application may forge a signal frame and directly call the monitor’s

signal entrypoint. As a result, it can, e.g., choose the PKRU value and the return address.

Therefore, the entrypoint has to distinguish between a fake signal from the untrusted

application or a real signal from the kernel. The entrypoint is carefully constructed such

that a signal returning from kernel returns with privileges from the trusted domain and

hence, is capable of writing trusted memory. We rely on this observation and place an

instruction at the beginning of the monitor which raises a ag in the trusted monitor. Any

fake signal created by the untrusted application cannot raise the signal ag in trusted

memory which violates a check that cannot be bypassed in the monitor’s signal entrypoint.

7.1.2 Fork Bomb

This attack targets the random location of the system call instruction in Intravirt. To

perform a system call the untrusted application may guess the random location of the 95 system call instruction. Assuming the trampoline size is 16 pages, there are 65534 possible locations of the system call instructions. When the untrusted application is capable to fork children, the untrusted application may try dierent locations within each child. In case the child crashes, the system call was unsuccessful and the untrusted application has to retry. Using this brute force algorithm the untrusted application tries until a child does not crash. At this point the untrusted application has access to a child process that bypassed

Intravirt’s security guarantees and may perform arbitrary system calls. It should be noted that only secc-rand is successive to this attack, since secc-eph removes the system call instruction completely when returning control to the untrusted application.

7.1.3 Syscall Arguments Abuse

Intravirt virtualizes a subset of all system calls. System calls which are not virtualized could be exploited to read secret memory, unless Intravirt veries that all pointers provided to a system call lie within untrusted memory. We perform an attack based on the rename system call and pass it a memory pointer from the trusted domain as an argument. Intravirt successfully prevents this attack by checking the pointer locations.

7.1.4 Race condition using shared memory

Shared memory may be used across multiple processes to bypass Intravirt checks on arguments to system calls. In particular, we consider a pwritev-based attack in which a child process performs a pwritev system call using an IO vector in shared memory. If the parent was permitted access to the same shared memory, it could time to alter the

IO vector’s values to point to trusted memory. This attack has to be timed such that the child’s monitor has already performed the security checks, but the system call has not yet read the aected IO memory vector. Intravirt prevents such attacks by copying pointers in 96

system call arguments to the trusted memory region and only then performing the system

call using the copied arguments.

7.1.5 TSX attack

TSX is an extension to support transactional memory in x86. It has a similar principle

as exception handling, but at the hardware level. When any considered as a violation of

transactional happens, the hardware rollback and modication and jump to a preset restore

code. Unfortunatelly, because the rollback feature provides a harmless way of content

probing, since the rst introduced, its been used as a source of memory leakage. It has

been obsoleted in latest Intel CPU but still exists in many products with MPK. Our attack

utilizes TSX as a probe to the randomize trampoline. First, a xbegin is used to enable

TSX environment. Then, we call to an address within the trampoline region. Now, there

are three cases about the content on target address, int3, syscall and ret. For the rst

two cases, TSX will be aborted but in the second case, ret instruction can be executed

successfully. Such dierence is sensible from the view of the attacker and the address

contains ret is exposed. Bccause our sysret-gadget is syscall; return;. This exposed

the secret address of syscall. Fortunately, TSX can be disabled through kernel or BIOS

and among all Intravirt congurations, only secc-rand is secret based and susceptible.

7.1.6 Race condition using multi threading

Supporting multi-threading is essential in modern computing environment that Intravirt

also supports it. But, there are a few attack surfaces which use race conditions in multi

threading environment. First, indirect jump to syscall; return; is possible in ephemeral

Intravirt. For example, one thread calls a syscall which take very long time, and the

attacker thread jumps to the active syscall; return;. To prevent such attacks, we use 97 either syscall dispatch, or per-thread Seccomp lter. Second, the attackers could perform

TOCTOU attacks in the syscall virtualization. For example, one thread open a normal

le and call a le-backed syscall, while another thread close the le descriptor and open

a sensitive le which is not allowed for the untrusted code. In Intravirt, we provide locks

per le descriptor that close system call could be locked when another thread is using

that le descriptor. As well, Intravirt provides a lock for memory management system

calls and signal related system calls.

7.2 Performance Evaluation

In this section we characterize the performance overhead of Intravirt. First, we explore

microbenchmarks focussing on the cost to intercept system calls and signals. Second, we

demonstrate the performance of Intravirt for common applications. Third, we evaluate the

cost of the least-privilege NGINX use case.

We perform all experiments on an Intel 11th generation CPU (i7-1165G7) with 4 cores at

2.8GHz (Turboboost and hyper-threading disabled), 16GB memory running Ubuntu 20.04,

and the kernel version 5.9.8 with CET and syscall dispatch support. For all experiments we average over 100 repetitions and analyze dierent Intravirt congurations. Intravirt

relies on a Seccomp lter or a syscall user dispatch (denoted by Sec or Dis) for system call

interception, and random, ephemeral, or CET trampoline (denoted as rnd, emp, cet). In

this conguration space we evaluate 5 dierent congurations ((푠푒푐|푑푖푠)_(푟푛푑|푒푚푝|푐푒푡))

and do not evaluate the insecure 푠푒푐_푟푛푑 conguration.

Throughout this section, we compare against MBOX [65] and strace, ptrace-based

system call monitors. MBOX fails for experiments using common applications. In these

cases we approximate the performance of MBOX using strace. In our microbenchmarks

strace outperforms MBOX by 2.7 % providing a conservative lower bound for MBOX. 98

45 38 20 18 4 1.5 3 1 sec sec 2 휇 휇 0.5 1 0 0

native ptracestrace native ptracestrace secc_ephdisp_ephsecc_cetdisp_cet secc_ephdisp_ephsecc_cetdisp_cet secc_rand_1 secc_rand_1

(a) open (b) read 19 18 42 10 1 sec sec 휇 5 0.5 휇

0 0

native ptracestrace native strace secc_ephdisp_ephsecc_cetdisp_cet secc_ephdisp_ephsecc_cetdisp_cet secc_rand_1 secc_rand_1

(c) write (d) mmap 19 18 29 28

2 8 1.5 6 sec sec 1 4 휇 휇 0.5 2 0 0

native ptracestrace native ptracestrace secc_ephdisp_ephsecc_cetdisp_cet secc_ephdisp_ephsecc_cetdisp_cet secc_rand_1 secc_rand_1

(e) install signal (f) catch signal

Figure 7.1: System call latency of LMBench benchmark. 99

7.2.1 Microbenchmarks

System call overhead

We evaluate Intravirt’s overhead on system calls and signal delivery in comparison to

native and the ptrace-based techniques. Figure 7.1 depicts the latency of LMBench v2.5 [81]

for common system calls. Each Intravirt conguration and the ptrace-based techniques

intercept system calls and provide a virtualized environment to LMBench while protecting

its privileged state.

secc-eph and secc-rand _1 modify the trampoline on every system call, but secc-eph

saves the cost of randomizing the trampoline location and hence, incurs less overhead.

secc-eph/disp-eph, and secc-cet/disp-cet demonstrate the performance dierence between

using a Seccomp lter or syscall user dispatch to intercept system call invocations.

Overall, disp-eph outperforms all other congurations, while secc-rand _1 is the slowest.

Even though CET relies on hardware support, it does not outperform other congurations.

Intravirt adds 0.5 - 2 usec per system call for disp-eph for policy enforcement and domain

switches. In comparison the ptrace-based technique incurs about 20 usec per invocation which is 4.7-26.8 times slower than disp-eph.

We observe high overheads for Intravirt protecting fast system calls like read or write

1 byte (126%-900%), whereas long lasting system calls like open or mmap only observe

29%-150% overhead.

We demonstrate the dierence by performing a throughput le IO experiment. Figure

7.2 shows high overheads for reading small buer sizes which amortize with larger buer

sizes. Since overhead induced by Intravirt is per syscall basis, to read a le with bigger

buer size has much less overhead than with the smaller buer size. Even though we

observe high overheads for some system calls, applications infrequently use them and 100

1

secc-rand_16 0.8 secc-eph disp-eph 0.6 secc-cet disp-cet ptrace 0.4

Normalized throughput 0.2

0 1 2 4 8 16 32 64 128 256 512 read size [KB]

Figure 7.2: Normalized latency of reading a 40MB le.

1 0.8 0.6 sec

휇 0.4 0.2

native secc_eph secc_rand_1secc_rand_2secc_rand_4secc_rand_8secc_rand_16secc_rand_32 secc_rand_1024

Figure 7.3: latency for getppid for dierent rerandomization scaling.

observe far less overhead as shown for common applications in § 7.2.2.

Randomization and performance tradeo

The secc-rand conguration rerandomizes the trampoline for each system call generating

an random number using RDRAND (approx. 460 cycles). We explore alternative rerandom- 101

25 native 20 secc-rand_16 secc-eph disp-eph 15 secc-cet disp-cet 10 strace Bandwidth [GB/s] 5

1 2 4 8 16 32 Number of threads

Figure 7.4: Random read bandwidth for di. number of threads measured with sysbench.

ization frequencies to amortize the cost of randomizing over several system calls. We

tradeo performance with security, since the system call address is simpler to guess if

rerandomization happens less frequently. The goal is to nd a reasonably secure, but fast

rerandomization frequency.

Figure 7.3 evaluates getppid system call for dierent randomization frequencies. getp-

pid is the fastest system call and hence, results in the highest overhead of Intravirt. The

overhead of secc-rand amortizes with less frequent randomization and does not improve

much beyond 16 system calls per randomization. secc-rand at 4 system calls per random-

ization shows similar performance with secc-eph’s performance which we also observed

for other LMBench microbenchmarks.

Thread scalability

To prevent race conditions and TOCTOU attacks in Intravirt, locks protect Intravirt’s

policy enforcement as addressed in § 5.5.6. We demonstrate the scalability of Intravirt in 102

329 501 88

60

40

20 Overhead in %

0 curl NGINX sqlite3 zip

secc-rand_16 secc-eph disp-eph secc-cet disp-cet strace

Figure 7.5: Normalized overhead of di. Linux applications.

gure 7.4 using the sysbench [82] tool which concurrently reads a 1 GB le from varying

number of threads. Due to the additional locks in Intravirt, the number of futex system

calls increases with the number of threads.

At 4 threads all CPU cores are busy and we observe the best performance. The overhead

of each conguration is similar to the microbenchmarks. secc-cet and disp-cet suer a

performance decrease of up to 60%, because the syscall performance of CET-based

congurations is the lowest. Compared to strace, Intravirt outperforms by 4.3-8.2 times.

7.2.2 Macrobenchmarks

Along with the microbenchmarks, we analyze the performance of common applications

such as lighttpd [83], NGINX [52], curl [84], SQLite database [85], and zip [86] protected

by Intravirt. Figure 7.5 shows the overall overhead of each application compared to the

native execution‘. 103

curl [84]

downloads a 1 GB le from a local web server. It is particularly challenging workload

for Intravirt, since curl makes a system call for every 8 KB and frequently installs signal

handlers. In total it calls more than 130, 000 write system calls and more than 30, 000

rt_sigaction system calls to download a 1 GB le. However, libcurl supports an option

not to use signal, which reduces the overhead about 10% in average for Intravirt but strace

gets worse about 140%.

Lighttpd [83] and NGINX [52]

serve a 64 KB le requested 1,000 times by an apachebench tool [87] client on the same

machine. All congurations perform within 94% of native. disp-eph outperforms all other

congurations and highlights Intravirt’s ability to protect applications at near-zero cost with a throughput degradation of 1%. In contrast, strace has about 30% overhead.

SQLite [85]

runs its speedtest benchmark [85] and performs read and write system calls with very

small buer size to serve individual SQL requests. Contrary to the microbenchmarks,

dierence between congurations is larger. Congurations using syscall user dispatch

(disp-eph and disp-cet) observe about 30% less overhead when compared to their Seccomp

alternatives (secc-eph and secc-cet). Strace performs poorly at more than 500% overhead.

zip [86]

compresses the full Linux kernel 5.9.8 source tree, a massive task which opens all les in

the source tree, reads their contents, compresses them, and archives them into a zip le. 104

The observed performance degradation is in-line with the microbenchmarks for openat,

read, and write system calls.

Summary:

Network-based applications like lighttpd and NGINX perform close to native results whereas le-based applications observe overheads between 4 and 55% depending on the

test scenario. Most impacted are applications which access small les like SQLite. In

comparison to ptrace-based techniques, Intravirt outperforms by 38- 529%.

7.3 Performance Evaluation of the Use Cases

7.3.1 zlib

As discussed in § 6.1.1, the value of isolating zlib [77] is for a reference implementation that we could easily compare to other techniques. We use a whole-library-separation approach

to isolate zlib library and measure the time to perform zlib API by creating a simple test

application. The test application gets a text le written in English as an input, reads 4KB,

compresses it, uncompress it, and compares it to the original data. It measures the time to

repeat the compression, uncompression, and memory comparison 10, 000 times. There are

six zlib API calls for each test iteration, so there will be 12 xcalls in total. The system call

point of view, it calls about 40, 000 brk, and almost no other system calls at all.

Figure 7.6 denotes the normalized overhead of zlib test application compared to the

native implementation for each Intravirt conguration that we could compare between

isolated zlib and not isolated zlib. First of all, secc-rand_16, secc-eph, and disp-eph shows

about 20% of overhead due to system call virtualization in Intravirt, and 2-3% overhead due

to the xcalls. Using this simple use case, we could easily estimate the overhead of xcall. 105

3.5

3

2.5

2

1.5

1 Normalized Overhead

0.5

0secc-rand_16 secc-eph disp-eph secc-cet disp-cet

not isolated isolated

Figure 7.6: Normalized overhead of isolated zlib.

For each xcall, the process switches to the trusted domain, acquires the required

information about the domain switch such as stack pointer and function pointer, switches

to the domain to the target domain, and similarly on the way back after a return. This

procedure consists of dozens of memory access and two WRPKRU instructions. The overhead

of a single xcall is 116 cycles in non-CET-based Nexpoline and 269 cycles in CET-based

Nexpoline.

However, the overhead of secc-cet and disp-cet is abnormally signicant with more

than 2-3 times from the native implementation. Since we do not have any codes working for

CET, we need to analyze this issue, so we tested CET-enabled zlib in the native environment.

Table 7.2 shows the result of the same test application without Intravirt, but with CET-

enabled zlib library. The same test application only running with CET-enabled zlib library

takes more than two times than the CET-disabled zlib library in the same kernel. We might 106

Table 7.2: Performance overhead of zlib test due to CET. No Intravirt involved.

Setup Without CET With CET sec. 1.343 3.029

need to discuss this issue, but this thesis is not focussing on the CET itself, so we put this issue out of scope.

7.3.2 Safeboxing OpenSSL and Sandboxing Parser in NGINX

§ 6.1.2 and 6.2.1 describe NGINX using Intravirt to safebox the OpenSSL library and sandbox the parser module. Based on this privilege-separation, we perform a throughput experiment downloading dierently-sized les, as shown in gure 7.7. The measurement relies on TLS v1.2 with a self-signed private root CA certicate and a server certicate signed by the root CA with the cipher suite ECDHE-RSA-AES128-GCM-SHA256,2048,128.

1 secc-rand_16 0.9 secc-eph disp-eph secc-cet 0.8 disp-cet strace 0.7

Normalized throughput 0.6

0.5 1 2 4 8 16 32 64 128 256 512 le size [KB]

Figure 7.7: Normalized throughput of privilege separated NGINX using TLS v1.2 with ECDHE-RSA-AES128-GCM-SHA256, 2048, 128. 107

The performance of the ptrace-based system is also shown for the reference data points

even though it does not provide safeboxing and sandboxing.

For most cases, Intravirt with safebox and sandbox performs within 10% of native,

about 3-4% more than Intravirt without safebox and sandbox (see gure 7.5). The reason why the normalized throughput for bigger le size is decreasing is that NGINX does not

read the whole le in one system call, but it calls read system calls divided by the predened

buer size, that the total number of system call increases exponentially.

Since Intravirt’s overhead is directly impacted by the number of xcalls and the time

to switch we need to discuss the number of xcalls to understand the gure 7.7. Table 7.3

shows the number of xcalls during the measurement for the respective le size. During

startup, NGINX performs 89 xcall in total to load conguration les and initialize OpenSSL with the private key. Each new connection results in a TLS handshake using 16 xcalls and

6 xcalls for initializing the session. Every 16 KB request message requires 3 additional xcalls. For every HTTP request, the parser module is called ve times, resulting in 5

more xcalls. After nishing receiving the request, it sends the target binary le for the

response, which requires seven xcall for initialization and three xcall for each 16 KB of

the le. As a result, summing up all the required xcall is shown in the table 7.3.

Table 7.3: xcall count for dierent le sizes in the test scenarios including startup of the process.

File size 1k 4k 16k 64k 256k 1024k Count 129 129 132 141 177 312 108

7.3.3 File and Directory Protection

We discussed this application in § 6.3 that this is an extension of the system call virtualiza-

tion policy to provide additional protection for the les and the directories. Therefore, we

need to understand the system’s overhead, and then we gure out how the overhead will af-

fect the actual applications. We perform the microbenchmark to measure the performance

overhead of the system calls eectively, and we also measure a few actual applications which use the system calls.

Microbenchmark

We use Lmbench [81] again for the microbenchmark. We only extend the system calls with les in this use case, so we only measure four typical le-based system calls: open,

read, write, and mmap.

Figure 7.8 compares the normalized latency of LMBench between Intravirt only, with

le protection, and with le protection and the directory protection for each Intravirt con-

guration. As shown in the gure, read, write and mmap do not have signicant overhead

between dierent system call virtualization policy. However, open does have signicant

overhead in the directory protection use case. It is because the le protection performs

additional fgetxattr system call to get the label of the le, and directory protection calls

readlink in /proc/self/fd/[FD] and performs the string comparison. In our test envi-

ronment, the le protection takes 0.7 휇sec, and 3.3 휇sec in the directory protection case.

However, once the le is open, Intravirt caches the permission of the le descriptors, so

there is no more overhead on this. Also, the string comparison will increase the overhead

if the number of the protected directory increases. In this test, we have one protected

directory.

Like open, all the system calls taking the path as an input parameter will have a similar 109

4 4

2 2 Normalized latency Normalized latency 0 0

secc-eph disp-eph secc-cet disp-cet secc-eph disp-eph secc-cet disp-cet

secc-rand_16not protected le protected directory protected secc-rand_16not protected le protected directory protected

(a) open (b) read

4 1

2 0.5 Normalized latency Normalized latency 0 0

secc-eph disp-eph secc-cet disp-cet secc-eph disp-eph secc-cet disp-cet

secc-rand_16not protected le protected directory protected secc-rand_16not protected le protected directory protected

(c) write (d) mmap

Figure 7.8: System call latency of LMBench benchmark with dierent protection metodolo- gies.

overhead because they perform the same functions. Therefore the overhead of those system calls will also be similar. However, we estimate the frequency of such system calls is much less than open, read, and write. Therefore, the overall overhead of the actual application environment could be small.

NGINX

We use NGINX again for the performance evaluation. NGINX is one of the best applications to protect the secret keys and the private keys, and also it’s one of the most well-known 110

1

0.8

0.6

0.4

. Normalized Throughput 0 2

0 secc-rand_16 secc-eph disp-eph secc-cet disp-cet

not isolated isolated private key protected directory protected

Figure 7.9: Normalized throughput of NGINX to download 64KB le for dierent private key protection methodologies.

event-driven single process web servers that it is a proper application to apply Intravirt.

Figure 7.9 shows the normalized throughput comparison between dierent congura-

tions in Intravirt to show the overhead of the le protection and the directory protection.

We measured the bandwidth to download 64KB les from the local NGINX web server

running on Intravirt. As shown in the gure, the overhead of each feature does not have

a signicant dierence, and thus, the overhead is independent of the protection policy.

Therefore it is safe to say that the le protection and the directory protection do not

contribute to the overhead.

Analyzing the measurement more systemically requires system call execution statistics.

The test performs downloading a 64KB le from a local server running on Intravirt and

measures the throughput with 1000 repeats. There are 5,000 read, 9,000 write, 2,000 close,

2,000 pread64, and 1,000 openat are executed within this test. As we discussed earlier,

there is almost no overhead other than open due to the permission cache, so there is not 111

1.2

1

0.8

0.6

0.4 Normalized latency 0.2

0 secc-rand_16 secc-eph disp-eph secc-cet disp-cet

not protected le protected directory protected

Figure 7.10: Normalized latency of zip for dierent le protection methodologies.

much overhead in total. Also, the overhead of opening les is still smaller than the other

computation, making the overall overhead due to the protection negligible.

In summary, Intravirt could protect the sensitive information in web server stored in

memory, such as session keys and private keys, and data stored in the storage such as

private key les in one single process with less than 10% of overhead.

Zip

We need to pick another application to show the overhead of le and directory protection

eectively. We select the zip test scenario we tested in § 7.2.2. Since it opens and reads

every le in the Linux kernel source tree, it could be a perfect test scenario on this feature.

Figure 7.10 denotes the normalized latency to compress the whole source tree of

Linux 5.9.8 with dierent protection policies. As shown in the gure, the le protection

policy takes 1-2% of overhead, and the directory protection policy takes another 1-2% of

the overhead. This 2-3% overhead is the overall overhead of the use case, which is still 112

relatively small depending on the massive number of le operations.

We analyze the system call frequency in this test case. There are 193K reads, 309K writes, and 79K openats and closees, but the total computation overhead which Intravirt

takes is relatively smaller than the compression overhead. Therefore, we address that

Intravirt is still valuable for this le-operation-heavy environment. 113

Chapter 8

Conclusion and Futuer Works

This dissertation nds drawbacks of the existing privilege separation techniques. Existing

techniques could be categorized as 1) separating the process and communicate with IPC, 2)

sandbox the data and code in a process and try not to dereference each other, or 3) control

memory visibility in a process by utilizing many software and hardware technologies.

However, each technique category has problems that the process separation has the

performance issue. The sandboxing has an issue with interaction between boxes, so the

last subprocess isolation has been spotlighted. Unfortunately, existing subprocess isolation

techniques have their own issue that they did not consider the underlying operating system

as a threat. The commodity operating systems like Linux take the process as the unit of

the separation. The OS interfaces share the resources within the process, so the attacker

could easily penetrate between separations by using those interfaces.

This dissertation suggests a new subprocess isolation model, Endokernel. Endokernel

proposes a virtualized endoprocess model that each endoprocess runs like virtualized

environment in a process and provides xcall to interact with each other safely. Also, we

develop a prototype of the Endokernel, Intravirt, and verify and evaluate the value and

eciency of the Endokernel model. Intravirt is a userspace solution that we do not need

to modify the operating system kernel or modify the runtime environment to provide the

solution eciently. In addition, we don’t need to modify the applications to run them on

Intravirt.

Intravirt has several advantages as new subprocess isolation. It certainly very secure, 114

but most of all, it has a very low overhead due to its subprocess characteristic. The

low overhead is a signicant advantage among all because it increases the applicability

signicantly. Since Intravirt is a userspace solution, it is straightforward to apply to any

commodity operating system, which signicantly increases compatibility and applicability.

In addition, we do not need to modify the application signicantly, which decreases the

hurdle for the applications. Intravirt provides endoprocess virtualization by virtualizing

the system calls and the signals, which signicantly decreases attacks by underlying

operating system interfaces and provides endoprocess virtual machine. Unlike many

existing techniques, Intravirt seriously takes care of the concurrency in the endoprocess virtualization. Lastly, Intravirt introduces a brand new security feature, Intel CET. CET

is a hardware-accelerated control-ow integrity technology that Intravirt is one of the

pioneers to use it.

Intravirt is benecial for the applications to achieve the performance and the least

privilege at the same time. For example, NGINX web server is designed for an event-driven,

single-process model that Intravirt could provide many features at the same time. It could

separate the memory region for the session keys and the private keys, isolate the access

to the sensitive OS objects such as private key les and user data les and minimize

the overhead at the same time by utilizing endoprocess virtualization. Also, by using xcall, it could provide safe and fast communication between endoprocesses. Lastly, some

applications could enforce optimized and ne-grained endoprocess access control policies

using system call and signal virtualization. For example, a endoprocess rewall could also

be possible.

This dissertation shows Endokernel as a new model of privilege separation and Intravirt

evaluates the model and shows its secure and low-performance overhead. Still, it is not

completed yet, and we have to put more eort into several aspects. First of all, Intravirt uses 115

Intel MPK as its separation mechanism, which eciently and securely isolates memory

pages in a process, but the total number of keys is only 16. Intravirt takes three of them for

the monitor and the application endoprocess, so we only have at most 13 domains possible, which is a signicant limitation in the applicability. There are several techniques to

overcome this limit, like libmpk [25], but the overhead performance increases dramatically.

Therefore, we will need to overcome the limit by utilizing other hardware technologies or

nding a new idea for the isolation.

It is no doubt that CET acts as one of the most crucial components in this dissertation.

However, we do not see CET as a complete technology yet. CET is a hardware-accelerated

control-ow integrity technology, but our evaluation shows that it is not faster than the

software-based control-ow integrity. Even in some cases, it is much slower than the

software. Unfortunately, this dissertation does not focus on the CET as the research topic,

and we did not perform any more profound analysis. However, in the future, we will need

to understand more about CET and its implementation, which leads to the signicant

improvement of the performance and the security of Intravirt that Intravirt will be one of

the most powerful security solutions.

Intravirt is an excellent prototype of Endokernel, but we did not focus on the per-

formance optimization. We took care of the performance, but there would be several

possibilities to optimize the performance while remaining the same functionalities. The

optimized design and implementation of Endokernel will increase the value and the exten-

sibility of this research.

Lastly, Endokernel proposes a robust security system that preserves performance

but lacks a critical aspect. Endokernel does not have enough consideration about the

endoprocess life cycle. That is, policy for creating and destroying a endoprocess is missing

that there could be attacks creating a process or endoprocess that bypasses the separation. 116

We need to design such policies with compatible and applicable to existing applications without signicant modication as well. 117

References

[1] R. J. Connor, T. McDaniel, J. M. Smith, and M. Schuchard, “PKU pitfalls: Attacks on

pku-based memory isolation systems,” in 29th USENIX Security Symposium (USENIX

Security 20), pp. 1409–1426, USENIX Association, Aug. 2020.

[2] “CVE-2014-0160.” https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-0160.

(Accessed on 07/05/2021).

[3] “OpenSSL, Cryptography and SSL/TLS Toolkit.” https://openssl.org. (Accessed on

07/04/2021).

[4] “CVE-2009-2629.” https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-2629.

(Accessed on 06/28/2021).

[5] “CVE-2013-2028.” https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2028.

(Accessed on 06/28/2021).

[6] “CVE-2013-2070.”

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-CVE-2013-2070. (Accessed on

06/28/2021).

[7] “CVE-2021-3156.” https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3156.

(Accessed on 06/08/2021).

[8] “sudo Main Page.” https://sudo.ws. (Accessed on 07/04/2021). 118

[9] “Chromium Multi-process Architecture.” https:

//www.chromium.org/developers/design-documents/multi-process-architecture.

(Accessed on 07/04/2021).

[10] W. Venema, “Postx: Past, present, and future,” in Invited Talk at the 24th Large

Installation System Administration Conference, LISA, vol. 146, 2010.

[11] “H2O, the optimized HTTP/1.x,HTTP2 server.” https://h2o.examp1e.net/.

[12] R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham, “Ecient software-based fault

isolation,” in Proceedings of the Fourteenth ACM Symposium on Operating Systems

Principles, SOSP ’93, (New York, NY, USA), p. 203–216, Association for Computing

Machinery, 1993.

[13] G. C. Necula, S. McPeak, and W. Weimer, “Ccured: Type-safe retrotting of legacy

code,” in Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of

Programming Languages, POPL ’02, (New York, NY, USA), p. 128–139, Association for

Computing Machinery, 2002.

[14] G. Tan, A. W. Appel, S. Chakradhar, A. Raghunathan, S. Ravi, and D. Wang, “Safe

java native interface,” in Proceedings of IEEE International Symposium on Secure

Software Engineering, vol. 97, p. 106, Citeseer, 2006.

[15] B. Yee, D. Sehr, G. Dardyk, J. B. Chen, R. Muth, T. Ormandy, S. Okasaka, N. Narula,

and N. Fullagar, “Native client: A sandbox for portable, untrusted x86 native code,” in

2009 30th IEEE Symposium on Security and Privacy, pp. 79–93, 2009.

[16] J. Huang, O. Schranz, S. Bugiel, and M. Backes, “The art of app compartmentalization:

Compiler-based library privilege separation on stock android,” in Proceedings of the 119

2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17,

(New York, NY, USA), p. 1037–1049, Association for Computing Machinery, 2017.

[17] M. Sun and G. Tan, “Nativeguard: Protecting android applications from third-party

native libraries,” in Proceedings of the 2014 ACM Conference on Security and Privacy in

Wireless Mobile Networks, WiSec ’14, (New York, NY, USA), p. 165–176, Association

for Computing Machinery, 2014.

[18] J. Litton, A. Vahldiek-Oberwagner, E. Elnikety, D. Garg, B. Bhattacharjee, and

P. Druschel, “Light-weight contexts: An OS abstraction for safety and performance,”

in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI

16), (Savannah, GA), pp. 49–64, USENIX Association, Nov. 2016.

[19] T. C.-H. Hsu, K. Homan, P. Eugster, and M. Payer, “Enforcing least privilege

memory views for multithreaded applications,” in Proceedings of the 2016 ACM

SIGSAC Conference on Computer and Communications Security, CCS ’16, (New York,

NY, USA), p. 393–405, Association for Computing Machinery, 2016.

[20] Y. Chen, S. Reymondjohnson, Z. Sun, and L. Lu, “Shreds: Fine-grained execution

units with private memory,” in 2016 IEEE Symposium on Security and Privacy (SP),

pp. 56–71, 2016.

[21] A. Belay, A. Bittau, A. Mashtizadeh, D. Terei, D. Mazières, and C. Kozyrakis, “Dune:

Safe user-level access to privileged CPU features,” in 10th USENIX Symposium on

Operating Systems Design and Implementation (OSDI 12), (Hollywood, CA),

pp. 335–348, USENIX Association, Oct. 2012.

[22] M. Hedayati, S. Gravani, E. Johnson, J. Criswell, M. L. Scott, K. Shen, and M. Marty,

“Hodor: Intra-process isolation for high-throughput data plane libraries,” in 2019 120

USENIX Annual Technical Conference (USENIX ATC 19), (Renton, WA), pp. 489–504,

USENIX Association, July 2019.

[23] D. Schrammel, S. Weiser, S. Steinegger, M. Schwarzl, M. Schwarz, S. Mangard, and

D. Gruss, “Donky: Domain keys – ecient in-process isolation for risc-v and x86,” in

29th USENIX Security Symposium (USENIX Security 20), pp. 1677–1694, USENIX

Association, Aug. 2020.

[24] A. Vahldiek-Oberwagner, E. Elnikety, N. O. Duarte, M. Sammler, P. Druschel, and

D. Garg, “ERIM: Secure, ecient in-process isolation with protection keys (MPK),” in

28th USENIX Security Symposium (USENIX Security 19), (Santa Clara, CA),

pp. 1221–1238, USENIX Association, Aug. 2019.

[25] S. Park, S. Lee, W. Xu, H. Moon, and T. Kim, “libmpk: Software abstraction for intel

memory protection keys (intel MPK),” in 2019 USENIX Annual Technical Conference

(USENIX ATC 19), (Renton, WA), pp. 241–254, USENIX Association, July 2019.

[26] D. Chisnall, C. Rothwell, R. N. Watson, J. Woodru, M. Vadera, S. W. Moore, M. Roe,

B. Davis, and P. G. Neumann, “Beyond the pdp-11: Architectural support for a

memory-safe c abstract machine,” in Proceedings of the Twentieth International

Conference on Architectural Support for Programming Languages and Operating

Systems, ASPLOS ’15, (New York, NY, USA), p. 117–130, Association for Computing

Machinery, 2015.

[27] R. N. Watson, J. Woodru, P. G. Neumann, S. W. Moore, J. Anderson, D. Chisnall,

N. Dave, B. Davis, K. Gudka, B. Laurie, et al., “Cheri: A hybrid capability-system

architecture for scalable software compartmentalization,” in 2015 IEEE Symposium on

Security and Privacy, pp. 20–37, IEEE, 2015. 121

[28] B. Davis, R. N. M. Watson, A. Richardson, P. G. Neumann, S. W. Moore, J. Baldwin,

D. Chisnall, J. Clarke, N. W. Filardo, K. Gudka, A. Joannou, B. Laurie, A. T. Markettos,

J. E. Maste, A. Mazzinghi, E. T. Napierala, R. M. Norton, M. Roe, P. Sewell, S. Son, and

J. Woodru, “Cheriabi: Enforcing valid pointer provenance and minimizing pointer

privilege in the posix c run-time environment,” in Proceedings of the Twenty-Fourth

International Conference on Architectural Support for Programming Languages and

Operating Systems, ASPLOS ’19, (New York, NY, USA), p. 379–393, Association for

Computing Machinery, 2019.

[29] M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti, “Control-ow integrity,” in

Proceedings of the 12th ACM Conference on Computer and Communications Security,

CCS ’05, (New York, NY, USA), p. 340–353, Association for Computing Machinery,

2005.

[30] V. Kuznetsov, L. Szekeres, M. Payer, G. Candea, R. Sekar, and D. Song, “Code-pointer

integrity,” in 11th USENIX Symposium on Operating Systems Design and

Implementation (OSDI 14), (Broomeld, CO), pp. 147–163, USENIX Association, Oct.

2014.

[31] S. Narayan, C. Disselkoen, T. Garnkel, N. Froyd, E. Rahm, S. Lerner, H. Shacham,

and D. Stefan, “Retrotting ne grain isolation in the refox renderer,” in 29th

USENIX Security Symposium (USENIX Security 20), pp. 699–716, USENIX Association,

Aug. 2020.

[32] Mozilla, “Firefox - Protect your life online with privacy-rst product.”

https://www.mozilla.org/en-US/refox/. (Accessed on 08/07/2021). 122

[33] ARM, “Domain Access Control Register.” https://developer.arm.com/documentation/

ddi0434/b/System-Control/Register-descriptions/Domain-Access-Control-Register.

[34] Intel Cooperation, “Intel(R) 64 and IA-32 Architectures Software Developer’s

Manual.” https://software.intel.com/en-us/articles/intel-sdm, 2016.

[35] H. Lefeuvre, V.-A. Bădoiu, P. Olivier, T. Mosnoi, R. Deaconescu, F. Huici, and

C. Raiciu, “Flexos: Making os isolation exible,” in HotOS’21: Workshop on Hot Topics

in Operating Systems, 2021.

[36] M. Sung, P. Olivier, S. Lankes, and B. Ravindran, “Intra-unikernel isolation with intel

memory protection keys,” in Proceedings of the 16th ACM SIGPLAN/SIGOPS

International Conference on Virtual Execution Environments, VEE ’20, (New York, NY,

USA), p. 143–156, Association for Computing Machinery, 2020.

[37] J. Woodru, R. N. Watson, D. Chisnall, S. W. Moore, J. Anderson, B. Davis, B. Laurie,

P. G. Neumann, R. Norton, and M. Roe, “The cheri capability model: Revisiting risc in

an age of risk,” in Proceeding of the 41st Annual International Symposium on Computer

Architecuture, ISCA ’14, p. 457–468, IEEE Press, 2014.

[38] B. Davis, R. N. M. Watson, A. Richardson, P. G. Neumann, S. W. Moore, J. Baldwin,

D. Chisnall, J. Clarke, N. W. Filardo, K. Gudka, A. Joannou, B. Laurie, A. T. Markettos,

J. E. Maste, A. Mazzinghi, E. T. Napierala, R. M. Norton, M. Roe, P. Sewell, S. Son, and

J. Woodru, “Cheriabi: Enforcing valid pointer provenance and minimizing pointer

privilege in the posix c run-time environment,” in Proceedings of the Twenty-Fourth

International Conference on Architectural Support for Programming Languages and

Operating Systems, ASPLOS ’19, (New York, NY, USA), p. 379–393, Association for

Computing Machinery, 2019. 123

[39] H. Xia, J. Woodru, H. Barral, L. Esswood, A. Joannou, R. Kovacsics, D. Chisnall,

M. Roe, B. Davis, E. Napierala, J. Baldwin, K. Gudka, P. G. Neumann, A. Richardson,

S. W. Moore, and R. N. M. Watson, “Cherirtos: A capability model for embedded

devices,” in 2018 IEEE 36th International Conference on Computer Design (ICCD),

pp. 92–99, 2018.

[40] Y. Ren, G. Liu, V. Nitu, W. Shao, R. Kennedy, G. Parmer, T. Wood, and A. Tchana,

“Fine-grained isolation for scalable, dynamic, multi-tenant edge clouds,” in 2020

USENIX Annual Technical Conference (USENIX ATC 20), pp. 927–942, USENIX

Association, July 2020.

[41] C. Wright, C. Cowan, S. Smalley, J. Morris, and G. Kroah-Hartman, “Linux security

modules: General security support for the linux kernel,” in 11th USENIX Security

Symposium (USENIX Security 02), (San Francisco, CA), USENIX Association, Aug.

2002.

[42] P. Loscocco and S. Smalley, “Integrating exible support for security policies into the

linux operating system,” in 2001 USENIX Annual Technical Conference (USENIX ATC

01), (Boston, MA), USENIX Association, June 2001.

[43] T. Harada, T. Horie, and K. Tanaka, “Task oriented management obviates your onus

on linux,” in Linux Conference, vol. 3, p. 23, 2004.

[44] M. Bauer, “Paranoid penguin: An introduction to novell apparmor,” Linux Journal,

vol. 2006, p. 13, Aug. 2006.

[45] C. Schauer, “Smack in embedded computing,” in Proc. Ottawa Linux Symposium,

p. 23, 2008. 124

[46] “YAMA - The Linux Kernel documentation.”

https://kernel.org/doc/html/v4.14/admin-guide/LSM/Yama.html. (Accessed on

07/04/2021).

[47] S. E. Hallyn and A. G. Morgan, “Linux capabilities: Making them work,” 2008.

[48] “SECure COMPuting with lters - The Linux Kernel documentation.”

https://www.kernel.org/doc/Documentation/prctl/seccomp_lter.txt.

[49] M. Fleming, “A thorough introduction to eBPF [LWN.net],”

[50] I. Goldberg, D. Wagner, R. Thomas, and E. A. Brewer, “A secure environment for

untrusted helper applications conning the wily hacker,” in Proceedings of the 6th

Conference on USENIX Security Symposium, Focusing on Applications of Cryptography

- Volume 6, SSYM’96, (USA), p. 1, USENIX Association, 1996.

[51] N. DeMarinis, K. Williams-King, D. Jin, R. Fonseca, and V. P. Kemerlis, “syslter:

Automated system call ltering for commodity software,” in 23rd International

Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020), (San

Sebastian), pp. 459–474, USENIX Association, Oct. 2020.

[52] “NGINX v1.24.0.” https://nginx.org/. (Accessed on 07/04/2021).

[53] S. Ghavamnia, T. Palit, S. Mishra, and M. Polychronakis, “Temporal system call

specialization for attack surface reduction,” in 29th USENIX Security Symposium

(USENIX Security 20), pp. 1749–1766, USENIX Association, Aug. 2020.

[54] H. Vijayakumar, X. Ge, M. Payer, and T. Jaeger, “JIGSAW: Protecting resource access

by inferring programmer expectations,” in 23rd USENIX Security Symposium (USENIX

Security 14), (San Diego, CA), pp. 973–988, USENIX Association, Aug. 2014. 125

[55] “ptrace.” https://man7.org/linux/man-pages/man2/ptrace.2.html. (Accessed on

07/04/2021).

[56] “strace.” https://man7.org/linux/man-pages/man1/strace.1.html. (Accessed on

07/04/2021).

[57] K. Jain and R. Sekar, “User-level infrastructure for system call interposition: A

platform for intrusion detection and connement,” in In Proc. Network and

Distributed Systems Security Symposium, 1999.

[58] M. Zheng, M. Sun, and J. C. Lui, “Droidtrace: A ptrace based android dynamic

analysis system with forward execution capability,” in 2014 international wireless

communications and mobile computing conference (IWCMC), pp. 128–133, IEEE, 2014.

[59] T. Garnkel, B. Pfa, and M. Rosenblum, “Ostia: A delegating architecture for secure

system call interposition,” in In Proc. Network and Distributed Systems Security

Symposium, 2003.

[60] D. R. Engler, M. F. Kaashoek, and J. O’Toole, “Exokernel: An operating system

architecture for application-level resource management,” in Proceedings of the

Fifteenth ACM Symposium on Operating Systems Principles, SOSP ’95, (New York, NY,

USA), p. 251–266, Association for Computing Machinery, 1995.

[61] WebAssembly Community, “Security - WebAssembly.”

[62] S. Narayan, C. Disselkoen, T. Garnkel, N. Froyd, E. Rahm, S. Lerner, H. Shacham,

and D. Stefan, “Retrotting ne grain isolation in the refox renderer,” in 29th

USENIX Security Symposium (USENIX Security 20), pp. 699–716, USENIX Association,

Aug. 2020. 126

[63] Z. Durumeric, F. Li, J. Kasten, J. Amann, J. Beekman, M. Payer, N. Weaver, D. Adrian,

V. Paxson, M. Bailey, and J. A. Halderman, “The matter of heartbleed,” in Proceedings

of the 2014 Conference on Internet Measurement Conference, IMC ’14, (New York, NY,

USA), p. 475–488, Association for Computing Machinery, 2014.

[64] Z. Tarkhani and A. Madhavapeddy, “Sirius: Enabling system-wide isolation for

trusted execution environments,” CoRR, vol. abs/2009.01869, 2020.

[65] T. Kim and N. Zeldovich, “Practical and eective sandboxing for non-root users,” in

2013 USENIX Annual Technical Conference (USENIX ATC 13), (San Jose, CA),

pp. 139–144, USENIX Association, June 2013.

[66] R. M. Needham, “Protection systems and protection implementations,” in Proceedings

of the December 5-7, 1972, fall joint computer conference, part I, AFIPS ’72, (New York,

NY, USA), pp. 571–578, 1972.

[67] J. M. Rushby, “Design and verication of secure systems,” in Proceedings of the Eighth

ACM Symposium on Operating Systems Principles, SOSP ’81, (New York, NY, USA),

p. 12–21, Association for Computing Machinery, 1981.

[68] N. Dautenhahn, T. Kasampalis, W. Dietz, J. Criswell, and V. Adve, “Nested kernel: An

operating system architecture for intra-kernel privilege separation,” in Proceedings of

the Twentieth International Conference on Architectural Support for Programming

Languages and Operating Systems, ASPLOS ’15, (New York, NY, USA), p. 191–206,

Association for Computing Machinery, 2015.

[69] B. W. Lampson, “Protection,” SIGOPS Oper. Syst. Rev., vol. 8, p. 18–24, Jan. 1974.

[70] E. Witchel, J. Rhee, and K. Asanović, “Mondrix: Memory isolation for linux using

mondriaan memory protection,” in Proceedings of the Twentieth ACM Symposium on 127

Operating Systems Principles, SOSP ’05, (New York, NY, USA), p. 31–44, Association

for Computing Machinery, 2005.

[71] A. Ghosn, M. Kogias, M. Payer, J. R. Larus, and E. Bugnion, “Enclosure:

Language-based restriction of untrusted libraries,” in Proceedings of the 26th ACM

International Conference on Architectural Support for Programming Languages and

Operating Systems, ASPLOS 2021, (New York, NY, USA), p. 255–267, Association for

Computing Machinery, 2021.

[72] “Syscall User Dispatch.”

https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html.

(Accessed on 07/13/2021).

[73] V. Shanbhogue, D. Gupta, and R. Sahita, “Security analysis of processor instruction

set architecture for enforcing control-ow integrity,” in Proceedings of the 8th

International Workshop on Hardware and Architectural Support for Security and

Privacy, HASP ’19, (New York, NY, USA), Association for Computing Machinery,

2019.

[74] C.-C. Tsai, “Passthru-libos.” https://github.com/chiache/passthru-libos. (Accessed on

07/04/2021).

[75] C.-C. Tsai, K. S. Arora, N. Bandi, B. Jain, W. Jannen, J. John, H. A. Kalodner,

V. Kulkarni, D. Oliveira, and D. E. Porter, “Cooperation and security isolation of

library oses for multi-process applications,” in Proceedings of the Ninth European

Conference on Computer Systems, EuroSys ’14, (New York, NY, USA), Association for

Computing Machinery, 2014. 128

[76] C. che Tsai, D. E. Porter, and M. Vij, “Graphene—SGX: A practical library OS for

unmodied applications on SGX,” in 2017 USENIX Annual Technical Conference

(USENIX ATC 17), (Santa Clara, CA), pp. 645–658, USENIX Association, July 2017.

[77] “zlib — a massively spiy yet delicately unobtrusive compression library.”

https://https://zlib.net/. (Accessed on 07/04/2021).

[78] “xattr(7) — Linux manual page.”

https://man7.org/linux/man-pages/man7/xattr.7.html. (Accessed on 06/29/2021).

[79] “setfattr(1) — Linux manual page.”

https://man7.org/linux/man-pages/man1/setfattr.1.html. (Accessed on 06/29/2021).

[80] “chroot(2) — Linux manual page.”

https://man7.org/linux/man-pages/man2/chroot.2.html. (Accessed on 07/04/2021).

[81] L. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in

USENIX 1996 Annual Technical Conference (USENIX ATC 96), (San Diego, CA),

USENIX Association, Jan. 1996.

[82] A. Kopytov et al., “Scriptable database and system performance benchmark.”

https://github.com/akopytov/sysbench. (Accessed on 06/08/2021).

[83] “Lighttpd v1.4.59.” https://www.lighttpd.net/. (Accessed on 07/04/2021).

[84] “CURL: Command line tool and library for transferring data with URLs v7.77.0.”

https://curl.haxx.se/. (Accessed on 07/04/2021).

[85] “SQLite Database Engine v.3.36.0.” https://www.sqlite.org/index.html. (Accessed on

07/04/2021). 129

[86] “Info-zip’s zip.” http://infozip.sourceforge.net/Zip.html. (Accessed on 06/08/2021).

[87] “Ab - Apache HTTP server benchmarking tool v2.4.”

https://httpd.apache.org/docs/2.4/en/programs/ab.html. (Accessed on 07/04/2021).