RICE UNIVERSITY Safe and Secure Subprocess Virtualization in Userspace
By
Bumj in Im
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE
Doctor of Philosophy
APPROVED, THESIS COMMITTEE
ang chen Nathan Dautenhahn (Aug 12, 2021 19:01 CDT) ang chen (Aug 12, 2021 16:01 CDT) Nathan Dautenhahn Ang Chen
Assistant Professor of Computer Science Assistant Professor of Computer Science
Dan Wallach (Aug 12, 2021 16:05 CDT) Dan Wallach
Professor of Computer Science and of Electrical and Computer Engineering
Kaiyuan Yang (Aug 12, 2021 16:32 CDT) Kaiyuan Yang
Assistant Professor of Electrical and Computer Engineering
HOUSTON, TEXAS August 2021 Safe and Secure Subprocess Virtualization in Userspace
Thesis by Bumjin Im
Thesis for the Degree of Doctor of Philosophy Department of Computer Science Rice University (Houston, Texas) August, 2021 ABSTRACT
Safe and Secure Subprocess Virtualization in Userspace
by
Bumjin Im
Commodity operating systems isolate the application with process boundary, and all the developers develop the applications upon the principle. However, the applications cannot simply trust the process-based isolation. Virtually all the applications link at least one dynamic library on the runtime that the libraries share all the resources in the same process boundary. Unfortunately, the application developers do not fully understand the libraries they are using, and it could even be infeasible for some complex applications. If a single malicious or buggy library is linked to the application, it can breach the entire application due to its process boundary principle. Since the process-based isolation could continue for some time, it could be harder to achieve the least privilege. We propose a new process model, Endokernel, to resolve this issue. Endokernel contains a monitor inside the standard process in the commodity operating system and provides safe isolation between subprocess, maintenance, and the secure interactions between subprocesses. Endokernel also proposes a endoprocess virtualization technique. Utilizing endoprocess virtualization could realize a more ne-grained least privilege principle in the commodity computing environment. We develop Intravirt as the prototype of Endokernel. Intravirt realizes the Endokernel model on Intel CPU and Linux by actively utilizing Intel® Memory Protection Key(MPK) and Control-ow Enforcement Technology(CET) as the core security mechanisms. Since MPK and CET are hardware mechanisms, Intravirt aims to secure and high-performance endoprocess virtualization. We then evaluate the security and the performance of Intravirt by measuring microbenchmarks and the actual applications with several use cases for the secure computing environment. Throughout the research, we verify Endokernel is a feasible, lightweight, applicable, and eective security model. Acknowledgments
It was a reckless decision as a mid-aged man to start an advanced academic degree in a foreign country with a foreign language after resigning from a well-paid and recently promoted job. Everyone did not understand this decision indeed, and many people said it is a mistake. However, I started a new life in Houston, Texas, being a student after 13 years, get a master’s degree, publish a conference paper, and nally get a ph.D. degree. This tremendous achievement could be impossible without enormous help and support from many people. Without them, there will be no research achievement, no conference paper, no admission to the university, and I will never be able to dream about this. Professor Dan Wallach guided me to join the ph.D. program at Rice University. Without him, I would never think of applying to Rice University. Instead of rushing me to nish the school work quickly, he gave me enough time to settle down to the new culture. Also, he gave me enormous advice as a father, neighbor, and teacher that helped me so much to carry out the program and to support my family members. Lastly, when I decided to change the advisor, he did not hesitate to allow and support my new decision that the lost momentum of the research was able to grow again. It was the beginning of my 5th year when I decided to join Nathan’s group. I was mid-40, have family, and the background knowledge is dierent from the group’s research projects. Hence, it was a risky gamble for him to admit me as his student. However, he welcomed me without hesitation and supported me in making such a decision. He also understood and waited patiently for my months-long distractive working environment and the slow progress due to the family support in the pandemic and the lack of knowledge. Without professor Nathan Dautenhahn, I would decide to stop the program during my 5th year. I think he certainly thought about admitting me as his rst graduate student in his academic career. Also, he would be anxious about the research after admission. I appreciate his endless patience and the waiting for my research progress. Mr. Hyunjin Choi became my boss about ten years after starting to work at Samsung. Working with him was an auspicious event for me. He tried to make the most rational and practical decision, and he always tried to reduce the unnecessary burden on my work. He always gave me his best advice not only for the project but also the career and personal issues that he was not a simple boss, but a teacher of my life. After a few years working with him, I was frustrated with continuing my career at Samsung and in Korea, his advice was to consider an advanced academic degree abroad and develop a new career there instead of telling me to work with him forever. Ordinary manager will tell his coworker to work together with sweet promises like promotion, but he guided me to a dierent career path to me, and he chose to let me go. He is indeed one of the people who inuenced my life. Fangfei Yang is my lucky elf in this research. At the beginning of the research, I could not code in assembly, no knowledge of low-level code and hardware in detail. The only thing I have was the research idea. His deep knowledge of the low-level operating systems and the hardware and the never-decreasing passion kept the research rolling all the time and injecting even more fascinating ideas into the research. I admire him as a fellow student and appreciate much for his eorts. Without his contribution, the research could stall at any time. Daniel Song joined Rice University 2years earlier than me, working with professor Dan Wallach, and he is a Korean. He gave me enormous help and tips to survive in a foreign country without trouble, and he kept in touch with my family as well, becoming an uncle to my kids. He still gives me even more tips and helps about the graduation and career paths, as well as his mistake stories. He spent a noticeable amount of his time and resources for my family and me that I could start my life in a foreign country without hassles, and my children got an uncle. Lastly, I have to say thank you to my family. Most of all, my wife gave up all the privileges and assets she possessed, and she just followed me that I appreciate her sacrice, and I also feel a deep sorry for her. Her husband, I, was a recognized employee at Samsung, her children enjoyed their school life, and there was no potential trouble that everyone else did not support my decision to go abroad for this program, she supported me from when I started thinking about the ph.D. program in Rice University, and she still struggles to live in a foreign country only with her direct family members. Also, she still makes an endless eort to support my program and overcoming this pandemic. She is the headstone of my life, without a doubt. I can recall clearly the my children’s rst day of school in Houston. They were dropped in unfamiliar schools, could not understand English at all, completely dierent culture, and no friends. But they did not complain about the new schools, and they quickly adapted, fortunately. The pandemic made my kids stuck at home all the time, but they are still not complaining about this, and they are keeping what they need to do. I really appreciate my adorable kids. Contents
1 Introduction 1 1.1 Ideal Solution: Use Safe Languages for Everything ...... 3 1.2 Straightforward Solution: More Process Separations ...... 3 1.3 Ecient Solution: Subprocess Isolations ...... 5 1.4 Problems in Subprocess Isolation ...... 6 1.5 Endokernel: Safe Subrocess Isolation in Commodity OS ...... 8 1.6 Contributions ...... 9
2 Subprocess Isolations and System Call Virtualizations 12 2.1 Subprocess Separation ...... 12 2.1.1 Language Based Separation ...... 13 2.1.2 Operating System Based Separation ...... 17 2.1.3 Hardware Accelerated Separation ...... 19 2.2 System Call and Signal Virtualization ...... 26 2.2.1 Linux Security Module ...... 26 2.2.2 System call Filtering ...... 27 2.2.3 System call tracing and interposition ...... 30
3 Threats 33 3.1 Unauthorized memory access ...... 33 3.2 Unauthorized le access ...... 35 3.3 Unauthorized system call execution ...... 35 3.4 Attack on Subprocess Isolation: PKU Pitfall ...... 36 4 Endokernel Architecture 38 4.1 Assumption ...... 38 4.2 Requirements ...... 38 4.3 Mechanisms Gaps and Challenges ...... 40 4.4 Endoprocess Model ...... 42 4.5 Design Principle ...... 44 4.6 Authority Model ...... 45 4.7 Nested Endokernel Organization ...... 47 4.7.1 In-Process Policy ...... 47 4.7.2 Interface ...... 48 4.8 Separation Facilities: Nested Boxing ...... 49 4.9 Intel® Memory Protection Key ...... 51
5 Design and Implementation 52 5.1 Privilege and Memory Virtualization ...... 52 5.1.1 Virtual Privilege Switch ...... 53 5.1.2 Securing the Domain Switch ...... 53 5.1.3 Instruction Capabilities ...... 54 5.1.4 Controlling mode switches ...... 55 5.2 System Call Monitor and Handling ...... 58 5.2.1 Passthrough ...... 59
5.2.2 No syscall from untrusted domain subspaces ...... 59
5.2.3 Complete mediation for mapped syscall ...... 60 5.3 OS Object Virtualization ...... 63 5.3.1 Sensitive but Unvirtualized System Calls ...... 63 5.3.2 Files ...... 63 5.3.3 Mappings ...... 64 5.3.4 Processes ...... 64 5.3.5 Forbidden system calls ...... 65 5.4 Signal virtualization ...... 65 5.4.1 Signals for Ephemeral System Call Trampoline ...... 68 5.4.2 Multithreading Design ...... 69 5.4.3 CET ...... 69 5.4.4 Multiple subdomains ...... 70 5.5 Multi-threading and Concurrency ...... 70 5.5.1 Concurrency in subprocess isolation ...... 70 5.5.2 Multithreading model ...... 71 5.5.3 Thread Local Data Structure ...... 71 5.5.4 Required Atomicity ...... 72 5.5.5 sysret-gadget Race Condition ...... 73
5.5.6 Clone ...... 74 5.5.7 Multi-Domain ...... 76 5.6 Implementation Details ...... 80
6 Use Cases 81 6.1 Library Isolation ...... 81 6.1.1 Reference Application: zlib ...... 81 6.1.2 Safeboxing OpenSSL in NGINX ...... 82 6.2 Module sandboxing ...... 83 6.2.1 Sandboxing HTTP Parser in NGINX ...... 83 6.2.2 Preventing sudo Privilege Escalation ...... 85 6.3 Endo-process System Call Policy Enhancement ...... 85 6.3.1 NGINX Private Key File Protection ...... 85 6.3.2 Directory Protection ...... 90
7 Evaluation 93 7.1 Security Evaluation ...... 93 7.1.1 Fake Signal ...... 93 7.1.2 Fork Bomb ...... 94 7.1.3 Syscall Arguments Abuse ...... 95 7.1.4 Race condition using shared memory ...... 95 7.1.5 TSX attack ...... 96 7.1.6 Race condition using multi threading ...... 96 7.2 Performance Evaluation ...... 97 7.2.1 Microbenchmarks ...... 99 7.2.2 Macrobenchmarks ...... 102 7.3 Performance Evaluation of the Use Cases ...... 104 7.3.1 zlib ...... 104 7.3.2 Safeboxing OpenSSL and Sandboxing Parser in NGINX ...... 106 7.3.3 File and Directory Protection ...... 108
8 Conclusion and Futuer Works 113
References 117 List of Figures
1.1 Problems of privilege separation approaches ...... 7
4.1 Intravirt Architecture...... 43
5.1 Signal Entrypoint ...... 66 5.2 State Transition with Signal; UT:Untrusted; T: Trusted; Sig: Signal Handler, Signal masked by Kernel; Smi: Semi-Trusted Domain ...... 67
7.1 System call latency of LMBench benchmark...... 98 7.2 Normalized latency of reading a 40MB le...... 100
7.3 latency for getppid for dierent rerandomization scaling...... 100 7.4 Random read bandwidth for di. number of threads measured with
sysbench...... 101 7.5 Normalized overhead of di. Linux applications...... 102 7.6 Normalized overhead of isolated zlib...... 105 7.7 Normalized throughput of privilege separated NGINX using TLS v1.2 with
ECDHE-RSA-AES128-GCM-SHA256, 2048, 128...... 106 7.8 System call latency of LMBench benchmark with dierent protection metodologies...... 109 7.9 Normalized throughput of NGINX to download 64KB le for dierent private key protection methodologies...... 110 7.10 Normalized latency of zip for dierent le protection methodologies. . . . 111 List of Tables
7.1 Quantitative security analysis based on attacks demonstrated in [1] and attacks found by us. ◦ indicates the variant of Intravirt in this column is vulnerable, • if it prevents this attack. × indicates this attack is beyond Intravirt’s threat model...... 94 7.2 Performance overhead of zlib test due to CET. No Intravirt involved. . . . . 106
7.3 xcall count for dierent le sizes in the test scenarios including startup of the process...... 107 1
Chapter 1
Introduction
Recently, security became one of the most critical features in the computing environment.
Modern operating systems(OSes), such as Windows, Unix, Linux, FreeBSD, and MAC OS
X provide various mechanisms and abstractions to provide such security and continuously
introduce new mechanisms to protect the system from more advanced attacks. To provide
security abstraction, they use processes as the unit of security management. Because of
this, all the codes share the same privilege level, memory, and les in a single process.
Therefore, if a small part of the process is compromised, it will aect the entire process.
The current security architecture is not entirely incorrect that there will be a unit
of any security architecture. If the unit is too small, the application development will
be challenging, and if the unit is too big, there will be serious security issues, so this
granularity issue is always present in any computing environment. However, modern
applications are complicated, providing feature-rich functions, visually beautiful, and
require many common functionalities such as security. Due to the complexity and massive
requirements, the application developers cannot develop all the features from scratch,
so using those third party libraries is very common. For example, the Linux version of
the Google Chrome web browser links about 100 libraries. Those libraries are mostly for
common but labor intensive functions, such as cryptography, mathematical functions, and
3D graphics. As a result, any single bug in one of the libraries will be resulting in the full
breach of the victim application process and the system itself. Moreover, this trend does
not seem to stop any soon. 2
There are several actual cases of this problem. The most popular incident is Heart-
Bleed [2]. HeartBleed is a vulnerability by a bug in OpenSSL [3] library, the de-facto standard for cryptography and secure communication functions. OpenSSL has been used by most Unix based systems, including Linux and FreeBSD. The bug is very simple that a missing boundary check in the SSL heartbeat message could be exploitable. So, a mali- ciously crafted heartbeat message could lead to memory content exposure in the target system. The vulnerability was so phenomenal that a simple bug could aect millions of computers in the world because everyone used OpenSSL.
The library is not the only problem. Each modules should be treated as a security unit. For example, every module in a web server application shares the privileges and the resources with other modules. In this example, the HTTP parser module only requires access to the HTTP message received from the network, and it does not need to access any other resources and no privilege is required. However, in the current architecture, any buggy HTTP parser module could lead to a complete compromise, and the attacker could acquire full access to the web server and the underlying system as well. CVE-2009-2629 [4],
CVE-2013-2028 [5], and CVE-2013-2070 [6] show this type of vulnerability.
There is one more example about this case, CVE-2021-3156 [7]. Sudo utiity [8] is a utility in Unix operating system that could execute some processes with the root privilege.
Usually, system administrators use Sudo to gain the root privilege to manage the system conguration. To execute the command line utility as root, the system administrator executes sudo and the target utility as the command line argument of Sudo. When sudo is executed, it rst asks the user the password and continues execution only when the password is correct and when the user is in the sudoer group. However, a bug in Sudo command line parser module allows the attacker to execute arbitrary code without verifying anything. This attack is because the command line argument parser shares the privilege 3
of Sudo utility, which is a setuided application.
In conclusion, we have to reconsider the problematic process based privilege model in
the complicated modern computing environment. If we discover a new privilege model with the same applicability, ner granularity, and minor performance overhead, we could
help billions of people at risk of their data breach.
1.1 Ideal Solution: Use Safe Languages for Everything
This type of issue is not new, and there have been enormous eorts with various approaches
to solving the problem. The most ideal approach is to code with type safe language, such
as Java and Rust. By doing this, most of the memory corruption bugs will be disappeared.
However, most of the libraries are still developed in unsafe languages such as C and C++,
and it looks like this trend does not seem to change any soon. Also, even though everything
is developed in safe languages, it is still vulnerable to intentionally malicious libraries.
1.2 Straightforward Solution: More Process Separations
The aordable approach is to separate the modules and libraries into dierent processes.
It is the easiest approach to take and most straightforward to apply, and many existing
applications are using this approach to preserve security, such as mail servers, web servers,
and even web browsers. Process separation utilizes the separation feature provided by
the operating systems and the hardware, which is well proven already and easy to apply.
However, the application has to be redesigned to insert IPC routines when those separated
processes should share any data, introducing performance overhead. Also, it creates even
more performance overhead due to the context switching between processes.
Google Chrome [9] is a web browser based on an open source project developed by
Google. The signicant dierence between traditional web browsers like Internet Explorer 4
and Chrome is that Chrome separates the process for each opened tab. In Chrome, there
are two types of processes. The browser process takes the I/O, and the main loop for
the browser application, and the renderer process takes the content rendering for each
tab. So, when a new tab is created, the browser process forks a renderer process, the
renderer process receives HTTP data from the Internet via the browser process, renders
the contents, and lets the browser process show the result to the screen. Therefore, the
memory footprint is signicantly high, and the system performance is getting much slower when there are too many opened tabs due to the massive IPCs and context switches.
However, the performance overhead is not a serious issue because the bottleneck of web
browsing is the speed of Internet trac.
Postx [10] is an email server developed by IBM to replace obsolete sendmail email
server. Sendmail was initially developed in the 1980s, which performs all the email related
functions like email transmission, inbox management, and user management in a single
root process. Due to its complexity, there have been many vulnerabilities such as CA-1988-
01, CA-1990-01, and CA-1994-12. However, those small bugs in the program lead to the
complete breach of the system because it is a single root process. Postx spread out the
risks into multiple processes. Postx launches more than ten processes during startup, and
each module uses IPC to communicate with other modules. Also, all the processes are not
running as root. The performance overhead could be much more signicant than sendmail,
but it is not a severe problem due to the nature of email performance requirements.
More recently, a web server software is released which resistant to the Heartbleed
attack. H2O project [11] architecture is a single process, event-driven web server, but
only one small module is separated into a process. H2O also uses OpenSSL [3] as the
cryptographic and TLS library, but it separates private key modules into a process. Any time
the server requires the private key computation, it requests the computation to the private 5 key module, and the module returns the results only. The private key is always in the memory of the separated module and never gets out from it. However, the performance is a crucial requirement for a webserver, and the process separation could create a burden for the performance. But fortunately, the overall performance overhead is only 2% because the private key is only required during the web session startup. However, H2O has drawbacks.
First, it only protects the private key in the memory. Any other resources in the memory are not protected, and the private key le is not protected that the attacker could simply open the private key le. Also, since only the private key module is separated, H2O uses a lot of low level OpenSSL APIs to perform secure communication, even though OpenSSL itself provides one-line high level API.
1.3 Ecient Solution: Subprocess Isolations
Subprocess isolation is a relatively new approach to separate the resource in a single process, dene the policy for each compartment, and enforce access control policy in a single process. This approach requires relatively more eort than the process separation, and there is almost no underlying operating system support. However, it is much faster than the process separation, applicable to the various operating systems and architectures, and could be optimized for various applications.
The most common technique for subprocess isolation is the Software Fault Isola- tion(SFI) [12], which compartmentalize the codes, memory, and resources into more than two domains and monitor the interaction between domains. There are multiple approache to achieve this. For example, it could be modify the compiler [13–17], extend the un- derlying operating system kernel [17–23], modify userspace libraries [22–25], utilize hardware functionalities [20, 22–24], and design a new computing environment [23, 26, 27] in some extreme cases. In this approach, the application runs under the same address 6
space, and it does not require kernel level context switches and performance heavy IPC.
Therefore, it is generally much faster than the process separation. The applications of this
approach are extensive, but the most famous applications are securing foreign function
interface(FFI) [14, 15, 17, 28] between safe and unsafe language that the unsafe part is
compartmented to prevent any unauthorized memory access.
1.4 Problems in Subprocess Isolation
Subprocess isolation is a novel approach for privilege separation, and expected security is
promising. However, we have to address that the underlying operating system is not aware
of it. The operating system manages the hardware and software and provides interfaces to
users to manage the resources as the system calls. System calls provide privilege separation
and access control, but the base unit is the process, not the subprocess separation domain.
Therefore even in the subprocess separation, the resources accessed by system calls will
be shared. For example, le descriptors and signal handlers will be shared, memory access
by ptrace will be available unless there is no proper consideration of the system calls.
Therefore, every subprocess separation technique should be aware of this threat. As a
result, we have to aware of the interfaces and provisioned functionalities in the operating
system kernel.
Connor et al. [1] shows this issue precisely. For example, in Linux, the operating
system provides a few more methods to access the memory, other than direct access by the
address. First, it provides a le, /proc/[pid]/mem in proc le system. This le is a virtual
le that maps to the virtual memory address of the corresponding process [pid]. Simply
by open the le, read and write the le is equivalent to direct memory access. Therefore,
if the subprocess isolation abstraction does not take care of this interface, the technique
is insecure. Along with this le-backed memory access, more interfaces provide such 7
memory access, such as signals, ptrace, and debugging. Thus, the underlying operating
system interfaces have to be considered carefully.
We can categorize three types of existing works to respond to this issue. First and the
majority of the response is not to solve the problem, which means that many of the existing works have serious security holes. The second solution is to prohibit system calls. In this
case, the security hole is prevented, but it decreases the applicability signicantly. The last
is to intercept and virtualize system calls by ptrace or equivalent debugging feature. This
approach could acquire both security and applicability, but since the mechanism requires
multiple processes and the mediation by the kernel, the performance will be dramatically
decreased. Therefore we need a new type of technique to satisfy security, applicability,
and performance at the same time.
Process Separation Sandboxing Subprocess Separation
X Unrusted Untrusted
X
X Trusted Unrusted
Trusted Trusted
syscall
X
IPC syscall Operating System
Figure 1.1: Problems of privilege separation approaches
Figure 1.1 summarizes the typical problems in privilege separation. First, process
separation has a clear advantage to separating the memory and resources, but it has
a massive penalty in the sharing data by the IPC and the context switching. Second,
sandboxing techniques enable the protection of the application from the untrusted code
inside, but it requires limiting the system call features to prevent attacks via the system
calls. Last, subprocess isolation protects sensitive data in the application process, but the 8
untrusted code could bypass the protection by operating system interfaces.
Lastly, multi-threading is crucial in the modern computing environment. But it is
tough to support the concurrency with the subprocess isolation. First, the underlying
OS allows sharing all the resources between threads, and the isolation could interfere
between threads. Second, the thread local storage could be extended to securely provides
the isolation for each thread. Lastly, the isolation has to securely but extensively support
the communication between threads.
1.5 Endokernel: Safe Subrocess Isolation in Commodity OS
Our goal is to develop a new subprocess separation technique with a very low performance
overhead, supporting multiple separation domains, and considering the operating system
interface without requiring modication of the hardware and the operating system. For the
performance, we utilize hardware accelerated memory protection mechanisms. We then
separate the target application process into multiple domains, and we provide a monitor
called endokernel to support domain management, domain switch, cross function calls,
and system call virtualization. endokernel prevents unauthorized system call execution,
and all the system calls are executed in the trampoline which endokernel provides. Also, we provide a minimal and straightforward system call virtualization policy to protect the
application from the memory protection bypass. endokernel provides the concurrency
and protects the application from the attacks like Time-Of-Check, Time-Of-Use(TOCTOU)
attacks.
Lastly, we develop the prototype of endokernel, called Intravirt, in userspace. Develop-
ing such features in kernel looks more suitable and convenient due to the privilege level
and the controllability, but it has a critical drawback. All Unix based operating systems take
the process as the unit of the privilege separation, so the code developed in the kernel will 9
not be able to upstream unless the kernel developers change the fundamental architecture.
Because of this, we have to port the implementation of Intravirt on every new release
of the kernel. However, we only need to maintain Intravirt code in userspace, making
the applicability and deployability much better. We lastly evaluate the security and the
performance of Intravirt, and we perform case studies on a few use case scenarios.
We implemented Intravirt in a Linux environment. We use Ubuntu 20.04 platform with kernel version 5.9.8, including a few latest feature patches from the upstream. Our
code fully works in the userspace that consists of 15,000 lines of C code and 4,000 lines of
assembly code. We used about 6,000 lines of open source C code, so our contribution is
about 9,000 lines of C code along with 400 lines of assembly code.
1.6 Contributions
This dissertation includes the presentation of the following artfacts and contributions:
Endokernel Architecture a new subprocess isolation abstraction with the following contributions.
• Provides a monitor to create and maintain subprocess isolation which are linked to
the application process during the startup of the application.
• All the functions run in userspace, and no kernel modication is required.
• Implement endokernel prototype, Intravirt.
• Provide Intel MPK based mechanism and support multi domain up to the 16 MPK
domains.
• Provide hardware accelerated memory protection with very low performance over-
head on domain switch. 10
System call virtualization framework monitor all the system calls and virtualize them to protect the system.
• Provide a trampoline to execute syscall instructions safely and controlled.
• Virtualize all the system calls that preventing arbitrary syscall instruction execution
by attackers.
• Protect the system from the indirect jump to the trampoline.
• Provide concurrency in trampoline for multi-threaded environments.
• The framework does not require modifying the applications to virtualize the system
calls.
Signal virtualization framework provides a virtualized environment for the signal handling that monitors all the signals and prevent compromising Intravirt.
• Protect the sigframe data structure to prevent malicious modication of important
registers and variables.
• Prevent signal spoong that the attacker could not articially call the signal handler.
System call baseline policy provides attack prevention by malicious system calls.
• Systemically analyze all the system calls to nd out any possible MPK bypass.
• Enforce the policy on runtime.
• Provide the concurrency of the policy enforcement
Compelling use cases provides applications using endokernel with compelling scenarios
• Select the applications that endokernel resolves the known problems.
• Design and implement the protection policy for the applications. 11
• Provide the actual performance data. 12
Chapter 2
Subprocess Isolations and System Call Virtualizations
In this section, we survey the related works for subprocess isolation. The goal of endokernel
is to isolate the part of the memory space within the application process, and virtualize
the system calls and signals and enforce the security policy to protect the application
from various attacks. We survey techniques for isolation provided by the language side,
operating systems, and hardware. As well, we also investigate system call and signal virtualization techniques.
2.1 Subprocess Separation
Traditional address space separation provided by the operating systems are widely used
because the process separation is proven techniques and easy to apply in most of the
computing environment. But, it has clear limitations due to the signicant overhead by the
context switching and IPC, and it is not trivial to eectively and securely share the memory
between processes. Therefore, there are numerous techniques have been published so far.
There are basic operations to provide subprocess privilege separation, which are, identify
the address space of the application, compartment the memory spaces into more than
two domains, trust only one of the compartmented domains, and enforce the policy that
only the code in the allowed domain access the corresponding memory space. The most
common technique to provide such isolation is Software Fault Isolation(SFI) [12]. As well,
to provide safe SFI, we have more techniques like Control Flow Integrity(CFI) [29], and
Code Pointer Integrity(CPI) [30]. The existing works are mostly based on one or more of 13 these techniques.
In this section, we analyze the existing works and address their contributions and
their drawbacks. To provide a well-structured survey, we categorized the existing works with language based, operating system based, and hardware based works. In language
based works, the works are focussed on the compilers to provide the abstractions. In
operating system based, the works are enhancing the existing operating system to provide
the separation. Lastly, in hardware based, they use the hardware features to provide such
abstractions.
2.1.1 Language Based Separation
The easiest way is to write the code in memory-safe languages. Because we need to perform
this approach statically, the performance optimization is relatively more straightforward,
but all the linked libraries in the application have to be written in memory-safe languages.
Unfortunately, there are enormous libraries written in unsafe languages such as C or C++.
As a result, this approach is not the easiest one indeed. To provide memory protection
in these unsafe environments, some early techniques insert memory boundary check
routine during the compile time and check the memory boundary on the heap. One of the
applications of these techniques is foreign function interface protection to protect the safe
language from the unsafe part of the application, such as Java Native Interface(JNI), Web
assembly, or Android native applications.
CCured CCured by Necula et al. [13] is the pioneer of this approach. The goal of CCured is to provide type safety in non-type-safe languages such as C, without modifying existing
source code only with the recompilation that the result is remarkable. CCured provides
the type safety by categorizing the pointers into dierent types, inserting type checking 14
code and boundary check code into the original code during compile time. The overhead
is from 0 to 100%, depending on the test applications. Many of the existing applications
are not required to be modied to apply CCured when the paper is published. As well, it
provides a formal denition and the verication of its safety.
However, CCured has several disadvantages. Since the technique works in the compiler
stage, any dynamically linked code does not aect to CCured as well as self-modifying
code. Also, it could not work on the uniquely designed data type. In addition, the memory
footprint increases due to the excessive tags and indexes, and performance overhead is
unavoidable due to the inserted boundary check code by the compiler. Lastly, CCured only
provides the type safety in C. Therefore, any direct memory access could bypass CCured,
such as le backed memory access, which is /proc/self/mem.
Safe Java Native Interface Java Native Interface (JNI) is known to be vulnerable to buggy native code, so Safe JNI by Tan et al. [14] is proposed to provide security in the
interface between Java and native code. Safe JNI consists of 3 parts. First, it uses CCured to
provide the type safety in the native library. Second, it adds dynamic type checking code
in the JNI interface, and lastly, it provides a new memory management module for JNI
applications. In SafeJNI, each pointer has a boolean validity tag to prevent dereferencing
after free, acts as a simple reference counter, and has a C level garbage collector. They
tested Safe JNI with Zlib and they compared to the full Java implementation of Zlib that
Safe JNI was about 10% faster than full Java implementation.
Since SafeJNI utilizes CCured internally, it inherits the disadvantages of CCured. In
addition, since C code could invoke any low level functions, it could easily bypass the
Java security framework. As a result, Safe JNI does provide memory safety, but it does not
provide the overall security enhancement for Java native interface. 15
Native Client Native Client(NaCl) [15] is a sandboxing framework designed by engineers in Google in 2009 to run native code safely in the Google Chrome browser. By providing
the sandbox for the native code, NaCl separates code and memory between web and native,
protecting the web browser from any attack from the malicious native code. A new set of
interfaces called NPAPI are dened to provide communication between native and web.
To protect the browser from malicious native code, NaCl has its dedicated compiler. The
binary created by the compiler has a few unique properties. The instructions are aligned with 32 bits and page size, and hlt instructions are padded and only allow dedicated
indirect jump pseudo instruction that the attacker will not be able to perform return
oriented or jump oriented attacks. In addition, the memory region of the web and the
native are separated, that the data sharing is allowed only by NPAPI.
NaCl is a novel abstraction to provide such a sandboxing environment, but it does
have disadvantages. To prevent illegal resource access from the native code, NaCl provides
a very strict system call lter. Most of the system calls are not allowed for the native
code, making the native code less useful. The native code in NaCl is dedicated to faster
computation rather than providing rich native features.
CompARTist CompARTist [16] separates advertising libraries and other codes in the Android applications in the compile time. The compiler analyzes the intermediate repre-
sentation of the source code of the application, identify the advertisement library and the
application code, separate them into dierent processes, substitute functions calls between
the application and the ad library into the binder calls, and then compile the application.
To provide seamless application functionalities with the advertisement, CompARTist iden-
ties the location of the ad banner on the screen and overlays the ad banner on top of
the application window. Since there is not much interaction between the application and 16
the ad banner, the overhead is relatively small. Because of the process separation, any
malicious advertisement library cannot access the memory of the application.
Even though CompARTist provides an eective and powerful separation between the
ad library and the application, it does have critical limitations. First, the compiler performs very complicated tasks to identify the ad library and the code, analyze the display location,
and seamlessly overlay the two windows look one window. Because of the complexity, the
applicability is very low. In the paper, about 62% of the selected applications in the Google
Play Store worked correctly. In addition, because it has dependencies with the Android
platform, any change in the platform would aect the technique. Lastly, even though
the ad library is separated into a dierent process, they did not adequately separate the
permission of the ad banner process, which means that any malicious ad banner process
has the same privilege of the application.
RLBox RLBox [31] is a library isolation technique for Mozilla Firefox [32] web browser. It does not mainly focus on the isolation mechanism that we could use SFI or process
separations. Instead, its contribution lies in the secure computing environment in library
isolation. The authors carefully analyze the attack surfaces and potential issues of the
library calls to fulll the secure isolation and propose an automated safe library isolation
framework. Firefox web browsers commercially use this technique, and their analysis of
the attacks and potential issues is signicantly valuable.
Since the technique is applied to the commercial software, the level of completion is very high, and most of the isolation researches could refer to its analysis. However, it also
does not allow system call like Native Client [15], making the applicability lower, and the
performance overhead is high that it’s over 20% in some cases. 17
2.1.2 Operating System Based Separation
As mentioned above, process based isolation suers from the performance overhead. There
are several eorts to provide ner grained and lightweight separation compared to the
process based isolation by modifying the operating systems.
Lightweight Context Lightweight Context(LwC) [18] provides a context separation technique that behaves similarly to the traditional process separation but in a simpler
feature. LwC provides LwC_create to copy the LwC instance similar to the fork system call
and provides context switching API between LwC instances. As a result, it behaves very
similar to a process separation that the memory and le descriptors are separated between
LwC instances, but it does not provide the concurrency between LwC instances, and only
one instance could be working at a time. As well, LwC provides a resource overlay feature
to share resources between LwC instances in a process. Since LwC provides lightweight
separation and context switches, the overhead is relatively smaller than the context switch
between processes.
Even though LwC provides robust separation and resource sharing features, it has a
critical disadvantage in its implementation. LwC is implemented in FreeBSD kernel that
porting to other platforms such as Linux requires additional research eort. Also, we do
not know any potential corner cases in other platforms, which could be a big obstacle. In
addition, FreeBSD itself keeps evolving that LwC should be ported to the latest version of
FreeBSD every time a new version of the kernel is released.
Secure Memory Views Secure Memory Views (SMV) [19] uniquely implemented intra- process memory separation. Their design uses the monolithic kernel, Linux, as their
codebase, and they modied the way to manage the page table entry(PTE) to provide 18
the memory separation between threads in a process. Since it uses the page table entry
management for the design, the overhead is minimal and provides very ecient memory
separation. The application developers should modify their applications to call proper
SMV APIs to utilize isolation and enforce the access policy such as granting and revoking.
SMV shows less than 1% of overhead in the Cherokee web server test scenario because it
utilizes the virtual memory management mechanism in Linux.
Even though SMV has very low performance overhead due to the unique PTE based
mechanism design, it has several drawbacks. First, it does not provide privilege separation
for the non-threaded third party libraries. In this case, the application developers should
modify the application to provide such isolation. SMV is a thread based isolation technique,
so considering memory isolation in a single threaded application is inadequate.
Nativeguard NativeGuard [17] is a technique to separate the java part and the native part of Android applications. Instead of applying NativeGuard during the application devel-
opment, it repackages existing applications that it analyze the original application package,
identify the java part of the application and the native part of the application, substituting
API calls to binder IPC messages, and repackage them into separated applications. As a
result, NativeGuard could be a very eective technique to separate the foreign function
interface in the Android environment. Since NativeGuard separates one application into
two parts, the separation is very eective due to the process separation, so it cannot be
categorized as a technique of an subprocess separation.
Because the design of NativeGuard is simple and straightforward, the drawbacks are
also simple and straightforward. Since it separates one application into two dierent
applications, the overhead of NativeGuard is relatively high due to the IPC and the context
switches. They performed several simple performance tests, and the performance overhead 19
is up to 200% depending on the test scenarios. In addition, there is a critical integrity
problem using NativeGuard. Since NativeGuard repackages the original package, the
signature verication will fail due to the signature mismatch.
2.1.3 Hardware Accelerated Separation
The techniques mentioned above include additional functions that the performance over-
head is inevitable. ince minimizing the overhead is the most crucial goal of the techniques,
many people try to use the hardware features. The most common hardware feature is
the VT-x x86 virtualization extension, and another frequently explored hardware feature
is Intel Memory Protection Key(MPK). Some other eorts design new hardware to fulll
such isolated environment.
Shreds Shreds [20] provides an subprocess separation technique similar to LwC [18]. Shreds uses Domain Access Control Register(DACR) [33] as the mechanism of memory
protection. DACR supports up to 16 memory protection domains that one domain is
assigned to a page table entry, and the access permission of each domain is stored in the
DACR register, and the CPU automatically enforces the access control whenever any pro-
cess performs memory data access. The application developers should call shreds_enter
API when the sensitive data should be accessed, which the DACR domain is changed.
After nishing the sensitive operations, it calls shreds_exit to exit from the domain, which Shreds returns back the DACR to the normal domain. Operations to manage DACR
are privileged operations, so Shreds provides a kernel module and the interface between
userspace to manage DACR properly. Shreds also provides mechanisms in the compiler to verify the usage of Shreds APIs and CFI mechanisms to prevent attacks like ROP. Shreds
does not have performance overhead on the memory protection during the memory access 20 because DACR is a hardware feature. However, it does have performance overhead due to the Shreds context switches. The paper says the performance overhead of the tests they performed was up to 5%. In addition, due to the compiler modication, compilation time is increased up to 40%.
Shreads provides very concise, fast, and powerful subprocess isolation. However, the dependencies between CPU architecture, operating system kernels, and compiler could make the maintenance dicult when each module is updated. More importantly, Shreads does not address the security of the operating system that the attacker could successfully bypass the memory protection by using the system calls or the signal handlers.
Dune In 2012, a creative technique for memory protection and privilege separation, called Dune [21], was published. CPU virtualization functionalities have been supported much earlier, but Dune used these features to provide application privilege separation instead of running a virtual machine safely. In Dune, applications are running on top of the newly created hypervisor instead of virtualizing the whole operating system. Dune then utilizes virtualization features in the Intel CPUs, such as ring management in userspace.
Dune also uses the system call trap mechanism to intercept all the system calls in the application on top of the hypervisor and pass them to the operating system running in the dierent hypervisor. Since Dune uses Intel’s hardware feature, the overhead of memory access is minimal, but the overall system call performance is relatively slow due to the system call trap. However, some system calls which manage virtual memory, such as appel1 are much faster than the native system calls.
Since Dune’s abstraction is very dierent from other subprocess separation techniques, it is pretty hard to compare Dune to other techniques. However, due to its uniqueness,
Dune has a unique drawback. The application running on top of the hypervisor requires 21 libdune library to manage the page tables, access control policy, system call, and signal
that the library is incredibly complex. Therefore, applying Dune to other platforms and
hardware could be a bothersome task. In addition, even though Dune supports the system
call trap, it does not particularly address the security issues of the system calls. It could be
straightforward to protect the system from the system calls executed by untrusted code
because it already supports the system call trap, but it lacks the consideration of the system
calls.
ERIM ERIM [24] provides a very similar abstraction with Shreds [20], but ERIM uses Intel’s Memory Protection Key(MPK) [34] instead of ARM’s DACR [33]. MPK is a memory
protection hardware function in Intel’s latest CPUs which is very similar to DACR that
MPK uses a dedicated register called Protection Key Register Userspace(PKRU) like ARM has
DACR register for the access control policy management for the memory pages. However,
unlike DACR, MPK operations are unprivileged userspace operations that anyone could call
the instructions to modify MPK settings. For example, WRPKRU and XRSTOR instructions
could directly modify PKRU value, and system calls like pkey_alloc and pkey_mprotect
could modify the protection key in the page table entry and the PKRU value as well. Because
of that, ERIM scans all the code regions of the application and the linked libraries to
prevent the execution of such codes. In ERIM, if there are such instructions, it replaces the
original instructions to dierent instructions or adds safety checks after the instruction to
prevent such attacks. In the same sense, ERIM prohibits memory allocation with writable
and executable that attacker might link with benign page and insert such instruction
after the allocation. One more dierence of ERIM compared to Shreds is that ERIM is
mostly userspace driven abstraction. Since it is in the userspace, it is straightforward to
apply ERIM to other platforms. However, to prevent memory attacks, memory allocation 22
related system calls such as mmap and mprotect are intercepted by either ptrace or Linux
Security Module(LSM). Since ERIM also uses hardware for memory protection, there is
no performance overhead of memory protection, but there is context switch overhead
between domains. The test results in NGINX with AES session key protection scenario
show that the overall overhead is up to 4%.
Even though ERIM provides a very concise and valuable abstraction, there are some
critical disadvantages. First of all, ERIM only supports two domains. Even though MPK
supports up to 16 protection domains, MPK only utilizes 2 of them. Second, since MPK
is unprivileged userspace operations, it requires additional routines to prevent it from
the untrusted and malicious code, but it is challenging to achieve. Third, ERIM lacks
multi-threading consideration. Moreover, as mentioned in PKU Pitfall [1], ERIM does not
consider anything about the system calls. Therefore the attackers could easily bypass
ERIM’s protection model by executing dangerous system calls.
HODOR HODOR [22] is very similar to ERIM [24] that they are published concurrently by a dierent group of people. HODOR supports not only MPK but also VMFunc for its
memory protection mechanism. Due to its similarity with ERIM, HODOR with MPK has
almost the same characteristics as ERIM, including the performance overhead. The main
dierence between ERIM is that HODOR requires both kernel and userspace modication.
Another dierence is the number of domains. ERIM only supports two domains, but
HODOR supports all the 16 MPK domains. The most interesting dierence is the way
to prevent unauthorized WRPKRU instruction. ERIM scans and rewrites all the possible
WRPKRU candidates into dierent instructions, HODOR uses a hardware watchpoint that
dynamically inspects the WRPKRU instruction and traps it if it is called.
Since HODOR is very similar to ERIM, it also has most of the advantages and disad- 23 vantages of ERIM. However, HODOR requires both kernel and userspace modication, it
has relatively more dependency issues than ERIM.
Donky Donky [23] is very similar to this research design that it was published in 2020. Donky supports both Intel and RISC-V architecture. Especially in RISC-V, they designed a
new register and memory protection mechanism to provide a feature similar to Intel MPK,
but it supports up to 1024 domains instead of 16 domains of MPK. Donky provides safe
system call ltering in userspace, which is also very similar to this research. However,
the most interesting contribution of Donky is to support memory protection in RISC-V
architecture. They added a new memory protection mechanism and system call ltering
feature into the new open source CPU architecture. Since the design is based on the
hardware, the performance overhead is minimal as well.
However, on the other hand, Donky lacks system call ltering in Intel architecture.
Donky utilizes hypervisor to prevent arbitrary system call attack or indirect jump attacks, which requires more dependencies and modules.
libmpk libmpk [25] provides another level of indirection for MPK. MPK has a critical limitation for the applicability that it only supports up to 16 keys Therefore only 16
dierent domains are allowed in a single application. If the number of threads increases,
the limitation will bring severe issues in concurrency and security. libmpk overcomes such
limitations by virtualizing MPK similar to the virtual memory in the modern computing
environment. First of all, when a new application is executed, libmpk assigns a domain
for managing the virtual domains. After that, whenever the application requests a new
domain assignment, it creates a virtual domain and maps it with one of the 15 physical
domains. On every domain switch, libmpk provides virtual to physical mapping to the
domain as well. Therefore, in theory, libmpk could support the innite number of MPK 24 domains. In addition, libmpk supports up to 15 MPK cache entries for performance gain.
There are two critical issues on libmpk. First, it does not care for the system calls.
Therefore, any attacker who could bypass MPK by invoking such system calls and signals,
could destroy the libmpk management system. Also, the MPK cache miss in libmpk has
serious performance issues depending on the memory footprint of the application. In
cache miss, libmpk maps the target virtual domain to the least used physical domain and
modies all the page table entries of the missed and the victim domain. The authors claim
it is still much faster than calling the same amount of mprotect system calls, but it is still very slow.
FlexOS FlexOS [35] resolves the security issue in the library OS that the untrusted application code and the most critical system libraries are in the same process space.
FlexOS isolates the libraries and applies for MPK based protection. For this, the library
developers should provide the specication for the formal verication, and FlexOS provides
an isolated environment based on the specication. FlexOS focuses on enhancing the
security of the libraryOS, which has 6-230% overhead depending on the test scenario.
FlexOS provides a very similar function to this dissertation, but it lacks a few crucial
aspects. First of all, it only focussed on the library separation that it misses addressing the
policy to prevent MPK bypass by the system calls. Also, it does not consider that MPK
does not protect the execution of code. As a result, FlexOS does not look deployable any
soon.
Sung et.al Sung et al. [36] provides Intra-unikernel isolation by MPK, which is very similar to FlexOS [35]. This scheme provides type safety and memory safety by utilizing
Rust language, and MPK isolates the kernel. Performance overhead in microbenchmarks is
relatively high, but it is much faster than the previous scheme by Linux-KVM. 25
However, it does have similar downsides which FlexOS has. It did not take care of
the system call policy that it is trivial to bypass MPK by system calls, and it also did not
consider MPK does not protect the executing codes.
CHERI CHERI [27, 37] is a capability based computing environment project driven by a group of people at Cambridge University for multiple years. In CHERI, it has its
CPU architecture [27, 37], its own operating systems [38, 39], its compiler [26], and the
applications [28] made use of CHERI architecture. The main contribution of CHERI is to
provide an overall computing environment for capability based computing that in CHERI,
they extended the concept of the pointer to provide memory safety. The traditional pointer
only takes the memory address, but in CHERI, the pointer is consists of the address, the
boundary and the permission data that the size is up to 256 bits. Therefore, any process which wants to access the memory must have the proper capability to access it. This type
of memory safety has been proposed for a long time ago, but CHERI’s main contribution is
to provide a whole set of the computing environment from the hardware to the application,
including the compiler and the operating system.
Even though CHERI introduced a fully compiled capability based computing envi-
ronment, its drawback is also from its contribution. CHERI lacks the applicability that it
requires the dedicated hardware, operating system, and compiler. As well, the applications
should be redesigned.
EdgeOS EdgeOS [40] is a subprocess virtualization scheme to provide fast 5G network service in the edge cloud. It can be applied on the microkernel operating system that it
introduces a radically lightweight subprocess, called featherweight process(FWP). FWP is a
concept of the subprocess that runs in a process on the microkernel-based operating system with an extremely short launching time due to the caching and reusing of the FWP instances. 26
EdgeOS has one more crucial module called memory management accelerator(MMA) for
communication between FWP. MMA enables the communication between FWP by copying
the message as memory copy, and it mediates the access control to provide security. Since
EdgeOS is implemented on the microkernel-based OS, it could provide more exible and
secure memory isolation than the monolithic OS like Linux.
EdgeOS has clear advantages. It introduces a fast and concise concept of FWP that
could be used for IoT services in the 5G network with performance. However, it only works
on the microkernel, which is quite far from most of the applications and the platform
running in the industry. Therefore, the applicability is very low, but it could be a good
opportunity if ported to the more prevalent monolithic OSes.
2.2 System Call and Signal Virtualization
As mentioned in chapter 2.1, we introduced several techniques in various layers, various
architectures, and various systems. However, many works only focus on isolation and do
not consider the interfaces in the operating systems that could be used to bypass such
isolation, which PKU Pitfall [1] addressed. In this section, we investigate the existing works in system call virtualization and what they provide and what are missing.
2.2.1 Linux Security Module
Linux Security Module(LSM) [41] is proposed about two decades ago, and it has been
Linux kernel upstream since then. Even though LSM is named as a module, it cannot
be congured in runtime because it is a build time congurable security module. LSM
provides hooking functions for all the system call routines in the kernel that the registered
module uses the hooks to provide various functionalities into the system call procedures.
LSM hooks are executed before the actual system call operation is performed in the kernel 27
that the module performs some functionalities and returns 0 in success, and other value
on errors which makes the system call returns error eventually. The overhead LSM is very
small that it is negligible.
The most useful function provided via LSM is the mandatory access control, such as
SELinux [42], Tomoyo [43], AppArmor [44], and Smack [45]. As well, minor access control
schemes like YAMA [46] and Linux capability [47] are also using LSM. LSM is suitable for
additional access control mechanisms, but it is not easy to apply system call virtualization
because of its limitation in the input and output parameter.
2.2.2 System call Filtering
Applications do not need all system calls. There are common system calls most of the
applications are using, such as read, write, and exit, but there are more than 300 system
calls that most of the applications are not using. However, all the system calls are allowed
by default to all the processes that any buggy or malicious application could execute
unintended system calls to attack the system. Also, the application links untrusted third
party libraries for common functions, but the libraries eventually share all the permissions with the application, including system calls. Therefore there are many research eorts
invested in limiting the system calls for the applications and some of them are widely used
in the industry.
First of all, many of the techniques using LSM [41] provides such system call ltering
features. For example, SELinux [42] enforces ne grained access control policy for each
process, so any unintended system calls could be ltered by simply allowing required
system calls only. However, SELinux is overkill for the system call ltering because it
provides too many functionalities other than simple system call ltering.
The other popular technique is Seccomp [48]. Seccomp is in the upstream in Linux 28
kernel since 2005, that the objective of the module is system call ltering. Seccomp uses
Berkeley Packet Filter(BPF) [49] that which works like a network rewall provides powerful
ltering rules with low performance overhead. Once the ltering rule is established, it can
add more rules, but the rules cannot be removed or modied even after fork.
Many other works provide more powerful and more eective system call ltering for various environments. In this section, we investigate a few of the researches.
Janus In 1996, Janus [50] is published that it provides system call ltering and preliminary mandatory access control. This technique looks very simple and premature in the current
standard, but it could be fascinating when it was published. Janus is a process-based
system call ltering mechanism that another process called framework, is launched when
an application process is launched. Then, the framework process attaches itself as a
debugger and sets the breakpoints on every system call instruction in the application. So, whenever the application tries to execute the system call instruction, it stops working
and wakes up the framework process to distinguish whether the system call is safe or not
based on the congurable access control policy.
The most important contribution of Janus is introducing a new concept of sandboxing,
and system calls are ltered to provide a safe sandboxing environment. The performance
measurement was also inadequate that they only measured two dierent applications with
simple input data. Therefore it did not fully show the eect of the system call ltering.
syslter syslter [51] is an automatic system call ltering policy enforcement technique, published in 2020. Generally speaking, the system call ltering policy is manually created
by the developers and applied with ltering tools like seccomp [48] in runtime. However,
syslter automatically analyzes the binary, derives the whole call graph of the binary
le, creates eBPF lter, and applies them on runtime. Therefore, if the binary analysis is 29
perfect, it will achieve the least privilege in the system calls. The evaluation data says that
they evaluated over 30,000 binaries in Linux packages from a Linux distro, and 90% of the
system calls are successfully detected. In the performance perspective, they tested syslter
on NGINX web server [52], and there was up to 18% of performance overhead due to the
nature of linear rule search in Seccomp.
Automated ltering rule creation is the most signicant contribution of this work. The
most important aspect of the automated rule will be accuracy. There are two types of
errors in the accuracy, the false positive and the false negative. The false positive means
that not required system calls are allowed, which will extend the attack surface. The false
negative means required system calls are not allowed, which will introduce application
failure. As a result, both types of accuracy errors should be carefully mitigated. In addition,
they used a special compiler for the binary analysis Therefore in the actual computing
environment, various binaries from various compilers would have issues.
Temporal Specialization Temporal Specialization [53] is a very similar technique to syslter [51], and it is published in 2020 as well. The most signicant dierence between
syslter and Temporal Specialization is that syslter requires binary to perform binary
analysis, and Temporal Specialization requires source code to perform source code analysis.
After analyzing the source code, it detects all the possible system calls, creates Seccomp [48]
policy rules, and inserts the Seccomp policy update code in the program’s starting point.
The most interesting contribution of this work is that it could identify the initialization
phase and the service phase of the application and insert one more Seccomp policy update
code right after the initialization and before the service. Therefore, it could achieve a more
mature least privilege principle compare to syslter.
Temporal Specialization has the same accuracy issue with syslter due to the technical 30
similarity. This work also requires source code to perform the analysis, so it cannot be
applied to binary-only environments. In addition, they did not perform any performance
tests in the paper, which is a disappointing part.
Jigsaw Jigsaw [54] is a very eective vulnerability detecting tool using system call monitoring, published in 2014. Jigsaw is primarily dedicated to the confused deputy
attacks by detecting the application’s request ltering code and analyzing the actual
ltering functions. Then, it also enforces the access control policy. The lters Jigsaw tries
to detect the binding lters and the name lters by static analysis and dynamic analysis
and detects the missing lters. In addition, in runtime, a kernel module intercepts the
system calls and analyzes whether the lter is correctly applied and any unauthorized
system call is ltered.
Jigsaw introduces an interesting concept, brilliant design, and well-dened formal verications along with a working implementation. The performance overhead is no more
than 10% in their measurement, and it could detect several confused deputy attacks from
the actual applications. However, since it uses heuristics and static and dynamic analysis,
it could have incorrect detections, false positives, and false negatives. Also, it is hard to
respond to the new attacks by this method.
2.2.3 System call tracing and interposition
Along with the system call ltering, system call tracing and interpositioning provide deeper
system call management. In system call tracing, the tracker tracks the execution of the
target application, then it pauses every system call execution and traces the executed
system call, input parameter, and the return value if possible. These techniques usually
utilize ptrace [55] mechanism to trace the target application that the tracer process attaches 31
to the tracee process and intercept the execution on every system call execution. strace [56]
is the most popular tool to trace the system call.
System call interposition is the extended technique of the system call tracing. Instead
of just tracking which system call is being executed, it intercepts the execution, performs
some additional functionalities on bahalf of the tracee process, and returns back to the
tracee [57]. Ptrace is widely used for this technique as well, and most of the techniques
using system call interposition try to enforce some security policy.
In this section, we investigate a few related works and nd out the characteristics of
each research.
Droidtrace In 2014, DroidTrace [58] was published, which provides an anomaly de- tection on the Android system based on the ptrace and the dynamic analysis. At rst,
DroidTrace dynamically analyzes the Java part of the application and derives the call
graph. Then it utilizes ptrace to trace all the system call executions. During the system
call execution, it compares the behavior with the pre-dened policy. If any anomaly is
detected, it alarms the user. Especially, DroidTrace targets dynamically linked libraries
that many of the dynamic analysis tools could miss at that time, and they proved it could
detect several actual vulnerabilities.
Ostia Ostia [59] is a system ltering and interposition technique published in 2004. The system call ltering of Ostia is implemented as a kernel module that when a new
system call is executed, and the context switch happened to the kernel, the kernel module
intercepts the system call, looks up the policy, and enforces it. If the policy says it is
allowed, the kernel module calls a callback function dened in the library linked in the
original user process and the callback function sends the system call information to another
user process called the agent. The agent process receives the request, performs the system 32 call virtualization on behalf of the original process, and returns back the result.
Because Ostia is an old paper, the design looks not ecient from a modern viewpoint.
However, it was creative at the moment of publishing. One drawback of the work is
that there was already an existing system call ltering mechanism, the Linux Security
Module [41]. One more issue of this work is the complicated call ow of the system call
interposition. Once the application calls a system call, the context switch happened to
the kernel. Then, the kernel module of Ostia takes the role, looks up the policy, and calls
the callback function in the application again. After that, the callback function creates an
IPC message with the system call information and sends the message to the agent process.
Then, another context switch is performed to the agent process, and the agent process
nally performs the system call, which will eventually perform the context switch to the
kernel to execute the system call. After the system call is executed, the return value will be
propagated reverse of the system call ow, which will take multiple context switches. The
performance measurement performed by the author says that the system call performance
overhead is at least seven times the original system call performance to the tens of times
in some cases. 33
Chapter 3
Threats
3.1 Unauthorized memory access
Any data in a process is stored in the memory in any case. It could be stored in a concise
amount of time or stays in memory until the process is destroyed. The data could be
constant or variable or even some codes. It could be in the stack, the heap, the RODATA,
or the BSS area, which are all in the memory after all. The data could be a constant value, a state variable, cryptographic keys, or a control variable for the code execution.
Therefore, protecting memory contents is crucial for application security. However, most
of the modern operating system allows any random memory access in the same process.
Therefore the compromised application by a bug or a malicious library is at signicant
risk.
Direct access The most straightforward threat is to access the target memory directly. In this scenario, the attacker acquires the target address and then executes the instructions
to access the memory, such as LD and ST. This type of attack is straightforward, but it
is also easy to detect and defend the attack. However, the attacker could hide the target
address using various techniques, so the memory protection mechanism is required other
than the instruction detection. In this case, the memory protection mechanism has to be
protected. 34
Access by system calls Applying the memory protection mechanism should be carefully evaluated because direct memory access is not the only way to access the memory. The
operating systems provide interfaces between userspace processes and the kernel space
resource as system calls. In Linux, there are more than 300 system calls. System calls are
executed in kernel space and return the result to the caller application in the userspace.
Therefore, we must carefully evaluate the memory protection mechanism whether it works
as intended in the kernel space. For example, Intel Memory Protection Key(MPK) [34]
enforces the additional memory protection by the CPU with the permission bits in the
PKRU register. However, it is reset in the kernel space, and the kernel could access all the
memory without restriction.
Operating systems provide various ways to access memory [1]. For example, open, read,
and write /proc/self/mem le is entirely equivalent to access to the memory. Also, there
are special system calls to access other processes’ memory, such as vm_process_readv, vm_process_writev, and ptrace. They are mostly used for debugging. As a result, we
have to analyze all the system calls the operating system to provide carefully, investigate
the interoperability of the designing memory protection mechanism and the operating
system.
Access by signals Along with the system calls, signals are also useful interfaces to perform unauthorized memory accesses. Whenever a signal has occurred, the kernel
calls the registered signal handler function to process the signal. In that case, we have to
evaluate the memory protection and the signal handler carefully. For example, in Linux
and MPK, the kernel resets the PKRU value whenever the signal handler is called. Therefore,
the attacker could register the malicious signal handler and trigger the signal, and then the
signal handler will be able to access all the memory without MPK enforcement. In addition, 35
the sigframe contains the PKRU value after returning from the signal handler, and the signal
handler could rewrite the value into any 32 bit integer. Therefore, the attacker could easily
manipulate the PKRU value by the signal handler. As a result, we have to analyze the eect
of the signals and the designing memory protection mechanism.
3.2 Unauthorized le access
To perform TLS communication, we need public key pair, which is the most critical data in
secure communication. Most of the TLS libraries like OpenSSL [3] acquire public key pair
and certicate by reading les stored in the local computer and load them into memory
for future use. Those les are protected by le permissions in most operating systems that
no other uses could access the les. In some cases, mandatory access control mechanisms
like SELinux [42] are applied to only allow the application to access the le. However,
suppose the application has a bug and compromised, or a library is malicious. In that case,
the attackers could open the key les and read the key from the les by simple system
call executions. Also, the attack could read the key les if the les are already open by the
application and there are open le descriptors. The memory protection does not protect
against this attack. Therefore we need to extend our protection abstraction to the system
calls.
3.3 Unauthorized system call execution
As mentioned above, we need to virtualize system calls to protect the system from malicious
system calls. However, the system call itself is an unprivileged two byte instruction in x86
architecture that can be executed at any time anywhere in the code. That is, the attacker
could execute any arbitrary system call with any arbitrary input parameter by executing
syscall instruction by itself, instead of call glibc wrapper functions. Also, attackers do not 36
even need syscall instruction in their code. The attacker could ll up required registers
as the input parameters of the system call and simply jump to the code address which
syscall instruction is located.
3.4 Attack on Subprocess Isolation: PKU Pitfall
Conor et al. [1] give a signicant hint for this research. It introduces several attack
scenarios on MPK based subprocess isolation like ERIM and Hodor. Some of the attacks
are universally applicable to non-MPK based isolation techniques as long as the techniques
run on top of Linux or similar Unix-based operating systems. The common factor of these
attacks is using system calls as the attack surface, which means that the most critical threat
for the subprocess isolation is the underlying operating system. The several essential
attack scenarios as following.
First, some system calls bypass MPK by design. For example, process_vm_readv and
process_vm_writev access memory of other processes mainly for the debugging purpose.
The actual memory access in these system calls happens in the kernel space which the MPK
is not applied. These also bypass Linux Security Module, so this is not an MPK specic
attack surface.
Second, the attacker could use ptrace [55]. Ptrace is a tracing mechanism designed for
debugging and proling that the tracer attaches to the tracee and accesses the memory
freely. Ptrace even bypasses MMU permission, so it has to be seriously evaluated. There
are several techniques to prevent ptrace like YAMA [46].
Third, le-backed memory access is allowed. In Linux, /proc/[pid]/mem is a virtual
le that maps to the virtual memory of the process. Therefore, opening, reading, and writing the le is entirely equivalent to memory access, and it bypasses the MPK protection.
This interface could be a critical attack surface for many subprocess isolation techniques, 37 not only for MPK based ones.
Lastly, signaling is also a critical attack surface. As mentioned above, the signal handler
is called with MPK reset by the kernel so that any signal handler can access any memory
in the process. Therefore, the signal handlers have to be monitored and virtualized. Also,
sigframe data structure is critical. Sigframe contains the important congurations and
register values to be restored when the signal handler is returned to the application, but
the signal handler is allowed to read and write the data structure. Therefore, the attacker
could modify PKRU value in the malicious signal handler and return to normal with a
compromised MPK setting. 38
Chapter 4
Endokernel Architecture
4.1 Assumption
This work focus on the subprocess isolation in userspace that we do not consider the
security of the underlying operating system and the hardware. Therefore, we assume
the kernel and the hardware are not vulnerable. Also, we assume that there are no side
channels. Lastly, we also assume Intravirt implementation has no bug.
4.2 Requirements
The requirements for this work are as follows. All of the required items should be satised
to evaluate the research is successful.
Memory isolation and protection Memory isolation is the building block of intra- process isolation. Without memory isolation, subprocess isolation is not possible. The
isolated memory for each domain should not be accessible by other domains. Each domain
should have a dedicated call stack and heap. Lastly, the memory isolation mechanism
should not have a performance overhead.
Safe domain switch Data sharing between separated domains and calling functions in other domains requires a context switch. This context switch has to be done only by
the feature provided and controlled by Intravirt. That is, the attacker should not be able
to arbitrarily switch to other domains without Intravirt, and the data should be shared 39
only by Intravirt. Also, the overhead of the context switch has to be tiny compared to the
process context switch.
System call virtualization Due to the unauthorized memory accesses by the system calls and signals, it is necessary to provide the system call virtualization. All the system
calls should be executed only by Intravirt, and the applications should call glibc wrapper
functions only. We have two requirements to prevent any arbitrary syscall execution.
First, no syscall instruction outside of Intravirt should be executed. By doing this, the
attacker cannot insert syscall instruction in her code area. Second, indirect jump to the
syscall instruction in Intravirt has to be detected and prevented.
Along with the arbitrary syscall prevention, all the syscalls have to be analyzed.
And then, we need to provide and enforce the policy to prevent unauthorized access by
system calls and signals. Also, the performance overhead of the system call virtualization
have to be smaller than traditional ptrace based system call interposition techniques.
Programmable Security Abstractions Much like the Exokernel argument [60], today’s process-based isolation is inexible. However, unlike Exokernel, the key challenge is not
about exposing state for managing performance, but rather making the policy language
more closely matching the needs of applications. This inuences 1) Ease of use: A primary
reason why ne-grained security is not applied is the complexity and diverse nature
of application demands. We argue that an abstraction that works for one application won’t necessarily be the easiest to apply for another. 2) Performance. We believe that
an extensible protection architecture will ameliorate these issues by putting control into
application specic abstractions. 40
Mechanism Portability The key problem is what are the essential elements independent of the mechanisms. It is clear that subprocess isolation mechanisms are only going to see
increased exploration, which fractures the landscape of approaches for applying them.
Each new system provides some properties, but how do we compare them? We believe it
is necessary to establish a model that prescribes a set of clear abstractions and security
properties so that diverse systems can be reasonably and systematically applied and
compared.
4.3 Mechanisms Gaps and Challenges
Several facets must be preserved to have meaningful privilege separation and compares
related eorts. The key gaps and challenges are described below.
Subdomain Identiability One solution would be to extend the kernel with subprocess abstractions. However, a userspace monitor is still necessary to track the current protection
domain or else you have to transition to kernel on each switch which is prohibitively
costly.
Programmability and Optimizations Having a general interfaces would be ideal but as implored by prior work (Exokernel, etc), applications tend to be severely constrained.
What’s worse is that existing process abstraction needs to have a separate interface to
accommodate dierent interaction pattern and to be ecient. Thus custimizeability of the
abstractions is critical and most prior work don’t handle it properly.
Leaky System Objects Since OSs are unaware of subprocess domains, an untrusted portion of a application can request access and the OS will gladly service it. Although we shows several bypass attacks, the primary challenge is to systematically assesses all 41 interfaces and to integrate them into a unied policy management interface. It is easier to reason about the policy for a relatively strict interface, but things like ioctls make it impossible to have comprehensive defenses.
System Flow Policies A basic property is that the information in a subspace should never ows in or out of systemobjects unless explicitly granted. However, deriving the system ows itself is hard due to system complexity. Although prior work such as Erim and Hodor shows that one can reason about the ows through a specic system object, the approach is hard to be broadened to a systematic solution. syscall Monitor The need to monitor syscalls is clear, but how to do it is not. A deny- all policy—as used by intra-app sandboxing [15,61,62]—sandboxing would indiscriminately deny all access and neglect a large application space. For example, deny-all sandboxing cannot prevent Heartbleed [63]. In general, applications should be able to benet from privilege separation while not losing functionality. Alternatively, we could modify the OS so that it recognizes and enforces endoprocesses [18,19,64]. Unfortunately, this introduces signicant complexity as indicated by Sirius [64].
Instead of the in-kernel approach, we propose enforcing nested ow policies at the syscalls—allowing some to bypass without change, others to be denied, and the rest to be securely emulated. This is not supported by well-known systems in Linux: MBOX uses ptrace for similar protections [65], but only virtualizes the lesystem interface and is inecient. Seccomp [48] with eBPF [49] and LSM [41] enforce syscall policies, but lack the ability to modify syscall semantics, which will require modifying the LSM hooks extensively. 42
Multi-Process An attacker can fork an exploited process, and access the original address space directly through load and stores instructions and access indirectly through read
system calls. The endokernel must be inside the new process to ensure the protections, or
the memory must be scrubbed. Prior approaches [22, 24, 64] do not consider this threat
and would have to disallow fork system calls.
Signals Signals create several exploitable gaps and challenges. First, Linux exposes virtual CPU state to the signal handler includings PKRU, which can be exploited by an
attacker. Second, the kernel does not change the domain and will trap if not properly
setup. Third, the kernel always delivers the signal to a default domain, exposing the
monitor control-ow attacks. Fourth, properly virtualizing signals requires complex
synchronization and modifying the semantics to be both correct, safe, and ecient. Overall,
properly handling signals introduces signicant complexity into the endokernel.
Multi-Threaded While existing work claims to have a design supporting multi-threading, none of them have implemented concurrency control in the runtime monitor, introducing
TOCTOU attacks and memory leaks, as well as neglects to measure scalability.
Multi-Domain Prior work isolated one domain per thread but not multiple domains per thread. The challenge is that switching from the untrusted domain to the monitor exposes
less data than executing an cross domain call because the stack requires tracking to ensure
return integrity.
4.4 Endoprocess Model
The Endokernel is a general purpose model for nesting a monitor, the endokernel, into the
process address space, which is responsible for self-protection—enforcing the abstraction 43
Application
Sandbox Sandbox Untrusted Glibc Trusted syscall passthrough Safebox Safebox
Domain switch
Syscall virtualization Syscall Policy Domain Manager Process boundary Process Trampoline Signal virtualization Thread Manager syscall libintravirt.so
Kernel
Figure 4.1: Intravirt Architecture.
of two privilege levels within the process—and presenting a lightweight virtual machine, the endoprocess, to the application. The Endokernel has been designed to insert directly below application logic and directly on top of the OS and HW provided abstractions. The core methodology is to systematically identify 1) what needs to be protected, 2) how that information can be interacted with (through the CPU, memory, or OS interfaces), and 3) specify a set of abstractions that must be in place to secure endoprocess isolation. The basic goal is to identify an architecture level description that is portable and independent of the exact layers above and below to properly encapsulate the endoprocess internals. The architecture has two main elements: 1) the authority model and 2) the nested endokernel architecture that ensures isolation. We show how to use this to create least-authority separation services, nested boxing, for application use. Figure 4.1, depicts these three elements together in the architecture. 44
4.5 Design Principle
We share the trusted monitor principles as outlined by Needham—tamper-proof, non-
bypassable, and small enough to verify [66]—and add the following:
Nested Separation Kernel Address spaces and kernel interactions are slow, eliminate all OS interactions [67,68]—i.e., pure userspace, while being smaller than a microkernel
and only tolerating elements inside if they support primitive separation mechanisms with
a minimal interface.
Self-Contained and Secure Userspace Avoid implementing system object isolation in the kernel: adding yet another security framework hacked on top of thousands of
kernel objects. Nesting requires part of the mechanism to be in-process, however, certain
resources could be virtualized by the OS. While that seems like it might be the best choice,
if parts of the process were virtualized by the monitor and others by the OS then: 1)
complexity arises in bridging the semantic gap of the abstractions, 2) bugs can arise from
complex concurrency, access, and exception control, and 3) ties the endoprocess abstraction
to a specic kernel implementation instead of the semantics of its interface.
General and Extensible The design should permit many implementations, i.e., using various hardware (MPK) or software isolation (SFI) techniques that might present valuable
tradeo points in the security-performance space. The architecture should enable safe
extensibility of the security abstractions to enable custom, least-authority protection
services. 45
4.6 Authority Model
The Endokernel represents and enforces authority based on a protection domain, called an
endoprocess. As outlined by Lampson [69] and instantiated by Mondrix [70], an endoprocess
must provide the basic properties of data abstraction: protected entry/return and memory
isolation, while also protecting access through OS objects. Most existing work multiplexes
regions of the virtual address space and uses hardware mechanisms to protect entry and
exit, however, these works neglect to map these properties to the other ways in which
the environment can be used to avoid mediation. Thus, in addition to traditional CPU
and memory virtualization, the Endokernel also virtualizes: CPU registers, the le system,
address spaces and memory management, and exceptions (as implemented through signals).
We use the term authority context to avoid confusion related to many other names, but it
is a lightweight virtual machine while being more precise than domain.
Denition 1 (endoprocess) An endoprocess is an authority context a tuple of (instruction, subspace, entry-point, return-point, le system, address space, and exception) capabilities.
Instruction capabilities specify which instructions are permitted without monitoring,
and is required to fully virtualize the CPU—similar to the hosted architecture of VMMs,
SFI, and Nested Kernel approaches. Explicitly representing instructions is critical as
many protection models operate by allowing instructions enforced either by privilege
level hardware (rings), capability hardware, or software based techniques like SFI (inline
monitors) or deprivileging (static veriers w/ runtime code integrity). As an example,
recent work uses memory protection keys to isolate virtual regions, however, the hardware
exposes the key register to corruption through WRPKRU. As we show in our prototype, we
implement a restricted view of CPU state by preventing any access to WRPKRU and syscall
instructions from non-endokernel code, but do so using diverse mechanisms. The way we 46 virtualize the CPU also inuences low-level mechanisms that enforce protected entry and
exits.
Memory capabilities allow an endoprocess to read, write, or execute a subspace, which
is a subregion of the virtual address space. The default subspaces for each endoprocess
include: stack, heap, and code. File system capabilities specify operations permitted for
opening, reading, and writing runtime state through the le system. Address space and
memory management capabilities determine what changes to the address space (e.g., mmap,
mprotect, etc.) a endoprocess has. Exception capabilities allow a endoprocess to securely
register for and handle signals (e.g., SIGSEGV). Entry-point capabilities denote points at which a endoprocess transition is permitted, and is much like converting function calls
into an RPC for context-switching and message passing. Return-point capabilities are
dynamically generated whenever invoking a cross-domain (RPC), xcall, and require the
machine to return in nested order. Each endoprocess, by default, is granted exclusive
access to its own code, data, and stack subspaces.
An execution context is the combination of the (endoprocess 푋 thread context), which
includes the program counter, stack pointer, and other per domain CPU registers. We
have chose explicitly to model the endoprocess in a similar way as a traditional process
by allowing multiple threads to coexist in a single authority context concurrently. As the
thread executes it traverses various contexts. This model allow for the greater range of
exibility for developing extensible protection. This execution model is the exact same as
provided by Mondrix. To support it the monitor supports the following interface: program
start, interrupted state, signals, up-calls, and xcalls.
Property 1 (Endoprocess Isolation) Each endoprocess is granted exclusive access to its code, data, and stack subspaces, guaranteed secure entry/return, mapping capabilities for it’s
own subspaces, and capabilities to OS level interfaces unless explicitly excluded for isolating 47 other endoprocess state.
With these capabilities, the Endokernel exposes the ability to fully virtualize each
resource while restricting access to privilege in-process state (e.g., monitor memory). This
is essential as many applications cannot be deployed without a certain level of access, but
the monitor itself must ensure it’s protection by reducing the functionality. This is one of
the most critical features gained under the Endokernel model relative to existing ad hoc
approaches.
4.7 Nested Endokernel Organization
The Endokernel Architecture is a process model where a security monitor, the endokernel,
is nested within the address space with full authority. The endokernel is then responsible
for multiplexing the process to enforce modularity in a set of endoprocesses. The rst
goal of the endokernel is to self-isolate, i.e., secure the endokernel state and endoprocess
abstraction from untrusted domain bypass. This section explicitly details this architecture,
leaving the protection abstractions as extensions on top of this basic isolation.
4.7.1 In-Process Policy
The endokernel is granted full authority to all process resources, and the untrusted domain
is granted access to all process resources except for the following: endokernel subspaces,
memory management (e.g., protection registers via WRPKRU) and direct OS call (e.g., syscall)
instructions, le system operations that would allow access to endokernel subspaces (e.g.,
read/write /proc/self/mem/endokernel-subspace), address space manipulation (e.g., mmap)
that would expose endokernel subspaces, and signal capabilities that could otherwise use
to bypass subspace isolation. In this way, the endokernel virtualizes privilege within the 48
address space while also inserting the endokernel in between the untrusted domain and
all privileged resources, where the higher privilege state all protection state, including
everything that that could allow unmediated access by the lower privilege untrusted
domain. Just like any kernelized system, protected gates ensure the endokernel is securely
entered into when a protection domain switch occurs. This architecture is similar to and
inspired by the hosted VMM architecture and Nested Kernel Architecture.
Denition 2 (Endokernel Architecture) An Endokernel Architecture is a split process model where the endokernel is nested within the address space.
The endokernel is responsible for exporting the basic endoprocess abstractions for all
untrusted domain endoprocesses, thus enabling a new method for virtualizing subprocess
resources and enforcing the following property:
Property 2 (Complete Mediation) A non-bypassable endokernel that is simple and guar- antees isolation.
To achieve this the Endokernel enforces the following policies: secure loading and ini-
tialization so that all protection is congured appropriately; exports call gates for cross
domain calls and ensures argument integrity and context-switching; inserts a monitor
for all system calls so that they can be fully virtualized; monitors all address space and
protection bits modications to ensure isolation is not disabled; controls all signals so
that they route through the endokernel before go to any untrusted domain endoprocess;
handles concurrency to support multi-threaded execution.
4.7.2 Interface
In the basic architecture, the endokernel transparently inserts itself and presents a minimal
interface to the protected resources. All access to the privileged resources must become 49
calls into the endokernel. In the process model, this typically means only a system call
interface as that is the mechanism by which most resources are accessed and typically the
only resource that must passthrough the monitor. Access to address spaces and le systems
are monitored through the system call interface. Other resources are memory based and
since the untrusted domain has no access to the endokernel state, there is no need for an
explicit interface. We do not dene an endoprocess creation/destruction interface as that
is the responsibility of the extension for implementing endoprocess modularity, which we
believe is best tailored to the application itself.
4.8 Separation Facilities: Nested Boxing
Least-authority is hard to apply in practice because security policies are highly dependent
on the objects being protected. As indicated, many abstractions are rigid and do not allow
for specialization from the application developers. The Endokernel Architecture allows
us the ability to use the endokernel to explore diverse endoprocess and sharing models
on top. To improve programmability and make use of the nested endokernel, we present
the nested boxing abstraction, which eectively creates three virtual privilege rings in
the process. The nested boxing model allows each level access to all resources of the less
privileged layers, while removing the ability from those domains to access more privileged
domains. In this thesis we x the number of domains to four from most to least privileged:
endokernel, safebox, unbox, and sandbox. Each domain is given an initialized endoprocess
that provides capabilities for accessing domain resources. To make programming easier, we also use a libos that aids in allocation and separation policy management.
Dynamic Memory Management One of the core challenges with privilege separation is modifying the code so that data is statically and dynamically separated. Static separation 50
is easily done using loader modication, but dynamic memory management is harder, in
particular when we have to ensure subspace isolation. In our system we provide a nested
endokernel allocator that transparently replaces whatever allocator the code originally
used and automatically manages the heap and associated privilege policies.
Memory Sharing endoprocess’s share data through a simple manual page level grant/revoke model. A endoprocess grants access to any of its pages to a lower privilege domain and
removes access through the revoke operation.
Protected Entry and Return Cross domain calls, or xcalls, are invoked by the calling domain and can only enter the called domain at predened entry points as specied by the
endoprocess denition. This interface will reject all attempts of accessing the safebox if it is
not to a preloaded entrypoint. It will then do the domain-switch: switch the stack, current
domain ID, store the return address in a protected memory subspace, and transfer control
to the safebox. When the called function nishes, it returns to the interface function, which domain-switches back to the untrusted domain. Entry points can either be dened
manually or as we show for full library separation, by using the library export list. This
model of control ow allows the called domain to subsequently call less privileged code,
i.e., if it does this the called code operates within the endoprocess context and is thus in the
TCB. We allow users to determine when and how to use these features, granting greater
exibility at the cost of more complexity in reasoning about security if a callback is issued.
This can implement the Shreds abstraction, if used in code with no callbacks. 51
4.9 Intel® Memory Protection Key
Hsu et al. [19] describe three generations of privilege separation, each increasing from
manual, address-space isolation to the third generation that eciently enables concurrent
per-thread memory views. The key is new hardware that extends paging with userlevel
tags for fast but insecure isolation.
MPK [34] extends page tables with a 4-bit tag for labeling each mapping. A new 32-bit
CPU register, called PKRU, species access control policies for each tag, 2-bits per for
controlling read or write access to one of the 16 tag values. The policy is updated via a new
ring-3 instruction called WRPKRU. On each access, the CPU checks the access control policy
as specied by the mapping’s tag and associated policy from the PKRU. If not permitted the
CPU faults and delivers an exception.
MPK Security vs Performance Unfortunately, the PKRU can be modied by any user-
level WRPKRU instruction: MPK is bypassable using gadget based attacks. As such MPK
balances security and performance by allowing protection changes without switching into
the kernel.
Preventing MPK Policy Corruption Nested privilege separation reconciles the ex-
posure of protection state by ensuring WRPKRU instructions are only used safely by the
endokernel. They achieve this by removing all WRPKRU instructions from the untrusted
binary and crafting nested call gates that prevent abuse [20, 22, 24, 35, 36, 71]. 52
Chapter 5
Design and Implementation
Intravirt is a userlevel only Endokernel system that fully virtualizes privilege and prevents
bypass attacks. Beyond memory and CPU virtualization, it emphasizes full virtualization of
system calls and signals, as well as exposes and addresses concurrency, multi-threaded, and
multi-domain challenges. Intravirt injects the monitor into the application, as the trusted
domain endokernel, and removes the ability of the untrusted domain to directly modify
privileged state. Privileged state includes: protection information (PKRU and memory
mappings), code, endokernel code and data, direct system call invocation, raw signal
handler data, CPU registers on transitions and control-ow, and system objects. The
endokernel is inserted on startup by hooking all system call execution and initializing the
protection state so that the trusted domain is isolated with no les opened or mapped.
5.1 Privilege and Memory Virtualization
While we build on and extend ERIM, we include it here for a complete view of Intravirt.
We encourage the reader to review detailed methodology from the original work. An
initial conguration partitions the application into the trusted domain and untrusted
domain, where the trusted domain contains the trusted monitor and the untrusted domain
contains the rest. Once the application is separated so that the parts are dierentiated, the
system is congured so that all pages of the trusted domain have key 0 and 1 based on the
condential requirement, and all pages of the untrusted domain have key 2. Some pages will have other keys if they belong to other subdomains in untrusted domain. 53
5.1.1 Virtual Privilege Switch
One of the most important elements when nesting the endokernel into the same address
space is the need for secure context switching, which is complex to get correct because
an attacker has access to whatever is mapped into the address space. While the trusted
domain is executing the PKRU is congured to allow_all (read/write to all domains), and
operating in the trusted domain virtual privileged mode. While the untrusted domain is
executing the PKRU value for the key 0 will be deny_write and deny_all for 1. The virtual
domain switch is implemented as a change in the protection policies in the PKRU—when
entering the monitor set the policy to allow_all, when exiting restore the original key
based on the previous state. This means that whenever the value of PKRU changes so too
does the currently executing domain. Each entry point into Intravirt is setup with a call
gate with a WRPKRU that transitions the domain. The basic idea is to nest monitor code
directly into the address space of the application and wrap each entry and exit point with a
WRPKRU operation. By doing this the system can transition between contexts and only allow
monitor code to access protected state—a virtual privilege switch. This similar technique
is also used to switch between dierent subdomains to enable the usage of other keys in
untrusted domain.
5.1.2 Securing the Domain Switch
Unlike systems with real hardware gates, this software/hardware virtual privilege switch
has challenges because the instruction must be mapped as executable to allow fast privilege
switching. The rst thing an attacker could do is use a direct jump to any code in the
monitor and thus bypass the entry gate. This would in fact allow the attacker to execute
monitor code. One way to thwart could be to modify the executable policy on the monitor
pages, but that would require a call into the OS which defeats the purpose of fast domain 54
switching of MPK in the rst place. Instead, we observe that even if an attacker is able to
jump into the middle of the monitor the domain would have never switched, therefore,
none of the protected state is available for access and therefore the basic memory protection
property holds. The only way to change the domain is to enter through the entry gate.
Since the switch is a single instruction, we can easily verify the result of such switching
immediately after the WRPKRU instruction and loop back if it is not switch to the intended
PKRU state. This ensure that the PKRU state at all exits of the gate secquence will be the
intended PKRU state.
Eectively, the attacker now faces the dilemma that jumping into the middle of the
code will ended nothing since it is the equivelent of running the same code in any other
locations, or it can try to jump to the entry gate, but any landing places of the gate will only
switch to the correct PKRU value and continue the exeuction with deterministic control
ow. No code can be abused.
5.1.3 Instruction Capabilities
Alternatively, an attacker could generate their own unprotected variant of WRPKRU—if
an attacker can inject or abuse the WRPKRU instruction, they could switch domains and
gain access to the monitors protected state. To deal with this ERIM and others like it
use a technique called instruction capabilities: that is by using a combination of static
transformations and code validation and dynamic protections an instruction becomes
much like a capability. The static analysis removes all instances of the WRPKRU opcode
so that the attacker has no aligned or unaligned instructions that could write the value without monitoring, and dynamic runtime is congured so that all code is writeable or
executable but not both. 55
5.1.4 Controlling mode switches
Processes may switch into 32-bit compatibility mode, which changes how some instruc-
tions are decoded and executed. The security monitor code may not enforce the intended
checks when executed in compatibility mode. Thus, we insert a short instruction se-
quence immediately after WRPKRU or XRSTOR instructions that will fault if the process is in
compatibility mode.
64-bit processes on Linux are able to switch to compatibility mode, e.g. by performing
a far jump to a 32-bit code segment that is included in the Global Descriptor Table (GDT).
Executing code in compatibility mode can change the semantics of that code compared to
running it in 64-bit mode. For example, the REX prexes that are used to select a 64-bit
register operand size and to index the expanded register le in 64-bit mode are interpreted
as INC and DEC instructions in compatibility mode. Another example is that the RIP-relative
addressing mode in 64-bit mode is interpreted as specifying an absolute displacement in
compatibility mode.
Executing the trusted code in compatibility mode may undermine its intended operation
in a way that leads to security vulnerabilities. For example, if the trusted code attempts to
load internal state using a RIP-relative data access, that will be executed in compatibility
mode as an access to an absolute displacement. The untrusted code may have control
over the contents of memory at that displacement, depending on the memory layout of
the program. This may lead to the trusted code making access control decisions based
on forged data. Conversely, if the trusted code stores sensitive data using a RIP-relative
data access, executing the store in compatibility mode may cause the data to be stored to a
memory region that can be accessed by the untrusted code.
To check that the program is executing in 64-bit mode when it enters the trusted code,
a sequence of instructions such as the following may be used: 56
1. Shift RAX left by 1 bit. In compatibility mode, this is executed as a decrement of EAX
followed by a 1-bit left shift of EAX.
2. Increment RAX, which sets the least-signicant bit of RAX. In compatibility mode,
this rst decrements EAX and then increments EAX, resulting in no net change to the
value of EAX.
3. Execute a BT (bit test) instruction referencing the least-signicant bit of EAX, which is
valid in both 64-bit mode as well as compatibility mode. The BT instruction updates
CF, the carry ag, to match the value of the specied bit. It does not aect the value
of EAX.
4. Execute a JC instruction that will jump past the next instruction i CF is set.
5. Include a UD2 instruction that will unconditionally generate an invalid opcode excep-
tion, which will provide an opportunity for the OS to terminate the application. The
security monitor should prevent the untrusted code from intercepting any signal
generated due to invalid opcode exception from this code sequence.
6. Shift RAX right by 1 bit to restore its original value. This instruction is unreachable
in compatibility mode.
The preceding description of the operation of the instructions in compatibility mode
assumes that the default operand size is set to 32 bits. However, a program may use the
modify_ldt system call to install a code segment with a default operand size of 16 bits. That would cause the instructions that are described above as accessing EAX to instead access AX.
That still results in the instruction sequence detecting that the program is not executing in
64-bit mode and generating an invalid opcode exception. Furthermore, Intravirt can block
the use of modify_ldt to install new segment descriptors. None of the default segment
descriptors in Linux specify a 16-bit default operand size.
It is convenient to use EAX/RAX in the preceding instructions, because the REX prex 57
for accessing RAX in the instructions used in the test happens to be interpreted as DEC
EAX, which enables our test to distinguish between 64-bit mode and compatibility mode
as described above by modifying the value of the register that is subsequently tested in
the BT instruction. However, we need to restore the value of EAX/RAX after the mode test.
One option would be to store RAX to the stack. However, that may introduce a TOCTTOU vulnerability if the untrusted code can modify the saved value. That is why we used shift
operations to save and restore the original value of RAX, depending on the property that
only the least-signicant 32 bits of RAX are ever set at the locations where mode checks
are needed.
The mode test comprises 11 bytes of instructions total. The mode test instruction
sequence overwrites the value of the ags register. If the value of the ags register needs
to be retained across the mode test, that can be accomplished using a matching pair of
PUSHF and POPF instructions surrounding the mode test. These instructions are encoded
identically in 64-bit mode and compatibility mode. It may be possible for untrusted code
to overwrite the ags register value while it is saved to the stack. However, trusted code
should not depend on ags register values set by untrusted code, regardless of whether that
register has been loaded from stack memory or it has been set by the processor directly as
a side-eect of executing instructions in untrusted code.
If the instruction sequence for testing the value of EAX/RAX used with an XRSTOR
or WRPKRU instruction that is not followed by trusted code is valid in all modes that are
reachable y the untrusted code, then the mode test code may be omitted prior to that value
test code. 58
5.2 System Call Monitor and Handling
Intravirt must ensure that access to system objects is virtualized. We could place this
monitor in the kernel, however, that would separate the memory protection logic from
the mechanism and create greater external dependencies. Furthermore, it would push the
policy specication into the kernel, but the abstractions supported need to be extensible
and thus endanger the whole OS. Instead we observe that system resources are provided via the system call interface, and that the semantics of that layer are stable and allow for
reasoning and enforcement of endoprocess isolation policies. Additionally, if Intravirt will
have greater portability if targeting POSIX. Furthermore, if we locate the monitor in the
kernel we also have to add the extra context switches and the layers of complexity in the
kernel for handling the virtualization.
As such, Intravirt virtualizes system objects by monitoring all control transfers between
the untrusted domain and the OS through a novel in-address space syscall monitor, called
the nexpoline.
Property 3 (Nexpoline) All legitimate syscalls go through endokernel checks and virtual- ization.
The basic way Intravirt does this is to 1) prevent all system call operations from untrusted
domain subspaces and 2) mediate and virtualize all others. We could use a control-ow
integrity monitor to provide both of these, like CPI [30], but that would add unnecessary
overhead, require compiler level instrumentation, and violate our minimal mechanism
principle. Alternatively, we could extend the OS however, this would break our principle
of no kernel dependencies and cost. 59
5.2.1 Passthrough
The rst step in handling is to determine what virtualization, if any, is necessary, because
many system calls do not allow endoprocess bypass. Additionally, if a system call creates
an interface to read or write from memory, they will use the application’s virtual addresses, which means that the MPK domain will be enforced even if accessed from the supervisor
mode—this is something we learned only through failing, so it is important to note that
by default the kernel leaves the MPK domain untouched and thus the hardware continues
to enforce MPK based policies even from supervisor mode access. The benet of this is
that any kernel access to endoprocess subspaces not permitted based on the current PKRU value will trap and deliver to the endokernel: meaning a powerful deny-by-default policy
that is enforced even on ioctls with unknown semantics. It does not mean that these
cannot remap pages and get around the domains, but it does mean that a common path
for access must be coded around, adding greater condence that access paths have been
protected. With these passthrough system calls, we use our protected nexpoline control
path and right before executing the syscall we transition the PKRU domain to the original
caller so the kernel will respect the memory policies in place. After the syscall, Intravirt will switch to trusted domain for nalizing the syscall and then transition back to the
calling endoprocess.
5.2.2 No syscall from untrusted domain subspaces
To prevent direct invocation of syscall operations, we could remove all syscall isntruc-
tions from untrusted domain and ensure integrity like we do for WRPKRU, however, the
syscall opcode is short and might lead to high false positives. Instead, we use OS sand-
boxes that restrict syscall use to a protected trusted domain subspace. There are two
that can be used: seccomp [48] and dispatch [72]. When we started, seccomp was the 60
only available one, but it has many drawbacks: 1) you cannot grow or modify the lter, which makes support for multiple threads and forks challenging and 2) it adds signicant
overhead. The only way to address the second is to use a dierent mechanism. Thus, we
also explore the recently released kernel dispatch mechanism, which is a lightweight lter
that restricts system calls to the particular subspace. Both of these mechanisms work by
specifying the virtual address region that is permitted to invoke system calls, which we
use to restrict to endokernel subspaces.
5.2.3 Complete mediation for mapped syscall
Unfortunately, the only way to invoke a syscall is for the opcode to exist in the runtime,
meaning it must be placed in memory that the untrusted domain can jump to. Ideally,
protection keys would distinguish executability and we could use a endoprocess switch,
but they do not: Intel relies on the NX mappings. Alternatively, subspaces with syscall
opcodes could be marked NX, but the nexpoline would require another syscall to enable write access to the page.
Instead, the nexpoline protects each instance of syscall; return; instruction se-
quence, called the sysret-gadget, so that if control neglects to enter through the call gate
the syscall is inaccessible. The basic control ow is to enter through the call gate and
perform system virtualization, setup the nexpoline code subspace, jump to the syscall,
then return to the handler for cleanup.
Randomized Location To abuse the sysret-gadget the attacker must know where it is located. As such, the rst isolation approach randomizes the location of the sysret-
gadget. We create one pointer that points to the sysret-gadget, and make it readable by
the endokernel endoprocess. This means that to get access to the pointer, the endoprocess 61
must be switched-to rst, and thus guarantee protected entry. The pointer is looked
up immediately after switch, which means that all code between that instruction and
the sysret-gadget will execute: endokernel executes all virtualization and once approved
invokes the sysret-gadget. This ensures complete mediation because the only way to get
the sysret-gadget location is to enter at the beginning, which ensures full virtualization.
The sysret-gadget can then be re-randomized at various intervals to provide stronger or weaker security; we measure the cost of randomizing at diering numbers of system calls.
Multi-threading creates some complexity as it could leak, which we address by creating a
per-thread pointer and giving enough virtual space to remain probabilistically secure. The
benet of this technique is that it is the simplest, and most of the time, results in the best
performance.
Ephemeral On-Demand While randomization—especially if randomizing on each
syscall—creates a high degree of separation, it is not guaranteed. To provide deterministic
isolation, we present the ephemeral nexpoline, which achieves isolation by writing the
sysret-gadget into an executable endokernel subspace on gate entry and rewriting to trap
instructions (int 3) after completion. This requires Intravirt to create a single page for
the nexpoline in the trusted domain with read and write permission restricted to the
endokernel (via MPK) and execute permission for all domains. Intravirt ensures that while
the untrusted domain executes the entire page is lled with int 3 instructions which would create a signal if the untrusted domain were to jump to this page. The endokernel
interposes on all control transfers from the OS to untrusted domain, thus it ensures that
prior to any control transfer back to the untrusted domain the sysret-gadget is removed.
The resulting enforced property is that there is no executable sysret-gadget while untrusted
domain is in control. 62
Handling multi-threaded execution is challenging because the sysret-gadget is callable by other threads running in the process. To address this issue, Intravirt creates a per-thread
lter that restricts each thread’s syscall to only come from a per-thread subspace. This means that the OS syscall lter ensures that if a thread invokes the sysret-gadget of another thread (while the system call is being handled) it will trap. In this way, the syscall instruction is ephemeral and only exists while the thread is executing the nexpoline. This creates complexity as signals may modify the control ow of system calls, which we describe in § 5.4.
Control enforcement technology (CET) [73] CET provides hardware to enforce control ow policies. While designed for enforcing Control-Flow Integrity [29], we show how to (ab)use CET to implement a virtual call gate, which ensures syscall; return; is not directly executable by the untrusted domain. Briey, CET guarantees that all returns return to the caller and indirect jumps only target locations that are prefaced with the end-branch instruction. CET also supports legacy code, by exporting a bitmap to mark all pages that can bypass indirect jump enforcement, but the shadow stack must be used across the whole application.
Intravirt allocates a shadow stack for each endoprocess and ensures that a stack cannot be used by a dierent endoprocess by assigning each one to a protected subspace. Intravirt marks all endokernel entrypoints with ENDBR64: denying transitions into the endokernel from any indirect jumps. This creates a problem though, because indirect jumps within the endokernel also require end-branch instructions and could be used as alternative entrypoints to the endokernel. Thus, all jumps within the endokernel are direct jumps with a xed oset from current IP and thus are not exploitable. This allows syscall; return; to be placed anywhere in the trusted domain, since the hardware automatically ensures 63
all syscall will start from a legit entrypoint. While CET can provide greater security for
the whole application, our evaluation shows signicant overheads compared to the other
approaches (see §7.2).
5.3 OS Object Virtualization
The primary goal of Intravirt is to preserve endoprocess isolation, which requires system
object virtualization for eliminating cross endoprocess ows. Intravirt represents these
in three three core system abstractions and policies to systematically reason about and
specify policies: les (including sockets), address spaces, and processes.
5.3.1 Sensitive but Unvirtualized System Calls
A key class of system interfaces (ioctls, sendto, etc.) may index into regions of the address
space that the kernel might accesses on behalf of a process, but as discussed, the kernel will use the userlevel virtual addresses which are protected by the hardware enforcing
MPK domain isolation even from privilege accesses. These do not require full system
level virtualization, but if the kernel did not implement that strategy, they could be fully virtualized by analyzing the arguments and denying any access that crosses endoprocess
isolation.
5.3.2 Files
The Linux kernel exposes (via the procfs) several sensitive les that may leak endoprocess’s
memory, because the kernel does not enforce page permissions, e.g., /proc/self/mems. [1].
To prevent any le-related system call from ever pointing to such a sensitive le, Intravirt
tracks the inode of each opened le. Conveniently, inodes are the same even when using
soft or hard links. This allows Intravirt to enforce that no open inode matches the inode 64
of a sensitive le. The associated rules are transitively forwarded to child processes as
they inherit the le descriptor table of the parent.
5.3.3 Mappings
In addition, one may break the isolation property of Intravirt by aliasing the same le
mapping multiple times with dierent access permissions. For instance one mapping
may allow read/execute, while the other alias mapping to the same le permits read/write
accesses. We prevent such attacks by emulating the mapping using the regular le interface
and copying the le to a read/write page rst which is later turned read/execute after all
security checks passed. As a result an executable page is never backed by a mapped le.
Memory system calls create, modify, or change access permissions of memory pages.
Across such system calls we prevent endoprocess from accessing or altering another endo-
processes memory, e.g. by never permitting a endoprocess to map another endoprocess’s
memory. In addition, new memory mappings by a endoprocess are tagged as belonging to
the endoprocess. Intravirt enforces these policies by building a memory map that associates
access permissions with endoprocess.
5.3.4 Processes
The Kernel permits virtual memory accesses of other processes via process_vm_readv
and process_vm_writev system calls. These calls access memory of remote processes
or the current process itself. For these two system calls, we apply the same restrictions
as for le-backed system calls preventing a domain from accessing another domain’s
memory. In addition, we completely prevent access to another process’ memory via
process_vm_readv/writev. [fork and vfork] Due to the insecure behavior of vfork, we
emulate it by using fork instead. fork needs to be altered to enforce transitive policy 65
enforcement across process boundaries. [exec] A process application can be modied using
the exec system call. In this case, the kernel loads the new executable and starts executing
it. This is problematic, because we need to initialize its protections before the application.
Hence, any exec system call needs to be intercepted to ensure policy enforcement is
enabled after exec.
5.3.5 Forbidden system calls
Several system calls access protection state. Intravirt currently denies access to the fol-
lowing and leaves their virtualization to future work: clone with shared memory, pkey_*
system calls, modify_ldt, rt_tgsigqueueinfo, seccomp, prctl accessing seccomp, shmdt,
shmat, ptrace.
5.4 Signal virtualization
Signals modify the execution ow of a process by pushing a signal frame onto the process
stack and transferring control to the point indicated by signal handler. The primary reasons we must fully virtualize signals are because 1) Linux always resets PKRU to a semi-privileged
state where domain 0 is made RW-accessible and all other domains are read-only and
2) because signals expose processor state through struct sigframe, potentially leaking
sensitive state or allowing corruption of PKRU, which could lead to untrusted domain
control while in the trusted domain context. As such, Intravirt must interpose on all
signal delivery to minimally transition protection back to the untrusted domain mode and virtualize signal handler state to avoid leakage and corruption.
Intravirt accomplishes this by virtualizing signals so that all signal handlers are regis-
tered with Intravirt rst, and second, registering signals with the kernel so that Intravirt
always gains control of initial signal delivery. When a signal occurs Intravirt rst copies 66
1 sig_entry: 2 movq $1, __flag_from_kernel(%rip) 3 erim_switch 4 cmpq $1, __flag_from_kernel(%rip) 5 jne __sigexit 6 movq $0, __flag_from_kernel(%rip) 7 call _shim_sig_entry
Figure 5.1: Signal Entrypoint
the signal handler context info to protected memory so that the untrusted domain cannot
read or corrupt it. Next Intravirt must deliver the signal to the untrusted domain, but
to do so it must 1) push the signal info onto the untrusted domain stack and 2) switch
the protection domain to the untrusted domain. Unfortunately, the semi-protected PKRU
state does not map the untrusted domain stack as writable, so Intravirt rst modies PKRU
so that it is fully in the trusted domain and then pushes the signal information onto the
untrusted domain stack. Then Intravirt transitions to the untrusted domain mode, giving
control to the handler registered in the rst step.
The next challenge is that the domain switch into the trusted domain places a WRPKRU in
the control path, which can be abused by the untrusted domain to launch a signal spoong
attack. By spoong a signal, the untrusted domain could hijack the return path to its
own code while setting PKRU to the trusted domain. As such, Intravirt must rst add a
mechanism to detect whether the signal is legitimately from the kernel or if it is from the
untrusted domain. Figure 5.1 shows our approach that uses a special ag that resides in
the trusted domain as a proof of PKRU status before WRPKRU. This ag is allocated with key
0, so it is writable only if the signal handler is invoked by the kernel which reset PKRU to
default. A spoofed signal handler invocation from the untrusted domain would result in a
segmentation fault that can be detected by the signal handler. 67
Semi-Trusted Sig Sig l R a SmiUT SmiT ec gn v i S S i v oxing g n B n c al i e n a ig l R S v ec Syscall R UT Entrypoint Entrypoint T S ign Ret D a l Deliver, Syscall Trusted l i r
Untrusted a e c n g t i D S e r liv fe er Sig Sig De UT T
Figure 5.2: State Transition with Signal; UT:Untrusted; T: Trusted; Sig: Signal Handler, Signal masked by Kernel; Smi: Semi-Trusted Domain
The next major issue is dealing with signals being delivered while the Intravirt system
call virtualization is working in the trusted domain. This can cause bugs due to reentry,
leading to potential security violations due to corrupted state. We must guarantee that our
signal handler can only be invoked by the kernel once until we decide to either deliver or
defer the signal and return to the corresponding state. The second problem arises out of
the complex nature of adding Intravirt in between the untrusted domain and kernel, in the
case where the signal is delivered during Intravirt’s handling of syscall. Unfortunately, we cannot simply ignore either these signals because that would break functionality. In
this case, Intravirt must defer the signal till after the syscall is completed.
The solution to interrupted signal delivery is to emulate almost exactly like the kernel.
As depicted in Figure 5.2, signals occurring while in the trusted domain will be deferred
by adding them to an internal pending signals queue and masking that particular type of
signal in the kernel. The latter step is not necessary but pushes the complexity of managing 68
multiple signals of the same type to the kernel. Once the current operation is completed,
Intravirt selects the last available signal that has not been masked by the user and delivers
it.
Signals represent the most complex aspect of Intravirt. They present subtle but funda-
mental attack vectors while also exposing signicant concurrency and compatibility issues.
Intravirt appropriately handles all these cases and identies several issues not mentioned
by prior work [1].
5.4.1 Signals for Ephemeral System Call Trampoline
Typically, signals return via the sigreturn system call. In the case of ephemeral nexpoline, we cannot rely on the kernel infrastructure to return from a signal, since we have to
cleanup the system call instructions after the sigreturn system call. Unfortunately, this is
quite hard to achieve, since sigreturn may directly return to the untrusted domain. Hence,
the emulation of sigreturn in userspace is easier, especially since the x86 instruction set
provides XRSTOR instructions to load the CPU state from a memory location.
Another issue of signal virtualization for secc-eph is the fact that during the cleanup
sequence a signal could occur. This results in a race condition between the cleanup and
the signal handling. To overcome this issue, we created a ‘transaction’ for the cleanup
phase of a system call. If a signal occurs within the cleanup procedure, Intravirt does not
resume to the trap source, but rather to the start of the cleanup phase. Therefore, we reset
the rip to the beginning of the cleanup phase and try to restore the signal context. This
procedure guarantees that signals occuring at any point within the trusted domain clean
the nexpoline. 69
5.4.2 Multithreading Design
So in the rst version we had single threaded and the kernel delivers the signal to the
interrupted thread. If the kernel is delivering on the backend of a syscall then we always
come from domain 0 and thus the kernel can deliver to key 0 secure stack, but the problem
happens if the dom 1 was interrupted and a signal comes. The kernel copies the PKRU value
and thus can’t pus to the dom0 stack. So it faults. Kernel couldn’t do copy to user.
To solve this we put the signal deliver into an untrusted trampoline page so that the
kernel could always write and that would jump directly into IV to handle it. This worked
but created the issue of a signal spoong attack because now an untrusted domain could
jmp to the untursted stack trampoline. So we solved with a nexpoline type solution..
We have a new problem with multithreading in that this open page won’t work anymore
because the page will be accessible to other threads in the same default domain. So we
realize the interface provided by the kernel is jsut broken. To x we modify the kernel
to allow a return to both a registered stack, which is already there, and to return to a
specic registered key value. So we now will always return to 0 domain and the 0 stack
and never expose the data. We must then ensure that no one else registers, so we deny
any registrations after initialization.
To summarize: small kernel patch to allow a default domain and deny any registration.
5.4.3 CET
also complicated the design of signal by adding another stack that must take care of during
the signal delivering. We a special system calll to write to the shadow stack whcih allows
us to push RIP to restore address and RIP to signal handler and restore token on the shadow
stack so we can have the required token for switching the stack when exiting Intravirt.
The similar trick is also used for virtualized sigreturn to switch to the old stack. 70
5.4.4 Multiple subdomains
As we discussed, control ow and corresponding CPU state are critical to the integrity of
sensitive application. This applies to not only the Intravirt but also the sandbox and the
safebox. Since users can run whatever code in the subdomains, any interrpution during
the execution of boxed code can be exploited to leak data. For this reason, we block the
signal from the view of subdomains. The kernel can still deliver signal to Intravirt signal
entrypoint but we will treat it as a signal delivered in trusted domain and pend that signal.
5.5 Multi-threading and Concurrency
Since multi-threading is one of the essential elements of the modern computing environ-
ment, the subprocess isolation also has to provide it securely. However, it is not trivial to
support the concurrent environment in such an isolated environment.
5.5.1 Concurrency in subprocess isolation
First, the underlying OS makes all threads share the memory and OS objects like le
descriptors without limitation. In this environment, a concurrent thread could easily
interfere with the isolation abstraction.
Second, many applications utilize thread local storage(TLS) for storing data only for
the corresponding thread, such as call stack and thread maintenance information. The
isolation abstraction will require management data structure for the domains, but the OS
or the userspace threading libraries (e.g., pthread) do not provide multiple TLSes for each
thread.
Lastly, in the multithreaded application environment, the shared data structures, locks,
and some notication mechanisms are commonly used to communicate between threads. 71
But the isolation abstraction could create problems for eective communications because
some of the data structures could be isolated, and some could be not. Therefore, the
isolation and thread communication data structures could be considered to be matrix-like
designs.
In summary, to abstract the subprocess isolation accurately, we have to put the concur-
rency to one of the main priorities, design it clearly and extensively, and implementation
has to be tightly tested.
5.5.2 Multithreading model
To provides concurrency, we need to select a multi-threading model to design a proper
environment. For example, we could consider one-to-many style model that there is
only one Intravirt thread and it intermediates all the system call executions for all the
threads in the process. But in this model, it is easy to expect that there will be signicant
performance issues and concurrency problems, therefore we cannot select this model. The
multi-threading model in Intravirt is more likely to be an one-to-one style. That is, each
thread has its own Intravirt instance, maintaining local data structures for stacks, PKRU
state and trampoline in some designs, but shares policy enforcement information (e.g.
mapping). Therefore, all the system call virtualizations, and the policy enforcements are
performed by each thread itself.
5.5.3 Thread Local Data Structure
In Intravirt, there are various types of local data that we support protection from unau-
thorized access, safe management to prevent any collision, and eective access by each
thread without a complex address derivation. To provide such a feature, we focus on GS register supported by x86 architecture. 72
GS register along with FS register is an user level segment registers that the application could make use of. But, FS register is being widely used by gcc and pthread for the
Stack canary and thread local data, so Intravirt uses GS register that is known to be no application is explicitly using. Intravirt stores thread local data as a data structure, and store the pointer of the data structure in GS register which could be easily accessed using oset of the segment register, same as other segment registers.
The thread local data structure is protected by MPK, therefore only monitor domain could access such area and any untrusted domain tries to access the location will be rejected by the CPU. However, the attacker could create a maliciously crafted thread local data structure and modify GS register to pointing the malformed data. But fortunately, GS register is only accessed by arch_prctl system call, which we could easily virtualize and prevent unauthorized access.
5.5.4 Required Atomicity
Linux does not guarantee the order of system call execution when multiple threads execute the system calls at the same time. It only supports internal locks to prevent any kind of collision, such as accessing the same le descriptor at the same time. Overall, Linux does not have any strict atomicity policy if there is no critical collision. But in Intravirt, the system calls are virtualized and security policy could be enforced. Therefore, there are enormous security condition checks and some of the system calls have a series of other system call executions along with security checks. In multi-thread environment, such checks and calls could be disturbed by other threads and such interruption could be used by attackers. For example, as presented in PKU Pitfall [1], the attacker could access the protected memory area by accessing /proc/self/mem, so one of the base policy should check whether the thread is trying to access it. In Intravirt, /proc/self/mem is treated as a 73
special le that a ag is set on open system call and the le oset is checked on every le
accesses such as read and write. But, there are many dierent TOCTOU attack scenarios
that the attacker could spawn another thread and manipulate the le oset by using lseek.
As well, if the ag is set before the actual le descriptor is assigned by the kernel, the
attacker could access the le before the ag is set.
Therefore, In Intravirt, we provide internal locking mechanism to provide such atom-
icity that could prevent various TOCTOU attacks. In the current implementation, a lock
is provided for memory related system calls such as mmap and mprotect, one for signal
related system calls such as rt_sigaction, and one for each opened le descriptor. For
les, we do not lock every time the le is accessed, to provide same use case with original
Linux that only close is blocked when another thread is in the sysret-gadget. The reason why we block close is to prevent the attacker from close and open a new le with same
le descriptor simultaneously and perform attacks.
5.5.5 sysret-gadget Race Condition
We already argue that protecting sysret gadget is very important to prevent unauthorized
system call execution. In the multi-thread environment, such protection has to be carefully
designed. For example, in Ephemeral Intravirt, the sysret-gadget location is xed and
the gadget exists when a thread is executing a system call. Therefore, any attack thread
could simply jmp to the gadget when another thread is calling a system call. As well, in
Randomized Intravirt, there is a probability that 2 threads could collide with the same gadget
location, and the shared gadget location could increase the successful guess probability.
Therefore there should be protection mechanisms in each design conguration.
First, in Randomized Intravirt, the sysret-gadget’s area does not overlap with other
threads that there is no collision, and remain the probability the same. 74
In Ephemeral Intravirt, we apply per thread seccomp lter to prevent accessing other
thread’s sysret-gadget. But, seccomp lter is always inherited from the parent and it’s
cannot be altered, so the child thread will have the same seccomp lter with its parent. In
Ephemeral Intravirt, we have a special thread which only spawns other threads on behalf
of the application threads, called Queen thread. We did not apply seccomp leter to the
Queen thread. When the application creates a new thread by calling clone system call,
Queen thread receives such request, create a new thread, apply a new seccomp lter, and
jmp to the user app code to start the thread.
In CET Intravirt, we have per-thread shadow stack that any kind of unauthorized
indirect jump could be detected and rejected easily.
5.5.6 Clone
In Linux, the clone system call is used to create new threads and processes. For Intravirt
to properly maintains its integrity, this process must be done carefully. As we already
described in previous paragraph, for example, the seccomp_eiv needs special consideration
for the sysret-gadget. When clone system call is called, we rst distinghish if the system
call is about to create a new process or a new thread based on the ags. For all other
Intravirt variants except for seccomp_eiv, the syscall is directly called and both old and
new process are continue executing after the kernel returns from the system call. But in
the case of the seccomp_eiv, which requires a Queen thread, simply calling the syscall without any preparation will create a new process which will discard all other threads in
the child process space including the Queen thread as well. As a result, the new process will lose the ability to create new threads any longer. In this case, we spawn the new
process by the Queen thread instead of the caller thread, and the newly created Queen
thread spawns another thread and restore the old context into the new thread and jump to 75
the pointer where the clone is being called from the parent process. In the case of creating
a new thread, the called thread will map a new stack for the thread, it allocates local data
structures on the stack, and copies the context of untrusted domain to the stack so it can
returns to the caller with proper context.
In CET variant, the new shadow stacks for untrusted domain and other subdomains
are also created with restore tokens. For the untrusted domain’s shadow stack the RIP
from the old thread is also pushed in to the stack. And the addresses are put into the
local data structure for the new thread. The sysret gadget and trampoline (if needed) will also be prepared on this data structure. Then, the initialization arguments are pushed
into the stack. The clone system call is invoked by the current thread, or Queen thread if
seccomp_eiv is used. In old thread, it returns immediately when the new tid is available while in the new thread it directly jumps to a thread start code. Now, the new thread has
identical state as the old one, but it cannot use the normal syscall code since GS segment
is not initialized. And the seccomp or dispatch is also disabled by the queen or the kernel
respectively. We must reestablish it before handling the control to the untrusted domain.
In the case of CET, we can simply use syscall instruction to call arch_prctl to set the
GS segment and CET can prevent the syscall from being abused. For other variants, we
have the address to the thread local data which contains all information we need for the
system call and we use these information instead of GS based addressing to utilize the
sysret gadget for the same arch_prctl. Next, we enable the syscall lter by calling
either seccomp or set the dispatch address range as we did when initializing Intravirt. And we restore all context data and jump to the monitor_ret to return from the Intravirt and
restore the context. 76
5.5.7 Multi-Domain
Intel MPK allows the change of PKRU register through WRPKRU instruction. As we disscuss
early in the thesis, this threats any privilege model based on the domain since an attacker
can always override current protection domain using this instruction. And we use binary
inspection to elimiate WRPKRU in untrusted code. However, this also means that untrusted
code cannot switch the domain for its own use.
While Intravirt along is using two of the MPK domain for its private data, and one as
untrusted domain in general, there are still 13 domains which can be used by the untrusted
part as a memory domain for private data to ensure the security and condentiality. To do
this, we repack the MPK interface with our multi-domain Intravirt design by providing all
the essencial components, including isolated encapsulations for code, data and context,
tracing current PKRU inside Intravirt, call-gate from the untrusted domain to the encapsu-
lation through a sets of xed entrypoints and providing a library for the user to assist the
annotation of sensitive data and code.
Secure Dynamic Loading Since we do not allow WRPKRU in user code, we add a new virtual system call iv_domain to complete the encapsulation of the domain.
iv_domain accepts pointers to the segment of code and data, a pointer to function table which contains the legit entrypoints and a pointer to a stub function. Intravirt will assign
an unsued MPK domain to, and only to this encapsulation, and map the code and data to
this domain to prevent other domains from accessing them.
The stub function is our solution to put the domain switch, which contains WRPKRU
closer to the, user code, so it does not need a indirect call to the switching function and when coding, user code can use it as a normal symbol to a function. While ensure the
WRPKRU will not be reused against our system. It will be loaded upon iv_domain gets called 77 and mapped as intervirt memory.
To ensure any code in current application that call into the boxing would not get
compromised, after any use of the iv_domain system call, all executable pages will be
locked down that mmap, remap and unmap an executable page in any form will become
an illegal operation.
Secure xcall These encapsulations have data and code memory marked with theirs MPK domains. We will then allow user to change MPK domain in order to use these data,
not arbitrary, however, through our xcall interface.
The xcall was a stub function when linked by the user, which gets replaced during
the loading.
It rst looks up the called function in a protected function table, which will also be
copied from the original function table and put into the protected memory, to check if this
call to a entrypoint. Then, it switches to trusted domain and update the variable indicate
current PKRU, previous PKRU and stacks (shadow stack if CET is used), and fetch the address
to the context associated to this domain in Intravirt memory and switch to that domain
and context. A special case here is that the system might not in untrusted domain. And
if the PKRU state indicates the program is running inside the requesting domain, we will
bypass the xcall gate by not updating the data structure and use jmp to jump to target
address as if the xcall does not exist. After switching, it calls to that function, this will
ensure that all xcall will enter correct MPK domain and transfer the control statically to
a set of entrance, instead any addresses given by the caller, even the addresses inside the
encapsulation, which will leak the privilege to a potential attacker.
The signal delivering is disabled after the switching of the protection domain with the
similar reason for its disabled during the running of Intravirt code – to prevent the leak of 78
CPU state and to protect the control-ow not get disrupted by signal.
When the called function ended, it will return to our xcall gate and we switch back
to untrusted, which rst switch to Intravirt, update current PKRU back to untrusted value,
and nally do the switch with the current PKRU and also switch the stack back to the old
caller stack.
Please notice there are modiable limitation on the size of the stack carried arguments
for calling through ‘xcall’. And higher the limitation will introduce higher overhead on
the switching.
Whole library isolation Beside the ne grained semantics that isolate only sensitive functions and data. We also provide another way of creating the isolated domain by adding
a few lines of code that use iv_domain to inform Intravirt with the base address of the
current library. Intravirt will read all exported symbols from the ELF symbol table on that
address, create stub codes that call to the same secure xcall as the ne grained one, and
redirect the hook all these symbols to use the new stub codes. Any lookup of the symbols will get our new address and thus all call into the library will automatically switch to
isolated domain and switch back when return from these exported functions while the
calling inside the library will remain normal function call. This feature is mostly provided
for isolating sensitive libraries but the programmer should be awared the use of outside
functions, especially libc functions that might leak the sensitive data (e.g. memcpy). Also
the functions in the libraries that leak sensitive data directly (e.g. dump_key) or indirectly
(e.g. bn_mul). Because the domain isolation only ensures that certain memory space are
not accesible from outside, but it cannot do anything to the data ow that intended to leak
data.
dev uses the above libos function which exectues by glibc and Intravirt does magic 1. dy- 79
namimc hook for allocators and free: link libos 2. get lib base addr libcrypto 3. walk through
table to nd libssl 4. visor call to iv for setup lib domain iso: safebox(libaddress,domainid)
Then Intravirt: nd all code, data, bss, pages, setup keys, libos does : lazy slab allocator
properties of this techniques: callbacks don’t switch back which could be problematic.
Safebox libOS In the invravirt side, we only provides the most basic and essencial building block to describe the encapsulation structure from the user. There are few
important elements on this design. First, all related data must being put into secure
memory. We provide this with a macro ISO_DATA which adds a section attribution. The
similar thing is apply to code with a macro ISO_CODE. We also mark all entrypoints with
a macro ISO_ENTRY. This is also achieve through a special section for function table and
symbols generated automatically by the GCC Linker to mark the start and stop of the
section. There two symbols can later being used by iv_domain to provide the entrypoints.
It will also generate the encapsulation and stub function and initialization function
automatically and try to pad code and data to page boundary.
To use the simplify the use of xcall, all entrypoints get a wrapping function, which
load its function id and jump the the actual xcall generated by Intravirt. And any use of
these wrapping function, can through the xcall macro. GCC compiler will translate it into
the calling of wrapping function with all arguments. In short, all you need is to replace
‘func(args)’ to ‘xcall(func, args)’.
We also support a simple threadsafe slab allocator and user can enable this by including
a single reader code for allocator. 80
5.6 Implementation Details
Intravirt is built out of ve primary components—secure loading, privilege and memory virtualization, syscall virtualization, signal virtualization, and xcall gates. We use the
Graphene passthrough LibOS [74–76] to securely load, insert syscall hooking into glibc,
and separate the trusted domain from untrusted domain memory regions. We use ERIM [24]
to isolate memory and protect WRPKRU, and 200 LoCs for tracking page attributes. We
implement all syscall and signal virtualization code. In total our system comprises
approximately 15k lines of codes, with ∼ 6, 400 new Intravirt code. 81
Chapter 6
Use Cases
In this chapter, we address the actual application scenario which Intravirt could provide.
Because Intravirt provides an isolated endoprocess environment, there are enormous
applications we could apply. Using the system call virtualization feature in Intravirt, we
could apply dierent system call policies for each endoprocess, similar to mandatory access
control mechanisms like SELinux [42], but in the endoprocesses.
6.1 Library Isolation
First of all, we address applications we could make use of the library separation. Using
Intravirt, we could safely separate the code and the data of the libraries providing xcalls which provides a compelling but fast isolated environment.
6.1.1 Reference Application: zlib
The very rst use case for Intravirt is zlib [77]. There will be no noticeable security benet
by isolating zlib, but zlib works as a reference use case for probably all the library isolation
techniques. Therefore, this use case could work as the baseline application for Intravirt
that we could easily compare to other techniques in terms of the performance, applicability,
and maybe compatibility, which is meaningful.
We could isolate zlib with the whole library separation technique. The implementation
is relatively simple that we modify zlib code to add a constructor function that the con-
structor gets the symbolic information of the library by calling dladdr function in loader, 82 and calls iv_domain system call to assign a new domain for zlib. In addition, the allocator could be replaced with a new one if required. For this, we added ten lines of codes in C, and dependency to loader(ld.so) is added. The applications that use zlib do not require modication, and by calling zlib API will automatically invoke domain switch. We address the performance evaluation in § 7.3.
6.1.2 Safeboxing OpenSSL in NGINX
OpenSSL [3] is responsible for the secure communication and cryptographic operations in NGINX [52] webserver. Once it is compromised, the impact will be signicant that the leaked session keys could expose the encrypted messages, and the identity could be intercepted due to the leaked private key. Unfortunately, OpenSSL is a dynamically linked library loaded during the startup of the NGINX process, shares all the memory with the remaining part of the application. Therefore, any small vulnerable part of the NGINX web server could lead to a complete breach of the secret information.
There have been enormous eorts to separate OpenSSL from the application to prevent such attacks. For example, the H2O web server project [11] separates the private key management module into a dierent process that the separated process performs all the operations regarding the private key, and it communicates with the primary web server process with IPC. Therefore, they claim that the private key is still protected even after the webserver is compromised. However, due to its process separation design, it has a performance overhead. The overall performance overhead is about 2% because the private key is required only at the beginning of the web session. Unfortunately, H2O cannot protect its session keys. If the session key management is also separated into a dierent process, the performance overhead will be signicantly increased due to the frequent use of the session key. Also, it relies on the complex low-level OpenSSL crypto APIs because 83
it separates cryptographic operations into dierent processes that reimplemented the
cryptographic functions, leading to the increase of dependency and complexity.
In contrast, ERIM [24] chooses OpenSSL session key protection for its use case section.
In their implementation, they modied OpenSSL to add domain switch codes before and
after AES session key operations, and all other parts of the program cannot access the
session keys without switch the domain. The performance overhead is about 2 − 3%.
However, due to its hacky implementation, it only supports the AES algorithm and only
one protection domain.
We utilize the same whole library separation as we did in zlib. Therefore, all the
codes, functions, and data in libssl.so and libcrypto.so are isolated, and xcall is called when any OpenSSL API is called. Therefore, all the secure communications resources are
protected, including the private key and the session key. However, this does not protect
from the bug in OpenSSL library like Heartbleed. We added 215 lines of code in OpenSSL,
and we cover the performance evaluation in § 7.3.
6.2 Module sandboxing
Intravirt could also isolate part of the program as well as the whole library. Since it could
isolate smaller parts than the library, the Intravirt could provide ner-grained isolation.
However, in this case, the protected memory has to be aligned by page because MPK
provides per-page protection.
6.2.1 Sandboxing HTTP Parser in NGINX
NGINX parser performs straightforward functionality. It rst reads the message received
by the network module, interprets the message contents, lls up the output data structure,
and returns to the caller. However, the parser module is also located in the same address 84
space, it shares all the resources with other parts of the process, and the compromised
parser could lead to severe data exposure such as personal data and nancial information.
The parser also works as a frontend module in NGINX that the attacker would pick the
target to exploit. There are actual buer overow vulnerabilities in the parser [4–6]. By
utilizing these vulnerabilities, the attacker could get complete control of the webserver. As
a result, we need to sandbox the parser to prevent most of the privilege it has.
In this use case, we modied the NGINX HTTP request handler code to acquire the
address of the parser functions, insert the call gate, and call the iv_domain system call that
Intravirt assigns a new sandbox endoprocess, prepare the call stack to isolate the parser.
As a result, when the parser is called, it calls xcall instead of calling the parser directly,
then domain switch is performed by xcall, and calls the parser function after that. The
current policy for sandbox endoprocess is that it cannot call any functions and system
calls, and cannot access any memory pages outside of the endoprocesses.
There would have a problem with the data structures. If the output data structure is
allocated outside the parser, the parser cannot access the data structure. In this case, we
have two solutions. The rst is to move the allocator inside the sandbox. This approach
has an advantage for performance, but the implementation could be challenging. The
second solution is to demote the already allocated memory pages before feeding them into
the parser. This approach is easier to implement, but it could have performance overhead
due to the MPK key change to the pages when the parser is called. We used the second
solution, and we added page alignment code in the allocator. With this approach, we added
377 lines of code in NGINX. We address the performance overhead in § 7.3. 85
6.2.2 Preventing sudo Privilege Escalation
A recent bug was found in the sudo argument parser that allows an attacker to corrupt
a function pointer and gain control with root access [7]. We compartmentalize sudo
so that the parser code, in le parse_args.c, is sandboxed, and restricted to only the
command line arguments and an output buer. The worst attack that can happen now
is overowing its internal buer and eventually segfault or done nothing harmful. In
summary, by changing approximately 200 lines of codes, importing our libsep in sudo and
using Intravirt, we conne the argument parser and successfully prevent the root exploit.
More generally, almost all parsers have a similar type of behavior and could benet from
similar changes, and possibly automatically.
6.3 Endo-process System Call Policy Enhancement
The system call virtualization feature can be used for additional OS object protection
along with the library separation. For example, we could provide dierent system call virtualization policy for each endoprocess.
6.3.1 NGINX Private Key File Protection
As mentioned in § 6.1.2, we discovered that we could protect the session keys and the
private keys stored in the memory. However, this is not complete protection that we
have to protect the private key les stored in the disk. H2O [11] separates the private key
management into another process, but the compromised web server could simply open
the private key le and read the le data. To prevent such an attack, the administrator
must utilize dierent access control mechanisms such as Unix user ID or mandatory access
controls such as SELinux [42]. However, applying these access control requires signicant 86
modication of H2O source code and the application runtime model because the simple
fork system call cannot provide dierent subject identiers for the processes.
In Intravirt, the private key les could also be protected by implementing additional
system call virtualization policies within the safeboxed OpenSSL in NGINX. In this section, we introduce a le capability system based on Intravirt. To provide a secure and ecient
le capability system, we need to analyze the threats and system calls. We also need to
dene the system call policy as well as the concurrency consideration. We discuss the
performance evaluation in § 7.3.
Assumption
In Linux, all the OS objects are abstracted as les, but we only consider the regular les
actually stored in the disk. The les like sockets, pipes and device nodes are out of scope.
As well, we do not consider the attacker execute other programs to manipulate the les.
The attackers are only valid in the same process boundary.
Identifying the Private Key Files
Identifying the private key les is the rst task in this use case. The identier must be
immutable and not copiable and should remain until the end of the application process.
There could be many identication mechanisms fullling the requirements, but we only
discuss two of them.
First, we could make use of the most basic le system identier, the inode. The inode
is a unique integer in the le system, which we could use as an identier by a tuple with
the le system device identier. However, we have to maintain a data structure indicating which inodes are the private key les in this case, and the data structure has to be protected
by Intravirt. Also, we need to provide an interface to manage the identication data 87
structures. The interfaces could be implemented by conguration le access or pseudo
system calls.
The second approach is to import a similar concept from other techniques. For example,
mandatory access control mechanisms such as SELinux [42] and AppArmor [44] support
le object labeling mechanism by using extended attributes in the Linux le system [78].
Each le could have an extended attribute item indicating the le is the private key le in
this approach, and Intravirt enforces the policy by reading the attribute in the virtualized
system call routines. This approach is straightforward to apply, but it does not work on
some le systems which do not support the extended attribute.
This dissertation uses the second approach with the private key le stored in the ext4
le system. The policies are 1) if the label is not existing or the label says unbox, then the
unbox and the safebox domains can access the les, and 2) if the label says safebox, then
only the safebox domains are allowed to access the les. We use additional system call
such as getxattr and fgetxattr to achieve the label of the les.
Possible Attacks and Mitigations
There are multiple ways to access the private key les with various system calls.
Direct Access This type of access directly accesses les. Example system calls in this
type are read, write, preadv, and pwritev. Some system calls like truncate could modify
the les without opening or reading them. For this, we check the permission of the les
and then execute the system calls.
Control Takeover This type of system calls do not access the le contents directly, but they control the le operations by modifying le information or le handles. The examples
are open, close, rename, unlink, chdir and chmod. We also enforce the policy in these 88 system calls.
Data Theft This type of system calls copy the le contents rather than directly accessing
the contents, such as dup and sendfile. These system calls require the permission check
for the source le permission and the permission check of the destination le because the
kernel overwrites the destination le in some of these system calls.
Indirect Access This type of system calls, such as execve do not access the le contents, but the attacker could infer the le contents by executing the system calls. We also need
to enforce the le protection policy for them.
Denial of Service These system calls do not access the les, but they could induce
malfunctions of the normal le operations. For example, lseek could change the le oset
to disturb the benign access, and flock could lock the les preventing access from others.
These also require policy enforcement.
Attachment These system calls do not access directly to the target les, but it creates a
reference for the les. Examples are link, and symlink. We do not directly enforce the
policy in this type of system calls, but we check the path is the symlink or the realpath
during the permission checks.
Reading Information These system calls read the le information such as size, type, and timestamps. We do not particularly enforce anything on these system calls. The
examples are stat and getxattr. 89
Policy Enforcement
At rst, we assigned “safebox” label to the private key les used by NGINX by using
setfattr command [79] in the shell. And then, we newly virtualize or modify the existing
policy for 57 system calls in total. In each virtualized system calls, we check the label of
the les fed as input parameters by calling getxattr or fgetxattr system calls. We then
continue the system call execution when the label and the caller domain match. Otherwise, we return the permission error(EPERM). Additionally, we also virtualize a few more system
calls for consistency, such as access and faccess because the purpose of them are to check
the permission.
Concurrency Consideration
In the multithreaded environment, assuming the attackers have full control of the syn-
chronization and the timing, we have to aware of the TOCTOU attacks because the
implementation has permission checks before executing the system calls. First of all, for
the system calls that receive le descriptors as input parameters, we could achieve the
concurrency by using the locking system we address in § 5.5. On the other hand, the
system calls with the path as input parameters need more cautions. The relative path
is allowed for the system calls so that the attacker might change the current working
directory after the permission check. The symbolic link could be another problem in which
the attacker could replace the original symbolic link with a malicious one to point to a
dierent target path.
We introduce a new lock that symlink, chdir, fchdir system call are locked when
other threads are in any system calls using path information to deal with the problems.
Lastly, some system calls with “at” sux such as openat allow directory le descriptor
to point as the root of the directory search which might be substituted by attacker thread 90
at any time. Therefore, we also apply locks on the directory le descriptors similar to other
le descriptors.
6.3.2 Directory Protection
We can extend the use case of § 6.3.1 by protecting the whole ssl directory, including
all les and subdirectories. We could simply label all the les and subdirectories in the
directory we want to protect, but it is tough to respond to the change of the les. Therefore, we need to consider a new approach to protect a directory.
This use case is the same as chroot [80] with inverse security policy, and very useful to
provide private storage to each endo-process. We address the design and implementation
in this section. The performance of this application is covered in § 7.3.
Identifying the Protected Directory
As we discussed in § 6.3.1, we could use a unique value of the les such as inode, or we
could also use an additional attribute of the les like extended attribute. The other method
for this is to use a new system call to identify the directory to be protected and remain
protected until the process lifecycle, just like chroot does. We assigned a new system call,
endo_toorhc, and let Intravirt intercepts the call, and manages the protected directories.
After selecting the directory to protect, we need to identify all the les and subdi-
rectories. We could perform this task le by le, but it could have severe performance
overhead, and it isn’t easy to handle some events such as le creation or deletion. After all, we will need to label only the root directory to be protected, and we need to distinguish
the location of the les in every le operation, and it has to be eectively fast.
The system calls taking le path allow the relative paths and indirect path like “..”.
Also, the user could create a symbolic link to point to any le in the system that it is 91 tough to distinguish the correct absolute path of a le in the userspace with the given path information. To solve this issue, we use /proc/self/fd/ directory. The kernel provides this interface in the proc le system that the absolute path of the opened le is provided as a symbolic link. So, we need to use readlink system call to read the absolute path of them to identify the exact location of the les. However, it only shows the path of opened
les, so we need to open the target every time we want to check the absolute path. This approach includes performance overhead due to the up to two system calls — open and readlink — for each le operation. In addition, the result of readlink is provided as a string, that we need to perform string compare operations. However, this looks like the most accurate way to retrieve the absolute path of the les in the userspace. We could cache the le’s location once it is open, reducing the overhead overall.
Policy Enforcement
First of all, the application calls the newly added endo_toorhc system call to select a directory to be protected. During this selection procedure, the application can also select the allowed domain for the directory. Like chroot is a privileged system call, endo_toorhc is also privileged that only safebox domain can call this new system call. After selecting the protected directory, all the les and subdirectories in the directory are only allowed to the given domain.
All the system calls taking the path as the input parameter, Intravirt opens the le in the
rst place, retrieve the absolute path of the le by calling readlink on /proc/self/fd/[FD], and compare the prex of the absolute path to the selected protected directory. If it matches, then Intravirt decides the access by comparing the caller’s domain and the selected domain for the protected directory. In this case, Intravirt opens the le once, so we substituted some system calls with an identical one with the le descriptors, such as chmod and fchmod. 92
Also, all the system calls which open a new le descriptor, cache the label of the opened
le to reduce the performance overhead.
Lastly, this use case is orthogonal from the use case in § 6.3.1. Therefore we could
apply both use cases at the same time. However, we need to consider the collision cases
that the le and the directory are protected with dierent labels. In this case, the current
implementation takes a higher label for policy enforcement. For example, a le labeled as
a safebox in the protected directory for unbox, we take safebox because it’s higher.
Concurrency consideration
We have the same concurrency issue as discussed in § 6.3.1. However, we need one more
locking to prevent any race condition on endo_toorhc.
Applying endo_chroot
This use case is very similar to chroot, only except for the inverse security policy. Therefore, we could also think of another use case, ehco_chroot. However, chroot is much complex
than this use case. For example, chroot modies the visibility that all the path information
after chroot has to be below the new root directory. In Intravirt, it’s not trivial to change
all the paths under the new root directory that we have to insert the new root directory
into the prex of the path. Also, some system calls with directory le descriptors (i.e.,
openat) require even more complex path manipulation. We also need to virtualize the
current working directory and le descriptors for each domain, which is also very complex. 93
Chapter 7
Evaluation
7.1 Security Evaluation
Table 7.1 summarizes the quantitative security analysis based on known attacks described
by Conner et al. [1] and additional attacks we found.
In general, Intravirt defends against the attacks raised in [1].
Intravirt’s system call and signal virtualization guarantees security properties prevent-
ing any exploitation as described in the attacks. We exclude the two race-condition attacks,
due to their requirement for multi-threading, which Intravirt does not support.
In addition to the attacks described by Conner et al., we found several attacks against
subprocess system call and signal virtualization. For the evaluation, we created a xed
address secret inside trusted domain. All test cases try to steal this secret and hence, would break Intravirt’s isolation guarantees. The attacks try to bypass our system call virtualization by performing system calls modifying the protection policy of the secret or
trying to elevate itself to be trusted by overriding the PKRU register. They specically target
the implementation of Intravirt and highlight the degree to which Intravirt has followed
through with its security guarantees. Ideally, Intravirt prevents all attacks.
7.1.1 Fake Signal
Intravirt eectively prevents the basic sigreturn attack from [1]. However, the kernel
places signals on the untrusted stack and delivers the signal to our monitor signal entry- 94
Attack secc-rand secc-eph CET Inconsistency of PKU Permission [1] • • • Inconsistency of PT Permissions [1] • • • Mappings with Mutable Backings [1] • • • Changing Code by Relocation [1] • • • Modifying PKRU via sigreturn [1] • • • Race condition in Signal Delivery [1] × × × Race condition in Memory Scanning [1] × × × Determination of Trusted Mappings [1] • • • Inuencing Behavior with seccomp [1] • • • Modifying Trusted Mappings [1] • • • Fake Signal • • • Fork Bomb ◦ • • Syscall Arguments Abuse • • • Race condition using shared memory • • • TSX attack × • •
Table 7.1: Quantitative security analysis based on attacks demonstrated in [1] and attacks found by us. ◦ indicates the variant of Intravirt in this column is vulnerable, • if it prevents this attack. × indicates this attack is beyond Intravirt’s threat model.
point. The untrusted application may forge a signal frame and directly call the monitor’s
signal entrypoint. As a result, it can, e.g., choose the PKRU value and the return address.
Therefore, the entrypoint has to distinguish between a fake signal from the untrusted
application or a real signal from the kernel. The entrypoint is carefully constructed such
that a signal returning from kernel returns with privileges from the trusted domain and
hence, is capable of writing trusted memory. We rely on this observation and place an
instruction at the beginning of the monitor which raises a ag in the trusted monitor. Any
fake signal created by the untrusted application cannot raise the signal ag in trusted
memory which violates a check that cannot be bypassed in the monitor’s signal entrypoint.
7.1.2 Fork Bomb
This attack targets the random location of the system call instruction in Intravirt. To
perform a system call the untrusted application may guess the random location of the 95 system call instruction. Assuming the trampoline size is 16 pages, there are 65534 possible locations of the system call instructions. When the untrusted application is capable to fork children, the untrusted application may try dierent locations within each child. In case the child crashes, the system call was unsuccessful and the untrusted application has to retry. Using this brute force algorithm the untrusted application tries until a child does not crash. At this point the untrusted application has access to a child process that bypassed
Intravirt’s security guarantees and may perform arbitrary system calls. It should be noted that only secc-rand is successive to this attack, since secc-eph removes the system call instruction completely when returning control to the untrusted application.
7.1.3 Syscall Arguments Abuse
Intravirt virtualizes a subset of all system calls. System calls which are not virtualized could be exploited to read secret memory, unless Intravirt veries that all pointers provided to a system call lie within untrusted memory. We perform an attack based on the rename system call and pass it a memory pointer from the trusted domain as an argument. Intravirt successfully prevents this attack by checking the pointer locations.
7.1.4 Race condition using shared memory
Shared memory may be used across multiple processes to bypass Intravirt checks on arguments to system calls. In particular, we consider a pwritev-based attack in which a child process performs a pwritev system call using an IO vector in shared memory. If the parent was permitted access to the same shared memory, it could time to alter the
IO vector’s values to point to trusted memory. This attack has to be timed such that the child’s monitor has already performed the security checks, but the system call has not yet read the aected IO memory vector. Intravirt prevents such attacks by copying pointers in 96
system call arguments to the trusted memory region and only then performing the system
call using the copied arguments.
7.1.5 TSX attack
TSX is an extension to support transactional memory in x86. It has a similar principle
as exception handling, but at the hardware level. When any considered as a violation of
transactional happens, the hardware rollback and modication and jump to a preset restore
code. Unfortunatelly, because the rollback feature provides a harmless way of content
probing, since the rst introduced, its been used as a source of memory leakage. It has
been obsoleted in latest Intel CPU but still exists in many products with MPK. Our attack
utilizes TSX as a probe to the randomize trampoline. First, a xbegin is used to enable
TSX environment. Then, we call to an address within the trampoline region. Now, there
are three cases about the content on target address, int3, syscall and ret. For the rst
two cases, TSX will be aborted but in the second case, ret instruction can be executed
successfully. Such dierence is sensible from the view of the attacker and the address
contains ret is exposed. Bccause our sysret-gadget is syscall; return;. This exposed
the secret address of syscall. Fortunately, TSX can be disabled through kernel or BIOS
and among all Intravirt congurations, only secc-rand is secret based and susceptible.
7.1.6 Race condition using multi threading
Supporting multi-threading is essential in modern computing environment that Intravirt
also supports it. But, there are a few attack surfaces which use race conditions in multi
threading environment. First, indirect jump to syscall; return; is possible in ephemeral
Intravirt. For example, one thread calls a syscall which take very long time, and the
attacker thread jumps to the active syscall; return;. To prevent such attacks, we use 97 either syscall dispatch, or per-thread Seccomp lter. Second, the attackers could perform
TOCTOU attacks in the syscall virtualization. For example, one thread open a normal
le and call a le-backed syscall, while another thread close the le descriptor and open
a sensitive le which is not allowed for the untrusted code. In Intravirt, we provide locks
per le descriptor that close system call could be locked when another thread is using
that le descriptor. As well, Intravirt provides a lock for memory management system
calls and signal related system calls.
7.2 Performance Evaluation
In this section we characterize the performance overhead of Intravirt. First, we explore
microbenchmarks focussing on the cost to intercept system calls and signals. Second, we
demonstrate the performance of Intravirt for common applications. Third, we evaluate the
cost of the least-privilege NGINX use case.
We perform all experiments on an Intel 11th generation CPU (i7-1165G7) with 4 cores at
2.8GHz (Turboboost and hyper-threading disabled), 16GB memory running Ubuntu 20.04,
and the kernel version 5.9.8 with CET and syscall dispatch support. For all experiments we average over 100 repetitions and analyze dierent Intravirt congurations. Intravirt
relies on a Seccomp lter or a syscall user dispatch (denoted by Sec or Dis) for system call
interception, and random, ephemeral, or CET trampoline (denoted as rnd, emp, cet). In
this conguration space we evaluate 5 dierent congurations ((푠푒푐|푑푖푠)_(푟푛푑|푒푚푝|푐푒푡))
and do not evaluate the insecure 푠푒푐_푟푛푑 conguration.
Throughout this section, we compare against MBOX [65] and strace, ptrace-based
system call monitors. MBOX fails for experiments using common applications. In these
cases we approximate the performance of MBOX using strace. In our microbenchmarks
strace outperforms MBOX by 2.7 % providing a conservative lower bound for MBOX. 98
45 38 20 18 4 1.5 3 1 sec sec 2 휇 휇 0.5 1 0 0
native ptracestrace native ptracestrace secc_ephdisp_ephsecc_cetdisp_cet secc_ephdisp_ephsecc_cetdisp_cet secc_rand_1 secc_rand_1
(a) open (b) read 19 18 42 10 1 sec sec 휇 5 0.5 휇
0 0
native ptracestrace native strace secc_ephdisp_ephsecc_cetdisp_cet secc_ephdisp_ephsecc_cetdisp_cet secc_rand_1 secc_rand_1
(c) write (d) mmap 19 18 29 28
2 8 1.5 6 sec sec 1 4 휇 휇 0.5 2 0 0
native ptracestrace native ptracestrace secc_ephdisp_ephsecc_cetdisp_cet secc_ephdisp_ephsecc_cetdisp_cet secc_rand_1 secc_rand_1
(e) install signal (f) catch signal
Figure 7.1: System call latency of LMBench benchmark. 99
7.2.1 Microbenchmarks
System call overhead
We evaluate Intravirt’s overhead on system calls and signal delivery in comparison to
native and the ptrace-based techniques. Figure 7.1 depicts the latency of LMBench v2.5 [81]
for common system calls. Each Intravirt conguration and the ptrace-based techniques
intercept system calls and provide a virtualized environment to LMBench while protecting
its privileged state.
secc-eph and secc-rand _1 modify the trampoline on every system call, but secc-eph
saves the cost of randomizing the trampoline location and hence, incurs less overhead.
secc-eph/disp-eph, and secc-cet/disp-cet demonstrate the performance dierence between
using a Seccomp lter or syscall user dispatch to intercept system call invocations.
Overall, disp-eph outperforms all other congurations, while secc-rand _1 is the slowest.
Even though CET relies on hardware support, it does not outperform other congurations.
Intravirt adds 0.5 - 2 usec per system call for disp-eph for policy enforcement and domain
switches. In comparison the ptrace-based technique incurs about 20 usec per invocation which is 4.7-26.8 times slower than disp-eph.
We observe high overheads for Intravirt protecting fast system calls like read or write
1 byte (126%-900%), whereas long lasting system calls like open or mmap only observe
29%-150% overhead.
We demonstrate the dierence by performing a throughput le IO experiment. Figure
7.2 shows high overheads for reading small buer sizes which amortize with larger buer
sizes. Since overhead induced by Intravirt is per syscall basis, to read a le with bigger
buer size has much less overhead than with the smaller buer size. Even though we
observe high overheads for some system calls, applications infrequently use them and 100
1
secc-rand_16 0.8 secc-eph disp-eph 0.6 secc-cet disp-cet ptrace 0.4
Normalized throughput 0.2
0 1 2 4 8 16 32 64 128 256 512 read size [KB]
Figure 7.2: Normalized latency of reading a 40MB le.
1 0.8 0.6 sec
휇 0.4 0.2
native secc_eph secc_rand_1secc_rand_2secc_rand_4secc_rand_8secc_rand_16secc_rand_32 secc_rand_1024
Figure 7.3: latency for getppid for dierent rerandomization scaling.
observe far less overhead as shown for common applications in § 7.2.2.
Randomization and performance tradeo
The secc-rand conguration rerandomizes the trampoline for each system call generating
an random number using RDRAND (approx. 460 cycles). We explore alternative rerandom- 101
25 native 20 secc-rand_16 secc-eph disp-eph 15 secc-cet disp-cet 10 strace Bandwidth [GB/s] 5
1 2 4 8 16 32 Number of threads
Figure 7.4: Random read bandwidth for di. number of threads measured with sysbench.
ization frequencies to amortize the cost of randomizing over several system calls. We
tradeo performance with security, since the system call address is simpler to guess if
rerandomization happens less frequently. The goal is to nd a reasonably secure, but fast
rerandomization frequency.
Figure 7.3 evaluates getppid system call for dierent randomization frequencies. getp-
pid is the fastest system call and hence, results in the highest overhead of Intravirt. The
overhead of secc-rand amortizes with less frequent randomization and does not improve
much beyond 16 system calls per randomization. secc-rand at 4 system calls per random-
ization shows similar performance with secc-eph’s performance which we also observed
for other LMBench microbenchmarks.
Thread scalability
To prevent race conditions and TOCTOU attacks in Intravirt, locks protect Intravirt’s
policy enforcement as addressed in § 5.5.6. We demonstrate the scalability of Intravirt in 102
329 501 88
60
40
20 Overhead in %
0 curl lighttpd NGINX sqlite3 zip
secc-rand_16 secc-eph disp-eph secc-cet disp-cet strace
Figure 7.5: Normalized overhead of di. Linux applications.
gure 7.4 using the sysbench [82] tool which concurrently reads a 1 GB le from varying
number of threads. Due to the additional locks in Intravirt, the number of futex system
calls increases with the number of threads.
At 4 threads all CPU cores are busy and we observe the best performance. The overhead
of each conguration is similar to the microbenchmarks. secc-cet and disp-cet suer a
performance decrease of up to 60%, because the syscall performance of CET-based
congurations is the lowest. Compared to strace, Intravirt outperforms by 4.3-8.2 times.
7.2.2 Macrobenchmarks
Along with the microbenchmarks, we analyze the performance of common applications
such as lighttpd [83], NGINX [52], curl [84], SQLite database [85], and zip [86] protected
by Intravirt. Figure 7.5 shows the overall overhead of each application compared to the
native execution‘. 103
curl [84]
downloads a 1 GB le from a local web server. It is particularly challenging workload
for Intravirt, since curl makes a system call for every 8 KB and frequently installs signal
handlers. In total it calls more than 130, 000 write system calls and more than 30, 000
rt_sigaction system calls to download a 1 GB le. However, libcurl supports an option
not to use signal, which reduces the overhead about 10% in average for Intravirt but strace
gets worse about 140%.
Lighttpd [83] and NGINX [52]
serve a 64 KB le requested 1,000 times by an apachebench tool [87] client on the same
machine. All congurations perform within 94% of native. disp-eph outperforms all other
congurations and highlights Intravirt’s ability to protect applications at near-zero cost with a throughput degradation of 1%. In contrast, strace has about 30% overhead.
SQLite [85]
runs its speedtest benchmark [85] and performs read and write system calls with very
small buer size to serve individual SQL requests. Contrary to the microbenchmarks,
dierence between congurations is larger. Congurations using syscall user dispatch
(disp-eph and disp-cet) observe about 30% less overhead when compared to their Seccomp
alternatives (secc-eph and secc-cet). Strace performs poorly at more than 500% overhead.
zip [86]
compresses the full Linux kernel 5.9.8 source tree, a massive task which opens all les in
the source tree, reads their contents, compresses them, and archives them into a zip le. 104
The observed performance degradation is in-line with the microbenchmarks for openat,
read, and write system calls.
Summary:
Network-based applications like lighttpd and NGINX perform close to native results whereas le-based applications observe overheads between 4 and 55% depending on the
test scenario. Most impacted are applications which access small les like SQLite. In
comparison to ptrace-based techniques, Intravirt outperforms by 38- 529%.
7.3 Performance Evaluation of the Use Cases
7.3.1 zlib
As discussed in § 6.1.1, the value of isolating zlib [77] is for a reference implementation that we could easily compare to other techniques. We use a whole-library-separation approach
to isolate zlib library and measure the time to perform zlib API by creating a simple test
application. The test application gets a text le written in English as an input, reads 4KB,
compresses it, uncompress it, and compares it to the original data. It measures the time to
repeat the compression, uncompression, and memory comparison 10, 000 times. There are
six zlib API calls for each test iteration, so there will be 12 xcalls in total. The system call
point of view, it calls about 40, 000 brk, and almost no other system calls at all.
Figure 7.6 denotes the normalized overhead of zlib test application compared to the
native implementation for each Intravirt conguration that we could compare between
isolated zlib and not isolated zlib. First of all, secc-rand_16, secc-eph, and disp-eph shows
about 20% of overhead due to system call virtualization in Intravirt, and 2-3% overhead due
to the xcalls. Using this simple use case, we could easily estimate the overhead of xcall. 105
3.5
3
2.5
2
1.5
1 Normalized Overhead
0.5
0secc-rand_16 secc-eph disp-eph secc-cet disp-cet
not isolated isolated
Figure 7.6: Normalized overhead of isolated zlib.
For each xcall, the process switches to the trusted domain, acquires the required
information about the domain switch such as stack pointer and function pointer, switches
to the domain to the target domain, and similarly on the way back after a return. This
procedure consists of dozens of memory access and two WRPKRU instructions. The overhead
of a single xcall is 116 cycles in non-CET-based Nexpoline and 269 cycles in CET-based
Nexpoline.
However, the overhead of secc-cet and disp-cet is abnormally signicant with more
than 2-3 times from the native implementation. Since we do not have any codes working for
CET, we need to analyze this issue, so we tested CET-enabled zlib in the native environment.
Table 7.2 shows the result of the same test application without Intravirt, but with CET-
enabled zlib library. The same test application only running with CET-enabled zlib library
takes more than two times than the CET-disabled zlib library in the same kernel. We might 106
Table 7.2: Performance overhead of zlib test due to CET. No Intravirt involved.
Setup Without CET With CET sec. 1.343 3.029
need to discuss this issue, but this thesis is not focussing on the CET itself, so we put this issue out of scope.
7.3.2 Safeboxing OpenSSL and Sandboxing Parser in NGINX
§ 6.1.2 and 6.2.1 describe NGINX using Intravirt to safebox the OpenSSL library and sandbox the parser module. Based on this privilege-separation, we perform a throughput experiment downloading dierently-sized les, as shown in gure 7.7. The measurement relies on TLS v1.2 with a self-signed private root CA certicate and a server certicate signed by the root CA with the cipher suite ECDHE-RSA-AES128-GCM-SHA256,2048,128.
1 secc-rand_16 0.9 secc-eph disp-eph secc-cet 0.8 disp-cet strace 0.7
Normalized throughput 0.6
0.5 1 2 4 8 16 32 64 128 256 512 le size [KB]
Figure 7.7: Normalized throughput of privilege separated NGINX using TLS v1.2 with ECDHE-RSA-AES128-GCM-SHA256, 2048, 128. 107
The performance of the ptrace-based system is also shown for the reference data points
even though it does not provide safeboxing and sandboxing.
For most cases, Intravirt with safebox and sandbox performs within 10% of native,
about 3-4% more than Intravirt without safebox and sandbox (see gure 7.5). The reason why the normalized throughput for bigger le size is decreasing is that NGINX does not
read the whole le in one system call, but it calls read system calls divided by the predened
buer size, that the total number of system call increases exponentially.
Since Intravirt’s overhead is directly impacted by the number of xcalls and the time
to switch we need to discuss the number of xcalls to understand the gure 7.7. Table 7.3
shows the number of xcalls during the measurement for the respective le size. During
startup, NGINX performs 89 xcall in total to load conguration les and initialize OpenSSL with the private key. Each new connection results in a TLS handshake using 16 xcalls and
6 xcalls for initializing the session. Every 16 KB request message requires 3 additional xcalls. For every HTTP request, the parser module is called ve times, resulting in 5
more xcalls. After nishing receiving the request, it sends the target binary le for the
response, which requires seven xcall for initialization and three xcall for each 16 KB of
the le. As a result, summing up all the required xcall is shown in the table 7.3.
Table 7.3: xcall count for dierent le sizes in the test scenarios including startup of the process.
File size 1k 4k 16k 64k 256k 1024k Count 129 129 132 141 177 312 108
7.3.3 File and Directory Protection
We discussed this application in § 6.3 that this is an extension of the system call virtualiza-
tion policy to provide additional protection for the les and the directories. Therefore, we
need to understand the system’s overhead, and then we gure out how the overhead will af-
fect the actual applications. We perform the microbenchmark to measure the performance
overhead of the system calls eectively, and we also measure a few actual applications which use the system calls.
Microbenchmark
We use Lmbench [81] again for the microbenchmark. We only extend the system calls with les in this use case, so we only measure four typical le-based system calls: open,
read, write, and mmap.
Figure 7.8 compares the normalized latency of LMBench between Intravirt only, with
le protection, and with le protection and the directory protection for each Intravirt con-
guration. As shown in the gure, read, write and mmap do not have signicant overhead
between dierent system call virtualization policy. However, open does have signicant
overhead in the directory protection use case. It is because the le protection performs
additional fgetxattr system call to get the label of the le, and directory protection calls
readlink in /proc/self/fd/[FD] and performs the string comparison. In our test envi-
ronment, the le protection takes 0.7 휇sec, and 3.3 휇sec in the directory protection case.
However, once the le is open, Intravirt caches the permission of the le descriptors, so
there is no more overhead on this. Also, the string comparison will increase the overhead
if the number of the protected directory increases. In this test, we have one protected
directory.
Like open, all the system calls taking the path as an input parameter will have a similar 109
4 4
2 2 Normalized latency Normalized latency 0 0
secc-eph disp-eph secc-cet disp-cet secc-eph disp-eph secc-cet disp-cet
secc-rand_16not protected le protected directory protected secc-rand_16not protected le protected directory protected
(a) open (b) read
4 1
2 0.5 Normalized latency Normalized latency 0 0
secc-eph disp-eph secc-cet disp-cet secc-eph disp-eph secc-cet disp-cet
secc-rand_16not protected le protected directory protected secc-rand_16not protected le protected directory protected
(c) write (d) mmap
Figure 7.8: System call latency of LMBench benchmark with dierent protection metodolo- gies.
overhead because they perform the same functions. Therefore the overhead of those system calls will also be similar. However, we estimate the frequency of such system calls is much less than open, read, and write. Therefore, the overall overhead of the actual application environment could be small.
NGINX
We use NGINX again for the performance evaluation. NGINX is one of the best applications to protect the secret keys and the private keys, and also it’s one of the most well-known 110
1
0.8
0.6
0.4
. Normalized Throughput 0 2
0 secc-rand_16 secc-eph disp-eph secc-cet disp-cet
not isolated isolated private key protected directory protected
Figure 7.9: Normalized throughput of NGINX to download 64KB le for dierent private key protection methodologies.
event-driven single process web servers that it is a proper application to apply Intravirt.
Figure 7.9 shows the normalized throughput comparison between dierent congura-
tions in Intravirt to show the overhead of the le protection and the directory protection.
We measured the bandwidth to download 64KB les from the local NGINX web server
running on Intravirt. As shown in the gure, the overhead of each feature does not have
a signicant dierence, and thus, the overhead is independent of the protection policy.
Therefore it is safe to say that the le protection and the directory protection do not
contribute to the overhead.
Analyzing the measurement more systemically requires system call execution statistics.
The test performs downloading a 64KB le from a local server running on Intravirt and
measures the throughput with 1000 repeats. There are 5,000 read, 9,000 write, 2,000 close,
2,000 pread64, and 1,000 openat are executed within this test. As we discussed earlier,
there is almost no overhead other than open due to the permission cache, so there is not 111
1.2
1
0.8
0.6
0.4 Normalized latency 0.2
0 secc-rand_16 secc-eph disp-eph secc-cet disp-cet
not protected le protected directory protected
Figure 7.10: Normalized latency of zip for dierent le protection methodologies.
much overhead in total. Also, the overhead of opening les is still smaller than the other
computation, making the overall overhead due to the protection negligible.
In summary, Intravirt could protect the sensitive information in web server stored in
memory, such as session keys and private keys, and data stored in the storage such as
private key les in one single process with less than 10% of overhead.
Zip
We need to pick another application to show the overhead of le and directory protection
eectively. We select the zip test scenario we tested in § 7.2.2. Since it opens and reads
every le in the Linux kernel source tree, it could be a perfect test scenario on this feature.
Figure 7.10 denotes the normalized latency to compress the whole source tree of
Linux 5.9.8 with dierent protection policies. As shown in the gure, the le protection
policy takes 1-2% of overhead, and the directory protection policy takes another 1-2% of
the overhead. This 2-3% overhead is the overall overhead of the use case, which is still 112
relatively small depending on the massive number of le operations.
We analyze the system call frequency in this test case. There are 193K reads, 309K writes, and 79K openats and closees, but the total computation overhead which Intravirt
takes is relatively smaller than the compression overhead. Therefore, we address that
Intravirt is still valuable for this le-operation-heavy environment. 113
Chapter 8
Conclusion and Futuer Works
This dissertation nds drawbacks of the existing privilege separation techniques. Existing
techniques could be categorized as 1) separating the process and communicate with IPC, 2)
sandbox the data and code in a process and try not to dereference each other, or 3) control
memory visibility in a process by utilizing many software and hardware technologies.
However, each technique category has problems that the process separation has the
performance issue. The sandboxing has an issue with interaction between boxes, so the
last subprocess isolation has been spotlighted. Unfortunately, existing subprocess isolation
techniques have their own issue that they did not consider the underlying operating system
as a threat. The commodity operating systems like Linux take the process as the unit of
the separation. The OS interfaces share the resources within the process, so the attacker
could easily penetrate between separations by using those interfaces.
This dissertation suggests a new subprocess isolation model, Endokernel. Endokernel
proposes a virtualized endoprocess model that each endoprocess runs like virtualized
environment in a process and provides xcall to interact with each other safely. Also, we
develop a prototype of the Endokernel, Intravirt, and verify and evaluate the value and
eciency of the Endokernel model. Intravirt is a userspace solution that we do not need
to modify the operating system kernel or modify the runtime environment to provide the
solution eciently. In addition, we don’t need to modify the applications to run them on
Intravirt.
Intravirt has several advantages as new subprocess isolation. It certainly very secure, 114
but most of all, it has a very low overhead due to its subprocess characteristic. The
low overhead is a signicant advantage among all because it increases the applicability
signicantly. Since Intravirt is a userspace solution, it is straightforward to apply to any
commodity operating system, which signicantly increases compatibility and applicability.
In addition, we do not need to modify the application signicantly, which decreases the
hurdle for the applications. Intravirt provides endoprocess virtualization by virtualizing
the system calls and the signals, which signicantly decreases attacks by underlying
operating system interfaces and provides endoprocess virtual machine. Unlike many
existing techniques, Intravirt seriously takes care of the concurrency in the endoprocess virtualization. Lastly, Intravirt introduces a brand new security feature, Intel CET. CET
is a hardware-accelerated control-ow integrity technology that Intravirt is one of the
pioneers to use it.
Intravirt is benecial for the applications to achieve the performance and the least
privilege at the same time. For example, NGINX web server is designed for an event-driven,
single-process model that Intravirt could provide many features at the same time. It could
separate the memory region for the session keys and the private keys, isolate the access
to the sensitive OS objects such as private key les and user data les and minimize
the overhead at the same time by utilizing endoprocess virtualization. Also, by using xcall, it could provide safe and fast communication between endoprocesses. Lastly, some
applications could enforce optimized and ne-grained endoprocess access control policies
using system call and signal virtualization. For example, a endoprocess rewall could also
be possible.
This dissertation shows Endokernel as a new model of privilege separation and Intravirt
evaluates the model and shows its secure and low-performance overhead. Still, it is not
completed yet, and we have to put more eort into several aspects. First of all, Intravirt uses 115
Intel MPK as its separation mechanism, which eciently and securely isolates memory
pages in a process, but the total number of keys is only 16. Intravirt takes three of them for
the monitor and the application endoprocess, so we only have at most 13 domains possible, which is a signicant limitation in the applicability. There are several techniques to
overcome this limit, like libmpk [25], but the overhead performance increases dramatically.
Therefore, we will need to overcome the limit by utilizing other hardware technologies or
nding a new idea for the isolation.
It is no doubt that CET acts as one of the most crucial components in this dissertation.
However, we do not see CET as a complete technology yet. CET is a hardware-accelerated
control-ow integrity technology, but our evaluation shows that it is not faster than the
software-based control-ow integrity. Even in some cases, it is much slower than the
software. Unfortunately, this dissertation does not focus on the CET as the research topic,
and we did not perform any more profound analysis. However, in the future, we will need
to understand more about CET and its implementation, which leads to the signicant
improvement of the performance and the security of Intravirt that Intravirt will be one of
the most powerful security solutions.
Intravirt is an excellent prototype of Endokernel, but we did not focus on the per-
formance optimization. We took care of the performance, but there would be several
possibilities to optimize the performance while remaining the same functionalities. The
optimized design and implementation of Endokernel will increase the value and the exten-
sibility of this research.
Lastly, Endokernel proposes a robust security system that preserves performance
but lacks a critical aspect. Endokernel does not have enough consideration about the
endoprocess life cycle. That is, policy for creating and destroying a endoprocess is missing
that there could be attacks creating a process or endoprocess that bypasses the separation. 116
We need to design such policies with compatible and applicable to existing applications without signicant modication as well. 117
References
[1] R. J. Connor, T. McDaniel, J. M. Smith, and M. Schuchard, “PKU pitfalls: Attacks on
pku-based memory isolation systems,” in 29th USENIX Security Symposium (USENIX
Security 20), pp. 1409–1426, USENIX Association, Aug. 2020.
[2] “CVE-2014-0160.” https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-0160.
(Accessed on 07/05/2021).
[3] “OpenSSL, Cryptography and SSL/TLS Toolkit.” https://openssl.org. (Accessed on
07/04/2021).
[4] “CVE-2009-2629.” https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-2629.
(Accessed on 06/28/2021).
[5] “CVE-2013-2028.” https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2028.
(Accessed on 06/28/2021).
[6] “CVE-2013-2070.”
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-CVE-2013-2070. (Accessed on
06/28/2021).
[7] “CVE-2021-3156.” https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3156.
(Accessed on 06/08/2021).
[8] “sudo Main Page.” https://sudo.ws. (Accessed on 07/04/2021). 118
[9] “Chromium Multi-process Architecture.” https:
//www.chromium.org/developers/design-documents/multi-process-architecture.
(Accessed on 07/04/2021).
[10] W. Venema, “Postx: Past, present, and future,” in Invited Talk at the 24th Large
Installation System Administration Conference, LISA, vol. 146, 2010.
[11] “H2O, the optimized HTTP/1.x,HTTP2 server.” https://h2o.examp1e.net/.
[12] R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham, “Ecient software-based fault
isolation,” in Proceedings of the Fourteenth ACM Symposium on Operating Systems
Principles, SOSP ’93, (New York, NY, USA), p. 203–216, Association for Computing
Machinery, 1993.
[13] G. C. Necula, S. McPeak, and W. Weimer, “Ccured: Type-safe retrotting of legacy
code,” in Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages, POPL ’02, (New York, NY, USA), p. 128–139, Association for
Computing Machinery, 2002.
[14] G. Tan, A. W. Appel, S. Chakradhar, A. Raghunathan, S. Ravi, and D. Wang, “Safe
java native interface,” in Proceedings of IEEE International Symposium on Secure
Software Engineering, vol. 97, p. 106, Citeseer, 2006.
[15] B. Yee, D. Sehr, G. Dardyk, J. B. Chen, R. Muth, T. Ormandy, S. Okasaka, N. Narula,
and N. Fullagar, “Native client: A sandbox for portable, untrusted x86 native code,” in
2009 30th IEEE Symposium on Security and Privacy, pp. 79–93, 2009.
[16] J. Huang, O. Schranz, S. Bugiel, and M. Backes, “The art of app compartmentalization:
Compiler-based library privilege separation on stock android,” in Proceedings of the 119
2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17,
(New York, NY, USA), p. 1037–1049, Association for Computing Machinery, 2017.
[17] M. Sun and G. Tan, “Nativeguard: Protecting android applications from third-party
native libraries,” in Proceedings of the 2014 ACM Conference on Security and Privacy in
Wireless Mobile Networks, WiSec ’14, (New York, NY, USA), p. 165–176, Association
for Computing Machinery, 2014.
[18] J. Litton, A. Vahldiek-Oberwagner, E. Elnikety, D. Garg, B. Bhattacharjee, and
P. Druschel, “Light-weight contexts: An OS abstraction for safety and performance,”
in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI
16), (Savannah, GA), pp. 49–64, USENIX Association, Nov. 2016.
[19] T. C.-H. Hsu, K. Homan, P. Eugster, and M. Payer, “Enforcing least privilege
memory views for multithreaded applications,” in Proceedings of the 2016 ACM
SIGSAC Conference on Computer and Communications Security, CCS ’16, (New York,
NY, USA), p. 393–405, Association for Computing Machinery, 2016.
[20] Y. Chen, S. Reymondjohnson, Z. Sun, and L. Lu, “Shreds: Fine-grained execution
units with private memory,” in 2016 IEEE Symposium on Security and Privacy (SP),
pp. 56–71, 2016.
[21] A. Belay, A. Bittau, A. Mashtizadeh, D. Terei, D. Mazières, and C. Kozyrakis, “Dune:
Safe user-level access to privileged CPU features,” in 10th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 12), (Hollywood, CA),
pp. 335–348, USENIX Association, Oct. 2012.
[22] M. Hedayati, S. Gravani, E. Johnson, J. Criswell, M. L. Scott, K. Shen, and M. Marty,
“Hodor: Intra-process isolation for high-throughput data plane libraries,” in 2019 120
USENIX Annual Technical Conference (USENIX ATC 19), (Renton, WA), pp. 489–504,
USENIX Association, July 2019.
[23] D. Schrammel, S. Weiser, S. Steinegger, M. Schwarzl, M. Schwarz, S. Mangard, and
D. Gruss, “Donky: Domain keys – ecient in-process isolation for risc-v and x86,” in
29th USENIX Security Symposium (USENIX Security 20), pp. 1677–1694, USENIX
Association, Aug. 2020.
[24] A. Vahldiek-Oberwagner, E. Elnikety, N. O. Duarte, M. Sammler, P. Druschel, and
D. Garg, “ERIM: Secure, ecient in-process isolation with protection keys (MPK),” in
28th USENIX Security Symposium (USENIX Security 19), (Santa Clara, CA),
pp. 1221–1238, USENIX Association, Aug. 2019.
[25] S. Park, S. Lee, W. Xu, H. Moon, and T. Kim, “libmpk: Software abstraction for intel
memory protection keys (intel MPK),” in 2019 USENIX Annual Technical Conference
(USENIX ATC 19), (Renton, WA), pp. 241–254, USENIX Association, July 2019.
[26] D. Chisnall, C. Rothwell, R. N. Watson, J. Woodru, M. Vadera, S. W. Moore, M. Roe,
B. Davis, and P. G. Neumann, “Beyond the pdp-11: Architectural support for a
memory-safe c abstract machine,” in Proceedings of the Twentieth International
Conference on Architectural Support for Programming Languages and Operating
Systems, ASPLOS ’15, (New York, NY, USA), p. 117–130, Association for Computing
Machinery, 2015.
[27] R. N. Watson, J. Woodru, P. G. Neumann, S. W. Moore, J. Anderson, D. Chisnall,
N. Dave, B. Davis, K. Gudka, B. Laurie, et al., “Cheri: A hybrid capability-system
architecture for scalable software compartmentalization,” in 2015 IEEE Symposium on
Security and Privacy, pp. 20–37, IEEE, 2015. 121
[28] B. Davis, R. N. M. Watson, A. Richardson, P. G. Neumann, S. W. Moore, J. Baldwin,
D. Chisnall, J. Clarke, N. W. Filardo, K. Gudka, A. Joannou, B. Laurie, A. T. Markettos,
J. E. Maste, A. Mazzinghi, E. T. Napierala, R. M. Norton, M. Roe, P. Sewell, S. Son, and
J. Woodru, “Cheriabi: Enforcing valid pointer provenance and minimizing pointer
privilege in the posix c run-time environment,” in Proceedings of the Twenty-Fourth
International Conference on Architectural Support for Programming Languages and
Operating Systems, ASPLOS ’19, (New York, NY, USA), p. 379–393, Association for
Computing Machinery, 2019.
[29] M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti, “Control-ow integrity,” in
Proceedings of the 12th ACM Conference on Computer and Communications Security,
CCS ’05, (New York, NY, USA), p. 340–353, Association for Computing Machinery,
2005.
[30] V. Kuznetsov, L. Szekeres, M. Payer, G. Candea, R. Sekar, and D. Song, “Code-pointer
integrity,” in 11th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 14), (Broomeld, CO), pp. 147–163, USENIX Association, Oct.
2014.
[31] S. Narayan, C. Disselkoen, T. Garnkel, N. Froyd, E. Rahm, S. Lerner, H. Shacham,
and D. Stefan, “Retrotting ne grain isolation in the refox renderer,” in 29th
USENIX Security Symposium (USENIX Security 20), pp. 699–716, USENIX Association,
Aug. 2020.
[32] Mozilla, “Firefox - Protect your life online with privacy-rst product.”
https://www.mozilla.org/en-US/refox/. (Accessed on 08/07/2021). 122
[33] ARM, “Domain Access Control Register.” https://developer.arm.com/documentation/
ddi0434/b/System-Control/Register-descriptions/Domain-Access-Control-Register.
[34] Intel Cooperation, “Intel(R) 64 and IA-32 Architectures Software Developer’s
Manual.” https://software.intel.com/en-us/articles/intel-sdm, 2016.
[35] H. Lefeuvre, V.-A. Bădoiu, P. Olivier, T. Mosnoi, R. Deaconescu, F. Huici, and
C. Raiciu, “Flexos: Making os isolation exible,” in HotOS’21: Workshop on Hot Topics
in Operating Systems, 2021.
[36] M. Sung, P. Olivier, S. Lankes, and B. Ravindran, “Intra-unikernel isolation with intel
memory protection keys,” in Proceedings of the 16th ACM SIGPLAN/SIGOPS
International Conference on Virtual Execution Environments, VEE ’20, (New York, NY,
USA), p. 143–156, Association for Computing Machinery, 2020.
[37] J. Woodru, R. N. Watson, D. Chisnall, S. W. Moore, J. Anderson, B. Davis, B. Laurie,
P. G. Neumann, R. Norton, and M. Roe, “The cheri capability model: Revisiting risc in
an age of risk,” in Proceeding of the 41st Annual International Symposium on Computer
Architecuture, ISCA ’14, p. 457–468, IEEE Press, 2014.
[38] B. Davis, R. N. M. Watson, A. Richardson, P. G. Neumann, S. W. Moore, J. Baldwin,
D. Chisnall, J. Clarke, N. W. Filardo, K. Gudka, A. Joannou, B. Laurie, A. T. Markettos,
J. E. Maste, A. Mazzinghi, E. T. Napierala, R. M. Norton, M. Roe, P. Sewell, S. Son, and
J. Woodru, “Cheriabi: Enforcing valid pointer provenance and minimizing pointer
privilege in the posix c run-time environment,” in Proceedings of the Twenty-Fourth
International Conference on Architectural Support for Programming Languages and
Operating Systems, ASPLOS ’19, (New York, NY, USA), p. 379–393, Association for
Computing Machinery, 2019. 123
[39] H. Xia, J. Woodru, H. Barral, L. Esswood, A. Joannou, R. Kovacsics, D. Chisnall,
M. Roe, B. Davis, E. Napierala, J. Baldwin, K. Gudka, P. G. Neumann, A. Richardson,
S. W. Moore, and R. N. M. Watson, “Cherirtos: A capability model for embedded
devices,” in 2018 IEEE 36th International Conference on Computer Design (ICCD),
pp. 92–99, 2018.
[40] Y. Ren, G. Liu, V. Nitu, W. Shao, R. Kennedy, G. Parmer, T. Wood, and A. Tchana,
“Fine-grained isolation for scalable, dynamic, multi-tenant edge clouds,” in 2020
USENIX Annual Technical Conference (USENIX ATC 20), pp. 927–942, USENIX
Association, July 2020.
[41] C. Wright, C. Cowan, S. Smalley, J. Morris, and G. Kroah-Hartman, “Linux security
modules: General security support for the linux kernel,” in 11th USENIX Security
Symposium (USENIX Security 02), (San Francisco, CA), USENIX Association, Aug.
2002.
[42] P. Loscocco and S. Smalley, “Integrating exible support for security policies into the
linux operating system,” in 2001 USENIX Annual Technical Conference (USENIX ATC
01), (Boston, MA), USENIX Association, June 2001.
[43] T. Harada, T. Horie, and K. Tanaka, “Task oriented management obviates your onus
on linux,” in Linux Conference, vol. 3, p. 23, 2004.
[44] M. Bauer, “Paranoid penguin: An introduction to novell apparmor,” Linux Journal,
vol. 2006, p. 13, Aug. 2006.
[45] C. Schauer, “Smack in embedded computing,” in Proc. Ottawa Linux Symposium,
p. 23, 2008. 124
[46] “YAMA - The Linux Kernel documentation.”
https://kernel.org/doc/html/v4.14/admin-guide/LSM/Yama.html. (Accessed on
07/04/2021).
[47] S. E. Hallyn and A. G. Morgan, “Linux capabilities: Making them work,” 2008.
[48] “SECure COMPuting with lters - The Linux Kernel documentation.”
https://www.kernel.org/doc/Documentation/prctl/seccomp_lter.txt.
[49] M. Fleming, “A thorough introduction to eBPF [LWN.net],”
[50] I. Goldberg, D. Wagner, R. Thomas, and E. A. Brewer, “A secure environment for
untrusted helper applications conning the wily hacker,” in Proceedings of the 6th
Conference on USENIX Security Symposium, Focusing on Applications of Cryptography
- Volume 6, SSYM’96, (USA), p. 1, USENIX Association, 1996.
[51] N. DeMarinis, K. Williams-King, D. Jin, R. Fonseca, and V. P. Kemerlis, “syslter:
Automated system call ltering for commodity software,” in 23rd International
Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020), (San
Sebastian), pp. 459–474, USENIX Association, Oct. 2020.
[52] “NGINX v1.24.0.” https://nginx.org/. (Accessed on 07/04/2021).
[53] S. Ghavamnia, T. Palit, S. Mishra, and M. Polychronakis, “Temporal system call
specialization for attack surface reduction,” in 29th USENIX Security Symposium
(USENIX Security 20), pp. 1749–1766, USENIX Association, Aug. 2020.
[54] H. Vijayakumar, X. Ge, M. Payer, and T. Jaeger, “JIGSAW: Protecting resource access
by inferring programmer expectations,” in 23rd USENIX Security Symposium (USENIX
Security 14), (San Diego, CA), pp. 973–988, USENIX Association, Aug. 2014. 125
[55] “ptrace.” https://man7.org/linux/man-pages/man2/ptrace.2.html. (Accessed on
07/04/2021).
[56] “strace.” https://man7.org/linux/man-pages/man1/strace.1.html. (Accessed on
07/04/2021).
[57] K. Jain and R. Sekar, “User-level infrastructure for system call interposition: A
platform for intrusion detection and connement,” in In Proc. Network and
Distributed Systems Security Symposium, 1999.
[58] M. Zheng, M. Sun, and J. C. Lui, “Droidtrace: A ptrace based android dynamic
analysis system with forward execution capability,” in 2014 international wireless
communications and mobile computing conference (IWCMC), pp. 128–133, IEEE, 2014.
[59] T. Garnkel, B. Pfa, and M. Rosenblum, “Ostia: A delegating architecture for secure
system call interposition,” in In Proc. Network and Distributed Systems Security
Symposium, 2003.
[60] D. R. Engler, M. F. Kaashoek, and J. O’Toole, “Exokernel: An operating system
architecture for application-level resource management,” in Proceedings of the
Fifteenth ACM Symposium on Operating Systems Principles, SOSP ’95, (New York, NY,
USA), p. 251–266, Association for Computing Machinery, 1995.
[61] WebAssembly Community, “Security - WebAssembly.”
[62] S. Narayan, C. Disselkoen, T. Garnkel, N. Froyd, E. Rahm, S. Lerner, H. Shacham,
and D. Stefan, “Retrotting ne grain isolation in the refox renderer,” in 29th
USENIX Security Symposium (USENIX Security 20), pp. 699–716, USENIX Association,
Aug. 2020. 126
[63] Z. Durumeric, F. Li, J. Kasten, J. Amann, J. Beekman, M. Payer, N. Weaver, D. Adrian,
V. Paxson, M. Bailey, and J. A. Halderman, “The matter of heartbleed,” in Proceedings
of the 2014 Conference on Internet Measurement Conference, IMC ’14, (New York, NY,
USA), p. 475–488, Association for Computing Machinery, 2014.
[64] Z. Tarkhani and A. Madhavapeddy, “Sirius: Enabling system-wide isolation for
trusted execution environments,” CoRR, vol. abs/2009.01869, 2020.
[65] T. Kim and N. Zeldovich, “Practical and eective sandboxing for non-root users,” in
2013 USENIX Annual Technical Conference (USENIX ATC 13), (San Jose, CA),
pp. 139–144, USENIX Association, June 2013.
[66] R. M. Needham, “Protection systems and protection implementations,” in Proceedings
of the December 5-7, 1972, fall joint computer conference, part I, AFIPS ’72, (New York,
NY, USA), pp. 571–578, 1972.
[67] J. M. Rushby, “Design and verication of secure systems,” in Proceedings of the Eighth
ACM Symposium on Operating Systems Principles, SOSP ’81, (New York, NY, USA),
p. 12–21, Association for Computing Machinery, 1981.
[68] N. Dautenhahn, T. Kasampalis, W. Dietz, J. Criswell, and V. Adve, “Nested kernel: An
operating system architecture for intra-kernel privilege separation,” in Proceedings of
the Twentieth International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS ’15, (New York, NY, USA), p. 191–206,
Association for Computing Machinery, 2015.
[69] B. W. Lampson, “Protection,” SIGOPS Oper. Syst. Rev., vol. 8, p. 18–24, Jan. 1974.
[70] E. Witchel, J. Rhee, and K. Asanović, “Mondrix: Memory isolation for linux using
mondriaan memory protection,” in Proceedings of the Twentieth ACM Symposium on 127
Operating Systems Principles, SOSP ’05, (New York, NY, USA), p. 31–44, Association
for Computing Machinery, 2005.
[71] A. Ghosn, M. Kogias, M. Payer, J. R. Larus, and E. Bugnion, “Enclosure:
Language-based restriction of untrusted libraries,” in Proceedings of the 26th ACM
International Conference on Architectural Support for Programming Languages and
Operating Systems, ASPLOS 2021, (New York, NY, USA), p. 255–267, Association for
Computing Machinery, 2021.
[72] “Syscall User Dispatch.”
https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html.
(Accessed on 07/13/2021).
[73] V. Shanbhogue, D. Gupta, and R. Sahita, “Security analysis of processor instruction
set architecture for enforcing control-ow integrity,” in Proceedings of the 8th
International Workshop on Hardware and Architectural Support for Security and
Privacy, HASP ’19, (New York, NY, USA), Association for Computing Machinery,
2019.
[74] C.-C. Tsai, “Passthru-libos.” https://github.com/chiache/passthru-libos. (Accessed on
07/04/2021).
[75] C.-C. Tsai, K. S. Arora, N. Bandi, B. Jain, W. Jannen, J. John, H. A. Kalodner,
V. Kulkarni, D. Oliveira, and D. E. Porter, “Cooperation and security isolation of
library oses for multi-process applications,” in Proceedings of the Ninth European
Conference on Computer Systems, EuroSys ’14, (New York, NY, USA), Association for
Computing Machinery, 2014. 128
[76] C. che Tsai, D. E. Porter, and M. Vij, “Graphene—SGX: A practical library OS for
unmodied applications on SGX,” in 2017 USENIX Annual Technical Conference
(USENIX ATC 17), (Santa Clara, CA), pp. 645–658, USENIX Association, July 2017.
[77] “zlib — a massively spiy yet delicately unobtrusive compression library.”
https://https://zlib.net/. (Accessed on 07/04/2021).
[78] “xattr(7) — Linux manual page.”
https://man7.org/linux/man-pages/man7/xattr.7.html. (Accessed on 06/29/2021).
[79] “setfattr(1) — Linux manual page.”
https://man7.org/linux/man-pages/man1/setfattr.1.html. (Accessed on 06/29/2021).
[80] “chroot(2) — Linux manual page.”
https://man7.org/linux/man-pages/man2/chroot.2.html. (Accessed on 07/04/2021).
[81] L. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in
USENIX 1996 Annual Technical Conference (USENIX ATC 96), (San Diego, CA),
USENIX Association, Jan. 1996.
[82] A. Kopytov et al., “Scriptable database and system performance benchmark.”
https://github.com/akopytov/sysbench. (Accessed on 06/08/2021).
[83] “Lighttpd v1.4.59.” https://www.lighttpd.net/. (Accessed on 07/04/2021).
[84] “CURL: Command line tool and library for transferring data with URLs v7.77.0.”
https://curl.haxx.se/. (Accessed on 07/04/2021).
[85] “SQLite Database Engine v.3.36.0.” https://www.sqlite.org/index.html. (Accessed on
07/04/2021). 129
[86] “Info-zip’s zip.” http://infozip.sourceforge.net/Zip.html. (Accessed on 06/08/2021).
[87] “Ab - Apache HTTP server benchmarking tool v2.4.”
https://httpd.apache.org/docs/2.4/en/programs/ab.html. (Accessed on 07/04/2021).