<<

Harmonizing Performance and Isolation in with Efficient Intra-kernel Isolation and Communication Jinyu Gu, Xinyue Wu, Wentai Li, Nian Liu, Zeyu Mi, Yubin Xia, and Haibo Chen, Shanghai Jiao Tong University https://www.usenix.org/conference/atc20/presentation/gu

This paper is included in the Proceedings of the 2020 USENIX Annual Technical Conference. July 15–17, 2020 978-1-939133-14-4

Open access to the Proceedings of the 2020 USENIX Annual Technical Conference is sponsored by USENIX. Harmonizing Performance and Isolation in Microkernels with Efficient Intra-kernel Isolation and Communication

Jinyu Gu, Xinyue Wu, Wentai Li, Nian Liu, Zeyu Mi, Yubin Xia, Haibo Chen Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University

Abstract bination of various optimizations such as in-register param- This paper presents UnderBridge, a redesign of traditional eter passing and scheduling avoidance, the performance of OSes to harmonize the tension between messag- highly optimized IPC has reached less than 1500 cycles per ing performance and isolation. UnderBridge moves the OS roundtrip [13]. The state-of-the-art SkyBridge IPC design, components of a microkernel between and ker- which retrofits Intel vmfunc to optimize IPCs, has further re- nel space at runtime while enforcing consistent isolation. It duced the IPC cost to around 400 cycles per roundtrip [62]. retrofits Intel Memory Protection Key for Userspace (PKU) However, such cost is still considerable compared with the in kernel space to achieve such isolation efficiently and de- cost of invoking kernel components in monolithic kernels sign a fast IPC mechanism across those OS components. (.g., calling function pointers takes around 24 cycles). Thanks to PKU’s extremely low overhead, the inter- There is always a tension between isolation and perfor- communication (IPC) roundtrip cost in UnderBridge can be mance for OS kernel designs. In this paper, we present a as low as 109 cycles. We have designed and implemented a new design named UnderBridge, which redesigns the run- new microkernel called ChCore based on UnderBridge and time structure of microkernels to harmonize performance have also ported UnderBridge to three mainstream microker- and isolation. The key idea is building isolated domains in su- pervisor mode while providing efficient cross-domain inter- nels, i.e., seL4, Google Zircon, and Fiasco.OC. Evaluations 1 show that UnderBridge speeds up the IPC by 3.0× compared actions, and enabling user-space system servers of a micro- with the state-of-the-art (e.g., SkyBridge) and improves the kernel OS to run in those domains. A traditional microkernel performance of IPC-intensive applications by up to 13.1× OS usually consists of a core kernel in supervisor mode and for the above three microkernels. several system servers in different user processes. With Un- derBridge, a system server can also run in an isolated kernel 1 Introduction space domain besides a user process. The system servers that The microkernel OS design has been studied for decades [3, run in kernel can interact with each other as well as the core 27, 35, 44, 49, 52, 70]. Microkernels minimize code running kernel efficiently without traditional expensive IPCs, and ap- in supervisor mode by moving OS components, such as file plications can invoke them with only two privilege switches, systems and the network stack, as well as device drivers, into similar to a monolithic OS. Although the number of isolated isolated user processes, which achieves good extensibility, se- domains is limited and may be smaller than the number of curity, and fault isolation. Other than the success of microker- system servers, UnderBridge supports server migration. The nels in safety-critical scenarios [1, 40, 66], there is a resur- microkernel can dynamically decide to run a server either in gent interest in designing microkernels for more general- a user process or a kernel domain based on how performance- purpose applications, such as Google’s next-generation ker- critical it is. However, it is challenging to efficiently provide nel Zircon [3]. mutually-isolated domains together with fast cross-domain However, a cost coming with microkernel is its commonly interactions in kernel space. lower performance compared with its monolithic counter- Protection Keys for Userspace (PKU [7], Section-2.7, parts, which forces a tradeoff between performance and iso- Volume-3), also named as Intel memory protection keys lation in many cases. One key factor of such cost is the com- (MPK), has been introduced in recent Intel processors and munication (IPC) overhead between OS components, which investigated by researchers to achieve intra-process isolation is considered as the Achilles’ Heel of microkernels [30, 33, in user space [29, 39, 64, 77]. As the name “Userspace” in- 51, 53, 62, 76]. Hence, there has been a long line of re- dicates, PKU is a mechanism that appears to be only effec- search work to improve the IPC performance for microker- 1We name the microkernel OS components implementing system func- nels [19, 30, 36, 44, 50, 52, 62, 79, 82]. Through a com- tionalities as system servers. and drivers are typical examples.

USENIX Association 2020 USENIX Annual Technical Conference 401 tive in user space. After a detailed investigation, we observed which, inherently, prevent them from executing privileged in- that no matter in kernel space (Ring-0) or user space (Ring- structions and exclude them from the system’s trusted com- 3), PKU-capable CPUs transparently enforce permission puting base (TCB). To do not bloat the TCB, we also pre- checks on memory accesses only if the User/Kernel (U/K) bit vent the in-kernel system servers from executing any privi- of the corresponding page table entry is User (means user- leged instruction. We leverage hardware virtualization and accessible). Hence, PKU, as a lightweight hardware feature, run ChCore in non-root mode and deploy a tiny secure mon- also offers an opportunity to achieve efficient intra-kernel itor in root mode. For most privileged instructions that are isolation if all the page table entries for kernel memory are not in the critical path, we configure them to trigger VMExit marked with U bit instead of K bit. However, marking ker- and enforce permission checks in the monitor. For others, we nel memory as user-accessible is dangerous since unprivi- carefully handle them using binary scanning and rewriting. leged applications may directly access kernel memory. Fortu- We have implemented ChCore on a real server with Intel nately, today’s OS kernels are usually equipped with kernel- Xeon Gold 6138 CPUs and conducted evaluations to show page-table-isolation (KPTI) when preferring stronger secu- the efficiency of UnderBridge. To demonstrate the general- rity guarantees, including defending against Meltdown-like ity of UnderBridge, we have also ported it to three popu- attacks [20, 55] and protecting kernel-address-space-layout- lar microkernels, i.e., seL4 [14], Google Zircon [4], and Fi- randomization [8, 42]. User processes and the kernel use dif- asco.OC [2]. In the micro-benchmark, UnderBridge achieves ferent page tables with KPTI, so marking kernel memory as 3.0× speedup compared with SkyBridge. In IPC-intensive user-accessible in a separate page table does not risk allow- application benchmarks, UnderBridge also shows better per- ing applications to access kernel space. formance than SkyBridge (up to 65%) and improves the per- Hence, UnderBridge allocates an individual page table for formance of the above three microkernels by 2.5×∼13.1×. the kernel and builds isolated execution domains atop MPK In summary, this paper makes the following contributions: memory domains in kernel space. Unlike software fault iso- • A new IPC design called UnderBridge that retrofits In- lation (SFI), guaranteeing memory isolation with MPK hard- tel MPK/PKU in kernel to achieve ultra-low overhead ware incurs nearly zero runtime overhead. Meanwhile, a interactions across system servers. new instruction, wrpkru, can help switching domains by writ- • A microkernel prototype ChCore which uses Under- ing a specific register, PKRU (protection key rights register Bridge to move system servers back to kernel space for user pages), which only takes 28 cycles. Thus, domain while keeping the same isolation properties as tradi- switches can be quick. With UnderBridge, we design and im- tional microkernels. plement a prototype microkernel named ChCore, which com- • A detailed evaluation of UnderBridge in ChCore and prises a core kernel and different system servers similar to three widely-used microkernels, which demonstrates existing microkernels. ChCore still preserves the IPC inter- the efficiency of the design. faces with fast domain switch for the servers but embraces better performance by significantly reducing the IPC costs. 2 Motivation However, we find merely using MPK in kernel fails to achieve the same isolation guarantee as traditional microker- 2.1 Invoking Servers with IPCs is Costly nels. On the one hand, since MPK only checks read/write per- To obviate the severe consequences incurred by crashes or missions when accessing memory but applies no restrictions security breaches [9, 22, 24], the microkernel architecture on instruction fetching, any server can execute all the code places most kernel functionalities into different user-space in the same virtual address space. Thus, an IPC gate (a code system servers and only keeps crucial functionalities in the piece), which is used in UnderBridge to establish the con- privileged kernel, as depicted in Figure-1. Therefore, a fault nection between two specific servers, can be abused by any in a single system server can be caught before it propagates server to issue an illegal IPC. To address the problem, Under- to the whole system. Bridge authenticates the caller of an IPC gate through check- However, compared with a monolithic OS like , a ing its (unique) memory-access permission and ensures the system service invocation usually becomes more expensive authentication is non-bypassable by validating a secret token at both sides of the IPC gate. Process-1 Process-2 Process-3 On the other hand, system servers running in kernel space System System Application Server A Server B can execute privileged instructions. Although the servers E.g., SQLite3 E.g., xv6fs E.g., RAMdisk (initially running in user space) should not contain any Ring-3 Ring-0 privileged instruction, such instructions may arise inadver- Core Kernel tently on x86, e.g., being composed of the bytes of adjacent Figure 1: A simplified microkernel architecture. Even without instructions. Thus, a compromised server may use return- KPTI, calling a server with IPC requires two user-kernel roundtrips oriented programming (ROP) [21, 69] to execute them. Tradi- with two process switches. A vertical arrow represents one tional microkernels confine system servers in user processes, roundtrip, and a dotted line means two process switches.

402 2020 USENIX Annual Technical Conference USENIX Association in such an OS architecture. Figure-1 shows a service invo- in different user-space processes (different address spaces), cation procedure that involves two system servers and thus the kernel has to change the address space from the caller leads to two IPCs. In this case, a microkernel requires four to the callee. With KPTI enabled, the kernel further needs to roundtrips between user and kernel in total, while Linux only change the address spaces twice during the privilege switch, requires one (i.e., the leftmost arrow between the application which inflates the overhead. The third part is other logics in and the OS). It is because Linux invokes different kernel com- IPC, such as permissions and fast-path conditions checks. ponents (like system servers in microkernel) directly through Besides its inherent cost, an IPC will inevitably cause pol- function calls. lution to the CPU internal structures such as the pipeline, instruction, data caches, and translation look-aside buffers IPC Cost Parts (cycles) w/o KPTI w/ KPTI Real Work in Servers (TLB), which has been evaluated in prior work [18, 32, 62, 100%

80% PrivilegeSwitch 158 690 71]. According to [71], the pollution caused by frequent priv- Process Switch 295 Included above 60% ilege switches can degrade the performance by up to 65% for Others 277 320 40% SPEC CPU programs. 20% Total 730 * 1010 In summary, the switches of privilege and address space Zircon seL4 seL4 *The result conforms to the officially reported w/ kpti w/o kpti data (722∼736) [13]. in IPC bring considerable overhead, which motivates our (a) (b) lightweight IPC design to remove these switches. Figure 2: (a) Invoking servers with IPCs is expensive. (b) A break- 2.3 Using Intel MPK in Kernel down of seL4 fast-path IPC. Background: Intel memory protection keys (MPK) [7]isa To measure the performance cost of server invocations new hardware feature to restrict memory accesses. MPK as- with IPC, we run SQLite3 [15] on Zircon and seL4 (§ 6 gives signs a four-bit domain ID (aka, protection key) to each page the detailed setup). We measure (i) the total time spent on in a virtual address space by filling the ID in previously un- invoking system servers (i.e., an FS server and a RAMdisk used bits of page table entries. Thus, MPK can partition user server) and (ii) the effective time used in system servers for pages within a virtual address space into at most 16 memory handling the requests. The difference between the two time domains. To determine the access permissions (read-only, durations is considered as the IPC cost. As presented in Fig- read-write, none) on each memory domain, it introduces a ure 2(a), the IPC cost is as high as 79% in Zircon. Even in per-core register, PKRU, and a new instruction, wrpkru, to seL4, which uses highly-optimized IPCs, 44% of the time is modify the PKRU register in only 28 cycles. It is worth men- spent on IPC when KPTI enabled (38% without KPTI). tioning that MPK checks on memory accesses incur nearly Moreover, a system functionality may involve even more zero runtime overhead [39, 77]. Nevertheless, MPK does not IPCs to invoke multiple system servers. For instance, launch- enforce permission checks on execution permission. One ex- ing an application requires 8 IPC roundtrips (among Shell, ecutable memory page is always executable to any domain, Loader, FS, and Driver) on Zircon. In contrast, it only needs even if PKRU forbids the domain from reading the page. one or two system calls on Linux (e.g., fork + ). There- Observation: After a detailed investigation, we observe that fore, invoking system servers with IPCs is time-consuming no matter in Ring-0 or Ring-3, MPK enforces permission in microkernels, which motivates our work in this paper. checks on any user-accessible memory page. To enable MPK checks in Ring-0, the User/Kernel (U/K) bits of all the corre- 2.2 IPC Overhead Analysis sponding page table entries in a four-level page table have to To further understand the overhead of IPC, we break down be set as User (i.e., 1). If there exists one entry that contains and measure the cost of each step in the IPC procedure in the Kernel bit at any level, MPK will not check the access seL4, which is known to be an efficient implementation of on the corresponding memory pages. Furthermore, the Su- microkernel IPC. Here we use the ideal configuration by re- pervisor Mode Access Prevention (SMAP) and Supervisor ferring to [13] and measure a one-way IPC (not a roundtrip) Mode Execution Prevention (SMEP) should also be disabled without transferring data. for accessing or executing these pages (tagged with User) in We find that the direct cost of the IPC consists of three Ring-0. main parts as shown in Figure-2(b). The first part is the priv- 2.4 Building Isolated Domains ilege switch. A user-space caller starts an IPC by using a syscall instruction to trap into the kernel. The kernel needs There are many approaches to build lightweight and isolated to save the caller’s context, which will be restored when re- domains. SFI (Software Fault Isolation), which has been ac- suming the caller. To invoke the target user-space callee, the tively studied over 20 years [34, 46, 60, 68, 80, 87], is one kernel transfers the control flow with a sysret instruction after of the most mature candidates. However, although being a restoring the callee’s context. The second part is the process general solution to achieve memory/fault isolation, SFI in- switch. The major cost in this part is the CR3 modification in- curs non-negligible runtime overhead due to excessive code struction (270 cycles). Since the caller and callee are isolated instrumentations. For example, two representative studies

USENIX Association 2020 USENIX Annual Technical Conference 403 show that SFI introduces around 15% overhead for SPEC intuitive design is adopting the MPK-based intra-process iso- CPU programs on average (Table-2 in [68]), even with the lation [39, 77] to run system servers within an application ad- help of the latest boundary-checking hardware MPX (Figure- dress space. However, this design has three major problems. 3 in [46]). Some other approaches [48, 88] uses x86 segmen- First, it requires to map a server into multiple applications’ tation for memory isolation, which avoids software checks page tables, which makes updating the server’s memory map- on memory accesses and is suitable for sandbox execution, pings especially expensive (i.e., update all the page tables). but it is not widely used anymore [68]. Second, its cost to setting up the IPC is non-negligible due to intensive page table modifications. Third, it restricts the Instruction Cost(cycles) applications’ ability to use the whole address space and the Indirect Call + Return 24 MPK hardware freely. syscall + sysret 150 Write CR3 (no TLB flush) 226 Instead, UnderBridge boosts IPC performance by putting vmfunc (switch EPT) 146 the system servers, which are user-space processes in tradi- wrpkru 28 tional microkernels, back into kernel space. Figure-3 shows Table 1: Cost comparison of selected instructions. the system overview. The core kernel resembles the tra- Leveraging advanced hardware features to build isolated ditional microkernel, which provides crucial functionali- domains can achieve better runtime performance. For exam- ties such as managing memory protection, capability en- ple, prior work [39, 46, 56, 57, 62, 65, 77] utilizes (extended) forcement, scheduling, and establishing IPC connections. page tables to provide isolated domains in user space and ex- With UnderBridge, system servers can run in kernel space ploits instructions like vmfunc and wrpkru for fast domain (e.g., Server-A/B) but are confined in isolated environments switches. We list the costs of these frequently-used instruc- (called execution domain). UnderBridge makes the core ker- tions in Table-1. The cost of wrpkru is the closest to indirect nel and in-kernel system servers share the same (kernel) ad- call, which is used to invoke kernel components (with func- dress space while leveraging Intel MPK to guarantee mem- tion pointers) in monolithic kernels. Therefore, considering ory isolation. that MPK is applicable to the kernel and has good perfor- To use MPK in kernel space, UnderBridge tags all the ker- mance property, we propose that MPK can be leveraged to nel memory pages with “User” bits in the kernel page table, implement an efficient fine-grained isolation mechanism in as introduced in § 2.3. However, marking kernel memory as the kernel. user-accessible enables unprivileged user-space applications to access kernel memory directly. UnderBridge prevents this 3 UnderBridge by allocating a separate page table to each application. An application’s page table does not contain the kernel space Process Process memory except for a small trampoline region (tagged with Application Server-C “Kernel”), which is used for privilege switch (e.g., syscall). Building IPC connections with in-kernel system servers

Code Data takes the following steps. First, a system server proactively User Mode Traditional IPC registers a function address (IPC function) in the core kernel before serving requests. The core kernel will check whether Execution Domain-0 (Privileged) the address is legal, i.e., both executable and belonging to

Core Kernel the server. Second, another server can ask the core kernel to Kernel Mode establish an IPC connection with the registered server. Third,

IPC Gate the core kernel generates an IPC Gate, which helps to accom- plish the IPC function invocation. Memory Domain-0 If an application needs to invoke the in-kernel server, it also needs to establish the connection first. Later, it invokes Server-A Server-B the system call for IPC and traps into the core kernel. Then, IPC Data the kernel will help to invoke the requested server via the Execution Domain-2 Execution Domain-1 corresponding IPC gate (between the kernel and the server). Memory Memory Domain-3 Memory Domain-1 (Shared by A & B) Domain-2 3.1 Execution Domains Figure 3: The overview of ChCore based on UnderBridge. As shown in Figure-3, UnderBridge constructs isolated ex- The goal of UnderBridge is to optimize the synchronous ecution domains over MPK memory domains and confines IPC2 while simultaneously maintaining strong isolation. An each in-kernel system server in an individual execution do- main. Specifically, UnderBridge builds 16 execution do- 2Synchronous IPC is commonly used in microkernels, especially when calling system servers. After issuing a synchronous IPC, the caller blocks mains in kernel space and assigns a unique domain ID until the callee returns. (0∼15) to each of them. Execution domain 0 is specialized

404 2020 USENIX Annual Technical Conference USENIX Association for running the (trustworthy) core kernel and can access all originally from T1. Thus, the scheduling overhead is avoided the memory. Every other execution domain (1∼15) has a pri- in the IPC gates. vate memory domain with a specific ID and can only access Besides intra-server, IPC gates also exist between the core its private memory domain by default. A system server, ex- kernel and the servers, which enables the core kernel to in- clusively running in one execution domain, stores its data, teract with the servers. Specifically, those in-kernel system stack, and heap regions in its private memory domain, which servers invoke the core kernel’s service through dedicated cannot be accessed by other servers. Nevertheless, its code re- IPC gates connected with the core kernel instead of using gion resides in memory domain 0 that can only be read/writ- syscall instructions. To handle exceptions/interrupts during ten by the core kernel. In this way, the server cannot read- the execution of in-kernel servers, UnderBridge provides /write its own code but can still execute it (i.e., execute-only similar gates at the beginning of each handler to switch the memory) as MPK memory domains do not affect instruction execution domain to the core kernel (execution domain 0). fetching. UnderBridge ensures an execution domain can only access allowed memory domains by configuring its PKRU 3.3 Server Migration register. It also forbids an execution domain (a server) from As MPK provides 16 memory domains in total, UnderBridge modifying this register by itself (details in § 4.2). can only support at most 16 execution domains concurrently, between two servers is allocated by al- including the reserved one for the core kernel. Neverthe- lowing them to access a free memory domain together (e.g., less, more system servers can be required, considering the Memory Domain-3). Shared memory between a server and number of different device drivers. When the number of the core kernel is achieved by letting the core kernel directly concurrent system servers exceeds 15, one solution is time- access the server’s private memory domain. Shared memory multiplexing [64] but will bring non-negligible overhead if between an application and a server is achieved by mapping frequently stopping and restarting servers. some private memory of the server in the application’s page Instead, UnderBridge enables server migration, which dy- table, which does not require a free memory domain. namically moves servers between user and kernel space. 3.2 IPCGates Each server is compiled as a position-independent exe- cutable, and the core kernel assigns disjoint Even though system servers reside in the same kernel address regions for different system servers. Whenever runs in an ex- space, UnderBridge still preserves the well-defined IPC inter- ecution domain in the kernel or a user process, a server al- faces for them. When connecting two system servers (in two ways uses the same virtual addresses. So, when migrating a execution domains), the core kernel generates an IPC gate for server between user and kernel, UnderBridge does not need them, which resides in memory domain 0. Specifically, it first to do relocations because all the memory references in the allocates a piece of memory for the gate and loads the gate server are always valid (no changes), which significantly sim- code, which is small, as shown in Figure-4, into the memory. plifies the migration procedure. Moreover, the system call Then, it fills specific values (e.g., per-gate SECRET_TOKEN) layer of the LibC used by servers is also modified, and thus into the gate. After that, it gives the gate address to the two all the syscall instructions are organized in one memory page, system servers connected by this gate. namely the syscall page. When migrating a server from user Later, during an IPC invocation, the gate transfers the con- space to kernel space, the core kernel overrides this page with trol flow from a caller to a callee. To be more specific, it saves another prepared page which contains IPC gates connected to the caller’s execution state, switches to the callee’s domain the core kernel. Therefore, a server can seamlessly perform by setting the PRKU register, and restores the callee’s execu- system calls no matter in user or kernel space. tion state. UnderBridge allows the caller and callee to define Specifically, there are four steps to migrate a server from their calling conventions (the gates only save/restore neces- kernel space to user space. First, the core kernel will wait for sary state by default), which is flexible and efficient. Trans- the server to enter a quiescent state by blocking new IPC re- ferring messages by registers and shared memory are both quests to the server temporarily and waiting for the finishes supported. of on-going ones. Second, it will modify the syscall page Since the system server in UnderBridge only executes of the server and making the server perform system calls when being called through an IPC gate, we adopt the mech- through syscall instruction instead of IPC gates later. Third, anism of decoupling the execution context, which contains it will free the execution domain (ID) of the server by set- the execution state (e.g., register values), with the schedul- ting the domain ID of the server’s page table entries to 0 ing context, which contains the scheduling information (e.g., and flush TLBs. If the server installs shared memory with an- time slice used in the scheduler) [47, 58, 72] and mark those other in-kernel server (S), the domain ID of the shared mem- servers as passive. When an application (T1) invokes the sys- ory will also be freed because the shared memory can use the tem server (T2) through an IPC, T2 inherits T1’s scheduling domain ID of S in the kernel page table. Last, the core ker- context and then starts executing. When T2 invokes another nel will activate the in-user server and allow clients to issue server (T3), T3 also inherits the scheduling context, which is IPCs to it. Migrating a server from user space to kernel space

USENIX Association 2020 USENIX Annual Technical Conference 405 1 ··· // Save & clear the caller’s states 1 ··· // Save & clear the caller’s states 1 ··· // Save & clear the caller’s states 2 xor %rcx, %rcx // Only core-kernel knows SECRET_TOKEN 2 mov $SECRET_TOKEN , %r15 3 xor %rdx, %rdx + 2 mov $SECRET_TOKEN , %r15 3 xor %rcx, %rcx 4 mov $PKRU_CALLEE , %rax 3 xor %rcx, %rcx 4 xor %rdx, %rdx 5 wrpkru // switch to the callee’s domain 4 xor %rdx, %rdx /* 6 cmp $PKRU_CALLEE , %rax // Authenticates the caller’s identity * Replace Line 5-7 with Line-8. 7 jne abort + 5 rdpkru * Only the legal caller can access + 6 cmp $PKRU_CALLER , %rax * ADDR_IN_CALLER. 8 ··· // Execute the callee’s function + 7 jne handle_abuse */ 8 mov $PKRU_CALLEE , %rax - 5 rdpkru 9 xor %rcx, %rcx 9 wrpkru - 6 cmp $PKRU_CALLER , %rax 10 xor %rdx, %rdx // Line-10 is removed due to Line-11 - 7 jne handle_abuse 11 mov $PKRU_CALLER , %rax - 10 cmp $PKRU_CALLEE , %rax + 8 mov %rsp, ADDR_IN_CALLER 12 wrpkru // switch to the caller’s domain + 11 cmp $SECRET_TOKEN , %r15 9 mov $PKRU_CALLEE , %rax 13 cmp $PKRU_CALLER , %rax // DoS attacks is also not allowed 10 wrpkru 14 jne abort - 12 jne abort 11 cmp $SECRET_TOKEN , %r15 15 ··· // Restore & return to the caller + 13 jne handle_abuse 12 jne handle_abuse

(a) (b) (c) Figure 4: (a) A basic IPC gate for switching execution domains: Line 1-7 is from the caller to the callee and Line 9-15 is just a reverse process. (b) A security-enhanced IPC gate (from the caller to the callee only) that solves the arbitrary IPC problem. (c) An optimized IPC gate based on the secure one. works similarly. The mappings of a system server in the ker- mission. Although recent work on MPK-based intra-process nel page table (used when running in the kernel) will not be isolation [39, 77] does not consider such gate abusing prob- removed, and the system server’s page table (used when run- lem, we cannot neglect it because it violates the enforce- ning in user) always exists. And the core kernel will keep ment of the IPC capability in microkernels. Therefore, Un- corresponding mappings in the two page tables the same. derBridge ensures only a legal caller (allowed by the core With server migration, UnderBridge can run frequently- kernel) can successfully use an IPC gate by adding manda- used servers (according to either online or offline profiling) tory authentications in the gate. § 4.1 explains the secure de- in kernel space while accommodating other servers in user sign of IPC gates. processes. Besides that, we think the number of frequently- The second threat is that untrusted in-kernel system used servers may usually be small according to a preliminary servers run in supervisor mode (Ring-0). Thus, a compro- survey on some popular applications including Memcached, mised server can execute any privileged instructions theo- MySQL, GCC, and a ROS-based (robot) application. We run retically, which threatens the whole system. For example, it those applications on Linux and find the most required sys- could install a new page table and freely access all memory. tem calls are only related to the File System, Network, Syn- One possible defense solution is to enforce control-flow in- chronization, and Memory-Management. For a microkernel, tegrity (CFI), which ensures that servers cannot execute any these system calls are usually implemented in only several illegal control flow leading to executing privileged instruc- system servers or directly in the core kernel, which indicates tions. However, CFI instrumentations inevitably bring obvi- the server migration may rarely happen. ous runtime overhead. Instead, we choose hardware virtual- ization technology and run ChCore in non-root mode. A tiny 4 Enforcing Isolation in UnderBridge secure monitor in root mode audits the execution of most Threat Model and Assumptions. UnderBridge aims to privileged instructions, as summarized in Table 2, by simply achieve the same security guarantee as existing microkernels trapping them through VMExits. In the meanwhile, we use and inherits the same trust model. Specifically, the (trusted) the binary rewriting technique to avoid the expensive VMEx- core kernel is assumed to be bug-free and correctly imple- its on the critical paths. § 4.2 gives more details. mented because it has relatively small codebase (e.g., 8,500 LoC in our implementation) and is amenable to formal ver- 4.1 Unauthorized IPCs Prevention ification [44]. We do not trust applications or the operating As shown in Figure-4(a), the responsibility of an IPC gate system servers, which may have vulnerabilities or even be is saving/restoring the (necessary) execution context of the maliciously crafted and can be fully compromised by attack- caller/callee and switching the execution domain from the ers. Physical attacks and hardware bugs are out of the scope caller to the callee. Line 2-4 prepares the argument registers of this paper. for wrpkru, which requires eax to store the target permission, Two Security Challenges. Although UnderBridge achieves and both ecx and edx to be zero. The wrpkru instruction memory isolation by utilizing MPK hardware, there are still in Line 5 sets PKRU_CALLEE (the callee’s permission) to two security threats. the PKRU register, which specifies the memory-access per- The first threat is the arbitrary IPC problem. The core mission, so the execution domain changes to the callee’s kernel generates IPC gates in memory domain 0 to ensure domain after this instruction finishes. Line 6-7 prevents a any server cannot modify the gates. However, an IPC gate compromised from directly jumping (e.g., ROP) to can still be invoked by any (in-kernel) system server as the Line 5 with some carefully chosen value in eax. Existing MPK does not enforce permission check on execution per- work [39, 77] on MPK also designs similar gates for intra-

406 2020 USENIX Annual Technical Conference USENIX Association process isolation. instructions operating on those registers to trigger VMExits. However, such a design faces the gate abusing problem. And, the secure monitor in root mode checks whether they Since MPK has no restriction on execution permission, a do- are executed by the core kernel according to the memory- main can use the gates belonging to other domains to issue access permission in PKRU. Nevertheless, rdmsr/wrmsr may IPCs. UnderBridge solves this problem by authenticating the also be used to operate FS/GS in the critical path. Under- caller’s identity in the IPC gates. As the core kernel is re- Bridge can configure these two specific registers not to trap sponsible for generating IPC gates when two execution do- or replace them with some newer non-privileged instructions mains establish the IPC connection, it knows which domain (e.g., wrfsbase). is the legal caller of the gate. Besides, since each domain Similarly, since control registers CR0 and CR4 are set at has unique memory-access permission, UnderBridge regards boot time only, UnderBridge also traps the setting instruc- the permission as the domain’s identity and checks the iden- tions. Nevertheless, accessing CR2 and reading CR0/CR4 tity with rdpkru in Line 5-7 of Figure-4(b). Moreover, Un- cannot trigger VMExits. UnderBridge clears CR2, where derBridge must ensure that the identity check cannot be by- CPU saves the page fault address in the fault handler to passed. Otherwise, a compromised thread can jump to Line prevent information leakage, and hides the real CR0/CR4 8/9 without going through the check. To this end, each gate value with a shadow value by leveraging virtualization hard- adds two cheap instructions (Line 2 and Line 11 in Figure- ware functionality. As for CR3, it points to page tables and 4(b)) to guarantee that a successful IPC invocation must go needs modifications when switching address spaces, which through the check. When setting up an IPC gate, the core ker- frequently appears in the critical path. So, triggering VMEx- nel randomly produces a 64-bit secret token with rdrand in- its on CR3 modifications is expensive. Instead, we use the fol- struction and inserts it to the gate. Note that any server cannot lowing lightweight solution (a special method). When load- read the token value since the IPC gates belong to domain 0 ing system servers, UnderBridge leverages binary scanning (core kernel). Thus, any caller who wants to pass the check at and rewriting to guarantee the servers contain no CR3 modifi- Line 11 must execute Line 2 first, which ensures the identity cation instructions, including at unaligned instruction bound- check at Line 5-7 is non-bypassable for successful invoca- aries. While in the core kernel, this privileged instruction tions. Similarly, Line 8 is also non-bypassable for successful must exist to switch address spaces. UnderBridge prevents IPCs; thus, Line 10 is no more required. a compromised system server from executing this instruc- Figure-4(c) further gives a more efficient design, which tion in the core kernel through defenses in depth. First (a eliminates the overhead of identity check. Although rdpkru simple defense), the instruction location is unknown to the only takes about 9 cycles, it as well as the extra comparison servers. Second, to achieve higher security, the core kernel (Line 6-7) are still on the critical path of IPC. To remove the can write this instruction right before executing it and imme- overhead, UnderBridge authenticates IPC callers by reusing diately remove it after the execution. Thus, a system server the stack-pointer saving instruction (Line 8), which is ini- cannot execute it even if knowing its location. Furthermore, tially located in the procedure of state saving (Line 1). In on different cores, UnderBridge makes the page table map- this way, any illegal caller will trigger a fault when accessing ping for this instruction (page) different. So, when one core ADDR_IN_CALLER (Line 8) and get caught. writes this instruction, other cores still cannot execute it. Instructions that invalidate cache/TLB may be used by 4.2 Privilege Deprivation compromised servers for conducting performance attacks. In traditional microkernels, system servers only have Ring-3 While trapping cache invalidation instructions via VMEx- privilege. To achieve the same security/isolation guarantee, its does not affect overall performance since they are rarely UnderBridge should restrict the servers’ behavior when run- executed, flushing TLB frequently appears on the critical ning them in execution domains (Ring-0). However, a com- path. So, we use binary rewriting to ensure there are no promised server may find and execute privileged instructions TLB flush instructions in system servers instead of trapping at unaligned instruction boundaries with ROP to attack the these instructions. The core kernel must contain these instruc- whole system. Based on an in-depth analysis of privileged tions, but they cannot be abused by faulting servers because instructions (briefly summarized in Table-2), UnderBridge we make sure they are followed by instructions that access combines virtualization hardware support, binary rewriting the core kernel memory. Cacheline flush instructions (non- technique, and some specific solutions to de-privilege the ex- privileged instructions, e.g., clflush and clflushopt) are also ecution domains for servers with negligible overhead. considered because the system servers share the same ad- According to our survey on four microkernels, system reg- dress space with the core kernel. Nevertheless, these instruc- isters such as IDTR are only configured at boot time; de- tions obey MPK memory (read) checks and thus, cannot be bug/profile registers are only accessed at debugging/profil- utilized by a server to flush others’ memory areas. ing time; most model-specific registers (MSRs) are also not For other privileged instructions and I/O related opera- accessed in the critical paths. Thus, UnderBridge runs the tions, we take similar solutions as listed in Table-2. One thing microkernel in non-root mode and configures the privileged to notice is that a compromised server may disable interrupts

USENIX Association 2020 USENIX Annual Technical Conference 407 Related Instructions Categories Usages in Zircon/Fiasco.OC/seL4/ChCore or Brief Explanations Solutions or Registers Although seL4 uses "LTR" instruction when switching processes, Load/Store IDTR, GDTR, LDTR it can be removed by setting different TSSs at boot time as do in other microkernels. VMExit System Registers TR, XCR0 ... Others are required at boot time only. Debug/Profile Debug registers Required for debugging and profiling, which are not performance-oriented. VMExit Registers RDPMC Model Specific Usually, they are mostly used at boot time or debug time and RDMSR/WRMSR Selected VMExit Registers can trigger VMExits for selected registers according to configuration bitmaps. mov CRn, reg - In the four microkernels, CR0/CR4 is written at boot time only and CR8 is not used. - VMExit Read/Write mov reg, CRn - No VMExits: accessing CR2, reading CR0 and CR4. - Clear/hide Control Registers CLTS (modify CR0) - CR3 is used for switching address space. So, we handle it with a special method. - Special method - INVD/WBINVD - Whole cache eviction. WBINVD is rarely used, and INVD is even not used. - VMExit Cache/TLB - INVLPG/INVPCID - TLB clear is needed after updating page tables. So, triggering VMExits is costly here. - Binary rewritng States - clflush instructions - (executable in Ring-3) Evicting a single cache line. MPK checks take effects. - None - Port I/O - Port I/O is not performance-oriented and can trigger VMExits with I/O bitmaps. - VMExit I/O Related - MMIO - MMIO operations go through MPK checks. - None Operations - DMA - The core kernel initializes DMA devices at boot time and takes the control plane. - None - SMSW, RSM, HLT ... - Either related to other modes like legacy and SMM mode or rarely used. - VMExit Other Privilege - SWAPGS, SYSRET... - Cannot break the system states, otherwise leading to the execution of fault handlers. - None Instructions - CLI, POPFQ... - Can be used by compromised servers to disable interrupts. - Check in VMM PKRU Register - xsave set instructions - For restoring extended processor states, which may include PKRU state. - No restoring (Ring-0/3) - WRPKRU - For changing the value of PRKU register as used in IPC gates. - Binary rewriting Table 2: Deprive the execution domains for system servers of the ability to execute privileged instructions. through instructions like cli to monopolize the CPU. Fortu- attack the core kernel, other servers, or the applications. nately, it cannot disable the host timer interrupts, which un- We assume that an attacker has fully compromised an in- conditionally trigger VMExits. Thus, the secure monitor can kernel server and can execute arbitrary instructions. Since easily detect such malicious behaviors by checking PKRU the server runs in another execution domain (no access to and interrupt state. the memory domain 0), it cannot directly access the memory Last but not least, we must forbid system servers from of the core kernel. As long as it tries to read or write any changing the PKRU register, i.e., changing the memory- disallowed memory, a CPU exception will immediately be access permission, by themselves. There are two kinds triggered and handled by the core kernel. PKRU of instructions that can modify . For the first kind There are four ways to bypass the memory isolation mech- xrstor/xrstors ( ), the core kernel configures them not to man- anism enforced by MPK: the first one is to run the disabling age PKRU by setting a control bit (bit-9) in XRC0. For the instructions, e.g., by setting CR4.PKE to 0 or setting CR0.WP wrpkru second kind ( ), the core kernel ensures it does not ex- to 0; the second is to change the PKRU register to gain access ist outside the IPC gates by rewriting the binary code (simi- permission of other memory domains illegally; the third is to wrpkru lar to [77]). The in the IPC gates cannot be abused as change the page table base address by setting the CR3 regis- specified in § 4.1. ter; the fourth is to modify the page table directly. We omit the detailed policies of the binary rewriting in ChCore can defend against all these attacks. Before load- this paper as it is a mature technique [17, 25, 77]. Neverthe- ing a server, ChCore uses binary scanning/rewriting to elimi- less, it is worth noting that using binary rewriting to directly nate the undesired privileged instructions. At runtime, the se- eliminating all privilege instructions is undecidable because cure monitor will prevent a server from executing other privi- some privilege instructions only take one byte (e.g., hlt). Our leged instructions. Compared with running on traditional mi- hybrid approach is both effective and efficient. crokernels, servers, including maliciously crafted ones, have 4.3 Security Analysis no more attack means on microkernels with UnderBridge. First, when the malicious server executes the disabling in- By introducing in-kernel servers, our system has one major structions, it will trigger VMExits, and the monitor will lo- difference from existing microkernels, which may lead to a cate the compromised server. Second, as described in the last larger attack surface. The in-kernel servers run in the ker- paragraph of § 4.2, the two ways of modifying the PKRU reg- nel mode, which means a compromised server is able to ex- ister are prevented. Third, the malicious server has no way to ecute any privileged instruction. We will analyze the attacks modify CR3 since the binary rewriting guarantees no CR3 caused by the difference and illustrate how to defend against modification instructions exist in any server’s address space. them. Fourth, the malicious server cannot modify the page table Restricting privileged in-kernel servers: System servers because the kernel page table resides in memory domain 0. are not trusted in our threat model. Although they can run Meanwhile, it cannot modify or add instructions which re- in the kernel mode, they are highly restricted when trying to quire to change the page table first. The isolation between

408 2020 USENIX Annual Technical Conference USENIX Association in-kernel servers is the same as the isolation between an in- ables and leverages the PCID hardware feature for avoiding kernel server and the core kernel. Since an in-kernel server unnecessary TLB flushing. As native Fiasco.OC does not does not share address space with user applications, it can- support to use this feature, we also add a simple extension not access applications’ memory either. to assign different PCIDs to an application and the kernel. Defending side-channel attacks: Since all the in-kernel servers share one address space, it is easier for a malicious 6 Performance Evaluation one to issue Spectre [45] and Meltdown [54] attacks com- Basic Setup. We conduct all the experiments on a Dell Pow- pared with the case where all servers have their own address erEdge R640 server, which is equipped with a 20-core Intel spaces. Although these attacks are caused by CPU bugs (out- Xeon Gold 6138 2.0GHz CPU and 128GB memory. Both of-scope), ChCore can mitigate them with existing software Turbo Boost and Hyper-threading are disabled. ChCore runs defenses like using address randomization makes a compro- on Linux/KVM-4.19 and QEMU-4.1. mised server hard to locate the sensitive memory area. Con- Systems for Comparison. We evaluate the native IPC per- sidering the secret tokens leakage on buggy CPUs, extra formance of three popular microkernels (Zircon, seL4, and checks can be added in the IPC gates, e.g., Line-10 in Figure- Fiasco.OC) on bare metal. Also, we evaluate SkyBridge [62], 4(b), to prevent malicious PKRU modification and ensure the which is the state-of-the-art optimization for IPC in micro- memory isolation. Besides that, most known hardware vul- kernels by using vmfunc. SkyBridge deploys a small hyper- nerabilities have been fixed by major CPU vendors in their visor called RootKernel and runs microkernels in non-root latest products [5, 6], which is orthogonal to ChCore. mode. When evaluating UnderBridge, we deploy the secure 5 Implementation monitor of UnderBridge in RootKernel and run microker- nels with UnderBridge on it. As Zircon has no kernel-page- Based on our UnderBridge design, we have implemented a table-isolation (KPTI) support, we simulate the overhead of prototype microkernel ChCore, which contains about 8500 page table switching by writing CR3 twice when evaluating lines of C code (LOC). ChCore runs in guest-mode, i.e., UnderBridge on it, this is because UnderBridge requires to non-root mode on x86_64, and a small secure monitor make the kernel and applications use different page tables. (around 300 LOC) runs in hypervisor-mode, i.e., root mode on x86_64. We have implemented the tiny secure monitor 6.1 IPC Performance Analysis in a minimal virtualization-layer, RootKernel (1,500 LOC) Cross-server IPC. Firstly, we analyze the IPC performance of SkyBridge [62]. Note that RootKernel is not a hypervisor between two servers (we abbreviate “system server” as and works only for running one OS in the guest-mode and “server” in this section) in a micro-benchmark, which uses thus avoids most overhead caused by virtualization. There- rdtsc instruction to measure the round-trip latency of invok- fore, although our system requires hardware virtualization, ing an empty function in server B from server A. it still can be deployed in bare-metal machines with RootK-

8500 ernel and achieve close-to-native performance. Considering 8151 the above, our system increases the trusted computing base 8000 7500 (TCB) by 1,800 LOC in total when running on bare-metal 5000 4145 4000 machines, which is acceptable. Besides, we also integrate 3057 3000 Cycles 2035 the tiny secure monitor in a commercial hypervisor, KVM, 2000 1450 1000 437 which makes deploying our system in cloud feasible. Even if 24 109 0 Monolithic ChCore SkyBridge seL4 seL4 Fiasco.OCFiasco.OC Zircon nested virtualization is required, our secure monitor can still (UnderBridge) -KPTI -KPTI work because it simply utilizes the hardware-provided capa- bility to trap sensitive instructions. Because the instructions Figure 5: Round-trip latency of cross-server IPC. to trap are deliberately selected and do not exist on critical Figure-5 gives the absolute cost of cross-server IPCs in paths, they will not degrade the overall performance. Never- different designs. Since invocations between servers (com- theless, our current implementation requires cloud providers ponents) in monolithic kernels are usually achieved by using to patch their hypervisors. In such a case, our system in- (indirect) call/ret instructions, the round-trip latency is only creases TCB by 300 LOC. Alternatively, we may leverage 24 cycles. The latency of an IPC round-trip in ChCore is eBPF [61, 81] to deploy our secure monitor without modifi- 109 cycles, which is dominated by two wrpkru instructions cations to the commercial hypervisor (left as future work). (56 cycles in total). Note that the IPC is achieved by Under- We also apply UnderBridge to three state-of-the-art mi- 3 Bridge and thus only involves the lightweight wrpkru and crokernels, i.e., seL4 , Zircon, and Fiasco.OC, to demon- the procedure of saving necessary registers. SkyBridge re- strate the generality of the design. The porting effort is about quires two much heavier vmfunc instruction (292 cycles in 1000∼1500 LOC for each of them. Since UnderBridge uses total) and therefore has larger latency. different page tables for applications and kernel, it also en- The round-trip latency of a native IPC in the other three 3We do not retain formal correctness guarantees of seL4. microkernels are much more noticeable. Among them, seL4

USENIX Association 2020 USENIX Annual Technical Conference 409 shows the best performance as it will directly switch the metric and cost the same cycles. In contrast, the latency of caller thread to callee thread when executing fast-path IPCs. UnderBridge grows much slower because its cross-server Although Fiasco.OC applies a similar strategy, it has a more IPC is much more lightweight. As shown in Table-3, the complex IPC capability handling procedure. Zircon does not performance of UnderBridge becomes better than SkyBridge support direct-switch and thus has the worst performance due when involving one cross-server IPC, and is better than that to the high scheduling cost. by 42% when involving two. The performance speedup is Compared with the native IPC in the three microkernels, expected to grow along with the increasing number of the our UnderBridge design is more than 12.3× faster. The per- cross-server IPCs and finally close to 3.0× as in Figure-5. formance improvement mainly comes from two parts. First, UnderBridge avoids the time-consuming privilege switches 6.2 Application Benchmarks in traditional IPC designs, as measured in § 2.2. Second, Un- We further evaluate the performance of UnderBridge with derBridge avoids the complex validation and invocation of two real-world applications: a database application and the IPC capability on the critical path. An IPC gate is only an HTTP server application. In the following experiments, generated after the corresponding capabilities having been shared memory is used to transfer data during IPCs. checked, so UnderBridge needs not to check the capability Database Evaluation. To faithfully compare with Sky- at runtime. Also, the gate only requires several lightweight Bridge, we use the benchmarks in [62]. Specifically, for serv- instructions for the domain switching (details in § 4.1). ing a relational database, SQLite3 [15], we run two system Another benefit of UnderBridge is that the in-kernel servers: one is a file system named xv6fs [16, 23], and the servers can invoke system calls faster via the IPC gates in- other is a RAMdisk (memory-only). When SQLite3 oper- stead of using syscall instructions. ates on a file, it will first invoke the xv6fs server by an Application-to-Server IPC. We further analyze the IPC application-to-server IPC, and then the xv6fs will read or (round-trip) performance between a user application and a write the RAMdisk by cross-server IPCs. server in this part. Commonly, a system call involves mul- Basic Operations. We first evaluate the performance of basic tiple servers in microkernels (or multiple kernel components operations, including insert, update, query, and delete opera- in monolithic kernels), which means an application-to-server tions. Our evaluation includes three IPC approaches: the na- IPC may involve several cross-server IPCs. To simulate such tive IPC, UnderBridge, and SkyBridge. Specifically, for each cases, we design a micro-benchmark that includes one appli- microkernel, we not only evaluate the performance with its cation and several servers. Each server will do nothing but native IPC designs but also test the performance after apply- routing IPCs to another server. We vary the number of cross- ing UnderBridge and SkyBridge to it. We also emulate the server IPCs in the benchmark. performance of a monolithic kernel by replacing all the IPC gates in UnderBridge with function calls. Cross-server IPCs Approaches Figure-6(a), 6(b) and 6(c) show the normalized throughput 0 1 2 of basic operations in the three microkernels, separately. The SkyBridge 437 931 1390 ChCore (UnderBridge) 723 856 981 baseline is set as the performance of native IPC design in seL4 1450 2932 4266 each microkernel. Table 3: Round-trip latency (cycles) of one application-to-server UnderBridge achieves up to 13.1×, 9.0×, 1.6× and 11.3× IPC and different number of cross-server IPCs. speedup for each of these operations, individually. The im- provement of query operations is relatively small since Table-3 compares the performance of SkyBridge (applied SQLite3 has an internal buffer for storing recent data and on Fiasco.OC) and UnderBridge (applied on ChCore) in this may handle the queries without issuing IPCs. Compared with micro-benchmark. Results are similar when we apply them SkyBridge, the performance improvement of UnderBridge to other microkernels and thus omitted. If an application in- (up to 65%) is because a single IPC from SQLite3 to xv6fs vokes a server without causing any cross-server IPCs, Sky- is likely to trigger multiple cross-server IPCs between xv6fs Bridge (437 cycles) shows better performance than Under- and RAMdisk. Even compared with the emulated perfor- Bridge (723 cycles). It is because UnderBridge has to switch mance of monolithic kernels, UnderBridge only introduces the privilege level and the address space for transferring the about 5.0% overhead. control flow from user space to kernel space, while Sky- The above-tested xv6fs (exactly the same one used in Bridge applies more lightweight EPT switching via vmfunc. the SkyBridge paper [62]) contains no page cache, which, Nevertheless, UnderBridge still outperforms the best native thus, emulates an IPC-intensive scenario. We further en- IPC (1450 cycles in seL4), because of reducing one privilege able the page cache in xv6fs to show how UnderBridge switch and minimizing the software logic of an IPC. performs with fewer cross-server IPCs between xv6fs and As the number of cross-server IPCs increases, the latency RAMdisk. As shown in Figure-6(d), 6(e) and 6(f), the per- of SkyBridge increases in proportion. It is because that IPC formance improvement of UnderBridge is still obvious. Un- from application to the server and between servers are sym- derBridge shows up to 4.0×, 2.9×, 0.7× and 3.2× speedup

410 2020 USENIX Annual Technical Conference USENIX Association Native w/o KPTI Monolithic Native w/ KPTI seL4-UnderBridge Native w/ KPTI Fiasco.OC-UnderBridge Zircon-SkyBridge Monolithic w/o KPTI Native w/o KPTI Monolithic Native w/o KPTI Monolithic Zircon-UnderBridge seL4-SkyBridge Monolithic w/o KPTI Fiasco.OC-SkyBridge Monolithic w/o KPTI 20 5 12

4 10 15 8 3 10 6 2 4 5 Throughput Throughput 1 Throughput 2 0 0 0 insert update query delete insert update query delete insert update query delete (a) Zircon (xv6fs w/o page cache). (b) seL4 (xv6fs w/o page cache). (c) Fiasco.OC (xv6fs w/o page cache).

6 4

5 2 3 4 3 2 1 2 1 Throughput 1 Throughput Throughput 0 0 0 insert update query delete insert update query delete insert update query delete (d) Zircon (xv6fs w/ page cache). (e) seL4 (xv6fs w/ page cache). (f) Fiasco.OC (xv6fs w/ page cache). Figure 6: Normalized throughput of basic SQLite3 operations. compared with the native IPC and achieves comparable per- are spent on modifying the kernel page table to free the do- formance with monolithic kernels. Nevertheless, since fewer main ID (i.e., the third step for migrating a server specified in cross-server IPCs are involved, the maximum improvement § 3.3). Migrating RAMdisk from user to kernel takes more of UnderBridge compared with SkyBridge drops from 65% cycles (90,189) mainly because more cycles are spent on to 25%. waiting for finishes of on-going IPCs. Since UnderBridge runs in non-root mode, we also count HTTP Server Evaluation. For running an HTTP server (a the number of VMExit, which is known as the cost of virtual- user-space application), we create three system servers: a ization. Thanks to the careful design of RootKernel and our socket server, a TCP/IP protocol stack server, and a loop- secure monitor, there are almost zero VMExits during the ex- back network device driver (not involving the real network periments. For example, at most one VMExit (due to timer) device) atop lwIP [31] library. We measure the throughput of happens during the query test. a simple HTTP server from the client-side, which receives re- YCSB Benchmark. We also evaluate SQLite3 against YCSB quests from the network and sends back a static HTML page. workloads. Figure-7(a) and 7(b) show the normalized We perform this evaluation on Zircon. As shown in Figure- throughput of YCSB-A (50% update and 50% query) with 7(d), UnderBridge improves the throughput of the HTTP the page cache disabled and enabled in xv6fs, separately. We server by 4.4×. We also implement the same benchmark with use the same baseline as the basic operation evaluation. Even SkyBridge. UnderBridge outperforms SkyBridge by about for seL4, which is the most efficient among the three micro- 24% because a network request also triggers multiple cross- kernels, UnderBridge improves the application’s throughput server IPCs. from 35% to 105%. UnderBridge also brings a better perfor- mance (from 7% to 35%) than SkyBridge. Besides, it is only 7 Related Work slightly slower (3.3% on average) than the monolithic kernel. Reconstructing monolithic kernels. The development of Other YCSB workloads give similar results. monolithic kernels follows the philosophy of modularization, Furthermore, Figure-7(c) gives a detailed analysis of the but all the kernel components are not isolated from each experiments with page cache enabled in xv6fs. First, the ratio other. With reliability and security attracting a fair amount of IPCs from SQLite3 to xv6fs (application-to-server) and of attention, we witness interest in reconstructing monolithic xv6fs to RAMdisk (cross-server) is about 1:2. Thus, Under- kernels to achieve better fault isolation and higher secu- Bridge outperforms SkyBridge, according to Table-3. Sec- rity [17, 25, 26, 38, 41, 59, 63, 65, 67, 74, 75, 84, 86]. Daut- ond, it helps to reduce the ratio of time spent on IPCs to enhahn et al. [25] build one memory protection domain in- around 11% while the other three microkernels spend at least side the BSD kernel and run a small trusted kernel in that 30% of the time on IPCs. Third, it makes the application and domain to control memory mappings. Mondrix [84] imple- the servers execute faster (about 10%) owing to less indirect ments a compartmentalized Linux kernel with eleven iso- costs such as cache/TLB pollution. lated modules based on customized security hardware [83]. Server Migration. We also evaluate the performance of Proskurin et al. [65] propose to use Intel EPT and vmfunc to server migration, although it should rarely happen. We still isolate critical kernel data in different domains. Nooks [74] run SQLite3, xv6fs, and RAMdisk as above on Zircon and and LXFI [59] focus on improving the reliability of Linux by trigger the server migration. Taking RAMdisk (128 MB vir- isolating kernel modules, especially device drivers. tual memory range) as an example, migrating it from kernel Our work does share some similarities with prior work to user takes about 84,361 cycles. Most of the cycles (83,517) on intra-(monolithic)kernel isolation. Nevertheless, we fo-

USENIX Association 2020 USENIX Annual Technical Conference 411 Native w/ KPTI UnderBridge SQLite3-xv6fs IPC Native w/o KPTI Monolithic xv6fs-RAMdisk IPC SkyBridge Monolithic w/o KPTI 4 Real Work 7 10 100% 6 8 80% 3 5 6 60% 4 2 4 40% 3 1 20% 2 Throughput Throughput

Throughput 2 1 0 0 Under Zircon Fiasco seL4 0 Zircon Fiasco.OC seL4 Zircon Fiasco.OC seL4 Bridge Zircon (a) YCSB-A (xv6fs w/o page cache). (b) YCSB-A (xv6fs w/ page cache). (c) Breakdown analysis. (d) HTTP server. Figure 7: (a), (b), and (d) share the same legend. (a) and (b) show the normalized throughput of YCSB-A with xv6fs’s page cache disabled and enabled separately. (c) shows the time breakdown of YCSB-A benchmark. (d) demonstrates the normalized throughput of a HTTP server. cus on accelerating IPCs for microkernel architectures while MPK that designed for user space in kernel space to improve maintaining strong isolation (both ends). We need to do lit- the IPC performance for microkernels (the first effort to our tle modification/instrumentation on system servers of a mi- knowledge) and also faces more security challenges. Appli- crokernel. This is because system servers of a microkernel cations can still utilize MPK in user space as they want since are designed to run in different user processes, and all the they have different page tables from the kernel. Second, Un- interactions are explicit IPCs, which is different from the derBridge authenticates the caller of each MPK gate for en- subsystems in Linux (no clear boundaries and have com- forcing IPC capability to prevent illegal domain switches, plex shared memory references). We achieve intra-kernel iso- while prior work allows arbitrary domain switches. lation by retrofitting Intel MPK, which is lightweight and There exist two concurrent studies [38, 73] to Under- commercially-available, to build multiple execution domains Bridge. They also propose to utilize MPK in kernel mode in kernel space. Furthermore, UnderBridge may also be gen- but with different goals and designs. IskiOS [38] leverages eralized to other kernel scopes with more efforts in the fu- MPK to defend against code-reuse attacks (e.g., protecting ture. On one side, the proposed abstraction of the execution shadow stacks) in the kernel. Sung et al. [73] uses MPK to domain can be extended to accommodate different kernel enhance the isolation for a unikernel while does not solve the modules in monolithic kernels, and the IPC gates can still security challenges identified in UnderBridge. be used to handle interactions between those modules. On MPK-like features on other architectures. Tagged mem- the other side, our design can also be applied in kernels writ- ory [28, 43, 85], which can provide MPK-like features, is ten in memory-safe languages [11, 24] to isolate some unsafe added in other architectures, which brings the potential to code (e.g., the code with “unsafe” tag in Rust). make UnderBridge more general to those architectures. Re- Accelerating IPCs. Optimizing the performance of IPC in cently, the RISC- security community also considers en- microkernels is continuously studied for a long time [19, hancing the PMP (physical memory protection) isolation 36, 44, 50, 52, 82]. For example, LRPC [19] eliminates with the tagged memory mechanism [12]. However, Under- the scheduling overhead during an IPC by using the thread- Bridge cannot work on current ARMv8 (aarch64). Yet, ARM migration model [36, 37]; seL4 [44] provides the fast-path v8.5 has included memory tagging extensions [10], by ex- IPC, which also avoids scheduling and passes arguments tending which with more mechanisms, we may provide a through registers. Nevertheless, even with these software- similar mechanism workable on future ARM platforms. based optimizations, IPC-intensive applications on microker- 8 Conclusion nels still suffer from the IPC overhead. This paper introduces UnderBridge, a redesign of the run- Recent studies present new designs to accelerate IPCs with time structure of microkernel OSes for faster OS services. To advanced hardware features. SkyBridge [62] utilizes the vm- demonstrate UnderBridge’s efficiency, we have built a proto- func to allow a process to invoke a function in another pro- type microkernel named ChCore and ported it to three exist- cess directly without the kernel’s involvement. XPC [30] is ing microkernels. Performance evaluations showed that Un- a hardware proposal that accelerates IPCs by implementing derBridge can achieve better performance in IPC-intensive efficient context switch and memory granting. dIPC relies workloads compared with prior work. on another hardware proposal [78] to put processes into the same address space and thus make IPCs faster. Our design 9 Acknowledgement shows better performance than SkyBridge and requires no We sincerely thank our shepherd Antonio Barbalace and the hardware modification. anonymous reviewers for their insightful suggestions. This Usage of Intel MPK. Intel MPK/PKU has been utilized work is supported in part by China National Natural Science by [39, 64, 77] to achieve efficient intra-process isolation, Foundation (No. 61925206, 61972244, and U19A2060) and which can build mutual-distrusted execution environments a grant from the Science and Technology Commission of in a single user process. There are two main differences be- Shanghai Municipality (No. 19511121100). Yubin Xia is the tween UnderBridge and them. First, UnderBridge retrofits corresponding author.

412 2020 USENIX Annual Technical Conference USENIX Association References kernel protection from the arm trustzone secure world. [1] Apple ios security-ios 12.1. https://www.apple. In Proceedings of the 2014 ACM SIGSAC Conference com/business/site/docs/iOS_Security_Guide. on Computer and Communications Security, pages 90– pdf. Referenced December 2019. 102. ACM, 2014. [2] Fiasco.oc repository. https://l4re.org/download/ [18] Andrew Baumann, Paul Barham, Pierre-Evariste Da- snapshots/. Referenced December 2019. gand, Tim Harris, Rebecca Isaacs, Simon Peter, Timo- thy Roscoe, Adrian Schüpbach, and Akhilesh Singha- [3] Fuchsia. https://fuchsia.dev/fuchsia-src. Ref- nia. The multikernel: A new os architecture for scal- erenced December 2019. able multicore systems. In Proceedings of the ACM https://fuchsia.dev/ [4] Fuchsia repository. SIGOPS 22Nd Symposium on Operating Systems Prin- fuchsia-src/development/source_code . Ref- ciples, SOSP ’09, pages 29–44, New York, NY, USA, erenced December 2019. 2009. ACM. [5] Ian cutress: Analyzing core i9-9900k performance [19] Brian N. Bershad, Thomas E. Anderson, Edward D. La- with spectre and meltdown hardware mitigations. zowska, and Henry M. Levy. Lightweight remote pro- https://www.anandtech.com/show/13659/analyzing- cedure call. ACM Trans. Comput. Syst., 8(1):37–55, core-i9-9900k-performance-with-spectre-and- February 1990. meltdown-hardware-mitigations. Referenced De- [20] Claudio Canella, Daniel Genkin, Lukas Giner, Daniel cember 2019. Gruss, Moritz Lipp, Marina Minkin, Daniel Moghimi, [6] Intel corporation. engineering new protections into Frank Piessens, Michael Schwarz, Berk Sunar, https://www.intel.com/content/ hardware. Jo Van Bulck, and Yuval Yarom. Fallout: Leaking www/us/en/architecture-and-technology/ data on meltdown-resistant cpus. In Proceedings of engineering-new-protections-into-hardware. the 2019 ACM SIGSAC Conference on Computer and html . Referenced December 2019. Communications Security, CCS ’19, pages 769–784, [7] Intel software developer’s manual. https:// New York, NY, USA, 2019. ACM. software.intel.com/sites/default/files/ [21] Stephen Checkoway, Lucas Davi, Alexandra managed/39/c5/325462-sdm-vol-1-2abcd-3abcd. Dmitrienko, Ahmad-Reza Sadeghi, Hovav Shacham, pdf . Referenced December 2019. and Marcel Winandy. Return-oriented programming [8] Kernel page table isolation. https://en.wikipedia. without returns. In Proceedings of the 17th ACM Con- org/wiki/Kernel_page-table_isolation. Refer- ference on Computer and Communications Security, enced December 2019. CCS ’10, pages 559–572, New York, NY, USA, 2010. [9] Linux kernel cves. https://www.cvedetails.com/ ACM. product/47/Linux-Linux-Kernel.html?vendor_ [22] Haogang Chen, Yandong Mao, Xi Wang, Dong Zhou, id=33. Referenced December 2019. Nickolai Zeldovich, and M. Frans Kaashoek. Linux ker- [10] Memory tagging in armv8.5-a. https:// nel vulnerabilities: State-of-the-art defenses and open community.arm.com/developer/ip-products/ problems. In Proceedings of the Second Asia-Pacific processors/b/processors-ip-blog/posts/ Workshop on Systems, APSys ’11, pages 5:1–5:5, New arm-a-profile-architecture-2018-developments-armv85a.York, NY, USA, 2011. ACM. Referenced May 2020. [23] Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chli- [11] . https://www.redox-os. pala, M. Frans Kaashoek, and Nickolai Zeldovich. Us- org/. Referenced December 2019. ing crash hoare logic for certifying the fscq file system. [12] Risc-v isa specification. https://riscv.org/ In Proceedings of the 25th Symposium on Operating specifications/. Referenced May 2020. Systems Principles, SOSP ’15, page 18–37, New York, [13] sel4 performance report. http://sel4.systems/ NY, USA, 2015. Association for Computing Machin- About/Performance/. Referenced December 2019. ery. [14] sel4 repository. https://github.com/seL4/seL4. [24] Cody Cutler, M Frans Kaashoek, and Robert T Morris. Referenced December 2019. The benefits and costs of writing a {POSIX} kernel [15] Sqlite. https://www.sqlite.org/index.html. Ref- in a high-level language. In 13th {USENIX} Sympo- erenced December 2019. sium on Operating Systems Design and Implementation [16] xv6 repository. https://github.com/mit-pdos/ ({OSDI} 18), pages 89–105, 2018. fscq/tree/master/xv6. Referenced December 2019. [25] Nathan Dautenhahn, Theodoros Kasampalis, Will Di- [17] Ahmed M Azab, Peng Ning, Jitesh Shah, Quan Chen, etz, John Criswell, and Vikram Adve. Nested kernel: Rohan Bhutkar, Guruprasad Ganesh, Jia Ma, and An operating system architecture for intra-kernel priv- Wenbo Shen. Hypervision across worlds: Real-time ilege separation. In Proceedings of the Twentieth In-

USENIX Association 2020 USENIX Annual Technical Conference 413 ternational Conference on Architectural Support for Design and Implementation, OSDI ’96, pages 137–151, Programming Languages and Operating Systems, AS- New York, NY, USA, 1996. ACM. PLOS ’15, pages 191–206, New York, NY, USA, 2015. [36] Bryan Ford and Jay Lepreau. Evolving 3.0 to a ACM. migrating thread model. In Proceedings of the USENIX [26] Lucas Davi, David Gens, Christopher Liebchen, and Winter 1994 Technical Conference on USENIX Win- Ahmad-Reza Sadeghi. Pt-rand: Practical mitigation of ter 1994 Technical Conference, WTEC’94, pages 9–9, data-only attacks against page tables. In NDSS, 2017. Berkeley, CA, USA, 1994. USENIX Association. [27] Francis M David, Ellick M Chan, Jeffrey C Carlyle, and [37] Ben Gamsa, Orran Krieger, Jonathan Appavoo, and Roy H Campbell. Curios: improving reliability through Michael Stumm. Tornado: Maximizing locality and operating system structure. pages 59–72, 2008. concurrency in a shared memory multiprocessor operat- [28] Udit Dhawan, Nikos Vasilakis, Raphael Rubin, Silviu ing system. In Proceedings of the Third Symposium on Chiricescu, Jonathan M Smith, Thomas F Knight Jr, Operating Systems Design and Implementation, OSDI Benjamin C Pierce, and André DeHon. Pump: a pro- ’99, pages 87–100, Berkeley, CA, USA, 1999. USENIX grammable unit for metadata processing. In Proceed- Association. ings of the Third Workshop on Hardware and Archi- [38] Spyridoula Gravani, Mohammad Hedayati, John tectural Support for Security and Privacy, pages 1–8, Criswell, and Michael L Scott. Iskios: Lightweight 2014. defense against kernel-level code-reuse attacks. arXiv [29] Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, and preprint arXiv:1903.04654, 2019. Haibo Chen. Performance and protection in the zofs [39] Mohammad Hedayati, Spyridoula Gravani, Ethan John- user-space nvm file system. In Proceedings of the son, John Criswell, Michael L. Scott, Kai Shen, and 27th ACM Symposium on Operating Systems Princi- Mike Marty. Hodor: Intra-process isolation for high- ples, SOSP ’19, pages 478–493, New York, NY, USA, throughput data plane libraries. In Proceedings of the 2019. ACM. 2019 USENIX Conference on Usenix Annual Technical [30] Dong Du, Zhichao Hua, Yubin Xia, Binyu Zang, and Conference, USENIX ATC ’19, pages 489–503, Berke- Haibo Chen. Xpc: Architectural support for secure ley, CA, USA, 2019. USENIX Association. and efficient cross process call. In Proceedings of the [40] Dan Hildebrand. An architectural overview of . 46th International Symposium on Computer Architec- In Proceedings of the Workshop on Micro-kernels and ture, ISCA ’19, pages 671–684, New York, NY, USA, Other Kernel Architectures, pages 113–126, Berkeley, 2019. ACM. CA, USA, 1992. USENIX Association. [31] Adam Dunkels. lwip-a lightweight tcp/ip stack. [41] Charles Jacobsen, Muktesh Khole, Sarah Spall, Scotty Available from World Wide Web: http://www. sics. Bauer, and Anton Burtsev. Lightweight capabil- se/ adam/lwip/index. html, 2002. ity domains: Towards decomposing the linux kernel. [32] Kevin Elphinstone and . From l3 to sel4 SIGOPS Oper. Syst. Rev., 49(2):44–50, January 2016. what have we learnt in 20 years of l4 microkernels? In [42] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking Proceedings of the Twenty-Fourth ACM Symposium on kernel address space layout randomization with intel Operating Systems Principles, SOSP ’13, pages 133– tsx. In Proceedings of the 2016 ACM SIGSAC Confer- 150, New York, NY, USA, 2013. ACM. ence on Computer and Communications Security, CCS [33] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exok- ’16, pages 380–392, New York, NY, USA, 2016. ACM. ernel: An operating system architecture for application- [43] Alexandre Joannou, Jonathan Woodruff, Robert Ko- level resource management. In Proceedings of the Fif- vacsics, Simon W Moore, Alex Bradbury, Hongyan teenth ACM Symposium on Operating Systems Princi- Xia, Robert NM Watson, David Chisnall, Michael Roe, ples, SOSP ’95, pages 251–266, New York, NY, USA, Brooks Davis, et al. Efficient tagged memory. In 2017 1995. ACM. IEEE International Conference on Computer Design [34] Úlfar Erlingsson, Martín Abadi, Michael Vrable, Mihai (ICCD), pages 641–648. IEEE, 2017. Budiu, and George C. Necula. Xfi: Software guards for [44] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June system address spaces. In Proceedings of the 7th Sym- Andronick, David Cock, Philip Derrin, Dhammika posium on Operating Systems Design and Implementa- Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael tion, OSDI ’06, pages 75–88, Berkeley, CA, USA, 2006. Norrish, Thomas Sewell, Harvey Tuch, and Simon Win- USENIX Association. wood. sel4: Formal verification of an os kernel. In Pro- [35] Bryan Ford, Mike Hibler, Jay Lepreau, Patrick Tull- ceedings of the ACM SIGOPS 22Nd Symposium on Op- mann, Godmar Back, and Stephen Clawson. Microker- erating Systems Principles, SOSP ’09, pages 207–220, nels meet recursive virtual machines. In Proceedings of New York, NY, USA, 2009. ACM. the Second USENIX Symposium on Operating Systems [45] Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin,

414 2020 USENIX Annual Technical Conference USENIX Association Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Association. Lipp, Stefan Mangard, Thomas Prescher, et al. Spec- [56] James Litton, Anjo Vahldiek-Oberwagner, Eslam El- tre attacks: Exploiting speculative execution. In 2019 nikety, Deepak Garg, Bobby Bhattacharjee, and Peter IEEE Symposium on Security and Privacy (SP), pages Druschel. Light-weight contexts: An os abstraction for 1–19. IEEE, 2019. safety and performance. In Proceedings of the 12th [46] Koen Koning, Xi Chen, Herbert Bos, Cristiano Giuf- USENIX Conference on Operating Systems Design and frida, and Elias Athanasopoulos. No need to hide: Pro- Implementation, OSDI’16, pages 49–64, Berkeley, CA, tecting safe regions on commodity hardware. In Pro- USA, 2016. USENIX Association. ceedings of the Twelfth European Conference on Com- [57] Yutao Liu, Tianyu Zhou, Kexin Chen, Haibo Chen, and puter Systems, EuroSys ’17, pages 437–452, New York, Yubin Xia. Thwarting memory disclosure with effi- NY, USA, 2017. ACM. cient hypervisor-enforced intra-domain isolation. In [47] Adam Lackorzynski,´ Alexander Warg, Marcus Völp, Proceedings of the 22Nd ACM SIGSAC Conference and Hermann Härtig. Flattening hierarchical schedul- on Computer and Communications Security, CCS ’15, ing. In Proceedings of the Tenth ACM Interna- pages 1607–1619, New York, NY, USA, 2015. ACM. tional Conference on Embedded Software, EMSOFT [58] Anna Lyons, Kent McLeod, Hesham Almatary, and ’12, pages 93–102, New York, NY, USA, 2012. ACM. Gernot Heiser. Scheduling-context capabilities: A prin- [48] Hojoon Lee, Chihyun Song, and Brent Byunghoon cipled, light-weight operating-system mechanism for Kang. Lord of the x86 rings: A portable user mode priv- managing time. In Proceedings of the Thirteenth Eu- ilege separation architecture on x86. In Proceedings of roSys Conference, EuroSys ’18, pages 26:1–26:16, New the 2018 ACM SIGSAC Conference on Computer and York, NY, USA, 2018. ACM. Communications Security, CCS ’18, page 1441–1454, [59] Yandong Mao, Haogang Chen, Dong Zhou, Xi Wang, New York, NY,USA, 2018. Association for Computing Nickolai Zeldovich, and M Frans Kaashoek. Software Machinery. fault isolation with api integrity and multi-principal [49] R. Levin, E. Cohen, W. Corwin, F. Pollack, and W. Wulf. modules. In Proceedings of the Twenty-Third ACM Policy/mechanism separation in . In Proceed- Symposium on Operating Systems Principles, pages ings of the Fifth ACM Symposium on Operating Sys- 115–128, 2011. tems Principles, SOSP ’75, pages 132–140, New York, [60] Stephen McCamant and Greg Morrisett. Evaluating sfi NY, USA, 1975. ACM. for a cisc architecture. In Proceedings of the 15th Con- [50] Henry M. Levy. Capability-Based Computer Systems. ference on USENIX Security Symposium - Volume 15, Butterworth-Heinemann, Newton, MA, USA, 1984. USENIX-SS’06, Berkeley, CA, USA, 2006. USENIX [51] Chuanpeng Li, Chen Ding, and Kai Shen. Quantifying Association. the cost of context switch. In Proceedings of the 2007 [61] Steven McCanne and Van Jacobson. The bsd packet Workshop on Experimental Computer Science, ExpCS filter: A new architecture for user-level packet capture. ’07, New York, NY, USA, 2007. ACM. In USENIX winter, volume 46, 1993. [52] . Improving ipc by kernel design. In [62] Zeyu Mi, Dingji Li, Zihan Yang, Xinran Wang, and Proceedings of the Fourteenth ACM Symposium on Op- Haibo Chen. Skybridge: Fast and secure inter-process erating Systems Principles, SOSP ’93, pages 175–188, communication for microkernels. In Proceedings of New York, NY, USA, 1993. ACM. the Fourteenth EuroSys Conference 2019, EuroSys ’19, [53] Jochen Liedtke. A persistent system in real use - expe- pages 9:1–9:15, New York, NY, USA, 2019. ACM. riences of the first 13 years. pages 2 – 11, 01 1994. [63] Vikram Narayanan, Yongzhe Huang, Gang Tan, Trent [54] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Jaeger, and Anton Burtsev. Lightweight kernel isola- Prescher, Werner Haas, Anders Fogh, Jann Horn, Ste- tion with virtualization and vm functions. In Proceed- fan Mangard, Paul Kocher, Daniel Genkin, et al. Melt- ings of the 16th ACM SIGPLAN/SIGOPS International down: Reading kernel memory from user space. In 27th Conference on Virtual Execution Environments, pages {USENIX} Security Symposium ({USENIX} Security 157–171, 2020. 18), pages 973–990, 2018. [64] Soyeon Park, Sangho Lee, Wen Xu, Hyungon Moon, [55] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas and Taesoo Kim. Libmpk: Software abstraction for Prescher, Werner Haas, Anders Fogh, Jann Horn, Stefan intel memory protection keys (intel mpk). In Proceed- Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, ings of the 2019 USENIX Conference on Usenix Annual and Mike Hamburg. Meltdown: Reading kernel mem- Technical Conference, USENIX ATC ’19, pages 241– ory from user space. In Proceedings of the 27th 254, Berkeley, CA, USA, 2019. USENIX Association. USENIX Conference on Security Symposium, SEC’18, [65] Sergej Proskurin, Marius Momeu, Seyedhamed pages 973–990, Berkeley, CA, USA, 2018. USENIX Ghavamnia, Vasileios P Kemerlis, and Michalis Poly-

USENIX Association 2020 USENIX Annual Technical Conference 415 chronakis. xmp: Selective memory protection for 48(6):1269–1284, 2018. kernel and user space. In Proceedings of 41st IEEE [76] Dan Tsafrir. The context-switch overhead inflicted Symposium on Security and Privacy, S&P ’20, 2020. by hardware interrupts (and the enigma of do-nothing [66] Franklin Reynolds. An architectural overview of al- loops). In Experimental Computer Science on Experi- pha: A real-time, distributed kernel. In Proceedings of mental Computer Science, ecs’07, pages 3–3, Berkeley, the Workshop on Micro-kernels and Other Kernel Ar- CA, USA, 2007. USENIX Association. chitectures, pages 127–146, Berkeley, CA, USA, 1992. [77] Anjo Vahldiek-Oberwagner, Eslam Elnikety, Nuno O. USENIX Association. Duarte, Michael Sammler, Peter Druschel, and Deepak [67] O. Schwahn, S. Winter, N. Coppik, and N. Suri. How Garg. Erim: Secure, efficient in-process isolation with to fillet a penguin: Runtime data driven partitioning of protection keys (mpk). In Proceedings of the 28th linux code. IEEE Transactions on Dependable and Se- USENIX Conference on Security Symposium, SEC’19, cure Computing, 15(6):945–958, Nov 2018. pages 1221–1238, Berkeley, CA, USA, 2019. USENIX [68] David Sehr, Robert Muth, Cliff Biffle, Victor Khi- Association. menko, Egor Pasko, Karl Schimpf, Bennet Yee, and [78] Lluïs Vilanova, Muli Ben-Yehuda, Nacho Navarro, Brad Chen. Adapting software fault isolation to con- Yoav Etsion, and Mateo Valero. Codoms: Protecting temporary cpu architectures. In Proceedings of the 19th software with code-centric memory domains. In 2014 USENIX Conference on Security, USENIX Security’10, ACM/IEEE 41st International Symposium on Computer pages 1–1, Berkeley, CA, USA, 2010. USENIX Associ- Architecture (ISCA), pages 469–480. IEEE, 2014. ation. [79] Lluís Vilanova, Marc Jordà, Nacho Navarro, Yoav Et- [69] Hovav Shacham. The geometry of innocent flesh on sion, and Mateo Valero. Direct inter-process communi- the bone: Return-into-libc without function calls (on cation (dipc): Repurposing the codoms architecture to the x86). In Proceedings of the 14th ACM Conference accelerate ipc. In Proceedings of the Twelfth European on Computer and Communications Security, CCS ’07, Conference on Computer Systems, EuroSys ’17, pages pages 552–561, New York, NY, USA, 2007. ACM. 16–31, New York, NY, USA, 2017. ACM. [70] Jonathan S. Shapiro, Jonathan M. Smith, and David J. [80] Robert Wahbe, Steven Lucco, Thomas E. Anderson, Farber. Eros: A fast capability system. In Proceedings and Susan L. Graham. Efficient software-based fault of the Seventeenth ACM Symposium on Operating Sys- isolation. In Proceedings of the Fourteenth ACM Sym- tems Principles, SOSP ’99, pages 170–185, New York, posium on Operating Systems Principles, SOSP ’93, NY, USA, 1999. ACM. pages 203–216, New York, NY, USA, 1993. ACM. [71] Livio Soares and Michael Stumm. Flexsc: Flexible sys- [81] Xi Wang, David Lazar, Nickolai Zeldovich, Adam Chli- tem call scheduling with exception-less system calls. pala, and Zachary Tatlock. Jitk: A trustworthy in-kernel In Proceedings of the 9th USENIX Conference on Op- interpreter infrastructure. In 11th {USENIX} Sympo- erating Systems Design and Implementation, OSDI’10, sium on Operating Systems Design and Implementation pages 33–46, Berkeley, CA, USA, 2010. USENIX As- ({OSDI} 14), pages 33–47, 2014. sociation. [82] Robert N. M. Watson, Robert M. Norton, Jonathan [72] Udo Steinberg and Bernhard Kauer. Nova: A Woodruff, Simon W. Moore, Peter G. Neumann, microhypervisor-based secure virtualization architec- Jonathan Anderson, David Chisnall, Brooks Davis, Ben ture. In Proceedings of the 5th European Conference on Laurie, Michael Roe, Nirav H. Dave, Khilan Gudka, Computer Systems, EuroSys ’10, pages 209–222, New Alexandre Joannou, A. Theodore Markettos, Ed Maste, York, NY, USA, 2010. ACM. Steven J. Murdoch, Colin Rothwell, Stacey D. Son, and [73] Mincheol Sung, Pierre Olivier, Stefan Lankes, and Bi- Munraj Vadera. Fast protection-domain crossing in noy Ravindran. Intra-unikernel isolation with intel the cheri capability-system architecture. IEEE Micro, memory protection keys. In Proceedings of the 16th 36(5):38–49, September 2016. ACM SIGPLAN/SIGOPS International Conference on [83] Emmett Witchel, Josh Cates, and Krste Asanovic.´ Mon- Virtual Execution Environments, pages 143–156, 2020. drian memory protection. In Proceedings of the 10th [74] Michael M. Swift, Brian N. Bershad, and Henry M. International Conference on Architectural Support for Levy. Improving the reliability of commodity operating Programming Languages and Operating Systems, AS- systems. In Proceedings of the Nineteenth ACM Sym- PLOS X, pages 304–316, New York, NY, USA, 2002. posium on Operating Systems Principles, SOSP ’03, ACM. pages 207–222, New York, NY, USA, 2003. ACM. [84] Emmett Witchel, Junghwan Rhee, and Krste Asanovic.´ [75] Donghai Tian, Xi Xiong, Changzhen Hu, and Peng Liu. Mondrix: Memory isolation for linux using mondri- A policy-centric approach to protecting os kernel from aan memory protection. In Proceedings of the Twenti- vulnerable lkms. Software: Practice and Experience, eth ACM Symposium on Operating Systems Principles,

416 2020 USENIX Annual Technical Conference USENIX Association SOSP ’05, pages 31–44, New York, NY, USA, 2005. Chen, Robert Muth, Tavis Ormandy, Shiki Okasaka, ACM. Neha Narula, and Nicholas Fullagar. Native client: A [85] Jonathan Woodruff, Robert NM Watson, David Chis- sandbox for portable, untrusted x86 native code. In Pro- nall, Simon W Moore, Jonathan Anderson, Brooks ceedings of the 2009 30th IEEE Symposium on Secu- Davis, Ben Laurie, Peter G Neumann, Robert Norton, rity and Privacy, SP ’09, pages 79–93, Washington, DC, and Michael Roe. The cheri capability model: Revisit- USA, 2009. IEEE Computer Society. ing risc in an age of risk. In 2014 ACM/IEEE 41st Inter- [88] Bennet Yee, David Sehr, Gregory Dardyk, J Bradley national Symposium on Computer Architecture (ISCA), Chen, Robert Muth, Tavis Ormandy, Shiki Okasaka, pages 457–468. IEEE, 2014. Neha Narula, and Nicholas Fullagar. Native client: A [86] Xi Xiong, Donghai Tian, Peng Liu, et al. Practical pro- sandbox for portable, untrusted x86 native code. In tection of kernel integrity for commodity os from un- 2009 30th IEEE Symposium on Security and Privacy, trusted extensions. In NDSS, volume 11, 2011. pages 79–93. IEEE, 2009. [87] Bennet Yee, David Sehr, Gregory Dardyk, J. Bradley

USENIX Association 2020 USENIX Annual Technical Conference 417