Technical University of Munich

TECHNICAL UNIVERSITY OF MUNICH

DEPARTMENT OF INFORMATICS

Master’s Thesis in Informatics

Leveraging Hardware-Assisted TEEs to Protect Host Secrets in an OS-Level Virtualization Environment

Martin Radev TECHNICAL UNIVERSITY OF MUNICH

DEPARTMENT OF INFORMATICS

Master’s Thesis in Informatics

Leveraging Hardware-Assisted TEEs to Protect Host Secrets in an OS-Level Virtualization Environment

Nutzung hardwareunterstützer TEEs zum Schutz von Hostgeheimnissen in einer Virtualisierungsumgebung auf Betriebssystemebene

Author: Martin Radev Supervisor: Prof. Dr. Claudia Eckert Advisor: Christian Epple Submission Date: 15.11.2020 I conﬁrm that this master’s thesis in informatics is my own work and I have documented all sources and material used.

Munich, 15.11.2020 Martin Radev Abstract

Linux is a prevalent kernel containing millions lines of complex code, whose correctness and security is difficult to verify. In the Linux kernel, thousands of bugs are discovered each year and hundreds are considered having a security impact. Modern kernels rely on software and hardware defenses to mitigate the exploitation of such bugs, but the usefulness of such defenses can vary. One security-critical component of the Linux kernel is the Kernel Crypto API (KCAPI), which is used by various security-related components to store encryption keys and to perform cryptographic operations. If an attacker can exploit a memory disclosure vulnerability in Linux to steal the encryption keys, the attacker may be able to read the content of encrypted storage mediums, or to read and modify messages over encrypted communication channels. For this reason, various academic solutions have been proposed for protecting cryptographic secrets in Linux, but their applicability and performance can vary significantly. This thesis researches, designs and implements a new solution — SE-Vault — for protecting cryptographic secrets by storing the secrets in a Trusted Execution Environ- ment (TEE) and by performing the cryptographic transformations within it. The TEE is built as a Virtual Machine (VM) whose memory is encrypted with the AMD Secure Encrypted Virtualization (SEV) hardware feature. By using SEV, SE-Vault transparently addresses memory disclosure attacks through hardware-assisted memory encryption, and additionally hardens the cryptographic component against other attack vectors. In this work, I present two implementations of the TEE: one using a Linux VM with built-in SEV support, and one using the seL4 microkernel to which I added SEV support. Both implementations use the Vhost and VirtIO interfaces for efficient communication of encryption keys and encryption requests. An empirical security and performance evaluation shows that SE-Vault can protect various Host cryptographic secrets against memory disclosure attacks, and can outperform significantly other similar solutions in request throughput. The solution can protect disk encryption keys, keys registered to the KCAPI and OpenSSL keys, while performance is degraded by 50%. This thesis shows that recent technologies for confidential computing can be re- purposed to protect security-critical components of the Linux kernel. Modern operating system design can build upon this work by isolating and protecting other components in encrypted virtual machines.

iii Contents

Abstract iii

1 Introduction 1 1.1 Research Goals ...... 2 1.2 Results ...... 2 1.3 Outline ...... 4

2 Background 5 2.1 Virtualization ...... 5 2.1.1 QEMU ...... 5 2.1.2 KVM ...... 6 2.1.3 AMD Virtualization ...... 7 2.2 AMD Secure Encrypted Virtualization ...... 10 2.2.1 Secure Encrypted Virtualization ...... 11 2.2.2 SEV - Encrypted State ...... 14 2.2.3 Attestation and Secret Provisioning ...... 17 2.3 Trusted Execution Environment ...... 20 2.4 VirtIO ...... 22 2.5 Vhost ...... 27 2.6 Kernel Crypto API ...... 29 2.7 Disk Encryption ...... 30

3 Related Work 33 3.1 Protection of Kernel Cryptographic Secrets ...... 33 3.2 Development of Secure Virtualized Environments ...... 36

4 Design of SE-Vault 39 4.1 Design Overview ...... 39 4.2 Attacker Model ...... 42 4.3 Host-Guest Communication ...... 42 4.4 User Interfaces ...... 45

iv Contents

5 Implementation 49 5.1 QEMU Device ...... 50 5.2 Host Linux Driver ...... 51 5.2.1 Initialization ...... 51 5.2.2 Data Communication ...... 52 5.2.3 User Space ioctl ABI ...... 58 5.2.4 Kernel Crypto API Cipher ...... 60 5.3 Guest Linux Driver ...... 60 5.4 Code Hardenings against Memory Disclosure Attacks ...... 63 5.5 Guest Driver Portability to the seL4 Microkernel ...... 65 5.5.1 Booting seL4 ...... 66 5.5.2 SEV Guest Support ...... 67 5.5.3 VirtIO Support ...... 69 5.5.4 Porting SE-Vault ...... 72

6 Evaluation 74 6.1 Security Evaluation ...... 75 6.2 Correctness Evaluation ...... 78 6.3 Performance Evaluation ...... 79 6.3.1 Throughput Measurements ...... 80 6.3.2 Latency Measurements ...... 85

7 Attacks 88 7.1 Kernel Memory Disclosure Attacks ...... 88 7.2 Register State Attacks ...... 89 7.3 NPT Corruption Attacks ...... 90 7.4 Memory Corruption Attacks ...... 91 7.5 Attacks Summary ...... 95

8 Discussion and Future Work 97

9 Conclusion 100

List of Figures 103

List of Tables 107

Bibliography 113

v 1 Introduction

Widely available Operating Systems rely on complex monolithic kernels for managing computer resources, whose correctness and security is difficult to assess. To improve the security of such complex systems, both software and hardware designers have implemented code hardenings and new hardware features to mitigate certain attack vectors. Example of such code hardenings are Kernel Address Space Layout Random- ization (KASLR), stack canaries and seccomp [KZ13], and examples of such hardware security features are Supervisor Mode Access Prevention (SMAP) and Supervisor Mode Execution Prevention (SMEP) [Cor20b]. However, one critical issue remains: software complexity would lead to security vulnerabilities which would be exploited if not fixed. The Linux kernel is an example of a popular monolithic kernel, which is used on mobile devices, personal computers, and servers. Due to the complexity of the kernel and its device drivers, throughout the years multiple security vulnerabilities in the Linux kernel have been discovered and reported [Det]. In 2019, five code execution and 18 information disclosure vulnerabilities were recorded. A malicious actor may use one of these vulnerabilities to exfiltrate secret data from typically inaccessible memory. Such vulnerabilities are especially critical on shared systems, where the users may not be known and may have malicious intentions. Such a scenario occurs naturally in a cloud environment, where a user can be provided computational resources in a restricted session such as a VM. In case the user is able to escape the restricted environment, the user may attack the Host kernel or other VMs to exfiltrate precious secrets. An example of such precious secrets are cryptographic keys, which are used to secure communication with external entities, to protect storage devices, and to provide confidentiality of expensive digital content. Such secrets are high-value targets for an attacker, since then an attacker may be able to read or modify confidential communication with those entities, decrypt storage devices, and steal valuable digital content. Multiple academic solutions have been proposed to provide additional protection on cryptographic keys in the Linux kernel, which rely on typically unused hardware resources [MFD11; Sim11; Gua+14b], on the side-effects of hardware features [Gua+15], or on explicit hardware protection from the CPU [RGM16]. The proposal of Richer et al. [RGM16] — Tresor-SGX — is a promising solution which stores kernel cryptographic secrets in a TEE built with the Intel Secure Guard Extensions feature [CD16]. Tresor-SGX

1 1 Introduction protects the conﬁdentiality and integrity of the secrets from even privileged adversaries, but suffers from two great limitations: availability of the solution and its performance. The solution is only available on recent Intel CPUs, and is not deployable on CPUs of other vendors. The solution also introduces a 1000x performance degradation, which essentially renders its current implementation unusable in a production system. This raises at least two research questions: Can a similar solution be built on other CPUs, like that of AMD?, and Can performance be improved to make the solution practical?.

1.1 Research Goals

The first goal of this thesis is to research and design a Trusted Execution Environment (TEE) based on the AMD SEV hardware feature which can protect encryption keys against memory disclosure attacks, and can perform cryptographic transformations with these keys without exposing them. As a main use case, the solution would protect disk encryption or network communication keys, which imposes the requirement that this solution should not incur significant performance degradation over existing software. Thus, the first goal also mandates the research and development of an efficient communication channel between the Host and the TEE. The second goal of this thesis is to implement the proposed design and verify its portability across operating systems. By providing a Proof-of-Concept (PoC) implementation, the security and performance of the design can be empirically verified, which is paramount in determining the practical usefulness of this work. Although SEV is only supported in Linux at the time of writing, other virtualizable Operating Systems can also benefit from having the confidentiality of their memory be protected with SEV. The third goal of this thesis is to verify the security guarantees and performance of the implementation. Performing this verification is necessary for reasoning about the usefulness of this work.

1.2 Results

The research goals from the previous section are fulﬁlled in the following way: Goal 1) Design. I designed an OS-agnostic solution — SE-Vault — which stores Host secrets in the encrypted memory of a virtualized environment protected with the AMD SEV hardware feature. The solution relies on the standardized interface VirtIO for transmission of secrets and cryptographic transformation requests, and handles these requests internally without exposing the secret keys into unencrypted memory. The proposed design additionally considers the wide usability and good performance of this solution. By placing the Host SE-Vault component in a central system location —

2 1 Introduction the Host kernel — the solution is made available to various other software components, and the solution is also better connected to other crypto-reliant components like the KCAPI and disk encryption facilities. The design is carefully discussed in Chapter 4, but I first refer the reader to Chapter 2 for the necessary background information. Goal 2) Implementation. To verify the soundness of my design, I first implemented a PoC which uses Linux for both the Host and the Guest. The implementation is divided into four components: the SE-Vault host driver, the SE-Vault guest driver, the SE-Vault OpenSSL engine, and the SE-Vault KCAPI cipher. The SE-Vault host driver receives cryptographic keys and transformation requests, and forwards the information to the SE-Vault guest driver using the VirtIO interface. The SE-Vault guest driver operates in an SEV-protected VM, and has the responsibility to securely store the cryptographic keys and perform the requested cryptographic transformations. The SE-Vault OpenSSL engine and KCAPI cipher act as entry points to the SE-Vault host driver, thus allowing OpenSSL and KCAPI users to benefit from SE-Vault transparently. As part of this implementation, I analyzed the implementation of the Host’s dm-crypt driver and of the Guest’s DMA SWIOTLB component, and then contributed hardenings to erase traces of provisioned secrets. Lastly, I ported the SE-Vault guest driver from Linux to the seL4 microkernel, and additionally added SEV support to the seL4 kernel. This shows that my design is portable to other kernels which can have significantly different designs than that of Linux. The details of the implementation are available in Chapter 5. Goal 3) Evaluation. I performed three evaluations on my Linux-based implementation: a security evaluation, a correctness evaluation, and a performance evaluation. The security evaluation verifies whether the implementation matches the attacker model, described in Section 4.2. The evaluation carefully emulates the capabilities of the adversary, and performs three tests: one only for SE-Vault, one for the KCAPI, and one for disk encryption with dm-crypt. All three tests succeed when my code hardenings from Section 5.4 are applied, which shows that SE-Vault can in fact protect cryptographic keys in the set attacker model. The correctness evaluation verifies whether the implementation correctly performs encryption and decryption requests under different circumstances and with different inputs. This evaluation is performed by running 20 tests over 1000 times in randomized order. In very rare occasions, the implementation would hang, but no memory corruption and no erroneous results have been observed. The performance evaluation computes the average throughput, average latency, performance overhead of using SE-Vault, performance overhead of using SEV, and the benefit of my proposed performance improvements. This evaluation compares the results against the default AES implementation in the Host kernel. While SE-Vault significantly increases the average latency of performing a cryptographic transformation, request throughput averages at 445 MiB/s, which is 60% slower than that of the default AES implementation in Linux. In comparison, Tresor-SGX [RGM16] achieves a request

3 1 Introduction throughput of only 1MiB/s which is signiﬁcantly lower than that of SE-Vault. Finally, the results show that my performance optimizations signiﬁcantly reduce the overhead of IO communication under SEV.

1.3 Outline

The rest of this thesis is organized in the following way. Chapter 2 introduces relevant information on virtualization, VirtIO devices, the SEV family of features, the KCAPI, and disk encryption in Linux. Chapter 3 introduces recent academic solutions for protecting Host cryptographic secrets in Linux, and compares them against the developed solution in this thesis. Chapter 4 discusses the design of the proposed TEE solution: 1) How it uses the SEV hardware feature to protect host secrets, 2) How those Host secrets are efﬁciently communicated to SE-Vault, and 3) How other software components can make use of SE-Vault. Chapter 5 presents the details of the implementation of SE-Vault in Linux, what optimizations are performed to the implementation, hardenings to the Linux kernel, and how the virtualized component is ported to the seL4 microkernel. Chapter 6 evaluates the security, correctness and performance aspects of the implementation. Chapter 7 lists attacks on the SEV family of features and on SE-Vault, and discusses their practical details and applicability to SE-Vault. Chapters 8 discusses the obtained results and lays improvement opportunities, and Chapter 9 concludes the thesis.

4 2 Background

2.1 Virtualization

Hardware virtualization allows to create virtual hardware configurations by using both software and hardware. This technology enables the creation of multiple VMs running on the same physical host. Each VM runs its own OS and can have its own hardware configuration. The VMs are managed by the Hypervisor (HV), a combination of userspace and kernel-space code which performs the necessary operations to support the correct execution of the VM. Such operations include the allocation of physical memory for the VM, providing CPU capability information or emulating special instructions. In order to guarantee the security of the HV and of other VMs on the same system, VMs do not have complete access to the underlying hardware. For security and correct emulation, the HV emulates certain CPU instructions such as IO instructions, MSR instructions, CPUID and RDTSC. Communication with hardware devices happens through a carefully designed device virtualization layer. Normally, the VM is not aware of the actual hardware device model and its configuration, but rather can only view the model and configuration of the virtual device. In the rest of this chapter, I examine common software and hardware extensions used for virtualization: QEMU, Kernel-based Virtual Machine (KVM), and the AMD Virtualization extension.

2.1.1 QEMU QEMU [Bel05] is an open-source full system emulation software which supports the emulation of a wide range of Central Processing Unit (CPU) architectures and devices. QEMU is capable of emulating common Instruction Set Architectures such as x86-32, x86-64, ARM, PowerPC, and also of emulating features for the corresponding CPU chip family. The emulation of a certain ISA is achieved without native support through dynamic binary translation [Ebc+01]. The technique takes dynamically a block of instructions and converts them into a sequence of intermediately represented micro-operations. After optimizations are performed, the micro-operations are transformed to the machine code of the host system and copied to the corresponding location in memory.

5 2 Background

QEMU handles self-modifying code by marking code regions as write-protected and by performing further code translations when a write access to a code region is intercepted. For complete system emulation, QEMU also needs to correctly emulate the Memory Management Unit (MMU), to handle modiﬁcations to the Interrupt Descriptor Table (IDT) and Global Descriptor Table (GDT), and to support handling of exceptions and interrupts. QEMU supports a software MMU and GDT by emulating translations of virtual to physical addresses. Exceptions such as invalid memory accesses and divisions by zero are captured by the QEMU process through a registered signal handler and are propagated to the emulated software. Injected interrupts are handled at the end of each translation block. QEMU further supports selecting the emulated chipset and what emulated devices are wired to the bus. Supported devices include hard drives, optical drives, network cards, graphics cards, USB devices and many other devices which can be attached to the exposed Peripheral Component Interconnect (PCI) bus.

2.1.2 KVM Although QEMU (2.1.1) supports the correct emulation of many systems, the performance of the emulated software and hardware can be significantly lower than that of native execution on the actual system. To alleviate the cost of emulation, QEMU supports the use of virtualization hardware accelerators such as the Kernel-based Virtual Machine (KVM). KVM [Got11] is a Linux kernel module which makes use of hardware extensions, such as AMD Virtualization (AMD-V) [Inc20b] and Intel Virtualization Technology (Intel VT) [Cor20b], to offer performance close to that of native execution. KVM exposes the /dev/kvm device node and defines an ioctl Application Binary Interface (ABI) [Kerb] for creating, configuring and launching a VM. However, KVM does not have the responsibility of populating memory with the corresponding kernel image or BIOS, or of initializing the register state to correct values upon launching the VM. Rather, this is the responsibility of KVM users like QEMU. The KVM ABI exposes commands for creating a VM instance, specifying the number of virtual CPUs, allocating and initializing Guest Physical Memory, injecting interrupts, updating a subset of Model Specific Register (MSR) values and more. A Virtual CPU (vCPU) of a VM can be launched by performing the KVM_RUN ioctl. KVM handles the command by updating the guest state structure for the vCPU with the correct register values and interrupt status, by modifying the MSR state and the Time Stamp Counter (TSC). Subsequently, KVM saves the Host CPU context and executes a sequence of instructions for loading the guest state structure, context switching to the Guest state and then saving the Guest state upon preemption. With Advanced

6 2 Background

Micro Devices Inc. (AMD)’s virtualization extension, the corresponding sequence of instructions is vmload, vmrun, vmsave. More information about virtualization on recent AMD CPUs is provided in Section 2.1.3. QEMU uses KVM if requested by the user upon starting the VM. Through the use of KVM, the Guest is isolated from the Host system also through hardware support, not just through careful emulation in QEMU. QEMU no longer needs to dynamically translate instructions when the Host and Guest ISAs match or to translate memory addresses in software. This does not only offer better performance, but also improves security since the VM can only communicate with the HV through carefully defined interfaces. The management of Guest Physical Memory is arguably the most complex part in the KVM implementation. The VM must not have direct access to all of physical memory to prevent a potentially malicious VM from taking over the Host system. Yet, the x86 ISA does not provide the necessary memory abstraction mechanisms to allocate only a portion of physical memory to the VM and to handle address translation transparently. KVM implements a software workaround named shadow page tables which keeps track of changes to the Guest’s Page Table by marking it as write-protected. Loading a new Page Table via a write to the CR3 register or modifications of the current Guest’s Page Table are always intercepted. KVM maintains a new Page Table — the shadow page table — in which the permission bits are as written by the Guest but the actual Host Physical Addresses are specified by the Host. Whenever the VM vCPU is started, the shadow page table is used instead of the one populated by the Guest. With this technique, the Guest can also utilize hardware units such as the Page Table Walker [Yam+00] and the Translation Lookaside Buffer (TLB). Unfortunately, the technique is inefficient if the Guest often modifies the Page Table and also adds complexity to the MMU implementation in KVM. Next chapter discusses an alternative preferred solution for translating Guest Virtual Addresses to HPAs.

2.1.3 AMD Virtualization KVM relies on hardware virtualization extensions to support native execution of the Guest’s software without compromising security. Such hardware extensions allow the possibility to start executing a privileged process which has restricted access to the system’s resources such as physical memory and the execution of special instructions. On recent AMD CPUs, KVM can make use of the AMD Secure Virtual Machine (SVM) extension which provides facilities for hardware-assisted virtualization. With SVM, the HV can initialize the Guest’s context such as architectural registers, can control the Guest’s view of physical memory and can intercept the execution of special instructions. The HV can specify the context of execution by populating a structure named the Virtual Machine Control Block (VMCB). The VMCB is divided into two areas: the

7 2 Background

Control Area and the State Save Area. The Control Area includes ﬁelds for specifying enabled features and vector of intercepted instructions, injecting interrupts and reading the cause of a VM-exit, specifying or reading the Guest’s RIP. The State Save Area contains a few special registers which cannot be easily stored or loaded by the HV in software. Such registers include segment registers (ES, CS, ...), descriptor registers (GDTR, IDTR, LDTR), control registers (CR0, CR3, ...), RAX, RSP, RIP, debug registers and other registers.

1 ; Save host registers 2 ... 3 4 ; Load guest registers 5 mov RDI_OFFSET(svm), %rdi 6 mov RCX_OFFSET(svm), %rcx 7 mov RDX_OFFSET(svm), %rdx 8 ... 9 10 ; Enter guest mode 11 vmload 12 vmrun 13 vmsave 14 15 ; Save guest registers 16 mov %rdi, RDI_OFFSET(svm) 17 mov %rcx, RCX_OFFSET(svm) 18 mov %rdx, RDX_OFFSET(svm) 19 ... 20 21 ; Restore host registers 22 ...

Figure 2.1: Example code for entering and exiting VM execution. The Host’s registers are ﬁrst saved, and then the guest registers are loaded. The VM starts execution when the HV executes vmrun. On a VM-exit, the HV saves the VM’s registers and restores the Host’s registers.

Some of the architectural registers — RDI, RSI, R8-15 etc. — are not contained in the VMCB and the HV is left the responsibility to save and restore them on every context switch between the Guest and the Host. Figure 2.1 shows an example snippet in x86-64

8 2 Background assembly for transferring execution to the Guest. In lines 1-8, the HV saves the Host’s context to the stack and then loads the Guest’s registers from the svm structure. In lines 11-13, the HV executes a sequence of instructions for loading the state from the VMCB, executing the Guest and then saving the VMCB when the Guest is preempted. In lines 15-22, the HV saves the remaining Guest registers to the svm structure and then restores the Host registers from the stack.

Guest process

memory access

Guest Virtual Memory

translate with gPT

Guest Physical gCR3 gPT Memory

CR3 translate with nPT

Host Physical nPT Memory

physical page

Figure 2.2: Sequence of steps for translating a Guest Virtual Address (GVA) to a Host Physical Address (HPA). On a Guest access, the GVA is ﬁrst translated to a Guest Physical Address (GPA) using the Guest Page Table (gPT). The GPA is then translated to a HPA using the Nested Page Table.

In Chapter 2.1.2, shadow page tables were introduced as a technique to restrict the Guest’s access to physical memory. The technique requires that any change to the Guest’s page tables are intercepted and reﬂected into the shadow page tables. The creation of a new process or allocating physical memory during runtime would lead to an expensive VM-exit which would degrade performance for the Guest. SVM includes an alternative approach for limiting the Guest’s view of physical memory: Nested Paging [Inc20b]. With Nested Paging, the Page Table Walker in the CPU is enhanced to handle two page tables: the Guest Page Table and the Nested Page Table. Figure 2.2 shows an example of how both page tables are used in address translation for the memory accesses of a Guest process. The memory access is performed with a Guest Virtual Address (GVA) which is translated to a GPA using the Guest Page Table. The Page Table Walker then translates the GPA to a HPA using the Nested Page Table.

9 2 Background

Correspondingly, the Guest Page Table is pointed by the gCR3 register in the Control Area of the VMCB, and the Nested Page Table by the architectural register CR3 of the Host. With Nested Paging, virtualization of memory is greatly simpliﬁed and performance is typically improved for workloads which modify the Guest Page Table often. The Host does not need to monitor changes to the Guest Page Table but only has to maintain the Nested Page Table and respond to incurred Page Faults (PFs). As long as the Host keeps the used physical pages present in the Nested Page Table, address translation can transparently make use of the TLB which avoids the expensive page table walk. The HV cannot ensure security and correct emulation by only managing access to physical memory and maintaining the Guest architectural registers. Direct access to the MSRs, execution of special instructions and access to peripheral devices have to also be controlled by the HV. The Control Area in the VMCB contains a bit vector which lists the instructions which are intercepted by the HV. When launching the Guest, the HV initializes the bit vector with the special instructions which need to be intercepted. Such instructions can include cpuid for checking CPU capabilities, rdtsc for reading the TSC, invlpg for invalidating the TLB entries, rdmsr and wrmsr for modifying MSR values, and so on. When the Guest attempts to execute an intercepted instruction, control is transferred to the HV to emulate the instruction. Once emulation of the instruction is performed, the HV updates the Guest’s registers and resumes execution for the Guest. By intercepting instructions, the HV can limit hardware features exposed to the VM, provide correct emulation of special instructions and prevent the Guest from accessing MSRs with security implications.

2.2 AMD Secure Encrypted Virtualization

Cloud computing is a convenient model for renting resources and computation time on remote servers operated by a third-party called the cloud provider [MG+11]. A common service model is Infrastructure as a Service (IaaS) which offers deploying and running arbitrary software on a variety of different system conﬁgurations. Cloud providers typically attempt to maximize utilization of the hardware resources by means of virtualization. Multiple VMs can be scheduled to co-run on the same system and the HV provides distribution and isolation of hardware resources. An obstacle of cloud computing is the difﬁculty of providing security [AGM16], especially in the IaaS model. With IaaS, multiple tenants share the hardware resources and each tenant has unrestricted access to the software inside the dedicated VM. Hardware resources include the network infrastructure, GPUs, physical memory, disk drives, etc. If a bug exists in the software for these devices or in the HV, a malicious VM

10 2 Background could attack the HV. On success, the VM could gain unrestricted access to all system resources which is known as a VM escape [Sca11]. After a VM escape, the attacker would have access to the data of all of the other tenants on the system. The AMD SVM extension does not provide any mechanisms to protect the conﬁdentiality and integrity of a VM. For example, if a bug in KVM or QEMU exists, then a malicious attacker from one VM could have unrestricted access to the data in all other VMs running on the same system. Since 2016 AMD has announced three hardware features for protecting a VM’s data in memory even against attacks from the HV. The features can address the security risk of using cloud computing platforms [KPW16], and can also be used to create a Trusted Execution Environment (TEE) for user applications on a local system [PNG19]. In the rest of this chapter, I examine the design of all three hardware features for secure virtualization.

2.2.1 Secure Encrypted Virtualization In 2016, AMD announced two hardware features for protecting data in DRAM: Secure Memory Encryption (SME) and Secure Encrypted Virtualization (SEV). Both technologies provide protection of data by means of encryption. Whenever the CPU attempts to write data into DRAM, the content of the data is encrypted by a hardware Advanced Encryption Standard (AES) engine before transferring it via the DRAM bus [KPW16]. Correspondingly, reads from DRAM are decrypted and the decrypted data is stored in the cache. Despite the added complexity of performing cryptographic operations in the DRAM memory access path, a process would only require accesses to main memory in case of a cache miss. Because processes typically exhibit locality of memory accesses and the L3 cache is large on SME/SEV-supporting CPUs, most memory accesses can be served by the cache without requiring any encryption or decryption. SME and SEV share the similarity of using memory encryption to protect code and data, but the two have different use cases and different attacker models. SME is a technology for encrypting almost all of the system memory in order to protect the OS and user software against cold boot attacks [Hal+09], DRAM snooping and DMA attacks [AD10]. Its attacker model includes an attacker with physical access to the system or peripheral devices but without access to the software running on the system. In contrast, the SEV hardware feature can be used to encrypt the code and data of a VM with an encryption key which is bound to the VM instance and is inaccessible from software. SEV considers an attacker model for which the attacker has gained the highest execution privilege on the system, but does not have access to software running in the SEV-protected VMs. For the rest of this chapter, I focus only on SEV because SME is not relevant for the goals of my thesis.

11 2 Background

A VM instance can be identiﬁed by its dedicated Address Space ID (ASID) which is allocated by the HV. With SEV, the memory of a VM is encrypted using a unique 128-bit-long AES key which is also bound to the ASID. The chosen AES cipher mode varies between CPU generations but ultimately uses a tweakable block cipher mode which depends on the GPA. The AES keys are derived and managed by an ARM microcontroller called the AMD Platform Security Processor (PSP) [KPW16; Inc20b]. The keys are never exposed to software in either the HV or VM, and the hardware AES encryption engines operate transparently to system software. Whenever a VM resumes execution, the AMD PSP can read the VM’s ASID and set the corresponding AES key into the hardware AES encryption engine before the VM begins accessing memory. The HV cannot observe the VM’s data in plaintext or obtain the AES key to decrypt it because the key is only associated with the SEV-protected VM and can never be accessed by software. The HV also cannot access the VM’s cleartext data in the cache. Each cache line is tagged with the ASID and thus only the VM can observe cache hits for the cache lines loaded by its memory accesses.

VM ASID 3

Has C-bit, or instruction AMD PSP fetch, or page table walk?

yes no set AES key AES engine for ASID 3

DRAM

Figure 2.3: Depiction of the two memory paths for an SEV-protected VM. The C-bit in the page table determines if the data in DRAM is interpreted as encrypted or not. If the VM performs a page table walk or an instruction fetch, then memory in DRAM is always interpreted as encrypted.

When SEV is used, instruction fetches and page table walks in the VM always go through the decryption memory path when a DRAM access is required. However, the VM can mark a data page as unencrypted by setting a ﬂag in its Guest Page Table [Inc20b]. The VM and HV can establish data communication by reading and writing data to unencrypted pages. The ﬂag for marking data pages as encrypted or unencrypted is named the C-bit. Figure 2.3 shows en example of the memory path taken based on the type of memory access. For every access to DRAM made by a

12 2 Background

SEV-protected VM, the memory controller checks the type of the memory access and takes one of two paths. If the memory access is a page table walk, an instruction fetch or a data access to a page with the C-bit set, then the memory controller would take the encryption/decryption memory path. In this path, the AES engine would encrypt any data being written to DRAM using the AES key identiﬁed by the ASID of the running VM. Similarly, a data read in this memory path would ﬁrst read data from DRAM and then the data would be decrypted by the hardware AES engine before delivering the data to the cache. In contrast, if the memory access is to an unencrypted page (C-bit = 0), then the memory controller would take the memory path which does not involve encryption or decryption of data.

Guest VM Kernel User Space DMA region Physical Firmware Memory

Encrypted (C-bit = 1) Decrypted (C-bit = 0)

Figure 2.4: Physical Address Space of an SEV-protected VM. While the VM Firmware (OVMF), kernel, and user space applications are encrypted, the Direct Memory Access (DMA) region is unencrypted and used for communication with external devices.

Figure 2.4 shows an example of what Guest memory regions are encrypted and which need to be marked as decrypted. The Guest protects the confidentiality of its data and code by having sensitive regions be encrypted, such as the VM firmware, kernel image and all memory of user space applications. When QEMU provides the VM firmware and kernel images, the AMD PSP would automatically encrypt the images using the AES key of the Guest. When the VM is booted with SEV, it is the responsibility of the VM firmware and the kernel to populate the page table entries with the C-bit set to guarantee that the VM boots correctly, and that code and data remain protected. However, the Guest cannot have all of its physical memory be encrypted because the Guest needs to exchange data with the outside world through external devices such as network interfaces and virtual disk drives. Shown in Figure 2.4, the kernel would mark the pages in the DMA region as unencrypted (C-bit = 0), so that it can exchange data with virtual devices configured by the HV. To send data to the outside world, the Guest’s kernel can copy data to the DMA region and then signal the HV that data is sent to a virtual device. The HV can read the cleartext data from the unencrypted region and propagate it to the corresponding interface. Similarly, the HV can copy data to the unencrypted DMA region and send a notification to the Guest that a virtual device has sent data.

13 2 Background

The HV also needs to emulate special instructions, such as cpuid, wrmsr, etc., which requires access to the Guest’s registers on a VM-exit. Because SEV does not encrypt the Guest’s Save State Area and registers on a VM-exit, emulation of special instructions does not require any changes with SEV. To handle an emulated instruction, the HV first saves the Guest’s state after a VM-exit as shown in Figure 2.1. Next, the HV can read the EXITINFO field of the VMCB to determine the cause of the VM-exit. The HV can then emulate the instruction, update the Guest’s general-purpose registers and the RIP to point to the next instruction, and then resume execution for the Guest. Although this allows SEV to be supported with few changes to the code in the VM’s kernel initialization phase and to the code in the HV, it also creates an attack vector for data exfiltration attacks, control flow attacks and rollback attacks. For example, the HV can cause a VM-exit at a convenient time to read secret data, such as passwords or cryptographic keys, while stored in the VM’s registers. Also, the HV can manipulate the VM’s RIP register and general-purpose registers to call any function in the VM as long as the VM’s address space is known.

2.2.2 SEV - Encrypted State Section 2.2.1 discussed the design of SEV and briefly examined a design decision in the communication of the VM’s register state which opens various attack possibilities against an SEV-protected VM. This design decision facilitates easy integration of the feature in HV and VM software but ultimately renders the feature vulnerable. In 2017, AMD announced SEV Encrypted State (SEV-ES) which encrypts and protects the integrity of the VM’s register state on top of the available memory encryption [Kap17]. The feature addresses a wide variety of attacks which aim to exfiltrate secret data via the shared VM’s register state or aim to corrupt the register state with suitable register values. To protect the VM’s register state, the register values are not exposed to the HV on a VM-exit for an SEV-ES-protected VM. All Guest registers are instead stored in a dedicated encrypted and integrity-protected structure named the VM Save Area (VMSA). The VMSA structure is allocated and initialized by the HV, and the HPA of the VMSA is stored inside the VMCB. During the VM setup phase, the AMD PSP would encrypt the initial state and compute a cryptographic hash to include into the attestation process (Chapter 2.2.3). On a VM-exit, the Guest’s register state is saved encrypted into the VMSA and a checksum is computed over the VMSA’s contents and stored to an unspecified location, likely inaccessible by software. When the HV resumes execution for the Guest with the vmrun instruction, the CPU computes a checksum over the VMSA and compares it against the already saved checksum. If the integrity check fails, the vmrun instruction would fail and the Guest would not be resumed.

14 2 Background

Full encryption of the VM’s register state causes difficulties in creating traditional virtualized environments which require emulation of special instructions like cpuid, wrmsr, MMIO operations, etc. In such virtual environments, the HV needs to be able to retrieve instruction operands from registers and be able to communicate the new register values after emulation. However, the HV cannot read and modify the register state of an SEV-ES-protected VM because the VMSA is encrypted. To facilitate this, the SEV-ES specification introduces a new structure called the Guest-Hypervisor Communication Block (GHCB) [Adv20b]. The GHCB is stored into unencrypted physical memory inside the VM’s address space. The structure contains information about the emulated intercepted operation (rdtsc, MMIO, etc.) and the register values associated with the operation (RAX, RCX, etc.). The VM is responsible for writing the information to the GHCB, invoking the HV and reading the new registers from the GHCB after emulation. There needs to be a mechanism to ensure that the VM can expose the operation type and associated registers before handling control to the HV for emulation. Examining and rewriting legacy software to adhere to the requirements imposed by SEV-ES is clearly not feasible. SEV-ES solves the problem of providing the necessary information for intercepted instructions by means of indirection. Each possible VM-exit is divided into two categories under SEV-ES: Automatic Exit (AE) and Non-Automatic Exit (NAE) events. An AE is an event which causes a VM-exit but does not require communication of registers. Such events include physical and virtual interrupts to the Host system, nested page faults, etc. An NAE is an event which causes a VM-exit but requires register values and instruction information to be communicated with the HV. Such events include instructions such as cpuid and rdmsr, MMIO operations and special events related to handling Non-Maskable Interrupts in the VM. Because an NAE event requires communicating information through the unencrypted GHCB, such an event raises a special exception under SEV-ES — the VMM Communication Exception (VC) exception — which gets handled by the VM. In handling the NAE, the VC exception handler checks the intercepted instruction type, then writes the necessary registers to the GHCB and executes a new instruction — vmgexit — which causes an AE. The HV can then proceed reading the exposed information from the GHCB, emulate the instruction, write the new register values into the GHCB and resume the VM. The VC handler can then read the new registers from the GHCB, update the VM’s registers and return to the context which caused the NAE. Figure 2.5 shows the steps in performing an NAE for the cpuid instruction. The VM executes the cpuid instruction which causes an NAE and the CPU redirects execution to the VC handler in the VM. The VC handler copies the values of the EAX and ECX registers into the GHCB and executes the vmgexit instruction. When the HV receives the AE event, it first reads the EAX and ECX registers from the GHCB. Afterwards the HV emulates the cpuid instruction, updates the eax, ecx, ebx and edx registers in the

15 2 Background

VM CPU HV Non-Automatic cpuid Exit ... Restore #VC execution via VC handler IRET write registers KVM vmgexit read registers read registers emulate validate registers write registers vmenter

Figure 2.5: Sequence for intercepting an instruction under SEV-ES. On a Non-Automatic exit, the VM ﬁrst passes control to its VC handler. Only afterwards, the VC handler gives the control to the HV. After resuming the VM, the VC handler validates the information from the HV before returning to the original instruction.

GHCB, and then resumes the VM. The VC handler can then read the four register values from the GHCB, validate them and update the VM’s registers. Lastly, the VC handler executes a Return from interrupt (IRET) to continue execution immediately after the cpuid instruction which caused the NAE. Although conceptually SEV-ES is only slightly more complex than SEV, it requires non-trivial changes to the VM’s bootloader and kernel [Roe20]. To use SEV-ES, the software in the VM must reserve an unencrypted memory region for the GHCB, must communicate the GPA of the GHCB to the HV and must register a VC exception handler before any NAE is caused. Otherwise, the VM would get corrupted and execution would halt. The major difficulty in performing the described steps comes from the necessity to mark the GHCB as unencrypted and communicate its GPA. Discussed in Chapter 2.2, a guest physical page can only be marked as unencrypted by setting the C-bit in the page tables. Since the VM is booted in protected mode, the VM must first generate a page table, write to the CR3 and switch to long mode. To communicate the location of the GHCB to the HV, the VM writes the address of the GHCB to MSR C0010130. To accommodate SEV-ES, all of the software in the VM — firmware, bootloader and kernel — must switch to long mode as early as possible before an NAE operation is performed. This requires careful redesign of the early initialization in such software to enable communication between the HV and the VM.

16 2 Background

2.2.3 Attestation and Secret Provisioning Only runtime protection of a VM is insufficient in providing a Trusted Execution Environment (TEE) for a remote user if the system has already been compromised or the Host has malicious intentions. Without the possibility of verifying the authenticity of the system and the launch state of the VM, a malicious Host can still be able to read the data and observe performed operations inside the VM in various ways. For example, the Host can lie about the authenticity of the server and whether it supports the SEV or SEV-ES features. The Host can also decide to launch the VM without any of these security features being enabled. The attacker can launch the VM with the features enabled but disable the NODBG policy which allows the attacker to use the SEV Application Programming Interface (API) [Adv20a] to decrypt the memory of the VM. Additionally, the attacker can launch the VM with all protections enabled but load a vulnerable VM firmware or kernel image which include a backdoor. The attacker can use the backdoor to gain execution inside the protected VM and inspect its state. Verifying the authenticity of the Host system is the first step of the attestation process from the perspective of the Guest Owner. The AMD attestation process supports the possibility of verifying the authenticity of the CPU, its firmware version and also of binding the system to a platform owner such as a Cloud provider. The actors in the attestation process are the Guest Owner, the System Owner (HV), the on-chip PSP and the online AMD Key Distribution Server [Dev].

AMD Key Server A B B is signed by A

ARK

OCA

ASK

CEK PEK PDH

Platform Security Processor

Figure 2.6: Ceritifcate chains used in platform authentication.

17 2 Background

Figure 2.6 shows all of the keys in the authentication process and their dependencies. The AMD Key Distribution Server stores the AMD Root Signing Key (ARK) and the AMD SEV Signing Key (ASK) which are both bound to the product line lifetime. The ARK signs the ASK public key. The Chip Endorsement Key (CEK) is derived by using a Key Derivation Function on a secret contained in an One Time Programmable Fuse in the PSP. The CEK uniquely identifies an AMD CPU and what firmware it is using. The system owner uses the SEV API [Adv20a] to retrieve the CEK public key and then sends it to the AMD Key Distribution Server. The AMD Key Distribution Server signs the public key using the ASK private key and returns the signed CEK public key to the System Owner. During the initialization of the SEV platform, the SEV firmware generates the Platform Endorsement Key (PEK) and Platform Diffie-Hellman Key (PDH) using a Key Derivation Function which is fed input from a secure entropy source. The PEK public key is signed with the CEK private key, and in turn the PDH public key is signed by the PEK private key. The PEK public key can be additionally signed with the Owner Certificate Authority Key (OCA) to bind the PEK to the System Owner. The lifetime of both keys is bound to the lifetime of the platform. The PDH public key is used in the Diffie-Hellman protocol in the secret provisioning stage to agree on a master secret with the Guest Owner. Although the CEK, PEK and PDH private keys are stored on the PSP, they are inaccessible to the System Owner as that would compromise the attestation process. The described signing process forms two chains of trust: 1) ARK ← ASK ← CEK ← PEK ← PDH, and 2) OCA ← PEK ← PDH. With 1), the Guest Owner can verify that the platform contains an authentic SEV-capable CPU. With 2), the Guest Owner can also optionally verify the identity of the System Owner. Additionally, the Guest Owner can establish a secret communication channel with the SEV firmware in order to provide secrets to the VM or retrieve an authenticated measurement of the VM. The described steps so far only provide a mechanism for establishing the authenticity of the SEV CPU and of deriving the PDH public key. However, a remote entity, such as the Guest Owner, must verify that the VM is started with the expected state. This can be achieved by retrieving a measurement hash of the VM’s state and by verifying it against the expected result. The generation of the measurement is tightly coupled with the secret provisioning process. As briefly mentioned before, the PDH public key can be used to establish a secure communication channel between the SEV firmware and a remote entity. The communication channel is protected by a shared secret, called the master secret. In order to agree on the master secret, the VM retrieves the PDH public key, generates its own Elliptic-Curve Diffie-Hellman (ECDH) key pair, and forward the ECDH public key to the System Owner, who provides it to the SEV firmware. Using this sequence, the master secret can be derived [Adv20a]. Afterwards, the Guest Owner and the SEV firmware derive the Key Encryption Key (KEK) and Key Integrity

18 2 Background

Key (KIK) using the same Key Derivation Function as follows: KDF(Master Secret, "sev-kek") and KDF(Master secret, "sev-kik"). The Guest owner generates the Transport Encryption Key (TEK) and Transport Integrity Key (TIK), which are encrypted and integrity protected using the KEK and KIK, and then sent to the SEV ﬁrmware. The SEV ﬁrmware would mix the TIK when producing the VM measurement. Because the Guest Owner is aware of the expected initial state of the VM and the TIK value, the Guest Owner can always compare the received measurement from the untrusted System Owner with the ground truth measurement computed locally.

Key Name Abbr. Algorithm Lifetime AMD Root Signing Key ARK RSA 2048 Product line lifetime AMD SEV Signing Key ASK RSA 2048 Product line lifetime Chip Endorsement Key CEK ECDSA Chip lifetime Platform Endorsement Key PEK ECDSA Platform/Owner lifetime Owner Certiﬁcate Authority Key OCA ECDSA Owner lifetime Platform Difﬁe-Hellman Key PDH ECDH Platform/Owner lifetime Key Encryption Key KEK AES-128 VM Launch Key Integrity Key KIK HMAC SHA-256 VM Launch Transport Encryption Key TEK AES-128 VM Launch Transport Integrity Key TIK HMAC SHA-256 VM Launch

Table 2.1: Summary of all keys in the attestation and secret provisioning process.

Table 2.1 shows all used keys to verify the system authenticity, to establish a secure communication channel, and to compute the ﬁnal measurement of the VM’s state. Each row includes the full name of the key, its abbreviation, what algorithm the key is used for, and the duration of the key’s lifetime. Finally, it is important to discuss the contents of the VM’s measurements for two crucial reasons. First, the Guest Owner must be aware of all data, which is included into the measurement, in order to compute the measurement locally. Second, all sensitive data and state must be included into the measurement, otherwise a malicious HV would be able to compromise the security of the VM. The SEV API [Adv20a] exposes two commands for setting the initial data and state of the SEV VM: LAUNCH_UPDATE_DATA and LAUNCH_UPDATE_VMSA. Notably, the second command can be only used with SEV-ES. The commands can be used by a HV like QEMU which forwards the commands to KVM using the ioctl ABI. Then, KVM sends them to the AMD Platform Security Processor (PSP) driver in the Linux kernel, which in turn communicates them to the secure processor. In the current state of the SEV eco-system, the LAUNCH_UPDATE_DATA command is only used to include the Open Virtual Machine Firmware (OVMF) code and data into the measurement. For example, the kernel is only provided after the VM

19 2 Background is launched via DMA or by reading it from disk. The LAUNCH_UPDATE_VMSA command is used to set the initial state of the VM Save Area (VMSA) structure, when SEV-ES is used. After the launch sequence is ﬁnished, the HV can execute the LAUNCH_MEASURE command to retrieve the measurement of the VM’s data and state. The measurement computes a SHA-256 hash of the SEV API version and PSP ﬁrmware version, loaded data and VMSA state, a nonce, the Transport Integrity Key (TIK), and the launch policy. The launch policy contains additional information to specify whether SEV-ES is used, whether debugging is enabled, whether relocation of the VM to other platforms is allowed, etc. Once the measurement is retrieved, the HV can provide it to the Guest Owner for validation. The Guest Owner can derive the measurement locally and compare it against the received measurement from the System Owner. If the measurements match, the Guest Owner can then use the secure channel to transmit a secret, such a disk decryption key, to the VM.

2.3 Trusted Execution Environment

In today’s digital world, a system often contains high-profile secrets whose protection is of high importance to the system’s users or to large companies. A system may contain sensitive online banking information or cryptocurrency wallet keys which the user would like to protect from malicious actors who may have gained access through vulnerable software. Video streaming companies like Netflix operate on a subscription-based business model which may suffer if the latest highest-quality content is recorded and uploaded to the internet. Thus, such a company would often protect its interests by providing high-quality content in an encrypted format which can never be accessed by software on the system. However, ensuring protection for both scenarios is difficult considering that the malicious actor may have full access to the OS. In a traditional environment, such an actor would be able to examine physical memory, suspend processes, set break points or inject malicious code into a running process, etc. To provide better protection to its customers, various system vendors have developed products which aim to protect sensitive information or high-profile software from malicious actors with the highest privileges. Such products offer the creation of a TEE which is protected through a combination of hardware and software support. By the definition established by Sabt et al. [SAB15b], a TEE must fulfill few crucial and difficult requirements. First, the system must be able to prove the authenticity of the system and of the TEE to a remote user. This ensures that the TEE is created with the expected code, data and state. This requirement also ensures that the system uses authentic hardware with a specific firmware version. Second, the

20 2 Background solution must provide confidentiality of the TEE’s code, data and state. The TEE’s contents should remain secret and never be accessible from an external entity. Third, the solution should provide integrity for the TEE’s code, data and state. A privileged external entity should not be able to alter the contents of the TEE, such as instructions, data, registers, etc. A TEE can be further subdivided into two components: the Dynamic Root of Trust Measurement (DRTM) and the Secure Execution Environment (SEE) [SAB15b]. The SEE is only provided runtime-protection by means of guaranteeing the confidentiality and integrity of its contents. However, if the system is compromised before the SEE is created, the initial state of the software in the SEE cannot be protected. The DRTM can be used to establish the authenticity of the system and the desired initial state of a particular system component. For example, this concept carries over to how Secure Boot operates: the contents of a component are verified against the expected ground truth before loading the components and passing execution to it. When both SEE and DRTM are combined, only then a Trusted Execution Environment (TEE) is formed. In the theoretical description of a TEE, the security guarantees of the environment are established and provided by the separation kernel [SAB15b]. The "separation kernel", first introduced by Rushby et al. [Rus81], has the purpose of distributing and isolating the software components of the system. No entity can influence the state of the other and communication happens only through secure and limited channels. The concept of a separation kernel is further described in the context of a TEE by Sabt et al. [SAB15b; SAB15a]. Sabt et al. describes that the separation kernel and the software in the TEE must ensure a secure boot sequence, secure storage, secure scheduling, inter-environment communication and trusted I/O paths. In recent past, various TEE solutions have been designed and implemented by various CPU vendors, all of whose solutions have varying use cases. In these designs, the separation kernel is often represented by a combination of hardware, firmware and traditional software. Such solutions vary in the adopted attacker model and aim at protecting a different system entity: a secure separate kernel, a user space process or a VM. The three most prominent TEE solutions are ARM TrustZone [Win08], Intel SGX [CD16] and AMD SEV [KPW16]. ARM TrustZone provides hardware-enforced isolation between the untrusted world and the trusted world. The trusted world typically runs a separate kernel and has the purpose of storing cryptographic secrets and performing operations with them internally. The authenticity of the trusted world is asserted during the boot process which verifies the signature of the trusted world image. Communication between the trusted and untrusted world happens only through well-defined interface which is designed by the developer of the trusted world. TrustZone also provides confidentiality and integrity of the secure world.

21 2 Background

Intel SGX is a technology for creating a TEE, called an enclave, for a user space application. The authenticity and runtime protection of the enclave are ensured by the CPU, CPU firmware, and few special enclaves provided by Intel. Intel SGX adopts a strong attacker model in which the kernel and all hardware, but the CPU package, are considered malicious. To provide such protection, the design of Intel SGX ensures the authenticity, confidentiality and integrity of the enclave. The last examined TEE solution is AMD Secure Encrypted Virtualization (SEV), which is also used in this thesis. In this solution, the separation kernel is a combination of the AMD Platform Security Processor (PSP) and the main CPU. The PSP can be used to establish the authenticity of the CPU, CPU firmware and the VM. The PSP further stores the per-VM memory encryption key, the secrecy of which is crucial for the protection of the VM during its lifespan. The confidentiality of the VM’s code, data and registers is provided by the main CPU and its firmware. However, neither SEV nor SEV-ES provide integrity of the VM’s code and data, with which the SEV and SEV-ES features would not classify as TEE solutions under the description suggested by Sabt et al. [SAB15b]. Additionally, the SEV features rely on a malicious Hypervisor (HV) to setup scheduling, emulate few special instructions, track execution via configurable trap events, and modify the nested page table of the VM. Thus, SEV and SEV-ES also fail the requirements of a separation kernel suggested by Sabt et al. [SAB15b]. However, designing mature software under SEV and SEV-ES is still valuable since they provide authenticity and confidentiality of the VM’s data. Attacks, based on the remaining missing requirements, can be mitigated by the software running inside the VM. Additionally, SEV and SEV-ES are only the first two iterations of the TEE solution from AMD. Recently, AMD announced the newest iteration of the technology — SEV Secure Nested Paging (SEV-SNP) — which provides integrity protection and further hardens interfaces between the HV and the VM. General-purpose software, which executes correctly on SEV and SEV-ES, would likely run on SEV-SNP without any required changes. Thus, developing solutions for SEV and SEV-ES is still meaningful even if these two features do not fulfill the ideal attacker model for the user product.

2.4 VirtIO

On a native system, the Host kernel has full access to the devices and can apply the necessary driver optimizations to reach high performance. Furthermore, the Host and the devices can operate concurrently: the Host can communicate with the device using Memory-Mapped IO (MMIO) or IO instructions [Inc20b], perform other computations immediately after, and the device can eventually notify the Host kernel via an interrupt. MMIO operations and IO instructions have higher latency than

22 2 Background regular memory accesses since they would have to reach the device but the latency can be hidden through the out-of-order execution of instructions which is performed by CPUs. In a virtualized environment, MMIO operations and IO instructions cause a VM- exit because they need to be intercepted by the HV [Inc20b]. Thus, virtualization of legacy OSs poses an issue for performance of IO communication between the Guest and the outside world due to frequent VM-exits. Initially, HVs like KVM, Xen [Chi08] and VMI [Ams+06] provided their own independent network, block device and console virtual device implementations [Rus08]. All three included overlapping functionality and optimizations which has led to maintenance difficulties and limited code reuse [Rus08]. VirtIO [Rus08] was introduced to provide a standardized common interface which can be used by all HVs, OSs and device drivers. The VirtIO specification [TH] defines the initialization steps, feature bits, supported communication buses, and facilities for data and event communication. The VirtIO interface is specifically designed to allow easy integration into existing drivers and HVs, to be feature-extensible and to offer close-to-native performance. The wide adoption of VirtIO in various HVs and OSs, and the development of new VirtIO devices, has proven its success. VirtIO devices are most commonly attached to the PCI bus, but other mediums such as MMIO or Channel IO can be used. The HV exposes the VirtIO device configurations over the PCI bus, and the Guest probes for available PCI devices after booting its kernel. The Guest’s kernel read the PCI Device ID and if it matches that of a supported VirtIO device, it calls the initialization function of the corresponding VirtIO Guest driver. The Guest VirtIO driver reads the offered configuration, processes it and eventually updates the Device Status Field with DRIVER_OK to signal that the Guest driver is setup and can communicate. The layers of abstraction provided by the implementation of the VirtIO specification in the Linux kernel and QEMU are shown in Figure 2.7. The figure shows layers in a para-virtualized network based on virtio-net. The delivery of events and configuration happens through the Virtual PCI and data exchange happens through physical memory. The Linux VirtIO implementation provides two Shim layers (marked with A ), correspondingly for the Host and Guest. The Shim layers provide an API to both Host and Guest VirtIO drivers for sending and retrieving data, for exchanging configuration information and for sending notifications. For example, the Host Shim layer exports the following functions: virtio_add_queue for adding a communication queue to a device, virtqueue_pop for reading data from a queue, virtqueue_push for sending data, etc. Marked with B in Figure 2.7, the Host virtio-net device implementation in QEMU uses the Shim to communicate information. Also marked with B , the Guest virtio-net device driver in Linux uses a similar Shim implementation.

23 2 Background

Host Communication layer Guest net device virtio-net Host VirtIO Virtual Guest VirtIO virtio-net RAM net driver driver device Shim PCI Shim driver

Data exchange Recv data Input queue Virtqueue 0 Virtqueue 0 Input queue Recv data Notiﬁcations

Data exchange Send data Output queue Virtqueue 1 Virtqueue 1 Output queue Send data Notiﬁcations

C B A A B D

Figure 2.7: Layers of abstraction in virtio-net, a network virtualization solution based on VirtIO. The Host and VM both rely on a shim layer to abstract common VirtIO functionality. The corresponding virtio-net drivers in the Host and in the VM communicate network packets and notiﬁcations on two separate Virtqueues.

Correspondingly, the Host net device driver ( C ) in QEMU and the Guest net driver ( D ) in Linux communicate with the VirtIO virtio-net device and the virtio-net driver. This layered design allows for better code reuse of common functionality and reduces the code complexity of adding new para-virtualized devices. The channel for communicating data between the Host and the Guest is called a Virtqueue. The number of Virtqueues and their configuration is determined by the device implementation in QEMU. The driver implementation in the Guest kernel driver recognizes and accepts the provided information or notifies that the device is not supported. Shown in Figure 2.8, a Virtqueue consists of a descriptor table, an available ring and a used ring. The Guest driver allocates buffers from its physical address space and populates entries in the descriptor table with the physical addresses, lengths and flag information of the allocated buffers. The Next desc field specifies whether descriptors form a chain and should be processed together. The available ring contains information about the descriptor chains which can be used by the Host driver and is only written to by the Guest. The used ring is updated by the Host driver, when it is done processing a descriptor, and it is only read by the Guest. Figure 2.8 shows the sequence of steps in sending data from the Host driver in QEMU to the Guest driver. Initially, the Guest driver carves out memory from its physical address space which it would use for receiving data. In step 1 , the Guest driver updates an entry in the descriptor table with the physical address 0x74ff33316 and length 540 bytes of a buffer in the carved-out region. The driver then writes the

24 2 Background

Used Ring

Id Flags Used Elements ID: 2 1 ... Length: 300 bytes 7 8

Descriptor Table

Id GPA Length Flags Next desc

0 0x74ff330fa 540 bytes 0 -1

VM 1 0x74ff33316 540 bytes 0 -1 QEMU Virtio Frontend 1 6 Virtio Backend 2 0x74ff33316 540 bytes 0 -1

...

255 0x74ff330fa 220 bytes 0 -1

Available Ring

Id Flags Available Elements

4 2 ...

3 2 5

Figure 2.8: Steps in VirtIO drivers for sending data to the Guest. The VM first makes buffers available by updating the descriptor table and the available ring, and then notifies the Host driver. The Host copies the data into an available buffer, updates the used ring and notifies the VM by injecting an interrupt. descriptor id into the available elements ring buffer and updates the index to point to that entry in its ring buffer, correspondingly in 2 and 3 . In step 4 , the Guest driver notifies the QEMU driver via MMIO by writing to a dedicated address in its physical address space. The address is marked as write-protected by the Host and leads to a PF which gets handled by KVM. KVM would in turn signal an eventfd polled by a QEMU thread and the QEMU thread would propagate the event to the event handler of the corresponding VirtIO driver. In steps 5 and 6 , the driver would read the available element index and the entry in the descriptor table. The driver then writes the data — network packet, hard drive block, etc. — into the retrieved buffer. In steps 7 and 8 , the QEMU VirtIO driver updates the used ring with the descriptor index and payload size, and then notifies the Guest VirtIO driver. The notification request is propagated to KVM which injects an interrupt into the Guest. The Guest’s interrupt handler propagates the event to the notification handle function in the corresponding Guest VirtIO driver. When data is being sent from the Guest to the Host, the only difference is that the Guest VirtIO driver writes into the buffer and the Host VirtIO driver reads from the buffer. The other steps remain the same. An immediate observation is that issuing a notification between the Host and Guest is expensive. First, sending an event from the Guest to the Host requires handling a PF in KVM which leads to a VM-exit. Second, sending an event from the Host to the

25 2 Background

Guest requires injecting an interrupt into the VM which would prevent the VM from performing useful computation while being in the interrupt handler. Third, issuing a notiﬁcation requires a transition between the user space QEMU process and KVM which is also expensive. Additional performance overhead is also added by the need of transferring data, such as packets and hard drive blocks, from the Host kernel to the QEMU driver. For example, when the Guest needs to access a virtual hard drive, QEMU would need to request the data from the kernel and forward it to the Guest using VirtIO. In a non-virtualized environment, a user process reading from the hard drive would require less context switches and data movement than a process in a virtualized environment.

Host Guest

Kernel 4 Network Driver KVM IRQ Handler

1 5 3 2 6 QEMU virtio-net Network Driver

Figure 2.9: Sequence of steps for sending a packet from the Host network driver to the Guest network driver. Because the data plane layer is located in QEMU, two context switches often need to occur.

Figure 2.9 illustrates the problem by depicting the sequence of operations for sending a network packet from the Host network driver to the Guest network driver. Three domains are visualized in the diagram: the Host kernel, user space in the Host, and the Guest. The main actors in the domains are QEMU, the Host network driver, KVM and the VirtIO network driver. In step 1 , the QEMU process polls the network driver for newly delivered packets and reads the packet when one is available. In step 2 , the QEMU VirtIO driver copies the network packet to Guest Physical Memory and updates the corresponding Virtqueue data structure. Next, the QEMU VirtIO driver needs to send a notification to the Guest VirtIO driver that new data is made available. The notification happens via interrupt injection into the Guest and requires assistance from KVM. In step 3 , QEMU sends an ioctl command to the KVM virtual device to send the corresponding interrupt to the Guest. In steps 4 and 5 , KVM injects the interrupt into the Guest, the Guest’s interrupt handler is invoked and it forwards the notification to the Guest VirtIO driver. In step 6 , the VirtIO driver reads the packets

26 2 Background and forwards them to the Guest network driver. The described performance overhead is displayed in steps 1 and 3 . In step 1 , data is copied from kernel space to user space and a context switch is performed. In step 3 , a context switch from user space to kernel space is performed in order to inject an interrupt into the Guest. The occurring context switches can happen once per network packet in the worst case, and can have a dramatic effect on overall system performance. In Chapter 2.5, I describe an approach used by high-performance VirtIO drivers to reduce the number of the context switches between kernel space and the user space QEMU process.

2.5 Vhost

Chapter 2.4 described the design of VirtIO and how the Host and Guest can exchange data using the interface. The chapter’s epilogue describes the performance overhead caused by the separation of functionality into different security domains: the Host kernel, the QEMU user space process, and the Guest. A context switch between any of the three security domains adds performance overhead due to the necessity to flush the TLB, and also the L1 cache and branch prediction buffers when certain security settings are enabled. The Vhost protocol is designed to address some of the performance issues in communicating data between the Host and the Guest. The protocol specifies an interface for providing the necessary VirtIO information for data communication to the Host kernel. For example, QEMU can setup the Virtqueues, provide the Virtqueue information to a kernel module using the Vhost interface, and then enable communication. The Host driver would be responsible for communicating data to the Guest using the VirtIO interface and would request KVM to inject an interrupt into the Guest. In this scenario, network packets and hard drive blocks would not have to be transferred to the HV operating in user space. Figure 2.10 illustrates the sequence of steps for transferring a packet from the Host network driver to the Guest network driver. The figure shows two security domains: the Host kernel and the Guest. The QEMU user space process is omitted from the diagram because QEMU is only responsible for configuring the Virtqueues when vhost is used. In step 1 , the vhost-net driver polls the network driver for newly delivered packets and reads them. In steps 2 and 3 , the vhost-net driver updates the corresponding Virtqueue structure and requests KVM to inject an interrupt into the Guest. In steps 4 and 5 , KVM injects an interrupt into the Guest, the Guest’s interrupt handler is invoked, and the interrupt is propagated to the VirtIO driver. Lastly, the virtio-net driver forwards the packet to the Guest’s network driver. Unlike the steps in Figure 2.9,

27 2 Background

Host Guest

Kernel 4 KVM IRQ Handler

3 5 1 2 6 Network Driver vhost-net virtio-net Network Driver

Figure 2.10: Steps for sending a network packet from the Host to the VM when vhost-net is used. In this design, no context switches are necessary between kernel and user space. the steps in Figure 2.10 do not include copying data to the QEMU process and sending an interrupt injection request from QEMU to KVM. By reducing the number of context switches and memory copies, performance is signiﬁcantly improved.

Host VM

read IO Kernel Virtual VQ Descriptor Table vhost Address 1 IOVA: 0x3b345010 Translation IOVA: 0x3b3c7e60 2 cache 5 ... notify translation Query 3 4 Send GPA failure IOVAs

QEMU

Figure 2.11: Sequence of operations when an IO Virtual Address (IOVA) translation miss occurs. The vhost layer needs to notify QEMU, after which QEMU queries each IOVA-miss and sends the GPA to the vhost layer. In this design, IOVA-misses are very expensive.

Although using Vhost can lead to better IO performance overall, a disadvantage occurs when a VirtIO device is conﬁgured to use a virtual IOMMU. In this setting, a device address is typed as an IOVA and needs to be translated to a GPA before Vhost can use it. Because Vhost does not have full knowledge of the device conﬁguration and its address space, the translation needs to be performed by QEMU. Such queries

28 2 Background are extremely expensive, and the current Vhost implementation stores IOVA-to-GPA translations into a translation cache as an optimization. Figure 2.11 shows the process of handling a translation failure in Vhost for the case of reading a buffer address from the descriptor table of a Virtqueue. In step 1 , the Vhost Shim reads the IOVA from the descriptor table and attempts to translate the address to a GPA. However, the translation cache does not have an entry for this IOVA and exhibits an IOTLB miss. In step 2 , Vhost notiﬁes QEMU for the IOTLB miss by writing to an eventfd. In step 3 and 4 , the QEMU thread reads all IOTLB misses, performs the translations and updates the translation cache inside Vhost. These steps are done by performing ioctl commands to the corresponding driver for which the IOTLB miss occurred. In step 5 , the Vhost Shim re-attempts to read and translate the IOVA. The translation succeeds and the Vhost driver can continue execution. An IOTLB miss is detrimental for performance since multiple context switches are necessary to update the translation cache. Until the translation cache is updated, processing of requests on this Virtqueue cannot continue which adds latency and reduces IO throughput signiﬁcantly.

2.6 Kernel Crypto API

Parts of the kernel and some kernel modules utilize cryptographic operations to protect high-profile secrets like passwords, to establish a secure communication channel with an external entity, etc. As a way to standardize and unify implementations of these operations, Linux provides the Kernel Crypto API (KCAPI) which offers a rich set of cryptographic ciphers and message digest algorithms [MV]. These operations are exposed not only to the Linux kernel and Linux Kernel Modules (LKMs), but also to user space applications via the Netlink interface. To facilitate usage in user space applications, the libkcapi library [Mue] and kcapi-enc command line tool provide access to the KCAPI without exposing the complexity of the Netlink layer. The KCAPI only offers an API but does not discuss the implementation of any cipher. A kernel driver can register an implementation of a cryptographic cipher which can either be implemented in software or be served by an external hardware device. Upon registering a cryptographic cipher implementation, the kernel driver needs to specify the priority of this implementation. When a cipher request is sent to the KCAPI, the interface would select among all cipher implementations the one with the highest priority [MV]. By default, each request is handled asynchronously: the submitted request may be buffered, handled later and a notification would be sent to the requester when the request is finished. Additionally, the KCAPI implements the alternative synchronous functions on top of the asynchronous API by sending an asynchronous

29 2 Background request and blocking until it completes. AES is one of the supported cryptographic ciphers by the KCAPI. AES is a symmetric- key block cipher which transforms a block of data based on the provided secret key. As part of the AES standard, the block size is 128 bits and the key length can be 128, 192 or 256 bits. When handling large inputs, the input is split into blocks which are transformed individually. In such cases, the application of each transformation is dictated by the block cipher mode of operation. The cipher mode describes the derivation of a unique binary sequence — the initialization vector (IV) — which is mixed with the input block before performing the AES transformation. Examples of cipher modes include Electronic Code Book (ECB), Cipher Block Chaining (CBC), Galois/Counter (GCM), among others. A cipher mode is engineered to fulfill a combination of security and performance, and a compromise is often made in favour of either. For example, the CBC mode offers stronger security guarantees than ECB does, but has lower performance due to difficulty of parallelization and additional computation. Because AES is widely used and its performance is important for crypto-reliant software, both Intel and AMD have added support for the AES-NI extension. The AES-NI extension offers specialized instructions for performing AES encryption and decryption. The instructions significantly outperform alternative implementations, and are always preferred when the AES-NI extension is available. In the Linux kernel, the default AES cipher implementation, under both Intel and AMD, is the one provided by the aesni-intel driver.

2.7 Disk Encryption

Hard Disk Drives (HDD) and Solid State Drives (SSD) store long-term information which is reused between power downs and power ups of the system. Such storage mediums typically contain sensitive data — the users’ ﬁles, browsing history, website passwords, etc. — and should be protected against attackers with physical access to the system. If the HDD or SSD is stolen, the thief can perform forensic analysis to locate high-proﬁle secrets like passwords or compromising information. Full Disk Encryption (FDE) is an approach for protecting the storage medium’s contents by means of encrypting the stored data. The data is encrypted using a secret key which is typically derived from a user-provided password. If the malicious actor does not know the password or the key, decrypting the disk’s contents would likely require time-consuming brute-forcing if a strong cryptographic cipher and key derivation algorithm are used. Linux provides disk encryption support via the dm-crypt module which only supports the encryption and decryption of block devices. A block device is only an abstrac-

30 2 Background tion, and its memory can be backed by a hardware device, such as a storage medium, by a file located on the system, by memory on a remote system, or even by another block device. In the context of disk or partition encryption, dm-crypt is provided an encrypted block device and a cryptographic key as input. In turn, dm-crypt maps the encrypted block device to another block device which can be later mounted by the user. Reads and writes to the block devices are handled transparently in dm-crypt which would decrypt or encrypt the data using the cryptographic key provided earlier. Disk encryption via dm-crypt can be managed from user space by using the dmsetup command line tool. However, the tool is complex and another higher-level command line tool can be used instead: cryptsetup. When manipulating an encrypted block device, the cryptsetyp tool always needs to be provided the password to the device. Additionally, when a device is formatted by cryptsetup, the disk encryption format needs to be specified. The disk encryption format describes how to interpret the data on the encrypted block device. A commonly-used disk encryption format is the LUKS format [Fru11]. The format employs a two-level key hierarchy: a master key to transform the user’s data, and a password-derived key to transform the master key. The master key is generated from a secure random source and never changes until the user decides to format the block device. Metadata — cipher name, cipher mode and master key hash— of the master key is stored in the LUKS header. The master key is always stored encrypted in one of the eight key material blobs which follow the header. The master key is encrypted with the corresponding password-derived key. To decrypt the device’s data, a user must input a password which is then exhaustively checked against all key material blobs. For each blob, a key is derived and the key is used to decrypt the master key. Then the hash of the decrypted master key candidate is verified against the master key hash located in the LUKS header. If the hashes match, the master key candidate is considered to be the master key and the device’s data can then be decrypted. This hierarchical approach has the benefit of being able to change the user password without having to re-encrypt the device’s memory. However, this approach also generates multiple high-profile secrets: the master key and the password-derived keys. If any of the keys is disclosed to a malicious actor, the disk can be decrypted. Figure 2.12 displays a high-level overview of the stack for device encryption using cryptsetup. The hardware device (HDD, SSD, etc.) is located on level 0 and is operated by a device-specific driver on level 1. The unit for reading and writing to the device is the sector size of the device, which for disk drives can vary between 512 and 4096 bytes. On level 2 is located the block device driver which communicates device reads and writes with the device-specific driver. The block device driver performs reads and writes at block size granularity, which is 512 bytes by default but can often be configured by the user. On level 3 is the device mapper which includes the dm-crypt module. The

31 2 Background

Level 4 cryptsetup User Space

Kernel Device Mapper Level 3 Kernel Crypto dm-crypt API Level 2 Block Device Driver

Level 1 Device-speciﬁc Driver

Level 0 Hardware Device

Figure 2.12: Software stack for encrypting the memory of a hardware device using cryptsetup. device mapper propagates access requests to the block device driver and transparently cryptographically transforms the requests if necessary. In case dm-crypt is used, the module relies on the KCAPI for serving the cryptographic operations. On level 4 is the cryptsetup tool which operates in user space. The tool is responsible for setting up the device mapper conﬁguration and forwarding the master key. Although disk encryption is an approach for protecting the contents of the storage medium, volatile memory such as DRAM is often not protected and the master key is left present in plain text. Thus, if an attacker has access to kernel memory via a memory disclosure vulnerability, then the master key can be leaked and the contents of the storage medium be decrypted. Additionally, this issue is exacerbated with emerging technologies like Non-Volatile RAM with which cold boot attacks become more applicable [Yit+17].

32 3 Related Work

SE-Vault is a solution for protecting cryptographic keys and other system secrets inside an encrypted VM using any of the SEV hardware features. However, its design is implementable with other TEE solutions such as Intel Trust Domain Extensions [Cor20a], Intel Software Guard Extensions [CD16] and IBM Protected Execution Facility [Kera]. The encrypted VM receives encryption keys, stores them during its lifetime and processes cryptographic transformation requests sent by various users such as KCAPI or OpenSSL. Because the VM’s memory is inaccessible to the HV, to other software and to devices in cleartext, the cryptographic keys cannot be read by an attacker. This design protects against a plethora of attacks which traverse RAM to locate and extract the cryptographic keys. The work presented in this Master thesis is one of few solutions for protecting system secrets by utilizing hardware features. In the rest of this chapter, I discuss related research on protecting cryptographic secrets and on protecting virtualized environments.

3.1 Protection of Kernel Cryptographic Secrets

Tresor [MFD11] is a solution for protecting disk encryption AES keys by storing them in hardware debug registers. The x86-64 architecture includes eight debug CPU registers which can be utilized by debuggers such as GDB to set break points or watch for memory accesses to speciﬁc addresses. Because the registers are typically unused on a production system, Tresor can reserve the registers to store an encryption key but due to architectural limitations only a single 256-bit AES key can be stored per CPU core. The AES algorithm is implemented using the AES-NI instruction set which stores intermediate results into SIMD registers. A limitation of using the few available debug registers is that the AES round keys need to be recomputed for each transformation request which causes performance degradation. To avoid spilling the keys and intermediate results into DRAM, the code is written in assembly and interrupts are disabled during the time of handling a transformation request. Tresor is available as a Linux kernel patch which restricts usage of hardware debug registers and modiﬁes the default AES implementation in the Kernel Crypto API (KCAPI). The design of Tresor poses limitations on the usage of the system: debug capabilities are

33 3 Related Work reduced, the maximum key length is 256 bits, and the number of protected keys is at most the number of CPU cores. Additionally, the keys are not protected against an attacker which can execute code inside the kernel, as shown by Blass et al. [BR12]. Amnesia [Sim11] is a solution similar to Tresor for protecting encryption keys against memory disclosure attacks. However, Amnesia is designed to protect arbitrary long and arbitrary many keys, unlike Tresor which has limitations imposed by its design and by the system’s hardware. To achieve this, Amnesia stores a master key across hardware performance counter registers and keeps all other encryption keys in DRAM but encrypted by the master key. When a cryptographic transformation needs to be performed, the corresponding key is read from DRAM, decrypted with the master key and just then the original request is performed. Interrupts are disabled during this period to avoid spilling registers to DRAM. Just like with Tresor, Amnesia stores the secret key in feature-specific hardware registers inside the CPU and also requires that the feature be disabled for the entire system. Disabling performance counters is not a great limitation since these counters are only used when profiling and tracing applications to optimize code. Thus, enabling Amnesia on a production system would not cause any issues since profiling and tracing is performed typically during the development phase. Similarly to Tresor, Amnesia is vulnerable to attacks which gain code execution inside the kernel [BR12]. Copker [Gua+14b] is another solution for protecting encryption keys and its design comes as a combination of Tresor and Amnesia. Copker utilizes debug registers to store a master key which is then used to decrypt the corresponding key for a particular transformation request. The difference in Copker’s design is that it utilizes the CPU caches to store sensitive data such as partial results from the cryptographic transformation. Similarly to all previous solutions, Copker disables interrupts to guarantee atomicity of cryptographic transformations and to prevent the spilling of the context to DRAM. Additionally, Copker maintains the state of all CPU caches and carefully selects which virtual addresses are used. This is necessary to guarantee that the cache lines with sensitive information are never accidentally evicted from the L1 and L2 caches into DRAM. Unfortunately, Copker requires non-trivial changes to code running on microarchitectures with different cache inclusion and replacement policies. In the case of recent CPU microarchitectures, Copker requires all cores to enter no-fill mode which can lead to significant performance degradation for the entire system [Gua+14b]. A more recent solution for cryptographic key protection is Mimosa [Gua+15]. Mimosa is an extension of Tresor but uses the L1 data cache to store temporary results and stores secret keys encrypted in DRAM. The keys are encrypted by a master key which is contained in the core’s debug registers. The sensitive temporary results inside the L1 data cache are protected by using Hardware Transactional Memory (HTM) to monitor

34 3 Related Work accesses to the data. If a memory access from another thread is observed, the sensitive data is immediately cleared. HTM is a mechanism for wrapping memory accesses in a transaction which completes only if the same data is not read or invalidated by another thread, among other constraints. The Mimosa implementation uses Intel Transactional Synchronization Extensions to protect temporary results in its L1 data cache. Unfortunately, extensions for HTM are available only on few architectures, like Intel64 and IBM POWER, but even then the technology may need be disabled since it has been widely used in microarchitectural and side-channel attacks. For example, Mimosa is not applicable to any AMD processor at the time of writing. Tresor-SGX [RGM16] is another recent solution for protecting cryptographic keys. Unlike all previous solutions, Tresor-SGX stores the cryptographic keys inside an enclave in a user space application protected with Intel SGX. Intel SGX is a TEE solution designed to protect user space applications from a malicious privileged attacker. The Intel SGX attacker model is very similar to that of SEV but has fewer attack vectors by design. Tresor-SGX keeps cryptographic secrets inside a dedicated region in DRAM called the Enclave Page Cache. The Enclave Page Cache is encrypted similarly to the memory of an SEV-protected VM. Tresor-SGX loads a Linux Kernel Module (LKM) which communicates cryptographic transformation requests from the KCAPI to the SGX enclave using the Netlink inter-process communication interface. The requests are handled inside the enclave without exposing the keys outside the encrypted memory region and results are communicated back to the module. Tresor-SGX is very similar to SE-Vault by overall design but unfortunately offers poor performance due to usage of a high-level interface like Netlink and the requirement of multiple context switches between security domains. For example, a request from KCAPI needs to be forwarded from the Tresor-SGX LKM to the user space application which requires the first context switch. The application then communicates the request to the Tresor-SGX enclave by using an SGX-specific interface for cross-domain communication and requires another context switch. Sending the result back to the KCAPI requires the same sequence of steps which leads to a total of four context switches for a single transformation request. In comparison to Tresor-SGX, SE-Vault requires only two context switches between security domains for sending a single transformation and receiving its result. Additionally, my choice for a communication interface — VirtIO — is designed for high- throughput communication and has a significantly simpler implementation than that of the Netlink interface. The design decisions behind SE-Vault lead to a significantly better performance than that of Tresor-SGX, as shown in Chapter 6, while keeping security guarantees the same. Table 3.1 shows a summary comparison of SE-Vault and all examined solutions for protecting cryptographic secrets. Tresor, Amnesia and Copker reserve hardware capabilities such as debug registers and performance counters which limits their

35 3 Related Work

Name Dependency Max Key Length Max Key Count Performance Vulnerabilities Tresor Debug Registers 256 Core count Poor Privilege escalation Amnesia Performance Registers Unlimited Unlimited Poor Privilege escalation Debug Registers, Copker Unlimited Unlimited Poor Privilege escalation Cache Policy Control Mimosa Intel TSX Unlimited Unlimited Good Privilege escalation Tresor-SGX Intel SGX Unlimited Unlimited Poor Attacks on SGX SE-Vault (this thesis) AMD SEV Unlimited Unlimited Good Attacks on SEV

Table 3.1: Summary comparison of solutions for protecting cryptographic keys. applicability across systems. Copker relies on a specific microarchitectural design of the cache system, its implementation requires tailoring based on the cache inclusion and replacement policies and can lead to performance degradation of the entire system. The three solutions protect against memory disclosure attacks but not against an attacker who escalated its privileges on the system. Additionally, the attack presented by Blass et al. [BR12] is applicable to all three solutions without additional hardware protections. The next two solutions — Mimosa and Tresor-SGX — are only applicable on recent Intel CPUs. Although Intel TSX is available on various Intel CPUs, it is often disabled due to the security implications of the extension [Gup][Sch+19]. Tresor-SGX relies on SGX which is available on fewer Intel CPUs. As discussed earlier in this chapter, Tresor- SGX stores cryptographic keys inside an SGX enclave in a user space application which causes a performance degradation due to multiple context switches between security domains. The encryption and decryption throughput of Tresor-SGX is significantly lower than that of any other examined solution. The last row shows information about the work of this thesis: SE-Vault. Although the current implementation is only tested with SEV, it would also work transparently with SEV-ES and SEV-SNP if a Linux VM is used. Additionally, SE-Vault can easily be re-purposed to function with other secure virtualization extensions, not just the ones from AMD. By design, SE-Vault can support any cryptographic operation and does not have a limit on the number of cryptographic secrets it can store. SE-Vault offers significantly better performance than that of Tresor-SGX because it relies on a proven performant interface like VirtIO and additionally employs techniques to reduce the overhead of context switches and extra memory copies.

3.2 Development of Secure Virtualized Environments

The SEV feature family is primarily focused at protecting a virtualized environment which includes an OS and additional user space software. However, SEV can also be used to protect legacy user space applications [Kap16] by wrapping the application in minimal bare-metal OS which propagates special events to the HV. SEVGuard [PNG19]

36 3 Related Work is such a solution which creates a minimal SEV-protected environment for legacy applications and operates very similarly to an Intel SGX enclave. The design of Palutke et al. consists of two components: the unprotected SEVGuard runtime and the Guest application. The SEVGuard runtime is responsible for setting up the Guest application and for establishing a communication channel between the Host kernel and the Guest application. The runtime setups the Guest application by directly communicating with KVM using ioctl commands, which avoids the dependency of using QEMU or any other complex hypervisor. During the setup, the SEVGuard runtime allocates a memory region for the Guest application, copies over the application’s code and data, creates a Guest Page Table for the application and sets the initial architectural state of the Guest application. Additionally, the SEVGuard runtime copies a small stub into the environment to perform initialization when the SEV-protected Guest application is launched. When the Guest application executes a syscall to call into the kernel, a handler in the Guest application is invoked. The handler copies the information to the SEVGuard runtime by copying the syscall request information to a decrypted page and by then calling into the HV by executing the vmmcall instruction. The HV would then propagate the vmmcall to the SEVGuard runtime. The runtime can then collect the syscall information from the decrypted page and execute the syscall. The result of the instruction would be the copied to the decrypted page and the Guest application would be resumed. The syscall handler in the Guest application can retrieve the information from the decrypted page, update the state and execute the IRET instruction to resume execution. Palutke et al. report few limitations of the current SEVGuard implementation. First, the Guest application cannot make use of multithreading which leads to limited compute performance when SEVGuard is used. Second, the IO throughput is reduced by 10x due to the multiple context switches between different security domains. Third, SEVGuard does not support the virtualization of arbitrary legacy applications and the authors discuss that applications may require code modiﬁcations. Although validating the state and results of the SEVGuard implementation would be interesting, no code has been made publicly available. A reason for this may be the pre-existing patent of the idea from AMD [Kap16]. SE-Vault can borrow ideas from the SEVGuard design to reduce dependencies and minimize the trusted computing base. For example, SE-Vault can be implemented to work as a "bare-metal" OS by relying on the HV to perform the initial state setup. This would eliminate the dependency on QEMU, as well as the necessity to use a general-purpose kernel like Linux or seL4. Wu et al. [Wu+18] is an extension to the Xen HV to harden against known attacks on an SEV VM. Shown in Chapter 7, the HV can exploit missing integrity protection

37 3 Related Work and validation of an SEV- or SEV-ES-protected VM to exﬁltrate secret data or even gain code execution inside the VM. The authors propose a hardening against such attacks by separating security-sensitive features and data from the HV to a separate protected component: Fidelius. Page table management, the VM’s memory and access to special commands is moved from the Xen HV to the Fidelius component. Additionally, the authors provide a tool to search for gadgets in the Xen binary which may be used to attack the SEV-protected VM. Although this academic project is deﬁnitely interesting, the provided solution only hardens the HV against attacks aimed at the SEV- or SEV-ES-protected VM. If the System Owner has malicious intentions or the kernel is fully compromised, an attacker would be able to circumvent the protections of Fidelius and apply the same attacks from Section 7 to the SEV- or SEV-ES-protected VM. However, the software hardenings of Fidelius can be useful in protecting SE-Vault, whose attacker model (Chapter 4.2) assumes a benign but vulnerable HV. For example, SEV does not protect the VM’s register state which is stored unencrypted in the HV’s memory on a VM-exit. In case of a memory disclosure vulnerability, an attacker would be able to sample the VM’s register state and potentially read sensitive information. Fidelius mitigates such an attack by moving the register state structure into its own address space, which would make the registers directly inaccessible for the HV. Another solution for enhancing the security of a VM was proposed by Guan et al. [Gua+14a] and later extended by Chu et al. [Chu+19]. The solution opens the possibility for VMs to store cryptographic keys in the Virtual Machine Manager (VMM) and have the VMM process cryptographic operations. If the VM is compromised due to a vulnerability in its software stack, cryptographic keys would not be exposed to the attacker since they are stored in the VMM’s memory. The VMM’s memory is only accessible to a privileged user of the Host system. Both authors rely on the VirtIO interface for the communication of cryptographic keys and transformation requests. The work of Guan et al. [Gua+14a] and Chu et al. [Chu+19] relate to SE-Vault in the direction of protecting cryptogric keys and using the VirtIO interface for communication. But in contrast, SE-Vault’s purpose is to protect Host secrets by storing them in a TEE and operates under a stricter attacker model. Additionally, Chu et al. report a performance degradation of 30x for AES encryption and decryption, while SE-Vault incurs 2.5x reduction in throughput performance. SE-Vault can be used as an addition to the work of Chu et al. by having the cryptographic secrets of all VMs be stored in the SE-Vault VM instead of in the VMM. Since the SE-Vault VM is protected against memory disclosure attacks, unlike the VMM, all cryptographic keys can remain protected.

38 4 Design of SE-Vault

SE-Vault is designed to protect cryptographic secrets, such as AES keys, from malicious users of the system. SE-Vault stores the cryptographic keys in a protected Trusted Execution Environment (TEE) and performs cryptographic transformations, requested from other software components, without ever exposing their secret keys. The TEE’s design and implementation are based on the AMD Secure Encrypted Virtualization (SEV) family of features [KPW16]. SEV protects a virtualized environment by means of memory encryption, and additionally allows the initial state of the environment to be attested remotely. Although SEV is used, the design of SE-Vault is also applicable to other recent TEE technologies like Intel Trusted Domain Extensions [Cor20a], IBM Protected Execution Facility [Kera] and Intel SGX [CD16]. The design of SE-Vault does not focus on implementing any cryptographic algorithms but rather depends on the software inside the VM to provide an efficient implementation of the supported transformations. At the core of its design, SE-Vault includes secure key provisioning and efficient communication with the software components which use it. Efficient communication is achieved via VirtIO – a standardized interface for communication between a HV and a VM. The design choice of using VirtIO for communication renders SE-Vault portable to popular OSs (Linux, OpenBSD, L4) and HVs (KVM and Xen [Chi08]).

4.1 Design Overview

SE-Vault has the design goals to provide a secure environment for storing cryptographic keys and to expose an interface for handling cryptographic transformations within that environment without exposing the stored secrets to untrusted software and hardware. SE-Vault’s design considers an attacker who exploits a memory disclosure vulnerability to have read access to all of physical memory. To ensure the protection of the stored secrets, SE-Vault stores them in a VM protected with SEV or SEV-ES. However, SE-Vault is not limited to speciﬁc features for secure virtualization because the design does not make assumptions about the offered protection and does not rely on any speciﬁc features of the Host CPU. SEV and SEV-ES were chosen due to availability of CPUs which support them, but one can also employ other secure virtualization features.

39 4 Design of SE-Vault

SE-Vault’s interfaces are designed to be easily integratable into other software which performs cryptographic operations. Users of SE-Vault can be of any kind — user space software or the kernel — and thus the efﬁciency of handling users’ requests can have varying impact on system performance and responsiveness. For example, a poor design or implementation can lead to slow reads from encrypted partitions which would make dependent software run with degraded performance or be unresponsive for the user. Thus, an efﬁcient design is of high importance to render this solution practical.

Host system SEV-protected Guest

Host Kernel Guest Kernel

other driver Kernel Crypto Engine

crypto requests crypto requests Kernel API v V 2 i SE-Vault h data and keys SE-Vault o r Host Driver s t Guest Driver t I ioctl ABI O 3

crypto vhost requests setup User process SE-Vault 1 device QEMU

Figure 4.1: Design overview of SE-Vault. Initially, the SE-Vault QEMU device and the SE-Vault guest driver establish communication using the VirtIO setup speciﬁcation. The device communicates the information to the SE-Vault host driver using the vhost interface. User processes and the kernel can send cryptographic secrets and requests via two different interfaces to the SE-Vault host driver. The host driver sends the information to the Guest driver, which processes transformation requests using its crypto engine, and then returns back the results.

Figure 4.1 gives an overview of the main components in SE-Vault and of their location in the whole system. In this design, I rely on QEMU for launching the VM but other

40 4 Design of SE-Vault

HVs can be used which support VirtIO and SEV. Shown in 1 , is the SE-Vault device in QEMU. The device implementation has the responsibility of registering itself as a PCI device, of exposing the supported VirtIO configuration to the Guest and of orchestrating the transition to using the vhost interface. With the vhost interface used, the VirtIO data plane is moved into a kernel device driver which leaves the QEMU device implementation small and with limited responsibility. Such a design decision does not only improve performance, but has the advantage that the device implementation is easily portable to other HVs, such as Native Linux KVM Tool (LKVM)[Enb+]. Shown in 2 , is the SE-Vault host driver. The driver exposes an ioctl ABI to userspace and an internal Kernel API. Both interfaces can be used to transmit cryptographic keys, to send requests for cryptographic operations to the Guest, and to read the returned responses from the Guest. Users of these interfaces can be other kernel drivers or the kernel itself, as well as user space programs like OpenSSL. The SE-Vault host driver receives requests via the interfaces, propagates the information to the Guest using VirtIO and sends an event to the Guest to perform the operations. The host driver tracks unresolved requests and already completed operations. When a user wants to read the result of a cryptographic operation, the user performs the appropriate ioctl command or function call, and then receives either the result or an error that the operation has not yet completed. Shown in 3 , is the SE-Vault guest driver. The guest driver stores cryptographic keys and receives requests to perform cryptographic transformations over the VirtIO interface. The guest processes the requests, sends the results back to the Host and notifies the Host for completion. The guest driver can make use of the already existing software in the VM to perform the cryptographic transformations. This alleviates the SE-Vault guest driver from providing a secure and efficient implementation for cryptographic primitives which makes an implementation more easily portable. For example, if the VM’s software contains a cryptographic library which makes use of ISA extensions such as AES-NI, the SE-Vault guest driver can directly use it. The SE-Vault guest driver stores valuable cryptographic keys and other long-term secrets which have to be protected while the system is powered. The confidentiality of the VM’s memory can be protected with SEV or other hardware extensions for secure virtualization. Because the SE-Vault guest driver stores the secrets in the encrypted memory region within the VM, no software outside the VM can read the secrets in plain text.

41 4 Design of SE-Vault

4.2 Attacker Model

The design of SE-Vault considers an attacker model for which a malicious user has unrestricted read access to physical memory (DRAM). The malicious actor may have achieved this by exploiting a memory disclosure vulnerability in the kernel. Notably, the attacker model excludes the scenario of the user having arbitrary write access, kernel- level code execution capabilities, or physical access to the system. In such a scenario, the conﬁdentiality guarantees of the SEV and SEV-ES features can be compromised (see Chapter 7), which would allow the attacker to extract any stored secrets in the SE-Vault guest driver regardless of the design and implementation of SE-Vault. This attacker model assumes that secrets, such as AES keys, are transferred into the encrypted VM before an attacker has gained read access to physical memory. This assumption guarantees that the secrets cannot be read while they are being sent to the VM. After the encrypted VM has received the secrets, all occurrences of the secrets are carefully erased from memory to avoid leaking them implicitly. With this attacker model and the design of SE-Vault, an attacker may be able to read secret messages and data, like encrypted packets or an encrypted partition, but with severe limitations. First, the attacker can only sample memory and must be always aware what small portion of physical memory to read, which may require guessing. Second, short-term secret data may be implicitly inaccessible depending on the design of the CPU’s caches. For example, short-term secret data may be only located in the L1 and L2 data cache, but not in the L3 cache and DRAM. If the attacker operates on a separate physical core, the attacker would not be able to access the L1 and L2 cache of another core, thus hindering the attack. Third, stealing short-term secret data would require from the attacker to sample the precise physical region at the precise time, which would be difﬁcult. If the secret data is deallocated, it may become partially corrupted or even zeroed-out, rendering the stolen information useless for the attacker.

4.3 Host-Guest Communication

An efﬁcient and scalable design of the interface for communication between the Host and the Guest is of paramount importance for the performance of SE-Vault. Because this communication channel acts as a sink and all transformations requests have to be transmitted through it, the overall performance of SE-Vault is limited by the performance of sending and receiving data through it. Figure 4.2 shows the data communication path between the SE-Vault host driver and the SE-Vault guest driver. The data exchange happens through streams which use VirtIO’s Virtqueues underneath.

42 4 Design of SE-Vault

IO streams 128 bytes

512 bytes V v i SE-Vault h 4096 bytes r SE-Vault o t host driver s 8192 bytes I guest driver t O Key stream

Key crypto crypto crypto crypto storage worker worker worker worker SE-Vault user

Figure 4.2: Multi-stream design approach in SE-Vault. Keys and requests are sent to the SE-Vault guest driver using a data stream - a pair of an Input Virtqueue and an Output Virtqueue. Cryptographic keys are transferred using the Key stream. Transformation requests are communicated using one of many IO streams, each suitable for a different request size. For example, large requests can be better served by the IO stream with buffers of size 8192 bytes. A crypto worker in the Guest handles the request and returns the result.

Delivery of keys to the SE-Vault guest driver happens through the Key stream. The Key stream uses two Virtqueues: one to send keys from the Host, and one to receive an acknowledgment from the Guest that the key is valid and can be used for cryptographic transformations. When the SE-Vault user wants to send a key, it populates a structure with information about the key type and the key itself, and then sends it to the SE-Vault host driver. The host driver allocates a private identifier for the key, sends the key to the SE-Vault guest driver, and then returns the key identifier to the user. The user can later use the key identifier to request cryptographic transformations. No other user has access to the key identifier of another user. Upon receiving the key, the SE-Vault guest driver stores it, allocates the necessary objects to process future transformations with the key, and sends back an acknowledgement to the host driver. When a key is successfully registered inside the SE-Vault guest driver, the host driver can proceed sending cryptographic transformation requests to the guest driver. Such transformations can be encryption, decryption or signing requests, and are transmitted over the IO streams. Figure 4.2 shows four streams accepting payloads of size 128, 512, 4096 or 8192 bytes. Each stream consists of two Virtqueues: an input queue and an output queue. The SE-Vault host driver sends a request over the input queue, and the SE-Vault guest driver sends back the result over the output queue. The provided request includes a request identifier, the key identifier, the transformation type (encryption, decryption, etc.), and the data to be transformed.

43 4 Design of SE-Vault

The delivery of keys and the delivery of transformation requests is purposely split into the Key stream and the IO streams because both operations have significantly different performance requirements. For example, a single disk encryption key is responsible for the encryption or decryption of data in a disk partition, but the disk partition itself contains megabytes or gigabytes of data which can be divided into millions of transformation requests. Transmitting the transformation requests and retrieving the results has an effect on long-term system performance and responsiveness, but sending the disk encryption key happens only once and can only lead to a small amount of latency at the beginning of using SE-Vault. The design choice is also beneficial from a software engineering perspective. The code for processing and storing cryptographic keys is different from the code for performing cryptographic operations. Such a separation between the Key stream and the IO streams results in easier to maintain code and a clear communication ABI since no distinction is necessary for what is transmitted over the Virtqueue. The data communication plane features multiple IO streams in order to better serve the workloads of different users of SE-Vault, and to also allow for parallel processing of cryptographic operations if the VM is started with sufficient processing resources. Discussed in Section 2.4, the SE-Vault guest driver is responsible for allocating buffers and sending their GPA to the SE-Vault host driver. The VirtIO device implementation in QEMU is responsible for sending the device configuration to the Guest VirtIO driver which contains the individual number of descriptors supported by each Virtqueue. Thus, each Virtqueue contains a fixed number of data buffers and the size of each is determined by the SE-Vault guest driver. These constraints pose a serious limitation for cases when the transmitted payload can vary dynamically in size. VirtIO device drivers in Linux must deal with varying payload sizes if high performance is a goal. For example, virtio-net — a network driver — dynamically adjusts the buffer sizes based on the size of the most recently received packets. The heuristic predicts the size of future packets and adjusts the size of the exposed buffers accordingly to match the prediction. SE-Vault cannot use such a technique since the variance of the payload size can be much higher than that of a packet size. Rather, SE-Vault dedicates a set of streams each using only a specific buffer size without making any adjustments to it dynamically. Figure 4.2 shows four streams which correspondingly handle buffers of sizes 128, 512, 4096 and 8192 bytes. Additionally, a system administrator can configure SE-Vault to use other sizes, which may be better suited for another system. For each received request by the SE-Vault guest driver, the request is forwarded to a dedicated crypto worker. The crypto worker performs the cryptographic transformation using the corresponding cryptographic key, and then returns the result back. Figure 4.3 shows a crypto worker in more detail. The crypto worker receives requests in-order over the input queue and passes them to the External Crypto Engine. After

44 4 Design of SE-Vault

Crypto worker

Output virtqueue Request 332 Request 333 External Crypto Engine

Input virtqueue Request 334

Figure 4.3: Overview of a crypto worker. A transformation request is sent using the Input Virtqueue to the crypto worker in the VM, which uses a crypto engine to process the request. Results are transferred back to the crypto worker, which propagates it to the Host using the Output Virtqueue. Requests for on IO stream are received, processes and returned in the same order. the engine performs the transformation, the engine passes the results to the crypto worker which transmits the result in-order to the Host driver over the output queue. The requests and their results are always received, processed and returned in-order for one IO stream. The reason for this is that requests such as cipher transformations can have sequential dependencies. For example, the initialization vector for AES-CBC is updated with the value of the previous encrypted block. If requests are reordered, the encryption or decryption results will yield erroneous results. However, the requests and results on separate IO streams are not processed in-order. Relaxing the ordering rules among requests on separate IO streams can lead to an implementation which is trivially parallelizable at a stream granularity. A drawback of this design is that scheduling and correct synchronization of requests over multiple IO streams is left to the users of SE-Vault. However, this should not lead to high complexity or poor performance since the SE-Vault users are often aware of the size of the requests which will be sent. For example, a disk encryption program can often select the IO stream with the largest capacity since the processed data is likely to be many megabytes or gigabytes. Only the ﬁnal request would under-utilize the Virtqueue buffer but this would also avoid any synchronization between IO streams since a single stream would be used for all of the data. This design can be parallelized by increasing the number of vCPUs in the conﬁguration of the VM.

4.4 User Interfaces

SE-Vault is designed to offer to various system users the secure storage of secrets and the secure processing of cryptographic transformations. SE-Vault achieves this by positioning the SE-Vault Host driver inside a central entity within the system: the Kernel. This design decision allows other components — such as the kernel itself, other kernel

45 4 Design of SE-Vault modules and also all userspace software – to have a short software path to the SE-Vault Host driver. This does not only allow for a very wide interface without compromising security, but also better performance since 1) fewer context switches are necessary, and 2) data has to be transferred fewer times than with other possible designs.

Linux Kernel

dm-crypt Kernel Crypto API

SE-Vault Crypto Shim SE-Vault Host Driver ioctl ABI

SE-Vault OpenSSL OpenSSL engine

Figure 4.4: Example integration of SE-Vault into OpenSSL and dm-crypt. Integration of SE-Vault into OpenSSL happens via the SE-Vault OpenSSL engine. Because dm-crypt uses the KCAPI, keys and requests are forwarded to the cipher implementation of the SE-Vault Crypto Shim.

Figure 4.4 provides an example of integrating SE-Vault into dm-crypt [Sao14] and OpenSSL. OpenSSL is a multi-feature toolkit which implements the Transport Layer Security (TLS) and Secure Socket Layer (SSL) protocols, and includes a comprehensive general-purpose cryptographic library. dm-crypt is an encryption subsystem which is based on the device mapper framework in the Linux kernel. dm-crypt can be used to encrypt or decrypt disks, partitions and ﬁles. The two components rely on or offer cryptographic operations and live in different parts of the system’s software stack: OpenSSL is a userspace library and dm-crypt is part of the Linux kernel. Notably, SE-Vault can serve the needs of both software components as shown in Figure 4.4. dm-crypt makes use of the KCAPI [MV] to delegate the cryptographic operations on data from block devices to cryptographic drivers inside the Kernel or to external devices. The KCAPI offers a rich interface for requesting cryptographic transformations on data but it does not provide implementations of the various ciphers or hash functions it offers. The KCAPI only has a high-level understanding of the ciphers but delegates serving of the requests to drivers which offer support for the type of transformation. By default, KCAPI would provide scheduling of requests and only propagate information

46 4 Design of SE-Vault such as cryptographic keys, initialization vectors and the data to the corresponding driver. A driver can offer support for a certain cryptographic transformation by calling crypto_register_alg with a properly initialized struct skcipher_alg instance. The SE-Vault can serve a subset of the users of the KCAPI by registering support for some cryptographic algorithms. For example, SE-Vault can register support for the CBC(AES) transformation with a high priority which guarantees that SE-Vault would be considered for handling all CBC(AES) requests. The SE-Vault Crypto Shim in the SE-Vault Host module is a thin layer which handles communication between the KCAPI implementation and the SE-Vault Host driver. The SE-Vault Crypto Shim is responsible for registering the supported transformations, receiving and delegating requests to the SE-Vault Host driver, and for delivering results back to processed requests. User space processes can also use SE-Vault to store secrets inside the VM and request crytographic transformations to be applied on data without exposing the secrets. Processes can establish communication with SE-Vault using the ioctl interface. A process can open the virtual device exposed by SE-Vault and send ioctl commands for registering keys, sending data transformations requests and reading back already processed requests. Notably, the process must be started with sufﬁcient privileges to open the SE-Vault virtual device but its privileges can be immediately dropped afterwards. Figure 4.4 shows an example of how OpenSSL can communicate cryptographic keys and data to SE-Vault. The OpenSSL library includes the capability of dynamically registering an OpenSSL engine. An OpenSSL engine is a shared library which implements an interface which can be used by the main OpenSSL library to determine the type of cryptographic operations the engine supports. During the time of loading the engine, the OpenSSL queries the capability information and the entry points for initializing and operating cryptographic algorithms. The OpenSSL library then can call into these functions to perform cryptographic operations without having knowledge of their actual implementation. OpenSSL can use SE-Vault through an OpenSSL engine as shown in Figure 4.4. The OpenSSL library loads the SE-Vault OpenSSL engine — a shared library — and queries the cryptographic transformations supported by the engine. The engine opens the SE-Vault virtual device to communicate cryptographic secrets and transformation requests in the future. OpenSSL would call the functions to create a context for the corresponding cryptographic transformation, provide the secret key, and later send transformation requests for a particular context. All of the requests to the SE-Vault OpenSSL engine would be delegated to the SE-Vault Host driver via the ioctl ABI. The Host driver would then use VirtIO to communicate the requests to the SE-Vault Guest driver running inside the SEV-protected VM.

47 4 Design of SE-Vault

Providing an interface to user space processes can also aid validating and testing the implementation of the SE-Vault Host and Guest drivers. Tests can be implemented in commodity languages, such as C, C++ or Python, which support opening a virtual device and communicating via ioctl commands. The tests can also be implemented on top of existing test frameworks such as Google Test [Inc] which allows one to easily add new test cases and run them with various arguments. Running the tests from a user space process also shortens iteration time of designing tests. For example, without the ioctl ABI, tests would have to be implemented inside the kernel. If one of the tests inside the kernel were erroneously implemented, the kernel could fault and enter kernel panic, or even worse the ﬁle system could get corrupted. An erroneous test in user space process can only lead to the process being terminated which allows the programmer to correct the mistake, recompile, and run the test again without having to restart the whole system.

48 5 Implementation

Chapter 4 proposes a design of a Trusted Execution Environment (TEE) for storing long-term secrets, like AES keys, and for communicating cryptographic requests. The design of SE-Vault assumes that the long-term secrets would be protected with SEV and that VirtIO would be used for communication. Although the design is definitely implementable, the choice of a TEE solution – namely SEV – creates software dependencies which require discussion. First, although SEV was added to the Linux kernel in 2018 [Kap], no other OS had added support for the feature. Thus, SE-Vault inherits the limitation that both the Host OS and the Guest OS can only be Linux, since only Linux supports SEV. Second, only QEMU fully implements the KVM API for creating an SEV-protected VM. Although it is possible to write a minimal HV [PNG19], such a HV would also need to implement the VirtIO initialization sequence and be compliant to the VirtIO specification. Because such an endeavor is outside the scope of this thesis, only QEMU is a suitable HV for fulfilling the design of SE-Vault. Third, using SEV or SEV-ES for protection of long-term secrets, raises a difficulty during the boot sequence of the VM. As discussed in Chapters 2.2 and 2.2.2, require additional steps in communicating information with the HV. SEV requires to explicitly mark a memory region as unencrypted to exchange data with the HV, and SEV-ES additionally needs to mark the GHCB page as unencrypted to communicate register state information for the emulation of "special" instructions like cpuid and rdmsr. Because QEMU needs to communicate initial state information and the VM always executes "special" instructions at the boot stage, the VM firmware and/or bootloader need to be SEV and SEV-ES aware. The only such boot software is Open Virtual Machine Firmware (OVMF). Thus, the full list of the dependencies is: QEMU, OVMF and Linux. Although it is possible to drop these dependencies by implementing a minimal HV and OS, such effort is outside the scope of this thesis. Additionally, SEV-ES and its successor SEV-SNP require non-trivial changes to the Guest OS and HV which would further complicate the implementation. This chapter is divided in the following way. Section 5.1 introduces the implementation of the SE-Vault QEMU device. Sections 5.2 and 5.3 discuss correspondingly the implementation of the SE-Vault host driver and of the SE-Vault guest driver. Section 5.4

49 5 Implementation discusses necessary code hardenings of the Host and Guest OSs. Section 5.5 showcases an implementation of the SE-Vault guest driver by using the seL4 microkernel [Kle+09] instead of Linux.

5.1 QEMU Device

The design of SE-Vault requires that QEMU is used as a HV. This requirement imposes that VirtIO communication between the Host and the Guest need to be established via a VirtIO QEMU device. The QEMU virtual device can be attached to the virtual PCI bus, which can be later identified by the Guest. The implementation of the QEMU SE-Vault device is straightforward because the data communication plane is moved to a vhost-based LKM. The QEMU device only needs to perform the device initialization, to communicate the configuration to the Guest, and to forward the VirtIO information to the vhost-based LKM. Alleviating the QEMU device implementation from performing data communication also has the benefit that the code can be easily ported to another HV like LKVM. When launching the SE-Vault VM, QEMU needs to be provided the -device virtio-sevault-pci argument in order to create the QEMU SE-Vault device and attach it to the PCI bus.

IO Stream Block Size (bytes) Descriptor Table Size BLOCK_TYPE_0 1024 256 BLOCK_TYPE_1 2048 256 BLOCK_TYPE_2 4096 256 BLOCK_TYPE_3 8192 256 BLOCK_TYPE_4 16384 128

Table 5.1: Block size and descriptor table size for all IO Streams. While BLOCK_TYPE_0 is suitable for small- and medium-sized payloads, BLOCK_TYPE_4 is suitable for large-sized payloads.

Discussed in Chapter 4.3, communication between the SE-Vault host driver and the SE-Vault guest driver happens via streams which use input and output Virtqueues underneath. Communication of cryptographic requests happens via multiple IO streams which are suited for different workloads. The current QEMU SE-Vault device implementation supports ﬁve IO streams with varying block sizes and varying descriptor table sizes. Table 5.1 lists the available IO streams with their block size and descriptor table size. The block size varies in order to accommodate different user work loads, from encrypting small network packets to encrypting large ﬁles. The descriptor table size is selected based on a limitation of the SE-Vault guest driver implementation, which

50 5 Implementation currently allocates physically-contiguous memory for all the blocks in the descriptor table.

5.2 Host Linux Driver

The Host SE-Vault driver is the central component for data communication between SE-Vault users and the SE-Vault Guest driver. At a high level, the Host driver only receives data from the SE-Vault users, sends it to the Guest driver and propagates back the result of the processed request once the Guest returns it. However, the details of the implementation can vary, which can have a dramatic effect on performance. In this Section, I describe my PoC implementation of the Host driver and the added performance optimizations.

5.2.1 Initialization The SE-Vault host driver is implemented as an LKM which gives the driver fast access to the KCAPI and to the Guest event functionality in KVM. However, the Linux kernel does not implement a full-ﬂedged HV and has no notion of the address space contents of a Guest. Thus, the Host driver does not know the PCI conﬁguration space and cannot use the VirtIO interface directly.

Host Guest

Linux Kernel

SEVault Host LKM vhost LKM SEVault Guest Driver /dev/sevault-user /dev/vhost-sevault

conﬁgure data plane

SEVault device

QEMU

Figure 5.1: The SE-Vault QEMU device provides the data plane conﬁguration to the SE-Vault host driver. The Host driver relies on the VirtIO implementation in the vhost LKM to communicate with the SE-Vault guest driver.

51 5 Implementation

The Host LKM relies on QEMU to provide the device configuration using the vhost interface. Shown in Figure 5.1, the QEMU SE-Vault device opens the /dev/vhost-sevault device. In handling the open operation, the LKM initializes its context, lists the Virtqueues it supports and calls vhost_dev_init to create a vhost context. The QEMU SE-Vault device would immediately afterwards provide the VirtIO device configuration using the vhost ioctl interface. For each Virtqueue, QEMU would provide the Virtqueue’s start address, number of descriptors, base descriptor and corresponding eventfd for notifications from the Guest. The SE-Vault Host driver does not need to handle the ioctl commands but can simply propagate the ioctl number and arguments to the vhost LKM. When the /dev/vhost-sevault node is first opened, the LKM does not immediately expose interfaces to future users of SE-Vault. The reason is that there is a delay between QEMU opening the node and then providing the data plane configuration. The Host driver cannot process requests during this period because it does not have information about the Virtqueues’ configuration. Without handling of this special case, a user could erroneously send requests to the SE-Vault host driver driver before communication with the Guest SE-Vault driver can be performed. Such a scenario would likely corrupt the state of the Host driver and possibly of the whole system. For example, the KCAPI implementation would always perform tests to a registered cipher algorithm if the CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not enabled, and thus this special case would always need to be handled. As a workaround, the Host driver exposes the VHOST_SEVAULT_SEND_READY ioctl command which is used by the QEMU SE-Vault device to signal that the data plane is now functional. When the Host driver is notified, the driver opens the interfaces described in Section 4.4. First, the driver establishes a channel with KCAPI users by registering the supported cipher definitions using the corresponding KCAPI API calls. In this PoC implementation, the driver only supports the AES cipher with the CBC cipher mode. Second, the SE-Vault Host driver also registers a separate virtual device with path /dev/sevault-user. User space programs can open the /dev/sevault-user device and communicate cryptographic keys and data using the ioctl ABI discussed in Section 5.2.3.

5.2.2 Data Communication The exchange of data between the Host driver and Guest driver is the most complex part of the design and implementation of SE-Vault. The complexity arises from the requirements to have the Guest isolated and protected with SEV, and to provide efﬁcient data communication for different system users. The SE-Vault design relies on VirtIO to provide a standardized interface for communication between a VM and a

52 5 Implementation virtual device. Although the specification is well defined and the vhost LKM provides sufficient functionality to exchange data, the Host and Guest drivers still need to manage memory regions and buffers into which data is written. Furthermore, SEV imposes that the Guest must mark the memory region for data exchange, as well as the descriptor table and ring buffers of each Virtqueue, as shared by setting the C-bit to 0. Another difficulty in implementing data communication between the Host and the Guest is that the Host driver can receive work at a higher rate than what the Guest driver can handle. For example, if a Virtqueue contains only 30 available buffers but the Host needs to send 100 buffers, the implementation would need to provide a mechanism for resolving such issues by either buffering requests or requesting that the Guest driver makes more buffers available. In this Section, I describe part of the SE-Vault Host driver implementation which is responsible for sending cryptographic keys and encryption/decryption requests to the SE-Vault guest driver.

1 struct sevault_key_info { 2 u32 key_id; 3 u32 transformation_type; 4 u16 key_length; 5 u16 iv_length; 6 u8 key[32]; 7 u8 iv[32]; 8 }; 9 10 struct vhost_sevault_key_info_item { 11 struct sevault_key_info *info; 12 struct list_head list; 13 }; 14 15 struct sevault_key_recv_ack { 16 u32 key_id; 17 int ret; 18 };

Figure 5.2: Structures used in sending a key to the SE-Vault guest driver. The structure sevault_key_info is used for sending a key, and sevault_key_recv_ack is used for receiving a key-registered acknowledgment from the SE-Vault guest driver. The structure vhost_sevault_key_info_item is used for buffering a key request.

53 5 Implementation

Cryptographic keys need to be sent to the SE-Vault guest driver before the Host can send any transformation requests. Figure 5.2 shows the structures used for delivering keys to the SE-Vault guest driver. The sevault_key_info structure contains the cryptographic key information which is sent to the SE-Vault guest driver. The key_id field contains a unique identifier for the key and is generated by the SE-Vault host driver. Future transformation requests use the key_id to select which key has to be used by the Guest for performing the transformation. The transformation_type field contains an identifier of what cipher algorithm the key would be used for. The current PoC implementation is limited only to the AES-CBC which can be set by specifying the value of SEVAULT_AES_CBC to the type field. Next follow the key_length and iv_length fields which correspond to the lengths of the cryptographic key and initialization vector. Afterwards, the structure contains the key and iv fields which contain the key and the initialization vector. The initialization vector only needs to be communicated at the time of registering the transformation with the CBC cipher mode because each subsequent initialization vector corresponds to the output of the preceding AES transformation. The vhost_sevault_key_info_item structure is an internal structure to the SE-Vault host driver for storing keys which have not been yet sent to the Guest. The sevault_key_recv_ack is sent by the SE-Vault guest driver when a cipher key is received and processed. The key_id field contains the identifier for the key which was processed, and the ret field contains a return value to indicate success or failure. This allows the error to be propagated back to the user, for example in the case of an unsupported key length, etc.

2 add to tail 5 iterate over key list vss_key_info_list /dev/sevault-user vhost_sevault_key_info_item 1 send key 6 sevault_key_info vhost_sevault_send_key_info(..) vhost key vhost_sevault_key_info_item SE-Vault Guest Driver worker 1 send key 7 sevault_key_recv_ack KCAPI cipher vhost_sevault_key_info_item backend

Linux scheduler 3 wake up key worker 4 schedule

Figure 5.3: Sequence of steps for sending a key from the KCAPI or sevault-user interfaces to the SE-Vault guest driver.

Figure 5.3 shows the sequence of steps when a user sends a key to the SE-Vault guest driver. Discussed in Chapter 4.1, SE-Vault exposes interfaces to the KCAPI by registering a cipher algorithm, and to user space by creating a device node. When a key is com-

54 5 Implementation municated, either entry point calls the function vhost_sevault_send_key_info(...) in step 1 . In step 2 and 3 , the key structure is added to the vss_key_info_list queue and a call is made to the Linux scheduler to wake up the vhost key worker. When the vhost key worker resumes execution, it starts iterating over the key queue and sends each sevault_key_info entry to the SE-Vault guest driver using the key virtqueue as shown in steps 5 and 6 . In step 7 , the Guest sends back the sevault_key_recv_ack for the corresponding key by using the ack virtqueue. When a user sends a cryptographic key to the SE-Vault guest driver, the operation returns before the Guest has registered the key. However, users would typically want to receive information about the success or failure of registering the key. The SE-Vault host driver exposes the function host_sevault_flush_keys_queue(..., key_id) which does not return until the key request with key_id has completed. This allows user space and KCAPI users to send multiple keys and then wait for completion before sending transformation requests. After a user has registered a key in the SE-Vault guest driver, the user can begin sending transformation requests and eventually read the transformation result. Fig- ure 5.4 shows the structures used in the SE-Vault host driver for holding a request and its result. The sevault_transform_req structure holds all of the request information which is transmitted to the Guest driver. The key_id field contains the identifier of the registered key and the req_id is a unique identifier for the request. The request_type field specifies what transformation type is performed on the data, and in my PoC is limited to encryption and decryption. The iv_len and iv fields specify information about the initialization vector used with the transformation. The field is not necessary for the CBC mode of operation, but is added in case other cipher modes would be supported in the future. The data and data_size fields contain respectively the data, which would be transformed, and its length. The data field is represented as a variable-length array to conform to the multi-stream design discussed in Chapter 4.3. Each transformation request structure can be allocated to tightly fit all of the data being transmitted, and the input Virtqueue with the best fit can be selected dynamically. The sevault_transform_result contains the transformation result. The SE-Vault guest driver is responsible for creating the result structure and sending it to the Host driver. The req_id field contains the identifier of the processed transformation request. The data and data_size fields correspondingly hold the transformed data and its length. Processing of requests happens asynchronously and requires additional bookkeeping to track pending and already processed requests. In SE-Vault host driver, this is achieved with the sevault_result_item structure. The result field contains a pointer to a pre-allocated structure for the result, and the done field specifies whether the result is available or not. The req_origin field specifies the origin of the request:

55 5 Implementation

1 struct sevault_transform_req { 2 u32 key_id; u64 req_id; 3 u8 request_type; 4 u8 iv_len; u8 iv[32]; 5 u32 data_size; u8 data[]; 6 }; 7 8 struct sevault_transform_result { 9 u64 req_id; 10 u32 data_size; u8 data[]; 11 }; 12 13 struct sevault_result_item { 14 struct sevault_transform_result *result; 15 bool done; 16 enum sevault_request_origin req_origin; 17 struct list_head user_node; 18 struct list_head vhost_node; 19 };

Figure 5.4: The structure sevault_transform_req holds information about a transformation request which is sent to the SE-Vault guest driver, and the structure sevault_transform_result is used to hold the result of the transformation. The structure sevault_result_item is used for keeping track of the result of a request. either KCAPI, or userspace. The distinction is necessary due to a difference in the PoC implementation of how KCAPI and userspace users are notiﬁed for completion of a request. The user_node and vhost_node are intrusive list nodes which are used to track separately all of the results being received and the results which are available for one particular user. The vhost_node is used to include the item into the list of all pending requests in SE-Vault host driver, and the user_node is used to include the item into the list of pending requests for the corresponding user. Figure 5.5 shows the sequence of steps for propagating a transformation request from the KCAPI cipher backend to the SE-Vault guest driver and then returning back the result. In step 1 , the cipher entry-point calls the function vhost_sevault_send_request to initiate sending a request. In step 2 , the function adds the request item to vhost_input_list, and a sevault_result_item instance to vhost_result_list and

56 5 Implementation

2 add entries to lists

vhost_sevault_send_request(...)

1 send request vhost_input_list vhost_result_list kcapi_result_list (data channel 3) (data channel 3) KCAPI cipher backend sevault_transform_req vhost_node user_node

notify cipher sevault_transform_req vhost_node user_node 7 backend update result 6 lists

vhost input worker vhost_sevault_output_handle_kick(...) 3 wake up input worker

4 send request

5 send result SE-Vault Guest Driver

Figure 5.5: Sequence of steps for sending a transformation request from the KCAPI cipher implementation to the SE-Vault guest driver and receiving back the result. kcapi_result_list. In step 3 . the vhost input worker gets scheduled for execution which iterates over all input data streams and processes the corresponding vhost_input_list list. Each processed sevault_transform_req is removed from the input list and deallocated. Although only a single list is shown in Figure 5.5, there are multiple data streams to conform to the workloads of different SE-Vault users. In step 4 , the vhost input worker would consume the required number of available VirtIO buffers, mark the buffers as used and then signal the SE-Vault guest driver that work is available. In step 5 , the Guest would receive an interrupt eventually, the SE-Vault guest driver would process and send the result of the request to the SE-Vault host driver. To send the request, the Guest would allocate a sevault_transform_result object, add its physical address into the available ring of the corresponding output Virtqueue and kick the Host that transformation results are available. The kick would result into an MMIO write which would be handled by KVM and eventually the handler function — vhost_sevault_output_handle_kick(...) — would be called. In step 6 , the handler function copies the result to the corresponding sevault_transform_result object and sets the done ﬁeld in the sevault_result_item to true. In step 7 , the handler function deletes the entry from the vhost_result_list list and notiﬁes the KCAPI cipher backend that there are available results. The cipher backend can then consume the available results from the kcapi_result_list and copy the results to the

57 5 Implementation corresponding skcipher_request. The sequence of steps for sending a request from user space is not shown because the majority of the steps are the same as the ones taken in the KCAPI path. Differences only arise at the entry point and exit point for sending a request. A user space process can send a request using the ioctl ABI and the request would get buffered into the vhost_input_list. The result structure is pre-allocated and stored in the vhost_result_list and userspace_result_list lists. In this PoC implementation, the user space process is not notiﬁed of request completion but rather the process receives completion information by attempting to read the result.

5.2.3 User Space ioctl ABI The SE-Vault host driver implementation exposes a user space interface which can be used to integrate SE-Vault into cryptographic libraries like OpenSSL or used to implement unit tests. The interface is exposed by opening a device node, named /dev/sevault-user, which can be opened by users with sufﬁcient privileges. A user can use ioctl commands to communicate with the SE-Vault host driver.

Command Description SEND_KEY Send a cryptographic key to SE-Vault host driver. FLUSH_KEY_OPS Block until all SEND_KEY requests have completed. SEND_TRANSFORMATION_REQ Send a transformation request to the SE-Vault host driver. READ_OP Attempt to read the result of a request.

Table 5.2: ioctl commands exposed by /dev/sevault-user.

Table 5.2 lists the exposed ioctl commands and provides a short description for each one. The SEND_KEY command is used to send a cryptographic key to the SE-Vault host driver and accepts a pointer to a sevault_key_info object as an argument. When a SEND_KEY command is received, the SE-Vault host driver updates the key_id field in the sevault_key_info structure with a unique key identifier before returning from the syscall. The user can then read the key_id and use the value for future transformation requests. To wait until all sent key requests are handled by the SE-Vault guest driver, the user can send a FLUSH_KEY_OPS command to the device node. The kernel thread would block until the key requests are processed. After a key has been successfully registered, the user can send transformation requests which use the key. The SEND_TRANSFORMATION_REQ command is used to send a transformation request to the SE-Vault host driver. To describe the request, the user populates the sevault_transform_user_req structure, shown in Figure 5.6. The block_type field specifies which data stream should be used for sending the data. The

58 5 Implementation

1 struct sevault_transform_user_req { 2 enum sevault_io_block_type block_type; 3 u64 request_id; // written by driver. 4 u32 key_id; 5 u8 request_type; 6 u32 data_length; 7 const u8 *data; 8 }; 9 10 struct sevault_read_op { 11 enum sevault_io_block_type block_type; 12 u8 *user_output; 13 u32 buffer_length; 14 u32 max_requests; 15 16 // Written by driver. 17 u32 consumed_bytes; 18 u32 num_requests; 19 };

Figure 5.6: Structures for a user transformation request and a result read operation. request_id field specifies the unique request identifier and is set by the Host driver before returning. The key_id field specifies which cryptographic key should be used for the transformation. The request_type specifies whether the request encrypts or decrypts data. The data and data_length specify a pointer to the data in user space and the length of the data. After the SEND_TRANSFORMATION_REQ ioctl returns, the user can use the request_id to identify retrieved results. To retrieve a result, the user executes the command READ_OP which takes the structure sevault_read_op as an argument. With this command, the user can read multiple sevault_transform_result objects from a given data stream at once. The fields of the sevault_read_op are listed in Figure 5.6. The block_type field specifies which data stream the results should be read from. The user_output and buffer_length fields specify the user space buffer, where the sevault_transform_result objects are copied, and the length of the buffer. The max_requests field specifies the maximum number of results which should be provided to the user by the SE-Vault host driver. Before returning from the syscall, the SE-Vault host driver updates the consumed_bytes and num_requests with correspondingly the number of copied bytes and the number of

59 5 Implementation processed results. If no results have been consumed, the two ﬁelds are set to 0 and the driver returns -ENODATA as an error. The user can later retry the ioctl command until all results are consumed.

5.2.4 Kernel Crypto API Cipher The other exposed interface is to the Kernel Crypto API (KCAPI), which allows primarily kernel drivers and secondly user space applications to make use of SE-Vault. For example, having this interface allows other components, like dm-crypt, to use SE-Vault transparently without any code modifications. In order to expose the SE-Vault implementation to the KCAPI, the SE-Vault host driver needs to register cipher implementations for all supported cryptographic ciphers. In this PoC, the only supported cipher is AES with the CBC cipher mode. Discussed in Chapter 2.6, the KCAPI runtime would delegate cryptographic keys and transformation requests to the registered cipher with the highest priority. Thus, in order to ensure that the SE-Vault implementation would take precedence over other implementations, a high priority must be set when registering the cipher. The current PoC calls the function crypto_register_skcipher to register the SE-Vault cbc(aes) cipher with a priority of 1450. Figure 5.5 shows at a high-level the sequence of steps for propagating a transformation request from the SE-Vault cipher implementation to the SE-Vault guest driver. The current cipher implementation only supports transferring transformation requests over a single IO stream, which supports a block size of up to 2048 bytes. This is only a minor limitation of the implementation because some users of the KCAPI do not support much larger block sizes either. For example, dm-crypt can be configured to use block sizes between 512 and 4096 bytes. While my implementation only supports block sizes in the middle of this range, this can be viewed as a compromise to offer good performance on average for all possible configurations.

5.3 Guest Linux Driver

The SE-Vault guest driver is the component responsible for storing cryptographic secrets in encrypted memory, and for performing cryptographic transformations which use the stored secrets. I implemented the SE-Vault guest driver as a Linux Kernel Module (LKM) running inside an SEV-protected Linux VM. Because Linux provides sufﬁcient functionality for establishing communication over VirtIO and for performing cryptographic transformations, the SE-Vault guest driver implementation has the responsibility of book-keeping request and key information, and also of delegating transformation requests to the KCAPI in the VM.

60 5 Implementation

Host Driver Guest Driver

input queue 0 1 sevault_guest_recv_input

input queue 1 2 5 3 kthread 0 work list 0 output queue 0 4 KCAPI

kthread 1 work list 1 output queue 1

Figure 5.7: Example high-level handling of transformation requests with two IO streams, identiﬁed by 0 and 1. Requests are buffered into separate work lists, which get traversed by an identical number of crypto workers implemented as kthreads.

When a cryptographic key is received by the SE-Vault guest driver, the driver stores the key in a structure which holds all registered keys. The key and future transformation requests can be linked by the Key ID speciﬁer which is included in both the request for registering the key and the transformation request. This prototype implementation stores the keys in a small array, but a future implementation can use a more sophisticated structure such as a hash table or some form of a tree structure. After a key is registered in the SE-Vault guest driver, the SE-Vault host driver can start sending transformation requests to the Guest. Figure 5.7 shows the sequence of steps for handling a transformation request in a setting of two IO streams. In step 1 , a transformation request is sent over Input Queue 0 to the SE-Vault guest driver and the sevault_guest_recv_input function is invoked to handle the event. The function copies the received data into a new object and stores the object into the corresponding work list in step 2 . After all transformations from the input queue are received and stored into its respective work list, the function sevault_guest_recv_input wakes up the crypto worker responsible for handling the work list. In this prototype, a crypto worker is implemented as a kthread (Linux kernel thread). In step 3 , the kthread iterates through its work list and forwards the transformation request to the Kernel Crypto API (KCAPI). The KCAPI handles the request asynchronously and notiﬁes the requester when the result is available. My current implementation blocks until the

61 5 Implementation result is available, but an optimized solution can send more requests to the KCAPI if the transformations are independent. In steps 4 and 5 , the result is received by the crypto worker which then copies it to the corresponding Output Queue. After the work list becomes empty, the crypto worker sends a signal to the SE-Vault host driver that new transformation results are available. Although the design choice of buffering requests into a work list and delegating the work to separate threads may seem complex and unnecessary, this ensures good performance and also that no blocking operations are executed while an interrupt is served. Discussed in Chapter 2.4, a notification to the VM happens via the injection of an interrupt which is forwarded to the corresponding handler in the VirtIO driver. Performing the cryptographic transformations requires allocating memory and making requests to the KCAPI which are all blocking operations and cannot be executed while the interrupt is handled. By moving the responsibility of handling the transformation to another kernel thread, this design only requires making few non-blocking operations such as allocating memory with the GFP_ATOMIC hint, adding items to a linked list and waking up the respective crypto workers. As a consequence, this design also improves performance by allowing requests, belonging to separate data streams, to be executed in parallel. At the end of Chapter 2.5, I described the sequence of steps necessary to simulate an IOMMU with QEMU and Vhost. My description showed that resolving an IOMMU miss requires multiple context switches, which is detrimental to IO performance. This observation is important for the performance of VirtIO communication in SEV-protected Linux VMs because the current Linux implementation requires that each VirtIO device is configured to use a virtual IOMMU if such is available. This can be accomplished by launching the VM with the iommu_platform=on argument set for the VirtIO device in QEMU. This in turn would set the configuration flag VIRTIO_F_IOMMU_PLATFORM which would be queried by the corresponding VirtIO driver in the VM. When this flag is set, two changes would follow in how the VirtIO communication is performed. The first change is that the VM’s VirtIO shim would always use the decrypted DMA region for communication which is necessary with SEV being enabled. The second change is that the Host VirtIO shim would need to translate every IO Virtual Address (IOVA) to a Guest Physical Address (GPA). During the development phase of SE-Vault, I continuously tested and benchmarked the implementation to guarantee correctness and catch code changes which introduce performance degradation. I observed through my tests a significant performance difference between SE-Vault running without and with SEV. With SEV, IO throughput would be 6x slower than running the VM without SEV. The slowdown is attributed to often occurring IOTLB misses which require multiple context switches to be resolved. Notably, the requirement to set the VIRTIO_F_IOMMU_PLATFORM flag only originates

62 5 Implementation

1 static bool vring_use_dma_api(struct virtio_device *vdev) 2 if (!virtio_has_iommu_quirk(vdev)) 3 return true; 4 if (mem_encrypt_active()) 5 return true; 6 ...

Figure 5.8: The function decides whether the unencrypted DMA region should be used. Lines 4 and 5 are added by me, in order to force usage under SEV without enabling the expensive iommu_platform=on setting in QEMU. from an implementation quirk in the Linux VirtIO shim which forces all VirtIO rings and buffers to use the decrypted DMA region. However, the code in the Linux VirtIO shim can be modified to also use DMA memory if SEV is active. The decision for using the DMA region is performed in the function vring_use_dma_api, shown in Figure 5.8. On lines 2-3, the code checks and returns true if the VIRTIO_F_IOMMU_PLATFORM flag is set. The proposed code change can be seen on lines 4-5 where the code checks and returns true if SEV is active. This code modification removes the requirement of launching the VM with the iommu_platform=on arguments which bypasses IOVA-to- GPA translations. In my experiments, this led to a significant performance improvement when SEV is used for protecting the VM.

5.4 Code Hardenings against Memory Disclosure Attacks

The SE-Vault guest driver operates in a VM whose memory is encrypted with SEV, and thus the protected Host secrets cannot be directly read from the encrypted memory of the VM. However, the Host secrets are sent from the Host to the VM using multiple memory copies, and remnants of the secrets may still be available in unencrypted memory. If some Host secrets can be located in unencrypted memory, an attacker would always be able to steal the secrets by reading the copy of the secret, which was unintentionally left in unencrypted memory. Thus, the SE-Vault implementation must prevent such situations by carefully erasing the memory used for temporarily storing the keys. The ﬁrst set of changes are performed on the SE-Vault implementation itself: the SE-Vault guest driver and the SE-Vault host driver. After each temporary buffer, used for buffering requests, can be freed, the buffer is either explicitly cleared with zeroes, or the Linux kernel function kzfree is used, which zeroes-out memory before giving back the memory to the heap. This set of changes ensure that the SE-Vault implementation

63 5 Implementation itself does not unintentionally leave any Host secrets in unencrypted memory, after the Host secrets have been delivered to the SE-Vault Guest. The second code hardening happens in the Linux DMA API implementation in the SE-Vault VM. To understand the necessity of this hardening, it is necessary first to examine the data path of sending a key from the SE-Vault host driver to the SE-Vault guest driver. Discussed in the previous section, the Host and Guest use the unencrypted DMA region for communication. The VirtIO shim would transparently use the DMA region for communication by issuing calls to the Linux DMA API. With SEV, the DMA API implementation relies on the SWIOTLB implementation to link private physical memory with unencrypted memory, which is then exposed to the external device. The SWIOTLB implementation uses a SLAB allocator to efficiently carve out memory from the unencrypted memory region, and the carved-out memory chunks are then linked to the specified private memory buffer.

SEV VM

Guest VirtIO driver

key

1 swiotlb_bounce()

DMA Region 2 swiotlb_tbl_unmap_single() copy Host VirtIO key copy driver key

Figure 5.9: Data path for sending a key from the Host VirtIO driver to the Guest VirtIO Driver. Under SEV, the key is stored in a temporary buffer in the DMA region, which is never cleared after the temporary buffer is freed by the function swiotlb_tbl_unmap_single().

The details of copying a secret key from the Host VirtIO driver to the Guest VirtIO driver are shown in Figure 5.9. When the Host VirtIO driver sends a key to the Guest, the key is ﬁrst copied into a temporary buffer inside the DMA region. This temporary buffer is carved out and maintained by the SWIOTLB SLAB implementation. When the Guest VirtIO driver decides to read the key from the used ring, the VirtIO shim would call the DMA API to synchronize the temporary buffer with the private buffer, allocated by the VirtIO driver. This call would eventually lead to the execution of the function swiotlb_bounce(), shown in 1 , which would copy the contents of the temporary

64 5 Implementation buffer to the private buffer. Afterwards, the function swiotlb_tbl_unmap_single(), shown in 2 , is called which would free the temporary buffer and return the memory chunk to the DMA region. However, the contents of the temporary buffer would never be cleared, which would leave the copy of the secret key in unencrypted memory. An attacker can later find the secret key, if the temporary buffer has not been overwritten with other data. The second code hardening clears the contents of the temporary buffer after the function swiotlb_bounce() is called. This ensures that no secrets would remain in the unencrypted DMA region of the VM. The last code hardening is performed on the dm-crypt implementation in the Host Linux kernel. During the security evaluation in Chapter 6.1, I discovered that dm-crypt would keep a copy of the LUKS master key in its configuration context. In particular, the key would be stored in the key[] field of the structure crypt_config. In this scenario, an attacker would be able to locate the master key and decrypt the encrypted memory of the disk. In my analysis, the key should no longer be stored in the context after it is registered in the KCAPI. Thus, the applied code hardening here is to zero-out the contents of the key field in the crypt_config structure after the key is registered in the KCAPI. In my evaluation, I verified that erasing the key does not later result in memory corruption or kernel error messages. However, this change requires more careful validation to ensure that this modification does not result in erroneous behavior for other dm-crypt users.

5.5 Guest Driver Portability to the seL4 Microkernel

In this Chapter, I examine the portability of the proposed design to the seL4 microkernel [LF ]. The seL4 kernel is a performant and formally verified microkernel [Kle+09] which can be run on x86-64, ARM and RISC-V CPUs. As a microkernel, seL4 has a significantly smaller code base than a monolithic kernel such as Linux. The reduction is achieved by relocating drivers, such as the file system and network drivers, to operate as user space processes known as services. The kernel only includes minimal infrastructure for virtual memory management, inter-process communication and scheduling. A service can communicate with the kernel using a minimal syscall interface to spawn new threads, request access to ranges of physical memory or make communication requests with another service. If a service crashes, the kernel can restart only the service without having to reboot the whole system. In comparison, a fault in a Linux driver would require a system restart to use the driver again. The seL4 kernel and open-source services are designed to run on bare-metal hardware and not in a virtualized environment. Although seL4 can also be run in a VM, the

65 5 Implementation seL4 eco-system lacks VirtIO and SEV support. Additionally, the eco-system does not include a verified cryptographic library. The aforementioned limitations of seL4 pose difficulties in easily porting the SE-Vault guest driver, but such difficulties can be overcome as I show in the following sections.

5.5.1 Booting seL4 The first step of porting the SE-Vault Guest driver is to be able to boot seL4 inside a VM. Discussed at the beginning of Chapter 5, using SEV imposes the requirement that OVMF is used, which implies that the seL4 kernel needs to boot under UEFI. However, the seL4 build system only produces an ELF image which is not directly bootable by OVMF. OVMF has the capability to boot either into an "EFI executable" (PE-format) or directly into the Linux EFI Boot Stub [Thea] of the bzImage file. Thus, launching seL4 on a UEFI-enabled system requires a bootloader which supports both UEFI and loading an ELF image. Additionally, the seL4 kernel needs to be aware of the bootloader used on the UEFI- enabled system and comply with its specification. On a UEFI-enabled system, the firmware passes system information, such as the ACPI tables, to the bootloader. The bootloader parses the provided tables, generates a standardized boot information structure, loads the kernel and transfers control to the kernel’s defined entry point passing alongside the boot information structure. The kernel then uses the boot information structure to perform initialization of the kernel. In Linux, the boot information structure corresponds to the boot_params structure which is standardized in the Linux x86 Boot Protocol [Theb].

1 menuentry"seL4 kernel" --class os"seL4-kernel"{ 2 load video 3 insmod gzio 4 insmod part_msdos 5 insmod ext2 6 set root="hd0,gpt2" 7 multiboot2 /boot/sel4_kernel 8 module2 /boot/sel4_modules 9 }

Figure 5.10: GRUB entry for the seL4 kernel and module. When the VM is started, GRUB would be able to load and boot them.

The seL4 kernel supports the multiboot2 interface [Oku+19] which in turn is also supported by the GRUB2 bootloader. The GRUB2 bootloader is responsible for creating

66 5 Implementation the boot information structure and handling control to the seL4 kernel in a manner compliant to the multiboot2 specification. Although it is possible to build GRUB2 from source, place the PE binary and configuration files into a file system readable by OVMF, the process can be time consuming and require installing additional dependencies on the Host system. For the purpose of testing, I followed an easier approach which requires a Linux distribution, such as Debian or Ubuntu, to be first installed onto a virtual disc image. After the installation is done, the seL4 kernel and module images need to be copied to the /boot/ partition. Afterwards, the GRUB2 menu entry from Figure 5.10 must be added to the /etc/grub.d/40_custom file and the update-grub must be run. Following these instructions, one can boot seL4 with the Open Virtual Machine Firmware (OVMF).

5.5.2 SEV Guest Support Adding SEV support to a kernel can be separated into two types of code modiﬁcations. If the kernel is part of the Host OS, the kernel or modules require changes to its HV in order to establish communication with the AMD PSP and to follow the SVM speciﬁcation for launching SEV VMs. On the other hand, if the kernel is part of the Guest OS, the kernel requires changes to the code responsible for managing page tables and for communication over external interfaces such as disk IO and internet communication. As discussed in Chapter 2.2, the Guest OS needs to map all of its private memory — code and data — with the C-bit set in the page table hierarchy. If the Guest does not set the C-bit in its page tables, data accesses to memory already encrypted by the PSP would read or write the cipher text while the target should be the plain text. This would immediately result into memory corruption and the Guest kernel would fault. Additionally, regions used for communication with devices or with the HV must be mapped as shared by setting the C-bit to zero. This ensures that the Guest OS and the Host OS can have an identical view on shared data. In the context of protecting the SE-Vault driver in seL4, only SEV Guest support needs to be implemented in the seL4 kernel. Launching the seL4 kernel with SEV enabled, but without any SEV support, shows that the kernel crashes early in the initialization phase without any output to the serial port. In a typical environment, QEMU can be instructed to start a GDB server to which a GDB client can attach and examine the cause of the crash in the kernel. However, QEMU does not support examining an SEV-protected VM at the time of writing, so using GDB to debug issues in the Guest kernel is infeasible. Another option in debugging the Guest kernel under SEV is to use intercepted instructions to communicate information to the HV. Such an instruction is the vmmcall instruction which is never used by the seL4 kernel. For debugging purposes, I added

67 5 Implementation inline assembly to store information in registers and issue the vmmcall instruction. I modiﬁed KVM to print the VM’s registers when vmmcall is intercepted.

1 BEGIN_FUNC(get_sev_mask) 2 ... 3 /* Query the SEV C-bit and create mask. */ 4 mov $0x8000001f, %eax 5 xor %ecx, %ecx 6 cpuid 7 ... 8 /* Store upper 32-bits of mask into EAX */ 9 mov %ebx, %edx 10 and $0x3f, %edx 11 sub $32, %edx 12 xor %eax, %eax 13 bts %edx, %eax 14 jmp .Lsev_exit 15 .Lno_sev: 16 xor %eax, %eax 17 .Lsev_exit: 18 ... 19 ret 20 END_FUNC(get_sev_mask)

Figure 5.11: Assembly code for determining the location of the C-bit.

To determine the necessary changes to the seL4 kernel, I examined the seL4 code for creating and managing virtual memory. The first page table is created in the boot setup code by the function setup_pml4 immediately before the transition to long mode. By adding vmmcalls, I established that that the first fault happens immediately after the seL4 kernel switches to long mode. The reason for the fault is that the Guest’s page tables do not have the C-bit set and by that the Guest accesses the cipher text without decrypting it. Such accesses include data read from the stack which ensures that the kernel would crash upon returning from the routine used for switching to long mode. The first step in modifying the page table setup is to determine the location of the C-bit which needs to be set in page table entries. Figure 5.11 shows a shortened version of the x86-32 assembly code for determining the location of the C-bit. Under SEV, the location can be queried by issuing the cpuid instruction and examining the returned result. On line 13, the upper 32 bits of the mask are set into the EAX register. Only the

68 5 Implementation upper 32 bits of the mask are saved because the C-bit is always after the 32-nd bit. After the position of the C-bit has been determined, it can be set on each physical address in the page table hierarchy. With the proposed changes, the seL4 kernel successfully transitions to long mode, leaves the boot stub and enters the kernel initialization stage. The memory mapping created by the boot stub is a one-to-one mapping of the first 512 GiBs of memory. Thus, the kernel code and data are located in low virtual addresses which are typically used by user space applications. To address this, the seL4 kernel additionally maps the first 1 GiB of physical memory to the high virtual address 0xffffff8000000000. The memory mapping is performed in the function map_kernel_window which needs to be modified to also set the C-bit. With the same changes applied, the seL4 kernel is able to finish initialization. After the kernel initialization stage is finished, the kernel transfers control to the user space root task. Similarly, the same changes need to be applied to the generation of the user space page table. The code is contained in the makeUser* memory mapping functions. After applying the C-bit to all of the user memory mappings, the user space root task is started and finishes execution without any observed memory corruption. With the proposed changes, the user space task can execute but does not have a mean of communicating data with external devices. To allow this, I added the function ps_io_map_enc_dec to the utility library which takes an extra flag to select whether mapped physical memory needs to be interpreted as encrypted or shared. The information is propagated to the seL4 kernel and handled in the three functions responsible for mapping memory in the user space page table: makeUserPTE, makeUserPDELargePage and makeUserPDPTEHugePage. The functions are modified to set the C-bit only if the user requested encrypted memory, and not set the bit otherwise. Additionally, I added code to flush and invalidate the cache and TLB to ensure that newly mapped memory would be correctly interpreted by memory accesses and instruction fetches. This newly added interface allows user space applications to explicitly request mapping memory as shared and use it for communication. As part of the design, calls to the ps_io_map interface would always mark memory as encrypted which would prevent legacy code erroneously exposing secret memory to the HV.

5.5.3 VirtIO Support SE-Vault uses the VirtIO interface for communication between the Host and Guest driver. The implementation of the Linux SE-Vault guest driver makes use of the available VirtIO shim layer which is responsible for initializing and managing the Virtqueues. As a microkernel, seL4 does not implement any functionality for external devices and leaves this task to user space libraries and services. The seL4 ecosystem

69 5 Implementation provides a single work-in-progress library for communicating data using the VirtIO interface: libvirtqueue. Unlike the Linux VirtIO shim layer, the library only supports management of the Virtqueue descriptor table, available and used rings. The library does not provide functionality for establishing communication with the VirtIO device, and also does not support communicating events between the device and the driver. Thus, porting the SE-Vault guest driver to seL4 requires adding code for discovering and initializing the SE-Vault VirtIO device, for finding Virtqueues and initializing their structures, and for communicating events between the device and driver. In the rest of this chapter, I describe the implementation of the seL4 SE-Vault port with respect to establishing communication over VirtIO. The first step in establishing communication is to locate the VirtIO device and its configuration. The device is registered as a PCI device by QEMU and can be located by scanning the PCI bus for attached devices. Per the PCI standard, each attached PCI device is associated with a configuration space of size 256 bytes. The first 64 bytes of the configuration space are taken by the PCI header which contains a Vendor ID and a Device ID by which the device can be identified. The seL4 ecosystem offers the libpci library which can be used to scan the PCI buses, parse the PCI header of each attached device and provide the information to the user. Because the Device ID and Vendor ID are selected by me in the SE-Vault QEMU implementation, I have knowledge which PCI device to ask libpci to search for.

Physical Address Space PCI Conﬁguration Space

0x00 Device ID Vendor ID 0x04 ... 0x10 BAR 4 PCI header Base Address Registers 0x28 ... COMMON_CFG 0x40 cap_vndr cap_next cap_len cfg_type virtio_pci_cap bar_index padding device_feature_select device_feature bar_offset length 0x50 driver_feature_select driver_feature ... NOTIFY_CFG msix_config num_queues 0x80 device_status config_generation cap_vndr cap_next cap_len cfg_type queue_select queue_size bar_index padding queue_msix_vector queue_enable bar_offset length queue_notify_off queue_desc notify_bar_multiplier queue_driver queue_device ... 0x100 virtio_pci_common_cfg (a) PCI configuration space of a VirtIO device. (b) Common Configuration Capability struc- Located after the PCI header, multiple PCI ture. The structure is located through the capability structures describe the VirtIO BAR index and offset. It contains informa- device and its MMIO regions. tion and commands for communication.

70 5 Implementation

With the VirtIO device found, the next step is to locate its configuration and to do the initialization sequence specified by the VirtIO specification. The configuration information is contained in the remaining part of the PCI configuration space. The VirtIO specification dictates that immediately after the PCI header, the device exposes five virtio_pci_cap capability structures. Each capability provides sufficient information to compute the physical addresses of the MMIO regions of the device. Figure 5.12a shows a sparse representation of the PCI configuration space of a VirtIO device and its relevant capability structures. Immediately after the PCI header, at offset 0x40 starts the first capability structure. A capability structure can be identified by the value of the cfg_type field, and the bar_index and bar_offset give information about the MMIO address space of the capability. The two relevant capability structures for adding SE-Vault support are the Common Configuration Capability and the Notify Configuration Capability structures. In the example of Figure 5.12a, the Common Configuration structure starts at offset 0x40. By using the Base Address Registers array from PCI header and the bar_index and bar_offset fields from the capability structure, one can compute the start address of the virtio_pci_common_cfg structure as shown in Figure 5.12b. The structure defines the hardware registers which can be used to initialize the VirtIO device and gather information about the supported Virtqueues. The fields — device_feature_select, device_feature, driver_feature_select, driver_feature and device_status — are using for exchanging the features which are supported correspondingly by the VirtIO device and driver. The field num_queues specifies the number of Virtqueues which are offered by the device. For each Virtqueue, the driver writes the respective index into the queue_select field to select the Virtqueue which is being interacted with. Once the Virtqueue is selected, the driver can write to the remaining fields to setup the state of the Virtqueue. By writing to the fields queue_desc, queue_driver and queue_device, the driver communicates the physical addresses of the descriptor table, available ring and used ring respectively. In order to initialize all Virtqueues, the driver needs to allocate a sufficiently large DMA region to contain all Virtqueue structures and the buffers which will be provided to the device. The DMA region needs to be physically contiguous in order to accommodate buffers at arbitrary physical page offsets. A DMA region of 32 MiBs is sufficient for the purpose of the SE-Vault guest driver. After the structures for all Virtqueues are setup, the driver needs to locate the notify register of each Virtqueue. To accomplish this, the driver needs to parse the Notify Configuration capability shown in Figure 5.12a and collect the queue_notify_off value from the virtio_pci_common_cfg structure shown in Figure 5.12b. The driver can then use the collected values to compute the physical address of the notify register for each Virtqueue respectively. A write access to the the notify register would trigger a PF which would be handled by KVM and propagated to the SE-Vault host driver.

71 5 Implementation

The proposed changes are sufficient to establish communication and send notifications to the device. However, the libvirtqueue library does not implement the VirtIO specification correctly and additionally does not protect against possible race conditions in populating the descriptor table and ring buffers. For the purpose of a PoC, I only fixed discovered bugs which completely prevent data communication, but I did not address the race condition bugs. Without addressing race conditions from memory access re-orderings introduced by the compiler and the memory model of the CPU, the device or driver state of the Virtqueues can become invalid or incorrect data can be transmitted.

5.5.4 Porting SE-Vault The SE-Vault guest driver driver in seL4 operates similarly to the Linux SE-Vault guest driver described in Section 5.3. Before being able to process any requests, the driver populates the Key and Input Virtqueues with available buffers and sends a notification to each Virtqueue that buffers have been added. This implementation uses the reserved DMA region to allocate the physically contiguous buffers for all Virtqueues. Afterwards, the Host driver can store cryptographic keys and data transformation requests into these buffers and update the corresponding Virtqueue structure. In a standard Linux implementation, the Guest driver would be notified for newly added buffers by receiving and handling an interrupt. The Linux VirtIO shim layer allows for specifying a handler for receiving an interrupt but this is not available in the libvirtqueue implementation. To avoid adding additional complexity to the prototype implementation, I ignored setting up interrupt handlers in seL4 and designed a solution which busy-loops and processes any received requests. Figure 5.13 shows the implementation of the busy-loop approach in pseudo-code. On lines 2-4, each key request is read from the Key-receive Virtqueue, processed and an acknowledgment is stored into the Key-acknowledgement Virtqueue. After all key requests are processed, on lines 5-6 a notification is sent over the Key-acknowledgement Virtqueue to the SE-Vault host driver. On lines 7-12, the driver goes through the transformation request from each Input Virtqueue, performs the transformation and adds the results to the corresponding Output Virtqueue. A notification is sent to each Output Virtqueue if a result has been added. On lines 13-18, new descriptors are added to each Virtqueue used for receiving if its available ring is empty. Correspondingly, the used ring of each Virtqueue used for sending is emptied if it is full. The prototype implementations of both the Linux and seL4 SE-Vault guest driver only support the CBC-AES transformation. However, unlike Linux, the seL4 kernel and available libraries do not come with implementation of cryptographic algorithms. Thus, the seL4 SE-Vault guest driver needs to additionally be compiled with an imple-

72 5 Implementation

1 while true: 2 for each key in recv_key_vq: 3 ack = process_key(key) 4 add_acknowledgement_to_vq(ack) 5 if ack_was_sent: 6 notify_ack_vq() 7 for input_vq, output_vq in io_vqs: 8 for each request in input_vq: 9 res = perform_crypto_req(request) 10 add_res_to_output_vq(res, output_vq) 11 if result_was_sent: 12 notify_output_vq(output_vq) 13 if not recv_key_vq.has_available_buffers(): 14 add_available_buffers(recv_key_vq) 15 wait_and_clean_used_buffers(ack_vq) 16 for input_vq, output_vq in io_vqs: 17 if not input_vq.has_available_buffers(): 18 add_available_buffers(input_Vq) 19 wait_and_clean_used_buffers(output_vq)

Figure 5.13: Pseudo-code implementation of the busy-loop responsible for handling all requests. Line 2-6 process each key request, send back the acknowledgment and notify the acknowledgment Virtqueue. Line 7-12 process each transformation request, send back the result and notify the corresponding output Virtqueue. Line 13-18 reallocate descriptors for all Virtqueues if their resources are used up. mentation of CBC-AES. I used Tiny AES C [kok] for the purpose of developing this prototype. This AES implementation is not designed to be efﬁcient, does not make use of the AES-NI instruction set and is likely susceptible to side-channel attacks. A more sophisticated implementation of the seL4 SE-Vault port can use an optimized AES implementation.

73 6 Evaluation

In this chapter, I present the methods and results of evaluating the security guarantees, correctness and performance metrics of SE-Vault. The security evaluation validates empirically that my prototype implementation is invulnerable to memory disclosure attacks against the Host kernel, which precisely fits the attacker model of Section 4.2. The correctness evaluation verifies that the implementation correctly stores keys, and correctly encrypts and decrypts data through all interfaces. The performance evaluation measures the throughput and latency of all available interfaces in SE-Vault, which is important for deciding the practical usability of SE-Vault. The evaluation is performed on a system with the following specification:

• CPU Model: AMD EPYC 3251

• Memory: 2 x 32 GiB DDR4 2133 MT/s

• Disk: 500GB Samsung SSD 750

• Kernel: Linux 5.6.0

• QEMU: 4.2.50

For performing the evaluation, a VM with the following conﬁguration was used:

• Processor Count: 1

• Memory: 2 GiBs

• Kernel: Linux 5.6.0

• Protection: AMD SEV

• OS Distribution: Ubuntu 20.04

Although the implementation of SE-Vault can take advantage of multi-core processing, only a single core is used for the performance evaluation. Efforts of testing and benchmarking the system with a multi-core VM is left for future work. The VM’s physical memory is selected to be 2 GiBs, but it can be reduced if the Host system has limited memory capacity. In my empirical evaluation, there was not a signiﬁcant

74 6 Evaluation performance difference between using 1 GiB and 2 GiBs of memory. The reason for this is that the VM can only be provided a ﬁx amount of data to transform, limited by the number of Virtqueues, size of the descriptor table and size of each buffer in the table. However, the VM cannot be launched with less than 1 GiB of memory as the kernel fails to boot. Although the Host system in this evaluation supports AMD SEV-ES, the evaluation was performed on a VM with SEV only. The reason for this decision is that SEV-ES is not ofﬁcially supported in Linux at the time of writing and that I do not expect a noticeable performance difference for SE-Vault.

6.1 Security Evaluation

The design and implementation of SE-Vault aim at protecting cryptographic keys and other long-term Host secrets. The presented design of SE-Vault considers an attacker model for which an attacker exploits a memory disclosure vulnerability to read all of physical memory. The strength of SE-Vault resides in its ability to store any long-term secrets without explicitly exposing these secrets to unprotected memory. However, the system’s components — kernels, SE-Vault components, KCAPI — have complex implementations and may leave traces of the Host secrets in unprotected memory. This potentially leaves the opportunity for an attacker to steal such Host secrets even if they are delivered to the SE-Vault guest driver. Such a scenario is undesirable and should be considered a vulnerability of the design and implementation of SE-Vault. To verify that the SE-Vault attacker model holds, I performed a test which captures the capabilities of an attacker using a memory disclosure vulnerability. The test sends a pseudo-randomly generated key to SE-Vault, waits for the delivery of the key, saves all accessible RAM regions to a file and then searches for occurrences of the secret key in the saved memory. If the key is found in the memory snapshot, then a trace of the secret key is present in unprotected memory and the SE-Vault implementation is considered insecure. If the key is not found, then the test succeeds. The memory snapshot contains all physical memory accessible by the kernel which guarantees that the secret would be found if it were exposed in unprotected memory. Although this test might be prone to false-negative outcomes for some possible implementations of AES, I verified that the test conditions are sufficient for the Linux aesni-intel driver which performs encryption and decryption on all CPUs which support the AES-NI extension. This driver implementation stores all round keys continuously in memory which guarantees that the proposed test case would always find the secret key if located in unprotected memory. In order to take a snapshot of physical memory, I used the /dev/mem interface which gives to a privileged user read access to physical memory. Since all recent Linux

75 6 Evaluation distributions restrict access to this interface, it was additionally necessary to compile the Host kernel with the configuration option CONFIG_STRICT_DEVMEM disabled. However, a privileged user cannot read all of physical memory since some regions are reserved for the BIOS or other system components like the AMD PSP. Accessing such regions is prevented by the memory system and can lead to system malfunctioning. Fortunately, Linux provides the /proc/iomem interface which lists all accessible memory regions. To automate the process of taking a physical memory snapshot, I wrote a script to parse the output of the /proc/iomem interface and save the corresponding regions from /dev/mem to a file. After the memory snapshot is saved to a file, the contents of the file can be searched for the key.

Test 3 Test 2 Test 1

Kernel Crypto SE-Vault DM-Crypt SE-Vault VM API Host Driver

cryptsetup kcapi-enc Test Application

Figure 6.1: Security tests for SE-Vault. Test 1 verifies the behavior of SE-Vault only, Test 2 additionally verifies the behavior of the KCAPI, and Test 3 also verifies the behavior for disk encryption.

Figure 6.1 depicts the three tests which were performed to check the security of three components in the implementation. The first test verifies that the SE-Vault implementation correctly stores a secret key inside the encrypted memory region of the VM without leaving any traces of the key in unprotected memory. In this test, a test application registers a key using the ioctl ABI, then saves all physical memory and searches for the key in the file. The second test verifies that the KCAPI implementation sends the key to the SE-Vault guest driver without leaving a copy of it in unprotected memory. To verify this, the test uses the kcapi-enc command line tool to encrypt a very large file while saving all of physical memory at the same time. Since the key needs to be registered and be available during the encryption process, it cannot be cleared until encryption finishes. The third test verifies that disk encryption keys can also be protected by the SE-Vault implementation. This test uses the cryptsetup command line tool to open an encrypted device, which would underneath use the DM-crypt driver to parse and decrypt the device’s memory. Notably, this test is the most difficult, because two keys need to be registered using the KCAPI which would be copied multiple times until they reach the SE-Vault guest driver. If any of the components along the way

76 6 Evaluation erroneously leaves a trace of any of the secret keys, the key would be found and the test would fail. The tests were run in three configurations whenever applicable: the default AES crypto module (aesni-intel), SE-Vault without SEV, and SE-Vault with SEV. The expected result of the tests is that the first two configurations leave the secret key in unprotected memory, while the third configuration never exposes the secret key since it should be stored in the VM’s encrypted memory region. To offer some robustness against mistakes or timing issues, each test was performed 32 times. The first test was run 32 times on the SE-Vault implementation without SEV and with SEV. As expected, the test succeeded with the SE-Vault and SEV configuration, but failed when SEV is not used. When SEV is not used, no memory is protected and the key can always be read by using a memory disclosure vulnerability. When SEV is used, the key is only stored in protected memory and an attacker cannot read it in clear text. Notably, this result can only be achieved if the DMA memory buffers are cleared as discussed in Section 5.4. Otherwise, the test would fail even if the SE-Vault is protected with SEV. The second test was performed on all three configurations. Similarly as in the previous test, the secret key is never located in unprotected memory when SE-Vault is deployed with SEV. The other two configurations always have the key present in memory which allows an attacker to extract the key using a memory disclosure vulnerability. This test shows that the KCAPI implementation does not leave traces of the secret in unprotected memory when SE-Vault is deployed with SEV. The last test was also performed on all three configurations. Because the encrypted device relies on the Linux Unified Key Setup (LUKS) format, two keys are always generated when an encrypted device is opened: the password-derived key and the master key (see Section 2.7). Both keys need to be protected, otherwise the memory of the encrypted device can be always decrypted using either of the keys. When running the test, the password-derived key is never found in memory for all three test configurations which is likely the result of dm-crypt clearing the key. This is expected since the password-derived key protects the master key, and is not needed for the decryption of the device’s memory. Copies of the master key are never found when SE-Vault is deployed with SEV, and the key is always found for the other test configurations. Notably, the SE-Vault with SEV configuration only succeeds when the code hardenings from Section 5.4 are applied. Table 6.1 shows a summary of all performed security evaluation tests. Overall, the table shows that the SE-Vault implementation can protect Host secrets, such as cryptographic keys, from an attacker with read access to all of physical memory. Additionally, the performed tests also show that legacy applications must ensure that any copies of the secret must be erased after storing the secret into SE-Vault. Although

77 6 Evaluation

Test Targets Requires Outcome (SE-Vault with SEV) Test 1 SE-Vault SWIOTLB hardenings Success Test 2 KCAPI, SE-Vault SWIOTLB hardenings Success Test 3 dm-crypt, KCAPI, SE-Vault dm-crypt and SWIOTLB hardenings Success

Table 6.1: Summary of results for the security evaluation. The target column specifies the software components which are tested. The requires column specifies the software hardenings which are necessary to be applied. The outcome column specifies the result of the test. this incurs changes to applications, I expect the changes to be simple and minimal.

6.2 Correctness Evaluation

The correctness of the SE-Vault implementation was veriﬁed throughout the development process to catch newly introduced bugs early on. Once the implementation had reached a mature state, I ran various tests to verify its correctness and resilience to race conditions. The ﬁnal tests were executed on the system described at the beginning of this chapter. Additionally, the tests were also run on two other machines which were used during the development phase. The tests included:

• 20 unit tests for exercising the /dev/sevault-user interface.

• openssl to encrypt and decrypt ﬁles for testing the SE-Vault OpenSSL engine.

• kcapi-enc for testing the SE-Vault KCAPI interface.

• custom LKM to exercise the KCAPI interface directly.

• all performance tests from Section 6.3.

The 20 unit tests are implemented using the Google Test [Inc] framework and all use the /dev/sevault-user interface. The tests include few smoke tests to verify basic functionality, stress tests which send many requests, and few negative tests to detect bugs of handling user input. The tests are executed 1000 times and have their order randomized to capture test sequences which might corrupt the state of the implementation. The tests succeeded in all 1000 runs. After running the unit tests many times during the development stage, I noticed that extremely rarely a test would hang which could be the result of a race condition in my prototype implementation. Although this is deﬁnitely an issue if deployed on a production system, the provided implementation is only a prototype and lacks sufﬁcient validation to be used in practice.

78 6 Evaluation

However, the provided code base is mature enough to allow logging and testing which renders such bugs not difficult to root cause. The SE-Vault OpenSSL engine was tested by running the openssl command-line tool to encrypt few files of various sizes. Each file would be first encrypted and saved as a temporary file, after which the temporary file is decrypted and saved as the final file. The original file and the final file are then compared using the cmp command-line tool. The procedure was executed 100 times and no file differences were observed. Additionally, the OpenSSL performance test was executed multiple times to catch any race conditions or erroneous results. Finally, I would like to emphasize that the SE-Vault OpenSSL engine uses the /dev/sevault-user interface. Thus, the OpenSSL engine tests further verify the correctness of that interface and code paths which follow. For validating the SE-Vault KCAPI interface, I used the kcapi-enc tool to transform files and also used a custom LKM which communicates directly with the KCAPI interface. Similarly to the OpenSSL engine test, the kcapi-enc tool encrypted the original file into a temporary file, which was then decrypted into a final file. The original file was compared with the final file. This test was executed 100 times and succeeded in all 100 rounds. Additionally, the kcapi-enc performance tests from Section 6.3 were also used to verify the correctness of this code path and no erroneous results were observed. The LKM module was used to directly exercise the KCAPI code path. Such a test could send transformation requests at a higher rate which can aid in detecting race conditions. The LKM allocates 8 MiBs of memory and continuously sends encryption and decryption requests. Similarly to the behavior of running the unit tests, a hang can be observed extremely rarely. However, no memory corruption was observed. This evaluation shows that my Linux-based prototype implementation is mostly correct, except for the few observed hangs. However, the hangs only occurred very rarely after sending hundreds of millions of requests. Although fixing these issues would be necessary for deploying SE-Vault to a production system, I did not find this important in reaching the goals of my thesis.

6.3 Performance Evaluation

In this chapter, I describe the used methods for measuring the performance of SE-Vault and discuss the gathered results. The performance evaluation of SE-Vault has two goals. The ﬁrst goal is to present throughput and latency metrics for all user interfaces of SE-Vault. These interfaces include the OpenSSL SE-Vault interface, the KCAPI interface and the dm-crypt interface. The second goal is to showcase the overhead of using SE-Vault. Although SE-Vault provides protection against a variety of attacks, it also

79 6 Evaluation needs to provide little overhead to render it applicable in production systems. For a comprehensive comparison, I ran my benchmark on four different test scenarios for each implemented interface. The four test scenarios include the default implementation of the interface, optimized SE-Vault without SEV, unoptimized SE-Vault with SEV, and the optimized SE-Vault implementation with SEV. The goal of this comparison is to showcase the overhead of using SE-Vault, the overhead of using SEV, and the improved performance through my optimizations.

6.3.1 Throughput Measurements The performance goal of SE-Vault targets achieving high throughput rather than low latency for encryption and decryption requests. SE-Vault cannot achieve low latency for serving a request since it requires multiple context switches and data copies before a request is served. In comparison, an AES library implementation would require to only read the data once, transform it and write it out. However, SE-Vault can still be competitive in throughput for large inputs where the context-switch-per-byte ratio is low. In such a scenario, the throughput performance would be bound by the maximum memory bandwidth of the system. For example, the memory bandwidth of a single core on the AMD EPYC 3251 CPU is around 20 GiB/s which is higher than the measured throughput of AES encryption with any key length and cipher mode. Thus, the movement of data from the Host’s address space to the VM’s address space can theoretically lead to a tolerable performance degradation. The SE-Vault throughput performance was measured for four different interfaces: the SE-Vault OpenSSL engine, the kcapi-enc command-line tool, KCAPI from a custom LKM, and disk encryption via dm-crypt.

Default SSL Engine Default SSL Engine Default SSL Engine SE-Vault SSL Engine (No SEV) 2000 SE-Vault SSL Engine (No SEV) SE-Vault SSL Engine (No SEV) 2000 SE-Vault SSL Engine (SEV) SE-Vault SSL Engine (SEV) 2000 SE-Vault SSL Engine (SEV) SE-Vault SSL Engine (SEV optimized) SE-Vault SSL Engine (SEV optimized) SE-Vault SSL Engine (SEV optimized) 1500 1500 1500

1000 1000 1000

500 Throughput (MiB/s) 500 Throughput (MiB/s) Throughput (MiB/s) 500

0 0 0 256 1024 8192 16384 256 1024 8192 16384 256 1024 8192 16384 Block Size (bytes) Block Size (bytes) Block Size (bytes)

(a) Test for CBC AES-128. (b) Test for CBC AES-192. (c) Test for CBC AES-256.

Figure 6.2: SE-Vault OpenSSL Engine throughput test using the openssl speed benchmark. Higher is better.

Figure 6.2 shows the result of using the openssl speed benchmark to measure the

80 6 Evaluation throughput for all four different configurations: the default OpenSSL engine, and three instances of SE-Vault. Respectively, figures 6.2a, 6.2b and 6.2c show the results for using AES-CBC with a 128-bit-long, 192-big-long and 256-bit-long key. The x-axis shows the data block size sent to the encryption engine. The block needs to be processed and the result returned before the next data block can be sent. The y-axis shows the measured throughput in MiB / second. Notably, the default OpenSSL engine uses its internal AES implementation which uses the AES-NI extension, while the SE-Vault implementation uses the aesni-intel implementation in the Linux kernel. Thus, one should expect some difference in performance if either of the implementations uses the AES-NI extension inefficiently. Seen in all three figures, SE-Vault sees better throughput performance with the increase of the block size. This result is expected since the context switch overhead is comparatively high for small block sizes and low for large block sizes. Intuitively, more useful work is performed for a large block size than for a small block size. Another observation from the figures is that there is almost no performance difference between the three SE-Vault configurations for small block sizes. The reasoning for this result is similar: most of the computation time is spent on context switches, which has identical overhead for all three SE-Vault configurations. For large block sizes, a large discrepancy in throughput performance can be observed for all four configurations. First, the optimized SE-Vault implementation with SEV and the SE-Vault implementation without SEV significantly outperform the unoptimized SE- Vault implementation with SEV. This performance difference originates in the necessity to perform IOVA-to-GPA translation for the unoptimized version, while the two other version do not need to perform such address translation. The optimized SE-Vault version with SEV has slightly lower throughput than the version of SE-Vault without SEV because it needs to copy data from the DMA region to its private buffer. Surprisingly, the two SE-Vault versions perform better than the default OpenSSL engine for large block sizes. This could be attributed to a more efficient implementation of AES in KCAPI or to a bug in the SE-Vault OpenSSL engine implementation. However, the correctness tests from Section 6.2 did not reveal any bugs in my implementation. Investigation of the discrepancy is left for future work. The next throughput benchmark relies on the kcapi-enc command line tool to encrypt or decrypt files of increasing size and save the result to disk. The file size varies from 32 MiBs to 480 MiBs with an increasing step of 64 MiB. After performing encryption or decryption of a file, the disk writes are synchronized by executing the sync command. Although this ensures that the overhead of writing to disk is included into the measured time, this also in turn can introduce noise into the measurement since other disk writes are also synchronized. To reduce the influence of the noise on the results, I performed 8 runs for each test and measured the average throughput. Fig-

81 6 Evaluation

Default Default 450 SE-Vault (SEV) 450 SE-Vault (SEV) SE-Vault (No SEV) SE-Vault (No SEV) 400 SE-Vault (SEV optimized) 400 SE-Vault (SEV optimized)

350 350

300 300

250 250

200 200

150 150 Throughput (MiB/s) Throughput (MiB/s) 100 100

50 50

100 200 300 400 500 100 200 300 400 500 Input size (MiB) Input size (MiB) (a) File encryption. (b) File decryption.

Figure 6.3: Throughput comparison among multiple SE-Vault configurations and the default AES engine (aesni-intel) for file encryption and decryption with kcapi-enc. The cipher is AES with a 256-bit-long key and CBC as a cipher mode. The x-axis shows the size of the file in MiB, and the y-axis shows the throughput in MiB/s. Higher is better. ure 6.3 shows respectively the results for encryption (left) and decryption (right) using a 256-bit-long AES key with the CBC mode. I measured four different configurations: the default AES implementation (aesni-intel) against the three SE-Vault configurations. Similarly as in the previous test, the unoptimized SE-Vault implementation with SEV offers the worst throughput performance. Also, the optimized SE-Vault implementation with SEV has roughly the same performance as the SE-Vault implementation without having SEV enabled which shows that using SEV has negligible overhead. For both encryption and decryption, it can be observed that the optimized SE-Vault implementation achieves 50% of the throughput of the default AES implementation in the kernel. Although this performance degradation is significant, the achieved throughput can still be sufficient for many applications. Additionally, although the aesni-intel module offers better performance, it does not protect the keys from a privileged attacker which may be unsatisfactory for certain applications. Although the kcapi-enc command line tool is useful for using the KCAPI from user space, the obtained results with it include noise from context switches and disk IO. To provide a better overview of KCAPI performance with SE-Vault, I implemented a simple Linux Kernel Module (LKM) which uses the KCAPI to encrypt memory. The module allocates 32 MiBs of memory and directly makes asynchronous requests to the KCAPI to encrypt the buffers. In the current SE-Vault implementation, each request is

82 6 Evaluation

Default SE-Vault 1000 SE-Vault (SEV) SE-Vault (SEV optimized)

800

600

400 Throughput (MiB/s)

200

Figure 6.4: Kernel Crypto API performance when used from a Linux Kernel Module. The optimized SE-Vault implementation achieves 40% of the throughput of the default KCAPI AES implementation. split into blocks of 4096 bytes as a compromise to offer on average good performance for different workloads. The results are shown in Figure 6.4. As observed also previously, the unoptimized SE-Vault implementation with SEV achieves the worst throughput performance, which is caused by IOVA-to-GPA translations. The default KCAPI AES implementation (aesni-intel) achieves the highest throughput: 1113 MiB/s. Both the optimized SE-Vault implementation with SEV and without SEV achieve around 40% of the performance of the default AES implementation. Notably, the performance gap can be made smaller by increasing the block size from 4096 bytes to 16384 bytes. However, such a change would need to carefully schedule smaller buffers to the Virtqueues which handle small inputs, otherwise memory and performance would be wasted. The next test examines the throughput for disk decryption when the SE-Vault implementation is deployed. In the initialization stage, a loop device is created which is connected to a file of size 1280 MiBs. The device is initialized with a sector size of 2048 bytes, uses the LUKS2 format, and relies for encryption on AES-CBC with a 256-bit-long key. After the device is first mapped, an ext4 file system is created on which a 1024-MiB-large file is created and filled with random data. After initialization is finished, then the actual performance test is executed. In the performance test, the device is opened and mounted to a local directory. Immediately after, the md5sum

83 6 Evaluation command is used to compute the md5 hash of the contents of the contained file. Two goals are achieved by computing the md5 hash: 1) This asserts that the whole file is read and 2) The hash can be compared against a recorded value to verify the integrity of the file. Afterwards, the device is unmounted and closed. The performance test is repeated eight times and the average is reported.

250 Default SE-Vault SE-Vault (SEV) 200 SE-Vault (SEV optimized)

150

100 Throughput (MiB/s) 50

Figure 6.5: Disk decryption performance test for a device size of 1240 MiB. The optimized SE-Vault implementation with SEV reaches 51% of the performance of the default KCAPI AES implementation.

The results are shown in Figure 6.5. Similarly as in previous tests, the unoptimized SE-Vault implementation with SEV achieves the worst throughput. The optimized SE-Vault implementation reaches 51% of the performance of the default AES engine in the KCAPI. Notably, the throughput results can be improved by using a sector size of 4096 bytes and by increasing the block size in the SE-Vault host driver KCAPI implementation from 4096 bytes to 16384 bytes. My last throughput experiment has the goal of determining the time breakdown of handling a request: 1) How much time is spent on data communication and 2) How much time is spent on performing the AES encryption. To calculate the cost of each, I examined two ciphers by using the LKM from earlier to send 32 MiBs of data for encryption to the optimized SE-Vault implementation with SEV enabled. The ﬁrst KCAPI cipher is ecb(aes) which provides the total cost of data communication and AES encryption. The second KCAPI cipher is ecb(cipher_null) which only copies data

84 6 Evaluation but skips AES encryption. One can compute the time percentage for AES encryption in the following way: (T1 − T2)/T1, where T1 and T2 are respectively the measured time for the ﬁrst and second cipher. In my test, only 4% of the computation time is spent on AES encryption and the remaining 96% are spent on copying data, context switches and scheduling.

6.3.2 Latency Measurements Due to its design, SE-Vault can only achieve high throughput for request processing but not low latency for an individual request. The abundance of memory copies, context switches and scheduling operations adds substantial latency to a request. However, it is meaningful to quantify the added latency overhead since even some throughput- reliant applications may have a latency threshold in order to provide good system responsiveness. In this chapter, I measure the latency of processing a request for SE-Vault’s interfaces: the user space ioctl interface and the KCAPI interface.

225 SE-Vault 200 SE-Vault (SEV) SE-Vault (SEV optimized) 175

150 s)

125

100

Latency ( 75

0 1024 2048 4096 8192 16384 Block Size (bytes)

Figure 6.6: Average latency of an encryption request using the ioctl interface for all supported request sizes. Lower is better.

My ﬁrst test measures the average latency of sending and reading back an encryption request from user space via the ioctl interface. The test includes measurements for the optimized SE-Vault implementation with and without SEV, and also of the unoptimized SE-Vault implementation with SEV. Each result is computed as the average latency of

85 6 Evaluation

65536 requests. The results of the test can be viewed in Figure 6.6. The x-axis lists each supported block size from 1024 to 16384 bytes. The y-axis shows the average latency of a request in microseconds. Three key observations can be made from interpreting the results. The ﬁrst observation is that the latency of a request increases with the block size which is expected. More work — data movement and encryption — needs to performed for a large block size than for a small block size. The second observation is that the unoptimized SE-Vault implementation with SEV has the highest latency for each request size. This result is also expected because the unoptimized version requires communication with QEMU for IOVA-to-GPA translation which adds extra latency. The third observation is that using SEV adds 5-10 microseconds of latency since it requires an extra copy operation for the data from the shared DMA region to a private memory buffer.

Default SE-Vault 800 SE-Vault (SEV) SE-Vault (SEV optimized)

s) 600

400 Latency (

200

0 1024 2048 4096 8192 16384 Block Size (bytes)

Figure 6.7: Average latency of an encryption request using the kcapi interface from a custom LKM for all supported request sizes. Lower is better.

The next test measures average latency of an encryption request sent from an LKM using the KCAPI. The test includes the candidates from the previous test, and in addition the default KCAPI AES engine (aesni-intel). Each result is computed as the average latency of 65536 requests. Figure 6.7 shows the results from the test. The x-axis shows the block size and the y-axis shows the average latency in microseconds. As before, the unoptimized SE-Vault implementation with SEV achieves the highest latency. Also, enabling SEV with the optimized SE-Vault implementation only adds

86 6 Evaluation a negligible 4% increase in latency. The default KCAPI AES engine has signiﬁcantly lower latency than any of the other test candidates. This result is expected since the default AES engine can immediately process the request without any need of request buffering, context switches and scheduling.

12.0%

11.0%

10.0%

9.0%

8.0%

7.0%

6.0% AES computation (%)

5.0%

4.0% 1024 2048 4096 8192 16384 Block Size (bytes)

Figure 6.8: Average percent of computation time spent on AES encryption. The x-axis shows the block size and the y-axis shows the percentage delta of ecb(aes) and ecb(cipher_null) when the optimized SE-Vault implementation is used with SEV.

The last test measures the percentage of computation time spent on performing AES encryption (useful work) when the optimized SE-Vault implementation with SEV is used. The residual percentage consists of data movement, context switches and scheduling which can be considered as wasted work. The results are obtained by measuring the average latency for the two ciphers — ecb(aes) and ecb(cipher_null) — and then computing the overhead of using AES as a cipher. Figure 6.8 shows the results for this test. The x-axis shows the size of the request size and the y-axis shows what percent of the total time serving a request is used for performing AES encryption. Depending on the block size, between 6% and 12% of the time is spent on AES encryption. This result is expected since serving a request requires multiple memory copies, context switches and scheduling all of which have high cost in comparison to AES encryption. Another observation is that the percentage of useful work increases as the block size increases. This result is also expected since the ratio of context switches and scheduling operations per processed byte decreases with the increase of the block size.

87 7 Attacks

Section 2.2 discusses the features and security hardenings of the SEV family of extensions. Each new generation of SEV improves the design to address vulnerabilities which can be used to leak conﬁdential data from the VM, to gain code execution inside the VM or to circumvent the attestation process. In this section, I examine the applicability and limitations of possible attack vectors against SE-Vault when using SEV, SEV-ES and SEV-SNP. Although the attacks are discussed in the context of SE-Vault, many are applicable to any software protected with the SEV family of features.

7.1 Kernel Memory Disclosure Attacks

A monolithic kernel, like Linux, consists of millions lines of complex code written in an unsafe language (C). Due to the sheer complexity and limited opportunities for code safety guarantees, bugs are introduced, discovered and fixed regularly in the Linux kernel. Some bugs may lead to the exploitation of the kernel and are classified as vulnerabilities. According to Niu et al. [Niu+14], there has been a significant increase in the number of discovered vulnerabilities per year in the Linux kernel: from 1020 in 2000 to 5186 in 2013. One such occurring vulnerability is a memory disclosure vulnerability, also known as information disclosure vulnerability. This vulnerability grants an attacker the opportunity to read restricted memory, such as the memory of the Linux kernel, which can lead to the leaks of precious system secrets. Such secrets can be cryptographic keys used in communication channels or used for disk encryption. Although the Linux kernel employs various security hardenings, like KASLR and stack canaries, and utilizes static analysis, vulnerabilities in the Linux kernel still exist. For example, CVE-2020-8835 is a recent Linux vulnerability which allows an attacker to read arbitrary memory addresses by exploiting a bug in the Berkeley Packet Filter [Mit]. Although this bug is fixed in recent versions of Linux, similar bugs are unlikely to disappear from a complex code base such as Linux. Precisely for these reasons, using SE-Vault is reasonable for a production system even if there is a noticeable performance degradation. SE-Vault is invulnerable to memory disclosure attacks by design, because it relies on the SEV hardware feature to encrypt the VM’s memory. Even if a memory disclosure vulnerability exists in the Host kernel,

88 7 Attacks the attacker cannot read the memory of the SE-Vault VM in plaintext because the CPU would not decrypt it. This result is expected since it precisely ﬁts the attacker model of SE-Vault (see Chapter 4.2).

7.2 Register State Attacks

When the VM exits and transfers execution to the HV, the VM’s architectural state has to be saved. The architectural state includes all of the architectural registers: RAX, RBX, RIP, the segment registers, the CR registers, descriptor registers, etc. Some of the registers are saved automatically by the CPU in the VMCB on a VM-exit. Other registers are saved manually by a short snippet of assembly code immediately after the vmenter instruction. The SEV feature does not encrypt or protect the integrity of the VMCB and of the other registers exposed to the HV on a VM-exit. Thus, the HV can read the registers in plain text after every VM-exit and modify them to any value prior to resuming the VM. This and other flaws of SEV’s design were first examined in the work of Hetzelt et al. [HB17]. The authors propose and implement an attack which diverts the Guest’s control flow to suitable sequences of instructions by updating the Guest’s RIP register after a VM-exit. The attack requires that the attacker can statically analyze the Guest’s kernel image to find the instruction sequences. Wermert et al. [Wer+19] proposed a new attack for tracing execution and examining the results of instructions in the VM by using debug registers. By examining the changes in the saved registers on each VM-exit, the authors can infer the sequence of instructions which are executed by the VM. First, Wermert et al. apply the attack to read exchanged HTTPS messages between the SEV-protected VM and a remote user. Second, the authors use the attack to inject a malicious private key inside OpenSSH in the VM to open an ssh session. This attack can be applied on SE-Vault if the VM is protected only with SEV. The attacker can gain code execution by using the attacks from [HB17] and [Wer+19], and then extract all of the stored keys in SE-Vault. Notably, the attacker does not even need to gain code execution inside the VM to extract the keys. If the attacker has access to the SE-Vault kernel image, then the attacker can statically analyze it and find locations in the code when a secret key is stored inside registers. Such a situation would naturally occur since the Kernel Crypto Engine used by SE-Vault, needs to have the key loaded in a register to apply cryptographic transformations to the handled data. The attacker can force a VM-exit by modifying the Second Level Address Translation of the Guest for a suitable Guest Physical Page. These attack vectors are not applicable against VMs protected with SEV-ES and SEV- SNP. SEV-ES and SEV-SNP save the registers to an encrypted and integrity-protected

89 7 Attacks structure, called the VMSA, and clear the registers on every VM-exit to limit information leakage. An attacker can no longer apply the described attacks to infer computation or to gain execution in the VM. This attack vector requires that the attacker has gained code execution in the Host’s kernel. Although this may be feasible in recent versions of Linux, the attacker model of SE-Vault considers only memory disclosure vulnerabilities, not code execution vulnerabilities. SE-Vault can be further enhanced to cover code execution vulnerabilities in the Host kernel, but the implementation would require the use of SEV-SNP which is not supported by Linux or any AMD CPU at the time of writing.

7.3 NPT Corruption Attacks

By the design of the AMD Virtualization extensions, the HV always has access to the Nested Page Table of a VM. Through this access, a HV can better utilize system memory by swapping out pages, or can detect memory accesses to MMIO regions which is necessary for correct virtualization. This leaves a security-critical component of the VM in the hands of a potentially vulnerable HV. One of the first attacks on SEV was presented by Hetzelt et al. [HB17], who exploits the missing integrity protection of the Nested Page Table (also referred to as Second Level Address Translation in literature). The attack modifies some Page Table Entries of the Nested Page Table, in order to trick the VM into accessing unintended data or code. In order to understand this attack vector, we first need to briefly discuss the contents of a PTE in the Nested Page Table. A PTE contains the HPA, Present Bit, Non-Executable (NX) bit, Write (W) bit, among other meta-information bits. The HV is in control of a PTE in the Nested Page Table, and thus can modify the aforementioned PTE fields. For example, by removing the Present, NX or Write bits, the HV can track memory reads, memory writes, and instruction fetches from a page. This essentially allows the HV to track the VM’s control and data flow at a page granularity. Additionally, the HV can manipulate the HPA field, which allows the HV to force the VM to access other memory than its execution expected. The attack proposed by Hetzelt et al. uses the HPA field to trick the VM into executing malicious code. The idea is visualized in Figure 7.1. The HV is only in control of the Nested Page Table, while the Guest Page Table and the Guest Physical Memory remain encrypted, and thus inaccessible. The attacker can modify the HPA field in the Nested Page Table, in order to remove the entry for Page B and instead set an entry for Page A. In this example, Page A contains malicious code, and Page B contains the code which the VM is currently executing. When the VM starts executing the code, the VM would fetch the instructions from Page A, which leads to the malicious code

90 7 Attacks

Guest Physical Memory NPT GPT

Malicious A code -controlled HV VM executes: PTE PTE Malicious code VM's B code

Figure 7.1: A malicious HV modifies a PTE in the Nested Page Table to change the control flow of the VM. When the VM attempts to fetch code from Page B, the VM would instead fetch code from Page A, which happens to contain malicious code in this example. be executed instead of the expected VM code. Hetzelt et al. used this approach to modify the control flow of an SSH server in the VM, which allowed them to open a shell inside the SEV-encrypted VM. This attack vector was later extended by Morbitzer et al. [Mor+18], who managed to extract the encrypted memory of a VM in plaintext by making use of a service running inside the VM. While these attacks are applicable to both SEV and SEV-ES, performing the attacks requires that the attacker gains code execution in the Host kernel. Because of this reason, this attack vector is not considered in the attacker model of SE-Vault.

7.4 Memory Corruption Attacks

Memory encryption is an integral part of the design of the SEV family of extensions. However, encrypting data written to memory or decrypting data read from memory has to be performed efﬁciently to avoid signiﬁcant performance degradation in comparison to native execution. With SEV and SEV-ES, the AMD memory controller only encrypts the Guest’s memory but does not protect its integrity or restrict accesses to it from the HV. The HV can still read the cipher text of the Guest’s encrypted memory, overwrite it with other encrypted data or move encrypted blocks from one address to another. Thus, SEV’s cipher mode must prevent an attacker from easily injecting arbitrary code or data into

91 7 Attacks the Guest by moving encrypted blocks. Du et al. first reverse-engineered the block cipher used on the first generation of Ryzen CPUs [Du+17]. The authors revealed that the block cipher mode is Xor-Encrypt (XE). The XE mode takes a 16-byte-long block and performs an xor with a random pattern, called the tweak function. The tweak function is constructed only from the block identifier. Finally, the result is encrypted it using the secret AES key for the VM.

j j EncK = AESK(P ⊕ T ) (7.1)

j j P = AESK(EncK) ⊕ T (7.2) The XE mode was constructed by Rogaway [Rog04] to provide an efficient tweakable block cipher which is also resilient to Chosen Plaintext Attack (CPA). Equation 7.1 shows the procedure for encrypting a block P with block identifier j and an encryption key K. With respect to memory encryption, the block identifier j is the GPA of the block. The tweak function Tj is random, it can be efficiently derived from the physical address of the block and must be independent from the tweak functions for other physical addresses. Correspondingly, Equation 7.2 shows the function for decrypting a message with XE. Rogaway discusses an efficient technique for deriving new tweak functions by incrementing one or more components of a preceding tweak function in the formed lattice. Namely, the tweak function for the state 0110 can be derived by incrementing the second component of the state 0010. Rogaway suggest that an increment along one of the components be implemented as a multiplication with a suitable unique number.

T001000000 = 82 25 38 38 82 25 38 38 82 25 38 38 82 25 38 38 (7.3) T000010000 = b0 10 b2 c0 b0 10 b2 c0 b0 10 b2 c0 b0 10 b2 c0

T010010000 = T010000000 ⊕ T000010000 (7.4) Unfortunately, Rogaway’s scheme for incrementing components of a tweak function is difficult to implement in hardware and would incur additional latency when encrypting or decrypting a block. Du et al. reverse-engineered the implemented tweak generation scheme on first generation Ryzen CPUs and showed that it is vulnerable [Du+17]. The authors show that the number of components is limited by the number of bits in a physical address and that the component increment operation is an xor with another tweak function. Equation 7.3 shows few initial states from which new tweak functions can be derived. Equation 7.4 shows the derivation of a tweak function for a physical address with the binary representation 010010000. Du et al. also discuss that the tweak functions are the same on all examined first generation Ryzen CPUs and also each

92 7 Attacks unique tweak function has very little entropy. As shown in Equation 7.3, the tweak functions exhibit a repeating pattern of length four bytes. If all of the functions exhibit the same pattern, the entropy of the tweak functions is only 32 bits which makes them susceptible to a brute-force attack.

j j EncK = AESK(Pj ⊕ T ) j i j i (7.5) Pi = AESK(EncK) ⊕ T = Pj ⊕ T ⊕ T These vulnerabilities in the tweak function derivation allow an attacker to efficiently compute the initial tweak functions and to apply a block-moving attack. In this attack, a malicios actor with access to the HV can move known cipher blocks from one physical address to another physical address to modify code or data in the desired way. Equation 7.5 shows the value of a block at physical address i after the HV moves a block from physical address j to address i. Namely, the data at physical address i will be equal to the data at address j xor-ed with the tweak functions for addresses i and j. If a malicious HV knows the plain text of parts of the Guest’s memory, then the HV can find suitable blocks in that memory which yield useful data or instructions when moved to a target address in the Guest’s memory. Du et al. design an attack which uses a web server in the SEV-protected VM as an encryption oracle. The authors can send messages with arbitrary data to the server which get automatically encrypted using the described tweakable block cipher mode. The authors craft messages which are transformed to valid instruction sequences when the blocks are moved to a specific physical address. To achieve this, the attacker needs to xor the machine code with the tweak functions Ti and Tj if the cipher block is moved from physical address i to physical address j. In their tests, Du et al. succeed in gaining code execution inside the SEV-protected VM by means of moving the cipher block to a physical address where code is executed frequently. The attack of Du et al. [Du+17] is limited by its requirement of finding an encryption oracle inside the VM. Wilke et al. showed that an attacker can use the Linux kernel image as a source of known plaintext blocks which can be used in the block-moving attack [Wil+20]. As part of their attack chain, the authors need to bypass KASLR which randomizes the physical and virtual offset of the kernel image, and also the offsets of other memory regions such as the page_offset_base, vmalloc_base and vmemmap_base. The authors apply the block moving attack to patch a function which checks whether KASLR should be used or not. If KASLR is not used, the offsets will always be the same and known to the attacker. Wilke et al. also briefly propose an alternative approach which monitors the sequence of Page Faults to find the physical offset of the kernel image. Before the kernel is decompressed to a random physical address determined by KASLR, the sequence of Page Faults is deterministic which

93 7 Attacks allows the attacker to wait for the ﬁrst random PF. The ﬁrst random PF determines the KASLR physical offset of the kernel image.

cpuid jmp +0x1e push rax, pop rdi 0f a2 eb 1c 50 5f

0x00 0x0e 0x00 0x02 0x00 0x0e

Figure 7.2: A chain of two-byte-long gadgets at the beginning and end of cipher blocks. The cpuid instruction at the end of the ﬁrst block sets registers, and the ﬁrst two bytes of the next block jump to the end of the next block. By carefully selecting the gadgets, an attacker can execute arbitrary code inside the VM.

Once the physical address of the kernel image is known, the attack continues with searching for suitable blocks which can form executable code when moved. This steps requires that the attacker has access to the vmlinux image for the kernel being deployed in the VM. Because the kernel image is typically about 10 MiB of size, an attacker has few cipher blocks to choose from. As mentioned by Wilke et al., the small search space and repetitive tweak pattern limits the attacker to control reliably only two bytes of a 16-byte-long cipher block. The authors suggest an approach to combine the last two bytes of one block with the ﬁrst two bytes of another block as shown in Figure 7.2. In this way, an attacker can execute an arbitrary two-byte-long instruction and then execute a near jmp to the end of the next block. However, this idea alone does not help with loading useful data into registers since many x86-64 instructions are encoded into three bytes or more. Wilke et al. propose augmenting the block-moving attack by relying on intercepted instructions to load useful values into the general-purpose registers of the Guest. The authors discuss using the cpuid instruction which gets emulated by the HV. Although SEV-ES ﬁrst intercepts the cpuid instruction in the VC handler, control is eventually passed to the HV to emulate it. The HV can then update the EAX, EBX, ECX and EDX values for the Guest and resume execution. With this approach, the HV can inject instructions by moving blocks and also update register values by relying on intercepted instructions. Figure 7.2 shows an example of the idea. First, cpuid is executed to update the lower 32-bits of some of the Guest’s registers. Then a jmp is executed to continue execution with the last two bytes of the next blocks. Last, the values of the RDI register is updated with the value of RAX. By using both techniques, the authors report to have developed a high-speed encryption and decryption oracle which works on both SEV and SEV-ES.

j j j EncK = AESK(P ⊕ T ) ⊕ T (7.6)

94 7 Attacks

j j j P = AESK(EncK ⊕ T ) ⊕ T (7.7) AMD updated the block cipher mode from XE to XEX on its second generation of Ryzen CPUs [Wil+20]. The application of the XEX cipher for encrypting and decrypting a message can be seen correspondingly in Equation 7.6 and Equation 7.7. The only difference with XE is that XEX additionally applies an xor with the tweakable function after the block is encrypted. As discussed by Wilke et al. [Wil+20], the initial tweak functions are constructed using the same method of incrementing as with XE, are still constant for the entire system and still have only 32 bits of entropy. The authors were able to compute the initial tweak functions by brute-forcing the values in 30 minutes. Thus, this extends the block-moving attack to SEV and SEV-ES VMs running on second generation Ryzen CPUs. However, this attack does not apply to SEV-SNP from software because the memory controller restricts access to the Guest’s memory from the HV. The attack of Wilke et al. [Wil+20] can be used on SE-Vault if the VM relies on the SEV or SEV-ES features, and the system has a ﬁrst or second generation Ryzen CPU. The attack would require write and read access to the VM’s memory, either by exploiting a vulnerability in the Host kernel or by gaining execution in the QEMU process. Additionally, the attacker must have access to the vmlinux image of the SE- Vault VM and must be able to locate the image in the VM’s memory. Notably, this attack requires write access or even code execution, and is thus considered outside the scope of the SE-Vault attacker model.

7.5 Attacks Summary

The previous sections discuss various attacks which can be utilized to steal the stored secrets in SE-Vault. Each attack varies in what it achieves, in what the requirements of the attack are, and in its applicability to the SEV family of features and SE-Vault. This section provides a summary of the presented attacks in order to facilitate comparison. Table 7.1 lists all discussed attacks in this chapter. The target column specifies whether the attack leads to memory disclosure or to code execution (CE). The requirement column specifies what context is necessary to apply the attack: code execution in the Host kernel, or a memory disclosure bug. The remaining columns specify whether the attack is applicable to SEV, SEV-ES, SEV-SNP, and SE-Vault. The value N/A states that the attack is not applicable due to a different attacker model. Two main observations can be made from the summary. The first main observation can be made with respect to SE-Vault. Kernel memory disclosure attacks cannot be applied to SE-Vault, as was empirically shown in the security evaluation in Section 6.1. Although the other attacks can be used to extract

95 7 Attacks

Attack Target Rquirement SEV / SEV-ES SEV-SNP SE-Vault Kernel Memory Disclosure Read memory Memory Disclosure Bug Yes No No Register State Read memory / CE Host kernel CE Only SEV No N/A NPT Corruption Read memory / CE Host kernel CE Yes No N/A Memory Corruption CE Host kernel CE Yes No N/A

Table 7.1: Summary of all discussed attacks. For each attack, the table shows the target of the attack (code execution (CE), read memory), the necessary requirement for applying the attack (e.g. host root privileges), and whether it is applicable to SEV, SEV-ES, SEV-SNP and to SE-Vault. If the attack is not applicable, N/A is set. the stored secrets from SE-Vault, they require code execution in the host kernel, which is excluded from the attacker model of SE-Vault. Due to the incompatibility of their requirements with the attacker model of SE-Vault, they are labeled as Not Applicable (N/A). The second main observation is that none of the attacks can be applied to SEV- SNP. SEV-SNP is an improvement over SEV-ES, and thus also borrows the idea of having the VM’s registers be encrypted, which removes all attack vectors which read or try to corrupt the register state of the VM. Additionally with SEV-SNP, the CPU’s memory controller would prevent any memory writes to the VM’s memory from other software [Inc20a], which prevents any memory corruption attacks such as the one of Wilke et al. [Wil+20]. SEV-SNP also prevents Nested Page Table corruption attacks because any changes of the physical address in the Nested Page Table requires the cooperation of the VM [Inc20a]. Thus, a malicious HV cannot forcefully remap an encrypted page, and the attacks of Hetzelt et al. [HB17] and Morbitzer et al. [Mor+18] cannot be used on an SEV-SNP VM.

96 8 Discussion and Future Work

This thesis presents the design, implementation and evaluation of an SEV-based TEE which is capable of securely storing cryptographic secrets and of performing cryptographic transformations with these secrets. The considered attacker model in this design assumes the presence of a malicious unprivileged user who exploits a kernel bug to gain only read access to all of physical memory. The thesis provides two implementations of the TEE: one by using an LKM in a Linux VM, and one by extending seL4 and implementing the main functionality in a user space service. Both implementations support a single cipher — AES with the CBC cipher mode — but the design considers the addition of more ciphers and cipher modes. With this cipher, the SE-Vault implementation produces 50% of the requests throughput of aesni_intel (the default AES implementation in the Linux kernel). This thesis fulﬁlls the research goals of Section 1.1, but there are various improvements which can be made to a future extension of SE-Vault. In this section, I discuss direction for extending and improving this work. Attacker model. The proposed attacker model in Section 4.2 assumes that the attacker may have unrestricted read access to all of physical memory, but excludes cases when the attacker has unrestricted write access or gains code execution in the kernel. In the case of gaining code execution, the attacker would be able to encrypt or decrypt messages without even extracting the stored secrets. This scenario limits the usefulness of SE-Vault, but such actions of the attacker would reduce system responsiveness, introduce noise and may be noticed by an administrator or another user. Thus, an attacker may rather prefer to steal the stored secrets by using the attacks of Chapter 7. Lastly, the attacker may decide to install a rootkit which monitors system usage and steals the secrets keys when they are delivered to a fresh instance of SE-Vault. In this context, two improvements can be made to improve SE-Vault and elevate its attacker model to include a privileged attacker. The ﬁrst improvement is to attest the initial state of the SE-Vault VM to guarantee that it is launched with the correct state. Through this addition, an attacker would not be able to introduce malicious code to the SE-Vault VM’s content. The delivery of cryptographic keys can happen based on the SEV secrets provisioning scheme [Adv20a]. Lastly, the SE-Vault VM must be protected with the latest SEV iteration — SEV Secure Nested Paging (SEV-SNP) — which addresses various security vulnerabilities of SEV and SEV-ES (see Chapter 7).

97 8 Discussion and Future Work

By adding these changes to SE-Vault, the stored secrets can be protected against a privileged attacker, who has gained code execution in the Host kernel. Unfortunately, such work cannot be currently performed since 1) the software supporting attestation is not yet complete, and 2) SEV-SNP is supported neither by Linux, nor by any available AMD CPU. Extension of the implementation. My PoC implementation supports a single cipher and cipher mode: AES with the CBC mode. However, SE-Vault can benefit from the addition of other cryptographic transformations like RSA and of other cipher modes like XTS and CTR. While the CTR cipher mode is easy to parallelize and offers better performance than that of CBC, the XTS cipher mode is more suitable for random access patterns such as the usage of a disk. The PoC seL4 implementation can also benefit from supporting more ciphers and cipher modes. Additionally, the current implementation relies on the Tiny AES library [kok] which does not make use of the AES-NI extension and has poor performance. Without improving the performance of the cipher implementation in the seL4 PoC, its performance would be many times lower than that of the aesni-intel (the default AES implementation in the Linux kernel). As discussed in Section 5.5.4, the root service starts busy looping after establishing communication with the SE-Vault host driver. The seL4 PoC can benefit from reacting to VirtIO notifications by handling injected interrupts instead of busy looping. This could improve overall system performance since work in the SE-Vault guest driver would be performed only on-demand. In investigating the portability of SE-Vault to the seL4 microkernel, I only added SEV support to seL4. Discussed in Section 2.2, SEV only encrypts the memory of the VM but not the register state, which allows for the attacks of Section 7.2. Thus, my seL4-based SE-Vault implementation can benefit from using SEV-ES instead of SEV. However, adding SEV-ES support to a kernel is a complex task since it requires non-trivial changes to the early boot code and kernel initialization stage, and requires the handling of VC exceptions. Performance evaluation. The Linux-based PoC implementation achieves 50% of the request throughput of the default AES engine in the Host Linux kernel, and also introduces significant latency overhead. While the performance evaluation of Section 6.3 provides intuition on the achieved average throughput and latency of the implementation through microbenchmarking, the evaluation can be extended to give a more complete picture on system performance and overhead to real-world applications. Additionally, the SE-Vault implementation is only compared against the default KCAPI AES engine, but not against similar solutions like the ones discussed in Chapter 3. The first step towards improving the evaluation is to compare the SE-Vault implementation against similar solutions like Tresor-SGX [RGM16], Tresor [MFD11] and

98 8 Discussion and Future Work

Amnesia [Sim11]. This comparison can be useful in determining optimization opportunities for SE-Vault and in providing a clear comparison of alternative solutions for protecting cryptographic secrets. The performance evaluation can also be extended by using readily available benchmarks for disk encryption, and for usage of the KCAPI and OpenSSL. Such benchmarks typically include hybrid workloads, which attempt to mimic the usage of a particular component in the real world. These workloads may affect performance in unforeseeable ways under SE-Vault and may help discovering overlooked performance optimizations. The performance evaluation of Section 6.3 reports the final figures — average throughput and latency — but does not provide a breakdown of the time spent in each component along the way of sending a cryptographic request and reading the result back. Measuring the time spent in each component can be beneficial in determining performance bugs and improving the implementation. While performing such an evaluation is not difficult, it was not included in this thesis due to time constraints. Another interesting extension of the evaluation is to measure the overhead of using SEV-ES instead of SEV. With SEV-ES, MMIO operations trigger a VC exception in the VM which adds additional latency to IO communication. Since there is no publicly available work on the introduced overhead from SEV-ES, such an examination can be interesting to cloud providers who offer SEV-ES-protected VMs to customers. Lastly, the OpenSSL performance measurements look suspicious while no bug was found when performing the correctness tests. The performance discrepancy between the SE-Vault OpenSSL engine and the default OpenSSL engine should be further investigated. This could help fix an existing bug in SE-Vault or help determine performance issues in the default OpenSSL engine. Correctness evaluation and fixing bugs. The correctness evaluation of Section 6.2 implements 20 tests for verifying the correctness of the SE-Vault implementation. While these tests were very helpful in detecting bugs, they are incomplete. If SE-Vault is considered for production use, the implementation would need to be thoroughly tested for handling erroneous inputs, race conditions, etc. The evaluation could benefit from coverage testing tools such as gcov [GNU] which is already supported in the Linux kernel.

99 9 Conclusion

The Linux kernel stores various security-critical secrets, one of which are cryptographic keys. Protecting cryptographic secrets is crucial for ensuring the confidentiality and integrity of storage mediums and communication channels. While many solutions exist for protecting cryptographic secrets in the Linux kernel, some restrict system usage [MFD11; Sim11; Gua+14b], and others rely on features bound to a single CPU vendor [Gua+15] and may have poor performance [RGM16]. My thesis presents SE-Vault: a TEE-based solution for protecting cryptographic secrets. SE-Vault stores cryptographic secrets and processes cryptographic transformations inside a VM whose memory is encrypted with the AMD SEV hardware feature. This thesis discusses two implementations of SE-Vault: one based on a Linux VM with built-in SEV support, and one based on the seL4 microkernel to which I added SEV support. By means of encryption, SE-Vault addresses memory disclosure attacks, which may exploit a bug in the Linux kernel to gain read access to all of physical memory. Because the memory of SE-Vault is encrypted with SEV, any reads from the Host would only return the cipher text. The Kernel Crypto API (KCAPI) and OpenSSL-reliant applications can transparently use SE-Vault with few code modifications, and benefit from the added protection of keys without significant performance degradation. The security evaluation of Section 6.1 shows that the provided prototype can successfully protect against an attacker who has read access to all of physical memory. The added protection comes at a performance cost, as shown in the performance evaluation of Section 6.3. While the average latency of serving a cryptographic request is significantly increased, the average request throughput decreases by only 50%. The implementation can benefit from performance optimizations, which would help reduce latency and increase throughput. Finally, let us examine the resolution of the three main research goals, established in Section 1.1:

Research goal 1). Research and design a performant TEE-based solution with AMD SEV for protecting cryptographic secrets. Research goal 2). Provide a prototype implementation of this solution, and verify its portability across operating systems. Research goal 3). Evaluate the security and performance of the implementation.

100 9 Conclusion

The first research goal is fulfilled in Chapter 4, which discusses a design which can serve various system components and also be efficient. The SE-Vault host driver resides in the Host kernel, giving the Host kernel and user space applications quick access to SE- Vault. This design choice also reduces the number of context switches and data copies when data or notifications are communicated over the VirtIO interface. This significantly improves performance, and renders SE-Vault practical for production systems. Notably, the attacker model of SE-Vault is transparently handled by SEV without introducing any SEV-specific design decisions. Thus, other protection mechanisms can be employed, such as the IBM Protected Execution Facility [Kera] or the Intel Trusted Domain Extensions [Cor20a]. The second research goal is accomplished in Chapter 5, which provides details on the implementation of the Linux-based and seL4-based drivers, as well as details on the necessary security hardenings. The SE-Vault host driver relies on the Vhost interface in the Linux kernel to receive the data plane information from the QEMU device implementation. Communication between the Host and the VM is established by using the widely-used and efficient VirtIO interface. To show the portability of the design, I ported the SE-Vault guest driver to a VM which uses the seL4 microkernel. As part of this goal, I added SEV and VirtIO support to the seL4 kernel and user space libraries. The third research goal is satisfied in Chapter 6, which examines the security, correctness and performance of the Linux-based SE-Vault implementation. The security evaluation consists of three tests and is performed in correspondence with the attacker model of Section 4.2: the attacker has full read access to all of physical memory. Each test verifies a different system component, and the results of the tests show that cryptographic keys, used by dm-crypt, KCAPI and user space applications, can be protected with SE-Vault. The correctness evaluation verifies that the SE-Vault implementation correctly handles the registration of keys and the handling of transformation requests. The performance evaluation details the achieved average throughput and average latency of a request with SE-Vault. The results show that the average request throughput is reduced by 50%, which is due to the increase of memory copies and context switches. The average latency of a request is significantly increased over that of the aesni_intel AES cipher implementation. While an increase in latency can have a negative impact on some applications, disk encryption and encrypted network communication are more reliant on high throughput than on low latency. Thus, SE-Vault is still widely applicable despite the degradation in performance. Although the presented results are interesting for the ongoing research on protecting cryptographic secrets, I believe the real benefit of this work is more general. The outcomes from this thesis show that recent technologies for confidential computing can be re-purposed to protect security-critical components of the Linux kernel without significant performance degradation. Modern Operating System design can build upon

101 9 Conclusion this work by isolating and protecting other components in encrypted virtual machines, which can lead to more secure operating systems in the future.

102 List of Figures

2.1 Example code for entering and exiting VM execution. The Host’s registers are first saved, and then the guest registers are loaded. The VM starts execution when the HV executes vmrun. On a VM-exit, the HV saves the VM’s registers and restores the Host’s registers...... 8 2.2 Sequence of steps for translating a Guest Virtual Address (GVA) to a Host Physical Address (HPA). On a Guest access, the GVA is first translated to a GPA using the Guest Page Table (gPT). The GPA is then translated to a HPA using the Nested Page Table...... 9 2.3 Depiction of the two memory paths for an SEV-protected VM. The C- bit in the page table determines if the data in DRAM is interpreted as encrypted or not. If the VM performs a page table walk or an instruction fetch, then memory in DRAM is always interpreted as encrypted. . . . . 12 2.4 Physical Address Space of an SEV-protected VM. While the VM Firmware (OVMF), kernel, and user space applications are encrypted, the DMA region is unencrypted and used for communication with external devices. 13 2.5 Sequence for intercepting an instruction under SEV-ES. On a Non- Automatic exit, the VM first passes control to its VC handler. Only afterwards, the VC handler gives the control to the HV. After resuming the VM, the VC handler validates the information from the HV before returning to the original instruction...... 16 2.6 Ceritifcate chains used in platform authentication...... 17 2.7 Layers of abstraction in virtio-net, a network virtualization solution based on VirtIO. The Host and VM both rely on a shim layer to abstract common VirtIO functionality. The corresponding virtio-net drivers in the Host and in the VM communicate network packets and notifications on two separate Virtqueues...... 24 2.8 Steps in VirtIO drivers for sending data to the Guest. The VM first makes buffers available by updating the descriptor table and the available ring, and then notifies the Host driver. The Host copies the data into an available buffer, updates the used ring and notifies the VM by injecting an interrupt. 25

103 List of Figures

2.9 Sequence of steps for sending a packet from the Host network driver to the Guest network driver. Because the data plane layer is located in QEMU, two context switches often need to occur...... 26 2.10 Steps for sending a network packet from the Host to the VM when vhost-net is used. In this design, no context switches are necessary between kernel and user space...... 28 2.11 Sequence of operations when an IOVA translation miss occurs. The vhost layer needs to notify QEMU, after which QEMU queries each IOVA-miss and sends the GPA to the vhost layer. In this design, IOVA-misses are very expensive...... 28 2.12 Software stack for encrypting the memory of a hardware device using cryptsetup...... 32

4.1 Design overview of SE-Vault. Initially, the SE-Vault QEMU device and the SE-Vault guest driver establish communication using the VirtIO setup speciﬁcation. The device communicates the information to the SE-Vault host driver using the vhost interface. User processes and the kernel can send cryptographic secrets and requests via two different interfaces to the SE-Vault host driver. The host driver sends the information to the Guest driver, which processes transformation requests using its crypto engine, and then returns back the results...... 40 4.2 Multi-stream design approach in SE-Vault. Keys and requests are sent to the SE-Vault guest driver using a data stream - a pair of an Input Virtqueue and an Output Virtqueue. Cryptographic keys are transferred using the Key stream. Transformation requests are communicated using one of many IO streams, each suitable for a different request size. For example, large requests can be better served by the IO stream with buffers of size 8192 bytes. A crypto worker in the Guest handles the request and returns the result...... 43 4.3 Overview of a crypto worker. A transformation request is sent using the Input Virtqueue to the crypto worker in the VM, which uses a crypto engine to process the request. Results are transferred back to the crypto worker, which propagates it to the Host using the Output Virtqueue. Requests for on IO stream are received, processes and returned in the same order...... 45 4.4 Example integration of SE-Vault into OpenSSL and dm-crypt. Integration of SE-Vault into OpenSSL happens via the SE-Vault OpenSSL engine. Because dm-crypt uses the KCAPI, keys and requests are forwarded to the cipher implementation of the SE-Vault Crypto Shim...... 46

104 List of Figures

5.1 The SE-Vault QEMU device provides the data plane conﬁguration to the SE-Vault host driver. The Host driver relies on the VirtIO implementation in the vhost LKM to communicate with the SE-Vault guest driver. . . . 51 5.2 Structures used in sending a key to the SE-Vault guest driver. The structure sevault_key_info is used for sending a key, and sevault_key_recv_ack is used for receiving a key-registered acknowledgment from the SE-Vault guest driver. The structure vhost_sevault_key_info_item is used for buffering a key request...... 53 5.3 Sequence of steps for sending a key from the KCAPI or sevault-user interfaces to the SE-Vault guest driver...... 54 5.4 The structure sevault_transform_req holds information about a transformation request which is sent to the SE-Vault guest driver, and the structure sevault_transform_result is used to hold the result of the transformation. The structure sevault_result_item is used for keeping track of the result of a request...... 56 5.5 Sequence of steps for sending a transformation request from the KCAPI cipher implementation to the SE-Vault guest driver and receiving back the result...... 57 5.6 Structures for a user transformation request and a result read operation. 59 5.7 Example high-level handling of transformation requests with two IO streams, identiﬁed by 0 and 1. Requests are buffered into separate work lists, which get traversed by an identical number of crypto workers implemented as kthreads...... 61 5.8 The function decides whether the unencrypted DMA region should be used. Lines 4 and 5 are added by me, in order to force usage under SEV without enabling the expensive iommu_platform=on setting in QEMU. . 63 5.9 Data path for sending a key from the Host VirtIO driver to the Guest VirtIO Driver. Under SEV, the key is stored in a temporary buffer in the DMA region, which is never cleared after the temporary buffer is freed by the function swiotlb_tbl_unmap_single()...... 64 5.10 GRUB entry for the seL4 kernel and module. When the VM is started, GRUB would be able to load and boot them...... 66 5.11 Assembly code for determining the location of the C-bit...... 68 5.13 Pseudo-code implementation of the busy-loop responsible for handling all requests. Line 2-6 process each key request, send back the acknowledgment and notify the acknowledgment Virtqueue. Line 7-12 process each transformation request, send back the result and notify the corresponding output Virtqueue. Line 13-18 reallocate descriptors for all Virtqueues if their resources are used up...... 73

105 List of Figures

6.1 Security tests for SE-Vault. Test 1 verifies the behavior of SE-Vault only, Test 2 additionally verifies the behavior of the KCAPI, and Test 3 also verifies the behavior for disk encryption...... 76 6.2 SE-Vault OpenSSL Engine throughput test using the openssl speed benchmark. Higher is better...... 80 6.3 Throughput comparison among multiple SE-Vault configurations and the default AES engine (aesni-intel) for file encryption and decryption with kcapi-enc. The cipher is AES with a 256-bit-long key and CBC as a cipher mode. The x-axis shows the size of the file in MiB, and the y-axis shows the throughput in MiB/s. Higher is better...... 82 6.4 Kernel Crypto API performance when used from a Linux Kernel Module. The optimized SE-Vault implementation achieves 40% of the throughput of the default KCAPI AES implementation...... 83 6.5 Disk decryption performance test for a device size of 1240 MiB. The optimized SE-Vault implementation with SEV reaches 51% of the performance of the default KCAPI AES implementation...... 84 6.6 Average latency of an encryption request using the ioctl interface for all supported request sizes. Lower is better...... 85 6.7 Average latency of an encryption request using the kcapi interface from a custom LKM for all supported request sizes. Lower is better...... 86 6.8 Average percent of computation time spent on AES encryption. The x-axis shows the block size and the y-axis shows the percentage delta of ecb(aes) and ecb(cipher_null) when the optimized SE-Vault implementation is used with SEV...... 87

7.1 A malicious HV modifies a PTE in the Nested Page Table to change the control flow of the VM. When the VM attempts to fetch code from Page B, the VM would instead fetch code from Page A, which happens to contain malicious code in this example...... 91 7.2 A chain of two-byte-long gadgets at the beginning and end of cipher blocks. The cpuid instruction at the end of the first block sets registers, and the first two bytes of the next block jump to the end of the next block. By carefully selecting the gadgets, an attacker can execute arbitrary code inside the VM...... 94

106 List of Tables

2.1 Summary of all keys in the attestation and secret provisioning process. 19

3.1 Summary comparison of solutions for protecting cryptographic keys. . 36

5.1 Block size and descriptor table size for all IO Streams. While BLOCK_TYPE_0 is suitable for small- and medium-sized payloads, BLOCK_TYPE_4 is suitable for large-sized payloads...... 50 5.2 ioctl commands exposed by /dev/sevault-user...... 58

6.1 Summary of results for the security evaluation. The target column specifies the software components which are tested. The requires column specifies the software hardenings which are necessary to be applied. The outcome column specifies the result of the test...... 78

7.1 Summary of all discussed attacks. For each attack, the table shows the target of the attack (code execution (CE), read memory), the necessary requirement for applying the attack (e.g. host root privileges), and whether it is applicable to SEV, SEV-ES, SEV-SNP and to SE-Vault. If the attack is not applicable, N/A is set...... 96

107 Glossary

SE-Vault guest driver The SE-Vault driver running in the VM.. 3, 40, 42–44, 49–51, 53–58, 60, 61, 63, 64, 66, 69–72, 75, 76, 98, 101, 104, 105

SE-Vault host driver The SE-Vault driver which runs in the Host and propagates information to the VM. 3, 40, 42–44, 49–52, 54–64, 71, 72, 84, 98, 101, 104, 105

ABI Application Binary Interface. 6, 19, 41, 44, 47, 48, 58, 76

AE Automatic Exit. 15

AES Advanced Encryption Standard. 11–13, 30, 39, 75, 77, 81, 85, 87, 92, 98

AES-NI AES instrunction set extension for Intel and AMD CPUs.. 30, 41, 81

AMD Advanced Micro Devices Inc.. 6, 7, 11–14, 17–19, 22, 67, 76

AMD Key Distribution Server The server contains the ASK private key, and can sign CEK public key.. 17, 18

AMD-V AMD Virtualization. 6

API Application Programming Interface. 17–20, 23, 41, 49, 52

ARK AMD Root Signing Key. 18

ASID Address Space ID. 12, 13

ASK AMD SEV Signing Key. 18

CEK Chip Endorsement Key. 18

CPA Chosen Plaintext Attack. 92

CPU Central Processing Unit. 5–7, 9–12, 14, 15, 17, 18, 21–23, 33, 34, 36, 39, 65, 72, 75, 89, 92, 95, 96, 112

CR3 x86 control register holding the physical address of the page table. 7, 16

108 Glossary

DMA Direct Memory Access. 13, 20, 62–64, 71, 77, 81, 86, 103

DRTM Dynamic Root of Trust Measurement. 21

ECDH Elliptic-Curve Difﬁe-Hellman. 18

FDE Full Disk Encryption. 30

GDT Global Descriptor Table. 6

GHCB Guest-Hypervisor Communication Block. 15, 16, 49

GPA Guest Physical Address. 9, 12, 16, 28, 29, 44, 62, 63, 81, 83, 86, 92, 103, 104

GPU Graphics Processing Unit plural. 10

Guest Software managed by the Hypervisor, and synonym for Virtual Machine (VM) in this thesis. 6–10, 13, 14, 23–26, 41, 42, 44, 47–55, 57, 64, 89, 91–95, 103

Guest Page Table A page table used when nested page tables are employed. The ﬁrst translation happens via the GPT, and the second translation via the NPT. The GPT is managed by the VM. 9, 10, 12, 37, 90

GVA Guest Virtual Address. 7, 9, 103

Host Physical system on which the software is run. 6–10, 15, 17, 22–25, 39, 41, 42, 45–48, 50–55, 57, 59, 64, 80, 103

HPA Host Physical Address. 7, 9, 14, 90, 103

HTM Hardware Transactional Memory. 34, 35

HV Hypervisor. 5, 7–17, 19, 20, 22, 23, 27, 33, 36–39, 41, 49–51, 67, 69, 89–91, 93–96, 103, 106, 110, 112

IDT Interrupt Descriptor Table. 6

Intel VT Intel Virtualization Technology. 6

IOVA IO Virtual Address. 28, 29, 62, 63, 81, 83, 86, 104

ISA Instruction Set Architecture. 5, 7, 41

KASLR Kernel Address Space Layout Randomization. 1, 88, 93, 94

109 Glossary

KCAPI Kernel Crypto API. iii, 3, 4, 29, 30, 32, 33, 35, 46, 47, 51, 52, 54–58, 60–62, 65, 75–87, 98–101, 104–106

KEK Key Encryption Key. 18, 19

Key Derivation Function A function to generate a new secret from an old secret.. 18, 19

KIK Key Integrity Key. 18, 19

KVM Kernel-based Virtual Machine. 5–7, 11, 19, 23, 25–28, 37, 49, 51, 57, 68, 71

LKM Linux Kernel Module. 29, 35, 50–53, 60, 78–80, 82, 84, 86, 97, 105, 106

LKVM Native Linux KVM Tool. 41, 50

LUKS Linux Uniﬁed Key Setup. 77

MMIO Memory-Mapped IO. 22, 23, 25, 57, 71, 90, 99

MMU Memory Management Unit. 6, 7

MSR Model Speciﬁc Register. 6, 10, 16

NAE Non-Automatic Exit. 15, 16

Nested Page Table A page table used for address translation in a virtualizated en- vrionment. This page table is managed by the hypervisor. 9, 10, 90, 91, 96, 106

Nested Paging A feature of the AMD-V, which allows the HV to manage the VM’s access to physical memory via a Nested Page Table, used in second-level address translation.. 9, 10

OCA Owner Certiﬁcate Authority Key. 18

OS Operating System. 2, 5, 11, 20, 23, 36, 39, 49, 50, 67

OVMF Open Virtual Machine Firmware. 19, 49, 66, 67

Page Table Structure used for translation of a virtual address to a physical address by the Page Table Walker. 7

110 Glossary

Page Table Walker Hardware or software component which traverses the page table structure to translate a virtual address to a physical address. 7, 9

PCI Peripheral Component Interconnect. 6, 23, 41, 50, 51, 70, 71

PDH Platform Difﬁe-Hellman Key. 18

PEK Platform Endorsement Key. 18

PF Page Fault. 10, 25, 71, 93, 94

PoC Proof-of-Concept. 2, 3, 51, 52, 54–56, 58, 60, 72, 98

PSP Platform Security Processor. 12–14, 17–20, 22, 67, 76

PTE Page Table Entry. 90, 91, 106

Ryzen CPU line by AMD. 92, 95

SE-Vault A software and hardware defense against memory disclosure attacks. This solution is developed in this thesis. iii, 2–4, 33, 35–40, 42–64, 66, 67, 69–91, 95–101, 104–108

Second Level Address Translation A mechanism for managing the physical address space of a VM by means of performing an additional translation from a guest physical address to host physical address. Synonym for nested paging. 89, 90

SEE Secure Execution Environment. 21

SEV Secure Encrypted Virtualization. iii, 2–4, 11–14, 16–20, 22, 33, 35–39, 41, 42, 47, 49, 52, 53, 60, 62–64, 66–68, 75, 77, 78, 80–91, 93–101, 103, 105–107, 111

SEV-ES SEV Encrypted State. 14–17, 19, 20, 22, 36, 38, 39, 42, 49, 75, 88, 89, 91, 94–99, 103, 107

SEV-SNP SEV Secure Nested Paging. 22, 36, 49, 88–90, 95–98, 107

Shim Thin software abstraction layer.. 23, 29

SMAP Supervisor Mode Access Prevention. 1

SME Secure Memory Encryption. 11

SMEP Supervisor Mode Execution Prevention. 1

SVM Secure Virtual Machine. 7, 9, 11, 67

111 Glossary

TEE Trusted Execution Environment. iii, 1, 2, 4, 11, 17, 20–22, 33, 35, 38, 39, 49, 97, 100

TEK Transport Encryption Key. 19

TIK Transport Integrity Key. 19, 20

TLB Translation Lookaside Buffer. 7, 10, 27, 69

TSC Time Stamp Counter. 6, 10

VC VMM Communication Exception. 15, 16, 94, 98, 99 vCPU Virtual CPU. 6, 7, 45 vhost An interface to communicate the data plane conﬁguration from a user space virtual device to a kernel module. The kernel module can handle communication more efﬁciently. 40, 41, 50, 52, 104

VirtIO Interface for data communication between a Guest and a virtual device on the Host. 2–4, 23–28, 35, 38–42, 44, 47, 49–52, 57, 60, 62–64, 66, 69–72, 98, 101, 103–105, 112

Virtqueue A structure for data communication in VirtIO.. 24, 26, 27, 29, 42–45, 50, 52, 53, 55, 57, 69–73, 75, 83, 103, 105

VM Virtual Machine. iii, 1, 3, 5–8, 10–22, 24–26, 28, 33, 35–42, 44, 45, 47, 49, 50, 52, 60, 62–68, 74–77, 80, 88–101, 103–106

VM-exit The event of handling execution to the HV. 8, 9, 14, 15, 23, 25, 89, 90

VMCB Virtual Machine Control Block. 7–10, 14, 89

VMM Virtual Machine Manager. 38

VMSA VM Save Area. 14, 15, 20, 90

XE Xor-Encrypt. 92

112 Bibliography

[AD10] D. Aumaitre and C. Devine. “Subverting windows 7 x64 kernel with dma attacks.” In: HITBSecConf Amsterdam (2010). [Adv20a] Advanced Micro Devices. Secure Encrypted Virtualization API Version 0.24. https://www.amd.com/system/files/TechDocs/55766_SEV- KM_API_ Specification.pdf. 2020. [Adv20b] Advanced Micro Devices. SEV-ES Guest-Hypervisor Communication Block Standardization. https://developer.amd.com/wp-content/resources/ 56421.pdf. 2020. [AGM16] M. Almorsy, J. Grundy, and I. Müller. “An analysis of the cloud computing security problem.” In: arXiv preprint arXiv:1609.01107 (2016). [Ams+06] Z. Amsden, D. Arai, D. Hecht, A. Holler, P. Subrahmanyam, et al. “VMI: An interface for paravirtualization.” In: Proc. of the Linux Symposium. 2006, pp. 363–378. [Bel05] F. Bellard. “QEMU, a fast and portable dynamic translator.” In: USENIX Annual Technical Conference, FREENIX Track. Vol. 41. 2005, p. 46. [BR12] E.-O. Blass and W. Robertson. “TRESOR-HUNT: attacking CPU-bound encryption.” In: Proceedings of the 28th Annual Computer Security Applications Conference. 2012, pp. 71–78. [CD16] V. Costan and S. Devadas. “Intel SGX Explained.” In: IACR Cryptol. ePrint Arch. 2016.86 (2016), pp. 1–118. [Chi08] D. Chisnall. The deﬁnitive guide to the xen hypervisor. Pearson Education, 2008. [Chu+19] D. Chu, K. Zhu, Q. Cai, J. Lin, F. Li, L. Guan, and L. Zhang. “Secure Cryptography Infrastructures in the Cloud.” In: 2019 IEEE Global Commu- nications Conference (GLOBECOM). IEEE. 2019, pp. 1–7. [Cor20a] I. Corporation. “Intel Trusted Domain Extensions.” In: White paper (2020). [Cor20b] I. Corporation. Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes. 2020.

113 Bibliography

[Det] C. Details. Linux Kernel : CVE Security Vulnerabilities. url: https://www. cvedetails.com/product/47/Linux-Linux-Kernel.html?vendor_id=33 (visited on 10/23/2020). [Dev] A. M. Devices. AMD Key Distribution Server. url: https://kdsintf.amd. com/cek/ (visited on 07/07/2020). [Du+17] Z.-H. Du, Z. Ying, Z. Ma, Y. Mai, P. Wang, J. Liu, and J. Fang. “Secure encrypted virtualization is unsecure.” In: arXiv preprint arXiv:1712.05090 (2017). [Ebc+01] K. Ebcioglu, E. Altman, M. Gschwind, and S. Sathaye. “Dynamic binary translation and optimization.” In: IEEE Transactions on Computers 50.6 (2001), pp. 529–548. [Enb+] P. Enberg, C. Gorcunov, A. He, S. Levin, and P. Joshi. Native Linux KVM Tool. url: https://github.com/lkvm/lkvm (visited on 07/01/2020). [Fru11] C. Fruhwirth. LUKS On-Disk Format Speciﬁcation Version 1.2. 2011. [GNU] GNU. gcov—a Test Coverage Program. url: https://gcc.gnu.org/onlinedocs/ gcc/Gcov.html (visited on 10/23/2020). [Got11] Y. Goto. “Kernel-based virtual machine technology.” In: Fujitsu Scientiﬁc and Technical Journal 47.3 (2011), pp. 362–368. [Gua+14a] L. Guan, F. Li, J. Jing, J. Wang, and Z. Ma. “virtio-ct: A secure cryptographic token service in hypervisors.” In: International Conference on Security and Privacy in Communication Networks. Springer. 2014, pp. 285–300. [Gua+14b] L. Guan, J. Lin, B. Luo, and J. Jing. “Copker: Computing with Private Keys without RAM.” In: NDSS. 2014, pp. 23–26. [Gua+15] L. Guan, J. Lin, B. Luo, J. Jing, and J. Wang. “Protecting private keys against memory disclosure attacks using hardware transactional memory.” In: 2015 IEEE Symposium on Security and Privacy. IEEE. 2015, pp. 3–19. [Gup] P. Gupta. x86/msr: Add the IA32_TSX_CTRL MSR. url: https : / / git . kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ ?id=c2955f270a84762343000f103e0640d29c7a96f3 (visited on 08/24/2020). [Hal+09] J. A. Halderman, S. D. Schoen, N. Heninger, W. Clarkson, W. Paul, J. A. Calandrino, A. J. Feldman, J. Appelbaum, and E. W. Felten. “Lest we remember: cold-boot attacks on encryption keys.” In: Communications of the ACM 52.5 (2009), pp. 91–98. [HB17] F. Hetzelt and R. Buhren. “Security analysis of encrypted virtual machines.” In: ACM SIGPLAN Notices 52.7 (2017), pp. 129–142.

114 Bibliography

[Inc] G. Inc. Google Test Github. url: https://github.com/google/googletest (visited on 07/14/2020). [Inc20a] A. M. D. Inc. “AMD SEV-SNP: Strenghtening VM Isolation with Integrity Protection and More.” In: White paper (2020). [Inc20b] A. M. D. Inc. AMD64 Architecture Programmer’s Manual. 2020. [Kap] D. Kaplan. AMD SEV Update - Linux Security Summit 2018. url: https: //events19.linuxfoundation.org/wp-content/uploads/2017/12/AMD- Encrypted-Virtualization-Update-David-Kaplan-AMD.pdf (visited on 10/16/2020). [Kap16] D. Kaplan. System and method for virtualized process isolation including pre- venting a kernel from accessing user address space. US Patent 15/270,231. 2016. [Kap17] D. Kaplan. “Protecting vm register state with sev-es.” In: White paper, Feb (2017). [Kera] L. Kernel. IBM Protected Execution Facility. url: https://www.kernel.org/ doc/html/latest/powerpc/ultravisor.html (visited on 08/25/2020). [Kerb] L. Kernel. The Deﬁnitive KVM (Kernel-based Virtual Machine) API Documen- tation. url: https://www.kernel.org/doc/Documentation/virtual/kvm/ api.txt (visited on 06/23/2020). [Kle+09] G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, R. Kolanski, M. Norrish, et al. “seL4: Formal veriﬁcation of an OS kernel.” In: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 2009, pp. 207–220. [kok] kokke. Tiny AES in C. url: https://github.com/kokke/tiny- AES- c (visited on 08/24/2020). [KPW16] D. Kaplan, J. Powell, and T. Woller. “AMD memory encryption.” In: White paper (2016). [KZ13] T. Kim and N. Zeldovich. “Practical and effective sandboxing for non-root users.” In: 2013 {USENIX} Annual Technical Conference ({USENIX}{ATC} 13). 2013, pp. 139–144. [LF ] LF Projects, LLC. The seL4 Microkernel. url: https : / / sel4 . systems/ (visited on 07/14/2020). [MFD11] T. Müller, F. C. Freiling, and A. Dewald. “TRESOR Runs Encryption Securely Outside RAM.” In: USENIX Security Symposium. Vol. 17. 2011.

115 Bibliography

[MG+11] P. Mell, T. Grance, et al. “The NIST definition of cloud computing.” In: (2011). [Mit] Mitre. CVE-2020-8835. url: https://cve.mitre.org/cgi-bin/cvename. cgi?name=CVE-2020-8835 (visited on 10/19/2020). [Mor+18] M. Morbitzer, M. Huber, J. Horsch, and S. Wessel. “Severed: Subverting amd’s virtual machine encryption.” In: Proceedings of the 11th European Workshop on Systems Security. 2018, pp. 1–6. [Mue] Mueller, Stephan. Linux Kernel Crypto API User Space Interface Library. url: http://www.chronox.de/libkcapi.html (visited on 10/02/2020). [MV] S. Mueller and M. Vasut. Linux Kernel Cryto API. url: https://www.kernel. org/doc/html/v5.6/crypto/index.html (visited on 06/23/2020). [Niu+14] S. Niu, J. Mo, Z. Zhang, and Z. Lv. “Overview of linux vulnerabilities.” In: 2nd International Conference on Soft Computing in Information Communication Technology. Atlantis Press. 2014. [Oku+19] Y. Okuji, B. Ford, E. Boleyn, K. Ishiguro, V. Serbinenko, and D. Kiper. The Multiboot2 Specification version 2.0. https://www.gnu.org/software/grub/ manual/multiboot2/multiboot.pdf. 2019. [PNG19] R. Palutke, A. Neubaum, and J. Götzfried. “SEVGuard: Protecting User Mode Applications Using Secure Encrypted Virtualization.” In: Interna- tional Conference on Security and Privacy in Communication Systems. Springer. 2019, pp. 224–242. [RGM16] L. Richter, J. Götzfried, and T. Müller. “Isolating operating system components with intel sgx.” In: Proceedings of the 1st Workshop on System Software for Trusted Execution. 2016, pp. 1–6. [Roe20] Roedel, Joerg. [PATCH v5 00/75] x86: SEV-ES Guest Support. https://lkml. org/lkml/2020/7/24/807. 2020. [Rog04] P. Rogaway. “Efficient instantiations of tweakable blockciphers and re- finements to modes OCB and PMAC.” In: International Conference on the Theory and Application of Cryptology and Information Security. Springer. 2004, pp. 16–31. [Rus08] R. Russell. “virtio: towards a de-facto standard for virtual I/O devices.” In: ACM SIGOPS Operating Systems Review 42.5 (2008), pp. 95–103. [Rus81] J. M. Rushby. “Design and verification of secure systems.” In: ACM SIGOPS Operating Systems Review 15.5 (1981), pp. 12–21.

116 Bibliography

[SAB15a] M. Sabt, M. Achemlal, and A. Bouabdallah. “The dual-execution-environment approach: Analysis and comparative evaluation.” In: IFIP International In- formation Security and Privacy Conference. Springer. 2015, pp. 557–570. [SAB15b] M. Sabt, M. Achemlal, and A. Bouabdallah. “Trusted execution environment: what it is, and what it is not.” In: 2015 IEEE Trustcom/BigDataSE/ISPA. Vol. 1. IEEE. 2015, pp. 57–64. [Sao14] C. Saout. “dm-crypt: a device-mapper crypto target, 2007.” In: URL http://www. saout. de/misc/dm-crypt (2014). [Sca11] K. Scarfone. Guide to security for full virtualization technologies. Vol. 800. 125. DIANE Publishing, 2011. [Sch+19] M. Schwarz, M. Lipp, D. Moghimi, J. Van Bulck, J. Stecklina, T. Prescher, and D. Gruss. “ZombieLoad: Cross-privilege-boundary data sampling.” In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Com- munications Security. 2019, pp. 753–768. [Sim11] P. Simmons. “Security through amnesia: a software-based solution to the cold boot attack on disk encryption.” In: Proceedings of the 27th Annual Computer Security Applications Conference. 2011, pp. 73–82. [TH] M. Tsirkin and C. Huck. Virtual I/O Device (VIRTIO) Version 1.1. url: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio- v1.1-csprd01.html (visited on 06/24/2020). [Thea] The Linux Foundation. The EFI Boot Stub. url: https://www.kernel.org/ doc/Documentation/efi-stub.txt (visited on 08/10/2020). [Theb] The Linux Foundation. THE LINUX/x86 BOOT PROTOCOL. url: https: / / www . kernel . org / doc / Documentation / x86 / boot . txt (visited on 08/10/2020). [Wer+19] J. Werner, J. Mason, M. Antonakakis, M. Polychronakis, and F. Monrose. “The SEVerESt Of Them All: Inference Attacks Against Secure Virtual Enclaves.” In: Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security. 2019, pp. 73–85. [Wil+20] L. Wilke, J. Wichelmann, M. Morbitzer, and T. Eisenbarth. “SEVurity: No Security Without Integrity–Breaking Integrity-Free Memory Encryption with Minimal Assumptions.” In: arXiv preprint arXiv:2004.11071 (2020). [Win08] J. Winter. “Trusted computing building blocks for embedded linux-based ARM trustzone platforms.” In: Proceedings of the 3rd ACM workshop on Scalable trusted computing. 2008, pp. 21–30.

117 Bibliography

[Wu+18] Y. Wu, Y. Liu, R. Liu, H. Chen, B. Zang, and H. Guan. “Comprehensive VM protection against untrusted hypervisor through retroﬁtted AMD memory encryption.” In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE. 2018, pp. 441–453. [Yam+00] K. Yamada, G. N. Hammond, J. Hays, J. K. Ross, S. Burger, and W. R. Bryg. Page table walker that uses at least one of a default page size and a page size selected for a virtual address space to position a sliding ﬁeld in a virtual address. US Patent 6,088,780. 2000. [Yit+17] S. F. Yitbarek, M. T. Aga, R. Das, and T. Austin. “Cold boot attacks are still hot: Security analysis of memory scramblers in modern processors.” In: 2017 IEEE International Symposium on High Performance Computer Architec- ture (HPCA). IEEE. 2017, pp. 313–324.

118