<<

SGX Explained

Victor Costan and Srinivas Devadas [email protected], [email protected] Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

ABSTRACT Owner’s Remote Computer Computer Intel’s Guard Extensions (SGX) is a set of Untrusted Software extensions to the Intel architecture that aims to pro- vide integrity and confidentiality guarantees to security- Computation Container Dispatcher sensitive computation performed on a computer where Setup Computation Setup all the privileged software (kernel, , etc) is Private Code Receive potentially malicious. Verification Encrypted This paper analyzes Intel SGX, based on the 3 pa- Results Private Data pers [14, 79, 139] that introduced it, on the Intel Software

Developer’s Manual [101] (which supersedes the SGX Owns Manages manuals [95, 99]), on an ISCA 2015 tutorial [103], and Trusts Authors

on two patents [110, 138]. We use the papers, reference Trusts manuals, and tutorial as primary data sources, and only

draw on the patents to fill in missing information. Data Owner Software Infrastructure This paper does not reflect the information available Provider Owner

in two papers [74, 109] that were published after the first Figure 1: Secure remote computation. A user relies on a remote version of this paper. computer, owned by an untrusted party, to perform some computation This paper’s contributions are a summary of the on her data. The user has some assurance of the computation’s Intel-specific architectural and micro-architectural details integrity and confidentiality. needed to understand SGX, a detailed and structured pre- sentation of the publicly available information on SGX, uploads the desired computation and data into the secure a series of intelligent guesses about some important but container. The trusted hardware protects the data’s con- undocumented aspects of SGX, and an analysis of SGX’s fidentiality and integrity while the computation is being security properties. performed on it. SGX relies on software attestation, like its predeces- 1 OVERVIEW sors, the TPM [71] and TXT [70]. Attestation (Figure 3) Secure remote computation (Figure 1) is the problem proves to a user that she is communicating with a specific of executing software on a remote computer owned and piece of software running in a secure container hosted maintained by an untrusted party, with some integrity by the trusted hardware. The proof is a cryptographic and confidentiality guarantees. In the general setting, signature that certifies the hash of the secure container’s secure remote computation is an unsolved problem. Fully contents. It follows that the remote computer’s owner can Homomorphic Encryption [61] solves the problem for a load any software in a secure container, but the remote limited family of computations, but has an impractical computation service user will refuse to load her data into performance overhead [140]. a secure container whose contents’ hash does not match Intel’s Software Guard Extensions (SGX) is the latest the expected value. iteration in a long line of trusted (Figure 2) The remote computation service user verifies the at- designs, which aim to solve the secure remote compu- testation key used to produce the signature against an tation problem by leveraging trusted hardware in the endorsement certificate created by the trusted hardware’s remote computer. The trusted hardware establishes a se- manufacturer. The certificate states that the attestation cure container, and the remote computation service user key is only known to the trusted hardware, and only used

1 Data Owner’s Remote Computer Trusted Platform Computer Trusted Hardware AK: Attestation Key Data Owner’s Computer Endorsement Certificate Untrusted Software Secure Container

Computation Secure Container Initial A Dispatcher Key exchange: A, g Public Code + Data Setup Public gA Computation A Setup Key exchange: B, g Private Code AB Receive B A B Shared key: K = g Verification Encrypted g , SignAK(g , g , M) Results Private Data M = Hash(Initial State) AB Shared key: K = g Builds EncK(secret code/data) Secret Code + Data Owns Authors Manages Trusts Computation Results EncK(results) Trusts Computation Results

Data Owner Software Infrastructure Manufacturer Figure 3: Software attestation proves to a remote computer that Provider Owner it is communicating with a specific secure container hosted by a Trusts trusted platform. The proof is an attestation signature produced Figure 2: . The user trusts the manufacturer of a by the platform’s secret attestation key. The signature covers the piece of hardware in the remote computer, and entrusts her data to a container’s initial state, a challenge nonce produced by the remote secure container hosted by the secure hardware. computer, and a message produced by the container.

for the purpose of attestation. improvements for enclave authors, it is a small incre- mental improvement, from a design and implementation SGX stands out from its predecessors by the amount standpoint. After understanding the principles behind of code covered by the attestation, which is in the Trusted SGX 1 and its security properties, the reader should be Computing Base (TCB) for the system using hardware well equipped to face Intel’s reference documentation protection. The attestations produced by the original and learn about the changes brought by SGX 2. TPM design covered all the software running on a com- puter, and TXT attestations covered the code inside a 1.1 SGX Tour VMX [181] virtual machine. In SGX, an enclave (secure SGX sets aside a memory region, called the container) only contains the private data in a computation, Reserved Memory (PRM, § 5.1). The CPU protects the and the code that operates on it. PRM from all non-enclave memory accesses, including For example, a cloud service that performs image pro- kernel, hypervisor and SMM (§ 2.3) accesses, and DMA cessing on confidential medical images could be imple- accesses (§ 2.9.1) from . mented by having users upload encrypted images. The The PRM holds the Enclave (EPC, users would send the encryption keys to software running § 5.1.1), which consists of 4 KB pages that store enclave inside an enclave. The enclave would contain the code code and data. The system software, which is untrusted, for decrypting images, the image processing algorithm, is in charge of assigning EPC pages to enclaves. The and the code for encrypting the results. The code that CPU tracks each EPC page’s state in the Enclave Page receives the uploaded encrypted images and stores them Cache Metadata (EPCM, § 5.1.2), to ensure that each would be left outside the enclave. EPC page belongs to exactly one enclave. An SGX-enabled processor protects the integrity and The initial code and data in an enclave is loaded by un- confidentiality of the computation inside an enclave by trusted system software. During the loading stage (§ 5.3), isolating the enclave’s code and data from the outside the system software asks the CPU to copy data from un- environment, including the and hyper- protected memory (outside PRM) into EPC pages, and visor, and hardware devices attached to the system . assigns the pages to the enclave being setup (§ 5.1.2). At the same time, the SGX model remains compatible It follows that the initial enclave state is known to the with the traditional software layering in the Intel archi- system software. tecture, where the OS kernel and hypervisor manage the After all the enclave’s pages are loaded into EPC, the computer’s resources. system software asks the CPU to mark the enclave as This work discusses the original version of SGX, also initialized (§ 5.3), at which point application software referred to as SGX 1. While SGX 2 brings very useful can run the code inside the enclave. After an enclave is

2 initialized, the loading method described above is dis- After having reviewed the background information, abled. section 5 provides a (sometimes painstakingly) detailed While an enclave is loaded, its contents is cryptograph- description of SGX’s programming model, mostly based ically hashed by the CPU. When the enclave is initialized, on Intel’s Software Development Manual. the hash is finalized, and becomes the enclave’s measure- Section 6 analyzes other public sources of informa- ment hash (§ 5.6). tion, such as Intel’s SGX-related patents, to fill in some A remote party can undergo a software attestation of the missing details in the SGX description. The sec- (§ 5.8) to convince itself that it is communicating tion culminates in a detailed review of SGX’s security with an enclave that has a specific measurement hash, properties that draws on information presented in the and is running in a secure environment. rest of the paper. This review outlines some troubling flow can only enter an enclave via special gaps in SGX’s security guarantees, as well as some areas CPU instructions (§ 5.4), which are similar to the mech- where no conclusions can be drawn without additional anism for switching from user mode to kernel mode. information from Intel. Enclave execution always happens in , at That being said, perhaps the most troubling finding in ring 3, and uses the address translation set up by the OS our security analysis is that Intel added a launch control kernel and hypervisor. feature to SGX that forces each computer’s owner to gain To avoid leaking private data, a CPU that is executing approval from a third party (which is currently Intel) for enclave code does not directly service an , fault any enclave that the owner wishes to use on the com- (e.g., a page fault) or VM . Instead, the CPU first per- puter. § 5.9 explains that the only publicly documented forms an Asynchronous Enclave Exit (§ 5.4.3) to intended use for this launch control feature is a licensing from enclave code to ring 3 code, and then services the mechanism that requires software developers to enter a interrupt, fault, or VM exit. The CPU performs an AEX (yet unspecified) business agreement with Intel to be able by saving the CPU state into a predefined area inside the to author software that takes advantage of SGX’s protec- enclave and transfers control to a pre-specified instruc- tions. All the official documentation carefully sidesteps tion outside the enclave, replacing CPU registers with this issue, and has a minimal amount of hints that lead to synthetic values. the Intel’s patents on SGX. Only these patents disclose The allocation of EPC pages to enclaves is delegated the existence of licensing plans. to the OS kernel (or hypervisor). The OS communicates The licensing issue might not bear much relevance its allocation decisions to the SGX implementation via right now, because our security analysis reveals that the special ring 0 CPU instructions (§ 5.3). The OS can also limitations in SGX’s guarantees mean that a security- evict EPC pages into untrusted DRAM and later load conscious software developer cannot in good conscience them back, using dedicated CPU instructions. SGX uses rely on SGX for secure remote computation. At the same cryptographic protections to assure the confidentiality, time, should SGX ever develop better security properties, integrity and freshness of the evicted EPC pages while the licensing scheme described above becomes a major they are stored in untrusted memory. problem, given Intel’s near-monopoly market share of desktop and CPUs. Specifically, the licensing limi- 1.2 Outline and Troubling Findings tations effectively give Intel the power to choose winners Reasoning about the security properties of Intel’s SGX and losers in industries that rely on . requires a significant amount of background information 2 BACK- that is currently scattered across many sources. For this reason, a significant portion of this work is dedicated to GROUND summarizing this prerequisite knowledge. This section attempts to summarize the general archi- Section 2 summarizes the relevant subset of the Intel tectural principles behind Intel’s most popular computer architecture and the micro-architectural properties of processors, as well as the peculiarities needed to reason recent Intel processors. Section 3 outlines the security about the security properties of a system running on these landscape around trusted hardware system, including processors. Unless specified otherwise, the information cryptographic tools and relevant attack classes. Last, here is summarized from Intel’s Software Development section 4 briefly describes the trusted hardware systems Manual (SDM) [101]. that make up the context in which SGX was created. Analyzing the security of a software system requires

3 understanding the interactions between all the parts of operating system) from the rest of the software running the software’s execution environment, so this section is on the computer. This isolation is a key tool for keeping quite long. We do refrain from introducing any security software complexity at manageable levels, as it allows concepts here, so readers familiar with ’s intricacies application and OS developers to focus on their software, can safely skip this section and refer back to it when and ignore the interactions with other software that may necessary. run on the computer. We use the terms Intel processor or Intel CPU to refer A key component of virtualization is address transla- to the server and desktop versions of Intel’s Core line- tion (§ 2.5), which is used to give software the impression up. In the interest of space and mental sanity, we ignore that it owns all the memory on the computer. Address Intel’s other processors, such as the embedded line of translation provides isolation that prevents a piece of CPUs, or the failed line. Consequently, buggy or malicious software from directly damaging the terms Intel and Intel systems refers to other software, by modifying its memory contents. computer systems built around Intel’s Core processors. The other key component of virtualization is the soft- In this paper, the term Intel architecture refers to the ware privilege levels (§ 2.3) enforced by the CPU. Hard- x86 architecture described in Intel’s SDM. The x86 ar- ware privilege separation ensures that a piece of buggy chitecture is overly complex, mostly due to the need to or malicious software cannot damage other software indi- support executing legacy software dating back to 1990 rectly, by interfering with the system software managing directly on the CPU, without the overhead of software it. interpretation. We only cover the parts of the architecture Processes express their computing power requirements visible to modern 64- software, also in the interest of by creating execution threads, which are assigned by the space and mental sanity. operating system to the computer’s logical processors. The 64-bit version of the x86 architecture, covered in A contains an execution context (§ 2.6), which is this section, was actually invented by Advanced Micro the information necessary to perform a computation. For Devices (AMD), and is also known as AMD64, x86 64, example, an execution context stores the address of the and x64. The term “Intel architecture” highlights our next instruction that will be executed by the processor. interest in the architecture’s implementation in Intel’s Operating systems give each process the illusion that it chips, and our desire to understand the mindsets of Intel has an infinite amount of logical processors at its disposal, SGX’s designers. and multiplex the available logical processors between the threads created by each process. Modern operating 2.1 Overview systems implement preemptive multithreading, where A computer’s main resources (§ 2.2) are memory and the logical processors are rotated between all the threads processors. On Intel computers, Dynamic Random- on a system every few milliseconds. Changing the thread Access Memory (DRAM) chips (§ 2.9.1) provide the assigned to a logical processor is accomplished by an memory, and one or more CPU chips expose logical execution (§ 2.6). processors (§ 2.9.4). These resources are managed by expose a fixed number of virtual proces- system software. An Intel computer typically runs two sors (vCPUs) to each operating system, and also use kinds of system software, namely operating systems and context switching to multiplex the logical CPUs on a hypervisors. computer between the vCPUs presented to the guest op- The Intel architecture was designed to support running erating systems. multiple application software instances, called processes. The execution core in a logical processor can execute An operating system (§ 2.3), allocates the computer’s re- instructions and consume data at a much faster rate than sources to the running processes. Server computers, espe- DRAM can supply them. Many of the complexities in cially in cloud environments, may run multiple operating modern computer architectures stem from the need to system instances at the same time. This is accomplished cover this speed gap. Recent Intel CPUs rely on hyper- by having a hypervisor (§ 2.3) partition the computer’s re- threading (§ 2.9.4), out-of-order execution (§ 2.10), and sources between the operating system instances running caching (§ 2.11), all of which have security implications. on the computer. An Intel processor contains many levels of interme- System software uses virtualization techniques to iso- diate memories that are much faster than DRAM, but late each piece of software that it manages (process or also orders of magnitude smaller. The fastest intermedi-

4 ate memory is the logical processor’s register file (§ 2.2, bility of any architectural change proposals, one must be § 2.4, § 2.6). The other intermediate memories are called able to distinguish changes that can be implemented in caches (§ 2.11). The Intel architecture requires applica- from changes that can only be accomplished tion software to explicitly manage file, which by modifying the hardware. serves as a high-speed scratch space. At the same time, 2.2 Computational Model caches transparently accelerate DRAM requests, and are This section pieces together a highly simplified model mostly invisible to software. for a computer that implements the Intel architecture, Intel computers have multiple logical processors. As illustrated in Figure 4. This simplified model is intended a consequence, they also have multiple caches dis- to help the reader’s intuition process the fundamental tributed across the CPU . On multi-socket systems, concepts used by the rest of the paper. The following sec- the caches are distributed across multiple CPU chips. tions gradually refine the simplified model into a detailed Therefore, Intel systems use a mech- description of the Intel architecture. anism (§ 2.11.3), ensuring that all the caches have the same view of DRAM. Thanks to cache coherence, pro- … 0 grammers can build software that is unaware of caching, Memory (DRAM) and still runs correctly in the presence of distributed caches. However, cache coherence does not cover the dedicated caches used by address translation (§ 2.11.5), Processor Processor I/O device and system software must take special measures to keep Execution Execution these caches consistent. logic logic interface to outside CPUs communicate with the outside world via I/O Register file Register file world devices (also known as peripherals), such as network interface cards and display adapters (§ 2.9). Conceptu- Figure 4: A computer’s core is its processors and memory, which ally, the CPU communicates with the DRAM chips and are connected by a system bus. Computers also have I/O devices, the I/O devices via a system bus that connects all these such as keyboards, which are also connected to the processor via the components. system bus. Software written for the Intel architecture communi- The building blocks for the model presented here come cates with I/O devices via the I/O (§ 2.4) from [165], which introduces the key abstractions in a and via the space, which is primarily computer system, and then focuses on the techniques used to access DRAM. System software must configure used to build software systems on top of these abstrac- the CPU’s caches (§ 2.11.4) to recognize the memory tions. address ranges used by I/O devices. Devices can notify The memory is an array of storage cells, addressed the CPU of the occurrence of events by dispatching in- using natural numbers starting from 0, and implements terrupts (§ 2.12), which cause a logical processor to stop the abstraction depicted in Figure 5. Its salient feature executing its current thread, and invoke a special handler is that the result of reading a memory at an address in the system software (§ 2.8.2). must equal the most recent value written to that memory Intel systems have a highly complex computer initial- cell. ization sequence (§ 2.13), due to the need to support a WRITE(addr, value) → ∅ large variety of peripherals, as well as a multitude of Store value in the storage cell identified by addr. operating systems targeting different versions of the ar- READ(addr) → value chitecture. The initialization sequence is a challenge to Return the value argument to the most recent WRITE any attempt to secure an Intel computer, and has facili- call referencing addr. tated many security compromises (§ 2.3). Intel’s engineers use the processor’s microcode facil- Figure 5: The memory abstraction ity (§ 2.14) to implement the more complicated aspects A logical processor repeatedly reads instructions from of the Intel architecture, which greatly helps manage the the computer’s memory and executes them, according to hardware’s complexity. The microcode is completely the flowchart in Figure 6. invisible to software developers, and its design is mostly The processor has an internal memory, referred to undocumented. However, in order to evaluate the feasi- as the register file. The register file consists of Static

5 IP Generation Exception Handling of the topmost element in the used by the Write interrupt processor’s support. The other Interrupted? YES data to exception execution context registers are described in § 2.4 and registers NO § 2.6. Under normal circumstances, the processor repeatedly Fetch Read the current instruction reads an instruction from the memory address stored in from the memory at RIP RIP, executes the instruction, and updates RIP to point Decode to the following instruction. Unlike many RISC architec- Identify the desired operation, tures, the Intel architecture uses a variable-size instruc- inputs, and outputs tion encoding, so the size of an instruction is not known Register Read until the instruction has been read from memory. Read the current instruction’s input registers While executing an instruction, the processor may encounter a fault, which is a situation where the instruc- Execute Execute the current instruction tion’s preconditions are not met. When a fault occurs, the instruction does not store a result in the output loca- Exception Handling tion. Instead, the instruction’s result is considered to be Write fault data to the Did a fault occur? YES the fault that occurred. For example, an integer division exception registers instruction DIV where the divisor is zero results in a NO Division Fault (#DIV). Locate the current Commit exception’s handler When an instruction results in a fault, the processor Write the execution results to stops its normal execution flow, and performs the fault the current instruction’s output Locate the handler’s registers handler process documented in § 2.8.2. In a nutshell, the exception stack top processor first looks up the address of the code that will IP Generation Push RSP and RIP to handle the fault, based on the fault’s nature, and sets up Output registers the exception stack YES the execution environment in preparation to execute the include RIP? fault handler. Write the exception NO stack top to RSP and The processors are connected to each other and to the memory via a system bus, which is a broadcast network Increment RIP by the size of Write the exception that implements the abstraction in Figure 7. the current instruction handler address to RIP SEND(op, addr, data) → ∅ Figure 6: A processor fetches instructions from the memory and Place a message containing the operation code op, the executes them. The RIP register holds the address of the instruction bus address addr, and the value data on the bus. to be executed. READ() → (op, addr, value) Return the message that was written on the bus at the Random Access Memory (SRAM) cells, generally known beginning of this clock cycle. as registers, which are significantly faster than DRAM cells, but also a lot more expensive. Figure 7: The system bus abstraction An instruction performs a simple computation on its During each clock cycle, at most one of the devices inputs and stores the result in an output location. The connected to the system bus can send a message, which processor’s registers make up an execution context that is received by all the other devices connected to the bus. provides the inputs and stores the outputs for most in- Each device attached to the bus decodes the operation structions. For example, ADD RDX, RAX, RBX per- codes and addresses of all the messages sent on the bus forms an integer , where the inputs are the regis- and ignores the messages that do not require its involve- ters RAX and RBX, and the result is stored in the output ment. register RDX. For example, when the processor wishes to read a The registers mentioned in Figure 6 are the instruction memory location, it sends a message with the operation pointer (RIP), which stores the memory address of the code READ-REQUEST and the bus address corresponding next instruction to be executed by the processor, and the to the desired memory location. The memory sees the stack pointer (RSP), which stores the memory address message on the bus and performs the READ operation.

6 At a later time, the memory responds by sending a mes- missing hardware. Therefore, the bootstrapping software sage with the operation code READ-RESPONSE, the same (§ 2.13) in the computer’s firmware is responsible for address as the request, and the data value set to the result setting up a continuous subset of DRAM as System Man- of the READ operation. agement RAM (SMRAM), and for loading all the code The computer communicates with the outside world that needs to run in SMM mode into SMRAM. The SM- via I/O devices, such as keyboards, displays, and net- RAM enjoys special hardware protections that prevent work cards, which are connected to the system bus. De- less privileged software from accessing the SMM code. vices mostly respond to requests issued by the processor. IaaS cloud providers allow their customers to run their However, devices also have the ability to issue interrupt operating system of choice in a virtualized environment. requests that notify the processor of outside events, such Hardware virtualization [181], called Virtual Machine as the user pressing a key on a keyboard. Extensions (VMX) by Intel, adds support for a hypervi- Interrupt triggering is discussed in § 2.12. On modern sor, also called a Virtual Machine Monitor (VMM) in systems, devices send interrupt requests by issuing writes the Intel documentation. The hypervisor runs at a higher to special bus addresses. are considered to be privilege level (VMX root mode) than the operating sys- hardware exceptions, just like faults, and are handled in tem, and is responsible for allocating hardware resources a similar manner. across multiple operating systems that share the same physical machine. The hypervisor uses the CPU’s hard- 2.3 Software Privilege Levels ware virtualization features to make each operating sys- In an Infrastructure-as-a-Service (IaaS) cloud environ- tem believe it is running in its own computer, called a ment, such as Amazon EC2, commodity CPUs run soft- virtual machine (VM). Hypervisor code generally runs ware at four different privilege levels, shown in Figure 8. at ring 0 in VMX root mode.

More Privileged Hypervisors that run in VMX root mode and take ad- SMM BIOS vantage of hardware virtualization generally have better VMX performance and a smaller codebase than hypervisors Root Ring 0 Hypervisor based on [161]. Ring 1 System Software The systems research literature recommends breaking Ring 2 up an operating system into a small kernel, which runs Ring 3 at a high privilege level, known as the kernel mode or VMX supervisor mode and, in the Intel architecture, as ring 0. Ring 0 OS Kernel Non-Root The kernel allocates the computer’s resources to the other Ring 1 system components, such as device drivers and services, Ring 2 which run at lower privilege levels. However, for per- Application formance reasons1, mainstream operating systems have Ring 3 SGX Enclave large amounts of code running at ring 0. Their monolithic Less Privileged kernels include device drivers, filesystem code, network- Figure 8: The privilege levels in the x86 architecture, and the ing stacks, and video rendering functionality. software that typically runs at each security level. Application code, such as a Web server or a game Each privilege level is strictly more powerful than the client, runs at the lowest privilege level, referred to as ones below it, so a piece of software can freely read and user mode (ring 3 in the Intel architecture). In IaaS cloud modify the code and data running at less privileged levels. environments, the virtual machine images provided by Therefore, a software module can be compromised by customers run in VMX non-root mode, so the kernel runs any piece of software running at a higher privilege level. in VMX non-root ring 0, and the application code runs It follows that a software module implicitly trusts all in VMX non-root ring 3. the software running at more privileged levels, and a 2.4 Address Spaces system’s security analysis must take into account the software at all privilege levels. Software written for the Intel architecture accesses the (SMM) is intended for use computer’s resources using four distinct by the manufacturers to implement features 1Calling a procedure in a different ring is much slower than calling such as fan control and deep sleep, and/or to emulate code at the same privilege level.

7 spaces, shown in Figure 9. The address spaces overlap A better-known example of memory mapping is that partially, in both purpose and contents, which can lead to at computer startup, memory addresses 0xFFFF0000 - confusion. This section gives a high-level overview of the 0xFFFFFFFF (the 64 KB of memory right below the physical address spaces defined by the Intel architecture, 4 GB mark) are mapped to a flash memory device that with an emphasis on their purpose and the methods used holds the first stage of the code that bootstraps the com- to manage them. puter. The memory space is partitioned between devices and CPU DRAM by the computer’s firmware during the bootstrap- MSRs Registers ping process. Sometimes, system software includes (Model-Specific Registers) motherboard-specific code that modifies the memory space partitioning. The OS kernel relies on address trans- Software lation, described in § 2.5, to control the applications’ access to the memory space. The hypervisor relies on the same mechanism to control the guest OSs. System Buses The input/output (I/O) space consists of 216 I/O ad- Memory Addresses I/O Ports dresses, usually called ports. The I/O ports are used exclusively to communicate with devices. The CPU pro- vides specific instructions for reading from and writing DRAM Device Device to the I/O space. I/O ports are allocated to devices by formal or de-facto standards. For example, ports 0xCF8 Figure 9: The four physical address spaces used by an Intel CPU. The registers and MSRs are internal to the CPU, while the memory and 0xCFC are always used to access the PCI express and I/O address spaces are used to communicate with DRAM and (§ 2.9.1) configuration space. other devices via system buses. The CPU implements a mechanism for system soft- The register space consists of names that are used to ware to provide fine-grained I/O access to applications. access the CPU’s register file, which is the only memory However, all modern kernels restrict application software that operates at the CPU’s clock frequency and can be from accessing the I/O space directly, in order to limit used without any latency penalty. The register space is the damage potential of application bugs. defined by the CPU’s architecture, and documented in The Model-Specific Register (MSR) space consists of the SDM. 232 MSRs, which are used to configure the CPU’s op- Some registers, such as the Control Registers (CRs) eration. The MSR space was initially intended for the play specific roles in configuring the CPU’s operation. use of CPU model-specific firmware, but some MSRs For example, CR3 plays a central role in address trans- have been promoted to architectural MSR status, making lation (§ 2.5). These registers can only be accessed by their semantics a part of the Intel architecture. For ex- system software. The rest of the registers make up an ample, architectural MSR 0x10 holds a high-resolution application’s execution context (§ 2.6), which is essen- monotonically increasing time-stamp . tially a high-speed scratch space. These registers can The CPU provides instructions for reading from and be accessed at all privilege levels, and their allocation is writing to the MSR space. The instructions can only be managed by the software’s . Many CPU instruc- used by system software. Some MSRs are also exposed tions only operate on data in registers, and only place by instructions accessible to applications. For example, their results in registers. applications can read the time-stamp counter via the The memory space, generally referred to as the address RDTSC and RDTSCP instructions, which are very useful space, or the physical address space, consists of 236 for benchmarking and optimizing software. (64 GB) - 240 (1 TB) addresses. The memory space is primarily used to access DRAM, but it is also used to 2.5 Address Translation communicate with memory-mapped devices that read System software relies on the CPU’s address transla- memory requests off a system bus and write replies for tion mechanism for implementing isolation among less the CPU. Some CPU instructions can read their inputs privileged pieces of software (applications or operating from the memory space, or store the results using the systems). Virtually all secure architecture designs bring memory space. changes to address translation. We summarize the Intel

8 architecture’s address translation features that are most Address translation is used by the operating system to relevant when establishing a system’s security proper- multiplex DRAM among multiple application processes, ties, and refer the reader to [108] for a more general isolate the processes from each other, and prevent ap- presentation of address translation concepts and its other plication code from accessing memory-mapped devices uses. directly. The latter two protection measures prevent an application’s bugs from impacting other applications or 2.5.1 Address Translation Concepts the OS kernel itself. Hypervisors also use address trans- From a systems perspective, address translation is a layer lation, to divide the DRAM among operating systems of indirection (shown in Figure 10) between the virtual that run concurrently, and to virtualize memory-mapped addresses, which are used by a program’s memory load devices. and store instructions, and the physical addresses, which The address translation mode used by 64-bit operating reference the physical address space (§ 2.4). The - systems, called IA-32e by Intel’s documentation, maps 48-bit virtual addresses to physical addresses of at most ping between virtual and physical addresses is defined by 2 page tables, which are managed by the system software. 52 . The translation process, illustrated in Figure 12, is carried out by dedicated hardware in the CPU, which is referred to as the address translation unit or the memory Virtual Address Physical management unit (MMU). Address Space Translation Address Space The bottom 12 bits of a virtual address are not changed by the translation. The top 36 bits are grouped into four Virtual Physical Mapping 9-bit indexes, which are used to index into the page Address Address tables. Despite its name, the page tables data structure System bus closely resembles a full 512-ary search tree where nodes

Software Page have fixed keys. Each node is represented in DRAM as Tables DRAM an array of 512 8- entries that contain the physical addresses of the next-level children as well as some flags. Figure 10: Virtual addresses used by software are translated into The physical address of the root node is stored in the physical memory addresses using a mapping defined by the page CR3 register. The arrays in the last-level nodes contain tables. the physical addresses that are the result of the address Operating systems use address translation to imple- translation. ment the abstraction, illustrated by Fig- The address translation function, which does not ure 11. The virtual memory abstraction exposes the same change the bottom bits of addresses, partitions the mem- interface as the memory abstraction in § 2.2, but each ory address space into pages. A page is the set of all process uses a separate that only memory locations that only differ in the bottom bits references the memory allocated to that process. From which are not impacted by address translation, so all an application developer standpoint, virtual memory can the memory addresses in a virtual page translate to corre- be modeled by pretending that each process runs on a sponding addresses in the same physical page. From this separate computer and has its own DRAM. perspective, the address translation function can be seen as a mapping between Virtual Page Numbers (VPN) and Process 1’s Process 2’s Process 3’s address space address space address space Physical Page Numbers (PPN), as shown in Figure 13. In addition to isolating application processes, operat- ing systems also use the address translation feature to run applications whose collective memory demands exceed the amount of DRAM installed in the computer. The OS evicts infrequently used memory pages from DRAM to Computer’s physical address space a larger (but slower) memory, such as a Memory page (HDD) or solid-state drive (SSD). For historical reason, Figure 11: The virtual memory abstraction gives each process this slower memory is referred to as the disk. its own virtual address space. The operating system multiplexes 2The size of a physical address is CPU-dependent, and is 40 bits the computer’s DRAM between the processes, while application for recent desktop CPUs and 44 bits for recent high-end server CPUs. developers build software as if it owns the entire computer’s memory.

9 Virtual 63 48 47 12 11 0 Address must match bit 47 Virtual Page Number (VPN) Page Offset 64…48 CR3 Register: Virtual address Must PML4 address match Address Translation Unit bit 48 43 12 11 0 Page Map Level 4 (PML4) Physical Page Number (PPN) Page Offset 47…39 Physical address PML4 PML4 Entry: PDPT address Index Figure 13: Address translation can be seen as a mapping between virtual page numbers and physical page numbers. Page-Directory-Pointer Table (PDPT) 38…30 by a second layer of address translation, illustrated in PDPTE PDPT Entry: PD address Figure 14. Index

Virtual Address Virtual Page-Directory (PD) Address Space 29…21 PDE PD Entry: PT address Guest OS Page Tables Mapping Index Address Space Virtual Page Number (VPN) (PT) Guest-Physical Address 20…12 PTE PT Entry: Page address Extended Page Mapping Physical Index Tables (EPT) Address Space Physical Page Number (PPN) Physical Address 11…0 Page + Figure 14: Virtual addresses used by software are translated into Offset physical memory addresses using a mapping defined by the page tables. Physical Address When a hypervisor is active, the page tables set up Figure 12: IA-32e address translation takes in a 48-bit virtual by an operating system map between virtual addresses address and outputs a 52-bit physical address. and guest-physical addresses in a guest-physical ad- dress space. The hypervisor multiplexes the computer’s The OS ability to over-commit DRAM is often called DRAM between the operating systems’ guest-physical page swapping, for the following reason. When an ap- address spaces via the second layer of address transla- plication process attempts to access a page that has been tions, which uses extended page tables (EPT) to map evicted, the OS “steps in” and reads the missing page guest-physical addresses to physical addresses. back into DRAM. In order to do this, the OS might have The EPT uses the same data structure as the page to evict a different page from DRAM, effectively swap- tables, so the process of translating guest-physical ad- ping the contents of a DRAM page with a disk page. The dresses to physical addresses follows the same steps as details behind this high-level description are covered in IA-32e address translation. The main difference is that the following sections. the physical address of the data structure’s root node is The CPU’s address translation is also referred to as stored in the extended page table pointer (EPTP) field “paging”, which is a shorthand for “page swapping”. in the Virtual Machine Control Structure (VMCS) for the guest OS. Figure 15 illustrates the address translation 2.5.2 Address Translation and Virtualization process in the presence of hardware virtualization. Computers that take advantage of hardware virtualization 2.5.3 Page Table Attributes use a hypervisor to run multiple operating systems at the same time. This creates some tension, because each Each page table entry contains a physical address, as operating system was written under the assumption that it shown in Figure 12, and some Boolean values that are owns the entire computer’s DRAM. The tension is solved referred to as flags or attributes. The following attributes

10 CR3: Guest 2.6 Execution Contexts PML4 PDPT PD PT Physical (Guest) (Guest) (Guest) (Guest) Address Application software targeting the 64-bit Intel architec- ture uses a variety of CPU registers to interact with the EPTP in EPT EPT EPT EPT EPT VMCS PML4 PML4 PML4 PML4 PML4 processor’s features, shown in Figure 16 and Table 1. The values in these registers make up an application thread’s EPT EPT EPT EPT EPT PDPT PDPT PDPT PDPT PDPT state, or execution context. OS kernels multiplex each logical processor (§ 2.9.4) EPT EPT EPT EPT EPT PD PD PD PD PD between multiple software threads by context switching, namely saving the values of the registers that make up a EPT EPT EPT EPT EPT thread’s execution context, and replacing them with an- PT PT PT PT PT other thread’s previously saved context. Context switch- Virtual PML4 PDPT PD PT Physical ing also plays a part in executing code inside secure Address (Physical) (Physical) (Physical) (Physical) Address containers, so its design has security implications. Figure 15: Address translation when hardware virtualization is enabled. The kernel-managed page tables contain guest-physical 64-bit integers / pointers 64-bit special-purpose registers addresses, so each level in the kernel’s page table requires a full walk RAX RBX RCX RDX RIP - instruction pointer of the hypervisor’s extended page table (EPT). A translation requires RSI RDI RBP RSP RSP - stack pointer up to 20 memory accesses (the bold boxes), assuming the physical address of the kernel’s PML4 is cached. R8 R9 R10 R11 RFLAGS - status / control bits R12 R13 R14 R15 segment registers are used to implement page swapping and software isola- ignored segment registers FS GS tion. CS DS ES SS 64-bit FS base 64-bit GS base

The present (P) flag is set to 0 to indicate unused parts Figure 16: CPU registers in the 64-bit Intel architecture. RSP can be of the address space, which do not have physical memory used as a general-purpose register (GPR), e.g., in pointer arithmetic, associated with them. The system software also sets the but it always points to the top of the program’s stack. Segment P flag to 0 for pages that are evicted from DRAM. When registers are covered in § 2.7. the address translation unit encounters a zero P flag, it Integers and memory addresses are stored in 16 aborts the translation process and issues a hardware ex- general-purpose registers (GPRs). The first 8 GPRs have ception, as described in § 2.8.2. This hardware exception historical names: RAX, RBX, RCX, RDX, RSI, RDI, gives system software an opportunity to step in and bring RSP, and RBP, because they are extended versions of an evicted page back into DRAM. the 32-bit Intel architecture’s GPRs. The other 8 GPRs The accessed (A) flag is set to 1 by the CPU whenever are simply known as R9-R16. RSP is designated for the address translation machinery reads a page table entry, pointing to the top of the procedure call stack, which is and the dirty () flag is set to 1 by the CPU when an simply referred to as the stack. RSP and the stack that entry is accessed by a memory write operation. The it refers to are automatically read and modified by the A and D flags give the hypervisor and kernel insight CPU instructions that implement procedure calls, such into application memory access patterns and inform the as CALL and RET (return), and by specialized stack han- algorithms that select the pages that get evicted from dling instructions such as PUSH and POP. RAM. All applications also use the RIP register, which con- tains the address of the currently executing instruction, The main attributes supporting software isolation are and the RFLAGS register, whose bits (e.g., the carry flag the writable (W) flag, which can be set to 0 to prohibit3 - CF) are individually used to store comparison results writes to any memory location inside a page, the disable and control various instructions. execution (XD) flag, which can be set to 1 to prevent Software might use other registers to interact with instruction fetches from a page, and the supervisor (S) specific processor features, some of which are shown in flag, which can be set to 1 to prohibit any accesses from Table 1. application software running at ring 3. The Intel architecture provides a future-proof method for an OS kernel to save the values of feature-specific 3Writes to non-writable pages result in #GP exceptions (§ 2.8.2). registers used by an application. The XSAVE instruction

11 Feature Registers XCR0 bit segment override prefixes, instructions can be modified FPU FP0 - FP7, FSW, FTW 0 to use the unnamed segments FS and GS for memory SSE MM0 - MM7, XMM0 - 1 references. XMM15, XMCSR Modern operating systems effectively disable segmen- AVX YMM0 - YMM15 2 tation by covering the entire addressable space with one MPX BND0 - BND 3 3 segment, which is loaded in CS, and one , MPX BNDCFGU, BNDSTATUS 4 which is loaded in SS, DS and ES. The FS and GS regis- AVX-512 K0 - K7 5 ters store segments covering thread-local storage (TLS). AVX-512 ZMM0 H - ZMM15 H 6 Due to the Intel architecture’s 16-bit origins, segment AVX-512 ZMM16 - ZMM31 7 registers are exposed as 16-bit values, called segment PK PKRU 9 selectors. The top 13 bits in a selector are an index in a descriptor table, and the bottom 2 bits are the selector’s Table 1: Sample feature-specific Intel architecture registers. ring number, which is also called requested privilege takes in a requested-feature bitmap (RFBM), and writes level (RPL) in the Intel documentation. Also, modern the registers used by the features whose RFBM bits are system software only uses rings 0 and 3 (see § 2.3). set to 1 in a memory area. The memory area written by Each segment register has a hidden segment descrip- XSAVE can later be used by the XRSTOR instruction to tor, which consists of a base address, limit, and type load the saved values back into feature-specific registers. information, such as whether the descriptor should be The memory area includes the RFBM given to XSAVE, used for executable code or data. Figure 17 shows the so XRSTOR does not require an RFBM input. effect of loading a 16-bit selector into a segment register. Application software declares the features that it plans The selector’s index is used to read a descriptor from the to use to the kernel, so the kernel knows what XSAVE descriptor table and copy it into the segment register’s bitmap to use when context-switching. When receiving hidden descriptor. the , the kernel sets the XCR0 register to the Input Value feature bitmap declared by the application. The CPU Index Ring GDTR generates a fault if application software attempts to use + Base Limit features that are not enabled by XCR0, so applications cannot modify feature-specific registers that the kernel Descriptor Table wouldn’t take into account when context-switching. The Base Limit Type kernel can use the CPUID instruction to learn the size of Index Ring Base Limit Type ⋮ XSAVE Register Selector the memory area for a given feature bitmap, and Base Limit Type compute how much memory it needs to allocate for the ⋮ context of each of the application’s threads. Base Limit Type

2.7 Segment Registers Base Limit Type The Intel 64-bit architecture gained widespread adoption Register Descriptor thanks to its ability to run software targeting the older 32- Figure 17: Loading a segment register. The 16-bit value loaded by bit architecture side-by-side with 64-bit software [169]. software is a selector consisting of an index and a ring number. The This ability comes at the cost of some warts. While most index selects a GDT entry, which is loaded into the descriptor part of of these warts can be ignored while reasoning about the the segment register. security of 64-bit software, the segment registers and In 64-bit mode, all segment limits are ignored. The vestigial segmentation model must be understood. base addresses in most segment registers (CS, DS, ES, The semantics of the Intel architecture’s instructions SS) are ignored. The base addresses in FS and GS are include the implicit use of a few segments which are used, in order to support thread-local storage. Figure 18 loaded into the processor’s segment registers shown in outlines the address computation in this case. The in- Figure 16. Code fetches use the code segment (CS). struction’s address, named in the Intel Instructions that reference the stack implicitly use the documentation, is added to the base address in the seg- stack segment (SS). Memory references implicitly use the ment register’s descriptor, yielding the virtual address, data segment (DS) or the destination segment (ES). Via also named linear address. The virtual address is then

12 translated (§ 2.5) to a physical address. Task switching was removed from the 64-bit architec- ture, but the TR segment register was preserved, and it RSI GPRs points to a repurposed TSS data structure. The 64-bit TSS contains an I/O map, which indicates what parts of Linear Address Address Physical + the I/O address space can be accessed directly from ring (Virtual Address) Translation Address 3, and the Interrupt Stack Table (IST), which is used for privilege level switching (§ 2.8.2). Base Limit Type Modern operating systems do not allow application FS Register Descriptor software any direct access to the I/O address space, so the Figure 18: Example address computation process for MOV kernel sets up a single TSS that is loaded into TR during FS:[RDX], 0. The segment’s base address is added to the ad- early initialization, and used to represent all applications dress in RDX before address translation (§ 2.5) takes place. running under the OS. Outside the special case of using FS or GS to refer- 2.8 Privilege Level Switching ence thread-local storage, the logical and virtual (linear) addresses match. Therefore, most of the time, we can get Any architecture that has software privilege levels must away with completely ignoring segmentation. In these provide a method for less privileged software to invoke cases, we use the term “virtual address” to refer to both the services of more privileged software. For example, the virtual and the linear address. application software needs the OS kernel’s assistance to Even though CS is not used for segmentation, 64-bit perform network or disk I/O, as that requires access to system software needs to load a valid selector into it. The privileged memory or to the I/O address space. CPU uses the ring number in the CS selector to track the At the same time, less privileged software cannot be current privilege level, and uses one of the type bits to offered the ability to jump arbitrarily into more privileged know whether it’s running 64-bit code, or 32-bit code in code, as that would compromise the privileged software’s compatibility mode. ability to enforce security and isolation invariants. In our The DS and ES segment registers are completely ig- example, when an application wishes to write a file to the nored, and can have null selectors loaded in them. The disk, the kernel must check if the application’s user has CPU loads a null selector in SS when switching privilege access to that file. If the ring 3 code could perform an levels, discussed in § 2.8.2. arbitrary jump in kernel space, it would be able to skip the access check. Modern kernels only use one descriptor table, the For these reasons, the Intel architecture includes (GDT), whose virtual address privilege-switching mechanisms used to control is stored in the GDTR register. Table 2 shows a typical from less privileged software to well-defined entry points GDT layout that can be used by 64-bit kernels to run in more privileged software. As suggested above, an ar- both 32-bit and 64-bit applications. chitecture’s privilege-switching mechanisms have deep Descriptor Selector implications for the security properties of its software. Null (must be unused) 0 Furthermore, securely executing the software inside a Kernel code 0x08 (index 1, ring 0) protected container requires the same security considera- Kernel data 0x10 (index 2, ring 0) tions as privilege level switching. User code 0x1B (index 3, ring 3) Due to historical factors, the Intel architecture has a User data 0x1F (index 4, ring 3) vast number of execution modes, and an intimidating amount of transitions between them. We focus on the TSS 0x20 (index 5, ring 0) privilege level switching mechanisms used by modern Table 2: A typical GDT layout in the 64-bit Intel Architecture. 64-bit software, summarized in Figure 19. The last entry in Table 2 is a descriptor for the Task 2.8.1 System Calls State Segment (TSS), which was designed to implement hardware context switching, named task switching in On modern processors, application software uses the the Intel documentation. The descriptor is stored in the SYSCALL instruction to invoke ring 0 code, and the ker- Task Register (TR), which behaves like the other segment nel uses SYSRET to switch the privilege level back to registers described above. ring 3. SYSCALL jumps into a predefined kernel loca-

13 VM exit and their locations are specified in the first 32 entries of VMEXIT SYSCALL the Interrupt Descriptor Table (IDT), whose structure is VMFUNC Fault shown in Table 3. The IDT’s physical address is stored in VMX Interrupt VM Ring 0 Ring 3 the IDTR register, which can only be accessed by ring 0 Root exit IRET VMLAUNCH code. Kernels protect the IDT memory using page tables, VMRESUME SYSRET so that ring 3 software cannot access it.

Figure 19: Modern privilege switching methods in the 64-bit Intel architecture. Field Bits Handler RIP 64 tion, which is specified by writing to a pair of architec- Handler CS 16 tural MSRs (§ 2.4). Interrupt Stack Table (IST) index 3 All MSRs can only be read or written by ring 0 code. Table 3: The essential fields of an IDT entry in 64-bit mode. Each This is a crucial security property, because it entails that entry points to a hardware exception or interrupt handler. application software cannot modify SYSCALL’s MSRs. If that was the case, a rogue application could abuse the Each IDT entry has a 3-bit index pointing into the SYSCALL instruction to execute arbitrary kernel code, Interrupt Stack Table (IST), which is an array of 8 stack potentially bypassing security checks. pointers stored in the TSS described in § 2.7. The SYSRET instruction the current privilege When a hardware exception occurs, the execution state level from ring 0 back to ring 3, and jumps to the address may be corrupted, and the current stack cannot be relied in RCX, which is set by the SYSCALL instruction. The on. Therefore, the CPU first uses the handler’s IDT entry SYSCALL / SYSRET pair does not perform any memory to set up a known good stack. SS is loaded with a null access, so it out-performs the Intel architecture’s previous descriptor, and RSP is set to the IST value to which the privilege switching mechanisms, which saved state on IDT entry points. After switching to a reliable stack, a stack. The design can get away without referencing a the CPU pushes the snapshot in Table 4 on the stack, stack because kernel calls are not recursive. then loads the IDT entry’s values into the CS and RIP registers, which trigger the execution of the exception 2.8.2 Faults handler. The processor also performs a switch from ring 3 to ring 0 when a hardware exception occurs while execut- Field Bits ing application code. Some exceptions indicate bugs in Exception SS 64 the application, whereas other exceptions require kernel Exception RSP 64 action. RFLAGS 64 A (#GP) occurs when software Exception CS 64 attempts to perform a disallowed action, such as setting Exception RIP 64 the CR3 register from ring 3. Exception code 64 A page fault (#PF) occurs when address translation Table 4: The snapshot pushed on the handler’s stack when a hard- encounters a page table entry whose P flag is 0, or when ware exception occurs. IRET restores registers from this snapshot. the memory inside a page is accessed in way that is inconsistent with the access bits in the page table entry. After the exception handler completes, it uses the For example, when ring 3 software accesses the memory IRET (interrupt return) instruction to load the registers inside a page whose S bit is set, the result of the memory from the on-stack snapshot and switch back to ring 3. access is #PF. The Intel architecture gives the fault handler complete When a hardware exception occurs in application code, control over the execution context of the software that in- the CPU performs a ring switch, and calls the correspond- curred the fault. This privilege is necessary for handlers ing exception handler. For example, the #GP handler (e.g., #GP) that must perform context switches (§ 2.6) typically terminates the application’s process, while the as a consequence of terminating a thread that encoun- #PF handler reads the swapped out page back into RAM tered a bug. It follows that all fault handlers must be and resumes the application’s execution. trusted to not leak or tamper with the information in an The exception handlers are a part of the OS kernel, application’s execution context.

14 2.8.3 VMX Privilege Level Switching 2.9.1 The Motherboard Intel systems that take advantage of the hardware virtu- A computer’s components are connected by a printed alization support to run multiple operating systems at circuit board called a motherboard, shown in Figure 20, the same time use a hypervisor that manages the VMs. which consists of sockets connected by buses. Sockets The hypervisor creates a Virtual Machine Control Struc- connect chip-carrying packages to the board. The Intel ture (VMCS) for each operating system instance that documentation uses the term “package” to specifically it wishes to run, and uses the VMENTER instruction to refer to a CPU. assign a logical processor to the VM. When a logical processor encounters a fault that must DRAM DRAM DRAM DRAM FLASH be handled by the hypervisor, the logical processor per- UEFI ME FW forms a VM exit. For example, if the address translation CPU CPU CPU CPU process encounters an EPT entry with the P flag set to 0, SPI the CPU performs a VM exit, and the hypervisor has an USB SATA opportunity to bring the page into RAM. PCH CPU CPU CPU CPU The VMCS shows a great application of the encapsula- ME tion principle [130], which is generally used in high-level software, to computer architecture. The Intel architecture DRAM DRAM DRAM DRAM NIC / PHY specifies that each VMCS resides in DRAM and is 4 KB QPI DDR PCIe DMI in size. However, the architecture does not specify the VMCS format, and instead requires the hypervisor to Figure 20: The motherboard structures that are most relevant in a interact with the VMCS via CPU instructions such as system security analysis. VMREAD and VMWRITE. The CPU (described in § 2.9.3) hosts the execution This approach allows Intel to add VMX features that cores that run the software stack shown in Figure 8 and require VMCS format changes, without the burden of described in § 2.3, namely the SMM code, the hypervisor, having to maintain backwards compatibility. This is no operating systems, and application processes. The com- small feat, given that huge amounts of complexity in the puter’s main memory is provided by Dynamic Random- Intel architecture were introduced due to compatibility Access Memory (DRAM) chips. requirements. The (PCH) houses (rela- tively) low-speed I/O controllers driving the slower buses 2.9 A Computer Map in the system, like SATA, used by storage devices, and This section outlines the hardware components that make USB, used by input peripherals. The PCH is also known up a computer system based on the Intel architecture. as the . At a first approximation, the south bridge § 2.9.1 summarizes the structure of a motherboard. term in older documentation can also be considered as a This is necessary background for reasoning about the synonym for PCH. cost and impact of physical attacks against a computing also have a non-volatile (flash) mem- system. § 2.9.2 describes Intel’s Management Engine, ory chip that hosts firmware which implements the Uni- which plays a role in the computer’s bootstrap process, fied Extensible Interface (UEFI) specifica- and has significant security implications. tion [180]. The firmware contains the boot code and § 2.9.3 presents the building blocks of an Intel proces- the code that executes in System Management Mode sor, and § 2.9.4 models an Intel execution core at a high (SMM, § 2.3). level. This is the foundation for implementing defenses The components we care about are connected by the against physical attacks. Perhaps more importantly, rea- following buses: the Quick-Path Interconnect (QPI [91]), soning about software attacks based on information leak- a network of point-to-point links that connect processors, age, such as timing attacks, requires understanding how the double data rate (DDR) bus that connects a CPU a processor’s computing resources are shared and parti- to DRAM, the (DMI) bus that tioned between mutually distrusting parties. connects a CPU to the PCH, the Component The information in here is either contained in the SDM Interconnect Express (PCIe) bus that connects a CPU to or in Intel’s Optimization Reference Manual [96]. peripherals such as a Network Interface Card (NIC), and

15 the Serial Programming Interface (SPI) used by the PCH Intel PCH to communicate with the flash memory. Intel ME Interrupt Watchdog The PCIe bus is an extended, point-to-point version Controller Timer of the PCI standard, which provides a method for any peripheral connected to the bus to perform Direct Mem- Crypto I-Cache Execution Boot SPI Accelerator D-Cache Core ROM Controller ory Access (DMA), transferring data to and from DRAM without involving an execution core and spending CPU Internal Bus cycles. The PCI standard includes a configuration mech- SMBus HECI DRAM DMA Internal anism that assigns a range of DRAM to each peripheral, Controller Controller Access Engine SRAM but makes no provisions for restricting a peripheral’s DRAM accesses to its assigned range. Network interfaces consist of a physical (PHY) mod- PCIe Audio USB Integrated MAC Controller Controller Controller Sensor Hub ule that converts the analog signals on the network me-

dia to and from digital bits, and a Media Access Con- Ethernet PCIe Audio, MIC USB I2C SPI trol (MAC) module that implements a network-level pro- PHY lanes Bluetooth PHY UART Bus tocol. Modern Intel-based motherboards forego a full- fledged NIC, and instead include an Ethernet [84] PHY Figure 21: The (ME) is an embedded computer hosted in the PCH. The ME has its own execution core, module. ROM and SRAM. The ME can access the host’s DRAM via a and a DMA controller. The ME is remotely accessible 2.9.2 The Intel Management Engine (ME) over the network, as it has direct access to an Ethernet PHY via the SMBus. Intel’s Management Engine (ME) is an embedded com- puter that was initially designed for remote system man- are powered off [87], including the CPU and DRAM. agement and troubleshooting of server-class systems that For all practical purposes, this means that the ME’s exe- are often hosted in data centers. However, all of Intel’s cution core is active as long as the power supply is still recent PCHs contain an ME [80], and it currently plays a connected to a power source. crucial role in platform bootstrapping, which is described in detail in § 2.13. Most of the information in this section In S5, the ME cannot access the DRAM, but it can is obtained from an Intel-sponsored book [162]. still use its own internal memories. The ME can also still The ME is part of Intel’s Active Management Tech- communicate with a remote party, as it can access the nology (AMT), which is marketed as a convenient way motherboard’s Ethernet PHY via SMBus. This enables for IT administrators to troubleshoot and fix situations applications such as AMT’s theft prevention, where a such as failing hardware, or a corrupted OS installation, equipped with a cellular modem can be tracked without having to gain physical access to the impacted and permanently disabled as long as it has power and computer. . The Intel ME, shown in Figure 21, remains functional As the ME remains active in deep power-saving modes, during most hardware failures because it is an entire its design must rely on low-power components. The exe- embedded computer featuring its own execution core, cution core is an Argonaut RISC Core (ARC) clocked at bootstrap ROM, and internal RAM. The ME can be used 200-400MHz, which is typically used in low-power em- for troubleshooting effectively thanks to an array of abil- bedded designs. On a very recent PCH [100], the internal ities that include overriding the CPU’s boot vector and a SRAM has 640KB, and is shared with the Integrated Sen- DMA engine that can access the computer’s DRAM. The sor Hub (ISH)’s core. The SMBus runs at 1MHz and, ME provides remote access to the computer without any without CPU support, the motherboard’s Ethernet PHY CPU support because it can use the System Management runs at 10Mpbs. bus (SMBus) to access the motherboard’s Ethernet PHY When the host computer is powered on, the ME’s exe- or an AMT-compatible NIC [100]. cution core starts running code from the ME’s bootstrap The Intel ME is connected to the motherboard’s power ROM. The bootstrap code loads the ME’s software stack supply using a power rail that stays active even when the from the same flash chip that stores the host computer’s host computer is in the Soft Off mode [100], known as firmware. The ME accesses the flash memory chip via ACPI G2/S5, where most of the computer’s components an embedded SPI controller.

16 2.9.3 The Processor Die each of which is called a core. At the time of this writing, An Intel processor’s die, illustrated in Figure 22, is di- desktop-class Intel CPUs have 4 cores, and server-class vided into two broad areas: the core area implements the CPUs have as many as 18 cores. instruction execution typically associated with Most Intel CPUs feature hyper-threading, which CPUs, while the provides functions that were means that a core (shown in Figure 23) has two copies traditionally hosted on separate chips, but are currently of the register files backing the execution context de- integrated on the CPU die to reduce latency and power scribed in § 2.6, and can execute two separate streams of consumption. instructions simultaneously. Hyper-threading reduces the impact of memory stalls on the utilization of the fetch,

NIC Platform Controller Hub decode and execution units.

PCI-X DMI Logical CPU Logical CPU L1 L1 Registers Registers Fetch Chip Package I-Cache I-TLB IOAPIC I/O Controller Core Core LAPIC LAPIC Graphics Decode Microcode CPU I/O to Ring Unit Config L3 Cache Instruction Scheduler L2 L2 Cache TLB Power QPI Router Home Agent Unit L1 L1 QPI Memory INT INT INT MEM Core Core D-Cache D-TLB Packetizer Controller FP FP SSE SSE QPI DDR3 Page Miss Handler (PMH) Execution Units CPU DRAM Figure 23: CPU core with two logical processors. Each logical Figure 22: The major components in a modern CPU package. processor has its own execution context and LAPIC (§ 2.12). All the § 2.9.3 gives an uncore overview. § 2.9.4 describes execution cores. other core resources are shared. § 2.11.3 takes a deeper look at the uncore. A hyper-threaded core is exposed to system software At a conceptual level, the uncore of modern proces- as two logical processors (LPs), also named hardware sors includes an integrated memory controller (iMC) that threads in the Intel documentation. The logical proces- interfaces with the DDR bus, an integrated I/O controller sor abstraction allows the code used to distribute work (IIO) that implements PCIe bus lanes and interacts with across processors in a multi-processor system to func- the DMI bus, and a growing number of integrated pe- tion without any change on multi-core hyper-threaded ripherals, such as a (GPU). processors. The uncore structure is described in some processor fam- The high level of resource sharing introduced by ily datasheets [97, 98], and in the overview sections in hyper-threading introduces a security vulnerability. Soft- Intel’s uncore performance monitoring documentation ware running on one logical processor can use the high- [37, 90, 94]. resolution performance counter (RDTSCP, § 2.4) [152] Security extensions to the Intel architecture, such as to get information about the instructions and memory ac- Trusted Execution Technology (TXT) [70] and Software cess patterns of another piece of software that is executed Guard Extensions (SGX) [14, 139], rely on the fact that on the other logical processor on the same core. the processor die includes the memory and I/O controller, That being said, the biggest downside of hyper- and thus can prevent any device from accessing protected threading might be the fact that writing about Intel pro- memory areas via (DMA) trans- cessors in a rigorous manner requires the use of the cum- fers. § 2.11.3 takes a deeper look at the uncore organiza- bersome term Logical Processor instead of the shorter tion and at the machinery used to prevent unauthorized and more intuitive “CPU core”, which can often be ab- DMA transfers. breviated to “core”. 2.9.4 The Core 2.10 Out-of-Order and Virtually all modern Intel processors have core areas con- CPU cores can execute instructions orders of magni- sisting of multiple copies of the execution core circuitry, tude faster than DRAM can read data. Computer archi-

17 Branch Instruction tects attempt to bridge this gap by using hyper-threading L1 I-Cache Predictors Fetch Unit (§ 2.9.3), out-of-order and speculative execution, and L1 I-TLB caching, which is described in § 2.11. In CPUs that Pre-Decode Fetch Buffer use out-of-order execution, the order in which the CPU Instruction Queue

carries out a program’s instructions (execution order) is Microcode Complex Simple not necessarily the same as the order in which the in- ROM Decoder Decoders

Micro-op structions would be executed by a sequential evaluation Micro-op Decode Queue Cache system (program order). Instruction Decode An analysis of a system’s information leakage must Out of Order Engine Register Reorder Load Store take out-of-order execution into consideration. Any CPU Files Buffer Buffer Buffer actions observed by an attacker match the execution order, so the attacker may learn some information by Renamer comparing the observed execution order with a known Scheduler program order. At the same time, attacks that try to infer a victim’s program order based on actions taken by the Port 0 Port 1 Ports 2, 3 Port 4 Port 5 Port 6 Port 7 CPU must account for out-of-order execution as a source Integer ALU Integer ALU Load & Store Integer ALU Integer ALU Store Shift LEA Store Data LEA Shift Address Address of noise. FMA FMA Vector Branch This section summarizes the out-of-order and specu- FP Multiply FP Multiply Shuffle Integer Integer Integer lative execution concepts used when reasoning about a Vector Vector Vector system’s security properties. [150] and [76] cover the Multiply ALU ALU Vector Vector Vector concepts in great depth, while Intel’s optimization man- Logicals Logicals Logicals

ual [96] provides details specific to Intel CPUs. Branch FP Addition Figure 24 provides a more detailed view of the CPU Divide core components involved in out-of-order execution, and Vector Shift Execution omits some less relevant details from Figure 23. The Intel architecture defines a complex instruction Memory Control

set (CISC). However, virtually all modern CPUs are ar- L1 D-Cache Fill Buffers L2 D-Cache

chitected following reduced instruction set (RISC) prin- L1 D-TLB ciples. This is accomplished by having the instruction Memory decode stages break down each instruction into micro- Figure 24: The structures in a CPU core that are relevant to out- ops, which resemble RISC instructions. The other stages of-order and speculative execution. Instructions are decoded into of the execution pipeline work exclusively with micro- micro-ops, which are scheduled on one of the ’s ports. ops. The enables speculative execution when a branch is encountered.

2.10.1 Out-of-Order Execution ops4 in Table 5 below. The OR uses the result of the Different types of instructions require different logic LOAD, but the ADD does not. Therefore, a good scheduler circuits, called functional units. For example, the arith- can have the load store unit execute the LOAD and the metic logic unit (ALU), which performs arithmetic op- ALU execute the ADD, all in the same clock cycle. erations, is completely different from the load and store The out-of-order engine in recent Intel CPUs works unit, which performs memory operations. Different cir- roughly as follows. Micro-ops received from the decode cuits can be used at the same time, so each CPU core can queue are written into a reorder buffer (ROB) while they execute multiple micro-ops in parallel. are in-flight in the execution unit. The The core’s out-of-order engine receives decoded table (RAT) matches each register with the last reorder micro-ops, identifies the micro-ops that can execute in buffer entry that updates it. The renamer uses the RAT parallel, assigns them to functional units, and combines to rewrite the source and destination fields of micro-ops the outputs of the units so that the results are equiva- when they are written in the ROB, as illustrated in Tables lent to having the micro-ops executed sequentially in the 4The set of micro-ops used by Intel CPUs is not publicly docu- order in which they come from the decode stages. mented. The fictional examples in this section suffice for illustration For example, consider the sequence of pseudo micro- purposes.

18 # Micro-op Meaning cute out-of-order can also have memory dependencies. 1 LOAD RAX, RSI RAX ← DRAM[RSI] For this reason, out-of-order engines have a load buffer 2 OR RDI, RDI, RAX RDI ← RDI ∨ RAX and a store buffer that keep track of in-flight memory op- 3 ADD RSI, RSI, RCX RSI ← RSI + RCX erations and are used to resolve memory dependencies. 4 SUB RBX, RSI, RDX RBX ← RSI - RDX Table 5: Pseudo micro-ops for the out-of-order execution example. 2.10.2 Speculative Execution 6 and 7. Note that the ROB representation makes it easy Branch instructions, also called branches, change the to determine the dependencies between micro-ops. instruction pointer (RIP, § 2.6), if a condition is met (the # Op Source 1 Source 2 Destination branch is taken). They implement conditional statements 1 LOAD RSI ∅ RAX (if) and looping statements, such as while and for. 2 OR RDI ROB #1 RSI The most well-known branching instructions in the Intel 3 ADD RSI RCX RSI architecture are in the jcc family, such as je (jump if 4 SUB ROB # 3 RDX RBX equal). Branches pose a challenge to the decode stage, because Table 6: Data written by the renamer into the reorder buffer (ROB), for the micro-ops in Table 5. the instruction that should be fetched after a branch is not known until the branching condition is evaluated. In Register RAX RBX RCX RDX RSI RDI order to avoid stalling the decode stage, modern CPU ROB # #1 #4 ∅ ∅ #3 #2 designs include branch predictors that use historical in- formation to guess whether a branch will be taken or Table 7: Relevant entries of the register allocation table after the micro-ops in Table 5 are inserted into the ROB. not. The scheduler decides which micro-ops in the ROB When the decode stage encounters a branch instruc- get executed, and places them in the reservation station. tion, it asks the branch predictor for a guess as to whether The reservation station has one port for each functional the branch will be taken or not. The decode stage bun- unit that can execute micro-ops independently. Each dles the branch condition and the predictor’s guess into reservation station port port holds one micro-op from a branch check micro-op, and then continues decoding the ROB. The reservation station port waits until the on the path indicated by the predictor. The micro-ops micro-op’s dependencies are satisfied and forwards the following the branch check are marked as speculative. micro-op to the functional unit. When the functional unit When the branch check micro-op is executed, the completes executing the micro-op, its result is written branch unit checks whether the branch predictor’s guess back to the ROB, and forwarded to any other reservation was correct. If that is the case, the branch check is retired station port that depends on it. successfully. The scheduler handles mispredictions by The ROB stores the results of completed micro-ops un- squashing all the micro-ops following the branch check, til they are retired, meaning that the results are committed and by signaling the instruction decoder to flush the to the register file and the micro-ops are removed from micro-op decode queue and start fetching the instruc- the ROB. Although micro-ops can be executed out-of- tions that follow the correct branch. order, they must be retired in program order, in order to Modern CPUs also attempt to predict memory read pat- handle exceptions correctly. When a micro-op causes a terns, so they can prefetch the memory locations that are hardware exception (§ 2.8.2), all the following micro-ops about to be read into the cache. Prefetching minimizes in the ROB are squashed, and their results are discarded. the latency of successfully predicted read operations, as In the example above, the ADD can complete before their data will already be cached. This is accomplished the LOAD, because it does not require a memory access. by exposing circuits called prefetchers to memory ac- However, the ADD’s result cannot be committed before cesses and cache misses. Each prefetcher can recognize LOAD completes. Otherwise, if the ADD is committed a particular access pattern, such as sequentially read- and the LOAD causes a page fault, software will observe ing an array’s elements. When memory accesses match an incorrect value for the RSI register. the pattern that a prefetcher was built to recognize, the The ROB is tailored for discovering register dependen- prefetcher loads the cache line corresponding to the next cies between micro-ops. However, micro-ops that exe- memory access in its pattern.

19 2.11 Cache Memories memory access. When doing a fill, the cache forwards At the time of this writing, CPU cores can process data the memory access to the next level of the memory hierar- ≈ 200× faster than DRAM can supply it. This gap is chy and caches the response. Under most circumstances, bridged by an hierarchy of cache memories, which are a cache fill also triggers a cache eviction, in which some orders of magnitude smaller and an order of magnitude data is removed from the cache to make room for the faster than DRAM. While caching is transparent to ap- data coming from the fill. If the data that is evicted has plication software, the system software is responsible for been modified since it was loaded in the cache, it must be managing and coordinating the caches that store address written back to the next level of the . translation (§ 2.5) results. Caches impact the security of a software system in Look for a cache Cache line storing A Lookup two ways. First, the Intel architecture relies on system software to manage address translation caches, which YES NO Look for a free cache Found? becomes an issue in a threat model where the system soft- hit miss line that can store A ware is untrusted. Second, caches in the Intel architecture are shared by all the software running on the computer. YES Found? This opens up the way for cache timing attacks, an entire class of software attacks that rely on observing the time NO differences between accessing a cached memory location Cache and an uncached memory location. Eviction Choose a cache line This section summarizes the caching concepts and im- that can store A plementation details needed to reason about both classes of security problems mentioned above. [170], [150] and NO Is the line dirty?

[76] provide a good background on low-level cache im- YES plementation concepts. § 3.8 describes cache timing attacks. Mark the line Write the cache line available to the next level 2.11.1 Caching Principles

At a high level, caches exploit the high locality in the Cache Get A from the Store the data at A Fill memory access patterns of most applications to hide the next memory level in the free line main memory’s (relatively) high latency. By caching (storing a copy of) the most recently accessed code and data, these relatively small memories can be used to Return data satisfy 90%-99% of an application’s memory accesses. associated with A In an Intel processor, the first-level (L1) cache consists Figure 25: The steps taken by a cache memory to resolve an access of a separate data cache (D-cache) and an instruction to a memory address A. A normal memory access (to cacheable cache (I-cache). The instruction fetch and decode stage DRAM) always triggers a cache lookup. If the access misses the is directly connected to the L1 I-cache, and uses it to read cache, a fill is required, and a write-back might be required. the streams of instructions for the core’s logical proces- Table 8 shows the key characteristics of the memory sors. Micro-ops that read from or write to memory are hierarchy implemented by modern Intel CPUs. Each executed by the memory unit (MEM in Figure 23), which core has its own L1 and L2 cache (see Figure 23), while is connected to the L1 D-cache and forwards memory the L3 cache is in the CPU’s uncore (see Figure 22), and accesses to it. is shared by all the cores in the package. Figure 25 illustrates the steps taken by a cache when it The numbers in Table 8 suggest that cache placement receives a memory access. First, a cache lookup uses the can have a large impact on an application’s execution memory address to determine if the corresponding data time. Because of this, the Intel architecture includes exists in the cache. A cache hit occurs when the address an assortment of instructions that give performance- is found, and the cache can resolve the memory access sensitive applications some control over the caching quickly. Conversely, if the address is not found, a cache of their working sets. PREFETCH instructs the CPU’s miss occurs, and a cache fill is required to resolve the prefetcher to cache a specific memory address, in prepa-

20 Memory Size Access Time of the location’s memory address. Direct set indexing Core Registers 1 KB no latency means that the S sets in a cache are numbered from 0 to L1 D-Cache 32 KB 4 cycles S − 1, and the memory location at address A is cached L2 Cache 256 KB 10 cycles in the set numbered An−1...n−l mod S. L3 Cache 8 MB 40-75 cycles In the common case where the number of sets in a DRAM 16 GB 60 ns cache is a , so S = 2s, the lowest l bits in an address make up the cache line offset, the next s bits Table 8: Approximate sizes and access times for each level in the memory hierarchy of an Intel processor, from [127]. Memory sizes are the set index. The highest n − s − l bits in an address and access times differ by orders of magnitude across the different are not used when selecting where a memory location levels of the hierarchy. This table does not cover multi-processor will be cached. Figure 26 shows the cache structure and systems. lookup process.

ration for a future memory access. The memory writes Memory Address MOVNT performed by the instruction family bypass the Address Tag Set Index Line Offset cache if a fill would be required. CLFLUSH evicts any n-1…s+l s+l-1…l l-1…0 cache lines storing a specific address from the entire . The methods mentioned above are available to soft- Set 0, Way 0 Set 0, Way 1 … Set 0, Way W-1 ware running at all privilege levels, because they were de- Set 1, Way 0 Set 1, Way 1 … Set 1, Way W-1 signed for high-performance workloads with large work- ⋮ ⋮ ⋱ ⋮ ing sets, which are usually executed at ring 3 (§ 2.3). For Set i, Way 0 Set i, Way 1 … Set i, Way W-1 comparison, the instructions used by system software ⋮ ⋮ ⋱ ⋮ to manage the address translation caches, described in Set S-1, Way 0 Set S-1, Way 1 … Set S-1, Way W-1 § 2.11.5 below, can only be executed at ring 0. Way 0 Way 1 … Way W-1 2.11.2 Cache Organization Tag Line Tag Line Tag Line In the Intel architecture, caches are completely imple- mented in hardware, meaning that the software stack has no direct control over the eviction process. However, Tag Comparator software can gain some control over which data gets evicted by understanding how the caches are organized, Matched Line and by cleverly placing its data in memory. The cache line is the atomic unit of cache organization. Match? Matched Word A cache line has data, a copy of a continuous range of Figure 26: Cache organization and lookup, for a W -way set- DRAM, and a tag, identifying the memory address that associative cache with 2l-byte lines and S = 2s sets. The cache the data comes from. Fills and evictions operate on entire works with n-bit memory addresses. The lowest l address bits point lines. to a specific byte in a cache line, the next s index the set, and The cache line size is the size of the data, and is always the highest n − s − l bits are used to decide if the desired address is W a power of two. Assuming n-bit memory addresses and a in one of the lines in the indexed set. cache line size of 2l bytes, the lowest l bits of a memory 2.11.3 Cache Coherence address are an offset into a cache line, and the highest n − l bits determine the cache line that is used to store The Intel architecture was designed to support applica- the data at the memory location. All recent processors tion software that was not written with caches in mind. have 64-byte cache lines. One aspect of this support is the Total Store Order (TSO) The L1 and L2 caches in recent processors are multi- [147] memory model, which promises that all the logical way set-associative with direct set indexing, as shown processors in a computer see the same order of DRAM in Figure 26. A W -way set-associative cache has its writes. memory divided into sets, where each set has W lines. A The same memory location might be simultaneously memory location can be cached in any of the w lines in a cached by different cores’ caches, or even by caches on specific set that is determined by the highest n − l bits separate chips, so providing the TSO guarantees requires

21 a cache coherence protocol that synchronizes all the Core Core L2 Cache L2 Cache DDR3 cache lines in a computer that reference the same memory QPI Link Channel address. The cache coherence mechanism is not visible to CBox CBox

software, so it is only briefly mentioned in the SDM. QPI Ring to Fortunately, Intel’s optimization reference [96] and the Packetizer QPI L3 Cache L3 Cache Slice Slice Home Memory datasheets referenced in § 2.9.3 provide more informa- Agent Controller UBox L3 Cache L3 Cache Ring to L3 Cache tion. Intel processors use variations of the MESIF [66] Slice Slice PCIe protocol, which is implemented in the CPU and in the I/O Controller protocol layer of the QPI bus. CBox CBox The SDM and the CPUID instruction output indicate that the L3 cache, also known as the last-level cache PCIe Lanes L2 Cache L2 Cache (LLC) is inclusive, meaning that any location cached by Core Core an L1 or L2 cache must also be cached in the LLC. This Figure 27: The stops on the ring interconnect used for inter-core design decision reduces complexity in many implemen- and core-uncore communication. tation aspects. We estimate that the bulk of the cache coherence implementation is in the CPU’s uncore, thanks The number of LLC slices matches the number of to the fact that cache synchronization can be achieved cores in the CPU, and each LLC slice shares a CBox without having to communicate to the lower cache levels with a core. The CBoxes implement the cache coherence that are inside execution cores. engine, so each CBox acts as the QPI cache agent for its The QPI protocol defines cache agents, which are LLC slice. CBoxes use a Source (SAD) connected to the last-level cache in a processor, and to route DRAM requests to the appropriate home agents. home agents, which are connected to memory controllers. Conceptually, the SAD takes in a memory address and Cache agents make requests to home agents for cache access type, and outputs a transaction type (coherent, line data on cache misses, while home agents keep track non-coherent, IO) and a node ID. Each CBox contains of cache line ownership, and obtain the cache line data a SAD replica, and the configurations of all SADs in a from other cache line agents, or from the memory con- package are identical. troller. The QPI routing layer supports multiple agents The SAD configurations are kept in sync by the UBox, per socket, and each processor has its own caching agents, which is the uncore configuration controller, and con- and at least one home agent. nects the System agent to the ring. The UBox is re- Figure 27 shows that the CPU uncore has a bidirec- sponsible for reading and writing physically distributed tional ring interconnect, which is used for communi- registers across the uncore. The UBox also receives inter- cation between execution cores and the other uncore rupts from system and dispatches them to the appropriate components. The execution cores are connected to the core. ring by CBoxes, which route their LLC accesses. The On recent Intel processors, the uncore also contains at routing is static, as the LLC is divided into same-size least one memory controller. Each integrated memory slices (common slice sizes are 1.5 MB and 2.5 MB), and controller (iMC or MBox in Intel’s documentation) is an undocumented hashing scheme maps each possible connected to the ring by a home agent (HA or BBox in physical address to exactly one LLC slice. Intel’s datasheets). Each home agent contains a Target Intel’s documentation states that the hashing scheme Address Decoder (TAD), which maps each DRAM ad- mapping physical addresses to LLC slices was designed dress to an address suitable for use by the DRAM chips, to avoid having a slice become a hotspot, but stops short namely a DRAM channel, bank, rank, and a DIMM ad- of providing any technical details. Fortunately, inde- dress. The mapping in the TAD is not documented by pendent researches have reversed-engineered the hash Intel, but it has been reverse-engineered [151]. functions for recent processors [85, 135, 197]. The integration of the memory controller on the CPU The hashing scheme described above is the reason brings the ability to filter DMA transfers. Accesses from why the L3 cache is documented as having a “complex” a peripheral connected to the PCIe bus are handled by the indexing scheme, as opposed to the direct indexing used integrated I/O controller (IIO), placed on the ring inter- in the L1 and L2 caches. connect via the UBox, and then reach the iMC. Therefore,

22 on modern systems, DMA transfers go through both the which is optimized under the assumption that all the SAD and TAD, which can be configured to abort DMA devices that need to observe the memory operations im- transfers targeting protected DRAM ranges. plement the cache coherence protocol. WB memory is cached as described in § 2.11, receives speculative reads, 2.11.4 Caching and Memory-Mapped Devices and operations targeting it are subject to reordering. Caches rely on the assumption that the underlying mem- Write Protected (WP) memory is similar to WB mem- ory implements the memory abstraction in § 2.2. How- ory, with the exception that every write is propagated ever, the physical addresses that map to memory-mapped to the system bus. It is intended for memory-mapped I/O devices usually deviate from the memory abstraction. buffers, where the order of operations does not matter, For example, some devices expose command registers but the devices that need to observe the writes do not im- that trigger certain operations when written, and always plement the cache coherence protocol, in order to reduce return a zero value. Caching addresses that map to such hardware costs. memory-mapped I/O devices will lead to incorrect be- On recent Intel processors, the cache’s behavior is havior. mainly configured by the Memory Type Range Registers Furthermore, even when the memory-mapped devices (MTRRs) and by Page Attribute Table (PAT) indices in follow the memory abstraction, caching their memory is the page tables (§ 2.5). The behavior is also impacted by sometimes undesirable. For example, caching a graphic the Cache Disable (CD) and Not-Write through (NW) unit’s could lead to visual artifacts on the bits in 0 (CR0, § 2.4), as well as by user’s display, because of the delay between the time equivalent bits in page table entries, namely Page-level when a write is issued and the time when the correspond- Cache Disable (PCD) and Page-level Write-Through ing cache lines are evicted and written back to memory. (PWT). In order to work around these problems, the Intel archi- The MTRRs were intended to be configured by the tecture implements a few caching behaviors, described computer’s firmware during the boot sequence. Fixed below, and provides a method for partitioning the mem- MTRRs cover pre-determined ranges of memory, such ory address space (§ 2.4) into regions, and for assigning as the memory areas that had special semantics in the a desired caching behavior to each region. computers using 16-bit Intel processors. The ranges Uncacheable (UC) memory has the same semantics covered by variable MTRRs can be configured by system as the I/O address space (§ 2.4). UC memory is useful software. The representation used to specify the ranges when a device’s behavior is dependent on the order of is described below, as it has some interesting properties memory reads and writes, such as in the case of memory- that have proven useful in other systems. mapped command and data registers for a PCIe NIC Each variable memory type range is specified using (§ 2.9.1). The out-of-order execution engine (§ 2.10) a range base and a range mask. A memory address be- does not reorder UC memory accesses, and does not longs to the range if computing a bitwise AND between issue speculative reads to UC memory. the address and the range mask results in the range base. Write Combining (WC) memory addresses the spe- This verification has a low-cost hardware implementa- cific needs of . WC memory is similar to tion, shown in Figure 28. UC memory, but the out-of-order engine may reorder memory accesses, and may perform speculative reads. MTRR mask AND The processor stores writes to WC memory in a write match Physical Address EQ combining buffer, and attempts to group multiple writes MTRR base into a (more efficient) line write bus transaction. Write Through (WT) memory is cached, but write Figure 28: The circuit for computing whether a physical address misses do not cause cache fills. This is useful for pre- matches a memory type range. Assuming a CPU with 48-bit physical addresses, the circuit uses 36 AND gates and a binary tree of 35 venting large memory-mapped device memories that are XNOR (equality test) gates. The circuit outputs 1 if the address rarely read, such as framebuffers, from taking up cache belongs to the range. The bottom 12 address bits are ignored, because memory. WT memory is covered by the cache coherence memory type ranges must be aligned to 4 KB page boundaries. engine, may receive speculative reads, and is subject to Each variable memory type range must have a size that operation reordering. is an integral power of two, and a starting address that DRAM is represented as Write Back (WB) memory, is a multiple of its size, so it can be described using the

23 base / mask representation described above. A range’s data, and a shared L2 TLB. Each core has its own TLBs starting address is its base, and the range’s size is one (see Figure 23). When a virtual address is not contained plus its mask. in a core’s TLB, the Page Miss Handler (PMH) performs Another advantage of this range representation is that a page walk (page table / EPT traversal) to translate the the base and the mask can be easily validated, as shown virtual address, and the result is stored in the TLB. in Listing 1. The range is aligned with respect to its size if and only if the bitwise AND between the base and the Memory Entries Access Time mask is zero. The range’s size is a power of two if and L1 I-TLB 128 + 8 = 136 1 cycle only if the bitwise AND between the mask and one plus L1 D-TLB 64 + 32 + 4 = 100 1 cycle the mask is zero. According to the SDM, the MTRRs are L2 TLB 1536 + 8 = 1544 7 cycles 36 10 not validated, but setting them to invalid values results in Page Tables 2 ≈ 6 · 10 18 cycles - 200ms undefined behavior. Table 9: Approximate sizes and access times for each level in the TLB hierarchy, from [4]. constexpr bool is_valid_range( size_t base, size_t mask) { In the Intel architecture, the PMH is implemented in // Base is aligned to size. hardware, so the TLB is never directly exposed to soft- return (base & mask) == 0 && // Size isa power of two. ware and its implementation details are not documented. (mask & (mask + 1)) == 0; The SDM does state that each TLB entry contains the } physical address associated with a virtual address, and Listing 1: The checks that validate the base and mask of a memory- the metadata needed to resolve a memory access. For type range can be implemented very easily. example, the processor needs to check the writable (W) flag on every write, and issue a General Protection fault No memory type range can partially cover a 4 KB page, (#GP) if the write targets a read-only page. Therefore, which implies that the range base must be a multiple of the TLB entry for each virtual address caches the logical- 4 KB, and the bottom 12 bits of range mask must be set. and of all the relevant W flags in the page table structures This simplifies the interactions between memory type leading up to the page. ranges and address translation, described in § 2.11.5. The TLB is transparent to application software. How- The PAT is intended to allow the operating system or ever, kernels and hypervisors must make sure that the hypervisor to tweak the caching behaviors specified in TLBs do not get out of sync with the page tables and the MTRRs by the computer’s firmware. The PAT has EPTs. When changing a page table or EPT, the system 8 entries that specify caching behaviors, and is stored software must use the INVLPG instruction to invalidate in its entirety in a MSR. Each page table entry contains any TLB entries for the virtual address whose translation a 3-bit index that points to a PAT entry, so the system changed. Some instructions flush the TLBs, meaning that software that controls the page tables can specify caching they invalidate all the TLB entries, as a side-effect. behavior at a very fine granularity. TLB entries also cache the desired caching behavior (§ 2.11.4) for their pages. This requires system software 2.11.5 Caches and Address Translation to flush the corresponding TLB entries when changing Modern system software relies on address translation MTRRs or page table entries. In return, the processor (§ 2.5). This means that all the memory accesses issued only needs to compute the desired caching behavior dur- by a CPU core use virtual addresses, which must undergo ing a TLB miss, as opposed to computing the caching translation. Caches must know the physical address for a behavior on every memory access. memory access, to handle aliasing (multiple virtual ad- The TLB is not covered by the cache coherence mech- dresses pointing to the same physical address) correctly. anism described in § 2.11.3. Therefore, when modifying However, address translation requires up to 20 memory a page table or EPT on a multi-core / multi-processor accesses (see Figure 15), so it is impractical to perform a system, the system software is responsible for perform- full address translation for every cache access. Instead, ing a TLB shootdown, which consists of stopping all the address translation results are cached in the translation logical processors that use the page table / EPT about look-aside buffer (TLB). to be changed, performing the changes, executing TLB- Table 9 shows the levels of the TLB hierarchy. Recent invalidating instructions on the stopped logical proces- processors have separate L1 TLBs for instructions and sors, and then resuming execution on the stopped logical

24 processors. the added twist that interrupts occur independently of the Address translation constrains the L1 cache design. instructions executed by the interrupted code, whereas On Intel processors, the set index in an L1 cache only most faults are triggered by the actions of the application uses the address bits that are not impacted by address software that incurs them. translation, so that the L1 set lookup can be done in par- Given the importance of interrupts when assessing allel with the TLB lookup. This is critical for achieving a system’s security, this section outlines the interrupt a low latency when both the L1 TLB and the L1 cache triggering and handling processes described in the SDM. are hit. Peripherals use bus-specific protocols to signal inter- Given a page size P = 2p bytes, the requirement rupts. For example, PCIe relies on Message Signaled above translates to l + s ≤ p. In the Intel architecture, Interrupts (MSI), which are memory writes issued to p = 12, and all recent processors have 64-byte cache specially designed memory addresses. The bus-specific lines (l = 6) and 64 sets (s = 6) in the L1 caches, as interrupt signals are received by the I/O Advanced Pro- shown in Figure 29. The L2 and L3 caches are only grammable Interrupt Controller (IOAPIC) in the PCH, accessed if the L1 misses, so the physical address for the shown in Figure 20. memory access is known at that time, and can be used The IOAPIC routes interrupt signals to one or more for indexing. Local Advanced Programmable Interrupt Controllers (LAPICs). As shown in Figure 22, each logical CPU L1 Cache Address Breakdown has a LAPIC that can receive interrupt signals from the Address Tag Set Index Line Offset 47…12 11…6 5…0 IOAPIC. The IOAPIC routing process assigns each inter- rupt to an 8-bit interrupt vector that is used to identify 4KB Page Address Breakdown PML4E Index PDPTE Index PDE Index PTE Index Page Offset the interrupt sources, and to a 32-bit APIC ID that is used 47…39 38…30 29…21 20…12 11…0 to identify the LAPIC that receives the interrupt.

L2 Cache Address Breakdown Each LAPIC uses a 256-bit Interrupt Request Regis- Address Tag Set Index Line Offset ter (IRR) to track the unserviced interrupts that it has 47…16 14…6 5…0 received, based on the interrupt vector number. When the L3 Cache Address Breakdown corresponding logical processor is available, the LAPIC Address Tag Set Index Line Offset copies the highest-priority unserviced interrupt vector 47…16 18…6 5…0 to the In-Service Register (ISR), and invokes the logical 2MB Page Address Breakdown processor’s interrupt handling process. PML4E Index PDPTE Index PDE Index Page Offset At the execution core level, interrupt handling reuses 47…39 38…30 29…21 20…0 many of the mechanisms of fault handling (§ 2.8.2). The Figure 29: Virtual addresses from the perspective of cache lookup interrupt vector number in the LAPIC’s ISR is used to and address translation. The bits used for the L1 set index and line locate an interrupt handler in the IDT, and the handler is offset are not changed by address translation, so the page tables do invoked, possibly after a privilege switch is performed. not impact L1 cache placement. The page tables do impact L2 and L3 cache placement. Using large pages (2 MB or 1 GB) is not sufficient The interrupt handler does the processing that the device to make L3 cache placement independent of the page tables, because requires, and then writes the LAPIC’s End Of Interrupt of the LLC slice hashing function (§ 2.11.3). (EOI) register to signal the fact that it has completed handling the interrupt. 2.12 Interrupts Interrupts are treated like faults, so interrupt handlers Peripherals use interrupts to signal the occurrence of have full control over the execution environment of the an event that must be handled by system software. For application being interrupted. This is used to implement example, a keyboard triggers interrupts when a key is pre-emptive multi-threading, which relies on a clock pressed or depressed. System software also relies on device that generates interrupts periodically, and on an interrupts to implement preemptive multi-threading. interrupt handler that performs context switches. Interrupts are a kind of hardware exception (§ 2.8.2). System software can cause an interrupt on any logical Receiving an interrupt causes an execution core to per- processor by writing the target processor’s APIC ID into form a privilege level switch and to start executing the the Interrupt Command Register (ICR) of the LAPIC system software’s interrupt handling code. Therefore, the associated with the logical processor that the software security concerns in § 2.8.2 also apply to interrupts, with is running on. These interrupts, called Inter-Processor

25 Interrupts (IPI), are needed to implement TLB shoot- trust, and performs the first steps towards establishing downs (§ 2.11.5). the system’s desired security properties. For example, in a measured boot system (also known 2.13 Platform Initialization () as trusted boot), all the software involved in the boot pro- When a computer is powered up, it undergoes a boot- cess is measured (cryptographically hashed, and the mea- strapping process, also called booting, for simplicity. surement is made available to third parties, as described The boot process is a sequence of steps that collectively in § 3.3). In such a system, the SEC implementation initialize all the computer’s hardware components and takes the first steps in establishing the system’s measure- load the system software into DRAM. An analysis of ment, namely resetting the special register that stores the a system’s security properties must be aware of all the measurement result, measuring the PEI implementation, pieces of software executed during the boot process, and and storing the measurement in the special register. must account for the trust relationships that are created SEC is followed by the Pre-EFI Initialization phase when a software module loads another module. (PEI), which initializes the computer’s DRAM, copies This section outlines the details of the boot process itself from the temporary memory store into DRAM, and needed to reason about the security of a system based tears down the temporary storage. When the computer is on the Intel architecture. [92] provides a good refer- powering up or rebooting, the PEI implementation is also ence for many of the booting process’s low-level details. responsible for initializing all the non-volatile storage While some specifics of the boot process depend on the units that contain UEFI firmware and loading the next motherboard and components in a computer, this sec- stage of the firmware into DRAM. tion focuses on the high-level flow described by Intel’s PEI hands off control to the Driver eXecution Envi- documentation. ronment phase (DXE). In DXE, a loader locates and starts firmware drivers for the various components in the 2.13.1 The UEFI Standard computer. DXE is followed by a Boot Device Selection The firmware in recent computers with Intel processors (BDS) phase, which is followed by a Transient System implements the Platform Initialization (PI) process in Load (TSL) phase, where an EFI application loads the the Unified Extensible Firmware Interface (UEFI) spec- operating system selected in the BDS phase. Last, the ification [180]. The platform initialization follows the OS loader passes control to the operating system’s kernel, steps shown in Figure 30 and described below. entering the Run Time (RT) phase. When waking up from sleep, the PEI implementation

Security (SEC) Cache-as-RAM first initializes the non-volatile storage containing the microcode measures system snapshot saved while entering the sleep state. firmware Pre-EFI Initialization (PEI) DRAM Initialized The rest of the PEI implementation may use optimized

measures re-initialization processes, based on the snapshot con-

Driver eXecution Environment (DXE) tents. The DXE implementation also uses the snapshot measures to restore the computer’s state, such as the DRAM con-

Boot Device Selection (BDS) tents, and then directly executes the operating system’s measures wake-up handler. Transient System Load (TSL) 2.13.2 SEC on Intel Platforms measures OS Run Time (RT) Right after a computer is powered up, circuitry in the power supply and on the motherboard starts establishing Figure 30: The phases of the Platform Initialization process in the reference voltages on the power rails in a specific or- UEFI specification. der, documented as “power sequencing” [184] in chipset The computer powers up, reboots, or resumes from specifications such as [102]. The rail powering up the sleep in the Security phase (SEC). The SEC implementa- Intel ME (§ 2.9.2) in the PCH is powered up significantly tion is responsible for establishing a temporary memory before the rail that powers the CPU cores. store and loading the next stage of the firmware into it. When the ME is powered up, it starts executing the As the first piece of software that executes on the com- code in its boot ROM, which sets up the SPI bus con- puter, the SEC implementation is the system’s root of nected to the flash memory chip (§ 2.9.1) that stores both

26 the UEFI firmware and the ME’s firmware. The ME then tents from non- before the computer is loads its firmware from flash memory, which contains initialized, because the initial SAD (§ 2.11.3) and PCH the ME’s operating system and applications. (§ 2.9.1) configurations maps a region in the memory After the Intel ME loads its software, it sets up some of address space to the SPI flash chip (§ 2.9.1) that stores the motherboard’s hardware, such as the PCH bus clocks, the computer’s firmware. and then it kicks off the CPU’s bootstrap sequence. Most 0xFFFFFFFF of the details of the ME’s involvement in the computer’s Legacy 0xFFFFFFF0 FIT Pointer boot process are not publicly available, but initializing 0xFFFFFFE8 the clocks is mentioned in a few public documents [5, 7, Firmware Interface Table (FIT) FIT Header 42, 107], and is made clear in firmware bringup guides, PEI ACM Entry such as the leaked confidential guide [93] documenting TXT Policy Entry firmware bringup for Intel’s Series 7 chipset. Pre-EFI Initialization ACM The beginning of the CPU’s bootstrap sequence is ACM Header the SEC phase, which is implemented in the processor Public Key circuitry. All the logical processors (LPs) on the mother- Signature board undergo hardware initialization, which invalidates PEI Implementation the caches (§ 2.11) and TLBs (§ 2.11.5), performs a Built- TXT Policy Configuration In Self Test (BIST), and sets all the registers (§ 2.6) to DXE modules pre-specified values.

After hardware initialization, the LPs perform the Figure 31: The Firmware Interface Table (FIT) in relation to the Multi-Processor (MP) initialization algorithm, which firmware’s memory map. bootstrap pro- results in one LP being selected as the The FIT [153] was introduced in the context of Intel’s cessor (BSP), and all the other LPs being classified as Itanium architecture, and its use in Intel’s current 64- application processors (APs). bit architecture is described in an Intel patent [40] and According to the SDM, the details of the MP initial- briefly documented in an obscure piece of TXT-related ization algorithm for recent CPUs depend on the moth- documentation [89]. The FIT contains Authenticated erboard and firmware. In principle, after completing Code Modules (ACMs) that make up the firmware, and hardware initialization, all LPs attempt to issue a spe- other platform-specific information, such as the TPM cial no-op transaction on the QPI bus. A single LP will and TXT configuration [89]. succeed in issuing the no-op, thanks to the QPI arbi- The PEI implementation is stored in an ACM listed tration mechanism, and to the UBox (§ 2.11.3) in each in the FIT. The processor loads the PEI ACM, verifies CPU package, which also serves as a ring arbiter. The the trustworthiness of the ACM’s public key, and ensures arbitration priority of each LP is based on its APIC ID that the ACM’s contents matches its signature. If the PEI (§ 2.12), which is provided by the motherboard when the passes the security checks, it is executed. Processors that system powers up. The LP that issues the no-op becomes support Intel TXT only accept Intel-signed ACMs [55, p. the BSP. Upon failing to issue the no-op, the other LPs 92]. become APs, and enter the wait-for-SIPI state. Understanding the PEI firmware loading process is 2.13.3 PEI on Intel Platforms unnecessarily complicated by the fact that the SDM de- [92] and [35] describe the initialization steps performed scribes a legacy process consisting of having the BSP set by Intel platforms during the PEI phase, from the per- its RIP register to 0xFFFFFFF0 (16 bytes below 4 GB), spective of a firmware programmer. A few steps provide where the firmware is expected to place a instruction that useful context for reasoning about threat models involv- jumps into the PEI implementation. ing the boot process. Recent processors do not support the legacy approach When the BSP starts executing PEI firmware, DRAM at all [156]. Instead, the BSP reads a word from address is not yet initialized. Therefore the PEI code starts ex- 0xFFFFFFE8 (24 bytes below 4 GB) [40, 203], and ex- ecuting in a Cache-as-RAM (CAR) mode, which only pects to find the address of a Firmware Interface Table relies on the BSP’s internal caches, at the expense of im- (FIT) in the memory address space (§ 2.4), as shown posing severe constraints on the size of the PEI’s working in Figure 31. The BSP is able to read firmware con- set.

27 One of the first tasks performed by the PEI implemen- information gleaned from Intel’s patents and other re- tation is enabling DRAM, which requires discovering searchers’ findings. and initializing the DRAM chips connected to the moth- erboard, and then configuring the BSP’s memory con- 2.14.1 The Role of Microcode trollers (§ 2.11.3) and MTRRs (§ 2.11.4). Most firmware The frequently used instructions in the Intel architecture implementations use Intel’s Memory Reference Code are handled by the core’s fast path, which consists of (MRC) for this task. simple decoders (§ 2.10) that can emit at most 4 micro- After DRAM becomes available, the PEI code is ops per instruction. Infrequently used instructions and copied into DRAM and the BSP is taken out of CAR instructions that require more than 4 micro-ops use a mode. The BSP’s LAPIC (§ 2.12) is initialized and slower decoding path that relies on a sequencer to read used to send a broadcast Startup Inter-Processor Inter- micro-ops from a microcode store ROM (MSROM). rupt (SIPI, § 2.12) to wake up the APs. The interrupt The 4 micro-ops limitation can be used to guess intel- vector in a SIPI indicates the memory address of the AP ligently whether an architectural feature is implemented initialization code in the PEI implementation. in microcode. For example, it is safe to assume that The PEI code responsible for initializing APs is ex- XSAVE (§ 2.6), which was takes over 200 micro-ops on ecuted when the APs receive the SIPI wake-up. The recent CPUs [53], is most likely performed in microcode, AP PEI code sets up the AP’s configuration registers, whereas simple arithmetic and memory accesses are han- such as the MTRRs, to match the BSP’s configuration. dled directly by hardware. Next, each AP registers itself in a system-wide table, The core’s execution units handle common cases in using a memory synchronization primitive, such as a fast paths implemented in hardware. When an input semaphore, to avoid having two APs access the table cannot be handled by the fast paths, the execution unit at the same time. After the AP initialization completes, issues a microcode assist, which points the microcode each AP is suspended again, and waits to receive an INIT sequencer to a routine in microcode that handles the Inter-Processor Interrupt from the OS kernel. edge cases. The most common cited example in Intel’s The BSP initialization code waits for all APs to register documentation is floating point instructions, which issue themselves into the system-wide table, and then proceeds assists to handle denormalized inputs. to locate, load and execute the firmware module that The REP MOVS family of instructions, also known implements DXE. as string instructions because of their use in strcpy- like functions, operate on variable-sized arrays. These 2.14 CPU Microcode instructions can handle small arrays in hardware, and The Intel architecture features a large instruction set. issue microcode assists for larger arrays. Some instructions are used infrequently, and some in- Modern Intel processors implement a microcode up- structions are very complex, which makes it impractical date facility. The SDM describes the process of applying for an execution core to handle all the instructions in hard- microcode updates from the perspective of system soft- ware. Intel CPUs use a microcode table to break down ware. Each core can be updated independently, and the rare and complex instructions into sequences of simpler updates must be reapplied on each boot cycle. A core instructions. Architectural extensions that only require can be updated multiple times. The latest SDM at the microcode changes are significantly cheaper to imple- time of this writing states that a microcode update is up ment and validate than extensions that require changes to 16 KB in size. in the CPU’s circuitry. Processor engineers prefer to build new architectural It follows that a good understanding of what can be features as microcode extensions, because microcode can done in microcode is crucial to evaluating the cost of be iterated on much faster than hardware, which reduces security features that rely on architecture extensions. Fur- development cost [193, 194]. The update facility further thermore, the limitations of microcode are sometimes the increases the appeal of microcode, as some classes of reasoning behind seemingly arbitrary architecture design bugs can be fixed after a CPU has been released. decisions. Intel patents [110, 138] describing Software Guard The first sub-section below presents the relevant facts Extensions (SGX) disclose that SGX is entirely imple- pertaining to microcode in Intel’s optimization reference mented in microcode, except for the memory encryp- [96] and SDM. The following subsections summarize tion engine. A description of SGX’s implementation

28 could provide great insights into Intel’s microcode, but, 2.14.2 Microcode Structure unfortunately, the SDM chapters covering SGX do not include such a description. We therefore rely on other According to a 2013 Intel patent [83], the avenues con- public information sources about the role of microcode sidered for implementing new architectural features are in the security-sensitive areas covered by previous sec- a completely microcode-based implementation, using tions, namely memory management (§ 2.5, § 2.11.5), existing micro-ops, a microcode implementation with the handling of hardware exceptions (§ 2.8.2) and inter- hardware support, which would use new micro-ops, and rupts (§ 2.12), and platform initialization (§ 2.13). a complete hardware implementation, using finite state machines (FSMs). The use of microcode assists can be measured using The main component of the MSROM is a table of the Precise Event Based Sampling (PEBS) feature in re- micro-ops [193, 194]. According to an example in a cent Intel processors. PEBS provides counters for the 2012 Intel patent [194], the table contains on the order number of micro-ops coming from MSROM, including of 20,000 micro-ops, and a micro-op has about 70 bits. complex instructions and assists, counters for the num- On embedded processors, like the Atom, microcode may bers of assists associated with some micro-op classes be partially compressed [193, 194]. (SSE and AVX stores and transitions), and a counter for The MSROM also contains an event ROM, which is an assists generated by all other micro-ops. array of pointers to event handling code in the micro-ops table [160]. Microcode events are hardware exceptions, The PEBS feature itself is implemented using mi- assists, and interrupts [24, 36, 149]. The processor de- crocode assists (this is implied in the SDM and con- scribed in a 1999 patent [160] has a 64-entry event table, firmed by [120]) when it needs to write the execution where the first 16 entries point to hardware exception context into a PEBS record. Given the wide range of handlers and the other entries are used by assists. features monitored by PEBS counters, we assume that all The execution units can issue an assist or signal a fault execution units in the core can issue microcode assists, by associating an event code with the result of a micro- which are performed at micro-op retirement. This find- op. When the micro-op is committed (§ 2.10), the event ing is confirmed by an Intel patent [24], and is supported code causes the out-of-order scheduler to squash all the by the existence of a PEBS counter for the “number of micro-ops that are in-flight in the ROB. The event code is microcode assists invoked by hardware upon micro-op forwarded to the microcode sequencer, which reads the writeback.” micro-ops in the corresponding event handler [24, 149]. The hardware exception handling logic (§ 2.8.2) and Intel’s optimization manual describes one more inter- interrupt handling logic (§ 2.12) is implemented entirely esting assist, from a memory system perspective. SIMD in microcode [149]. Therefore, changes to this logic are masked loads (using VMASKMOV) read a series of data relatively inexpensive to implement on Intel processors. elements from memory into a vector register. A mask This is rather fortunate, as the Intel architecture’s stan- register decides whether elements are moved or ignored. dard hardware exception handling process requires that If the memory address overlaps an invalid page (e.g., the the fault handler is trusted by the code that encounters P flag is 0, § 2.5), a microcode assist is issued, even if the exception (§ 2.8.2), and this assumption cannot be the mask indicates that no element from the invalid page satisfied by a design where the software executing in- should be read. The microcode checks whether the ele- side a secure container must be isolated from the system ments in the invalid page have the corresponding mask software managing the computer’s resources. bits set, and either performs the load or issues a page fault. The execution units in modern Intel processors support microcode procedures, via dedicated microcode call and The description of machine checks in the SDM men- return micro-ops [36]. The micro-ops manage a hard- tions page assists and page faults in the same context. ware data structure that conceptually stores a stack of We assume that the page assists are issued in some cases microcode instruction pointers, and is integrated with out- when a TLB miss occurs (§ 2.11.5) and the PMH has to of-order execution and hardware exceptions, interrupts walk the page table. The following section develops this and assists. assumption and provides supporting evidence from In- Asides from special micro-ops, microcode also em- tel’s assigned patents and published patent applications. ploys special load and store instructions, which turn into

29 special bus cycles, to issue commands to other functional not be able to file new patents for the same specifications, units [159]. The memory addresses in the special loads we cannot present newer patents with the information and stores encode commands and input parameters. For above. Fortunately, we were able to find newer patents example, stores to a certain range of addresses flush spe- that mention the techniques described above, proving cific TLB sets. their relevance to newer CPU models. Two 2014 patents [78, 154] mention that the PMH is 2.14.3 Microcode and Address Translation executing a FSM which issues stuffing loads to obtain Address translation (§ 2.5) is configured by CR3, which page table entries. A 2009 patent [62] mentions that stores the physical address of the top-level page table, microcode is invoked after a PMH walk, and that the and by various bits in CR0 and CR4, all of which are microcode can prevent the translation result produced by described in the SDM. Writes to these control registers the PMH from being written to the TLB. are implemented in microcode, which stores extra infor- A 2013 patent [83] and a 2014 patent [155] on scatter mation in microcode-visible registers [62]. / gather instructions disclose that the newly introduced When a TLB miss (§ 2.11.5) occurs, the memory exe- instructions use a combination of hardware in the ex- cution unit forwards the virtual address to the Page Miss ecution units that perform memory operations, which Handler (PMH), which performs the page walk needed include the PMH. The hardware issues microcode assists to obtain a physical address. In order to minimize the for slow paths, such as gathering vector elements stored latency of a page walk, the PMH is implemented as in uncacheable memory (§ 2.11.4), and operations that a Finite-State Machine (FSM) [78, 154]. Furthermore, cause Page Faults. the PMH fetches the page table entries from memory A 2014 patent on APIC (§ 2.12) virtualization [168] by issuing “stuffed loads”, which are special micro-ops describes a memory execution unit modification that in- that bypass the reorder buffer (ROB) and go straight vokes a microcode assist for certain memory accesses, to the memory execution units (§ 2.10), thus avoiding based on the contents of some range registers. The patent the overhead associated with out-of-order scheduling also mentions that the range registers are checked when [63, 78, 159]. the TLB miss occurs and the PMH is invoked, in or- The FSM in the PMH handles the fast path of the entire der to decide whether a fast hardware path can be used address translation process, which assumes no address for APIC virtualization, or a microcode assist must be translation fault (§ 2.8.2) occurs [63, 64, 149, 160], and issued. no page table entry needs to be modified [63]. The recent patents mentioned above allow us to con- When the PMH FSM detects the conditions that trigger clude that the PMH in recent processors still relies on an a Page Fault or a General Protection Fault, it commu- FSM and stuffed loads, and still uses microcode assists to nicates a microcode event code, corresponding to the handle infrequent and complex operations. This assump- detected fault condition, to the execution unit (§ 2.10) tion plays a key role in estimating the implementation responsible for memory operations [63, 64, 149, 160]. In complexity of architectural modifications targeting the turn, the execution unit triggers the fault by associating processor’s address translation mechanism. the event code with the micro-op that caused the address 2.14.4 Microcode and Booting translation, as described in the previous section. The PMH FSM does not set the Accessed or Dirty The SDM states that microcode performs the Built-In attributes (§ 2.5.3) in page table entries. When it detects Self Test (BIST, § 2.13.2), but does not provide any de- that a page table entry must be modified, the FSM issues tails on the rest of the CPU’s hardware initialization. a microcode event code for a page walk assist [63]. The In fact, the entire SEC implementation on Intel plat- microcode handler performs the page walk again, setting forms is contained in the processor microcode [40, 41, the A and D attributes on page table entries when neces- 168]. This implementation has desirable security proper- sary [63]. This finding was indirectly confirmed by the ties, as it is significantly more expensive for an attacker description for a PEBS event in the most recent SDM to tamper with the MSROM circuitry (§ 2.14.2) than it release. is to modify the contents of the flash memory chip that The patents at the core of our descriptions above [24, stores the UEFI firmware. § 3.4.3 and § 3.6 describe 63, 64, 149, 160] were all issued between 1996 and 1999, the broad classes of attacks that an Intel platform can be which raises the concern of obsolescence. As Intel would subjected to.

30 The microcode that implements SEC performs MP update is signed with a 2048-bit RSA key and a (possibly initialization (§ 2.13.2), as suggested in the SDM. The non-standard) 256-bit hash algorithm, which agrees with microcode then places the BSP into Cache-as-RAM the findings above. (CAR) mode, looks up the PEI Authenticated Code Mod- The microcode update implementation places the ule (ACM) in the Firmware Interface Table (FIT), loads core’s cache into No-Evict Mode (NEM, documented the PEI ACM into the cache, and verifies its signature by the SDM) and copies the microcode update into the (§ 2.13.2) [40, 41, 144, 202, 203]. Given the structure of cache before verifying its signature [202]. The update fa- ACM signatures, we can conclude that Intel’s microcode cility also sets up an MTRR entry to protect the update’s contains implementations of RSA decryption and of a contents from modifications via DMA transfers [202] as variant of SHA hashing. it is verified and applied. The PEI ACM is executed from the CPU’s cache, after While Intel publishes the most recent microcode up- it is loaded by the microcode [40, 41, 202]. This removes dates for each of its CPU models, the release notes asso- the possibility for an attacker with physical access to the ciated with the updates are not publicly available. This SPI flash chip to change the firmware’s contents after the is unfortunate, as the release notes could be used to con- microcode computes its cryptographic hash, but before it firm guesses that certain features are implemented in is executed. microcode. On motherboards compatible with LaGrande Server However, some information can be inferred by read- Extensions (LT-SX, also known as Intel TXT for servers), ing through the Errata section in Intel’s Specification the firmware implementing PEI verifies that each CPU Updates [88, 104, 106]. The phrase “it is possible for connected to motherboard supports LT-SX, and powers BIOS5 to contain a workaround for this erratum” gen- off the CPU sockets that don’t hold processors that im- erally means that a microcode update was issued. For plement LT-SX [144]. This prevents an attacker from example, Errata AH in [88] implies that string instruc- tampering with a TXT-protected VM by hot-plugging tions (REP MOV) are implemented in microcode, which a CPU in a running computer that is inside TXT mode. was confirmed by Intel [12]. When a hot-plugged CPU passes security tests, a hy- Errata AH43 and AH91 in [88], and AAK73 in [104] pervisor is notified that a new CPU is available. The imply that address translation (§ 2.5) is at least partially hypervisor updates its internal state, and sends the new implemented in microcode. Errata AAK53, AAK63, CPU a SIPI. The new CPU executes a SIPI handler, in- and AAK70, AAK178 in [104], and BT138, BT210, side microcode, that configures the CPU’s state to match in [106] imply that VM entries and exits (§ 2.8.2) are the state expected by the TXT hypervisor [144]. This implemented in microcode, which is confirmed by the implies that the AP initialization described in § 2.13.2 is APIC virtualization patent [168]. implemented in microcode. 3 SECURITY BACKGROUND 2.14.5 Microcode Updates Most systems rely on some cryptographic primitives for The SDM explains that the microcode on Intel CPUs security. Unfortunately, these primitives have many as- can be updated, and describes the process for applying sumptions, and building a secure system on top of them an update. However, no detail about the contents of an is a highly non-trivial endeavor. It follows that a sys- update is provided. Analyzing Intel’s microcode updates tem’s security analysis should be particularly interested seems like a promising avenue towards discovering the in what cryptographic primitives are used, and how they microcode’s structure. Unfortunately, the updates have are integrated into the system. so far proven to be inscrutable [32]. § 3.1 and § 3.2 lay the foundations for such an anal- The microcode updates cannot be easily analyzed be- ysis by summarizing the primitives used by the secure cause they are encrypted, hashed with a cryptographic architectures of interest to us, and by describing the most like SHA-256, and signed using RSA or common constructs built using these primitives. § 3.3 elliptic curve cryptography [202]. The update facility builds on these concepts and describes software attesta- is implemented entirely in microcode, including the de- tion, which is the most popular method for establishing cryption and signature verification [202]. 5Basic Input/Output System (BIOS) is the predecessor of UEFI- [75] independently used fault injection and timing based firmware. Most Intel documentation, including the SDM, still analysis to conclude that each recent uses the term BIOS to refer to firmware.

31 trust in a secure architecture. Guarantee Primitive Having looked at the cryptographic foundations for Confidentiality Encryption building secure systems, we turn our attention to the Integrity MAC / Signatures attacks that secure architectures must withstand. Asides Freshness Nonces + integrity from forming a security checklist for architecture design, Table 10: Desirable security guarantees and primitives that provide these attacks build intuition for the design decisions in them the architectures of interest to us. Guarantee Symmetric Asymmetric The attacks that can be performed on a computer sys- Keys Keys tem are broadly classified into physical attacks and soft- Confidentiality AES-GCM, RSA with ware attacks. In physical attacks, the attacker takes ad- AES-CTR PKCS #1 v2.0 vantage of a system’s physical implementation details Integrity HMAC-SHA-2 DSS-RSA, to perform an operation that bypasses the limitations set AES-GCM DSS-ECC by the computer system’s software abstraction layers. In software attacks Table 11: Popular cryptographic primitives that are considered to contrast, are performed solely by execut- be secure against today’s adversaries ing software on the victim computer. § 3.4 summarizes the main types of physical attacks. A message whose confidentiality is protected can be The distinction between software and physical attacks transmitted over an insecure medium without an adver- is particularly relevant in cloud computing scenarios, sary being able to obtain the information in the message. where gaining software access to the computer running When integrity protection is used, the receiver is guaran- a victim’s software can be accomplished with a credit teed to either obtain a message that was transmitted by card backed by modest funds [157], whereas physical the sender, or to notice that an attacker tampered with access is a more difficult prospect that requires trespass, the message’s content. coercion, or social engineering on the cloud provider’s When multiple messages get transmitted over an un- employees. trusted medium, a freshness guarantee assures the re- However, the distinction between software and phys- ceiver that she will obtain the latest message coming ical attacks is blurred by the attacks presented in § 3.6, from the sender, or will notice an attack. A freshness which exploit programmable peripherals connected to guarantee is stronger than the equivalent integrity guar- the victim computer’s bus in order to carry out actions antee, because the latter does not protect against replay that are normally associated with physical attacks. attacks where the attacker replaces a newer message with While the vast majority of software attacks exploit an older message coming from the same sender. a bug in a software component, there are a few attack The following example further illustrates these con- classes that deserve attention from architecture designers. cepts. Suppose Alice is a wealthy investor who wishes Memory mapping attacks, described in § 3.7, become a to either BUY or SELL an item every day. Alice cannot possibility on architectures where the system software is trade directly, and must relay her orders to her broker, not trusted. Cache timing attacks, summarized in § 3.8 Bob, over a network connection owned by Eve. exploit microarchitectural behaviors that are completely A communication system with confidentiality guaran- observable in software, but dismissed by the security tees would prevent Eve from distinguishing between a analyses of most systems. BUY and a SELL order, as illustrated in Figure 32. With- out confidentiality, Eve would know Alice’s order before 3.1 Cryptographic Primitives it is placed by Bob, so Eve would presumably gain a This section overviews the cryptosystems used by se- financial advantage at Alice’s expense. cure architectures. We are interested in cryptographic A system with integrity guarantees would prevent Eve primitives that guarantee confidentiality, integrity, and from replacing Alice’s message with a false order, as freshness, and we treat these primitives as black boxes, shown in Figure 33. In this example, without integrity focusing on their use in larger systems. [116] covers the guarantees, Eve could replace Alice’s message with a mathematics behind cryptography, while [51] covers the SELL-EVERYTHING order, and buy Alice’s assets at a topic of building systems out of cryptographic primitives. very low price. Tables 10 and 11 summarize the primitives covered in Last, a communication system that guarantees fresh- this section. ness would ensure that Eve cannot perform the replay

32 Network Each cryptographic primitive has an associated key Message generation algorithm that uses random data to produce

Alice Eavesdrop Bob a unique key. The random data is produced by a cryp- tographically strong pseudo-random number generator Buy Yes (CSPRNG) that expands a small amount of random seed Eve Sell No data into a much larger amount of data, which is compu- Figure 32: In a confidentiality attack, Eve sees the message sent by tationally indistinguishable from true random data. The Alice to Bob and can understand the information inside it. In this random seed must be obtained from a true source of ran- case, Eve can tell that the message is a buy order, and not a sell order. domness whose output cannot be predicted by an adver- Network sary, such as the least significant bits of the temperature Eve’s Message readings coming from a hardware sensor. Alice Bob Symmetric key cryptography requires that all the par- Drop Send own message message ties in the system establish a shared secret key, which is usually referred to as “the key”. Typically, one party Eve Sell Everything executes the key generation algorithm and securely trans- mits the resulting key to the other parties, as illustrated Figure 33: In an integrity attack, Eve replaces Alice’s message with in Figure 35. The channel used to distribute the key must her own. In this case, Eve sends Bob a sell-everything order. In this provide confidentiality and integrity guarantees, which case, Eve can tell that the message is a buy order, and not a sell order. is a non-trivial logistical burden. The symmetric key attack pictured in Figure 34, where she would replace primitives mentioned here do not make any assumption Alice’s message with an older message. Without fresh- about the key, so the key generation algorithm simply ness guarantees, Eve could mount the following attack, grabs a fixed number of bits from the CSPRNG. which bypasses both confidentiality and integrity guaran- Hardware Sensor tees. Over a few days, Eve would copy and store Alice’s Random Seed messages from the network. When an order would reach Bob, Eve would observe the market and determine if the Cryptographically Secure order was BUY or SELL. After building up a database Pseudo-Random Number of messages labeled BUY or SELL, Eve would replace Generator (CSPRNG) Alice’s message with an old message of her choice. random data Bob Alice Key Generation Secret private Secret Algorithm Key communication Key

Figure 35: In symmetric key cryptography, a secret key is shared by the parties that wish to communicate securely. The defining feature of asymmetric key cryptography is that it does not require a private channel for key distri- bution. Each party executes the key generation algorithm, Figure 34: In a freshness attack, Eve replaces Alice’s message with a message that she sent at an earlier time. In this example, Eve builds which produces a private key and a public key that are a database of labeled messages over time, and is able to send Bob her mathematically related. Each party’s public key is dis- choice of a BUY or a SELL order. tributed to the other parties over a channel with integrity guarantees, as shown in Figure 36. Asymmetric key 3.1.1 Cryptographic Keys primitives are more flexible than their symmetric coun- All cryptographic primitives that we describe here rely terparts, but are more complicated and consume more on keys, which are small pieces of information that must computational resources. only be disclosed according to specific rules. A large part of a system’s security analysis focuses on ensuring that 3.1.2 Confidentiality the keys used by the underlying cryptographic primitives Many cryptosystems that provide integrity guarantees are produced and handled according to the primitives’ are built upon block ciphers that operate on fixed-size assumptions. message blocks. The sender transforms a block using an

33 Hardware Sensor The most popular block cipher based on symmetric Random Seed keys at the time of this writing is the American Encryp- tion Standard (AES) [39, 141], with two variants that Cryptographically Secure Pseudo-Random Number operate on 128-bit blocks using 128-bit keys or 256- Generator (CSPRNG) bit keys. AES is a secure permutation function, as it Bob random data can transform any 128-bit block into another 128-bit Private Key block. Recently, the United States National Security Alice Key Generation Agency (NSA) required the use of 256-bit AES keys for Algorithm Public tamper-proof Bob’s Public Key communication Key protecting sensitive information [143]. The most deployed asymmetric key block cipher is the Figure 36: An asymmetric key generation algorithm produces a Rivest-Shamir-Adelman private key and an associated public key. The private key is held (RSA) [158] algorithm. RSA confidential, while the public key is given to any party who wishes to has variable key sizes, and 3072-bit key pairs are con- securely communicate with the private key’s holder. sidered to provide the same security as 128-bit AES keys [20]. encryption algorithm, and the receiver inverts the trans- A block cipher does not necessarily guarantee confi- formation using a decryption algorithm. The encryp- dentiality, when used on its own. A noticeable issue is tion algorithms in block ciphers obfuscate the message that in our previous example, a block cipher would gen- block’s content in the output, so that an adversary who erate the same encrypted output for any of Alice’s BUY does not have the decryption key cannot obtain the origi- orders, as they all have the same content. Furthermore, nal message block from the encrypted output. each block cipher has its own assumptions that can lead Symmetric key encryption algorithms use the same to subtle vulnerabilities if the cipher is used directly. secret key for encryption and decryption, as shown in Symmetric key block ciphers are combined with op- Figure 37, while asymmetric key block ciphers use the erating modes to form symmetric encryption schemes. public key for encryption, and the corresponding private Most operating modes require a random initialization key for decryption, as shown in Figure 38. vector (IV) to be used for each message, as shown in Figure 39. When analyzing the security of systems based Message Message on these cryptosystems, an understanding of the IV gen- Block Block Alice Bob eration process is as important as ensuring the confiden- tiality of the encryption key. Secret Key Encryption Decryption Secret Key

Network Message Message Encrypted Block Alice Bob Figure 37: In a symmetric key secure permutation (block cipher), CSPRNG the same secret key must be provided to both the encryption and the decryption algorithm. Initialization Vector (IV)

Message Message Secret Secret Block Block Encryption Decryption Key Key Alice Bob

Bob’s Bob’s Public Encryption Decryption Private Network Key Key Encrypted IV Message

Network Encrypted Figure 39: Symmetric key block ciphers are combined with oper- Block ating modes. Most operating modes require a random initialization vector (IV) to be generated for each encrypted message. Figure 38: In an asymmetric key block cipher, the encryption Counter (CTR) and Cipher Block Chaining (CBC) algorithm operates on a public key, and the decryption algorithm uses the corresponding private key. are examples of operating modes recommended [45] by

34 the United States National Institute of Standards and produce a small fixed-size output. Secure hash functions Technology (NIST), which informs the NSA’s require- have a few guarantees, such as pre-image resistance, ments. Combining a block cipher, such as AES, with an which states that an adversary cannot produce input data operating mode, such as CTR, results in an encryption corresponding to a given hash output. method, such as AES-CTR, which can be used to add At the time of this writing, the most popular se- confidentiality guarantees. cure hashing function is the Secure Hashing Algo- In the asymmetric key setting, there is no concept rithm (SHA) [48]. However, due to security issues in equivalent to operating modes. Each block cipher has its SHA-1 [173], new software is recommended to use at own assumptions, and requires a specialized scheme for least 256-bit SHA-2 [21] for secure hashing. general-purpose usage. The SHA hash functions are members of a large family The RSA algorithm is used in conjunction with of block hash functions that consume their input in fixed- padding methods, the most popular of which are the meth- size message blocks, and use a fixed-size internal state. ods described in the Public-Key Cryptography Standard A block hash function is used as shown in Figure 41. An (PKCS) #1 versions 1.5 [112] and 2.0 [113]. A security INITIALIZE algorithm is first invoked to set the internal analysis of a system that uses RSA-based encryption state to its initial values. An EXTEND algorithm is ex- must take the padding method into consideration. For ecuted for each message block in the input. After the example, the padding in PKCS #1 v1.5 can leak the pri- entire input is consumed, a FINALIZE algorithm produces vate key under certain circumstances [23]. While PKCS the hash output from the internal state. #1 v2.0 solves this issue, it is complex enough that some implementations have their own security issues [134]. Initialize Asymmetric encryption algorithms have much higher computational requirements than symmetric encryption Intermediate State algorithms. Therefore, when non-trivial quantities of data is encrypted, the sender generates a single-use secret Message Block Extend key that is used to encrypt the data, and encrypts the secret key with the receiver’s public key, as shown in Intermediate State Figure 40. Message Block Extend

Message Message Intermediate State

Alice Bob … … Symmetric Symmetric CSPRNG Encryption Decryption Intermediate State

Symmetric Key Generation Secret Key Secret Key Finalize Algorithm

Output Bob’s Bob’s Asymmetric Asymmetric Public Private Encryption Decryption Key Key Figure 41: A block hash function operates on fixed-size message blocks and uses a fixed-size internal state. Network Encrypted Encrypted In the symmetric key setting, integrity guarantees are Secret Key Message obtained using a Message Authentication Code (MAC) cryptosystem, illustrated in Figure 42. The sender uses Figure 40: Asymmetric key encryption is generally used to bootstrap a MAC algorithm that reads in a symmetric key and a a symmetric key encryption scheme. variable-legnth message, and produces a fixed-length, short MAC tag. The receiver provides the original mes- 3.1.3 Integrity sage, the symmetric key, and the MAC tag to a MAC Many cryptosystems that provide integrity guarantees are verification algorithm that checks the authenticity of the built upon secure hashing functions. These hash func- message. tions operate on an unbounded amount of input data and The key property of MAC cryptosystems is that an

35 signature, and FALSE if the message has been tampered Message Message with. Alice Bob

Secret MAC MAC Secret Message Message Key Signing Verification Key Alice Bob Secure Secure Accept Hashing Yes Hashing Message Network Correct? Hash Hash MAC tag Message Reject No Message Alice’s Signature Alice’s Signing Private Key Verification Public Key Figure 42: In the symmetric key setting, integrity is assured by com- puting a Message Authentication Code (MAC) tag and transmitting it Accept Yes over the network along the message. The receiver feeds the MAC tag Network Message into a verification algorithm that checks the message’s authenticity. Signature Message Correct? Reject No Message adversary cannot produce a MAC tag that will validate a message without the secret key. Figure 44: Signature schemes guarantee integrity in the asymmetric Many MAC cryptosystems do not have a separate key setting. Signatures are created using the sender’s private key, and MAC verification algorithm. Instead, the receiver checks are verified using the corresponding public key. A cryptographically secure hash function is usually employed to reduce large messages to the authenticity of the MAC tag by running the same small hashes, which are then signed. algorithm as the sender to compute the expected MAC tag for the received message, and compares the output Signing algorithms can only operate on small mes- with the MAC tag received from the network. sages and are computationally expensive. Therefore, in This is the case for the Hash Message Authentica- practice, the message to be transmitted is first ran through tion Code (HMAC) [124] generic construction, whose a cryptographically strong hash function, and the hash is operation is illustrated in Figure 43. HMAC can use provided as the input to the signing algorithm. any secure hash function, such as SHA, to build a MAC At the time of this writing, the most popular choice for cryptosystem. guaranteeing integrity in shared secret settings is HMAC- SHA, an HMAC function that uses SHA for hashing. Authenticated encryption, which combines a block Message Message cipher with an operating mode that offers both confi- Alice Bob dentiality and integrity guarantees, is often an attractive HMAC HMAC Secret Secret alternative to HMAC. The most popular authenticated Key Secure Secure Key Galois/Counter operation Hash Hash encryption operating mode is mode (GCM) [137], which has earned NIST’s recom- Accept Yes Message mendation [47] when combined with AES to form AES- Network Equal? GCM. HMAC tag Message Reject No Message The most popular signature scheme combines the RSA encryption algorithms with a padding schemes specified Figure 43: In the symmetric key setting, integrity is assured by in PKCS #1, as illustrated in Figure 45. Recently, elliptic computing a Hash-bassed Message Authentication Code (HMAC) curve cryptography (ECC) [121] has gained a surge in and transmitting it over the network along the message. The receiver re-computes the HMAC and compares it against the version received popularity, thanks to its smaller key sizes. For example, a from the network. 384-bit ECC key is considered to be as secure as a 3072- Asymmetric key primitives that provide integrity guar- bit RSA key [20, 143]. The NSA requires the Digital antees are known as signatures. The message sender pro- Signature Standard (DSS)[142], which specifies schemes vides her private key to a signing algorithm, and transmits based on RSA and ECC. the output signature along with the message, as shown 3.1.4 Freshness in Figure 44. The message receiver feeds the sender’s public key and the signature to a signature verification al- Freshness guarantees are typically built on top of a sys- gorithm, which returns TRUE if the message matches the tem that already offers integrity guarantees, by adding a

36 DER-Encoded Hash Algorithm ID Message Message 30 31 30 0d 06 09 60 86 48 01 Alice Bob 65 03 04 02 01 05 00 04 20 Seen Yes OK This is a CSPRNG Before? signature Message Reject No Replay Synchronized Recent Padding String Clock Nonces 256-bit ff ff ff ... ff SHA-2 Synchronized Clock Yes OK 0x00 0x01 PS 0x00 DER Hash Network Little-Endian Integer Reject Timestamp Nonce Message Recent? No Expired RSA Private Key Decryption Figure 46: Freshness guarantees can be obtained by adding times- tamped nonces on top of a system that already offers integrity guar- antees. The sender and the receiver use synchronized clocks to PKCS #1 v1.5 timestamp each message and discard unreasonably old messages. RSA Signature The receiver must check the nonce in each new message against a database of the nonces in all the unexpired messages that it has seen. Figure 45: The RSA signature scheme with PKCS #1 v1.5 padding specified in RFC 3447 combines a secure hash of the signed message with a DER-encoded specification of the secure hash algorithm used a fresh response and a replay attack. The nonce is only by the signature, and a padding string whose bits are all set to 1. stored by the challenger, and is small in comparison to Everything except for the secure hash output is considered to be a the rest of the state needed to validate the response. part of the PKCS #1 v1.5 padding. 3.2 Cryptographic Constructs unique piece of information to each message. The main This section summarizes two constructs that are built on challenge in freshness schemes comes down to economi- the cryptographic primitives described in § 3.1, and are cally maintaining the state needed to generate the unique used in the rest of this work. pieces of information on the sender side, and verify their uniqueness on the receiver side. 3.2.1 Certificate Authorities A popular solution for gaining freshness guarantees relies on nonces, single-use random numbers. Nonces are Asymmetric key cryptographic primitives assume that attractive because the sender does not need to maintain each party has the correct public keys for the other par- any state; the receiver, however, must store the nonces of ties. This assumption is critical, as the entire security all received messages. argument of an asymmetric key system rests on the fact Nonces are often combined with a message timestamp- that certain operations can only be performed by the own- ing and expiration scheme, as shown in Figure 46. An ers of the private keys corresponding to the public keys. expiration can greatly reduce the receiver’s storage re- More concretely, if Eve can convince Bob that her own quirement, as the nonces for expired messages can be public key belongs to Alice, Eve can produce message safely discarded. However, the scheme depends on the signatures that seem to come from Alice. sender and receiver having synchronized clocks. The The introductory material in § 3.1 assumed that each message expiration time is a compromise between the de- party transmits their public key over a channel with in- sire to reduce storage costs, and the need to tolerate clock tegrity guarantees. In practice, this is not a reasonable skew and delays in message transmission and processing. assumption, and the secure distribution of public keys is Alternatively, nonces can be used in challenge- still an open research problem. response protocols, in a manner that removes the storage The most widespread solution to the public key distri- overhead concerns. The challenger generates a nonce bution problem is the Certificate Authority (CA) system, and embeds it in the challenge message. The response to which assumes the existence of a trusted authority whose the challenge includes an acknowledgement of the em- public key is securely transmitted to all the other parties bedded nonce, so the challenger can distinguish between in the system.

37 The CA is responsible for securely obtaining the pub- Start lic key of each party, and for issuing a certificate that binds a party’s identity (e.g., “Alice”) to its public key, Expected No as shown in Figure 47. Certificate subject?

Yes Certificate Subject Identity Subject Public Key Subject Identity Valid Valid From / Until No now? Subject Public Key Certification Certificate Usage Valid From / Until Statement Certificate Policy Yes Certificate Usage Valid Certificate Policy for expected No use? Issuer Public Key Secured Yes Issuer Public Key Storage Certificate Signature Signing Issuer Trusted Certificate Signature No Algorithm Private Key Issuer?

Figure 47: A certificate is a statement signed by a certificate author- Yes ity (issuer) binding the identity of a subject to a public key. Valid No A certificate is essentially a cryptographic signature signature? produced by the private key of the certificate’s issuer, Yes who is generally a CA. The message signed by the issuer states that a public key belongs to a subject. The cer- Accept Reject tificate message generally contains identifiers that state Public Key Certificate the intended use of the certificate, such as “the key in Figure 48: A certificate issued by a CA can be validated by any this certificate can only be used to sign e-mail messages”. party that has securely obtained the CA’s public key. If the certificate The certificate message usually also includes an identifier is valid, the subject public key contained within can be trusted to for the issuer’s certification policy, which summarizes belong to the subject identified by the certificate. the means taken by the issuer to ensure the authenticity of the subject’s public key. turn, are responsible for generating certificates for the A major issue in a CA system is that there is no obvi- other parties in the system, as shown in Figure 49. ous way to revoke a certificate. A revocation mechanism In hierarchical CA systems, the only public key that is desirable to handle situations where a party’s private gets distributed securely to all the parties is the root key is accidentally exposed, to avoid having an attacker CA’s public key. Therefore, when two parties wish to use the certificate to impersonate the compromised party. interact, each party must present their own certificate, as While advanced systems for certificate revocation have well as the certificate of the issuing CA. For example, been developed, the first line of defense against key com- given the hierarchy in Figure 49, Alice would prove the promise is adding expiration dates to certificates. authenticity of her public key to Bob by presenting her In a CA system, each party presents its certificate certificate, as well as the certificate of Intermediate CA along with its public key. Any party that trusts the CA 1. Bob would first use the steps in Figure 48 to validate and has obtained the CA’s public key securely can verify Intermediate CA 1’s certificate against the root CA’s any certificate using the process illustrated in Figure 48. public key, which would assure him of the authenticity of One of the main drawbacks of the CA system is that Intermediate CA 1’s public key. Bob would then validate the CA’s private key becomes a very attractive attack tar- Alice’s certificate using Intermediate CA 1’s public key, get. This issue is somewhat mitigated by minimizing the which he now trusts. use of the CA’s private key, which reduces the opportuni- In most countries, the government issues ID cards for ties for its compromise. The authority described above its citizens, and therefore acts as as a certificate authority. becomes the root CA, and their private key is only used An ID card, shown in Figure 50, is a certificate that binds to produce certificates for the intermediate CAs who, in a subject’s identity, which is a full legal name, to the

38 subject’s physical appearance, which is used as a public key. The CA system is very similar to the identity document (ID card) systems used to establish a person’s identity, and a comparison between the two may help further the Secure Storage reader’s understanding of the concepts in the CA system. Root CA’s Private Key Subject Public Key Certificate Signature Root CA’s Public Key Root CA is replaced by physical Fictional Country security features

Intermediate CA 1’s Intermediate CA 2’s Alice Smith Subject Identity Certificate Certificate Citizen ID Card Certificate Usage Intermediate CA 1 Sign Intermediate CA 2 Issued by Fictional City Card Office CA 1’s Public Key CA 2’s Public Key Issuer Public Key is replaced by the Issued Expires Usage: CA Usage: CA Issuer Name 12/01/2015 12/01/2017 Root CA’s Public Key Root CA’s Public Key

Certificate Signature Certificate Signature Valid From Valid Until

Secure Storage Secure Storage Figure 50: An ID card is a certificate that binds a subject’s full legal name (identity) to the subject’s physical appearance, which acts as a CA 1’s Private Key CA 2’s Private Key public key. CA 1’s Public Key CA 2’s Public Key Each government’s ID card issuing operations are reg- ulated by laws, so an ID card’s issue date can be used to track down the laws that make up its certification pol- Intermediate Intermediate icy. Last, the security of ID cards does not (yet) rely CA 1 CA 2 on cryptographic primitives. Instead, ID cards include physical security measures designed to deter tampering Alice’s Certificate Sign Bob’s Certificate and prevent counterfeiting. Alice Bob 3.2.2 Key Agreement Protocols Alice’s Public Key Bob’s Public Key Usage: End-User Usage: End-User The initial design of symmetric key primitives, intro- CA 1’s Public Key CA 2’s Public Key duced in § 3.1, assumed that when two parties wish to interact, one party generates a secret key and shares Certificate Signature Certificate Signature it with the other party using a communication channel with confidentiality and integrity guarantees. In practice, Secure Storage Secure Storage a pre-existing secure communication channel is rarely Alice’s Private Key Bob’s Private Key available.

Alice’s Public Key Bob’s Public Key Key agreement protocols are used by two parties to establish a shared secret key, and only require a com- munication channel with integrity guarantees. Figure 51

Alice Bob outlines the Diffie-Hellman Key Exchange (DKE) [43] protocol, which should give the reader an intuition for Figure 49: A hierarchical CA structure minimizes the usage of how key agreement protocols work. the root CA’s private key, reducing the opportunities for it to get This work is interested in using key agreement proto- compromised. The root CA only signs the certificates of intermediate CAs, which sign the end users’ certificates. cols to build larger systems, so we will neither explain the mathematic details in DKE, nor prove its correctness. We note that both Alice and Bob derive the same shared secret key, K = gAB mod p, without ever transmit- ting K. Furthermore, the messages transmitted in DKE, namely gA mod p and gB mod p, are not sufficient

39 Alice being aware of her presence. After establishing shared keys with both Alice and

Alice Bob Bob, Eve can choose to observe the communication be- tween Alice and Bob, by forwarding messages between Pre-established parameters: large prime p, g generator in Z p them. For example, when Alice transmits a message, Eve can decrypt it using K1, the shared key between herself Choose A randomly Choose B randomly between 1 and p between 1 and p and Alice. Eve can then encrypt the message with K2, the key established between Bob and herself. While Bob A B still receives Alice’s message, Eve has been able to see Compute g mod p Compute g mod p its contents. A A A Furthermore, Eve can impersonate either party in the Transmit g mod p g mod p Receive g mod p communication. For example, Eve can create a message,

B B B encrypt it with K2, and then send it to Bob. As Bob Receive g mod p g mod p Transmit g mod p thinks that K2 is a shared secret key established between himself and Alice, he will believe that Eve’s message Shared key K = Shared key K = B A A B comes from Alice. = (g mod p) = = (g mod p) = MITM attacks on key agreement protocols can be = gAB mod p = gAB mod p foiled by authenticating the party who sends the last mes- Figure 51: In the Diffie-Hellman Key Exchange (DKE) protocol, sage in the protocol (in our examples, Bob) and having Alice and Bob agree on a shared secret key K = gAB mod p. An them sign the key agreement messages. When a CA adversary who observes gA mod p and gB mod p cannot compute system is in place, Bob uses his public key to sign the K. messages in the key agreement and also sends Alice his certificate, along with the certificates for any intermedi- for an eavesdropper Eve to determine K, because effi- ate CAs. Alice validates Bob’s certificate, ensures that ciently solving for x in gx mod p is an open problem the subject identified by the certificate is whom she ex- assumed to be very difficult. pects (Bob), and verifies that the key agreement messages Key agreement protocols require a communication exchanged between herself and Bob match the signature channel with integrity guarantees. If an active adversary provided by Bob. Eve can tamper with the messages transmitted by Alice In conclusion, a key agreement protocol can be used to and Bob, she can perform a man-in-the-middle (MITM) bootstrap symmetric key primitives from an asymmetric attack, as illustrated in Figure 52. key signing scheme, where only one party needs to be able to sign messages. A E2 g mod p g mod p 3.3 Software Attestation Overview E1 B g mod p g mod p The security of systems that employ trusted processors Alice Eve Bob hinges on software attestation. The software running AE1 BE2 K1 = g mod p K2 = g mod p inside an isolated container established by trusted hard- ware can ask the hardware to sign (§ 3.1.3) a small piece Figure 52: Any key agreement protocol is vulnerable to a man- of attestation data, producing an attestation signature. in-the-middle (MITM) attack. The active attacker performs key agreements and establishes shared secrets with both parties. The Asides from the attestation data, the signed message attacker can then forward messages between the victims, in order includes a measurement that uniquely identifies the soft- to observe their communication. The attacker can also send its own ware inside the container. Therefore, an attestation signa- messages to either, impersonating the other victim. ture can be used to convince a verifier that the attestation In a MITM attack, Eve intercepts Alice’s first key data was produced by a specific piece of software, which exchange message, and sends Bob her own message. Eve is hosted inside a container that is isolated by trusted then intercepts Bob’s response and replaces it with her hardware from outside interference. own, which she sends to Alice. Eve effectively performs Each hardware platform discussed in this section uses key exchanges with both Alice and Bob, establishing a a slightly different software attestation scheme. Plat- shared secret with each of them, with neither Bob nor forms differ by the amount of software that executes

40 inside an isolated container, by the isolation guarantees Manufacturer Certificate Authority provided to the software inside a container, and by the process used to obtain a container’s measurement. The PubRK PrivRK Manufacturer Root Key

threat model and security properties of each trusted hard- Signs ware platform follow directly from the design choices Endorsement Tamper-Resistant outlined above, so a good understanding of attestation Certificate Hardware

is a prerequisite to discussing the differences between Attestation Key PubAK PrivAK Signs Attestation existing platforms. Signature Hash of Measurement Secure 3.3.1 Authenticated Key Agreement Container Data Hash of Software attestation can be combined with a key agree- Trusts Key Exchange Key Exchange ment protocol (§ 3.2.2), as software attestation provides Message 1 Message 2 the authentication required by the key agreement pro- tocol. The resulting protocol can assure a verifier that it has established a shared secret with a specific piece Verifier of software, hosted inside an isolated container cre- Figure 53: The chain of trust in software attestation. The root of ated by trusted hardware. The next paragraph outlines trust is a manufacturer key, which produces an endorsement certificate the augmented protocol, using Diffie-Hellman Key Ex- for the secure processor’s attestation key. The processor uses the change (DKE) [43] as an example of the key exchange attestation key to produce the attestation signature, which contains a cryptographic hash of the container and a message produced by the protocol. software inside the container. The verifier starts executing the key exchange protocol, and sends the first message, gA, to the software inside validates the processor’s attestation key using its endorse- the secure container. The software inside the container ment certificate, the verifier ensures that the signature is produces the second key exchange message, gB, and asks valid, and that the measurement in the signature belongs the trusted hardware to attest the cryptographic hash of to the software with which it expects to communicate. both key exchange messages, h(gA||gB). The verifier re- Having checked all the links in the attestation chain, the ceives the second key exchange and attestation signature, verifier has authenticated the other party in the key ex- and authenticates the software inside the secure container change, and is assured that it now shares a secret with the by checking all the signatures along the attestation chain software that it expects, running in an isolated container of trust shown in Figure 53. on hardware that it trusts. The chain of trust used in software attestation is rooted 3.3.2 The Role of Software Measurement at a signing key owned by the hardware manufacturer, which must be trusted by the verifier. The manufacturer The measurement that identifies the software inside a acts as a Certificate Authority (CA, § 3.2.1), and provi- secure container is always computed using a secure hash- sions each secure processor that it produces with a unique ing algorithm (§ 3.1.3). Trusted hardware designs differ attestation key, which is used to produce attestation sig- in their secure hash function choices, and in the data natures. The manufacturer also issues an endorsement provided to the hash function. However, all the designs certificate for each secure processor’s attestation key. share the principle that each step taken to build a secure The certificate indicates that the key is meant to be used container contributes data to its measurement hash. for software attestation. The certification policy gener- The philosophy behind software attestation is that the ally states that, at the very least, the private part of the computer’s owner can load any software she wishes in attestation key be stored in tamper-resistant hardware, a secure container. However, the computer owner is as- and only be used to produce attestation signatures. sumed to have an incentive to participate in a distributed A secure processor identifies each isolated container system where the secure container she built is authenti- by storing a cryptographic hash of the code and data cated via software attestation. Without the requirement loaded inside the container. When the processor is asked to undergo software attestation, the computer owner can to sign a piece of attestation data, it uses the crypto- build any container without constraints, which would graphic hash associated with the container as the mea- make it impossible to reason about the security proper- surement in the attestation signature. After a verifier ties of the software inside the container.

41 By the argument above, a trusted hardware design The simplest type of physical attack is a denial of based on software attestation must assume that each con- service attack performed by disconnecting the victim tainer is involved in software attestation, and that the re- computer’s power supply or network cable. The threat mote party will refuse to interact with a container whose models of most secure architectures ignore this attack, reported measurement does not match the expected value because denial of service can also be achieved by soft- set by the distributed system’s author. ware attacks that compromise system software such as For example, a cloud infrastructure provider should the hypervisor. be able to use the secure containers provided by trusted hardware to run any software she wishes on her com- 3.4.1 Port Attacks puters. However, the provider makes money by renting Slightly more involved attacks rely on connecting a de- her infrastructure to customers. If security savvy cus- vice to an existing port on the victim computer’s case or tomers are only willing to rent containers provided by motherboard (§ 2.9.1). A simple example is a cold boot trusted hardware, and use software attestation to authen- attack, where the attacker plugs in a USB flash drive into ticate the containers that they use, the cloud provider will the victim’s case and causes the computer to boot from have a strong financial incentive to build the customers’ the flash drive, whose malicious system software receives containers according to their specifications, so that the unrestricted access to the computer’s peripherals. containers pass the software attestation. More expensive physical attacks that still require rela- A container’s measurement is computed using a se- tively little effort target the ports of various periph- cure hashing algorithm, so the only method of building erals. The cost of these attacks is generally dominated a container that matches an expected measurement is to by the expense of acquiring the development kits needed follow the exact sequence of steps specified by the dis- to connect to the debug ports. For example, recent Intel tributed system’s author. The cryptographic properties of processors include the Generic Debug eXternal Connec- the secure hash function guarantee that if the computer’s tion (GDXC) [126, 199], which collects and filters the owner strays in any way from the prescribed sequence data transferred by the uncore’s ring bus (§ 2.11.3), and of steps, the measurement of the created container will reports it to an external debugger. not match the value expected by the distributed system’s The threat models of secure architectures generally author, so the container will be rejected by the software ignore debug port attacks, under the assumption that de- attestation process. vices sold for general consumption have their debug ports Therefore, it makes sense to state that a trusted hard- irreversibly disabled. In practice, manufacturers have ware design’s measurement scheme guarantees that a strong incentives to preserve debugging ports in produc- property has a certain value in a secure container. The tion hardware, as this facilitates the diagnosis and repair precise meaning of this phrase is that the property’s value of defective units. Due to insufficient documentation determines the data used to compute the container’s mea- on this topic, we ignore the possibility of GDXC-based surement, so an expected measurement hash effectively attacks. specifies an expected value for the property. All contain- ers in a distributed system that correctly uses software 3.4.2 Bus Tapping Attacks attestation will have the desired value for the given prop- More complex physical attacks consist of installing a erty. device that taps a bus on the computer’s motherboard For example, the measuring scheme used by trusted (§ 2.9.1). Passive attacks are limited to monitoring the hardware designed for cloud infrastructure should guar- bus traffic, whereas active attacks can modify the traf- antee that the container’s memory was initialized using fic, or even place new commands on the bus. Replay the customer’s content, often referred to as an image. attacks are a notoriously challenging class of active at- tacks, where the attacker first records the bus traffic, and 3.4 Physical Attacks then selectively replays a subset of the traffic. Replay Physical attacks are generally classified according to attacks bypass systems that rely on static signatures or their cost, which factors in the equipment needed to carry HMACs, and generally aim to double-spend a limited out the attack and the attack’s complexity. Joe Grand’s resource. DefCon presentation [69] provides a good overview with The cost of bus tapping attacks is generally dominated a large number of intuition-building figures and photos. by the cost of the equipment used to tap the bus, which

42 increases with bus speed and complexity. For example, requires ion beam microscopy. the flash chip that stores the computer’s firmware is con- The least expensive classes of chip attacks are destruc- nected to the PCH via an SPI bus (§ 2.9.1), which is tive, and only require imaging the chip’s circuitry. These simpler and much slower than the DDR bus connecting attacks rely on a microscope capable of capturing the DRAM to the CPU. Consequently, tapping the SPI bus is necessary details in each layer, and equipment for me- much cheaper than tapping the DDR bus. For this reason, chanically removing each layer and exposing the layer systems whose security relies on a cryptographic hash below it to the microscope. of the firmware will first copy the firmware into DRAM, Imaging attacks generally target global secrets shared hash the DRAM copy of the firmware, and then execute by all the chips in a family, such as ROM masks that store the firmware from DRAM. global encryption keys or secret boot code. They are also Although the speed of the DDR bus makes tapping used to reverse-engineer undocumented functionality, very difficult, there are well-publicized records of suc- such as debugging backdoors. E-fuses and polyfuses are cessful attempts. The original console’s booting particularly vulnerable to imaging attacks, because of process was reverse-engineered, thanks to a passive tap their relatively large sizes. on the DRAM bus [82], which showed that the firmware Non-destructive passive chip attacks require measur- used to boot the console was partially stored in its south- ing the voltages across a module at specific times, while bridge. The protection mechanisms of the PlayStation 3 the chip is operating. These attacks are orders of magni- hypervisor were subverted by an active tap on its memory tude more expensive than imaging attacks, because the bus [81] that targeted the hypervisor’s page tables. attacker must maintain the integrity of the chip’s circuitry, The Ascend secure processor (§ 4.10) shows that con- and therefore cannot de-layer the chip. cealing the addresses of the DRAM cells accessed by The simplest active attacks on a chip create or destroy a program is orders of magnitude more expensive than an electric connection between two components. For protecting the memory’s contents. Therefore, we are example, the debugging functionality in many chips is interested in analyzing attacks that tap the DRAM bus, disabled by “blowing” an e-fuse. Once this e-fuse is but only use the information on the address lines. These located, an attacker can reconnect its two ends, effec- attacks use the same equipment as normal DRAM bus tively undoing the “blowing” operation. More expensive tapping attacks, but require a significantly more involved attacks involve changing voltages across a component as analysis to learn useful information. One of the dif- the chip is operating, and are typically used to reverse- ficulties of such attacks is that the memory addresses engineer complex circuits. observed on the DRAM bus are generally very different Surprisingly, active attacks are not significantly more from the application’s memory access patterns, because expensive to carry out than passive non-destructive at- of the extensive cache hierarchies in modern processors tacks. This is because the tools used to measure the (§ 2.11). voltage across specific components are not very different We are not aware of any successful attack based on from the tools that can tamper with the chip’s electric tapping the address lines of a DRAM bus and analyzing circuits. Therefore, once an attacker develops a process the sequence of memory addresses. for accessing a module without destroying the chip’s circuitry, the attacker can use the same process for both 3.4.3 Chip Attacks passive and active attacks. The most equipment-intensive physical attacks involve At the architectural level, we cannot address physical removing a chip’s packaging and directly interacting with attacks against the CPU’s chip package. Active attacks its electrical circuits. These attacks generally take advan- on the CPU change the computer’s execution semantics, tage of equipment and techniques that were originally leaving us without any hardware that can be trusted to developed to diagnose design and manufacturing defects make security decisions. Passive attacks can read the in chips. [22] covers these techniques in depth. private data that the CPU is processing. Therefore, many The cost of chip attacks is dominated by the required secure computing architectures assume that the processor equipment, although the reverse-engineering involved chip package is invulnerable to physical attacks. is also non-trivial. This cost grows very rapidly as the Thankfully, physical attacks can be deterred by reduc- circuit components shrink. At the time of this writing, ing the value that an attacker obtains by compromising the latest Intel CPUs have a 14nm feature size, which an individual chip. As long as this value is below the cost

43 of carrying out the physical attack, a system’s designer by a keyboard and learn the password that its operator can hope that the processor’s chip package will not be typed. [148] applied similar techniques to learn a user’s targeted by the physical attacks. input on a ’s on-screen keyboard, based on Architects can reduce the value of compromising an data from the device’s accelerometer. individual system by avoiding shared secrets, such as In general, power attacks cannot be addressed at the global encryption keys. Chip designers can increase the architectural level, as they rely on implementation de- cost of a physical attack by not storing a platform’s se- tails that are decided during the manufacturing process. crets in hardware that is vulnerable to destructive attacks, Therefore, it is unsurprising that the secure computing ar- such as e-fuses. chitectures described in § 4 do not protect against power analysis attacks. 3.4.4 Power Analysis Attacks 3.5 Privileged Software Attacks An entirely different approach to physical attacks con- sists of indirectly measuring the power consumption of a The rest of this section points to successful exploits that computer system or its components. The attacker takes execute at each of the privilege levels described in § 2.3, advantage of a known correlation between power con- motivating the SGX design decision to assume that all sumption and the computed data, and learns some prop- the privileged software on the computer is malicious. erty of the data from the observed power consumption. [163] describes all the programmable hardware inside The earliest power analysis attacks have directly mea- Intel computers, and outlines the security implications of sured the processor chip’s power consumption. For ex- compromising the software running it. ample, [122] describes a simple power analysis (SPA) SMM, the most privileged execution level, is only used attack that exploits the correlation between the power to handle a specific kind of interrupts (§ 2.12), namely consumed by a smart card chip’s CPU and the type of System Management Interrupts (SMI). SMIs were ini- instruction it executed, and learned a DSA key that the tially designed exclusively for hardware use, and were smart card was supposed to safeguard. only triggered by asserting a dedicated pin (SMI#) in the While direct power analysis attacks necessitate some CPU’s chip package. However, in modern systems, sys- equipment, their costs are dominated by the complexity tem software can generate an SMI by using the LAPIC’s of the analysis required to learn the desired informa- IPI mechanism. This opens up the avenue for SMM- tion from the observed power trace which, in turn, is based software exploits. determined by the complexity of the processor’s circuitry. The SMM handler is stored in System Manage- Today’s smart cards contain special circuitry [179] and ment RAM (SMRAM) which, in theory, is not acces- use hardened algorithms [77] designed to frustrate power sible when the processor isn’t running in SMM. How- analysis attacks. ever, its protection mechanisms were bypassed multi- Recent work demonstrated successful power analysis ple times [44, 114, 164, 189], and SMM-based rootk- attacks against full-blown out-of-order Intel processors its [49, 186] have been demonstrated. Compromising using inexpensive off-the-shelf sensor equipment. [60] the SMM grants an attacker access to all the software on extracts an RSA key from GnuPG running on a laptop the computer, as SMM is the most privileged execution using a microphone that measures its acoustic emissions. mode. [59] and [58] extract RSA keys from power analysis- [200] is a very popular representative of the fam- resistant implementations using a voltage meter and a ily of hypervisors that run in VMX root mode and use radio. All these attacks can be performed quite easily by hardware virtualization. At 150,000 lines of code [11], a disgruntled data center employee. Xen’s codebase is relatively small, especially when com- Unfortunately, power analysis attacks can be extended pared to a kernel. However, Xen still has had over 40 to displays and human input devices, which cannot be security vulnerabilities patched in each of the last three secured in any reasonable manner. For example, [182] years (2012-2014) [10]. documented a very early attack that measures the radia- [136] proposes using a very small hypervisor together tion emitted by a CRT display’s ion beam to reconstitute with Intel TXT’s dynamic root of trust for measurement the image on a computer screen in a different room. [125] (DRTM) to implement trusted execution. [183] argues extended the attack to modern LCD displays. [201] used that a dynamic root of trust mechanism, like Intel TXT, a directional microphone to measure the sound emitted is necessary to ensure a hypervisor’s integrity. Unfor-

44 tunately, the TXT design requires an implementation 3.6.2 DRAM Attacks complex enough that exploitable security vulnerabilities have creeped in [190, 191]. Furthermore, any SMM The rowhammer DRAM bit-flipping attack [72, 119, attack can be used to compromise TXT [188]. 166] is an example of a different class of software attacks The monolithic kernel design leads to many opportu- that exploit design defects in the computer’s hardware. nities for security vulnerabilities in kernel code. Rowhammer took advantage of the fact that some mobile is by far the most popular kernel for IaaS cloud environ- DRAM chips (§ 2.9.1) refreshed the DRAM’s contents ments. Linux has 17 million lines of code [16], and has slowly enough that repeatedly changing the contents of a had over 100 security vulnerabilities patched in each of could impact the charge stored in a neigh- the last three years (2012-2014) [8, 33]. boring cell, which resulted in changing the bit value obtained from reading the cell. By carefully targeting specific memory addresses, the attackers caused bit flips 3.6 Software Attacks on Peripherals in the page tables used by the CPU’s address translation Threat models for secure architectures generally only (§ 2.5) mechanism, and in other data structures used to consider software attacks that directly target other com- make security decisions. ponents in the software stack running on the CPU. This The defect exploited by the rowhammer attack most assumption results in security arguments with the very likely stems from an incorrect design assumption. desirable property of not depending on implementation The DRAM engineers probably only thought of non- details, such as the structure of the motherboard hosting malicious software and assumed that an individual the processor chip. DRAM cell cannot be accessed too often, as repeated ac- The threat models mentioned above must classify at- cesses to the same memory address would be absorbed by tacks from other motherboard components as physical the CPU’s caches (§ 2.11). However, malicious software attacks. Unfortunately, these models would mis-classify can take advantage of the CLFLUSH instruction, which all the attacks described in this section, which can be flushes the cache line that contains a given DRAM ad- carried out solely by executing software on the victim dress. CLFLUSH is intended as a method for applications processor. The incorrect classification matters in cloud to extract more performance out of the cache hierarchy, computing scenarios, where physical attacks are signifi- and is therefore available to software running at all priv- cantly more expensive than software attacks. ilege levels. Rowhammer exploited the combination of CLFLUSH’s availability and the DRAM engineers’ in- valid assumptions, to obtain capabilities that are normally 3.6.1 PCI Express Attacks associated with an active DRAM bus attack. The PCIe bus (§ 2.9.1) allows any device connected to the bus to perform Direct Memory Access (DMA), read- ing from and writing to the computer’s DRAM without 3.6.3 The Performance Monitoring Side Channel the involvement of a CPU core. Each device is assigned a range of DRAM addresses via a standard PCI config- Intel’s Software Development Manual (SDM) [101] and uration mechanism, but can perform DMA on DRAM Optimization Reference Manual [96] describe a vast ar- addresses outside of that range. ray of performance monitoring events exposed by recent Without any additional protection mechanism, an at- Intel processors, such as branch mispredictions (§ 2.10). tacker who compromises system software can take ad- The SDM also describes digital temperature sensors em- vantage of programmable devices to access any DRAM bedded in each CPU core, whose readings are exposed region, yielding capabilities that were traditionally asso- using Model-Specific Registers (MSRs) (§ 2.4) that can ciated with a DRAM bus tap. For example, an early im- be read by system software. plementation of Intel TXT [70] was compromised by pro- An attacker who compromises a computer’s system gramming a PCIe NIC to read TXT-reserved DRAM via software and gains access to the performance monitoring DMA transfers [190]. Recent versions have addressed events or the temperature sensors can obtain the informa- this attack by adding extra security checks in the DMA tion needed to carry out a power analysis attack, which bus arbiter. § 4.5 provides a more detailed description of normally requires physical access to the victim computer Intel TXT. and specialized equipment.

45 3.6.4 Attacks on the Boot Firmware and Intel ME tains the SHA-256 cryptographic hash of the RSA public key, and uses it to validate the full Intel public key stored Virtually all motherboards store the firmware used to boot in the signature. Similarly, the microcode bootstrap pro- the computer in a flash memory chip (§ 2.9.1) that can be cess in recent CPUs will only execute firmware in an written by system software. This implementation strategy Authenticated Code Module (ACM, § 2.13.2) signed by provides an inexpensive avenue for deploying firmware an Intel key whose SHA-256 hash is hard-coded in the bug fixes. At the same time, an attack that compromises microcode ROM. the system software can subvert the firmware update However, both the computer firmware security checks mechanism to inject malicious code into the firmware. [54, 192] and the ME security checks [178] have been The malicious code can be used to carry out a cold boot subverted in the past. While the approaches described attack, which is typically considered a physical attack. above are theoretically sound, the intricate details and Furthermore, malicious firmware can run code at the complex interactions in Intel-based systems make it very highest software privilege level, System Management likely that security vulnerabilities will creep into im- Mode (SMM, § 2.3). Last, malicious firmware can mod- plementations. Further proving this point, a security ify the system software as it is loaded during the boot analysis [185] found that early versions of Intel’s Active process. These avenues give the attacker capabilities Management Technology (AMT), the flagship ME appli- that have traditionally been associated with DRAM bus cation, contained an assortment of security issues that tapping attacks. allowed an attacker to completely take over a computer The Intel Management Engine (ME) [162] loads its whose ME firmware contained the AMT application. firmware from the same flash memory chip as the main 3.6.5 Accounting for Software Attacks on Peripherals computer, which opens up the possibility of compromis- ing its firmware. Due to its vast management capabilities The attacks described in this section show that a system (§ 2.9.2), a compromised ME would leak most of the pow- whose threat model assumes no software attacks must ers that come with installing active probes on the DRAM be designed with an understanding of all the system’s bus, the PCI bus, and the (SM- buses, and the programmable devices that may be at- Bus), as well as power consumption meters. Thanks to tached to them. The system’s security analysis must its direct access to the motherboard’s Ethernet PHY, the argue that the devices will not be used in physical-like probe would be able to communicate with the attacker attacks. The argument will rely on barriers that prevent while the computer is in the Soft-Off state, also known untrusted software running on the CPU from communi- as S5, where the computer is mostly powered off, but is cating with other programmable devices, and on barriers still connected to a power source. The ME has signifi- that prevent compromised programmable devices from cantly less computational power than probe equipment, tampering with sensitive buses or DRAM. however, as it uses low-power embedded components, Unfortunately, the ME, PCH and DMI are Intel- such as a 200-400MHz execution core, and about 600KB proprietary and largely undocumented, so we cannot of internal RAM. assess the security of the measures set in place to pro- The computer and ME firmware are protected by a tect the ME from being compromised, and we cannot few security measures. The first line of defense is a reason about the impact of a compromised ME that runs security check in the firmware’s update service, which malicious software. only accepts firmware updates that have been digitally 3.7 Address Translation Attacks signed by a manufacturer key that is hard-coded in the § 3.5 argues that today’s system software is virtually firmware. This protection can be circumvented with guaranteed to have security vulnerabilities. This suggests relative ease by foregoing the firmware’s update services, that a cautious secure architecture should avoid having and instead accessing the flash memory chip directly, via the system software in the TCB. the PCH’s SPI bus controller. However, removing the system software from the TCB The deeper, more powerful, lines of defense against requires the architecture to provide a method for isolat- firmware attacks are rooted in the CPU and ME’s hard- ing sensitive application code from the untrusted system ware. The bootloader in the ME’s ROM will only load software. This is typically accomplished by designing flash firmware that contains a correct signature generated a mechanism for loading application code in isolated by a specific Intel RSA key. The ME’s boot ROM con- containers whose contents can be certified via software

46 attestation (§ 3.3). One of the more difficult problems 3.7.2 Straightforward Active Attacks these designs face is that application software relies on the memory management services provided by the sys- We define active address translation attacks to be the tem software, which is now untrusted. class of attacks where malicious system software modi- Intel’s SGX [14, 139], leaves the system software in fies the page tables used by an application in a way that charge of setting up the page tables (§ 2.5) used by ad- breaks the virtual memory abstraction (§ 2.5). Memory dress translation, inspired by Bastion [31], but instanti- mapping attacks do not include scenarios where the sys- ates access checks that prevent the system software from tem software breaks the memory abstraction by directly directly accessing the isolated container’s memory. writing to the application’s memory pages. This section discusses some attacks that become rel- We begin with an example of a straight-forward active evant when the application software does not trust the attack. In this example, the application inside a protected system software, which is in charge of the page tables. container performs a security check to decide whether to Understanding these attacks is a prerequisite to reasoning disclose some sensitive information. Depending on the about the security properties of architectures with this security check’s outcome, the enclave code either calls threat model. For example, many of the mechanisms in a errorOut procedure, or a disclose procedure. SGX target a subset of the attacks described here. The simplest version of the attack assumes that each procedure’s code starts at a page boundary, and takes up less than a page. These assumptions are relaxed in more 3.7.1 Passive Attacks complex versions of the attack. In the most straightforward setting, the malicious sys- System software uses the CPU’s address translation fea- tem software directly modifies the page tables of the ture (§ 2.5) to implement page swapping, where infre- application inside the container, as shown in Figure 54, quently used memory pages are evicted from DRAM so the virtual address intended to store the errorOut to a slower storage medium. Page swapping relies the procedure is actually mapped to a DRAM page that con- accessed (A) and dirty (D) page table entry attributes tains the disclose procedure. Without any security (§ 2.5.3) to identify the DRAM pages to be evicted, and measures in place, when the application’s code jumps on a page fault handler (§ 2.8.2) to bring evicted pages to the virtual address of the errorOut procedure, the back into DRAM when they are accessed. CPU will execute the code of the disclose procedure Unfortunately, the features that support efficient page instead. swapping turn into a security liability, when the system software managing the page tables is not trusted by the Application code written by Application code seen by CPU application software using the page tables. The system developer software can be prevented from reading the application’s memory directly by placing the application in an iso- Security Security PASS PASS lated container. However, potentially malicious system Check Check software can still infer partial information about the ap- FAIL FAIL 0x41000 0x41000 plication’s memory access patterns, by observing the errorOut(): errorOut(): write error write error application’s page faults and page table attributes. return return 0x42000 0x42000 disclose(): disclose(): We consider this class of attacks to be passive attacks write data write data that exploit the CPU’s address translation feature. It return return Virtual Page may seem that the page-level memory access patterns addresses tables DRAM pages provided by these attacks are not very useful. However, [195] describes how this attack can be carried out against Figure 54: An example of an active memory mapping attack. The Intel’s SGX, and implements the attack in a few practical application’s author intends to perform a security check, and only settings. In one scenario, which is particularly concern- call the procedure that discloses the sensitive information if the check passes. Malicious system software maps the virtual address of the ing for medical image processing, the outline of a JPEG procedure that is called when the check fails, to a DRAM page that image is inferred while the image is decompressed inside contains the disclosing procedure. a container protected by SGX’s isolation guarantees.

47 3.7.3 Active Attacks Using Page Swapping 3.7.4 Active Attacks Based on TLBs Today’s multi-core architectures can be subjected to an The most obvious active attacks on memory mapping even more subtle active attack, illustrated in Figure 56, can be defeated by tracking the correct virtual address which can bypass any protection measures that solely for each DRAM page that belongs to a protected con- focus on the integrity of the page tables. tainer. However, a naive protection measure based on

address tracking can be defeated by a more subtle ac- Page tables and TLB tive attack that relies on the architectural support for before swapping DRAM page swapping. Figure 55 illustrates an attack that does Virtual Physical Physical Contents not modify the application’s page tables, but produces 0x41000 0x19000 0x19000 errorOut 0x42000 0x1A000 0x1A000 disclose the same corrupted CPU view of the application as the straight-forward attack described above. Page tables after swapping HDD / SSD Virtual Physical errorOut Page tables and DRAM before swapping 0x41000 0x1A000 0x42000 0x19000 disclose Virtual Physical Contents 0x41000 0x19000 errorOut HDD / SSD Stale TLB after swapping DRAM 0x42000 0x1A000 disclose errorOut Virtual Physical Physical Contents disclose 0x41000 0x19000 0x19000 disclose Page tables and DRAM after swapping 0x42000 0x1A000 0x1A000 errorOut Virtual Physical Contents 0x41000 0x19000 disclose Figure 56: An active memory mapping attack where the system 0x42000 0x1A000 errorOut software does not invalidate a core’s TLBs when it evicts two pages from DRAM and exchanges their locations when reading them back Figure 55: An active memory mapping attack where the system in. The page tables are updated correctly, but the core with stale TLB software does not modify the page tables. Instead, two pages are entries has the same incorrect view of the protected container’s code evicted from DRAM to a slower storage medium. The malicious as in Figure 54. system software swaps the two pages’ contents then brings them back into DRAM, building the same incorrect page mapping as the direct For performance reasons, each execution core caches attack shown in Figure 54. This attack defeats protection measures address translation results in its own translation look- that rely on tracking the virtual and disk addresses for DRAM pages. aside buffer (TLB, § 2.11.5). For simplicity, the TLBs are not covered by the cache coherence protocol that In the swapping attack, malicious system soft- synchronizes data caches across cores. Instead, the sys- ware evicts the pages that contain the errorOut tem software is responsible for invalidating TLB entries and disclose procedures from DRAM to a slower across all the cores when it modifies the page tables. medium, such as a hard disk. The system software ex- Malicious system software can take advantage of the changes the hard disk bytes storing the two pages, and design decisions explained above by carrying out the fol- then brings the two pages back into DRAM. Remarkably, lowing attack. While the same software used in the previ- all the steps taken by this attack are indistinguishable ous examples is executing on a core, the system software from legitimate page swapping activity, with the excep- executes on a different core and evicts the errorOut tion of the I/O operations that exchange the disk bytes and disclose pages from DRAM. As in the previous storing evicted pages. attack, the system software loads the disclose code The subtle attack described in this section can be de- in the DRAM page that previously held errorOut. In feated by cryptographically binding the contents of each this attack, however, the system software also updates page that is evicted from DRAM to the virtual address the page tables. to which the page should be mapped. The cryptographic The core where the system software executed sees the primitive (§ 3.1) used to perform the binding must ob- code that the application developer intended. Therefore, viously guarantee integrity. Furthermore, it must also the attack will pass any security checks that rely upon guarantee freshness, in order to foil replay attacks where cryptographic associations between page contents and the system software “undoes” an application’s writes by page table data, as long as the checks are performed by evicting one of its DRAM pages to disk and bringing in the core used to load pages back into DRAM. However, an older version of the same page. the core that executes the protected container’s code still

48 uses the old page table data, because the system software own address space, the shared cache must evict one of did not invalidate its TLB entries. Assuming the TLBs the cache lines holding the attacker’s memory locations. are not subjected to any additional security checks, this As the victim is executing, the attacker process repeat- attack causes the same private information leak as the edly times accesses to its own memory locations. When previous examples. the access times indicate that a location was evicted from In order to avoid the attack described in this sec- the cache, the attacker can conclude that the victim ac- tion, the trusted software or hardware that implements cessed an interesting memory location in its own cache. protected containers must also ensure that the system Over time, the attacker collects the results of many mea- software invalidates the relevant TLB entries on all the surements and learns a subset of the victim’s memory cores when it evicts a page from a protected container to access pattern. If the victim processes sensitive informa- DRAM. tion using data-dependent memory fetches, the attacker may be able to deduce the sensitive information from the 3.8 Cache Timing Attacks learned . Cache timing attacks [19] are a powerful class of soft- ware attacks that can be mounted entirely by application code running at ring 3 (§ 2.3). Cache timing attacks do not learn information by reading the victim’s memory, 3.8.2 Practical Considerations so they bypass the address translation-based isolation measures (§ 2.5) implemented in today’s kernels and Cache timing attacks require control over a software pro- hypervisors. cess that shares a cache memory with the victim process. Therefore, a cache that targets the L2 cache 3.8.1 Theory would have to rely on the system software to schedule a software thread on a logical processor in the same Cache timing attacks exploit the unfortunate dependency core as the target software, whereas an attack on the L3 between the location of a memory access and the time cache can be performed using any logical processor on it takes to perform the access. A cache miss requires the same CPU. The latter attack relies on the fact that at least one memory access to the next level cache, and the L3 cache is inclusive, which greatly simplifies the might require a second memory access if a write-back processor’s cache coherence implementation (§ 2.11.3). occurs. On the Intel architecture, the latency between The cache sharing requirement implies that L3 cache a cache hit and a miss can be easily measured by the attacks are feasible in an IaaS environment, whereas L2 RDTSC and RDTSCP instructions (§ 2.4), which read a cache attacks become a significant concern when running high-resolution time-stamp counter. These instructions sensitive software on a user’s desktop. have been designed for benchmarking and optimizing software, so they are available to ring 3 software. Out-of-order execution (§ 2.10) can introduce noise in The fundamental tool of a cache timing attack is an cache timing attacks. First, memory accesses may not attacker process that measures the latency of accesses to be performed in program order, which can impact the carefully designated memory locations in its own address lines selected by the cache eviction algorithms. Second, space. The memory locations are chosen so that they out-of-order execution may result in cache fills that do map to the same cache lines as those of some interesting not correspond to executed instructions. For example, a memory locations in a victim process, in a cache that is load that follows a faulting instruction may be scheduled shared between the attacker and the victim. This requires and executed before the fault is detected. in-depth knowledge of the shared cache’s organization Cache timing attacks must account for speculative ex- (§ 2.11.2). ecution, as mispredicted memory accesses can still cause Armed with the knowledge of the cache’s organization, cache fills. Therefore, the attacker may observe cache the attacker process sets up the attack by accessing its fills that don’t correspond to instructions that were actu- own memory in such a way that it fills up all the cache ally executed by the victim software. Memory prefetch- sets that would hold the victim’s interesting memory lo- ing adds further noise to cache timing attacks, as the cations. After the targeted cache sets are full, the attacker attacker may observe cache fills that don’t correspond allows the victim process to execute. When the victim to instructions in the victim code, even when accounting process accesses an interesting memory location in its for speculative execution.

49 3.8.3 Known Cache Timing Attacks fetches must also be taken into consideration. [115] Despite these difficulties, cache timing attacks are known gives an idea of the level of effort required to remove to retrieve cryptographic keys used by AES [25, 146], data-dependent accesses from AES, which is a relatively RSA [28], Diffie-Hellman [123], and elliptic-curve cryp- simple data processing algorithm. At the time of this tography [27]. writing, we are not aware of any approach that scales to Early attacks required access to the victim’s CPU core, large pieces of software. but more sophisticated recent attacks [131, 196] are able While the focus of this section is cache timing at- to use the L3 cache, which is shared by all the cores on tacks, we would like to point out that any shared re- a CPU die. L3-based attacks can be particularly dev- source can lead to information leakage. A worrying astating in cloud computing scenarios, where running example is hyper-threading (§ 2.9.4), where each CPU software on the same computer as a victim application core is represented as two logical processors, and the only requires modest statistical analysis skills and a small threads executing on these two processors share execu- amount of money [157]. Furthermore, cache timing at- tion units. An attacker who can run a process on a logical tacks were recently demonstrated using JavaScript code processor sharing a core with a victim process can use in a page visited by a Web browser [145]. RDTSCP [152] to learn which execution units are in use, Given this pattern of vulnerabilities, ignoring cache and infer what instructions are executed by the victim timing attacks is dangerously similar to ignoring the process. string of demonstrated attacks which led to the depreca- tion of SHA-1 [3, 6, 9]. 4 RELATED WORK 3.8.4 Defending against Cache Timing Attacks This section describes the broader picture of trusted hard- ware projects that SGX belongs to. Table 12 summarizes Fortunately, invalidating any of the preconditions for the security properties of SGX and the other trusted hard- cache timing attacks is sufficient for defending against ware presented here. them. The easiest precondition to focus on is that the attacker must have access to memory locations that map 4.1 The IBM 4765 Secure to the same sets in a cache as the victim’s memory. This assumption can be invalidated by the judicious use of a Secure [198] encapsulate an entire com- cache partitioning scheme. puter system, including a CPU, a cryptographic accel- Performance concerns aside, the main difficulty asso- erator, caches, DRAM, and an I/O controller within a ciated with cache partitioning schemes is that they must tamper-resistant environment. The enclosure includes be implemented by a trusted party. When the system hardware that deters attacks, such as a Faraday cage, as software is trusted, it can (for example) use the prin- well as an array of sensors that can detect tampering ciples behind page coloring [117, 177] to partition the attempts. The secure coprocessor destroys the secrets caches [129] between mutually distrusting parties. This that it stores when an attack is detected. This approach comes down to setting up the page tables in such a way has good security properties against physical attacks, that no two mutually distrusting software module are but tamper-resistant enclosures are very expensive [15], stored in physical pages that map to the same sets in relatively to the cost of a computer system. any cache memory. However, if the system software The IBM 4758 [172], and its most current-day suc- is not trusted, the cache partitioning scheme must be cessor, the IBM 4765 [2] (shown in Figure 57) are rep- implemented in hardware. resentative examples of secure coprocessors. The 4758 The other interesting precondition is that the victim was certified to withstand physical attacks to FIPS 140-1 must access its memory in a data-dependent fashion that Level 4 [171], and the 4765 meets the rigors of FIPS allows the attacker to infer private information from the 140-2 Level 4 [1]. observed memory access pattern. It becomes tempting The 4765 relies heavily on physical isolation for its to think that cache timing attacks can be prevented by security properties. Its system software is protected from eliminating data-dependent memory accesses from all attacks by the application software by virtue of using the code handling sensitive data. a dedicated service processor that is completely sepa- However, removing data-dependent memory accesses rate from the application processor. Special-purpose bus is difficult to accomplish in practice because instruction logic prevents the application processor from accessing

50 Attack TrustZone TPM TPM+TXT SGX XOM Aegis Bastion Ascend, Sanctum Phantom Malicious N/A (secure N/A (The whole N/A (Does not Access checks on Identifier tag Security kernel Access checks OS separates Access checks containers (direct world is trusted) computer is one allow concurrent TLB misses checks separates on each containers on TLB misses probing) container) containers) containers memory access Malicious OS Access checks N/A (OS Host OS Access checks on OS has its own Security kernel Memory X Access checks al 12 Table (direct probing) on TLB misses measured and preempted during TLB misses identifier measured and encryption and on TLB misses trusted) late launch isolated HMAC Malicious Access checks N/A (Hypervisor Hypervisor Access checks on N/A (No N/A (No Hypervisor N/A (No Access checks euiyfaue vriwfrtetutdhrwr rjcsrltdt ne’ SGX Intel’s to related projects hardware trusted the for overview features Security : hypervisor (direct on TLB misses measured and preempted during TLB misses hypervisor hypervisor measured and hypervisor on TLB misses probing) trusted) late launch support) support) trusted support) Malicious N/A (firmware is CPU microcode SINIT ACM signed SMM handler is N/A (Firmware N/A (Firmware Hypervisor N/A (Firmware Firmware is firmware a part of the measures PEI by Intel key and subject to TLB is not active is not active measured after is not active measured and secure world) firmware measured access checks after booting) after booting) boot after booting) trusted Malicious N/A (secure N/A (Does not N/A (Does not X X X X X Each enclave containers (cache world is trusted) allow concurrent allow concurrent its gets own timing) containers) containers) cache partition Malicious OS Secure world N/A (OS Host OS X N/A (Paging not X X X Per-enclave (page fault has own page measured and preempted during supported) page tables recording) tables trusted) late launch Malicious OS X N/A (OS Host OS X X X X X Non-enclave 51 (cache timing) measured and preempted during software uses a trusted) late launch separate cache partition DMA from On-chip bus X IOMMU bounces IOMMU bounces Equivalent to Equivalent to Equivalent to Equivalent to MC bounces malicious bounces secure DMA into TXT DMA into PRM physical DRAM physical DRAM physical DRAM physical DRAM DMA outside peripheral world accesses memory range access access access access allowed range Physical DRAM Secure world X X Undocumented DRAM DRAM DRAM DRAM X read limited to on- memory encryption encryption encryption encryption encryption chip SRAM engine Physical DRAM Secure world X X Undocumented HMAC of HMAC of Merkle tree over HMAC of X write limited to on- memory encryption address and address, data, DRAM address, data, chip SRAM engine data timestamp timestamp Physical DRAM Secure world X X Undocumented X Merkle tree Merkle tree over Merkle tree X rollback write limited to on- memory encryption over HMAC DRAM over HMAC chip SRAM engine timestamps timestamps Physical DRAM Secure world in X X X X X X ORAM X address reads on-chip SRAM Hardware TCB CPU chip Motherboard Motherboard CPU chip package CPU chip CPU chip CPU chip CPU chip CPU chip size package (CPU, TPM, (CPU, TPM, package package package package package DRAM, buses) DRAM, buses) Software TCB Secure world All software on SINIT ACM + VM Application module Application Application Application Application Application size (firmware, OS, the computer (OS, application) + privileged module + module + module + process + module + application) containers hypervisor security kernel hypervisor trusted OS security monitor Tamper-Resistant Enclosure design that uses TrustZone IP blocks.

Tamper Battery- Boot Flash Detection and Backed Loader NVRAM System-on-Chip Package Response RAM ROM Interrupt Controller

Hardware Access Control Logic Processor with Application Application Battery-Backed Service SRAM Processor 4G Modem Secure CPU CPU RAM CPU without Extensions Secure DMA Boot ROM System Bus TZMA Extensions Controller L2 Cache

Random I/O Crypto Real-Time AMBA AXI On-Chip Bus SDRAM Number Controller Accelerator Clock Generator L3 Cache Module Interface Real-Time OTP AXI to APB Clock Polyfuses Bridge AMBA AXI Bus

PCIe I/O Controller Batteries APB Bus PCI Express Card TZASC PCI Express Interface Memory Memory Display Keypad ADC / DAC Figure 57: The IBM 4765 secure coprocessor consists of an entire Controller Controller Controller Controller computer system placed inside an enclosure that can deter and de- tect physical attacks. The application and the system use separate DRAM Flash Display Audio Keypad processors. Sensitive memory can only be accessed by the system code, thanks to access control checks implemented in the system bus’ Figure 58: Smartphone SoC design based on TrustZone. The hardware. Dedicated hardware is used to clear the platform’s secrets red IP blocks are TrustZone-aware. The red connections ignore and shut down the system when a physical attack is detected. the TrustZone secure bit in the bus address. Defining the system’s security properties requires a complete understanding of all the red privileged resources, such as the battery-backed memory elements in this figure. that stores the system software’s secrets. TrustZone extends the address lines in the AMBA AXI The 4765 implements software attestation. The co- system bus [17] with one signal that indicates whether processor’s attestation key is stored in battery-backed an access belongs to the secure or normal (non-secure) memory that is only accessible to the service processor. world. ARM processor cores that include TrustZone’s Upon reset, the service processor executes a first-stage “Security Extensions” can switch between the normal bootloader stored in ROM, which measures and loads the world and the secure world when executing code. The system software. In turn, the system software measures address in each bus access executed by a core reflects the the application code stored in NVRAM and loads it into world in which the core is currently executing. the DRAM chip accessible to the application processor. The reset circuitry in a TrustZone processor places The system software provides attestation services to the it in secure mode, and points it to the first-stage boot- application loaded inside the coprocessor. loader stored in on-chip ROM. TrustZone’s TCB includes this bootloader, which initializes the platform, sets up 4.2 ARM TrustZone the TrustZone hardware to protect the secure container ARM’s TrustZone [13] is a collection of hardware mod- from untrusted software, and loads the normal world’s ules that can be used to conceptually partition a system’s bootloader. The secure container must also implement resources between a secure world, which hosts a secure a monitor that performs the context switches needed to container, and a normal world, which runs an untrusted transition an execution core between the two worlds. The software stack. The TrustZone documentation [18] de- monitor must also handle hardware exceptions, such as scribes semiconductor intellectual property cores (IP interrupts, and route them to the appropriate world. blocks) and ways in which they can be combined to The TrustZone design gives the secure world’s monitor achieve certain security properties, reflecting the fact that unrestricted access to the normal world, so the monitor ARM is an IP core provider, not a chip manufacturer. can implement inter-process communication (IPC) be- Therefore, the mere presence of TrustZone IP blocks in a tween the software in the two worlds. Specifically, the system is not sufficient to determine whether the system monitor can issue bus accesses using both secure and non- is secure under a specific threat model. Figure 58 illus- secure addresses. In general, the secure world’s software trates a design for a smartphone System-on-Chip (SoC) can compromise any level in the normal world’s software

52 stack. For example, the secure container’s software can caches described in TrustZone’s documentation do not jump into arbitrary locations in the normal world by flip- enforce a complete separation between worlds, as they al- ping a bit in a register. The untrusted software in the low a world’s memory accesses to evict the other world’s normal world can only access the secure world via an cache lines. This exposes the secure container software instruction that jumps into a well-defined location inside to cache timing attacks from the untrusted software in the the monitor. normal world. Unfortunately, hardware manufacturers Conceptually, each TrustZone CPU core provides sep- that license the TrustZone IP cores are reluctant to dis- arate address translation units for the secure and normal close all the details of their designs, making it impossible worlds. This is implemented by two page table base for security researchers to reason about TrustZone-based registers, and by having the page walker use the page hardware. table base corresponding to the core’s current world. The The TrustZone components do not have any counter- physical addresses in the page table entries are extended measures for physical attacks. However, a system that to include the values of the secure bit to be issued on the follows the recommendations in the TrustZone documen- AXI bus. The secure world is protected from untrusted tation will not be exposed to physical attacks, under a software by having the CPU core force the secure bit in threat model that trusts the processor chip package. The the address translation result to zero for normal world AXI bus is designed to connect components in an SoC address translations. As the secure container manages its design, so it cannot be tapped by an attacker. The Trust- own page tables, its memory accesses cannot be directly Zone documentation recommends having all the code observed by the untrusted OS’s page fault handler. and data in the secure world stored in on-chip SRAM, TrustZone-aware hardware modules, such as caches, which is not subject to physical attacks. However, this ap- are trusted to use the secure address bit in each bus access proach places significant limits on the secure container’s to enforce the isolation between worlds. For example, functionality, because on-chip SRAM is many orders of TrustZone’s caches store the secure bit in the address magnitude more expensive than a DRAM chip of the tag for each cache line, which effectively provides com- same capacity. pletely different views of the memory space to the soft- TrustZone’s documentation does not describe any soft- ware running in different worlds. This design assumes ware attestation implementation. However, it does out- that memory space is partitioned between the two worlds, line a method for implementing secure boot, which so no aliasing can occur. comes down to having the first-stage bootloader verify a The TrustZone documentation describes two TLB con- signature in the second-stage bootloader against a public figurations. If many context switches between worlds key whose cryptographic hash is burned into on-chip are expected, the TLB IP blocks can be configured to One-Time Programmable (OTP) polysilicon fuses. A include the secure bit in the address tag. Alternatively, hardware measurement root can be built on top of the the secure bit can be omitted from the TLBs, as long as same components, by storing a per-chip attestation key the monitor flushes the TLBs when switching contexts. in the polyfuses, and having the first-stage bootloader The hardware modules that do not consume Trust- measure the second-stage bootloader and store its hash Zone’s address bit are expected to be connected to the in an on-chip SRAM region allocated to the secure world. AXI bus via IP cores that implement simple partition- The polyfuses would be gated by a TZMA IP block that ing techniques. For example, the TrustZone Memory makes them accessible only to the secure world. Adapter (TZMA) can be used to partition an on-chip ROM or SRAM into a secure region and a normal region, 4.3 The XOM Architecture and the TrustZone Address Space Controller (TZASC) The execute-only memory (XOM) architecture [128] in- partitions the memory space provided by a DRAM con- troduced the approach of executing sensitive code and troller into secure and normal regions. A TrustZone- data in isolated containers managed by untrusted host aware DMA controller rejects DMA transfers from the software. XOM outlined the mechanisms needed to iso- normal world that reference secure world addresses. late a container’s data from its untrusted software envi- It follows that analyzing the security properties of a ronment, such as saving the register state to a protected TrustZone system requires a precise understanding of memory area before servicing an interrupt. the behavior and configuration of all the hardware mod- XOM supports multiple containers by tagging every ules that are attached to the AXI bus. For example, the cache line with the identifier of the container owning it,

53 and ensures isolation by disallowing memory accesses the TPM chip. It follows that the measurement included to cache lines that don’t match the current container’s in an attestation signature covers the entire OS kernel and identifier. The operating system and the untrusted appli- all the kernel modules, such as device drivers. However, cations are considered to belong to a container with a commercial computers use a wide diversity of devices, null identifier. and their system software is updated at an ever-increasing XOM also introduced the integration of encryption pace, so it is impossible to maintain a list of acceptable and HMAC functionality in the processor’s memory con- measurement hashes corresponding to a piece of trusted troller to protect container memory from physical attacks software. Due to this issue, the TPM’s software attes- on DRAM. The encryption and HMAC functionality is tation is not used in many security systems, despite its used for all cache line evictions and fetches, and the wide deployment. ECC bits in DRAM chips are repurposed to store HMAC The TPM design is technically not vulnerable to any values. software attacks, because it trusts all the software on the XOM’s design cannot guarantee DRAM freshness, so computer. However, a TPM-based system is vulnerable the software in its containers is vulnerable to physical to an attacker who has physical access to the machine, replay attacks. Furthermore, XOM does not protect a as the TPM chip does not provide any isolation for the container’s memory access patterns, meaning that any software on the computer. Furthermore, the TPM chip piece of malicious software can perform cache timing receives the software measurements from the CPU, so attacks against the software in a container. Last, XOM TPM-based systems are vulnerable to attackers who can containers are destroyed when they encounter hardware tap the communication bus between the CPU and the exceptions, such as page faults, so XOM does not support TPM. paging. Last, the TPM’s design relies on the software running XOM predates the attestation scheme described above, on the CPU to report its own cryptographic hash. The and relies on a modified software distribution scheme TPM chip resets the measurements stored in Platform instead. Each container’s contents are encrypted with Configuration Registers (PCRs) when the computer is a symmetric key, which also serves as the container’s rebooted. Then, the TPM expects the software at each identity. The symmetric key, in turn, is encrypted with boot stage to cryptographically hash the software at the the public key of each CPU that is trusted to run the next stage, and send the hash to the TPM. The TPM up- container. A container’s author can be assured that the dates the PCRs to incorporate the new hashes it receives, container is running on trusted software by embedding a as shown in Figure 59. Most importantly, the PCR value secret into the encrypted container data, and using it to at any point reflects all the software hashes received by authenticate the container. While conceptually simpler the TPM up to that point. This makes it impossible for than software attestation, this scheme does not allow the software that has been measured to “remove” itself from container author to vet the container’s software environ- the measurement. ment. For example, the firmware on most modern comput- ers implements the platform initialization process in the 4.4 The Trusted Platform Module (TPM) Unified Extensible Firmware Interface (UEFI) specifi- The Trusted Platform Module (TPM) [71] introduced cation [180]. Each platform initialization phase is re- the software attestation model described at the beginning sponsible for verifying or measuring the firmware that of this section. The TPM design does not require any implements the next phase. The SEC firmware initializes hardware modifications to the CPU, and instead relies the TPM PCR, and then stores the PEI’s measurement on an auxiliary tamper-resistant chip. The TPM chip into a measurement register. In turn, the PEI imple- is only used to store the attestation key and to perform mentation measures the DXE firmware and updates the software attestation. The TPM was widely deployed on measurement register that stores the PEI hash to account commodity computers, because it does not rely on CPU for the DXE hash. When the OS is booted, the hash in modifications. Unfortunately, the cost of this approach the measurement register accounts for all the firmware is that the TPM has very weak security guarantees, as that was used to boot the computer. explained below. Unfortunately, the security of the whole measurement The TPM design provides one isolation container, cov- scheme hinges on the requirement that the first hash sent ering all the software running on the computer that has to the TPM must reflect the software that runs in the first

54 0 (zero) secure container to a virtual machine (guest operating TPM MR system and application) hosted by the CPU’s hardware after reboot Boot Loader virtualization features (VMX [181]). TXT isolates the software inside the container from SHA-1( ) untrusted software by ensuring that the container has sent to TPM exclusive control over the entire computer while it is OS Kernel SHA-1( ) active. This is accomplished by a secure initialization TPM MR when SHA-1( ) authenticated code module (SINIT ACM) that effectively boot loader performs a warm system reset before starting the con- executes sent to TPM Kernel module tainer’s VM. SHA-1( ) TXT requires a TPM chip with an extended register SHA-1( ) TPM MR when set. The registers used by the measured boot process de- OS kernel sent to TPM executes scribed in § 4.4 are considered to make up the platform’s SHA-1( ) Static Root of Trust Measurement (SRTM). When a TXT TPM MR when VM is initialized, it updates TPM registers that make Kernel Module executes up the Dynamic Root of Trust Measurement (DRTM). Figure 59: The measurement stored in a TPM platform configura- While the TPM’s SRTM registers only reset at the start of tion register (PCR). The PCR is reset when the system reboots. The a boot cycle, the DRTM registers are reset by the SINIT software at every boot stage hashes the next boot stage, and sends ACM, every time a TXT VM is launched. the hash to the TPM. The PCR’s new value incorporates both the old TXT does not implement DRAM encryption or PCR value, and the new software hash. HMACs, and therefore is vulnerable to physical DRAM boot stage. The TPM threat model explicitly acknowl- attacks, just like TPM-based designs. Furthermore, early edges this issue, and assumes that the firmware respon- TXT implementations were vulnerable to attacks where sible for loading the first stage bootloader is securely a malicious operating system would program a device, embedded in the motherboard. However, virtually ev- such as a network card, to perform DMA transfers to the ery TPM-enabled computer stores its firmware in a flash DRAM region used by a TXT container [188, 191]. In memory chip that can be re-programmed in software recent Intel CPUs, the memory controller is integrated (§ 2.9.1), so the TPM’s measurement can be subverted on the CPU die, so the SINIT ACM can securely set by an attacker who can reflash the computer’s firmware up the memory controller to reject DMA transfers tar- [29]. geting TXT memory. An Intel chipset datasheet [105] On very recent Intel processors, the attack described documents an “Intel TXT DMA Protected Range” IIO above can be defeated by having the initialization mi- configuration register. crocode (§ 2.14.4) hash the computer’s firmware (specifi- Early TXT implementations did not measure the cally, the PEI code in UEFI [180] firwmare) and commu- SINIT ACM. Instead, the microcode implementing the nicate the hash to the TPM chip. This is marketed as the TXT launch instruction verified that the code module Measured Boot feature of Intel’s Boot Guard [162]. contained an RSA signature by a hard-coded Intel key. Sadly, most computer manufacturers use Verified Boot SINIT ACM signatures cannot be revoked if vulnerabili- (also known as “secure boot”) instead of Measured Boot ties are found, so TXT’s software attestation had to be (also known as “trusted boot”). Verified Boot means that revised when SINIT ACM exploits [190] surfaced. Cur- the processor’s microcode only boots into PEI firmware rently, the SINIT ACM’s cryptographic hash is included that contains a signature produced by a key burned into in the attestation measurement. the chip’s e-fuses. Verified Boot does not impact the Last, the warm reset performed by the SINIT ACM measurements stored on the TPM, so it does not improve does not include the software running in System Manage- the security of software attestation. ment Mode (SMM). SMM was designed solely for use by firmware, and is stored in a protected memory area 4.5 Intel’s Trusted Execution Technology (TXT) (SMRAM) which should not be accessible to non-SMM Intel’s Trusted Execution Technology (TXT) [70] uses software. However, the SMM handler was compromised the TPM’s software attestation model and auxiliary on multiple occasions [44, 49, 164, 186, 189], and an tamper-resistant chip, but reduces the software inside the attacker who obtains SMM execution can access the

55 memory used by TXT’s container. cations running inside unmodified, untrusted operating 4.6 The Aegis Secure Processor systems. Bastion’s hypervisor ensures that the operating system does not interfere with the secure containers. We The Aegis secure processor [174] relies on a security only describe Bastion’s virtualization extensions to ar- kernel in the operating system to isolate containers, and chitectures that use nested page tables, like Intel’s VMX includes the kernel’s cryptographic hash in the measure- [181]. ment reported by the software attestation signature. [176] The hypervisor enforces the containers’ desired mem- argued that Physical Unclonable Functions (PUFs) [56] ory mappings in the OS page tables, as follows. Each can be used to endow a secure processor with a tamper- Bastion container has a Security Segment that lists the resistant private key, which is required for software attes- virtual addresses and permissions of all the container’s tation. PUFs do not have the fabrication process draw- pages, and the hypervisor maintains a Module State Table backs of EEPROM, and are significantly more resilient that stores an inverted page map, associating each physi- to physical attacks than e-fuses. cal memory page to its container and virtual address. The Aegis relies on a trusted security kernel to isolate each processor’s hardware page walker is modified to invoke container from the other software on the computer by the hypervisor on every TLB miss, before updating the configuring the page tables used in address translation. TLB with the address translation result. The hypervisor The security kernel is a subset of a typical OS kernel, checks that the virtual address used by the translation and handles virtual memory management, processes, and matches the expected virtual address associated with the hardware exceptions. As the security kernel is a part of physical address in the Module State Table. the trusted code base (TCB), its cryptographic hash is Bastion’s cache lines are not tagged with container included in the software attestation measurement. The identifiers. Instead, only TLB entries are tagged. The security kernel uses processor features to isolate itself hypervisor’s TLB miss handler sets the container iden- from the untrusted part of the operating system, such as tifier for each TLB entry as it is created. Similarly to device drivers. XOM and Aegis, the secure processor checks the TLB The Aegis memory controller encrypts the cache lines tag against the current container’s identifier on every in one memory range, and HMACs the cache lines in one memory access. other memory range. The two memory ranges can over- Bastion offers the same protection against physical lap, and are configurable by the security kernel. Thanks DRAM attacks as Aegis does, without the restriction that to the two ranges, the memory controller can avoid the a container’s data must be stored inside a continuous latency overhead of cryptographic operations for the DRAM range. This is accomplished by extending cache DRAM outside containers. Aegis was the first secure lines and TLB entries with flags that enable memory processor not vulnerable to physical replay attacks, as it encryption and HMACing. The hypervisor’s TLB miss uses a Merkle tree construction [57] to guarantee DRAM handler sets the flags on TLB entries, and the flags are freshness. The latency overhead of the Merkle tree is propagated to cache lines on memory writes. greatly reduced by augmenting the L2 cache with the The Bastion hypervisor allows the untrusted operat- tree nodes for the cache lines. ing system to evict secure container pages. The evicted Aegis’ security kernel allows the OS to page out con- pages are encrypted, HMACed, and covered by a Merkle tainer memory, but verifies the correctness of the paging tree maintained by the hypervisor. Thus, the hypervisor operations. The security kernel uses the same encryption ensures the confidentiality, authenticity, and freshness and Merkle tree algorithms as the memory controller to of the swapped pages. However, the ability to freely guarantee the confidentiality and integrity of the con- evict container pages allows a malicious OS to learn a tainer pages that are swapped out from DRAM. The OS container’s memory accesses with page granularity. Fur- is free to page out container memory, so it can learn a thermore, Bastion’s threat model excludes cache timing container’s memory access patterns, at page granular- attacks. ity. Aegis containers are also vulnerable to cache timing attacks. Bastion does not trust the platform’s firmware, and computes the cryptographic hash of the hypervisor af- 4.7 The Bastion Architecture ter the firmware finishes playing its part in the booting The Bastion architecture [31] introduced the use of a process. The hypervisor’s hash is included in the mea- trusted hypervisor to provide secure containers to appli- surement reported by software attestation.

56 4.8 Intel SGX in Context microcode, similarly to TXT’s SINIT ACM. Intel’s Software Guard Extensions (SGX) [14, 79, 139] As SGX does not protect against cache timing at- implements secure containers for applications without tacks, the privileged enclave’s authors cannot use data- making any modifications to the processor’s critical ex- dependent memory accesses. For example, cache attacks ecution path. SGX does not trust any layer in the com- on the Quoting Enclave, which computes attestation sig- puter’s software stack (firmware, hypervisor, OS). In- natures, would provide an attack with a processor’s EPID stead, SGX’s TCB consists of the CPU’s microcode and signing key and completely compromise SGX. a few privileged containers. SGX introduces an approach Intel’s documentation states that SGX guarantees to solving some of the issues raised by multi-core pro- DRAM confidentiality, authentication, and freshness by cessors with a shared, coherent last-level cache. virtue of a Memory Encryption Engine (MEE). The MEE SGX does not extend caches or TLBs with container is informally described in an ISCA 2015 tutorial [103], identity bits, and does not require any security checks and appears to lack a formal specification. In the absence during normal memory accesses. As suggested in the of further information, we assume that SGX provides TrustZone documentation, SGX always ensures that a the same protection against physical DRAM attacks that core’s TLBs only contain entries for the container that Aegis and Bastion provide. it is executing, which requires flushing the CPU core’s 4.9 Sanctum TLBs when context-switching between containers and untrusted software. Sanctum [38] introduced a straightforward software/hard- SGX follows Bastion’s approach of having the un- ware co-design that yields the same resilience against trusted OS manage the page tables used by secure con- software attacks as SGX, and adds protection against tainers. The containers’ security is preserved by a TLB memory access pattern leaks, such as page fault monitor- miss handler that relies on an inverted page map (the ing attacks and cache timing attacks. EPCM) to reject address translations for memory that Sanctum uses a conceptually simple cache partitioning does not belong to the current container. scheme, where a computer’s DRAM is split into equally- Like Bastion, SGX allows the untrusted operating sys- sized continuous DRAM regions, and each DRAM re- tem to evict secure container pages, in a controlled fash- gion uses distinct sets in the shared last-level cache ion. After the OS initiates a container page eviction, (LLC). Each DRAM region is allocated to exactly one it must prove to the SGX implementation that it also container, so containers are isolated in both DRAM and switched the container out of all cores that were execut- the LLC. Containers are isolated in the other caches by ing its code, effectively performing a very coarse-grained flushing on context switches. TLB shootdown. Like XOM, Aegis, and Bastion, Sanctum also consid- SGX’s microcode ensures the confidentiality, authen- ers the hypervisor, OS, and the application software to ticity, and freshness of each container’s evicted pages, conceptually belong to a separate container. Containers like Bastion’s hypervisor. However, SGX relies on a are protected from the untrusted outside software by the version-based Merkle tree, inspired by Aegis [174], and same measures that isolate containers from each other. adds an innovative twist that allows the operating system Sanctum relies on a trusted security monitor, which to dynamically shape the Merkle tree. SGX also shares is the first piece of firmware executed by the processor, Bastion’s and Aegis’ vulnerability to memory access pat- and has the same security properties as those of Aegis’ tern leaks, namely a malicious OS can directly learn a security kernel. The monitor is measured by bootstrap container’s memory accesses at page granularity, and any code in the processor’s ROM, and its cryptographic hash piece of software can perform cache timing attacks. is included in the software attestation measurement. The SGX’s software attestation is implemented using monitor verifies the operating system’s resource alloca- Intel’s Enhanced Privacy ID (EPID) group signature tion decisions. For example, it ensures that no DRAM scheme [26], which is too complex for a microcode region is ever accessible to two different containers. implementation. Therefore, SGX relies on an assort- Each Sanctum container manages its own page tables ment of privileged containers that receive direct access mapping its DRAM regions, and handles its own page to the SGX processor’s hardware keys. The privileged faults. It follows that a malicious OS cannot learn the containers are signed using an Intel private key whose virtual addresses that would cause a page fault in the corresponding public key is hard-coded into the SGX container. Sanctum’s hardware modifications work in

57 conjunction with the security monitor to make sure that 5.1 SGX Physical Memory Organization a container’s page tables only reference memory inside The enclaves’ code and data is stored in Processor Re- the container’s DRAM regions. served Memory (PRM), which is a subset of DRAM that The Sanctum design focuses completely on software cannot be directly accessed by other software, including attacks, and does not offer protection from any physical system software and SMM code. The CPU’s integrated attack. The authors expect Sanctum’s hardware modifica- memory controllers (§ 2.9.3) also reject DMA transfers tions to be combined with the physical attack protections targeting the PRM, thus protecting it from access by in Aegis or Ascend. other peripherals. 4.10 Ascend and Phantom The PRM is a continuous range of memory whose bounds are configured using a base and a mask regis- The Ascend [52] and Phantom [132] secure processors ter with the same semantics as a variable memory type introduced practical implementations of Oblivious RAM range (§ 2.11.4). Therefore, the PRM’s size must be [65] techniques in the CPU’s memory controller. These an integer power of two, and its start address must be processors are resilient to attackers who can probe the aligned to the same power of two. Due to these restric- DRAM address bus and attempt to learn a container’s tions, checking if an address belongs to the PRM can be private information from its DRAM memory access pat- done very cheaply in hardware, using the circuit outlined tern. in § 2.11.4. Implementing an ORAM scheme in a memory con- The SDM does not describe the PRM and the PRM troller is largely orthogonal to the other secure archi- range registers (PRMRR). These concepts are docu- tectures described above. It follows, for example, that mented in the SGX manuals [95, 99] and in one of Ascend’s ORAM implementation can be combined with the SGX papers [139]. Therefore, the PRM is a micro- Aegis’ memory encryption and authentication, and with architectural detail that might change in future implemen- Sanctum’s hardware extensions and security monitor, tations of SGX. Our security analysis of SGX relies on yielding a secure processor that can withstand both soft- implementation details surrounding the PRM, and will ware attacks and physical DRAM attacks. have to be re-evaluated for SGX future implementations.

5 SGX PROGRAMMING MODEL 5.1.1 The Enclave Page Cache (EPC) The central concept of SGX is the enclave, a protected The contents of enclaves and the associated data struc- environment that contains the code and data pertaining tures are stored in the Enclave Page Cache (EPC), which to a security-sensitive computation. is a subset of the PRM, as shown in Figure 60. SGX-enabled processors provide trusted computing by DRAM PRM EPC isolating each enclave’s environment from the untrusted EPCM software outside the enclave, and by implementing a soft- 4kb page Entry 4kb page Entry PRM EPC ware attestation scheme that allows a remote party to au- 4kb page Entry thenticate the software running inside an enclave. SGX’s ⋮ ⋮ isolation mechanisms are intended to protect the confi- 4kb page Entry 4kb page Entry dentiality and integrity of the computation performed inside an enclave from attacks coming from malicious Figure 60: Enclave data is stored into the EPC, which is a subset of software executing on the same computer, as well as the PRM. The PRM is a contiguous range of DRAM that cannot be from a limited set of physical attacks. accessed by system software or peripherals. This section summarizes the SGX concepts that make The SGX design supports having multiple enclaves up a mental model which is sufficient for programmers on a system at the same time, which is a necessity in to author SGX enclaves and to add SGX support to ex- multi-process environments. This is achieved by having isting system software. Unless stated otherwise, the the EPC split into 4 KB pages that can be assigned to information in this section is backed up by Intel’s Soft- different enclaves. The EPC uses the same page size as ware Developer Manual (SDM). The following section the architecture’s address translation feature (§ 2.5). This builds on the concepts introduced here to fill in some of is not a coincidence, as future sections will reveal that the the missing pieces in the manual, and analyzes some of SGX implementation is tightly coupled with the address SGX’s security properties. translation implementation.

58 The EPC is managed by the same system software Field Bits Description that manages the rest of the computer’s physical mem- VALID 1 0 for un-allocated EPC ory. The system software, which can be a hypervisor or pages an OS kernel, uses SGX instructions to allocate unused PT 8 page type pages to enclaves, and to free previously allocated EPC ENCLAVESECS identifies the enclave own- pages. The system software is expected to expose en- ing the page clave creation and management services to application Table 13: The fields in an EPCM entry that track the ownership of software. pages. Non-enclave software cannot directly access the EPC, as it is contained in the PRM. This restriction plays a key The SGX instructions that allocate an EPC page set role in SGX’s enclave isolation guarantees, but creates an the VALID bit of the corresponding EPCM entry to 1, obstacle when the system software needs to load the ini- and refuse to operate on EPC pages whose VALID bit is tial code and data into a newly created enclave. The SGX already set. design solves this problem by having the instructions The instruction used to allocate an EPC page also that allocate an EPC page to an enclave also initialize the determines the page’s intended usage, which is recorded page. Most EPC pages are initialized by copying data in the page type (PT) field of the corresponding EPCM from a non-PRM memory page. entry. The pages that store an enclave’s code and data are considered to have a regular type (PT REG in the 5.1.2 The Enclave Page Cache Map (EPCM) SDM). The pages dedicated to the storage of SGX’s The SGX design expects the system software to allocate supporting data structures are tagged with special types. the EPC pages to enclaves. However, as the system soft- For example, the PT SECS type identifies pages that ware is not trusted, SGX processors check the correctness hold SGX Enclave Control Structures, which will be of the system software’s allocation decisions, and refuse described in the following section. The other EPC page to perform any action that would compromise SGX’s types will be described in future sections. security guarantees. For example, if the system software Last, a page’s EPCM entry also identifies the enclave attempts to allocate the same EPC page to two enclaves, that owns the EPC page. This information is used by the SGX instruction used to perform the allocation will the mechanisms that enforce SGX’s isolation guarantees fail. to prevent an enclave from accessing another enclave’s In order to perform its security checks, SGX records private information. As the EPCM identifies a single some information about the system software’s allocation owning enclave for each EPC page, it is impossible for decisions for each EPC page in the Enclave Page Cache enclaves to communicate via using EPC Map (EPCM). The EPCM is an array with one entry pages. Fortunately, enclaves can share untrusted non- per EPC page, so computing the address of a page’s EPC memory, as will be discussed in § 5.2.3. EPCM entry only requires a bitwise shift operation and 5.1.3 The SGX Enclave Control Structure (SECS) an addition. The EPCM’s contents is only used by SGX’s security SGX stores per-enclave metadata in a SGX Enclave checks. Under normal operation, the EPCM does not Control Structure (SECS) associated with each enclave. generate any software-visible behavior, and enclave au- Each SECS is stored in a dedicated EPC page with the thors and system software developers can mostly ignore page type PT SECS. These pages are not intended to it. Therefore, the SDM only describes the EPCM at a be mapped into any enclave’s address space, and are very high level, listing the information contained within exclusively used by the CPU’s SGX implementation. and noting that the EPCM is “trusted memory”. The An enclave’s identity is almost synonymous to its SDM does not disclose the storage medium or memory SECS. The first step in bringing an enclave to life al- layout used by the EPCM. locates an EPC page to serve as the enclave’s SECS, and The EPCM uses the information in Table 13 to track the last step in destroying an enclave deallocates the page the ownership of each EPC page. We defer a full discus- holding its SECS. The EPCM entry field identifying the sion of the EPCM to a later section, because its contents enclave that owns an EPC page points to the enclave’s is intimately coupled with all of SGX’s features, which SECS. The system software uses the virtual address of will be described over the next few sections. an enclave’s SECS to identify the enclave when invoking

59 DRAM SGX instructions. Host Application Page Tables All SGX instructions take virtual addresses as their in- Virtual Memory Enclave Virtual managed by View system software puts. Given that SGX instructions use SECS addresses to Memory View identify enclaves, the system software must create entries

in its page tables pointing to the SECS of the enclaves it EPC manages. However, the system software cannot access Abort Page ELRANGE any SECS page, as these pages are stored in the PRM. SECS pages are not intended to be mapped inside their enclaves’ virtual address spaces, and SGX-enabled pro- cessors explicitly prevent enclave code from accessing SECS pages. This seemingly arbitrary limitation is in place so that Figure 61: An enclave’s EPC pages are accessed using a dedicated the SGX implementation can store sensitive information region in the enclave’s virtual address space, called ELRANGE. The in the SECS, and be able to assume that no potentially rest of the virtual address space is used to access the memory of the malicious software will access that information. For ex- host process. The memory mappings are established using the page tables managed by system software. ample, the SDM states that each enclave’s measurement is stored in its SECS. If software would be able to modify The SGX design guarantees that the enclave’s mem- an enclave’s measurement, SGX’s software attestation ory accesses inside ELRANGE obey the virtual memory scheme would provide no security assurances. abstraction (§ 2.5.1), while memory accesses outside EL- The SECS is strongly coupled with many of SGX’s RANGE receive no guarantees. Therefore, enclaves must features. Therefore, the pieces of information that make store all their code and private data inside ELRANGE, up the SECS will be gradually introduced as the different and must consider the memory outside ELRANGE to be aspects of SGX are described. an untrusted interface to the outside world. 5.2 The Memory Layout of an SGX Enclave The word “linear” in ELRANGE references the linear SGX was designed to minimize the effort required to addresses produced by the vestigial segmentation fea- convert application code to take advantage of enclaves. ture (§ 2.7) in the 64-bit Intel architecture. For most History suggests this is a wise decision, as a large factor purposes, “linear” can be treated as a synonym for “vir- in the continued dominance of the Intel architecture is tual”. its ability to maintain . To this ELRANGE is specified using a base (the BASEADDR end, SGX enclaves were designed to be conceptually field) and a size (the SIZE) in the enclave’s similar to the leading software modularization construct, SECS (§ 5.1.3). ELRANGE must meet the same con- dynamically loaded libraries, which are packaged as .so straints as a variable memory type range (§ 2.11.4) and as files on , and .dll files on Windows. the PRM range (§ 5.1), namely the size must be a power For simplicity, we describe the interaction between of 2, and the base must be aligned to the size. These enclaves and non-enclave software assuming that each restrictions are in place so that the SGX implementation enclave is used by exactly one application process, which can inexpensively check whether an address belongs to we shall refer to as the enclave’s host process. We do an enclave’s ELRANGE, in either hardware (§ 2.11.4) or note, however, that the SGX design does not explicitly software. prohibit multiple application processes from sharing an When an enclave represents a dynamic , it is enclave. natural to set ELRANGE to the memory range reserved for the library by the loader. The ability to access non- 5.2.1 The Enclave Linear Address Range (ELRANGE) enclave memory from enclave code makes it easy to Each enclave designates an area in its virtual address reuse existing library code that expects to work with space, called the enclave linear address range (EL- pointers to memory buffers managed by code in the host RANGE), which is used to map the code and the sensi- process. tive data stored in the enclave’s EPC pages. The virtual Non-enclave software cannot access PRM memory. A address space outside ELRANGE is mapped to access memory access that resolves inside the PRM results in non-EPC memory via the same virtual addresses as the an aborted transaction, which is undefined at an archi- enclave’s host process, as shown in Figure 61. tectural level, On current processors, aborted writes are

60 ignored, and aborted reads return a value whose bits are the new features. all set to 1. This comes into play in the scenario described The MODE64BIT flag is set to true for enclaves that above, where an enclave is loaded into a host application use the 64-bit Intel architecture. From a security stand- process as a dynamically loaded library. The system soft- point, this flag should not even exist, as supporting a ware maps the enclave’s code and data in ELRANGE secondary architecture adds unnecessary complexity to into EPC pages. If application software attempts to ac- the SGX implementation, and increases the probability cess memory inside ELRANGE, it will experience the that security vulnerabilities will creep in. It is very likely abort transaction semantics. The current semantics do that the 32-bit architecture support was included due to not cause the application to crash (e.g., due to a Page Intel’s strategy of offering extensive backwards compati- Fault), but also guarantee that the host application will bility, which has paid off quite well so far. not be able to tamper with the enclave or read its private In the interest of mental sanity, this work does information. not analyze the behavior of SGX for enclaves whose MODE64BIT flag is cleared. However, a security re- 5.2.2 SGX Enclave Attributes searcher who wishes to find vulnerabilities in SGX might The execution environment of an enclave is heavily in- study this area. fluenced by the value of the ATTRIBUTES field in the Last, the INIT flag is always false when the enclave’s enclave’s SECS (§ 5.1.3). The rest of this work will refer SECS is created. The flag is set to true at a certain point to the field’s sub-fields, shown in Table 14, as enclave in the enclave lifecycle, which will be summarized in attributes. § 5.3.

Field Bits Description 5.2.3 Address Translation for SGX Enclaves DEBUG 1 Opts into enclave debugging features. Under SGX, the operating system and hypervisor are XFRM 64 The value of XCR0 (§ 2.6) still in full control of the page tables and EPTs, and while this enclave’s code is each enclave’s code uses the same address translation executed. process and page tables (§ 2.5) as its host application. MODE64BIT 1 Set for 64-bit enclaves. This minimizes the amount of changes required to add SGX support to existing system software. At the same Table 14: An enclave’s attributes are the sub-fields in the AT- time, having the page tables managed by untrusted sys- TRIBUTES field of the enclave’s SECS. This table shows a subset of the attributes defined in the SGX documentation. tem software opens SGX up to the address translation attacks described in § 3.7. As future sections will reveal, The most important attribute, from a security perspec- a good amount of the complexity in SGX’s design can tive, is the DEBUG flag. When this flag is set, it enables be attributed to the need to prevent these attacks. the use of SGX’s debugging features for this enclave. SGX’s active memory mapping attacks defense mech- These debugging features include the ability to read and anisms revolve around ensuring that each EPC page modify most of the enclave’s memory. Therefore, DE- can only be mapped at a specific virtual address (§ 2.7). BUG should only be set in a development environment, When an EPC page is allocated, its intended virtual ad- as it causes the enclave to lose all the SGX security guar- dress is recorded in the EPCM entry for the page, in the antees. ADDRESS field. SGX guarantees that enclave code will always run When an address translation (§ 2.5) result is the physi- with the XCR0 register (§ 2.6) set to the value indicated cal address of an EPC page, the CPU ensures6 that the by extended features request mask (XFRM). Enclave au- virtual address given to the address translation process thors are expected to use XFRM to specify the set of matches the expected virtual address recorded in the architectural extensions enabled by the compiler used to page’s EPCM entry. produce the enclave’s code. Having XFRM be explicitly SGX also protects against some passive memory map- specified allows Intel to design new architectural exten- ping attacks and fault injection attacks by ensuring that sions that change the semantics of existing instructions, the access permissions of each EPC page always match such as Extensions (MPX), without the enclave author’s intentions. The access permissions having to worry about the security implications on en- clave code that was developed without an awareness of 6A mismatch triggers a general protection fault (#GP, § 2.8.2).

61 for each EPC page are specified when the page is allo- The SGX implementation uses a Thread Control Struc- cated, and recorded in the readable (R), writable (W), ture (TCS) for each logical processor that executes an and executable (X) fields in the page’s EPCM entry, enclave’s code. It follows that an enclave’s author must shown in Table 15. provision at least as many TCS instances as the maxi- mum number of concurrent threads that the enclave is Field Bits Description intended to support. ADDRESS 48 the virtual address used to ac- Each TCS is stored in a dedicated EPC page whose cess this page EPCM entry type is PT TCS. The SDM describes the R 1 allow reads by enclave code first few fields in the TCS. These fields are considered W 1 allow writes by enclave code to belong to the architectural part of the structure, and X 1 allow execution of code inside therefore are guaranteed to have the same semantics on the page, inside enclave all the processors that support SGX. The rest of the TCS Table 15: The fields in an EPCM entry that indicate the enclave’s is not documented. intended virtual memory layout. The contents of an EPC page that holds a TCS cannot When an address translation (§ 2.5) resolves into an be directly accessed, even by the code of the enclave that EPC page, the corresponding EPCM entry’s fields over- owns the TCS. This restriction is similar to the restric- ride the access permission attributes (§ 2.5.3) specified in tion on accessing EPC pages holding SECS instances. the page tables. For example, the W field in the EPCM However, the architectural fields in a TCS can be read by entry overrides the writable (W) attribute, and the X field enclave debugging instructions. overrides the disable execution (XD) attribute. The architectural fields in the TCS lay out the context It follows that an enclave author must include mem- switches (§ 2.6) performed by a logical processor when ory layout information along with the enclave, in such it transitions between executing non-enclave and enclave a way that the system software loading the enclave will code. know the expected virtual memory address and access For example, the OENTRY field specifies the value permissions for each enclave page. In return, the SGX loaded in the instruction pointer (RIP) when the TCS is design guarantees to the enclave authors that the sys- used to start executing enclave code, so the enclave au- tem software, which manages the page tables and EPT, thor has strict control over the entry points available to en- will not be able to set up an enclave’s virtual address clave’s host application. Furthermore, the OFSBASGX space in a manner that is inconsistent with the author’s and OFSBASGX fields specify the base addresses loaded expectations. in the FS and GS segment registers (§ 2.7), which typi- The .so and .dll file formats, which are SGX’s cally point to Thread Local Storage (TLS). intended enclave delivery vehicles, already have provi- 5.2.5 The State Save Area (SSA) sions for specifying the virtual addresses that a software module was designed to use, as well as the desired access When the processor encounters a hardware excep- permissions for each of the module’s memory areas. tion (§ 2.8.2), such as an interrupt (§ 2.12), while exe- Last, a SGX-enabled CPU will ensure that the virtual cuting the code inside an enclave, it performs a privilege memory inside ELRANGE (§ 5.2.1) is mapped to EPC level switch (§ 2.8.2) and invokes a hardware exception pages. This prevents the system software from carry- handler provided by the system software. Before ex- ing out an address translation attack where it maps the ecuting the exception handler, however, the processor enclave’s entire virtual address space to DRAM pages needs a secure area to store the enclave code’s execution outside the PRM, which do not trigger any of the checks context (§ 2.6), so that the information in the execution above, and can be directly accessed by the system soft- context is not revealed to the untrusted system software. ware. In the SGX design, the area used to store an enclave thread’s execution context while a hardware exception is 5.2.4 The Thread Control Structure (TCS) handled is called a State Save Area (SSA), illus- The SGX design fully embraces multi-core processors. trated in Figure 62. Each TCS references a contiguous se- It is possible for multiple logical processors (§ 2.9.3) to quence of SSAs. The offset of the SSA array (OSSA) field concurrently execute the same enclave’s code at the same specifies the location of the first SSA in the enclave’s time, via different threads. virtual address space. The number of SSAs (NSSA) field

62 indicates the number of available SSAs. tectural, and is completely documented in the SDM. This opens up possibilities for an enclave exception handler SECS Enclave virtual that is invoked by the host application after a hardware SIZE 40000 address space exception occurs, and acts upon the information in a BASEADDR C00000 ELF / PE Header SSA. SSAFRAMESIZE 3 TCS 1 5.3 The Life Cycle of an SGX Enclave EPCM entries OENTRY 01D038 An enclave’s life cycle is deeply intertwined with re- ADDRESS PT RWX OFSBASGX 008000 source management, specifically the allocation of EPC 0 PT_SECS OGSBASGX pages. Therefore, the instructions that transition between C00000 PT_REG R OSSA 001000 NSSA 2 C01000 PT_TCS different life cycle states can only be executed by the C02000 PT_REG RW SSA 1 Page 1 system software. The system software is expected to C03000 PT_REG RW SSA 1 Page 2 expose the SGX instructions described below as enclave C04000 PT_REG RW SSA 1 Page 3 loading and teardown services. C05000 PT_REG RW SSA 2 Page 1 The following subsections describe the major steps in C06000 PT_REG RW SSA 2 Page 2 an enclave’s lifecycle, which is illustrated by Figure 63. C07000 PT_REG RW SSA 2 Page 3 C08000 PT_REG RW Thread 1 TLS Non- EADD ECREATE Uninitialized C09000 PT_TCS TCS 2 existing EEXTEND ⋮ ⋮ ⋮ ⋮

C1C000 PT_REG RWX Code Pages EINIT EGETKEY EREMOVE C1D000 PT_REG RWX _main EREPORT ⋮ ⋮ ⋮ EENTER Initialized Initialized C3F000 PT_REG RW Data Pages ERESUME In use Not in use EEXIT Figure 62: A possible layout of an enclave’s virtual address space. EBLOCK AEX EBLOCK ETRACK ETRACK Each enclave has a SECS, and one TCS per supported concurrent ELDU, ELDB ELDU, ELDB thread. Each TCS points to a sequence of SSAs, and specifies initial EWB values for RIP and for the base addresses of FS and GS. Figure 63: The SGX enclave life cycle management instructions Each SSA starts at the beginning of an EPC page, and and state transition diagram uses up the number of EPC pages that is specified in the SSAFRAMESIZE field of the enclave’s SECS. These 5.3.1 Creation alignment and size restrictions most likely simplify the An enclave is born when the system software issues the SGX implementation by reducing the number of special ECREATE instruction, which turns a free EPC page into cases that it needs to handle. the SECS (§ 5.1.3) for the new enclave. An enclave thread’s execution context consists of ECREATE initializes the newly created SECS using the general-purpose registers (GPRs) and the result of the information in a non-EPC page owned by the system the XSAVE instruction (§ 2.6). Therefore, the size of software. This page specifies the values for all the SECS the execution context depends on the requested-feature fields defined in the SDM, such as BASEADDR and bitmap (RFBM) used by to XSAVE. All the code in an SIZE, using an architectural layout that is guaranteed to enclave uses the same RFBM, which is declared in the be preserved by future implementations. XFRM enclave attribute (§ 5.2.2). The number of EPC While is very likely that the actual SECS layout used pages reserved for each SSA, specified in SSAFRAME- by initial SGX implementations matches the architec- SIZE, must7 be large enough to fit the XSAVE output for tural layout quite closely, future implementations are the feature bitmap specified by XFRM. free to deviate from this layout, as long as they main- SSAs are stored in regular EPC pages, whose EPCM tain the ability to initialize the SECS using the archi- page type is PT REG. Therefore, the SSA contents is tectural layout. Software cannot access an EPC page accessible to enclave software. The SSA layout is archi- that holds a SECS, so it cannot become dependent on 7ECREATE (§ 5.3.1) fails if SSAFRAMESIZE is too small. an internal SECS layout. This is a stronger version of

63 the encapsulation used in the Virtual Machine Control Currently, the PAGEINFO structure contains the vir- Structure (VMCS, § 2.8.3). tual address of the EPC page that will be allocated ECREATE validates the information used to initialize (LINADDR), the virtual address of the non-EPC page the SECS, and results in a page fault (#PF, § 2.8.2) or whose contents will be copied into the newly allocated general protection fault (#GP, § 2.8.2) if the information EPC page (SRCPGE), a virtual address that resolves to is not valid. For example, if the SIZE field is not a the SECS of the enclave that will own the page (SECS), power of two, ECREATE results in #GP. This validation, and values for some of the fields of the EPCM entry asso- combined with the fact that the SECS is not accessible ciated with the newly allocated EPC page (SECINFO). by software, simplifies the implementation of the other The SECINFO field in the PAGEINFO structure is ac- SGX instructions, which can assume that the information tually a virtual memory address, and points to a Security inside the SECS is valid. Information (SECINFO) structure, some of which is also Last, ECREATE initializes the enclave’s INIT attribute illustrated in Figure 64. The SECINFO structure contains (sub-field of the ATTRIBUTES field in the enclave’s the newly allocated EPC page’s access permissions (R, SECS, § 5.2.2) to the false value. The enclave’s code W, X) and its EPCM page type (PT REG or PT TCS). cannot be executed until the INIT attribute is set to true, Like PAGEINFO, the SECINFO structure is solely used which happens in the initialization stage that will be to communicate data to the SGX implementation, so its described in § 5.3.3. contents are also entirely architectural. However, most of the structure’s 64 bytes are reserved for future use. 5.3.2 Loading Both the PAGEINFO and the SECINFO structures are prepared by the system software that invokes the ECREATE marks the newly created SECS as uninitial- EADD instruction, and therefore must be contained in ized. While an enclave’s SECS is in this state, the system non-EPC pages. Both structures must be aligned to their software can use EADD instructions to load the initial sizes – PAGEINFO is 32 bytes long, so each PAGEINFO code and data into the enclave. EADD is used to create instance must be 32-byte aligned, while SECINFO has 64 both TCS pages (§ 5.2.4) and regular pages. bytes, and therefore each SECINFO instance must be 64- EADD reads its input data from a Page Informa- byte aligned. The alignment requirements likely simplify tion (PAGEINFO) structure, illustrated in Figure 64. The the SGX implementation by reducing the number of structure’s contents are only used to communicate in- special cases that must be handled. formation to the SGX implementation, so it is entirely EADD validates its inputs before modifying the newly architectural and documented in the SDM. allocated EPC page or its EPCM entry. Most importantly, attempting to EADD a page to an enclave whose SECS is Enclave and Host Application Virtual Address Space in the initialized state will result in a #GP. Furthermore, attempting to EADD an EPC page that is already allocated SECS (the VALID field in its EPCM entry is 1) results in a #PF. BASEADDR PAGEINFO SIZE SECS EADD also ensures that the page’s virtual address falls LINADDR within the enclave’s ELRANGE, and that all the reserved

ELRANGE SRCPGE fields in SECINFO are set to zero. SECINFO While loading an enclave, the system software will New EPC Page also use the EEXTEND instruction, which updates the enclave’s measurement used in the software attestation

Initial Page Contents process. Software attestation is discussed in § 5.8.

SECINFO EPCM Entry 5.3.3 Initialization FLAGS ADDRESS After loading the initial code and data pages into the R, W, X R, W, X enclave, the system software must use a Launch En- PAGE_TYPE PT clave ENCLAVESECS (LE) to obtain an EINIT Token Structure, via an under-documented process that will be described in more detail in § 5.9.1. The token is then provided to the EINIT Figure 64: The PAGEINFO structure supplies input data to SGX instruction, which marks the enclave’s SECS as initial- instructions such as EADD.

64 ized. Each logical processor that executes enclave code uses The LE is a privileged enclave provided by Intel, and a Thread Control Structure (TCS, § 5.2.4). When a TCS is a prerequisite for the use of enclaves authored by is used by a logical processor, it is said to be busy, and it parties other than Intel. The LE is an SGX enclave, cannot be used by any other logical processor. Figure 65 so it must be created, loaded and initialized using the illustrates the instructions used by a host process to ex- processes described in this section. However, the LE is ecute enclave code and their interactions with the TCS cryptographically signed (§ 3.1.3) with a special Intel that they target. key that is hard-coded into the SGX implementation, and that causes EINIT to initialize the LE without checking Logical Processor in Enclave Mode for a valid EINIT Token Structure.

When EINIT completes successfully, it sets the en- TCS Available EEXIT TCS Busy clave’s INIT attribute to true. This opens the way for ring CSSA = 0 EENTER CSSA = 0 3 (§ 2.3) application software to execute the enclave’s ERESUME code, using the SGX instructions described in § 5.4. On AEX

the other hand, once INIT is set to true, EADD cannot be TCS Available EEXIT TCS Busy invoked on that enclave anymore, so the system software CSSA = 1 EENTER CSSA = 1 must load all the pages that make up the enclave’s initial ERESUME state before executing the EINIT instruction. AEX

5.3.4 Teardown TCS Available CSSA = 2 After the enclave has done the computation it was de- signed to perform, the system software executes the Figure 65: The stages of the life cycle of an SGX Thread Control EREMOVE instruction to deallocate the EPC pages used Structure (TCS) that has two State Save Areas (SSAs). by the enclave. Assuming that no hardware exception occurs, an en- EREMOVE marks an EPC page as available by setting clave’s host process uses the EENTER instruction, de- the VALID field of the page’s EPCM entry to 0 (zero). scribed in § 5.4.1, to execute enclave code. When the en- Before freeing up the page, EREMOVE makes sure that clave code finishes performing its task, it uses the EEXIT there is no logical processor executing code inside the instruction, covered in § 5.4.2, to return the execution enclave that owns the page to be removed. control to the host process that invoked the enclave. An enclave is completely destroyed when the EPC If a hardware exception occurs while a logical proces- EREMOVE page holding its SECS is freed. refuses to sor is in enclave mode, the processor is taken out of en- deallocate a SECS page if it is referenced by any other clave mode using an Asynchronous Enclave Exit (AEX), EPCM entry’s ENCLAVESECS field, so an enclave’s summarized in § 5.4.3, before the system software’s ex- SECS page can only be deallocated after all the enclave’s ception handler is invoked. After the system software’s pages have been deallocated. handler is invoked, the enclave’s host process can use 5.4 The Life Cycle of an SGX Thread the ERESUME instruction, described in § 5.4.4, to re- Between the time when an enclave is initialized (§ 5.3.3) enter the enclave and resume the computation that it was and the time when it is torn down (§ 5.3.4), the enclave’s performing. code can be executed by any application process that has 5.4.1 Synchronous Enclave Entry the enclave’s EPC pages mapped into its virtual address space. At a high level, EENTER performs a controlled jump into When executing the code inside an enclave, a logical enclave code, while performing the processor configura- processor is said to be in enclave mode, and the code tion that is needed by SGX’s security guarantees. Going that it executes can access the regular (PT REG, § 5.1.2) through all the configuration steps is a tedious exercise, EPC pages that belong to the currently executing en- but it a necessary prerequisite to understanding how all clave. When a logical process is outside enclave mode, data structures used by SGX work together. For this it bounces any memory accesses inside the Processor reason, EENTER and its siblings are described in much Reserved Memory range (PRM, § 5.1), which includes more detail than the other SGX instructions. the EPC. EENTER, illustrated in Figure 66 can only be exe-

65 cuted by unprivileged application software running at while enclave code is executed. ring 3 (§ 2.3), and results in an undefined instruction EENTER transitions the logical processor into enclave (#UD) fault if is executed by system software. mode, and sets the instruction pointer (RIP) to the value indicated by the entry point offset (OENTRY) field in x the TCS that it receives. EENTER is used by an un- TCS EPCM Entry + trusted caller to execute code in a protected environment, ENCLAVESECS SECS R, W, X, PT SSAFRAMESIZE and therefore has the same security considerations as PT BASEADDR SYSCALL (§ 2.8), which is used to call into system soft- XFRM ware. Setting RIP to the value indicated by OENTRY Input SSA guarantees to the enclave author that the enclave code RSP U_RSP RBP U_RBP will only be invoked at well defined points, and prevents RCX AEP a malicious host application from bypassing any security RIP XSAVE checks that the enclave author may perform. RBX GPRSGX EENTER also sets XCR0 (§ 2.6), the register that con- XCR0 trols which extended architectural features are in use, to FS Base Limit Type Selector Output Register File the value of the XFRM enclave attribute (§ 5.2.2). En- GS Base Limit Type Selector XCR0 suring that XCR0 is set according to the enclave author’s RCX intentions prevents a malicious operating system from TCS RIP + Reserved bypassing an enclave’s security by enabling architectural GS features that the enclave is not prepared to handle. CR_SAVE_GS Limit Base + Furthermore, EENTER loads the bases of the segment CR_SAVE_FS FS registers (§ 2.7) FS and GS using values specified in the Limit Base CR_SAVE_XCR0 TCS. The segments’ selectors and types are hard-coded GSLIMIT to safe values for ring 3 data segments. This aspect of FSLIMIT OFSBASGX + the SGX design makes it easy to implement per-thread OGSBASGX Thread Local Storage (TLS). For 64-bit enclaves, this is OENTRY a convenience feature rather than a security measure, as CSSA enclave code can securely load new bases into FS and Read Write OSSA GS using the WRFSBASE and WRGSBASE instructions. Figure 66: Data flow diagram for a subset of the logic in EENTER. The EENTER implementation backs up the old val- The figure omits the logic for disabling debugging features, such as ues of the registers that it modifies, so they can be re- hardware and performance monitoring events. stored when the enclave finishes its computation. Just EENTER switches the logical processor to en- like SYSCALL, EEENTER saves the address of the fol- clave mode, but does not perform a privilege level lowing instruction in the RCX register. switch (§ 2.8.2). Therefore, enclave code always exe- Interestingly, the SDM states that the old values of the cutes at ring 3, with the same privileges as the application XCR0, FS, and GS registers are saved in new registers code that calls it. This makes it possible for an infras- dedicated to the SGX implementation. However, given tructure owner to allow user-supplied software to create that they will only be used on an enclave exit, we expect and use enclaves, while having the assurance that the OS that the registers are saved in DRAM, in the reserved kernel and hypervisor can still protect the infrastructure area in the TCS. from buggy or malicious software. Like SYSCALL, EENTER does not modify the stack EENTER takes the virtual address of a TCS as its input, pointer register (RSP). To avoid any security exploits, and requires that the TCS is available (not busy), and that enclave code should set RSP to point to a stack area at least one State Save Area (SSA, § 5.2.5) is available that is entirely contained in EPC pages. Multi-threaded in the TCS. The latter check is implemented by making enclaves can easily implement per-thread stack areas by sure that the current SSA index (CSSA) field in the TCS setting up each thread’s TLS area to include a pointer is less than the number of SSAs (NSSA) field. The SSA to the thread’s stack, and by setting RSP to the value indicated by the CSSA, which shall be called the current obtained by reading the TLS area at which the FS or GS SSA, is used in the event that a hardware exception occurs segment points.

66 Last, when EENTER enters enclave mode, it suspends Application Code some of the processor’s debugging features, such as call() { prepare call arguments TCS AEX Path hardware breakpoints and Precise Event Based Sam- try { CSSA EENTER Enclave Code pling (PEBS). Conceptually, a debugger attached to the OENTRY void entry() { host process sees the enclave’s execution as one single RCX: AEP RBX: TCS RCX set by processor instruction. store call results EENTER } catch (AEX e) { 5.4.2 Synchronous Enclave Exit read ESP from FS:TLS Resumable Yes EEXIT can only be executed while the logical processor exception? PUSH RCX is in enclave mode, and results in a (#UD) if executed perform enclave No computation in any other circumstances. In a nutshell, the instruction POP RBX returns the processor to ring 3 outside enclave mode return ERROR; EEXIT ERESUME Synchronous and restores the registers saved by EENTER, which were Execution Path } RCX: AEP RBX: TCS described above. store call results SYSRET EEXIT Unlike , sets RIP to the value read } AEX from RBX, after exiting enclave mode. This is inconsis- return SUCCESS; } tent with EENTER, which saves the RIP value to RCX. Ring 0 SSA Unless this inconsistency stems from an error in the System Software Stack XSAVE SDM, enclave code must be sure to note the difference. Hardware Exception Handler GPRs GPRSGX void handler() { The SDM explicitly states that EEXIT does not mod- Code AEP save GPRs RIP U_RBP ify most registers, so enclave authors must make sure to handle exception CS U_RSP clear any secrets stored in the processor’s registers before restore GPRs RFLAGS Registers returning control to the host process. Furthermore, en- IRET RSP cleared clave software will most likely cause a fault in its caller } SS by AEX if it doesn’t restore the stack pointer RSP and the stack Figure 67: If a hardware exception occurs during enclave execution, frame base pointer RBP to the values that they had when the synchronous execution path is aborted, and an Asynchronous EENTER was called. Enclave Exit (AEX) occurs instead. It may seem unfortunate that enclave code can induce faults in its caller. For better or for worse, this perfectly The AEX saves the enclave code’s execution con- matches the case where an application calls into a dynam- text (§ 2.6), restores the state saved by EENTER, and ically loaded module. More specifically, the module’s sets up the processor registers so that the system soft- code is also responsible for preserving stack-related reg- ware’s hardware exception handler will return to an asyn- isters, and a buggy module might jump anywhere in the chronous exit handler in the enclave’s host process. The application code of the host process. exit handler is expected to use the ERESUME instruction This section describes the EENTER behavior for 64- to resume the enclave computation that was interrupted bit enclaves. The EENTER implementation for 32-bit by the hardware exception. enclaves is significantly more complex, due to the extra Asides from the behavior described in § 5.4.1, special cases introduced by the full-fledged segmentation EENTER also writes some information to the current model that is still present in the 32-bit Intel architecture. SSA, which is only used if an AEX occurs. As shown As stated in the introduction, we are not interested in in Figure 66, EENTER stores the stack pointer register such legacy aspects. RSP and the stack frame base pointer register RBP into the U RSP and U RBP fields in the current SSA. Last, 5.4.3 Asynchronous Enclave Exit (AEX) EENTER stores the value in RCX in the Asynchronous If a hardware exception, like a fault (§ 2.8.2) or an in- Exit handler Pointer (AEP) field in the current SSA. terrupt (§ 2.12), occurs while a logical processor is ex- When a hardware exception occurs in enclave mode, ecuting an enclave’s code, the processor performs an the SGX implementation performs a sequence of steps Asynchronous Enclave Exit (AEX) before invoking the that takes the logical processor out of enclave mode and system software’s exception handler, as shown in Fig- invokes the hardware exception handler in the system ure 67. software. Conceptually, the SGX implementation first

67 performs an AEX to take the logical processor out of en- 5.4.4 Recovering from an Asynchronous Exit clave mode, and then the hardware exception is handled When a hardware exception occurs inside enclave mode, using the standard Intel architecture’s behavior described the processor performs an AEX before invoking the ex- in § 2.8.2. Actual Intel processors may interleave the ception’s handler set up by the system software. The AEX implementation with the exception handling imple- AEX sets up the execution context in such a way that mentation. However, for simplicity, this work describes when the system software finishes processing the excep- AEX as a separate process that is performed before any tion, it returns into an asynchronous exit handler in the exception handling steps are taken. enclave’s host process. The asynchronous exception han- dler usually executes the ERESUME instruction, which In the Intel architecture, if a hardware exception oc- causes the logical processor to go back into enclave mode curs, the application code’s execution context can be read and continue the computation that was interrupted by the and modified by the system software’s exception handler hardware exception. (§ 2.8.2). This is acceptable when the system software ERESUME shares much of its functionality with is trusted by the application software. However, under EENTER. This is best illustrated by the similarity be- SGX’s threat model, the system software is not trusted tween Figures 68 and 67. by enclaves. Therefore, the AEX step erases any secrets that may exist in the execution state by resetting all its Application Code registers to predefined values. int call() { prepare call arguments TCS AEX Path try { Before the enclave’s execution state is reset, it is CSSA EENTER Enclave Code OENTRY backed up inside the current SSA. Specifically, an AEX RCX: AEP RBX: TCS void entry() { backs up the general purpose registers (GPRs, § 2.6) store call results RCX set by

in the GPRSGX area in the SSA, and then performs } catch (AEX e) { ERESUME an XSAVE (§ 2.6) using the requested-feature bitmap read ESP from FS:TLS Resumable (RFBM) specified in the XFRM field in the enclave’s Yes exception? PUSH RCX SECS. As each SSA is entirely stored in EPC pages al- perform enclave located to the enclave, the system software cannot read No computation or tamper with the backed up execution state. When an return ERROR; POP RBX SSA receives the enclave’s execution state, it is marked ERESUME EEXIT } as used by incrementing the CSSA field in the current RCX: AEP RBX: TCS store call results TCS. Synchronous } Execution Path AEX return SUCCESS; After clearing the execution context, the AEX process } sets RSP and RBP to the values saved by EENTER in Ring 0 SSA the current SSA, and sets RIP to the value in the current System Software Stack XSAVE Hardware Exception Handler GPRs GPRSGX SSA’s AEP field. This way, when the system software’s void handler() { Code AEP hardware exception handler completes, the processor save GPRs RIP U_RBP will execute the asynchronous exit handler code in the handle exception CS U_RSP restore GPRs RFLAGS enclave’s host process. The SGX design makes it easy Registers IRET to set up the asynchronous handler code as an exception RSP cleared } SS by AEX handler in the routine that contains the EENTER instruc- tion, because the RSP and RBP registers will have the Figure 68: If a hardware exception occurs during enclave execution, same values as they had when EENTER was executed. the synchronous execution path is aborted, and an Asynchronous Enclave Exit (AEX) occurs instead. Many of the actions taken by AEX to get the logical EENTER and ERESUME receive the same inputs, processor outside of enclave mode match EEXIT. The namely a pointer to a TCS, described in § 5.4.1, and segment registers FS and GS are restored to the values an AEP, described in § 5.4.3. The most common appli- saved by EENTER, and all the debugging facilities that cation design will pair each EENTER instance with an were suppressed by EENTER are restored to their previ- asynchronous exit handler that invokes ERESUME with ous states. exactly the same arguments.

68 The main difference between ERESUME and EENTER without a significant degradation in user experience. is that the former uses an SSA that was “filled out” by Unfortunately, the OS cannot be allowed to evict an an AEX (§ 5.4.3), whereas the latter uses an empty SSA. enclave’s EPC pages via the same methods that are used Therefore, ERESUME results in a #GP fault if the CSSA to implement page swapping for DRAM memory outside field in the provided TCS is 0 (zero), whereas EENTER the PRM range. In the SGX threat model, enclaves do fails if CSSA is greater than or equal to NSSA. not trust the system software, so the SGX design offers When successful, ERESUME decrements the CSSA an EPC page eviction method that can defend against field of the TCS, and restores the execution context a malicious OS that attempts any of the active address backed up in the SSA pointed to by the CSSA field translation attacks described in § 3.7. in the TCS. Specifically, the ERESUME implementation The price of the security afforded by SGX is that an restores the GPRs (§ 2.6) from the GPRSGX field in OS kernel that supports evicting EPC pages must use the SSA, and performs an XRSTOR (§ 2.6) to load the a modified page swapping implementation that inter- execution state associated with the extended architectural acts with the SGX mechanisms. Enclave authors can features used by the enclave. mostly ignore EPC evictions, similarly to how today’s ERESUME shares the following behavior with application developers can ignore the OS kernel’s paging EENTER (§ 5.4.1). Both instructions write the U RSP, implementation. U RBP, and AEP fields in the current SSA. Both instruc- As illustrated in Figure 69, SGX supports evicting tions follow the same process for backing up XCR0 and EPC pages to DRAM pages outside the PRM range. The the FS and GS segment registers, and set them to the system software is expected to use its existing page swap- same values, based on the current TCS and its enclave’s ping implementation to evict the contents of these pages SECS. Last, both instructions disable the same subset of out of DRAM and onto a disk. the logical processor’s debugging features. ERESUME Enclave Non-PRM An interesting edge case that handles cor- Memory Memory rectly is that it sets XCR0 to the XFRM enclave at- tribute before performing an XRSTOR. It follows that ERESUME EPC fails if the requested feature bitmap (RFBM) EWB classical in the SSA is not a subset of XFRM. This matters be- page HDD / SSD ELDU, swapping cause, while an AEX will always use the XFRM value ELDB as the RFBM, enclave code executing on another thread Disk is free to modify the SSA contents before ERESUME is DRAM DRAM called. The correct sequencing of actions in the ERESUME im- Figure 69: SGX offers a method for the OS to evict EPC pages into plementation prevents a malicious application from using non-PRM DRAM. The OS can then use its standard paging feature an enclave to modify registers associated with extended to evict the pages out of DRAM. architectural features that are not declared in XFRM. SGX’s eviction feature revolves around the EWB in- This would break the system software’s ability to provide struction, described in detail in § 5.5.4. Essentially, EWB thread-level execution context isolation. evicts an EPC page into a DRAM page outside the EPC and marks the EPC page as available, by zeroing the 5.5 EPC Page Eviction VALID field in the page’s EPCM entry. Modern OS kernels take advantage of address transla- The SGX design relies on symmetric key cryp- tion (§ 2.5) to implement page swapping, also referred tograpy 3.1.1 to guarantee the confidentiality and in- to as paging (§ 2.5). In a nutshell, paging allows the OS tegrity of the evicted EPC pages, and on nonces (§ 3.1.4) kernel to over-commit the computer’s DRAM by evicting to guarantee the freshness of the pages brought back rarely used memory pages to a slower storage medium into the EPC. These nonces are stored in Version Ar- called the disk. rays (VAs), covered in § 5.5.2, which are EPC pages Paging is a key contributor to utilizing a computer’s dedicated to nonce storage. resources effectively. For example, a desktop system Before an EPC page is evicted and freed up for use whose user runs multiple programs concurrently can by other enclaves, the SGX implementation must ensure evict memory pages allocated to inactive applications that no TLB has address translations associated with the

69 evicted page, in order to avoid the TLB-based address the page. Furthermore, performing IPIs and TLB flushes translation attack described in § 3.7.4. for each page eviction would add a significant overhead As explained in § 5.1.1, SGX leaves the system soft- to a paging implementation, so the SGX design allows ware in charge of managing the EPC. It naturally follows a batch of pages to be evicted using a single IPI / TLB that the SGX instructions described in this section, which flush sequence. are used to implement EPC paging, are only available to The TLB flush verification logic relies on a 1-bit system software, which runs at ring 0 § 2.3. EPCM entry field called BLOCKED. As shown in Fig- In today’s software stacks (§ 2.3), only the OS ker- ure 70, the VALID and BLOCKED fields three nel implements page swapping in order to support the possible EPC page states. A page is free when both bits over-committing of DRAM. The hypervisor is only used are zero, in use when VALID is one and BLOCKED is to partition the computer’s physical resources between zero, and blocked when both bits are one. operating systems. Therefore, this section is written with the expectation that the OS kernel will also take on the Free BLOCKED = 0 responsibility of EPC page swapping. For simplicity, VALID = 0 we often use the term “OS kernel” instead of “system ELDU ELDB software”. The reader should be aware that the SGX ECREATE, EADD, EPA design does not preclude a system where the hypervisor EREMOVE EWB implements its own EPC page swapping. Therefore, “OS EREMOVE kernel” should really be read as “the system software In Use Blocked that performs EPC paging”. BLOCKED = 0 EBLOCK BLOCKED = 1 VALID = 1 VALID = 1 5.5.1 Page Eviction and the TLBs Figure 70: The VALID and BLOCKED bits in an EPC page’s One of the least promoted accomplishments of SGX is EPCM entry can be in one of three states. EADD and its siblings that it does not add any security checks to the memory allocate new EPC pages. EREMOVE permanently deallocates an EPC execution units (§ 2.9.4, § 2.10). Instead, SGX’s access page. EBLOCK blocks an EPC page so it can be evicted using EWB. control checks occur after an address translation (§ 2.5) ELDB and ELDU load an evicted page back into the EPC. is performed, right before the translation result is written Blocked pages are not considered accessible to en- into the TLBs (§ 2.11.5). This aspect is generally down- claves. If an address translation results in a blocked EPC played throughout the SDM, but it becomes visible when page, the SGX implementation causes the translation to explaining SGX’s EPC page eviction mechanism. result in a Page Fault (#PF, § 2.8.2). This guarantees that A full discussion of SGX’s memory access protections once a page is blocked, the CPU will not create any new checks merits its own section, and is deferred to § 6.2. TLB entries pointing to it. The EPC page eviction mechanisms can be explained Furthermore, every SGX instruction makes sure that using only two requirements from SGX’s security model. the EPC pages on which it operates are not blocked. For First, when a logical processor exits an enclave, either example, EENTER ensures that the TCS it is given is not via EEXIT (§ 5.4.2) or via an AEX (§ 5.4.3), its TLBs blocked, that its enclave’s SECS is not blocked, and that are flushed. Second, when an EPC page is deallocated every page in the current SSA is not blocked. from an enclave, all logical processors executing that In order to evict a batch of EPC pages, the OS kernel enclave’s code must be directed to exit the enclave. This must first issue EBLOCK instructions targeting them. The is sufficient to guarantee the removal of any TLB entry OS is also expected to remove the EPC page’s mapping targeting the deallocated EPC. from page tables, but is not trusted to do so. System software can cause a logical processor to exit After all the desired pages have been blocked, the OS an enclave by sending it an Inter-Processor Interrupt kernel must execute an ETRACK instruction, which di- (IPI, § 2.12), which will trigger an AEX when received. rects the SGX implementation to keep track of which log- Essentially, this is a very coarse-grained TLB shootdown. ical processors have had their TLBs flushed. ETRACK re- SGX does not trust system software. Therefore, be- quires the virtual address of an enclave’s SECS (§ 5.1.3). fore marking an EPC page’s EPCM entry as free, the If the OS wishes to evict a batch of EPC pages belonging SGX implementation must ensure that the OS kernel has to multiple enclaves, it must issue an ETRACK for each flushed all the TLBs that might contain translations for enclave.

70 Following the ETRACK instructions, the OS kernel DRESS fields in their EPCM entries set to zero, and must induce enclave exits on all the logical processors cannot be accessed directly by any software, including that are executing code inside the enclaves that have been enclaves. ETRACKed. The SGX design expects that the OS will Unlike the other page types discussed so far, VA pages use IPIs to cause AEXs in the logical processors whose are not associated with any enclave. This means they TLBs must be flushed. can be deallocated via EREMOVE without any restriction. The EPC page eviction process is completed when the However, freeing up a VA page whose slots are in use ef- OS executes an EWB instruction for each EPC page to be fectively discards the nonces in those slots, which results evicted. This instruction, which will be fully described in losing the ability to load the corresponding evicted in § 5.5.4, writes an encrypted version of the EPC page pages back into the EPC. Therefore, it is unlikely that a to be evicted into DRAM, and then frees the page by correct OS implementation will ever call EREMOVE on a clearing the VALID and BLOCKED bits in its EPCM VA with non-free slots. entry. Before carrying out its tasks, EWB ensures that the According to the pseudo-code for EPA and EWB in the EPC page that it targets has been blocked, and checks the SDM, SGX uses the zero value to represent the free slots state set up by ETRACK to make sure that all the relevant in a VA, implying that all the generated nonces have to TLBs have been flushed. be non-zero. This also means that EPA initializes a VA An evicted page can be loaded back into the EPC via simply by zeroing the underlying EPC page. However, the ELDU and ELDB instructions. Both instructions start since software cannot access a VA’s contents, neither the up with a free EPC page and a DRAM page that has the use of a special value, nor the value itself is architectural. evicted contents of an EPC page, decrypt the DRAM page’s contents into the EPC page, and restore the cor- 5.5.3 Enclave IDs responding EPCM entry. The only difference between The EWB and ELDU / ELDB instructions use an en- ELDU and ELDB is that the latter sets the BLOCKED bit clave ID (EID) to identify the enclave that owns an in the page’s EPCM entry, whereas the former leaves it evicted page. The EID has the same purpose as the EN- cleared. CLAVESECS (§ 5.1.2) field in an EPCM entry, which is ELDU and ELDB resemble ECREATE and EADD, in also used to identify the enclave that owns an EPC page. the sense that they populate a free EPC page. Since This section explains the need for having two values rep- the page that they operate on was free, the SGX secu- resent the same concept by comparing the two values rity model predicates that no TLB entries can possibly and their uses. target it. Therefore, these instructions do not require a The SDM states that ENCLAVESECS field in an mechanism similar to EBLOCK or ETRACK. EPCM entry is used to identify the SECS of the enclave owning the associated EPC page, but stops short of de- 5.5.2 The Version Array (VA) scribing its format. In theory, the ENCLAVESECS field When EWB evicts the contents of an EPC, it creates an can change its representation between SGX implemen- 8-byte nonce (§ 3.1.4) that Intel’s documentation calls a tations since SGX instructions never expose its value to page version. SGX’s freshness guarantees are built on the software. assumption that nonces are stored securely, so EWB stores However, we will later argue that the most plausible the nonce that it creates inside a Version Array (VA). representation of the ENCLAVESECS field is the phys- Version Arrays are EPC pages that are dedicated to ical address of the enclave’s SECS. Therefore, the EN- storing nonces generated by EWB. Each VA is divided CLAVESECS value associated with a given enclave will into slots, and each slot is exactly large enough to store change if the enclave’s SECS is evicted from the EPC one nonce. Given that the size of an EPC page is 4KB, and loaded back at a different location. It follows that the and each nonce occupies 8 bytes, it follows that each VA ENCLAVESECS value is only suitable for identifying has 512 slots. an enclave while its SECS remains in the EPC. VA pages are allocated using the EPA instruction, According to the SDM, the EID field is a 64-bit field which takes in the virtual address of a free EPC page, and stored in an enclave’s SECS. ECREATE’s pseudocode turns it into a Version Array with empty slots. VA pages in the SDM reveals that an enclave’s ID is generated are identified by the PT VA type in their EPCM entries. when the SECS is allocated, by atomically incrementing Like SECS pages, VA pages have the ENCLAVEAD- a global counter. Assuming that the counter does not roll

71 over8, this process guarantees that every enclave created code (MAC, § 3.1.3) tag. With the exception of the during a power cycle has a unique EID. nonce, EWB writes its output in DRAM outside the PRM Although the SDM does not specifically guarantee area, so the system software can choose to further evict this, the EID field in an enclave’s SECS does not appear it to disk. to be modified by any instruction. This makes the EID’s The EPC page contents is encrypted, to protect the value suitable for identifying an enclave throughout its confidentiality of the enclave’s data while the page is lifetime, even across evictions of its SECS page from the stored in the untrusted DRAM outside the PRM range. EPC. Without the use of encryption, the system software could learn the contents of an EPC page by evicting it from the 5.5.4 Evicting an EPC Page EPC. The system software evicts an EPC page using the EWB The page metadata is stored in a Page Informa- instruction, which produces all the data needed to restore tion (PAGEINFO) structure, illustrated in Figure 72. This the evicted page at a later time via the ELDU instruction, structure is similar to the PAGEINFO structure described as shown in Figure 71. in § 5.3.2 and depicted in Figure 64, except that the SECINFO field has been replaced by a PCMD field, EPCM EPC which contains the virtual address of a Page Crypto Meta-

⋮ ⋮ data (PCMD) structure. ELDB target metadata ELDB target page Enclave and Host Application ⋮ ⋮ Virtual Address Space VA page metadata VA page ⋮ ⋮ SECS BASEADDR EWB source metadata EWB source page ⋮ ⋮ SIZE EID PAGEINFO SECS ELRANGE LINADDR VA page SRCPGE EWB EPC Page PCMD ⋮ Untrusted DRAM nonce ⋮ = Encrypted EPC Page

PCMD Encrypted Page MAC EPC Page Metadata Tag SECINFO EPCM Entry FLAGS ADDRESS R, W, X R, W, X PAGE_TYPE ELDU / PT ENCLAVESECS ELDB ENCLAVEID MAC

Figure 71: The EWB instruction outputs the encrypted contents of the evicted EPC page, a subset of the fields in the page’s EPCM entry, a MAC tag, and a nonce. All this information is used by the ELDB or Figure 72: The PAGEINFO structure used by the EWB and ELDU / ELDU instruction to load the evicted page back into the EPC, with ELDB instructions confidentiality, integrity and freshness guarantees. The LINADDR field in the PAGEINFO structure is EWB’s output consists of an encrypted version of the used to store the ADDRESS field in the EPCM entry, evicted EPC page’s contents, a subset of the fields in which indicates the virtual address intended for accessing the EPCM entry corresponding to the page, the nonce the page. The PCMD structure embeds the Security Infor- discussed in § 5.5.2, and a message authentication mation (SECINFO) described in § 5.3.2, which is used 8A 64-bit counter incremented at 4Ghz would roll over in slightly to store the page type (PT) and the access permission more than 136 years flags (R, W, X) in the EPCM entry. The PCMD structure

72 also stores the enclave’s ID (EID, § 5.5.3). These fields EPC Page Address PAGEINFO (Input) are later used by ELDU or ELDB to populate the EPCM (Input/Output) LINADDR entry for the EPC page that is reloaded. SRCPGE EPCM entry PCMD PCMD (Output) The metadata described above is stored unencrypted, BLOCKED SECS SECINFO so the OS has the option of using the information inside LINADDR as-is for its own bookkeeping. This has no negative im- ENCLAVESECS FLAGS R, W, X R, W, X pact on security, because the metadata is not confidential. PT PAGE_TYPE In fact, with the exception of the enclave ID, all the meta- VALID reserved fields data fields are specified by the system software when zero ECREATE is called. The enclave ID is only useful for ENCLAVEID MAC_HDR reserved fields (Temporary) identifying the enclave that the EPC page belongs to, and SECS MAC the system software already has this information as well. EID EID SECINFO TRACKING

Asides from the metadata described above, the PCMD FLAGS structure also stores the MAC tag generated by EWB. PAGE_TYPE R, W, X The MAC tag covers the authenticity of the EPC page MAC data contents, the metadata, and the nonce. The MAC tag is reserved fields checked by ELDU and ELDB, which will only load an LINADDR MAC evicted page back into the EPC if the MAC verification confirms the authenticity of the page data, metadata, and EPC Page plaintext AES-GCM nonce. This security check protects against the page counter ciphertext

swapping attacks described in § 3.7.3. non-EPC VA page Page Similarly to EREMOVE, EWB will only evict the EPC Page Version (Generated) ⋮ page holding an enclave’s SECS if there is no other VA slot address target VA slot (Input) EPCM entry whose ENCLAVESECS field references ⋮ the SECS. At the same time, as an optimization, the points to SGX implementation does not perform ETRACK-related copied to checks when evicting a SECS. This is safe because a Figure 73: The data flow of the EWB instruction that evicts an EPC SECS is only evicted if the EPC has no pages belonging page. The page’s content is encrypted in a non-EPC RAM page. A to the SECS’ enclave, which implies that there isn’t any nonce is created and saved in an empty slot inside a VA page. The TCS belonging to the enclave in the EPC, so no processor page’s EPCM metadata and a MAC are saved in a separate area in non-EPC memory. can be executing enclave code.

The pages holding Version Arrays can be evicted, just 5.5.5 Loading an Evicted Page Back into EPC like any other EPC page. VA pages are never accessible After an EPC page belonging to an enclave is evicted, any by software, so they can’t have any TLB entries point- attempt to access the page from enclave code will result ing to them. Therefore, EWB evicts VA pages without in a Page Fault (#PF, § 2.8.2). The #PF will cause the performing any ETRACK-related checks. The ability to logical processor to exit enclave mode via AEX (§ 5.4.3), evict VA pages has profound implications that will be and then invoke the OS kernel’s page fault handler. discussed in § 5.5.6. Page faults receive special handling from the AEX EWB’s data flow, shown in detail in Figure 73, has process. While leaving the enclave, the AEX logic specif- an aspect that can be confusing to OS developers. The ically checks if the hardware exception that triggered the instruction reads the virtual address of the EPC page to AEX was #PF. If that is the case, the AEX implementa- be evicted from a register (RBX) and writes it to the tion clears the least significant 12 bits of the CR2 register, LINADDR field of the PAGEINFO structure that it is which stores the virtual address whose translation caused provided. The separate input (RBX) could have been a page fault. removed by providing the EPC page’s address in the In general, the OS kernel’s page handler needs to be LINADDR field. able to extract the virtual page number (VPN, § 2.5.1)

73 from CR2, so that it knows which memory page needs VA Page to be loaded back into DRAM. The OS kernel may also ⋮

be able to use the 12 least significant address bits, which ⋮ are not part of the VPN, to better predict the application software’s memory access patterns. However, unlike the bits that make up the VPN, the bottom 12 bits are not absolutely necessary for the fault handler to carry out its Encrypted VA job. Therefore, SGX’s AEX implementation clears these Page 12 bits, in order to limit the amount of information that ⋮

is learned by the page fault handler. ⋮ When the OS page fault handler examines the address ⋮ in the CR2 register and determines that the faulting ad- dress is inside the EPC, it is generally expected to use the Page MAC ELDU or ELDB instruction to load the evicted page back Metadata Tag into the EPC. If the outputs of EWB have been evicted from DRAM to a slower storage medium, the OS kernel will have to read the outputs back into DRAM before Encrypted VA Encrypted invoking ELDU / ELDB. Page EPC Page ELDU ELDB and verify the MAC tag produced by Page MAC ⋮ EWB, described in § 5.5.4. This prevents the OS kernel Metadata Tag ⋮ from performing the page swapping-based active address translation attack described in § 3.7.3. Page MAC Metadata Tag 5.5.6 Eviction Trees The SGX design allows VA pages to be evicted from the EPC, just like enclave pages. When a VA page is evicted from EPC, all the nonces stored by the VA slots Encrypted Encrypted become inaccessible to the processor. Therefore, the EPC Page EPC Page evicted pages associated with these nonces cannot be Page MAC Page MAC Metadata Tag Metadata Tag restored by ELDB until the OS loads the VA page back into the EPC. Figure 74: A version tree formed by evicted VA pages and enclave In other words, an evicted page depends on the VA EPC pages. The enclave pages are leaves, and the VA pages are page storing its nonce, and cannot be loaded back into inner nodes. The OS controls the tree’s shape, which impacts the the EPC until the VA page is reloaded as well. The de- performance of evictions, but not their correctness. pendency graph created by this relationship is a forest of the shape of the eviction trees. This has no negative of eviction trees. An eviction tree, shown in Fig- impact on security, as the tree shape only impacts the ure 74, has enclave EPC pages as leaves, and VA pages performance of the eviction scheme, and not its correct- as inner nodes. A page’s parent is the VA page that holds ness. its nonce. Since EWB always outputs a nonce in a VA page, the root node of each eviction tree is always a VA 5.6 SGX Enclave Measurement page in the EPC. SGX implements a software attestation scheme that fol- A straightforward inductive argument shows that when lows the general principles outlined in § 3.3. For the an OS wishes to load an evicted enclave page back into purposes of this section, the most relevant principle is the EPC, it needs to load all the VA pages on the path that a remote party authenticates an enclave based on from the eviction tree’s root to the leaf corresponding to its measurement, which is intended to identify the soft- the enclave page. Therefore, the number of page loads ware that is executing inside the enclave. The remote required to satisfy a page fault inside the EPC depends party compares the enclave measurement reported by on the shape of the eviction tree that contains the page. the trusted hardware with an expected measurement, and The SGX design leaves the OS in complete control only proceeds if the two values match.

74 § 5.3 explains that an SGX enclave is built us- Offset Size Description ing the ECREATE (§ 5.3.1), EADD (§ 5.3.2) and 0 8 ”ECREATE\0” EEXTEND instructions. After the enclave is initialized 8 8 SECS.SSAFRAMESIZE (§ 5.2.5) via EINIT (§ 5.3.3), the instructions mentioned above 16 8 SECS.SIZE (§ 5.2.1) cannot be used anymore. As the SGX measurement 32 8 32 zero (0) bytes scheme follows the principles outlined in § 3.3.2, the Table 16: 64-byte block extended into MRENCLAVE by ECREATE measurement of an SGX enclave is obtained by com- puting a secure hash (§ 3.1.3) over the inputs to the enclave’s measurement. This feature can be combined ECREATE, EADD and EEXTEND instructions used to with a compiler that generates position-independent en- create the enclave and load the initial code and data into clave code to obtain relocatable enclaves. its memory. EINIT finalizes the hash that represents the The enclave’s measurement includes the enclave’s measurement. SSAFRAMESIZE field, which guarantees that Along with the enclave’s contents, the enclave author the SSAs (§ 5.2.5) created by AEX and used by is expected to specify the sequence of instructions that EENTER (§ 5.4.1) and ERESUME (§ 5.4.4) have the should be used in order to create an enclave whose mea- size that is expected by the enclave’s author. Leaving surement will match the expected value used by the re- this field out of an enclave’s measurement would mote party in the software attestation process. The .so allow a malicious enclave loader to attempt to attack and .dll dynamically loaded library file formats, which the enclave’s security checks by specifying a bigger are SGX’s intended enclave delivery methods, already SSAFRAMESIZE than the enclave’s author intended, include informal specifications for loading algorithms. which could cause the SSA contents written by an AEX We expect the informal loading specifications to serve to overwrite the enclave’s code or data. as the starting points for specifications that prescribe the exact sequences of SGX instructions that should be used 5.6.2 Measuring Enclave Attributes to create enclaves from .so and .dll files. The enclave’s measurement does not include the en- As argued in § 3.3.2, an enclave’s measurement is clave attributes (§ 5.2.2), which are specified in the AT- computed using a secure hashing algorithm, so the sys- TRIBUTES field in the SECS. Instead, it is included tem software can only build an enclave that matches an directly in the information that is covered by the attesta- expected measurement by following the exact sequence tion signature, which will be discussed in § 5.8.1. of instructions specified by the enclave’s author. The SGX software attestation definitely needs to cover The SGX design uses the 256-bit SHA-2 [21] secure the enclave attributes. For example, if XFRM (§ 5.2.2, hash function to compute its measurements. SHA-2 is § 5.2.5) would not be covered, a malicious enclave loader a block hash function (§ 3.1.3) that operates on 64-byte could attempt to subvert an enclave’s security checks blocks, uses a 32-byte internal state, and produces a 32- by setting XFRM to a value that enables architectural byte output. Each enclave’s measurement is stored in extensions that change the semantics of instructions used the MRENCLAVE field of the enclave’s SECS. The 32- by the enclave, but still produces an XSAVE output that byte field stores the internal state and final output of the fits in SSAFRAMESIZE. 256-bit SHA-2 secure hash function. The special treatment applied to the ATTRIBUTES SECS field seems questionable from a security stand- 5.6.1 Measuring ECREATE point, as it adds extra complexity to the software attesta- The ECREATE instruction, overviewed in § 5.3.1, first tion verifier, which translates into more opportunities for initializes the MRENCLAVE field in the newly created exploitable bugs. This decision also adds complexity to SECS using the 256-bit SHA-2 initialization algorithm, the SGX software attestation design, which is described and then extends the hash with the 64-byte block depicted in § 5.8. in Table 16. The most likely reason why the SGX design decided to The enclave’s measurement does not include the go this route, despite the concerns described above, is the BASEADDR field. The omission is intentional, as it wish to be able to use a single measurement to represent allows the system software to load an enclave at any an enclave that can take advantage of some architectural virtual address inside a host process that satisfies the extensions, but can also perform its task without them. ELRANGE restrictions (§ 5.2.1), without changing the Consider, for example, an enclave that performs image

75 processing using a library such as OpenCV, which has by the system software loading the enclave matches the routines optimized for SSE and AVX, but also includes specifications of the enclave author. generic fallbacks for processors that do not have these The EPCM field values mentioned above take up less features. The enclave’s author will likely wish to allow than one byte in the SECINFO structure, and the rest of an enclave loader to set bits 1 (SSE) and 2 (AVX) to the bytes are reserved and expected to be initialized to either true or false. If ATTRIBUTES (and, by extension, zero. This leaves plenty of expansion room for future XFRM) was a part of the enclave’s measurement, the SGX features. enclave author would have to specify that the enclave has The most notable omission from Table 17 is the data 4 valid measurements. In general, allowing n architec- used to initialize the newly created EPC page. Therefore, tural extensions to be used independently will result in the measurement data contributed by EADD guarantees 2n valid measurements. that the enclave’s memory layout will have pages allo- 5.6.3 Measuring EADD cated with prescribed access permissions at the desired virtual addresses. However, the measurements don’t The EADD instruction, described in § 5.3.2, extends the cover the code or data loaded in these pages. SHA-2 hash in MRENCLAVE with the 64-byte block For example, EADD’s measurement data guarantees shown in Table 17. that an enclave’s memory layout consists of three exe- Offset Size Description cutable pages followed by five writable data pages, but it 0 8 ”EADD\0\0\0\0” does not guarantee that any of the code pages contains 8 8 ENCLAVEOFFSET the code supplied by the enclave’s author. 16 48 SECINFO (first 48 bytes) 5.6.4 Measuring EEXTEND Table 17: 64-byte block extended into MRENCLAVE by EADD. The The EEXTEND instruction exists solely for the reason of ENCLAVEOFFSET is computed by subtracting the BASEADDR measuring data loaded inside the enclave’s EPC pages. in the enclave’s SECS from the LINADDR field in the PAGEINFO structure. The instruction reads in a virtual address, and extends the enclave’s measurement hash with the five 64-byte blocks The address included in the measurement is the ad- in Table 18, which effectively guarantee the contents of dress where the EADDed page is expected to be mapped a 256-byte chunk of data in the enclave’s memory. in the enclave’s virtual address space. This ensures that the system software sets up the enclave’s virtual memory Offset Size Description layout according to the enclave author’s specifications. 0 8 ”EEXTEND\0” If a malicious enclave loader attempts to set up the en- 8 8 ENCLAVEOFFSET clave’s layout incorrectly, perhaps in order to mount an 16 48 48 zero (0) bytes active address translation attack (§ 3.7.2), the loaded en- 64 64 bytes 0 - 64 in the chunk clave’s measurement will differ from the measurement expected by the enclave’s author. 128 64 bytes 64 - 128 in the chunk The virtual address of the newly created page is mea- 192 64 bytes 128 - 192 in the chunk sured relatively to the start of the enclave’s ELRANGE. 256 64 bytes 192 - 256 in the chunk In other words, the value included in the measurement Table 18: 64-byte blocks extended into MRENCLAVE by is LINADDR - BASEADDR. This makes the enclave’s EEXTEND. The ENCLAVEOFFSET is computed by subtracting the measurement invariant to BASEADDR changes, which BASEADDR in the enclave’s SECS from the LINADDR field in the is desirable for relocatable enclaves. Measuring the rel- PAGEINFO structure. ative addresses still preserves all the information about Before examining the details of EEXTEND, we note the memory layout inside ELRANGE, and therefore has that SGX’s security guarantees only hold when the con- no negative security impact. tents of the enclave’s key pages is measured. For ex- EADD also measures the first 48 bytes of the SECINFO ample, EENTER (§ 5.4.1) is only guaranteed to perform structure (§ 5.3.2) provided to EADD, which contain the controlled jumps inside an enclave’s code if the contents page type (PT) and access permissions (R, W, X) field of all the Thread Control Structure (TCS, § 5.2.4) pages values used to initialize the page’s EPCM entry. By the are measured. Otherwise, a malicious enclave loader same argument as above, including these values in the can change the OENTRY field (§ 5.2.4, § 5.4.1) in a measurement guarantees that the memory layout built TCS while building the enclave, and then a malicious

76 OS can use the TCS to perform an arbitrary jump inside § 3.7.2 and illustrated in Figure 54. enclave code. By the same argument, all the enclave’s More specifically, the malicious loader would EADD code should be measured by EEXTEND. Any code frag- the errorOut page contents at the virtual address in- ment that is not measured can be replaced by a malicious tended for disclose, EADD the disclose page con- enclave loader. tents at the virtual address intended for errorOut, Given these pitfalls, it is surprising that the SGX de- and then EEXTEND the pages in the wrong order. If sign opted to decouple the virtual address space layout EEXTEND would not include the address of the data measurements done by EADD from the memory content chunk that is measured, the steps above would yield the measurements done by EEXTEND. same measurement as the correctly constructed enclave. At a first pass, it appears that the decoupling only has The last aspect of EEXTEND worth analyzing is its one benefit, which is the ability to load un-measured user support for relocating enclaves. Similarly to EADD, input into an enclave while it is being built. However, this the virtual address measured by EEXTEND is relative benefit only translates into a small performance improve- to the enclave’s BASEADDR. Furthermore, the only ment, because enclaves can alternatively be designed to SGX structure whose content is expected to be mea- copy the user input from untrusted DRAM after being sured by EEXTEND is the TCS. The SGX design has initialized. At the same time, the decoupling opens up carefully used relative addresses for all the TCS fields the possibility of relying on an enclave that provides no that represent enclave addresses, which are OENTRY, meaningful security guarantees, due to not measuring all OFSBASGX and OGSBASGX. the important data via EEXTEND calls. 5.6.5 Measuring EINIT However, the real reason behind the EADD / EEXTEND EINIT separation is hinted at by the EINIT pseudo-code in the The instruction (§ 5.3.3) concludes the enclave EINIT SDM, which states that the instruction opens an inter- building process. After is successfully invoked rupt (§ 2.12) window while it performs a computationally on an enclave, the enclave’s contents are “sealed”, mean- EADD intensive RSA signature check. If an interrupt occurs ing that the system software cannot use the instruc- during the check, EINIT fails with an error code, and tion to load code and data into the enclave, and cannot EEXTEND the interrupt is serviced. This very unusual approach for use the instruction to update the enclave’s a processor instruction suggests that the SGX implemen- measurement. EINIT tation was constrained in respect to how much latency its uses the SHA-2 finalization algorithm (§ 3.1.3) instructions were allowed to add to the interrupt handling on the MRENCLAVE field of the enclave’s SECS. Af- EINIT process. ter , the field no longer stores the intermediate state of the SHA-2 algorithm, and instead stores the final In light of the concerns above, it is reasonable to con- output of the secure hash function. This value remains clude that EEXTEND was introduced because measur- constant after EINIT completes, and is included in the ing an entire page using 256-bit SHA-2 is quite time- attestation signature produced by the SGX software at- consuming, and doing it in EADD would have caused the testation process. instruction to exceed SGX’s latency budget. The need to hit a certain latency goal is a reasonable explanation for 5.7 SGX Enclave Versioning Support the seemingly arbitrary 256-byte chunk size. The software attestation model (§ 3.3) introduced by The EADD / EEXTEND separation will not cause secu- the Trusted Platform Module (§ 4.4) relies on a mea- rity issues if enclaves are authored using the same tools surement (§ 5.6), which is essentially a content hash, to that build today’s dynamically loaded modules, which identify the software inside a container. The downside appears to be the workflow targeted by the SGX design. of using content hashes for identity is that there is no In this workflow, the tools that build enclaves can easily relation between the identities of containers that hold identify the enclave data that needs to be measured. different versions of the same software. It is correct and meaningful, from a security perspec- In practice, it is highly desirable for systems based tive, to have the message blocks provided by EEXTEND on secure containers to handle software updates without to the hash function include the address of the 256-byte having access to the remote party in the initial software chunk, in addition to the contents of the data. If the attestation process. This entails having the ability to address were not included, a malicious enclave loader migrate secrets between the container that has the old could mount the memory mapping attack described in version of the software and the container that has the

77 updated version. This requirement translates into a need secrets with the key, and hands off the encrypted secrets for a separate identity system that can recognize the to the untrusted system software. The receiving enclave relationship between two versions of the same software. passes the sending enclave’s identity to EGETKEY, ob- SGX supports the migration of secrets between en- tains the same symmetric key as above, and uses the key claves that represent different versions of the same soft- to decrypt the secrets received from system software. ware, as shown in Figure 75. The symmetric key obtained from EGETKEY can be used in conjunction with cryptographic primitives that SIGSTRUCT A SIGSTRUCT B protect the confidentiality (§ 3.1.2) and integrity (§ 3.1.3) of an enclave’s secrets while they are migrated to another SGX EINIT SGX EINIT enclave by the untrusted system software. However, sym- Enclave A Enclave B metric keys alone cannot be used to provide freshness SECS SECS guarantees (§ 3.1), so secret migration is subject to re- Certificate-Based Identity Certificate-Based Identity play attacks. This is acceptable when the secrets being migrated are immutable, such as when the secrets are Enclave A Identity encryption keys obtained via software attestation

Secret Secret 5.7.1 Enclave Certificates

SGX SGX The SGX design requires each enclave to have a certifi- EGETKEY EGETKEY cate issued by its author. This requirement is enforced by EINIT (§ 5.3.3), which refuses to operate on enclaves Symmetric Authenticated Authenticated Secret Key Encryption Decryption Key without valid certificates. The SGX implementation consumes certificates for- matted as Signature Structures (SIGSTRUCT), which are

Non-volatile memory intended to be generated by an enclave building toolchain, Encrypted as shown in Figure 76. Secret A SIGSTRUCT certificate consists of metadata fields, the most interesting of which are presented in Table 19, Figure 75: SGX has a certificate-based enclave identity scheme, and an RSA signature that guarantees the authenticity which can be used to migrate secrets between enclaves that contain of the metadata, formatted as shown in Table 20. The different versions of the same software module. Here, enclave A’s secrets are migrated to enclave B. semantics of the fields will be revealed in the following sections. The secret migration feature relies on a one-level cer- tificate hierarchy ( § 3.2.1), where each enclave author Field Bytes Description is a Certificate Authority, and each enclave receives a ENCLAVEHASH 32 Must equal the certificate from its author. These certificates must be for- enclave’s measure- matted as Signature Structures (SIGSTRUCT), which are ment (§ 5.6). described in § 5.7.1. The information in these certificates ISVPRODID 32 Differentiates mod- is the basis for an enclave identity scheme, presented in ules signed by the § 5.7.2, which can recognize the relationship between same private key. different versions of the same software. ISVSVN 32 Differentiates ver- The EINIT instruction (§ 5.3.3) examines the target sions of the same enclave’s certificate and uses the information in it to pop- module. ulate the SECS (§ 5.1.3) fields that describe the enclave’s VENDOR 4 Differentiates Intel certificate-based identity. This process is summarized in enclaves. § 5.7.4. ATTRIBUTES 16 Constrains the en- Last, the actual secret migration process is based on clave’s attributes. the key derivation service implemented by the EGETKEY ATTRIBUTEMASK 16 Constrains the en- instruction, which is described in § 5.7.5. The sending clave’s attributes. enclave uses the EGETKEY instruction to obtain a sym- metric key (§ 3.1.1) based on its identity, encrypts its Table 19: A subset of the metadata fields in a SIGSTRUCT enclave certificate

78 Enclave Contents Field Bytes Description

SECS MODULUS 384 RSA key modulus ATTRIBUTES EXPONENT 4 RSA key public exponent BASEADDR SIGNATURE 384 RSA signature (See § 6.5) SIZE Q1 384 Simplifies RSA signature SSAFRAMESIZE verification. (See § 6.5)

Other EPC Q2 384 Simplifies RSA signature Pages verification. (See § 6.5)

Table 20: The format of the RSA signature used in a SIGSTRUCT SGX enclave certificate Measurement SIGSTRUCT Simulation Signed Fields to sign the certificate (MODULUS), the enclave’s prod- ENCLAVEHASH uct ID (ISVPRODID) and the security version number VENDOR zero (not Intel) (ISVSVN). AND ATTRIBUTES The public RSA key used to issue a certificate iden- ATTRIBUTEMASK RFC tifies the enclave’s author. All RSA keys used to issue ISVPRODID 3447 Build Toolchain ISVSVN enclave certificates must have the public exponent set to 256-bit SHA-2 Configuration DATE 3, so they are only differentiated by their moduli. SGX does not use the entire modulus of a key, but rather a RSA Signature PKCS #1 v1.5 Padding 256-bit SHA-2 hash of the modulus. This is called a Enclave Author’s EXPONENT (3) signer measurement Public RSA Key MODULUS (MRSIGNER), to parallel the name RSA SIGNATURE of enclave measurement (MRENCLAVE) for the SHA-2 Exponentiation Q1 hash that identifies an enclave’s contents. Q2 The SGX implementation relies on a hard-coded MR- Enclave Author’s Private RSA Key SIGNER value to recognize certificates issued by Intel. Enclaves that have an Intel-issued certificate can receive Figure 76: An enclave’s Signature Structure (SIGSTRUCT) is additional privileges, which are discussed in § 5.8. intended to be generated by an enclave building toolchain that has An enclave author can use the same RSA key to issue access to the enclave author’s private RSA key. certificates for enclaves that represent different software modules. Each module is identified by a unique Product The enclave certificates must be signed by RSA signa- ID (ISVPRODID) value. Conversely, all the enclaves tures (§ 3.1.3) that follow the method described in RFC whose certificates have the same ISVPRODID and are 3447 [111], using 256-bit SHA-2 [21] as the hash func- issued by the same RSA key (and therefore have the tion that reduces the input size, and the padding method same MRENCLAVE) are assumed to represent different described in PKCS #1 v1.5 [112], which is illustrated in versions of the same software module. Enclaves whose Figure 45. certificates are signed by different keys are always as- The SGX implementation only supports 3072-bit RSA sumed to contain different software modules. keys whose public exponent is 3. The key size is Enclaves that represent different versions of a module likely chosen to meet FIPS’ recommendation [20], which can have different security version numbers (SVN). The makes SGX eligible for use in U.S. government applica- SGX design disallows the migration of secrets from an tions. The public exponent 3 affords a simplified signa- enclave with a higher SVN to an enclave with a lower ture verification algorithm, which is discussed in § 6.5. SVN. This restriction is intended to assist with the distri- The simplified algorithm also requires the fields Q1 and bution of security patches, as follows. Q2 in the RSA signature, which are also described in If a security vulnerability is discovered in an enclave, § 6.5. the author can release a fixed version with a higher SVN. As users upgrade, SGX will facilitate the migration of 5.7.2 Certificate-Based Enclave Identity secrets from the vulnerable version of the enclave to the An enclave’s identity is determined by three fields in its fixed version. Once a user’s secrets have migrated, the certificate (§ 5.7.1): the modulus of the RSA key used SVN restrictions in SGX will deflect any attack based on

79 building the vulnerable enclave version and using it to SIGSTRUCT Enclave Contents read the migrated secrets. Signed Fields SECS Software upgrades that add functionality should not be ENCLAVEHASH Must be equal MRENCLAVE accompanied by an SVN increase, as SGX allows secrets ATTRIBUTES BASEADDR to be migrated freely between enclaves with matching VENDOR Must be equal SIZE DATE SSAFRAMESIZE SVN values. As explained above, a software module’s ATTRIBUTEMASK AND ATTRIBUTES SVN should only be incremented when a security vulner- ISVPRODID ISVPRODID ability is found. SIGSTRUCT only allocates 2 bytes to ISVSVN ISVSVN the ISVSVN field, which translates to 65,536 possible 256-bit SHA-2 MRSIGNER RSA Signature SVN values. This space can be exhausted if a large team PADDING MODULUS (incorrectly) sets up a continuous build system to allocate EXPONENT (3) Other EPC a new SVN for every software build that it produces, and SIGNATURE Pages each code change triggers a build. Q1 RSA Signature Q2 Verification Intel’s 5.7.3 CPU Security Version Numbers MRSIGNER

The SGX implementation itself has a security version Privileged attribute check Equality check number (CPUSVN), which is used in the key derivation process implemented [138] by EGETKEY, in addition to Figure 77: EINIT verifies the RSA signature in the enclave’s the enclave’s identity information. CPUSVN is a 128-bit certificate. If the certificate is valid, the information in it is used to value that, according to the SDM, reflects the processor’s populate the SECS fields that make up the enclave’s certificate-based identity. microcode update version. The SDM does not describe the structure of CPUSVN, exponent 3, facilitate a simplified verification algorithm, but it states that comparing CPUSVN values using inte- which is discussed in § 6.5. ger comparison is not meaningful, and that only some CPUSVN values are valid. Furthermore, CPUSVNs If the SIGSTRUCT certificate is found to be properly admit an ordering relationship that has the same seman- signed, EINIT follows the steps discussed in the fol- tics as the ordering relationship between enclave SVNs. lowing few paragraphs to ensure that the certificate was Specifically, an SGX implementation will consider all issued to the enclave that is being initialized. Once the SGX implementations with lower SVNs to be compro- checks have completed, EINIT computes MRSIGNER, mised due to security vulnerabilities, and will not trust the 256-bit SHA-2 hash of the MODULUS field in the them. SIGSTRUCT, and writes it into the enclave’s SECS. An SGX patent [138] discloses that CPUSVN is a con- EINIT also copies the ISVPRODID and ISVSVN fields catenation of small integers representing the SVNs of the from SIGSTRUCT into the enclave’s SECS. As ex- various components that make up SGX’s implementation. plained in § 5.7.2, these fields make up the enclave’s This structure is consistent with all the statements made certificate-based identity. in the SDM. After verifying the RSA signature in SIGSTRUCT, EINIT copies the signature’s padding into the 5.7.4 Establishing an Enclave’s Identity PADDING field in the enclave’s SECS. The PKCS #1 When the EINIT (§ 5.3.3) instruction prepares an en- v1.5 padding scheme, outlined in Figure 45, does not clave for code execution, it also sets the SECS (§ 5.1.3) involve randomness, so PADDING should have the same fields that make up the enclave’s certificate-based iden- value for all enclaves. tity, as shown in Figure 77. EINIT performs a few checks to make sure that the EINIT requires the virtual address of the enclave undergoing initialization was indeed authorized SIGSTRUCT certificate issued to the enclave, by the provided SIGSTRUCT certificate. The most obvi- and uses the information in the certificate to initial- ous check involves making sure that the MRENCLAVE ize the certificate-based identity information in the value in SIGSTRUCT equals the enclave’s measurement, enclave’s SECS. Before using the information in the which is stored in the MRENCLAVE field in the en- certificate, EINIT first verifies its RSA signature. The clave’s SECS. SIGSTRUCT fields Q1 and Q2, along with the RSA However, MRENCLAVE does not cover the enclave’s

80 attributes, which are stored in the ATTRIBUTES field value of 0x8086, and everyone else’s enclaves, which of the SECS. As discussed in § 5.6.2, omitting AT- should use a VENDOR value of zero. However, the TRIBUTES from MRENCLAVE facilitates writing en- EINIT pseudocode seems to imply that the SGX imple- claves that have optimized implementations that can use mentation only checks that VENDOR is either zero or architectural extensions when present, and also have fall- 0x8086. back implementations that work on CPUs without the ex- 5.7.5 Enclave Key Derivation tensions. Such enclaves can execute correctly when built with a variety of values in the XFRM (§ 5.2.2, § 5.2.5) SGX’s secret migration mechanism is based on the sym- attribute. At the same time, allowing system software metric key derivation service that is offered to enclaves to use arbitrary values in the ATTRIBUTES field would by the EGETKEY instruction, illustrated in Figure 78. compromise SGX’s security guarantees.

When an enclave uses software attestation (§ 3.3) to AND SECS KEYREQUEST gain access to secrets, the ATTRIBUTES value used ATTRIBUTES ATTRIBUTEMASK to build it is included in the SGX attestation signa- BASEADDR KEYNAME Current ture (§ 5.8). This gives the remote party in the attestation SIZE Must be >= CPUSVN CPUSVN process the opportunity to reject an enclave built with ISVPRODID KEYID an undesirable ATTRIBUTES value. However, when se- ISVSVN Must be >= ISVSVN MRENCLAVE KEYPOLICY crets are obtained using the migration process facilitated zero by certificate-based identities, there is no remote party MRSIGNER MRENCLAVE SSAFRAME 1 0 MRSIGNER that can check the enclave’s attributes. SIZE zero PADDING The SGX design solves this problem by having en- 1 0 clave authors convey the set of acceptable attribute values for an enclave in the ATTRIBUTES and AT- TRIBUTEMASK fields of the SIGSTRUCT certificate PADDING MRSIGNER MRENCLAVE ISVSVN KEYID issued for the enclave. EINIT will refuse to initialize ISVPRODID MASKEDATTRIBUTES KEYNAME CPUSVN an enclave using a SIGSTRUCT if the bitwise AND be- OWNEPOCH tween the ATTRIBUTES field in the enclave’s SECS Key Derivation Material SEAL_FUSES and the ATTRIBUTESMASK field in the SIGSTRUCT does not equal the SIGSTRUCT’s ATTRIBUTES field. OWNEREPOCH SEAL_FUSES This check prevents enclaves with undesirable attributes SGX Register from obtaining and potentially leaking secrets using the SGX Master AES-CMAC 128-bit migration process. Derivation Key Key Derivation symmetric key Any enclave author can use SIGSTRUCT to request any of the bits in an enclave’s ATTRIBUTES field to Figure 78: EGETKEY implements a key derivation service that is be zero. However, certain bits can only be set to one primarily used by SGX’s secret migration feature. The key derivation material is drawn from the SECS of the calling enclave, the informa- for enclaves that are signed by Intel. EINIT has a tion in a Key Request structure, and secure storage inside the CPU’s mask of restricted ATTRIBUTES bits, discussed in § 5.8. hardware. The EINIT implementation contains a hard-coded MR- The keys produced by EGETKEY are derived based on SIGNER value that is used to identify Intel’s privileged the identity information in the current enclave’s SECS enclaves, and only allows privileged enclaves to be built and on two secrets stored in secure hardware inside the with an ATTRIBUTES value that matches any of the SGX-enabled processor. One of the secrets is the input bits in the restricted mask. This check is essential to the to a largely undocumented series of transformations that security of the SGX software attestation process, which yields the symmetric key for the cryptographic primitive is described in § 5.8. underlying the key derivation process. The other secret, Last, EINIT also inspects the VENDOR field in referred to as the CR SEAL FUSES in the SDM, is one SIGSTRUCT. The SDM description of the VENDOR of the pieces of information used in the key derivation field in the section dedicated to SIGSTRUCT suggests material. that the field is essentially used to distinguish between The SDM does not specify the key derivation algo- special enclaves signed by Intel, which use a VENDOR rithm, but the SGX patents [110, 138] disclose that the

81 keys are derived using the method described in FIPS which reflects its contents. No other enclave will be able SP 800-108 [34] using AES-CMAC [46] as a Pseudo- to obtain the same key. This is useful when the derived Random Function (PRF). The same patents state that the key is used to encrypt enclave secrets so they can be secrets used for key derivation are stored in the CPU’s stored by system software in non-volatile memory, and e-fuses, which is confirmed by the ISCA 2015 SGX tuto- thus survive power cycles. rial [103]. If the MRSIGNER flag in KEYPOLICY is set, the This additional information implies that all EGETKEY derived key is tied to the public RSA key that issued invocations that use the same key derivation material will the enclave’s certificate. Therefore, other enclaves is- result in the same key, even across CPU power cycles. sued by the same author may be able to obtain the same Furthermore, it is impossible for an adversary to obtain key, subject to the restrictions below. This is the only the key produced from a specific key derivation material KEYPOLICY value that allows for secret migration. without access to the secret stored in the CPU’s e-fuses. It makes little sense to have no flag set in KEYPOL- SGX’s key hierarchy is further described in § 5.8.2. ICY. In this case, the derived key has no useful security The following paragraphs discuss the pieces of data property, as it can be obtained by other enclaves that are used in the key derivation material, which are selected completely unrelated to the enclave invoking EGETKEY. by the Key Request (KEYREQUEST) structure shown Conversely, setting both flags is redundant, as setting in in Table 21, MRENCLAVE alone will cause the derived key to be tied to the current enclave, which is the strictest possible Field Bytes Description policy. KEYNAME 2 The desired key The KEYREQUEST structure specifies the enclave type; secret mi- SVN (ISVSVN, § 5.7.2) and SGX implementation gration uses Seal SVN (CPUSVN, § 5.7.3) that will be used in the key keys derivation process. However, EGETKEY will reject the KEYPOLICY 2 The identity informa- derivation request and produce an error code if the de- tion (MRENCLAVE sired enclave SVN is greater than the current enclave’s and/or MRSIGNER) SVN, or if the desired SGX implementation’s SVN is ISVSVN 2 The enclave SVN greater than the current implementation’s SVN. used in derivation The SVN restrictions prevent the migration of secrets CPUSVN 16 SGX implementa- from enclaves with higher SVNs to enclaves with lower tion SVN used in SVNs, or from SGX implementations with higher SVNs derivation to implementations with lower SVNs. § 5.7.2 argues that ATTRIBUTEMASK 16 Selects enclave at- the SVN restrictions can reduce the impact of security tributes vulnerabilities in enclaves and in SGX’s implementation. KEYID 32 Random bytes EGETKEY always uses the ISVPRODID value from Table 21: A subset of the fields in the KEYREQUEST structure the current enclave’s SECS for key derivation. It fol- The KEYNAME field in KEYREQUEST always par- lows that secrets can never flow between enclaves whose ticipates in the key generation material. It indicates the SIGSTRUCT certificates assign them different Product type of the key to be generated. While the SGX design IDs. defines a few key types, the secret migration feature al- Similarly, the key derivation material always includes ways uses Seal keys. The other key types are used by the the value of an 128-bit Owner Epoch (OWNEREPOCH) SGX software attestation process, which will be outlined SGX configuration register. This register is intended to in § 5.8. be set by the computer’s firmware to a secret generated The KEYPOLICY field in KEYREQUEST has two once and stored in non-volatile memory. Before the flags that indicate if the MRENCLAVE and MRSIGNER computer changes ownership, the old owner can clear fields in the enclave’s SECS will be used for key deriva- the OWNEREPOCH from non-volatile memory, making tion. Although the fields admits 4 values, only two seem it impossible for the new owner to decrypt any enclave to make sense, as argued below. secrets that may be left on the computer. Setting the MRENCLAVE flag in KEYPOLICY ties Due to the cryptographic properties of the key deriva- the derived key to the current enclave’s measurement, tion process, outside observers cannot correlate keys

82 derived using different OWNEREPOCH values. This Source Enclave Enclave Author Enclave Author makes it impossible for software developers to use the Files Runtime Private Key Public Key EGETKEY-derived keys described in this section to track a processor as it changes owners. Compiler The EGETKEY derivation material also includes a 256- Linker bit value supplied by the enclave, in the KEYID field. Enclave Enclave Build This makes it possible for an enclave to generate a col- Contents Toolchain

lection of keys from EGETKEY, instead of a single key. Enclave The SDM states that KEYID should be populated with SIGSTRUCT Authoring a random number, and is intended to help prevent key SGX ECREATE wear-out. SGX Launch Enclave Last, the key derivation material includes the bitwise SGX EADD

AND of the ATTRIBUTES (§ 5.2.2) field in the enclave’s SGX EEXTEND EINITTOKEN SECS and the ATTRIBUTESMASK field in the KEYRE- Enclave Launch QUEST structure. The mask has the effect of removing Loading Policy (Licensing) some of the ATTRIBUTES bits from the key derivation Enclave Environment material, making it possible to migrate secrets between enclaves with different attributes. § 5.6.2 and § 5.7.4 MRENCLAVE explain the need for this feature, as well as its security Enclave Launch INITIALIZED SGX EINIT implications. MRSIGNER Before adding the masked attributes value to the key generation material, the EGETKEY implementation Software Attestation forces the mask bits corresponding to the INIT and DE- Attestation SGX EREPORT BUG attributes (§ 5.2.2) to be set. From a practical Challenge SGX Quoting Attestation standpoint, this means that secrets will never be migrated REPORT between enclaves that support debugging and production Enclave Signature enclaves. Without this restriction, it would be unsafe for an en- Figure 79: Setting up an SGX enclave and undergoing the soft- clave author to use the same RSA key to issue certificates ware attestation process involves the SGX instructions EINIT and EREPORT, and two special enclaves authored by Intel, the SGX to both debugging and production enclaves. Debugging Launch Enclave and the SGX Quoting Enclave. enclaves receive no integrity guarantees from SGX, so it is possible for an attacker to modify the code inside a Pushing the signing functionality into the Quoting debugging enclave in a way that causes it to disclose any Enclave creates the need for a secure communication secrets that it has access to. path between an enclave undergoing software attestation and the Quoting Enclave. The SGX design solves this 5.8 SGX Software Attestation problem with a local attestation mechanism that can be The software attestation scheme implemented by SGX used by an enclave to prove its identity to any other follows the principles outlined in § 3.3. An SGX-enabled enclave hosted by the same SGX-enabled CPU. This processor computes a measurement of the code and data scheme, described in § 5.8.1, is implemented by the that is loaded in each enclave, which is similar to the mea- EREPORT instruction. surement computed by the TPM (§ 4.4). The software The SGX attestation key used by the Quoting Enclave inside an enclave can start a process that results in an does not exist at the time SGX-enabled processors leave SGX attestation signature, which includes the enclave’s the factory. The attestation key is provisioned later, using measurement and an enclave message. a process that involves a Provisioning Enclave issued by The cryptographic primitive used in SGX’s attestation Intel, and two special EGETKEY ( § 5.7.5) key types. The signature is too complex to be implemented in hardware, publicly available details of this process are summarized so the signing process is performed by a privileged Quot- in § 5.8.2. ing Enclave, which is issued by Intel, and can access the The SGX Launch Enclave and EINITTOKEN struc- SGX attestation key. This enclave is discussed in § 5.8.2. ture will be discussed in § 5.9.

83 5.8.1 Local Attestation identity (MRSIGNER, ISVPRODID, ISVSVN), and at- tributes (ATTRIBUTES). The attestation report also in- An enclave proves its identity to another target enclave cludes the SVN of the SGX implementation (CPUSVN) via the EREPORT instruction shown in Figure 80. The and a 64-byte (512-bit) message supplied by the enclave. SGX instruction produces an attestation Report (RE- The target enclave that receives the attestation re- PORT) that cryptographically binds a message sup- port can convince itself of the report’s authenticity as plied by the enclave with the enclave’s measurement- shown in Figure 81. The report’s authenticity proof based (§ 5.6) and certificate-based (§ 5.7.2) identities. is its MAC tag. The key required to verify the MAC The cryptographic binding is accomplished by a MAC can only be obtained by the target enclave, by asking tag (§ 3.1.3) computed using a symmetric key that is EGETKEY (§ 5.7.5) to derive a Report key. The SDM only shared between the target enclave and the SGX states that the MAC tag is computed using a block cipher- implementation. based MAC (CMAC, [46]), but stops short of specifying the underlying cipher. One of the SGX papers [14] states Input Register File that the CMAC is based on 128-bit AES. RBX RDX EREPORT The Report key returned by EGETKEY is derived from RCX MAC a secret embedded in the processor (§ 5.7.5), and the SECS MACed Fields key material includes the target enclave’s measurement. ATTRIBUTES ATTRIBUTES The target enclave can be assured that the MAC tag in MRENCLAVE MRENCLAVE the report was produced by the SGX implementation, MRSIGNER MRSIGNER AES-CMAC for the following reasons. The cryptographic properties ISVPRODID ISVPRODID of the underlying key derivation and MAC algorithms ISVSVN ISVSVN ensure that only the SGX implementation can produce BASEADDR REPORTDATA SIZE CPUSVN the MAC tag, as it is the only entity that can access the processor’s secret, and it would be impossible for SSAFRAMESIZE KEYID an attacker to derive the Report key without knowing Current REPORTDATA CPUSVN the processor’s secret. The SGX design guarantees that the key produced by EGETKEY depends on the calling CR_EREPORT_KEYID enclave’s measurement, so only the target enclave can TARGETINFO obtain the key used to produce the MAC tag in the report. MEASUREMENT EREPORT uses the same key derivation process as ATTRIBUTES EGETKEY does when invoked with KEYNAME set to Hard-coded PKCS the value associated with Report keys. For this rea- #1 v1.5 Padding son, EREPORT requires the virtual address of a Report PADDING zero MRENCLAVE zero KEYID Target Info (TARGETINFO) structure that contains the zero MASKEDATTRIBUTES KEYNAME CPUSVN measurement-based identity and attributes of the target enclave. OWNEPOCH Key Derivation Material SEAL_FUSES When deriving a Report key, EGETKEY behaves OWNEREPOCH SEAL_FUSES slightly differently than it does in the case of seal keys, SGX Register Report Key as shown in Figure 81. The key generation material never includes the fields corresponding to the enclave’s SGX Master AES-CMAC 128-bit certificate-based identity (MRSIGNER, ISVPRODID, Derivation Key Key Derivation Report key ISVSVN), and the KEYPOLICY field in the KEYRE- QUEST structure is ignored. It follows that the report EREPORT Figure 80: data flow can only be verified by the target enclave. The EREPORT instruction reads the current enclave’s Furthermore, the SGX implementation’s identity information from the enclave’s SECS (§ 5.1.3), SVN (CPUSVN) value used for key generation is and uses it to populate the REPORT structure. Specifi- determined by the current CPUSVN, instead of being cally, EREPORT copies the SECS fields indicating the en- read from the Key Request structure. Therefore, SGX clave’s measurement (MRENCLAVE), certificate-based implementation upgrades that increase the CPUSVN

84 EREPORT 5.8.2 Remote Attestation MAC Yes Trust Report MACed Fields The SDM paints a complete picture of the local attesta- ATTRIBUTES Equal? tion mechanism that was described in § 5.8.1. The remote MRENCLAVE No Reject Report attestation process, which includes the Quoting Enclave MRSIGNER and the underlying keys, is covered at a high level in an AES-CMAC ISVPRODID Intel publication [109]. This section’s contents is based ISVSVN on the SDM, on one [14] of the SGX papers, and on the REPORTDATA ISCA 2015 SGX tutorial [103]. CPUSVN Report Key SGX’s software attestation scheme, which is illus- KEYID KEYREQUEST trated in Figure 82, relies on a key generation facility and SECS ATTRIBUTEMASK on a provisioning service, both operated by Intel. PADDING KEYNAME ATTRIBUTES CPUSVN Intel BASEADDR KEYID Key Generation ISVSVN Facility SIZE CPU e-fuses ISVPRODID KEYPOLICY Provisioned MRENCLAVE Seal Provisioning Keys ISVSVN Secret Secret MRENCLAVE MRSIGNER MRSIGNER SSAFRAME Provisioning SIZE Enclave Provisioning Provisioning Proof of EGETKEY Current Seal Key Key Provisioning Key CPUSVN ownership Intel Provisioning Service Authenticated Attestation Attestation Key PADDING zero MRENCLAVE zero KEYID Encryption Key zero MASKEDATTRIBUTES KEYNAME CPUSVN OWNEPOCH Encrypted Attestation Key Key Derivation Material SEAL_FUSES

OWNEREPOCH SEAL_FUSES Attested Enclave SGX Register Key Agreement Key Agreement Challenge Message 2 Message 1 SGX Master AES-CMAC 128-bit Derivation Key Key Derivation Report key EREPORT Report Data Figure 81: The authenticity of the REPORT structure created by EREPORT Report can and should be verified by the report’s target enclave. Remote The target’s code uses EGETKEY to obtain the key used for the MAC Party in tag embedded in the REPORT structure, and then verifies the tag. Quoting Enclave Software Reporting Report Attestation Key Verification

Provisioning Attestation Response invalidate all outstanding reports. Given that CPUSVN Seal Key Signature increases are associated with security fixes, the argument Authenticated Attestation in § 5.7.2 suggests that this restriction may reduce the Encryption Key impact of vulnerabilities in the SGX implementation. Figure 82: SGX’s software attestation is based on two secrets stored Last, EREPORT sets the KEYID field in the key gen- in e-fuses inside the processor’s die, and on a key received from eration material to the contents of an SGX configuration Intel’s provisioning service. register (CR REPORT KEYID) that is initialized with During the manufacturing process, an SGX-enabled a random value when SGX is initialized. The KEYID processor communicates with Intel’s key generation fa- value is also saved in the attestation report, but it is not cility, and has two secrets burned into e-fuses, which covered by the MAC tag. are a one-time programmable storage medium that can

85 be economically included on a high-performance chip’s PROVISIONKEY die. We shall refer to the secrets stored in e-fuses as the must be true AND Provisioning Key Provisioning Secret and the Seal Secret. The Provisioning Secret is the main input to a largely SECS KEYREQUEST undocumented process that outputs the SGX master ATTRIBUTES ATTRIBUTEMASK BASEADDR KEYNAME derivation key used by EGETKEY, which was referenced Current SIZE Must be >= CPUSVN CPUSVN in Figures 78, 79, 80, and 81. ISVPRODID KEYID The Seal Secret is not exposed to software by any of ISVSVN Must be >= ISVSVN the architectural mechanisms documented in the SDM. MRENCLAVE KEYPOLICY The secret is only accessed when it is included in the MRSIGNER MRENCLAVE material used by the key derivation process implemented SSAFRAME MRSIGNER SIZE by EGETKEY (§ 5.7.5). The pseudocode in the SDM PADDING uses the CR SEAL FUSES register name to refer to the Seal Secret. PADDING MRSIGNER zero ISVSVN zero The names “Seal Secret” and “Provisioning Secret” ISVPRODID MASKEDATTRIBUTES KEYNAME CPUSVN deviate from Intel’s official documents, which confus- zero ingly use the “Seal Key” and “Provisioning Key” names to refer to both secrets stored in e-fuses and keys derived Key Derivation Material zero by EGETKEY. The SDM briefly describes the keys produced by SGX Master AES-CMAC 128-bit EGETKEY, but no official documentation explicitly de- Derivation Key Key Derivation Provisioning Key scribes the secrets in e-fuses. The description below is is the only interpretation of all the public information Figure 83: When EGETKEY is asked to derive a Provisioning key, it does not use the Seal Secret or OWNEREPOCH. The Provisioning sources that is consistent with all the SDM’s statements key does, however, depend on MRSIGNER and on the SVN of the about key derivation. SGX implementation. The Provisioning Secret is generated at the key gener- ation facility, where it is burned into the processor’s e- the CPUSVN value to reject SGX implementations with fuses and stored in the database used by Intel’s provision- known security vulnerabilities. Third, this design admits ing service. The Seal Secret is generated inside the pro- multiple mutually distrusting provisioning services. cessor chip, and therefore is not known to Intel. This ap- EGETKEY only derives Provisioning keys for enclaves proach has the benefit that an attacker who compromises whose PROVISIONKEY attribute is set to true. § 5.9.3 Intel’s facilities cannot derive most keys produced by argues that this mechanism is sufficient to protect the EGETKEY, even if the attacker also compromises a vic- computer owner from a malicious software provider that tim’s firmware and obtains the OWNEREPOCH (§ 5.7.5) attempts to use Provisioning keys to track a CPU chip value. These keys include the Seal keys (§ 5.7.5) and across OWNEREPOCH changes. Report keys (§ 5.8.1) introduced in previous sections. After the Provisioning Enclave obtains a Provision- The only documented exception to the reasoning above ing key, it uses the key to authenticate itself to Intel’s is the Provisioning key, which is effectively a shared se- provisioning service. Once the provisioning service is cret between the SGX-enabled processor and Intel’s pro- convinced that it is communicating to a trusted Provi- visioning service. Intel has to be able to derive this key, sioning enclave in the secure environment provided by so the derivation material does not include the Seal Secret a SGX-enabled processor, the service generates an At- or the OWNEREPOCH value, as shown in Figure 83. testation Key and sends it to the Provisioning Enclave. EGETKEY derives the Provisioning key using the cur- The enclave then encrypts the Attestation Key using a rent enclave’s certificate-based identity (MRSIGNER, Provisioning Seal key, and hands off the encrypted key ISVPRODID, ISVSVN) and the SGX implementation’s to the system software for storage. SVN (CPUSVN). This approach has a few desirable se- Provisioning Seal keys, are the last publicly docu- curity properties. First, Intel’s provisioning service can mented type of special keys derived by EGETKEY, using be assured that it is authenticating a Provisioning Enclave the process illustrated in Figure 84. As their name sug- signed by Intel. Second, the provisioning service can use gests, Provisioning Seal keys are conceptually similar to

86 the Seal Keys (§ 5.7.5) used to migrate secrets between stems from not using OWNEREPOCH was already intro- enclaves. duced by Provisioning keys, and is mitigated using the access control scheme based on the PROVISIONKEY attribute that will be discussed in § 5.9.3. AND Provisioning Seal Key Similarly to the Seal key derivation process, both the SECS KEYREQUEST Provisioning and Provisioning Seal keys depend on the ATTRIBUTES ATTRIBUTEMASK bitwise AND of the ATTRIBUTES (§ 5.2.2) field in the BASEADDR KEYNAME Current enclave’s SECS and the ATTRIBUTESMASK field in SIZE Must be >= CPUSVN CPUSVN the KEYREQUEST structure. While most attributes can ISVPRODID KEYID ISVSVN Must be >= ISVSVN be masked away, the DEBUG and INIT attributes are MRENCLAVE KEYPOLICY always used for key derivation. MRSIGNER MRENCLAVE This dependency makes it safe for Intel to use its pro- SSAFRAME MRSIGNER duction RSA key to issue certificates for Provisioning SIZE PADDING or Quoting Enclaves with debugging features enabled. Without the forced dependency on the DEBUG attribute, using the production Intel signing key on a single de- PADDING MRSIGNER zero ISVSVN zero bug Provisioning or Quoting Enclave could invalidate ISVPRODID MASKEDATTRIBUTES KEYNAME CPUSVN SGX’s security guarantees on all the CPU chips whose zero attestation-related enclaves are signed by the same key. Key Derivation Material SEAL_FUSES Concretely, if the issued SIGSTRUCT would be leaked, any attacker could build a debugging Provisioning or SEAL_FUSES Quoting enclave, use the SGX debugging features to 128-bit modify the code inside it, and extract the 128-bit Pro- SGX Master AES-CMAC Provisioning Derivation Key Key Derivation visioning key used to authenticated the CPU to Intel’s Seal key provisioning service. Figure 84: The derivation material used to produce Provisioning After the provisioning steps above have been com- Seal keys does not include the OWNEREPOCH value, so the keys pleted, the Quoting Enclave can be invoked to perform survive computer ownership changes. SGX’s software attestation. This enclave receives lo- The defining feature of Provisioning Seal keys is that cal attestation reports (§ 5.8.1) and verifies them using they are not based on the OWNEREPOCH value, so they the Report keys generated by EGETKEY. The Quoting survive computer ownership changes. Since Provisioning Enclave then obtains the Provisioning Seal Key from Seal keys can be used to track a CPU chip, their use is EGETKEY and uses it to decrypt the Attestation Key, gated on the PROVISIONKEY attribute, which has the which is received from system software. Last, the en- same semantics as for Provisioning keys. clave replaces the MAC in the local attestation report Like Provisioning keys, Seal keys are based on the with an Attestation Signature produced with the Attesta- current enclave’s certificate-based identity (MRSIGNER, tion Key. ISVPROD, ISVSVN), so the Attestation Key encrypted The SGX patents state that the name “Quoting Enclave” by Intel’s Provisioning Enclave can only be decrypted was chosen as a reference to the TPM (§ 4.4)’s quoting by another enclave signed with the same Intel RSA key. feature, which is used to perform software attestation on However, unlike Provisioning keys, the Provisioning Seal a TPM-based system. keys are based on the Seal Secret in the processor’s e- The Attestation Key uses Intel’s Enhanced Privacy fuses, so they cannot be derived by Intel. ID (EPID) cryptosystem [26], which is a group signature When considered independently from the rest of the scheme that is intended to preserve the anonymity of the SGX design, Provisioning Seal keys have desirable se- signers. Intel’s key provisioning service is the issuer in curity properties. The main benefit of these keys is that the EPID scheme, so it publishes the Group Public Key, when a computer with an SGX-enabled processor ex- while securely storing the Master Issuing Key. After a changes owners, it does not need to undergo the provi- Provisioning Enclave authenticates itself to the provision- sioning process again, so Intel does not need to be aware ing service, it generates an EPID Member Private Key, of the ownership change. The confidentiality issue that which serves as the Attestation Key, and executes the

87 EPID Join protocol to join the group. Later, the Quoting We speculate that Intel has not been forthcoming about Enclave uses the EPID Member Private Key to produce the LE because of its role in enforcing software licens- Attestation Signatures. ing, which will be discussed in § 5.9.2. This section The Provisioning Secret stored in the e-fuses of each abstracts away the licensing aspect and assumes that the SGX-enabled processor can be used by Intel to trace LE enforces a black-box Launch Control Policy. individual chips when a Provisioning Enclave authen- The LE approves an enclave by issuing an EINIT ticates itself to the provisioning service. However, if Token (EINITTOKEN), using the process illustrated the EPID Join protocol is blinded, Intel’s provisioning in Figure 85. The EINITTOKEN structure contains service cannot trace an Attestation Signature to a spe- the approved enclave’s measurement-based (§ 5.6) and cific Attestation Key, so Intel cannot trace Attestation certificate-based (§ 5.7.2) identities, just like a local at- Signatures to individual chips. testation REPORT (§ 5.8.1). This token is inspected by Of course, the security properties of the description EINIT (§ 5.3.3), which refuses to initialize enclaves above hinge on the correctness of the proofs behind the with incorrect tokens. EPID scheme. Analyzing the correctness of such cryp- While an EINIT token is handled by untrusted system tographic schemes is beyond the scope of this work, so software, its integrity is protected by a MAC tag (§ 3.1.3) we defer the analysis of EPID to the crypto research that is computed using a Launch Key obtained from community. EGETKEY. The EINIT implementation follows the 5.9 SGX Enclave Launch Control same key derivation process as EGETKEY to convince itself that the EINITTOKEN provided to it was indeed The SGX design includes a launch control process, generated by an LE that had access to the Launch Key. which introduces an unnecessary approval step that is required before running most enclaves on a computer. The SDM does not document the MAC algorithm The approval decision is made by the Launch Enclave used to confer integrity guarantees to the EINITTOKEN EINIT (LE), which is an enclave issued by Intel that gets to structure. However, the pseudocode verifies the approve every other enclave before it is initialized by token’s MAC tag using the same function that the ERE- EINIT (§ 5.3.3). The officially documented information PORT pseudocode uses to create the REPORT structure’s about this approval process is discussed in § 5.9.1. MAC tag. It follows that the reasoning in § 5.8.1 can The SGX patents [110, 138] disclose in no uncertain be reused to conclude that EINITTOKEN structures are terms that the Launch Enclave was introduced to ensure MACed using AES-CMAC with 128-bit keys. that each enclave’s author has a business relationship The EGETKEY instruction only derives the Launch with Intel, and implements a software licensing system. Key for enclaves that have the LAUNCHKEY attribute § 5.9.2 briefly discusses the implications, should this turn set to true. The Launch Key is derived using the same out to be true. process as the Seal Key (§ 5.7.5). The derivation mate- The remainder of the section argues that the Launch rial includes the current enclave’s versioning information Enclave should be removed from the SGX design. § 5.9.3 (ISVPRODID and ISVSVN) but it does not include the explains that the LE is not required to enforce the com- main fields that convey an enclave’s identity, which are puter owner’s launch control policy, and concludes that MRSIGNER and MRENCLAVE. The rest of the deriva- the LE is only meaningful if it enforces a policy that is tion material follows the same rules as the material used detrimental to the computer owner. § 5.9.4 debunks the for Seal Keys. myth that an enclave can host malware, which is likely to The EINITTTOKEN structure contains the identi- be used to justify the LE. § 5.9.5 argues that Anti-Virus ties of the approved enclave (MRENCLAVE and MR- (AV) software is not fundamentally incompatible with SIGNER) and the approved enclave attributes (AT- enclaves, further disproving the theory that Intel needs TRIBUTES). The token also includes the information to actively police the software that runs inside enclaves. used for the Launch Key derivation, which includes the LE’s Product ID (ISVPRODIDLE), SVN (ISVSVNLE), 5.9.1 Enclave Attributes Access Control and the bitwise AND between the LE’s ATTRIBUTES The SGX design requires that all enclaves be vetted by a and the ATTRIBUTEMASK used in the KEYREQUEST Launch Enclave (LE), which is only briefly mentioned (MASKEDATTRIBUTESLE). in Intel’s official documentation. Neither its behavior The EINITTOKEN information used to derive the nor its interface with the system software is specified. Launch Key can also be used by EINIT for damage

88 Vetted Enclave Desired ATTRIBUTES (§ 5.7.5). SIGSTRUCT Launch The check described above make it safe for Intel to Signed Fields AES-CMAC Control ATTRIBUTES supply SGX enclave developers with a debugging LE that Policy ATTRIBUTEMASK Checks EINITTOKEN has its DEBUG attribute set, and performs minimal or VENDOR MAC no security checks before issuing an EINITTOKEN. The DATE MACed Fields DEBUG attribute disables SGX’s integrity protection, ATTRIBUTES ENCLAVEHASH so the only purpose of the security checks performed in ISVPRODID 1 VALID ISVSVN MRENCLAVE the debug LE would be to help enclave development by MRSIGNER mimicking its production counterpart. The debugging LE RSA Signature 256-bit ISVSVNLE can only be used to launch any enclave with the DEBUG MODULUS SHA-2 KEYID EXPONENT (3) attribute set, so it does not undermining Intel’s ability to CPUSVNLE SIGNATURE enforce a Launch Control Policy on production enclaves. ISVPRODIDLE Q1 The enclave attributes access control system described MASKED Q2 ATTRIBUTESLE above relies on the LE to reject initialization requests AND that set privileged attributes such as PROVISIONKEY Signed by Enclave KEYREQUEST Author’s RSA Key on unauthorized enclaves. However, the LE cannot vet ATTRIBUTEMASK KEYNAME itself, as there will be no LE available when the LE itself Launch Enclave SECS CPUSVN needs to be initialized. Therefore, the Launch Key access MRENCLAVE KEYID restrictions are implemented in hardware. MRSIGNER ISVSVN EINIT accepts an EINITTOKEN whose VALID bit is KEYPOLICY ISVPRODID set to zero, if the enclave’s MRSIGNER (§ 5.7.1) equals MRENCLAVE PADDING a hard-coded value that corresponds to an Intel public ISVSVN MRSIGNER ATTRIBUTES key. For all other enclave authors, an invalid EINIT token BASEADDR RDRAND causes EINIT to reject the enclave and produce an error SIZE Launch Key code. SSAFRAMESIZE This exemption to the token verification policy pro- EGETKEY vides a way to bootstrap the enclave attributes access AND Must be >= control system, namely using a zeroed out EINITTO- PADDING zero zero ISVSVN KEYID KEN to initialize the Launch Enclave. At the same time, ISVPRODID MASKEDATTRIBUTES KEYNAME CPUSVN the cryptographic primitives behind the MRSIGNER OWNEPOCH check guarantee that only Intel-provided enclaves will Key Derivation Material SEAL_FUSES be able to bypass the attribute checks. This does not change SGX’s security properties because Intel is already OWNEREPOCH SEAL_FUSES Current a trusted party, as it is responsible for generating the Pro- SGX Register Must be >= CPUSVN visioning Keys and Attestation Keys used by software attestation (§ 5.8.2). SGX Master AES-CMAC 128-bit Curiously, the EINIT pseudocode in the SDM states Derivation Key Key Derivation Launch Key that the instruction enforces an additional restriction, Figure 85: The SGX Launch Enclave computes the EINITTOKEN. which is that all enclaves with the LAUNCHKEY at- tribute must have its certificate issued by the same Intel control, e.g. to reject tokens issued by Launch Enclaves public key that is used to bypass the EINITTTOKEN with known security vulnerabilities. The reference pseu- checks. This restriction appears to be redundant, as the docode supplied in the SDM states that EINIT checks same restriction could be enforced in the Launch En- the DEBUG bit in the MASKEDATTRIBUTESLE field, clave. and will not initialize a production enclave using a to- 5.9.2 Licensing ken issued by a debugging LE. It is worth noting that MASKEDATTRIBUTESLE is guaranteed to include The SGX patents [110, 138] disclose that EINIT To- the LE’s DEBUG attribute, because EGETKEY forces kens and the Launch Enclave (§ 5.9.1) were introduced the DEBUG attribute’s bit in the attributes mask to 1 to verify that the SIGSTRUCT certificates associated

89 with production enclaves are issued by enclave authors means that the system software can perform its own who have a business relationship with Intel. In other policy checks before allowing application software to words, the Launch Enclave is intended to be an enclave initialize an enclave. So, the system software can enforce licensing mechanism that allows Intel to force itself a Launch Control Policy set by the computer’s owner. as an intermediary in the distribution of all enclave For example, an IaaS cloud service provider may use its software. hypervisor to implement a Launch Control Policy that The SGX patents are likely to represent an early ver- limits what enclaves its customers are allowed to execute. sion of the SGX design, due to the lengthy timelines Given that the system software has access to a superset associated with patent application approval. In light of of the information that the Launch Enclave might use, this consideration, we cannot make any claims about In- it is easy to see that the set of policies that can be en- tel’s current plans. However, given that we know for sure forced by system software is a superset of the policies that Intel considered enclave licensing at some point, we that can be supported by an LE. Therefore, the only ra- briefly discuss the implications of implementing such a tional explanation for the existence of the LE is that it licensing plan. was designed to implement a Launch Control Policy that Intel has a near-monopoly on desktop and server-class is not beneficial to the computer owner. processors, and being able to decide which software ven- As an illustration of this argument, we consider the dors are allowed to use SGX can effectively put Intel in case of restricting access to EGETKEY’s Provisioning a position to decide winners and losers in many software keys (§ 5.8.2). The derivation material for Provisioning markets. keys does not include OWNEREPOCH, so malicious Assuming SGX reaches widespread adoption, this is- enclaves can potentially use these keys to track a CPU sue is the software security equivalent to the Net Neutral- chip package as it exchanges owners. For this reason, the ity debates that have pitted the software industry against SGX design includes a simple access control mechanism telecommunication giants. Given that virtually all com- that can be used by system software to limiting enclave petent software development companies have argued that access to Provisioning keys. EGETKEY refuses to derive losing Net Neutrality will stifle innovation, it is fairly Provisioning keys for enclaves whose PROVISIONKEY safe to assume that Intel’s ability to regulate access to attribute is not set to true. SGX will also stifle innovation. It follows that a reasonable Launch Control Policy Furthermore, from a historical perspective, the enclave would only allow the PROVISIONKEY attribute to be licensing scheme described in the SGX patents is very set for the enclaves that implement software attestation, similar to Verified Boot, which was briefly discussed such as Intel’s Provisioning Enclave and Quoting En- in § 4.4. Verified Boot has mostly received negative clave. This policy can easily be implemented by system reactions from software developers, so it is likely that software, given its exclusive access to the EINIT instruc- an enclave licensing scheme would meet the same fate, tion. should the developer community become aware of it. The only concern with the approach outlined above is that a malicious system software might abuse the PRO- 5.9.3 System Software Can Enforce a Launch Policy VISIONKEY attribute to generate a unique identifier for § 5.3 explains that the SGX instructions used to load and the hardware that it runs on, similar to the much ma- initialize enclaves (ECREATE, EADD, EINIT) can only ligned Intel Processor Serial Number [86]. We dismiss be issued by privileged system software, because they this concern by pointing out that system software has manage the EPC, which is a . access to many unique identifiers, such as the Media A consequence on the restriction that only privileged Access Control (MAC) address of the Ethernet adapter software can issue ECREATE and EADD instructions is integrated into the motherboard’s chipset (§ 2.9.1). that the system software is able to track all the public 5.9.4 Enclaves Cannot Damage the Host Computer contents that is loaded into each enclave. The privilege requirements of EINIT mean that the system software SGX enclaves execute at the lowest privilege level (user can also examine each enclave’s SIGSTRUCT. It follows mode / ring 3), so they are subject to the same security that the system software has access to a superset of the checks as their host application. For example, modern information that the Launch Enclave might use. operating systems set up the I/O maps (§ 2.7) to pre- Furtheremore, EINIT’s privileged instruction status vent application software from directly accessing the I/O

90 address space (§ 2.4), and use the supervisor (S) page 5.9.5 Interaction with Anti-Virus Software table attribute (§ 2.5.3) to deny application software di- rect access to memory-mapped devices (§ 2.4) and to Today’s anti-virus (AV) systems are glorified pattern the DRAM that stores the system software. Enclave matchers. AV software simply scans all the executable software is subject to I/O privilege checks and address files on the system and the memory of running processes, translation checks, so a malicious enclave cannot directly looking for bit patterns that are thought to only occur interact with the computer’s devices, and cannot tamper in malicious software. These patterns are somewhat the system software. pompously called “virus signatures”. SGX (and TXT, to some extent) provides a method for It follows that software running in an enclave has the executing code in an isolated container that we refer to same means to compromise the system software as its as an enclave. Enclaves are isolated from all the other host application, which come down to exploiting a secu- software on the computer, including any AV software rity vulnerability. The same solutions used to mitigate that might be installed. vulnerabilities exploited by application software (e.g., The isolation afforded by SGX opens up the possibility /bpf [118]) apply to enclaves. for bad actors to structure their attacks as a generic loader The only remaining concern is that an enclave can per- that would end up executing a malicious payload without form a denial of service (DoS) attack against the system tripping the AV’s pattern matcher. More specifically, the software. The rest of this section addresses the concern. attack would create an enclave and initialize it with a The SGX design provides system software the tools generic loader that looks innocent to an AV. The loader it needs to protect itself from enclaves that engage in inside the enclave would obtain an encrypted malicious CPU hogging and DRAM hogging. As enclaves cannot payload, and would undergo software attestation with perform I/O directly, these are the only two classes of an Internet server to obtain the payload’s encryption key. DoS attacks available to them. The loader would then decrypt the malicious payload and execute it inside the enclave. An enclave that attempts to hog an LP assigned to it In the scheme suggested here, the malicious payload can be preempted by the system software via an Inter- only exists in a decrypted form inside an enclave’s mem- Processor Interrupt (IPI, § 2.12) issued from another ory, which cannot be accessed by the AV. Therefore, the processor. This method is available as long as the sys- AV’s pattern matcher will not trip. tem software reserves at least one LP for non-enclave This issue does not have a solution that maintains the computation. status-quo for the AV vendors. The attack described Furthermore, most OS kernels use tick schedulers, above would be called a protection scheme if the payload which use a real-time clock (RTC) configured to issue pe- would be a proprietary image processing algorithm, or a riodical interrupts (ticks) to all cores. The RTC interrupt DRM scheme. handler invokes the kernel’s scheduler, which chooses On a brighter note, enclaves do not bring the com- the thread that will get to use the logical processor until plete extinction of AV, they merely require a change in the next RTC interrupt is received. Therefore, kernels approach. Enclave code always executes at the lowest that use tick schedulers always have the opportunity to privilege mode (ring 3 / user mode), so it cannot perform de-schedule enclave threads, and don’t need to rely on any I/O without invoking the services of system software. the ability to send IPIs. For all intents and purposes, this effectively means that In SGX, the system software can always evict an en- enclave software cannot perform any malicious action clave’s EPC pages to non-EPC memory, and then to disk. without the complicity of system software. Therefore, The system software can also outright deallocate an en- enclaves can be policed effectively by intelligent AV clave’s EPC pages, though this will probably cause the software that records and filters the I/O performed by enclave code to encounter page faults that cannot be re- software, and detects malicious software according to solved. The only catch is that the EPC pages that hold the actions that it performs, rather than according to bit metadata for running enclave threads cannot be evicted patterns in its code. or removed. However, this can easily be resolved, as Furthermore, SGX’s enclave loading model allows the system software can always preempt enclave threads, the possibility of performing static analysis on the en- using one of the methods described above. clave’s software. For simplicity, assume the existence

91 of a standardized static analysis framework. The initial The PRM range is configured by the PRM Range Reg- enclave contents is not encrypted, so the system software isters (§ 5.1), which have exactly the same semantics as can easily perform static analysis on it. Dynamically the Memory Type Range Registers (MTRRs, § 2.11.4) loaded code or Just-In-Time code generation (JIT) can used to configure a variable memory range. The page be handled by requiring that all enclaves that use these walker FSM in the PMH is already configured to issue a techniques embed the static analysis framework and use microcode assist when the page tables are in uncacheable it to analyze any dynamically loaded code before it is memory (§ 2.11.4). Therefore, the PRMRR can be repre- executed. The system software can use static verification sented as an extra MTRR pair. to ensure that enclaves follow these rules, and refuse to initialize any enclaves that fail verification. 6.1.2 Uncore Modifications In conclusion, enclaves in and of themselves don’t The SDM states that DMA transactions (§ 2.9.1) that introduce new attack vectors for malware. However, the target the PRM range are aborted by the processor. The enclave isolation mechanism is fundamentally incompati- SGX patents disclose that the PRMRR protection against ble with the approach employed by today’s AV solutions. unauthorized DMA is implemented by having the SGX Fortunately, it is possible (though non-trivial) to develop microcode set up entries in the Source Address De- more intelligent AV software for enclave software. coder (SAD) in the uncore CBoxes and in the Target Address Decoder (TAD) in the integrated Memory Con- 6 SGX ANALYSIS troller (MC). 6.1 SGX Implementation Overview § 2.11.3 mentions that Intel’s Trusted Execution Tech- An under-documented and overlooked feat achieved by nology (TXT) [70] already takes advantage of the inte- the SGX design is that implementing it on an Intel pro- grated MC to protect a DRAM range from DMA. It is cessor has a very low impact on the chip’s hardware highly likely that the SGX implementation reuses the design. SGX’s modifications to the processor’s execu- mechanisms brought by TXT, and only requires the ex- tion cores (§ 2.9.4) are either very small or completely tension of the SADs and TADs by one entry. inexistent. The CPU’s uncore (§ 2.9.3, § 2.11.3) receives SGX’s major hardware modification is the Memory a new module, the Memory Encryption Engine, which Encryption Engine (MEE) that is added to the processor’s appears to be fairly self-contained. uncore (§ 2.9.3, § 2.11.3) to protect SGX’s Enclave Page The bulk of the SGX implementation is relegated to Cache (EPC, § 5.1.1) against physical attacks. the processor’s microcode (§ 2.14), which supports a The MEE was first briefly described in the ISCA 2015 much higher development speed than the chip’s electrical SGX tutorial [103]. According to the information pre- circuitry. sented there, the MEE roughly follows the approach intro- duced by Aegis [174][176], which relies on a variation 6.1.1 Execution Core Modifications of Merkle trees to provide the EPC with confidentiality, At a minimum, the SGX design requires a very small integrity, and freshness guarantees (§ 3.1). Unlike Aegis, modification to the processor’s execution cores (§ 2.9.4), the MEE uses non-standard cryptographic primitives that in the Page Miss Handler (PMH, § 2.11.5). include a slightly modified AES operating mode (§ 3.1.2) The PMH resolves TLB misses, and consists of a fast and a Carter-Wegman [30, 187] MAC (§ 3.1.3) construc- path that relies on an FSM page walker, and a microcode tion. The MEE was further described in [74]. assist fallback that handles the edge cases (§ 2.14.3). Both the ISCA SGX tutorial and the patents state that The bulk of SGX’s memory access checks, which are the MEE is connected to to the Memory Controller (MC) discussed in § 6.2, can be implemented in the microcode integrated in the CPU’s uncore. However, all sources are assist. completely silent on further implementation details. The The only modification to the PMH hardware that is MEE overview slide states that “the Memory Controller absolutely necessary to implement SGX is developing an detects [the] address belongs to the MEE region, and ability to trigger the microcode assist for all address trans- routes transaction to MEE”, which suggests that the MEE lations when a logical processor (§ 2.9.4) is in enclave is fairly self-contained and has a narrow interface to the mode (§ 5.4), or when the physical address produced by rest of the MC. the page walker FSM matches the Processor Reserved Intel’s SGX patents use the name Crypto Memory Memory (PRM, § 5.1) range. Aperture (CMA) to refer to the MEE. The CMA descrip-

92 tion matches the MEE and PRM concepts, as follows. the SGX patents reveal that most of these registers are According to the patents, the CMA is used to securely actually stored in DRAM. store the EPC, relies on crypto controllers in the MC, For example, the patents state that each TCS (§ 5.2.4) and loses its keys during deep sleep. These details align has two fields that receive the values of the DR7 and perfectly with the SDM’s statements regarding the MEE IA32 DEBUGCTL registers when the processor enters and PRM. enclave mode (§ 5.4.1), and are used to restore the The Intel patents also disclose that the EPCM (§ 5.1.2) original register values during enclave exit (§ 5.4.2). and other structures used by the SGX implementation The SDM documents these fields as “internal CREGs” are also stored in the PRM. This rules out the possibility (CR SAVE DR7 and CR SAVE DEBUGCTL), which that the EPCM requires on-chip memory resembling the are stated to be “hardware specific registers”. last-level cache (§ 2.11, § 2.11.3). The SGX patents document a small subset of the Last, the SGX patents shine a bit of light on an area CREGs described in the SDM, summarized in Table 22, that the official Intel documentation is completely silent as microcode registers. While in general we trust offi- about, namely the implementation concerns brought by cial documentation over patents, in this case we use the computer systems with multiple processor chips. The CREG descriptions provided by the patents, because they patents state that the MEE also protects the Quick-Path appear to be more suitable for implementation purposes. Interconnect (QPI, § 2.9.1) traffic using link-layer en- From a cost-performance standpoint, the cost of regis- cryption. ter memory only seems to be justified for the state used by the PMH to implement SGX’s memory access checks, 6.1.3 Microcode Modifications which will be discussed in § 6.2). The other pieces of According to the SGX patents, all the SGX instructions state listed as CREGs are accessed so infrequently that are implemented in microcode. This can also be de- storing them in dedicated SRAM would make very little duced by reading the SDM’s pseuodocode for all the sense. instructions, and realizing that it is highly unlikely that The SGX patents state that SGX requires very few any SGX instruction can be implemented in 4 or fewer hardware changes, and most of the implementation is in micro-ops (§ 2.10), which is the most that can be handled microcode, as a positive fact. We therefore suspect that by the simple decoders used in the hardware fast paths minimizing hardware changes was a high priority in the (S 2.14.1). SGX design, and that any SGX modification proposals The Asynchronous Enclave Exit (AEX, § 5.4.3) behav- need to be aware of this priority. ior is also implemented in microcode. § 2.14.2 draws on an assortment of Intel patents to conclude that hardware 6.2 SGX Memory Access Protection exceptions (§ 2.8.2), including both faults and interrupts, SGX guarantees that the software inside an enclave is trigger microcode events (§ 2.14.2). It follows that the isolated from all the software outside the enclave, includ- SGX implementation can implement AEX by modifying ing the software running in other enclaves. This isolation the hardware exception handlers in the microcode. guarantee is at the core of SGX’s security model. The SGX initialization sequence is also implemented It is tempting to assume that the main protection in microcode. SGX is initialized in two phases. First, it is mechanism in SGX is the Memory Encryption Engine very likely that the boot sequence in microcode (§ 2.14.4) (MEE) described in § 6.1.2, as it encrypts and MACs was modified to initialize the registers associated with the DRAM’s contents. However, the MEE sits in the the SGX microcode. The ISCA SGX tutorial states that processor’s memory controller, which is at the edge of the MEE’ keys are initialized during the boot process. the on-chip memory hierarchy, below the caches (§ 2.11). Second, SGX instructions are enabled by setting a bit Therefore, the MEE cannot protect an enclave’s memory in a Model-Specific Register (MSR, § 2.4). This second from software attacks. phase involves enabling the MEE and configuring the The root of SGX’s protections against software attacks SAD and TAD to protect the PRM range. Both tasks are is a series of memory access checks which prevents the amenable to a microcode implementation. currently running software from accessing memory that The SGX description in the SDM implies that the SGX does not belong to it. Specifically, non-enclave software implementation uses a significant number of new regis- is only allowed to access memory outside the PRM range, ters, which are only exposed to microcode. However, while the code inside an enclave is allowed to access non-

93 SDM Name Bits Scope Description CSR SGX OWNEREPOCH 128 CPU Chip Package Used by EGETKEY (§ 5.7.5) CR ENCLAVE MODE 1 Logical Processor 1 when executing code inside an enclave CR ACTIVE SECS 16 Logical Processor The index of the EPC page storing the current en- clave’s SECS CR TCS LA 64 Logical Processor The virtual address of the TCS (§ 5.2.4) used to en- ter (§ 5.4.1) the current enclave CR TCS PH 16 Logical Processor The index of the EPC page storing the TCS used to enter the current enclave CR XSAVE PAGE 0 16 Logical Processor The index of the EPC page storing the first page of the current SSA (§ 5.2.5)

Table 22: The fields in an EPCM entry.

PRM memory, and the EPC pages owned by the enclave. tion units (§ 2.10.1) use virtual addresses that must be Although it is believed [50] that SGX’s access checks resolved using the TLBs before the actual memory ac- are performed on every memory access check, Intel’s cesses are carried out. By contrast, the processor’s mi- patents disclose that the checks are performed in the crocode (§ 2.14) has the ability to issue physical memory Page Miss Handler (PMH, § 2.11.5), which only handles accesses, which bypass the TLBs. Conveniently, SGX TLB misses. instructions are implemented in microcode (§ 6.1.3), so they can bypass the TLBs and access memory that is 6.2.1 Functional Description off limits to software, such as the EPC page holding an The intuition behind SGX’s memory access protections enclave’s SECS(˜§ 5.1.3). can be built by considering what it would take to imple- The SGX address translation checks use the informa- ment the same protections in a trusted operating system tion in the Enclave Page Cache Map (EPCM, § 5.1.2), or hypervisor, solely by using the page tables that direct which is effectively an inverted page table that covers the the CPU’s address translation feature (§ 2.5). entire EPC. This means that each EPC page is accounted The hypothetical trusted software proposed above can for by an EPCM entry, using the structure is summarized implement enclave entry (§ 5.4.1) as a system call § 2.8.1 in Table 23. The EPCM fields were described in detail that creates page table entries mapping the enclave’s in § 5.1.2, § 5.2.3, § 5.2.4, § 5.5.1, and § 5.5.2. memory. Enclave exit (§ 5.4.2) can be a symmetric system call that removes the page table entries created Field Bits Description during enclave entry. When modifying the page tables, VALID 1 0 for un-allocated EPC the system software has to consider TLB coherence is- pages sues (§ 2.11.5) and perform TLB shootdowns when ap- BLOCKED 1 page is being evicted propriate. R 1 enclave code can read SGX leaves page table management under the sys- W 1 enclave code can write tem software’s control, but it cannot trust the software X 1 enclave code can execute to set up the page tables in any particular way. There- PT 8 page type (Table 24) fore, the hypothetical design described above cannot be ADDRESS 48 the virtual address used to used by SGX as-is. Instead, at a conceptual level, the access this page SGX implementation approximates the effect of hav- ENCLAVESECS the EPC slot number for ing the page tables set up correctly by inspecting every the SECS of the enclave address translation that comes out of the Page Miss Han- owning the page dler (PMH, § 2.11.5). The address translations that do Table 23: The fields in an EPCM entry. not obey SGX’s access control restrictions are rejected before they reach the TLBs. Conceptually, SGX adds the access control logic il- SGX’s approach relies on the fact that software al- lustrated in Figure 86 to the PMH. SGX’s security ways references memory using virtual addresses, so all checks are performed after the page table attributes-based the micro-ops (§ 2.10) that reach the memory execu- checks (§ 2.5.3) defined by the Intel architecture. It fol-

94 Type Allocated by Contents PT REG EADD enclave code and data PT SECS ECREATE SECS (§ 5.1.3) PT TCS EADD TCS (§ 5.2.4) PT VA EPA VA (§ 5.5.2) Perform Address Translation using FSM

Table 24: Values of the PT (page type) field in an EPCM entry. Prepare TLB entry

lows that SGX’s access control logic has access to the Executing physical address produced by the page walker FSM. No enclave code? SGX’s security checks depend on whether the logi- Yes cal processor (§ 2.9.4) is in enclave mode (§ 5.4) or not. Physical address While the processor is outside enclave mode, the PMH al- in PRM? Yes lows any address translation that does not target the PRM Physical range (§ 5.1). When the processor is inside enclave mode, Replace TLB address No the PMH performs the checks described below, which entry address in PRM? with abort No Yes provide the security guarantees described in § 5.2.3. page First, virtual addresses inside the enclave’s virtual Insert new entry memory range (ELRANGE, § 5.2.1) must always trans- in TLB Virtual address late into physical addresses inside the EPC. This way, in ELRANGE? No an enclave is assured that all the code and data stored Yes in ELRANGE receives SGX’s confidentiality, integrity, Set XD attribute on TLB entry Page Fault and freshness guarantees. Since the memory outside ELRANGE does not enjoy these guarantees, the SGX de- Insert new entry sign disallows having enclave code outside ELRANGE. in TLB Physical This is most likely accomplished by setting the disable Yes address in EPC? No execution (XD, § 2.5.3) attribute on the TLB entry. Read EPCM entry for Second, an EPC page must only be accessed by the physical address the code of the enclave who owns the page. For the Page Fault EPCM entry purpose of this check, each enclave is identified by Yes No the index of the EPC page that stores the enclave’s blocked? SECS (§ 5.1.3). The current enclave’s identifier is stored Page Fault EPCM Yes entry type is No in the CR ACTIVE SECS microcode register during en- PT_REG?

clave entry. This register is compared against the enclave EPCM Page Fault identifier stored in the EPCM entry corresponding to the entry EID equals current enclave’s No EPC page targeted by the address translation. ID? Yes Third, some EPC pages cannot be accessed by soft- Page Fault ware. Pages that hold SGX internal structures, such as EPCM entry a SECS, a TCS (§ 5.2.4), or a VA (§ 5.5.2) must only ADDRESS equals translated be accessed by SGX’s microcode, which uses physical No virtual address? Yes addresses and bypasses the address translation unit, in- Modify TLB entry flags cluding the PMH. Therefore, the PMH rejects address Page Fault according to EPCM entry translations targeting these pages. Insert new entry in TLB Blocked (§ 5.5.1) EPC pages are in the process of being evicted (§ 5.5), so the PMH must not create new Figure 86: SGX adds a few security checks to the PMH. The checks TLB entries targeting them. ensure that all the TLB entries created by the address translation unit Next, an enclave’s EPC pages must always be accessed meet SGX’s memory access restrictions. using the virtual addresses associated with them when they were allocated to the enclave. Regular EPC pages, which can be accessed by software, are allocated to en-

95 claves using the EADD (§ 5.3.2) instruction, which reads processors, which is 236 - 240 bytes. in the page’s address in the enclave’s virtual address space. This address is stored in the LINADDR field in 6.2.3 PMH Hardware Modifications the corresponding EPCM entry. Therefore, all the PMH The SDM describes the memory access checks per- has to do is to ensure that LINADDR in the address trans- formed after SGX is enabled, but does not provide any lation’s target EPCM entry equals the virtual address that insight into their implementation. Intel’s patents hint caused the TLB miss which invoked the PMH. at three possible implementations that make different At this point, the PMH’s security checks have com- cost-performance tradeoffs. This section summarizes the pleted, and the address translation result will definitely three approaches and argues in favor of the implementa- be added to the TLB. Before that happens, however, the tion that requires the fewest hardware modifications to SGX extensions to the PMH apply the access restrictions the PMH. in the EPCM entry for the page to the address translation All implementations of SGX’s security checks en- result. While the public SGX documentation we found tail adding a pair of memory type range regis- did not describe this process, there is a straightforward ters (MTRRs, § 2.11.4) to the PMH. These registers are implementation that fulfills SGX’s security requirements. named the Secure Enclave Range Registers (SERR) in Specifically, the TLB entry bits P, W, and XD can be Intel’s patents. Enabling SGX on a logical processor ini- AND-ed with the EPCM entry bits R, W, and X. tializes the SERR to the values of the Protected Memory Range Registers (PMRR, § 5.1). 6.2.2 EPCM Entry Representation Furthermore, all implementations have the same be- Most EPCM entry fields have obvious representations. havior when a logical processor is outside enclave mode. The exception is the LINADDR and ENCLAVESECS The memory type range described by the SERR is en- fields, described below. These representations explain abled, causing a microcode assist to trigger for every SGX’s seemingly arbitrary limit on the size of an en- address translation that resolves inside the PRM. SGX’s clave’s virtual address range (ELRANGE). implementation uses the microcode assist to replace the The SGX patents disclose that the LINADDR field address translation result with an address that causes in an EPCM entry stores the virtual page num- memory access transactions to be aborted. ber (VPN, § 2.5.1) of the corresponding EPC page’s The three implementations differ in their behavior expected virtual address, relative to the ELRANGE base when the processor enters enclave mode (§ 5.4) and starts of the enclave that owns the page. executing enclave code. The representation described above reduces the num- The alternative that requires the least amount of hard- ber of bits needed to store LINADDR, assuming that the ware changes sets up the PMH to trigger a microcode maximum ELRANGE size is significantly smaller than assist for every address translation. This can be done the virtual address size supported by the CPU. This desire by setting the SERR to cover all the physical memory to save EPCM entry bits is the most likely motivation for (e.g., by setting both the base and the mask to zero). In specifying a processor model-specific ELRANGE size, this approach, the microcode assist implements all the which is reported by the CPUID instruction. enclave mode security checks illustrated in Figure 86. The SDM states that the ENCLAVESECS field of an A speedier alternative adds a pair of registers to the ECPM entry corresponding to an EPC page indicates PMH that represents the current enclave’s ELRANGE the SECS of belonging to the enclave that owns the and modifies the PMH so that, in addition to checking page. Intel’s patents reveal that the SECS address in physical addresses against the SERR, it also checks the ENCLAVESECS is represented as a physical page num- virtual addresses going into address translations against ber (PPN, § 2.5.1) relative to the start of the EPC. Effec- ELRANGE. When either check is true, the PMH in- tively, this relative PPN is the 0-based EPC page index. vokes the microcode assist used by SGX to implement The EPC page index representation saves bits in the its memory access checks. Assuming the ELRANGE reg- ECPM entry, assuming that the EPCM size is signifi- isters use the same base / mask representation as variable cantly smaller than the physical address space supported MTRRs, enclave exists can clear ELRANGE by zeroing by the CPU. The ISCA 2015 SGX tutorial slides men- both the base and the mask. This approach uses the same tion an EPC size of 96MB, which is significantly smaller microcode assist implementation, minus the ELRANGE than the physical addressable space on today’s typical check that moves into the PMH hardware.

96 The second alternative described above has the ben- ing of SGX. efit that the microcode assist is not invoked for enclave 6.3.1 Top-Level Invariant Breakdown mode accesses outside ELRANGE. However, § 5.2.1 argues that an enclave should treat all the virtual mem- We first break down the above invariant into specific ory addresses outside ELRANGE as untrusted storage, cases based on whether a logical processor (LP) is ex- and only use that memory to communicate with soft- ecuting enclave code or not, and on whether the TLB ware outside the enclave. Taking this into considera- entries translate virtual addresses in the current enclave’s tion, well-designed enclaves would spend relatively little ELRANGE (§ 5.2.1). When the processor is outside en- time performing memory accesses outside ELRANGE. clave mode, ELRANGE can be considered to be empty. Therefore, this second alternative is unlikely to obtain This reasoning yields the three cases outlined below. performance gains that are worth its cost. The last and most performant alternative would entail 1. At all times when an LP is outside enclave mode, its implementing all the access checks shown in Figure 86 in TLB may only contain physical addresses belonging hardware. Similarly to the address translation FSM, the to DRAM pages outside the PRM. hardware would only invoke a microcode assist when a 2. At all times when an LP is inside enclave mode, security check fails and a Page Fault needs to be handled. the TLB entries for virtual addresses outside the The high-performance implementation described current enclave’s ELRANGE must contain physical above avoids the cost of microcode assists for all addresses belonging to DRAM pages outside the TLB misses, assuming well-behaved system software. PRM. In this association, a microcode assist results in a Page Fault, which triggers an Asynchronous Enclave 3. At all times when an LP is in enclave mode, the Exit (AEX, § 5.4.3). The cost of the AEX dominates the TLB entries for virtual addresses inside the current performance overhead of the microcode assist. enclave’s ELRANGE must match the virtual mem- While this last implementation looks attractive, one ory layout specified by the enclave author. needs to realize that TLB misses occur quite infrequently, so a large improvement in the TLB miss speed trans- The first two invariant cases can be easily proven in- lates into a much less impressive improvement in overall dependently for each LP, by induction over the sequence enclave code execution performance. Taking this into of instructions executed by the LP. For simplicity, the consideration, it seems unwise to commit to extensive reader can assume that instructions are executed in pro- hardware modifications in the PMH before SGX gains gram mode. While the assumption is not true on proces- adoption. sors with out-of-order execution (§ 2.10), the arguments presented here also hold when the executed instruction 6.3 SGX Security Check Correctness sequence is considered in retirement order, for reasons In § 6.2.1, we argued that SGX’s security guarantees that will be described below. can be obtained by modifying the Page Miss Han- An LP will only transition between enclave mode and dler (PMH, § 2.11.5) to block undesirable address trans- non-enclave mode at a few well-defined points, which are lations from reaching the TLB. This section builds on the EENTER (§ 5.4.1), ERESUME (§ 5.4.4), EEXIT (§ 5.4.2), result above and outlines a correctness proof for SGX’s and Asynchronous Enclave Exits (AEX, § 5.4.3). Ac- memory access protection. cording to the SDM, all the transition points flush the Specifically, we outline a proof for the following in- TLBs and the out-of-order execution pipeline. In other variant. At all times, all the TLB entries in every log- words, the TLBs are guaranteed to be empty after every ical processor will be consistent with SGX’s security transition between enclave mode and non-enclave mode, guarantees. By the argument in § 6.2.1, the invariant so we can consider all these transitions to be trivial base translates into an assurance that all the memory accesses cases for our induction proofs. performed by software obey SGX’s security model. The While SGX initialization is not thoroughly discussed, high-level proof structure is presented because it helps the SDM mentions that loading some Model-Specific understand how the SGX security checks come together. Registers (MSRs, § 2.4) triggers TLB flushes, and that By contrast, a detailed proof would be incredibly tedious, system software should flush TLBs when modifying and would do very little to the reader’s understand- Memory Type Range Registers (MTRRs, § 2.11.4).

97 Given that all the possible SGX implementations de- 3. An EPCM entry is only modified when there is no scribed in § 6.2.3 entail adding a MTRR, it is safe to mapping for it in any LP’s TLB. assume that enabling SGX mode also results in a TLB The second and third invariant combined prove that flush and out-of-order pipeline flush, and can be used by all the TLBs in an SGX-enabled computer always reflect our induction proof as well. the contents of the EPCM, as the third invariant essen- All the base cases in the induction proofs are serializa- tially covers the gaps in the second invariant. This result, tion points for out-of-order execution, as the pipeline is in combination with the first invariant, shows that the flushed during both enclave mode transitions and SGX EPCM is a bridge between the memory layout specifi- initialization. This makes the proofs below hold when cations of the enclave authors and the TLB entries that the program order instruction sequence is replaced with regulate what memory can be accessed by software ex- the retirement order sequence. ecuting on the LPs. When further combined with the The first invariant case holds because while the LP is reasoning in § 6.2.1, the whole proof outlined here re- outside enclave mode, the SGX security checks added to sults in an end-to-end argument for the correctness of the PMH (§ 6.2.1, Figure 86) reject any address transla- SGX’s memory protection scheme. tion that would point into the PRM before it reaches the TLBs. A key observation for proving the induction step 6.3.2 EPCM Entries Reflect Enclave Author Design of this invariant case is that the PRM never changes after This sub-section outlines the proof for the following in- SGX is enabled on an LP. variant. At all times, each EPCM entry for a page that The second invariant case can be proved using a simi- is allocated to an enclave matches the virtual mem- lar argument. While an LP is executing an enclave’s code, ory layout desired by the enclave’s author. the SGX memory access checks added to the PMH reject A key observation, backed by the SDM pseudocode for any address translation that resolves to a physical address SGX instructions, is that all the instructions that modify inside the PRM, if the translated virtual address falls out- the EPCM pages allocated to an enclave are synchro- side the current enclave’s ELRANGE. The induction step nized using a lock in the enclave’s SECS. This entails the for this invariant case can be proven by observing that a existence of a time ordering of the EPCM modifications change in an LP’s current ELRANGE is always accom- associated with an enclave. We prove the invariant stated panied by a TLB flush, which results in an empty TLB above using a proof by induction over this sequence of that trivially satisfies the invariant. This follows from the EPCM modifications. constraint that an enclave’s ELRANGE never changes EPCM entries allocated to an enclave are created after it is established, and from the observation that the by instructions that can only be issued before the en- LP’s current enclave can only be changed by an enclave clave is initialized, specifically ECREATE (§ 5.3.1) and entry, which must be preceded by an enclave exit, which EADD (§ 5.3.2). The contents of the EPCM entries cre- triggers a TLB flush. ated by these instructions contributes to the enclave’s The third invariant case is best handled by recognizing measurement (§ 5.6), together with the initial data loaded that the Enclave Page Cache Map (EPCM, § 5.1.2) is into the corresponding EPC pages. an intermediate representation for the virtual memory § 3.3.2 argues that we can assume that enclaves with layout specified by the enclave authors. This suggests incorrect measurements do not exist, as they will be re- breaking down the case into smaller sub-invariants cen- jected by software attestation. Therefore, we can assume tered around the EPCM, which will be proven in the that the attributes used to initialize EPCM pages match sub-sections below. the enclave authors’ memory layout specifications. 1. At all times, each EPCM entry for a page that is EPCM entries can be evicted to untrusted DRAM, allocated to an enclave matches the virtual memory together with their corresponding EPC pages, by the EWB ELDU ELDB layout desired by the enclave’s author. (§ 5.5.4) instruction. The / (§ 5.5) in- structions re-load evicted page contents and metadata 2. Assuming that the EPCM contents is constant, at back into the EPC and EPCM. By induction, we can all times when an LP is in enclave mode, the TLB assume that an EPCM entry matches the enclave au- entries for virtual addresses inside the current en- thor’s specification when it is evicted. Therefore, if we clave’s ELRANGE must match EPCM entries that can prove that the EPCM entry that is reloaded from belong to the enclave. DRAM is equivalent to the entry that was evicted, we

98 can conclude that the reloaded entry matches the author’s victim enclave, and loading it into a malicious enclave specification. that would copy the page’s contents to untrusted DRAM. A detailed analysis of the cryptographic primitives The virtual address (LINADDR) field is covered by used by the SGX design to protect the evicted EPC the MAC tag, so the OS cannot modify the virtual mem- page contents and its associated metadata is outside the ory layout of an enclave by evicting an EPC page and scope of this work. Summarizing the description in § 5.5, specifying a different LINADDR when loading it back. the contents of evicted pages is encrypted using AES- If LINADDR was not covered by authenticity guarantees, GMAC (§ 3.1.3), which is an authenticated encryption a malicious OS could perform the exact attack shown in mechanism. The MAC tag produced by AES-GMAC Figure 55 and described in § 3.7.3. covers the EPCM metadata as well as the page data, and The page access permission flags (R, W, X) are also includes a 64-bit version that is stored in a version tree covered by the MAC tag. This prevents the OS from whose nodes are Version Array (VA, (§ 5.5.2) pages. changing the access permission bits in a page’s EPCM Assuming no cryptographic weaknesses, SGX’s entry by evicting the page and loading it back in. If scheme does appear to guarantee the confidentiality, in- the permission flags were not covered by authenticity tegrity, and freshness of the EPC page contents and asso- guarantees, the OS could use the ability to change EPCM ciated metadata while it is evicted in untrusted memory. access permissions to facilitate exploiting vulnerabilities It follows that EWB will only reload an EPCM entry if in enclave code. For example, exploiting a stack overflow the contents is equivalent to the contents of an evicted vulnerability is generally easier if OS can make the stack entry. pages executable. The equivalence notion invoked here is slightly dif- The nonce stored in the VA slot is also covered by ferent from perfect equality, in order to account for the the MAC. This prevents the OS from mounting a replay allowable operation of evicting an EPC page and its asso- attack that reverts the contents of an EPC page to an ciated EPCM entry, and then reloading the page contents older version. If the nonce would not be covered by to a different EPC page and a different EPCM entry, as integrity guarantees, the OS could evict the target EPC illustrated in Figure 69. Loading the contents of an EPC page at different times t1 and t2 in the enclave’s life, and page at a different physical address than it had before then provide the EWB outputs at t1 to the ELDU / ELDB does not break the virtual memory abstraction, as long instruction. Without the MAC verification, this attack as the contents is mapped at the same virtual address would successfully revert the contents of the EPC page (the LINEARADDRESS EPCM field), and has the same to its version at t1. access control attributes (R, W, X, PT EPCM fields) as it While replay attacks look relatively benign, they can had when it was evicted. be quite devastating when used to facilitate double spend- The rest of this section enumerates the address trans- ing. lation attacks prevented by the MAC verification that occurs in ELDU / ELDB. This is intended to help the 6.3.3 TLB Entries for ELRANGE Reflect EPCM Con- tents reader develop some intuition for the reasoning behind using the page data and all the EPCM fields to compute This sub-section sketches a proof for the following invari- and verify the MAC tag. ant. At all times when an LP is in enclave mode, the The most obvious attack is prevented by having the TLB entries for virtual addresses inside the current MAC cover the contents of the evicted EPC page, so the enclave’s ELRANGE must match EPCM entries that untrusted OS cannot modify the data in the page while it belong to the enclave. The argument makes the assump- is stored in untrusted DRAM. The MAC also covers the tion that the EPCM contents is constant, which will be metadata that makes up the EPCM entry, which prevents justified in the following sub-section. the more subtle attacks described below. The invariant can be proven by induction over the The enclave ID (EID) field is covered by the MAC tag, sequence of TLB insertions that occur in the LP. This so the OS cannot evict an EPC page belonging to one sequence is well-defined because an LP has a single enclave, and assign the page to a different enclave when PMH, so the address translation requests triggered by it is loaded back into the EPC. If EID was not covered by TLB misses must be serialized to be processed by the authenticity guarantees, a malicious OS could read any PMH. enclave’s data by evicting an EPC page belonging to the The proof’s induction step depends on the fact that the

99 TLB on hyper-threaded cores (§ 2.9.4) is dynamically start out with valid entries. These instructions are partitioned between the two LPs that share the core, and EREMOVE (§ 5.3.4) and EWB (§ 5.5.4). no TLB entry is shared between the LPs. This allows The EPCM entries associated with EPC pages that our proof to consider the TLB insertions associated with store Version Arrays (VA, § 5.5.2) represent a special one LP independently from the other LP’s insertions, case for both instructions mentioned above, as these which means we don’t have to worry about the state (e.g., pages are not associated with any enclave. As these enclave mode) of the other LP on the core. pages can only be accessed by the microcode used to im- The proof is further simplified by observing that when plement SGX, they never have TLB entries representing an LP exits enclave mode, both its TLB and its out-of- them. Therefore, both EREMOVE and EWB can invalidate order instruction pipeline are flushed. Therefore, the EPCM entries for VA pages without additional checks. enclave mode and current enclave register values used by EREMOVE only invalidates an EPCM entry associated address translations are guaranteed to match the values with an enclave when there is no LP executing in enclave obtained by performing the translations in program order. mode using a TCS associated with the same enclave. An Having eliminated all the complexities associated with EPCM entry can only result in TLB translations when an hyper-threaded (§ 2.9.4) out-of-order (§ 2.10) execution LP is executing code from the entry’s enclave, and the cores, it is easy to see that the security checks outlined in TLB translations are flushed when the LP exits enclave Figure 86 and § 6.2.1 ensure that TLB entries that target mode. Therefore, when EREMOVE invalidates an EPCM EPC pages are guaranteed to reflect the constraints in the entry, any associated TLB entry is guaranteed to have corresponding EPCM entries. been flushed. Last, the SGX access checks implemented in the PMH EWB’s correctness argument is more complex, as it reject any address translation for a virtual address in relies on the EBLOCK / ETRACK sequence described in ELRANGE that does not resolve to an EPC page. It § 5.5.1 to ensure that any TLB entry that might have been follows that memory addresses inside ELRANGE can created for an EPCM entry is flushed before the EPCM only map to EPC pages which, by the argument above, entry is invalidated. must follow the constraints of the corresponding EPCM Unfortunately, the SDM pseudocode for the instruc- entries. tions mentioned above leaves out the algorithm used to verify that the relevant TLB entries have been flushed. 6.3.4 EPCM Entries are Not In TLBs When Modified Thus, we must base our proof on the assumption that In this sub-section, we outline a proof that an EPCM the SGX implementation produced by Intel’s engineers entry is only modified when there is no mapping for matches the claims in the SDM. In § 6.4, we propose a it in any LP’s TLB.. This proof analyzes each of the method for ensuring that EWB will only succeed when instructions that modify EPCM entries. all the LPs executing an enclave’s code at the time when For the purposes of this proof, we consider that setting ETRACK is called have exited enclave mode at least once the BLOCKED attribute does not count as a modification between the ETRACK call and the EWB call. Having to an EPCM entry, as it does not change the EPC page proven the existence of a correct algorithm by construc- that the entry is associated with, or the memory layout tion, we can only hope that the SGX implementation uses specification associated with the page. our algorithm, or a better algorithm that is still correct. The instructions that modify EPCM entries in such a way that the resulting EPCM entries have the VALID 6.4 Tracking TLB Flushes field set to true require that the EPCM entries were in- This section proposes a straightforward method that the valid before they were modified. These instructions are SGX implementation can use to verify that the system ECREATE (§ 5.3.1), EADD (§ 5.3.2), EPA (§ 5.5.2), and software plays its part correctly in the EPC page evic- ELDU / ELDB (§ 5.5). The EPCM entry targeted by any tion (§ 5.5) process. Our method meets the SDM’s spec- these instructions must have had its VALID field set to ification for EBLOCK (§ 5.5.1), ETRACK (§ 5.5.1) and false, so the invariant proved in the previous sub-section EWB (§ 5.5.4). implies that the EPCM entry had no TLB entry associ- The motivation behind this section is that, at least at ated with it. the time of this writing, there is no official SGX doc- Conversely, the instructions that modify EPCM en- umentation that contains a description of the mecha- tries and result in entries whose VALID field is false nism used by EWB to ensure that all the Logical Pro-

100 cessors (LPs, § 2.9.4) running an enclave’s code exit instruction was issued. Therefore, the ETRACK imple- enclave mode (§ 5.4) between an ETRACK invocation mentation atomically zeroes lp-mask. The full ETRACK and a EWB invocation. Knowing that there exists a cor- algorithm is listed in Figure 88. rect mechanism that has the same interface as the SGX instructions described in the SDM gives us a reason to ETRACK(SECS) hope that the SGX implementation is also correct. Our method relies on the fact that an enclave’s £ Abort if tracking is already active. SECS (§ 5.1.3) is not accessible by software, and is 1 if SECS . tracking = TRUE already used to store information used by the SGX mi- 2 then return SGX-PREV-TRK-INCMPL crocode implementation (§ 6.1.3). We store the follow- £ Activate TLB flush tracking. ing fields in the SECS. tracking and done-tracking are 3 SECS . tracking ← TRUE Boolean variables. tracked-threads and active-threads 4 SECS . done-tracking ← FALSE are non-negative integers that start at zero and must 5 SECS . tracked-threads ← store numbers up to the number of LPs in the computer. ATOMIC-READ(SECS . active-threads) lp-mask is an array of Boolean flags that has one mem- 6 for i ← 0 to MAX-LP-ID ber per LP in the computer. The fields are initialized as 7 do ATOMIC-CLEAR(SECS . lp-mask[i]) shown in Figure 87. Figure 88: The algorithm used by ETRACK to activate TLB flush tracking. ECREATE(SECS) When an LP exits an enclave that has TLB flush £ Initialize the SECS state used for tracking. tracking activated, we atomically test and set the cur- 1 SECS . tracking ← FALSE rent LP’s flag in lp-mask. If the flag was not previ- 2 SECS . done-tracking ← FALSE ously set, it means that an LP that was executing the 3 SECS . active-threads ← 0 enclave’s code when ETRACK was invoked just exited 4 SECS . tracked-threads ← 0 enclave mode for the first time, and we atomically decre- 5 SECS . lp-mask ← 0 ment tracked-threads to reflect this fact. In other words, lp-mask prevents us from double-counting an LP when Figure 87: The algorithm used to initialize the SECS fields used by it exits the same enclave while TLB flush tracking is the TLB flush tracking method presented in this section. active. The active-threads SECS field tracks the number of Once active-threads reaches zero, we are assured that LPs that are currently executing the code of the enclave all the LPs running the enclave’s code when ETRACK who owns the SECS. The field is atomically incremented was issued have exited enclave mode at least once, and by EENTER (§ 5.4.1) and ERESUME (§ 5.4.4) and is can set the done-tracking flag. Figure 89 enumerates all atomically decremented by EEXIT (§ 5.4.2) and Asyn- the steps taken on enclave exit. chronous Enclave Exits (AEXs, § 5.4.3). Asides from helping track TLB flushes, this field can also be used by ENCLAVE-EXIT(SECS) EREMOVE (§ 5.3.4) to decide when it is safe to free an EPC page that belongs to an enclave. £ Track an enclave exit. As specified in the SDM, ETRACK activates TLB flush 1 ATOMIC-DECREMENT(SECS . active-threads) tracking for an enclave. In our method, this is accom- 2 if ATOMIC-TEST-AND-SET( plished by setting the tracking field to TRUE and the SECS . lp-mask[LP-ID]) done-tracking field to FALSE. 3 then ATOMIC-DECREMENT( When tracking is enabled, tracked-threads is the num- SECS . tracked-threads) ber of LPs that were executing the enclave’s code when 4 if SECS . tracked-threads = 0 the ETRACK instruction was issued, and have not yet ex- 5 then SECS . done-tracking ← TRUE ited enclave mode. Therefore, executing ETRACK atom- ically reads active-threads and writes the result into Figure 89: The algorithm that updates the TLB flush tracking state tracked-threads. Also, lp-mask keeps track of the LPs when an LP exits an enclave via EEXIT or AEX. that have exited the current enclave after the ETRACK Without any compensating measure, the method above

101 will incorrectly decrement tracked-threads, if the LP ex- iting the enclave had entered it after ETRACK was issued. EBLOCK(virtual-addr) We compensate for this with the following trick. When 1 physical-addr ← TRANSLATE(virtual-addr) an LP starts executing code inside an enclave that has 2 epcm-slot ← EPCM-SLOT(physical-addr) TLB flush tracking activated, we set its corresponding 3 if EPCM [slot]. BLOCKED = TRUE flag in lp-mask. This is sufficient to avoid counting the 4 then return SGX-BLKSTATE LP when it exits the enclave. Figure 90 lists the steps 5 if SECS . tracking = TRUE required by our method when an LP enters an enclave. 6 then if SECS . done-tracking = FALSE 7 then return SGX-ENTRYEPOCH-LOCKED 8 SECS . tracking ← FALSE ENCLAVE-ENTER(SECS) 9 EPCM [slot]. BLOCKED ← TRUE £ Track an enclave entry. 1 ATOMIC-INCREMENT(SECS . active-threads) Figure 92: The algorithm that marks the end of a TLB flushing 2 ATOMIC-SET(SECS . lp-mask[LP-ID]) cycle when EBLOCK is executed.

Figure 90: The algorithm that updates the TLB flush tracking state its intended value throughout enclave entries and exits. when an LP enters an enclave via EENTER or ERESUME. 6.5 Enclave Signature Verification With these algorithms in place, EWB can simply verify that both tracking and done-tracking are TRUE. This Let m be the public modulus in the enclave author’s ensures that the system software has triggered enclave RSA key, and s be the enclave signature. Since the exits on all the LPs that were running the enclave’s code SGX design fixes the value of the public exponent e to when ETRACK was executed. Figure 91 lists the algo- 3, verifying the RSA signature amounts to computing 3 rithm used by the EWB tracking verification step. the signed message M = s mod m, checking that the value meets the PKCS v1.5 padding requirements, and comparing the 256-bit SHA-2 hash inside the message EWB-VERIFY(virtual-addr) with the value obtained by hashing the relevant fields in 1 physical-addr ← TRANSLATE(virtual-addr) the SIGSTRUCT supplied with the enclave. 2 epcm-slot ← EPCM-SLOT(physical-addr) This section describes an algorithm for computing the 3 if EPCM [slot]. BLOCKED = FALSE signed message while only using subtraction and multi- 4 then return SGX-NOT-BLOCKED plication on large non-negative integers. The algorithm 5 SECS ← EPCM-ADDR( admits a significantly simpler implementation than the EPCM [slot]. ENCLAVESECS) typical RSA signature verification algorithm, by avoiding £ Verify that the EPC page can be evicted. the use of long division and negative numbers. The de- 6 if SECS . tracking = FALSE scription here is essentially the idea in [73], specialized 7 then return SGX-NOT-TRACKED for e = 3. 8 if SECS . done-tracking = FALSE The algorithm provided here requires the signer to 9 then return SGX-NOT-TRACKED compute the q1 and q2 values shown below. The values can be computed from the public information in the sig- nature, so they do not leak any additional information Figure 91: The algorithm that ensures that all LPs running an enclave’s code when ETRACK was executed have exited enclave about the private signing key. Furthermore, the algorithm mode at least once. verifies the correctness of the values, so it does not open Last, EBLOCK marks the end of a TLB flush tracking up the possibility for an attack that relies on supplying cycle by clearing the tracking flag. This ensures that sys- incorrect values for q1 and q2. tem software must go through another cycle of ETRACK and enclave exits before being able to use EWB on the s2  page whose BLOCKED EPCM field was just set to TRUE q = 1 m by EBLOCK. Figure 92 shows the details.  3  Our method’s correctness can be easily proven by ar- s − q1 × s × m q2 = guing that each SECS field introduced in this section has m

102 Due to the desirable properties mentioned above, it is passes step 4, it must be the case that the value supplied very likely that the algorithm described here is used by for q1 is correct. 2 the SGX implementation to verify the RSA signature in We can also plug s , q1 and m into the integer division an enclave’s SIGSTRUCT (§ 5.7.1). remainder definition to obtain the identity s2 mod m = 2 The algorithm in Figure 93 computes the signed mes- s − q1 × m. However, according to the computations 3 2 sage M = s mod m, while also verifying that the given performed in steps 1 and 3, w = s − q1 × m. Therefore, 2 values of q1 and q2 are correct. The latter is necessary we can conclude that w = s mod m. because the SGX implementation of signature verifica- 6.5.2 Analysis of Steps 5 - 8 tion must handle the case where an attacker attempts to exploit the signature verification implementation by Similarly, steps 5 − 8 in the algorithm check the correct- supplying invalid values for q1 and q2. ness of q2 and use it to compute w × s mod m. The key observation here is that q2 is the quotient of the integer 1. Compute u ← s × s and ← q1 × m division (w × s)/m. We can convince ourselves of the truth of this obser- 2. If u < v, abort. q must be incorrect. 1 vation by using the fact that w = s2 mod m, which was 3. Compute w ← u − v proven above, by plugging in the definition of the re- mainder in integer division, and by taking advantage of 4. If w ≥ m, abort. q1 must be incorrect. the distributivity of integer multiplication with respect to addition. 5. Compute x ← w × s and y ← q2 × m

6. If x < y, abort. q2 must be incorrect. w × s (s2 mod m) × s = 7. Compute z ← x − y. m m $ 2 s2 % (s − b m × m) × s 8. If z ≥ m, abort. q2 must be incorrect. = m 9. Output z. $ 2 % s3 − b s c × m × s = m m Figure 93: An RSA signature verification algorithm specialized for  3  the case where the public exponent is 3. s is the RSA signature and s − q1 × m × s m is the RSA key modulus. The algorithm uses two additional inputs, = m q1 and q2. s3 − q × s × m The rest of this section proves the correctness of the = 1 algorithm in Figure 93. m = q2 6.5.1 Analysis of Steps 1 - 4

Steps 1 − 4 in the algorithm check the correctness of q1 By the same argument used to analyze steps 1 − 4, 2 and use it to compute s mod m. The key observation we use elementary division properties to prove that q2 is to understanding these steps is recognizing that q1 is the correct if and only if the equation below is correct. quotient of the integer division s2/m. Having made this observation, we can use elementary 0 ≤ w × s − q2 × m < m division properties to prove that the supplied value for q 1 The equation’s first comparison, 0 ≤ w × s − q × m, is correct if and only if the following property holds. 2 is equivalent to q2 × m ≤ w × s, which corresponds to 2 the check performed by step 6. The second comparison, 0 ≤ s − q1 × m < m w × s − q2 × m < m, matches the condition verified by 2 We observe that the first comparison, 0 ≤ s −q1 ×m, step 8. It follows that, if the algorithm passes step 8, it 2 is equivalent to q1 × m ≤ s , which is precisely the must be the case that the value supplied for q2 is correct. check performed by step 2. We can also see that the By plugging w × s, q2 and m into the integer division 2 second comparison, s −q1 ×m < m corresponds to the remainder definition, we obtain the identity w × s mod condition verified by step 4. Therefore, if the algorithm m = w ×s−q2 ×m. Trivial substitution reveals that the

103 computations in steps 5 and 7 result in z = w×s−q2×m, secrets in plain text. Once initialized, an enclave is ex- which allows us to conclude that z = w × s mod m. pected to participate in a software attestation process, In the analysis for steps 1 − 4, we have proven that where it authenticates itself to a remote server. Upon suc- w = s2 mod m. By substituting this into the above cessful authentication, the remote server is expected to identity, we obtain the proof that the algorithm’s output disclose some secrets to an enclave over a secure commu- is indeed the desired signed message. nication channel. The SGX design attempts to guarantee that the measurement presented during software attesta- tion accurately represents the contents loaded into the z = w × s mod m enclave. = (s2 mod m) × s mod m SGX also offers a certificate-based identity system that = s2 × s mod m can be used to migrate secrets between enclaves that have certificates issued by the same authority. The migration 3 = s mod m process involves securing the secrets via authenticated encryption before handing them off to the untrusted sys- 6.5.3 Implementation Requirements tem software, which passes them to another enclave that The main advantage of the algorithm in Figure 93 is that can decrypt them. it relies on the implementation of very few arithmetic The same mechanism used for secret migration can operations on large integers. The maximum integer size also be used to cache the secrets obtained via software that needs to be handled is twice the size of the modulus attestation in an untrusted storage medium managed by in the RSA key used to generate the signature. system software. This caching can reduce the number Steps 1 and 5 use large integer multiplication. Steps of times that the software attestation process needs to 3 and 7 use integer subtraction. Steps 2, 4, 6, and 8 use be performed in a distributed system. In fact, SGX’s large integer comparison. The checks in steps 2 and 6 software attestation process is implemented by enclaves guarantee that the results of the subtractions performed with special privileges that use the certificate-based iden- in steps 3 and 7 will be non-negative. It follows that the tity system to securely store the CPU’s attestation key in algorithm will never encounter negative numbers. untrusted memory. 6.6 SGX Security Properties 6.6.2 Physical Attacks We have summarized SGX’s programming model and the implementation details that are publicly documented We begin by discussing SGX’s resilience to the physical in Intel’s official documentation and published patents. attacks described in § 3.4. Unfortunately, this section We are now ready to bring this the information together is set to disappoint readers expecting definitive state- in an analysis of SGX’s security properties. We start ments. The lack of publicly available details around the the analysis by restating SGX’s security guarantees, and hardware implementation aspects of SGX precludes any spend the bulk of this section discussing how SGX fares rigorous analysis. However, we do know enough about when pitted against the attacks described in § 3. We SGX’s implementation to point out a few avenues for conclude the analysis with some troubling implications future exploration. of SGX’s lack of resistance to software side-channel Due to insufficient documentation, one can only hope attacks. that the SGX security model is not trivially circum- vented by a port attack (§ 3.4.1). We are particularly 6.6.1 Overview concerned about the Generic Debug eXternal Connec- Intel’s Software Guard Extensions (SGX) is Intel’s latest tion (GDXC) [126, 199], which collects and filters the iteration of a trusted hardware solution to the secure re- data transferred by the uncore’s ring bus (§ 2.11.3), and mote computation problem. The SGX design is centered reports it to an external debugger. around the ability to create an isolated container whose The SGX memory protection measures are imple- contents receives special hardware protections that are mented at the core level, in the Page Miss Han- intended to translate into confidentiality, integrity, and dler (PMH, § 2.11.5) (§ 6.2) and at the chip die level, freshness guarantees. in the memory controller (§ 6.1.2). Therefore, the code An enclave’s initial contents is loaded by the system and data inside enclaves is stored in plaintext in on-chip software on the computer, and therefore cannot contain caches (§ 2.11), which entails that the enclave contents

104 travels without any cryptographic protection on the un- For example, the original SGX patents [110, 138] dis- core’s ring bus (§ 2.11.3). close that the Fused Seal Key and the Provisioning Key, Fortunately, a recent Intel patent [167] indicates that which are stored in e-fuses (§ 5.8.2), are encrypted with Intel engineers are tackling at least some classes of at- a global wrapping logic key (GWK). The GWK is a tacks targeting debugging ports. 128-bit AES key that is hard-coded in the processor’s The SDM and SGX papers discuss the most obvi- circuitry, and serves to increase the cost of extracting the ous class of bus tapping attacks (§ 3.4.2), which is the keys from an SGX-enabled processor. DRAM bus tapping attack. SGX’s threat model con- As explained in § 3.4.3, e-fuses have a large feature siders DRAM and the bus connecting it to the CPU size, which makes them relatively easy to “read” using a chip to be untrusted. Therefore, SGX’s Memory En- high-resolution microscope. In comparison, the circuitry cryption Engine (MEE, § 6.1.2) provides confidentiality, on the latest Intel processors has a significantly smaller integrity and freshness guarantees to the Enclave Page feature size, and is more difficult to reverse engineer. Cache (EPC, § 5.1.1) data while it is stored in DRAM. Unfortunately, the GWK is shared among all the chip dies However, both the SGX papers and the ISCA 2015 created from the same mask, so it has all the drawbacks tutorial on SGX admit that the MEE does not protect the of global secrets explained in § 3.4.3. addresses of the DRAM locations accessed when cache Newer Intel patents [67, 68] describe SGX-enabled lines holding EPC data are evicted or loaded. This pro- processors that employ a Physical Unclonable Func- vides an opportunity for a malicious computer owner to tion (PUF), e.g., [175], [133], which generates a symmet- observe an enclave’s memory access patterns by combin- ric key that is used during the provisioning process. ing a DRAM address line bus tap with carefully crafted Specifically, at an early provisioning stage, the PUF system software that creates artificial pressure on the last- key is encrypted with the GWK and transmitted to the level cache (LLC ,§ 2.11) lines that hold the enclave’s key generation server. At a later stage, the key generation EPC pages. server encrypts the key material that will be burned into On a brighter note, as mentioned in § 3.4.2, we are not the processor chip’s e-fuses with the PUF key, and trans- aware of any successful DRAM address line bus tapping mits the encrypted material to the chip. The PUF key attack. Furthermore, SGX is vulnerable to cache timing increases the cost of obtaining a chip’s fuse key material, attacks that can be carried out completely in software, so as an attacker must compromise both provisioning stages malicious computer owners do not need to bother setting in order to be able to decrypt the fuse key material. up a physical attack to obtain an enclave’s memory access As mentioned in previous sections, patents reveal de- patterns. sign possibilities considered by the SGX engineers. How- While the SGX documentation addresses DRAM bus ever, due to the length of timelines involved in patent ap- tapping attacks, it makes no mention of the System Man- plications, patents necessarily describe earlier versions of agement bus (SMBus, § 2.9.2) that connects the Intel the SGX implementation plans, which might not match Management Engine (ME, § 2.9.2) to various compo- the shipping implementation. We expect this might be nents on the computer’s motherboard. the case with the PUF provisioning patents, as it makes In § 6.6.5, we will explain that the ME needs to be little sense to include a PUF in a chip die and rely on taken into account when evaluating SGX’s memory pro- e-fuses and a GWK to store SGX’s root keys. Deriving tection guarantees. This makes us concerned about the the root keys from the PUF would be more resilient to possibility of an attack that taps the SMBus to reach into chip imaging attacks. the Intel ME. The SMBus is much more accessible than SGX’s threat model excludes power analysis at- the DRAM bus, as it has fewer wires that operate at a tacks (§ 3.4.4) and other side-channel attacks. This is significantly lower speed. Unfortunately, without more understandable, as power attacks cannot be addressed at information about the role that the Intel ME plays in a the architectural level. Defending against power attacks computer, we cannot move beyond speculation on this requires expensive countermeasures at the lowest levels topic. of hardware implementation, which can only be designed The threat model stated by the SGX design excludes by engineers who have deep expertise in both system se- physical attacks targeting the CPU chip (§ 3.4.3). Fortu- curity and Intel’s manufacturing process. It follows that nately, Intel’s patents disclose an array of countermea- defending against power analysis attacks has a very high sures aimed at increasing the cost of chip attacks. cost-to-benefit ratio.

105 6.6.3 Privileged Software Attacks not access any enclave secrets that may be stored in the execution state. The SGX threat model considers system software to be The protections described above apply to the all the untrusted. This is a prerequisite for SGX to qualify as levels of privileged software. SGX’s transitions between a solution to the secure remote computation problem an enclave’s code and non-enclave code place SMM encountered by software developers who wish to take ad- software on the same footing as the system software vantage of Infrastructure-as-a-Service (IaaS) cloud com- at lower privilege levels. System Management Inter- puting. rupts (SMI, § 2.12, § 3.5), which cause the processor to SGX’s approach is also an acknowledgement of the execute SMM code, are handled using the same Asyn- realities of today’s software landscape, where the sys- chronous Enclave Exit (AEX, § 5.4.3) process as all other tem software that runs at high privilege levels (§ 2.3) hardware exceptions. is so complex that security researchers constantly find Reasoning about the security properties of SGX’s tran- vulnerabilities in it (§ 3.5). sitions between enclave mode and non-enclave mode is The SGX design prevents malicious software from very difficult. A correctness proof would have to take directly reading or from modifying the EPC pages that into account all the CPU’s features that expose registers. store an enclave’s code and data. This security property Difficulty aside, such a proof would be very short-lived, relies on two pillars in the SGX design. because every generation of Intel CPUs tends to intro- First, the SGX implementation (§ 6.1) runs in the pro- duce new architectural features. The paragraph below cessor’s microcode (§ 2.14), which is effectively a higher gives a taste of what such a proof would look like. privilege level that system software does not have access EENTER (§ 5.4.1) stores the RSP and RBP register to. Along the same lines, SGX’s security checks (§ 6.2) values in the SSA used to enter the enclave, but stores are the last step performed by the PMH, so they cannot XCR0 (§ 2.6), FS and GS (§ 2.7) in the non-architectural be bypassed by any other architectural feature. area of the TCS (§ 6.1.3). At first glance, it may seem This implementation detail is only briefly mentioned elegant to remove this inconsistency and have EENTER in SGX’s official documentation, but has a large impact store the contents of the XCR0, FS, and GS registers on security. For context, Intel’s Trusted Execution Tech- in the current SSA, along with RSP and RBP. However, nology (TXT, [70]), which is the predecessor of SGX, this approach would break the Intel architecture’s guar- relied on Intel’s Virtual Machine Extensions (VMX) for antees that only system software can modify XCR0, and isolation. The approach was unsound, because software application software can only load segment registers us- running in System Management Mode (SMM, § 2.3) ing selectors that index into the GDT or LDT set up by could bypass the restrictions used by VMX to provide system software. Specifically, a malicious application isolation. could modify these privileged registers by creating an The security properties of SGX’s memory protection enclave that writes the desired values to the current SSA mechanisms are discussed in detail in § 6.6.4. locations backing up the registers, and then executes Second, SGX’s microcode is always involved when a EEXIT (§ 5.4.2). CPU transitions between enclave code and non-enclave Unfortunately, the following sections will reveal that code (§ 5.4), and therefore regulates all interactions be- while SGX offers rather thorough guarantees against tween system software and an enclave’s environment. straightforward attacks on enclaves, its guarantees are On enclave entry (§ 5.4.1), the SGX implementation almost non-existent when it comes to more sophisticated sets up the registers (§ 2.2) that make up the execution attacks, such as side-channel attacks. This section con- state (§ 2.6) of the logical processor (LP § 2.9.4), so cludes by describing what might be the most egregious a malicious OS or hypervisor cannot induce faults in side-channel vulnerability in SGX. the enclave’s software by tampering with its execution Most modern Intel processors feature hyper-threading. environment. On these CPUs, the execution units (§ 2.10) and When an LP transitions away from an enclave’s code caches (§ 2.11) on a core (§ 2.9.4) are shared by two due to a hardware exception (§ 2.8.2), the SGX imple- LPs, each of which has its own execution state. SGX mentation stashes the LP’s execution state into a State does not prevent hyper-threading, so malicious system Save Area (SSA, § 5.2.5) area inside the enclave and software can schedule a thread executing the code of a scrubs it, so the system software’s exception handler can- victim enclave on an LP that shares the core with an LP

106 executing a snooping thread. This snooping thread can tographic primitives that offer confidentiality, integrity use the processor’s high-resolution performance counter and freshness guarantees. This protects against the active [152], in conjunction with microarchitectural knowledge attacks using page swapping described in § 3.7.3. of the CPU’s execution units and out-of-order scheduler, When system software wishes to evict EPC pages, to learn the instructions executed by the victim enclave, it must follow the process described in § 5.5.1, which as well as its memory access patterns. guarantees to the SGX implementation that all the LPs This vulnerability can be fixed using two approaches. have invalidated any TLB entry associated with pages The straightforward solution is to require cloud comput- that will be evicted. This defeats the active attacks based ing providers to disable hyper-threading when offering on stale TLB entries described in § 3.7.4. SGX. The SGX enclave measurement would have to § 6.3 outlines a correctness proof for the memory pro- be extended to include the computer’s hyper-threading tection measures described above. configuration, so the remote parties in the software at- Unfortunately, SGX does not protect against passive testation process can be assured that their enclaves are address translation attacks (§ 3.7.1), which can be used hosted by a secure environment. to learn an enclave’s memory access pattern at page gran- A more complex approach to fixing the hyper- ularity. While this appears benign, recent work [195] threading vulnerability would entail having the SGX demonstrates the use of these passive attacks in a few implementation guarantee that when an LP is executing practical settings, which are immediately concerning for an enclave’s code, the other LP sharing its core is either image processing applications. inactive, or is executing the same enclave’s code. While The rest of this section describes the theory behind this approach is possible, its design would likely be quite planning a passive attack against an SGX enclave. The cumbersome. reader is directed to [195] for a fully working system. Passive address translation attacks rely on the fact that 6.6.4 Memory Mapping Attacks memory accesses issued by SGX enclaves go through § 5.4 explained that the code running inside an enclave the Intel architecture’s address translation process (§ 2.5), uses the same address translation process (§ 2.5) and including delivering page faults (§ 2.8.2) and setting the page tables as its host application. While this design accessed (A) and dirty (D) attributes (§ 2.5.3) on page approach makes it easy to retrofit SGX support into ex- table entries. isting codebases, it also enables the address translation A malicious OS kernel or hypervisor can obtain the attacks described in § 3.7. page-level trace of an application executing inside an The SGX design protects the code inside enclaves enclave by setting the present (P) attribute to 0 on all against the active attacks described in § 3.7. These pro- the enclave’s pages before starting enclave execution. tections have been extensively discussed in prior sections, While an enclave executes, the malicious system software so we limit ourselves to pointing out SGX’s answer to maintains exactly one instruction page and one data page each active attack. We also explain the lack of protec- present in the enclave’s address space. tions against passive attacks, which can be used to learn When a page fault is generated, CR2 contains the an enclave’s memory access pattern at 4KB page granu- virtual address of a page accessed by enclave, and the larity. error code indicates whether the memory access was a SGX uses the Enclave Page Cache read or a write (bit 1) and whether the memory access is Map (EPCM, § 5.1.2) to store each EPC page’s a data access or an instruction fetch access (bit 4). On a position in its enclave’s virtual address space. The data access, the kernel tracing the enclave code’s memory EPCM is consulted by SGX’s extensions to the Page access pattern would set the P flag of the desired page to Miss Handler (PMH, § 6.2.1), which prevent straight- 1, and set the P flag of the previously accessed data page forward active address translation attacks (§ 3.7.2) by to 0. Instruction accesses can be handled in a similar rejecting undesirable address translations before they manner. reach the TLB (§ 2.11.5). For a slightly more detailed trace, the kernel can set SGX allows system software to evict (§ 5.5) EPC a desired page’s writable (W) attribute to 0 if the page pages into untrusted DRAM, so that the EPC can be fault’s error code indicates a read access, and only set over-subscribed. The contents of the evicted pages and it to 1 for write accesses. Also, applications that use the associated EPCM metadata are protected by cryp- a page as both code and data (self-modifying code and

107 just-in-time compiling VMs) can be handled by setting a Based Sampling (PEBS) for the LP, as well as any hard- page’s disable execution (XD) flag to 0 for a data access, ware breakpoints placed inside the enclave’s virtual ad- and by carefully accounting for the case where the last dress range (ELRANGE, § 5.2.1). This addresses some accessed data page is the same as the last accessed code of the attacks described in § 3.6.3, which take advantage page. of performance monitoring features to get information Leaving an enclave via an Asynchronous Enclave that typically requires access to hardware probes. Exit (AEX, § 5.4.3) and re-entering the enclave via At the same time, the SDM does not mention any- ERESUME (§ 5.4.4) causes the CPU to flush TLB en- thing about uncore PEBS counters, which can be used tries that contain enclave addresses, so a tracing kernel to learn about an enclave’s LLC activity. Furthermore, would not need to worry about flushing the TLB. The the ISCA 2015 tutorial slides mention that SGX does tracing kernel does not need to flush the caches either, not protect against software side-channel attacks that because the CPU needs to perform address translation rely on performance counters. even for cached data. This limitation in SGX’s threat model leaves security- A straightforward way to reduce this attack’s power conscious enclave authors in a rather terrible situation. is to increase the page size, so the trace contains less These authors know that SGX does not protect their information. However, the attack cannot be completely enclaves against a class of software attacks. At the same prevented without removing the kernel’s ability to over- time, they cannot even contemplate attempting to defeat subscribe the EPC, which is a major benefit of paging. these attacks on their own, due to lack of information. Specifically, the documentation that is publicly available 6.6.5 Software Attacks on Peripherals from Intel does not provide enough information to model Since the SGX design does not trust the system software, the information leakage due to performance counters. it must be prepared to withstand the attacks described in For example, Intel does not document the mapping § 3.6, which can be carried out by the system software implemented in CBoxes (§ 2.11.3) between physical thanks to its ability to control peripheral devices on the DRAM addresses and the LLC slices used to cache the computer’s motherboard (§ 2.9.1). This section summa- addresses. This mapping impacts several uncore per- rizes the security properties of SGX when faced with formance counters, and the impact is strong enough to these attacks, based on publicly available information. allow security researches to reverse-engineer the map- When SGX is enabled on an LP, it configures the mem- ping [85, 135, 197]. Therefore, it is safe to assume that ory controller (MC, § 2.11.3) integrated on the CPU chip a malicious computer owner who knows the CBox map- die to reject any DMA transfer that falls within the Pro- ping can use the uncore performance counters to learn cessor Reserved Memory (PRM, § 5.1) range. The PRM about an enclave’s memory access patterns. includes the EPC, so the enclaves’ contents is protected The SGX papers mention that SGX’s threat model from the PCI Express attacks described in § 3.6.1. This includes attacks that overwrite the flash memory chip protection guarantee relies on the fact that the MC is that stores the computer’s firmware, which result in ma- integrated on the processor’s chip die, so the MC con- licious code running in SMM. However, all the official figuration commands issued by SGX’s microcode imple- SGX documentation is silent about the implications of mentation (§ 6.1.3) are transmitted over a communication an attack that compromises the firmware executed by the path that never leaves the CPU die, and therefore can be Intel ME. trusted. § 3.6.4 states that the ME’s firmware is stored in the SGX regards DRAM as an untrusted storage medium, same flash memory as the boot firmware, and enumer- and uses cryptographic primitives implemented in the ates some of ME’s special privileges that enable it to help MEE to guarantee the confidentiality, integrity and fresh- system administrators remotely diagnose and fix hard- ness of the EPC contents that is stored into DRAM. This ware and software issues. Given that the SGX design is protects against software attacks on DRAM’s integrity, concerned about the possibility of malicious computer like the rowhammer attack described in § 3.6.2. firmware, it is reasonable to be concerned about - The SDM describes an array of measures that SGX cious ME firmware. takes to disable processor features intended for debug- § 3.6.4 argues that an attacker who compromises the ging when a LP starts executing an enclave’s code. For ME can carry out actions that are usually classified as example, enclave entry (§ 5.4.1) disables Precise Event physical attacks. An optimistic security researcher can

108 observe that the most scary attack vector afforded by divides the cache’s sets (§ 2.11.2) into two regions, as an ME takeover appears to be direct DRAM access, shown in Figure 94. and SGX already assumes that the DRAM is untrusted. Therefore, an ME compromise would be equivalent to OS the DRAM attacks analyzed in § 6.6.2. RAM However, we are troubled by the lack of documenta-

tion on the ME’s implementation, as certain details are … critical to SGX’s security analysis. For example, the ME is involved in the computer’s boot process (§ 2.13, § 2.14.4), so it is unclear if it plays any part in the SGX Cache initialization sequence. Furthermore, during the security Enclave boot stage (SEC, § 2.13.2), the bootstrap LP (BSP) is placed in Cache-As-Ram (CAR) mode so that the PEI firmware can be stored securely while it is measured. This suggests that it would be convenient for the ME to receive direct access to the CPU’s caches, so that the ME’s TPM implementation can measure the firmware … directly. At the same time, a special access path from the ME to the CPU’s caches might sidestep the MEE, allow- ing an attacker who has achieved ME code execution to …

directly read the EPC’s contents. Cache Line

6.6.6 Cache Timing Attacks Page The SGX threat model excludes the cache timing attacks described in § 3.8. The SGX documentation bundles Figure 94: A malicious OS can partition a cache between the these attacks together with other side-channel attacks and software running inside an enclave and its own malicious code. Both summarily dismisses them as complex physical attacks. the OS and the enclave software have cache sets dedicated to them. However, cache timing attacks can be mounted entirely When allocating DRAM to itself and to the enclave software, the malicious OS is careful to only use DRAM regions that map to the by unprivileged software running at ring 3. This section appropriate cache sets. On a system with an Intel CPU, the the OS describes the implications of SGX’s environment and can partition the L2 cache by manipulating the page tables in a way threat model on cache timing attacks. that is completely oblivious to the enclave’s software. The main difference between SGX and a standard The system software stores all the victim enclave’s architecture is that SGX’s threat model considers the sys- code and data in DRAM addresses that map to the cache tem software to be untrusted. As explained earlier, this sets in one of the regions, and stores its own code and accurately captures the situation in remote computation data in DRAM addresses that map to the other region’s scenarios, such as cloud computing. SGX’s threat model cache sets. The snooping thread’s code is assumed to be implies that the system software can be carrying out a a part of the OS. For example, in a typical 256 KB (per- cache timing attack on the software inside an enclave. core) L2 cache organized as 512 8-way sets of 64-byte A malicious system software translates into signifi- lines, the tracing kernel could allocate lines 0-63 for the cantly more powerful cache timing attacks, compared to enclave’s code page, lines 64-127 for the enclave’s data those described in § 3.8. The system software is in charge page, and use lines 128-511 for its own pages. of scheduling threads on LPs, and also in charge of set- To the best of our knowledge, there is no minor modifi- ting up the page tables used by address translation (§ 2.5), cation to SGX that would provably defend against cache which control cache placement (§ 2.11.5). timing attacks. However, the SGX design could take a For example, the malicious kernel set out to trace an few steps to increase the cost of cache timing attacks. enclave’s memory access patterns described in § 6.6.4 For example, SGX’s enclave entry implementation could can improve the accuracy of a cache timing attack by flush the core’s private caches, which would prevent using page coloring [117] principles to partition [129] cache timing attacks from targeting them. This measure the cache targeted by the attack. In a nutshell, the kernel would defeat the cache timing attacks described below,

109 and would only be vulnerable to more sophisticated at- of SGX, assuming that the isolation guarantees provided tacks that target the shared LLC, such as [131, 196]. The by SGX are sufficient to protect the attestation key. How- description above assumes that hyper-threading has been ever, the security analysis in § 6.6 reveals that enclaves disabled, for the reasons explained in § 6.6.3. are vulnerable to a vast array of software side-channel Barring the additional protection measures described attacks, which have been demonstrated effective in ex- above, a tracing kernel can extend the attack described in tracting a variety of secrets from isolated environments. § 6.6.4 with the steps outlined below to take advantage The gaps in the security guarantees provided to en- of cache timing and narrow down the addresses in an ap- claves place a large amount of pressure on Intel’s soft- plication’s memory access trace to cache line granularity. ware developers, as they must attempt to implement the Right before entering an enclave via EENTER or EPID signing scheme used by software attestation with- ERESUME, the kernel would issue CLFLUSH instruc- out leaking any information. Intel’s ISCA 2015 SGX tions to flush the enclave’s code page and data page from tutorial slides suggest that the SGX designers will ad- the cache. The enclave could have accessed a single code vise developers to write their code in a way that avoids page and a single data page, so flushing the cache should data-dependent memory accesses, as suggested in § 3.8.4, be reasonably efficient. The tracing kernel then uses 16 and perhaps provide analysis tools that detect code that bogus pages (8 for the enclave’s code page, and 8 for performs data-dependent memory accesses. the enclave’s data page) to load all the 8 ways in the 128 The main drawback of the approach described above cache sets allocated by enclave pages. After an AEX is that it is extremely cumbersome. § 3.8.4 describes gives control back to the tracing kernel, it can read the that, while it may be possible to write simple pieces of 16 bogus pages, and exploit the time difference between software in such a way that they do not require data- an L2 cache hit and a miss to see which cache lines were dependent memory accesses, there is no known process evicted and replaced by the enclave’s memory accesses. that can scale this to large software systems. For example, An extreme approach that can provably defeat cache each virtual method call in an object-oriented language timing attacks is disabling caching for the PRM range, results in data-dependent code fetches. which contains the EPC. The SDM is almost com- The ISCA 2015 SGX tutorial slides also suggest that pletely silent about the PRM, but the SGX manuals that the efforts of removing data-dependent memory accesses it is based on state that the allowable caching behav- should focus on cryptographic algorithm implementa- iors (§ 2.11.4) for the PRM range are uncacheable (UC) tions, in order to protect the keys that they handle. This and write-back (WB). This could become useful if the is a terribly misguided suggestion, because cryptographic SGX implementation would make sure that the PRM’s key material has no intrinsic value. Attackers derive ben- caching behavior cannot be changed while SGX is en- efits from obtaining the data that is protected by the keys, abled, and if the selected behavior would be captured by such as medical and financial records. the enclave’s measurement (§ 5.6). Some security researchers focus on protecting cryp- tographic keys because they are the target of today’s 6.6.7 Software Side-Channel Attacks and SGX attacks. Unfortunately, it is easy to lose track of the fact The SGX design reuses a few terms from the Trusted Plat- that keys are being attacked simply because they are the form Module (TPM, § 4.4) design. This helps software lowest hanging fruit. A system that can only protect developers familiar with TPM understand SGX faster. the keys will have a very small positive impact, as the At the same time, the term reuse invites the assump- attackers will simply shift their focus on the algorithms tion that SGX’s software attestation is implemented in that process the valuable information, and use the same tamper-resistant hardware, similarly to the TPM design. software side-channel attacks to obtain that information § 5.8 explains that, in fact, the SGX design delegates directly. the creation of attestation signatures to software that runs The second drawback of the approach described to- inside a Quoting Enclave with special privileges that wards the beginning of this section is that while eliminat- allows it to access the processor’s attestation key. Re- ing data-dependent memory accesses should thwart the stated, SGX includes an enclave whose software reads attacks described in § 6.6.4 and § 6.6.6, the measure may the attestation key and produces attestation signatures. not be sufficient to prevent the hyper-threading attacks Creating the Quoting Enclave is a very elegant way of described in § 6.6.3. The level of sharing between the reducing the complexity of the hardware implementation two logical processors (LP, § 2.9.4) on the same CPU

110 core is so high that it is possible that a snooping LP can 12/sha1-deprecation-policy.aspx, 2013. learn more than the memory access pattern from the other [Online; accessed 4-May-2015]. LP on the same core. [4] 7-zip lzma : Intel haswell. http://www. For example, if the number of cycles taken by an inte- 7-cpu.com/cpu/Haswell.html, 2014. [On- ger ALU to execute a multiplication or division micro- line; accessed 10-Februrary-2015]. [5] Bios freedom status. https://puri.sm/posts/ op (§ 2.10) depends on its inputs, the snooping LP could -freedom-status/, Nov 2014. [Online; ac- learn some information about the numbers multiplied cessed 2-Dec-2015]. or divided by the other LP. While this may be a simple [6] Gradually sunsetting sha-1. http:// example, it is safe to assume that the Quoting Enclave googleonlinesecurity.blogspot.com/ will be studied by many motivated attackers, and that any 2014/09/gradually-sunsetting-sha-1. information leak will be exploited. html, 2014. [Online; accessed 4-May-2015]. [7] Ipc2 hardware specification. http://fit-pc. 7 CONCLUSION com/download/intense-pc2/documents/ Shortly after we learned about Intel’s Software Guard ipc2-hw-specification.pdf, Sep 2014. [Online; accessed 2-Dec-2015]. Extensions (SGX) initiative, we set out to study it in the [8] : Cve security vulnerabilities, versions hope of finding a practical solution to its vulnerability and detailed reports. http://www.cvedetails. to cache timing attacks. After reading the official SGX com/product/47/Linux-Linux-Kernel. manuals, we were left with more questions than when we html?vendor_id=33, 2014. [Online; accessed started. The SGX patents filled some of the gaps in the 27-April-2015]. official documentation, but also revealed Intel’s enclave [9] Nist’s policy on hash functions. http://csrc. licensing scheme, which has troubling implications. nist.gov/groups/ST/hash/policy.html, After learning about the SGX implementation and 2014. [Online; accessed 4-May-2015]. inferring its design constraints, we discarded our draft [10] Xen: Cve security vulnerabilities, versions and de- tailed reports. http://www.cvedetails.com/ proposals for defending enclave software against cache product/23463/XEN-XEN.html?vendor_ timing attacks. We concluded that it would be impossi- id=6276, 2014. [Online; accessed 27-April-2015]. ble to claim to provide this kind of guarantee given the [11] Xen project software overview. http: design constraints and all the unknowns surrounding the //wiki.xen.org/wiki/Xen_Project_ SGX implementation. Instead, we applied the knowledge Software_Overview, 2015. [Online; accessed that we gained to design Sanctum [38], which is briefly 27-April-2015]. described in § 4.9. [12] Seth Abraham. Time to revisit rep;movs - This paper describes our findings while studying SGX. comment. https://software.intel.com/ We hope that it will help fellow researchers understand en-us/forums/topic/275765, Aug 2006. [On- line; accessed 23-January-2015]. the breadth of issues that need to be considered before [13] Tiago Alves and Don Felton. Trustzone: Integrated accepting a trusted hardware design as secure. We also hardware and software security. Information Quarterly, hope that our work will prompt the research community 3(4):18–24, 2004. to expect more openness from the vendors who ask us to [14] Ittai Anati, Shay Gueron, Simon P Johnson, and Vin- trust their hardware. cent R Scarlata. Innovative technology for cpu based attestation and sealing. In Proceedings of the 2nd In- 8 ACKNOWLEDGEMENTS ternational Workshop on Hardware and Architectural Funding for this research was partially provided by the Support for Security and Privacy, HASP, volume 13, National Science Foundation under contract number 2013. Security engineering: A guide to build- CNS-1413920. [15] Ross Anderson. ing dependable distributed systems. Wiley, 2001. REFERENCES [16] Sebastian Anthony. Who actually develops linux? the answer might surprise you. http: FIPS 140-2 Consolidated Validation Certificate No. [1] //www.extremetech.com/computing/ 0003. 2011. 175919-who-actually-develops-linux, [2] IBM 4765 Cryptographic Coprocessor Security Module 2014. [Online; accessed 27-April-2015]. - Security Policy. Dec 2012. [17] ARM Limited. AMBA R AXI Protocol, Mar 2004. Ref- http://blogs. [3] Sha1 deprecation policy. erence no. IHI 0022B, IHI 0024B, AR500-DA-10004. technet.com/b/pki/archive/2013/11/

111 [18] ARM Limited. ARM Security Technology Building [33] Haogang Chen, Yandong Mao, Xi Wang, Dong Zhou, a Secure System using TrustZone R Technology, Apr Nickolai Zeldovich, and M Frans Kaashoek. Linux 2009. Reference no. PRD29-GENC-009492C. kernel vulnerabilities: State-of-the-art defenses and [19] Sebastian Banescu. Cache timing attacks. 2011. [On- open problems. In Proceedings of the Second Asia- line; accessed 26-January-2014]. Pacific Workshop on Systems, page 5. ACM, 2011. [20] Elaine Barker, William Barker, William Burr, William [34] Lily Chen. Recommendation for key derivation using Polk, and Miles Smid. Recommendation for key man- pseudorandom functions. Federal Information Pro- agement part 1: General (revision 3). Federal Informa- cessing Standards (FIPS) Special Publications (SP), tion Processing Standards (FIPS) Special Publications 800-108, Oct 2009. (SP), 800-57, Jul 2012. [35] . Developer manual, Sep 2014. [Online; ac- [21] Elaine Barker, William Barker, William Burr, William cessed 4-March-2015]. Polk, and Miles Smid. Secure hash standard (shs). [36] M.P. Cornaby and B. Chaffin. Microinstruction pointer Federal Information Processing Standards (FIPS) Pub- stack including speculative pointers for out-of-order lications (PUBS), 180-4, Aug 2015. execution, 2007. US Patent 7,231,511. [22] Friedrich Beck. Failure Analysis: a [37] Intel Corporation. Intel R R Processor E5 v3 Guide to Preparation Techniques. John Wiley & Sons, Family Uncore Performance Monitoring Reference 1998. Manual, Sep 2014. Reference no. 331051-001. [23] Daniel Bleichenbacher. Chosen ciphertext attacks [38] Victor Costan, Ilia Lebedev, and Srinivas Devadas. against protocols based on the rsa encryption standard Sanctum: Minimal hardware extensions for strong soft- pkcs# 1. In Advances in Cryptology CRYPTO’98, ware isolation. Cryptology ePrint Archive, Report pages 1–12. Springer, 1998. 2015/564, 2015. [24] D.D. Boggs and S.D. Rodgers. with [39] J. Daemen and V. Rijmen. Aes proposal: Rijndael, aes novel instruction for signaling event occurrence and algorithm submission, Sep 1999. for providing event handling information in response [40] S.M. Datta and M.J. Kumar. Technique for providing thereto, 1997. US Patent 5,625,788. secure firmware, 2013. US Patent 8,429,418. [25] Joseph Bonneau and Ilya Mironov. Cache-collision [41] S.M. Datta, V.J. Zimmer, and M.A. Rothman. System timing attacks against aes. In Cryptographic Hardware and method for trusted early boot flow, 2010. US Patent and Embedded Systems-CHES 2006, pages 201–215. 7,752,428. Springer, 2006. [42] Pete Dice. Booting an intel architecture system, part [26] Ernie Brickell and Jiangtao Li. Enhanced privacy id i: Early initialization. Dr. Dobb’s, Dec 2011. [Online; from bilinear pairing. IACR Cryptology ePrint Archive, accessed 2-Dec-2015]. 2009. [43] Whitfield Diffie and Martin E Hellman. New directions [27] Billy Bob Brumley and Nicola Tuveri. Remote tim- in cryptography. Information Theory, IEEE Transac- ing attacks are still practical. In – tions on, 22(6):644–654, 1976. ESORICS 2011, pages 355–371. Springer, 2011. [44] Lo¨ıc Duflot, Daniel Etiemble, and Olivier Grumelard. [28] David Brumley and Dan Boneh. Remote timing at- Using cpu system management mode to circumvent op- tacks are practical. Computer Networks, 48(5):701–716, erating system security functions. CanSecWest/core06, 2005. 2006. [29] John Butterworth, Corey Kallenberg, Xeno Kovah, and [45] Morris Dworkin. Recommendation for block cipher Amy Herzog. Bios chronomancy: Fixing the core root modes of operation: Methods and techniques. Fed- of trust for measurement. In Proceedings of the 2013 eral Information Processing Standards (FIPS) Special ACM SIGSAC conference on Computer & Communica- Publications (SP), 800-38A, Dec 2001. tions Security, pages 25–36. ACM, 2013. [46] Morris Dworkin. Recommendation for block cipher [30] J Lawrence Carter and Mark N Wegman. Universal modes of operation: The cmac mode for authentica- classes of hash functions. In Proceedings of the 9th an- tion. Federal Information Processing Standards (FIPS) nual ACM Symposium on Theory of Computing, pages Special Publications (SP), 800-38B, May 2005. 106–112. ACM, 1977. [47] Morris Dworkin. Recommendation for block cipher [31] David Champagne and Ruby B Lee. Scalable architec- modes of operation: Galois/counter mode (gcm) and tural support for trusted software. In High Performance gmac. Federal Information Processing Standards Computer Architecture (HPCA), 2010 IEEE 16th Inter- (FIPS) Special Publications (SP), 800-38D, Nov 2007. national Symposium on, pages 1–12. IEEE, 2010. [48] D. Eastlake and P. Jones. RFC 3174: US Secure Hash [32] Daming D Chen and Gail-Joon Ahn. Security analysis Algorithm 1 (SHA1). Internet RFCs, 2001. of x86 processor microcode. 2014. [Online; accessed [49] Shawn Embleton, Sherri Sparks, and Cliff C Zou. Smm 7-January-2015]. rootkit: a new breed of os independent malware. Secu-

112 rity and Communication Networks, 2010. 2009. US Patent 7,552,255. [50] Dmitry Evtyushkin, Jesse Elwell, Meltem Ozsoy, [63] A. Glew, G. Hinton, and H. Akkary. Method and ap- Dmitry Ponomarev, Nael Abu Ghazaleh, and Ryan paratus for performing page table walks in a micropro- Riley. Iso-x: A flexible architecture for hardware- cessor capable of processing speculative instructions, managed isolated execution. In (MI- 1997. US Patent 5,680,565. CRO), 2014 47th annual IEEE/ACM International Sym- [64] A.F. Glew, H. Akkary, R.P. Colwell, G.J. Hinton, D.B. posium on, pages 190–202. IEEE, 2014. Papworth, and M.A. Fetterman. Method and apparatus [51] Niels Ferguson, Bruce Schneier, and Tadayoshi Kohno. for implementing a non-blocking translation lookaside Cryptography Engineering: Design Principles and buffer, 1996. US Patent 5,564,111. Practical Applications. John Wiley & Sons, 2011. [65] Oded Goldreich. Towards a theory of software protec- [52] Christopher W Fletcher, Marten van Dijk, and Srinivas tion and simulation by oblivious rams. In Proceedings Devadas. A secure processor architecture for encrypted of the 19th annual ACM symposium on Theory of Com- computation on untrusted programs. In Proceedings puting, pages 182–194. ACM, 1987. of the Seventh ACM Workshop on Scalable Trusted [66] J.R. Goodman and H.H.J. Hum. Mesif: A two-hop Computing, pages 3–8. ACM, 2012. cache coherency protocol for point-to-point intercon- [53] Agner Fog. Instruction tables - lists of instruction laten- nects. 2009. cies, throughputs and micro-operation breakdowns for [67] K.C. Gotze, G.M. Iovino, and J. Li. Secure provisioning intel, amd and via cpus. Dec 2014. [Online; accessed of secret keys during integrated circuit manufacturing, 23-January-2015]. 2014. US Patent App. 13/631,512. [54] Andrew Furtak, Yuriy Bulygin, Oleksandr Bazhaniuk, [68] K.C. Gotze, J. Li, and G.M. Iovino. Fuse attestation to John Loucaides, Alexander Matrosov, and Mikhail secure the provisioning of secret keys during integrated Gorobets. Bios and secure boot attacks uncovered. circuit manufacturing, 2014. US Patent 8,885,819. The 10th ekoparty Security Conference, 2014. [Online; [69] Joe Grand. Advanced hardware hacking techniques, Jul accessed 22-October-2015]. 2004. [55] William Futral and James Greene. Intel R Trusted [70] David Grawrock. Dynamics of a Trusted Platform: A Execution Technology for Server Platforms. Apress building block approach. Intel Press, 2009. Open, 2013. [71] Trusted Computing Group. Tpm [56] Blaise Gassend, Dwaine Clarke, Marten Van Dijk, and main specification. http://www. Srinivas Devadas. Silicon physical random functions. trustedcomputinggroup.org/resources/ In Proceedings of the 9th ACM Conference on Com- tpm_main_specification, 2003. puter and Communications Security, pages 148–160. [72] Daniel Gruss, Clementine´ Maurice, and Stefan Man- ACM, 2002. gard. Rowhammer. js: A remote software-induced fault [57] Blaise Gassend, G Edward Suh, Dwaine Clarke, Marten attack in javascript. CoRR, abs/1507.06955, 2015. Van Dijk, and Srinivas Devadas. Caches and hash [73] Shay Gueron. Quick verification of rsa signatures. In trees for efficient memory integrity verification. In 8th International Conference on Information Technol- Proceedings of the 9th International Symposium on ogy: New Generations (ITNG), pages 382–386. IEEE, High-Performance Computer Architecture, pages 295– 2011. 306. IEEE, 2003. [74] Shay Gueron. A memory encryption engine suitable for [58] Daniel Genkin, Lev Pachmanov, Itamar Pipman, and general purpose processors. Cryptology ePrint Archive, Eran Tromer. Stealing keys from pcs using a radio: Report 2016/204, 2016. Cheap electromagnetic attacks on windowed exponen- [75] Ben Hawkes. Security analysis of x86 processor mi- tiation. Cryptology ePrint Archive, Report 2015/170, crocode. 2012. [Online; accessed 7-January-2015]. 2015. [76] John L Hennessy and David A Patterson. Computer [59] Daniel Genkin, Itamar Pipman, and Eran Tromer. Get Architecture - a Quantitative Approach (5 ed.). Mogran your hands off my laptop: Physical side-channel key- Kaufmann, 2012. extraction attacks on pcs. Cryptology ePrint Archive, [77] Christoph Herbst, Elisabeth Oswald, and Stefan Man- Report 2014/626, 2014. gard. An aes smart card implementation resistant to [60] Daniel Genkin, Adi Shamir, and Eran Tromer. Rsa key power analysis attacks. In Applied cryptography and extraction via low-bandwidth acoustic cryptanalysis. Network security, pages 239–252. Springer, 2006. Cryptology ePrint Archive, Report 2013/857, 2013. [78] G. Hildesheim, I. Anati, H. Shafi, S. Raikin, G. Gerzon, [61] Craig Gentry. A fully homomorphic encryption scheme. U.R. Savagaonkar, C.V. Rozas, F.X. McKeen, M.A. PhD thesis, Stanford University, 2009. Goldsmith, and D. Prashant. Apparatus and method [62] R.T. George, J.W. Brandt, K.S. Venkatraman, and S.P. for page walk extension for enhanced security checks, Kim. Dynamically partitioning pipeline resources, 2014. US Patent App. 13/730,563.

113 [79] Matthew Hoekstra, Reshma Lal, Pradeep Pappachan, Optimization Reference Manual, Sep 2014. Reference Vinay Phegade, and Juan Del Cuvillo. Using innovative no. 248966-030. instructions to create trustworthy software solutions. [97] Intel Corporation. Intel R Xeon R Processor 7500 Se- In Proceedings of the 2nd International Workshop on ries Datasheet - Volume Two, Mar 2014. Reference no. Hardware and Architectural Support for Security and 329595-002. Privacy, HASP, volume 13, 2013. [98] Intel Corporation. Intel R Xeon R Processor E7 v2 [80] Gael Hofemeier. Intel manageability firmware recovery 2800/4800/8800 Product Family Datasheet - Volume agent. Mar 2013. [Online; accessed 2-Dec-2015]. Two, Mar 2014. Reference no. 329595-002. [81] George Hotz. Ps3 glitch hack. 2010. [Online; accessed [99] Intel Corporation. Software Guard Extensions Program- 7-January-2015]. ming Reference, 2014. Reference no. 329298-002US. [82] Andrew Huang. Hacking the Xbox: an Introduction to [100] Intel Corporation. Intel R 100 Series Chipset Family Reverse Engineering. No Starch Press, 2003. Platform Controller Hub (PCH) Datasheet - Volume [83] C.J. Hughes, Y.K. Chen, M. Bomb, J.W. Brandt, M.J. One, Aug 2015. Reference no. 332690-001EN. Buxton, M.J. Charney, S. Chennupaty, J. Corbal, M.G. [101] Intel Corporation. Intel R 64 and IA-32 Architectures Dixon, M.B. Girkar, et al. Gathering and scattering Software Developer’s Manual, Sep 2015. Reference no. multiple data elements, 2013. US Patent 8,447,962. 325462-056US. [84] IEEE Computer Society. IEEE Standard for Ethernet, [102] Intel Corporation. Intel R C610 Series Chipset and Dec 2012. IEEE Std. 802.3-2012. Intel R X99 Chipset Platform Controller Hub (PCH) [85] Mehmet Sinan Inci, Berk Gulmezoglu, Gorka Irazoqui, Datasheet, Oct 2015. Reference no. 330788-003. Thomas Eisenbarth, and Berk Sunar. Seriously, get off [103] Intel Corporation. Intel R Software Guard Extensions my cloud! cross-vm rsa key recovery in a public cloud. (Intel R SGX), Jun 2015. Reference no. 332680-002. Cryptology ePrint Archive, Report 2015/898, 2015. [104] Intel Corporation. Intel R Xeon R Processor 5500 Se- [86] Intel Corporation. Intel R Processor Serial Number, ries - Specification Update, 2 2015. Reference no. Mar 1999. Order no. 245125-001. 321324-018US. [87] Intel Corporation. Intel R architecture Platform Basics, [105] Intel Corporation. Intel R Xeon R Processor E5-1600, Sep 2010. Reference no. 324377. E5-2400, and E5-2600 v3 Product Family Datasheet - [88] Intel Corporation. Intel R Core 2 Duo and Intel R Core Volume Two, Jan 2015. Reference no. 330784-002. 2 Solo Processor for Intel R Centrino R Duo Processor [106] Intel Corporation. Intel R Xeon R Processor E5 Prod- Technology Intel R R Processor 500 Series - uct Family - Specification Update, Jan 2015. Reference Specification Update, Dec 2010. Reference no. 314079- no. 326150-018. 026. [107] Intel Corporation. Mobile 4th Generation Intel R [89] Intel Corporation. Intel R Trusted Execution Technol- Core R Processor Family I/O Datasheet, Feb 2015. ogy (Intel R TXT) LAB Handout, 2010. [Online; ac- Reference no. 329003-003. cessed 2-July-2015]. [108] Bruce Jacob and Trevor Mudge. Virtual memory: Is- [90] Intel Corporation. Intel R Xeon R Processor 7500 Se- sues of implementation. Computer, 31(6):33–43, 1998. ries Uncore Programming Guide, Mar 2010. Reference [109] Simon Johnson, Vinnie Scarlata, Carlos Rozas, no. 323535-001. Ernie Brickell, and Frank Mckeen. Intel R soft- [91] Intel Corporation. An Introduction to the Intel R Quick- ware guard extensions: Epid provisioning and Path Interconnect, Mar 2010. Reference no. 323535- attestation services. https://software. 001. intel.com/en-us/blogs/2016/03/09/ [92] Intel Corporation. Minimal Intel R Architecture Boot intel-sgx-epid-provisioning-and-attestation-services, LoaderBare Bones Functionality Required for Booting Mar 2016. [Online; accessed 21-Mar-2016]. an Intel R Architecture Platform, Jan 2010. Reference [110] Simon P Johnson, Uday R Savagaonkar, Vincent R no. 323246. Scarlata, Francis X McKeen, and Carlos V Rozas. Tech- [93] Intel Corporation. Intel R 7 Series Family - Intel R nique for supporting multiple secure enclaves, Dec Management Engine Firmware 8.1 - 1.5MB Firmware 2010. US Patent 8,972,746. Bring Up Guide, Jul 2012. Revision 8.1.0.1248 - PV [111] Jakob Jonsson and Burt Kaliski. RFC 3447: Public-Key Release. Cryptography Standards (PKCS) #1: RSA Cryptogra- [94] Intel Corporation. Intel R Xeon R Processor E5-2600 phy Specifications Version 2.1. Internet RFCs, Feb Product Family Uncore Performance Monitoring Guide, 2003. Mar 2012. Reference no. 327043-001. [112] Burt Kaliski. RFC 2313: PKCS #1: RSA Encryption [95] Intel Corporation. Software Guard Extensions Program- Version 1.5. Internet RFCs, Mar 1998. ming Reference, 2013. Reference no. 329298-001US. [113] Burt Kaliski and Jessica Staddon. RFC 2437: PKCS [96] Intel Corporation. Intel R 64 and IA-32 Architectures #1: RSA Encryption Version 2.0. Internet RFCs, Oct

114 1998. [128] David Lie, Chandramohan Thekkath, Mark Mitchell, [114] Corey Kallenberg, Xeno Kovah, John Butterworth, and Patrick Lincoln, Dan Boneh, John Mitchell, and Mark Sam Cornwell. Extreme on win- Horowitz. Architectural support for copy and tamper dows 8/uefi systems, 2014. resistant software. ACM SIGPLAN Notices, 35(11):168– [115] Emilia Kasper¨ and Peter Schwabe. Faster and timing- 177, 2000. attack resistant aes-gcm. In Cryptographic Hard- [129] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, ware and Embedded Systems-CHES 2009, pages 1–17. Xiaodong Zhang, and P Sadayappan. Gaining in- Springer, 2009. sights into multicore cache partitioning: Bridging the [116] Jonathan Katz and Yehuda Lindell. Introduction to gap between simulation and real systems. In 14th In- modern cryptography. CRC Press, 2014. ternational IEEE Symposium on High Performance [117] Richard E Kessler and Mark D Hill. Page placement Computer Architecture (HPCA), pages 367–378. IEEE, algorithms for large real-indexed caches. ACM Trans- 2008. actions on Computer Systems (TOCS), 10(4):338–359, [130] Barbara Liskov and Stephen Zilles. Programming with 1992. abstract data types. In ACM Sigplan Notices, volume 9, [118] Taesoo Kim and Nickolai Zeldovich. Practical and pages 50–59. ACM, 1974. effective sandboxing for non-root users. In USENIX [131] Fangfei Liu, Yuval Yarom, Qian Ge, , and Annual Technical Conference, pages 139–144, 2013. Ruby B Lee. Last-level cache side-channel attacks are [119] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, practical. In Security and Privacy (SP), 2015 IEEE Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Symposium on, pages 143–158. IEEE, 2015. Lai, and Onur Mutlu. Flipping bits in memory with- [132] Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari, out accessing them: An experimental study of dram Elaine Shi, Krste Asanovic, John Kubiatowicz, and disturbance errors. In Proceeding of the 41st annual Dawn Song. Phantom: Practical oblivious computation International Symposium on Computer Architecuture, in a secure processor. In Proceedings of the 2013 ACM pages 361–372. IEEE Press, 2014. SIGSAC conference on Computer & communications [120] L.A. Knauth and P.J. Irelan. Apparatus and method security, pages 311–324. ACM, 2013. for providing eventing ip and source data address in [133] R. Maes, P. Tuyls, and I. Verbauwhede. Low-Overhead a statistical sampling infrastructure, 2014. US Patent Implementation of a Soft Decision Helper Data Algo- App. 13/976,613. rithm for SRAMPUFs. In Cryptographic Hardware [121] N. Koblitz. Elliptic curve cryptosystems. Mathematics and Embedded Systems (CHES), pages 332–347, 2009. of Computation, 48(177):203–209, 1987. [134] James Manger. A chosen ciphertext attack on rsa op- [122] Paul Kocher, Joshua Jaffe, and Benjamin Jun. Dif- timal asymmetric encryption padding (oaep) as stan- ferential power analysis. In Advances in Cryptology dardized in pkcs# 1 v2.0. In Advances in Cryptology (CRYPTO), pages 388–397. Springer, 1999. CRYPTO 2001, pages 230–238. Springer, 2001. [123] Paul C Kocher. Timing attacks on implementations of [135] Clmentine Maurice, Nicolas Le Scouarnec, Christoph diffie-hellman, rsa, dss, and other systems. In Advances Neumann, Olivier Heen, and Aurlien Francillon. Re- in CryptologyCRYPTO96, pages 104–113. Springer, verse engineering intel last-level cache complex ad- 1996. dressing using performance counters. In Proceedings [124] Hugo Krawczyk, Ran Canetti, and Mihir Bellare. of the 18th International Symposium on Research in Hmac: Keyed-hashing for message authentication. Attacks, Intrusions and Defenses (RAID), 2015. 1997. [136] Jonathan M McCune, Yanlin Li, Ning Qu, Zongwei [125] Markus G Kuhn. Electromagnetic eavesdropping risks Zhou, Anupam Datta, Virgil Gligor, and Adrian Perrig. of flat-panel displays. In Privacy Enhancing Technolo- Trustvisor: Efficient tcb reduction and attestation. In gies, pages 88–107. Springer, 2005. Security and Privacy (SP), 2010 IEEE Symposium on, pages 143–158. IEEE, 2010. [126] Tsvika Kurts, Guillermo Savransky, Jason Ratner, Eilon Hazan, Daniel Skaba, Sharon Elmosnino, and Gee- [137] David McGrew and John Viega. The galois/counter yarpuram N Santhanakrishnan. Generic debug external mode of operation (gcm). 2004. [Online; accessed connection (gdxc) for high integration integrated cir- 28-December-2015]. cuits, 2011. US Patent 8,074,131. [138] Francis X McKeen, Carlos V Rozas, Uday R Sava- [127] David Levinthal. Performance analysis guide for gaonkar, Simon P Johnson, Vincent Scarlata, Michael A Goldsmith, Ernie Brickell, Jiang Tao Li, Howard C Her- intel R core i7 processor and intel R xeon 5500 processors. https://software.intel.com/ bert, Prashant Dewan, et al. Method and apparatus to sites/products/collateral/hpc/vtune/ provide secure application execution, Dec 2009. US performance_analysis_guide.pdf, 2010. Patent 9,087,200. [Online; accessed 26-January-2015]. [139] Frank McKeen, Ilya Alexandrovich, Alex Berenzon,

115 Carlos V Rozas, Hisham Shafi, Vedvyas Shanbhogue, execution time analysis for hard real-time tasks on state and Uday R Savagaonkar. Innovative instructions and of the art processors feasible. In Sixth International software model for isolated execution. HASP, 13:10, Conference on Real-Time Computing Systems and Ap- 2013. plications, pages 442–449. IEEE, 1999. [140] Michael Naehrig, Kristin Lauter, and Vinod Vaikun- [153] S.A. Qureshi and M.O. Nicholes. System and method tanathan. Can homomorphic encryption be practical? for using a firmware interface table to dynamically load In Proceedings of the 3rd ACM workshop on Cloud an acpi ssdt, 2006. US Patent 6,990,576. computing security workshop, pages 113–124. ACM, [154] S. Raikin, O. Hamama, R.S. Chappell, C.B. Rust, H.S. 2011. Luu, L.A. Ong, and G. Hildesheim. Apparatus and [141] National Institute of Standards and Technology (NIST). method for a multiple page size translation lookaside The advanced encryption standard (aes). Federal In- buffer (tlb), 2014. US Patent App. 13/730,411. formation Processing Standards (FIPS) Publications [155] S. Raikin and R. Valentine. Gather cache architecture, (PUBS), 197, Nov 2001. 2014. US Patent 8,688,962. [142] National Institute of Standards and Technology (NIST). [156] Stefan Reinauer. x86 intel: Add firmware interface The digital signature standard (dss). Federal Informa- table support. http://review.coreboot.org/ tion Processing Standards (FIPS) Processing Standards #/c/2642/, 2013. [Online; accessed 2-July-2015]. Publications (PUBS), 186-4, Jul 2013. [157] Thomas Ristenpart, Eran Tromer, Hovav Shacham, and [143] National Security Agency (NSA) Central Security Ser- Stefan Savage. Hey, you, get off of my cloud: Exploring vice (CSS). Cryptography today on suite b phase- information leakage in third-party compute clouds. In out. https://www.nsa.gov/ia/programs/ Proceedings of the 16th ACM Conference on Computer suiteb_cryptography/, Aug 2015. [Online; ac- and Communications Security, pages 199–212. ACM, cessed 28-December-2015]. 2009. [144] M.S. Natu, S. Datta, J. Wiedemeier, J.R. Vash, S. Kotta- [158] RL Rivest, A. Shamir, and L. Adleman. A method for palli, S.P. Bobholz, and A. Baum. Supporting advanced obtaining digital signatures and public-key cryptosys- ras features in a secured computing system, 2012. US tems. Communications of the ACM, 21(2):120–126, Patent 8,301,907. 1978. [145] Yossef Oren, Vasileios P Kemerlis, Simha Sethumadha- [159] S.D. Rodgers, K.K. Tiruvallur, M.W. Rhodehamel, K.G. van, and Angelos D Keromytis. The spy in the sandbox Konigsfeld, A.F. Glew, H. Akkary, M.A. Karnik, and – practical cache attacks in javascript. arXiv preprint J.A. Brayton. Method and apparatus for performing op- arXiv:1502.07373, 2015. erations based upon the addresses of microinstructions, [146] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache 1997. US Patent 5,636,374. attacks and countermeasures: the case of aes. In Topics [160] S.D. Rodgers, R. Vidwans, J. Huang, M.A. Fetterman, in Cryptology–CT-RSA 2006, pages 1–20. Springer, and K. Huck. Method and apparatus for generating 2006. event handler vectors based on both operating mode [147] Scott Owens, Susmit Sarkar, and Peter Sewell. A better and event type, 1999. US Patent 5,889,982. x86 memory model: x86-tso (extended version). Uni- [161] M. Rosenblum and T. Garfinkel. Virtual machine mon- versity of Cambridge, Computer Laboratory, Technical itors: current technology and future trends. Computer, Report, (UCAM-CL-TR-745), 2009. 38(5):39–47, May 2005. [148] Emmanuel Owusu, Jun Han, Sauvik Das, Adrian Perrig, [162] Xiaoyu Ruan. Platform Embedded Security Technology and Joy Zhang. Accessory: password inference using Revealed. Apress, 2014. accelerometers on . In Proceedings of the [163] Joanna Rutkowska. Intel x86 considered harmful. Oct Twelfth Workshop on Mobile Computing Systems & 2015. [Online; accessed 2-Nov-2015]. Applications, page 9. ACM, 2012. [164] Joanna Rutkowska and Rafał Wojtczuk. Preventing [149] D.B. Papworth, G.J. Hinton, M.A. Fetterman, R.P. Col- and detecting xen hypervisor subversions. Blackhat well, and A.F. Glew. Exception handling in a processor Briefings USA, 2008. that performs speculative out-of-order instruction exe- [165] Jerome H Saltzer and M Frans Kaashoek. Principles cution, 1999. US Patent 5,987,600. of Computer System Design: An Introduction. Morgan [150] David A Patterson and John L Hennessy. Computer Kaufmann, 2009. Organization and Design: the hardware/software inter- [166] Mark Seaborn and Thomas Dullien. Exploit- face. Morgan Kaufmann, 2013. ing the dram rowhammer bug to gain kernel [151] P. Pessl, D. Gruss, C. Maurice, M. Schwarz, and S. Man- privileges. http://googleprojectzero. gard. Reverse engineering intel dram addressing and blogspot.com/2015/03/ exploitation. ArXiv e-prints, Nov 2015. exploiting-dram-rowhammer-bug-to-gain. [152] Stefan M Petters and Georg Farber. Making worst case html, Mar 2015. [Online; accessed 9-March-2015].

116 [167] V. Shanbhogue, J.W. Brandt, and J. Wiedemeier. Pro- 2005. tecting information processing system secrets from de- [182] Wim Van Eck. Electromagnetic radiation from video bug attacks, 2015. US Patent 8,955,144. display units: an eavesdropping risk? Computers & [168] V. Shanbhogue and S.J. Robinson. Enabling virtu- Security, 4(4):269–286, 1985. alization of a processor resource, 2014. US Patent [183] Amit Vasudevan, Jonathan M McCune, Ning Qu, Leen- 8,806,104. dert Van Doorn, and Adrian Perrig. Requirements for [169] Stephen Shankland. Itanium: A cautionary tale. Dec an integrity-protected hypervisor on the x86 hardware 2005. [Online; accessed 11-February-2015]. virtualized architecture. In Trust and Trustworthy Com- [170] Alan Jay Smith. Cache memories. ACM Computing puting, pages 141–165. Springer, 2010. Surveys (CSUR), 14(3):473–530, 1982. [184] Sathish Venkataramani. Advanced Board Bring Up - [171] Sean W Smith, Ron Perez, Steve Weingart, and Vernon Power Sequencing Guide for Embedded Intel Archi- Austel. Validating a high-performance, programmable tecture. Intel Corporation, Apr 2011. Reference no. secure coprocessor. In 22nd National Information Sys- 325268. tems Security Conference. IBM Thomas J. Watson Re- [185] Vassilios Ververis. Security evaluation of intel’s active search Division, 1999. management technology. 2010. [172] Sean W Smith and Steve Weingart. Building a high- [186] Filip Wecherowski. A real smm rootkit: Reversing and performance, programmable secure coprocessor. Com- hooking bios smi handlers. Phrack Magazine, 13(66), puter Networks, 31(8):831–860, 1999. 2009. [173] Marc Stevens, Pierre Karpman, and Thomas Peyrin. [187] Mark N Wegman and J Lawrence Carter. New hash Free-start collision on full sha-1. Cryptology ePrint functions and their use in authentication and set equality. Archive, Report 2015/967, 2015. Journal of Computer and System Sciences, 22(3):265– [174] G Edward Suh, Dwaine Clarke, Blaise Gassend, Marten 279, 1981. Van Dijk, and Srinivas Devadas. Aegis: architecture for [188] Rafal Wojtczuk and Joanna Rutkowska. Attacking intel tamper-evident and tamper-resistant processing. In Pro- trusted execution technology. Black Hat DC, 2009. ceedings of the 17th annual international conference [189] Rafal Wojtczuk and Joanna Rutkowska. Attacking smm on Supercomputing, pages 160–171. ACM, 2003. memory via intel poisoning. Invisible Things [175] G Edward Suh and Srinivas Devadas. Physical unclon- Lab, 2009. able functions for device authentication and secret key [190] Rafal Wojtczuk and Joanna Rutkowska. Attacking intel generation. In Proceedings of the 44th annual Design txt via sinit code execution hijacking, 2011. Automation Conference, pages 9–14. ACM, 2007. [191] Rafal Wojtczuk, Joanna Rutkowska, and Alexander [176] G. Edward Suh, Charles W. O’Donnell, Ishan Sachdev, Tereshkin. Another way to circumvent intel R trusted and Srinivas Devadas. Design and Implementation of execution technology. Invisible Things Lab, 2009. the AEGIS Single-Chip Secure Processor Using Phys- [192] Rafal Wojtczuk and Alexander Tereshkin. Attacking nd ical Random Functions. In Proceedings of the 32 intel R bios. Invisible Things Lab, 2010. ISCA’05. ACM, June 2005. [193] Y. Wu and M. Breternitz. Genetic algorithm for mi- [177] George Taylor, Peter Davies, and Michael Farmwald. crocode compression, 2008. US Patent 7,451,121. The tlb slice - a low-cost high-speed address translation [194] Y. Wu, S. Kim, M. Breternitz, and H. Hum. Compress- mechanism. SIGARCH Computer Architecture News, ing and accessing a microcode rom, 2012. US Patent 18(2SI):355–363, 1990. 8,099,587. [178] Alexander Tereshkin and Rafal Wojtczuk. Introducing [195] Yuanzhong Xu, Weidong Cui, and Marcus Peinado. ring-3 rootkits. Master’s thesis, 2009. Controlled-channel attacks: Deterministic side chan- [179] Kris Tiri, Moonmoon Akmal, and Ingrid Verbauwhede. nels for untrusted operating systems. In Proceedings A dynamic and differential cmos logic with signal in- of the 36th IEEE Symposium on Security and Privacy dependent power consumption to withstand differential (Oakland). IEEE Institute of Electrical and power analysis on smart cards. In Proceedings of the Engineers, May 2015. 28th European Solid-State Circuits Conference (ESS- [196] Yuval Yarom and Katrina E Falkner. Flush+ reload: a CIRC), pages 403–406. IEEE, 2002. high resolution, low noise, l3 cache side-channel attack. [180] UEFI Forum. Unified Extensible Firmware Interface IACR Cryptology ePrint Archive, 2013:448, 2013. Specification, Version 2.5, 2015. [Online; accessed [197] Yuval Yarom, Qian Ge, Fangfei Liu, Ruby B. Lee, and 1-Jul-2015]. Gernot Heiser. Mapping the intel last-level cache. Cryp- [181] Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L Santoni, tology ePrint Archive, Report 2015/905, 2015. Fernando CM Martins, Andrew V Anderson, Steven M [198] Bennet Yee. Using secure coprocessors. PhD thesis, Bennett, Alain Kagi, Felix H Leung, and Larry Smith. Carnegie Mellon University, 1994. Intel virtualization technology. Computer, 38(5):48–56, [199] Marcelo Yuffe, Ernest Knoll, Moty Mehalel, Joseph

117 Shor, and Tsvika Kurts. A fully integrated multi-cpu, gpu and memory controller 32nm processor. In Solid- State Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE International, pages 264–266. IEEE, 2011. [200] Xiantao Zhang and Yaozu Dong. Optimizing xen vmm based on intel R virtualization technology. In Inter- net Computing in Science and Engineering, 2008. ICI- CSE’08. International Conference on, pages 367–374. IEEE, 2008. [201] Li Zhuang, Feng Zhou, and J Doug Tygar. Keyboard acoustic emanations revisited. ACM Transactions on Information and System Security (TISSEC), 13(1):3, 2009. [202] V.J. Zimmer and S.H. Robinson. Methods and systems for microcode patching, 2012. US Patent 8,296,528. [203] V.J. Zimmer and J. Yao. Method and apparatus for sequential hypervisor invocation, 2012. US Patent 8,321,931.

118