Meltdown and Spectre, Explained – Mattklein123 – Medium 3/30/18, 7�10 AM

Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

mattklein123 Follow Engineer @lyft Jan 14 · 19 min read

Meltdown and Spectre, explained Although these days I’m mostly known for application level networking and distributed systems, I spent the

The vulnerabilities are astounding; I would argue they are one of the most important discoveries in computer science in the last 10–20 years. The mitigations are also diFcult to understand and accurate information about them is hard to

Although a lot has been written about Meltdown and Spectre since their announcement, I have not seen a good mid-level introduction to the vulnerabilities and mitigations. In this post I’m going to attempt to correct that by providing a gentle introduction to the hardware and software background required to understand the vulnerabilities, a discussion of the vulnerabilities themselves, as well as a discussion of the current mitigations.

Important note: Because I have not worked directly on the mitigations, and do not work at Intel, Microsoft, Google, Amazon, Red Hat, etc. some of the details that I am going to provide may not be entirely accurate. I have pieced together this post based on my knowledge of how these systems work, publicly available documentation, and patches/discussion posted to LKML and xen-devel. I would love to be corrected if any of this

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 1 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

post is inaccurate, though I doubt that will happen any time soon given how much of this subject is still covered by NDA.

Background In this section I will provide some background required to understand the vulnerabilities. The section glosses over a large amount of detail and is aimed at readers with a limited understanding of computer hardware and systems software.

Virtual memory Virtual memory is a technique used by all operating systems since the 1970s. It provides a layer of abstraction between the memory address layout that most software sees and the physical devices backing that memory (RAM, disks, etc.). At a high level, it allows applications to utilize more memory than the machine actually has; this provides a powerful abstraction that makes many programming tasks easier.

Figure 1: Virtual memory

Figure 1 shows a simplistic computer with 400 bytes of memory laid out in “pages” of 100 bytes (real computers use powers of two, typically 4096). The computer has two processes, each with 200 bytes of memory across 2 pages each. The processes might be running the same code

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 2 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

using

Translating virtual to physical addresses is such a common operation in modern computers that if the OS had to be involved in all cases the computer would be incredibly slow. Modern CPU hardware provides a device called a Translation Lookaside Buder (TLB) that caches recently used mappings. This allows CPUs to perform address translation directly in hardware the majority of the time.

Figure 2: Virtual memory translation

Figure 2 shows the address translation bow:

1. A program fetches a virtual address.

2. The CPU attempts to translate it using the TLB. If the address is found, the translation is used.

3. If the address is not found, the CPU consults a set of “page tables” to determine the mapping. Page tables are a set of physical memory pages provided by the operating system in a location the hardware can x86 hardware). Page tables map virtual addresses to physical addresses, and also contain metadata such as permissions.

4. If the page table contains a mapping it is returned, cached in the TLB, and used for lookup. If the page table does not contain a mapping, a “page fault” is raised to the OS. A page fault is a special

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 3 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

kind of interrupt that allows the OS to take control and determine what to do when there is a missing or invalid mapping. For example, the OS might terminate the program. It might also allocate some physical memory and map it into the process. If a page fault handler continues execution, the new mapping will be used by the TLB.

Figure 3: User/kernel virtual memory mappings

Figure 3 shows a slightly more realistic view of what virtual memory looks like in a modern computer (pre-Meltdown — more on this below). In this setup we have the following features:

• Kernel memory is shown in red. It is contained in physical address range 0–99. Kernel memory is special memory that only the operating system should be able to access. User programs should not be able to access it.

• User memory is shown in gray.

• Unallocated physical memory is shown in blue.

In this example, we start seeing some of the useful features of virtual memory. Primarily:

• User memory in each process is in the virtual range 0–99, but backed by diderent physical memory.

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 4 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

• Kernel memory in each process is in the virtual range 100–199, but backed by the same physical memory.

As I brieby mentioned in the previous section, each page has associated permission bits. Even though kernel memory is mapped into each user process, when the process is running in user mode it cannot access the kernel memory. If a process attempts to do so, it will trigger a page fault at which point the operating system will terminate it. However, when the process is running in kernel mode (for example during a system call), the processor will allow the access.

At this point I will note that this type of dual mapping (each process having the kernel mapped into it directly) has been standard practice in operating system design for over thirty years for performance reasons (system calls are very common and it would take a long time to remap the kernel or user space on every transition).

CPU cache topology

Figure 4: CPU thread, core, package, and cache topology.

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 5 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

The next piece of background information required to understand the vulnerabilities is the CPU and cache topology of modern processors. Figure 4 shows a generic topology that is common to most modern CPUs. It is composed of the following components:

• The basic unit of execution is the “CPU thread” or “hardware thread” or “hyper-thread.” Each CPU thread contains a set of registers and the ability to execute a stream of machine code, much like a software thread.

• CPU threads are contained within a “CPU core.” Most modern CPUs contain two threads per core.

• Modern CPUs generally contain multiple levels of cache memory. The cache levels closer to the CPU thread are smaller, faster, and more expensive. The further away from the CPU and closer to main memory the cache is the larger, slower, and less expensive it is.

• Typical modern CPU design uses an L1/L2 cache per core. This means that each CPU thread on the core makes use of the same caches.

• Multiple CPU cores are contained in a “CPU package.” Modern CPUs might contain upwards of 30 cores (60 threads) or more per package.

• All of the CPU cores in the package typically share an L3 cache.

• CPU packages

Speculative execution

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 6 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

Figure 5: Modern CPU execution engine (Source: Google images)

The

The primary takeaway is that modern CPUs are incredibly complicated and do not simply execute machine instructions in order. Each CPU thread has a complicated pipelining engine that is capable of executing instructions out of order. The reason for this has to do with caching. As I discussed in the previous section, each CPU makes use of multiple levels of caching. Each cache miss adds a substantial amount of delay time to program execution. In order to mitigate this, processors are capable of executing ahead and out of order while waiting for memory loads. This is known as speculative execution. The following code snippet demonstrates this.

if (x < array1_size) { y = array2[array1[x] * 256]; } https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 7 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

In the previous snippet, imagine that array1_size is not available in cache, but the address of array1 is. The CPU might guess (speculate) that x is less than array1_size and go ahead and perform the calculations inside the if statement. Once array1_size is read from memory, the CPU can determine if it guessed correctly. If it did, it can continue having saved a bunch of time. If it didn’t, it can throw away the speculative calculations and start over. This is no worse than if it had waited in the

Another type of speculative execution is known as indirect branch prediction. This is extremely common in modern programs due to virtual dispatch.

class Base { public: virtual void Foo() = 0; };

class Derived : public Base { public: void Foo() override { … } };

Base* obj = new Derived; obj->Foo();

(The source of the previous snippet is this post)

The way the previous snippet is implemented in machine code is to load the “v-table” or “virtual dispatch table” from the memory location that obj points to and then call it. Because this operation is so common, modern CPUs have various internal caches and will often guess (speculate) where the indirect branch will go and continue execution at that point. Again, if the CPU guesses correctly, it can continue having saved a bunch of time. If it didn’t, it can throw away the speculative calculations and start over. https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 8 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

Meltdown vulnerability Having now covered all of the background information, we can dive into the vulnerabilities.

Rogue data cache load The

1. uint8_t* probe_array = new uint8_t[256 * 4096]; 2. // ... Make sure probe_array is not cached 3. uint8_t kernel_memory = *(uint8_t*)(kernel_address); 4. uint64_t final_kernel_memory = kernel_memory * 4096; 5. uint8_t dummy = probe_array[final_kernel_memory]; 6. // ... catch page fault 7. // ... determine which of 256 slots in probe_array is cached

Let’s take each step above, describe what it does, and how it leads to being able to read the memory of the entire computer from a user program.

1. In the

2. Following the allocation, the attacker makes sure that none of the memory in the probe array is cached. There are various ways of accomplishing this, the simplest of which includes CPU-speci

3. The attacker then proceeds to read a byte from the kernel’s address space. Remember from our previous discussion about virtual memory and page tables that all modern kernels typically map the entire kernel virtual address space into the user process. Operating systems rely on the fact that each page table entry has permission settings, and that user mode programs are not allowed to access

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 9 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

kernel memory. Any such access will result in a page fault. That is indeed what will eventually happen at step 3.

4. However, modern processors also perform speculative execution and will execute ahead of the faulting instruction. Thus, steps 3–5 may execute in the CPU’s pipeline before the fault is raised. In this step, the byte of kernel memory (which ranges from 0–255) is multiplied by the page size of the system, which is typically 4096.

5. In this step, the multiplied byte of kernel memory is then used to read from the probe array into a dummy value. The multiplication of the byte by 4096 is to avoid a CPU feature called the “prefetcher” from reading more data than we want into into the cache.

6. By this step, the CPU has realized its mistake and rolled back to step 3. However, the results of the speculated instructions are still visible in cache. The attacker uses operating system functionality to trap the faulting instruction and continue execution (e.g., handling SIGFAULT).

7. In step 7, the attacker iterates through and sees how long it takes to read each of the 256 possible bytes in the probe array that could have been indexed by the kernel memory. The CPU will have loaded one of the locations into cache and this location will load substantially faster than all the other locations (which need to be read from main memory). This location is the value of the byte in kernel memory.

Using the above technique, and the fact that it is standard practice for modern operating systems to map all of physical memory into the kernel virtual address space, an attacker can read the computer’s entire physical memory.

Now, you might be wondering: “You said that page tables have permission bits. How can it be that user mode code was able to speculatively access kernel memory?” The reason is this is a bug in Intel processors. In my opinion, there is no good reason, performance or otherwise, for this to be possible. Recall that all virtual memory access must occur through the TLB. It is easily possible during speculative execution to check that a cached mapping has permissions compatible https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 10 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

with the current running privilege level. Intel hardware simply does not do this. Other processor vendors do perform a permission check and block speculative execution. Thus, as far as we know, Meltdown is an Intel only vulnerability.

Edit: It appears that at least one ARM processor is also susceptible to Meltdown as indicated here and here.

Meltdown mitigations Meltdown is easy to understand, trivial to exploit, and fortunately also has a relatively straightforward mitigation (at least conceptually — kernel developers might not agree that it is straightforward to implement).

Kernel page table isolation (KPTI) Recall that in the section on virtual memory I described that all modern operating systems use a technique in which kernel memory is mapped into every user mode process virtual memory address space. This is for both performance and simplicity reasons. It means that when a program makes a system call, the kernel is ready to be used without any further work. The

Figure 6: Kernel page table isolation

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 11 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

Figure 6 shows a technique called Kernel Page Table Isolation (KPTI). This basically boils down to not mapping kernel memory into a program when it is running in user space. If there is no mapping present, speculative execution is no longer possible and will immediately fault.

In addition to making the operating system’s virtual memory manager (VMM) more complicated, without hardware assistance this technique will also considerably slow down workloads that make a large number of user mode to kernel mode transitions, due to the fact that the page tables have to be modi

Newer x86 CPUs have a feature known as ASID (address space ID) or PCID (process context ID) that can be used to make this task substantially cheaper (ARM and other microarchitectures have had this feature for years). PCID allows an ID to be associated with a TLB entry and then to only bush TLB entries with that ID. The use of PCID makes KPTI cheaper, but still not free.

In summary, Meltdown is an extremely serious and easy to exploit vulnerability. Fortunately it has a relatively straightforward mitigation that has already been deployed by all major OS vendors, the caveat being that certain workloads will run slower until future hardware is explicitly designed for the address space separation described.

Spectre vulnerability Spectre shares some properties of Meltdown and is composed of two variants. Unlike Meltdown, Spectre is substantially harder to exploit, but adects almost all modern processors produced in the last twenty years. Essentially, Spectre is an attack against modern CPU and operating system design versus a speci

Bounds check bypass (Spectre variant 1) The

if (x < array1_size) { y = array2[array1[x] * 256]; }

In the previous example, assume the following sequence of events:

1. The attacker controls x .

2. array1_size is not cached.

3. array1 is cached.

4. The CPU guesses that x is less than array1_size . (CPUs employ various proprietary algorithms and heuristics to determine whether to speculate, which is why attack details for Spectre vary between processor vendors and models.)

5. The CPU executes the body of the if statement while it is waiting for array1_size to load, adecting the cache in a similar manner to Meltdown.

6. The attacker can then determine the actual value of array1[x] via one of various methods. (See the research paper for more details of cache inference attacks.)

Spectre is considerably more diFcult to exploit than Meltdown because this vulnerability does not depend on privilege escalation. The attacker must convince the kernel to run code and speculate incorrectly. Typically the attacker must poison the speculation engine and fool it into guessing incorrectly. That said, researchers have shown several proof-of-concept exploits.

I want to reiterate what a truly incredible =nding this exploit is. I do not personally consider this a CPU design baw like Meltdown per se. I consider this a fundamental revelation about how modern hardware and software work together. The fact that CPU caches can be used indirectly to learn about access patterns has been known for some time. The fact that CPU caches can be used as a side-channel to dump computer memory is astounding, both conceptually and in its implications.

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 13 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

Branch target injection (Spectre variant 2) Recall that indirect branching is very common in modern programs. Variant 2 of Spectre utilizes indirect branch prediction to poison the CPU into speculatively executing into a memory location that it never would have otherwise executed. If executing those instructions can leave state behind in the cache that can be detected using cache inference attacks, the attacker can then dump all of kernel memory. Like Spectre variant 1, Spectre variant 2 is much harder to exploit than Meltdown, however researchers have demonstrated working proof-of-concept exploits of variant 2.

Spectre mitigations The Spectre mitigations are substantially more interesting than the Meltdown mitigation. In fact, the academic Spectre paper writes that there are currently no known mitigations. It seems that behind the scenes and in parallel to the academic work, Intel (and probably other CPU vendors) and the major OS and cloud vendors have been working furiously for months to develop mitigations. In this section I will cover the various mitigations that have been developed and deployed. This is the section I am most hazy on as it is incredibly diFcult to get accurate information so I am piecing things together from various sources.

Static analysis and fencing (variant 1 mitigation) The only known variant 1 (bounds check bypass) mitigation is static analysis of code to determine code sequences that might be attacker controlled to interfere with speculation. Vulnerable code sequences can have a serializing instruction such as lfence inserted which halts speculative execution until all instructions up to the fence have been executed. Care must be taken when inserting fence instructions as too many can have severe performance impacts.

Retpoline (variant 2 mitigation) The

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 14 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

whether it was developed in isolation by Google or by Google in collaboration with Intel. I would speculate that it was experimentally developed by Google and then veri

Retpoline relies on the fact the calling and returning from functions and the associated stack manipulations are so common in computer programs that CPUs are heavily optimized for performing them. (If you are not familiar with how the stack works in relation to calling and returning from functions this post is a good primer.) In a nutshell, when a “call” is performed, the return address is pushed onto the stack. “ret” pops the return address od and continues execution. Speculative execution hardware will remember the pushed return address and speculatively continue execution at that point.

The retpoline construction replaces an indirect jump to the memory location stored in register r11 :

jmp *%r11

with:

call set_up_target; (1) capture_spec: (4) pause; jmp capture_spec; set_up_target: mov %r11, (%rsp); (2) ret; (3)

Let’s see what the previous assembly code does one step at a time and how it mitigates branch target injection.

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 15 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

1. In this step the code calls a memory location that is known at compile time so is a hard coded odset and not indirect. This places the return address of capture_spec on the stack.

2. The return address from the call is overwritten with the actual jump target.

3. A return is performed on the real target.

4. When the CPU speculatively executes, it will return into an in

In my opinion, this is a truly ingenious mitigation. Kudos to the engineers that developed it. The downside to this mitigation is that it requires all software to be recompiled such that indirect branches are converted to retpoline branches. For cloud services such as Google that own the entire stack, recompilation is not a big deal. For others, it may be a very big deal or impossible.

IBRS, STIBP, and IBPB (variant 2 mitigation) It appears that concurrently with retpoline development, Intel (and AMD to some extent) have been working furiously on hardware changes to mitigate branch target injection attacks. The three new hardware features being shipped as CPU microcode updates are:

• Indirect Branch Restricted Speculation (IBRS)

• Single Thread Indirect Branch Predictors (STIBP)

• Indirect Branch Predictor Barrier (IBPB)

Limited information on the new microcode features are available from Intel here. I have been able to roughly piece together what these new features do by reading the above documentation and looking at Linux

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 16 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

kernel and Xen hypervisor patches. From my analysis, each feature is potentially used as follows:

• IBRS both bushes the branch prediction cache between privilege levels (user to kernel) and disables branch prediction on the sibling CPU thread. Recall that each CPU core typically has two CPU threads. It appears that on modern CPUs the branch prediction hardware is shared between the threads. This means that not only can user mode code poison the branch predictor prior to entering kernel code, code running on the sibling CPU thread can also poison it. Enabling IBRS while in kernel mode essentially prevents any previous execution in user mode and any execution on the sibling CPU thread from adecting branch prediction.

• STIBP appears to be a subset of IBRS that just disables branch prediction on the sibling CPU thread. As far as I can tell, the main use case for this feature is to prevent a sibling CPU thread from poisoning the branch predictor when running two diderent user mode processes (or virtual machines) on the same CPU core at the same time. It’s honestly not completely clear to me right now when STIBP should be used.

• IBPB appears to bush the branch prediction cache for code running at the same privilege level. This can be used when switching between two user mode programs or two virtual machines to ensure that the previous code does not interfere with the code that is about to run (though without STIBP I believe that code running on the sibling CPU thread could still poison the branch predictor).

As of this writing, the main mitigations that I see being implemented for the branch target injection vulnerability appear to be both retpoline and IBRS. Presumably this is the fastest way to protect the kernel from user mode programs or the hypervisor from virtual machine guests. In the future I would expect both STIBP and IBPB to be deployed depending on the paranoia level of diderent user mode programs interfering with each other.

The cost of IBRS also appears to vary extremely widely between CPU architectures with newer Intel Skylake processors being relatively cheap

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 17 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

compared to older processors. At Lyft, we saw an approximately 20% slowdown on certain system call heavy workloads on AWS C4 instances when the mitigations were rolled out. I would speculate that Amazon rolled out IBRS and potentially also retpoline, but I’m not sure. It appears that Google may have only rolled out retpoline in their cloud.

Over time, I would expect processors to eventually move to an IBRS “always on” model where the hardware just defaults to clean branch predictor separation between CPU threads and correctly bushes state on privilege level changes. The only reason this would not be done today is the apparent performance cost of retro

Conclusion It is very rare that a research result fundamentally changes how computers are built and run. Meltdown and Spectre have done just that. These

In the meantime, the Meltdown and Spectre

https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2 Page 18 of 20 Meltdown and Spectre, explained – mattklein123 – Medium 3/30/18, 710 AM

Although I love working at Lyft and feel that the work we are doing in the microservice systems infrastructure space is some of the most impactful work being done in the industry right now, events like this do make me miss working on operating systems and hypervisors. I’m extremely jealous of the heroic work that was done over the last six months by a huge number of people in researching and mitigating the vulnerabilities. I would have loved to have been a part of it!