UPTEC IT 20036 Examensarbete 30 hp September 2020

Dynamic Eviction Set Algorithms and Their Applicability to Characterisation

Maria Lindqvist

Institutionen för informationsteknologi Department of Information Technology

Abstract Dynamic Eviction Set Algorithms and Their Applicability to Cache Characterisation Maria Lindqvist

Teknisk- naturvetenskaplig fakultet UTH-enheten Eviction sets are groups of memory addresses that map to the same cache set. They can be used to perform efficient information-leaking attacks Besöksadress: against the cache memory, so-called cache side channel attacks. In Ångströmlaboratoriet Lägerhyddsvägen 1 this project, two different algorithms that find such sets are implemented Hus 4, Plan 0 and compared. The second of the algorithms improves on the first by using a concept called group testing. It is also evaluated if these algorithms can Postadress: be used to analyse or reverse engineer the cache characteristics, which is a Box 536 751 21 Uppsala new area of application for this type of algorithms. The results show that the optimised algorithm performs significantly better than the previous Telefon: state-of-the-art algorithm. This means that countermeasures developed 018 – 471 30 03 against this type of attacks need to be designed with the possibility of

Telefax: faster attacks in mind. The results also shows, as a proof-of-concept, that 018 – 471 30 00 it is possible to use these algorithms to create a tool for cache analysis.

Hemsida: http://www.teknat.uu.se/student

Handledare: Christos Sakalis Ämnesgranskare: Stefanos Kaxiras Examinator: Lars-Åke Nordén UPTEC IT 20036 Tryckt av: Reprocentralen ITC

Acknowledgements

I would like to thank my supervisor Christos Sakalis for all the guidance, dis- cussions and feedback during this thesis. I also want to thank Stefanos Kaxiras for the idea of subject and making the project possible. Finally, I would like to thank everyone close to me for all the moral support and patience.

3 Contents

1 Introduction6 1.1 Motivation and Purpose...... 7 1.2 Limitations...... 7

2 Background8 2.1 Cache Memories...... 8 2.1.1 Cache Hierarchy...... 8 2.1.2 Cache Organisation...... 9 2.1.3 Cache Hits and Cache Misses...... 10 2.1.4 Eviction and Replacement Policies...... 10 2.2 Memory Addresses and Virtual Memory...... 11 2.2.1 Addresses and Cache Indexing...... 11 2.2.2 Virtual Memory...... 12 2.3 Cache Attacks and Eviction Sets...... 13 2.3.1 Cache Attacks...... 14 2.3.2 Prime+Probe, Evict+Time and Flush+Reload...... 14 2.3.3 Eviction Sets...... 16 2.4 Algorithms for Finding Eviction Sets...... 17 2.4.1 Baseline Algorithm...... 17 2.4.2 Group Testing...... 20 2.4.3 Group Testing Algorithm...... 20 2.4.4 Targeting Last Level Cache - Purpose and Challenges.. 22 2.4.5 Implementation Complications...... 23

3 Related Work 25 3.1 Characterising the Cache - Microbenchmarks...... 26

4 Method 28 4.1 Pin...... 28 4.2 Implementation Decisions...... 29

4 4.3 Method to Determine Cache Presence...... 29 4.4 Method of Measuring Number of Memory Accesses...... 30 4.5 Comparing the Algorithms...... 30 4.5.1 Parameters...... 31 4.5.2 Testing Correctness...... 31 4.5.3 Testing Performance and Scaling...... 31 4.6 Characterising the Cache...... 32 4.6.1 Using Baseline Algorithm...... 32 4.6.2 Using Group Testing Algorithm...... 34 4.6.3 Evaluation...... 35

5 Results 36 5.1 Comparing Algorithms...... 36 5.1.1 Correctness...... 36 5.1.2 Scaling...... 36 5.1.3 Performance...... 38 5.2 Characterising the Cache...... 40 5.2.1 Correctness...... 40 5.2.2 Scaling...... 41

6 Discussion 42 6.1 Comparing Algorithms...... 42 6.2 Characterising the Cache...... 43 6.3 Problems, Pitfalls and Lessons Learned...... 44 6.4 Future Work...... 44

7 Conclusion 46

5 1 Introduction

Cache memories are in most computers today used as the bridge between the fast processors, and the slower main memory. While having many great advantages, these caches could also be used in so-called side channel attacks. In a computer system, a side channel could leak information by measuring side effects on the system. By performing a side channel attack that uses the cache memory, it is possible to indirectly retrieve encryption keys, by observing the memory access patterns [1,3, 16, 15]. More recent research has also shown that many modern processors are vulnerable for so-called speculative attacks [9,8]. In that case, traces left in the cache combined with misuse of performance mechanisms leads to the possibility to read other processes private memory content. Cache side channel attacks could also break the separation between virtual machines (VM’s) in a cloud computing setting, if multiple VM’s are hosted on the same physical hardware [6]. Such separation is necessary to preserve the integrity of the users of the cloud environment. To perform such cache side channel attacks, one needs to be able to control and examine the content of the cache. That is, being able to remove data associated with a specific memory address from the cache, and some time later examine if it has been brought back into the cache by some other process. In that case, it could give away information about the actions of the other process. To know if the data is present in the cache, one typically measures the time it takes to access it. A long time to accesses could indicate that it is not in the cache memory. The removal of data from the cache memory is referred to as an eviction. The most straight-forward way to evict a target victim address is to use a flush-based approach [27], where you use an instruction that directly evict the targeted address. While convenient, this type of attack can easily be mitigated and is not available in some environments. The other type of method is called conflict-based [10]. The aim is then to con- struct a so-called eviction set, which is a set of memory addresses. Accessing the addresses that are in the set brings the corresponding data from main memory into the smaller cache memory. Because of the organisation of the cache, the victim data will then be evicted after accessing all the addresses. To be able to use this approach in a practical attack, one want the set to be as small as possible. Another requirement is that it is possible to compute the set in an efficient way. Eviction sets can be derived in a static way. This means to form the set man- ually, using an already known scheme that maps memory addresses to their corresponding location in the cache. However, some countermeasures has been presented that prevents this. One is to randomise this mapping in some time intervals [17], making it harder to use the static approach, since it needs to be re-done after each randomisation. The alternative is instead to use a dynamic method, which requires less information about the system. The main idea is to randomly find a large set of addresses that will evict the target, and then reduce the set to its minimal core. Dynamic methods is what I examine in this thesis.

6 1.1 Motivation and Purpose

The previous state-of-the-art algorithm [10] dynamically finds eviction sets in quadratic time. In 2019, a new version of the algorithm was presented [24], that found minimal eviction sets in linear time, which is a significant improvement. When the eviction set can be found in a shorter amount of time, some of the countermeasures against conflict-based attacks are less powerful. Thus, to be able to improve protections against this type of attacks [18], it is necessary to understand and analyse methods for quickly finding eviction sets. The purpose of this thesis project has two parts. The first is to examine these two algorithms that are used for finding minimal eviction sets. This is done by implementing the previous state-of-the-art algorithm as well as the optimised algorithm. The performance and correctness of these two algorithms are then compared. The second part is to explore if the algorithms can be used for a different purpose, namely for characterising a cache memory. That is, to find out the different parameters or settings of the cache, without knowing these in advance. Specifically, these are the questions to be answered:

1. Does the increase in performance conform with the theoretical calcula- tions? 2. What are the impact of different parameters and settings on the effective- ness of the two algorithms?

3. What are the challenges when implementing eviction set algorithms and finding the minimal eviction sets? 4. Can dynamical methods for finding eviction sets be used to characterise/analyse the cache parameters?

1.2 Limitations

This project will focus on algorithms for cache-based attacks. It will not cover other types of side-channel attacks, or attacks that targets other types of micro- architectural structures than the cache memories. The attacks that are using the discovered eviction sets will not be implemented. New eviction set algorithms will not be developed. Countermeasures against the methods of finding eviction sets will not be evaluated. This project will use dynamic methods for finding eviction sets. Static methods (reverse engineer the mapping from addresses to cache sets) will not be explored.

7 2 Background

This section provides the background necessary to understand the purpose and meaning of eviction sets, as well as the mechanism behind the algorithms. It will first cover the basics of cache memories, memory addresses and conflict-based cache attacks. Then the two algorithms are presented. Some general challenges for the algorithms are also briefly discussed.

2.1 Cache Memories

Memory accesses from the CPU to the main memory introduce a long latency. This is both because of the memory technology used in the main memory, and because of its placement far away from the processor. To overcome this, smaller but faster cache memories are used. These caches store copies of recently used data, so that the processor can access it in a quicker way. Some base character- istics of caches will now be covered.

2.1.1 Cache Hierarchy

Caches today are organised in a hierarchy, with the largest and slowest cache close to the main memory. This is called the last level cache (LLC, typically same as the L3 cache). Closest to the CPU the smallest caches reside, called L1 caches. Typically, there are two L1 caches, one for data (L1d) and one for instructions (L1i). On a multi-core system, the L1 and (usually) L2 caches are kept private for each core, while the LLC is shared between all of the cores. This is illustrated in Figure 1. On modern systems, the LLC is divided in so-called slices, typically as many slices as cores. Which slices that will be used by which memory address could be determined by a simple function, but are on recent Intel processors deter- mined by an undocumented so-called complex addressing. This was previously complicating countermeasures against some types of attacks against caches, but the addressing scheme has since been reversed engineered [12].

8 Core 1 Core 2

L1d L1i L1d L1i

L2 L2

L3

Memory

Figure 1: Example of cache hierarchy. L1, L2 and L3/LLC are cache memories, where L3 is closest to the main memory and L1 is closest to the processor core. Here we have 2 cores, which each has its own private L1 and L2 caches.

2.1.2 Cache Organisation

The cache memory is divided into entries called cache lines. Each cache line consists, at the very least, of a valid bit, an address tag and the data to be stored. The data correspond to a chunk of data in the main memory. This is to be able to benefit from spatial locality. That is, memory locations that is physically close could be more likely to be accessed close in time. If that memory then really is accessed, it could already be in the cache, brought in with the same cache line from an earlier access. Caches are today most commonly divided into sets, where each set consists of multiple cache lines. The number of lines in each set determines the associativity or ways of the cache. For example, if one set consists of four cache lines, the cache is said to have 4-way associativity. The result of this is that multiple cache lines could index to the same set in the cache, and can still all be in the cache. A cache organised in this way is called a set-associative cache. There are other types of possible cache organisation as well. In a fully-associative cache, each cache line can be placed anywhere in the cache. In a direct-mapped cache, each address maps to exactly one cache location. In this thesis, the caches that are targeted are set-associative.

9 Index Index

0 0

1 1 Index 2 2 0 3 3

4 4

5 5

Figure 2: Different cache organisations. The cache to the left is a two-way associative cache. Each index can hold two cache lines. The cache in the middle are direct-mapped. Each index can hold exactly one cache line. The cache to the right is fully-associative. All cache lines map to the same index.

2.1.3 Cache Hits and Cache Misses

When accessing memory, the request first goes to the smallest cache in the hierarchy, the L1 cache. If the cache does not currently hold the information, it is called a cache miss and the request goes to the next level in the hierarchy. If the cache does hold the information, it is called a cache hit. If a cache miss occurs, it introduces some latency since the larger and slower caches, or the main memory, need to be accessed. Cache misses can occur for a number of different reasons. One usually mentions three main types of cache misses. Compulsory misses occur when accessing a cache line for the first time, such that it has to be loaded from memory. Capacity misses occur since the cache is too small. It cannot contain more cache lines and have to remove one of the cache lines already in cache, when a new one is accessed. The next time data from the removed line is accessed, it will be a miss since it is no longer in the cache memory. Conflict misses occur when you do have enough space in total, but the cache line is mapped to an already occupied index. For example, using an associative cache, when all the ways in a cache set is occupied. Then one of the cache lines will again be removed, and it will result in a cache miss the next time data from that removed line is accessed. For example, we will not have any conflict misses in a fully-associative cache (but it is less efficient to search the cache). In a direct-mapped cache, we will have a lot of conflicts, since each index only holds one line (but is easier to search). Set-associative caches works as a trade-off between these two alternatives.

2.1.4 Eviction and Replacement Policies

If all ways in the set are occupied by cache lines, and the cache brings in a line that maps to the same set, one of the old lines needs to be removed. This is called eviction. The next time someone tries to access data from the evicted line in that cache, it will result in a cache miss. It will take longer time to access that data, which if measured could indicate a miss.

10 Which one of the lines that gets evicted is determined by the replacement policy. Examples of policies are least-recently-used (LRU), first-in-first-out (FIFO) and random replacement, with some variant of LRU or pseudo-LRU being the most common. In that case, the cache line that were the least recently used, as the name implies, is evicted from the cache. However, some of the more recent architectures uses policies which are undocumented and that are adaptive or randomised in their behaviour [24]. This can affect the performance of the cache attacks and algorithms that are commonly used. In this project, the caches are assumed to use LRU eviction policy.

Index

0

1

2

Figure 3: In this example, all the slots at index 1 is filled. A new cache line maps to the same index, which means one of the previous store cache lines will be evicted. Under LRU replacement, if the cache line in the second slot was least recently used, it will be the one to be evicted.

2.2 Memory Addresses and Virtual Memory

Virtual memory is a memory management technique that makes the physical memory easier to interact with. Depending on the cache hierarchy, cache mem- ories can be indexed using either the virtual memory addresses or the physical memory addresses. In this section, it is explained how the addresses in general are used to interact with the cache, as well as how the translation from a virtual to a physical address works.

2.2.1 Addresses and Cache Indexing

Memory addresses are used to store the data into the cache and to load the data from the cache. Different bits in the address have different usage. The least significant bits (LSB’s) of the address are used to find a specific byte in a word. If we have a word size of 4 bytes, we will need 2 bits to choose the correct byte. The next bits (counting from LSB) are used to find the right word in the cache line. If we have a 64B cache line, giving 16 words when using 4 bytes words, we will need 4 bits of the address for this (since 24 = 16). Choosing the correct word can be done using a multiplexer, which is a type of selector.

11 Cache size: 64 kB Cache line size: 64B Word size: 4B 32 bit address 01001110101011000101100100100101 Tag - 16 bit Index - 10 bit Word in Byte in cache line word 4 bit 2 bit

Figure 4: Example of the purpose of different bits in a memory address.

The “next” bits are used for indexing into the cache. How many bits that is needed depend on the size of the cache, cache line size and the associativity. If we as an example have a 64 kB cache, a 64B CL size and a direct-mapped cache, we will have 1K entries. We will then need 10 bits for indexing, since 210 = 1024, and we will not be able to index all entries with less bits. The last bits (the most significant bits) are finally used as the tag. When retrieving data from the cache, based on an address, we first find the right index using the index bits. The tag stored within the cache line at that index is then compared to the tag in the address. If they match, and the valid bit in the cache line is set, the access will return as a hit. If it is a set-associative cache, we well need to compare all addresses in the cache set to determine if there is a hit. Another consequence of a set-associative cache is that it will have fewer entries and by that need fewer indexing bits.

Tag Index

V Tag Data V Tag Data

Compare Compare

Hit? Multiplexer

Hit or miss Data

Figure 5: Example of how the address is used in a 2-way associative cache

2.2.2 Virtual Memory

When mentioning memory addresses, there are two types of addresses that can be discussed. The first one is physical memory addresses, which refers to the actual location of data in the main memory. On the other hand, we have virtual (or logical) memory, which is an abstraction of the physical memory. It can be seen as a memory management technique, which gives a program, and the , the illusion of more memory available for the program than there

12 actually is. It also frees the programmer from requiring detailed knowledge of the main memory, and from managing it manually. This is in particular useful since the program might not be written and compiled at the same machine that it will later be run on. It also makes it easier to run multiple programs on the same machine. The virtual address space is typically divided into so-called pages, which maps to the corresponding frames in the physical memory. A page table, which resides in main memory, is used to translate the virtual addresses to physical addresses. The translation is managed by the Memory Management Unit (MMU). Recently made address translations are stored in a type of cache, called Translate Looka- side Buffer (TLB), to avoid unnecessary memory accesses. Since the TLB is a type of cache memory, it can suffer from cache misses as well, if the memory translation is not in the TLB. If a miss occurs, the translation will be performed using the page table, which will be a much slower process. The memory translation is done in two steps. The first is to find the corre- sponding physical frame using the TLB or consulting the page table. Second, the actual address is retrieved using a part of the LSB of the virtual address, the page offset. Some architectures also support so-called large pages or huge pages, which could for example hold 2MB data, in contrast to the standard (on Linux) 4kB pages. (There are also 1GB huge pages, but they need to be allocated manually). The main purpose is to increase performance, since fewer pages need be used to represent the translations. This also reduces the number of misses in the TLB. This is since a page lookup brings the frame translation for one virtual page into one entry of the TLB. If the page is large, many addresses can use the same entry for the translation as above, using the offset to separate them. This way, the TLB can cache many more translations, while not increasing in size. The first level cache (L1) are typically indexed using virtual addresses, while the other levels are indexed by using the physical address. This means, that we might not directly have access to the indexing of the larger caches, which can be a problem when using the algorithms presented later. However, using large pages could give more information about physical addresses, since more bits are needed for the page offset. These offset bits will be directly translated to the physical address. If all of these bits happens to overlap with the indexing bits, we can essentially use the virtual addresses as if they were physical. Even with this information, the slice index might however still be unknown for a sliced cache. Large pages are also not supported in all systems.

2.3 Cache Attacks and Eviction Sets

Side channel attacks are a category of attacks where information is leaked through some microarchitectural parameters (such as different caches or other types of buffers), rather than using some weakness in the software. This includes measurements of energy consumption, time taken for computations, electromag- net radiation or cache timing attacks, which can give us information about the system. The information that can be leaked could for example be encryption

13 keys or passwords. We will focus on the cache memories as the side channel. There are however other micro-architectural structures that could be exploited in a similar way, like small, specialised caches or internal buffers (e.g. the TLB or branch prediction buffers.) Exploring attacks against these is out of the scope for this project.

2.3.1 Cache Attacks

In the case of cache side channel attacks, monitoring the content of the caches can give information about what memory addresses the victim is accessing. While the actual data content is being protected from being read by other pro- cesses, the access pattern is not. As an example, if the other process in perform- ing an encryption scheme, it is possible to draw conclusions about encryption keys, by observing the pattern. Cache attacks are possible to do on all cache levels, though attacks against L1 and L2 caches assumes that the victim and the attacker are running on the same core. This could be done while using simultaneous multithreading (also called hyperthreading) [16], where multiple threads can execute simultaneously to im- prove performance. L1 and L2 attacks can also be performed when processes simply share the same core, but it then requires to being able to interrupt the other process [27]. Since the LLC (e.g L3 cache) is shared between different cores, attacks against the LLC do not assume that the victim and the attacker needs to be on the same core. However, there are some difficulties [10, 27] introduced when using the LLC cache in contrary to, for example, the L1 cache. This is further discussed in Section 2.4.4. The fact that the LLC can be used in these kinds of attacks is also relevant for virtulised envidonments, e.g. for cloud computing services. This is since these could rely on multiple virtual machines (VM’s) that run on the same hardware platform, but on different cores. They then share the LLC, which as discussed is susceptible for cache attacks. Cache related vulnerabilities in cloud systems has been researched in many cases before [6, 13, 19, 26, 28], of which some of them have been mitigated.

2.3.2 Prime+Probe, Evict+Time and Flush+Reload

Cache attacks that are built upon eviction can be divided in two categories [18]. These are conflict-based attacks and flush-based attacks. Prime+Probe [15] is one of the most commonly used conflict-based attacks. The idea is to observe the effect on the cache made by e.g. an encryption. This consists of three steps.

1. Prime. The attacker fills the cache with its own cache lines. 2. Wait. Attacker waits for some time period. 3. Probe. The attacker accesses all cache lines again while measuring the time. If there is a long latency for some of the cache lines, the victim has

14 accessed the cache set that those lines belonged to, evicting some of the attackers cache lines.

As a consequence will the last step (Probe) be the first step (Prime) for the next round of measurement, since it yet again is in an attacker-controlled state.

1. Prime 2. Wait 3.Probe Index Index

0 0 .... ! X ! ! 1 1

2 2

Fill with own cache lines Measure access times

Figure 6: The three steps of Prime+Probe. In this example, the second cache line has been evicted by the victim process in the wait step. Because of this, that cache line will have longer access time in step 3. This can give the attacker information about the victims execution.

In the same work as Prime+Probe, Evict+Time [15] was presented, which is a second type of conflict-based attack. In this case the attacker wants to manipu- late the cache content and see the effect on the execution time of the encryption. However, the authors point out that there are some weaknesses in this type of attack. The timing is sensitive to variations in the encryption operations, and triggering an encryption could introduce some noise. The Prime+Probe vari- ant gives more control over the timing and is less sensitive to these kinds of variations. The following are the steps of Evict+Time:

1. Trigger an encryption

2. Evict. evicts some cache lines 3. Time. Trigger encryption again, measure time difference

One example of a flush-based attack is Flush+Reload [27]. In this case, you remove cache content using a x86 instruction called clflush and then reload the content, measuring the time. The access time gives information about if the cacheline has been brought back to the cache or not. Clflush flushes a specific cache line from all levels in the cache hierarchy. Because of that, this type of attack has the advantage of having higher granularity than the previous described attacks, which target a whole set instead of a line. The three steps of the attack are:

1. Flush. Uses clflush to evict a cache line. 2. Wait. Attacker waits for some time period.

3. Reload. Accesses the cache line and measures the access time.

15 1. Flush 3. Reload 2. Wait Index Index 0 ... 0 1 1 ?

2 2

Flush a target cache line Measure access time

Figure 7: The three steps of Flush+Reload. In the first step, the clflush instruc- tion is used to force eviction of a cache line. After the wait step, the attacker measures the time it takes to access the same cache line. If it has been brought back again by the victim, it will have a shorter access time. If it has to be fetch from main memory, it will have longer access time.

Flush+Reload could for example be used in exploits of the Rowhammer [20] bug, where bits in the main memory can be flipped by constantly doing Flush+Reload on surrounding memory rows. However, Flush+Reload relies on usage of the clflush instruction, which is a highly specific instruction in the x86 instruction set architecture. It is also dependent on memory sharing when attacking cross- VM, which for be disabled in many cloud platforms [12]. Memory sharing is a optimisation technique by the operating system. In short, it makes VM’s share a write-protected copy of a virtual memory page, instead of having identical separate ones. This is of course good for performance but, unfortunately, opens up for attacks such as Flush+Reload. Both Prime+Probe and Evict+Time requires the existence of an eviction set, which will be defined next.

2.3.3 Eviction Sets

To be able to perform Prime+Probe and Evict+Time attacks, one need to cal- culate so-called eviction sets. Eviction sets are a collection of at least W memory addresses that maps to the same cache set, if we have a W-way associative cache. This means, that if all addresses in the eviction set are accessed, all the previous content in that cache set will be evicted. Apart from usage in timing attacks against the LLC, eviction sets also have usage in Rowhammer attacks [4] (in the case we can’t or don’t want to use Flush+Reload), as well as in so-called speculative attacks [24]. One way of defining eviction sets, is seeing it as a collection of addresses that have to be congruent [24] to each other. That is, that they map to both the same cache set and cache slice, if we have a sliced LLC. To be able to perform efficient attacks, we want the eviction set to be minimal, that is, as small as possible while still being an eviction set. A too large eviction set would also introduce too much noise, apart from being inefficient. Methods to find eviction sets can be divided into dynamic and static [4]. The static approach uses information about the mapping from virtual to physical ad- dresses and the hash function for mapping into LLC slices. While the mapping

16 to LLC slices can be trivial in some architectures, more recent (Intel) architec- tures are using so-called complex addressing, making it harder to use a static approach if the addressing hash has not been reverse engineered. Some mitiga- tion approaches are based on randomising the mapping from physical address into the cache, as presented in CEASER [17]. The remapping is then done pe- riodically, meaning that the reversing of the mapping needs to be redone in a very short period. In this case, a dynamic method is better to use. Dynamic methods do not require any information about the indexing or slice mapping. However, dynamic methods have to be recalculated for each execution, making it slower than a static approach [12], where the mapping already is known. The two algorithms presented in this thesis are examples of dynamic methods for finding an eviction set.

2.4 Algorithms for Finding Eviction Sets

In this section the two different algorithms for finding eviction sets are presented. They will here be called Baseline algorithm and Group testing algorithm, respec- tively. Both start with a large set of addresses that iteratively gets pruned down to a minimal core. Some difficulties of targeting the LLC as well as challenges for the algorithms introduced from hardware features will also be discussed.

2.4.1 Baseline Algorithm

Vila et al. [24] describes two variants of an algorithm to find minimal eviction sets. The first one is referred to as “Baseline algorithm”. It was introduced by Liu et al. [10] to perform Prime+Probe against the LLC and has also been used for example by Oren et al. [14] to performe a web-based attack. The algorithm starts with a large set of addresses (S), and an victim address (x) that we want to be able to evict. S is an eviction set for x, but not minimal. That is, if we access all addresses in S, x would be evicted from the cache memory. But since we have many addresses, of which most does not contribute to the eviction, it would not be efficient enough to be used in an attack. The output from the algorithm will be a minimal eviction set for x, which is called R. The main idea is to pick one address c from S at the time, and check if S, combined with the previous found members of the minimal eviction set, still evicts x from the cache line. If not, it means that c was necessary for S to be an eviction set, and c is added to R. The tested address c is removed from S and the procedure starts over, until R contains as many elements as the cache has ways. This algorithm takes O(n2) accesses to form the minimal eviction set, where n is the size of the initial set S. This means that the number of memory accesses

17 will have a quadratic growth with respect to the input, starting set size n.

Algorithm 1: Baseline algorithm from Vila et al. [24] input : S = starting set, x = victim address output: R = minimal eviction set

1 R = {} 2 while |R| < W do 3 c = pick one address from S 4 if R ∪ (S \{c}) does not evict x then 5 R = R ∪ {c} 6 end 7 S = S \{c} 8 end 9 return R

In Figure 8, we can see a small example run of the baseline algorithm. In this case the associativity is 2, so we want to find 2 green elements (marked with “!”), which will form a minimal eviction set. The following explains the different subfigures seen, and how they connect to the pseudocode for algorithm 1.

(a) First, we take out one element and test if the remaining set is an eviction set. That is, if the set evicts the target address from the cache (line 3-4 in pseudocode). Here it is true, since we have more than 2 “green elements” left. We remove the element since it was not needed (line 7 in algorithm). (b) Next, we take out another element that this time happens to be green. However, it is not necessary in the minimal set, since we only need two green elements, and we still have two in the remaining set. We remove this tested element as well.

(c) When we find one element that makes the remaining elements to not be an eviction set, we save it to R (line 5 in algorithm). R is here illustrated with the dashed box. Next time we test for eviction, we will combine the remaining elements in the set with the already found and saved elements in the dashed box (as on line 4 in pseudocode).

(d) Here, the eviction test will be true, since we have two green elements combined, and the tested element can be removed. (e) In this case, the test will be false, since we only have one green element left if we take out one of the green elements. We save the tested element to R.

(f) When the number of found elements is equal to the associativity (in this case 2), we stop (line 2 in the algorithm).

18 ! X

X ! !

Initial set, S X

a) b) ! c X c X

X !

! ! ! ! Evicts = true Evicts = true

X X

c) d) R c X c ! ! X

! ! Evicts = false Evicts = true

X X

e) R f) R

c ! ! Found eviction set !

Evicts = true !

X

Figure 8: Example run of baseline algorithm with associativity 2. The green elements (marked with “!”) represents addresses that maps to the same target cache set, and evicts some victim address in that cache set. The red elements are mapping to some other cache sets, and will not be a part of the found eviction set.

To test if set is evicting x, the following is the test that is made. In short, the victim address is first accessed to make sure that it is in the cache to begin with. Then, each element of the tested set, here noted Si is accessed. Lastly, the victim is again accessed while the time is measured. If the time taken is larger than some threshold (previously calculated average time for a cache hit), return 1. This means that the tested set does evict the victim x. (Note that this is the general method, but since I use a cache simulator in this project, I

19 will not use timing, see section 4.3.)

Algorithm 2: Testing for eviction input : S = tested_set, x = victim address output: 1 if evicted, otherwise 0

1 access(x) 2 while i < size(S) do 3 access(Si) 4 end 5 access_time= time(x) 6 return access_time > threshold

2.4.2 Group Testing

The second algorithm is based on group testing [2], and reduces the memory accesses needed to find minimal eviction sets to O(n), where n again is the size of the initial set S. Group testing can be applied to many different areas, for example when testing blood, and wanting to reduce the number of tests needed to be done. It can essentially be seen as the problem of finding the positive elements, in a set of mixed positive and negative elements, in as few steps as possible. If a test shows that the group contains no positive elements, the whole group can be discarded (for example, a negative blood test for a whole group). In that way, one does not need to test each element separately.

2.4.3 Group Testing Algorithm

The problem of testing if a set S evicts x can be seen as group test [24]. It can be shown that if we partition set S in to W + 1 subsets (where W is the associativity), we can find at least one subset that can be removed, and the remaining set still is an eviction set. This is since we only need W elements representing addresses that map to the same set (and possibly slice) as the victim in our final set, and we have W+1 groups. If, as in the worst case, W of the groups would contain one such element each, the remaining group can be removed, no matter if it also contains elements with the desired property. The group testing algorithm leverages this principle. It is given the same input as before, that is, the initial set S and the victim address x. The algorithm then splits the set into W + 1 subsets. It iterates over all the subsets, and test if S without the subset still evicts x. If so, the subset is removed from S. Otherwise, it keeps testing the other subsets until one that fulfill that requirement is found. Then, S is once again split into W + 1 groups, and the same process starts all over. This is repeated until there are W elements left in S, that is, when S is a minimal eviction set. This way, we can discard a whole group of elements per

20 iteration of the algorithm, instead of only one element as before.

Algorithm 3: Group testing algorithm from Vila et al. [24] input : S = starting set, x = victim address output: R = minimal eviction set

1 while |S| > W do 2 {Ti...TW +1} = split S into W + 1 groups 3 i = 1 4 while S \ Ti does not evict x do 5 i = i + 1 6 end 7 S = S \ Ti 8 end 9 return R

In Figure 9 we can see a small example of the group testing algorithm, with associativity W = 2. The green elements (marked “!”) represents addresses that maps to the same set as the victim address. The following is an explanation of the subfigures presented.

(a) The initial set is first divided into W + 1 = 3 groups (line 2 in the algo- rithm).

(b) When holding out the first group (line 4), the remaining elements still forms an eviction set. The first group can be removed (line 7). (c) The groups are combined and split into 3 new groups. (d) The first group can be removed, since the remaining set still evicts the victim (we have enough green elements).

(e) The process is repeated one last time, by splitting the remaining elements into 3 groups. Here, testing to hold out each of the first two groups will yield that the remaining set does not evict the victim, since it does not form an eviction set. However, when holding out the third group (containing the red element), the test will show that the remaining set does evict the victim.

(f) After removing that group, we are left with only 2 elements, and the algorithm ends. The two elements, or the addresses they represents, con- stitutes the minimal eviction set.

21 ! X

X ! !

Initial set, S X

a) b)

! X ! ! X !

X ! X X ! X

c) d)

! ! X ! X !

X X

e) f)

! ! X ! ! X

Figure 9: Example run of group testing algorithm with associativity 2. The green elements (marked with “!”) represent addresses that maps to the same target cache set, and evicts some victim address in that cache set. The red elements are mapping to some other cache sets, and will not be a part of the found eviction set.

2.4.4 Targeting Last Level Cache - Purpose and Challenges

The described Prime+Probe attack and the eviction set algorithms can be used on any cache level in the hierarchy. However, attacks against the LLC could benefit most. This is since it is too large to repeatedly evict and probe the whole cache, which could be possible for a smaller cache. The LLC also opens up to cross-core attacks, for example between different virtual machines in a cloud service environment, since different cores share this cache memory, as mentioned in Section 2.1.1. However, this also means that since the LLC is larger than for example L1 and L2 caches (in a 3 level cache hierarchy), it will be slower to attack using Prime+Probe [10]. This is the reason that a minimal evictions set is needed, to be able to target specific cache lines instead of the whole cache. Performing Prime+Probe on a single cache line could also be slower on the LLC

22 than on for example the L1 cache. One reason is that the larger cache typically also has higher associativity, so it takes more memory accesses to fill or evict a cache line. Another reason is that it introduces longer latency to access the LLC, especially if there is a cache miss (which will lead to fetching the data from main memory). One other issue is that the LLC is physically indexed. For a virtually indexed cache, it is possible to directly calculate addresses that would make an eviction set. The physical addresses are not directly accessable, and we might however not know the mapping from virtual to physical addresses. Even if we do, we might also not know the hash function to LLC slices. To overcome this, it is possible to calculate the set reversing the mappings (static method), or by using the previously discussed algorithms (dynamic method). Using large pages could also help with this problem (see Section 2.2.2).

2.4.5 Implementation Complications

Modern hardware has many optimisation features that makes programs run more efficiently. However, these features might affect the results of the previous described algorithms, unless some adaptations are made. This section is about what one needs to think about when implementing the algorithms on real hard- ware. Note that when evaluating the algorithms in this project, a software-based cache simulator is used, avoiding the effect of most of these complications. Cache prefetching is the concept of data or instructions loaded into the cache before they are needed. That way they will already be available in the cache and we will not suffer from a cache miss. Prefetching can be implemented either by hardware or software. For example, if making accesses in a loop with predictable offsets, it could be easy to predict the next upcoming instructions. When running the eviction set algorithms, we want to avoid prefetching since it can interfere with the algorithms, by bringing unintended content into the cache and giving false results. One way to avoid this is to access the elements in a random order [10]. Instruction re-ordering and out-of-order execution is other features the opti- mises the performance. In short, when compiling a program it is broken down to separate instructions that the processor can execute. Instead of executing them in the exact program order, it can re-order the instructions that are not dependent on each other, as long as the final result is the same. This way, the processor can use time slots that would otherwise be empty. However, when running some parts of the algorithm, we don’t want re-ordering. This parts could for example be when performing time measurements (which is used to check for eviction). Unwanted re-ordering can be avoided for example by using memory fences [10]. When running the algorithms, many memory accesses are made, which also will result in more accesses to the TLB. Some of these will be misses, resulting in a page walk (see Section 2.2.2), which will give higher latency. This could introduce some false positives/negatives (a timing result could be interpreted as the element not being in the cache, while the result in reality comes from the

23 TLB miss). One way to avoid this is to use huge pages, since each page then will cover more translations, and fewer misses will occur. To be able to perform the tests to see if an eviction has occurred, it is essential to have access to good clock/timing method. The availability could depend on the platform. Since the timing could vary and be affected by a number of things, it is also important to take an average of a number of measurements, to get a representative result. Lastly, there could be interference or noise introduced by other processes run- ning on the same system, and are sharing the same cache and other micro- architectural structures.

24 3 Related Work

There have been many cases of proven possible cache attacks, in different set- tings and on different platforms. Most common use case is to leak encryption keys. Some of the following examples are using Prime+Probe as the method of attack, which needs an eviction set to work. Liu et al. [10] presents a method to find eviction sets, without having to know the virtual-to-physical address mapping of the victim (first version of what is in this project is called the Baseline algorithm). This method is then used to launching Prime+Probe attacks against the last level cache, over VM boundaries, without relying on weaknesses in the VM itself. Qureshi [18] proposed a new version of this algorithm, that reduced the time complexity of creating such eviction sets from O(N 2) to O(N). This would make it harder to perform randomisation- based mitigation, in this case by using cache encryption. Qureshi also presents a method for creating eviction sets that leverage the replacement policy. In parallel, Vila et al. [24] also presented an algorithm to find minimal eviction sets, based on threshold group testing. This is similar to the method used by Qureshi, and also reduced the time complexity to linear. The differences is that Vila et al. focuses more on mathematically prove and evaluate the eviction set algorithm, while Quershi focuses more on improving the mitigation technique to still hold when the time taken to find eviction sets are reduced. In this thesis, the algorithms are based on the work of Vila et al., but the idea is similar to the one by Qureshi. Maurice et al. [12] build an automatic method for reverse engineering com- plex addressing (to LLC slices) in Intel processors, using performance counters. Knowing the addressing, one can compute an eviction set and finally perform a Prime+Probe attack. This is a static approach for finding eviction sets. The results from the reversed engineered complex addressing scheme was then used in another work by Gruss et al[4]. In this case, the Rowhammer bug was exploited, but using Javascript. Most previous works on the Rowhammer attack, in contrary, uses clflush, which are not available from Javascript. The paper also defines the concept of eviction strategies, and tries to find a strategy that works well for undocumented/adaptive replacement policies as well as variants of pseudo-LRU. Osvik et al. [15] were the first to present Prime+Probe as well as Evict+Time, and uses them to perform attacks against AES encryptions. Numerous counter- measures are also discussed. Yarom and Falkner [27] presented the Flush+Reload attack as an alternative to Prime+Probe, which would have higher resolution, since it can target specific cache lines instead of cache sets. This attack uses the clflush instruction to perform the eviction and is depended on page memory sharing to work. Oren et al. [14] build on the work by Liu et al. and used Prime+Probe to build an attack against the LLC. The difference is that it does not assume the victim to be in close proximity to the attacker (e.g. on the same machine), but can be executed remotely, e.g over the web using Javascript. It also does not require

25 support for large pages, as in Liu et al. The result is that it is possible to track user behaviour. The paper discusses some difficulties introduced when using Javascript, for example lack of access to physical or even virtual addresses, as Javascript has no notion of pointers. It is also harder to use memory barriers and get high-resolution timing. The paper discusses the effect of mapping strategy and replacement policies on the cache behaviour, as well as different types of noise that affect side channel attacks and how to mitigate that. It also discusses countermeasures on Javascript level as well as hardware level. Percival [16] uses a method similar to Prime+Probe to perform cache side channel attacks in the presence of “Hyper-threading”, or simultaneous multi- threading. Multi-threading allows instructions to be re-ordered across different threads, to better utilise resources. In that case, they also share access to caches, which makes the attacks possible. Countermeasures on hardware, OS and library level are proposed. Attacks on portable code were examined by Genkin et al. [3]. Instead of Javascript, which is a lot less efficient than native code, the attacks are built around PNaCl and WebAssembly, which are a bit more efficient. The algorithm for finding eviction sets are also based on Liu et al. [10], but a bit adjusted for portable settings. This is used to extract cryptographic keys from various libraries. This paper also discusses how to handle the TLB noise. Lu Shiting [11] wrote a master thesis on data cache based timing attacks (CBTA). Both methods and countermeasures are described and a hardware-based protec- tion method is evaluated. Another master thesis, by Sai Prashanth Josyula [7], explores the possibility to use cache side channel attacks against The Elliptic Curve Digital Signature Algorithm (ECDSA), by using Flush+Reload.

3.1 Characterising the Cache - Microbenchmarks

A memory microbenchmark program can be used to analyse a memory archi- tecture. For example, it is possible to create a graph over the access time for different memory accesses, by stepping through an array and accessing one el- ement of the array at the time [25]. Both the array size and the stride (step size) are varied. The measurements are repeated many times, to get an average access time. From the output graph it is possible to draw conclusions about the cache hier- archy, like associativity, size, line size and access time of the L1, L2 and LLC caches of the system. One can also get information about the TLB entries, associativity and lookup time, as well as page size and main memory access time. As an example, if we are using a vector that is smaller than the cache size, we will have 100 % hit rate regardless of the stride (except for the first time we go through the vector), since all of the vector can fit into the cache. If we use a smaller vector, the hit ratio can tell us about the cache line size. If we have 50 % hit rate, we know that we access each cache line twice, where the first access is a miss and the second one is a hit. That is since the second element of the

26 vector is brought in with spatial locality, since it is on the same cache line. If we increase the stride and get to 0 % hit rate, we know that we are accessing each cache line only once, which means that we have found the cache line size. In a similar way, we can also draw conclusions about the cache sizes.

27 4 Method

In this section, the tools and methods used for the implementation and evalu- ation are presented. First a general description of Pin and the cache simulator is given, followed by explanation of the decision made when implementing the algorithms in code. Section 4.3 and 4.4 describes the changes made to be able to determine if an access is a hit or a miss, and to be able to measure the number of memory accesses. Finally, the tests run to be able to compare the algorithms and evaluate the characterisation tool is described. In previous sections the advantages and issues of attacking the LLC have been discussed. However, when evaluating the algorithms using Pin, a L1 data cache is used for simplicity. This also means that we also do not check slice indexing, since the L1 cache do not use slices.

4.1 Pin

Pin is a free-to-use tool for instrumentation and analysis of binaries, developed by Intel [5]. The meaning of instrumentation is to allow the user to insert arbitrary code dynamically while running a program. By doing this, one can gain information about the execution of the program, such as the number of memory accesses used or other types of data. The user program will here be referred to as an application. Pin can be seen as a “just in time”-compiler, which takes an executable appli- cation as its input. This original code is used as a reference, as Pin generates a new sequence of code. This new code is then what actually get executed, and in which the user can inject own instructions by instrumentation. Instrumentation consists of two parts. The Instrumentation part tells where to insert code (such as calls to other functions), while the Analysis part refer to the actual functions that will be called or inserted. A Pintool contains both these parts. It is possible to use the Pintools that are provided, or write your own. The Instrumentation code will only be called once, while the Analysis code will be called each time the instruction to be instrumented is executed. Instrumen- tation can be done on different granularity levels. For example, it can be done for each routine (function) of a program, or for each instruction. It is also possi- ble to replace existing functions in an application with a different function that exists in the Pintool, by a built-in replacement function. Pin is in this case used to simulate a cache, to be able to analyse what the algorithms are doing to the cache in detail. By utilising this, we can do a comparison of the algorithms on a less noisy setting, by avoiding some of the more complicated optimisation features of modern processors. For this purpose, a built-in cache simulator Pintool is used. This Pintool is then customised for this project. The cache simulator works in the following way: Each instruction in the application is analysed to see if it is a memory access,

28 and if it is a read (from memory) or write (to memory). Depending on this, a call to a special function is inserted. The function will store the tag and the index part of the address that the access is related to. Since the program still will get the data from the “real” hardware cache, the actual data does not need to be stored in the simulated cache. When determine if it is a hit or a miss, or when to replace data in the simulated cache, the simulated cache only need to look at the stored address tags.

4.2 Implementation Decisions

The algorithms are implemented using C. The reason for using C is that it is a language that is suitable for low-level memory management. Is is also compatible with the cache simulator tool that is used, which is written in C++. When needed, inline assembly is used in combination with C. The reason for this is to be able to control the exact instructions used and to prevent the compiler from removing important parts of the code. This is for example used when doing memory accesses “on purpose”, with the only reason being to bring content to the cache. If the values we access are later not used, the compiler could choose not include these accesses, since they don’t seem to be needed. Several previous works [24, 14, 10, 22] have implemented the eviction sets as linked list, where each address is associated with a pointer to the next one in the set. The order of the addresses are also randomized. This is done to reduce the effect of hardware prefetching, and to make sure that all memory loads are executed in-order [24]. In this project, the eviction set are also implemented as linked lists. Even if there will not be any effect of prefetching when using a simulated cache, this representation still enables us to access all the elements in the set in an easy way, by simply traversing the linked list. Since the addresses in the set are the same as the addresses of the nodes in the linked list, there also will not induce any “extra” memory accesses to access the addresses in the set, as opposed to storing them in for example an array. When using a linked list, it is also possible to split the set in groups in a convenient way, which is needed to be done when implementing the group testing version of the algorithm. Some utility functions for the linked list are borrowed from the published source code of [24], which can be found on github [23]. Each node in the linked list is padded to have the same size as the cache line size, so that one cache line only contains one element.

4.3 Method to Determine Cache Presence

When running dynamic methods for finding eviction sets on a real hardware cache, timing measurements are necessary to determine if an access is hitting or missing in the cache. That is, if the access takes longer time than some threshold, we know it is not in the cache and it is a miss, otherwise it is a hit. This can be done using for example a processor time stamp counter [21, 16], or

29 some other type of clock. Reducing the accuracy of the available clock has been proposed as a countermeasure against timing attacks [22]. This can however be hard to achieve, since the reduction in accuracy can affect other programs that need the clock, and an attacker can overcome this by averaging many samples instead. On the simulated cache it is not possible to use the same techniques for measur- ing the timing, as used on a hardware cache. Since no “real” accesses to memory are made when using the Pin cache, measuring the time would only tell us how much time the work of Pin is taking, which is not necessarily connected to the hit or miss in the cache. Instead, it is possible to directly check if an address is in the cache or not, instead of determining it based on the timing. This is possible since we have control over the content of the simulated cache, which we don’t have on a real system. In this case this is done by inside the Pintool adding a function that checks for the requested address tag in the data structure that saves the tags. An empty function called inCache() is then added to the application. In the instrumen- tation, a function that replaces this function with the Pintool function is added. By doing this, one could in a future work extend the program to use one type of inCache function when run on the real system (using timing), and another type of function when run in the simulated environment (using the Pintool function).

The inCache() that is sent to Pin will however always give the correct infor- mation about the content of the simulated cache, while a function that is using timing and the hardware cache might be affected by noise.

4.4 Method of Measuring Number of Memory Accesses

The performance of the eviction set algorithms is defined in terms of number of memory accesses, in relation to the size of the starting set. In order to be able to confirm this with experiments, a method to measure the number of memory accesses is needed. The functionality for this is added to the Pintool. This works by having a counter that increments each time a read or write is done to the simulated cache. The result is at the end sent to an output file for further processing. Which part of the program that gets measured is controlled by toggling a variable. In this case, the part that is interesting is the function tests if a set is an eviction set, since it is there all “intentional” memory accesses is made. That is, the ones that is made to move data into the cache to potentially evict the victim. When running the whole program, many other accesses will be made, but most will not be relevant to the performance of the algorithm, and therefore not relevant when comparing the algorithms.

4.5 Comparing the Algorithms

One goal of the project was to compare the two algorithms against each other. To do so, the two implementations are run separately, while varying different

30 parameters, and either counting the number of successful runs, or the number of memory accesses.

4.5.1 Parameters

It is possible to vary some parameters when running the cache simulator. The following is the parameters that is possible to change with command line argu- ments, given as flags to the Pintool:

• -a Associativity (1 gives a direct-mapped cache) • -b Cache block size (linesize) • -c Cache size (in KB) (Must be a power of 2)

In the algorithms, it is also possible to vary the following parameters:

• Size of the initial set S

• Which method of reduction that is used (group testing or regular)

4.5.2 Testing Correctness

Tests were made to measure the number of times the algorithms are returning the correct result. This was done for each of the two algorithms. The test first calls the function that finds an minimal eviction set for a given victim address. Then it is checked if the found set does evict the victim. If it does, the test program returns true. The cache parameters (size and associativity) were varied in different combi- nations. The values of size were set to the values in [2,4,8,16,32] kB, while the associativity was set to the values in [2,4,8,16,32]. Each combination was then tested 10 times. It is then calculated how many of the tested times that re- turned successful, that is, found a correct eviction set. The results are presented in section 5.1.1. Note that it is normal that the algorithms sometimes fail to find a correct mini- mal eviction set, due to other memory accesses than intended by the algorithms being made, or to general noise. These factors would be even worse if ran on real hardware cache.

4.5.3 Testing Performance and Scaling

To test and compare the performance of the two algorithms, the method de- scribed in section 4.4 is used to count the number of memory accesses. Only the part containing “intentional” memory accesses are measured (the function that accesses the victim and all elements in the set). This is since it can be assumed

31 that the number of other memory accesses would be approximately the same for both of the algorithms. To examine the performance under the influence of different cache parameters, cache size and associativity were both varied to take values between [2,4,8,16,32], where cache size is in kB. Each combination was run 100 times, from which the mean value and standard deviation was calculated. The results are presented in Section 5.1.3 The parameter that could be affecting the performance and scaling the most is the size of the starting set S. If we want to test a larger cache size, we would have to have a larger number of elements in the initial set, to make sure that the initial set is indeed an eviction set, although non-minimal. A larger starting set will take longer time to prune down to a minimal core. Because of this, the performance under different starting sized were compared. The size of S was varied to take the values of [50, 100, 500, 1000, 2000, 4000, 6000, 8000, 1000, 20000]. The other values were here kept to the same, with associativity set to 2 and size set to 1kB, which were the smallest values possible while still having a set-associative cache. The results are presented in Section 5.1.2.

4.6 Characterising the Cache

The second part of this project is about finding a method for reverse-engineer the cache parameters, using the two eviction set algorithms. This section discusses the different methods tested to accomplish this. Both the original algorithm and the group testing algorithm are used when testing these methods. The parameters that we want to find is:

1. The associativity of the cache 2. The size of the cache

Here we assume that the cache line size is 64B. This is the most common cache line size, but the program can be easily altered to use a different cache line sizes as well.

4.6.1 Using Baseline Algorithm

Associativity: If the baseline algorithm is slightly modified to look at all elements in the set, instead of stopping when it has found the right number of elements, it can work without knowing the associativity beforehand. The number of elements found by the algorithm will then be equal to the associativity of the cache. This test need to be run multiple time, in case some of the runs are unsuccessful. This is even more important if used on a real system, since more noise and interference are present. To implement this, a linked list is used as before, where each node represents one address in the set. Each node is tested as described in subsubsection 2.4.1

32 to see if it contributes to the eviction set. If it is, it is appended to the end of the set. Otherwise, it is removed. This process is repeated until all elements in the original set is tested (and either removed or appended to the end). This way, the final set will consist of only the element of the minimal eviction set. The size of the set can then be measured, which as mention should be the same as the associativity. Not stopping when found the right number of elements might lead to a bit longer average run-time. The maximum run-time should however be the same. Size: The number of entries in the cache (the number of possible indexes) is defined as ENTRIES = SIZE/LINESIZE/ASSOC, where ASSOC stands for associa- tivity. The offset from one address to the next address that maps to the same index/entry in the cache is then defined as OFFSET = ENTRIES * LINESIZE = SIZE/ASSOC. The size can then be derived from SIZE = ASSOC * OFFSET. Assume that the addresses are chosen non-randomized from the same memory area and tested sequentially for eviction, as above. Then, the found addresses in the final evictionset will be at a constant offset from each other. Using the relations above and the method for finding associativity in the previous section, we can find what the size of the cache is. For example, if we from running the algorithm finds 4 addresses with an offset of 2048 from each other (e.g addr1- addr2 = 2048 etc.), we can draw the conclusion that the cache size is 2048*4 = 8kB. Figure 10 shows an example of the method to find the associativity using the baseline algorithm, with associativity 2.

Algorithm 4: Finding size and associativity with baseline algorithm input : S = starting set, x = victim address output: 1 if successful, otherwise 0

1 baseline_algorithm(S, x); //Reduces S to minimal eviction set 2 guessed_assoc = len(S); 3 offset = addr(S) - addr(S−→next) //Second found address 4 guessed_size = offset * guessed_assoc; 5 return guessed_assoc == real_assoc && guessed_size == real_size //Only for verification

33 1 x ! x ... x

2 x ! x ... x

3 x ... x ! ......

N ! !

Figure 10: Example of the implemented method for finding associativity using the baseline algorithm. Each circle is here representing one node in a linked list. The green elements, marked with “!”, are elements that maps to the desired cache set.

4.6.2 Using Group Testing Algorithm

Associativity: Here, the associativity is originally used to define how many groups we want to split into. Since we now don’t know the associativity, the algorithm needs to be adjusted. One solution is to run the algorithm multiple times, and try different values of possible associativity. This was implemented by start testing with associativity of 1, and test if the found set is an eviction set. If not, the tested value is multiplied by two for each iteration, until the test succeeds. The tested value is then the associativity of the cache. Size: The size can be found in a similar way as with the baseline method. However, it again needs to be considered that this only will work if the addresses are chosen in a sequential order, and we have a standard indexing function.

Algorithm 5: Finding size and associativity with group testing algo- rithm input : S = starting set, x = victim address output: 1 if successful, otherwise 0

1 int i = 1; 2 while i < max_assoc do 3 reset(S); 4 grouptesting_algorithm(S, x, i); //Reduces S 5 if(S evicts x) break; 6 i = i*2; 7 end 8 guessed_assoc = i; 9 offset = addr(S) - addr(S−→next) //Second found address 10 guessed_size = offset * guessed_assoc; 11 return guessed_assoc == real_assoc && guessed_size == real_size //Only for verification

34 4.6.3 Evaluation

To verify if a run is successful, the found size and associativity is compared to the actual size and associativity of the simulated cache, used in that run. These values are not known to the algorithm, but can be used for verification. To evaluate the tool, its correctness and scaling was tested in the same way as described as in Section 4.5.2 and Section 4.5.3. Since the correctness of the found eviction set would affect the evaluation of the tool, tests were also made were the eviction set were first verified.

35 5 Results

In this section, the results of the tests performed are presented. The motivation and description of the tests are described in Section4. The results will then be discussed in Section6.

5.1 Comparing Algorithms

5.1.1 Correctness

The following is the result of 250 runs with different combinations of size and associativity, as described in Section 4.5.2. The first two rows describe the results when verifying that the found set is an eviction set, which does not need to be minimal. The two last rows shows the results when also checking that the found set is minimal, meaning that we want to find as many addresses as the associativity of the cache.

Baseline algorithm Group testing algorithm Number of successful runs 209/250 202/250 Percentage successful runs 83.6 % 80.8 % Number of successful runs, minimal set 199/250 201/250 Percentage successful runs, minimal set 79.6 % 80.4 %

Table 1: Correctness of the two algorithms.

We can see that there is a minor difference in the result when checking if the found set is an eviction set, and when also checking if it is minimal. This indicates that there is some cases when we include elements/addresses into the eviction set that is not actual a part of the minimal evictions set.

5.1.2 Scaling

N is here the size of the starting set. First we can see the scaling of the two algorithms separately, where Figure 11 shows the baseline algorithm and Fig- ure 12 shows the group testing algorithm. We can see that the baseline showing a quadratic behaviour while the group testing one is linear. We can also see that the group testing algorithm in general takes lower values.

36 1e8 4.0 Baseline algorithm

3.5

3.0

2.5

2.0

1.5 Memory accesses

1.0

0.5

0.0

0 2500 5000 7500 10000 12500 15000 17500 20000 N

Figure 11: Scaling of baseline algorithm

120000 Group testing algorithm

100000

80000

60000

Memory accesses 40000

20000

0

0 2500 5000 7500 10000 12500 15000 17500 20000 N

Figure 12: Scaling of group testing algorithm

37 In Figure 13 we can see both of the algorithms together. Here both the x-axis and the y-axis are logarithmic, to compensate for the big differences in scale. We can see that the baseline algorithm both starts with more memory accesses and has a much steeper scaling.

Baseline algorithm Group testing algorithm 108

107

106

105 Memory accesses

104

103

102 103 104 N

Figure 13: Logarithmic scaling of the algorithms

5.1.3 Performance

This table shows the number of (intentional) memory accesses made in the algorithms, while varying the cache parameters. That is, the memory accesses made inside the algorithm when we want to test for eviction. The size of the starting set is here kept the same, while the associativity and size are varied. The baseline algorithm, shown in Figure Figure 14 is not very affected by vari- ation in either cache size and associativity. This is since it is mainly dependent on the size of the starting set, and does not use the other parameters in the algorithm. In reality, the size of the starting set would be chosen with respect to the cache size, to make sure that we have enough addresses for it to form an eviction set. This means that the parameters cache size and size of the starting set is in fact connected, but we are testing them separately here, as otherwise the results would be less clear. The group testing algorithm, shown in Figure Figure 15 has more variations. Mainly the associativity of the cache affects the performance. With a smaller associativity, the algorithm can use and remove larger groups at the time. It also has to find fewer addresses to get the final eviction set.

38 1e7 Baseline algorithm - varying parameters

8

7

6

5

4 Mem accesses 3

Size = 2 2 Size = 4 Size = 8 1 Size = 16 Size = 32 0 2 4 8 16 32 Associativity

Figure 14: The number of memory accesses when varying the associativity and cache size, when running the baseline algorithm. The size is in kB.

39 Group testing algorithm - varying parameters 700000 Size = 2 Size = 4 600000 Size = 8 Size = 16 Size = 32 500000

400000

300000 Mem accesses

200000

100000

0 2 4 8 16 32 Associativity

Figure 15: The number of memory accesses when varying the associativity and cache size, when running the group testing algorithm. The size is in kB.

5.2 Characterising the Cache

5.2.1 Correctness

The following is the result of 250 runs with different combinations of size and associativity, as described in Section 4.5.2. We can see that it is similar to the success rate of the algorithms themselves, which is expected. If the found minimal eviction set is not correct, the guesses for size and associativity will also not be correct. However, if only looking at the runs where the final eviction sets is confirmed to actually be minimal and evict the victim, the success rate is 100 % for the Baseline algorithm and around 95 % for the Group testing algorithm. The incorrect runs are in that case probably due to missing the right associativity for one iteration. (The correctness of the eviction set itself was in this test only verified at the end of the algorithm. A “false” eviction set could then lead to a correct associativity being missed.)

Baseline Group testing Number of successful runs 209/250 202/250 Percentage, successful runs 83.6 % 80.8 % Successful runs, eviction set confirmed 182/182 192/201 Percentage, eviction set confirmed 100 % 95.5 %

Table 2: Success rate for the cache characterisation tool

40 5.2.2 Scaling

The scaling of the method for finding size and associativity, while using the two different algorithms, can be seen in Figure Figure 16 N is here the size of the starting set. We can see here that even since the group testing algorithm need to be run multiple times (one for each associativity we are testing for) to work in this method, it is still significantly better than the baseline algorithm. The total number of accesses is here a sum of all the runs of the group testing algorithm, when repeated.

Baseline algorithm Group testing algorithm

1011

109

107 Memory accesses

105

102 103 104 N

Figure 16: Logarithmic scaling of the tool when using different start sizes N

41 6 Discussion

In this section, I will discuss the results found when comparing the two algo- rithms, as well as the result of implementing a method for characterising the cache settings. Some problems encountered on the way will be mentioned as a reference and help to similar projects in the area. Finally I will give some suggestions for future work and extensions to this project.

6.1 Comparing Algorithms

One first observation is that it is beneficial to use the Pin cache simulator to compare the performance of the two algorithms, since this made it possible to tune different settings. This would not have been possible if running directly on hardware, without switching platform entirely, which would change more than just one or two parameters. This usage also reduces the amount of noise present when doing the comparison as well as the evaluation of the cache characterisa- tion. The algorithms has a high success rate when run on the simulated cache. It is not clear what is the cause of the unsuccessful runs, but it could be due to corner cases or unintended memory accesses that interferes with the run of the algorithms. It is however a good success rate compared to how a similar program would behave on a real system, which has more noise and interference (see Section 2.4.5). In such case, the algorithms would need to be re-run many times, to ensure good results. We can see that the variation in cache size and associativity did not affect the performance much, especially not in the Baseline algorithm. Since it will look at one element at a time, it is mostly the starting set size that will affect the performance. In the Group testing case, the associativity has some effect on the performance, since it decides the group sizes, and thus how many elements we can remove each iteration. Looking at the scaling, we can see that the group testing algorithm has signif- icantly better scaling than the baseline algorithm. This is expected, since the baseline algorithm has to hold out on element at the time, and for each test access all the element in the remaining set. The group testing algorithm can do this much faster, by testing a whole group of elements at once. There are some consequences of this that are important. The first one is the possibility to attack the LLC in a faster way. Before the baseline algorithm was presented, it was considered too inefficient to use Prime+Probe to attack the LLC, since it’s size made it hard to priming and probing the whole cache. The baseline algorithm made such attacks possible, but the optimised group testing algorithm will make them more powerful. Attacks against the LLC are more dangerous than attacks on the smaller caches, since as discussed previously, the victim and the attacker does not need to be in proximity. For example, in a cloud environment, they can be on separate guest VM’s that runs on the same host, the LLC being the only shared cache. This opens up for information

42 leakage in the system, that might not be easy to prevent. Since attacks using Prime+Probe with Javascript are previously shown to be possible, web-based attacks could also be enchanted using the optimised algorithm. This could have negative impact on user privacy. Attacks that uses , such as Meltdown and Spectre could be also be done more efficiently, in the cases where we don’t have access to the clflush instruction. Secondly, an optimised eviction set algorithm creates new requirements for the designs of countermeasures against these types of attacks. Protections that pre- viously was seen as “good enough” might need to be reconsidered and redesigned, to consider a faster attack. This was for example realised by Qureshi [18] which lead to a new version of the mitigation technique CEASER.

6.2 Characterising the Cache

The results show that it is possible to use the eviction set algorithms to char- acterise the cache memory. The success rate is not 100%, but this is expected since the algorithms themselves are not always correct, due to reasons previously discussed. That is, when the algorithm fails to find eviction sets, this method will also not work correctly. To overcome this, one possible solution to this is to first confirm that the found set is indeed an eviction set, and re-run if not successful. However, if the set is an eviction set, but not a minimal one, the test will return true, but the tool will still fail. This is since it assumes that we have found a minimal set when determining the associativity. The tool however seem to work correctly as intended in the cases where we do find a correct minimal eviction set. Another way of overcoming the problem with “false eviction sets” could be to run the tool multiple times on the same cache settings. Since the wrong results seem to appear randomly and the algorithms in general have a high success rate, it is probable that we will get the correct results a majority of the time we run the tool. The deviant results could then be discarded. How would this tool perform in reality, that is, to characterise a real hardware cache? Assuming that the algorithms are modified to work well enough on a real system, according to the challenges discussed in Section 2.4.5, I believe that they would be able to determine the associativity by the method presented. To be able to determine the cache size, it is possible that it need to be slightly modified. This is since we want to access the addresses in a random order, to avoid pre-fetching. Then it is not certain that we end up with a set where the elements are on regular offsets that we can draw conclusions from. If the cache has a high associativity we will however have more addresses to work with in the final set, which could give more information to the the tool. Again, re-running the algorithm many times on the same machine/cache would also give more information to work with. Comparing this method to the michrobenchmark method, it has some of the same issues. The michrobenchmark method described before also don’t work with prefetching. One benefit the eviction set algorithm method has is that it does not require a manual review of the produced graph. (It is of course possible

43 that there are variations of microbenchmarking that also don’t require this.) By having a more automatic analysis of the settings, the tool can act as a first step in a larger program, where the results are used as parameters in that program. It is also beneficial that the tool makes few assumptions. Even if there are other ways of retrieving the cache information from the system, it might be different depending on what system it is. This method might be more portable and work as long it as it is possible to calculate the minimal eviction set.

6.3 Problems, Pitfalls and Lessons Learned

One of the main lessons I learned from this project is that working with the cache is not very deterministic. Even though I used a simulated cache and thus avoided a lot of interference, there were sometime difficult to know what output to expect during the development. This is since there could be unintended memory accesses, which will go to the cache as any other access and cause eviction. In Pin it is possible to log every memory access, or accesses done from a certain function, but since there still could be a large amount of accesses, it is not a very simple way of debugging. This is also important to remember when choosing the data structure to repre- sent the evictions set. As an example, as a first trial step I stored the addresses in the initial set in an array. However, this will lead to a “double” memory access, since we first need to access the array, then access the address that is stored in the array. This could affect the cache in a way that was not intended. For this reason, the linked list worked better, as described earlier. The addresses could then be accessed by simply traversing the list. A third thing to think about is what the compiler might do. An access to a value that later isn’t used, can be removed by the compiler for optimisation reasons, even if the access was what we wanted. This is the reason that inline assembly is partially used, so that this does not happen. When doing similar projects, it can be a good idea to inspect the decompiled assembly code version of the program, to see if it looks at expected (checking that the accesses are not compiled away).

6.4 Future Work

To test the applicability and performance of the algorithms and the tools even further, it would of course have been interesting to evaluate them on using a real cache as well, instead of the simulated one. To do so, some modifications would be necessary, as described in Section 2.4.5. The adaptions are mainly to reduce effect of reordering, prefetching, TLB contamination, interference from other processes and other types of noise that can occur when working in real hardware. Some newer processors are using undocumented, adaptive replacement policies. It would be interesting to see how the algorithms perform under these policies, and how they can be adapted to work as good as possible. It would as well

44 be interesting to see how the algorithms would work on more unusual cache designs, like skewed associative caches. Song and Liu [21] optimised the algorithm even further, by using other group sizes in the group testing algorithm and a different method for going through the list of elements. These optimisations could be incorporated into the tool that was created in this project, and compared for performance.

45 7 Conclusion

Methods for dynamically finding eviction sets are mainly used when performing efficient Prime+Probe attacks against the cache. They are especially useful in cases when an other type of attack, Flush+Reload, cannot be used for vari- ous reasons, and for attacking the last level cache (LLC). In this work I have compared an optimised algorithm for finding eviction sets, against the previ- ous state-of-the-art algorithm. We could see that the results are confirming the theory, meaning that the optimised algorithm has linear scaling, instead of the quadratic scaling of the previous algorithm. This improvement implies that more efficient attacks using Prime+Probe is possible, and that one needs to consider this more efficient approach when designing countermeasures against these types of attacks. I have also explored a new area of application for the algorithms. We can see that there is potential that these algorithms can be used for characterising the cache parameters, since I to a great extent was able to determine size and associativity of the caches with many different combinations of settings. However, to know exactly the applicability in reality, this method need to be further evaluated.

46 References

[1] D. J. Bernstein, “Cache-timing attacks on aes,” 2005. [2] P. Damaschke, “Threshold group testing,” in General theory of information transfer and combinatorics. Springer, 2006, pp. 707–718. [3] D. Genkin, L. Pachmanov, E. Tromer, and Y. Yarom, “Drive-by key- extraction cache attacks from portable code,” in International Conference on Applied Cryptography and Network Security. Springer, 2018, pp. 83– 102. [4] D. Gruss, C. Maurice, and S. Mangard, “Rowhammer. js: A remote software-induced fault attack in javascript,” in International conference on detection of intrusions and malware, and vulnerability assessment. Springer, 2016, pp. 300–321. [5] Intel, “Pin - a dynamic binary instrumentation tool,” https://software.intel.com/en-us/articles/pin-a-dynamic-binary- instrumentation-tool.

[6] G. Irazoqui, T. Eisenbarth, and B. Sunar, “S $ a: A shared cache attack that works across cores and defies vm sandboxing–and its application to aes,” in 2015 IEEE Symposium on Security and Privacy. IEEE, 2015, pp. 591–604. [7] S. P. Josyula, “On the applicability of a cache side-channel attack on ecdsa signatures: The flush+ reload attack on the point multiplication in ecdsa signature generation process,” 2015. [8] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Ham- burg, M. Lipp, S. Mangard, T. Prescher et al., “Spectre attacks: Exploiting speculative execution,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 1–19. [9] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown,” arXiv preprint arXiv:1801.01207, 2018.

[10] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cache side- channel attacks are practical,” in 2015 IEEE symposium on security and privacy. IEEE, 2015, pp. 605–622. [11] S. Lu, “Micro-architectural attacks and countermeasures,” 2011. [12] C. Maurice, N. Le Scouarnec, C. Neumann, O. Heen, and A. Francillon, “Reverse engineering intel last-level cache complex addressing using per- formance counters,” in International Symposium on Recent Advances in Intrusion Detection. Springer, 2015, pp. 48–65. [13] C. Maurice, M. Weber, M. Schwarz, L. Giner, D. Gruss, C. A. Boano, S. Mangard, and K. Römer, “Hello from the other side: Ssh over robust cache covert channels in the cloud.” in NDSS, vol. 17, 2017, pp. 8–11.

47 [14] Y. Oren, V. P. Kemerlis, S. Sethumadhavan, and A. D. Keromytis, “The spy in the sandbox: Practical cache attacks in javascript and their implica- tions,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 2015, pp. 1406–1418. [15] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and countermea- sures: the case of aes,” in Cryptographers’ track at the RSA conference. Springer, 2006, pp. 1–20.

[16] C. Percival, “Cache missing for fun and profit,” 2005. [17] M. K. Qureshi, “Ceaser: Mitigating conflict-based cache attacks via encrypted-address and remapping,” in 2018 51st Annual IEEE/ACM In- ternational Symposium on (MICRO). IEEE, 2018, pp. 775–787.

[18] ——, “New attacks and defense for encrypted-address cache,” in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 360–371. [19] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds,” in Proceedings of the 16th ACM conference on Computer and communications security, 2009, pp. 199–212. [20] M. Seaborn and T. Dullien, “Exploiting the dram rowhammer bug to gain kernel privileges,” Black Hat, vol. 15, p. 71, 2015.

[21] W. Song and P. Liu, “Dynamically finding minimal eviction sets can be quicker than you think for side-channel attacks against the {LLC},” in 22nd International Symposium on Research in Attacks, Intrusions and Defenses ({RAID} 2019), 2019, pp. 427–442. [22] E. Tromer, D. A. Osvik, and A. Shamir, “Efficient cache attacks on aes, and countermeasures,” Journal of Cryptology, vol. 23, no. 1, pp. 37–71, 2010.

[23] P. Vila, “Tool for testing and finding minimal eviction sets,” https://github. com/cgvwzq/evsets, 2019. [24] P. Vila, B. Köpf, and J. F. Morales, “Theory and practice of finding eviction sets,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 39–54. [25] D. Wallin and E. Berg, “Microbenchmark,” http://user.it.uu.se/ danw/darkII/stridelab.html, accessed: 2020-05- 12. [26] Z. Wu, Z. Xu, and H. Wang, “Whispers in the hyper-space: high-bandwidth and reliable covert channel attacks inside the cloud,” IEEE/ACM Trans- actions on Networking, vol. 23, no. 2, pp. 603–615, 2014. [27] Y. Yarom and K. Falkner, “Flush+ reload: a high resolution, low noise, l3 cache side-channel attack,” in 23rd {USENIX} Security Symposium ({USENIX} Security 14), 2014, pp. 719–732.

48 [28] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-vm side chan- nels and their use to extract private keys,” in Proceedings of the 2012 ACM conference on Computer and communications security, 2012, pp. 305–316.

49