IT 13 048 Examensarbete 30 hp Juli 2013

Stealing the shared for fun and profit

Moncef Mechri

Institutionen för informationsteknologi Department of Information Technology

Abstract Stealing the shared cache for fun and profit

Moncef Mechri

Teknisk- naturvetenskaplig fakultet UTH-enheten Cache pirating is a low-overhead method created by the Uppsala Architecture Besöksadress: Research Team (UART) to analyze the Ångströmlaboratoriet Lägerhyddsvägen 1 effect of sharing a CPU cache Hus 4, Plan 0 among several cores. The cache pirate is a program that will actively and Postadress: carefully steal a part of the shared Box 536 751 21 Uppsala cache by keeping its in it. The target application can then be Telefon: benchmarked to see its dependency on 018 – 471 30 03 the available shared cache capacity. The

Telefax: topic of this Master Thesis project 018 – 471 30 00 is to implement a cache pirate and use it on Ericsson’s systems. Hemsida: http://www.teknat.uu.se/student

Handledare: Erik Berg Ämnesgranskare: David Black-Schaffer Examinator: Ivan Christoff IT 13 048 Sponsor: Ericsson Tryckt av: Reprocentralen ITC

Contents

Acronyms 2

1 Introduction 3

2 Background information 5 2.1 A dive into modern processors ...... 5 2.1.1 ...... 5 2.1.2 ...... 6 2.1.3 CPU caches ...... 8 2.1.4 Benchmarking the memory hierarchy ...... 13

3 The Cache Pirate 17 3.1 Monitoring the Pirate ...... 18 3.1.1 The original approach ...... 19 3.1.2 Defeating prefetching ...... 19 3.1.3 Timing ...... 20 3.2 Stealing evenly from every set ...... 21 3.3 How to steal more cache? ...... 22

4 Results 23 4.1 Results from the micro-benchmark ...... 23 4.2 SPEC CPU2006 results ...... 25

5 Conclusion, related and future work 27 5.1 Related work ...... 27

Acknowledgement 29

List of Figures 30

Bibliography 31

1 Acronyms

LLC Last Level Cache.

LRU Least Recently Used.

OS Operating System.

PIPT Physically-Indexed, Physically-Tagged.

PIVT Physically-Indexed, Virtually-Tagged.

RAM Random Access Memory.

THP Transparent Hugepage.

TLB Translation Lookaside Buffer.

TSC Time-Stamp .

UART Uppsala Architecture Research Team.

VIPT Virtually-Indexed, Physically-Tagged.

VIVT Virtually-Indexed, Virtually-Tagged.

2 Chapter 1

Introduction

In the past, architects and manufacturers have used several approaches to increase the performance of their CPUs: Increase in frequency and , Out-of-Order execution, branch prediction, complex pipelines, and other techniques have all been extensively used to improve the performance of our processors.

However, in the mid-2000s, a wall was hit. Architects started to run out of room for improvement using these traditional approaches [1]. In order to satisfy Moore’s law, the industry has converged on another approach: Mul- ticore processors. If we cannot noticeably improve the performance of one CPU core, why not having several cores in one CPU?

A multicore processor embeds several processing units in one physical package, thus allowing execution of different execution flows in parallel. Helped by the constant shrinkage in semiconductor fabrication processes, this has led to a transition to multicore processors, where several cores are embedded on a CPU. As of today this transition has been completed, with multicore processors being everywhere, from to mainstream personal and even mobile phones [2].

This revolution has allowed processor makers to keep producing signifi- cantly faster chips. However, this does not come for free: It imposes a greater burden on the programmers in order to efficiently exploit this new compu- tational power, and new problems arose (or were at least greatly amplified). Most of them should be familiar to the reader: False/true sharing, race con- ditions, costly communications, scalability issues, etc.

Although multicore processors have by definition duplicate execution units,

3 they still tend to share some resources, like the Last Level Cache (LLC) and the . The contention for these shared ressources repre- sents another class of bottlenecks and performance issues that are much less known.

In order to study them, the UART has developped two new methods: Cache pirating [3] and the Bandwidth Bandit[4].

The Cache Pirate is a program that actively and carefully steals a part of the shared cache. In order to know how sensitive a target application is to the amount of shared cache available, it is co-run with the Pirate. By stealing different amount of shared cache, the performance of the target application can be expressed as a function of the available shared cache.

We will first start by describing how most multicore processors are orga- nized and how caches work, as well as studying the Freescale P4080 processor. We will then introduce the micro-benchmark, a cache and memory bench- mark developped during the Thesis. Later, the pirate will be thoroughly described. Finally, experiments and results will be shown.

This work is part of a partnership between UART and Ericsson.

4 Chapter 2

Background information

2.1 A dive into modern processors

2.1.1 Memory hierarchy Memory systems are organized as a Memory hierarchy. Figure 2.1 shows how a memory hierarchy typically looks like. CPU registers sit on top of this hierarchy. They are the fastest memory available, but they are also very limited in numbers and size.

CPU caches come next. They are caches that are used by the CPU to reduce accesses to the slower level of the memory hierarchy (Mainly, the main memory). Caches will be explored more thoroughly later.

The main memory, commonly called Random Access Memory (RAM) is the next level in the hierarchy. It is a type of memory that is off-chip and that is connected to the CPU by a (AMD HyperTransport, Quick- Path Interconnect, etc.). Throughout the years, RAM has become cheaper and bigger, and mainstream devices often have several GigaBytes of RAM.

Lower levels of the hierarchy, like the hard drive are several order or magnitude slower than even the main memory and will not be explored here.

5 Figure 2.1: The memory hierarchy (wikipedia.org)

2.1.2 Virtual memory Virtual memory is a memory management technique used by most systems nowadays which allows every to see a flat and contiguous independent of the underlying physical memory. The Operating System (OS) maintains in main memory per-process trans- lation tables of virtual-to-physical addresses. The CPU has a special unit called (MMU) which, in cooperation with the OS, is used to perform the translation from virtual to physical addresses.

Since the OS’ translation tables live in RAM, accessing them is a slow process. In order to speed up the translation, the CPU maintains a very fast translation cache called Translation Lookaside Buffer (TLB) that caches recently-resolved addresses. The TLB can be seen as a kind of hash ta- ble where the virtual addresses are the keys and the physical addresses the results. When a translation is required, the TLB is looked up. If a matching entry is found, then we have a TLB-hit and the entry is returned. If the entry is not found, then we have a TLB-miss. The in-memory translation table is then searched, and when a valid entry is found, the TLB is updated and the

6 translation restarted.

Paged virtual memory Most implementations of virtual memory use pages as the basic unit, the most common page size being 4KB. The memory is divided in chunks, and the OS translation tables are called Page Tables. In such a setup, the TLB does not hold per- virtual-to-physical translations. Instead, it keeps per- page translations, and once the right page is found, some lower bits of the virtual address are used as an offset in this page.

TLBs are unfortunately quite small: Most TLBs have a few dozens of entries, 64 being a common number, which means that with 4KB pages, only 256KB of memory can be quickly addressed thanks to the TLB. Programs which requires more memory will thus likely cause a lot of TLB-misses, forc- ing the processor to waste time traversing the (this process is called a page walk).

Large pages To reduce this problem, processors can also support other page sizes. For example, modern processors support 2MB, 4MB and 1GB pages. While this can waste quite a lot of memory (because even if only one byte is needed, a whole page is allocated), this can also significantly improve the performance by increasing the range of memory covered by the TLB. There is no standard way to allocate large pages: We must rely on OS- specific functionnalities. Linux provides two ways of allocating large pages:

• hugetlbfs: hugetlbfs is a mechanism that allows the system administra- tor to create a pool of large pages that will then be accessible through a filesystem mount. libhugetlbfs provides a both an API and tools to make allocating large pages easier [5],

• Transparent Hugepage (THP): THP is a feature that attempts to au- tomatically back memory areas with large pages when this could be useful for performance. THP also allows the developer to request a memory area to be backed by large pages (but with no guaranty of success) through the madvise() system call. THP is much easier to use than hugetlbfs [6].

7 Although not originally designed for that, large pages (sometimes called huge pages) support is fundamental for the Cache Pirate. We will explained later why.

2.1.3 CPU caches Throughout the years, processors have kept getting exponentially faster, but the growth in the main memory performance has not been quite as steady. To fill the gap and keep our processors busy, CPU architects have been using caches for a long time. CPU caches are a kind of memory that is small and expensive but very fast (one or two orders of magnitude faster than the main memory). Nowa- days, without caches, a CPU would spend most of its time waiting for to arrive from RAM.

Before studying some common cache organizations, it is important to note that caches don’t store individual . Instead, they store a data block, called a cacheline. The most common cache line size is 64B.

Direct-mapped caches Direct-mapped caches are the simplest kind of cache. Figure 2.2 shows how a direct-mapped cache with the following characteristics is organized: Size: 32KB Cache line size: 64B

The cache is divided in sets. In a direct-mapped cache, each set can hold one cache line and a tag. This cache has 512 sets (32KB / 64B) When the processor wants to read or write in main memory, the cache is searched first.

The requested is used to locate the data in the cache. This process works as follows: Bits 6-14 (9 bits) are used as an index to locate the right set. Once the set is located, the tag it contains is compared to bits 15-31 (17 bits) in the address. If they match, it means that the data block requested is present in the cache. It can thus be immediately retrieved. This case is called a cache-hit. The 6 lower bits of the address are then used to select the desired byte within the cache block. If they don’t match, then the data needs to be fetched from main memory and put in the cache, evicting in the process the cache line that was cached

8 Figure 2.2: A direct-mapped cache. Size: 32KB; Cache line size: 64B in this set. This situation is a cache-miss. While being simple to implement and understand, direct-mapped caches have a fundamental weakness: Each set only holds one data block at a time. In our example, it means that up to 217 data blocks can potentially be com- peting for the same set, leading to a lot of expensive evictions. This kind of cache misses are called conflict misses and can be triggered even if the program’s working set is small enough to fit entirely in the cache.

Set-associative caches A practical way to mitigate conflict misses is to make each set holds several blocks. A cache is said to be N-way set-associative if each set can hold N cache lines (and their corresponding tag). These entries in the set are also called ways. Figure 2.3 shows how the cache from Figure 2.2 would look like if it is made 4-way set associative. Since each set now holds 4 cache lines instead of 1, there are 128 sets (32KB / 64B / 4). Upon request, bits 6-12 (7 bits) are used to index the cache (find the right set). Once located, the 4 tags in the set are compared to bits 13-32 (19 bits).

9 Figure 2.3: A 4-ways set-associative cache.. Size: 32KB; Cache line size: 64B

10 If one of them matches, it’s a cache-hit, and the desired byte is extracted from the corresponding cache line using bits 0-5. If no tag matches, it’s a cache-miss, and the data needs to be fetched from main memory. This freshly-fetched block is then cached. To do so, it is necessary to evict one block from the set (more on that later).

Set-associativity is an effective way to decrease considerably the number of conflict misses. However, it does not come for free: Costly hardware (in term of size, price, and power consumption) must be added to select the correct block in a set as fast as possible.

Fully-associative caches Fully-associative caches are a special case of set-associative cache with only one set. Data can end up anywhere in this set, thus eliminating completely conflict misses. Fully-associative caches are very expensive because in order to be efficient, they must be able to compare all the tags contained in the set in parallel, which is very costly. For this reason, only very small caches, like L1 TLBs are made fully associative.

Replacement policy Each time a new block is brought to the cache, an old one needs to be evicted. But which one? Several strategies [7] (or replacement policies) exist to choose the right cache line to evict, called the victim. The most common strategy is called Least Recently Used (LRU). This policy assumes that data recently used will likely be reused soon (temporal locality) and therefore, as implied by its name, chooses the least recently used cache line as the victim.

To implement this strategy, each set maintains age bits for each of its cache line that represent. Each time a cache line is accessed, it becomes the youngest cache line in the set and the other cache lines age. Keeping track of the exact age of each cache line can become quite expen- sive as the associativity grows. In practice, instead of a strict LRU policy, processors tend to use an approximation of LRU.

11 Hardware prefetching For the sake of hiding memory latency even further, processor manufacturers use another technique: Prefetching. The idea behind prefetching is to try to guess which data will be used soon and prefetch it so that it is already cached when actually needed. To do so, the hardware prefetcher, which is a part of the CPU, constantly looks for common memory access patterns. Here is a common pattern that every prefetcher should be able to detect: int array[1000]; for ( int i = 0 ; i < 1000; ++i ) { array[i] = 10; } This code simply traverses linearly an array. When executing this code, the prefetcher will detect the linear memory accesses and will fetch ahead the data before they are needed. A prefetcher should also be able to detect strides between each memory access. For example, the prefetcher would also successfully prefetch data: int array[1000]; for ( int i = 0 ; i < 1000; i += 2) { array[i] = 10; }

Virtual or physical addresses? This section discusses which address (virtual or physical) is used to address a cache and the consequences of this choice. This is important because the pirate needs to control where its data end up in the cache.

We said earlier that some bits from the address are used to index the cache and some other bits are used as the tag to identify cache lines with the same set index. But are we talking about the virtual or physical address? The answer is: It depends. CPU manufacturers are free to choose which combination of virtual/physical address they want to use for the set indexes and tags. Considering that they have to choose whether they use virtual or physical addresses for the indexes and tags, a cache can be:

12 • Virtually-Indexed, Virtually-Tagged (VIVT): The virtual address is used for both the index and tag. This has the advantage of having a low latency, because since the physical address is not required at all, there is no need to request it to the MMU (which can take time, especially in the event of a TLB-miss). However, this does not come without problems since different virtual addresses can refer to the same physical location (aliasing), and identical virtual addresses can refer to different physical locations (homonyms)

• Physically-Indexed, Physically-Tagged (PIPT): The physical address is used for both the index and tag. This scheme avoids aliasing and homonyms problems altogether but is also slow since nothing can be done until the physical address has been served by the MMU.

• Virtually-Indexed, Physically-Tagged (VIPT): The virtual address is used to index the cache and the physical address is used for the tag. This scheme has a lower latency than PIPT because indexing the cache and retrieving the physical address can be done in parallel since the virtual address already gives us the set index. The last possible combination, Physically-Indexed, Virtually-Tagged (PIVT), is futile and won’t be discussed.

Generally, smaller, lower latency caches such as L1 caches use VIPT be- cause it has a lower latency, while larger caches such as LLCs use PIPT to avoid dealing with aliasing.

2.1.4 Benchmarking the memory hierarchy In order to understand how exactly a memory hierarchy works, a micro benchmark has been developped during this thesis work. The benchmark works as follows: The benchmark’s main datastructure consists in an array of ”struct elem” whose size is specified as a command line parameter. The C structure “elem” is defined like this: struct elem { struct elem∗ next ; long int pad [NPAD] ; } ;

13 The next pointer points to the next element to visit in the array (Note that a pointer is 4-bytes large on 32 bits systems but 8-bytes large on 64 bits systems.) The pad element is used to pad the structure in order to control the size of an element. Varying the padding can greatly vary the benchmark’s behav- ior and is done by choosing at compile time an appropriate value for NPAD. An interesting thing to do would be to adjust the padding so that the elem structure size is similar to the cache line size: Assuming a 64B cache line size, and considering that the long int type has a size of 4 bytes on Linux 32 bits and 8 bits on Linux 64 bits, choosing NPAD=15 for a 32 bits sys- tem or NPAD=7 for a 64 bits system would make each element 64 bytes large.

The elements are linked together using the next pointer. There are 2 ways to link the elements:

• Sequentially: The elements are linked sequentially and circularly, so the last element must point to the first one,

• Randomly: The elements are randomly linked but we must make sure that each element will be visited exactly once to avoid looping only on a subset of the array.

The benchmark then times how long it takes to traverse (only using the next pointer) a fixed number of times. Finally, dividing the total time by the number of iterations and the num- ber of elements in the array gives us the average access time per element. The array size and average access time are then stored in a results file. In its current , each execution of the benchmark works only with one array size. To try different array sizes, it is recommended to use a shell ”for” loop. A script able to parse the results files them and plot them using “gnuplot” is also provided.

The Freescale P4080

Figure 2.4: Freescale P4080 Block Diagram

14 The Freescale P4080 is an octocore PowerPC processor. The development systems provided by Freescale use Linux. On the other hand, the production systems use a commercial realtime OS.

Each core runs at 1.5GHz and has 32KB L1 data cache and 32KB L1 instruction cache, as well as a 128KB L2 unified cache. Those eight cores share a 2MB L3 cache (split into two chunks of 1MB) as well as two DDR3 memory controllers.

Figure 2.5 displays the micro-benchmark results when configured to oc- cupy one full cache line per element.

Figure 2.5: micro-benchmark run on the Freescale P4080

We see that up to a working set of 32KB, the benchmark is very low. However, beyond this point, the benchmark is slowed down. This is because the working set does not fit anymore in the L1 cache. A second performance drop occurs when the working set grows beyond 128KB, which matches the size of the L2 cache. Above this point, the working

15 set becomes too big for the L2 cache and we start to hit in the L3 cache, which has significantly higher latencies. Another performance penalty occurs when the working set size exceeds 2MB, which is the size of the L3 cache. Beyond this point, the working set does not fit in the caches anymore and every element needs to be fetched from the main memory.

The characteristics of the memory hierarchy, such as the size of each cache level are thus clearly exhibited. Many more details about the memory hierarchy can be found in Drepper2007[13].

Time measurement In order to get valid results, it is crucial to be able to measure time accurately.

POSIX systems such as Linux which implement the POSIX Realtime Extensions provide the clock gettime() function that has an high enough resolution for our purpose [12]. Since the commercial OS used by Ericsson does not provide an high- resolution timing function, a timing function that reads the Time-Stamp Counter (TSC) has been written.

Finally, it is strongly advised to disable any frequency scaling feature when running the benchmark since it could affect the results.

16 Chapter 3

The Cache Pirate

As we showed previously, modern multicore processors tend to have one or several private caches per-core as well as one big cache shared by all the cores. Sharing a cache between cores have several advantages. Some of them are:

• Each core can, at any time, use as much shared cache space as available. This is in contrast with private caches where any unused space cannot be reused by other cores,

• Faster/simpler data sharing and cache-coherency operations by using the shared cache

However, sharing a cache between cores also means that there can be contention for this precious resource, where several cores are competing for space in the cache, evicting each other’s data. As we saw in the previous section, the LLC is the last rampart before having to hit the RAM, which has been shown to have significantly longer access times. Therefore it is important to understand how sharing the LLC will affect a specific application. In order to study this, the UART team has created an original approach: The Cache Pirate. The Pirate is an application that runs on a core and actively steals a part of the shared cache to make it look smaller than it actually is. We then co-run a target application with the pirate and measure its performance with whatever metric we are interested in (, execution time, Frames Per Second, requests/second, and so on). If we repeat this operation for different cache sizes, we can express the performance of our application as a function of the shared cache size.

17 The pirate works by looping through a working set of a desired size in order to prevent it from being evicted from the cache. The size of this working set represents the amount of cache we are trying to steal. The pirate must:

• steal the desired amount of cache (no more, no less)

• steal evenly from every set in the cache

• not use any other resource, such as the memory bandwidth

Property 1 is self-explanatory: We want the pirate to steal the amount of cache we have asked for.

Property 2 is important because we want the cache to look smaller by reducing its associativity. In order to do that, the pirate needs to steal the same number of ways in every set. Not doing so could affect applications that have hotspots in the cache. Note that this is the approach processor manufacturers tend to use when they create a lower-end processor with a smaller cache than their higher-end processors (For example, the L3 cache on Intel processors is made of 2MB 16 ways set-associative chunks. To create a processor with a 3MB L3 cache, they would use 2 chunks of cache but with 4 ways disabled, reducing the size from 4MB to 3MB).

Property 3 is also important because the pirate must only use the shared cache. Other resources, like the memory bandwidth should be left untouched or the results will be biased.

Ensuring that the pirate actually has these properties is the main chal- lenge of this work.

3.1 Monitoring the Pirate

We have already said that in order to steal cache, the Pirate will loop through its working set to keep it in the shared cache. However, the Pirate will be contending with the rest of the system for space in this cache, which means that its data could get evicted if they become marked as least recently used. This would lead to two problems: • The Pirate would not be stealing the desired amount of cache (and thus violating the first property), and

18 • It would need to fetch the working set (or part of it) from the next level of the memory hierarchy, which is the RAM. This implies using the memory bandwidth, and since this resource is also shared with other cores, it also implies reducing the memory bandwidth available to the other cores, violating also property 3

A key point is that this issue gets amplified when we try to steal more cache, because increasing the working set size implies reducing the access rate, which is the period at which element gets touched. If the access rate becomes too low, the data could become the least recently used and therefore get evicted from the cache. We thus need to check whether the Pirate is able to keep its working set cached or not.

3.1.1 The original approach In their original paper, the UART team created a pirate that loops sequen- tially through its working set. To check that the Pirate is not using the memory bandwidth, a naive approach would be to check LLC’s miss ratio (the ratio of memory accesses that cause a cache miss). Missing in the Last-Level Cache implies fetching data from the RAM and thus this approach looks reasonable at a first glance.

However, we saw that processors implement prefetching in order to hide the latency. In this case, a prefetcher could easily detect the sequential ac- cesses the pirate is doing and prefetch data from main memory before they are needed. This would reduce drastically the miss ratio and thus hide the fact that the Pirate is using the memory bandwidth.

To get around that, the UART team uses another metric called the fetch ratio, which is the ratio of memory accesses that cause a fetch from main memory [8].

3.1.2 Defeating prefetching However, the fetch ratio is a metric that is much less known than the miss ratio. Some systems might also not provide the performance events neces- sary for measuring it, or those events might only be collectible system-wide, which is the case on modern processors where the L3 cache and the are located in an “” part that provides no easy way to col- lect data on a per-core basis [10].

19 In order to not need to measure the fetch ratio, a different approach has been used in this work: The Pirate written during this thesis work iterates through its working set randomly. This change is fundamental because since the prefetchers are not unable to recognize a typical access pattern, they are not able to guess which data will be needed soon and therefore will not prefetch ahead any data. Now that the prefetchers will not get in the way anymore, the Pirate’s miss ratio can be used to monitor the ratio. It should stay as close to 0 as possible, otherwise it means that the Pirate is fetching data from the RAM.

Monitoring the miss ratio must be done with a profiler able to read the performance counters, such as Linux’s perf tool [9].

Unfortunately, compared to the original approach, defeating the prefetch- ers could hit the Pirate’s performance by not being able to prefetch data from the LLC, leading to a lower access rate.

3.1.3 Timing There are systems where L3 performance events are also not easily collectible for each core. To work around this issue, another way of monitoring the Pirate has been devised: We first start the Pirate while the system is idling. The Pirate will then benchmark the time it takes to iterate over its working set. Once this is done, our target application can then be started. The pirate will keep timing itself and compare it to the reference time. If the difference between the reference time and the new time exceeds a threshold (The current default is 10%, but it can be changed via an optional command-line ), it means that the pirate’s data got evicted from the cache and thus had to read them from the RAM, which in other words means that the pirate was not able to keep its working set cached. This is due to the fact that, as we saw in the previous chapter, the RAM has significantly higher latencies than caches, and thus missing in the last- level cache will lead to significantly higher running time for the Pirate that we can use to detect when the Pirate starts to use the memory bandwidth.

As before, timing is done using the clock gettime() function on Linux or by manually reading the CPU’s TSC on the OS used in production. This method has several advantages:

20 • It is very easy to implement

• It is completely cross-platform for systems supporting the clock gettime() function (and easily portable to others)

• It does not use any CPU-dependent feature (apart from the TSC if we read it manually) such as performance counters

• It can be used on systems that do not provide an easy way to get per-core LLC events

3.2 Stealing evenly from every set

We saw previously that caches are split in sets which are themselves split in ways. Some applications might have hotspots in the cache, which means that their working set is concentrated in a small part of the cache, and failing to steal evenly from every set could cause either the target application to be unaffected by the Pirate (if the Pirate does not steal anything or too little ) or to be too affected by the Pirate (If the Pirate steals too much from the target application’s hotspots). Ideally, what we would like is to steal the same number of ways in every set. As it has been said previously, to determine where each datum is (or will end up) in the cache, a part of its address is used as an index. From this, we can derive that a way to control in which set a datum will end up is to manually choose its address, or at least the part that will be used as an index. This can be easily done for virtually-indexed caches by using an aligned allocation routine such as posix memalign() to craft a virtual address [11]. The Pirate’s working set will be organized as follows: As with the micro-benchmark, the working set is a densily-packed linked- list. Each element will occupy a full cache line. The working set is placed in memory such that the first element ends up in the first set of the cache, the second in the second set, and so on until we have stolen a way in every set, at which point the addresses of the following elements (if we want to steal more than one way) will wrap around.

Example: Suppose we have the same cache as in the previous section (Size: 32KB; Cache line size: 64B; Associativity: 4; Number of sets: 128; 32-bits ad-

21 dresses) We saw that bits 0-5 of an address are used to select a byte within a cache line and that bits 6-12 are used to index the cache. Thus, to use a way in the first set, the first element of the working set should have an address with bits 0-12 (13 bits) unset. To achieve this, the working set needs to be aligned on 8KB boundaries (213).

However, LLCs are physically-indexed, which means as we said before that the physical address is used to index the cache. Thus, if we want to control where data end up in the cache, we need to control the physical addresses. This becomes a problem because since the memory is paged, there is no guaranty that the pages will be contiguous in memory, which prevents us from controlling the addresses of the elements in the working set, which in turn prevents us from placing elements in the cache as we wish. A way to work around this issue is to use large memory pages in order to be sure that our working set will be allocated as a contiguous block in physical memory. Coupled with the use of an aligned allocation routine as explained above, this allows us to know where each datum will end up in the cache.

3.3 How to steal more cache?

We said before that when the Pirate’s working set size increases, the access rate decreases, down to a point where the rest of the system fights back hard enough to start evicting the Pirate’s data. This puts a limit on how much cache the Pirate can steal (which depends heavily on how hard the target application and the rest of the system fight back) To go beyond this limit, we can run several Pirates. By splitting the working set between each instance, they will each have a smaller working set to iterate on, and thus a higher access rate. However, when contending for a space in the shared cache, the different instances will make no distinction between them and the rest of the system, which means that they will also fight each other. Furthermore, since each Pirate should run on one core, this will reduce even more the number of cores available to the target application.

22 Chapter 4

Results

4.1 Results from the micro-benchmark

The micro-benchmark developed during this thesis work was created to get a better understanding of how the memory hierarchy works, but also to validate the Pirate. The idea here is that the micro-benchmark can expose the size and latency of each part of the memory system. Thus, if we steal a part of the LLC with the Pirate, and then co-run it with the micro-benchmark, it should show that the LLC indeed looks smaller. This is what has been done and Figure 4.1 shows the result.

23 Figure 4.1: micro-benchmark co-run with Pirates stealing different cache sizes

The tests were run on the Freescale P4080 (8 cores, 32KB L1, 128KB L2, 2MB L3). The benchmark uses a random access pattern to defeat prefetching and large memory pages to hide TLB effects. The red line shows the benchmark results with no Pirate running. Each level of the memory hierarchy is visible. Particularly, we can notice that the benchmark gets slower when the working set size becomes larger than 2MB (221), meaning that we are hitting the main memory. The green line shows us what happens when we run the same benchmark with a Pirate stealing 1MB of cache. We can see that the benchmarks gets

24 slowed down when the working set becomes as soon as its working set becomes bigger than 1MB (220). This is expected (and desired!) and due to the fact that the Pirate is already stealing 1MB, thus reducing the cache capacity available for the benchmark to 1MB. The blue line represents the benchmark results when a Pirate is trying to steal 1.5MB. We would expect the performance to get slower above 512KB (219), but actually the line looks very similar to the green line. This is because the Pirate is not able to steal that much cache. Finally, the purple line shows what happens when we run two Pirates, stealing 0.75MB each (for a total of 1.5MB). Now the knee appears indeed when the working set grows above 512KB, which means that the benchmark only has access to 512KB of LLC space.

4.2 SPEC CPU2006 results

Several benchmarks from the SPEC CPU2006 benchmark suite were co-run with a Pirate stealing different cache sizes. Table 4.1 shows the execution for the 401.bzip2 benchmark when con- tending with a Pirate.

Amount of cache available Execution time (seconds) 2MB 33.5 1.75MB 34.9 1.5MB 35.9 1MB 37.4 512KB 40.2

Table 4.1: Results for 401.bzip2 co-run with the Pirate stealing different cache sizes

We can see that although the Pirate and the benchmark obviously does not share any data, the benchmark is slowed down when the Pirate is stealing cache. Up to 1.5MB out of 2MB were stolen from the LLC. The performance dropped there by 20%.

25 Amount of cache available Execution time (seconds) 2MB 58.8 1.75MB 61.7 1.5MB 64.1 1MB 68.4 512KB 74.4

Table 4.2: Results for 433.milc co-run with the Pirate stealing different cache sizes

Table 4.2 shows the results for 433.milc. Here, the performance drop went as far as 26.6%.

The important slowdown caused by reducing the shared cache capacity tells us that these applications rely heavily on the LLC. If the shared cache capacity is reduced, this application starts to fetch data from RAM instead which has been shown to be significantly slower.

Nevertheless, other benchmarks, such as 470.lbm were much less sensitive to reducing the shared cache capacity.

26 Chapter 5

Conclusion, related and future work

The goal of this work was to implement the Cache Pirating method on Eric- sson’s system in order to study the effect on performance of sharing a cache between cores. During this thesis, a Pirate has been written that can run on both Linux and the commercial OS used by Ericsson, and it has been successfully used to test several applications, both from the SPEC CPU2006 benchmark suite and from Ericsson’s own programs. Compared to the original work done by the UART team, some different choices were made. The Pirate written as part of this work use a random access pattern in order to defeat prefetching which makes monitoring easier. Monitoring itself is done by continuously timing the Pirate and looking for sudden slowdowns. Multicore processors have become a norm and the need for performance keeps rising every day, so studying the effect of sharing resources in a multi- core processors is of primary importance. In the same vein than the Cache Pirate, the UART team created the Bandwidth Bandit, which works in the same vein as the Cache Pirate, but focuses on stealing only the memory bandwidth.

5.1 Related work

Multicore processors have become a norm and the need for performance keeps rising every day, so studying the effect of sharing resources in a multicore pro- cessors is of primary importance. In the same vein than the Cache Pirate, the

27 UART team created the Bandwidth Bandit, which works in the same vein as the Cache Pirate, but focuses on stealing only the memory bandwidth [4].

U. Drepper [13] provides comprehensive yet detailed informations about the memory hierarchy, performance analysis, operating systems and what a programmer can do to optimize his program in regard to the memory hierarchy. Wulf & McKee [14] have shown that the increasing gap between the speed of CPUs and that of RAM memory speed will become a major problem, up to a point where the memory speed will become the major bottleneck.

Others have studied the effect of sharing a cache between cores. StatCC [15] is a statistical cache contention model also created by the UART team that takes as an input is a reuse distance distribution that must be collected beforehand and leverages StatStack [16], a probabilistic cache model, to es- timate the cache miss ratio of co-scheduled applications and their CPIs. The M¨alardalenReal-Time Research Center (MRTC) has been working on feedback-based generation of hardware characteristics, where a model is derived from running a production system. This model is then used to generate a similar load on different part of the hardware, such as the caches, without running the real production system [17].

28 Acknowledgments

I would like to thank David Black-Schaffer and Erik Berg for giving me the opportunity to work on this interesting project, as well as for their supervision during this work. I would also like to thank all the members of the Quantum team at Ericsson for their precious help. Finally, I would like to thank my friends and my family for their unconditional support and faith.

29 List of Figures

2.1 The memory hierarchy ...... 6 2.2 A direct-mapped cache ...... 9 2.3 A direct-mapped cache ...... 10 2.4 Freescale P4080 Block Diagram ...... 14 2.5 micro-benchmark on the P4080 ...... 15

4.1 micro-benchmark co-run with Pirates stealing different cache sizes ...... 24

30 Bibliography

[1] Herb Sutter, The free lunch is over. Date accessed: 2013-07-24. http://www.gotw.ca/publications/concurrency-ddj.htm, 2005. [2] Herb Sutter, Welcome to the hardware jungle, Date accessed: 2013-07-24. http://herbsutter.com/welcome-to-the- jungle/, 2011. [3] D. Eklov, N. Nikoleris, D. Black-Schaffer, E. Hagersten, Cache Pirating: Measuring the curse of the shared cache, in 2011 International Confer- ence on Parallel Processing (ICPP) 2011, pages 165-175 [4] D. Eklov, N. Nikoleris, D. Black-Schaffer, E. Hagersten, Bandwidth ban- dit: Understanding memory contention, in International Symposium on Performance Analysis of Systems and Software (ISPASS) 2012, pages 116-117 [5] LWN.net, Huge pages: Interfaces, Date accessed: 2013-07-24. https://lwn.net/Articles/375096/ [6] Linux Documentation, Transparent Hugepage support, Date accessed: 2013-07-24. http://lwn.net/Articles/423592/ [7] Wikipedia, Cache algorithms, Date accessed: 2013-07-24. http://en.wikipedia.org/wiki/Cache algorithms [8] Roguewave, Fetch ratio, Date accessed: 2013-07-24. http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/ manual html linux/manual html/ch03s08.html [9] Kernel.org, Linux’ perf tool, Date accessed: 2013-07-24. https://perf.wiki.kernel.org/index.php/Main Page [10] Intel, Performance Analysis Guide for i7 Processor and In- tel 5500 processor, p13 section “Uncore performance monitoring (PMU)”.

31 [11] Linux Manpages, posix memalign() manpage, Date accessed: 2013-07-24. http://linux.die.net/man/3/posix memalign

[12] Linux Manpages, clock gettime() manpage, Date accessed: 2013-07-24. http://linux.die.net/man/3/clock gettime

[13] Ulrich Drepper, What every programmer should know about memory, Date accessed: 2013-07-24. www.akkadia.org/drepper/cpumemory.pdf

[14] Wm. A. Wulf, Sally A. McKee, Hitting the memory wall: implications of the obvious, in ACM SIGARCH News Volume 23 Issue 1, March 1995, pages 20-24

[15] D. Eklov, D. Black-Schaffer, E. Hagersten, StatCC: A statistical cache contention model, in PACT ’10 Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 551-552

[16] D. Eklov, E. Hagersten, StatStack: Efficient modeling of LRU caches, in IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS) 2010, pages 55-65

[17] M. J¨agemar,S. Eldh, A. Ermedahl, B. Lisper, Towards Feedback-Based Generation of Hardware Characteristics, in 7th International Workshop on Feedback Computing 2012

32