Stealing the Shared Cache for Fun and Profit

IT 13 048 Examensarbete 30 hp Juli 2013 Stealing the shared cache for fun and profit Moncef Mechri Institutionen för informationsteknologi Department of Information Technology Abstract Stealing the shared cache for fun and profit Moncef Mechri Teknisk- naturvetenskaplig fakultet UTH-enheten Cache pirating is a low-overhead method created by the Uppsala Architecture Besöksadress: Research Team (UART) to analyze the Ångströmlaboratoriet Lägerhyddsvägen 1 effect of sharing a CPU cache Hus 4, Plan 0 among several cores. The cache pirate is a program that will actively and Postadress: carefully steal a part of the shared Box 536 751 21 Uppsala cache by keeping its working set in it. The target application can then be Telefon: benchmarked to see its dependency on 018 – 471 30 03 the available shared cache capacity. The Telefax: topic of this Master Thesis project 018 – 471 30 00 is to implement a cache pirate and use it on Ericsson’s systems. Hemsida: http://www.teknat.uu.se/student Handledare: Erik Berg Ämnesgranskare: David Black-Schaffer Examinator: Ivan Christoff IT 13 048 Sponsor: Ericsson Tryckt av: Reprocentralen ITC Contents Acronyms 2 1 Introduction 3 2 Background information 5 2.1 A dive into modern processors . 5 2.1.1 Memory hierarchy . 5 2.1.2 Virtual memory . 6 2.1.3 CPU caches . 8 2.1.4 Benchmarking the memory hierarchy . 13 3 The Cache Pirate 17 3.1 Monitoring the Pirate . 18 3.1.1 The original approach . 19 3.1.2 Defeating prefetching . 19 3.1.3 Timing . 20 3.2 Stealing evenly from every set . 21 3.3 How to steal more cache? . 22 4 Results 23 4.1 Results from the micro-benchmark . 23 4.2 SPEC CPU2006 results . 25 5 Conclusion, related and future work 27 5.1 Related work . 27 Acknowledgement 29 List of Figures 30 Bibliography 31 1 Acronyms LLC Last Level Cache. LRU Least Recently Used. OS Operating System. PIPT Physically-Indexed, Physically-Tagged. PIVT Physically-Indexed, Virtually-Tagged. RAM Random Access Memory. THP Transparent Hugepage. TLB Translation Lookaside Buffer. TSC Time-Stamp Counter. UART Uppsala Architecture Research Team. VIPT Virtually-Indexed, Physically-Tagged. VIVT Virtually-Indexed, Virtually-Tagged. 2 Chapter 1 Introduction In the past, computer architects and processor manufacturers have used several approaches to increase the performance of their CPUs: Increase in frequency and transistor count, Out-of-Order execution, branch prediction, complex pipelines, speculative execution and other techniques have all been extensively used to improve the performance of our processors. However, in the mid-2000s, a wall was hit. Architects started to run out of room for improvement using these traditional approaches [1]. In order to satisfy Moore's law, the industry has converged on another approach: Mul- ticore processors. If we cannot noticeably improve the performance of one CPU core, why not having several cores in one CPU? A multicore processor embeds several processing units in one physical package, thus allowing execution of different execution flows in parallel. Helped by the constant shrinkage in semiconductor fabrication processes, this has led to a transition to multicore processors, where several cores are embedded on a CPU. As of today this transition has been completed, with multicore processors being everywhere, from supercomputers to mainstream personal computers and even mobile phones [2]. This revolution has allowed processor makers to keep producing significantly faster chips. However, this does not come for free: It imposes a greater burden on the programmers in order to efficiently exploit this new compu- tational power, and new problems arose (or were at least greatly amplified). Most of them should be familiar to the reader: False/true sharing, race con- ditions, costly communications, scalability issues, etc. Although multicore processors have by definition duplicate execution units, 3 they still tend to share some resources, like the Last Level Cache (LLC) and the memory bandwidth. The contention for these shared ressources repre- sents another class of bottlenecks and performance issues that are much less known. In order to study them, the UART has developped two new methods: Cache pirating [3] and the Bandwidth Bandit[4]. The Cache Pirate is a program that actively and carefully steals a part of the shared cache. In order to know how sensitive a target application is to the amount of shared cache available, it is co-run with the Pirate. By stealing different amount of shared cache, the performance of the target application can be expressed as a function of the available shared cache. We will first start by describing how most multicore processors are organized and how caches work, as well as studying the Freescale P4080 processor. We will then introduce the micro-benchmark, a cache and memory benchmark developped during the Thesis. Later, the pirate will be thoroughly described. Finally, experiments and results will be shown. This work is part of a partnership between UART and Ericsson. 4 Chapter 2 Background information 2.1 A dive into modern processors 2.1.1 Memory hierarchy Memory systems are organized as a Memory hierarchy. Figure 2.1 shows how a memory hierarchy typically looks like. CPU registers sit on top of this hierarchy. They are the fastest memory available, but they are also very limited in numbers and size. CPU caches come next. They are caches that are used by the CPU to reduce accesses to the slower level of the memory hierarchy (Mainly, the main memory). Caches will be explored more thoroughly later. The main memory, commonly called Random Access Memory (RAM) is the next level in the hierarchy. It is a type of memory that is off-chip and that is connected to the CPU by a bus (AMD HyperTransport, Intel Quick- Path Interconnect, etc.). Throughout the years, RAM has become cheaper and bigger, and mainstream devices often have several GigaBytes of RAM. Lower levels of the hierarchy, like the hard drive are several order or magnitude slower than even the main memory and will not be explored here. 5 Figure 2.1: The memory hierarchy (wikipedia.org) 2.1.2 Virtual memory Virtual memory is a memory management technique used by most systems nowadays which allows every process to see a flat and contiguous address space independent of the underlying physical memory. The Operating System (OS) maintains in main memory per-process translation tables of virtual-to-physical addresses. The CPU has a special unit called memory management unit (MMU) which, in cooperation with the OS, is used to perform the translation from virtual to physical addresses. Since the OS' translation tables live in RAM, accessing them is a slow process. In order to speed up the translation, the CPU maintains a very fast translation cache called Translation Lookaside Buffer (TLB) that caches recently-resolved addresses. The TLB can be seen as a kind of hash table where the virtual addresses are the keys and the physical addresses the results. When a translation is required, the TLB is looked up. If a matching entry is found, then we have a TLB-hit and the entry is returned. If the entry is not found, then we have a TLB-miss. The in-memory translation table is then searched, and when a valid entry is found, the TLB is updated and the 6 translation restarted. Paged virtual memory Most implementations of virtual memory use pages as the basic unit, the most common page size being 4KB. The memory is divided in chunks, and the OS translation tables are called Page Tables. In such a setup, the TLB does not hold per-byte virtual-to-physical translations. Instead, it keeps per- page translations, and once the right page is found, some lower bits of the virtual address are used as an offset in this page. TLBs are unfortunately quite small: Most TLBs have a few dozens of entries, 64 being a common number, which means that with 4KB pages, only 256KB of memory can be quickly addressed thanks to the TLB. Programs which requires more memory will thus likely cause a lot of TLB-misses, forc- ing the processor to waste time traversing the page table (this process is called a page walk). Large pages To reduce this problem, processors can also support other page sizes. For example, modern x86 processors support 2MB, 4MB and 1GB pages. While this can waste quite a lot of memory (because even if only one byte is needed, a whole page is allocated), this can also significantly improve the performance by increasing the range of memory covered by the TLB. There is no standard way to allocate large pages: We must rely on OS- specific functionnalities. Linux provides two ways of allocating large pages: • hugetlbfs: hugetlbfs is a mechanism that allows the system administra- tor to create a pool of large pages that will then be accessible through a filesystem mount. libhugetlbfs provides a both an API and tools to make allocating large pages easier [5], • Transparent Hugepage (THP): THP is a feature that attempts to au- tomatically back memory areas with large pages when this could be useful for performance. THP also allows the developer to request a memory area to be backed by large pages (but with no guaranty of success) through the madvise() system call. THP is much easier to use than hugetlbfs [6]. 7 Although not originally designed for that, large pages (sometimes called huge pages) support is fundamental for the Cache Pirate. We will explained later why. 2.1.3 CPU caches Throughout the years, processors have kept getting exponentially faster, but the growth in the main memory performance has not been quite as steady. To fill the gap and keep our processors busy, CPU architects have been using caches for a long time.

Stealing the Shared Cache for Fun and Profit

Memory Hierarchy Memory Hierarchy

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Caches & Memory

IBM Power Systems Performance Report Apr 13, 2021

Cache & Memory System

Cache-Fair Thread Scheduling for Multicore Processors

A Cache Line Fill Circuit for a Micropipelined, Asynchronous Microprocessor

Quickspecs HP Integrity Rx7640 Server Overview

Exploiting Cache Side Channels on CPU-FPGA Cloud Platforms

The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures

Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor