EXPLORING COMPRESSION IN THE GPU MEMORY HIERARCHY FOR GRAPHICS AND COMPUTE

A Dissertation Presented by

Akshay Lahiry

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Engineering

Northeastern University Boston, Massachusetts

August 2018

i Contents

List of Figures iv

List of Tables vi

Abstract of the Dissertation vii

1 Introduction 1 1.1 Compressed Cache Architecture ...... 5 1.2 Compression Algorithms ...... 6

2 Background 8 2.1 GPU Memory Hierarchy ...... 8 2.2 Compression Algorithms ...... 11 2.3 Compressed Cache Architecture ...... 18 2.3.1 Smart Caches ...... 25 2.4 DRAM Efficiency ...... 26

3 Framework and Metrics 30

4 Results 37 4.1 Dual Dictionary Compressor ...... 38 4.1.1 Cache Architecture ...... 38 4.1.2 Dictionary Structure ...... 39 4.1.3 Dictionary Index Swap ...... 41 4.1.4 Dictionary Replacement ...... 42 4.1.5 DDC Performance ...... 43 4.1.6 Summary ...... 43 4.2 Compression Aware Victim Cache ...... 45 4.2.1 Design Challenges ...... 47 4.2.2 Results ...... 49 4.2.3 Summary ...... 49 4.3 Smart Cache Controller ...... 50 4.3.1 Smart Compression ...... 54 4.3.2 Smart Decompression ...... 54

ii 4.3.3 Smart Prefetch ...... 55 4.3.4 Summary ...... 59

5 Compression on Graphics Hardware 60 5.1 Geometry Compression ...... 62 5.2 Texture Compression ...... 63 5.3 Depth and Color Compression ...... 64 5.4 Compute Workloads ...... 65 5.5 Key Observations ...... 68

6 Conclusion 70

Bibliography 73

iii List of Figures

1.1 Total number of pixels for common display resolutions...... 1 1.2 The total number of pixels for common display resolutions...... 2 1.3 Two frames of Halo:Combat Evolved. Left image shows a frame from 2001 and the right image shows the updated frame in 2011 ...... 3

2.1 Block diagram showing the ratio of ALU to cache hardware...... 9 2.2 Block diagram showing cache hierarchy in a CPU and GPU ...... 10 2.3 Compression example using Frequent Value Compression...... 13 2.4 Compression example using CPack...... 15 2.5 Fragmenation example for fixed compaction scheme ...... 19 2.6 Fragmentation with less restriction on compressed block placement ...... 19 2.7 Compaction with subblocks in the data array ...... 20 2.8 Compaction with decoupled subblocks and super tags ...... 22 2.9 Bytemasked writes with uncompressed data...... 23 2.10 Bytemasked writes with compressed data ...... 24 2.11 Feature table for a hashed perceptron prediction...... 26 2.12 Example DRAM address mapping...... 27 2.13 Efficient DRAM access pattern...... 28 2.14 Inefficient DRAM access pattern...... 29

3.1 Block Diagram of simulation Framework...... 32 3.2 Sample simulation output for IceStorm benchmark...... 36

4.1 A Cache Block ...... 38 4.2 Block diagram of the LLC with our dual dictionary ...... 39 4.3 Output from Dictionary Index Swap Unit ...... 40 4.4 The Write BW Savings ...... 44 4.5 The Read BW Savings ...... 45 4.6 Block diagram of LLC with Victim Cache ...... 46 4.7 Super Tag structure for the Victim Cache ...... 46 4.8 Byte-masked write with uncompressed data ...... 47 4.9 Byte-masked write in compressed cache ...... 48 4.10 Burst Efficiency with EBU on and off ...... 49 4.11 Row Buffer Hit rate with Victim EBU on and off ...... 50

iv 4.12 Conventional Compressed Cache ...... 51 4.13 Smart Compressed Cache ...... 53 4.14 Results for smart data compression. For each workload, we show the oracle perfor- mance as our baseline. We also plot the wasteful compressions that were identified using our model as a percentage of the oracle, as well as the percent of false positives, as a percentage of total predictions...... 55 4.15 Results for smart data decompression. For each workload, we show the oracle performance as our baseline. We also plot the repeated decompressions that were eliminated using our smart decompressor as a percentage of the oracle...... 56 4.16 Results for smart prefetching. For each workload we show: the efficiency with compression turned off (the baseline), the memory efficiency drop with compression turned on, and the efficiency improvements when using smart prefetching. . . . . 57 4.17 Area impact of feature tables ...... 58

5.1 A simplified Direct3D pipeline...... 61 5.2 Texture Map example by Elise Tarsa [66]...... 63 5.3 Bandwidth savings - AES Benchmark ...... 66 5.4 Bandwidth savings - Compute Workloads ...... 66 5.5 Bus Utilization - Compute Workloads ...... 67

v List of Tables

2.1 CPack Code table ...... 15 2.2 Frequent Pattern Encoding table as seen in [25] ...... 17 2.3 Comparison of popular hardware compression algorithms ...... 18

3.1 Graphics workloads ...... 32 3.2 Compute workloads ...... 33

4.1 Updated Pattern Table ...... 39 4.2 The compressed word distribution ...... 44

vi Abstract of the Dissertation

EXPLORING COMPRESSION IN THE GPU MEMORY HIERARCHY FOR GRAPHICS AND COMPUTE

by Akshay Lahiry Doctor of Philosophy in Computer Engineering Northeastern University, August 2018 Dr. David Kaeli, Advisor

As game developers push the limits of graphics processors (GPUs) in their quest to achieve photorealism, modern games are becoming increasingly memory bound. At the same time display vendors are pushing the boundaries of display technology with ultra high resolution displays and high dynamic range (HDR). This means the GPU not only needs to render more pixels but also needs to process a significantly higher amount of data per pixel for high quality rendering. The advent of mainstream virtual reality (VR) also increases the minimum frame-rate for these displays which puts a lot of pressure on the GPU memory system. GPU have also evolved to be used as accelerators in high performance computing systems. Given their data-parallel throughput, many compute-intensive applications have benefited from GPU acceleration. For both of these workload, increasing the cache size helps alleviate some of the memory pressure by caching the frequently used data on-chip. However, die area on modern chips comes at a premium. Data compression is one approach to manage the data footprint problem. Data compression in the last level cache (LLC) can help achieve the performance of a much larger cache while utilizing significantly less die area. In this thesis, we address these challenges of using on-chip data compression and explore novel methods to arrive at performant solutions for both graphics and compute. We also highlight some unique compression requirements for graphics workloads and how they contrast to prior cache compression algorithms.

vii Chapter 1

Introduction

Over the past few years the number of pixels in a video display has increased exponentially. The graphics display industry has quickly moved from resolutions as low as 480p, to stunning 8K displays. Figure 1.1 shows some common display resolutions, ranging from half a million pixels for old displays in SVGA format, to thirty three million pixels for current state-of-the-art 8K displays.

Figure 1.1: Total number of pixels for common display resolutions.

1 CHAPTER 1. INTRODUCTION

Rendering a growing number of pixels significantly increases the amount of data associated with each frame. Figure 1.2 shows how the total bytes per frame scales, from a few kilo-bytes for low resolution displays, to thirty mega-bytes for a modern high resolution display. With the introduction of High-Dynamic Range (HDR) displays, the amount of data associated with each pixel has also increased. More bits are being used to represent each channel in the pixel [1, 2]. As these ultra-high resolution and high dynamic range displays become ubiquitous, the data footprint of modern games will increase rapidly.

Figure 1.2: The total number of pixels for common display resolutions.

Figure 1.3 shows two frames from the Halo Combat Evolved game. The frame on the left shows the original game from the year 2001 and the one the right shows the remastered frame from the anniversary update of the game in 2011. The images show a significant improvement to the level of detail per frame. Game developers push modern Graphics Processors (GPU) to their performance limits in their quest for photorealism. As virtual reality (VR) headsets become more mainstream, the quest for realism goes even further to enable the illusion of reality in the virtual world. High end VR headsets have two high resolution screens that independently render immersive content. This puts a lot of pressure on the GPU to render high resoluton images in real-time. A single dropped frame due to

2 CHAPTER 1. INTRODUCTION

Figure 1.3: Two frames of Halo:Combat Evolved. Left image shows a frame from 2001 and the right image shows the updated frame in 2011 memory latency can can ruin the immersive experience. The trends mentioned above have significantly increased the pressure on the memory hierarchy of GPUs. This exponential increase in the number of bytes per frame adds a lot of pressure on the GPU’s cache hierarchy and memory. As access to memory is extremely slow, GPUs have been increasing the size of on-chip caches to avoid costly cache misses and improve performance. This results in increased power consumption and on-chip area, which is not ideal. Modern manufacturing nodes have significant yield issues with larger chips, which become increasingly expensive to manufacture. As the area on the chip comes at a premium, over provisioning caches will have an adverse impact. Every millimeter of chip area is valuable and every effort is made by graphics chip companies to improve the performance/mm2. Having more data in flight also increases bandwidth contention on a GPU. This can impact the cost of a cache miss. The on-chip area must be used more efficiently to handle the memory footprint issue. Game developers acknowledge this issue and expect this trend to increase [3] in high quality games. Data compression is an effective way to improve cache capacity and reduce bandwidth

3 CHAPTER 1. INTRODUCTION contention. Compression in the Last-Level Cache (LLC) can increase the logical cache size without increasing the hardware area significantly. This can lead to better cache performance due a smaller data footprint, as well as a power reduction due to higher cache hit rates and fewer main memory transactions. The higher hit rate increases cache performance, providing the benefits of a larger cache without increasing any hardware. We focus on compression in the last level cache because the miss penalty grows exponentially higher after this level of memory hierarchy. The negative aspect of compression/decompression is higher access latency, which must be offset by higher hit-rates. If the compressed block , and the compression metadata can be stored off-chip efficiently, then the traffic on the memory bus is also significantly reduced, improving bandwidth usage. Data compression in the memory hierarchy can help with cache performance in modern memory-bound applications. While the benefits of compression in the LLC are significant, some new design challenges arise in dealing with managing compressed data with hardware. Compression algorithms frequently produce variable sized data. Storing variable sized data efficiently in a conventional cache is difficult due to internal fragmentation of the data lines. Efficient data compaction techniques are needed to utilize all of the on-chip area efficiently. Compression/decompression operations, as well as the compaction algorithm, can potentially add latency to the critical path and negate the performance gains due provided by the logical increase in cache capacity. Partial writes to a compressed cache block can change the size of the compressed data and result in an expensive recompaction operation on the data stored in the cache. Another issue with variable sized compressed data in the cache is that it changes the data access pattern from the LLC to the DRAM page. When accessing compressed data, a each cache line no longer contains the same amount of data, so the evicted data from the LLC will have variable burst sizes to the DRAM pages. This will result in efficiencies in terms of performance and power consumption. The choice of compression algorithm also impacts the hardware design. Most compression algorithms aim to exploit the redundancy in data streams. The granularity at which compression is performed affects the decompression latency. Decompression latency is usually on the critical path of a memory access, thus high latency here nullifies any gains from compression. If the compression ratio is not high enough, then data compression might not provide any significant hit rate improvements or memory bandwidth reductions.

4 CHAPTER 1. INTRODUCTION

1.1 Compressed Cache Architecture

Traditional caches store uncompressed data in each cache block. As the data stored per cache block is fixed, the data lines can be utilized completely and effective cache capacity is typically close to the physical size of the cache. Compressed caches store compressed data - the compressed data is subject to variable compression ratios. This means that the data size can change from block to block, and may not fully utilize cache block. If the data could be compacted efficiently in the cache, the effective cache capacities would match the compression ratio and significantly increase the effective cache size. In practice this rarely happens due to internal fragmentation that reduces data line utilization. Internal fragmentation occurs due to the compressed data being variable in size. Variable sized data cannot be packed efficiently in the data lines leaving fragmented free space in the cache blocks. A compaction algorithm is typically used to pack different compressed blocks in the data lines. This algorithm needs to balance maximizing data line utilization with access latency. A complex compaction algorithm can also increase power consumption and area due to increased hardware complexity. Writes to compressed blocks may change the size of the compressed data. This would required relocation of compacted data in the cache block. The compressed data blocks must be rearranged to make room for the updated compressed block. This process is known as recompaction and is extremely costly in terms of latency. Real world applications have bytemasked writes that only partially overwrite a cache block. These partial writes can be handled easily in a conventional cache by merging the dirty bytes based on the bytemask. However for a partial writes to a compressed block, we would need to decompress the block before merging the data and then recompress the new data block. This turns a simple merge operation into a read-decompress-modify-compress-write operation. This can potentially add a significant latency to the critical path of a cache write, especially if a read operation immediately follows a bytemasked write. The replacement policy also has a huge impact on system performance for compressed caches. Eviction of variable sized data lines can result in inefficient bursts to the DRAM. This will result in poor data bus utilization, further increasing the cost of a cache miss. Smarter replacement policies are required to achieve a better balance between logical expansion, access latency and efficient bursting.

5 CHAPTER 1. INTRODUCTION

1.2 Compression Algorithms

Compression algorithms vary from software based schemes, such as the Lempel-Ziv algorithm [4], to schemes designed for hardware implementation, such as the base delta intermediate scheme [5]. At a fundamental level, compression schemes exploit various redundancies in a data stream to reduce the memory footprint of an application. For example, data structures are frequently initialized with zeroes for many applications. This creates data streams that can be compressed easily by handling zero data as a special case. Pattern matching is another approach that is used in dictionary based compression schemes. The compression dictionary stores a list of frequently used words, and then the data stream is encoded based on the entries in the dictionary. The encoding can be done based on complete or partial matches with the dictionary entries. Dictionaries entries can be static, which means they are populated before an application starts running, or dynamic, where the entries are updated at runtime to improve compression. Narrow bit-width data is another candidate for compression. Typically software over provisions the bit-width for the worst case maximum size of a variable. Counters are a good example, where the data width always matches the maximum range of the counter. In most cases, these variables only use a fraction of their available bit-width. Since it is extremely cumbersome to optimize bit-widths in software, these narrow cases can be handled in hardware to reduce the size of the data stream. Repeated values can also be compressed by storing them as a single value, adding a count that captures the number of repetitions. As we can see, there are many methods to compress data. However hardware-based compression schemes have to meet certain conditions to be feasible. On-chip compression needs to have low decompression latency as the decompress operation usually happens on the critical path. The compression ratio needs to be high so that the benefit of the increased cache capacity plus the bandwidth savings can compensate for the increase in latency. Hardware complexity is another concern, as the area occupied by the compressor and decompressor needs to be small. The storage of compression metadata is another issue with on-chip compression techniques. If a compressed line is evicted from the cache, the metadata should be small enough to store in memory without increasing the memory bandwidth. If the metadata shared like in the case of dictionary based compression, then the data would have to be decompressed before the block is evicted from the LLC, This prevents us from realizing the benefits of compression and increases power consumption. The rest of the thesis is organized as follows. In Section 2 we discuss the challenges with compressed data present in the memory hierarchy, and cover related work that addresses compressed

6 CHAPTER 1. INTRODUCTION caches, dram efficiency, on-chip compression algorithms and smart cache controllers. In Section 3 we discuss the various metrics used to evaluate our proposal. We also review the simulation framework and workloads used in our experimentation in this thesis. Section 4 describes our solutions to some of the key challenges with compressed cache designs and evaluates them using our simulation framework. In Section 5 we discuss the unique properties associated with compression in the graphics pipeline and highlight the key differences when compared to cache compression algorithms. Finally in Section 6 we summarize the contributions made in this thesis, and chart directions for future research in the area of compressed cache design for GPUs.

7 Chapter 2

Background

In this section, we provide background on how caches can be used to significantly improve processor performance. We then discuss the role a cache in the memory hierarchy of CPUs and GPUs, and shed some light on the key differences between the CPU and GPU memory stream. In this discussion, we focus primarily on graphics workloads for the GPU. We then explain how compression can further improve the performance of a computer system, and provide a thorough discussion of various state of the art compression algorithms and compressed cache designs. Finally we discuss the intricacies involved in DRAM timing and describe how compressed data can potentially hurt performance if the cache-memory interaction is not handled efficiently.

2.1 GPU Memory Hierarchy

A cache is a fast on-chip memory that is used to store frequently accessed data. Most caches take advantage of locality in the address stream to improve performance. There are two main types of locality, spatial and temporal. For a data access stream, many data structures possess spatial locality, commonly in is vector and matrix operations. The other type of locality is temporal locality, where the same address is likely to be accessed in the near future. The instruction stream typical has high spatial locality, where instructions in the body of a loop are repeatedly accessed [6]. As on-chip area is limited, caches are typically small. Caches tend to be expensive to implementation since they utilize expensive SRAM technology instead of cheaper DRAM technol- ogy. Caches help reduce the number of accesses to slow, off-chip, memory. Caches can improve performance significantly since they are much closer to the processor and can return data much faster.

8 CHAPTER 2. BACKGROUND

A processor typically has multiple levels of cache, including a large shared LLC that holds both instructions and data. The exact organization of these caches depends on the type of processor and available on-chip area. We will see how cache organization differs between a CPU and a GPU cache. The GPU memory hierarchy is a lot more complex than that of a CPU. The fundamental difference between a CPU and a GPU is that CPUs are designed for low latency, while GPUs are designed for high throughput through parallel execution. GPUs have fixed function hardware, along with programmable support, where as a CPU focuses more on general purpose algorithms.

Figure 2.1: Block diagram showing the ratio of ALU to cache hardware.

As we can see in Figure 2.1, a majority of the die area on a CPU is dedicated to cache memory, as CPUs work better when most of the program data is resident in the cache. GPUs have a complex memory system which consists of registers, constant memory, texture memory, local memory and global memory. The shared, constant and texture memory are unique to a GPU, and do not exist in a CPU. A GPU has significantly higher compute performance spread over massively parallel cores, but have smaller caches relative to the number of Arithmetic Logical Units (ALUs) on-chip. It is difficult and expensive to provision a large enough cache system to store all the data needed by the GPU. This is why GPUs that are designed for graphics processing have their own DRAM connected to the LLC. The DRAM chips are typically referred to as Video- RAM (VRAM) and have significantly wider memory buses as compared to a CPU Figure 2.2. These wide buses help move data quickly between memory and the LLC.

9 CHAPTER 2. BACKGROUND

Figure 2.2: Block diagram showing cache hierarchy in a CPU and GPU

10 CHAPTER 2. BACKGROUND

Since it is extremely difficult to store the entire application’s data set in the GPU caches, the throughput of the memory system becomes important in determining the overall performance of the graphics processor. This is why the GPU manufacturers such as (AMD) have taken new approaches, such as hardware data compression [7], to improve bandwidth.

2.2 Compression Algorithms

In this section, we review previous work on data compression and discuss the state of the art in hardware compression schemes. Compression can broadly be classified into lossy and lossless compression. Lossy compression is useful when a small loss in information makes no material difference in the data stream. Many audio and video compression schemes are lossy, as the quality is still good enough with a loss of some unessential bits. JPEG and MPEG compression are examples of lossy compression. This data fidelity reduction is typically irreversible. For data compression on GPUs, our focus is typically on lossless compression. This is because a loss in fidelity is not tolerable for general purpose computing or graphics workloads. Lossless data compression algorithms exploit redundancy in the data stream to reduce the memory footprint. As documented by Mittal et al. [8] in their survey of compression methods, redundancy can occur due to many reasons such as zero initialization of data structures, or copy operations and constants in a program. Narrow-width values present another opportunity for compression, as they are typically stored in over-provisioned data types. Data with low dynamic range can be encoded as a base value plus an array of differences, to reduce the size. While each of the above mentioned features can be exploited for data compression, many trade-offs need to be evaluated to find the optimal method for a specific design case. For compression in a LLC, latency and hardware complexity are critical factors that limit the compression ratios that can be achieved. The decompression operation is more important than compression, as it frequently performed on the critical path of a cache access. This makes decompression latency a critical factor in the selection of the algorithm for on-chip data compression. Complex compression hardware can increase power consumption and occupies a larger percentage of the on-chip area. This can negatively offset any memory and cache performance gains due to compression. Data granularity can have an impact on compression ratios as well. Larger blocks of data tend to compress better as more redundancy can be exploited. However, any access to a small part of the data block would require the whole block to be decompressed. Compressing smaller blocks

11 CHAPTER 2. BACKGROUND generates more metadata and does not compress as well. A balance must be achieved for efficient compression. Dictionary-based compression schemes work on a word granularity. This prevents parallel decompression, as each word needs to be decompressed to find the starting location of the next compressed word. Certain features of the data stream can also provide a good indicator of compressibility. For example, encrypted data by nature is random and has low redundancy. In such a case, it makes sense to avoid wasting resources on compression, since the return is negligible. Next, we discuss some of the state of the art compression schemes proposed in literature. Ekman and Stenstrom [9] observed that real world applications contain a significant amount of null data blocks. This is because for most applications, the data structures are initialized to zero and temporary variables are mostly cleared with zeroes. Dusser et al. [10] note that storing such null blocks is a waste of cache resources and a better approach can be used to handle these cases efficiently. They propose compressing these null blocks by creating a separate zero cache to augment the main cache. This zero cache stores the address and valid bits for the null block. It does not contain any data blocks, as the data is known to be zero. This simple scheme has very low decompression latency and fast look-up times, as the main cache and the zero cache can be searched in parallel for any address. One possible drawback is that for workloads with a low percentage of null data, the parallel lookup increases power consumption for no real benefit. The authors propose a solution for this by monitoring the compression ratio of the blocks and turning off the zero cache when the ratio drops below a specific threshold. Another key observation by the authors is that null data exhibits good spatial locality. Hence, the null blocks come in sizes that are much larger than a single cache line. This is exploited by using super tags in the zero cache, a feature the addresses a lot more memory for a very low area cost. The authors take this idea further by combining the zero cache with null compression in memory [11]. They propose organizing the main memory as a decoupled sectored cache [12] with cache line sized sub blocks. The memory controller has a cache to translate compressed addresses into uncompressed sectors in main memory. Any null block is represented by a null bit in the cache, similar to the zero cache. Now any read or write to a null block can be handled by the cache, instead of a costly access to main memory. This reduces bandwidth utilization and the average memory access time. Villa et al. [13] propose another novel method to compress zero data. They propose adding a zero indicator bit (ZIB) for every byte in the cache line. By setting a single bit for every zero

12 CHAPTER 2. BACKGROUND byte, they compress the zeroes and save a significant amount of energy. Energy savings is realized by disabling the bitline discharge in the cache for every byte that has the ZIB set. They observe significant energy savings of 26% for data caches and 18% for instruction caches using this method. The limitation of zero compression schemes is that they do not work with any other data patterns. If a data stream has limited null data, then there is limited benefit from this compression scheme. This is why compression schemes that handle more data patterns, while treating zero compression as a special case, can be a better choice for hardware data compression.

Figure 2.3: Compression example using Frequent Value Compression.

Yang et al. [14] observed that applications tend to exhibit frequent value locality. This means that a few values appear much more frequently than others during execution and can potentially occupy almost 50% of the cache area. They propose encoding these frequent values to compress the data stream. They use this approach to create a small frequent value victim cache to improve performance [14] and a compressed cache to increase effective cache size [15], reducing memory bandwidth. Figure 2.3 shows an example of data compression using the frequent value compression scheme. Four 32-bit frequent values are stored in a table and assigned 3 bit codes to represent them. The 3-bit value ’111’ is used to represent an infrequent value. The 32 bit values in the uncompressed stream are replaced with the 3 bit codes from the frequent value table and the uncompressed word is stored as is. This reduces the 128 bit uncompressed stream to just 20 Bits. The drawback of this scheme is that the application needs to be profiled to create a list of frequent values. The FVC compression scheme is basically a static dictionary where data is encoded using a table of frequent values. Dictionary-based compression schemes take advantage of repeated patterns and encode the data with a dictionary index that matches the specific pattern. Previous research in the field of dictionary-based compression algorithms has primarily focused on three

13 CHAPTER 2. BACKGROUND independent goals: i.) cache expansion [16, 17, 18, 13], ii.) memory expansion [19, 20], and iii.) bus compression [21]. Memory expansion techniques, such as XMatch and X-RL [22], use dictionary-based pattern matching to expand the logical memory size. The dictionary is created during compression using a move-to-front strategy, as described by Bentley et al. in their locally adaptive compression scheme [23]. In this method, data is compressed one tuple at a time, where a tuple is the chosen data granularity for compression. The dictionary starts empty for each compression operation, and both the compressor and decompressor use the same move-to-front algorithm to create the dictionary on the fly. This method takes advantage of the spatial locality within the data set to update dictionary entries. Each tuple is compared with the current dictionary entries for a complete or partial match, and the matched characters, type and locations are recorded. The output always starts with one bit with ’1’ indicating no compression, and ’0’ indicating a match. If no match occurs, the tuple is inserted into the dictionary, and the last entry is discarded. Decompression follows the same steps, decompressing one word at a time, creating the exact same dictionary during the decompression operation. The main issue with this method is the variable length of the match location and match type. This forces the dictionary to decompress the first compressed word before knowing the location of the next. This adds significant latency to the critical path for a memory access. The X-RL algorithm [19] follows the same general principal as XMATCH, but handles a sequence of zeroes as a special case by using run length encoding. The selective memory compression system (SCMS) proposed by Lee et al. [20] uses the XMatch-based dictionary compression scheme to compress data throughout the memory hierarchy. A key difference is the selective nature of the compressor. A compression threshold is selected for the cache, and a block is stored in the compressed form only if the compression ratio is below the threshold. This way the compressed block can share the cache line with other compressed blocks. The work also uses a fixed output size for the compressor to make it easier to store the compressed blocks in the cache and in memory. The disadvantage of assuming a fixed output size is that compression ratio can suffer. Cache expansion techniques, such as CPack [16], have focused on methods that increase the logical size of the cache, without increasing the area significantly. For CPack, a two-level code table is used, as shown in Table Table 2.1. As shown in Table Table 2.1, a ’z’ represents a zero, an ’m’ represents a matching byte within a dictionary entry, and an ’x’ represents an unknown byte. The ’bbbb’ pattern denotes the 4-bit dictionary index and ’B’ represents the unmatched byte in the output. As shown in Figure Figure 2.4, each word in a data stream is compared with dictionary entries to find a complete or partial match.

14 CHAPTER 2. BACKGROUND

Code Pattern Output Length(b) 00 zzzz (00) 2 01 xxxx (01)BBBB 34 10 mmmm (10)bbbb 6 1100 mmxx (1100)bbbbBB 24 1101 zzzx (1101)B 12 1110 mmmx (1110)bbbbB 16

Table 2.1: CPack Code table

The code is then combined with the unmatched bytes to form the compressed word. If no match occurs, an unmatched code is added to the data. For simplicity, we show a 16-entry dictionary in the figure, with 4 bits representing the dictionary index. As we can see, the dictionary entries represented by ’bbbb’ might change during a dictionary update, which may cause a data consistency problem. The authors however do not address this data consistency problem, which can occur due to a dictionary replacement. We define data consistency as the fact that data values, and the corresponding compression dictionary entry, come as a pair and must be updated individually from time to time. Updating a compression dictionary entry will potentially impact multiple data values, resulting in a challenging consistency problem. For example, with a dynamic dictionary, all the data that references a replaced dictionary entry would have to be updated. This could potentially touch all the cache lines where the compressed data is stored. This is an extremely expensive operation that would not be feasible in a latency senstive system.

Figure 2.4: Compression example using CPack.

15 CHAPTER 2. BACKGROUND

The CPack cache is divided into a compressed and an uncompressed partition. The dictio- nary entries are updated for every new word encountered. The algorithm uses a FIFO replacement scheme, which results in updates to compressed data every time a dictionary entry is replaced. Using this algorithm, all cache lines representing a particular dictionary entry must be tracked, decom- pressed, and moved to the uncompressed side for every dictionary replacement. This is expensive to perform in hardware. The CPack scheme improves performance due to an increased hit rate, since the cache size is logically expanded. While increasing the hit rate (which results in some bandwidth reduction), the full potential of the improved bandwidth is not realized due to the data being decompressed on its way out of the cache. Decompression is necessary to maintain data consistency with the dynamic dictionaries, as the entries being referenced might change due to the replacement scheme. Bus compression schemes proposed by Yang et al. [21] use dynamic dictionaries because the dictionary entries only need to be consistent during bus transfer operations. While this certainly helps reduce bandwidth demands, it does not result in any cache or memory expansion. In this case, there would be no additional bandwidth savings due to the improved hit rates. The adaptive multi-dictionary model (AMDM) proposed by Yu and Wu [24] considers using an n-level dictionary with 9 tuples to model the dictionaries. A word can be compressed using any of the n-levels. They considered the trade-offs of using either LRU and FIFO replacement policies. These policies require that each dictionary entry will have to keep track of every location of the compressed data within the memory hierarchy. Managing this information in hardware for a multi-level dictionary would be costly in terms of chip area. Our work differs from this hierarchical approach significantly. The n-dictionaries created in this scheme are independent and do not communicate with each other. This means any opportunity to maintain data consistency, by swapping dictionary references, is lost. Another problem is the potentially high access latency incurred. Having to access multiple dictionaries to decompress data, which will be on the critical timing path, may be too costly to be feasible. Alameldeen and Wood [25] observed that most dictionary-based compression schemes perform effectively for large data block sizes. In a cache, the data block size is typically much smaller, so they proposed using a significance-based compression scheme instead. Significance-based schemes try to compress data by encoding data using the redundancies in the most significant parts of the data stream. Farren et al. [26] observe that the addresses transferred to memory show redundancy in the higher order bits. They proposed storing these higher order bits in dynamically allocated registers and transferring only the narrow indices to memory. This compresses the data on the bus,

16 CHAPTER 2. BACKGROUND saving significant bus bandwidth. Citron et al. [27] use the same approach, but store the higher order bits in a table. For the frequent pattern compression scheme proposed by Alameldeen and Wood [25], they take advantage of the fact that small values have few significant bits in a 32/64 bit word. They propose encoding these values based on the number of significant bits to reduce the data size. Table 2.2 shows how the data stream can be encoded using frequent pattern compression. A 3-bit prefix is used to encode each 32-bit word. The prefix denotes the type of encoding followed by the narrow width data for most entries in the table. As expected, zero compression is handled as a special case, where a run of eight 32-bit words that are all zeroes can be compressed using a 3-bit prefix, and a 3-bit number to denote the number of zeroes.

Prefix Pattern Encoded Data Size 000 Zero Run 3 bits (for runs up to 8 Zeros) 001 4-bit sign-extended 4 bits 010 One byte sign-extended 8 bits 011 halfword sign-extended 16 bits 100 halfword padded with a zero half word The nonzero halfword (16bits) 101 Two halfwords, each a byte sign-extended The two bytes (16 bits) 110 word consisting of repeated bytes 8 bits 111 Uncompressed word Original Word (32 bits)

Table 2.2: Frequent Pattern Encoding table as seen in [25]

Pekhimenko et al. proposed the base-delta-immediate (BDI) scheme [5] to address the shortcomings of dictionary-based compression algorithms. The authors observed that data in a cache block has low dynamic range, which means the delta between values in the cache is small. Thus, the data can thus be stored in a more compact format as a base value and an array of deltas. Some values are too small and do not fit the base plus delta paradigm, these can be stored as a delta from zero. Pattern matching algorithms work on a word granularity [15, 25, 16]. Word granularity prevents them from taking advantage of the correlation between different words in the cache line. BDI works on a cache-line granularity, which is much larger. This improves the compression ratio. Another advantage of working on a cache-line granularity is that the decompression need not be serial. The dictionary-based schemes are forced to decompress serially, as the position of each word in the compressed stream depends on the decompression of the previous word.

17 CHAPTER 2. BACKGROUND

Compression Algorithm Type Granularity Target ZCA Zero compression Cache block Cache ZMA Zero compression Cache block Cache and Memory DZC Zero compression Byte Cache FVC Static Dictionary Word Cache FPC Frequent pattern table Word Cache XMATCH Dynamic Dictionary Word Cache and Memory XRL Dynamic Dictionary Word Cache and Memory CPACK Dynamic Dictionary Word Cache BDI Base-Delta Cache block Cache and Memory Base-Register Caching Base-Delta address Data Bus

Table 2.3: Comparison of popular hardware compression algorithms

2.3 Compressed Cache Architecture

Previous research on compressed cache design predominantly focused on two areas: i) logical cache expansion [17, 28, 18, 29, 15, 25, 16] and ii) energy reduction [13, 30, 31, 32]. Logical expansion increases the effective cache capacity so the cache can perform similar to a much larger cache, while occupying a smaller silicon footprint. This efficient use of on-chip area results in reduced power consumption, which makes the cache more energy efficient. The last level cache (LLC) is typically chosen for compression because higher level caches are designed for performance, and have a low tolerance for latency. In contrast, the LLC is designed for capacity and can tolerate additional latency in exchange for increased effective size. A miss at the LLC is exponentially more expensive, as the request goes off-chip to DRAM, so the small overhead related to compression-decompression operations seem reasonable at this level. A good compressed cache needs to have a high rate of compaction, low hardware complex- ity, and fast lookup times, but achieving these goals can be difficult due to issues such as internal fragmentation, compression latency, and re-compaction. Compression algorithms generate com- pressed data of different sizes based on the compressibility of the data. Internal fragmentation occurs when these compressed blocks of variable sizes cannot be stored in the data lines efficiently. This results in low data line utilization and a smaller effective cache size. Compaction is the process of packing variable sized data to maximize the data line utilization, effectively increasing the logical cache size. Some of the previous work [16, 30] in this area tries to solve this problem with fairly simple approaches. For example, the architecture of

18 CHAPTER 2. BACKGROUND

Figure 2.5: Fragmenation example for fixed compaction scheme the low energy data cache design proposed by Kim et al. [30] only stores multiple compressed blocks in the same line if both blocks compress to half the data line size. As shown in Figure Figure 2.5, this design uses a fixed scheme of mapping two tags to the same data block when each block compresses to less than half of the data line size. This approach uses twice the number of tags as data lines, and can potentially increase the effective cache capacity by 2X the physical size of the cache. The cache design proposed by Chen et al. [16] follows a similar approach for compaction using a dictionary-based compression algorithm. The theoretical 2X expansion is not realized in practice, as compressed blocks do not always fit the compaction paradigm described above. Lee et al. [20] propose compressing larger data blocks, selecting blocks that are twice the size of the data line size, to a size that is less than a single line. This approach may improve compaction as larger data blocks tend to compress better, but the internal fragmentation problem is still unsolved. Blocks with higher compression ratios would still leave the data lines underutilized. Another approach is to store two compressed data lines if both can fit in one cache line, irrespective of the compression ratio of each [29]. As we can see in Figure 2.6, this approach can potentially reduce fragmentation due to more flexibility in the compaction criteria.

Figure 2.6: Fragmentation with less restriction on compressed block placement

This will improve compaction, but still will not solve the internal fragmentation issue. The methods discussed so far can, in theory, effectively double the size of the cache. However, due to

19 CHAPTER 2. BACKGROUND lack of flexibility in the compaction schemes, internal fragmentation occurs, so the effective cache capacities are not able to reach that potential. Seznec et al. [12] proposed using super tags to address a larger set of data lines with fewer tags, decoupling the tag and data lines for each set. This concept of decoupling the tag and data lines is exploited in various compressed cache designs [17, 29, 33, 28]. They map multiple compressed blocks belonging to the same super tag to smaller data blocks across the cache. In figure Figure 2.8 decoupled super tags share the same data line. This increases the data line utilization and almost completely eliminates fragmentation in the cache.

Figure 2.7: Compaction with subblocks in the data array

Alameldeen et al. implement a decoupled variable segment cache [29] that divides the data lines into 8 byte segments. The smaller segments allow more flexibility in data placement, making it easier to compact the data. The compressed size for each block is stored as metadata along with the address tag. A saturating counter is used by the cache controller to determine if an incoming block should be compressed. The controller looks at the LRU stack and the compressed size of each block to determine if a miss could have been avoided with compression. If not, the counter is updated to reflect a bias against compression. The disadvantage of an adaptive scheme is that every cache block must be compressed to generate the metadata required to update the counter. Sardashti et al. propose a decoupled compressed cache [17] which uses decoupled super tags and subblocks. Each super tag can address up to 4 conventional cache blocks, leading to

20 CHAPTER 2. BACKGROUND a 4x logical expansion. The super tag consumes a variable number of subblocks based on the compressibility of the data. Internal fragmentation is reduced by allowing non-contiguous subblock allocation for each compressed block. This helps utilize the data lines more efficiently. Using super tags also reduces area overhead when compared to using 4x the number of tags to achieve logical expansion. Decoupling the tag and data block also has some potential disadvantages. Decoupling results in increased lookup latency, as the tag and data cannot be searched in parallel. The decoupled compressed cache [17] uses back pointers for each subblock to keep track of which tag it belongs to. This means we first need the tag match operation to complete before we can start searching for the subblocks, which increases latency. Sardashti et al. [28] tried to solve the indirect pointers problem with a skewed compressed cache design. Instead of using back pointers to map the data blocks and tag, the cache is divided into four-way groups. The compressed data is distributed among the way groups, based on an index generated by hashing the compression ratio with certain address bits. This allows parallel look ups of the tag and data, which is much faster than using back pointers. The effective associativity is lowered due to each compression ratio being tied to a specific cache way. The authors try to counter this by using a skewed associative design. By using a different hash function for each cache way the associativity is nearly doubled [34]. This also ensures high data line utilization due to a more efficient distribution of data blocks with different compression ratios. While skewing helps improve associativity, it cannot not completely eliminate the lower effective associativity of the cache. Using different hashing functions also increases the area cost and complexity of the design. Skewing also makes it hard to use traditional cache replacement policies as the block distribution does not follow conventional schemes. Sardashti et al. propose an alternative design [18] that attempts to balance the advantages of a decoupled compressed cache and a skewed compressed cache, with lower design complexity. They use a super tag similar to the decoupled cache, but limit the compressed blocks to a single data entry instead of distributing the subblocks across the set. This brings the design closer to what conventional caches look like today, which significantly lowers hardware complexity. Hallnor et al. [33] propose a fully-associative Indirect Indexed Cache (IIC) with compres- sion. The cache decouples the tag and data arrays, with the data blocks being divided into smaller subblocks. Each tag contains metadata about the number of subblocks used and the forward pointers to those subblocks. If the number of subblocks is zero, then the data is completely stored in the forward pointers, without consuming any subblocks of data. Another way to provide logical cache expansion is by augmenting the cache with a smaller

21 CHAPTER 2. BACKGROUND cache for compressed data. Zhang et al. [15] propose a frequent value cache that stores the most frequently used words in a separate structure. The frequent values are then encoded in the main cache using a much smaller representation. The effectiveness of the frequent value structure goes down as the block size increases, which is why this approach suits the level-1 (L1) caches better than the LLC. The compaction scheme is still restrictive, as it allows two compressed blocks in the same data line only when each compresses to half the line size. Dusser et al. [10] propose a Zero-Content Augmented (ZCA) cache. They observe that a large chunk of data accesses in select applications consist of null data that exhibit good spatial locality. This can be compressed easily by storing a single bit instead of similar bytes representing zero. They propose using a zero cache that stores only the address tags, and a valid bit for each zero block. Decompression is really fast due to the low complexity of the hardware, and the fact that the LLC has more free space to store non-zero values.

Figure 2.8: Compaction with decoupled subblocks and super tags

Writes to a compressed block can make the data placement problem even harder. If a write request to the cache completely overwrites a compressed block, then we have two cases. If the new compressed block is smaller than or equal to the old compressed block, we can just store the data in the old position. If written data is larger, then we have to deal with recompaction. Recompaction can be challenging if the compressed block was sharing a data line with other compressed blocks. A larger compressed block would mean moving the neighboring compressed blocks to different locations in the cache, or allocating a new block for the larger compressed block. Designs, such as [16, 30, 28], share the data lines with multiple tags, and will incur a performance hit due to recompaction operations. Designs that decouple the tag and data lines [17, 29] can handle these cases

22 CHAPTER 2. BACKGROUND better. This is because the data blocks are divided into subblocks that need not be contiguous. In this type of cache one can simply allocate more subblocks to fit the larger compressed data. A more challenging case with compressed caches is a partial write to a compressed block. In Figure 2.9 we see that in a conventional cache, a partial write would simply result in a partial update of the data lines. With compressed caches, partial writes can cause a huge performance degradation. A compressed block would have to be decompressed and merged with the partial write and then recompressed. This changes a simple read-modify-write operation to a read-decompress-modify-compress-write operation. This results in high latency and back-to-back partial writes, which will significantly increase latency in the application. Partial writes are more common on GPUs due to the work being distributed in parallel. Different compute units can work on different parts of the same cache line and commit the data in a sequence of partial writes. Figure 2.10 shows a flowchart for handling byte-masked writes with compressed caches.

Figure 2.9: Bytemasked writes with uncompressed data.

The compressed cache design can also depend on the compression algorithm chosen. For example, a compressed cache that uses a dictionary-based compression algorithm, such as

23 CHAPTER 2. BACKGROUND

Figure 2.10: Bytemasked writes with compressed data

24 CHAPTER 2. BACKGROUND

CPACK [16], will need to decompress or evict blocks in case a dictionary entry is replaced. This can result in eviction of multiple compressed blocks that depend on that dictionary entry. Another option is to perform a decompression operation on all affected blocks, and then store the uncompressed data in the cache using an efficient recompaction scheme. Compressed caches need to store more tags than an uncompressed cache design. This results in area overhead. For example, the cache designs proposed previously [16, 30] need to support storing two tags for each data line. Using super tags can reduce this area overhead,as in the case of DCC [17] and ACC [29]. However super tags would introduce a higher number tag conflicts due to the reduction in the number of unique tags stored. A better balance is needed to achieve the desired performance, at an acceptable area cost.

2.3.1 Smart Caches

As graphics processors grow more complex, cache controllers need to reduce reliance on fixed cache management policies and become more dynamic in nature. Perceptron learning algorithms have been proposed to add intelligence to CPUs when making dynamic performance- related decisions [35, 36, 37, 38]. Perceptron models [39] can be used for making binary decisions by identifying various patterns in instruction and data streams. A perceptron is modeled by a vector of signed weights that are adjusted based on a selected training criteria. The dot product of an input vector with the associated signed weights for each input stream is compared against a threshold to drive performance related decisions in the design. Jimenez et al. [35] use a perceptron-based learning method to help improve branch predic- tion accuracy. Branch prediction is more sensitive to latency than any LLC access, and thus reaffirms our confidence in using the perceptron learning method for smart cache management of compressed data. While the original perceptron model requires the computation of a dot product between a set of binary weight vectors and a set of input features, the work done in subsequent research [35, 40] uses hashed perceptron predictors as shown in figure 2.11. The hashed predictors do not require binary weights. For hashed perceptron predictors, independent weight tables are used for each feature in the model and the tables are indexed using a hash of the relevant features. Multiple feature tables can be combined to produce a final output value by adding the results from each individual feature table. This final value can then be compared to a threshold for decision making. This approach uses the perceptron learning method, while minimizing hardware complexity in the design implementation. In addition to branch prediction, the perceptron learning algorithm has also been used to

25 CHAPTER 2. BACKGROUND

Figure 2.11: Feature table for a hashed perceptron prediction.

improve cache performance. Teran et al. [36] use the perceptron learning method to improve the accuracy of reuse prediction. They use features, such as program counter values, bits of memory addresses and a trace of past memory instructions to predict whether a block of data will be reused. This information is used to make decisions such as block placement and bypass management. Wang et al. [37] use a learning algorithm to improve the performance of prefetching. Ghosh [38] uses the same algorithm to provide coherence predictors in the cache for shared data. The hardware complexity for such a learning model is low, as evidenced by the hashed perceptron based branch prediction deployed in the AMD Zen CPU core [41]. Teran et al. [36] also observe that a perceptron-based model can find independent correlations between different features used. This type of intelligent controllers can potentially be used to efficiently manage compressed data in a cache. We explore this further in Section 4.3.

2.4 DRAM Efficiency

Another important component of the memory hierarchy is DRAM, which is used for main memory. Accessing DRAM involves a long latency operation. Inefficient access patterns can be very costly in terms of performance and power. DRAM power in modern server systems can dominate almost a third of the total system power budget [42]. The main memory in a GPU is typically organized with multiple channels. The AMD HD 7970 GPU has 12 such memory channels [43]. Having multiple channels increases the data bandwidth of the device, as they can concurrently service memory requests. Each channel has an

26 CHAPTER 2. BACKGROUND independent command bus, address bus, and data bus, thus enabling parallel access. A memory controller connected to each memory channel is responsible for scheduling the memory requests. The controller tries to achieve high throughput, as well as fairness, but arbitrating requests. A memory channel is further divided into a number of banks that operate in parallel. These banks share the command bus, address bus, and data bus assigned to that channel. Figure 2.12 shows one specific way of mapping the address bits to a bank within a memory channel. Each bank is organized as a 2D memory array consisting of rows and columns. Address bits are used to determine which row and column to fetch the data from. To realize the full benefit of multiple channels, memory accesses should have minimal channel conflicts. A channel conflict occurs if multiple requests hit the same channel and get stuck in a queue while other channels have empty request queues available to service requests. The address mapping is usually such that consecutive addresses are spread accross multiple channels and banks to minimize any conflicts.

Figure 2.12: Example DRAM address mapping.

Performance of a workload is highly sensitive to memory efficiency. To understand the memory performance better, we can highlight some of the important DRAM timing parameters, such as row precharge time (tRP ), row activation time (tRAS) and column activation time(tCAS). Each row in a bank needs to be precharged before it can be accessed. The tRP determines the minimum amount a time it takes to precharge a specific row. Once a row is precharged, the row activation time is the minimum time that must elapse before a different row can be accessed within a same bank. Access to DRAM can be classified into three categories, as suggested by Zhang et al. [44]. If the memory access hits the same bank and row, then the row is already precharged and activated.

This means the latency only depends on the column activation time (tCAS), which is the best case scenario. If the access is to a different row in a different bank, then it can be pipelined, as banks can be accessed parallely. The worst case scenario is when a row buffer conflict occurs. This happens when a different row in the same bank is requested. In this case, the latency is tRP + tRAS + tCAS, as we would have to wait for the current row to be closed and a new row to be precharged and activated. Row buffer conflicts prevent pipelining of data from parallel banks, and reduce the memory throughput due to underutilization of the data bus.

27 CHAPTER 2. BACKGROUND

Figure 2.13 shows an ideal access pattern within one memory channel. The address bus receives four address requests from A0-A3, where each address has a unique bankID resulting in zero bank conflicts. The distributed requests keep the data bus fully utilized with constant data throughput.

In an ideal case, the memory controller will only return to a specific bank once tRP + tRAS

+ tCAS has passed for the first request to that bank. The other banks in a particular channel need to hide this latency by keeping the data bus fully utilized.

Figure 2.13: Efficient DRAM access pattern.

Figure 2.14 shows a case where consecutive requests are made to different rows of the same bank. As we can see from the figure, there is a penalty as the row conflict must be resolved before the second request can be returned. This adds latency to the access and reduces data throughput for that memory channel. Access to a row is a destructive process, and typically happens on an LLC miss. When a row is accessed, the data is read from the entire row and stored in a row buffer. Typically, each bank has a dedicated row buffer. The row size is usually much larger than the LLC cache line size. To exploit spatial locality for better performance, multiple cache lines must be mapped to the same DRAM page. This way, once a row is open, consecutive requests would hit in the row buffer. This also helps hide the latency of a row access across multiple banks. Increasing row buffer hit rate and optimizing throughout across banks are conflicting goals. For parallel access, it is ideal that consecutive cache lines are mapped to different banks, where as for higher row buffer hit rates, they should be mapped to the same page. This is why the conventional method is to interleave the addresses across a page boundary. This way a row buffer can be utilized efficiently, while spreading accesses across banks at strides of a page size.

28 CHAPTER 2. BACKGROUND

Figure 2.14: Inefficient DRAM access pattern.

As most of the compressed cache designs are focused on logical expansion and energy related goals, memory efficiency often is neglected. The compressed cache designs discussed in the previous section store data in the cache in compressed form, but evict data uncompressed. This means memory-bound workloads will not see huge bandwidth improvements with these designs. Any bandwidth improvement is limited to gains from improved hit-rates in the caches. Data must be transffered to memory in compressed form to fully realize the potential of a compressed cache design. A LLC does not just need to store data efficiently, it also needs to evict it efficiently. This is because inefficient data bursts from the LLC to off-chip DRAM can result in increased latency, as well as power consumption. Traditional caches evict data in multiples of the cache line size. This makes it easy for the memory controller to schedule accesses to improve the data bus utilization and row buffer utilization. Transferring compressed data on the memory bus is inefficient, because compressed data is variable in size and can result in a poor and unpredictable DRAM access patterns. This can hurt data bus efficiency and the row buffer hit rate. Smarter replacement algorithms are needed to improve the data pattern with cache evictions and improve memory efficiency. This adds another variable to the design of compressed caches. Memory efficiency aware compressed cache designs can deliver huge improvements to performance, power, area for a last level cache, expending very small area.

29 Chapter 3

Framework and Metrics

An architecture simulator is used for modeling processors in software and help facilitate rapid design exploration of microarchitectural features. They provide fast and accurate representation of real hardware and can be built much faster and cheaper than real hardware prototypes. In this chapter we discuss the types of architecture simulators available to evaluate cache and memory systems in processors, and provide an overview of the custom framework we built for design exploration of compressed caches. An architecture simulator can be classified based on the level of detail in the model. For example, a cycle accurate model attempts to emulate the performance of a processor on a per cycle basis with accurate clocks guiding the execution, whereas a functional model will emulate the function of the modeled hardware without any notion of a clock or specific timing related features. Functional models typically run a lot faster due to the lower level of detail and help with faster design exploration of non timing sensitive features. In addition to architecture simulators, there are full system simulators which model not just the processor, but also the operating system and other necessary components of an entire computer system. These are capable of running programs directly instead of pre-captured traces. One of the most popular cache simulators is the Dinero Cache simulator [45]. This is a trace-based simulator that supports modeling of multiple levels of the cache hierarchy. It takes an input memory addresses and generates hit and miss stats for each level of the memory system. The system is modeled to work on memory reference streams, this means data does not actually move in and out of the cache. This makes it difficult to use this model for compressed cache evaluation, as data can exist in different forms in different levels of the cache hierarchy, data values are needed to produce accurate statistics.

30 CHAPTER 3. FRAMEWORK AND METRICS

Since our focus is on graphics hardware, we also evaluated GPU simulators including the gem5 simulator [46] and Multi2Sim [47]. gem5 is a modular simulator that contains both system and processor level models for design exploration. It has support for detailed cache simulation, along with memory models with accurate DRAM controllers. However, gem5 does not support running non-compute traces on the GPU, and does not provide any models of a compressed cache. gem5 does not support running traces captured on a real GPU. Multi2Sim supports both functional and timing models for various CPU and GPU architec- tures, including AMDs southern Islands and Nvidias Fermi architecture. However, just like Dinero, Multi2Sim does not model the movement of data through the cache hierarchy, which makes it difficult to use for compressed cache evaluation. Multi2Sim also lacks support for graphics workloads, as it only models the compute pipeline for a GPU. As most of the industry standard simulators do not meet our requirements for compressed cache evaluation, we decided to create our own custom framework for this purpose. We use this framework to capture data to help evaluate various compressed cache designs, compression algorithms, and memory design tradeoffs. Our framework consists of a configurable last level cache model, with support for various tag-data mapping schemes, a compression engine that supports various state of the art compression algorithms, and a simple main memory model with metadata storage for compressed data. The entire infrastructure is modular, which lets us effortlessly plug in custom components and configure simulations with varying parameters. Our simulator is designed to run traces captured at the last level cache interface. For our evaluation, we use real world graphics/gaming workloads for GPUs that include: Call of Duty:BlackOps [48], Assassins Creed 3 [49], and Civilization 5 [50], along with industry standard benchmarks such as Firestrike, Cloudgate and Icestorm from the 3DMark suite [51]. The full list of graphics workloads is presented in Table 3.1 Two sets of input traces were captured at the last level cache interface. The first set was captured on an AMD Tahiti GPU [43] and the second set was captured on the new AMD Vega64 GPU [2]. Along with the input trace, we also capture the return stream from the last level cache to verify the correctness of our simulations. Each trace contains between one million to ten million cache requests. The captured trace represents the cache traffic for a single frame, and includes several draw calls. The trace includes the data address, the data values and the metadata, identifying which blocks within the GPU issued the read/write request. This includes data from the pixel , vertex shaders and compute shaders for each workload. Since these traces are captured on a live system, they also have byte-masked write accesses.The Byte-masked writes cannot simply be merged

31 CHAPTER 3. FRAMEWORK AND METRICS

Figure 3.1: Block Diagram of simulation Framework.

Workload DirectX version GPU 3D Mark Vantage DX10 Tahiti Assassins Creed 3 DX11 Tahiti BioShock Infinite DX11 Tahiti Formula 1 DX11 Tahiti Civilization 5 DX11 Tahit 3D Mark Cloudgate DX11 Tahiti CoD BlackOps DX11 Tahiti 3D Mark IceStorm DX11 Tahiti,Vega 3D Mark GT2 DX11 Tahiti 3D Mark Firestrike DX11 Tahiti Far Cry Primal DX11 Vega Fallout 4 DX11 Vega Alien vs Predator DX11 Vega Witcher 3 DX11 Vega Total War - Warhammer DX11 Vega

Table 3.1: Graphics workloads

32 CHAPTER 3. FRAMEWORK AND METRICS as our cache model has compressed data. We have added support for partial writes to compressed data in our model to manage these byte-masked writes. We have also evaluated six different compute workloads from the AMD OpenCL SDK [52] and the Hetero-Mark [53] benchmark suite. The list of workloads is shown in Table 3.2 These workloads help us compare compression metrics across graphics and compute.

Workload Benchmark Simple convolution OpenCLSDK Matrix transpose OpenCLSDK Bitonic sort OpenCLSDK AES encryption-decryption Hetero-Mark Finite Impulse Response Filter Hetero-Mark KMeans Hetero-Mark

Table 3.2: Compute workloads

Traces for these workloads were captured using MultiGPUSim (MGSim), an execution- driven GPU simulator. The simulator was configured to model the R9 Nano GPU. The Last Level Cache is split into four 512 kilobyte caches, each 16-way associative, with a 64 byte data line size. Figure 3.1 shows a block diagram for the various components in our simulator. The preprocessor block is used to convert the raw input trace into a cache request stream, producing aligned addresses and bytemasks. The preprocessesor also analyzes the captured stream to initialize the memory model to the state at which the original trace was captured. This allows our simulation to handle the first read from the cache request stream correctly. The preprocessor can also be configured to combine write addresses for large cache block sizes, giving us the ability to run different configurations with various cache block sizes. The compression engine can be configured to compress data at different data granularities, and supports various state of the art compression algorithms, such as base-delta-intermediate [5], CPACK [16], and others. The LLC model supports different cache designs, with support for conventional cache parameters such as block size, associativity, replacement policy, and write policy. The LLC model also takes as input parameters specific to a compressed cache, such as compaction policy, write expansion support, and indirect tag-data mapping. The simulator can be configured to a 1:1 tag-data mapping, as seen in conventional caches or decoupled tag-data mapping, as proposed in the work done by Wood et al. [17] and Seznec et al. [12]. We also support skewed mappings based on compression ratios, as proposed by Sardashti et al. [28]. To support various compaction schemes, we added support for data block allocations at different

33 CHAPTER 3. FRAMEWORK AND METRICS subblock granularities. This lets us allocate at subblocks to support indirect tag-data mapping schemes and improve data line utilization by compacting various compressed blocks in the same data line. For write policies, we support the convention write-back and write-through policies. We also support custom policies required to handle cases specific to compressed data, policies not encountered while using traditional caches. For example, handling partial writes on top of compressed blocks would need the block to be decompressed before we can write on top of it. Another case that needs to be handled correctly is when a write changes the size of a compressed block. The compaction scheme will pack multiple compressed blocks into the same data line, and any change in compressed size would need the compacted blocks to be relocated. For replacement policies, we support the least recently used scheme (LRU) as well as priority based schemes for shared data blocks. These priority-based schemes attempt to find a balance between burst efficiency and optimal hit-rate to compensate for the limitations of the LRU scheme when it comes to compressed data. The memory model is fairly simple, but captures advanced stats such as bank conflicts, page conflicts, and row buffer hit rate. At the end of the simulation, the output from our custom simulator is compared to the original captured trace to verify correctness. The simulator captures many important metrics for analysis of the various components of the compressed cache design. These metrics captured for the last level cache help us understand the impact of various tag-data mapping schemes, compaction schemes, and write policies unique to compressed data. Logical cache size helps us understand the logical expansion due to our compaction scheme and tag-data mapping. Data line utilization lets us evaluate if our compaction scheme is using all available space effectively. The tag utilization can provide insight on the super tag size needs to be and the eviction stats can help set the right granularity for the data blocks. We also capture the best case hit-rate for an infinite cache to compare against the various designs being evaluated.

• Cache Read/Write Hit-Rate

• Best Case Hit-Rate

• Logical Cache Size

• Number of Block/Super Block Evictions

• Data Line utilization

• Tag utilization

34 CHAPTER 3. FRAMEWORK AND METRICS

Other metrics are captured to provide insight on the cache-memory interaction with compressed data. We collect statistics on bandwidth gains due to compression, compressibility of the data stream, as well as effects on DRAM efficiency due to the changes in the memory access pattern as a result of compression. Compressed data can alter the read/write burst size between LLC and main memory, which can affect the row buffer hit-rate. For any access to the memory system, we keep track of the burst size to each bank and row of the DRAM, and record any conflicts to identify potential inefficiencies due to the compressed nature of our data. The various DRAM parameters, including the row size, number of banks, and the channel interleave factor, are configurable.

• Read/Write Bandwidth

• Logical Read/Write Bandwidth

• Average Compression Ratio

• Read/Write Compression Ratio

• Read/Write Burst Efficiency to DRAM

• Row Buffer Hit-Rate

• Bank/Page Conflicts

We also capture average compression ratios to measure the compressibility of the workload for various compression algorithms and record compression stats for each segment of the access stream to identify if selected program phases are more compressible. These time sliced traces can provide good insight into the nature of the data stream and how it affects compression. The metrics are verified using synthetic workloads that consist of input streams with specific access patterns. These specific patterns generate predictable numbers for all the supported metrics and help confirm the correctness of our simulator.

35 CHAPTER 3. FRAMEWORK AND METRICS

Figure 3.2: Sample simulation output for IceStorm benchmark.

36 Chapter 4

Results

In this chapter we discuss our proposed solutions to the challenges involved in designing a compressed cache and evaluate them using a wide set of real world gaming workloads. In section 2.2, we discussed the benefits and drawbacks of various compression algorithms. We discussed how dictionary-based compression algorithms work and the key limitations in current state of the art compressors. We also described the two main types of dictionary-based compressors: static and dynamic, and how these differ from each other. Dynamic dictionaries in general have better compression ratios as the entries adapt to each workload dynamically. This eliminates the need for static profiling to populate the dictionary. The algorithm performs well with any generic workload, rather than using a pre-determined set. Dynamic dictionary-based compression techniques are used to increase the logical size of the LLC, which improves performance. As discussed in section 2.2, for the previous dictionary-based compressors, the data in the cache block is decompressed during eviction. This is done to maintain data consistency, as the compressed data and dictionary entries form a pair and each of these may change independently with time. This implies that any bandwidth reduction we observe is solely due to an improved hit rate, as the data on the bus is always transferred in uncompressed form. We propose a solution to extend dictionary-based compression methods to further improve bandwidth. We solve the data consistency problem with a novel dual dictionary system. We discuss this design in detail in the next section.

37 CHAPTER 4. RESULTS

4.1 Dual Dictionary Compressor

We propose a compression algorithm that can further improve bandwidth by transferring the data in compressed form instead of decompressing on eviction. We introduce a second dictionary in memory to help with data consistency. We intelligently determine if it is valuable to transfer dictionary entries to memory and swap the dictionary index to maintain data consistency. If the dictionary entry is not significant enough to be transferred, then we attempt to preserve the static-zero and static-one compression in the data stream and uncompress the other words. We define ’static’ compression as a compression operation that does not use dynamic dictionary entries and compresses using a static value (such as ’0000’ or ’ffff’). These values are then pinned in the dictionary. We describe our architecture in more detail in the following sections.

4.1.1 Cache Architecture

Our dual dictionary compression scheme is agnostic to the underlying cache architecture. We use a decay-based [54] dictionary replacement scheme, as described in Section 4.1.4. This allows us to pick any underlying compressed cache architecture, and add support for decay-based block eviction. For our experiments, we use the compressed cache architecture described in Fig- ure 4.1. This model uses super tags to store up to two compressed blocks per cache line, with independent status bits for each compressed half. This theoretically limits us to a 2X logical cache expansion, but we can replace our cache model with any of the other compressed cache architec- tures that have been proposed, including DCC [17], SCC [28], and YACC [18]. These models can give us up to a 4X logical cache expansion, based on the maximum line compression of our algorithm.

Figure 4.1: A Cache Block

Figure 4.2 presents a block diagram of our cache model. We provide a dictionary swapping unit between the last level cache and main memory. The on-chip dictionary holds 64 entries and the larger off-chip dictionary holds 128 entries. The dictionaries have the ability to communicate

38 CHAPTER 4. RESULTS with each other directly. On an eviction, the data and metadata from the cache line is evaluated to determine the best way to transfer data on the memory bus.

Figure 4.2: Block diagram of the LLC with our dual dictionary

4.1.2 Dictionary Structure

Our dictionary code structure is similar to the one described in Table 2.1, with some modifications that will improve the performance of graphics workloads. Based on our analysis, we update the code table, as initially described by Chen et al. [16] as follows:

Code Pattern Output Length(b) 00 zzzz (00) 2 01 xxxx (01)BBBB 34 10 mmmm (10)bbbb 6 1100 ffxx (1100)bbbbBB 20 1101 zzzx (1101)B 12 1110 mmmx (1110)bbbbB 16 1111 ffff (1111) 4

Table 4.1: Updated Pattern Table

First, we introduce a new code ’ffff’ and replace ’mmxx’ with ’ffxx’. This is needed

39 CHAPTER 4. RESULTS because our analysis found that graphics workloads consist of a lot of black and white colors which are represented by zeros and ones in the data stream [55]. These are easy to identify and compress in a stream, thus adding new compression codes to handle complete and partial matches with 0xffff makes sense here. While the Cpack work achieved good compression ratios for their selected compute workloads on CPUs, they achieved negligible improvements in terms of bandwidth [16]. We extend their work further here to improve bandwidth usage by storing the data in its compressed form, taking advantage of a secondary dictionary in memory. We can then use multiple criteria to determine if it is worth transferring the data in its compressed form. If we meet the criteria, we perform a dictionary index swap operation to replace the on-chip dictionary indices with the indices in memory. We also keep the compressed words that are based on static compression and decompress the dictionary-based words if the dictionary swapping criteria is not satisfied. This is done by uncompressing the dictionary codes and indices, and leaving the zero and one compressed code untouched. If we can generate a modified compressed stream which is smaller in size than the size of the uncompressed data, this will still have a positive impact on bandwidth.

Figure 4.3: Output from Dictionary Index Swap Unit

40 CHAPTER 4. RESULTS

4.1.3 Dictionary Index Swap

When a compressed line is evicted from the LLC, the Dictionary Index Swap Unit evaluates the most efficient method to transfer the data to memory. As shown in Figure 4.3, the Dictionary Index Swap Unit takes a compressed line and metadata as input, and produces one of three possible outputs:

• The compressed line is transferred to memory with the dictionary indices swapped with the indices of the second dictionary in memory. This creates a new dynamically compressed block.

• The compressed line is partially decompressed to replace dictionary entries with uncompressed words, while preserving the ’static one’ and ’static zero’ compression. This creates a new static compressed block.

• The compressed line is completely decompressed and transferred to memory as an uncom- pressed block.

The dictionary index swap operation must only be done when specific conditions are met during block eviction. First, we check to see if the compressed line references a valid dictionary entry. If there is no valid entry, then a swap is unnecessary, since the compressed line is statically compressed and can be transferred uncompressed. If the compressed line references a valid entry in the dictionary, we check the metadata to see if the entry already exists in the off-chip dictionary. If the entry is present, we can perform the swap and transfer the updated compressed line to memory. In the case where a dictionary entry is not present in the off-chip dictionary, we need to evaluate if it will be beneficial to transfer the dictionary entry to the off-chip dictionary. We first check the number of dictionary entries used in the compressed line, as well as the reference frequency for each dictionary entry at that index. The reference frequency is the total number of times the dictionary entry is used across the entire cache. We then check if:

n P Fn > Tf i=0

where Fn is the frequency of each dictionary entry in the stream, and Tf is a frequency threshold that determines if the dictionary entries add enough value to be transferred to memory. We also check if:

Fn > Tn

41 CHAPTER 4. RESULTS

where Tn stands for the threshold value per dictionary entry. The threshold value per entry, as well as the frequency threshold, can vary by application. Typically when running graphics workloads, the driver is aware which application is running and can adjust threshold values to provide the best performance. By being selective on whether to transfer dictionary entries, we are able to use a secondary dictionary in memory that consists of frequently used words from the on-chip dynamic dictionary. This approach is less dynamic in nature, because we do not eliminate the entries in the off-chip dictionary until the data in memory is overwritten. We can then update the metadata in the on-chip dictionary to mark the dictionary entry as present in memory. Then, when a subsequent cache line is evicted, we can save bandwidth by not transferring the dictionary entry again. The two dictionaries communicate with each other in order to sync up metadata related to the dictionary entries. The memory controller is responsible for communication between the two dictionaries. A metadata sync only occurs when dictionary entries are modified in the off-chip dictionary. This communication helps avoid bandwidth pollution due to unnecessary transfers of dictionary entries, and helps us evaluate the dictionary swap operation.

4.1.4 Dictionary Replacement

Most dictionary replacement schemes for dynamic dictionaries update the dictionary entries every time a new word is seen by the compressor. These methods are highly inefficient, as they need to keep track of every cache line that is currently using a specific dictionary entry. Simple schemes, such as FIFO and LRU, perform poorly because they replace all the cache lines referenced by that dictionary entry. These schemes may also end up replacing dictionary entries that are frequently accessed, and referenced by multiple cache lines. This makes the entire process inefficient, which defeats the purpose of compression to improve cache efficiency. We can consider a frequency-aware replacement scheme that would address the multi-line eviction issue with the FIFO/LRU scheme, but this would still require us to keep track of all cache lines that use a specific dictionary entry. Decay-based replacement schemes, as discussed in prior work [54], do not require us to keep track of the lines using a specific entry. In our scheme, we assign a decay factor to the cache lines and the dictionary entry. If a dictionary entry decays, it means all lines using that specific dictionary entry have also aged out. This is because every time a cache line is accessed, the decay counter is reset. The reset action helps to make room for new dictionary entries that are referenced more often, and further improves our compression efficiency. The hardware to implement decay counters is also much cheaper when compared to the required hardware for FIFO/LRU based schemes, as we need to

42 CHAPTER 4. RESULTS keep track of significantly less information. The decay factor can can be updated dynamically based on the cache performance. A highly aggressive decay factor will end up evicting cache lines that are likely to be reused. This will reduce the hit rate, and end up increasing bandwidth contention.

4.1.5 DDC Performance

In this Section, we review the results of our experiments when running a range of graphics workloads. In Table 4.2, we show the distribution of the compressed words in the three compression categories: i.) dictionary compress, ii.) zero compress, and iii.) one compress. The dictionary compress column shows the percentage of words that were compressed using the dynamic dictionary entries. The zero compress column shows the percentage of words that were partially or completely compressed based on a zero doubleword. The one compress column shows the same for a doubleword that is all ones. We can observe from the Table that for Civilization 5, up to 90% of the compression occurs due to zeroes. If we look at Table 2.1, we can see that for this game, the percentage of unknown words is much lower (i.e., 61.69%). This means a very high percentage of this game is compressible and the reliance on dictionary swapping is limited. This can be observed in Figure 4.4- Figure 4.5, where we plot the bandwidth as a percentage of a single dictionary compressor. The results show high bandwidth savings using our compression scheme. On the other end of the spectrum, we can observe from Table 4.2 that 3DMark Vantage, a DirectX 10 benchmark, also has 96.3% of its compressible data in the form of zeroes or partial zeroes. However, inspecting Table 2.1, we can see that the benchmark consists of 85.7% unique words. This is why we do not see much bandwidth savings in Figure 4.4 and Figure 4.5. Over the eight workloads, we see an average bandwidth savings of 18.55% for reads and 11.01% for writes. From Table 4.2, Formula 1 contains the highest percentage of dictionary-based compressed words. Our dual dictionary based scheme still manages bandwidth savings of 12.75% and 7.51% for reads and writes, respectively. Our cache architecture can theoretically achieve 2X logical expansion but in practice with real world workloads we see closer to 1.2X. This shows that our dual dictionary based scheme is able to save bandwidth, without adding bandwidth pollution by transferring the dictionary entries on the memory bus.

4.1.6 Summary

As we can see from our experiments, our proposed dual dictionary scheme can help improve bandwidth by transferring data from the cache in compressed form. We also analyze the

43 CHAPTER 4. RESULTS

Workloads Dictionary Compress Zero Compress One Compress 3D Mark Vantage DX10 3.36% 96.31% 0.32% Assassins Creed 3 DX11 38.76% 50.41% 10.81% BioShock Infinite DX11 4.51 % 79.34% 16.14% Formula 1 DX11 52.89% 41.70% 5.40% Civilization 5 DX11 9.34% 89.97% 0.67% 3D Mark Cloudgate DX11 5.46 % 88.63 % 5.90 % CoD BlackOps DX11 9.79 % 74.66 % 15.53 % 3D Mark IceStorm DX11 24.31% 36.83% 38.85%

Table 4.2: The compressed word distribution

Figure 4.4: The Write BW Savings

44 CHAPTER 4. RESULTS

Figure 4.5: The Read BW Savings graphics workloads and conclude that dictionary based compression schemes are less suitable for such workloads. A base-delta scheme as described in [5] could provide better compression ratios due to the ability to select multiple bases per cache line. In section 2.4, we consider another key issue with compressed cache designs, the memory efficiency. The variable size of compressed blocks in the cache can result in inefficient data access patterns in the DRAM. Low efficiency of DRAM accesses can significantly increase latency and power consumption, resulting in poor system performance which can neutralize any gains realized due to compression. We propose a solution to tackle this issue so that the memory efficiency can be significantly improved and the benefits of compression can be realized to their full potential.

4.2 Compression Aware Victim Cache

In this section we describe our approach of using a victim cache as an efficient burst unit to improve DRAM efficiency. Our victim cache is based on the decoupled compressed cache implementation proposed by Sardashti et al. [17]. We use a fully associative cache with decoupled supertags and data lines. Each supertag can address multiple compressed data blocks from the LLC. Figure Figure 4.6 shows a cache block eviction from the LLC. When a cache block is evicted from the LLC, the cache controller looks at the metadata to determine if the block has

45 CHAPTER 4. RESULTS

Figure 4.6: Block diagram of LLC with Victim Cache

compressed data. The compressed blocks are then stored in the victim cache while the uncompressed blocks go straight to DRAM. As the supertag in the victim cache can address a lot more data, multiple compressed blocks from the LLC get mapped to a single super tag. This allows the supertag to have a much larger data burst size during eviction. The supertag size can be configured to target a specific average burst size based on DRAM parameters. Using the Least Recently Used (LRU) replacement policy might not always result in efficient bursts sizes. We use a efficient burst aware replacement policy.

Figure 4.7: Super Tag structure for the Victim Cache

Figure 4.7 shows the structure of the victim super tag. The super tag has status bits for each LLC block stored as well a valid bit and total sub blocks used. During block replacement the last four entries of the LRU stack is searched to find a supertag that has exceeded the minimum

46 CHAPTER 4. RESULTS burst size required for efficient bursting. This allows us to increase the number of efficient bursts to DRAM and potentially provides the partially occupied supertags more time to accumulate data for an efficient burst. The victim cache can also be used to solve other challenges with compressed cache design as described in the next sections.

4.2.1 Design Challenges

In this section we discuss challenges associated with compressed caches and how a victim cache can be used to these challenges. Real world workloads frequently have byte-masked writes which is a trivial operation with uncompressed data. As seen in fig Figure 4.8 a cache hit would result in byte-masked data stored on top of the uncompressed block and a cache miss would result in a write allocate to store the dirty bytes which are later transferred to memory.

Figure 4.8: Byte-masked write with uncompressed data

For a compressed cache a byte-masked write can result in a high latency operation. This is because we would need to decompress the data before we can merge the byte-masked bytes. If we receive multiple byte-masked writes then the latency impact becomes a lot worse. We can use the victim cache for an elegant solution to this problem. By adding a new status bit DV as shown in Figure 4.7 we can mark the compressed lines with pending bytemasked writes.

47 CHAPTER 4. RESULTS

Instead of merging the dirty bytes, we can store them in the victim cache as shown in Figure 4.9. This way we can merge multiple byte-masks in the victim cache and when the mask becomes full compress the data and overwrite the previous block completely. This significantly reduces the read-mod-write latency associated with servicing every byte-masked write. If a read request happens before the byte-mask is full, there is no performance hit. The compressed data needs to be decompressed so during the decompression operation the dirty bytes are fetched in parallel from the victim cache. The uncompressed data is then merged with the dirty bytes and returned to the client that requests it.

Figure 4.9: Byte-masked write in compressed cache

48 CHAPTER 4. RESULTS

4.2.2 Results

In this section, we analyze the results from our experiments with the efficient burst unit. From Figure 4.10 we can see the significant improvements to burst efficiency across the entire sweep of workloads. The larger tag size of the victim cache couple with the efficiency aware replacement scheme significantly increase the average burst size for our cache. Figure 4.11 shows significant improvements as well as the larger bursts would access the same row buffer causing the hit-rate to increase.

Figure 4.10: Burst Efficiency with EBU on and off

4.2.3 Summary

We use a novel approach of accumulating compressed data in a victim cache to improve memory bursting. While victim caches are traditionally used to improve hit-rates, we show that using a decoupled victim cache with super tags can help increase the burst size for compressed data which can memory efficiency and row buffer hit rates in DRAM. In section 2.3.1, we discussed prior work in the field of smart controllers that learn from various features present in the workload to make dynamic decisions that help improve performance. We propose a smart cache controller that dynamically adjusts cache policies to better handle various challenges associated with compressed data management.

49 CHAPTER 4. RESULTS

Figure 4.11: Row Buffer Hit rate with Victim EBU on and off

4.3 Smart Cache Controller

In Chapter 2 we discussed how the perceptron learning algorithm works, and discussed how it can be implemented in hardware efficiently to make dynamic performance decisions. We present a learning algorithm that can improve compressed data management in a cache by predicting when to compress data, and deciding when to cache recently decompressed data, reducing latency on the critical path. Alameldeen and Wood [29] have proposed using a saturating counter to predict if an incoming block of data should be compressed in the cache. They store the compressed size of each block, along with the metadata, to predict if a cache miss could have been avoided with data compression. If the prediction is negative, the counter is updated to reflect a bias against compression. The goal of this prediction mechanism is to reduce access latency by only compressing blocks when the hit rate can be improved. This approach only minimizes the compaction complexity as it mitigates the need to manage compressed data when it does not improve performance. However this method requires every block to be compressed to generate the metadata required for decision making, which is an additional question, separate from whether the data being stored should be compressed or not. Our compression prediction differs from this approach as our work is focused on GPU LLCs, where efficient use of memory bandwidth is far more important than hit rate. This is because

50 CHAPTER 4. RESULTS

GPUs are massively parallel machines that value high throughput more than low latency. We propose compressing an incoming block of data based on compressibility trends our learning model accumulates. If the prediction says the data is not compressible, we save energy and avoid adding latency associated with the data compression operation. Alameldeen and Wood [56] also propose an adaptive prefetch mechanism that is used to eliminate harmful prefetches by controlling the stride prefetches based on performance history. However, they do not address the memory efficiency issue with transferring variable-sized compressed data on the memory bus. Our prediction mechanism uses a perceptron learning method to prefetch more compressed data from an open DRAM page in order to increase bus utilization significantly. We describe the features of our smart cache controller and explain how we use perceptron- based learning to make predictions that help improve performance in a compressed cache.

Figure 4.12: Conventional Compressed Cache

From Figure 4.12 we can see that a conventional compressed cache would attempt to compress all the writes it processes and decompresses data on every read request to a compressed block. As mentioned in Section 2.3, not all writes can be compressed due to the nature of the data pattern associated with the write. The first feature of our smart cache controller is the ability to predict when the data is compressible. When a compressor fails to compress data, it ends up adding

51 CHAPTER 4. RESULTS unnecessary latency and wasting power. While compression latency may not always be on the critical path, the block placement algorithm depends on the output of the compressor to decide how to compact the data in the compressed cache. Unnecessarily delaying this operation may end up stalling block placement for other blocks that are accessing the cache. Our controller utilizes feature tables for tracking each feature. For each LLC access, we use request-related information, such as the LLC client making the request, the data type (float,int), the dimensionality of the surface (1D,2D or 3D) and the address associated with the request, to populate our feature tables. An LLC client is a block of hardware within the GPU that shares data in the LLC. Some examples of LLC clients on an AMD VEGA GPU [2] are the compute engines, the pixel engines, and the geometry engines. Our learning model has individual feature tables for each field in the metadata. The model is trained by adjusting the weights for each feature, based on the outcome for each compression attempt in the sample channel. Successful compressions have a positive correlation with the weights. Failed compression attempts impact the weights negatively. Adjustments are made using the perceptron learning method, resulting in a correlation between independent features and data compressibility. During prediction, the weights for independent feature tables are added and compared against a threshold to make a decision. If the smart compressor predicts poor compression, then the controller chooses not to compress the data associated with that access resulting in power and performance improvements. The second feature that our smart cache controller targets is decompression latency. As decompression latency is on the critical path of a cache read request, the controller must avoid it whenever possible. On a GPU, multiple clients may read the same data block, which could result in multiple decompression requests for the same compressed block. We use the learning algorithm to predict when such repeat accesses occur and store the result of the first decompression in the decompression buffer, avoiding any decompression latency on subsequent accesses. Figure 4.13 shows how each access is evaluated by our smart controller to enable the two features discussed so far. The third feature of our smart cache controller targets memory efficiency. Accessing DRAM is a long latency operation, and inefficient access patterns, such as short read or write bursts to DRAM banks, can cause conflicts (as discussed in section 2.3). This can make the latency grow much longer, even if bandwidth is reduced. Power is another major concern with inefficient access patterns, as DRAM power can consume as much as a third of the total system power [42]. Transferring compressed data on the memory bus can be inefficient due to the nature of variable-sized

52 CHAPTER 4. RESULTS data associated with compressed data. A cache managing uncompressed data will access memory in cache line-sized bursts. As the cache line size is constant, it is easy for the memory controller to schedule accesses in a way that the DRAM data bus utilization is maximized. The variable nature of compressed data makes this problem a lot more challenging for the memory controller. We use the prediction mechanism for our compression feature to enable prefetching in our cache. As our cache block compacts multiple compressed blocks into a single cache line, we can use the compression ratio to prefetch more data and improve data bus utilization for the compressed data. Increasing the burst size from a single DRAM bank reduces latency, while improving bus utilization. If the data is uncompressed, we can avoid prefetching and let the memory controller optimize the scheduling of that access.

Figure 4.13: Smart Compressed Cache

We cover the metrics used to evaluate our smart cache controller, and present the results from our experiments.

53 CHAPTER 4. RESULTS

4.3.1 Smart Compression

We start with our smart compressor, which predicts whether compression is likely to be beneficial and then directs the cache whether to compress the data. Figure 4.14 shows how the smart controller is able to avoid initiating a large portion of the wasteful compression attempts. To accurately gauge the performance of our prediction scheme, we create an oracle predictor for compression. The oracle prediction mechanism is a perfect predictor that eliminates all data compression attempts that fail to reduce the size of the input data. Our smart predictor identifies 79% (on average) of the beneficial compression operations. This shows that, for graphics workloads, there is strong correlation between the compressibility of the data block, and our feature set (i.e., LLC client ID, address and metadata), as discussed in Section 2.3.1. We also consider the impact of false positives, cases where we missed out on the benefits of cache compression due to a wrong prediction by our model. We find that their performance impact is small, with the false positives (i.e., missed opportunities) making up 3.8% of the total predictions (on average). We acknowledge that false positives can increase bandwidth demands, but with low misprediction rates, the impact on bandwidth is low for our workloads.

4.3.2 Smart Decompression

Next, we evaluate the benefits of our smart decompressor. The goal of the decompressor is to detect when a compressed data block is likely to be reused soon. Repeated accesses to a compressed data block can result in multiple wasteful decompression operations. Our smart decompressor identifies such candidate blocks and stores the output of the first decompression in the decompress buffer. Subsequent hits access the decompress buffer rather than the compressed data in the cache, resulting in lower latency for latency-sensitive operations. To illustrate the benefits of our design, we create an oracle that perfectly identifies candidate blocks, and compare the performance of our smart decompressor to the oracle. Figure 4.15 compares the performance of the smart decompressor against an oracle pre- dictor. We see that the smart decompressor reduces latency for 58% of the repeated compressed reads, on average, taking decompression off the critical path. We can ignore the small number of false positives in this case, as there is little penalty associated with storing data in the decompress buffer. Any unused data in the decompress buffer simply gets replaced when new decompressed data is inserted. While a 58% percent reduction in repeated decompressions is significant, Teran et al. [36] reported that perceptron-based reuse prediction is much more effective when the program

54 CHAPTER 4. RESULTS

Figure 4.14: Results for smart data compression. For each workload, we show the oracle performance as our baseline. We also plot the wasteful compressions that were identified using our model as a percentage of the oracle, as well as the percent of false positives, as a percentage of total predictions.

counter of the memory instruction is used as a feature in the prediction. While our cache traces did not have this information available, based on their work, we are confident that adding the instruction ID to our existing features (client ID and surface metadata), will help improve performance even further. We leave this study for future work.

4.3.3 Smart Prefetch

In section 2.3.1, we showed how data compression can reduce memory efficiency by creating conflicting memory access patterns. In this case, memory performance is impacted due to a growing number of bank conflicts, leading to poor data bus utilization. While there have been many algorithms proposed for scheduling memory accesses [44, 57, 58, 59], they require additional hardware (e.g., reorder buffers and dynamic page management) to improve the performance of the memory access stream. This adds to memory latency and increases

55 CHAPTER 4. RESULTS

Figure 4.15: Results for smart data decompression. For each workload, we show the oracle perfor- mance as our baseline. We also plot the repeated decompressions that were eliminated using our smart decompressor as a percentage of the oracle.

power consumption. While scheduling requests, the DRAM controller must keep a balance between the conflicting goals of fairness across accesses and achieving high throughput for the entire memory system. A DRAM client (i.e., the LLC) can reduce this work significantly by creating a more efficient memory access pattern, reducing the amount of management needed. This would result in fewer variables for the DRAM controller to consider while scheduling accesses. To evaluate our smart prefetch feature, we measure the memory efficiency at the LLC-to- Memory interface. For each workload, we consider three different configurations: 1) compression off, 2) compression on, and 3) compression on, with smart prefetch. Memory efficiency is measured as the effective bandwidth of the cache to memory interface, which includes penalties due to row conflicts for each memory bank. Figure 4.16 shows how compression can reduce memory efficiency by increasing the number of bank conflicts. We can see that for all workloads, turning data compression on drops the

56 CHAPTER 4. RESULTS memory efficiency by 7.5% on average. Workloads with high compressibility, such as Witcher 3 and FarCry Primal, are more heavily impacted, reducing memory efficiency by 10.1% and 12.8%, respec- tively. Our smart prefetching algorithm increases memory efficiency by prefetching compressed data to increase the amount of data associated with each bank access. Given that the compressed data is associated with a surface, prefetching neighboring compressed data should result in additional useful data being brought into the cache. A mechanism that always prefetches data can pollute the cache. Our results show that our smart prefetch scheme results in almost the same efficiency as assuming uncompressed data. Our smart cache compensates for the loss in memory efficiency due to data compression.

Figure 4.16: Results for smart prefetching. For each workload we show: the efficiency with compression turned off (the baseline), the memory efficiency drop with compression turned on, and the efficiency improvements when using smart prefetching.

From our results, the Alien vs Predator (AVP) game appears to be a clear outlier. It receives the least benefits from all our smart prediction schemes. There are two possible reasons

57 CHAPTER 4. RESULTS

Figure 4.17: Area impact of feature tables

for this outcome. First, from Figure 4.16, it is clear that AVP does not experience a huge drop in memory efficiency when using compression. This is because the data stream in AVP exhibits limited compressibility as compared to the other workloads. The second reason is that this particular simulated frame has the lowest hit rate in our cache model. Considering the combined impact of these two observations, it clear that learning from changing access patterns results in less benefits. We believe learning over multiple frames instead of a single frame may help smooth out the noise in the learning algorithm, resulting in improved performance. We will evaluate this trade-off in our future work. In addition to the performance metrics, we also used the CACTI [60] tool to evaluate how scaling the number of features and the depth of each feature table can impact the area of the cache. CACTI allows us model special structures in the cache by adding additional bits to the tag array.

58 CHAPTER 4. RESULTS

While the control logic is not modeled, it can be used to generate approximate area numbers for the data storage dedicated to special structures in the cache. Figure 4.17 shows the results from our experiments. It is clear from the chart that, for a specific area target, a trade-off must be made between the number of features and the number of entries dedicated to each feature. For example, Config 2C and 3B have identical area impact, with the former supporting eight features, and the latter supporting four features with twice the entries in each feature table. Config 3C shows that a large number of features with deep feature tables have a significant area impact (37%). In addition to the area impact, the performance might be impacted by the larger number of features, making it more challenging to find a strong correlation with performance. We should strive to use the minimum number of features that have strong correlation to performance in order to get the best balance between performance and area overhead.

4.3.4 Summary

We propose a smart cache controller that uses perceptron learning to address the competing challenges associated with managing compressed graphics data in a cache. We show how data compression can lead to poor memory efficiency, increased latency, and greater power consumption. Utilizing a trace-driven model, we explored how a smart controller can help alleviate these issues. We show that the metadata associated with graphics workloads can be used as an indicator of data compressibility. Leveraging this information, we are able to make accurate predictions to improve the performance of a compressed cache. We analyze our proposal using real world graphics workloads, including the latest high- end games and GPU benchmarks. We show that our smart controller can successfully avoid most unnecessary compressions, decompressions and improve memory efficiency of a compressed caches by intelligently prefetching compressed data to reduce row conflicts.

59 Chapter 5

Compression on Graphics Hardware

The motivating reason for using compression is to encode redundant information such that the encoded data is smaller in size than the original input. In section 2.2 we looked at various cache compression algorithms that exploit redundancy to compress data. However, most cache compression algorithms do not consider the unique requirements associated with compression when implemented in a graphics pipeline. Figure 5.1 shows a simple software pipeline for the popular Direct X programming interface. The pipeline stages perform various vertex and pixel-related operations that require a significant amount of memory bandwidth. The resources required by a graphics application are preloaded in memory before the application begins. Each stage will perform either a read or read/write operation on those resources. Let us look take a closer look at the resources used in various stages of the 3D pipeline. The initial stages in the pipeline perform matrix transforms, clipping and lighting operations on the 3D geometry present in the scene. Once the vertices are transformed, rasterization is used to convert the vertices related to 3D objects into pixels in screen space. Pixel processing involves several memory-bandwidth intensive operations such as texture read/write, color read/write and depth buffer accesses. For each pixel associated with the 3D geometry, the processor must determine if it is visible in screen space, what texture to map to it, as well as the final color value of the pixel. Next, we consider how compression plays a key role in improving performance in the graphics pipeline. Compression in the graphics pipeline is required for three different classes of compression:

• Geometry Compression

• Texture Compression

60 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE

Figure 5.1: A simplified Direct3D pipeline.

61 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE

• Depth and Color compression

We consider each class in more detail next.

5.1 Geometry Compression

Figure 5.1 shows the first two stages in the 3D pipeline, which process vertex information from the 3D models present in a scene. The models are stored as mesh structures, containing encoded connectivity information for all the vertices. Game developers try to optimize the representation of various 3D objects by encoding them with the fewest number of vertices possible. Modern GPUs have special vertex processing engines [2, 61] that are optimized for working with such mesh structures. Deering et al. [62] propose a lossy compression algorithm to compress data associated with each vertex. A vertex is described by position, color and normal information. The X,Y,Z position coordinates are represented as floating point numbers. Deering et al. proposed reducing the precision of these floating point numbers to 16 bit values for each component. They claim storing the deltas between neighboring vertices in the mesh can be represented with fewer bits, which can significantly reduce the data size. This method is lossy in nature due to reduced precision. This compression scheme contrasts with cache compression algorithms that need to be lossless, since a compressed cache must return the same raw data that was initially stored in the cache. The RGBA color information has the encoded red, green, blue and alpha channel values, and are treated similar to the position information. A reduced precision version can be compressed using the same delta compression scheme as the position parameters. In the above method proposed by Deering et al. [62], the compression ratio versus quality trade-off can be made by changing the precision for each parameter. Chow [63] argues that this method needs manual experimentation to provide for an acceptable degree of image quality, which can be extremely time consuming for large databases. He proposes an improved algorithm that creates more efficient mesh structures, and includes a method to automatically pick the precision to preserve image quality before compressing the geometry data. Since the proposed compression operation would be run offline before the geometry information is read by the GPU, the encoding operation need not be fast. Hoppe et al. [64] observe that offline compression schemes are only able to handle static geometry, and thus are not usable for animated geometry that changes every frame. Hoppe instead proposes to use a transparent vertex cache that stores data in uncompressed form and supports random access to support dynamic geometry, instead of just static geometry. Reuse of transformed vertices from the on-chip vertex cache can significantly

62 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE reduce memory bandwidth requirements, thus minimizing the need for compressing the vertex information. Graphics card vendors, such as AMD, have released tools [65] that aid developers in optimizing the triangle mesh structures for better performance with a vertex cache.

5.2 Texture Compression

Before we discuss texture compression, we first need to understand why textures are used on a GPU. Textures are image patterns used to add detail to a 3D surface. Figure 5.2 shows how the visual quality of a simple cube is improved by mapping a texture to its surface. Texture mapping is used because it is easier to add detail by mapping a texture pixel (texel) to a pixel on a visible 3D surface instead of calculating the value for each pixel on the screen. It is a cheap way to improve scene quality and significantly improves the performance of a GPU.

Figure 5.2: Texture Map example by Elise Tarsa [66].

A game typically has several textures associated with it. Processing each pixel may result in multiple accesses to texture memory. This consumes a significant amount of memory space and bus bandwidth. The file sizes associated with modern games has been increasing at an exponential

63 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE rate due to higher resolution textures [3]. Compression is necessary to manage the rapidly increasing texture sizes. Graphics processors have dedicated hardware to help manage texture accesses and mapping to be more efficient. Game development APIs and graphics hardware also have added support to compress textures on a GPU [67, 68, 69]. Texture compression and cache compression have requirements that do not completely overlap. Beers et al. [70] specify some unique properties associated with texture compression. We need to consider these requirements and understand how they contrast with cache compression objectives. The major difference between cache compression and texture compression is that texture compression is, in most cases, lossy. All the dedicated texture hardware on a GPU assumes lossy algorithms. But compression implemented in the Last level Cache cannot be lossy as the cache handles data from multiple clients, and needs to return the exact same data on all read requests. Another unique property for textures is that they can be accessed randomly, as any part of the texture can map to a pixel on the 3D geometry based on the complexity of the geometry. This means compressed textures need to support fast decompression for random accesses to compressed blocks, given that read requests will most likely be on the critical path. Compressed caches typically compress data at a cache-line granularity, so any read request will typically request data within a single compressed block. Another unique property of texture compression is that it can be done offline during the initial loading of the gaming workload - the compression operation can be slow. While compression on a cache is not on the critical path, it does delay block placement for a compressed block, which may result in performance issues for subsequent reads that are on the critical path.

5.3 Depth and Color Compression

Depth compression [71], also known as Z compression, is another essential operation on a Modern GPU [72] that helps improve memory bandwidth. A depth buffer stores depth values of various 3D objects in a scene and is used to determine if an opaque object closer to the viewer is obscuring an object behind it. An occluded object need not be rendered, thus improving performance of the GPU rendering. The depth buffer is an essential part of the modern graphics pipeline. While depth buffers improve performance, they can increase the amount of stress on the memory system. This is because depth buffer read-write operations can consume between 40% to 100% of the memory bandwidth for each pixel processing operation [72]. Unlike texture compression, depth compression cannot be lossy, as data loss may result in strange artifacts in the rendered scene. Typical depth compression algorithms [73, 74] store depth values for a specific number of pixels forming a pixel

64 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE block. The algorithms compress data by calculating the deltas between the various values, similar to the base-delta [5] algorithm. For example, Morein et al. [74] take advantage of the fact that depth buffers typically store the max and min depth values known as Z-Max and Z-Min. They calculate deltas for depth values from either the Z-Max or Z-min, based on whichever selection provides the narrowest delta range. Then the selected Z value is used to encode the deltas, which are stored in the Z buffer in compressed form. Color values associated with a pixel can also be compressed using a similar delta compres- sion algorithm. Hardware vendors, such as AMD [7] and NVIDIA [61], use this approach to store narrow deltas and improve memory bandwidth.

5.4 Compute Workloads

So far we have looked at several approaches for compressing data associated with graphics workloads and focused on how graphics metadata plays a big role at each pipeline stage. Next, we switch our focus to compute workloads to see if we can draw any comparisons between the two types of workloads. We first attempt to compress the AES encryption-decryption workload from the Hetero- Mark benchmark suite [53] using the base-delta intermediate algorithm [5]. Figure 5.3 shows we can obtain bandwidth savings of around 12% with Base-Delta compression enabled in the LLC. The rather low bandwidth savings is expected since encrypted data should have less redundancy to exploit. In such situations it would be obvious to turn off compression as a large number of compression attempts would result in failed compressions. However, our LLC compression is transparent to the application running on the GPU. It would be difficult for the LLC to know what type of compute workload is accessing memory, so the compressor in the LLC will attempt to compress every block of data. In Section 4.3 we proposed using the graphics metadata to predict compressibility, though compute workloads do not have this metadata available. They are considered to be linear buffers with 1-dimensional data. We need to identify features in the compute workloads that help generate metadata hints on the nature of the workload. These features can then be used to train the smart controller to make compression predictions. Figure 5.4 shows the bandwidth savings for all of our compute workloads and Figure 5.5 shows the impact on bus utilization. It is important to note that we do not model the memory controller in our simulator, so our memory analysis is done from the point of view of the client

65 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE

Figure 5.3: Bandwidth savings - AES Benchmark

Figure 5.4: Bandwidth savings - Compute Workloads

66 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE accessing the memory. In this case, the LLC is responsible for making sure that good bank rotation occurs before accessing a different row on the same bank. With compressed data, this may not always be possible, resulting in a high number of row conflicts and lower bus utilization.

Figure 5.5: Bus Utilization - Compute Workloads

Another interesting difference between graphics and compute workloads is the way the base-delta algorithm is applied. Recall from Chapter 2 that the base delta algorithm searches for an efficient base value, as well as the best granularity for the element size, to perform delta comparisons and encode narrow values. In the previous section, we have observed that the information present in the graphics metadata, including the data format, bytes per pixel value, and similar information, can be used to select an efficient element size (8/16/32 bytes) for the delta comparisons in a data block. In the case of the depth buffer, the Z-Max and Z-min values provide a efficient base value to use for the compression operation. Compute workloads have no such hints available to them, since compute metadata does not have this type of information available. This means the compressor needs to perform an additional search for a efficient base value, and determine the granularity for compression on the fly, which adds additional latency to the compression operation.

67 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE

5.5 Key Observations

Based on our analysis of compression algorithms used in the GPU pipeline, it is clear that the majority of the algorithms work on encoding deltas between neighboring values, similar to the previous delta-based algorithms [5, 73]. However, the granularity chosen for the delta operation has a high dependency on the memory layout of data, based on format type being encoded, and based on whether the compression operation is lossy or not. Graphics-specific data formats are represented in memory in a unique way. Knowledge of this memory layout needs to be exploited for cache compression to be effective. For example, a data format such as R8G8B8A8 indicates that 32 bits of data is split into four channels, with a channel representing 8 bits each of the red, green, blue and alpha channels. Let’s take two 32bit values of 0x1f0f0f0f and 0x1e0e0e0e for this data format and attempt to compress them. A dictionary-based compression scheme, as discussed in section 2.2, will not be able to compress the two values, as few of the higher order bytes are the same. A format-unaware base-delta approach will store the delta between the two values as 0x01010101, which still needs at least 28 bits to be stored. Splitting along a channel boundary will result in four independent deltas, each being a single bit. The independent base values can be amortized over the entire data block being compressed, resulting in a much higher compression ratio. Similar conclusions can be drawn from the position and depth buffer values. Texture swizzling and tiling [75, 76] is another operation that is used on modern GPUs to improve the cache performance associated with storing data structures for 2D/3D images. As 2D/3D structures are not accessed linearly in memory, the cache performance for such buffers may be sub-optimal. GPUs perform address swizzling to improve cache performance when such structures are encountered. When this type of swizzled or tiled data is stored in the cache, the cache compression algorithm must be aware of the data layout so that it can achieve a high compression ratio. Reflecting on the previous discussion on compression approaches, we can see that current GPU compression techniques have both lossy and lossless compression, along with specialized hardware that handles this data. GPUs are increasingly being used for general purpose computing, so a well designed compressed cache will need a compression algorithm that can handle the uniqueness of graphics workloads, along with general purpose compute data. One way to approach this problem is with an automatic format detector that would sit ahead of the compression engine and automatically classify the data formats based on certain features in the data set. This would help predict the

68 CHAPTER 5. COMPRESSION ON GRAPHICS HARDWARE organisation of data in memory, and could be exploited to compress the data using one of the efficient compression algorithms discussed in section 2.2. Based on the unique requirements of graphics workloads, the LLC may also need to support a precision modifier to decouple the ’lossy’ step from the compression operation with the metadata, indicating the loss in precision. Employing such a decoupling mechanism, the cache can perform compaction of the data based on the metadata, while it is completely unaware of the lossy or lossless nature of the algorithms. The other approach is for clients to provide format information so the cache compressor can adjust the granularity for compression and store metadata indicating the granularity and compression output. This approach increases accuracy by removing the need to predict texture data formats, but potentially increases the internal bandwidth on the chip.

69 Chapter 6

Conclusion

In Chapter 2 we provided an overview of the state of the art in the field of on-chip data compression. We described in detail a number of hardware compression algorithms, as well as compressed cache designs. These algorithms and cache take advantage of compressed data to improve the effective cache size and resulting performance. We also described the various challenges involved with compression, including data compaction, relocation and partial write handling of compressed blocks. In contrast to the previous work that focused on the benefits and challenges at a specific level of the memory hierarchy, we looked at the entire memory hierarchy and explore the potential impact of compressed data in the cache in terms of DRAM access efficiency. We also looked at some state-of-the-art smart cache controllers that use intelligent learning algorithms based on the perceptron-based learning methods [39]. These smart controllers have been used to improve cache performance by improving branch prediction [77, 35] , coherence prediction [38], as well as block reuse in the cache [36]. We discussed possible applications that can exploit smart controllers as part of a compressed cache design. In Chapter 4 we provided an overview of our dual-dictionary algorithm that extends shared dictionary-based compressors, such as CPACK [16], to improve memory bandwidth, in addition to the hit-rate, in the cache. We also showed results for our efficiency-aware compressed cache design that uses a victim cache to not only increase the logical size of the cache, but also serves to evict data in order to improve DRAM efficiency. In Section 4.3 we proposed a smart cache controller that addresses three key challenges with compressed caches on GPUs. We show how our smart controller can improve memory efficiency through intelligent prefetching of compressed data, and reduce latency by caching recently decompressed data when reuse is predicted. We also demonstrated how a smart controller can take advantage of graphics-specific metadata to predict if a data block

70 CHAPTER 6. CONCLUSION is compressible, thus eliminating non-beneficial compressions in order reduce latency and power consumption of the compressor. Our results were generated on a custom cache simulator that models novel compressed cache designs and compression algorithms. The simulation provides us with traditional cache performance metrics, such as hit-rate, as well as DRAM access efficiency, which can help paint a complete picture on the impact of each compression approach on efficiency of the memory system. The state-of-the-art compressed cache research discussed in Chapter 2 was primarily carried out using compute workloads, and targeted CPU caches. In our work, we expanded this research to consider graphics workloads as run with GPU caches. Our work highlighted some key challenges that were not considered in prior work. In Chapter 4 we proposed solutions to help address some real-world challenges with data compression in GPU caches. We also approached data compression solution from a novel perspective in Chapter 5. We looked at prior work that was focused on compression in the graphics pipeline and observed key differences between requirements of a cache compression algorithm and the compression schemes targeting a graphics pipeline. Our thesis lays a solid foundation for compressed cache designs in future GPUs, and identifies some key areas for further work. We suggest that future work will continued down two paths: 1) new cache compression algorithms, and 2) new cache compaction algorithms. As discussed in Chapter 5 , the key difference between a generic compression algorithm and a graphics specific algorithm is that graphics has knowledge of the data representation as stored in memory. A generic algorithm (e.g., the base delta method [5]) tests different granularities within a data block to arrive at a good compression ratio. In contrast, graphics specific metadata, such as the ”RGBA” format representing four channel colors, or swizzle mode, provides a precise data layout that can be exploited for optimized data compression. An intelligent resource format predictor that can use specific features in a generic data block to predict the resource metadata, which can save a significant amount on-chip bandwidth on the GPU. In addition to saving on-chip bandwidth, it can also improve compute compression ratios avoiding an exhaustive search, or using fixed-sized granularities for data compression algorithms. Another important observation of the GPU compression pipeline, discussed in Chapter 5, is that a GPU might need to use both lossy and lossless compression algorithms to achieve improved performance. It may not always be possible to unify graphics data compression leveraging a single algorithm. A last level cache cannot perform lossy compression by itself, as the cache clients expect the data to be returned unmodified from the cache. A solution to this problem is to use a combination of cache compression and client compression in the compressed cache design. If we can unify the

71 CHAPTER 6. CONCLUSION compression metadata for different compression algorithms used by the cache and the clients, then each data block can be compressed by the client in a lossy way, or by the cache in a lossless way. The unified metadata can be used by the compressed cache to compact the data efficiently, irrespective of where the original compression occurred. This can also help improve internal bandwidth on the GPU for clients that consume significant bandwidth due to the large number of reads or writes. Another promising path for future research is to expand the scope of the smart cache controller. In Section 4.3 we proposed using the perceptron learning method to handle compressed data in the cache efficiently. This controller can be used for compression-aware cache request scheduling in the GPU. One fundamental way a GPU differs from a CPU is that the GPU has a number of various fixed function hardware blocks. As each fixed function block has its own internal data path, with specific throughput rates and data bus widths, it may not always make sense to return cache requests in order. With client side data compression, it is possible that the compressed data transferred to the client may expand to a size larger than the amount required to keep the data path fully utilized in the client. Instead, we can reorder the requests between various clients, with the goal of increasing utilization metrics for each client. If we can improve the compression ratio of the data returned, then we can improve overall GPU utilization and performance. The same principle can also be applied to the memory controller when it schedules cache misses from memory. The memory controller will need to keep memory efficiency in mind, as discussed in Section 2.4 In summary we have thoroughly evaluated past work pertaining to compressed caches. We have also explored the many challenges associated with designing compressed caches on GPUs. We have discussed new challenges not addressed in prior work, and evaluated various solutions to help address some of these challenges. We have also proposed a path forward for future research that can help unify graphics and compute compression for modern GPUs, designs that can effectively support both types of workloads.

72 Bibliography

[1] J. Munkberg, P. Clarberg, J. Hasselgren, and T. Akenine-Moller,¨ “High dynamic range texture compression for graphics hardware,” in ACM Transactions on Graphics (TOG), vol. 25, no. 3. ACM, 2006, pp. 698–706.

[2] “Amd vega whitepaper,” https://radeon.com/ downloads/vega-whitepaper-11.6.17.pdf, ac- cessed: 2018-5-22.

[3] “Texture usage in modern games,” https://www.pcgamer.com/ how-game-sizes-got-so-huge-and-why-theyll-get-even-bigger/, accessed: 2018-07-24.

[4] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Transac- tions on information theory, vol. 23, no. 3, pp. 337–343, 1977.

[5] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Base- delta-immediate compression: practical data compression for on-chip caches,” in Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM, 2012, pp. 377–388.

[6] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach. Elsevier, 2011.

[7] “Delta color compression,” https://gpuopen.com/dcc-overview/, accessed: 2017-10-24.

[8] S. Mittal and J. S. Vetter, “A survey of architectural approaches for data compression in cache and main memory systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 5, pp. 1524–1536, 2016.

[9] M. Ekman and P. Stenstrom, “A robust main-memory compression scheme,” in ACM SIGARCH Computer Architecture News, vol. 33, no. 2. IEEE Computer Society, 2005, pp. 74–85.

73 BIBLIOGRAPHY

[10] J. Dusser, T. Piquet, and A. Seznec, “Zero-content augmented caches,” in Proceedings of the 23rd international conference on Supercomputing. ACM, 2009, pp. 46–55.

[11] J. Dusser and A. Seznec, “Decoupled zero-compressed memory,” in Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM, 2011, pp. 77–86.

[12] A. Seznec, “Decoupled sectored caches: conciliating low tag implementation cost,” in ACM SIGARCH Computer Architecture News, vol. 22, no. 2. IEEE Computer Society Press, 1994, pp. 384–393.

[13] L. Villa, M. Zhang, and K. Asanovic,´ “Dynamic zero compression for cache energy reduction,” in Proceedings of the 33rd annual ACM/IEEE international symposium on . ACM, 2000, pp. 214–220.

[14] Y. Zhang, J. Yang, and R. Gupta, “Frequent value locality and value-centric data cache design,” in ACM SIGOPS Operating Systems Review, vol. 34, no. 5. ACM, 2000, pp. 150–159.

[15] J. Yang, Y. Zhang, and R. Gupta, “Frequent value compression in data caches,” in Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture. ACM, 2000, pp. 258–265.

[16] X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, “C-pack: A high-performance micro- processor cache compression algorithm,” IEEE transactions on very large scale integration (VLSI) systems, vol. 18, no. 8, pp. 1196–1208, 2010.

[17] S. Sardashti and D. A. Wood, “Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46, 2013, pp. 62–73.

[18] S. Sardashti, A. Seznec, and D. A. Wood, “Yet another compressed cache: A low-cost yet effective compressed cache,” ACM Trans. Archit. Code Optim., vol. 13, no. 3, pp. 27:1–27:25, Sep. 2016.

[19] M. Kjelso, M. Gooch, and S. Jones, “Design and performance of a main memory hardware data compressor,” in EUROMICRO 96. Beyond 2000: Hardware and Software Design Strategies., Proceedings of the 22nd EUROMICRO Conference. IEEE, 1996, pp. 423–430.

74 BIBLIOGRAPHY

[20] J.-S. Lee, W.-K. Hong, and S.-D. Kim, “An on-chip cache compression technique to reduce decompression overhead and design complexity,” Journal of systems Architecture, vol. 46, no. 15, pp. 1365–1382, 2000.

[21] J. Yang, R. Gupta, and C. Zhang, “Frequent value encoding for low power data buses,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 9, no. 3, pp. 354–384, 2004.

[22] J. L. Nu´nez˜ and S. Jones, “Gbit/s lossless data compression hardware,” IEEE Transactions on very large scale integration (VLSI) systems, vol. 11, no. 3, pp. 499–510, 2003.

[23] J. L. Bentley, D. D. Sleator, R. E. Tarjan, and V. K. Wei, “A locally adaptive data compression scheme,” Communications of the ACM, vol. 29, no. 4, pp. 320–330, 1986.

[24] C.-L. Yu and J.-L. Wu, “Hierarchical dictionary model and dictionary management policies for data compression,” Signal processing, vol. 69, no. 2, pp. 149–155, 1999.

[25] A. R. Alameldeen and D. A. Wood, “Frequent pattern compression: A significance-based compression scheme for l2 caches,” Dept. Comp. Scie., Univ. Wisconsin-Madison, Tech. Rep, vol. 1500, 2004.

[26] M. Farrens and A. Park, “Dynamic base register caching: A technique for reducing address bus width,” in ACM SIGARCH Computer Architecture News, vol. 19, no. 3. ACM, 1991, pp. 128–137.

[27] D. Citron and L. Rudolph, “Creating a wider bus using caching techniques,” in High- Performance Computer Architecture, 1995. Proceedings., First IEEE Symposium on. IEEE, 1995, pp. 90–99.

[28] S. Sardashti, A. Seznec, and D. A. Wood, “Skewed compressed caches,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014, pp. 331–342.

[29] A. R. Alameldeen and D. A. Wood, “Adaptive cache compression for high-performance proces- sors,” in Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on. IEEE, 2004, pp. 212–223.

[30] N. Kim, T. Austin, and T. Mudge, “Low-energy data cache using sign compression and cache line bisection,” in Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI02). Citeseer, 2002.

75 BIBLIOGRAPHY

[31] S. Kim, J. Lee, J. Kim, and S. Hong, “Residue cache: a low-energy low-area l2 cache ar- chitecture via compression and partial hits,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2011, pp. 420–429.

[32] J. Yang and R. Gupta, “Energy efficient frequent value data cache design,” in Microarchitecture, 2002.(MICRO-35). Proceedings. 35th Annual IEEE/ACM International Symposium on. IEEE, 2002, pp. 197–207.

[33] E. G. Hallnor and S. K. Reinhardt, “A compressed memory hierarchy using an indirect index cache,” in Proceedings of the 3rd Workshop on Memory Performance Issues: in conjunction with the 31st International Symposium on Computer Architecture. ACM, 2004, pp. 9–15.

[34] A. Seznec and F. Bodin, “Skewed-associative caches,” in PARLE’93 Parallel Architectures and Languages Europe. Springer, 1993, pp. 305–316.

[35] D. A. Jimenez´ and C. Lin, “Dynamic branch prediction with perceptrons,” in High-Performance Computer Architecture, 2001. HPCA. The Seventh International Symposium on. IEEE, 2001, pp. 197–206.

[36] E. Teran, Z. Wang, and D. A. Jimenez,´ “Perceptron learning for reuse prediction,” in Microar- chitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 2016, pp. 1–12.

[37] H. Wang and Z. Luo, “Data cache prefetching with perceptron learning,” arXiv preprint arXiv:1712.00905, 2017.

[38] D. Ghosh, J. B. Carter, and H. Daume´ III, “Perceptron-based coherence predictors,” 2008.

[39] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain.” Psychological review, vol. 65, no. 6, p. 386, 1958.

[40] A. Seznec, “The o-gehl branch predictor,” The 1st JILP Championship Branch Prediction Competition (CBP-1), 2004.

[41] M. Clark, “A new× 86 core architecture for the next generation of computing,” in Hot Chips 28 Symposium (HCS), 2016 IEEE. IEEE, 2016, pp. 1–19.

76 BIBLIOGRAPHY

[42] L. A. Barroso, J. Clidaras, and U. Holzle,¨ “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” Synthesis lectures on computer architecture, vol. 8, no. 3, pp. 1–154, 2013.

[43] M. Mantor, “AMD Radeon HD 7970 with (GCN) architecture,” in Hot Chips 24 Symposium (HCS), 2012 IEEE. IEEE, 2012, pp. 1–35.

[44] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality,” in Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture. ACM, 2000, pp. 32–41.

[45] J. Edler, “Dinero iv trace-driven uniprocessor cache simulator,” http://www. cs. wisc. edu/˜ markhill/DineroIV/, 1998.

[46] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5 simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.

[47] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2sim: a simulation framework for cpu-gpu computing,” in Parallel Architectures and Compilation Techniques (PACT), 2012 21st International Conference on. IEEE, 2012, pp. 335–344.

[48] “Call of duty : Black ops,” https://www.callofduty.com/blackops, accessed: 2017-12-9.

[49] “Assassins creed 3,” https://www.ubisoft.com/en-us/game/assassins-creed-3/, accessed: 2017- 12-9.

[50] “Civilization 5,” https://civilization.com/civilization-5, accessed: 2017-12-9.

[51] “3d mark suite,” https://www.futuremark.com/benchmarks/3dmark, accessed: 2017-12-9.

[52] “Amd sdk,” https://developer.amd.com/tools-and-sdks, accessed: 2018-08-07.

[53] Y. Sun, X. Gong, A. K. Ziabari, L. Yu, X. Li, S. Mukherjee, C. McCardwell, A. Villegas, and D. Kaeli, “Hetero-mark, a benchmark suite for cpu-gpu collaborative computing,” in 2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2016, pp. 1–10.

[54] G. Keramidas, K. Aisopos, and S. Kaxiras, “Dynamic dictionary-based data compression for level-1 caches,” in International Conference on Architecture of Computing Systems. Springer, 2006, pp. 114–129.

77 BIBLIOGRAPHY

[55] A. C. Frery and T. Perciano, “Image data formats and color representation,” in Introduction to Image Processing Using R. Springer, 2013, pp. 21–29.

[56] A. R. Alameldeen and D. A. Wood, “Interactions between compression and prefetching in chip multiprocessors,” in High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on. IEEE, 2007, pp. 228–239.

[57] O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both perfor- mance and fairness of shared dram systems,” in ACM SIGARCH Computer Architecture News, vol. 36, no. 3. IEEE Computer Society, 2008, pp. 63–74.

[58] J. F. Martinez and E. Ipek, “Dynamic multicore resource management: A machine learning approach,” IEEE micro, vol. 29, no. 5, 2009.

[59] W.-D. Weber, “Method and apparatus for scheduling of requests to dynamic random access memory device,” Nov. 1 2005, uS Patent 6,961,834.

[60] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: A tool to model large caches,” HP laboratories, pp. 22–31, 2009.

[61] N. G. GTX, “980: Featuring maxwell, the most advanced gpu ever made,” White paper, NVIDIA Corporation, 2014.

[62] M. Deering, “Geometry compression,” in Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. ACM, 1995, pp. 13–20.

[63] M. M. Chow, “Optimized geometry compression for real-time rendering,” in Visualization’97., Proceedings. IEEE, 1997, pp. 347–354.

[64] H. Hoppe, “Optimization of mesh locality for transparent vertex caching,” in Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1999, pp. 269–276.

[65] D. Nehab, J. Barczak, and P. V. Sander, “Triangle order optimization for graphics hardware computation culling,” in Proceedings of the 2006 symposium on Interactive 3D graphics and games. ACM, 2006, pp. 207–211.

78 BIBLIOGRAPHY

[66] E. Tarsa, “Texture map example,” https://sites.google.com/site/dlckcadtest2/tutorials/ 3d-programs/texture-and-bump-maps?tmpl=%2Fsystem%2Fapp%2Ftemplates%2Fprint% 2F&showPrintDialog=1, accessed: 2018-07-24.

[67] J. Strom¨ and M. Pettersson, “Etc 2: texture compression using invalid combinations,” in Graphics Hardware, 2007, pp. 49–54.

[68] J. Nystad, A. Lassen, A. Pomianowski, S. Ellis, and T. Olson, “Adaptive scalable texture compression,” in Proceedings of the Fourth ACM SIGGRAPH/Eurographics conference on High-Performance Graphics. Eurographics Association, 2012, pp. 105–114.

[69] K. I. Iourcha, K. S. Nayak, and Z. Hong, “System and method for fixed-rate block-based image compression with inferred pixel values,” Sep. 21 1999, uS Patent 5,956,431.

[70] A. C. Beers, M. Agrawala, and N. Chaddha, “Rendering from compressed textures,” in Proceed- ings of the 23rd annual conference on Computer graphics and interactive techniques. ACM, 1996, pp. 373–378.

[71] S. Morein and M. T. Wright, “Method and apparatus for controlling compressed z information in a video graphics system that supports anti-aliasing,” Jun. 18 2002, uS Patent 6,407,741.

[72] S. Morein et al., “Ati radeon technology,” in Graphics Hardware, 2000.

[73] J. E. DeRoo, S. Morein, B. Favela, and M. T. Wright, “Method and apparatus for compressing parameter values for pixels in a display frame,” Nov. 5 2002, uS Patent 6,476,811.

[74] S. L. Morein and M. A. Natale, “System, method, and apparatus for compression of video data using offset values,” Jul. 13 2004, uS Patent 6,762,758.

[75] M. Mantor, J. A. Carey, R. C. Taylor, T. A. Piazza, J. D. Potter, and A. E. Socarras, “3-d rendering texture caching scheme,” May 23 2006, uS Patent 7,050,063.

[76] D. P. Wilde, “Apparatus for dynamic xy tiled texture caching,” Oct. 27 1998, uS Patent 5,828,382.

[77] J. Karlin, D. Stefanovic, and S. Forrest, “The triton branch predictor,” 2004.

79