Exploring Compression in the Gpu Memory Hierarchy for Graphics and Compute

EXPLORING COMPRESSION IN THE GPU MEMORY HIERARCHY FOR GRAPHICS AND COMPUTE A Dissertation Presented by Akshay Lahiry to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Northeastern University Boston, Massachusetts August 2018 i Contents List of Figures iv List of Tables vi Abstract of the Dissertation vii 1 Introduction 1 1.1 Compressed Cache Architecture . 5 1.2 Compression Algorithms . 6 2 Background 8 2.1 GPU Memory Hierarchy . 8 2.2 Compression Algorithms . 11 2.3 Compressed Cache Architecture . 18 2.3.1 Smart Caches . 25 2.4 DRAM Efficiency . 26 3 Framework and Metrics 30 4 Results 37 4.1 Dual Dictionary Compressor . 38 4.1.1 Cache Architecture . 38 4.1.2 Dictionary Structure . 39 4.1.3 Dictionary Index Swap . 41 4.1.4 Dictionary Replacement . 42 4.1.5 DDC Performance . 43 4.1.6 Summary . 43 4.2 Compression Aware Victim Cache . 45 4.2.1 Design Challenges . 47 4.2.2 Results . 49 4.2.3 Summary . 49 4.3 Smart Cache Controller . 50 4.3.1 Smart Compression . 54 4.3.2 Smart Decompression . 54 ii 4.3.3 Smart Prefetch . 55 4.3.4 Summary . 59 5 Compression on Graphics Hardware 60 5.1 Geometry Compression . 62 5.2 Texture Compression . 63 5.3 Depth and Color Compression . 64 5.4 Compute Workloads . 65 5.5 Key Observations . 68 6 Conclusion 70 Bibliography 73 iii List of Figures 1.1 Total number of pixels for common display resolutions. 1 1.2 The total number of pixels for common display resolutions. 2 1.3 Two frames of Halo:Combat Evolved. Left image shows a frame from 2001 and the right image shows the updated frame in 2011 . 3 2.1 Block diagram showing the ratio of ALU to cache hardware. 9 2.2 Block diagram showing cache hierarchy in a CPU and GPU . 10 2.3 Compression example using Frequent Value Compression. 13 2.4 Compression example using CPack. 15 2.5 Fragmenation example for fixed compaction scheme . 19 2.6 Fragmentation with less restriction on compressed block placement . 19 2.7 Compaction with subblocks in the data array . 20 2.8 Compaction with decoupled subblocks and super tags . 22 2.9 Bytemasked writes with uncompressed data. 23 2.10 Bytemasked writes with compressed data . 24 2.11 Feature table for a hashed perceptron prediction. 26 2.12 Example DRAM address mapping. 27 2.13 Efficient DRAM access pattern. 28 2.14 Inefficient DRAM access pattern. 29 3.1 Block Diagram of simulation Framework. 32 3.2 Sample simulation output for IceStorm benchmark. 36 4.1 A Cache Block . 38 4.2 Block diagram of the LLC with our dual dictionary . 39 4.3 Output from Dictionary Index Swap Unit . 40 4.4 The Write BW Savings . 44 4.5 The Read BW Savings . 45 4.6 Block diagram of LLC with Victim Cache . 46 4.7 Super Tag structure for the Victim Cache . 46 4.8 Byte-masked write with uncompressed data . 47 4.9 Byte-masked write in compressed cache . 48 4.10 Burst Efficiency with EBU on and off . 49 4.11 Row Buffer Hit rate with Victim EBU on and off . 50 iv 4.12 Conventional Compressed Cache . 51 4.13 Smart Compressed Cache . 53 4.14 Results for smart data compression. For each workload, we show the oracle performance as our baseline. We also plot the wasteful compressions that were identified using our model as a percentage of the oracle, as well as the percent of false positives, as a percentage of total predictions. 55 4.15 Results for smart data decompression. For each workload, we show the oracle performance as our baseline. We also plot the repeated decompressions that were eliminated using our smart decompressor as a percentage of the oracle. 56 4.16 Results for smart prefetching. For each workload we show: the efficiency with compression turned off (the baseline), the memory efficiency drop with compression turned on, and the efficiency improvements when using smart prefetching. 57 4.17 Area impact of feature tables . 58 5.1 A simplified Direct3D pipeline. 61 5.2 Texture Map example by Elise Tarsa [66]. 63 5.3 Bandwidth savings - AES Benchmark . 66 5.4 Bandwidth savings - Compute Workloads . 66 5.5 Bus Utilization - Compute Workloads . 67 v List of Tables 2.1 CPack Code table . 15 2.2 Frequent Pattern Encoding table as seen in [25] . 17 2.3 Comparison of popular hardware compression algorithms . 18 3.1 Graphics workloads . 32 3.2 Compute workloads . 33 4.1 Updated Pattern Table . 39 4.2 The compressed word distribution . 44 vi Abstract of the Dissertation EXPLORING COMPRESSION IN THE GPU MEMORY HIERARCHY FOR GRAPHICS AND COMPUTE by Akshay Lahiry Doctor of Philosophy in Computer Engineering Northeastern University, August 2018 Dr. David Kaeli, Advisor As game developers push the limits of graphics processors (GPUs) in their quest to achieve photorealism, modern games are becoming increasingly memory bound. At the same time display vendors are pushing the boundaries of display technology with ultra high resolution displays and high dynamic range (HDR). This means the GPU not only needs to render more pixels but also needs to process a significantly higher amount of data per pixel for high quality rendering. The advent of mainstream virtual reality (VR) also increases the minimum frame-rate for these displays which puts a lot of pressure on the GPU memory system. GPU have also evolved to be used as accelerators in high performance computing systems. Given their data-parallel throughput, many compute-intensive applications have benefited from GPU acceleration. For both of these workload, increasing the cache size helps alleviate some of the memory pressure by caching the frequently used data on-chip. However, die area on modern chips comes at a premium. Data compression is one approach to manage the data footprint problem. Data compression in the last level cache (LLC) can help achieve the performance of a much larger cache while utilizing significantly less die area. In this thesis, we address these challenges of using on-chip data compression and explore novel methods to arrive at performant solutions for both graphics and compute. We also highlight some unique compression requirements for graphics workloads and how they contrast to prior cache compression algorithms. vii Chapter 1 Introduction Over the past few years the number of pixels in a video display has increased exponentially. The graphics display industry has quickly moved from resolutions as low as 480p, to stunning 8K displays. Figure 1.1 shows some common display resolutions, ranging from half a million pixels for old displays in SVGA format, to thirty three million pixels for current state-of-the-art 8K displays. Figure 1.1: Total number of pixels for common display resolutions. 1 CHAPTER 1. INTRODUCTION Rendering a growing number of pixels significantly increases the amount of data associated with each frame. Figure 1.2 shows how the total bytes per frame scales, from a few kilo-bytes for low resolution displays, to thirty mega-bytes for a modern high resolution display. With the introduction of High-Dynamic Range (HDR) displays, the amount of data associated with each pixel has also increased. More bits are being used to represent each channel in the pixel [1, 2]. As these ultra-high resolution and high dynamic range displays become ubiquitous, the data footprint of modern games will increase rapidly. Figure 1.2: The total number of pixels for common display resolutions. Figure 1.3 shows two frames from the Halo Combat Evolved game. The frame on the left shows the original game from the year 2001 and the one the right shows the remastered frame from the anniversary update of the game in 2011. The images show a significant improvement to the level of detail per frame. Game developers push modern Graphics Processors (GPU) to their performance limits in their quest for photorealism. As virtual reality (VR) headsets become more mainstream, the quest for realism goes even further to enable the illusion of reality in the virtual world. High end VR headsets have two high resolution screens that independently render immersive content. This puts a lot of pressure on the GPU to render high resoluton images in real-time. A single dropped frame due to 2 CHAPTER 1. INTRODUCTION Figure 1.3: Two frames of Halo:Combat Evolved. Left image shows a frame from 2001 and the right image shows the updated frame in 2011 memory latency can can ruin the immersive experience. The trends mentioned above have significantly increased the pressure on the memory hierarchy of GPUs. This exponential increase in the number of bytes per frame adds a lot of pressure on the GPU’s cache hierarchy and memory. As access to memory is extremely slow, GPUs have been increasing the size of on-chip caches to avoid costly cache misses and improve performance. This results in increased power consumption and on-chip area, which is not ideal. Modern manufacturing nodes have significant yield issues with larger chips, which become increasingly expensive to manufacture. As the area on the chip comes at a premium, over provisioning caches will have an adverse impact. Every millimeter of chip area is valuable and every effort is made by graphics chip companies to improve the performance/mm2. Having more data in flight also increases bandwidth contention on a GPU. This can impact the cost of a cache miss. The on-chip area must be used more efficiently to handle the memory footprint issue.

Load more