A Case for Toggle-Aware Compression for GPU Systems

A Case for Toggle-Aware Compression for GPU Systems Gennady Pekhimenko†, Evgeny Bolotin?, Nandita Vijaykumar†, Onur Mutlu†, Todd C. Mowry†, Stephen W. Keckler?# †Carnegie Mellon University ?NVIDIA #University of Texas at Austin ABSTRACT bandwidth utilization (e.g., of on-chip and o-chip interconnects [15, 5, 64, 58, 51, 60, 69]). Several recent works focus on Data compression can be an eective method to achieve higher bandwidth compression to decrease memory trac by trans- system performance and energy eciency in modern data- mitting data in a compressed form in both CPUs [51, 64, 5] intensive applications by exploiting redundancy and data simi- and GPUs [58, 51, 69], which results in better system perfor- larity. Prior works have studied a variety of data compression mance and energy consumption. Bandwidth compression techniques to improve both capacity (e.g., of caches and main proves to be particularly eective in GPUs because they are memory) and bandwidth utilization (e.g., of the on-chip and often bottlenecked by memory bandwidth [47, 32, 31, 72, 69]. o-chip interconnects). In this paper, we make a new observa- GPU applications also exhibit high degrees of data redun- tion about the energy-eciency of communication when com- dancy [58, 51, 69], leading to good compression ratios. pression is applied. While compression reduces the amount of While data compression can dramatically reduce the num- transferred data, it leads to a substantial increase in the number ber of bit symbols that must be transmitted across a link, of bit toggles (i.e., communication channel switchings from 0 compression also carries two well-known overheads: (1) la- to 1 or from 1 to 0). The increased toggle count increases the tency, energy, and area overhead of the compression/decom- dynamic energy consumed by on-chip and o-chip buses due pression hardware [4, 52]; and (2) complexity and cost to to more frequent charging and discharging of the wires. Our support variable data sizes [22, 57, 51, 60]. Prior work has results show that the total bit toggle count can increase from addressed solutions to both of these problems. For exam- 20% to 2.2× when compression is applied for some compression ple, Base-Delta-Immediate compression [52] provides a low- algorithms, averaged across dierent application suites. We latency, low-energy hardware-based compression algorithm. characterize and demonstrate this new problem across 242 GPU Decoupled and Skewed Compressed Caches [57, 56] provide applications and six dierent compression algorithms. To miti- mechanisms to eciently manage data recompaction and gate the problem, we propose two new toggle-aware compres- fragmentation in compressed caches. sion techniques: Energy Control and Metadata Consolidation. These techniques greatly reduce the bit toggle count impact of 1.1. Compression & Communication Energy the data compression algorithms we examine, while keeping In this paper, we make a new observation that yet another im- most of their bandwidth reduction benets. portant problem with data compression must be addressed to implement energy-ecient communication: transferring data 1. Introduction in compressed form (as opposed to uncompressed form) leads Modern data-intensive computing forces system designers to to a signicant increase in the number of bit toggles, i.e., the deliver good system performance under multiple constraints: number of wires that switch from 0 to 1 or 1 to 0. An increase shrinking power and energy envelopes (power wall), increas- in bit toggle count causes higher switching activity [65, 9, 10] ing memory latency (memory latency wall), and scarce and ex- for wires, causing higher dynamic energy to be consumed pensive bandwidth resources (bandwidth wall). While many by on-chip or o-chip interconnects. The bit toggle count dierent techniques have been proposed to address these is- of compressed data transfer increases for two reasons. First, sues, these techniques often oer a trade-o that improves the compressed data has a higher per-bit entropy because the one constraint at the cost of another. Ideally, system architects same amount of information is now stored in fewer bits (the would like to improve one or more of these system parameters, “randomness” of a single bit grows). Second, the variable-size e.g., on-chip and o-chip1 bandwidth consumption, while nature of compressed data can negatively aect the word/it simultaneously avoiding negative eects on other key param- data alignment in hardware. Thus, in contrast to the common eters, such as overall system cost, energy, and latency charac- wisdom that bandwidth compression saves energy (when it teristics. One potential way to address multiple constraints is is eective), our key observation reveals a new trade-o: en- to employ dedicated hardware-based data compression mech- ergy savings obtained by reducing bandwidth versus energy anisms (e.g., [71, 4, 14, 52, 6]) across dierent data links in loss due to higher switching energy during compressed data the system. Compression exploits the high data redundancy transfers. This observation and the corresponding trade-o observed in many modern applications [52, 57, 6, 69] and are the key contributions of this work. can be used to improve both capacity (e.g., of caches, DRAM, To understand (1) how applicable general-purpose data non-volatile memories [71, 4, 14, 52, 6, 51, 60, 50, 69, 74]) and compression is for real GPU applications, and (2) the severity of the problem, we use six compression algorithms [4, 52, 14, 1Communication channel between the last-level cache and main memory. 51, 76, 53] to analyze 221 discrete and mobile graphics appli- 978-1-4673-9211-2/16/$31.00 ©2016 IEEE 188 cation traces from a major GPU vendor and 21 open-source, on-chip interconnect energy consumption with C-Pack com- general-purpose GPU applications. Our analysis shows that pression algorithm to a much more acceptable 1.1× increase. although o-chip bandwidth compression achieves a signicant compression ratio (e.g., more than 47% average eective 2. Background bandwidth increase with C-Pack [14] across mobile GPU applications), it also greatly increases the bit toggle count (e.g., a Data compression is a powerful mechanism that exploits corresponding 2.2× average increase). This eect can signi- the existing redundancy in the applications’ data to relax cantly increase the energy dissipated in the on-chip/o-chip capacity and bandwidth requirements for many modern sys- interconnects, which constitute a signicant portion of the tems. Hardware-based data compression was explored in memory subsystem energy. the context of on-chip caches [71, 4, 14, 52, 57, 6] and main memory [2, 64, 18, 51, 60], but mostly for CPU-oriented ap- 1.2. Toggle-Aware Compression plications. Several prior works [64, 51, 58, 60, 69] examined In this work, we develop two new techniques that make band- the specics of memory bandwidth compression, where it is width compression for on-chip/o-chip buses more energy- critical to decide where and when to perform compression ecient by limiting the overall increase in compression- and decompression. related bit toggles. Energy Control (EC) decides whether to While these works evaluated the energy/power benets of send data in compressed or uncompressed form, based on a bandwidth compression, the overhead of compression was model that accounts for the compression ratio, the increase limited to the examined overheads of 1) the compression/de- in bit toggles, and current bandwidth utilization. The key compression logic and 2) the newly-proposed mechanisms/de- insight is that this decision can be made in a ne-grained signs. To our knowledge, this is the rst work that examines manner (e.g., for every cache line), using a simple model the energy implications of compression on the data trans- to approximate the commonly-used Energy × Delay and ferred over the on-chip/o-chip buses. Depending on the Energy × Delay2 metrics. In this model, Energy is directly type of the communication channel, the transferred data bits proportional to the bit toggle count; Delay is inversely pro- have dierent eect on the energy spent on communication. portional to the compression ratio and directly proportional We provide a brief background on this eect for three major to the bandwidth utilization. Our second technique, Metadata communication channel types. Consolidation (MC), reduces the negative eects of scattering On-chip Interconnect. For the full-swing on-chip inter- the metadata across a compressed cache line, which happens connects, one of the dominant factors that denes the energy with many existing compression algorithms [4, 14]. Instead, cost of a single data transfer (commonly called a it) is the MC consolidates compression-related metadata in a contigu- activity factor—the number of bit toggles on the wires (com- ous fashion. munication channel switchings from 0 to 1 or from 1 to 0). Our toggle-aware compression mechanisms are generic The bit toggle count for a particular it depends on both the and applicable to dierent compression algorithms (e.g., it’s data and the data that was previously sent over the same Frequent Pattern Compression (FPC) [4] and Base-Delta- wires. Several prior works [63, 10, 73, 66, 9] examined more Immediate (BDI) compression [52]), dierent communication energy-ecient data communication in the context of on- channels (on-chip and o-chip buses), and dierent archi- chip interconnects [10], reducing the number of bit toggles. tectures (e.g., GPUs, CPUs, and hardware accelerators). We The key dierence between our work and these prior works demonstrate that our proposed mechanisms are mostly or- is that we aim to address the eect of increase (sometimes thogonal to dierent data encoding schemes also used to a dramatic increase, see Section 3) in bit toggle count specif- minimize the bit toggle count (e.g., Data Bus Inversion [63]), ically due to data compression.

A Case for Toggle-Aware Compression for GPU Systems

Gs-35F-4677G

Nvidia Forceware Graphics Drivers for XP, Manual and Notes

Release 85 Notes

Manycore GPU Architectures and Programming, Part 1

CPU-GPU Benchmark Description

Trabajo #1 De Computación Gráfica

Time Complexity Parallel Local Binary Pattern Feature Extractor on a Graphical Processing Unit

HP Z800 Workstation Overview

Opengl Performance Tools

INTEL's TOP CPU the Pentium 4 3.4Ghz Extreme Edition Intel

Linux Hwmon Documentation

HP Z400 Workstation Overview