Exploring Software Inefficiency With

ZeroSpy: Exploring Software Inefficiency with Redundant Zeros Xin You∗, Hailong Yang∗y, Zhongzhi Luan∗, Depei Qian∗ and Xu Liuz Beihang University, China∗, North Carolina State University, USAz State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Chinay fyouxin2015,hailong.yang,07680,[email protected]∗, [email protected] Abstract—Redundant zeros cause inefficiencies in which the 1 for(int i=0;i<1000;++i) { 2 A[i] = 0; B[i] = i; zero values are loaded and computed repeatedly, resulting in 3 } 4 ... unnecessary memory traffic and identity computation that waste 5 for(int i=0;i<1000;++i) memory bandwidth and CPU resources. Optimizing compilers 6 I C[i] = A[i]-B[i]; is difficult in eliminating these zero-related inefficiencies due to Listing 1. An example code working on redundant zeros, where all variables are 32-bit integers. limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not enough to fulfill the computation, which yields better cache widely adopted in commodity processors. In this paper, we 1 propose ZeroSpy - a fine-grained profiler to identify redundant usage and vectorization potentials . zeros caused by both inappropriate use of data structures and Actually, a larger number of real-world applications (e.g., useless computation. ZeroSpy also provides intuitive optimization deep neural network and high-efficiency video coding) have guidance by revealing the locations where the redundant zeros already been reported to contain a significant amount of happen in source lines and calling contexts. The experimental redundant zeros and achieved significant speedup from the results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. corresponding optimization. For instance, in the area of deep Based on the optimization guidance revealed by ZeroSpy, we can neural network, researchers have proposed both hardware [11] achieve significant speedups after eliminating redundant zeros. and software [7] approaches to automatically detect the spar- Index Terms—Redundant Zero, Software Inefficiency, Perfor- sity of the network in order to apply specific optimizations for mance Profiling and Optimization achieving better performance. For high-efficiency video coding applications, better detection of all-zero blocks can improve I. INTRODUCTION the performance by skipping computations on these blocks [6]. Nowadays, production software in both industry and Modern optimizing compilers usually employ a set of redun- academia is becoming increasingly complex as they consist of dancy elimination techniques, such as value numbering [12], a large number of library dependencies as well as sophisticated common subexpression elimination [13], and constant propa- control and data flows. Such complexities can easily lead gation [14]. However, they suffer from the limitation of myopic to unexpected inefficiency, which prevents the software from optimization scopes and imprecise static analysis on pointers achieving optimal performance. Often, software inefficiency and aliases. Link-time optimization [15], [16] can enlarge the involves redundant operations, such as repeatedly loading the visibility of compiler optimization, but the performance im- same values from memory [1], writing values never used [2], provement after optimization remains constrained. Combining overwriting values in the same location with intermediate val- these static optimizing techniques, profile-guided optimization ues never referenced [3], and repeatedly computing the same (PGO) [17] was proposed to optimize the generated code value [4], [5]. In addition, a large number of applications [6], with profiled performance data. However, these techniques [7] take sparse data as the inputs, which can result in a do not identify redundant zeros involved in the memory and significant waste of resources if dense data structures are used. computation operations for further code optimization. One of the fundamental reasons for the above inefficiencies Performance analysis tools such as Perf [18], HPC- is that there are instructions and data structures frequently Toolkit [19], VTune [20], gprof [21], CrayPat [22] and OPro- dealing with zeros. file [23] can monitor program execution to report performance Listing 1 shows a simple example. The code first initializes metrics including CPU cycles, cache misses, and arithmetic two integer arrays of A and B (line 1-3) and then computes intensity to guide optimization. Other fine-grained profiling array C by subtracting array A and B (line 5-6). The redundant tools such as RedSpy [4] and LoadSpy [1] can identify zeros appear in the computation where memory operations redundant memory stores and loads. Although these tools can frequently read zeros from the array A. Such memory loads report program hotspots and resource under-utilization, they can be avoided by compressing A with a sparse data structure. do not identify inefficiencies involved with redundant zeros Furthermore, values of array B range from 0 to 1000, so a few and provide corresponding optimization guidance. significant bytes of array B’s values are always 0, which are redundant zeros as well, leading to a fraction of memory loads 1Unlike existing work [8]–[10] that reduces floating-point precisions, we wasted. Instead of using a 32-bit integer, a 16-bit integer is do not suffer from any accuracy loss by only removing unnecessary zeros. SC20, November 9-19, 2020, Is Everywhere We Are 978-1-7281-9998-6/20/$31.00 ©2020 IEEE Hardware approaches have also been proposed to detect II. RELATED WORK and optimize redundant zeros. Dusser et al. [24], [25] reported frequent zero values existing in cache and memory A. Hardware Solutions to Remove Redundancies and proposed Zero-Content Augmented cache (ZCA cache) as well as Decoupled Zero-Compressed memory (DZC memory) There are many research approaches to profile and optimize for redundancy elimination. Redundant zeros are exploited data sparsity from the hardware aspect. Dusser et al. [24], [25] in eDRAM [26] to save energy by avoiding refreshing zero reported the data sparsity in cache and memory that there are chunks. A zero-aware cache algorithm (Zero-Chunk) [27] was zero values loaded from and written to memory frequently and proposed to deal with redundant zeros. However, the major these zero values wasted memory bandwidth. Correspondingly, drawback of these hardware approaches is that they require they proposed Zero-Content Augmented cache (ZCA cache) special hardware extensions and are not directly applicable to and Decoupled Zero-Compressed memory (DZC memory). In commodity servers running real-world applications. their approach, each level of cache consists of a conventional cache augmented with zero-content cache, which utilizes an Given the existing software and hardware approaches, it is address tag and a validate tag for null data. They also attached still difficult to identify the inefficiencies caused by redundant a zero detection unit both in the path of load and write to zeros that are buried deep due to layers of software abstrac- detect zero values as well as gave control signals to maintain tions. Thus, we propose ZeroSpy, a practical tool to profile the zero-content cache. A similar idea is applied to compress redundant zeros stored in memory and reused for computation. physical memory in their proposed DZC memory. ZeroSpy works with fully optimized binary executables and provides useful metrics (e.g., redmap, redundancy fraction) to Manohar and Kapoor [26] developed a new refreshing guide performance optimization. ZeroSpy reports redundant strategy of eDRAM by detecting the zero chunks and skipping zeros with their calling contexts and associated source lines the corresponding refreshment, which results in notable energy as well as the data access patterns. For optimization, ZeroSpy saving. Gao et al. [27] proposed an efficient cache algorithm presents the top candidate instructions and data structures with named Zero-Chunk, which detects fingerprints with all-zero redundant zeros, and derives a metric redmap to represent the remainders and aggregates them into corresponding containers data access pattern. With the detailed profiles provided by to improve the cache hit ratio as well as save the memory ZeroSpy, developers can obtain useful insights to guide the space. Miguel et al. [28], [29] proposed a new cache design optimization of the code regions and data structures involving to store multiple similar blocks with a single data array entry frequent redundant zeros for better performance. in order to reduce the data redundancy. Rollback-Free Value Prediction (RFVP) [30] exploited approximate computation by In summary, this paper makes the following contributions: predicting the requested values for certain safe-to-approximate • We observe the fact that operating redundant zeros per- load operations missed in the cache. vasively exists in modern software packages, which usu- In some specific domains such as machine learning, Peng ally indicates performance inefficiencies. Furthermore, we et al. [7] proposed to exploit tensor sparsity in the deep categorize the provenance of redundant zeros to motivate neural network at runtime. They profiled several mini-batches different code optimization guidances produced by ZeroSpy. during training to identify the sparsity of certain neurons in • We develop ZeroSpy, a practical tool with reasonable pro- the network and divided them into sparse neurons to

Exploring Software Inefficiency With

Go Forth with TTL !

6502 Introduction

Download a Sample

The Ultimate C64 Overview Michael Steil, 25Th Chaos Communication Congress 2008

W65C02S Microprocessor DATA SHEET

W65C02S 8–Bit Microprocessor

Programming the 65816

Memory Management in Linux Is a Complex System ○ Supports a Variety of Systems from MMU-Less Microcontrollers to Supercomputers

6500 Microprocessors

Virtual Memory 2

W65C134S 8-Bit Microcomputer

Linux Boot Process