Axbench: a Benchmark Suite for Approximate Computing Across the System Stack

AxBench: A Benchmark Suite for Approximate Computing Across the System Stack Amir Yazdanbakhsh Divya Mahajan Pejman Lotﬁ-Kamran† Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab School of Computer Science, Georgia Institute of Technology †School of Computer Science, Institute for Research in Fundamental Sciences (IPM) fa.yazdanbakhsh, divya [email protected] [email protected] [email protected] ABSTRACT quality degradation for applications that can tolerate inexactness As the end of Dennard scaling looms, both the semiconductor in their output. industry and the research community are exploring for innovative A growing swath of studies have proposed different approx- solutions that allow energy efficiency and performance to con- imation languages and techniques including software languages tinue to scale. Approximation computing has become one of the EnerJ [9] and Rely [10], hardware language Axilog [5], circuit viable techniques to perpetuate the historical improvements in the level techniques [11–26], microarchitecture techniques [27, 28], computing landscape. As approximate computing attracts more algorithmic techniques [29, 30] and approximate accelerators [2– attention in the community, having a general, diverse, and repre- 4, 31]. As approximate computing gains popularity, it becomes sentative set of benchmarks to evaluate different approximation important to have a diverse and representative set of benchmarks techniques becomes necessary. for a fair evaluation of approximation techniques. While a bad In this paper, we develop and introduce AxBench, a general, set of benchmarks makes progress problematic, a good set of diverse and representative multi-framework set of benchmarks benchmarks can help us as a community to rapidly advance the for CPUs, GPUs, and hardware design with the total number of field [32]. 29 benchmarks. We judiciously select and develop each bench- A benchmark suite for approximate computing has to have mark to cover a diverse set of domains such as machine learn- several features. As various applications in different domains ing, scientific computation, signal processing, image processing, like finance, machine learning, image processing, vision, med- robotics, and compression. AxBench comes with the necessary ical imaging, robotics, 3D gaming, numerical analysis, etc. are annotations to mark the approximable region of code and the amenable to approximate computation, a good benchmark suite application-specific quality metric to assess the output quality of for approximate computation should be diverse to be represen- each application. AxBench with these set of annotations facilitate tative of all these different applications. the evaluation of different approximation techniques. Moreover, approximate computing can be applied to various To demonstrate its effectiveness, we evaluate three previously levels of computing stack and through different techniques. Ap- proposed approximation techniques using AxBench benchmarks: proximate computing is applicable to both hardware and software. loop perforation [1] and neural processing units (NPUs) [2–4] on At both levels various techniques may be used for approximation. CPUs and GPUs, and Axilog [5] on dedicated hardware. We find At the hardware level, a dedicated approximate hardware may that (1) NPUs offer higher performance and energy efficiency as perform the operations [2–4, 31, 33–35] or an imprecise proces- compared to loop perforation on both CPUs and GPUs, (2) while sor may run the program [12, 34, 36], among other possibili- NPUs provide considerable efficiency gains on CPUs, there still ties [5, 13–26, 37, 38]. Likewise, there are many possibilities at remains significant opportunity to be explored by other approxi- the software level [1, 29, 30, 39–41]. A good benchmark suite for mation techniques, (3) Unlike on CPUs, NPUs offer full benefits approximate computation should be useful for evaluation of all of approximate computations on GPUs, and (4) considerable op- these possibilities. Being able to evaluate vastly different approx- portunity remains to be explored by innovative approximate com- imation techniques using a common set of benchmarks enables putation techniques at the hardware level after applying Axilog. head-to-head comparison of different approximation techniques. Finally, approximation not only applies to information pro- 1 Introduction cessing, but also can be applied to information communication As the end of Dennard scaling and Moore’s law advances loom, and information retrieval. While many approximation techniques the computing landscape confronts the diminishing performance target processing units [2–4, 12, 34, 36], the communication and and energy improvements from the traditional scaling paradigm [6– storage medium are also amenable to approximation [27, 42– 8]. This evolution drives both the industry and the research com- 46]. This means that a good benchmark suite for approximate munity to explore viable solutions and techniques to maintain computing should be rich enough to be useful for evaluation of the traditional scaling of performance and energy efficiency. Ap- approximate communication and storage. proximate computing is one of the promising approaches for This paper introduces AxBench, a diverse and representative achieving significant efficiency gains at the cost of some output multi-framework set of benchmarks for evaluating approximate computing research in CPUs, GPUs and hardware design. We Table 1: The evaluated CPU benchmarks, characterization of each approximable discuss why AxBench benchmarks have all the necessary fea- region, and the quality metric. tures of a good benchmark suite for approximate computing. # of # of # of ifs # of x86- Name Domain Function Quality Metric Loops / elses 64 Insts. AxBench covers diverse application domains such as machine Calls learning, robotics, arithmetic computation, multimedia, and sig- blackscholes Financial Analysis 5 0 5 309 Avg. Relative Error nal processing. enables researchers in the approximate canneal Optimization 6 2 6 378 Avg. Relative Error AxBench fft Singal Processing 2 0 0 34 Avg. Relative Error computing to study, evaluate, and compare the state-of-the-art forwardk2j Robotics 2 0 0 65 Avg. Relative Error approximation techniques on a diverse set of benchmarks in a inversek2j Robotics 4 0 0 100 Avg. Relative Error jmeint 3D Gaming 32 0 23 1,079 Miss Rate straightforward manner. jpeg Compression 3 4 0 1,257 Image Diff We perform a detailed characterization of AxBench bench- kmeans Machine Learning 1 0 0 26 Image Diff marks on CPUs, GPUs, and dedicated hardware. The results sobel Image Processing 3 2 1 88 Image Diff show that approximable regions of the benchmarks, on average, Table 2: The evaluated GPU benchmarks, characterization of each approximable constitute 74.9% of runtime and 81.8% of energy usage of the region, and the quality metric. # of # of # of ifs # of PTX whole applications when they run on a CPU. On a GPU, approx- Name Domain Function Quality Metric Loops / elses Insts. imable regions constitute 53.4% of runtime and 56.0% of energy Calls binarization Image Processing 1 0 1 27 Image Diff usage of the applications. We use an approximation synthesis [5] blackscholes Financial Analysis 2 0 0 96 Avg. Relative Error to gain insights about the potential benefits of using approxima- convolution Machine Learning 0 2 2 886 Avg. Relative Error fastwalsh Signal Processing 0 0 0 50 Avg. Relative Error tion in the hardware design. The results demonstrate, on average, inversek2j Robotics 0 3 5 132 Avg. Relative Error approximate parts constitute 92.4% of runtime, 69.4% of energy jmeint 3D Gaming 4 0 37 2,250 Miss Rate usage, and 70.1% of the area of the whole dedicated hardware. laplacian Image Processing 0 2 1 51 Image Diff meanfilter Machine Vision 0 2 1 35 Image Diff These results clearly demonstrate that these benchmarks, which newton-raph Numerical Analysis 2 2 1 44 Avg. Relative Error are taken from various domains, are amendable to approximation. sobel Image Processing 0 2 1 86 Image Diff We also evaluate three previously proposed approximate com- srad Medical Imaging 0 0 0 110 Image Diff putation techniques using AxBench benchmarks. We apply Loop Table 3: The evaluated ASIC benchmarks, characterization of each approximable Perforation [1] and Neural Processing Units (NPUs) [2–4] to region, and the quality metric. # of Name Domain Quality Metric CPU and GPU, and Axilog [5] to dedicated hardware. We find Lines that loop perforation results in large output quality degradation brent-kung (32-bit) Arithmetic Computation 352 Avg. Relative Error and consequently, NPUs offer higher efficiency on both CPUs fir (8-bit) Signal Processing 113 Avg. Relative Error and GPUs. Moreover, we observe that, on CPU+NPU, signifi- forwardk2j Robotics 18,282 Avg. Relative Error cant opportunity remains to be explored by other approximation inversek2j Robotics 22,407 Avg. Relative Error techniques mainly because NPUs do nothing for data misses. kmeans Machine Learning 10,985 Image Diff On GPUs, however, NPUs leverage all the potentials and leave kogge-stone (32-bit) Arithmetic Computation 353 Avg. Relative Error very little opportunity for other approximation techniques except wallace tree (32-bit) Arithmetic Computation 13,928 Avg. Relative Error on workloads that saturate the off-chip bandwidth. Data misses neural network Machine Learning 21,053 Image Diff are not a performance bottleneck for GPU+NPU, as massively- sobel Image Processing 143 Image

Load more