Approxtuner: a Compiler and Runtime System for Adaptive Approximations

ApproxTuner: A Compiler and Runtime System for Adaptive Approximations Hashim Sharif Yifan Zhao Maria Kotsifakou University of Illinois at University of Illinois at Runtime Verification, Inc., USA Urbana-Champaign, USA Urbana-Champaign, USA [email protected] [email protected] [email protected] Akash Kothari Ben Schreiber Elizabeth Wang University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign, USA Urbana-Champaign, USA Urbana-Champaign, USA [email protected] [email protected] [email protected] Yasmin Sarita Nathan Zhao Keyur Joshi Cornell University, USA University of Illinois at University of Illinois at [email protected] Urbana-Champaign, USA Urbana-Champaign, USA [email protected] [email protected] Vikram S. Adve Sasa Misailovic Sarita Adve University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign, USA Urbana-Champaign, USA Urbana-Champaign, USA [email protected] [email protected] [email protected] Abstract significantly speeds up autotuning by analytically predicting Manually optimizing the tradeoffs between accuracy, perfor- the accuracy impacts of approximations. mance and energy for resource-intensive applications with We evaluate ApproxTuner across 10 convolutional neu- flexible accuracy or precision requirements is extremely diffi- ral networks (CNNs) and a combined CNN and image pro- cult. We present ApproxTuner, an automatic framework for cessing benchmark. For the evaluated CNNs, using only accuracy-aware optimization of tensor-based applications hardware-independent approximation choices we achieve a while requiring only high-level end-to-end quality specifi- mean speedup of 2.1x (max 2.7x) on a GPU, and 1.3x mean cations. ApproxTuner implements and manages approxima- speedup (max 1.9x) on the CPU, while staying within 1 per- tions in algorithms, system software, and hardware. centage point of inference accuracy loss. For two different The key contribution in ApproxTuner is a novel three- accuracy-prediction models, ApproxTuner speeds up tun- phase approach to approximation-tuning that consists of dev- ing by 12.8x and 20.4x compared to conventional empirical elopment-time, install-time, and run-time phases. Our ap- tuning while achieving comparable benefits. proach decouples tuning of hardware-independent and hard- CCS Concepts: • Software and its engineering ! Com- ware-specific approximations, thus providing retargetability pilers. across devices. To enable efficient autotuning of approximation choices, we present a novel accuracy-aware tun- Keywords: Approximate Computing, Compilers, Heteroge- ing technique called predictive approximation-tuning, which neous Systems, Deep Neural Networks Permission to make digital or hard copies of all or part of this work for 1 Introduction personal or classroom use is granted without fee provided that copies With the ubiquitous deployment of highly compute-intensive are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights machine-learning and big data processing workloads, opti- for components of this work owned by others than the author(s) must mizing these workloads is becoming increasingly impor- be honored. Abstracting with credit is permitted. To copy otherwise, or tant. A wide range of applications using these workloads republish, to post on servers or to redistribute to lists, requires prior specific are being deployed on both the cloud and the edge, includ- permission and/or a fee. Request permissions from [email protected]. ing image processing, object classification, speech recogni- PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea tion, and face recognition [8, 43, 46]. Many of these work- © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. loads are very computationally demanding which makes ACM ISBN 978-1-4503-8294-6/21/02...$15.00 it challenging to achieve the desired performance levels https://doi.org/10.1145/3437801.3446108 on resource-constrained systems. PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea H. Sharif et al. Many modern machine learning and image-processing devices. Fourth, the best accuracy-performance-energy trade- applications are inherently “approximate” in the sense that offs may vary significantly at run time, depending onsystem the input data are often collected from noisy sensors (e.g., load, battery level, or time-varying application requirements. image and audio streams) and output results are usually prob- 1.1 ApproxTuner System abilistic (e.g., for object classification or facial recognition). We present ApproxTuner, an automatic framework for Such applications can trade off small amounts of result quality accuracy-aware optimization of tensor-based applications (or accuracy) for improved performance and efficiency [60], (an important subset of edge computing applications). It finds where result quality is an application-specific property such application configurations that maximize speedup and/or en- as inference accuracy in a neural network or peak signal-to- ergy savings of an application, subject to end-to-end quality noise ratio (PSNR) in an image-processing pipeline. Previ- specifications. It addresses all of the challenges above, and is ous research has presented many individual domain-specific the first and only system to handle all ofthem, as we discuss and system-level techniques for trading accuracy for per- in Section9. formance. For instance, reduced precision models are wide- Tensor computations are widely used in machine learning spread in deep learning [3, 7, 14, 34, 45]. Recent specialized frameworks, and many important domains such as image accelerators incorporate hardware-specific approximations processing, scientific computing, and others [3, 31–33]. Ap- that provide significant improvements in performance and proxTuner focuses on tensor-based computations for two energy in exchange for relaxed accuracy [1, 2, 11, 29, 30, 48, reasons. First, limiting the system to these computations 59]. Such techniques can provide a crucial strategy to achieve enables algorithmic approximations specific to tensor opera- the necessary performance and/or energy for emerging ap- tions (in addition to generic software and hardware approxi- plications in edge computing. mations). Adding support for other classes of computations In practice, a realistic application (e.g., a neural network or can be done in a similar fashion, but is outside the scope of a combination of an image processing pipeline and an image this work. Second, many current and emerging hardware classification network) can make use of multiple approxima- accelerators focus on tensor computations [1, 2, 29, 30, 48], tion techniques for different computations in the code, each enabling novel hardware-specific approximations for these with its own parameters that must be tuned, to achieve the computations. ApproxTuner includes support for a novel ex- best results. For example, our experiments show that for the perimental, analog-domain accelerator [59] that exemplifies ResNet-18 network, which contains 22 tensor operations, the such approximation opportunities and associated challenges. best combination is to use three different approximations ApproxTuner tackles the last two challenges above — and with different parameter settings in different operations. A enables hardware-specific, yet portable, tuning and run-time major open challenge is how to automatically select, configure, adaptation – by decomposing the optimization process into and tune the parameters for combinations of approximation three stages: development-time, install-time and run-time. techniques, while meeting end-to-end requirements on energy, At development time, ApproxTuner selects hardware-inde- performance, and accuracy. pendent approximations and creates a tradeoff curve that To address this broad challenge, we need to solve sev- contains the approximations with highest quality and per- eral specific challenges. First, the variety of software and formance ApproxTuner found during search. At install time, hardware approximation choices, and the tuning knobs for the system refines this curve using hardware-specific opti- each of them, induce a large search space for accuracy-aware 91 mizations and performance measurements. The final tradeoff optimization – up to 7 configurations in our benchmarks. curve is included with the program binary. The program can Second, efficiently searching such large spaces is made even use the refined curve for the best static choices of approxima- more difficult because individual “candidate configurations” tions before the run, and it can further adapt these choices (sample points in the search space) can be expensive to mea- using the same curves based on run-time conditions. Us- sure on edge systems for estimating accuracy, performance ing the final tradeoff curve at run-time as well keeps the and energy. For example, measurement-based (aka, “empiri- overheads of run-time adaptation negligible. cal”) tuning for the ResNet50 neural network took 11 days on ApproxTuner tackles the first two challenges – and en- a server-class machine. Third, the diversity of hardware com- ables efficient navigation of the large tradeoff space andeffi- pute units often used in edge-computing devices [63] yields cient estimation of performance and quality by introducing diverse hardware-specific

Load more