Automatic Generation of High-Performance Quantized Machine Learning Kernels

Automatic Generation of High-Performance Quantized Machine Learning Kernels Meghan Cowan Thierry Moreau Tianqi Chen University of Washington University of Washington University of Washington Seattle, WA, USA Seattle, WA, USA Seattle, WA, USA [email protected] [email protected] [email protected] James Bornholt Luis Ceze University of Texas at Austin University of Washington Austin, TX, USA Seattle, WA, USA [email protected] [email protected] Abstract ACM Reference Format: Quantization optimizes machine learning inference for re- Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, source constrained environments by reducing the precision and Luis Ceze. 2020. Automatic Generation of High-Performance Quantized Machine Learning Kernels. In Proceedings of the 18th of its computation. In the extreme, even single-bit computa- ACM/IEEE International Symposium on Code Generation and Op- tions can produce acceptable results at dramatically lower timization (CGO ’20), February 22ś26, 2020, San Diego, CA, USA. cost. But this ultra-low-precision quantization is difficult ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3368826. to exploit because extracting optimal performance requires 3377912 hand-tuning both high-level scheduling decisions and low- level implementations. As a result, practitioners settle for a few predefined quantized kernels, sacrificing optimality and 1 Introduction restricting their ability to adapt to new hardware. Machine learning has seen successes in a wide variety of ap- This paper presents a new automated approach to imple- plication areas. These successes come at the cost of dramatic menting quantized inference for machine learning models. increases in the memory and compute demands of new mod- We integrate the choice of how to lay out quantized val- els such as convolutional neural networks. For example, per- ues into the scheduling phase of a machine learning com- forming a single prediction (an inference) with state-of-the- piler, allowing it to be optimized in concert with tiling and art image recognition models use tens of millions of parame- parallelization decisions. After scheduling, we use program ters and billions of floating-point operations [22, 26]. Quan- synthesis to automatically generate efficient low-level op- tization techniques reduce the scale of these models to make erator implementations for the desired precision and data them practical in resource-constrained environments such as layout. We scale up synthesis using a novel reduction sketch smartphones and edge devices [17]. Quantization reduces the that exploits the structure of matrix multiplication. On a bit-width of both the data (weights) and the computations ResNet18 model, our generated code outperforms an opti- (additions and multiplications) used to execute the model. In . mized floating-point baseline by up to 3 9×, and a state-of- the extreme, even single-bit weights and computations can . the-art quantized implementation by up to 16 6×. still yield an effective model (with under 10% accuracy loss) CCS Concepts · Software and its engineering → Com- while being orders of magnitude more efficient [12, 30]. This pilers. trade-off is appropriate for applications such as facial recog- Keywords quantization, machine learning, synthesis nition, which may use a lower-accuracy, on-device model as a low-pass filter to decide whether to offload to thecloud Permission to make digital or hard copies of all or part of this work for for more expensive, more accurate prediction [24]. personal or classroom use is granted without fee provided that copies Quantized inference evokes a classic accuracyśperformance are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights trade-off across a large space of potential configurations. At for components of this work owned by others than the author(s) must training time, users must select quantized bit-widths to meet be honored. Abstracting with credit is permitted. To copy otherwise, or their performance and accuracy targets (we do not consider republish, to post on servers or to redistribute to lists, requires prior specific training in this paper). They must then efficiently implement permission and/or a fee. Request permissions from [email protected]. inference at that chosen quantization level for the target CGO ’20, February 22ś26, 2020, San Diego, CA, USA platform. Commodity hardware rarely supports ultra-low- © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. bitwidth operations, and so naive quantized implementations ACM ISBN 978-1-4503-7047-9/20/02...$15.00 can easily be slower than full-precision versions. These chal- https://doi.org/10.1145/3368826.3377912 lenges are exacerbated by pressures across the stack. Rapid 305 CGO ’20, February 22ś26, 2020, San Diego, CA, USA M. Cowan, T. Moreau, T. Chen, J. Bornholt, and L. Ceze progress in new machine learning models demands new spe- assembly code that produces the desired output while mini- cialized quantized operations (e.g., new matrix sizes or new mizing the cost function. We implement reduction sketches combinations of precisions). Meanwhile, a broad spectrum of and our synthesis engine in the Rosette solver-aided lan- hardware backends each require different implementations guage [37], and integrate the generated code into the TVM to best realize performance gains from quantization. machine learning compiler [4]. Using program synthesis In practice, users deploying quantized inference are re- further supports our goal of retargetabilityÐporting the op- stricted to a small class of pre-set configurations with cor- erators to a new hardware architecture (e.g., new vector responding hand-tuned inference implementations. Devel- extensions) requires only a specification of that hardware’s oping these hand-tuned implementations is a tedious and instruction set, from which the necessary implementations labor-intensive process that requires intimate knowledge can be automatically synthesized. of the target architecture. One developer estimated that im- We evaluate our approach by using it to implement quan- plementing and tuning a single quantization configuration tized ResNet-18 [15] image classification models. The re- on a single smartphone platform [42] required four person- sulting implementation on a low-power ARM CPU outper- months of work [40]. Similarly, the authors of another li- forms a floating-point version by 3.9× and is comparable brary [39] spent about a month implementing a single con- to a state-of-the-art hand-written quantized implementa- figuration [38]. These hand-tuned implementations are often tion [39] when configured in the same way. However, our brittle and difficult to extend with new optimizations; both automated approach allows us to apply optimizations such as authors reported difficulty with multi-threading and new parallelizationÐwhereas the hand-written implementation vector extensions. With the proliferation of new models and is single-threadedÐthat yield up to a 16.6× speedup over hardware backends, investing this level of effort for every the hand-written quantized implementation. We show that configuration is not a scalable approach. our quantized implementations shift the Pareto frontier of This paper presents a new automated approach to generat- the accuracyśperformance trade-off, offering 2.3× higher ing efficient code for quantized machine learning inference. throughput than existing implementations at the same 22% Our approach comprises two parts. First, we integrate the accuracy loss. in edge applications such as video process- choice of how to perform bit-slicing (i.e., how to arrange ing [24]. Automation also allows us to explore the space quantized values in memory) into the scheduling phase of of possible quantization configurations, performing a limit a machine learning compiler [4]. Our insight is that quanti- study of the potential speedups available from quantization. zation is best viewed as a scheduling decision, in the same In summary, this paper makes the following contributions: way that tensor libraries treat tiling and parallelism as part • Quantization-aware scheduling for compilation of ma- of the schedule space [29]. The bit-slicing choice is critical chine learning models to generating efficient code: different layouts influence both • An automated scheduling approach to tuning quan- spatial locality and the ability to apply different low-level tized implementation for new models and hardware operators for efficient computation (e.g., without scatter or • A new program synthesis approach to generating quan- gather operations on vector lanes). Making quantization a tized low-level tensor kernels scheduling decision also allows us to extract better perfor- • Results showing we can automatically generate code mance using holistic schedules that combine quantization that executes quantized models up to 12.4× faster than with other transformations such as loop tiling and paralleliza- state-of-the-art hand-written code. tion. Finally, integrating quantization into scheduling allows us to apply existing automated scheduling techniques [5] to optimize a model for a given quantization configuration and 2 Quantized Models hardware architecture. This automation supports flexible Quantized machine learning models promise to reduce the experimentation and retargeting to new hardware without compute and memory demands of inference while maintain- reengineering

Load more