A Survey of Quantization Methods for Efficient Neural Network Inference
Total Page:16
File Type:pdf, Size:1020Kb
A Survey of Quantization Methods for Efficient Neural Network Inference Amir Gholami∗, Sehoon Kim∗, Zhen Dong∗, Zhewei Yao∗, Michael W. Mahoney, Kurt Keutzer University of California, Berkeley {amirgh, sehoonkim, zhendong, zheweiy, mahoneymw, keutzer}@berkeley.edu Abstract—As soon as abstract mathematical computa- means that it is not possible to deploy them for many tions were adapted to computation on digital computers, resource-constrained applications. This creates a problem the problem of efficient representation, manipulation, and for realizing pervasive deep learning, which requires communication of the numerical values in those computa- real-time inference, with low energy consumption and tions arose. Strongly related to the problem of numerical high accuracy, in resource-constrained environments. This representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be pervasive deep learning is expected to have a significant distributed over a fixed discrete set of numbers to minimize impact on a wide range of applications such as real-time the number of bits required and also to maximize the intelligent healthcare monitoring, autonomous driving, accuracy of the attendant computations? This perennial audio analytics, and speech recognition. problem of quantization is particularly relevant whenever Achieving efficient, real-time NNs with optimal ac- memory and/or computational resources are severely re- curacy requires rethinking the design, training, and stricted, and it has come to the forefront in recent years due deployment of NN models [71]. There is a large body of to the remarkable performance of Neural Network models literature that has focused on addressing these issues by in computer vision, natural language processing, and re- making NN models more efficient (in terms of latency, lated areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits memory footprint, and energy consumption, etc.), while or less holds the potential to reduce the memory footprint still providing optimal accuracy/generalization trade-offs. and latency by a factor of 16x; and, in fact, reductions of These efforts can be broadly categorized as follows. 4x to 8x are often realized in practice in these applications. a) Designing efficient NN model architectures: Thus, it is not surprising that quantization has emerged One line of work has focused on optimizing the NN model recently as an important and very active sub-area of architecture in terms of its micro-architecture [101, 111, research in the efficient implementation of computations 127, 167, 168, 212, 253, 280] (e.g., kernel types such as associated with Neural Networks. In this article, we survey depth-wise convolution or low-rank factorization) as well approaches to the problem of quantizing the numerical as its macro-architecture [100, 101, 104, 110, 214, 233] values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this (e.g., module types such as residual, or inception). The survey and its organization, we hope to have presented a classical techniques here mostly found new architecture useful snapshot of the current research in quantization modules using manual search, which is not scalable. As for Neural Networks and to have given an intelligent such, a new line of work is to design Automated machine arXiv:2103.13630v3 [cs.CV] 21 Jun 2021 organization to ease the evaluation of future research in learning (AutoML) and Neural Architecture Search (NAS) this area. methods. These aim to find in an automated way the right NN architecture, under given constraints of model size, I. INTRODUCTION depth, and/or width [161, 194, 232, 245, 252, 291]. We Over the past decade, we have observed significant refer interested reader to [54] for a recent survey of NAS improvements in the accuracy of Neural Networks (NNs) methods. for a wide range of problems, often achieved by highly b) Co-designing NN architecture and hardware over-parameterized models. While the accuracy of these together: Another recent line of work has been to adapt over-parameterized (and thus very large) NN models has (and co-design) the NN architecture for a particular target significantly increased, the sheer size of these models hardware platform. The importance of this is because the overhead of a NN component (in terms of latency and ∗Equal contribution. energy) is hardware-dependent. For example, hardware with a dedicated cache hierarchy can execute bandwidth distillation with prior methods (i.e., quantization and bound operations much more efficiently than hardware pruning) has shown great success [195]. without such cache hierarchy. Similar to NN architecture e) Quantization: Finally, quantization is an ap- design, initial approaches at architecture-hardware co- proach that has shown great and consistent success in design were manual, where an expert would adapt/change both training and inference of NN models. While the the NN architecture [70], followed by using automated problems of numerical representation and quantization AutoML and/or NAS techniques [22, 23, 100, 252]. are as old as digital computing, Neural Nets offer unique c) Pruning: Another approach to reducing the opportunities for improvement. While this survey on memory footprint and computational cost of NNs is to quantization is mostly focused on inference, we should apply pruning. In pruning, neurons with small saliency emphasize that an important success of quantization has (sensitivity) are removed, resulting in a sparse computa- been in NN training [10, 35, 57, 130, 247]. In particular, tional graph. Here, neurons with small saliency are those the breakthroughs of half-precision and mixed-precision whose removal minimally affects the model output/loss training [41, 72, 79, 175] have been the main drivers that function. Pruning methods can be broadly categorized have enabled an order of magnitude higher throughput in into unstructured pruning [49, 86, 139, 143, 191, 257], AI accelerators. However, it has proven very difficult to and structured pruning [91, 106, 156, 166, 274, 275, 279]. go below half-precision without significant tuning, and With unstructured pruning, one removes neurons with most of the recent quantization research has focused on with small saliency, wherever they occur. With this inference. This quantization for inference is the focus of approach, one can perform aggressive pruning, removing this article. most of the NN parameters, with very little impact on f) Quantization and Neuroscience: Loosely related the generalization performance of the model. However, to (and for some a motivation for) NN quantization this approach leads to sparse matrix operations, which is work in neuroscience that suggests that the human are known to be hard to accelerate, and which are brain stores information in a discrete/quantized form, typically memory-bound [21, 66]. On the other hand, rather than in a continuous form [171, 236, 240]. A with structured pruning, a group of parameters (e.g., popular rationale for this idea is that information stored entire convolutional filters) is removed. This has the in continuous form will inevitably get corrupted by noise effect of changing the input and output shapes of layers (which is always present in the physical environment, and weight matrices, thus still permitting dense matrix including our brains, and which can be induced by operations. However, aggressive structured pruning often thermal, sensory, external, synaptic noise, etc.) [27, 58]. leads to significant accuracy degradation. Training and However, discrete signal representations can be more inference with high levels of pruning/sparsity, while robust to such low-level noise. Other reasons, including maintaining state-of-the-art performance, has remained the higher generalization power of discrete representa- an open problem [16]. We refer the interested reader tions [128, 138, 242] and their higher efficiency under to [66, 96, 134] for a thorough survey of related work limited resources [241], have also been proposed. We in pruning/sparsity. refer the reader to [228] for a thorough review of related d) Knowledge distillation: Model distillation [3, 95, work in neuroscience literature. 150, 177, 195, 207, 269, 270] involves training a large The goal of this work is to introduce current methods model and then using it as a teacher to train a more com- and concepts used in quantization and to discuss the pact model. Instead of using “hard” class labels during current challenges and opportunities in this line of the training of the student model, the key idea of model research. In doing so, we have tried to discuss most distillation is to leverage the “soft” probabilities produced relevant work. It is not possible to discuss every work in by the teacher, as these probabilities can contain more a field as large as NN quantization in the page limit of a information about the input. Despite the large body of short survey; and there is no doubt that we have missed work on distillation, a major challenge here is to achieve a some relevant papers. We apologize in advance both to high compression ratio with distillation alone. Compared the readers and the authors of papers that we may have to quantization and pruning, which can maintain the neglected. performance with 4 compression (with INT8 and In terms of the structure of this survey, we will first lower precision), knowledge≥ × distillation methods tend to provide