Pruning and Quantization for Deep Neural Network Acceleration: a Survey A,B A,B,C a A,B a ∗ Tailin Liang , John Glossner , Lei Wang , Shaobo Shi and Xiaotong Zhang

Pruning and Quantization for Deep Neural Network Acceleration: A Survey a,b a,b,c a a,b a < Tailin Liang , John Glossner , Lei Wang , Shaobo Shi and Xiaotong Zhang , aSchool of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China bHua Xia General Processor Technologies, Beijing 100080, China cGeneral Processor Technologies, Tarrytown, NY 10591, United States ARTICLEINFO ABSTRACT Keywords: Deep neural networks have been applied in many applications exhibiting extraordinary abilities in convolutional neural network the field of computer vision. However, complex network architectures challenge efficient real-time neural network acceleration deployment and require significant computation resources and energy costs. These challenges can neural network quantization be overcome through optimizations such as network compression. Network compression can often neural network pruning be realized with little loss of accuracy. In some cases accuracy may even improve. This paper low-bit mathematics provides a survey on two types of network compression: pruning and quantization. Pruning can be categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning. Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and activations may be quantized typically to 8-bit integers although lower bit width implementations are also discussed including binary neural networks. Both pruning and quantization can be used independently or combined. We compare current techniques, analyze their strengths and weaknesses, present compressed network accuracy results on a number of frameworks, and provide practical guidance for compressing networks. 1. Introduction connections between neurons. Feed forward layers reduce connections by considering only connections in the forward Deep Neural Networks (DNNs) have shown extraordinary path. This reduces the number of connections to N. Other abilities in complicated applications such as image classifica- types of components such as dropout layers can reduce the tion, object detection, voice synthesis, and semantic segmen- number of connections even further. tation [138]. Recent neural network designs with billions of Network Architecture Search (NAS) [63], also known as parameters have demonstrated human-level capabilities but network auto search, programmatically searches for a highly at the cost of significant computational complexity. DNNs efficient network structure from a large predefined search with many parameters are also time-consuming to train [26]. space. An estimator is applied to each produced architecture. These large networks are also difficult to deploy in embedded While time-consuming to compute, the final architecture of- environments. Bandwidth becomes a limiting factor when ten outperforms manually designed networks. moving weights and data between Compute Units (CUs) and Knowledge Distillation (KD) [80, 206] evolved from memory. Over-parameterization is the property of a neural knowledge transfer [27]. The goal is to generate a simpler network where redundant neurons do not improve the accu- compressed model that functions as well as a larger model. racy of results. This redundancy can often be removed with KD trains a student network that tries to imitate a teacher net- little or no accuracy loss [225]. work. The student network is usually but not always smaller Figure 1 shows three design considerations that may con- arXiv:2101.09671v3 [cs.CV] 15 Jun 2021 and shallower than the teacher. The trained student model tribute to over-parameterization: 1) network structure, 2) net- should be less computationally complex than the teacher. work optimization, and 3) hardware accelerator design. These Network optimization [137] includes: 1) computational design considerations are specific to Convolutional Neural convolution optimization, 2) parameter factorization, 3) net- Networks (CNNs) but also generally relevant to DNNs. work pruning, and 4) network quantization. Convolution op- Network structure encompasses three parts: 1) novel com- erations are more efficient than fully connected computations ponents, 2) network architecture search, and 3) knowledge dis- because they keep high dimensional information as a 3D ten- tillation. Novel components is the design of efficient blocks sor rather than flattening the tensors into vectors that removes such as separable convolution, inception blocks, and residual the original spatial information. This feature helps CNNs blocks. They are discussed in Section 2.4. Network com- to fit the underlying structure of image data in particular. ponents also encompasses the types of connections within N2 Convolution layers also require significantly less coefficients layers. Fully connected deep neural networks require compared to Fully Connected Layers (FCLs). Computational < Corresponding author convolution optimizations include Fast Fourier Transform [email protected] (T. Liang); [email protected] (J. (FFT) based convolution [168], Winograd convolution [135], [email protected] [email protected] Glossner); (L. Wang); (S. Shi); and the popular image to column (im2col) [34] approach. [email protected] (X. Zhang) ORCID(s): 0000-0002-7643-912X (T. Liang) We discuss im2col in detail in Section 2.3 since it is directly T Liang et al.: Preprint submitted to Elsevier Page 1 of 41 Survey on pruning and quantization CNN Acceleration [40, 39, 142, 137, 194, 263, 182] Network Optimization Hardware Accelerator [151, 202] Network Structure Convolution Optimization Novel Components Platform Optimization Factorization Network Architecture Search [63] CPU Lookup Table Pruning [201, 24, 12, 250] GPU Knowledge Distillation [80, 206] Computation Reuse Quantization [131, 87] ASIC Memory Optimization FPGA [86,3, 234, 152] ... Figure 1: CNN Acceleration Approaches: Follow the sense from designing to implementing, CNN acceleration could fall into three categories, structure design (or generation), further optimization, and specialized hardware. related to general pruning techniques. include: General Processor Technologies (GPT) [179], ARM, Parameter factorization is a technique that decomposes nVidia, and 60+ others [202] all have processors targeting higher-rank tensors into lower-rank tensors simplifying mem- this space. ASICs may also target both training and inference ory access and compressing model size. It works by breaking in datacenters. Tensor processing units (TPU) from Google large layers into many smaller ones, thereby reducing the [125], Habana from Intel [169], Kunlun from Baidu [191], number of computations. It can be applied to both convolu- Hanguang from Alibaba [124], and Intelligence Processing tional and fully connected layers. This technique can also be Unit (IPU) from Graphcore [121]. applied with pruning and quantization. Programmable reconfigurable FPGAs have been used for Network pruning [201, 24, 12, 250] involves removing neural network acceleration [86, 3, 234, 152]. FPGAs are parameters that don’t impact network accuracy. Pruning can widely used by researchers due to long ASIC design cycles. be performed in many ways and is described extensively in Neural network libraries are available from Xilinx [128] and Section 3. Intel [69]. Specific neural network accelerators are also being Network quantization [131, 87] involves replacing datatypes integrated into FPGA fabrics [248, 4, 203]. Because FPGAs with reduced width datatypes. For example, replacing 32-bit operate at the gate level, they are often used in low-bit width Floating Point (FP32) with 8-bit Integers (INT8). The val- and binary neural networks [178, 267, 197]. ues can often be encoded to preserve more information than Neural network specific optimizations are typically in- simple conversion. Quantization is described extensively in corporated into custom ASIC hardware. Lookup tables can Section 4. be used to accelerate trigonometric activation functions [46] Hardware accelerators [151, 202] are designed primarily or directly generate results for low bit-width arithmetic [65], for network acceleration. At a high level they encompass partial products can be stored in special registers and reused entire processor platforms and often include hardware opti- [38], and memory access ordering with specialized address- mized for neural networks. Processor platforms include spe- ing hardware can all reduce the number of cycles to compute cialized Central Processing Unit (CPU) instructions, Graph- a neural network output [126]. Hardware accelerators are ics Processing Units (GPUs), Application Specific Integrated not the primary focus of this paper. However, we do note Circuits (ASICs), and Field Programmable Gate Arrays (FP- hardware implementations that incorporate specific accelera- GAs). tion techniques. Further background information on efficient CPUs have been optimized with specialized Artificial processing and hardware implementations of DNNs can be Intelligence (AI) instructions usually within specialized Sin- found in [225]. gle Instruction Multiple Data (SIMD) units [49, 11]. While We summarize our main contributions as follows: CPUs can be used for training, they have primarily been used for inference in systems that do not have specialized inference • We provide a review of two network compression

Load more