Filter Sketch for Network Pruning

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 1 Filter Sketch for Network Pruning Mingbao Lin, Liujuan Cao, Shaojie Li, Qixiang Ye, Senior Member, IEEE, Yonghong Tian, Senior Member, IEEE, Jianzhuang Liu, Senior Member, IEEE, Qi Tian Fellow, IEEE, Rongrong Ji, Senior Member, IEEE, Abstract—We propose a novel network pruning approach by Existing structured pruning approaches can be classified information preserving of pre-trained network weights (filters). into three categories: (1) Regularization-based pruning, which Network pruning with the information preserving is formulated introduces sparse constraint [12], [13], [10] and mask scheme as a matrix sketch problem, which is efficiently solved by the off-the-shelf Frequent Direction method. Our approach, referred [11] in the training process. Despite its simplicity, this kind to as FilterSketch, encodes the second-order information of pre- of approaches usually requires to train from scratch and trained weights, which enables the representation capacity of therefore is computationally expensive. (2) Property-based pruned networks to be recovered with a simple fine-tuning pruning, which picks up a specific property of a pre-trained procedure. FilterSketch requires neither training from scratch network, e.g., ` -norm [14] and/or ratio of activations [15], and nor data-driven iterative optimization, leading to a several-orders- 1 of-magnitude reduction of time cost in the optimization of prun- simply removes filters with less importance. However, many ing. Experiments on CIFAR-10 show that FilterSketch reduces of these approaches require to recursively prune the filters of 63.3% of FLOPs and prunes 59.9% of network parameters with each single pre-trained layer and fine-tune, which is very costly. negligible accuracy cost for ResNet-110. On ILSVRC-2012, it (3) Reconstruction-based pruning [16], [17], which imposes reduces 45.5% of FLOPs and removes 43.0% of parameters with relaxed constraints to the optimization procedure. Nevertheless, only 0.69% accuracy drop for ResNet-50. Our code and pruned models can be found at https://github.com/lmbxmu/FilterSketch. the optimization procedure in each layer is typically data-driven and/or iterative [16], [17], which brings a heavy optimization burden. Index Terms—network pruning, sketch, filter pruning, structured pruning, network compression & acceleration, information In this paper, we propose the FilterSketch approach, which, preserving by encoding the second-order information of pre-trained network weights, provides a new perspective for deep CNN compression. FilterSketch is inspired by the fact that preserving I. INTRODUCTION the second-order covariance of a matrix is equal to maximizing the correlation of multi-variate data [18]. The representation EEP convolutional neural networks (CNNs) typically of the second-order information has been demonstrated to be result in significant memory requirement and computa- D effective in many other tasks [19], [20], [21], and for the first tional cost, hindering their deployment on front-end systems of time, we apply it to network pruning in this paper. limited storage and computational power. Consequently, there Instead of simply discarding the unimportant filters, FilterS- is a growing need for reduction of model size by parameter ketch preserves the second-order information of the pre-trained quantization [1], [2], low-rank decomposition [3], [4], and model in the pruned model as shown in Fig. 1. For each layer network pruning [5], [6]. Early pruning works [7], [8] use in the pre-trained model, FilterSketch learns a set of new unstructured methods to obtain irregular sparsity of filters. parameters for the pruned model which maintains the second- Recent works pay more attention to structured pruning [9], order covariance of the pre-trained model. The new group of [10], [11], which pursues simultaneously reducing model size sketched parameters then serves as a warm-up for fine-tuning and improving computational efficiency, facilitating model the pruned network. The warm-up provides an excellent ability deployment on general-purpose hardware and/or usage of basic to recover the model performance. arXiv:2001.08514v4 [cs.CV] 25 May 2021 linear algebra subprograms (BLAS) libraries. We show that preserving the second-order information can be approximated as a matrix sketch problem, which can then M. Lin, S. Li and R. Ji are with the Media Analytics and Computing be efficiently solved by the off-the-shelf Frequent Direction Laboratory, Department of Artificial Intelligence, School of Informatics, method [22], leading to a several-orders-of-magnitude reduction Xiamen University, Xiamen 361005, China. of optimization time in pruning. FilterSketch thus involves L. Cao (Corresponding Author) is with the Fujian Key Laboratory of Sensing and Computing for Smart City, Computer Science Department, neither complex regularization to restart retraining nor data- School of Informatics, Xiamen University, Xiamen 361005, China (e-mail: driven iterative optimization to approximate the covariance [email protected]) information of the pre-trained model. R. Ji is also with Institute of Artificial Intelligence, Xiamen University, Xiamen 361005, China. Q. Ye is with School of Electronic, Electrical and Communication Engi- neering, University of Chinese Academy of Sciences, Beijing 101408, China. II. RELATED WORK Y. Tian is with the School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China. Unstructured pruning and structured pruning are two major J. Liu is with the Noah’s Ark Lab, Huawei Technologies Co. Ltd., Shenzhen lines of methods for network model compression. In a broader 518129, China. view, e.g., parameter quantization and low-rank decomposition Q. Tian is with Cloud BU, Huawei Technologies Co. Ltd., Shenzhen 518129, China. can be integrated with network pruning to achieve higher Manuscript received April 19, 2005; revised August 26, 2015. compression and speedup. We give a brief discussion over IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 2 �#×�# �#×�# �#×�#'(×ℎ#×�# �#̃ ×�#'(×ℎ#×�# … … … Reconstruction Pre-trained Model Error Pruned Model Minimize Reshape Reshape �# = �#'(×ℎ#×�# �# = �#'(×ℎ#×�# Data Stream �̃ Matrix Sketch # 2 …… �̃ ×� �̃ ×� …… # # �#̃ ×�# # # �#×�# Fig. 1. Framework of FilterSketch. The upper part displays the second-order covariance approximation between the pre-trained model and the pruned model at the i-th layer. The lower part shows the approximation is achieved effectively and efficiently by the stream data based matrix sketch [22]. some related works in the following and refer readers to the [12], [10] impose a sparse property on the scaling factor of survey paper [23] for a more detailed overview. the batch normalization layer with the deterministic `1-norm Unstructured Pruning. Pruning a neural network to a or dynamical distribution of channel saliency. After re-training, reasonable smaller size and for a better generalization, has long the channels below a threshold are discarded correspondingly. been investigated. As a pioneer work, the second-order Taylor Huang et al.[13] proposed a data-driven sparse structure expansion [7] is utilized to select less important parameters selection by introducing scaling factors to scale the outputs for deletion. Sum et al. [24] introduced an extended Kalman of the pruned structure and added the sparsity constraint on filter (EKF) training to measure the importance of a weight these scaling factors. Lin et al.[11] proposed to minimize an in a network. Given the recursive Bayesian property of EKF, objective function with l1-regularization on a soft mask via they further considerd the sensitivity of a posterior probability a generative adversarial learning and adopted the knowledge as a measure of weight importance, which is then applied to distillation for optimization. prune in a nonstationary environment [25] or recurrent neural Property-based pruning tries to figure out a discriminative networks [26]. property of pre-trained CNN models and discards filters of Han et al. [8] introduced an iterative weight pruning method less importance. Hu et al. [15] utilized the abundant zero by fine-tuning with a strong l2 regularization and discarding activations in a large network and iteratively prunes filters with the small weights with values below a threshold. Group a higher percentage of zero outputs in a layer-wise fashion. Li et sparsity based regularization of network parameters [27] is al. [14] used the sum of absolute values of filters as a metric leveraged to penalize unimportant parameters. Further, [28] to measure the importance of filters, and assumes filters with prunes parameters based on the second-order derivatives of smaller values are less informative and thus should be pruned a layer-wise error function, while [29] implements CNNs in first. In [31], the importance scores of the final responses are the frequency domain and apply 2-D DCT transformation propagated to every filter in the network and the CNN is pruned to sparsify the coefficients for spatial redundancy removal. by removing the filter with the least importance. Polyak et The lottery ticket hypothesis [30] sets the weights below a al.[32] considered the contribution variance of each channel threshold to zero, rewinds the rest of the weights to their and removed filters that have low-variance outputs. It shows a initial configuration, and then retrains the network from this good ability to accelerate the deep face

Load more