IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 1

Filter Sketch for Network Pruning Mingbao Lin, Liujuan Cao, Shaojie Li, Qixiang Ye, Senior Member, IEEE, Yonghong Tian, Senior Member, IEEE, Jianzhuang Liu, Senior Member, IEEE, Qi Tian Fellow, IEEE, Rongrong Ji, Senior Member, IEEE,

Abstract—We propose a novel network pruning approach by Existing structured pruning approaches can be classified information preserving of pre-trained network weights (filters). into three categories: (1) Regularization-based pruning, which Network pruning with the information preserving is formulated introduces sparse constraint [12], [13], [10] and mask scheme as a matrix sketch problem, which is efficiently solved by the off-the-shelf Frequent Direction method. Our approach, referred [11] in the training process. Despite its simplicity, this kind to as FilterSketch, encodes the second-order information of pre- of approaches usually requires to train from scratch and trained weights, which enables the representation capacity of therefore is computationally expensive. (2) Property-based pruned networks to be recovered with a simple fine-tuning pruning, which picks up a specific property of a pre-trained procedure. FilterSketch requires neither training from scratch network, e.g., ` -norm [14] and/or ratio of activations [15], and nor data-driven iterative optimization, leading to a several-orders- 1 of-magnitude reduction of time cost in the optimization of prun- simply removes filters with less importance. However, many ing. Experiments on CIFAR-10 show that FilterSketch reduces of these approaches require to recursively prune the filters of 63.3% of FLOPs and prunes 59.9% of network parameters with each single pre-trained layer and fine-tune, which is very costly. negligible accuracy cost for ResNet-110. On ILSVRC-2012, it (3) Reconstruction-based pruning [16], [17], which imposes reduces 45.5% of FLOPs and removes 43.0% of parameters with relaxed constraints to the optimization procedure. Nevertheless, only 0.69% accuracy drop for ResNet-50. Our code and pruned models can be found at https://github.com/lmbxmu/FilterSketch. the optimization procedure in each layer is typically data-driven and/or iterative [16], [17], which brings a heavy optimization burden. Index Terms—network pruning, sketch, filter pruning, struc- tured pruning, network compression & acceleration, information In this paper, we propose the FilterSketch approach, which, preserving by encoding the second-order information of pre-trained network weights, provides a new perspective for deep CNN compression. FilterSketch is inspired by the fact that preserving I.INTRODUCTION the second-order covariance of a matrix is equal to maximizing the correlation of multi-variate data [18]. The representation EEP convolutional neural networks (CNNs) typically of the second-order information has been demonstrated to be result in significant memory requirement and computa- D effective in many other tasks [19], [20], [21], and for the first tional cost, hindering their deployment on front-end systems of time, we apply it to network pruning in this paper. limited storage and computational power. Consequently, there Instead of simply discarding the unimportant filters, FilterS- is a growing need for reduction of model size by parameter ketch preserves the second-order information of the pre-trained quantization [1], [2], low-rank decomposition [3], [4], and model in the pruned model as shown in Fig. 1. For each layer network pruning [5], [6]. Early pruning works [7], [8] use in the pre-trained model, FilterSketch learns a set of new unstructured methods to obtain irregular sparsity of filters. parameters for the pruned model which maintains the second- Recent works pay more attention to structured pruning [9], order covariance of the pre-trained model. The new group of [10], [11], which pursues simultaneously reducing model size sketched parameters then serves as a warm-up for fine-tuning and improving computational efficiency, facilitating model the pruned network. The warm-up provides an excellent ability deployment on general-purpose hardware and/or usage of basic to recover the model performance. arXiv:2001.08514v4 [cs.CV] 25 May 2021 linear algebra subprograms (BLAS) libraries. We show that preserving the second-order information can be approximated as a matrix sketch problem, which can then M. Lin, S. Li and R. Ji are with the Media Analytics and Computing be efficiently solved by the off-the-shelf Frequent Direction Laboratory, Department of Artificial Intelligence, School of Informatics, method [22], leading to a several-orders-of-magnitude reduction Xiamen University, Xiamen 361005, China. of optimization time in pruning. FilterSketch thus involves L. Cao (Corresponding Author) is with the Fujian Key Laboratory of Sensing and Computing for Smart City, Computer Science Department, neither complex regularization to restart retraining nor data- School of Informatics, Xiamen University, Xiamen 361005, China (e-mail: driven iterative optimization to approximate the covariance [email protected]) information of the pre-trained model. R. Ji is also with Institute of Artificial Intelligence, Xiamen University, Xiamen 361005, China. Q. Ye is with School of Electronic, Electrical and Communication Engi- neering, University of Chinese Academy of Sciences, Beijing 101408, China. II.RELATED WORK Y. Tian is with the School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China. Unstructured pruning and structured pruning are two major J. Liu is with the Noah’s Ark Lab, Huawei Technologies Co. Ltd., Shenzhen lines of methods for network model compression. In a broader 518129, China. view, e.g., parameter quantization and low-rank decomposition Q. Tian is with Cloud BU, Huawei Technologies Co. Ltd., Shenzhen 518129, China. can be integrated with network pruning to achieve higher Manuscript received April 19, 2005; revised August 26, 2015. compression and speedup. We give a brief discussion over IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 2

�×� �×� �×�×ℎ×� �̃ ×�×ℎ×�

… … … Reconstruction Pre-trained Model Error Pruned Model Minimize

Reshape Reshape � = �×ℎ×� � = �×ℎ×� Data Stream �̃ Matrix Sketch 2 ……

�̃ ×� �̃ ×� …… �̃ ×�

��

Fig. 1. Framework of FilterSketch. The upper part displays the second-order covariance approximation between the pre-trained model and the pruned model at the i-th layer. The lower part shows the approximation is achieved effectively and efficiently by the stream data based matrix sketch [22]. some related works in the following and refer readers to the [12], [10] impose a sparse property on the scaling factor of survey paper [23] for a more detailed overview. the batch normalization layer with the deterministic `1-norm Unstructured Pruning. Pruning a neural network to a or dynamical distribution of channel saliency. After re-training, reasonable smaller size and for a better generalization, has long the channels below a threshold are discarded correspondingly. been investigated. As a pioneer work, the second-order Taylor Huang et al.[13] proposed a data-driven sparse structure expansion [7] is utilized to select less important parameters selection by introducing scaling factors to scale the outputs for deletion. Sum et al. [24] introduced an extended Kalman of the pruned structure and added the sparsity constraint on filter (EKF) training to measure the importance of a weight these scaling factors. Lin et al.[11] proposed to minimize an in a network. Given the recursive Bayesian property of EKF, objective function with l1-regularization on a soft mask via they further considerd the sensitivity of a posterior probability a generative adversarial learning and adopted the knowledge as a measure of weight importance, which is then applied to distillation for optimization. prune in a nonstationary environment [25] or recurrent neural Property-based pruning tries to figure out a discriminative networks [26]. property of pre-trained CNN models and discards filters of Han et al. [8] introduced an iterative weight pruning method less importance. Hu et al. [15] utilized the abundant zero by fine-tuning with a strong l2 regularization and discarding activations in a large network and iteratively prunes filters with the small weights with values below a threshold. Group a higher percentage of zero outputs in a layer-wise fashion. Li et sparsity based regularization of network parameters [27] is al. [14] used the sum of absolute values of filters as a metric leveraged to penalize unimportant parameters. Further, [28] to measure the importance of filters, and assumes filters with prunes parameters based on the second-order derivatives of smaller values are less informative and thus should be pruned a layer-wise error function, while [29] implements CNNs in first. In [31], the importance scores of the final responses are the frequency domain and apply 2-D DCT transformation propagated to every filter in the network and the CNN is pruned to sparsify the coefficients for spatial redundancy removal. by removing the filter with the least importance. Polyak et The lottery ticket hypothesis [30] sets the weights below a al.[32] considered the contribution variance of each channel threshold to zero, rewinds the rest of the weights to their and removed filters that have low-variance outputs. It shows a initial configuration, and then retrains the network from this good ability to accelerate the deep face networks. Centripetal configuration. SGD [33] constrains filters within the same cluster to move Though progress has been made, unstructured pruning towards a center, and thus removes the identical filters without requires specialized hardware or software supports to speed the necessity of fine-tuning. EigenDamage [34] decorrelates up inference. It has limited applications on general-purpose filter weight on top of the Kronecker-factored eigenbasis to hardware or BLAS libraries in practice, due to the irregular enable weights to be approximately independent, meanwhile it sparsity in weight tensors. allows a global ranking of filter importance in Hessian-based Structured Pruning. Compared to unstructured pruning, pruning. He et al.[35] proposed to calculate the geometric structured pruning does not have limitations on specialized median in each layer, and the filters closest to this are pruned. hardware or software since the entire filters are removed, and Optimization-based pruning leverages layer-wise optimiza- thereby it is more favorable in accelerating CNNs. tion to minimize the reconstruction error between the full model To this end, regularization-based pruning techniques require and the pruned model. He et al. [16] presented a LASSO-based a joint-retraining from scratch to derive the values of filters filter selection strategy to identify representative filters and a such that they can be made sufficiently small. To that effect, least square reconstruction error to reconstruct the outputs. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 3

3 3 4 × 4 4 ×1 0 ×1 0 ×1 0 1 0 ×1 0 5 . 0 5 . 0 2 . 8 3 . 0 8 . 0 4 . 0 4 . 0 2 . 1 2 . 0 6 . 0 3 . 0 3 . 0

1 . 4 4 . 0 2 . 0 2 . 0 1 . 0 0 . 7 1 . 0 1 . 0 2 . 0

0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 - 0 . 1 0 - 0 . 0 5 0 . 0 0 0 . 0 5 0 . 1 0 - 0 . 1 0 - 0 . 0 5 0 . 0 0 0 . 0 5 0 . 1 0 - 0 . 0 8 - 0 . 0 4 0 . 0 0 0 . 0 4 0 . 0 8 - 0 . 0 8 - 0 . 0 4 0 . 0 0 0 . 0 4 0 . 0 8 - 0 . 0 5 - 0 . 0 3 0 . 0 0 0 . 0 3 0 . 0 5 C o n v 2 0 _ 1 C o n v 2 3 _ 2 C o n v 3 4 _ 3 C o n v 3 3 _ 2 C o n v 4 2 _ 1

Fig. 2. Histograms of weights at different layers of ResNet-50, which have approximate zero means. Conv20_1 denotes the 1st filter in the 20th convolutional layer and the same with others. Note that similar statistical result can be observed in other convolutional networks as well.

These two steps are iteratively executed until convergence. In The goal is to search for the pruned model F, a set of i L i i c˜i di×c˜i [17], Luo et al. reconstructed the information from transformed filters Ω = {Ω }i=1 with Ω = {Ωj}j=1 ∈ R the next layer to guide the importance evaluation of filters from and c˜i = bpi ·cie, where pi is the pruning rate for the i-th layer the current layer. A set of training samples is used to deduce and 0 < pi ≤ 1. b·e rounds the input to its nearest integer. a closed-form solution. To learn Ωi for each layer, predominant approaches are The proposed FilterSketch can be grouped into optimization- often divided into three streams: (1) Retraining CNNs from based pruning but differs from [16], [17] in two aspects: First, it scratch by imposing human-designed regularizations into the preserves the second-order information of pre-trained weights, training loss [13], [11]. (2) Measuring the importance of filters leading to quick accuracy recovery without the requirement of via an intrinsic property of CNNs [14], [31]. (3) Minimizing training from scratch or layer-wise fine-tuning. Second, it can be the reconstruction error [16], [17] for pruning optimization. formulated as the matrix sketch problem and solved by the off- Nevertheless, these methods solely consider the first-order the-shelf Frequent Direction (FD) method, leading to a several- statistics, while missing the covariance information. orders-of-magnitude reduction of time consumption without introducing data-driven and/or iterative optimization procedure. B. Information Preserving We note that [36] conducts a complex tensor sketch for network In this study, we devise a novel second-order covariance approximation. Differently, our FilterSketch uses the efficient preserving scheme, which provides a good warm-up for fine- FD algorithm for the goal of information preserving. The tuning the pruned network. Different from existing works [19], network pruning and network approximation can be combined [20], [21] where the covariance statistics of feature maps are to further reduce the network size, which will be our future calculated, we aim to preserve the covariance information of work. filters. Note that, [7], [37], [28], [38] also exploit the second order For each W i ∈ Rdi×ci , our second-order preserving scheme information for the network pruning. Nevertheless, our second- aims to find a filter matrix Ωi ∈ Rdi×c˜i , which contains only order information fundamentally differs from [7], [37], [28], c˜i ≤ ci columns but preserves sufficient covariance information [38]. First, the second-order information of [7], [37], [28], of W i, as: [38] lies in the second-order derivatives while our FilterSketch Σ i ≈ Σ i , (1) lies in the second-order covariance. Second, the second-order W Ω derivatives in [7], [37], [28], [38] are used as a measure to where ΣW i and ΣΩi respectively denote the covariance identify unimportant weights/channels while our FilterSketch matrices of W i and Ωi and are defined as: aims to preserve the covariance information in the preserved i ¯ i i ¯ i T ΣW i = (W − W )(W − W ) , (2) weights. Third, [7], [37], [28], [38] involve heavy computation i i i i T Σ i = (Ω − Ω¯ )(Ω − Ω¯ ) , (3) for constructing Hessian Matrix while our FilterSketch is Ω implemented in a computationally cheap manner via the where W¯ i = 1 Pci W i and Ω¯ i = 1 Pc˜i Ωi are the mean ci j=1 j c˜i j=1 j Frequent Direction. values of the filters in the i-th layer for the full model and pruned model, respectively. III.THE PROPOSED APPROACH The covariance ΣW i can effectively measure the pairwise interactions between the pre-trained filters. A key ingredient A. Notations of FilterSketch is that it can well preserve the correlation We start with notation definitions. Given a pre-trained CNN information of W i in Ωi. Through this, it yields a more model F , which contains L convolutional layers, and a set expressive and informative Ωi for fine-tuning, as validated i L i i ci di×ci of filters W = {W }i=1 with W = {Wj }j=1 ∈ R and in Sec. IV. di = ci−1 × hi × wi, where ci, hi and wi respectively denote To preserve the covariance information in Eq. 1, we formulate the channel number, filter height and width of the i-th layer. the following objective function: i i W is the filter set for the i-th layer and Wj is the j-th filter in the i-th layer. arg min kΣW i − ΣΩi kF , (4) Ωi IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 4

where k · kF denotes the Frobenius norm. Based on Eq. 2 and Algorithm 1 Frequent Direction [22]. i di×ci Eq. 3, Eq. 4 is expanded as: Require: Matrix W ∈ R , sketched size c˜i. Ensure: Sketched matrix Ωi ∈ di×c˜i . arg min (W i − W¯ i)(W i − W¯ i)T − (Ωi − Ω¯ i)(Ωi − Ω¯ i)T . R F i di×c˜i Ωi 1: Ω ← all zeros matrix ∈ R . i i (5) 2: for each column Wj in W do 3: Insert W i into a zero valued column of Ωi. We statistically observe that W¯ i ≈ 0. As illustrated in j 4: if Ωi has no zero valued columns then Fig. 2, the pre-trained weights intend to follow a zero-mean 5: [U, S, V ] = SVD(Ωi) Gaussian-like weight distribution, which is also discussed in 2 c˜ ( i ) S 6: δ = s c˜i . # The squared 2 -th entry of . many previous works [39], [40], [41], [42], [43]. The potential 2 ˆ p 2 c˜i×c˜i reason behind might be that the network is often trained with 7: S = max(S − Ic˜i δ, 0). # Identity matrix Ic˜i ∈ R . i zero-mean Gaussian distribution as the initialization. During 8: Ω = USˆ. # At least half columns of Ωi are zeros. training, the regularization effect of `2-norm weight penalty 9: end if confines weights to the bell-shape histogram, and thus prevents 10: end for a drastic change of the initial Gaussian distribution. As such, the pre-trained weights still have the approximately zero-mean Gaussian-like histogram. Similarly, it is intuitive that a good that half of its singular values are zeros. Accordingly, the right pruned weight Ωi satisfies that W¯ i ≈ 0. Thus, Eq. 5 can be half of the columns in USˆ (see line 7 of Alg. 1 for Sˆ) will be re-written as: zeros. The details of the method can be referred to [22]. It can be seen that the optimization of Eq. 6 is similar to arg min W i(W i)T − Ωi(Ωi)T . (6) F the matrix sketch problem of Eq. 7, though the existence of Ωi i 2 the upper bound term εkW k2 does not necessarily result in Similar to [16], [17], one can develop a series of optimization optimal Ωi for Eq. 6. steps to minimize the reconstruction error of Eq. 6. However, Nevertheless, in what follows, we show that Alg. 1 can the optimization procedure is typically based on data-driven provide a tight convergence bound to solve the sketch problem and/or iterative methods [16], [17], which inevitably introduces of Eq. 7 while the learned Ωi can serve as a good warm-up heavy computation cost. for fine-tuning the pruned model as demonstrated in Sec. IV. Corollary 1. If Ωi is the sketch result of matrix W i with C. Tractability the sketched size c˜i by Alg. 1, then it holds: In this section, we show that Eq. 6 can be effectively and 0 Ωi(Ωi)T W i(W i)T , and efficiently solved by the off-the-shelf matrix sketch method 4 4 2 (8) [22], which does not involve data-driven iterative optimization i i T i i T i 2 W (W ) − Ω (Ω ) F ≤ kW kF , while maintaining the property of interest of W i in Ωi. c˜i Specifically, a sketch of a matrix W i is a transformed matrix i.e., ε = 2 . c˜i Ωi, which is smaller than W i but tracks an ε-approximation The proof of Corollary 1 can be referred to [22]. Accordingly, i 1 to the norm of W , as: the convergence bound of FD is proportional to . Smaller c˜i c˜i i i T i i T causes more error, which is intuitive since smaller c˜ means Ω (Ω ) 4 W (W ) , and i i i T i i T i 2 (7) more pruned filters. Besides, the sketch time is up-bounded W (W ) − Ω (Ω ) ≤ εkW k . F F by O(dicic˜i) [22]. Sec. IV-E shows that the sketch process Several papers have been devoted to solving Eq. 7, including requires less than 2 seconds with CPU, which verifies the CUR decomposition [44], random projection [45], and column efficiency of FilterSketch. sampling methods [46], which however still rely on iterative In our experiments, we find that a small portion of the optimization. elements in Ωi are unordinary larger especially for a small The streaming-based Frequent Direction (FD) method by value of c˜i, which damages the accuracy performance as [22] provides a promising direction to solve this problem, where demonstrated in Tab. III. This is understandable since the error each sample is passed forward only once without iterations, bound does exist in Corollary 1. We can conduct a second-round W i sketch for i to solve this problem of unstable numerical which is extremely efficient. We summarize it in Alg. 1. A kΩ kF i di ×ci data matrix W and the sketched size c˜i are fed into FD. values (see below for analysis). However, in what follows, we i i Each column Wj of matrix W represents a sample. Columns show that the second-round sketch can be eliminated by simply from W i will replace all zero-valued columns in Ωi, and half normalizing Ωi. of the columns in the sketch will be emptied with two steps Theorem 1. If Ωi is the sketch result by applying Alg. 1 to i i once Ω is fully fed with non-zero valued columns: In the matrix W with the sketched size c˜i, then for any constant β, first step, the sketch is rotated (from right) with the SVD βΩi is the result by applying Alg. 1 to matrix βW i. decomposition of Ωi so that its columns are orthogonal and Proof. We start with Line 5 of Alg. 1, which can be modified in descending magnitude order. In the SVD decomposition, as: T i T T T i USV = Ω , U U = V V = VV = Ic˜i , where Ic˜i is the Line 5: [U, βS, V ] = SVD(βΩ ). c˜i × c˜i identity matrix, S is a non-negative Correspondingly, Line 6 to Line 8 can be modified as: 2 2 c˜ and S11 ≥ ... ≥ Sc˜ic˜i ≥ 0. In the second step, S is shrunk so Line 6: β δ = (βs i ) ; 2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 5

Algorithm 2 FilterSketch Algorithm. learning rate is initially set to 0.1, and divided by 10 every 30 i L Require: Pre-trained model F with filter set W = {W }i=1. epochs. i L Ensure: Pruned model F with filter set Ω = {Ω }i=1 and For all methods, we apply the standard data augmentation L per-layer pruning rate set {pi}i=1. provided by the official Pytorch including random crop and 1: for i = 1 → L do horizontal flip. To stress, other techniques for image augmen- 2: Compute the sketched size by c˜i = bpi · cie. tation, such as lightening and color jitter, can be applied to 3: Obtain Ωi via sketching W i by Alg. 1. further improve the performance as done in the implementations i 4: Normalize Ω via `2-norm. of [52], [53], [54], or even the cosine learning rate [55], can 5: end for also be applied to further improve the accuracy performance. 6: Initialize F with the sketched filter set Ω. We do not consider these since we aim to show the performance 7: for t = 1 → T do of pruning algorithms themselves. We provide our codes in 8: Fine-tune the pruned model F. the supplementary material. 9: end for Performance Metric. Parameter amount and FLOPs 10: Return F with the fine-tuned filter set Ω. (floating-point operations) are used as the metrics, which respectively denote the storage and computation cost. We also q report the pruning rate (PR) of parameters and FLOPs. For ˆ 2 2 2  Line 7: βS = max β S − Ic˜i β δ, 0 ; CIFAR-10, top-1 accuracy are provided. For ILSVRC-2012, Line 8: βΩi = UβSˆ. both top-1 and top-5 accuracies are reported. Thus, the sketch of βW i results in βΩi, which completes the proof.  B. Results on CIFAR-10 1 By setting β = i , we can see that the sketch result of kΩ kF We evaluate the performance of FilterSketch on CIFAR-10 W i Ωi 1 i is equal to i . Note that β = i < 1 is always kΩ kF kΩ kF kΩ kF with popular networks, including GoogLeNet, ResNet-56 and satisfied from our extensive experiments. Thus, the unordinary ResNet-110. For GoogLeNet, we make the final output class i larger elements in Ω can be re-scaled to smaller ones. Finally, number the same as the number of categories on CIFAR-10. Ωi 1 i is fed to the slimmed network for fine-tuning . kΩ kF GoogLeNet. As can be seen from Tab. I, FilterSketch We outline FilterSketch in Alg. 2. It can be seen that, outperforms the SOTA methods in both accuracy retaining compared with existing methods, FilterSketch stands out in and model complexity reductions. Specifically, 61.1% of the that it is deterministic, simple to implement, and also very fast FLOPs are reduced and 57.6% of the parameters are removed, (see Tab. IV later). achieving a significantly higher compression rate than GAL-0.6 and HRank. Besides, FilterSketch also maintains a comparable IV. EXPERIMENTS top-1 accuracy, even better than L1, which obtains a much less complexity reduction. To show the effectiveness and efficiency of FilterSketch, ResNet-56. Results for ResNet-56 are presented in Tab. I, we have conducted extensive experiments on image classifi- where FilterSketch removes around 41.5% of the FLOPs cation. Representative compact-designed networks, including and parameters while keeping the top-1 accuracy at 93.19%. GoogLeNet [47] and ResNet-50/56/110 [48], are chosen to Compared to 93.26% by the pre-trained model, the accuracy compress. We report the performance of FilterSketch on CIFAR- drop is negligible. Compared with L1, FilterSketch shows 10 [49] and ILSVRC-2012 [50], and compare it to state-of- an overwhelming superiority. Though NISP obtains 1% more the-arts (SOTAs) including regularization-based pruning [13], parameters reduction than FilterSketch, it takes more computa- [11], property-based pruning [14], [31], [51], and optimization- tion in the convolutional layers with a lower top-1 accuracy. based pruning [16], [17]. Besides, we also conduct sub- Moreover, in comparison with HRank under similar reductions sampling of the pre-trained weights for fine-tuning (denoted of FLOPs and parameters, our FilterSketch results in a higher as Random) to show the advantage of considering the second- top-1 accuracy, well demonstrating the effectivness of the order information. sketched weights to recover the accuracy performance. ResNet-110. Tab. I also displays the pruning results for A. Implementation Details ResNet-110. FilterSketch reduces the FLOPs of ResNet-110 Training Strategy. We use the Stochastic Gradient Descent by an impressive factor of 63.3%, and the parameters by (SGD) for fine-tuning with the Nesterov momentum 0.9 and 59.9%, while maintaining an accuracy of 93.44%. FilterSketch the batch size is set to 256. For CIFAR-10, the weight decay is significantly outperforms these SOTAs, showing that it can set to 5e-3 and we fine-tune the network for 150 epochs with greatly facilitate the ResNet model, a popular backbone for an initial learning rate of 0.01, which is then divided by 10 object detection and semantic segmentation, to be deployed on every 50 epochs. For ILSVRC-2012, the weight decay is set mobile devices. to 5e-4 and 90 epochs are given to fine-tune the network. The In Tab. I, we also display the performance of randomly sub- sampling of filter weights (Random) given the same pruning 1 W i Ωi To sketch i might be an alternative, the result of which is i . rates as with FilterSketch. As seen, Random suffers great kW kF kW kF i However, we observe that kW kF is much larger than kΩkF and thus most accuracy degradation in comparison with FilterSketch. In Ωi elements in i are very close to zero, which is infeasible. contrast, FilterSketch considers all the information in the pre- kW kF IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 6

TABLE I RESULTS OF GOOGLENETAND RESNET-56/110 ON CIFAR-10.

Model Top1-acc ↑↓ FLOPs (Pruning Rate) Parameters (Pruning Rate) GoogLeNet (Base) 95.05% 0.00% 1.52B (0.0%) 6.15M (0.0%) L1 [14] 94.54% 0.51%↓ 1.02B (32.9%) 3.51M (42.9%) GAL-0.05 [11] 93.93% 1.12%↓ 0.94B (38.2%) 3.12M (49.3%) HRank [51] 94.53% 0.52%↓ 0.69B (54.9%) 2.74M (55.4%) Random 93.23% 1.82%↓ 0.59B (61.1%) 2.61M (57.6%) FilterSketch 94.88% 0.17%↓ 0.59B (61.1%) 2.61M (57.6%) ResNet-56 (Base) 93.26% 0.00% 125.49M (0.0%) 0.85M (0.0%) L1 [14] 93.06% 0.20%↓ 90.90M (27.6%) 0.73M (14.1%) He et al. [16] 90.80% 2.46%↓ 62.00M (50.6%) - HRank [51] 93.52% 0.26%↑ 88.72M (29.3%) 0.71M (16.8%) FilterSketch 93.65% 0.39%↑ 88.05M (30.43%) 0.68M (20.6%) NISP [31] 93.01% 0.25%↓ 81.00M (35.5%) 0.49M (42.4%) GAL-0.6 [11] 92.98% 0.28%↓ 78.30M (37.6%) 0.75M (11.8%) Random 90.55% 2.71%↓ 73.36M (41.5%) 0.50M (41.2%) FilterSketch 93.19% 0.07%↓ 73.36M (41.5%) 0.50M (41.2%) HRank [51] 90.72% 2.54%↓ 32.52M (74.1%) 0.27M (68.1%) FilterSketch 91.20% 2.06%↓ 32.47M (74.4%) 0.24M (71.8%) ResNet-110 (Base) 93.50% 0.00% 252.89M (0.0%) 1.72M (0.0%) L1 [14] 93.30% 0.20%↓ 155.00M (38.7%) 1.16M (32.6%) GAL-0.5 [11] 92.55% 0.95%↓ 130.20M (48.5%) 0.95M (44.8%) HRank [51] 93.36% 0.14%↓ 105.70M (58.2%) 0.70M (59.2%) Random 89.88% 3.62%↓ 92.84M (63.3%) 0.69M (59.9%) FilterSketch 93.44% 0.06%↓ 92.84M (63.3%) 0.69M (59.9%)

TABLE II RESULTS OF RESNET-50 ON ILSVRC-2012.

Model Top1-acc% ↑↓ Top5-acc% ↑↓ FLOPs (Pruning Rate) Parameters (Pruning Rate) ResNet-50 (Base) [17] 76.13% 0.00% 92.86% 0.00% 4.09B (0.0%) 25.50M (0.0%) SSS-32 [13] 74.18% 1.95%↓ 91.91% 0.95%↓ 2.82B (31.1%) 18.60M (27.1%) He et al. [16] 72.30% 3.83%↓ 90.80% 2.06%↓ 2.73B (33.3%) - FilterSketch-0.7 75.22% 0.91%↓ 92.41% 0.45%↓ 2.64B (35.5%) 16.95M (33.5%) GAL-0.5 [11] 71.95% 4.18%↓ 90.94% 1.92%↓ 2.33B (43.0%) 21.20M (16.9%) SSS-26 [13] 71.82% 4.31%↓ 90.79% 2.07%↓ 2.33B (43.0%) 15.60M (38.8%) FilterSketch-0.6 74.68% 1.45%↓ 92.17% 0.69%↓ 2.23B (45.5%) 14.53M (43.0%) HRank [51] 71.98% 4.15%↓ 91.01% 1.85%↓ 1.55B (62.1%) 13.77M (46.0%) GAL-0.5-joint [11] 71.80% 4.33%↓ 90.82% 2.04%↓ 1.84B (55.0%) 19.31M (24.3%) ThiNet-50 [17] 71.01% 5.12%↓ 90.02% 2.84%↓ 1.71B (58.2%) 12.28M (51.8%) GAL-1 [11] 69.88% 6.25%↓ 89.75% 3.11%↓ 1.58B (61.4%) 14.67M (42.5%) FilterSketch-0.4 73.04% 3.09%↓ 91.18% 1.68%↓ 1.51B (63.1%) 10.40M (59.2%) ThiNet-50 [17] 68.42% 6.82%↓ 88.30% 4.56%↓ 1.10B (73.1%) 8.66M (66.0%) GAL-1-joint [11] 69.31% 6.82%↓ 89.12% 3.74%↓ 1.11B (72.9%) 10.21M (60.0%) HRank [51] 68.10% 8.03%↓ 89.58% 3.28%↓ 0.98B (76.0%) 8.27M (67.6%) FilterSketch-0.2 69.43% 6.70%↓ 89.23% 3.63%↓ 0.93B (77.3%) 7.18M (71.8%) trained weights, which provides a more informative warm-up top-5 accuracies. For convenience, we use FilterSketch-α to c˜ for fine-tuning the pruned model. denote the sketch rate α (i.e., α = c ) for FilterSketch. Smaller In Fig. 3, we further compare the Top-1 accuracies of the α leads to a higher compression rate. compressed models by GAL [11], L1 [14], Random, and our As shown in Tab. II, with similar or better reductions of FilterSketch under different compression rates using ResNet-56. FLOPs and parameters, FilterSketch demonstrates its great As shown in the figure, our method outperforms the compared advantages in retaining the accuracy in comparisons with the methods easily. Especially, for large pruning rates (> 60%), SOTAs. For example, FilterSketch-0.6 obtains 74.68% top-1 both L1 and GAL suffer an extreme accuracy drop while and 92.17% top-5 accuracies, significantly better than GAL- FilterSketch maintains a relatively stable performance, which 0.5 and SSS-26. Another observation is that, with similar stresses the importance of information preserving in network or more FLOPs reduction, FilterSketch also removes more pruning again. parameters. Hence, FilterSketch is especially suitable for network compression.

C. Results on ILSVRC-2012 D. Practical Speedup In Tab. II, we show the results for ResNet-50 on ILSVRC- The practical speedup for pruned CNNs depends on many 2012 and compare FilterSketch to many SOTAs. We display factors, e.g., FLOPs reduction percentage, the number of different pruning rates for FilterSketch and compare top-1 and CPU/GPU cores available and I/O delay of data swap, etc. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 7

TABLE III ERFORMANCECOMPARISONSBETWEENSKETCHESWITHANDWITHOUT 93 P THE NORMALIZATION.

92 Sketch accuracy Sketch+norm accuracy GoogLeNet 94.54% 94.88% 91 ResNet-56 92.81% 93.19% ResNet-110 93.01% 93.44% 90 FilterSketch-FLOPs ResNet-50 72.84% / 91.01% 73.04% / 91.18% GAL-FLOPs

Top-1 Accuracy(%) 89 L1-FLOPs Random-FLOPs TABLE IV FilterSketch-Parameters OPTIMIZATION EFFICIENCY (CPU) AMONG THINET [17], CP [16] AND 88 GAL-Parameters FILTERSKETCH. L1-Parameters 87 Random-Parameters ThiNet CP FilterSketch 20 30 40 50 60 70 80 GoogLeNet 183585.41s 2008.72s 1.85s Pruning Rate(%) ResNet-56 24422.73s 536.55s 0.09s ResNet-110 63695.89s 961.71s 1.06s ResNet-50 4130102.59s 205117.20s 1.41s Fig. 3. FLOPs and parameter comparison among GAL [11], L1 [14], Random, and our FilterSketch under different compression rates. ResNet- 56 is compressed and Top-1 accuracy is reported. which are several orders of magnitude faster than the other methods that cost many hours, or even days. CPU 2.75 2.72x GPU THEORY 2.58x 2.50 V. CONCLUSION 2.25 We have proposed a novel approach, termed FilterSketch, 2.00 1.89x for structured network pruning. Instead of simply discarding

1.75 1.71x unimportant filters, FilterSketch preserves the second-order Speed-up 1.57x 1.50 information of the pre-trained model, through which the 1.25x 1.28x accuracy is well maintained. We have further proposed to obtain 1.25 1.11x 1.18x the information preserving constraint by utilizing the off-the- 1.00 shelf matrix sketch method, based on which the requirement ResNet56 ResNet110 GoogLeNet of training from scratch or iterative optimization can be Architecture eliminated, and the pruning complexity is significantly reduced. Extensive experiments have demonstrated the superiorities of Fig. 4. Speedups corresponding to CPU (Intel(R) Xeon(R) CPU E5-2620 v4 FilterSketch over the state-of-the-arts. As the first attempt on @2.10GHz) and GPU (GTX-1080TI) over the different CNNs with a batch size 256 on CIFAR-10. weight information preserving, FilterSketch provides a fresh insight for the network pruning problem. Nevertheless, our FilterSketch is built on the fact that the filter weights have We test the speedup of pruned models by FilterSketch in approximate zero mean in each layer of the convolutional Tab. I with CPU and GPU, and present the results in Fig. 4. neural network. Such a requirement might not be satisfied in Compared with the theoretical speedups of 1.71×, 2.72× and other networks, e.g., multi-layer . More effort will 2.58× for ResNet-56, ResNet-110 and GoogLeNet, respectively. be made to solve this issue in our future work. FilterSketch gains 1.11×, 1.57× and 1.18× practical speedups × × × with GPU while 1.25 , 1.89 and 1.28 with CPU are REFERENCES obtained. [1] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized cnn: A unified approach to accelerate and compress convolutional networks,” E. Normalization Influence and Optimization Efficiency IEEE Transactions on Neural Networks and Learning Systems (TNNLS), vol. 29, no. 10, pp. 4730–4743, 2017. To measure the effectiveness of the sketch with the Frobenius [2] P. Wang, X. He, Q. Chen, A. Cheng, Q. Liu, and J. Cheng, “Unsupervised normalization, we compare our FilterSketch models given in network quantization via fixed-point factorization,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2020. Tab. I and Tab. II (FilterSketch-0.4) with corresponding models [3] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression but excluding the Frobenius normalization. As shown in Tab. III, of deep convolutional neural networks for fast and low power mobile the former (the third column) obtains a better accuracy than applications,” arXiv preprint arXiv:1511.06530, 2015. [4] K. Hayashi, T. Yamaguchi, Y. Sugawara, and S.-i. Maeda, “Exploring the latter (the second column), which verifies the analysis in unexplored tensor network decompositions for convolutional neural Sec. III-C that the sketch with the Frobenius normalization can networks,” in Proceedings of the Advances in Neural Information effectively solve the problem of unstable numerical values after Processing Systems (NeurIPS), 2019, pp. 5553–5563. [5] Z. Chen, T.-B. Xu, C. Du, C.-L. Liu, and H. He, “Dynamical channel sketch. As for the sketch efficiency, we again compare these pruning by conditional accuracy change for deep neural networks,” IEEE four models in Tab. III with two optimization-based methods Transactions on Neural Networks and Learning Systems (TNNLS), 2020. [17], [16]. The results in Tab. IV show that the time cost in [6] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li, “Toward compact convnets via structure-sparsity regularized filter pruning,” IEEE Transactions on the sketch process is little. Even with wider GoogLeNet and Neural Networks and Learning Systems (TNNLS), vol. 31, no. 2, pp. deeper ResNet-110, the sketches consume less than 2 seconds, 574–588, 2019. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 8

[7] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in [30] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, Proceedings of the Advances in Neural Information Processing Systems trainable neural networks,” in Proceedings of the International Conference (NeurIPS), 1990, pp. 598–605. on Learning Representations (ICLR), 2019. [8] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and [31] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. connections for efficient neural network,” in Proceedings of the Advances Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance in Neural Information Processing Systems (NeurIPS), 2015, pp. 1135– score propagation,” in Proceedings of the IEEE Conference on Computer 1143. Vision and Pattern Recognition (CVPR), 2018, pp. 9194–9203. [9] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri, “Play and prune: [32] A. Polyak and L. Wolf, “Channel-level acceleration of deep face Adaptive filter pruning for deep model compression,” in Proceedings representations,” IEEE Access, no. 3, pp. 2163–2175, 2015. of the International Joint Conference on Artificial Intelligence (IJCAI), [33] X. Ding, G. Ding, Y. Guo, and J. Han, “Centripetal sgd for pruning very 2019, pp. 3460–3466. deep convolutional networks with complicated structure,” in Proceedings [10] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, “Variational of the IEEE Conference on Computer Vision and Pattern Recognition convolutional neural network pruning,” in Proceedings of the IEEE (CVPR), 2019, pp. 4943–4953. Conference on Computer Vision and Pattern Recognition (CVPR), 2019, [34] C. Wang, R. Grosse, S. Fidler, and G. Zhang, “Eigendamage: Structured pp. 2780–2789. pruning in the kronecker-factored eigenbasis,” in Proceedings of the [11] S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, and International Conference on (ICML), 2019, pp. 6566– D. Doermann, “Towards optimal structured cnn pruning via generative 6575. adversarial learning,” in Proceedings of the IEEE Conference on [35] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2790–2799. geometric median for deep convolutional neural networks acceleration,” [12] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient in Proceedings of the IEEE Conference on Computer Vision and Pattern convolutional networks through network slimming,” in Proceedings of Recognition (CVPR), 2019, pp. 4340–4349. the IEEE International Conference on Computer Vision (ICCV), 2017, [36] S. P. Kasiviswanathan, N. Narodytska, and H. Jin, “Network approxima- pp. 2736–2744. tion using tensor sketching.” in Proceedings of the International Joint [13] Z. Huang and N. Wang, “Data-driven sparse structure selection for Conference on Artificial Intelligence (IJCAI), 2018, pp. 2319–2325. deep neural networks,” in Proceedings of the European Conference on [37] B. Hassibi and D. G. Stork, “Second order derivatives for network Computer Vision (ECCV), 2018, pp. 304–320. pruning: optimal brain surgeon,” in Proceedings of the Advances in [14] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters Neural Information Processing Systems (NeurIPS), 1992, pp. 164–171. for efficient convnets,” in Proceedings of the International Conference [38] H. Peng, J. Wu, S. Chen, and J. Huang, “Collaborative channel pruning on Learning Representations (ICLR), 2017. for deep networks,” in Proceedings of the International Conference on [15] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data- Machine Learning (ICML), 2019, pp. 5113–5122. driven neuron pruning approach towards efficient deep architectures,” [39] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, arXiv preprint arXiv:1607.03250, 2016. “A systematic dnn weight pruning framework using alternating direction [16] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep method of multipliers,” in Proceedings of the European Conference on neural networks,” in Proceedings of the IEEE International Conference Computer Vision (ECCV), 2018, pp. 184–199. on Computer Vision (ICCV), 2017, pp. 1389–1397. [40] Z. He and D. Fan, “Simultaneously optimizing weight and quantizer [17] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method of ternary neural network using truncated gaussian approximation,” in for deep neural network compression,” in Proceedings of the IEEE Proceedings of the IEEE Conference on Computer Vision and Pattern International Conference on Computer Vision (ICCV), 2017, pp. 5058– Recognition (CVPR), 2019, pp. 11 438–11 446. 5066. [41] G. Franchi, A. Bursuc, E. Aldea, S. Dubuisson, and I. Bloch, “Tradi: [18] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrian retrieval,” Tracking deep neural network weight distributions,” in Proceedings of the in Proceedings of the IEEE International Conference on Computer Vision European Conference on Computer Vision (ECCV), 2020, pp. 105–121. (ICCV), 2017, pp. 3800–3808. [42] X. Glorot and Y. Bengio, “Understanding the difficulty of training [19] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” in deep feedforward neural networks,” in Proceedings of the International Proceedings of the Advances in Neural Information Processing Systems Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. (NeurIPS), 2018, pp. 1564–1574. 249–256. [20] J. Zhang, Z. Feng, Y. Su, and M. Xing, “Discriminative saliency-pose- [43] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: attention covariance for action recognition,” in Proceedings of the IEEE Surpassing human-level performance on imagenet classification,” in International Conference on Acoustics, Speech and Signal Processing Proceedings of the IEEE Conference on Computer Vision and Pattern (ICASSP). IEEE, 2019, pp. 2132–2136. Recognition (CVPR), 2015, pp. 1026–1034. [21] C.-X. Ren, J. Feng, D.-Q. Dai, and S. Yan, “Heterogeneous domain adap- [44] P. Drineas and R. Kannan, “Pass efficient algorithms for approximating tation via covariance structured feature translators,” IEEE Transactions large matrices.” in Proceedings of the ACM-SIAM Symposium On Discrete on Cybernetics (T-Cybernetics), 2019. Algorithms (SODA), 2003, pp. 223–232. [22] E. Liberty, “Simple and deterministic matrix sketching,” in Proceedings [45] T. Sarlos, “Improved approximation algorithms for large matrices of the ACM SIGKDD International Conference on Knowledge Discovery via random projections,” in Proceedings of the IEEE Symposium on and (ACM SIGKDD), 2013, pp. 581–588. Foundations of Computer Science (FOCS), 2006, pp. 143–152. [23] S. Vadera and S. Ameen, “Methods for pruning deep neural networks,” [46] A. Frieze, R. Kannan, and S. Vempala, “Fast monte-carlo algorithms for arXiv preprint arXiv:2011.00241, 2020. finding low-rank approximations,” Journal of the ACM (JACM), vol. 51, [24] J. Sum, C.-s. Leung, G. H. Young, and W.-k. Kan, “On the kalman filter- no. 6, pp. 1025–1041, 2004. ing method in neural network training and pruning,” IEEE Transactions [47] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, on Neural Networks and Learning Systems (TNNLS), vol. 10, no. 1, pp. V. Vanhoucke, and A. Rabinovich, “Going deeper with ,” in 161–166, 1999. Proceedings of the IEEE Conference on Computer Vision and Pattern [25] J. Sum, C.-s. Leung, G. H. Young, L.-w. Chan, and W.-k. Kan, “An Recognition (CVPR), 2015, pp. 1–9. adaptive bayesian pruning for neural networks in a non-stationary [48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image environment,” Neural Computation, vol. 11, no. 4, pp. 965–976, 1999. recognition,” in Proceedings of the IEEE Conference on Computer Vision [26] J. Sum, L.-w. Chan, C.-s. Leung, and G. H. Young, “Extended kalman and Pattern Recognition (CVPR), 2016, pp. 770–778. filter–based pruning method for recurrent neural networks,” Neural [49] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features Computation, vol. 10, no. 6, pp. 1481–1505, 1998. from tiny images,” University of Toronto, Toronto, ON, Canada, Tech. [27] J. M. Alvarez and M. Salzmann, “Learning the number of neurons in Rep., 01 2009. deep networks,” in Proceedings of the Advances in Neural Information [50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, Processing Systems (NeurIPS), 2016, pp. 2270–2278. A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale [28] X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks visual recognition challenge,” Proceedings of the International Journal via layer-wise optimal brain surgeon,” in Proceedings of the Advances in of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015. Neural Information Processing Systems (NeurIPS), 2017, pp. 4857–4867. [51] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and S. Ling, “Hrank: [29] Z. Liu, J. Xu, X. Peng, and R. Xiong, “Frequency-domain dynamic Filter pruning using high-rank feature map,” in Proceedings of the IEEE pruning for convolutional neural networks,” in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Advances in Neural Information Processing Systems (NeurIPS), 2018, [52] J. Yu and T. Huang, “Autoslim: Towards one-shot architecture search pp. 1043–1053. for channel numbers,” arXiv preprint arXiv:1903.11728, 2019. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 9

[53] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun, Yonghong Tian (Senior Member, IEEE) is “Metapruning: Meta learning for automatic neural network channel currently a Boya Distinguished Professor pruning,” in Proceedings of the IEEE Conference on Computer Vision with the Department of Computer Science and Pattern Recognition (CVPR), 2019, pp. 3296–3305. and Technology, Peking University, China, [54] X. Ding, G. Ding, Y. Guo, J. Han, and C. Yan, “Approximated oracle and is also the deputy director of Artificial filter pruning for destructive cnn width optimization,” in Proceedings of Intelligence Research Center, PengCheng the International Conference on Machine Learning (ICML), 2019, pp. Laboratory, Shenzhen, China. His research interests 1607–1616. include neuromorphic vision, brain-inspired [55] X. Dong and Y. Yang, “Network pruning via transformable architecture computation and multimedia big data. He is search,” in Proceedings of the Advances in Neural Information Processing the author or coauthor of over 200 technical Systems (NeurIPS), 2019, pp. 759–770. articles in refereed journals such as IEEE TPAMI/TNNLS/TIP/TMM/TCSVT/TKDE/TPDS, ACM CSUR/TOIS/TOMM and conferences such as NeurIPS/CVPR/ICCV/AAAI/ACMMM/WWW. Prof. Tian was/is an Associate Editor of IEEE TCSVT (2018.1-), IEEE TMM (2014.8-2018.8), IEEE Multimedia Mag. (2018.1-), and IEEE Access (2017.1-). He co-initiated IEEE Int’l Conf. on Multimedia Big Data (BigMM) Mingbao Lin is currently pursuing the Ph.D degree and served as the TPC Co-chair of BigMM 2015, and aslo served as the with Xiamen University, China. He has published Technical Program Co-chair of IEEE ICME 2015, IEEE ISM 2015 and over ten papers as the first author in international jour- IEEE MIPR 2018/2019, and General Co-chair of IEEE MIPR 2020 and nals and conferences, including IEEE TPAMI, IJCV, ICME2021. He is the steering member of IEEE ICME (2018-) and IEEE IEEE TIP, IEEE TNNLS, IEEE CVPR, NeuriPS, BigMM (2015-), and is a TPC Member of more than ten conferences such AAAI, IJCAI, ACM MM and so on. His current as CVPR, ICCV, ACM KDD, AAAI, ACM MM and ECCV. He was the research interest includes network compression & recipient of the Chinese National Science Foundation for Distinguished Young acceleration, and information retrieval. Scholars in 2018, two National Science and Technology Awards and three ministerial-level awards in China, and obtained the 2015 EURASIP Best Paper Award for Journal on Image and Video Processing, and the best paper award of IEEE BigMM 2018. He is a senior member of IEEE, CIE and CCF, a member of ACM.

Liujuan Cao received the B.S., M.S., and Ph.D degrees from the School of Computer Science and Technology, Harbin Engineering University. She is currently an associate professor at Xiamen University. Her research interest mainly focuses on computer vision and pattern recognition. She has authored over 40 papers in top and major tired journals and conferences, including CVPR, TIP, etc. She is the Financial Chair of the IEEE MMSP 2015, the Workshop Chair of the ACM ICIMCS 2016, and the Local Chair of the Visual and Learning Seminar 2017.

Shaojie Li studied for his B.S. degrees in FuZhou University, China, in 2019. He is currently trying to pursue a M.S. degree in Xiamen University, China. His research interests include model compression and computer vision.

Qixiang Ye (Senior Member, IEEE) received the B.S. and M.S. degrees in mechanical and electrical Jianzhuang Liu (Senior Member, IEEE) received the engineering from Harbin Institute of Technology, Ph.D. degree in computer vision from The Chinese China, in 1999 and 2001, respectively, and the Ph.D. University of Hong Kong, Hong Kong, in 1997. degree from the Institute of Computing Technology, From 1998 to 2000, he was a Research Fellow Chinese Academy of Sciences in 2006. He has been with Nanyang Technological University, Singapore. a professor with the University of Chinese Academy From 2000 to 2012, he was a Postdoctoral Fellow, of Sciences since 2009, and was a visiting assistant an Assistant Professor, and an Adjunct Associate professor with the Institute of Advanced Computer Professor with The Chinese University of Hong Kong. Studies (UMIACS), University of Maryland, College In 2011, he joined the Shenzhen Institute of Ad- Park until 2013. His research interests include image vanced Technology, University of Chinese Academy processing, object detection and machine learning. He has published more of Sciences, Shenzhen, China, as a Professor. He is than 100 papers in refereed conferences and journals including IEEE CVPR, currently a Principal Researcher with Huawei Technologies Company Limited, ICCV, ECCV, NeurIPS, TNNLS, and PAMI. Shenzhen, China. He has authored more than 150 papers. His research interests include computer vision, image processing, , and graphics. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (ACCEPTED) 10

Qi Tian (Fellow, IEEE) is currently a Chief Scientist in Artificial Intelligence at Cloud BU, Huawei. From 2018-2020, he was the Chief Scientist in Computer Vision at Huawei Noah’s Ark Lab. Before that he was a Full Professor in the Department of Computer Science, the University of Texas at San Antonio (UTSA) from 2002 to 2019. During 2008-2009, he took one-year Faculty Leave at Microsoft Research Asia (MSRA). Dr. Tian received his Ph.D. in ECE from University of Illinois at Urbana-Champaign (UIUC) and received his B.E. in Electronic Engineering from Tsinghua University and M.S. in ECE from Drexel University, respectively. Dr. Tian’s research interests include computer vision, multimedia information retrieval and machine learning and published 610+ refereed journal and conference papers. His Google citation is over 27900+ with H-index 81. He was the co-author of best papers including IEEE ICME 2019, ACM CIKM 2018, ACM ICMR 2015, PCM 2013, MMM 2013, ACM ICIMCS 2012, a Top 10% Paper Award in MMSP 2011, a Student Contest Paper in ICASSP 2006, and co-author of a Best Paper/Student Paper Candidate in ACM Multimedia 2019, ICME 2015 and PCM 2007. Dr. Tian received 2017 UTSA President’s Distinguished Award for Research Achievement, 2016 UTSA Innovation Award, 2014 Research Achievement Awards from College of Science, UTSA, 2010 Google Faculty Award, and 2010 ACM Service Award. He is the associate editor of IEEE TMM, IEEE TCSVT, ACM TOMM, MMSJ, and in the Editorial Board of Journal of Multimedia (JMM) and Journal of MVA. Dr. Tian is the Guest Editor of IEEE TMM, Journal of CVIU, etc. Dr. Tian is a Fellow of IEEE.

Rongrong Ji (Senior Member, IEEE) is a Nanqiang Distinguished Professor at Xiamen University, the Deputy Director of the Office of Science and Technol- ogy at Xiamen University, and the Director of Media Analytics and Computing Lab. He was awarded as the National Science Foundation for Excellent Young Scholars (2014), the National Ten Thousand Plan for Young Top Talents (2017), and the National Science Foundation for Distinguished Young Scholars (2020). His research falls in the field of computer vision, multimedia analysis, and machine learning. He has published 50+ papers in ACM/IEEE Transactions, including TPAMI and IJCV, and 100+ full papers on top-tier conferences, such as CVPR and NeurIPS. His publications have got over 10K citations in Google Scholar. He was the recipient of the Best Paper Award of ACM Multimedia 2011. He has served as Area Chairs in top-tier conferences such as CVPR and ACM Multimedia. He is also an Advisory Member for Artificial Intelligence Construction in the Electronic Information Education Committee of the National Ministry of Education.