Progressive Blockwise Knowledge Distillation for Neural Network Acceleration

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Progressive Blockwise Knowledge Distillation for Neural Network Acceleration Hui Wang1;∗, Hanbin Zhao1;∗, Xi Li1;y, Xu Tan2 1 Zhejiang University, Hangzhou, China 2 2012 Lab, Huawei Technologies, Hangzhou, China fwanghui 17, zhaohanbin, xilizju, [email protected], [email protected] Abstract original network model into a low-complexity network model in terms of the teacher-student learning strategy. Without As an important and challenging problem in ma- sacrificing too much accuracy, the low-complexity network chine learning and computer vision, neural network model naturally possesses the properties of high computation- acceleration essentially aims to enhance the com- al efficiency and low memory usage. However, the effective- putational efficiency without sacrificing the mod- ness of model distillation is often challenged in the aspects el accuracy too much. In this paper, we propose a of teacher-student network optimization and student network progressive blockwise learning scheme for teacher- structure design. Therefore, we mainly focus on construct- student model distillation at the subnetwork block ing an effective network learning strategy with a structure- level. The proposed scheme is able to distill the preserving criterion for model distillation in this paper. knowledge of the entire teacher network by local- More specifically, the process of model distillation usually ly extracting the knowledge of each block in terms involves two components: a teacher network with a compli- of progressive blockwise function approximation. cated network structure as well as a simple student network. Furthermore, we propose a structure design crite- In essence, the process seeks for a feasible student network rion for the student subnetwork block, which is to mimic the output of the teacher network. Usually, con- able to effectively preserve the original receptive ventional approaches (e.g., Knowledge Distillation [Hinton field from the teacher network. Experimental re- et al., 2014]) rely on a non-convex joint network optimiza- sults demonstrate the effectiveness of the proposed tion strategy that converts the teacher network into the de- scheme against the state-of-the-art approaches. sired student network in a one-pass fashion. Implemented in a huge search space of the student network function with a 1 Introduction wide variety of network configurations, the aforementioned non-convex joint optimization process is usually intractable Recent years have witnessed a great development of deep and unstable in practice. Following the work [Hinton et al., convolutional neural networks (DCNNs) and their various ap- 2014], Nowak and Corso [Nowak and Corso, 2018] make an plications [Krizhevsky et al., 2012; Simonyan and Zisser- attempt to compress a subnetwork block of the teacher net- man, 2015; He et al., 2016; Szegedy et al., 2017]. Due to work into a student subnetwork block and then design differ- the resource limit of real world devices, DCNN compres- ent methods for initialization and training. Moreover, their sion and acceleration have emerged as a crucial and chal- design criterion for the student subnetwork block is to sim- lenging problem in practice. Typically, the problem is re- ply retain a convolution layer and directly remove the other solved from the following four perspectives: 1) quantization two convolution layers from the teacher subnetwork block. and binarization [Hubara et al., 2016; Rastegari et al., 2016; Such a blockwise distillation strategy is simple and easy to Wu et al., 2016; Zhou et al., 2017]; 2) parameter prun- optimize, but incapable of effectively modeling the sequential ing and sharing [Courbariaux et al., 2015; Hu et al., 2016; dependency relationships between layer-specific subnetwork Li et al., 2017a; Luo et al., 2017; Molchanov et al., 2017; blocks. In addition, the student subnetwork block design cri- Li et al., 2017c]; 3) matrix factorization [Tai et al., 2016; terion is also incapable of well preserving the receptive field Lin et al., 2016; Jaderberg et al., 2014; Sainath et al., information on feature extraction. 2013]; and 4) model distillation [Bucila et al., 2006; Ba and Caruana, 2014; Hinton et al., 2014; Romero et al., 2015; Motivated by the observations above, we propose a block- Li et al., 2017b]. In principle, the first three points focus on wise learning scheme for progressive model distillation. how to carry out an efficient network inference process using Specifically, the proposed learning scheme converts a se- a variety of computational acceleration techniques with low quence of teacher subnetwork blocks into a sequence of s- memory usage. In contrast, the last one aims at distilling the tudent subnetwork blocks after progressive blockwise optimization. Additionally, we propose a structure-preserving ∗Indicates equal contribution criterion for student subnetwork design. This criterion allows yCorresponding author us to transform the teacher subnetwork blocks into the student 2769 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Notation Definition subnetwork blocks without changing the receptive field. T The function that represents the initial teacher network As a result, the proposed progressive learning scheme aim- S The function that represents the final student network s at distilling the knowledge of the entire teacher network by Ak The auxiliary function that represents the intermediate network locally extracting the knowledge of each block in a progres- si The mapping function of the i-th block in S ti The mapping function of the i-th block in T sive learning manner with the structure-preserving criterion c The mapping function of the classifier QN of feature extraction. Therefore, the proposed scheme creates i=1 ◦ A symbol to simplify the network representation k W =W =W k a novel network acceleration strategy in terms of progres- T S A The parameters of T=S=A W =W =W The parameters of t =s =c sive blockwise learning, and has the following advantages: ti si c i i 1) structure-preserving for each subnetwork block; 2) easy Table 1: Main notations and symbols used throughout the paper. to implement for progressive blockwise optimization; 3) fast stagewise convergence; 4) flexible compatability with exist- ing learning modules; and 5) good balance with high accuracy the corresponding optimal parameters. The student network and competitive FLOPs reduction. composed of N student subnetwork blocks can be written as: QN As a result, the main contributions of this work are two- S = c ◦ ◦si (5) fold: i=1 where si denotes a student subnetwork block. The corre- • We propose a progressive blockwise learning scheme for sponding optimal parameters of the student network are de- model distillation, which is innovative in the area of neu- noted as: ral network acceleration. The proposed learning scheme W = fW ;W ;W ;:::;W g (6) naturally converts the problem of network acceleration S c sN sN−1 s1 into that of progressive blockwise function approxima- In essence, the problem is to design N student subnetwork tion. blocks sequence S and optimize the corresponding parameters WS using the prior knowledge of N teacher subnetwork • We propose a structure design criterion for the student blocks sequence T : subnetwork block, which is able to effectively preserve the original receptive field from the teacher network. QN optimize WS QN c ◦ i=1 ◦ti(x; WT ) −−−−−−−! c ◦ i=1 ◦si(x; WS) design S 2 Our Approach (7) The main challenges to solve this problem are: 1) The joint 2.1 Problem Definition optimization of the student network function with a wide va- To better understand our representations, we provide the de- riety of parameters is usually intractable and unstable in prac- tailed explanations of the main notations and symbols used tice; 2) Designing a feasible student network structure from throughout this paper as shown in Tab. 1. scratch is difficult. In Sectiom 2.2 and Section 2.3, we pro- A neural network is mainly comprised of the convolution pose our solutions to these two challenges. layers, the pooling layers, and the fully connected layers. The 2.2 Progressive Blockwise Learning subnetwork between two adjacent pooling layers is defined as a subnetwork block. To reduce the optimization difficulty described in Eq. (7), we propose a progressive blockwise learning scheme. As shown Let a complicated network T be the teacher network, which in Fig. 1, our blockwise learning scheme learns the sequence is composed of N subnetwork blocks: of student subnetwork blocks by N block learning stages, and T = c ◦ tN ◦ tN−1 ◦ · · · ◦ t1 (1) only optimizes one block at each learning stage while keeping the other blocks fixed. where ti (i 2 f1; 2;:::;Ng) is the mapping function of the To better introduce our blockwise scheme, we use the aux- i-th block in the sequence and c is the mapping function of iliary function Ak (k 2 f0; 1;:::;Ng) to represent our inter- the classifier. To simplify the representation of the network, mediate network at the k-th block learning stage: we shorten it as: N k Y Y N Ak = c ◦ ( ◦t ) ◦ ( ◦s ) (8) Y i j ◦ti = tN ◦ tN−1 ◦ · · · ◦ t1 (2) i=k+1 j=1 i=1 where sj is the optimized student network block and ti is the Therefore, T is rewritten as: teacher network block. The parameters of Ak are denoted as below: QN T = c ◦ i=1 ◦ti (3) k WA = Wc;WtN ;:::;Wtk+1 ;Wsk ;Wsk−1 ;:::;Ws1 The parameters of the teacher network are denoted as: (9) As can be noted from the description of Ak, A0 is the N WT = Wc;WtN ;WtN−1 ;:::;Wt1 (4) teacher network T and A is the optimized network S. Hence, the problem defined in Eq. (7) can be solved as be- where W and W (i 2 f1; 2;:::;Ng) are the parameters of c ti low: c and ti.

Progressive Blockwise Knowledge Distillation for Neural Network Acceleration

Layer-Level Knowledge Distillation for Deep Neural Network Learning

Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE

Understanding and Improving Knowledge Distillation

Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm

Memory-Replay Knowledge Distillation

Knowledge Distillation: a Survey 3

Towards Effective Utilization of Pre-Trained Language Models

Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Knowledge Distillation from Internal Representations

Sequence-Level Knowledge Distillation

Improved Knowledge Distillation Via Teacher Assistant