Online Knowledge Distillation Via Collaborative Learning

Online Knowledge Distillation via Collaborative Learning Qiushan Guo1, Xinjiang Wang2, Yichao Wu2, Zhipeng Yu2, Ding Liang2, Xiaolin Hu3, Ping Luo4 1Beijing University of Posts and Telecommunications 2SenseTime Group Limited 3Tsinghua University 4The University of Hong Kong [email protected] [email protected] [email protected] {wangxinjiang,wuyichao,yuzhipeng,liangding}@sensetime.com Abstract 0,,3 0,,3 3%456)1 !34()/35 This work presents an efficient yet effective online ( ( Knowledge Distillation method via Collaborative Learn- 0,,3 0,,3 ing, termed KDCL, which is able to consistently improve !34()/3 !34()/36 the generalization ability of deep neural networks (DNNs) (a) Baseline (b) DML that have different learning capacities. Unlike existing two- 0,,3 stage knowledge distillation approaches that pre-train a 0,,3 8&2641)5 (5 !34()/35 DNN with large capacity as the “teacher” and then transfer ./2).12) ./2).12) the teacher’s knowledge to another “student” DNN unidi- ( !641) 943) 0,,3 rectionally (i.e. one-way), KDCL treats all DNNs as “stu- 0,,3 8&2641)6 (6 !34()/36 dents” and collaboratively trains them in a single stage (knowledge is transferred among arbitrary students during (c) ONE (d) KDCL collaborative training), enabling parallel computing, fast computations, and appealing generalization ability. Specif- Figure 1: (a) [10] transfers knowledge from the static pre-trained ically, we carefully design multiple methods to generate teacher to student model. (b) Students can learn from each other in [32]. (c) [15] establishes teacher using multiple branch design. soft target as supervisions by effectively ensembling pre- The gate is to ensemble all the branches. (d) KDCL consistently dictions of students and distorting the input images. Ex- gains extra information from ensembling soft target produced by tensive experiments show that KDCL consistently improves all students, outperforming existing approaches. The input of each all the “students” on different datasets, including CIFAR- model is randomly distorted separately to increase its generaliza- 100 and ImageNet. For example, when trained together by tion ability. When model pair is trained with KDCL on ImageNet, using KDCL, ResNet-50 and MobileNetV2 achieve 78.2% ResNet-18 is improved by 1.9% and ResNet-50 gets 1.0% im- and 74.0% top-1 accuracy on ImageNet, outperforming the provement due to the knowledge from ResNet-18. original results by 1.4% and 2.0% respectively. We also verify that models pre-trained with KDCL transfer well to in traditional offline knowledge distillation framework, the object detection and semantic segmentation on MS COCO teacher is pre-trained first and then fixed, meaning that the dataset. For instance, the FPN detector is improved by knowledge can be only transferred from the teacher to the 0.9% mAP. student (i.e. one-way) as shown in Fig. 1a. The online distillation methods [32, 15] are more attrac- 1. Introduction tive because the training process is simplified to a single stage, and all the networks are treated as students. These Knowledge distillation [10] is typically formulated as approaches merge the training processes of all student net- “teacher-student” learning setting. It is able to improve works, enabling them to gain extra knowledge from each performance of a compact ‘student’ deep neural network other. Students directly learn from the prediction of other because the representation of a ‘teacher’ network can be students in Deep Mutual Learning (DML) [32], as illus- used as structured knowledge to guide the training of stu- trated in Fig. 1b. However, the output of students can be di- dent. The predictions (e.g. soft target) produced by the verse, conflicting with each other and even the ground truth. teacher can be easily learned by a student and encourage it When the performances are significantly different among to generalize better than that trained from scratch. However, models, this method does harm to the model of high perfor- 110201 mance. KDCL transfer well to object detection and semantic seg- An alternative method proposed by [15] (ONE) is to train mentation on the COCO dataset [17]. a multi-branch network while establishing teacher on the Our contributions are listed as follows. fly, as shown in Fig. 1c. Nonetheless, this method is inflex- A new pipeline of knowledge distillation based on col- • ible because the network is compelled to share lower lay- laborative learning is designed. Models of various ers and knowledge transfer occurs only at the upper layers learning capacity can benefit from collaborative train- within a single model rather than other models, limiting the ing. extra knowledge and the performance. The gate module is A series of model ensembling methods are designed • not a guarantee of high quality soft target. to dynamically generate high-quality soft targets in a Self-distillation [6] shows that distilling a converged one-stage online knowledge distillation framework. teacher model into a student model of identical network Invariance against perturbations in the input domain • architecture can further improve the generalization ability is enhanced by transferring knowledge and fusing the compared to the teacher. The efficacy of self-distillation output of images with different distortion. and online distillation leads us to the following question: Could we use a small network to improve the model with 2. Related work larger capacity in a one-stage distillation framework? Knowledge transfer for the neural network is advocated In this work, we propose a novel online knowledge dis- by [2, 10] to distill the knowledge from teacher to student. tillation method via collaborative learning. In KDCL, stu- An obvious way is to let the student imitate the output of the dent networks with different capacities learn collaboratively teacher model. [2] proposes to improve shallow networks to generate high-quality soft target supervision, which dis- by penalizing the difference of logits between the student tills the additional knowledge to each student as illustrated and the teacher. [10] realizes knowledge distillation by min- in Fig.1d. The high-quality soft target supervision aims at imizing the Kullback-Leibler (KL) divergence loss of their instructing students with significant performance gaps to output categorical probability. consistently converge with higher generalization ability and Structure knowledge Based on the pioneering work, less variance to the input perturbation in the data domain. many methods have been proposed to excavate more infor- The major challenge is to generate soft target supervi- mation from the teacher. [20] introduces more supervision sion that can boost the performance of all students with high by further exploiting the feature of intermediate hidden lay- confidence, which have different learning capacities or sig- ers. [31] defines additional attention information combined nificant performance gaps. Ensembling tends to yield better with distillation. [18] mines mutual relations of data ex- results when diversity presents among the outputs of mod- amples by distance-wise and angle-wise losses. [23] estab- els [14]. Therefore, we propose to generate high-quality lishes an equivalence between Jacobian matching and dis- soft target supervision by carefully ensembling the output tillation. [9] transfers more accurate information via the of students with the information of ground truth in an on- route to the decision boundary. A few recent papers about line manner. Furthermore, we propose to estimate general- self-distillation [29, 3, 6, 28] have shown that a converged ization error by measuring model on the validation set. The teacher model supervising a student model of identical ar- soft target is generated for stronger generalization ability on chitecture could improve the generalization ability over the the validation set. teacher. In contrast to mimicking complex models, KDCL For improving invariance against perturbations in the in- involves all networks in learning and provides hint via fus- put data domain, the soft target should encourage the stu- ing the information of the students. Without any additional dents to output similarly with similar distorted input im- loss for intermediate layers, KDCL reduces the difficulty of ages. Therefore, students are fed with the images, which optimizing model. are individually perturbed from identical inputs, and the soft Collaborative learning In online distillation frame- target is generated by combining the outputs and fusing the work, students imitate the teacher in the training process. information of data augmentation. In this case, the benefits DML [32] suggests peer students learn from each other of model ensembling are further exploited. through the cross-entropy loss between each pair of stu- To evaluate the effect of KDCL, we conduct exten- dents. Co-distillation [1] is similar to DML, whereas it sive experiments on benchmarks for image classification, forces student networks to maintain their diversity longer CIFAR-100 [13] and ImageNet-2012 [4]. We demonstrate by adding distillation loss after updating enough steps. In- that, with KDCL, ResNet-50 [8] and ResNet-18 trained spired by self-distillation, training a multiple branch vari- in pair achieve 77.8% and 73.1% val accuracy. ResNet- ant of the target network is proposed to establish a strong 18 outperforms the baseline by 1.9% and ResNet-50 gains teacher on-the-fly. ONE [15] constructs multiple branch 1.0% improvement as a benefit of the extra knowledge from classifiers and trains a gate controller to align the teacher’s ResNet-18. We also verify that models pre-trained with prediction. CLNN [22] promotes the diversity of each 11021 Augmentation Loss Input Logits 1 %&% ✂ ♦ ♣✁ ! Network 1 %() Ensemble logits Logits 2 %$ $ &% %() ✡ ☛☞✌ ✍ ✟ ✠ !$ Network 2 ✎ ✏ ✑ ✒ ✓ ✔ Logits m ' %(" ✄ ☎✆✝ ! ✞ ' Network m ' %&% Figure 2: Overview of knowledge distillation via collaborative learning (KDCL). We input images distorted separately for each network to increase the invariance against perturbations in the data domain. KDCL dynamically ensembles soft target produced by all students to improve students consistently. h(x, ǫ) means random distortion and ǫ is the random seed.

Online Knowledge Distillation Via Collaborative Learning

Layer-Level Knowledge Distillation for Deep Neural Network Learning

Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE

Understanding and Improving Knowledge Distillation

Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm

Memory-Replay Knowledge Distillation

Knowledge Distillation: a Survey 3

Towards Effective Utilization of Pre-Trained Language Models

Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Knowledge Distillation from Internal Representations

Sequence-Level Knowledge Distillation

Improved Knowledge Distillation Via Teacher Assistant