Annealing Knowledge Distillation

Annealing Knowledge Distillation 1,2Aref Jafari, 2Mehdi Rezagholizadeh, 1Pranav Sharma, 1,3Ali Ghodsi 1 David R. Cheriton School of Computer Science, University of Waterloo 2Huawei Noah’s Ark Lab 3Department of Statistics and Actuarial Science, University of Waterloo faref.jafari, p68sharma, [email protected] [email protected] Abstract He et al., 2019), these state-of-the-art networks are usually heavy to be deployed on edge devices Significant memory and computational re- with limited computational power (Bie et al., 2019; quirements of large deep neural networks re- Lioutas et al., 2019). A case in point is the BERT strict their application on edge devices. Knowl- edge distillation (KD) is a prominent model model (Devlin et al., 2018) which can be comprised compression technique for deep neural net- of more than a hundred million parameters. works in which the knowledge of a trained The problem of network over-parameterization large teacher model is transferred to a smaller and expensive computational complexity of deep student model. The success of knowledge networks can be addressed by neural model com- distillation is mainly attributed to its train- pression. There are abundant of neural model ing objective function, which exploits the soft- compression techniques in the literature (Prato target information (also known as “dark knowledge”) besides the given regular hard labels et al., 2019; Tjandra et al., 2018; Jacob et al., in a training set. However, it is shown in the 2018), among which knowledge distillation (KD) literature that the larger the gap between the is one of the most prominent techniques (Hinton teacher and the student networks, the more et al., 2015). KD is tailored a lot to serve dif- difficult is their training using knowledge dis- ferent applications and different network architec- tillation. To address this shortcoming, we tures (Furlanello et al., 2018; Gou et al., 2020). propose an improved knowledge distillation For instance, patient KD (Sun et al., 2019), Tiny- method (called Annealing-KD) by feeding the BERT (Jiao et al., 2019), and MobileBERT (Sun rich information provided by the teacher’s soft- targets incrementally and more efficiently. Our et al., 2020) are designed particularly for distilling Annealing-KD technique is based on a gradual the knowledge of BERT-based teachers to a smaller transition over annealed soft-targets generated student. by the teacher at different temperatures in an The success of KD is mainly attributed to its iterative process, and therefore, the student is training objective function, which exploits the soft- trained to follow the annealed teacher output target information (also known as “dark knowl- in a step-by-step manner. This paper includes edge”) besides the given regular hard labels in the theoretical and empirical evidence as well as practical experiments to support the effective- training set (Hinton, 2012). Previous studies in the ness of our Annealing-KD method. We did literature (Lopez-Paz et al., 2015; Mirzadeh et al., a comprehensive set of experiments on differ- 2019) show that when the gap between the student ent tasks such as image classification (CIFAR- and teacher models increases, training models with 10 and 100) and NLP language inference with KD becomes more difficult. We refer to this prob- BERT-based models on the GLUE benchmark lem as KD’s capacity gap problem in this paper. and consistently got superior results. For example, Mirzadeh et al.(2019) show that if 1 Introduction we gradually increase the capacity of the teacher, first the performance of student model improves for Despite the great success of deep neural networks a while, but after a certain point, it starts to drop. in many challenging tasks such as natural language Therefore, although increasing the capacity of a processing (Vaswani et al., 2017; Liu et al., 2019), teacher network usually boosts its performance, it computer vision (Wong et al., 2019; Howard et al., does not necessarily lead to a better teacher for the 2017), and speech processing (Chan et al., 2016; student network in KD. In other words, it would 2493 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 2493–2504 April 19 - 23, 2021. ©2021 Association for Computational Linguistics be more difficult for KD to transfer the knowledge 2 Related Work of this enhanced teacher to the student. A similar scenario happens when originally the gap between 2.1 Knowledge Distillation the teacher and student network is large. In the original Knowledge distillation method Mirzadeh et al.(2019) proposed their TAKD so- by Hinton et al.(2015), which is referred to as KD lution to this problem which makes the KD process in this paper, the student network is trained based more smooth by filling the gap between the teacher on two guiding signals: first, the training dataset or and student networks using an intermediate aux- hard labels, and second, the teacher network pre- iliary network (referred to as “teacher assistant”). dictions, which is known as soft labels. Therefore, The size of this TA network is between the size KD is trained based on a linear combination of two of the student and the teacher; and it is trained by loss functions: the regular cross entropy loss func- the teacher first. Then, the student is trained using tion between the student outputs and hard labels, KD when the TA network is playing the role of its and the KD loss function to minimize the distance teacher. This way, the training gap (between the between the output predictions of the teacher and teacher and the student) would be less significant student networks at a particular temperature, T , compared to the original KD. However, TAKD on training samples: suffers from the high computational complexity demand since it requires training the TA network L = (1 − λ)LCE + λLKD separately. Moreover, the training error of the TA L = H y; (σ(z (x)) network can be propagated to the student during CE CE s (1) the KD training process. 2 zt(x) zs(x) LKD = T KL σ( ); σ( ) In this paper, we want to solve the KD capac- T T ity gap problem from a different perspective. We where H (:) and KL(:) are representing the propose our Annealing-KD technique to bridges CE cross entropy and KL divergence respectively, the gap between the student and teacher models by z (x) and z (x) are the output logits from the stu- introducing a new KD loss with a dynamic tem- s t dent and teacher networks, T is the temperature perature term. This way, Annealing-KD is able parameter, σ(:) is the softmax function and λ is to transfer the knowledge of the teacher smoothly a coefficient between [0,1] to control the contri- to the student model via a gradual transition over bution of the two loss functions. The above loss soft-labels generated by the teacher at different tem- function minimizes the distance between the stu- peratures. We can summarize the contributions of dent model and both the underlying function and this paper in the following: the teacher model assuming the teacher is a good 1. We propose our novel Annealing-KD solu- approximation of the underlying function of the tion to the KD capacity gap problem based on data. modifying the KD loss and also introducing A particular problem with KD, that we would a dynamic temperature function to make the like to address in this paper, is that the larger the student training gradual and smooth. gap between the teacher and the student networks, the more difficult is their training using knowledge 2. We provide a theoretical and empirical justifi- distillation (Lopez-Paz et al., 2015; Mirzadeh et al., cation for our Annealing-KD approach. 2019). 3. We apply our technique to ResNET8 and plain 2.2 Teacher Assistant Knowledge Distillation CNN models on both CIFAR-10 and CIFAR- (TAKD) 100 image classification tasks, and the natural language inference task on different BERT To address the capacity gap problem between the based models such as DistilRoBERTa, and student and teacher networks in knowledge distil- BERT-Small on the GLUE benchmark and lation, TAKD (Mirzadeh et al., 2019) proposes to achieved the-state-of-the art results. train the student (of small capacity) with a pre- trained intermediate network (of moderate capac- 4. Our technique is simple, architecture agnostic, ity) called teacher assistance. In this regard, we and can be applied on top of different variants first train the TA with the guidance of the teacher of KD. network by using the KD method. Then, we can 2494 use the learned TA network to train the student net- another category of knowledge distillation which work. Here, since the capacity of the TA network is focuses on improving the performance of the stu- between the capacity of the teacher and the student dent model and not compressing it. Our method is networks, therefore it can fill the gap between the described in the next section. teacher and student and enhance the complexity of the teacher and transfer its knowledge to the 3 Method: Annealing Knowledge student network. Distillation As it is mentioned in (Mirzadeh et al., 2019), a In this section, we describe our Annealing-KD tech- better idea could be using TAKD in a hierarchical nique and show the rationale behind it. First, we way. So in this case, we can have several TAs with start by formulating the problem and visualizing different levels of capacity from large capacities our technique using an example for a better pre- close to the teacher model to small capacities close sentation. Then, we use VC-dimension theory to to the student model.

Load more