Arxiv:1902.03393V2 [Cs.LG] 17 Dec 2019 Teacher, and the Student Is Then Only Distilled from the Tas

Improved Knowledge Distillation via Teacher Assistant Seyed Iman Mirzadeh∗ 1, Mehrdad Farajtabar∗2, Ang Li 2, Nir Levine 2, Akihiro Matsukaway 3, Hassan Ghasemzadeh 1 1 Washington State University, WA, USA 2 DeepMind, CA, USA 3 D.E. Shaw, NY, USA 1 fseyediman.mirzadeh,[email protected] 2 ffarajtabar,anglili,[email protected] 3 [email protected] Abstract Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a Figure 1: TA fills the gap between student & teacher fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To al- by adding a term to the usual classification loss that encour- leviate this shortcoming, we introduce multi-step knowledge ages the student to mimic the teacher’s behavior. distillation, which employs an intermediate-sized network However, we argue that knowledge distillation is not al- (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant ways effective, especially when the gap (in size) between size and extend the framework to multi-step distillation. The- teacher and student is large. To illustrate, we ran experi- oretical analysis and extensive experiments on CIFAR-10,100 ments that show surprisingly a student model distilled from a and ImageNet datasets and on CNN and ResNet architectures teacher with more parameters(and better accuracy) performs substantiate the effectiveness of our proposed approach. worse than the same one distilled from a smaller teacher with a smaller capacity. Such scenarios seem to impact the efficacy of knowledge distillation where one is given a small Introduction student network and a pre-trained large one as a teacher, both Deep neural networks have achieved state of the art results in fixed and (wrongly) presumed to form a perfect transfer pair. a variety of applications such as computer vision (Huang et Inspired by this observation, we propose a new distillation al. 2017; Hu, Shen, and Sun 2018), speech recognition (Han framework called Teacher Assistant Knowledge Distillation et al. 2017) and natural language processing (Devlin et al. (TAKD), which introduces intermediate models as teacher 2018). Although it is established that introducing more lay- assistants (TAs) between the teacher and the student to fill ers and more parameters often improves the accuracy of a in their gap (Figure 1). TA models are distilled from the model, big models are computationally too expensive to be arXiv:1902.03393v2 [cs.LG] 17 Dec 2019 teacher, and the student is then only distilled from the TAs. deployed on devices with limited computation power such Our contributions are: (1) We show that the size (capac- as mobile phones and embedded sensors. Model compres- ity) gap between teacher and student is important. To the sion techniques have emerged to address such issues, e.g., best of our knowledge, we are the first to study this gap and parameter pruning and sharing (Han, Mao, and Dally 2016), verify that the distillation performance is not at its top with low-rank factorization (Tai et al. 2015) and knowledge distil- the largest teacher; (2) We propose a teacher assistant based lation (Bucila, Caruana, and Niculescu-Mizil 2006; Hinton, knowledge distillation approach to improve the accuracy of Vinyals, and Dean 2015). Among these approaches, knowl- student network in the case of extreme compression; (3) We edge distillation has proven a promising way to obtain a extend this framework to include a chain of multiple TAs small model that retains the accuracy of a large one. It works from teacher to student to further improve the knowledge ∗Equal Contribution transfer and provided some insights to find the best one; (4) yWork Done at DeepMind Through extensive empirical evaluations and a theoretical Copyright c 2020, Association for the Advancement of Artificial justification, we show that introducing intermediary TA net- Intelligence (www.aaai.org). All rights reserved. works improves the distillation performance. Related Work lation framework called co-distillation and argue that distil- We discuss in this section related literature in knowledge dis- lation can even work when the teacher and student are made tillation and neural network compression. by the same network architecture. The idea is to train multi- Model Compression. Since our goal is to train a small, ple models in parallel and use distillation loss when they are yet accurate network, this work is related to model com- not converged, in which case the model training is faster and pression. There has been an interesting line of research that the model quality is also improved. compresses a large network by reducing the connections However, the effectiveness of distilling a large model to a based on weight magnitudes (Han, Mao, and Dally 2016; small model has not yet been well studied. Our work differs Li et al. 2016) or importance scores (Yu et al. 2018). The re- from existing approaches in that we study how to improve duced network is fine-tuned on the same dataset to retain its the student performance given fixed student and teacher accuracy. Another line of research focuses on distilling the network sizes, and introduces intermediate networks with original (large) network to a smaller network (Polino, Pas- a moderate capacity to improve distillation performance. canu, and Alistarh 2018; Wang et al. 2018a), in which case Moreover, our work can be seen as a complement that can the smaller network is more flexible in its architecture design be combined with them and improve their performance. does not have to be a sub-graph of the original network. Distillation Theory. Despite its huge popularity, there Knowledge Distillation. Originally proposed by Bu- are few systematic and theoretical studies on how and why cila, Caruana, and Niculescu-Mizil (2006) and popularized knowledge distillation improves neural network training. by Hinton, Vinyals, and Dean (2015) knowledge distillation The so-called dark knowledge transferred in the process compress the knowledge of a large and computational ex- helps the student learn the finer structure of teacher network. pensive model (often an ensemble of neural networks) to Hinton, Vinyals, and Dean (2015) argues that the success a single computational efficient neural network. The idea of knowledge distillation is attributed to the logit distribution of knowledge distillation is to train the small model, the of the incorrect outputs, which provides information on the student, on a transfer set with soft targets provided by similarity between output categories. Furlanello et al. (2018) the large model, the teacher. Since then, knowledge dis- investigated the success of knowledge distillation via gra- tillation has been widely adopted in a variety of learning dients of the loss where the soft-target part acts as an im- tasks (Yim et al. 2017; Yu et al. 2017; Schmitt et al. 2018; portance sampling weight based on the teachers confidence Chen et al. 2017). Adversarial methods also have been in its maximum value. Zhang et al. (2017) analyzed knowl- utilized for modeling knowledge transfer between teacher edge distillation from the posterior entropy viewpoint claim- and student (Heo et al. 2018; Xu, Hsu, and Huang 2018; ing that soft-targets bring robustness by regularizing a much Wang et al. 2018b; Wang et al. 2018c). more informed choice of alternatives than blind entropy reg- There have been works studying variants of model distil- ularization. Last but not least, Lopez-Paz et al. (2015) stud- lation that involve multiple networks learning at the same ied the effectiveness of knowledge distillation from the per- time. Romero et al. (2014) proposed to transfer the knowl- spective of learning theory (Vapnik 1998) by studying the edge using not only the logit layer but earlier ones too. estimation error in empirical risk minimization framework. To cope with the difference in width, they suggested a re- In this paper, we take this last approach to support our gressor to connect teacher and student’s intermediate layers. claim on the effectiveness of introducing an intermediate Unfortunately, there is not a principled way to do this. To network between student and teacher. Moreover, we empiri- solve this issue, Yim et al.; Yu et al. (2017; 2017) used a cally analyze it via visualizing the loss function. shared representation of layers, however, it’s not straight- forward to choose the appropriate layer to be matched. Assistant based Knowledge Distillation Czarnecki et al. (2017) minimized the difference between Background and Notations teacher and student derivatives of the loss combined with the divergence from teacher predictions while Tarvainen and The idea behind knowledge distillation is to have the student Valpola (2017) uses averaging model weights instead of tar- network (S) be trained not only via the information provided get predictions. Urban et al. (2017) trained a network con- by true labels but also by observing how the teacher network sisting of an ensemble of 16 convolutional neural networks (T) represents and works with the data. The teacher network and compresses the learned function into shallow multilayer is sometimes deeper and wider (Hinton, Vinyals, and Dean perceptrons. To improve the student performance, Sau and 2015), of similar size (Anil et al. 2018; Zhang et al. 2017), Balasubramanian (2016) injected noise into teacher logits or shallower but wider (Romero et al.

Arxiv:1902.03393V2 [Cs.LG] 17 Dec 2019 Teacher, and the Student Is Then Only Distilled from the Tas

Layer-Level Knowledge Distillation for Deep Neural Network Learning

Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE

Understanding and Improving Knowledge Distillation

Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm

Memory-Replay Knowledge Distillation

Knowledge Distillation: a Survey 3

Towards Effective Utilization of Pre-Trained Language Models

Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Knowledge Distillation from Internal Representations

Sequence-Level Knowledge Distillation

Improved Knowledge Distillation Via Teacher Assistant