Knowledge Distillation: a Survey 3

Total Page:16

File Type:pdf, Size:1020Kb

Knowledge Distillation: a Survey 3 Noname manuscript No. (will be inserted by the editor) Knowledge Distillation: A Survey Jianping Gou1 · Baosheng Yu 1 · Stephen J. Maybank 2 · Dacheng Tao1 Received: date / Accepted: date Abstract In recent years, deep neural networks have Keywords Deep neural networks · Model been successful in both industry and academia, es- compression · Knowledge distillation · Knowledge pecially for computer vision tasks. The great success transfer · Teacher-student architecture. of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited 1 Introduction resources, e.g., mobile phones and embedded devices, not only because of the high computational complex- During the last few years, deep learning has been ity but also the large storage requirements. To this the basis of many successes in artificial intelligence, end, a variety of model compression and acceleration including a variety of applications in computer vi- techniques have been developed. As a representative sion (Krizhevsky et al., 2012), reinforcement learning type of model compression and acceleration, knowledge (Silver et al., 2016; Ashok et al., 2018; Lai et al., 2020), distillation effectively learns a small student model from and natural language processing (Devlin et al., 2019). a large teacher model. It has received rapid increas- With the help of many recent techniques, including ing attention from the community. This paper pro- residual connections (He et al., 2016, 2020b) and batch vides a comprehensive survey of knowledge distillation normalization (Ioffe and Szegedy, 2015), it is easy to from the perspectives of knowledge categories, train- train very deep models with thousands of layers on ing schemes, teacher-student architecture, distillation powerful GPU or TPU clusters. For example, it takes algorithms, performance comparison and applications. less than ten minutes to train a ResNet model on a Furthermore, challenges in knowledge distillation are popular image recognition benchmark with millions of briefly reviewed and comments on future research are images (Deng et al., 2009; Sun et al., 2019); It takes no arXiv:2006.05525v7 [cs.LG] 20 May 2021 discussed and forwarded. more than one and a half hours to train a powerful BERT model for language understanding (Devlin et al., 2019; You et al., 2019). The large-scale deep models Jianping Gou E-mail: [email protected] have achieved overwhelming successes, however the Baosheng Yu huge computational complexity and massive storage E-mail: [email protected] requirements make it a great challenge to deploy them Stephen J. Maybank in real-time applications, especially on devices with E-mail: [email protected] Dacheng Tao limited resources, such as video surveillance and au- E-mail: [email protected] tonomous driving cars. 1 UBTECH Sydney AI Centre, School of Computer Science, To develop efficient deep models, recent works usu- Faculty of Engineering, The University of Sydney, Darlington, NSW 2008, Australia. ally focus on 1) efficient building blocks for deep models, 2 Department of Computer Science and Information Systems, including depthwise separable convolution, as in Mo- Birkbeck College, University of London, UK. bileNets (Howard et al., 2017; Sandler et al., 2018) and ShuffleNets (Zhang et al., 2018a; Ma et al., 2018); and 2 Jianping Gou1 et al. Teacher Model Knowledge Transfer Student Model Knowledge K n o Transfer Distill w l e d g e Data Fig. 1 The generic teacher-student framework for knowledge distillation. 2) model compression and acceleration techniques, in due to the limited computational capacity and memory the following categories (Cheng et al., 2018). of the devices. To address this issue, Bucilua et al. • Parameter pruning and sharing: These methods fo- (2006) first proposed model compression to transfer cus on removing inessential parameters from deep the information from a large model or an ensem- neural networks without any significant effect on the ble of models into training a small model without a performance. This category is further divided into significant drop in accuracy. The knowledge transfer model quantization (Wu et al., 2016), model bina- between a fully-supervised teacher model and a stu- rization (Courbariaux et al., 2015), structural matri- dent model using the unlabeled data is also intro- ces (Sindhwani et al., 2015) and parameter sharing duced for semi-supervised learning (Urner et al., 2011). (Han et al., 2015; Wang et al., 2019f). The learning of a small model from a large model • Low-rank factorization: These methods identify re- is later formally popularized as knowledge distilla- dundant parameters of deep neural networks by em- tion (Hinton et al., 2015). In knowledge distillation, a ploying the matrix and tensor decomposition (Yu et al., small student model is generally supervised by a large 2017; Denton et al., 2014). teacher model (Bucilua et al., 2006; Ba and Caruana, • Transferred compact convolutional filters: These meth- 2014; Hinton et al., 2015; Urban et al., 2017). The main ods remove inessential parameters by transferring idea is that the student model mimics the teacher model or compressing the convolutional filters (Zhai et al., in order to obtain a competitive or even a superior 2016). performance. The key problem is how to transfer the • Knowledge distillation (KD): These methods distill knowledge from a large teacher model to a small student the knowledge from a larger deep neural network into model. Basically, a knowledge distillation system is a small network (Hinton et al., 2015). composed of three key components: knowledge, dis- tillation algorithm, and teacher-student architecture. A comprehensive review on model compression and A general teacher-student framework for knowledge acceleration is outside the scope of this paper. The distillation is shown in Fig. 1. focus of this paper is knowledge distillation, which has received increasing attention from the research Although the great success in practice, there are not community in recent years. Large deep neural networks too many works on either the theoretical or empirical have achieved remarkable success with good perfor- understanding of knowledge distillation (Urner et al., mance, especially in the real-world scenarios with large- 2011; Cheng et al., 2020; Phuong and Lampert, 2019a; scale data, because the over parameterization improves Cho and Hariharan, 2019). Specifically, Urner et al. (2011) the generalization performance when new data is con- proved that the knowledge transfer from a teacher sidered (Zhang et al., 2018; Brutzkus and Globerson, model to a student model using unlabeled data is 2019; Allen-Zhu et al., 2019; Arora et al., 2018; Tu et al., PAC learnable. To understand the working mecha- 2020). However, the deployment of deep models in mo- nisms of knowledge distillation, Phuong & Lampert bile devices and embedded systems is a great challenge, obtained a theoretical justification for a generalization Knowledge Distillation: A Survey 3 Sec. 2.1 KD Algorithms Performance Comparison Response-Based Knowledge Sec. 5.1 Sec. 5.2 Sec. 5.3 Multi-Teacher Adversarial KD Cross-Modal KD Sec. 2.2 Sec. 6 KD Feature-Based Knowledge Sec. 5.4 Sec. 5.5 Sec. 5.6 Knowledge Distillation Sec. 5 Graph-Based Attention-Based Sec. 2 Data-Free KD Sec. 2.3 KD KD Relation-Based Knowledge Knowledge Sec. 5.7 Sec. 5.8 Sec. 5.9 Quantized KD Lifelong KD NAS-Based KD Sec. 4 Sec. 3.1 Teacher-Student Offline Distillation Architecture KD Applications Sec. 3 Sec. 3.2 Sec. 7 Sec. 7.1 Sec. 7.2 Natural Language Online Distillation Distillation Visual Recognition Processing Sec. 3.3 Sec. 7.3 Sec. 7.4 Sec. 8 Self-Distillation Speech Recognition Other Applications Sec. 8.1 Sec. 8.2 Challenges Directions Fig. 2 The schematic structure of knowledge distillation and the relationship between the adjacent sections. The body of this survey mainly contains the fundamentals of knowledge distillation, knowledge types, distillation schemes, teacher-student architecture, distillation algorithms, performance comparison, applications, discussions, challenges, and future directions. Note that ‘Section’ is abbreviated as ‘Sec.’ in this figure. bound with fast convergence of learning distilled stu- by this, recent knowledge distillation methods have dent networks in the scenario of deep linear classi- extended to teacher-student learning (Hinton et al., fiers (Phuong and Lampert, 2019a). This justification 2015), mutual learning (Zhang et al., 2018b), assis- answers what and how fast the student learns and tant teaching (Mirzadeh et al., 2020), lifelong learn- reveals the factors of determining the success of dis- ing (Zhai et al., 2019), and self-learning (Yuan et al., tillation. Successful distillation relies on data geometry, 2020). Most of the extensions of knowledge distilla- optimization bias of distillation objective and strong tion concentrate on compressing deep neural networks. monotonicity of the student classifier. Cheng et al. The resulting lightweight student networks can be quantified the extraction of visual concepts from the easily deployed in applications such as visual recog- intermediate layers of a deep neural network, to explain nition, speech recognition, and natural language pro- knowledge distillation (Cheng et al., 2020). Ji & Zhu cessing (NLP). Furthermore, the knowledge transfer theoretically explained knowledge distillation on a wide from one model to another in knowledge distillation neural network from the respective of risk bound, data can be extended to other tasks, such as adversar- efficiency and imperfect teacher (Ji and Zhu., 2020). ial attacks (Papernot et al., 2016), data augmenta- Cho & Hariharan empirically analyzed in detail the tion (Lee et al., 2019a; Gordon and Duh, 2019), data efficacy of knowledge distillation (Cho and Hariharan, privacy and security (Wang et al., 2019a). Motivated 2019). Empirical results show that a larger model may by knowledge distillation for model compression, the not be a better teacher because of model capacity gap idea of knowledge transfer has been further applied in (Mirzadeh et al., 2020).
Recommended publications
  • Layer-Level Knowledge Distillation for Deep Neural Network Learning
    applied sciences Article Layer-Level Knowledge Distillation for Deep Neural Network Learning Hao-Ting Li, Shih-Chieh Lin, Cheng-Yeh Chen and Chen-Kuo Chiang * Advanced Institute of Manufacturing with High-tech Innovations, Center for Innovative Research on Aging Society (CIRAS) and Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 62102, Taiwan; [email protected] (H.-T.L.); [email protected] (S.-C.L.); [email protected] (C.-Y.C.) * Correspondence: [email protected] Received: 1 April 2019; Accepted: 9 May 2019; Published: 14 May 2019 Abstract: Motivated by the recently developed distillation approaches that aim to obtain small and fast-to-execute models, in this paper a novel Layer Selectivity Learning (LSL) framework is proposed for learning deep models. We firstly use an asymmetric dual-model learning framework, called Auxiliary Structure Learning (ASL), to train a small model with the help of a larger and well-trained model. Then, the intermediate layer selection scheme, called the Layer Selectivity Procedure (LSP), is exploited to determine the corresponding intermediate layers of source and target models. The LSP is achieved by two novel matrices, the layered inter-class Gram matrix and the inter-layered Gram matrix, to evaluate the diversity and discrimination of feature maps. The experimental results, demonstrated using three publicly available datasets, present the superior performance of model training using the LSL deep model learning framework. Keywords: deep learning; knowledge distillation 1. Introduction Convolutional neural networks (CNN) of deep learning have proved to be very successful in many applications in the field of computer vision, such as image classification [1], object detection [2–4], semantic segmentation [5], and others.
    [Show full text]
  • Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE
    JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE Abstract—Knowledge distillation is a popular paradigm for following perspectives: network pruning [6][7][8], network de- learning portable neural networks by transferring the knowledge composition [9][10][11], network quantization and knowledge from a large model into a smaller one. Most existing approaches distillation [12][13]. Among these methods, the seminal work enhance the student model by utilizing the similarity information between the categories of instance level provided by the teacher of knowledge distillation has attracted a lot of attention due to model. However, these works ignore the similarity correlation its ability of exploiting dark knowledge from the pre-trained between different instances that plays an important role in large network. confidence prediction. To tackle this issue, we propose a novel Knowledge distillation (KD) is proposed by Hinton et al. method in this paper, called similarity transfer for knowledge [12] for supervising the training of a compact yet efficient distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples. Furthermore, we propose student model by capturing and transferring the knowledge of to better capture the similarity correlation between different in- a large teacher model to a compact one. Its success is attributed stances by the mixup technique, which creates virtual samples by to the knowledge contained in class distributions provided by a weighted linear interpolation. Note that, our distillation loss can the teacher via soften softmax.
    [Show full text]
  • Understanding and Improving Knowledge Distillation
    Understanding and Improving Knowledge Distillation Jiaxi Tang,∗ Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H.Chi, Sagar Jain Google, Inc {jiaxit,rakeshshivanna,zhezhao,dongl,animasingh,edchi,sagarj}@google.com Abstract Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is used to train a more compact student model with better inference efficiency. Through distillation, one hopes to benefit from student’s compactness, without sacrificing too much on model quality. Despite the large success of knowledge distillation, better understanding of how it benefits student model’s training dy- namics remains under-explored. In this paper, we categorize teacher’s knowledge into three hierarchical levels and study its effects on knowledge distillation: (1) knowledge of the ‘universe’, where KD brings a regularization effect through label smoothing; (2) domain knowledge, where teacher injects class relationships prior to student’s logit layer geometry; and (3) instance specific knowledge, where teacher rescales student model’s per-instance gradients based on its measurement on the event difficulty. Using systematic analyses and extensive empirical studies on both synthetic and real-world datasets, we confirm that the aforementioned three factors play a major role in knowledge distillation. Furthermore, based on our findings, we diagnose some of the failure cases of applying KD from recent studies. 1 Introduction Recent advances in artificial intelligence have largely been driven by learning deep neural networks, and thus, current state-of-the-art models typically require a high inference cost in computation and memory.
    [Show full text]
  • Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm
    electronics Article Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm Robert Stewart 1,* , Andrew Nowlan 1, Pascal Bacchus 2, Quentin Ducasse 3 and Ekaterina Komendantskaya 1 1 Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, EH14 4AS, UK; [email protected] (A.N.); [email protected] (E.K.) 2 Inria Rennes-Bretagne Altlantique Research Centre, 35042 Rennes, France; [email protected] 3 Lab-STICC, École Nationale Supérieure de Techniques Avancées, 29200 Brest, France; [email protected] * Correspondence: [email protected] Abstract: This paper compares the latency, accuracy, training time and hardware costs of neural networks compressed with our new multi-objective evolutionary algorithm called NEMOKD, and with quantisation. We evaluate NEMOKD on Intel’s Movidius Myriad X VPU processor, and quantisation on Xilinx’s programmable Z7020 FPGA hardware. Evolving models with NEMOKD increases inference accuracy by up to 82% at the cost of 38% increased latency, with throughput performance of 100–590 image frames-per-second (FPS). Quantisation identifies a sweet spot of 3 bit precision in the trade-off between latency, hardware requirements, training time and accuracy. Parallelising FPGA implementations of 2 and 3 bit quantised neural networks increases throughput from 6 k FPS to 373 k FPS, a 62 speedup. × Citation: Stewart, R.; Nowlan, A.; Bacchus, P.; Ducasse, Q.; Keywords: quantisation; evolutionary algorithm; neural network; FPGA; Movidius VPU Komendantskaya, E. Optimising Hardware Accelerated Neural Networks with Quantization and a Knowledge Distillation Evolutionary 1. Introduction Algorithm. Electronics 2021, 10, 396. Neural networks have proved successful for many domains including image recog- https://doi.org/10.3390/ nition, autonomous systems and language processing.
    [Show full text]
  • Memory-Replay Knowledge Distillation
    sensors Article Memory-Replay Knowledge Distillation Jiyue Wang 1,*, Pei Zhang 2 and Yanxiong Li 1 1 School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China; [email protected] 2 School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China; [email protected] * Correspondence: [email protected] Abstract: Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model’s output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher’s output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method.
    [Show full text]
  • Towards Effective Utilization of Pre-Trained Language Models
    Towards Effective Utilization of Pretrained Language Models | Knowledge Distillation from BERT by Linqing Liu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer Science Waterloo, Ontario, Canada, 2020 c Linqing Liu 2020 Author's Declaration I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract In the natural language processing (NLP) literature, neural networks are becoming increas- ingly deeper and more complex. Recent advancements in neural NLP are large pretrained language models (e.g. BERT), which lead to significant performance gains in various down- stream tasks. Such models, however, require intensive computational resource to train and are difficult to deploy in practice due to poor inference-time efficiency. In this thesis, we are trying to solve this problem through knowledge distillation (KD), where a large pretrained model serves as teacher and transfers its knowledge to a small student model. We also want to demonstrate the competitiveness of small, shallow neural networks. We propose a simple yet effective approach that transfers the knowledge of a large pre- trained network (namely, BERT) to a shallow neural architecture (namely, a bidirectional long short-term memory network). To facilitate this process, we propose heuristic data augmentation methods, so that the teacher model can better express its knowledge on the augmented corpus. Experimental results on various natural language understanding tasks show that our distilled model achieves high performance comparable to the ELMo model (a LSTM based pretrained model) in both single-sentence and sentence-pair tasks, while using roughly 60{100 times fewer parameters and 8{15 times less inference time.
    [Show full text]
  • Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model
    Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model Zi Wang 1 Abstract resource-limited devices such as mobile phones and drones (Moskalenko et al., 2018). Recently, a large number of ap- Knowledge distillation (KD) is a successful ap- proaches have been proposed for training lightweight DNNs proach for deep neural network acceleration, with with the help of a cumbersome, over-parameterized model, which a compact network (student) is trained by such as network pruning (Li et al., 2016; He et al., 2019; mimicking the softmax output of a pre-trained Wang et al., 2021), quantization (Han et al., 2015), factor- high-capacity network (teacher). In tradition, KD ization (Jaderberg et al., 2014), and knowledge distillation usually relies on access to the training samples (KD) (Hinton et al., 2015; Phuong & Lampert, 2019; Jin and the parameters of the white-box teacher to ac- et al., 2020; Yun et al., 2020; Passalis et al., 2020; Wang, quire the transferred knowledge. However, these 2021). Among all these approaches, knowledge distillation prerequisites are not always realistic due to stor- is a popular scheme with which a compact student network age costs or privacy issues in real-world applica- is trained by mimicking the softmax output (class proba- tions. Here we propose the concept of decision- bilities) of a pre-trained deeper and wider teacher model based black-box (DB3) knowledge distillation, (Hinton et al., 2015). By doing so, the rich information with which the student is trained by distilling the learned by the powerful teacher can be imitated by the stu- knowledge from a black-box teacher (parameters dent, which often exhibits better performance than solely are not accessible) that only returns classes rather training the student with a cross-entropy loss.
    [Show full text]
  • MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models
    MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models Linqing Liu,∗1 Huan Wang,2 Jimmy Lin,1 Richard Socher,2 and Caiming Xiong2 1 David R. Cheriton School of Computer Science, University of Waterloo 2 Salesforce Research flinqing.liu, [email protected], fhuan.wang, rsocher, [email protected] Abstract Teacher 1 Task 1 Task 1 Teacher 2 Task 2 Shared Task 2 Pretrained language models have led to signif- Student Teacher Network … icant performance gains in many NLP tasks. … … Layers However, the intensive computing resources to train such models remain an issue. Knowledge Teacher N Task N Task N distillation alleviates this problem by learning a light-weight student model. So far the dis- Figure 1: The left figure represents task-specific KD. tillation approaches are all task-specific. In The distillation process needs to be performed for each this paper, we explore knowledge distillation different task. The right figure represents our proposed under the multi-task learning setting. The stu- multi-task KD. The student model consists of shared dent is jointly distilled across different tasks. layers and task-specific layers. It acquires more general representation capac- ity through multi-tasking distillation and can be further fine-tuned to improve the model in leading to resource-intensive inference. The con- the target domain. Unlike other BERT distilla- tion methods which specifically designed for sensus is that we need to cut down the model size Transformer-based architectures, we provide and reduce the computational cost while maintain- a general learning framework. Our approach ing comparable quality. is model agnostic and can be easily applied One approach to address this problem is knowl- on different future teacher model architectures.
    [Show full text]
  • Knowledge Distillation from Internal Representations
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Knowledge Distillation from Internal Representations Gustavo Aguilar,1 Yuan Ling,2 Yu Zhang,2 Benjamin Yao,2 Xing Fan,2 Chenlei Guo2 1Department of Computer Science, University of Houston, Houston, USA 2Alexa AI, Amazon, Seattle, USA [email protected], {yualing, yzzhan, banjamy, fanxing, guochenl}@amazon.com Abstract (Hinton, Vinyals, and Dean 2015), where a cumbersome already-optimized model (i.e., the teacher) produces output Knowledge distillation is typically conducted by training a probabilities that are used to train a simplified model (i.e., small model (the student) to mimic a large and cumbersome the student). Unlike training with one-hot labels where the model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft- classes are mutually exclusive, using a probability distribu- labels to optimize the student. However, when the teacher tion provides more information about the similarities of the is considerably large, there is no guarantee that the internal samples, which is the key part of the teacher-student distil- knowledge of the teacher will be transferred into the student; lation. even if the student closely matches the soft-labels, its inter- Even though the student requires fewer parameters while nal representations may be considerably different. This inter- still performing similar to the teacher, recent work shows nal mismatch can undermine the generalization capabilities the difficulty of distilling information from a huge model. originally intended to be transferred from the teacher to the Mirzadeh et al. (2019) state that, when the gap in between student.
    [Show full text]
  • Sequence-Level Knowledge Distillation
    Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush [email protected] [email protected] School of Engineering and Applied Sciences Harvard University Cambridge, MA, USA Abstract proaches. NMT systems directly model the proba- bility of the next word in the target sentence sim- Neural machine translation (NMT) offers a ply by conditioning a recurrent neural network on novel alternative formulation of translation the source sentence and previously generated target that is potentially simpler than statistical ap- words. proaches. However to reach competitive per- formance, NMT models need to be exceed- While both simple and surprisingly accurate, ingly large. In this paper we consider applying NMT systems typically need to have very high ca- knowledge distillation approaches (Bucila et pacity in order to perform well: Sutskever et al. al., 2006; Hinton et al., 2015) that have proven (2014) used a 4-layer LSTM with 1000 hidden units successful for reducing the size of neural mod- per layer (herein 4 1000) and Zhou et al. (2016) ob- els in other domains to the problem of NMT. × tained state-of-the-art results on English French We demonstrate that standard knowledge dis- → tillation applied to word-level prediction can with a 16-layer LSTM with 512 units per layer. The be effective for NMT, and also introduce two sheer size of the models requires cutting-edge hard- novel sequence-level versions of knowledge ware for training and makes using the models on distillation that further improve performance, standard setups very challenging. and somewhat surprisingly, seem to elimi- This issue of excessively large networks has been nate the need for beam search (even when ap- plied on the original teacher model).
    [Show full text]
  • Improved Knowledge Distillation Via Teacher Assistant
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Improved Knowledge Distillation via Teacher Assistant Seyed Iman Mirzadeh,∗1 Mehrdad Farajtabar,∗2 Ang Li,2 Nir Levine,2 Akihiro Matsukawa,†3 Hassan Ghasemzadeh1 1Washington State University, WA, USA 2DeepMind, CA, USA 3D.E. Shaw, NY, USA 1{seyediman.mirzadeh, hassan.ghasemzadeh}@wsu.edu 2{farajtabar, anglili, nirlevine}@google.com [email protected] Abstract Despite the fact that deep neural networks are powerful mod- els and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or em- bedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distil- lation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this pa- per, we show that the student network performance degrades when the gap between student and teacher is large. Given a Figure 1: TA fills the gap between student & teacher fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To al- by adding a term to the usual classification loss that encour- leviate this shortcoming, we introduce multi-step knowledge ages the student to mimic the teacher’s behavior. distillation, which employs an intermediate-sized network However, we argue that knowledge distillation is not al- (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant ways effective, especially when the gap (in size) between size and extend the framework to multi-step distillation.
    [Show full text]
  • Reinforced Multi-Teacher Selection for Knowledge Distillation
    The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Reinforced Multi-Teacher Selection for Knowledge Distillation Fei Yuan1∗ Linjun Shou2 Jian Pei3 Wutao Lin2 Ming Gong2 Yan Fu1 Daxin Jiang2y 1 University of Electronic Science and Technology of China 2 Microsoft STCA NLP Group 3 School of Computing Science, Simon Fraser University [email protected], flisho,wutlin,migon,[email protected] [email protected], [email protected] Abstract MRPC MNLI-mm Teacher Student In natural language processing (NLP) tasks, slow inference Acc Acc speed and huge footprints in GPU usage remain the bottleneck BERT-Base N / A 83.3 83.8 of applying pre-trained deep models in production. As a pop- RoBERTa-Base N / A 88.5 86.8 ular method for model compression, knowledge distillation transfers knowledge from one or multiple large (teacher) mod- BERT-Base BERT3 74.0 75.3 els to a small (student) model. When multiple teacher models RoBERTa-Base BERT3 73.0 71.9 are available in distillation, the state-of-the-art methods assign a fixed weight to a teacher model in the whole distillation. Table 1: RoBERTa-Base performs better than BERT-Base. Furthermore, most of the existing methods allocate an equal However, the student model distilled from BERT-Base, the weight to every teacher model. In this paper, we observe that, weaker model, performs better than the same student model due to the complexity of training examples and the differ- distilled from RoBERTa-Base, the stronger teacher model. ences in student model capability, learning differentially from teacher models can lead to better performance of student mod- els distilled.
    [Show full text]