Knowledge Distillation from Internal Representations

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Knowledge Distillation from Internal Representations Gustavo Aguilar,1 Yuan Ling,2 Yu Zhang,2 Benjamin Yao,2 Xing Fan,2 Chenlei Guo2 1Department of Computer Science, University of Houston, Houston, USA 2Alexa AI, Amazon, Seattle, USA [email protected], {yualing, yzzhan, banjamy, fanxing, guochenl}@amazon.com Abstract (Hinton, Vinyals, and Dean 2015), where a cumbersome already-optimized model (i.e., the teacher) produces output Knowledge distillation is typically conducted by training a probabilities that are used to train a simplified model (i.e., small model (the student) to mimic a large and cumbersome the student). Unlike training with one-hot labels where the model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft- classes are mutually exclusive, using a probability distribu- labels to optimize the student. However, when the teacher tion provides more information about the similarities of the is considerably large, there is no guarantee that the internal samples, which is the key part of the teacher-student distil- knowledge of the teacher will be transferred into the student; lation. even if the student closely matches the soft-labels, its inter- Even though the student requires fewer parameters while nal representations may be considerably different. This inter- still performing similar to the teacher, recent work shows nal mismatch can undermine the generalization capabilities the difficulty of distilling information from a huge model. originally intended to be transferred from the teacher to the Mirzadeh et al. (2019) state that, when the gap in between student. In this paper, we propose to distill the internal rep- the teacher and the student is large (e.g., shallow vs. deep resentations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such repre- neural networks), the student struggles to approximate the sentations and various algorithms to conduct the distillation. teacher. They propose to use an intermediate teaching assis- We experiment with datasets from the GLUE benchmark and tant (TA) model to distill the information from the teacher consistently show that adding knowledge distillation from in- and then use the TA model to distill information towards the ternal representations is a more powerful method than only student. However, we argue that the abstraction captured by using soft-label distillation. a large teacher is only exposed through the output probabilities, which makes the internal knowledge from the teacher (or the TA model) hard to infer by the student. This can po- Introduction tentially take the student to very different internal represen- Transformer-based models have significantly advanced the tations undermining the generalization capabilities initially field of natural language processing by establishing new intended to be transferred from the teacher. state-of-the-art results in a large variety of tasks. Specifi- In this paper, we propose to apply KD to internal repre- cally, BERT (Devlin et al. 2018), GPT (Radford et al. 2018), sentations. Our approach allows the student to internally be- GPT-2 (Radford et al. 2019), XLM (Lample and Conneau have as the teacher by effectively transferring its linguistic 2019), XLNet (Yang et al. 2019), and RoBERTa (Liu et al. properties. We perform the distillation at different internal 2019c) lead tasks such as text classification, sentiment anal- points across the teacher, which allows the student to learn ysis, semantic role labeling, question answering, among oth- and compress the abstraction in the hidden layers of the large ers. However, most of the models have hundreds of millions model systematically. By including internal representations, of parameters, which significantly slows down the training we show that our student outperforms its homologous mod- process and inference time. Besides, the large number of els trained on ground-truth labels, soft-labels, or both. parameters demands a lot of memory consumption, mak- ing such models hard to adopt in production environments Related Work where computational resources are strictly limited. Due to these limitations, many approaches have been pro- Knowledge distillation has become one of the most effective posed to reduce the size of the models while still providing and simple techniques to compress huge models into simpler similar performance. One of the most effective techniques and faster models. The versatility of this framework has al- is knowledge distillation (KD) in a teacher-student setting lowed the extension of KD to scenarios where a set of expert models in different tasks distill their knowledge into a uni- Copyright c 2020, Association for the Advancement of Artificial fied multi-task learning network (Clark et al. 2019b), as well Intelligence (www.aaai.org). All rights reserved. as the opposite scenario where an ensemble of multi-task 7350 models are distilled into a task-specific network (Liu et al. The first one focuses on approximating a large model into a 2019a; 2019b). We extend the knowledge distillation frame- smaller one by reducing the precision of each of the parame- work with a different formulation by applying the same prin- ters. The second one focuses on removing weights in the net- ciple to internal representations. work that do not have a substantial impact on model perfor- Using internal representations to guide the training of mance. These techniques are complementary to the method a student model was initially explored by Romero et al. we propose in this paper, which can potentially lead to a (2014). They proposed FITNET, a convolutional student net- more effective overall compression approach. work that is thinner and deeper than the teacher while using significantly fewer parameters. In their work, they establish Methodology a middle point in both the teacher and the student models In this section, we detail the process of distilling knowl- to compare internal representations. Since the dimension- edge from internal representations. First, we describe the ality between the teacher and the student differs, they use standard KD framework (Hinton, Vinyals, and Dean 2015), a convolutional regressor model to map such vectors into which is an essential part of our method. Then, we formal- the same space, which adds a significant number of parame- ize the objective functions to distill the internal knowledge ters to learn. Additionally, they mainly focus on providing of transformer-based models. Lastly, we propose various al- a deeper student network than the teacher, exploiting the gorithms to conduct the internal distillation process. particular benefits of depth in convolutional networks. Our work differs from theirs in different aspects: 1) using a single Knowledge Distillation point-wise loss on the middle layers has mainly a regulariza- tion effect, but it does not guarantee to transfer the internal Hinton, Vinyals, and Dean (2015) proposed knowledge dis- knowledge from the teacher; 2) our distillation method is tillation (KD) as a framework to compress a large model into applied across all the student layers, which effectively com- a simplified model that achieves similar results. The frame- press groups of layers from the teacher into a single layer work uses a teacher-student setting where the student learns of the student; 3) we use the internal representations as-is from both the ground-truth labels (if available) and the soft- instead of relying on additional parameters to perform the labels provided by the teacher. The probability mass asso- distillation; 4) we do not focus on deeper models than the ciated with each class in the soft-labels allows the student teacher as this can slow down the inference time, and it is to learn more information about the label similarities for a not necessarily an advantage on transformer-based models. given sample. The formulation of KD considering both soft Concurrent to this work, similar transformer-based dis- and hard labels is given as follows: tillation techniques have been studied. Sanh et al. (2019) 1 N 1 N propose DistilBERT, which compresses BERT during pre- LKD = − p(yi|xi,θT )log(ˆyi) − λ yilog(ˆyi)) training to provide a smaller general-purpose model. They N i N i pre-train their model using a masked language modeling (1) loss, a cosine embedding loss at the hidden states, and the teacher-student distillation loss. Conversely, Sun et al. where θT represents the parameters of the teacher, and (2019) distill their model during task-specific fine-tuning p(yi|xi,θT ) are its soft-labels; yî is the student prediction using a MSE loss at the hidden states and cross-entropy given by p(yi|xi,θS) where θS denotes its parameters, and losses from soft- and hard-labels. While our work is simi- λ is a small scalar that weights down the hard-label loss. lar to theirs, the most relevant differences are 1) the use of Since the soft-labels often present high entropy, the gradient KL-divergence loss at the self-attention probabilities, which tends to be smaller than the one from the hard-labels. Thus, have been shown to capture linguistic knowledge (Clark et λ balances the terms by reducing the impact of the hard loss. al. 2019a), and 2) the introduction of new algorithms to distill the internal knowledge from the teacher (i.e., progressive Matching Internal Representations and stacked knowledge distillation). In order to make the student model behave as the teacher Curriculum learning (CL) (Bengio 2009) is another line model, the student is optimized by the soft-labels from of research that focuses on teaching complex tasks by build- teacher’s output. In addition, the student also acquires the ing upon simple concepts. Although the goal is similar to abstraction hidden in the teacher by matching its internal ours, CL is conducted by stages focusing on simple tasks representations.

Knowledge Distillation from Internal Representations

Layer-Level Knowledge Distillation for Deep Neural Network Learning

Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE

Understanding and Improving Knowledge Distillation

Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm

Memory-Replay Knowledge Distillation

Knowledge Distillation: a Survey 3

Towards Effective Utilization of Pre-Trained Language Models

Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Sequence-Level Knowledge Distillation

Improved Knowledge Distillation Via Teacher Assistant

Reinforced Multi-Teacher Selection for Knowledge Distillation