Self-Knowledge Distillation in Natural Language Processing

Self-Knowledge Distillation in Natural Language Processing Sangchul Hahn Heeyoul Choi Handong Global University Handong Global University Pohang, South Korea Pohang, South Korea [email protected] [email protected] Abstract gio et al., 2013) than other machine learning algo- rithms. In other words, if we make a model to learn Since deep learning became a key player in better representation of data, the model can show natural language processing (NLP), many better performance. deep learning models have been showing In natural language processing (NLP) tasks like remarkable performances in a variety of language modeling (LM) (Bengio et al., 2003; NLP tasks, and in some cases, they are Mikolov et al., 2013) and neural machine trans- even outperforming humans. Such high lation (NMT) (Sutskever et al., 2014; Bahdanau performance can be explained by efficient et al., 2015), when the models are trained, they are knowledge representation of deep learn- to generate many words in sentence, which is a ing models. While many methods have sequence of classification steps, for each of which been proposed to learn more efficient repre- they choose a target word among the whole words sentation, knowledge distillation from pre- in the dictionary. That is why LM and NMT are trained deep networks suggest that we can usually trained with the sum of cross-entropies over use more information from the soft target the target sentence. Thus, although language re- probability to train other neural networks. lated tasks are more of generation rather than clas- In this paper, we propose a new knowledge sification, the models estimate target probabilities distillation method self-knowledge distilla- with the softmax operation on the previous neu- tion, based on the soft target probabilities ral network layers and the target distributions are of the training model itself, where mul- provided as one-hot representations. As data rep- timode information is distilled from the resentation in NLP models, word symbols should word embedding space right below the soft- also be represented as vectors. max layer. Due to the time complexity, In this paper, we focus on the word embedding our method approximates the soft target and the estimation of the target distribution. In NLP, probabilities. In experiments, we applied word embedding is a step to translate word symbols the proposed method to two different and (indices in the vocabulary) to vectors in a contin- fundamental NLP tasks: language model uous vector space and is considered as a standard and neural machine translation. The ex- approach to handle symbols in neural networks. periment results show that our proposed When two words have semantically or syntactically method improves performance on the tasks. similar meanings, the words are represented closely 1 Introduction to each other in a word embedding space. Thus, even when the prediction is not exactly correct, Deep learning has achieved the state-of-the-art per- the predicted word might not be so bad, if the es- formance on many machine learning tasks, such as timated word is very close to the target word in image classification, object recognition, and neural the embedding space like ‘programming’ and ‘cod- machine translation (He et al., 2016; Redmon and ing’. That is, to check how wrong the prediction Farhadi, 2017; Vaswani et al., 2017) and outper- is, the word embedding can be used. There are sev- formed humans on some tasks. In deep learning, eral methods to obtain word embedding matrices one of the critical points for success is to learn bet- (Mikolov et al., 2013; Pennington et al., 2014), in ter representation of data with many layers (Ben- addition to neural language models (Bengio et al., 423 Proceedings of Recent Advances in Natural Language Processing, pages 423–430, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_050 2003; Mikolov et al., 2010). Recently, several ap- 2 Background proaches have been proposed to make more effi- In this section, we briefly review the cross-entropy cient word embedding matrices, usually based on and knowledge distillation. Also, since our pro- contextual information (Søgaard et al., 2017; Choi posed method is based on word embedding, the et al., 2017). layer right before the softmax operation, word embedding process is summarized. On the other hand, knowledge distillation was proposed by (Hinton et al., 2015) to train new and 2.1 Cross Entropy usually shallow networks using hidden knowledge For classification with C classes, neural networks in the probabilities produced by the pretrained net- produce class probabilities pi, i 2 f0; 1; :::Cg by works. It shows that there is knowledge not only using a softmax output layer which calculates class in the target probability corresponding to the target probabilities from the logit, zi considering the other class but also in the other class probabilities in the logits as follows. estimation of the trained model. In other words, the other class probabilities can contain additional exp (zi) pi = P : (1) information describing the input data samples dif- k exp (zk) ferently even when the samples are in the same In most classification problems, the objective class. Also, samples from different classes could function for a single sample is defined by the cross- produce similar distributions to each other. entropy as follows. X In this paper, we propose a new knowledge dis- J(θ) = − yk log pk; (2) tillation method, self-knowledge distillation (SKD) k based on the word embedding of the training model where y and p are the target and predicted proba- itself. That is, self-knowledge is distilled from the k k bilities. The cross-entropy can be simply calculated predicted probabilities produced by the training by model, expecting the model has more information as it is more trained. In the conventional knowledge J(θ) = − log pt; (3) distillation, the knowledge is distilled from the esti- mated probabilities of pretrained (or teacher) mod- when the target probability y is a one-hot vector els. Contrary, in the proposed SKD, knowledge defined as is distilled from the current model in the training 1; if k = t(target class) process, and the knowledge is hidden in the word yk = : (4) embedding. During the training process, the word 0; otherwise embedding reflects the relationship between words Note that the cross-entropy objective function in the vector space. A word close to the target says only how likely input samples belong to the word in the vector space is expected to have similar corresponding target class, and it does not provide distribution after softmax, and such information any other information about the input samples. can be used to approximate the soft target probability as in knowledge distillation. We apply our 2.2 Knowledge Distillation proposed method to two popular NLP tasks: LM A well trained deep network model contains mean- and NMT. The experiment results show that our ingful information (or knowledge) extracted from proposed method improves the performance of the training datasets for a specific task. Once a deep tasks. Moreover, SKD reduces overfitting prob- model is trained for a task, the trained model can lems which we believe is because SKD uses more be used to train new smaller (shallower or thinner) information. networks as shown in (Hinton et al., 2015; Romero et al., 2014). This approach is referred to as knowl- The paper is organized as follows. Background edge distillation. is reviewed in Section 2. In Section 3, we describe Basically, knowledge distillation provides more our proposed method, SKD. Experiment results are information to new models for training and im- presented and analyzed in Section 4, followed by proves the new model’s performance. Thus, when Section 5 with conclusion. a new model which is usually smaller is trained 424 with the distilled knowledge from the trained deep distilled knowledge from the teacher model. Their model, it can achieve a similar (or sometimes even experiment results show that the student models better) performance compared to the pretrained outperform the teacher model. Also, even though deep model. when the teacher model has a less powerful archi- In the pretrained model, knowledge lies in the tecture, the knowledge from the trained teacher class probabilities produced by softmax of the model can boost student models which have more model as in Eq. (1). All probability values includ- powerful (or bigger) architectures. It means that ing the target class probability describe relevant even the knowledge is distilled from a relatively information about the input data. Thus, instead weak model, it can be useful to train a bigger of one-hot representation of the target label where model. only the target class is considered in cross-entropy, 2.3 Word Embedding all probabilities over the whole classes from the pretrained model can provide more information Word embedding is to convert symbolic represen- about the input data in cross-entropy, and can teach tation of words to vector representation with se- new models more efficiently. All probabilities from mantic and syntactic meanings, which reflects the the pretrained model are considered as soft target relations between words. Including CBOW, Skip- probabilities. gram (Mikolov et al., 2013), and GloVe (Penning- In a photo tagging task, depending on the other ton et al., 2014), various word embedding methods class probabilities, we understand the input image have been proposed to learn a word embedding better than just target class. When a class ‘mouse’ matrix. The trained embedding matrix can be trans- has the highest probability, if ‘mascot’ has a rel- ferred to other models like LM or NMT (Ahn et al., atively high probability, then the image would be 2016). probably ‘mickey mouse’. If ‘button’ or ‘pad’ has CBOW predicts a word given its neighbor words, a high probability, the image would be a mouse as a and Skip-gram predicts the neighbor words given computer device.

Self-Knowledge Distillation in Natural Language Processing

Layer-Level Knowledge Distillation for Deep Neural Network Learning

Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE

Understanding and Improving Knowledge Distillation

Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm

Memory-Replay Knowledge Distillation

Knowledge Distillation: a Survey 3

Towards Effective Utilization of Pre-Trained Language Models

Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Knowledge Distillation from Internal Representations

Sequence-Level Knowledge Distillation

Improved Knowledge Distillation Via Teacher Assistant