Natural Language Generation for Effective Knowledge Distillation

Natural Language Generation for Effective Knowledge Distillation Raphael Tang, Yao Lu, and Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo Abstract et al.(2019) demonstrate that knowledge distillation (Ba and Caruana, 2014; Hinton et al., 2015) Knowledge distillation can effectively transfer can transfer knowledge from BERT to small, tra- knowledge from BERT, a deep language repre- ditional neural networks, helping them approach sentation model, to traditional, shallow word or exceed the quality of much larger pretrained embedding-based neural networks, helping long short-term memory (LSTM; Hochreiter and them approach or exceed the quality of other heavyweight language representation models. Schmidhuber, 1997) language models, such as As shown in previous work, critical to this dis- ELMo (Embeddings from Language Models; Pe- tillation procedure is the construction of an ters et al., 2018). unlabeled transfer dataset, which enables ef- As shown in Tang et al.(2019), crucial to fective knowledge transfer. To create transfer knowledge distillation is constructing a transfer set examples, we propose to sample from pre- dataset of unlabeled examples. In this paper, we trained language models fine-tuned on task- explore how to construct such an effective trans- specific text. Unlike previous techniques, this directly captures the purpose of the transfer set. Previous approaches comprise manual data fer set. We hypothesize that this principled, curation, a meticulous method where the end user general approach outperforms rule-based tech- manually selects a corpus similar enough to the niques. On four datasets in sentiment clas- present task, and rule-based techniques, where a sification, sentence similarity, and linguistic transfer set is fabricated from the training set using acceptability, we show that our approach im- a set of data augmentation rules. However, these proves upon previous methods. We outper- rules only indirectly model the purpose of the form OpenAI GPT, a deep pretrained trans- former, on three of the datasets, while using transfer set, which is to provide more input drawn a single-layer bidirectional LSTM that runs at from the task-specific data distribution. Hence, least ten times faster. we instead propose to construct the transfer set by generating text with pretrained language models 1 Introduction fine-tuned on task-specific text. We validate our approach on four small- to mid-sized datasets in That bigger neural networks plus more data equals sentiment classification, sentence similarity, and higher quality is a tried-and-true formula. In the linguistic acceptability. natural language processing (NLP) literature, the We claim two contributions: first, we elucidate recent darling of this mantra is the deep, pretrained a novel approach for constructing the transfer set language representation model. After pretrain- in knowledge distillation. Second, we are the first ing hundreds of millions of parameters on vast to outperform OpenAI GPT (Radford et al., 2018) amounts of text, models such as BERT (Bidirec- in sentiment classification and sentence similar- tional Encoder Representations from Transform- ity with a single-layer bidirectional LSTM (Bi- ers; Devlin et al., 2018) achieve remarkable state LSTM) that runs more than ten times faster, with- of the art in question answering, sentiment analy- out pretraining or domain-specific data curation. sis, and sentence similarity tasks, to list a few. We make our datasets and codebase public in a Does this progress mean, then, that classic, GitHub repository.1 shallow word embedding-based neural networks are noncompetitive? Not quite. Recently, Tang 1 https://github.com/castorini/d-bert 202 Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo), pages 202–208 Hong Kong, China, November 3, 2019. c 2019 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 2 Background and Related Work 3 Our Approach Ba and Caruana(2014) propose knowledge dis- In knowledge distillation, the student perceives the tillation, a method for improving the quality of a oracular teacher to be the true p(Y jX), where smaller student model by encouraging it to match X and Y respectively denote the input sentence the outputs of a larger, higher-quality teacher net- and label. This is reasonable, since the student work. Concretely, suppose hS(·) and hT (·) re- treats the teacher output y as ground truth, given spectively denote the untrained student and trained some sentence x comprising words fw1; : : : ; wng. teacher models, and we are given a training set of The purpose of the transfer set is, then, to pro- inputs S = fx1; : : : ; xN g. On classification tasks, vide additional input sentences for querying the the model outputs are log probabilities; on regres- teacher. To construct such a set, we propose sion tasks, the outputs are as-is. Then, the distilla- the following: first, we parameterize p(X) di- tion objective LKD is rectly as a language model p(w1; : : : ; wn) = n Πi=1p(wijw1; : : : ; wi−1) trained on the given sen- N 1 X tences fx1; : : : ; xN g. Then, to generate unlabeled L = kh (x ) − h (x )k2 (1) KD N S i T i 2 examples, we sample from the language model, i=1 i.e., the ith word of a sentence is drawn from Hinton et al.(2015) alternatively use Kullback– p(wijw1; : : : ; wi−1). We stop upon generating the Leibler divergence for classification, along with special end-of-sentence token [EOS], which we additional hyperparameters. For simplicity and append to each sentence while fine-tuning the lan- generality, we stick with the original mean- guage model (LM). squared error (MSE) formulation. We minimize Unlike previous methods, our approach directly p(X) LKD end-to-end with backpropagation, updating parameterizes to provide unlabeled exam- the student’s parameters and fixing the teacher’s. ples. We hypothesize that this approach outper- LKD can optionally be combined with the original, forms ad hoc rule-based methods, which only in- supervised cross-entropy or MSE loss; following directly model the input distribution p(X). Tang et al.(2019) and Shi et al.(2019), we opti- Sentence-pair modeling. To language model sen- mize only LKD for training the student. tence pairs, we follow Devlin et al.(2018) and Using only the given training set for S, how- join both sentences with a special separator token ever, is often insufficient. Thus, Ba and Caruana [SEP] between, treating the resulting sequence (2014) augment S with a transfer set comprising as a single contiguous sentence. unlabeled input, providing the student with more examples to distill from the teacher. Techniques 3.1 Model Architecture for constructing this transfer set consist of either For simplicity and efficient inference, our student manual data curation or unprincipled data synthe- models use the same single-layer BiLSTM models sis rules. Ba and Caruana(2014) choose images from Tang et al.(2019)—see Figures1 and2. from the 80 million tiny images dataset, which is a First, we map an input sequence of words superset of their dataset. In the NLP domain, Tang to their corresponding word2vec embeddings, et al.(2019) propose text perturbation rules for trained on Google News. Next, for single-sentence creating a transfer set from the training set, achiev- tasks, these embeddings are fed into a single-layer ing results comparable to ELMo using a BiLSTM BiLSTM encoder to yield concatenated forward with 100 times fewer parameters. and backward states h = [hf ; hb]. For sentence- We wish to avoid these previous approaches. pair tasks, we encode each sentence separately us- Manual data curation requires the researcher to ing a BiLSTM to yield h1 and h2. To produce select an unlabeled set similar enough to the tar- a single vector h, following Wang et al.(2018), get dataset, a difficult-to-impossible task for many we compute h = [h1; h2; δ(h1; h2); h1 · h2], datasets in, for example, linguistic acceptability where · denotes elementwise multiplication and and sentence similarity. Rule-based techniques, δ denotes elementwise absolute difference. Fi- while general, unfortunately deviate from the true nally, for both single- and paired-sentence tasks, h purpose of modeling the input distribution; hence, is passed through a multilayer perceptron (MLP) we hypothesize that they are less effective than a with one hidden layer that uses a rectified linear principled approach, which we detail below. unit (ReLU) activation. For classification, the fi- 203 a b c e f j 1 i h i j h g 2 f ... ... e n g b c ... ... d a d Input #1 Input #2 Figure 1: Illustration of the single-sentence BiLSTM, copied from Tang et al.(2019). The labels are as fol- lows: (a) word embeddings (b) BiLSTM layer (c) fi- Figure 2: Illustration of the sentence-pair BiLSTM, nal forward hidden state (d) final backward hidden state copied from Tang et al.(2019). The labels are as fol- (e) nonlinear layer (f) the final representation (g) fully- lows: (a) BiLSTM layer (b) final forward hidden state connected layer (h) logits or similarity score (i) soft- (c) final backward hidden state (d) comparison unit, as max activation for classification tasks; identity for re- detailed in the text (e) nonlinear layer (f) the final repre- gression (j) final probabilities or score. sentation (g) fully-connected layer (h) logits or similarity score (i) softmax activation for classification tasks; identity for regression (j) final probabilities or score. nal output is interpreted as the logits of each class; for real-valued sentence similarity, the final output tence similarity, and paraphrasing: Stanford Sen- is a single score. timent Treebank-2 (SST-2; Socher et al., 2013), Our teacher model is the large variant of BERT, the Corpus of Linguistic Acceptability (CoLA; a deep pretrained language representation model Warstadt et al., 2018), Semantic Textual Simi- that achieves close to state of the art (SOTA) larity Benchmark (STS-B; Cer et al., 2017), and on our tasks.

Load more