Self-Knowledge Distillation in Natural Language Processing

Sangchul Hahn Heeyoul Choi Handong Global University Handong Global University Pohang, South Korea Pohang, South Korea [email protected] [email protected]

Abstract gio et al., 2013) than other algo- rithms. In other words, if we make a model to learn Since became a key player in better representation of data, the model can show natural language processing (NLP), many better performance. deep learning models have been showing In natural language processing (NLP) tasks like remarkable performances in a variety of language modeling (LM) (Bengio et al., 2003; NLP tasks, and in some cases, they are Mikolov et al., 2013) and neural machine trans- even outperforming humans. Such high lation (NMT) (Sutskever et al., 2014; Bahdanau performance can be explained by efficient et al., 2015), when the models are trained, they are knowledge representation of deep learn- to generate many words in sentence, which is a ing models. While many methods have sequence of classification steps, for each of which been proposed to learn more efficient repre- they choose a target word among the whole words sentation, knowledge distillation from pre- in the dictionary. That is why LM and NMT are trained deep networks suggest that we can usually trained with the sum of cross-entropies over use more information from the soft target the target sentence. Thus, although language re- probability to train other neural networks. lated tasks are more of generation rather than clas- In this paper, we propose a new knowledge sification, the models estimate target probabilities distillation method self-knowledge distilla- with the softmax operation on the previous neu- tion, based on the soft target probabilities ral network layers and the target distributions are of the training model itself, where mul- provided as one-hot representations. As data rep- timode information is distilled from the resentation in NLP models, word symbols should word embedding space right below the soft- also be represented as vectors. max layer. Due to the time complexity, In this paper, we focus on the word embedding our method approximates the soft target and the estimation of the target distribution. In NLP, probabilities. In experiments, we applied word embedding is a step to translate word symbols the proposed method to two different and (indices in the vocabulary) to vectors in a contin- fundamental NLP tasks: language model uous vector space and is considered as a standard and neural machine translation. The ex- approach to handle symbols in neural networks. periment results show that our proposed When two words have semantically or syntactically method improves performance on the tasks. similar meanings, the words are represented closely 1 Introduction to each other in a word embedding space. Thus, even when the prediction is not exactly correct, Deep learning has achieved the state-of-the-art per- the predicted word might not be so bad, if the es- formance on many machine learning tasks, such as timated word is very close to the target word in image classification, object recognition, and neural the embedding space like ‘programming’ and ‘cod- machine translation (He et al., 2016; Redmon and ing’. That is, to check how wrong the prediction Farhadi, 2017; Vaswani et al., 2017) and outper- is, the word embedding can be used. There are sev- formed humans on some tasks. In deep learning, eral methods to obtain word embedding matrices one of the critical points for success is to learn bet- (Mikolov et al., 2013; Pennington et al., 2014), in ter representation of data with many layers (Ben- addition to neural language models (Bengio et al.,

423 Proceedings of Recent Advances in Natural Language Processing, pages 423–430, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_050 2003; Mikolov et al., 2010). Recently, several ap- 2 Background proaches have been proposed to make more effi- In this section, we briefly review the cross-entropy cient word embedding matrices, usually based on and knowledge distillation. Also, since our pro- contextual information (Søgaard et al., 2017; Choi posed method is based on word embedding, the et al., 2017). layer right before the softmax operation, word em- bedding process is summarized. On the other hand, knowledge distillation was proposed by (Hinton et al., 2015) to train new and 2.1 Cross Entropy usually shallow networks using hidden knowledge For classification with C classes, neural networks in the probabilities produced by the pretrained net- produce class probabilities pi, i ∈ {0, 1, ...C} by works. It shows that there is knowledge not only using a softmax output layer which calculates class in the target probability corresponding to the target probabilities from the logit, zi considering the other class but also in the other class probabilities in the logits as follows. estimation of the trained model. In other words, the other class probabilities can contain additional exp (zi) pi = P . (1) information describing the input data samples dif- k exp (zk) ferently even when the samples are in the same In most classification problems, the objective class. Also, samples from different classes could function for a single sample is defined by the cross- produce similar distributions to each other. entropy as follows. X In this paper, we propose a new knowledge dis- J(θ) = − yk log pk, (2) tillation method, self-knowledge distillation (SKD) k based on the word embedding of the training model where y and p are the target and predicted proba- itself. That is, self-knowledge is distilled from the k k bilities. The cross-entropy can be simply calculated predicted probabilities produced by the training by model, expecting the model has more information as it is more trained. In the conventional knowledge J(θ) = − log pt, (3) distillation, the knowledge is distilled from the esti- mated probabilities of pretrained (or teacher) mod- when the target probability y is a one-hot vector els. Contrary, in the proposed SKD, knowledge defined as is distilled from the current model in the training  1, if k = t(target class) process, and the knowledge is hidden in the word yk = . (4) embedding. During the training process, the word 0, otherwise embedding reflects the relationship between words Note that the cross-entropy objective function in the vector space. A word close to the target says only how likely input samples belong to the word in the vector space is expected to have similar corresponding target class, and it does not provide distribution after softmax, and such information any other information about the input samples. can be used to approximate the soft target proba- bility as in knowledge distillation. We apply our 2.2 Knowledge Distillation proposed method to two popular NLP tasks: LM A well trained deep network model contains mean- and NMT. The experiment results show that our ingful information (or knowledge) extracted from proposed method improves the performance of the training datasets for a specific task. Once a deep tasks. Moreover, SKD reduces overfitting prob- model is trained for a task, the trained model can lems which we believe is because SKD uses more be used to train new smaller (shallower or thinner) information. networks as shown in (Hinton et al., 2015; Romero et al., 2014). This approach is referred to as knowl- The paper is organized as follows. Background edge distillation. is reviewed in Section 2. In Section 3, we describe Basically, knowledge distillation provides more our proposed method, SKD. Experiment results are information to new models for training and im- presented and analyzed in Section 4, followed by proves the new model’s performance. Thus, when Section 5 with conclusion. a new model which is usually smaller is trained

424 with the distilled knowledge from the trained deep distilled knowledge from the teacher model. Their model, it can achieve a similar (or sometimes even experiment results show that the student models better) performance compared to the pretrained outperform the teacher model. Also, even though deep model. when the teacher model has a less powerful archi- In the pretrained model, knowledge lies in the tecture, the knowledge from the trained teacher class probabilities produced by softmax of the model can boost student models which have more model as in Eq. (1). All probability values includ- powerful (or bigger) architectures. It means that ing the target class probability describe relevant even the knowledge is distilled from a relatively information about the input data. Thus, instead weak model, it can be useful to train a bigger of one-hot representation of the target label where model. only the target class is considered in cross-entropy, 2.3 Word Embedding all probabilities over the whole classes from the pretrained model can provide more information Word embedding is to convert symbolic represen- about the input data in cross-entropy, and can teach tation of words to vector representation with se- new models more efficiently. All probabilities from mantic and syntactic meanings, which reflects the the pretrained model are considered as soft target relations between words. Including CBOW, Skip- probabilities. gram (Mikolov et al., 2013), and GloVe (Penning- In a photo tagging task, depending on the other ton et al., 2014), various word embedding methods class probabilities, we understand the input image have been proposed to learn a word embedding better than just target class. When a class ‘mouse’ matrix. The trained embedding matrix can be trans- has the highest probability, if ‘mascot’ has a rel- ferred to other models like LM or NMT (Ahn et al., atively high probability, then the image would be 2016). probably ‘mickey mouse’. If ‘button’ or ‘pad’ has CBOW predicts a word given its neighbor words, a high probability, the image would be a mouse as a and Skip-gram predicts the neighbor words given computer device. The other class probabilities have a word. They use feedforward layers, and the last some extra information and such knowledge in the layer of CBOW includes the word embedding ma- pretrained model can be transferred to a new model trix, W , as follows. by using a soft target distribution of the training z = W h + b, (6) set. When the target labels are available, the objec- where b is a bias, h is hidden layer, and z is logits tive function is a weighted sum of the conventional for the softmax operation. cross-entropy with the correct labels and the cross- Words in the embedding space have semantic entropy with the soft target distribution, given by and syntactic similarities, such that two similar words are close in the space. Thus, when the classi- fication is not correct, the error can be interpreted X J(θ) = −(1 − λ) log pt − λ qk log pk, (5) differently depending on the similarity between the k predicted word and the target word. For example, where pk is probability for class k produced by when the target word is ‘learning’, if the predicted current model with parameter θ, and qk is the soft word is ‘training’, then it is less wrong than other target probability from the pretrained model. λ words like ‘flower’ or ‘internet’. In this paper, we controls the amount of knowledge from the trained utilize such hidden information (or knowledge) in model. Note that the conventional knowledge distil- the word embedding space, while training. Fig.1 lation extracts knowledge from a pretrained model, shows where the word embedding is located in LM and in this paper, we propose to extract knowledge and NMT, respectively. from the current model itself without any pretrained 3 Self-Knowledge Distillation model. Furthermore, in a recently proposed paper by We propose a new learning method self-knowledge (Furlanello et al., 2018), they proved that knowl- distillation (SKD) which distills knowledge from edge distillation can be useful to train a new model a currently training model, following the conven- which has the same size and the same architecture tional knowledge distillation. In this section, we as the pretrained model. They trained a teacher describe an algorithm for SKD and its application model first, then they trained a student model with to language model and neural machine translation.

425 meaning that the class n cannot be more correct than the real target t, so Eq. (7) becomes

qn = min{exp{−σkwt − wnk2}, 0.5},

qt = 1 − qn, (8)

where qn + qt = 1. That is, we consider only two soft target probabilities as shown in Fig.2. Note that we use Euclidean distance between wt and wn to calculate qn, but other approaches like inner product would be possible. Now, the objective function of SKD becomes similar to Eq. (5), and is defined by (a) Language Model (b) NMT Model J(θ) = −(1 − λ) log pt

Figure 1: Network architectures of LM and NMT. −λ(qt log pt + qn log pn), (9) Word embedding is presented as gray boxes in the models. where the second term of Eq. (5) is approximated by λ(qt log pt +qn log pn), ignoring the other class probabilities. Eq. (9) can be rewritten simply as 3.1 SKD Equations follows. In order to apply knowledge distillation on a cur- rent training model, we need to obtain soft target J(θ) = −(1 − λqn) log pt − λqn log pn. (10) probabilities as q in Eq. (5) for all classes, but k Eqs. (9) or (10) can be understood in three cases. they are not available explicitly. However, when First, if the prediction is correct (n = t), then Eq. the model is trained enough, then the word embed- (9) is the same as the conventional cross-entropy ding has such information implicitly. If a word w i objective. Second, if w is far from w in the word is close to w in the embedding space, the prob- n t j embedding space, then q is close to zero and Eq. ability p would be close to p for a given input n i j (9) becomes close to the conventional cross-entropy sample. objective. Finally, if w is close to w (e.g. q = When t is the target class, we calculate the soft n t n 0.4), it approximates the soft target probability with target probabilities q based on the word embed- k only two classes t and n, and the model is trained ding. First, we assume that q should be high, and t to produce probabilities for class t and n as close if w is close to w in the embedding space, q k t k as q and q . This approach trains the model with should be also high. That is, the Euclidean distance t n different targets for different input samples. between words is used to estimate the soft target Fig.2 presents how SKD obtains simplified probability. The other class probabilities (or soft soft target distribution based on the distance of target probabilities) q can be obtained by k target and estimated vectors in the word embedding 1 space. qk = exp{−σkwt − wkk2}, (7) Z 3.2 SKD Algorithm where k · k2 is l2-norm, and Z is a normalization Since SKD distills knowledge from the current term. σ is a scale parameter and its value depends training model, at the beginning of the training on the average distance to the corresponding near- process, the model does not contain relevant infor- est neighbors in the word embedding space. How- mation. That is, we cannot extract any knowledge ever, due to the expensive computational cost, we from the training model at the beginning. Thus, do not calculate qk for all classes, and we choose we start training process without knowledge dis- just one of the other classes, which is the predicted tillation at first and gradually increase the amount class of the current model. of knowledge distillation as the training iteration Assuming that the model predicts a class n for goes. So, our algorithm starts with the conven- a given input sample, only qt and qn are used as tional cross-entropy objective function in Eq. (3), distilled knowledge. We clip the qn value with 0.5, and after training the model for a while, it gradually

426 tion (NMT). Although LM and NMT are actually sentence generation rather than classification, they have classification steps to generate words for the target sentence. Also, the sum of cross-entropies over the words in the sentence is adapted as an objective function for them. In addition, to check if SKD is robust against errors in the word embedding space, we also evalu- ate SKD when we add Gaussian noise in the word embedding space for target words in the decoder. Figure 2: Given a target class t, a soft target prob- abilities are obtained based on the distance in the 4 Experiments word embedding space. However, only the target To evaluate self-knowledge distillation, we com- class and the predicted class have soft target proba- pare it to the baseline models for language model- bilities in SKD. ing and neural machine translation. transits to Eq. (10). To implement the transition, 4.1 Dataset another parameter α is introduced to Eq. (10), lead- For language modeling, we use two different ing to the final objective function as follows. datasets: Penn TreeBank (PTB) and WiKi-2. PTB J(θ) = −(1 − αλqn) log pt − αλqn log pn, (11) was made by (Marcus et al., 1993), and we use the pre-processed version by (Mikolov et al., 2010). α starts from 0 with which Eq. (11) becomes the In the PTB dataset, the train, valid and test sets conventional cross-entropy. After K iterations, α have about 887K, 70K, and 78K tokens, respec- increases by η per iteration and eventually goes up tively, where the vocabulary size is 10K. The WiKi- to 1 with which Eq. (11) becomes the same as Eq. 2 dataset introduced by (Merity et al., 2016) con- (9). In our experiments, we used a simplified equa- sists of sentences that are extracted from Wikipedia. tion as in Eq. (12) without λ so that the objective It has about 2M, 217K, and 245K tokens for train, function relies gradually more on the soft target valid, and test sets. Its vocabulary size is about 33K. probabilities as training goes. We did not apply additional pre-processing for the J(θ) = −(1 − αqn) log pt − αqn log pn. (12) PTB dataset. The WiKi-2 dataset is pre-tokenized data, therefore we only added an end-of-sentence Table1 summarizes the proposed SKD algorithm. token () to every sentence. For machine translation, we evaluated models on Table 1: Self-Knowledge Distillation Algorithm three different translation tasks (En-Fi, Fi-En, and Algorithm 1: SKD Algorithm En-De) with the available corpora from WMT’15 1. The dictionary size is 10K for En-fi and Fi-En Initialize the model parameters θ translation task, and 30K for the En-De translation Initialize α = 0 and σ task. (See the experiments for σ values.) Repeat K times: 4.2 Language Modeling Train the network based on the Language modeling (LM) has been used in many cross-entropy in Eq. (3) different NLP tasks like automatic speech recogni- Repeat until convergence: tion (ASR), and machine translation (MT) to cap- Train the network based on ture syntactic and semantic structure of a natural the SKD objective function in Eq. (12) language. The neural network-based language mod- Update α with α + η els (NNLM) and recurrent neural network language (See the experiments for η values.) model (RNNLM) catch the syntactic and seman- tic regularities of an input language (Bengio et al., 3.3 NLP Tasks 2003; Mikolov et al., 2013). RNNLM is our base- line, which consists of a single LSTM layer and SKD is applied to two different NLP tasks: lan- guage modeling (LM), and neural machine transla- 1http://www.statmt.org/wmt15/translation-task.html

427 single feed forward layer with ReLU (Le et al., the objective function, we believe that the improve- 2015). ment by SKD is regardless of model architectures. We evaluate four models: Baseline, Noise (with Table3 shows that our proposed method im- Gaussian noise on the word embedding), SKD, proves NMT performance by around 1 BLEU score. and Noise+SKD. To show that the information by For qualitative comparison, some translation results SKD is more knowledgeable than random noise, are presented below. The overall quality of transla- we tested a noise injected model which injects only tion of the SKD model looks better than baseline Gaussian noise to the word embedding space. The model’s. In other words, when the BLEU scores word dimension is set to 500 and the number of hid- are similar, the sentences translated by the SKD den nodes is 400 for all models. We set the σ and model look better. η in the SKD algorithm in Table1 0.1 (both PTB • (src) Hallituslähteen mukaan tämä on yksi monista and WiKi-2 dataset) and 0.0002 (PTB), 0.00011 ehdotuksista, joita harkitaan. (WiKi-2), respectively. We applied the SKD object (trg) A governmental source says this is one of the many proposals to be considered. function after 500 batches for PTB and 900 batches (baseline) According to government revenue, this is one for WiKi-2. Note that Wiki-2 data is larger than of the many proposals that are being considered to be PTB. considered. (SKD) According to the government, this is one of the The evaluation metric is the negative log- many proposals that are being considered. likelihood (NLL) for each sentence (the lower is the better). Table2 presents NLLs for the test data • (src) Meillä on hyvä tunne tuotantoketjunvahvuudesta. of two datasets with different models. It shows that (trg) We feel very good about supply chain capability. our proposed methods (both noise injection and (baseline) We have good knowledge of the strength of the production chain. self-distillation knowledge) improve the results in (SKD) We have a good sense of the strength of the the LM task. Note that SKD provides more knowl- production chain. edgeable information than Gaussian noise. • (src) En ole oikein tajunnut, että olen näin vanha. Table 2: NLLs for LM with different models on (trg) I haven’t really realized that I’m this old. (baseline) I have not been right to realise that I am so PTB and Wiki-2. old. Model PTB Wiki-2 (SKD) I am not quite aware that I am so old. Baseline 101.40 119.49 +Noise 101.28 118.70 • (src) Ne vaikuttavat vasta tulevaisuudessa. +SKD 99.38 116.85 (trg) They’ll have an impact in the future only. +Noise+SKD 97.41 116.60 (baseline) They will only be affected in the future. (SKD) They will only affect in the future.

4.3 Neural Machine Translation Fig.3 shows a trajectory of the qn values and NMT has been widely used in machine translation scheduling of the α value during training the En-Fi research, because of its powerful performance and NMT model described in Eq. (12), respectively. end-to-end training (Sutskever et al., 2014; Bah- As expected, the qn value becomes larger than 0.5 danau et al., 2015; Johnson et al., 2017). Attention- which means that wn (the predicted word vector) based NMT models consist of an encoder, a de- is close enough to the wt (the target word vector). coder, and the attention mechanism (Bahdanau Fig.3(b) shows the scheduled value of α in Eq. et al., 2015), which is our baseline in this paper (12). The α value starts from 0 and increases up to except for replacing GRU with LSTM and using 1 while training. The model is trained with only the BPE (Sennrich et al., 2016). The encoder takes the cross-entropy for K iterations, and then when the sequence of source words in the word embedding model captures enough knowledge to be distilled, form. The decoder works in a similar way to LM, α increases to utilize knowledge from the model. except the attention mechanism. See (Bahdanau Also, as in Fig.4, the SKD models are not (or et al., 2015) for NMT and the attention mechanism more slowly) overfitted to the training data. We be- in detail. lieve that SKD provides more information distilled In the experiments, we check how much SKD by the training model itself to prevent overfitting. can improve model’s performance using the simple Note that there is no significant difference in the baseline architecture. Since SKD modifies only improvements by SKD and Noise, but Noise+SKD

428 (a) qn value during NMT model training

Figure 4: BLEU scores of validation data while training on En-Fi corpus with four different mod- els: Baseline, +Noise, +SKD, and +Noise+SKD. The vertical axis indicates BLEU score and the horizontal axis the number of training iteration.

(b) Scheduling of α value of NMT training 5 Conclusion Figure 3: (a) Change of qn value during NMT model training for En-Fi translation task, and (b) We proposed a new knowledge distillation method, scheduling of α value in Eq. (12) of NMT training self-knowledge distillation, from the probabilities for En-Fi translation task. (a) shows that when the of the currently training model itself. The method uses only two soft target probabilities that are ob- model is trained more, the qn value become more close to the target. tained based on the word embedding space. The experiment results with language modeling and neural machine translation show that our method improves further. It implies that SKD provides dif- improves the performance. This method can be ferent kinds of information from noise, while the straightforwardly applied to other tasks where the synergy effect between SKD and noise needs more cross-entropy is used. research. As future works, we want to apply SKD to other applications with different model architectures, to Table 3: BLEU scores on the test sets for En-Fi, show that SKD does not depend on tasks nor the Fi-En and En-De with two different beam widths. model architectures. For image classification tasks, The scores on the development sets are in the paren- if we abuse the term ‘word embedding’ to refer to theses. the layer right before the softmax operation, it may Beam width Model be possible to apply SKD in a similar way, although 1 12 it is not guaranteed that comparable image classes En-Fi are closely located in the word embedding space Baseline 7.29(8.28) 9.01(9.85) for image related tasks. Also, we can develop an +Noise 7.68(8.50) 9.35(9.53) automatic way for the parameters like α in Eq. (12), +SKD 8.36(9.43) 9.87(10.30) and generalize the equation for qn in Eq. (8). +Noise+SKD 8.81(8.95) 10.13(10.47) Fi-En Acknowledgement Baseline 10.42(11.39) 11.89(12.78) This research was supported by Basic Science Re- +Noise 10.74(11.80) 12.39(13.35) search Program through the National Research +SKD 10.70(12.52) 12.43(13.82) Foundation of Korea(NRF) funded by the Ministry +Noise+SKD 11.87(12.92) 13.16(14.13) of Education (2017R1D1A1B03033341), and by En-De Institute for Information & communications Tech- Baseline 19.72(19.28) 22.25(20.91) nology Promotion(IITP) grant funded by the Korea +Noise 20.69(19.68) 22.40(20.92) government(MSIT) (No. 2018-0-00749, Develop- +SKD 20.29(20.41) 22.59(21.75) ment of virtual network management technology +Noise+SKD 21.16(20.34) 23.07(21.64) based on artificial intelligence).

429 References Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and neural network based language model. In INTER- Yoshua Bengio. 2016. A neural knowledge language SPEECH 2010, 11th Annual Conference of the Inter- CoRR model. abs/1608.00318:1–10. national Speech Communication Association. pages Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- 1045–1048. gio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. Int’l Conf. Jeffrey Pennington, Richard Socher, and Christopher D. on Learning Representations (ICLR). Manning. 2014. Glove: Global vectors for word rep- resentation. In Empirical Methods in Natural Lan- Yoshua Bengio, Aaron C. Courville, and Pascal Vin- guage Processing (EMNLP). pages 1532–1543. cent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Intell. 35(8):1798–1828. better, faster, stronger. In 2017 IEEE Conference on and Pattern Recognition, CVPR Yoshua Bengio, Réjean Ducharme, and Pascal Vin- 2017, Honolulu, HI, USA, July 21-26, 2017. pages cent. 2003. A Neural Probabilistic Language Model. 6517–6525. The Journal of Machine Learning Research 3:1137– 1155. Adriana Romero, Nicolas Ballas, Samira Ebrahimi Ka- hou, Antoine Chassang, Carlo Gatta, and Yoshua Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. Bengio. 2014. Fitnets: Hints for thin deep nets. 2017. Context-dependent word representation for CoRR abs/1412.6550. neural machine translation. Computer Speech and Language 45:149–160. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with Tommaso Furlanello, Zachary Chase Lipton, Michael subword units. In 54th Annual Meeting of the Asso- Tschannen, Laurent Itti, and Anima Anandkumar. ciation for Computational Linguistics. pages 1715– 2018. Born-again neural networks. In Proceedings 1725. of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stock- Anders Søgaard, Yoav Goldberg, and Omer Levy. 2017. holm, Sweden, July 10-15, 2018. pages 1602–1611. A strong baseline for learning cross-lingual word embeddings from sentence alignments. In Proceed- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian ings of the 15th Conference of the European Chap- Sun. 2016. Deep residual learning for image recog- ter of the Association for Computational Linguistics, nition. In 2016 IEEE Conference on Computer Vi- EACL 2017, Valencia, Spain, April 3-7, 2017, Vol- sion and Pattern Recognition, CVPR 2016, Las Ve- ume 1: Long Papers. pages 765–774. gas, NV, USA, June 27-30, 2016. pages 770–778. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Sequence to Sequence Learning with Neural Net- 2015. Distilling the knowledge in a neural network. works. In Advances in Neural Information Process- CoRR abs/1503.02531. ing Systems (NIPS). Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob rat, Fernanda B. Viégas, Martin Wattenberg, Greg Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Kaiser, and Illia Polosukhin. 2017. Attention is all Google’s multilingual neural machine translation you need. In Advances in Neural Information Pro- system: Enabling zero-shot translation. TACL cessing Systems 30: Annual Conference on Neural 5:339–351. Information Processing Systems 2017, 4-9 Decem- ber 2017, Long Beach, CA, USA. pages 6000–6010. Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. 2015. A simple way to initialize recurrent networks of rectified linear units. CoRR abs/1504.00941. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The Penn Treebank. Computa- tional Linguistics 19(2):313–330. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture mod- els. CoRR abs/1609.07843. Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient Estimation of Word Repre- sentations in Vector Space. In Proc. Int’l Conf. on Learning Representations (ICLR).

430