<<

Explicit Semantic Decomposition for Definition Generation

Jiahuan Li ∗ Yu Bao ∗ Shujian Huang† Xinyu Dai Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University, China {lijh,baoy}@smail.nju.edu.cn, {huangsj,daixinyu,chenjj}@nju.edu.cn

Abstract captain Reference the person in charge of a ship Generated the person who is a member of a ship Definition generation, which aims to auto- matically generate definitions for Table 1: An example of the definitions of word “cap- , has recently been proposed to assist tain”. Reference is from Oxford dictionary and Gener- the construction of and help peo- ated is from the method of Ishiwatari et al.(2019). ple understand unfamiliar texts. However, pre- vious works hardly consider explicitly mod- eling the “components” of definitions, lead- where the word to be defined is mapped to a low- ing to under-specific generation results. In dimension semantic vector by an encoder, and the this paper, we propose ESD, namely Explicit Semantic Decomposition for definition gen- decoder is responsible for generating the definition eration, which explicitly decomposes mean- given the semantic vector. ing of words into semantic components, and Although the existing encoder-decoder architec- models them with discrete latent variables for ture (Gadetsky et al., 2018; Ishiwatari et al., 2019; definition generation. Experimental results Washio et al., 2019) yields reasonable generation show that ESD achieves substantial improve- ments on WordNet and Oxford benchmarks results, it relies heavily on the decoder to extract over strong previous baselines. thorough semantic components of the word, lead- ing to under-specific definition generation results, 1 Introduction i.e. missing some semantic components. As illus- trated in Table1, to generate a precise definition of Dictionary definition, which provides explanatory the word “captain”, one needs to know that “cap- sentences for word senses, plays an important role tain” refers to a person, “captain” is related to ship, in natural language understanding for human. It is and “captain” manages or is in charge of the ship, a common practice for human to consult a dictio- where person, ship, manage are three semantic nary when encountering unfamiliar words (Fraser, components of word “captain”. However, due to 1999). However, it is often the case that we can- the lack of explicitly modeling of these semantic not find satisfying definitions for words that are components, the model misses the semantic com- rarely used or newly created. To assist dictionary ponent “manage” for the word “captain”. compilation and help human readers understand un- Linguists and lexicographers define a word by familiar texts, generating definitions automatically decomposing its meaning into its semantic com- is of practical significance. ponents and expressing them in natural language Noraset et al.(2017) first propose definition sentences (Wierzbicka, 1996). Inspired by this, modeling, which is the task of generating the dic- Yang et al.(2019) incorporate (Bloom- tionary definition for a given word with its embed- field, 1949; Dong and Dong, 2003), i.e. minimum ding. Gadetsky et al.(2018) extend the work by units of semantic meanings of human languages, in incorporating disambiguation to gener- the task of generating definition in Chinese. How- ate context-aware word definitions.Both methods ever, it is just as, if not more, time-consuming and adopt a variant of encoder-decoder architecture, expensive to label the components of words than to write definitions manually. ∗ Equal contribution † Corresponding author In this paper, we propose to explicitly decom-

708 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 708–717 July 5 - 10, 2020. c 2020 Association for Computational Linguistics pose the meaning of words into semantic compo- river and swam to the bank.”, then the ap- nents for definition generation. We introduce a propriate definition would be “the side of a river”. group of discrete latent variables to model the un- They extend Eqn.1 to make use of the given derlying semantic components.Extending the estab- context as follows: lished training technique for discrete latent variable used in representation learning (Roy et al., 2018) T Y and machine tasks (van den Oord et al., p(D|w∗, C) = p(dt|di

− KL(qφ(z|w∗, C, D)||pθ(z|w∗, C)) get the word representation r∗.

≤ log pθ(D|w∗, C) Context Encoder We adopt a standard bi- (4) directional LSTM network (Sundermeyer et al., At the training phase, both posterior distribution 2012) to encode the context, which takes word embedding sequence of the context C=c1:|C| and qφ(z|w∗, C, D) and prior distribution pθ(z|w∗, C) are computed and z is sampled from the posterior outputs a hidden state sequence H=h1:|C|. distribution. At the testing phase, due to the lack of D, we 3.2.2 Semantic Components Predictor only compute the prior distribution pθ(z|w∗, C) For the proposed ESD, we need to model both the and obtain z by applying arg max to it. semantic components posterior qφ(z|w∗, C, D) and Note that for the simplicity of notions, we denote the prior pθ(z|w∗, C). qφ(zi|w∗, C, D) and pθ(zi|w∗, C) as qi and pi in the following sections, respectively. Semantic Components Posterior Approximator Exactly modeling the true posterior qφ(z|w∗, C, D) 3.2 Model Architecture is usually intractable. Therefore, we adopt an ap- As shown in Figure1, ESD is composed of three proximation method to simplify the posterior in- modules: the encoder stack, a decoder, and a se- ference (Zhang et al., 2016) Following the spirit mantic components predictor. Before detailing of VAE (Bowman et al., 2016), we use neural net- each component of ESD, we overview the architec- works for better approximation in this paper. ture for a brief understanding. Specifically, we first compute the representation 0 Following the common practice of context-aware HD=h1:T of the definition D = d1:T with a bi- definition models (Gadetsky et al., 2018; Ishiwatari directional LSTM network. We then obtain the et al., 2019), we first encode the source word w∗ representation of definition D and context C with

710 max-pooling operation. Finally, we adopt a GRU-like (Cho et al., 2014) 0 gate mechanism to allow the decoder to dynami- hD = max-pooling(h ) (5) 1:T cally fuse information from the word representation hC = max-pooling(h1:|C|) (6) r∗, context vector ct, and semantic context vector With these representations, as well the word ot, which can be calculated as follows: representation r∗, we compute the posterior ft = [r∗; ct; ot] approximation qi of zi as follows: q q ut = σ(Wu[ft; st] + bu) qi = softmax(Wi [r∗, hC, hD] + bi ) q q vt = σ(Wr[ft; st] + br) where the Wi and bi are the parameters of the semantic components posterior approximator. sˆt = tanh(Ws[(vt ft]; st] + bs) 0 Semantic Components Prior Model Similar st = (1 − ut) st + ut sˆt p z to the posterior, we model the prior i of i where, W∗ and b∗ are weight matrices and bias by a neural network with the representation terms, respectively. hC (computed by Eqn6) and r∗ as follows: p p 3.3 Learning pi = softmax(Wi [r∗, hC] + bi ) p p The loss function in Eqn.4 serves as our primary where the Wi and bi are the parameters of the semantic components prior. training objective. Besides, since the latent vari- ables are designed to model the semantic compo- 3.2.3 Definition Decoder nents, we propose two auxiliary losses to ensure Given the word w∗, the context C and the semantic that these latent variables can learn informative component latent variables z, our decoder adopt a codes and capture the decomposed . LSTM to model the probability of generating defi- Semantic Completeness Objective In order to nition D given word w , context C, and semantic ∗ generate accurate definitions, the introduces latent components z: variables must capture all perspectives of the word T Y semantics. For example, it is impossible to pre- p(D|w , C, z) = p(d |d , w , C, z) ∗ t

Lbase = −JELBO 1. I-Attention (Gadetsky et al., 2018) uses the context to disambiguate the word embedding The first variant of ESD (denoted by ESD-def) in- and cannot utilize context information at the cludes the optimization of semantic completeness decoding time. and semantic diversity, which is optimized with: 2. LOG-CaD (Ishiwatari et al., 2019) is simi- (def) LESD-def = Lbase + αLcom + βLdiv lar to our architecture, without modeling the semantic component. Grounding on the annotated sememes, the second variant of ESD (denoted by ESD-sem) is optimized 3. Pip-sem is our intuitive pipeline that con- with: sists of a sememe predictor and a definition generator. The sememe predictor is trained (sem) LESD-sem = Lbase + αLcom + βLdiv on HowNet and is responsible for annotating words in definition generation datasets. The 4 Experiments definition generator is used to generate defi- 4.1 Experimental Setting nitions given the word, context, and pseudo Datasets To demonstrate the effectiveness of our annotations of sememes. method, we conduct experiments on two datasets Metrics We adopt two several automatic metrics used in previous work (Ishiwatari et al., 2019): that are often used in generation tasks: BLEU (Pa- 1 2 WordNet and Oxford . Each entry in the datasets pineni et al., 2002) and Meteor (Denkowski and is a triple of a word, a piece of its usage example, Lavie, 2014). BLEU considers the exact match be- and its corresponding dictionary definition. tween generation results and references and is the 1https://wordnet.princeton.edu/ most common metric used to evaluate generation 2https://en.oxforddictionaries.com/ systems. Following previous work, we compute

712 WordNet Oxford Model BLEU METEOR BLEU METEOR I-Attention (Gadetsky et al., 2018) 23.77 / 17.25 / LOG-CaD (Ishiwatari et al., 2019) 24.79 / 18.53 / *LOG-CaD 24.70 8.66 18.24 8.43 †Pip-sem 25.52 11.33 19.89 11.10 ESD-def 25.75 11.52 19.98 10.79 †ESD-sem 26.48 12.45 20.86 11.86

Table 2: BLEU and Meteor scores on WordNet and Oxford dataset. ‘†’ indicates models that incorporate external sememe annotations while training. ‘*’ denotes our reimplementation of previous model.

Model Fluency Semantic Completeness do not have the external sememe annotation during LOG-CaD 3.53 3.01 ESD-def 3.55 3.45 training and testing. Notably, ESD-sem also im- proves over Pip-sem by a large margin. This shows Table 3: Human annotated scores on Oxford dataset. that the way our method leverages the sememe annotations, i.e. using them as external signals of word semantics, is more effective than simple the sentence level BLEU score. We also consider annotate-then-generate pipeline methods. Meteor (Denkowski and Lavie, 2014), a metric that takes , stemming, and paraphrases into 4.3 Human Evaluation consideration while calculating the score. Meteor score is said to favor word choices than word or- In order to further compare the proposed methods ders and favor recall over precision (Denkowski and the strongest previous method (i.e., the Log- and Lavie, 2014). We use the recommended hyper- CaD model), we performed a human evaluation of parameters to compute Meteor scores. the generated definitions. We randomly selected 100 samples from the test set of Oxford dataset, and 4.2 Automatic Evaluation invited four people with at least CET6 level English The results, as measured by the automatic evalua- skills to rate the output definitions in terms of flu- tion metrics, i.e. BLEU and Meteor, are presented ency and semantic completeness from 1 to 5 points. in Table2. The averaged scores are presented in Table3. As can be seen from the table, definitions generated by ESD significantly improves the quality of defi- our methods are rated higher in terms of semantic nition generation with a large margin. On all completeness while achieving comparable fluency. the benchmark datasets, our ESD that incorporates sememes achieves the best generation performance, 4.4 Ablation Study both in BLEU and Meteor scores. It is worth not- We also perform an ablation study to quantify the ing that the improvement of the Meteor score is effect of different model components. more significant than the BLEU score, i.e. 3.79 vs. 1.78 on WordNet, and 3.43 vs. 2.62 on Oxford, Semantic completeness objective We can see (∗) indicating that our model is better at recalling se- that the semantic completeness objective, i.e. Lcom mantically correct words, which is consistent with leads to a substantial improvement in terms of Me- our motivation to address the under-specific prob- teor score (Line 3 and Line 4 vs. Line 1), which lem. indicates that the gain obtained by our model is not by trivially adopting the conditional VAE frame- Decomposing semantics is indeed helpful to def- work to definition generation task. inition modeling. The models that generate defi- nition with the explicit decomposed semantics (Pip- Semantic diversity objective The experimental sem, ESD-def and ESD-sem) leads to remarkable results show that although independently using the improvements over the competitor without decom- semantic diversity objective leads to no gains (Line posed component modeling (I-Attention and LOG- 2 vs. Line 1), regularizing the model to learn di- CaD). The comparison between the ESD-def, I- verse latent codes when using semantic complete- Attention and LOG-CaD is fair because all of them ness objective can improve the generation perfor-

713 (def) (sem) Lbase Ldiv Lcom Lcom Meteor

1 X 8.99 18 LOG-CaD ESD-def 2 XX 9.15 ESD-sem 16 3 XX 11.09 4 XX 11.88 14 5 XXX 11.56 12 6 XXX 12.43

7 XXXX 12.87 Meteor Score 10 Table 4: Ablation study on the development set of Ox- 8 ford dataset. 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 parameter

Figure 3: Comparison between LOG-CaD and ESD- 11.0 K=64 K=128 def with different parameter δ. δ controls how much K=256 K=512 we prefer content words over function words. Larger δ 10.5 K=1024 implies we prefer content words more.

10.0

9.5 Meteor Score the latent models would damage the performance. It is interesting to see that when we set the num- 9.0 ber of components M to 8, the optimal number

M=1 M=4 M=8 M=16 M=32 of categories K is 256. As the total number of Number of latent variables semantic units we are modeling is M × K, this Figure 2: The Meteor scores of ESD on Oxford test approximately equals to the number of sememes in dataset with different M and K, where M is the number HowNet. of discrete latent variables used in ESD, and K is the number of categories. 5.2 Improvements on different word types mance of the model (Line 5 vs. Line 3 and Line 6 The goal of definition generation task is to acceler- vs. Line 4). ate dictionary compilation or to help humans with 5 Analysis unfamiliar text. In both application scenarios, it is more important to generate content words that To gain more insight into the improvement pro- describe the semantic of the given word, rather vided by the proposed method, we perform several than function words or phrases such as “refer to” analyses in this section. and “of or relating to”. To understand which kind of word our model achieves the largest improve- 5.1 Influence of the number of components ments on, we evaluate Meteor scores of the baseline To validate that explicit decomposition of word model and our model under different values of δ, semantics is beneficial for definition generation, where δ is a hyperparameter used by Meteor that we compare the performances of several models controls how much we prefer content words over with different number of latent variables, and plot function words. Figure3 shows the results. We the result in Figure2. can see that as our preference over content words Overall, setting multiple latent variables given increases, both the performances of baseline model the same categories achieves noticeable improve- and our model decreases, indicating that it is more ments over M=1, i.e. encoder-decoder model with difficult for current definition generation models to word prediction mechanism. However, it is not the generate useful content words than function words. case we should adopt as many latent variables as However, the gap between the baseline model and possible. The reason for it is that generally a word our model becomes larger when δ increases, which has a limited number of semantic components (3-10 shows that the gain of our model is mainly from in HowNet), and having too many components in the content words instead of function words.

714 Word militia The militia repelled attacks from without and denied the executive the means to Context oppress from within. a group of people who are not professional soldiers but who have had military training Reference and can act as an army LOG-CaD a group of people engaged in a military force ESD-def a group of people engaged in a military force and not very skillful Word captain Context The captain gave the order to abandon ship Reference the person in charge of a ship LOG-CaD a person who is a member of ship ESD-def a person who is the leader of a ship

Table 5: Examples from LOG-CaD and ESD-def. We highlight the different part between two models in red.

word z1 z2 z3 z4 z5 z6 z7 z8 fine-grained semantic components. red 54 7B 9C 60 A1 A7 F5 C7 yellow 54 92 7F 22 A1 A7 F5 55 blue 6A E5 7F 22 A1 A7 F5 C7 6 Related Work cat 7A E3 C4 22 A1 A7 F5 3B dog 7A 43 C4 60 A1 A7 F5 3B Definition Generation Definition modeling was penguin 7A C3 C4 60 A1 BE F5 3B firstly proposed by Noraset et al.(2017). They take a word embedding as input and generate a Table 6: Examples of the learned latent codes. Each definition of the word. An obvious drawback is line is a word with the hexadecimal identifier of its cor- responding latent codes. Color words like “red”, “yel- that their model cannot handle polysemous words. low”, “blue” share most parts of latent codes with each Recently several works (Ni and Wang, 2017; Gadet- other, while words from different groups like “red” and sky et al., 2018; Ishiwatari et al., 2019) consider “cat” share fewer parts of latent codes. the context-aware definition generation task, where the context is introduced to disambiguate senses of words. They all adopt a encoder-decoder archi- 5.3 Case Studies tecture, and rely heavily on the decoder to extract Examples of learned latent codes In Table6, semantic components of the word semantic, thus we show some examples of learned latent codes on leading to under-specific definitions. In contrast, WordNet dataset. We can see that our model does we introduce a group of discrete latent variables to learn informative codes, i.e. words with similar model these semantic components explicitly. meanings are assigned with similar latent codes, and codes of words with different meanings tend Semantic decomposition and Decomposed Se- to differ. mantics It is recognized by linguists that human beings understand complex meaning by decom- Examples of generated definitions We also list posing it into components that are latent in the several generation samples in Table5. We can see meaning. Wierzbicka(1996) propose that differ- that the definitions generated by our method are ent languages share a set of atomic concepts that more semantically complete than those by previ- cannot be further decomposed i.e. semantic prim- ous works, and they indeed capture fine-grained itives, and all complex concepts can be semanti- semantic components that the baseline model ig- cally composed by semantic primitives. Dong and nores. For example, it is necessary to know that Dong(2003) introduce a similar idea. They call militia has unprofessional military skills, which the atomic concepts as sememes, and present a distinguishes the meaning of militia and army. The knowledge base HowNet in which senses of words definition generated by the baseline model ignores are annotated with sememes. HowNet is shown this perspective. However, our model does describe to be helpful for many NLP tasks, such as word the unprofessional nature of militia by generating representation learning (Niu et al., 2017), relation “not very skillful”, thanks to the ability of modeling extraction (Li et al., 2019), aspect extraction (Luo

715 et al., 2019). Previously Yang et al.(2019) propose Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- to use sememe annotations as a direct input when cehre, Dzmitry Bahdanau, Fethi Bougares, Holger generating definitions, which can suffer from the Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder data sparsity problem. In this paper, we instead for statistical machine translation. In EMNLP, pages leverage HowNet as the external supervising sig- 1724–1734. nals for latent variables when training and try to Michael Denkowski and Alon Lavie. 2014. Meteor uni- learn the knowledge into the model itself. versal: Language specific translation evaluation for any target language. In Proceedings of the EACL 7 Conclusion 2014 Workshop on Statistical Machine Translation.

We proposed ESD, a context-aware definition gen- Zhendong Dong and Qiang Dong. 2003. HowNet eration model that explicitly models the decom- - a hybrid language and knowledge resource. In posed semantics of words. Specifically, we model NLPKE, pages 820–824. the decomposed semantics as discrete latent vari- Carol Fraser. 1999. The role of consulting a dictionary ables, and training with auxiliary losses to ensure in reading and vocabulary learning. Canadian Jour- that the model learns informative latent codes for nal of Applied Linguistics, 2:73–89. definition modeling. As a result, ESD leads to Artyom Gadetsky, Ilya Yakubovskiy, and Dmitry significant improvements over the previous strong Vetrov. 2018. Conditional generators of words defi- baselines on two established definition datasets. nitions. In ACL, pages 266–271. Quantitative and qualitative analysis showed that Cliff Goddard and Anna Wierzbicka. 1994. Semantic our model could generate more meaningful, spe- and Lexical Universals: Theory and Empirical Find- cific and accurate definitions. ings, volume 25. John Benjamins Publishing. In future work, we plan to seek better ways to guide the learning of latent variables, such as using Shonosuke Ishiwatari, Hiroaki Hayashi, Naoki Yoshi- naga, Graham Neubig, Shoetsu Sato, Masashi Toy- dynamic routing (Sabour et al., 2017) method to oda, and Masaru Kitsuregawa. 2019. Learning to align the latent variables and sememes, and learn describe unknown phrases with local and global con- more explainable latent codes. texts. In NAACL, pages 3467–3476. Acknowledgments Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. 2019. Disentangled representation We would like to thank the anonymous review- learning for text style transfer. In ACL. ers for their insightful comments. Shujian Huang Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish is the corresponding author. This work is sup- Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam ported by National Science Foundation of China Shazeer. 2018. Fast decoding in sequence mod- ICML (No. U1836221, 61772261), the Jiangsu Provin- els using discrete latent variables. In , pages 2395–2404. cial Research Foundation for Basic Research (No. BK20170074). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

References Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou, ton. 2012. Imagenet classification with deep con- Olga Vechtomova, Xinyu Dai, and Jiajun Chen. volutional neural networks. In NIPS, pages 1097– 2019. Generating sentences from disentangled syn- 1105. tactic and semantic spaces. In ACL, pages 6008– 6019. Ziran Li, Ning Ding, Zhiyuan Liu, Haitao Zheng, and Ying Shen. 2019. Chinese relation extraction Leonard Bloomfield. 1949. A set of postulates for the with multi-grained information and external linguis- science of language. IJAL, 15(4):195–202. tic knowledge. In ACL, pages 4377–4386. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Ling Luo, Xiang Ao, Yan Song, Jinyao Li, Xiaopeng Tomas Mikolov. 2017. Enriching word vectors with Yang, Qing He, and Dong Yu. 2019. Unsupervised subword information. TACL, 5:135–146. neural aspect extraction with sememes. In IJCAI, pages 5123–5129. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- drew Dai, Rafal Jozefowicz, and Samy Bengio. Ke Ni and William Yang Wang. 2017. Learning 2016. Generating sentences from a continuous to explain non-standard english words and phrases. space. In CoNLL, pages 10–21. CoRR, abs/1709.09254.

716 Yilin Niu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2017. Improved word representation learning with sememes. In ACL, pages 2049–2058. Thanapon Noraset, Chen Liang, Larry Birnbaum, and Doug Downey. 2017. Definition modeling: Learn- ing to define word embeddings in natural language. In AAAI. Aaron van den Oord, Oriol Vinyals, et al. 2017. Neu- ral discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In ACL, pages 311–318. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word rep- resentation. In EMNLP, pages 1532–1543. Aurko Roy, Ashish Vaswani, Niki Parmar, and Arvind Neelakantan. 2018. Towards a better understanding of vector quantized autoencoders. arXiv. Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. 2017. Dynamic routing between capsules. CoRR, abs/1710.09829. Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. 2019. Latent-variable non- autoregressive neural machine translation with deter- ministic inference using a delta posterior. In AAAI. Martin Sundermeyer, Ralf Schluter,¨ and Hermann Ney. 2012. Lstm neural networks for language modeling. In InterSpeech. Koki Washio, Satoshi Sekine, and Tsuneaki Kato. 2019. Bridging the defined and the defining: Exploiting im- plicit lexical semantic relations in definition model- ing. In EMNLP-IJCNLP, pages 3519–3525. Rongxiang Weng, Shujian Huang, Zaixiang Zheng, XIN-YU DAI, and CHEN Jiajun. 2017. Neural ma- chine translation with word predictions. In EMNLP, pages 136–145. Anna Wierzbicka. 1996. Semantics: Primes and Uni- versals. Oxford University Press. Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, and Maosong Sun. 2017. Lexical sememe prediction via word embeddings and matrix factorization. In Pro- ceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4200–4206. AAAI Press. Liner Yang, Cunliang Kong, Yun Chen, Yang Liu, Qi- nan Fan, and Erhong Yang. 2019. Incorporating sememes into chinese definition modeling. arXiv preprint arXiv:1905.06512. Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, and Min Zhang. 2016. Variational neural machine trans- lation. In EMNLP.

717