Auto-Encoding Dictionary Definitions Into Consistent Word Embeddings

Auto-Encoding Dictionary Definitions into Consistent Word Embeddings Tom Bosc Pascal Vincent Mila, Université de Montréal Mila, Université de Montréal, CIFAR, [email protected] Facebook AI Research [email protected] Abstract ilarity and especially in distinguishing it from relatedness (Hill et al., 2016). Monolingual dictionaries are widespread and It is useful to specialize word embeddings to fo- semantically rich resources. This paper cus on either relation in order to improve perfor- presents a simple model that learns to com- mance on specific downstream tasks. For instance, pute word embeddings by processing dictio- Kiela et al.(2015) report that improvements on nary definitions and trying to reconstruct them. It exploits the inherent recursivity of dictio- relatedness benchmarks also yield improvements naries by encouraging consistency between the on document classification. In the other direction, representations it uses as inputs and the repre- embeddings learned by neural machine translation sentations it produces as outputs. The result- models capture similarity better than distributional ing embeddings are shown to capture seman- unsupervised objectives (Hill et al., 2014). tic similarity better than regular distributional There is a wealth of methods that postprocess methods and other dictionary-based methods. embeddings to improve or specialize them, such In addition, the method shows strong performance when trained exclusively on dictionary as retrofitting (Faruqui et al., 2014). On similarity data and generalizes in one shot. benchmarks, they are able to reach correlation co- efficients close to inter-annotator agreements. But 1 Introduction these methods rely on additional resources such as paraphrase databases (Wieting et al., 2016) or Dense, low-dimensional, real-valued vector repre- graphs of lexical relations such as synonymy, hy- sentations of words known as word embeddings pernymy, and their converse (Mrkšic´ et al., 2017). have proven very useful for NLP tasks (Turian Rather than relying on such curated lexical re- et al., 2010). They can be learned as a by-product sources that are not readily available for the ma- of solving a particular task (Collobert et al., 2011). jority of languages, we propose a method capa- Alternatively, one can pretrain generic embed- ble of improving embeddings by leveraging the dings based on co-occurence counts or using an more common resource of monolingual dictionar- unsupervised criterion such as predicting nearby ies.1 Lexical databases such as WordNet (Fell- words (Bengio et al., 2003; Mikolov et al., 2013). baum, 1998) are often built from dictionary defi- These methods implicitly rely on the distributional nitions, as was proposed earlier by Amsler(1980). hypothesis (Harris, 1954; Sahlgren, 2008), which We propose to bypass the process of explicitly states that words that occur in similar contexts tend building a lexical database – during which infor- to have similar meanings. mation is structured but information is also lost – It is common to study the relationships captured and instead directly use its detailed source: dictio- by word representations in terms of either simi- nary definitions. The goal is to obtain better rep- larity or relatedness (Hill et al., 2016). “Coffee” resentations for more languages with less effort. is related to “cup” as coffee is a beverage often The ability to process new definitions is also de- drunk in a cup, but “coffee” is not similar to “cup” sirable for future natural language understanding in that coffee is a beverage and cup is a container. systems. In a dialogue, a human might want to ex- Methods relying on the distributional hypothesis plain a new term by explaining it in his own words, often capture relatedness very well, reaching hu- 1See AppendixA for a list of online monolingual dictio- man performance, but fare worse in capturing sim- naries. and the chatbot should understand it. Similarly, hidden states output of LSTM embeddings question-answering systems should also be able to definition grasp definitions of technical terms that often oc- embedding cur in scientific writing. We expect the embedding of a word to rep- resent its meaning compactly. For interpretabil- ity purposes, it would be desirable to be able to min consistency penalty generate a definition from that embedding, as a simple way to verify what information it captured. Case language in point: to analyse word embeddings, Noraset model et al.(2017) used RNNs to produce definitions input from pretrained embeddings, manually annotated embeddings the errors in the generated definitions, and found min reconstruction out that more than half of the wrong definitions error fit either the antonyms of the defined words, their definition hypernyms, or related but different words. This prove: provide evidence for points in the same direction as the results of in- trinsic evaluations of word embeddings: lexical Figure 1: Overview of the CPAE model. relationships such as lexical entailment, similarity and relatedness are conflated in these embed- training criterion is built on the following princi- dings. It also suggests a new criterion for evalu- ple: we want the model to be able to recover the ating word representations, or even learning them: definition from which the representation was built. they should contain the necessary information to This objective should produce similar embeddings reproduce their definition (to some degree). for words which have similar definitions. Our hy- In this work, we propose a simple model that pothesis is that this will help capture semantic sim- exploits this criterion. The model consists of a def- ilarity, as opposed to relatedness. Reusing the pre- inition autoencoder: an LSTM processes the def- vious example, “coffee” and “cup” should get dif- inition of a word to yield its corresponding word ferent representations in virtue of having very dif- embedding. Given this embedding, the decoder at- ferent definitions, while “coffee” and “tea” should tempts to reconstruct the bag-of-words represen- get similar representations as they are both defined tation of the definition. Importantly, to address as beverages and plants. and leverage the recursive nature of dictionaries We chose to compute a single embedding per – the fact that words that are used inside a def- word in order to avoid having to disambiguate inition have their own associated definition – we word senses. Indeed, word sense disambiguation train this model with a consistency penalty that remains a challenging open problem with mixed ensures proximity of the embeddings produced by success on downstream task applications (Navigli, the LSTM and those that are used by the LSTM. 2009). Also, recent papers have shown that a sin- Our approach is self-contained: it yields good gle word vector can capture polysemy and that representations when trained on nothing but dic- having several vectors per word is not strictly nec- tionary data. Alternatively, it can also leverage essary (Li and Jurafsky, 2015)(Yaghoobzadeh and existing word embeddings and is then especially Schütze, 2016). Thus, when a word has several apt at specializing them for the similarity relation. definitions, we concatenate them to produce a sin- Finally, it is also extremely data-efficient, as it per- gle embedding. mits to create representations of new words in one shot from a short definition. 2.2 Autoencoder model 2 Model Let VD be the set of all words that are used in definitions and VK the set of all words that are 2.1 Setting and motivation K defined. We let w 2 V be a word and Dw = We suppose that we have access to a dictionary (Dw;1;:::;Dw;T ) be its definition, where Dw;t is that maps words to one or several definitions. Def- the index of a word in vocabulary VD. We en- initions themselves are sequences of words. Our code such a definition Dw by processing it with an LSTM (Hochreiter and Schmidhuber, 1997). scheme that brings the input embeddings closer The LSTM is parameterized by Ω and a matrix to the definition embeddings. We call this term D th E of size jV j × m, whose i row Ei contains an a consistency penalty because its goal is to to en- m-dimensional input embedding for the ith word sure that the embeddings used by the encoder (in- of VD. These input embeddings can either be put embeddings) and the embeddings produced by learned by the model or be fixed to a pretrained the encoder (definition embeddings) are consistent embedding. The last hidden state computed by with each other. It is implemented as this LSTM is then transformed linearly to yield an X m-dimensional definition embedding h. Thus the Jp(θ) = d(Ew; fθ(Dw)) encoder whose parameters are θ = fE; Ω; W; bg w2VD\VK computes this embedding h as where d is a distance. In our experiments, we choose d to be the Euclidian distance. The penalty h = f (D ) = W LSTM (D ) + b: θ w E;Ω w is only applied to some words because VD 6= VK. The subsequent decoder can be seen as a condi- Indeed, some words are defined but are not used in tional language model trained by maximum likeli- definitions and some words are used in definitions but not defined. In particular, inflected words are hood to regenerate definition Dw given definition not defined. To balance the two terms, we intro- embedding h = fθ(Dw). We use a simple con- ditional unigram model with a linear parametriza- duce two hyperparameters λ, α ≥ 0 and the com- tion θ0 = fE0; b0g where E0 is a jVDj × m matrix plete objective is b0 2 and is a bias vector. 0 0 We maximize the log-probability of definition J(θ ; θ) = αJr(θ ; θ) + λJp(θ): Dw under that model: We call the model CPAE, for Consistency Pe- nalized AutoEncoder when α > 0 and λ > 0 (see X Figure1).

Load more