Putting Words in Context: LSTM Language Models and Lexical Ambiguity

Putting words in context: LSTM language models and lexical ambiguity Laura Aina Kristina Gulordava Gemma Boleda Universitat Pompeu Fabra Barcelona, Spain [email protected] Abstract word-level representations that do not change across contexts, that is, “static” word embeddings. In neural network models of language, words are commonly represented using context- These are then passed to further processing lay- invariant representations (word embeddings) ers, such as the hidden layers in a recurrent neural which are then put in context in the hidden lay- network (RNN). Akin to classic distributional se- ers. Since words are often ambiguous, repre- mantics (Erk, 2012), word embeddings are formed senting the contextually relevant information as an abstraction over the various uses of words is not trivial. We investigate how an LSTM in the training data. For this reason, they are apt language model deals with lexical ambiguity to represent context-invariant information about a in English, designing a method to probe its hidden representations for lexical and contex- word —its lexical information— but not the con- tual information about words. We find that tribution of a word in a particular context —its both types of information are represented to contextual information (Erk, 2010). Indeed, word a large extent, but also that there is room for embeddings subsume information relative to var- improvement for contextual information. ious senses of a word (e.g., mouse is close to words from both the animal and computer do- 1 Introduction main; Camacho-Collados and Pilehvar, 2018). In language, a word can contribute a very differ- Classic distributional semantics attempted to ent meaning to an utterance depending on the con- do composition to account for contextual effects, text, a phenomenon known as lexical ambiguity but it was in general unable to go beyond short (Cruse, 1986; Small et al., 2013). This variation is phrases (Baroni, 2013); newer-generation neural pervasive and involves both morphosyntactic and network models have supposed a big step forward, semantic aspects. For instance, in the examples in as they can natively do composition (Westera and Table1, show is used as a verb in Ex. (1), and as a Boleda, 2019). In particular, the hidden layer acti- noun in Ex. (2-3), in a paradigmatic case of mor- vations in an RNN can be seen as putting words in phosyntactic ambiguity in English. Instead, the context, as they combine the word embedding with difference between Ex. (2) and (3) is semantic in information coming from the context (the adja- nature, with show denoting a TV program and an cent hidden states). The empirical success of RNN exhibition, respectively. Semantic ambiguity cov- models, and in particular LSTM architectures, at ers a broad spectrum of phenomena, ranging from fundamental tasks like Language Modeling (Joze- quite distinct word senses (e.g. mouse as animal fowicz et al., 2015) suggests that they are indeed or computer device) to more subtle lexical modu- capturing relevant contextual properties. More- lation (e.g. visit a city / an aunt / a doctor; Cruse, over, contextualized representations derived from 1986). This paper investigates how deep learning such models have been shown to be very informa- models of language, and in particular Long Short- tive as input for lexical disambiguation tasks (e.g. Term Memory Networks (LSTMs) trained on Lan- Melamud et al., 2016; Peters et al., 2018). guage Modeling, deal with lexical ambiguity.1 We here present a method to probe the extent In neural network models of language, words to which the hidden layers of an LSTM language in a sentence are commonly represented through trained on English data represent lexical and con- 1Code at: https://github.com/amore-upf/ textual information about words, in order to inves- LSTM_ambiguity tigate how the model copes with lexical ambiguity. 3342 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3342–3348 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics Examples LexSub w NN s NN w&s NN (1) . I clapped her shoulder demonstrate, demonstrate, indicate, indicate, to show I was not laughing at display, indicate, exhibit, indicate, demonstrate, demonstrate, prove, her. prove, clarify offer, reveal suggest, prove, ensure, suggest (2) . The show [. ] program, series, demonstrate, series, program, series, program, revolutionized the way broadcast, exhibit, indicate, production, production, America cooks and eats. presentation offer, reveal miniseries, trilogy broadcast (3) . The inauguration of exhibition, demonstrate, conference, event, conference, event, Dubai Internet City coincides conference, exhibit, indicate, convention, exhibition, with the opening of an annual convention, offer, reveal symposium, symposium, IT show in Dubai. demonstration exhibition convention Table 1: Examples from the LexSub dataset (Kremer et al., 2014) and nearest neighbors for target representations. Our work follows a recent strand of research that 2016) but differently from, e.g., the BiLSTM ar- purport to identify what linguistic properties deep chitecture used for ELMo (Peters et al., 2018). learning models are able to capture (Linzen et al., 1 1 1 2016; Adi et al., 2017; Gulordava et al., 2018; ht = LSTM (wt; ht−1) (1) Conneau et al., 2018; Hupkes et al., 2018, a.o.). i i i−1 i ht = LSTM (ht ; ht−1) (2) We train diagnostic models on the tasks of retriev- −! − o = softmax(f( h L + h L )) ing the embedding of a word and a representation t t−1 t+1 (3) of its contextual meaning, respectively —the latter We train the LM on the concatenation of English obtained from a Lexical Substitution dataset (Kre- text data from a Wikipedia dump2, the British Na- mer et al., 2014). Our results suggest that LSTM tional Corpus (Leech, 1992), and the UkWaC cor- language models heavily rely on the lexical infor- pus (Ferraresi et al., 2008).3 More details about mation in the word embeddings, at the expense of the training setup are specified in Appendix A.1. contextually relevant information. Although fur- The model achieves satisfying performances on ther analysis is necessary, this suggests that there test data (perplexity: 18.06). is still much room for improvement to account for For our analyses, we deploy the trained LM on a contextual meanings. Finally, we show that the text sequence and extract the following activations hidden states used to predict a word – as opposed of each hidden layer; Eq. (4) and Fig.1. to those that receive it as input – display a bias to- wards contextual information. −!i −i f h tji ≤ Lg [ f h tji ≤ Lg (4) i −!i −i 2 Method ht = [ h t; h t] (5) −! − hi = [ h i ; h i ] (6) Language model. As our base model, we em- t±1 t−1 t+1 ploy a word-level bidirectional LSTM (Schus- At timestep t, for each layer, we concatenate the ter and Paliwal, 1997; Hochreiter and Schmidhu- forward and backward hidden states; Eq. (5). We ber, 1997) language model (henceforth, LM) with refer to these vectors as current hidden states. three hidden layers. Each input word at timestep As they are obtained processing the word at t as t is represented through its word embedding wt; input and combining it with information from the this is fed to both a forward and a backward context, we can expect them to capture the rel- stacked LSTMs, which process the sequence left- evant contribution of such word (e.g., in Fig.1 to-right and right-to-left, respectively (Eqs. (1-2) the mouse-as-animal sense). As a comparison, we describe the forward LSTM). To predict the word also extract activations obtained by processing the at t, we obtain output weights by summing the ac- text sequence up to t − 1 and t + 1 in the forward tivations of the last hidden layers of the forward and backward LSTM, respectively, hence exclud- and backward LSTMs at timesteps t − 1 and t + 1, ing the word at t. We concatenate the forward and respectively, and applying a linear transformation backward states of each layer; Eq. (6). While these followed by softmax (Eq.3, where L is the num- 2 ber of hidden layers). Thus, a word is predicted From 2018/01/03, https://dumps.wikimedia. org/enwiki/ using both its left and right context jointly, akin 350M tokens from each corpus, in total 150M to the context2vec architecture (Melamud et al., (train/valid/test: 80/10/10%); vocabulary size: 50K. 3343 tions of the contextual meaning of words. We define two types of representations, inspired by previous work that proposed simple vector oper- ations to combine word representations (Mitchell and Lapata, 2010; Thater et al., 2011, a.o.): the average embedding of the substitute words (henceforth, s), and the average embedding of the union of the substitute words and the target word (w&s). As Table1 qualitatively shows, the resulting representations tend to be close to the substitute words and reflect the contextual nuance conveyed by the word; in the case of w&s, they also retain a strong similarity to the embedding of the target word.4 We frame our analyses as supervised probe tasks: a diagnostic model learns to “retrieve” word representations out of the hidden states; the rate Figure 1: Language model and extracted representa- of success of the model is taken to measure the tions. The different shades across layers reflect the dif- amount of information relevant to the task that its ferent performances in the probe tasks (darker = higher) input contains. Given current or predictive states activations do not receive the word at t as input, as inputs, we define three diagnostic tasks: they are relevant because they are used to predict -W ORD: predict w that word as output.

Putting Words in Context: LSTM Language Models and Lexical Ambiguity

Knowledge-Powered Deep Learning for Word Embedding

Recurrent Neural Network Based Language Model

Unified Language Model Pre-Training for Natural

Hello, It's GPT-2

N-Gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture Notes Courtesy of Prof

A General Language Model for Information Retrieval Fei Song W

A Unified Multilingual Handwriting Recognition System Using

Word Embeddings: History & Behind the Scene

Language Modelling for Speech Recognition

A Pretrained Language Model for Scientific Text

Improving Text Recognition Using Optical and Language Model Writer Adaptation

Exploring the Limits of Language Modeling