Query Expansion with Locally-Trained Word Embeddings
Total Page:16
File Type:pdf, Size:1020Kb
Query Expansion with Locally-Trained Word Embeddings Fernando Diaz Bhaskar Mitra Nick Craswell Microsoft Microsoft Microsoft [email protected] [email protected] [email protected] Abstract trained using local windows, risks captur- ing only coarse representations of those top- Continuous space word embeddings ics dominant in the corpus. While a par- have received a great deal of atten- ticular embedding may be appropriate for a tion in the natural language processing specific word within a sentence-length con- and machine learning communities for text globally, it may be entirely inappropri- their ability to model term similarity ate within a specific topic. Gale et al. re- and other relationships. We study the fer to this as the `one sense per discourse' use of term relatedness in the context property (Gale et al., 1992). Previous work of query expansion for ad hoc informa- by Yarowsky demonstrates that this property tion retrieval. We demonstrate that can be successfully combined with informa- word embeddings such as word2vec and tion from nearby terms for word sense dis- GloVe, when trained globally, under- ambiguation (Yarowsky, 1995). Our work ex- perform corpus and query specific em- tends this approach to word2vec-style training beddings for retrieval tasks. These re- in the context word similarity. sults suggest that other tasks benefit- For many tasks that require topic-specific ing from global embeddings may also linguistic analysis, we argue that topic-specific benefit from local embeddings. representations should outperform global rep- resentations. Indeed, it is difficult to imagine 1 Introduction a natural language processing task that would Continuous space embeddings such as not benefit from an understanding of the local word2vec (Mikolov et al., 2013b) or GloVe topical structure. Our work focuses on a query (Pennington et al., 2014a) project terms in expansion, an information retrieval task where a vocabulary to a dense, lower dimensional we can study different lexical similarity meth- space. Recent results in the natural lan- ods with an extrinsic evaluation metric (i.e. guage processing community demonstrate the retrieval metrics). Recent work has demon- effectiveness of these methods for analogy strated that similarity based on global word and word similarity tasks. In general, these embeddings can be used to outperform clas- approaches provide global representations of sic pseudo-relevance feedback techniques (Sor- words; each word has a fixed representation, doni et al., 2014; al Masri et al., 2016). regardless of any discourse context. While a We propose that embeddings be learned on global representation provides some advan- topically-constrained corpora, instead of large tages, language use can vary dramatically by topically-unconstrained corpora. In a retrieval topic. For example, ambiguous terms can eas- scenario, this amounts to retraining an em- ily be disambiguated given local information bedding on documents related to the topic of in immediately surrounding words (Harris, the query. We present local embeddings which 1954; Yarowsky, 1993). The window-based capture the nuances of topic-specific language training of word2vec style algorithms exploits better than global embeddings. There is this distributional property. substantial evidence that global methods un- A global word embedding, even when derperform local methods for information re- trieval tasks such as query expansion (Xu and Croft, 1996), latent semantic analysis (Hull, 150 1994; Sch¨utze et al., 1995; Singhal et al., 1997), cluster-based retrieval (Tombros and van Rijsbergen, 2001; Tombros et al., 2002; 100 Willett, 1985), and term clustering (Attar and Fraenkel, 1977). We demonstrate that the 50 same holds true when using word embeddings for text retrieval. 0 2 Motivation -1 0 1 2 3 4 5 For the purpose of motivating our approach, log(weight) we will restrict ourselves to word2vec although other methods behave similarly (Levy and Goldberg, 2014). These algorithms involve Figure 1: Importance weights for terms occur- discriminatively training a neural network to ring in documents related to `argentina peg- predict a word given small set of context ging dollar' relative to frequency in gigaword. words. More formally, given a target word w and observed context c, the instance loss is de- of observing a word-context pair conditioned fined as, on the topic t. The expected loss under this distribution is (Shimodaira, 2000), `(w; c) = log σ(φ(w) · (c)) pt(w; c) + η · Ew∼θC [log σ(−φ(w) · (w))] Lt = Ew;c∼pc `(w; c) (2) pc(w; c) where φ : V ! <k projects a term into a k- dimensional embedding space, : Vm ! <k In general, if our corpus consists of sufficiently projects a set of m terms into a k-dimensional diverse data (e.g. Wikipedia), the support of embedding space, and w is a randomly sam- pt(w; c) is much smaller than and contained pled `negative' context. The parameter η con- in that of pc(w; c). The loss, `, of a con- trols the sampling of random negative terms. text that occurs more frequently in the topic, These matrices are estimated over a set of con- will be amplified by the importance weight ! = pt(w;c) . Because topics require special- texts sampled from a large corpus and mini- pc(w;c) mize the expected loss, ized language, this is likely to occur; at the same time, these contexts are likely to be un- deremphasized in training a model according Lc = Ew;c∼pc [`(w; c)] (1) to Equation 1. where pc is the distribution of word-context In order to quantify this, we took a topic pairs in the training corpus and can be esti- from a TREC ad hoc retrieval collection (see mated from corpus statistics. Section 5 for details) and computed the im- While using corpus statistics may make portance weight for each term occurring in sense absent any other information, oftentimes the set of on-topic documents. The histogram we know that our analysis will be topically of weights ! is presented in Figure 1. While constrained. For example, we might be analyz- larger probabilities are expected since the size ing the `sports' documents in a collection. The of a topic-constrained vocabulary is smaller, language in this domain is more specialized there are a non-trivial number of terms with and the distribution over word-context pairs much larger importance weights. If the loss, is unlikely to be similar to pc(w; c). In fact, `(w), of a word2vec embedding is worse for prior work in information retrieval suggests these words with low pc(w), then we expect that documents on subtopics in a collection these errors to be exacerbated for the topic. have very different unigram distributions com- Of course, these highly weighted terms may pared to the whole corpus (Cronen-Townsend have a low value for pt(w) but a very high et al., 2002). Let pt(w; c) be the probability value relative to the corpus. We can adjust the global local 0.15 cutting tax squeeze deficit reduce vote 0.10 slash budget KL reduction reduction spend house 0.05 lower bill halve plan soften spend 0.00 rank freeze billion Figure 2: Pointwise Kullback-Leibler diver- Figure 3: Terms similar to `cut' for a word2vec gence for terms occurring in documents re- model trained on a general news corpus and lated to `argentina pegging dollar' relative to another trained only on documents related to frequency in gigaword. `gasoline tax'. weights by considering the pointwise Kullback- Leibler divergence for each word w, two word2vec models: the first on the large, generic Gigaword corpus and the second on a pt(w) topically-constrained subset of the gigaword. Dw(ptkpc) = pt(w) log (3) pc(w) We present the most similar terms to `cut' using both a global embedding and a topic- Words which have a much higher value of specific embedding in Figure 3. In this case, pt(w) than pc(w) and have a high absolute the topic is `gasoline tax'. As we can see, the value of pt(w) will have high pointwise KL `tax cut' sense of `cut' is emphasized in the divergence. Figure 2 shows the divergences topic-specific embedding. for the top 100 most frequent terms in pt(w). The higher ranked terms (i.e. good query ex- pansion candidates) tend to have much higher 3 Local Word Embeddings probabilities than found in pc(w). If the loss on those words is large, this may result in poor embeddings for the most important words for The previous section described several reasons the topic. why a global embedding may result in over- A dramatic change in distribution between general word embeddings. In order to perform the corpus and the topic has implications for topic-specific training, we need a set of topic- performance precisely because of the objective specific documents. In information retrieval used by word2vec (i.e. Equation 1). The train- scenarios users rarely provide the system with ing emphasizes word-context pairs occurring examples of topic-specific documents, instead with high frequency in the corpus. We will providing a small set of keywords. demonstrate that, even with heuristic down- Fortunately, we can use information re- sampling of frequent terms in word2vec, these trieval techniques to generate a query-specific techniques result in inferior performance for set of topical documents. Specifically, we specific topics. adopt a language modeling approach to do so Thus far, we have sketched out why using (Croft and Lafferty, 2003). In this retrieval the corpus distribution for a specific topic may model, each document is represented as a max- result in undesirable outcomes. However, it is imum likelihood language model estimated even unclear that pt(wjc) = pc(wjc). In fact, from document term frequencies. Query lan- we suspect that pt(wjc) 6= pc(wjc) because of guage models are estimated similarly, using the `one sense per discourse' claim (Gale et term frequency in the query.