Query Expansion with Locally-Trained Word Embeddings

Fernando Diaz Bhaskar Mitra Nick Craswell Microsoft Microsoft Microsoft [email protected] [email protected] [email protected]

Abstract trained using local windows, risks captur- ing only coarse representations of those top- Continuous space word embeddings ics dominant in the corpus. While a par- have received a great deal of atten- ticular embedding may be appropriate for a tion in the natural language processing specific word within a sentence-length con- and machine learning communities for text globally, it may be entirely inappropri- their ability to model term similarity ate within a specific topic. Gale et al. re- and other relationships. We study the fer to this as the ‘one sense per discourse’ use of term relatedness in the context property (Gale et al., 1992). Previous work of query expansion for ad hoc informa- by Yarowsky demonstrates that this property tion retrieval. We demonstrate that can be successfully combined with informa- word embeddings such as word2vec and tion from nearby terms for word sense dis- GloVe, when trained globally, under- ambiguation (Yarowsky, 1995). Our work ex- perform corpus and query specific em- tends this approach to word2vec-style training beddings for retrieval tasks. These re- in the context word similarity. sults suggest that other tasks benefit- For many tasks that require topic-specific ing from global embeddings may also linguistic analysis, we argue that topic-specific benefit from local embeddings. representations should outperform global rep- resentations. Indeed, it is difficult to imagine 1 Introduction a natural language processing task that would Continuous space embeddings such as not benefit from an understanding of the local word2vec (Mikolov et al., 2013b) or GloVe topical structure. Our work focuses on a query (Pennington et al., 2014a) project terms in expansion, an task where a vocabulary to a dense, lower dimensional we can study different lexical similarity meth- space. Recent results in the natural lan- ods with an extrinsic evaluation metric (i.e. guage processing community demonstrate the retrieval metrics). Recent work has demon- effectiveness of these methods for analogy strated that similarity based on global word and word similarity tasks. In general, these embeddings can be used to outperform clas- approaches provide global representations of sic pseudo-relevance feedback techniques (Sor- words; each word has a fixed representation, doni et al., 2014; al Masri et al., 2016). regardless of any discourse context. While a We propose that embeddings be learned on global representation provides some advan- topically-constrained corpora, instead of large tages, language use can vary dramatically by topically-unconstrained corpora. In a retrieval topic. For example, ambiguous terms can eas- scenario, this amounts to retraining an em- ily be disambiguated given local information bedding on documents related to the topic of in immediately surrounding words (Harris, the query. We present local embeddings which 1954; Yarowsky, 1993). The window-based capture the nuances of topic-specific language training of word2vec style algorithms exploits better than global embeddings. There is this distributional property. substantial evidence that global methods un- A global , even when derperform local methods for information re- trieval tasks such as query expansion (Xu and Croft, 1996), latent semantic analysis (Hull, 150 1994; Sch¨utze et al., 1995; Singhal et al., 1997), cluster-based retrieval (Tombros and van Rijsbergen, 2001; Tombros et al., 2002; 100 Willett, 1985), and term clustering (Attar and Fraenkel, 1977). We demonstrate that the 50 same holds true when using word embeddings for text retrieval. 0 2 Motivation -1 0 1 2 3 4 5 For the purpose of motivating our approach, log(weight) we will restrict ourselves to word2vec although other methods behave similarly (Levy and Goldberg, 2014). These algorithms involve Figure 1: Importance weights for terms occur- discriminatively training a neural network to ring in documents related to ‘argentina peg- predict a word given small set of context ging dollar’ relative to frequency in gigaword. words. More formally, given a target word w and observed context c, the instance loss is de- of observing a word-context pair conditioned fined as, on the topic t. The expected loss under this distribution is (Shimodaira, 2000), `(w, c) = log σ(φ(w) · ψ(c))   pt(w, c) + η · Ew∼θC [log σ(−φ(w) · ψ(w))] Lt = Ew,c∼pc `(w, c) (2) pc(w, c) where φ : V →

KL reduction reduction spend house 0.05 lower bill halve plan soften spend 0.00 rank freeze billion Figure 2: Pointwise Kullback-Leibler diver- Figure 3: Terms similar to ‘cut’ for a word2vec gence for terms occurring in documents re- model trained on a general news corpus and lated to ‘argentina pegging dollar’ relative to another trained only on documents related to frequency in gigaword. ‘gasoline tax’. weights by considering the pointwise Kullback- Leibler divergence for each word w, two word2vec models: the first on the large, generic Gigaword corpus and the second on a pt(w) topically-constrained subset of the gigaword. Dw(ptkpc) = pt(w) log (3) pc(w) We present the most similar terms to ‘cut’ using both a global embedding and a topic- Words which have a much higher value of specific embedding in Figure 3. In this case, pt(w) than pc(w) and have a high absolute the topic is ‘gasoline tax’. As we can see, the value of pt(w) will have high pointwise KL ‘tax cut’ sense of ‘cut’ is emphasized in the divergence. Figure 2 shows the divergences topic-specific embedding. for the top 100 most frequent terms in pt(w). The higher ranked terms (i.e. good query ex- pansion candidates) tend to have much higher 3 Local Word Embeddings probabilities than found in pc(w). If the loss on those words is large, this may result in poor embeddings for the most important words for The previous section described several reasons the topic. why a global embedding may result in over- A dramatic change in distribution between general word embeddings. In order to perform the corpus and the topic has implications for topic-specific training, we need a set of topic- performance precisely because of the objective specific documents. In information retrieval used by word2vec (i.e. Equation 1). The train- scenarios users rarely provide the system with ing emphasizes word-context pairs occurring examples of topic-specific documents, instead with high frequency in the corpus. We will providing a small set of keywords. demonstrate that, even with heuristic down- Fortunately, we can use information re- sampling of frequent terms in word2vec, these trieval techniques to generate a query-specific techniques result in inferior performance for set of topical documents. Specifically, we specific topics. adopt a language modeling approach to do so Thus far, we have sketched out why using (Croft and Lafferty, 2003). In this retrieval the corpus distribution for a specific topic may model, each document is represented as a max- result in undesirable outcomes. However, it is imum likelihood language model estimated even unclear that pt(w|c) = pc(w|c). In fact, from document term frequencies. Query lan- we suspect that pt(w|c) 6= pc(w|c) because of guage models are estimated similarly, using the ‘one sense per discourse’ claim (Gale et term frequency in the query. A document al., 1992). We can qualitatively observe the score then, is the Kullback-Leibler divergence difference in pc(w|c) and pt(w|c) by training between the query and document language models, Now we turn to using word embeddings for query expansion. Let U be an |V| × k term X pq(w) D(pqkpd) = pq(w) log (4) embedding matrix. If q is a |V| × 1 column pd(w) w∈V term vector for a query, then the expansion term weights are UUTq. We then take the top Documents whose language models are more k terms, normalize their weights, and compute similar to the query language model will have pq+ (w). a lower KL divergence score. For consistency We consider the following alternatives for with prior work, we will refer to this as the U. The first approach is to use a global query likelihood score of a document. model trained by sampling documents uni- The scores in Equation 4 can be passed formly. The second approach, which we pro- through a softmax function to derive a multi- pose in this paper, is to use a local model nomial over the entire corpus (Lavrenko and trained by sampling documents from p(d). Croft, 2001), 5 Methods exp(−D(pqkpd)) p(d) = P (5) d0 exp(−D(pqkpd0 )) 5.1 Data Recall in Section 2 that training a word2vec To evaluate the different retrieval strategies model weights word-context pairs according described in Section 3, we use the following to the corpus frequency. Our query-based datasets. Two newswire datasets, trec12 and multinomial, p(d), provides a weighting func- robust, consist of the newswire documents and tion capturing the documents relevant to this associated queries from TREC ad hoc retrieval topic. Although an estimation of the topic- evaluations. The trec12 corpus consists of Tip- specific documents from a query will be im- ster disks 1 and 2; and the robust corpus precise (i.e. some nonrelevant documents will consists of Tipster disks 4 and 5. Our third be scored highly), the language use tends to dataset, web, consists of the ClueWeb 2009 be consistent with that found in the known Category B Web corpus. For the Web cor- relevant documents. pus, we only retain documents with a Water- 1 We can train a local word embedding us- loo spam rank above 70. We present corpus ing an arbitrary optimization method by sam- statistics in Table 1. pling documents from p(d) instead of uni- We consider several publicly available global formly from the corpus. In this work, we use embeddings. We use four GloVe embed- word2vec, although any method that operates dings of different dimensionality trained on the 2 on a sample of documents can be used. union of Wikipedia and Gigaword documents. We use one publicly available word2vec em- 4 Query Expansion with Word bedding trained on Google News documents.3 Embeddings We also trained a global embedding for trec12 and robust using the entire corpus. Instead of When using language models for retrieval, training a global embedding on the large expansion involves estimating an alter- collection, we use a GloVe embedding trained native to p . Specifically, when each expansion q on Common Crawl data.4 term is associated with a weight, we normalize We train local embeddings with word2vec these weights to derive the expansion language using one of three retrieval sources. First, we model, p + . This language model is then in- q consider documents retrieved from the target terpolated with the original query model, corpus of the query (i.e. trec12, robust, or 1 web). We also consider training a local embed- pq(w) = λpq(w) + (1 − λ)pq+ (w) (6) 1 https://plg.uwaterloo.ca/~gvcormac/ This interpolated language model can then clueweb09spam/ be used with Equation 4 to rank documents 2http://nlp.stanford.edu/data/glove.6B.zip 3 (Abdul-Jaleel et al., 2004). We will refer to https://code.google.com/archive/p/ word2vec/ this as the expanded query score of a docu- 4http://nlp.stanford.edu/data/glove.840B. ment. 300d.zip docs words queries All word2vec training used the publicly trec12 469,949 438,338 150 available word2vec cbow implementation.7 robust 528,155 665,128 250 When training the local models, we sampled web 50,220,423 90,411,624 200 1000 documents from p(d) with replacement. news 9,875,524 2,645,367 - To compensate for the much smaller corpus wiki 3,225,743 4,726,862 - size, we ran word2vec training for 80 iter- ations. Local word2vec models use a fixed Table 1: Corpora used for retrieval and local embedding dimension of 400 although other embedding training. choices did not significantly affect our results. Unless otherwise noted, default parameter set- ding by performing a retrieval on large auxil- tings were used. iary corpora. We use the Gigaword corpus as In our experiments, expanded queries a large auxiliary news corpus. We hypothe- rescore the top 1000 documents from an ini- size that retrieving from a larger news corpus tial query likelihood retrieval. Previous results will provide substantially more local training have demonstrated that this approach results data than a target retrieval. We also use a in performance nearly identical with an ex- Wikipedia snapshot from December 2014. We panded retrieval at a much lower cost (Diaz, hypothesize that retrieving from a large, high 2015). Because publicly available embeddings fidelity corpus will provide cleaner language may have tokenization inconsistent with our than that found in lower fidelity target do- target corpora, we restricted the vocabulary mains such as the web. Table 1 shows the of candidate expansion terms to those occur- relative magnitude of these auxiliary corpora ring in the initial retrieval. If a candidate term compared to the target corpora. was not found in the vocabulary of the embed- All corpora in Table 1 were stopped using ding matrix, we searched for the candidate in the SMART stopword list5 and stemmed us- a stemmed version of the embedding vocabu- ing the Krovetz algorithm (Krovetz, 1993). lary. In the event that the candidate term was We used the Indri implementation for indexing still not found after this process, we removed and retrieval.6 it from consideration. 5.2 Evaluation 6 Results We consider several standard retrieval eval- We present results for retrieval experiments uation metrics, including NDCG@10 and in- in Table 2. We find that embedding-based terpolated precision at standard recall points query expansion outperforms our query like- (J¨arvelin and Kek¨al¨ainen,2002; van Rijsber- lihood baseline across all conditions. When gen, 1979). NDCG@10 provides insight into using the global embedding, the news corpora performance specifically at higher ranks. An benefit from the various embeddings in differ- interpolated precision recall graph describes ent situations. Interestingly, for trec12, using system performance throughout the entire an embedding trained on the target corpus sig- ranked list. nificantly outperforms all other global embed- 5.3 Training dings, despite using substantially less data to estimate the model. While this performance All retrieval experiments were conducted may be due to the embedding having a tok- by performing 10-fold cross-validation across enization consistent with the target corpus, it queries. Specifically, we cross-validate may also come from the fact that the corpus the number of expansion terms, k ∈ is more representative of the target documents {5, 10, 25, 50, 100, 250, 500}, and interpolation than other embeddings which rely on online weight, λ ∈ [0, 1]. For local word2vec train- news or are mixed with non-news content. To ing, we cross-validate the learning rate α ∈ some extent this supports our desire to move {10−1, 10−2, 10−3}. training closer to the target distribution. 5http://jmlr.csail.mit.edu/papers/volume5/ Across all conditions, local embeddings sig- lewis04a/a11-smart-stop-list/english.stop 6http://www.lemurproject.org/indri/ 7https://code.google.com/p/word2vec/ Table 2: Retrieval results comparing query expansion based on various global and local embed- dings. Bolded numbers indicate the best expansion in that class of embeddings. Wilcoxon signed rank test between bolded numbers indicates statistically significant improvements (p < 0.05) for all collections.

global local wiki+giga gnews target target giga wiki QL 50 100 200 300 300 400 400 400 400 trec12 0.514 0.518 0.518 0.530 0.531 0.530 0.545 0.535 0.563* 0.523 robust 0.467 0.470 0.463 0.469 0.468 0.472 0.465 0.475 0.517* 0.476 web 0.216 0.227 0.229 0.230 0.232 0.218 0.216 0.234 0.236 0.258*

trec12 robust web

0.8 0.6 0.7 QL QL QL 0.6 global global 0.5 global local 0.6 local local 0.5 0.4 0.4 0.4 0.3 0.3 precision precision precision 0.2 0.2 0.2 0.1 0.1

0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

recall recall recall

Figure 4: Interpolated precision-recall curves for query likelihood, the best global embedding, and the best local embedding from Table 2. nificantly outperform global embeddings for global query expansion method, and the best query expansion. For our two news collec- local query expansion method. Interestingly, tions, estimating the local model using a re- although global methods achieve strong per- trieval from the larger Gigaword corpus led to formance for NDCG@10, these improvements substantial improvements. This effect is al- over the baseline are not reflected in our most certainly due to the Gigaword corpus be- precision-recall curves. Local methods, on the ing similar in writing style to the target cor- other hand, almost always strictly dominate pus but, at the same time, providing signifi- both the baseline and global expansion across cantly more relevant content (Diaz and Met- all recall levels. zler, 2006). As a result, the local embedding The results support the hypothesis that lo- is trained using a larger variety of topical ma- cal embeddings provide better similarity mea- terial than if it were to use a retrieval from the sures than global embeddings for query expan- smaller target corpus. An embedding trained sion. In order to understand why, we first com- with a retrieval from Wikipedia tended to per- pare the performance differences between local form worse most likely because the language is and global embeddings. Figure 2 suggests that dissimilar from news content. Our web col- we should adopt a local embedding when the lection, on the other hand, benefitted more local unigram language model deviates from from embeddings trained using retrievals from the corpus language model. To test this, we the general Wikipedia corpus. The Gigaword computed the KL divergence between the lo- corpus was less useful here because news-style P cal unigram distribution, d p(w|d)p(d), and language is almost certainly not representative the corpus unigram language model (Cronen- of general web documents. Townsend et al., 2002). We hypothesize that, Figure 4 presents interpolated precision- when this value is high, the topic language recall curves comparing the baseline, the best is different from the corpus language and the Table 3: Kendall’s τ and Spearman’s ρ be- tween improvement in NDCG@10 and lo- global cal KL divergence with the corpus language model. The improvement is measured for the best local embedding over the best global em- bedding.

τ ρ trec12 0.0585 0.0798 robust 0.0545 0.0792 web 0.0204 0.0283

global embedding will be inferior to the local local embedding. We tested the rank correlation be- tween this KL divergence and the relative per- formance of the local embedding with respect to the global embedding. These correlations are presented in Table 3. Unfortunately, we find that the correlation is low, although it is positive across collections. We can also qualitatively analyze the differ- ences in the behavior of the embeddings. If we have access to the set of documents labeled rel- evant to a query, then we can compute the fre- quency of terms in this set and consider those Figure 5: Global versus local embedding of terms with high frequency (after stopping and highly relevant terms. Each point represents a ) to be good query expansion can- candidate expansion term. Red points have didates. We can then visualize where these high frequency in the relevant set of docu- terms lie in the global and local embeddings. ments. White points have low or no frequency In Figure 5, we present a two-dimensional pro- in the relevant set of documents. The blue jection (van der Maaten and Hinton, 2008) point represents the query. Contours indicate of terms for the query ‘ocean remote sens- distance from the query. ing’, with those good candidates highlighted. Our projection includes the top 50 candidates by frequency and a sample of terms occurring fective if the data is appropriate for the task in the query likelihood retrieval. We notice at hand. And, when provided, much smaller that, in the global embedding, the good can- high-quality data can provide much better per- didates are spread out amongst poorer candi- formance. Beyond this, our results suggest dates. By contrast, the local embedding clus- that the approach of estimating global repre- ters the candidates in general but also situates sentations, while computationally convenient, them closely around the query. As a result, we may overlook insights possible at query time, suspect that the similar terms extracted from or evaluation time in general. A similar local the local embedding are more likely to include embedding approach can be adopted for any these good candidates. natural language processing task where topi- cal locality is expected and can be estimated. 7 Discussion Although we used a query to re-weight the cor- The success of local embeddings on this task pus in our experiments, we could just as eas- should alarm natural language processing re- ily use alternative contextual information (e.g. searchers using global embeddings as a rep- a sentence, paragraph, or document) in other resentational tool. For one, the approach of tasks. learning from vast amounts of data is only ef- Despite these strong results, we believe that there are still some open questions in this erico, 1996; Kuhn and De Mori, 1990). Sim- work. First, although local embeddings pro- ilarly to the adaptation work, we use topic- vide effectiveness gains, they can be quite in- specific documents to train a topic-specific efficient compared to global embeddings. We model. In our case the documents come from believe that there is opportunity to improve a first round of retrieval for the user’s cur- the efficiency by considering offline computa- rent query, and the word embedding model tion of local embeddings at a coarser level than is trained based on sentences from the topic- queries but more specialized than the corpus. specific document set. Unlike the past work, If the retrieval algorithm is able to select the we do not focus on interpolating the local and appropriate embedding at query time, we can global models, although this is a promising avoid training the local embedding. Second, area for future work. In the current study although our supporting experiments (Table 3, we focus on a direct comparison between the Figure 5) add some insight into our intuition, local-only and global-only approach, for im- the results are not strong enough to provide proving retrieval performance. a solid explanation. Further theoretical and empirical analysis is necessary. Word embeddings for IR Information Retrieval has a long history of learning repre- 8 Related Work sentations of words that are low-dimensional dense vectors. These approaches can be Topical adaptation of models The short- broadly classified into two families based on comings of learning a single global vector rep- whether they are learnt based on a term- resentation, especially for polysemic words, document matrix or term co-occurence data. have been pointed out before (Reisinger and Using the term-document matrix for embed- Mooney, 2010b). The problem can be ad- ding leads to several well-studied approaches dressed by training a global model with multi- such as LSA (Deerwester et al., 1990), PLSA ple vector embeddings per word (Reisinger and (Hofmann, 1999), and LDA (Blei et al., Mooney, 2010a; Huang et al., 2012) or topic- 2003; Wei and Croft, 2006). The perfor- specific embeddings (Liu et al., 2015). The mance of these models varies depending on the number of senses for each word may be fixed task, for example they are known to perform (Neelakantan et al., 2015), or determined us- poorly for retrieval tasks unless combined with ing class labels (Trask et al., 2015). However, lexical features (Atreya and Elkan, 2011a). to the best of our knowledge, this is the first Term-cooccurence based embeddings, such as time that training topic-specific word embed- word2vec (Mikolov et al., 2013b; Mikolov et dings has been explored. al., 2013a) and (Pennington et al., 2014b), Several methods exist in the language mod- have recently been remarkably popular for eling community for topic-dependent adapta- many natural language processing and logi- tion of language models (Bellegarda, 2004). cal reasoning tasks. However, there are rel- These can lead to performance improvements atively less known successful applications of in tasks such as machine translation (Zhao et these models in IR. Ganguly et. al. (Gan- al., 2004) and speech recognition (Nanjo and guly et al., 2015) used the word similarity in Kawahara, 2004). Topic-specific data may be the word2vec embedding space as a way to es- gathered in advance, by identifying corpus of timate term transformation probabilities in a topic-specific documents. It may also be gath- language modelling setting for retrieval. More ered during the discourse, using multiple hy- recently, Nalisnick et. al. (Nalisnick et al., potheses from N-best lists as a source of topic- 2016) proposed to model document about-ness specific language. Then a topic-specific lan- by computing the similarity between all pairs guage model is trained (or the global model is of query and document terms using dual em- adapted) online using the topic-specific train- bedding spaces. Both these approaches es- ing data. A topic-dependent model may be timate the semantic relatedness between two combined with the global model using lin- terms as the cosine distance between them in ear interpolation (Iyer and Ostendorf, 1999) the embedding space(s). We adopt a similar or other more sophisticated approaches (Fed- notion of term relatedness but focus on demon- strating improved retrieval performance using and Gianmaria Silvello, editors, Proceedings of locally trained embeddings. the 38th European Conference on IR Research (ECIR 2016), pages 709–715, Cham. Springer Local latent semantic analysis Despite International Publishing. the mathematical appeal of latent seman- Avinash Atreya and Charles Elkan. 2011a. La- tic analysis, several experiments suggest that tent semantic indexing (lsi) fails for trec collec- its empirical performance may be no better tions. ACM SIGKDD Explorations Newsletter, than that of ranking using standard term vec- 12(2):5–10. tors (Deerwester et al., 1990; Dumais, 1995; Avinash Atreya and Charles Elkan. 2011b. Latent Atreya and Elkan, 2011b). In order to address semantic indexing (lsi) fails for trec collections. the coarseness of corpus-level latent seman- SIGKDD Explor. Newsl., 12(2):5–10, March. tic analysis, Hull proposed restricting analysis R. Attar and A. S. Fraenkel. 1977. Local feed- to the documents relevant to a query (Hull, back in full-text retrieval systems. J. ACM, 1994). This approach significantly improved 24(3):397–417, July. over corpus-level analysis for routing tasks, a Jerome R Bellegarda. 2004. Statistical lan- result that has been reproduced in consequent guage model adaptation: review and perspec- research (Sch¨utzeet al., 1995; Singhal et al., tives. Speech communication, 42(1):93–108. 1997). Our work can be seen as an extension David M. Blei, Andrew Y. Ng, and Michael I. Jor- of these results to more recent techniques such dan. 2003. Latent dirichlet allocation. J. Mach. as word2vec. Learn. Res., 3:993–1022. W. Bruce Croft and John Lafferty. 2003. Language 9 Conclusion Modeling for Information Retrieval. Kluwer We have demonstrated a simple and effective Academic Publishing. method for performing query expansion with Steve Cronen-Townsend, Yun Zhou, and W. Bruce word embeddings. Importantly, our results Croft. 2002. Predicting query performance. In highlight the value of locally-training word SIGIR ’02: Proceedings of the 25th annual in- ternational ACM SIGIR conference on Research embeddings in a query-specific manner. The and development in information retrieval, pages strength of these results suggests that other 299–306, New York, NY, USA. ACM Press. research adopting global embedding vectors Scott C. Deerwester, Susan T. Dumais, Thomas K. should consider local embeddings as a poten- Landauer, George W. Furnas, and Richard A. tially superior representation. Instead of using Harshman. 1990. Indexing by latent semantic a “Sriracha sauce of deep learning,” as em- analysis. Journal of the American Society of In- bedding techniques like word2vec have been formation Science, 41(6):391–407. called, we contend that the situation some- Fernando Diaz and Donald Metzler. 2006. Im- times requires, say, that we make a b´echamel proving the estimation of relevance models using or a mole verde or a sambal—or otherwise large external corpora. In SIGIR ’06: Proceed- ings of the 29th annual international ACM SI- learn to cook. GIR conference on Research and development in information retrieval, pages 154–161, New York, NY, USA. ACM Press. References Fernando Diaz. 2015. Condensed list relevance Nasreen Abdul-Jaleel, James Allan, W. Bruce models. In Proceedings of the 2015 International Croft, Fernando Diaz, Leah Larkey, Xiaoyan Conference on The Theory of Information Re- Li, Donald Metzler, Mark D. Smucker, Trevor trieval, ICTIR ’15, pages 313–316, New York, Strohman, Howard Turtle, and Courtney Wade. NY, USA, May. ACM. 2004. Umass at trec 2004: Novelty and hard. In Online Proceedings of 2004 Text REtrieval Con- Susan T. Dumais. 1995. Latent semantic in- ference. dexing (LSI): TREC-3 report. In Overview of the Third Text REtrieval Conference (TREC-3), Mohannad al Masri, Catherine Berrut, and Jean- pages 219–230. Pierre Chevallet. 2016. A comparison of deep learning based query expansion with Marcello Federico. 1996. Bayesian estimation pseudo-relevance feedback and mutual informa- methods for n-gram language model adaptation. tion. In Nicola Ferro, Fabio Crestani, Marie- In Spoken Language, 1996. ICSLP 96. Proceed- Francine Moens, Josiane Mothe, Fabrizio Sil- ings., Fourth International Conference on, vol- vestri, Maria Giorgio Di Nunzio, Claudia Hauff, ume 1, pages 240–243. IEEE. William A. Gale, Kenneth W. Church, and David Victor Lavrenko and W. Bruce Croft. 2001. Rele- Yarowsky. 1992. One sense per discourse. vance based language models. In Proceedings of In Proceedings of the Workshop on Speech and the 24th annual international ACM SIGIR con- Natural Language, HLT ’91, pages 233–237, ference on Research and development in infor- Stroudsburg, PA, USA. Association for Compu- mation retrieval, pages 120–127. ACM Press. tational Linguistics. Omer Levy and Yoav Goldberg. 2014. Neural Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, word embedding as implicit matrix factoriza- and Gareth J.F. Jones. 2015. Word embedding tion. In Z. Ghahramani, M. Welling, C. Cortes, based generalized language model for informa- N.D. Lawrence, and K.Q. Weinberger, editors, tion retrieval. In Proceedings of the 38th Inter- Advances in Neural Information Processing Sys- national ACM SIGIR Conference on Research tems 27, pages 2177–2185. Curran Associates, and Development in Information Retrieval, SI- Inc. GIR ’15, pages 795–798, New York, NY, USA. ACM. Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015. Topical word embed- Zellig S. Harris. 1954. Distributional structure. dings. In Proceedings of the Twenty-Ninth WORD, 10(2-3):146–162. AAAI Conference on Artificial Intelligence, AAAI’15, pages 2418–2424. AAAI Press. Thomas Hofmann. 1999. Probabilistic latent se- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- mantic indexing. In SIGIR ’99: Proceedings of frey Dean. 2013a. Efficient estimation of word the 22nd annual international ACM SIGIR con- representations in vector space. arXiv preprint ference on Research and development in infor- arXiv:1301.3781. mation retrieval, pages 50–57, New York, NY, USA. ACM Press. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed Eric H Huang, Richard Socher, Christopher D representations of words and phrases and their Manning, and Andrew Y Ng. 2012. Improving compositionality. In C.J.C. Burges, L. Bottou, word representations via global context and mul- M. Welling, Z. Ghahramani, and K.Q. Wein- tiple word prototypes. In Proceedings of the 50th berger, editors, Advances in Neural Information Annual Meeting of the Association for Com- Processing Systems 26, pages 3111–3119. Curran putational Linguistics: Long Papers-Volume 1, Associates, Inc. pages 873–882. Association for Computational Linguistics. Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016. Improving document David Hull. 1994. Improving text retrieval for the ranking with dual word embeddings. In Proc. routing problem using latent semantic indexing. WWW. International World Wide Web Confer- In Proceedings of the 17th Annual International ences Steering Committee. ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’94, Hiroaki Nanjo and Tatsuya Kawahara. 2004. pages 282–291, New York, NY, USA. Springer- Language model and speaking rate adaptation Verlag New York, Inc. for spontaneous presentation speech recognition. Speech and Audio Processing, IEEE Transac- R.M. Iyer and M. Ostendorf. 1999. Modeling long tions on, 12(4):391–400. distance dependence in language: topic mixtures versus dynamic cache models. Speech and Audio Arvind Neelakantan, Jeevan Shankar, Alexandre Processing, IEEE Transactions on, 7(1):30–39, Passos, and Andrew McCallum. 2015. Efficient Jan. non-parametric estimation of multiple embed- dings per word in vector space. arXiv preprint Kalervo J¨arvelin and Jaana Kek¨al¨ainen.2002. Cu- arXiv:1504.06654. mulated gain-based evaluation of ir techniques. Jeffrey Pennington, Richard Socher, and Christo- TOIS, 20(4):422–446. pher D. Manning. 2014a. Glove: Global vec- tors for word representation. In Empirical Meth- Robert Krovetz. 1993. Viewing morphology as an ods in Natural Language Processing (EMNLP), inference process. In SIGIR ’93: Proceedings of pages 1532–1543. the 16th annual international ACM SIGIR con- ference on Research and development in infor- Jeffrey Pennington, Richard Socher, and Christo- mation retrieval, pages 191–202, New York, NY, pher D Manning. 2014b. Glove: Global vec- USA. ACM Press. tors for word representation. Proc. EMNLP, 12:1532–1543. Roland Kuhn and Renato De Mori. 1990. A cache- based natural language model for speech recog- Joseph Reisinger and Raymond Mooney. 2010a. nition. Pattern Analysis and Machine Intelli- A mixture model with sharing for lexical se- gence, IEEE Transactions on, 12(6):570–583. mantics. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Pro- and development in information retrieval, pages cessing, pages 1173–1182. Association for Com- 178–185, New York, NY, USA. ACM Press. putational Linguistics. Peter Willett. 1985. Query-specific automatic doc- Joseph Reisinger and Raymond J Mooney. 2010b. ument classification. In International Forum Multi-prototype vector-space models of word on Information and Documentation, volume 10, meaning. In Human Language Technologies: pages 28–32. The 2010 Annual Conference of the North American Chapter of the Association for Com- Jinxi Xu and W. Bruce Croft. 1996. Query expan- putational Linguistics, pages 109–117. Associa- sion using local and global document analysis. tion for Computational Linguistics. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and De- Hinrich Sch¨utze,David A. Hull, and Jan O. Peder- velopment in Information Retrieval, SIGIR ’96, sen. 1995. A comparison of classifiers and doc- pages 4–11, New York, NY, USA. ACM. ument representations for the routing problem. In Proceedings of the 18th Annual International David Yarowsky. 1993. One sense per colloca- ACM SIGIR Conference on Research and De- tion. In Proceedings of the Workshop on Human velopment in Information Retrieval, SIGIR ’95, Language Technology, HLT ’93, pages 266–271, pages 229–237, New York, NY, USA. ACM. Stroudsburg, PA, USA. Association for Compu- tational Linguistics. Hidetoshi Shimodaira. 2000. Improving predic- tive inference under covariate shift by weighting David Yarowsky. 1995. Unsupervised word sense the log-likelihood function. Journal of Statisti- disambiguation rivaling supervised methods. In cal Planning and Inference, 90(2):227 – 244. Proceedings of the 33rd Annual Meeting on As- sociation for Computational Linguistics, ACL Amit Singhal, Mandar Mitra, and Chris Buckley. ’95, pages 189–196, Stroudsburg, PA, USA. As- 1997. Learning routing queries in a query zone. sociation for Computational Linguistics. SIGIR Forum, 31(SI):25–32, July. Bing Zhao, Matthias Eck, and Stephan Vogel. Alessandro Sordoni, Yoshua Bengio, and Jian-Yun 2004. Language model adaptation for statisti- Nie. 2014. Learning concept embeddings for cal machine translation with structured query query expansion by quantum entropy minimiza- models. In Proceedings of the 20th International tion. In Proceedings of the Twenty-Eighth AAAI Conference on Computational Linguistics, COL- Conference on Artificial Intelligence, AAAI’14, ING ’04, Stroudsburg, PA, USA. Association for pages 1586–1592. AAAI Press. Computational Linguistics. Anastasios Tombros and C. J. van Rijsbergen. 2001. Query-sensitive similarity measures for the calculation of interdocument relationships. In CIKM ’01: Proceedings of the tenth interna- tional conference on Information and knowledge management, pages 17–24, New York, NY, USA. ACM Press. Anastasios Tombros, Robert Villa, and C. J. Van Rijsbergen. 2002. The effectiveness of query-specific hierarchic clustering in informa- tion retrieval. Inf. Process. Manage., 38(4):559– 582, July. Andrew Trask, Phil Michalak, and John Liu. 2015. sense2vec-a fast and accurate method for word sense disambiguation in neural word embed- dings. arXiv preprint arXiv:1511.06388. Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605. C. J. van Rijsbergen. 1979. Information Retrieval. Butterworths. Xing Wei and W. Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In SI- GIR ’06: Proceedings of the 29th annual inter- national ACM SIGIR conference on Research