Axiomatic Analysis of Smoothing Methods in Language Models for Pseudo-Relevance Feedback
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository AXIOMATIC ANALYSIS OF SMOOTHING METHODS IN LANGUAGE MODELS FOR PSEUDO-RELEVANCE FEEDBACK BY HUSSEIN HAZIMEH THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2016 Urbana, Illinois Adviser: Professor ChengXiang Zhai Abstract Pseudo-Relevance Feedback (PRF) is an important general technique for improving retrieval effectiveness without requiring any user effort. Several state-of-the-art PRF models are based on the language modeling approach where a query language model is learned based on feedback documents. In all these models, feedback documents are represented with unigram language models smoothed with a collection language model. While collection lan- guage model-based smoothing has proven both effective and necessary in using language models for retrieval, we use axiomatic analysis to show that this smoothing scheme inherently causes the feedback model to favor fre- quent terms and thus violates the IDF constraint needed to ensure selection of discriminative feedback terms. To address this problem, we propose re- placing collection language model-based smoothing in the feedback stage with additive smoothing, which is analytically shown to select more discrimina- tive terms. Empirical evaluation further confirms that additive smoothing indeed significantly outperforms collection-based smoothing methods in mul- tiple language model-based PRF models. ii To my parents, for their love and support. iii Acknowledgments I owe my gratitude to my adviser ChengXiang Zhai whose dedication and advice made this work possible. iv Table of Contents Chapter 1 Introduction . 1 Chapter 2 Related Work . 6 Chapter 3 Representative PRF Models . 8 3.1 Divergence Minimization Model . 8 3.2 Relevance Model . 9 3.3 Geometric Relevance Model . 9 Chapter 4 Axiomatic Analysis of PRF Models . 11 4.1 Divergence Minimization Model . 12 4.2 Relevance Model . 15 4.3 Geometric Relevance Model . 16 Chapter 5 Additive Smoothing for PRF Models . 17 5.1 Divergence Minimization Model . 17 5.2 Relevance Model . 18 5.3 Geometric Relevance Model . 18 Chapter 6 Measuring PRF Method Discrimination . 19 Chapter 7 Empirical Evaluation . 21 7.1 Datasets and Parameter Setting . 21 7.2 Experimental Results . 22 7.3 Summary of Main Findings . 25 7.4 Tables and Figures . 25 Chapter 8 Conclusion and Future Work . 30 References . 32 v Chapter 1 Introduction Feedback is an essential component in every modern retrieval system as it allows incorporating user preferences. The most reliable type of feedback is relevance feedback where users would label the top documents returned by a retrieval system as relevant or non-relevant. As relevance feedback is a tedious task that requires labeling large numbers of documents in the collec- tion, pseudo relevance feedback (PRF) is commonly used as an alternative. In PRF, the top documents returned by the retrieval system are assumed to be relevant and are used to expand the query. Although PRF is not as reliable as relevance feedback, empirical studies have shown that it is an effec- tive general technique for improving retrieval accuracy [2, 3, 4, 5, 6]. Since it requires no user effort, the technique can be applied in any retrieval system. Several types of PRF models exist in the literature (see Chapter 1 for a detailed review), among which language model (LM) based PRF methods are both theoretically well-grounded and empirically effective [4, 7]. Many PRF models have been proposed, including the divergence minimization model [4], the mixture model [4], the relevance model [3], and the geometric relevance model [8], in addition to many other improved variants [5, 6]. While these models differ in how they are derived, they generally take a set of top-ranked documents from an initial retrieval result as input and attempt to estimate a unigram language model, referred to as the feedback language model, based on these documents to capture their topic. The feedback lan- guage model θF can then be linearly interpolated with the original query language model θQ to form an \expanded" query language model: 0 θQ = (1 − α) θQ + α θF This chapter and the subsequent ones are a joint work with ChengXiang Zhai which has been published in ACM ICTIR15 (see [1] for bibliographic information) 1 where α 2 [0; 1] is the interpolation coefficient that determines the weight assigned to the feedback language model. This is similar to how the Rocchio feedback algorithm works in the vector space model [9]. The effectiveness of a PRF method is thus directly determined by the quality of the estimated feedback language model θF . In all the current LM-based PRF models, the estimated feedback LM is based on the aggregation of all the language models of the feedback docu- ments. The aggregation is achieved using an averaging function (such as the arithmetic or geometric mean), and it may also involve weighting each feed- back document based on how well it matches the query (i.e., retrieval score of each feedback document in the initial retrieval result), though uniform weighting is also often used. The intuition captured in such an aggregation- based estimate is to favor words that have high probabilities according to all the individual LMs of feedback documents (i.e., occur frequently in all the feedback documents). Such an aggregation function alone, however, would not consider the oc- currences of these terms in the global collection, and indeed it would tend to give high probabilities to many non-informative popular words in the collec- tion that are intuitively not useful for expanding a query. Thus, some of these models further rely on the use of a collection language model to penalize the terms that are too common in the collection. As a result, the terms that have high probabilities in the final estimated feedback LM would be those that occur frequently in all feedback documents, but not very frequently in the entire collection; this is the same effect as what the Rocchio algorithm achieves in the vector space retrieval model when TF-IDF weighting is used. Formally, let the feedback set be F = fD1;D2; ::::; Dng, where Di is the ith feedback document. In a language model-based approach to PRF, we consider a unigram language model, θi, estimated based on each feedback document Di, in addition to the collection LM, θC , which is estimated based on the entire collection of documents. The goal of a PRF method is to estimate a feedback LM, denoted by θF , based on all the document LMs, θ1; :::; θn, and the collection LM θC . Let w be any feedback word that appears in at least one of the documents in F , then a state-of-the-art PRF method would, in general, assign w a 2 probability P (wjθF ) defined as follows: ! P (wjθF ) / f A P (wjθ1); :::; P (wjθn) ;P (wjθC ) (1.1) where A : RN ! R is an aggregation function that combines the probabilities of words in each of the feedback documents. For example, A can be the arithmetic or geometric mean. f : R2 ! R is an arbitrary function that is increasing in the first argument (so that words that occur frequently in feedback documents would tend to have a higher value of f) and decreasing in the second (so that words that occur frequently in the collection would tend to have a smaller value of f). This generalized view of the LM-based PRF models is intuitive as we want the final probability of the word to be high for relevant discriminative words, i.e. those that appear in most of the feedback documents and have a low to moderate probability of occurrence in the collection LM. As in all cases of language model estimation, the feedback documents' LMs (θis) have to be smoothed to account for the fact that a document is gener- ally a very small sample for estimating a word distribution over the entire vocabulary. Smoothing is also necessary to avoid assigning zero probabili- ties to (many) unseen words in documents (an inevitable consequence of the commonly used Maximum Likelihood estimator). The general motivation behind smoothing the feedback documents' language models is that a word may not occur in some of the feedback documents, and not smoothing its probability might overpenalize it. For example, if the PRF model has the function A as the geometric mean and the feedback word occurs in all the feedback documents except one, then the probability assigned to the word will be zero if smoothing is not used. So far, in virtually all work on LM-based PRF models, smoothing of P (wjθi) has been based on some form of interpolation of the maximum like- lihood estimate and the probability of the word according to the collection language model, P (wjθC ), which is proportional to the number of occurrences of the word in the whole collection. Dirichlet prior and Jelinek-Mercer are es- pecially popular choices of collection-based smoothing because of their good performance when used in a retrieval function such as Query Likelihood or Kullback-Leibler (KL) divergence [7]. Indeed, smoothing with a collection 3 language model has not only proven to be effective for retrieval, but also essential for achieving the effect of IDF weighting in a retrieval function [7]. However, in this paper, we show that using collection LM-based smoothing, while effectively avoids the assignment of zero probabilities, is not suitable for LM-based PRF models and can lead to the selection of very frequent and non-informative feedback terms, thus making the PRF model unable to select discriminative feedback terms.