Term Association Modelling in Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
TERM ASSOCIATION MODELLING IN INFORMATION RETRIEVAL JIASHU ZHAO A DISSERTATION SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY GRADUATE PROGRAM IN COMPUTER SCIENCE AND ENGINEERING YORK UNIVERSITY TORONTO, ONTARIO MARCH 2015 c Jiashu Zhao 2015 Abstract Many traditional Information Retrieval (IR) models assume that query terms are in- dependent of each other. For those models, a document is normally represented as a bag of words/terms and their frequencies. Although traditional retrieval models can achieve reasonably good performance in many applications, the corresponding inde- pendence assumption has limitations. There are some recent studies that investigate how to model term associations/dependencies by proximity measures. However, the modeling of term associations theoretically under the probabilistic retrieval frame- work is still largely unexplored. In this thesis, I propose a new concept named Cross Term, to model term prox- imity, with the aim of boosting retrieval performance. With Cross Terms, the asso- ciation of multiple query terms can be modeled in the same way as a simple unigram term. In particular, an occurrence of a query term is assumed to have an impact on its neighboring text. The degree of the query term impact gradually weakens with increasing distance from the place of occurrence. Shape functions are used to ii characterize such impacts. Based on this assumption, I first propose a bigram CRoss TErm Retrieval (CRT ER2) model for probabilistic IR and a Language model based LM model CRT ER2 . Specifically, a bigram Cross Term occurs when the correspond- ing query terms appear close to each other, and its impact can be modeled by the intersection of the respective shape functions of the query terms. Second, I propose a generalized n-gram CRoss TErm Retrieval (CRT ERn) model recursively for n query terms where n > 2. For n-gram Cross Term, I develop several distance metrics with different properties and employ them in the proposed models for ranking. Third, an enhanced context-sensitive proximity model is proposed to boost the CRT ER models, where the contextual relevance of term proximity is studied. The models are validated on several large standard data sets, and show improved performance over other state-of-art approaches. I also discusse the practical impact of the pro- posed models. The approaches in this thesis can also provide helpful benefit for term association modeling in other domains. iii Acknowledgements First of all, I would like to express my sincere gratitude to my supervisor Prof. Jimmy Xiangji Huang for his guidence, motivation, enthusiasm, and rigorous atti- tude. During my Ph.D. study with Prof. Jimmy Xiangji Huang, I was introduced to Information Retrieval and become a researcher in this area. With his patience, I was able to change my major smoothly from a Mathematical Master to a Ph.D. in Computer Science. Prof. Jimmy Xiangji Huang also provided me a lot of oppor- tunities to conferences and introduced me to many famous professors in the field. I have never been so close to the cutting edge of Science. I am grateful to have Prof. Jimmy Xiangji Huang as my Ph.D. supervisor. Besides my advisor, I would like to thank the rest of my thesis committee: Prof. Nice Cercone and Prof. Aijun An, for their advice, knowledge, and insightful com- ments. I also would like to thank my dissertation Examining Committee: Prof. Walter P. Tholen, Prof. Norbert Fuhr, and Prof. Mahmudul Anam. They provided many iv insightful and detailed suggestions for improving my thesis. I offer my regards and blessings to all my lab mates: Mariam Daoud, Ben He, Vivian Hu, Jun Miao, Justin Wu, Zheng Ye, Xiaoshi Yin, Xiaofeng Zhou, and so on. They are my best friends and study partners. Last but not least, I would like to thank my family. My beloved father Shuyuan Zhao and mother Fuxia Liu always take care of me and support me in my whole life. My husband, Shicheng Wu, always encourages my study and research. Because of them, I can sit down and write my papers and this thesis with a peaceful mind. v Table of Contents Abstract ii Acknowledgements iv Table of Contents vi List of Tables xi List of Figures xiv 1 Introduction and Motivation 1 1.1 Motivation . .2 1.2 Main Contributions . .6 1.3 Outline . 11 2 Related Work 13 2.1 Ranking Models in Information Retrieval . 13 vi 2.1.1 Boolean Model . 14 2.1.2 Vector Space Model . 15 2.1.3 Probabilistic Model . 17 2.1.4 Language Model . 18 2.2 Term Association Models . 19 2.2.1 Phrase-based Term Association Approaches . 20 2.2.2 Similarity Coefficients . 21 2.2.3 Latent Topic Models for Term Associations . 22 2.2.4 BM25-based and LM-based Term Association Approaches . 23 2.2.5 Bigram and N-gram Term Association Approaches . 24 2.3 Summary . 26 3 Bigram Cross Term Retrieval Model 27 3.1 A New Pseudo Term: Bigram Cross Term . 28 3.2 Estimations for Bigram Cross Term Variants . 31 3.2.1 Within-Document Frequency Estimation of Bigram Cross Term 32 3.2.2 Document frequencies Estimation of Bigram Cross Term . 33 3.2.3 Within-Query Frequency Estimation of Bigram Cross Term . 35 3.3 Kernel Functions . 36 3.4 Bigram CRoss-TErm Retrieval (CRT ER2) Model . 40 vii 3.4.1 Unigram Model: BM25 . 41 3.4.2 BM25-based CRT ER2 Model . 42 LM 3.4.3 Language Model based CRT ER2 Model . 43 3.5 Experimental Settings . 46 3.5.1 Data Sets and Evaluation Measures . 47 3.5.2 Evaluation Metrics . 49 3.5.3 The Okapi System . 50 3.5.4 Major Probabilistic and Language Models for Comparison . 50 3.6 Experiments . 53 3.6.1 Experimental Results of CRT ER2 ............... 53 3.6.2 Parameter Sensitivity . 56 3.6.3 Robustness of CRT ER2 ..................... 60 3.6.4 Comparison with Major Probabilistic Proximity Models . 62 3.6.5 Comparison with Major Proximity Language Models . 66 3.7 Further Analysis and Discussions . 70 3.7.1 A Case Study on the Manually Judged Documents . 70 3.7.2 An Analysis on the Bigram Cross Term Example . 73 3.7.3 The Stability of Using Different Topic Sets . 76 4 N-gram Cross Term Retrieval Model 78 viii 4.1 A New Pseudo Term: N-gram Cross Term . 78 4.2 Estimations for N-gram Cross Term Variants . 80 4.2.1 Lp-Norm Distances for N Terms . 82 4.2.2 Pairwise Distance . 83 4.2.3 Altitude and Hypotenuse Based Distance . 84 4.3 The Recursive N-gram CRoss-TErm Retrieval (CRT ERn) Model . 87 4.4 Algorithm and Time Analysis . 88 4.5 Experiments . 95 4.5.1 Performance of CRT ER3 .................... 98 4.5.2 Runtime Analysis . 98 4.5.3 The Usefulness of N-gram Cross Terms . 102 5 An Enhanced Context-sensitive Proximity Model 105 5.1 Motivation . 106 5.2 Contextual Relevance of Term Proximity . 108 5.3 A Context-Sensitive Proximity Model . 113 5.4 Experiments . 115 6 Summaries 120 6.1 Practical Impact of the Proposed Approaches . 120 ix 6.2 Summary of Using Cross Terms . 122 7 Conclusions and Future Work 124 Bibliography 127 Appendix A Proof for Theorem 4.2.2 144 Appendix B Topic Sets in Experiments 149 B.1 Appendix: Topics in TREC8 . 149 B.2 Appendix: Topics in Robust . 157 B.3 Appendix: Topics in AP88-89 . 190 B.4 Appendix: Topics in WT2G . 215 B.5 Appendix: Topics in WT10G . 222 B.6 Appendix: Topics in .GOV2 . 235 B.7 Appendix: Topics in Blog06 . 258 x List of Tables 3.1 Overview of the TREC collections used . 48 3.2 Comparison between BM25 baseline and CRT ER2 with different ker- nel functions: BM25 parameter b is initialized to be 0.35. All the results are evaluated by MAP, P@5, and P@20. CRT ER2 outper- forms BM25 on all collections. \*" means the improvements over the BM25 are statistically significant (p<0.05 with Wilcoxon Matched- pairs Signed-rank test). 54 3.3 Comparison among three BM25 based proximity models: PPM, BM25TP and CRT ER2. BM25 parameter b is initialized to be 0.35. All the re- sults are evaluated in terms of MAP, P@5, and P@20. \*" means the improvements over BM25 are statistically significant; \y" means the improvement over PPM is significant; and \z" means the improve- ment over BM25TP is significant (p<0.05) . 62 xi 3.4 Parameter µ for Dirichlet LM . 64 3.5 Comparison between two Language Model based proximity models: LM CRT ER2 with MRF. All the results are evaluated in terms of MAP, P@5, and P@20. \*" means the improvements over the Dirichlet LM are statistically significant (p<0.05 with Wilcoxon Matched-pairs Signed-rank test). 66 3.6 Direct MAP Comparison with PLM . 70 3.7 A Case Study: Distribution of terms on relevant and non-relevant documents (Topic 931 \Fort McMurray" on the Blog06 Collection). In the first row, PPM uses position dependent frequency with proximity in BM25 instead of the raw term frequency. For CRT ER2, bigram Cross Term frequency is extracted to compare with PPM. In the second and third rows, we extract the additional proximity weight of each model for comparison. 72 3.8 Comparisons on the judged documents over the Blog06 collection . 73 3.9 The values corresponding to the non-relevant and the relevant docu- ments in the case study . 76 3.10 Experimental results on robust track .