Effective Focused Retrieval by Exploiting Query Context and Document Structure
Total Page:16
File Type:pdf, Size:1020Kb
Effective Focused Retrieval by Exploiting Query Context and Document Structure ILLC Dissertation Series DS-2011-06 For further information about ILLC-publications, please contact Institute for Logic, Language and Computation Universiteit van Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam phone: +31-20-525 6051 fax: +31-20-525 5206 e-mail: [email protected] homepage: http://www.illc.uva.nl/ Effective Focused Retrieval by Exploiting Query Context and Document Structure Academisch Proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof. dr. D.C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op vrijdag 7 oktober 2011, te 14.00 uur door Anna Maria Kaptein geboren te Heerhugowaard Promotiecommissie Promotor: Prof. dr. J.S. Mackenzie Owen Co-promotor: Dr. ir. J. Kamps Overige Leden: Dr. ir. D. Hiemstra Prof. dr. F.M.G. de Jong Dr. M.J. Marx Prof. dr. M. de Rijke Prof. dr. ir. A.P. de Vries Faculteit der Geesteswetenschappen Universiteit van Amsterdam The investigations were supported by the Nether- lands Organization for Scientific Research (NWO) in the EfFoRT (Effective Focused Retrieval Techniques) project, grant # 612.066.513. SIKS Dissertation Series No. 2011-28 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. Copyright c 2011 Rianne Kaptein Cover design by Roel Verhagen-Kaptein Printed and bound by Off Page Published by IR Publications, Amsterdam ISBN: 978-90-814485-7-4 Contents Acknowledgments ix 1 Introduction 1 1.1 Research Objective . .1 1.2 Research plan . .7 1.2.1 Adding Query Context . .7 1.2.2 Exploiting Structured Resources . .9 1.2.3 Summarising Search Results . 10 1.3 Methodology . 10 1.3.1 Test Collections . 10 1.3.2 Evaluation Measures . 13 1.4 Thesis Outline . 14 I Adding Query Context 17 2 Topical Context 21 2.1 Introduction . 21 2.2 Related Work . 24 2.3 Data . 29 2.4 Models . 31 2.4.1 Language Modelling . 31 2.4.2 Parsimonious Language Model . 34 2.4.3 Query Categorisation . 34 2.4.4 Retrieval . 36 2.5 Categorising Queries . 39 2.5.1 User Study Set-Up . 39 2.5.2 User Study Results . 40 2.5.3 Discussion . 45 v 2.6 Retrieval using Topical Feedback . 46 2.6.1 Experimental Set-Up . 47 2.6.2 Experimental Results . 47 2.7 Conclusion . 52 II Exploiting Structured Resources 55 3 Exploiting the Structure of Wikipedia 59 3.1 Introduction . 59 3.2 Related Work . 63 3.3 Data . 65 3.4 Entity Ranking vs. Ad Hoc Retrieval . 69 3.4.1 Relevance Assessments . 70 3.5 Retrieval Model . 72 3.5.1 Exploiting Category Information . 72 3.5.2 Exploiting Link Information . 73 3.5.3 Combining information . 74 3.5.4 Target Category Assignment . 75 3.6 Experiments . 76 3.6.1 Experimental Set-up . 76 3.6.2 Entity Ranking Results . 77 3.6.3 Ad Hoc Retrieval Results . 80 3.6.4 Manual vs. Automatic Category Assignment . 82 3.6.5 Comparison to Other Approaches . 85 3.7 Conclusion . 87 4 Wikipedia as a Pivot for Entity Ranking 89 4.1 Introduction . 89 4.2 Related Work . 92 4.3 Using Wikipedia as a Pivot . 95 4.3.1 From Web to Wikipedia . 95 4.3.2 From Wikipedia to Web . 96 4.4 Entity Ranking on the Web . 99 4.4.1 Approach . 99 4.4.2 Experimental Setup . 100 4.4.3 Experimental Results . 101 4.5 Finding Entity Homepages . 106 4.5.1 Task and Test Collection . 107 4.5.2 Link Detection Approaches . 107 4.5.3 Link Detection Results . 108 4.6 Conclusion . 110 vi III Summarising Search Results 113 5 Language Models and Word Clouds 117 5.1 Introduction . 117 5.2 Related Work . 119 5.3 Models and Experiments . 123 5.3.1 Experimental Set-Up . 123 5.3.2 Baseline . 125 5.3.3 Clouds from Pseudo Relevant and Relevant Results . 125 5.3.4 Non-Stemmed and Conflated Stemmed Clouds . 127 5.3.5 Bigrams . 128 5.3.6 Term Weighting . 129 5.4 Word Clouds from Structured Data . 131 5.4.1 Data . 132 5.4.2 Word Cloud Generation . 132 5.4.3 Experiments . 136 5.5 Conclusion . 141 6 Word Clouds of Multiple Search Results 143 6.1 Introduction . 143 6.2 Related Work . 145 6.3 Word Cloud Generation . 147 6.3.1 Full-Text Clouds . 147 6.3.2 Query Biased Clouds . 148 6.3.3 Anchor Text Clouds . 150 6.4 Experiments . 151 6.4.1 Experimental Set-Up . 151 6.4.2 Experimental Results . 153 6.5 Conclusion . 157 7 Conclusion 159 7.1 Summary . 159 7.2 Main Findings and Future Work . 164 Bibliography 170 Samenvatting 189 Abstract 191 vii Acknowledgments First of all I would like to thank the person who, after me, contributed most to the work described in this thesis, my advisor Jaap Kamps. Jaap has given me the freedom to work on my own, but was always there when I needed advice. I thank my promotor John Mackenzie Owen, and all the members of my thesis committee: Djoerd Hiemstra, Franciska de Jong, Maarten Marx, Maarten de Rijke, and Arjen de Vries. I would like to thank Djoerd in particular, for giving me many detailed comments on my thesis, and for the positive feedback on my work during the four years of my PhD. I thank Gabriella Kazai, Bodo von Billerbeck and Filip Radlinski, I had a really good time working with you during my internship at Microsoft Research Cambridge. Also thanks to Vinay and Milad for the entertainment during and outside working hours. Thanks to my roommates at the office, Nisa, Marijn, Junte, Avi, Nir and Frans, for the sometimes much needed distraction. Thanks to all the IR colleagues I met at conferences and workshops, probably the favorite part of my PhD, for many good times. Having a deadline for a conference in some exotic place was always good motivation. Finally, thanks to my family and friends for being impressed sometimes without even knowing what I do exactly, and for always being there. ix Chapter 1 Introduction 1.1 Research Objective Information retrieval (IR) deals with the representation, storage, organisation of, and access to information items such as documents, Web pages, online catalogs, structured and semi-structured records, and multimedia objects (Baeza-Yates and Ribeiro-Neto, 2011). Many universities and public libraries use IR systems to provide access to books, journals and other documents, but Web search engines are by far the most popular and heavily used IR applications. Let's try to find a particular piece of information using a Web search engine. The search process, depicted in Figure 1.1, starts with a user looking to fulfil an information need, which can vary in complexity. In the simplest case the user wants to go to a particular site that he has in mind, either because he visited it in the past or because he assumes that such a site exists (Broder, 2002). An example of such a navigational information need is: I want to find the homepage of the Simpsons. In more complex cases the user will be looking for some information assumed to be present on one or more Web pages, for example: A friend of mine told me that there are a lot of cultural references in the `Simpsons' cartoon, whereas I was thinking that it was `just' a cartoon like every other cartoon. I'd thus like to know what kind of references can be found in Simpsons episodes (references to movies, tv shows, literature, music, etc.)1 The next step in the search process is to translate the information need into a query, which can be easily processed by the search engine. In its most common form, this translation yields a set of keywords which summarises the information 1This is INEX ad hoc topic 464 (Fuhr et al., 2008), see Section 1.3.1. 1 2 CHAPTER 1 INTRODUCTION Figure 1.1: Main components of the search process, adaptation of the classic IR model of Broder (2002). need. For our first simple information need formulating a query is also simple, i.e., the keyword query `the simpsons' is a good translation of the information need. For our second, more complex information need also formulating the keyword query becomes a more complex task for the user. A possible keyword query is `simpsons references'. Given the user query, the key goal of an IR system is to retrieve information which might be useful or relevant to the information need of the user. For our first simple information need, there is only one relevant result: the homepage of the Simpsons, that is http://www.thesimpsons.com. When the keyword query `the simpsons' is entered into Web search engines Google2 and Bing3, both these search engines will return the homepage of the Simpsons as their first result, thereby satisfying the user information need. Continuing with our more complex information need, entering the keyword query `simpsons references' into Google and Bing, leads to the results as shown 2http://www.google.com/ 3http://www.bing.com/ 1.1 RESEARCH OBJECTIVE 3 in Figure 1.2. The results of the two searches look similar. The search engines return ranked list of results. Each result consists of the title of the Web page, a short snippet of text extracted from the page, and the URL.