Zlatic Et Al., 2006)
Total Page:16
File Type:pdf, Size:1020Kb
The Meaning of Structure the Value of Link Evidence for Information Retrieval ILLC Dissertation Series DS-2011-03 For further information about ILLC-publications, please contact Institute for Logic, Language and Computation Universiteit van Amsterdam Science Park 904 1098 XH Amsterdam phone: +31-20-525 6051 fax: +31-20-525 5206 e-mail: [email protected] homepage: http://www.illc.uva.nl/ The Meaning of Structure the Value of Link Evidence for Information Retrieval Academisch Proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. D.C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op vrijdag 15 april 2011, te 14.00 uur door Marinus Henricus Aloysius Koolen geboren te Hoorn. Promotiecommissie: Promotor: Prof.dr. J.S. MacKenzie-Owen Co-promotor: Dr.ir. J. Kamps Overige commissieleden: Dr. N.E. Craswell Prof.dr. M. Lalmas Dr. M.J. Marx Prof.dr. M. de Rijke Prof.dr.ir. A.P. de Vries Faculteit der Geestewetenschappen Universiteit van Amsterdam The work in this thesis has been funded by the Netherlands Organization for Scientific Research (NWO) in the project Multipe- collection Searching Using Metadata (Mu- SeUM), grant number 640.001.501, which is part of the Continuous Access To Cultural Heritage (CATCH) research programme. siks Dissertation Series 2011-15 The research reported in this thesis has been carried out under the auspices of siks, the Dutch Research School for Information and Knowledge Systems. Copyright © 2011 by Marijn Koolen Cover design by Donald Roos. Printed and bound by Ipskamp Drukkers B.V. Published by IR Publications, Amsterdam ISBN: 978-90-814485-5-0 ACKNOWLEDGMENTS I start this book with thanking people who helped me in the past five years in starting and finishing this thesis. First and foremost, thank you Jaap, for all your advice, help, inspir- ation and enthusiasm through the years. It is incredible how often I managed to lose track of your great advisory one-liners and get stuck in trivial matters. It is equally incredible how you kept your calm every time I did, and helped my get back to the more interesting avenues. I’m also very grateful you offered me a Ph.D. position and not the position for scientific programmer that you brought to my attention several times. Given my programming skills, I think you made the right choice. I thank John MacKenzie-Owens and my thesis committee for reading and approving my thesis and giving encouraging and helpful comments and suggestions. I’m grateful to the inex community for providing a friendly surroundings in which I could safely make my first scientific blunders, and for creating the resources on which a large part of this research is based. Thanks to Avi for inspiring discussions about theory and modelling in IR, pointing out many flaws in my mathematical work and providing the right material to mend it. Thanks to Tikitu for explaining the importance of typography and making me care about it, and helping me make this thesis look much better than my initial, uninformed and messy attempt. Thanks to Donald for designing the cover of this thesis and for further explaining the importance and beauty of typography. I am sure there are still plenty of typographic horrors to be found in here, for which I take full responsibility. I look forward to further education. Frans, I’ve thoroughly enjoyed the countless useful discussions on ir, speech segmentation of infants, writing papers, writing a thesis, doing science, and I even vaguely recall some discussions on music. Also, thanks for proofreading the Dutch summary and giving tips, which greatly improved it. Anna, thank you for very many things, including proofreading this thesis in exchange for taking you out for dinner a few times. I owe you plenty more. v ABSTRACT How can search engines use the hyperlinks between documents to determine which documents are the most relevant for a search query? Some search engines use links to determine popularity, where the underlying idea is that the number of links pointing to a document (Web page) is a measure of its popularity. Search results are ordered or ranked based on their popularity and on similarity between the content of the document and the content of the search query. Another aspect of links is they provide a signal that two documents have related content. After all, a link is a reference. If a document A is relevant for a search query, then documents linked to A are possibly also relevant. Link information could possibly contain evidence for the topical relevance of a document. This thesis describes an investigation of the value of those two aspects of link information—popularity and topical relevance—for ranking search results. This question has been addressed before in the field of information retrieval. Starting in the late nineties, researchers con- ducted large scale experiments to see if link information could help search engines to find as many relevant documents on a search topic as possible, and rank these documents as well as possible. The results were disappointing. No consistent improvements by incorporating link information in the search process could be reported. Representatives of search engine companies argued that users of Web search engines are not looking for as many relevant documents on a topic as possible, but for the home pages of specific Web sites and pages that provide a good starting point to further explore Web pages on a certain topic. The researchers changed their attention to these Web-centric search tasks, with immediate success. The home pages and other important pages that Web searchers are looking for tend to be popular pages, which can be more easily identified by using link information. The value of link information for information retrieval seemed clear: link information is useful for measuring popularity, but not for measuring the topical relevance of documents. The question of why links are not useful for measuring topical rel- evance was never answered. The goal of this thesis is to give a more precise and complete account of the value of link evidence for inform- ation retrieval. This is first investigated using the English Wikipedia, vii because it is obtainable in its entirety, including all the links between the Wikipedia articles. Because it is an encyclopedia, Wikipedia is a natural source for users to search for articles relevant to a certain topic, which makes it an appropriate starting point to measure link evidence for topical relevance. The findings are then validated on a much larger corpus of Web pages, to find out how they generalise to searching on the Web. Evidence for popularity can best be measured by using all the links on the whole Web, and counting how many point to each Web page. We call this global link information, derived from the global link structure. For popularity, the direction of the link is important; a link from A to B makes B popular, but not A. The page with the most incoming links is the most popular. The order in which pages are ranked is determined partly by their popularity and partly by how well their content matches that of the search query. Evidence for topical relevance can be derived from link information by first finding a list of Web pages in which the search terms frequently occur, and then using only the links between those pages. This way, links are selected based on a topic: the topic of the search query. This list of pages is called the set of local pages, and the links between those pages are called local links. For topical relevance, the direction of the link is not important; if page A discusses the same topic(s) as page B, then page B discusses the same topic(s) as page A. Search terms that occur in page A form text evidence that A is topically relevant to the search query. If the search terms occur frequently in page A as well as in page B, then a link from A to B is a signal that the text evidence for page A is also evidence for the topical relevance of B. Vice versa, the text evidence for B is also evidence for the topical relevance of A. The number of local links between A and other local pages represents the amount of evidence for A. More links between A and other local pages means more evidence for the topical relevance of A. The page with the most local links is considered the most relevant. Although links are considered to be a signal that two linked pages have topically related content, this relation is not the same for all pairs of linked pages. To measure how strong the topical relation between two pages is, we use the category information in Wikipedia. In Wikipedia, the value of links for measuring topical relevance is dependent on the relation between the pages they link. A link between pages about the same topic is more effective evidence for topical relevance of those pages than a link between pages about unrelated topics. This finding viii confirms that, in Wikipedia, link information can be used as evidence for the topical relevance of a page. From these findings, we draw a number of conclusions. Global link information is independent of the search query and provides evidence for popularity, but not for topical relevance. Local link information is dependent on the search query and can provide evidence for topical relevance. For global link information, the direction determines its meaning.