EVALUATION of LINKAGE-BASED WEB DISCOVERY SYSTEMS by Cathal G. Gurrin a Dissertation Submitted in Partial Fulfillment of The
Total Page:16
File Type:pdf, Size:1020Kb
EVALUATION OF LINKAGE-BASED WEB DISCOVERY SYSTEMS DCU by Cathal G. Gurrin A dissertation submitted in partial fulfillment of the requirements for the Ph.D. degree Supervisor : Prof. Alan F. Smeaton School of Computer Applications Dublin City University September 2002 D e c l a r a t io n I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Doctor of Philosophy in Computer Applications, is entirely my own work and has not been taken from the work of others save where referenced. Date: A cknowledgements Firstly, I’d like to thank my supervisor, Alan Smeaton for his assistance and guidance over the course of my four years working on this dissertation. I was fortunate to have worked with Alan and had him as my supervisor and would likely not have got this far without his guidance and unfailing enthusiasm. Secondly, I’d like to thank friends in the postgrad lab, although people come and go there are a number that I must thank; Jer, Paul, Jiamin, Hyowon, Derek, Aidan but especially Tom and Kieran for discussion and advice. I would also like to thank the other postgrads, some of whom are here no longer, who, if nothing else helped me to keep a grasp on reality; Jer, Chriostai, Stephen, Qamir, Saba, and the girls: Aoife, Mairead, Mary, Michelle and Nano. O f course, I must thank both Martin and Amit whom I had the pleasure of working with at AT&T. My time spent there instilled me with confidence for the rest of my research. I do appreciate the opportunity I was given. Thanks guys! I would also like to my deepest appreciation to my family, especially my m other who was always there when I needed especially when things were going wrong. Mother, at last you can rest-assured that I have finished college! Finally, I must acknowledge die significant input o f Jean, my girlfriend. I would like to thank her for her patience and understanding when I spent all my time either in the university or reflecting about my work at home. She is the only person that can truly appreciate the effort required to compete this dissertation. O f course, I could never have completed this work without funding and I would like, in particular, to acknowledge the following: • Enterprise Ireland Informatics Directorate for part-funding my studentship. • School of Computer Applications for part-funding my studentship and for employment for the last year. • Hewlett-Packard Europe for support through a gratefully received equipment donation. In retrospect, the equipment was essential for the completion of the work in this dissertation. • BCS for twice funding my attendance at IRSG and CEPIS-IR for funding my attendance at ESSIR 2000. A b s t r a c t In recent years, the widespread use of the WWW has brought information retrieval systems into the homes of many millions people. Today, we have access to many billions of documents (web pages) and have (free-of-charge) access to powerful, fast and highly efficient search facilities over these documents provided by search engines such as Google. The "first generation" of web search engines addressed the engineering problems of web spidering and efficient searching for large numbers of both users and documents, but they did not innovate much in the approaches taken to searching. Recently, however, linkage analysis has been incorporated into search engine ranking strategies. Anecdotally, linkage analysis appears to have improved retrieval effectiveness of web search, yet there is little scientific evidence in support of the claims for better quality retrieval, which is surprising. Participants in the three most recent TREC conferences (1999, 2000 and 2001) have been invited to perform benchmarking of information retrieval systems on web data and have had the option of using linkage information as part of their retrieval strategies. The general consensus from the experiments of these participants is that linkage information has not yet been successfully incorporated into conventional retrieval strategies. In this thesis, we present our research into the field of linkage-based retrieval of web documents. We illustrate that (moderate) improvements in retrieval performance is possible if the undedying test collection contains a higher link density than the test collections used in the three most recent TREC conferences. We examine the linkage structure of live data from the WWW and coupled with our findings from crawling sections of the WWW we present a list of five requirements for a test collection which is to faithfully support experiments into linkage-based retrieval of documents from the WWW. We also present some of our own, new, vanants on linkage-based web retrieval and evaluate their performance in comparison to the approaches of others. - IV - TABLE OF CONTENTS 1.1 Introduction to Information Retrieval..................................................................................1 1.2 What is Information Retrieval?...................................................................................... 2 1.3 Components of IR Systems..................................... ............................. .............................. 5 1.3.1 Steps in Performing Information Retrieval...............................................................6 1.3.1.1 Document Gathering.........................................................................................6 1.3.1.2 Document Indexing...................................................................... ................. 6 1.3.1.3 Searching.............................................................................................................. 6 1.3.1.4 Document M anagement........................................... ....................................... 7 1.4 Approaches to Automatic IR .................................................................................................7 1.5 Index Term Weighting techniques............................................................................... .. 10 1.5.1 The Boolean Model of IR ..........................................................................................11 1.5.2 The Vector Space M odel............................................................................................11 1.5.3 The Probabilistic M odel.............................................................................. ........ 14 1.6 Other Issues in Information Retrieval............................................................................... 15 1.6.1 Relevance Feedback.............................................................................................15 1.6.2 Query Expansion.................................................................................................16 1.7 The Architecture of a simple Information Retrieval System............ ............................. 17 1.7.1 The Process.................................................................................................................. 17 1.7.2 The Inverted index......................................................................................................18 1.8 Searching die W eb..................................................................................................................20 1.8.1 Web Directories...........................................................................................................21 1.8.2 Search Engines.............................................................................................................23 1.9 Database Management Systems (DBMSs).........................................................................24 1.10 Objectives of the Research being Undertaken..............................................................26 1.11 Summary..............................................................................................................................26 2.1 Information Retrieval on the W eb .................................................................................. 28 2.1.1 HyperText Markup Language...................................................................................28 2.1.2 The Challenges of WWW Search.............................................................................30 2.1.3 The Benefits o f workingwith Web D ata...............................................................35 2.1.4 Architecture of a Basic Web Search Engine.......................................................... 36 2.2 Linkage Analysis as an augmentation to Web Search.................................................. 38 2.2.1 HyperLinks (Differentiating the WWW from a Conventional Document Collection)............................................................................................................. ............... 40 2.2.1.1 U R Ls.................................................................................................................. 41 2.2.2 Essential Terminology....................................................................... ........................41 2.2.3 Extracting Meaning from Links to aid Linkage Analysis.................................... 44 2.2.3.1 Impact Factors................................................................................................44 2.3.3.2 N ot All Links are created Equal.............. ......................................................45 2.3 Basic Connectivity Analysis Techniques............................................................................46 2.4 Building on the Basics of Linkage Analysis................ ......................................................48