An Intelligent Metasearch Engine for the World Wide
Total Page:16
File Type:pdf, Size:1020Kb
AN INTELLIGENTMETASEARCH ENGINE FOR THE WORLDWIDE WEB Andrew Agno A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Cornputer Science University of Toronto Copyright @ 2000 by Andrew Agno National Library Bibliothèque nationale ($1 of Canada du Canada Acquisitions and Acquisitions et Bibfiographic Services services bibliographiques 395 Wellington Street 395. rue Wellington OttawaON K1AON4 Ottawa ON K1A ON4 Canada Canada The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Lïbrary of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, distribuer ou copies of this thesis in rnicrofonn, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de reproduction sur papier ou sur format électronique. The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts ffom it Ni la thèse ni des extraits substantiels may be p~tedor otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation. tract An Intelligent Met asearch Engine for the n'orld nïde Uéb .Anchen- Agno Mas ter of Science Graduate Department of Cornputer Science Cniversi ty of Toronto 2000 Uachine learning and informat ion retried techniques are appliecl t o met asearch on the Vorld n'ide ?\éb as a means of providing user specific relennr documents in respome to user queries. .A rnerasearch agent works in conjunction n-ith a user to provide daiiy sers of relel-ant documents. Csers provide relennce feedback which is incorporarcd into future resdts b. a choice of machine I~arningalgorithms. Csing a fisecl ranking niethoci. the algorithms incorporating relelance feetlback per- forni rriuch bet ter than t hose t hat do not. Furthemore. using heterogeneoits information sources on the Lorld Wide \\éb is shown ro be effective in short and long term usage. Acknowledgement s 1 n-ould he much less proticl of m\- work if it tvere not for the help of a nurnber of people. 1 woiild like to firsr thank Grigoris Iiarakoulas and John 11~-lopotilos.my super~isors. for their guidance and support of mj- work. nïthoiit theni. 1 woiild still be fishing for a perfect ropic. Thank oualso for making me look ar quesrions insteacl of answrs. I ,..,,uuU - .. I I nLXe tu iZnriP iio DG ;LLLLC~gruup for rkir quesrions ro my presenration oi this thesis. which helped me focus on \\-kat questions other people woiild be inreresred in. upon seeing my work. Some of the n-ork in the implementation of my project used other peoples software. In particular. 1 n-ould like to thank S teyen Brandt for the package com.stevesoft.pat. Doug Lea. for util.concurrenr . and Brian Chambers for his pret-ioiis work in word sterrirning and document vectorization. Last. but certainly not least. 1 woiiltl like ro thank mu deJobie. for coming to Toronto ancl staying wirh me thrse lasr two yars. 3 Architecture 24 3.1 O\-erall -1rchitect ure ............................. 24 3.2 Global Data Structures ............................ 2s 3.1 Zipf's Law ............................... 29 3 .2 Stopword List ............................. 29 3.2.3 Stemniing ............................... 30 3.3 Scalahility ................................... 30 3.3.1 Caching ................................ 31 3.4 Lem Weighting ................................ 33 3.5 Topics ..................................... 33 4 Experimental Results and Evaluation 35 4.1 Description of Data Gathering Procedure .................. 35 4. Eduation Framework ............................ 37 4.3 TREC ~lcasures............................... 41 4 .3. TheF3lIeasure ........................ 41 4.32 The T9L- Neasure .......................... 46 4.3.3 The T9P Measure ........................... 50 4.4 Precision of Learning Algorithms ...................... 5.L 4.4.1 Continuous Learning vs Train/Test ................. 60 4.5 Daily Recall .................................. 63 4.6 Spikes ..................................... 65 4.6.1 Data Gathering Gaps ......................... 66 4.6.2 Flushing ................................ 66 4 Individual Search Engine Recall ....................... 69 5 Conclusions and Future Directions 77 .. 5 .1 Conclusions and Discussion ..................... r , 5. Future Directions . 79 5.2-1 Implicit Ranking . 79 5.2 Ontologies ........ ....... ... ............ SO 5.2.3 Collaboration . S 1 3.4 Alternate Document or Featiire Space . S 1 5.2. Thresholds . S? 5.2.6 Alternative Methods of Learning . S.' C - a.?. 1 !discellaneou Iniprovements and Direct ions . S9 Bibliograp hy List of Figures 3.1 Architecture Diagam ............................. 26 4.1 Daily Document Counts ........................... 36 4 Precision for Plain Algorithm ........................ 39 4.3 Precision for Random aigorithm ....................... 40 4 .4 F3 Measure .................................. 42 4 .5 F3 Sleasiire. Top 5 .............................. 43 4.6 F3 blessure. Top 10 .............................. 44 4 .7 T9C Ueasure ................................. 47 4.S T9C Sleasure . Top 5 ............................. 4s 4.9 T9C Ueasure . Top 10 ............................. 49 4.10 T9P 'lleasure ................................. 51 4.11 T9P Skasure ................................. 52 -4.12 T9P 'Ikasrire ................................. 53 - - 4.13 Running Average of Precision . AU Topics .................. XI 4.14 Precision Running Average . Various Topics . Rocchio tk Grigoris ..... 56 4.15 Running .A \.erage Precision . Student . Grigoris ............... 5s 4.16 Running Average Precision . .AU Topics . Continuous Training vs Train / Test 5s -4.17 Running avg precision . nrious topics . continuous training vs Train/Test 59 4-18 Dail- recall for Rocchio algorithm ...................... 61 4.19 Daily recd for Grigoris algonthm ...................... 62 vii 4.20 Daily precision . -411 ropics . Top 10 ...................... 66 4.21 Dail>-precision . I'arious topics . Top 10 ................... 67 4.22 Daily precision on srudenrs topic . Top 30 .................. (jS 4.23 Search Engine Red. -411 Topics . Riuining Average ............ 70 4.24 Search Engine Recd. Al1 Topics . Running -Iverage ............ 71 4.23 Runriing Average of Recall of làhoo on Palni Pilot ............ 72 4.26 Riinning .l verage of Recall of Lycos on Student .............. 73 4.27 Running average recall . lIS/DO*I ...................... 74 Chapter 1 Introduction Problem and Motivation Findilig informat ion on the World Wide Web (\\'WU')can be difficult withottt sonir form of assistance. As estimated by Lawrence and Giles [LG99]in 1999. there rvere SOO million pages. an increaoe of 250% from their previous estimate in their 1998 study [LC;9Sc]. C-illance [MMOO] claims there are 2.1 billion unique and ptihlicly amilable pages on the Internet. Given the size and the gowth of the WWY. one can see that we neeci tools to help us find information. One would typicdy turn to a search engine. like Yahoo![Ya.h] or Coogle [Goo]. Cnfortunately. the mos t frequently used search engines [StaOO. SulOOa] do not always do an adequate job. due to their Iack of coverage [LG99.LGSSc] and their lack of ability t-Ofind the relevant documents in those that are covered. One potential ierned-- is to enable searches with more -intelligence". Given a search engine. ir mai- be imbued with -intelligencen in at least tn-O ways: through the use of specialized. larger. or sirnply different informat ion sources: or t hrough the implement at ion of various machine learning or information filtering algorithms. The purpose of this R-ork is to create an intelligent search engine by combining both of these approaches into a single search engine. dra~ingfrom machine learning and information retriemi techniques. The remainder of this chapter id1 deal with an 01-er~ien-of a particiilar technique for searching t hrough her erogeneoiis infornia t ion sources. called met asearch. The chapter also inclutles an esplanation of various information retrieval and machine learning techniques rhar have been used in other work. the contributions of this work as iveU as a layout of the remainder of this thesis. 1.2 Metasearch on the World Wide Web One commonly tised method for adding intelligence. even mong search engines not norrnally thought of as eiichjSulOOc]. is to use metasearch. For the purposes of this research. metasearch will refer to nietasearch on the 1V1171'. 1.2.1 A simple anatomy of a search engine In the following disciission. it helps to have nn idea of hou- search engines typically ivork. The portion of a search engine that the user sees is only one aspect of the entire system that makes up an engine. .U the results given by a search engine corne from some forrn of a database underlying the engine. which may list documents and information about those documents. This database is populated by certain software agents that visit WWV' servers and index the documents that those semers contain. In the Hantest system[BDW95]. these agents are cded Gatherers. whereas in Google[BPSSb]. they are called c~awlers. The purpose of these crawlers is to dow the collection of information about documents. also known as the indexiag of documents. to proceed independently of an? search that may be using the information. These cran-lers must revisit