AN INTELLIGENTMETASEARCH ENGINE FOR THE WORLDWIDE WEB

Andrew Agno

A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Cornputer Science University of Toronto

Copyright @ 2000 by Andrew Agno National Library Bibliothèque nationale ($1 of Canada du Canada Acquisitions and Acquisitions et Bibfiographic Services services bibliographiques 395 Wellington Street 395. rue Wellington OttawaON K1AON4 Ottawa ON K1A ON4 Canada Canada

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Lïbrary of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, distribuer ou copies of this thesis in rnicrofonn, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts ffom it Ni la thèse ni des extraits substantiels may be p~tedor otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation. tract

An Intelligent Met asearch Engine for the n'orld nïde Uéb

.Anchen- Agno

Mas ter of Science

Graduate Department of Cornputer Science

Cniversi ty of Toronto

2000

Uachine learning and informat ion retried techniques are appliecl t o met asearch on the

Vorld n'ide ?\éb as a means of providing user specific relennr documents in respome to user queries. .A rnerasearch agent works in conjunction n-ith a user to provide daiiy sers of relel-ant documents. Csers provide relennce feedback which is incorporarcd into future resdts b. a choice of machine I~arningalgorithms.

Csing a fisecl ranking niethoci. the algorithms incorporating relelance feetlback per- forni rriuch bet ter than t hose t hat do not. Furthemore. using heterogeneoits information sources on the Lorld Wide \\éb is shown ro be effective in short and long term usage. Acknowledgement s

1 n-ould he much less proticl of m\- work if it tvere not for the help of a nurnber of people.

1 woiild like to firsr thank Grigoris Iiarakoulas and John 11~-lopotilos.my super~isors. for their guidance and support of mj- work. nïthoiit theni. 1 woiild still be fishing for a perfect ropic. Thank oualso for making me look ar quesrions insteacl of answrs. I

,..,,uuU - .. I I nLXe tu iZnriP iio DG ;LLLLC~gruup for rkir quesrions ro my presenration oi this thesis. which helped me focus on \\-kat questions other people woiild be inreresred in. upon seeing my work.

Some of the n-ork in the implementation of my project used other peoples software. In particular. 1 n-ould like to thank S teyen Brandt for the package com.stevesoft.pat. Doug Lea. for util.concurrenr . and Brian Chambers for his pret-ioiis work in word sterrirning and document vectorization.

Last. but certainly not least. 1 woiiltl like ro thank mu deJobie. for coming to

Toronto ancl staying wirh me thrse lasr two yars.

3 Architecture 24

3.1 O\-erall -1rchitect ure ...... 24

3.2 Global Data Structures ...... 2s

3.1 Zipf's Law ...... 29

3 .2 Stopword List ...... 29

3.2.3 Stemniing ...... 30

3.3 Scalahility ...... 30

3.3.1 Caching ...... 31

3.4 Lem Weighting ...... 33

3.5 Topics ...... 33

4 Experimental Results and Evaluation 35

4.1 Description of Data Gathering Procedure ...... 35

4. Eduation Framework ...... 37

4.3 TREC ~lcasures...... 41

4 .3. TheF3lIeasure ...... 41

4.32 The T9L- Neasure ...... 46

4.3.3 The T9P Measure ...... 50

4.4 Precision of Learning Algorithms ...... 5.L

4.4.1 Continuous Learning vs Train/Test ...... 60 4.5 Daily Recall ...... 63

4.6 Spikes ...... 65

4.6.1 Data Gathering Gaps ...... 66 4.6.2 Flushing ...... 66

4 Individual Recall ...... 69

5 Conclusions and Future Directions 77 .. 5 .1 Conclusions and Discussion ...... r , 5. Future Directions ...... 79

5.2-1 Implicit Ranking ...... 79 5.2 Ontologies ...... SO

5.2.3 Collaboration ...... S 1

3.4 Alternate Document or Featiire Space ...... S 1 5.2. Thresholds ...... S?

5.2.6 Alternative Methods of Learning ...... S.'

C - a.?. 1 !discellaneou Iniprovements and Direct ions ...... S9

Bibliograp hy List of Figures

3.1 Architecture Diagam ...... 26

4.1 Daily Document Counts ...... 36

4 Precision for Plain Algorithm ...... 39

4.3 Precision for Random aigorithm ...... 40

4 .4 F3 Measure ...... 42

4 .5 F3 Sleasiire. Top 5 ...... 43

4.6 F3 blessure. Top 10 ...... 44

4 .7 T9C Ueasure ...... 47

4.S T9C Sleasure . Top 5 ...... 4s

4.9 T9C Ueasure . Top 10 ...... 49 4.10 T9P 'lleasure ...... 51

4.11 T9P Skasure ...... 52

-4.12 T9P 'Ikasrire ...... 53 - - 4.13 Running Average of Precision . AU Topics ...... XI 4.14 Precision Running Average . Various Topics . Rocchio tk Grigoris ..... 56 4.15 Running .A \.erage Precision . Student . Grigoris ...... 5s

4.16 Running Average Precision . .AU Topics . Continuous Training vs Train / Test 5s -4.17 Running avg precision . nrious topics . continuous training vs Train/Test 59 4-18 Dail- recall for Rocchio algorithm ...... 61

4.19 Daily recd for Grigoris algonthm ...... 62

vii 4.20 Daily precision . -411 ropics . Top 10 ...... 66 4.21 Dail>-precision . I'arious topics . Top 10 ...... 67

4.22 Daily precision on srudenrs topic . Top 30 ...... (jS

4.23 Search Engine Red. -411 Topics . Riuining Average ...... 70 4.24 Search Engine Recd. Al1 Topics . Running -Iverage ...... 71 4.23 Runriing Average of Recall of làhoo on Palni Pilot ...... 72 4.26 Riinning .l verage of Recall of on Student ...... 73 4.27 Running average recall . lIS/DO*I ...... 74 Chapter 1

Introduction

Problem and Motivation

Findilig informat ion on the World Wide Web (\\'WU')can be difficult withottt sonir form of assistance. As estimated by Lawrence and Giles [LG99]in 1999. there rvere SOO million pages. an increaoe of 250% from their previous estimate in their 1998 study [LC;9Sc].

C-illance [MMOO] claims there are 2.1 billion unique and ptihlicly amilable pages on the Internet. Given the size and the gowth of the WWY. one can see that we neeci tools to help us find information. One would typicdy turn to a search engine. like Yahoo![Ya.h] or Coogle [Goo]. Cnfortunately. the mos t frequently used search engines [StaOO. SulOOa] do not always do an adequate job. due to their Iack of coverage [LG99.LGSSc] and their lack of ability t-Ofind the relevant documents in those that are covered.

One potential ierned-- is to enable searches with more -intelligence". Given a search engine. ir mai- be imbued with -intelligencen in at least tn-O ways: through the use of specialized. larger. or sirnply different informat ion sources: or t hrough the implement at ion of various machine learning or information filtering algorithms. The purpose of this

R-ork is to create an intelligent search engine by combining both of these approaches into a single search engine. dra~ingfrom machine learning and information retriemi techniques. The remainder of this chapter id1 deal with an 01-er~ien-of a particiilar technique for searching t hrough her erogeneoiis infornia t ion sources. called met asearch.

The chapter also inclutles an esplanation of various and machine learning techniques rhar have been used in other work. the contributions of this work as iveU as a layout of the remainder of this thesis.

1.2 Metasearch on the World Wide Web

One commonly tised method for adding intelligence. even mong search engines not norrnally thought of as eiichjSulOOc]. is to use metasearch. For the purposes of this research. metasearch will refer to nietasearch on the 1V1171'.

1.2.1 A simple anatomy of a search engine

In the following disciission. it helps to have nn idea of hou- search engines typically ivork.

The portion of a search engine that the user sees is only one aspect of the entire system that makes up an engine. .U the results given by a search engine corne from some forrn of a underlying the engine. which may list documents and information about those documents. This database is populated by certain software agents that visit WWV' servers and index the documents that those semers contain. In the Hantest system[BDW95]. these agents are cded Gatherers. whereas in [BPSSb]. they are called c~awlers. The purpose of these crawlers is to dow the collection of information about documents. also known as the indexiag of documents. to proceed independently of an? search that may be using the information. These cran-lers must revisit documents periodically. to reindes them. This is because documents on the UTVW have a tendency ro change or even disappear over time. Reindesing must happen at the same time that new documents are being indesed. When a user gives a que- to a search engine. the engine uses its database of rankings and documents to generate a nen- document that consists of a list of pointers. knorvn as Pniversal Resource Locators ( CRLs). ro dociiments that haïe been raded as most likely ro fuMl the user's information need. The esact rading depends on the searck engine. and the ranking çchenie is typically proprietary.

Horvever. search engines developed in acadernia do publish their ranking methods. For instance. Google uses something called a PageRank. which is defined by Brin and Page

[BPSSb]. This description of a search engine ma? also be applied to metasearch engines. wi t h çorne changes.

1.2.2 Metasearch and how it helps

The idea behind rnetasearch is to use multiple -helper- search engines ro do the search. then to combine the results from these engines. Engines rhat use metasearch inclucle !detacrawler. SavvySearch. 11SS Search and .Ut alvista. among ot hers [SulOOc]. These helper engines form the merasearch engine's database. This approach differs from the individual search engines in that a metaseach engine does not need to crawl the U7117V. although it may do so. A metasearch engine for the WV1Y may just verify that the doc- uments returned by the search engines stiil elùst. The problem of combining the results can be solved sirnply or by using a little cleverness. The sirnpli&ic solution has been used in search engines from !detacrawler to . Sletacrawler has changed in recent years and no longer appears to employ this simple rnethod. but Dogpile continues to use it. The solution is to put results from separate engines under separate headings. This solution can still be seen in some engines t hat empioy both a rnanually-maint ained direc- tory (in the style of Yahoo!) and a more traditional cran-ler-generated index of lFT\W documents (as in ). It is also seen in some of the less well honn metasearch engines. such as SherlockHound [She] and .Uone [.Ill].This approach has a number of problems:

0 The individual search engines are treated equdy. despite the fact that their results

might not be equdy relevant to the search at hand. This implies two things. - The only indication of relative importance of the resiilts is within eacli search

engine. not across search engines. This means rliat the user cannot jiidge

which documents will be rnost relel-ant of al1 the results returned.

- The number of results is typically restricteci. wirh each jearch engine given a

cpoca of documents that ir can fill. This quota is the sanie for each search

documents that are more relevant than the results from searcli engine B. then

the rn documents from search engine A that are better t han the ones froni B

will nor be shorvn to the user. Insteacl. al1 the results froni search engine B

If a user searches ouce. then attemprs the same search again. the results do not

change. even rhough the user ma? have giveri some implicit information aboiir

documents that might be relevant. The user ma- have selected some documents

to be viewed. and by doing so. ma1 have ;ken some information about potenrial

relevance.

The other. more sophisticated. solution is to rerank the documents that the helper search engines returned. either bp downloading the documents and analyzing them. or using some existing preprocessed version of the document. such as the surnmary that occasiondy accompanies the results from search engines. This approach is used in

San-ySearch [HDgÏ. DH96. Sav]. in search engines from YEC research institute [LGgSa.

LG9Sb. GLG+99]: and- presumably in the new version of Metacrawler. This deviates the problem of treating the search engines equally. but the other problem remains.

-4s can be seen from the listing of alliances at Search Engine IVatch [SulOOc]. meta- search. even in its simple fonn. is a popular method of searching. The reasons for this include the user's desire to twmultiple search engines. the user's la& of laon-ledge about esisting search engines [GLGf 991. and the user's unniKngness to visit multiple search engines. One orher reason to use metasearch is that while most 11111V users id1 tr>- another search engine when rhey cannot find n-hat the>-are looking for. 20% of theni give up [Sulood. SiilOOb]. Csing metasearch allow the user to view resiilrs from man!- search engines simultaneousl~which ma? allow at least one search engine to give back rele\-ant results before the user gives up.

Fortunatel' for tliose engines thnt do employ some form of metasearch. there is eV- itlence for its efficacy The reason that metasearch can be effective is rhat there are problerns wi t h plain search:

1. Indi\-idual search engines have poor indesing anci reindesing t imes [LGW]. Rcin-

desing tirne refers to the time betiveen successive visits to a documerit h>- a search

engine's crawler. This is a problern. because if a search cngine's crawler does not

reindes a frequently changing document. it may nor be able to order CRLs ac-

cording to their actual data. Poor reindesing rimes also lead to the predence of

dead links [LGW] in the results giwn by a search engine. These are pointers CO

documents that no longer esisr. are no longer published. or are no longer accessible.

2. Ranking schemes ma! be poor or inconsistent (Iiee92. LGSSa]. In a talk by Chris-

tian Collberg [ColOO]. given at the Cniversity of Toronto. he claimed that one of the

methods thar .Altavista used to rad documents was by age of the document. This

xas meant to reduce the propensity of artificially relevant documents (also known

as -spamn). This leads to old documents being listed first. which is not usually

the best heuristic for ranking relemnt documents. The general problem of poor

ranking schemes is esacerbated by users' tendencies to use short. one or tn-O word

queries [SulOOb. LGSSa]. .A Lape query combined nith a poor ranking scherne by

a search engine can lead to fmstration. In fact. the ranking scheme need not be

poor overd. just poor on the topic that the user has in mind. 3. Indi\-idital search engines ma! have poor coverage due to specialization [DHS6.

HD97. BBCSS. LebS;] or may jusr not have the resources to have wide CO\-erage

of the entire iYJVFV. This means that no one search ensine hw a full indes of

the WWK. Also. since the Kn'W is growing esponentially [LGSS]. so do the

cornput ing and storage needs of search engines thar maintain t heir own index.

1. Rc!n:cc! :o thpie;-i~iis pviuî. ïhc düiuiiienî craii.irr. rLar are 1ist.d for inkronii-

style search engines ma? not he able to search the enrire \YU-n' because of the

connectedness properries-as many as half the documents are nor reachable from

the -tore" web [But001. which is presiimably where mosr crarders spencl t heir t ime.

Sletasearch can alleviate some of these problems:

1. Csing multiple search engines should give ooerlaps in terms of the index times.

translating into a shorter mean time between reindesing [LGSS].

2. .\ metasearch engine can use its own ranking scheme. independent of the rankings

of the individual search engines [DHSG. HD97. LGSSa]. It can also use ranking

schemes that analyze the individual pages. just as individual search engine crawlers

do. The met asearch engine cm do this faster. since it has a smaller set of documents

to analyze. Combined with document caching [BCF+9S. ZFJSi']. this can result in a

fast metasearch using a custom. document based ranliing that analyzes documents

to a geater degree than the individual search engines used. Xote that a problern

may still esist since a metasearch engine must do the anaipis in real time. instead

of ofIline. as the crawlers do. Lawrence and Giles [LGgSa] do show that it is possible

to do this analysis in real time. without creating long tvaiting times for the user.

3. Coverage of a metasearch engine is greater because it uses multiple sources of in-

formation [LGSSc. LG991 and a metasearch engine can use speciaity searcli engines

without having them adversel- affect the results due to fen- results. or due to a slow network connectiou to the specialty engine. If the nitniber of documents on

the IiTk7V continues to grow esponentially. then col-erage niay eventually be a

problem even when using ntimerous search engines.

4. The last item cannot be fised hy using metasearch engines. Instead. individual

search engines need to do random searching among IP addresses and ports. IYliile

the metasearch engine coitld do this itself. ir removes one of its advanrages from

the perspective of resource usage.

Recent studics hy Lawrence and Giles [LGSS. LGSSc] gi~esmore creclence to the pe tential of metasearch. The papers show that individual search engine corerage is poor. covering no more t han 16% of the n'Il?\' and as lit t le as 2.2%. However. r he o~~rlap between search engines is also low. with a range of [-l.-Ll%.X.j%] in 199s. to a range of

[2.2%.3S.3%1 in 1999. This bodes well for metasearch. despitr the facr that the combined coverage appears to be decreasing over time. An estimate of the combinecl coterage in

1999. iising 11 search engines. was only approsimately 40% of the estimated size of the iWW being covered. In 199s. with only 6 search engines. the estimated coverage \vas

59%, . Giwn tkis. met asearch ought to be effective. as long as the individual engines have

access to rele~antdocuments and can return them to the metasearch engine.

lletasearch is not mithout problerns. however. For instance. metasearch can create

increased network traffic. both on the global Interner and on the local netrvork. especidy for engines that perform their onn ranking by dotvnloading the actual documents t hat are found ria the individual search engines (DH96.HDS;]. The structure of the W?W [But001 is also a problem. because not all documents be indexed by the engines

used by a metasearch engine. The hst problem can be alleviated by using a network bandwidth sensitive ranking scherne [DH96.KD97..\ILY+99j. The second problern needs

to be solved by having the WTVW crawlers probe random IP addresses at port SO (HTTP)

and possibly other common or uncommon ports. 1.3 Intelligent Agents

Besides metasearch. a search engine can also turn to other algorithms for inspiration.

The other source of intelligence cari be obtained by looking at intelligent agents. These are software programs that uncierrake tasks on behalf of a user or iisers. ;\gents ha\-e been used to aid in sorting email [IIla99. BooSS]. and for searching and -sitriing* the

KG-U- ilIia99. CS%. PB9;. P89'J. JanK BLGos]. The latter. the Uéb asenrs. can bc categorizeri into ones rhat perfonn automaticall~nithout user assistance. and those tbat arc designed to work in conjunction rvirh the user in order to enkmce the iiser. Those that work wirho~itfeedhack from the iiser. such as CiteSeer [BLGSS] and WbSIate [CSSS]. among others [Jang;!. can be usecl as a search agent. where the program ma- learn user habits and preferences rnerely through obsernrion. and rhus alter search queries. or offer dociiments. .Uternatively. as in CiteSeer. the progréun ma! st il1 rely on user geiierated queries. but the user is assumed to want a specific type of information nnd the agent lises

a specially huilt clarabase ro ansrver queries.

The other agents. those rhat rvork to enhance a user. or otherwise reqiiire user inter-

vention or feedback. are of particular interest to this thesis. There have been a nurnber of these systems made for viewing mebsites [PB99.PB9;. SH99. Pol99. BSY95. BS951.

The- typically require that the user give some feedback about the iVi\X' documents

that have been recommended. The agents adjust their future responses to take this new

information into account. This data is stored in a profde of the user. This profile typically

consists of a set of features that have some weight associated with them. Features include

some form of a List of keyn*ords[llla99]. although other possibilities esist. For esample.

agents mzq- form part of a collaborat ive system. ahere a profile may simply consist of a

List of documents that the user has ranked and those ranliings. As Glover et al !GLG'99]

suggest. other features of a document map also be used to construct a profile. This could

indude preferences for length of documents. reading level. languages. age. images and

URL:text ratio. The esact features used in the profile can be determined esperimen- tally. JI-hatever the exact nature of the profle. it is used to give iniproved resitlts to the user. Hon-e~er.none of the aforementioned agents has applied this learning from user feedhack to the problem of nietasearch. The individual systems rnentioned above l-ar?- iu t heir learning algori t hms. term weightings. feat ure space and int encted use. For instance.

Pazzani-s Syskill and ilébert [PB971 uses a Layesian classifier ro predict the probabiliry of a user liking a document. whereas Balabanovic [BSYSJ. BS9j.I and Somlo [SHSQ]ilse similariry nieasures wirh respect ro a profile. One interesting \ariarion on searching the Kn-n-is Pazzani's web sire agent [PB99].There. the agent learns only about people visiting a site. as well as the patterns of site navigation of al1 users ancl the patterns of the hypertest linkage structure to sorne degree. The linkage structure is merely the

way in tvhich documents refer to each other through . The narrow scope of

that agent's responsibilities is nor the purpose of this work. However. it cloes provicle

an interesting contrast in terms of usage. Pol&xa (Po1991 denionstrates a collaborative

recommender system in which user profiles are sirnply the relevane' rankings of various

documents. These profiles are compared between users. and for any given user. the sys-

tem recommends clocuments that a sirnilar user found relevant. Sone of the agents give

the user access to a large number of information sources from which to gather relevant

documents. using only a single search engine. or a single WVW site in each case.

1.3.1 Relevance Feedback

Relevame feedback is a term used in information retrietal to describe a special type of

feedback loop. This feedback hop requires that the user of a system make judgements

about the rele~anceof documents retumed by the system. The +stem. in this case. is one that is designed to return documents to a user based on queries that the user gives

to the system. The feedback obtained in this manner is used to determine the profile

used in the nest iteration of document retrievai. The exact nature of this feedback varies

with different systems [RocTl. BBC9S. CS9S. BSY95. BS95. DH96. SB9O. BSA9.I.Joa97I and may even be implicit [BBCSS. I-. Specifically. if sonie document. D. appears multiple rimes. it may be given rankings that actually depend on the relelance lalue of the dociunenrs thar iverc listed alongside D at the time of ranking [BSY95].

If this is the case. the range of lalues ma' not give an>- more information thri the boolean ranking system. and may in fact haniper learning due ro the inconsistency of user rankings.

Vïrh implicit feedback. feedback is assigned by the sysrern basecl on whether a docu- ment was viewed b>-the user and possibly baseci on how long a user viewed a tfocunient.

The ad\xntage of this implicit feedback is that. from the point of view of the user. the interface is transparent. The system cmleam a user's preference wichout the user ha\-ing to intervene. The problem with using impiicit feedback is that the feedback is somewhat more difficult to interpret. as a visited document is not necessarily a relenrtr one. This causes potential inconsistencies in the rankings. which is n-hy ir is not irnplemented in the system described here.

One added complesity that may be introduced with relelance feedback is incremental feedback. In most of the previous work. feedback processing mas done in a batch fashion.

That is. the entire corpus. and their relekance ranlüngs. were known before testing began. and the entire corpus (or some training subset) could be given to the feedback algorithm- the algorithm could have complete knoivledge and could process the feedback in one pass.

This is the context in nhich the original Rocchio algorithm for relelance feedback was created [SalÏl]. Incremental feedback. on the other hand. requires only that a portion of the corpus be judged before applping changes to the profile. Additional portions ma? be judged and the profile changed as time progresses. with Little or no knowledge of previous rdngs. Fortunatel'. Rocchio and Ide [Ideil] style algorithms for releiance feedback ( herein termed -traditional relevance feedback algorit hms- ) ma! be applied in a incremental fasliion i.11196. Cal%. IJA92]. In fact. Allan found that -keeping a smd niunber of terms can acttially iniprow performance over full feedback ...alniosr any nurnber of terms works 11-ell." Full feedback refers to feedback in which. at an? tinie t. rhe system is @en dl ranked documents seen until time t. Harrnan [HarS?] suggests the use of 20 terms in a full feedback em-iroument. Xian shows that traditional relel-ançe feedback style algorithnis work n-el1 as long as some contest. possibly a d~.namically changing contest. is maint ained froni previous iterations. This provides a niethod to pcrforni online learning. insteacl of batch leaniing. This result is uecessary in orcler ro

did date the use of relevame feedback as appliecl to metasearch.

Query Expansion

Typically. reletnnce feedback results in an altered query. This qiiery is identical to the

profile mentioned above (earlier in section 1.3). The purpose of the espancled qiiery is to

give a more precise quer!: based on previous user rankings. for the nest ireration. This

qiiery nould also be used to order the pages in terms of relevance. for the user to view.

This approach t~orkswell in certain experirnents. However. in the contest of metasearch.

t his approach does not work at d. For instance. some search engines Limit queries to only

10 terms. and some seem to ignore long queries. Those that do accept long queries do not

return man? results. Espanded queries typically used stemmed versions of words. These are words that have had their s&~es stripped from them. Any stemrning in the query

assumes that the search engines mil1 understand the stemmed term or tenns. mhich is not

necessarily crue. In fact. search engines like Google do not do stemming. and others dow

stemming on'\. as an ad~ancedoption: even with stemming as an option. however. search

engines cannot be relied upon to understand stemmed terms given as part of the que.

Emn if the stemmed terms were e-rpanded into the words that were originallj- seen to create the stemmed terms. the queries would be increased in length. concributhg to the pre~~iousproblem. This is different from mosr of the literature on searching a corpus of documents. but its omission here is supportecl by r he profile generating inreiligent agents

(earl?. in section 1.3).

1.3.2 Browsing and Searching

.-\ user surfs the IYJVJY. There are generally two different rnethocls of ~isinga search engine to do this sufing: bron-sing and searching. The difference berwen the t~vois identified by the user's interest. -4 user who wants detailed information on a specific topic is searcliing. whereas a user who wants an over~ien-of a topic. or even multiple topics. is hrowsing. One may also distinguish the two by the user's conceptual mode1 of a topic. If the user is looking for information on. for esample. Palm Pilots' . rhen this is probahly browsing. .A user interested in a specific subtopic. such as free producti~it~ software for the paim pilot. is probably doing a search.

.iccording to Search Engine llatch [Su100d]. 70% of people know specifically what the' are looking for when the! use a search engine. Hoivever. this does not mean that they can articulate this knowledge in the form of an appropriate query. Csers typically have a specific meaning in mind nhen using a term in a que- This meaning is not necessarily the same one that search engines assign the word. For instance. the term

-pairnw might be found in a document on handheld computer organizers as well as a document on vacations. Furthermore. Butler [But001 says that most users only have a general queq- in mind. These two viervs of a user that know specificdy what they are looking for and one nho does not are not easily reconcilable. The evidence suggests. however. that most users only enter a general que- This is supported by the fact that

30% of searches are done using ody a single word in the query [SulOOb]. according to

Search Engine Match. In another study done by Lan-rence and Giles [LGSSa]. the- found

'=\II product narnes and cornpan' narnes mentioned in this document are the trademarks of their respective holders thar almosr half of user cperies containecl one term. and almost 80% of queries wre one or two terms. WhiIe the query entered into search engines mal he general. this author is inclined to believe that rhe iiser has a fairlj- clear idea of what the!- are looking for. The' ma>- not always he able to express rhis in the form of a qiiery. but the? can certainly identify documents that are and are not relevant. .An!- other view of a user means that the!- are entering randorn queries to the search engine.

The work in this thesis assumes that the mer does have a fairly good idea what she is iooking for. ancl are able to identify these documents. This work dso assumes that the iiser ri.iIl only enter a generai query. From there. the system will leu11 a more specific query that corresponds ro the user's internalized and unasked qiier?: This corresponds to the specific search aspect of surfing the \VlVR*. It should be possible to use the sysrem describecl here to perform the browsing aspect of suhng. but this rws not one of the items investigated.

Our Approach

The systrm presented in this thesis is meant to address some of the shortcomings of other

WRlY search engines. and in particular. other metasearch engines for the WkVW.

0 We will use the more sophisticated of the metasearch techniques by using a unified

ranking scheme across al1 search engines.

0 tVe nill allow the search engine to adapt to a user's preferences over time.

0 The feedback from users will be elduated on a daily bais. in an incremental

fashion. Thus. there tvill be no strict tuning period to generate a correct profile. as

in Somlo's engine [SH99].

0 The system ma? be used in either a server based mode. or ma!- be used as a single

client on the user-s machine. a The search engine will generate a specific profile from a user formulated general

query.

0 rser profiles will be generated which represent a prototypical docunlenr. The user

profile will he used to determine similarity with documents that are retrim-ed h-

the search engine.

Usage Scenarios

The scenario in which the search engine described in this thesis was designed to be nstd is the following. A user has a query that they wish to make over a period of days or even weeks. The user may be looking for information on a topic she knows little about. or she may be looking for a certain type of infornlation. but does not know hotv to specify the information as a search query. As an esample of the former. suppose the user is looking for student Life at the Cniversity of Toronto. She is not sure what this might entail. but is certain of what it does not. and hopes that the search engine will help to determine the scope of the topic. The other type of search is where a user knows what to look for. but enters only a \xgue query. For example. a user might be interested in looking for free productivity software for the Patm PilotT". and might do so for three to five days: ths user might onlj- enter -palm pilotm as a query. As another example. a user might want to retrieve information about the most recent antitrust trial in the USA. from its besinning to the current events. In this case. the user might enter a query such as -microsoft doj.- In both these queries. the topic is general enough to include documents pertaining to other. nondesirable events or topics. The metasearch engine should be able to detect user preferences and d~namicallyalter the manner in which it orders pages for user viewing. 1.3.4 Questions

There are several questions t hat resring of r he engine will answer:

a Can learning a user profile increase the relelance of tlie results as returned by tlie

met asearch engine?

a Are certain search engines hrtter to use than others on certain topics'.'

O 1s metasearch more effective than ordinary search. in the contesr of relevancc feetl-

hack'.'

1.3.5 Contribution

The work presented here dl provide evidence for the effiçacy of an adaptiw learuer for the problem of metaseuch on the \1717\'. It ni11 provide evidence that search engines can

hr usecl for more than merel- one rime searches. but can also be used for the piirposes of tracking a topic over tirne. né will confirm that metasearch is actually effective.

and show that even cornbining search engines that use similar techniques is worthwhile.

Furthemore. some reasons that metasearch is effective in the temporal contest of this

work dlbe shown.

1.4 The Rest of the Story

The rernainder of this thesis will discuss the user's interaction with the user. delving into

representations of the -user's ideas and methods by nhich one can alter these representa- tions. Chapter 3 will esamine the architecture and implementation details of the engine.

including remarks about the data structures used and the scalability of the system as

a shole. Certain shortcomings of the system dlalso be shown here. as well as some

possible ways to overcome them. The results of a tarie-- of esperiments dl be shown in

4. in an attempt to ans=er the questions posed in section 1.3.4. Chapter 4 also presents some analysis of the data. Chapter 5 sumarizes the findiugs. answering the quesrions

$\-en ahove. anstvering possible esceptions ro the work clone. and cornnielits on esrenciin; this work. Chapter 2

User Interaction and Query

Processing

This chapter describes the user's interaction with the systeni. and the results of those interactions. It also describes representations of the user that the system maintaim. and the met hods t hroiigh which t hese represent at ions are maint ained.

2.1 Profile and Document Representation

-4s in the other systerns rnentioned earlier in section 1.3. the system uses a profile to keep track of user preferences. In this thesis. a profile can be viewed as a prototype document. or perhaps a union of prototype documents. A document is represented by a set of weighted words and phrases (collectively cded terms) coming from a document space [SMS3]. This document space is a muitidimensiond space. having one avis per n-ord or phrase chat is accepted by the system. This space mal- either be knonn be- forehand. or may gron- over rime. as new tems are encountered. Terms that are in the document space at time t are called the dictionary at time t-this diction- is static if the document space is already knoftm. Given a document space. D. consisting of terms.

D = (tl. t2.. . . .tlnl). a document is a vector. I'. in this space n-ith non-negatke neights II- = ( tri. ~2... . . "'pi): wt > O: 1- = 11'-'P.It is iiseful to have the sparse representation of the vector. Thar is. take Il-' = ( tr;. w;.. . . .cc: 1. u., > O n-here 11 is the numher of positive weights. D' = (t;.t:. . . . .t;). the terms n-itk posiri\-e weighrs and 1-' = II'' - P'.

For a document. d. 1 ;i indicates the presence or absence of terms in the docunient. and ma- indicate the significance of those ternis in the document. The marner in nhick weights are assignecl cau \ary between implenientarions of an' sysrem employin:, such a representation [\Ila99. HC931. and may range from a simple boolean only. to a real number reprcsenting the weight of the term. The esact method iised in this thesis is described in C hapt er 3.

2.2 The User Query

The query sent to the individual search engines is al\\-ays the original que- that the user yives. This is partly due to the fact that queru espansion does not work in the metasearch contest. as eqdained in section 1.3.1. hlso. given the assurnption of a generd query. this alIo~vsthe individual search engines to retrieve man- documents. Having many documents for the search engines to retrieve is important because of the repetitive nature of the query-the user wiU input the same query over a number of days. but only wants to see nem or changed docriments. h large pool of documents ensures that there will be many nen documents. even if only a fen of them change over time. Xso. the purpose of this metasearch engine is not to learn the optimal ranking for a finite set of documents for a specific user. Rather. it should continua& adjust to the user's preferences. and learn the best ranking for future documents that have not been seen. This can ody be tested if the rnetasearch engine can examine ody a portion of a large pool of documents at one t ime. 2.2.1 User Query and Ontologies

The user's query is espected to fit into some onrology. This i?; siniilar ro Glol-er et al": -information needs- [GLGi 991. and DiialX-KI'S -caregories" [XIHT99j. The user rnanitally selects an approprinre topic in u-hich to place the qtiery Tliese topics conir from some esis t ing ontology Current ly. t his onto1o;y esists only for r hose qiieries rhat

bas^ the categorization of qtieries. The use of this outology also allon-s rhe sysreni to jeneralize or specialize to a new profile. giwn the old onrs. Sonie adclicional tliotiglir would be required to determine esactly how r his ~otildhr done. as fiirttier detailecl in

Ckaprer 5. whiïli tlescrihes possible future work.

2.3 Ranking

Two clifferent entities rank clocuments: the user. ancl the nietasearch engine. The meta- search engine ranks documents it receives from the indiridual search engines. It does this so that ir ma? present the user with a list of rnnked document CRLs. The ranking is done by making cornparisons of documents to the current profile. Since the profile is merely a vector in the document space. then the profile. P. and the document. D. may be compared by rneasuring the cosine of the angle between the trvo rectors:

If the document vectors are normalized. relative document rankings for a specific profile are preserved rvi th:

Other rransformations may be made prior to the nomalization. For instance. in the engine described here. aIl vector n-eights are first made to sum to one by dividing each vector by the sum of the weights. Thus. the actual similarity nleasure used is:

where P: and Dr are the weights for the term Pt and D,respecti~ely.

The highes t ranked documents are then shown to t he user as a lis t of L-RLs-the esac t numher returned depends on the algorirhrn usecl (see section 2.4). Oncr this is clone. the user ma' give rnnkings. The user looks at the documents given by the metaseasch engine and ranks them as either relevant or nouelevant. The user's ranking is based on a set of criteria which depends on the topic. One important global rule rvas that documents that had heen seen before must have changed in a relelïmt manner in order for the document to be marked relevant again. This criteria is important. because otherwise. a fised set of documents would always be returned to the user. These documents would be those that were accessed from a database. or othemise dynamically generated. documents that had their date changed on a daily basis but no other changes. and documents on servers that gaye incorrect dates for the document's datest amp. This is because the page fetchers. as described in Chapter 3. fetch those documents whose datestamp or checksum has changed. if the document has been seen before. The exact number of documents ranked \aries according to the learner that is in use and the threshold the leamer uses for determinhg tt-hen to stop giving documents to the user. This number is a maximum of 30 documents. 2.4 Profile Adjustment

2 A. 1 Relevance Ranking

The ira>- in ~vhichprofiles are aitered depends on the systeni in use. biir are generally lariarions on Rocchio's algorithm [RocTl. SB90j: the profile at timr f + 1. cnn he ohtained rhrough a function. f : Qtci = f(Qt.Rt.St). il-here R, and St are the sers of reletanr and nolirele\ant documents. respectivcl!-. The ac tiial algori thm usecl iu Rocchio's original forniulat ion [Roc711 rvas:

Generally. variarions of Rocchio's algorithm are va.riatious on this formula. ttsing ditferent weights for Q,. anci the rrro sums. For instance. .-\llan [;\1196] uses 2 ancl instead of the inverse cardinalities of the sets as the rveigkts for the two sums:

halbersberg [IJ.l92] cites Salton and Buckley [SB901 and Ide [Ide711 for the general forms:

where o. 3.7 are variables.

For each separate que- that the user enters. a profile is created. Initia&-. the tems in the profie are just the terms in the user que?. The weights in the terms of the profile alnays sum to one. The profile is limited to a maximum of 20 single word terms and 4 two word tems (ie: phrases). Documents are represented in a similar fashion. but are limited to a masimum of JO single mord terms and 10 phrases. Depending on the leamer. the profle gets updated in different -S. TNOdifferent lemers were used: Rocchio variant Here. the profile at time t i l. Pt+ ,. ma>-be obtainecl from Pt and the

sets of relex-ant and norirele\-ant documents as follows:

Pt+i = Pt + IS'IC Rt- IR'ICS' R : sr ~~hereR'. St are. respectively. the set of relelant and nonrelelxnt dociinients as

judged b>- the user ar cime t. R: is the \*SM for relelant document nuniber i.

Similarly. S: is the \'S1\I for nonreievam cloctunent niimber i. This is siniilar to

Rocchio's original variant [Roc;l]. modifiecl in the weiglit assiguecl to the previoiis

query and in the facr thar the algorir kni is appliecl in an incremental fashion. \\-ben

iising tLis algorithni. tip to 30 clocumenrs are returnecl to the user for relelanc-y

rankirig. Docunienrs u-hose systeni rariking is less than or equal to O are nor shown

to the user.

The algorithm is further moclified as suggesrecl in Rocchio's original formulation

[Roc;l]. by accepting a term for inclusion in Pt,, if and only if its weight is grearer

than 0. and the term \vas ei t her in Pt. or waç in more relennt vectors t han non-

relevant vcctors. Both these measures have the same effect : the- elirninate terms

thar have little discriminatory pomer. The former technique also eliminates those

terms that are only able to identify irrele~antdocuments.

Ide variant This merhod is a nriation diie to Iiarakoulas [I

of other variations [.-\U96.BS.494. IdeTl]. Here. the new profile Pt+iis obtained as

folio\\-s :

pt+, =ÛP,+JCR~-?CS~ R : s'; where Rt .Sr are as before and a. 3. -, .d are predetermined constants. and n-here

.? .? = t3 for dl tems u; E Rt such that u, is not in the guerl- at time t. The

constants are determined through esperimentation. In another variation. a. 3.7.d

may actually be variables that are determined as time progresses. Howet-er. for the purposes of these esperimenrs. the! are constant. -AL: before. when usilig this algorithm. up to 30 documents niiist be rankecl b?- the user. using rlie same criteria as in the Rocchio style algorithni. Only positive. non-zero wijhrecl ternis are accepted in Pt,I. This nierhod will be referred to b?- the name %rigoris". Chapter 3

Architecture

This chaprer esplains the architecriire of the systeni iniplementrcl. coinnient o; ou r htl scalability of the 5'-stem. esplains some algorith~ristisrd. and lisrs the topics iised for the nest chaprer.

3.1 Overall Architecture

Adaptive information filtering and metasearch are both asynchronous tasks. The archi- tecture refiects this. Figure 3.1 shotvs the architecture of the metasearch engine. .An esplanation of the symbols used folloivs.

Circle An object in the system. operating in the same machine as al1 other circles: over-

lapping circles represent multiple objects working concurrently in separate t hreads

of execut ion.

Square An object in the system. operating on a specific machine. Overlapping squares

mean that sewrd object instances are working concurrently. possibly on different

machines. If on the same machine. instances work in different threads in the same

process. Rectangle Rectangles t hat are uot also squares represenr queues or user interfaces.

Rectangles wirh a horizontal orient arion are qiieties. i~hilea vertical orieutarioti

iutlicater the iiser interface.

Oval The 01x1 representr rhe niain client-rhis is the clictir rhat esisti on the sanie nia-

chine as the user interface.

.As can he seen froni the tliagrani. one client cornrnunicareï with niiiltiple processes ou niultipl~machilies in ortler to collect clocunienta for iiser rankiiig. Each one of thore processes creates a series of queiies adqiteiie managers to hantlle iariotis data trnns- formations. The qtiew managers operate iri parallel wi th eacli ut hm. procevsing tlat a as it hecoriitls availahltx in the queues. Each queue manager acttially passes the dnra in the querie to a rniilritlireatled pool of workers. One worker iu the pool will perforni transiorniarions ou r he data before putting it into the nest queue. iVhat follow is au esplanacion of figure 3.1.

1. The user intrrface iu responsible for accepting the initial query of the iiser.

2. The query is passed to the client program.

3. The client program passes the que- ro the search engine selector. which selects a

set of search engines to use. This allows for dynamic selection of search engines.

possih1~-in conjiinction wit h a learning algorithm.

4. The set of search engines are passed to multiple page retrievers. possibly on different

machines. The machines are selected in a random order. and given a random

number of search engines to query. The masimum number of search engines on a

single machine is an adjustable parameter. Each page retriever handes ody one

engine. Each of the follon-ing steps unril step 10 occurs in eack page retriever. and

each page retriever has its on2 copy of the ~~iousobjects and queues. escept for

global data structures. rrhich are outlined in section 3.2. Figure 3.1: Overail architecture. esplanatioii iu section 3.1 .3. The search engine estractor works in tandem n-itli the page fetchers. It reqtiest.:

CRLs corresponditig ro the iiser's query from a search erigine. ancl continues to do

so until it fin& $0 new or changed documents. or no more dociinleurs are a\aiiai>le

from the inclividiial search engines. Changeci dociinients are rtiose thar have heen

fetclied before. but have changecl sincc last fetckect. This is deterniined using a

combina tion of a clatestanip and a checksiim for the clocunient . The page fetchers

tell the aearch eiigiue the nitmher of new or changed doc~inientsthat have breu

founcl. The search engine estractor puti CRL': into rhe CRL queue.

6. The CRL Manager removes CRLs froni the CRL Qtieiie and passes them ro a ivorker

in the page fetchcr pool. .Ja\a's threading niotlrl requires thar an asyuchro~ious

page fetcher use a helper ohject. 50 that page cio~vnloaclin~rnay I>e halretl. The

page fetcher asks a helper ro do~vnloaclthe document $yen the CRL. and passes

informa riori ahutit nrw or changed dociinienrs to the searck engine csrracror. The

docilnient information i?i then entered into the page information queue. - . The page information manager removes clocuments from the page information qtieiie

ancl passes them to one of the workers in the page analyzer pool. The analyzers

estract and rearrang some of the data from each document. transforming the

document into an SGML document that the backend instances can use. The newly

formatted document is entered into the backend queue. At this stage. documents

n-ith non-English characters are stripped to include only English characters. if aq-

esist . Furthemore. the document is split into a series of rRLs. the L'RL test. the

title. and the rest of the document.

S. The backend manager estracts the document data fiom the backend queue and

passes it to one of the n-orkers in the backend instances pool. Here. a backend

instance transfoms the document into a vector representation of the document.

dong n-ith various rags assigned to the document by the page anal)-zers. The vecrors are then entered inro rlie analyzecl \*SM qtieiie. The backeiid was origirial1)-

crea t ecl hj- Brian CLarnhers[CliaQS]. and rnoclified ru allon- for nitiit i thenclt.d itsr

au1 lYl\*U*ctoctinieuts i ie: rlor.unients in HTSZL 1.

10. The page retriever coilects the data in the analyzed \'SM qiietie aiid seuds tkta tlara

ro clirrit. basecl on a client reqtiesr for the clata.

11. The clizrir then forniars the clata and orclers it according to tlic Iearning algorithni

in ~1st..and presents it to the user for rankiug.

Al1 coiiiniu~iicarionbetwen machines is clone via a message passirig systerii built on top of .Java's Reniote Ifet hotl Invocatiou i RlII). This niessage passirie systcrn allows synchronotis and as>-nckronouscommunication with mechanisnu in place to allow agent coniniiiaiçat ions langtiagcs such as IiQlIL [LFI;. FLM9;. FFlIE]. Howe~cr.stich lan- guages are oot reqiiirecl for use. aucl a much sinipler protocol was usecl here. The global data stntcrtires are shared throtigh Java's RlII siihsystem.

Global Data Structures

There are two global data structures that are used for all queries and across all machines.

The first of these is the document frequency table and the second is the stopword list.

The former is a list of ail the tcrms that have been seen until the current time. t,. It also has a mapping of rems to the frequency of the term in the documents seen until time t,.

This stnicture forms the dictionary that is used by the systern. and is initially empty. If chis dictionan- represented the entire document space. and thus. all words and two word phrases in the English language. the dimensionality of the document space wodd inhibit useful leaming in addition to causing scalabilit y problems. This problem is handled fmt by the clynamic generation of the document frequency table. rrhich ensures that ody those rvorcls

3.2.1 Zipf's Law

The firsr of ttie~eii an application of Zipf's law. This lm*.as cittd 11)- SaLia1iii!Sah9Sj. srates t hat worcls chat occur infrecliiently in a corpus of tlociimerits have lit rle cliscririiinar- ing porwr hetn-ern dociimenr~. Siich words rnc- help ro ideiitify incli\-itliial docurneuts. h~ttdo little else. Since the pitrpose of the clocumenr \'S'iIs is ro orcier docitrnents r&ti\-e to cach other. these words ma? he cliscarclecl. as the! pro\-ide no tisefiil information in this contest. .As an esample of this. Saliami estimates that wordr occiirring only oncp in a corpus n-il1 proride half of the unique terms in the corpiis. This is only an approsinia- tion. but aids in keeping the dicrionary smaU. This application of Zipf's law may be done ar internls. when a large number of documents. such as 10000. ha\*e been seen. This ensures that such terms really do fall inro the category of unique but nondiscriminatory.

Terms that are in this category are placecl into a dynarnically generated stopword list.

3.2.2 Stopword List

The stopn-ord list is used when creating \51Is. and is used by the system to remove words from a document's i-ector- representation that appear in the stopword Est. Complementing the dpamically generated portion of the stopword list. described in section 3.2.1. is a static stopword list. This list is created before the dictionq is built. and consists of a list of words to esclude from the diction- and thus the document space. This Est is meant ro consist of commonly used words that are knonn to provide Little or no discriminatory power. It includes articles iike -the.- -a and common conjuncts like -anda and -or.- 3.2.3 Stemming

The lasr nirclauiim to prevenr esces cliniensionaliry i': sreniniing. Here. words trith sirnilar rom' are rornlkrtl iii rhe clof-iinienr space. For instance. the n-or& -accessorc-

-aîceasories.~*alid "accesorized" are identifiecl to the terni "accessori- during the tram- forniatioii of a tlocument ro a \-ecror in rhr rlocti~rientspace. Tliiis. the document space

?.:A3 Lï PL q .*---A - -1 - -1- - C 7 kt:; GE!:; ruu~c-- Iburcrz c~~~~- -- - -. -- UIUC~. i.&l.iiïiiu~13VL ificlu. LCL.ILLSthal al.^ ttvo ~ortl~~tirasrs have sternniing applietl ro each indit-icliial word iu the phrase. This is açconiplishd through a disstripping routine[PorSO]. whete words are systematically reduced to a yoot" word. whicli riiay or rnq riot correspond to an actual wortl in the

English langitaqr. The façr thar terni:: in the profile consist of stenimecl teniis ad& ro t tic inahilirl- ro ilse ari espniiclecl qiter>- in successive searches tri t h search engines ( sep secriou 1.3.11.

3.3 Scalability

The arctii tect ure is SC alable. allowing multiple machines to cooperate in analyzing and tlownloading documents. In kt. working on multiple machines mas necessary. as the initial sysreni reported mernory errors wich just one machine il-ith 12SMB of RA11 in use. The use of .Ja\?i0sRh1 1 subsystem allows multiple machines to coordinate. and work in an asynchronous manner. Cnfortunately. this is not as cornpletely scalable as it could be or appears to be. Since the engine \vas rvritten in Java. it is subject to some of Java3 faults. In this case. threads in Java cannot be interrupted immediately. This means t hat even though ail the transformation performing objects (section 3.1) are given a limited amount of rime in n-hich to complete a data transformation. those that do not finish and are told to intempt themselves will not stop consuming resources until the task has been completed. or the object checks for interruption of the thread in h hi ch it is operating.

If the object happens to be in the rnidst of a blochïng cd. such as when donmloading a Gerard Salton and C hris Buckley. Iniprovin J rer rieval perforniance by rrl- elance feedback. Jounzal of the Americati Society for Infornrntiorl Science.

-4lr 4 ):?SS-W. 1990.

Gabriel L. Sodo and -idele E. Han-e. Agent-assistecl inreruet browsing. In

Proceedings of the Workshop on Ititekgetit iraforn~ationSystems at the 16th iVntiotial Corifererice on Artificial Ititelligr7icc (AAAI '39). 1999.

Sherlockhound. ht tp: //~v~~*iv.sherlockhottnct.corn.

G. Salton ancl 1I.J. McGill. Introduction to Modrnr Information Retrieval. l\IcGraw-Hill. Sew York. Sew York. 19S3.

StatUarkct. S t at Narket search engine rat ings. .lune 2000. ht tp://~vtv~v.searchengine~~*atch.com/reports/starmarket.html.

Louise T. Su. The relevance of recall and precision in user evaluation. Jol~rnal of the Arnerican Society /or Infonnation Science. 45(3):207-217. 1994.

Daruiy Sulliran. Xedia Metris search engine ratings. Slarch 2000. http://a~~~~.searchengine~vatch.com/reports/medimetrix.htd.

Danny Sullimn. SPD search Sr navigation study. June 2000. http://searchenginen*atch.com/reports/npd.htd.

Danny Sullivan. Search engine alliances chart. June 2000. ht tp: / /searchengine~t-atch.com/reports/alliances.htd.

Dnnny Sullikm. Survey reveals search habits. June 2000. ht tp: / /searchenginematch.com/sereport /00/06-realnames. htd. mn-that is. for seps 1 to 11 in figure 3.1-a page retriever will nor have a complete document frequency table. Only t hose changes that are made locally are alailable. .At

the end of the run. changes ro the local table are rransferrecl to the master table. dong

ivith changes from all orher page retrievers. This redtices mtich of the nerwork rraffic. escept for the initial trausfer of the table and stopword list . As a consequence of caching.

the document frecluency r able is an est imare of the current kno~declgeabout the frequenc?. of terms in the documents seen. Thus. it id1 inirially be inaccurare. ancl the esrimate

will get ber ter as rime goes on. until the caching effect becornes irrelevant . This is nor a

prohlem. as the table is inaccurate initially. whether or not caching is tised. Furthermore.

if no caching ivere used. and the initial tablc were chauging rapidly. the kliowledge would

only alioiv bet ter vector modelling of clocuments t hat were viewecl later in the initial rtins.

Thus. documents tvould not be treated eclually in the rankings because they tvoiilcl he

ranked basect on different arnounts of knotdedge about the terrns in the document space.

The problem of a poor initial table ma' be reduced by using a bootstrap document

frequency tablc. if there is knowledge about the dictionary that will Iikely be created.

S tich knowledge ma? corne from esisting analyses of the English language or from other

information retried studies. for example. For this study. bootstrap-t ables were used from

Brian Chamber's work [Cha99]. The document corpus and topics used in Chamber's work were different from the ones used here. However. this means that shased words

would most likely be fairly common words in rhe English language that still had some

discrimina tory power.

In spite of the caching done. tme scalability cm only be achieved by using some

distributed database. or other distributed data structure. n-hich inciudes the use of a

more intelligent caching scherne. This database or caching scheme n-ould apply to both

the document fiequency table and the stopword list. 3.4 Term Weighting

In the creation of I'Slls in step S in fi~ure3.1. a number of different approaclies ma? be used for the creation of the \'Slls. In this engine. a terni t in document (1 is gi\-en a weight IL* as follo~vs.

# unique words in d IL = aïg # unique words per clocunienr ivhere f, is the term freqitency. log is the logarithrn to the base 2. .\- is the total number of documents secn at the tinie that d is analyzed. fd(t) is the nuniber of documents iu

~vhichterm t occurs at least once. Other term weighting systems may also be used. as in

(SMS3. Sal7l. LG3Sa. 'rIla991. These are not the weights that are used during learning. however. n'ben actua!ly used. vectors are transformed so chat the sum of their weights is 1. This includes the wctors that represent the profiles.

3.5 Topics

Four topics were chosen for queries:

Palm Pilot The palm pilot topic Kas meant to obtain documents pertaining to acces-

sories and free productivity software for the Palm handheld computer line by Palm.

hc. AU documents had to be about either accessories or free productivity software

to be deemed relevant-no demos or sharen-are- and no hst of Wsto other sites.

among other criteria. Robots This topic concernecl research into autonomous robots. parricularly n-ith respect

to courses. but an- research wodd do. Documents about robot cornpetitions (~uless

these were also courses j. remore controlled robots. and toys were escliided from rhis

topic.

Microsoft DOJ \!*hile Xcrosoft Lias had a niunber of cases with the Department of

Tl.,+:, 1. 3 1 ., ,J,.,C DO.] ). doci,mcu:s KC~C~C!C;ZY~ iû tkih îü~liüd~- if ik~ KCL~ ctppL~cduir

to the case which took place in the Feus 199s-2000. and restdtecl in the judge

determining that llicrosoft shoiild be split into two companies. This escludes the

case regarding the consent decree. circa 1997. and al1 ot her antitrust cases. stich as

the one regarding the purchase of Inriiit.

St udents The actual query used for this topic was -stuclents universitu toronro." It wns

nieant ro obtain dociunents relating to student life nt the Cniversity of Toronto:

chings such as clubs. organizations. student activities and student guides. It was

meant to mimic a query by a potential undergraduate student to the Cniversity.

who was interestcd in seeing what students did at the Cniwrsity. socially. Chapter 4

Experimental Results and Evaluat ion

This chapter presents results of esperiments run to answer the questions posed in the

Introdiictiou. The results are aiso evaluared for statistical significance. and e\aliiated ivith respect to the questions posed.

4.1 Description of Data Gathering Procedure

Data were gathered on as close to a dail' basis as possible. Some days. data could not be obtained due to the fact that few documents nrere retunied. This lack of data on a given day was due to changing conditions on the local and global Intemet. and because of this. the data gathering procedures ivere either redone for those days or not done at au. As wiIl be shotm. a large gap between data gathering days did not appear to have an effeci on the results. On those days when data n-ere gathered. the queries were given in the order in nhich thel- were presented in section 3.5. For each topic. at least 50 documents were obrained on each day. with a mean of 1SO documents gathered per topic. per da>-. The exact number of documents found each da' may be seen in figure 4.1. An esrplanation of some of the larger spikes may be found in section 4.6. (a) Palni Pilot (b) Rabots

(c) lIicrosoft/ DOJ (d) Students

Figure 4.1: Daily Document Counts CHAPTER4. ESPERISIESTALRESL'LTS XSD EVALU.-\TIOS

4.2 Evaluation Framework

The e\aluation of information retrielal systems rypically involves some nieaslire of pre- cision ancl recall. In this domain. the former is a measure of the number of documents renie\-ecl that are Iahelled as relel-ant . and the latter is the percenrage of relelant clocu- menrs that were retrieved. out of the dole population of relelant clocumenrs a\ailable.

-1lue rotai niimoer of reielanr documenrs ior an- giren topic or cper?- is imknorvn. Fur- thermore. -\. measure of recall should also take into account properties of the Interner.

At an- given moment. large portions of the Inrenier ma- be inaccessible or ciifficult to access due ro fnilures of individual machines in the Internet. or iinresolved congestion at some point in the Internet. .Uso. since users tend not to navigate beyonci the first set of doc~imentsthat a search engine displays !SieOO. Sie99b. Xie%. SchOO. TogSS] and since some search engines can give liiinclreds of thousancls of documents. it is especidly important to have more relex-ant documents in the top few CRLs listed. Finally. as the

\\'t\'lI* grows. recall becornes far Iess important t han precision [Sie99aj. Man?- clocu- ments will be reletant. but the most relelant should be placed at the top of search lists.

Some might argue rhar the need for better precision over recall already esists. However. another study [Su94]. indicates that users may be more interested in absolute recall than precision-it is unclear whether this wodd still be true ahen using today's search engines on the mstness of the U'WV. as Yielsen argues. Whatever the case may be. measures of recall or precision that go beyond the first 10 to 20 documents that the metasearch engine displays is relatively useless because the user will only rarely see documents listed be- yond those first 16 or -20. Hoivever. a rneasure of relative recall is useful when comparing individual search engines that make up part of the metasearch engine. Here. a measure of the number of relemnt documents an individual search engine obtained relative to the number of relevant documents the metasearch engine obtained can be used to compare individual search engine performance over cime. Statistics of this nature carr be found in section 4.7. It is possible to estirnate the number of relevant documents found each day. These results are presented in section 4.5. For most of the other discussion. a nieasure of precision is used. This measure is giwu with respect to a certain number of docii~iients: for esample. the precision in the top 10 documents returnecl hy the metasearcli engine. or the precision in the top 1 document. Tkese measures have been used before. in arialyzing search engines [GLGf 99. CSSS]. The TREC-'7 filtering taak also siiggests a measiire to use [Hu199]. the F3 nieasiire:

where Rs. is the number of rele\-ant ciocuments retrieved ancl ,Y+ is the number of nonrel- evant documents retrieved. Results from this measure are presented in secrion 4.3.1. The

TREC-9 filtering task [RH] siiggests the use of the T9P mcl TKmeasures. presentecl in sections 4.3.2 and 4.3.3. respectively:

JI i n l- othenvise

R+ T9P = mu(Min D.(R+ +S+ ))

Jlinl- = -400 for 4 yeas or pro-rata

Min D = 50 for 4 vars or pro-rata

Establishing a performance baseline would also be useful. One baseline that could be used is haring an unchanging profile order the documents. In other words. the profile and

the query stay the same through time. In the following discussion. this profile. queq- and

associated learning algorithm will be referred to as the plain profile. query. or learning

algorithm. This baseline tums out to have properties veq- similar to choosing a random

set of documents and presenting them to the user. One R-ould expect that choosing a Figure 4.7: Running average of precision across al1 topics for Plaiii algorithm

% Plain. tw t -O- Plain: iao 3 '

a 8 Plan: top 5 l , c Pian. tao IO , -- Pfw: mp Ja ,

random set of documents would result in the running average of the precision in the top .Y being approsimately the sarne throughout time. for an!- S.The running average precision in the top S is simply the precision in the top S over some number of days.

This can be seen in figure 4.2. The differences in precision end up being no more than

4%. hlso. the order in which the precisions appear on the pph-seems random. with the precision in the top 30 being the best. mhile the precision in the top 3 is the worst.

The difference between the precision in the top 30 and the next best precision suggests that the ordering may be poorer than random. and figure 4.3 confirms this. X R'ilcoxon signed rank test [Gusg;]. performed due to the non-normality of the data. reveds the difference is significant with p < 0.001. This still presents a good baseline. however. based on the merits of the profile. .AU the other learning algorithrns use this plain profile to star with. Any improvement over the plain algorithm is thus a resdt of Ieaming.

IYi t h t his baseline measurement . 30 documents were ranked each da>-.

hother baseline %-as also used. This wiil be referred to by the random name. The random algorithm chose 30 random documents each da>-.from the documents retrieved on Figure 4.3: Ruming average of precision across al1 topics for Randorn algorithni

I 4 Ranaom: top 1 £- Ranaorn: toc 3 e Randam: top 5

each day. Here. al1 the running averages converge to approsimately 10%. which is slightly higher than the precision in the top 30 for the plain profile. This convergence is espectcci froni a ranking of a radon1 set of documents. The data frorn the random algorit hm will be used to estimatc the proportion of relekant documents per da> and thus. the ntimber of relevant documents per day. -4s with the plain algorithm. 30 documents were ranked on each &y.

In the follotving discussion. it is often instructive to examine only a portion of the results returned bu the dgori thms used. For esample. nit hout an- ot her restrictions. every algorithm coiild return up to 30 documents. Thus. the measures git-en. without other restrictions. could- be called the measuzes in the top (up to) 30 documents. LVe cari restrict the nurnber of documents we aUow in the measure. and call it the measure in the top n documents. These measures would include ody the top n system ranked documents. or fewer than n if fen-er documents were ranked by rhe system. For ense of notation. these measures will be refened to as the measure in the top n. Iieep in mind chat fewer than n documents ma? be included in the measure. 4.3 TREC Measures

The F3 Measure

The runnin; average of the F3 nieamre is sho~v~iin figure 4.4. with \?trioils topicr.

Figure LI(h ) shows an esarnple of the tlaily F3 statistics. The line labellecl as -Test" in rhi5 iicn~~~.$1 ^?berfi grirFF- hl -2E-p pr9f le pi rhp !ire -crinGv;c.**-n+:l "O '" L 3 "- da. 62. After day 62. the -Test" profile is frozeri. while the -Grigorisn profile is allowecl

CO change through continued learning. This was done to examine an>- clifferences that niight arise. and results of this are presenred in section 4.4.1.

The clifference hrtweeu the Grigoris algorithm and the Rocchio algorithni is s t at is t i- cally significant wheu taking iuto account al1 topics. with p < 0.0001. using the \\'ilcoson signed rank test with a 0.5 continuity correction. The difference here is also clcarl>. sig- nificant in terms of real tvorld performance.

The graphs seem to indicate thar. n-ith the esception of the plain algorithm. al1 algo- rithms on al1 topics perform n-el1 in the firsr ten to twenty dqs. after which performance seems to plateau or decline. The exception to this is the performance on the Palm Pilot topic. where the Rocchio and Grigoris algorithms increase their performance continually.

Almost without esception. this measure indicates an increase in performance at or around day 70. This will be esplained in section 1.6. The almost monot onically decreas- ing line of the plain profile indicates chat the number of relevant documents is almost always decreasing. This is to be espected. as one would thinlr that the individual search engines are fairly good at r&ng documents. Thus. as time goes on. the individual search engines give back worse and worse documents. n-hich are less and less relennt to the general quer- gi~en.much less the implicit topic that the user has in mind. In light of ths. any line segment in the cumulative plots mhich has greater siope than the corresponding line segment on the plain profile line is indicative of better than plain performance. Liken-ise n-ith respect to the fine for the random algorithm. (a) Palni Pilot (b) Palm Pilot. Daily

(c) RO~OIS (d) JIS/DOJ

-'7 I+-Ire- ' . !*-

m rn m m ,QI 'n> -? -nmm

(e)Students (f) Al1 Topics

Figure -1.4: F3 Sleasure on Va.rious Topics. Running Average (a)Palm Pilot (b) Robots

(c) hIicrosoft/ DOJ (d) Students

(e) AI1 topics

Figure 4.5: F3 Measure on Ikrious Topics. Running Average. based on top 5 documents returned. only (d) Students

(e) -411 Topics

Figure 4.6: F3 JIewure on Various Topics. Running Average. based on top 10 document

returned. only The two tradi t ional relevance feeclback algori thms perform iwil on al1 but the sr udem ancl robots topics. On the latter topic. the Rocchio algorit hm generates approsinia tel? four rimes as many nonrelevant documents as relelant ones. whereas the Grigoris algo- rit hm manages to set 50 more rele~anrdocuments t han nonrelevant ones. Seit lier of the algorirhms clo well in the sr dent topic. possibly due to rke fact that few dociirnenrs were relelant at al1 in that topic. as inclicated by the line for the randonl meastue-note rhat the minimum \due of tliis measure is - 1300 and the random algorithin receivetl - 1-47. whicli translates to approsimately 5.5% of the retrie~eddocuments being relennt. This may indicate that the data or the generated profiles were noisy. In facr. the Rorchio algorithni had a recall of 24% (see section 4.5 below. on how recall is estirnatecl in this work) but performed more poorly than the Grigoris algorithm because of the lx,ne nuni- ber of irrelevant documents obrainrd. In comparison. the Grigoris algorithm had a 23% recall. It is also possible thar the implicit topic behind the stuclent query lecl to noisy results by virtue of the topic itseif. For instance. suppose rhat the t~rm-Associationu were a good indicator of relelance. but only in conjunction with the term -Torontou and -S t udent .- Furt hermore. suppose t hat t hose words wre poor inckators of relelance when fountl alone. or not near to the other words. The generated profiles do not take rhis into account. and thus cannot mode1 these relationships accuratel-

The F3 measure based only on the top 10 and top 5 documents ranlred gives an interesring picture of the systems performance. These may be seen in figures 4.6 and 4.5.

Judging fiom the shapes of the graphs'. and in particulax the shape of the graph showing the F3 measure across all topics. it nould appear that the performance improïes bu just taking into account only those documents. This lends credence to the idea that the learning algorithms are working. because this type of improvement suggests that more relevant documents are concentrated in the top ranked documents. More discussion on

------'Lnless we normalize the F3 measure. we cannot compare dopes or absolute nurnbers-this can be done. but this type of comparison is more clearly seen in section 4.4 CHAPTER4. ESPERIXIESTALRESL-LTS XSD EVALC'XTIOS this is presented in section 4.4.

4.3.2 The T9U Measure

The T91' measure is nieant to reduce the effect of selectine; too many documents. partic- iilarly when those documents are irrelelanc. It does this by introducing a lower hoiind ou

-1 - z ~tcas~res,i~iiliir t~ îLr F2 ilirn>ULr. lu~verLUULLJ is caiiei Jiini-. liesuirs siiown with this meastire. in figure 4.7. shoiild be a 'snioother' version of rhe F3 measure. This is borne o~itby esamining the graphs. In facr. escept for the palrn pilot topic. the -craphs for the lemers eshibit a high degee of parallelism with the graphs for plain and ran- dom algorithms. This means that the learning algorithms aide in the process of findiug relevant documents only at specific points. This musi he tnie. given the parailelisni and

the fact that the lines representing the learning dgorithms are higher than rhose for the plain and random algorithrns.

This behaviour is made more obvious by rsamining the graphs in figures 4. ;(e).4.S( e) and 4.9(e) It is easy to see that the 'bumpiness' of the graphs increases when taking into account fewer documents. The 'bumpiness' occurs where the learning aigorithms actually

perform betrer than the random and plain algorithms for a particular da'. This hehaviour

cannot entirely be due to the performance on the palm pilot topic. as the graphs of the

palm pilot are fairly consistent in terms of when performance is better than the plain

and random algorithms. The consistency of the measure on the palm pilot topic and

the fact that the measure was alnqs rising indicates that. in fact. the learners on this

topic did not select man' more than 5 documents each da- While the other topics are

also somen-har consistent. this same interpretation may not be giwn. The consistency

there is a result of the performance being similar to the baseline measures for most of

the nui. Khile the performance on the other topics appears to be poor. the overall

resulta still indicate that the lemers n-ere correctly ra&ng relerant documents higher

than nonrelerant ones. If this were not the case. the graphs for the top 5 and 10 (figures ia) Palm Pilot (b) Robots

(d) Students

(e) A11 Topics

Fiove 4.7: T9L Measure on Various Topics. Running Averages [a) Palm Pilot (bj Robots

+- id- ; * TU * -~rr 1+-

(d) Student

(e) Al1 Topicj

Figure -1.8: T9C Measure. Top 5 only (a) Palm Pilot (b) Robots

Pm- +-

fc) >Iicrosoft/DOJ (d) Student

(e) -411 Topics

Figure -1.9: T9V Measure. Top 10 only 4.S ancl -4.9) n-oiild look more like the graph in nhich no restrictions wre made (figure

4.7). Esamininj the O\-erallresults on this measure one final time. it indicates rhar eitlier there were nor too man'- rele~antdocuments or the learning algorithms work besr only for producing relevant documents in the top 5 or fewer cloctmients returned. If this w-ere not the case. one would espect to see more the performance to be siniilar on al1 graphs for a particular topic.

The dara seeni to indicate rhat the Grigoris algorithm is better than the Rocchio one using this measiire. However. a n'ilcoson signed rad test on the daily differences reveals the difference to be insignificant. with 0.1977 < p < 0.2005. Esamining figure 4.;(r). the difference seems airnost entirely due to the difference occiirring at days 2. 3 wcl 4.

11-hile this difference. taken on a dail? basis. might not be statistical1~-significant. the end results are certainly significant in the red worid. This difference ends up causing a difference of 4.19% in rems of the precision in the top 30. Perhaps more importantly. however. rhe clifference is almost entirely due to a sis document difference in the first four days. with a five document difference in the second day This is important because while the usage scenario of this system occtus over some time. it coiild be the case that a time periocl of only a few days mas important. Thus. any irnprovement in rhose few days would be critical. Also. since the behaviour of the individual search engines being put to use is such that relemnt links are highly lihly to be present in the first few pages of hits that the search engines return. and that these first few pages are retrieved in the first few days. the importance of achieving a good profile in a short time increases.

4.3.3 The T9P Measure

The T9P measure is meant to *stress precision* according to the Filtering Track Guide- lines [RH]. It uses a lon-er bound on the minimum number of documents that must be selected. It thus penalizes systerns ahich retrieve fewer than this minimum nmber of documents. These results are presented in figures 4.10. 4.11 and 4.12. The results a.,. a.,. t- rW

(a) Palm Pilot ib) Robots

(c) .\licrosoft/ DOJ (d) Students

(e) Ail Topics

Fi,gure 1.10: T9P Measure on nrious topics (a) Palni Pilot (b) Robots

(d) Students

*- E

(e) ,411 Topics

Figure 4.11: T9P Measme on larious topics including a rnavimum of 5 documents. (a)Palm Pilot jb) Robots

(c) lIicrosoft/ DOJ (d) Students

(e) -411 Topics

Figure 4.12: T9P Measure on various topics. including up to a maximum of 10 documents. are clearly in favour of the Grigoris algorithm. In fact. the differences bern-een it and the Rocchio algorithm are statistically significaot on the palnl pilot and robots topic. as well as the overall results. Again. the U'ilcoson signed rank resr with the 0.5 continuity correction !vas used to compare daily versions of the measure. The p values Iyere p <

0.0044. p < 0.0'207.0.1151 < p < 0.1170.0.1003 < p < 0.1020. p < 0.0013 for the graphs of figure 4.10. in order from left to right. top to bottom. In other words. the three graphs in which it look like there may have been a statistically significant difTerence do in facc have one. This clifference decreases as we include ouly those documents in the top j or

10. although the overall results maintain their staristical significance. This means that the algorithms tend to give the user the sarne number of relevant documents in the top few document S.

4.4 Precision of Learning Algorit hms

The T9P measure in section 4.3.3 gives the precision in the top 30 for the plain and random algorithms. since those algorithms are always forced to return 30 documents.

That measure does not completely represent the data. or accurately represent those individual algorithms. Figure 1.13 shows the ruZlILing average of the precision across al1 topics. with precisions in the top 1. 5. 10 and 30 documents returned by the system. The figure shows that the Grigoris algorithm is better at discriminating between relevant and nomelemnt doiuments. since the line representing the results from the top n alaays ends up higher with thé Grigoris algorithm than the Rocchio one.

The staggered positions of the lines corresponding to the different numbers of doc- uments indicates that bot h tradit ional style aigori thms are pushing rele~antdocuments higher in the list of documents presented to the user. It may also imply that there are not enough releianc documents to have a high precision in the top 30 or the top 10. Given that the algorithms are n-orking. which they appear to be. then if there mere many rele- * Racchio; top IO Racchio: top 30

1 L 1 I 1 1 O 20 40 60 80 100 120 days !rom start

Figure 4.13: Running average precision across dl topics (a) Palrti Pilot (b) Robots

(c) Microsoft/ DOJ (d) Students

Figure 4.14: Running average precision on ~arioustopics. shotving precisions wi th various

numbers of documents included. Rocchio and Grigons algorithms. rant documents. one wodd espect thac the lines for the different number of clociimcnts n-ould be closer to each other. instead of having an eigkt percent difference betrveen the top 1 and rop 30 results. It is likely t har the staggering is also a result of the algorit hm' inability to both have relevant documents ranked highly. and to obtain al1 the rele\arit rankings from the set of documents anilable. Hoivever. it is difficult to estimate the recall of the algorithms. since no data esists about how mm>-relevant articles esist in the entire KWI*.Data about recall on a daily basis. ivith respect to the documents rerurned by the indidual search engines. can be estiniated by looking at the data from the random algorithrn-t bis is presented in section 4.5.

Figure 4.14 shows the precision numbers for the individual topics. -411 the graphs eshibit stratification. That is. given a fked nurnber. m. of documents incltidecl in the measure. then including n documents. n < m. results in incrcased performance. The stratification is much less apparent on the robots and lIS/DO.J topic than on the palm pilot topic. This is probably attributable to the profiles. and rheir ability to distingtiish between documents. For instance. with the student topic. the term 'sttident life' is important. but only in conjunction with the term 'toronto'. However. the term 'toronto' by itself is actually a ver? poor indication of releçance. Similarly. the term 'student life' without the term 'toronto' is a poor indication of relemnce. Thus. at certain times in the eduations. the term 'student life' may be important in the profile. but it lacks the tenn 'toronto.' or vice versa. This may be due to a fault of the feature space. which only consists of the terms in the VShIs and does not add the correlations between terms.

The large gap in the Grigoris algorithm on the students topic appears unusual. The stratification seems to be extreme when loobng from the top 3 to the top I data. This is probably due to the high precision at the beginning of the run. Figure 4.1 3: Riinning average precision of Grigoris algorit hm on the stticlent topic

------. am-

Figure 4.16: Running average of the precision for continuous training versus test portion

of train/ test. across all topics. Grigoris algorithm used. Shom starting

from test cycle at day 70. (a) Palm Pilot (b) Robots

id) Students

Figure 4.1;: Running average precision on larious topics using the Grigoris algorit hm.

comparing continuous training with the test portion of a trainltest cycle.

Testing cycle begins at da>- 70. CHXPTER4. ESPERI~IESTALRESL-LTS XSD E\-ALUXTIOS

4.4.1 Continuous Learning vs Train/Test

One other test that was performecl was to esamine the effects of continuous training versus hal-ing a training and testing period for the profile. This was clone with the

Grigoris algorirhm. and the results ma>- be seen in figures 4.16 and 4.17. Kth the exception of the resulrs in the top 3. the differences in the daily data are statisticaily significanr with p E [O.OIO.L. 0.04lSI. The differences in the top 3 result in p o 0.0643.

These resuits are only with 14 days worth of data. Furthemiore. these results nia! be due to the facr that the metasearch engine frequently returned pages that kacl been seen hefore. and tvhich had only changed slightly since the last time it t~asseen-in the date. for esample. Servers ma! also have returned incorrect dates for the date check. or pages may have been dynamically generated. -411 these factors result in a basically unchanged page being given to the user. Pages that had not changed in a relevant manner ivere marked irrelevant (see section 2.3). This ptits any profile that has been frozen at some particular point in time at a disadvaatage. because it cannot adjust to this method of ranking. At the sarne time. this method is necessary in order for the system to give new pages to the user. and for the system rdngsnot to be dominated bu d~amicdy generated pages.

One other possible explanation of these results is that overtraining might have oc- curred. If this were the case. the test profile would only work weU with the documents the system had already seen. This is not the case. here. The spilie in the performance at day 70 shows this. As explained in section 4.6. this spih is due to the system seeing relevant documents that had been seen before. but when the profile nas still relatively poor. (a) Palni Pilot (b) Robots

(d) Students

1 -

O ,-

bac

btb

(e) .Ill topics

Figure 4-18: Dail' recall for the Rocchio algorithm (b) Robots

rniiaw

(d) Students

(e),411 topics

Figure 4.19: Dail- recd for the Grigoris algorithm CHXPTER4. EXPERIJIESTALRESL-LTS XSD EUL~ATIOS

4.5 Daily Recall

The random algori th~nallon-s estimation of the proportion of reletant clocumenrs ad- able on each day. Xote that this does not allon- estimation of the proportion of relevant

J~cii~iwiirsiu riir popuiarion of ciociiments aiaiable on the \\W\\*. bitt only the pop- ulation consisting oi those dociiments retrieved on each da?. This still provitles useful data. For instance. the 93% confidence interval of the number of relel-ant documents each day can he estimared. and frorn there. a range of recall values for each algorithm may he obtained. This confidence inter\d is obtained iising a 0.5 continuity correction to approsimate a binomial distribution wit h a normal one(Gus9;j. Csing these interds. it is possible to compute a range for the number of relevant clocuments that are expected to he present on each da'. Given this. it is easy to cornpure the estirnated recall per day.

These data are presented in figure 4.18. for the Rocchio algorithni. and figure 4.19 for the

Grigoris algorithm. Sote tkat the estirnated proportions were altered to be nonnegative. that the recall estimates shown were dtered to be no more than 100%. and that a recall of 010 iras given 100% recall. The 'mean recall' refers to the recall obtained when using the middle of the confidence interval as the estimate of the number of documents found.

The dail- data can be used to estirnate the ovedrecall. The ranges are given in ta- ble 4.1. These ranges were obt ained by summing the found nuber of releiant documents each da? and dividing by the surn of the estimated number of relevant documents found each da): This sum ~aried.depending on esactly nhich nurnber in the 95% confidence inten-al 1-as used in the summation-either the maximum. the minimum or the rnean of the number of estimated documents in the confidence intend. The expected number of relennt documents is the sum of the means of the number of relet-ant documents. and gives an indication of the weight each topic is given in the 'A11 topics' topic. Lspect ed num- topic Rocchio Grigoris i ber of relevant 1 min mean mas 1 min mean nias / documents / Palm Pilot / 0.2672 0.3496 0.3039 0.233s 0.3059 0.4409 S9'1

1 AI1 topics (0.19S-L 0.2645 0.3S5l / 0.1935 O.25SO 0.3757 1

Table 4.1: Overall estimated recall.

( number of retrieved documents expected 'Z relelant

t i

Robots 6453 32.9

l Studenrs i 5365 1 5.93 I hll topics 31664 10.2

Table 4.2: Total number of documents retriewd for each topic. This is not the num-

ber retrieved by the learning algorithrns to be presented to the user. but the

documents thar the metasearch obtains from the individual search engines. CHXPTER4. ESPERIXIEST.ALRESC'LTS AND EVXL~XTIOS

4.6 Spikes

There are two spikes. appearing at the beginning of the graphs. and on or around cl-

70 in the graphs. that are particularly striking. as the' appear in al1 graphs. froni the dùcüi~in~iütirî grirpL üf Sgurc 4.1; tu tlic TREC iiircaiirrb iu figiirc~4.4. 4.7. -i.iû: to rne precision measurements in figure 4.13. The initial spike ma>-be esplainecl hl- the fact thar the indi\*idiialsearch engines that were being used had their best resiilts at the begiming of their document lists. which were viewed by the metasearch engine first. In the case of the document counts. there were more documents becaitsr the system dicl not haïe to go far in the iists of documents returned by the individual search engines. Thiis. the documents tended to be fairly popular. which tends to imply good network connections.

It could also have resulted in more popular pages. particularly with DirectHit. Hothot and derivatives. One i~ouldespect more popular documents to be reachablt across the

Internet and conversel-. that less popular documents wouid be less reachable. Since unreachable documents cause resource consunlption. the system might not be able to collect as man- documents in the later days than in the early days.

The other spike occurred after a gap in data gathering of eight days. but also after the

table which keeps track of the seen documents had been fliished. This table kept track of ivhich documents the metasearch engine had seen aod ranked. not which documents the user had seen and ranked. Flushing the first 10 days of data from it resulted in many completely new documents being presented to the user. and some previously seen ones. Ignoring the previously seen ones unless they had changd. the met asearch engine still gave a large number of relevant documents. especidy to show a difference even after

;O days of data gathering. This could either have been due to the gap in data gathering. or the flushing. Figure 4.10: Dail- precision across al1 topics. Top 10

4.6.1 Data Gathering Gaps

Gaps occur in other places. siich as r he sewu day gap starting at da- 17. the 6 da' gap starting nt da? 30. and the 15 da\- gap starting at da! 73. Esamining figure 4.1. it is not dif£icult to see where rhese gaps are. The l~gestgap-the 15 dqone. results in a general decline across topics. while the other two gaps show a mis of shallow declines and ascensions. as can be seen on the various performance gaphs. such as the F3. T9L' and T9P rneasures. The document counts also do not appear to increase or decrease significantly. meaning that the number of documents that were found to be changed or new did not increase or decrease. The estimated number of relevant documents also does not appear to have any correlation with the data gathering gaps-the results of figure 4.lS(e) show a mis of ascensions and declines associated with the gaps.

4.6.2 Flushing

Thus. the spikes were probably due to flushing. However. one n-ould not expect that this flushing would work more than once because the first flushing would dow the user to rad most of the pages that had been missed due to a poor profile. earlier. If it did. one tvould expect that the performance n-ouid not increase even to the extent that they did on 1 - as-

aa- , ;

4 a,.* "

(a) Palm Pilot (b) Robots

(c)Mcrosoft/DOJ (d) Students

Figure 4.21: Daily precision on various topics. Top 10 Figure 4.22: Dail- precision in top 30 on stiidents topic for rhe Rocchio algorithm

da! 70. .in increase at least as large woiild mean that the individital search engines were gi~ingman- good results over time. but the profile was not adapting quickly enoiigh.

Flushing iras done at days 70. 94. and 104-that is. 3 d-s. 7 days ancl 13 days before the last data point. Figures 4-20. 4.21. and 4.1s confirm the hypothesis that 0ushing would not work more rhan once-in one instance after the da! 70 flushing. flirshing results in higher precision and in the other ir results in lotver precision. Also. the lower precision due to flushing occurs earlier in tirne than the higher precision due to flushing (if flushing nas the cause of this at all. which ir probably was not ). Similar patterns are found in the daily recall statistics.

This data also has some bearing on the issue of continuous learning. and therefore. continuous adjustment of the profile. It shows chat despite the necessity for continuous leaming as shom-in section 4.4.1. the leamed profile still performs tvell n-hen presented - nith a large ndeer of relevaat documents. even 60 days after the previous peali per- formance (and thus. peak learning with the traditonal style algorithms). The mai&- negative reinforcement that was received afier the hst spike in performance does not seem to have a deleterious effect later on.

One peculiarity r~iththe daily precision numbers is that the biggest spih in the students topic occurs at da!- 50 with the Rocchio algorithm. This doesn't correspond to an' particularl!- special day. Only 200 docunienrs were retrieved on that day. !&en the average for the srudents ropic !vas 190. TO put rhat in perspective. -1'27 documenrs were retrieved on da? 70. The high performance seems ro be due to the Snap and MSS' search engines. each obtaining four relelant results in the rop 10.

4.7 Individual Search Engine Recall

The individual search engines do perforni differently ou certain topics than on otkers.

Figure 4.23 shows graphs of the running average of the search engine recall. for dl topics with rhe Rocchio algorithm. Figure 4.24 shows a sirnilar graph for the Grigoris algorithm. The recall nurnbers are given with respect to the relevant documents found hy the metasearch engine lie: al1 search engines). Comput ing a line of regression for the abovc graphs. and using a percentage for the recall. produces the slopes given in table 4.3.

This shows that the search engine recall is about the same over time across al1 topics.

The graphs show tliat no indivitlual search engine is able to obtain more than 25% of the relevant documents. over time. Esamining results on the separate topics reveals that no individual search engine obtains more rhan 40% of the relevant documents found by the metasearch engine on an? individual topic. over tirne.

Search engine recall does change on a per topic basis. Eramining figure 4.25. there is a clear upward trend. In fact. the slope of the line of regression is O.l545%/day. Similar cases may also be found in the other topics. as ivith Infoseek on the robots topic. with a dope of the regression line of O.IOSl%/day (see figure 1.26 and figure 4-27). The different search engines perform differently on the various topics. Table 4.4 shows the engine with the highest slope on the line of regression on the karious topics for the two traditional style algorithms. and table 4.6 shows the lowest slopes. In each topic. a single search engine had the best or norst slope. independently of the leaming algorithm used. The 93- lm-

Fiope 4.23: Running average of individual search engine recall. across ail topics for Roc-

chio algorithm. Recd is measured relative to total number of relelant doc-

ument s found by the metasearch engine using the Rocchio algorit hm for Figure 4.24: Running average of individual search engine recall. across ail topics for Grig-

ois algorithm. Recd is measured relative to total number of reletant doc-

uments found by the metasearch engine using the Grigoris algorithm for leaming . Search Engine

.Ut aiïsta

Direct Hi t

Hotbot Infoseek

L y cos 1ISS

SationalDirectory

Snap

Thunclers tone

Yahoo

Table 4.3: Slopes of regession lines where the y asis is given as a percentage recdl: t hus

the unita are %recalI/day from start of run. Results are across al1 topics.

an-

Figure 4.23: Running average of recd for the Yahoo search engine on the phpilot

topic using the Rocchio Iearning algorithm Figure 4.26: Running average of recdl for the Lycos search engine on the student topic

using the Rocchio learning algorithni most and least improving engines do not necessarily match the best and worst performing search engines at the end of the tests. These are given in tables 4.3 and 4.7. The best engines for this job do not reflecr the coverage each indit-idual search engine has according to Lawrence and Giles [LGSS]. It is also interesting to note chat those search engines thar share cornmon structures. such as the use of DirectHit's search engine in Hotbot and h1SS search engines have varying results. possibly due to different versions of the search engine. or different versions of . or different supplementary searches. Figure 4.27: Running average of relative recd. Llicrosoft /DO J topic. Rocchio algorithm.

These graphs show a wuiety of patterns in the recd. for mxious search

engines. ropic Rocchio Grigoris Engine S lope Engine

- - Palm Pilot !lm- 11SX

Robots Infoseek Infoseek

!lIicrosoft/DOJ Alt aVist a DirectHit

5t ticients NSS MSS

Table 4.4: Best improving individual engine recall per ropic. for the two traditional style

algorit hms. Slopes are given as <7; recalllday

I topic Rocchio Grigoris 1 Engine 1 Recall Engine Recall 1 11SS 03jj3i Snap Snap 0.2604 / AltaVista 1 Stiidents Lycos

Table 4.5: Besr indidual engine recail per topic. for the two traditional style algorithms.

t opic Rocchio Grigoris Engine Slope Engine

1 Robots DirectHit Snap [ Students DirectHit Direct Hit

Table 4.6: Wxst improvement in individual engine recd per topic. for the two traditonal

style algorithnis. Slopes are gi~enas %recd/day topic Rocchio Grigoris 1 Engine Engine I pzTSationalDirecton 1 Robots Thunderstone Thunderstone Slicrosoft /DOJ l'ah00

1 Students SationalDirectoy

Table 4.7: Llbrsr indiridual engine recall per topic. for the two traditionai style algo- rit hms . Chapter 5

Conclusions and Future Directions

5.1 Conclusions and Discussion

To ansiver the questions posed in the Introdttct ion. the al results of the previous chapter indicate that metasearch does seem to be effective. and learning a user profile also appears to increase the relevance of retumed documents. The difference in precision between the baseline algorithms and the algorithms that use a changing user profile ranges from 15% to 30%. The T9P measure resdted in differences of 9 to 10 points. the

T9C measure resdted in a difference of between 90 to 115. and the F3 measure resulted in differences between 3000 and 4000 points. Yetasearch in a relevance feedback context. and with the implemented method for rankng. is much better than merely using an' individual search engine because certain search engines perform better than others on certain topics. and because of the increased coverage one obtains with metasearch. L7sing a user profile appears to help rele:ance. according to the larious TREC measures and cornparisons n-ith the plain algori thm. n-ith some caveats presented belon-. The Grigoris algorithm appears to perforrn at least as n-eLl as the Rocchio algonthm on the individual ropics and ~erformssignificantly better than the Rocchio algorithm on the T9C. T9P. and F3 measures on d the topics taken together. This studj- also obtained answers to questions that were iinposed. For instance. the importance of achiel-ing a fairly gootl profile in the first rwo to fivt. runs was shown in the parameter tuning stage. There. a ser of paranierers that led a learner ro ohtain few resiilts in the first feu- days led to poor learned profiles. This caused the learuer to never see relevant resrilts at dl. becatise no reletant results were returned to the user. Thiis. the user would be forced to indicate that al1 the clocuments wre conrelevant . Saturally. this led ro the learner having no way to predict whicli future clociiments woiild be rele\ant. except throtigh random selecrion of documents.

There are some objections thar niay he raised about the accuracy of the conclusions presented above. In particular. the tise of the plain algorithni and plain profile as a baseline may he q~iestionable.Similarly with the use of the randorn algorit hm. Seither

algorit hm gives particularly good rankings. Futhermore. some esist ing search engines have t kir own met hod of 'one-s tep' relevame feedback. For instance. Google allo~vs a searcher to find 5irnilar pagesq to an' that they find releiant . Other search engines

have the pot ential to use implicit feedback. although this would probably occiir in a bat ch fashion. unlike the work presented here. Search engines that fail into this category are

rhose that present a list of links to other pages. but in which those links are actually links

to a server operated by the same entity as the search engine itself. This other server can

coilect 'visited' statistics. and then redirect the user to the document they nish to vien-.

Ir is difficult to make any cornparisons with individual search engines simply because

they are not designed to be used in the same scenario presented here.

As to the baseiine measure. it would be difficult to corne up with another benchmark.

Other rnetasearch engines cannot be used for various reasons.

r They may not return enough results to make a repetitive query feasible. even with

a general que-such as -palm pilot.-

They may not combine the results of al1 the search engines. instead interleaving

results in some manner. 0 The? ma' not allow the user ro specify esactly n-hich search engines to use. or if

the? do. the- may not allow certain search engines thar were used in t his studj-.

The rnetasearch engine closesr to being feasible for benchrnarking purposes would be

San-ySearch. which still fails for the first two reasons given. The main reason. how- ever. is the firsr-no metasearch engine returns enough results to make a repetirive query worthn-hile. There would be no new documents to view. and few. if an'. changer1 nn-.

Thus. the? would tend to rerum feu- or no documents for user ranking. Csing the ranking scheme that the indiridual search engines use wouid form a good baseline. but these are not availabk for the obvious reason that the! are proprietary. and are the basis on which people use a search engine. It woulcl be possible to use SavvySeafch (or sonie other metasearch engine) as the only search engine used as a 'helper' search eligirie for the rnetasearch engine described here. but that would not be fair to SavySearck's ranking algori t hm.

Barring furcher objections. the results shotv t hat met asearch works tvell for a repeti rive query. iising forced reie\ance feedback to adjitst the systern rankings. and that rnetasearch n-orks bet ter than an- indiridual search engine. in this contest. -4 number of t hings may be done in the future to improve the relevancy of the results presented here and to esplore other aspects of searching.

5.2 Future Directions

5.2.1 Implieit Ranking - One of the less savon aspects of using this systern from the user's point of view is that the user must provide relelance feedback. and in the case of the tn-O traditional style dgorithms. had to do so for 30 documents. This does not fit well with the scanning method that people use to vien- KKWdocuments [Nie97]. A better method would be implicit ranking-ra&g that is done nithout the user having to press a button. For esample. one could use the tirne between visits to the page of ranked documents-thar is. the time between when a user follon-s a link to a ranked docitnient to the tirne that a user uest iollo~~sa link to anothrtr ranked document. Obviously. some niasiniuni tinie ri-otild have to be instituted. The assumption here is that users visit relennr pages for longer periods of t ime than nonrelelant ones. Iionsr an et al. [IiMWS;] show. \rit h uewsgroup articles. that there is a high correlation between the rime spent reading and the rsplicit rating given to an article. This could even be used in addition to esplicit measurcs of releiance. San-ySearch [DH96]used a visited/not visitecl measure of perforniance. This coiiltl easily be made into a boolean rele~anceranking. and indicstes the relc\ance of worcls presented in the test of the page thar displays the ranked doctmients. This latter information could also be used as iniplicit reletance feedback.

5.2.2 Ontologies

As mentioned in Chap ter '7. the user's query is expected to fit into some ontology. It might prove interesting to obtain some ontology. such as that from the Libr-of Congress or hman existing director' based search engine such as Open DzT~c~o?~or Yahoo!. One could also use a narrow-er field such as Computer Science. using some esisting ontology

(such as one from ResearchIndex-formerly CiteSeer [LGBSS]) . Csing this ontologv. one could create more a general profile. P,. for a topic g by cornbining esisting profiles which corresponded to topics belon- g in the ontology. Similady. one could create more specific profiles by using esisting general profiles. This wodd have to be done at the user's discretion. since specific profiles would not necessarily generalize. nor vice versa. The user could even spec- the esact combination of profdes to use. The use of an ontology would be even more powerfd through collaboration. CHAPTER5. COSCLUSIOSSXSD FL-TI'REDIRECTIOSS

5.2.3 Collaboration

Collaborative learning and recommendation has been used in a ntunber of differerit sys- rems [CGW99. BP9Sa. KSS97. BS9L Ii1111+97]. and has shon-n good results. ilïrh ontologies. collaboration with respect to profiles might produce good resulrs as well.

The collaborarion wodd involre liaving a conunon ontology iised by al1 people using the uict,açarcL çugiiiç. Diffrrrlic profiies !roui dinSrenr peopie createci untier a ropic in the ontolog- could be combined to produce a prototype profile. which could be of general use. particularly for new users. or for those users who do not wish ro have their own. separate profile. People witk similx profiles could also recommend other profiles to each other.

Clustering techniques coulcL be used on the profiles to generate a dyamic onrolog- to use. or merely be used to creace a d~namicser of bootstrap profiler; for neu- users.

5.2.4 Alternate Document or Feature Space

The current rnetasearch engine uses only the document space as the set of features CO use when comparing documents to profiles. Other features could be used in the profiles and document represent at ions. such as the linkage stmcture of retrieved documents. This is used in search engines such as Google [BPSSb]. Clever [I

One could also use features such as the grade level of a document. the number of words in the document. the number of links and images. the recency of the document. and indications of whether the paper is a research paper. among others [GLG+99,511a99].

These mould provide additional information and codd provide additional insight into a user's criteria for relelance n-hich likely include things other than the tes of a document 5.2.5 Thresholds

The sysrem clescribed here uses a static tlireshold to determine when ro stop giving documents to the user for ranking. Documents with a system ranking belon- rhis thresliold are x\-er -hot::-, zo the user. c-.-cc if fc*::cr ;ha 30 Uûciuiiciit~Ld Leeu cuiiecteci ro be shown. This threshold could be dpamically generated. This cotild be done by monitoring current performance. such as one of the F3. T9P or T9L- measiires. and alrering the threshold based on the dueof or the changes in those measures.

5.2.6 Alternative Met hods of Learning

Other learners might prove to be more effective on this task. For insrance. the palm pilot topic triight be more easily learned by a system that used several learning agents. each of which would learn a specialized profile. One could be good at retrie~ingresults on hardware accessories. while one could be good at retrieling resul t s on free produc tivi ty software. Each agent would leam a local version of the more general profile. This coiild lead to a better ability to discriminate between rele~antand irrelevant documents. hecause each agent wodd have a full sized profde representing a local version of the general one. This nould also provide a kind of symmetry-t he metasearch uses met asearch as the learning component. One candidate for this type of leaming is SIGMA [I

5.2.7 Miscellaneous Improvements and Directions

In the course of using the metasearch engine. and in analyzing the results. several points of potential improwment have corne to light.

1. To increase the precision at the beginning. it might be usehl to implement a system

in ahich. having reached a peak (as detected bu the subsequent decline in precision). the system woiild rerank those documents that hacl been seen before. but were

unranked bj- the user. This accounts for the second spike as net ailed in section 4.6.

2. .An alternative to the above would be to have a system that alii-ays rerankecl those

clociiments rhat had been seen before. btit were tinranked by rhe user.

3. There nreds to be a better mechanisni to detect changecl documents such that a

docurnenr woiild be rejarcled as tinchangecl if it were changed in a tririal manner.

siich as a date change. or a single number or rvord change. This ~oulclprevent some

ronrele\ant documents from affecting the precision measures. One siich mechanisrn

niight be to only use a sample of the data in the chrcksiim. such as the 30 bytes

of data surrounding the niost common terms of the clociiment. Altematively. the

similarity measure hetween the VSSI versions of a potent ially changed document

could also he used.

4. .Aalbersberg [IJA92] obtained favorable results wi t h incremental relelance feedback

where the user only gave a relevance ranking for a single doctirnent at a tirne. This

might deviate sorne of the strain mentioned in section 5.2.1. The contest of the

problem presented in that paper was slightly different. however. so might not readily

apply to the situation outlined here.

5. It is possible that using the random algorithm at the beginning of a run would

always produce better results than using the plain algorithm. Certainly. the graphs

in figures 4.3 and 4.2 suggest that using the random algorithm for at lest the first

day tvould be better than using the initial que- as the profile (ie: using the plain

profile).

6. It should be possible for leamers to escape from any local minima they encounter

&en using a poor profile. This means that eve- learner needs to have the ability

to revert to old profiles or use old profiles in some way in order to e-xplore the space CHXPTER3. COSCLLSIOSSXSD FL'TC'REDIRECTIOSS

of possible profiles as a means of escaping the local minima. - 1. Esaniination of the feature space usecl might prove fruitful. For instance. insteacl

of merel!. using two word phrases. one could deternine the average distance. in

the document. betn-een words chat are in the profile. .A low al-erase distance coulcl

indicat e increased relel-ance. Of course. as mentioned in the introduction. o t her

;>-j:em.i Laïc ii;d ûtker f~~tüies.as ;ïcE.

S. To bet ter use the resources anilable. the interruption niechanisni mentionecl in

section 3.3 coitld be implemented for interrupting calls thar blockecl on socket com-

m~inications ( ie: rietwork cornmunicat ions ) . Bibliography

[.41196] J. Allan. Incremenral relennce feedback for inforniatiou filrering. In .4Ck.I

SIGIR Conf.* August 1996. Zurich. Switzerland.

[Bar941 Carol L. Barry. Ilser-defined rele~ancecriteria: .in exploratory stucly. Joar-

na1 of the American Society for Infonatio~i Scierm. 45( 3 ):l-lg-l39. 1994.

[BBCgS] Ana B. Benitez. Mandis Beigi. and Shih-Fu Chang. hing relevame feedback in content-based image metasearch. IEEE Internet Computing. '7(4):59-69.

.July/ August 199s.

[BCF+9S] Lee Breslau. Pei Cao. Li Fan. Graham Phillips. and Scott Shenker. Web caching and zipf-like dis tribut ions: Evidence and implications. Technical

report. University of IVisconsin-1,Iadison. April199S. Technical Report 1371.

Computer Sciences Dept .

[BDW951 Jlic Botvman. Peter B. Danzip. Gdi !danber. 4Iichael F. Schwartz. Darren R. Hardy and Duane P. Wessels. Harvest: .A scalable, customizable discover-

and access system. Technical report. University of Colorado-Boulder. 1995.

[BLGSS] Kurt D. Bollacker. Steve Lan~ence.and C. Lee Giles. CiteSeer: .in au-

tonornous web agent for automatic retrieval and identification of interes ting

publications. In Autonomow Agenb 98. hCX 199s. [BooSS] Gary Boone. Concept features in ReAgenr. an inrelligenr email agent. In

Proceeditigs of the Second International conference on -4utonon~o ILS Agents.

pages 141-143. 199s.

[BPSSa] D. Billsus and SI. Pazzani. Learning collaborari\-e information filrers. In

Proceedinp of the Fifieenth Int entatiotial Conference on Machine Learniiig.

pages 46-34. llorgan iiauiman. 199S.

[BPSShj Serge' Brin ancl Lawrence Page. The anatomy of a largescale hypertestiial IIèb searhc engine. In Seventh International World Wide Web Cotrference.

Brisbane. Australia. 199s.

11. B alabanovic and Y. S hoham. Learning informat ion ret ricd agents: Es-

perimeuts wit h automated web browsing. In A.4.41 SpBng Symposirrm on 171-

formation Gathenng /rom Heterogeneous. Distrib ut ed Erivirorimenta. llarch

1993.

M. Balabanovié and Y. Shoham. Fab: Content-based. collaborative recom-

mendation. Comm~~nicatio~of the ACM. 40(3):66-;O. 'rfarcli 1997.

[BSAS-L] C. Buckley. G. Salton. and J. Ailan. The effect of adding reletance informa-

tion in a relevame feedback environment. In Proceedzngs of the seventeenth

annual international ACM-SIGIR conference on research and development

in information retn'evd Springer-krlag. 1994.

[BSY95] 11. BalabanoviC. 1'. Shoham. and Y. Yu. An adaptive agent for auto-

mated web browsing. Journal of Visual Communication and Image Represen-

atation. 6(4). 1995. http://n?vn.diglib. stanford.edu/cgi-bin/I\T/get/SIDL- 11-P-19950023.

[But001 Declan Butler. Souped-up search engines. Nature. -IOX12-1 15. Sf a'- 2000. [CalSS] J. Callan. Learning u-hile filtering documents. In Proceedings of the KM SIGIR Conference. 199s.

[CDIi+99] Soumen Chalilabarti. Byron E. Dom. S. Ravi Iiitmar. Prabhakar Raghamn. Sridhar Rajagopalan. Andre~vTornkiiins. David Gibson. mc1 .Jon Iileinberg.

Minirig the Keb's link structure. IEEE Cornpater. 32 (8L60-67. Aligiist 1999. iCC;3I+99! Mark Claypool. Anuja Gokhale. Tim 'iliranda. Pa\d Jlurnikol-. Dvicry

Setes. and llatt ben- Sartin. Combining content-based and collaborat ive

filters in an online newspaper. ACM SIGIR WorXlrhop or2 Recomrnetider Systems. August 1999. Berkeley. CA. iCha991 Brian D. Chambers. .-\daprive bayesian information filrering. hsrer's thesis. Cniversity of Toronto. 1999.

[Co1001 Christian Collberg. 2000. Colloquiurn at Cniversity of Toromo.

[CS981 Liren Chen and Katia Sycara. WeblIate: -4 personai agent for browsing ancl searching. In A utonornous Agents '98. pages 132-139. ACM. 1998.

[DH96] Daniel Dreilinger and Adele E. Howe. An information garhering agent for querying tveb search engines. Technical Report Techincal Report CS-9G- 11 1. Cornputer Science Department. Colorado S tate Cniversity. 1996.

[Dir] Directhit. http://ivww.directhit.corn.

[FFM92] Tim Finin. Rich Fritzson. and Don McKay. A lanwage and protocol to support intelligent agent interoperabilit- In Proceedinqs of the CE B CALS

Washington -92 Conference. June 1992.

[FLUS;] Tim Finin. I'annis Labrou. and James Uayiield. KQML as an agent com- munication language. In Softwan Agents. MIT Press. Cambridge. 1997. [GLG+99] Eric .J. Glowr. Steve Lawrence. llichael D. Gordon. Killian P. Birmingham.

and C. Lee Giles. Recommending \Y& documenrs based on user preferences.

In Proceedings of the CM SIGIR '99 ÇVorkhop on Recommender Sgstenw: ..llgorithms and Eval~uation.1999.

[Go01 Google. ht t p: / / it-ww.goog1e. corn.

[Gus971 Paul Gustafsen. Sovember 1997. Lecture notes from Statistics 303. CBC.

[Har92] Donna Harman. Relelance feedback revisitecl. In Proceedings of the Fifteenth

Annual Jnten~ationalACM SIGIR conference on Research ond deoeloprnent

in information retneval. June 1992.

[HC93] D. Haines and K. B. Croft. Relevance feedback and inference networks. In

Proceedings of the Sizteenth Annual International ACM SIGIR Conference

on Research and Development in Infonnution Retrieval. pages '2- Il. EKU.

[HD97] -4. Howe and D. Dreilinger. Sav\-ysearch: A metasearch engine that learns

which search engines to query. -41Magazine. lS(3). 1997.

[Hu1991 David A. Hull. The TREC-7 filtering track: Description and analysis. In E. SI. Yoorhees and D. Harman. editors. The Seuenth Text REtn'evaZ Confer-

ence (TREC-7).pages 33-56. Department of Commerce. Xational Institute of Standards and Technolog-. 1999.

[IdeTl] E. Ide. 'r'en*e-xperiments in reltance feedback. In Salton [SalX]. pages 337- 354.

[IJASS] IJsbrand Jan Aalbersberg. Lucremencal rele~ncefeedback. In Proceedings of the Fifteenth Annual International ACM SIGIR conference on Research

ond development in information ntrieval. June 1992. .J mes J ansen. Csing an intelligent agent to enhance search engine performance. Fkst Monday. (3. Jlwch 1997. ht t p: / /n~\~~.first monda>-.dk/issues/issue2-3/j ansen/ indes.htni1.

T. Joachims. -4 probabilistic analysis of the rocchio algorithm tvith tfidf for rest caregorizarion. In Proc. of the 14th Ititerriatiorial Corifererrce on

Mm-h.in~L~nrning lCMC 97. ?ap 143-1 51, !go.

E. Iieen. Term position ranhng: Some new test results. In Proceedlngs of the Fifteenth Annual International ACM SIGIR conference on Research and development in information retrieval. pages 66-T6. l%Q. amilable at ht tp://~~~v~~.acm.org/pubs/contents/proceedings/ir/L3316O/.

Grigoris J. Karakoulas and Innes A. Ferguson. A cornpiitational market mode1 for multi-agent learning. In AAAI 96 Fall Symposium orr Leamirig Cornplex Behaviors in Adaptiue Intelligent System. .-\--\.4I Press. 1996.

Grigoris J. Iiarakoulas and Innes A. Ferguson. Applying SIGJI-4 to the TREC-7 filtering track. L-npublished paper obtained from Grigoris Karak- oulas. 199s.

.I. Iileinberg. S. Kumar. P. Rapham. S. Rajagopalan. and A. Tomkins. The web as a graph: Measurements. models and methods. In Proceedings of the International Conference on Combinatorics and Computing. 1999.

J. Konstan. B. Xiller. D. Maltz. J. Herlocker. L. Gordon. and .'J. Riedl.

Grouplens: Applying collaborative filtering to usenet nems. Communications of the ACM. 40(3):77-Sr. 31arch 1997.

S. R. I

Alesander Lebedel-. Best search engines for finding scientific in- formation in the web. néb aiithored. May 1997. \\éb address http://n~r~v.chem.rnsu.su/eng/comparison.htnil.

't'annis Labrou and Tirn Finin. -1 proposal for a new kqml specification.

Technical report. Computer Science and Electrical Engineering Department. Cniwrsity of Maryland. Baltimore County. Bdtimore. SID '21250. Februq 1997. TR CS-97-03.

Sceve Lawrence and C. Lee Giles. Contest and page analysis for improveti Uéb search. IEEE Inteniet Computing. 2(4). 199s.

Steve Lawrence and C. Lee Giles. Inquirus. the SECI rneta search engine. In

Seventh International World Wide Web Conference. pages 95-105. Brisbane.

-lustralia, 1998. Elsevier Science.

Steve Lawrence and C. Lee Giles. Searching the Korld n'ide Web. Science. %O(536O):9S. 199s.

Steve Lawrence and C. Lee Giles. Accessibility of information on the web.

Nature. 400(6740):107-109. 1999.

Steve Lamence. C. Lee Giles. and Kurt Bollacker. Digital libraries and auronomous citation indexing. IEEE Compter. 32(6):67-Z. 1999. Worbring systern aiailable at ht tp: //citeseer.nj .nec.com/cs.

Dunj a Mladenic. Te--learning and related intelligent agents: h survey.

IEEE Intelligent Systems. pages 44-54. July 1999. W. Meng. Ii. Liu. C. Yu. W. Ku. and S. Rishe. Estimating the usefulness of search engines. In 15th Intenaatiotial Conference on Data Engineering (ICDE'99). Sydney. -4ustralia. Ilarch 1999.

.Uvin Moore and Brian H. .\lima\-. Sizing the inremet. Jidy 2000. ht tp://~~~~~~-.cy~eillance.com/ne~~sroom/pressr/OOO~~O.asp.

.Ja.kob Sielsen. How uusers read on the wb. Octoher 1991. http://w~v~r..useit.corn/alertbox/971Oa.htnil.

Jakob 'iielsen. Why yahoo is good (but niay get worse). Xovember 199s. http://~r?v~~.useit.com/dertbos/9S110l.htrnl.

.Jakob Sielsen. July 1999. http://~~*~vm.useit.com/hotlist/spot- light 1999q234.htrnl.

Jakob Sielsen. 'top ten mistalies' revisited three years later. llay 1999. http://i\-tt.~~.~iseit.com/alertbos/990502.html.

.Jakob Sielsen. 1s navigation useful? . January 2000. ht tp://w~nv.useit.corn/alert box/20000109.html.

Yoshiki Niwa. 1Iakoto Irvayama. Toru Hisami tsu. Shingo Sishiola. Akhiko

Takano. Hirofumi Sakurai. and Osami Imaichi. Interactive document search with DualIWVI. In Proceedings of the First NTCIR Worhhop on Research in Japanese Text Retrieval and Term Recognition. pages 123-130. August

1999. Tokyo. Japan.

Open directon http://\t7tv-.dmoz.org.

Taemin Kim Park. Toward a theory of user-based rele\ance: -4 cdfor a nen paradigm of inquîry. Journal of the Arnerican Society for Information Science. 45(3):135-141. 1994. M. Pazzani and D. Billsus. Leaming and revising user profiles: The identifi- car ion of interesting web sites. k?achine Learning. 27:3 13-331. 1997.

'ilichael J. Pazzani and Daniel Billsus. Eduating adaptiw web sire agents. b*orkshopon Recommender Systems Algorithms and Enluation. '2nd Inter- national Conference on Research and Development in Information Retrielal. L!Xl!L

Gabriela Polticova. Recommending htd-documents iising feature guided au- toniated collaborarive filtering. In Johann Eder. han Rozman. and TaGana nélzer. edirors. ADBIS Short Papers. pages SI -87. Instit ute of Informat- ics. Faculty of Elecrrical Engineering and Cornputer Science. Smetanova 1;.

IS-2000 Illaribor. Slovenia. 1999.

SI. Porter. An algori t hm for sds stripping. program. Automated Librarg and Information Systems. l4(3):130-137. 1980.

Stephen Robertson and David A. Hull. Guidelines for the TREC-9 filtering track. http://wtvtv.soi.cityac.uk/ ser/filterguide.htm.

Joseph J. Rocchio. Rele$ance feedback in information retrieial. In Gerard Salton. editor. The SMART retrieval system: experiments in automatic doc- ument processing. pages 313-323. Prentice-Hall. Englewood Cliffs. US. 1971.

'rlehran Sabarni. Using Machine Learning to Improve Infornation Access.

PhD thesis. Stanford Uni-jersity. December 1998.

Gerard Salton. editor. The SMART retrieval 3ystem: ezperiments in auto- matic document processing. Prentice-Hd. Englewood Cliffs. US. 1971. Gerard Salton and C hrir Buckley. Improving ret ricd perforniance by rel- elance feedback. Journal of the Amencan Society for Information Science. -Il(-I):2SS-297.1990. llathew Schwartz. Shwper staples. June 2000. hr tp://wv~\-.cornputer- worlc~.com/c~1-i/story/O.1I99.S~~~~~~STO457S~.OO.html.

Gabriel L. Somlo and Adele E. Howe. Agent-assisted internet browsing. In

Proceedings of the Worhhop on Intelligent Information Systems- nt the 16th

National Conference on Artificial Intelligence (AAAI '99). 1999.

G. Salton and 11.J. 1lcGill. Introd*uction to Modern Information Retrieval.

McGratv-Hill. Xew York. Sew York. 19S3.

Stat Market. StatUarket search engine rat in gs. June 3000. ht tp://~~*lnv.searchenginetvatch.com/reports/statmarht.html.

Louise T. Su. The reletance of recall and precision in user etaluation. Journal of the Arnerican Society for Information Science. 13(3):207-217. 1994.

Danny Sullivan. Media Metris search engine ratings. Mar& 2000.

Dnnny Sullivan. ';PD search k navigation study. June 2000.

Damy Sulli~an. Search engine alliances chart. June 2000. ht tp:/ /searchenginewatch.com/reports/~~

Dnnny Sulli~n. Survey reveals search habits. June 2000. http://searctenginewatch.com/sereport/00/06-r .html. [TogSS] Bruce Tognazzani. Scding information access. August 199s. http://~~~~-~~-.asktog.com/columns/00Sscale&nfo.html.

[ZFJ97! L. Zhang. S. Floyd. and 1.. Jacobson. Adaptive web caching. In NLAiVR Web Cache Worbhop. June 1997. http://tv\~x\--nrg.ee.lbl.gov/floyd.