USOO6.999959B1

(12) United States Patent (10) Patent No.: US 6,999,959 B1 Lawrence et al. (45) Date of Patent: Feb. 14, 2006

(54) for the Automated Patent System (APS), Apr. 1996-Feb. 1997, pp. 5-2, 8-6, and 8-9.* (75) Inventors: Stephen R. Lawrence, New York, NY “Finally, a search site for the 20th century”, Highway 61, (US); C. Lee Giles, Lawrenceville, NJ pp. 1-2. (US) “Highway 61 Features & Options”, Highway 61, pp. 1-2. “Welcome to The Virtual Mirror!”, The Virtual Mirror, pp. (73) Assignee: NEC Laboratories America, Inc., 1-3. Princeton, NJ (US) “New Features: All and Any buttons for more search op tions! Boolean query Support for all Search engines', ProFu (*) Notice: Subject to any disclaimer, the term of this Sion, p. 1. patent is extended or adjusted under 35 “How does InferenceFine work?”, inferenceFind, the intel U.S.C. 154(b) by 781 days. ligent and fast parallel web search, pp. 1-2. “MAMMA FAQ”, Mamma, The Mother of All Search (21) Appl. No.: 09/113,751 Engines, pp. 1-2. “Take surveys for us!”, chubba, p. 1. (22) Filed: Jul. 10, 1998 “WebCompass”, PC Magazine Online Utility Guide, pp. 1-2. Related U.S. Application Data “WebSeeker”, PC Magazine Online Utility Guide, pp. 1-2. (60) Provisional application No. 60/062,958, filed on Oct. “Internet FastFind', PC Magazine Online Utility Guide, pp. 10, 1997. 1-2. “Frequently Asked Questions”, Savvy Search, pp. 1-3. (51) Int. Cl. “The MetaCrawler FAQ”, metcrawler, pp. 1-3. G06F 17/30 (2006.01) “Metacrawlwers and Metasearch Engines”, (52) U.S. Cl...... 707/5; 707/3; 707/4; 707/10 Watch, pp. 1-3. (58) Field of Classification Search ...... 707/3, 707/5, 2, 4, 6, 7, 10; 345/866,968; 709/218, (Continued) 709/224, 227, 245 See application file for complete Search history. Primary Examiner-Vincent Millin ASSistant Examiner-Ella Colbert (56) References Cited (57) ABSTRACT U.S. PATENT DOCUMENTS 5,913.215 A * 6/1999 Rubinstein et al...... 707/10 A computer implemented meta Search engine and Search 5.987,446 A * 11/1999 Corey et al...... 707/3 method. In accordance with this method, a query is for 6,044,385 A 3/2000 Gross et al...... 707/526 warded to one or more third party Search engines, and the 6,078,914 A 6/2000 Redfern ...... 707/3 responses from the third party Search engine or engines are 6,092,074 A * 7/2000 Rodkin et al...... 707/102 parsed in order to extract information regarding the docu 6,094,649 A 7/2000 Bowen et al...... 707/3 ments matching the query. The full text of the documents 6,101,491. A 8/2000 Woods ...... 707/3 matching the query are downloaded, and the query terms in 6,151,624 A * 11/2000 Teare et al...... 707/5 the documents are located. The text Surrounding the query OTHER PUBLICATIONS terms are extracted, and that text is displayed. U.S. Department of Commerce U.S. Patent and Trademark Office, Text Search and Retrieval Examiner Training Manual 4 Claims, 34 Drawing Sheets

2

frr it REC Saisacring Suring on the Eigos', la Shiasta?i. The Tech?ial. t. - ultipurpose Riga 20, 3rd Flor

3|Risista?k said of totyridih desired positics Reese and Orlstationshilli, i? 3-spaceintroduce specified Erylab single for iv.26 tsushing. take? frt thesing Raikas. thiselhoro If thesis easir, it ofEEEEEEEEEEE iterater a specify ti differenca in position ind trietailor of the car betwee: the to rages tracters. Hoese since retails areasia e receceoseings, eate celes. * Ross,its thor Information Retrieval Discussion Group feeting. Coordinator. Aobert Kroetz - Board CUT appli. Kishuler. Aging, Ebbesen, Sottlieb, gang, de Ryer

CERASERTESTATED INDEABLE ESSEE

2ESTIATED TOTAL HOT BOT ECITE SNORTHERN LIGRT ASA Lics US 6,999,959 B1 Page 2

OTHER PUBLICATIONS Society, May 17, 1996, vol. 96, Edition 40 (96-DPS-76, pp. Kihara, E. et al., Multi-agent-based information Search SyS- 25-30, abstract only. tems, Research report from the Information Processing Society, Japan, Incorporated body Information Processing * cited by examiner U.S. Patent Feb. 14, 2006 Sheet 1 of 34 US 6,999,959 B1

Bo?e r Ems NECI neta Search (14093 Web S Usenet Web Usenet Press images Citations Journals Tech All

(Hide) Options:

Onorrow at NECI CS Talk: "Visual Homing Surfing On the Egipoles". Ilan Shishoni. The Technion, (DW - Multipurpose RoO 2FO, 2d Floor

This is joint work with Ronen Basri and Ehud Rivlin. We introduce a novel ethod for visual holing. Using this ethod a robot can be sent to desired positions and Orientations in 3-D space specified by single inages taken fro these positions. Our Rethod is based on recoverig the epipolar genetry relating the Carrest age taken by the robot and the targes sage, sing the epipolar geometry, most of the paraeters which specify the differences in position and Orientation of the caera between the to inages

are recovered. However, since not all of the parameters can be recovered frog two images, We have developed sp.,.. Lexical Semantics and Information Retrieval Discussion Group Meeting. Coordinator, Robert Krovetz - Board Robin 4009, 4th Floor CUT Aeppli. Altshuler. Anupindi. Ebbesen, Gottlieb, Omohundro, de Ruyter COVERAGE WRT ESTIMATED INDEXABLE WEB' SIZE 100%- 2ESTIMATED TOTAL 70% HOT BOT SO OEXCITE

50% SNORTHERN LIGHT ALTA VISTA

IDINFOSEEK ELYCOS N:I- U.S. Patent Feb. 14, 2006 Sheet 2 of 34 US 6,999,959 B1

Individual page tireDut: 30 Rd. Filter pages to highlight hits when viewing full page: Filter inages fron the pages when viewed: Keep Image classification: AddR to track: C 22 1. Update OUeries being tracked: 'search engine' Press Stop tracking queryl signafy Press Stop tracking query 24 VRS being tracked: Page http:/lcesdis.gsfc.nasa.gov/linux/drivers/vortex.html Stop tracking URL Page http:lly, li?tuxhq. Coalkpatch2.html Stop tracking UFL Page ftp:llfp.cygnus.Coalpublegcs/releases Stop trackira RL) -N Page http:llinn. meci.nj.mec. CoalhonepageSigiles Stop tracking URL) 26 Page ftp://ftp.kernel.org/publlinux kernellv2.il (Stop tracking URL Description of the Options On the main pace: Hits: Maxinuin number of hits to display excluding duplicates Context: Nunber of context characters to show either side of the query terms Cluster: Cluster documents after retrieval Tracking: Start tracking this query and tell Ae when ?ley docuRents appear which natch the query Locality: Only show documents in this donain Age limit: Filter out documents older than the specified age Depth: Only show documents with a given subdirectory depth Isages: Filter inages and Only show photos or praphics U.S. Patent Feb. 14, 2006 Sheet 3 of 34 US 6,999,959 B1

Colors Helo SOOESIO's Probles EC neta search 1816 querl Fid: Web Usenet Web Usenet Press Images Citations JournalsTech All Show) Constraints: its) Options: Hits: 00 y Context: Searching for: nece'digital watermark" Using: HotSol InfoSeek. AltaVista Norther? Light YahDO WebCrawler

Tip: The query tera links in the 'searching for line lead to the Webster dictionary definitions. 30 gear Co., Hong Page I in 13k http:livy.neti.nj.net, coal homepages/igenarlingglar.html ...http:Trnecinj.mec. COlhanepageslingerfingenir.html...... ar Cox Home Inaeinar J. Cox Sr. Research Scientist, Computer Vision, NEC Research Institutely nost recent work has focused On the development of statistical fraineWor...... r investigation i?clude face ESSE: and stereo correspondente problems. Address NEC Research Institute 4 Independence Way Princeton, NJ 08540, USA Office: 609-351-2722 Fal...My most recent work has focused on the development of statistical frameworks for notion analysis, digital waterDarking and content based image database retrieval. Other projects currently under investigation includ... 32 6i Lechnical Report SS-10 I iy 2k http:/iWWW, reci.nj.?ec. COltri reci-abstract-35-10.html Report... ECI 95-10 Technica NECI TechnicalReport 95-0... Report 95-10s: NE "E"E",Research Institute, Indepe...l., "E",Ingener J. Cox, ...Joe ECI Kilian. Technical Tom Leighton, and Jalal Shadoon. December 4, 1995. We describe a digital waternarking.nethod for Use in audio. image, video and cultinedia data. We argue that a water Rark Rust be p...I.A, including dithering a dreonression and rotation, translation, cropping and E. The same digital watermarking algorithai can be applied to all three Dedia under consideration with only inor Modifications. ... 32 ...sitesMass Netscape High Tech Bonds Inla, With k EEEEEEg e, Netscan Netscape Bonds With Apple JEM and NEC to its Online factory outlet Local science teachers. Access Excellence A digitized isi Stories August 12-17, 1995 This Week In Mass t Tech ARIS says it's on key with digital waterark is right on tune Info highway rest stops CAT's neon of Web sites Netscape Bond... OEC CAS SC 385: Ipage, and Video CDAguti Ag: i In la Sk ...telli. Cox, J. Killian, T. Leightga,s scharoff and I.courses/cs,857 Shar000, Secre syllabus.htl SE Spectru Eig for Multi edia, NEC Research Institute Technical g 95-10. M. Kass, A. Witkin, and D. '', makes: Acti...... ions 1,2 Oct Section CSP due, P Dut 3 Digital waternark, steganography 3 T 8 Edges, contours ESR 10 Ury atch... http?SMHCOMPUTEAS sh.co.au/computers/content/950220freys6-3SD220.html February 20 1998: Mark to foil Net rates H (132k : Mark to foil, Net pirates Week of February 20, 1996 Mark to foil Net pirates NEC researchers in the US have developed a "digital waternark' that can be attached to a ultimedia into. . . .ing its Gyner beyond doubt in the case of a copyright SS Enbedded in the data itself. NEC says it is "a rathematically derived code included in the isfrequency invisible signals to users of theand informationNEC is confident sent it cannotrenultinedia be found infortionand stripped of Outdubious by Cultiaedia onership pirates.is proliferating It is eibedded becode i? ... . . y 20, 1996 Mark to foil Net pirates EC researchers in the US have developed a 'digital waternark" that can be attached to fall timedia information, identifying its owner beyond doubt in the case of...... section deleted...) U.S. Patent Feb. 14, 2006 Sheet 4 of 34 US 6,999,959 B1

FIG. 4 Ranked pages (first 20:

299.97...torola licensesNESySA Apple O 36kPe. E'SSEECEO reveals new sey Chess: issuesivo)3issinesbytes.ht Kasparov beats Deep Blue NEC Digital Waternark' technology Week of E. 2 - February 6 595 New Fak stand...... -around Dn't cone cheap ENIAC to run on 50th anniversary Mitel denos US8 Escale con?ection Previous Page Next Page Week of March 4 - March 8...... ely and productively. The products offered by the Stall Business Unit will A. reportedly let Users connect to the Internet as well as create corporate intranets to link the businesses with their customers. ...a licenses Apple OS Apple CEO reveals new strategy Chess: KasparOY beas Deep Blue NEC "Digital Waternark' technology Week of February 12 - Fr. Sil 1996 New Fay standard to incorporate... technology1...he third February and fourthereiras, 20196). NEC Corporation and Kasparov has developedwon the lasttechnology two games. that Back, will todigitally top Et Digitalar. Watermark20, and multinedi data assel) as text and pages. Unlike conventional encryption systems, the digital watermark stays eabedded in the data and retains Unaffected by digital-analogue Conversions.image Scaling ...

299.931, tigernarktwo.html tigernarktwo.hts igerhark A a DataBlade4k EEEEEEEE Module for mages Frontor is and NEC What is Waternarking. ... NEC Tigerhark DataBlade Module for FrDA Informix and NEC What is YES With the advert of digital counication, including the I.....g. the Internet, take it easy to transfit and redistribute perfect copies of digital data. Now with NEC is E. ark technology, E. Car Custon R2 watermark your images perialently and securely, without de...f...es perianently and securely, without degrading the quality of the content. NEC has Egged a digital waternarking technology, that solves this problem for today is content providers. NEC is TigerMark is a digit...l.. ereveryor content goes, E. waternark goes, too. NEC provides a powerful tool Digital laternark NEC bis developed a digital laterparking technology that neets the needs of today is busines... ..s, too. NEC provides a powerful tool digital wateraark:EC has E; a digital Watermarking technology that meets the needs of today is business environment. NEC is digital waterark Tige...

2S3,83 Focus on internet H is Sk http:www.esies Information/STROISS four.httl EANET NESSIN BERSPACE PARTS OF INTERNET GO 3LACK IN PROTEST OVER NEW LAW NEC DEVELOPS DIGITAL WATERARK TECHNOLOGY INTERESENS SITES KFK AESEARCH INSTITUTE FOR MEASUR...... arian s and individuals. SIE February 1 his is an excerpt Source: Reuters NEC DEVELOPS OIGITAL WATERMARK TECHNOLOGY PRINCETON, N.J. NEC says scientists at its NEC Research...!...an excerpt Source: Regters NEC DEVELOPS DIGITAL WATERARK TECHNOLOGY PRINCETON, N.J.- NEC says scientists at its NEC Research Institute have R3 developed a digital watermarking Aethod for Use...I., WS 24 HCRS IN CYBERSPACE PARTS OF INTERNE GO BLACK IN PROTEST OVER NEW LAW NEC DEVELOPS OIGITAL WATERARK TECHNOLOGY INTERESTING SITES KFKIRESEARCH INSTITUTE FOR MEASUREMENT AND COMPUTING TECHNIQ.. ... and individuals. Syndsy February A1 This is a? excerpt Source: Resters KEC DEVELOPS DIGITA, ATERARK TECHNOLOGY PRINCETON, N.J. - NEC s scientists at its NEC Research Institute have developed a ...... TECHNOLOGY PRINCEON, N.J. NEC y scientists at its NEC Research Institute have developed a digital wateraarking method for use in protecting copyrighted audio, image, video and fault inedia data. The company's ...

... section deleted...) U.S. Patent Feb. 14, 2006 Sheet 5 of 34 US 6,999,959 B1

Only 1 search tern was found in these documents:

OASIS Technologies Hoggage Hid 2k http:lle.musicode. Colwelcoge, he ...AFIS Technologies' Homepage ARIS Technologies is an industry leader in digital Vaternarking. We deal exclusively with protecting intellectual property such as audio, video, and ultipedia...

trilii.interactive listing design AProceedings na-k EurographicsCES '95, 9,229-2ESEEhs: E. Digital wateraark. 1. References: Cox, Kilian, Leighton and Shacon. A Secure, Imperceptible yet Perceptually sal, , ..., IBN Tech. Best (preprint available). Further links to other papers and resources on digital waternarks. Face recognition wi h "eigenfaces'. References: Turk and Pentland, "Face recognition u...,

EEEEEEEE: N. nila Ek http:...nlsearch.CO/cgi binipidsery.pl?threcid-YY...Sunnary: First licensee 19970 is Adobe.EEyles, Digimarc, the company that last year announced its Inageaarc digital wateraark technology, sees to be ready to take its nove in the market, ...

O...Lernout & Hauspieai (NASDAOLHSPFN Odi (LSH)28k of Eyes:5"S Massachusetts, it's and Ieper, at Belgium, CHIPS NEC Gevelops World's Saallest Transistor TOKYO, JAPAN, 1997 SEP 1 (NB). By ''. Willians. NECCO. . . . .PS 1. NEC Develops. World's Smallest Transistor TOKYO, JAPAN, 1997 SEP 11 INB - By Martyn Willians. NEC Corporation (TOKYO:570) says it has developed the World's smallest Operational tra sistor, a Fe...l...te h Of 14 nanometers (14 pilionths of a caillieter: The achievee?t was reached as part of NEC's develop tent of a tarabit memory chip. Intel Advances Mobile PC Platfora HONG KONG. CHINA... OA letter from the publisher TAE. Deceaber 6. S1 ly 3k http:/electron, rutgers.edulyadev/art1 walli decla.html - i". warning to India. But the only evidence of lar that night as the blackout which was quite unnecessary, From tie carrespondents' files, and from background research assembled by Reporter-Res... U.S. Patent Feb. 14, 2006 Sheet 6 of 34 US 6,999,959 B1

FIG. 5

No Search terms were found in these documents:

Article I Hyk http:/foiaYX1.RUOhio, edul-whittijslarticle two.htl Userdir rule failure The server was unable to resolve the requested userrane reference, possible causes include: Username invalid Server is unable to determine Usernane's login directory due to insufficient privilege to

Jonathan Campbell g Ulster read, W Links I he k http: fiv. isch.ulst.ac.uki rjohl book? J. G. Campbell's Bookmarks from 27 Augus is page is permanently relocated to http://infoist.ac.uk/ic/book pdated 27 August 1997 - JS, Campbellulst.ac.uk

CIOS/Conserve We server address has changed Snt Cities, lc.rpi.edu: 4997/mailboxest congrads\O3O85104,118 CIOS!Conserve WW server address has changed The CIOS web server address has changed. It is now http: fivy, cios.org Please note too the new email address for the Conserve email interface to o LS 2k http: Iliraa, fridocsic! OdeS8, html INEIAGE Net: DEFINE LIKE g DEFINE Preyious: DEFINE HEADER DEFINE IMAGE DEFINE IMAGE Yarl Filel Key Yar? File2 Key2 ... I (GLOBAL o Arizona Off-Road 33k http:llwww.3Zoffroad.com rica find 833 W. Mountain Wien Road Phoenix, AZ 85021 ATC's HOTORCYCLES JET SKIS GO CARTS Resultats g escaations 2d 5k http:lladgin, chlichtflipDreval 138402267cans.html o Votation,cantons ArratÉ no.3S fédéral sultats concernant dans, lesla cantons.perception Tableau d'UAe récapitulatif redevance sur ledeutsch trafics Yotation, des poids p 316lourds Résultats du 24 juindenses 19 "We know to the Parisians Felly Sk http:ll electron.futgers.edulyadav facillwall declb, his We Know How the Parisians Felt "We'Know Hoy the Parisians. Flt Section: Box Page, TIMEDec. 27, 1971 Time Correspondent Dan Coggin, whD covered the war from Pakistani side, was in Dacca when that city surrendered. His repor The U.S. A Policy in Shambles Lly 6k http:ll electron.rutgers.edulyadaviari?all/dec0b.html The U.S A Policy in Shambles The Nixon Administration drew a fusilade of criticism last week for its policy On India and Pakistan. Two weeks ago, when war broke out between two traditional enemies, a State Department spokesman issued

ClariNet Tearsheet: Government. Business, and General News N - Och Bk http:llyw.clari.?etSamplesinb-Other.html O ClariNet Tearsheet: Goverment. Business, and General News ClariNet Claret Tearsheet: Gover ment, Businesss, and General News ClariNet Tearsheet; Government. Business, and General News This SURAary of computer and technology news is U.S. Patent Feb. 14, 2006 Sheet 7 of 34 US 6,999,959 B1

FIG. 7

Pages with duplicate context strings to a page above:

LYo No. 1 A : Bk http:lly, edia.sbexpos.com/BULLYBUL.01.19.HTM Alternate H ly, Sk http:lierseyboldceport. CIBULIBUL119.HTM i.e. that five, Dillion enbers Pressiek floats additional stock Persetters launched in Europe NEC announces digital WaterDark Oracle to include software suite with Interret box Ca...... -BT have launched Presstek's Pearlsetters in seven European countries, Reuters reports that NEC claims to have g a digital waternark system that could protect digital files...l., n members Presistek floats additional stock Pearlsetters launched in Europe NEC announces digital waternark Oracle to include software suite with Internet box Canon coabines divisions within a I. I...ters in seven European countries. Reuters reports that NEC claims to have digital waterpark system that could protect digital files, such as still images, video and audio, fron unauth... O(http:licitly ideOdiscovery Comlydyrebidydd dydfaq. ixt) H in 118k ...gigs'...'.S.'s on NTSC line l. The digital standard (CSSD) is not yet finalized, but will apply to digital connections such as IEEE 1384 FireWire, 3) Because of the to: for perfect digital Copies, Eli E. it. ND unscrambled digital output is allowed until York in progress for securg digital connections is finished. On the computer side. DVD-ROM drives and video displayl decoderhardware or softw... f...d a PCM audio track. (Other streams such as Colby Digital aucio, MPEG audio, and subpicture are not. Necessary, for the simplest case. Basic OYO control Codes are also needed. At the Oment it's difficul... f...doing this, but it's possible. The PUsic industry is also requesting an "enbedding signalling' or "digital waternark' copy protection feature. This applies a digital signature to the audio in the form of supposedly... Ely flex J. Launch H3m 3k http:llip?. CD. plan3Slp14.html ...tly Heflex J. JLaunch launch NEC H Deylops3m 3k SEE Digital 'E' echnique JPN Scientists at NEC Research Institut... Inch NEC Develops Digital Waternarking Technique JPN Scientisits at NEC Research Institute in Princeton, NJ, have developed a digital watermarking tethod that could be US..f.ary information is increasingly, an issue,' O said Tatsuo Ishiggro, associate senior Yice president of ECCorp.... I alcony inced the Quraternarking technique is a solution that will he welccned E.E. J1 launch NEC Develops Digital Erg Technique JPN Scientists at NEC Research Institue in Princeton, NJ, have de...l...ique JPNScientists at NE Research Institute in Princeton, NJ, have deYeloped a digital ". method that could be used to E. providesthe copyright little of protection images, and against music. Onpiracy. the Internet A stal onewaterark, ishowever, no way canto trackprotect is a reproductioncopyright by andMeans therefore of an it invisible identification code that is permanent Internet Hnia 20k http: finet.info.niflui? D23Slinternet.html ... tscape servers. Ditkyan O. a door het feit dat bepaalde optionel Onderdelen Zoals een database-CDAnector dkur betaald ADeten orden, Microsoft e zijnbeurt deed daar weer een SchepiebOvelop door ?e... w t; thuv.icrosoft.co.infosery.http:/land.microsoft Coal windows http:llyn.netscape.COM NEC Ontwikkelt Digital Waterpark E. NEC is in zijn Coiputer aboratoriusbezigeteen digit.. a !...icrosoft.com windows Attp:lly.netscape, con NEC Ontwikkelt Digital Waternark Technology NEC is in http:RE corputerlaboratoriuslinicrosoft.com/indows ihttp://www.?etScape.com net een digitaal watergerk, NEC Ontwikkelt it afterknee Digital Watermark in de TechnologyOckDa...l...t.colinfoServ NEC is in zijn computer laboratoriuns bezig neteen digitaal waterlerk. Dit water Merk..., U.S. Patent Feb. 14, 2006 Sheet 8 of 34 US 6,999,959 B1

FIG. B.

ProjectError 40. Notg http:lly.csuglab.cornell.edu/Info/People?vbadariticsS31/irnrkfound file doesn't exist or is read protected evenied nulti proj/project.html Digital, Image Waternarking: Main Error 404 Not found Labeling Techniques for Multigedia Data: http:/ny it, et, tydelft.nl.lpdal saash/public/benelux...cr.html iit.et.tudelft.nl/pdalsmash/public/benlx96/benelux.cr.html cror 404. Not found labelling Techniques for Multimedia Data; http://www. E rfor 404. Not Found Artisoft Inc. - Industry Awards and Recognition http:/tartisoft.com/mainloverview lawards.hto) Error 404 File Not Found The Rutgers Reviewhttp:ll electron.futgers.edu/-nebus E.E. "iSEE Search, engine pages: Also Page 3 Excite Page 2 expansion (adding these words, to the query y help!: digitally (16) digitized (15), digit (9) agitale? g aal (8) digitization (5) digits (3) digitize (3) waternarking (.463) wateruarks 127) watermarke

Duplicates AltaVista 26 26 13

Y e S 39 33 5 InfoSeek Yes 19 17 L C O S Yes 1.O 1. O Northern Light Yes 5 33

Web Crawler Yes Yahoo Yes Total 74O 174 14 4.1.89 More CoCUMents were found but the MaxiMUN NUMber Of hits was reached.

U.S. Patent Feb. 14, 2006 Sheet 9 of 34 US 6,999,959 B1

FIG. 9 40

Jump to ?ect2 digital waterpark (2) http:lity. Neci ni nec. Contrineci-abstract-3S-10.html Track page) B NECI Technical Report 95-10 NEC Research Institue, 4 Independence Way, Princeton, NJ 0850. Secure Spread Spectrum Watermarking for Multimedia Ingemar J. Cox, Joe Kilian, Tom Leighton, and Talal Sharoon. Deceber 4, 1995. We describe as laternarking, ethod for use in audio image, vided and ?oul tiedia data. We argue that a waterpark dust be g aced in perceptually significant corponents of a signal, if it is to be robust to connon signal distortionsperceptual degradationand halicious of attack.the signal. However, To avoid it is this, well weknown propose that tomodification inserta Watermark of these intocomponents the spectral can lead components to of the data. Using techniques analogous to spread spectful continications, hiding a narrow band signalina, wideband channel that is the data. The laternark is difficult for an attacker to renov, even when several individuals conspire distortionstagether with such asge. digita watermarkeds and copies anlog ofto-digital the data. conversion,It is also robust,resampling, to conson a drequantization, signal, and geometric including dithering and recompression and rotation, translation, Cropping and scaling. The sale C EAAF algorithm can be R to all three media under consideration with only in Aor modifications, faking it especially appropriate for nultipedia products. Retrieval of the waternark u?abiggously identifies, the O?ter, and the is a be constructed to Aake counterfeiting ala-Ost impossible. Experimental results are presented to support S. S. U.S. Patent Feb. 14, 2006 Sheet 10 of 34 US 6,999,959 B1

THE WEB OVERY ENTERED BY USER re

MODIFY QUERIES FOR SEARCH ENGINES EACH SEARCHENGINE SENO GUERIES HOT BOT PARALLE. PAGE RETRIEWAL ENGINE INFOSEEK MORE N O PAGES TO RETRIEVE EXCITE REACHED \ ALTA VISTA

WAIT FOR PAGE TO BE RETRIEVED

NO PAGE FROM SEARCH ENGINE?

PAUSE SEARCH ENGINE RESPONSE N ETC. SEND REQUESTS MORE HIS SEND REQUEST FOR WEB PAGE FOR NEXT PAGE OF HITS

DOCUMENT DISPLAY DOCUMENT WITH ANALYZE DOCUMENT MEETS DISPLAY QUERY TERM CONTEXT CRITERIA IMMEDIATELY

DISPLAY RESULTS WITH DISPLAY RESULTS WHICH DISPLAY SUMMARY DIFFERENT RANKING DD NOT MEET PREVIOUS CRITERIA DISPLAY CRITERIA STATISTICS U.S. Patent Feb. 14, 2006 Sheet 11 of 34 US 6,999,959 B1

FIG 11 THE WEB QUERY ENTERED BY USER SO

MODIFY QUERIES FOR EACH SEARCHENGINE YAHOO SEND OVERIES PARALLEL PAGE RETRIEVAL HOT BOT ENGINE MORE PAGES NO TO RETRIEVE AND LYCOS MAXIMUM HITS NOT REACHED? ALTA VISTA

N WAIT FOR PAGE TO BE RETRIEVED N EC

PAGE FROM SEARCH ENGINE

PARSE SEARCH ENGINE RESPONSE

SEND REOUESTS MORE HITS SEND REOUEST FOR WEB PAGES s ENGINES FOR NEXT PAGE OF HITS

CREATE AND DISPLAY NO DISPLAY MONTAGEDISPLAY OUEUECLEAR OUEUEEd FULL

IMAGE PAGE IS ADD IMAGE TO

ANIMAGE ISS' DISPLAY OUEUE

CREATE AND DISPLAY MONTAGE FOR ANY IMAGES ANALYZE DOCUMENT SEND REQUESTS FOR IN DISPLAY OUEUE FOR OUERY TERMS IMAGES PREOICTED TO TO PREDICT IMAGE DISPLAY SUMMARY MATCH OUERY STATISTICS MATCHING QUERY U.S. Patent Feb. 14, 2006 Sheet 12 of 34 US 6,999,959 B1

Hole QQesti Problems NECI neta search 1341 queries web s Usenet web Usenet Press Images Citations Journals Tech All Hidg) Constraints: locality: Art Age liait. NonetaDepth: Ayyages: Photosz Hits; OR Context: 100 Cluster N. Tracking Searching for: koala Using: WebSeer Corel Lycos Yahoo HotSct AltaVista Tip: The bar to the left of the titles is longer when the query terns are closer together in the document.

More images were found but the axiaua Unber of hits was reached U.S. Patent Feb. 14, 2006 Sheet 13 of 34 US 6,999,959 B1

FIG. 13 ": Search E"; girls Hotboi-Images Page 2 Page 3 Page

AltaVista Images Yes || 0 || 0 | 0 || 0 Core pages Yes | 7 || 7 || 7 || 0 Hobot Images Yes || 511 125 99 Lycos Images Yes | 222 80 B5 WebSee Yes || 0 || 0 || 0 || 0 Yehoo Images | Yes || 4 || 4 || 4 || 1

Total 5744 21S 195 3 More documents were found but the maximum number of hits was reached.

Filtered due to size: 12 Filtered due to type: 21 U.S. Patent Feb. 14, 2006 Sheet 14 of 34 US 6,999,959 B1

Web SUsenet Web Usenet Press Inages Hidg) Constraints: Locality: Aiyi Age limit: Norevepth: Ayra Images: Eraphical Hid Options: Hits: Searching for koala Using: WebSeer Core Lycos Yahoo Hotbot AltaVista

Tip: You Can Search for links to a specific page, e.g. link: ny, neci ninec. Coahonepagesales. Self links are excluded.

More inages were found but the Raximum number of hits was reached This search:"E"; koala Search engine"E. pages: EliAltaVista Inaggs Corel Laages Holbot Images lycos Iages Page 2

AltaVista Images Yes || 0 || 0 || 0 || 0 Corel Images Yes || 7 || 7 || 7 || 0 Hot Bot Images Yes || 0 || 0 || 0 || 0

WebSeer Yes | 0 || 0 || 0 || 0 Yahoo Images | Yes || 4 || 4 || 4 || 1 | Total 28 131 139 6 More coCUMents were found but the Maximum number of hits was reached.

Filtered due to size: 2 Filtered due to type: 61 U.S. Patent Feb. 14, 2006 Sheet 15 of 34 US 6,999,959 B1

FIG. 15

artificialE. Ofneural texas networks 70 iOSE nelsOn amaral classifiers kagan tuner department of electrical Combined neural classifiers hybridRep." intelligent E. algoriths architecture OR hybrid intelligent architecture AND systems radial basis function classifier boundary distributions estimating the bayes boundaries in linearly pattern recognition ty 1972 issai taha pine international Conference abstractE. paper austin PC utexas U.S. Patent Feb. 14, 2006 Sheet 16 of 34 US 6,999,959 B1

F.G. 15

CiUster Sunneries:

University of texas

It'sDocument...by Application clicking to Water on Resery.... Journal Papers: toIsail Journal Tahaand of SmartEngineering Joydeep Shosh, Systems.A Hybrid SaiIntelligent Tiha and Architecture, Joydeep Shosh, and Sg ic Interpretation of Artificial Neural Networks, subitted... l. Austin, 1995. Conference Papers: Ismail Taha and Joydeep Ghosh "Evaluation and Ordering of Rules Extracted fron Feed forward NetCrks...I. Also, Tech. Rep. TR-97-01-06. The Computer and Vision Research Center, University of Texas, Austin, 1996. Conference Papers: Ismail Tahaan... Document:...Jaydeep Shosh...l. i.e. Ghosh Joydeep Shosh IgE. Fa; (512 4-5. . . 1. Joydeep Ghosh Joydeep Ghosh Telephone: (512)471-9980 Fax: 512471-5532 EFA E. Electrical512 4-5532 & Computer E-ail: Engineering...site uteased Address: The University of Texas at Austin Department of Document:... Yoan Shin and Joydeep Ghosh Departent of Electrical and Corputer...l...Yoan Shin and Ghosh Department of Electrical and Computer Engineering The University of Texas in and ly's E. t Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78 12 Astrac This paper introduces a nov......

artificial neural networks

ComputerDocument. Engineering Artificial The Neural Unive...l...rsity Networks Authors. of Texas Bryan at W. Austin Stiles Correspondence: and Joydeep Ghash Bryan Stiles,e Cloof JoydeepElectrical Ghosh and E. of Electrical and Computer E. The Uniye... f...Phone: (512)471-235EEDail: bstillespi neece.utexas.eds Joyceep Ghosh Departient of Electrica at EE Engineering The Univ. . . . A Habituation Based Mechanism for Encoding Temporal Information in Artificial Neural Networks. Authors:Bryan W. Stiles and Joydeep Ghosh E; 0.1...1:ghoshpine.ece.utexas.edu Subait to Applications and Science of Artificial NeuralNetworks Steven K. Rogers and Dennis W. Ruck at AeroSense 3... Document:... (eds.). IEEE Press, 1995. pp. 135 - 14 Bryan W. Stiles and re: Shosh, "A Habitjation Based Mecha?is for E. Temporal Infortalion in Arti ...E Proc. Yol, Orlad, April 1895, pp. Brya Siles Detroit,and Joydeep May Ghosh,1995, pp. "Rabituation Bryan W. Stiles Based Neuraland y Classifiers Ghosh, for'Dynamic E; Neural Networks E. for thePro. Classification Pfc. ICASSP-35. of Oceanographic Data'.I.Ghosh, A Habituation.Based Mechanise for Encoding Teporal Infernation in Artificial NeuralNetworks" (invited paper) Proc. SPIE Conf. On Applications and Science of Artif...... tworks. (invited paper Proc. SPIE Conf, on Applications and Science of Artificial Neural Networks IV, SPIE Proc. Vol. Orlando, April 1995. pp. Bryan W. St., :... Jth.edu D. Jackel Robert E. Schapire Y. Freund Kagan Funer and Joydeep Ghosh ShipOn Edelmanitpilerisisdoveianann.ac.ilisablasps, Jonathan Baxter Anders Krogh, and Jesper 2 KaganSedelsby Turner andftp Joydeep from Ghosh, Theorectical foundations of Linear and Order Statistics Combinars for Ne...l.. When Networks Disagree: Ensemble Methods for Neural Networks, Chapter 10, Artificial Neural Networks for Speech and Visión, editor R.J. Marinone. Chaplan-Hall, London 1993 M...... / 74 U.S. Patent Feb. 14, 2006 Sheet 17 Of 34 US 6,999,959 B1

FIG. 17 HUSKY SEARCH Query: (joydeep ghosh) Documents: 102. Clusters: 14, Average Cluster Size: 11.21 documents IS Document e Phrase and Sample Document Titles

Artificial 74, Hybrid intelligent 32, doain knowledgel26) CASANS ATTEE Cluster 19 Untitled

Click on to view the abstract and on to obtain a postscript copy... 7, postScript copy of the full paper 100 paper is currently not available. (57.

Kagan Tuner's Publications Cluster 2 Classification form Papers elee val Journal Publications Full Regular s ecca S

Kagan Tuer (100, NERAL CLASSIFIERS Kagan Tuner and Joydeep ghosh 273, pattern classifiers (20, tuner S3

of Electric and Couputer (88. Untitled Atitled

AS Roe Page Cluster 4 : Joydeep Shosh LANS Hog Page gece32, RESEARCHS EDUCATIONAL RESOURCES/GRGANIZATIONS te?ters 10 of Texas C. r. 8: Cluster 5

University of Texas at austin (50. Iniversity of texas (6, texas CE Untitled

Uster 6 Entitled mos ANS Hoe Page E. ters, Journal (60. Cister LANS one Page

Refereed Archival Journal Publications ful Regular Papers Refereed Archival Journal Publications Full Regular Papers U.S. Patent Feb. 14, 2006 Sheet 18 of 34 US 6,999,959 B1

F.G. 19

papers (NE an er's Pications Cister classificatch Site s orkingrt's PublicationsPapers efereed Archiral Journal Publications Cful Reggar Papers

Clusterster 9 ors.Some Books (BO ocsiroceer. Eded

ANS Hog Page Cister

Cister

coabining tos, outputs (6. Abstract Alsac Cluster 12 Abstract titled Abstraci ed LANS Hoee Page

Kagaluger's Publications Custer 13 CSP blicatios Database Kurt's Publications Refereed Archiva Journal Publicators Full Regular faters

LASloe Page Cluster 13 OYEEPHS lo Shosh

usical Events in Rubai Unclustered -vi iliali cane Paces fork Picaces des Nesnar U.S. Patent Feb. 14, 2006 Sheet 19 of 34 US 6,999,959 B1

Get Get Personal Fersonal Try it FREE try it FREE

26% Ghosh, neural, artificial, networks, Sonar, austin, classification, texas, signals ... 24% Orthod, Orthodontics, orth.Dcontic, Orthop, Orthodontists, Handibular, Maxillary, nanda craniofacial v. 23%. Classifiers, kagan, classifier, combiners, CObining, recognition, patter?, proceedings, ensemble 1B, Bds, did, bds, incisors, sd, phd E. Leee, spig, associative, a realig a 15%. Wavelets, fuzzy, prof 14 Molars, maxilla, Gandible 14 Generalization, supervised, adaptation, Yol, estimation 12, Ece, electrical, engineering, utexas Help . Preferences . New Search, Advanced Search / Our Network | AddRemove Url I feedback | Help 82 Advertising Info About AltaVista Jobs Text-Only allella Digital Equipient Corporation Disclainer Privac Statement Copyright 19370 All fights Reserved U.S. Patent Feb. 14, 2006 Sheet 20 of 34 US 6,999,959 B1

FIG. 20

artificial Neural NetWorks networks neural networks 84 ieee international Conference university of texas pacific northwest laboratory recurrent neural networks nn in hep pacific northwest national University of California austrian research institute Self organizing map northwest national laboratory artificial intelligence pattern recognition research group international Conference fuzzy logic Sandiego genetic algorithms iege transactions pacificSignal Past,northwes technical report Machine learning data nets ai Software U.S. Patent Feb. 14, 2006 Sheet 21 of 34 US 6,999,959 B1

E.itive y stresssyndrone injuries 86 SEAFFSolsrepetitive E. W injuries A s i: e OR Saf jose state AND archive Universityindso O YESbraska products include split da? Sia: O EY.se inderSEFA's OR browse groups hygantagger factor relatedSS S archive SOUCes £2.5 advice AND s trees CR cgi. ity advice AND keyboard alternatives repetitive stress injuriesASE AND dafEas wallach alternatives Carpal tunnel Sydrome Doctent:...FAO - Typing Injury...l. E. t FA Hage Page, TIFACE Generall keyboards).Speech MicellSof... . . FA - saircésTyping it. of information ...... for peoples withFAO g Pageinjuries, IIFAOIGeneri) repetitive keyboardsstress injuries, SRES carpal tunnel afesyndrome, .." etc.. The Archive. TIFAD is targeted at computer users suffering at the hands of their equipuent. You'... Docent:...J.'s Ergononics Page is a site by John Murray at University of Michigan that focuses on typing injuries. Carpal tunnel syndromeYsgroups. and You design do the oncepts. searching. Offige Typing Yorking Injuries Postures is everythingis a coae...l...eral : ever wanted links to toknow various about safetytypig oriented injuries servers,by Dan Wallachlists, andat Princeto... f...Princeton, Lots of publications and links. Everyt E. you wanted to know and acre. E. Injury Archive is a typing injury library by Dan Wallach at Princeton. Here will ind a Yellcate.....C?s and l?ks. Everything you wanted to kPO and Ore, Typing Injury Archive is a typing injury library by Dan Wallach at Princeton. Here will find a well categorized list of typing injury ...l...and Ergononics Hope Page, links to several ergonic sites that focus on safety issues. Such as: tunnel syndrone. back injuries. air quality, sick : sydroe, and EPE lawrence LiverDore lab ...... al technology and human factors engineering. SpecificSt. emphasis work is buton keyboard... an interesting site. Carpal Tuniel Syndrose is a coercial site but dices have lots of references to CTS. one EApage for thethe Typing keyboard. SiteFA offersand Finey OutInjury For Archiye.Windows, an...t...an ergonomic ergopaic, exercise exercise break prograd. break progralypig Typing Injury FAO:Injury This FAO. is the This is the hone EE for the Typing Injur FAO aid typing Archive. NEVUniversity of Minnesota Office Egonois. 1...are, how one gets the, and so guideli?es for hor One E. elp heel Oeself froa this devastating R Carpal Tunnel Syndrose S. Repetitive Stress Computer Related Repetitive St...l... Carpal Tunnel Syndrapes Repetitive Stress Coiputer Related Repetitive Strain Injury: I hope on this ye to provide a brief introduction to AS for the benefit of students who... i.e. and Sone guidelines for holy one say help heal preself from this devastating injury Carpal funnel Syndrose Repetitive Stress Computer Related RepetitiveGuide to Carpal Strain Tunnel Injury: SAC I hope Thein thisfollowing page accesdocudents attempt toyou explain avoid COSIRS1what Carpal with TunnelMicropses Syndrone and is, Exercise, how it Breaks.is Patient's diagnosed... ni. Typing Injury FAO: General Information... f. Typing Inuity FAQ: General Informatian General Infonraton TIFAO ?erkey...T.pii, Injury FAO: General Infration...I., Tiping Injury FAO: General Information General Information TIFAQ6eneral (Keyboards).l...nonyn for RSI RLD Work-Related Opper Lib Disorders, yet another synonyn for ASICTS, and forear, soieSislette's tenderness, and it gets worse with repetitive activity. CarpalReggie's Tunnel Syndrome the nerves that run through your wrist into your fingers get trapped by the inflaned nu. U.S. Patent Feb. 14, 2006 Sheet 22 of 34 US 6,999,959 B1

Home O. PRS Find;

Web s Usenet Web Usenet Press Inages Citations Journals Tech All Show) Constraints: (Hide) Options: Hits: Searching for: "NASDAQ stands for NASDAO is an abbreviation "NASDAO aneans' Using: Hotot Infoseek AltaVista ExCite LyCOS Northern light Yahoo WebCrawler.

Tip: For better precision with multiple terms you night like to use '' to ensure that the results contain specific teros (e.g. 'lee giles' optics).

Ref;...viation for the New York Stock Exchange AMEX is an abbreviation for the American Stock Exchange NASDAQ is an abbreviation for the National Associaction of Securities Dealers Automatic Ouotation Exchange "Top S. Of the... Ref:...nformation on NASDAQ and the companies traded theron. E. does anyone know what NASDAQ stands for? NYSE All about the New York Stock Exchange. Data nongers loo... Rgf:... - The NASDAO Last-Revised: 25 Oct 1986 from: billianraol.com, jeffrbeneaol.co., cmlics. Und.edu NASDAO is an abbreviation for the National Association of Securities Dealers Automated Quotation system. It is also COMIU fl.., Ref'... - NASDAQ Last Revised: 25 Oct 1996 from: billmanraol.com, jeffwbeneaol.com, lottinvest-faq, CO NASDAO is an abbreviation for the National Association of Securities Dealers Automated Quotation systein. It is also CDTUnl... Ref'...ble for the operation and regulation of the NASDAQ stock market and Overthecounter markets. NASDAQ Stands for the National Association of Securities Dealers Automated Quotation System. A nationwide computeri... Ref'...site Index is a value weighted index that Onitors aore than 2,000 stocks traded Over-the Counter. NASAO stards for National Association of Securities Dealers Automated Guotations. It has been available since 1971..., Ref'...as an incentive stock option under Section 422 of the Code. (k) "NASDAG' means the National Association of Securities Dealers, Inc. Automated Quotation System...... section deleted...)

AltaVistaThis search: ExCite "NASDAQ stands for NASDAO is an abbreviatin,NASDAQWebCrawler leads Yahog Search engine pages:

Duplicates

AltaVista Yes | 9 || 9 || 9 || 11.

Xcite Yes

Northern Light

WebCrawler U.S. Patent Feb. 14, 2006 Sheet 23 of 34 US 6,999,959 B1

FIG. 23

92 Honel Add URL Free Software Help f k S. m l?, QSee On the net ifti. G Infoseek found 23,054,238 pages containing at least one of the these Words; she does NASDAQ stard for

Related Topics Search Results 1 - 0 p Stock Hide Saaries text o Personal investaert e Brokerage with online al L. Y. trading is: Reach are higi . g O Stock Research NSA. Be ahead of the e Financial market a falysis in testartistriarch Sir

g and develops real-time decision support, execution and trading systems for NASDAO stock parket traders. Be sure to check out CyberTrader. s: http:lim.cyber-torp.ca Size 4.0.

NS Honel Coapay list Industries Search ANE GASCAP (NASWO:Al ALPA SPT it."

I RS. s: List Industries Campany Search 20TH CENTRY ISTRES (NYSE: TWACE ERA C :AE ANCE ro G2 http:liva.investquest.cm?htafss-industry.ht Size 2.

Investueste OESAE TRAE-REGIS CRPESERE-OURE (NYSE:AIR ABATENNREN SES. nestsease ACRP Sri listAli Capal Search AAR S2 http: 1.fm. investquest.coathtalisD. industry.hts (Size 24.2

SI:Stocks g -C. CCNYSE) ler CA (TSE) Canadian Airlines CANOTCAGISIOC) agisoft Software Corp. SE Wireless CENASA01 Creative... ag 62. http: if stockclub.colstockslsabol-c-index.html (Size S6K by g Name - A cars ACO) A.G. EDARS (NYSE:AGE Abati Environmental NASCAO:ABE) access health CASCAO:ACCS) acclain (NASDAQ: AKA Ackerley Councations (NASDA):AK)... 2. http: histockclub. coal stocksfia-a-index.html Size .

RES ANDSE. PASTICS ES site: Coaparty List Industries Company Search) ANEKATERAS GRUP INC NASA:AS AEP STRES 1 :AEP.., 62 http:llwy. investguest.ca/htall3-industry.hts (Size 10.6

TERAELOSESiii. EstatehoeME (NASAO:AAT AC Sist car ListE. IslriestES5 Search AYLD a h;im, investquest.ca/hta/3.ndustry. Ata (Sile 3.3

Stilt List industries Chaparty Search WAEEECTS IN OTC Bulleth Raari:3ET S E. INS IN... G2 http:tim. Vestest.com/html/87-industry.htm (Sire 15.8

Hide Suaries next 1C CopyrightC) is5-97 Infoseek Corporation. All rights reserved. E. Elity.Sir Inight. U.S. Patent Feb. 14, 2006 Sheet 24 of 34 US 6,999,959 B1

His: Context:Ez Searching for: rainbow is created makes a rainbow created "rainbow is produced rainbow is inade" Using: Hot Bot InfoSeek. AltaVista Excite YCOS Northern Light Yahoo escrawle?.

Tip: For better precision with nultiple ter's you night like to use "" to ensure that the results contain specific terns (e.g. + "lee giles' optics. Ref:... the green flash, it to know how our atmosphere effects sunlight Coincidentally, the phenaeron responsible for the green flash is also the one that paints rainbows across Hawaii's sky. A rainbow is created when rays of sunlight enter a fairdrop, bounce around inside, and eit. light from the sun consists of a potpourri of colors that are each bent by a different aaount inside a raindrop. This unequa... Ref:...scapes the so after it is reflected once. A part of the ray is reflected again and travels along inside the drop to Derge from the drop. The rainboy we orally see is called the prinary rainbo and is produced by one internal reflection the secondary rainbow arises fron two internal reflections and the rays exit the drop at an angle of 50 degrees rather than the 42 degrees for the red prinary bow. ... bor,Ref;...e. double rainbow boys, Wesize do ofnot arc,see andthe brightnessSun, and we of carely the rainbowSee a rainbow Answer in The winter rainbow Hoy is do A. we explainSunlight this apperance s of a through a raindrop or a collection of rain drops. A typical raindrop is spherical and as a light ray strikes the surface of the rairdrop, some light is reflected and sole passes... Ref;...se to US. He E. that the earth will never be destroyed again by a flood. As a sign of that promise He put a rainbow in the sky. Whenever we sees rainbow, we can think of God's praise. The rainbow is inade up of all the colors. Back To Index. Next Page... Page 1... Ref;...two rainbcs, the narrower Dale rairboy and the vider female. The ?ale rainbow cannot stop the rain by itself. When it is followed ty the fease the rain stops. Other Native Americans believe the raing is defron the souls of wild flowers that lived in the forest and lilies from the prairies. AJapanese pyth tells of the first an ag and the first ranaa Isafaai who stood on the floating bri...... te of sasara before the clear light of Nirvana or heaven. In Arabis the rainbow is a tapestry draped by the hands of the South wind. It is also called the cloud's D or Allah's boy. In Isla the rairboy is ade E. of four colors, red, : and blue relaled to the four elements. In ayths of India the Goddess Indra not only carries a thunderbolt like the Greek God Zeus but she also carries a... Bef;...true b false 13. The average speed of light is greatest in ... ared Orange E. C, gree? glass, d. blue 3. e. is the sale in all of these, 14, The secondary rainbgy is produced with an extra choose the best answer a s b. reflection, c. refraction. d. diffraction. 15. If a person has green comes that are reak, then yellow light will appear to t... Ref: ...ever wonder what nakes the color in a rainbow The answer is slight it has all of the Colors of the rainbow in it, but ty are all inited up together so you are not able to see thes. The rainboy is Dade up of drops of later. When sunlight passes through a drop of water, it bends and the colors inside the light split apart and are separated SD that we can see thes. When the sunlight passes through...

...section deleted...) U.S. Patent Feb. 14, 2006 Sheet 25 of 34 US 6,999,959 B1

FIG. 25

96 Honel Add URL Frge Software Helg l fSE e K.SA nets i. UPS Infoseek found 20,534.341 pages containing at least one of the these words; how is a rainbow created? Related Topics Search Results 1 - 0 to pet loss fly fishing Hide Suaries text 10 eative echniques or joy by Hayden Books frking with Layers: Creating a Rainbow Effect Art by Gary Poyssick Consents; This tip shows you how to se layers and Photoshop's... G. http:lm.acp.cca 97808705120thayden series techniques/52nda.html Size 4.2

Rainbow PrograningStiltiadiitisforts Holdings' sports networks - NewSport. Prine and Networks ... and The sporting Nes Create Alliance 6 http:llinn.ediatentral.co/Magazinesti?siderediaNews153568.601.hts/Delavlt Size 3.9X)

eativegi .Techniques feative brought s to you by Hayden Books E. layers: Creating a Rainboy Effect Art by Gary Poyssick Couents: This tip shows hoi to use Layers and Phot pS.,. S. http:llm.ucp.com/2285114933.2/kaydeniseriest techniques S2lindex.ht) (Size .2c

rriors? EG Hacker All the colors of the rainbow... The appropriate excerpt from the alt2600 FAO. Your are left on your on recognizance S3 http:lfm.cloos.consipateshacker chro.htl Size 3, BK

Petoss and Rainbow Bridae 1dge a oss grief pages, lay post poems, photos. tributes or just stop by and be conforted. 62. http:llwy, price ?et, cont-aegli bridge.hta (Size 45.2c he sky's the is all see solething different then yelgok up to the sky. The clouds often stir our inagination allowing us to see animated inages being formed by those oysterious 'puffs of... 62. http:ilm. solutions.lba.colk 21 teachercloudss.htal Sile S.5k

etfi 3D Flt and for windows screenshot Create 1ggual professionally refere e-dimensional images and animations with AsyRetrix 30 FIK. You can easily add dailing 3-D affects and sophisticalled animation to any... 2. http:/ldsite, coards tec software synetrix.3dfx.ht Size S.K.

. an st Services. Not satisfied with offering the best in regional video production, Rairboy Yideo y couplete c)-RN autharing services for both the ... S. http: tw.rary ideo.coolav thor.hts (Sir 2.7K:

B to Icon Feedback Icon photo by Wirk Van s Fred Stern The Rainbow Maker Fred Stern was raised in New York and is an acknowledged innovator in enviroraenta l art. He has... 5 http:11 wizianet.coutrainbowl bio.ht Size 4.1.) it. Updated: August 7, 1997. Denotes scheduled stocking SOs http: tw.dfw.state.ar.usf Corhtal Retireportsfishing.html Size 40.5k

Hide Suaries (ext 10

g;9.nfoseek lacorporates isS ology All rightsInclight. reserved. Scaler U.S. Patent Feb. 14, 2006 Sheet 26 of 34 US 6,999,959 B1

Option a Suggestionsescal' Probles -- NEC - - - -t:we:-neta Search (1858 Queries OD Find: Web SUsenet Web Usenet Press Inages Citations Journals Tech Al (Show Constraints: Options: Hits: C, Context:

Searching for: "Healy achine is nealy machire refers to really achine gears' nealy achine will 'nealy achine heigs' Using: biotBot InfoSeek. AltaVista Excite lycos Northern Light Yahoo WebCrawler.

Tip: For better precision with multiple terns you raight like to use "" to ensure that the results contain specific terms (e.g. "lee giles' opticsl.

Ref:..k.has less statesSuch thatthan all an stateequivalent E. Mealy changes machine. are madeTF Awith potential respect problem to the withclock a MealyE. Tachi?e FA Moore is that machine the output usually changes are not synchronized with clock changes. Fill in the blanks, 0 poifts at 2 points per blank The canonica SOP form of an expression results in a level circuit, ... Ref;...input alphabet, and by creating multiple input Rechanisas for reading events. Second; the transition function definesust be modifiedsymbols which so that are controller output during tasks state can betransitions. performed Forduring is statecurrent transitions. purpose, aA similar Healy Machine, Dechanis is ais DFA used that to perform controller tasks such as moving a robot, opening a wise affi... Ref;...these general premises to the Collatz E. Which of all open problens at the Onent is E. the most conveniently conducive to the E. Generalized Sequential Machines, SSIs A Healy Machine is a finite State Autoston with a SE Output t associated with each state transition (e.g. see?, p.42)}. A GSM or Generalized Sequential Machire is similar it is a FSA with an output strin... ei:...next state which then effect the test, State refers to all latched events and values.) ly. for y is that the output depends on the transition, thus i. the buffers, the CFSN is a Mealy Machine. Will explore this Qore later.) Issues concerning Compostign have not been resolved by the Polis group, there is no composition as it stancs. Resources A Foral Methodology for Hardware/Softwar... Ref:... State Machines We consider two types of state achines, Moore and Mealy, A Moore, machine is a Mealy achine whose output does not directly depend on its ingut. Nealy Mechines a saly Machine is a S-tuple M. S. D,symboli O, O,ois' C(a,alta, finite q}set. where s! O is a finite set of input symbols (we will use a to denote a particular input Ref;...ying on state register flip-flops, it is still desirable to use then. This leads to alternative star design styles for Mealy achines stated, the way to construct a synchronous Healy Machi?e is to break the direct Connection between i?puts and Outputs ty introducing storage elecents. One way to do this is to synchronize the Mealy achine outputs with output flip-flops. See Figure ... Reft...itions. A FSA is callied, non-deteroinistic if there is one or Dr transitions trp or state to ather for a given input. A Moore achine is an FSA which associates an E. with each state and a Healy, Kachine is and FSA which associates an augut with each transition, The Moore and Healy. FSAs are important in applications of FSAs. Equivalence of deterministic and non-deterministic fisa It light seen... Re:...icle, and you, will nake use of this three-blockodel to describe a statemachine in VDUsing our four-steg design procedure. Moreoyer, the outputs of a state machine define, its type, That is, a Healy achine is one in which the outputs are a function of both the inputs and the current state-variables (Figure 1). A Moore Aachi?e has outputs that are a function of the state-variables only figure 2). And a .., Ref;...GO Sestudeer van nodule 1 de bl2.74 blz.13 grOndig. 20 Alsy niet (of niet Reer ZC goed) Veet Yateen testandschi?e, een to standsdiagra, genMoore- ofteen l'i is Zoek dan at unist Cheer treet of in uy boek (en) over digitale techniek. 30. Maak de Oefenopdracht van bl2.13 10 Lees de fest Yen Odule 1 opper vlakkig...... Section deleted. ...) U.S. Patent Feb. 14, 2006 Sheet 27 Of 34 US 6,999,959 B1

(Show) Constraints; Options:

Tip: Clicking on the search engine links in the 'Searching for' line will show the search engine response to the current query.

Recently modified URs: Page ftp:liftp.lkernel.org/publinuxlkernell testing? Stop tracking UAL Recent documents matching: signafy Mark as seenStop tracking query

S. ahensive Media Asset Ma?adegent Packade From Wirane: http://www.infoseek.com Contentarn-B TE: Scola NXinh=25kt-Asaksal lies Group The Content Group, Excalibur Technologies A. Muscle Fish, LLC Silicon pics. Inc. Signafy, Inc. TECHATH GmbH and TATA Consultancy Services. About INFORMIX-Universal Server INFORMIX-Un... R SW P. Siye Media Asset ManaceAlent II la 13k http://www.infoseek.com/Contentarn-ix. BIR19970915252517SXSq Egergii:S Ecs The Content Group. Excalibur Muscle Fish, C Silicon Graphics, Inc. Signafy, Inc. TECHMATH GebH and TATA Consultancy Services. About INFORMIX-Universal Server I...

9:00 NJ Prog Language Workshop - Multipurpose Rooms 2F00 2FO, 2nd Floor. (AW OUT Ebbesen, Gottlieb, de Ruyter, Thornber

Recent articles about NEC Research in the press: Fig. Promises Terabit Meory R NT na 20k. http:/?, techreb, COA:80/wirel news/1997/09/09 linec.html Search Propises Terabit Memory Chips ... 1. NEC Research Proises Terabit Memory Chips ...... Chips International NEC Research Promises Terabit Memory Chips (08/11/97 12:00 p.m. EDT. By John Boyd. ...

... section deleted...) U.S. Patent Feb. 14, 2006 Sheet 28 of 34 US 6,999,959 B1

FIG. 29

http:lly, research, digital.com/SRC publications/src-papers.htol New text: i. Paul McJones and John Detreville. Each to Each progranaer's reference anal Technical Nate 1997 023, Digital Equipaent Corporation Systes Research Center, Palo Alto, CA, October 1997.

SRC Publications List

1. Paul McJones and John DeTreville. Each to Each REE reference a?ual Technical Note 1997-023. Digital Equipment Corporation Systems Research Center, Palo Alto, CA, October 1997.

2. Monikaraphs. Henzinger,Technical andNote Hana 1997-021, pytre. Digital Certificates Equipment and Corporation last algorithms Systems for Research biconnegliyity Center, Paloin fully-dyanic E. CA, eptember 1997.

3. MonikaEquipment Henzinger. Corporation SEESysters Research Center, PaloESSEE Alto, CA, September 1997. Technical Note 1997-020, Digital

4, Monikaenginger, and Yalerie King. Maintaining innin Spanning tigers (dynicgraphs. Technical Note 1997-019, Digital Equipment Corporation Systers Research Center, Palo Alto, CA, September 1997.

5. Marc Brown, Marc A. Najork, and Roope Raisano. A Java-based inglesentation of Collaborative Active Textbooks. In 1997 I El on Pisual languages, es 372-379. IEEE Computer Society, September 1997. (PEF), (PostScript). Copyright 1997 IE

... section deleted,...)

Digital Systems Research Center & Legal notice Tel:30 ytton:Elsa Avenue, Paloe Alto,3. B53-2104CA 9430 last addified:Sefid Connents Tuesday, to the07-0ct-97 Owner of thisE. page.E. Copyright Digital Equipment Corporation 1995-1997. All Rights Reserved. U.S. Patent Feb. 14, 2006 Sheet 29 of 34 US 6,999,959 B1

F.G. 29 COVERAGE WRT 6 ENGINE TOTAL 100% 90% 80% 2S ENGINES 70% HOT BOT SO, OEXCITE 50 NNORTHERN LIGHT 40% ALTA VISTA 30% S: DINFOSEEK 20% ELYCOS 10% N ::::

FIG. 30 :

O 1 2 3 4 5 S 1 8 9 10 1 2 3 14 15 AS 17 19 19 20 NUMBER OF ENGINES U.S. Patent Feb. 14, 2006 Sheet 30 of 34 US 6,999,959 B1

FIG. 31. TOTAL DOCUMENTS IN THE INDEXABLE WEB - C (EXCLUDING PAGES NOT CONSIDERED

BY SEARCHENGINES) FRACTION COVERED BY THE GENGINES -

DOCUMENTS RETURNED BY ENGINE a

DOCUMENTS RETURNED BY

BOTHENGINES a AND b.) n. DOCUMENTS RETURNED BY ENGINE b-n

FIG. 32 COVERAGE WRT ESTIMATED "INDEXABLE WEB' SIZE 100%- 1 120 90% 80% 2ESTIMATED TOTAL 70% HOT BOT SO OEXCITE 50% NORTHERN LIGHT

40% ALTA VISTA

30% DINFOSEEK 20, ELYCOS 3 N:IE U.S. Patent Feb. 14, 2006 Sheet 31 of 34 US 6,999,959 B1

FIG. 33

40 ATAVISTA RESPONSE TIME

40 30 EXCITE RESPONSE TIME 20 s 10

U.S. Patent Feb. 14, 2006 Sheet 32 of 34 US 6,999,959 B1

F.G. 34

INFOSEEK RESPONSE TIME

5 9 TIME

40 30 LYCOS RESPONSE TIME 20 O

40 30 2O NORTHERN LIGHT RESPONSE TIME O O 2 4. 6 TIME 8 O 12 14

40 RESPONSE TIME FOR THE FIRST 3C OF GWEB SEARCHENGINES 20 10 U.S. Patent Feb. 14, 2006 Sheet 33 of 34 US 6,999,959 B1

FIG. 35 MEDIAN RESPONSE TIME

HOT BOT EXCITE ONORTHERN LIGHT SALTA VIST INFOSEEK IDLYCOS

2 Š :: :

F.G. 35 2 2.5 2, 2 1.5 E 1 s 2 3 4 5 NUMBER OF SEARCHENGINES U.S. Patent Feb. 14, 2006 Sheet 34 of 34 US 6,999,959 B1

FIG. 37 40 ALLPAGES RESPONSE TIME E 10 O O 2 4 TIE 10 2 14

FIG. 3B

O 5 O 2 4. S O 2 14 S B 20 NUMBER OF PAGES

FIG. 39

3040 RESPONSE TIME FOR THE FIRST TIME TO FIRST RESULT RESULT FROM THE META ENGINE TIME FOR WEB SEARCH ------20 ENGINE TO RESPOND O

TIME US 6,999,959 B1 1 2 METASEARCH ENGINE A further object of this invention is to provide a search method that improves on the efficiency of existing Search This application is a conversion of copending provisional methods. application 60/062,958, filed Oct. 10, 1997. A further object of this invention is to provide a meta Search engine that is capable of displaying the context of the BACKGROUND OF THE INVENTION query terms, advanced duplicate detection, progressive dis play of results, highlighting query terms in the pages when A number of useful and popular Search engines attempt to Viewed, insertion of quick jump links for finding the query maintain full text indexes of the World Wide Web. For terms in large pages, dramatically improved precision for example, Search engines are available from AltaVista, certain queries by using Specific expressive forms, improved Excite, Hot3ot, Infoseek, Lycos and Northern Light. How relevancy ranking, improved clustering, and image Search. ever, searching the Web can still be a slow and tedious These and other objectives are attained with a computer process. Limitations of the Search Services have led to the implemented meta Search engine and Search method. In introduction of meta Search engines. A meta Search engine accordance with this method, a query is forwarded to a Searches the Web by making requests to multiple Search 15 number of third party Search engines, and the responses from engines Such as AltaVista or InfoSeek. The primary advan the third party Search engines are parsed in order to extract tages of current meta Search engines are the ability to information regarding the documents matching the query. combine the results of multiple Search engines and the The full text of the documents matching the query are ability to provide a consistent user interface for Searching downloaded, and the query terms in the documents are these engines. Experimental results show that the major located. The text Surrounding the query terms are extracted, search engines index a relatively small amount of the Web and that text is displayed. and that combining the results of multiple engines can The engine downloads the actual pages corresponding to therefore return many documents that would otherwise not the hits and Searches them for the query terms. The engine be found. then provides the context in which the query terms appear A number of meta Search engines are currently available. 25 rather than a Summary of the page (none of the available Some of the most popular ones are MetaCrawler, Inference Search engines or meta Search Services currently provide this Find, SavvySearch, Fusion, ProFusion, Highway 61, option). This typically provides a much better indication of Mamma, Quarterdeck WebCompass, Symantec Internet the relevance of a page than the Summaries or abstracts used FastFind, and ForeFront WebSeeker. by other Search engines, and it often helps to avoid looking The principle motivation behind the basic text meta at a page only to find that it does not contain the required Search capabilities of the meta Search engine of this inven information. The context can be particularly helpful when tion was the poor precision, limited coverage, limited avail ever a search includes terms which may occur in a different ability, limited user interfaces, and out of date databases of context to that required. The amount of context is Specified the major Web search engines. More specifically, the diverse by the user in terms of the number of characters either side nature of the Web and the focus of the Web search engines 35 of the query terms. Most non-alphanumeric characters are on handling relatively simple queries very quickly leads to filtered from the context in order to produce more readable Search results often having poor precision. Additionally, the and informative results. practice of "search engine Spamming” has become popular, Results are returned progressively after each individual whereby users add possibly unrelated keywords to their page is downloaded and analyzed, rather than after all pages pages in order to alter the ranking of their pages. The 40 are downloaded. The first result is typically displayed faster relevance of a particular hit is often obvious only after than the average time for a Search engine to respond. When waiting for the page to load and finding the query term(s) in multiple pages provide the information required, the archi the page. tecture of the meta engine can be helpful because the fastest Experience with using different Search engines Suggests Sites are the first ones to be analyzed and displayed. that the coverage of the individual engines was relatively 45 When Viewing the full pages corresponding to the hits, low, i.e. Searching with a Second engine would often return these pages are filtered to highlight the query terms and links several documents which were not returned by the first are inserted at the top of the page which jump to the first engine. It has been Suggested that AltaVista limits the occurrence of each query term. LinkS at each occurrence of number of pages indexed per domain, and that each Search the query terms jump to the next occurrence of the respective engine has a different Strategy for Selecting pages to index. 50 term. Query term highlighting helps to identify the query Experimental results confirm that the coverage of any one terms and page relevance quickly. The links help to find the Search engine is very limited. query terms quickly in large documents. In addition, due to Search engine and/or network difficul PageS which are no longer available can be identified. ties, the engine which responds the quickest varies over These pages are listed at the end of the response. Some other time. It is possible to add a number of features which 55 meta Search Services also provide “dead link’ detection, enhance usability of the Search engines. Centralized Search however the feature is usually turned off by default and no engine databases are always out of date. There is a time lag results are returned until all pages are checked. For the meta between the time when new information is made available Search engine of this invention however, the feature is and the time that it is indexed. intrinsic to the architecture of the engine which is able to 60 produce results both incrementally and quickly. SUMMARY OF THE INVENTION PageS which no longer contain the Search terms or that do not properly match the query can be identified. These pages An object of this invention is to improve meta Search are listed after pages which properly match the query. This engines. can be very important-different engines use different rel Another object of the present invention is to provide a 65 evance techniques, and if just one engine returns poor meta Search engine that analyzes each document and dis relevance results, this can lead to poor results from Standard plays local context around the query terms. meta Search techniques. US 6,999,959 B1 3 4 The tedious process of requesting additional hits can be queries, automatically informing users when new documents avoided. The meta Search engine understands how to extract are found which match a given query. The engine is capable the URL for requesting the next page of hits from the of tracking the text of a given page, automatically informing individual Search engine responses. More advanced detec the user when the text changes and which lines have tion of duplicate pages is done. Pages are considered dupli changed. The engine includes an advanced clustering tech cates if the relevant context Strings are identical. This allows nique which improves over the clustering done in existing the detection of a duplicate if the page has a different header Search engines. A Specific expressive forms Search technique or footer. can dramatically improve precision for certain queries. A U.S. Pat. No. 5,659,732 (Kirsch) presents a technique for new query expansion technique can automatically perform relevance ranking with meta Search techniques wherein the intelligent query expansion. underlying Search engines are modified to return extra Additional features which could easily be added to the information Such as the number of occurrences of each meta Search engine of this invention include: Improved Search term in the documents and the number of occurrences relevance measures, Alternative ordering methods, e.g. by in the entire database. Such a technique is not required for Site, Field Searching e.g. page title, Usenet message Subject, the meta Search engine of this invention because the actual 15 hyperlink text, Rules and/or learning methods for routing pages are downloaded and analyzed. It is therefore possible queries to specific Search engines, Word Sense disambigua to apply a uniform ranking measure to documents returned tion, and Relevance feedback. by different engines. Currently, the engine displays pages in Further benefits and advantages of the invention will descending order of the number of query terms present in the become apparent from a consideration of the following document (if none of the first few pages contain all of the detailed description, given with reference to the accompa query terms, then the engine initially displayS results which nying drawings, which specify and Show preferred embodi contain the maximum number of query terms found in a ments of the invention. page So far). After all pages have been downloaded, the engine then relists the pages according to a simple relevance BRIEF DESCRIPTION OF THE DRAWINGS CSUC. 25 This measure currently considers the number of query FIG. 1 shows the home page of the meta Search engine of terms present in the document, the proximity between query this invention. terms, and term frequency (the usual inverse document FIG. 2 shows the options page of the meta Search engine frequency may also be useful (Salton, G. (1989), Automatic of this invention. text processing: the transformation, analysis and retrieval of FIGS. 3-8 show, respectively, first through sixth portions information by computer, Addison-Wesley.) of a Sample response of the meta Search engine of the present invention for the query nec and “digital watermark.” FIG. 9 shows a sample page view for the meta search N-1 N engine of this invention. X, X mind(i, j), c2) 35 FIG. 10 is a simplified control flow chart of the meta i=l i=i- C N Search engine of the present invention. R = c N + c, -- N-1T -1/2C + C3i. X (N - k) FIG. 11 is a simplified control flow chart for image meta k=1 Search. FIG. 12 shows a first portion of a sample response of the 40 meta Search engine of this invention for the query koala in where N is the number of query terms that are present in the the image databases, filtered for photos. document (each term is counted only once), N, is the total FIG. 13 shows a Second portion of a Sample response of number of query terms in the document, d(i, j) is the the meta Search engine of this invention for the query koala minimum distance between the ith and the jth of the query in the image databases, filtered for photos. terms which are present in the document (currently in terms 45 FIG. 14 shows a Sample response of the meta Search of the number of characters), c is a constant which controls engine of this invention for the query koala in the image the overall magnitude of the relevance measure R, c is a databases, filtered for graphics. constant Specifying the maximum distance between query FIG. 15 shows clusters for the query “joydeep ghosh.” terms which is considered useful, and c is a constant FIG. 16 shows the first two cluster Summaries for the Specifying the importance of term frequency (currently 50 query “joy deep ghosh.” c=100, C=5000, and c=10C) This measure is used for FIG. 17 shows the first part of the clusters for the query pages containing more than one of the query terms, when “joydeep ghosh' from HuskySearch. only one query term is found the terms distance from the FIG. 18 shows the second part of the clusters for the query Start of the page is used. 55 “joydeep ghosh' from HuskySearch. This ranking criterion is particularly useful with Web FIG. 19 shows clusters for the query “joydeep ghosh' searches. A query for multiple terms on the Web often from AltaVista. returns documents which contain all terms, but the terms are FIG. 20 shows clusters produced by the meta search far apart in the document and may be in unrelated Sections engine of this invention for the query “neural network.” of the page, e.g. in Separate Usenet messages archived on a 60 FIG. 21 shows clusters produced by the meta search Single Web page, or in Separate bookmarks on a page engine of this invention for the query typing and injury along containing a list of bookmarkS. with the first cluster summary. The engine does not use the lowest common denominator FIG. 22 shows the response of the meta Search engine of in terms of the Search Syntax. The engine Supports all the present invention for the query What does NASDAQ common Search formats, including boolean Syntax. Queries 65 stand for? are dynamically modified in order to match each individual FIG. 23 shows the response of Infoseek for the query query Syntax. The engine is capable of tracking the results of What does NASDAQ Stand for? US 6,999,959 B1 S 6 FIG. 24 shows the response of the meta search engine of query is made by clicking on the appropriate Selection on the this invention for the query How is a rainbow created? following line. The options are currently: FIG.25 shows the response of Infoseek for the query How 1. Web-standard Web search engines: (a) AltaVista, (b) is a rainbow created? Excite, (c) Infoseek, (d) Hot Bot, (e) Lycos, (f) Northern FIG. 26 shows the response of the meta search engine of 5 Light, (g) WebCrawler, and (h) Yahoo. the present invention for the query What is a mealy 2. Usenet Databases-indexes of Usenet newsgroups: (a) machine? AltaVista, (b) DejaNews, (c) Reference.com. FIG. 27 shows a sample home page showing new hits for 3. PreSS-indexes of press articles and news wires: (a) a query and recently modified URLS. Infoseek NewsWire, Industry, and Premier sources-c/o FIG. 28 shows a sample page view showing the text Infoseek-Reuters, PR NewsWire etc., and (b) which has been added to the page Since the last time it was NewsTracker-c/o Excite-online newspaperS and maga viewed. ZICS. FIG. 29 shows the coverage of each of six search engines 4. Images-image indexes: (a) Corel-corel image data with respect to the combined coverage of all Six. base, (b) HotBot-Hotbot images, (c) Lycos-Lycos FIG. 30 shows the coverage as the number of search 15 images, (d) WebSeer-WebSeer images, (e) Yahoo-Yahoo engines is increased. images, and (f) AltaVista-AltaVista images. FIG. 31 shows a comparison of the overlap between Search engines to the number of documents returned from all 5. Journals-academic journals: (a) Science. Six engines combined. 6. Tech-technical news: (a) TechWeb and (b) ZDNet. FIG. 32 shows the coverage of each search engine with 7. All-all of the above. respect to the estimated size of the indexable Web. The constraints menu 16 follows which contains options FIGS. 33 and 34 show histograms of the major search for constraining the results to Specific domains, Specific page engine response times, and a histogram of the response time ages, and Specific image types. The main options menu 20 for the first response when queries are made to the Six follows which contains options for Selecting the maximum engines Simultaneously. 25 number of results, the amount of context to display around FIG. 35 shows the median time for the Web Search the query terms (in characters), and whether or not to engines to respond. activate clustering or tracking. FIG. 36 shows the median time for the first of n Web The options link on the top bar allows Setting a number of Search engines to respond. other options, as shown at 22 in FIG. 2. These options are: FIG. 37 shows the response time for arbitrary Web pages. 1. The timeout (per individual page download), 2. Whether FIG.38 shows the median time to download the first of n or not to filter the pages when viewed, 3. Whether or not to pages requested simultaneously. filter images from the pages when viewed, 4. Whether each FIG. 39 shows the time for the meta engine to display the Search displays results in a new window or not, and 5. first result. Whether or not to perform image classification (for manual 35 classification of images). Additionally, the options page DETAILED DESCRIPTION OF THE shows at 24 and 26 which queries and URLs are being PREFERRED EMBODIMENTS tracked for changes, and allows entering a new URL to track. One of the fundamental features of the meta Search engine FIGS. 3 to 8 show a sample response of the meta search of this invention is that it analyzes each document and 40 engine of this invention for the query nec and “digital displays local context around the query terms. The benefit of watermark”. FIG. 3 shows the top portion of the response displaying the local context, rather than an abstract or from the Search. The Search form can be seen at the top, Summary of the document, is that the user may be able to followed by a tip 30 which may be query sensitive. Results more readily determine if the document answers his or her which contain all of the query terms are then displayed as Specific query. In essence, this technique admits that the 45 they are retrieved and analyzed (as mentioned before, if computer may not be able to accurately determine the none of the first few pages contain all of the query terms then relevance of a particular document, and in lieu of this ability, the engine initially displayS results which contain the maxi formats the information in the best way for the user to mum number of query terms found in a page So far). The quickly determine relevance. A user can therefore find bars 32 to the left of the document titles indicate how close documents of high relevance by quickly Scanning the local 50 the query terms are in the documents-longer bars indicate context of the query terms. This technique is simple, but can that the query terms are closer together. The engine which be very effective, especially in the case of Web search where found the document, the age of the document, the Size of the the database is very large, diverse, and poorly organized. document, and the URL follow the document title. The idea of querying and collating results from multiple After the pages have been retrieved, the engine then databases is not new. Companies like PLS, Lexis-Nexis, and 55 displays the top 20 pages ranked using term proximity Verity have long Since created Systems which integrate the information (FIG. 4). In descending order, and referring to results of multiple heterogeneous databases. Many other FIGS. 5 to 8, the engine then displays those pages which Web meta Search Services exist Such as the popular and contain fewer query terms, those pages which contain none useful MetaCrawler service. Services similar to MetaCra of the query terms, those pages which contain duplicate wler include SavvySearch, Inference Find, Fusion, ProFu 60 context Strings, and those pages which could not be down Sion, Highway 61, Mamma, Quarterdeck WebCompass, loaded. LinkS to the Search engine pages which were used Metabot, Symantec Internet FastFind, and WebSeeker. are then provided, followed by terms which may be useful FIG. 1 shows the home page of the meta Search engine of for query expansion. With reference to FIG. 8, the engine this invention. The bar 12 at the top contains links for the then displays a Summary box with information on the options page, the help page, and the Submission of Sugges 65 number of documents found from each individual engine, tions and problems. Queries are entered into the “Find:” box the number retrieved and processed, and the number of 14. The selection of which search engines to use for the duplicates. US 6,999,959 B1 7 8 FIG. 9 shows a sample of how the individual pages are The client processes simply retrieve the relevant pages, processed when viewed. The links 40 at the top jump to the handling errors and timeouts, and return the pages directly first occurrence of the query terms in the document, and to the appropriate Search process. indicate the number of occurrences. The Track Page link The algorithm used for image meta Search in the meta activates tracking for this page-the user will be informed Search engine of this invention is as follows: when and how the document changes. Process the request to check Syntax and create. ...regular The engine comprises two main logical parts: the meta expressions which are used to match....query terms Search code and a parallel page retrieval daemon. Send requests (modified appropriately) to all...... relevant Pseudocode for (a simplified version of) the search code is image Search engines as follows: Loop for each page retrieved until maximum number Process the request to check Syntax and create. ...regular of... .images or all pages retrieved expressions which are used to match query. ... terms If page is from a Search engine Send requests (modified appropriately) to all. ...relevant Parse Search engine response extracting....hits and any Search engines link for the next Set. ...of results Loop for each page retrieved until maximum number. ..of 15 Send requests for all of the hits results or all pages retrieved Send request for the next set of results....if applicable If page is from a Search engine Else if page is an image Parse Search engine response extracting hits. ...and any Add image to the display queue link for the next set of results Else Send requests for all of the hits Analyze query term locations in the page. ...and Send requests for the next set of results....if applicable predict which (if any) of the...... images on the Else page corresponds to...... the query-Send a Check page for query terms and create....context Strings request to....download this image if found Endif Print page information and context Strings if all....query 25 If n images are in the display queue terms are found and duplicate context....Strings have Create a single image montage of the...images in not been encountered before the queue Endif Display the montage as a clickable image....where End loop each portion of the image. ...corresponding to Re-rank pages using proximity and term frequency....infor the original. ...individual images shows a mation detail....page for the original image Print page information and context strings for Endif pages...which contained Some but not all query terms End loop Print page information for pages which contained If any images are in the display queue no. ...query terms 35 Print page information and context Strings for Create a Single image montage of the images in.. ...the pages...which contain duplicate context Strings queue Print page information for pages which could not Display the montage as a clickable image where....each be....downloaded portion of the image corresponding....to the original Print Summary Statistics. 40 individual images shows....a detail page for the origi FIG. 10 shows a simplified control flow diagram 50 of the nal image meta Search engine. The page retrieval engine is relatively Endif Simple but does incorporate features Such as queuing Print Summary statistics requests and balancing the load from multiple Search pro FIG. 11 shows a simplified control flow diagram 60 for cesses, and delaying requests to the same site to prevent 45 the image meta Search algorithm. overloading a Site. The page retrieval engine comprises a Image Classification dispatch daemon and a number of client retrieval processes. The Web image search engine WebSeer attempts to clas Pseudocode for (a simplified version of) the dispatch dae Sify imageS as photographs or graphicS. WebSeer extracts a mon is as follows: number of features from the images and uses decision trees Start clients 50 for classification. We have implemented a similar image Loop classification System. However, we use a different feature Set Check for timeout of active clients and use a neural network for classification. FIGS. 12 and 13 Send any queued requests if possible, balancing....load show the response of the meta Search engine of this inven for requests from multiple Search....processes tion to the image query koala, with the imageS filtered for If there is a message from a client 55 photos. FIG. 14 shows the response when filtering for If message is "replace me' replace the....client with graphics. a new process Document Clustering If message is "done' update client...information Document clustering methods typically produce non If message is “status' return Status overlapping clusters. For example, Hierarchical Agglomera If message is "get then 60 tive Clustering (HAC) algorithms, which are the most com If all clients are busy or a request....has been made monly used algorithms for document clustering (Willet, P. to this site. ...within the last X Seconds (1988), “Recent trends in hierarchical document clustering: then. ...queue the request a critical review”, Information Processing and Management Otherwise Send request to a client 24, 577-597), start with each document in a cluster and Endif 65 iteratively merge clusters until a halting criterion is met. Endif HAC algorithms employ similarity functions (between End loop documents and between sets of documents). US 6,999,959 B1 9 10 A document clustering algorithm is disclosed herein is larger and more diverse. A query expansion algorithm is which is based on the identification of co-occurring phrases disclosed herein which is based on the use of only a Subset and conjunctions of phrases. The algorithm is fundamentally of morphological variants. Specifically, the algorithm uses different to commonly used methods in that the clusters may the Subset of morphological variants which occur on a be overlapping, and are intended to identify common items certain percentage of the Web pages matching the original or themes. query. Currently, the query terms are Stemmed with the The World Wide Web (the Web) is large, contains a lot of Porter stemmer (Porter, M. F. (1980), “An algorithm for redundancy, and a relatively low Signal to noise ratio. These suffix stripping", Program 14, 130-137) and the retrieved factors make finding information on the Web difficult. The pages can be searched for morphological variants of the clustering algorithm presented here is designed as an aid to query terms. Variants which occur on greater than 1% of the information discovery, i.e. out of the many hits returned for pages are displayed to the user for possible inclusion in a a given query, what topics are covered? This allows a user Subsequent query. No quantitive evaluation of this technique to refine their query in order to investigate one of the has been performed, however observation indicates that Subtopics. useful terms are Suggested. As an example, for the query nec The clustering algorithm is as follows: 15 and “digital watermark', the following terms are Suggested Retrieve pages corresponding to the query for query expansion: digitally, watermarking, watermarks, For each page watermarked. For n=1 to MaximumPhraseLength Currently the technique does not automatically expand a For each Set of Successive n words query when first entered, because the query expansion terms If this combination of words has not...... already are not known until the query is complete. However the appeared in this....document then add the Set to technique can be made automatic by maintaining a database a....hash table for this document. ...and a hash of expansion terms for each query term. The first query table for all....documents containing a term can add the co-occurring morphological End for variants to the database, and Subsequent queries can use End for 25 these terms, and update the database if required. End for Specific Expressive Forms For n=Maximum PhraseLength to 1 Accurate information retrieval is difficult due to the Find the most common phrases of length n, to a...maxi possibility of information represented in many mum of MaxN phrases, which occurred. ...more than wayS-requiring an optimal retrieval System to incorporate MinN times Semantics and understand natural language. Research in Add these phrases to the Set of clusters information retrieval often considers techniques aimed at End for improving recall, e.g. word Stemming and query expansion. Find the most common combinations of two clusters.... from AS mentioned earlier, it is possible for these techniques to the previous Step, to a maximum of MaxC.. ...Combina decrease precision, especially in a database as diverse as the tions, for which the combination....occurred in individual 35 Web. documents at least....MinC times The World Wide Web contains a lot of redundancy. Delete clusters which are identified by phrases....which are Information is often contained multiple times and expressed Subsets of a phrase identifying....another cluster in different forms across the Web. In the limit where all Merge clusters which contain identical documents information is expressed in all possible ways, high precision Display each cluster along with context from a Set. ...of 40 information retrieval would be simple and would not require pages for both the query terms....and the cluster terms. Semantic knowledge-one would only need to Search for FIG. 15 shows the clusters 70 produced by this algorithm one particular way of expressing the information. While for the query “joydeep ghosh', and FIG. 16 shows the first Such a goal will never be reached for all information, two cluster Summaries 72 and 74 for these clusters. FIGS. 17 experiments indicate that the Web is already sufficient for an and 18 show the clusters 76 and 80 produced by Husky 45 approach based on this idea to be effective for certain Search for the same query. FIG. 19 shows the clusters 82 retrieval taskS. produced by AltaVista. FIGS. 20 and 21 show the clusters 84 The method of this invention is to transform queries in the and 86 produced by the meta Search engine of this invention form of a question, into Specific forms for expressing the for another two queries: “neural network” and typing and answer. For example, the query “What does NASDAQ stand injury. 50 for?' is transformed into the query “NASDAQ stands for Query Expansion “NASDAQ is an abbreviation” “NASDAQ means”. Clearly One method of performing query expansion is to augment the information may be contained in a different form to these the query with morphological variants of query terms. Word three possibilities, however if the information does exist in stemming (Porter, M. F. (1980), “an algorithm for suffix one of these forms, then there is a high likelihood that stripping", Program 14, 130-137) can be used in order to 55 finding these phrases will provide the answer to the query. treat morphological variants of a word as identical words. The technique thus trades recall for precision. The meta Web Search engines typically do not perform word Stem Search engine of this invention currently uses the Specific ming, despite the fact that it would reduce the resources expressive forms (SEF) technique for the following queries required to index the Web. One reason for the lack of word (Square brackets indicate alternatives and parentheses indi Stemming by Web Search engines is that Stemming can 60 cate optional terms or alternatives): reduce precision. Stemming considers all morphological What islarex'? variants. Query expansion using all morphological variants What causes|creates produces X'? often results in reduced precision for Web search because the What do you think aboutlofregarding X'? morphological variants often refer to a different concept. What does X stand forimean? Reduced precision using word Stemming is typically more 65 Where is X? problematic on the Web as compared to traditional infor Who is X? mation retrieval test collections, because the Web database Why how islare(althe) x y? US 6,999,959 B1 11 12 Why do x? When Viewing a filtered page, clicking on a query term When is X? jumps to the next occurrence of that term. Clicking on When do X'? the last occurrence of a term jumps back to the first How do can ix? OCCCCC. You can use “-term' to exclude a term. How (can)althe X y? You can Search for links to a Specific page, e.g. link: How does althe X y? www.necinj.nec.com/homepageS/giles. Self links are As an example of the transformations, “What does X excluded. stand forimean?' is converted to “x stands for “X is an When in doubt use lower case. abbreviation” “x means', and “What causes creates produces X'?' is transformed to “X is caused This meta engine makes more than three times as many “X is created” “causes X” “produces X” “makes X” “creates documents available as a single Search engine. Con ss Straining your Search can help, e.g. if you want to know X. what NASDAQ stands for, searching for “NASDAQ Different Search engines use different Stop words and stands for rather than “NASDAQ' can find your relevance measures, and this tends to result in Some engines 15 answer faster although the information may also be returning many pages not containing the SEFs. The offend expressed in alternative ways. ing phrases are therefore filtered out from the queries for the Clicking on the Search engine links in the "Searching for:” relevant engines. line will Show the Search engine response to the current FIG. 22 shows at 90 the response of the meta search query. engine of this invention for the query “What does NASDAQ You can Search for images by Selecting the “images' stand for?' The answer to the query is contained in the local button, e.g. “red rose'. context displayed for about 5 out of the first 6 pages. FIG. The bar to the left of the titles is longer when the query 23 shows at 92 the response of Infoseek to the same query. terms are closer together in the document. The answer to the query is not displayed in the page The query term links in the “Searching for line lead to Summaries, and which, if any, of the pages contains the 25 the Webster dictionary definitions. answer is not clear. FIGS. 24 and 25 show, at 94 and 96, the If you Select Tracking: Yes, then your query will be meta Search engine of this invention and InfoSeek responses tracked and new hits will be displayed on your cus to the query "How is a rainbow created?” Again, the answer tomized home page Similar to the “recent articles about is contained in the local context shown by the meta Search NEC Research. engine of this invention but it is not clear which, if any, of Select Cluster: Yes to cluster the documents and identify the pages listed by InfoSeek contain the answer to the common themes. question. FIG. 26 shows at 100 a third example of the You can filter images using a neural network prediction of response from the meta Search engine of the invention for whether each image is a photo or a graphic using the the query “What is a mealy machine?” Images: option. It is reasonable to expect that the amount of easily 35 A listing of pages ranked by term proximity is shown after accessible information will increase over time, and therefore all of the documents have been retrieved. that the viability of the Specific expressive forms technique Tracking Queries and URLs will improve over time. An extension of the above-discussed Services such as the Informant (The Informant, 1997) procedures is to define an order over the various SEFS, e.g. track the response of Web Search engines to queries, and “x stands for may be more likely to find the answer to 40 inform users when new documents are found. The meta “What does X stand for than the phrase “x means”. If none Search engine of this invention Supports this function. Track of the SEFs are found then the engine could fall back to a ing is initiated for a query by Selecting the Track option Standard query. when performing the query. A daemon then repeats the query Search tips may be provided by the meta engine. These periodically, Storing new documents along with the time tips may include, for example, the following: 45 they were found. New documents are presented to the user Use quotes for phrases, e.g. “nec research'. on the home page of the Search engine, as shown at 102 in You can hide the various options above to Save Screen FIG. 27. The engine does not currently inform users if the Space by clicking on the "hide' linkS. documents matching queries have changed, although this Window option: clicking on a hit brings up the page in the could be added. Same window for multiple Searches or a new window 50 The meta Search engine of this invention also Supports for each new Search. tracking URLS. Tracking is initiated by clicking the Track Filter option: filters pages when viewed to highlight query page link when Viewing one of the pages from the Search terms. Faster due to local caching of the page. engine results. Alternatively, tracking may be initiated for an The letter(s) after the page titles identify the Search engine arbitrary URL using the options page. A daemon identifies 55 updates to the pages being tracked, and shows a list of which provided the result (e.g. A==AltaVista). modified pages to the user on the home page, as in FIG. 27. The Second field after the page titles is the time since the The Page link displays the page being tracked and inserts page was last updated (e.g. 5 m=5 months, 1y=1 year). a header at the top showing which lines have been added or The third field after the page titles is the Size of the page. modified since the last time the user viewed the page (e.g. The context option Selects the number of characters to 60 see FIG. 28). display either Side of the query terms. Estimating the Coverage of Search Engines and the Size The timeout option is the maximum time to download of the Web each individual page. As the World Wide Web continues to expand, it is Searching in “Press” is useful for higher precision with becoming an increasingly important resource for Scientists. current news topics. 65 Immediate access to all Scientific literature has long been a Image option: remove images from the pages when dream of Scientists, and the Web Search engines have made viewed (for faster viewing). a large and growing body of Scientific literature and other US 6,999,959 B1 13 14 information resources easily accessible. The major Web the Search engine databases. Only the 6 current major Search engines are commonly believed to indeX a large full-text search engines are considered herein (in alphabeti proportion of the Web. Important questions which impact cal order): AltaVista, Excite, HotBot, Infoseek, Lycos, and the choice of search methodology include: What fraction of Northern Light. A common perception is that these engines the Web do the search engines index'? Which search engine indeX roughly the same documents, and that they indeX a is the most comprehensive? How up to date are the Search relatively large portion of the Web. engine databases? We first compare the number of documents returned when A number of Search engine comparisons are available. using different combinations of 1 to 6 Search engines. Our Typically, these involve running a set of queries on a number overall methodology is to retrieve the list of matching of Search engines and reporting the number of results documents from all engines and then retrieve all of the returned by each engine. Results of these comparisons are of documents for analysis. Two important constraints were limited value because Search engines can return documents used. which do not contain the query terms. This may be due to (a) The first constraint was that the entire list of documents the information retrieval technology used by the engine, e.g. matching the query must have been retrieved for all of the Excite uses “concept-based clustering” and InfoSeek uses 15 Search engines in order for a query to be included in the morphology-these engines can return documents with Study. This constraint is important because the order in related words, (b) documents may no longer exist-an which the engines rank documents varies between engines. engine which never deletes invalid documents would be at Consider a query which resulted in greater than 1,000 an advantage, and (c) documents may still exist but may documents from each engine. If we only compared the first have changed and no longer contain the query terms. 200 documents from each engine we may find many unique Selberg and Etzioni (Selberg, E. and Etzioni, O. (1995), URLS. However, we would not be able to determine if the Multi-Service Search and comparison using the MetaCra engines were indexing unique URLS, or if they were index wler, in “Proceedings of the 1995 World Wide Web Con ing the same URLS but returning a different Subset as the ference”.) have presented results based on the usage logs of first 200 documents. the MetaCrawler meta search service (due to substantial 25 The second constraint was that for all of the documents changes in the Search engine Services and the Web, it is that each engine lists as matching the query, we attempted to expected that their results would be significantly different if download the full text of the corresponding URL. Only repeated now). These results considered the following documents which could be downloaded and which actually engines: Lycos, WebCrawler, InfoSeek, Galaxy, Open Text, contain the query terms are counted. This is important and Yahoo. Selberg and Etzioni's results are informative but because (a) Some engines can return documents which they limited for Several reasons. believe are relevant but do not contain the query terms (e.g. First, they present the "market share” of each engine Excite uses "concept-based clustering” and may consider which is the percentage of documents that users follow that related words, and Infoseek uses morphology), and (b) each originated from each of the Search engines. These results are Search engine contains a number of invalid links, and the limited for a number of reasons, including (a) relevance is 35 percentage of invalid linkS varies between the Search difficult to determine without viewing the pages, and (b) engines (engines which do not delete invalid links would be presentation order affects user relevance judgements (Eisen at an advantage). berg, M. and Barry, C. (1986), Order effects: A preliminary Other details important to the analysis are: Study of the possible influence of presentation order on user 1. Duplicates are removed when considering the total judgements of document relevance, in "Proceedings of the 40 number of documents returned by one engine or by a 49th Annual Meeting of the American Society for Informa combination of engines, including detection of identical tion Science”, Vol. 23, pp. 80-86). pages with different URLs. URLs are normalized by a)re The results considered by Selberg and Etzioni are also moving any “index.html” suffix or trailing"/", b) removing limited because they present results on the percentage of a port 80 designation (the default), c) removing the first unique references returned and the coverage of each engine. 45 segment of the domain name for URLs with a directory Their results Suggest that each engine covers only a fraction depth greater than 1(in order to account for machine aliases), of the Web, however their results are limited because (a) as and d) uneScaping any “escaped” characters. (e.g. 767E in a above, engines can return documents which do not contain URL is equivalent to the tilde character). the query terms-engines which return documents with 2. We consider only lowercase queries because different related words or invalid documents can result in Signifi 50 engines treat capitalized queries differently (e.g. AltaVista cantly different results, and (b) Search engines return docu returns only capitalized results for capitalized queries). ments in different orders, meaning that all documents need 3. We used an individual page timeout of 60 seconds. to be retrieved for a valid comparison, e.g. two Search PageS which timed out were not included in the analysis. engines may indeX exactly the same Set of documents yet 4. We use a fixed maximum of 700 documents per query return a different set as the first X. 55 (from all engines combined after the removal of duplicates) In addition, Selberg and Etzioni find that the percentage of -queries returning more documents were not included. The invalid links was 15%. They do not break this down by Search engines typically impose a maximum number of Search engine. Selberg and Etzioni do point out limitations documents which can be retrieved (current limits are in their study (which is just a Small part of a larger paper on AltaVista 200, Infoseek 500, HotBot 1,000, Excite 1,000, the very successful MetaCrawler service). 60 Lycos 1,000, and Northern Lightd 10,000) and we checked to AltaVista and Infoseek have recently confirmed that they ensure that these limits were not exceeded (using this do not provide comprehensive coverage on the Web (Brake, constraint no query returned more than the maximum from D. (1997), “Lost in cyberspace”, New Scientist 154(2088), each engine, notably no query returned more than 200 12-13.) Discussed below are estimates on how much they do documents from AltaVista). COVC. 65 5. We only counted documents which contained the exact We have produced Statistics on the coverage of the major query terms, i.e. the word “crystals' in a document would Web search engines, the size of the Web, and the recency of not match a query term of “crystal’-the non-plural form of US 6,999,959 B1 15 16 the word would have to exist in the document in order for the exclusion Standard, or by authentication requirements. document to be counted as matching the query. This is Therefore, we expect the true size of the Web to be much necessary because different engines use different morphol larger than estimated here. However Search engines are ogy rules. unlikely to Start indexing these documents, and it is therefore 6. Hot3ot and AltaVista can identify alternate pages with of interest to estimate the size of the Web that they do the same information on the Web. These alternate pages are consider indexing (hereafter referred to as the “indexable included in the statistics (as they are for the engines which Web”), and the relative comprehensiveness of the engines. do not identify alternate pages with the same data). The logarithmic extrapolation above is not accurate for 7. The “special collection” (premier documents not part of determining the size of the indexable Web because (a) the the publicly indexable Web) of Northern Light was not used. amount of the Web indexed by each engine varies signifi Over a period of time, we have collected 500 queries cantly between the engines, and (b) the Search engines do which Satisfy the constraints. For the results presented not sample the Web independently. All of the 6 search herein, we performed the 500 queries during the period Aug. engines offer a registration function for users to register their 23, 1997 to Aug. 24, 1997. We manually checked that all pages. It is reasonable to assume that many users will results were retrieved and parsed correctly from each engine 15 register their pages at Several of the engines. Therefore the before and after the tests because the engines periodically pages indexed by each engine will be partially dependent. A change their formats for listing documents and/or requesting Second Source of dependence between the Sampling per the next page of documents (we also use automatic methods formed by each engine comes from the fact that Search designed to detect temporary failures and changes in the engines are typically biased towards indexing pages which Search engine response formats). are linked to in other pages, i.e. more popular pages. FIG. 29 shows the fraction of the total number of docu Consider the overlap between engines a and b in FIG. 31. ments from the 6 engines which were retrieved by each ASSuming that each engine Samples the Web independently, individual engine. Table 1 below shows these results along the quantity n/n, where n is the number of documents with the 95% confidence interval. Hot3Ot is the most returned by both engines and n is the number of documents comprehensive in this comparison. These results are specific 25 returned by engine b, is an estimate of the fraction of the to the particular queries performed and the State of the indexable Web, P, covered by engine a. Using the coverage engine databases at the time they were performed. Also, the of 6 engines as a reference point we can write P=na/ne, results may be partly due to different indexing rather than where n is the number of documents returned by engine a different databases sizes-different engines may not indeX and no is the number of unique documents returned by the identical words for the same documents (for example, the combination of 6 engines. Thus, P is the coverage of engine engines typically impose a maximum file Size and effec a with respect to the coverage of the 6 engines, we can write tively truncate oversized documents). c=P/P=n,n/non. We use this equation to estimate the size of the Web in relation to the amount, of the Web covered by TABLE 1. the 6 engines considered here. Because the Size of the 35 engines varies significantly, we consider estimating the Search Hot- Ex- Northern Alta value of c using combinations of two engines, from the Engine Bot cite Light Vista Infoseek Lycos smallest two to the largest two. We limit this analysis to the Coverage 39.2%. 31.1%. 30.4% 29.2%. 17.9% 12.2% 245 queries returning 250 documents (to avoid difficulty WRT 6 when no=0). Table 2 shows the results. Values of c Smaller Engines 40 than 1 Suggest that the size of the indexable Web is smaller 95% +f +f +f +f +f +f confidence -1.4% -1.2% -1.3% -1.2% -1.1% -1.1% than the number of documents retrieved from all 6 engines. interval It is reasonable to expect that larger engines will have lower dependence because a) they can index more pages other than the pages which users register, and b) they can index more FIG. 30 shows the average fraction of documents 45 of the less popular pages on the Web. Indeed, there is a clear retrieved by 1 to 6 Search engines normalized by the number trend where the estimated value of c increases with the larger retrieved from all six engines. For 1 to 5 engines, the average engines. is over all combinations of the engines, which is averaged for each query and then averaged over queries. Using the TABLE 2 assumption that the coverage increases logarithmically with 50 the number of Search engines, and that, in the limit, an AltaVista Lycos & Northern infinite number of Search engines would cover the entire Search & Infoseek & Northern Light & Excite Web, f(x)=b(1-1/exp(ax)), where a and b are constants and Engines Infoseek AltaVista Light Excite & Hot3Ot X is the number of Search engines, was fit to the data (using 55 Engine Smallest -> Largest Levenberg-Marquardt minimization (Fletcher, R. (1987), Sizes Practical Methods of Optimization, Second Edition, John Estimated O6 O.9 O.9 1.9 2.2 Wiley & Sons) with the default parameters in the program C gnuplot) and plotted on FIG. 30. This is equivalent to the 95% +f-0.04 +f-0.06 +f-0.04 +f-0.12 +f-0.17 assumption that each engine covers a certain fixed percent confidence interval age of the Web, and each engine's sample of the Web is 60 drawn independently from all Web pages (c-c+c(1-c.), i=2... n where c is the coverage of i engines and c is the Using c=2.2, from the comparison with the largest two coverage of one engine). engines, we can estimate the fraction of the indexable Web There are a number of important biases which should be which the engines cover: Hot3ot 17.8%, Excite 14.1%, considered. Search engines typically do not consider index 65 Northern Light 13.8%, AltaVista 13.3%, Infoseek 8.1%, ing documents which are hidden behind Search forms, and Lycos 5.5%. These results are shown at 120 in FIG. 32. The documents where the engines are excluded by the robots percentage of the indexable Web indexed by the major US 6,999,959 B1 17 18 Search engines is much lower than is commonly believed. ence)(the underlying Search engines and/or the Web appear We note that (a) it is reasonable to expect that the true value to be significantly faster than they were when Selberg and of c is actually larger than 2.2 due to the dependence which Etzioni performed their experiment). remains between the two largest engines, and (b) different Therefore, on average we find that the parallel architec results may be found for queries from a different class of ture of the meta engine of this invention allows it to find, USCS. download and analyze the first page faster than the Standard Hot3ot reportedly contains 54 million pages, putting our Search engines can produce a result although the Standard estimate on a lower bound for the size of the indexable Web engines do not download and analyze the pages. Note that at approximately 300 million pages. Currently available the results in this Section are specific to the particular queries estimates of the size of the Web vary significantly. The performed (speed as a function of the query is different for Internet Archive uses an estimate of 80 million pages each engine) and the network conditions under which they (excluding images, Sounds, etc.) (Cunningham, M. (1997), were performed. These factors may bias the results towards Brewster's millions, http://www.irish-times.com/irish certain engines. The non-Stationarity of Web access times is times/paper/1997/0127/cmp1.html.) Forrester Research esti not considered here, e.g. the Speed of the engines varies mates that there are more than 75 million pages (Guglielmo, 15 Significantly over time (short term variations may be due to C. (1997), “Mr. Kurnit's neighborhood, Upside September.) network or machine problems and user load, long term AltaVista now estimates that there the Web contains 100 to variations may be due to modifications in the Search engine 150 million pages (Brake, D. (1997), “Lost in cyberspace, Software, the Search engine hardware resources, or relevant New Scientist 154(2088), 12–13). network connections). A simple analysis of page retrieval times leads to Some The meta Search engine of this invention demonstrates interesting conclusions. Table 3 below shows the median that real-time analysis of documents returned from Web time for each of the Six major Search engines to respond, Search engines is feasible. In fact, calling the Web Search along with the median time for the first of the Six engines to engines and downloading Web pages in parallel allows the respond when queries are made simultaneously to all meta Search engine of this invention to, on average, display engines (as happens in the meta engine). 25 the first result quicker than using a Standard Search engine. User feedback indicates that the display of real-time local TABLE 3 context around query terms, and the highlighting of query terms in the documents when viewed, Significantly improves Median Time for response the efficiency of searching the Web. Search Engine (seconds) Our experiments indicate that an upper bound on the AltaVista O.9 coverage of the major Search engines varies from 6% Infoseek 1.3 (Lycos) to 18% (Hot Bot) of the indexable Web. Combining Hot3Ot 2.6 Excite 5.2 the results of Six engines returns more than 3.5 times as Lycos 2.8 many documents when compared to using only one engine. Northern Light 7.5 35 By analyzing the overlap between Search engines, we esti All engines 2.7 mate that an approximate lower bound on the size of the First of 6 engines O.8 First result from the meta 1.3 indexable Web is 300 million pages. The percentage of search engine of this invalid links returned by the major engines varies from 3% invention to 7%. Our results provide an indication of the relative 40 coverage of the major Web Search engines, and confirm that, as indicated by Selberg and Etzioni, the coverage of any one Histograms of the response times for these engines and Search engine is significantly limited. the first of 6 engines are shown in FIGS. 33 and 34, and the While it is apparent that the invention herein disclosed is median times are shown in FIG. 35. FIG. 36 shows the well calculated to fulfill the objects previously stated, it will median time for the first of n engines to respond. These 45 be appreciated that numerous modifications and embodi results are from September 1997, and we note that the ments may be devised by those skilled in the art, and it is relative Speed of the Search engines varies over time. intended that the appended claims cover all Such modifica Looking now at the time to download arbitrary Web tions and embodiments as fall within the true Spirit and pages, FIG. 37 shows a histogram of the response time. FIG. Scope of the present invention. 38 shows the median time for the first of n engines to 50 What is claimed is: respond. We can estimate the time for the meta engine to 1. A computer-implemented meta Search engine method, display the first result, which we create by Sampling from the comprising the Steps of: distributions for the first of 6 search engines (the meta forwarding a query to a plurality of third party Search engine actually uses more than 6 Search engines but we engines, concentrate on the major Web engines here), and the first of 55 receiving and processing responses from the third party 10 Web pages (the actual number depends on the number Search engines, Said responses identifying documents returned by the first engine to respond), adding these in response to the query, Said processing including the together, and averaging over 10,000 trials. Steps of, FIG. 39 shows a histogram of this distribution. The (a) downloading the full text of the documents identified median of the distribution is 1.3 seconds (compared to 2.7 60 in response to the query, and Seconds for the median response time of a Search engine (b) locating query terms in the documents and extracting even without downloading any actual pages). For compari text Surrounding the query terms to form at least one Son, the average time MetaCrawler takes to return results is context String, 25.7 seconds (without page verification) or 139.3 seconds displaying information regarding the documents, and the (with page verification) (Selberg, E. and Etziono, O. (1995), 65 at least one context String Surrounding one or more of Multi-Service Search and comparison using the MetaCra the query terms for each processed document contain wler, in Proceedings of the 1995 World Wide Web Confer ing the query terms, and US 6,999,959 B1 19 20 clustering the documents based on analysis of the full text (a) downloading the full text of the documents identified of each document and identification of co-occurring in response to the query, and phrases and words, and conjunctions thereof and dis playing the information regarding the documents (b) locating query terms in the documents and extracting arranged by Said clusters. text Surrounding the query terms to form at least one 2. A computer-implemented meta Search engine method, context String, and comprising the Steps of: displaying information regarding the documents, and the forwarding a query to a plurality of third party Search at least one context String Surrounding one or more of engines, the query terms for each processed document contain receiving and processing responses from the third party ing the query terms. Search engines, Said responses identifying documents in response to the query, Said processing including the 4. A computer-implemented meta Search engine method, Steps of, comprising the Steps of: (a) downloading the full text of the documents identified forwarding a query to a plurality of third party Search in response to the query, and 15 engines, (b) locating query terms in the documents and extracting receiving and processing responses from the third party text Surrounding the query terms to form at least one Search engines, Said responses identifying documents context String, in response to the query, Said processing including the displaying information regarding the documents, and the Steps of, at least one context String Surrounding one or more of the query terms for each processed document contain (a) downloading the full text of the documents identified ing the query terms, and in response to the query, and displaying Suggested additional query terms for expand (b) locating query terms in the documents and extracting ing the query based on terms in the documents identi text Surrounding the query terms to form at least one fied in response to the query. 25 context String, 3. A computer-implemented meta Search engine method, displaying information regarding the documents, includ comprising the Steps of: ing the at least one context String Surrounding one or receiving a query and transforming the query from a form more of the query terms for each processed document of a question into a form of an answer; forwarding the transformed query to a plurality of third containing the query terms, and party Search engines, displaying an indication of how close the query terms are receiving and processing responses from the third party to each other in the documents. Search engines, Said responses identifying documents in response to the query, Said processing including the Steps of,