New Methods and Tools for the World Wide Web Search

Journal of Computing and Information Technology - CIT 8, 2000, 4, 267–276 267

Vlatko Ceric Faculty of Economics, University of Zagreb, Croatia

Explosive growth of the World Wide Web as well as its equivalent to about 3 terabytes. As a compar- heterogeneity call for powerful and easy to use search ison, Library of Congress, the biggest library tools capable to provide the user with a moderate number of relevant answers. This paper presents analysis of today, contains about 20 terabytes of text. No key aspects of recently developed Web search meth- single search engine was indexing more than ods and tools: visual representation of subject trees, about 16% of Web sites, and all major search en- interactive user interfaces, linguistic approaches, image gines combined were covering only about 42% search, ranking and grouping of search results, database search, and scientific information retrieval. Current of the Web. trends in Web search include topics such as exploiting

Web hyperlinking structure, natural language processing, The latest study of the Web size done by Cyveil- software agents, influence of XML markup language on lance company URL 3 estimated that in July search efficiency, and WAP search engines. 2000 Web contained about 2.1 billion unique publicly available Web pages. The study also Keywords: Web search, search engines, subject trees, scientific information retrieval, database search, ranking, found that Web is growing by about 7.3 million XML, WAP search engines Web pages a day. About 84% of all pages are based in the USA. It is estimated that about 30% Web pages are

copied or mirrored, or are very similar Shiv- 1. Introduction

akumar and Garcia-Molina, 1998. Analysis of 200 million Web pages Broder et al., 1999 World Wide Web had an astonishing growth in has shown that these pages use 1.5 billion hy- the last few years. Number of Web servers in- perlinks and that the Web is made up of four

creased from 1.7 million in December 1997, to components. A central core about 56 million 3.7 million in December 1998, to 9.6 million pages is well connected, an ’in’ cluster 44

in December 1999, and then to 21.2 million in million pages contain pages that point to the

September 2000 URL 1 . The same source central core but cannot be reached from it, an estimates that the number of Internet hosts in- ’out’ cluster 44 million pages contains pages

creased from 29.7 million in January 1998, to that can be reached from the core but do not 43.2 million in January 1999, to 72.4 million link to it, ’tubes’ 44 million pages connect to in January 2000, and then to 93 million in July either ’in’ or ’out’ clusters but not to the core, 2000. The number of people online is assessed while some 17 million pages are completely un-

to be over 370 million as of September 2000 connected. Web graph structure is interesting in URL 2 . itself, but it can also be used for designing crawl

strategies on the Web Cho and Garcia-Molina, One of the most authoritative studies of Web

2000. size Lawrence and Giles, 1999 estimated that in February 1999 the publicly indexable Web Web sites have extremely variable information contained approximately 800 million pages, the value since they are written by individuals with quantity of information equivalent to 6 terabytes any conceivable background, education, cul- of pure text, as well as some 180 million images ture, interest, and motivation, and in most cases 268 New Methods and Tools for the World Wide Web Search

these materials are neither edited nor reviewed. has about 200 editors and consist of about 2 mil- Web pages size may range from a dozen words lion Web links, while NBCi Snap has about to hundreds screens, and often contain gram- 50 editors and contains approximately 1 million matical mistakes and typos. Presented material Web links. One of the most valuable mid-sized can be obsolete, false or unreliable. Web con- subject trees Britannica contains about 130,000 tent itself is unstructured, and multiple formats, Web links. different languages, dialects and alphabets are

Since Web today contains roughly two billion used Baeza-Yates and Ribeiro-Neto, 1999 . Web pages, biggest subject trees cover only the Despite all its weakness, including a lot of poor, order of magnitude of one promile of the Web. obsolete and unreliable material, Web includes Although this is a very small coverage, the value a vast quantity of excellent material of all sorts, of subject trees is primarily in the quality of Web from practical information to professional and links selected by its editors. scientific material that cannot be found without appropriate tools. Web search is also an exciting Two interesting new subject trees are Open Di-

activity in itself, since search is one of the fun- rectory and About.com. Open Directory URL

damental human activities. The more ordered 4 is based on the concept in which Web com- and structured material is, it is less exciting — munity, rather than professional staff, select the Web, with its heterogeneous and unstruc- Web links for a subject tree. Every one of tured material, is therefore the perfect search approximately 28,000 editors that build and area. maintain the directory is an expert in the cat- Without software tools for searching and categories she takes care of. Open Directory aloguing of the huge and heterogeneous Web contains about 2 million links categorized in information space it would be absolutely im- more than 300,000 subcategories. Open Direc- possible to find relevant information residing on tory makes its data available to anyone, free

the Web Ceric, 1997; Baeza-Yatesand Ribeiro- of charge — some of the major search engines

Neto, 1999. The need for retrieval of informa- using Open Directory are HotBot, Lycos and tion on the Web is so large that the major search Netscape. About.com URL 5 is a mid-sized

engines receive about 15-20 thousand queries subject tree that maintains about 700 targeted per minute. environments i. e. wide subcategories , each What users need are powerful, easy to use search overseen by a professional guide. About.com tools capable to provide a moderate number of guides are company-certified subject specialists appropriately ranked relevant answers. In the living and working in over 20 countries and last few years numerous advancements were selected by a rigorous certification program. made in various areas of the World Wide Web About.com environments have well-organized search. This paper presents analysis of major and rich content that includes new links, arti- aspects of recently developed Web search meth- cles, newsletter, forums, chat, etc.

ods and tools, as well as of the most important Brittanica.com URL 6 is one of the first sub- trends in Web search methods. ject trees that introduced other contents in addi- tion to Web links. This directory provides the 2. Subject trees complete and updated Encyclopaedia Britan- nica, as well as selected articles from 70 world’s

top magazines like The Economist, Discover

Both subject trees directories and search en- and Newsweek and database of related books. gines cope with a serious problem of how to follow a tremendous growth of the Web. Since Subject trees have become so big that traditional Web links are incorporated into subject trees user interfaces via hierarchic trees became inad- by human indexing teams, the size of the team equate for dealing with information since they determines the coverage of Web contents by don’t enable the view of the whole information

the subject tree. Biggest subject trees have the structure, while navigation is done by slow fold- following approximate sizes as of mid 2000 : and-unfold process. Such situation led to devel- Yahoo! has the staff of about 150 editors and in- opment of several user interfaces, e. g. Hyper- cludes about 1.5 million Web links, LookSmart bolic Tree, Mapccucino and ThemeScape, with

New Methods and Tools for the World Wide Web Search 269

novel representation of the subject tree struc- FAST URL 10 is a unique new search engine ture. Hyperbolic Tree URL 7 , developed by developed by a group of computer scientists the Xerox Palo Alto Research Center, is a user from the Norwegian University of Science and interface designed for managing and exploring Technology in Trondheim having a long exper-

large hierarchies of data such as Web subject tise in both search and imagevideo compres- trees, product catalogs or document collections. sion technologies. This search engine is based It allows an overview of the whole data set, as on the massively parallel architecture running

well as drilling down and accessing the primary on Dell PowerEdge X86 servers that give the information e. g. specific Web pages . Hyper- impressive speed and scalability to this search bolic Tree is an intuitive and highly interactive engine. Designers of the FAST Web search tool that allows focusing on the part of data rep- engine intend to build the world largest search resented in context of the whole unfocused data engine that intends to index the entire Web con- set. User can easily change the focus via ani- tents till the end of 2000, a goal that will be mated moving of hyperbolic tree elements by almost impossible to fulfill because of the ex-

means of click-and-drag with the mouse. plosive Web size growth. Its spider crawler,

robot provides data for indexing by scanning Mapuccino URL 8 is an IBM product that also dynamically construct visual maps of Web the Internet with the throughput of up to 80 mil- sites using their linking structure. ThemeScape lion documents per day, thus ensuring a fresh

catalog. For the purpose of the FAST Web URL 9 organizes a topographical map of tex- tual documents based on the similarity of their search the document index is broken into dis- content. This software reads documents from crete segments of 5 million documents each, the data set and examines their contents using with each segment stored on the independent sophisticated statistical and natural language fil- search nodes. Queries are received by multiple tering algorithms. Related topics found in the dispatch nodes which broadcast them to every documents are associated, thus forming differ- search node simultaneously. Search results are ent concepts. These concepts are assigned to the assembled and ranked according to the relevan- spatial structure that is transformed into topo- cy score. Parallel processing enables that search logical maps in such a way that the greater the through the whole index takes less than one sec- similarity between two documents the closer to- ond.

gether they appear on the map. Therefore, peaks Northern Light URL 11 search engine intro- on the map denote a concentration of several duced supplementary contents for searching. documents about the same topic. A resulting Besides Web links, this search engine offers topographical map clearly shows which topics the Special Collection, an online business li- are covered in the data set, how much emphasis brary that consists of over 6,000 full-text jour- is given to the specific topic, and how different topics relate to one another. Search across nals, books, magazines, newswires and refer- the documents results in highlighting of relevant ence sources. Special Collection currently has documents right on the map, instead of showing about 16 million documents, and adds approxi- the list of discovered documents. mately 250 new sources each month. Most of the Special Collection sources include articles

back to January 1995. Price of the Special Col- 3. Search engines lection documents range from $100 to $4 00 per article, while the summary of the article is free. Fast growth of the Web led to expansion of search engines data bases. Some of the major One of the most important things for efficient

search engines have the following size of data- use of the Web search is proper ranking of the re- bases as of June, 2000 : Google database has sulting Web links. Two new approaches devel- 560 million Web pages indexed, WebTop.com oped in this field were done by Northern Light has 500 million pages, AltaVista and FAST have and Google search engines. Northern Light in- about 350 million pages, while Northern Light troduced automatic grouping of resulting Web

has 265 million pages. The most recent figure links into meaningful categories so called Cus-

for the FAST database size is 575 million pages tom Search Folders, that contain only informa- as of October, 2000. tion relevant to that folder. These folders group 270 New Methods and Tools for the World Wide Web Search

results by subject, type e. g. press releases or context. When the user enters the search term product reviews, source e. g. commercial sites Simpli.com activates its knowledge base and

or magazines and language. Users can con- automatically generates a list of possible con- centrate on links from the appropriate folder texts of this search term. User then interacts

and thus narrow the results considerably and with the search engine and chooses the appro- save time. Folders are not preset, but are rather priate meaning concept . After that the search

dynamically created for each search. Google engine consults its database to choose related

URL 12 search engine uses the PageRank al- words based on the search term and the chosen

gorithm based on the importance popularity concept, and automatically expands the query of Web pages. Importance of a Web page is dis- with these words. Oingo URL 15 takes the covered through analysis of its link structure, similar approach and enables using more than and doesn’t depend on the specific search re- one keyword for search. Oingo offers its user a quest. PageRank is a characteristic of the Web list with possible meanings for all of the terms page itself — it is higher if more Web pages used in a query. The user then chooses the most link to this page, as well as if these Web pages appropriate meaning for each query term, and

have high PageRank. Therefore, important Web these meanings are used for the search. pages help to make other Web pages important. AskJeeves URL 16 search engine has two in- One consequence of this approach is that search teresting features. One of them is that questions result may include links to Web pages that were can be stated in natural English. AskJeeves not found by Google spider, but were linked by uses natural language processing technology to some Web page accessed by the spider. analyze the meaning of the words in question, Two innovative approaches to Web search are and the meaning in the grammar of the ques- visual approach for searching the Web, and tion. This search engine also keeps a database

of questions and answers to these questions search by meaning. Ditto.com URL 13 is and expands it all the time, hence becoming the service that provides visual mechanism for smarter during time. Currently it has more searching the Web. Search begins in a tradi- than 7 million answers that contain informa- tional manner, by entering one or more key- tion about the most frequently asked questions words. Ditto.com responds by presenting pic- and the answers to these questions discovered tures from discovered Web pages plus pictures on the Web. When a user asks a new ques- from its own media collection. Presented pic- tion AskJeeves analyses its meaning and pro- tures from Web pages act as a visual clue for de- vides to the user a list of specific questions. termining the relevance of the Web page. Click When the user clicks on such specific question, on the presented picture from particular Web the answer-processing engine retrieves the an- page lead us to that Web page. swers template that contains links to available Simpli.com and Oingo are search engines deal- answers. For example, the original question ing with the problem of lexical ambiguity of “Where can I buy the 56K external modem?” the traditional search with one or more search triggers the answers template for this question terms put out of context. Since words can that contains answers to the following specific

questions: “Where can I buy modems online?”, have many different meanings polysemy , traditional search engines respond by selecting all “Where can I buy and sell modems via on- Web pages that fit each possible meaning of line auction?”, “Where can I compare prices

the search terms. This leads to a large list for computer peripherals?”, etc. of resulting links that includes many links not AltaVista URL 17 search engine includes two related to the desired meaning of the search. important features, search for images and trans-

Because there are many words with the same lation between languages. AltaVista enables or similar meaning synonymy it can happen search for over 25 million images, audio and that keyword-based search cannot find relevant video files from both the Web and from sev- pages because it uses one synonym while au- eral private collections. Images include pho-

thors of Web pages may use some other syn- tographs, graphics, clip art, and galleries. Search onym. Simpli.com URL 14 is using principles is done by specifying appropriate keywords, and of linguistic and cognitive science as well as the AltaVista responds by presenting pictures from

interaction with users to place search terms in Web pages andor private collections. When a New Methods and Tools for the World Wide Web Search 271

user finds the picture that fits his interest, it is sites contain about 10% of total size of invisi- in some cases possible to activate the search for ble Web. More than half of the invisible Web other pictures with similar features, and thus content resides in topic specific databases, and increase the proportion of pictures belonging about 95% of the invisible Web is publicly ac- to the field of his or her interest. Similarity is cessible information not subject to fees or sub- based on visual characteristics such as dominant scriptions. colors, shapes and textures.

BrightPlanet also developed the LexiBot search AltaVista also enables automatic translation of tool URL 21 . This tool looks at the search the contents of discovered Web pages between request, then selects the most relevant 600 dif- several languages. Automatic translation of ferent invisible Web resources and forwards the any peace of text between dozen pairs of lan- query to them. The results are returned in the guages can also be performed. This can even form similar to the one done by meta search en-

be done independently of the search using Al- gines. BrightPlanet plans to expand the query taVista translation service URL 18 directly. to about 20,000 invisible Web sites already col-

This translation service enables search for for- lected, and finally to all 100,000 significant in-

eign languages by a translation of English visible Web sites. query to some target language, b accomplish-

Some of the best known Web directories con- ing search with the translated terms, and c translating the resulting Web pages back to En- taining addresses of Web sites used as entrance glish. Translation is in fact done by AltaVista’s for different databases are InvisibleWeb, Ly-

cos Searchable Databases and INFOMINE. In-

partner SYSTRAN URL 19 , a leading translation software institution. SYSTRAN’s soft- visibleWeb URL 22 is a directory with over ware is widely used by US government and 10,000 databases, archives, and search engines containing information that traditional search European Union for translation of its inter- engines cannot access. This is a hierarchically nal communication. It was also used in the US-USSR Apollo-Soyouz space project during structured directory that can also be searched by

1972-1975. key words. An example search using the term “ATM” done in October 1999 found 8 Web sites like Visa ATM locator, 4Banking.com, and MasterCard ATM locator. Each of these Web 4. Database search sites enable entrance to the database of ATM lo- cators that can be retrieved geographically, with Growing number of databases from various ar- the final result of finding addresses of ATM eas are available by means of the Web. How- locations in the specific town. Comparative ever, search engines often cannot distinguish search with AltaVista on the phrase “ATM”gave between the simple Web page and the entrance about 525,000 links, while search on “ATM” in to one or more huge databases. Information the title gave some 22,000 links. The more pre- contained in databases are hidden and cannot cise search using the phrase “ATMlocator” gave be retrieved by search engines, hence the name 3,400 links while search on “ATM locator” in “Invisible Web” for databases accessible from the title gave some 350 links. Moreover, most the Web. Databases typically include struc- of these links are not the entrances to databases. tured data about specific topics like compa- As can be seen, in search for specific informa- nies, restaurants, or locations of ATM machines. tion the advantage of databases directory over Most database resources are free, while some traditional search engine is evident.

are fee-based. Lycos Searchable Databases URL 23 is a

Recent study accomplished by the BrightPlanet directory similar to InvisibleWeb, while IN-

company URL 20 estimated that the inacces- FOMINE URL 24 is a huge scholar directory sible part of the Web is several hundred times of more than 10,000 databases in biological,

larger than its visible part accessible by search agricultural, medical and physical sciences, en-

engines, and contains about 550 billion indi- gineering, computing, mathematics, social sci- vidual documents. More than 100,000 invisible ences and humanities. INFOMINE also in- Web sites exist, while 60 largest invisible Web cludes directories of electronic journals, elec- 272 New Methods and Tools for the World Wide Web Search

tronic books, online library card catalogs and In order to avoid this problem NEC Research articles. Institute team developed a digital library of sci-

entific publications that create a citation index Terraserver URL 25 is a unique huge database autonomously using Autonomous Citation In-

of high-resolution satellite imagery and aerial dexing ACI , a system that doesn’t require any photography that started as a research project

manual effort Lawrence, Giles and Bollacker, between Aerial Images, Inc., Microsoft, the

1999. ACI allows literature search using both USGS, and Compaq. Special interest of Mi- citation links and the context of citations. It re- crosoft was to test the ability of a database trieves PDF and postscript files and uses simple

server built as a large scalable network us- rules based on formatting a document to extract ing Windows NT Server and Microsoft SQL the title, abstract, author and references of any

Server to store terabytes of data for thousands scientific paper it finds. ACI enables fast feed- of simultaneous users, and to explore the usage back by indexing items such as conference pro- of relational database technology as a general- ceedings or technical reports. New papers can purpose image repository. Terraserver currently be automatically located and indexed as soon as has imagery from more than 60 countries with they are published on the Web or announced on resolution down to 1 meter, and has a goal to mailing lists. completely cover the earth’s surface. It is using imagery from various sources like the Russian The same team built a prototype digital library Space Agency, the Indian Space Agency, as well called CiteSeer. CiteSteer downloads papers as US companies using commercial satellites. from the Web and converts them to text, parses the papers to extract the citations and the context in which the citation were made, and then 5. Scientiﬁc information retrieval stores these information in a database. CiteSeer includes full-text articles and citation indexing and allows location of papers by keyword search Web is increasingly becoming a distributed col- or citation links. NEC Research has made the lection of scientific literature where scientific CiteSeer software available at no cost for non- papers can be rather easily retrieved and quickly

commercial use. A demonstration version of

accessed Lawrence and Giles, 1999a . Numer- CiteSeer URL 27 indexes over 200.000 com- ous researchers make their publications avail- puter scientific papers, more than the largest able on their homepages, and many traditional online scientific archives. journals offer access to the whole text of papers on the Web. Some publishers allow their papers Retrieval of scientific data is a particularly com- to be placed on the author’s Web site, while plex problem, since it has to deal with a widely others permit prepublication of articles on the distributed, heterogeneous collections of data. Web. The main problem for search engines is Scientific data form huge and fast growing col-

in location of articles prepared in Postscript or lections of data stored in specialized formats e.

PDF format, predominantly used formats for g. astronomic or medical data, and their re- scientific publications. trieval requires powerful search tools. For this purpose a consistent set of metadata semantics, Important step forward in enabling efficient and as well as standard information retrieval pro- effective service for scientific literature on the tocol supporting the metadata semantic is re- Web was done by the NEC Research Institute quired. One technology developed for scientific

team in developing the ResearchIndex URL information retrieval is Emerge URL 28 , the

26. ResearchIndex is a search engine specifi- system based on XML-based translation engine cally designed for scientists, based on the com- that can perform metadata mapping and query bination of digital library and citation index. transformation. Citation indices index the citations in a paper so that the paper is linked with the cited articles. They enable scientists to find papers that cite 6. Trends a given paper, and also facilitate evaluation of articles and authors. The main problem with traditional citation indices is their high price re- Several new approaches to search engines mecha- lated to manual effort necessary for indexing. nisms were developed around the idea of explo-

New Methods and Tools for the World Wide Web Search 273 iting rich Web hyperlinking structure for clus- Alexa service URL 29 . When an Alexa user

tering and ranking of Web documents. Clever navigates to a Web page, Alexa service retrieves

project Members of the Clever team, 1999 , data about this page and presents them to the run by researchers from IBM, Cornell Univer- user. These data or metadata includes infor- sity and the University of California at Berke- mation on who is behind the site the user is ley, analysis Web hyperlinks and automatically navigating to, how often is this site updated, locates two types of pages: “authorities” and how fast this site responds to requests, what is “hubs”. “Authorities” are the best sources of its popularity ranking, etc. All these informa- information on a particular broad search topic, tion help the user to decide whether the site is while “hubs” are collection of links to these au- what he is looking for. Alexa service also re-

thorities. A respective authority is a page that is trieves information from its servers to suggest referred to by many good hubs, while a useful to the user some other pages Related Links hub is a page that points to many valuable au- that might be of interest to him. To find Re- thorities. For any query Clever first performs an lated Links Alexa service uses the paths of the ordinary text-based search using an engine like collective Alexa community, information about

AltaVista, and takes a list of 200 resulting Web clusters of similar Web sites found by prior pages. This set of links is then expanded with analysis, analysis of texts on individual Web Web pages linked to and from those 200 pages pages, etc. To avoid the “Not Found” message — this step is repeated to obtain a collection of Alexa service checks its archive to see whether about 3,000 pages. Clever system then analyses it can find an archived version of the page. the interconnections between these documents, Visual information retrieval becomes increas- giving higher authority scores to those pages ingly important as the amount of image and that are frequently cited, and higher hub scores video information in different computing en- to pages that link to those authorities. This pro- vironments continues to grow at an extremely cedure is repeated several times with iterative fast rate. This kind of retrieval is very specific adjusting of authorities and hub scores: author- and complex, and even formulating the query is ity that has many high-scoring hubs pointing to not simple since text queries are unable to ade- it earns a higher authority score, while a hub quately express the query requirement, making that points to many high-scored authorities gets the query processing inefficient. Readers in- a higher hub score. The same page can be both terested in this field are recommended to read the authority and the hub. A side-effect of this two introductory papers Gupta and Jain, 1997; iterative processing is that the algorithm sep- Chang et al., 1997. arates Web sites into clusters of similar sites. Clever system is also used for automatic com- Traditional search in the batch mode leads to the pilation of lists of Web resources similar to sub- situations where users may miss recently- added ject trees — these lists appear to be competitive Web pages because of the slow update of search with handcrafted ones. engines data bases crawlers often revisit the

same Web site after one or two months. This

One extension of this research is the Focused led IBM to develop the Fetuccino URL 30 Crawling system Chakrabarti et al, 1999 ,a software, an interesting combination of tradi- software for topical resource discovery. This tional search in batch mode and dynamic search system learns the specialization by examples in real time. In the first stage some traditional and then explores the Web, guided by a rel- batch search engine is exploited. In the second evance and popularity rating mechanism. The stage results obtained in the first stage are re- system selects Web pages at the data-acquisition fined via dynamic searching and crawling. This level, so it can crawl to a greater depth dozens is done by feeding the search results from the of links away from the starting set of Web pages first phase to the Mapuccino’s dynamic map- when on the interesting trace. Focused Crawl- ping system which augments those results by ing will be used for automated building of high- dynamically crawling in directions where rele- quality collection of Web documents on specific vant information is found. topic. Natural language processing has an important An interesting approach to Web search, some- role in developing advanced search systems be- times called ’surfing engine’, is developed by cause of the fundamental problem of presenting 274 New Methods and Tools for the World Wide Web Search

accurate meaning of search request to the search may be able to help, with best candidates first. system. Natural language includes a lot of am- User can now exchange messages with the se- biguities on various levels: morphological, syn- lected experts. tactic, semantic, discourse and pragmatic levels, Intensive use of the XML markup language is and natural language processing can be used expected to drastically improve information re-

in all stages of information retrieval processing trieval on the Web and make it more efficient.

Feldman, 1999 . Some of the research direc- XML eXtensible Markup Language provides tions in this field are entity extraction, cross a standard system for browsers and other appli- language retrieval, automatic information cate- cations to recognize the type of data in docu- gorization, and summarization. Entity extrac- ments. This enables search engines to search tion may e. g. enable extraction of names of only certain fields in the documents, instead of people, places, and chronological information the whole documents. This will lead to much from text, store them and enable retrieval of faster and more precise search. However, in these information with natural language search order to enable use of XML each industry will requests. Summarization attempts to automati- have to set up its own standards for document cally reduce document text to its most relevant structures, Web site providers will have to tag content based on the user requirements. the pages according to the standard structures, An example of research in use of natural lan- and search engines indexing applications will guage processing on search engines is a system have to hold tag information as metadata. that performs query expansion using an online Currently there are more than 400 million mo-

lexical database WordNet Moldovan and Mi- bile phone users in the world, and it is estimated

halcea, 2000. This system first disambiguates by Forrester Research that by 2004, 95% of all the sense of the query words using the WordNet mobile users will be Internet enabled. Because database, so that each keyword in the query is of this, in the beginning of 2000 several search mapped into its corresponding semantic form engines start offering search under the Wireless

as defined in WordNet. After that, WordNet is Application Protocol WAP . FAST search en- used for finding synonyms with the query se- gine thus launched its WAP search engine URL mantic concepts that are subsequently used in 33 in February 2000, offering search of the in- Internet search. Documents found in the search dex with more than 100,000 WML Wireless are subjected to a new search using the operator Markup Language documents, growing at a which extracts only the paragraphs that provide rate of several hundred percents per month. Be- relevant information to the query. The goal of sides general search, FAST is preparing search

the system is not to retrieve entire documents for location information e. g. location of near- but to provide the user with answers, so it re- est hospitals or hotels, real-time alerts e. g.

turns to the user the paragraphs that may contain about congested traffic in the area and image

the answer. and video streaming. Google search engine of-

fers WAP search URL 34 of both WML and Software agents bots are used for various

HTML documents, since less than 1% of Web search activities URL 31 like searching re- mote databases, multiple virtual bookstores, or sites are available in WML. When a wireless searching and indexing Internet resources. One user requests a HTML page, Google translates of the research projects from the MIT Media the requested HTML document on the fly into

WML. Lab is the Expert Finder project URL 32 that helps in finding someone who is an expert on some area. The project is based on the assump- tion that each user has an agent that knows about 7. Conclusion his or her areas and levels of expertise. When necessary, the user asks the agent to find another In the last few years significant improvements user that is an expert on some area. The agent were made in a number of areas of the World then goes out on the Internet and exchanges in- Wide Web search. This paper first presented the formation with other expert finder agents, get- current state of the World Wide Web, the speed ting their users’ profiles. After that, the agent of its development, as well as its heterogeneity. presents to the user a list of people it thought After that, the paper discussed some of the most

New Methods and Tools for the World Wide Web Search 275

advanced recent approaches to the Web search, 9 S. LAWRENCE AND C. L. GILES Accessibility of in-

as well as various novel operational Web search formation on the web. Nature, 400 1999 , 107-109. elements or systems in the area of subject trees, 10 S. LAWRENCE AND C. L. GILES Searching the Web:

search engines and database search. general and scientific information access. IEEE Communications Magazine, 37 1999a , 116-122.

There is certainly room for further advancements of both technology and services. One 11 S. LAWRENCE,C.L.GILES AND K. BOLLACKER Dig-

type of service that would be extremely helpful ital libraries and autonomous citation indexing. IEEE Computer, 32 1999 , 67-71.

for professionals are specialty search services that would exclusively cover the domain of in- 12 MEMBERS OF THE CLEVER TEAM Hypersearching

the Web. Scientiﬁc American, June 1999. Online

terest of specific groups of professionals. Cov-

mcomissue

at httpwwwscia erage should be deep, and information should be raghavanhtml

fresh and almost without any dead links. Aca- 13 D. I. MOLDOVAN AND R. MIHALCEA Improving the

demic search engine is another highly important search on the Internet by using WordNet and lexical service that should cover academic Web sites. It operators. IEEE Internet Computing, 4 2000 , No.

should carry out a deep and very frequent crawl 1. of these Web sites. Covering a fresh state of re- 14 N. SHIVAKUMAR AND H. GARCIA-MOLINA Finding

search and research publishing could e. g. lead near-replicas on documents on the Web. In Work- to decrease duplication of research work. shop on Web databases 1998 , Valencia, Spain.

References URLs

org

1. Hobbes’ Internet Timeline, httpwwwisoc

zakonInternetHistoryHIThtml 1 R. BAEZA-YATES,B.RIBEIRO-NETO Modern Infor-

mation Retrieval. Addison-Wesley, Harlow, Eng-

surveys 2. Nua Ltd., httpwwwnuaie

land, 1999.

how many online

lance 2 A. BRODER ET AL. Graph structure in the web. 3. Cyveillance comp., httpwwwcyveil

essr

Proceedings of the 9 International World Wide comnewsroompr

Web Conference 2000 Amsterdam. Online at

4. Open Directory, httpdmozorg

httpwwworgwcdromhtml

5. About.com, httpaboutcom

3 S. CHAKRABARTI ET AL. Focused Crawling: a new

annicacom approach to topic-specific Web resource discov- 6. Britannica.com, httpwwwbrit

ery. Proceedings of the 8th International World t

7. Inxight: Hyperbolic Tree, httpwwwinxigh

Wide Web Conference 1999 Toronto. Online

treehtml comproductsdeveloperhyperbolic

wwworgwpapersasearch

at http

httpwwwibmcomjava

indexhtml

querycrawling 8. IBM: Mapuccino,

mapuccinoindexhtml

4 S. CHANG ET AL. Visual information retrieval from

acom

large distributed online repositories. Communica- 9. Cartia: ThemeScape, httpwwwcarti

productsindexhtml

tions of the ACM, 40 1997 , No. 12, 63-71.

com

10. FAST, httpalltheweb

5 J. CHO AND H. GARCIA-MOLINA Synchronizing a

rnlightcom database to improve freshness. Proceedings of the 11. Northern Light, httpwwwnorthe

2000 ACM SIGMOD International Conference on

lecom

12. Google, httpwwwgoog

Management of Data 2000 , Dallas, TX.

ocom

13. Ditto.com, httpwwwditt

6 V. C ERIC 1997 , World Wide Web search: tech-

com niques, tools and strategies. Proceedings of the 19 14. Simpli.com, httpwwwsimpli

International Conference on Information Technol-

ocom

15. Oingo, httpwwwoing

ogy Interfaces ITI’2000 2000 , Pula, Croatia. m

16. AskJeeves, httpwwwaskco

7 S. FELDMAN NLP meets the jabberwocky: natural

stacom

language processing in information retrieval. ON- 17. AltaVista, httpwwwaltavi

httpwww LINE, 23 1999 , No. 3. Online at

18. AltaVista translation service, httpworld

onlineinccomonlinemagOL

altavistacom

feldmanhtml

nsoftcom

19. SYSTRAN, httpwwwsystra

8 A. GUPTA AND R. JAIN Visual Information Retrieval.

httpwwwcompleteplanetcom

Communications of the ACM, 40 1997 , No. 5, 20. BrightPlanet, eb 71-79. TutorialsDeepW

276 New Methods and Tools for the World Wide Web Search

tcom

21. LexiBot, httpwwwlexibo

iblewebcom

22. InvisibleWeb, httpwwwinvis

23. Lycos Searchable Databases, httpdirlycos

comReferenceSearchable Databases

eucredusearch

24. INFOMINE, httpinfomin

phtml

vercom

25. Terraserver, httpterraser

auiucedu 26. Emerge, httpemergencs

27. CiteSeer demonstration version, httpcsindex

com

archindex

28. ResearchIndex, httpwwwrese

com com

29. Alexa, httpwwwalexa

comjava

30. IBM: Fetuccino, httpwwwibm

fetuccinoindexhtml

tcom

31. Search agents, httpwwwbotspo

ssearchhtm

amit

32. Expert Finder project, httpwwwmedi

eduadrianaprojectsEF o

33. FAST WAP search, httpwapfastn

ecom

34. Google WAP search, httpwwwgoogl palm

Received: October, 2000 Accepted: November, 2000

Contact address: Vlatko Ceric Faculty of Economics University of Zagreb Kennedyjev trg 6 10000 Zagreb Croatia

Phone: 385 1 2383280

Fax: 385 1 2335633

e-mail: vcericefzghr

DR.VLATKO CERIC is a professor at the Faculty of Economics, Univer- sity of Zagreb. His main research interests are simulation modelling, decision support systems, information retrieval, electronic commerce and operations management. He published over 80 papers and a few books, and was a leader of several research and application-oriented projects.