Faculdade De Engenharia Da Universidade Do Porto Mestrado Gestão De Informação

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

MESTRADO GESTÃO DE INFORMAÇÃO

Information Retrieval Techniques in Commercial Systems

Professor Mark Sanderson

Fernando Luis Poças da Silva

Porto, June 2002

Index

1. Introduction...... 3

2. Analysing - Google...... 8

Conclusion ...... 10

3. Analysing – Altavista ...... 11

Conclusion ...... 14

4. References ...... 15

1. Introduction

The purpose of this document is about the search engine technology involved on, when we are searching for specific information. At the beginning information specialists exists in order to deliver people the information they need. But computer based systems have revolutionised the way to make things work. Understanding how search engines work, provides a better understand and serve to put our web site in a better position when we want to rank our website, or to choose what search engine best serve our goals. At this moment, regarding information needs, the Web exists for us today as the Britannica Encyclopaedia to our grandparents. The growth of the public Internet and enterprise intranets has given more people access to more information, on demand, than ever before in history. Consequently, information retrieval has become a leading challenge for information systems managers and developers. But the problem is that the material available on the web is poorly organized, and of variable quality and stability; its difficult to conceptualise, browse, search, filter, or reference1.

Interactions with information include much more than traditional information retrieval: locating and selecting among relevant sources, retrieving information from them, interpreting what was retrieved, managing the filtered-out information locally, and sharing results with others2. Technology is gaining some space, thus the use of artificial intelligence, thus the use of some advance algorithms, thus the increasing computational power. Speed and efficiency play the paper of the joker. Technology cannot replicate the human skill on cataloguing the information, on managing and share it with the final user, on responding with precision to the queries that the users need to satisfy. Automated search engines that rely on keyword matching usually return too many low quality matches. To make matters worse, some advertisers attempt to gain people's attention by taking measures meant to mislead automated search engines.

Every time a query is made, the engine goes to is database and exhibits the results, each engine produce different results. Those differences can be very substantial because each engine has is own way to organize the information and retrieve it.

Unlike to what happen to libraries, which has own rules accepted by a huge community of librarians, the search engines don’t have common rules of working or cataloguing, which means you have to understand how the particular engine works in order to obtain the best results. So the problem is that when we use such powerful technology the result could be an avalanche for example: “34.567 documents found…”. Because initially the web engines putted their power on finding the largest number of documents, from the most number of possible url’s in the shortest time. Such aims couldn’t be accomplish without sacrificing the precision of the search. The aim must take in consideration not only on the recall capacity, but also the ability to deliver precision on the result.

1 Levy, David M. – Cataloging in the Digital Order. Xerox Palo Alto Research Center. http://www.csdl.tamu.edu/DL95/papers/levy/levy.html , on 2002-05-19 2 Paepcke, Andreas – Digital Libraries: Searching is not enough. D-Lib Magazine, 1996, http://www.dlib.org/dlib/may96/stanford/05paepcke.html , on 2002-05-19 When analysing a search engine, we must keep in mind that the indexing software uses agents called robots or spiders that constantly crawl the “web”. In the next pages I intend to analyse, based on experiences and bibliography available, two web search engines. They all supply us with web-pages in response to a query based in four factors: 1. Index size: Index’s only supply a web-page if and only if it has been indexed by the search engine in question. 2. index quality: 3. Query language: each search engine has its own way of letting us search. 4. Index ranking: the engine sorts the pages, and gives us the 10 to 30 heights ranked first. Page ranking is very complicated thing. In sort search engines rank depending on the ‘title’, ‘description’, ‘keywords’ and the text in the ‘body’ of the page. It’s the author of the page that decides this but it the search engine that gives these factors, often very different, weight.

However the developments in Indexing and Searching technologies targeting improving relevance of searching cover such issues as:

1. Search results clustering that is used already by lots of Search Engines. 2. Page/source ranking using number of hits or query caching (e.g., like Google! Search Engine) 3. Topic "distillation" and finding authoritative sources for categorised information/topics (e.g., as used in IBM's CLEVER project) 4. Learning Classification Algorithms and Knowledge Networks technologies that have already first implementations in Librarian Subject Gateway systems (for example DESIRE2 Project and Cora Search Engine for computer science research papers) 5. Directory enabled "real names" searching (e.g., like Altavista's Real Name or Netscape's keywords.netscape.com) 6. Multilinguality and cross-language searching (for example TUSTEP3 System developed in Tuebingen University providing fuzzy searching in all European languages).

So in order to advise any company on adopting a certain search engine for their intranet, first I’ve to evaluate the different ones, with their characteristics and strengths. Facing these trends, i choose to rely on the following criteria for evaluate the choice:

1. Accuracy – Does the system deliver the highest possible precision and recall? Inefficiency at small scales will only become more wasteful as the size and scope of the system increases. Productivity, cost- effectiveness, and return on investment will always be important. 2. Scalability – Is the system scalable? Can it handle gigabyte, and even terabyte, sized databases? Will it handle thousands of users? Growth is inevitable. Many systems, which perform adequately with test- sized databases at pilot and workgroup levels, may fail completely when they are rolled out to the entire enterprise and implemented on massive databases.

3 http://www.uni-tuebingen.de/zdv/tustep/tdv_eng.html , on 2002-05-19 3. Extensibility – Will the system meet the entire range of users needs? Can it handle the full spectrum of information in the collection? Can it seamlessly extended to data types other than text?

With this objective I’ve collecting some material from the web, and made a summary between what seem to me the two best search engines, and compare their features. I think there is no need to explain what is a PageRank, or any other features of the searches engines, due the classes we took or with the web I’m just repeating known concepts. However, there is an excellent paper about the Google PageRank by Jill Whalen4, which summarizes what is all about PageRank.

In the next table (table 1)5 there is very useful information that each search web site uses:

Search Engine Partnership AltaVista Feeds LookSmart and Uses LookSmart's directory & ODP Uses Inktomi for US & AOL Netfind Canada and Uses Lycos for Europe & ODP Building own index, search results are then manipulated by Direct Hit with meta Ask Jeeves data from Alta, Excite, WebCrawler, and Goto.com (now owns AskJeeves) Using own database with suplemental results from ODP, and AskJeeves. Direct Hit Influences searches on HotBot, AskJeeves, iWon, ZDNET, and MSN. Fast (alltheweb.com) Feeds Lycos. Formerly called Metacrawler uses a variety of search engines including (Alta, Go2Net DirectHit, Excite, Google, Goto, Infoseek, Lycos, Looksmart, WebCrawler, and the only Meta search to use Thunderstone. Feeds Netscape Netcenter, gets from ODP, and feeds non-directory results Google for Yahoo Uses ODP HotBot Primary results from Inktomi, ODP Directory, with results manipulated by Direct Hit Feeds Search.com and is now the major engine behind GO.com. Has own InfoSeek directory. Gets from Inktomi with directory listings by Looksmart, and the whole thing Iwon manipulated by Direct Hit. Feeds, Hotbot, iWon, AOL search, Snap, GOO.jp, Looksmart, MSN Search and Inktomi Goto. Also has dozens of customers in Europe and the Far East. A full list. Gets Search results from Inktomi and serves AltaVista directory, Excite, Iwon, and LookSmart Search.msn. Also powers 2000 to 3000 ISP homepages. Now uses ODP Directory and Fast (AllTheWeb) search results. Some Direct Hit Lycos Data influenced. Magellan Uses WebCrawler Netscape Search Uses, Google, ODP, About.com/Mining CO. Open Directory Project Feeds Netscape Search, Alta, Hotbot, Lycos, Direct Hit, and thousands of smaller (GnuHoo/NewHoo) sites. Search MSN Directory results from LookSmart. Results from Inktomi and Direct Hit. Yahoo Uses Google for non-directory matches

Table 1 – Search engine vs enterprises they serve

From analysing this table results we can see that google supports a major site like Yahoo, and Inktomi is very popular in the States, reaching 82% of the U.S. internet users.

In the next table (table 2)6, we can see some of the search engine math and Boolean commands that can be used by the different search engines:

4 http://www.searchengineguide.com/rankwrite/2001/1123_rwl.html , on 2002-05-20 5 http://www.ontheavenues.com/FYI/understanding_the_search_engine.htm , on 2002-05-20 Command How Supported By Include All but LookSmart + Term (Does work for LookSmart's Inktomi results) All but LookSmart Exclude (Does work for LookSmart's Inktomi results.Also, will not - Term work for preprogrammed resultsto popular queries at MSN Search) All but Direct Hit, LookSmart, MSN Search Phrase " " (Does work for LookSmart's Inktomi results. At MSN Search, unpredictable about when it works) AltaVista, Direct Hit, Excite, LookSmart Auto Not yet updated, but may be still correct: Netscape, Yahoo Match Any Term adv. search AllTheWeb, AOL Search, Google page HotBot, Lycos, MSN Search Other Northern Light (use OR)

Match Auto AllTheWeb, AOL Search, Google, HotBot, Lycos, MSN All Terms Other Can usually be done with + symbol or adv. search page AltaVista, AOL Search, Excite, Google, OR Inktomi (HotBot, MSN), Lycos, Northern Light Or AllTheWeb, Direct Hit, LookSmart, None Not yet updated, but may be still correct: Yahoo AltaVista, AOL Search, Excite, AND Inktomi (HotBot, MSN) Lycos, Northern Light And AllTheWeb, Direct Hit, Google, LookSmart None Not yet updated, but may be still correct: Yahoo AOL Search, Excite, Inktomi (HotBot), NOT Lycos, Northern Light AltaVista, Inktomi (MSN) Not AND NOT Not yet updated, but may be still correct:Netscape AllTheWeb, Direct Hit, Google, LookSmart, None Not yet updated, but may be still correct: Yahoo AltaVista, AOL Search, Excite, ( ) Inktomi (MSN), Northern Light Nesting AllTheWeb, Direct Hit, Google, Inktomi (HotBot), LookSmart, None Lycos Not yet updated, but may be still correct: Yahoo AltaVista (10 words), AOL Search (specify number), Lycos NEAR (25 words) Near AllTheWeb, Direct Hit, Google, None Inktomi (HotBot, MSN), LookSmart

Table 2 – Search commands and Boolean search

6 more details can be found in http://www.searchenginewatch.com/facts/powersearch.html and http://www.searchenginewatch.com/facts/boolean.html both on 2002-05-20 About the searching commands and regarding the two search engines that are subject to appreciation these are the features that each one provides:

Functionality Google Altavista

Title Search allintitle: intitle: title:

Site Search site: host:

URL Search allinurl: inurl: url:

Link Search link: lin k:

WildCard Search none *

Related Searches N Y

Clustering Y Y

Find Similar Y Y

Search within Y Y

Spidered version (cached) Y N

Search by language Y Y

Page translation Y Y

Porn Filter Y Y

Porn Warning N N

Ability to increase number of listings Y Y

See 50 results Y Y

See 100 results Y N

Sort by date N N

Date Range Y Y

Date displayed? N Y

Display titles only? N Y

Customize options Y Y

Translate search sentence Y Y

Case sensitive N Y (if capitalize every word)

Accent sensitive Y Y

“Did you mean:” hint Y Y

Table 3 – main features of Google and Altavista

After the resume of the main features of each search engine that were analyzed, it’s time to point out the properties of each. About the algorithm that is used by them, it’s very difficult to find information but there are some clues on how it’s made. 2. Analysing - Google

Has a presentation card of searching more than 2 billion pages, and is distinguished by its ranking algorithm based on how many other pages link to each page, along with factors like proximity of the search keywords or phrases in the documents, has it can be demonstrate in the two images next, that altering the order of the words on the search sentence we have different positioning rank of the sites. Google is the biggest7 search engine database in the world. The name, Google, has it origins in a common spelling of googol, or 10100 and fits well with the authors goal of building very large-scale search engines.

Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank. Second, Google utilizes link to improve search results. They alter their search algorithm frequently, so link farms, deep cross-linking and a host of tricks have been banned. PageRank is only one factor in their algorithm, and according some, not the most important. According to Peter Norving, Google director of machine learning, he told to Newsfactor8, that they look to the links but also to a lot of different things. So if you search for a specific sentence, Google will find all the sites that refer to that sentence, and then take the ones most often pointed to by other sites and say those are most relevant, so the user receive results that include the most popular Web site matched to the search sentence. This is a good idea because if the person that knows, cares enough to build the link, is more relevant than an opinion of someone about the subject.

7 http://www.searchengineshowdown.com/stats/ , on 2002-06-05 8 http://www.newsfactor.com/perl/story/16768.html , on 2002-06-05

As we can observer, from these two images, resulting from alter the order of the search sentence we can see that the rank order is altered, although the search results still are the same – 534.

One other feature is that google searches full text in almost every PDF files, as well Microsoft Office documents, Corel WordPerfect and other file formats, no other offer this feature. It is very customizable offering a wide variety of objects to search, like news, images, groups, directory, UncleSam search, and a toolbar that is very useful to customize your browser bar.

toolbar

Document format, pdf, ps, etc

A very useful feature is the fact that we can consult the page cached, by the crawlers, in case of web site is off- line. This is related with the dead linksproblems, which in this case can be minimized, by giving the page cached.

Google is a good engine for: 1. Fast search, large index – uses “group judgment” to rank results. Cached pages for dead links 2. Searches US federal, state and local government web pages, by using Google’s Uncle Sam. 3. Find information in presentations, spreadsheets and other – pdf, rtf, doc, ps. 4. Find images and sounds, media types or files extensions – with over 150 million digital images Google offers an huge media library. 5. Advice and opinions from others users – using Google Groups to search in newsgroups. 6. It can search most stop words

Conclusion For the Google people, what you say about a page becomes just as important as the actual content of the page. The page must be what other people say it is. That Google adheres to this rule and is by far the most effective search engine raises many interesting issues. Now Google is smart, simply having tons of the same links with the same phrase on a single page will do nothing. It requires a multitude of pages to have that link with specific link text. But this power can be harnessed with a concentrated group effort.

3. Analysing – Altavista

Apart from being among the biggest indexes, it’s a search engine that has lived better days. Through constant innovation, AltaVista has been awarded the most search-related patents of any company. According to some articles their index engine treats every page on the Web and every article of Usenet news as a sequence of words. One nice thing is that we can always count the occurrences of a term in the index. If the keyword occurs too many times in the index, the search itself will be ignored but the number of times the word appears in the index will be adequately counted, they are called “stopping words”. As the database grows, more and more words will have a number of occurrences that is above this so called “stop words”. So what is a valid term today can be a stop word tomorrow. AltaVista is a full text index and so, terms in the index will be roughly distributed according to Zipf’s law. It had a very good feature, SORT, that provided searching inside the results but it seems that was abandoned. According to some reading, it now provides a ranked order by payment.

It is possible to customize our search by using the settings:

And, has we could see in the next images, the order of the words in a search sentence doesn’t influence the results.

Alta Vista ranks the results of a search on criteria that – according to a help file that come with the Personal Extensions program – include these:

· Whether the words or phrases are found in the first few lines of the document (for example, in the title of a web page). · The frequency of occurrence of a query word or phrase. Rare words in a query are weighted more heavily than common words (rarity is determined by the number of occurrences of the word in the index). · Whether all of the specified words or phrases appear in a document. A document containing all three words specified in a three-word query would rank higher than a document containing only two or one of the words. · Whether multiple query words or phrases are found close to each other in a document. · Content that is useful and unique. · Placement of content on the page. Use "newspaper" style placement. Text at the top of the page is generally assigned greater weight than text at the bottom of the page. · Title tags. The page title is critical. · Keyword and description META tags. · Link popularity, with links coming from valuable sites. Artific ial links are considered spam. Anchor text should accurately describe the page's content. · Participating in their inclusion programs doesn't have a special effect on rankings either.

I think that the general rule for rank the document, is obvious and logical: the more occurrences in the index, the smaller the weight becomes (the greater the probability that a word will occur, the less information it carries). The main problem to overcome is the Zipf distribution. What is a good measure to compute index weights? The enormous differences between the number of times any given term appears in the index should not and cannot give rise to the same order of difference in the weights because that would allow one term to dominate other terms in a way that would make any serious information retrieval impossible. With advanced searches we may also specify keywords for Alta Vista use in order to confidence rank our results. This is a very powerful feature which will let us control which items are ranked at the top of the hit list.

Altavista is a good engine for: 1. Focus my search – before searching, choose options like Boolean or phrase, domain, timeframe or date and use the AltaVista search assistant. 2. improve my search – from the results choose thesaurus words or phrases to refine the search – shows related phrases that others have used in similar searches. 3. Search on often-ignored words or phrase, includes little words in search. 4. hard-to-find or late-breaking news – AltaVista News offers fast-response time, organizes in categories, uses Moreover database 5. pinpoint search using a unique phrase or word – best engine for unique word or phrase search.

Conclusion Besides some yet unknown distance metric, there seem to be only four criteria for ranking scores: two positional ones, and two related to occurrences in the text. The positional ones have to do with the words that are contained in the HTML TITLE-tags, and the criteria that have to do with the occurrences (or frequency) of keywords have to do with the body text of the document. However the search index engine of Alta Vista still lies down on computing power force. About relevance it seems that is poor to fair organization by relevance. Can do very specific searches; retrieves large number of results (not always relevant); powerful search engine. Use when wanting to search the entire Web, when wanting a large number of results, when doing a specific search for obscure sites, documents or information. WHAT TO REMEMBER WHEN USING ALTAVISTA: use +, -, "" for simple searches on regular search page; use the advanced search page when you wish to use Boolean searching options; AND, OR and ( ) for advanced searches.

4. References

Remoaldo, Pedro Leonel – The Skill to Search: End-users versus Information Specialists. 1997 http://riff.fe.up.pt/~mgi97018/ari/ariessay.html , on 2002-05-02

Lyman, Jay – Google Search Engine Unfazed by ‘Googlewhackers, NewsFactor http://www.osopinion.com/perl/story/16768.html , on 2002-05-19

Barker, Joe – Alta Vista Advanced Search, UC Berkley Library. http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/AltaVista.html , on 2002-05-19

Baeza-Yates, Ricardo and Ribeiro-Neto, Berthier – Modern Information Retrieval. Adison Wesley – 1999.

Kobayashi, Mei and Takeda, Koichi – Information Retrievalon the Web – ACM Computing Surveys, vol.32, nº 2, p. 144-168. June 2000

Ballard Spahr Andrews & Ingersoll, LLP – The Virtual Chase - http://www.virtualchase.com/howto/engine.html, on 2002-05-19

Gray, Terry - How to search the Web, Aguide to Search Tools- http://daphne.palomar.edu/TGSEARCH , on 2002-06-01

Silva, Miguel and Costa, Artur - Um Sistema de Pesquisa Aberto, Eficiente e Escalável - http://www.uc.pt/crc98/comfin24/comfin24.html ,on 2002-06-01

Brin, Sergey and Page, Lawrence - The Anatomy of a arge-Scale Hypertextual Web Search Engine – http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm on 2002-06-01

van Eylen, Dirk - Alta Vista ranking of query results – http://www.ping.be/~ping0658/avrank.html ,on 2000-06-01