CS 101 - Intro to Computer Science Such Software Has Capabilities Such As: Search Engines Organizing an Index to the Information in a Database
Total Page:16
File Type:pdf, Size:1020Kb
What is a Search Engine? A search engine is a computer program that allows a user to submit a query and retrieve information from a database. CS 101 - Intro to Computer Science Such software has capabilities such as: Search Engines organizing an index to the information in a database, Dr. Stephen P. Carl enabling the formulation of a query by a user, and searching the index in response to a query. Web Crawlers - Mining Web Information Types of web search engines A search engine for the WWW uses a program called a robot or spider to index the information on Web pages. Topical search engines organize their catalogs of sites by topic or subject. Examples are Yahoo! and Lycos a2z. Spiders work by following all the links on a page according to a specific search strategy. The simplest strategy is to collect all links from a page and then follow them, one by one, collecting new links as new pages are visited. The content of each page is added to a database. The database is indexed and this index is searched upon receipt of a query from the user; the search engine then presents a sorted list of matching results to the user. 1-4 Types of web search engines Types of web search engines Keyword or Key Phrase search engines let the user specify a set of Metasearch Engines send queries to several other search engines keywords or phrases related to the desired content. and consolidate the results. Some, such as ProFusion, filter the results to remove duplicates and check the validity of the links. An The engine returns a list of documents containing the keywords or example of a Metasearch Engine is MetaCrawler. phrases. Most engines allow the using booleans in queries; some support more advanced searching such as inexact match Examples of keyword or key phrase search engines are AltaVista, InfoSeek, Lycos, and WebCrawler. Web Robots AKA Spiders Web Robots AKA Spiders Advantages of using robots: Web Robots are programs that automatically visit web pages in order Tasks can be performed automatically to perform specific tasks such as collecting information from the Tasks can be executed in parallel by creating several robots to pages for indexing. work concurrently Common Uses: Users can easily access up-to-date information retrieving web pages whose contents will be indexed by the Search Disadvantages of using robots: Engine Users can easily access up-to-date information administrative tasks, such as collecting statistics, checking links, and verifying HTML Robots often visit irrelevent URLs and sometimes invoke programs (like CGI) with side effects distributing information Malicious or poorly written robots can severely degrade the quality of service for legitimate users 5-8 Types of Robots Managing Robots Access to Sites Robots generally fall into two categories: server-invoked or client-invoked. The Robot Exclusion Standard at http://info.webcrawler.com/mak/projects/robots/norobots.html outlines Server robots, such as WebCrawler, are created and controlled by a a method to reduce robot problems. Robots that adhere to the server process responding to a request, while client robots are standard can be excluded from a particular web page. The settings invoked directly by the user. for the standard are stored in a file named robots.txt. Client robots can cause network traffic problems. Should several By using the field User-agent in the file, the user can specify which thousand users search the web using personal search engines at the robots are to be excluded. Most search engine services publish the same time, the traffic generated by these clients will have a negative names of their robots. effect on network performance. Displaying Results Displaying Results, Google Style Google uses an algorithm called PageRank to determine not only After receiving a query and working through the indexed database to how well a page matches a query, but also how important and find matches, how should the search engine display the results to the relevant it might be. user? PageRank is was developed at Stanford University by the founders of Alphabetic Order, by Title for example? Google, Larry Page and Sergey Brin, while they were graduate students there. In the order results were found? Curiously, Stanford, not Google, holds the PageRank patent Using some criterion specified by the user? The criterion used is simple on the surface...the more links to a page, Let's see what Google does... the more relevant it is considered; the more important the pages linking to this page are, the higher the page's rank in the results. 9-12 References Lager, Mark. Spinning a Web Search http://www.library.ucsb.edu/untangle/lager.html http://en.wikipedia.org/wiki/Search_engine http://www.google.com/corporate/tech.html Much of this information came from Dr. Lankewicz's course pages 13-14.