DATA SEARCH ENGINE Sketch the Idea of Building a Database Search Engine

D DATA SEARCH ENGINE sketch the idea of building a database search engine. Finally, we introduce the key components of metasearch INTRODUCTION engines, including both document metasearch engines and database metasearch engines, and the techniques for The World Wide Web was first developed by Tim Berners- building them. Lee and his colleagues in 1990. In just over a decade, it has become the largest information source in human history. DOCUMENT SEARCH ENGINE The total number of documents and database records that are accessible via the Web is estimated to be in the hun- Architecture dreds of billions (1). By the end of 2005, there were already over 1 billion Internet users worldwide. Finding informa- Although the architectures of different Web search engines tion on the Web has become an important part of our daily may vary, a typical document search engine generally lives. Indeed, searching is the second most popular activity consists of the following four main components as shown on the Web, behind e-mail, and about 550 million Web in Fig. 1: Web crawler, Indexer, Index database, and Query searches are performed every day. engine. A Web crawler, also known as a Web spider or a Web The Web consists of the Surface Web and the Deep Web robot, traverses the Web to fetch Web pages by following the (Hidden Web or Invisible Web). Each page in the Surface URLs of Web pages. The Indexer is responsible for parsing Web has a logical address called Uniform Resource Locator the text of each Web page into word tokens and then (URL). The URL of a page allows the page to be fetched creating the Index database using all the fetched Web directly. In contrast, the Deep Web contains pages that pages. When a user query is received, the Query engine cannot be directly fetched and database records stored in searches the Index database to find the matching Web database systems. It is estimated that the size of the Deep pages for the query. Web is over 100 times larger than that of the Surface Web (1). Crawling the Web The tools that we use to find information on the Web are A Web crawler is a computer program that fetches Web called search engines. Today, over 1 million search engines pages from remote Web servers. The URL of each Web page are believed to be operational on the Web (2). Search identifies the location of the page on the Web. Given its engines may be classified based on the type of data that URL, a Web page can be downloaded from a Web server are searched. Search engines that search text documents using the HTTP (HyperText Transfer Protocol). Starting are called document search engines, whereas those that from some initial URLs, a Web crawler repeatedly fetches search structured data stored in database systems are Web pages based on their URLs and extracts new URLs called database search engines. Many popular search from the downloaded pages so that more pages can be engines such as Google and Yahoo are document search downloaded. This process ends when some termination engines, whereas many e-commerce search engines such as conditions are satisfied. Some possible termination condi- Amazon.com are considered to be database search engines. tions include (1) no new URL remains and (2) a preset Document search engines usually have a simple interface number of pages have been downloaded. As a Web crawler with a textbox for users to enter a query, which typically may interact with numerous autonomous Web servers, it is contains some key words that reflect the user’s information important to design scalable and efficient crawlers. needs. Database search engines, on the other hand, usually To crawl the Web quickly, multiple crawlers can be have more complex interfaces to allow users to enter more applied. These crawlers may operate in two different man- specific and complex queries. ners (i.e., centralized and distributed). Centralized craw- Most search engines cover only a small portion of the lers are located at the same location running on different Web. To increase the coverage of the Web by a single search machines in parallel. Distributed crawlers are distributed system, multiple search engines can be combined. A search at different locations of the Internet, and controlled by a system that uses other search engines to perform the search central coordinator; each crawler just crawls the Web sites and combines their search results is called a metasearch that are geographically close to the location of the crawler. engine. Mamma.com and dogpile.com are metasearch The most significant benefit of distributed crawlers is the engines that combine multiple document search engines reduction in communication cost incurred by crawling whereas addall.com is a metasearch engine that combines activity. Centralized crawlers, however, are easier to multiple database search engines for books. From a user’s implement and control than distributed crawlers. perspective, there is little difference between using a search As the Web grows and changes constantly, it is neces- engine and using a metasearch engine. sary to have the crawlers regularly re-crawl the Web and This article provides an overview of some of the main make the contents of the index database up to date. Fre- methods that are used to create search engines and meta- quent re-crawling of the Web will waste significant search engines. In the next section, we describe the basic resources and make the network and Web servers over- techniques for creating a document search engine. Then we 1 Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah. Copyright # 2008 John Wiley & Sons, Inc. 2 DATA SEARCH ENGINE term should be a monotonically decreasing function of its Web Crawler Web pages Indexer document frequency. Currently, most Web pages are for- matted in HTML, which contains a set of tags such as title and header. The tag information can be used to influence the weights of the terms for representing Web pages. For Query Query engine Index example, terms in the title of a page or emphasized using World Wide Web database Query reponse bold and italic fonts are likely to be more important in User Query query resuts representing a page than terms in the main body of the page with normal font. To allow efficient search of Web pages for any given User Interface query, the representations of the fetched Web pages are Figure 1. The general architecture of a document search engine. organized into an inverted file structure. For each term t,an inverted list of the format [(p1, w1), ...,(pk, wk)] is generated and stored, where each pj is the identifier of a page contain- loaded. Therefore, some incremental crawling strategies ing t and wj is the weight of t in pj,1jk. Only entries with should be employed. One strategy is to re-crawl just the positive weights are kept. changed or newly added Web pages since the last crawling. The other strategy is to employ topic-specific crawlers to Ranking Pages for User Queries crawl the Web pages relevant to a pre-defined set of topics. A typical query submitted to a document search engine Topic-specific crawling can also be used to build specialized consists of some keywords. Such a query can also be repre- search engines that are only interested in Web pages in sented as a set of terms with weights. The degree of match some specific topics. between a page and a query, often call the similarity, can be Conventional Web crawlers are capable of crawling only measured by the terms they share. A simple approach is to Web pages in the Surface Web. Deep Web crawlers are add up the products of the weights corresponding to the designed to crawl information in the Deep Web (3). As matching terms between the query and the page. This information in the Deep Web is often hidden behind the approach yields larger similarities for pages that share search interfaces of Deep Web data sources, Deep Web more important terms with a query. However, it tends to crawlers usually gather data by submitting queries to these favor longer pages over shorter ones. This problem is often search interfaces and collecting the returned results. addressed by dividing the above similarity by the product of the lengths of the query and the page. The function that Indexing Web Pages computes such type of similarities is called the Cosine After Web pages are gathered to the site of a search engine, function (4). The length of each page can be computed they are pre-processed into a format that is suitable for beforehand and stored at the search engine site. effective and efficient retrieval by search engines. The Many methods exist for ranking Web pages for user contents of a page may be represented by the words it queries, and different search engines likely employ differ- has. Non-content words such as ‘‘the’’ and ‘‘is’’ are usually ent ranking techniques. For example, some ranking meth- not used for page representation. Often, words are con- ods also consider the proximity of the query terms within a verted to their stems using a stemming program to facil- page. As another example, a search engine may keep track itate the match of the different variations of the same word. of the number of times each page has been accessed by users For example, ‘‘comput’’ is the common stem of ‘‘compute’’ and use such information to help rank pages. Google and ‘‘computing’’. After non-content word removal and (www.google.com) is one of the most popular search engines stemming are performed on a page, the remaining words on the Web.

Load more