Semantic Search Engine, Information Retrieval, Web Mining, Fuzzy Logic

Total Page:16

File Type:pdf, Size:1020Kb

Semantic Search Engine, Information Retrieval, Web Mining, Fuzzy Logic International Journal of Web Engineering 2013, 3(1): 1-10 DOI: 10.5923/j.web.20130301.01 Enhancing Semantic Search Engine by Using Fuzzy Logic in Web Mining Salah Sleibi Al-Rawi1,*, Rabah N. Farhan2, Wesam I. Hajim2 1Information Systems Department, College of Computer, Anbar University, Ramadi, Anbar, Iraq 2Computer Science Department, College of Computer, Anbar University, Ramadi, Anbar, Iraq Abstract The present work describes system architecture of a collaborative approach for semantic search engine mining. The goal is to enhance the design, evaluation and refinement of web mining tasks using fuzzy logic technique for improving semantic search engine technology. It involves the design and implementation of the web crawler for automatizing the process of search, and examining how these techniques can aid in improving the efficiency of already existing Information Retrieval (IR) technologies. The work is based on term frequency inverse document frequency (tf*idf) which depends on Vector Space Model (VSM). The time average consumed for the system to respond in retrieving a number of pages ranging between 20-120 pages for a number of keywords was calculated to be 4.417Sec. The increase percentage (updating rate) was calculated on databases for five weeks during the experiment and for tens of queries to be 5.8 pages / week. The results are more accurate depending on the recall and precision measurements reaching 95% - 100% for some queries within acceptable retrieved time and a spellcheck technique. Ke ywo rds Semantic Search Engine, Information Retrieval, Web Mining, Fuzzy Logic guided concept is to reduce the probability of inserting 1. Introduction wrong words in a query. As for feature-rich semantic search engine, searches are Due to the complexity of human language, the computer divided into a classic search and information search. If you cannot understand well and interpret users' queries and it is search for a term that has more than one meaning, it will give difficult to determine the information on a specific website you the chance to choose what you were originally looking effectively and efficiently because of the large amount of for, with its disambiguation results. information[1]. There are two main problems in this area. Leelanupab, T.[5], explored three aspects of First, when people use natural language to search, the diversity-based document retrieval: 1) recommender systems, computer cannot understand and interpret the query correctly 2) retrieval algorithms, and 3) evaluation measures, and and precisely. Second, the large amount of information provided an understanding of the need for diversity in search makes it difficult to search effectively and efficiently[2]. results from the users' perspective. He was developing an Berners-Lee et al.[3] indicated that the Semantic Web is interactive recommender system for the purpose of a user not a separate Web but an extension of the current one, in study. Designed to facilitate users engaged in exploratory which information is given well-defined meaning, better search, the system is featured with content-based browsing, enabling computers and people to work in cooperation. aspectual interfaces, and diverse recommendations. While Al-Rawi Salah et al.[4] Suggested building a Semantic the diverse recommendations allow users to discover mo re Guided Internet Search Engine to present an efficient search and different aspects of a search topic, the aspectual engine - crawl, index and rank the web pages by applying interfaces allow users to manage and structure their own two approaches . The first was implementing Semantic search process and results regarding aspects found during principles through the searching stage, which depended on browsing. The recommendation feature mines implicit morphology concept - applying stemming concept - and relevance feedback information extracted from a user's synonyms dictionary. The second was implementing guided browsing trails and diversifies recommended results with concept during input the query stage which assisted the user respect to document contents. to find the suitable and corrected words. The advantage of One way to improve the efficiency of semantic search engine is Web Mining (WM). WM is the Data Mining * Corresponding author: [email protected] (Salah Sleibi Al-Rawi) technique that automatically discovers or extracts the Published online at http://journal.sapub.org/web information from web documents. It consists of the Copyright © 2013 Scientific & Academic Publishing. All Rights Reserved following tasks[6]: 2 Salah Sleibi Al-Rawi et al.: Enhancing Semantic Search Engine by Using Fuzzy Logic in Web M ining Resource finding, information selection and The tf*idf algorithm is based on the well-known VSM, pre-processing, generalization and analysis. which typically uses the cosine of the angle between the The present work describes system architecture of a document and the query vectors in a multi-dimensional space collaborative approach for semantic search engine mining. as the similarity measure. Vector length normalization can The aim of this study is to enhance the design, evaluation and be applied when computing the relevance score, Ri,q, o f refinement of web mining tasks using fuzzy logic technique page Pi with respect to query q:[9] ( , ) for improving semantic search engine technology. All main 0.5+0.5 max ( , ) = parameters in this proposal is based on the term frequency ∑ ∈ � ∙ � (6) ( , ) 0.5+0.5 inverse document frequency (tf*idf) which depends on max VSM. �∑ ∈ �� ∙ �∙ � �� where ( , ) : the term frequncy of Qj in Pi max : the maximum term frequency of a keyword in Pi : 2. Theory and Methods log (7) ( , ) =1 2.1 Vector Space Model (VSM) The full VSM is very expensive�∑ to� implement, because the The VSM suggests a framework in which the matching normalization factor is very expensive to compute. In tf*idf between query and document can be represented as a real algorithm, the normalization factor is not used. That is the number. The framework not only enables partial matching relevance score is computed by: ( , ) but also provides an ordering or ranking of retrieved ( , ) = 0.5 + 0.5 max (8) documents. The model views documents as "bag of words" and uses weight vectors representation of the documents in a The tf*idf ∑weight ∈ (term�� frequency∙ –inverse� ∙ � document�� collection. The model provides the notions "term frequency" frequency) is a numerical statistic which reflects how (tf) and "inverse document frequency" (idf) which are used to important a word is to a document in a collection or corpus. It compute the weights of index terms with regarding to a is often used as a weighting factor in IR and text mining. The document. The notion "tf" is computed as the number of tf*idf value increases proportionally to the number of times a occurrence of that term, normally weighted by the largest word appears in the document, but is offset by the frequency number of occurrence. Mathematically this issue may be of the word in the corpus, which helps to control for the fact treated as follows: that some words are generally more common than others[10]. Let N be the number of documents in the system and ni be One of the simplest ranking function is computed by the number of documents in which the index term ti appears. summing the tf*idf for each query term; many more Let freqi,j be the raw frequency of term ki in the document dj. sophisticated ranking functions are variants of this simple Then the normalized term frequency is given by: [7] model. , It can be shown that tf*idf may be derived as; , = (1) max ( , , ) = ( , ) × ( , ) (9) 1, where: D is the total number of documents in the corpus. where the max freq l,j is the maximum number of A high weight ∗ in tf*idf is reached by a high term frequency occurrence of term tl in document dj . The idf, inverse document frequency is given by: (in the given document) and a low document frequency of the term in the whole collection of documents; the weights = log (2) hence tend to filter out common terms. Since the ratio inside The classical term weighting scheme is based on the the log function is always greater than 1, the value of following equation : idf (and tf*idf) is greater than 0. As a term appears in more , = , × (3) documents′ then ratio inside the log approaches 1 and making There are also some variations of equation 3, such as the and tf*idf approaching 0. If a 1 is added to the one proposed by Salton et al[7]: denominator, a term that appears in all documents will have 0.5 , negative , and a term that occurs in all but one document , = 0.5 + × log (4) max 1, will have an equal to zero[11]. The above concepts The retrieval process is based on computation of a were exploited to build an algorithm with which it can be � � similarity function, also known as the cosine function, which used to reduce the frequency in the page. The proposed measures the similarity of the query and document ranking algorithm will be implemented as given in Algorith m (1). vectors.[8] =1 , × , In the suggested system all or most of the priority , = = (5) ×| | parameters were taken into consideration as they are listed � 2 × 2 ∙� =∑1 , =1 , and defined in Table (1). Each is given a value according to �� � � where � and � are vector representation�∑ of�∑ the document j its importance. These values are estimated in reasonable and query q. way. ̅ � International Journal of Web Engineering 2013, 3(1): 1-10 3 Ta b l e 1. Parameters Importance Values n : Number of parameters of Style Tag Value Descript io n Ta b l e 2. Style Values <head> 0.10 Defines informat ion about the document St y le Value <title> 0.10 Defines the title of a document Bold 0.20 Defines a default address or a default target for <base> 0.05 Italic 0.20 all links on a page Defines the relationship between a document Underline 0.20 <link> 0.05 and an external resource Font Size 0.10 <met a> 0.05 Defines met adata about an HTML document Font Type 0.10 <style> 0.10 Defines style informat ion for a document Color 0.05 Blue 0.05 The style value is extracted from a collection of elements Red 0.05 as illustrated in Table (2).
Recommended publications
  • Towards the Ontology Web Search Engine
    TOWARDS THE ONTOLOGY WEB SEARCH ENGINE Olegs Verhodubs [email protected] Abstract. The project of the Ontology Web Search Engine is presented in this paper. The main purpose of this paper is to develop such a project that can be easily implemented. Ontology Web Search Engine is software to look for and index ontologies in the Web. OWL (Web Ontology Languages) ontologies are meant, and they are necessary for the functioning of the SWES (Semantic Web Expert System). SWES is an expert system that will use found ontologies from the Web, generating rules from them, and will supplement its knowledge base with these generated rules. It is expected that the SWES will serve as a universal expert system for the average user. Keywords: Ontology Web Search Engine, Search Engine, Crawler, Indexer, Semantic Web I. INTRODUCTION The technological development of the Web during the last few decades has provided us with more information than we can comprehend or manage effectively [1]. Typical uses of the Web involve seeking and making use of information, searching for and getting in touch with other people, reviewing catalogs of online stores and ordering products by filling out forms, and viewing adult material. Keyword-based search engines such as YAHOO, GOOGLE and others are the main tools for using the Web, and they provide with links to relevant pages in the Web. Despite improvements in search engine technology, the difficulties remain essentially the same [2]. Firstly, relevant pages, retrieved by search engines, are useless, if they are distributed among a large number of mildly relevant or irrelevant pages.
    [Show full text]
  • The Commodification of Search
    San Jose State University SJSU ScholarWorks Master's Theses Master's Theses and Graduate Research 2008 The commodification of search Hsiao-Yin Chen San Jose State University Follow this and additional works at: https://scholarworks.sjsu.edu/etd_theses Recommended Citation Chen, Hsiao-Yin, "The commodification of search" (2008). Master's Theses. 3593. DOI: https://doi.org/10.31979/etd.wnaq-h6sz https://scholarworks.sjsu.edu/etd_theses/3593 This Thesis is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Theses by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. THE COMMODIFICATION OF SEARCH A Thesis Presented to The School of Journalism and Mass Communications San Jose State University In Partial Fulfillment of the Requirement for the Degree Master of Science by Hsiao-Yin Chen December 2008 UMI Number: 1463396 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 1463396 Copyright 2009 by ProQuest LLC. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC 789 E.
    [Show full text]
  • Introduction to Web Search Engines
    Journal of Novel Applied Sciences Available online at www.jnasci.org ©2014 JNAS Journal-2014-3-7/724-728 ISSN 2322-5149 ©2014 JNAS INTRODUCTION TO WEB SEARCH ENGINES Mohammad Reza pourmir Computer Engineering Department, Faculty of Engineering, University of zabol, Zabol, Iran Corresponding author: Mohammad Reza Pourmir ABSTRACTA: web search engine is a program designed to help find information stored on the World Wide Web. Search engine indexes are similar, but vastly more complex that back of the book indexes. The quality of the indexes, and how the engines use the information they contain, is what makes or breaks the quality of search results. The vast majority of users navigate the Web via search engines. Yet searching can be the most frustrating activity using a browser. Type in a keyword or phrase, and we're likely to get thousands of responses, only a handful of which are close to what we looking for. And those are located on secondary search pages, only after a set of sponsored links or paid-for-position advertisements. Still, search engines have come a long way in the past few years. Although most of us will never want to become experts on web indexing, knowing even a little bit about how they're built and used can vastly improve our searching skills. This paper gives a brief introduction to the web search engines, architecture, the work process, challenges faced by search engines, discuss various searching strategies, and the recent technologies in the web mining field. Keywords: Introduction, Web Search, Engines. INTRODUCTION A web search engine is a program designed to help find information stored on the World Wide Web.
    [Show full text]
  • 5.5 the Smart Search Engine
    Durham E-Theses Smart Search Engine For Information Retrieval SUN, LEILEI How to cite: SUN, LEILEI (2009) Smart Search Engine For Information Retrieval, Durham theses, Durham University. Available at Durham E-Theses Online: http://etheses.dur.ac.uk/72/ Use policy The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-prot purposes provided that: • a full bibliographic reference is made to the original source • a link is made to the metadata record in Durham E-Theses • the full-text is not changed in any way The full-text must not be sold in any format or medium without the formal permission of the copyright holders. Please consult the full Durham E-Theses policy for further details. Academic Support Oce, Durham University, University Oce, Old Elvet, Durham DH1 3HP e-mail: [email protected] Tel: +44 0191 334 6107 http://etheses.dur.ac.uk LEILEI SUN Smart Search Engine For Information Retrieval Leilei Sun A thesis in part fulfilment of the requirements of the University of Durham for the degree of Master of Science by Research Word count: 17353 Submitted: 24 June 2009 Page 1 LEILEI SUN Abstract This project addresses the main research problem in information retrieval and semantic search. It proposes the smart search theory as new theory based on hypothesis that semantic meanings of a document can be described by a set of keywords. With two experiments designed and carried out in this project, the experiment result demonstrates positive evidence that meet the smart search theory.
    [Show full text]
  • Politics of Search Engines Matters
    Draft: Forthcoming in The Information Society Shaping the Web: Why the politics of search engines matters Lucas D. Introna London School of Economics [email protected] Helen Nissenbaum University Center for Human Values Princeton University [email protected] ABSTRACT This article argues that search engines raise not merely technical issues but also political ones. Our study of search engines suggests that they systematically exclude (in some cases by design and in some accidentally) certain sites, and certain types of sites, in favor of others, systematically give prominence to some at the expense of others. We argue that such biases, which would lead to a narrowing of the Web’s functioning in society, run counter to the basic architecture of the Web as well as the values and ideals that have fueled widespread support for its growth and development. We consider ways of addressing the politics of search engines, raising doubts whether, in particular, the market mechanism could serve as an acceptable corrective. INTRODUCTION The Internet, no longer merely an e-mail and file-sharing system, has emerged as a dominant interactive medium. Enhanced by the technology of the World Wide Web, it has become an integral part of the ever-expanding global media system, moving into center stage of media politics alongside traditional broadcast media – television and radio. Enthusiasts of the “new medium” have heralded it as a democratizing force that will give voice to diverse social, economic and cultural groups, to members of society not frequently heard in the public sphere. It will Introna/Nissenbaum-Politics of SE 1 empower the traditionally disempowered, giving them access both to typically unreachable nodes of power and previously inaccessible troves of information.
    [Show full text]
  • The Invisible Web: Uncovering Sources Search Engines Can't
    The Invisible Web: Uncovering Sources Search Engines Can’t See CHRISSHERMAN AND GARYPRICE ABSTRACT THEPAR.4DOX OF THE INVISIBLEWEBis that it’s easy to understand why it exists, but it’s very hard to actually define in concrete, specific terms. In a nutshell, the Invisible Web consists of content that’s been excluded from general-purpose search engines and Web directories such as Lycos and Looksmart-and yes, even Google. There’s nothing inherently “invisible” about this content. But since this content is not easily located with the information-seeking tools used by most Web users, it’s effectively invisible because it’s so difficult to find unless you know exactly where to look. In this paper, we define the Invisible Web and delve into the reasons search engines can’t “see” its content. We also discuss the four different “types” of inhisibility, ranging from the “opaque” Web which is relatively accessible to the searcher, to the truly invisible Web, which requires spe- cialized finding aids to access effectively. The visible Web is easy to define. It’s made up of HTML Web pages that the search engines have chosen to include in their indices. It’s no more complicated than that. The Invisible Web is much harder to define and classify for several reasons. First, many Invisible Web sites are made up of straightforward Web pages that search engines could easily crawl and add to their indices but do not, simply because the engines have decided against including them. This is a crucial point-much of the Invisihle Web is hidden becausp search ~nginrs Chris Sherman, President, Searchwisr, 898 Rockway Place, Boulder, CO 80303; Gary Price, Librarian, Gary Price Library Research and Internet Consulting, 107 Kinsman View Ciirclc, Silver Spring, MD 20901.
    [Show full text]
  • Appendix E: Recruitment Letter, Previous Participant
    PUBLIC LIBRARIES AND SEARCH ENGINES by Zoe Dickinson Submitted in partial fulfilment of the requirements for the degree of Master of Library and Information Studies at Dalhousie University Halifax, Nova Scotia April 2016 © Copyright by Zoe Dickinson, 2016 TABLE OF CONTENTS LIST OF TABLES ............................................................................................................ vii LIST OF FIGURES ......................................................................................................... viii ABSTRACT ....................................................................................................................... ix LIST OF ABBREVIATIONS USED ................................................................................. x GLOSSARY ...................................................................................................................... xi ACKNOWLEDGEMENTS ............................................................................................. xiii CHAPTER 1 – INTRODUCTION ..................................................................................... 1 1.1 Purpose of the Research ....................................................................................... 2 1.2 An Overview of Relevant Technology ...................................................................... 4 1.2.1 The Mechanics of Library Catalogue Technology .............................................. 4 1.2.2 The Mechanics of Search Engines .....................................................................
    [Show full text]
  • Strategic Use of Information Technology - Google
    UC Irvine I.T. in Business Title Strategic Use of Information Technology - Google Permalink https://escholarship.org/uc/item/0nj460xn Authors Chen, Rex Lam, Oisze Kraemer, Kenneth Publication Date 2007-02-01 eScholarship.org Powered by the California Digital Library University of California INTRODUCTION Arguably the most popular search engine available today, Google is widely known for its unparalleled search engine technology, embodied in the web page ranking algorithm, PageRanki and running on an efficient distributed computer system. In fact, the verb “to Google” has ingrained itself in the vernacular as a synonym of “[performing] a web search.”1 The key to Google’s success has been its strategic use of both software and hardware information technologies. The IT infrastructure behind the search engine includes huge storage databases and numerous server farms to produce significant computational processing power. These critical IT components are distributed across multiple independent computers that provide parallel computing resources. This architecture has allowed Google’s business to reach a market capital over $100 billion and become one of the most respected and admirable companies in the world. MARKET ENVIRONMENTS Search Engine Internet search engines were first developed in the early 1990s to facilitate the sharing of information among researchers. The original effort to develop a tool for information search occurred simultaneously across multiple universities is shown in Table 1. Although functionalities of these systems were very limited, they provided the foundation for future web- based search engines. TABLE 1. Early Search Engines Search Engine Name University Year Inventor Archie McGill University 1990 Alan Emtage Veronica University of Nevada 1993 Many students WWW Wanderer Massachusetts Institute of Technology 1993 Matthew Gray Source: Battelle, 2005.
    [Show full text]
  • Search Engine Technology
    CS236620 - Search Engine Technology Ronny Lempel, Yahoo! Labs Winter 2009/10 The course consists of 14 2-hour meetings, divided into 4 main parts. It aims to cover both engineering and theoretical aspects of search engine technology. Part 1, 2 meetings - introduction to search engines and Infor- mation Retrieval 1. Course outline, popular introduction to search engines, technical overview of search engine components [14, 4]. 2. Introduction to Information Retrieval: Boolean model, vector space model, TF/IDF scoring, probabilistic models [45, 7]. Part 2, 4 meetings - inverted indices 1. Basics: what is an inverted index, how is one constructed efficiently, what operations does it support, what extra payload is usually stored in search engines, the accompanying lexicon [4, 7]. 2. Query evaluation schemes: term-at-a-time vs. doc-at-a-time, result heaps, early termination/pruning [15]. 3. Distributed architectures: global/local schemes [4, 41, 19], combina- torial issues stemming from the distribution of data [37], the Google cluster architecture [9]. 4. Coping with scale of data: index compression [46, 3], identification of near-duplicate pages [17]. 1 Part 3, 4 meetings - Web graph analysis 1. Basics: Google’s PageRank [14], Kleinberg’s HITS [32], with some quick overview of Perron-Frobenius theory and ergodicity [26]. 2. Advanced PageRank: topic-sensitive flavors [28], accelerations [27, 30, 31, 18]. Advanced HITS: flavors and variations [11, 21], SALSA [36]. 3. Stability and similarity of link-based schemes [44, 13, 22, 12, 39], the TKC Effect [36]. 4. Web graph structure: power laws [42], Bow-tie structure [16], self- similarity [24], evolutionary models [33, 34].
    [Show full text]
  • Web Search Engine Based on DNS1
    Web search engine based on DNS1 Wang Liang1, Guo Yi-Ping2, Fang Ming3 1, 3 (Department of Control Science and Control Engineering, Huazhong University of Science and Technology, WuHan, 430074 P.R.China) 2(Library of Huazhong University of Science and Technology, WuHan 430074 P.R.China) Abstract Now no web search engine can cover more than 60 percent of all the pages on Internet. The update interval of most pages database is almost one month. This condition hasn't changed for many years. Converge and recency problems have become the bottleneck problem of current web search engine. To solve these problems, a new system, search engine based on DNS is proposed in this paper. This system adopts the hierarchical distributed architecture like DNS, which is different from any current commercial search engine. In theory, this system can cover all the web pages on Internet. Its update interval could even be one day. The original idea, detailed content and implementation of this system all are introduced in this paper. Keywords: search engine, domain, information retrieval, distributed system, Web-based service 1. Introduction Because the WWW is a large distributed dynamic system, current search engine can't continue to index close to the entire Web as it grows and changes. According to the statistical data in 1998[1], the update interval of most pages database is almost one month and no search engine can cover more than 50 percent of pages on Internet. Till now, these data are still available. Converge and recency problems become the bottleneck problem of current web search system.
    [Show full text]
  • Web Search Engines: Practice and Experience
    Web Search Engines: Practice and Experience Tao Yang, University of California at Santa Barbara Apostolos Gerasoulis, Rutgers University 1 Introduction The existence of an abundance of dynamic and heterogeneous information on the Web has offered many new op- portunities for users to advance their knowledge discovery. As the amount of information on the Web has increased substantially in the past decade, it is difficult for users to find information through a simple sequential inspection of web pages or recall previously accessed URLs. Consequently, the service from a search engine becomes indispensable for users to navigate around the Web in an effective manner. General web search engines provide the most popular way to identify relevant information on the web by taking a horizontal and exhaustive view of the world. Vertical search engines focus on specific segments of content and offer great benefits for finding information on selected topics. Search engine technology incorporates interdisciplinary concepts and techniques from multiple fields in computer science, which include computer systems, information retrieval, machine learning, and computer linguistics. Understanding the computational basis of these technologies is important as search engine technologies have changed the way our society seeks out and interacts with information. This chapter gives an overview of search techniques to find information from on the Web relevant to a user query. It describes how billions of web pages are crawled, processed, and ranked, and our experiences in building Ask.com's search engine with over 100 million north American users. Section 2 is an overview of search engine components and historical perspective of search engine development.
    [Show full text]
  • Chapter 31 Web Search Technologies for Text Documents
    CHAPTER 31 WEB SEARCH TECHNOLOGIES FOR TEXT DOCUMENTS Weiyi Meng SUNY, BINGHAMTON Clement Yu UNIVERSITY OF ILLINOIS, CHICAGO Introduction Text Retrieval System Architecture Document Representation Document-Query Matching Efficient Query Evaluation Effectiveness Measures Search Engine Technology Web Crawler Web Search Queries Use of Tag Information Use of Linkage Information Use of User Profiles Result Organization Sponsored Search Metasearch Engine Technology Software Component Architecture Database Selection Result Record Extraction Result Merging Conclusion Glossary References Abstract The World Wide Web has become the largest information source in recent years, and search engines are indispensable tools for finding needed information from the Web. While modern search engine technology has its roots in text/information retrieval techniques, it also consists of solutions to unique problems arising from the Web such as web page crawling and utilizing linkage information to improve retrieval effectiveness. The coverage of the Web by a single search engine may be limited. One effective way to increase the search coverage of the Web is to combine the coverage of multiple search engines. Systems that do such combination are called metasearch engines. In this chapter, we describe the inner workings of text retrieval systems, search engines, and metasearch engines. Key Words: search engine, Web, Surface Web, Deep Web, Web crawler, text retrieval, Vector space model, PageRank, hub and authority, metasearch engine, database selection, result merging INTRODUCTION The World Wide Web has emerged as the largest information source in recent years. People all over the world use the Web to find needed information on a regular basis. Students use the Web as a library to find references, and customers use the Web to purchase various products and services.
    [Show full text]