Scalability and Efficiency Challenges in Large-Scale Web Search

Total Page:16

File Type:pdf, Size:1020Kb

Scalability and Efficiency Challenges in Large-Scale Web Search 7/9/14 Scalability and Efficiency Challenges in Large-Scale Web Search Engines " Ricardo Baeza-Yates" B. Barla Cambazoglu! Yahoo Labs" Barcelona, Spain" Tutorial at SIGIR 2014, Gold Coast, Australia Disclaimer Dis •# This talk presents the opinions of the authors. It does not necessarily reflect the views of Yahoo Inc. or any other entity." •# Algorithms, techniques, features, etc. mentioned here might or might not be in use by Yahoo or any other company." •# Some non-technical material (e.g., images) provided in this presentation were taken from the Web." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 2 - Tutorial at SIGIR 2014, Gold Coast, Australia 1 7/9/14 Yahoo Labs Barcelona •# Research topics" •# Web retrieval" –# web data mining" –# distributed web retrieval" –# semantic web" –# scalability and efficiency" –# social media" –# opinion/sentiment retrieval" –# web retrieval" –# personalization" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 3 - Tutorial at SIGIR 2014, Gold Coast, Australia Outline of the Tutorial •# Background (35 minutes)" •# Main sections" –# web crawling (75 minutes + 5 minutes Q/A)" –# indexing (75 minutes + 5 minutes Q/A)" –# query processing (90 minutes + 5 minutes Q/A)" –# caching (40 minutes + 5 minutes Q/A)" •# Concluding remarks (10 minutes)" •# Questions and open discussion (15 minutes)" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 4 - Tutorial at SIGIR 2014, Gold Coast, Australia 2 7/9/14 Structure of Main Sections •# Definitions" •# Metrics" •# Issues and techniques" –# single computer" –# cluster of computers" –# multiple search sites" •# Research problems" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 5 - Tutorial at SIGIR 2014, Gold Coast, Australia Background Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 6 - Tutorial at SIGIR 2014, Gold Coast, Australia 3 7/9/14 Brief History of Search Engines •# Past" -# Before browsers" -# Gopher" -# Before the bubble" -# Altavista" -# Lycos" -# Infoseek" -# Excite" -# HotBot" •# Current" •# Future" -# After the bubble" •# Global" •# Facebook ?" -# Yahoo" •# Google, Bing" •# …" -# Google" •# Regional" -# Microsoft" •# Yandex, Baidu" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 7 - Tutorial at SIGIR 2014, Gold Coast, Australia Anatomy of a Search Engine Result Page •# Web search results" Main focus of this tutorial" •# Direct displays (vertical search results)" –# image" –# video" –# local" –# shopping" –# related entities" •# Query suggestions" •# Advertisements" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 8 - Tutorial at SIGIR 2014, Gold Coast, Australia 4 7/9/14 Anatomy of a Search Engine Result Page Movie Related User entity query direct display suggestions Algorithmic Video Ads search Knowledge direct results graph display Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 9 - Tutorial at SIGIR 2014, Gold Coast, Australia Actors in Web Search •# User’s perspective: accessing information" –# relevance" –# speed" " •# Search engine’s perspective: monetization" –# increase the ad revenue" –# attract more users" –# reduce the operational costs" •# Advertiser’s perspective: publicity" –# attract more users" –# pay little" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 10 - Tutorial at SIGIR 2014, Gold Coast, Australia 5 7/9/14 Search Engine Usage Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 11 - Tutorial at SIGIR 2014, Gold Coast, Australia What Makes Web Search Difficult? •# Size" •# Diversity" •# Dynamicity" •# All of these three features can be observed in" –# the Web" –# web users" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 12 - Tutorial at SIGIR 2014, Gold Coast, Australia 6 7/9/14 What Makes Web Search Difficult? •# The Web" –# more than 180 million Web servers and 950 million host names" –# compare with almost 1 billion computers directly connect to Internet" –# the largest data repository (estimated as 100 billion pages)" –# constantly changing" –# diverse in terms of content and data formats" •# Users" –# too many! (over 2.5 billion at the end of 2012)" –# diverse in terms of their culture, education, and demographics" –# very short queries (hard to understand the intent)" –# changing information needs" –# little patience (few queries posed & few answers seen)" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 13 - Tutorial at SIGIR 2014, Gold Coast, Australia Expectations from a Search Engine •# Crawl and index a large fraction of the Web" •# Maintain most recent copies of the content in the Web" •# Scale to serve hundreds of millions of queries every day" •# Evaluate a typical query under several hundred milliseconds" •# Serve most relevant results for a user query" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 14 - Tutorial at SIGIR 2014, Gold Coast, Australia 7 7/9/14 Internet Growth •# We observed a super-linear growth in the last decade" •# The growth of Internet accelerated with web search engines" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 15 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 16 - Tutorial at SIGIR 2014, Gold Coast, Australia 8 7/9/14 Web Page Size Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 17 - Tutorial at SIGIR 2014, Gold Coast, Australia Web User Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 18 - Tutorial at SIGIR 2014, Gold Coast, Australia 9 7/9/14 Search Data Centers •# Quality and performance requirements imply large amounts of compute resources, i.e., very large data centers" •# High variation in data center sizes" -# hundreds of thousands of computers" -# a few computers" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 19 - Tutorial at SIGIR 2014, Gold Coast, Australia Cost of Data Centers •# Data center facilities are heavy consumers of energy, accounting for between 1.1% and 1.5% of the world’s total energy use in 2010." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 20 - Tutorial at SIGIR 2014, Gold Coast, Australia 10 7/9/14 Financial Costs •# Costs" –# depreciation: old hardware need to be replaced" –# maintenance: failures need to be handled" –# operational: energy spending need to be reduced" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 21 - Tutorial at SIGIR 2014, Gold Coast, Australia Impact of Resources on Revenue Energy Resources Consumption Result Response Revenue Relevance time ? Clicks Income Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 22 - Tutorial at SIGIR 2014, Gold Coast, Australia 11 7/9/14 Major Components in a Web Search Engine •# Web crawling" •# Indexing" •# Query processing" document collection index Web query query crawler indexer processor results user Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 23 - Tutorial at SIGIR 2014, Gold Coast, Australia Q&A Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 24 - Tutorial at SIGIR 2014, Gold Coast, Australia 12 7/9/14 Web Crawling Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 25 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling •# Web crawling is the process of locating, fetching, and storing the pages available in the Web" •# Computer programs that perform this task are referred to as" -# crawlers" -# spider" -# harvesters" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 26 - Tutorial at SIGIR 2014, Gold Coast, Australia 13 7/9/14 Web Graph •# Web crawlers exploit the hyperlink structure of the Web" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 27 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling Process •# A typical Web crawler" –# starts from a set of seed pages," –# locates new pages by parsing the downloaded seed pages," –# extracts the hyperlinks within," –# stores the extracted links in a fetch queue for retrieval, " –# continues downloading until the fetch queue gets empty or a satisfactory number of pages are downloaded." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 28 - Tutorial at SIGIR 2014, Gold Coast, Australia 14 7/9/14 Web Crawling Process Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 29 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling seed page downloaded •# Crawling divides discovered the Web into (frontier) three sets" –# downloaded" –# discovered" –# undiscovered" undiscovered Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 30 - Tutorial at SIGIR 2014, Gold Coast, Australia 15 7/9/14 URL Prioritization •# A state-of-the-art web crawler maintains two separate queues for prioritizing the download of URLs" –# discovery queue" –# downloads pages pointed by already discovered links" –# tries to increase the coverage of the crawler" –# refreshing queue" –# re-downloads already downloaded pages" –# tries to increase the freshness of the repository" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 31 - Tutorial at SIGIR 2014, Gold Coast, Australia URL Prioritization (Discovery) seed page downloaded discovered •# Random (A, B, C, D)" A (frontier) •# Breadth-first (A)" •# In-degree (C)" C •# PageRank (B)" B D undiscovered (more intense red color indicates higher PageRank) Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 32 - Tutorial at SIGIR 2014, Gold Coast, Australia 16 7/9/14 URL Prioritization (Refreshing) •# Random (A, B, C, D)" •# Age (C)" •# PageRank (B)" •# User feedback (D)" •# Longevity (A)" (more intense blue color indicates larger user interest) download time order C D A B (by the crawler) last update time order B A C D (by
Recommended publications
  • Study of Web Crawler and Its Different Types
    IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. VI (Feb. 2014), PP 01-05 www.iosrjournals.org Study of Web Crawler and its Different Types Trupti V. Udapure1, Ravindra D. Kale2, Rajesh C. Dharmik3 1M.E. (Wireless Communication and Computing) student, CSE Department, G.H. Raisoni Institute of Engineering and Technology for Women, Nagpur, India 2Asst Prof., CSE Department, G.H. Raisoni Institute of Engineering and Technology for Women, Nagpur, India, 3Asso.Prof. & Head, IT Department, Yeshwantrao Chavan College of Engineering, Nagpur, India, Abstract : Due to the current size of the Web and its dynamic nature, building an efficient search mechanism is very important. A vast number of web pages are continually being added every day, and information is constantly changing. Search engines are used to extract valuable Information from the internet. Web crawlers are the principal part of search engine, is a computer program or software that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. It is an essential method for collecting data on, and keeping in touch with the rapidly increasing Internet. This Paper briefly reviews the concepts of web crawler, its architecture and its various types. Keyword: Crawling techniques, Web Crawler, Search engine, WWW I. Introduction In modern life use of internet is growing in rapid way. The World Wide Web provides a vast source of information of almost all type. Now a day’s people use search engines every now and then, large volumes of data can be explored easily through search engines, to extract valuable information from web.
    [Show full text]
  • Building a Scalable Index and a Web Search Engine for Music on the Internet Using Open Source Software
    Department of Information Science and Technology Building a Scalable Index and a Web Search Engine for Music on the Internet using Open Source software André Parreira Ricardo Thesis submitted in partial fulfillment of the requirements for the degree of Master in Computer Science and Business Management Advisor: Professor Carlos Serrão, Assistant Professor, ISCTE-IUL September, 2010 Acknowledgments I should say that I feel grateful for doing a thesis linked to music, an art which I love and esteem so much. Therefore, I would like to take a moment to thank all the persons who made my accomplishment possible and hence this is also part of their deed too. To my family, first for having instigated in me the curiosity to read, to know, to think and go further. And secondly for allowing me to continue my studies, providing the environment and the financial means to make it possible. To my classmate André Guerreiro, I would like to thank the invaluable brainstorming, the patience and the help through our college years. To my friend Isabel Silva, who gave me a precious help in the final revision of this document. Everyone in ADETTI-IUL for the time and the attention they gave me. Especially the people over Caixa Mágica, because I truly value the expertise transmitted, which was useful to my thesis and I am sure will also help me during my professional course. To my teacher and MSc. advisor, Professor Carlos Serrão, for embracing my will to master in this area and for being always available to help me when I needed some advice.
    [Show full text]
  • A Study on Vertical and Broad-Based Search Engines
    International Journal of Latest Trends in Engineering and Technology IJLTET Special Issue- ICRACSC-2016 , pp.087-093 e-ISSN: 2278-621X A STUDY ON VERTICAL AND BROAD-BASED SEARCH ENGINES M.Swathi1 and M.Swetha2 Abstract-Vertical search engines or Domain-specific search engines[1][2] are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Broad-based search engines or Web-wide search engines. The paper focuses on the survey of domain specific search engine which is becoming more popular as compared to Web- Wide Search Engines as they are difficult to maintain and time consuming .It is also difficult to provide appropriate documents to represent the target data. We also listed various vertical search engines and Broad-based search engines. Index terms: Domain specific search, vertical search engines, broad based search engines. I. INTRODUCTION The Web has become a very rich source of information for almost any field, ranging from music to histories, from sports to movies, from science to culture, and many more. However, it has become increasingly difficult to search for desired information on the Web. Users are facing the problem of information overload , in which a search on a general-purpose search engine such as Google (www.google.com) results in thousands of hits.Because a user cannot specify a search domain (e.g. medicine, music), a search query may bring up Web pages both within and outside the desired domain. Example 1: A user searching for “cancer” may get Web pages related to the disease as well as those related to the Zodiac sign.
    [Show full text]
  • Open Search Environments: the Free Alternative to Commercial Search Services
    Open Search Environments: The Free Alternative to Commercial Search Services. Adrian O’Riordan ABSTRACT Open search systems present a free and less restricted alternative to commercial search services. This paper explores the space of open search technology, looking in particular at lightweight search protocols and the issue of interoperability. A description of current protocols and formats for engineering open search applications is presented. The suitability of these technologies and issues around their adoption and operation are discussed. This open search approach is especially useful in applications involving the harvesting of resources and information integration. Principal among the technological solutions are OpenSearch, SRU, and OAI-PMH. OpenSearch and SRU realize a federated model to enable content providers and search clients communicate. Applications that use OpenSearch and SRU are presented. Connections are made with other pertinent technologies such as open-source search software and linking and syndication protocols. The deployment of these freely licensed open standards in web and digital library applications is now a genuine alternative to commercial and proprietary systems. INTRODUCTION Web search has become a prominent part of the Internet experience for millions of users. Companies such as Google and Microsoft offer comprehensive search services to users free with advertisements and sponsored links, the only reminder that these are commercial enterprises. Businesses and developers on the other hand are restricted in how they can use these search services to add search capabilities to their own websites or for developing applications with a search feature. The closed nature of the leading web search technology places barriers in the way of developers who want to incorporate search functionality into applications.
    [Show full text]
  • Efficient Focused Web Crawling Approach for Search Engine
    Ayar Pranav et al, International Journal of Computer Science and Mobile Computing, Vol.4 Issue.5, May- 2015, pg. 545-551 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IJCSMC, Vol. 4, Issue. 5, May 2015, pg.545 – 551 RESEARCH ARTICLE Efficient Focused Web Crawling Approach for Search Engine 1 2 Ayar Pranav , Sandip Chauhan Computer & Science Engineering, Kalol Institute of Technology and Research Canter, Kalol, Gujarat, India 1 [email protected]; 2 [email protected] Abstract— a focused crawler traverses the web, selecting out relevant pages to a predefined topic and neglecting those out of concern. Collecting domain specific documents using focused crawlers has been considered one of most important strategies to find relevant information. While surfing the internet, it is difficult to deal with irrelevant pages and to predict which links lead to quality pages. However most focused crawler use local search algorithm to traverse the web space, but they could easily trapped within limited a sub graph of the web that surrounds the starting URLs also there is problem related to relevant pages that are miss when no links from the starting URLs. There is some relevant pages are miss. To address this problem we design a focused crawler where calculating the frequency of the topic keyword also calculate the synonyms and sub synonyms of the keyword. The weight table is constructed according to the user query. To check the similarity of web pages with respect to topic keywords and priority of extracted link is calculated.
    [Show full text]
  • Distributed Indexing/Searching Workshop Agenda, Attendee List, and Position Papers
    Distributed Indexing/Searching Workshop Agenda, Attendee List, and Position Papers Held May 28-19, 1996 in Cambridge, Massachusetts Sponsored by the World Wide Web Consortium Workshop co-chairs: Michael Schwartz, @Home Network Mic Bowman, Transarc Corp. This workshop brings together a cross-section of people concerned with distributed indexing and searching, to explore areas of common concern where standards might be defined. The Call For Participation1 suggested particular focus on repository interfaces that support efficient and powerful distributed indexing and searching. There was a great deal of interest in this workshop. Because of our desire to limit attendance to a workable size group while maximizing breadth of attendee backgrounds, we limited attendance to one person per position paper, and furthermore we limited attendance to one person per institution. In some cases, attendees submitted multiple position papers, with the intention of discussing each of their projects or ideas that were relevant to the workshop. We had not anticipated this interpretation of the Call For Participation; our intention was to use the position papers to select participants, not to provide a forum for enumerating ideas and projects. As a compromise, we decided to choose among the submitted papers and allow multiple position papers per person and per institution, but to restrict attendance as noted above. Hence, in the paper list below there are some cases where one author or institution has multiple position papers. 1 http://www.w3.org/pub/WWW/Search/960528/cfp.html 1 Agenda The Distributed Indexing/Searching Workshop will span two days. The first day's goal is to identify areas for potential standardization through several directed discussion sessions.
    [Show full text]
  • Natural Language Processing Technique for Information Extraction and Analysis
    International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 2, Issue 8, August 2015, PP 32-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Natural Language Processing Technique for Information Extraction and Analysis T. Sri Sravya1, T. Sudha2, M. Soumya Harika3 1 M.Tech (C.S.E) Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] 2 Head (I/C) of C.S.E & IT Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] 3 M. Tech C.S.E, Assistant Professor, Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] Abstract: In the current internet era, there are a large number of systems and sensors which generate data continuously and inform users about their status and the status of devices and surroundings they monitor. Examples include web cameras at traffic intersections, key government installations etc., seismic activity measurement sensors, tsunami early warning systems and many others. A Natural Language Processing based activity, the current project is aimed at extracting entities from data collected from various sources such as social media, internet news articles and other websites and integrating this data into contextual information, providing visualization of this data on a map and further performing co-reference analysis to establish linkage amongst the entities. Keywords: Apache Nutch, Solr, crawling, indexing 1. INTRODUCTION In today’s harsh global business arena, the pace of events has increased rapidly, with technological innovations occurring at ever-increasing speed and considerably shorter life cycles.
    [Show full text]
  • Distributed Web Crawling Using Network Coordinates
    Distributed Web Crawling Using Network Coordinates Barnaby Malet Department of Computing Imperial College London [email protected] Supervisor: Peter Pietzuch Second Marker: Emil Lupu June 16, 2009 2 Abstract In this report we will outline the relevant background research, the design, the implementation and the evaluation of a distributed web crawler. Our system is innovative in that it assigns Euclidean coordinates to crawlers and web servers such that the distances in the space give an accurate prediction of download times. We will demonstrate that our method gives the crawler the ability to adapt and compensate for changes in the underlying network topology, and in doing so can achieve significant decreases in download times when compared with other approaches. 3 4 Acknowledgements Firstly, I would like to thank Peter Pietzuch for the help that he has given me throughout the course of the project as well as showing me support when things did not go to plan. Secondly, I would like to thank Johnathan Ledlie for helping me with some aspects of the implemen- tation involving the Pyxida library. I would also like to thank the PlanetLab support team for giving me extensive help in dealing with complaints from web masters. Finally, I would like to thank Emil Lupu for providing me with feedback about my Outsourcing Report. 5 6 Contents 1 Introduction 9 2 Background 13 2.1 Web Crawling . 13 2.1.1 Web Crawler Architecture . 14 2.1.2 Issues in Web Crawling . 16 2.1.3 Discussion . 17 2.2 Crawler Assignment Strategies . 17 2.2.1 Hash Based .
    [Show full text]
  • Panacea D4.1
    SEVENTH FRAMEWORK PROGRAMME THEME 3 Information and communication Technologies PANACEA Project Grant Agreement no.: 248064 Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies D4.1 Technologies and tools for corpus creation, normalization and annotation Dissemination Level: Public Delivery Date: July 16, 2010 Status – Version: Final Author(s) and Affiliation: Prokopis Prokopidis, Vassilis Papavassiliou (ILSP), Pavel Pecina (DCU), Laura Rimel, Thierry Poibeau (UCAM), Roberto Bartolini, Tommaso Caselli, Francesca Frontini (ILC- CNR), Vera Aleksic, Gregor Thurmair (Linguatec), Marc Poch Riera, Núria Bel (UPF), Olivier Hamon (ELDA) Technologies and tools for corpus creation, normalization and annotation Table of contents 1 Introduction ........................................................................................................................... 3 2 Terminology .......................................................................................................................... 3 3 Corpus Acquisition Component ............................................................................................ 3 3.1 Task description ........................................................................................................... 4 3.2 State of the art .............................................................................................................. 4 3.3 Existing tools ..............................................................................................................
    [Show full text]
  • Meta Search Engine with an Intelligent Interface for Information Retrieval on Multiple Domains
    International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.4, October 2011 META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON MULTIPLE DOMAINS D.Minnie1, S.Srinivasan2 1Department of Computer Science, Madras Christian College, Chennai, India [email protected] 2Department of Computer Science and Engineering, Anna University of Technology Madurai, Madurai, India [email protected] ABSTRACT This paper analyses the features of Web Search Engines, Vertical Search Engines, Meta Search Engines, and proposes a Meta Search Engine for searching and retrieving documents on Multiple Domains in the World Wide Web (WWW). A web search engine searches for information in WWW. A Vertical Search provides the user with results for queries on that domain. Meta Search Engines send the user’s search queries to various search engines and combine the search results. This paper introduces intelligent user interfaces for selecting domain, category and search engines for the proposed Multi-Domain Meta Search Engine. An intelligent User Interface is also designed to get the user query and to send it to appropriate search engines. Few algorithms are designed to combine results from various search engines and also to display the results. KEYWORDS Information Retrieval, Web Search Engine, Vertical Search Engine, Meta Search Engine. 1. INTRODUCTION AND RELATED WORK WWW is a huge repository of information. The complexity of accessing the web data has increased tremendously over the years. There is a need for efficient searching techniques to extract appropriate information from the web, as the users require correct and complex information from the web. A Web Search Engine is a search engine designed to search WWW for information about given search query and returns links to various documents in which the search query’s key words are found.
    [Show full text]
  • Report for Portal Specific
    Report for Portal specific Time range: 2013/01/01 00:00:07 - 2013/03/31 23:59:59 Generated on Fri Apr 12, 2013 - 18:27:59 General Statistics Summary Summary Hits Total Hits 1,416,097 Average Hits per Day 15,734 Average Hits per Visitor 6.05 Cached Requests 8,154 Failed Requests 70,823 Page Views Total Page Views 1,217,537 Average Page Views per Day 13,528 Average Page Views per Visitor 5.21 Visitors Total Visitors 233,895 Average Visitors per Day 2,598 Total Unique IPs 49,753 Bandwidth Total Bandwidth 77.49 GB Average Bandwidth per Day 881.68 MB Average Bandwidth per Hit 57.38 KB Average Bandwidth per Visitor 347.40 KB 1 Activity Statistics Daily Daily Visitors 4,000 3,500 3,000 2,500 2,000 Visitors 1,500 1,000 500 0 2013/01/01 2013/01/15 2013/02/01 2013/02/15 2013/03/01 2013/03/15 Date Daily Hits 40,000 35,000 30,000 25,000 Hits 20,000 15,000 10,000 5,000 0 2013/01/01 2013/01/15 2013/02/01 2013/02/15 2013/03/01 2013/03/15 Date 2 Daily Bandwidth 1,300,000 1,200,000 1,100,000 1,000,000 900,000 800,000 700,000 600,000 500,000 Bandwidth (KB) 400,000 300,000 200,000 100,000 0 2013/01/01 2013/01/15 2013/02/01 2013/02/15 2013/03/01 2013/03/15 Date Daily Activity Date Hits Page Views Visitors Average Visit Length Bandwidth (KB) Sun 2013/02/10 11,783 10,245 2,280 13:01 648,207 Mon 2013/02/11 16,454 14,146 2,484 10:05 906,702 Tue 2013/02/12 19,572 17,089 3,062 07:47 926,190 Wed 2013/02/13 14,554 12,402 2,824 06:09 958,951 Thu 2013/02/14 12,577 10,666 2,690 05:03 821,129 Fri 2013/02/15 15,806 12,697 2,868 07:02 1,208,095 Sat 2013/02/16 16,811 14,939
    [Show full text]
  • DEEP WEB IOTA Report for Tax Administrations
    DEEP IOTA Report for Tax Administrations IOTA Report for Tax Administrations – Deep Web DEEP WEB IOTA Report for Tax Administrations Intra-European Organisation of Tax Administrations (IOTA) Budapest 2012 1 IOTA Report for Tax Administrations – Deep Web PREFACE This report on deep Web investigation is the second report from the IOTA “E- Commerce” Task Team of the “Prevention and Detection of VAT Fraud” Area Group. The team started operations in January 2011 in Wroclaw, Poland initially focusing its activities on problems associated with the audit of cloud computing, the report on which was published earlier in 2012. During the Task Teams’ second phase of work the focus has been on deep Web investigation. What can a tax administration possibly gain from the deep Web? For many the deep Web is something of a mystery, something for computer specialists, something they have heard about but do not understand. However, the depth of the Web should not represent a threat as the deep Web offers a very important source of valuable information to tax administrations. If you want to understand, to master and to fully exploit the deep Web, you need to see the opportunities that arise from using information buried deep within the Web, how to work within the environment and what other tax administrations have already achieved. This report is all about understanding, mastering and using the deep Web as the key to virtually all audits, not just those involving E-commerce. It shows what a tax administration can achieve by understanding the deep Web and how to use it to their advantage in every audit.
    [Show full text]