Scalability and Efficiency Challenges in Large-Scale Web Search

7/9/14 Scalability and Efficiency Challenges in Large-Scale Web Search Engines " Ricardo Baeza-Yates" B. Barla Cambazoglu! Yahoo Labs" Barcelona, Spain" Tutorial at SIGIR 2014, Gold Coast, Australia Disclaimer Dis •# This talk presents the opinions of the authors. It does not necessarily reflect the views of Yahoo Inc. or any other entity." •# Algorithms, techniques, features, etc. mentioned here might or might not be in use by Yahoo or any other company." •# Some non-technical material (e.g., images) provided in this presentation were taken from the Web." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 2 - Tutorial at SIGIR 2014, Gold Coast, Australia 1 7/9/14 Yahoo Labs Barcelona •# Research topics" •# Web retrieval" –# web data mining" –# distributed web retrieval" –# semantic web" –# scalability and efficiency" –# social media" –# opinion/sentiment retrieval" –# web retrieval" –# personalization" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 3 - Tutorial at SIGIR 2014, Gold Coast, Australia Outline of the Tutorial •# Background (35 minutes)" •# Main sections" –# web crawling (75 minutes + 5 minutes Q/A)" –# indexing (75 minutes + 5 minutes Q/A)" –# query processing (90 minutes + 5 minutes Q/A)" –# caching (40 minutes + 5 minutes Q/A)" •# Concluding remarks (10 minutes)" •# Questions and open discussion (15 minutes)" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 4 - Tutorial at SIGIR 2014, Gold Coast, Australia 2 7/9/14 Structure of Main Sections •# Definitions" •# Metrics" •# Issues and techniques" –# single computer" –# cluster of computers" –# multiple search sites" •# Research problems" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 5 - Tutorial at SIGIR 2014, Gold Coast, Australia Background Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 6 - Tutorial at SIGIR 2014, Gold Coast, Australia 3 7/9/14 Brief History of Search Engines •# Past" -# Before browsers" -# Gopher" -# Before the bubble" -# Altavista" -# Lycos" -# Infoseek" -# Excite" -# HotBot" •# Current" •# Future" -# After the bubble" •# Global" •# Facebook ?" -# Yahoo" •# Google, Bing" •# …" -# Google" •# Regional" -# Microsoft" •# Yandex, Baidu" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 7 - Tutorial at SIGIR 2014, Gold Coast, Australia Anatomy of a Search Engine Result Page •# Web search results" Main focus of this tutorial" •# Direct displays (vertical search results)" –# image" –# video" –# local" –# shopping" –# related entities" •# Query suggestions" •# Advertisements" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 8 - Tutorial at SIGIR 2014, Gold Coast, Australia 4 7/9/14 Anatomy of a Search Engine Result Page Movie Related User entity query direct display suggestions Algorithmic Video Ads search Knowledge direct results graph display Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 9 - Tutorial at SIGIR 2014, Gold Coast, Australia Actors in Web Search •# User’s perspective: accessing information" –# relevance" –# speed" " •# Search engine’s perspective: monetization" –# increase the ad revenue" –# attract more users" –# reduce the operational costs" •# Advertiser’s perspective: publicity" –# attract more users" –# pay little" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 10 - Tutorial at SIGIR 2014, Gold Coast, Australia 5 7/9/14 Search Engine Usage Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 11 - Tutorial at SIGIR 2014, Gold Coast, Australia What Makes Web Search Difficult? •# Size" •# Diversity" •# Dynamicity" •# All of these three features can be observed in" –# the Web" –# web users" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 12 - Tutorial at SIGIR 2014, Gold Coast, Australia 6 7/9/14 What Makes Web Search Difficult? •# The Web" –# more than 180 million Web servers and 950 million host names" –# compare with almost 1 billion computers directly connect to Internet" –# the largest data repository (estimated as 100 billion pages)" –# constantly changing" –# diverse in terms of content and data formats" •# Users" –# too many! (over 2.5 billion at the end of 2012)" –# diverse in terms of their culture, education, and demographics" –# very short queries (hard to understand the intent)" –# changing information needs" –# little patience (few queries posed & few answers seen)" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 13 - Tutorial at SIGIR 2014, Gold Coast, Australia Expectations from a Search Engine •# Crawl and index a large fraction of the Web" •# Maintain most recent copies of the content in the Web" •# Scale to serve hundreds of millions of queries every day" •# Evaluate a typical query under several hundred milliseconds" •# Serve most relevant results for a user query" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 14 - Tutorial at SIGIR 2014, Gold Coast, Australia 7 7/9/14 Internet Growth •# We observed a super-linear growth in the last decade" •# The growth of Internet accelerated with web search engines" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 15 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 16 - Tutorial at SIGIR 2014, Gold Coast, Australia 8 7/9/14 Web Page Size Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 17 - Tutorial at SIGIR 2014, Gold Coast, Australia Web User Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 18 - Tutorial at SIGIR 2014, Gold Coast, Australia 9 7/9/14 Search Data Centers •# Quality and performance requirements imply large amounts of compute resources, i.e., very large data centers" •# High variation in data center sizes" -# hundreds of thousands of computers" -# a few computers" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 19 - Tutorial at SIGIR 2014, Gold Coast, Australia Cost of Data Centers •# Data center facilities are heavy consumers of energy, accounting for between 1.1% and 1.5% of the world’s total energy use in 2010." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 20 - Tutorial at SIGIR 2014, Gold Coast, Australia 10 7/9/14 Financial Costs •# Costs" –# depreciation: old hardware need to be replaced" –# maintenance: failures need to be handled" –# operational: energy spending need to be reduced" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 21 - Tutorial at SIGIR 2014, Gold Coast, Australia Impact of Resources on Revenue Energy Resources Consumption Result Response Revenue Relevance time ? Clicks Income Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 22 - Tutorial at SIGIR 2014, Gold Coast, Australia 11 7/9/14 Major Components in a Web Search Engine •# Web crawling" •# Indexing" •# Query processing" document collection index Web query query crawler indexer processor results user Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 23 - Tutorial at SIGIR 2014, Gold Coast, Australia Q&A Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 24 - Tutorial at SIGIR 2014, Gold Coast, Australia 12 7/9/14 Web Crawling Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 25 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling •# Web crawling is the process of locating, fetching, and storing the pages available in the Web" •# Computer programs that perform this task are referred to as" -# crawlers" -# spider" -# harvesters" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 26 - Tutorial at SIGIR 2014, Gold Coast, Australia 13 7/9/14 Web Graph •# Web crawlers exploit the hyperlink structure of the Web" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 27 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling Process •# A typical Web crawler" –# starts from a set of seed pages," –# locates new pages by parsing the downloaded seed pages," –# extracts the hyperlinks within," –# stores the extracted links in a fetch queue for retrieval, " –# continues downloading until the fetch queue gets empty or a satisfactory number of pages are downloaded." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 28 - Tutorial at SIGIR 2014, Gold Coast, Australia 14 7/9/14 Web Crawling Process Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 29 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling seed page downloaded •# Crawling divides discovered the Web into (frontier) three sets" –# downloaded" –# discovered" –# undiscovered" undiscovered Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 30 - Tutorial at SIGIR 2014, Gold Coast, Australia 15 7/9/14 URL Prioritization •# A state-of-the-art web crawler maintains two separate queues for prioritizing the download of URLs" –# discovery queue" –# downloads pages pointed by already discovered links" –# tries to increase the coverage of the crawler" –# refreshing queue" –# re-downloads already downloaded pages" –# tries to increase the freshness of the repository" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 31 - Tutorial at SIGIR 2014, Gold Coast, Australia URL Prioritization (Discovery) seed page downloaded discovered •# Random (A, B, C, D)" A (frontier) •# Breadth-first (A)" •# In-degree (C)" C •# PageRank (B)" B D undiscovered (more intense red color indicates higher PageRank) Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 32 - Tutorial at SIGIR 2014, Gold Coast, Australia 16 7/9/14 URL Prioritization (Refreshing) •# Random (A, B, C, D)" •# Age (C)" •# PageRank (B)" •# User feedback (D)" •# Longevity (A)" (more intense blue color indicates larger user interest) download time order C D A B (by the crawler) last update time order B A C D (by

Scalability and Efficiency Challenges in Large-Scale Web Search

Study of Web Crawler and Its Different Types

Building a Scalable Index and a Web Search Engine for Music on the Internet Using Open Source Software

A Study on Vertical and Broad-Based Search Engines

Open Search Environments: the Free Alternative to Commercial Search Services

Efficient Focused Web Crawling Approach for Search Engine

Distributed Indexing/Searching Workshop Agenda, Attendee List, and Position Papers

Natural Language Processing Technique for Information Extraction and Analysis

Distributed Web Crawling Using Network Coordinates

Panacea D4.1

Meta Search Engine with an Intelligent Interface for Information Retrieval on Multiple Domains

Report for Portal Specific

DEEP WEB IOTA Report for Tax Administrations