Scalability and Efficiency Challenges in Large-Scale Web Search
Total Page:16
File Type:pdf, Size:1020Kb
7/9/14 Scalability and Efficiency Challenges in Large-Scale Web Search Engines " Ricardo Baeza-Yates" B. Barla Cambazoglu! Yahoo Labs" Barcelona, Spain" Tutorial at SIGIR 2014, Gold Coast, Australia Disclaimer Dis •# This talk presents the opinions of the authors. It does not necessarily reflect the views of Yahoo Inc. or any other entity." •# Algorithms, techniques, features, etc. mentioned here might or might not be in use by Yahoo or any other company." •# Some non-technical material (e.g., images) provided in this presentation were taken from the Web." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 2 - Tutorial at SIGIR 2014, Gold Coast, Australia 1 7/9/14 Yahoo Labs Barcelona •# Research topics" •# Web retrieval" –# web data mining" –# distributed web retrieval" –# semantic web" –# scalability and efficiency" –# social media" –# opinion/sentiment retrieval" –# web retrieval" –# personalization" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 3 - Tutorial at SIGIR 2014, Gold Coast, Australia Outline of the Tutorial •# Background (35 minutes)" •# Main sections" –# web crawling (75 minutes + 5 minutes Q/A)" –# indexing (75 minutes + 5 minutes Q/A)" –# query processing (90 minutes + 5 minutes Q/A)" –# caching (40 minutes + 5 minutes Q/A)" •# Concluding remarks (10 minutes)" •# Questions and open discussion (15 minutes)" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 4 - Tutorial at SIGIR 2014, Gold Coast, Australia 2 7/9/14 Structure of Main Sections •# Definitions" •# Metrics" •# Issues and techniques" –# single computer" –# cluster of computers" –# multiple search sites" •# Research problems" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 5 - Tutorial at SIGIR 2014, Gold Coast, Australia Background Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 6 - Tutorial at SIGIR 2014, Gold Coast, Australia 3 7/9/14 Brief History of Search Engines •# Past" -# Before browsers" -# Gopher" -# Before the bubble" -# Altavista" -# Lycos" -# Infoseek" -# Excite" -# HotBot" •# Current" •# Future" -# After the bubble" •# Global" •# Facebook ?" -# Yahoo" •# Google, Bing" •# …" -# Google" •# Regional" -# Microsoft" •# Yandex, Baidu" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 7 - Tutorial at SIGIR 2014, Gold Coast, Australia Anatomy of a Search Engine Result Page •# Web search results" Main focus of this tutorial" •# Direct displays (vertical search results)" –# image" –# video" –# local" –# shopping" –# related entities" •# Query suggestions" •# Advertisements" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 8 - Tutorial at SIGIR 2014, Gold Coast, Australia 4 7/9/14 Anatomy of a Search Engine Result Page Movie Related User entity query direct display suggestions Algorithmic Video Ads search Knowledge direct results graph display Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 9 - Tutorial at SIGIR 2014, Gold Coast, Australia Actors in Web Search •# User’s perspective: accessing information" –# relevance" –# speed" " •# Search engine’s perspective: monetization" –# increase the ad revenue" –# attract more users" –# reduce the operational costs" •# Advertiser’s perspective: publicity" –# attract more users" –# pay little" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 10 - Tutorial at SIGIR 2014, Gold Coast, Australia 5 7/9/14 Search Engine Usage Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 11 - Tutorial at SIGIR 2014, Gold Coast, Australia What Makes Web Search Difficult? •# Size" •# Diversity" •# Dynamicity" •# All of these three features can be observed in" –# the Web" –# web users" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 12 - Tutorial at SIGIR 2014, Gold Coast, Australia 6 7/9/14 What Makes Web Search Difficult? •# The Web" –# more than 180 million Web servers and 950 million host names" –# compare with almost 1 billion computers directly connect to Internet" –# the largest data repository (estimated as 100 billion pages)" –# constantly changing" –# diverse in terms of content and data formats" •# Users" –# too many! (over 2.5 billion at the end of 2012)" –# diverse in terms of their culture, education, and demographics" –# very short queries (hard to understand the intent)" –# changing information needs" –# little patience (few queries posed & few answers seen)" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 13 - Tutorial at SIGIR 2014, Gold Coast, Australia Expectations from a Search Engine •# Crawl and index a large fraction of the Web" •# Maintain most recent copies of the content in the Web" •# Scale to serve hundreds of millions of queries every day" •# Evaluate a typical query under several hundred milliseconds" •# Serve most relevant results for a user query" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 14 - Tutorial at SIGIR 2014, Gold Coast, Australia 7 7/9/14 Internet Growth •# We observed a super-linear growth in the last decade" •# The growth of Internet accelerated with web search engines" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 15 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 16 - Tutorial at SIGIR 2014, Gold Coast, Australia 8 7/9/14 Web Page Size Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 17 - Tutorial at SIGIR 2014, Gold Coast, Australia Web User Growth Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 18 - Tutorial at SIGIR 2014, Gold Coast, Australia 9 7/9/14 Search Data Centers •# Quality and performance requirements imply large amounts of compute resources, i.e., very large data centers" •# High variation in data center sizes" -# hundreds of thousands of computers" -# a few computers" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 19 - Tutorial at SIGIR 2014, Gold Coast, Australia Cost of Data Centers •# Data center facilities are heavy consumers of energy, accounting for between 1.1% and 1.5% of the world’s total energy use in 2010." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 20 - Tutorial at SIGIR 2014, Gold Coast, Australia 10 7/9/14 Financial Costs •# Costs" –# depreciation: old hardware need to be replaced" –# maintenance: failures need to be handled" –# operational: energy spending need to be reduced" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 21 - Tutorial at SIGIR 2014, Gold Coast, Australia Impact of Resources on Revenue Energy Resources Consumption Result Response Revenue Relevance time ? Clicks Income Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 22 - Tutorial at SIGIR 2014, Gold Coast, Australia 11 7/9/14 Major Components in a Web Search Engine •# Web crawling" •# Indexing" •# Query processing" document collection index Web query query crawler indexer processor results user Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 23 - Tutorial at SIGIR 2014, Gold Coast, Australia Q&A Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 24 - Tutorial at SIGIR 2014, Gold Coast, Australia 12 7/9/14 Web Crawling Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 25 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling •# Web crawling is the process of locating, fetching, and storing the pages available in the Web" •# Computer programs that perform this task are referred to as" -# crawlers" -# spider" -# harvesters" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 26 - Tutorial at SIGIR 2014, Gold Coast, Australia 13 7/9/14 Web Graph •# Web crawlers exploit the hyperlink structure of the Web" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 27 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling Process •# A typical Web crawler" –# starts from a set of seed pages," –# locates new pages by parsing the downloaded seed pages," –# extracts the hyperlinks within," –# stores the extracted links in a fetch queue for retrieval, " –# continues downloading until the fetch queue gets empty or a satisfactory number of pages are downloaded." Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 28 - Tutorial at SIGIR 2014, Gold Coast, Australia 14 7/9/14 Web Crawling Process Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 29 - Tutorial at SIGIR 2014, Gold Coast, Australia Web Crawling seed page downloaded •# Crawling divides discovered the Web into (frontier) three sets" –# downloaded" –# discovered" –# undiscovered" undiscovered Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 30 - Tutorial at SIGIR 2014, Gold Coast, Australia 15 7/9/14 URL Prioritization •# A state-of-the-art web crawler maintains two separate queues for prioritizing the download of URLs" –# discovery queue" –# downloads pages pointed by already discovered links" –# tries to increase the coverage of the crawler" –# refreshing queue" –# re-downloads already downloaded pages" –# tries to increase the freshness of the repository" Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 31 - Tutorial at SIGIR 2014, Gold Coast, Australia URL Prioritization (Discovery) seed page downloaded discovered •# Random (A, B, C, D)" A (frontier) •# Breadth-first (A)" •# In-degree (C)" C •# PageRank (B)" B D undiscovered (more intense red color indicates higher PageRank) Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs - 32 - Tutorial at SIGIR 2014, Gold Coast, Australia 16 7/9/14 URL Prioritization (Refreshing) •# Random (A, B, C, D)" •# Age (C)" •# PageRank (B)" •# User feedback (D)" •# Longevity (A)" (more intense blue color indicates larger user interest) download time order C D A B (by the crawler) last update time order B A C D (by