Crawling Frontier Controls
Total Page:16
File Type:pdf, Size:1020Kb
Nutch – ApacheCon US '09 Web-scale search engine toolkit search Web-scale Today and tomorrow Today Apache Andrzej Białecki [email protected] Nutch – ApacheCon US '09 • • • Questions answers and future Nutch present and solutions)some (and Challenges Nutchworkflow: Nutcharchitecture overview general Web in crawling project the About Searching Crawling Setup Agenda 2 Nutch – ApacheCon US '09 • • Collections typically 1 mln - 200 mln documents mln Collections -typically 200 1 mln search mostly vertical in operation, installations Many Spin-offs: (sub-project Lucene) of Apache project since 2004 Mike Cafarella creator, and Lucene bythe Cutting, Doug 2003 in Founded Content type detection and parsing Tika → Map-Reduce and distributed → Hadoop FS Apache Nutch project 3 Nutch – ApacheCon US '09 4 Nutch – ApacheCon US '09 first, random Traversal: depth- breadth-first, edges, the follow listsas Oftenadjacency represented (neighbor) <alabels: href=”..”>anchor Edge text</a> Edges (links): hyperlinks like <a href=”targetUrl”/> Nodes (vertices):URL-s identifiers as unique 6 2 8 1 3 Web as a directed graph 5 4 7 9 7 →3, 4, 8, 9 5 →6, 9 1 →2, 3, 4, 5, 6 5 Nutch – ApacheCon US '09 … What's in a search engine? a fewa things may surprisethat you! 6 Nutch – ApacheCon US '09 Injector -links(in/out) - Web graph pageinfo Search engine building blocks Scheduler Updater Crawling frontierCrawling controls Crawler repository Content Searcher Indexer Parser 7 Nutch – ApacheCon US '09 Robust API and integration options Robust APIintegration and Full-text&indexer search engine processingdata framework Scalable Robustcontrols frontier crawling processing (parsing, content filtering) Plugin-based crawler distributed multi-threaded, Multi-protocol, modular: highly Plugin-based, graph) (web link database and database Page − − − − Support Support for search distributed or Using Lucene Solr Map-reduce processing Mostvia plugins be behavior can changed Nutch features at a glance 8 Hadoop foundation File system abstraction • Local FS, or • Distributed FS − also Amazon S3, Kosmos and other FS impl. 9 0 ' Map-reduce processing S U n o • C Currently central to Nutch algorithms e h c a p A • Processing tasks are executed as one or more map- – h c t u reduce jobs N • Data maintained as Hadoop MapFile-s / SequenceFile-s • Massive updates very efficient • Small updates costly − Hadoop data is immutable, once created − Updates are really merge & replace 9 Nutch – ApacheCon US '09 Injector -links(in/out) - Web graph pageinfo Search engine building blocks Scheduler Updater Crawling frontierCrawling controls Crawler repository Content Searcher Indexer Parser 10 Nutch – ApacheCon US '09 Injector URL URL filters normalizers, & parsing/indexing filters, scoring plugins CrawlDB LinkDB Generator Nutch building blocks Updater inverter Link Fetcher (segments) Shards Searcher Indexer Parser 11 Nutch data Maintains info on all known URL-s: Injector Generator ● Fetcher Searcher Fetch schedule ● Fetch status ● Page signature ● Metadata 9 0 ' Indexer S U CrawlDB Updater n o Shards C e h c (segments) a p A Parser Link – h c t u LinkDB inverter N URL filters & normalizers, parsing/indexing filters, scoring plugins 12 Nutch – ApacheCon US '09 Injector URL URL filters normalizers, & parsing/indexing filters, scoring plugins CrawlDB LinkDB Generator Updater inverter text and theiranchor URL-s associated links, i.e. incoming list of source For each target info on keeps URL Link Nutch data Fetcher (segments) Shards Searcher Indexer Parser 13 Nutch data Shards (“segments”) keep: ● Raw page content Injector● Parsed contentGenerator + discoveredFetcher Searcher metadata + outlinks ● Plain text for indexing and snippets 9 0 ' ● Indexer S Optional per-shard indexes U CrawlDB Updater n o Shards C e h c (segments) a p A Parser Link – h c t u LinkDB inverter N URL filters & normalizers, parsing/indexing filters, scoring plugins 14 Nutch shards (a.k.a. “segments”) Unit of work (batch) – easier to process massive datasets Convenience placeholder, using predefined directory names Unit of deployment to the search infrastructure May include per-shard Lucene indexes Once completed they are basically unmodifiable − No in-place updates of content, or replacing of obsolete content 9 0 ' Periodically phased-out S U n o C e h c a p 200904301234/ A 2009043012345 – 2009043012345 h c t Generator u N crawl_generate/ crawl_generatecrawl_generatecrawl_fetch/ Fetcher crawl_fetchcrawl_fetchcontent/ “cached” view contentcontentcrawl_parse/ Parser crawl_parsecrawl_parseparse_data/ parse_dataparse_text/ parse_dataindexes/ snippets Indexer parse_textparse_text 15 URL life-cycle and shards observed changes time real page changes A time-lapse view of reality injected / discovered Goals: scheduled for fetching • Fresh index → sync with the real S1 rate of changes • Minimize re-fetching → don't fetch fetched / unmodified pages up-to-date 9 0 ' S Each page may need its U n o C own crawl schedule scheduled e h c for fetching a p A Shard management: – h c t • Page may be present in many fetched / u N shards, but only the most recent up-to-date record is valid scheduled • Inconvenient to update in-place, S2 for fetching just mark as deleted • Phase-out old shards and force re- fetch of the remaining pages S3 fetched / up-to-date 16 Nutch – ApacheCon US '09 Many dangers of simply wandering around Search engines discover their view of web universe No authoritative catalog of web pages − − − − Follow (walk) some ( Start from “seed list” collecting unwanted content (spam, junk, offensive) explosion or collapse of the frontier some guidance I could use Crawling frontier useful? interesting? ) outlinks) 17 Controlling the crawling frontier URL filter plugins − White-list, black-list, regex − May use external resources (DB-s, services ...) URL normalizer plugins 9 0 ' S − U Resolving relative path n o C seed e elements h c a p − A “Equivalent” URLs i = 1 – h c t u N i = 2 Additional controls using i = 3 scoring plugins − priority, metadata select/block Crawler traps – difficult! − Domain / host / path-level stats 18 Wide vs. focused crawling Differences: • Little technical difference in configuration • Big difference in operations, maintenance and quality Wide crawling: − (Almost) Unlimited crawling frontier 9 0 ' − High risk of spamming and junk content S U n − o “Politeness” a very important limiting factor C e h c − a Bandwidth & DNS considerations p A – h c Focused (vertical or enterprise) crawling: t u N − Limited crawling frontier − Bandwidth or politeness is often not an issue − Low risk of spamming and junk content 19 Nutch – ApacheCon US '09 • • • • • • • • • Enterprise search Vertical search Robust control of the crawling frontier Range of selected “reference” sites PageRank-like scoring usually works poorly Little danger of spam Integration with in-house data sources Well-defined and limited crawling frontier Variety of data sources and data formats Business-driven decisions about ranking Extensive content post-processing Vertical & enterprise search 20 Nutch – ApacheCon US '09 ? Face to face with Nutch 21 Nutch – ApacheCon US '09 • • • Command-line bash script: bin/nutch Command-line web app search Simple usessetup for searchTomcat (pretty stable) build nightly a or Getrelease 1.0 a have Java 5+You already , right? Hadoop web-based monitoring Early version of a web-based UI console Windows users: get Cygwin Simple set-up and running 22 Nutch – ApacheCon US '09 • • • • • External configuration files Externalconfiguration properties Per-plugin Active configuration plugins in conf/nutch-site.xml Edit configuration You MUST at Youleast MUST fill the name of your agent Check nutch-default.xml for defaults and docs parse-plugins.xml: mapping of MIME type to plugin regex-normalize.xml regex-urlfilter.xml Configuration: files 23 Nutch – ApacheCon US '09 • • • • • • • • • • Plugin-basedextensions for: URL filtering and normalization and URL filtering Crawl scheduling Query translation and expansion (user → Lucene) expansion and Query translation ranking and Scoring highlighting and generation Snippet &filters fields (index metadata) Indexing (to detect near-duplicates) signature Page Text analysis (tokenization) parsing Content content the Protocol for getting Nutch plugins 24 Main Nutch workflow Command-line: bin/nutch Inject: initial creation of CrawlDB inject • Insert seed URLs • Initial LinkDB is empty 9 0 ' S U n o C e Generate new shard's fetchlist generate h c a p A Fetch raw content fetch – h c t u N Parse content (discovers outlinks) parse Update CrawlDB from shards updatedb Update LinkDB from shards updatelinkdb Index shards index / solrindex (repeat) 25 Nutch – ApacheCon US '09 Injector URL URL filters normalizers, & parsing/indexing filters, scoring plugins CrawlDB LinkDB Generator Injecting new URL-s Updater inverter Link Fetcher (segments) Shards Searcher Indexer Parser 26 Nutch – ApacheCon US '09 Injector URL URL filters normalizers, & parsing/indexing filters, scoring plugins CrawlDB LinkDB Generator Generating fetchlists Updater inverter Link Fetcher (segments) Shards Searcher Indexer Parser 27 Workflow: generating fetchlists What to fetch next? • Breadth-first – important due to “politeness” limits • Expired (longer than fetchTime + fetchInterval) • Highly ranking (PageRank) • Newly added 9 0 ' Fetchlist generation: S U n o • C “topN” - select best candidates e h c − a Priority based on many factors, pluggable p A – h c t Adaptive fetch schedule u N • Detect rate of changes & time of change • Detect unmodified