Crawling Frontier Controls

Nutch – ApacheCon US '09 Web-scale search engine toolkit search Web-scale Today and tomorrow Today Apache Andrzej Białecki [email protected] Nutch – ApacheCon US '09 • • • Questions answers and future Nutch present and solutions)some (and Challenges Nutchworkflow: Nutcharchitecture overview general Web in crawling project the About Searching Crawling Setup Agenda 2 Nutch – ApacheCon US '09 • • Collections typically 1 mln - 200 mln documents mln Collections -typically 200 1 mln search mostly vertical in operation, installations Many Spin-offs: (sub-project Lucene) of Apache project since 2004 Mike Cafarella creator, and Lucene bythe Cutting, Doug 2003 in Founded Content type detection and parsing Tika → Map-Reduce and distributed → Hadoop FS Apache Nutch project 3 Nutch – ApacheCon US '09 4 Nutch – ApacheCon US '09 first, random Traversal: depth- breadth-first, edges, the follow listsas Oftenadjacency represented (neighbor) <alabels: href=”..”>anchor Edge text</a> Edges (links): hyperlinks like <a href=”targetUrl”/> Nodes (vertices):URL-s identifiers as unique 6 2 8 1 3 Web as a directed graph 5 4 7 9 7 →3, 4, 8, 9 5 →6, 9 1 →2, 3, 4, 5, 6 5 Nutch – ApacheCon US '09 … What's in a search engine? a fewa things may surprisethat you! 6 Nutch – ApacheCon US '09 Injector -links(in/out) - Web graph pageinfo Search engine building blocks Scheduler Updater Crawling frontierCrawling controls Crawler repository Content Searcher Indexer Parser 7 Nutch – ApacheCon US '09 Robust API and integration options Robust APIintegration and Full-text&indexer search engine processingdata framework Scalable Robustcontrols frontier crawling processing (parsing, content filtering) Plugin-based crawler distributed multi-threaded, Multi-protocol, modular: highly Plugin-based, graph) (web link database and database Page − − − − Support Support for search distributed or Using Lucene Solr Map-reduce processing Mostvia plugins be behavior can changed Nutch features at a glance 8 Hadoop foundation File system abstraction • Local FS, or • Distributed FS − also Amazon S3, Kosmos and other FS impl. 9 0 ' Map-reduce processing S U n o • C Currently central to Nutch algorithms e h c a p A • Processing tasks are executed as one or more map- – h c t u reduce jobs N • Data maintained as Hadoop MapFile-s / SequenceFile-s • Massive updates very efficient • Small updates costly − Hadoop data is immutable, once created − Updates are really merge & replace 9 Nutch – ApacheCon US '09 Injector -links(in/out) - Web graph pageinfo Search engine building blocks Scheduler Updater Crawling frontierCrawling controls Crawler repository Content Searcher Indexer Parser 10 Nutch – ApacheCon US '09 Injector URL URL filters normalizers, & parsing/indexing filters, scoring plugins CrawlDB LinkDB Generator Nutch building blocks Updater inverter Link Fetcher (segments) Shards Searcher Indexer Parser 11 Nutch data Maintains info on all known URL-s: Injector Generator ● Fetcher Searcher Fetch schedule ● Fetch status ● Page signature ● Metadata 9 0 ' Indexer S U CrawlDB Updater n o Shards C e h c (segments) a p A Parser Link – h c t u LinkDB inverter N URL filters & normalizers, parsing/indexing filters, scoring plugins 12 Nutch – ApacheCon US '09 Injector URL URL filters normalizers, & parsing/indexing filters, scoring plugins CrawlDB LinkDB Generator Updater inverter text and theiranchor URL-s associated links, i.e. incoming list of source For each target info on keeps URL Link Nutch data Fetcher (segments) Shards Searcher Indexer Parser 13 Nutch data Shards (“segments”) keep: ● Raw page content Injector● Parsed contentGenerator + discoveredFetcher Searcher metadata + outlinks ● Plain text for indexing and snippets 9 0 ' ● Indexer S Optional per-shard indexes U CrawlDB Updater n o Shards C e h c (segments) a p A Parser Link – h c t u LinkDB inverter N URL filters & normalizers, parsing/indexing filters, scoring plugins 14 Nutch shards (a.k.a. “segments”) Unit of work (batch) – easier to process massive datasets Convenience placeholder, using predefined directory names Unit of deployment to the search infrastructure May include per-shard Lucene indexes Once completed they are basically unmodifiable − No in-place updates of content, or replacing of obsolete content 9 0 ' Periodically phased-out S U n o C e h c a p 200904301234/ A 2009043012345 – 2009043012345 h c t Generator u N crawl_generate/ crawl_generatecrawl_generatecrawl_fetch/ Fetcher crawl_fetchcrawl_fetchcontent/ “cached” view contentcontentcrawl_parse/ Parser crawl_parsecrawl_parseparse_data/ parse_dataparse_text/ parse_dataindexes/ snippets Indexer parse_textparse_text 15 URL life-cycle and shards observed changes time real page changes A time-lapse view of reality injected / discovered Goals: scheduled for fetching • Fresh index → sync with the real S1 rate of changes • Minimize re-fetching → don't fetch fetched / unmodified pages up-to-date 9 0 ' S Each page may need its U n o C own crawl schedule scheduled e h c for fetching a p A Shard management: – h c t • Page may be present in many fetched / u N shards, but only the most recent up-to-date record is valid scheduled • Inconvenient to update in-place, S2 for fetching just mark as deleted • Phase-out old shards and force re- fetch of the remaining pages S3 fetched / up-to-date 16 Nutch – ApacheCon US '09 Many dangers of simply wandering around Search engines discover their view of web universe No authoritative catalog of web pages − − − − Follow (walk) some ( Start from “seed list” collecting unwanted content (spam, junk, offensive) explosion or collapse of the frontier some guidance I could use Crawling frontier useful? interesting? ) outlinks) 17 Controlling the crawling frontier URL filter plugins − White-list, black-list, regex − May use external resources (DB-s, services ...) URL normalizer plugins 9 0 ' S − U Resolving relative path n o C seed e elements h c a p − A “Equivalent” URLs i = 1 – h c t u N i = 2 Additional controls using i = 3 scoring plugins − priority, metadata select/block Crawler traps – difficult! − Domain / host / path-level stats 18 Wide vs. focused crawling Differences: • Little technical difference in configuration • Big difference in operations, maintenance and quality Wide crawling: − (Almost) Unlimited crawling frontier 9 0 ' − High risk of spamming and junk content S U n − o “Politeness” a very important limiting factor C e h c − a Bandwidth & DNS considerations p A – h c Focused (vertical or enterprise) crawling: t u N − Limited crawling frontier − Bandwidth or politeness is often not an issue − Low risk of spamming and junk content 19 Nutch – ApacheCon US '09 • • • • • • • • • Enterprise search Vertical search Robust control of the crawling frontier Range of selected “reference” sites PageRank-like scoring usually works poorly Little danger of spam Integration with in-house data sources Well-defined and limited crawling frontier Variety of data sources and data formats Business-driven decisions about ranking Extensive content post-processing Vertical & enterprise search 20 Nutch – ApacheCon US '09 ? Face to face with Nutch 21 Nutch – ApacheCon US '09 • • • Command-line bash script: bin/nutch Command-line web app search Simple usessetup for searchTomcat (pretty stable) build nightly a or Getrelease 1.0 a have Java 5+You already , right? Hadoop web-based monitoring Early version of a web-based UI console Windows users: get Cygwin Simple set-up and running 22 Nutch – ApacheCon US '09 • • • • • External configuration files Externalconfiguration properties Per-plugin Active configuration plugins in conf/nutch-site.xml Edit configuration You MUST at Youleast MUST fill the name of your agent Check nutch-default.xml for defaults and docs parse-plugins.xml: mapping of MIME type to plugin regex-normalize.xml regex-urlfilter.xml Configuration: files 23 Nutch – ApacheCon US '09 • • • • • • • • • • Plugin-basedextensions for: URL filtering and normalization and URL filtering Crawl scheduling Query translation and expansion (user → Lucene) expansion and Query translation ranking and Scoring highlighting and generation Snippet &filters fields (index metadata) Indexing (to detect near-duplicates) signature Page Text analysis (tokenization) parsing Content content the Protocol for getting Nutch plugins 24 Main Nutch workflow Command-line: bin/nutch Inject: initial creation of CrawlDB inject • Insert seed URLs • Initial LinkDB is empty 9 0 ' S U n o C e Generate new shard's fetchlist generate h c a p A Fetch raw content fetch – h c t u N Parse content (discovers outlinks) parse Update CrawlDB from shards updatedb Update LinkDB from shards updatelinkdb Index shards index / solrindex (repeat) 25 Nutch – ApacheCon US '09 Injector URL URL filters normalizers, & parsing/indexing filters, scoring plugins CrawlDB LinkDB Generator Injecting new URL-s Updater inverter Link Fetcher (segments) Shards Searcher Indexer Parser 26 Nutch – ApacheCon US '09 Injector URL URL filters normalizers, & parsing/indexing filters, scoring plugins CrawlDB LinkDB Generator Generating fetchlists Updater inverter Link Fetcher (segments) Shards Searcher Indexer Parser 27 Workflow: generating fetchlists What to fetch next? • Breadth-first – important due to “politeness” limits • Expired (longer than fetchTime + fetchInterval) • Highly ranking (PageRank) • Newly added 9 0 ' Fetchlist generation: S U n o • C “topN” - select best candidates e h c − a Priority based on many factors, pluggable p A – h c t Adaptive fetch schedule u N • Detect rate of changes & time of change • Detect unmodified

Crawling Frontier Controls

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support