Nutch – ApacheCon US '09 Web-scalesearchenginetoolkit Todayandtomorrow Apache Andrzej Białecki Andrzej [email protected] Agenda About the project Web crawling in general Nutch architecture overview Nutch workflow: • Setup 9 0 '
S U • Crawling n o C e
h •
c Searching a p A
–
h Challenges (and some solutions) c t u N Nutch present and future Questions and answers
2 Apache Nutch project Founded in 2003 by Doug Cutting, the Lucene creator, and Mike Cafarella Apache project since 2004 (sub-project of Lucene) Spin-offs: 9 0 '
S • Map-Reduce and distributed FS → Hadoop U n o C e • h Content type detection and parsing → Tika c a p A
– h
c Many installations in operation, mostly vertical t u N search Collections typically 1 mln - 200 mln documents
3 9 0 '
S U n o C e h c a p A
– h c t u N
4 Web as a directed graph Nodes (vertices): URL-s as unique identifiers Edges (links): hyperlinks like Edge labels: anchor text Often represented as adjacency (neighbor) lists
9 Traversal: follow the edges, breadth-first, depth- 0 '
S U
n first, random o C e h c 8 a p A
– h c
t 7 u 9
N 3 2 1 → 2, 3, 4, 5, 6 1 4 5 → 6, 9 7 → 3, 4, 8, 9
6 5
5 What's in a search engine?
… a few things that may surprise you! 9 0 '
S U n o C e h c a p A
– h c t u N
6 Search engine building blocks
Injector Scheduler Crawler Searcher 9 0 ' Indexer S U Web graph Updater n
o Content
C -page info e h
c repository a -links (in/out) p
A Parser
– h c t u N
Crawling frontier controls
7 Nutch features at a glance Page database and link database (web graph) Plugin-based, highly modular: − Most behavior can be changed via plugins Multi-protocol, multi-threaded, distributed crawler Plugin-based content processing (parsing, filtering) 9 0 '
S U Robust crawling frontier controls n o C e h c Scalable data processing framework a p A − –
Map-reduce processing h c t u N Full-text indexer & search engine − Using Lucene or Solr − Support for distributed search Robust API and integration options
8 Hadoop foundation File system abstraction • Local FS, or • Distributed FS − also Amazon S3, Kosmos and other FS impl. 9 0 '
Map-reduce processing S U n o • C Currently central to Nutch algorithms e h c a p
A • Processing tasks are executed as one or more map- – h c t u reduce jobs N • Data maintained as Hadoop MapFile-s / SequenceFile-s • Massive updates very efficient • Small updates costly − Hadoop data is immutable, once created
− Updates are really merge & replace 9 Search engine building blocks
Injector Scheduler Crawler Searcher 9 0 ' Indexer S U Web graph Updater n
o Content
C -page info e h
c repository a -links (in/out) p
A Parser
– h c t u N
Crawling frontier controls
10 Nutch building blocks
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
11 Nutch data
Maintains info on all known URL-s: Injector Generator ● Fetch scheduleFetcher Searcher ● Fetch status ● Page signature ● Metadata 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
12 Nutch data
Injector Generator For each Fetchertarget URL keepsSearcher info on incoming links, i.e. list of source URL-s and their associated anchor text 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
13 Nutch data
Shards (“segments”) keep: ● Raw page content Injector● Parsed contentGenerator + discoveredFetcher Searcher metadata + outlinks ● Plain text for indexing and snippets 9 0 ' ● Indexer S Optional per-shard indexes U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
14 Nutch shards (a.k.a. “segments”) Unit of work (batch) – easier to process massive datasets Convenience placeholder, using predefined directory names Unit of deployment to the search infrastructure May include per-shard Lucene indexes Once completed they are basically unmodifiable − No in-place updates of content, or replacing of obsolete content 9 0 ' Periodically phased-out S U n o C e h c a p 200904301234/
A 2009043012345
– 2009043012345 h c
t Generator u
N crawl_generate/ crawl_generatecrawl_generatecrawl_fetch/ Fetcher crawl_fetchcrawl_fetchcontent/ “cached” view contentcontentcrawl_parse/ Parser crawl_parsecrawl_parseparse_data/ parse_dataparse_text/ parse_dataindexes/ snippets Indexer parse_textparse_text 15 URL life-cycle and shards observed changes time real page changes A time-lapse view of reality injected / discovered Goals: scheduled for fetching • Fresh index → sync with the real S1 rate of changes • Minimize re-fetching → don't fetch fetched / unmodified pages up-to-date 9 0 '
S Each page may need its U n o
C own crawl schedule scheduled e h c for fetching a p
A Shard management:
– h c t • Page may be present in many fetched / u N shards, but only the most recent up-to-date record is valid scheduled • Inconvenient to update in-place, S2 for fetching just mark as deleted • Phase-out old shards and force re- fetch of the remaining pages S3 fetched / up-to-date 16 Crawling frontier No authoritative catalog of web pages Search engines discover their view of web universe − Start from “seed list” − Follow (walk) some (useful? interesting?) outlinks Many dangers of simply wandering around 9
0 − ' explosion or collapse of the frontier S U n
o −
C collecting unwanted content (spam, junk, offensive) e h c a p A
– h c t u N I could use some guidance
17 Controlling the crawling frontier URL filter plugins − White-list, black-list, regex − May use external resources (DB-s, services ...) URL normalizer plugins 9 0 '
S −
U Resolving relative path n o
C seed
e elements h c a
p −
A “Equivalent” URLs i = 1 – h c t u N i = 2 Additional controls using i = 3 scoring plugins − priority, metadata select/block Crawler traps – difficult! − Domain / host / path-level stats 18 Wide vs. focused crawling Differences: • Little technical difference in configuration • Big difference in operations, maintenance and quality Wide crawling: − (Almost) Unlimited crawling frontier 9 0
' − High risk of spamming and junk content S U
n − o “Politeness” a very important limiting factor C e h
c −
a Bandwidth & DNS considerations p A
– h c Focused (vertical or enterprise) crawling: t u N − Limited crawling frontier − Bandwidth or politeness is often not an issue − Low risk of spamming and junk content
19 Vertical & enterprise search Vertical search • Range of selected “reference” sites • Robust control of the crawling frontier • Extensive content post-processing
9 • 0 Business-driven decisions about ranking '
S U n o
C Enterprise search e h c a p
A • Variety of data sources and data formats – h c t u • N Well-defined and limited crawling frontier • Integration with in-house data sources • Little danger of spam • PageRank-like scoring usually works poorly
20 Face to face with Nutch
? 9 0 '
S U n o C e h c a p A
– h c t u N
21 Simple set-up and running You already have Java 5+ , right? Get a 1.0 release or a nightly build (pretty stable) Simple search setup uses Tomcat for search web app 9 0 '
S Command-line bash script: bin/nutch U n o C e h • Windows users: get Cygwin c a p A
– •
Early version of a web-based UI console h c t u
N • Hadoop web-based monitoring
22 Configuration: files Edit configuration in conf/nutch-site.xml • Check nutch-default.xml for defaults and docs • You MUST at least fill the name of your agent Active plugins configuration 9 0
' Per-plugin properties
S U n o
C External configuration files e h c a p
A • regex-urlfilter.xml – h c t u
N • regex-normalize.xml • parse-plugins.xml: mapping of MIME type to plugin
23 Nutch plugins Plugin-based extensions for: • Crawl scheduling • URL filtering and normalization • Protocol for getting the content
9 • 0
' Content parsing
S U n o • Text analysis (tokenization) C e h c a p • Page signature (to detect near-duplicates) A
– h c t u • Indexing filters (index fields & metadata) N • Snippet generation and highlighting • Scoring and ranking • Query translation and expansion (user → Lucene)
24 Main Nutch workflow
Command-line: bin/nutch Inject: initial creation of CrawlDB inject • Insert seed URLs • Initial LinkDB is empty 9 0 '
S U n o C e Generate new shard's fetchlist generate h c a p A Fetch raw content fetch – h c t u
N Parse content (discovers outlinks) parse Update CrawlDB from shards updatedb Update LinkDB from shards updatelinkdb Index shards index / solrindex (repeat) 25 Injecting new URL-s
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
26 Generating fetchlists
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
27 Workflow: generating fetchlists What to fetch next? • Breadth-first – important due to “politeness” limits • Expired (longer than fetchTime + fetchInterval) • Highly ranking (PageRank) • Newly added 9 0 ' Fetchlist generation: S U n o •
C “topN” - select best candidates e h c − a Priority based on many factors, pluggable p A
– h c t Adaptive fetch schedule u N • Detect rate of changes & time of change • Detect unmodified content − Hmm, and how to recognize this? → approximate page signatures (near-duplicate detection)
28 Fetching content
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
29 Workflow: fetching
Multi-protocol: HTTP(s), FTP, file://, etc ... Coordinates multiple threads accessing the same host • “politeness” issues vs. crawling efficiency 9 0 ' • Host: an IP or DNS name? S U n o •
C Redirections: should follow immediately? What kind? e h c a p
A Other netiquette issues:
– h c t • u robots.txt: disallowed paths, Crawl-Delay N Preparation for parsing: • Content type detection issues • Parsing usually executed as a separate step (resource hungry and sometimes unstable)
30 Content processing
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
31 Workflow: content processing Protocol plugins retrieve content as plain bytes + protocol-level metadata (e.g. HTTP headers) Parse plugins • Content is parsed by MIME-type specific parsers 9 0 ' • Content is parsed into parseData (title, outlinks, other S U n o metadata) and parseText (plain text content) C e h c a p A Nutch supports many popular file formats – h c t u N
32 Link inversion
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
33 Workflow: link inversion Pages have outgoing links (outlinks) … I know where I'm pointing to Question: who points to me? … I don't know, there is no catalog of pages … NOBODY knows for sure either! 9 0 '
S U
In-degree indicates importance of the page n o C e h c
a Anchor text provides important semantic info p A
– h c t Partial answer: invert the outlinks that I know u N about, and group by target tgt 2 src 2 tgt 1 src 1
src 1 tgt 3 tgt 1 src 3
tgt 5 tgt 4 src 5 src 4 34 Page importance - scoring
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
35 Workflow: page scoring Query-independent importance factor • Affects search ranking • Affects crawl prioritization May include arbitrary decision of the operator 9
0 Currently two systems (plugins + tools) '
S U n o OPIC scoring C e h c a p • Doesn't require explicit steps - “online” A
– h c t • u Difficult to stabilize N PageRank scoring with loop detection • Periodically run to update CrawlDb scores • Computationally intensive, esp. loop detection
36 Indexing
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
37 Workflow: indexing Indexing plugins • Create full-text search index from segment data • Apply adjustments to score per document • May further post-process the parsed content (e.g. language identification) to facilitate advanced search 9 0 '
S U n
o Indexers C e h c a p • Lucene indexer – builds indexes to be served by Nutch A
– h c
t search servers u N • Solr indexer – submits documents to a Solr instance
38 De-duplication
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
39 Workflow: de-duplication The same page may be present in many shards • Obsolete versions in older shards • Mirrored pages, or equivalent (a.com → www.a.com) Many other pages are almost identical • Template-related differences (banners, current date) 9 0 '
S • Font / layout changes, minor re-wording U n o C e h c a p A
–
Hmm … what is a significant change ??? h c t u
N • Tricky issue! Hint: what is the page content?
Near-duplicate detection and removal • Nutch uses approximate page signatures (fingerprints) • Duplicates are only marked as deleted 40 Cycles may overlap
Injector Generator Fetcher Searcher 9 0 ' Indexer S
U CrawlDB Updater n
o Shards C e h
c (segments) a p
A Parser Link – h c t
u LinkDB inverter N
URL filters & normalizers, parsing/indexing filters, scoring plugins
41 Nutch distributed search
Search front-end(s) Web app Search server API (Nutch or Solr) NutchBean XML JSON 9 0 '
S U
n indexed shards o C e h
c Multiple search servers a p A
–
Search front-end dispatches h c t u
N queries and collects search results • multiple access methods (API, OpenSearch XML, JSON) Page servers build snippets Fault-tolerant * * with degraded quality 42 Search configuration Nutch syntax is limited – on purpose! • Some queries are costly, e.g. leading wildcard, very long • Some queries may need implicit (hidden) expansion Query plugins • From user query to Lucene/Solr query 9 0 '
S User query: web search U n o +(url:web^4.0 anchor:web^2.0 content:web title:web^1.5 host:web^2.0) C e h
c +(url:search^4.0 anchor:search^2.0 content:search title:search^1.5 a p A
host:search^2.0) url:"web search"~10^4.0 anchor:"web search"~4^2.0 – h c t content:"web search"~10 title:"web search"~10^1.5 u N host:"web search"~10^2.0 Search server configuration • Single search vs. distributed • Using Nutch searcher, or Solr, or a mix • Currently no global IDF calculation 43 Deployment: single-server The easiest but the most limited deployment Centralized storage, centralized processing Hadoop LocalFS and LocalJobTracker Drop nutch.war in Tomcat/webapps and point to shards
SearchSearch Fetcher
9 Fetcher 0
' front-end front-end S U n o C e h c LocalFS a p LocalFS Indexer A
Indexer
– Storage
h Storage c t u N
DBDB mgmt mgmt
44 Deployment: distributed search Local storage on search servers preferred (perf.)
SearchSearch front-end Fetcher
9 front-end Fetcher 0 '
S U n o C e h c LocalFS a p LocalFS Indexer A
Indexer
– Storage
h Storage c t u N Search server Search server DB mgmt Search server DB mgmt
45 Deployment: map-reduce processing & distrib. search Fully distributed storage and processing Local storage on search servers preferred (perf.)
(Job(Job Tracker) Tracker) SearchSearch Fetcher DFSDFS Name Name Node Node Fetcher front-end Indexer
9 front-end
0 Indexer '
S DB mgmt U
DB mgmt n o C e h c a p A
– h c t u N Search server DFS Data Node Task Tracker Search server DFS Data Node Task Tracker Search server DFS Data Node Task Tracker
46 Nutch on a Hadoop cluster
Assumes an up & running Hadoop cluster … which is covered elsewhere Build nutch.job jar and use it as usual:
9 bin/hadoop jar nutch.job
S U n o C e h c
a Note: Nutch configuration is inside nutch.job p A
– − When you change it you need to rebuild job jar h c t u N Searching is not a map-reduce job – often on a separate group of machines
47 9 0 '
S U n o C e h c a p A
– h c t u N
48 Nutch index 9 0 '
S U n o C e h c a p A
– h c t u N
49 … and press Search 9 0 '
S U n o C e h c a p A
– h c t u N
50 Conclusions (This overview is a tip of the iceberg)
Nutch Implements all core search engine components 9 0 ' Scales well S U n o C
e Extremely configurable and modular h c a p A It's a complete solution – and a toolkit – h c t u N
51 Future of Nutch
Avoid code duplication Parsing → Tika • Almost total overlap • Still some missing functionality in Tika → contribute Plugins → OSGI 9 0 ' • Home-grown plugin system has some deficiencies S U n o • Initial port available C e h c a p Indexing & Search → Solr, Zookeeper A
–
h • c Distributed and replicated search is difficult t u N • Initial integration needs significant improvement • Shard management - Katta? Web-graph & page repository → HBase • Combine CrawlDB, LinkDB and shard storage • Avoid tedious shard management
• Initial port available 52 Future of Nutch (2) What's left then? • Crawling frontier management, discovery • Re-crawl algorithms • Spider trap handling • Fetcher 9 0 '
S • Ranking: enterprise-specific, user-feedback U n o
C • e Duplicate detection, URL aliasing (mirror detection) h c a p
A •
Template detection and cleanup, pagelet-level crawling – h c t u • Spam & junk control N Share code with other crawler projects → crawler-commons Vision: á la carte toolkit, scalable from 1-1000s nodes 53 Summary
Q&A
Further information:
9 • http://lucene.apache.org/nutch 0 '
S U n o C e h c a p A
– h c t u N
54