Nutch – ApacheCon US '09 Web-scalesearchenginetoolkit Todayandtomorrow Apache Andrzej Białecki Andrzej [email protected] Agenda About the project Web crawling in general Nutch architecture overview Nutch workflow: • Setup 9 0 '

S U • Crawling n o C e

h •

c Searching a p A

h Challenges (and some solutions) c t u N Nutch present and future Questions and answers

2 Apache Nutch project Founded in 2003 by Doug Cutting, the Lucene creator, and Mike Cafarella Apache project since 2004 (sub-project of Lucene) Spin-offs: 9 0 '

S • Map-Reduce and distributed FS → Hadoop U n o C e • h Content type detection and parsing → Tika c a p A

– h

c Many installations in operation, mostly vertical t u N search Collections typically 1 mln - 200 mln documents

3 9 0 '

S U n o C e h c a p A

– h c t u N

4 Web as a directed graph Nodes (vertices): URL-s as unique identifiers Edges (links): hyperlinks like Edge labels: anchor text Often represented as adjacency (neighbor) lists

9 Traversal: follow the edges, breadth-first, depth- 0 '

S U

n first, random o C e h c 8 a p A

– h c

t 7 u 9

N 3 2 1 → 2, 3, 4, 5, 6 1 4 5 → 6, 9 7 → 3, 4, 8, 9

6 5

5 What's in a ?

… a few things that may surprise you!  9 0 '

S U n o C e h c a p A

– h c t u N

6 Search engine building blocks

Injector Scheduler Crawler Searcher 9 0 ' Indexer S U Web graph Updater n

o Content

C -page info e h

c repository a -links (in/out) p

A Parser

– h c t u N

Crawling frontier controls

7 Nutch features at a glance Page database and link database (web graph) Plugin-based, highly modular: − Most behavior can be changed via plugins Multi-protocol, multi-threaded, distributed crawler Plugin-based content processing (parsing, filtering) 9 0 '

S U Robust crawling frontier controls n o C e h c Scalable data processing framework a p A − –

Map-reduce processing h c t u N Full-text indexer & search engine − Using Lucene or Solr − Support for distributed search Robust API and integration options

8 Hadoop foundation File system abstraction • Local FS, or • Distributed FS − also Amazon S3, Kosmos and other FS impl. 9 0 '

Map-reduce processing S U n o • C Currently central to Nutch algorithms e h c a p

A • Processing tasks are executed as one or more map- – h c t u reduce jobs N • Data maintained as Hadoop MapFile-s / SequenceFile-s • Massive updates very efficient • Small updates costly − Hadoop data is immutable, once created

− Updates are really merge & replace 9 Search engine building blocks

Injector Scheduler Crawler Searcher 9 0 ' Indexer S U Web graph Updater n

o Content

C -page info e h

c repository a -links (in/out) p

A Parser

– h c t u N

Crawling frontier controls

10 Nutch building blocks

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

11 Nutch data

Maintains info on all known URL-s: Injector Generator ● Fetch scheduleFetcher Searcher ● Fetch status ● Page signature ● Metadata 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

12 Nutch data

Injector Generator For each Fetchertarget URL keepsSearcher info on incoming links, i.e. list of source URL-s and their associated anchor text 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

13 Nutch data

Shards (“segments”) keep: ● Raw page content Injector● Parsed contentGenerator + discoveredFetcher Searcher metadata + outlinks ● Plain text for indexing and snippets 9 0 ' ● Indexer S Optional per-shard indexes U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

14 Nutch shards (a.k.a. “segments”) Unit of work (batch) – easier to process massive datasets Convenience placeholder, using predefined directory names Unit of deployment to the search infrastructure May include per-shard Lucene indexes Once completed they are basically unmodifiable − No in-place updates of content, or replacing of obsolete content 9 0 ' Periodically phased-out S U n o C e h c a p 200904301234/

A 2009043012345

– 2009043012345 h c

t Generator u

N crawl_generate/ crawl_generatecrawl_generatecrawl_fetch/ Fetcher crawl_fetchcrawl_fetchcontent/ “cached” view contentcontentcrawl_parse/ Parser crawl_parsecrawl_parseparse_data/ parse_dataparse_text/ parse_dataindexes/ snippets Indexer parse_textparse_text 15 URL life-cycle and shards observed changes time real page changes A time-lapse view of reality injected / discovered Goals: scheduled for fetching • Fresh index → sync with the real S1 rate of changes • Minimize re-fetching → don't fetch fetched / unmodified pages up-to-date 9 0 '

S Each page may need its U n o

C own crawl schedule scheduled e h c for fetching a p

A Shard management:

– h c t • Page may be present in many fetched / u N shards, but only the most recent up-to-date record is valid scheduled • Inconvenient to update in-place, S2 for fetching just mark as deleted • Phase-out old shards and force re- fetch of the remaining pages S3 fetched / up-to-date 16 Crawling frontier No authoritative catalog of web pages Search engines discover their view of web universe − Start from “seed list” − Follow (walk) some (useful? interesting?) outlinks Many dangers of simply wandering around 9

0 − ' explosion or collapse of the frontier S U n

o −

C collecting unwanted content (spam, junk, offensive) e h c a p A

– h c t u N I could use some guidance

17 Controlling the crawling frontier URL filter plugins − White-list, black-list, regex − May use external resources (DB-s, services ...) URL normalizer plugins 9 0 '

S −

U Resolving relative path n o

C seed

e elements h c a

p −

A “Equivalent” URLs i = 1 – h c t u N i = 2 Additional controls using i = 3 scoring plugins − priority, metadata select/block Crawler traps – difficult! − Domain / host / path-level stats 18 Wide vs. focused crawling Differences: • Little technical difference in configuration • Big difference in operations, maintenance and quality Wide crawling: − (Almost) Unlimited crawling frontier 9 0

' − High risk of spamming and junk content S U

n − o “Politeness” a very important limiting factor C e h

c −

a Bandwidth & DNS considerations p A

– h c Focused (vertical or enterprise) crawling: t u N − Limited crawling frontier − Bandwidth or politeness is often not an issue − Low risk of spamming and junk content

19 Vertical & • Range of selected “reference” sites • Robust control of the crawling frontier • Extensive content post-processing

9 • 0 Business-driven decisions about ranking '

S U n o

C Enterprise search e h c a p

A • Variety of data sources and data formats – h c t u • N Well-defined and limited crawling frontier • Integration with in-house data sources • Little danger of spam • PageRank-like scoring usually works poorly

20 Face to face with Nutch

? 9 0 '

S U n o C e h c a p A

– h c t u N

21 Simple set-up and running You already have Java 5+ , right? Get a 1.0 release or a nightly build (pretty stable) Simple search setup uses Tomcat for search web app 9 0 '

S Command-line bash script: bin/nutch U n o C e h • Windows users: get Cygwin c a p A

– •

Early version of a web-based UI console h c t u

N • Hadoop web-based monitoring

22 Configuration: files Edit configuration in conf/nutch-site.xml • Check nutch-default.xml for defaults and docs • You MUST at least fill the name of your agent Active plugins configuration 9 0

' Per-plugin properties

S U n o

C External configuration files e h c a p

A • regex-urlfilter.xml – h c t u

N • regex-normalize.xml • parse-plugins.xml: mapping of MIME type to plugin

23 Nutch plugins Plugin-based extensions for: • Crawl scheduling • URL filtering and normalization • Protocol for getting the content

9 • 0

' Content parsing

S U n o • Text analysis (tokenization) C e h c a p • Page signature (to detect near-duplicates) A

– h c t u • Indexing filters (index fields & metadata) N • Snippet generation and highlighting • Scoring and ranking • Query translation and expansion (user → Lucene)

24 Main Nutch workflow

Command-line: bin/nutch Inject: initial creation of CrawlDB inject • Insert seed URLs • Initial LinkDB is empty 9 0 '

S U n o C e Generate new shard's fetchlist generate h c a p A Fetch raw content fetch – h c t u

N Parse content (discovers outlinks) parse Update CrawlDB from shards updatedb Update LinkDB from shards updatelinkdb Index shards index / solrindex (repeat) 25 Injecting new URL-s

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

26 Generating fetchlists

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

27 Workflow: generating fetchlists What to fetch next? • Breadth-first – important due to “politeness” limits • Expired (longer than fetchTime + fetchInterval) • Highly ranking (PageRank) • Newly added 9 0 ' Fetchlist generation: S U n o •

C “topN” - select best candidates e h c − a Priority based on many factors, pluggable p A

– h c t Adaptive fetch schedule u N • Detect rate of changes & time of change • Detect unmodified content − Hmm, and how to recognize this? → approximate page signatures (near-duplicate detection)

28 Fetching content

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

29 Workflow: fetching

Multi-protocol: HTTP(s), FTP, file://, etc ... Coordinates multiple threads accessing the same host • “politeness” issues vs. crawling efficiency 9 0 ' • Host: an IP or DNS name? S U n o •

C Redirections: should follow immediately? What kind? e h c a p

A Other netiquette issues:

– h c t • u robots.txt: disallowed paths, Crawl-Delay N Preparation for parsing: • Content type detection issues • Parsing usually executed as a separate step (resource hungry and sometimes unstable)

30 Content processing

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

31 Workflow: content processing Protocol plugins retrieve content as plain bytes + protocol-level metadata (e.g. HTTP headers) Parse plugins • Content is parsed by MIME-type specific parsers 9 0 ' • Content is parsed into parseData (title, outlinks, other S U n o metadata) and parseText (plain text content) C e h c a p A Nutch supports many popular file formats – h c t u N

32 Link inversion

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

33 Workflow: link inversion Pages have outgoing links (outlinks) … I know where I'm pointing to Question: who points to me? … I don't know, there is no catalog of pages … NOBODY knows for sure either! 9 0 '

S U

In-degree indicates importance of the page n o C e h c

a Anchor text provides important semantic info p A

– h c t Partial answer: invert the outlinks that I know u N about, and group by target tgt 2 src 2 tgt 1 src 1

src 1 tgt 3 tgt 1 src 3

tgt 5 tgt 4 src 5 src 4 34 Page importance - scoring

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

35 Workflow: page scoring Query-independent importance factor • Affects search ranking • Affects crawl prioritization May include arbitrary decision of the operator 9

0 Currently two systems (plugins + tools) '

S U n o OPIC scoring C e h c a p • Doesn't require explicit steps - “online” A

– h c t • u Difficult to stabilize N PageRank scoring with loop detection • Periodically run to update CrawlDb scores • Computationally intensive, esp. loop detection

36 Indexing

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

37 Workflow: indexing Indexing plugins • Create full-text search index from segment data • Apply adjustments to score per document • May further post-process the parsed content (e.g. language identification) to facilitate advanced search 9 0 '

S U n

o Indexers C e h c a p • Lucene indexer – builds indexes to be served by Nutch A

– h c

t search servers u N • Solr indexer – submits documents to a Solr instance

38 De-duplication

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

39 Workflow: de-duplication The same page may be present in many shards • Obsolete versions in older shards • Mirrored pages, or equivalent (a.com → www.a.com) Many other pages are almost identical • Template-related differences (banners, current date) 9 0 '

S • Font / layout changes, minor re-wording U n o C e h c a p A

Hmm … what is a significant change ??? h c t u

N • Tricky issue! Hint: what is the page content?

Near-duplicate detection and removal • Nutch uses approximate page signatures (fingerprints) • Duplicates are only marked as deleted 40 Cycles may overlap

Injector Generator Fetcher Searcher 9 0 ' Indexer S

U CrawlDB Updater n

o Shards C e h

c (segments) a p

A Parser Link – h c t

u LinkDB inverter N

URL filters & normalizers, parsing/indexing filters, scoring plugins

41 Nutch distributed search

Search front-end(s) Web app Search server API (Nutch or Solr) NutchBean XML JSON 9 0 '

S U

n indexed shards o C e h

c Multiple search servers a p A

Search front-end dispatches h c t u

N queries and collects search results • multiple access methods (API, OpenSearch XML, JSON) Page servers build snippets Fault-tolerant * * with degraded quality 42 Search configuration Nutch syntax is limited – on purpose! • Some queries are costly, e.g. leading wildcard, very long • Some queries may need implicit (hidden) expansion Query plugins • From user query to Lucene/Solr query 9 0 '

S User query: web search U n o +(url:web^4.0 anchor:web^2.0 content:web title:web^1.5 host:web^2.0) C e h

c +(url:search^4.0 anchor:search^2.0 content:search title:search^1.5 a p A

host:search^2.0) url:"web search"~10^4.0 anchor:"web search"~4^2.0 – h c t content:"web search"~10 title:"web search"~10^1.5 u N host:"web search"~10^2.0 Search server configuration • Single search vs. distributed • Using Nutch searcher, or Solr, or a mix • Currently no global IDF calculation 43 Deployment: single-server The easiest but the most limited deployment Centralized storage, centralized processing Hadoop LocalFS and LocalJobTracker Drop nutch.war in Tomcat/webapps and point to shards

SearchSearch Fetcher

9 Fetcher 0

' front-end front-end S U n o C e h c LocalFS a p LocalFS Indexer A

Indexer

– Storage

h Storage c t u N

DBDB mgmt mgmt

44 Deployment: distributed search Local storage on search servers preferred (perf.)

SearchSearch front-end Fetcher

9 front-end Fetcher 0 '

S U n o C e h c LocalFS a p LocalFS Indexer A

Indexer

– Storage

h Storage c t u N Search server Search server DB mgmt Search server DB mgmt

45 Deployment: map-reduce processing & distrib. search Fully distributed storage and processing Local storage on search servers preferred (perf.)

(Job(Job Tracker) Tracker) SearchSearch Fetcher DFSDFS Name Name Node Node Fetcher front-end Indexer

9 front-end

0 Indexer '

S DB mgmt U

DB mgmt n o C e h c a p A

– h c t u N Search server DFS Data Node Task Tracker Search server DFS Data Node Task Tracker Search server DFS Data Node Task Tracker

46 Nutch on a Hadoop cluster

Assumes an up & running Hadoop cluster … which is covered elsewhere Build nutch.job jar and use it as usual:

9 bin/hadoop jar nutch.job 0 '

S U n o C e h c

a Note: Nutch configuration is inside nutch.job p A

– − When you change it you need to rebuild job jar h c t u N Searching is not a map-reduce job – often on a separate group of machines

47 9 0 '

S U n o C e h c a p A

– h c t u N

48 Nutch index 9 0 '

S U n o C e h c a p A

– h c t u N

49 … and press Search 9 0 '

S U n o C e h c a p A

– h c t u N

50 Conclusions (This overview is a tip of the iceberg)

Nutch Implements all core search engine components 9 0 ' Scales well S U n o C

e Extremely configurable and modular h c a p A It's a complete solution – and a toolkit – h c t u N

51 Future of Nutch

Avoid code duplication Parsing → Tika • Almost total overlap • Still some missing functionality in Tika → contribute Plugins → OSGI 9 0 ' • Home-grown plugin system has some deficiencies S U n o • Initial port available C e h c a p Indexing & Search → Solr, Zookeeper A

h • c Distributed and replicated search is difficult t u N • Initial integration needs significant improvement • Shard management - Katta? Web-graph & page repository → HBase • Combine CrawlDB, LinkDB and shard storage • Avoid tedious shard management

• Initial port available 52 Future of Nutch (2) What's left then? • Crawling frontier management, discovery • Re-crawl algorithms • Spider trap handling • Fetcher 9 0 '

S • Ranking: enterprise-specific, user-feedback U n o

C • e Duplicate detection, URL aliasing (mirror detection) h c a p

A •

Template detection and cleanup, pagelet-level crawling – h c t u • Spam & junk control N Share code with other crawler projects → crawler-commons Vision: á la carte toolkit, scalable from 1-1000s nodes 53 Summary

Q&A

Further information:

9 • http://lucene.apache.org/nutch 0 '

S U n o C e h c a p A

– h c t u N

54