EPL 660: Lab 6 Introduction to Nutch
Total Page:16
File Type:pdf, Size:1020Kb
EPL 660: Lab 6 University of Cyprus Department of Introduction to Nutch Computer Science Andreas Kamilaris Overview University of Cyprus • Complete Web search engine – Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins (e.g. parsing) + MapReduce & Distributed FS (Hadoop) • Java-based • Open source Reasons to run your own search engine University of Cyprus • Transparency: Nutch is open source, anyone can see how ranking algorithms work. – Google allows rankings to be based on payments. – Can be used by academic and governmental organizations, where fairness of rankings may be very important. • Understanding: see how large-scale search engine works. – Google source code is not available. • Extensibility: can be customized and incorporated into your application Nutch in Practice University of Cyprus • Nutch installations typically operate at one of three scales: – local filesystem Æ reliable (no network errors, caching is unnecessary). – Intranet-level. – whole Web Æ whole-Web crawling is difficult. • Many crawling-oriented challenges when building a complete Web search engine: – Which pages do we start with? – How do we partition the work between a set of crawlers? – Howoftendowere-crawl? – How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? Nutch Vs Lucene University of Cyprus • Nutch is built on top of Lucene. • "Should I use Lucene or Nutch?" – Use Lucene if you don't need a web crawler. • e.g. you want to make a database searchable • Nutch is a better fit for sites where you don't have direct access to the underlying data, or it comes from disparate sources. Nutch Architecture University of Cyprus • Nutch Æ crawler + searcher • Crawler: fetches pages, creates inverted index. • Searcher: uses inverted index to answer queries. • Crawler and Searcher are highly decoupled, enabling independent scaling on separate hardware platforms. Nutch Crawler University of Cyprus • It consists of four main components: – WebDB – Segments – Index – Crawl tool Web Database (WebDB) University of Cyprus • Persistent data structure for mirroring the structure and properties of the Web graph being crawled. • Used only by the crawler (not used during searching). • The WebDB stores two types of entities. – Pages: pages on the Web. – Links: the set of links from one page (to other pages). • In the WebDB Web graph, the nodes are pages and the edges are links. Segments University of Cyprus • A Segment is a collection of pages fetched and indexed by the crawler in a single run. – limited lifespan (named by the date and time created). • Fetchlist of a segment involves a list of URLs for the crawler to fetch. Index University of Cyprus • Nutch uses Lucene for indexing. • Inverted index of all of the pages the system has retrieved. – Each segment has its own index. • A (global) inverted index is created by merging all individual segment indexes. Crawl tool University of Cyprus • Crawling is a cyclical process: 1. The crawler generates a set of fetchlists from the WebDB. 2. A set of fetchers downloads the content from the Web. 3. The crawler updates the WebDB with new links that were found. 4. The crawler generates a new set of fetchlists (for links that haven't been fetched for a given period, including the new links found in the previous cycle). 5. This cycle repeats. Steps in a Crawl+Index cycle University of Cyprus 1. Create a new WebDB (admin db -create). 2. Inject root URLs into the WebDB (inject). 3. Generate a fetchlist from the WebDB in a new segment (generate). 4. Fetch content from URLs in the fetchlist (fetch). 5. Update the WebDB with links from fetched pages (updatedb). 6. Repeat steps 3-5 until the required depth is reached. 7. Update segments with scores and links from the WebDB (updatesegs). 8. Index the fetched pages (index). 9. Eliminate duplicate content (and duplicate URLs) from the indexes (dedup). 10.Merge the indexes into a single index for searching (merge). Nutch as a Crawler University of Cyprus Initial URLs Injector Web WebDB Webpages/files update get Generator Crawl tool Fetcher read/write generate read/write Segment Parser Nutch as a complete Web Search Engine University of Cyprus Segments WebDB LinkDB Indexer (Lucene) Index Searcher (Lucene) GUI (Tomcat) Running a Crawl University of Cyprus • The site structure for the site we are going to crawl: • echo 'http://keaton/tinysite/A.html' > urls – file urls contains the root URL from which to populate the initial fetch list (page A). • The Crawl tool uses a filter to decide which URLs go into the WebDB. – restrict the domain to the server on the intranet (/keaton). • bin/nutch crawl urls -dir crawl-tinysite -depth 3 >& crawl.log – The Crawl tool uses the root URLs in urls file to start the crawl. – The results go to directory crawl-tinysite. – -depth flag tells the Crawler how many generate/fetch/update cycles to carry out to get full page coverage. Examine Results (File System) University of Cyprus • Directories and files created after running the Crawl tool: WebDB A, B, C, C-dup links to Wikipedia are not in WebDB Lucene index (filter was used) segments (pages) • Crawl created three segments in timestamped subdirectories. • Each segment has its own index. Examine Results (Pages&Links) University of Cyprus arguments changed into -stats in release 1.2 Examine results (Segments) University of Cyprus • The Crawl tool created three segments in timestamped subdirectories: changed into readseg in release 1.2 • PARSED column – Useful when running fetchers with parsing turned off, to be run later as a separate process. • STARTED and FINISHED columns indicate the times when fetching started and finished. – Invaluable for bigger crawls, when tracking down why crawling is taking a long time. • COUNT column – Shows the number of fetched pages in the segment. – E.g. last segment has two entries, corresponding to pages C and C-duplicate. Examine results (Index&Search) University of Cyprus • Command line searching through NutchBean: bin/nutch.org.apache.nutch.searcher.NutchBean <keyword> where keyword is the search term. Search results Examine results (Index&Search) University of Cyprus • GUI-based searching with Luke. • Luke is the Lucene Index Toolbox. • It accesses existing Lucene indexes and allows you to display and modify their contents. • You can browse by doc number, view docs, execute search, analyze search results, retrieve ranked lists etc. Download from: http://code.google.com/p/luke/ Nutch Distributed File System (NDFS) University of Cyprus • NDFS stores the crawling and indexes. • Data divided into blocks. • Blocks can be copied, replicated. • Namenode Vs Datanodes. • Datanodes hold and serve blocks. • Namenode holds metainfo. – Filename Æ block list – Block Æ datanode-location • Datanodes report to namenode every few seconds. Nutch & Hadoop University of Cyprus • Hadoop is used in Nutch to manage data obtained from the crawling process. • MapReduce for indexing, parsing, WebDB construction, even fetching. Plugins University of Cyprus • Provide extensions to extension-points. • Each extension point defines an interface that must be implemented by extension. • Some core extension points: – IndexingFilter: add meta-data to indexed fields. – Parser: to parse a new type of document. – NutchAnalyzer: language specific analyzers. Get Started with Nutch University of Cyprus 1. Download the latest Apache Nutch release (release 1.2) from: http://www.apache.org/dyn/closer.cgi/nutch/ 2. Set NUTCH_JAVA_HOME to the root of your JVM installation. (* you need to set also JAVA_HOME to work). 3. Open up /conf/nutch-default.xml file, search for http.agent.name and give it value “MYNAME Spider”. 4. Create a urls file containing a list of root URLs. 5. You can filter the crawling by editing the file /conf/crawl-urlfilter.txt, replacing MY.DOMAIN.NAME with the name of the domain you wish to crawl. (* actually if you don’t do it, it will not work!) Installing in Tomcat University of Cyprus 1. You need to put the Nutch war file into your servlet container. 2. Assuming you've unpacked Tomcat as ~/local/tomcat, the Nutch war file may be installed with the commands: mkdir ~/local/tomcat/webapps/nutch cp nutch*.war ~/local/tomcat/webapps/nutch/ jar xvf ~/local/tomcat/webapps/nutch/nutch.war rm nutch-1.1.war 3. The webapp finds its indexes in ./crawl, relative to where you start Tomcat. Start Tomcat using a command like: ~/local/tomcat/bin/catalina.sh start 4. Then visit: http://localhost:8080/nutch/ Crawl Command Vs Whole-Web Crawling University of Cyprus • The Crawl command is more appropriate when you intend to crawl up to around one million pages on a handful of Web servers. • Whole-Web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. – More control over the crawl process. – Incremental crawling. References University of Cyprus • Nutch Web site: http://nutch.apache.org/ • Nutch Docs: http://lucene.apache.org/nutch/ • Nutch Wiki: http://wiki.apache.org/nutch/ (Support, mailing lists, tutorials, presentations) • Prasad Pingali, CLIA consortium, Nutch Workshop, 2007. • Tom White, Introduction to Nutch, java.net website http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html.