An Approach of Web Crawling and Indexing of Nutch N
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Scientific & Engineering Research, Volume 5, Issue 11, November-2014 766 ISSN 2229-5518 An Approach of Web Crawling and Indexing of Nutch N. KOWSALYA ASSISTANT PROFESSOR DEPARTMENT OF COMPUTER APPLICATIONS VIVEKANANDHA COLLEGE OF ARTS AND SCIENCES FOR WOMEN (AUTONOMOUS) TIRUCHENGODE-05, TAMILNADU Abstract-Nutch has a highly modular architecture. Crawling is the operation that navigates and retrieves the information in web pages, populating the set of documents that will be searched. Typically, at the centre of any IR system is the inverted index. For each term, the inverted index contains a posting list, which lists the documents represented as integer document-IDs (doc-IDs) containing the term. The nutch command eliminates duplicate documents from a set of Lucene indices for Nutch segments. Nutch also has to add support for HTML extraction to Lucene. Nutch includes a link analysis algorithm similar to PageRank. The WebDB containing the web graph of pages and links. These lists contain every URL we’re interested in downloading. The query engine part consists of one or more front-ends, and one or more back-ends. Each back-end is associated with a segment of the complete data set. The driver represents external users and it is the point at which the performance of the query is measured, in terms of queries per second (qps). Index Terms – Crawling; Retrieving; Indexing; Link analysis; PageRank; Web DB; Web Graph; —————————— —————————— 1 Introduction: • parsers for web content; Apache Nutch is an open source web crawler that • a link-graph builder; is used for crawling websites. It is extensible and scalable. It • schemas for indexing and search; provides facilities for parsing, indexing, and scoring filters • distributed operation, for high scalability; for custom implementations. Nutch is an effort to build an • an extensible, plugin-based architecture. open source search engine based on Lucene Java for the Nutch is implemented in Java and its code is open source. search and index component. It has a highly modular Thus runs on many operating systems and a wide variety architecture, allowing developers to create plug-ins for of hardware. media-type parsing, data retrieval,IJSER querying and clustering. 2 Nutch Architecture: Nutch is a complete open-source Web search engine Nutch has a highly modular architecture that uses package that aims to index the World Wide Web as plug-in APIs for media-type parsing, HTML analysis, data effectively as commercial search services. As a research retrieval protocols, and queries. The core has four major platform it is also promising at smaller scales, since its components: flexible architecture enables communities to customize it; and can even scale down to a personal computer. Its Searcher: Given a query, it must quickly find a small founding goal was to increase the transparency of the Web relevant subset of a corpus of documents, and then present search process as searching becomes an everyday task. The them. Finding a large relevant subset is normally done nonprofit Nutch Organization supports the open-source with an inverted index of the corpus; ranking within that development effort as it addresses significant technical set to produce the most relevant documents, which then challenges of operating at the scale of the entire public Web. must be summarized for display. Nutch server installations have already indexed 100M-page collections while providing state-of-the-art search result Indexer: Creates the inverted index from which the quality. At the same time, smaller organizations have also searcher extracts results. It uses Lucene storing indexes. adopted Nutch for intranet and campus networks. At this scale, all of its components can run on a single server. Database: Stores the document contents for indexing and Nutch is an open-source project hosted by the Apache later summarization by the searcher, along with Software Foundation. Nutch provides a complete, high- information such as the link structure of the document quality Web search system, as well as a flexible, scalable space and the time each document was last fetched. platform for the development of novel Web search engines. Nutch includes: • a web crawler; IJSER © 2014 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 5, Issue 11, November-2014 767 ISSN 2229-5518 Fetcher: Requests web pages, parses them, and extracts uniform refresh policy; all pages are refetched at the same links from them. Nutch’s robot has been written entirely interval (30 days, by default) regardless of how frequently from scratch. they change. Java can set individual recrawl-deadlines on Figure-1 outlines the relationships between every page). The fetching process must also respect elements that refer on each other, placing them in the same bandwidth and other limitations of the target website. box, and those they depend on in a lower layer. For However, any polite solution requires coordination before example, protocol does not depend on net, because protocol fetching, Nutch uses the most straightforward localization is only an interface point for plugins that actually provide of references possible namely, making all fetches from a much of Nutch’s functionality. particular host run on one machine. 2.2 Indexing: To allow efficient retrieval of documents from a corpus, suitable data structures must be created, collectively known as an index. Usually, a corpus covers many documents, and hence the index will be held on a large storage device – commonly one or more hard disks. Typically, at the centre of any IR system is the inverted index. For each term, the inverted index contains a posting list, which lists the documents represented as integer document-IDs (doc-IDs) containing the term. Each posting in the posting list also stores sufficient statistical information to score each document, such as the frequency Figure-1 Nutch Architecture of the term occurrences and, possibly, positional information (the position of the term within each 2.1 Crawling: document, which facilitates phrase or proximity search) or field information (the occurrence of the term in various Crawling is the operation that navigates and semi-structured area of the document, such as title, retrieves the information in web pages, populating the set enabling these to be higher-weighted during retrieved). of documents that will be searched. This set of documents The inverted index does not store the textual terms is called the corpus, in search terminology. Crawling can be themselves, but instead uses an additional structure known performed on internal networksIJSER (Intranet) as well as as a lexicon to store these along with pointers to the external networks (Internet). Crawling, particularly in the corresponding posting lists within the inverted index. A Internet, is a complex operation. Either intentionally or document index may also be created which stores meta- unintentionally, many web sites are difficult to crawl. information about each document within the inverted Sophisticated technology is required to develop a good index, such as an external name for the document (e.g. crawler. URL), and the length of the document. The process of The performance of crawling is usually limited by generating these structures is known as indexing. the bandwidth of the network between the system doing the crawling and the outside world. In the Commercial 2.2.1 Indexing Text: Scale Out project, system setup did not have a high- bandwidth connection to the outside world and it would Lucene meets the scalability requirements for text take a long time to populate the system with enough indexing in Nutch. Nutch also takes the advantage of documents to create an interesting corpus. Lucene’s multi-field case-folding keyword and phrase An intranet or niche search engine might only take search in URLs, anchor text, and document text. The typical a few hours to crawl, in a single-machine while a whole- definition of Web search does not require some index types web crawl might take many machines several weeks or which Lucene does not address, such as regular-expression longer. A single crawling cycle consists of generating a matching (using, say, suffix trees) or document-similarity fetchlist from the webdb, fetching those pages, parsing clustering (using, say, term-vectors). those for links, then updating the webdb. In the terminology of Nutch's crawler supports both a crawl-and- 2.2.1 Indexing Hypertext: stop and crawl-and-stop-with-threshold (which requires feedback from scoring and specifying a floor). It also uses a IJSER © 2014 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 5, Issue 11, November-2014 768 ISSN 2229-5518 Lucene provides an inverted-file full-text index, analysis doesn’t use Nutch’s homegrown IPC service; like which suffices for indexing the text but not the additional fetching, work is coordinated through the appearance of tasks required by a web search engine. In addition to this, files in a shared directory. There are better techniques for Nutch implements a link database to provide efficient distribution (MapReduce) and accelerating link analysis. access to the Web's link graph, and a page database that Nutch includes a parallel indexing operation stores crawled pages for indexing, summarizing, and written using the MapReduce programming model. serving to users, as well as supporting other functions such MapReduce provides a convenient way of addressing an as crawling and link analysis. In the terminology of one important class of real-life commercial applications by web search engine survey, Nutch combines the text and hiding the parallelization and the fault tolerance from the utility databases into its page database. Nutch also has to programmers, letting them focus on the problem domain. add support for HTML extraction to Lucene. When a page MapReduce was published by Google in 2004 and quickly is fetched, any embedded links are considered for addition became a popular approach for parallelizing commercial to the fetchlist and that link’s anchor text is also stored.