Distributed Web Crawlers Using Hadoop
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 24 (2017) pp. 15187-15195 © Research India Publications. http://www.ripublication.com Distributed Web Crawlers using Hadoop Pratiba D Assistant Professor, Department of Computer Science and Engineering, R V College of Engineering, R V Vidyanikethan Post, Mysuru Road Bengaluru, Karnataka, India. Orcid Id: 0000-0001-9123-8687 Shobha G Professor, Department of Computer Science and Engineering, R V College of Engineering, R V Vidyanikethan Post, Mysuru Road Bengaluru, Karnataka, India. LalithKumar H Student, Department of Computer Science and Engineering, R V College of Engineering, R V Vidyanikethan Post, Mysuru Road Bengaluru, Karnataka, India. Samrudh J Student, Department of Information Science and Engineering, R V College of Engineering, R V Vidyanikethan Post, Mysuru Road Bengaluru, Karnataka, India. Abstract With the increasing size of web content, it is very difficult to index the entire WWW. So, efforts must be put to limit the Web Crawler is a software, which crawls through the WWW amount of data indexed and at the same time maximize the to build database for a search engine. In recent years, web coverage. To minimize the amount of data indexed, Natural crawling has started facing many challenges. Firstly, the web Language Processing techniques need to be applied to filter pages are highly unstructured which makes it difficult to and summarize the web page and store only the essential maintain a generic schema for storage. Secondly, the WWW details. By doing so, a search can yield the most accurate and is too huge and it is impossible to index it as it is. Finally, the appropriate results based on the prominence of words instead most difficult challenge is to crawl the deep web. Here we are of their mere existence. Deep web refers to the sections of the proposing a novel web crawler, which uses Neo4J, HBase as WWW, which are difficult to index. Dynamic web pages data stores. It also applies Natural Language Processing which can be generated only through a form and pages (NLP) and machine learning techniques to resolve the above- tailored based on the access context and navigation sequence mentioned problems. are just the main problems encountered while crawling deep Keywords: NoSQL, Neo4J, Hbase, Natural Language web. To maximize the coverage of WWW, a web crawler can Processing, Reinforcement learning, Deep web never neglect the deep web. By applying Natural Language Processing and reinforcement learning to gain domain specific knowledge of a website and by gaining insights about the INTRODUCTION domain, the web crawler can build intelligent queries against forms to navigate through the website. In this paper, we will Web Crawler is an application that feeds a search engine with deliberate upon three methods to improve the efficiency of a web content. A search engine searches over the crawler's web crawler. indexing of the WWW. The data store used for storage, query and retrieval of the indexed content plays an important role in Web Crawler is an integral part of any search engine. As the determining the crawler's performance. There is a need for a number of pages in increasing massively, there must be a schema-less storage and easier query techniques with the ever sophisticated search engine to track and traverse this huge increasing structural complexity of web pages. The database data and provide user with a better results for their queries. needed here must be schema-less, horizontally scalable and Hence, a web crawler keeps track of all the web pages inside should provide higher performance for queries dealing with the search engine by traversing through the internet and large amount of data. It should also be able to represent the downloads as many web pages as possible and indexes the relationship between web pages for faster computation of relevant web pages. These indexes are maintained up to date most appropriate results. So, by using HBase as the data store along with the path to the relevant web pages. to store the indexed pages and using a graph database like Neo4J for storing the page, file relationships and hierarchy, the data store needs can be sufficed. A web crawler is an automated script, which is used to extract 15187 International Journal of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 24 (2017) pp. 15187-15195 © Research India Publications. http://www.ripublication.com data from the WWW in a periodical way. Web crawler is an and network resource which reduces the network traffic and application that feeds a search engine with processed web downloads. content which makes the process of ranking easier. A search Parallel Crawlers: Web pages are crawled in parallel with the engine searches over the crawler's indexing of the WWW. help of multiple threads in order to fasten the speed of Search engine use web crawling application to periodically indexing and to counter the problem of indefinite wait on a appraise their indexed web content or indexes of other sites. page. This crawler consists of several crawling processes Crawlers consume resources on the servers they crawl and known as C-procs which runs on network terminals. This type often visit without site’s approval. It is important to note that a of crawling system plays a vital role in downloading crawler needs to take into consideration issues of scheduling, documents in a sensible amount of time. load, and politeness delay when large collections of pages are indexed. Public sites have mechanisms to inform the crawlers Distributed Web Crawlers: Sometimes even a process with that the content is not allowed to be crawled. multiple threads is not enough to index large amount of web pages. In such a situation, a cluster of many workstations are Web Crawler has a distinctive life cycle which starts with set used to crawl the web. A dominant server maintains the of the seeds (URLs) which is originally extracted from the interaction and management of the nodes as it is geologically user or from a web page. It then crawls deeper (Download) dispersed. This type of crawling is called Distributed web into the sub URLs created in the main page. The procedure crawler [2]. It mainly routines page rank algorithm for the begins with downloading the corresponding web pages and better efficacy and quality search. The advantage of this then extracting the entire sub URLs found in the page, crawler is that it protects against system crashes. followed by population of the frontier which is a queue data structure. The downloading process is abled by using Hyper Breadth First Crawlers: This method crawls in breadth first Text Transfer Protocol (HTTP). The downloaded pages are fashion [3]. All the web pages of lower depth are crawled indexed. WebCrawler stops at the point where no more seeds before a web page of higher depth is crawled. exist in the frontier. Traditional Web Crawlers Types of Web Crawlers In the year 1993, four web crawlers were introduced. They Web Crawlers are classified [1] based on the methodology were like Wide Web Wanderer, Jump Station, World Wide used to crawl the web pages. Some of the types are listed Web Worm, and RBSE spider. They collect data, information below: and statistics about the web using a set of seed URLs. These crawlers iteratively download URLs and update their Incremental Web Crawlers: The traditional crawlers are used repository of URLs through the downloaded web pages. to replenish the old documents with new ones to augment its collection. On the other hand, an incremental crawler updates In the year 1994, two new web crawlers, WebCrawler and an existing set of indexed pages instead of re-indexing the MOMspider crawler [4] were introduced. These web crawlers pages on subsequent crawls that are invoked in a fixed time. started collecting data and statistics of the web along with the An incremental crawler enhances the present set of pages by concept of politeness and black-list to traditional web visiting them frequently, based on the frequency of how often crawlers. Simultaneously 15 links can be downloaded and page change. This technique exchange less important pages hence web crawlers are considered to be the first parallel web with more important ones. The advantage of incremental crawler. The number of indexed pages were increased from crawler is that only appreciated data is provided to the user, 110,000 to 2 million from WWW to Web Worm. These thus saving the network bandwidth and achieves data crawlers arises scalability issues. enhancement. Brin et al., [5] addressed the issue of scalability by Hidden Web Crawlers: Crawlers, which are used to crawl the introducing a large scale web crawler called Google. Google deep web, are called Hidden Web Crawlers. Hidden web was the large web crawler that addressed the problem of refers to the abundant data inside the databases. These data are scalability in several ways. Firstly, it is optimized to reduce hidden behind the query interface and are not directly disk access time through techniques like compression and accessible to the search engine. indexing. Secondly, probability of a user visiting a page was calculated using a Page Rank algorithm. Page Rank algorithm Focused Crawlers: Focused Crawler is also called as Topic computes the probability of a user visiting a page by Crawler. This web crawler minimizes the downloads by considering the number of links that point to the page along downloading only the appropriate and significant pages that with the style of those links. This approach optimize the are related to each other. It avoids crawling unimportant paths resource available to the web crawler by reducing the rate of by limiting the area of interest of a search to a particular topic.