Implementation of Efficient Distributed Crawler Through Stepwise Crawling Node Allocation
Total Page:16
File Type:pdf, Size:1020Kb
Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 15 http://dx.doi.org/10.14801/JAITC.2020.10.2.15 Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation Hyuntae Kim1, Junhyung Byun2, Yoseph Na3, and Yuchul Jung4* 1,4Cognitive Intelligence Lab., Department of Computer Engineering, Kumoh National Institute of Technology, Gumi, Korea 2,3Undergraduate student, Department of Computer Engineering, Kumoh National Institute of Technology, Gumi, Korea [email protected], orcid: https://orcid.org/0000-0002-9803-8642 [email protected], orcid: https://orcid.org/0000-0002-6543-805X [email protected], orcid: https://orcid.org/0000-0002-5360-7418 [email protected], orcid: https://orcid.org/0000-0002-8871-1979 (*Corresponding Author) Abstract Various websites have been created due to the increased use of the Internet, and the number of documents distributed through these websites has increased proportionally. However, it is not easy to collect newly updated documents rapidly. Web crawling methods have been used to continuously collect and manage new documents, whereas existing crawling systems applying a single node demonstrate limited performances. Furthermore, crawlers applying distribution methods exhibit a problem related to effective node management for crawling. This study proposes an efficient distributed crawler through stepwise crawling node allocation, which identifies websites' properties and establishes crawling policies based on the properties identified to collect a large number of documents from multiple websites. The proposed crawler can calculate the number of documents included in a website, compare data collection time and the amount of data collected based on the number of nodes allocated to a specific website by repeatedly visiting the website, and automatically allocate the optimal number of nodes to each website for crawling. An experiment is conducted where the proposed and single-node methods are applied to 12 different websites; the experimental result indicates that the proposed crawler's data collection time decreased significantly compared with that of a single node crawler. This result is obtained because the proposed crawler applied data collection policies according to websites. Besides, it is confirmed that the work rate of the proposed model increased. Keywords: web crawling, docker wwarm, virtual nodes, documents, scrapy, efficiency Received: Dec. 04, 2020 Revised: Dec. 25, 2020 Accepted: Dec. 28, 2020 pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 16 1. Introduction The development of the Internet has provided an environment where an increasing number of websites can provide information, generate various types of data, and then distribute those data. Particularly, documents with large-volume data have been collected, analyzed, and processed, as shown in examples of open-source frameworks that have been developed for documents crawling, such as Crawler4j [1], Scrapy [2], and Apache Nutch [3]. Simple web crawlers [4] can be operated based on a single virtual node, machine, or process for crawling; however, the application of only a single node deteriorates the swift data collection of these crawlers. Owing to these single-node crawlers' limited performance and speed, distributed crawlers have been recently used for rapid web document collection [5]. Scrapinghub [6], based on the Scrapy framework, is the main example of distributed crawlers. Although it performs distributed crawling using the Scrapy framework and virtual nodes, it has limited memory capacity per virtual node, and its node allocation based on websites is restricted. Moreover, several crawling systems do not consider bandwidth allocation for web crawler systems with time-related restrictions, as mentioned in a previous study [7]. This study proposes an efficient distributed crawler through stepwise crawling node allocation (EDC-SCNA), which can efficiently operate multiple crawling nodes. The proposed crawler's main function is to establish a number of crawling nodes required for data collection by controlling the number of Docker containers or Docker-based virtual nodes while considering the number of documents stored on target websites and website access environments. To verify the efficiency of the implemented EDC-SCNA system, 12 websites were selected and categorized into three types based on the number of documents stored. Accordingly, an experiment was conducted to compare the proposed system's efficiency with that of Crawler4j [1], a Java-based open-source web crawler, in terms of the amount of data collection time required and the amount of data collected. The experimental result indicated that the proposed model reduced the amount of data collection time by approximately 10% or more and collected a larger number of documents by approximately 12% compared with Crawler4j. This result was obtained because the proposed model performed more efficient crawling by calculating the optimal number of virtual nodes for crawling and allocating additional resources. This paper is organized as follows. In Chapter 2, the trend of current research regarding web crawlers is analyzed. In Chapter 3, the structure and elements of the proposed distributed crawler system are described. In Chapter 4, the performance of the proposed model evaluated experimentally is presented. In Chapter 5, the strengths, weaknesses, and limitations of the proposed model are provided. In Chapter 6, conclusions are presented. 2. Related works Existing distributed crawlers include Apache Nutch [2], Scrapy [3], and Crawler4j [1]. Distributed crawling systems have been recently implemented based on Docker, which is used for container virtualization. Also, frameworks (e.g., Docker Swarm) for efficient Docker management and those (e.g., Scrapy–remote distribution server (Redis)) for efficient data management have been developed. pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 17 http://dx.doi.org/10.14801/JAITC.2020.10.2.15 2.1 Distributed crawlers Apache Nutch [2]: Apache Hadoop was developed to support distribution processing performed by Apache Nutch. This framework facilitates multimachine processing and large-volume data processing through the implementation of MapReduce and a distributed file system. In a previous study [5] where a Hadoop-based distributed crawler's performance applying four nodes with that of a Heritrix crawler applying a single node were compared, it was reported that both crawlers exhibited similar crawling performances in the initial stage. However, the performance difference between them more than doubled after tens of minutes. Independent network crawlers are inappropriate for large-volume data collection due to network bandwidth, central processing unit (CPU) resources, memory capacities, and other factors. In a previous study [8], a distributed crawler's performance was improved by applying Apache Hadoop by adjusting the number of nodes. Furthermore, in a previous study [9], the operating time of a web crawler decreased as the scale of distributed environments increased and that a distributed system was appropriate for large-volume data collection. Scrapy [3]: The Scrapy framework is used to download both the contents and hypertext markup language of webpages and collect data of websites according to specified static URLs(Uniform Resource Locator). Furthermore, it can effectively extract hidden data [10] and compose crawlers [11]. Moreover, it can be widely applied in various fields such as data mining and information processing through website crawling and structured data extraction. Recently, crawling technologies based on Scrapy instead of Apache Nutch based on Hadoop have been used primarily. A previous study [12] that compared a single node crawler with a Scrapy-based distributed crawler indicated that the Scrapy-based distributed crawler's performance was higher by approximately 90% compared with that of a single node crawler. Additionally, an example of using the Scrapy framework to extract data from websites has been demonstrated in [13]. In another previous study [14], a distributed crawling function was added to Scrapy by applying a remote dictionary server (Redis) in Scrapy. Crawler4j [1]: Crawler4j is a Java-based open-source web crawler that provides a simple web exploration interface. It evaluates URLs delivered based on the robots.txt file of corresponding hosts and minimizes users' possibility of being locked on the web by allowing them to establish a search depth. Its capacity can be expanded through the addition of several threads. Specifically, it can be operated as a multithread web crawler based on applying additional threads to increase its performance and efficiency. In addition to the above open source crawlers, various distributed crawling techniques are being studied as follows. A distributed template-customized vertical crawler [15] is suggested for crawling Internet forums. [16] presented a distributed focused crawler by implementing crawling scheduler, site ordering to determine URL queue and naïve bayes based focused crawler. SIMHAR[17] employed Redis [27] to implement a distributed web crawler for the hidden web. It implements a hybrid technique based on Simhash and Redis server. Besides, [18] introduced Docker [19] based distributed web crawling system. Most of the proposed distributed crawlers showed