Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 15 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation

Hyuntae Kim1, Junhyung Byun2, Yoseph Na3, and Yuchul Jung4*

1,4Cognitive Intelligence Lab., Department of Computer Engineering, Kumoh National Institute of Technology, Gumi, Korea 2,3Undergraduate student, Department of Computer Engineering, Kumoh National Institute of Technology, Gumi, Korea [email protected], orcid: https://orcid.org/0000-0002-9803-8642 [email protected], orcid: https://orcid.org/0000-0002-6543-805X [email protected], orcid: https://orcid.org/0000-0002-5360-7418 [email protected], orcid: https://orcid.org/0000-0002-8871-1979 (*Corresponding Author)

Abstract Various websites have been created due to the increased use of the Internet, and the number of documents distributed through these websites has increased proportionally. However, it is not easy to collect newly updated documents rapidly. Web crawling methods have been used to continuously collect and manage new documents, whereas existing crawling systems applying a single node demonstrate limited performances. Furthermore, crawlers applying distribution methods exhibit a problem related to effective node management for crawling. This study proposes an efficient distributed crawler through stepwise crawling node allocation, which identifies websites' properties and establishes crawling policies based on the properties identified to collect a large number of documents from multiple websites. The proposed crawler can calculate the number of documents included in a website, compare data collection time and the amount of data collected based on the number of nodes allocated to a specific website by repeatedly visiting the website, and automatically allocate the optimal number of nodes to each website for crawling. An experiment is conducted where the proposed and single-node methods are applied to 12 different websites; the experimental result indicates that the proposed crawler's data collection time decreased significantly compared with that of a single node crawler. This result is obtained because the proposed crawler applied data collection policies according to websites. Besides, it is confirmed that the work rate of the proposed model increased.

Keywords: web crawling, docker wwarm, virtual nodes, documents, scrapy, efficiency

Received: Dec. 04, 2020 Revised: Dec. 25, 2020 Accepted: Dec. 28, 2020 pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT

H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 16

1. Introduction

The development of the Internet has provided an environment where an increasing number of websites can provide information, generate various types of data, and then distribute those data. Particularly, documents with large-volume data have been collected, analyzed, and processed, as shown in examples of open-source frameworks that have been developed for documents crawling, such as Crawler4j [1], Scrapy [2], and [3]. Simple web crawlers [4] can be operated based on a single virtual node, machine, or process for crawling; however, the application of only a single node deteriorates the swift data collection of these crawlers. Owing to these single-node crawlers' limited performance and speed, distributed crawlers have been recently used for rapid web document collection [5]. Scrapinghub [6], based on the Scrapy framework, is the main example of distributed crawlers. Although it performs distributed crawling using the Scrapy framework and virtual nodes, it has limited memory capacity per virtual node, and its node allocation based on websites is restricted. Moreover, several crawling systems do not consider bandwidth allocation for systems with time-related restrictions, as mentioned in a previous study [7]. This study proposes an efficient distributed crawler through stepwise crawling node allocation (EDC-SCNA), which can efficiently operate multiple crawling nodes. The proposed crawler's main function is to establish a number of crawling nodes required for data collection by controlling the number of Docker containers or Docker-based virtual nodes while considering the number of documents stored on target websites and website access environments. To verify the efficiency of the implemented EDC-SCNA system, 12 websites were selected and categorized into three types based on the number of documents stored. Accordingly, an experiment was conducted to compare the proposed system's efficiency with that of Crawler4j [1], a Java-based open-source web crawler, in terms of the amount of data collection time required and the amount of data collected. The experimental result indicated that the proposed model reduced the amount of data collection time by approximately 10% or more and collected a larger number of documents by approximately 12% compared with Crawler4j. This result was obtained because the proposed model performed more efficient crawling by calculating the optimal number of virtual nodes for crawling and allocating additional resources. This paper is organized as follows. In Chapter 2, the trend of current research regarding web crawlers is analyzed. In Chapter 3, the structure and elements of the proposed distributed crawler system are described. In Chapter 4, the performance of the proposed model evaluated experimentally is presented. In Chapter 5, the strengths, weaknesses, and limitations of the proposed model are provided. In Chapter 6, conclusions are presented.

2. Related works

Existing distributed crawlers include Apache Nutch [2], Scrapy [3], and Crawler4j [1]. Distributed crawling systems have been recently implemented based on Docker, which is used for container virtualization. Also, frameworks (e.g., Docker Swarm) for efficient Docker management and those (e.g., Scrapy–remote distribution server (Redis)) for efficient data management have been developed. pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 17 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

2.1 Distributed crawlers

Apache Nutch [2]: Apache Hadoop was developed to support distribution processing performed by Apache Nutch. This framework facilitates multimachine processing and large-volume data processing through the implementation of MapReduce and a distributed file system. In a previous study [5] where a Hadoop-based distributed crawler's performance applying four nodes with that of a crawler applying a single node were compared, it was reported that both crawlers exhibited similar crawling performances in the initial stage. However, the performance difference between them more than doubled after tens of minutes. Independent network crawlers are inappropriate for large-volume data collection due to network bandwidth, central processing unit (CPU) resources, memory capacities, and other factors. In a previous study [8], a distributed crawler's performance was improved by applying Apache Hadoop by adjusting the number of nodes. Furthermore, in a previous study [9], the operating time of a web crawler decreased as the scale of distributed environments increased and that a distributed system was appropriate for large-volume data collection. Scrapy [3]: The Scrapy framework is used to download both the contents and hypertext markup language of webpages and collect data of websites according to specified static URLs(Uniform Resource Locator). Furthermore, it can effectively extract hidden data [10] and compose crawlers [11]. Moreover, it can be widely applied in various fields such as data mining and information processing through website crawling and structured data extraction. Recently, crawling technologies based on Scrapy instead of Apache Nutch based on Hadoop have been used primarily. A previous study [12] that compared a single node crawler with a Scrapy-based distributed crawler indicated that the Scrapy-based distributed crawler's performance was higher by approximately 90% compared with that of a single node crawler. Additionally, an example of using the Scrapy framework to extract data from websites has been demonstrated in [13]. In another previous study [14], a distributed crawling function was added to Scrapy by applying a remote dictionary server (Redis) in Scrapy. Crawler4j [1]: Crawler4j is a Java-based open-source web crawler that provides a simple web exploration interface. It evaluates URLs delivered based on the robots.txt file of corresponding hosts and minimizes users' possibility of being locked on the web by allowing them to establish a search depth. Its capacity can be expanded through the addition of several threads. Specifically, it can be operated as a multithread web crawler based on applying additional threads to increase its performance and efficiency. In addition to the above open source crawlers, various distributed crawling techniques are being studied as follows. A distributed template-customized vertical crawler [15] is suggested for crawling Internet forums. [16] presented a distributed by implementing crawling scheduler, site ordering to determine URL queue and naïve bayes based focused crawler. SIMHAR[17] employed Redis [27] to implement a distributed web crawler for the hidden web. It implements a hybrid technique based on Simhash and Redis server. Besides, [18] introduced Docker [19] based system. Most of the proposed distributed crawlers showed better efficiency than the single crawler.

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT

H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 18

2.2 Docker technology for virtualization

A virtualization process must establish a distributed crawler system, and Docker [19] has been actively applied in this process. Docker is an open-source project for a virtualization solution that operates an application under an isolated condition based on Linux container (LXC) technology without a Kernel-based Virtual Machine (KVM). A simple LXC provides only an isolated space and lacks additional functions required for development and server operation. Hence, the Docker Engine is applied to the LXC. Docker-based on a Linux kernel provides functions for container creation and management as well as other functions. It was reported in a previous study [20] that a Docker-based container exhibits better performances than a general virtual machine in terms of lightness, speed, and operating speed. Meanwhile, the performance of KVM-based virtualization decreases significantly with the number of virtual machines. However, Docker shows a more efficient virtualization performance under the same condition. For example, an experiment was conducted previously [21] to compare the performance of a KVM based on a hypervisor applying full virtualization and that of Docker; it was reported that Docker exhibited more efficient performance by five to nine times compared with the KVM in terms of CPU and memory use rates.

2.3 Docker container management based on Docker Swarm

Docker Swarm [22]: Docker Swarm is a container orchestration tool that transforms several Docker containers into a single Docker container. A previously developed Docker Swarm engine has been integrated into Docker Engine. Docker Swarm nodes are primarily classified into manager nodes used for task scheduling and worker nodes used for allocated crawling task operations. New containers are allocated to worker nodes that perform less crawling tasks based on manager nodes' schedulers. A previous study [23] that analyzed the efficiency of Docker applications indicated that the use of Docker Swarm reduced the request failure rate more effectively and that the identification of the optimal number of nodes was critical for crawling performance optimization.

2.4 Databases for distributed crawlers

Databases or search engines are required to store and manage crawling data. Because most data on the web are atypical, NoSQL databases have been widely used for their more convenient functions of storing and managing atypical data. A previous study [24] compared and analyzed five NoSQL databases (i.e., MongoDB, Couchbase, Hbase, Cassandra, and Redis) under distributed environments and discovered that the Redis using memory resulted in the most rapid data collection speed. However, the Redis occupied a significant amount of memory space and failed to fulfill load-related requests as the size of its record increased owing to its in-memory database characteristics.

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 19 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

Experiments have been conducted previously [24, 25] on a distributed crawler applying Scrapy–Redis technology and verified that this crawler afforded advantages such as an increased amount of data collected, reduced amount of data collection time, and increased storage system efficiency. Additionally, a method of using the Redis as a cache has been proposed.

3. The Proposed Scheme

Existing web crawlers applying a single virtual node exhibit limited performance in terms of time. Administrators should periodically manage the crawling tasks of multi-node crawlers to achieve regular crawling. To efficiently solve the aforementioned problems of existing crawlers, an EDC-SNCA based on virtualization is proposed herein.

3.1 System structure

Distributed web crawlers can collect a large number of documents more rapidly compared with single-node web crawlers. Moreover, distributed environments can be established more efficiently by applying LXC-based technology rather than full virtualization technology. Hence, this study proposes the distributed crawler architecture based on Scrapy, which virtualizes Scrapy nodes in Docker. The proposed model includes logic for determining the optimal number of nodes for data collection based on target websites. Figure 1 shows the system structure of the EDC-SCNA model.

Figure 1. System architecture of EDC-SCNA

As shown in Figure 1, Docker Swarm [23] is used to form a cluster and efficiently monitor, log, and schedule virtual crawling nodes in the form of Docker. Under the assumption of 10 operating machines, the cluster includes a master node, a reachable Redis, and eight worker nodes. The proposed model can increase the number of worker nodes when required. 1) The master node manages the reachable Redis and k-2 worker node(s), where k refers to the number of machines. Furthermore, it comprises a visualizer function. 2) The Reachable Redis: It performs the functions of the master node when the master node is terminated or cannot be used.

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT

H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 20

Reachable refers to a machine that executes Redis databases. Data Transfer is used to store filtered data as MySQL data is executed based on each host URL in worker nodes where MySQL is being operated. Redis [27], a NoSQL open-source database based on memory instead of a disk, is a highly optimized key-value data storage space for simple search and additional tasks. It manages millions of requests per second for real-time applications and uses values to provide various data structures instead of simple objects in the key-value storage space. 3) Worker nodes: Crawler containers scheduled by the master node are executed for task processing in p-1 worker nodes except a worker node implementing MySQL. Here, it prefers the number of physical machines allocated as worker nodes. The proposed model can constantly increase the number of crawler containers by increasing the number of machines for worker nodes. Because the necessary amount of resources is allocated to each container, the number of useable containers in a machine can be calculated based on CPU and memory use rates based on the target tasks. This experiment confirmed that more than 100 containers were executable in each machine, applying 64 GB of memory. The Redis [27] is used to identify duplicate requests and items through caching and serves as a temporary storage space for requests [28].

3.2 Crawling considering work rate

The crawling method proposed herein can be categorized into two processes: crawling and calculating the optimal number of virtual nodes for crawling. In this study, the work rate is defined as (1). In an ideal environment, machine resources might be input unlimitedly. However, as resources are input limitedly in a practical environment, all the worker nodes should perform the most efficient data collection process. In this regard, the work rate per node instead of the work rate based on the number of nodes is calculated via an equation.

.       = (1)   ∗. 

3.2.1 Crawling process

Figure 2 shows a flow chart of the crawling process under the assumption of 10 physical machines applied. When a site URL is an input, Machine 1 creates a Docker service for crawling. After Docker service creation, Machine 2 caches information included in databases and transfers it to the Redis. Worker nodes related to Machines 3 to 10 collect data through crawling. Machine 2 performs a duplication verification on the data collected and then stores them in the database of Machine 3. Finally, Machine 1 calculates the work rate based on the statistics of the data collected and the amount of time required for data collection; subsequently, it determines the nodes to be used for the subsequent crawling process. Figure 3 shows these processes in detail.

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 21 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

Figure 2. Crawling process

Figure 3. Node rate architecture

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT

H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 22

3.2.2 Architecture for calculating optimal number of nodes

Figure 3 shows the process of deriving the optimal number of nodes by calculating the work rate. The number of nodes increases in multiples of two and is compared with the work rate to identify the optimal number of nodes Machine 1 calculates the work rate using databases obtained through crawling, the amount of time required for the data collection on the target website, and the amount of data collected. It analyzes the difference between previous and current nodes' work rates based on the work rates calculated and increases or decreases the number of nodes on the corresponding website based on the difference. Finally, Machine 2 stores the number of nodes determined and the work rate calculated in its database.

4. Experimental Results

4.1 Site List

Table 1 shows a list of websites selected as experimental targets in this study. Each website possesses 3,000 to 150,000 documents that are related to policies. They are operated by various countries, including Germany, Japan, Europe, France, Canada, and the United States.

Table 1. Crawling sites Site Document Country Topic bicc 4,420 Germany Social issue policy mri 4,980 Japan Industrial Solutions Information and Communication glocom 5,740 Japan Academic Policy martenscentre 7,800 Europe Social phenomenon cigionline 39,100 Canada Policy discussion jil 42,200 Japan Policy research Sub-organization of the UN undp 42,900 UN General Assembly adlittle 43,230 France Oil and gas policy rand 92,250 United States of America Military issue research Research focused on the US mercatus 103,160 United States of America non-profit free market bearingpoint 105,137 Netherlands Technical consulting field Association of independent non- mpg 112,230 Germany profit research institutions in Germany

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 23 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

Twelve websites were classified into three classes based on the number of documents possessed, and each experiment was performed based on the classes. The number of documents possessed by the websites in the first class was 10,000 or less; that in the second class was higher than 10,000 and less than 50,000; and that in the third class was more than 50,000 and less than 150,000. Experiments were performed according to the three classes.

4.2 Performance comparison

Existing distributed crawling frameworks such as Crawler4j, Scrapy, and Apache Nutch were considered in the experiments performed in this study. Apache Nutch was excluded due to difficulties in installation and setting and several errors related to execution. As Crawler4j provided a more convenient installation and setting functions in addition to excellent accessibility, it was selected as a comparison target and compared with the proposed EDC-SCNA through various comparative experiments. Table 2 summarizes the hardware and software specifications used in these experiments and their version information. Ten machines were prepared, and four 16 GB memory modules for the 64 GB memory capacity were installed in each machine. Besides, Ubuntu Linux version 18.04, Redis version 6.0 version, Crawler4j version 4.4.0, and Scrapy version 2.3.0 were used.

Table 2. Hardware and software specifications HW Specification Num. Intel 9th Generation CPU 10 i7-9700K CoffeeLake-R Samsung Electronics DDR4 MEM 40 PC4-21300 16GB SW Software Titles Ver. OS Ubuntu LTS 18.04 Database Redis 6.0 Database MySQL 14.14 Crawler Crawler4j 4.4.0 Crawler Scrapy 2.3.0

A representative website was selected from each class, and experiments were conducted based on the selected websites to compare the performances between the EDC-SCNA and Crawler4j. Figure 4 shows the experimental results. Figures 4-(A), 4-(B), and 4-(C) present the experimental results based on Classes 1 (Bicc), 2 (Adlittle), and 3 (Bearingpoint), respectively. In Figure 4, the left part of the Y-axis refers to the amount of data collected, whereas the right part of the time required for data collection. The X-axis refers to the number of virtual nodes input for crawling. A bar graph depicting the number of documents collected and a line graph depicting the amount of time required for document collection are

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT

H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 24 presented. The blue bar graph and black line graph represent the results of Crawler4j. The orange bar graph and red line graph represent the results of the proposed EDC-SCNA.

(A) Class 1 performance comparison

(B) Class 2 performance comparison

(C) Class 3 Performance comparison Figure 4. Each class performance comparison pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 25 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

The graphs show that the crawling time of both models decreased as the number of nodes increased. However, the proposed model exhibited better performances in terms of the number of documents collected and the amount of time required for document collection. Moreover, the difference in the amount of data collection time based on classes between the proposed and existing models was analyzed. The analytical result indicated that the difference based on Class 3 was the greatest. In other words, the EDC-SCNA exhibited better performances than Crawler4j in terms of data collection performance and time based on websites that possessed the highest number of documents.

4.3 Performance change based on optimal number of nodes

The work rate of the proposed crawler increased with the number of nodes. However, the increased range of the work rate was inconsistent. Hence, the optimal number of nodes was selected in a section in which the work rate based on the number of nodes increased significantly and then increased slightly. The experimental results obtained according to classes based on the number of documents possessed by websites were as follows. (1) The optimal number of nodes was primarily two, according to Class 1. The crawling speed increase rate was approximately 40%. (2) The optimal number of nodes was between four and eight, according to Class 2. The crawling speed increase rate was between 60% and 70%. (3) The optimal number of nodes varied according to Class 3. The crawling speed increase rate was between 50% and 60%, except that it was 90% or higher in the case of Mercatus. Table 3 shows a comparison of crawling time increase rates obtained by applying the optimal number of nodes measured via the EDC-SCNA and a single node from the existing model. On the Mercatus website, the optimal number of nodes of 32 was applied for crawling. Compared with the crawling time calculated based on a single node, the crawling time increase rate obtained by applying the optimal number of nodes was 95%, which was the highest increase rate.

Table 3. Performance comparisons based on Work rate per node Crawling Time Optimal Optimal Node Crawling Reduction Site (with Optimal Node Work Rate Num. Time (N=1) Time Rate Num.) bicc 0.755 2 1:06:04 0:42:35 35% mri 1.277 2 1:04:31 0:32:23 49% glocom 1.103 2 1:07:26 0:43:19 36% martenscentre 1.697 4 0:53:42 0:18:58 64% cigionline 1.282 4 6:15:48 2:07:04 66% jil 1.034 8 7:00:49 1:26:52 79% undp 0.890 8 7:05:41 1:40:21 76% adlittle 2.688 4 3:30:52 1:06:50 68% rand 0.148 4 99:38:03 43:12:54 56% mercatus 0.206 32 98:21:37 4:20:11 95% bearingpoint 1.514 4 13:22:39 3:49:11 63% mpg 0.310 1 100:21:41 100:21:41 0% pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT

H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 26

Table 4 indidates the difference between work rates based on additional node input and the work rate based on node input to determine the optimal number of nodes for each website. For example, the difference between work rates based on eight virtual nodes input and the work rate based on four virtual nodes input was 0.689, which was the greatest for the case of Adlittle. Hence, the optimal number of nodes for Adlittle was four.

Table 4. Performance comparisons by varying the numbers of crawling nodes Site Node(2-1) Node(4-2) Node(8-4) Node(16-8) Node(32-16) Node(64-32) bicc 0.218 0.244 0.152 0.126 0.108 0.047 mri 0.004 0.355 0.268 0.258 0.153 0.117 glocom 0.314 0.476 0.168 0.129 0.112 0.095 martenscentre 0.130 0.569 0.666 0.482 0.232 0.112 cigionline 0.090 0.361 0.396 0.362 0.247 0.121 jil 0.164 0.240 0.268 0.428 0.283 0.143 undp 0.197 0.288 0.302 0.354 0.191 0.172 adlittle 0.076 0.643 0.689 0.558 0.325 0.313 rand 0.044 0.064 0.036 0.029 0.015 0.008 mercatus 0.008 0.012 0.029 0.014 0.019 0.039 bearingpoint 0.314 0.354 0.356 0.375 0.257 0.209 mpg 0.099 0.062 0.019 0.011 0.028 0.009

The graphs in Figure 5 show the minimum, maximum, and mean values of differences in work rates based on the number of nodes input to each class. These graphs reflect the characteristics of the classes.

Figure 5. Node rate min, max, average graph pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 27 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

As shown in these graphs, the difference range between close nodes based on Class 3, including the greatest amount of data, was narrow. This result was obtained because data collection through crawling was performed more rapidly than the data duplication verification when the amount of data to be crawled was the greatest. Owing to more swift data collection, the number of documents to be verified for duplication increased. Consequently, the work rate increased insignificantly with the number of nodes. The amount of time required for the duplication verification was included in the amount of time required for collection. However, the effects of the duplication verification time were not considered in this experiment.

5. Discussion

5.1 Restrictions on experimental conditions

The number of nodes in the experiment cannot be increased unlimitedly because a host might be overloaded when numerous nodes are allocated. In this experiment, target websites were classified according to the number of documents stored in them. However, factors such as the corresponding host's response time and the corresponding host website size should be considered.

5.2 Discussions regarding experimental results

It was discovered that the average crawling time decreased as the number of Docker- based virtual nodes input for crawling increased. However, the allocation of excessive virtual crawling nodes for a certain website did not always decrease the crawling time or an improvement in the crawling performance. This emphasizes the importance of identifying the appropriate number of nodes since the size of servers for each host varies, and the number of resources provided for the servers cannot be the same at all times. In particular, the allocation of numerous nodes to a website can result in overloading the website server and node blocking by the website. Hence, the proposed crawler should allocate an appropriate number of nodes to websites and comply with the crawling policies stated in the robots.txt file. The proposed model should also perform crawling from the initial stage again to obtain documents that were not collected from the previous crawling process. Consequently, this will increase the number of crawling performed for parameter identification.

6. Conclusions

A parallel crawling system that increases the limited crawling performance of existing single and distributed systems is proposed herein. The experimental result indicated that the general single system required a significant amount of time for crawling based on data collection and storage. The proposed parallel crawling system using the optimal node demonstrated a better crawling performance by up to 57% compared with the existing crawling system using a single node.

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT

H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 28

The proposed model can easily manage nodes under the cluster environment established based on Docker Swarm; furthermore, its data collection time decreases as the nodes increase. However, its performance does not increase in proportion to the number of crawling nodes added because it is affected by the target host server's performance and the Internet speed. Moreover, the proposed system requires a high memory use rate due to the Redis application, which is an in-memory database, for duplicate data removal. Consequently, it becomes unstable when a significant amount of data is collected. Further studies regarding automatic crawling cycle identification and advanced content repetition confirmation should be conducted to facilitate a more stable operation of the EDC-SCNA. The proposed crawler can select the optimal crawling cycle by automatically identifying the inflow of new content based on data collection target websites. This optimal crawling cycle selection process is crucial for maximizing the use of crawling resources. Furthermore, in-memory databases such as the Redis used currently are advantageous in terms of speed despite requiring a significant memory resource. Hence, further studies should be performed to improve the proposed model's efficiency by replacing the in- memory database with a distributed indexing process similar to those of search engines.

Acknowledgments

This research was supported by Kumoh National Institute of Technology (202001890001)

References

[1] Crawler4j Project. [Online]. Available: https://github.com/yasserg/crawler4j, [Accessed: Dec. 28, 2020] [2] Apache Nutch Project. [Online]. Available: https://cwiki.apache.org/confluence/display/ nutch /#Nutch_2.x, [Accessed: Dec. 28, 2020] [3] Scrapy Project, “Scrapy 1.5 documentation", [Online]. Available: https://docs.scrapy.org/en/ latest/, [Accessed: Dec. 28, 2020] [4] Yu, Linxuan, Yeli Li, Qingtao Zeng, Yanxiong Sun, Yuning Bian, and Wei He. “Summary of Web Crawler Technology Research”, Journal of Physics: Conference Series. Vol. 1449, No. 1, pp. 22-24, Feb, 2020. [Online]. Available: https://iopscience.iop.org/article/10.1088/1742- 6596/1449/1/012036/meta [5] Shi, Yuliang, and Ti Zhang. “Design and Implementation of a Scalable Distributed Web Crawler Based on Hadoop”, In Proceedings of IEEE 2nd International Conference on Big Data Analysis, ICBDA, pp. 537–41, Oct, 2017 [Online]. Available: https://ieeexplore.ieee.org/ document/8078691 [6] ScrapingHub Project. [Online]. Available: https://www.scrapinghub.com/, [Accessed: Dec. 28, 2020] [7] Zhu, Weiping, Yaodong Li, Shu Li, Yi Xu, and Xiaohui Cui. “Optimal Bandwidth Allocation for Web Crawler Systems with Time Constraints”, Journal of Ambient Intelligence and Humanized Computing. pp. 1146-1153, Apr, 2020. [Online]. Available: https://ieeexplore. ieee.org/document/9060415 [8] Su, Linping, and Fengxiao Wang. “Web Crawler Model of Fetching Data Speedily Based on Hadoop Distributed System”, In Proceedings of the IEEE International Conference on Software Engineering and Service Sciences, ICSESS, pp.927–31, Mar, 2017. [Online]. Available: https://ieeexplore.ieee.org/document/7883217

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 29 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

[9] Manike, Chiranjeevi, Ashok Kumar Nanda, and Tejashwini Gajulagudem. “Hadoop Scalability and Performance Testing in Homogeneous Clusters”, Lecture Notes in Electrical Engineering, Vol.605, pp.907–17, Sept, 2019. [Online]. Available: https://link.springer. com/chapter/10.1007%2F978-3-030-30577-2_81 [10] Thomas, David Mathew, and Sandeep Mathur. “Data Analysis by Web Scraping Using Python”, In Proceedings of the 3rd International Conference on Electronics and Communication and Aerospace Technology, ICECA pp.450–54, Sept, 2019 [Online]. Available: https://ieeexplore.ieee.org/document/8822022 [11] Farooq, Bassam; Mohd Shahid Husain; Suaib, Mohammad, 2018. “CRAWLING OF JAPANESE REAL-ESTATE WEBSITES USING SCRAPY”, International Journal of Advanced Research in Computer Science; Udaipur Vol. 9, pp. 64-67, Apr, 2018. [Online]. Available: http://www.ijarcs.info/index.php/Ijarcs/article/view/6139 [12] Yin, Fulian, Xiating He, and Zhixin Liu. “Research on Scrapy-Based Distributed Crawler System for Crawling Semi-Structure Information at High Speed”, In Proceedings of 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pp.1356–59, Aug, 2018. Institute of Electrical and Electronics Engineers Inc. [Online]. Available: https://ieeexplore.ieee.org/document/8781062 [13] Nisafani, Amna Shifia, Rully Agus Hendrawan, and Arif Wibisono. “ELICITING DATA FROM WEBSITE USING SCRAPY: AN EXAMPLE”, In Seminar Nasional Teknologi Informasi Dan Multimedia (SEMNASTEKNOMEDIA), pp. 1–8, Feb, 2017. [14] Wang, Jiancai, and Jianting Shi. “The Crawl and Analysis of Recruitment Data Based on the Distributed Crawler”, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, Vol. 333, pp. 162–68, Nov, 2020. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978-3-030-62483-5_18 [15] Zhou, Bing, Bo Xiao, Zhiqing Lin, and Chuang Zhang. “A Distributed Vertical Crawler Using Crawling-Period Based Strategy”, In Proceedings of the 2010 2nd International Conference on Future Computer and Communication, ICFCC, Vol. 1, pp. 306-11, Jun, 2010. [Online]. Available: https://ieeexplore.ieee.org/document/5497780 [16] Gunawan, Dani, Amalia Amalia, and Atras Najwan. “Improving Data Collection on Article Clustering by Using Distributed Focused Crawler”, Data Science: Journal of Computing and Applied Informatics, Vol. 1, pp. 1–12, Jul, 2010. [Online]. Available: https://talenta usu.ac. id/index.php/JoCAI/article/view/82 [17] Kaur, Sawroop, and G. Geetha. “SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis Server”, IEEE Access, Vol. 8, pp. 117582–92, Jun, 2020. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9123854 [18] Ye, Feng, Zongfei Jing, Qian Huang, and Yong Chen. “The Research of a Lightweight Distributed Crawling System”, In Proceedings of 2018 IEEE/ACIS 16th International Conference on Software Engineering Research, Management and Application, SERA, pp. 200–204, Jun, 2018 [Online]. Available: https://ieeexplore.ieee.org/document/8477212 [19] Docker, "docker documentation", [Online]. Available: https://docs.docker.com/, [Accessed: Dec. 28, 2020] [20] Sharma, Vivek, Harsh Kumar Saxena, and Akhilesh Kumar Singh. 2020. “Docker for Multi- Containers Web Application”, In Proceedings of 2nd International Conference on Innovative Mechanisms for Industry Applications, (ICIMIA), pp. 589–92, Apr, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9074925 [21] Hwang, Jung-Yeon, and Ho-Yong Ryu. “Performance Comparison and Forecast Analysis between KVM and Docker”, The Journal of Korean Institute of Information Technology, Vol. 13, No. 11, pp. 127-136, Nov, 2015. [Online]. Available: https://www.kci.go.kr/kciportal/ci/ sereArticleSearch/artiPreView.kci?sereArticleSearchBean.artiId=ART002048947

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT

H. Kim et al.; Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation 30

[22] sumologic, "Docker Swarm", [Online]. Available: https://www.sumologic.com/glossary/ docker-swarm/, [Accessed: Dec. 28, 2020] [23] Juwita, Oktalia, Firmansyah, Diksy. “Cloud Computing Implementation with Docker Engine Swarm Mode for Data Availability Infrastructure of Rice Plants”, International Journal Of Information System & Technology (IJISTECH), Vol. 1, No 2, pp. 1-24, 2018. [Online]. Available: http://ijistech.org/ijistech/index.php/ijistech/article/view/10 [24] Matallah, Houcine, Ghalem Belalem, and Karim Bouamrane. “Evaluation of NoSQL Databases”, International Journal of Software Science and Computational Intelligence Vol. 12, No. 4, pp. 71–91, Dec, 2020. [Online]. Available: https://www.igi-global.com/gateway/ article/262589 [25] Ying, Zhe Yu, Feng Li Zhang, and Qing Yu Fan. “Consistent Hashing Algorithm Based on Slice in Improving Scrapy-Redis Distributed Crawler Efficiency”, In Proceedings of 2018 IEEE International Conference on Computer and Communication Engineering Technology (CCET), pp. 334–40, Nov, 2018. [Online]. Available: https://ieeexplore.ieee.org/ document/8542217 [26] Han, Xiaowei, and Likun Zheng. “Design and Implementation of Firmware Data Acquisition System Based on Scrapy Framework", In Proceedings of 2020 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), pp. 168–74, Sept, 2020. Institute of Electrical and Electronics Engineers Inc. [Online]. Available: https://ieeexplore.ieee.org/ document/9202251 [27] Redis, "redis documentation", [Online]. Available: https://redis.io/documentation [Accessed: Dec. 28, 2020] [28] Rolando Max Espinoza, "Scrapy-redis documentation, " [Online]. Available: https://scrapyredis.readthedocs.org. [Accessed: Dec. 28, 2020]

Authors

Hyuntae Kim

He is currently an undergraduate student of the Department of Computer Engineering at Kumoh National Institute of Technology. His research interests include Big Data, Document Object Detection, and Deep Learning.

Junhyung Byun

He is currently an undergraduate student of the Department of Computer Engineering at Kumoh National Institute of Technology. His research interests include Open-source software, Big-Data, and Deep Learning.

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT Journal of JAITC, Vol. 10, No.2, pp. 15-31, Dec. 31, 2020 31 http://dx.doi.org/10.14801/JAITC.2020.10.2.15

Yoseph Na

He is currently an undergraduate student of the Department of Computer Engineering at Kumoh National Institute of Technology. His research interests include the general the broader areas of Distributed and Parallel Computing, emphasizing storage systems and data services.

Yuchul Jung

Received his B.S. degree in Computer Science from Ajou University, South Korea in 2001 and his M.S. in Information & Communication Engineering and a Ph.D. degree in Computer Science from KAIST (Korea Advanced Institute of Science and Technology) in 2005 and 2011, respectively. He joined the Department of Computer Engineering faculty at Kumoh National Institute of Technology (KIT), Gumi, as an assistant professor, in 2017. Before joining KIT, he worked as a senior researcher at Korea Institute of Science and Technology Information (KISTI) (’13~’17) and Electronics and Telecommunications Research Institute (ETRI) (’09~’13), Daejeon, South Korea. His research interests include AI, NLP, Speech Recognition, Data Science and Medicine 2.0.

pISSN 2234-1072/eISSN 2234-0963 Copyright ⓒ KIIT