High Performance Distributed Web-Scraper
Total Page:16
File Type:pdf, Size:1020Kb
High performance distributed web-scraper Denis Eyzenakh Anton Rameykov Igor Nikiforov Institute of Computer Science and Institute of Computer Science and Institute of Computer Science and Technology Technology Technology Peter the Great St.Petersburg Peter the Great St.Petersburg Peter the Great St.Petersburg Polytechnic University Polytechnic University Polytechnic University Saint – Petersburg, Russian Federation Saint – Petersburg, Russian Federation Saint – Petersburg, Russian Federation [email protected] [email protected] [email protected] Abstract—Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the II. EXISTING WEB SCRAPING TECHNIQUES extraction of knowledge by performing machine leaning analysis. Typically, web scraping applications imitate a regular web In order to perform data mining of the web-information, the data user. They follow the links and search for the information should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have they need. The classic web scraper can be classified into two different ways to access their data: either API over HTTP protocol types: web-crawlers and data extractors “Fig. 1”. or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are described in this article as well. Keywords — web-scraping, web-crawling, distributed data collection, distributed data analysis I. INTRODUCTION Due to the rapid development of the network, the World Wide Web has become a carrier of a large amount of Fig. 1 Web-scraper structure information. The data extraction and use of information has become a huge challenge nowadays. Traditional access to the A web-crawler (or called a spider, spiderbot) is the first information through browsers like Chrome, Firefox, etc. can type of data web-scraping. The crawler is a web robot also provide a comfortable user experience with web pages. Web known as an Internet bot that scans the World Wide Web sites have a lot of information and sometimes haven’t got any typically operated by search engines for the purpose of Web instruments to access over the API and preserve it in analytical indexing [2]. The crawling procedure starts with the list of storage. The manual collection of data for further analysis can seed URLs. The program identifies all the links that exist on take a lot of time and in the case of semi-structured or seed pages and stores them. After that, the list of all links is unstructured data types the collection and analyzing of data recursively visited. This process continues until all URLs will can become even more difficult and time-consuming. The be visited. There are several types of web-crawlers, but all of person who manually collects data can make mistakes them can be divided into a common crawler and focused (duplication, typos in the text, etc.) as far as the process is crawler. error-prone. Focused crawler searches for the most suitable pages Web-scraping is the technique which is focused on solving according to the topic that is defined by the user. This goal is the issue of the manual data processing approach [1]. Web achieved by using algorithms of intelligent text analysis. It scraping is the part of ETL-process and is broadly used in ensures that web pages can only be crawled for information web-indexing, web-mining, web data integration and data related to the specific topic. In the server’s perspective, there mining. However, many existing solutions do not support are single machine crawlers or distribution crawlers. The parallel computing on multiple machines. This significantly information crawling can be achieved by dividing into several reduces performance, limiting the system's ability to collect nodes and their cooperation, which improves the efficiency large amounts of data. A distributed approach allows you to and performance of the crawler. create a horizontally scalable system performance of which The second type of web scraper is a data extractor [3]. The can be increased depending on the user's needs. website contains a large amount of information and the The article proposes an approach to organize distributed, analyst cannot spend a lot of time manually collecting and horizontally scalable scraping and distributed data storage. converting this data into the desired format. Besides that, a Using an orchestration system greatly simplifies the web page can contain a lot of unstructured data that means it interaction with the system, and the support of automatic load can contain noise or redundant data. Data extractors can balancing avoids overloading individual nodes. easily extract large and unstructured data and convert them into a comprehensive and structured format. The extraction process starts with indexing or crawling. In the crawling process, the crawler finds a list of the relevant URLs that the data extractor will process. In these web pages a lot of junk and useful data is mixed. The data extractor extracts the SSL certificates, HTTP POST, HTTP PUT, FTP uploading, needed information from the web-pages. Data extractor HTTP form-based upload, proxies, cookies and HTTP contains a lot of techniques [4] for extraction data from authentication. Moreover, it can work with many HTML pages. programming languages. In Java, the Apache HttpClient package emulates HTTP main features, i.e., all request III. COMPARISON ANALYSIS OF SYSTEMS methods, cookies, SSL and HTTP authentication, and can be Here is an overview and comparison of web scraping combined with HTML parsing libraries. Java also supports frameworks for fast scanning any kind of data, distributed XPath and provides several HTML cleaning libraries, such as scraping systems for increasing the performance, and “jsoup”. Programs like “curl” (libcurl) and “wget” implement orchestration systems. the HTTP client layer, while utilities such as “grep”, “awk”, “sed”,“cut” and “paste” can be used to parse and transform A. Scraping tools contents conveniently. There are various tools for working with web scrapers. They can be divided into three categories: libraries, 2) Desktop or web application frameworks, and desktop or web-based environments. Desktop applications are implementations of web scrapers that are designed for noncoding professionals. This 1) Libraries kind of web scraper contains a graphical shell that makes it Modern web resources may contain various information. easier to create and support web robots. Typically, these Due to this circumstance, certain flexibility is required for applications include an embedded web browser, where the configuring and implementing web scraping tools. The user can navigate to a target web resource and interactively libraries guarantee access to the web resource. Most of the select page elements to extract them, avoiding any kind of library implement the client side of the http protocol, then the “regex”, “XPath” queries, or other technical details. In resulting web page is parsed and the data is retrieved using addition, modules are capable of generating several kinds of string functions such as regular expressions, splitting and outputs, such as CSV, Excel and XML files, and queries that trimming, etc. [5]. Also, third-party libraries can help with are inserted into databases. The main disadvantages of implementing more complex analysis, for example, building desktop solutions are commercial distribution and limited an html-tree and XPATH mappings. API access, which make it difficult to embed these web One of the most popular site access libraries is “libcurl”. scrapers into other programs. It supports the major features of the HTTP protocol, including TABLE I. COMPARISON OF SCRAPING FRAMEWORKS Feature/ Scrapy PySpider NodeCralwer Apify SDK Selenium Framework CSV, Built in Data CSV, JSON, CSV, Customizable JSON, Customizable Storage Supports JSON XML, HTML XML Suitable for Broad Yes No Yes Yes No Crawling Build in Scaling Yes Yes No Yes No Support AJAX No Yes No Yes Yes Available Selectors CSS, Xpath CSS, XPath CSS, XPath CSS CSS, XPath Built in Interface No Yes No Yes No for Periodic Jobs Speed Fast Medium Medium Medium Very Slow (Fast, Medium, Slow) CPU Usage Medium Medium Medium Medium High (Fast, Medium, Slow) Memory Usage Medium Medium Medium High High (High, Medium, Low) Github Forks 9000 3600 852 182 6200 Github Stars 39600 14800 5800 2700 19800 Apache License Apache Apache License BSD License MIT 2.0 License 2.0 License 2.0 3) Frameworks work with containers. Thus, our web scraper will be delivered Programming libraries have their limitations. For as a container running Docker. example, you need to use one library for accessing a web page, another for analyzing and extracting data from HTML C. Distributed scraping system review pages. The architecture designing and the compatibility of the library's checking process can take a significant amount of 1) Research time. Frameworks are a complete solution for developing Scrapy does not provide any built-in facility for running web scrapers. Comparison results of popular frameworks for spiders in a distributed (multi-server) manner. However, implementing web scrapers are presented in the article as well there are several ways to organize work. Some of the popular (Tab. 1). solutions are Frontera, Scrapy Redis, Scrapy Cluster, and Comparison is made according to the following criteria. Scrapyd. Built-in Data Storage Supports - supporting types of files or Frontera is a distributed crawler [9] [10] system. Based on other storage. the description of the project, we can say that the system is a Suitable for Broad Crawling - this type of crawler covers separately distributed web crawler.