High Performance Distributed Web-Scraper
Total Page:16
File Type:pdf, Size:1020Kb
Труды ИСП РАН, том 33, вып. 3, 2021 г. // Trudy ISP RAN/Proc. ISP RAS, vol. 33, issue 3, 2021 Eyzenakh D.S., Rameykov A.S., Nikiforov I.V. High performance distributed web-scraper. Trudy ISP RAN/Proc. ISP RAS, vol. 33, issue 3, 2021, pp. 87-100 Для цитирования: Эйзенах Д.С., Рамейков А.С., Никифоров И.В. Высокопроизводительный распределенный веб-скрапер. Труды ИСП РАН, том 33, вып. 3, 2021 г., стр. 87-100 (на английском языке). DOI: 10.15514/ISPRAS–2021–33(3)–7. DOI: 10.15514/ISPRAS-2021-33(3)-7 1. Introduction Due to the rapid development of the network, the World Wide Web has become a carrier of a large High Performance Distributed Web-Scraper amount of information. The data extraction and use of information has become a huge challenge nowadays. Traditional access to the information through browsers like Chrome, Firefox, etc. can provide a comfortable user experience with web pages. Web sites have a lot of information and sometimes haven’t got any instruments to access over the API and preserve it in analytical storage. D.S. Eyzenakh, ORCID: 0000-0003-1584-1745 <[email protected]> The manual collection of data for further analysis can take a lot of time and in the case of semi- A.S. Rameykov, ORCID: 0000-0001-7989-6732 <[email protected]> structured or unstructured data types the collection and analyzing of data can become even more I.V. Nikiforov, ORCID: 0000-0003-0198-1886 <[email protected]> difficult and time-consuming. The person who manually collects data can make mistakes Peter the Great St.Petersburg Polytechnic University (duplication, typos in the text, etc.) as far as the process is error-prone. 29, Polytechnicheskaya, St.Petersburg, 195251, Russia Web-scraping is the technique which is focused on solving the issue of the manual data processing Abstract. Over the past decade, the Internet has become the gigantic and richest source of data. The data is approach [1]. Web scraping is the part of ETL-process and is broadly used in web-indexing, web- used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining mining, web data integration and data mining. However, many existing solutions do not support of the web-information, the data should be extracted from the source and placed on analytical storage. This is parallel computing on multiple machines. This significantly reduces performance, limiting the the ETL-process. Different web-sources have different ways to access their data: either API over HTTP system's ability to collect large amounts of data. A distributed approach allows you to create a protocol or HTML source code parsing. The article is devoted to the approach of high-performance data horizontally scalable system performance of which can be increased depending on the user's needs. extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from The article proposes an approach to organize distributed, horizontally scalable scraping and the process of scraping. The approach is implemented in the solution with the following technologies: Docker, distributed data storage. Using an orchestration system greatly simplifies the interaction with the Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are system, and the support of automatic load balancing avoids overloading individual nodes. described in this article as well. Keywords: web-scraping; web-crawling; distributed data collection; distributed data analysis 2. Existing Web Scraping Techniques For citation: Eyzenakh D.S., Rameykov A.S., Nikiforov I.V. High Performance Distributed Web-Scraper. Typically, web scraping applications imitate a regular web user. They follow the links and search Trudy ISP RAN/Proc. ISP RAS, vol. 33, issue 3, 2021, pp. 87-100. DOI: 10.15514/ISPRAS-2021-33(3)-7 for the information they need. The classic web scraper can be classified into two types: web-crawlers and data extractors (fig. 1). Высокопроизводительный распределенный веб-скрапер Д.С. Эйзенах, ORCID: 0000-0003-1584-1745 <[email protected]> А.С. Рамейков, ORCID: 0000-0001-7989-6732 <[email protected]> И.В. Никифоров, ORCID: 0000-0003-0198-1886 <[email protected]> Санкт-Петербургский Политехнический университет Петра Великого, Россия, Санкт-Петербург, 195251, ул. Политехническая, д. 29 Аннотация. За последнее десятилетие Интернет стал гигантским и богатейшим источником данных. Fig. 1 Web-scraper structure Данные используются для извлечения знаний путем выполнения машинного анализа. Чтобы выполнить интеллектуальный анализ данных веб-информации, данные должны быть извлечены из источника и A web-crawler (or called a spider, spiderbot) is the first type of data web-scraping. The crawler is a помещены в аналитическое хранилище. Это ETL-процесс. Разные веб-источники имеют разные web robot also known as an Internet bot that scans the World Wide Web typically operated by search способы доступа к своим данным: либо API по протоколу HTTP, либо парсинг исходного кода HTML. engines for the purpose of Web indexing [2]. The crawling procedure starts with the list of seed Статья посвящена подходу к высокопроизводительному извлечению данных из источников, не URLs. The program identifies all the links that exist on seed pages and stores them. After that, the имеющих API для доступа к данным. Отличительными особенностями предлагаемого подхода list of all links is recursively visited. This process continues until all URLs will be visited. There are являются: балансировка нагрузки, двухуровневая подсистема данных и отделение процесса загрузки several types of web-crawlers, but all of them can be divided into a common crawler and focused файлов от процесса парсинга. Подход реализован в решении со следующими технологиями: Docker, crawler. Kubernetes, Scrapy, Python, MongoDB, Redis Cluster и СephFS. Результаты тестирования решения также описаны в этой статье. Focused crawler searches for the most suitable pages according to the topic that is defined by the user. This goal is achieved by using algorithms of intelligent text analysis. It ensures that web pages Ключевые слова: веб-скрапинг; веб-краулинг; распределенный сбор данных; распределенный анализ can only be crawled for information related to the specific topic. In the server’s perspective, there данных are single machine crawlers or distribution crawlers. The information crawling can be achieved by dividing into several nodes and their cooperation, which improves the efficiency and performance of the crawler. 87 88 Эйзенах Д.С., Рамейков А.С., Никифоров И.В. Высокопроизводительный распределенный веб-скрапер. Труды ИСП РАН, том 33, Eyzenakh D.S., Rameykov A.S., Nikiforov I.V. High performance distributed web-scraper. Trudy ISP RAN/Proc. ISP RAS, vol. 33, issue 3, вып. 3, 2021 г., стр. 87-100. 2021, pp. 87-100 The second type of web scraper is a data extractor [3]. The website contains a large amount of Suitable for Yes No Yes Yes No information and the analyst cannot spend a lot of time manually collecting and converting this data Broad into the desired format. Besides that, a web page can contain a lot of unstructured data that means it Crawling can contain noise or redundant data. Data extractors can easily extract large and unstructured data Builtin Yes Yes No Yes No and convert them into a comprehensive and structured format. The extraction process starts with Scaling indexing or crawling. In the crawling process, the crawler finds a list of the relevant URLs that the Support AJAX No Yes No Yes Yes data extractor will process. In these web pages a lot of junk and useful data is mixed. The data Available CSS, CSS, CSS, CSS CSS, extractor extracts the needed information from the web-pages. Data extractor contains a lot of Selectors Xpath Xpath Xpath Xpath techniques [4] for extraction data from HTML pages. Builtin No Yes No Yes No Interface for 3. Comparison Analysis of Systems Periodic Jobs Speed (Fast, Fast Medium Medium Medium Very slow Here is an overview and comparison of web scraping frameworks for fast scanning any kind of data, Medium, distributed scraping systems for increasing the performance, and orchestration systems. Slow) CPU Usage Medium Medium Medium Medium High 3.1 Scraping tools (Fast, Medium, There are various tools for working with web scrapers. They can be divided into three categories: Slow) libraries, frameworks, and desktop or web-based environments. Memory Medium Medium Medium High Medium Usage (High, 3.1.1 Libraries Medium, Low) Github Forks Modern web resources may contain various information. Due to this circumstance, certain flexibility 9000 3600 852 182 6200 is required for configuring and implementing web scraping tools. The libraries guarantee access to Github Stars 39600 14800 5800 2700 19800 the web resource. Most of the library implement the client side of the http protocol, then the resulting License BSD Apache MIT Apache Apache web page is parsed and the data is retrieved using string functions such as regular expressions, License License License License splitting and trimming, etc. [5]. Also, third-party libraries can help with implementing more complex 2.0 2.0 2.0 analysis, for example, building an html-tree and XPATH mappings. 3.1.3 Frameworks One of the most popular site access libraries is libcurl. It supports the major features of the HTTP protocol, including SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form-based Programming libraries have their limitations. For example, you need to use one library for accessing upload, proxies, cookies and HTTP authentication. Moreover, it can work with many programming a web page, another for analyzing and extracting data from HTML pages. The architecture designing languages. In Java, the Apache HttpClient package emulates HTTP main features, i.e., all request and the compatibility of the library's checking process can take a significant amount of time.