Scrapy Cluster Documentation Release 1.3
Total Page:16
File Type:pdf, Size:1020Kb
Scrapy Cluster Documentation Release 1.3 IST Research Apr 07, 2021 Contents 1 Introduction 3 1.1 Overview.................................................3 1.2 Quick Start................................................5 2 Kafka Monitor 13 2.1 Design.................................................. 13 2.2 Quick Start................................................ 14 2.3 API.................................................... 16 2.4 Plugins.................................................. 33 2.5 Settings.................................................. 36 3 Crawler 41 3.1 Design.................................................. 41 3.2 Quick Start................................................ 46 3.3 Controlling................................................ 48 3.4 Extension................................................. 54 3.5 Settings.................................................. 59 4 Redis Monitor 65 4.1 Design.................................................. 65 4.2 Quick Start................................................ 66 4.3 Plugins.................................................. 68 4.4 Settings.................................................. 71 5 Rest 77 5.1 Design.................................................. 77 5.2 Quick Start................................................ 77 5.3 API.................................................... 81 5.4 Settings.................................................. 85 6 Utilites 89 6.1 Argparse Helper............................................. 89 6.2 Log Factory............................................... 91 6.3 Method Timer.............................................. 97 6.4 Redis Queue............................................... 98 6.5 Redis Throttled Queue.......................................... 102 6.6 Settings Wrapper............................................. 107 i 6.7 Stats Collector.............................................. 110 6.8 Zookeeper Watcher............................................ 118 7 Advanced Topics 125 7.1 Upgrade Scrapy Cluster......................................... 125 7.2 Integration with ELK........................................... 127 7.3 Docker.................................................. 132 7.4 Crawling Responsibly.......................................... 137 7.5 Production Setup............................................. 137 7.6 DNS Cache................................................ 139 7.7 Response Time.............................................. 140 7.8 Kafka Topics............................................... 140 7.9 Redis Keys................................................ 141 7.10 Other Distributed Scrapy Projects.................................... 143 8 Frequently Asked Questions 145 8.1 General.................................................. 145 8.2 Kafka Monitor.............................................. 146 8.3 Crawler.................................................. 146 8.4 Redis Monitor.............................................. 147 8.5 Rest.................................................... 147 8.6 Utilities.................................................. 147 9 Troubleshooting 149 9.1 General Debugging............................................ 149 9.2 Data Stack................................................ 152 10 Contributing 155 10.1 Raising Issues.............................................. 155 10.2 Submitting Pull Requests........................................ 156 10.3 Testing and Quality Assurance...................................... 157 10.4 Working on Scrapy Cluster Core..................................... 157 11 Change Log 159 11.1 Scrapy Cluster 1.3............................................ 159 11.2 Scrapy Cluster 1.2............................................ 160 11.3 Scrapy Cluster 1.1............................................ 161 11.4 Scrapy Cluster 1.0............................................ 162 12 License 163 13 Introduction 165 14 Architectural Components 167 14.1 Kafka Monitor.............................................. 167 14.2 Crawler.................................................. 167 14.3 Redis Monitor.............................................. 167 14.4 Rest.................................................... 168 14.5 Utilities.................................................. 168 15 Advanced Topics 169 16 Miscellaneous 171 Index 173 ii Scrapy Cluster Documentation, Release 1.3 This documentation provides everything you need to know about the Scrapy based distributed crawling project, Scrapy Cluster. Contents 1 Scrapy Cluster Documentation, Release 1.3 2 Contents CHAPTER 1 Introduction This set of documentation serves to give an introduction to Scrapy Cluster, including goals, architecture designs, and how to get started using the cluster. 1.1 Overview This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster. The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster. The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests. 1.1.1 Dependencies Please see the requirements.txt within each sub project for Pip package dependencies. Other important components required to run the cluster • Python 2.7 or 3.6: https://www.python.org/downloads/ • Redis: http://redis.io • Zookeeper: https://zookeeper.apache.org • Kafka: http://kafka.apache.org 3 Scrapy Cluster Documentation, Release 1.3 1.1.2 Core Concepts This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include: • The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster • Scale Scrapy instances across a single machine or multiple machines • Coordinate and prioritize their scraping effort for desired sites • Persist data across scraping jobs • Execute multiple scraping jobs concurrently • Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked • Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime • Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results) • Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address 1.1.3 Architecture 4 Chapter 1. Introduction Scrapy Cluster Documentation, Release 1.3 At the highest level, Scrapy Cluster operates on a single input Kafka topic, and two separate output Kafka topics. All incoming requests to the cluster go through the demo.incoming kafka topic, and depending on what the request is will generate output from either the demo.outbound_firehose topic for action requests or demo. crawled_firehose topics for html crawl requests. Each of the three core pieces are extendable in order add or enhance their functionality. Both the Kafka Monitor and Redis Monitor use ‘Plugins’ in order to enhance their abilities, whereas Scrapy uses ‘Middlewares’, ‘Pipelines’, and ‘Spiders’ to allow you to customize your crawling. Together these three components and the Rest service allow for scaled and distributed crawling across many machines. 1.2 Quick Start This guide does not go into detail as to how everything works but it will hopefully get you scraping quickly. For more information about how each process works please see the rest of the documentation. 1.2.1 Setup There are several options available to set up Scrapy Cluster. You can choose to provision with the Docker Quickstart, or manually configure it via the Cluster Quickstart yourself. Docker Quickstart The Docker Quickstart will help you spin up a complete standalone cluster thanks to Docker and Docker Compose. All individual components will run in standard docker containers, and be controlled through the docker-compose command line interface. 1) Ensure you have Docker Engine and Docker Compose installed on your machine. For more installation infor- mation please refer to Docker’s official documentation. 2) Download and unzip the latest release here. Let’s assume our project is now in ~/scrapy-cluster 3) Run docker compose $ docker-compose up -d This will pull the latest stable images from Docker hub and build your scraping cluster. At the time of writing, there is no Docker container to interface and run all of the tests within your compose-based cluster. Instead, if you wish to run the unit and integration tests please see the following steps. Note: If you want to switch to python3, just modify docker-compose.yml to change kafka_monitor, re- dis_monitor, crawler and rest image to python3’s tag like kafka-monitor-dev-py3. You can find all available tag in DockerHub Tags 4) To run the integration tests, get into the bash shell on any of the containers. Kafka monitor $ docker exec -it scrapy-cluster_kafka_monitor_1 bash Redis monitor 1.2. Quick Start 5 Scrapy Cluster Documentation, Release 1.3 $ docker exec -it scrapy-cluster_redis_monitor_1