Scrapy Cluster Documentation Release 1.2

Scrapy Cluster Documentation Release 1.2 IST Research November 29, 2016 Contents 1 Introduction 3 1.1 Overview.................................................3 1.2 Quick Start................................................5 2 Kafka Monitor 13 2.1 Design.................................................. 13 2.2 Quick Start................................................ 14 2.3 API.................................................... 16 2.4 Plugins.................................................. 31 2.5 Settings.................................................. 34 3 Crawler 37 3.1 Design.................................................. 37 3.2 Quick Start................................................ 42 3.3 Controlling................................................ 44 3.4 Extension................................................. 49 3.5 Settings.................................................. 53 4 Redis Monitor 59 4.1 Design.................................................. 59 4.2 Quick Start................................................ 60 4.3 Plugins.................................................. 61 4.4 Settings.................................................. 64 5 Rest 69 5.1 Design.................................................. 69 5.2 Quick Start................................................ 69 5.3 API.................................................... 73 5.4 Settings.................................................. 77 6 Utilites 81 6.1 Argparse Helper............................................. 81 6.2 Log Factory............................................... 83 6.3 Method Timer.............................................. 87 6.4 Redis Queue............................................... 88 6.5 Redis Throttled Queue.......................................... 92 6.6 Settings Wrapper............................................. 96 6.7 Stats Collector.............................................. 99 6.8 Zookeeper Watcher............................................ 106 i 7 Advanced Topics 113 7.1 Upgrade Scrapy Cluster......................................... 113 7.2 Integration with ELK........................................... 114 7.3 Docker.................................................. 119 7.4 Crawling Responsibly.......................................... 123 7.5 Production Setup............................................. 124 7.6 DNS Cache................................................ 126 7.7 Response Time.............................................. 126 7.8 Kafka Topics............................................... 127 7.9 Redis Keys................................................ 127 7.10 Other Distributed Scrapy Projects.................................... 129 8 Frequently Asked Questions 131 8.1 General.................................................. 131 8.2 Kafka Monitor.............................................. 132 8.3 Crawler.................................................. 132 8.4 Redis Monitor.............................................. 133 8.5 Rest.................................................... 133 8.6 Utilities.................................................. 133 9 Troubleshooting 135 9.1 General Debugging............................................ 135 9.2 Data Stack................................................ 138 10 Contributing 141 10.1 Raising Issues.............................................. 141 10.2 Submitting Pull Requests........................................ 142 10.3 Testing and Quality Assurance...................................... 143 10.4 Working on Scrapy Cluster Core..................................... 143 11 Change Log 145 11.1 Scrapy Cluster 1.2............................................ 145 11.2 Scrapy Cluster 1.1............................................ 146 11.3 Scrapy Cluster 1.0............................................ 147 12 License 149 13 Introduction 151 14 Architectural Components 153 14.1 Kafka Monitor.............................................. 153 14.2 Crawler.................................................. 153 14.3 Redis Monitor.............................................. 153 14.4 Rest.................................................... 153 14.5 Utilities.................................................. 154 15 Advanced Topics 155 16 Miscellaneous 157 ii Scrapy Cluster Documentation, Release 1.2 This documentation provides everything you need to know about the Scrapy based distributed crawling project, Scrapy Cluster. Warning: This document may not be up to date with Scrapy Cluster 1.2 capabilities. Contents 1 Scrapy Cluster Documentation, Release 1.2 2 Contents CHAPTER 1 Introduction This set of documentation serves to give an introduction to Scrapy Cluster, including goals, architecture designs, and how to get started using the cluster. 1.1 Overview This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster. The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster. The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests. 1.1.1 Dependencies Please see requirements.txt for Pip package dependencies across the different sub projects. Other important components required to run the cluster • Python 2.7: https://www.python.org/downloads/ • Redis: http://redis.io • Zookeeper: https://zookeeper.apache.org • Kafka: http://kafka.apache.org 1.1.2 Core Concepts This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include: • The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster • Scale Scrapy instances across a single machine or multiple machines • Coordinate and prioritize their scraping effort for desired sites 3 Scrapy Cluster Documentation, Release 1.2 • Persist data across scraping jobs • Execute multiple scraping jobs concurrently • Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked • Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime • Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results) • Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address 1.1.3 Architecture At the highest level, Scrapy Cluster operates on a single input Kafka topic, and two separate output Kafka topics. All incoming requests to the cluster go through the demo.incoming kafka topic, and depending on what the request is will generate output from either the demo.outbound_firehose topic for action requests or demo.crawled_firehose topics for html crawl requests. Each of the three core pieces are extendable in order add or enhance their functionality. Both the Kafka Monitor and Redis Monitor use ‘Plugins’ in order to enhance their abilities, whereas Scrapy uses ‘Middlewares’, ‘Pipelines’, and ‘Spiders’ to allow you to customize your crawling. Together these three components and the Rest service allow for scaled and distributed crawling across many machines. 4 Chapter 1. Introduction Scrapy Cluster Documentation, Release 1.2 1.2 Quick Start This guide does not go into detail as to how everything works, but hopefully will get you scraping quickly. For more information about each process works please see the rest of the documentation. 1.2.1 Setup There are a number of different options available to set up Scrapy Cluster. You can chose to provision with Vagrant Quickstart, use the Docker Quickstart, or manually conﬁgure via the Cluster Quickstart yourself. Vagrant Quickstart The Vagrant Quickstart provides you a simple Vagrant Virtual Machine in order to try out Scrapy Cluster. It is not meant to be a production crawling machine, and should be treated more like a test ground for development. 1. Ensure you have Vagrant and VirtualBox installed on your machine. For instructions on setting those up please refer to the ofﬁcial documentation. 2. Download and unzip the latest release here. Lets assume our project is now in ~/scrapy-cluster 3. Stand up the Scrapy Cluster Vagrant machine. $ cd ~/scrapy-cluster $ vagrant up # This will set up your machine, provision it with Ansible, and boot the required services Note: The initial setup may take some time due to the setup of Zookeeper, Kafka, and Redis. This is only a one time setup and is skipped when running it in the future. 4. SSH onto the machine $ vagrant ssh 5. Ensure everything under supervision is running. vagrant@scdev:~$ sudo supervisorctl status kafka RUNNING pid 1812, uptime 0:00:07 redis RUNNING pid 1810, uptime 0:00:07 zookeeper RUNNING pid 1809, uptime 0:00:07 Note: If you receive the message unix:///var/run/supervisor.sock no such file, issue the fol- lowing command to start Supervisord: sudo service supervisord start, then check the status like above. 6. Create a Virtual Environment for Scrapy Cluster, and activate

Scrapy Cluster Documentation Release 1.2

Cluster Setup Guide

Frontera Documentation Release 0.4.0

High Performance Distributed Web-Scraper

Frontera-Open Source Large Scale Web Crawling Framework

Distributed Training

HPC and AI Middleware for Exascale Systems and Clouds

Vergleich Aktueller Web-Crawling-Werkzeuge

Parsl Documentation Release 1.1.0

Frontera Documentation Release 0.8.0

Frontera Documentation 0.6.0

High Performance Distributed Web-Scraper

Diapositivo 1