Vergleich Aktueller Web-Crawling-Werkzeuge

Total Page:16

File Type:pdf, Size:1020Kb

Vergleich Aktueller Web-Crawling-Werkzeuge Hochschule Wismar University of Applied Sciences Technology, Business and Design Fakult¨at fur¨ Ingenieurwissenschaften, Bereich EuI Bachelor-Thesis Vergleich aktueller Web-Crawling-Werkzeuge Gedruckt am: 30. April 2021 Eingereicht am: von: Christoph Werner Betreuende Professorin: Prof. Dr.-Ing. Antje Raab-Dusterh¨ ¨oft Zweitbetreuer: Prof. Dr.-Ing. Matthias Kreuseler Aufgabenstellung Das Ziel der Bachelor Thesis ist ein Vergleich aktueller Web-Crawling-Werkzeuge. Die Crawler sind hinsichtlich ihrer Funktionsweise und Crawling-Ergebnisse zu ver- gleichen und zu bewerten. Des Weiteren ist ein Tool zu konzeptionieren, welches diese Werkzeuge fur¨ die Nutzung in Laborpraktika verfugbar¨ macht. 3 Kurzreferat Webcrawling im Verbund mit Webscraping ist eine der grundlegenden Technologien, um gezielt und automatisiert Daten aus dem Internet zu sammeln. Zun¨achst wird in dieser Thesis auf die Grundlagen des Crawlings und Scrapings eingegangen. Hierbei soll besonderes Augenmerk auf der Architektur und Funkti- onsweise eines Crawlers, dem Robots Exclusion Protocol (REP), zu bedenkende Sicherheitsaspekte, sowie Anti-Crawling/Scraping-Maßnahmen liegen. Darauf aufbauend werden verschiedenste Crawling Frameworks auf Grundlage ihrer Dokumentation bewertet und gegenubergestellt.¨ Abschließend wird ein Tool mit einer grafischen Benutzeroberfl¨ache (GUI) zum Ver- gleich von verschiedenen Crawling Frameworks entwickelt. 4 Abstract Web crawling in combination with web scraping is the key technology for the targeted and automated collection of data from the World Wide Web. First of all, this thesis deals with the basics of crawling and scraping. Special at- tention will be paid to the architecture and functionality of a crawler, the Ro- bots Exclusion Protocol (REP), security aspects to be considered as well as anti- crawling/scraping measures. Based on this, various crawling frameworks are evaluated and compared on the basis of their documentation. Afterwards a tool with a Graphical User Interface (GUI) for the comparison of different crawling frameworks will be developed. 5 Inhaltsverzeichnis Inhaltsverzeichnis 1 Motivation ................................... 10 2 Grundlegendes zu Webcrawlern ....................... 11 2.1 Architektur eines modernen Crawlers.................. 11 2.2 Downloader und Parser Prozess..................... 12 2.3 Robots Exclusion Protocol........................ 13 2.3.1 Implementierung als robots.txt................. 13 2.3.2 Implementierung als META-Tag................. 15 2.3.3 Sicherheitsaspekte........................ 15 2.4 Crawler Identifizierung.......................... 16 2.5 Crawling-Richtlininen........................... 17 2.5.1 Selection Policy (Auswahlrichtlinie)............... 17 2.5.2 Re-Visit Policy (Richtlinie fur¨ erneuten Aufruf)........ 20 2.5.3 Politeness Policy (H¨oflichkeitsrichtlinie)............. 22 2.5.4 Parallelization Policy (Parallelisierungsrichtlinie)........ 23 2.6 Maßnahmen gegen Crawler........................ 24 2.6.1 Prufung¨ der HTTP Request Header............... 24 2.6.2 Blockieren vonIP-Adressen................... 24 2.6.3 Inhaltslose Gestaltung von Fehlermeldungen.......... 25 2.6.4 Honeypot Fallen......................... 25 2.6.5 Regelm¨aßiges Andern¨ des HTML Code............. 25 2.6.6 Verstecken von Inhalten..................... 25 2.7 Nutzung von Crawlern.......................... 26 3 Eigenschaften des optimalen Crawlers ................... 28 3.1 Robuste Extraktion von Links...................... 28 3.2 Datenbank und Datenexport....................... 28 3.3 Ausfuhren¨ von JavaScript........................ 28 3.4 Anwendung des REP........................... 29 3.5 Passende Request Header........................ 29 6 Inhaltsverzeichnis 3.6 Richtlinienanwendung.......................... 29 3.7 Skalierbarkeit............................... 29 3.8 Umgehen eines Banns durch Adminstratoren.............. 30 4 Bewertungsschema .............................. 31 4.1 Datenextraktionsmechanismen...................... 31 4.2 Datenbankanbindung........................... 32 4.3 JavaScript Unterstutzung¨ ........................ 32 4.4 Datenexport................................ 33 4.5 Anwendung des Robots Exclusion Protocol............... 33 4.6 Request Header Anpassung....................... 34 4.7 Integrierte Crawling-Richtlinien..................... 34 4.8 Skalierbarkeit............................... 35 4.9 Proxyserverunterstutzung¨ ........................ 35 5 Bewertung der vorausgew¨ahlten Tools ................... 36 5.1 Einschr¨ankungen............................. 36 5.2 Ungewichtete Bewertung......................... 38 5.3 Gewichtung der Bewertungskriterien.................. 39 5.4 Gewichtete Bewertung.......................... 40 5.5 Erl¨auterung der Notenvergabe bei Scrapy, Apache Nutch und Storm- crawler................................... 42 5.5.1 Scrapy............................... 42 5.5.2 Apache Nutch........................... 44 5.5.3 StormCrawler........................... 45 5.6 Ergebnisanalyse.............................. 47 5.6.1 Test auf Normalverteilung der Noten.............. 47 5.6.2 Vergleich der Tools........................ 48 5.6.3 Spezialf¨alle in der Auswahl.................... 49 6 Funktionsweisen der drei bestbewerteten Werkzeuge .......... 50 6.1 Scrapy................................... 50 6.2 Apache Nutch............................... 52 6.3 StormCrawler............................... 54 7 Werkzeugtest .................................. 56 7.1 Vorgaben fur¨ Crawler im Test...................... 56 7.2 Testvorkehrungen............................. 57 7.2.1 Proxy-Pool............................ 57 7 Inhaltsverzeichnis 7.2.2 User-Agent-Pool......................... 57 7.2.3 Seed-URLs............................ 57 7.3 Test von Scrapy.............................. 58 7.3.1 Konfigurationen von Scrapy Crawler.............. 58 7.3.2 Auswertung von Scrapy..................... 59 7.4 Test von Apache Nutch.......................... 60 7.4.1 Konfigurationsdateien fur¨ Apache Nutch Crawler....... 60 7.4.2 Auswertung von Apache Nutch................. 62 7.5 Test von StormCrawler.......................... 63 7.5.1 Vorausgesetzte Programme fur¨ Stromcrawler.......... 63 7.5.2 Konfigurationsdatei fur¨ StromCrawler.............. 65 7.5.3 Auswertung von StormCrawler................. 66 7.6 Vergleich der Ergebnisse......................... 67 8 Tool fur¨ das Praktikum ............................ 69 8.1 Anforderungen an das Tool........................ 69 8.2 Grafische Benutzeroberfl¨ache des Tools................. 69 8.3 Verarbeitung der Formulareingaben................... 71 8.3.1 Auswahl des Crawlers...................... 71 8.3.2 Generierung von Skripten fur¨ Scrapy.............. 71 8.3.3 Generierung von Dateien fur¨ Apache Nutch.......... 72 8.3.4 Generierung von Stormcrawler-Dateien............. 73 8.4 Zu bedenkende Punkte bei Generierung von Konfigurationsdateien.. 74 9 Fazit ....................................... 77 Literatur ...................................... 79 Bildverzeichnis ................................... 87 Tabellenverzeichnis ................................ 88 Codeverzeichnis .................................. 89 Abkurzungen¨ .................................... 91 Selbstst¨andigkeitserkl¨arung ........................... 93 Thesen ....................................... 94 A Liste von h¨andisch gesammelten Crawling-Werkzeugen ......... 95 8 Inhaltsverzeichnis B Code Anhang: Scrapy ............................. 97 C Code Anhang: Apache Nutch ........................ 106 D Code Anhang: Stromcrawler ......................... 114 Daten-CD ..................................... 118 9 Kapitel 1. Motivation 1 Motivation Auf Grund des stetig wachsenden Umfangs des Internets und den dadurch steigen- den Informationsgehalt wird es n¨otig die Inhalte, oder zumindest einen Großteil davon, zu indexieren. Wie in einer Bibliothek wird ein Verzeichnis angelegt, welches Grundinformationen zu dem jeweiligen Inhalt fuhrt.¨ Da das internationale Netz je- doch neben seiner Gr¨oße zus¨atzlich sehr dynamisch ist, Webseiten und Inhalte also st¨andig aktualisiert, entfernt oder hinzugefugt¨ werden, ist eine manuelle Fuhrung¨ eines Internetverzeichnisses unm¨oglich. Diese Aufgabe ubernehmen¨ Suchmaschinen mit ihren Bot-Programmen, sogenann- ten Crawlern, welche die Technologie Webcrawling im Verbund mit Webscraping nutzen, um gezielt und automatisiert Daten aus dem Internet zu sammeln und zu indexieren. Diese Thesis besch¨aftigt sich zun¨achst mit den Grundlagen und der Funktionsweise dieser Crawler. Die Nutzung von Webcrawlern ist jedoch nicht nur den großen Suchmaschinenan- bietern vorbehalten, sondern steht auch dem einfachen Nutzer offen. Dieser kann mit Hilfe einer Vielzahl an Frameworks und Funktionsbibliotheken ebenfalls Web- crawler zur Informationsrecherche, Datensammlung oder Indexierung erstellen und betreiben. Ab Kapitel4 befasst sich diese Arbeit zun ¨achst mit einem Bewertungsschema zur Benotung einer Auswahl von Crawling-Werkzeugen. Anschließend stellt sie diese gegenuber¨ und skizziert danach die internen Abl¨aufe der drei Bestbenoteten. Im Anschluss werden in Kapitel7 diese drei Crawling-Werkzeuge hinsichtlich ihrer Ergebnisse in verschiedenen Konfigurationen getestet. Hierbei soll gekl¨art werden welches dieser Werkzeuge am effektivsten und zuverl¨assigsten arbeitet. Außerdem soll in diesem Zuge ein Programm geschrieben werden, mit dessen Hilfe zukunftig¨ Studierende an der Hochschule Wismar im Fach Informationsrecherche“ selbst ein- ” fache Crawler
Recommended publications
  • Java Linksammlung
    JAVA LINKSAMMLUNG LerneProgrammieren.de - 2020 Java einfach lernen (klicke hier) JAVA LINKSAMMLUNG INHALTSVERZEICHNIS Build ........................................................................................................................................................... 4 Caching ....................................................................................................................................................... 4 CLI ............................................................................................................................................................... 4 Cluster-Verwaltung .................................................................................................................................... 5 Code-Analyse ............................................................................................................................................. 5 Code-Generators ........................................................................................................................................ 5 Compiler ..................................................................................................................................................... 6 Konfiguration ............................................................................................................................................. 6 CSV ............................................................................................................................................................. 6 Daten-Strukturen
    [Show full text]
  • Cluster Setup Guide
    Frontera Documentation Release 0.8.0.1 ScrapingHub Jul 30, 2018 Contents 1 Introduction 3 1.1 Frontera at a glance...........................................3 1.2 Run modes................................................5 1.3 Quick start single process........................................6 1.4 Quick start distributed mode.......................................8 1.5 Cluster setup guide............................................ 10 2 Using Frontera 15 2.1 Installation Guide............................................ 15 2.2 Crawling strategies............................................ 16 2.3 Frontier objects.............................................. 16 2.4 Middlewares............................................... 19 2.5 Canonical URL Solver.......................................... 22 2.6 Backends................................................. 23 2.7 Message bus............................................... 27 2.8 Writing custom crawling strategy.................................... 30 2.9 Using the Frontier with Scrapy...................................... 35 2.10 Settings.................................................. 37 3 Advanced usage 51 3.1 What is a Crawl Frontier?........................................ 51 3.2 Graph Manager.............................................. 52 3.3 Recording a Scrapy crawl........................................ 58 3.4 Fine tuning of Frontera cluster...................................... 59 3.5 DNS Service............................................... 60 4 Developer documentation 63 4.1
    [Show full text]
  • Customer Xperience – Using Social Media Data to Generate Insights for Retail
    FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Customer Xperience – Using Social Media Data to Generate Insights for Retail João Pedro Nogueira Ribeiro WORKING VERSION Mestrado Integrado em Engenharia Eletrotécnica e de Computadores Supervisor: Pedro Alexandre Rodrigues João June 25, 2018 c João Pedro Nogueira Ribeiro, 2018 Resumo No mundo atual, as redes sociais são um meio de comunicação vastamente aceite e difundido com um número de utilizadores que tende a crescer todos os dias. No entanto, com o crescimento da popularidade ao longo dos anos, as plataformas das redes sociais começaram a ser muito mais do que isso; Estas páginas web começaram a funcionar como um púlpito de onde as principais entidades podem gritar e ser ouvidas por milhões: agências de notícias podem difundir as suas notícias, políticos e celebridades podem partilhar os seus pontos de vista e fazer declarações, e empresas podem representar a sua marca, fazer promoções e anúncios. Tendo isto em conta, as redes sociais também se tornaram um extenso repositório de informação sobre os pensamentos, sentimentos, opiniões e ações das pessoas; Isto pode, em teoria, permitir tirar conclusões signi- ficativas sobre o comportamento dos consumidores para que as empresas de retalho melhorem a sua perceção de mercado e tomem decisões mais informadas. A vantagem óbvia deste tipo de informação é a sua relativamente fácil aquisição e a sua grande quantidade, no entanto, a informação recolhida de redes sociais é geralmente muito "ruidosa", com a grande maior parte dela sendo difícil de entender e até mesmo sem significado. Não ob- stante, através da filtragem da informação recolhida e de uma posterior análise sentimental, é possivel obter informação suficientemente relevante para ser usada numa análise de retalho exten- siva, abordando áreas tais como previsão de procura, desempenho de loja, lealdade para com a marca, melhoria de produto, eficácia de promoções, entre outras.
    [Show full text]
  • Frontera Documentation Release 0.4.0
    Frontera Documentation Release 0.4.0 ScrapingHub December 30, 2015 Contents 1 Introduction 3 1.1 Frontera at a glance...........................................3 1.2 Run modes................................................5 1.3 Quick start single process........................................6 1.4 Quick start distributed mode.......................................8 2 Using Frontera 11 2.1 Installation Guide............................................ 11 2.2 Frontier objects.............................................. 11 2.3 Middlewares............................................... 12 2.4 Canonical URL Solver.......................................... 13 2.5 Backends................................................. 14 2.6 Message bus............................................... 16 2.7 Crawling strategy............................................. 17 2.8 Using the Frontier with Scrapy...................................... 17 2.9 Settings.................................................. 20 3 Advanced usage 31 3.1 What is a Crawl Frontier?........................................ 31 3.2 Graph Manager.............................................. 32 3.3 Recording a Scrapy crawl........................................ 38 3.4 Production broad crawling........................................ 40 4 Developer documentation 45 4.1 Architecture overview.......................................... 45 4.2 Frontera API............................................... 47 4.3 Using the Frontier with Requests.................................... 49 4.4
    [Show full text]
  • High Performance Distributed Web-Scraper
    High performance distributed web-scraper Denis Eyzenakh Anton Rameykov Igor Nikiforov Institute of Computer Science and Institute of Computer Science and Institute of Computer Science and Technology Technology Technology Peter the Great St.Petersburg Peter the Great St.Petersburg Peter the Great St.Petersburg Polytechnic University Polytechnic University Polytechnic University Saint – Petersburg, Russian Federation Saint – Petersburg, Russian Federation Saint – Petersburg, Russian Federation [email protected] [email protected] [email protected] Abstract—Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the II. EXISTING WEB SCRAPING TECHNIQUES extraction of knowledge by performing machine leaning analysis. Typically, web scraping applications imitate a regular web In order to perform data mining of the web-information, the data user. They follow the links and search for the information should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have they need. The classic web scraper can be classified into two different ways to access their data: either API over HTTP protocol types: web-crawlers and data extractors “Fig. 1”. or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS.
    [Show full text]
  • Scrapy Cluster Documentation Release 1.2
    Scrapy Cluster Documentation Release 1.2 IST Research November 29, 2016 Contents 1 Introduction 3 1.1 Overview.................................................3 1.2 Quick Start................................................5 2 Kafka Monitor 13 2.1 Design.................................................. 13 2.2 Quick Start................................................ 14 2.3 API.................................................... 16 2.4 Plugins.................................................. 31 2.5 Settings.................................................. 34 3 Crawler 37 3.1 Design.................................................. 37 3.2 Quick Start................................................ 42 3.3 Controlling................................................ 44 3.4 Extension................................................. 49 3.5 Settings.................................................. 53 4 Redis Monitor 59 4.1 Design.................................................. 59 4.2 Quick Start................................................ 60 4.3 Plugins.................................................. 61 4.4 Settings.................................................. 64 5 Rest 69 5.1 Design.................................................. 69 5.2 Quick Start................................................ 69 5.3 API.................................................... 73 5.4 Settings.................................................. 77 6 Utilites 81 6.1 Argparse Helper............................................. 81 6.2 Log Factory..............................................
    [Show full text]
  • Frontera-Open Source Large Scale Web Crawling Framework
    Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015 [email protected] Hola los participantes! • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. 2 «A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.». –Wikipedia: Web Crawler article, July 2015 3 Motivation • Client needed to crawl 1B+ pages/week, and identify frequently changing HUB pages. • Scrapy wasn’t suitable for broad crawling and had no crawl frontier capabilities* • People were tend to favor Apache Nutch instead of Scrapy. Hyperlink-Induced Topic Search, Jon Kleinberg, 1999 4 tophdart.com Frontera: single-threaded and distributed • Frontera is all about knowing what to crawl next and when to stop. • Single-Threaded mode can be used for up to 100 websites (parallel downloading), • for performance broad crawls there is a distributed mode. 6 Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. 7 Single-threaded use cases • Need of URL metadata and content storage, • Need of isolation of URL ordering/queueing logic from the spider • Advanced URL ordering logic (big websites, or revisiting) 8 Single-threaded architecture 9 Frontera and Scrapy • Frontera is implemented as a set of custom scheduler and spider middleware for Scrapy.
    [Show full text]
  • Distributed Training
    Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote Talk at SDAS ‘19 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Understanding the Deep Learning Resurgence • Deep Learning is a sub-set of Machine Learning – But, it is perhaps the most radical and revolutionary subset – Automatic feature extraction vs. hand-crafted features • Deep Learning – A renewed interest and a lot of hype! – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” Courtesy: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory SDAS (June ‘19) 2 Deep Learning Use Cases and Growth Trends Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/ Network Based Computing Laboratory SDAS (June ‘19) 3 Increasing Usage of HPC, Big Data and Deep Learning Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.) Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory SDAS (June ‘19) 4 Newer Workflows - Deep Learning over Big Data (DLoBD) • Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms • More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark
    [Show full text]
  • HPC and AI Middleware for Exascale Systems and Clouds
    HPC and AI Middleware for Exascale Systems and Clouds Talk at HPC-AI Advisory Council UK Conference (October ‘20) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] Follow us on http://www.cse.ohio-state.edu/~panda https://twitter.com/mvapich High-End Computing (HEC): PetaFlop to ExaFlop 100 PetaFlops in 415 Peta 2017 Flops in 2020 (Fugaku in Japan with 7.3M cores 1 ExaFlops Expected to have an ExaFlop system in 2021! Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 2 Increasing Usage of HPC, Big Data and Deep/Machine Learning Big Data (Hadoop, Spark, HPC HBase, (MPI, PGAS, etc.) Memcached, etc.) Convergence of HPC, Big Deep/Machine Data, and Deep/Machine Learning Learning! (TensorFlow, PyTorch, BigDL, cuML, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 3 Converged Middleware for HPC, Big Data and Deep/Machine Learning? Physical Compute Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 4 Converged Middleware for HPC, Big Data and Deep/Machine Learning? Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 5 Converged Middleware for HPC, Big Data and Deep/Machine Learning? Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 6 Converged Middleware for HPC, Big Data and Deep/Machine Learning? Hadoop Job Deep/Machine Learning Job Spark Job Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 7 Presentation Overview • MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiBD Project – High-Performance
    [Show full text]
  • Parsl Documentation Release 1.1.0
    Parsl Documentation Release 1.1.0 The Parsl Team Sep 24, 2021 CONTENTS 1 Quickstart 3 1.1 Installation................................................3 1.2 Getting started..............................................4 1.3 Tutorial..................................................4 1.4 Usage Tracking..............................................4 1.5 For Developers..............................................5 2 Parsl tutorial 7 2.1 Configuring Parsl.............................................7 2.2 Python Apps...............................................8 2.3 Bash Apps................................................8 2.4 Passing data...............................................9 2.5 AppFutures................................................9 2.6 DataFutures................................................ 10 2.7 Files................................................... 11 2.8 Remote Files............................................... 11 2.9 Sequential workflow........................................... 12 2.10 Parallel workflow............................................. 12 2.11 Parallel dataflow............................................. 13 2.12 Monte Carlo workflow.......................................... 14 2.13 Local execution with threads....................................... 15 2.14 Local execution with pilot jobs..................................... 15 3 User guide 17 3.1 Overview................................................. 17 3.2 Apps................................................... 23 3.3 Futures.................................................
    [Show full text]
  • Frontera Documentation Release 0.8.0
    Frontera Documentation Release 0.8.0 ScrapingHub Jul 30, 2018 Contents 1 Introduction 3 1.1 Frontera at a glance...........................................3 1.2 Run modes................................................5 1.3 Quick start single process........................................6 1.4 Quick start distributed mode.......................................8 1.5 Cluster setup guide............................................ 10 2 Using Frontera 15 2.1 Installation Guide............................................ 15 2.2 Crawling strategies............................................ 16 2.3 Frontier objects.............................................. 16 2.4 Middlewares............................................... 19 2.5 Canonical URL Solver.......................................... 22 2.6 Backends................................................. 23 2.7 Message bus............................................... 27 2.8 Writing custom crawling strategy.................................... 30 2.9 Using the Frontier with Scrapy...................................... 35 2.10 Settings.................................................. 37 3 Advanced usage 51 3.1 What is a Crawl Frontier?........................................ 51 3.2 Graph Manager.............................................. 52 3.3 Recording a Scrapy crawl........................................ 58 3.4 Fine tuning of Frontera cluster...................................... 59 3.5 DNS Service............................................... 60 4 Developer documentation 63 4.1
    [Show full text]
  • Frontera Documentation 0.6.0
    Frontera Documentation 0.6.0 ScrapingHub 2017 02 13 Contents 1 3 1.1 Frontera at a glance...........................................3 1.2 Run modes................................................5 1.3 Quick start single process........................................7 1.4 Quick start distributed mode.......................................8 1.5 Cluster setup guide............................................9 2 Frontera 13 2.1 Installation Guide............................................ 13 2.2 Frontier objects.............................................. 13 2.3 Middlewares............................................... 14 2.4 Canonical URL Solver.......................................... 15 2.5 Backends................................................. 16 2.6 Message bus............................................... 18 2.7 Crawling strategy............................................. 19 2.8 Using the Frontier with Scrapy...................................... 20 2.9 Settings.................................................. 23 3 33 3.1 What is a Crawl Frontier?........................................ 33 3.2 Graph Manager.............................................. 34 3.3 Recording a Scrapy crawl........................................ 40 3.4 Fine tuning of Frontera cluster...................................... 42 3.5 DNS Service............................................... 43 4 45 4.1 Architecture overview.......................................... 45 4.2 Frontera API............................................... 47 4.3 Using the
    [Show full text]