Scrapy Cluster Documentation Release 1.2

Total Page:16

File Type:pdf, Size:1020Kb

Scrapy Cluster Documentation Release 1.2 Scrapy Cluster Documentation Release 1.2 IST Research November 29, 2016 Contents 1 Introduction 3 1.1 Overview.................................................3 1.2 Quick Start................................................5 2 Kafka Monitor 13 2.1 Design.................................................. 13 2.2 Quick Start................................................ 14 2.3 API.................................................... 16 2.4 Plugins.................................................. 31 2.5 Settings.................................................. 34 3 Crawler 37 3.1 Design.................................................. 37 3.2 Quick Start................................................ 42 3.3 Controlling................................................ 44 3.4 Extension................................................. 49 3.5 Settings.................................................. 53 4 Redis Monitor 59 4.1 Design.................................................. 59 4.2 Quick Start................................................ 60 4.3 Plugins.................................................. 61 4.4 Settings.................................................. 64 5 Rest 69 5.1 Design.................................................. 69 5.2 Quick Start................................................ 69 5.3 API.................................................... 73 5.4 Settings.................................................. 77 6 Utilites 81 6.1 Argparse Helper............................................. 81 6.2 Log Factory............................................... 83 6.3 Method Timer.............................................. 87 6.4 Redis Queue............................................... 88 6.5 Redis Throttled Queue.......................................... 92 6.6 Settings Wrapper............................................. 96 6.7 Stats Collector.............................................. 99 6.8 Zookeeper Watcher............................................ 106 i 7 Advanced Topics 113 7.1 Upgrade Scrapy Cluster......................................... 113 7.2 Integration with ELK........................................... 114 7.3 Docker.................................................. 119 7.4 Crawling Responsibly.......................................... 123 7.5 Production Setup............................................. 124 7.6 DNS Cache................................................ 126 7.7 Response Time.............................................. 126 7.8 Kafka Topics............................................... 127 7.9 Redis Keys................................................ 127 7.10 Other Distributed Scrapy Projects.................................... 129 8 Frequently Asked Questions 131 8.1 General.................................................. 131 8.2 Kafka Monitor.............................................. 132 8.3 Crawler.................................................. 132 8.4 Redis Monitor.............................................. 133 8.5 Rest.................................................... 133 8.6 Utilities.................................................. 133 9 Troubleshooting 135 9.1 General Debugging............................................ 135 9.2 Data Stack................................................ 138 10 Contributing 141 10.1 Raising Issues.............................................. 141 10.2 Submitting Pull Requests........................................ 142 10.3 Testing and Quality Assurance...................................... 143 10.4 Working on Scrapy Cluster Core..................................... 143 11 Change Log 145 11.1 Scrapy Cluster 1.2............................................ 145 11.2 Scrapy Cluster 1.1............................................ 146 11.3 Scrapy Cluster 1.0............................................ 147 12 License 149 13 Introduction 151 14 Architectural Components 153 14.1 Kafka Monitor.............................................. 153 14.2 Crawler.................................................. 153 14.3 Redis Monitor.............................................. 153 14.4 Rest.................................................... 153 14.5 Utilities.................................................. 154 15 Advanced Topics 155 16 Miscellaneous 157 ii Scrapy Cluster Documentation, Release 1.2 This documentation provides everything you need to know about the Scrapy based distributed crawling project, Scrapy Cluster. Warning: This document may not be up to date with Scrapy Cluster 1.2 capabilities. Contents 1 Scrapy Cluster Documentation, Release 1.2 2 Contents CHAPTER 1 Introduction This set of documentation serves to give an introduction to Scrapy Cluster, including goals, architecture designs, and how to get started using the cluster. 1.1 Overview This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster. The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster. The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests. 1.1.1 Dependencies Please see requirements.txt for Pip package dependencies across the different sub projects. Other important components required to run the cluster • Python 2.7: https://www.python.org/downloads/ • Redis: http://redis.io • Zookeeper: https://zookeeper.apache.org • Kafka: http://kafka.apache.org 1.1.2 Core Concepts This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include: • The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster • Scale Scrapy instances across a single machine or multiple machines • Coordinate and prioritize their scraping effort for desired sites 3 Scrapy Cluster Documentation, Release 1.2 • Persist data across scraping jobs • Execute multiple scraping jobs concurrently • Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked • Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime • Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results) • Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address 1.1.3 Architecture At the highest level, Scrapy Cluster operates on a single input Kafka topic, and two separate output Kafka top- ics. All incoming requests to the cluster go through the demo.incoming kafka topic, and depending on what the request is will generate output from either the demo.outbound_firehose topic for action requests or demo.crawled_firehose topics for html crawl requests. Each of the three core pieces are extendable in order add or enhance their functionality. Both the Kafka Monitor and Redis Monitor use ‘Plugins’ in order to enhance their abilities, whereas Scrapy uses ‘Middlewares’, ‘Pipelines’, and ‘Spiders’ to allow you to customize your crawling. Together these three components and the Rest service allow for scaled and distributed crawling across many machines. 4 Chapter 1. Introduction Scrapy Cluster Documentation, Release 1.2 1.2 Quick Start This guide does not go into detail as to how everything works, but hopefully will get you scraping quickly. For more information about each process works please see the rest of the documentation. 1.2.1 Setup There are a number of different options available to set up Scrapy Cluster. You can chose to provision with Vagrant Quickstart, use the Docker Quickstart, or manually configure via the Cluster Quickstart yourself. Vagrant Quickstart The Vagrant Quickstart provides you a simple Vagrant Virtual Machine in order to try out Scrapy Cluster. It is not meant to be a production crawling machine, and should be treated more like a test ground for development. 1. Ensure you have Vagrant and VirtualBox installed on your machine. For instructions on setting those up please refer to the official documentation. 2. Download and unzip the latest release here. Lets assume our project is now in ~/scrapy-cluster 3. Stand up the Scrapy Cluster Vagrant machine. $ cd ~/scrapy-cluster $ vagrant up # This will set up your machine, provision it with Ansible, and boot the required services Note: The initial setup may take some time due to the setup of Zookeeper, Kafka, and Redis. This is only a one time setup and is skipped when running it in the future. 4. SSH onto the machine $ vagrant ssh 5. Ensure everything under supervision is running. vagrant@scdev:~$ sudo supervisorctl status kafka RUNNING pid 1812, uptime 0:00:07 redis RUNNING pid 1810, uptime 0:00:07 zookeeper RUNNING pid 1809, uptime 0:00:07 Note: If you receive the message unix:///var/run/supervisor.sock no such file, issue the fol- lowing command to start Supervisord: sudo service supervisord start, then check the status like above. 6. Create a Virtual Environment for Scrapy Cluster, and activate
Recommended publications
  • Cluster Setup Guide
    Frontera Documentation Release 0.8.0.1 ScrapingHub Jul 30, 2018 Contents 1 Introduction 3 1.1 Frontera at a glance...........................................3 1.2 Run modes................................................5 1.3 Quick start single process........................................6 1.4 Quick start distributed mode.......................................8 1.5 Cluster setup guide............................................ 10 2 Using Frontera 15 2.1 Installation Guide............................................ 15 2.2 Crawling strategies............................................ 16 2.3 Frontier objects.............................................. 16 2.4 Middlewares............................................... 19 2.5 Canonical URL Solver.......................................... 22 2.6 Backends................................................. 23 2.7 Message bus............................................... 27 2.8 Writing custom crawling strategy.................................... 30 2.9 Using the Frontier with Scrapy...................................... 35 2.10 Settings.................................................. 37 3 Advanced usage 51 3.1 What is a Crawl Frontier?........................................ 51 3.2 Graph Manager.............................................. 52 3.3 Recording a Scrapy crawl........................................ 58 3.4 Fine tuning of Frontera cluster...................................... 59 3.5 DNS Service............................................... 60 4 Developer documentation 63 4.1
    [Show full text]
  • Frontera Documentation Release 0.4.0
    Frontera Documentation Release 0.4.0 ScrapingHub December 30, 2015 Contents 1 Introduction 3 1.1 Frontera at a glance...........................................3 1.2 Run modes................................................5 1.3 Quick start single process........................................6 1.4 Quick start distributed mode.......................................8 2 Using Frontera 11 2.1 Installation Guide............................................ 11 2.2 Frontier objects.............................................. 11 2.3 Middlewares............................................... 12 2.4 Canonical URL Solver.......................................... 13 2.5 Backends................................................. 14 2.6 Message bus............................................... 16 2.7 Crawling strategy............................................. 17 2.8 Using the Frontier with Scrapy...................................... 17 2.9 Settings.................................................. 20 3 Advanced usage 31 3.1 What is a Crawl Frontier?........................................ 31 3.2 Graph Manager.............................................. 32 3.3 Recording a Scrapy crawl........................................ 38 3.4 Production broad crawling........................................ 40 4 Developer documentation 45 4.1 Architecture overview.......................................... 45 4.2 Frontera API............................................... 47 4.3 Using the Frontier with Requests.................................... 49 4.4
    [Show full text]
  • High Performance Distributed Web-Scraper
    High performance distributed web-scraper Denis Eyzenakh Anton Rameykov Igor Nikiforov Institute of Computer Science and Institute of Computer Science and Institute of Computer Science and Technology Technology Technology Peter the Great St.Petersburg Peter the Great St.Petersburg Peter the Great St.Petersburg Polytechnic University Polytechnic University Polytechnic University Saint – Petersburg, Russian Federation Saint – Petersburg, Russian Federation Saint – Petersburg, Russian Federation [email protected] [email protected] [email protected] Abstract—Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the II. EXISTING WEB SCRAPING TECHNIQUES extraction of knowledge by performing machine leaning analysis. Typically, web scraping applications imitate a regular web In order to perform data mining of the web-information, the data user. They follow the links and search for the information should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have they need. The classic web scraper can be classified into two different ways to access their data: either API over HTTP protocol types: web-crawlers and data extractors “Fig. 1”. or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS.
    [Show full text]
  • Frontera-Open Source Large Scale Web Crawling Framework
    Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015 [email protected] Hola los participantes! • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. 2 «A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.». –Wikipedia: Web Crawler article, July 2015 3 Motivation • Client needed to crawl 1B+ pages/week, and identify frequently changing HUB pages. • Scrapy wasn’t suitable for broad crawling and had no crawl frontier capabilities* • People were tend to favor Apache Nutch instead of Scrapy. Hyperlink-Induced Topic Search, Jon Kleinberg, 1999 4 tophdart.com Frontera: single-threaded and distributed • Frontera is all about knowing what to crawl next and when to stop. • Single-Threaded mode can be used for up to 100 websites (parallel downloading), • for performance broad crawls there is a distributed mode. 6 Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. 7 Single-threaded use cases • Need of URL metadata and content storage, • Need of isolation of URL ordering/queueing logic from the spider • Advanced URL ordering logic (big websites, or revisiting) 8 Single-threaded architecture 9 Frontera and Scrapy • Frontera is implemented as a set of custom scheduler and spider middleware for Scrapy.
    [Show full text]
  • Distributed Training
    Scalable and Distributed DNN Training on Modern HPC Systems: Challenges and Solutions Keynote Talk at SDAS ‘19 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Understanding the Deep Learning Resurgence • Deep Learning is a sub-set of Machine Learning – But, it is perhaps the most radical and revolutionary subset – Automatic feature extraction vs. hand-crafted features • Deep Learning – A renewed interest and a lot of hype! – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” Courtesy: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory SDAS (June ‘19) 2 Deep Learning Use Cases and Growth Trends Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/ Network Based Computing Laboratory SDAS (June ‘19) 3 Increasing Usage of HPC, Big Data and Deep Learning Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.) Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory SDAS (June ‘19) 4 Newer Workflows - Deep Learning over Big Data (DLoBD) • Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms • More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark
    [Show full text]
  • HPC and AI Middleware for Exascale Systems and Clouds
    HPC and AI Middleware for Exascale Systems and Clouds Talk at HPC-AI Advisory Council UK Conference (October ‘20) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] Follow us on http://www.cse.ohio-state.edu/~panda https://twitter.com/mvapich High-End Computing (HEC): PetaFlop to ExaFlop 100 PetaFlops in 415 Peta 2017 Flops in 2020 (Fugaku in Japan with 7.3M cores 1 ExaFlops Expected to have an ExaFlop system in 2021! Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 2 Increasing Usage of HPC, Big Data and Deep/Machine Learning Big Data (Hadoop, Spark, HPC HBase, (MPI, PGAS, etc.) Memcached, etc.) Convergence of HPC, Big Deep/Machine Data, and Deep/Machine Learning Learning! (TensorFlow, PyTorch, BigDL, cuML, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 3 Converged Middleware for HPC, Big Data and Deep/Machine Learning? Physical Compute Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 4 Converged Middleware for HPC, Big Data and Deep/Machine Learning? Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 5 Converged Middleware for HPC, Big Data and Deep/Machine Learning? Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 6 Converged Middleware for HPC, Big Data and Deep/Machine Learning? Hadoop Job Deep/Machine Learning Job Spark Job Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 7 Presentation Overview • MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiBD Project – High-Performance
    [Show full text]
  • Vergleich Aktueller Web-Crawling-Werkzeuge
    Hochschule Wismar University of Applied Sciences Technology, Business and Design Fakult¨at fur¨ Ingenieurwissenschaften, Bereich EuI Bachelor-Thesis Vergleich aktueller Web-Crawling-Werkzeuge Gedruckt am: 30. April 2021 Eingereicht am: von: Christoph Werner Betreuende Professorin: Prof. Dr.-Ing. Antje Raab-Dusterh¨ ¨oft Zweitbetreuer: Prof. Dr.-Ing. Matthias Kreuseler Aufgabenstellung Das Ziel der Bachelor Thesis ist ein Vergleich aktueller Web-Crawling-Werkzeuge. Die Crawler sind hinsichtlich ihrer Funktionsweise und Crawling-Ergebnisse zu ver- gleichen und zu bewerten. Des Weiteren ist ein Tool zu konzeptionieren, welches diese Werkzeuge fur¨ die Nutzung in Laborpraktika verfugbar¨ macht. 3 Kurzreferat Webcrawling im Verbund mit Webscraping ist eine der grundlegenden Technologien, um gezielt und automatisiert Daten aus dem Internet zu sammeln. Zun¨achst wird in dieser Thesis auf die Grundlagen des Crawlings und Scrapings eingegangen. Hierbei soll besonderes Augenmerk auf der Architektur und Funkti- onsweise eines Crawlers, dem Robots Exclusion Protocol (REP), zu bedenkende Sicherheitsaspekte, sowie Anti-Crawling/Scraping-Maßnahmen liegen. Darauf aufbauend werden verschiedenste Crawling Frameworks auf Grundlage ihrer Dokumentation bewertet und gegenubergestellt.¨ Abschließend wird ein Tool mit einer grafischen Benutzeroberfl¨ache (GUI) zum Ver- gleich von verschiedenen Crawling Frameworks entwickelt. 4 Abstract Web crawling in combination with web scraping is the key technology for the targeted and automated collection of data from the World Wide Web. First of all, this thesis deals with the basics of crawling and scraping. Special at- tention will be paid to the architecture and functionality of a crawler, the Ro- bots Exclusion Protocol (REP), security aspects to be considered as well as anti- crawling/scraping measures. Based on this, various crawling frameworks are evaluated and compared on the basis of their documentation.
    [Show full text]
  • Parsl Documentation Release 1.1.0
    Parsl Documentation Release 1.1.0 The Parsl Team Sep 24, 2021 CONTENTS 1 Quickstart 3 1.1 Installation................................................3 1.2 Getting started..............................................4 1.3 Tutorial..................................................4 1.4 Usage Tracking..............................................4 1.5 For Developers..............................................5 2 Parsl tutorial 7 2.1 Configuring Parsl.............................................7 2.2 Python Apps...............................................8 2.3 Bash Apps................................................8 2.4 Passing data...............................................9 2.5 AppFutures................................................9 2.6 DataFutures................................................ 10 2.7 Files................................................... 11 2.8 Remote Files............................................... 11 2.9 Sequential workflow........................................... 12 2.10 Parallel workflow............................................. 12 2.11 Parallel dataflow............................................. 13 2.12 Monte Carlo workflow.......................................... 14 2.13 Local execution with threads....................................... 15 2.14 Local execution with pilot jobs..................................... 15 3 User guide 17 3.1 Overview................................................. 17 3.2 Apps................................................... 23 3.3 Futures.................................................
    [Show full text]
  • Frontera Documentation Release 0.8.0
    Frontera Documentation Release 0.8.0 ScrapingHub Jul 30, 2018 Contents 1 Introduction 3 1.1 Frontera at a glance...........................................3 1.2 Run modes................................................5 1.3 Quick start single process........................................6 1.4 Quick start distributed mode.......................................8 1.5 Cluster setup guide............................................ 10 2 Using Frontera 15 2.1 Installation Guide............................................ 15 2.2 Crawling strategies............................................ 16 2.3 Frontier objects.............................................. 16 2.4 Middlewares............................................... 19 2.5 Canonical URL Solver.......................................... 22 2.6 Backends................................................. 23 2.7 Message bus............................................... 27 2.8 Writing custom crawling strategy.................................... 30 2.9 Using the Frontier with Scrapy...................................... 35 2.10 Settings.................................................. 37 3 Advanced usage 51 3.1 What is a Crawl Frontier?........................................ 51 3.2 Graph Manager.............................................. 52 3.3 Recording a Scrapy crawl........................................ 58 3.4 Fine tuning of Frontera cluster...................................... 59 3.5 DNS Service............................................... 60 4 Developer documentation 63 4.1
    [Show full text]
  • Frontera Documentation 0.6.0
    Frontera Documentation 0.6.0 ScrapingHub 2017 02 13 Contents 1 3 1.1 Frontera at a glance...........................................3 1.2 Run modes................................................5 1.3 Quick start single process........................................7 1.4 Quick start distributed mode.......................................8 1.5 Cluster setup guide............................................9 2 Frontera 13 2.1 Installation Guide............................................ 13 2.2 Frontier objects.............................................. 13 2.3 Middlewares............................................... 14 2.4 Canonical URL Solver.......................................... 15 2.5 Backends................................................. 16 2.6 Message bus............................................... 18 2.7 Crawling strategy............................................. 19 2.8 Using the Frontier with Scrapy...................................... 20 2.9 Settings.................................................. 23 3 33 3.1 What is a Crawl Frontier?........................................ 33 3.2 Graph Manager.............................................. 34 3.3 Recording a Scrapy crawl........................................ 40 3.4 Fine tuning of Frontera cluster...................................... 42 3.5 DNS Service............................................... 43 4 45 4.1 Architecture overview.......................................... 45 4.2 Frontera API............................................... 47 4.3 Using the
    [Show full text]
  • High Performance Distributed Web-Scraper
    Труды ИСП РАН, том 33, вып. 3, 2021 г. // Trudy ISP RAN/Proc. ISP RAS, vol. 33, issue 3, 2021 Eyzenakh D.S., Rameykov A.S., Nikiforov I.V. High performance distributed web-scraper. Trudy ISP RAN/Proc. ISP RAS, vol. 33, issue 3, 2021, pp. 87-100 Для цитирования: Эйзенах Д.С., Рамейков А.С., Никифоров И.В. Высокопроизводительный распределенный веб-скрапер. Труды ИСП РАН, том 33, вып. 3, 2021 г., стр. 87-100 (на английском языке). DOI: 10.15514/ISPRAS–2021–33(3)–7. DOI: 10.15514/ISPRAS-2021-33(3)-7 1. Introduction Due to the rapid development of the network, the World Wide Web has become a carrier of a large High Performance Distributed Web-Scraper amount of information. The data extraction and use of information has become a huge challenge nowadays. Traditional access to the information through browsers like Chrome, Firefox, etc. can provide a comfortable user experience with web pages. Web sites have a lot of information and sometimes haven’t got any instruments to access over the API and preserve it in analytical storage. D.S. Eyzenakh, ORCID: 0000-0003-1584-1745 <[email protected]> The manual collection of data for further analysis can take a lot of time and in the case of semi- A.S. Rameykov, ORCID: 0000-0001-7989-6732 <[email protected]> structured or unstructured data types the collection and analyzing of data can become even more I.V. Nikiforov, ORCID: 0000-0003-0198-1886 <[email protected]> difficult and time-consuming. The person who manually collects data can make mistakes Peter the Great St.Petersburg Polytechnic University (duplication, typos in the text, etc.) as far as the process is error-prone.
    [Show full text]
  • Diapositivo 1
    Instituto Politécnico de Tomar Introduction to Information Retrieval Data Acquisition Ricardo Campos Lic ITM – Técnicas Avançadas de Programação Abrantes, Portugal, 2019 This presentation was developed by Ricardo Campos, Professor of ICT of the Polytechnic Institute of Tomar and researcher of LIAAD - INESC TEC. Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography: Web crawling and system administration What is Information Retrieval? Please refer to the following when using this presentation: Campos, Ricardo. (2019). A .ppt version of this presentation can be provided upon request by sending an email to [[email protected]] What is Information Retrieval? AGENDA What is this talk about? Data Acquisition APIs Web Scraping 1 2 3 Web Crawling Web Dynamics Web Archives 4 5 6 What is Information Retrieval? Example scenario: Your small business’s website has a form used to sign clients up for appointments. You want to give your clients the ability to automatically create a Google calendar event with the details for that appointment. An application program interface (API) is a set of routines, protocols, API Use: The idea is to have your website’s server send an API and tools for building software request directly to Google’s. Your server would then receive applications. Basically, an API Google’s response, process it, and send back relevant specifies how software information to the browser, such as a confirmation message to components should interact the user. What is the difference between an API and You’ve probably heard of any other remote server? companies packaging APIs as To render the whole web page, your browser expects a response products.
    [Show full text]