Diapositivo 1

Instituto Politécnico de Tomar Introduction to Information Retrieval Data Acquisition Ricardo Campos Lic ITM – Técnicas Avançadas de Programação Abrantes, Portugal, 2019 This presentation was developed by Ricardo Campos, Professor of ICT of the Polytechnic Institute of Tomar and researcher of LIAAD - INESC TEC. Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography: Web crawling and system administration What is Information Retrieval? Please refer to the following when using this presentation: Campos, Ricardo. (2019). A .ppt version of this presentation can be provided upon request by sending an email to [[email protected]] What is Information Retrieval? AGENDA What is this talk about? Data Acquisition APIs Web Scraping 1 2 3 Web Crawling Web Dynamics Web Archives 4 5 6 What is Information Retrieval? Example scenario: Your small business’s website has a form used to sign clients up for appointments. You want to give your clients the ability to automatically create a Google calendar event with the details for that appointment. An application program interface (API) is a set of routines, protocols, API Use: The idea is to have your website’s server send an API and tools for building software request directly to Google’s. Your server would then receive applications. Basically, an API Google’s response, process it, and send back relevant specifies how software information to the browser, such as a confirmation message to components should interact the user. What is the difference between an API and You’ve probably heard of any other remote server? companies packaging APIs as To render the whole web page, your browser expects a response products. For example, Weather in HTML, which contains presentational code, while Google Underground sells access to its Calendar’s API call would just return the data — likely in a format like JSON. If your website’s server is making the API request, then weather data API. your website’s server is the client (similar to your browser being the client when you use it to Whatnavigate is Information to a Retrieval? website). To summarize, when a company offers an API to their customers, it just means that they’ve built a set of dedicated URLs that return pure data responses — meaning the responses An application program interface won’t contain the kind of presentational overhead that you (API) is a set of routines, protocols, would expect in a graphical user interface like a website. and tools for building software applications. Basically, an API Can you make these requests with your specifies how software components should interact browser? Often, yes. Since the actual HTTP transmission happens in text, your browser will always do the best it can to display the You’ve probably heard of response. companies packaging APIs as products. For example, Weather For example, you can access GitHub’s API directly with your Underground sells access to its browser without even needing an access token. Here’s the JSON weather data API. response you get when you visit a GitHub user’s API route in your browser (https://api.github.com/users/rcampos): What is Information Retrieval? https://youtu.be/s7wmiS2mSXY What is Information Retrieval? Data acquisition can be done by means of a browser, a crawler or a web scraper What is Information Retrieval? The following slides (entitled Arquivo.pt) were kindly provided by Fernando Melo (Research & Development of Arquivo.pt - the Portuguese web-archive) and by Daniel Bicho (Web crawling and system administration of Arquivo.pt - the Portuguese web-archive) What is Information Retrieval? arquivo.pt/api What is Information Retrieval? URL List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt Default Parameters: • Offset =0 (first result) • maxItems =50 (number of results) What is Information Retrieval? What is Information Retrieval? URL List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &offset=100 Parameters: • offset =100 (start in result 100) • maxItems =50 (number of results) What is Information Retrieval? URL List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &maxItems=1000 Parameters: • maxItems =1000 (number of results) What is Information Retrieval? URL List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &maxItems=1000&from=2010&to=2016 Parameters: • maxItems =1000 (number of results) • From and to: date between 2010 and 2016 What is Information Retrieval? URL List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &from=20010224174130&to=20150424181416 Parameters: • From and to: date between 24/02/2001 and 24/04/2015 What is Information Retrieval? URL List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api http://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &fields=originalURL,tstamp Parameters: • Fields: Filter answer to show only: Preserved page URL date (tstamp) in which it was preserved What is Information Retrieval? URL Extract text from a specific URL version arquivo.pt/api http://arquivo.pt/textextracted?m=http://www.ipt.pt/20161111160133 What is Information Retrieval? Text Search arquivo.pt/api http://arquivo.pt/textsearch?q=instituto politécnico de tomar Default Parameters: • offset =0 (first result) • itemsPerPage =50 (number of results) What is Information Retrieval? What is Information Retrieval? Response Item - Snippet What is Information Retrieval? Response Item – Link to Extracted Text What is Information Retrieval? Text Search Iterate every 50 results changing offset parameter (at a maximum of 1950) arquivo.pt/api https://arquivo.pt/textsearch?q=instituto politécnico de tomar &offset=0&maxItems=50 https://arquivo.pt/textsearch?q=instituto politécnico de tomar &offset=50&maxItems=50 ……….. https://arquivo.pt/textsearch?q=instituto politécnico de tomar &offset=1950&maxItems=50 What is Information Retrieval? Text Search arquivo.pt/api http://arquivo.pt/textsearch?q=instituto politécnico de tomar &maxItems=2000 Parameters: • maxItems =2000 (maximum number of results, but recommended only 200) What is Information Retrieval? Text Search arquivo.pt/api http://arquivo.pt/textsearch?q=“instituto politécnico de tomar” Parameters: • “”: search for an expression What is Information Retrieval? Text Search arquivo.pt/api http://arquivo.pt/textsearch?q=instituto politécnico de tomar&type=pdf Parameters: • File Format: pdf What is Information Retrieval? Images Search arquivo.pt/api What is Information Retrieval? Images Search arquivo.pt/api https://arquivo.pt/imagesearch?q=instituto politécnico de tomar What is Information Retrieval? What is Information Retrieval? Images Search arquivo.pt/api https://arquivo.pt/imagesearch?q=instituto politécnico de tomar&type=png Parameters: • File Format: png What is Information Retrieval? Images Search arquivo.pt/api https://arquivo.pt/imagesearch?q=instituto politécnico de tomar&size=sm Parameters (size): • sm (small image size) • md (medium image size); • lg (large image size); What is Information Retrieval? Images Search arquivo.pt/api https://arquivo.pt/imagesearch?q=instituto politécnico de tomar&&safeSearch=off Parameters (size): • safeSearch = off; Show all images including those that were automatically classified as Not Safe for Work What is Information Retrieval? https://www.youtube.com/watch?v=SibIAGU7sE0 Complete Video: https://www.youtube.com/watch?v=fNpUllm0dvcWhat is Information Retrieval? https://www.youtube.com/watch?v=rbCqcSpRjEc Complete Video: https://www.youtube.com/watch?v=FASxveLn6is What is Information Retrieval? Prémios Arquivo.pt 2018 What is Information Retrieval? Contamehistorias.pt What is Information Retrieval? https://www.youtube.com/watch?v=b4_ZBr2ijVw&feature=youtu.be When performing data science tasks, it's common to want to use data found on the internet. Some web pages make your life easier, they offer something called API, they offer an interface that you can use to download data. For instance, Twitter provides API to access data. However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you'll want to use a technique called web scraping to get the data from the web page into a format you can work with in your analysis. What is Information Retrieval? “Web scraping is a computer software technique of extracting information from websites” It focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. What is Information Retrieval? • When performing data science tasks, it's common that you want to use data found on the internet. You'll usually be able to access this data in csv format, or via an Application Programming Interface (API). “Web scraping is a computer • However, there are times when the data you want can only software technique of extracting be accessed as part of a web page. In cases like this, you'll information from websites” want to use a technique called web scraping to get the data from the web page into a format you can work with in your analysis. It focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. What is Information Retrieval? • When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we're getting files from the server. The server then sends back files that tell our browser how to render the Scarping is all about html tags. page for us. The files fall into a few main types: So you need to understand html inorder to scrape data.

Diapositivo 1

Cluster Setup Guide

Frontera Documentation Release 0.4.0

High Performance Distributed Web-Scraper

Scrapy Cluster Documentation Release 1.2

Frontera-Open Source Large Scale Web Crawling Framework

Distributed Training

HPC and AI Middleware for Exascale Systems and Clouds

Vergleich Aktueller Web-Crawling-Werkzeuge

Parsl Documentation Release 1.1.0

Frontera Documentation Release 0.8.0

Frontera Documentation 0.6.0

High Performance Distributed Web-Scraper