Instituto Politécnico de Tomar

Introduction to Information Retrieval

Data Acquisition Ricardo Campos

Lic ITM – Técnicas Avançadas de Programação Abrantes, Portugal, 2019 This presentation was developed by Ricardo Campos, Professor of ICT of the Polytechnic Institute of Tomar and researcher of LIAAD - INESC TEC. Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography:

Web crawling and system administration

What is Information Retrieval? Please refer to the following when using this presentation: Campos, Ricardo. (2019).

A .ppt version of this presentation can be provided upon request by sending an email to [[email protected]]

What is Information Retrieval? AGENDA What is this talk about?

Data Acquisition APIs Web Scraping 1 2 3

Web Crawling Web Dynamics Web Archives 4 5 6

What is Information Retrieval? Example scenario: Your small business’s website has a form used to sign clients up for appointments. You want to give your clients the ability to automatically create a Google calendar event with the details for that appointment. An application program interface (API) is a set of routines, protocols, API Use: The idea is to have your website’s server send an API and tools for building software request directly to Google’s. Your server would then receive applications. Basically, an API Google’s response, process it, and send back relevant specifies how software information to the browser, such as a confirmation message to components should interact the user. What is the difference between an API and You’ve probably heard of any other remote server? companies packaging APIs as To render the whole web page, your browser expects a response products. For example, Weather in HTML, which contains presentational code, while Google Underground sells access to its Calendar’s API call would just return the data — likely in a format like JSON. If your website’s server is making the API request, then weather data API. your website’s server is the client (similar to your browser being the client when you use it to Whatnavigate is Information to a Retrieval? website). To summarize, when a company offers an API to their customers, it just means that they’ve built a set of dedicated URLs that return pure data responses — meaning the responses An application program interface won’t contain the kind of presentational overhead that you (API) is a set of routines, protocols, would expect in a graphical user interface like a website. and tools for building software applications. Basically, an API Can you make these requests with your specifies how software components should interact browser? Often, yes. Since the actual HTTP transmission happens in text, your browser will always do the best it can to display the You’ve probably heard of response. companies packaging APIs as products. For example, Weather For example, you can access GitHub’s API directly with your Underground sells access to its browser without even needing an access token. Here’s the JSON weather data API. response you get when you visit a GitHub user’s API route in your browser (https://api.github.com/users/rcampos):

What is Information Retrieval? https://youtu.be/s7wmiS2mSXY

What is Information Retrieval? Data acquisition can be done by means of a browser, a crawler or a web scraper

What is Information Retrieval? The following slides (entitled Arquivo.pt) were kindly provided by Fernando Melo (Research & Development of Arquivo.pt - the Portuguese web-archive) and by Daniel Bicho (Web crawling and system administration of Arquivo.pt - the Portuguese web-archive)

What is Information Retrieval? arquivo.pt/api

What is Information Retrieval? URL

List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt

Default Parameters: • Offset =0 (first result) • maxItems =50 (number of results)

What is Information Retrieval? What is Information Retrieval? URL

List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &offset=100

Parameters: • offset =100 (start in result 100) • maxItems =50 (number of results)

What is Information Retrieval? URL

List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &maxItems=1000

Parameters: • maxItems =1000 (number of results)

What is Information Retrieval? URL

List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &maxItems=1000&from=2010&to=2016

Parameters: • maxItems =1000 (number of results) • From and to: date between 2010 and 2016

What is Information Retrieval? URL

List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api https://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &from=20010224174130&to=20150424181416

Parameters: • From and to: date between 24/02/2001 and 24/04/2015

What is Information Retrieval? URL

List all preserved versions of URL: http://www.ipt.pt arquivo.pt/api http://arquivo.pt/textsearch?versionHistory=http://www.ipt.pt &fields=originalURL,tstamp

Parameters: • Fields: Filter answer to show only: Preserved page URL date (tstamp) in which it was preserved

What is Information Retrieval? URL

Extract text from a specific URL version

arquivo.pt/api http://arquivo.pt/textextracted?m=http://www.ipt.pt/20161111160133

What is Information Retrieval? Text Search

arquivo.pt/api http://arquivo.pt/textsearch?q=instituto politécnico de tomar

Default Parameters: • offset =0 (first result) • itemsPerPage =50 (number of results)

What is Information Retrieval? What is Information Retrieval? Response Item - Snippet

What is Information Retrieval? Response Item – Link to Extracted Text

What is Information Retrieval? Text Search

Iterate every 50 results changing offset parameter (at a maximum of 1950) arquivo.pt/api https://arquivo.pt/textsearch?q=instituto politécnico de tomar &offset=0&maxItems=50

https://arquivo.pt/textsearch?q=instituto politécnico de tomar &offset=50&maxItems=50

………..

https://arquivo.pt/textsearch?q=instituto politécnico de tomar &offset=1950&maxItems=50 What is Information Retrieval? Text Search

arquivo.pt/api http://arquivo.pt/textsearch?q=instituto politécnico de tomar &maxItems=2000

Parameters: • maxItems =2000 (maximum number of results, but recommended only 200)

What is Information Retrieval? Text Search

arquivo.pt/api http://arquivo.pt/textsearch?q=“instituto politécnico de tomar”

Parameters: • “”: search for an expression

What is Information Retrieval? Text Search

arquivo.pt/api http://arquivo.pt/textsearch?q=instituto politécnico de tomar&type=pdf

Parameters: • File Format: pdf

What is Information Retrieval? Images Search

arquivo.pt/api

What is Information Retrieval? Images Search

arquivo.pt/api https://arquivo.pt/imagesearch?q=instituto politécnico de tomar

What is Information Retrieval? What is Information Retrieval? Images Search

arquivo.pt/api https://arquivo.pt/imagesearch?q=instituto politécnico de tomar&type=png

Parameters: • File Format: png

What is Information Retrieval? Images Search

arquivo.pt/api https://arquivo.pt/imagesearch?q=instituto politécnico de tomar&size=sm

Parameters (size): • sm (small image size) • md (medium image size); • lg (large image size);

What is Information Retrieval? Images Search

arquivo.pt/api https://arquivo.pt/imagesearch?q=instituto politécnico de tomar&&safeSearch=off

Parameters (size): • safeSearch = off; Show all images including those that were automatically classified as Not Safe for Work

What is Information Retrieval? https://www.youtube.com/watch?v=SibIAGU7sE0

Complete Video: https://www.youtube.com/watch?v=fNpUllm0dvcWhat is Information Retrieval? https://www.youtube.com/watch?v=rbCqcSpRjEc

Complete Video: https://www.youtube.com/watch?v=FASxveLn6is What is Information Retrieval? Prémios Arquivo.pt 2018

What is Information Retrieval? Contamehistorias.pt

What is Information Retrieval? https://www.youtube.com/watch?v=b4_ZBr2ijVw&feature=youtu.be When performing data science tasks, it's common to want to use data found on the internet. Some web pages make your life easier, they offer something called API, they offer an interface that you can use to download data. For instance, Twitter provides API to access data.

However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you'll want to use a technique called web scraping to get the data from the web page into a format you can work with in your analysis.

What is Information Retrieval? “Web scraping is a computer software technique of extracting information from websites”

It focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.

What is Information Retrieval? • When performing data science tasks, it's common that you want to use data found on the internet. You'll usually be able to access this data in csv format, or via an Application Programming Interface (API). “Web scraping is a computer • However, there are times when the data you want can only software technique of extracting be accessed as part of a web page. In cases like this, you'll information from websites” want to use a technique called web scraping to get the data from the web page into a format you can work with in your analysis.

It focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.

What is Information Retrieval? • When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we're getting files from the server. The server then sends back files that tell our browser how to render the Scarping is all about html tags. page for us. The files fall into a few main types: So you need to understand html inorder to scrape data. • HTML — contain the main content of the page. • CSS — add styling to make the page look nicer. • JS — Javascript files add interactivity to web pages • Images — image formats, such as JPG and PNG allow web pages to show pictures.

• After our browser receives all the files, it renders the page and displays it to us. There's a lot that happens behind the scenes to render a page nicely, but we don't need to worry about most of it when we're web scraping. When we perform web scraping, we're interested in the main content of the web page, so we lookWhat isat Information the HTML Retrieval? • HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on.

HyperText Markup Language • The most basic tag is the tag. This tag tells the web browser that everything inside of it is HTML. (HTML) is a language that web pages are created in.

HTML isn't a programming language, like Python — • Right inside an html tag, we put two instead, it's a markup language other tags, the head tag, and the that tells a browser how to body tag. The main content of the layout content. web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn't useful

in web scraping: What is Information Retrieval? We'll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:

What is Information Retrieval? Tags have commonly used names that depend on their position in relation to other tags:

Child - a child is a tag inside another tag. So the two p tags below are both children of the body tag.

What is Information Retrieval? Tags have commonly used names that depend on their position in relation to other tags:

Parent — a parent is the tag another tag is inside. The body tag is the parent of the child

What is Information Retrieval? Tags have commonly used names that depend on their position in relation to other tags:

Sibiling — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they're both inside html. Both p tags are siblings, since they're both inside body.

What is Information Retrieval? We can also add properties to HTML tags that change their behavior. In the example below, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

What is Information Retrieval? There are also special properties like class and id which give HTML elements names, and make them easier to interact with when we're scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.

What is Information Retrieval? a and p are extremely common html tags. Here are a few others: div — indicates a division, or area, of the page; b — bolds any text inside; i — italicizes any text inside; table — creates a table; form — creates an input form.

For a full list of tags, look here.

What is Information Retrieval? I encourage you to inspect a web page and view its source code to understand more about html.

https://zapier.com/blog/inspect-element-tutorial/

What is Information Retrieval? A is a software application which systematically browses the Web, for the purpose of Web indexing. .

“I have written a Perl script that wanders the WWW collecting URLs, keeping tracking of where it’s been and new hosts that it finds. Eventually, after hacking up the code to return some slightly more useful information (currently it just return URLs), I will produce a searchable index of this.”

Matthew Gray WWW-talk mailing list, June 1993

What is Information Retrieval? A Web crawler is a software application which systematically browses the Web, for the purpose of Web indexing.

What is Information Retrieval? A Web crawler is a software application which systematically browses the Web, for the purpose of Web indexing.

What is Information Retrieval? A Web crawler is a software application which systematically browses the Web, for the purpose of Web indexing.

What is Information Retrieval? Start Crawling Start Seed Pages Steps End Frontier Stop criterion

Pick the first Web / unvisited URL Request Internet / Intranet Fetch and Parser the page

Repository Save the Page content

Add unseen URL to Frontier What is Information Retrieval? The URL frontier is the data structure that holds and manages URLs we’ve seen, but that have not been crawled yet.

Crawler start from a set of seed pages (initial frontier) and then gradually expand

What is Information Retrieval? What is Information Retrieval? Crawling - Types

Archive Crawlers

Vertical Search Engines S

E F General Mirroring Systems C A Feed Crawlers

ShopBot

What is Information Retrieval? Breadth First Search: Implemented with QUEUE (FIFO)

What is Information Retrieval?

Seed

. . .

Visited Page

Unvisited Page

New URL Duplicated URL What is Information Retrieval? Depth First Search: Implemented with STACK (LIFO)

What is Information Retrieval? To fetch 20,000,000,000 pages in one month . .

. . . we need to fetch almost 8000 pages per second!

What is Information Retrieval? Be polite

• Don’t hit a site too often: do not fetch more than one page at a time from a particular web server;

• Wait at least a few seconds, or even minutes, between Crawlers can cause trouble, even requests to the same web server; unwillingly, if not properly designed to be “polite” and “ethical”; Robots.txt file can be used to control crawlers

For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack! Server administrator and users will be upset; Crawler may be http://www.robotstxt.org/ blacklisted https://www.seomarketing.com.br/robots.txt.phpWhat is Information Retrieval? To illustrate:

• Google (googlebot)

• Yahoo! Search (Slurp!) Crawlers can cause trouble, even unwillingly, if not properly designed to be “polite” and “ethical”; robots.txt User-agent: * Disallow: /data/private Disallow: /cgi-bin Allow: /cgi-bin/help/* For example, sending too many Crawl-delay: 5 requests in rapid succession to a User-agent: Googlebot single server can amount to a Disallow: /data/users Denial of Service (DoS) attack! Sitemap: /lst/sitemap-index.xml Server administrator and users will be upset; Crawler may be blacklisted What is Information Retrieval? • Not possible to constantly check all pages. Must check important pages and pages that change frequently

• HTTP protocol has a special request type called HEAD that Web pages are constantly being makes it easy to check for page changes added, deleted, and modified

Web crawler must continually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the document collection

What is Information Retrieval? • Not possible to constantly check all pages. Must check important pages and pages that change frequently

• Sitemaps contain lists of URLs and data about those URLs, such as modification time and modification frequency. Gives Web pages are constantly being crawler a hint about when to check a page for changes added, deleted, and modified

Web crawler must continually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the document collection

What is Information Retrieval? Do we need to crawl the entire web?

There is an abundance of pages in the Web, but some are useless. Thus we should focus on General search engines: pages with high prestige; News portals: pages that change often; Vertical portals: pages on some topic

What is Information Retrieval? Some sites are difficult for a crawler to find. Private sites (intentionally private, e.g., require a login);

Form results (sites that can be reached only after data into a form); Scripted pages (pages that use JavaScript, Flash, etc);

https://rationalwiki.org/wiki/Deep_web What is Information Retrieval? Web Crawler - Schedule

Limit the crawler in: ● URL path; Selection ● Maximum number of hosts or ● Type of page (html, pdf, etc); ● Online Selections; pages or bytes; ● Conditions of the tag meta; Policy ● And others... ● Maximum depth;

Batch crawling: The Uniform policy: Re-visit Proportional policy: Re-visit more often entire crawling process is Revisit all pages in the collection the pages that change more frequently. periodically halted and with the same frequency. The visit frequency is estimated. Policy restarted.

Politeness A Web Crawler MUST... Policy … identify itself as … keep a low … obey the robots such. bandwidth usage. exclusion protocol. What is Information Retrieval? Crawling - Summary

The increasing number of URLs crawled increase the difficulty to avoid retrieving duplicate content. Crawlers must achieve extremely high throughput, Scalable which poses difficult The degree to which the engineering problems. acquired page snapshots remain up-to-date, relative to Efficient Freshness the current “live” web copies. General Web Crawler

Accurate / Some crawlers are Coverage / Quality Volume interested in having broad The fraction of desired pages coverage at different quality that the crawler acquires levels. successfully. What is Information Retrieval? Crawling - Toolkits

Apache Nutch Sphinx BeautifulSoup Crawler4j Chilkat StormCrawler DataparkSearch Frontera JSpider . pyspider JTidy ......

To more open source web crawlers, click here!

What is Information Retrieval? What is Information Retrieval? Every day new pages are added to the web.

What is Information Retrieval? Around 40% of the world population has an internet connection today.

What is Information Retrieval? The content of the web changes over time with the update of existing web pages.

What is Information Retrieval? In such a dynamic environment, web archives are of increased importance as serving to preserve documents and prevent loss of information.

What is Information Retrieval? The following slides (entitled Web Archives) were kindly provided by Daniel Gomes (Head of Arquivo.pt - the Portuguese web-archive)

"Gazeta de Lisboa" was the first printed Portuguese newspaper (begun in 1715)

What is Information Retrieval? The following slides (entitled Web Archives) were kindly provided by Daniel Gomes (Head of Arquivo.pt - the Portuguese web-archive)

It was suspended in 1762 ... but 300 years later its publications remain accessible.

What is Information Retrieval? The following slides (entitled Web Archives) were kindly provided by Daniel Gomes (Head of Arquivo.pt - the Portuguese web-archive)

"Diário Digital" was the first Portuguese newspaper exclusively online (started in 1999)

What is Information Retrieval? The following slides (entitled Web Archives) were kindly provided by Daniel Gomes (Head of Arquivo.pt - the Portuguese web-archive)

But it disappeared in 2017

What is Information Retrieval? The following slides (entitled Web Archives) were kindly provided by Daniel Gomes (Head of Arquivo.pt - the Portuguese web-archive)

After only 17 years how do we access the publications of the “Diário Digital”?

What is Information Retrieval? Blogs Photos EBooks Newspapers

What is Information Retrieval? However, this valuable information disappears quickly

What is Information Retrieval? However, this valuable information disappears quickly psd2011.com, 2011 psd2015.com, 2015 psd2011.com, 2019

What is Information Retrieval? Archive.org the first Archive on the Web

Brewster Kahle founded the Internet Archive in 1996, a non-profit organization with a mission to achieve "universal access to all knowledge." What is Information Retrieval? 86 initiatives around the world (2018)

https://en.wikipedia.org/wiki/List_of_Web_archiving _initiatives

What is Information Retrieval? What is Information Retrieval? The first Portuguese page (90's)

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files …since 1996 What is Information Retrieval? Results of the 1996 presidential election

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files …since 1996 What is Information Retrieval? Results of the 1996 presidential election

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files …since 1996 What is Information Retrieval? International Events

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files Egypt's Revolution of 2011: web-based revolution, …since 1996 archived web revolution What is Information Retrieval? Printed publications are also preserved by web archives

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files …since 1996 What is Information Retrieval? How to search for archived information?

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files …since 1996 What is Information Retrieval? How to search for archived information?

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files …since 1996 What is Information Retrieval? How to search for archived information?

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files …since 1996 What is Information Retrieval? How to search for archived information?

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files 2001 …since 1996 “Vasco Matos Trigo” BeforeWhat is Information1999 Retrieval? How to search for archived information?

Free preservation service provided to authors of webpages

5 000 000 000 preserved web files …since 1996 2001: past program about the future :-) What is Information Retrieval? Informational Seek general information on a broad topic, such as leukemia. There is typically not a single web page that contains all the information sought; indeed, users with informational queries typically try to assimilate information from multiple web pages

Navigational Navigational queries seek the website or home page of a single entity that the user has in mind, say Lufthansa airlines. In such cases, the user's expectation is that the very first search result should be the home page of Lufthansa. The user is not interested in a plethora of documents containing the term Lufthansa; for such a user, the best measure of user satisfaction is precision at 1.

Transactional Purchasing a product, downloading a file or making a reservation. In such cases, the search engine should return results listing services that provide form interfaces for such transactions.

What is Information Retrieval? Informational – 14% to 38% Collecting information about a topic written in the past

Navigational – 53% to 81% Seeing a web page in the past or how it evolved

Transactional – 5% to 16% downloading an old file or recovering a site from the past.

Miguel Costa (2014). Information Search in Web Archives. http://sobre.arquivo.pt/wp-content/uploads/presentation-phd-thesis-information-search-in-web.pdf

What is Information Retrieval? Information is automatically collected from the Web to be preserved

1. Conjunto de endereços iniciais (e.g. páginas de entrada) 2. Recolha automática e iterativa de ficheiros hiperligados

www.site.pt/index.html

logo.gif sobre.html

contactos.html morada.html What is Information Retrieval? Search analogous to search engines but with reproduction of preserved pages

1. Recolha 2. Indexação

Live-web Índices Ficheiros ARC

4. Reprodução e Acesso 3. Pesquisa

What is Information Retrieval? arquivo.pt/sugerir

Types of Capture

Quarterly • Websites .PT • Suggested sites in any domain

Daily • Selection of 300 online publications • Newspapers, Magazines

Specials • Elections, Olympic Games • Research & Development project sites Formação: arquivo.pt/forma

What is Information Retrieval? https://www.youtube.com/watch?v=cB_ejnln51A

Complete Video: https://www.youtube.com/watch?v=Mi_OKuBgUWkWhat is Information Retrieval? Crawling involves fetching web pages;

There are many questions around crawling such as performance issues, freshness, updates;

It is important to have a well-defined crawling policy such that administrators of web servers don’t get angry;

A crawler must carefully choose and how at each step which pages to visit next.

What is Information Retrieval? Duplication: Don’t want to fetch same page twice! Keep lookup table (hash) of visited pages

Prioritized search: The frontier grows very fast! For large crawls, need to define an exploration policy with priorities, rather than depth first or breadth first

Availability: Fetcher must be robust! Don’t crash if download fails

The content of web pages changes very frequently, thus keeping this information in web archives is of fundamental importance to prevent loss of information

What is Information Retrieval? What is Information Retrieval?