CBS Robot Framework (More in the Afternoon) • CBS Robottool, a Tool for Detecting Changes on Websites (Robot-Assisted Price Collection)

CBS Robot Framework (More in the Afternoon) • CBS Robottool, a Tool for Detecting Changes on Websites (Robot-Assisted Price Collection)

Web scraping tools, an introduction ESTP course on Big Data Sources – Web, Social Media and Text Analytics, Day 1 Olav ten Bosch, Statistics Netherlands THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat Outline • Introduction • Scraping tools • Some scraping knowledge • A simple scraping exercise 2 Eurostat Introduction (1) • There are many different tools for scraping available, which differ in their functionality and use. • Tools and frameworks come and go, choose the one that fits the job. • Scraping is not like typical (statistical) IT. The life cycle (design, develop, test, maintain) is much shorter, it might not even be a cycle (one time use). 3 Eurostat Introduction (2) • In this course we mention some tools we came across in the past years which show different functionality without claiming this list is exhaustive or these are the best available tools. • Any tool is useless without some basic knowledge of web technology and internet experience, so we provide you some. • At the end of this session we will do a simple scraping exercise with a freely available web- based scraping tool. 4 Eurostat Introduction (3) • We make a rough distinction between: • Scraping: the actual extraction of data / information from a web page • Crawling: following hyperlinks on the internet to traverse multiple pages and / or sites • Search: using (third party) search engines (such as google) automatically to find information on the web • Many tools offer a mix of these 5 Eurostat Scraping tools (1) • Imacros (imacros.net): • Available for quite some years • Point and click (record and replay) as well as coding via API (application programming interface) • Browser add-in as well as standalone program • Type of functionality: scrape • Free and commercial version • Used by some NSI’s for scraping prices • Easy to start with 6 Eurostat Screendump Imacros(2) 7 Eurostat Imacros: generated code 8 Eurostat Scraping tools (2) • Scrapy(scrapy.org): • Python-based scraping and crawling framework • More IT oriented: coding skills required • Open source • Large user community • Used by some NSI’s for various scraping tasks 9 Eurostat Scrapy example 10 Eurostat Scraping tools (3) • Import.io: • Point and click and coding • Fully web-based and hosted scraping • Type of functionality: scrape • Free and commercial licenses • Used by some NSI’s 11 Eurostat Screendump ImportIO 12 Eurostat Screendump ImportIO NEW! Reduced price finally… Dinner in your own garden! 13 Eurostat Screendump ImportIO 14 Eurostat Scraping tools (5) • There are many more, such as: • Nutch for crawling (Apache, java) • An extensive list is available on: https://github.com/lorien/awesome-web-scraping • Scraping tools by Statistics Netherlands: • CBS Robot Framework (more in the afternoon) • CBS Robottool, a tool for detecting changes on websites (robot-assisted price collection) 15 Eurostat Some scraping knowledge (1) • HTTP: the communication protocol • HTML: the language in which web pages are defined • JS: javascript (code executing in the browser) • CSS: style sheets, how web pages are styled. Important, but does not contain data. • JPG, PNG, BMP: images, usually not interesting • CSV / TXT / JSON / XML: data, interesting !!! 16 Eurostat Some scraping knowledge (2) • Analyse a website: • For example using firebug in Firefox or Web developer extensions in Chrome • Live demo • Keep an eye on the format of a hyperlink: • Fictitious example: http://www.example.com/getdata?subject=books&display=label,price • The parameters may be useful for scraping • Live demo 17 Eurostat A simple scraping exercise • The exercise will be presented during the course 18 Eurostat.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    18 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us