<<

Webscraping at Statistics Netherlands Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Content

as a datasource (IAD): motivation – Some IAD projects over past years – Technologies used – Summary / trends – Observations / thoughts

– Legal – The Dutch Business Register

2 The why

Administrative sources – Tax, social security services – Municipalities/ Provinces – Supermarkets – …

Internet sources

– … – Surveys

3 Fuel prices (2009)

‐ Daily fuel prices from of unmanned petrol stations (tinq.nl) ‐ Regional prices (per station) every day

Now: 2016: ‐ A direct data feed from travelcard company, weekly ‐ Fuel prices per day and all transactions of that week ‐ Publication in website: prices per month

4 Airline tickets (2010) – Pilot: 3 robots on 6 airline companies – 2 robots by external companies, 1 by SN – Prices comply with manual collection – Quite expensive; negative business case – 2016: still manual price collection of airline tickets

Ticket price Amsterdam - Milano 250

200

150

Robot Manual 100

50 5 0 11 Feb 03 Mar 23 Mar 12 Apr 02 May 22 May 11 Jun 01 Jul 21 Jul 10 Aug Housing market

– Housing market (from 2011): ‐ Discussions with external company for > 1 year (iWoz) ‐ We scraped 5 sites, about 250.000 observations / week, 2 years

2013 ->: ‐ Direct feed from one of the sites (Jaap.nl) ‐ Statline : Bestaande woningen in verkoop ‐ “based on 80-90 percent of the market”

7 Bulk price collection for CPI (1)

– Bulk price collection for CPI (from 2012): ‐ Mainly clothing ‐ Software scrapes all prices and product data (id, name, description, category, colour, size,…)

2016: ‐ About 500.000 price observations daily from 10 sites ‐ Data from 3 sites used in production of Dutch CPI ‐ Price collection process embedded in organisation ‐ Plans to extend to > 20 sites; other domains

8 Bulk price collection for CPI (2)

Features: Fine-knit Jumper Dark blue Striped Data collection & Cotton Feature extraction edges Structured data

Big Data Index methods

Index based on internet data

Processing bulk data from 9 the Internet Robot-assisted price collection

– Robot tool for detecting price changes on (parts of) – Traffic light indicates status: ‐ Green: nothing changed, prices is saved in ‐ Red: some change, need attention of statistician ‐ Two click to hold old price or store a new one ‐ In production from 2014 Collect data on enterprises for EGR (2013)

– Pilot: find data about EGR enterprises on the web ‐ We scraped semi structured data from Wikipedia ‐ Multiple wikipedia languages (NL, EN, DE, FR)

‐ 2016: something alike in ESSnet BD WP2?

11

Search product descriptions for classifying business activities

– Search product descriptions on web (from 2014) ‐ First time we used automated search with search API for statistics ‐ Pilot, no production ‐ Some doubts on google results

12 -LinkedIn (1)

– LinkedIn-Twitter for profiling (2015) ‐ Automated search on LinkedIn based on a sample of twitter users ‐ Very specific and experimental ‐ “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch

13 14 Scraping websites of enterprises

– Identify family businesses (search and / or crawling) (2016) – Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) – Research program: ‐ “Extracting information from websites to improve economic figures”

– This ESSnet BD WP2 !!!

15 Crawling for Statistics

Incomplete Url-base statistical data Search terms

Navigation terms Internet Focused Crawler () Item identifyer terms “year report, family business”

More complete statistical data Search & Match Data ElasticSearch store

16 Technologies used

– Perl (2009), Djuggler (2010) – Python, Scrapy (2010) – R (2011-2015) – NodeJS (Javacript on server) (2014-) – Google Search API (2014-) – ElasticSearch (2016) – Roboto (nodejs package, 2015-2016) – Nutch: tested, not used – Generic Framework (robot framework) for bulk scraping of prices

17 Summary / trends

Production Scrape Search Crawl External company Tinq x (x) Travelcard

Airlines x 2 robots Housing x (x) Jaap.nl BulkCPI x x Robottool x x (x) EGR x x RGS x Twitter/ x x Linkedin 18 Enterprises x x Dataprovider? Observations / thoughts …

‐ If it is there, we can get it ‐ Technology is (usually) not the problem! ‐ The internet is a living thing! ‐ It’s too simple to think we can just buy the internet somewhere and then make statistics! ‐ It’s powerful to combine something we know with something we observe!

‐ External companies can help, but be careful …

19

20 Legal

– Dutch Statistics Law: ‐ Enterprises have to provide data to Statistics Netherlands on request ‐ Scraping information from websites reduces response burden ‐ Statistics Netherlands does use data for official statistics only – Dutch database legislation: ‐ Commercial re-use of intellectual property is forbidden ‐ This may also apply to internet sources – Privacy: ‐ Dutch (statistical) legislation on protection of personal information ‐ Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally – Netiquette: ‐ respect robots.txt ‐ identify yourself (user-agent) ‐ do not overload servers, use some idle time between requests21

Dutch Business Register (simplified)

- From administrative units to statistical units:

Cluster of Enterprise Legal units relationships Enterprises Local units control groups

Sources: - Trade Register - Tax Register - Social security register (employees) - Profilers

- About 1.5 Million administrative entities - About 0.5 Million have a - Quality of url field not known, but seems usable 22