Webscraping at Statistics Netherlands Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Content
– Internet as a datasource (IAD): motivation – Some IAD projects over past years – Technologies used – Summary / trends – Observations / thoughts
– Legal – The Dutch Business Register
2 The why
Administrative sources – Tax, social security services – Municipalities/ Provinces – Supermarkets – …
Internet sources
– … – Surveys
3 Fuel prices (2009)
‐ Daily fuel prices from website of unmanned petrol stations (tinq.nl) ‐ Regional prices (per station) every day
Now: 2016: ‐ A direct data feed from travelcard company, weekly ‐ Fuel prices per day and all transactions of that week ‐ Publication in website: prices per month
4 Airline tickets (2010) – Pilot: 3 robots on 6 airline companies – 2 robots by external companies, 1 by SN – Prices comply with manual collection – Quite expensive; negative business case – 2016: still manual price collection of airline tickets
Ticket price Amsterdam - Milano 250
200
150
Robot Manual 100
50 5 0 11 Feb 03 Mar 23 Mar 12 Apr 02 May 22 May 11 Jun 01 Jul 21 Jul 10 Aug Housing market
– Housing market (from 2011): ‐ Discussions with external company for > 1 year (iWoz) ‐ We scraped 5 sites, about 250.000 observations / week, 2 years
2013 ->: ‐ Direct feed from one of the sites (Jaap.nl) ‐ Statline tables: Bestaande woningen in verkoop ‐ “based on 80-90 percent of the market”
7 Bulk price collection for CPI (1)
– Bulk price collection for CPI (from 2012): ‐ Mainly clothing ‐ Software scrapes all prices and product data (id, name, description, category, colour, size,…)
2016: ‐ About 500.000 price observations daily from 10 sites ‐ Data from 3 sites used in production of Dutch CPI ‐ Price collection process embedded in organisation ‐ Plans to extend to > 20 sites; other domains
8 Bulk price collection for CPI (2)
Features: Fine-knit Jumper Dark blue Striped Data collection & Cotton Feature extraction edges Structured data
Big Data Index methods
Index based on internet data
Processing bulk data from 9 the Internet Robot-assisted price collection
– Robot tool for detecting price changes on (parts of) websites – Traffic light indicates status: ‐ Green: nothing changed, prices is saved in database ‐ Red: some change, need attention of statistician ‐ Two click to hold old price or store a new one ‐ In production from 2014 Collect data on enterprises for EGR (2013)
– Pilot: find data about EGR enterprises on the web ‐ We scraped semi structured data from Wikipedia ‐ Multiple wikipedia languages (NL, EN, DE, FR)
‐ 2016: something alike in ESSnet BD WP2?
11
Search product descriptions for classifying business activities
– Search product descriptions on web (from 2014) ‐ First time we used automated search with Google search API for statistics ‐ Pilot, no production ‐ Some doubts on google results
12 Twitter-LinkedIn (1)
– LinkedIn-Twitter for profiling (2015) ‐ Automated search on LinkedIn based on a sample of twitter users ‐ Very specific and experimental ‐ “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch
13 14 Scraping websites of enterprises
– Identify family businesses (search and / or crawling) (2016) – Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) – Research program: ‐ “Extracting information from websites to improve economic figures”
– This ESSnet BD WP2 !!!
15 Crawling for Statistics
Incomplete Url-base statistical data Search terms
Navigation terms Internet Focused Crawler (Roboto) Item identifyer terms “year report, family business”
More complete statistical data Search & Match Data ElasticSearch store
16 Technologies used
– Perl (2009), Djuggler (2010) – Python, Scrapy (2010) – R (2011-2015) – NodeJS (Javacript on server) (2014-) – Google Search API (2014-) – ElasticSearch (2016) – Roboto (nodejs package, 2015-2016) – Nutch: tested, not used – Generic Framework (robot framework) for bulk scraping of prices
17 Summary / trends
Production Scrape Search Crawl External company Tinq x (x) Travelcard
Airlines x 2 robots Housing x (x) Jaap.nl BulkCPI x x Robottool x x (x) EGR x x RGS x Twitter/ x x Linkedin 18 Enterprises x x Dataprovider? Observations / thoughts …
‐ If it is there, we can get it ‐ Technology is (usually) not the problem! ‐ The internet is a living thing! ‐ It’s too simple to think we can just buy the internet somewhere and then make statistics! ‐ It’s powerful to combine something we know with something we observe!
‐ External companies can help, but be careful …
19
20 Legal
– Dutch Statistics Law: ‐ Enterprises have to provide data to Statistics Netherlands on request ‐ Scraping information from websites reduces response burden ‐ Statistics Netherlands does use data for official statistics only – Dutch database legislation: ‐ Commercial re-use of intellectual property is forbidden ‐ This may also apply to internet sources – Privacy: ‐ Dutch (statistical) legislation on protection of personal information ‐ Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally – Netiquette: ‐ respect robots.txt ‐ identify yourself (user-agent) ‐ do not overload servers, use some idle time between requests21
Dutch Business Register (simplified)
- From administrative units to statistical units:
Cluster of Enterprise Legal units relationships Enterprises Local units control groups
Sources: - Trade Register - Tax Register - Social security register (employees) - Profilers
- About 1.5 Million administrative entities - About 0.5 Million have a url - Quality of url field not known, but seems usable 22