WPJ - Innovative Tourism Statistics Quality
Katarzyna Kapica on behalf of the WPJ coordinator Marek Cierpiał-Wolan, PhD Statistical Office in Rzeszów, Poland
stat.gov.pl 1 Statistical representation Data description, Concepts and Definitions
new tourism accommodation establishments geolocation of all establishments types of accommodation establishments and their number of bed-places
stat.gov.pl 2 Statistical representation Statistical unit and population
Statistical unit: domestic accommodation establishment Statistical population: are domestic accommodation establishments with complementary information on their types, regions and other information allowing for their identification.
.
stat.gov.pl 3 Statistical representation Time coverage and reference area
Time Coverage • Data from Hotels.com have been collected from (February) July 2019 onwards. • Data from Booking.com have been collected in the period of July and August 2019
Reference Area • The web scraper of Hotels.com has proxy “destination” set differently in each country. Some use just country name, some provinces and cities, and some follow NUTS classification.
stat.gov.pl 4 Unit of measure
Day, Month and Year in which an accommodation is detected
Date configuration used to gather accommodation prices
Geographical area connected chosen by a visitor using the search engine of the platform
Platform id assigned to an specific visitor’s search of accommodation (Demand-side)
Platform_id (client number) assigned to an specific accommodation (Supply-side)
Name of the accommodation in the platform database
Sort of accommodation registered by the accommodation responsible to classify its accommodation business in the platform
city where an accommodation is located or established
Combination of numbers and letters in addresses that denote the neighborhood and street to locate dwellings
Cumulative number of the number of quest reviews
stat.gov.pl 5 Institutional mandate Legal acts and other agreements
There is no legal mandate regulating web scraping of websites across all countries of the European Union
WPJ tool (crawler) • not invasive, • doesn’t increase load on servers, • not affect the day-to-day operation and functioning of the portals, • performed during off peak hours, • only data that is needed for statistical purposes.
stat.gov.pl 6 Institutional mandate Data access and transmission
Access to raw data is based on its public online availability.
stat.gov.pl 7 Confidentiality Data treatment
The data gathered by web scraping methods are generally not a subject to confidentiality as they are publicly available and scraped form worldwide available websites. Exception: highly privacy-sensitive data of very small accommodations to natural persons (not enterprises) providing personal details.
It is recommended for each country to implement their own confidentiality rules.
stat.gov.pl 8 Quality management Quality assurance
A specific protocol of quality assurance for web scraping has not been put forward yet.
Statistics Netherlands is certified according to ISO9001 since 2018. Quality management of methodology and process development for official statistics has been adopted, audited and assessed. This certification confirms that Statistics Netherlands focuses on: o the quality procedures for internal and external reports, recommendations and briefs, o the quality assurance of statistical development projects in which methodologists and business analysts participate, o the quality assurance of methodological courses taught to statisticians, o the internal management of the department.
stat.gov.pl 9 Relevance Added value trough a new data source
• improve (expand) the survey population of tourist accommodation establishments, • provide more complete information about establishments operating in the country. The use of web scraping techniques, and in particular platforms such as Hotels.com and Booking.com, allow to speed up the inventarisation process of tourism accommodations. The chosen platforms provide up-to-date inside information on the population dynamics of hospitality businesses and households.
stat.gov.pl 10 Accuracy and reliability Overall accuracy
NACE Survey data Web scraped data
55.1 yes yes
55.2 yes yes
55.3 yes Not at the moment
stat.gov.pl 11 Accuracy and reliability Non-sampling error
Economic classification used in tourism statistics: • 55.1 (hotels and similar accommodation), • 55.2 (holiday and other short-stay accommodation) • 55.3 (camping grounds, recreational vehicle parks and trailer parks)
stat.gov.pl 12 Accuracy and reliability Coverage error and overcoverage rate
Undercoverage error will occur as only accommodation available on booking portals will be gathered. Overcoverage error may occur if on the same accommodation establishment is included in two or more booking portals. This will be treated by the use of geolocation data. This numbers differ for every country in the project and will need to be calculated by the end of it.
stat.gov.pl 13 Accuracy and reliability Measurement error
• Scraper features: Changes in platforms lead the scraper to fail and stop. • Target population: Platform (hotels.com) filter on proximity provokes that accommodations outside the population target get included in the scraping files.
• Chained-brand Hotels & Resorts: Rapid changes in ownership of the Leisure and recreation industry require more flexible adaptive approaches to detect and correct for merging, acquisition and franchise connected to multinationals such as Holiday Inn (UK), Best Western or Hilton (US), NH Hotel Group (ES), Jinjiang International (CN), Scandic Hotels(SE) and Accor(FR).
stat.gov.pl 14 Accuracy and reliability Nonresponse error
Listings of companies where name and address of accommodations are missing, sometimes even on purpose. Booking.com • Studio on a houseboat, near city center! • Spacious, modern family home on the canal with parking! used as a name of the accommodation Strategy “quarantine” and wait for the new data to be collected, analyzed and assessed.
stat.gov.pl 15 Accuracy and reliability Unit non-response rate
Ideally, following table should be filled in at the end of the project
Survey on TAE Administrative data Big Data Source Hotels.com Booking.com Number % Number % Number % Number % Number of units 1. Response rate 1.1 Used for calculations 1.2 Not used for calculations 1.2.1 Out of scope 1.2.2 Other reasons (too many empty fields, inconsistent data, unable to link) 2. Non-response (not filled-in by respondent or by web scraper)
stat.gov.pl 16 Timeliness and punctuality
Web scrapping allows to add new units on a regular basis. Flash estimates in terms of accommodation establishments and their occupancy will be calculated at t+1 (ideally).
stat.gov.pl 17 Coherence and comparability
Comparability – geographical The completeness and comparability of accommodations close to country borders should be considered apart.
Comparability - over time If no legal problems occur or the web scrapped website disappear there will be no problem with comparability over time.
stat.gov.pl 18 Cost and burden
Direct costs: efficiency and less work power to collect data from accommodations. Indirect costs: update all the tools, editing processes
Short-term: develop web crawlers, Medium-term: testing crawlers, deploying in a special server, training. Long-term: maintenance, update and operation of the servers, crawlers and the trained operators.
stat.gov.pl 19 Statistical processing
Source data Combination of two sources of data: 1. survey on tourism accommodation establishments 2. web scraping of accommodation portals. They will be combined by the geolocation and address data. Frequency of data collection Data on accommodation establishments are web scraped with daily or monthly frequency. Data collection Web scraped data: It is done daily, but monthly (and accumulated) updates should be sufficient, the day of the week seems more important, i.e. Saturday and/or Sundays
stat.gov.pl 20 Thank you for your attention!
For more information Coordinator of the WPJ implementation: [email protected]
stat.gov.pl