WPJ - Innovative Statistics Quality

Katarzyna Kapica on behalf of the WPJ coordinator Marek Cierpiał-Wolan, PhD Statistical Office in Rzeszów, Poland

stat.gov.pl 1 Statistical representation Data description, Concepts and Definitions

 new tourism accommodation establishments  geolocation of all establishments  types of accommodation establishments and their number of bed-places

stat.gov.pl 2 Statistical representation Statistical unit and population

Statistical unit: domestic accommodation establishment Statistical population: are domestic accommodation establishments with complementary information on their types, regions and other information allowing for their identification.

.

stat.gov.pl 3 Statistical representation Time coverage and reference area

Time Coverage • Data from .com have been collected from (February) July 2019 onwards. • Data from Booking.com have been collected in the period of July and August 2019

Reference Area • The web scraper of Hotels.com has proxy “destination” set differently in each country. Some use just country name, some provinces and cities, and some follow NUTS classification.

stat.gov.pl 4 Unit of measure

Day, Month and Year in which an accommodation is detected

Date configuration used to gather accommodation prices

Geographical area connected chosen by a visitor using the search engine of the platform

Platform id assigned to an specific visitor’s search of accommodation (Demand-side)

Platform_id (client number) assigned to an specific accommodation (Supply-side)

Name of the accommodation in the platform database

Sort of accommodation registered by the accommodation responsible to classify its accommodation business in the platform

city where an accommodation is located or established

Combination of numbers and letters in addresses that denote the neighborhood and street to locate dwellings

Cumulative number of the number of quest reviews

stat.gov.pl 5 Institutional mandate Legal acts and other agreements

There is no legal mandate regulating web scraping of websites across all countries of the European Union

WPJ tool (crawler) • not invasive, • doesn’t increase load on servers, • not affect the day-to-day operation and functioning of the portals, • performed during off peak hours, • only data that is needed for statistical purposes.

stat.gov.pl 6 Institutional mandate Data access and transmission

Access to raw data is based on its public online availability.

stat.gov.pl 7 Confidentiality Data treatment

The data gathered by web scraping methods are generally not a subject to confidentiality as they are publicly available and scraped form worldwide available websites. Exception: highly privacy-sensitive data of very small accommodations to natural persons (not enterprises) providing personal details.

It is recommended for each country to implement their own confidentiality rules.

stat.gov.pl 8 Quality management Quality assurance

A specific protocol of quality assurance for web scraping has not been put forward yet.

Statistics Netherlands is certified according to ISO9001 since 2018. Quality management of methodology and process development for official statistics has been adopted, audited and assessed. This certification confirms that Statistics Netherlands focuses on: o the quality procedures for internal and external reports, recommendations and briefs, o the quality assurance of statistical development projects in which methodologists and business analysts participate, o the quality assurance of methodological courses taught to statisticians, o the internal management of the department.

stat.gov.pl 9 Relevance Added value trough a new data source

• improve (expand) the survey population of tourist accommodation establishments, • provide more complete information about establishments operating in the country. The use of web scraping techniques, and in particular platforms such as Hotels.com and Booking.com, allow to speed up the inventarisation process of tourism accommodations. The chosen platforms provide up-to-date inside information on the population dynamics of hospitality businesses and households.

stat.gov.pl 10 Accuracy and reliability Overall accuracy

NACE Survey data Web scraped data

55.1 yes yes

55.2 yes yes

55.3 yes Not at the moment

stat.gov.pl 11 Accuracy and reliability Non-sampling error

Economic classification used in tourism statistics: • 55.1 (hotels and similar accommodation), • 55.2 (holiday and other short-stay accommodation) • 55.3 (camping grounds, recreational vehicle parks and trailer parks)

stat.gov.pl 12 Accuracy and reliability Coverage error and overcoverage rate

Undercoverage error will occur as only accommodation available on booking portals will be gathered. Overcoverage error may occur if on the same accommodation establishment is included in two or more booking portals. This will be treated by the use of geolocation data. This numbers differ for every country in the project and will need to be calculated by the end of it.

stat.gov.pl 13 Accuracy and reliability Measurement error

• Scraper features: Changes in platforms lead the scraper to fail and stop. • Target population: Platform (hotels.com) filter on proximity provokes that accommodations outside the population target get included in the scraping files.

• Chained-brand Hotels & : Rapid changes in ownership of the Leisure and recreation industry require more flexible adaptive approaches to detect and correct for merging, acquisition and franchise connected to multinationals such as Holiday (UK), or Hilton (US), NH Group (ES), (CN), (SE) and (FR).

stat.gov.pl 14 Accuracy and reliability Nonresponse error

Listings of companies where name and address of accommodations are missing, sometimes even on purpose. Booking.com • Studio on a houseboat, near city center! • Spacious, modern family home on the canal with parking! used as a name of the accommodation Strategy “quarantine” and wait for the new data to be collected, analyzed and assessed.

stat.gov.pl 15 Accuracy and reliability Unit non-response rate

Ideally, following table should be filled in at the end of the project

Survey on TAE Administrative data Big Data Source Hotels.com Booking.com Number % Number % Number % Number % Number of units 1. Response rate 1.1 Used for calculations 1.2 Not used for calculations 1.2.1 Out of scope 1.2.2 Other reasons (too many empty fields, inconsistent data, unable to link) 2. Non-response (not filled-in by respondent or by web scraper)

stat.gov.pl 16 Timeliness and punctuality

Web scrapping allows to add new units on a regular basis. Flash estimates in terms of accommodation establishments and their occupancy will be calculated at t+1 (ideally).

stat.gov.pl 17 Coherence and comparability

Comparability – geographical The completeness and comparability of accommodations close to country borders should be considered apart.

Comparability - over time If no legal problems occur or the web scrapped website disappear there will be no problem with comparability over time.

stat.gov.pl 18 Cost and burden

Direct costs: efficiency and less work power to collect data from accommodations. Indirect costs: update all the tools, editing processes

Short-term: develop web crawlers, Medium-term: testing crawlers, deploying in a special server, training. Long-term: maintenance, update and operation of the servers, crawlers and the trained operators.

stat.gov.pl 19 Statistical processing

Source data Combination of two sources of data: 1. survey on tourism accommodation establishments 2. web scraping of accommodation portals. They will be combined by the geolocation and address data. Frequency of data collection Data on accommodation establishments are web scraped with daily or monthly frequency. Data collection Web scraped data: It is done daily, but monthly (and accumulated) updates should be sufficient, the day of the week seems more important, i.e. Saturday and/or Sundays

stat.gov.pl 20 Thank you for your attention!

For more information Coordinator of the WPJ implementation: [email protected]

stat.gov.pl