Wp7 Use Case of Big Data Sources

Wp7 Use Case of Big Data Sources

WPJ - Innovative Tourism Statistics Quality Katarzyna Kapica on behalf of the WPJ coordinator Marek Cierpiał-Wolan, PhD Statistical Office in Rzeszów, Poland stat.gov.pl 1 Statistical representation Data description, Concepts and Definitions new tourism accommodation establishments geolocation of all establishments types of accommodation establishments and their number of bed-places stat.gov.pl 2 Statistical representation Statistical unit and population Statistical unit: domestic accommodation establishment Statistical population: are domestic accommodation establishments with complementary information on their types, regions and other information allowing for their identification. stat.gov.pl 3 Statistical representation Time coverage and reference area Time Coverage • Data from Hotels.com have been collected from (February) July 2019 onwards. • Data from Booking.com have been collected in the period of July and August 2019 Reference Area • The web scraper of Hotels.com has proxy “destination” set differently in each country. Some use just country name, some provinces and cities, and some follow NUTS classification. stat.gov.pl 4 Unit of measure Day, Month and Year in which an accommodation is detected Date configuration used to gather accommodation prices Geographical area connected chosen by a visitor using the search engine of the platform Platform id assigned to an specific visitor’s search of accommodation (Demand-side) Platform_id (client number) assigned to an specific accommodation (Supply-side) Name of the accommodation in the platform database Sort of accommodation registered by the accommodation responsible to classify its accommodation business in the platform city where an accommodation is located or established Combination of numbers and letters in addresses that denote the neighborhood and street to locate dwellings Cumulative number of the number of quest reviews stat.gov.pl 5 Institutional mandate Legal acts and other agreements There is no legal mandate regulating web scraping of websites across all countries of the European Union WPJ tool (crawler) • not invasive, • doesn’t increase load on servers, • not affect the day-to-day operation and functioning of the portals, • performed during off peak hours, • only data that is needed for statistical purposes. stat.gov.pl 6 Institutional mandate Data access and transmission Access to raw data is based on its public online availability. stat.gov.pl 7 Confidentiality Data treatment The data gathered by web scraping methods are generally not a subject to confidentiality as they are publicly available and scraped form worldwide available websites. Exception: highly privacy-sensitive data of very small accommodations to natural persons (not enterprises) providing personal details. It is recommended for each country to implement their own confidentiality rules. stat.gov.pl 8 Quality management Quality assurance A specific protocol of quality assurance for web scraping has not been put forward yet. Statistics Netherlands is certified according to ISO9001 since 2018. Quality management of methodology and process development for official statistics has been adopted, audited and assessed. This certification confirms that Statistics Netherlands focuses on: o the quality procedures for internal and external reports, recommendations and briefs, o the quality assurance of statistical development projects in which methodologists and business analysts participate, o the quality assurance of methodological courses taught to statisticians, o the internal management of the department. stat.gov.pl 9 Relevance Added value trough a new data source • improve (expand) the survey population of tourist accommodation establishments, • provide more complete information about establishments operating in the country. The use of web scraping techniques, and in particular platforms such as Hotels.com and Booking.com, allow to speed up the inventarisation process of tourism accommodations. The chosen platforms provide up-to-date inside information on the population dynamics of hospitality businesses and households. stat.gov.pl 10 Accuracy and reliability Overall accuracy NACE Survey data Web scraped data 55.1 yes yes 55.2 yes yes 55.3 yes Not at the moment stat.gov.pl 11 Accuracy and reliability Non-sampling error Economic classification used in tourism statistics: • 55.1 (hotels and similar accommodation), • 55.2 (holiday and other short-stay accommodation) • 55.3 (camping grounds, recreational vehicle parks and trailer parks) stat.gov.pl 12 Accuracy and reliability Coverage error and overcoverage rate Undercoverage error will occur as only accommodation available on booking portals will be gathered. Overcoverage error may occur if on the same accommodation establishment is included in two or more booking portals. This will be treated by the use of geolocation data. This numbers differ for every country in the project and will need to be calculated by the end of it. stat.gov.pl 13 Accuracy and reliability Measurement error • Scraper features: Changes in platforms lead the scraper to fail and stop. • Target population: Platform (hotels.com) filter on proximity provokes that accommodations outside the population target get included in the scraping files. • Chained-brand Hotels & Resorts: Rapid changes in ownership of the Leisure and recreation industry require more flexible adaptive approaches to detect and correct for merging, acquisition and franchise connected to multinationals such as Holiday Inn (UK), Best Western or Hilton (US), NH Hotel Group (ES), Jinjiang International (CN), Scandic Hotels(SE) and Accor(FR). stat.gov.pl 14 Accuracy and reliability Nonresponse error Listings of companies where name and address of accommodations are missing, sometimes even on purpose. Booking.com • Studio on a houseboat, near city center! • Spacious, modern family home on the canal with parking! used as a name of the accommodation Strategy “quarantine” and wait for the new data to be collected, analyzed and assessed. stat.gov.pl 15 Accuracy and reliability Unit non-response rate Ideally, following table should be filled in at the end of the project Survey on TAE Administrative data Big Data Source Hotels.com Booking.com Number % Number % Number % Number % Number of units 1. Response rate 1.1 Used for calculations 1.2 Not used for calculations 1.2.1 Out of scope 1.2.2 Other reasons (too many empty fields, inconsistent data, unable to link) 2. Non-response (not filled-in by respondent or by web scraper) stat.gov.pl 16 Timeliness and punctuality Web scrapping allows to add new units on a regular basis. Flash estimates in terms of accommodation establishments and their occupancy will be calculated at t+1 (ideally). stat.gov.pl 17 Coherence and comparability Comparability – geographical The completeness and comparability of accommodations close to country borders should be considered apart. Comparability - over time If no legal problems occur or the web scrapped website disappear there will be no problem with comparability over time. stat.gov.pl 18 Cost and burden Direct costs: efficiency and less work power to collect data from accommodations. Indirect costs: update all the tools, editing processes Short-term: develop web crawlers, Medium-term: testing crawlers, deploying in a special server, training. Long-term: maintenance, update and operation of the servers, crawlers and the trained operators. stat.gov.pl 19 Statistical processing Source data Combination of two sources of data: 1. survey on tourism accommodation establishments 2. web scraping of accommodation portals. They will be combined by the geolocation and address data. Frequency of data collection Data on accommodation establishments are web scraped with daily or monthly frequency. Data collection Web scraped data: It is done daily, but monthly (and accumulated) updates should be sufficient, the day of the week seems more important, i.e. Saturday and/or Sundays stat.gov.pl 20 Thank you for your attention! For more information Coordinator of the WPJ implementation: [email protected] stat.gov.pl.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    21 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us