Essnet Big Data II

ESSnet Big Data II Grant Agreement Number: 847375 -2018-NL-BIGDATA https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata https://ec.europa.eu/eurostat/cros/content/essnetbigdata_en Work package K Methodology and quality Deliverable K5: First draft of the m ethodological report Final version, 17.06.2020 Prepared by: Piet Daas (CBS, NL) Jacek Maślankowski (GUS, PL) David Salgado (INE, ES) Sónia Quaresma (INE, PT) ESSnetTiziana co-ordinator: Tuoto, Loredana Di Consiglio, Giovanna Brancato, Paolo Righi (ISTAT, IT) Magdalena Six, Alexander Kowarik (STAT, AT) Work package leader: Alexander Kowarik (STAT, AT) [email protected] telephone : +43 1 71128 7513 Contents 0. Prologue/Starting point ...................................................................................................................... 4 1. Overview of Big Data based official statistics ................................................................................... 5 1.1 Using Big Data ............................................................................................................................. 6 1.2 Lessons learned from Google Flu Trends .................................................................................... 7 1.3 Large n, large p and large t .......................................................................................................... 7 1.4 Relation with Quality and Typology reports ............................................................................... 8 Input Phase ........................................................................................................................................... 10 2. Introductory overview of methods for text and image mining ........................................................ 10 2.1 Text mining and NLP................................................................................................................. 10 2.1.1 Text ..................................................................................................................................... 10 2.1.2 Text mining tasks ................................................................................................................ 11 2.1.3 Examples of text mining applications ................................................................................ 13 2.2 Image mining ............................................................................................................................. 14 2.2.1 Bag of visual words ............................................................................................................ 14 2.2.2 Deep Learning ..................................................................................................................... 15 2.2.3 Examples of image mining applications ............................................................................. 16 2.3 Discussion .................................................................................................................................. 16 Throughput Phase 1 ............................................................................................................................. 17 3. Big Data exploration ........................................................................................................................ 17 3.1 Visualization methods for large datasets ................................................................................... 18 3.1.1 Univariate data .................................................................................................................... 18 3.1.2 Bivariate data ...................................................................................................................... 20 2.1.3 Other relevant visualization methods .................................................................................. 21 2.2 Examples of visualization methods used ................................................................................... 23 3.3 Discussion .................................................................................................................................. 24 4. Combining sources........................................................................................................................... 25 4.1 Linking units .............................................................................................................................. 26 4.1.1 Reasons for linking ............................................................................................................. 26 4.1.2 Linking units and linking variables..................................................................................... 26 4.1.3 Linking methods ................................................................................................................. 27 4.1.4 Suggestions for improvements on methods ........................................................................ 27 4.2 Examples of combining sources ................................................................................................ 27 4.3 Discussion .................................................................................................................................. 28 5. Dealing with errors .......................................................................................................................... 29 5.1 Error dimension – Coverage ...................................................................................................... 29 5.2 Error dimension – Comparability over time .............................................................................. 29 5.3 Error dimension - Measurement errors ...................................................................................... 29 5.4 Model errors (raw data to statistical data).................................................................................. 30 5.5 Discussion (Link with Quality deliverable and relation with user needs) ................................. 30 Throughput Phase 2 ............................................................................................................................. 31 6. Inference .......................................................................................................................................... 31 6.1 General considerations ............................................................................................................... 31 6.2 Big Data as the main source of input ......................................................................................... 31 6.3 Conceptual differences ............................................................................................................... 33 6.4 Representativity (selectivity) ..................................................................................................... 33 6.5 Sampling .................................................................................................................................... 37 6.6 Accuracy assesment (Bias and variance) ................................................................................... 38 6.7 Additional considerations .......................................................................................................... 39 6.8 Discussion .................................................................................................................................. 39 7. Conclusions ...................................................................................................................................... 40 References ............................................................................................................................................ 41 Annexes................................................................................................................................................ 46 Annex A: BREAL (Big data Reference Architecture and Layers) .................................................. 46 0. Prologue/Starting point This report provides an overview of the current state of art of Big Data methodology. As we have found that the range of what one considers Big Data methodology differs tremendously on the experience and area of expertise of a particular statistical researcher, we decided to apply a very strict (narrow) focus on what is included in this report. Essential starting point of any method described in this report is that it: i) has been used by researchers involved in the ESSnet Big Data I and II projects, ii) is included in one of the Big Data based official statistics currently being published, iii) is used in one of the Big Data based statistics that is in the process of becoming officially published. Any other method, however interesting it may seem, is not included in this report. For the reader it is important to realize that the methods included in this report are divided in various (sub)sections that relate to the various phases of the statistical process in which they are used. The same phases, input, throughput (1 & 2) and output, were used in the quality guidelines report (WPK Del. K3) published earlier. 1. Overview of Big Data based official statistics At the moment of writing, 13 examples of statistics are known that make use of Big Data for statistics published by National Statistical Institutes (NSI’s). These are listed in table 1. Unfortunately, not all processes have already completely matured and only two are in production. These numbers will undoubtable increase

Essnet Big Data II

1988: Coverage Error in Establishment Surveys

American Community Survey Design and Methodology (January 2014) Chapter 15: Improving Data Quality by Reducing Non-Sampling Erro

R(Y NONRESPONSE in SURVEY RESEARCH Proceedings of the Eighth International Workshop on Household Survey Nonresponse 24-26 September 1997

Coverage Bias in European Telephone Surveys: Developments of Landline and Mobile Phone Coverage Across Countries and Over Time

A Total Survey Error Perspective on Comparative Surveys

Ethical Considerations in the Total Survey Error Context

Standards and Guidelines for Reporting on Coverage Martin Provost Statistics Canada R

Survey Coverage

STANDARDS and GUIDELINES for STATISTICAL SURVEYS September 2006

Chapter VIII Non-Observation Error in Household Surveys in Developing Countries

Methodological Background 1

A Recommendation for Net Undercount Estimation in Iran Population and Dwelling Censuses