Essnet Big Data II
Total Page:16
File Type:pdf, Size:1020Kb
ESSnet Big Data II Grant Agreement Number: 847375 -2018-NL-BIGDATA https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata https://ec.europa.eu/eurostat/cros/content/essnetbigdata_en Work package K Methodology and quality Deliverable K5: First draft of the m ethodological report Final version, 17.06.2020 Prepared by: Piet Daas (CBS, NL) Jacek Maślankowski (GUS, PL) David Salgado (INE, ES) Sónia Quaresma (INE, PT) ESSnetTiziana co-ordinator: Tuoto, Loredana Di Consiglio, Giovanna Brancato, Paolo Righi (ISTAT, IT) Magdalena Six, Alexander Kowarik (STAT, AT) Work package leader: Alexander Kowarik (STAT, AT) [email protected] telephone : +43 1 71128 7513 Contents 0. Prologue/Starting point ...................................................................................................................... 4 1. Overview of Big Data based official statistics ................................................................................... 5 1.1 Using Big Data ............................................................................................................................. 6 1.2 Lessons learned from Google Flu Trends .................................................................................... 7 1.3 Large n, large p and large t .......................................................................................................... 7 1.4 Relation with Quality and Typology reports ............................................................................... 8 Input Phase ........................................................................................................................................... 10 2. Introductory overview of methods for text and image mining ........................................................ 10 2.1 Text mining and NLP................................................................................................................. 10 2.1.1 Text ..................................................................................................................................... 10 2.1.2 Text mining tasks ................................................................................................................ 11 2.1.3 Examples of text mining applications ................................................................................ 13 2.2 Image mining ............................................................................................................................. 14 2.2.1 Bag of visual words ............................................................................................................ 14 2.2.2 Deep Learning ..................................................................................................................... 15 2.2.3 Examples of image mining applications ............................................................................. 16 2.3 Discussion .................................................................................................................................. 16 Throughput Phase 1 ............................................................................................................................. 17 3. Big Data exploration ........................................................................................................................ 17 3.1 Visualization methods for large datasets ................................................................................... 18 3.1.1 Univariate data .................................................................................................................... 18 3.1.2 Bivariate data ...................................................................................................................... 20 2.1.3 Other relevant visualization methods .................................................................................. 21 2.2 Examples of visualization methods used ................................................................................... 23 3.3 Discussion .................................................................................................................................. 24 4. Combining sources........................................................................................................................... 25 4.1 Linking units .............................................................................................................................. 26 4.1.1 Reasons for linking ............................................................................................................. 26 4.1.2 Linking units and linking variables..................................................................................... 26 4.1.3 Linking methods ................................................................................................................. 27 4.1.4 Suggestions for improvements on methods ........................................................................ 27 4.2 Examples of combining sources ................................................................................................ 27 4.3 Discussion .................................................................................................................................. 28 5. Dealing with errors .......................................................................................................................... 29 5.1 Error dimension – Coverage ...................................................................................................... 29 5.2 Error dimension – Comparability over time .............................................................................. 29 5.3 Error dimension - Measurement errors ...................................................................................... 29 5.4 Model errors (raw data to statistical data).................................................................................. 30 5.5 Discussion (Link with Quality deliverable and relation with user needs) ................................. 30 Throughput Phase 2 ............................................................................................................................. 31 6. Inference .......................................................................................................................................... 31 6.1 General considerations ............................................................................................................... 31 6.2 Big Data as the main source of input ......................................................................................... 31 6.3 Conceptual differences ............................................................................................................... 33 6.4 Representativity (selectivity) ..................................................................................................... 33 6.5 Sampling .................................................................................................................................... 37 6.6 Accuracy assesment (Bias and variance) ................................................................................... 38 6.7 Additional considerations .......................................................................................................... 39 6.8 Discussion .................................................................................................................................. 39 7. Conclusions ...................................................................................................................................... 40 References ............................................................................................................................................ 41 Annexes................................................................................................................................................ 46 Annex A: BREAL (Big data Reference Architecture and Layers) .................................................. 46 0. Prologue/Starting point This report provides an overview of the current state of art of Big Data methodology. As we have found that the range of what one considers Big Data methodology differs tremendously on the experience and area of expertise of a particular statistical researcher, we decided to apply a very strict (narrow) focus on what is included in this report. Essential starting point of any method described in this report is that it: i) has been used by researchers involved in the ESSnet Big Data I and II projects, ii) is included in one of the Big Data based official statistics currently being published, iii) is used in one of the Big Data based statistics that is in the process of becoming officially published. Any other method, however interesting it may seem, is not included in this report. For the reader it is important to realize that the methods included in this report are divided in various (sub)sections that relate to the various phases of the statistical process in which they are used. The same phases, input, throughput (1 & 2) and output, were used in the quality guidelines report (WPK Del. K3) published earlier. 1. Overview of Big Data based official statistics At the moment of writing, 13 examples of statistics are known that make use of Big Data for statistics published by National Statistical Institutes (NSI’s). These are listed in table 1. Unfortunately, not all processes have already completely matured and only two are in production. These numbers will undoubtable increase