Should Big Data Change the Modelling Paradigm in Official Statistics?
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Journal of the IAOS 31 (2015) 193–202 193 DOI 10.3233/SJI-150892 IOS Press “Re-make/Re-model”: Should big data change the modelling paradigm in official statistics?1 Barteld Braaksmaa,∗ and Kees Zeelenbergb aInnovation Program, Statistics Netherlands, 2490 HA Den Haag, The Netherlands bMethods and Statistical Policies, Statistics Netherlands, 2490 HA Den Haag, The Netherlands Abstract. Big data offers many opportunities for official statistics: for example increased resolution, better timeliness, and new statistical outputs. But there are also many challenges: uncontrolled changes in sources that threaten continuity, lack of identifiers that impedes linking to population frames, and data that refers only indirectly to phenomena of statistical interest. We discuss two approaches to deal with these challenges and opportunities. First, we may accept big data for what they are: an imperfect, yet timely, indicator of phenomena in society. These data exist and that’s why they are interesting. Secondly, we may extend this approach by explicit modelling. New methods like machine-learning techniques can be considered alongside more traditional methods like Bayesian techniques. National statistical institutes have always been reluctant to use models, apart from specific cases like small-area estimates. Based on the experience at Statistics Netherlands we argue that NSIs should not be afraid to use models, provided that their use is documented and made transparent to users. Moreover, the primary purpose of an NSI is to describe society; we should refrain from making forecasts. The models used should therefore rely on actually observed data and they should be validated extensively. Keywords: Big data, model-based statistics 1. Introduction There are various challenges with the use of big data in official statistics, such as legal, technological, fi- nancial, methodological, and privacy-related ones; see Big data come in high volume, high velocity and e.g. [19,21,22]. This paper focuses on methodological high variety; examples are web scraping, twitter mes- sages, mobile phone call detail records, traffic-loop challenges, in particular on the question how official data, and banking transactions. This leads to opportu- statistics may be made from big data, and not on the nities for new statistics or redesign of existing statis- other challenges. tics. Their high volume may lead to better accuracy and At the same time, big data may be highly volatile more details, their high velocity may lead to more fre- and selective: the coverage of the population to which quent and timelier statistical estimates, and their high they refer, may change from day to day, leading to in- variety may give rise to statistics in new areas. explicable jumps in time-series. And very often, the in- dividual observations in big-data sets lack linking vari- ables and so cannot be linked to other datasets or pop- 1Re-Make/Re-Model is a song written by Bryan Ferry and per- ulation frames. This severely limits the possibilities for formed by Roxy Music in 1972 (Wikipedia, 2015). correction of selectivity and volatility. ∗ Corresponding author: Barteld Braaksma, Manager Innovation The use of big data in official statistics therefore Program, Statistics Netherlands, PO Box 24500, 2490 HA Den Haag, Netherlands. Tel.: +31 70 3374430; E-mail: b.braaksma@ requires other approaches. We discuss two such ap- cbs.nl. proaches. 1874-7655/15/$35.00 c 2015 – IOS Press and the authors. All rights reserved This article is published online with Open Access and distributed under the terms of the Creative Commons Attribution Non-Commercial License. 194 B. Braaksma and K. Zeelenberg / Should big data change the modelling paradigm in official statistics? In the first place, we may accept big data just for than conducting a survey as they are already present. what they are: an imperfect, yet very timely, indicator In some countries, the access and use of secondary of developments in society. In a sense, this is what na- sources is regulated by law. tional statistical institutes (NSIs) often do: we collect Big Data sources offer even less control. They typi- data that have been assembled by the respondents and cally consist of ‘organic’ data [10] collected by others, the reason why, and even just the fact that, they have who have a non-statistical purpose for their data. For been assembled is very much the same reason why they example, a statistical organization might want to use are interesting for society and thus for an NSI to col- retail transaction data to provide prices for their Con- lect. For short, we might argue: these data exist and sumer Price Index statistics, while the data generator that’s why they are interesting. sees it as a way to track inventories and sales. Secondly, we may extend this approach in a more In this section we will look at some examples formal way by modelling these data explicitly. In re- from the research and innovation program at Statis- cent years, many new methods for dealing with big tics Netherlands: social media messages, traffic-loop data have been developed by mathematical and applied data, and mobile phone data; the text of these subsec- statisticians. tions is based on papers [4,5,13,17] from colleagues In Section 2 we briefly describe big data and the pos- at Statistics Netherlands. These examples fall in the sible uses as well as some actual examples. In Section 3 categories Social Networks and Internet of Things, we look at the first manner in which they may be used: as distinguished by the UN/ECE Task Team on Big as they are collected or assembled, i.e. as statistics in Data [21]; we omit examples from their third category, their own right. In Section 4 we discuss how models Traditional Business systems (process-mediated data), may be useful for creating information from big-data since on the one hand some of the methodological and sources, and under what conditions NSIs may be using statistical problems for this category resemble those models for creating official statistics. for administrative data and on the other hand there is not yet much experience with the more complicated types of data in this category, such as banking transac- 2. Big data tions. In particular, we discuss actual or possible uses in official statistics and some issues that arise when analysing these data sources from an official statistics 2.1. Source data for official statistics perspective. Other examples, which we will not dis- cuss here, include web scraping, scanner data, satellite Official statistics must be based on observations: of- images and banking transactions. ten raw data that needs further processing and is honed to produce accurate, reliable, robust and timely infor- 2.2. Traffic-loop data [5,17] mation. For many years, producers of official statistics have In the Netherlands, approximately 100 million traf- relied on their own data collection, using paper ques- fic detection loop records are generated a day. More tionnaires, face-to-face and telephone interviews, or specifically, for more than 12 thousand detection loops (somewhat less traditional) web surveys. This classical on Dutch roads, the number of passing cars is available approach originates from the era of data scarcity, when on a minute-by-minute basis. The data are collected official statistics institutes were among the few organi- and stored by the National Data Warehouse for Traffic sations that could gather data and disseminate informa- Information (NDW) (http://www.ndw.nu/en/), a gov- tion. A main advantage of the survey-based approach is ernment body which provides the data free of charge to that it gives full control over questions asked and pop- Statistics Netherlands. A considerable part of the loops ulations studied. A big disadvantage is that it is rather discern length classes, enabling the differentiation be- costly and burdensome, for the surveying organisation tween, e.g., cars and trucks. Their profiles clearly re- and the respondents, respectively. veal differences in driving behaviour. More recently, statistical institutes have started to Harvesting the vast amount of data is a major chal- use administrative (mostly government) registers as lenge for statistics; but it could result in speedier and additional sources. Using such secondary sources re- more robust traffic statistics, including more detailed duces control over the available data, and the adminis- information on regional levels and increased resolution trative population often does not exactly match the sta- in temporal patterns. This is also likely indicative of tistical one. However, these data are cheaper to obtain changes in economic activity in a broader sense. B. Braaksma and K. Zeelenberg / Should big data change the modelling paradigm in official statistics? 195 (a) (b) Fig. 1. Traffic distribution pattern for a single day (Thursday, 1 December 2011), aggregated over all traffic loops in five-minute blocks. Figure 1a presents raw data as recorded; Fig. 1b presents data after imputation for missing observations [5]. An issue is that this source suffers from under- company Coosto, which routinely collects all Dutch coverage and selectivity. The number of vehicles de- social media messages. In addition, they provide some tected is not available for every minute due to system extra information, like assigning a sentiment score to failures and not all (important) Dutch roads have de- individual messages or adding information about the tection loops. Fortunately, the first can be corrected by place of origin of a message. imputing the absent data with data that is reported by To find out whether social media is an interesting the same loop during a 5-minute interval before or after data source for statistics, Dutch social media messages that minute (see Fig. 1). Coverage is improving over were studied from two perspectives: content and senti- time. Gradually more and more roads have detection ment.