Enhanced Step-By-Step Approach for the Use of Big Data in Modelling for Official Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Enhanced step-by-step approach for the use of big data in modelling for official statistics Dario Buono1, George Kapetanios2, Massimiliano Marcellino3, Gian Luigi Mazzi4, Fotis Papailias5 Paper prepared for the 16th Conference of IAOS OECD Headquarters, Paris, France, 19-21 September 2018 Session 1.A., Day 1, 19/09, 11:00: Use of Big Data for compiling statistics 1 Eurostat, European Commission. Email: [email protected] 2 King’s College London. Email: [email protected] 3 Bocconi University. Email: [email protected] 4 GOPA. Email: [email protected] 5 King’s College London. Email: [email protected] Dario Buono [email protected], European Commission George Kapetanios [email protected] King’s College London Massimiliano Marcellino [email protected] Bocconi University Gian Luigi Mazzi [email protected] GOPA Fotis Papailias [email protected] King’s College London Enhanced step-by-step approach for the use of big data in modelling for official statistics DRAFT VERSION DD/MM/2018 PLEASE DO NOT CITE (optional) Prepared for the 16th Conference of the International Association of Official Statisticians (IAOS) OECD Headquarters, Paris, France, 19-21 September 2018 Note (optional): This Working Paper should not be reported as representing the views of the Author organisation/s. The views expressed are those of the author(s). ABSTRACT In this paper we describe a step-by-step approach for the use of big data in nowcasting exercises and, possibly, in the construction of high frequency and new indicators. We also provide a set of recommendations associated to each step which are based on the outcome of the other tasks of the project as well as on the outcome of the previous project on big data and macroeconomic nowcasting. Specifically, we consider issues related to data sources, access, preparation, cleaning/filtering, modelling, and evaluation. Keywords: Big Data, Macroeconomic Nowcasting, Econometric Models, construction of new indicators; Outliers; Seasonality; Unstructured data. 3 1 INTRODUCTION Recent advances in IT and the social use of Internet related applications make nowadays a vast amount of information available in real time. The associated Big Data are potentially very useful for a variety of applications, ranging from marketing to tackling tax evasion. From the point of view of statistical institutes, the main question is whether and to what extent Big Data are valuable to expand, check and improve the data production process. In this paper we focus on three particular cases of using Big Data in official statistics. The first one is macroeconomic nowcasting, thus potentially enhancing the timely availability and precision of early estimates of key macroeconomic variables, and possibly providing new Big Data based coincident and leading indicators of economic activity. The second is represented by the possibility of using big data in the construction of new high frequency versions of already existing macroeconomic indicators. This could, for example, concern the estimation of daily or weekly proxies of GDP or of inflation. An example can be found in Aprigliano et al. (2016). The last one concerns the possibility of using big data to construct new qualitative or quantitative indicators measuring phenomena for which official statistics had problems to find appropriate information to satisfactorily measuring or describing them. This could be the case of pressure indicators (like for prices or for consumers) and of wellbeing ones. While clearly potentially of important value in the contexts mentioned above, Big Data still raises a number of old and new issues, related to data availability and accessibility, data preparation, cleaning and filtering, data modelling, validation and evaluation. All these aspects have a great relevance in the context of official statistics. Furthermore, the use of Big Data also requires proper statistical and econometric modelling as well as evaluation techniques. A wide range of methods are available, but they should be used in a careful way as they have been developed either (in the econometric literature) to handle large but not really big datasets of time series, or (in the statistical and machine learning literatures) to model really big datasets of iid observations, while we would like to work with big sets of correlated time series. Broadly speaking, the available methods are based either on summarizing the many variables, or on selecting them, or on combining many small models, after a proper data pre-treatment. Obviously the above considerations apply to the case of numerical big data previously converted into a set of time-series. This approach is especially relevant for nowcasting exercises or even to construct high frequency indicators. Nevertheless when constructing new indicators, especially qualitative ones, it is possible that non-numerical big data (such as textual or image ones) have to be considered. In such situation, since the conversion into time series or even panel data is not feasible, we are obliged to work in unstructured data and the problems are even more difficult to solve. In this case, we need to base our process of construction on typically multivariate tools of data exploration such as data mining or data analytics (or their extensions such as text mining). The characteristics of these methods call for an even higher carefulness in considering the outcome of such investigations. Given this complex background, in this document we develop a step-by-step approach, ac- companied by specific recommendations for the use of big data for macroeconomic 4 nowcasting, construction of high frequency indicators and of new indicators. The approach proposed here in- tends to generalise and extend the one already proposed by Mazzi (2016) and Kapetanios et al. (2017). Besides the generalisation to different kind of usages of big data, the extensions proposed here mainly concern: • the analysis of the usefulness of various types of big data when constructing different indicators, • the methods for converting unstructured numerical big data into time series structured ones, • the seasonal and calendar adjustment, • the identification of advantages and drawbacks of statistical methods to be used when dealing with non-numerical big data. Keeping unchanged the philosophy used in the previous papers, the step-by-step approach pro- posed below is intended to guide researchers and statisticians from the identification and the choice of Big Data to its pre-treatment and econometric/statistical modelling until the construction of indicators and the comparative evaluation of results. With this step-by-step approach we aim to provide researchers and statisticians with a very useful, flexible and complete tool for decision making about: if, when, how and for what using big data in official statistics. More specifically: • Step 1 considers the clear specification of the statistical problem (i.e. nowcasting versus high frequency indicator versus new indicator); • Step 2 refers to the definition of an appropriate IT environment for collecting, storing, treating and analyzing big data; • Step 3 considers Big Data potential usefulness within the statistical exercises as defined in step 1; • Step 4 discusses issues related to the design of the exercise: possible big data sources, definition of the type of variables/information to be extracted from big-data, etc.; • Step 5 assesses big-data accessibility and the quality of envisaged big-data sources; • Step 6 focuses on big data specific features and on possible ways to move from unstructured to structured data; • Step 7 deals with the approaches to replace missing observations as well as to remove and correct outliers; • Step 8 deals with the ways for filtering out seasonal and other very short-term components from big data; • Step 9 concerns the presence of bias in big data and the ways to correct it; 5 • Step 10 analyzes the design of the econometric modelling strategy in case of the availability of numerical big data structured in time series form; • Step 11 considers the statistical exploratory techniques to be used in case of non- numerical or numerical but not structured big data; • Step 12 deals with the evaluation of the results mainly obtained in steps 10 and 11 (choice of benchmark model or nowcasts, evaluation metrics, criteria for assessing the significance of the accuracy and/or timeliness gains, the statistical and economic validity of the results, etc.); • Step 13 considers the overall implementation of big data based results within the regular production and dissemination process; All these steps are discussed in the various subsections of Section 2. Each subsection is structured into two parts: the first one providing a description of the problem and related issues and the second one providing specific recommendations. Section 3 proposes a more compact presentation of the step-by-step approach and briefly concludes. 2 A STEP-BY-STEP APPROACH In this section the 13 steps listed above are presented in details, focusing on the main theoretical and empirical aspects characterizing each step and deriving recommendations reflecting the experience of the authors in dealing with big data as well as some relevant literature contributions in this field. 2.1 Step 1: Identification of the Statistical Problem 2.1.1 Description Before starting the search phase for identifying possibly useful big data, it is important to clarify in the most accurate way the problem we want to address. In particular, in case of a nowcasting exercise, it is important to specify the expected gain in terms of timeliness and the frequency for nowcasting updating (especially if the estimates already start during the reference period). In case of the estimation of high frequency indicators it is important to define the frequency at which the indicator is expected to be estimated as well as some feasible targets in terms of timeliness. For new indicators it is essential to clearly define what they are supposed to measure, if they should be qualitative or quantitative as well as the expected frequency of computation and, if relevant, the timelines of production.