Statistical Data Warehouse Design Manual
Total Page:16
File Type:pdf, Size:1020Kb
in partnership with Statistical Data Warehouse Design Manual Authors: CBS - Harold Kroeze ISTAT - Antonio Laureti Palma SF - Antti Santaharju INE - Sónia Quaresma ONS - Gary Brown LS - Tauno Tamm ES - Valerij Zavoronok th 24 February 2017 Excellence Center: ON MICRO DATA LINKING AND DATA WAREHOUSING Excellence Center: ON MICRO DATA LINKING AND DATA WAREHOUSING i-General Introduction Author: Antonio Laureti Palma 1-Implementation 1.1 Current state and pre-conditions Author: Antti Santaharju 1.2 Design Phase roadmap Authors: Antonio Laureti Palma, Antti Santaharju 1.3 Building blocks – The input datasets Author: Antti Santaharju 1.4 Business processes of the layered S-DWH Authors: Antonio Laureti Palma, Antti Santaharju, Sónia Quaresma 2-Governance 2.1 Governance of the metadata Authors: Harold Kroeze, Sónia Quaresma 2.2 Management processes Author: Antonio Laureti Palma, 2.3 Type of analysts Author: Sónia Quaresma 3-Architecture 3.1 Business architecture Authors: Antonio Laureti Palma, Sónia Quaresma 3.2 Information systems architecture Authors: Antonio Laureti Palma, Sónia Quaresma 3.3 Technology Architecture (docs in the Annex) 3.4 Data centric workflow Author: Antonio Laureti Palma 3.5 Focus on sdmx in statistical data warehouse Authors: Antonio Laureti Palma, Sónia Quaresma Excellence Center: ON MICRO DATA LINKING AND DATA WAREHOUSING Excellence Center: ON MICRO DATA LINKING AND DATA WAREHOUSING 4-Methodology 4.1 Data cleaning Author: Gary Brown 4.2 Data linkage Author: Gary Brown 4.3 Estimation Author: Gary Brown 4.4 Revisions Author: Gary Brown 4.5 Disclosure control Author: Gary Brown 5-Metadata 5.1 Fundamental principles Author: Tauno Tamm 5.2 Business Architecture: metadata Author: Sónia Quaresma 5.3 Metadata System Author: Tauno.Tamm 5.4 Metadata and SDMX Author: Tauno.Tamm A1-Annex: Technology Architecture I.1 Technology Architecture Author: Sónia Quaresma I.2 Classification of SDMX Tools Authors: : Valerij Zavoronok, Sónia Quaresma Excellence Center: ON MICRO DATA LINKING AND DATA WAREHOUSING Prefece Author: Harold Kroeze In order to modernise statistical production, ESS Member States are searching for ways to make optimal use of all available data sources, existing and new. This modernisation implies not only an important organisational impact but also higher and stricter demands for the data and metadata management. Both activities are often decentralised and implemented in various ways, depending on the needs of specific statistical systems (stove-pipes), whereas realising maximum re-use of available statistical data just demands the opposite: a centralised and standardised set of (generic) systems with a flexible and transparent metadata catalogue that gives insight in and easy access to all available statistical data. To reach these goals, building a Statistical Data Warehouse (S-DWH) is considered to be a crucial instrument. The S-DWH approach enables NSIs to identify the particular phases and data elements in the various statistical production processes that need to be common and reusable. The CoE on DWH provides a document that help and guide in the process of designing and developing a S-DWH: The S-DWH Design Manual This document answers the following questions: What is a Statistical Data Warehouse (S-DWH) ? How does a S-DWH differ from a traditional = 'commercial' DWH ? Why should we build a S-DWH ? Who are the envisaged users of a S-DWH ? Give a road map for designing, building and implementing the S-DWH: What are the prerequisites for implementing a S-DWH ? What are the phases/steps to take ? How to prepare for an implementation ? Acknowledgements This work is based on reflections within the team of the Centre of Excellence on Datawarehousing as well as on discussions with a broader group of experts during the CoE's workshops. The CoE would like to thank all workshop attendees for their participation. Special thanks to Gertie van Doren-Beckers for administrative support. Excellence Center: ON MICRO DATA LINKING AND DATA WAREHOUSING i - General Introduction Authors: Antonio Laureti Palma Excellence Center: ON MICRO DATA LINKING AND DATA WAREHOUSING i Introduction The statistical production system of a NSI concerns a cycle of organizational activity: the acquisition of data, the elaboration of information, the custodianship and the distribution of that information. This cycle of organizational involvement with information involves a variety of stakeholders: for example those who are responsible for assuring the quality, accessibility and program of acquired information, those who are responsible for its safe storage and disposal. The information management embraces all the generic concepts of management, including: planning, organizing, structuring, processing, controlling, evaluation and reporting of information activities and is closely related to, and overlaps, the management of data, systems, technology, and statistical methodologies. Due to the great evolution in the world of information, user’s expectations and need of official statistics has increased in the recent years. They require wider, deeper, quicker and less burdensome statistics. This has led NSIs to explore new opportunities for improving statistical productions using several different sources of data and in which is possible an integrated approach, both in term of data and processes. Some practical examples are: In the last European census, administrative data was used by almost all the countries. Each country used either a full register-based census or register combined with direct surveys. The census processes were quicker than in the past and generally with better results. In some cases, as in the 2011 German census, the first census not register-based taken in that country since 1983, provides a useful reminder of the danger in using only a register-based approach. In fact, the census results indicated that the administrative records on which Germany based official population statistics for a period of several decades, overestimates the population because of failing to adequately record foreign-born emigrants. This suggests that the mixed data source approach, which combines direct-survey data with administrative data, is the best method to obtain accurate results (Citro 2014) even if it is much more complex to organize in terms of methodologies and infrastructure. At a European level, a few years ago, the SIMSTAT project, an important operational collaboration between all member states, started. This is an innovative approach for simplifying Intrastat, the European Union (EU) data collection system on intra-EU trade in goods. It aims to reduce the administrative burden while maintaining data quality by exchanging microdata on intra-EU trade between Member States and re-using them, including both technical and statistical aspects. In this context directed survey or admin data are shared between member states through a central data hub. However, in SIMSTAT there is an increase in complexity due to the need for a single coherent distributed environment where the 28 countries can work together. Also in the context of Big Data, there are several statistical initiatives at the European level, for example “use of scanner data for consumer price index” (ISTAT) or “aggregated mobile phone data to identify commuting patterns” (ONS), which both require an adjustment of production infrastructure in order to manage these big data sets efficiently. In this case the main difficulty is to find a data model able to merge big data and direct surveys efficiently. Recently, also in the context of regular structural or short term statistics, NSIs have expressed the need for a more intensive use of administrative data in order to increase the quality of statistics and 3 reduce the statistical burden. In fact, one or more administrative data sources could be used for supporting one or more surveys of different topics (for example the Italian Frame-SBS). Such a production approach creates more difficulties due to an increase in dependency between the production processes. Different surveys must be managed in a common coherent environment. This difficult aspect has led NSIs to assess the adequacy of their operational production systems and one of the main drawbacks that has emerged is that many NSIs are organized in single operational life cycles for managing information, or the “stove-pipe” model. This model is based on independent procedures, organizations, capabilities and standards that deal with statistical products as individual services. If an NSI with a production system mostly based the stove-pipe model wants to use administrative data efficiently, it has to change to a more integrated production system. All the above cases indicate the need of a complex infrastructure where the use of integrated data and procedures is maximized. Therefore, this infrastructure would have two basic requirements: - ability to management of large amounts of data, - a common statistical frame in terms of IT infrastructure, methodologies, standards and organization to reduce the risk of losing coherence or quality. An such complex infrastructure that can meet these requirements is a corporate Statistical Data Warehouse (S-DWH), possibly metadata-driven, in which statisticians can manage micro and macro data in the different production phases. A metadata-driven is a system where metadata create a logical self-describing framework to allow the data to drive functionality. The S-DWH approach would then support a high level of modularity and standards that