3Rd Meeting of the Microdata Access Network Group (MANG)
Total Page:16
File Type:pdf, Size:1020Kb
EUROPEAN COMMISSION EUROSTAT Directorate B: Methodology; Dissemination; Cooperation in the European Statistical System Unit B-1: Methodology: Innovation in Official Statistics ESTAT/B1/MANG(19)3 Available in EN only 3rd meeting of the Microdata Access Network Group (MANG) Luxembourg, 13 June 2019 Venue: Luxembourg Foyer Europeen 10, rue Heinrich Heine L-1720 Luxembourg – Gare 9:30-16:00 Item 3 Metadata for microdata 1 Metadata standard for microdata 1. INTRODUCTION Eurostat and the European Statistical System (ESS) have a long tradition in providing metadata to the user. Click on the link below to access the Eurostat webpage on metadata: https://ec.europa.eu/eurostat/data/metadata To improve services for users of microdata, Eurostat is considering the metadata standard for its microdata, both for research purposes and to accompany its public use files. To inform the discussion on a metadata standard for microdata, Eurostat has asked some experts in national statistical institutes to describe what metadata they provide with the microdata. The results of this are summarised in paragraph 3. In the discussion on these outcomes it was stressed that user requirements should be considered. In this agenda item the Members of the MANG are invited to reflect on the user requirements on metadata. Paragraph 2 offers a brief introduction from Eurostat perspective and the information in paragraph 3 summarises the situation in some national statistical institutes. This may be used as a trigger for formulating requirements, highlighting both good and bad practices. 2. METADATA FOR EUROPEAN MICRODATA The value added of European microdata is in the standardisation over Member States, thus allowing research across several countries in the EU. In the ideal situation the metadata should allow several views: Across countries to assess the comparability over countries; Across time to assess the comparability over time; Over different versions of the same data set: the full data set as used in dissemination of official statistics, the partially anonymised scientific use files and the public use files; along this line you could follow the protection process. There is some metadata at the level of the survey as a whole (per country and per year), for instance sample size, sample design, response rate, confidentiality treatment. Other metadata are at the level of variables: definition, relation with other variables, format. All this also requires a kind of demographic description of the variables. Which variables are completely new, which are continuations of previous variables, which variables appear with a certain pattern (special topics/modules) etc. All this information is available in principle in the national statistical institutes and in Eurostat, but scattered over separate documents and usually stored along with the data according to the annual production rhythm. Another challenge is the long list of exceptions. Countries that have implemented the new version of the classification before it was required, countries that have implemented after it was required, countries that have done important changes to the data collection or processing methods, countries that have requested additional protection of the microdata because of the size of the country etc. This would be a considerable task. The MANG is invited to reflect on user requirements and on priorities. 2 3. METADATA IN SOME NATIONAL STATISTICAL INSTITUTES Eurostat has asked some experts in national statistical institutes to describe what metadata they provide with the microdata. The results can be summarised as follows: Q1. What types of metadata should be made available to researchers? For example: variable definitions, survey design, estimation method, protection approach. The generally held view is that as much metadata as possible should be provided to enable the user to correctly analyse the data and draw accurate and reliable conclusions. This should include general methodological information about the data source (be it administrative or survey data), processing and its potential uses. Wherever possible and applicable, the following points should be captured: How and why the data has been collected, its scope and the timescale covered; Details about the production / extraction process, including any imputation procedures. In the case of survey data, this should include survey design, target population, sample size and response rate. State if it is a cross-sectional or longitudinal survey, and detail any stratification or weighting criteria applied; Time-series information if changes have been made to the data collection practices over time; A list of variable codes complete with their classification, nomenclature, description / definition, type, format and length; A list of potential values for each variable with a translation of what each code means, including how missing data and “not applicable” cells are processed and annotated; Details of any protection methods used; Reliability thresholds, the quality of the variables and their potential uses or caveats. Advice on how variables could be applied as proxies. Q2. Is the metadata that goes with public use files different from the metadata for scientific use of confidential microdata? If yes, please describe the differences. For some national experts this question was not directly applicable their work experience since their organisation does not publish PUFs. Generally it was felt that, since PUFs are a test / teaching aid, whose users include researchers working to prepare syntax prior to gaining access to the SUFs, the metadata for PUFs and SUFs should not differ, except to reflect the differences in the level of detail provided by each file. PUF users also include the general public however, and a requirement for more descriptive (less technical) metadata is likely. The PUF metadata should also include details about the SDC methods applied (including the methods of synthetisation if applicable), and how this affects the research utility of the data. Q3. What formats do you use for the data and the metadata? Responses were mixed, from standard MS packages to a range of specialist tools. 3 Data is often provided in txt, csv, html, xls, xlsx, dbf or pdf format. Direct outputs from SAS, SPSS, Stata R, and Oracle Discoverer for OLAP (on-line analytical processing) are also available. [Question – does the user get to choose?] Metadata is generally provided in pdf, doc, docx, xls or xlsx format. Additionally, the following specialist tools / formats were cited by EG SDC members: Insee - Beyond 20/20, DDI1 and RDF2; ECB – SDMX3, XBRL / DPM and SDD; BG NSI – SDMX, JSON-stat, RDF N-Triples and INFOSTAT; Statistics Finland - JSON-stat; Statbel – RDF, Turtle. Q4. What software do you use to produce, control and disseminate the metadata? A diverse range of responses was received, with only 2 national experts citing MS Office software. Instead, organisations have either procured specialist “commercial off the shelf” (COTS) packages, or developed their own bespoke solutions. Since such diversity is difficult to summarise, the detail of individual responses is set out below: Statistics Sweden – The MONA (Microdata Online Access) system provides secure access to microdata via the internet. Data is processed and analysed via a suite of applications4 and aggregated results are e-mailed to the user (the microdata remains at Statistics Sweden, who supplies both the hardware and software). The metadata can be accessed via MONA and MetaPlus, and the latter is a tool designed to centrally co- ordinate and harmonise Statistics Sweden’s metadata repository. It is also presented on each survey’s internet home page. Hungarian Central Statistics Office – Investigating software capable of handling the DDI format; currently testing the data publishing and online analysis tool NESSTAR. ECB – Manages a data inventory using the Informatica business glossary tool, which operates on the ISO (International Standardisation Organisation) model to ensure global interoperability. A separate Single Data Dictionary (SDD) is also maintained, and work is underway to integrate the two systems in an Oracle database to create a browser 1 Data Documentation Initiative (DDI) is an international alliance aimed at creating and maintaining a technical documentation standard for describing and preserving statistical metadata, particularly surveys and questionnaires. Standardising this documentation involves modelling the various statistical concepts (questions, variables, code list, etc) and their relationships in xml documents. 2 Recommended by the World Wide Web Consortium (W3C), the Resource Description Framework (RDF) aims to create a global information network by facilitating the dissemination of data and metadata according to “linked data” principles. This promotes the publication of common structured and connected data on the internet rather than isolated sets of independent data and metadata. 3 Statistical Data and Metadata eXchange (SDMX) is an international initiative designed to standardise data and metadata exchange, including data (DSD) and metadata (MSD) structure definitions, concepts, code lists, data flow and IT architecture. It is sponsored by 7 international institutions, namely BIS, ECB, IMF, OECD, UN, World Bank and Eurostat. 4 FreeMat, Geoda QGIS, LibreOffice, MPlus, Management Studio, Python, R, R-Studio, SAS, SPSS, STATA, StatTransfer, SuperCross and Tinn-R. 4 application that allows collaborative management of the metadata, and has a sound approvals