EUROPEAN COMMISSION EUROSTAT

Directorate B: Methodology; Dissemination; Cooperation in the European Statistical System Unit B-1: Methodology: Innovation in Official Statistics

ESTAT/B1/MANG(19)3 Available in EN only

3rd meeting of the Microdata Access Network Group (MANG)

Luxembourg, 13 June 2019

Venue: Luxembourg Foyer Europeen 10, rue Heinrich Heine L-1720 Luxembourg – Gare

9:30-16:00

Item 3 for microdata

1

Metadata standard for microdata

1. INTRODUCTION

Eurostat and the European Statistical System (ESS) have a long tradition in providing metadata to the user. Click on the link below to access the Eurostat webpage on metadata:

https://ec.europa.eu/eurostat/data/metadata

To improve services for users of microdata, Eurostat is considering the metadata standard for its microdata, both for research purposes and to accompany its public use files.

To inform the discussion on a metadata standard for microdata, Eurostat has asked some experts in national statistical institutes to describe what metadata they provide with the microdata. The results of this are summarised in paragraph 3. In the discussion on these outcomes it was stressed that user requirements should be considered.

In this agenda item the Members of the MANG are invited to reflect on the user requirements on metadata. Paragraph 2 offers a brief introduction from Eurostat perspective and the information in paragraph 3 summarises the situation in some national statistical institutes. This may be used as a trigger for formulating requirements, highlighting both good and bad practices.

2. METADATA FOR EUROPEAN MICRODATA

The value added of European microdata is in the standardisation over Member States, thus allowing research across several countries in the EU. In the ideal situation the metadata should allow several views:  Across countries to assess the comparability over countries;  Across time to assess the comparability over time;  Over different versions of the same data set: the full data set as used in dissemination of official statistics, the partially anonymised scientific use files and the public use files; along this line you could follow the protection process. There is some metadata at the level of the survey as a whole (per country and per year), for instance sample size, sample design, response rate, confidentiality treatment. Other metadata are at the level of variables: definition, relation with other variables, format. All this also requires a kind of demographic description of the variables. Which variables are completely new, which are continuations of previous variables, which variables appear with a certain pattern (special topics/modules) etc. All this information is available in principle in the national statistical institutes and in Eurostat, but scattered over separate documents and usually stored along with the data according to the annual production rhythm. Another challenge is the long list of exceptions. Countries that have implemented the new version of the classification before it was required, countries that have implemented after it was required, countries that have done important changes to the data collection or processing methods, countries that have requested additional protection of the microdata because of the size of the country etc. This would be a considerable task. The MANG is invited to reflect on user requirements and on priorities.

2

3. METADATA IN SOME NATIONAL STATISTICAL INSTITUTES

Eurostat has asked some experts in national statistical institutes to describe what metadata they provide with the microdata. The results can be summarised as follows:

Q1. What types of metadata should be made available to researchers? For example: variable definitions, survey design, estimation method, protection approach.

The generally held view is that as much metadata as possible should be provided to enable the user to correctly analyse the data and draw accurate and reliable conclusions. This should include general methodological information about the data source (be it administrative or survey data), processing and its potential uses. Wherever possible and applicable, the following points should be captured:

How and why the data has been collected, its scope and the timescale covered; Details about the production / extraction process, including any imputation procedures. In the case of survey data, this should include survey design, target population, sample size and response rate. State if it is a cross-sectional or longitudinal survey, and detail any stratification or weighting criteria applied; Time-series information if changes have been made to the data collection practices over time;

A list of variable codes complete with their classification, nomenclature, description / definition, type, format and length;

A list of potential values for each variable with a translation of what each code means, including how missing data and “not applicable” cells are processed and annotated;

Details of any protection methods used;

Reliability thresholds, the quality of the variables and their potential uses or caveats. Advice on how variables could be applied as proxies.

Q2. Is the metadata that goes with public use files different from the metadata for scientific use of confidential microdata? If yes, please describe the differences.

For some national experts this question was not directly applicable their work experience since their organisation does not publish PUFs.

Generally it was felt that, since PUFs are a test / teaching aid, whose users include researchers working to prepare syntax prior to gaining access to the SUFs, the metadata for PUFs and SUFs should not differ, except to reflect the differences in the level of detail provided by each file. PUF users also include the general public however, and a requirement for more descriptive (less technical) metadata is likely.

The PUF metadata should also include details about the SDC methods applied (including the methods of synthetisation if applicable), and how this affects the research utility of the data.

Q3. What formats do you use for the data and the metadata?

Responses were mixed, from standard MS packages to a range of specialist tools.

3

Data is often provided in txt, csv, , xls, xlsx, dbf or pdf format. Direct outputs from SAS, SPSS, Stata R, and Oracle Discoverer for OLAP (on-line analytical processing) are also available. [Question – does the user get to choose?]

Metadata is generally provided in pdf, doc, docx, xls or xlsx format.

Additionally, the following specialist tools / formats were cited by EG SDC members:

Insee - Beyond 20/20, DDI1 and RDF2;

ECB – SDMX3, XBRL / DPM and SDD;

BG NSI – SDMX, JSON-stat, RDF N-Triples and INFOSTAT;

Statistics Finland - JSON-stat;

Statbel – RDF, .

Q4. What software do you use to produce, control and disseminate the metadata?

A diverse range of responses was received, with only 2 national experts citing MS Office software. Instead, organisations have either procured specialist “commercial off the shelf” (COTS) packages, or developed their own bespoke solutions. Since such diversity is difficult to summarise, the detail of individual responses is set out below:

Statistics Sweden – The MONA (Microdata Online Access) system provides secure access to microdata via the . Data is processed and analysed via a suite of applications4 and aggregated results are e-mailed to the user (the microdata remains at Statistics Sweden, who supplies both the hardware and software). The metadata can be accessed via MONA and MetaPlus, and the latter is a tool designed to centrally co- ordinate and harmonise Statistics Sweden’s metadata repository. It is also presented on each survey’s internet home page.

Hungarian Central Statistics Office – Investigating software capable of handling the DDI format; currently testing the data publishing and online analysis tool NESSTAR.

ECB – Manages a data inventory using the Informatica business glossary tool, which operates on the ISO (International Standardisation Organisation) model to ensure global interoperability. A separate Single Data Dictionary (SDD) is also maintained, and work is underway to integrate the two systems in an Oracle to create a browser

1 Data Documentation Initiative (DDI) is an international alliance aimed at creating and maintaining a technical documentation standard for describing and preserving statistical metadata, particularly surveys and questionnaires. Standardising this documentation involves modelling the various statistical concepts (questions, variables, code list, etc) and their relationships in documents.

2 Recommended by the Consortium (W3C), the Resource Description Framework (RDF) aims to create a global information network by facilitating the dissemination of data and metadata according to “” principles. This promotes the publication of common structured and connected data on the internet rather than isolated sets of independent data and metadata.

3 Statistical Data and Metadata eXchange (SDMX) is an international initiative designed to standardise data and metadata exchange, including data (DSD) and metadata (MSD) structure definitions, concepts, code lists, data flow and IT architecture. It is sponsored by 7 international institutions, namely BIS, ECB, IMF, OECD, UN, World Bank and Eurostat.

4 FreeMat, Geoda QGIS, LibreOffice, MPlus, Management Studio, Python, R, R-Studio, SAS, SPSS, STATA, StatTransfer, SuperCross and Tinn-R.

4

application that allows collaborative management of the metadata, and has a sound approvals process, before publication is enabled.

Insee – Use a suite of Colectica software, storing DDI-compliant questionnaires and code books in a Colectica Repository, using Colectica Design to manage the objects, and providing a Colectica Portal “front end” web tool to enable the user to interrogate the metadata. Additional software facilitates the construction of DDI-compliant questionnaires. Other metadata (e.g. concepts, definitions and classifications) are stored in a DataLift repository, using internally developed applications to manage them.

Slovak Statistics – A bespoke internal system (IŠIS) is used to store and process the data and metadata.

BG NSI – Utilise the open source software Apache, MySQL, PHP and Drupal. BG NSI has developed a bespoke Drupal module (Statistical Data and Metadata) to enable national metadata to be accessed via a web application. For BS metadata presented to Eurostat, the European Statistical System Metadata Handler (ESS MH) is used. This is the web application developed by Eurostat to support the production, management, exchange and dissemination of European and national reference metadata.

Statistics Finland – Has developed its own system for compiling and managing metadata. Called Muuttuja Editori (translation Variable Editor), it sends information directly to their internally-produced metadata catalogue (Taika) for publication on the internet.

Statistics Netherlands – Use an enterprise content management system based on Documentum to produce and control the metadata. Dissemination is achieved via a web- interface.

Statistics Portugal – Has developed a bespoke metadata system (SMI) which controls, maintains and disseminates the component parts (e.g concepts, classifications, variables, data collection tools and methodological documents,) to ensure they are integrated and harmonised.

Statbel – Current dissemination is in pdf files, though Stabel expressed a desire to move to an open source solution such as Wikibase, which would enable users to query the microdata via the internet and create their own ad-hoc reports.

Q5. What are the strengths and weaknesses of this software?

This is very much dependent on whether a COTS or bespoke solution is selected, with COTS posing obvious challenges. One member commented that an “upgrade” from an Access database to a COTS solution resulted in the disadvantage of no longer having the ability to make swift developments on the receipt of iterative feedback from users. However, the COTS package has delivered enhanced functionality and integration.

Open source software is generally considered “easy” to modify, however some highlighted that a level of expertise in the chosen software must be maintained to enable this.

Bespoke tools can deliver tailor-made benefits designed to meet specific business requirements, with organisations able to prioritise functionality and conduct beta testing. The result should be an easy to use system, in a variety of languages, with the ability to easily find and download metadata in a format of the users’ choosing. Access management, security and time-series functionality can also be built in.

5

The major benefit of both COTS and bespoke solutions is that they bring everything together so all the data and metadata components are managed in (and queried from) the same system. This facilitates standardisation and interoperability, and ease of use by both the statistical institution and the users. Systems which enable international standard- setting and data exchange enhance these benefits further.

Tools which allow users to “build their own tables” via an internet site mean the microdata remains in the statistical institution and only aggregated results are received by the user. This provides an additional SDC whilst enabling the user to interrogate the microdata at a time and location convenient to themselves.

The intended use of any system needs to be determined and documented at the outset. One weakness cited by a national expert was a systems initially designed to be an internal tool, and thus containing metadata suited to an internal audience, now required additional work in order to prepare appropriate metadata for external users.

Q6. Please provide any feedback you have received from the user community on the utility of the metadata and its format.

Any metadata is appreciated by users, especially if it is easy to access via a web interface, can be downloaded in a variety of formats (not just pdf) and in a variety of languages. Feedback is, therefore, generally positive with just a few shortfalls highlighted.

A lot of work has to be completed in order to achieve a comprehensive set of metadata, and thus fully satisfy the user community. Compiling metadata concerning longitudinal time-series changes is a considerable challenge, and therefore this detail often does not exist.

The same is true for comparisons across geographical boundaries. In order to compare variables across countries, it is important to know how apparently similar variable actually differ. In a Nordic study it was suggested that one possible solution would be to use high level common concepts and group variables by country according to which concept they belong.

Additional utility features highlighted by users include a robust search facility and instructions for using data provided directly as an output from statistical software programmes (e.g. the R and SAS).

Users would like variable names (rather than codes) to be included as labels in datasets, or at least for it to be easy for them to add them automatically to the data provided.

Conclusion

In conclusion, the diversity of responses was illuminating, and the key message is that no common system is utilised by even a small majority of your institutes.

4. QUESTIONS TO THE MANG

 Do you have any requirements on the contents of the metadata?

 What would be your priorities?

 What would you like the metadata to offer in addition to the views listed in section 2?

 Do you have examples of or preferences on specific tools or formats? 6