SDMX GLOBAL CONFERENCE, PARIS, 19-21 January 2009

THE IMPLEMENTATION OF THE SDMX CONTENT-ORIENTED GUIDELINES WITH REGARD TO EXCHANGE IN THE EUROPEAN STATISTICAL SYSTEM

(August Götzfried and Marco Pellegrino, Eurostat)

This paper describes the implementation of a more advanced framework for metadata production, exchange and sharing making use of the new SDMX content-oriented guidelines published in January 2009. The combined use of technical standards and content guidelines can greatly support the standardisation, exchange, web dissemination and re-use of structural and reference metadata1 within the European Statistical System (ESS). With regard to structural metadata, Eurostat is not only implementing the content-oriented guidelines, but is also undertaking further harmonisation efforts for additional sets of metadata which are offered to be used at SDMX level in a second phase. With regard to reference metadata, the use of the Cross- domain Concepts (version 2009) is leading to the implementation of a Europe-wide standard for reference metadata called ESMS, Euro-SDMX Metadata Structure. The ESMS, currently being implemented within Eurostat, will be successively used for reporting metadata to Eurostat and for exchanging metadata between international organisations, for instance between Eurostat, the , the IMF and the OECD. The SDMX work has progressed considerably, in 2008, but more still needs to be done for taking advantage of the full potential that SDMX standards and content guidelines provide.

1. Background: Using SDMX standards for improving metadata exchange The dissemination of reference metadata on Eurostat's web site has considerably increased since the implementation of the free dissemination policy, in 2004. As more and more data got freely disseminated, metadata files also needed to keep pace. These metadata files, based on the SDDS standard, were explaining data content, methodology, re-use precautions and overall quality, providing users with a standard template across all of the statistical domains. The visibility of these metadata was high since the start (statistics on web usage show an average of more than 3000 consultations of reference metadata files per day on Eurostat's web site). A policy for central quality monitoring of the contents of the metadata files was put in place, together with training initiatives and specific actions for increasing the metadata coverage. In spite of this success, users are still confronted with a series of problems when trying to retrieve comparable methodological information per statistical domain for Eurostat, on one hand, and for different countries on the other hand. The European Statistical System as a whole still does not provide users with a common set of standardised, comparable and re-usable reference metadata describing both European statistics (produced by Eurostat) and national statistics produced by EU Member states or other associated countries. Reference metadata collected for countries are in most cases non-standardised and follow different structures, normally determined by the managers of the respective statistical domain. Even the SDDS, implemented by the IMF, hardly cover 5% of the whole dissemination of a statistical institute.

1 In SDMX, "structural metadata" are those metadata acting as identifiers and descriptors of the data, such as names of variables or dimensions of statistical cubes. Structural metadata must be associated with the data, otherwise it becomes impossible to identify, retrieve and browse the data. "Reference metadata" are metadata that describe the contents and the quality of the statistical data (concepts used, metadata, describing methods used for the generation of the data, and metadata, describing the different quality dimensions of the resulting statistics, e.g. timeliness, accuracy). While these metadata exist and may be exchanged independent of the data and its structural metadata, they are often linked (“referenced”) to data.

1 What is more, according to our assessment, many SDDS-based reference metadata are quite weak with regard to information on national statistical processes and on data quality. For these reasons, the standardisation process needed to be accelerated, so that comparable data and metadata could be made available more easily, reducing redundancies, minimising reporting efforts and establishing a stronger coordination of metadata requirements. These tasks involved the development of more advanced metadata standards in both the IT area (system architecture and tools) and in metadata content. And this is where SDMX fits into the picture. While version 1.0 of the SDMX standards was mainly concerned with data sets and its structural definitions, version 2.0 introduced a full metadata support, providing for the attachment of reference metadata to any part of the data tree, as well as for the reporting and exchange of metadata using XML formats. These functionalities can be very useful for supporting data quality initiatives, allowing for a better exchange of quality-related metadata. One of the most important features of the SDMX information model is the specification of formal rules for formatting data and metadata, so that these can be exchanged, read and processed without manual intervention. A web-service, using information about web locations of data and metadata, can navigate, find and automatically process the information for analytical and dissemination purposes, even querying metadata across various sites for retrieving a customised reporting in XML format. Chart 1 (taken from the version 2 package) depicts the essential characteristics supported in the SDMX model for data and metadata reporting. The pivot of this diagram is the Data or Metadata Flow, maintained by the organisation that collects data or metadata. A Data Flow is linked to a “Data Structure Definition” (DSD) while a Metadata flow is linked to a “Metadata Structure Definition” (MSD) which defines the structure of metadata and identifies the data elements to which metadata can be attached. Data or metadata may be made available by many providers and any provider may report or publish data or metadata for several data or metadata flows, according to a Provision Agreement. The Data or Metadata Flow may also be linked to one or more topics (Category) in a subject-matter scheme (Category Scheme). A category scheme provides a way of classifying data for collection, reporting or publication.

Chart 1 - SDMX Data and Metadata Reporting

2 The core of the SDMX model for reference metadata is the concept of “Metadata Structure Definition”, which defines: • which metadata concepts are to be reported; • the identity of the metadata concept (for example, a code which may simply be derived from the cross-domain concepts scheme); • the format and representation (textual or coded); • the role in its usage, e.g., mandatory or conditional.

Reference metadata may be attached to different object types (for instance a data set, a time series, or an observation). These files are however often attached at a high level (i.e. at data set or statistical domain level, or even at agency level) because the contents of the file often refers to several or even all of the data tables produced on the basis of the respective data set. A Metadata Structure Definition also needs to identify the object the metadata are attached to. For using this model, it was necessary: • to develop and manage standard Metadata Structure Definitions (MSD) compliant with the SDMX version 2.0, using a standardised list of metadata concepts; • to design and develop a set of IT tools allowing the creation and management of reference metadata, using as much as possible information already stored in existing metadata files; • to design and develop a system architecture (based on a registry) for transferring reference metadata to external users and to the web-site, independently on the respective IT platforms. During the course of 2007 and 2008, Eurostat dedicated significant resources to the development of a system architecture and IT tools aimed at supporting data and metadata exchange within the European Statistical System. At the same time, intensive consultations were conducted within several ESS working groups and task-forces, while the SPC (Statistical Programme Committee), the IT Directors' Group and the Directors' meeting expressed their favourable opinion and encouragement on the SDMX implementation.

2. Content-oriented guidelines: a step forward towards statistical standardisation The Content-Oriented Guidelines (COG) finally comes to complements the SDMX Technical Standards with a set of recommended practices for creating interoperable data and metadata across statistical domains. The release of the new package is therefore a big achievement with regard to the international harmonisation of metadata messages. From now on, metadata reports can be structured using the SDMX list of standard concepts – such as "contact", "timeliness", "dissemination format", "classification system", or "comparability" – with a common description and identification, so that IT systems which exchange data and metadata understand what the data or metadata refer to without any big problem in determining the semantic equivalence between concepts. The SDMX COG package comprises the following main elements: ƒ SDMX Cross-Domain Concepts: a list of 66 metadata concepts plus a series of sub-concepts, relevant to several statistical domains and recommended for use in data and metadata structures to promote standardisation and interoperability. The composition of this list was guided by concepts already in use by SDMX sponsoring organisations or other statistical organisations. ƒ SDMX Cross-Domain Code Lists: 9 code lists recommended for use, describing cross-domain concepts such as observation status, confidentiality, frequency, sex, reference area and currency. These lists contain code values and descriptions which are recommended by SDMX or against which the codes in use by statistical organisations can be mapped. Statistical organisations however may dispose of additional code values and code descriptions which are not included in the lists presented. ƒ SDMX Statistical Subject-Matter Domains: a standard structure of statistical domains along three main categories: demographic and social statistics, economic statistics and environment, and

3 multi-domain statistics. This subject-matter domain structure provides orientation for organising the production and exchange of statistical data and metadata within SDMX. ƒ SDMX Metadata Common Vocabulary: a repository containing 397 concepts and related definitions which should be used as main basis when dealing with metadata. This vocabulary also covers the cross-domain concepts list. Compared to previous drafts of the SDMX COG, the 2009 release is broader in scope and also reached a much better quality of contents.

3. The implementation of Content-oriented Guidelines within the European Statistical System with regard to metadata The implementation of the SDMX COG within the European Statistical System has started in the following main areas:

3.1. Reference metadata: The Euro-SDMX metadata structure (ESMS) Eurostat was confronted with the need of improving and extending its reference metadata standard based on a series of reasons. These were in particular: ƒ The European Statistics Code of Practice and the upcoming European statistical framework legislation, which stresses the need for more metadata on quality. The Code of Practice, adopted in 2005 as Commission Recommendation, requests in its principle 15 (accessibility and clarity) that "metadata are to be documented according to a standardised metadata system". This Code of Practice also provides a detailed description of the principles with a link to indicators related to data quality. ƒ The need of offering a harmonised structure of reference metadata to countries in order to improve the compilation and exchange of national metadata. ƒ The need to further improve the contents and attachment of the reference metadata existing beforehand in order to improve its accessibility for users. Therefore, Eurostat decided to build the Euro-SDMX Metadata Structure (ESMS) to replace the old SDDS format. The ESMS aims at documenting methodologies, quality and the statistical production processes in general. It uses 21 high-level concepts, with a limited breakdown of sub-items, strictly derived from the SDMX list of cross-domain concepts. Most of the reference metadata in the ESMS are currently inserted as free text, although in the near future some of them may follow a code list (e.g. frequency, or reference area). The ESMS covers the following statistical concepts (the full set of concepts and sub-items is shown in annex 1):

1. Contact 8. Release policy 15. Timeliness and punctuality 2. Metadata update 9. Frequency of dissemination 16. Comparability 3. Statistical presentation 10. Dissemination format 17. Coherence 4. Unit of measure 11. Accessibility of documentation 18. Cost and burden 5. Reference period 12. Quality management 19. Data revision 6. Institutional mandate 13. Relevance 20. Statistical processing 7. Confidentiality 14. Accuracy and reliability 21 Comment

4 Within Eurostat's metadata system, the ESMS has already replaced the SDDS since 1 December 2008. The transition from the old to the new standard has been made easier by the availability of a metadata IT application (EMIS) used for treating, storing and extracting reference metadata, and of course by the conceptual similarity between the SDDS and ESMS template. On top of the statistical concepts which were already part of the SDDS template, the ESMS contains a number of additional statistical concepts, mainly related to data quality, such as accuracy, comparability, coherence, relevance, etc. Therefore, the ESMS better integrates the information which is part of the ESS standard quality report (including specific quantitative quality indicators, such as non-response rate). The quality information contained in the ESMS will allow a much better comparison of the quality reached by each statistical survey. The attachment level of the ESMS still needs to be improved in many statistical domains, moving in general towards a higher attachment level, with a tentative reduction and further standardisation of metadata compared to the previously produced SDDS files. The ESMS, which uses SDMX concepts and is supported by a Metadata Structure Definition and by a "generic" XML metadata format fully compliant with SDMX version 2.0, is addressed to the whole European Statistical System: Eurostat, Member states and associated countries. The ESMS will be recommended as a standard format through a Commission Recommendation to be discussed and hopefully adopted in 2009. The Commission Recommendation proposes that EU Member States and associated countries use the ESMS format when compiling and transmitting domain-specific reference metadata to Eurostat. While, in the first step, the ESMS is being implemented within Eurostat, in a second phase also Member States are going to make use of the standard format. This successive implementation should improve the metadata exchange and sharing, reducing redundancies and increasing comparability.

3.2. Structural metadata: Cross-domain Code Lists and Data Structure Definitions For structural metadata, e.g. concerning the use and implementation of the SDMX Cross-domain code lists, the situation is less straight forward. The code lists published within the COG (version 2009) contain some recommended code lists, plus the identification of additional code lists to be discussed and included within SDMX: further work needs to be done on those Code Lists in 2009/2010 in order to improve and complete the work. Beyond the SDMX process, Eurostat is undertaking efforts in harmonising code lists used within the European Statistical System. This work has started in 2008, when a number of harmonised ESS code lists were already released. When harmonising the ESS code, the greatest account will be taken of the further work on SDMX code lists in order to align both processes, by achieving full consistency of results and as much as possible cross synergy. Please see as an example the ESS code list on the “Marital status” in annex 2 to this document. The SDMX Cross-domain code lists, as well as the harmonised code lists produced within the European Statistical System and the cross-domain concepts recommended for Data Structure Definitions, also need to be used and implemented. These harmonised structural metadata should be used across the whole data life cycle, i.e. from the phase of data collection until data production and dissemination. Therefore, Eurostat aims to incorporate SDMX cross-domain concepts and code lists, together with other ESS-harmonised code lists, whenever new data structure definitions are created in particular statistical domains. Of course, if common data collections are concerned, DSDs are drawn up together with other international organisations. Furthermore, harmonised structural metadata will be more and more used in domain-specific production databases at Eurostat, as well as in reference and dissemination databases. A big bang approach is however excluded, in favour of a more gradual implementation across the components of the data life cycle.

5 4. How can national and international agencies take advantage of SDMX for metadata exchange? Based on the substantial development made on both technical standards and content guidelines, we have now considerably progressed in the implementation of advanced standards for the exchange and sharing of data and metadata, which was the original objective of SDMX. Nevertheless, with regard to metadata, the objectives of the SDMX initiative can be fully achieved if, together with the set of standard concepts, an open architecture and common IT tools, also shared arrangements supporting the exchange are put in place. The key idea behind an "open metadata interchange" is that, through the use of a common set of statistical concepts, linked to a standard terminology, the set up of a multilateral exchange of reference metadata among countries and international organisations now gets possible. The next working step is that international organisations may agree on a core set of reference metadata concepts, based on the ESMS, on the SDDS/DQAF and other standards. On the basis of such an agreement, the exchange of SDMX-based reference metadata between national and international organisations can be organised in practice. Using SDMX formats and web services, each participating statistical organisation will be able to identify and retrieve, among the metadata made available, those which are relevant for its own framework. European countries, in this context, could provide metadata to more international organisations once for all, for the same SDMX concept and for the same data set. This does not necessarily require the adoption of exactly the same statistical concepts by each agency for its own metadata system: each organisation just needs to map its own statistical concepts to the list of concepts identified in a common Metadata Structure Definition. This use of SDMX may help to achieve an immediate reduction of the burden at national level, because using common XML formats and standardised web tools, a statistical organisation would also be able to easily identify and retrieve those metadata which are relevant for its own metadata system. The mapping between the metadata concepts used by Eurostat (ESMS), IMF (DQAF) and OECD (Metastore) – also present in the COG package – supports the idea of such an international metadata exchange and sharing. The ESMS, for instance, will be used in 2009 for facilitating the direct access to ESS metadata on the web ("pull" mechanism) instead of the current transmission by national agencies ("push" mechanism) thanks to the SDMX registry architecture put in place at Eurostat and to the construction of a specific web form aimed at assisting countries which are not currently in an advanced state for generating SDMX-ML messages directly. The same principle could be used for the transfer of metadata files between Eurostat, the European Central Bank and the International Monetary Fund for the "Euro area" page of the Dissemination Standards Bulletin Board. In this framework, Eurostat and the European Central Bank could coordinate Euro area requirements and metadata flows interconnecting national metadata systems. EU countries, at the end, would provide metadata to more organisations at the same time, for the same metadata concepts, possibly using information extracted by their own metadata systems, reducing manual interventions, double work and inconsistencies.

5. References SDMX Content-Oriented Guidelines, 2009 (http://www.sdmx.org/index.php?page_id=11) B.A. Lindblad, M. Pellegrino, F. Rizzo, "Registry facilities for supporting the exchange and sharing of statistical data and metadata", METIS 2008, WP.8, Luxembourg, April 2008 (http://www.unece.org/stats/documents/ece/ces/ge.40/2008/wp.8.e.) A. Götzfried, M; Pellegrino, "Structural and reference metadata in the European Statistical System", METIS 2008, WP.10, Luxembourg, April 2008 (http://www.unece.org/stats/documents/ece/ces/ge.40/2008/wp.10.e.pdf)

6 Annex 1: Euro-SDMX Metadata Structure

Concept Sub-concept Definition

1 Contact Individual or organisational contact points for the data or metadata, including information on how to reach the contact points.

1.1 Contact organisation The name of the organisation of the contact points for the data or metadata.

1.2 Contact organisation unit An addressable subdivision of an organization.

1.3 The name of the contact points for the Contact name data or metadata.

1.4 The area of technical responsibility of the contact, such as "methodology", Contact person function "database management" or "dissemination".

1.5 The postal address of the contact points Contact mail address for the data or metadata.

1.6 E-mail address of the contact points for Contact email address the data or metadata.

1.7 The telephone number of the contact Contact phone number points for the data or metadata.

1.8 Fax number of the contact points for the Contact fax number data or metadata.

The date on which the metadata element 2 Metadata update was inserted or modified in the database.

Date of the latest certification provided by the domain manager to confirm that 2.1 Metadata last certified the metadata posted are still up-to-date, even if the content has not been amended.

Date of the latest dissemination of the 2.2 Metadata last posted metadata.

Date of last update of the content of the 2.3 Metadata last update metadata.

3 Statistical presentation

Main characteristics of the data set described in an easily understandable 3.1 Data description manner, referring to the data and indicators disseminated.

Classification system Arrangement or division of objects into 3.2 groups based on characteristics which the objects have in common.

Main economic or other sectors covered 3.3 Sector coverage by the statistics

Statistical characteristics of statistical 3.4 Statistical concepts and definitions observations.

Entity for which information is sought 3.5 Statistical unit and for which statistics are ultimately compiled.

The total membership or population or 3.6 Statistical population "universe" of a defined class of people, objects or events.

7 Concept Sub-concept Definition

The country or geographic area to which 3.7 Reference area the measured statistical phenomenon relates.

The length of time for which data are 3.8 Time coverage available.

The period of time used as the base of an 3.9 Base period index number, or to which a constant price series refers.

The unit in which the data values are 4 Unit of measure measured.

The period of time or point in time to 5 Reference period which the measured observation is intended to refer.

Set of rules or other formal set of instructions assigning responsibility as 6 Institutional Mandate well as the authority to an organisation for the collection, processing, and dissemination of statistics.

6.1 Legal acts or other formal or informal agreements that assign responsibility as Legal acts and other agreements well as the authority to an agency for the collection, processing, and dissemination of statistics.

6.2 Arrangements or procedures for data Data sharing sharing and coordination between data producing agencies.

A property of data indicating the extent to which their unauthorised disclosure 7 Confidentiality could be prejudicial or harmful to the interest of the source or other relevant parties.

Legislative measures and other formal procedures which prevent unauthorised 7.1 Confidentiality - policy disclosure of data that identify a person or economic entity either directly or indirectly.

Rules applied for treating the data set to 7.2 Confidentiality - data treatment ensure statistical confidentiality and prevent unauthorised disclosure.

Rules for disseminating statistical data to 8 Release policy interested parties.

8.1 Release calendar The schedule of statistical release dates.

Access to the release calendar 8.2 Release calendar access information

The policy for release of the data to users, the scope of dissemination (e.g. to the public, to selected users), how users 8.3 User access are informed that the data are being released, and whether the policy determines the dissemination of statistical data to all users.

The time interval at which the statistics 9 Frequency of dissemination are disseminated over a given time period.

Media by which statistical data and 10 Dissemination format metadata are disseminated.

8 Concept Sub-concept Definition

Regular or ad-hoc press releases linked 10.1 News release to the data.

Regular or ad-hoc publications in which 10.2 Publications the data are made available to the public.

Information about on-line databases in 10.3 On-line database which the disseminated data can be accessed.

Information on whether micro-data are 10.4 Micro-data access also disseminated.

References to the most important other 10.5 Other data dissemination done.

11 Accessibility of documentation

Descriptive text and references to 11.1 Documentation on methodology methodological documents available.

Documentation on procedures applied 11.2 Quality documentation for quality management and quality assessment.

Systems and frameworks in place within 12 Quality Management an organisation to manage the quality of statistical products and processes.

Guidelines focusing on quality in general and dealing with quality of 12.1 Quality assurance statistical programmes, including measures for ensuring the efficient use of resources.

Overall assessment of data quality, based 12.2 Quality assessment on standard quality criteria.

The degree to which statistical 13 Relevance information meets the real or perceived needs of clients.

Description of users and their respective 13.1 User needs needs with respect to the statistical data.

13.2 User satisfaction Measures to determine user satisfaction.

The extent to which all statistics that are 13.3 Completeness needed are available.

Accuracy: closeness of computations or estimates to the exact or true values that the statistics were intended to measure. 14 Accuracy and reliability Reliability: closeness of the initial estimated value to the subsequent estimated value.

Assessment of accuracy, linked to a certain data set or domain, which is 14.1 Overall accuracy summarising the various components into one single measure.

That part of the difference between a population value and an estimate thereof, 14.2 Sampling error derived from a random sample, which is due to the fact that only a subset of the population is enumerated.

Errors in sample estimates which cannot 14.3 Non-sampling error be attributed to sampling fluctuations.

15 Timeliness and punctuality

9 Concept Sub-concept Definition

Length of time between data availability 15.1 Timeliness and the event or phenomenon they describe.

Time lag between the actual delivery of 15.2 Punctuality the data and the target date when it should have been delivered.

The extent to which differences between statistics can be attributed to differences 16 Comparability between the true values of the statistical characteristics.

The extent to which statistics are 16.1 Comparability - geographical comparable between geographical areas.

The extent to which statistics are 16.2 Comparability over time comparable or reconcilable over time.

Adequacy of statistics to be combined in 17 Coherence different ways and for various uses.

The extent to which statistics are 17.1 Coherence - cross domain reconcilable with those obtained through other data sources or statistical domains.

The extent to which statistics are 17.2 Coherence - internal consistent within a given data set.

Costs associated with the collection and 18 Cost and burden production of a statistical product and burden on respondents.

Any change in a value of a statistic 19 Data revision released to the public.

Policy aimed at ensuring the transparency of disseminated data, 19.1 Data revision - policy whereby preliminary data are compiled that are later revised.

Information on the data revision 19.2 Data revision - practice practice.

20 Statistical processing

Characteristics and components of the 20.1 Source data raw statistical data used for compiling statistical aggregates.

Frequency with which the source data 20.2 Frequency of data collection are collected.

Systematic process of gathering data for 20.3 Data collection official statistics.

Process of monitoring the results of data 20.4 Data validation compilation and ensuring the quality of the statistical results.

Operations performed on data to derive 20.5 Data compilation new information according to a given set of rules.

The set of procedures employed to modify statistical data to enable it to conform to national or international 20.6 Adjustment standards or to address data quality differences when compiling specific data sets.

Supplementary descriptive text which 21 Comment can be attached to data or metadata.

10

11

Annex 2. Standard Code list (produced within the European Statistical System): Marital status

12