SODI Implementation and support of standardised data

THE USE OF SDMX STANDARDS FOR SUPPORTING REFERENCE INTERCHANGE WITHIN THE SODI PROJECT

Last updated on 20 December 2006 SODI Implementation and support of standardised data

SDMX Reference Metadata Interchange*

Contents

1. INTRODUCTION...... 3

2. PROJECT OBJECTIVES ...... 3

3. BACKGROUND...... 3

4. A MODEL FOR METADATA EXCHANGE ...... 4

5. CONTENT-ORIENTED GUIDELINES...... 5

6. REFERENCE METADATA REPORTING ...... 6

7. SYSTEM ARCHITECTURE...... 6

8. STANDARDISATION OF METADATA CONCEPTS...... 8

9. SDMX METADATA INTERCHANGE ...... 8

* Document prepared by Marco Pellegrino Eurostat, Directorate B: Statistical Methods and Tools, Dissemination Unit B4, Reference Databases Contact: [email protected]

The views expressed are those of the writer and may not in any circumstances be considered as stating an official position of the European Commission (Eurostat).

1. INTRODUCTION

This paper provides an overview of the project “SDMX-compliant Metadata for use within Eurostat's Web Site” being executed under the framework contract “Implementation and support for standardised data formats for statistical data”. The project is embedded within the SODI project1 (SDMX Open Data Interchange) sponsored by Eurostat and it has been discussed during the latest meeting of the SODI task-force, held in Luxembourg on 13-14 November 2006. The main purpose of the project is to enable end users and metadata producers to access, analyse and reuse statistical metadata originated from multiple websites. In order to achieve this objective, we intend to: a) standardise the set of concepts used in collecting, processing, and disseminating statistical metadata; b) use SDMX protocols for ensuring the interoperability of messages, independently on the respective IT platforms, and for delivering a timely information to users on the web. The data and metadata involved in this project refer to a comparatively small number of standardised data sets in the domain of Principal European Economic Indicators (PEEI, see Table 1) but the project tasks, dealing with formats and registry technologies, illustrates how a similar approach could be used to support the collection and management of standardised metadata covering more subject-matter domains.

2. PROJECT OBJECTIVES

The project intends to demonstrate how SDMX technical standards and content guidelines can support the exchange and web dissemination of an important set of reference metadata2. To realise this goal, it is necessary: • to develop tools and standard formats needed for the creation and management of Metadata Structure Definitions (MSD) compliant with SDMX version 2.0 standards; • to design and develop a set of tools allowing creation, transfer and management of reference metadata in SDMX-ML format, using as much as possible information already available in the existing metadata repositories; • to design and develop a registry-based architecture for transferring reference metadata to external users and to the web-site.

3. BACKGROUND

In 2005, Eurostat launched "SDMX Open Data Interchange" (SODI) as a data sharing and exchange project within the European Statistical System. The project started with a pilot exercise involving National Statistical Institutes of Germany, France, the Netherlands, Sweden and the United Kingdom. The statistical institutes of Denmark, Italy, Norway and Slovenia joined the pilot in 2006, while Finland and Ireland are due to join the exercise in 2007. The long term view is to extend the results of the SODI pilots to cover any suitable statistical domain and to explore the feasibility of using SDMX as the preferred standard formats for the harmonisation of statistical production systems along the statistical life cycle of Eurostat. Eurostat and EU Member States have been gradually increasing their use of standardised messages for the transmission of statistical data, and this work will continue over the next five years.

1 The SODI project focuses on the interoperability of statistics for collecting and disseminating short-term statistics, especially in the domains of the Principal European Economic Indicators (PEEI), with the overall objective of increasing timeliness and accessibility. SODI is an SDMX implementation project. 2 According to the SDMX Metadata Common Vocabulary, "reference" metadata are metadata describing the contents and the quality of the statistical data, normally including "conceptual" metadata, describing the concepts used and their practical implementation; "methodological" metadata, describing methods used for the generation of the data (e.g. sampling, collection methods, editing processes); and "quality" metadata, describing the different quality dimensions of the resulting statistics (e.g. timeliness, accuracy). These metadata are often stored in a separate metadata repository and they are referenced from the related data element. 3

Statistical institutes are now confronted with the challenge of providing at the same time clear, timely and accurate information on the data (metadata) which are disseminated on public channels. For this information to be consistent, comparable and reusable by third parties, further efforts need to be done for harmonising its content and presentation, reducing redundancies and double work.

4. A MODEL FOR METADATA EXCHANGE

The Open Metadata Interchange project is based on the SDMX information model and makes use of the SDMX 2.0 set of standards for facilitating the exchange of statistical information through the use of web services and mark-up languages. The SDMX information model encourages the specification of formal rules for formatting metadata, so that these can be exchanged, read and processed by computers without manual intervention. A web- service, using information about web locations of data and metadata, can then navigate, find and automatically process the information for analytical and dissemination purposes. In particular, the version 2.0 of Technical Standards represents a major advance over version 1, as it supports richer and more complex data/metadata structures and it allows querying metadata across various sites for retrieving a customised reporting in a standard XML format.

Chart 1 - SDMX Data and Metadata Reporting3

Chart 1 depicts the essential characteristics supported in the SDMX model for data and metadata reporting. The pivot of this diagram is the Data or Metadata Flow, maintained by the organisation that collects data or metadata. A Data Flow is linked to a “Data Structure Definition” (DSD, also known as “key family” in Gesmes) while a Metadata flow is linked to a “Metadata Structure Definition”. A "Metadata Structure Definition" (MSD) defines the allowable content of metadata and identifies the data structures to which metadata can be attached Data or metadata may be provided by many Data Providers and any Data Provider may report or publish data or metadata for many Data or Metadata Flows. The Provision Agreement is a way of applying constraints on the scope of the data or metadata that can be supplied: for instance, a Data Provider might supply data or metadata for a limited subset of values. The Data or Metadata Flow may also be linked to one or more topics (category) in a subject- matter scheme (category scheme). A category scheme provides a way of classifying data for collection, reporting, or publication.

3 The chart is taken from the SDMX Version 2 package, namely from the "SDMX Implementors Guide, Version 2.0", November 2005, figure 30. 4

5. CONTENT-ORIENTED GUIDELINES

Technical standards are complemented with "content-oriented guidelines" aimed at establishing good practices in the use of a common terminology and in structuring data and metadata sets for supporting the exchange and encouraging re-use across domains. Although content-oriented guidelines are not strictly required for being conformant with the technical ISO standard, SDMX partners intend to promote the use of concepts that are common to as many statistical domains as possible. In March 2006, SDMX delivered a draft set of content-oriented guidelines4 consisting of: ƒ Cross-Domain Concepts (a list of metadata concepts relevant to several statistical domains, recommended for use in data and metadata exchange to promote re-usability of statistical information between organizations) ƒ Statistical Subject-Matter Domains (a standard scheme against which similar domain lists of various organizations can be mapped to facilitate the exchange of data and metadata) ƒ Metadata Common Vocabulary (MCV5, a repository containing concepts and related definitions to which metadata terminology used in international and national data producing agencies may be mapped) The SDMX group is currently at work for reviewing the guidelines in the light of comments received and for completing the package, before proceeding to a further consultation of respective constituencies.

Each data set or metadata set should use standard concepts and structure definitions, so that systems which exchange data and metadata can understand what the data or metadata means. Chart 2 provides a schematic view of this multiple use of cross-domain concepts for assisting data and metadata exchange.

Chart 2 – The use of standardised cross-domain concepts for data and metadata6

4 See http://www.sdmx.org/news/document.aspx?id=146&nid=67 . 5 The MCV, which includes definitions consistent with international standards and guidelines, is focused on a system of definitions for metadata concepts which can be used for any statistical domain and independently from any general model. The list of terms and associated definitions is a building block applicable across domains, playing an important role for the availability of exchangeable data descriptions. 5

In SDMX, cross-domain concepts can be used in three basic ways: a) as "dimensions" in the description of a data structure (e.g. reference area) with values typically taken from code lists; b) as "attributes in the description of a data structure" (e.g. "unit of measure"); c) as "attributes in the description of a metadata structure", for example using concepts such as "contacts", "timeliness", "dissemination formats", "classification system", or "compilation practices".

6. REFERENCE METADATA REPORTING

The core of the SDMX model of reference metadata is the concept of “Metadata Structure Definition”. The MSD defines the allowable content of a metadata set and identifies the data structures to which metadata can be attached. A Metadata Structure Definition defines: • which metadata concepts are to be reported • the identity of the Metadata Concept (for example a code which may simply be derived from the metadata concept scheme) • format and representation (textual or coded) • the role in its usage (e.g., mandatory or conditional) Reference metadata may be attached to different object types (for instance a data set, a time series, or an observation). For several reasons, this kind of metadata is often attached at a high level (data set, or even at agency level) because it is often referred to several or even to all of the data sets. A Metadata Structure Definition identifies both the concepts for which metadata have to be reported and the object type the metadata are attached to. More than one metadata reports may be attached to the objects identified in MSD. For instance, a Release Calendar can be structured and posted separately from the main report on Reference Metadata, which includes all the usual elements of data content and data quality, together with a link to the external release calendar.

7. SYSTEM ARCHITECTURE

The project "SDMX-compliant Metadata for use within Eurostat's Web Site" is based on the SODI standard architecture, which makes use of a metadata registry aimed at allowing the discovery of statistical data and metadata by interested third parties, so that the information can be interpreted and retrieved in the shortest possible time. The registry will comprise a repository of data structure definitions, metadata structure definitions and subject-matter domains, together with information on provision agreements, i.e. how data and metadata are made available by data providers. The project has been organised into four main tasks, as follows: • Task 1: Definition of the overall architecture analysis of the environment that will handle the SDMX-ML metadata; design, development and testing of the MSD tool; support for the creation of SDMX-ML schemas for selected domains. • Task 2: Analysis of metadata transfer from national and international sources and development of tools for conversion and transfer of reference metadata. • Task 3: Activities related to loading metadata into Eurostat's database and formulating queries for extracting the information. • Task 4: Design and development of a module for dispatching SDMX-compliant metadata to Eurostat’s website. The first task concerns the design, development and testing of a web service module based on the SDMX 2.0 standard format, which will be responsible for the definition and management of Metadata

6 The illustration is taken from "SDMX Content-oriented Guidelines: Cross-domain Concepts", draft March 2006, page 6. 6

Structure Definitions defining the valid content of a metadata set in terms of the concepts comprising the structure of the metadata set, how the concepts are related in terms of their role in the metadata set and the valid content of the concepts when used in a metadata set. A prototype of MSD tool, aligned with the techniques and methods used for the development of the SDMX registry web service, will be ready for being tested by the first quarter 2007. This tool will be used for defining and creating XML schemas of MSD for selected domains. The tool will be populated with the corresponding MSD, and the corresponding XML schemas will be validated and exported using the tool’s functionalities. The web based application able to store and manage MSD will utilise information that resides in Eurostat's SDMX registry. This task also aims at presenting the overall architecture of the Eurostat’s environment that will handle SDMX reference metadata and will serve as the primary vehicle for communication between all parties involved in the implementation of the project. An initial context diagram that depicts the different software components that must be implemented in the framework of the project is provided in Chart 3. The development and testing of a web service module for editing/converting metadata into SDMX compliant format will be the subject of Task 2. This includes a study of the possibility of transferring these metadata from national systems to Eurostat using SDMX technical specifications. A web-based questionnaire could help countries which are not currently in an advanced state for generating SDMX- ML metadata sets. Task 3 consists of the loading of SDMX-ML metadata, coming either from the SDMX metadata converter module or directly in such format, into Eurostat's reference metadata base EMIS, using a SDMX-ML metadata loader web service. Tools developed under task 2 and task 3 will be mainly used by potential metadata providers such as National Statistical Institutes. In addition, a web service extracting metadata from EMIS via dynamic formulation of SDMX queries (Query Formulator and Extractor) will be implemented. This tool will be able to respond to well structured SDMX-ML queries forwarded by either web services (metadata consumers) or human operators. In this, it will also serve the exchange with other SDMX partners interested in re-using Eurostat's metadata. Finally a SDMX metadata publisher web service will be the interface with Eurostat’s web site. This tool will be responsible for transforming the received SDMX-ML metadata sets into publishing formats (via XSLT). This tool will enable interaction with usual Internet users and will be implemented in the framework of Task 4.

Chart 3: Architectural Overview of the Software Components

7

8. STANDARDISATION OF METADATA CONCEPTS

The set of metadata concepts used for the main Reference Metadata Scheme - on which the project is based – includes both data content and quality assessment elements and is reported in table 2. A distinct MSD may refer to the Advance Release Calendar and it is based on the concepts of reference period, release date and time, date tolerance and release status. Table 2 makes use, as much as possible, of the set of cross-domain concepts released in the content- oriented guidelines draft published by SDMX in March 2006. In this, it provides a mapping of the SDMX cross-domain concepts to the reference metadata elements already produced and disseminated by Eurostat for all of its domains. In some limited case, it suggests some addenda or some more detail. Although the specific concepts are presented as a flat list, some of them can be further broken down along a "presentation hierarchy". For instance, "contact" can be further specified according to the elements highlighted in the concept description (organisation name, contact name, mail address, electronic mail address, phone number, fax number). The same is true for most methodological and quality concepts. The general model of reference metadata presents a breakdown of sub-elements which is linked to the indicators retained by the European Statistics Code of Practice. This will be specified in the SDMX-ML representation.

9. SDMX METADATA INTERCHANGE

The objectives of the project will only be achieved by providing users and producers with a set of standard metadata concepts and definitions, together with an open architecture and common tools. The availability of technical specifications and tools, and the SDMX registry - with its repository of data and metadata structure definitions - will assist the creation of a consistent set of standard metadata structures and help to ensure a close coordination and interaction with other SDMX projects and partners, and with European Union's member States. The key idea behind the open metadata interchange is that, through the alignment to a common set of concepts linked to a standard terminology, there is a concrete possibility of setting up – for the first time – an exchange of reference metadata among countries and international organisations which reduces redundancies and increases comparability. Alignment does not necessarily entail the direct adoption of precisely the same concept by each agency in its internal workflow. Although such adoption would facilitate the ability to exchange metadata between agencies, it is sufficient for organisations to be able to map their own granular concepts (developed to meet their own and needs) to the list of cross-domain concepts specified in the SDMX list. This would also facilitate direct access to metadata on the web instead of the current transmission by national agencies of different metadata to different international organisations. This would also help achieve a reduction of effort at the national level. Using a common XML format and standardised web tools, one organisation should be able to identify and retrieve those metadata which are relevant for its own framework, avoiding duplications. The system can also be used for the transfer of metadata files between the platforms respectively run by Eurostat, and International Monetary Fund for the new "Euro area" page of the Dissemination Standards Bulletin Board. Eurostat, as a central statistical organisation for the Euro area, in cooperation with the European Central Bank, can coordinate Euro area requirements and metadata flows, while interconnecting national metadata systems. EU countries, on their hand, will be able to provide metadata to more organisations at the same time, for the same SDMX concepts, using as much as possible information extracted by their original metadata systems, reducing manual interventions, double work and inconsistencies.

8

Table 1 Principal European Economic Indicators List

Set 1: Price Indicators 1.1. Harmonised Consumer Price Index: MUICP flash estimate: release end of reference month 1.2. Harmonised Consumer Price Index: actual indices: release 2,5 weeks after reference month Set 2: National Accounts Indicators 2.1. Quarterly National Accounts: flash GDP: release t+45 2.2. Quarterly National Accounts: first GDP release with breakdowns: t+60 2.3. Quarterly National Accounts: Sector Accounts: release t+90 2.4. Quarterly Government Finance Statistics: release t+90 Set 3: Business Indicators 3.1 Industrial production index: release t+30 3.2 Industrial output price index for domestic markets: release t+35 3.3 Industrial new orders index: release t+50 3.4 Industrial import price index: release t+30 3.5 1. Production in construction: quarterly: release t+45 2. Monthly: release t+30 3.6 Turnover index for retail trade and repair: release t+30 3.7 Turnover index for other services: release t+60 3.8 Corporate output price index for services: release t+60 Set 4: Labour Market Indicators 4.1. Unemployment rate: release t+30 4.2. 1: Job vacancy rate: quarterly 2: monthly: release t+30 4.3. 1. Employment: monthly release t+30 2. quarterly: release t+45 4.4. Labour cost index (US: Employment cost index) release t+60 Set 5: Foreign Trade Indicators 5.1. External trade balance: intra- and extra-MU; intra- and extra-EU: release t+46

9

Table 2

Elements for a generic Metadata Structure Definition (Maintenance agency: Eurostat)

Usage Typical Concept ID Name of Concept Description (derived from the Metadata Common Vocabulary) Representation Status attachment It describes contact points for the data or metadata, including how to reach the contact Agency / 1 CONTACT Contact points: organisation name, contact name, mail address, electronic mail address, phone Text Mandatory Data set number, fax number. Date on which the metadata element was inserted or modified. It can be further 2 METADATA_UPDATE Metadata update * detailed in: a) last update of content; b) last certified without update; c) last posted on Date Mandatory Data set web site. Description of the table contents, with their data breakdowns. It should also include summary information on units of measurement, time span covered, adjustments to 3 STAT_PRESENTATION Statistical presentation * Text Mandatory Data set data (e.g., seasonal adjustments for time series) and availability of textual analysis of current-period development with the dissemination of the data.

Frequency refers to the time interval between the observations of a time series. Periodicity refers to the frequency of compilation of the data (e.g., a time series could 4 FREQUENCY_PERIODICITY Frequency and Periodicity Code, Text Mandatory Data set be available at annual frequency but the underlying data are compiled monthly, thus have a monthly periodicity).

The time period to which a variable refers. Statistical variables refer to specific times, 5 REFERENCE_PERIOD Reference period ** which may be limited to a reference time point (e.g. a specific day) or a period (e.g. a Code, Text Mandatory Data set month, calendar year or fiscal year). RELEASE_CALENDAR_ Describes the policy regarding the release of statistics according to a preannounced 6 Release calendar policy * Text Conditional Data set POLICY schedule (if available). It may also contain a link to the release calendar information.

Describes the policy for release of the data to the public, how the public is informed that the data are being released, and whether the policy provides for the dissemination 7 SIMULTANEOUS_RELEASE Simultaneous release Text Mandatory Data set of statistical data to all interested parties at the same time. It also describes the policy for briefing the press in advance of the release of the data.

Refers to a law or other formal provision that assign primary responsibility as well as the authority to an agency for the collection, processing, and dissemination of the Agency / 8 INST_FRAMEWORK Institutional framework Text Mandatory statistics; it also includes arrangements or procedures to facilitate data sharing and Data set coordination between data producing agencies ("reporting requirements").

Information on the terms and conditions under which statistics are collected, processed, and disseminated. It also describes the policy of providing advanced notice 9 TRANSPARENCY Transparency Text Mandatory Data set of major changes in methodology; the policy on internal governmental access to statistics prior to their release; the policy on statistical products’ identification.

References to news release, publications, on-line databases and other dissemination 10 DISSEMINATION_FORMATS Dissemination formats Text Mandatory Data set media.

11 QUALITY_REPORTS Related quality reports ** References and link to available external quality reports for the data. Text Conditional Data set

* The concept has been either renamed or further specified, if compared with the March 2006 draft of content-oriented guidelines ** The concept is not included in the March 2006 draft of content-oriented guidelines 10

Timeliness refers to the speed of dissemination of the data - i.e., the lapse of time between the end of a reference period (or a reference date) and dissemination of the data. It reflects many factors, including some that are related to institutional arrangements, such as the preparation of accompanying commentary and printing. 12 TIMELINESS Timeliness and punctuality Text Mandatory Data set Punctuality refers to the possible time lag existing between the actual delivery date of data and the target date when it should have been delivered, for instance, with reference to dates announced in some official release calendar or previously agreed among partners.

The extent to which differences between statistics from different geographical areas, non-geographical domains, or over time, can be attributed to differences between the Comparability and true values of the statistics. Comparability is closely associated with "Coherence", 13 COMPARABILITY Text Mandatory Data set coherence which is the adequacy of statistics to be reliably combined in different ways and for various uses: the use of standard concepts, classifications and target populations promotes coherence, as does the use of common methodology across surveys.

The accuracy of statistical information is the degree to which the information correctly describes the phenomena it was designed to measure. Reliability is the closeness of the initial estimated value(s) to the subsequent estimated value(s). Accuracy refers to the provision of either measures of accuracy or precision (numerical 14 ACCURACY Accuracy and reliability * results of the methods/processes for assessing the accuracy or precision of data) or Text Mandatory Data set qualitative assessment indicators. It may also be described in terms of the major sources of error that potentially cause inaccuracy (e.g., coverage, sampling, non response, response). It includes providing the results of the assessment of source data for coverage, sampling error, response error and non-sampling error.

"Concepts and Definitions" refer to the internationally accepted statistical standards, guidelines, or good practices on which the concepts and definitions that are used for compiling the statistics are based. It also refers to the description of deviations of the 15 STAT_CONCEPTS Statistical concepts Text Mandatory Data set concepts and definitions from accepted statistical standards, guidelines, or good practices, when relevant. This should define the statistical concept under measure and the organisation of data, i.e. the type of variables included in the domain of study.

Classification Systems refer to a description of the classification systems being used and how they conform with internationally accepted standards guidelines, or good practices. It also refers to the description of deviations of classification systems compared to accepted statistical standards, guidelines, or good practices, when 16 CLASSIFICATION Classification systems Text Mandatory Data set relevant. The structure of classification can be either hierarchical or flat. Hierarchical classifications range from the broadest level (e.g. division) to the detailed level (e.g. class). Flat classifications (e.g. sex classification) are not hierarchical.

Scope/Coverage describes the coverage of the statistics and how consistent this is with internationally accepted standards, guidelines, or good practices. The 17 SCOPE_COVERAGE Scope and coverage Text Mandatory Data set scope/coverage includes a description of target population, and geographic, sector, institutional, item, population, product, and other coverage.

* The concept has been either renamed or further specified, if compared with the March 2006 draft of content-oriented guidelines ** The concept is not included in the March 2006 draft of content-oriented guidelines 11

The practical aspects and conventions used when compiling data from diverse sources under a common methodological framework. It may refer to descriptions of the types of prices used to value flows and stocks, or other units of measurements used for 18 ACCOUNTING_CONVENTIONS Accounting conventions * Text Mandatory Data set recording the phenomena being observed; the time of recording of the flows and stocks or the time of recording of other phenomena that are measured; and the grossing/netting procedures that are used.

Description of the data collection programs and their adequacy for the production of 19 SOURCE_DATA Source data statistics, including meeting the requirements for methodological frameworks, scope, Text Mandatory Data set classifications systems, and basis for recording.

The processes for manipulating or classifying statistical data into various categories with the object of producing statistics. It refers to a description of the data compilation and other statistical procedures to deal with intermediate data and statistical outputs 20 STAT_PROCESSING Statistical processing (e.g., data adjustments and transformation, and statistical analysis). The items covered Text Mandatory Data set include, inter alia, weighting schemes, methods for imputing missing values or source data, statistical adjustments, balancing/cross-checking techniques and relevant characteristics of the specific approach applied.

Validation describes methods and processes for routinely assessing source data – including censuses, sample surveys, and administrative records – and how the results of the assessments are monitored and made available to guide statistical processes. It also describes how intermediate results are validated against other information where applicable, how statistical discrepancies in intermediate data are assessed and investigated and how statistical discrepancies and other potential indicators or 21 DATA_VALIDATION Data validation * Text Mandatory Data set problems in statistical outputs are investigated. All the controls made in terms of quality of the data to be published or already published are included in the validation process. Validation also includes the results of studies and analysis of revisions and how they are used to inform the statistical processes. In this process, two dimensions can be distinguished: (i) validation before publication of the figures and (ii) validation after publication.

22 ANNOTATIONS Annotations ** Special warnings and footnotes (e.g. temporary warnings or re-use precautions) Text Conditional Data set

Refers to the processes for monitoring the relevance and practical utility of existing statistics in meeting users’ needs and how these processes inform the development of 23 RELEVANCE Relevance statistical programs. Text Conditional Agency Relevance is concerned with whether the available information sheds light on the issues that are important to users.

Refers to processes in place to focus on quality, to monitor the quality of the statistical programs, to deal with quality considerations in planning the statistical programs. It 24 QUALITY_MANAGEMENT Quality management also includes how well the resources meet the requirements of the program, and Text Conditional Agency measures to ensure efficient use of resources (staffing, facilities, computing resources, and financing of statistical programs).

* The concept has been either renamed or further specified, if compared with the March 2006 draft of content-oriented guidelines ** The concept is not included in the March 2006 draft of content-oriented guidelines 12