The OECD DELTA Project – making OECD statistical

data open

Trevor Fletcher OECD, e-mail: [email protected]

Abstract

The OECD is currently undertaking a project (‘DELTA’) with the aim of making its statistical data open, accessible and free. In the context of this project, ‘open’ means that data content is machine-readable, retrievable, indexable and re-usable. This paper provides background to the DELTA project and describes the steps being taken to implement an Application Programming Interface (API) to provide machine-to- machine access to the OECD statistical data warehouse “ OECD.Stat ” via a number of formats along with the challenges involved in standardising the statistical content from the 800+ datasets. The paper describes the steps being put in place to encourage re-use of OECD data and re-use by OECD of external innovation through Open innovation process and community. The paper examines links to the related DELTA project work streams to make data accessible and free and links to the Knowledge Information Management (KIM) Ontology Management and Semantic Infrastructure project via linked data.

Keywords : OECD.Stat , API, Web Services

1. Background to the OECD DELTA Project Statistics are of strategic importance to the OECD both as an input for internal analysis and also as a product for dissemination to a wider audience in their own right. Taking into account the importance of statistics as a publication output, a review of the OECD Publishing Policy was carried out during 2011. Among other issues, the review covered the following points related to the dissemination of statistics: • An assessment of OECD’s publishing policy benchmarked against that of other International Organisations • Technologies used in disseminating publications and data • An assessment of the quality of publications and the services provided • The overall accessibility of OECD datasets

Following the Publication Policy report a number of recommendations were proposed to make OECD statistics “open, accessible and free”. The OECD Council welcomed this proposal and as a result the DELTA project was initiated to implement these aims.

The DELTA Project was given the objective to deliver on the Council decisions to “increase dissemination and usage and to maintain a viable, sustainable cost recovery model”. Subsequently 3 project work streams were established: “Open”, “Accessible” and “Free”. These can be defined as follows: • Accessible : Develop a new, more user friendly, statistical portal “data.oecd.org”, as a single, central gateway to access all OECD statistical data. • Open : Provide open data services for OECD statistical data. • Free: Make all data freely available.

This paper describes the “Open” work stream.

2. DELTA Project – Open Data Openness is one of the key values that guide the OECD vision for a stronger, cleaner and fairer economy. Making data open is an important part of this and to this end a number of open benchmarks in the project have been defined as follows: • Completeness – content should include data, , sources and methods. • Primacy – datasets should be primary and not aggregated and include details on how the data was collected. • Timeliness – data should be automatically available in trusted third-party repositories upon publication. • Ease of access – data made available via a simple Application Programming Interface (API) • Machine Readability – data and metadata provided in machine-readable standard plus documentation. • Non-discrimination – No special permissions required to access data. • Use of common standards – Stored data can be accessed without a special software license. • Licensing – Creative Commons CC-BY (Licensees may copy, distribute, display and perform the work and make derivative works based on it only if they give the author or licensor the credits in the manner specified by these). • Permanence – Information made available remain online with archiving over time together with notification mechanism. • Usage costs – Free.

2.1 Open Data Project goals

Data today can be extracted only via downloads from OECD.Stat. The ODWS will make them available to other web sites directly for creating custom data visualizations, live combinations with other data sources etc.

The goals of the Open Data project are: to make OECD data machine-readable, retrievable, indexable and re-usable; to increase the dissemination and impact of OECD data via open data services for its statistical data; and, to encourage re-use of OECD data by external innovation communities,

In terms of deadlines, the first version of the Open Data project providing a sub-set of OECD data should be available by Q1 2014. The full DELTA version of Open Data will be available by Q2 2015.

The DELTA Open Project has 3 main deliverables: i) a full set of “Open-ready” data and metadata; ii) a set of Open Data Web Services and iii) an interface for managing the OECD Open Innovation Community.

2.2 “Open-Ready” Data and metadata.

For data to be considered “Open-ready” the existing data and metadata content of the OECD corporate data warehouse OECD.Stat will be required to meet certain criteria of structure and content necessary for machine-to-machine access. The criteria for Open- readiness include: • OECD.Stat data content must include units of measure and scale which are structured and consistent across datasets. • Variable labels must be sufficiently clear when taken out of context of the full dataset such that they are self-explanatory. • All datasets must have structured statistical metadata to help users understand the data • Data should be stored at unit level (actual values) • Common dimensions (Time, Country etc) should be used • Common dimension names (i.e. “Subject”) should be used

To achieve this, data owners will carry out a self-assessment of all OECD.Stat data content to gauge the state of open-readiness for each dataset. This will involve analysing the metadata content according to the criteria.

2.3 Open Data Web Services (ODWS)

In parallel to the data assessment exercise, the Open Data Web Services will be developed. This will involve building a set of Web Services to provide machine-to- machine access to OECD.Stat data via a number of formats.

In addition an interface to manage user registration and access to the Web Services will be designed and developed as part of the OECD.Data.Org statistics portal.

This will involve defining the technical standards for data to be machine-readable that meet the needs of both expert and non-expert audiences. Application Programming Interfaces (API) will be developed to make the data and metadata in OECD.Stat available to systems outside the organisation via a number of formats. These services will be made available to the public via an Open Data Interface (ODI) together with the necessary technical and other documentation for developers.

2.4 Open Data formats. Data and metadata will be made available to external users in as many output formats as possible to maximise data access. The project will start with formats including: SDMX/JSON, Restful API, OData, XLS and CSV. Additional formats will be added as needed over time. These formats have been chosen for the reasons described below. a) Excel/CSV Excel and CSV are already widely used exchange standards so including them as output formats was a fairly obvious decision. b) SDMX/JSON JavaScript Object Notation (JSON ) is a text-based open standard designed for human-readable data interchange and has become one of the most popular industry-used open data formats on web sites today. JSON has a number of advantages, including: • Simplicity - JSON is a simple and ‘lightweight’ format with a smaller grammar and can map directly onto the data structures used in modern programming languages. • Interoperability - JSON has the same interoperability potential as XML. • Openness - JSON has the same open capabilities as XML • Readability - JSON is much easier for human to read than XML. It is easier to write and is easier for machines to read and write.

The Statistics Data and Metadata eXchange standard ( SDMX ) provides a standard model for statistical data and metadata exchange between national agencies and international agencies, within national statistical systems and within organisations. OECD is a member of the SDMX Sponsor Group (together with the Bank of International Settlements, , Eurostat, International Monetary Fund, United Nations Statistics Division and ). SDMX data extracts from OECD.Stat are already provided via a web service; this will be adapted as an API using the SDMX compact version. c) Open Data (OData) OData is an open protocol for sharing data

Future formats could include Google Data (a REST-inspired technology), Google Dataset Publishing Language ( DPSL ) or Google KML , a Geospatial file format.

An interface will be developed to allow users to access the Open Data Web Services and associated technical documentation and to register their intention to use the data and sign a Terms of Agreement.

3. Linked Data and the OECD KIM project The OECD Knowledge and Information Management (KIM) has been established to integrate information and centralise access to all OECD content (corporate content management, record management, authoring, etc.). KIM was launched in parallel to the DELTA project and is concerned with developing semantic enrichment and centralized taxonomy linked data support.

A long-term goal of the project is to create linked data sources with the Resource Description Framework (RDF) using existing vocabularies to map data to related subjects and generating a collection of “triples” (consisting of a subject, a predicate and an object) known as a “triple-store”. Each component of the triple has a Unique Resource Identifier (URI) enable data to be linked to related sources.

Creating a triple-store from the OECD.Stat data warehouse will be a huge task and work investigating the possibilities has only recently started (at time of writing the tools have not yet been selected), but the long-term goal is to conform to the Tim Lee-Berners “ 5 star ” level of open data, where the various levels of openness are described below: • No star - data is not available under an open licence, even if it is available on-line. • One star - data is accessible on the Web. It is readable by the human eye, but not by a software agent (PDF) • Two stars - data is accessible on the Web in a structured, machine-readable format (Excel) • Three stars - users do not need proprietary software Re-users can manipulate the data in any way, without being confined to a particular software producer (CSV). • Four stars - data is now in the Web as opposed to on the Web through the use of a unique URI allowing for bookmarking and linking (RDF with URIs). • Five stars - data is linked to other data, fully exploiting its network effects and is interconnected so its value increases exponentially. Data is discoverable from other sources and is given a context e.g., through links to Wikipedia (RDF with URIs with semantic properties).

The vision of the Semantic Web is to extend principles of the Web from documents to data. Data should be accessed using the general Web architecture using, e.g., URI-s; data should be related to one another just as documents (or portions of documents) are already. This also means creation of a common framework that allows data to be shared and reused across application, enterprise, and community boundaries, to be processed automatically by tools as well as manually, including revealing possible new relationships among pieces of data.

4. The OECD Open Innovation Community

The Open Innovation Community will consist of an interface for managing Open Innovation Community (OIC) content and involves designing, building and maintaining this interface to provide the following: • Information describing the open platform • Registration services • Examples of products developed using the open platform • Open Services available with associated technical documentation • OIC Blog • FAQ

5. Conclusions

The OECD is currently engaged in a major project to improve access to its statistical data and metadata with the overall goal of making it more findable, usable and open. This involves collaboration from statistical experts, database managers, IT, publications and communications staff. The outcome will represent the next generation of data dissemination and sharing which builds on the work in the last 10 years to develop the OECD corporate data warehouse OECD.Stat.