ESFRI / E-IRG Report on Data Management, January 2010

e-IRG Report on Data Management Data Management Task Force November 2009 1 1 EXECUTIVE SUMMARY A fundamental paradigm shift known as Data Intensive Science is quietly changing the way science and research in most disciplines is being conducted. While the unprecedented capacities of new research instruments and the massive computing capacities needed to handle their outputs occupy the headlines, the growing importance and changing role of data is rarely noticed. Indeed it seems the only hints to this ever burgeoning issue are mentions of heights of hypothetical stacks of DVDs when illustrating massive amounts of ”raw, passive fuel” for science. However, a shift from a more traditional methodology to Data Intensive Science – also sometimes recognized as the 4th Research Paradigm – is happening in most scientific areas and making data an active component in the process. This shift is also subtly changing how most research is planned, conducted, communicated and evaluated. This new paradigm is based on access and analysis of large amounts of new and existing data. This data can be the result of work of multiple groups of researchers, working concurrently or independently without any partnership to the researchers that originally gathered the information. Use of data by unknown parties for purposes that were not initially anticipated creates a number of new challenges related to overall data management. Long-term storage, curation and certification of the data are just the tip of the iceberg. So called Digital Data Deluge, for example, caused by the ease with which large quantities of new data can be created, becomes much more difficult to deal with in this new environment. Large amounts of data are created not only by state-of-the-art scientific instruments and computers, but also by processing and collating existing archived data. These challenges have been discussed in length during the 2007-2008 e-IRG workshops, and particular focus has been put on data initiatives linked with the new research infrastructures identified by ESFRI. The e-IRG del- egates recognised the importance of data management for the future of research infrastructures and, as a result, established the e-IRG Data Management Task Force (DMTF) that received recognition and support also from ES- FRI. The main objectives of DMTF were defined as producing an analysis of issues regarding data management in a coherent and flexible way, and a set of recommendations for the present and future research infrastructures. DMTF, following the e-IRG instructions, put together a large group of experts in data management who have prepared the present report. The following report is divided into three parts: a survey of existing data management initiatives, metadata and quality, and interoperability issues in data management. The report ends with conclusions of the study and a set of proposed recommendations for further analysis and discussion by the e-IRG. The findings will also be presented to the e-IRG and to ESFRI to create a final set of recommendations endorsed by the two bodies. This work is part of a continuous analysis of the scientific data situation, started initially with various contributions from the European Commission (communication published in February 2007, OECD position about scientific data status, published in xx 2007 and ESFRI position paper about Digital Repositories in September 2007). At the same time the DMTF was elaborating the present contribution, the European Commission issued also a quite am- bitious strategic paper about the increasing role of e-Science. All together, these various contributions will help defining the needs and the terms of a Global Data Infrastructure, relevant to participate to the Lisbon strategy in building the European Research Area. The present report addresses the following points in further detail: 1) Existing data management initiatives includes an initial inventory of known data management initiatives from operational initiatives to future projects. This survey analyses the opportunities, synergies and gaps presented by these initiatives or clusters of similar initiatives and their potential impact. The survey is divided into three main fields of science: arts and humanities – social sciences; health sciences; and natural sciences and engineering. For each scientific field a large number of initiatives is analysed regarding e.g. data type, standards used or proposed, data curation, quality control, and best practices for preservation. Challenges and needs relevant to various user communities are also identified. The analysis of 18 social science, 12 health sciences and 33 natural sciences and engineering initiatives gives a global view of most of the data initiatives in Europe. Achieving this comprehensive coverage necessitated a trade-off in the form of leaving a detailed study of collaboration opportunities and synergies (both between e-IRG and the initiatives and between initiatives) to be accomplished by future initiatives. Furthermore, due to the very diverse and rapidly evolving nature of data initiatives and the links between them, the study indicates that e-IRG should carefully focus its efforts to the most promising cases when establish contacts or actively facilitating collaboration between the initiatives and communities. This is necessary in order to maximise 2 the impact that can be achieved with reasonable resource investments. 2) Metadata and quality covers the basic principles and requirements for metadata descriptions and the quality of the resources to be stored in accessible repositories. The principles and requirements specified are considered to be baselines for all research infrastructures, and as such are independent of the scientific research field. Key findings of this part of the report focus on metadata flexibility to allow for addition of new elements, for using different types of selections and for the possibility of using elements of different sets and re-using existing elements/sets. Metadata topics such as usage, scope, provenance, persistence, aggregation, standardisation, interoperability, quality, earliness and availability are discussed in detail. Quality of data resources is also covered in the context of sharing data and quality assurance, assessing the quality of research data, and data consumers. 3) Interoperability issues in data management are very important to ensure that scientific data is reachable and useful to other scientific fields, i.e. to enable cross-disciplinary Data Intensive Science. The same data is often relevant to researchers in different scientific fields, but it is not always obvious that researchers can access and use data outside their own domain with the same ease and efficiency as discipline-based solutions. Thus interoperability is fundamental for allowing flexible and coherent cross-disciplinary data access. At the moment interoperability-related activities are mainly contained within individual communities, but, with the advent of e- Science, data interoperability needs to be extended to groups of different communities. In addition to providing details of these opportunities and challenges, this part of the document presents several levels and types of interoperability: resource-level operability, general semantic operability and syntactic versus semantic interoperability. The different layers, ranging from device level to communications, middleware and deployment of resource interoperability are analysed in detail. Semantic interoperability is also discussed in terms of data integration, ontology support, simplicity, transcoding and metadata, representation information, conceptual modelling, and distributed systems. Some use cases in the medical field, linguistics, e-humanities ecosystem, earth sciences, astronomy and space science, and particle physics are presented and, in each instance, use cases solutions, tendencies and needs are identified and put forward. This report is obviously not intended as the final word in the area of data management, rather it aims to put together several starting points to encourage future efforts in this domain. The participating authors sincerely hope that interested parties will take on board the findings of this document and craft them into a group of concrete, well-aligned initiatives, which fulfil the promises of Data Intensive Research. The authors also wish that the e-IRG and ESFRI, when drafting recommendations based on these findings will, for their part, ensure that these initiatives have clear and efficient contact points with e-Infrastructure policy makers in the future! 2 SUMMARY OF RECOMMENDATIONS All recommendations and proposed guidelines are embedded in the full texts of the various annexes. However a short synthesis of them is summarized as follows: 2.1 METADATA R1: Usage Providing metadata describing any kind of research resources and services is an urgent requirement for service providers and resource repositories. R2: Scope There is an increasing pressure for disciplines to agree on a set of semantically specific enough elements that allows researchers to describe their services and resources. R3: Provenance Descriptive metadata should include or refer to provenance information to support long-term preservation and further processing. R4: Persistence Metadata descriptions need to be persistent, to be identified by persistent identifiers and also to refer to the resources and services they represent by using persistent identifiers. R5: Aggregations Descriptive metadata have an enormous potential to describe various

ESFRI / E-IRG Report on Data Management, January 2010

Proceedings of the Workshop on Multilingual Language Resources and Interoperability, Pages 1–8, Sydney, July 2006

Part 2: Data Category Selection (DCS) for Electronic Terminological Resources (ETR)

A Multi-View Hyperlexicon Resource for Speech and Language System Development

Iso 24613:2008(E)

Tomaž Erjavec: Platforms and Standards for Data Sharing

Accessibility of Multilingual Terminological Resources - Current Problems and Prospects for the Future

Terminology for Translators — an Implementation of ISO 12620 Robert Bononno

Corpus Query Lingua Franca Part II: Ontology

Janne Bondi Johannessen (Ed.)

Of Parallèles

ISO 12616 Translation-Oriented Terminography

Polnet 3.0 Zygmunt Vetulani, Grażyna Vetulani, Bartłomiej Kochanowski Adam Mickiewicz University in Poznań Ul