Scientific Data Application Profile Scoping Study Report
Total Page:16
File Type:pdf, Size:1020Kb
Scientific Data Application Profile Scoping Study Report Document details Author: Alexander Ball, UKOLN, University of Bath Date: 3rd June 2009 Version: 1.1 Document Name: sdapss.pdf Notes: Changes from version 1.0: Typographical corrections made. References added. Conclusions expanded. Acknowledgements Acknowledgement to contributors The author would like to thank the various people who contributed to the report by completing an interview or through e-mail correspondence, in particular: Linda Ault (BGS), Tim Booth (NEBC), Helen Campbell (BAS), Talat Chaudhri (UKOLN), Harry Dixon (National River Flow Archive), Steve Donegan (STFC), Monica Duke (UKOLN), Jane Greenberg (UNC Chapel Hill), Justin Hayes (MIMAS), Rachel Heaven (BGS), Stuart Jeffrey (ADS), Roy Lowry (BODC), Brian Matthews (STFC), Ken Miller (UKDA), Pam Miller (UKDA), Cameron Neylon (STFC), Manjula Patel (UKOLN), James Reid (EDINA), Robin Rice (EDINA), Belinda Robinson (BADC), Pauline Simpson (Central Caribbean Marine Institute), Humphrey Southall (University of Portsmouth), Mark Thorley (NERC), Matt Wood (Sanger Institute/Mekentosj Inc.), and Weimin Zhu (European Bioinformatics Institute). The author takes responsibility for interpreting the answers and for any change of emphasis that comes with collating the viewpoints of the various contributors. Acknowledgement to funders This work was funded by the JISC as part of the Repositories and Preservation Programme. UKOLN is funded by the Joint Information Systems Committee (JISC) of the Higher and Further Education Funding Councils and the Museums, Libraries and Archives Council (MLA), as well as by project funding from the JISC, the Research Councils and the European Union. UKOLN also receives support from the University of Bath where it is based. ii Contents 1 Introduction 1 1.1 Background . 1 1.2 Aims and Objectives . 1 1.3 Scope . 2 1.4 Methodology . 2 1.5 Outline............................................. 3 2 Repository Landscape 4 2.1 NERC funded data centres . 4 2.2 ESRC and AHRC funded data centres . 7 2.3 European Bioinformatics Institute . 7 2.4 STFC e-Science Centre . 8 2.5 Data repository projects . 8 2.6 Other projects . 10 3 Use Cases for a Scientific Data Application Profile 11 3.1 Broad enquiry . 11 3.2 Specific follow-up work . 12 3.3 Detailed enquiry . 12 3.4 Data mining . 13 3.5 Aggregator functions . 13 3.6 Discussion . 14 4 Domain Models for Scientific Data 17 4.1 DDI............................................... 17 4.2 MIDAS Heritage Standard . 18 4.3 MOLES Data Model . 19 4.4 CCLRC Scientific Metadata Model . 20 4.5 ISO 2146 . 20 4.6 FRBR . 22 4.7 Comparisons . 22 4.8 Alternative metadata models . 24 5 Existing metadata standards and profiles 26 5.1 Identification . 26 5.2 Responsibility . 27 5.3 Archiving . 28 5.4 Topical Coverage and Derivation . 28 5.5 Spatiotemporal Coverage . 29 6 Further Discussion 31 iii 6.1 Implementation . 31 6.2 Relationship with other application profiles . 32 7 Conclusions 33 7.1 Alternatives to constructing a Scientific Data Application Profile . 34 7.2 Development effort required . 35 7.3 Community uptake strategy . 35 A Presentation of metadata schemes 37 B Directory Interchange Format metadata standard 38 C Data Documentation Initiative metadata standard 43 D MIDAS Heritage metadata standard 47 E MOLES metadata profile 56 F National Geoscience Data Centre metadata profile 62 G UK GEMINI metadata profile 66 H SeaDataNet EDMED metadata profile 69 I Cruise Summary Report metadata standard 71 J NERC Environmental Bioinformatics Centre metadata profile 72 K UK Data Archive metadata profile 74 L CCLRC Scientific Metadata Model 76 M AGMAP metadata profile 80 N eBank UK metadata profile 84 O Edinburgh DataShare metadata profile 87 P Data Audit Framework metadata profile 88 Bibliography 90 iv Chapter 1 Introduction This study was undertaken by UKOLN on behalf of the Joint Information Systems Committee (JISC). 1.1 Background The background for this study as provided in the brief from the JISC is as follows. Application profiles are metadata schemata which consist of data elements drawn from one or more namespaces, optimized for a particular local application [HP00]. They offer a way for particular communities to base the interoperability specifications they create and use for their digital material on established open standards. This offers the potential for digital materials to be accessed, used and curated effectively both within and beyond the communities in which they were created. The JISC has recently recognized that there is a requirement to undertake a scoping study to investigate metadata application profile requirements for scientific data1 in relation to digital repositories, and specifically concerning descriptive metadata to support resource discovery and other functions such as preservation. This follows the development of the Scholarly Works Application Profile (SWAP) undertaken within the JISC Digital Repositories Programme and led by Andy Powell (Eduserv Foundation) and Julie Allinson (RRT UKOLN) on behalf of the JISC [AP08]. Application profiles for images (AHDS – Visual Arts), time based media (Manchester) and geospatial data (EDINA) are also being commissioned. Arguably, scientific data encompass a much wider range of resource types and are far more complex than other kinds of material, and so this scoping requires the consultants to explore whether harmonization around an application profile to improve resource discovery and reuse of scientific (research) data in the repository landscape can be achieved or is even desirable. The consultants will make recommendations for the repository landscape. 1.2 Aims and Objectives As stated in the brief from the JISC, the aims and objectives of the study are: 1. to assess whether a single metadata AP for research data, or a small number thereof, would improve resource discovery or discovery-to-delivery in any useful or significant way; 2. if so, then to: a) assess whether the development of such AP(s) is practical and if so, how much effort it would take; 1. By ‘scientific data’ is meant the evidence base on which academic researchers build their analytic or other work, where this evidence base is typically gathered, collated and structured according to declared and accepted protocols. 1 b) scope a community uptake strategy that is likely to be successful, identifying the main barriers and key stakeholders; 3. otherwise, to investigate how best to improve cross-discipline, cross-community discovery- to-delivery for research data, and make recommendations to the JISC and others as appro- priate. 1.3 Scope Scientific data as defined in this study covers a broad range of data types and contributes to the evidence base of many disciplines outside what is commonly termed Science.2 Engineering and Technology, Medicine, Veterinary Sciences, Social Sciences, Humanities and the Arts all contain disciplines and modes of study where analyses are performed on carefully collected evidence bases. The scope of this study, therefore, is not limited to a particular set of disciplines, but rather to data gathered, collated, structured and analysed using a recognizably scientific method, with a bias towards quantitative methods. The Scientific Metadata Model produced by CCLRC (now STFC) defines three types of scientific investigation: experiment, measurement and simulation [SM04]. Loosely speaking, an experiment measures the behaviour of a sample or system under controlled conditions, while a measurement measures the behaviour of a sample or system with as little as possible disturbance or intervention from the researcher. A simulation attempts to mimic or predict a measurement investigation using mathematical modelling. The common characteristics of experiments and measurements is that the researcher selects or collects a sample or system to study, applies an instrument to it, and generates a raw set of data. These data may then be collated, cleaned and otherwise processed to form a dataset more amenable to study. It will be seen that this paradigm applies equally well to census data, where the instrument is a census return, as it does to crystallography data. Derived datasets, where one or more existing datasets are manipulated to create a new dataset, are not specifically excluded from the study, but the focus of the study is on primary data. 1.4 Methodology The report Dealing with Data was used as a starting point for identifying UK data repositories and projects of relevance to this study [Lyo07]. Further repositories and projects were identified through desk research. Where possible, each repository and project was contacted and asked for details of the discovery-to-delivery metadata they use or have defined, alongside any insights they had from working with this metadata. A series of unstructured interviews were conducted, discussing use cases for a Scientific Data Application Profile, and how widely a single profile might be applied. On the latter point, matters of granularity, the experimental/measurement contrast, the quantitative/qualitative contrast, the raw/derived data contrast, and the homogeneous/heterogeneous data collection contrast were discussed. The interviewees were: • Cameron Neylon, STFC • James Reid, EDINA (Geospatial Application Profile, Go-Geo!) • Robin Rice, EDINA (DataShare) 2. The traditional conception of Science probably coincides with the disciplines grouped by the Joint Academic Coding System under the categories Biological Sciences, Physical Sciences, and Mathematical and Computer Sciences [HES06]. 2 • Brian Matthews, STFC • Weimin Zhu,