Instructions for the Preparation of a Camera-Ready Paper in MS Word
Total Page:16
File Type:pdf, Size:1020Kb
An RDF guide for the Darwin Core standard Editor(s): Solicited review(s): Open review(s): Steven J. Baskaufa,*, John Wieczorekb, John Deckc, and Campbell O. Webbd aDepartment of Biological Sciences, PMB 351634, Vanderbilt University, Nashville, TN 37146, USA E-mail: [email protected] bMuseum of Vertebrate Zoology, University of California at Berkeley, 3101 Valley Life Sciences Building, Berkeley, CA 94720, USA E-mail: [email protected] c Berkeley Natural History Museums, University of California at Berkeley, 1001 Valley Life Sciences Building, Berkeley, CA 94720, USA Email: [email protected] dArnold Arboretum of Harvard University, 1300 Centre St., Boston 02131, USA E-mail: [email protected] Abstract. The Darwin Core vocabulary is widely used to transmit biodiversity data in the form of simple text files. In order to support expression of biodiversity data in the Resource Description Framework (RDF), a guide was created as a non-normative addition to the Darwin Core standard. The guide resolves a number of issues that arise from adapting terms designed to have literal values for use with URI references. Although there are some problems that are beyond the scope of the guide, the guide is an important step towards enabling the biodiversity informatics community to participate in broader Linked Data and Se- mantic Web efforts. Keywords: biodiversity, informatics, standards, semantics, interoperability 1. Introduction Darwin Core roughly follows the Dublin Core Ab- stract Model,4 which was built on Resource Descrip- Darwin Core (DwC) [1] is a technical standard1 of tion Framework5 (RDF) concepts. A DwC term is Biodiversity Information Standards2 (TDWG). Since designated as a URI, which can be dereferenced to its ratification in 2009, Darwin Core has been widely access its normative RDF/XML metadata. Despite used to publish and transmit data. For example, as of these connections to RDF, there has previously been 29 Jan 2014 the Global Biodiversity Information Fa- no guide to describe how DwC vocabulary terms cility (GBIF) network has aggregated over 428 mil- should be used to describe biodiversity resources as lion Darwin Core records.3 RDF. The Darwin Core standard contains a general- The Darwin Core vocabulary was primarily de- purpose vocabulary for describing biodiversity re- signed to facilitate the sharing of data in simple text sources. It includes several guides that describe how files containing a single table of rows and columns. the vocabulary terms should be used to transmit data Extensive discussion on the TDWG mailing list6 be- in various formats such as simple text and XML. tween 2009 and 2011 showed that there was signifi- cant interest in expressing Darwin Core data in RDF * Corresponding author. 1 http://www.tdwg.org/standards/450/; enter reference website at http://rs.tdwg.org/dwc/terms/ 4 http://dublincore.org/documents/abstract-model/ 2 http://www.tdwg.org/ 5 http://www.w3.org/TR/rdf-concepts/ 3 http://www.gbif.org/ 6 http://lists.tdwg.org/pipermail/tdwg-content/ and identified several issues that impeded the effec- namespace dwc:9 suggest using text strings to refer tive use of Darwin Core terms in RDF: to physical and conceptual entities, e.g., names to - Text files often use name strings as values of prop- represent people, citations to represent articles, codes erties, whereas it is often advantageous in RDF to to represent institutions, etc. When a record has mul- have URI reference objects. tiple values for a property, these term definitions - Simple fielded text files (where a line in the file specify that the multiple strings be concatenated and contains the data for a single record) are by their delineated in a single field to avoid forcing the crea- nature “flat”, while RDF is often highly normalized. tion of a more normalized data structure. These (We use the term "flat" to denote data which is ex- specifications are at odds with best practices for RDF, pressed in a single table.) where it is preferable in triples to identify non-literal - Darwin Core defines a number of “ID” terms, objects using URI references. Those URIs can then which are used in flat files to specify identifiers and be associated with additional properties that describe record type, whereas RDF provides different stand- the non-literal resource. ard methods for expressing identity and class mem- The conflicting demands of flat, string-based ta- bership. bles and normalized, graph-based RDF creates a In 2011, an RDF/OWL Task Group7 was chartered problem when terms that were originally designed for by TDWG, and in 2012 a team of writers began work use with text strings are co-opted for use with non- on a Darwin Core RDF guide to address the issues literal objects in RDF. This is a longstanding prob- listed above by providing a set of best practices and lem10 that is not unique to Darwin Core. The Dublin by creating some new Darwin Core terms intended Core RDF Guidelines11 provide a mechanism to per- specifically for use in RDF. The guide8 was com- mit legacy string literal data to be associated with pleted in 2013 and reviewed by the Task Group, terms that were not intended for use with literal ob- which recommended it for adoption. Upon adoption jects. This mechanism, which involves use of the by TDWG, the RDF guide will become a non- rdf:value 12 property, has not been widely im- normative part of the Darwin Core standard. plemented. A more widely used dual-term alterna- In this paper, URIs are sometimes abbreviated as tive allows Dublin Core terms in the legacy dc:13 QNames using standard namespace prefixes, e.g. namespace to be used with literal values,14 while re- dwc:recordedBy. The prefixes are defined in serving terms in the dcterms: 15 namespace that footnotes or legends. URIs, RDF serializations, and have declared non-literal ranges for use with URI SPARQL queries are written in Courier font. In reference or blank node objects. many cases, URIs identify real resources in the wild, although example triples containing those URIs are 2.1.1. New Darwin Core terms in the dwcuri: not necessarily asserted there. namespace Section 2 of this paper lays out how each major is- The Darwin Core RDF guide adopts the dual-term sue was addressed by the guide. Section 3 describes approach by creating a new Darwin Core namespace, several challenges to the implementation of Darwin dwcuri:,16 whose terms are intended for use only Core as RDF that are beyond the scope of the guide. with non-literal objects. For example, the existing Section 4 concludes the paper. Darwin Core term dwc:recordedBy would con- tinue to be used with a value that consisted of a name string for agents who recorded an occurrence, where- 2. Issues and their resolution in the guide as the new term dwcuri:recordedBy would re- late the subject to a non-literal object (URI reference 2.1. Terms intended for use with non-literal objects or blank node) that denotes the actual agent. Because the Darwin Core vocabulary was de- 9 http://rs.tdwg.org/dwc/terms/ signed primarily to facilitate the transfer of text- 10 http://wiki.foaf-project.org/w/UsingDublinCoreCreator based records from relatively flat database tables, 11 http://dublincore.org/documents/dc-rdf/#sect-4 definitions and comments for terms in the general 12 rdf: = http://www.w3.org/1999/02/ 22-rdf-syntax-ns# 13 http://purl.org/dc/elements/1.1/ 14 7 http://code.google.com/p/tdwg-rdf/ http://wiki.dublincore.org/index.php/User_Guide/ 8 The draft guide is at http://code.google.com/p/tdwg- Publishing_Metadata#Legacy_namespace rdf/wiki/DwcRdf with an eventual permanent URL of 15 http://purl.org/dc/terms/ http://rs.tdwg.org/dwc/terms/guides/rdf/ 16 http://rs.tdwg.org/dwc/uri/ <http:// arctos.database.museum/guid/MVZ:Mamm:115956> dwc:recordedBy "Oliver P. Pearson; Anita K. Pearson"; dwcuri:recordedBy <http://viaf.org/viaf/263074474>, <http://museum-x.org/personnel/akp>. Fig. 1. Recorders of an occurrence (serialized as RDF/Turtle) <http://bioimages.vanderbilt.edu/baskauf/00001#loc> a dcterms:Location; dwc:continent "North America"; dwc:country "United States"; dwc:stateProvince "Tennessee"; dwc:county "Robertson". Fig. 2. Darwin Core convenience terms describing the political subdivisions of a location (serialized as RDF/Turtle) <http://bioimages.vanderbilt.edu/baskauf/00001#loc> a dcterms:Location; dwcuri:inDescribedPlace <http://sws.geonames.org/4653638/>. Fig. 3. Using a dwcuri: term to link a location to its lowest level political subdivision (serialized as RDF/Turtle) The guide allows legacy string name data in the dwc:continent allow a location to be placed in form of concatenated lists to continue to be exposed its geographical context. in RDF as a literal object of a dwc: namespace term. No single term value is sufficient to unambiguous- However, the guide specifies that if a record using a ly place the location in its lowest level political sub- term from the general dwc: namespace is serialized division because there may be several low level polit- as RDF using a dwcuri: namespace term, each ical subdivisions having the same name that are con- non-literal resource in a concatenated list of names tained within different upper level political subdivi- should be the object of separate triple. This is illus- sions. Thus each location record must provide the trated in Fig. 1. entire set. In the context of a flat database structure, An advantage of the dual-term approach is that it it is convenient to expose the full set of proper- allows large databases consisting of legacy string ty/value pairs for a location since that would allow a name data to be exposed immediately as RDF with- user to query for locations in the database by specify- out imposing a requirement that the provider imme- ing the particular values of interest for certain proper- diately implement the use of URI identifiers for non- ties in the set (hence the name "convenience terms" literal resources.