Beyond the Pillars of Hercules: Linked Data and Cultural Heritage

Beyond the Pillars of Hercules: Linked data and cultural heritage Gianfranco Crupi The term linked data refers to a «set of best practices for publishing and interlinking structured data on the Web. These best practices were introduced by Tim Berners-Lee in his Web architecture note Linked Data and have become known as the Linked Data principles»(Heath and Bizer).1 The underlying paradigm is that of the traditional web, the web of hypertext or documents, focused, as we know, on a small but effective number of standards: HTML as a markup language and format for page layouts, formatting and visualization; HTTP, the universal protocol for the transmission of information in hypertext; URI, the only and universal identification system. This “simple” logical architecture is the basis of the underlying principles for publishing and sharing structured data on the web: the use of URIs to identify not only web documents and digital contents, but also objects in the real world and abstract concepts 1The principles formulated by Tim Berners-Lee are: 1. Use URIs as names for things; 2. Use HTTP URIs, so that people can look up those names; 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL); 4. Include links to other URIs, so that they can discover more things. JLIS.it. Vol. 4, n. 1 (Gennaio/January 2013). DOI: 10.4403/jlis.it-8587 G. Crupi, Beyond the Pillars of Hercules (partly because URIs work as a means of access to information that describes the entities identified); the adoption of HTTP URIs, allowing URIs to be dereferenced through the HTTP protocol, in a description of the object identified or abstract concept; and finally, the use of a standard mechanism for specifying the existence and significance of the connections between the elements described in the data, provided by RDF, which, through descriptions of the relations between the “things” of the world (people, places or abstract concepts) expressed in qualified links, provides a flexible way of describing them, indicating the relationships they have with other “things” and of explicitly stating the nature of these relationships. Dereferencing means that clients can search for the URI using the HTTP protocol and thus recover a description of the resource (be it an HTML document, a real-world object or an abstract concept) that is identified by the URI; the descriptions of resources that are destined to be elaborated by machines are represented as RDF data. However, when the URIs identify “things” in the real world, in order to avoid any risk of ambiguity, confusing “things” with documents that describe them, the normal procedure is to use different URIs, thus distinguishing in a coherent manner statements about a “thing” from the document that describes it. The technology of linked data is therefore tied to the RDF model, not only because it provides the unique identification of entities on a global scale, but also because it allows for the parallel use of different schemes for the representation of data. However, at this point, we need to take a step back in order to give a theoretical and methodological context to the technology of linked data, in the light of the contributions that have been made to the Global Interoperability and Linked Data in Libraries seminar, the proceedings of which will be published here. JLIS.it. Vol. 4, n. 1 (Gennaio/January 2013). Art. #8587 p. 26 JLIS.it. Vol. 4, n. 1 (Gennaio/January 2013) The language of the semantic web In the context of the semantic web, the term semantic does not refer to the semantics of natural language but to the fact that the data can be elaborated by a computer, and that they contain information that allows the computer to process them correctly. Nevertheless, the semantic web has its own language, which is not a spoken language but a language invented to communicate and exchange data and information between human beings, and to be read, interpreted and processed by machines. It is a language with its own grammar, which functions to express the relational nature of the data and their proteiform typology. This grammar, known as RDF, provides the logical structure for managing and expressing the relationships between pieces of information based on the principles of predicate logic, according to which the information is expressed through statements consisting of a basic tripartite (triple) syntagmatic model: 1. a subject, i.e. any resource, not necessarily accessible via the web, which identifies the “thing” described (documents, readable by humans, or objects, readable by machines); 2. a predicate, that is a specific property of the resource or relation used to describe it, identified by a name; 3. an object, known as a value. Furthermore, according to the grammar of RDF, every sentence or statement describes the relationship between two entities – for example, between a work and its author (Giuseppe Verdi composed La Traviata) – or between an entity and the textual annotations that characterize it (e.g. the words La Traviata and the words that indicate the date and place of its first performance: March 6, 1853, Venice, Teatro La Fenice). Nevertheless, as already stated, except for textual annotations, each element in an RDF statement is represented, in JLIS.it. Vol. 4, n. 1 (Gennaio/January 2013). Art. #8587 p. 27 G. Crupi, Beyond the Pillars of Hercules its grammar, not by words from spoken language but by strings of characters preceded by the prefix http://, which uniformly identify any resource (URI, Uniform Resource Identifier): from a web address to an e-mail address, from a document to a service, from a file to a program, etc. In the language of the semantic web, the URI also allows the use of the object identified in contexts other than the original and regardless of its textual expression.2 Each RDF statement can be expressed by a graph consisting of nodes and arcs that represent the resources, their properties and their respective values. To be published this graph model is encoded in serialization formats,3 which allow the machine to process the model and understand the meaning of the descriptions of resources. More specifically, the identifiers used by RDF are URI references (URIref), or identifiers formatted by a URI, to which is added a suffix with Unicode characters, allowing it to express and define 2 «A URI can be classified as a URL or URN. A URL is a URI that, in addition to identifying a network-homed resource, specifies the means of act- ing upon or obtaining the representation: either through description of the pri- mary access mechanism, or through network “location”. For example, the URL http://en.wikipedia.org/wiki/Main_Page identifies a resource, in this case English Wikipedia’s home page, whose representation, in the form of the home page’s current HTML and related code, as encoded characters, is obtainable via the HyperText Transfer Protocol from a network host whose domain name is www.wikipedia.org. A uniform resource name (URN) is a URI that identifies a resource by name, in a particular namespace. One can use a URN to talk about a resource without implying its location or how to access it. The resource does not need necessarily to be accessible over a network. For example, the URN urn:isbn:0- 395-36341-1 is a URI that specifies the identifier system, i.e. international standard book number (ISBN), as well as the unique reference within that system and allows one to talk about a book, but the URI doesn’t suggest where and how to obtain an actual copy of it»(Uniform Resource Identifier, in Wikipedia. L’enciclopedia libera, http://it.wikipedia.org/wiki/Uniform_Resource_Identifier, 04-12-2003; last modified 04-08-2012). 3“Serialization” means the process of converting a data structure into a format that can be stored and then regenerated in the same or in another computing environment. JLIS.it. Vol. 4, n. 1 (Gennaio/January 2013). Art. #8587 p. 28 JLIS.it. Vol. 4, n. 1 (Gennaio/January 2013) the relationships between any things. Although the objects, which represent the values associated with the predicates, can be expressed as strings of characters (known as literals), the use of URIref allows applications to distinguish the properties that may be identified with the same literal name and which may in turn be treated as resources, allowing their additional information to be associated. «A URI address - thanks to the way in which it is formed - con- tains in itself, at least implicitly, a quote. URI type addresses used for properties and classes lead the reader to definitions documented in an official manner. Thus it is the web itself that supplies the data language with its dictionary» (Baker).Tom Baker rightly insists on the linguistic nature that informs the entire system, a key to under- standing the functioning of linked data and their many applications, especially in the context of cultural heritage and, in particular, libraries. In fact, it is precisely this linguistic dimension that explains the construction of multiple phrases concerning the same subject, or phrases that, in accordance with the principle of inference, gen- erate new ones, giving rise to a network of assertions, and thus to a set of relations (according to a model derived from the logic of relational databases), which extends the semantic network of the areas of origin of the data, expressed in the individual statements. The assimilation of the principle of combinatoriality, according to which a limited number of smaller units can be combined to form an unlimited number of larger units, thus facilitates the production of messages that contain higher levels of relational complexity and at the same time granularity relative to the domain to which the individual objects belong.

Beyond the Pillars of Hercules: Linked Data and Cultural Heritage

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support