A Semantically Enabled Architecture for Crowdsourced Linked Data Management
Total Page:16
File Type:pdf, Size:1020Kb
A semantically enabled architecture for crowdsourced Linked Data management Elena Simperl Maribel Acosta Barry Norton Institute AIFB Institute AIFB Ontotext AD Karlsruhe Institute of Karlsruhe Institute of Bulgaria Technology Technology [email protected] Germany Germany [email protected] [email protected] ABSTRACT a central building block of the Linked Data technology stack. It is Increasing amounts of structured data are exposed on the Web us- a graph-based data model based on the idea of making statements ing graph-based representation models and protocols such as RDF about (information and non-information) resources on the Web in subject predicate object 2 and SPARQL. Nevertheless, while the overall volume of such open, terms of triples of the form . or easily accessible, data sources reaches critical mass, the abil- The object of any RDF triple may be used in the subject posi- ity of potential consumers to use them in novel applications and tion in other triples, leading to a directed, labeled graph typically services is predicated on the availability of purposeful means to referred to as an ’RDF graph’. Both nodes and edges in such query and manage the data, while taking into account and master- graphs are identified via URIs; nodes represent Web resources, ing its essential features in terms of decentralization, heterogene- while edges stand for attributes of such resources or properties ity of schema, varying quality, and scale. Many aspects of these connecting them. Schema information can be expressed using lan- challenges are necessarily tackled through a combination of algo- guages such as RDFS and OWL, by which resources can be typed rithmic techniques and manual effort. In the literature on tradi- as classes described in terms of domain-specific attributes, proper- ties and constraints.3 RDF graphs can be natively queried using the tional data management the theoretical and technical groundwork 4 to realize and manage such combinations is being established. In query language SPARQL. A SPARQL query is composed of graph this paper we build upon these ideas and propose a semantically patterns and can be stored as RDF triples together with any RDF domain model using SPIN to facilitate the definition of constraints enabled architecture for crowdsourced data management systems 5 which uses formal representations of tasks and data to automati- and inference rules in ontologies. cally design and optimize the operation and outcomes of human Over the past years Linked Data has established itself as the de computation projects. The architecture is applied to the context of facto means for the publication of structured data over the Web, Linked Data management to address specific challenges of Linked enjoying amazing growth in terms of the number of organizations Data query processing such as identity resolution and ontological committing to use its core principles and technologies for exposing and interlinking data sets for seamless exchange, integration and classification. Starting from a motivational scenario we explain 6 how query-processing tasks can be decomposed and translated into reuse. More and more ICT ventures offer innovative data man- MTurk projects using our semantic approach, and roadmap the ex- agement services on top of Linked (Open) Data. A first notable tensions to graph-based data management technology that are re- example comes from the public administration sector, with a wide quired to achieve this vision. array of governmental institutions in many countries worldwide exposing their data according to open-access policies and Linked Data principles and technologies. Other equally promising applica- 1. INTRODUCTION tion domains include life sciences, cultural heritage, and media and Linked Data refers to a set of guidelines and best practices for broadcasting, with the BBC, the New York Times and the Guardian publishing and accessing structured data on the Web.1 It builds adopting the same approach to publish metadata about their content upon established Web technologies, in particular HTTP and URIs, in order to increase Web traffic, improve site navigation and facili- extended with Semantic Web representation formats and protocols tate the integration of self-owned and public data sets. such as RDF, RDFS, OWL and SPARQL, by which data from dif- Per design Linked Data management solutions aim to reach a ferent sources can be shared, interconnected and used beyond the high level of automation with respect to the processing of an open application scenarios for which it was originally created. RDF is and decentralized data space bringing together data sources pub- lished by different parties, of varying quality and using heteroge- 1http://www.w3.org/DesignIssues/LinkedData. html neous conceptual schemas and vocabularies. This is achieved by using a minimal set of established Web technologies known for their robustness and scalability, paired with a uniform data model 2http://www.w3.org/RDF/ 3http://www.w3.org/TR/rdf-schema/, http://www.w3.org/TR/owl-ref/ 4 Copyright c 2012 for the individual papers by the papers’ authors. Copy- http://www.w3.org/TR/rdf-sparql-query/ ing permitted for private and academic purposes. This volume is published 5http://spinrdf.org/ and copyrighted by its editors. 6See also recent statistics of the Linked Open Data Cloud at CrowdSearch 2012 workshop at WWW 2012, Lyon, France http://www4.wiwiss.fu-berlin.de/lodcloud/ Figure 1: Generic architecture for applications consuming Linked Data (according to [5]) Figure 2: Overview of our approach and links between datasets facilitating exploration and integration. management envisions the enhancement of core components of the Nevertheless, while the core technological building blocks of this architecture presented in Figure 1 with integrated crowdsourcing approach are maturing, and large amounts of useful data are con- features that deal with the packaging of specific tasks as MTurk tinuously made available on the Web, the actual experience in de- HITs, and with the integration of the crowdsourcing results with veloping Linked Data applications reveals that many aspects of existing semi-automatic functionality. Figure 2 schematically de- Linked Data management remain, for principled or technical rea- picts this idea for a query processing scenario, which is further sons, heavily reliant on human intervention. Figure 1 shows a discussed in Section 3. Dealing with MTurk includes user inter- generic architecture of applications consuming data exposed us- face management capabilities, in order to produce optimal human- ing Linked Data principles and technologies [5]. The two layers readable descriptions of specific tasks operating on specific data, to that lend itself to LD-specific tasks are the publication layer, and increase workers’ productivity, and to reduce unintended behavior the data access, integration and storage layer, respectively. Both such as spam, but also workflow design capabilities for handling contain various aspects amenable to crowdsourcing, as we will see complex tasks. As described in [2] we propose to use SPARQL later on: (i) publishing legacy data as RDF: identifying suitable graph patterns to describe tasks and data to drive the generation vocabularies and extending them, conceptual modeling, defining of HITs interfaces, as well as extensions of Linked Data manage- mapping rules from the legacy sources to Linked Data; (ii) Web ment languages and components, most prominently query process- data access: discovering or adding missing information about a re- ing, mapping and identity resolution, and Linked Data wrappers source; (iii) Vocabulary mapping and identity resolution: defining with crowdsourcing operators. The corresponding Linked Data correspondences between related resources (iv) Query processing components need to interact with the MTurk platform to post spe- over integrated data sets: aligning different styles or attributes of cific HITs, assess the quality of the outcomes, and exploit these in data integrated from heterogeneous data sources, detecting lack of a particular application. This interaction can occur offline, when knowledge. In these scenarios, human contributions can serve dif- crowdsourcing input is being used to gradually improve the quality ferent purposes, from undertaking the task itself, to generating data of computational methods, and online, which may require specific to train specific automatic techniques, and validating (intermedi- optimizations to predict the time-to-completion of crowdsourced ary) outcomes of such techniques. tasks. Recent approaches in the area of database management systems Each task that is subject to a crowdsourcing exercise is describes [1, 4, 9] are laying out the theoretical and technical foundations for using graph patterns from the SPARQL query language represent- designing and building crowdsourcing-enabled data management ing input to the human task and its desired output. The input pattern technology. In this paper we build upon these ideas and propose a expresses a query over the existing collection of Linked Data, nec- semantically enabled architecture for crowdsourced data manage- essary to provide the input to the MTurk task. The output pattern ment systems which uses formal representations of tasks and data represents a query that should succeed when the additional asser- to automatically design and optimize the operation and outcomes tions that result from the human task have been added to the knowl- of human computation projects. The architecture