Dynamic Linked Data: a SPARQL Event Processing Architecture
Total Page:16
File Type:pdf, Size:1020Kb
future internet Article Dynamic Linked Data: A SPARQL Event Processing Architecture Luca Roffia 1,∗ ID , Paolo Azzoni 2, Cristiano Aguzzi 1 ID , Fabio Viola 1 ID , Francesco Antoniazzi 1 ID and Tullio Salmon Cinotti 1,3 1 Department of Computer Science and Engineering (DISI), University of Bologna, I-40126 Bologna, Italy; [email protected] (C.A.); [email protected] (F.V.); [email protected] (F.A.); [email protected] (T.S.C.) 2 Eurotech S.p.a., Via Fratelli Solari 3/a, 33020 Amaro (Udine), Italy; [email protected] 3 Advanced Research Center on Electronic Systems “Ercole De Castro” (ARCES), University of Bologna, I-40126 Bologna, Italy * Correspondence: luca.roffi[email protected]; Tel.: +39-051-209-5423 Received: 12 February 2018; Accepted: 16 April 2018; Published: 20 April 2018 Abstract: This paper presents a decentralized Web-based architecture designed to support the development of distributed, dynamic, context-aware and interoperable services and applications. The architecture enables the detection and notification of changes over the Web of Data by means of a content-based publish-subscribe mechanism where the W3C SPARQL 1.1 Update and Query languages are fully supported and used respectively by publishers and subscribers. The architecture is built on top of the W3C SPARQL 1.1 Protocol and introduces the SPARQL 1.1 Secure Event protocol and the SPARQL 1.1 Subscribe Language as a means for conveying and expressing subscription requests and notifications. The reference implementation of the architecture offers to developers a design pattern for a modular, scalable and effective application development. Keywords: dynamic Linked Data; publish-subscribe; Semantic Web; SPARQL; event processing; protocols; distributed Web applications; interoperability; security 1. Introduction In the last couple of decades, the World Wide Web, initially conceived of to share human-friendly information, has been growing into a new form: the Semantic Web [1], intended as a global Web of Data, which can be interpreted and processed directly by digital machines. In 2006, Tim Berners-Lee published a document on Linked Data claiming to be “the unexpected re-use of information the value added by the Web” (Tim Berners-Lee, Linked Data, 2006, https://www.w3.org/DesignIssues/ LinkedData.html) and providing a set of rules on how to publish data on the Web (such as URI (https://tools.ietf.org/html/rfc3986), HTTP (https://tools.ietf.org/html/rfc2616), RDF (https: //www.w3.org/TR/rdf11-concepts/) and SPARQL (https://www.w3.org/TR/sparql11-overview/). In 2009, the Linked Data concept and relative technologies, along with an assessment of the Web of Data evolution, have been further described in [2]. By that time, research efforts on developing Linked Data have been made in several directions, including standards (e.g., SPARQL 1.1 Protocol (https://www. w3.org/TR/sparql11-protocol/), SPARQL 1.1 Query language (https://www.w3.org/TR/sparql11- query/), SPARQL 1.1 Update language (https://www.w3.org/TR/sparql11-update/), JSON-LD (https: //json-ld.org/spec/latest/json-ld/), Linked Data Platform 1.0 (https://www.w3.org/TR/ldp/)), ontologies and vocabularies (e.g., OWL (https://www.w3.org/TR/owl2-primer/), CIDOC-CRM (http://www.cidoc-crm.org/Version/version-6.2.1), W3C OWL Time (https://www.w3.org/TR/ owl-time/), Semantic Sensor Network Ontology (https://www.w3.org/TR/vocab-ssn/), schema.org Future Internet 2018, 10, 36; doi:10.3390/fi10040036 www.mdpi.com/journal/futureinternet Future Internet 2018, 10, 36 2 of 33 (http://schema.org/), RDF Data Cube Vocabulary (https://www.w3.org/TR/vocab-data-cube/), just to mention a few), RDF stores and SPARQL endpoints (e.g., Jena (https://jena.apache.org/index.html), Fuseki (https://jena.apache.org/documentation/fuseki2/), Virtuoso (https://virtuoso.openlinksw. com/), Blazegraph (https://www.blazegraph.com/, Stardog (https://www.stardog.com/) and Amazon Nepture (https://aws.amazon.com/neptune/)), tools (e.g., for ontologies editing, like Protegè (https://protege.stanford.edu/) or RDF graphs visualization, like LodLive (http://en.lodlive.it/) and Visual Data Web (http://www.visualdataweb.org/index.php)) and Semantic Web portals (e.g, DBpedia (http://wiki.dbpedia.org/), WiKidata (https://www.wikidata.org/wiki/Wikidata:Main_Page)). The adoption of Semantic Web technologies follows two main trends: the first tries to exploit their potential to create autonomous systems, in its widest meaning, while the second focuses on solutions that take advantage of the flexible RDF data model. The use of Semantic Web technologies in autonomous systems enables machines to generate, publish and consume new information autonomously (e.g., from simple autonomous systems adopted in IoT applications, to complex systems for the autonomous analysis of a huge amount of data that would be impossible for a human operator). At the same time, the adoption of a flexible data model is intended to support the applications where it is difficult to define the knowledge-base and its model once and for ever. In fact, the adoption of a static model is often not reasonable, because information is living and evolves in time. While relational databases intrinsically do not provide this level of flexibility, the flexibility offered by RDF ensures that the unanticipated and unexpected evolution of the knowledge base can be integrated on the fly. These two trends appear to be interdependent, and in this paper, we propose an enabling technology that could allow their convergence. Focusing on data, the heterogeneity of data produced and consumed on the Web is very high, and the amount of data globally processed is huge (i.e., Big Data): social data collected from social networks (e.g., Facebook, Twitter), medical data (e.g., Medical Subject Headings RDF (https://id.nlm. nih.gov/mesh/), W3C Semantic Web Health Care and Life Sciences Interest Group (https://www.w3. org/2001/sw/hcls/), HealthData.gov (https://www.healthdata.gov/)), scientific data from several science fields (e.g., NASA Global Change Master Directory (https://gcmd.gsfc.nasa.gov/), Science Environment for Ecological Knowledge (http://seek.ecoinformatics.org/), NeuroCommons project (http://neurocommons.org/page/Main_Page), NASA Semantic Web for Earth and Environmental Terminology (https://sweet.jpl.nasa.gov/), Bio2RDF (http://bio2rdf.org/)), e-commerce information (e.g., Best Buy, Sears, Kmart, O’Reilly publishing, Overstock.com), governmental data (e.g., Opening Up Government (http://data.gov.uk/), Tetherless World Constellation - Linking Open Government Data (https://logd.tw.rpi.edu/), Data-gov WiKi (https://data-gov.tw.rpi.edu/wiki)) and human knowledge (e.g., Wikipedia). Also from this perspective, the use of Semantic Web technologies for Big Data processing and analysis follows the previously-mentioned trends: on the one hand, software platforms research, analyze and harvest with different levels of autonomy the Web to provide enhanced and enriched results; on the other hand, specific applications are developed to consume, publish or combine information for the purpose of a vertical domain. The solution we present is this paper is information agnostic and able to manage data sources gathered from different and heterogeneous domains, providing support for standard Web protocols (i.e., HTTP and WebSocket). Furthermore, we extend the concept of Wed of Data to applications that become the actionable extension of the Semantic Web. Nowadays, the Web of Data is becoming a Web of Dynamic Data, where detecting, communicating and tracking the evolution of data changes play crucial roles and open new research questions [3,4]. Moreover, detecting data changes is functional to enable the development of distributed Web of Data applications where software agents may interact and synchronize through the knowledge base [5,6]. The need for solutions on detecting and communicating data changes over the Web of Data has been emphasized in the past few years by research focused on enabling interoperability in the Internet of Things through the use of Semantic Web technologies (like the ones shown in [7–10] just to cite a few). Last but not least, an attempt made by W3C is represented by the Web of Things (https://www.w3. Future Internet 2018, 10, 36 3 of 33 org/WoT/) working group and by the Linked Data Notifications (https://www.w3.org/TR/ldn/) released in 2017 that provide the recommendation to enable notifications over Linked Data. In this paper, we propose a decentralized Web-based software architecture, named SEPA (SPARQL Event Processing Architecture) built on top of the authors’ experience acquired developing an open interoperability platform for smart space applications [11–21]. SEPA derives and extends the architecture presented in [22] through the use of standard Linked Data technologies and protocols. It enables the detection and communication of changes over the Web of Data by means of a content-based publish-subscribe mechanism where the W3C SPARQL 1.1 Update and Query languages are fully supported respectively by publishers and subscribers. SEPA is built on top of the SPARQL 1.1 Protocol and introduces the SPARQL 1.1 Secure Event protocol and the SPARQL 1.1 Subscribe Language as a means for conveying and expressing subscription requests and notifications. In particular, assuming an event as “any change in an RDF store”, SEPA has been mainly designed to enable event detection and distribution. The core element of SEPA is its broker (see Figure1): it implements a content-based publish-subscribe mechanism where publishers and subscribers use respectively SPARQL 1.1 Updates (i.e., to generate events) and SPARQL 1.1 Queries (i.e., to subscribe to events). In particular, at subscription time, subscribers receive the SPARQL query results. Subsequent notifications about events (i.e., changes in the RDF knowledge base) are expressed in terms of added and removed query results since the previous notification. With this approach, subscribes can easily track the evolution of the query results (i.e., the context), with the lowest impact on the network bandwidth (i.e., the entire results set is not sent every time, but just the delta of the results).