Online Index Extraction from Linked Open Data Sources

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia Online Index Extraction from Linked Open Data Sources Fabio Benedetti1,2, Sonia Bergamaschi2, Laura Po2 1 ICT PhD School - Università di Modena e Reggio Emilia - Italy [email protected] 2 Dipartimento di Ingegneria "Enzo Ferrari" - Università di Modena e Reggio Emilia - Italy fi[email protected] Abstract. The production of machine-readable data in the form of RDF datasets belonging to the Linked Open Data (LOD) Cloud is growing very fast. However, selecting relevant knowledge sources from the Cloud, assessing the quality and extracting synthetical information from a LOD source are all tasks that require a strong human effort. This paper proposes an approach for the automatic extraction of the more representative information from a LOD source and the creation of a set of indexes that enhance the description of the dataset. These indexes collect statistical information regarding the size and the complexity of the dataset (e.g. the number of instances), but also depict all the instantiated classes and the properties among them, supplying user with a synthetical view of the LOD source. The technique is fully implemented in LODeX, a tool able to deal with the performance issues of systems that expose SPARQL endpoints and to cope with the heterogeneity on the knowledge representation of RDF data. An evaluation on LODeX on a large number of endpoints (244) belonging to the LOD Cloud has been performed and the effectiveness of the index extraction process has been presented. 1 Introduction The possibility to expose any sort of data on the Web by exploiting a consolidated group of technologies of the Semantic Web Stack [6] is one of the main strengths of Linked Open Data. There are several portals that catalog datasets that are available as LOD on the Web. One of the main globally available Open Data catalogues is The Data Hub (formerly CKAN)3. All of these portals allow users to perform keyword search over the metadata associated to their list of LOD sources, but they do not provide search techniques based on structural information of these sources. As mention in [14], there is still a “lack of conceptual description of datasets". The documentation of a LOD source is produced by who published the data and, in many cases, it is incomplete or absent. Therefore, usage of LOD datasets requires a human being to identify the domain of the datasets and discriminate if they are relevant for his/her needs (usually by performing SPARQL queries). This work has been accomplished in the framework of a PhD program organized by the Global Grant Spinner 2013 and funded by the European Social Fund and the Emilia Romagna Region. 3 http://datahub.io 9 To overcome the above problems, we devise a new method and a tool, called LODeX. LODeX is able to automatically extract a set of indexes containing the most representative information about the structure of a LOD dataset, enhancing the understanding and, therefore, the exploitation of LOD sources. These indexes collect statistical information regarding the intensional and extensional knowledge of a LOD source. The intensional knowledge contains the RDFS/OWL triples used to define a vocabulary or an ontology. The extensional knowledge is characterized by the instantiated classes, the properties between them and some graph patterns (ingoing and outgoing properties from instances of a specific class). Usually, the intensional knowledge is defined as the terminology used for characterizing the assertions, i.e. the extensional knowledge. As reported in [2,9], the structure of an RDF source is implicitly defined by its set of triples; information about the schema resides explicitly in the instantiation of the classes and implicitly in the use of the properties among these. The indexes can have different uses, but primarily they represent a good documentation of the dataset to which they refer. In fact, a knowledge engineer might be interested to describe a specific environment by reusing available vocabularies or ontologies if he easily understands their intensional contents. Otherwise, a data scientist can explore the lists of elements characterizing the extensional knowledge (e.g. occurrence of outgoing properties from a particular type of instances) of a specific dataset to easily build the SPARQL queries he/she needs to extract the data he/she is looking for. The indexes can also be used with other purposes: to support search engines (e.g DataHub) for dataset selection according to the ingoing or outgoing properties from instances of a class ranked according to the number of occurrences; to support tools able to generate queries. The indexes can be used for the schema generation and summarization of LOD sources as done in [5]. LODeX only deals with SPARQL endpoints, differently from other approaches that work with a dump of the RDF Database stored locally (e.g. [15], [3] and [8]). This choice has been made in order to provide a tool able to work as an online service. LODeX takes as input just the URL of a SPARQL endpoint and produces the necessary queries to extract the needed information and generate the statistical indexes. One of the main problems we encounter when dealing with SPARQL endpoints is the heterogeneity on the performance of the different implementations of endpoints. In fact, it happens that several SPARQL aggregation queries may trigger timeout errors. LODeX handles the problem of long running queries, that are usually going in timeout, by gen- erating an higher number of low-complexity queries able to return the same information into smaller chunks of data. We called this mechanisms pattern strategy, and it will be described in Section 5.1. The paper is structured as follows. Next Section outlines some relevant works con- nected to this topic or papers that have inspired the development of this tool. Section 3 defines intensional and extensional knowledge in the scenario of LOD sources. An overview of LODeX and its architecture is depicted in Section 4. Section 5 details the extraction of the set of indexes. In section 6 some tests are reported and finally, conclu- sions and some ideas for future work are described in Section 7. 10 2 Related Work In the literature, we can find several works in which a summary or a set of descriptors are extracted from a LOD source. In [8], authors divide these techniques in two groups, triples-level and instances-level summaries, according to the granularity with which the sources are scanned and indexed. The triples-level techniques inspect the content of the RDF dataset scanning each triple, and then they usually build an index containing statistical information regarding the type of these triples. SchemEx [15] is an example of work belonging to this group; here, dumps of RDF Graphs are indexed in order to support user query processing. Differently from LODeX, this approach does not consider the class instances, thus it is also not able to retrieve the properties among classes. RDFstats [17], instead, defines a vocabulary and an algorithm able to collects statis- tics about the triples belonging to an RDF dataset that can be used to build histograms and document the source. The information extracted by RDFstats has a low granularity and it can not be used as documentation of a RDF dataset, because it does not contain any sort of information about the structure of the RDF source. Another valuable example of these works is LODStats [3]. Here, the RDF graphs are scanned at triple level and in the post-processing phase, structural information such as the class and property hierarchies are discovered. In the instance-level approach the RDF Graph is inspected by taking into account RDF, RDFS and OWL primitives and their semantics, in order to detect structural information pervasive in the source. In this group we can find two important works [12, 20] in which the proposed instance-level summary can support efficient and federated query evaluation. Another valuable example of these group is [7] in which an approach that allows to enrich knowledge bases with OWL2 axioms is described. The information extracted by [7] overlaps the intensional knowledge that can be present within the triples of a dataset; LODeX is able to extract the triples belonging the intensional knowledge through an ad hoc algorithm reducing the time and the complexity of this task. All these techniques have been tested using a small number of datasets and taking as input the dump of an RDF graph; instead, LODeX has been designed to be used with a wide number of SPARQL endpoints in an online environment. The most important example of LOD indexes are Void descriptors [1]. The Void descriptors are a W3C standard used for expressing metadata about RDF datasets. In particular, they are primarily used to describe links among different datasets rather than the structure of the dataset itself. Despite they report valuable information, their def- inition is demanded to the producer of the LOD dataset, thus, not all the datasets are equipped with VOID descriptors. LODeX makes use of some of the VOID descriptors and adds further indexes to supply more detailed information about the structure of the dataset. 3 Intensional and Extensional Knowledge in LOD sources The LOD Cloud consists of an huge number of SPARQL endpoints, each aiming to describe a knowledge base of a specific domain. The language used to describe data is RDF, while RDFS and OWL are used to represent intensional knowledge [18][10]. 11 Fig. 1: Data structure of a generic LOD dataset We can think the entire set of RDF triples partitioned between Intensional Knowl- edge (I) and Extensional Knowledge (E).

Online Index Extraction from Linked Open Data Sources

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support