Specification and Implementation of Mapping Rule Visualization and Editing

Specification and Implementation of Mapping Rule Visualization and Editing: MapVOWL and the RMLEditor Pieter Heyvaerta,1,∗, Anastasia Dimoua,1, Ben De Meestera, Tom Seymoensb, Aron-Levi Herregodtsc, Ruben Verborgha, Dimitri Schuurmanc, Erik Mannensa aIDLab, Department of Electronics and Information Systems, Ghent University { imec, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium bimec { smit, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Etterbeek, Belgium cimec { Ghent University { MICT, Korte Meer 7, 9000 Gent, Belgium Abstract Visual tools are implemented to help users in defining how to generate Linked Data from raw data. This is possible thanks to mapping languages which enable detaching mapping rules from the implementation that executes them. However, no thorough research has been conducted so far on how to visualize such mapping rules, especially if they become large and require considering multiple heterogeneous raw data sources and transformed data values. In the past, we proposed the RMLEditor, a visual graph-based user interface, which allows users to easily create mapping rules for generating Linked Data from raw data. In this paper, we build on top of our existing work: we (i) specify a visual notation for graph visualizations used to represent mapping rules, (ii) introduce an approach for manipulating rules when large visualizations emerge, and (iii) propose an approach to uniformly visualize data fraction of raw data sources combined with an interactive interface for uniform data fraction transformations. We perform two additional comparative user studies. The first one compares the use of the visual notation to present mapping rules to the use of a mapping language directly, which reveals that the visual notation is preferred. The second one compares the use of the graph-based RMLEditor for creating mapping rules to the form-based RMLx Visual Editor, which reveals that graph-based visualizations are preferred to create mapping rules through the use of our proposed visual notation and uniform representation of heterogeneous data sources and data values. Keywords: graph, Linked Data, mapping rule, MapVOWL, RMLEditor 1. Introduction loaded in multiple databases; UniProt [12] (UniPro- tKB, Uniref and UniParc), with approximately 45 Nowadays Linked Data still stems from (semi- billion triples across 3 datasets derived from the )structured formats. A few of the most well-known UniProt Knowledgebase6; and Bio2 7 [2], with 2 rdf and larger Linked Data sets [23] are: DBpedia approximately 11 billion triples across 35 datasets. 3 dataset [6], with approximately 1.39 billion triples Overall, Linked Data generation includes the fol- 4 derived from Wikipedia where the data is origi- lowing: (i) multiple heterogeneous raw data sources nally represented in the wikitext syntax; Linked whose data values might need transformation, 5 Geo Data [58] with approximately 1.38 billion (ii) ontologies used to annotate the rdf terms [13] triples derived from Open Street Map planet files that are generated from the data fractions of the different data sources, (iii) actual mapping rules which ∗Corresponding author define how data fractions are semantically anno- Email address: [email protected] tated and used to generate rdf terms and triples, (Pieter Heyvaert) 1These authors contributed equally to this work. and (iv) generated Linked Data. 2http://stats.lod2.eu/rdfdocs?sort=triples 3http://dbpedia.org 4http://wikipedia.org 6http://www.uniprot.org/help/uniprotkb 5http://linkedgeodata.org 7http://bio2rdf.org/ Preprint submitted to Journal of Web Semantics February 6, 2018 Several datasets are generated by tools that in- which we presented in the past. The RMLEditor corporate directly in their implementation how offers a graph-based interface for specifying map- Linked Data is generated. This means when new or ping rules for raw data to generate Linked Data. updated semantic annotations are needed, knowl- Its target group of users have knowledge about both edge of Semantic Web technologies is required, as Linked Data and the domain of the data. The map- well as dedicated software development cycles for ping rules creation and editing is based on graph adjusting and extending the implementations. visualizations, without requiring knowledge of the To the contrary, mapping rules may also be de- underlying mapping language. Our novel contribu- fined, according to a specified mapping language tions include in particular: syntax, such as r2rml [16] or rml [20]. Mapping (i) a rich graph-based visual notation for map- languages define declaratively how terms are gener- ping rule visualization; ated from corresponding raw data and annotated (ii) an approach for manipulating rules when with ontology terms to form the desired Linked large visualizations emerge; Data. This way, mapping rules are detached from (iii) an approach to uniformly visualize data frac- the implementation that executes them. Neverthe- tions of data sources combined with an interactive less, knowledge of the underlying mapping language interface for uniform data fraction transformations; is required to define mapping rules, while manually (iv) an implementation of these three contribu- editing and curating them requires a substantial tions in the RMLEditor; and amount of human effort [31]. Therefore, the cre- (v) additional evaluations to compare the use of ation of mapping rules still remains complicated. { the visual notation to present mapping rules To this end, a significant number of mapping ed- to the use of a mapping language directly, which itors were implemented to facilitate mapping rules reveals that the visual notation is preferred; and creation and editing, such as Map-On [56], the { the graph-based RMLEditor to create mapping RMLEditor [31], and the RMLx Visual Editor8. rules to the form-based RMLx Visual Editor, which Only a few of them provide graph-based visualiza- reveals that the RMLEditor is preferred to create tions, although user evaluations suggest that such mapping rules through its use of the visual notation visualizations are suitable for supporting users to and uniform representation of heterogeneous data intuitively generate their desired Linked Data [31]. sources and data values. Nevertheless, how such user interfaces should be The remainder of the paper is structured as fol- designed is not thoroughly investigated so far: (i) a lows. In Section 2, we elaborate on Linked Data, visual notation specification for mapping rules does mapping rules, and rdf terms. In Section 3, we not exist. Such a specification would provide a for- discuss related work. In Section 4, we introduce mal description for mapping rules and allows mul- our research questions and hypotheses. We present tiple tools to implement it, improving the acces- in Section 5 our proposed visual notation for map- sibility for users across different tools; (ii) current ping rules, and in Section 6, the RMLEditor. In mapping editors do not uniformly present multi- particular, we elaborate in Section 6.5 on the ma- ple heterogeneous data sources. Therefore, creat- nipulation of large graphs in the RMLEditor, and in ing mapping rules that define relationships between Section 6.6 on how the RMLEditor deals with het- heterogeneous data sources is not always straight- erogeneous data sources. In Section 7, we present forward; (iii) data value transformations cannot be the evaluations and the results both for the visual defined in current visualization-based mapping ed- notation and the new version of the RMLEditor. In itors; (iv) scalability is not thoroughly addressed. Section 8, we summarize this work's conclusions. Even if graph-based visualizations are used, large graphs cause difficulties to users when editing the corresponding mapping rules. 2. Preliminary In this work, we present and extend our ongo- ing work towards a uniform graphical user interface Linked Data refers to data whose meaning is ex- (gui) to create and edit Linked Data mapping rules. plicitly defined, that is published on the Web in Such a gui is implemented in the RMLEditor, a machine-interpretable way, and that is linked to other external data sets [5]. Nowadays, RDF [13] is the prevalent framework to represent Linked Data. 8http://pebbie.org/mashup/rml Most of the time, Linked Data originally stems from 2 (semi-)structured formats (csv, xml, and so on). 3.1.1. Linked Data Visualizations Their rdf representation is obtained by repetitively As mapping rules resample how Linked Data will applying mapping rules according to an iteration eventually be generated [31], visualizations that are pattern which specifies the extract of data that is applicable for Linked Data would be expected to be considered during each iteration. applicable for mapping rules visualizations too. Mapping rules define correspondences between Efforts to improve Linked Data accessibility, re- data in different schemas [24]. In the case of Linked sulted in tools offering either (i) text-based pre- Data generation, mapping rules define how rdf sentations, or (ii) visualizations. Dadzie and terms, i.e., iris, literals or blank nodes [13], are gen- Rowe [15] conducted a survey on both approaches. erated from data fractions derived from one or more They concluded that text-based solutions, such as data sources which are annotated with ontologies. Sig.ma [60], (i) fail to provide an overview of the These rdf terms are used to form rdf triples. available information, and (ii) are only suitable for With the term data fractions, we do not only users with understanding of the underlying tech- mean raw data values as they are in the original nologies, whereas lay users need additional support. data source, but also transformed data values that Visual presentations lead to approaches relying result from a raw data value after applying a func- on network maps [35], diagrams [35], geographic tion to process the original data values. Data value maps [53], timelines [53], charts [3], and graphs [27, transformations are needed to support changes in 62, 19]. The latter was the default in the past ac- the structure, representation or content of data, cording to Dadzie and Pietriga [14], because (i) on- such as string transformations.

Load more