Specification and Implementation of Mapping Rule Visualization and Editing: MapVOWL and the RMLEditor

Pieter Heyvaerta,1,∗, Anastasia Dimoua,1, Ben De Meestera, Tom Seymoensb, Aron-Levi Herregodtsc, Ruben Verborgha, Dimitri Schuurmanc, Erik Mannensa

aIDLab, Department of Electronics and Information Systems, Ghent University – imec, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium bimec – smit, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Etterbeek, Belgium cimec – Ghent University – MICT, Korte Meer 7, 9000 Gent, Belgium

Abstract Visual tools are implemented to help users in defining how to generate from raw data. This is possible thanks to mapping languages which enable detaching mapping rules from the implementation that executes them. However, no thorough research has been conducted so far on how to visualize such mapping rules, especially if they become large and require considering multiple heterogeneous raw data sources and transformed data values. In the past, we proposed the RMLEditor, a visual graph-based user interface, which allows users to easily create mapping rules for generating Linked Data from raw data. In this paper, we build on top of our existing work: we (i) specify a visual notation for graph visualizations used to represent mapping rules, (ii) introduce an approach for manipulating rules when large visualizations emerge, and (iii) propose an approach to uniformly visualize data fraction of raw data sources combined with an interactive interface for uniform data fraction transformations. We perform two additional comparative user studies. The first one compares the use of the visual notation to present mapping rules to the use of a mapping language directly, which reveals that the visual notation is preferred. The second one compares the use of the graph-based RMLEditor for creating mapping rules to the form-based RMLx Visual Editor, which reveals that graph-based visualizations are preferred to create mapping rules through the use of our proposed visual notation and uniform representation of heterogeneous data sources and data values. Keywords: graph, Linked Data, mapping rule, MapVOWL, RMLEditor

1. Introduction loaded in multiple ; UniProt [12] (UniPro- tKB, Uniref and UniParc), with approximately 45 Nowadays Linked Data still stems from (semi- billion triples across 3 datasets derived from the )structured formats. A few of the most well-known UniProt Knowledgebase6; and Bio2 7 [2], with 2 rdf and larger Linked Data sets [23] are: DBpedia approximately 11 billion triples across 35 datasets. 3 dataset [6], with approximately 1.39 billion triples Overall, Linked Data generation includes the fol- 4 derived from where the data is origi- lowing: (i) multiple heterogeneous raw data sources nally represented in the wikitext syntax; Linked whose data values might need transformation, 5 Geo Data [58] with approximately 1.38 billion (ii) ontologies used to annotate the rdf terms [13] triples derived from Open Street Map planet files that are generated from the data fractions of the dif- ferent data sources, (iii) actual mapping rules which ∗Corresponding author define how data fractions are semantically anno- Email address: [email protected] tated and used to generate rdf terms and triples, (Pieter Heyvaert) 1These authors contributed equally to this work. and (iv) generated Linked Data. 2http://stats.lod2.eu/rdfdocs?sort=triples 3http://dbpedia.org 4http://wikipedia.org 6http://www.uniprot.org/help/uniprotkb 5http://linkedgeodata.org 7http://bio2rdf.org/

Preprint submitted to Journal of Web Semantics February 6, 2018 Several datasets are generated by tools that in- which we presented in the past. The RMLEditor corporate directly in their implementation how offers a graph-based interface for specifying map- Linked Data is generated. This means when new or ping rules for raw data to generate Linked Data. updated semantic annotations are needed, knowl- Its target group of users have knowledge about both edge of technologies is required, as Linked Data and the domain of the data. The map- well as dedicated software development cycles for ping rules creation and editing is based on graph adjusting and extending the implementations. visualizations, without requiring knowledge of the To the contrary, mapping rules may also be de- underlying mapping language. Our novel contribu- fined, according to a specified mapping language tions include in particular: syntax, such as r2rml [16] or rml [20]. Mapping (i) a rich graph-based visual notation for map- languages define declaratively how terms are gener- ping rule visualization; ated from corresponding raw data and annotated (ii) an approach for manipulating rules when with ontology terms to form the desired Linked large visualizations emerge; Data. This way, mapping rules are detached from (iii) an approach to uniformly visualize data frac- the implementation that executes them. Neverthe- tions of data sources combined with an interactive less, knowledge of the underlying mapping language interface for uniform data fraction transformations; is required to define mapping rules, while manually (iv) an implementation of these three contribu- editing and curating them requires a substantial tions in the RMLEditor; and amount of human effort [31]. Therefore, the cre- (v) additional evaluations to compare the use of ation of mapping rules still remains complicated. – the visual notation to present mapping rules To this end, a significant number of mapping ed- to the use of a mapping language directly, which itors were implemented to facilitate mapping rules reveals that the visual notation is preferred; and creation and editing, such as Map-On [56], the – the graph-based RMLEditor to create mapping RMLEditor [31], and the RMLx Visual Editor8. rules to the form-based RMLx Visual Editor, which Only a few of them provide graph-based visualiza- reveals that the RMLEditor is preferred to create tions, although user evaluations suggest that such mapping rules through its use of the visual notation visualizations are suitable for supporting users to and uniform representation of heterogeneous data intuitively generate their desired Linked Data [31]. sources and data values. Nevertheless, how such user interfaces should be The remainder of the paper is structured as fol- designed is not thoroughly investigated so far: (i) a lows. In Section 2, we elaborate on Linked Data, visual notation specification for mapping rules does mapping rules, and rdf terms. In Section 3, we not exist. Such a specification would provide a for- discuss related work. In Section 4, we introduce mal description for mapping rules and allows mul- our research questions and hypotheses. We present tiple tools to implement it, improving the acces- in Section 5 our proposed visual notation for map- sibility for users across different tools; (ii) current ping rules, and in Section 6, the RMLEditor. In mapping editors do not uniformly present multi- particular, we elaborate in Section 6.5 on the ma- ple heterogeneous data sources. Therefore, creat- nipulation of large graphs in the RMLEditor, and in ing mapping rules that define relationships between Section 6.6 on how the RMLEditor deals with het- heterogeneous data sources is not always straight- erogeneous data sources. In Section 7, we present forward; (iii) data value transformations cannot be the evaluations and the results both for the visual defined in current visualization-based mapping ed- notation and the new version of the RMLEditor. In itors; (iv) scalability is not thoroughly addressed. Section 8, we summarize this work’s conclusions. Even if graph-based visualizations are used, large graphs cause difficulties to users when editing the corresponding mapping rules. 2. Preliminary In this work, we present and extend our ongo- ing work towards a uniform graphical user interface Linked Data refers to data whose meaning is ex- (gui) to create and edit Linked Data mapping rules. plicitly defined, that is published on the Web in Such a gui is implemented in the RMLEditor, a machine-interpretable way, and that is linked to other external data sets [5]. Nowadays, RDF [13] is the prevalent framework to represent Linked Data. 8http://pebbie.org/mashup/rml Most of the time, Linked Data originally stems from 2 (semi-)structured formats (csv, , and so on). 3.1.1. Linked Data Visualizations Their rdf representation is obtained by repetitively As mapping rules resample how Linked Data will applying mapping rules according to an iteration eventually be generated [31], visualizations that are pattern which specifies the extract of data that is applicable for Linked Data would be expected to be considered during each iteration. applicable for mapping rules visualizations too. Mapping rules define correspondences between Efforts to improve Linked Data accessibility, re- data in different schemas [24]. In the case of Linked sulted in tools offering either (i) text-based pre- Data generation, mapping rules define how rdf sentations, or (ii) visualizations. Dadzie and terms, i.e., iris, literals or blank nodes [13], are gen- Rowe [15] conducted a survey on both approaches. erated from data fractions derived from one or more They concluded that text-based solutions, such as data sources which are annotated with ontologies. Sig.ma [60], (i) fail to provide an overview of the These rdf terms are used to form rdf triples. available information, and (ii) are only suitable for With the term data fractions, we do not only users with understanding of the underlying tech- mean raw data values as they are in the original nologies, whereas lay users need additional support. data source, but also transformed data values that Visual presentations lead to approaches relying result from a raw data value after applying a func- on network maps [35], diagrams [35], geographic tion to process the original data values. Data value maps [53], timelines [53], charts [3], and graphs [27, transformations are needed to support changes in 62, 19]. The latter was the default in the past ac- the structure, representation or content of data, cording to Dadzie and Pietriga [14], because (i) on- such as string transformations. For instance, when tologies are often hierarchically structured and used a data fraction contains a date in the format “DD- to annotate Linked Data; (ii) rdf’s data model MM-YYYY”, it might be needed to be transformed is a directed labeled graph [13]; and (iii) network to the format “YYYY-MM-DD”. analysis is one of the most common visualization- References to raw or transformed data fractions driven tasks carried out within the field, to ex- might be combined with constant values (template- plore, e.g., collaborations and other interrelation- valued [16]) or only constant data values (constant- ships between researchers within research data, and valued [16]) might be used to form the desired rdf social networks at large. However, they fail to pro- terms. As such, a complete mapping rule includes vide meaningful visualizations when graphs become (i) a reference to zero or more raw data fractions large, even when styled to better convey the re- and (ii) one or more ontology terms. Therefore, sources’ and properties’ semantics [49]. Further- when visualizing a mapping rule, its interrelation more, exposing the data’s graph structure is not with the raw data as well as with its semantic an- always needed, because the model may be of little notation with an ontology term should be visualized importance to users [14]. Therefore, current efforts when presented to and edited by users. are more focused on user interface components de- signed for different types of data, such as temporal and geographical data. Nonetheless, graph visual- 3. State of the Art izations can still be used, but they are no longer the main component to be represented. The creation and editing of mapping rules, as well Regardless of the used visualization, research as their visualization is closely related to Linked is being done to improve the support for large Data, ontologies, and query visualizations (Section datasets. For example, Bikakis et al. [4] introduce 3.1), and is performed using different types of map- a generic model for organizing, and analyzing nu- ping editors (Section 3.2). meric and temporal data in a multilevel fashion to deal with the challenges that come with the use of 3.1. Interfaces for Semantic Web Visualizations large datasets, such as information overload. The model is not tied to a specific visualization and, In this section, we elaborate on visualizations thus, it can be used with any of the aforementioned for different aspects of the Semantic Web: Linked tools to improve the support for such datasets. Data, ontologies, and queries which become rele- vant. We discuss the characteristics of each aspect 3.1.2. Ontology Visualizations and their visualizations that are implemented in ex- Ontology visualizations are also close to mapping isting tools or formalized as a specification. rule visualizations. Mapping rules define the mod- 3 eling, namely the conceptualization, of raw data as responding changes in the ontology. TurtleEdi- Linked Data, using ontologies. Thus, approaches tor [57], for instance, uses both methods: text ed- applicable for ontology representation which visu- itor and visualization to present and update the alize the schema could be adjusted for visualizing ontology, leaving the choice to the users. mapping rules too. To improve the ontology pre- sentation and editing, a number of tools were devel- Specifications for Visual Notations for Ontologies. oped. Such tools offer visualizations that represent In most cases, tools create a new stand-alone vi- ontology-specific elements. Only a limited number sualization to present the ontology [43, 34]. How- of efforts defined a specification for such visual no- ever, when users switch between tools they need tations that can be implemented by any other tool. to learn a new visualization to work with the new tool. Recent efforts result in the creation of specifi- Tools. Graphs offer a natural way to depict the cations for visual notations. They provide a formal structure of ontological elements and relation- description of the visual elements used to convey ships [40]. Therefore, graph-based visualizations the required information to users. These specifica- are present in a high number of visualization tools tions are independent of the tools that implement for ontologies, such as GrOWL [36], OWLViz [33] them, and, thus, can be applied by multiple tools. and KC-Viz [43]. In these cases, classes and The latter allows to improve the user’s ability to datatypes are represented as nodes and properties switch between these tools as they provide the same as edges. However, when similar graphical nota- predefined visualization [25]. The most significant tions are used for different ontological elements it specifications are Graffoo [25] and vowl [40]. is difficult for users to distinguish them. The Graphical Framework For OWL Ontologies Furthermore, a number of graph-based visualiza- (Graffoo) [25] defines a graph-based visual notation tions are focused on a single task, e.g., providing for ontologies. Classes are represented with yel- insights regarding the class hierarchy, and neglect low rectangles and datatypes with green parallelo- other aspects of ontologies, including the proper- grams. The properties are represented with lines ties and datatypes. This makes them less suit- connecting the rectangles or parallelograms, and able for other tasks [40], such as gaining insights have different colors depending on the property’s in the relationships between the classes via the dif- type, i.e., object and datatype property. Graffoo is 9 ferent properties. When visualizing large ontolo- implemented in yEd , a diagram editor. However, gies the graphs might become too large for users it does not propose an intuitive ontology visualiza- to cope with. Different approaches, applied by sev- tion that is immediately understandable to casual eral tools [43, 38, 34], were developed to tackle this users due to its diagrammatic approaches [40]. 10 challenge: (i) showing detailed information only on The Visual Notation for owl Ontologies demand, i.e., selecting a specific node results in dis- (vowl) defines a visual language for user-oriented playing more detailed information about that node; representation of ontologies and provides graphi- (ii) displaying the parts of the graphs that are se- cal depictions for elements of the Web Ontology lected by a given filter; (iii) zooming in and out on Language (owl) [40]. The initial specification of the graphs, enabling them to gain more information vowl [44] focused on the visualization of ontology about a specific region of interest of the graph. elements (i.e., classes, properties, and datatypes), So far, ontology visualizations were focused on together with the dataset’s instances, the so-called presenting the ontologies. Nevertheless, more and TBox and ABox in , respectively. more tools are built to facilitate editing of ontolo- However, based on their user study, they concluded gies [45, 11], as ontologies are subject to change. that even when visualizing only a few instances, This leads to the development of visualization plu- providing additional information already leads to gins that show the (updated) ontology inside the difficulties for understanding the visualization. tool [39]. However, such an approach disconnects Therefore, vowl 2 focuses on visualizing only the visualization of the ontology from its editing the TBox. It relies on directed graphs to visualize and requires users to understand both the visu- ontologies, and is implemented in WebVOWL [38], alization and the interface used to perform the a web application. Classes are represented by blue, updates. This is overcome when graph-based vi- sualizations allow users to update the nodes and 9http://www.yworks.com/products/yed edges [36, 57]. The updates will result in the cor- 10http://purl.org/vowl/spec/ 4 circular nodes, and datatypes by yellow, rectangu- Icons represent the update actions to the queries, lar nodes. Links are used to visualize how classes such as add and delete. are related to other classes or datatypes through As queries can become large, it is required to sup- properties. Graphical elements are added to links port large graphs. Similar to ontology visualiza- between classes to represent characteristics, such tions, different approaches were proposed to tackle as disjoint and union, while the use of colors helps the corresponding challenges. For example, details identifying the different elements. are presented on-demand when users click on a part of the query [26], specific details are shown by ap- plying filters on the query’s results [26], or users 3.1.3. Query Visualizations can zoom in on specific parts of the query through Query visualizations aim to hide the query lan- the corresponding buttons [52]. guage to end users, whereas mapping rule visualiza- Specific tools are also developed for specific cases, tions aim to hide the mapping language. Thus, ap- such as SPEX [53] for spatial and temporal data. proaches applicable for query representation could Their goal is to provide users with a gui to ex- be adjusted for visualizing mapping rules. plore the data and, subsequently, to help with query Query formulation is important for the re- construction. Scheider et al. [53] introduced de- trieval of information available as Linked Data. sign principles to achieve this goal, such as the use [50] is the standard for of the results’ feedback into the construction and Linked Data described using the rdf framework. the automatically handling of space and time data. However, besides users from the Semantic Web SPEX, which follows these principles, includes also community, lay users and domain experts from dif- a graph-based visualization, similar to the afore- ferent areas, i.e., users without knowledge about mentioned tools. However, these tools do not follow this language, need support to define valid queries the design principles making them less usable than that provide the desired results [26]. Tools, such as SPEX. NITELIGHT [52], RDF-GL [32], and FedViz [22], were developed to make it easier to work with a 3.2. Mapping Editors query language by removing the burden of deal- ing with the language’s syntax. They apply graph- Research efforts to facilitate the mapping rule based visualizations to represent the data’s under- creation and editing to generate Linked Data re- lying rdf structure. However, knowledge about sulted in the development of two types of graphical sparql is still required, which makes them less us- user interface (gui) tools: (i) step-by-step wizards able for lay users, with exception of FedViz. This and (ii) visualization tools. tool allows to browse Linked Data through a graph- In the past, step-by-step wizards prevailed as based visualization (see Section 3.1.1). Once the an easy-to-reach solution, such as the fluidOps ed- users have found the desired data, FedViz generate itor [54]. However, such applications restrict data the corresponding sparql query. publishers’ editing options, hamper altering param- Nonetheless, visually spotting the difference be- eters in previous steps, and detach mapping defini- tween different aspects of a query or Linked Data tions from the overall knowledge modeling, since is difficult as the same graphical notation is used to related information is separated in different steps. denote different aspects. queryvowl [26] addresses These tools circumvent dealing with the underly- this ambiguity. It is a specification for a visual no- ing mapping languages’ syntax, but users are still tation for queries that uses graph visualizations, as required to have knowledge of the language’s termi- it is based on vowl. queryvowl specifies a vi- nology, limiting thus the support for users who do sual query language, with the goal to be accessible not have this knowledge. for lay users while still preserving most of sparql’s More recent tools incorporate graph-based vi- expressiveness. Referents are represented as circu- sualizations to present mappings. A limited num- lar nodes and literal values as rectangular nodes. ber of them also apply these visualizations to edit A referent’s class is added inside the node as text. the mapping rules, such as Map-On [56] and the The same is done for the datatype of a literal value. RMLEditor [31]. We distinguish three prevalent Links represent the relationships between nodes, us- approaches for mapping rule visualizations: (i) vi- ing properties. Different colors are used to make sualizing mapping rules as Linked Data; (ii) sepa- distinction between classes, instances, and literals. rate graph visualizations for the raw data and on- 5 tology; (iii) single graph visualization for both raw rules editing. Moreover, so far, Semantic Web re- data and ontologies. lated visualizations present only one of the compo- The first approach is applied by the RMLx Vi- nents involved in Linked Data generation, namely sual Editor11. It visualizes mapping rules by using either Linked Data or ontologies, which do not re- graphs as if the rules are Linked Data, aiming to quire to interrelate multiple components. However, present them to the users rather than editing them. the mapping rule definition involves the interrela- This is possible when the mappings are defined in tion of the different involved components. rdf syntax, which is the case for r2rml and rml, A number of open issues remain regarding map- while editing is still performed relying on a step-by- ping rules visualization and editing: (i) the defi- step wizard. Consequently, users (i) need to under- nition of a visual notation specification for graph- stand the mapping language’s terminology to cor- based visualizations of mapping rules, (ii) scalabil- rectly interpret these graph visualizations; (ii) can ity issues when large graphs emerge, (iii) uniform only view and not edit the graphs, so mappings can representation for heterogeneous data sources, and only be edited using forms; (iii) need to ‘imagine’ (iv) data values transformation integration with the how the triples will look like as the visualization presentation of these data sources. More details are does not communicate how the mapping rules re- presented in Section 4.1, followed by our research sult in the corresponding rdf triples. questions and hypotheses in Section 4.2. The second approach is applied by Map-On. Two graphs are created: one that represents the raw 4.1. Open Issues data structure and one that represents the target ontology which is predefined and cannot be ad- Specification. The development of a visual notation justed after being loaded. Mapping rules are cre- specification increases the visualization’s adoption ated by aligning the two graphs, i.e., aligning the by other tools. Current research efforts and map- data fractions with the corresponding ontology ele- ping editors neither specify nor implement such a ments. The two graphs are distinguished from each specification. This impedes the users accessibility. other using different colors. However, by visualiz- ing both the raw data and ontology in the same Scalability. Dealing with large graphs is an impor- visualization, the graphs quickly become cluttered tant challenge for graph-based visualizations. The and which aggravates when both of them are large. same occurs when mapping rules are edited by us- The third approach is applied by the RMLEditor. ing graphs. However, none of the existing tools The visualization aims to represent the rdf triples tackle this challenge, besides barely applying zoom- that the mapping rules generate. The nodes rep- ing. The latter is not always sufficient as for users resent how rdf terms of subjects and objects are it is not always straightforward to know on which generated, while the edges represent the predicates’ part of the graph they need to zoom in. rdf terms. More details regarding the RMLEditor are available in Section 6. Heterogeneity. Linked Data might originally stem from multiple data sources with heterogeneous data formats. Therefore, mapping rules and their cor- 4. Problem Statement responding visualizations need to support these data sources. Most existing tools support nei- Based on the aforementioned, we notice that ther multiple nor heterogeneous data sources. The graph-based visualizations is the dominant approach RMLEditor and the RMLx Visual Editor are the when tools present Linked Data, ontologies, and only ones that support multiple heterogeneous data queries to users, while these visualizations also sources. The latter does not show the raw data, have revealed benefits for visualizing mapping rules. while the former does. Nevertheless, even only Linked Data and ontology visualizations mainly showing the raw data makes it difficult to derive aim to present to the users, whereas queries require the data model and especially its data fractions. the users to actively interact with and shape the visualized object, as it also occurs with mapping Data Transformations. Although, transformations are required when raw data values are desired to be altered to generate rdf terms, only the RMLx 11http://pebbie.org/mashup/rml Visual Editor provides this functionality. In other 6 cases, they are often implemented as custom solu- 5. Visual Notation for Mapping Rules tions, or addressed by dedicated applications, thus the range of possible transformations is limited, and In the past, we showed that graph-based visu- cannot be visualized uniformly to the users. alizations are applicable for presenting mapping rules [31]. It was also confirmed to be an intu- itive way to represent the structure of ontologies 4.2. Research Questions and Hypotheses in a comparative evaluation [40]. So, we rely on graph-based ontology visualizations and we investi- Given these open issues, we define the following gated how to apply their approaches on correspond- two research questions: ing graph-based mapping rule visualizations. There- fore, we based the specification for a mapping rule Q1 How can we design graph-based visualizations visualization on vowl, leading to mapvowl. that improve the cognitive effectiveness of mapping rules visual representations when gen- 5.1. Requirements erating Linked Data? The notation needs to support rules visualization Q2 How can we visualize the components of a that generate triples which, on their own turn, con- Linked Data generation process – that is based form with the rdf specification. Each triple con- on legacy non-rdf and (semi-)structured data sists of three elements: subject, predicate, and ob- sources – to improve its cognitive effectiveness? ject, which are rdf terms: iri, blank node, or lit- eral. A literal might have a datatype or language. This question has two sub-questions: These terms can be generated based on fractions - How to deal with large mapping rules graphs? from the raw data, constant values, or a combi- nation of the two. The data can be annotated by - How to uniformly visualize heterogeneous classes, properties, and datatypes from different on- data fractions and their transformations? tologies. Furthermore, a notation has to be cognitively ef- Note that cognitive effectiveness is defined as the fective. In the case of mapping rules this includes speed, ease, and accuracy with which a representa- the understanding of the different components of tion can be processed by the human mind [37]. the mapping process and their relationships: mul- To address Q1, we introduce a visual notation for tiple heterogeneous raw data sources, ontologies, mapping rules called mapvowl (see Section 5). To mapping rules, and the generated Linked Data. address Q2 and its sub-questions, we introduce an Therefore, a cognitive effective notation needs to approach to deal with heterogeneous data sources support rules that define how to generate and data values (see Section 6.6) and an approach to improve the understanding of large graph-based R1.1a rdf term: iri, blank node, and literal; visualizations (see Section 6.5). These approaches are implemented in the RMLEditor. Furthermore, R1.2 a triple’s subject (iri/blank node), predicate these questions lead to two corresponding hypothe- (iri), and object (iri/blank node/literal). ses which are validated through two comparative More, a notation should support references to user studies (see Section 7): R1.3 data values of data sources, H1 mapvowl improves the cognitive effectiveness of the mapping rules graph-based visual rep- R1.4 constant values, resentation to generate the Linked Data com- R1.5 a combination of data and constant values pared to using a mapping language directly. R1.6 ontological elements from various ontologies. H2 The cognitive effectiveness provided by the RMLEditor’s gui improves the user’s perfor- In Linked Data visualizations similar require- mance during the Linked Data generation pro- ments were fulfilled in order to depict the rdf’s cess – that is based on legacy non-rdf and data model, together with the used ontologies. (semi-)structured data sources – compared to However, they did not include the need to visualize the state of the art. the relationships with the raw data. 7 5.2. MapVOWL Table 1: Graphical Primitives We introduce mapvowl, a user-oriented visual Primitive Application notation for mapping rules which defines how Linked Data is generated from raw data. mapvowl referents builds on top of vowl’s graphical elements [40]. Relying on mapvowl’s unified graphical elements, blank nodes the mapping rules can be created and edited en- tirely using visual representations, while the map- ping rules in the underlying mapping language’s relationship labels and syntax are generated by the mapping editor with- literal values out user intervention. This way, mapvowl hides the underlying mapping language, as queryvowl relationships hides the query language. Overall, mapvowl aims to make mapping rules relationship direction creation and editing for generating Linked Data more accessible to Linked Data experts, even more textual information text for those who already know vowl or queryvowl. about mapping rules A comprehensive and uniform visual representation would be beneficial and would have great impact particularly on users with knowledge about the un- Referents and literals are depicted as nodes of the derlying mapping language, as it occurs in the case mapping rule graph visualization, whereas the re- of ontology visualizations [21]. lationships between them form the graph’s edges. The mapvowl specification can be found at This choice is made because the shape plays a spe- http://rml.io/mapvowl. Moreover, we applied cial role in discriminating between symbols as it mapvowl to the RMLEditor. You may access a represents the primary basis on which we identify demo instance of the RMLEditor with mapvowl objects in the real world; the shape is the primary implemented at http://rml.io/jws-demo. More- visual variable for distinguishing between different over, you may find a screencast explaining the constructs. graph elements of mapvowl as adopted in the RMLEditor at http://rml.io/jws-screencast. Shape. In the case of vowl, the shape is the In Section 5.2.1, we present the graphical primi- graphical primitive used to distinguish the classes tives. In Section 5.2.2, we discuss the color scheme. from datatypes which are the two fundamental con- In Section 5.2.3, we elaborate on the visual ele- structs of ontologies. Following the same princi- ments. In Section 5.2.4, we discuss how mapvowl ple, in the case of mapping visualizations, the type improves the cognitive effectiveness. of the rdf term which will be generated is the A user study between mapvowl and rml com- most fundamental construct. Thus, deviation in the pares the use of the visual notation to present map- node’s shape determines whether a referent (iri or ping rules to the use of a mapping language di- blank node) or a literal will be generated. rectly (see Section 7.1), while a user study between In the case of mapping visualizations, the con- the RMLEditor implementing mapvowl and the structs which can have a class assigned to them are RMLx Visual Editor compares the use of a graph- depicted as circles (R1.1), whereas the rest has a based gui to a form-based gui to create mapping rectangular shape (R1.1). This follows vowl where rules (see Section 7.2). classes are depicted as circles and datatypes as rect- angles. For the classes, there is no restriction on 5.2.1. Graphical Primitives which ontologies to use (R1.6). Constructs that are mapvowl provides a small set of unambiguous not meant to represent an entity, but are an at- graphical primitives based on vowl, as shown in tribute of an entity are depicted as rectangles, sim- Table 1. It is designed to be modular and, thus, it ilarly to datatypes in vowl and might be assigned distinguishes the primary elements that are crucial a datatype or have the default. for the Linked Data generation, such as the rdf terms to be generated, from elements which provide Edge. mapvowl considers directed edges whose la- supplementary details, such as their (data)types. bel is defined in a rectangle to associate referents 8 with each other or with literals. mapvowl uses Although, this number may be estimated by ana- edges as vowl does to represent the properties lyzing the data source for the specified data frac- which will be generated or (re)used to associate tion(s) from which the individuals are generated in the entities among each other or the entities with the end. If the analysis is not possible or desired, the attributes (R1.2). However, even though in all nodes should have the same predefined size. vowl the property’s label is detached from the edge, mapvowl incorporates the label in the edge 5.2.2. Color Scheme using a rectangle. This way, we make it explicit that Color. A color scheme is employed to interrelate a property is another rdf term as a referent or a data sources and mapping rules. In contrast to datatype valued node and it could be defined under vowl where the color indicates different types of the same conditions as the rest of them. Namely, we classes, in mapvowl the color depends on the data do not distinguish generation of a property from the source the term is (partially) derived from. If the other rdf terms. Last, a property’s direction is in- color of a node is grey, it is not associated to any dicated by an arrowhead, as it is for vowl too. This data source (yet). A color scheme is recommended states the property’s subject and object (R1.2). by mapvowl, but any other color scheme may be considered. Alternatively, a schema that relies on Text. While nodes and edges depict mapping rules, texture usage instead of different colors may be con- text is used to provide additional information about sidered to support color-blinded people if needed. these rules. Text is added to a circle to denote Brightness. In addition to color schema, the bright- (i) its class, and (ii) how the iris are formed, ei- ther template-valued or constant-valued (see Sec- ness level is employed to make the visualization tion 2; R1.3, R1.4, and R1.5), and to a rectangle more comprehensive. Changes on the brightness level occurs when a user interacts with the graph. to denote (i) the datatype or language tag (R1.1), and (ii) how the literal values are formed, either To be more precise, the brightness is higher when a template-valued or constant-valued (R1.3, R1.4, graph element is not selected and darker once the and R1.5). For the datatype there is no restric- element is selected by the user. tion on which ontologies to use (R1.6). When a language is specified, the language tag is preceded 5.2.3. Visual Elements with “@” and the datatype is automatically set to mapvowl defines visual elements for mapping rdf:langString [13] rules. These elements are based on the aforemen- tioned graphical primitives and color scheme. Fig- ure 1 shows examples of these elements. Border. As borders can convey meaning, they are Mapping rules can be created to generate refer- considered as a distinct graphical primitive used to ents, i.e., they will be represented by an iri or a represent a certain notion. A border is only present blank node which can be annotated with a class to when an entity is a blank node and then it is dashed, determine its type. For instance, in the examples, following vowl’s choice for using dashed borders referents are generated for employees and their ad- and properties’ lines for special classes and proper- dresses (see Figure 1a). For each employee a iri is ties. In the case of vowl, dashed circles are used generated, using a template, and for each address a when circles represent owl:Thing and, thus, they blank node, thus its node remains gray as its defi- do not carry relevant domain information. Simi- nition does not depend on any data source. Their larly, blank nodes might be generated and anno- classes are :Person and vivo:Address. tated; however, they differ from typical entities, as Mapping rules can be created to generate literals, the ones related to blank nodes do not receive iris. i.e., they are represented with the default datatype or a user-specified datatype. For instance, in our Size. Similarly to vowl, mapvowl also recom- example, literals are generated for the names of mends to vary the node size if possible and desired, projects. It has the default datatype xsd:string. without determining the scaling method though. The start date of a project has the user-specified vowl associates the node size with the number of datatype xsd:date (Figure 1b). When a literal individuals which are members of a class. However, value is a string, users can denote the value’s lan- this is not possible in the case of mapping rule visu- guage. For instance, in our example, the city of an alizations as the individuals are not generated yet. address is provided in English (Figure 1c). 9 Figure 1: Examples of visual elements

Mapping rules can be created to generate prop- notation should adhere to be effective. The princi- erties, which can be represented by an iri only. In ples are semiotic clarity, perceptual discriminabil- our example, properties are used to denote the rela- ity, semantic transparency, complexity manage- tionship between employees and their correspond- ment, visual expressiveness, dual coding, graphic ing literals, such as their phone number using the economy, coding fit, and cognitive integration. As property foaf:phone (see Figure 1d). a result the notation does not just only fulfill the Nodes have different size based on the available requirements, but does this in an cognitive effective data in the data source, unless the information is manner to improve the usability. not available (either deliberately or ignored). In our example, all employees have a name; however, not all employees provided their phone number. This is reflected in the size of the nodes that represent Semiotic Clarity. When specifying a visual nota- these literals (see Figure 1e). tion it is important to consider the principle of The overall visualization is combined to a graph semiotic clarity as it impacts the notation’s ef- that represents the envisaged model for the Linked fectiveness of addressing the problem it tries to Data. The position of the graph elements is im- solve [42]. Semiotic clarity is determined by the cor- posed by a force-directed graph layout algorithm. respondence between the semantic constructs and This means that the highly connected nodes are the graphical notations. In the case of mapvowl, placed more together, which puts more focus on the nodes and edges of the graphs form the graph- their relationships. ical notations, and the possible combinations of rdf terms, which the user can create to form the 5.2.4. Conform to Physics of Notations generated triples, are the semantic constructs (see The notation also needs to be designed to be cog- Figure 2). The semiotic clarity is determined by nitively effective. To achieve this we rely on the de- the presence of four anomalies: (i) symbol redun- sign theory “Physics of Notations” by Moody [42]. dancy, (ii) symbol overload, (iii) symbol excess, and The theory presents a set of principles to which a (iv) symbol deficit. 10 11

Figure 2: Semantic constructs (left) and graphical notations (right) Symbol Redundancy When a semantic con- Object. A (dashed) circle where an arrow arrives struct can be represented by multiple graphical no- aligns with an object, corresponding with either an tations, called symbol redundancy, it has two neg- iri or blank node; a rectangle without text at the ative effects on the user. On one hand, users have bottom with a literal without datatype or language; difficulties deciding which notation to use. On the a rectangle with text about the datatype with a lit- other hand, users need to remember multiple nota- eral with a datatype; a rectangle with text preceded tions for a single construct. with “@” with a literal with a language. In the case of mapvowl, a subject, which might be an iri or a blank node, is represented by a circle Class. A circle with text at the bottom aligns with from which an arrow starts. An iri is different from an iri as subject, rdf:type as predicate, and a class a blank node, because no border is used, while a as object. If this circle has a dashed border, then dashed border is used for a blank node, i.e., two it aligns with a blank node as subject instead of distinct notations for subject. A predicate, which iri. A circle with an arrow and rectangle to a circle is always an iri, is represented by an arrow with with text in the center aligns with an iri as subject, a rectangle. When an object is an iri or a blank rdf:type as predicate and a class as object. If this node, it is represented by a circle where an arrow notation has dashed border for the first circle, then ends, which distincts it from a subject. When an it aligns with a blank node as subject instead of iri. object is a literal, a rectangle is used. The use of a datatype or language is differently visualized: the Symbol Excess When a graphical notation is datatype is added at the bottom of the rectangle, used that does not represent a semantic construct, and the language is added at the bottom of the called symbol excess, it clutters the visualization rectangle preceded with “@”. and increase the complexity of the visualization. The class of an entity can be represented with Both have a negative impact on the understanding two different notations: (i) a circle with the class by users. In the case of mapvowl, no symbol ex- at the bottom, as datatype is for literals, and (ii) a cess is present as for each graphical notation there circle with an arrow and rectangle to another circle is a corresponding semantic construct, as explained with the class. The former occurs in most cases, for symbol overload (at least one arrow arrives for when the class is constant. To support rare cases each graphical notation one the right in Figure 2). where the class might depend on raw data values, the longer notation can be used, without excluding Symbol Deficit As opposed to the aforemen- the support for a constant value. This might be a tioned anomalies, symbol deficit is desired to limit small symbol redundancy, nevertheless, it leads to the diagrammatic complexity. Symbol deficit oc- a shorter notation in most cases. curs if not all semantic constructs are represented by a corresponding graphical notation. The deficit is desired when representing all semantic constructs Symbol Overload When a single graphical would reduce, instead of improve, the notation’s notation can represent multiple semantic con- cognitive effectiveness. In the case of mapvowl, structs, called symbol overload, it leads to ambigu- no symbol deficit was introduced as the graphical ity and potential misinterpretation. In the case of notations required to represent all constructs do not mapvowl, no symbol overload is observed. Each lead to a complex notation (Figure 2). graphical notation aligns with a single semantic construct, as can be observed in Figure 2 where Perceptual Discriminability. Perceptual discrim- only a single arrow arrives for each graphical nota- inability is the ease and accuracy with which tion on the right. In detail: graphical notations can be differentiated from each other [42]. For mapping rules, it is important to perceive the difference between (i) the components Subject. A (dashed) circle from which an arrow of a triple (subject, predicate, and object), and starts aligns with a subject, corresponding with ei- (ii) the rdf terms (iri, blank node, and literal). ther an iri or blank node. The former is done by using directed graphs, thus, an edge has an explicit direction, where the start is Predicate. A rectangle with an arrow through it the subject and the end is the object. The latter is aligns with a predicate. done by using a circle without a border for an iri, 12 Table 2: Visual variables information can be used: interval, ordinal, or nom- inal. The capacity denotes how many perceptible Variable Power Capacity steps are needed to understand the variable. horizontal position internal 10-15 mapvowl uses five out of the eight visual vari- vertical position interval 10-15 ables, as it uses size, brightness, color, texture, and size interval 20 shape (see Section 5.2.1). They are used in accor- brightness ordinal 6-7 dance to their power and capacity. Size denotes how colour nominal 7-10 often a data fraction is used in a data sources, rang- texture nominal 2-5 ing from 0% to 100% (interval), with steps of 10% shape nominal unlimited resulting in 10 perceptible steps (< 20 (capacity in orientation nominal 4 Table 2)). Brightness denotes which graph elements is selected, which is either true or false (ordinal, < 6). Color denotes the different data sources (nomi- and a circle with a dashed border for a blank node, nal), and in most cases the number of data sources regardless of whether they are subject or object. A is limited to less than 5 (< 7). The texture is an rectangle is used on the edge between two nodes alternative to denote a data sources (nominal, < to represent the iri, which is the only option for a 5). Shape denotes the limited graph elements, and predicate. A rectangle is also used for literals, but corresponding mapping rules (nominal, unlimited). it is distinct since it can only be at the end position Even though users are able to adjust the horizontal of the edge as it can only be the object of a triple. and vertical position, and orientation, which might Semantic Transparency. Semantic transparency is improve their understanding, it does not carry a defined as the extent to which the meaning of a specific meaning in mapvowl, because mapvowl notation can be inferred from its appearance [42]. does not have a symbol deficit that would benefit When Linked Data is presented using rdf, rdf from using these variables. graphs represent the model of the data. By us- ing graph-based visualizations for mapping rules, Dual Coding. According to dual coding theory [48], users already see the model that will be used for the using text and graphics together to convey infor- generated rdf triples, which form an rdf graph. mation is more effective than using either on their Nodes in the mapping graphs that are the start/end own. mapvowl does not specify the use of dual of an edge will result in subjects/objects of triples in coding. However, in the RMLEditor it is used. In- the rdf graph. The edges of the graphs will result formation about the used data source for an rdf in the predicates of the triples in the rdf graph. term is available through the color of the nodes and edges (graphics), and when clicking on the details Complexity Management. Complexity manage- icon (text). Other details about the mapping rules ment refers to the ability of a visual notation are only available as either text or graphics, be- to represent information without overloading the cause they do not have a corresponding graphical human mind [42]. Graph-based visualizations can notation or because that would introduce the use of be difficult to understand by users when they (mapping language) terminology while mapvowl’s become too large. Therefore, when the application goal is to avoid that (see Section 5.2). that implements the visual notation for mapping rules fails to address it, it might overload its users. Graphic Economy. Graphic complexity is defined by the number of graphical notations: the size of Visual Expressiveness. Visual expressiveness is de- its visual vocabulary [42]. The principle of graphic fined as the number of visual variables used in a economy states that this number should be cogni- notation [42]. The number is proportional to the tively manageable. In the case of mapvowl, the understanding of a notation, as it exploits multi- number is low. There are six different graphical ple visual communication channels and maximizes primitives (see Section 5.2.1). computational offloading. In total, there are eight visual variables: horizontal position, vertical posi- Cognitive Fit. Cognitive fit means a visual notation tion, size, brightness, color, texture, shape and ori- should include different representations to support entation (see Table 2). Each variable has a power users with skills ranging from novice to expert, and and capacity. The power denotes which type of the representational medium, such as whiteboards, 13 paper and computer screens [42], which is influ- enced by the perceptual discriminability, semantic transparency, and visual expressiveness. The creation of mapvowl has as main goal to support users who have knowledge about Linked Data and the data domain. Therefore, a different representation for others is not defined. mapvowl is designed to be displayed on a screen. However, the perceptual discriminability is high enough as mapvowl relies on differently shaped graphic nota- tion, i.e., circles, rectangles, and lines. The seman- tic transparency is high enough as mapvowl uses basic geometric shapes. The visual expressiveness might not be high enough though. For example, mapvowl only uses colors to denote the different data sources. However, on paper easy-to-draw sym- bols can be used to compensate this.

6. RMLEditor

The RMLEditor is a graph-based visualization Figure 3: Overview of the RMLEditor tool for mapping rules that define how Linked Data is generated from multiple heterogeneous data sources [31]. It allows users to load raw data R2.3: support heterogeneous data formats. The sources, create and edit mapping rules, and pre- original data from which Linked Data is derived view the resulting rdf triples. Its main goal is might stem from different data formats, such as to allow users who have domain and Linked Data csv, json, and xml. Thus, the gui should support knowledge, but no knowledge about the mapping the creation of rules on top of different formats. language, to create mapping rules (see Figure 3). Even more, it should be independent of the format. In this section, we discuss the gui requirements for the creation of Linked Data mappings (see R2.4: support multiple ontologies. Different ontolo- Section 6.1), the extended RMLEditor’s architec- gies, which model complementary or overlapping ture (see Section 6.2), the gui’s design and fea- aspects of domain-level knowledge, are available. tures (see Section 6.3), and how the mapping rules The gui should support the creation of a set of rules are executed (see Section 6.4). that uses different ontologies at the same time.

R2.5: support multiple alternative modeling ap- 6.1. Requirements proaches. When users create rules they can follow In previous work [29], we introduced a list of de- different modeling approaches [30]. The used ap- sired gui features for the creation of mappings. proach will be dependent on the user’s preference and use case at hand. The gui should enable using R2.1: independent of the underlying mapping lan- different approaches to support users and use cases. guage. The visualization of mapping rules should be independent of the underlying language to exe- R2.6: support non-linear workflows. During the cute the rules. This is to remove dependencies on creation of mapping rules different factors are in- the language and to support the editor to switch the volved, including data, ontologies, vocabularies, underlying mapping language for one that might mapping rules, and resulting Linked Data. When better suit the use case at hand. these factors are presented to the user in isolation, as done with non-linear workflows, the relationships R2.2: support multiple data sources. The gui betweens these factors are obscured. Thus, the gui should support multiple data sources, as a single should support non-linear workflows to enable users Linked Data set might be derived from several. to keep an overview of the different factors. 14 R2.7: independent of mapping execution. The graphs’ graphml representation is used to generate mapping rules definition and execution are different the corresponding rml statements (see Figure 3a). aspects. An editor and its gui should not restrict When users want to load mapping rules into the the execution to a specific library or tool. This im- RMLEditor, they can load (i) a graphml docu- proves the rules’ interoperability and reusability. ment, or (ii) an rml mapping document which is converted to graphml and loaded (Figure 3c). The 6.2. Architecture corresponding graphs are shown in the gui. The RMLEditor’s high-level architecture is based Data transformation descriptions are also en- on the multilayered architecture pattern [51]. This coded in the graphml representation and thus they allows to separate the presentation of the mapping are also present in the resulting rml document components, the logic of the mapping process and which is enriched in this case with descriptions the access to data and external , using the pre- based on the Function Ontology for the data trans- sentation, application, and data access layer, re- formations [18, 17]. When an rml document is spectively. The latter only communicates with the being processed by the extended RMLProcessor application layer. Communication between the pre- server-side, the raw data values are extracted by sentation and data access layer is not possible, as the RMLProcessor, and data transformations are the architecture prohibits communication between handled by passing the raw data values and the layers not directly under or above each other. function which is desired to be applied to the raw The presentation layer consists of several panels data values to the Function Processor. with which users interact. The application layer consists of the modules to process the interactions 6.3. Graphical User Interface triggered through the panels. The data access layer In this section, we discuss the RMLEditor’s gui handles the requests that require data or mapping and its interaction elements to manipulate the rules from outside the RMLEditor. rml [20] is the graph-based visualizations. underlying mapping language of the RMLEditor, because rml supports mapping rules with data 6.3.1. Panels from multiple heterogeneous data sources. The communication with the application layer is The gui consists of three views: Input Panel, facilitated by the command pattern [47] and the Modeling Panel and Results Panel (Figures 3, and flux pattern [7]. The flux pattern replaced the 4) [31]. They are aligned next to each other, but Model-View-Controller (mvc) pattern [47] that was when users want to focus on a specific panel, they used in the initial version of the RMLEditor. The can hide the Input or Results Panel. mvc pattern causes problems in both performance The Input Panel shows the data sources (the and development, because of the bidirectional com- left panel in Figure 4). Multiple data sources can munication, where one change can loop back and be loaded and they can be in different data formats, have cascading effects across the codebase [7]. The such as csv, json, and xml. Each data source flux pattern dictates an unidirectional flow. This is assigned with a unique color to be used in the flow is stricter than the flow in mvc, because it does graphs. In the previous version of the RMLEditor, not allow to start new flows when another flow is this panel only shows the raw data, thus, providing still being executed. For the presentation layer, the a different representation for each data format. Webix JavaScript library12 is used to build the gui The different representations for each format and the d3.js library [8] to build the graphs. were circumvented. Now, the Input Panel is divided The graph-based visualizations are encoded us- into two subpanels. The top panel displays the data ing an XML document, based on Graph Markup sources structure. This makes it independent of the Language (graphml) [9] with custom extensions, data format. The structure can be manipulated if to represent them independently of the underly- data values need to be transformed. The bottom ing mapping language. This allows users to ex- panel shows the raw data of the source. This allows port the graphs, besides the mapping rules, in an users to view both the structure and data values. application-independent format. Additionally, the The Modeling Panel shows the mapping rules using mapvowl (the middle panel in Figure 4). The color of each node and edge depends on the 12http://webix.com/ data source that is used in a specific mapping rule, 15 Figure 4: Overview of the RMLEditor if any. It offers the means to manipulate the nodes and edges of the graphs to update the mapping rules. Semantic annotations can be added using multiple ontologies, which can be either defined locally or online. Linked Open Vocabularies [61] Figure 5: Interaction elements on a node can be consulted via the gui to get suggestions on which classes, properties and datatypes to use. It is possible to use prefixes instead of using a full iri. These prefixes can be customized and defined for both local and online ontologies. Users can consult of the schemas are then assigned to the mapping http://prefix.cc to search for well-known pre- rules. When users start with the ontologies to fixes and their namespaces. generate the mapping rules, the schema-driven ap- In the previous version of the RMLEditor, scal- proach is followed. Next, data fractions from the ability was only addressed using geometric zoom- data sources can be associated to the mapping ing [28], but interactive filtering [55] was intro- rules. duced to address usability issues with large graphs The RMLEditor implements the aforementioned (more details are available at Section 6.5). Geomet- requirements for a gui for Linked Data mappings. ric zooming and interactive filtering are two estab- The visualization of the mapping rules is indepen- lished methods to address large graphs. dent of the underlying mapping language through The Results Panel shows the resulting rdf the use of mapvowl in the Modeling Panel (R2.1). triples when mapping rules created in the Modeling Multiple, heterogeneous data sources are supported Panel are executed on the data in the Input Panel via the Input Panel (R2.2 and R2.3). Multiple (right panel in Figure 4). For each rdf triple, it ontologies are supported via the Modeling Panel shows the subject, predicate, and object. (R2.4). The collaboration between the three panels Furthermore, the use of the three panels al- supports multiple alternative modeling approaches lows to follow different mapping generation ap- (R2.5) and non-linear workflows (R2.6). The in- proaches [30, 31]. The data-driven approach uses dependence of mapping execution is implemented the input data sources as the basis to construct the because the mapping rules can be exported as both mapping rules. Classes, properties, and datatypes graphml and mapping documents (R2.7). 16 Table 3: Interaction Elements Figure 3c). Another mapping language and cor- responding processor can be used, only leading to Element Application adjustments to the RMLEditor’s code that converts show/edit details the visualization to mapping language statements, create relationship while no adjustments to the gui are required.

delete node/edge 6.5. Manipulation of Large Graphs collapse graph When applying graph-based visualizations, large graphs are inevitable. Users have difficulties editing expand graph and keeping an overview of the mapping rules when large-scale graphs are not addressed. Geometric zooming [28] and interactive filtering [55] are two 6.3.2. Interaction established methods to address large graphs. The Besides visually understanding the mapping former provides enlargement of the graphs. The rules, users should also be able to interact with the zoom level can be set through, e.g., scrolling, but- graphs, which results in updating the correspond- tons, or keyboard shortcuts. The latter allows users ing rules. queryvowl provides this functionality to filter out the relevant elements of the graphs. for queries, but its approach can also be applied to vowl does not specify any methods on how mapping rules. Therefore, similar to queryvowl, to deal with large graphs. However, WebVOWL, the RMLEditor uses a set of interaction elements which implements vowl, applies both aforemen- (see Table 3): icons placed on the border of a node tioned methods. Geometric zooming is done or edge (see Figure 5). They correspond with five through scrolling. Interactive filtering is applied actions: edit details of a mapping rule, create re- through a filter feature which gives users access to lationship, delete node/edge, and collapse/expand four options that hide specific information or graph- graph. The different icons were chosen to be sim- ical notations and reduce the size of the graph: hide ple, but concrete and distinctive to increase the user datatype properties, solitary subclasses, informa- performance [41]. Every action has a distinctive tion about disjointness, and set operators. icon, with exception for collapse and expand graph, Similar to vowl, geometric zooming and inter- but they are never shown at the same time. active filtering is not specified by mapvowl. Geo- These icons are only shown when users hover over metric zooming is implemented in the RMLEditor the node or edge to reduce the visual overload. via scrolling, as in WebVOWL. Interactive filtering is implemented via the use of detail levels. Interac- 6.4. Mapping Execution tive filtering requires to define which filters can be Once the mapping rules are created, they can be applied on the graphs. Instead of letting the users executed by the RMLProcessor to generate Linked select the filters, as WebVOWL does, we opted to Data. This might happen also via the RMLEditor. group a set of appropriate filter together in a so- called detail level. The detail levels are arranged We developed a Web , using Node.js13, to sep- arate this functionality from the RMLEditor (see from highest to lowest, where each level applies one Figure 3). This makes it also easy to switch between or more additional filters to the higher level. different tools that execute mapping rules that sup- port multiple, heterogeneous data sources. Highest Level. The highest level shows all visual el- The api offers three functions: (i) executing a ements of mapvowl. This level is used when users mapping document on a set of data sources (see want to inspect and edit all details of the mapping Figures 3b and 3b’); (ii) converting a graphml- rules, and when the general overview of the map- based document to rml to execute the mapping ping rules is less important (see Figure 6). rules using the RMLProcessor (see Figure 3a), and (iii) converting rml statements to a graphml-based High Level. The high level hides the literals’ document to visualize the loaded mapping rules (see datatypes and languages (see Figure 7). This level focuses more on how different graph elements are related to each other, hiding the specifics, which 13https://nodejs.org might be the most redundant for the users. 17 Figure 9: Low detail level in the RMLEditor

Figure 6: Highest detail level in the RMLEditor

Figure 10: Lowest detail level in the RMLEditor

Moderate Level. The moderate level hides the prop- erties of the edges (see Figure 8). This puts more focus on which subjects and objects are related to each other, instead of how, i.e., only the relation- ship, without a specific predicate. Furthermore, the graphs become less dense, leading to a better Figure 7: High detail level in the RMLEditor overview of the mappings rules.

Low Level. The low level hides the literals (see Fig- ure 9). This puts more focus on how the different entities are related, and more specific how the dif- ferent data sources are connected. Furthermore, the graphs become less dense, leading to a better overview of the mappings rules.

Lowest Level. The lowest level hides the blank nodes (see Figure 10). This only shows entities that have an iri assigned, which allows the users to deal only with entities of the different data sources and how they are directly related, to improve their un- derstanding of the links between the data. More, when users are at a certain detail level, they might want to show or hide details of a specific subgraph, without the desire to change the overall Figure 8: Moderate detail level in the RMLEditor level. In the RMLEditor, this is accomplished using the collapse and expand actions (see Section 6.3.2). When a mapping document is loaded in the RMLEditor an appropriate detail level is selected 18 based on the graphs size, i.e., the amount of nodes Table 4: Blocks of the mapvowl vs. rml questionnaire is higher than a specific threshold, to provide users block description with the mapping rules overview. If the level is Questions about participants’ too high for the size of the graph, users lose the introduction socio-demographics overview of the mapping rules, which makes it dif- & Linked Data expertise ficult to navigate to the part of the graph which is mapvowl-specific questions desired to be edited. mapvowl vs. rml followed by four test cases to compare mapvowl & rml 6.6. Heterogeneous Data Values Manipulation Questions on preference Data sources might have heterogeneous data for- of mapvowl vs rml post-assessment mats. Thus, user interfaces for mapping rules need and the use of mapvowl to support multiple data formats. Moreover, pre- for editing mapping rules senting both the data structure and the data values allows users to gain insights in both the data model, and corresponding data. Transforming one or more at https://w3id.org/mapvowl/eval17/results, data fractions when generating Linked Data might and, in Section 7.1.3, the corresponding derived in- also be required. For example, if a data source has sights. a data fraction that contains a date in the format “DD/MM/YYYY”. A transformation is needed to 7.1.1. Method convert it to “YYYY-MM-DD” when the datatype Procedure. Participants with rml knowledge were xsd:date is used for the corresponding literal. directly contacted by the author. Those who agreed In the RMLEditor, the support for manipulat- to partake in the test had to read an introduction to ing heterogeneous data values and data sources is mapvowl14 and complete an online questionnaire. facilitated by the Input Panel (see Figure 4). It fea- The introduction explains the different elements of tures both the structure of the data, and the raw mapvowl through the use of an example which data. The structure includes data fractions that was provided to assure that participants have a ba- users may use to generate the desired Linked Data sic understanding of mapvowl, too. The partici- and can be updated to reflect also the required data pants completed an online questionnaire, which can transformations. This shows that the transformed be found at https://w3id.org/mapvowl/eval17/ data values can be used as any other data value. survey. The questionnaire consists of three main building blocks (see Table 4): (i) The first block questioned the participants’ 7. Evaluation main sociodemographic traits, such as year of We conducted two comparative studies to vali- birth, gender, level of education, and employment date our two hypotheses (see Section 4): one study status. Then, it measured the participants’ Linked compares the representation of mapping rules via Data expertise through a self-assessment and a mapvowl and via rml directly, and another study study of their familiarity with the topic and tools. compares the creation of mapping rules via the (ii) The second and essential block provides a RMLEditor and the RMLx Visual Editor. test with eleven questions about mapping rules that were presented as a mapvowl visualization to 7.1. MapVOWL vs. RML access the understanding of the mapvowl elements and ten questions about four test cases to compare With this study, we aim to validate H1. We a mapvowl representation with an rml represen- compared what knowledge users are able to ex- tation. Three of the use cases are real-world ex- tract from both a set of mapping rules that are amples and one is artificially created for the study. represented via mapvowl or rml directly. This Half of the participants answered the questions of knowledge aligns directly with the six requirements the first two use cases via the mapvowl represen- for a visual notation (see Section 5.1). Further- tation and the last two via rml. The other half of more, we evaluated which representation they pre- the participants answered the questions of the first fer. In Section 7.1.1, we discuss the method, namely the procedure and participants. In Section 7.1.2, we discuss the results, which can also be found 14https://w3id.org/mapvowl/eval17/intro 19 Table 5: › correct answers regarding mapvowl expert correct proficient questions answers (max. 9) developing ›

level literals per entity/relationships 9 emerging used ontology/language entities without a class 8 novice how iris are generated › literals and their datatypes 0 1 2 3 4 7 datatypes of literals › participants › distinct data sources 6 › entities w/ data source Figure 11: Level of Linked Data expertise of participants of › relationships ind. of data source 5 MapVOWL user study

Table 6: › correct answers when using mapvowl and rml two use cases via the rml representation and the correct answers questions last two via mapvowl. Participants were randomly RML MapVOWL assigned to one of the two aforementioned groups. types of entities (iii) The post-assessment consists of three › literals 51 61 questions and gathers information about the par- relationships b/t entities ticipants’ preference regarding the use of mapvowl › (unlinked) data sources 65 50 versus rml to answer the questions, and whether the language of literals they want to use ap to also edit mapping m vowl use of templates rules, besides only to visualize them. 24 24 use of datatypes Participants. The online questionnaire was sent out to potential participants in August 2017. Nine a data source. One question is answered correctly participants have eventually taken part in the ex- by 5 participants: determining the number of re- periment, their age range was 20 to 38. All were lationships that do not depend on a data source. highly educated: three of the contributors had a This is summarized in Table 5. PhD, four a master’s degree and two a bachelor’s Regarding the questions about the use cases, the degree. They were highly familiar with Linked Data number of correctly answered questions is slightly as well: 6 of the participants claimed they already higher for rml than mapvowl (140 correct answers generated Linked Data and only one declared to vs. 135), as summarized in Table 6. Questions re- have merely a basic understanding of Linked Data garding the types of entities, number of literals, and (see Figure 11). relationships between entities are answered more of- ten correctly via mapvowl than rml (61 correct 7.1.2. Results answers vs. 51). Questions regarding the number Mapping Rules. When answering the eleven ques- of (unlinked) data sources and the language of liter- tions about the elements of mapvowl, four ques- als are answered more often correctly via rml than tions are answered correctly by all participants: de- mapvowl (65 correct answers vs. 50). Questions termining the number of literals per entity and re- regarding the use of templates and datatypes are lationships, the used ontology and language. Two answered correctly as often via mapvowl as via questions are answered correctly by 8 participants: rml (24 correct answers). detecting entities without a class and determining how iris are generated. Two questions are an- Post-Assessment. 7 participants preferred the ap- swered correctly by 7 participants: detecting the plication of the mapvowl notation for viewing the number of literals and their datatypes. Two ques- presented cases. 1 participant preferred using rml tions are answered correctly by 6 participants: de- directly over mapvowl and 1 was neutral (see Fig- termining the number of distinct data sources and ure 12). Similarly, when they were asked which they the number of entities that are not associated with prefer to use for editing mapping rules, 7 partici- 20 Table 7: Steps of the RMLEditor vs. RMLx Visual Editor user study neutral step description RML Questions on participants’ introduction socio-demographics MapVOWL & Linked Data expertise

preferred language 0 2 4 6 2 use cases per participant use cases on mapping rules creation › participants Questions on heterogeneous additional Figure 12: Preferred language for viewing the rules data sources & data values, RMLEditor test & RMLEditor’s large graphs Questions on participants’ post-assessment neutral experience with the used tool

disagree In Section 7.2.1, we discuss the participants and

preference agree procedure. In Section 7.2.2, we discuss the results, 0 2 4 6 and in Section 7.2.3, the derived insights. › participants 7.2.1. Method Figure 13: Participants’ agreement to use mapvowl to edit Procedure. This second study is modeled according the rules to the following procedure1516. Potential partic- ipants from imec and Ghent University were ap- proached. Those agreeing to contribute to the pants agreed that they prefer mapvowl, 1 slightly user evaluations then had to read an introduc- disagreed and 1 remained neutral (see Figure 13). tion17 to the rml and the RMLx Visual Editor and work through the different steps of the study 7.1.3. Insights together with one of the authors. More details The answers to mapvowl-specific questions pro- about the experimental setting of the latter can be vide evidence that the visual notation is cog- found at https://w3id.org/rml/editor/eval17/ nitive effective. When comparing the answers of setting. The introduction explains the different questions that are both answered using mapvowl elements of both tools. For the RMLEditor, it also and rml, it provides evidence that both options includes the introduction to mapvowl, because it are possible candidates to represent mapping rules. implements mapvowl. The user study is divided However, the post-assessment shows that users into four steps (see Table 7): prefer MapVOWL over rml to visualize map- (i) The participants are presented with questions ping rules, even though rml results in a slightly about their socio-demographics and Linked Data higher number of correct answers. expertise, identical to ones asked to participants of the mapvowl vs. rml study (Section 7.1). (ii) Two use cases are presented to the users for 7.2. RMLEditor vs. RMLx Visual Editor which mapping rules need to be created. The first With this study, we aim to validate H2 by com- use case deals with data about employees and the paring the use of the RMLEditor and RMLx Vi- projects they are working on. Both data sources sual Editor to edit mapping rules. Specifically, the are provided as a csv file. This use case is data- visualization of the mapping process components driven: every data fraction should be represented with special attention to the rules is studied. The in the resulting Linked Data. The second use case RMLx Visual Editor was chosen, as it allows to deals with data about movies and their directors. annotate multiple, heterogeneous data sources and values. More, it also uses graph visualizations to 15https://w3id.org/rml/editor/eval17/survey1 represent the mapping rules; however a form-based 16https://w3id.org/rml/editor/eval17/survey2 approach is used to edit the rules. 17https://w3id.org/rml/editor/eval17/intro 21 The movie data source is provided as a csv file expert and the director data as a json file. This use case is schema-driven: based on a given set of classes, proficient properties, and datatypes Linked Data needs to be generated using the provided raw data. developing Half of the participants completed the first use level case with the RMLEditor and the second use case emerging with the RMLx Visual Editor. The other half com- pleted the first use case with the RMLx Visual Ed- novice itor and the second use case with the RMLEditor. 0 2 4 6 8 Participants were randomly assigned to one of the › participants two groups. The outcomes were assessed based on the generated Linked Data correctness and the ex- Figure 14: Level of Linked Data expertise of participants of perience with each tool was recorded. RMLEditor vs. RMLx Visual Editor user study (iii) An additional test was completed for the RMLEditor. Here, the participant got presented Table 8: Completeness of rules and SUS-score of RMLEditor with an online survey, containing specific questions and RMLx Visual Editor regarding the heterogeneous data sources and val- RMLx ues, and large graphs. During the second and third RMLEditor Visual Editor step of the user study, each participant was super- vised by one of the authors making use of the think- completeness of rules (%) aloud protocol. Widely used by usability profes- use case 1 91 83 sionals, think-aloud originated out of the incapa- use case 2 98 82 bility to know what users think when completing SUS-score tasks [46]. In the concurrent think-aloud process as 82.75 (good) 42 (poor) implemented in the current usability study, users are probed to specify their thoughts and actions as they occur. This allows the attending author to dis- pertise, required for a meaningful participation in cern when and why a participant faces difficulties. the user study. A total of 10 participants, age range 27 to 39, were recruited. Their level of education (iv) After each use case, a post-assessment was high: they all hold a master’s degree, have con- gathers information about the participants’ expe- ducted advanced graduate work, or hold a PhD. 8 rience with the used tool. Included are questions participants assessed themselves as having at least about the perceived difficulty of the tasks and con- a proficient level of Linked Data expertise, 1 con- fidence in a successful completion. A central part of sidered himself to be a novice and 1 observed his the post-assessment is assigned to the System Us- Linked Data expertise as emerging (see Figure 14). ability Scale (sus) [10], a procedure to quickly col- They all had heard of the RMLEditor before, 2 par- lect users’ usability ratings of a technology. Benefits ticipants had used it; whereas 6 participants had of applying sus are its conciseness, its transferabil- taken notice of the RMLx Visual Editor before. ity over several technologies, and applicability with small sample sizes [1, 59]. As proposed by Bangor et al. [1], an adjective rating using a 7-point Lik- 7.2.2. Results ert scale, which assesses the tools user-friendliness Mapping Rules. An analysis of the created map- from worst imaginable to best imaginable, was also ping rules by the participants, which can be added to the questionnaire. It should be sharply found at https://w3id.org/rml/editor/eval17/ noted though that the outcome of the adjective rat- results/mappings, revealed several aspects that ing and the system usability scale should be sup- can be circumvented by updating the gui of the ported by observations by the attending supervisor. tools. For the first use case 91% of the ex- pected mapping rules were present when using the Participants. The participants of our study were RMLEditor. In the case of the RMLx Visual Edi- people from imec and Ghent University, that were tor, it was only 83%. The same is observed for the contacted directly by the authors. This ensured second use case with 98% for the RMLEditor and that they had a sufficient level of Linked Data ex- 82% for the RMLx Visual Editor (see Table 8). 22 In more details, the iterator was forgotten by extremely difficult 3 participants using the RMLx Visual Editor, but moderately difficult this was not an issue with the RMLEditor. The use of constants, reference, and templates was incorrect slightly difficult for 2 participants using the RMLx Visual Editor, neither easy nor difficult but this was not an issue with the RMLEditor. slightly easy All participants added the data transformation difficulty with both tools, but 4 participants forgot to use moderately easy the output of the transformation as an object us- extremely easy ing the RMLx Visual Editor, while only 2 using RMLEditor 0 2 4 6 the RMLEditor. This lead to triples that did not RMLx Visual Editor › participants contained the transformed data value, but the orig- inal value. When interlinking datasets, all but one Figure 15: Difficulty of the RMLEditor vs. RMLx Visual participant created the required mapping rule using Editor the RMLx Visual Editor. Using the RMLEditor, all participants created the mapping rule, but 3 partic- ipants did not provide the required join condition neutral which interlinks entities that are not related. This was not an issue using the RMLx Visual Editor. RMLx Visual Editor Furthermore, we pay special attention to new fea- confidence RMLEditor tures of the RMLEditor introduced in this work. Users are able to distinguish the different data 0 2 4 6 sources, but when they have to count the total › participants number of data sources that are loaded 4 partic- ipants forget to count the currently selected data Figure 16: Confidence of using the RMLEditor vs. RMLx source. 9 participants were able to correctly deter- Visual Editor mine how many data transformations are applied when presented with a new set of mapping rules. They were also able and perceived it as easy to de- Visual Editor, all participants except 1 consistently termine which data transformation and data frac- rated executing the tasks more difficult than with tions are used. When users create and edit mapping the RMLEditor. 5 participants scored it as “slightly rules that link entities in large graph visualizations, difficult” and the other 3 as “moderately difficult”. 7 of them use the lower detail levels. When users The confidence for having correctly executed the create and edit mapping rules that generate literals tasks was positively rated for both tools. 6 partici- the preferred detail levels varies from the highest pants were more confident in performing the tasks (2 participants) to the lowest level (1 participant), with the RMLEditor, 1 rated both tools equal and 3 and 2 participants even stated no preference. were more confident with the RMLx Visual Editor (see Figure 16). Post-Assessment. The first part of the post- The second part of the post-assessment applied assessment questioned the participants with regard the sus and the accompanying user-friendliness to the difficulty of performing the use case with the rating proposed by Bangor et al. [1]. The user- tools and their confidence in a successful completion friendliness of the RMLEditor was rated on a 7- of the tasks. The results of the post-assessment point likert scale. 9 participants scored it as ex- can be found at https://w3id.org/rml/editor/ cellent, the other 1 rated it as good. In contrast, eval17/results/post. The difficulty of perform- the user-friendliness of the RMLx Visual Editor was ing the use cases was rated on a 7-point likert scale valuated as “good” by 1 respondent, “ok” by 1 re- from extremely difficult to extremely easy. spondent, “poor” by 6 respondents and “awful” by Regarding difficulty, 1 participant rated complet- the remaining 2 (see Figure 17). ing the tasks with the RMLEditor as “slightly dif- More, all except 1, who rated both as good, ficult”, all others assessed it as “neither easy nor assessed the user-friendliness of the RMLEditor difficult” or higher, with 3 participants valuating it higher than that of the RMLx Visual Editor. The at “extremely easy” (see Figure 15). For the RMLx average obtained mean sus-score of the RMLEditor 23 best imaginable formation. This happened for 3 participants using the RMLx Visual Editor and 6 participants using excellent the RMLEditor. The difference occurs because the good default parameter was already set in the RMLx Vi-

ok sual Editor, whereas no default parameter is set

poor for the RMLEditor. Nevertheless, still 3 partici-

likert scale pants have trouble understanding these parameters aweful as shown via the RMLx Visual Editor. worst imaginable 5 participants required additional information about the use of a baseURI and how it is defined RMLEditor 0 2 4 6 8 in the RMLEditor. Although it is perceived as use- RMLx Visual Editor › participants ful, the participants indicated that a clearer pre- sentation in the gui is required. The RMLx Visual Figure 17: User-friendliness of the RMLEditor vs. RMLx Visual Editor Editor does not offer this functionality. A common issue with both tools is the use of constants, references, and templates to generate is 82.75, whereas for the RMLx Visual Editor, it is subjects, predicates, and objects. 9 participants 42 (see Table 8). When translating these scores on for both tools required additional information to the spectrum created by Bangor et al. [1], the us- understand how they work and when they should ability of the RMLEditor is identified as good, but be used. Although the RMLEditor selects a default that of the RMLx Visual Editor as poor. depending on the element in the graph, additional Two remarks: (i) a single metric should not be information was still required for the users to fully used to make absolute statements about a systems’ understand these three options. usability [1]. The observations done by one of the 8 participants required additional information to authors and elaborated in the next paragraph give understand how to interlink entities originating an enhanced understanding of the process of utiliz- from different data sources using the RMLx Visual ing both tools; (ii) one of the authors was present Editor. “Parent mapping”, “parent value”, and during the user study, which could result in a mod- “child value” were terms difficult to be understood, erator acceptance bias. This happens when partic- and participants had trouble understanding how ipants want to gratify the present author. they would affect the resulting Linked Data. This was not an issue during the use of the RMLEditor, Observations. During the study, participants were because of the use of mapvowl. supervised by one of the authors. The results can be 2 participants required information to under- found at https://w3id.org/rml/editor/eval17/ stand that an iterator for a csv file is not required results/obs. Participants could ask questions and when using the RMLx Visual Editor. This was not receive feedback while using both tools. Although, an issue using the RMLEditor, because an iterator users were provided with an introduction to both cannot be set when using a csv file as data source, tools, which they could consult during the study, considering the iteration always happens per row. they still asked questions about how certain actions 3 participants required information about the use should be performed. of the “output var”-field in the RMLx Visual Edi- The participants referred to the introduction to tor. This field represents the output of a data trans- find the required information. This happened for 8 formation and the string can be used in other map- participants when using the RMLEditor and for 9 ping rules to use the output of a data transforma- participants when using the RMLx Visual Editor. tion. This was not an issue using the RMLEditor, The RMLx Visual Editor does not allow to visu- because data transformations are integrated in the alize the data sources. Therefore, every participant original data fractions in the input panel. used text editors and spreadsheet tools to view the data sources. In the case of the RMLEditor only 1 7.2.3. Insights participant first viewed the data sources externally The resulting mapping rules provide evidence before loading them in the RMLEditor. that the use of the RMLEditor leads to a higher Participants required additional information completeness of rules, and thus of Linked Data, about the use of parameters to create a data trans- compared to the RMLx Visual Editor. For both 24 tools updates to the gui could improve this com- of the mapping rules graph-based visual representa- pleteness, such as the inclusion of checks for the tion to generate the Linked Data compared to using use of transformed values and iterators. The a mapping language directly.” RMLEditor was rated as good and the RMLx Vi- The results of the second study (Section 7.2) sual Editor as poor by the participants. This is in show that the RMLEditor is preferred over the line with the completeness of the rules. RMLx Visual Editor for creating and editing map- The observations during the user study made ping rules. Participants were able to create a high clear that specific terminology, such as baseURI, percentage of the required mapping rules using both reference, and template, needs further clarifica- tools (high accuracy). The RMLEditor has a bet- tion. This could be done, for example, through ter usability compared to the RMLx Visual Edi- tool tips in the gui or by avoiding the use of the tor, as shown through the sus-score (82.75 vs 42), terminology. The observations provided evidence making it easier for users to create and edit rules. for the use of the RMLEditor over the RMLx Vi- This is due to the use of mapvowl, which is in- sual Editor when interlinking entities due to the dependent of the underlying mapping language, as use of the visual notation in the former and the use pointed out by the participants. However, through of forms and specific terminology in the latter, as the use of think-aloud, we conclude that updates explicitly mentioned by the participants. to both guis would improve the mapping process, with specific focus on the creation of the mapping 8. Conclusion rules. The RMLx Visual Editor’s form-based gui is strongly influenced by its underlying mapping lan- Visual tools are implemented to help users in guage rml, which leads to issues for users that are defining how to generate Linked Data from raw unfamiliar with rml. These issues can only be re- data. This is possible thanks to mapping lan- solved either by acquiring knowledge about rml or guages which enable detaching mapping rules from avoiding to rely on knowledge of rml, as done by the implementation that executes them. However, the RMLEditor. Therefore, this study provides ev- no thorough research has been conducted so far on idence towards the acceptance of H2 “The cogni- how to visualize such mapping rules, especially if tive effectiveness provided by the RMLEditor’s gui they become large and require considering multiple improves the user’s performance during the Linked heterogeneous data sources and transformed data Data generation process – that is based on legacy values. In this article we introduced mapvowl, a non-rdf and (semi-)structured data sources – com- graph-based visual notation for mapping rules; and pared to the state of the art.” an approach for manipulating rules when large vi- mapvowl fills the important gap between uni- sualizations emerge; an approach to uniformly vi- form mapping rule visualizations and the different sualize data fractions of raw data sources combined mapping editors. It allows multiple tools to incor- with an interactive interface for uniform data frac- porate the same visualization, which increases the tion transformations. mapvowl and the two ap- accessibility for users across these tools. The differ- proaches were implemented in the RMLEditor to ent semantic constructs of the mapping rules align provide users with a uniform gui to create and edit with the graphical notations of the graph-based vi- mapping rules that is cognitive effective. sualization. This leads to an effective representa- The results of the first study (Section 7.1) show tion for users. Furthermore, it combines the differ- that the use of mapvowl is preferred over rml for ent components of the mapping process, providing representing mapping rules. Answering questions users with insights on the interrelation of these com- about these rules leads to the same correctness (ac- ponents. Users are able to define and deduct which curacy in cognitive effectiveness) by either using data fractions of which data sources are used in mapvowl or rml for users who understand rml. which mapping rules, and deduct how the mapping Nevertheless, most participants indicated that they rules correspond with the generated Linked Data. preferred to use mapvowl as representation for Moreover, mapvowl and the RMLEditor over- mapping rules, and they would also use mapvowl come two issues of current mapping editors, namely, to create and edit rules. This shows that mapvowl the uniform presentation of heterogeneous data is easier to use than rml directly. Therefore, this sources and data transformations within Linked study provides evidence towards the acceptance of Data generation. Multiple data sources are sup- H1 “mapvowl improves the cognitive effectiveness ported and color is an appropriate method to de- 25 note them. Distinguishing the data fractions that [6] Christian Bizer, Jens Lehmann, Georgi Kobilarov, users rely on to generate Linked Data from the orig- S¨oren Auer, Christian Becker, Richard Cyganiak, inal raw data allows to create an abstraction layer. and Sebastian Hellmann. DBpedia - A crystal- lization point for the Web of Data. Web Se- Such an abstraction layer not only enables to easily mantics: Science, Services and Agents on the manipulate the data fractions but also incorporates , 7(3):154–165, 2009. ISSN data transformations uniformly for heterogeneous 1570-8268. doi: http://dx.doi.org/10.1016/j.websem. 2009.07.002. URL http://www.sciencedirect.com/ data sources. This allows users to handle trans- science/article/pii/S1570826809000225. formed data values just as raw values. [7] Adam Boduch. Flux Architecture. Packt Publishing, Large graphs might result in a reduction of the 2016. graph-based visualizations’ efficiency. However, the [8] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3 Data-Driven Documents. IEEE Transactions on Vi- second study shows that detail levels are appropri- sualization and Computer Graphics, 17(12):2301–2309, ate to compensate this for mapping rules. It is effi- 2011. cient for the users to change the detail levels based [9] Ulrik Brandes, Markus Eiglsperger, Ivan Herman, on the task at hand, and personal preference. Michael Himsolt, and M Scott Marshall. GraphML Progress Report Structural Layer Proposal. In Interna- Overall, if users need to create and edit mapping tional Symposium on Graph Drawing, pages 501–512. rules, the RMLEditor is a valid choice through the Springer Berlin Heidelberg, 2001. use of the proposed mapvowl and the approaches [10] John Brooke. SUS – A quick and dirty usability scale. to deal with large visualizations and heterogeneous Usability Evaluation in Industry, pages 189–194, 1996. [11] Karlis Cerans, Julija Ovcinnikova, Renars Liepins, and data sources and values. Although, improvements Arturs Sprogis. Advanced OWL 2.0 Ontology Visu- to the gui are required, which will be addressed alization in OWLGrEd. In Databases and Informa- in future work, the gui’s core elements have been tion Systems VII, pages 41–54. IOS Press, 2013. doi: 10.3233/978-1-61499-161-8-41. proven to be effective via the two user studies. [12] The UniProt Consortium. Uniprot: a hub for protein in- formation. Nucleic Acids Research, 43(D1):D204, 2015. doi: 10.1093/nar/gku989. Acknowledgements [13] World Wide Web Consortium. RDF 1.1 Concepts and Abstract Syntax. Technical report, 2014. URL https: The described research activities were funded by //www.w3.org/TR/rdf11-concepts/. Ghent University, imec, Flanders Innovation & En- [14] Aba-Sah Dadzie and Emmanuel Pietriga. Visualisation of Linked Data – Reprise. Semantic Web, 8(1):1–21, trepreneurship (AIO), the Research Foundation – 2017. doi: 10.3233/SW-160249. Flanders (FWO), and the European Union. [15] Aba-Sah Dadzie and Matthew Rowe. Approaches to visualising linked data: A survey. Semantic Web, 2(2): 89–124, 2011. References [16] Souripriya Das, Seema Sundara, and Richard Cyganiak. R2RML: RDB to RDF Mapping Language. Working group recommendation, W3C, September 2012. URL [1] Aaron Bangor, Philip Kortum, and James Miller. De- termining what individual sus scores mean: Adding an http://www.w3.org/TR/r2rml/. adjective rating scale. Journal of usability studies, 4(3): [17] Ben De Meester and Anastasia Dimou. The Func- 114–123, 2009. tion Ontology. Unofficial Draft, October 2016. http: [2] Fran¸cois Belleau, Marc-Alexandre Nolin, Nicole //users.ugent.be/~bjdmeest/function/. Tourigny, Philippe Rigault, and Jean Morissette. [18] Ben De Meester, Anastasia Dimou, Ruben Verborgh, Bio2RDF: Towards a mashup to build bioinformatics Erik Mannens, and Rik Van de Walle. An ontology to knowledge systems. Biomedical Informatics, 41(5):706– semantically declare and describe functions. In Har- 716, 2008. ald Sack, Giuseppe Rizzo, Nadine Steinmetx, Dunja [3] Nikos Bikakis, Melina Skourla, and George Papaste- Mladeni´c,S¨orenAuer, and Christoph Lange, editors, fanatos. rdf:SynopsViz – A Framework for Hierarchical The Semantic Web; ESWC 2016 Satellite Events, vol- Linked Data Visual Exploration and Analysis. In The ume 9989 of Lecture Notes in Computer Science, pages Semantic Web: ESWC 2014 Satellite Events, pages 46–49. Springer International Publishing, October 2016. 292–297. Springer International Publishing, 2014. ISBN doi: 10.1007/978-3-319-47602-5 10. 978-3-319-11955-7. doi: 10.1007/978-3-319-11955-7 37. [19] Laurens De Vocht, Selver Softic, Ruben Verborgh, Erik [4] Nikos Bikakis, George Papastefanatos, Melina Skourla, Mannens, and Martin Ebner. ResXplorer: Revealing and Timos Sellis. A Hierarchical Aggregation Frame- relations between resources for researchers in the Web work for Efficient Multilevel Visual Exploration and of Data. Computer Science and Information Systems, Analysis. Semantic Web, 8(1):139–179, 2017. 14(1):25–50, 2017. [5] Christian Bizer, Tom Heath, and Tim Berners-Lee. [20] Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Linked data: The story so far. Semantic Ser- Ruben Verborgh, Erik Mannens, and Rik Van de Walle. vices, Interoperability and Web Applications: Emerg- RML: A Generic Language for Integrated RDF Map- ing Concepts, pages 205–227, 2009. doi: 10.4018/ pings of Heterogeneous Data. In Workshop on Linked 978-1-60960-593-3.ch008. Data on the Web, 2014. 26 [21] Martin Dzbor, Enrico Motta, Carlos Buil, Jose Gomez, Masoon Alam. Scalable Visualization of Semantic Nets Olaf Goerlitz, and Holger Lewen. Developing ontologies using Power-Law Graphs. Applied Mathematics & In- in owl: An observational study. In OWL: Experiences formation Sciences, 8(1):355–367, 2014. and Directions 2006, November 2006. [35] Muhammad Javed, Sandy Payette, Jim Blake, and Tim [22] Syeda Sana e Zainab, Muhammad Saleem, Qaiser Worrall. VIZ–VIVO: Towards Visualizations-driven Mehmood, Durre Zehra, Stefan Decker, and Ali Has- Linked Data Navigation. In Proceedings of the Second nain. FedViz: A Visual Interface for SPARQL Queries International Workshop on Visualization and Interac- Formulation and Execution. In Proceedings of the In- tion for Ontologies and Linked Data, page 80, 2016. ternational Workshop on Visualizations and User In- [36] Sergey Krivov, Richard Williams, and Ferdinando Villa. terfaces for Ontologies and Linked Data, pages 49–60, GrOWL: A tool for visualization and editing of OWL 2015. ontologies. Web Semantics: Science, Services and [23] Ivan Ermilov, Jens Lehmann, Michael Martin, and Agents on the World Wide Web, 5(2):54–57, 2007. ISSN S¨orenAuer. LODStats: The Data Web Census Dataset. 1570-8268. doi: http://dx.doi.org/10.1016/j.websem. In Proceedings of 15th International Semantic Web 2007.03.005. Conference - Resources Track (ISWC’2016), 2016. [37] Jill H Larkin and Herbert A Simon. Why a diagram [24] J´erˆomeEuzenat and Pavel Shvaiko. Ontology Match- is (sometimes) worth ten thousand . Cognitive ing. Springer Berlin Heidelberg, 2013. doi: 10.1007/ science, 11(1):65–100, 1987. 978-3-642-38721-0. [38] Steffen Lohmann, Vincent Link, Eduard Marbach, and [25] Riccardo Falco, Aldo Gangemi, Silvio Peroni, David Stefan Negru. WebVOWL: Web-based Visualization of Shotton, and Fabio Vitali. Modelling OWL Ontolo- Ontologies. In Knowledge Engineering and Knowledge gies with Graffoo. In The Semantic Web: ESWC 2014 Management, pages 154–158. Springer, 2014. Satellite Events, pages 320–325. Springer International [39] Steffen Lohmann, Stefan Negru, and David Bold. The Publishing, 2014. doi: 10.1007/978-3-319-11955-7 42. Prot´eg´eVOWL Plugin: Ontology Visualization for Ev- [26] Florian Haag, Steffen Lohmann, Stephan Siek, and eryone. In The Semantic Web: ESWC 2014 Satellite Thomas Ertl. QueryVOWL: Visual Composition of Events, pages 395–400. Springer International Publish- SPARQL Queries. In The Semantic Web: ESWC 2015 ing, 2014. doi: 10.1007/978-3-319-11955-7 55. Satellite Events, pages 62–66. Springer International [40] Steffen Lohmann, Stefan Negru, Florian Haag, and Publishing, 2015. doi: 10.1007/978-3-319-25639-9 12. Thomas Ertl. Visualizing ontologies with VOWL. Se- [27] Philipp Heim, Sebastian Hellmann, Jens Lehmann, mantic Web, 7(4):399–419, 2016. Steffen Lohmann, and Timo Stegemann. RelFinder: [41] Sin´eJP McDougall, Oscar de Bruijn, and Martin B Revealing relationships in RDF knowledge bases. In Curry. Exploring the effects of icon characteristics on Semantic Multimedia, pages 182–187. Springer Berlin user performance: the role of icon concreteness, com- Heidelberg, 2009. doi: 10.1007/978-3-642-10543-2 21. plexity, and distinctiveness. Journal of Experimental [28] Ivan Herman, Guy Melan¸con,and M Scott Marshall. Psychology: Applied, 6(4):291, 2000. Graph Visualization and Navigation in Information Vi- [42] Daniel Moody. The “physics” of Notations: Toward sualization: A Survey. IEEE Transactions on Visual- a Scientific Basis for Constructing Visual Notations in ization and Computer Graphics, 6(1):24–43, 2000. doi: Software Engineering. IEEE Transactions on Software 10.1109/2945.841119. Engineering, 35(6):756–779, 2009. doi: 10.1109/TSE. [29] Pieter Heyvaert, Anastasia Dimou, Ruben Verborgh, 2009.67. Erik Mannens, and Rik Van de Walle. Towards a Uni- [43] Enrico Motta, Silvio Peroni, Jos´eManuel G´omez-P´erez, form User Interface for Editing Mapping Definitions. In Mathieu d’Aquin, and Ning Li. Visualizing and navi- Proceedings of the 4th International Workshop on In- gating ontologies with KC-Viz. In Ontology Engineering telligent Exploration of Semantic Data (IESD 2015). in a Networked World, pages 343–362. Springer Berlin CEUR-WS.org, 2015. Heidelberg, 2012. doi: 10.1007/978-3-642-24794-1 16. [30] Pieter Heyvaert, Anastasia Dimou, Ruben Verborgh, [44] Stefan Negru and Steffen Lohmann. A Visual Nota- Erik Mannens, and Rik Van de Walle. Towards Ap- tion for the Integrated Representation of OWL Ontolo- proaches for Generating rdf Mapping Definitions. In gies. In Proceedings of the 9th International Conference Proceedings of the ISWC 2015 Posters & Demonstra- on Web Information Systems and Technologies (WE- tions Track. CEUR Workshop Proceedings, 2015. BIST ’13), pages 308–315. SciTePress, 2013. [31] Pieter Heyvaert, Anastasia Dimou, Aron-Levi Herre- [45] Natalya F Noy, Michael Sintek, Stefan Decker, Mon- godts, Verborgh Ruben, Schuurman Dimitri, Mannens ica Crub´ezy, Ray W Fergerson, and Mark A Musen. Erik, and Van de Walle Rik. RMLEditor: A Graph- Creating Semantic Web Contents with Prot´eg´e-2000. Based Mapping Editor for Linked Data Mappings. In IEEE Intelligent Systems, 16(2):60–71, 2001. doi: The Semantic Web. Latest Advances and New Do- 10.1109/5254.920601. mains, pages 709–723. Springer International Publish- [46] Erica L Olmsted-Hawala, Elizabeth D Murphy, Sam ing, 2016. doi: 10.1007/978-3-319-34129-3 43. Hawala, and Kathleen T Ashenfelter. Think-aloud pro- [32] Frederik Hogenboom, Viorel Milea, Flavius Frasin- tocols: a comparison of three think-aloud protocols for car, and Uzay Kaymak. RDF-GL: A SPARQL-Based use in testing data-dissemination web sites for usabil- Graphical Query Language for RDF. In Emergent ity. In Proceedings of the SIGCHI conference on human Web Intelligence: Advanced Information Retrieval, factors in computing systems, pages 2381–2390. ACM, pages 87–116. Springer London, 2010. doi: 10.1007/ 2010. 978-1-84996-074-8 4. [47] Addy Osmani. Learning JavaScript Design Patterns. [33] Matthew Horridge. OWLViz, 2010. URL http:// O’Reilly Media, 2012. protegewiki.stanford.edu/wiki/OWLViz. [48] Allan Paivio. Mental Representations: A Dual Coding [34] Ajaz Hussain, Khalid Latif, A Rextin, Amir Hayat, and Approach. Oxford University Press, 1990.

27 [49] Emmanuel Pietriga. Semantic web data visualization with graph style sheets. In Proceedings of the 2006 ACM Symposium on Software Visualization, SoftVis ’06, pages 177–178, New York, NY, USA, 2006. ACM. ISBN 1-59593-464-2. doi: 10.1145/1148493.1148532. [50] Eric Prud’hommeaux and Andy Seaborne. SPARQL query language for RDF. Technical report. URL http: //www.w3.org/TR/rdf-sparql-query/. [51] Mark Richards. Software Architecture Patterns. O’Reilly Media, 2015. [52] Alistair Russell, Paul R. Smart, Dave Braines, and Nigel R. Shadbolt. NITELIGHT: A Graphical Tool for Construction. In Proceedings of the Fifth International Workshop on Semantic Web User Interaction (SWUI 2008), 2008. [53] Simon Scheider, Auriol Degbelo, Rob Lemmens, Corn´e van Elzakker, Peter Zimmerhof, Nemanja Kostic, Jim Jones, and Gautam Banhatti. Exploratory querying of SPARQL endpoints in space and time. Semantic Web, 8(1):65–86, 2017. [54] Kunal Sengupta, Peter Haase, Michael Schmidt, and Pascal Hitzler. Editing R2RML Mappings Made Easy. In Proceedings of the 2013th International Conference on Posters & Demonstrations Track, pages 101–104. CEUR-WS.org, 2013. [55] Ben Shneiderman. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Pro- ceedings 1996 IEEE Symposium on Visual Languages, pages 336–343. IEEE, 1996. doi: 10.1109/VL.1996. 545307. [56] Alvaro´ Sicilia, German Nemirovski, and Andreas Nolle. Map-On: A web-based editor for visual ontology map- ping. Semantic Web, (Preprint):1–12, 2016. [57] Alexandra Similea, Niklas Petersen, Christoph Lange, and Steffen Lohmann. TurtleEditor 2.0: A Synchro- nized Visual and Text Editor for RDF Graphs. In IEEE 11th International Conference on Semantic Comput- ing, 2017. [58] Claus Stadler, Jens Lehmann, Konrad H¨offner, and S¨orenAuer. LinkedGeoData: A Core for a Web of Spa- tial . Semantic Web Journal, 3(4):333–354, 2012. [59] Thomas S Tullis and Jacqueline N Stetson. A compari- son of questionnaires for assessing website usability. In Usability professional association conference, pages 1– 12, 2004. [60] Giovanni Tummarello, Richard Cyganiak, Michele Catasta, Szymon Danielczyk, Renaud Delbru, and Ste- fan Decker. Sig.ma: Live views on the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 8(4):355–364, 2010. ISSN 1570-8268. doi: http://dx.doi.org/10.1016/j.websem.2010.08.003. [61] Pierre-Yves Vandenbussche, Ghislain A Atemezing, Mar´ıa Poveda-Villal´on,and Bernard Vatant. Linked Open Vocabularies (LOV): a gateway to reusable se- mantic vocabularies on the Web. Semantic Web, 8(3): 437–452, 2017. [62] Marc Weise, Steffen Lohmann, and Florian Haag. LD- VOWL: Extracting and Visualizing Schema Informa- tion for Linked Data. Proceedings of the Second In- ternational Workshop on Visualization and Interaction for Ontologies and Linked Data, pages 120–127, 2016.

28