Semantic Object Mapping Using SHACL Validated Resource Graphs

Semantic Web 0 (0) 1 1 IOS Press

1 1 2 2 3 3 4 Semantic Object Mapping using SHACL 4 5 5 6 Validated Resource Graphs 6 7 7 8 Sarah Bensberg a,* and Marius Politze a 8 9 a IT Center, RWTH Aachen University, Germany 9 10 E-mails: [email protected], [email protected] 10 11 11 12 12 13 13 14 14 15 Abstract. Metadata describing research data is a core instrument for compliance with the FAIR Guiding Principles. RDF is very 15 16 suitable for this purpose because its schema-lessness provides flexibility for changes and different disciplines. In order to follow 16 17 defined metadata schemas, RDF needs to be restricted by application profiles. SHACL allows validating whether an RDF based 17 18 metadata matches a certain shape and meets minimum quality requirements. Described research data hence becomes resources in 18 19 a validated knowledge graph. Searchability is achieved by SPARQL, which, however, requires considerable technical knowledge. 19 20 The presented semantic object mapping of validated resource graphs into an JSON object is an approach which is less dependent 20 on the users’ background. 21 21 The evaluation was performed based on precision, recall and response times using Elasticsearch as a search engine on the 22 22 mapped object in comparison to generated SPARQL queries. The results show that with the transformation of RDF based 23 (meta-)data into a search index using application profiles and inference rules, a solution was found that is equivalent in these terms. 23 24 Through the integration of the developed mapping in data management platform Coscine, a search of the research data is possible 24 25 which at the same time promotes the subsequent use. 25 26 26 27 Keywords: Resource Object Mapping, Knowlegde Graph Validation, Shape Graphs, Research Data Management, Metadata 27 28 28 29 29 30 30 31 1. Introduction Kindling and Schirmbacher [3] describe RDM as 31 32 the entire process that supports the allocation, genera- 32 33 As research data is often fundamental for scientific tion, processing, enrichment, archiving, and publication 33 34 discoveries, it should be stored on a long-term basis and of digital research data. Within this context RDM ser- 34 35 it should be reusable for validation or future research vices allow researchers to manage their research data 35 36 projects. Metadata, a data record describing the data, according to the Findable, Accessible, Interoperable 36 37 serves to make the data interpretable [1]. In the field of and Reusable (FAIR) Guiding Principles [4]. 37 38 Research Data Management (RDM), metadata can be Description with metadata is crucial to secure long 38 39 used to clarify questions that occur regarding research term interpretability within the research data manage- 39 40 data, e.g., the date of creation. As additional informa- ment process. When regarding research data objects 40 41 tion is stored by the metadata, it is possible to search as resources, Resource Description Framework (RDF) 41 42 research data or to filter it according to certain prop- triples are a common choice for storing such metadata. 42 43 erties. A standardized presentation of the data makes Due to the schema-lessness and semantics RDF pro- 43 44 the information machine-readable and thus a search vides flexibility for evolutionary changes and across 44 45 possible. Through the usage of metadata, information different disciplines. In order to follow defined meta- 45 46 can be linked across the entire World Wide Web [2], data schemas, however, RDF graphs need to be re- 46 47 which is why it is particularly useful for accessibility stricted by application profiles. Shapes Constraint Lan- 47 48 and subsequent use. guage (SHACL) allows validating whether an RDF 48 49 (meta-)data graph matches a certain shape and thus 49 50 meets minimum quality requirements. Furthermore, se- 50 51 *Corresponding author. E-mail: [email protected]. mantic resource object mappings could be generated 51

1 because of the known structure of the knowledge graph. pose new challenges for researchers [10]. More and 1 2 These models can be used in search engines to make more public funding programs by the German Research 2 3 the resources searchable and therefore to find research Foundation (GRF) or EU’s Horizon 2020 expect re- 3 4 data. search data to be made available and data management 4 5 plans to be developed. While the access to research data 5 6 1.1. Motivation and related services is usually limited to one project 6 7 or organization [11]. However, the intention here is to 7 8 The Council for Scientific Information Infrastruc- enable cross-disciplinary access. 8 9 tures established by the Joint Science Conference has An important aspect that is relevant for the imple- 9 10 recommended the establishment of a National Research mentation of appropriate solutions is the diversity and 10 11 Data Infrastructure (NFDI) for the coordinated further heterogeneity of the various scientific disciplines as 11 12 development of the information infrastructure construc- well as the decentralized nature of data. This results 12 13 tion [5] that is now funded by the German Federal Min- from the different research data infrastructures used by 13 14 istry of Education and Research. The background to researchers and consisting different services. However, 14 15 this is the establishment of a coordinated and sustain- it should be possible to maintain the discipline-specific 15 16 able data infrastructure for the German science system, equipment and IT systems individually for each insti- 16 17 which strengthens the competitive position of research tute despite uniform support concerning RDM. The dif- 17 18 in Germany through the systematization of data stocks, ficulty of reproducibility is thus increased even further. 18 19 good accessibility of research data, and continuous fur- Additionally, initial approaches already exist at some 19 20 ther development of services [6]. “The aim of the Na- research groups, which is why the different stages of 20 21 21 tional Research Data Infrastructure (NFDI) is to sys- development represent a further challenge [8]. 22 tematically manage scientific and research data, provide 22 It should be noted that for the researchers themselves, 23 long-term data storage, backup and accessibility, and 23 the added value of RDM is often indirect, so active 24 integrate the data both nationally and internationally. 24 integration into research processes is crucial [12]. Be- 25 The NFDI will bring multiple stakeholders together in a 25 sides, the role of a dedicated “Data Curator”, who is 26 coordinated network of consortia tasked with providing 26 familiar with certain techniques/standards and technolo- 27 science-driven data services to research communities.” 27 gies, is progressively disappearing within the context 28 [7] 28 of RDM; instead, every single researcher is under obli- 29 While RDM best practices need to find their way 29 gation. This trend indicates that many “IT skills” are 30 into daily workflows of each individual researcher from 30 becoming more and more “general education” in the 31 Ph.D. student to principal investigator, the goal set by 31 context of digitization. To support the researcher in 32 NFDI also requires the central university institutions, 32 all phases of the Research Data Life Cycle (RDLC), 33 such as libraries or computing centers to be involved 33 34 and provide supporting (IT) services [8]. For this rea- the goal of RWTH is to create a basic infrastructure 34 35 son RWTH Aachen University (RWTH), as a renowned of connected services [8]. To enable research groups 35 36 research institution, decided to invest in institutional to continue using individual components, technology- 36 37 RDM structures [9]. RWTH supports researchers to do independent and process-oriented interfaces will be cre- 37 38 RDM and to deal with the topic comprehensively to ated. These ensure that different systems are bundled 38 39 avoid problems and difficulties during research work. to generate added value for the researcher. 39 40 Another motivational aspect is the security of research Several components are used for this purpose: (i) 40 41 data. RDM should ensure that data is not lost and that the EPIC Persistent Identifier (PID) service based on 41 42 it is protected against misuse and theft. The visibility the Handle system for the permanent and unique ref- 42 43 of research data also plays a major role – research data erencing of research data [13], (ii) external metadata 43 44 should be documented in a comprehensible way and stored with the PID, which enables the use of discipline- 44 45 assigned to the researcher so that queries regarding the specific systems, (iii) a central storage service in which 45 46 data can be made. Another point is the reusability of researchers can store their research data, and (iv) the 46 47 research data. The goal is to be able to reuse research support of the University Library to publish the results 47 48 data for new research projects and to gain more efficient in a suitable repository. As the importance of descriptive 48 49 access to already existing research data. Besides, re- and explanatory metadata is growing, an application 49 50 search data is often already stored in different systems has been developed that allows the creation of metadata 50 51 and the combination and unification of this data records for discipline-specific application profiles [14]. 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 3

1 In order to connect these existing infrastructures, – Data incompleteness, since the RDF triples con- 1 2 the IT Center of RWTH is currently developing the re- tain a lot of information, but most knowledge is 2 3 search data management integration platform Collab- represented as free text. 3 4 orative Scientific Integration Environment (Coscine) – Inflexible querying, because triple-pattern queries 4 5 [10], a software platform for allocation and manage- are highly expressive, but at the same time restric- 5 6 ment of IT resources in research projects. The main tive, because they have to follow a certain struc- 6 7 goal is to provide access to research data regardless of ture and queries can only be made according to 7 8 its storage location and to manage existing data FAIR. the underlying data structure and terms. A further 8 9 Coscine provides assistance in order to meet the re- restriction is that no query can be made regarding 9 10 quirements of the Good Research Practice (GRP)[15] missing resources. 10 11 – Missing result ranking 11 such as the guidelines on the use of standards and , since RDF graphs, or 12 queries in such a graph may yield many results 12 methods, documentation, allowing to make research 13 that can overwhelm a user. Therefore, a specific 13 results publicly accessible and archiving. The platform 14 ranking is necessary, which can be based on dif- 14 is currently in a pilot phase for researchers at RWTH. 15 ferent criteria. However, such a ranking is not pro- 15 While Coscine itself does not store research data, it sup- 16 vided by SPARQL Protocol And RDF Query Lan- 16 ports the structured storage of metadata and creation 17 guage (SPARQL), thus it has to be implemented 17 of persistent identifiers on research data as described 18 explicitly. 18 19 above. Metadata are stored as resource graphs using in – Result diversity, because it is important that the 19 20 the RDF model and are validated with SHACL shapes. topmost results give a broad overview of the re- 20 21 At the moment, the implemented search functionality sults of a query and do not all contain similar as- 21 22 is limited, which hinders the reusability of data. To pects. 22 provide a more sophisticated search interface for re- 23 Since the data is stored in a triple store, the most nat- 23 searchers, Coscine intends to use an Elasticsearch (ES) 24 ural way to access it is the SPARQL Endpoint that al- 24 [16] search index. 25 lows querying and filtering. However, a user must have 25 26 ES should be used as a search engine because it is 26 some knowledge to do this: (i) RDF and SPARQL, (ii) 27 scalable and with its document-based approach it fits the structure of the stored data, (iii) used vocabularies 27 28 very well with the mapping of metadata records. It also and terms as well as their corresponding Uniform Re- 28 29 provides a near-real-time search for different types of source Identifiers (URIs)1. Even if a user would have all 29 30 data. This meets the requirements of a structured search the information, the use of SPARQL is very expressive 30 31 as well as advanced search types, e.g., range queries. and therefore challenging and thus can only be used 31 32 The crucial question is how to transform the Coscine by experienced Data Curators in a satisfactory manner. 32 33 knowledge graph into a search index. This is especially However, the goal is that every researcher should be 33 34 important because it is to be benefited the from structure able to use RDM tools. 34 35 and inference rules, i.e., from the advantages of the The general user who ends up searching the data 35 36 RDF model. records is a researcher who does not necessarily come 36 37 from a technical environment. Besides, usually no RDF 37 38 38 1.2. Research Methodology or SPARQL knowledge is present and users are not 39 aware of the structure of the data or the database model. 39 40 Based on the state of research on search in RDF 40 41 A part of the search within Coscine is to find re- 41 search data based on stored metadata. It enables re- graphs, and Natural Language Processing (NLP) ori- 42 ented approaches from general Information Retrieval 42 trieval and thus the reproduction or further development 43 (IR), the work is oriented towards the following re- 43 of research projects. The background of the presented 44 search hypothesis, which is to be tested: 44 paper is the elaboration and evaluation of a resource 45 A mapping of RDF-based metadata records into a 45 object mapping to build such a search considering the 46 search index exists such that a search engine on that 46 47 data structure and requirements of Coscine (see section 47 48 2.6). 48 1or more specifically an Internationalized Resource Identifier (IRI). 49 Initially, the data is available as an RDF-based knowl- 49 Since IRI is a generalization of an URI and both are used very inter- 50 edge graph, which presents some general challenges changeably in general linguistic usage, only the term URI is used in 50 51 described by Arnaout and Elbassuoni [17]: the following, even if a IRI would be allowed. 51 4 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 index yields the same precision and recall when com- or users outside of their own project group or institution 1 2 pared to generated SPARQL queries. can interpret the data as well [19]. 2 3 Therefore, this paper presents a semantic resource 3 4 object mapping for a metadata record which is relies 2.2. Resource Description Framework 4 5 on SHACL based application profiles and SHACL val- 5 6 idated resources graphs. Therefore, the following re- RDF is one of the World Wide Web Consortium 6 7 search questions are addressed: (W3C) conceived data model for the representation of 7 8 information in the Web [20]. It is a data model that is 8 – How to exploit features of SHACL for resource 9 used in the area of the Semantic Web. Data is structured 9 object mapping? 10 in the form of triples (subject-predicate-object) and 10 – How does such a mapping influence the search 11 combined into a graph. For this reason, RDF offers a 11 results in terms of precision and recall? 12 reasonable way to map metadata as it is possible to 12 13 To test the research hypothesis and answer the represent metadata in the same form. A resource (subject) 13 14 search question, a method was first developed to con- is assigned a term or a value (object) via a category 14 15 vert the metadata graphs into JavaScript Object Nota- (predicate). 15 16 tion (JSON) objects. This is necessary to be able to use Example 1 shows how to use the Dublin Core [21] 16 17 a search engine like ES. The mapping applies inference vocabulary to map the metadata X has creator Y as RDF 17 18 rules to convert URIs into literals and to inject further triple. 18 19 human knowledge into the object. 19 20 To validate the mapping and its influence on search Example 1: Metadata as RDF triple 20 21 results, sample metadata records were created based on 21 ·dcterms:creator· 22 a real world application profile and different types of 22 23 search intentions from the RDM domain were consid- 23 24 ered. Using these, the results provided by ES queries 2.3. Metadata Schemas and Application Profiles 24 25 and SPARQL queries were evaluated in terms of pre- 25 26 cision and recall. Response times were measured on a The Joint Information Systems Committee in the 26 27 small and a large data record to assess scalability. Also, UK (JISC) (Organization for the Promotion of Digi- 27 28 necessary procedures for object mapping and commu- tal Technologies in Research and Education) [22] de- 28 29 nicating with ES were considered as additional effort. scribes a metadata schema as a collection of metadata 29 30 fields that are combined into a set. A metadata field is a 30 31 resource that is used as a property of a particular subject. 31 32 2. Background For example, the metadata fields “title”, “author”, and 32 33 “subject” can be combined in a metadata schema. There 33 34 This section provides some background about the are many official metadata standards, which have gone 34 35 used metadata model and the relevant Semantic Web through an agreement and validation process of the 35 36 technologies used by Coscine, which form the founda- metadata schema at some standardization body (such as 36 37 tion for the proposed resource object mapping. the Dublin Core Metadata Initiative (DCMI) or W3C). 37 38 Examples of such metadata schemas are Dublin Core 38 39 2.1. Metadata [23], Data Catalog Vocabulary (DCAT)[24], DataCite 39 40 [25], or RADAR [26]. 40 41 Various definitions of metadata exist in the literature. Controlled Vocabulary can be a kind of linear key- 41 42 Steffen Staab [18] describes metadata as data that de- word list, a hierarchically structured catalog or a taxon- 42 43 scribes selected aspects of other data. Usually, metadata omy, thus limiting an input. According to the JISC [27], 43 44 is structured data. Terms are linked to a resource via controlled vocabularies are used in metadata fields with 44 45 generic categories (e.g., “title”) and assigned to it. A the main goal of more efficient retrieval of resources 45 46 resource describes a specific object using a link. through searching. They improve data consistency and 46 47 Therefore, metadata is information that says some- reduce ambiguity in the language. 47 48 thing about the creation, content, or context of a re- An application profile [28] is a specification for de- 48 49 source. They can be used to link data over the World scribing metadata in an application that uses controlled 49 50 Wide Web [2]. In general, metadata is used to under- vocabularies and imposes further usage restrictions. The 50 51 stand data records in a broader context, so that people use of different metadata schemas is possible. 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 5

1 2.4. Open and Closed World Assumption 1 Example 2 (continued) 2 2 3 Along with the Closed World Assumption (CWA) a namely if the statement Germany·is-not·France 3 4 system with complete access to all information about a would be added. The conclusion in OWA would 4 5 subject exists. If a search is made for a specific piece then look like this: A person can only live in one 5 6 of information within this system, the correct answer is country. Berlin is a city in Germany. Berlin is 6 7 found and only this answer exists. In this way, unam- a city in France. It follows that Germany and 7 8 biguous statements can be made. France are the same things. But since Germany 8 9 In the context of the Semantic Web, the term Open and France are two different things and therefore 9 10 World is used. In case of the Open World Assump- cannot be the same, something must be wrong: 10 11 tion (OWA), the information in a system is or can be in- 11 12 complete. Just because an answer to a question cannot ( ) ^ ( 12 13 be found in one system, does not imply that it cannot ) ^ ( 13 14 be found in another system, which might contain the ) ^ ( ‰ ) ÑK 14 15 necessary information. Thus, it cannot be concluded it 15 The Semantic Web is a system with incomplete infor- 16 is the only result or that there is none. Therefore, the 16 mation. Missing information in this context means that 17 correct and complete answer is unknown. 17 18 The main difference between the two assumptions the information has not yet been explicitly generated 18 19 is that the CWA includes the Unique World Assump- and that the Semantic Web follows the OWA, precisely 19 20 tion (UNA). The UNA [29] states that two things with because its main goal is to obtain new information [30]. 20 21 different names are in fact different unless explicitely Due to this fact, in Semantic Web applications rules 21 22 defined as the same thing. Example 2 illustrates this. restricting the range of a property cannot be applied 22 23 so intuitively especially when compared to relational 23 Example 2: Difference between CWA and 24 databases. This is mainly since databases are normally 24 OWA 25 based on CWA. 25 26 Metadata is intended to provide additional informa- 26 Assuming the scenario [30] that a person can 27 tion. Based on inference and the OWA, these statements 27 only live in one country and the following asser- 28 are not unambiguous (see Example 2). If requirements 28 tions are introduced: Sarah·lives-in·Berlin 29 need to be expressed in an application with metadata, 29 and Berlin·is-in·Germany. by adding 30 these must be restricted by application profiles follow- 30 Berlin·is-in·France, a CWA system finds 31 ing a CWA. Only this allows validating whether they 31 contradiction since persons can only live in one 32 follow a certain schema and use only certain controlled 32 country and according to UNA Germany and vocabularies. 33 France are not the same countries. The decisive 33 34 34 point is that by the name Berlin is the same city 2.5. Shapes Constraint Language 35 and not two different cities (which would have 35 36 different URIs) with the same name: 36 37 SHACL [31] is another W3C recommendation and 37 38 serves the validation of so-called data graphs against 38 ( ) ^ ( 39 shapes graphs. A shape graph is an RDF graph, which 39 ) ^ ( 40 describes conditions and restrictions that the data graph 40 ) ^ (Germany ‰ France) ÑK 41 needs to fulfill. For this reason, it is more suitable than 41 42 Resource Description Framework Schema (RDFS) to 42 In contrast, in an OWA system there would be no 43 model the constraints in terms of a metadata schema. 43 error, but the result would be the statement that 44 In addition to validation, the use of SHACL serves, 44 Germany and France are the same things: 45 among other things, to build user interfaces, generate 45 46 code or to ensure data integration [32]. SHACL can be 46 ( ) ^ ( 47 divided into the two languages SHACL Core language 47 ) ^ ( 48 SHACL-SPARQL 48 ) Ñ ( “ ) and . 49 SHACL shapes thus provide the means to describe 49 50 However, an inconsistency could be created here, an application profile with technologies of the Semantic 50 51 Web. 51 6 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 Example 3 shows a SHACL shape. dictions. Ontologies only allow to extract conclusions 1 2 about existing data. Example 4 demonstrates this. 2 3 Example 3: SHACL shape (RDF application 3 4 profile) Example 4: Ontology use 4 5 5 6 In this example, ex:ExampleShape is defined as The Organization Ontology (ORG) metadata 6 7 SHACL shape via the definition a·sh:NodeShape. standard [35] defines the property org:memberOf. 7 8 Using sh:property a property is defined which org:Agent is defined as definition range and an 8 9 in the example corresponds to ex:author. The org:Organization as value range: 9 10 values of ex:author must be instances of the 10 org:memberOf 11 class ex:author. ·rdfs:domain·org:Agent·; 11 12 ex:ExampleShape ·rdfs:range·org:Organization·. 12 13 ·a·sh:NodeShape·; 13 However, this does not ensure that each 14 ·sh:property·[ 14 triple using the property org:memberOf as 15 ··sh:path·ex:author·; 15 ··sh:class·ex:Author·; predicate has an org:Agent as subject and an 16 16 ]·. org:Organization as object. But in contrast to 17 17 this, a triple of the form ·org:memberOf· 18 18 is interpreted in such a way that x is an org:Agent 19 2.6. Technical Details Collaborative Scientific 19 and y is an org:Organization, although this 20 Integration Environment 20 might not be the case in reality: 21 21 22 Coscine is a research data management plattform 22 (org:memberOf rdfs:domain org:Agent) ^ 23 offered for researchers at RWTH. Internally Coscine 23 (org:memberOf rdfs:range org:Organization) ^ 24 uses a project-based system as an internal data model. 24 ( org:memberOf ) Ñ ( a org:Agent) 25 For each project, any number of data storages can be 25 ^ ( a org:Organization) 26 added. A data storage can further contain multiple data 26 27 entities which each consist of binaries and metadata 27 SHACL supports the validation of RDF graphs 28 records. Following the intention of FAIR digital objects 28 against different constraints. Thus, a sh:NodeShape al- 29 [33], each data storage is assigned a unique PID, which 29 lows to validate that the data meets the conditions in any 30 can be used to add further information and which pro- 30 case, otherwise, the data graph does not fulfill the shape 31 vides a resolution via the Handle system [34]. Since 31 graph. In principle, this turns the Open World into a 32 the PID is an URI, it is suitable as a node of a Linked 32 Closed World in which data follows certain principles 33 Data knowledge graph, to represent metadata. Here, 33 and rules. The result is a system of semi-structured data 34 the PID with an additional key fragment (since a data 34 35 storage can contain several data entities) is used as the due to the fact that on the one hand there are certain 35 36 object for corresponding metadata triples. The predi- (structured) fields and on the other hand there are dif- 36 37 cates are the respective metadata fields and the object is ferent SHACL application profiles with different fields, 37 38 the metadata value. Because RDF was chosen as data vocabularies, and structures (unstructured). 38 39 format metadata graphs are individually extendable and An application profile defines the structure for meta- 39 40 adaptable. RDF supports the approach of Linked Data data to be stored. Within the scope of Coscine, var- 40 41 and supports interoperability and reusability required ious application profiles are continously developed, 41 42 by the FAIR Guiding Principles. However, the flexibil- each of which is geared to the subject-specific require- 42 43 ity of the RDF model itself does not allow to express ments of different disciplines. To ensure uniform use, 43 44 specifications or requirements for the metadata. For this the SHACL profiles are based on existing metadata 44 45 reason, SHACL is used to provide shapes for the indi- schemas. Furthermore, according to the concept of 45 46 vidual resource graphs, therefore ensuring consistency Linked Data existing metadata standards and vocabular- 46 47 and reusability of the metadata. ies can be used. To ensure the use of existing schemas 47 48 SHACL enables clear structures and restrictions for and vocabularies, a knowledge engineer is required who 48 49 RDF data. In comparison, ontologies can also define has an overview of existing metadata standards in a 49 50 value ranges, but these cannot be validated or due to variety of domains and who monitors the development 50 51 the OWA allow inconsistencies and supposed contra- process of a new application profile [11]. 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 7

1 user can select an instance without having to know the 1 2 URI behind it. Data types are also transformed into 2 3 corresponding form fields. 3 4 Within Coscine, a SHACL application profile with 4 5 the desired metadata fields and restrictions is created 5 6 as described above. These are adapted to the require- 6 7 ments of a research group, an organization or a project. 7 8 Vocabularies and Metadata Standards are reused and 8 9 created. In the future, this may be supported by an Ap- 9 10 plication Profile Generator application. The profile is 10 11 stored in the Application Profile Repository from where 11 12 it is retrieved by Coscine. A corresponding input mask 12 13 can be generated if the user wants to create metadata by 13 14 hand. Alternatively a REST API can be used to submit 14 Fig. 1. Generated form from the application profile in Appendix B 15 and validate the metadata using the shape. As a final 15 16 step, the resulting RDF graph is stored via the PID to a 16 17 Example 5 shows a shortened version of such an resource and the associated research data entity. 17 18 application pofile with only one metadata field. Example 6 shows an exemplary metadata record for 18 19 the previously presented extract of the application pro- 19 20 Example 5: Example Application Profile from file in Example B. 20 21 EngMeta 21 22 Example 6: Example metadata record of appli- 22 23 Shortened version of the application profile in cation profile in Appendix B 23 24 Appendix B with only one metadata field: 24 25 coscineengmeta: ·a··coscineengmeta:·; 25 26 ·a·sh:NodeShape·; ·dcterms:creator 26 27 ·sh:targetClass·coscineengmeta:·; ··dbpedia:Thomas_Pynchon·; 27 ·sh:property·[ 28 ·engmeta:worked·true·; 28 ··coscineengmeta:subject ·dcterms:title·"Design·Patterns"·; 29 29 ··sh:path·dcterms:subject·; ·dcterms:subject·dfgCS:A409-02; 30 ··sh:name ·dcterms:created 30 31 ···"Subject·Area"@en, ··"2020-01-01+02:00"^^xsd:date·; 31 32 ···"Sachgebiet"@de·; ·engmeta:version·3·. 32 ··sh:class·dfg:faecher·. 33 33 ·]. 34 34 35 The application profile in Appendix B contains 3. Resource Object Mapping 35 36 the full conversion of the EngMeta metadata 36 37 schema [36] to an application profile for the de- Based on the SHACL validated resource graphs it 37 38 scription of engineering research data from the is now possible to define a resource object mapping. 38 39 project DIPL-ING [37]. Two different types of inference rules were established, 39 40 which are used to create the resulting semantic mapping 40 41 Using the SHACL application profiles, a form is gen- of resources. 41 42 erated, which hides all technical details about the def- 42 43 inition and structure of the metadata. Restrictions are 3.1. Literal Rules 43 44 checked automatically so that, e.g., mandatory fields 44 45 are marked as such or a maximum of one element can The resources to be mapped are described by triples 45 46 be selected. Furthermore, the user gets easy access to in the knowledge graph. As previously described, each 46 47 the ontology, e.g. to the vocabularies. Figure 1 shows resource (sub-)graph has been validated against a 47 48 the input mask generated from the application profile SHACL shape. Within the resource graphs predicates 48 49 in Appendix B. are in the form of URIs and the objects are in the form 49 50 The illustrations shows that class constraints are au- of URIs or literals. It is assumed that a user generally 50 51 tomatically transformed into drop-down boxes and the is not interested in the URIs or does not even know it 51 8 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 but the corresponding label. A matching literal asso- Regarding Example 7, if not only the name but also 1 2 ciated by the user to the URI can then be found else- the first and last name individually or the corresponding 2 3 where in the graph depending on the instance or class. organizational unit and organization should be matched, 3 4 Generally it cannot be assumed that there always is a the literal rule in Example 8 is used. 4 5 suitable rdfs:label property available, as described in 5 6 the approach [38], which could be used to generate a Example 8: Advanced literal rule for the class 6 7 suitable name. However, all used classes and bounding foaf:Person 7 8 conditions are known and can therefore be used to find 8 9 literals with the help of respective SHACL shapes. Assuming the input graph from Example 7, the 9 10 Additionally, further literals for the description of a SHACL rule could be altered like this: 10 11 resource may exist in the knowledge graph. Compound foaf:Person 11 12 literals or the direct application of inferences to rep- ·sh:rule·[ 12 13 resent human knowledge as well as relations and to ··a·sh:SPARQLRule·; 13 14 improve the search would also be conceivable. If this ··sh:construct·""" 14 ···CONSTRUCT·{ 15 implicit information is stored additionally as SHACL 15 ····$this·rdfs:label·?label 16 rule [39] for each class and thus make it explicit, the 16 ···}·WHERE·{ 17 user’s search query can automatically be applied to the ····UNION·{ 17 18 matching literals. In the following, these rules are called ·····$this·foaf:givenName·?label 18 19 literal rules. Example 7, 8 and 9 serve as demonstra- ····}·UNION·{ 19 ·····$this·foaf:familyName·?label 20 tion of these. As can be seen from the three examples, 20 SHACL rules can be used to map literals and structures ····}·UNION·{ 21 ·····?mem·a·org:Membership·. 21 22 of any complexity. Since the mapping may need to dif- ·····?mem·org:member·$this·. 22 23 fer for different shape graphs, they can be added and ·····?mem·org:organization·?org·. 23 24 applied per shape. ·····?org·rdfs:label·?label 24 ····}·UNION·{ 25 25 Example 7: Literal rule for the class ·····?mem·a·org:Membership·. 26 org member 26 foaf:Person ·····?mem· : ·$this·. 27 ·····?mem·org:organization·?unit·. 27 28 ·····?org·org:hasUnit·?unit·. 28 29 Suppose there exists the following input graph ·····?org·rdfs:label·?label 29 ····} 30 which describes the person Thomas Pynchon: 30 ···} 31 dbpedia:Thomas_Pynchon 31 a foaf:Person ··"""·; 32 · · ·; 32 ·foaf:name·"Thomas·Pynchon"@en·. ·]·. 33 33 34 From this, the following graph is to be generated 34 Additionally, literal rules allow returning multiple 35 for the literals: 35 values for the returned properties to construct lists. As 36 dbpedia:Thomas_Pynchon 36 shown in Example 9 this can be used to resolve class or 37 ·rdfs:label·"Thomas·Pynchon"@en·. 37 instance hierarchies. 38 The following SHACL rule shows the generation 38 39 of the matching literal of a foaf:Person, where Example 9: Literal rule for the subject 39 40 \$this is the focus node, which is replaced by classification of the GRF 40 41 the respective instance of the class. 41 42 42 foaf:Person For instances of the GRF subject classification, 43 ·sh:rule·[ the literals of the superclass are also applied be- 43 44 ··a·sh:SPARQLRule·; cause an instance of the subclass is also automati- 44 45 ··sh:construct·""" 45 ···CONSTRUCT·{ cally an instance of the superclass. The following 46 ····$this·rdfs:label·?label graph is assumed: 46 47 ···}·WHERE·{ 47 48 ····$this·foaf:name·?label dfgCS:A409-02 48 ·a·dfgCS:A409-02·; 49 ···} 49 ··"""·; ·rdfs:label·"Software·Engineering·and· u 50 ·]·. Programming·Languages"@en·; 50 51 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 9

1 1 Example 9 (continued) Example 10 (continued) 2 2 3 ·rdfs:subClassOf·dfgCS:·. 3 ·engmeta:step·coscinestep:storage·. 4 4 dfgCS: 5 ·a·dfgCS:·; coscinestep:storage 5 6 ·rdfs:label·"Computer·Science"@en·. ·rdfs:label·"Data·storage"@en·; 6 7 ·a··. 7 From this, the following graph for the literals 8 8 should be generated: From this, the following graph should be gener- 9 ated: 9 10 dfgCS:A409-02 10 11 ·rdfs:label 11 ··"Software·Engineering·and· ·coscinesc:isLastStep·true·. 12 u 12 Programming·Languages"@en, 13 ··"Computer·Science"@en·. The following rule creates this additional triple. 13 14 The focus node \$this is replaced by the URI of 14 15 The following SHACL rule shows the generation the resource, in context of Coscine by the Identi- 15 16 of the matching literals: fier (ID) of the metadata record. 16 17 17 dfg: 18 ·sh:rule·[ ·sh:rule·[ 18 19 ··a·sh:SPARQLRule·; ··a·sh:SPARQLRule·; 19 ··sh:construct·""" 20 ··sh:construct·""" 20 ···CONSTRUCT·{ 21 ···CONSTRUCT·{ 21 ····$this·rdfs:label·?label ····$this·coscinec:isLastStep·?value 22 22 ···}·WHERE·{ ···}·WHERE·{ * 23 ····?class·rdfs:subClassOf ·?sup·. ····$this·engmeta:step·?stepA·. 23 24 ····?sup·rdfs:label·?label·. ····BIND·(not·exists·{ 24 25 ····$this·a·?class·. ·····?stepA·schema:predecessorOf·?stepB 25 ··} 26 ·····}·as·?value) 26 ···} 27 ··"""·; 27 ·]·. 28 ··"""·; 28 ·]·. 29 29 30 3.2. Additional Rules 30 In Example 10, only a triple is created for the given 31 31 Furthermore, it is possible to use additional rules, resource. However, the rules can be more complex and 32 32 which enable the automatic generation of further triples also influence other resources as shown in Example 11. 33 33 for specific resources. A human being is aware of fur- 34 34 ther knowledge or information due to the specified prop- Example 11: Additional rule for newest 35 35 erties of a resource, which are withheld from a ma- version 36 36 chine and thus from the mapping. The rules are used to 37 37 slightly close this gap. Example 10 contains a simple We assume a property engmeta:version contains 38 38 additional rule to create a triple indicating whether the the version number of a resource in the form of 39 39 referenced data is in the last step. an integer value. It is assumed that the associated 40 data entity is used as a kind of backup, which 40 41 Example 10: Additional rule for last step contains the same data in different versions. Thus, 41 42 it is possible for a human being to identify the 42 43 43 It is assumed that a metadata field engmeta:step resource with the most current version from a set 44 is filled by instances of the self-created vocab- of resources. However, this information has to 44 45 45 ulary steps [40]. The instances are processing be created explicitly for the mapping because 46 steps that follow a certain order. An additional a machine cannot automatically develop such 46 47 triple should be created, which uses a boolean connections. Suppose the following input graph 47 48 48 value and the predicate coscinesc:isLastStep exists: 49 49 to indicate if a metadata record is the last step. 50 The following graph is given: ·engmeta:version·1·; 50 51 51 10 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 amples are the handling of grammatical differences or 1 Example 11 (continued) 2 spelling and the use of auto-completions or search sug- 2 ·coscine:isFileOf·coscine:resource1·. 3 gestions. The possibilities that a search index provides 3 4 4 for the user and generally the optimization of the search 5 ·engmeta:version·2·; sound very promising. Evaluating this mapping would 5 6 ·coscine:isFileOf·coscine:resource1·. therefore support the core hypothesis (see section 1.2). 6 7 7 From this, the following graph should be gener- 8 8 ated: 3.3. Semantic Mapping of a Resource into a JSON 9 Object 9 10 10 11 ·ex:isNewestVersion·false·. 11 A JSON object consists of properties and associated 12 12 values. Furthermore, different data types like boolean, 13 ex:isNewestVersion 13 · ·true·. number and string are supported for the values2. For 14 14 all properties, a name and an appropriate value must 15 The rule for creating a field indicating the newest 15 be created from the resource graph triples to describe 16 version in a set of resources based on the property 16 17 engmeta:version and the data model of Coscine the corresponding field. In case of a literal it is taken 17 18 [41] is as follows. Furthermore, this rule does as a value. The literal rules are used for instances. Ac- 18 19 not only affect a single resource but must always cording to the same principle, additionally generated 19 20 be considered in the context of the given set of triples can be stored in the JSON object by using the 20 21 resources. additional rules. Thus, any arbitrarily complex graph 21 22 is transformed into a flat JSON object, which consists 22 23 ·sh:rule·[ only of field-value(s) pairs, as specified in the following 23 24 ··a·sh:SPARQLRule·; definition: 24 ··sh:construct·""" 25 25 ···CONSTRUCT·{ Definition 1: Semantic Mapping of a Resource 26 ····?s·ex:isNewestVersion·?value 26 27 ···}·WHERE·{ into a JSON Object 27 28 ····?s·engmeta:version·?ver·. 28 29 ····?s·coscine:isFileOf·?res·. Given is resource graph G from a knowledge 29 ····$this·coscine:isFileOf·?res·. 30 graph K such that G is a sub graph of K with 30 ····{ a root node S 0 identifying a resource to be 31 ·····SELECT·?newVer 31 32 ·····WHERE·{ mapped to a JSON object, the edges correspond 32 33 ······?file·engmeta:version·?newVer·. to the properties and the objects to the respective 33 34 ······?file·coscine:isFileOf·?res·. values. 34 ······$this·coscine:isFileOf·?res·. 35 For each triple xS, P, Oy P G, S P I, P P U and 35 ·····}·ORDER·BY·DESC(?newVer)·LIMIT·1 36 O P pU Y Lq applies, where I denotes resources, 36 ····}·. 1 37 ····BIND(·?ver·=·?newVer·AS·?value) U denotes URIs, and L denotes literals. Then G 37 38 ···} is the resulting mapped graph of G by applying 38 39 ··"""·; the SHACL rules on K. 39 ˆ 40 ·]·. The mapping to a JSON object then is a tree G 40 41 with exactly height one that includes triples from 41 1 1 1 1 1 1 1 1 42 Also, the additional rules do not have to generate G such that @i, jpxS i, Pi, Oiy, xS j, P j, O jyq P G : 42 1 1 1 1 1 1 1 43 boolean objects. With such rules, any additional infor- S i “ S j “ S 0 and @ixS i, Pi, Oiy P G : S P I, 43 1 1 n 44 mation can be generated automatically or additionally P P S, and O P L Y L is valid, where S are 44 n 45 to extend or improve the mapping. It is possible to de- strings, and L are lists of literals. 45 46 fine general or SHACL shape specific additional rules. 46 47 The approach considered is based on creating a To demonstrate the mapping Example 12 is consid- 47 48 JSON object from the RDF graphs to be able to ingest ered. 48 49 that into existing search engines. As advantages for the 49 50 search, it is possible to benefited from functionalities 2In the ES index date and number formats like integer and 50 51 and optimizations for user input and search features. Ex- float are added and text replaces string. 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 11

1 1 Example 12: Mapping of an RDF metadata Example 12 (continued) 2 record into an ES document 2 3 ·"title":·"Design·Patterns", 3 ·"subject":·[ 4 4 Again, the example metadata record of Example ··"Software·Engineering·and· u 5 6 from section 2.6 is considered. Furthermore, Programming·Languages", 5 6 it is assumed that the following additional ··"Computer·Science", 6 7 information is stored in the knowledge graph: ··"Engineering·Sciences" 7 8 ·], 8 dfg:EngineeringSciences ·"created":·2020-01-01, 9 9 ·a·dfg:EngineeringSciences; ·"version":·3 10 ·rdfs:label·"Engineering·Sciences"@en·; } 10 11 ·rdfs:subClassOf·dfg:·. 11 12 It is important that all URIs are transformed into 12 dfgCS: 13 13 ·a·dfgCS:·; literals or lists of literals during this transformation. For 14 ·rdfs:label·"Computer·Science"@en·; this reason, the mapping of Definition 1 is not injective. 14 15 15 ·rdfs:subClassOf· u It may happen that two different RDF resources map to 16 dfg:EngineeringSciences·. the same JSON object as shown in Example 13. 16 17 17 18 dfgCS:A409-02 Example 13: Mapping into a JSON object is 18 ·a·dfgCS:A409-02; 19 not injective 19 ·rdfs:label·"Software·Engineering·and· u 20 20 Programming·Languages"@en·; 21 ·rdfs:subClassOf·dfgCS:·. Given are the two RDF triples: 21 22 22 rwth: ex:SWEP 23 rdfs:label 23 ·a·org:FormalOrganization·; · ·"Software·Engineering·and· u 24 24 ·rdfs:label·"RWTH·Aachen·University"·; Programming·Languages"·. 25 ·org:hasUnit·rwth:U022000·. 25 26 dfgCS:A409-02 26 ·rdfs:label·"Software·Engineering·and· 27 rwth:U22000 u 27 Programming·Languages"·. 28 ·a·org:OrganizationalUnit·; 28 ·rdfs:label·"IT·Center"·. 29 Assuming there are two metadata records which 29 30 dbpedia:Thomas_Pynchon differ only in one metadata value (the two URIs) 30 31 ·a·foaf:Person·; and no other rules are applied. The corresponding 31 32 ·foaf:givenName·"Thomas"@en·; mapping could not be distinguished because the 32 ·foaf:familyName·"Pynchon"@en·; 33 URIs are mapped to the same literal and thus the 33 ·foaf:name·"Thomas·Pynchon"@en·. 34 same ES document is created. 34 35 [] 35 36 ·a·org:Membership·; 3.4. Resolving Ambiguous Property Data Types 36 37 ·org:member·dbpedia:Thomas_Pynchon·; 37 ·org:organization·rwth:22000·. 38 Ideally, the same properties are defined in different 38 39 If it is assumed that only the presented rules from SHACL shapes with the same data type or are described 39 40 the Examples 7 and 9 from section 3.1 and no using instances of one class. In this case, they can re- 40 41 other additional rules are applied, the following ceive exactly this defined data type as type in ES (date, 41 42 ES document results from the mapping: text, boolean or integer) or instances of a class are 42 43 always mapped as text. If in a knowledge graph sev- 43 { 44 eral different resources were validated based on a shape 44 ·"creator":·[ 45 ··"Thomas", and the same properties in different shapes have dif- 45 46 ··"Pynchon", ferent data types, a generic data type should be found, 46 47 ··"Thomas·Pynchon", which is then generally valid. The property is mapped 47 48 ··"IT·Center", to the type text, because as mentioned before the same 48 ··"RWTH·Aachen·University" 49 property in ES may only have one unique type per in- 49 ·], 50 ·"worked":·true, dex. Table 1 shows all possible mappings of data types 50 51 between the existing SHACL shapes and ES. 51 12 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

Table 1 1 – Data records about non computability of the hu- 1 Data types mapping between SHACL shapes and ES where * means 2 man consciousness 2 that the same properties were described in different shapes with the 3 3 same data type. – Data records which are available since 03.07.2020 4 – Data records about political left 4 SHACL Shape JSON ES 5 – Data records which contain a variable with the 5 date* string date 6 value 2 meters 6 string* string text 7 – Data records about object-oriented software which 7 boolean* boolean boolean 8 are published before 2015 8 integer* number integer 9 – Data records which are published by Thomas Pyn- 9 class* string text 10 chon, created in 2016 and about object-oriented 10 different data types string text 11 software 11 12 12 13 This intention has to be formulated differently as a 13 3 14 4. Evaluation request, because a certain syntax has to be followed 14 15 when using ES [44]. Furthermore, the search query is 15 16 During the evaluation of the resource object map- formulated in such a way that it is at least well-fitted 16 17 ping, the context of Coscine was used and the mapping or close to optimal. This means that the intention “pub- 17 18 of metadata records in the context of searchability was lished on IT Center”, e.g., in the query is translated to 18 publisher:IT·Center 19 evaluated. For the considered tests described in this sec- and thus the correct property is 19 20 tion exemplary metadata records were generated. All used. Also, it is assumed that the user searches for the 20 21 of them are based on a modified form of the EngMeta correct words, i.e. those that are present in the text and 21 22 application profile [43]. Some classes and data types of not for variants or synonyms of them. 22 23 the profile have been modified to demonstrate the dif- 23 24 ferent possibilities that the SHACL definitions provide. 4.2. Test Setup 24 25 Furthermore, the data does not represent real metadata 25 26 of research data records but serves for illustration. .NET Framework 4.8 was used with the program- 26 27 The following results are based on the assumptions ming language C# and the library dotNetRDF [45] was 27 28 made in the context of the paper, the presented test included in the version 2.6. As RDF database Virtuoso 28 29 setup, implementation, and the generated sample data. in the open-source variant 7.20.3217 with the default 29 30 settings was applied. Only the parameter NumberOf- 30 4.1. Evaluated Search Intentions 31 Buffers was changed to 660000 and MaxDirtyBuffers 31 32 to 495000 in the virtuoso.ini file. For the implemen- 32 For testing the research hypothesis some search 33 tation ES version 7.6.2 was used. In addition, all de- 33 queries were considered. Behind each query is the in- 34 fault settings have been adopted, i.e., in particular only 34 tention to find one or more data records (which are 35 one node, one cluster, one shard, and one replica have 35 described by metadata). They were selected to cover 36 been created. In order to keep the conditions, especially 36 as many different search types as possible and to be 37 for the measurement of times and their comparison, 37 related to the RDM use case: 38 as similar as possible, the tests were carried out under 38 39 – Data records which were published at the IT Cen- the same conditions and the same hardware. A 64-bit 39 40 ter system with the operating system Microsoft Windows 40 41 – Data records with version number 10 Server 2016 Standard, 16GB RAM and two Intel(R) 41 42 – Data records which were created in 2007 Xeon(R) CPU E5-2695 v3 @ 2.29GHz processors was 42 43 – Data records about design patterns and with ver- used. 43 44 sion number less than 5 44 45 – Data records about computer science 4.3. Precision and Recall 45 46 – The newest data record of resource 2 46 47 – Data records which are experiments or simulations To evaluate precision and recall the presented search 47 48 and not analysis intentions were considered. First, the search inputs for 48 49 the ES syntax were created (see Appendix D) and then 49 50 3The evaluation is based on data record [40] and evaluation code executed on 35 selected metadata records. To see which 50 51 [42] discussed in detail in [41]. metadata records are relevant, the query was translated 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 13

1 Table 2 Table 3 1 2 Summed up confusion matrix Precision, recall, and F1 score (rounded to two digits) 2 3 Relevant Mapping 3 4 + - Precision 0,96 4 5 + 51 2 Recall 1 5 6 - 0 402 F1 Score 0,98 6

7 Received 7 8 as a reference into a pure SPARQL query considering when using SPARQL queries and is not a disadvantage 8 9 the existing structure and data (see Appendix C). A compared to them. 9 10 full-text search, i.e., if searching in strings and no cor- Table 3 summarizes the results by calculating pre- 10 11 responding URI exists, is only possible by using an cision, recall, and F1 score of the total confusion ma- 11 12 regular expression or the function bif:contains provided trix. The resource object mapping reaches the same 12 13 by Virtuoso. Based on the results of the query it was values as generated SPARQL queries and is therefore 13 14 decided which records should be found for the respec- not inferior. 14 15 tive search intention. Afterwards, confusion matrices 15 16 (see section E) were set up, which compare the relevant 4.4. Response Time 16 17 search results with the found ones and thus make a con- 17 18 crete analysis possible. Table 2 contains a summed up The response time was measured for all presented 18 19 matrix, which adds up the results of the different search search intentions on the search data record. Nielsen 19 20 intentions. [46] differs several types of response times: One limit is 20 21 The confusion matrix shows that all relevant records set to 0.1 seconds so that the user feels that the system 21 22 were found and only the two records were irrelevant. is responding directly to the interaction. Another limit 22 23 The advantages of this approach are that a search can is 1.0 second: the user notices the delay but remains 23 24 be done in concrete fields, any combination of fields is uninterrupted. This concludes that the search results 24 25 possible and specification of value ranges is allowed. (i.e., from entering the search query to displaying the 25 26 Furthermore, the year and boolean values (due to the found metadata) should be found within 0.1 seconds if 26 27 search in concrete fields) can be found without addi- possible and within one second at most. 27 28 tional created fields. Besides, a ranking of the found To make the comparison of the reponse times as fair 28 29 records is also available. Additional metadata records and realistic as possible, an executable file was created, 29 30 could be found for specific phrases containing stop which receives the query (see section 4.3) and performs 30 31 words. If there had been records containing the word the corresponding function with it. The function returns 31 32 center as a publisher, they would also have been found a list with the PIDs of the found metadata records. It 32 33 in the search intention “published at the IT center”. was omitted to run through and output of these results to 33 34 This is because the word “it” counts as a stop word and avoid that large result lists influence the runtime. With 34 35 thus only the word “center” is searched for. This is not the executables, an attempt is made to map a complete 35 36 shown in the example queries or data but would be a user request, which goes beyond simply executing the 36 37 disadvantage. request against a database. The search input of the user 37 38 The search intention “left (political)” finds two irrel- has to be sent to the backend to add the visibility re- 38 39 evant records. This is because a pure search for “left” strictions. However, in the future, a web service will be 39 40 leads to the fact that in addition to the political left, the running, and not every time a process will be started 40 41 left side or the past tense of the word “leave” can also be that reloads the used libraries. Therefore a basic process 41 42 meant. For a computer, this semantic difference is not was created for comparison, which loads the libraries 42 43 recognizable. However, the search could be extended for dotNetRDF.Data.Virtuoso needed in all approaches 43 44 by the word political or a wildcard variant politic* and queries a simple graph in Virtuoso. An execution 44 45 and would in this case only find the one relevant data time of 221 ms was measured. Also, this type of mea- 45 46 record. However, this requires that the word also occurs surement was chosen because it is not about comparing 46 47 in the same string. If this is not the case, it is not pos- different queries within an approach, but about the dif- 47 48 sible to distinguish between them. If the topic would ferences between the approaches and the time elapsed 48 49 be political science or similar (which is not the case for the user. To get a more accurate measurement, in 49 50 in the data), the search could also be further limited each approach the executable was executed for each 50 51 by this information. This semantic problem also exists query 100 times in a row and then the average value was 51 14 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 Table 4 Table 5 1 2 Response time in ms (rounded to a full number) of user request in Response time in ms (rounded to a full number) of user requests in 2 small search data record large search data record 3 3 4 Intention Guideline Mapping Intention Guideline Mapping 4 5 Published at IT Center 311 ˘ 21 572 ˘ 78 Published at IT Center 316 ˘ 28 557 ˘ 69 5 6 Version 10 301 ˘ 16 550 ˘ 21 Version 10 305 ˘ 17 549 ˘ 20 6 7 Created in 2007 294 ˘ 10 552 ˘ 23 Created in 2007 308 ˘ 9 551 ˘ 20 7 8 Design Patterns and Version < 5 295 ˘ 12 552 ˘ 24 Design Patterns and Version < 5 290 ˘ 11 548 ˘ 23 8 9 Computer Science 295 ˘ 16 548 ˘ 24 Computer Science 373 ˘ 11 549 ˘ 29 9 10 Newest Version Res. 2 311 ˘ 8 558 ˘ 29 Newest Version Res. 2 313 ˘ 9 550 ˘ 18 10 11 Experi./Simu. not Anal. 311 ˘ 10 555 ˘ 22 Experi./Simu. not Anal. 415 ˘ 16 554 ˘ 18 11 12 Non Comput. Human Consci. 302 ˘ 27 556 ˘ 26 Non Comput. Human Consci. 348 ˘ 129 556 ˘ 31 12 13 Av. s. 03.07.20 303 ˘ 25 556 ˘ 18 Av. s. 03.07.20 336 ˘ 12 556 ˘ 20 13 14 Left (politicial) 295 ˘ 19 553 ˘ 25 Left (politicial) 292 ˘ 7 558 ˘ 33 14 15 2 Meters 311 ˘ 30 555 ˘ 29 2 Meters 317 ˘ 30 549 ˘ 23 15 16 Pub. < 2015 Oo. Softw. 288 ˘ 17 551 ˘ 23 Pub. < 2015 Oo. Softw. 291 ˘ 18 549 ˘ 21 16 17 Pub. T. Pyn. Created 2016 Oo. Softw. 298 ˘ 24 556 ˘ 27 Pub. T. Pyn. Created 2016 Oo. Softw. 291 ˘ 18 550 ˘ 26 17 18 I 301 ˘ 18 555 ˘ 28 I 323 ˘ 24 552 ˘ 27 18 19 19 20 calculated. The results are shown in Table 4. Besides, Table 6 20 21 the standard deviation was calculated to illustrate the Factor (rounded to two digits) by which the response time has in- 21 creased compared to the small data record 22 inaccuracies of the measurements. 22 23 Guideline Mapping 23 24 4.5. Scalability 1.07 0.99 24 25 25 26 Coscine is a system that will grow steadily the more for the search index from section 3 is required. This 26 27 metadata about research data is stored and collected. results in completely redundant storage of the ES map- 27 28 That means that the system should be able to handle pings of all metadata records. Besides the additional 28 29 large amounts of data and enable searching in this memory consumption, this also leads to necessary syn- 29 30 amount of data. chronization steps. The already existing data must be in- 30 31 To measure the scalability, the search queries were dexed at the beginning. Besides, new metadata records 31 32 executed on the 10,000 metadata records large data must be created, updated, and deleted synchronously. If 32 33 record the same way described the section above. The application proﬁles change or new ones are added, in 33 34 results of the response times on the large data record the worst case a new index must be created. 34 35 are shown in Table 5. The times to perform these actions are shown in Ta- 35 36 For the assessment of the scalability, the factor by ble 7. As with the measurement of the response times, 36 37 which the execution times have increased are consid- executable ﬁles were created for this purpose, whose 37 38 ered since they indicate how the response times will be- execution time and standard deviation were measured. 38 39 have on even larger data records. Table 6 shows the fac- Since the initial indexing has to be done only once at 39 40 tor between the respective values of the average times the beginning and the reindexing differs only by delet- 40 41 from the tables 4 and 5. ing the old index and switching the alias, which are fast 41 42 queries, only the reindexing was measured. Due to its 42 43 4.6. Additional Effort duration, it was executed ten times independently with 43 44 the 10,000 test data and the average was given. The 44 45 This category describes the additional effort (storage creation, updating, and deletion of individual records 45 46 and calculations) required to implement the respective were performed 100 times in a row and then the aver- 46 47 approach which is necessary in addition to the previous age value was calculated. The additional rule for saving 47 48 storage of metadata records in the RDF database. In the latest version (see Example 11) was used once and 48 49 addition to the storage of the literal rules described in once without it, because it can affect other metadata 49 50 section 3.1, the implementation of the additional admin- records and thus increases the execution time. When 50 51 istrative triples and the mapping of the metadata records (re)-indexing, the additional rules are always executed, 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 15

Table 7 1 Related to this topic is the graph summarisation by 1 Times in ms (rounded to a full number) for the actions of the ad- 2 Campinas [50], where this process takes place in re- 2 ditional effort. Columns with * show values where additional rules 3 3 which influence other metadata records were used. verse order. From an existing knowledge graph, a sum- 4 mary is constructed that represents the structure. In our 4 (Re)-Indexing Add Update Delete 5 context, the structure already exists through SHACL 5 1646706 ˘ 53376 837 ˘ 143 839˘ 129 735 ˘ 121 6 validations with whose assistance JSON objects are to 6 7 be generated from individual resources. 7 Add* Update* Delete* 8 Delbru et al. [51] describe an Entity Retrieval Model 8 3055 ˘ 233 2996 ˘ 97 2901 ˘ 148 9 for Web Data to describe semi-structured information 9 10 from heterogeneous and distributed sources for semi- 10 11 but they do not affect the other documents every time, structured information. The model consists of three 11 12 since all knowledge is already available in the knowl- components: dataset, entity, view.A dataset is a col- 12 13 edge graph and thus can be directly queried correct lection of entity descriptions and can be uniquely refer- 13 14 and complete. To see the effects of the additional rules, enced by an URI. An entity is something that can be de- 14 15 the actions were executed on the large data record. For fined via an URI (such as documents, events or persons) 15 16 adding, 9900 records already existed and the remain- and for which descriptions exist in the form of proper- 16 17 ing 100 were added. When updating and deleting, all ties. A view is a piece of information accessible via an 17 18 10,000 records were already indexed. URI which is used to describe the dataset. The Entity 18 19 It is important that these delays are not noticeable Attribute-Value Model [51] is a model to describe the en- 19 20 for the user, because they can run parallel in the back- tities of a dataset. This is done with an ID describes the 20 21 ground without limiting the search possibilities. Fur- entity, a set of labeled properties, and a set of values for 21 22 thermore, it is clearly visible that creating, updating these properties. Furthermore, Delbru presents a search 22 23 and deleting is faster if the other metadata records are model for web data which, in contrast to conventional 23 24 not affected. In order to see the relation between the web search engines, does not represent entities as a bag 24 25 additional time required to add a metadata record and of words but describes them as a set of attribute-value 25 26 the time required to store and validate it in Virtuoso, pairs. For the search, three different search types can 26 27 the execution time for running an executable with this be distinguished: (i) full-text queries which represent a 27 28 function was measured. The average value for 100 search request with unknown structure, (ii) structural 28 29 consecutive executions is 3217 ms. This time was cal- queries which represent more complex queries in form 29 30 culated using a small metadata record that only fills of key-value-pairs if the structure is known and (iii) 30 31 the metadata fields dcterms:author, dcterms:title, semi-structural queries as the combination of both. In 31 32 dcterms:subject, and dcterms:created. It follows principle, the Entity Attribute-Value Model and the cor- 32 33 that the additional mapping in ES results in a further responding Search Model for Web Data are applied in 33 34 time expenditure of 26%. This number shows that the the presented approach, where the metadata records are 34 35 main effort lies in the necessary validation and storage the entities. Resources (in the context of Coscine only 35 36 in Virtuoso. the metadata records) are considered as documents, the 36 37 corresponding properties (metadata fields) as fields, and 37 38 all objects as values. However, the model serves only 38 39 5. Related Work as a structure. The work presents how such a mapping 39 40 is created smartly and semantically. 40 41 To enable a search in an RDF-based knowledge Linked swissbib [52] converts bibliographic data into 41 42 graph there are so-called SPARQL query builder, which a RDF-based data model and divides it into six different 42 43 allow a step-by-step construction of SPARQL queries concepts. By using the JSON-LD serialization of RDF, 43 44 using cleverly chosen user interfaces. There is a wide they are indexed in the search engine ES. Therefore, 44 45 variety of such query builders. Kuric et al. [47] have the approach precedes as a good example for mapping 45 46 compared the best-known query builders regarding their RDF data into a search index. However, it is outdated, 46 47 usability for laypeople. Since such tools all suffer from since ES no longer supports the use of different types 47 48 usability problems and have already been extensively in a search index [53]. Furthermore, the pure use of the 48 49 evaluated [47–49], this paper focuses on an alternative JSON-LD serialization in our context does not make 49 50 variant: mapping the graph data into a smart search sense, because otherwise URIs would be indexed for 50 51 index. which the users do not search. 51 16 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 For this reason, Open Semantic Search [54] addition- graphs into a search engine. It builds upon SHACL and 1 2 ally indexes the corresponding rdfs:label to a URI. the SHACL-SPARQL extension to allow the definition 2 3 Nevertheless, this is not sufficient since it is not always of inference rules that are used to generate object prop- 3 4 set or can be found in other properties depending on the erties. The mapping was further evaluated regarding its 4 5 data structure. In this approach the associated labels of performance with respect to precision and recall when 5 6 the RDF annotation to URIs for indexing data in Solr prepared to generated SPARQL queries. 6 7 are stored. As the results show, the resource object mapping 7 8 Semplore [55] translates Semantic Web data into doc- when using ES as search engine is inferior to the man- 8 9 uments, fields, and terms properly so that the IR en- ual generation of SPARQL queries in terms of speed, 9 10 gine can index them in their inverted index structure but it also clearly meets a maximum response time of 10 11 and provide retrieval functions via the data. The pre- one second that is common for web applications. This 11 12 sented approach for mapping objects into a search index applies to both the small and large data record ensuring 12 13 is oriented on Semplore. The differences are the con- scalability of the mapping. However, both approaches 13 14 crete generation of the mapping, which in the consid- result the same in terms of effectiveness: For the exam- 14 15 ered approach can contain, for example, inference rules, ple metadata graphs and search intentions, they achieve 15 16 and the possibility to use ES to benefit from advanced the same precision and recall values. 16 17 search types like value range queries. Two disadvantages of the mapping compared to the 17 18 Rocha et al. [56] presented a hybrid approach to 18 use of SPARQL queries are its additional resource con- 19 keyword-based search in the Semantic Web, where the 19 sumption regarding memory and computing time as 20 focus is on finding concepts. For this purpose, an in- 20 well as the additional storage of rules. Since both are stance graph is created for each concept based on the 21 not directly visible to the users and do not bring any 21 values of its properties, which is searched with tradi- 22 disadvantage for them, this point of criticism is not to 22 tional IR techniques. Afterwards, spread activation tech- 23 be taken very serious. SHACL rules only have to be 23 24 niques are applied to find related concepts. A knowl- 24 created once and can be reused. They also allow a very 25 edge engineer has to weight the paths in the graph 25 flexible and customizable extension, which has a great 26 to the related concepts since they are always domain- 26 influence on the search results. 27 dependent. In the context of Coscine, the domain is not 27 The definition of rules has the further advantage that 28 known in advance or can change by further applica- 28 a user no longer has to think about such structures and 29 tion profiles and application areas. Thus, such an ap- 29 connections of the RDF data and individual users do 30 proach is not suitable. Changes and adjustments would 30 no longer have to map, e.g., the transitive relationship 31 have to be made again and again, which is why it is 31 of the subclasses for hierarchies (see Example 7 and 9). 32 not pursued further. Consequently, the creation of the 32 The user might not even be aware of these relations, so 33 instance graphs is interesting, which store all terms 33 34 from associated properties for a concept to make them it is advantageous to include them from the beginning. 34 35 referenceable. The necessary knowledge about the inference rules and 35 36 The main contribution of the developed approach the internal structure of the graph is pushed to a knowl- 36 37 is the use of SHACL as the basis for the creation of edge engineer building the SHACL shapes. Naturally, 37 38 the inference based mapping. By using graph shapes this implies that only such relations can be found in the 38 39 a validated knowledge graph exists whose structures, search, which are mapped by rules for the search index. 39 40 vocabularies, and terms are known. Additionally, infer- For example, if the transitive rule for the subclasses of 40 41 ence rules can be created and applied to the graph. This the GRF subject classification is not used, no subclasses 41 42 makes it possible to store human knowledge and use it can be found in the search for a superclass. This means 42 43 for resource object mapping. Furthermore, it allows the that all knowledge or structures must be aware of and 43 44 transformation of unrecognized and unreadable URIs considered in advance. 44 45 into suitable human-readable literals, which can be used 45 46 to describe resources. 6.1. Answering Research Question 46 47 47 48 6. Conclusion and Future Work In the following, the answers to the research ques- 48 49 tions posed in section 1.2 are briefly discussed. 49 50 This paper introduced a semantic resource object How to exploit features of SHACL for resource object 50 51 mapping for the ingestion of SHACL validated resource mapping? 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 17

1 The main challenges when querying RDF knowl- 6.2. Discussion 1 2 edge graphs are the unknown structure and URIs. These 2 3 challenges have been eliminated due to the mapping In this work RDF data, SPARQL, and SHACL are 3 used. SPARQL queries map semantic knowledge in 4 and the application of the literal rules, as more com- 4 the form of SHACL rules. This knowledge includes, 5 plex graph structures and URIs are thus resolved. The 5 6 e.g., inference rules like subclasses or the mapping 6 SHACL application profiles form an essential compo- 7 of structures to search in the corresponding literals of 7 nent in the implementation of the resource object map- 8 URIs. This allows capturing existing semantics into the 8 9 ping approach. Based on the assumption that resource object mapping. 9 10 graphs are SHACL validated SHACL rules can be used The introduced mapping is limited to SHACL shapes 10 11 to build (specific) literal and additional rules (see sec- being available to specify corresponding fields and data 11 12 tion 3.1). The rules allow the extension and mapping of types. Only with this information an index can be cre- 12 13 a resource graph into an JSON object (see section 3). ated in ES in advance. Furthermore, when creating dif- 13 14 How does such a mapping influence the search re- ferent shapes, it is important to ensure that the same 14 15 sults in terms of precision and recall? data types are used for the same properties, to provide a 15 data type specific mapping in ES instead of a string rep- 16 The advantages of mapping to a search index are that 16 17 resentation. In turn, a property should be used consis- 17 every user can make complex search queries because tently for the same metadata fields throughout different 18 the provided search syntax of ES is very simple and 18 19 shapes, otherwise there will be ambiguous mappings 19 intuitive and no further knowledge is required. Besides, 20 for the same property (see Example 14). In any case, 20 the data structure, vocabularies, and schemas are hid- 21 reuse of existing shapes and metadata schemas reduces 21 22 den. The mapping delivers successful results and rea- the impact of this issue and is generally desirable. 22 23 sonable times due to the previous indexing of the data. 23 Example 14: Ambiguous definition of proper- 24 Besides, ES is specifically designed for scalable and 24 ties 25 fast searches [16]. Since SPARQL is specially designed 25 26 as a query language for RDF graphs ES cannot fully 26 In an application profile, the metadata field 27 keep up with the speed of executing SPARQL queries. 27 of the author is described via the property 28 28 Through ES different search types like keyword search, dcterms:creator, where the associated label is 29 29 value ranges, boolean queries, and full-text search are “creator”. 30 possible. For the handling of large amounts of data, 30 31 coscineengmeta:creator 31 it was considered how the approach works with large 32 ·sh:path·dcterms:creator·; 32 amounts of data (see section 4.5). ·sh:name·"Creator"@en·. 33 33 With both answers at hand and the discussed eval- 34 dcterms:creator 34 35 uation in Section 4, this allows assessing the initially ·rdfs:label·"creator"·. 35 36 posed research hypothesis: 36 In a second application profile, the property 37 A mapping of RDF-based metadata records into a 37 ex:author is defined with the associated label 38 search index exists such that a search engine on that 38 “author”. 39 index yields the same precision and recall when com- 39 40 pared to generated SPARQL queries. ex:creator 40 41 ·sh:path·example:author·; 41 The results of the evaluation largely confirms the ·sh:name·"Creator"@en·. 42 hypothesis in the considered context when using appli- 42 43 cation profiles. The mapping of resource graphs into a ex:author 43 44 ·rdfs:label·"author"·. 44 search index for use in a search engine yields the same 45 45 precision and recall as manually generated SPARQL This leads to the fact that in JSON object there 46 46 will be a field creator and a field author, al- 47 queries. For this reason the same results can be achieved 47 though semantically they may describe the same 48 with much less effort for the user. However, this is only 48 concept. In a third application profile, the property 49 possible by using previously carefully considered infer- 49 ex:author is defined with the associated label 50 ence rules for mapping literals and further relations or 50 “creator”. 51 cognitions. 51 18 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 knowledge graph. Consequently, the new institute is 1 Example 14 (continued) 2 stored instead of the previous specified at the time of 2 ex:creator 3 publication, which makes less sense. 3 ·sh:path·example:author·; 4 4 ·sh:name·"Creator"@en·. Another aspect worthy of discussion is the arbitrary 5 extensibility of the rules. The question arises, if any rule 5 6 ex:author can be constructed. Anything can be mapped that can 6 7 ·rdfs:label·"creator"·. be constructed using the SPARQL CONSTRUCTs based 7 8 Hence, the values would be merged by the pro- on the database, if the corresponding instance of a class 8 9 posed algorithm even though they may not seman- (literal rule) or the URI of a resource (additional rule) 9 10 tically describe the same concept. is used as focus node. In general, the rules are very 10 11 generic and can be used across all application profiles. 11 12 In this example it might be considered to check if It is crucial that a knowledge engineer decides on the 12 13 such a field already exists in the ES index and use it basis of the rules which information of the metadata 13 14 for both properties. This has to be considered when graph is mapped and thus is searchable and which in- 14 15 creating the data types. However, it would have to be formation is lost through the mapping according to ES 15 16 ensured that both properties are actually referring to and thus cannot be searched. Furthermore, in the evalu- 16 17 the same concept. This decision is not necessarily un- ation it was found that the use of additional rules that 17 18 18 ambiguous and trivial, and is an area of research of influence other metadata records essentially increases 19 the indexing time of a new metadata record. For this 19 its own. If two different metadata fields (which also 20 reason, they should only be used very rarely and with 20 have a different semantic meaning) are identified with 21 caution. 21 the same label, which can occur due to homonyms, a 22 22 workaround must be developed so that the fields can be 23 6.3. Future Research Suggestions 23 clearly distinguished and offered to the user as search 24 24 fields. Different solutions exists like prefixing or re- 25 The entire evaluation refers to the context of research 25 quiring the knowledge engineer creating an application 26 data and Coscine (especially the use of application pro- 26 profile to provide an alternative name. The resource 27 files which provide a specific structure). The sample 27 object mapping generally follows a CWA and as such 28 data was generated and the search intentions were de- 28 assumes that properties with different names are indeed 29 vised with exactly this focus. In addition, the search for 29 different. However, an extension to respect definitions 30 metadata records is also a specific issue. Other entities 30 like owl:sameAs could provide a reasonable extension. 31 like single persons should not be found. In addition, the 31 32 The whole example shows the problem of the semantic most optimal search queries were assumed, i.e., that the 32 33 gap [57], namely that the same data or information can user does not make any mistakes during the input and 33 34 be represented using different vocabularies and RDF that the structure and appropriate choice of words for 34 35 terms. An arbitrary number of different graphs exists, the approach were adhered to. To what extent such an 35 36 which represent the same semantic information. Within approach is transferable to other use cases and whether 36 37 the context of Coscine, this problem is countered by ES can be generally considered a suitable search engine 37 38 the uniform creation of application profiles, the reuse for arbitrary RDF knowledge graphs remains an open 38 39 of both metadata schema and vocabularies, as well as question for future research projects. 39 40 the controlled creation of literal and additional rules. Within Coscine, there are currently no connections 40 41 For the creation of the literal and additional rules between the individual metadata records. With the ex- 41 42 a knowledge engineer with SPARQL knowledge is re- pansion to resources, a further area of research could 42 43 quired. Although this saves the user a lot of cognitive be the linking of the resources to each other, which is 43 44 work, the creation of application profiles is more com- currently completely lost through mapping into the lit- 44 45 plex. When creating the rules, it is especially important erals. It is no longer clear whether a literal was created 45 46 to pay attention to their meaningfulness. The rule from by a resource or was specified as such. 46 47 the example 8, for example, also creates the correspond- Depending on the vocabularies and data used, it 47 48 ing institutional organization for a person as value. This might be interesting to enrich the knowledge graph 48 49 makes sense in the context of a publisher metadata field. with information from other data sources according to 49 50 However, it must also be taken into account that the the principle of Linked Data. In this way, further rules 50 51 associated institutional organization can change in the could be mapped. 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 19

1 Up to now, a knowledge engineer is required to map [3] M. Kindling and P. Schirmbacher, "Die digitale 1 2 the literal and additional rules. An interesting question Forschungswelt" als Gegenstand der Forschung / Research on 2 3 is whether it is possible to have these rules created by Digital Research / Recherche dans la domaine de la recherche 3 numérique, Information - Wissenschaft & Praxis 64(2–3) 4 4 a user without the appropriate knowledge. For exam- (2013). doi:10.1515/iwp-2013-0017. 5 ple, would a user interface be conceivable in which [4] M.D. Wilkinson, M. Dumontier, I.J.J. Aalbersberg, G. Apple- 5 6 rules can be clicked together based on the vocabularies, ton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L.B. da 6 7 metadata standards, and ontologies used? Thinking the Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, 7 8 other way around, the question arises whether and how M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, 8 R. Finkers, A. Gonzalez-Beltran, A.J.G. Gray, P. Groth, 9 it is possible to generate such rules automatically when 9 C. Goble, J.S. Grethe, J. Heringa, P.A.C. ’t Hoen, R. Hooft, 10 10 creating graph summaries as proposed by Campinas T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, 11 [50]. A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van 11 12 Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, 12 13 6.4. Closing Remarks G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van 13 14 Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wol- 14 stencroft, J. Zhao and B. Mons, The FAIR Guiding Princi- 15 15 In conclusion, the main finding of this work is that ples for scientific data management and stewardship, Scien- 16 the use of search engines can also be suitable for search- tific data 3 (2016), 160018. doi:10.1038/sdata.2016.18. https: 16 17 ing in RDF-based knowledge graphs if a skillful map- //pubmed.ncbi.nlm.nih.gov/26978244/. 17 18 ping of the data and the knowledge contained in its [5] NFDI, Nationale Forschungsdateninfrastruktur NFDI. https: 18 . . 19 structure is applied. //www nfdi de/informationen. 19 20 [6] RfII, Themen - RfII. http://www.rfii.de/de/themen/. 20 Two more subtle but noticeable advantages of ES as [7] DFG, National Research Data Infrastructure. https: 21 21 a search engine were experienced but their effect was //www.dfg.de/en/research_funding/programmes/nfdi/ 22 not further investigated during this research: (i) ES is index.html. 22 23 additionally able to map synonyms and find results by [8] D. Schmitz and M. Politze, Forschungsdaten managen – 23 24 using stemming without having to map them literally in Bausteine für eine dezentrale, forschungsnahe Unterstützung: 24 25 76-91 Seiten / o-bib. Das offene Bibliotheksjournal / heraus- 25 the data. Just the possibility to enter singular or plural gegeben vom VDB, Bd. 5 Nr. 3 (2018) / o-bib. Das offene Bib- 26 26 words leads to an extreme increase in user satisfaction liotheksjournal / herausgegeben vom VDB, Bd. 5 Nr. 3 (2018), 27 and error tolerance. In this case, it is no longer neces- o-bib. Das offene Bibliotheksjournal / Herausgeber VDB 5(3) 27 28 sary to assume that the search is optimal in every case, (2018), 76–91. doi:10.5282/o-bib/2018H3S76-91. 28 29 29 so that, for example, a search can be made for 2·meters [9] RWTH Aachen University, Leitlinie zum Forschungsdaten- 30 management an der RWTH Aachen. https://www.rwth- 30 instead of 2·meter. (ii) ES brings along a ranking com- 31 aachen.de/cms/root/Forschung/Forschungsdatenmanagement/ 31 pared to SPARQL. This is especially helpful if a search ~ncfw/Leitlinie-zum-Forschungsdatenmanagement/. 32 32 query returns many results. This can be used to high- [10] M. Politze, F. Claus, B. Brenger, M.A. Yazdi, B. Heinrichs 33 light metadata records for which a certain search query and A. Schwarz, How to Manage IT Resources in Research 33 34 is more specific than others. Both features increased the Projects? Towards a Collaborative Scientific Integration Envi- 34 35 ronment, in: European Journal of Higher Education IT 2020-2, 35 experiences quality of search results significantly. 36 Y. Epelboin, M. Mennielli, A. Pacholak, P. Kähkipuro, G. Ferell, 36 C. Diaz, L.M. Riberio, J. Bergström, T. Koscielnieak, E. Car- 37 37 doso, R. Vogl, B. Cordewener, N. Wilson, C. Vilarrasa, N. Har- 38 38 Acknowledgements ris and O. Tasala, eds, 2020. 39 [11] M. Politze and B. Decker, Ontology Based Semantic Data Man- 39 40 agement for Pandisciplinary Research Projects, in: Proceed- 40 41 The work was partially funded with resources ings of the 2nd Data Management Workshop, C. Curdt and 41 42 granted by NFDI4Ing and the project AIMS funded by C. Wilmes, eds, Kölner Geographische Arbeiten, Cologne, Ger- 42 many, 2016. doi:10.5880/TR32DB.KGA96.10. 43 GRF under project number 432233186. 43 [12] M. Politze and T. Eifert, On the Decentralization of IT 44 Infrastructures for Research Data Management, in: Euro- 44 45 pean Journal of Higher Education IT 2019-1, Y. Epelboin, 45 46 References M. Mennielli, A. Pacholak, P. Kähkipuro, G. Ferell, C. Diaz, 46 47 L.M. Riberio, J. Bergström, T. Koscielnieak, E. Cardoso, 47 R. Vogl, B. Cordewener, N. Wilson, C. Vilarrasa, N. Harris and 48 [1] RWTH Aachen University, Forschungsdatenmanagement 48 von A - Z. https://www.rwth-aachen.de/cms/root/Forschung/ O. Tasala, eds, Paris, France, 2020. 49 49 Forschungsdatenmanagement/~svkj/A-bis-Z/. [13] T. Kálmán, D. Kurzawe and U. Schwardmann, European Per- 50 [2] Jisc, Describing metadata. https://www.jisc.ac.uk/guides/ sistent Identifier Consortium - PIDs für die Wissenschaft, in: 50 51 metadata/describing-metadata. Langzeitarchivierung von Forschungsdaten, R. Altenhöner and 51 20 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 C. Oellers, eds, Scivero Verl., Berlin, Germany, 2012, pp. 151– [33] K. de Smedt, D. Koureas and P. Wittenburg, FAIR 1 2 164. ISBN 978-3-944417-00-4. Digital Objects for Science: From Data Pieces to Ac- 2 3 [14] M. Politze, S. Bensberg and M. Müller, Managing Discipline- tionable Knowledge Units, Publications 8(2) (2020), 21. 3 Specific Metadata Within an Integrated Research Data Manage- doi:10.3390/publications8020021. 4 4 ment System, in: Proceedings of the 21st International Confer- [34] Corporation for National Research Initiatives, Handle.Net Reg- 5 5 ence on Enterprise Information Systems ICEIS 2019, Heraklion, istry. https://www.handle.net/. 6 Crete - Greece, May 3 - 5, 2019, J. Filipe, ed., ICEIS (Setúbal), [35] W3C, The Organization Ontology. https://www.w3.org/TR/ 6 7 SciTePress, [S. l.], 2019, pp. 253–260. ISBN 978-989-758-372- vocab-org/. 7 8 8. doi:10.5220/0007725002530260. [36] Universität Stuttgart, EngMeta - Beschreibung von Forschungs- 8 daten. https://www.izus.uni-stuttgart.de/fokus/engmeta/. 9 [15] Deutsche Forschungsgemeinschaft, Gute wis- 9 senschaftliche Praxis. https://www.dfg.de/foerderung/ [37] Universität Stuttgart, FDM-Projekt DIPL-ING. http: 10 10 grundlagen_rahmenbedingungen/gwp/. //www.ub.uni-stuttgart.de/dipling#. 11 11 [16] Elasticsearch B.V., What is Elasticsearch? | Elasticsearch Ref- [38] S. Shekarpour, S. Auer, A.-C.N. Ngomo, D. Gerber and C. Stadler, Keyword-Driven SPARQL Query Generation 12 erence [master]. https://www.elastic.co/guide/en/elasticsearch/ 12 Leveraging Background Knowledge, 2011, pp. 203–210. 13 reference/master/elasticsearch-intro.html. 13 doi:10.1109/WI-IAT.2011.70. 14 [17] H. Arnaout and S. Elbassuoni, Effective Searching of 14 [39] W3C, SHACL Advanced Features. https://www.w3.org/TR/ RDF Knowledge Graphs, SSRN Electronic Journal (2018). 15 shacl-af/#rules. 15 doi:10.2139/ssrn.3199315. 16 [40] S. Bensberg, Sample Dataset for Search Engine Evaluation for 16 [18] S. Staab, Wissensmanagement mit Ontologien und 17 Research Data in an RDF-based Knowledge Graph, RWTH 17 Metadaten, Informatik Spektrum 25(3) (2002), 194–209. 18 Aachen University. doi:10.18154/RWTH-2020-09886. 18 doi:10.1007/s002870200226. [41] S. Bensberg, An Efficient Semantic Search Engine for Research 19 19 [19] University of Edinburgh, Documentation, metadata, citation. Data in an RDF-based Knowledge Graph, RWTH Aachen Uni- 20 https://mantra.edina.ac.uk/documentation_metadata_citation/. versity. doi:10.18154/RWTH-2020-09883. 20 21 [20] W3C, RDF 1.1 Concepts and Abstract Syntax. https:// [42] S. Bensberg, Search Engine Evaluation for Research Data in 21 22 www.w3.org/TR/rdf11-concepts/. an RDF-based Knowledge Graph, RWTH Aachen University, 22 [21] Dublin Core Metadata Initiative, DCMI: DCMI Metadata 23 2020. doi:10.18154/RWTH-2020-09885. 23 Terms. https://www.dublincore.org/specifications/dublin-core/ [43] Universität Stuttgart, EngMeta - Beschreibung von Forschungs- 24 24 dcmi-terms. daten | Informations- und Kommunikationszentrum | Univer- 25 [22] Jisc, Metadata management. https://www.jisc.ac.uk/guides/ sität Stuttgart, 2020. https://www.izus.uni-stuttgart.de/fokus/ 25 26 metadata/management. engmeta. 26 27 [23] Dublin Core Metadata Initiative, DCMI: Dublin Core™. https: [44] Elasticsearch B.V., Query string query | Elastic- 27 28 //www.dublincore.org/specifications/dublin-core/. search Reference [master] | Elastic, 2020. https: 28 [24] W3C, Data Catalog Vocabulary (DCAT) - Version 2. https: //www.elastic.co/guide/en/elasticsearch/reference/master/ 29 29 //www.w3.org/TR/vocab-dcat-2/. query-dsl-query-string-query.html. 30 30 [25] DataCite - International Data Citation Initiative e.V., DataCite [45] dotNetRDF Project, dotNetRDF. https://www.dotnetrdf.org/. 31 Schema. https://schema.datacite.org/. [46] J. Nielsen, Usability engineering, Academic Press, Boston, 31 32 [26] A. Kraft, M. Razum, J. Potthoff, A. Porzel, T. Engel, F. Lange, 1993. ISBN 0125184050. 32 33 K. van den Broek and F. Furtado, The RADAR Project—A [47] E. Kuric, J.D. Fernández and O. Drozd, Knowledge Graph Ex- 33 ploration: A Usability Evaluation of Query Builders for Laypeo- 34 Service for Research Data Archival and Publication, In- 34 ple, in: Semantic Systems. The Power of AI and Knowledge 35 ternational Journal of Geo-Information 5(3) (2016), 28. 35 doi:10.3390/ijgi5030028. https://www.researchgate.net/ Graphs, M. Acosta, P. Cudré-Mauroux and M. Maleshkova, eds, 36 Information Systems and Applications, incl. Internet/Web, and 36 publication/297607606_The_RADAR_Project- HCI, Springer International Publishing, Cham, 2019, pp. 326– 37 A_Service_for_Research_Data_Archival_and_Publication. 37 342. ISBN 978-3-030-33220-4. 38 [27] Jisc, Controlled vocabulary. https://www.jisc.ac.uk/guides/ 38 [48] P. Grafkin, M. Mironov, M. Fellmann, B. Lantow, K. Sandkuhl 39 metadata/controlled-vocabulary. 39 and A.V. Smirnov, SPARQL Query Builders : Overview and [28] Dublin Core Metadate Initiative, DCMI: Application 40 Comparison, CEUR-WS, 2016. http://hj.diva-portal.org/smash/ 40 Profile. https://www.dublincore.org/resources/glossary/ 41 record.jsf?pid“diva2%3A1070495&dswid“6004. 41 application_profile/. 42 [49] A. Styperek, M. Ciesielczyk, A. Szwabe and P. Misiorek, Eval- 42 [29] N. Drummond and R. Shearer, The open world assumption, in: 43 uation of SPARQL-compliant semantic search user interfaces, 43 eSI Workshop: The Closed World of Databases meets the Open Vietnam Journal of Computer Science 2(3) (2015), 191–199. 44 44 World of the Semantic Web, Vol. 15, 2006. doi:10.1007/s40595-015-0044-y. 45 [30] J. Sequeda, Introduction to: Open World Assumption [50] S. Campinas, Graph summarisation of web data: 45 46 vs Closed World Assumption - DATAVERSITY, 2012. data-driven generation of structured representations 46 47 https://www.dataversity.net/introduction-to-open-world- (2016). https://www.semanticscholar.org/paper/Graph- 47 assumption-vs-closed-world-assumption/. 48 summarisation-of-web-data%3A-data-driven-of-Campinas/ 48 [31] H. Knublauch and D. Kontokostas, Shapes Constraint Language 6a6fec6cb64c9f9aa7e3e73fd301eb7c80565ef0. 49 49 (SHACL). https://www.w3.org/TR/shacl/. [51] R. Delbru, S. Campinas and G. Tummarello, Searching Web 50 [32] H. Knublauch, Form Generation using SHACL and DASH, Data: An Entity Retrieval and High-Performance Indexing 50 51 2020. http://datashapes.org/forms.html. Model, 2012. doi:10.2139/ssrn.3198931. 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 21

1 [52] Linked swissbib - swissbib. http://www.swissbib.org/wiki/ 1 2 index.php?title“Linked_swissbib. 2 3 [53] Elasticsearch B.V., Removal of mapping types | Elastic- 3 search Reference [7.9] | Elastic. https://www.elastic.co/guide/ 4 4 en/elasticsearch/reference/current/removal-of-types.html. 5 [54] M. Mandalka, Meta data enrichment or annotator from Re- 5 6 source Description Framework (RDF) to Solr or Elasticsearch | 6 7 Open Semantic Search. https://www.opensemanticsearch.org/ 7 8 enhancer/rdf. 8 [55] L. Zhang, Q. Liu, J. Zhang, H. Wang, Y. Pan and Y. Yu, Sem- 9 9 plore: An IR Approach to Scalable Hybrid Query of Seman- 10 tic Web Data, in: The semantic web, K. Aberer, ed., Lecture 10 11 Notes in Computer Science, Springer, Berlin, 2007, pp. 652– 11 12 665. ISBN 978-3-540-76298-0. 12 13 [56] C. Rocha, D. Schwabe and M.P. Aragao, A hybrid approach for 13 searching in the semantic web, in: 13th International Confer- 14 14 ence on World Wide Web Conference, S. Feldman, M. Uretsky, 15 M. Najork and C. Wills, eds, ACM Press, New York, N.Y., 2005, 15 16 p. 374. ISBN 158113844X. doi:10.1145/988672.988723. 16 17 [57] A. Freitas, S. O’Riáin and E. Curry, Querying and Search- 17 18 ing Heterogeneous Knowledge Graphs in Real-time Linked 18 Dataspaces, in: Real-time linked dataspaces, E. Curry, ed., 19 19 SpringerOpen, Cham, Switzerland, 2020, pp. 105–124. ISBN 20 978-3-030-29664-3. doi:10.1007/978-3-030-29665-0_7. 20 21 [58] RWTH Aachen University, Coscine ApplicationProﬁles 21 22 - GitLab, 2020. https://git.rwth-aachen.de/coscine/graphs/ 22 23 applicationproﬁles. 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 22 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 Appendix A. Commonly used Prefixes 1 2 2 3 For brevity definition of prefixes were omitted in the examples. The following is the comprehensive list of prefixes 3 4 used: 4 5 @prefix·coscine:··. 5 6 @prefix·coscineengmeta:··. 6 7 @prefix·coscinesc:··. 7 8 8 @prefix·dbpedia:··. 9 9 10 @prefix·dfg:··. 10 11 @prefix·dfgCS:··. 11 12 12 13 @prefix·dcterms:··. 13 14 14 @prefix·ex:··. 15 15 16 @prefix·engmeta:··. 16 17 17 18 @prefix·foaf:··. 18 19 19 @prefix·org:··. 20 20 21 @prefix·owl:··. 21 22 22 23 @prefix·rdfs:··. 23 24 24 @prefix·rwth:··. 25 25 26 @prefix·schema:··. 26 27 27 28 @prefix·sh:··. 28 29 29 30 30 31 Appendix B. EngMeta Metadata Schema as Application Profile 31 32 32 33 The following is an application profile based on an excerpt of the EngMeta metadata schema [36] created for the 33 34 evaluation. A more extensive definition is available in the Coscine source repository [58]. 34 35 35 36 ··a·sh:NodeShape·; 36 37 ··sh:targetClass··; 37 38 ··sh:property·coscineengmeta:creator·; 38 ··sh:property·coscineengmeta:worked·; 39 39 ··sh:property·coscineengmeta:title·; 40 ··sh:property·coscineengmeta:subject·; 40 41 ··sh:property·coscineengmeta:created·; 41 42 ··sh:property·coscineengmeta:version·. 42 43 43 coscineengmeta:creator 44 44 ··sh:path·dcterms:creator·; 45 ··sh:order·1·; 45 46 ··sh:minCount·1·; 46 47 ··sh:name·"Creator"@en,·"Autor"@de·; 47 48 ··sh:class·foaf:Agent·. 48 49 49 coscineengmeta:worked 50 ··sh:path·engmeta:worked·; 50 51 ··sh:order·2·; 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 23

1 ··sh:maxCount·1·; 1 2 ··sh:datatype·xsd:boolean·; 2 3 ··sh:name·"Worked"@en,·"Hat·funktioniert"@de·. 3 4 4 coscineengmeta:title 5 ··sh:path·dcterms:title·; 5 6 ··sh:order·3·; 6 7 ··sh:minCount·1·; 7 8 ··sh:minLength·1·; 8 ··sh:datatype·xsd:string·; 9 9 ··sh:name·"Title"@en,·"Titel"@de·. 10 10 11 coscineengmeta:subject 11 12 ··sh:path·dcterms:subject·; 12 13 ··sh:order·4·; 13 ··sh:maxCount·1·; 14 14 ··sh:name·"Subject·Area"@en,·"Sachgebiet"@de·; 15 ··sh:class·dfg:. 15 16 16 17 coscineengmeta:created 17 18 ··sh:path·dcterms:created·; 18 ··sh:order·5·; 19 19 ··sh:minCount·1·; 20 ··sh:maxCount·1·; 20 21 ··sh:datatype·xsd:date·; 21 22 ··sh:name·"Creation·Date"@en,·"Erstellungsdatum"@de·. 22 23 23 coscineengmeta:version 24 24 ··sh:path·engmeta:version·; 25 ··sh:order·6·; 25 26 ··sh:minCount·1·; 26 27 ··sh:maxCount·1·; 27 28 ··sh:datatype·xsd:integer·; 28 ··sh:name·"Version"@en,·"Version"@de·. 29 29 30 30 31 31 32 Appendix C. Generated Reference Queries 32 33 33 34 34 35 For validation of the search queries all search intentions were built as reference queries. Their results were used as 35 36 a reference for the evaluation of precision and recall of the mapping. 36 37 37 38 Data records which were published at the IT Center 38 39 39 40 SELECT·?s·WHERE·{ 40 41 ··{ 41 42 ····SELECT·?s·{ 42 43 ······?s·dcterms:publisher··. 43 ····} 44 44 ··}·UNION·{ 45 ····SELECT·?s·{ 45 46 ······?s·dcterms:publisher·?person·. 46 47 ······?membership·a·org:Membership. 47 48 ······?membership·org:member·?person. 48 ······?membership·org:organization··. 49 49 ····} 50 ··} 50 51 } 51 24 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 Data records with version number 10 1 2 2 3 SELECT·?s·WHERE·{ 3 4 ··?s·engmeta:version·10·. 4 } 5 5 6 6 Data records which were created in 2007 7 7 8 8 SELECT·?s·WHERE·{ 9 ··?s·dcterms:created·?date·. 9 10 ··FILTER·(?date·>=·xsd:date("2007-01-01")·&&·?date·<·xsd:date("2008-01-01")) 10 11 } 11 12 12 13 Data records about design patterns and with version number less than 5 13 14 14 15 SELECT·?s·WHERE·{ 15 ··{ 16 16 ····SELECT·?s·{ 17 ······?s·dcterms:title·?title·. 17 18 ······?title·bif:contains·"'design·patterns'"·. 18 19 ····} 19 20 ··}·UNION·{ 20 ····SELECT·?s·{ 21 21 ······?s·dcterms:abstract·?abstract·. 22 ······?abstract·bif:contains·"'design·patterns'"·. 22 23 ····} 23 24 ··} 24 25 ··?s·engmeta:version·?version·. 25 ··FILTER·(?version·<·5)·. 26 26 } 27 27 28 28 Data records about computer science 29 29 30 30 SELECT·?s·WHERE·{ 31 ··{ 31 32 ····SELECT·?s·{ 32 33 ······?s·dcterms:abstract·?abstract·. 33 34 ······?abstract·bif:contains·"'computer·science'"·. 34 ····} 35 35 ··}·UNION·{ 36 ····SELECT·?s·{ 36 37 ······?s·dcterms:subject·?subject·. 37 38 ······?class·rdfs:subClassOf*·?superclass·. 38 39 ······?subject·a·?class·. 39 ······FILTER·(STR(?superclass)·=·"http://www.dfg.de/.../liste/index.jsp?id=409")·. 40 40 ····} 41 ··} 41 42 } 42 43 43 44 The newest data record of resource 2 44 45 45 46 SELECT·?s·WHERE·{ 46 47 ··?s·engmeta:version·?version·. 47 ··?s·coscineprojectstructure:isFileOf· 48 u 48 ·. 49 49 ··{ 50 ····SELECT·?newestVersion·WHERE·{ 50 51 ······?file·engmeta:version·?newestVersion·. 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 25

1 ······?file·coscineprojectstructure:isFileOf· u 1 2 ·. 2 3 ····}·ORDER·BY·DESC(?newestVersion)·LIMIT·1 3 ··}·. 4 4 ··FILTER(·?version·=·?newestVersion)·. 5 } 5 6 6 7 7 Data records which are experiments or simulations and not analysis 8 8 9 9 SELECT·?s·WHERE·{ 10 10 ··?s·engmeta:mode·?mode·. 11 11 ··FILTER·((?mode·=··||·?mode·=· u 12 12 )·&&·?mode·!=· u 13 ·)·. 13 14 } 14 15 15 16 Data records about non computability of the human consciousness 16 17 17 18 SELECT·?s·WHERE·{ 18 19 ··?s·dcterms:abstract·?abstract·. 19 20 ··?abstract·bif:contains·"'human·consciousness'·AND·non-algorithmic"·. 20 } 21 21 22 22 23 Data records which are available since 03.07.2020 23 24 24 25 SELECT·?s·WHERE·{ 25 26 ··?s·dcterms:available·"2020-07-03+02:00"^^xsd:date·. 26 } 27 27 28 28 29 Data records about political left 29 30 30 31 SELECT·?s·WHERE·{ 31 ··?s·dcterms:abstract·?abstract·. 32 32 ··?abstract·bif:contains·"left"·. 33 } 33 34 34 35 35 Data records which contain a variable with the value 2 meters 36 36 37 37 SELECT·?s·WHERE·{ 38 38 ··{ 39 ····?s·engmeta:controlledVariable·. 39 40 ··}·UNION·{ 40 41 ····?s·engmeta:measuredVariable·. 41 42 ··} 42 } 43 43 44 44 45 Data records about object-oriented software which are published before 2015 45 46 46 47 SELECT·?s·WHERE·{ 47 48 ··?s·dcterms:issued·?date·. 48 ··?s·dcterms:abstract·?abstract·. 49 49 ··?abstract·bif:contains·"'object-oriented·software'"·. 50 ··FILTER·(?date·<=·xsd:date("2015-01-01"))·. 50 51 } 51 26 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs

1 Data records which are published by Thomas Pynchon, created in 2016 and about object-oriented software 1 2 2 3 SELECT·?s·WHERE·{ 3 ··?s·dcterms:publisher··. 4 4 ··?s·dcterms:created·?date·. 5 ··?s·dcterms:abstract·?abstract·. 5 6 ··?abstract·bif:contains·"'object-oriented·software'"·. 6 7 ··FILTER·(?date·>=·xsd:date("2016-01-01")·&&·?date·<·xsd:date("2017-01-01")). 7 8 } 8 9 9 10 10 11 Appendix D. Search Queries for the Mapped Objects in Elasticsearch 11 12 12 The search intentions were modeled as ES search queries using the Query string query syntax [44]. 13 13 14 14 Data records which were published at the IT Center 15 15 16 16 publisher:"IT·Center" 17 17 18 Data records with version number 10 18 19 19 20 version:10 20 21 21 22 Data records which were created in 2007 22 23 23 24 date_created:[2007-01-01·TO·2007-12-31] 24 25 25 26 Data records about design patterns and with version number less than 5 26 27 27 28 "design·patterns"·AND·version:<5 28 29 29 30 Data records about computer science 30 31 31 32 "computer·science" 32 33 33 34 The newest data record of resource 2 34 35 35 is_file_of:"resource·2"·AND·is_newest_version:true 36 36 37 37 Data records which are experiments or simulations and not analysis 38 38 39 39 mode:((experiment·OR·simulation)·AND·NOT·analysis) 40 40 41 Data records about non computability of the human consciousness 41 42 42

43 abstract:("human·consciousness"·AND·non-algorithmic) 43 44 44 45 Data records which are available since 03.07.2020 45 46 46 47 date_available:2020-07-03 47 48 48 49 Data records about political left 49 50 50 51 left 51 S. Bensberg and M. Politze / Semantic Object Mapping using SHACL Validated Resource Graphs 27

1 Data records which contain a variable with the value 2 meters 1 2 2 3 controlled\_variables:"2·meter"·measured_variables:"2·meter" 3 4 4 5 Data records about object-oriented software which are published before 2015 5 6 6 abstract:"object-oriented·software"·AND·date_issued: ·TO·2015-01-01 7 * 7 8 8 Data records which are published by Thomas Pynchon, created in 2016 and about object-oriented software 9 9

10 publisher:"Thomas·Pynchon"·AND·date_created_year:2016·AND 10 11 abstract:"object-oriented·software" 11 12 12 13 13 14 Appendix E. Confusion Matrices for Search Intentions 14 15 15 16 The following shows confusion matrices of the Query string query syntax of ES. 16 17 Relevant Relevant Relevant 17 18 + - + - + - 18 19 + 2 0 + 1 0 + 1 0 19 20 - 0 33 - 0 34 - 0 34 20 21 Published at Version 10 Created in 21 Received Received Received 22 IT Center 2007 22 23 23 24 Relevant Relevant Relevant 24 25 + - + - + - 25 26 + 1 0 + 4 0 + 1 0 26 27 - 0 34 - 0 31 - 0 34 27 28 Design Patterns Computer Newest Version 28 Received Received Received 29 Version < 5 Science Res. 2 29 30 30 31 Relevant Relevant Relevant 31 32 + - + - + - 32 33 + 22 0 + 2 0 + 10 0 33 34 - 0 13 - 0 33 - 0 25 34 35 Experi./Simu. Non Comput. Av. s. 35 Received Received Received 36 not Anal. Human Consci. 03.07.20 36 37 37 38 Relevant Relevant Relevant 38 39 + - + - + - 39 40 + 1 2 + 5 0 + 0 0 40 41 - 0 32 - 0 30 - 0 35 41 42 Left 2 Meters Pub. < 2015 42 Received Received Received 43 (political) Oo. Softw. 43 44 44 45 Relevant Relevant 45 46 + - + - 46 47 + 1 0 + 51 2 47 48 - 0 34 - 0 402 48 49 49 Pub. T. Pynt. In Total Received 50 Received 50 Created 2016 51 51 Oo. Softw.