A New Nested Graph Model for Data Integration

Alma Mater Studiorum – Università di Bologna DOTTORATO DI RICERCA IN Computer Science and Engineering Ciclo XXX Settore Concorsuale: 01/B1 Informatica Settore Scientifico Disciplinare: INF/01 Informatica A new Nested Graph Model for Data Integration Presentata da: Giacomo Bergami Coordinatore Dottorato Supervisore Prof. Paolo Ciacccia Prof. Danilo Montesi Esame finale anno 2018 A new Nested Graph Model for Data Integration Giacomo Bergami Dottorato di Ricerca in Computer Science and Engineeering Ciclo XXX Settore concorsuale di afferenza: 01/B1 - Informatica Settore scientifico disciplinare: INF/01 - Informatica Coordinatore: Supervisore: Paolo Ciaccia Danilo Montesi Alma Mater Studiorum - Università di Bologna Esame finale anno 2018 Contents 1 Introduction 9 1.1 Graph Data: Use Cases . 11 I Related Works 13 2 Data integration: a data representation-independent approach 17 2.1 Preliminaries: data representation dependent approach . 18 2.1.1 Structured Data integration: Integrating entities represented with different schemas . 20 2.1.2 Semistructured Data Integration: Integrating multiple relations into a common representation . 21 2.1.3 Structured and Semistructured data integration: schema alignment as a data cleaning step . 24 2.1.4 Integrating unstructured data via semistructured representation 28 2.1.5 Aligning (Nested) Graphs . 29 2.2 In-Database Integration . 34 2.2.1 Preliminaries: towards a uniform data representation . 34 2.2.2 Aggregations . 40 2.3 Multi-database integration . 47 2.3.1 Preliminaries: Description Logic and Ontologies . 47 2.3.2 Ontology Alignments and Data Integration . 49 2.3.3 Query Rewriting . 50 2.4 Conclusions . 53 3 Analysing the properties of Data Models and Query Languages 55 3.1 Structured data: the Relational Model . 56 3.1.1 Query Languages . 56 3.1.2 Representation Problems . 58 3.1.3 Representing graphs . 63 3.2 Nested Relational Model, Semistructured data and Streams . 64 3.2.1 Query languages . 68 3.2.2 Representation problems . 68 3.2.3 Representing graphs . 69 3.3 Unstructured Data: Full Text Documents . 70 3.3.1 Query Languages . 71 3.4 Graph (Data) Models . 73 3.5 Classifying Graph Query Languages . 76 3.5.1 Graph Traversal and Pattern Matching Languages . 77 3.5.2 Graph Grammars . 79 3.5.3 Graph Algebras . 80 3.5.4 (Proper) Graph Query Languages . 82 3.6 Conclusions . 84 3 4 II On Combining Graphs 85 4 On Joining Property Graphs 89 4.1 Graph Query Languages limitations’ on Graph Joins . 91 4.2 Graph Data Model . 95 4.3 Graph θ-Joins . 97 4.3.1 Graph Join properties . 98 4.4 Graph Conjunctive Equi-Joins . 100 4.4.1 Algorithm and Data Structure . 101 4.4.2 Experimental Evaluation . 105 4.5 Graph Less-Equal Join . 108 4.6 Left, right and full graph joins. 112 4.7 Conclusions . 115 5 General Semistructured Model and Nested Graphs 119 5.1 General Semistructured (Data) Model . 121 5.1.1 script, a MetaModel for GSM . 125 5.1.2 Characterizing object identifiers . 131 5.2 Nested Graph . 133 5.3 Data model translation functions . 136 5.4 Use Cases . 139 5.4.1 Representing part-of aggregations . 139 ϛ – 5.4.2 Graph ETL and Ϙ ( ) (α(Di)): the Transformation phase . 143 α Di ,H 5.5 Conclusions . .( . .) . 155 6 GSQL: a Generalized Semistructured Query Language 157 6.1 General Semistructured Query Language (GSQL) . 158 6.2 Derived GSQL operators over GSM . 162 6.2.1 (Attribute labelled) Set operations . 162 6.2.2 Relational and semistructured operations . 164 6.3 GSQL Use cases . 172 6.3.1 paNGRAm: Nested Graph Relational Algebra . 172 6.3.2 Implementing traversal query languages’ semantics (σ).... 177 6.3.3 Representing is-a aggregations . 182 6.3.4 Generalized Graph Grammars G for Nested Graphs. ϘG , (H)(ï) 183 6.4 Conclusons . .H . .T . 193 7 On Nesting Graphs 195 7.1 Graph Query Languages limitations’ on Graph Nesting . 199 7.1.1 Graph Joins’ limitations in providing the ν operator . 199 ≅ 7.1.2 Implementing Graph Nesting over (two) graph collections . 201 7.1.3 Query Languages’ and data models’ limitations . 203 7.2 Class of optimizable graph nesting queries . 206 7.3 Nested Graphs . 209 7.4 Graph Nesting . 211 7.4.1 Two HOp Separated Patterns Algorithm . 213 7.5 Experimental Evaluation . 217 7.6 Conclusions . 219 0. CONTENTS 5 III Conclusions 221 8 Conclusions 223 A Resolving Alignments and Morphisms: OCaml Source Code 225 B Dovetailing lemmas 233 C Expressing containment functions in script 235 6 “No plan can predict everything. Some people will raise their heads, others will mutiny. The time will not cease to bestow losses and fame to whose who will continue the fight. [. ] Do not pursue your actions according to a plan.” —Luther Blissett, Q,Epilogue Abstract Despite graph data gained increasing interest in several fields, no data model suitable for both querying and integrating differently structured graph and (semi)structured data has been currently conceived. The lack of operators allowing combinations of (multiple) graphs in current graph query languages (graph joins), and on graph data structure allowing neither data integration nor nested multidimensional representations (graph nesting) are a possible motivation. In order to make such data integration possible, this thesis proposes a novel model (General Semistructured data Model) allowing the representation of both graphs and arbitrarily nested contents (e.g., one node can be contained by more than just one parent node), thus allowing the definition of a nested graph model, where both vertices and edges may include (overlapping) graphs. We provide two graph joins algorithms (Graph Conjunctive Equijoin Algorithm and Graph Conjunctive Less-equal Algorithm) and one graph nesting algorithm (Two HOp Separated Patterns). Their evaluation on top of our secondary memory representation showed the inefficiency of existing query languages’ query plan on top of their respective data models (relational, graph and document-oriented). In all three algorithms, the enhancement was possible by using an adjacency list graph representation, thus reducing the cost of joining the vertices with their respective outgoing (or ingoing) edges, and by associating hash values to both vertices and edges. As a secondary outcome of this thesis, a general data integration scenario is provided where both graph data and other semistructured and structured data could be represented and integrated into the General Semistructured data Model. A new query language outlines the feasibility of this approach (General Semistructured Query Language) over the former data model, also allowing to express both graph joins and graph nestings. This language is also capable of representing both traversal and data manipulation operators. 7 1 Introduction Contents 1.1 Graph Data: Use Cases . 11 Despite that modern Graph Database Management Systems (GDBMSs) are a well-known subject of studies, three fundamental aspects are missing in literature: operators integrating and combining graphs, the adoption of such operators in current graph query languages, and the definition of a graph data structure providing both a multidimensional and nested representation. As regards of the aforementioned graph integration operators, both graph joins generalizing the graph products and graph nestings summarizing graph data with other extracted graph patterns are required. In particular, this thesis is going to focus on two graph operators allowing to combine graphs: graph joins and graph nestings. First, the graph join operator (Chapter 4 on page 89) is useful in practical scenarios, such as the intersection or the merge of different transport overlay networks, or for comparing and merging different ontology representations. The flexibility at the edge combination level provides adaptiveness to such operator (Figure 1.1a). Moreover, if we fix a same theta-join predicate and a specific edge semantics, we could further on extend the definition of the graph join as in the relational model, that is by allowing that even the non matching vertices from one (left and right join) or both operands (full join) must also appear (Figure 1.1b). Hereby, the graph join operator defines a whole class of graph operators (graph ⊗θ-products), that can be differently instantiated as required by the user, and hence can be also differently implemented. Graph nesting (Chapter 7 on page 195) permits the summarization of graph data within one graph data source: in particular, we could “fuse” or “summarize” all the vertices belonging to the same cluster, and then we could group the paths occurring between the fused vertices as edges containing such paths. As a consequence, such operator provides dimensionality reduction that is often required in multidimensional data analysis. Group Recommendation Systems characterize a use case scenario in which the members sharing similar interests and their suggested activities could be “fused”, thus providing a coarser representation of the sequence of similar activities. While the class of graph join operators can be defined on top of current graph representations, graph nesting requires an extension of the graph data model for fulfilling the aforementioned graph data integration purposes. Such extension must allow to represent graphs within both vertices and edges (nestings), thus representing a summarized view of the contained graph. Nested representations are shared among task representations and complex world knowledge representations, such as medical procedures for specific diseases such as glaucoma [KW82]. Graphs distinguish two different classes of data types, vertices and edges. Edges

A New Nested Graph Model for Data Integration

Empirical Study on the Usage of Graph Query Languages in Open Source Java Projects

Graphql Attack

Graphql-Tools Merge Schemas

Red Hat Managed Integration 1 Developing a Data Sync App

IBM Filenet Content Manager Technology Preview: Content Services Graphql API Developer Guide

Graphql at Enterprise Scale a Principled Approach to Consolidating a Data Graph

Easy Web API Development with SPARQL Transformer

Schemas and Types for JSON Data

Elixir Generate Graphql Schema from Database

Graphql API Backed by a Graph Database

Linked Data Querying with Graphql

Defining Schemas for Property Graphs by Using the Graphql Schema Definition Language