Digree: a Middleware for a Graph Databases Polystore

2016 IEEE International Conference on Big Data (Big Data) Digree: A Middleware for a Graph Databases Polystore Vasilis Spyropoulos, Christina Vasilakopoulou, Yannis Kotidis Department of Informatics, Athens University of Economics and Business, Athens, Greece [email protected], [email protected], [email protected] Abstract— In this paper we present Digree, an experimental to temporarily store the partial results and perform the middleware system that can execute graph pattern matching required operations so as to produce the final result set. The queries over databases hosting voluminous graph datasets. underlying endpoints storing the graph data remain fully First, we formally present the employed data model and the processes of re-writing a query into an equivalent set of functional and can, at the same time, continue to run as subqueries and subsequently combining the partial results into standalone systems. the final result set. Our framework guarantees the correctness In the evaluation of our prototype, we used native graph and completeness of the produced answers. Then, we present databases at the nodes and a PostgreSQL DBMS at the a prototype implementation of Digree, which is agnostic to the middleware. We are currently working to extend support to underlying data processing engines used at the endpoints. As the experimental results show, in many cases Digree outper- other types of endpoints, such as relational databases and forms a single node graph database deployment in execution big data systems like Spark [9]. The latter can also be used speed, up to 20 times depending on the query type. to implement the functionality of our middleware. Keywords-graph pattern matching; graph databases Our contribution can be summarized as follows: • We study the evaluation of a general graph pattern I. INTRODUCTION matching query in a polystore [10] integrating interlinked graph databases. We formalize the process of Graph data processing is a notable example of the “one decomposing a query into a set of smaller patterns, via size does not fit all” paradigm. This is due to both the a series of transformations. These subqueries can be inherent heterogeneity of the graph data and the diversity executed in parallel, and their results are used to form of the different computations that can be performed on the final result set. them. Consequently, existing approaches utilize relational • We present a prototype middleware system termed databases [1], [2], big data systems [3], [4], [5], RDF Digree that implements the aforementioned ideas in triple stores [6], [7], or a combination of the above. In this order to orchestrate the query decomposition and result fragmented environment, it is quite possible that applications set composition tasks. We describe a modular imple- will have to access graph data lying on different ecosystems, mentation that enables us to use different graph data using different underlying storage representations and offer- implementations at the distributed endpoints. ing different query languages to access this data. • We present a series of experiments running on a proto- In this work, we present a middleware approach that per- type implementation of Digree. We run a set of pattern mits execution of pattern matching queries over distributed matching queries against three different datasets and or interlinked big-graph datasets hosted by such a network of discuss the scalability and capabilities of Digree. independent data sources called endpoints. The middleware receives graph pattern matching queries, splits them in a II. OVERVIEW suitable manner that permits their efficient parallel execution Our work concerns the process of efficiently querying and then assembles all partial results in order to generate the a polystore of interlinked graph databases. The graph data final result set. We introduce a solid theoretical framework model that we employ is one of the most widely adopted that ensures the correctness of this process. This framework models, namely the labeled property graph model, which is generic, in the sense that it is not tied to a particular is made up of nodes, relationships, properties and labels. implementation or query language. Consequently, given a pattern query, i.e. a directed graph Our prototype system, termed Digree, adopts a flexible with vertices and edges possibly with labels and properties, architecture where requests for local data processing on the the fundamental task is to find subgraphs of the database that endpoints are made by implementing a very basic interface are isomorphic (structurally and semantically) to the pattern so as to be able to ask for computation of simple path query. This belongs to the (exact) pattern matching problem, expressions. Such interfaces can be implemented for native specifically in terms of subgraph isomorphism [11]. graph data management systems [8] but also for relational The middleware system that we propose takes as input or big data deployments as well. The proposed middleware a pattern query and essentially divides it into all required in our prototype is built around a DBMS that is used smaller parts that are executed in parallel over all graph 978-1-4673-9005-7/16/$31.00 ©2016 IEEE 2580 database partitions [12],[13]. The middleware then appropri- typically partitioned via some algorithm, and edges’ location ately combines the partial results to produce the global result is determined by their source vertex. For our purposes, we set. As it will be made clear from the discussion, Digree create a duplicate of the end vertex of the cross-partition can utilize any graph database system or combination of edges, labeled with REF ; the original vertex obtains a label those hosting the graph partitions. All that is required is the RF D, so that the graph database is in fact decomposed in existence or implementation of the respective basic API calls parts G1, ..., Gm that constitute an edge partition. for querying path expressions in the underlying systems. For example, consider a distributed graph database Bellow, we can see the operations that decompose the with two partitions, G1 ∪ G2 = G, and a subgraph input query and re-synthesize the partial results: H ⊆ G with VH = {k, l, m, n, o} and EH = {(k, l), (l, o), (n, m), (m, l)}. If the partition algorithm sends Query ResultSetO {k, n} to G1 and {m, l, o} to G2, then {(k, l), (n, m)}∈G1 and {(m, l), (l, o)}∈G2. A graphic representation is RF D G1 n m G2 Path queries Path resultsO REF REF RF D / k /l / m l o (2) Distributed queries DistributedO results Some basic properties of the above procedure are the following: • A source node is always in the same partition as its Fragment queries Fragment results OO edge, whereas its target inside the same partition may be a REF copy of the RF D vertex placed elsewhere. • REF vertices necessarily have no outgoing edges. parallel partition searches • The set of all REF nodes, as well as that of all RF D ) nodes, equals I; each RF D node may have REF PrimaryFragments P artitions results duplicates in more than one partition. Section III mathematically formalizes the above operations A path cover is a set of disjoint paths in G which together and illustrates the process via a running example. contain all vertices; we denote a k-path as P =(x1, ..., xk). If we don’t allow paths of length 0, namely single vertices, a III. QUERY REWRITE AND RESULTS COMBINATION path cover {Gi} forms an edge partition of G. The elements A. Preliminaries in the intersection I of the Vi’s are now called join vertices. A (finite) directed graph is a pair G =(V,E) where V is A weakly connected graph is a graph where there exists the finite set of vertices and E ⊆ V × V is the finite set of an undirected path between every pair of vertices. In what edges; the first component of an edge pair is the source and follows, our initial pattern query will always be weakly con- the second is the target. The basic graph theoretic definitions nected, since otherwise we could take its weakly connected given below can be found e.g. in [14], [15]. components and perform the transformations separately. An edge partition {Ei} of G is a set of non-empty disjoint Our basic task is to identify subgraphs of the graph subsets whose union gives E. If we define {Vi} to be the database which are graph-isomorphic to an input query. set of source and target nodes of edges in {Ei}, every Gi = Two graphs G, H are isomorphic when there is an edge- ∼ (Vi,Ei) is a graph on its own. The family {Gi} of edge- preserving bijection f : VG = VH , i.e. such that (x, y) ∈ ∼ disjoint subgraphs is a decomposition of G, EG ⇔ (f(x),f(y)) ∈ EH (hence also EG = EH ). As mentioned in the overview, our system is focused on G = G1 ∪ ... ∪ Gm =(∪iVi, ∪iEi). the labeled property graph data model, elsewhere called di- Notably, {Vi} are not disjoint in principle: if two ad- rected labeled typed/attributed graph. Vertices can be viewed jacent edges are located in different partitions, the vertex as tuples with a unique id, certain labels and properties in-between is ‘duplicated’ and exists in both respective (attributes), whereas edges have a source and target vertex, subgraphs. Those elements lie in the set labels and properties. However, in order to emphasize the i=j underlying general techniques and ideas, presently we em- I = (Vi ∩ Vj)=(V1∩V2)∪(V1∩V3)∪...∪(Vm-1∩Vm). ploy the abstract representation of a plain directed graph. 1≤i,j≤m (1) B. Graph query rewrites The structure of the distributed graph database is as fol- 1) Pattern query to path pattern queries: Consider an ar- lows.

Digree: a Middleware for a Graph Databases Polystore

1. Directed Graphs Or Quivers What Is Category Theory? • Graph Theory on Steroids • Comic Book Mathematics • Abstract Nonsense • the Secret Dictionary

Graph Theory

An Application of Graph Theory in Markov Chains Reliability Analysis

Graph Algorithms in Bioinformatics an Introduction to Bioinformatics Algorithms Outline

Graph Theory Approach to the Vulnerability of Transportation Networks

Graph Theory: Network Flow

Combinatorics and Graph Theory I (Math 688)

Characterizing Levels of Reasoning in Graph Theory

Combinatorics

Combinatorics: a First Encounter

Graph Theoretical Calculation of Systems Reliability with Semi-Markov Processes

On Sets Forming Boolean Algebras and Partite Hypergraphs