Quick viewing(Text Mode)

Digree: a Middleware for a Graph Databases Polystore

Digree: a Middleware for a Graph Databases Polystore

2016 IEEE International Conference on Big Data (Big Data)

Digree: A for a Graph Polystore

Vasilis Spyropoulos, Christina Vasilakopoulou, Yannis Kotidis Department of Informatics, Athens University of Economics and Business, Athens, Greece [email protected], [email protected], [email protected]

Abstract— In this paper we present Digree, an experimental to temporarily store the partial results and perform the middleware system that can execute graph pattern required operations so as to produce the final result . The queries over databases hosting voluminous graph datasets. underlying endpoints storing the graph data remain fully First, we formally present the employed data model and the processes of re-writing a query into an equivalent set of functional and can, at the same time, continue to run as subqueries and subsequently combining the partial results into standalone systems. the final result set. Our framework guarantees the correctness In the evaluation of our prototype, we used native graph and completeness of the produced answers. Then, we present databases at the nodes and a PostgreSQL DBMS at the a prototype implementation of Digree, which is agnostic to the middleware. We are currently working to extend support to underlying data processing engines used at the endpoints. As the experimental results show, in many cases Digree outper- other types of endpoints, such as relational databases and forms a single node graph deployment in execution big data systems like Spark [9]. The latter can also be used speed, up to 20 times depending on the query type. to implement the functionality of our middleware. Keywords-graph pattern matching; graph databases Our contribution can be summarized as follows: • We study the evaluation of a general graph pattern I. INTRODUCTION matching query in a polystore [10] integrating inter- linked graph databases. We formalize the process of Graph data processing is a notable example of the “one decomposing a query into a set of smaller patterns, via size does not fit all” paradigm. This is due to both the a series of transformations. These subqueries can be inherent heterogeneity of the graph data and the diversity executed in parallel, and their results are used to form of the different computations that can be performed on the final result set. them. Consequently, existing approaches utilize relational • We present a prototype middleware system termed databases [1], [2], big data systems [3], [4], [5], RDF Digree that implements the aforementioned ideas in triple stores [6], [7], or a combination of the above. In this order to orchestrate the query decomposition and result fragmented environment, it is quite possible that applications set composition tasks. We describe a modular imple- will have to access graph data lying on different ecosystems, mentation that enables us to use different graph data using different underlying storage representations and offer- implementations at the distributed endpoints. ing different query languages to access this data. • We present a series of experiments running on a proto- In this work, we present a middleware approach that per- type implementation of Digree. We run a set of pattern mits execution of pattern matching queries over distributed matching queries against three different datasets and or interlinked big-graph datasets hosted by such a network of discuss the and capabilities of Digree. independent data sources called endpoints. The middleware receives graph pattern matching queries, splits them in a II. OVERVIEW suitable manner that permits their efficient parallel execution Our work concerns the process of efficiently querying and then assembles all partial results in order to generate the a polystore of interlinked graph databases. The graph data final result set. We introduce a solid theoretical framework model that we employ is one of the most widely adopted that ensures the correctness of this process. This framework models, namely the labeled property graph model, which is generic, in the sense that it is not tied to a particular is made up of nodes, relationships, properties and labels. implementation or query language. Consequently, given a pattern query, i.e. a Our prototype system, termed Digree, adopts a flexible with vertices and edges possibly with labels and properties, architecture where requests for local data processing on the the fundamental task is to find subgraphs of the database that endpoints are made by implementing a very basic interface are isomorphic (structurally and semantically) to the pattern so as to be able to ask for computation of simple query. This belongs to the (exact) pattern matching problem, expressions. Such interfaces can be implemented for native specifically in terms of subgraph [11]. graph data management systems [8] but also for relational The middleware system that we propose takes as input or big data deployments as well. The proposed middleware a pattern query and essentially divides it into all required in our prototype is built around a DBMS that is used smaller parts that are executed in parallel over all graph

978-1-4673-9005-7/16/$31.00 ©2016 IEEE 2580 database partitions [12],[13]. The middleware then appropri- typically partitioned via some , and edges’ location ately combines the partial results to produce the global result is determined by their source . For our purposes, we set. As it will be made clear from the discussion, Digree create a duplicate of the end vertex of the cross-partition can utilize any system or combination of edges, labeled with REF ; the original vertex obtains a label those hosting the graph partitions. All that is required is the RF D, so that the graph database is in fact decomposed in existence or implementation of the respective basic API calls parts G1, ..., Gm that constitute an edge partition. for querying path expressions in the underlying systems. For example, consider a distributed graph database Bellow, we can see the operations that decompose the with two partitions, G1 ∪ G2 = G, and a subgraph input query and re-synthesize the partial results: H ⊆ G with VH = {k, , m, n, o} and EH = {(k, l), (l, o), (n, m), (m, l)}. If the partition algorithm sends Query ResultSetO {k, n} to G1 and {m, l, o} to G2, then {(k, l), (n, m)}∈G1 and {(m, l), (l, o)}∈G2. A graphic representation is

RF D G1 n m G2  Path queries Path resultsO   REF REF RF D / k /l / m l o  (2) Distributed queries DistributedO results Some basic properties of the above procedure are the following:  • A source node is always in the same partition as its Fragment queries Fragment results OO edge, whereas its target inside the same partition may be a REF copy of the RF D vertex placed elsewhere. • REF vertices necessarily have no outgoing edges. parallel partition searches • The set of all REF nodes, as well as that of all RF D  ) nodes, equals I; each RF D node may have REF PrimaryFragments P artitions results duplicates in more than one partition. Section III mathematically formalizes the above operations A path cover is a set of disjoint paths in G which together and illustrates the process via a running example. contain all vertices; we denote a k-path as P =(x1, ..., xk). If we don’t allow paths of length 0, namely single vertices, a III. QUERY REWRITE AND RESULTS COMBINATION path cover {Gi} forms an edge partition of G. The elements A. Preliminaries in the intersection I of the Vi’s are now called join vertices. A (finite) directed graph is a pair G =(V,E) where V is A weakly connected graph is a graph where there exists the finite set of vertices and E ⊆ V × V is the finite set of an undirected path between every pair of vertices. In what edges; the first of an edge pair is the source and follows, our initial pattern query will always be weakly con- the second is the target. The basic graph theoretic definitions nected, since otherwise we could take its weakly connected given below can be found e.g. in [14], [15]. components and perform the transformations separately. An edge partition {Ei} of G is a set of non-empty disjoint Our basic task is to identify subgraphs of the graph subsets whose union gives E. If we define {Vi} to be the database which are graph-isomorphic to an input query. set of source and target nodes of edges in {Ei}, every Gi = Two graphs G, H are isomorphic when there is an edge- ∼ (Vi,Ei) is a graph on its own. The family {Gi} of edge- preserving bijection f : VG = VH , i.e. such that (x, y) ∈ ∼ disjoint subgraphs is a decomposition of G, EG ⇔ (f(x),f(y)) ∈ EH (hence also EG = EH ). As mentioned in the overview, our system is focused on G = G1 ∪ ... ∪ Gm =(∪iVi, ∪iEi). the labeled property graph data model, elsewhere called di- Notably, {Vi} are not disjoint in principle: if two ad- rected labeled typed/attributed graph. Vertices can be viewed jacent edges are located in different partitions, the vertex as tuples with a unique id, certain labels and properties in-between is ‘duplicated’ and exists in both respective (attributes), whereas edges have a source and target vertex, subgraphs. Those elements lie in the set labels and properties. However, in order to emphasize the i=j underlying general techniques and ideas, presently we em- I = (Vi ∩ Vj)=(V1∩V2)∪(V1∩V3)∪...∪(Vm-1∩Vm). ploy the abstract representation of a plain directed graph. 1≤i,j≤m (1) B. Graph query rewrites The structure of the distributed graph database is as fol- 1) Pattern query to path pattern queries: Consider an ar- lows. Starting with the whole graph G =(V,E), vertices are bitrary pattern query Q =(V,E). The initial transformation

2581 decomposes the query into a list of edge-disjoint paths, i.e. consider {x1,xk} as breakpoints when splitting a path in all specifies a path cover for Q. Any path covering algorithm possible non-zero smaller paths. from the literature will do and the choice is orthogonal to Since a path can be up at multiple nodes simultane- our techniques. We chose to use an all-paths algorithm to ously, there are as many distributed queries, i.e. our initial discover all possible paths between outer vertices or join path query together with a specific choice of breakpoints, as candidate vertices and then select the largest possible paths the size of the powerset P(VP \{x1,xk}). This contains all k-2 that constitute a path cover of the query. The choice is in possible subsets of {x2, ..., xk-1}, and its size is 2 .For a sense independent of the rest of the method: changing it simplicity of notation, we denote only affects the intermediate steps and not the final results. { }⊆{ } We thus obtain specific simple paths (in a specific order) xi1 ,xi2 , ..., xiv x2, ..., xk−1 as   s(i i ..i ) ∈P(V \{x1,xk}). 1 2 n 1 2 v Q ,Q , ..., Q : Vi = V and Ei = E = ∅ 1≤i≤n 1≤i≤n Write s∅ , which corresponds to the whole path (no breakpoints). We can now define a function Therefore this well-defines a function k-2 G : PPQ /DPQ/ 2 F : PQ /PPQ/ n (3) // 1 n P ((P, s∅), (P, s(2)), .., (P, s(23)), .., (P, s(23...k 1))) Q /(Q/ , ..., Q ) - (5) where PQ is the set of all (weakly connected) pattern where DPQ is the set of all distributed queries. If we start queries and PPQis the set of path pattern queries. The n-th with F as in (3), we can compose it with G for each path 1 n PPQn = PPQ× ... × PPQ is as usually Q , ..., Q : defined as the set of n-tuples. Notice that this function, like G ×..×G k1-2 kn-2 G : n −−−−−−−1 →n 2 × × 2 all functions that follow, is ‘stable under isomorphism’: for PPQ DPQ ... DPQ (6) ∼ ( )=( 1 n) i ∼ i R = Q, F R R , ..., R with R = Q . with mapping As a demonstrating example, consider a pattern query Q ( 1 n) → with V = {a, b, c, d, e} and E = {(a, b), (b, c), (e, d), (d, b)} Q , ..., Q (( 1 ) ( 1 ) ( n ) ( n )) // // Q ,s∅ , .., Q ,s(2..k1-1) , .., Q ,s∅ , .., Q ,s(2..kn-1) . a bOO c (4) Explicitly, given an arbitrary pattern query, F first decom- dOO poses it into simple paths of lengths k1, ..., kn and then G1 × ... × Gn produces all acceptable combinations of e breakpoints of all path queries. As an example, we identify the distributed queries The path decomposition operation produces the simple path 1 1 2 for Q as in (4). For the path Q =(x1,x2,x3,x4), queries Q =(e, d, b, c), Q =(a, b) and the single join VQ1 \{x1,x4} = {x2,x3} so there exist k1-2=2 vertex is I = VQ1 ∩ VQ2 = {b}. For simplicity, we denote 2 1 possible breakpoints. Their 2 =4combinations are the 4-path (e, d, b, c) as Q =(x1,x2,x3,x4) and the 2-path {∅, {x2}, {x3}, {x2,x3}}, thus the function producing its (a, b) as (y1,y2). Hence the image of Q under the function distributed queries is F : PQ → PPQ× PPQ is // 1 2 G1 : PPQ DPQ × DPQ × DPQ × DPQ F (Q)=(Q ,Q )=(e → d → b → c, a → b)   1 // 1 1 1 1 Q (Q , ∅), (Q , {x2}), (Q , {x3}), (Q , {x2,x3}) =((x1,x2,x3,x4), (y1,y2) | x3 ≡ y2) 2 =( ) 2 \{ } = ∅ 2) Path to Distributed pattern queries: Suppose we have For the path Q y1,y2 ,wehaveVQ y1,y2 , 20 =1 : a path pattern query P =(x1, ..., xk). The next transforma- i.e. no possible breakpoints. Hence and G2 → ( 2)=( 2 ∅) tion determines all the ‘breakpoints’ of the path, in order PPQ DPQ with G2 Q Q , . In total, we have G◦ = → 2 → 4+1 to identify all its possible separated sub-parts inside the the composite function F PQ PPQ DPQ partitions of the distributed graph database. mapping Q to  1 1 1 1 Consider the set VP \{x1,xk} = {x2, ..., xk-1} with size G( ( )) = ( ∅) ( { }) ( { }) ( { }) F Q Q , , Q , x2 , Q , x3 , Q , x2,x3 , k-2. This contains precisely the candidate nodes where the 2 (Q , ∅) | x3 = y2 (7) path can be split – with minimal part a single edge – due to the structure of the distributed graph database: the start These are the two paths with chosen breakpoints: node is necessarily in the same partition as the first edge of ( → → → → x → → the path, whereas the end node appears inside the last edge’s x1 x2 x3 x4,x1 2 x3 x4, partition, even with a REF label. Thus we do not need to x1 → x2 → x3 → x4,x1 → x2 → x3 → x4,y1 → x3).

2582 3) Distributed to Fragment pattern queries: The trans- with mapping, for some pattern query Q, formation that follows employs the paths’ breakpoints in- (( 1 ) ( 1 ) ( n ) ( n )) formation to actually split them into smaller paths. Observe Q ,s∅ , .., Q ,s(2..k1-1) , ..,_ Q ,s∅ , .., Q ,s(2..kn-1) how each element of P(VP \{x1,xk}) uniquely corresponds  to a specific path cover of the simple path P , e.g. {xm}↔ ( 1 1 n n ) Q , .., Q(2..k1-1)k1-1, .., Q , .., Q(2..kn-1)kn-1 . {(x1, .., xm), (xm, .., xk)}. Hence any distributed query can         equivalently be written as (P1, .., Pr), where all Pi’s are =1 k-2 +2 k-2 +3 k-2 + +( 1) k-2 Notice that if T 0 1 2 ... k- k-2 , subpaths such that ∪Pi = P , and the end node of each Pi 2·(k 2) 3·(k 2) k 1 ∼ T concides with the start node of Pi+1. PPQ×PPQ - ×PPQ - ×...×PPQ - = PPQ Based on that, we can describe the fragment pattern H T1 × × Tn queries (or just fragments) in which every distributed query hence the image of is PPQ .. PPQ . divides into. Some general facts are the following. Back to the example query (4), having computed its # # +1 distributed queries in (7), we can now identify its fragment • fragment queries= breakpoints .   1 =4 2 =1 k-2 queries.  For Q ,wehavek1 so there are 0 , • # distributed queries with v +1 fragments= , e.g. 2 2   v =2, =1distributed queries with 0+1 = 1, k-2 =1 1 1 2 – 0 distributed query with fragment, for 1+1 = 2, 2+1 = 3 fragment queries respectively. The  4 2 2 3 choosing  0 breakpoints, i.e. the whole P ; function H1 : DPQ → PPQ× PPQ × PPQ × PPQ k-2 = 2 2 – 1 k- distributed queries with fragments, has as image the list of fragment queries for choosing all singletons as breakpoints; k-2 1 1 1 1 1 1 1 1 – =1distributed query with k-1 fragments, for (Q , (Q(2)1,Q(2)2), (Q(3)1,Q(3)2), (Q(23)1,Q(23)2,Q(23)3)) = k-2  choosing k-2 breakpoints, i.e. decomposing into all ( ) ( ) ( ) ( ) x1,x2,x3,x4 , x1,x2 , x2,x3,x4 , x1,x 2,x3 , its edges. ( ) ( ) ( ) ( ) k 2 x3,x4 , x1,x2 , x2,x3 , x3,x4 • # all distributed   queries is recovered to be 2 - = k-2 + k-2 + k-2 + + k-2 2 =2  : → 0 1 2 ... k-2 . For Q we have k2 so H2 DPQ PPQ with image 2 We can thus express distributed queries in terms of frag- the only fragment query Q =(y1,y2). The product of those ments two functions is composed with G ◦ F to give F G H r −→ 2 −→ 5 −→ (1+2+2+3)+1 H : DPQ /PPQ/ (8) PQ PPQ DPQ PPQ // (P, s(−)) (P(−)1, ..., P(−)r) with mapping the total list of 10 fragment queries  → → → → → → where P(−)i is determined by the chosen breakpoints s(−): x1 x2 x3 x4,x1 x2,x2 x3 x4, → → → → → x1 x2 x3,x3 x4,x1 x2,x2 x3, (P, s(i ..i )) ↔ (P(i ..i )1,P(i ..i )2, .., P(i ..i )v+1). 1 v 1 v 1 v 1 v x3 → x4,y1 → y2 | y2 ≡ x3 (10) Starting with the function G as in (5), we can compose it  4) Fragment queries to Primary Fragments: In the final with the cartesian product H1 × .. × H2k-2 = H for all list of all fragments (P, P(2)1, ..., P(2..k-1)k-1) for a path, different subsets (−), namely s some entries turn out to be identical, e.g. P(2)1 =(x1,x2)= (23)1 2k-2 P . In this section, the described transformation distin-   guishes all the unique elements from that list. k 2 k 2 2k-2 2( - ) 3( - ) k 1 DPQ →PPQ× PPQ 1 × PPQ 2 × .. × PPQ - We thus consider the primary fragments of a path, i.e. all distinct subpaths that are essential to build it up in all ways; with mapping for P =(x1, .., xk) they are the following:

· #(k-1) 2-paths (x1,x2), (x2,x3),...,(xk-1,xk); ((P,s ),(P,s ),..,(P,s ),..,(P,s ))→ ∅ (2) (23) (2..k-1) · #( 2) 3 ( ) ( ) ( ) k- -paths x1,x2,x3 ,..., xk-2,xk-1,xk ; ... (P,(P ,P ),..,(P ,P ,P ),..,(P ,..,P )). · # ( 2) = 2 [ 1] ( ) ( ) (2)1 (2)2 (23)1 (23)2 (23)3 (2..k-1)1 (2..k-1)k-1 k- k-  k- -paths x1, .., xk-1 , x2, .., xk ; · # k-(k-1) = 1 k-path (x1, .., xk). Combining the transformations (3),(5) and (8), we can   In total, we have the above (k-1) + (k-2) + ... +1 = compose G1 × ... × Gn as in (6) with H = H1 × ... × Hn (k-1)k 2 subpaths, which in combination make up all fragment

k -2 kn-2 queries. 2 1 × × 2 DPQ ... DPQ In order to correspond these to fragments of the previous H  general form P(−)i, so as to be able to discriminate the latter (PPQ× ... × PPQk1-1) × ... × (PPQ× ... × PPQkn-1) between primary and non-primary, we make the following (9) choices which focus on their source/target:

2583 - subpaths of the form (x1, .., xi) or (xi, .., xk) — in- pattern query Q, it produces the set of its primary fragments cluding the start or end path vertex — are P(i)1 or M(Q):=PFQ. Graphically, P(i)2; F / n G / 2k1-2 2kn-2 - subpaths of the form (xi, .., xj) where 1

2584 Define the following four classes of partition fragment FRSQ is the set containing all the above sets, with the result sets, corresponding to the four types of primary condition that y2 = x3. fragments for each path Qu as in (12): Remark 1: There may be cases when certain primary fragments of some pattern query Q are graph-isomorphic. ( u)={ ¯ ∈ | ¯ =∼ u} FRSGw Q Q Gw Q Q (16) It is then not required to search the partitions multiple u u ( )={ ¯ ∈ | ¯ ∼ ∧ ∈ } FRSGw Q(i)1 Q Gw Q = Q(i)1 eQ¯ Rw times; rather it seems optimal to first identify them and only u ¯ ¯ ∼ u G ( )={ ∈ w | = ∧ ¯ ∈ w} query once. For the above example, PFQ as in (15) already FRS w Q(i)2 Q G Q Q(i)2 sQ D 1 ∼ 1 ∼ u u includes some isomorphic fragments, e.g. Q(2)1 = Q(3)2 = ( )={ ¯ ∈ | ¯ =∼ ∧ ∈ ∧ ∈ } FRSGw Q(ij)2 Q Gw Q Q(ij)2 eQ¯ Rw sQ¯ Dw 1 ∼ 2 Q(23)2 = Q : these are all (isomorphic) 2-paths. Hence if we 2 1 For each primary fragment of each path of Q , we take the populate the set/table FRS , we can then obtain FRS(2)1, n ku(ku-1) 1 1 union of these sets to form the following u=1 2 FRS(3)2 and FRS(23)2 by selecting the ones with the extra fragment result sets label restrictions, rather than querying three more times. 2) Fragment Result Sets to Distributed Result Sets: After u ≡ ( u ):= ( u )∪ ∪ ( u ) FRS− FRS Q− FRSG1 Q− ... FRSGm Q− . querying the database to specify (populate, when viewed as (17) tables) all the separate FRS(Qu )’s in the previous step, u = { ¯ ∈ | ¯ =∼ u} − Specifically, we have FRS Q G Q Q , we can form sets of isomorphic instances of the distributed u = { ¯ ∈ | ¯ =∼ u ∧ ∈ } FRS(i)1 Q Gw Q Q(i)1 eQ¯ Rw for some w , queries for Q, which will be constructed ‘algebraically’ from u = { ¯ ∈ | ¯ =∼ u ∧ ∈ } FRS(i)2 Q G Q Q(i)2 sQ¯ DG the fragment result sets. u = { ¯ ∈ | ¯ ∼ u ∧ ∈ and FRS(ij)2 Q Gw Q = Q(ij)2 eQ¯ An important advantage of viewing result sets as relational Rw for some w}. These sets contain all isomorphic in- tables is that we can join them in the usual sense, i.e. employ stances of the primary fragments in the database, all together extension of natural join of binary relations, as described in large collections, where the partition from which each of in [16]. We can construct various joins of the fragment result them originates is forgotten. sets of some path P , e.g. u In general, it is convenient to think of these sets FRS− as = {( ) | relational tables, with attributes the ‘names’ of the vertices FRS(i)1 F R S (i)2 y1, ..., yk of Q they refer to, and tuples the graph-isomorphic paths (y1, ..., yi) ∈ FRS(i)1 ∧ (yi, ..., yk) ∈ FRS(i)2}. appearing in the partitions. The other cases needed for our purposes are FRS(i)1 We can now define the full set of fragment result sets for FRS(ij)2, FRS(ij)2 F R S (jk)2 and FRS(ij)2 a specific pattern query Q: FRS(j)2, for all appropriate i, j, k. 1 n u u u u For Q decomposed in paths {Q , ..., Q }, we define the FRSQ = {FRS ,FRS(i)1,FRS(i)2,FRS(ij)2} distributed result sets 1 ≤ ≤ FRS = with u n and appropriate ranges for i, j.If u ( u ∅)= u { | } DRS1 ≡DRS Q , FRS (19) FRSQ any Q is the set of all sets of fragment result u u u u ≡ ( )= sets for arbitrary pattern queries, consider the function DRS(i) DRS Q ,s(i) FRS(i)1 F R S (i)2 u u u u u ≡ ( )= F : PF /FRS/ DRS(ij) DRS Q ,s(ij) FRS(i)1 F R S (ij)2 F R S (j)2 (18) u u u u DRS ≡DRS(Q ,s(ijk))=FRS F R S = { u } // = { u } (ijk) (i)1 (ij)2 PFQ Q− FRSQ FRS− u u F R S (jk)2 F R S (k)2 etc.

ku(ku−1) Clearly, the size of FRSQ is also 2 . where each set is formed following the reconstruction rules For our example pattern query as in (4), the 7 fragment of fragment queries from primary fragments from (11). result sets are of the form Obviously, for any such Q with lengths of paths k1, .., kn, 1 the total number of the distributed result sets/tables is FRS1 := FRS ={[x1 → x2 → x3 → x4]} n k 2 2 u- . 1 u=1 FRS2 := FRS(2)1 ={[x1 → x2] | x2 is REF } We can now define the set of all distributed result sets 1 FRS3 := FRS(2)2 ={[x2 → x3 → x4] | x2 is RF D} u u u DRSQ = {DRS1 ,DRS(i),DRS(ij), ... | 1 ≤ u ≤ n} 1 FRS4 := FRS ={[x1 → x2 → x3] | x3 is REF } (3)1 as well as the set of all sets of distributed result sets for := 1 ={[ → ] | } FRS5 FRS(3)2 x3 x4 x3 is RF D arbitrary pattern queries, DRS = {DRSQ | any Q}.Wecan 1 FRS6 := FRS(23)2 ={[x2 → x3] | x2 is RF D ∧ x3 is REF } now define a function using the join construction formulas 2 // FRS7 := FRS ={[y1 → y2]} D : FRS DRS = { u } // = { u u } where [−] denotes all the graph-isomorphic instances of the FRSQ FRS− DRSQ DRS1 ,DRS(i),.. path arising in some partition of G. The mapping F(PFQ)= (20)

2585 In our example (4), we have the following 22 +20 =5 where the natural join is performed over the common (join) distributed result sets: vertices of the paths. Consequently, we can view the set BRSQ as a table, with attributes the names of the discrete DRS1={[x →x →x →x ]} 1 1 2 3 4 vertices of Q, and tuples all of its isomorphic subgraphs DRS1 ={x →x →x →x |x →x ∈FRS1 ∧x →x →x ∈FRS1 } (2) 1 2 3 4 1 2 (2)1 2 3 4 (2)2 inside G.IfBRS = {BRSQ |∀Q ∈ PQ}⊆P(PQ) is the 1 1 1 DRS(3)={x1→x2→x3→x4|x1→x2→x3∈FRS(3)1∧x3→x4∈FRS(3)2} set of all sets of base result sets for arbitrary Q’s, the above 1 1 1 join defines the transformation DRS(23)={x1→x2→x3→x4|x1→x2∈FRS(2)1∧x2→x3∈FRS(23)2 ∧x →x ∈FRS1 } 3 4 (4)2 B : PRS /BRS/ (24) DRS2={[y →y ]} 1 1 2 1 n // PRSQ = {PRS , ..., P RS } BRSQ. Notice that these sets contain multiple graph-isomorphic 1 2 paths to Q and Q obtained in different ways, and the The full process of assembling the results of primary ‘names’ xi are only used as attributes of the tables where the fragments in the partitions into isomorphic instances of the tuples of these sets are stored in. The image D(F(PFQ)) = initial query is thus given by the composite transformation DRSQ contains all these sets, with the condition y2 = x3. of (18,20,22,24) 3) Distributed Result Sets to Path Result Sets: Having specified all isomorphic distributed queries arising in the PF F /FRS D /DRS database — which are in fact the paths Q1, ..., Qn formed P by joining fragments in different intermediate nodes each PRS time — we now wish to assemble the whole result table for L B each path, ignoring the way they were constructed. - u BRS Define the path result set to be the union of those DRS(−) u ∈{ 1 n} for each path Q Q , ..., Q individually: For the example query Q as in (4), we end up with u u u u u PRS = DRS ∪ DRS ∪ DRS ∪ .. ∪ DRS 1 (2) (3) (23..k-1) B(P(D(F( )))) = 1 2 = (21) PFQ PRS P R S 1 n 1 2 Consider the set PRSQ = {PRS , ..., P RS } of all path {(x1,x2,x3,x4,y1)|(x1,x2,x3,x4)∈PRS ∧(y1,x3)∈PRS } result sets. If PRS = {PRSQ | any Q} contains all sets of path result sets for arbitrary pattern queries, we form the where the join is performed over the unique join vertex transformation in J = {x3}. Notice that this result set BRSQ of all P : DRS /PRS/ isomorphic subgraphs to Q would contain also H of the distributed setting example (2): u n // u n DRSQ = {DRS(−)}u=1 PRSQ = {PRS }u=1. ( REF ) ∈ 1 ∧ ( RF D ) ∈ 1 ⇒ (22) n, m FRS(2)1 m ,l,o FRS(2)2 1 1 For our example query (4), we obtain the following path (n, m, l, o) ∈ DRS(2) ⊆ PRS , 1 2 2 2 2 result sets for Q and Q : (k, l) ∈ FRS = DRS = PRS . PRS1=DRS1∪DRS1 ∪DRS1 ∪DRS1 ={x →x →x →x | 1 (2) (3)  (23) 1 2 3 4  x1→x2→x3→x4∈FRS1∨ x1→x2∈FRS2∧x2→x3→x4∈FRS3 ∨ We conclude with a result which ensures the validity of   our method (we omit the proof due to space restriction). x1→x2→x3∈FRS4∧x3→x4∈FRS5 ∨  Theorem 1: By performing the series of transformations x1→x2∈FRS2∧x2→x3∈FRS6∧x3→x4∈FRS5) L◦Mdepicted by the dotted line 2 2 PRS =DRS1 ={[y1→y2]}. T1 × × Tn /PF Hence the image of the composition of the transformations PPQ ..O PPQ 8 1 2 gives P(D(F(PFQ))) = PRSQ = {PRS ,PRS | x3 ≡  2k1-2 2kn-2 y2}. DPQ × .. × DPQ FRS O M L 4) Path Result Sets to Base Result Sets: Now that we  have gathered all paths that are isomorphic to Q1, ..., Qn n PPQOO DRS inside the distributed graph database, we recall that the join Õ  L◦M vertices of the weakly connected pattern query Q constitute PQ /BRSoo PRS common ‘attributes’ of the path queries results when stored in tables. Thus we can form the base result set we obtain all isomorphic instances of the initial query Q in 1 2 n BRSQ = PRS P R S . . . P R S (23) the distributed graph database.

2586 IV. EXPERIMENTAL EVALUATION We set a time limit of 1000 seconds for the queries to finish A. Implementation Details and Setup execution, in order to reduce the effect of strangling queries in the average computation for either system. In Figure 2(a) The Digree prototype is implemented as a middleware we provide the average execution time of the finished queries system that spans over a set of distributed graph databases. run in single-node and Digree, while in Figure 3(a) we It consists of two separate software units, the master and the present the percentage of the queries that returned their result slave. A basic Digree deployment is consisted of one master set within the time limit. We can see that for the smaller node, k slave nodes and an extra database node. Each slave datasets (amazon and youtube) the differences between the node is assigned a graph partition that is loaded in a graph two implementation are not significant. Nevertheless, Digree data management system (e.g. graph database, relational manages to produce reduced query times because of the database or other). While in our prototype implementation effect of parallel execution. For the larger dataset (twitter), we use Neo4j to manage each graph partition, this is not where the graph cannot fit in the main memory of the a restrictive choice. In order to use a different data store single-node instance, the results are more interesting. Not (or a combination of different data stores), all is required only execution time is more than halfway down, but Digree is the interfaces to the respective query languages or APIs. also manages to answer more queries than the single-node We deployed our system on a cluster consisting of 18 Linux (96% against 89%) within the time limit. For the rest of the virtual machines. Each machine had 4 cpus and 8 GB of experiments we present results only for the larger dataset memory. We used one machine to run the master process, (twitter) since Digree is aimed for big datasets that don’t one machine to host a PostgreSQL database server and the easily fit on main memory of a single node. rest 16 machines to host the partitions of the graph database 2) Mutated Pattern Queries: We continued by using the (one Neo4j database per node) and the slave processes. For pattern queries from the previous experiment and creating comparison purposes we used a virtual machine of the same a number of mutations for each of those. These mutations specifications to run the queries on a stand alone Neo4j have been created by randomly choosing a vertex from the instance loaded with the unpartitioned dataset (from now graph query and attaching a new vertex to it, randomly on refered to as single-node). incoming or outgoing. We refer to the number of vertices B. Datasets added as the mutation . For each pattern query we created mutations with degree from 1 to 5 and for each one We used three real world datasets for our experiments. we created 10 query instances, each with a different random The first two are the amazon product network1 [17] and the set of labels. We applied the same time limit as before and youtube video graph.2 The third dataset that we used is an the results are shown in Figures 2(b) and 3(b). For this more anonymized twitter users graph parsed by our team. The challenging set of queries Digree reduces execution time details of the datasets are shown in Table I (size refers to a to less than one third of the time required for single-node single partition Neo4j database size on disk). For partitioning execution. Moreover, it manages in answering more queries the smaller datasets (amazon and youtube) we used the within the time limit than the single-node. popular METIS [18] algorithm, while for the twitter dataset 3) Path Queries: Digree utilizes path queries as its basic we used the online partitioning algorithm LDG [19] since tool of decomposing a complex graph pattern. In the final ex- METIS could not handle that large an input. periment we demonstrate the effect of parallel processing of Table I path queries that result in the reduced query times observed DATASETS OVERVIEW in the previous experiments. We used path queries varying in length from 3 to 8 vertices. For each path length we created #nodes #edges #labels size 100 query instances, each with a different random set of amazon 548.552 1,788,725 11 99.6 MB labels and we used the same time limit as before. The results youtube 155,513 2,969,826 14 161.8 MB twitter 35,648,794 910,526,369 232 30.85 GB are presented in Figures 2(c) and 3(c). Digree manages to execute the queries in a small fraction of the time requirered by single-node, showing a small increase in execution time C. Experiments as the length of the path increases. Digree is between 9 and 20 times faster that the single-node deployment. 1) Pattern Queries: For each of the datasets we built a set of pattern matching queries. Specifically we used seven V. R ELATED WORK pattern queries used in [1] (depicted in Figure 1). For each There exists a number of distributed graph databases. of the datasets and for each pattern query we created ten Trinity [20] is an in-memory distributed graph database instances, each with a different random set of node labels. used for online query processing and offline analytics on 1https://snap.stanford.edu/data/ large graphs as well as for subgraph matching [21]. Titan 2http://netsg.cs.sfu.ca/youtubedata/ is an open source graph database layered upon a distributed

2587 A A C C A E A A C D B C B

B A D B E B A D D B D C B E E D F C C Q1 Q2 Q3 Q4 Q5 Q6 Q7

Figure 1. The pattern queries used in the experiments

(a) pattern queries (b) mutated pattern queries (c) path queries

Figure 2. Average execution time

(a) pattern queries (b) mutated pattern queries (c) path queries

Figure 3. Percentage of queries answered within time limit of 1000 seconds

NoSQL database like HBase. ThingSpan [22] is a commer- propose a distributed solution for graph mining. In [27] the cial graph analytics platform that handles very large graphs authors present GraphMat, a framework for writing high by utilizing a number of open source big data tools. Digree performance parallel graph taking advantage of can act as a middleware so as to utilize any such system or multicore architectures in a single-node deployment. In [28] combination of systems deployed at the graph partitions. the authors present PSgL, a distributed solution for subgraph listing, which is a special case of matching when all vertices A number of systems were developed for distributed graph have the same label. A large amount of related work exists processing such as Google Pregel [4], it’s open source in the context of semantic web. In [6] the authors propose an alternative Apache Giraph [23] and GraphX [3]. These approach where a SPARQL query is executed at all partitions systems are not specialized in graph pattern matching but and partial RDF matches of it are then assembled so as to are frameworks that provide a programming model for the build the cross-partition results, while in [29] the authors user to develop and deploy graph algorithms. focus on data partitioning and propose a mapreduce based A work that concentrates on distributed pattern matching join solution for inter-partition query answering. In [30] the can be found in [1] where the authors use relational data authors propose a partitioning of the RDF data that adopts storage and their work is closely related to traditional to workload changes in order to better support queries. database system query optimization. Digree can easily oper- The authors in [7] focus on graph partitioning and how it ate over relational graph management solutions, by affects their mapreduce based system performance. In [31] the required API calls. In [24] the authors propose algo- the authors examine how caching of SPARQL query results rithms that use the message passing model and apply their can favor execution of queries over large RDF graphs. ideas for graph simulation [25], while in [26] the authors

2588 VI. CONCLUSION [13] I. Filippidou and Y. Kotidis, “Online partitioning of multi- In this paper we presented Digree, a middleware to handle labeled graphs,” in Proceedings of the GRADES, 2015. graph pattern matching queries over a distributed graph [14] R. Diestel, Graph , ser. Graduate Texts in Mathematics. database or inter-linked graph databases. We developed a Springer, 1997, no. 173. solid theoretical base to rewrite a graph query so as to [15] J.-A. Bondy and U. S. R. Murty, Graph theory, ser. Graduate be executed in a distributed setting in parallel, but also to texts in mathematics. New York, London: Springer, 2007. combine back the partial results into the final result set. We presented a prototype implementation of Digree and [16] E. F. Codd, “A relational model of data for large shared data demonstrated its capacity in reducing execution times of banks,” Commun. ACM, vol. 13, no. 6, pp. 377–387, 1970. complex pattern matching queries, especially for datasets [17] J. Leskovec, L. A. Adamic, and B. A. Huberman, “The that do not fit in main memory of a single node. As future dynamics of viral marketing,” ACM Trans. Web, 2007. work, we plan to compare with other systems such as GraphX [3] and to use a variety of systems to manage the [18] G. Karypis and V. Kumar, “Analysis of multilevel graph partitioning,” in Proceedings of IEEE/ACM SC Conf., 1995. graph partitions or the central processing engine of Digree. [19] I. Stanton and G. Kliot, “Streaming graph partitioning for REFERENCES large distributed graphs,” in Proc. of ACM SIGKDD, 2012. [1] J. Huang, K. Venkatraman, and D. J. Abadi, “Query optimiza- tion of distributed pattern matching,” in ICDE, 2014. [20] Y. L. Bin Shao, Haixun Wang, “Trinity: A distributed graph engine on a memory cloud,” in Proc. of SIGMOD, 2013. [2] D. Bleco and Y. Kotidis, “Graph analytics on massive collec- tions of small graphs,” in Proceedings of EDBT, 2014. [21] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient subgraph matching on billion node graphs,” Proc. VLDB [3] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, Endow., 2012. “Graphx: A resilient distributed graph system on spark,” in [22] (2016, aug) Thingspan. [Online]. Available: http://www. First International Workshop on Graph Data Management objectivity.com/ Experiences and Systems, ser. GRADES ’13, 2013. [23] M. Han and K. Daudjee, “Giraph unchained: Barrierless asyn- [4] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, chronous parallel execution in pregel-like graph processing N. Leiser, and G. Czajkowski, “Pregel: A system for large- systems,” Proc. VLDB Endow., vol. 8, no. 9, May 2015. scale graph processing,” in Proc. of the ACM SIGMOD, 2010. [24] S. Ma, Y. Cao, J. Huai, and T. Wo, “Distributed graph pattern [5] V. Spyropoulos and Y. Kotidis, “Dynamic partitioning of big matching,” in Proceedings of WWW, 2012, pp. 949–958. hierarchical graphs,” in Proceedings of the First International Workshop on Big Dynamic Distributed Data, 2013. [25] W. Fan, X. Wang, Y. Wu, and D. Deng, “Distributed graph simulation: Impossibility and possibility,” Proc. VLDB En- [6] P. Peng, L. Zou, M. T. Ozsu,¨ L. Chen, and D. Zhao, dow., vol. 7, no. 12, pp. 1083–1094, Aug. 2014. “Processing sparql queries over distributed rdf graphs,” The VLDB Journal, vol. 25, no. 2, pp. 243–268, Apr. 2016. [26] C. H. C. Teixeira, A. J. Fonseca, M. Serafini, G. Siganos, M. J. Zaki, and A. Aboulnaga, “Arabesque: A system for [7] J. Huang, D. J. Abadi, and K. Ren, “Scalable sparql querying distributed graph mining,” in Proceedings of SOSP, 2015. of large rdf graphs,” PVLDB, vol. 4, no. 11, 2011. [27] N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, [8] (2016, aug) Neo4j. [Online]. Available: http://neo4j.com/ M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey, “Graphmat: High performance graph analytics made produc- [9] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and tive,” Proc. VLDB, vol. 8, no. 11, pp. 1214–1225, Jul. 2015. I. Stoica, “Spark: Cluster computing with working sets,” in Proceedings of the 2Nd USENIX Conference on Hot Topics [28] Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu, “Parallel in Cloud Computing, 2010. subgraph listing in a large-scale graph,” in Proc. of the 2014 ACM SIGMOD Int. Conference on Management of Data. [10] A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, [29] K. Lee and L. Liu, “Scaling queries over big rdf graphs T. Kraska, S. Madden, D. Maier, T. Mattson, S. Papadopou- with semantic hash partitioning,” Proc. VLDB Endow., vol. 6, los, J. Parkhurst, N. Tatbul, M. Vartak, and S. Zdonik, “A no. 14, pp. 1894–1905, Sep. 2013. demonstration of the bigdawg polystore system,” Proc. VLDB Endow., Aug. 2015. [30] R. Harbi, I. Abdelaziz, P. Kalnis, N. Mamoulis, Y. Ebrahim, and M. Sahli, “Accelerating sparql queries by exploiting hash- [11] B. Gallagher, “Matching structure and semantics: A survey based locality and adaptive partitioning,” The VLDB Journal, on graph-based pattern matching,” AAAI FS, vol. 6, 2006. 2016.

[12] I. Filippidou and Y. Kotidis, “Online and on-demand partition- [31] N. Papailiou, D. Tsoumakos, P. Karras, and N. Koziris, ing of streaming graphs,” in Proceedings of the 2015 IEEE “Graph-aware, workload-adaptive sparql query caching,” in International Conference on Big Data (Big Data), 2015. Proc. of ACM SIGMOD, 2015.

2589