Information Systems 40 (2014) 47–66

Contents lists available at ScienceDirect

Information Systems

journal homepage: www.elsevier.com/locate/infosys

Efficient processing of label-constraint reachability queries in large graphs

Lei Zou a,n,KunXua, Jeffrey Xu Yu b, Lei Chen c, Yanghua Xiao d, Dongyan Zhao a a Peking University, No.5 Yiheyuan Road Haidian District, Beijing, China b The Chinese University of Hong Kong, Shatin, NT, Hong Kong c Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong d Fudan University, Shanghai, China article info abstract

Article history: In this paper, we study a variant of reachability queries, called label-constraint reachability

Received 22 November 2012 (LCR) queries. Specifically, given a label set S and two vertices u1 and u2 in a large directed Received in revised form graph G, we check the existence of a directed path from u1 to u2, where edge labels along 13 June 2013 the path are a subset of S. We propose the path-label method to answer Accepted 1 October 2013 LCR queries. Specifically, we t4ransform an edge-labeled into an augmen- Recommended by: Xifeng Yan Available online 18 October 2013 ted DAG by replacing the maximal strongly connected components as bipartite graphs. We also propose a Dijkstra-like algorithm to compute path-label transitive closure by re- Keywords: defining the “distance” of a path. Comparing with the existing solutions, we prove that our Graph database method is optimal in terms of the search space. Furthermore, we propose a simple yet Reachability query effective partition-based framework (local path-label transitive closureþonline traversal) to answer LCR queries in large graphs. We prove that finding the optimal graph partition to minimize query processing cost is a NP-hard problem. Therefore, we propose a sampling-based solution to find the sub-optimal partition. Moreover, we address the index maintenance issues to answer LCR queries over the dynamic graphs. Extensive experiments confirm the superiority of our method. & 2013 Elsevier Ltd. All rights reserved.

1. Introduction networks [23].Therearetwoextremesolutionstoanswer reachability queries. One approach is to materialize the The growing popularity of graph databases has generated transitive closures of a graph, enabling one to answer many interesting data management problems. One impor- reachability queries efficiently. On the other extreme, we tant type of queries over graphs is reachability queries can perform DFS (depth-first search) or BFS (breath-first

[8,10,13,14,20,21]. Specifically, given two vertices u1 and u2 search) over graph G on the fly to answer reachability in a directed graph G, we want to verify whether there exists queries. Obviously, these two methods cannot work in a 1 2 adirectedpath from u1 to u2. There are many applications of large graph G, since the former needs OðjVj Þ space to store reachability queries, such as pathway finding in biological the transitive closure (large index space cost), and the latter networks [16], inferring over RDF (resource description needs OðjVjÞ time in answering reachability queries (slow framework) graphs [17], relationship discovery in social query response time), where V is a set of vertices in G.The key issue in reachability queries is how to find a good trade- off between the two extreme solutions. Therefore, many n Corresponding author at: Institute of Computer Science and algorithms have been proposed, such as 2-hop [8,7,4],GRIPP Technology, Peking University, No.5 Yiheyuan Road Haidian District, [20],path-cover[10],-cover[20,21],pathtree[14] and Beijing 100871, China. Tel.: þ86 10 82529643. 3-hop [13]. E-mail addresses: [email protected], [email protected] (L. Zou). 1 In this paper, all “paths” refer to “simple paths” unless otherwise In many real applications, edge labels are utilized to specified. denote different relationships between two vertices. For

0306-4379/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.is.2013.10.003 48 L. Zou et al. / Information Systems 40 (2014) 47–66 example, edge labels in RDF graphs denote different we have to consider all possible paths between two vertices, properties. We can also use edge labels to define different because different paths may have different edge labels along relationships in social networks. In this paper, we study paths. Therefore, it is much more complicated of computing a variant of reachability queries, called Label-Constraint transitive closure for LCR queries. Existing index techniques Reachability (LCR) queries, which are originally proposed in traditional reachability queries are not available in LCR in [11]. Specifically, given two vertices u and v and a label queries, either. For example, in order to answer reachability set S, a LCR query checks whether there exists a directed queries, we always transform a directed graph G into a path from u to v, where the edge labels along the path are (DAG) by coalescing each strongly a subset of S. Here, we give some motivation examples to connected (in G) into a single vertex. However, demonstrate the usefulness of LCR queries. this method cannot work in LCR queries, since each strongly We model a social network as a graph G, in which each connected component has different edge labels. vertex in G denotes an individual and an edge indicates In order to address LCR queries efficiently, we make the the association between two users. Edge labels denote the following contributions in this work: relationship types, such as isFriendOf, isColleagueOf, isRelativeOf, isSchoolmateOf, isCoauthorOf, isAdvisorOf (1) Given an edge-labeled directed graph G,wefindall and so on. In some social network analysis tasks, we are maximal strongly connected components in G and only interested in finding some specified relationships replace them by bipartite graphs.Then,adirectedgraph between two individuals. For example, we want to see G is transformed into an augmented DAG with labels. whether two suspects are remote relatives (in a terrorist Based on the augmented DAG, we propose a method to network G) by checking the existence of a path between compute path-label transitive closure (Definition 3.8), two corresponding vertices, where the edge labels along where LCR queries can be answered directly. the path are all “isRelativeOf”. (2) We re-define the “distance” of a path by the number of LCR queries are also useful for understanding how distinct edge labels along the path, and also propose a metabolic chain reactions take place in metabolic networks. Dijkstra-like algorithm to compute a single-source A metabolic network can be also modeled as a graph G, path-label transitive closure (Definition 3.8). We prove where each vertex corresponds to a chemical compound that our algorithm is optimal in terms of search space. and each directed edge indicates a chemical reaction from (3) In order to speed up query processing over large graphs, one compound to another. Enzymes catalyze these reac- we propose an effective partition-based framework tions. Thus, we can use edge labels to denote different (local path-label transitive closureþonline traversal) enzymes. A metabolic pathway involves the step-by-step to answer LCR queries in large graphs. We prove that modification of an initial molecule to form another product. finding the optimal partition in terms of minimizing From the perspective of , a pathway is a the number of traversal steps is NP-hard. Based on the directed path from the initial node in the metabolic network complexity analysis, we design a sampling-based solu- to the target. The common query is as follow: considering tion to find a sub-optimal partition. the availability of a set of enzymes, is there a pathway from (4) In order to handle graph updates, we propose an one compound to another one? Obviously, this is a LCR efficient index maintenance algorithm to handle updates query over a metabolic network. over graphs. As demonstrated above, LCR queries are quite useful; (5) Last but not least, extensive experiments confirm that however, it is non-trivial to answer LCR queries over a our method is faster than the existing ones by orders of large directed graph. Traditional reachability queries do magnitude. For example, given a random network satis- not consider edge labels along the path [11]. For example, fying ER model with 100K vertices and 150K edges, the vertex 1 can reach vertex 4 in graph G (in Fig. 1). However, method in [11] consumes 277 h for index building. Given if the constraint label set S is fb; cg, the LCR query answer is the same graph, our method only needs 0.5 h for index NO, since we cannot find a path from 1 to 4, where all edge building. Furthermore, our method can work well in a labels along the path are a subset of S. Generally speaking, very large RDF graph (Yago dataset) with more than existing reachability indexes are compact data structures 2 million vertices and 6 million edges and 97 edge labels. of the transitive closure. As traditional reachability queries do not consider edge labels, in order to compute the The rest of this paper is organized as follows. The transitive closure, we only need to consider a single related work is discussed in Section 2. We formally define directed path from one vertex to another (if any). However, the problem and discuss existing solutions in Section 3. in order to compute the transitive closures for LCR queries, Then, we propose several novel techniques for computing path-label transitive closures in Section 4. The partition- based solution is discussed in Section 5. We also discuss how to handle dynamic graphs in Section 6. We evaluate our method in Section 7. Section 8 concludes this paper.

2. Related work

Recently, reachability queries have attracted lots of Fig. 1. A Running example. attentions in the database community [22,2,5,12]. Generally L. Zou et al. / Information Systems 40 (2014) 47–66 49

speaking, given two vertices ui and uj in a directed graph G, [8] is not scalable on large graphs. In order to address this a reachability query verifies the existence of a directed path issue, Cheng and Yu in [6] propose efficient algorithms to from ui to uj. The reachability query is a fundamental compute 2-hops efficiently. operation over graph data, which can be used in different LCR query is first proposed in [11], which is a special applications. For example, reachability queries can be used case of regular path queries [1]. Different from traditional to find pathways between two compounds in metabolic reachability queries, LCR query needs to consider the edge networks to understand chain reactions. We can also find labels along the path. In this case, some existing techni- some semantic association on the semantic web [3]. ques cannot be used. For example, we cannot simply Usually, a directed graph G is transformed into a directed replace all strongly connected components by vertices, acyclic graph by coalescing all strongly connected compo- since this transformation leads to missing edge labels in nents into vertices. It is easy to prove that the directed each strongly connected component. In [11], Jin et al. use a acyclic graph has the same reachability information with G. spanning tree and some local transitive closures to support Thus, all existing approaches assume that graph G is a LCR queries. The intuition of their method is that the full directed acyclic graph. So far, there have been a lot of transitive closure can be re-constructed by the spanning proposals to address this issue. Basically, these existing tree and local transitive closures. However, this kind approaches can be classified into three categories: chain- of tree-cover method suffers from the high densities in cover [10],tree-cover[20,21] and 2-hop labeling [4,8,7]. graphs, since local transitive closures may be very large. The chain cover is to decompose a graph G into Label constraint semantic is also considered in the shortest pairwise disjoint chains. A chain is more general than a path queries over road networks [18]. Different from [18], path. Each vertex has a distinct code based on the position our work focuses on the “reachability” queries, which on a chain. Given any two vertices u1 and u2,ifu1 can follows the same problem definition in [1]. reach u2 along a chain, it is not necessary to record the This paper is a heavily expanded journal version of reachability information between u1 and u2, because that a research paper entitled “Answering Label-Constraint it can be deduced based on their codes on the chain. Two Reachability in Large Graphs” presented at CIKM 2011. In typical solutions are proposed in [10] and [5]. this journal version, we made the following new contribu- The tree cover proposes to use a spanning tree instead tions: First, we propose an optimized solution to compute of chains to compress the transitive closures. All edges local transitive closure in each maximal strongly con- covered by the tree are called tree edges, and others are nected component in Section 4.3. Second, we propose a called non-tree edges. Some tree codes are designed to partition-based method to answer LCR queries over large check the existence of a directed tree path (the path only graphs. Third, we also discuss the index maintenance containing tree-edges) from vertex u1 to another vertex u2. issues on the dynamic graphs. Most existing approaches In order to consider non-tree edges, many different assume that the graph is static. They do not discuss how approaches are proposed. In [20], Trißl and Leser propose to update indexes. However, many real-life graph data, a traversal-based solution. Instead, Wang et al. propose a such as social networks, RDF graphs, are always evolving dual-labeling theme to materialize the reachability infor- over time. Thus, we should consider the reachability over mation through non-tree edges [21]. dynamic graphs. Furthermore, in this journal version, we Different from the above methods, a 2-hop labeling introduce six more experiments (Exp3–Exp8) to evaluate method over a large graph G assigns to each vertex uA the new proposed techniques.

VðGÞ a label LðuÞ¼ðLinðuÞ; LoutðuÞÞ, where LinðuÞ; LoutðuÞD VðGÞ. Vertices in Lin(u) and Lout(u) are called centers. For 3. Background reachability labeling, given any two vertices u1; u2 AVðGÞ, there is a path from u1 to u2 (denoted as u1-u2), if and 3.1. Problem definition only if Loutðu1Þ\Linðu2Þaϕ. It is a NP-hard problem to find the minimal number of hops. Lots of heuristic approaches We formally define our problem in this section. Table 1 are proposed, such as [8,7,4].In[8], Cohen et al. adopt the shows some frequently used symbols throughout this set-cover solution to find hops. However, the method of paper.

Table 1 Frequently-used notations

Notation Description

e, λðeÞ Edge e and the edge label of e (Definition 3.1) p, L(p) Path p and the path label L(p)(Definition 3.1)

S ¼fl1; …; lng The label constraint set (Definition 3.2) Pðu1; u2Þ, LSðu1; u2Þ All paths and path labels between u1 and u2 (Definition 3.4). MGðu1; u2Þ All non-redundant paths labels between u1 and u2 (Definition 3.6). MGðu; Þ single-source path-label transitive closure in graph G (Definition 3.8)

MG the path-label transitive closure of graph G (Definition 3.8) PruneðÞ_ Remove all redundant paths (Definition 3.9) The concatenation of two path label sets (Definition 3.10). The concatenation of two paths. 50 L. Zou et al. / Information Systems 40 (2014) 47–66

Definition 3.1. A directed edge-labeled graph G is denoted Definition 3.8. Given a graph G, path-label transitive closure as G ¼ðV; E; ∑; λÞ, where (1) V is a set of vertices, and (2) is a collection of minimal path-label sets between any two D ∑ ; E V V is a set of directed edges, and (3) is a set of vertices in G. Specifically, MG ¼½MGðu1 u2ÞjVðGÞjjVðGÞj,where edge labels, and (4) the labeling function λ defines the u1; u2 AVðGÞ,andasingle-source path-label transitive closure -∑ ; ; A mapping E . is a vector MGðu Þ¼½MGðu uiÞ1jVðGÞj,whereui VðGÞ. Given a path p from u to u in graph G, the path-label of 1 2 When the context is clear, for the simplicity of the p is denoted as LðpÞ¼⋃e A pλðeÞ, where λðeÞ denotes e's presentation, we use transitive closure instead of path-label edge label. transitive closure. For ease of presentation, we borrow two Given a graph G in Fig. 1, the numbers inside vertices operator definitions (Prune and ) from Ref. [11]. are vertex IDs that we introduce to simplify description Definition 3.9. Given a path label set LSðu ; u Þ from u to of a graph; and the letters beside edges are edge labels. 1 2 1 ; ; ; ; u2, PruneðLSðu1 u2ÞÞ is defined to delete all redundant path Considering path p1 ¼ð1 2 5 6Þ, the path-label of p1 is labels (defined in Definition 3.5)inLSðu1; u2Þ, i.e., Prune Lðp1Þ¼facg. ðLSðu1; u2ÞÞ ¼ MGðu1; u2Þ. Definition 3.2. Given two vertices u and u in graph G 1 2 For example, in Fig. 1 Prune ðLSð1, 6Þ¼fa; acgÞ ¼ M and a label constraint (set) S ¼fl ; …; l g, where l is label, G 1 n i ð1; 6Þ¼ fag, since fagDfacg. i ¼ 1; …; n, we say that u1 can reach u2 under label - S constraint S (denoted as u1 u2) if and only if there exists Definition 3.10. Given a path-label L(p) and a path-label D ; …; ⋃ ; …; ⋃ a path p from u1 to u2 and LðpÞ S. set LS ¼fLðp1Þ LðpnÞg, LðpÞLS ¼fLðpÞ Lðp1Þ LðpÞ LðpnÞg. Definition 3.3 (Problem Definition). Given two vertices Given two path-label sets LS ¼fLðp Þ; …; Lðp Þg and ; …; 1 11 1n u1 and u2 in graph G and a label set S ¼fl1 lng,a ; …; ; LS2 ¼fLðp21Þ Lðp2mÞg, LS1 LS2 ¼fLðp11ÞLS2 Lðp12Þ labelconstraint reachability (LCR) query verifies whether ; …; ⋃ ; …; ⋃ LS2 Lðp1nÞLS2g¼ fLðp11Þ Lðp21Þ Lðp1nÞ Lðp2mÞg. u1 can reach u2 under the label constraint S, denoted as Given a path-label set LS and a vector M ¼½LS ; …; ; ; ; 1 G 21 LCRðu1 u2 S GÞ. … LS2n1n, where each LS2i (i¼1, , n) is a path-label set, ; …; For example, given two vertices 1 and 6 in graph G in LS1 MG ¼½LS1 LS21 LS1 LS2n1n. Fig. 1 and label constraint S ¼facg, it is easy to know that 1 For instance, given M ð5; 6Þ¼fag and M ð3; 6Þ¼fag, ; ; ; G G can reach 6 under label constraint S,i.e.,LCRð1 6 S GÞ¼ fλð2; 5Þg M ð5; 6Þ¼facg and fλð2; 3Þg M ð3; 6Þ¼fag. ; ; ; D G G true, since there exists path p1 ¼f1 2 5 6g,whereLðp1Þ S. It is easy to prove that MGðui; ujÞ¼Pruneð⋃ƒƒ! If S ¼fbcg, query LCRð1; 6; S; GÞ¼ false. uiu′ A EðGÞ ƒƒ! fλðu u′ÞM ðu′; u ÞgÞ. For example, M ð2; 6Þ¼Pruneðffλð2; Definition 3.4. Given two vertices u and u in graph G, i G j G 1 2 5Þg M ð5; 6Þ; fλð2; 3Þg M ð3; 6Þg ¼ Pruneðffag; facgÞg ¼ Pðu ; u Þ denotes the set of all paths from u to u . The G G 1 2 1 2 ffagg. path-label set from u to u is defined as LSðu ; u Þ¼fLðpÞj 1 2 1 2 An extreme approach to answering LCR queries is to pAPðu1; u2Þg. materialize transitive closure MG. At run time, given a ; ; ; Note that Pðu1; u2Þ and LSðu1; u2Þ may be very large and query LCRðu1 u2 S GÞ, LCR queries can be answered by we do not compute them in our method. We only use the simply checking MGðu1; u2Þ. However, computing MG is concept of Pðu1; u2Þ to define the minimal path-label set from much more complicated than traditional transitive closure. u1 to u2 (Definition 3.6). Consider two paths (1, 2, 3, 4, 5, 6) We introduce the following theorem, which is used in our and (1, 2, 5, 6) in Pð1; 6Þ,whereLð1; 2; 3; 4; 5; 6ÞD Lð1; 2; 5; 6Þ. LCR query algorithms and the performance analysis. Obviously, if path (1, 2, 5, 6) can satisfy some label constraint Theorem 3.1 (Apriori property). Given a path p, p must be S, path (1, 2, 3, 4, 5, 6) will also satisfy S. Therefore, path (1, 2, redundant if one of its subpaths is redundant. 5, 6) is redundant (Definition 3.5)foranyLCRquery.

Proof. Assume that one subpath p1 of p is redundant. We Definition 3.5. Considering two paths p and p′ from can find another path p2 that has the same end points of p1 vertex u1 to u2, respectively, if LðpÞDLðp′Þ, we say L(p) and p2 is a non-redundant path, i.e., Lðp Þ covers Lðp Þ.We covers Lðp′Þ. In this case, p′ is a redundant path, and Lðp′Þ is 2 1 get another path p′ by replacing p1 by p2. It is easy to prove also redundant in the path-label set LSðu ; u Þ. Considering 1 2 that Lðp′Þ covers L(p). Thus, p must be a redundant path. □ one path p from vertex u1 to u2, if there exists no other path p′ from u to u that covers p, p is a non-redundant path. 1 2 3.2. Existing approaches

Definition 3.6. The minimal path-label set from u1 to u2 in LCR queries are proposed in [11]. Generally speaking, graph G is defined as MGðu1; u2Þ, where (1) MGðu1; u2ÞDLS the method in [11] employs a spanning tree T and a partial ðu1; u2Þ; and (2) there exists no redundant path-label in transitive closure NT to compress the full transitive clo- MGðu1; u2Þ; and (3) path-labels in MGðu1; u2Þ cover (defined sure. Specifically, a spanning tree T is found in the graph G. in Definition 3.5) all path labels in LSðu1; u2Þ. Based on T, all pairwise paths are partitioned into three

Definition 3.7. Given two vertices u1 and u2 and a label categories Pn and Ps and Pe. All paths in Pn contain all constraint (set) S, we say MGðu1; u2Þ covers S if and only if pairwise paths whose starting edges and end edges are there exists a path p from u1 to u2, where S+LðpÞ and both non-tree edges. All paths in Ps (and Pe) contain LðpÞAMGðu1; u2Þ. all pairwise paths whose starting (and ending) edges are L. Zou et al. / Information Systems 40 (2014) 47–66 51

Fig. 2. Existing solution. (a) An example of tree-cover. (b) Running example of Algorithm 1.

ƒ! tree-edges. In Fig. 2(a), (4,5,6,1) is a path in Pn, since 4; 5 of their method. Generally speaking, Algorithm 1 adopts ƒ! and 6; 1 are non-tree edges. NTðu; vÞ contains all path a BFS-strategy to broadcast one vertex's path-labels to its neighbors in each iteration. Fig. 2 shows a running example labels between u and v in Pn. MGðu; vÞ can be re- ð ; Þ constructed by Eq. (1). Therefore, we can re-construct of computing M 1 in graph G. A problem of this method ð ; Þ the full transitive closure by the spanning tree T and is that: if one path-label L u ui from u to ui is redundant partial transitive closure NT ¼fNTðu; vÞ; u; vAVðGÞg. (Definition 3.5), it may infect ui's neighbors. For example, in Step 2, there is a redundant path-label facg in Mð1; 5Þ (it will MGðu; vÞ¼ffLðPT ðu; u′ÞÞg NTðu′; v′ÞfLðPT ðv′; vÞÞgj be pruned in Step 4), it infects its neighbor vertex 6 in u′ASuccðuÞ and v′APredðvÞg ð1Þ Step 3. Thus, there is also a redundant path-label facg in Mð1; 6Þ in Step 3. Actually, these redundant path-labels where u′ is reachable from u in the spanning tree T and should be pruned from search space to avoid unnecessary

LðPT ðu; u′ÞÞ denotes the corresponding path label in T; computation. Given a redundant path p, redundant path- and v′ can reach v in the spanning tree T and LðPT ðv′; vÞÞ label of p will infect all its super paths. The infection denotes the corresponding path label in T. will affect the performance greatly, especially in large and Obviously, different spanning trees will lead to different dense graphs. An optimal algorithm should “magically” stop NT. In order to minimize the size of NT, Jin et al. introduce the infection as early as possible. For example, a magical

“weight” w(e) for each edge e,wherew(e)reflectsthatife is method can remove vertex 5 from V1 in Step 3. A key in the tree, the number of path-labels that can be removed problem in Algorithm 1 is that we cannot know facg in from NT. Therefore, they propose to use the maximal Mð1; 5Þ is redundant until Step 4. Therefore, the infection of spanning tree in G. However, it is quite expensive to assign redundancy cannot be avoided. exact edge weights w(e). Thus, they propose a sampling In our proposed method (Section 4.1), we can guarantee method. For each sampling seed (vertex), they compute that if one path p is redundant, all its super paths are pruned single-source transitive closure, based on which, they from search space. For example, in our algorithm, path (1, 2, propose some heuristic methods to define edge weights. 5, 6) can be pruned, since its subpath (1, 2, 5) is redundant. It However, there are two limitations of their method in means that it is impossible to generate intermediate result [11]. First, similar to the counterpart methods in traditional Mð1; 6Þ¼facg.Consequently,comparingwithAlgorithm 1, reachability queries [21,14], a single spanning tree cannot our method reduces the search space greatly. compress the transitive closure greatly, especially in dense graphs. Consequently, NT may be very large. Second, in order Algorithm 1. Single source transitive closure computation to find the optimal spanning tree T,Jinetal.[11] propose an [11]. algorithm to compute single-source transitive closure for each sampling seed (vertex). However, the search space in Input: A graph G and a vertex u in G; Output: Single Source Transitive Closure Mðu; Þ. their algorithm is not minimal in terms of search space, as it 1: Mðu; Þ←NULL; V 1←fug. contains a large number of redundant paths in intermediate 2: while V1 aNULL do results. The redundant paths affect the performance greatly. 3: V2←NULL A We will analyze the shortcoming in detail shortly. The above 4: for each vertex v V1 do 5: for each vertex v′ANðvÞfv′ : ðv; v′ÞAEðGÞg do two problems (large index size and expensive index building 6: New←PruneðMðu; v′Þ⋃Mðu; vÞfλðv; v′ÞgÞ process) affect the scalability of the method in [11]. 7: if NewaMðu; v′Þ then

Since computing single-source transitive closure is also 8: Mðu; v′Þ←New; V2←V2⋃fv′g ; a building block in our method, we prove that our method 9: end if is optimal in terms of search space. In order to understand 10: end for 11: end for the superiority of our method, we first analyze the algo- 12 : V1←V2 rithm in [11] 13: end while Given a vertex u in graph G, single-source transitive closure of vertex u is a vector Mðu; Þ¼½Mðu; u1Þ; …; In [9], Fan et al. add regular expressions to graph Mðu; ujVðGÞjÞ,whereui AVðGÞði ¼ 1; …; jVðGÞjÞ. The method reachability queries. Specifically, given two vertices u1 and in [11] adopts a generalization of Bellman–Ford algorithm u2,themethodin[9] verifies whether there is a directed to compute Mðu; Þ. Algorithm 1 lists the pseudo-code path P where all edge labels along the path satisfy the 52 L. Zou et al. / Information Systems 40 (2014) 47–66 specified regular expression. Obviously, LCR query is a special case of the problem in [9]. Fan et al. propose a bi-directional BFS algorithm at runtime. We can utilize the method in [9] for LCR queries. Given two vertices u1 and u2 over a large graph G and a label set S, two sets are maintained for u1 and u2. Each set records the vertices that are reachable from (resp. to) u1 (resp. u2) only via edges of labels in S. We expand the smaller set at a time until either the two sets intersect, or they cannot be further expanded Fig. 3. Algorithm process. (i.e., unreachable). This method works very well in small graphs, but, the running time of the bidirectional BFS is slow on large graphs. ½fag; ð1; 2Þ; 2 is the heap head (see Fig. 3), it is moved to path 4. Computing transitive closure set RS. When we move the heap head T1 into path set RS,we check whether T1 is covered (Definition 4.3)bysomeneighbor As mentioned earlier, comparing with the traditional triple T2 in RS (Line 5 in Algorithm 2). If so, we ignore T1 (Lines – – transitive closure, it is much more challenging to compute 5 6); otherwise, we insert T1 into RS (Lines 7 8). path-label transitive closure (Definition 3.8). This section ; Definition 4.2. Given two neighbor triples T1 ¼½Lðp1Þ focuses on computing path-label transitive closure efficiently. ; ; ; r p1 d1 and T2 ¼½Lðp2Þ p2 d2 in the heap H, T1 T2 if and We first propose a Dijkstra-like algorithm to compute single- r only if jLðp1Þj jLðp2Þj, where jLðp1Þj is the number of source transitive closure efficiently (Section 4.1). However, it distinct vertex labels along path p1. is very expensive to iterate single-source transitive closure ; ; computation from each vertex in G to compute MG.Inorder Definition 4.3. Given one neighbor triple T1 ¼½Lðp1Þ p1 toaddressthisissue,weproposetheaugmented DAG (aDAG d1, T1 is redundant if and only if there exists another ; ; + for short) by representing all strongly connected components neighbor triple T2 ¼½Lðp2Þ p2 d2, where Lðp1Þ Lðp2Þ and as bipartite graphs. The aDAG-based solution to compute d1 ¼ d2. In this case, we say that T1 is covered by T2. transitive closures over G is discussed in Section 4.2.We discuss how to optimize computing local transitive closure in Definition 4.4. Given two paths p1 and p2, whose lengths each strongly connected component in Section 4.3. are n and nþ1, respectively, p1 is a parent path of p2 if and “ ” only if p2 ¼ p1 e, where denotes the concatenation of a path p1 and an edge e. 4.1. Single-source transitive closure ; ; Definition 4.5. Given two neighbor triples T1 ¼½Lðp1Þ p1 ; ; This subsection focuses on computing single-source d1 and T2 ¼½Lðp2Þ p2 d2,ifp1 is a parent path (or a child transitive closure efficiently. As discussed in Section 3.2, path) of p2, we say that T1 (or T2)isaparent neighbor triple the method in [11] is not optimal in terms of search space. (or a child neighbor triple)ofT2 (or T1). The key problem is that some redundant paths are visited before their corresponding non-redundant paths. In order Algorithm 2. Single-source transitive closure computation. to address this issue, we propose a Dijkstra-like algorithm. Input: A graph G and a vertex u in G; ; In each step of Dijkstra's algorithm, we always access one Output: single-source transitive closure MGðu Þ. 1: Set u as the source. Set answer set RS¼ϕ and heap H¼ϕ. un-visited vertex that has the minimal distance from the 2: Put all neighbor triples of u into H. origin vertex. In our algorithm, we redefine “distance” of a 3: while Haϕ do ; ; path by the number of distinct edge labels along the path 4: Let T1 ¼½Lðp1Þ p1 d to denote the head in H. 5: if T is covered by some neighbor triple T in RS then in Definition 4.1. Given a redundant path p1, there must 1 2 exist another non-redundant path p , where Lðp ÞLðp Þ. 6: Delete T1 from H 2 2 1 7: else It is straightforward to know the “distance” of p1 must be ; 8: Move T1 into RS and put Lðp1Þ into MGðu dÞ larger than p2. Dijkstra's algorithm only finds the shortest 9: for each child neighbor triple T′½Lðp′Þ; p′; d′ of T1 do paths between the origin vertex and other vertices. There- 10: if p′ is a non-simple path then 11: continue fore, p2 must be out of search space in our single-source 12: end if transitive closure computation (Theorem 4.1). ″ 13: if T′ is not covered by some neighbor triple T in H then 14: Insert T′ into H Definition 4.1. A distance of a path p is defined as the 15: end if ″ number of distinct edge labels in p. 16: if T′ covers some neighbor triple T in H then 17: ″ Given a graph G in Fig. 1, Fig. 3 demonstrates how to Delete T from H ; 18: end if compute MGð1 Þ from vertex 1 in our algorithm (i.e., 19: end for Algorithm 2). Initially, we set vertex 1 as the source. All vertex 20: end if 1's neighbors are put into the heap H.Eachneighboris 21: end while ; ; ; …; ; ; denoted as a neighbor triple ½LðpÞ; p; d,whered denotes the 22: MGðu Þ¼½MGðu u1Þ MGðu unÞ and return MGðu Þ neighbor's ID, p specifies one path from source s to d,andL(p) is the path-label set of p. All neighbor triples are ranked Then, we put all child neighbor triples (Definition 4.5) according to the total order defined in Definition 4.2.Since of ½fag; ð1; 2Þ; 2 into heap H. Considering one neighbor of L. Zou et al. / Information Systems 40 (2014) 47–66 53 vertex 2, such as vertex 3, we put neighbor triple ½fag[ (2) Proof of the second claim in Theorem 4.1. ƒ! ½ ð Þ; ; Lð2; 3Þ¼fag; ð1; 2; 3Þ; 3 into H, where (1, 2, 3) is (1, 2)'s child Given a heap head T1 L p1 p1 d1 ,ifp1 is a redundant ƒ! path, its child paths must be redundant paths (proved in path. Analogously, we put ½fag[Lð2; 5Þ¼facg; ð1; 2; 5Þ; 5 Theorem 3.1). According to Lines 5–6, p1 will be deleted. into H. When we insert some neighbor triple T′½Lðp′Þ; p′; d′ Also, Algorithm 2 does not expand p1 to generate redun- ′ into H, we first check whether p is a non-simple path, dant paths. Thus, all paths founded by Algorithm 2 are ′ – and we ignore T if so. (Lines 10 11). Furthermore, we non-redundant paths. □ ″ also check whether there exists another triple T that has ″ ″ existed in H and T′ is covered by T ,orT′ covers T (Lines Theorem 4.2. Algorithm 2 is optimal in terms of the search ″ 12–15). If T′ is covered by T , we ignore T′; otherwise, T′ is space for computing the single source transitive closure. ″ ″ inserted into H.IfT′ covers some triple T in H, T is Proof. In order to compute the single source transitive deleted from H. At Step 2, the heap head is ½fag; ð1; 3Þ; 3, closure, we must find all non-redundant paths beginning which is moved to path set RS. from the original vertex. Otherwise, we will miss the Iteratively, we put all child neighbor triples of reachability information in the single source transitive ½fag; ð1; 3Þ; 3 into heap H. At Step 4, we find that closure. ½facg; ð1; 2; 5Þ; 5 is covered by ½fag; ð1; 2; 3; 4; 5Þ; 5. There- According to Theorem 4.1, Algorithm 2 can generate all fore, we remove ½facg; ð1; 2; 5Þ; 5 from H. Fig. 3 illustrates non-redundant paths beginning from the original vertex the whole process. All paths and path-labels in RS are non- but cannot generate any redundant path. Therefore, redundant. According to RS, it is straightforward to obtain Algorithm 2 is optimal in the search space. □ MGð1; Þ. Note that, our algorithm stops the infection from the redundant path to its child paths (Theorem 4.1). For Theorem 4.3. The time complexity of Algorithm 2 in the example, path (1, 2, 5, 6) is pruned from search space in worst case is OðDdÞ, where D is the maximal outgoing degree our algorithm. and d is the diameter of the graph. Analysis of Algorithm 2: Proof. The worst case is that all edges have distinct edge Theorem 4.1. Given a vertex u in graph G, the following labels. In this case, all paths (i.e., simple paths) are non- claims about Algorithm 2 hold: redundant. Thus, Algorithm 2 needs to evaluate all paths (beginning from the original vertex) for computing the 1. All non-redundant paths can be found in Algorithm 2. single source path-label transitive closure. It is straightfor- 2. The paths found by Algorithm 2 (i.e., the paths that are ward to know there are at most OðDdÞ paths, where D is the inserted into RS) are non-redundant paths. maximal outgoing degree and d is the diameter of the graph. Therefore, the time complexity is OðDdÞ. □

Proof. (1) Proof of the first claim (proof by induction) Although Algorithm 2 has the same time complexity (a) (Base Case): According to Line 2 of Algorithm 2, all with the method in [11] (i.e., Algorithm 1) in the worst length-1 non-redundant paths beginning from vertex u are case. Our algorithm can avoid visiting redundant paths pushed into H in Algorithm 2. Obviously, these length-1 (Theorem 4.1). In practice, our method is faster than the non-redundant paths are not covered by any path in H or method in [11] significantly. RS. According to the loop steps (Lines 3-21), these length-1 non-redundant paths will be the heap head of H at some iteration step. At this moment, they are inserted into result 4.2. Computing transitive closures set RS. Therefore, Algorithm 2 will not miss these length-1 non-redundant paths. Given a graph G, we can iterate Algorithm 2 from each

(b) (Hypothesis): Assume that all length-n non-redundant vertex in G to compute MG. However, this is an inefficient paths beginning from vertex u can be found in Algorithm 2. solution. Intuitively, given two adjacent vertices u1 and u2, (Induction): Given a length-(nþ1) non-redundant path they share a lot of steps for computing MGðu1; Þ and p, its parent path is denoted as p′. It is straightforward to MGðu2; Þ by Algorithm 2. Therefore, an efficient algorithm know the length of p′ is n, meaning that p′ must be found should avoid unnecessary redundant computation. Usually, in Algorithm 2, according to the above assumption. adirectedgraphG can be transformed into a DAG by Since p is p′'s child, p must be considered in Lines 9–15. coalescing each strongly connected component into a single Since p is a non-redundant path, it means that p cannot vertextocomputetransitiveclosureefficiently.However, be covered by any path in H or RS. Therefore, p must be this method cannot be used for LCR queries since it misses inserted into H. At some iteration step, p is a heap head, some edge labels. Instead, we propose an augmented DAG D which will be moved into RS. by replacing all strongly connected components as bipartite (c) (Conclusion): According to the above analysis, we can graphs. Note that, we allow for some trivial maximal always obtain any length-(nþ1) non-redundant path p, strongly connected components, which include a single once we have obtained its length-n parent p′. Furthermore, vertex. Then, we can compute single-source transitive all length-1 non-redundant paths can be found in RS closure MGðu; Þ according to the reverse order of D.During (proved in Base Case). Therefore, according to the induc- the computation, MGðu; Þ is always transmitted to its tion method, we can find all non-redundant paths in parent vertices in D. In this way, redundant computation Algorithm 2. can be avoided. 54 L. Zou et al. / Information Systems 40 (2014) 47–66

Algorithm 3. Building an augmented DAG D for a directed Proof. If D is not a DAG, there must exist at least one graph G. in D. This cycle corresponds to one maximal connected Input: : A directed graph G; component, or this cycle should be embedded into one Output: The augmented DAG D. maximal connected component. It means that cycle should 1: Find all maximal strongly connected components in G. occur in some maximal connected component, which has 2: for each maximal strongly connected component Ci do been replaced in a bipartite graph Bi. It also means that 3: Replace Ci by a bipartite graph Bi ¼ðVi ; Vi Þ, where Vi 1 2 1 there exists no such cycle in D. □ contains all in-portal vertices in Ci and Vi2 contains all out-portal vertices in Ci. A A Note that, in the traditional reachability problem, a 4: For any two vertices u1 Vi1 and u2 V i2 , we introduce a ; directed graph G is transformed into a DAG by coalescing directed edge u1 to u2, whose edge label is MCi ðu1 u2Þ. 5: end for each maximal strongly connected component by a single 6: Set D to be the updated graph. vertex. To differentiate the DAG generated by Algorithm 3 7: Return D from the DAG generated in the traditional reachability Given a directed graph G, Algorithm 3 shows how to get problem, we call the DAG generated by Algorithm 3 as the an augmented DAG D. Specifically, we first find all maximal augmented DAG (aDAG for short). strongly connected components in G. Then, we replace each Definition 4.6. Given two maximal strongly connected maximal connected component C by a bipartite graph i components Ci and Cj, Ci is called an ancestor component of B ¼ðV ; V Þ,whereV contains all in-portal vertices in C i i1 i2 i1 i Cj if and only if there exists a directed path from Ci to Cj in and V i2 contains all out-portal vertices in Ci.Avertexu in Ci is the augmented DAG. called as an in-portal if and only if it has at least one For example, C is an ancestor component of C , since incoming edge from vertices out of Ci.Avertexu in Ci is 5 2 called as an out-portal if and only if it has at least one there is a directed path C5 C1 C2 in the augmented DAG. outgoing edge to vertices out of Ci.Ifvertexu is both an in- Algorithm 4. Compute transitive closure of graph G. portal and an out-portal, it has two instances u and u′ that Input: A graph G; occur in V i and V i , respectively. For any two vertices u1 AVi 1 2 1 Output: MG. A ; ϕ; …; ϕ and u2 Vi2 ,weintroduceadirectededgeu1 to u2,whose 1: Set MGðu Þ¼f g for each vertex u. ; 2: Identify all maximal connected component Ci (i ¼ 1; …; n) and edge label is MCi ðu1 u2Þ. Theorem 4.4 proves that the graph generated by Algorithm 3 is a directed acyclic graph (DAG). build aDAG D by employing Algorithm 3. 3: for each maximal connected component C , i ¼ 1; …; n do Given a graph G in Fig. 4(a), only a single maximal i 4: for each vertex u in Ci do connected component C1 is identified in G.WecomputeMC ; 1 5: Compute MCi ðu Þ by employing Algorithm 2. for C1.InC1,in-portalsareV1 ¼f2; 3g and out-portals are 6: end for V 2 ¼f3; 4g. Note that vertex 3 is both an in-portal and an 7: end for out-portal. Thus, we introduce two instances of vertex 3. We 8: Call Function aDAG to compute MG 9: Return MG build a bipartite graph B1 by the in-portal vertices and the ; ; … aDAG ðD MC1 MCn Þ out-portal vertices. We introduce directed edges between 1: for each vertex u in D according to the reverse topological order any pair of vertices between V1 and V2. The edge label is the do 2: ƒ! minimal path-label set between the two vertices. MGðu; Þ¼Pruneð⋃i ¼ 1;…;nðλðuci ÞMGðci; ÞÞÞ 3: end for

Theorem 4.4. The updated graph D in Algorithm 3 is a 4: for each maximal connected component Ci do A directed acyclic graph (DAG). 5: for each intra-vertex u Ci do

Fig. 4. Augmented DAG. (a) graph G. (b) Augmented DAG D. L. Zou et al. / Information Systems 40 (2014) 47–66 55

6: M ðu; Þ¼M ðu; Þ G Ci Proof. There are at most jVðGÞj maximal strongly components 7: for each out-portal u in V do i i2 and each maximal strongly component has at most jVðGÞj 8: M ðu; Þ¼PruneðM ðu; Þ⋃PruneðM ðu; u ÞM ðu ; ÞÞÞ G G Ci i G i vertices. There are at most jVðGÞj2 loops of calling Algorithm 2. 9: end for 10: end for ThetimecomplexityofAlgorithm 2 is given in Theorem 4.3. – 2 d 11: for each vertex u′2= Ci do Thus, the time complexity of Lines 1 7isOðjVðGÞj D Þ. A 12: for each intra-vertex u Ci do According to the time complexity of Function aDAG (in 13: for each in-portal u in V do i i1 Theorem 4.5), we know that Theorem 4.6 holds. □ ′; ′; ⋃ ′; ; 14: MGðu uÞ¼PruneðMGðu uÞ PruneðMGðu uiÞMCi ðui uÞÞÞ 15: end for 16: end for 4.3. Optimization: computing local transitive closure in 17: end for strongly connected components 18: end for In Lines 3–7ofAlgorithm 4, we need to compute local Given a graph G, Algorithm 4 shows pseudo-codes transitive closure MCi for each maximal strongly connected to compute transitive closure for G. First, we initialize component C . In order to compute M , we iterate ; ϕ; …; ϕ i Ci MGðu Þ¼f g and identify all maximal connected Algorithm 2 from each vertex uAC . Let us consider C – i 1 component Ci in G (Lines 1 2inAlgorithm 4). We build an in graph G in Fig. 4(a). Assume that we have computed aDAG by calling Algorithm 3. For each maximal strongly ; ; MC1 ð3 Þ. It is unnecessary to compute MC1 ð2 Þ by connected component Ci, we iterate Algorithm 2 from each running Algorithm 2 from vertex 2 again, since some – vertex u in Ci to compute MCi (Lines 3 7). Finally, we call ; search branches of computing MC1 ð2 Þ can be terminated Function aDAG to compute MG. ; by using MC1 ð3 Þ as early as possible. It means, given two In Function aDAG, we first perform the topological vertices u and u′ in a strongly connected component C, sorting over D. We process each vertex (in D) according ƒ! where there exists a directed edge uu′ in C,ifMC ðu′; Þ to the reverse topological sort of D. If a vertex u has n have been computed, some search space of computing children ci, i ¼ 1; …; n, we set MGðu; Þ¼Pruneð⋃i ¼ 1;…;n ƒ MC ðu; Þ can be pruned by using MC ðu′; Þ to save λ ! ; ; ð ð uci ÞMGðci ÞÞÞ. In this way, we can obtain MGðu Þ computation cost. For example in Fig. 5, if there exist for each vertex u in aDAG D. two different paths p1 and p3 from u to vertex d, where Now, we need to consider “intra-vertices” in each p1 does not go through u′ and p3 goes through u′ and cluster Ci (i.e, the vertices that are not in D), such as Lðp1Þ¼Lðp3Þ, it is not necessary to further extend p1 vertices 1, 5 and 6 in Fig. 4. Given an intra-vertex u in Ci, toreach another vertex d′, since we can guarantee that we initialize MGðu; Þ¼MC ðu; Þ. We first consider how i we can find another path by extending p3 to reach d′ and an intra-vertex u (in cluster Ci) reach other vertices out of Ci. the two extended paths have the same labels. Specifically, If u reaches another vertex out of Ci,thepathmustgoes in Fig. 5, Lðp1 þp2Þ¼Lðp3 þp2Þ, where p1 þp2 means con- throughanout-portalinCi.Thus,foreachout-portalui catenating path p and p . Theorem 4.7 shows the details. ; ; ⋃ 1 2 in V 2i , we update MGðu Þ¼PruneðMGðu Þ PruneðMCi ; ; ðu uiÞMGðui ÞÞÞ iteratively . Theorem 4.7. Given a non-redundant path p1 from u to d, if ′ Then, we consider how other vertices u (out of Ci) there exists another non-redundant path p3 from u to d that ′ ′ ′ reach an intra vertex u in Ci. Obviously, the path from u to goes through u , where u is a neighbor of u, and Lðp1Þ¼ u must go through one in-portal vertex in Ci. Consider any Lðp3Þ, for any non-redundant path p from u to another vertex one vertex u′2= Ci. Given an intra vertex u in Ci, we compute d′ that goes through vertex d, the following equation holds: ′; ′; ϕ MGðu uÞ as follows: initially, we set MGðu uÞ¼ . For each ƒƒ! ′; ′; ⋃ LðpÞAðλðu; u′ÞMðu′; dÞÞ in-portal ui in V 1i , we update MGðu uÞ¼PruneðMGðu uÞ PruneðMGðu′; uiÞMC ðui; uÞÞÞ iteratively. i where MC ðu′; dÞ is defined in Definition 3.6.

Theorem 4.5. The time complexity of Function aDAG in Proof. (1) Assume that p goes through u′. The subpath of p 3 Algorithm 4 is OðjVðGÞj Þ, where jVðGÞj is the number of from u′ to d′ must be a non-redundant path, since p is a ƒƒ! vertices in G. non-redundant path. It means that LðpÞAλðu; u′ÞMC ðu′; d′Þ. Proof. The time complexity of Lines 1–3isOðjVðGÞj2Þ, (2) Assume that p does not go through u′. Let subpath of since one vertex has at most jVðGÞj children in the aDAG p from vertex d to d′ be denoted as p2, as shown in Fig. 5. and there are jVðGÞj loops of Lines 1–3. A maximal connected component has at most jVðGÞj in- portals and out-portals and intra-vertices. Thus, the time complexity of computing transitive closures of intra-vertices is jVðGÞj2 (Lines 5–17). Since there are at most jVðGÞj blocks, thus, the total time complexity is OðjVðGÞj3Þ. □

Theorem 4.6. The time complexity of Algorithm 4 is OðMaxðjVðGÞj3; jVðGÞj2 DdÞ, where jVðGÞj is the number of vertices in G and D is the maximal outgoing vertex degree in some maximal strongly component Ci and d is the maximal diameter in some maximal strongly component Ci. Fig. 5. Theorem 4.7. 56 L. Zou et al. / Information Systems 40 (2014) 47–66

ƒƒ! A λ ; ′ ′; Since Lðp1Þ ð ðu u ÞMC ðu ÞÞ, it means that there exists a path P3 from u to d through vertex u′ and ′ ′ Lðp1Þ¼Lðp3Þ. Let p ¼ p3 p2, i.e., path p is formed by concatenating p3 and p2. It is straightforward to know that Lðp′Þ¼LðpÞ. Since p is a non-redundant path, p′ must be also a non-redundant path. The subpath of p′ from u′ to d′ is denoted as p4. According to Apriori property (Theorem 3.1), p is also a non-redundant path. It means that Lðp Þ 4 ƒ! ƒƒ! 4 A ′; ′ ′ ′ Aλ ; ′ MC u d . Therefore, LðpÞ¼Lðp Þ¼Lðuu p4Þ ðu u Þ MC ðu′; d′ÞÞ. □

Assume that MC ðu′; Þ is computed before computing MC ðu; Þ. When computing MC ðu; Þ by Algorithm 2,we ; ; get a head triple T1 ¼½Lðp1Þ p1 d in Line 4. If Lðp1Þ¼Prune ƒƒ! ðλðu; u′Þ⋃MC ðu′; dÞÞ, it is not necessary to extend the path p1 to reach another vertex d′, as we must find another path goes through u′ to reach d′. Specifically, Algorithm 5 shows how to revise Algorithm 2 to save computation in comput- Fig. 6. Optimization technique. ing MC ðu; Þ. For simplicity, Algorithm 5 only shows the differences between Algorithms 5 and 2, where Lines 1-1, 1-2, 1-3 in Algorithm 5 replace Line 1 in Algorithm 2 and According to Algorithm 5, we have the following

Lines 8-1, 8-2, 8-3 in Algorithm 5 replace Line 8 in method to compute local transitive closure MC for a Algorithm 2. Given a vertex u in C, assume that there exist strongly connected component C. Specifically, we first find m outgoing neighbors u1; …; um of u, where MC ðuj; Þ have a vertex u′ with the largest incoming degree in C.We ′; been computed, j ¼ 1; …; m. Initially, we set MC ðu; Þ¼ϕ. compute MC ðu Þ by Algorithm 2. Then, we perform For each outgoing neighbor uj of u, we set MC ðu; Þ¼ breadth-first search over C from this largest incoming ƒƒ! degree vertex. When visiting some node u in C, we utilize PruneðMC ðu; Þ⋃fλðu; uj ÞMC ðuj; ÞgÞ (Lines 1-1,1-2,1-s3 ; in Algorithm 5). Then, we employ Algorithm 2 from Algorithm 5 to compute MC ðu Þ. vertex u. However, when we pop the neighbor triple ; ; T1 ¼½Lðp1Þ p1 d from heap H (Line 8 in Algorithm 2), we ; 5. LCR query over large graphs need to check whether Lðp1Þ is in MC ðu dÞ. If so, according to Theorem 4.7, we can terminate the branch. Otherwise, we continue the following steps of Algorithm 2 (Lines 8-1, Given a large graph G, it is very expensive to compute 8-2,8-3 in Algorithm 5). path-label transitive closures despite its capability to answer LCR queries. As discussed earlier, another extreme approach to answering reachability queries is to traverse G Algorithm 5. Optimization. on the fly. The intuition behind our method is: we can 1-1) Set u as the source. Set answer set RS ¼ ϕ, heap H ¼ ϕ and combine the two extreme approaches to find a good trade-off

MC ðu; Þ¼ϕ between offline and online costs. ƒƒ! 1-2) MC ðu; Þ¼⋃ƒƒ! fλðu; uj ÞMC ðuj; Þg. In our method, we partition a large graph G into several u; u A VðCÞ j ; …; …… blocks Pi, i ¼ 1 k. For each block Pi, we employ the ; 8-1) if path label Lðp1Þ exists in MC ðu dÞ method in Section 4 to compute path-label transitive 8-2) continue closure of Pi. All boundary vertices and crossing edges ; n 8-3) Move T1 into RS and put Lðp1Þ into MC ðu dÞ (Definition 5.1) are collected to form a skeleton graph G (Definition 5.2). Obviously, Gn is much smaller than G. With local transitive closures, we can answer LCR queries Let us consider C1 in Fig. 4(a). Assume that we have n ; over G by traversing G on the fly. Obviously, different computed MC1 ð3 Þ as shown in Fig. 6(a). According to ƒƒƒ! partitions lead to different query performance. We delay Algorithm 5, we first set M ð3; Þ¼Pruneðλð2; 3Þ C1 the discussion of finding the optimal partition to enhance ð ; ÞÞ MC1 3 in 6(b). Then, we begin Algorithm 2 from vertex query performance until Section 5.2, since it is related to 2. In the first step, there are two neighbor triples our query algorithm in Section 5.1. ½fbg; ð2; 5Þ; 5 and ½fag; ð2; 3Þ; 3, as shown in Fig. 6(d). Since triple ½fbg; ð2; 5Þ; 5 is not covered by any path label in ; ; MC1 ð2 5Þ, thus, we insert fbg into MC1 ð2 5Þ. Then, we insert 5.1. Query algorithm the neighbor triples of 5 into H. At the second step, neighbor triple ½fag; ð2; 3Þ; 3 is removed from heap H, since ; there is a path label fag in MC1 ð2 3Þ. Furthermore, neighbor Definition 5.1. Given a vertex u in a block P, u is a triple ½fabg; ð2; 5; 6Þ; 6 is also removed from heap H since boundary vertex if and only if u has at least one neighbor ƒƒ ; ! fabg is covered by a path label fag in MC1 ð2 6Þ. Therefore, vertex that is outside of block P. An edge e ¼ u1u2 is called the whole process is terminated at the second step. We get a crossing edge if and only if u1 and u2 are boundary ; MC1 ð2 Þ in Fig. 6(c). vertices in two different blocks. L. Zou et al. / Information Systems 40 (2014) 47–66 57

ƒ! ¼ Definition 5.2. Given a large graph G that is partitioned from u1 under constraint S. Forƒ! adjacent edges e bbi to into k blocks Pi, i ¼ 1; …; k, all boundary vertices and these boundary vertices, if λðbbi ÞAS, it means that the crossing edges are collected to form a skeleton graph Gn. traversal can reach boundary vertices of other blocks under constraint S. Note that, we use Visited to store all Algorithm 6. Answer LCR queries by traversal over skele- vertices that have been visited in one search branch (Line 1 ton graph. in Function Travel). A vertex can be visited multiple times (Line 14 in Function Travel) in different branches, but at Input: Two vertices u1 and u2 and a label set S. S most once in the same branch. Otherwise, it may generate Output:Ifu1-u2, return True;Otherwise, return False; 1: Let P1 and P2 denote the corresponding blocks of u1 and u2, duplicate results. If the traversal reaches the destination respectively. block P2, Lines 2–4 (in Function Travel) verify whether 2: Let Visited ¼ ϕ. LCRðu1; u2; S; P2Þ¼true, and if so, returns true. Otherwise, 3: if P1 ¼ P2 then ; the traversal will be continued. 4: if MP1 ðu1 u2Þ covers S then 5: Return True 6: end if 7: end if 5.2. Finding the optimal partition 8: Let ; BD ¼fbjb is a boundary vertex in P1 and MP1 ðu1 biÞ covers Sg As mentioned earlier, different partitions lead to differ- 9: for each bABD do ent query performance. For example, given a graph G, 10: Assume that b has m outgoing neighbors bi (i ¼ 1; …; m) that ƒ! there are two kinds of graph partition over G, as shown in λ A are in other blocks except for P1 and ðbbi Þ S. Fig. 7. Consider a query LCRð1; 2; fag; GÞ. In the first parti- ; …; 11: for each outgoing neighbors bi, i ¼ 1 m do tion in Fig. 7(b), our query algorithm can answer 12: Let P denote the block where b is resided in. i i LCRð1; 2; fag; GÞ based on local transitive closure, since path 13: if ðbi; PiÞAVisited then 14: continue (1, 3, 2) is contained in one block. However, given the 14: else second partition, we have to access another block to ; 15: Call Function Travel ðbi PiÞ answer the same query. Obviously, the former is faster 16: end if than the latter, since the latter leads to more I/O cost. 17: end for 18: end for The optimal partition should minimize the overall Travel ðb′; P′Þ query workloads. In this subsection, we formalize the 1: Insert ðb′; P′Þ into Visited graph partition problem and prove that it is a NP-hard ′ 2: if P ¼ P2 then problem, since min-cut graph partitioning problem can be 3: if M ′ðb′; u Þ covers S then P 2 reduced to this optimization problem. Therefore, we adopt 4: Return True “ ” 5: end if some heuristic methods to find a good partition to speed 6: end if up Algorithm 6. Generally speaking, we reduce a classical ′ ′; 7: Let BD ¼fbjb is a boundary vertex in P and MP′ðb bÞ coversSg min-cut edge-weighted graph partition problem into our A 8: for each b BD do problem. The main contribution of our method is how 9: Assume that b has m outgoing neighbors bi ði ¼ 1; …; mÞ that ƒ! to assign edge weights in our problem. Note that, in the ′ λ A are in other blocks except for P and ðbbi Þ S. following discussion, we assume that the partition number 10: for each outgoing neighbors b , i ¼ 1; …; m do i k is given. We will discuss how to set up k in Section 5.3. 11: Let Pi denote the block where bi is resided in.

12: if ðbi; PiÞAVisited then Considering one path p from u1 to u2 in graph G, if there 13: continue are l crossing edges, p is divided into 2 lþ1 segments. 14: else If a segment contains some consecutive non-crossing 15: ð ; Þ Call Function Travel bi Pi edges, it is called a non-crossing segment. Otherwise, it is 16: end if 17: end for called a crossing segment. For example, given a path 18: end for pð1; 2; 3; 4; 5; 6; 7; 8Þ in Fig. 8, since there are two crossing edges, p is divided into five segments, (1, 2, 3), (3, 4), (4, 5, Algorithm 6 shows the pseudo-code for LCR testing 6), (6, 7) and (7, 8), where (1, 2, 3) and (4, 5, 6) and (7, 8) over two vertices u1 and u2 under label constraint S. Lines are non-crossing segments, and the others are crossing 3–5 verify whether u1 and u2 are in the same block P1 and segments. Let us recall Algorithm 6. We employ local LCRðu1; u2; S; P1Þ¼true, and if so, returns true. Otherwise, transitive closures to find a non-crossing segment, whose Line 6 finds all boundary vertices (in P1) that are reachable cost is defined as α. We employ online traversal to find

Fig. 7. Partition VS. query performance. (a) Graph G. (b) A partition. (c) Another partition. 58 L. Zou et al. / Information Systems 40 (2014) 47–66

Actually, the proof process of Theorem 5.1 implies a solution to find the optimal partition. Specifically, all non- redundant paths are enumerated. Then, each edge weight

wðeiÞ can be determined by Definition 5.6. Finally, some classical min-cut graph partitioning algorithm, such as METIS [15], can be utilized to find a good partition. How- ever, it is prohibited to enumerate all non-redundant paths (i.e.,PC) in practice. Therefore, we can utilize some sampling Fig. 8. Crossing segments and non-crossing segments. methods to estimate edge weights. Specifically, we ran-

domly select Δ seed vertices from G, denoted as si, a crossing segment, whose cost is defined as β. Thus, we i ¼ 1; …; Δ. Then, for each seed si,weemployAlgorithm 2 define the cost of finding one path p as follows: to enumerate all non-redundant paths beginning from si. The set of these paths is denoted as PC′. The estimated edge Definition 5.3. Given a path p, if there are l crossing weight of e is denoted as follows: edges, the cost of finding p is defined as CostðpÞ¼ðlþ1Þ αþl β¼αþðαþβÞl. w′ðeÞ¼jfpjpAPC′4eApgj

Given a graph G, any non-redundant path may be an Finally, we employ METIS to partition graph G based on answer to a LCR query. Therefore, given a partition over these estimated edge weights. graph G, we define the overall cost for LCR queries as follows: 5.3. Setting k Definition 5.4. Given a partition over graph G, the overall cost of this partition is defined as follows: Now, we discuss how to set up the partition number k. ¼ ∑ A ð Þ¼α j jþðαþβÞ∑ A ð Þ j j Cost pi PC Cost pi PC pi PC li 2 The larger the value k is, the smaller each block size P is. On the other hand, larger k means more blocks in the skeleton where PC denotes all non-redundant paths and l denotes n i graph G , which leads to more search space in the online the number of crossing edges in path p . i traversal of Algorithm 6. Therefore, in order to optimize Obviously, given a graph G, the first part of Eq. (2) is a query performance, k should be as small as possible. constant. Different partitions over G lead to different However, small k leads to more expensive offline cost. In values of the second part of Eq. (2). the extreme, k¼1 means that computing transitive closure over the whole graph. In practice, we can tune k to obtain a Definition 5.5. A partition over graph G is optimal if and good trade-off between offline and online performance. only if its cost (Definition 5.4) is minimal.

Theorem 5.1. Given a graph G and a number k, finding the 6. LCR query over dynamic graphs optimal partition (Definition 5.5) that divides G into k disjoint blocks is NP-hard. In this section, we address index maintenance issues in dynamic graphs. In real-life applications, such as social Proof. Generally speaking, the classical min-cut graph networks, the graph structure is evolving over time. We partition problem, i.e., partitioning G into k disjoint blocks model the updates over graphs as a series of edge insertions and the sum of all crossing edge weights is minimized, can and deletions. In this section, we only discuss how to handle be reduced to our optimization problem. the two basic operations (edge insertions and deletions).2 In order to prove the theorem, we first introduce the Assume that a graph G is decomposed into k blocks Pi, following definition. i ¼ 1; …; k. All boundary vertices and crossing edges are n Definition 5.6. Given an edge e in graph G, its edge weight collected to form a skeleton graph G . The index main- n is defined as follows: tenance involves the updates to skeleton graph G and local transitive closures. We first discuss the general wðeÞ¼jfp jeAp 4p APCgj i i i framework of our method in Section 6.1. The key problem where PC contains all non-redundant paths. Actually, w(e) is how to update local transitive closures, which is pre- denotes the number of non-redundant paths (in PC) that sented in Section 6.2. are covered by e.

Given a partition over G, all crossing edges are denoted 6.1. Overview as CE. It is straightforward to know the following equation ƒ! holds: Consider that we insert an edge e ¼ uiuj into graph G. There are five cases for e: ¼ α j jþðαþβÞ∑ A Cost PC pi PC li ¼ α jPCjþðαþβÞ∑ A wðeÞð3Þ e CE (1) If ui2= G AND uj2= G, inserting an isolated edge e does Eq. (3) means that the optimal partition for our query not affect the reachability information of the original algorithm is the partition with the minimal sum of cross- graph. It is a trivial case. ing edge weights (Definition 5.6), which is exactly the same as the min-cut graph partitioning problems that is a 2 Inserting an isolated vertex does not affect the reachability infor- classical NP-hard problem. □ mation. Deleting a vertex means deleting all adjacent edges to the vertex. L. Zou et al. / Information Systems 40 (2014) 47–66 59

(2) If ui2= G AND uj AG AND uj APj, we update local transi- Algorithm 7. Insertion. tive closure of block Pj by the method in Section 6.2.1. A = A (3) If ui G AND uj2G AND ui Pi, we update local Input: The local transitive closure MP over block P, and an inserted transitive closure of P by the method in Section 6.2.1. ƒƒ! i ƒ edge e ¼ uiuj ; ! ′ (4) If ui AG AND uj AG AND uiuj APi, we update local Output: the updated block P and the updated transitive closure M ′. transitive closure of Pi by the method in Section 6.2.1. P ′ (5) If u AG AND u AG AND u AP AND u AP AND P aP , 1: Insert e into P to get P . i ƒ! j i i j j i j 2: //Case 1; meaning uiuj is a crossing edge, we introduce edge e 3: M ′ ¼M . n P P into G directly. 4: if ui 2= P 4ui 2= P then 5: set MP′ðui; ujÞ¼λðeÞ and insert MP′ðui; ujÞ¼λðeÞ into MP′ ƒ! 6: Return MP′ Consider that we delete an edge e ¼ uiuj from graph G. 7: end if There are two cases for e: 8: //Case 2;

9: if ui AP 4ui 2= P then (1) If e is a crossing edge in Gn,wedeletee from Gn directly. 10: for each vertex u in P do 11: ƒƒ! MP′ðu; ujÞ¼PruneðMP ðu; uiÞλðuiuj ÞÞ (2) If e is in one block Pi, we delete e from Pi and update 12: end for the local transitive closure of Pi by the method in 13: set MP′ðuj; Þ¼fϕ; …; ϕg and insert MP′ðuj; Þ into Mp′ Section 6.2.2. 14: Return MP′ 15: end if 16: //Case 3;

17: if ui 2= P 4ui AP then 6.2. Local transitive closure maintenance 18: ƒƒ! set MP′ðui; Þ¼Pruneðλðuiuj ÞMP ðuj; ÞÞ and put MP′ðui; Þ

into MP′. 6.2.1. Edge insertion 19: Return MP′ Assume that we have computed the local transitive closure 20: end if 21: //Case 4; of block P, denoted as MP. This subsection discusses how to ƒ! 22: if ui AP 4ui AP then ¼ update the local transitive closure, if a new edge e uiuj is 23: ′ ′ Get an inversed graph P by reversing each edges direction in P inserted into P.LetP be the updated block. There are four 24: ƒ! Employ Algorithm 2 in P from vertex uj by replacing Line 2 cases for e ¼ u u . Algorithm 7 shows the pseudo codes. i j by “put neighbor ui of uj into H” get M ƒƒ!ðuj; Þ. ; (1) u 2= P 4u 2= P;2)u AP 4u 2= P;3)u 2= P 4u AP;4) P ujui i j i j i j 25: for each vertex u in P do ui AP 4uj AP 26: Set M ƒƒ!ðu; ujÞ¼M ƒƒ!ðuj; uÞ P′;uiuj P ;ujui It is very easy to update the transitive closure MP in 27: Set the first three cases. In Case 1 (Fig. 9(a)), since the MP′ðu; Þ¼PruneðMP ðu; Þ⋃M ƒƒ!ðu; ujÞMP ðuj; ÞÞ inserted edge does not affect the transitive closure of P′;uiuj 28: end for other vertices, i.e., 8uAP, MPðu; Þ¼MP′ðu; Þ, where 29: Return M ′ ð ; Þ P MP u denotes the single source transitive closure from 30: end if vertex u in block P. Let us consider Case 2 in Fig. 9(b). Initially, for each vertex uAP, we set M ′ðu; Þ¼M ðu; Þ. For each Theorem 6.1. P is changed into P′ by inserting an edge e into P P ƒ! vertex uAP, MP′ðu; ujÞ¼PruneðMP ðu; uiÞλðuiuj ÞÞ.Weset P. Given any two vertices u1 and u2 in block P, MPðu1; MP′ðuj; Þ¼fϕ; …; ϕg. u2ÞaMP′ðu1; u2Þ if and only if the following conditions hold, Let us consider Case 3 in Fig. 9(c). For each vertex uAP, where M ðu ; u Þ is the minimal path label sets in block P ƒ! P 1 2 we set MP′ðu; Þ¼MP ðu; Þ. MP′ðui; Þ¼Pruneðλðuiuj Þ (Definition 3.6). MPðuj; ÞÞ. The key issue is how to compute MP′ in Case 4 in Fig. 9(d), (1) there exists a path p from u1 to u2 in P′, where p goes whichisthefocusofthissubsection. through e; and

Fig. 9. Four cases. 60 L. Zou et al. / Information Systems 40 (2014) 47–66

(2) MP ðu1; u2Þ does not cover L(p), where L(p) denotes the which vertices can be reached from vertex 4 through edge path label of p. ƒƒ! 4; 10 in P. Then, we can know that M ƒƒ!ð10; 4Þ¼fcg and P′;10; 4 M ƒƒ!ð9; 4Þ¼fb; cg.ForotherverticesuAP′, M ƒƒ! Proof. It is straightforward to know if the above two P′;10; 4 P′;10; 4 ; a ; conditions hold, MPðu1 u2Þ MP′ðu1 u2Þ. ðu; 4Þ¼ϕ. Finally, we update MP′ð10; Þ¼M ƒƒ!ð10; 4Þ P′;10; 4 Given two vertices u1 and u2, if there exists no path p M ′ð4; Þ and M ′ð9; Þ¼PruneðM ð9; Þ⋃M ƒƒ ð9; 4Þ from u1 to u2, where p goes through e, the insertion of e P P P ! P′;10; 4 does not affect MPðu1; u2Þ. Thus, MPðu1; u2Þ¼MP′ðu1; u2Þ. MP′ð4; ÞÞ The time complexity analysis is given Table 2. Given two vertices u1 and u2, if there exists a path p from u to u , where p goes through e, but M ðu ; u Þ covers L(p), 1 2 P 1 2 Algorithm 8. Deletion. it means that there must exist another path p′ from u1 to u2, where path label Lðp′Þ covers L(p). Thus, the insertion of ; ; ; □ Input: The local transitive closure MP over block P, and an deleted e does not affect MPðu1 u2Þ, i.e., MPðu1 u2Þ¼MP′ðu1 u2Þ. ƒƒ! edge e ¼ uiuj ; According to Theorem 6.1, we design an algorithm to Output: the updated block P′ and the updated transitive ƒ! closure MP′. handle the insertion in Case 4. Assume that edge e¼uiuj is ƒƒ 1: ! ′ inserted into block P. Initially, for each vertex uAP′, we set Delete e ¼ uiuj from P to get P . ƒ! 2: C and C are two blocks that u and u reside in. M ′ðu; Þ¼M ðu; Þ. Due to the inserted edge u u , some i j i j P P i j 4: //Case 1; reachability information need to be updated. For example, a ƒƒƒ! 5: if Ci Cj then due to the inserted edge 10; 4 in Fig. 9(d), vertex 10 can 6: All maximal strongly connected components do not change reach vertices {4, 5, 6, 1, 2, 3, 8} in G′, since vertex 4 can in P′. reach these vertices in G and vertex 10 can reach 4 in G′. 7: Rebuild the aDAG D′ by Calling Algorithm 3. Therefore, according to the reverse order of edges, we 8: Recompute MP′ by calling function aDAG in Algorithm 4. 9: end if ; need to propagate MPð4 Þ to other vertices iteratively 10: //Case 2; ƒƒƒ! through 10; 4. Note that, the propagation process also 11: if Ci¼Cj then “ ” 12: Recompute all maximal strongly connected components. follows the best-first strategy in Algorithm 2 to avoid 13: for each maximal strongly connected component C′ do redundant paths. 14: if it is the same with some maximal strongly connected Specifically, we design the following algorithm. We first component C in P then ƒ ! 15: M ′ ¼ M insert edge uiuj into block P to P′. We get an inversed C C graph P by reversing each edge's direction in P′. Then, we 16: else 17: Recompute MC′ by calling Algorithm 2 employ Algorithm 2 in P from vertex uj. Note that, in the 18: endif first step, we only consider neighbor ui in graph P. In this 19: end for way, we can know that how uj can reach other vertices in P 20: Recompute MP′ by calling Function aDAG in Algorithm 4. ƒ! 21: end if through edge ujui in graph P. It also means that we can 22: Return MP′ know how each vertex u (AP′) can reach uj through edge ƒ! uiuj in graph P′, i.e., denoted as M ƒ!ðu; ujÞ. Finally, for It is trivial to know the time complexity of the first P′;u u i j three cases, as shown in Table 3. As we know, we need to each vertex uAP′, we update MP′ðu; Þ¼PruneðMP ðu; Þ⋃ M ƒ!ðu; ujÞMPðuj; ÞÞ. P′;u u i j ƒƒ! Table 2 For example, we get a block P′ by inserting edge e¼ 10; 4 Time complexity analysis of inserting edge e into block P. into P in Fig. 9(d). Initially, we set MP′ðu; Þ¼ MPðu; Þ, u 2= P 4u 2= PuAP 4u 2= Pu2= P 4u APuAP 4u AP where uAP′. We get a reverse graph G,asshowninFig. 10(a). i j i j i j i j Oð1Þ OðjVðPÞjÞ OðjVðPÞjÞ OðDdÞ Then, we employ Algorithm 2 in P from vertex 4. Note that, ƒƒ! in the first step, we only consider neighbor 10 in graph P. Note: e ¼ uiuj , D is the maximal degree in block P and d is the diameter Fig. 10(b) shows the process. Note that, we only consider of P.

Fig. 10. Algorithm process. (a) A reverse graph G. (b) Process. L. Zou et al. / Information Systems 40 (2014) 47–66 61

Fig. 11. Two cases. (a) Case 1. (b) Case 2.

Table 3 Consider a vertex uðAP′Þ that is included in a maximal Time complexity analysis of deleting edge e from block P. strongly connected component C. MP′ðu; ÞaMPðu; Þ only if there exists a directed path from C to C in aDAG D, where D u AC 4u AC 4C aC u AC 4u AC 4C ¼ C i i i j j i j i i j j i j is an augmented DAG of block P. OðjVðPÞj3Þ OðMaxðjVðPÞj3; jVðPÞj2 DdÞÞ

Note: D is the maximal degree in block P and d is the diameter of P. Proof. If there is no directed path from C to Ci in aDAG D, it means that there exists no path from vertex u to other vertices in G, where the path goes through the deleted ƒ! employ Algorithm 2 for the last case. The time complexity edge e ¼ uiuj . Therefore, the single source transitive clo- ; ; □ of Algorithm 2 has been given in Theorem 4.3. Thus, we sure MG′ðu Þ¼MGðu Þ. know the time complexity of the last case is OðDdÞ, where If u and u are in the same maximal strongly connected D is the maximal vertex outgoing degree in block P and d is i j component C, we propose the following algorithm to the diameter of P. handle the updates. First, we identify all maximal strongly connected components in the updated graph G′. Then, 6.2.2. Edge deletion according to the method in Section 4, we represent G′ as In this section, we discuss how to update the local ′ ƒ! a new aDAG D . For each maximal strongly connected transitive closure if we delete one edge e ¼ uiuj from block component C′ in G′,ifC′ is the same with some maximal ′ P. Let P be the updated block after edge deletion. Ci and Cj strongly connected component in G, it is not necessary to are two maximal strongly connected components containing recompute MC′. Otherwise, we need to recompute MC′. u and u in the original block P, respectively. There are two i j ƒ! Since we only delete one edge, most components are not cases for edge e ¼ uiuj . Fig. 11 demonstrates the two cases. changed in G′. According to the reverse topological order of D′, we find the lowest component C′ whose local 1. u and u are not in the same maximal strongly con- i j transitive closure (i.e., MC′) is changed. Finally, we call a nected component, i.e., Ci Cj. Function aDAG in Algorithm 4 from the component C′ to 2. ui and uj are in the same maximal strongly connected recompute the transitive closure. component, i.e., Ci¼Cj. In the first case, all maximal strong connected compo- nents do not change after deleting an edge. We only need If ui and uj are in two different blocks, it is straightfor- to re-compute the over D′ and re- ward to know that all maximal strongly connected com- compute MP′ by calling Function aDAG in Algorithm 4. ponents in P′ are the same as that in P. According to Thus, the time complexity of the first case is the same with Algorithm 3, we can compute the updated aDAG, denoted function aDAG. It is OðjVðPÞj3Þ. In the second case, we have as D′. Finally, we re-compute local transitive closure by to re-compute MCi for some maximal strong connected calling Function aDAG in Algorithm 4 again. Actually, the components. Thus, the time complexity of the second case computing process can be optimized by the following is OðMaxðjVðPÞj3; jVðPÞj2 dDÞÞ theorem. According to Theorem 6.2, we only need to beginning the propagation from Cj. 7. Experiments Theorem 6.2. Let P′ be the updated block by deleting an ƒ! edge e ¼ uiuj from P, where ui and uj are in two different In this section, we evaluate our methods over both maximal connected components Ci and Cj, respectively. random networks and real datasets, and compare them 62 L. Zou et al. / Information Systems 40 (2014) 47–66 with the existing solution the sampling-tree method in (3) Large-YAGO is the full version of RDF graph corre- [11]. Specifically, we experimentally study the perfor- sponding to YAGO dataset, which is a knowledge base mance of three approaches: (1) the sampling-tree method containing information harvested from Wikipedia and proposed in [11]; (2) we compute path-label transitive linked to Wordnet. In our experiments, we delete all closure method by Algorithm 4, based on which, we can “literal” vertices from RDF graph and maintain all answer LCR queries. This method is called transitive closure “entity” and “class” vertices. Each edge label corre- method; (3) the partition-based approach proposed in sponds to one property. Generally speaking, Large- Algorithm 6; (4) the bi-directional search proposed in [9]. Yago has 2 million vertices and 6 million edges and 97 The codes of the sampling-tree are provided by authors in edge labels. The average density is jEj=jVj¼2:7. [11]. Our methods, including the transitive closure method (4) DBLP contains a large number of bibliographic descrip- and the partition-based approach, are implemented using tions on major computer science journals and pro- Cþþ, and our experiments are conducted on a P4 3.0 GHz ceedings. We use a RDF version of DBLP dataset, which machine with 2G RAM running Ubuntu Linux. is available at http://sw.deri.org/aharth/2004/07/dblp/. We also delete all “literal” vertices from DBLP RDF graph. There are 1,145,882 vertices, 1,699,117 edges and 5 edge 7.1. Datasets labels in the RDF graph. Each edge label denotes one property. The average density is jEj=jVj¼1:48. There are two types of synthetic datasets to be used in our experiments, namely, Erdos Renyi Model (ER) and Scale-Free Model (SF). ER is a classical random graph model. It defines a random graph as jVj vertices connected 7.2. Performance of transitive closure method by jEj edges, chosen randomly from the jVjðjVj1Þ possible edges. In our experiments, we vary the density jEj=jVj from In this section, we use Algorithm 4 to compute transi- 1.5to5.0,andvaryjVj from 1K to 200K. SF defines a tive closure for a graph G. We report index construction random network with jVj vertices satisfying power-law time (IT), index size (IS) and average query response time distribution in vertex degrees. In our implementations, we (QT) for the experiments on the synthetic datasets. IT-opt use the graph generator gengraphwin (http://fabien.viger. refers to the index construction time when we utilize free.fr/liafa/generation/) to generate a large graph G satisfy- the optimization technique in Section 4.3. Note that, the ing power-law distribution. Usually, the power-law distri- default query constraint size ðjSjÞ is 30% j∑j¼6. Further- bution parameter γ is between 2.0 and 3.0 to simulate real more, we also compare our method with the sampling complex networks [19]. Thus, default value of parameter γ is tree method. Note that, in the following experiments, we set to 2.5 in this work. In order to study the scalability, we always randomly generate 1000 queries to evaluate query also vary jVj in SF networks from 1K to 200K. The number performance. QT is reported as the average response of edge labels ðjΣjÞ is 20. The distribution of labels is time for one query. In these experiments, we evaluate generated according to uniform distribution. the performance with regard to graph size, graph density We also employ four real graph datasets (Yeast, Small- and label constraint size jSj. Furthermore, we also test the Yago, Large-Yago and DBLP) in our experiments. The first performance of bi-directional search in Table 4. Since bi- two datasets are provided by authors in [11]. directional search does not need offline processing, thus, we only report QT in the following experiments. (1) Yeast is a protein-to-protein interaction network in Exp1. varying graph size ðjVjÞon ER graphs: In this budding yeast. Each vertex denotes a protein and experiment, we fix the density jEj=jVj¼1.5 and label an edge denotes the interaction between two corre- constraint size jSj¼6 and vary jVj from 1000 to 10,000 to sponding proteins. Yeast graph contains 3063 vertices study the performance by varying graph sizes. Table 4 (genes) with density 2.4. It has 5 edge labels, which reports the detailed performance, such as, index sizes (IS), corresponds to different type of interactions, such as index building times (IT) and average query response time protein–DNA and protein–protein interaction. (QT). From Table 4, we know that transitive closure (2) Small-YAGO is a sampling graph from a large RDF method is faster than the sampling-tree method in offline dataset, containing 5000 vertices with 66 labels, and processing by orders of magnitude. Furthermore, the has density jEj=jVj¼5.7. optimization technique (IT-opt) in Section 4.3 can further

Table 4 Performance VS. jVj in ER Graphs.

jVj Transitive closure method Sampling-tree method Bi-directional search

d¼1.5 IT (s) IT-opt (s) IS (KB) QT (ms) IT (s) IS (KB) QT (ms) QT (ms)

1K 15 3 415 0.01. 113 13,275 0.05 0.01 2K 23 5 1396 0.01 493 51,020 0.09 0.02 4K 217 7 4920 0.02 3680 78,920 0.10 0.02 6K 623 10 9000 0.03 5689 92,890 0.15 0.03 8K 964 12 13,200 0.03 9290 100,890 0.18 0.04 10K 5065 15 33,000 0.04 100,560 123,450 0.29 0.05 L. Zou et al. / Information Systems 40 (2014) 47–66 63 speed up offline processing, as shown in Table 4. For graphs. Furthermore, the fraction is growing with the example, when jVj¼1K, transitive closure method spends increasing of the average vertex degree in ER graphs. It 15 s to build index and the optimization technique only means that computing local transitive closure in each SCC needs 3 s, but the sampling-tree method needs 113 s. The covers a large proportion of the whole computation. index size of our method is much smaller than that in the Therefore, IT-opt works much better than IT in ER graphs. sampling-tree method. Furthermore, our query perfor- Exp3. varying graph size ðjVjÞ on SF graphs: Similar to mance is also better than the sampling tree method. From Exp1, we study the performance of our method by varying Table 4, we know that bi-directional search is also very jVj from 1K to 10K in SF graphs. Table 7 shows that our fast for LCR queries. However, when the constraint label method is also better than the sampling tree in all perfor- size jSj increases, the performance of bi-directional search mance measures. Note that, the performance of our method method degrades greatly, as evaluated in Table 8 in Exp4. in SF is much faster than that in ER graph. The reason is that jEj most vertices in SF have very small degrees and that Exp2. varying density j j on ER graphs: In these V reduces the search space in offline processing. experiments, we fix jVj¼10; 000 and vary the density Exp4. varying query constraint size ðjSjÞ on ER and SF jEj=jVj from 2 to 5 to study the performance of our method graphs: In ER graphs, we fix jVj¼10; 000 and jEj¼15; 000. in dense graphs. From Table 5, we know that the index In SF graphs, we fix jVj¼10; 000 and the power-law building time and index size increase when varying jEj=jVj distribution parameter γ¼2.5. We vary jSj from 30% from 2 to 5 in both methods. Furthermore, the sampling j∑j¼6to80% j∑j¼16. Note that, the offline process tree method cannot finish index building in 48 h when does not depend on label constraint size jSj. Thus, we only jEj=jVjZ4. From Table 5, we know that transitive closure report the query response time in Table 8. It is straightfor- method has better scalability with regard to the graph ward to know the online performance in transitive closure density jEj=jVj than the sampling tree method. Actually, method is stable with jSj. We also report the performance the two methods need to compute M ðu; Þ (i.e., single- G of the sampling tree method in Table 8. From Table 8,we source transitive closure). As proven in Theorem 4.1, our have two findings: (1) transitive closure method is faster method has the minimal search space, but the search than the sampling tree method in query response time; space in the sampling tree method is not minimal. Thus, (2) the performance of the bi-directional search degrades large search space affects the scalability of the sampling significantly when jSj increases. The reason is that the tree method. Furthermore, our query performance is also search space is very large when jSj increases. better than the sampling tree method. Another observation is that our optimal solution (Section 4.3) for computing the transitive closure in each Table 7 strong connected component (SCC) works very well, since Performance VS. jVj in SF graphs. IT-opt is much faster than IT in Table 5, especially in dense jVj Transitive closure Sampling-tree Bi-directional graphs. As we know, computing transitive closure of graph method method search G involves two steps. The first step is to compute the local transitive closure in each SCC. The native solution for this d¼1.5 IT IT-opt IS QT IT IS QT QT (ms) step is to iterate Algorithm 2 from each vertex in this SCC. (s) (s) (KB) (ms) (s) (KB) (ms)

However, according to the analysis in Section 4.3, we can 1K 0.1 0.1 43 0.01. 0.6 1576 0.16 0.01 speed up the process by considering computation sharing. 2K 0.2 0.1 94 0.01 2.6 2583 0.17 0.02 In ER graphs, there are lots of SCCs, especially in high- 4K 0.2 0.1 140 0.01 9.5 4870 0.18 0.02 degree graphs. Table 6 shows the statics about the number 6K 0.4 0.2 289 0.01 27.8 8854 0.20 0.03 8K 0.4 0.2 390 0.01 45.5 11393 0.22 0.03 of vertices that are in non-trivial SCCs (i.e., size 41). We 10K 1.1 0.3 492 0.01 80.4 23931 0.24 0.03 observe that most of vertices are in non-trivial SCCs in ER

Table 5 Performance VS. Density in ER Graphs.

Degree Transitive closure method Sampling-tree method Bi-directional search

d IT (s) IT-opt (s) IS (KB) QT (ms) IT (s) IS (KB) QT (ms) QT (ms)

2 6890 20 26.3 0.08. 123,890 95.6 0.31 0.05 3 11,112 23 102.3 0.09 378,563 232.7 0.53 0.09 4 25,347 35 160 0.12 F F F 0.15 5 33,169 80 186 0.23 F F F 0.18

Table 6 The fraction of vertices in non-trivial SCCs.

Degree 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 ER10K (%) 34.47 63.57 79.16 88.19 93.42 96.14 97.58 98.58 64 L. Zou et al. / Information Systems 40 (2014) 47–66

Table 8 Performance VS. jSj in ER and SF graphs.

jSj Transitive closure method Sampling-tree method Bi-directional search

QT (ms) QT (ms) QT (ms)

ER SF ER SF ER SF

jVj¼10K jVj¼10K jVj¼10K jVj¼10K jVj¼10K jVj¼10K d¼1.5 d¼1.5 d¼1.5

6 0.04 0.01 0.29 0.24 0.05 0.03 8 0.04 0.01 0.32 0.25 0.08 0.05 10 0.05 0.02 0.32 0.27 0.25 0.1 12 0.06 0.03 0.39 0.29 1.20 0.7 14 0.06 0.04 0.40 0.32 3.49 1.2 16 0.07 0.04 0.52 0.34 10.90 3.59

Table 9 Table 10 Offline performance in real datasets. Online performance in real datasets.

jVj Transitive closure method Sampling-tree method jSj= ALL Transitive closure Sampling-tree Bi-directional labels method method search IT (s) IT-opt (s) IS (MB) IT (s) IS (MB) QT (ms) QT (ms) QT (ms) Yeast 60 10 3.18 877 151 Small Yago 32 6 1.58 945 90 Yeast Small Yeast Small Yeast Small Yago Yago Yago

40% 0.14 0.12 0.68 1.20 0.09 0.15 Exp5. performance on real graphs: In this experiment, 60% 0.17 0.23 0.76 1.30 1.28 2.05 we evaluate transitive closure and the sampling tree 80% 0.20 0.36 1.23 1.50 3.59 3.87 100% 0.27 0.46 1.56 1.60 9.78 10.56 method in two small real graphs, Yeast and Small-Yago. Table 9 confirms that our method is much better than the sampling tree approach in all performance measures. For example, the index building time in our method is only label constraint size jSj¼6 and vary jVj from 20K to 200K 1 1 about 10 30 of that in the sampling tree method. The to study the performance. From Table 11, we know that the index size in our method is also much smaller than that in sampling tree method cannot finish index building in a the sampling tree method. Furthermore, the bi-directional reasonable time (o48 h) when jVj420K. Generally search method cannot work well when jSj is large. We speaking, index building time and index size are linear also report the online performance on real datasets in with the graph size jVj in the partition-based approach. Table 10. A promising finding in Table 11 is that the query response time in the partition-based approach is less than 0.1 s. 7.3. Performance of partition-based approach For example, given a graph with 200K vertices and 300K edges, query response time in the partition-based Although transitive closure method has good perfor- approach is about 90 ms, as shown in Table 11. Although mance, it suffers from offline processing cost in a large graph. the bi-direction search method has good performance in Therefore, we evaluate the performance of the partition- Table 11, in which, jSj¼6, but it cannot work well when jSj based approach in this section. increases, as shown in Table 13. We can observe the similar Setup. Given a large graph G, we partition G into results in large SF graphs in Table 12. k ¼ ⌈jVj=jPj⌉ blocks. The index building time can be esti- mated as follows: ITðGÞ¼⌈jVj=jPj⌉ ITðPÞ, where IT(P)is 7.4. Performance of updates the index building time for block P. According to the statistics in Tables 4 and 7, we can set up jPj to minimize Exp8. performance on updates: Table 14 lists the average the overall index building time, i.e., IT(G). Therefore, we set time for inserting/deleting one edge into four real datasets. up jPj¼ 2K in ER graphs and jPj¼ 8K in SF graphs. Note that, in Large Yago and DBLP, we adopt the partition- In order to obtain an optimal partition for query based method, thus, we only need to update the local processing, we utilize the method in Section 5.2 to parti- transitive closure of each partition. Table 14 shows the tion graph G into k blocks. Specifically, given a graph G,we average time to insert/delete one edge is less than 100 ms. randomly select Δ ¼jVðGÞj 1% vertices as seeds. Then, Specifically, in order to evaluate the insertion performance, based on these seeds, we can estimate edge weight w(e). we randomly generate 100 edges to be inserted. Since Finally, we employ METIS algorithm to partition G. there are four cases of the inserted edges, each case has Exp6. varying graph size ðjVjÞ on large ER and SF graphs: 1/4 probability to be generated in our update workload. In this experiment, we fix the density jEj=jVj¼1:5 and We report the average insertion time in Table 14.Wehave L. Zou et al. / Information Systems 40 (2014) 47–66 65

Table 11 Performance VS. jVj in ER graphs.

jVj Partition-based method Sampling-tree method Bi-directional search

d¼1.5 IT (s) IT-opt (s) IS (KB) QT (ms) IT (s) IS (KB) QT (ms) QT (ms)

20K 210 10 0.71 20.8. 253,236 367.98 0.95 0.25 40K 396 23 6.64 25.6 F F F 1.68 60K 622 35 5.55 29.8 F F F 5.36 80K 1018 60 6.14 32.1 F F F 16.89 100K 1104 65 6.88 45.5 F F F 30.10 120K 1281 72 7.81 51.6 F F F 40.68 140K 1463 81 8.88 58.9 F F F 69.59 160K 1686 85 9.83 65.4 F F F 106.80 180K 1807 120 10.70 72.6 F F F 120.56 200K 2233 178 11.80 89.9 F F F 160.58

Table 12 Performance VS. jVj in SF graphs.

jVj Partition-based method Sampling-tree method Bi-directional search

IT (s) IT-opt (s) IS (KB) QT (ms) IT (s) IS (KB) QT (ms) QT (ms)

20K 3 2 1.61 1.95 123.90 48.73 0.25 0.16 40K 10 7 2.15 2.89 356.80 120.50 0.31 0.96 60K 21 10 3.25 3.20 835.70 378.90 0.35 2.68 80K 50 30 4.36 3.58 F F F 4.89 100K 66 41 5.47 4.84 F F F 8.45 120K 81 52 6.87 5.80 F F F 10.59 140K 171 63 8.07 6.50 F F F 25.56 160K 187 75 9.27 9.56 F F F 30.43 180K 211 81 10.5 10.89 F F F 40.56 200K 233 109 16.8 12.96 F F F 50.98

Table 13 8. Conclusions Performance VS. jSj In large ER and SF graphs. In this paper, we address label-constraint reachability jSj Partition-based method Bi-directional search (LCR) queries over large graphs. Theoretically, we propose QT (ms) QT (ms) several methods to optimize path-label transitive closure computing. In order to address the scalability issue, we ER SF ER SF propose a partition-based approach based on the graph partition. We prove that hardness of finding the optimal jVj¼200K; d ¼ 1:5 jVj¼200K jVj¼200K; d ¼ 1:5 jVj¼200K partition is NP hard. Thus, we propose a sampling-based 6 89.9 12.96 160.58 50.98 solution to find a good partition to speed up LCR queries. 8 115.2 15.58 197.32 85.60 Last but not the least, extensive experiments on both real 10 125.8 18.90 235.35 156.89 and synthetic datasets confirm that our methods are faster 12 153.6 22.39 302.60 198.9 14 180.9 25.60 430.90 285.3 than the existing solution by orders of magnitude in both 16 210.3 30.56 560.68 300.56 offline and online processing.

Table 14 Acknowledgments Evaluating index maintenance. Lei Zou's work was supported by NSFC under Grant Data set Insertion (ms) Deletion (ms) 61370055. Dongyan Zhao was supported by NSFC under Yeast 35 12 Grant 61272344 and China 863 Project under Grant no. Small Yago 56 37 2012AA011101. Jeffery Xu Yu's work was supported by Large Yago 63 83 Research Grants Council of the Hong Kong SAR, China DBLP 51 65 under Grant no. 418512. Lei Chen's work was supported in part by the Hong Kong RGC GRF 611411, National Grand Fundamental Research 973 Program of China under Grant the similar setting for the deletion. We randomly delete 2012-CB316200, Microsoft Research Asia Grant, Huawei 100 edges. Half of them are from the first case and the Noahs ark lab project HWLB06-15C03212/13PN and others are from the second case. Google Faculty Award 2013. Yanghua Xiao was supported 66 L. Zou et al. / Information Systems 40 (2014) 47–66 by NSFC (No. 61003001, 61170006, 61171132, 61033010); [10] H.V. Jagadish, A compression technique to materialize transitive Specialized Research Fund for the Doctoral Program of closure, ACM Trans. Database Syst. 15 (4) (1990). [11] R. Jin, H. Hong, H. Wang, N. Ruan, Y. Xiang, Computing label- Higher Education No. 20100071120032; Shanghai Municipal constraint reachability in graph databases, in: SIGMOD, 2010. Science and Technology Commission with Funding No. [12] R. Jin, N. Ruan, S. Dey, J.X. Yu, Scarab: scaling reachability computa- 13511505302; NSF of Jiangsu Province (No. BK2010280). tion on large graphs, in: SIGMOD Conference, 2012, pp. 169–180. [13] R. Jin, Y. Xiang, N. Ruan, D. Fuhry, 3-hop: a high-compression indexing scheme for reachability query, in: SIGMOD, 2009. References [14] R. Jin, Y. Xiang, N. Ruan, H. Wang, Efficiently answering reachability queries on very large directed graphs, in: SIGMOD, 2008, pp. 595–608. [1] S. Abiteboul, V. Vianu, Regular path queries with constraints, [15] G. Karypis, V. Kumar, Multilevel k-way partitioning scheme for J. Comput. Syst. Sci. 58 (3) (1999). irregular graphs, J. Parallel Distrib. Comput. 48 (1) (1998). [2] R. Agrawal, A. Borgida, H.V. Jagadish, Efficient management of [16] S. Lu, F. Zhang, J. Chen, S.-H. Sze, Finding pathway structures in transitive relationships in large data and knowledge bases, in: protein interaction networks, Algorithmica 48 (4) (2007). SIGMOD Conference, 1989. [17] J.P. McGlothlin, L.R. Khan, Rdfkb: efficient support for RDF inference [3] K. Anyanwu, A.P. Sheth, ρ-queries: enabling querying for semantic queries and knowledge management, in: IDEAS, 2009. associations on the semantic web, in: WWW, 2003. [18] V.J.T. Michael Rice, Graph indexing of road networks for shortest [4] R. Bramandia, B. Choi, W.K. Ng, Incremental maintenance of 2-hop path queries with label restrictions, PVLDB 4 (2) (2010) 69–80. labeling of large graphs, IEEE Trans. Knowl. Data Eng. 22 (5) (2010). [19] Réka Albert, Albert-László Barabási, Statistical mechanics of complex [5] Y. Chen, Y. Chen, An efficient algorithm for answering graph reach- networks, Rev. Mod. Phys. 74 (2002) 47–97. ability queries, in: ICDE, 2008, pp. 893–902. [20] S. Trißl, U. Leser, Fast and practical indexing and querying of very [6] J. Cheng, J.X. Yu, On-line exact shortest distance query processing, large graphs, in: SIGMOD, 2007. in: EDBT, 2009, pp. 481–492. [21] H. Wang, H. He, 0001, J.Y., P.S. Yu, J.X. Yu, Dual labeling: Answering [7] J. Cheng, J.X. Yu, X. Lin, H. Wang, P.S. Yu, Fast computing reachability graph reachability queries in constant time, in: ICDE, 2006. labelings for large graphs with high compression rate, in: EDBT, 2008. [22] J.X. Yu, Graph Reachability Queries: A Survey (Book Chapter), [8] E. Cohen, E. Halperin, H. Kaplan, U. Zwick, Reachability and distance Kluwer Academic Publishers, Boston, Dordrecht, London, 2010. queries via 2-hop labels, SIAM J. Comput. 32 (5) (2003). [23] J. Zhu, Z. Nie, X. Liu, B. Zhang, J.-R. Wen, Statsnowball: a statistical [9] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, Adding regular expressions to approach to extracting entity relationships, in: WWW, 2009. graph reachability and pattern queries, in: ICDE, 2011.