Efficient Processing of Label-Constraint Reachability Queries in Large Graphs

Information Systems 40 (2014) 47–66 Contents lists available at ScienceDirect Information Systems journal homepage: www.elsevier.com/locate/infosys Efficient processing of label-constraint reachability queries in large graphs Lei Zou a,n,KunXua, Jeffrey Xu Yu b, Lei Chen c, Yanghua Xiao d, Dongyan Zhao a a Peking University, No.5 Yiheyuan Road Haidian District, Beijing, China b The Chinese University of Hong Kong, Shatin, NT, Hong Kong c Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong d Fudan University, Shanghai, China article info abstract Article history: In this paper, we study a variant of reachability queries, called label-constraint reachability Received 22 November 2012 (LCR) queries. Specifically, given a label set S and two vertices u1 and u2 in a large directed Received in revised form graph G, we check the existence of a directed path from u1 to u2, where edge labels along 13 June 2013 the path are a subset of S. We propose the path-label transitive closure method to answer Accepted 1 October 2013 LCR queries. Specifically, we t4ransform an edge-labeled directed graph into an augmen- Recommended by: Xifeng Yan Available online 18 October 2013 ted DAG by replacing the maximal strongly connected components as bipartite graphs. We also propose a Dijkstra-like algorithm to compute path-label transitive closure by re- Keywords: defining the “distance” of a path. Comparing with the existing solutions, we prove that our Graph database method is optimal in terms of the search space. Furthermore, we propose a simple yet Reachability query effective partition-based framework (local path-label transitive closureþonline traversal) to answer LCR queries in large graphs. We prove that finding the optimal graph partition to minimize query processing cost is a NP-hard problem. Therefore, we propose a sampling-based solution to find the sub-optimal partition. Moreover, we address the index maintenance issues to answer LCR queries over the dynamic graphs. Extensive experiments confirm the superiority of our method. & 2013 Elsevier Ltd. All rights reserved. 1. Introduction networks [23].Therearetwoextremesolutionstoanswer reachability queries. One approach is to materialize the The growing popularity of graph databases has generated transitive closures of a graph, enabling one to answer many interesting data management problems. One impor- reachability queries efficiently. On the other extreme, we tant type of queries over graphs is reachability queries can perform DFS (depth-first search) or BFS (breath-first [8,10,13,14,20,21]. Specifically, given two vertices u1 and u2 search) over graph G on the fly to answer reachability in a directed graph G, we want to verify whether there exists queries. Obviously, these two methods cannot work in a 1 2 adirectedpath from u1 to u2. There are many applications of large graph G, since the former needs OðjVj Þ space to store reachability queries, such as pathway finding in biological the transitive closure (large index space cost), and the latter networks [16], inferring over RDF (resource description needs OðjVjÞ time in answering reachability queries (slow framework) graphs [17], relationship discovery in social query response time), where V is a set of vertices in G.The key issue in reachability queries is how to find a good trade- off between the two extreme solutions. Therefore, many n Corresponding author at: Institute of Computer Science and algorithms have been proposed, such as 2-hop [8,7,4],GRIPP Technology, Peking University, No.5 Yiheyuan Road Haidian District, [20],path-cover[10],tree-cover[20,21],pathtree[14] and Beijing 100871, China. Tel.: þ86 10 82529643. 3-hop [13]. E-mail addresses: [email protected], [email protected] (L. Zou). 1 In this paper, all “paths” refer to “simple paths” unless otherwise In many real applications, edge labels are utilized to specified. denote different relationships between two vertices. For 0306-4379/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.is.2013.10.003 48 L. Zou et al. / Information Systems 40 (2014) 47–66 example, edge labels in RDF graphs denote different we have to consider all possible paths between two vertices, properties. We can also use edge labels to define different because different paths may have different edge labels along relationships in social networks. In this paper, we study paths. Therefore, it is much more complicated of computing a variant of reachability queries, called Label-Constraint transitive closure for LCR queries. Existing index techniques Reachability (LCR) queries, which are originally proposed in traditional reachability queries are not available in LCR in [11]. Specifically, given two vertices u and v and a label queries, either. For example, in order to answer reachability set S, a LCR query checks whether there exists a directed queries, we always transform a directed graph G into a path from u to v, where the edge labels along the path are directed acyclic graph (DAG) by coalescing each strongly a subset of S. Here, we give some motivation examples to connected component (in G) into a single vertex. However, demonstrate the usefulness of LCR queries. this method cannot work in LCR queries, since each strongly We model a social network as a graph G, in which each connected component has different edge labels. vertex in G denotes an individual and an edge indicates In order to address LCR queries efficiently, we make the the association between two users. Edge labels denote the following contributions in this work: relationship types, such as isFriendOf, isColleagueOf, isRelativeOf, isSchoolmateOf, isCoauthorOf, isAdvisorOf (1) Given an edge-labeled directed graph G,wefindall and so on. In some social network analysis tasks, we are maximal strongly connected components in G and only interested in finding some specified relationships replace them by bipartite graphs.Then,adirectedgraph between two individuals. For example, we want to see G is transformed into an augmented DAG with labels. whether two suspects are remote relatives (in a terrorist Based on the augmented DAG, we propose a method to network G) by checking the existence of a path between compute path-label transitive closure (Definition 3.8), two corresponding vertices, where the edge labels along where LCR queries can be answered directly. the path are all “isRelativeOf”. (2) We re-define the “distance” of a path by the number of LCR queries are also useful for understanding how distinct edge labels along the path, and also propose a metabolic chain reactions take place in metabolic networks. Dijkstra-like algorithm to compute a single-source A metabolic network can be also modeled as a graph G, path-label transitive closure (Definition 3.8). We prove where each vertex corresponds to a chemical compound that our algorithm is optimal in terms of search space. and each directed edge indicates a chemical reaction from (3) In order to speed up query processing over large graphs, one compound to another. Enzymes catalyze these reac- we propose an effective partition-based framework tions. Thus, we can use edge labels to denote different (local path-label transitive closureþonline traversal) enzymes. A metabolic pathway involves the step-by-step to answer LCR queries in large graphs. We prove that modification of an initial molecule to form another product. finding the optimal partition in terms of minimizing From the perspective of graph theory, a pathway is a the number of traversal steps is NP-hard. Based on the directed path from the initial node in the metabolic network complexity analysis, we design a sampling-based solu- to the target. The common query is as follow: considering tion to find a sub-optimal partition. the availability of a set of enzymes, is there a pathway from (4) In order to handle graph updates, we propose an one compound to another one? Obviously, this is a LCR efficient index maintenance algorithm to handle updates query over a metabolic network. over graphs. As demonstrated above, LCR queries are quite useful; (5) Last but not least, extensive experiments confirm that however, it is non-trivial to answer LCR queries over a our method is faster than the existing ones by orders of large directed graph. Traditional reachability queries do magnitude. For example, given a random network satis- not consider edge labels along the path [11]. For example, fying ER model with 100K vertices and 150K edges, the vertex 1 can reach vertex 4 in graph G (in Fig. 1). However, method in [11] consumes 277 h for index building. Given if the constraint label set S is fb; cg, the LCR query answer is the same graph, our method only needs 0.5 h for index NO, since we cannot find a path from 1 to 4, where all edge building. Furthermore, our method can work well in a labels along the path are a subset of S. Generally speaking, very large RDF graph (Yago dataset) with more than existing reachability indexes are compact data structures 2 million vertices and 6 million edges and 97 edge labels. of the transitive closure. As traditional reachability queries do not consider edge labels, in order to compute the The rest of this paper is organized as follows. The transitive closure, we only need to consider a single related work is discussed in Section 2. We formally define directed path from one vertex to another (if any). However, the problem and discuss existing solutions in Section 3. in order to compute the transitive closures for LCR queries, Then, we propose several novel techniques for computing path-label transitive closures in Section 4. The partition- based solution is discussed in Section 5. We also discuss how to handle dynamic graphs in Section 6.

Load more