Efficiently Answering Regular Simple Path Queries on Large Labeled

Efficiently Answering Regular Simple Path Queries on Large Labeled Networks Sarisht Wadhwa, Anagh Prasad, Sayan Ranu, Amitabha Bagchi, Srikanta Bedathur IIT Delhi, Dept. of Computer Science and Engineering, New Delhi, India [sarishtwadhwa.1996,anaghprasad95]@gmail.com,[sayanranu,bagchi,srikanta]@cse.iitd.ac.in ABSTRACT 1 INTRODUCTION A fundamental query in labeled graphs is to determine if A fundamental query in labeled graphs is to determine if there there exists a path between a given source and target vertices, exists a path from a given source vertex to a specified target such that the path satisfies a given label constraint. One of vertex, such that the labels encountered along the path form a the powerful forms of specifying label constraints is through string that is from a given regular language. Such queries are regular expressions, and the resulting problem of reachability termed as Regular Path Queries (RPQs) and are often spec- queries under regular simple paths (RSP) form the core of ified with a starting node, a destination node anda regular many practical graph query languages such as SPARQL from expression over the node and edge labels. For example, in a W3C, Cypher of Neo4J, Oracle’s PGQL and LDBC’s G-CORE. PPI, does there exist a pathway from protein P1 to protein P2 Despite its importance, since it is known that answering that proceeds only through either cleavage or covalent bond- RSP queries is NP-Hard, there are no scalable and practical ing interactions? Does there exist a cascade of interactions solutions for answering reachability with full-range of regu- from user U on Twitter to user V such that all intermediate lar expressions as constraints. In this paper, we circumvent nodes are females of age between 20 and 30? Regular path this computational bottleneck by designing a random-walk queries have been studied under different possible semantics: based sampling algorithm called ARRIVAL, which is backed arbitrary, shortest and simple [13]. In this paper, we will focus by theoretical guarantees on its expected quality. Extensive on the evaluation of RPQs under simple path semantics, or experiments on billion-sized real graph datasets with thou- Regular Simple Path Queries (RSPQs), which require that in sands of labels show that ARRIVAL to be 100 times faster a valid path no node is visited more than once. RPQs form than baseline strategies with an average accuracy of 95%. a proper superset of “label-constrained reachability” (LCR) queries (e.g., [11, 21, 23]) wherein the path from source to KEYWORDS destination must use labels belonging to a specified subset of the graph’s label set. reachability query, regular path query, random walks, knowl- Owing to their edge graphs, regular expression Limitations of existing techniques: wide applications, RPQs have been extensively studied [2, ACM Reference Format: 4, 6, 7, 12, 13, 16]. Nonetheless, several challenges, including Sarisht Wadhwa, Anagh Prasad, Sayan Ranu, Amitabha Bagchi, scalability to large networks, remain open. Srikanta Bedathur. 2019. Efficiently Answering Regular Simple 1. They handle a very small subclass of regular expres- Path Queries on Large Labeled Networks. In 2019 International sions. Existing techniques for RSP queries do not support Conference on Management of Data (SIGMOD ’19), June 30-July 5, most regular expressions as label constraints [10, 11, 21, 23]. 2019, Amsterdam, Netherlands. ACM, New York, NY, USA, 18 pages. Most existing techniques, including the most recent one by https://doi.org/10.1145/3299869.3319882 Valstar et al. [21], take a set of labels L = fa1; ··· ; ak g as input, and every edge (or node) in the path defining the reachability must contain a label from this set L. This constraint reduces Permission to make digital or hard copies of all or part of this work for ∗ personal or classroom use is granted without fee provided that copies are to the regular expression ¹a1j · · · jak º . Such a formulation sig- not made or distributed for profit or commercial advantage and that copies nificantly reduces the expressive power of the user in posing bear this notice and the full citation on the first page. Copyrights for com- effective queries. An analysis of SPARQL query logs on Wiki- ponents of this work owned by others than the author(s) must be honored. data17 knowledge graph reveals that ≈ 35% of the RSP queries Abstracting with credit is permitted. To copy otherwise, or republish, to cannot be expressed through this restricted form [5]. Fan et post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. al. [8] propose a technique that is slightly more generic and SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands supports a subclass of regular expressions, which are compu- © 2019 Copyright held by the owner/author(s). Publication rights licensed tationally easier to solve. Fletcher et al.[10] replaces Kleene to the Association for Computing Machinery. closure with paths of bounded length recursion. Hence, they ACM ISBN 978-1-4503-5643-5/19/06...$15.00 operate on a restricted set. https://doi.org/10.1145/3299869.3319882 Age: 19 Age: 43 Table 1: Limitations of existing techniques. Gender: Male Gender: Male Algorithm Regular Non- Query- Dynamic v2 v3 Expressions exponential time label Networks growth with definition Age: 26 v7 Gender: Female label space Age: 27 v1 Valstar et al. [21] Only LCR x x x v4 v5 Gender: Male Jin et al. [11] Only LCR x x x Age: 25 Zou et al. [23] Only LCR x x X Age: 32 Gender: Female Gender: Female Fan et al. [8] X (partially) x x x Fletcher et al. [10] X (partially) X x x v6 Koschmieder et al. [12] XX x x Age: 17 ARRIVAL XXXX Gender: Male Figure 1: A sample graph dataset. For simplicity, we assume that labels are attached only with nodes. • We develop an index-free, sampling-based algorithm called 2. Costs grow exponentially with number of labels. Ex- ARRIVAL, which is remarkably simple in its approach, and isting techniques to answer LCR queries [8, 11, 21, 23] (which yet backed by theoretical analysis. are a proper subset of RSP queries) employ index structures • We perform extensive experiments on several real graphs whose memory and computation costs grow exponentially containing millions of nodes, billions of edges and thou- with the number of labels. Consequently, they are unable to sands of unique labels. Our empirical evaluation establishes scale to networks containing more than a handful of labels. that ARRIVAL is up to 100 times faster than BBFS. Further- In the real world, this assumption is not true. For example, more, ARRIVAL guarantees a precision of 1, while the recall in a social network, if each node is tagged with the resident is 95% on average. (Sec: 5). country of the corresponding user, the number of unique labels would run into hundreds. 2 PROBLEM FORMULATION 3. Do not support query-time label definition. In some scenarios, the constraints are defined as a function over node There are four pieces of input to the RSPQ problem: the graph, or edge labels at query time. For example, recall the previous the source and destination nodes, and the label constraint. query where one wishes to check if there exists a cascade of Definition 1 (Graph). A multi-labeled graph (directed interactions from user U on Twitter to user V such that all in- or undirected) is a triple G = ¹V; E; Lº, where V is the set of termediate nodes are females of age between 20 and 30. Here, nodes, E ⊆ V ×V is the set of edges, and L is a finite non-empty the constraint is not defined based on the presence or absence set of labels over nodes and edges in the graph. A node or an of an existing label, but as a function that is specified at query edge may be labeled with zero or more labels from the set L. time, whose output is like a new label for that node. Existing We do not assume the graph to adhere to any structural techniques do not support query-time label definitions since property such as acyclicity or strong connectedness. However, they assume that the label set space is known. certain quality guarantees may be provided only when the 4. Unable to handle dynamic networks. Many networks graph satisfies certain conditions. We explicitly mention those today are highly dynamic in nature. For example, new nodes conditions in our proofs. are added to the Wikidata17 knowledge base every day and We use the notation l¹vº and l¹eº to denote the set of labels existing nodes get updated by creating new links among them. present in node v and edge e respectively. Several index-based techniques for LCR reachability [8, 11, 21] assume a static network. This limits their applicability to Example 1. A sample multi-labeled graph is shown in dynamic networks. Fig. 1. In this graph, every node is characterized by two features: Proposed solution: Table 1 summarizes the limitations of the age and the gender. The label of a node is formed by concate- existing techniques, most of which result due to relying on an nating the feature name and feature value. For example, v1’s index structure. This property forces us to ask the following labels are “Age=26” and “Gender=Female”. We use the notation question: Is it possible to develop an index-free, near-optimal al- v:Aдe andv:Gender to note the value of the “Age” and “Gender” gorithm with linear storage and time complexity? In this paper, features of v. For example, v1:Aдe = 26.

Load more