Accelerating Parallel Evaluation of Regular Path Queries on Large Graphs by Estimating Joining Cost of Subqueries

Accelerating Parallel Evaluation of Regular Path Queries on Large Graphs by Estimating Joining Cost of Subqueries Van-Quyet Nguyen Van-Hau Nguyen Huy-The Vu Hung Yen University of Technology Hung Yen University of Technology Hung Yen University of Technology and Education and Education and Education Hung Yen, Vietnam Hung Yen, Vietnam Hung Yen, Vietnam [email protected] [email protected] [email protected] Minh-Quy Nguyen Quyet-Thang Huynh Kyungbaek Kim Hung Yen University of Technology Hanoi University of Science and Chonnam National University and Education Technology Gwangju, South Korea Hung Yen, Vietnam Ha Noi, Vietnam [email protected] [email protected] [email protected] ABSTRACT performance guarantee on computational cost due to the large size Using Regular Path Queries (RPQs) is a common way to explore of graphs and/or highly complex queries. The need for developing patterns in graph databases. Traditional automata-based approaches an efficient approach with low computational cost is particularly for evaluating RPQs on large graphs are restricted in the graph size evident for evaluating RPQs, which are most commonly used in and/or highly complex queries, which causes a high evaluation cost. practice [17]. Recently, the threshold rare label based approach applied on large A well-known approach for evaluating RPQs is to exploiting graphs has been proved to be effective. Nevertheless, using rare automata [9]. However, this approach has a disadvantage when labels in a graph provides only coarse information which could not the applied graph is large: the long response time is triggered by always guarantee the minimum searching cost. Hence, the Unit- the mapping of the automaton states onto the graph. To deal with Subquery Cost Matrix (USCM) based approach has been proposed to this issue, optimization techniques have been studied to reduce the reduce the parallel evaluation cost by estimating the searching cost evaluation cost of RPQs. Rewriting RPQs is introduced as the first of RPQs. However, the previous approach does not take the joining technique, in which, the original expression of an RPQ is trans- cost among subqueries into account. In this paper, the method formed into another one to reduce the search space by avoiding the of estimating joining cost of subqueries is proposed in order to whole graph traversal [8], [5], [6]. Nevertheless, there is a limitation accelerate the USCM based parallel evaluation of RPQs. Specifically, when the RPQs are highly complex (e.g., an RPQ with a modifier the proposed method is realized by estimating the result size of operator * over a group of alternate label). the subqueries. Through our experiments upon real-world datasets, Recently, a threshold rare label based approach has been proved it is depicted that estimating joining cost enhances USCM based to be efficient to evaluate RPQs on large graphs [10]. The basic approach up to around 20% in terms of response time. idea of this method is that the original query is split into multiple subqueries at rare labels, which are used as fixed points for graph CCS CONCEPTS searching. However, the threshold rare label based approach has some limitations since it relies on the presence of rare labels, their • Computing methodologies ! Search methodologies; Paral- positions on the queries, and their quantity in the graphs. Our lel computing methodologies. earlier work used Unit-Subqueries Cost Matrix (USCM) to estimate KEYWORDS the searching cost of RPQs and obtain the viability of the usage of subqueries in RPQs evaluation [16]. However, this method does not Graph Queries, USCM, Parallel Evaluation, Estimating Joining Cost take the joining cost among subqueries into account. This neglect could increase the response time when evaluating RPQs on a large 1 INTRODUCTION graph. Regular path query (RPQ) is a common class of queries providing In this paper, we propose a method of estimating the joining cost a way of finding connections and patterns in a graph database of subqueries in order to accelerate the USCM based parallel evalu- [11]. There have been a number of approaches for evaluating RPQs ation of RPQs (called USCM-Join). We first propose cost functions [3], [20]. However, only a few of these approaches provide the and algorithms to estimate the result size of a given RPQ, which is directly related to the joining cost of subqueries. We then present Permission to make digital or hard copies of part or all of this work for personal or how to improve the evaluation performance of RPQs by splitting classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation them with a combination of the estimated joining and searching on the first page. Copyrights for third-party components of this work must be honored. cost. Finally, we conduct extensive experiments on both real-world For all other uses, contact the owner/author(s). graphs and synthetic graphs, and the experimental results show SMA 2020, September 17-19, 2020, Jeju, Republic of Korea, © 2020 Copyright held by the owner/author(s). SMA 2020, September 17-19, 2020, Jeju, Republic of Korea, Van-Quyet et al. that our USCM-Join approach outperforms the original one and these works, there is no one that issues any cost estimating func- other approaches. tions relying on RPQ operators as well as connectivity of labels The rest of this paper is organized as follows. We introduce an in the query and the graph. In this work, we propose an efficient overview of related work in Section 2. We present some terms, method for evaluating an RPQ by splitting it into multiple smaller definitions, and splitting RPQs for parallel evaluation using USCM subqueries based on estimation of their searching cost and joining in Section 3. Section 4 describes our proposal for estimating the cost. result size of RPQs. Section 5 describes how to evaluate an RPQ in a parallel fashion by using the estimated evaluation cost. We 3 PRELIMINARIES conduct experiments to evaluate our method in Section 6. Section 7 concludes with a summary of our proposal. 3.1 Graph Data and Regular Path Queries Graph Data. We consider an edge-labeled directed graph 퐺 = (+ , 퐸, Σ), where + is a set of nodes, Σ is a set of labels, and 퐸 ⊆ + × Σ ×+ is a set of edges. In which, an edge (E, 0, D) indicates edge direction 2 RELATED WORK from node E to node D labeled with 0 2 Σ. There have been a lot of studies done on RPQs evaluation as well as Regular Path Queries. An RPQ & ¹'º is a regular expression ' providing query languages on graph data [3], [10], [20]. A common over some labels in Σ. Here, ' is defined in formally by ' = n j 0 j way to evaluate an RPQ is using the automata-based approach. ' ◦ ' j ' [ ' j '∗, where n is an empty value; a is a label in Σ; ' ◦ ', This approach converts the graph to a DFA (Deterministic Finite ' [ ', and '∗ denote concatenation, alternation, and Kleene Star, Automaton), and the expression of an RPQ can be translated into respectively. an automaton, then computes the cross-product of the automaton Let us categorize regular expression ' into four types of RPQ as to find the answer [9]. However, the limitation of this approach is the following: that every state in automaton needs to be mapped onto the graph, which causes substantial memory space consumption and long • Concatenation RPQ: ' = 0001...0= • Alternation RPQ: ' = 00...08−1 ¹08 j08¸1º08¸2...0= response time. To address this problem, a number of studies has ∗ • Kleene Star RPQ: ' = 0001...08−108 08¸1...0= been proposed with optimization techniques to reduce the cost of ∗ RPQs evaluation. • Highly Complex RPQ: ' = 0001...08−1 ¹08 j08¸1º 08¸2...0= A strategy for reducing the RPQs evaluation cost is to optimize where 08 2 Σ, 0 ≤ 8 ≤ =. For clarity of presentation, we use the the RPQs by rewriting them into other ones [8], [5]. Fernandez Mary symbol j for alternation operator and drop the symbol ◦ in terms et al. [8] presented two optimization techniques based on graph and equations, but keep them in examples. schemas. Calvanese et al. [5] proposed a view-based query rewrit- To answer an RPQ, & ¹'º, we need to search all paths in the ing approach for evaluating RPQs in semi-structured data which graph 퐺 which satisfy a given regular expression '. Here, a path d guarantees the new ones contain all the answers of the original between node E0 and node E: in 퐺 is a sequence ones. However, the query rewriting techniques still have some lim- 00 01 0:−1 itations deal with highly complex RPQs, such as the nested queries d = E0 −−! E1 −−! E2...E:−1 −−−−! E: with modifier recursion, which leads to state explosion after con- verting the rewritten query to a DFA for graph searching. Therefore, such that each (E8, 08,E8¸1), for 0 ≤ 8 < :, is an edge. The sequence ∗ several techniques have also been proposed for estimating query of labels of a path d, denoted as L(d), is the string 0001...0:−1 2 Σ , ∗ size [13] or minimizing DFAs [2], [12]. where Σ is a set of all possible strings over the set of labels Σ. The !¹'º Recently, a threshold rare label based approach has been proved answer of Q(R) is a set of paths in the form & ¹'º = E −! D, where effectively to reduce the search space of RPQ evaluation onlarge E,D 2 + , and !¹'º ⊆ Σ∗ is a regular language. Thus, a path d is an graphs [10]. The authors employ a cost-based technique to deter- answer path of & ¹'º iff L(d) 2 L(R).

Load more