<<

Fast Computation of Graph

Xiaoyang Chen†, Hongwei Huo∗†, Jun Huan‡, Jeffrey Scott Vitter § † Dept. of Computer Science, Xidian University, [email protected], [email protected] ‡ Dept. of Electrical Engineering and Computer Science, The University of Kansas, [email protected] § Dept. of Computer and Information Science, The University of Mississippi, [email protected]

Abstract—The graph edit distance (GED) is a well-established nodes in the search , existing methods can be divided into distance measure widely used in many applications. However, two broad categories: vertex-based and edge-based mapping existing methods for the GED computation suffer from several methods. When generating successors of a node, the former drawbacks including oversized search space, huge memory consumption, and lots of expensive backtracking. In this paper, extends unmapped vertices of comparing graphs, while the ? we present BSS GED, a novel vertex-based mapping method for later extends unmapped edges. A -GED [5], [7] and DF- the GED computation. First, we create a small search space GED [16] are two major vertex-based mapping methods. A?- by reducing the number of invalid and redundant mappings GED adopts the best-first search paradigm A? [6], which picks involved in the GED computation. Then, we utilize beam-stack up a partial mapping with the minimum induced edit cost to search combined with two heuristics to efficiently compute GED, achieving a flexible trade-off between available memory and extend each time. The first found complete mapping induces expensive backtracking. Extensive experiments demonstrate that the GED of comparing graphs. However, DF-GED carries out BSS GED is highly efficient for the GED computation on sparse a depth-first search, which quickly reaches a leaf node. The as well as dense graphs and outperforms the state-of-the-art edit cost of a leaf node in fact is an upper bound of GED GED methods. In addition, we also apply BSS GED to the graph and hence can be used to prune nodes later to accelerate the similarity search problem and the practical results confirm its efficiency. GED computation. Different from the above two methods, CSI GED [4] is a novel edge-based mapping method based I.INTRODUCTION on common substructure isomorphism, which works well for Graphs are widely used to model various complex structured the sparse and distant graphs. Similar to DF-GED, CSI GED data, including social networks, molecular structures, etc. Due also adopts the depth-first search paradigm. to extensive applications of graph models, there has been a Even though existing methods have achieved promising considerable effort in developing techniques for effective graph preliminary results, they still suffer from several drawbacks. data management and analysis, such as graph matching [3] and Both A?-GED and DF-GED enumerate all possible mappings graph similarity search [14], [17], [19]. between two graphs. However, among these mappings, some Among these studies, similarity computation between two mappings must not be optimal, called invalid mappings, or graphs is a core and essential problem. In this paper, we focus they induce the same edit cost, called redundant mappings. on the similarity measure based on graph edit distance (GED) For invalid mappings, we do not have to generate them, and since it is applicable to virtually all types of data graphs and for redundant mappings, we only need to generate one of them can also precisely capture structural differences. Due to the so as to avoid redundancy. The search space of A?-GED and flexible and error-tolerant characteristics of GED, it has been DF-GED becomes oversized as they generate plenty of invalid successfully applied in many applications, such as molecular and redundant mappings. comparison in chemistry [8], object recognition in computer In addition, for A?-GED, it needs to store enormous vision [2] and graph clustering [9]. partial mappings, resulting in a huge memory consumption. Given two graphs G and Q, the GED between them, denoted In practice, A?-GED cannot compute the GED of graphs with arXiv:1709.10305v1 [cs.DS] 29 Sep 2017 by ged(G, Q), is defined as the minimum cost of an edit more than 12 vertices. Though DF-GED performing a depth- path that transforms one graph to another. Unfortunately, first search is efficient in memory, it is easily trapped into a unlike the classical graph matching problem, such as subgraph local (i.e., suboptimal) solution and hence produces lots of isomorphism [15], the fault tolerance of GED allows a vertex expensive backtracking. On the other hand, for CSI GED, it of one graph to be mapped to any vertex of the other graph, adopts the depth-first search paradigm, and hence also faces regardless of their labels and degrees. As a consequence, the the expensive backtracking problem. Besides, the search space complexity of the GED computation is higher than that of of CSI GED is exponential with respect to the number of subgraph isomorphism, which has been proved to be an NP- edges of comparing graphs, making it naturally be unsuitable hard [17] problem. for dense graphs. The GED computation is usually carried out by means To solve the above issues, we propose a novel vertex-based of a tree search algorithm which explores the space of all mapping method for the GED computation, named BSS GED, possible mappings of vertices and edges of comparing graphs. based on beam-stack search [11] which has shown an excellent The underlying search space can be organized as an ordered performance in AI literature. Our contributions in this paper search tree. Based on the way of generating successors of are summarized below. • We propose a novel method of generating successors of if there exists an injective function φ : VG → VQ, such that nodes in the search tree, which reduces a large number (1) ∀u ∈ VG, φ(u) ∈ VQ and L(u) = L(φ(u)). (2) ∀e(u, v) ∈ of invalid and redundant mappings involved in the GED EG, e(φ(u), φ(v)) ∈ EQ and L(e(u, v)) = L(e(φ(u), φ(v))). computation. As a result, we create a small search space. If G ⊆ Q and Q ⊆ G, then G and Q are graph isomorphic Moreover, we also give a rigorous theoretical analysis of to each other, denoted by G =∼ Q. the search space. There are six edit operations can be used to transform one • Incorporating with the beam-stack search paradigm into graph to another [12], [20]: insert/delete an isolated vertex, our method to compute GED, we achieve a flexible trade- insert/delete an edge, and substitute the label of a vertex or off between available memory and the time overhead of an edge. Given two graphs G and Q, an edit path P = backtracking and gain a better performance than the best- hp1, . . . , pki is a sequence of edit operations that transforms first and depth-first search paradigms. p p one graph to another, such as G = G0 −→1 ,..., −→k Gk =∼ Q. • We propose two heuristics to prune the search space, The edit cost of P is defined as the sum of edit cost of all where the first heuristic produces tighter lower bound and operations in P , i.e., Pk c(p ), where c(p ) is the edit cost the second heuristic enables to fast search of tighter upper i=1 i i of the edit operation p . In this paper, we focus on the uniform bound. i cost model, i.e., c(pi) = 1 for ∀i, thus the edit cost of P is its • We have conducted extensive experiments on both real length, denoted by |P |. For P , we call it is optimal if and only and synthetic datasets. The experimental results show that if it has the minimum length among all possible edit paths. BSS GED is highly efficient for the GED computation on sparse as well as dense graphs, and outperforms the Definition 2 (Graph Edit Distance). Given two graphs G state-of-the-art GED methods. and Q, the graph edit distance between them, denoted by • In addition, we also extend BSS GED as a standard graph ged(G, Q), is the length of an optimal edit path that trans- similarity search query method and the practical results forms G to Q (or vice versa). confirm its efficiency. Example 1. In Figure 1, we show an optimal edit path P that The rest of this paper is organized as follows: In Section II, transforms graph G to graph Q. The length of P is 4, where we introduce the problem definition and then give an overview we delete two edges e(u1, u2) and e(u1, u3), substitute the of the vertex-based mapping method for the GED computation. label of vertex u1 with label A and insert one edge e(u1, u4) In Section III, we create a small search space by reducing the with label a. number of invalid and redundant mappings involved in the v GED computation. In Section IV, we utilize the beam-stack u1 u2 u1 u2 u1 u2 1 b search paradigm to traverse the search space to compute GED. B A B A A A A In Section V, we propose two heuristics to prune the search b a a a a A C A C A C A a C a A space. In Section VI, we extend BSS GED as a standard graph a a a v v v u3 u4 u3 u4 u3 u4 3 4 2 similarity search query method. In Section VII, we report the G G1 Q1 Q experimental results and our analysis. Finally, we investigate Fig. 1: An optimal edit path P between graphs G and Q. research works related to this paper in Section VIII and then make concluding remarks in Section IX.

II.PRELIMINARIES B. Graph Mapping In this section, we introduce basic notations. For simplicity In this part, we introduce the graph mapping between two in exposition, we only focus on simple undirected graphs graphs, which can induce an edit path between them. In order to match two unequal size graphs G and Q, we extend their without multi-edges or self-loops. ∗ ∗ ∗ n vertex sets as VG and VQ such that VG = VG ∪ {u } and ∗ n n n A. Problem Definition VQ = VQ ∪ {v }, respectively, where u and v are dummy vertices labeled with ε, s.t., ε∈ / Σ. Then, we define graph Let Σ be a set of discrete-valued labels. A labeled graph mapping as follows: is a triplet G = (VG,EG,L), where VG is the set of vertices, EG ⊆ VG × VG is the set of edges, L : VG ∪ EG → Σ is a Definition 3 (Graph Mapping). A graph mapping from ∗ ∗ labeling function which assigns a label to a vertex or an edge. graph G to graph Q is a bijection ψ : VG → VQ, such that ∗ ∗ For a vertex u, we use L(u) to denote its label. Similarly, ∀u ∈ VG, ψ(u) ∈ VQ, and at least one of u and ψ(u) is not

L(e(u, v)) is the label of an edge e(u, v). ΣVG = {L(u): a dummy vertex. u ∈ V } and Σ = {L(e(u, v)) : e(u, v) ∈ E } are the G EG G Given a graph mapping ψ from G to Q, it induces an label multisets of VG and EG, respectively. For a graph G, unlabeled graph H = (VH ,EH ), where VH = {u : u ∈ S(G) = (VG,EG) is its unlabeled version, i.e., its structure. VG ∧ ψ(u) ∈ VQ} and EH = {e(u, v): e(u, v) ∈ EG ∧ In this paper, we refer |VG| to the size of graph G. e(ψ(u), ψ(v)) ∈ EQ}, then H ⊆ S(G) and H ⊆ S(Q). Definition 1 (Subgraph Isomorphism [15]). Given two graphs Let Gψ (resp. Qψ) be the labeled version of H embedded G and Q, G is subgraph isomorphic to Q, denoted by G ⊆ Q, in G (resp. Q). Accordingly, we obtain an edit path Pψ : ψ ψ G → G → Q → Q. Let CD(ψ), CS(ψ) and CI (ψ) be generating successors. For a leaf node r, we compute the edit ψ ψ ψ the respective edit cost of transforming G to G , G to Q , cost of its corresponding edit path Pψr by Theorem 1. Thus, and Qψ to Q. As Gψ is a subgraph of G, we only need when we generate all leaf nodes, we must find an optimal to delete vertices and edges that do not belong to Gψ when graph mapping and then obtain ged(G, Q). ψ transforming G to G . Thus, CD(ψ) = |VG| − |VH | + |EG| − ψ |EH |. Similarly, CI (ψ) = |VQ|−|VH |+|EQ|−|EH |. Since G Algorithm 1: BasicGenSuccr(r) and Qψ have the same structure H, we only need to substitute ψ ψ 1 ψr ← {(ui1 → vj1 ),..., (uil → vjl )}, succ ← ∅; the corresponding vertex and edge labels between G and Q , r r 2 CG ← VG\{ui1 , . . . , uil },CQ ← VQ\{vj1 , . . . , vjl }; r thus CS(ψ) = |{u : u ∈ VH ∧ L(u) 6= L(ψ(u))}| + |{e(u, v): 3 if |CG| > 0 then r e(u, v) ∈ EH ∧ L(e(u, v)) 6= L(e(ψ(u), ψ(v)))}|. 4 foreach z ∈ CQ do

5 generate successor q, s.t., ψq ← ψr ∪ {(uil+1 → z)}; Theorem 1 ([4]). Given a graph mapping ψ from graph G to 6 succ ← succ ∪ {q}; graph Q. Let Pψ be the edit path induced by ψ, then |Pψ| = n 7 generate successor q, s.t., ψq ← ψr ∪ {(uil+1 → v )}; CD(ψ) + CI (ψ) + CS(ψ). 8 succ ← succ ∪ {q}; Example 2. Consider graphs G and Q in Figure 1. Given a 9 else S n 10 generate successor q, s.t., ψq ← ψr ∪ z∈Cr {(u → z)}; graph mapping ψ : {u1, u2, u3, u4} → {v1, v2, v3, v4}, where Q 11 succ ← succ ∪ {q}; ψ(u1) = v1, ψ(u2) = v2, ψ(u3) = v3, and ψ(u4) = v4, we have H = ({u1, u2, u3, u4}, {e(u2, u4), e(u3, u4)}). Then ψ 12 return succ; ψ ψ induces an edit path Pψ : G → G → Q → Q shown in Figure 1, where Gψ = G and Qψ = Q . By Theorem 1, we 1 1 However, the above method BasicGenSuccr used in A?- compute that C (ψ) = 2, C (ψ) = 1 and C (ψ) = 1, thus D I S GED [5] and DF-GED [16] generates all possible successors. |P | = C (ψ) + C (ψ) + C (ψ) = 4. ψ D I S As a result, both A?-GED and DF-GED enumerate all possible Hereafter, for ease of presentation, we assume that G and Q graph mappings from G to Q and their search space size are the two comparing graphs, and VG = {u1, . . . , u } |VG| is O(|V ||VG|) [4]. However, among these mappings, some and V = {v , . . . , v }. For a graph mapping ψ from G Q Q 1 |VQ| mappings certainly not be optimal, called invalid mappings, to Q, we call it is optimal only when its induced edit path P ψ or they induce the same edit cost, called redundant mappings. is optimal. Next, we give an overview of the vertex-based For invalid mappings, we do not have to generate them, and mapping method for computing ged(G, Q) by enumerating for redundant mappings, we only need to generate one of all possible graph mappings from G to Q. them. Next, we present how to create a small search space C. GED computation: Vertex-based Mapping Approach by reducing the number of invalid and redundant mappings. V ∗ Assuming that vertices in G are processed in the order REATING MALL EARCH PACE n n III.C S S S (ui , . . . , ui , u , . . . , u ), where i1, . . . , i|V | is a permu- 1 |VG| G A. Invalid Mapping Identification tation of 1,..., |VG| detailed in Section V-B. Then, we denote |V ∗| S G Let |ψ| be the length of a graph mapping ψ, i.e., |V ∗|. We a graph mapping from G to Q as ψ = l=1 {(uil → vjl )} in G n the following sections, such that (1) uil = u if il > |VG|; (2) give an estimation of |ψ| in Theorem 2, which can be used to n ∗ vjl = v if jl > |VQ|; and (3) vjl = ψ(uil ) for 1 ≤ l ≤ |VG|. identify invalid mappings. The GED computation is always achieved by means of an Theorem 2. Given an optimal graph mapping ψ from graph G ordered search tree, where inner nodes correspond to partial to graph Q, then |ψ| = max{|V |, |V |}. graph mappings and leaf nodes correspond to complete graph G Q mappings. Such a search tree is created dynamically at runtime Proof: Suppose for the purpose of contradiction that n n by iteratively generating successors linked by edges to the |ψ| > max{|VG|, |VQ|}. Then (x → v ) and (u → y) must currently considered node. Let ψr = {(ui1 → vj1 ), . . ., (uil → be present simultaneously in ψ, where x ∈ VG and y ∈ VQ. We 0 n n vjl )} be the (partial) mapping associated with a node r, construct another graph mapping ψ = (ψ\{(x → v ), (u → 0 where vjk is the mapped vertex of uik for 1 ≤ k ≤ l, then y)}) ∪ {(x → y)}, and then prove |Pψ | < |Pψ| as follows: Algorithm 1 outlines the method of generating successors of r. Let H and H0 be two unlabeled graphs induced by ψ and ψ0, 0 Algorithm 1 is easy to understand. First, we compute the respectively, then VH0 = {u : u ∈ VG ∧ ψ (u) ∈ VQ} = {u : r r 0 sets of unmapped vertices CG and CQ in G and Q, respectively u ∈ VG ∧ ψ(u) ∈ VQ} ∪ {x : ψ (x) ∈ VQ} = VH ∪ {x}. Let r (line 2). Then, if |CG| > 0, for the vertex uil+1 to be extended, Ax = {z : z ∈ VH ∧ e(x, z) ∈ EG ∧ e(y, ψ(z)) ∈ EQ}, then r n 0 0 we choose a vertex z from CQ or {v } as its mapped vertex, EH0 = {e(u, v): e(u, v) ∈ EG ∧ e(ψ (u), ψ (v)) ∈ EQ} = and finally generate all possible successors of r (lines 4–8); {e(u, v): e(u, v) ∈ EG ∧ e(ψ(u), ψ(v)) ∈ EQ} ∪ {e(x, z): otherwise, all vertices in G were processed, then we insert all z ∈ VH0 ∧e(x, z) ∈ EG ∧e(y, ψ(z)) ∈ EQ} = EH ∪{e(x, z): r vertices in CQ into G and obtain a unique successor leaf node z ∈ Ax}. As x∈ / VH , e(x, z) ∈/ EH for ∀z ∈ Ax. Thus, (lines 10–11). |VH0 | = |VH | + 1 and |EH0 | = |EH | + |Ax|. Staring from a dummy root node root such that ψroot = ∅, As CD(ψ) = |VG|−|VH |+|EG|−|EH | and CI (ψ) = |VQ|− 0 we can create the search tree layer-by-layer by iteratively |VH |+|EQ|−|EH |, we have CD(ψ ) = CD(ψ)−(1+|Ax|) and 0 CI (ψ ) = CI (ψ)−(1+|Ax|). Since CS(ψ) = |{u : u ∈ VH ∧ For an edge e(u, v) in EH , e(ψ(u), ψ(v)) is its mapped 0 0 L(u) 6= L(ψ(u))}| + |{e(u, v): e(u, v) ∈ EH ∧ L(e(u, v)) 6= edge in EQ. As π(ψ(u)) = π(ψ (u)), we have ψ(u) ∼ ψ (u) 0 0 L(e(ψ(u), ψ(v)))}|, we have CS(ψ ) = CS(ψ) + c(x → y) + and then obtain NQ(ψ(u)) = NQ(ψ (u)). Thus, we have P c(e(x, z) → e(y, ψ(z))), where c(·) gives the edit cost e(ψ0(u), ψ(v)) ∈ E . Similarly, since π(ψ(v)) = π(ψ0(v)), z∈Ax Q of relabeling a vertex or an edge, such that c(a → b) = 0 if we obtain ψ(v) ∼ ψ0(v). Thus, there must exist edges between L(a) = L(b), and c(a → b) = 1 otherwise, and a (resp. b) is ψ(u) and ψ0(v), ψ0(u) and ψ0(v) (an illustration is shown 0 0 0 a vertex or an edge in G (resp. Q). Thus, CS(ψ ) ≤ CS(ψ) + in Fig. 2), thus e(ψ (u), ψ (v)) ∈ EQ and hence we obtain 0 0 0 1 + |Ax|. Therefore, |Pψ0 | = CD(ψ ) + CI (ψ ) + CS(ψ ) ≤ e(u, v) ∈ EH0 . Thus, EH ⊆ EH0 . Similarly, we also obtain CD(ψ)−(1+|Ax|)+CI (ψ)−(1+|Ax|)+CS(ψ)+1+|Ax| = EH0 ⊆ EH . So, EH = EH0 . 0 |Pψ|−(1+|Ax|) < |Pψ|. This would be a contradiction that Pψ Since VH = VH0 and EH = EH0 , we have H = H . Thus, 0 0 is optimal. Hence |ψ| = max{|VG|, |VQ|}. CI (ψ) = CI (ψ ) and CD(ψ) = CD(ψ ). Next, we do not Theorem 2 states that a graph mapping whose length distinguish H and H0 anymore. is greater than |V | must be an invalid mapping, where |V | = max{|VG|, |VQ|}. For example, considering graphs G and Q in Figure 1, and a graph mapping ψ = {(u1 → u ψ(u) ψ0(u) n n v1), (u2 → v ), (u3 → v3), (u4 → v4), (u → v2)}, we know that ψ with an edit cost 7 must be invalid as |ψ| = 5 > max{|VG|, |VQ|} = 4. v ψ(v) ψ0(v)

B. Redundant Mapping Identification Fig. 2: Illustration of isomorphic vertices. For a vertex u in VQ, its neighborhood information is de- fined as N (u) = {(v, L(e(u, v))) : v ∈ V ∧ e(u, v) ∈ E }. Q Q Q 0 For any vertex u in VH , we have L(ψ(u)) = L(ψ (u)) 0 Definition 4 (Vertex Isomorphism). Given two vertices as ψ(u) ∼ ψ (u). Thus, |{u : u ∈ VH ∧ L(u) 6= 0 u, v ∈ VQ, u is isomorphic to v, denoted by u ∼ v, if and L(ψ(u))}| = |{u : u ∈ VH ∧ L(u) 6= L(ψ (u))}|. For 0 only if L(u) = L(v) and NQ(u) = NQ(v). any edge e(u, v) in EH , since ψ(u) ∼ ψ (u), we know that L(e(ψ(u), ψ(v))) = L(e(ψ0(u), ψ(v))). Similarly, we By Definition 4, we know that the isomorphic relationship obtain L(e(ψ0(u), ψ(v))) = L(e(ψ0(u), ψ0(v))) as ψ(v) ∼ between vertices is an equivalence relation. Thus, we can 0 0 0 1 λQ ψ (v). Thus L(e(ψ(u), ψ(v))) = L(e(ψ (u), ψ (v))). Hence divide VQ into λQ equivalent classes VQ,...,VQ of isomor- |{e(u, v): e(u, v) ∈ EH ∧L(e(u, v)) 6= L(e(ψ(u), ψ(v)))}| = phic vertices. Each vertex u is said to belong to class π(u) = i 0 0 i n |{e(u, v): e(u, v) ∈ EH ∧ L(e(u, v)) 6= L(e(ψ (u), ψ (v)))}|. if u ∈ VQ. Dummy vertices in {v } are isomorphic to each 0 Therefore, C (ψ) = C (ψ ). So, we have |P | = |P 0 |. other, and let π(vn) = λ + 1. S S ψ ψ Q Example 3. Consider graphs G and Q in Figure 1. For Q, Definition 5 (Canonical Code). Given a graph mapping we know that L(v1) = L(v2) = L(v3) = A, and NQ(v1) = S|ψ| N (v ) = N (v ) = {(v , a)}, thus v ∼ v ∼ v . So, we ψ = l=1{(uil → vjl )}, where vjl = ψ(uil ) for 1 ≤ Q 2 Q 3 4 1 2 3 1 2 l ≤ |ψ|. The canonical code of ψ is defined as code(ψ) = divide VQ into equivalent classes VQ = {v1, v2, v3}, VQ = {v }. Given two graph mappings ψ = {(u → v ), (u → hπ(vj1 ), . . . , π(vj|ψ| )i. 4 1 1 2 0 0 0 v2), (u3 → v3), (u4 → v4)} and ψ = {(u1 → v2), (u2 → Given two graph mappings ψ and ψ such that |ψ| = |ψ |, 0 v3), (u3 → v1), (u4 → v4)}, we have code(ψ) = code(ψ ) = we say that code(ψ) = code(ψ0) if and only if π(v ) = jl 0 0 0 0 h1, 1, 1, 2i, and then obtain |Pψ| = |Pψ | = 4. π(v ), where vj = ψ(ui ) and v = ψ (ui ) for 1 ≤ l ≤ |ψ|. jl l l jl l Theorem 3 states that graph mappings with the same 0 Theorem 3. Given two graph mappings ψ and ψ . Let Pψ canonical code induce the same edit cost. Thus, among these 0 and Pψ0 be edit paths induced by ψ and ψ , respectively. If mappings, we only need to generate one of them. Next, 0 GenSuccr code(ψ) = code(ψ ), then we have |Pψ| = |Pψ0 |. we apply Theorems 2, 3 into the procedure of generating successors, which can prevent from generating the Proof: As discussed in Section II-B, |Pψ| = CI (ψ) + above invalid and redundant mappings. CD(ψ) + CS(ψ). In order to prove |Pψ| = |Pψ0 |, we first 0 0 prove CI (ψ) = CI (ψ ) and CD(ψ) = CD(ψ ), then prove C. Generating Successors 0 CS(ψ) = CS(ψ ). Consider a node r associated with a partial graph mapping 0 0 Let H and H be two unlabeled graphs induced by ψ and ψ , ψr = {(ui1 → vj1 ),..., (uil → vjl )} in the search tree. Then, 0 r respectively. For a vertex u in VH , ψ(u) and ψ (u) are the the sets of unmapped vertices in G and Q are CG = 0 r mapped vertices of u, respectively. Since code(ψ) = code(ψ ), VG\{ui1 , . . . , uil } and CQ = VQ\{vj1 , . . . , vjl }, respectively. 0 0 r n we have π(ψ(u)) = π(ψ (u)) and hence obtain ψ(u) ∼ ψ (u). For the vertex uil+1 to be extended, let z ∈ CQ ∪ {v } be a As ψ(u) 6= vn, we have ψ0(u) 6= vn by Definition 4. Thus possible mapped vertex. u ∈ VH0 , and hence we obtain VH ⊆ VH0 . Similarly, we also By Theorem 2, if |VG| ≤ |VQ|, we have |ψ| = |VQ|, which obtain VH0 ⊆ VH . So, VH = VH0 . means that none of the vertices in VG is allowed to be mapped n Layer V to a dummy vertex, i.e., (u → v ) ∈/ ψ for ∀u ∈ VG. G root r r r n Rule 1. If |CG| ≤ |CQ|, then z ∈ CQ; otherwise z = v or r 1 u1 v1 v4 z ∈ CQ. Applying rule 1 into the process GenSuccr of generating 2 u2 v2 v4 v1 successors of each node, we know that if |V | ≤ |V | then G Q 3 u3 v3 v4 v2 v2 none of the vertices in VG will be mapped to a dummy vertex u4 v4 v3 v3 v3 otherwise only |VG|−|VQ| vertices do. As a result, the obtained 4 graph mapping ψ must satisfy |ψ| = max{|VG|, |VQ|}. Fig. 3: Search tree created by GenSuccr. Definition 6 (Canonical Code Partial Order). Let ψ and ψ0 be two graph mappings such that code(ψ) = code(ψ0). We 0 0 define that ψ  ψ if ∃l, 1 ≤ l ≤ |ψ|, s.t., ψ(uik ) = ψ (uik ) D. Search Space Analysis 0 for 1 ≤ k < l and ψ(uil ) < ψ (uil ). Replacing BasicGenSuccr (see Alg. 1) with GenSuccr By Theorem 3, we know that graph mappings with the same to generate successors, we reduce a large number of invalid canonical code induce the same edit cost, thus among these and redundant mappings and then create a small search tree. mappings we only need to generate the smallest according Next, we analyze the size of search tree, i.e., the total number to the partial order defined in Definition 6. For u , we il+1 of nodes in the search tree. only map it to the smallest unmapped vertex in V m, for Q Nodes in the search tree are grouped into different layers 1 ≤ m ≤ λ . This will guarantee that the obtained graph Q based on their distances from the root node. Hence, the search mapping is smallest among those mappings with the same tree is divided into layers, one for each depth. When all canonical code. Then we establish Rule 2 as follows: vertices in VG were processed, for any node in the layer |VG| SλQ r m Rule 2. z ∈ m=1 min{CQ ∩ VQ }. (starting from 0), it generates a unique successor leaf node, Based on the above Rule 1 and Rule 2, we give the method thus we regard this layer as the last layer. Namely, we only of generating successors of r in Algorithm 2, where lines 4–8 need to generate the first |VG| layers. For the layer l, let Nl be corresponds to Rule 2 and lines 9–11 corresponds to Rule 1. the total number of nodes in this layer. So, the total number of P|VG| nodes in the search tree can be computed as SR = l=0 Nl. Algorithm 2: GenSuccr(r) For the layer l, the set of vertices in G that have been l processed is BG = {ui1 , . . . , uil }, correspondingly, we must 1 ψr ← {(ui1 → vj1 ),..., (uil → vjl )}, succ ← ∅; r r choose l vertices from V ∪ {vn} as their mapped vertices. 2 CG ← VG\{ui1 , . . . , uil },CQ ← VQ\{vj1 , . . . , vjl }; Q r l 3 if |CG| > 0 then Let BQ = {vj1 , . . . , vjl } be the l selected vertices, then we 4 for m ← 1 to λQ do use a vector x = [x , . . . , x ] to represent it, where x is r m 1 λQ+1 m 5 if CQ ∩ VQ 6= ∅ then l m r m the number of vertices in BQ that belong to VQ , i.e., xm = 6 z ← min{CQ ∩ VQ }; l m 7 generate successor q, s.t., |BQ ∩ VQ |, for 1 ≤ m ≤ λQ, and xλQ+1 is the number of l ψq ← ψr ∪ {(uil+1 → z)}; dummy vertices in BQ. Thus, we have 8 succ ← succ ∪ {q}; λQ+1 r r X 9 if |CG| > |CQ| then xm = l. (1) n 10 generate successor q, s.t., ψq ← ψr ∪ {(uil+1 → v )}; m=1 11 succ ← succ ∪ {q}; m where 0 ≤ xm ≤ |VQ | for 1 ≤ m ≤ λQ, and xλQ+1 ≥ 0. 12 else S n For a solution x of equation (1), it corresponds to a 13 generate successor q, s.t., ψq ← ψr ∪ z∈Cr {(u → z)}; l Q unique BQ. The reason is as follows: In Rule 2, each time 14 succ ← succ ∪ {q}; m we only select the smallest unmapped vertex in VQ as the 15 return succ; mapped vertex, for 1 ≤ m ≤ λQ. Thus, for xm in x, it means l m that BQ contains the first xm smallest vertices in VQ . For example, let us consider the search tree in Figure 3. Let l = 3 Example 4. Consider graphs G and Q in Figure 1. Figure 3 l and x = [2, 1, 0], then BQ contains the first 2 smallest vertices shows the entire search tree of G and Q created layer-by- 1 2 in VQ, i.e., v1 and v2, and the smallest vertex in VQ, i.e., v4, layer by using GenSuccr, where vertices in G are processed l then we have BQ = {v1, v2, v4}. in the order (u1, u2, u3, u4). In a layer, the values inside Let Ψl be the set of solutions of equation (1), then it the nodes are the possible mapped vertices (e.g., v1 and v4 l covers all possible BQ. For a solution x, it totally produces in layer one are the possible mapped vertices of u1). The l! different (partial) canonical codes. For example, sequence of vertices on the path from root to each leaf node QλQ+1 m=1 xm! gives a complete graph mapping. In this example, we totally for x = [2, 1, 0], it produces 3 partial canonical codes, i.e., generate 4 graph mappings, and then easily compute that h1, 1, 2i, h1, 2, 1i and h2, 1, 1i. As we know, each (partial) l ged(G, Q) = 4. canonical code corresponds to a (partial) mapping from BG l to BQ, which is associated with a node in the layer l, thus to denote the interval of layer l. For a node r in layer l, its successor n in next layer l + 1 is allowed to be expanded X l! Nl = . (2) only when f(n) is in the interval bs[l], i.e., bs[l].fmin ≤ QλQ+1 x ! x∈Ψl m=1 m f(n) < bs[l].fmax . In Rule 1, only when the number of unmapped vertices in G • priority queues open[0], . . . , open[|VG|], where open[l] is greater than that in Q, we select a dummy vertex vn as the (0 ≤ l ≤ |VG|) is used to store the expanded nodes in layer l. mapped vertex. As a result, if |VG| ≤ |VQ| then the number |V | • a table new, where new[H(r)] stores all successors of r of dummy vertices in B G is 0 otherwise is |V | − |V |. Let Q G Q and H(r) is a hash function which assigns a unique ID l = |V | and we then discuss the following two cases: G to r. Case 1. When |VG| > |VQ|. For any x, we have xλQ+1 = |VG| − |VQ|. Then equation (1) is reduced B. Algorithm PλQ PλQ m to m=1 xm = |VQ|. Since m=1 |VQ | = |VQ| and Algorithm 3 performs an iterative search to obtain a more m 0 ≤ xm ≤ |VQ |, for 1 ≤ m ≤ λQ , then equation (1) has and more tight upper bound ub of GED until ub = ged(G, Q). 1 λQ a unique solution x = [|VQ|,..., |VQ |, |VG| − |VQ|], Thus, In an iteration, we perform the following two steps: (1) we |VG|! N|V | = λ . As N0 = 1 and N1 ≤ · · · ≤ utilize beam search [10] to quickly reach to a leaf node whose G (|V |−|V |)! Q Q |V m|! G Q m=1 Q edit cost is an upper bound of GED, then we update ub (line 4). |VG|! N , we obtain SR ≤ |VG| λ + 1. |VG| Q Q m As beam search expands at most w nodes in each layer, some (|VG|−|VQ|)! |V |! m=1 Q nodes are inadmissible pruned when the number of nodes in a Case 2. When |VG| ≤ |VQ|. For any x, we have xλQ+1 = 0. PλQ layer is greater than w, where w is the beam width; Thus, (2) Then equation (1) is reduced to m=1 xm = |VG|. As P |VG|! we backtrack and pop items from bs until a layer l such |VG| ≤ |VQ|, we have N = x∈Ψ λ ≤ |VG| |VG| Q Q m=1 xm! that bs.top().fmax < ub (lines 5–6), and then shift the range xλQ+1=0 P |VQ|! |VQ|! of bs.top() (line 9) to re-expand those inadmissible pruned x∈Ψ λ = λ . Since N0 = 1 and |VQ| Q Q Q Q m m=1 xm! m=1 |VQ |! nodes in next iteration to search for tighter ub. If l = −1, xλQ+1=0 |VQ|! it means that we finish a complete search and then obtain N1 ≤ · · · ≤ N , we have SR ≤ |VG| λ + 1. |VG| Q Q m m=1 |VQ |! ub = ged(G, Q) (lines 7–8). However, this will overestimate SR when |VG|  |VQ|. In procedure Search, we perform a beam search starting For the layer l, if we do not consider the isomorphic vertices from layer l to re-expand those inadmissible pruned nodes l l l in BQ, then there are l! mappings from BG to BQ. Since there to search for tighter ub, where P QL and P QLL are two |VQ| l |VQ| temporary priority queues used to record expanded nodes in are at most l possible BQ, we have Nl ≤ l · l! = |VQ|! P|VG| P|VG| |VQ|! two adjacent layers. Each time we pop a node r with the . So, SR = Nl ≤ + 1 (|VQ|−l)! l=0 l=1 (|VQ|−l)! smallest cost to expand (line 4). If r is a leaf node, then |VQ|! P|VG| 1 = |V |−l + 1 (|VQ|−|VG|)! l=1 Q G we update ub and stop the search as g(z) ≥ g(r) holds for m=1 (|VQ|−|VG|+m) |VQ|! P|VG| 1 |VQ|! ∀z ∈ P QL (line 7); otherwise, we call ExpandNode to ≤ |V |−l + 1 ≤ 2 + 1. (|VQ|−|VG|)! l=1 2 G (|VQ|−|VG|)! |VG||VG|! generate all successors of r that are allowed to be expanded In summary, if |VG| > |VQ|, SR = O( λ ); Q Q m (|VG|−|VQ|)! m=1 |VQ |! in next layer and then insert them into P QLL (lines 8–9). As |VG||VQ|! |VQ|! at most w successors are allowed to be expanded, we only otherwise, SR = O(min{ λ , }). Q Q |V m|! (|VQ|−|VG|)! m=1 Q keep the best w nodes (i.e., the smallest cost) in P QLL, IV. GED COMPUTATION USING BEAM-STACK SEARCH and the nodes left are inadmissible pruned (lines 11–13). Correspondingly, line 12, we modify the right boundary of The previous section shows that we create a small search bs.top() as the lowest cost among all inadmissible pruned space. However, we still need an efficient search paradigm to nodes to ensure that the cost of the w successors currently traverse the search space to seek for an optimal graph mapping expanded is in this interval. to compute GED. In this section, based on the efficient search In procedure ExpandNode, we generate all successors of r paradigm, beam-stack search [11], we give our approach for that are allowed to be expanded. Note that, all nodes first the GED computation. generated are marked as false. If r has not been visited, i.e., r.visited = false, then we call GenSuccr (i.e., Alg. 2) to A. Data Structures generate all successors of r and mark r as visited (lines 3–4); For a node r in the search tree, f (r) = g(r) + h(r) is the otherwise, we directly read all successors of r from new total edit cost assigned to r, where g(r) is the edit cost of the (line 6). For a successor n of r, if f(n) ≥ ub or n.visited = partial path accumulated so far, and h(r) is the estimated edit true, then we safely prune it, see Lemma 1. Meanwhile, cost from r to a leaf node, which is less than or equal to the we delete all successors of n from new (line 9); otherwise, real cost. Before formally presenting the algorithm, we first if bs.top().fmin ≤ f (n) < bs.top().fmax , we expand n. If all introduce the data structures used as follows: successors of r are safely pruned, we safely prune r, and • a beam stack bs, which is a generalized stack. Each item delete r from open[l] and its successors from new, res- in bs is a half-open interval [fmin , fmax ), and we use bs[l] pectively (line 13). Lemma 1. In ExpandNode, if f(n) ≥ ub or n.visited = Algorithm 3: BSS GED(G, Q, w) true, i.e., line 8, we safely prune n. 1 ψroot ← ∅, bs ← ∅, open[] ← ∅, new[] ← ∅, l ← 0, ub ← ∞; 2 bs.push([0, ub)), open[0].push(root); Proof: For the case f(n) ≥ ub, it is trivial. Next we prove 3 while bs 6= ∅ do it in the other case. 4 Search(l, ub, bs, open, new); Consider bs in the last iteration. Assuming that in this 5 while bs.top().fmax ≥ ub do iteration we perform Search starting from layer k (i.e., 6 bs.pop(), l ← l − 1; backtracking to layer k in the last iteration, see lines 5–6 7 if l = −1 then in Alg. 3), and node r and its successors n are in layers l 8 return ub; and l + 1, respectively, then k ≤ l and bs[m].fmax ≥ ub 9 bs.top().fmin ← bs.top().fmax , bs.top().fmax ← ub; for l + 1 ≤ m ≤ |VG |. If n.visited = true, then we must 10 return ub; have called ExpandNode to generate successors of n in the procedure Search(l, ub, bs, open, new) last iteration. For a successor x of n in layer l + 2, if x is 1 P QL ← open[l], P QLL ← ∅; 2 while P QL 6= ∅ or P QLL 6= ∅ do inadmissible pruned, then f(x) ≥ bs[l + 1].fmax ≥ ub, thus we safely prune x; otherwise, we consider a successor of x and 3 while P QL 6= ∅ do 4 r ← arg minn{f(n): n ∈ P QL}; repeat this decision process until a leaf node z. Then, it must 5 P QL ← P QL\{r}; satisfy that f(z) = g(z) ≥ ub. Thus, none of descendants of n 6 if ψr is a complete graph mapping then can produce tighter ub. So, we safely prune it. 7 ub ← min{ub, g(r)}, return; 8 succ ←ExpandNode(r, l, ub, open, new); Lemma 2. A node r is visited at most O(|VQ|) times. 9 P QLL ← P QLL ∪ succ;

Proof: For a node r in layer l, it generates at most 10 if |P QLL| > w then |VQ| + 1 successors by GenSuccr. In order to fully generate 11 keepNodes ← the best w nodes in P QLL; all successors in layer l +1, we backtrack to this layer at most 12 bs.top().fmax ← min{f(n): n ∈ P QLL ∧ n∈ / keepNodes}; |open[l]|·(|VQ|+1)/w ≤ |VQ|+1 times as |open[l]| ≤ w. After 13 P QLL ← keepNodes; that, when we visit r once again, all successors of r are either pruned or marked, thus we safely prune them by Lemma 1. 14 open[l + 1] ← P QLL, P QL ← P QLL; 15 P QLL ← ∅, l ← l + 1, bs.push([0, ub)); So, r cannot produce tighter ub in this iteration and we safely prune it, i.e., lines 12–13 in ExpandNode. Plus the first time procedure ExpandNode(r, l, ub, open, new) 1 expand ← ∅; when generating r, we totally visit r at most |VQ| + 3 times, 2 if r.visited = false then i.e., O(|VQ|). 3 succ ←GenSuccr(r); Theorem 4. Given two graphs G and Q, BSS GED must 4 new[H(r)] ← succ, r.visited ← true; return ged(G, Q). 5 else 6 succ ← new[H(r)]; Proof: By Lemma 2, a node is visited at most O(|VQ|) 7 foreach n ∈ succ do times, thus all nodes are totally visited at most O(|V |S ) Q R 8 if f(n) ≥ ub or n.visited = true then times (see SR in Section III-D), which is finite. So, BSS GED 9 new[H(n)] ← ∅; always terminates. In Search, we always update ub = 10 else if bs.top().fmin ≤ f(n) < bs.top().fmax then min{ub, g(r)} each time. Thus, ub becomes more and more 11 expand ← expand ∪ {n}; tight. Next, we prove that ub converges to ged(G, Q) when BSS GED terminates by contradiction. 12 if ∀n ∈ succ, f(n) ≥ ub or n.visited = true then 13 open[l] ← open[l]\{r}, new[H(r)] ← ∅; Suppose that ub > ged(G, Q). Let r and n be the leaf nodes whose edit cost is ub and ged(G, Q), respectively. Let x in 14 return expand; layer l be the common ancestor of r and n, which is farthest from root. Let z in layer l + 1 be a successor of x, which is the ancestor of n. Then f(z) ≤ f(n) = ged(G, Q) < ub. the upper bound ub and lower bound h(r) are the keys to For z, it is not in the path from root to r, thus it must be perform pruning. Here, we give two heuristics to prune the pruned in an iteration, i.e., f(z) ≥ ub or z.visited = false, search space as follows: (1) proposing an efficient heuristic line 8 in ExpandNode (if z has been inadmissible pruned, we function to obtain tighter h(r); (2) ordering vertices in G to backtrack and shift the range of bs[l] to re-expand it until that z enable to fast find of tighter ub. is pruned or marked). For the case f(z) ≥ ub, it contradicts that f(z) < ub, and for the other case, we conclude that A. Estimating h(r) f(n) ≥ ub by using the same analysis in Lemma 1, which contradicts that f(n) < ub. Thus, ub = ged(G, Q). Let P be an optimal edit path that transforms G to Q, then it contains at least max{|VG|, |VQ|} − |ΣVG ∩ ΣVQ | edit V. SEARCH SPACE PRUNING operations performed on vertices. Next, we only consider the In BSS GED, for a node r, if f(r) = g(r) + h(r) ≥ ub, edit operations in P performed on edges. Assuming that we then we safely prune r. As g(r) is the irreversible edit cost, first delete γ1 edges to obtain G1, then insert γ2 edges to u u obtain G2, and finally change γ3 edge labels to obtain Q. assuming that we first delete ξ1 and then insert ξ2 edges on u, u When transforming G to G1 by deleting γ1 edges, we and finally substitute ξ3 labels on the outer edges adjacent to u. have Σ ⊆ Σ , thus |Σ ∩ Σ | ≥ |Σ ∩ Σ |. Similar to the previous analysis of obtaining inequality (3), we EG1 EG EG EQ EG1 EQ When transforming G1 to G2 by inserting γ2 edges, for have each inserted edge, we no longer change its label, thus  u u |Ou| − ξ1 + ξ2 = |Oψ(u)| u u (4) |ΣEG ∩ ΣEQ | = |ΣEG ∩ ΣEQ | + γ2. When transform- 2 1 |ΣOu ∩ ΣOψ(u) | + ξ2 + ξ3 ≥ |Oψ(u)| ing G2 to Q by changing γ3 edge labels, we need to substitute at least |Σ | − |Σ ∩ Σ | edge labels, thus Thus, P3 ξu ≥ |O |−|Σ ∩Σ |. As |O |+ξu = EQ EG2 EQ i=1 i ψ(u) Ou Oψ(u) u 2 3 γ3 ≥ |ΣE | − |ΣE ∩ ΣE |. So, we have u P u Q G2 Q |Oψ(u)|+ξ1 , we have i=1 ξi ≥ |Oψ(u)|−|ΣOu ∩ΣOψ(u) |+ ξu = |O |−|Σ ∩Σ |+ξu ≥ |O |−|Σ ∩Σ |. So, |Σ ∩ Σ | + γ + γ ≥ |E |. (3) 1 u Ou Oψ(u) 2 u Ou Oψ(u) EG EQ 2 3 Q P3 u i=1 ξi ≥ max{|Ou|, |Oψ(u)|} − |ΣOu ∩ ΣOψ(u) |. Adding all P3 r r Let lb(G, Q) = max{|VG|, |VQ|}−|ΣVG ∩ΣVQ |+ i=1 γi, vertices in G1, we obtain the lower bound LB1 as follows: then ged(G, Q) ≥ lb(G, Q). Obviously, the lower bound r r r X lb(G, Q) should be as tight as possible. In order to achieve LB1 = LB(G2,Q2) + (max{|Ou|, |Oψ(u)|} u∈V r this goal, we utilize the degree sequence of a graph. G1 (5)

For a vertex u in G, its degree du is the number of edges −|ΣOu ∩ ΣOψ(u) |). adjacent to u. The degree sequence δG = [δG[1], . . . , δG[|VG|]] r Definition 8 (Outer Vertex Set). For a vertex u in G1, we of G is a permutation of d1, . . . , d|VG| such that δG [i] ≥ δG [j ] define its outer vertex set as A = {v : v ∈ V r ∧ e(u, v) ∈ u G2 for i < j . For unequal size G and Q, we extend δG and δQ r 0 0 EG}, which consists of vertices in G2 adjacent to u. as δG = [δG[1], . . . , δG[|VG|], 01,..., 0|V |−|VG|] and δQ =

[δQ[1], . . . , δQ[|VQ|], 01,..., 0|V |−|VQ|], resp., where |V | = Correspondingly, Aψ(u) denotes the outer vertex set P 0 n max{|VG|, |VQ|}. Let ∆1(G, Q) = d 0 0 (δ [i] − of ψ(u). Note that, if ψ(u) = v , then A = ∅. Thus, δG[i]>δQ[i] G ψ(u) 0 P 0 0 r S r δ [i])/2e and ∆ (G, Q) = d 0 0 (δ [i] − δ [i])/2e, A = Au denotes the set of vertices in G adjacent Q 2 δ [i]≤δ [i] Q G G u∈VGr 2 G Q 1 r r for 1 ≤ i ≤ |V |, then we give the respective lower bounds to those outer edges between G1 and G2. Similarly, we r S r r γ γ obtain A = Az. If |A | ≤ |A |, then we need of 1 and 2 as follows: Q z∈VQr G Q r 1 r to insert at least |AQ| − |AG| outer edges on some vertices Theorem 5 ([14]). Given two graphs G and Q, we have γ1 ≥ r P u r r in G1, hence u∈V r ξ2 ≥ |AQ| − |AG|; otherwise, ∆1(G, Q) and γ2 ≥ ∆2(G, Q). G1 P ξu ≥ |Ar | − |Ar |. Considering equation (4), for a Based on inequality (3) and Theorem 5, we then establish u∈VGr 1 G Q 1 r u u the following lower bound of GED in Theorem 6. vertex u in G1, we have ξ2 + ξ3 ≥ |Oψ(u)| − |ΣOu ∩ ΣOψ(u) |. P u u u P Thus, (ξ + ξ + ξ ) ≥ (|O | − |ΣO ∩ Theorem 6. Given two graphs G and Q, we u∈VGr 1 2 3 u∈VGr ψ(u) u 1 r r 1 u u have ged(G, Q) ≥ LB(G, Q), where LB(G, Q) = ΣOψ(u) |)+max{0, |AG|−|AQ|}. As |Ou|+ξ2 = |Oψ(u)|+ξ1 , P u u u P max{|V |, |V |} − |Σ ∩ Σ | + max{∆ (G, Q) + we have (ξ + ξ + ξ ) ≥ (|Ou| − |ΣO ∩ G Q VG VQ 1 u∈VGr 1 2 3 u∈VGr u 1 r r 1 ∆2(G, Q), ∆1(G, Q) + |EQ| − |ΣEG ∩ ΣEQ |}. ΣOψ(u) |) + max{0, |AQ| − |AG|}. So, we obtain the lower r r Next, we discuss how to estimate h(r) based on Theorem 6. bounds LB2 and LB3 as follows: Let ψr = {(ui → vj ),..., (ui → vj )} be the partial r r r X 1 1 l l LB = LB(G ,Q ) + (|Oψ(u)| − |ΣO ∩ ΣO |) r G Gr 2 2 2 u ψ(u) mapping associated with , then we divide into two parts 1 u∈V r r r G1 and G , where G is the mapped part of G, s.t., VGr = 2 1 1 +max{0, |Ar | − |Ar |}. {u , . . . , u } E r = {e(u, v): u, v ∈ V r ∧ e(u, v) ∈ G Q i1 il and G1 G1 r (6) E } G V r = V \V r G , and 2 is the unmapped part, s.t., G2 G G1 and r r r X E r = {e(u, v): u, v ∈ V r ∧ e(u, v) ∈ E }. Similarly, we G2 G2 G LB3 = LB(G2,Q2) + (|Ou| − |ΣOu ∩ ΣOψ(u) |) r r also obtain Q and Q . For r, by Theorem 6 we know that u∈V r 1 2 G1 (7) LB(Gr,Qr) is lower bound of ged(Gr,Qr) and hence can 2 2 2 2 +max{0, |Ar | − |Ar |}. adopt it as h(r). However, for the potential edit cost on the Q G r r r r r r r r r edges between G1 (resp. Q1) and G2 (resp. Q2), LB(G2,Q2) Based on the above lower bounds LB1 , LB2 and LB3 , we r r r has not covered it. adopt h(r) = max{LB1 , LB2 , LB3 } as the heuristic function to estimate the edit cost of a node r in BSS GED. Definition 7 (Outer Edge Set). For a vertex u in Gr, we define 1 Example 5. Consider graphs G and Q in Figure 4. For a node r its outer edge set as O = {e(u, v): v ∈ V r ∧e(u, v) ∈ E }, u G2 G associated with a partial mapping ψ(r) = {(u1 → v1), (u2 → which consists of edges adjacent to u that neither belong to r v2)}, then G2 = ({u3, u4, u5}, {e(u3, u5), e(u4, u5)},L) and E r nor E r . G1 G2 r Q2 = ({v3, v4, v5, v6}, {e(v3, v6), e(v4, v6), e(v5, v6)},L). By r r Correspondingly, Oψ(u) is the outer edge set of ψ(u). Theorem 6, we compute LB(G2,Q2) = 2. Considering n Note that, if ψ(u) = v , then Oψ(u) = ∅. Thus, ΣOu = vertices u1 and u2 that have been processed, we have Ou1 =

{L(e(u, v)) : e(u, v) ∈ Ou} is the label multiset of Ou. In {e(u1, u3), e(u1, u4)} and Ou2 = {e(u2, u4)}, and then obtain order to make O and O have the same label multiset, Σ = {a, a} and Σ = {b}. Similarly, we have u ψ(u) Ou1 Ou2 Σ = {a, a} and Σ = {b}. Thus LBr = LB(Gr,Qr)+ verse G in a depth-first order, and finally obtain order = Ov1 Ov2 1 2 2 P (max{|O |, |O |} − |Σ ∩ Σ |) = 2. As [u , u , u , u , u ]. Thus, we process vertices in G in the order u∈{u1,u2} u ψ(u) Ou Oψ(u) 2 4 1 3 5 r r r AG = {u3, u4} and AQ = {v3, v4, v5}, we obtain LB2 = (u2, u4, u1, u3, u5) in BSS GED. LB(Gr,Qr) + P (|O | − |Σ ∩ Σ |) + 2 2 u∈{u1,u2} ψ(u) Ou Oψ(u) r r r r r VI.EXTENSIONOF BSS GED max{0, |AG| − |AQ|} = 2, and LB3 = LB(G2,Q2) + P (|O | − |Σ ∩ Σ |) + max{0, |Ar | − |Ar |} u∈{u1,u2} u Ou Oψ(u) Q G In this section, we extend BSS GED to solve the GED r r r = 3. So, h(r) = max{LB1 , LB2 , LB3 } = max{2, 2, 3} = 3. based graph similarity search problem: Given a graph database G = {G1, G2,... }, a query graph Q and an edit distance u1 A B u2 v1 A B v2 threshold τ, the problem aims to find all graphs in G satisfy a a b a a b ged(Gi , Q) ≤ τ. As computing GED is an NP-hard problem, u3 A A u4 v3 A v4 A B v5 most of the existing methods, such as [12], [14], [19], [20], a b a a b all use the filter-and-verify schema, that is, first filtering some u5 C v6 C graphs in G to obtain candidate graphs, and then verifying them. G Q Here, we also use this strategy. For each data graph Gi, Fig. 4: Example of two comparing graphs G and Q. we compute the lower bound LB(Gi, Q) by Theorem 6. If LB(Gi, Q) > τ, then ged(Gi, Q) ≥ LB(Gi, Q) > τ and hence we filter Gi; otherwise, Gi becomes a candidate graph. B. Ordering Vertices in G For a candidate graph Gi, we need to compute ged(Gi, Q) In BSS GED, we use GenSuccr to generate successors. to verify it. The standard method is that we first compute However, we need to determine the processing order of ged(Gi, Q) and then determine Gi is a required graph or not vertices in G at first, i.e., (ui , . . . , ui ) (see Section II-C). 1 |VG| by judging ged(Gi, Q) ≤ τ. Incorporating τ with BSS GED, The most primitive way is to adopt the default vertex order we can further accelerate it as follows: First, we set the initial ? in G, i.e., (u1, . . . , u|VG|), which is used in A -GED [5] and upper bound ub as τ + 1 (line 1 in Alg. 3). Then, during the DF-GED [16]. However, this may be inefficient as it has not execution of BSS GED, when we reach to a leaf node r, if considered the structure relationship between vertices. the cost of r (i.e., g(r)) satisfies g(r) ≤ τ, then Gi must be a For vertices u and v in G such that e(u, v) ∈ EG, if u required graph and we stop running of BSS GED. The reason has been processed, then in order to early obtain the edit cost is that g(r) is an upper bound of GED and hence we know on e(u, v), we should process v as soon as possible. Hence, that ged(Gi, Q) ≤ g(r) ≤ τ. our policy is to traverse G in a depth-first order to obtain (ui , . . . , ui ). However, starting from different vertices to 1 |VG| Algorithm 4: DetermineOrder(G) traverse, we may obtain different orders. In section V-A, we have proposed the heuristic estimate 1 F [1..|VG|] ← false, order[] ← ∅, count ← 1; r r 2 rank ← sort vertices in VG according to the partial order ≺; function h(r), where an important part is LB(G2,Q2) pre- r 3 for i ← 1 to |VG| do sented in Theorem 6. As we know, the more structure G2 4 u ← rank[i]; r r r and Q2 keep, the tighter lower bound LB(G2,Q2) we may 5 if F [u] = false then obtain. As a result, we preferentially consider vertices with 6 DFS(u, F, rank, order, count) small degrees. This is because that when we first process those r r 7 return order vertices, the left unmapped parts G2 and Q2 could keep the procedure DFS(u, F, rank, order, count) structure as much as possible. 1 order[count] ← u, count ← count + 1,F [u] ← true; 2 N ← {v : v ∈ V ∧ e(u, v) ∈ E }; Definition 9 . For two vertices u and v u G G (Vertex Partial Order) 3 while |Nu| > 0 do in G, we define that u ≺ v if and only if du < dv or du = 4 v ← argminj {rank[j]: j ∈ Nu}; dv ∧ u < v. 5 Nu ← Nu\{v}; 6 if F [v] = false then In Algorithm 4, we give the method to compute the order 7 DFS(v, F, rank, order, count); (ui , . . . , ui ). First, we sort vertices to obtain a global 1 |VG| order array rank based on the partial order ≺ (line 2). Then, we call DFS to traverse G in a depth-first order (lines 3–6). In DFS, we sequentially insert u into order and then mark u VII.EXPERIMENTAL RESULTS as visited, i.e., set F [u] = true (line 1). Then, we obtain the set Nu of vertices adjacent to u (line 2). Finally, we select a In this section, we perform comprehensive experiments and smallest unvisited vertex v from Nu based on the partial order then analyse the obtained results. ≺, and then recursively call DFS to traverse the subtree rooted at v (lines 3–7). A. Datasets and Settings Example 6. For the graph G in Figure 4, we first com- We choose several real and synthetic datasets used in the pute rank = [u2, u1, u3, u5, u4]. Starting from u2, we tra- experiment, described as follows: A I D S - 1 0 K P R O T E I N - 3 0 0 S 1 K . E 3 0 . D 3 0 . L 2 0 1 0 0 1 0 0 1 0 0 ) ) ) % % % ( ( (

8 0 o o o i i i 9 5 t t t 8 0 a a a r r r

e e e 6 0 v v v l l l

9 0 o o o 6 0 s s s

e e e g

g 4 0 g a a b b a r r 8 5 B S S _ G E D B S S _ G E D r 4 0 e

e b e B S S _ G E D v v B S S _ G E D B S S _ G E D v A A 2 0 A B S S _ G E D 8 0 2 0 6 9 1 2 1 5 1 8 2 1 6 9 1 2 1 5 1 8 2 1 6 9 1 2 1 5 1 8 2 1 Q u e r y g r o u p Q u e r y g r o u p Q u e r y g r o u p

A I D S - 1 0 K P R O T E I N - 3 0 0 S 1 K . E 3 0 . D 3 0 . L 2 0 3 0 0 1 0 0 0 1 0 0 0 ) s ) ) ( s s

( (

2 5 0 e 8 0 0 8 0 0 e e m i t m m i

i b

t b 2 0 0 t

g B S S _ G E D B S S _ G E D g n g

i 6 0 0 6 0 0

n B S S _ G E D B S S _ G E D n i n i n n 1 5 0 b n

n u B S S _ G E D n r u u

r 4 0 0 r 4 0 0 e

1 0 0 B S S _ G E D g e e a g g r a a e r 2 0 0 r 2 0 0 v

e 5 0 e v v A A A 0 0 0 6 9 1 2 1 5 1 8 2 1 6 9 1 2 1 5 1 8 2 1 6 9 1 2 1 5 1 8 2 1 Q u e r y g r o u p Q u e r y g r o u p Q u e r y g r o u p Fig. 5: Effect of GenSuccr on the performance of BSS GED

• AIDS1. It is an antivirus screen compound dataset from available time and memory be 1000s and 24GB, respectively, the Development and Therapeutics Program in NCI/NIH, and then define the average solve ratio as follows: which contains 42687 chemical compounds. We generate P|D| P|T | slove(Di, Tj) labeled graphs from these chemical compounds and omit sr = i=1 j=1 . (8) Hydrogen atom as did in [13], [14]. |D| × |T | • PROTEIN2. It is a protein database from the Protein where slove(Di, Tj) = 1 if we obtain ged(Di, Tj) within both Data Bank, constituted of 600 protein structures. Vertices 1000s and 24GB, and slove(Di, Tj) = 0 otherwise. Obviously, represent secondary structure elements and are labeled sr should be as large as possible. with their types (helix, sheet or loop). Edges are labeled We have conducted all experiments on a HP Z800 PC with to indicate if two elements are neighbors or not. a 2.67GHz GPU and 24GB memory, running Ubuntu 12.04 • Synthetic. The synthetic dataset is generated by the syn- operating system. We implement our algorithm in C++, with 3 thetic graph data generator GraphGen . In the experiment, -O3 to compile and run. For BSS GED, we set the beam we generate a density graph dataset S1K.E30.D30.L20, width w = 15 for the sparse graphs in datasets AIDS-10K and which means that this dataset contains 1000 graphs; the PROTEIN-300, and w = 50 for the density graphs in dataset 4 average number of edges in each graph is 30; the density S1K.E30.D30.L20. of each graph is 30%; and the distinct vertex and edge labels are 20 and 5, respectively. B. Evaluating GenSuccr Due to the hardness of computing GED, existing methods, In this section, we evaluate the effect of GenSuccr on such as A?-GED [5], DF-GED [16] and CSI GED [4], cannot the performance of BSS GED. To make a comparison, we obtain GED of large graphs within a reasonable time and replace GenSuccr with BasicGenSuccr (i.e., Alg. 1) and memory. Therefore, for AIDS and PROTEIN, we exclude large then obtain BSS GEDb, where BasicGenSuccr is the basic graphs with more than 30 vertices as did in [1], and then method of generating successors used in A?-GED [5] and DF- randomly select 10000 and 300 graphs to make up the datasets GED [16]. In BSS GEDb, we also use the same heuristics AIDS-10K and PROTEIN-300. For S1K.E30.D30.L20, we use proposed in Section V. Figure 5 shows the average solve ratio the entire dataset. and running time. As suggested in [1], [4], for each dataset, we randomly se- As shown in Figure 5, the average solve ratio of BSS GED lect 6 query groups, where each group consists of 3 data graphs is much higher than that of BSS GEDb, and the gap be- having three consecutive graph sizes. Specifically, the number tween them becomes larger as the query graph size in- of vertices of each group is in the range: 6 ± 1, 9 ± 1, 12 ± creases. This indicates that GenSuccr provides more re- 1, 15 ± 1, 18 ± 1 and 21 ± 1. duction on the search space for larger graphs. Regarding the For the tested database D = {D1, D2,... } and query group running time, BSS GED achieves the respective 1x–5x, 0.4x– b T = {T1, T2,... }, we need to perform |D| × |T | times GED 1.5x and 0.1x–4x speedup over BSS GED on AIDS-10K, computation. For each pair of the GED computation, we set the PROTEIN-300 and S1K.E30.D50.L20. Thus, we create a small search space by GenSuccr. 1http://dtp.nci.nih.gov/docs/aids/aidsdata.html 2http://www.fki.inf.unibe.ch/databases/iam-graph-database/download-the- C. Evaluating BSS GED iam-graph-database 3http://www.cse.ust.hk/graphgen/ In this section, we evaluate the effect of beam-stack search 4the density of a graph G is defined as 2|EG| . and heuristics on the performance of BSS GED. We fix |VG|(|VG|−1) datasets AIDS-10K, PROTEIN-300 and S1K.E30.D30.L20 as 2x and 9x speedup over Basic on AIDS-10K, PROTEIN- the tested datasets and select their corresponding groups 15±1 300 and S1K.E30.D30.L20. Moreover, compared with +h1, as the query groups, respectively. the running time needed by +h2 decreases 21%, 30% and (1). Effect of w 41% on AIDS-10K, PROTEIN-300 and S1K.E30.D30.L20, As we know, beam-stack search achieves a flexible trade- respectively. Thus, the proposed two heuristics greatly boost off between available memory and expensive backtracking by the performance. setting different w, thus we vary w to evaluate its effect on the performance. Figure 6 shows the average solve ratio and D. Comparing with Existing GED Methods running time. In this section, we compare BSS GED with existing ? 1 0 0 4 0 0 methods A -GED [5], DF-GED [16] and CSI GED [4]. ) ) s (

% ( e

Figure 8 shows the average solve ratio and running time. o m i 9 0 i 3 0 0 t t

a A I D S - 1 0 K r g

P R O T E I N - 3 0 0 n A I D S - 1 0 K By Figure 8, we know that BSS GED performs the best in i e v n l S 1 K . E 3 0 . D 3 0 . L 2 0 P R O T E I N - 3 0 0

8 0 n o 2 0 0 ? u s S 1 K . E 3 0 . D 3 0 . L 2 0 r

terms of average solve ratio. For A -GED, it cannot obtain e e g g a a r

7 0 r 1 0 0 e

e GED of graphs with more than 12 vertices within 24GB. For v v A A 6 0 0 DF-GED, it cannot finish the GED computation of graphs with 1 5 1 5 5 0 1 0 0 5 0 0 1 5 1 5 5 0 1 0 0 5 0 0 B e a m w i d t h w B e a m w i d t h w more than 15 vertices in 1000s. Besides, for density graphs in Fig. 6: Effect of w on the performance of BSS GED. S1K.E30.D30.L20, the average solve ratio of CSI GED drops sharply as the query graph size increases, which confirms that By Figure 6, we obtain that the average solve ratio first it is unsuitable for dense graphs. increases and then decreases, and achieves maximum when Regarding the running time, BSS GED still performs the w = 15 on AIDS-10K and PROTEIN-300, and w = 50 on best in most cases. DF-GED performs better than A?-GED, S1K.E30.D30.L20. There are several factors may contribute which is consistent with the previous results in [16]. Compared to this trend: (1) When w is too small, BSS GED may be with DF-GED, BSS GED achieves 50x–500x, 20x–2000x trapped into a local suboptimal solution and hence produces and 15x–1000x speedup on AIDS-10K, PROTEIN-300 and lots of backtracking. (2) When w is too large, BSS GED S1K.E30.D30.L20, respectively. Though CSI GED performs expands too many unnecessary nodes in each layer. Note that, better than BSS GED on AIDS-10K when the query graph depth-first search is a special case of beam-stack search when size is less than 9, BSS GED achieves 2x–5x speedup over w = 1. Thus, beam-stack search performs better than depth- CSI GED when the graph size is greater than 12. Besides, first search. As previously demonstrated in [16], depth-first for S1K.E30.D30.L20, BSS GED achieves 5x–95x speedup search performs better than best-first search. Therefore, we over CSI GED. Thus, BSS GED is efficient for the GED conclude that the beam-stack search paradigm outperforms computation on sparse as well as dense graphs. the best-first and depth-first search paradigms for the GED computation. E. Performance Evaluation on Graph Similarity Search

3 In this part, we evaluate the performance of BSS GED as a B a s i c 1 0

1 0 0 )

+ h 1 s ) ( standard graph similarity search query method by comparing it % + h 2 e (

8 0 m o i i t t

a with CSI GED and GSimJoin [19]. For each dataset described g r

n

6 0 i e

n 2 v

l

n 1 0

o in Section VII-A, we use its entire dataset and randomly select

u B a i s c r s

4 0 + h 1 e e g g + h 2 a a 100 graphs from it as query graphs. Figure 9 shows the total r r 2 0 e e v v A A running time (i.e., the filtering time plus the verification time). 0 1 0 1 A I D S - 1 0 K P R O T E I N - 3 0 0 S 1 K . E 3 0 . D 3 0 . L 2 0 A I D S - 1 0 K P R O T E I N - 3 0 0 S 1 K . E 3 0 . D 3 0 . L 2 0 It is clear from Figure 9 that BSS GED has the best perfor- Fig. 7: Effect of heuristics on the performance of BSS GED. mance in most cases, especially for large τ. For GSimJoin, it cannot finish when τ ≥ 8 in AIDS and PROTEIN because of (2). Effect of Heuristics the huge memory consumption. Compared with GSimJoin for In this part, we evaluate the effect of the proposed two τ values where it can finish, BSS GED achieves the respective heuristics by injecting them one by one into the base algo- 1.6x–15000x, 3.8x–800x and 2x–3000x speedup on AIDS, rithm. We use the term Basic for the baseline algorithm PROTEIN and S1K.E30.D30.L20. Considering CSI GED, it without applying any heuristics. +h1 denotes the improved performs slightly better than BSS GED when τ ≤ 4 on AIDS. algorithm of Basic by incorporating the first heuristics However, BSS GED performs much better than CSI GED (Section V-A). +h2 denotes the improved algorithm of +h1 when τ ≥ 6 and the gap between them becomes larger as τ by incorporating the second heuristic (Section V-B). Figure 7 increases. Specifically, BSS GED achieves 2x–28x, 1.2x– plots the average solve ratio and running time. 100000x and 1.1x–187x speedup over CSI GED on AIDS, By Figure 7, we know that the average solve ratio of PROTEIN and S1K.E30.D30.L20, respectively. As previously Basic is only 15% of that of +h1. This means that the discussed in [4], CSI GED performs much better than the proposed heuristic function provides powerful pruning ability. state-of-the-art graph similarity search query methods. Thus, Considering the running time, +h1 brings the respective 50x, we conclude that BSS GED can efficiently finish the graph A I D S - 1 0 K P R O T E I N - 3 0 0 S 1 K . E 3 0 . D 3 0 . L 2 0 1 0 0 1 0 0 A * - G E D 1 0 0 ) ) D F - G E D ) % % % ( ( C S I _ G E D (

8 0

8 0 8 0 o o o i i B S S _ G E D i t t t a a a r r r

6 0 6 0 e e 6 0 e v v v l

o

o l o l s s s

4 0 4 0 4 0 e e e A * - G E D g g A * - G E D g a a a D F - G E D r r D F - G E D r e e 2 0 2 0 2 0 e C S I _ G E D v v C S I _ G E D v A A A B S S _ G E D B S S _ G E D 0 0 0 6 9 1 2 1 5 1 8 2 1 6 9 1 2 1 5 1 8 2 1 6 9 1 2 1 5 1 8 2 1 Q u e r y g r o u p Q u e r y g r o u p Q u e r y g r o u p A I D S - 1 0 K P R O T E I N - 3 0 0 S 1 K . E 3 0 . D 3 0 . L 2 0 3 1 0 1 0 3 1 0 3 ) ) ) s 2 s s ( ( (

1 0 2 2 e e 1 0 e 1 0

m 1 m m i i i t

1 0 t t

g 0 g 1 g 1 n n n

i 1 0 1 0 i 1 0 i n n n

n - 1 n n u u u 0 r 0

1 0 r r

1 0 A * - G E D 1 0 e - 2 A * - G E D e e A * - G E D g 1 0 g D F - G E D g a a a D F - G E D

r D F - G E D r - 1 C S I _ G E D r - 1 e - 3 e 1 0 e 1 0 C S I _ G E D

v C S I _ G E D 1 0 v B S S _ G E D v A B S S _ G E D A A B S S _ G E D 1 0 - 2 6 9 1 2 1 5 1 8 2 1 6 9 1 2 1 5 1 8 2 1 6 9 1 2 1 5 1 8 2 1 Q u e r y g r o u p Q u e r y g r o u p Q u e r y g r o u p Fig. 8: Performance comparison with existing state-of-the-art GED methods.

similarity search and runs much faster than existing methods. IX.CONCLUSIONSAND FUTURE WORK In this paper, we present a novel vertex-based mapping method for the GED computation. First, we reduce the number VIII.RELATED WORKS of invalid and redundant mappings involved in the GED computation and then create a small search space. Then, we Recently, the GED computation has received considerable utilize beam-stack search to efficiently traverse the search attention. A?-GED [5], [7] and DF-GED [16] are two major space to compute GED, achieving a flexible trade-off between vertex-based mapping methods, which utilize the best-first available memory and expensive backtracking. In addition, we and depth-first search paradigms, respectively. Provided that also give two efficient heuristics to prune the search space. the heuristic function estimates the lower bound of GED of However, it is still very hard to compute GED of large graphs unmapped parts, A?-GED guarantees that the first found com- within a reasonable time. Thus, the approximate algorithm that plete mapping induces the GED of comparing graphs, which fast suboptimally compute GED is left as a future work. ? seems very attractive. However, A -GED stores numerous X.ACKNOWLEDGMENTS partial mappings, resulting in a huge memory consumption. As a result, it is only suitable for the small graphs. To overcome The authors would like to thank Kaspar Riesen and Zeina this bottleneck, DF-GED performs a depth-first search, which Abu-Aisheh for providing their source files, and thank Karam only stores the partial mappings of a path from the root to Gouda and Xiang Zhao for providing their executable files. a leaf node. However, it may be easily trapped into a local This work is supported in part by China NSF grants 61173025 suboptimal solution, leading to massive expensive backtrack- and 61373044, and US NSF grant CCF-1017623. Hongwei ing. CSI GED [4] is an edge-based mapping method through Huo is the corresponding author. common substructure isomorphism enumeration, which has REFERENCES an excellent performance on the sparse graphs. However, the [1] D. B. Blumenthal and J. J. Gamper. Exact computation of graph edit edge-based search space of CSI GED is exponential with distance for uniform and non-uniform metric edit costs. In GbRPR, respect to the number of edges of comparing graphs, making it pages 211–221, 2017. [2] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years of graph naturally be unsuitable for dense graphs. Note that, CSI GED matching in pattern recognition. Int. J. Pattern Recogn., 18(03):265–298, only works for the uniform model, and [1] generalized it to 2004. cover the non-uniform model. [3] K. H. Bunke. Approximate graph edit distance computation by means of bipartite graph matching. Image Vision Comput., 27(7):950–959, 2009. Another work closely related to this paper is the GED [4] K. Gouda and M. Hassaan. CSI GED: An efficient approach for graph based graph similarity search problem. Due to the hardness edit similarity computation. In ICDE, pages 256–275, 2016. of computing GED, existing graph similarity search query [5] K. Riesen, S. Fankhauser, and H. Bunke. Speeding up graph edit distance computation with a bipartite heuristic. In MLG, pages 21–24, 2007. methods [12], [14], [17], [18], [19], [20] all adopt the filter- [6] P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic and-verify schema, that is, first filtering graphs to obtain a determination of minimum cost paths. IEEE Trans.SSC., 4(2):100–107, candidate set and then verifying those candidate graphs. In 1968. [7] K. Riesen, S. Emmenegger, and H. Bunke. A novel software toolkit for the verification phase, most of the existing methods adopt graph edit distance computation. In GbRPR, pages 142–151, 2013. A?-GED as their verifiers. As discussed above, BSS GED [8] R. M. Mar´ın, N. F. Aguirre, and E. E. Daza. Graph theoretical similarity greatly outperforms A?-GED, hence it can be also used as approach to compare molecular electrostatic potentials. J. Chem. Inf. Model., 48(1):109–118, 2008. a standard verifier to accelerate those graph similarity search [9] A. Robles-Kelly and R. H. Edwin. Graph edit distance from spectral query methods. seriation. IEEE Trans. Pattern Anal Mach Intell., 27(3):365–378, 2005. A I D S P R O T E I N S 1 K . E 3 0 . D 3 0 . L 2 0 1 0 6 6 1 0 4 ) 5 G S i m J o i n ) 1 0 G S i m J o i n s s

) 1 0 ( ( s

5 C S I _ G E D C S I _ G E D ( e e 1 0 4 3 e 1 0 B S S _ G E D B S S _ G E D m m 1 0 i i t t m 4 i

t 1 0 3

e e 1 0 2 s s e 1 0 s n 3 n

n o o 1 0 2 o p p 1 0 s s p 1 s e 2 e 1 0 r

r 1 e

1 0

l r G S i m J o i n l 1 0

l a a t

t 0 a 1 C S I _ G E D t 0 o 1 0 o 1 0 o

1 0 T B S S _ G E D T T 1 0 0 1 0 - 1 1 0 - 1 2 4 6 8 1 0 1 2 2 4 6 8 1 0 1 2 3 6 9 1 2 1 5 1 8 G E D T h r e s h o l d G E D T h r e s h o l d G E D T h r e s h o l d Fig. 9: Performance comparison with CSI GED and GSimJoin.

[10] S. Russell and P. Norvig. Artificial intelligence: a modern approach (2nd ed.). Prentice-Hall, 2002. [11] R. Zhou and E. A. Hansen. Beam-stack search: Integrating backtracking with beam search. In ICAPS, pages 90–98, 2005. [12] G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing large sparse graphs for similarity search. IEEE Trans. Knowl Data Eng., 24(3):440–451, 2012. [13] D. W. Williams, J. Huan, and W. Wang. Graph database indexing using structured graph decomposition. In ICDE, pages 976–985, 2007. [14] X. Chen, H. Huo, J. Huan, and J. S. Vitter. MSQ-Index: A succinct index for fast graph similarity search. arXiv preprint arXiv:1612.09155, 2016. [15] X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent structure-based approach. In SIGMOD, pages 335–346, 2004. [16] Z. Abu-Aisheh, R. Raveaux, J. Y. Ramel, and P. Martineau. An exact graph edit distance algorithm for solving pattern recognition problems. In ICPRAM , pages 271–278, 2015. [17] Z. Zeng, A. K. H. Tung, J. Wang, J. Feng, and L. Zhou. Comparing stars: On approximating graph edit distance. PVLDB, 2(1):25–36, 2009. [18] X. Zhao, C. Xiao, X. Lin, Q. Liu, and W. Zhang. A partition-based approach to structure similarity search. PVLDB, 7(3):169–180, 2013. [19] X. Zhao, C. Xiao, X. Lin, and W. Wang. Efficient graph similarity joins with edit distance constraints. In ICDE, pages 834–845, 2012. [20] W. Zheng, L. Zou, X. Lian, D. Wang, and D. Zhao. Efficient graph similarity search over large graph databases. IEEE Trans. Knowl Data Eng., 27(4):964–978, 2015.