Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
Sousuke Takami and Akihiro Inokuchi School of Science and Technology, Kwansei Gakuin University, 2-1 Gakuen, Sanda, Hyogo, Japan
Keywords: Graph Edit Distance, Graph Relabeling, Graph Classification.
Abstract: The graph edit distance, a well-known metric for determining the similarity between two graphs, is commonly used for analyzing large sets of structured data, such as those used in chemoinformatics, document analysis, and malware detection. As computing the exact graph edit distance is computationally expensive, and may be intractable for large-scale datasets, various approximation techniques have been developed. In this paper, we present a method based on graph relabeling that is both faster and more accurate than the conventional approach. We use unfolded subtrees to denote the potential relabeling of local structures around a given vertex. These subtree representations are concatenated as a vector, and the distance between different vectors is used to characterize the distance between the corresponding graphs. This avoids the need for multiple calculations of the exact graph edit distance between local structures. Simulation experiments on two real-world chemical datasets are reported. Compared with the conventional technique, the proposed method gives a more accurate approximation of the graph edit distance and is significantly faster on both datasets. This suggests the proposed method could be applicable in the analysis of larger and more complex graph-like datasets.
1 INTRODUCTION tion, or the substitution of a vertex and edge in the graphs. The method based on the A? algorithm is a Graphs are one of the most natural means of repre- well-known technique for computing the exact graph senting structured data. For instance, a chemical com- edit distance (Hart et al., 1968). However, this method pound can be represented as a graph in which each cannot be applied to large graphs, because the prob- vertex corresponds to an atom, each edge corresponds lem of obtaining the exact graph edit distance between to a bond between two atoms, and the label of each two graphs is known to be NP-complete. vertex corresponds to the atom type. With recent im- To overcome this difficulty, a method that uses provements in system throughput, the need to analyze the minimum matching problem of a complete bipar- large numbers of graphs has arisen, and the topic of tite graph (V1,V2,E,w) has been proposed to com- graph mining has received considerable interest be- pute the approximate graph edit distance between the cause the knowledge present in structured data can graphs (Riesen, 2015). In a bipartite graph, V1 and be applied to various real-world datasets. For exam- V2 correspond to the vertices of g1 and g2, respec- ple, in cheminformatics, certain properties of chem- tively, and w is a function assigning values of the dis- ical compounds (e.g., mutagenicity or toxicity) can similarities between local structures around the ver- be identified by analyzing their structural informa- tices to edges in the bipartite. Given the complete tion, and in bioinformatics, the prediction of protein– bipartite graph, the matching problem returns a map- protein interactions is beneficial for drug discovery. ping from the vertices in g1 to vertices in g2 that is 3 When analyzing datasets of graphs, one of the solvable in O(b ), where b = max V1 , V2 . Com- most critical measures is the dissimilarity (or simi- puting the approximate graph edit distance{| | | using|} this larity) among the graphs. A representative measure method is computationally simpler than computing of the dissimilarity is the graph edit distance. The the exact graph edit distance. In this method, the graph edit distance d(g1,g2) between graphs g1 and local structures are important in obtaining as accu- g2 is defined as the minimum length of the sequence rate an approximate distance as possible. The lo- of edit operations needed to transform g1 into g2, cal structure around each vertex v is represented by where one edit operation includes the insertion, dele- star structures (Zeng et al., 2009), random walks
17
Takami, S. and Inokuchi, A. Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling. DOI: 10.5220/0006540000170026 In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 17-26 ISBN: 978-989-758-276-9 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
from v (Gauz¨ ere` et al., 2014), and subgraphs induced · ¹ · ¹ ¹ »̊ »̋ by vertices reachable within h steps from v, called ½̊ limited-size subgraphs (Carletti et al., 2015). If com- ¸ · ¸ · ¸ · plex and/or large structures such as limited-size sub- graphs are used as local structures, the computation »̌ time required to assign values to edges in the bipar- º º º »̎ »̍ tite graph can be significant. In contrast, if simple ½̋ structures such as stars or walks are used, the ac- ¸ ¹ ¸ · ¸ · curacy of the approximated graph edit distance de- creases. A recently proposed method (Carletti et al., Figure 1: Sequence of edit operations for transforming g1 2015) measures the dissimilarity between local struc- into g2. tures around v1 in g1 and v2 in g2 with small limited- size subgraphs. The dissimilarity in this approach sume that only the vertices in the graphs have labels, is the exact graph edit distance. Thus, although this the methods in this paper can be applied to graphs method gives a relatively accurate approximate graph where both the vertices and edges have labels (Hido and Kashima, 2009). The vertices adjacent to vertex v edit distance between g1 and g2, it requires consid- are represented as N(v) = u (v,u) E . The aver- erable computation time to measure multiple exact { | ∈ } graph edit distances for the small subgraphs. age number of adjacent vertices is represented as d.A In this paper, we tackle the problem of measuring sequence of vertices from v to u is called a path, and the approximate graph edit distance between graphs its step refers to the number of edges on that path. A and propose an accurate and fast method. In the pro- path is said to be simple if and only if it does not have posed method, each of the complex local structures is repeating vertices. The paths discussed in this paper represented by a vector, which enables this method to are not always simple. be applied to complex and large local structures and The graph edit distance is one of the most repre- ensures fast computation. The proposed method re- sentative metrics to measure the dissimilarity between quires O(b2h( Σ + b) time to compute the approxi- graphs. The graph edit distance d(g1,g2) between mate graph edit| distance,| where Σ is the set of vertex graphs g1 and g2 is defined as the minimum length labels in the graphs. In addition, the method requires of the sequence of edit operations needed to trans- O( Σ b + b2) memory. form g1 into g2, where one edit operation includes the | | The remainder of this paper is organized as fol- insertion or deletion of a vertex/edge and the substi- lows. Section 2 formalizes the problem of measur- tution of a vertex label. Although the edit distance ing the graph edit distance and explains the existing was originally proposed for measuring the dissimilar- method based on the minimum matching problem of ity between two strings, the metric was extended to a complete bipartite graph. In Section 3, we propose graphs by introducing graph edit operations. an accurate and fast method for measuring the graph Figure 1 shows a certain sequence of edit oper- edit distance by representing local structures as vec- ations that consists of one deletion of vertex (e2), tors. In Section 4, we verify the computational effi- one insertion of edge (e4), one deletion of edge (e1), ciency of the proposed method and compare it with and two substitutions of labels (e3,e5). Computing the conventional method in terms of accuracy using the edit distance between g1 and g2 is equivalent to real-world datasets. Finally, we conclude the paper in searching for the minimum length of the sequence of Section 5. edit operations needed to transform g1 into g2. The method based on the A? algorithm is a well-known technique for computing the exact graph edit dis- tance (Hart et al., 1968). However, this method can- 2 PRELIMINARIES not be applied to large graphs, because the problem of obtaining the exact graph edit distance between This paper tackles the problem of computing the ap- two graphs is known to be NP-complete. To address proximate graph edit distance between graphs. First, this drawback, various methods for computing the ap- we define the terminology used to solve the problem. proximate graph edit distance have been proposed. An undirected graph is represented as g = (V,E,Σ,`), The graph edit distance is an important measure where V is a set of vertices, E V V is a set of for analyzing graphs, and is applicable to a wide ⊆ × edges, Σ = σ1,σ2, ,σΣ is a set of vertex labels, range of practical applications such as fingerprint au- and ` : V {Σ is a function··· that} assigns a label to each thentication (Choi and Kim, 2010), malware detec- vertex in→ the graph. Additionally, the set of vertices tion (Kinable and Kostakis, 2011), chemoinformat- in graph g is represented as V(g). Although we as- ics (Kashima et al., 2003), bioinformatics, and doc-
18 Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
ument analysis (Wang et al., 2014). We explain three c denotes the cost of replacing a label of vertex • i j typical fundamental problems that use the analysis vi V1 with a label of vertex v j V2 and of graphs. First, given a set of graphs, we can ap- ∈ ∈ cεi denotes the cost of inserting a vertex vi into g1 ply methods for clustering the graphs to detect some • to edit g1 to g2. latent groups within them (Robles-Kelly and Han- cock, 2003). Second, given a set of graphs and a Additionally, we denote the cost of editing an edge (v ,v ) in g into an edge (v ,v ) in g by c((v ,v ) query graph gq, we can apply a similarity search to i j 1 i0 j0 2 i j (v ,v )). In this paper, we assume that only vertices→ the graphs (Yan et al., 2005). Third, given a set of i0 j0 in graphs have labels, and so c((v ,v ) (v ,v )) is graphs with class labels, we can apply a Support Vec- i j → i0 j0 tor Machine (SVM) and kernel functions to forecast the cost of inserting or deleting an edge. When g1 the class of a graph whose label is unknown (Kashima is edited to g2, the mapping between the two sets of vertices is denoted by a bijective function ϕ : V + et al., 2003). The graph edit distance is converted to 1 → similarity using the Gaussian kernel, which is defined V2. Then, the total cost of editing g1 to g2 via ϕ is as b 2 dist(g1,g2,ϕ) = ∑ ciϕ(i) d(g1,g2) i=1 k(g1,g2) = exp . (1) − 2σ2 b 1 b − + c (v ,v ) v ,v . Most machine learning and data mining algorithms ∑ ∑ i j ϕ(i) ϕ( j) (3) i=1 j=i+1 → that are designed to analyze d-dimensional vectors can be applied to graphs using this kernel. Therefore, the exact edit distance between g1 and g2 is d(g1,g2) = mindist(g1,g2,ϕ), (4) ϕ Φ ∈ 3 APPROXIMATE GRAPH EDIT where Φ is a set of all possible permutations of DISTANCE integers 1,2, ,b. Equation (4) is a type of quadratic assignment··· problem that is known to be NP- This section surveys a framework for computing the complete (Koopmans and Beckmann, 1957), although approximate graph edit distance between two graphs it reduces to the LSAP if Eq. (3) does not contain its using the linear sum assignment problem (LSAP). second term. The problem of computing the exact graph edit dis- Riesen et al. proposed an efficient algorithmic tance between graphs g1 = (V1,E1,Σ,`1) and g2 = framework that enables us to obtain the approximate (V2,E2,Σ,`2) is formalized as follows (Riesen and graph edit distance by omitting the second term of Bunke, 2009; Riesen, 2015): First, the set of vertices Eq. (3). The framework first solves V1 is extended to b ϕˆ = argmin ciϕ(i) (5) b a empty vertices ϕ Φ ∑ − ∈ i=1 + V = V1 ε1,ε2, ,εb a 1 ∪ { ··· − } and then obtains the approximate edit distance be- where V = a, V = b,z and we}| assume{ that V tween g1 and g2 by substituting ϕˆ into Eq. (3). Equa- | 1| | 2| | 1| ≤ V2 . The graph edit distance computation is even- tion (5) implies that each vertex vi in g1 should, as | | + tually performed on graphs g1 = (V1 ,E1,Σ,`1) and far as possible, be mapped to a vertex in g2 that g2 = (V2,E2,Σ,`2). Additionally, a cost matrix for has the same label as vi. Solving Eq. (5) is equiva- editing g1 to g2 is defined as lent to the minimum matching problem of a bipartite graph whose vertices are V + and V , and whose edge 1 2 b 1 2 weights are c in Eq. (2). Therefore, this problem 1 c c ··· c i j 11 12 ··· 1b is tractable in O(b3). However, because the structural 2 c21 c22 c2b . . . ···. . information of the two graphs is ignored and only ver- ...... tex labels are taken into account in the optimization C = a ca1 ca2 cab, (2) problem shown in Eq. (3), we do not always obtain an ··· + 1 cε1 cε2 cεb adequate mapping from V1 to V2. To overcome this . . . ···. . difficulty, the cost matrix in Eq. (2) are redefined to . . . .. . . . . . take account of the structural information as the ma- b a cε1 cε2 cεb trix C with elements − ··· ∗ c∗ = c(local(v ) local(v )), where i j i → j
19 ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
ing to the optimal bipartite graph matching. Finally, in Line 7, Algorithm 1 returns the approximate graph edit distance between graphs g1 and g2. This algorithm has a drawback in terms of com- putational efficiency, because it computes the exact c c edit distance multiple times. To overcome this, we propose a novel method for computing the approxi- Figure 2: Limited-size subgraphs induced by vertices mate graph edit distance more efficiently by compar- i j within h steps from vertex v1. ing structural information in gh and gh with a compu- i j tation time that is proportional to V(gh) and V(gh) . Algorithm 1: Approximate Edit Distance. | | | | Data: graphs g1 and g2, and h ˆ Result: approximate graph edit distance d 4 APPROXIMATE GRAPH EDIT 1 while V(g ) < V(g ) do | 1 | | 2 | 2 V(g ) V(g ) ε ; DISTANCE BASED ON 1 ← 1 ∪ { } 3 for (vi,v j) V(g1) V(g2) do RELABELING GRAPHS ∈ i j × 4 ci∗ j d(gh,gh); ← Given a graph g(h) = (V,E,Σ,`(h)), all labels of 5 ϕˆ LSAP(C∗); ← vertices in g(h) are updated to obtain another 6 dˆ dist(g ,g ,ϕˆ ); 1 2 graph g(h+1) = (V,E,Σ ,`(h+1)). We call this op- ← ˆ 0 7 return d; eration “relabeling,” and define it as `(h+1)(v) = r(v,N(v),`(h)). Representative methods based on re- where local(v ) is the local structure around a vertex i labeling include the Weisfeiler–Lehman Subtree Ker- v , and c(local(v ) local(v )) is the cost of edit- i i j nel (WLSK) (Shervashidze et al., 2011), Neighbor- ing local(v ) to local→(v ). Recently, various methods i j hood Hash Kernel (NHK) (Hido and Kashima, 2009), within this framework have been developed by repre- and Hadamard Code Kernel (HCK) (Kataoka and senting local structures as stars (Zeng et al., 2009), Inokuchi, 2016). The vertex labels of WLSK are rep- walks (Gauz¨ ere` et al., 2014), or limited-size sub- resented as strings and the relabeling of vertex v is de- graphs (Carletti et al., 2015). This enables a more fined as a string concatenation of the labels of N(v). accurate mapping between sets of vertices than that In NHK, the vertex labels are represented as fixed- of Eq. (5). length bit strings and relabeling v is defined in terms Carletti et al. (2015) introduced local(vi), a of logical operations such as XOR on the labels of limited-size subgraph induced by vertices reachable N(v). The labels of HCK are based on the Hadamard within h steps from vertex v1, as shown in Fig. 2. The i code, which is used in spread spectrum-based com- subgraph local(vi) is denoted by gh, and ci∗ j is the munication technologies, and relabeling v is defined i j exact edit distance between gh and gh, that is, ci∗ j = as a summation on the labels of N(v). d(gi ,g j ). d(gi ,g j ) is tractable for sufficiently small Figure 3 shows an example of the framework h h h h ( ) h, although applying the problem to large graphs is based on graph relabeling. Let g 0 be the original intractable, because the problem of computing the graph whose vertices have labels a, b, and c. Each ( ) graph edit distance is NP-complete. Additionally, be- of the labels is relabeled to obtain g 1 . Although cause the structural information contained in limited- the actual calculation depends on the method of re- size subgraphs is greater than that of walks and stars, labeling (e.g., NHK, WLSK, or HCK), the relabeling ( ) the graph edit distance based on limited-size sub- of v is commonly applied using v, N(v), and ` 0 (v). (0) graphs provides an accurate approximate graph edit In the center of Figure 3, ` (v1) = b is relabeled distance. into d using adjacent vertices v2 and v4. Therefore, (1) Algorithm 1 shows the pseudo-code for com- ` (v1) = d represents the characteristics of st(v1,1), puting an approximate graph edit distance between where st(v,1) is a tree of height 1 whose root and graphs g1 and g2. In Lines 1–2, the numbers of ver- leaves are v and N(v), respectively. The labels of v1 (1) tices in g1 and g2 are equalized. For every pair of and v3 in g are identical because st(v1,1) = st(v3,1) (0) vertices in V(g1) V(g2), the exact edit distance be- in g . × i j tween limited-size subgraphs gh and gh is measured Given a graph g and its vertex v, an unfolded sub- and set as the (i, j)-th element in the cost matrix C∗. tree st(v,h) of height h is defined recursively from the LSAP in Line 5 returns the optimum mapping accord- root of the tree:
20 Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
Figure 3: Example of relabeling (g(0) g(1)). Figure 5: Example of relabeling in LAK. → (h) `L (v) represents the characteristics of st(v,h). In ad- dition, although the size of st(v,h) increases exponen- ( ) tially as h increases, the size of ` h (v) remains Σ . L | | (h) Therefore, retaining `L (v) in memory is tractable, c c although retaining st(v,h) is intractable. We show an example of relabeling in LAK in Fig- ure 5, assuming that Σ = 3 and relabeling is applied | | only once. Consider the graph g(0), whose vertices have labels (1,0,0), (0,1,0), and (0,0,1). We next relabel the graphs to obtain g(1). The label of ver- tex v in g(1) represents the distribution of labels con- st(v, ) v g(1) tained in 1 . For instance, the label of 5 in is (1) `L (v5) = (2,1,1), which indicates that there are two Figure 4: Derivation of an unfolded subtree of height 2. vertices labeled (1,0,0), one vertex labeled (0,1,0), and one vertex labeled (0,0,1). This distribution is the root of st(v,h) is the vertex v in g, equivalent to that of the labels contained in st(v ,1). • 5 each node in st(v,h), except for leaves, has chil- By relabeling the graphs h times, each • dren that are N(u) of the vertex u corresponding vertex v has h + 1 Σ -dimensional vectors (0) (1) (h) | | to the node, and `L (v),``L (v), ,``L (v). We concatenate these vectors to obtain··· an (h + 1) Σ -dimensional vector the height of st(v,h) is h. × | | • as Figure 4 shows an example of an unfolded subtree `(0)(v) `(1)(v) `(h)(v) st(v1,2) derived from the graph shown in Figure 2. A L L L ( ) ( ) ( ) ( ) ( ) ( ) vertex in the graph may appear in the unfolded subtree ` (v) = (` 0 , ,` 0 ,` 1 , ,` 1 , ,` h , ,` h ). (6) L 1 ··· Σ 1 ··· Σ ··· 1 ··· Σ (such as v1, v2, and v3), because multiple paths reach z }| | {| z }| | {| z }| | {| to the vertices from the root; this causes the size of i Given two graphs g1 and g2, if gh is isomorphic the tree to grow exponentially as h increases. j to g , where vi V(g1) and v j V(g2), then `L(vi) is The Label Aggregate Kernel (LAK) (Kataoka and h ∈ ∈ the same as `L(v j). Inokuchi, 2016) is another method based on this Based on this observation, we propose a novel framework. Next, we present a concrete definition of method for computing the difference between local (0) the relabeling operation in LAK. In LAK, `L (v) is a structures around vi and v j. Given two concatenated vector in Σ -dimensional space. If a vertex in a graph vectors ` (v ) and ` (v ), the distance between ` (v ) | | L i L j L i has a label σi from the set Σ = σ1,σ2, ,σ Σ , the and `L(v j) is defined as i-th element in the vector is 1 and{ the other··· elements| |} (h) 2 are 0. In LAK, ` (v) is defined as ci∗ j = `L(vi) `L(v j) (7) L − h (h) (h 1) (h 1) (t) (t ) 2 ` ` − ` − `L (v) = `L (v) + ∑ `L (u). = ∑ `L (vi) `L (v j) . (8) u N(v) t=0 − ∈
(h) Because Eq. (7) is the square of the Euclidean dis- The i-th element in ` (v) equals the frequency of oc- L tance between `L(vi) and `L(v j), the four axioms (h) currence of σi in st(v,h). Label `L (v), obtained by about this metric hold (non-negativity, identity of iteratively relabeling h times, has a distribution of la- indiscernibles, symmetry, and triangle inequality). bels that is reachable within h steps from v. Therefore, However, we need to discuss the identity of indis-
21 ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
2 and set as the (i, j)-th element in a square matrix C∗ i j = l (v ) − l (v ) = 0 d(gh , gh ) 0 L i L j in Line 6. In Lines 7–8, using the set of nonnega- (h) (h) tive integers Z, g1 and g2 are relabeled to obtain (h+1) (h+1) g1 and g2 , respectively. In Line 9, LSAP re- g i = g j l = l turns the mapping ϕˆ from V(g1) to V(g2) according h h L (vi ) L (v j ) to the optimal bipartite graph matching. The pro- cesses in Lines 5–10 are repeated H + 1 times. Fi- nally, the algorithm returns the minimum approximate Figure 6: Identity of indiscernibles. graph edit distance among dˆh. This algorithm runs in O(H( Σ b2 + Σ bd +b3)) time, because the computa- Algorithm 2: Proposed Method. tional| complexities| | | of Lines 6, 7, 9, and 10 are O( Σ ), Data: graphs g1 and g2, and H O( bd) O(b3) O(b2) | | ˆ Σ , , and , respectively. Because Result: approximate graph edit distance d d is| bounded| by b, the computational complexity of 1 while V(g ) < V(g ) do 2 | 1 | | 2 | Algorithm 2 becomes O(b H( Σ + b). If we com- 2 V(g ) V(g ) ε ; | | 1 ← 1 ∪ { } pute Eq. (7) in a straightforward manner, we require 3 C 0; O( Σ H) memory for each vertex of graphs g and g . ∗ ← | | 1 2 4 for h [0,H] do However, when we compute Lines 6, 7, and 8 in Algo- ∈ (τ) 5 for (vi,v j) V(g1) V(g2) do rithm 2, we do not require ` (v) for τ < h. Therefore, ∈ × 2 L 2h (h) (h) O( b + bd) 6 ci∗ j ci∗ j + γ `L (vi) `L (v j) ; Algorithm 2 requires Σ memory. ← − The notable difference| | between Algorithms 1 and Σ (h+1) 7 g1 (V(g1),E(g1 ),Z| |,` ); 2 is that the former requires the exact graph edit dis- ← Σ (h+1) 8 g (V(g ),E(g ),Z ,` ); tance to be computed, which is known to be an NP- 2 ← 2 2 | | 9 ϕˆ LSAP(C ); complete problem. Although the computation is con- ← ∗ 10 dˆh dist(g1,g2,ϕˆ ); ducted for small graphs, Algorithm 1 runs the compu- ← tation b2 times, which entails a significant computa- 11 return minh [0,H] dˆh; ∈ tional cost. In contrast, the proposed method does not require the exact graph edit distance to be computed, cernibles in more detail. As shown in Fig. 6, the replacing this with computations of the Euclidean dis- identity of indiscernibles for the Euclidean distance tance. The complexity of this operation is indepen- implies that ` (v ) = ` (v ) if and only if ` (v ) L i L j k L i − dent of the size of the graphs, which enables us to use ` (v ) 2 = 0. The above framework for relabel- L j k LSAP multiple times in our proposed method. The ing graphs is based on the 1-dimensional Weisfeiler– final output of the proposed method is selected from Lehman algorithm, which checks the isomorphism dˆh for 0 h H, with our method returning the most between two graphs. This algorithm is known to be a accurate≤ graph≤ edit distance. valid isomorphism test for almost all graphs (see (Cai In the proposed method, we used LAK to charac- et al., 1992) for examples of graphs that cannot be terize the local structure around each vertex. If we distinguished by the algorithm) (Shervashidze et al., use WLSK, HCK, or NHK, the difference between gi i j 2 h 2011). Therefore, if g = g , ` (v ) ` (v ) = 0 j h h L i L j and g is represented as only a binary value, rather always holds. In contrast, itsk converse− holdsk for al- h than in a quantitative form. The binary value repre- most all graphs. i j To control the effect of steps from a central vertex sents whether gh and gh are isomorphic or not. This of the local structure, we add to Eq. (6) a coefficient is why we use LAK in our proposed method. that changes exponentially as h increases:
`(0) h`(h) `L (v) γ `L (v) 5 EXPERIMENTAL EVALUATION ( ) ( ) ( ) ( ) ` (v,γ) = (` 0 , ,` 0 , ,γh` h , ,γh` h ). (9) L 1 ··· Σ ··· 1 ··· Σ z }| | {| z }| | {| The proposed method was implemented in Java. All Algorithm 2 shows the pseudo-code of the pro- experiments were conducted on an Intel Xeon E5- posed method for computing the approximate graph 2609 2.50 GHz computer with 32 GB memory run- edit distance between graphs g1 and g2. In Lines 1– ning Microsoft Windows 7. We used two real-world 2, the numbers of vertices in g1 and g2 are equalized. datasets. The first dataset, MUTAG (Debnath et al., This is repeated at most b times. For each pair of 1991), contains information on 188 chemical com- vertices in V(g1) V(g2), the Euclidean distance be- pounds and their class labels. The class labels are × (h) (h) tween two vectors `L (vi) and `L (v j) is measured binary values that indicate the mutagenicity of chem-
22 Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
8
8
8 `Q]Q VR QJ0VJ 1QJ:C Figure 7: Conversion of a graph.
HQI]% : 1QJ 1IV `I VHa 8 Table 1: Summary of evaluation datasets. MUTAG ENZYMES U Number of graphs D 188 600 Figure 8: Average computation time for various H (MU- | | Maximum graph size 84 126 TAG). Average graph size 53.9 32.6 Number of labels Σ 12 3 | | 8 Number of classes 2 6 (class distribution) (126,63) (100,100,100,) 100,100,100) 8 Avg. degree of vertices 2.1 3.8 ical compounds. Each chemical compound is rep- 8 `Q]Q VR resented as an undirected graph where each vertex, QJ0VJ 1QJ:C edge, vertex label, and edge label corresponds to an atom, chemical bond, atom type, and bond type, re- 8 HQI]% : 1QJ 1IV `I VHa spectively. Because we assume that only the vertices in the graphs have labels, the chemical graphs are con- U verted using a previous method (Hido and Kashima, Figure 9: Average computation time for various H (EN- 2009), that is, an edge labeled with ` that is adjacent ZYMES). to vertices v and u in a chemical graph is replaced with a vertex labeled with ` that is adjacent to v and u with more than 95% of the computation time in the with unlabeled edges, as shown in Figure 7. The sec- conventional method taken up by computing the exact ond dataset, ENZYMES, contains information on 600 graph edit distance between local structures. There- proteins and their class labels. These are one of six la- fore, the conventional method can only take account bels denoting the six EC top-level classes (Schomburg of small local structures around vertices to compute et al., 2004). Table 1 summarizes the datasets. the approximate graph edit distance. In contrast, the computation time of the proposed method is propor- 5.1 Computational Efficiency tional to H, as discussed in the final part of Section 4. As the proposed method does not require the exact Figures 8 and 9 show the average computation time graph edit distance to be computed, we obtained re- required to compute the approximate graph edit dis- sults for up to H = 200 within a total computation tance for each pair of graphs in the MUTAG and EN- time of 3 hours, which indicates that the proposed ZYMES datasets for various H, respectively. These method takes account of large local structures around figures do not contain results of the exact graph edit vertices. Although the proposed method computes distance, because we could not compute the exact LSAP multiple times, the computational complexity graph edit distance between most of the graphs with of LSAP is O(b3), which is sufficiently tractable for more vertices than 30. The horizontal axes are log- the size of the graphs in the experimental datasets. arithmic. To plot the computation times of both From these results, we can confirm that the proposed methods for H = 0, note that the horizontal axes re- method is much faster than the conventional method fer to H + 1 rather than H. The result of the pro- for computing approximate graph edit distances. posed method for H = 0 is almost the same as that of the conventional method for H = 0, because they 5.2 Accuracy of the Graph Edit use the same local structures around vertices. As H increases, the computation time of the conventional Distance method aforementioned in Algorithm 1 (Carletti et al., 2015) increases dramatically. We could compute up Figures 10 and 11 show the average approximate to H = 2 within a total computation time of 3 hours, graph edit distance measured by the proposed method
23 ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
for each pair of graphs in the MUTAG and EN- ZYMES datasets, respectively. The approximate graph edit distance measured by the proposed method is not less than the exact graph edit distance, be- cause the exact graph edit distance is defined as the “minimum length” of the sequence of edit operations needed to transform g1 into g2 and the approximate graph edit distance measured by the proposed method Y is computed by Eq. (3). As H increases, the aver- Y age approximate graph edit distance decreases in both Y ]]`Q61I: V $`:]. VR1 R1 :JHV Y datasets, because large local structures around the ver- tices are used to map the vertices in one graph to 8 8 vertices in another graph. In addition, by tuning γ, we can obtain more accurate edit distances than with Figure 10: Average approximate graph edit distance for var- γ = 1. In Figure 12, each point (dˆc,dˆp) shows the ious H and γ (MUTAG). approximate graph edit distances dˆc and dˆp measured by the conventional and proposed methods, respec- tively, for a pair of graphs in the MUTAG dataset. There are many points above the red line, which sug- gests that most of the approximate graph edit dis- tances measured by the proposed method are more accurate than those given by the conventional method, ˆ ˆ because dc > dp. For the MUTAG dataset, the average approximate graph edit distances of the conventional and proposed methods are 78.5 and 82.4, respectively. Y Y Y Y Figure 13 for ENZYMES indicates the same tendency ]]`Q61I: V $`:]. VR1 R1 :JHV as Figure 12. For the ENZYMES dataset, the average 8 8 approximate graph edit distances of the conventional and proposed methods are 115.8 and 129.0, respec- tively. One of the reasons why the proposed method Figure 11: Average approximate graph edit distance for var- computes accurate approximate graph edit distances ious H and γ (ENZYMES). is that it very efficiently takes account of large local structures around vertices, which enables the mini- this experiment is defined as mum approximate graph edit distance to be selected ˆ 2 ˆ ˆ ˆ d(g1,g2) from among d0,d1, ,dH . k(g ,g ) = exp , (10) ··· 1 2 2 In general, there is a trade-off between fast com- − 2σ ! putation and accuracy. However, as shown in the pre- vious subsection and here, the proposed method is where dˆ(g1,g2) is the approximate graph edit dis- much faster and more accurate than the conventional tance between g1 and g2 measured by the conven- method. tional and proposed methods1. To learn from the kernel matrices generated by the above graph kernel, 5.3 Application of Graph Edit Distance we used the LIBSVM package with 10-fold cross- validation (Chang and Lin, 2001). To validate the applicability of the graph edit dis- Figures 14 and 15 show the classification accu- tance, we compared the accuracy of the predictions racy of the conventional and proposed methods, as given by the conventional and proposed methods. The well as that of WLSK (Shervashidze et al., 2011), for graph classification problem is defined as follows. various H using the MUTAG and ENZYME datasets. Given a set of n training examples D = (gi,yi) The maximum accuracy of the proposed method is (i = 1,2, ,n), where each example is a pair{ consist-} greater than that of the conventional method, because ··· ing of a labeled graph gi and the class yi +1, 1 the proposed method computes accurate approximate to which it belongs, the objective is to learn∈ a{ function− } graph edit distances, as shown in the previous section. f that correctly predicts the classes of the test exam- In addition, the maximum accuracy of the proposed ples. In this experiment, graphs are classified by an 1 SVM using a graph kernel. The graph kernel used in Their experimental results are represented with “Pro- posed” and “Conventional” in Figs. 14 and 15.
24 Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
`Q]QVR QJ0VJ 1QJ:C
HQJ0VJ 1QJ:CIV .QR C:1`1H: 1QJ HH%`:H7 `Qa
]]`Q8$`:].VR1 R1 :JHV`Q` .V Figure 14: Prediction accuracy for various H (MUTAG). ]]`Q8 $`:]. VR1 R1 :JHV `Q` .V ]`Q]QVR IV .QR `Q]QVR Figure 12: Approximate graph edit distance for H = 2 (MU- QJ0VJ 1QJ:C TAG).
`Qa
C:1`1H: 1QJ HH%`:H7 Figure 15: Prediction accuracy for various H (ENZYMES). HQJ0VJ 1QJ:CIV .QR Simulation experiments on two real-world chemical ]]`Q8$`:].VR1 R1 :JHV`Q` .V datasets were reported. Compared with the conven- tional technique, the proposed method gave a more accurate approximation of the graph edit distance and is significantly faster on both datasets. This suggested ]]`Q8 $`:]. VR1 R1 :JHV `Q` .V ]`Q]QVR IV .QR the proposed method could be applicable in the anal- ysis of larger and more complex graph-like datasets. Figure 13: Approximate graph edit distance for H = 2 (EN- ZYMES). method is comparative with that of WLSK, which is REFERENCES one of the most representative graph kernels. Cai, Jin-yi, Furer,¨ Martin, and Immerman, Neil. 1992. An Optimal Lower Bound on the Number of Variables for Graph Identification. Combinatorica, 12(4), 389–410. 6 CONCLUSION Carletti, Vincenzo, Gauz¨ ere,` Benoit, Brun, Luc, and Vento, Mario. 2015. Approximate Graph Edit Distance Com- putation Combining Bipartite Matching and Exact As computing the exact graph edit distance is compu- Neighborhood Substructure Distance. In Interna- tationally expensive, and may be intractable for large- tional Workshop on Graph Based Representations in scale datasets, in this paper, we proposed a method Pattern Recognition (GbRPR), 188–197. based on graph relabeling that is both faster and more Chang, Chih-Chung, and Lin, Chih-Jen. 2001. LIBSVM: A accurate than the conventional approach. We used un- library for Support Vector Machines. Available online folded subtrees to denote the potential relabeling of at http://www.csie.ntu.edu.tw/cjlin/libsvm. local structures around a given vertex. These subtree Choi, Yeonjoo, and Kim, Gyeonghwan. 2010. Graph-based representations are concatenated as a vector, and the Fingerprint Classification using Orientation Field in distance between different vectors is used to charac- Core Area. IEICE Electronic Express, 7(17), 1303– terize the distance between the corresponding graphs. 1309. This avoids the need for multiple calculations of the Debnath, Asim Kumar, Lopez de Compadre, Rosa L., Deb- exact graph edit distance between local structures. nath, Gargi, Shusterman, Alan J., and Hansch, Cor-
25 ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
win. 1991. Structure-Activity Relationship of Mu- the ACM SIGMOD International Conference on Man- tagenic Aromatic and Heteroaromatic Nitro Com- agement of Data (SIGMOD), 766–777. pounds. Correlation with Molecular Orbital Energies Zeng, Zhiping, Tung, Anthony K. H. Tung, Wang, Jiany- and Hydrophobicity. Journal of Medicinal Chemistry, ong, Feng, Jianhua, and Zhou, Lizhu. 2009. Com- 34, 786–797. paring Stars: On Approximating Graph Edit Distance. Gauz¨ ere,` Benoit, Bougleux, Sebastien,´ Riesen, Kaspar, and In Proc. of International Conference on Very Large Brun, Luc. 2014. Approximate Graph Edit Distance Databases (PVLDB), 2(1), 25–36. Guided by Bipartite Matching of Bags of Walks. In Proc. of International Workshop on Structural and Syntactic Pattern Recognition (SSPR), 73–82. Hart, Peter E., Nilsson, Nils J., and Raphael, Bertram. 1968. APPENDIX A Formal Basis for the Heuristic Determination of Minimum Cost Paths. Journal of IEEE Transaction The Label Aggregate Kernel (LAK) was designed to on Systems Science and Cybernetics, 4(2), 100–107. measure the similarity between two graphs g1 and g2 Hido, Shohei and Kashima, Hisashi. 2009. A Linear-Time to enable the application of a Support Vector Machine Graph Kernel. In Proc. of International Conference (SVM). The kernel is defined as on Data Mining (ICDM), 179–188. h Kashima, Hisashi, Tsuda, Koji, and Inokuchi, Aki- (t) (t) hiro. 2003. Marginalized Kernels Between Labeled k(g1,g2) = ∑ ∑ ∑ δ(`L (vi),``L (v j)), Graphs. In Proc. of International Conference on Ma- t=0 vi V(g1) v j V(g2) chine Learning (ICML), 321–328. ∈ ∈ Kataoka, Tetsuya and Inokuchi, Akihiro. 2016. Hadamard where δ is the Kronecker delta. In contrast, in this pa- Code Graph Kernels for Classifying Graphs. In Proc. per, LAK has been used to measure the approximate of International Conference on Pattern Recognition graph edit distance corresponding to the dissimilarity Applications and Methods (ICPRAM), 24–32. between two graphs g1 and g2. Kinable, Joris, and Kostakis, Orestis. 2011. Malware Clas- sification based on Call Graph Clustering. Journal in Computer Virology, 7(4), 233–245. Koopmans, Tjalling C. and Beckmann, Martin. 1957. As- signment Problems and the Location of Economic Ac- tivities. Econometrica, 25(1), 53–76. Riesen, Kaspar and Bunkle, Horst. 2009. Approximate Graph Edit Distance Computation by Means of Bipar- tite Graph Matching. Image Vision Computing, 27(7), 950–959. Riesen, Kaspar. 2015. Structural Pattern Recognition with Graph Edit Distance: Approximation Algorithms and Applications. Advances in Computer Vision and Pat- tern Recognition, Springer. Robles-Kelly, Antonio, and Hancock, Edwin R. 2003. Edit Distance From Graph Spectra. In Proc. of Interna- tional Conference on Computer Vision (ICCV), 234– 241. Schomburg, Ida, Chang, Antje, Ebeling, Christian, Gremse, Marion, Heldt, Christian, Huhn, Gregor, and Schom- burg, Dietmar. 2004. BRENDA, the Enzyme Database: Updates and Major New Developments. Nucleic Acids Research, 32D, 431–433. Shervashidze, Nino, Schweitzer, Pascal, Jan van Leeuwen, Erik, Mehlhorn, Kurt, and Borgwardt, Karsten M.. 2011. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research (JMLR), 2539–2561. Wang, Peng, Eglin, Veronique,´ Garcia, Christophe, Larg- eron, Christine, Llados,´ Josep, and Fornes,´ Alicia. 2014. A Coarse-to-Fine Word Spotting Approach for Historical Handwritten Documents Based on Graph Embedding and Graph Edit Distance. In Proc of Inter- national Conference on Pattern Recognition (ICPR), 3074–3079. Yan, Xifeng, Yu, Philip S., and Han, Jiawei. 2005 Substruc- ture Similarity Search in Graph Databases. In Proc. of
26