A Binary Linear Programming Formulation of the Graph Edit Distance
Total Page:16
File Type:pdf, Size:1020Kb
1200 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 8, AUGUST 2006 A Binary Linear Programming Formulation of the Graph Edit Distance Derek Justice and Alfred Hero, Fellow, IEEE Abstract—A binary linear programming formulation of the graph edit distance for unweighted, undirected graphs with vertex attributes is derived and applied to a graph recognition problem. A general formulation for editing graphs is used to derive a graph edit distance that is proven to be a metric, provided the cost function for individual edit operations is a metric. Then, a binary linear program is developed for computing this graph edit distance, and polynomial time methods for determining upper and lower bounds on the solution of the binary program are derived by applying solution methods for standard linear programming and the assignment problem. A recognition problem of comparing a sample input graph to a database of known prototype graphs in the context of a chemical information system is presented as an application of the new method. The costs associated with various edit operations are chosen by using a minimum normalized variance criterion applied to pairwise distances between nearest neighbors in the database of prototypes. The new metric is shown to perform quite well in comparison to existing metrics when applied to a database of chemical graphs. Index Terms—Graph algorithms, similarity measures, structural pattern recognition, graphs and networks, linear programming, continuation (homotopy) methods. æ 1INTRODUCTION TTRIBUTED graphs provide convenient structures for chemicals or medicines [6]. Various techniques have been Arepresenting objects when relational properties are of designed for processing the graphs for structural features and interest. Such representations are frequently useful in generating bit strings (referred to as fingerprints) based on the applications ranging from computer-aided drug design to presence or absence of such features [7]. Since the fingerprints machine vision. A familiar machine vision problem is to can be rapidly compared, some prescreening or clustering is recognize specific objects within an image. In this case, the often done based on these to eliminate all but the most similar image is processed to generate a representative graph based graphs to a given input graph. The remaining graphs may on structural characteristics, such a region adjacency graph then be compared to the query using a more sophisticated or a line adjacency graph, and vertex attributes may be (and more computationally demanding) method. Distance assigned according to characteristics of the region to which metrics based on the maximum common subgraph are each vertex corresponds [1]. This representative graph is frequently used in this role [8], [9]. then compared to a database of prototype or model graphs Although computing the maximum common subgraph in order to identify and classify the object of interest. Face (MCS) is no small task (indeed, it is an NP-Hard problem identification [2] and symbol recognition [3] are among the [10]), several graph distance metrics have been proposed that problems in machine vision where graphs have been use the size of the MCS. One such metric is given in [11], with a utilized recently. In this context, a reliable and speedy slight modification presented in [12]. A different MCS-based method for comparing graphs is important. Many heuristics metric is presented in [13] specifically for application to and simplifications have been developed and employed for chemical graphs. An alternate metric that uses the MCS along this purpose in a variety of applications [4]. with the minimum common supergraph has also been Comparing graphs in the context of graph database proposed [14]. For related structures, such as attributed trees, searching has also found a significant application in the it is often possible to derive distance metrics based on the pharmaceutical and agrochemical industries. The attributed maximum common substructure that operate in polynomial graphs of interest are so-called chemical graphs which are time [15]. An intimate relationship between graph compar- derived from chemical structure diagrams. Graphical repre- ison and graph (or subgraph) isomorphism is readily sentations are of great utility here because of the similar apparent in these examples because the MCS defines property principle, which states that molecules with similar subgraphs in the two graphs being compared that are structures will exhibit similar chemical properties [5]. Thus, isomorphic. Indeed, computing a graph distance metric often databases of these chemical graphs are often searched by requires the computation of some sort of isomorphism (also comparing with a query graph to aid in the design of new known as matching) between graphs. An exact graph isomorphism defines a mapping between the nodes of two attributed graphs so that their structures . The authors are with the Department of Electrical Engineering and (vertex attributes along with edges) exactly coincide. As one Computer Science, University of Michigan, 1301 Beal Avenue, Ann Arbor, might expect, this is also a challenging computational MI 48109-2122. E-mail: {justiced, hero}@umich.edu. problem, in general, although it has not been shown to be Manuscript received 5 Jan. 2005; revised 14 Oct. 2005; accepted 3 Jan. 2006; NP-Complete [10]. As with subgraph isomorphism, poly- published online 13 June 2006. Recommended for acceptance by A. Rangarajan. nomial time algorithms are available for certain restricted For information on obtaining reprints of this article, please send e-mail to: classes of graphs [16]. Algorithms for general graph iso- [email protected], and reference IEEECS Log Number TPAMI-0014-0105. morphism that are shown to be quite speedy in practice are 0162-8828/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: University of Michigan Library. Downloaded on May 5, 2009 at 11:46 from IEEE Xplore. Restrictions apply. JUSTICE AND HERO: A BINARY LINEAR PROGRAMMING FORMULATION OF THE GRAPH EDIT DISTANCE 1201 given in [17], [18]. Such algorithms typically take advantage recognition point of view have been presented. In [36], the EM of vertex attributes to efficiently prune a search tree algorithm is applied to assumed Gaussian mixture models for constructed for finding an isomorphism. Graphs obtained edit events in order to choose costs that enforce similarity (or from real objects are rarely isomorphic, however, so it is dissimilarity) between specific pairs of graphs in a training useful to consider inexact or error-correcting graph iso- set. An application for matching images based on their shock morphisms (ECGI) that allow for graphs to nearly (but not graphs [37] uses the tree edit distance algorithm in [34] and exactly) coincide [19]. As the name suggests, the lack of exact chooses edit costs based on local shape differences within isomorphism can be caused by measurement noise or errors shock graphs corresponding to similar images. In a chemical in a sample graph when compared to a model graph. On the graph recognition application, heuristics are used to choose other hand, when comparing two model graphs, one might the edit costs of a string edit distance between strings formed interpret such “errors” as capturing the essential differences from the maximal paths between vertices in the graphs [38]. between the two graphs. Related studies have also been done into the effectiveness of Error-correcting graph matching attempts to compute a weighting the presence or absence of certain substructures mapping between the vertices of two graphs so that they differently when comparing fingerprints derived from approximately coincide, realizing that the graphs may not be chemical graphs [39]. isomorphic. Many suboptimal approaches exist to tackle this In this paper, we provide a formulation of the graph edit problem [20], [21], [22], [23]. The adjacency matrix eigende- distance whereby error-correcting graph matching may be composition approach of [21] gives fast suboptimal results, performed by solving a binary linear program (BLP—that however, it is only applicable to graphs having adjacency is, a linear program where all variables must take values matrices with no repeated eigenvalues. Graphs with a low from the set f0; 1g). We first present a general framework degree of connectivity will often have adjacency matrices for computing the GED between attributed, unweighted with multiple zero eigenvalues. Heuristics are used in the graphs by treating them as subgraphs of a larger graph graduated assignment type methods of [22], [23] to signifi- referred to as the edit grid. It is argued that the edit grid cantly reduce the exponential complexity of the original need only have as many vertices as the sum of the total problem. These methods can be applied to very large graphs; number of vertices in the graphs being compared. We show however, they require several tuning parameters to which the that graph editing is equivalent to altering the state of the performance of the algorithm is quite sensitive. Unfortu- edit grid and prove that the GED derived in this way is a nately, no systematic method for choosing these parameters metric, provided the cost function for individual edit is provided. The linear programming approach of [20] gives operations is a metric. We then use the adjacency matrix good results in a reasonable amount of time for graphs having representation to formulate a binary linear program to solve the same number of vertices. The authors of [20] use the linear for the GED. Since solving a BLP is NP-Hard [10], we show program to minimize a matrix norm similarity metric. how to obtain upper and lower bounds for the GED in The graph edit distance (GED) is a convenient and logical polynomial time by using solution techniques for standard graph distance metric that arises naturally in the context of linear programming and the assignment problem [40]. error-correcting graph matching [19], [24], [25]. It can also be These bounds may be useful in the event that the problem viewed as an extension of the string edit distance [26].