Computing Optimal Assignments in Linear Time for Approximate Graph Matching

Computing Optimal Assignments in Linear Time for Approximate Graph Matching Nils M. Kriege∗, Pierre-Louis Giscardy, Franka Bause∗, Richard C. Wilsonz ∗Department of Computer Science, TU Dortmund University, Dortmund, Germany fnils.kriege, [email protected] yLMPA Joseph Liouville, Universite´ Littoral Coteˆ d’Opale, Calais, France [email protected] zDepartment of Computer Science, University of York, York, United Kingdom [email protected] Abstract—Finding an optimal assignment between two sets of features. However, in both cases, the method is not ideal, objects is a fundamental problem arising in many applications, since each feature corresponds to a specific element in the data including the matching of ‘bag-of-words’ representations in object, and so can correspond to no more than one element natural language processing and computer vision. Solving the assignment problem typically requires cubic time and its pairwise in a second data object. The co-occurrence counting method computation is expensive on large datasets. In this paper, we allows each feature to match to multiple features in the other develop an algorithm which can find an optimal assignment in dataset. linear time when the cost function between objects is represented A different approach is to explicitly find the best correspon- by a tree distance. We employ the method to approximate the edit dence between the features of two bags. This can be achieved distance between two graphs by matching their vertices in linear 3 time. To this end, we propose two tree distances, the first of which by solving the (linear) assignment problem in O(n ) time us- reflects discrete and structural differences between vertices, and ing Hungarian-type algorithms [4]. When we assume weights the second of which can be used to compare continuous labels. to be integers within the range of [0;N], scaling algorithm We verify the effectiveness and efficiency of our methods using become applicable such as [5], which requires O(n2:5 log N) synthetic and real-world datasets. time. Several authors studied a geometric version of the Index Terms—assignment problem, graph matching, graph edit d distance, tree distance problem, where the objects are points in R and the total distance is to be minimised. No subquadratic exact algorithm for this task is known, but efficient approximation algorithms I. INTRODUCTION exist [6]. This is also the case for various other problem Vast amounts of data are now available for machine learn- variants, see [7] and references therein. These algorithms are ing, including text documents, images, graphs and many typically involved and focus on theoretical guarantees. For more. Learning from such data typically involves computing practical applications often standard algorithms with cubic a similarity or distance function between the data objects. running time, simple greedy strategies or methods specifically Since many of these datasets are large, the efficiency of the designed for one task are used. comparison methods is critical. This is particularly challenging We briefly summarise the use of assignment methods for since taking the structure adequately into account often is a comparing structured data in machine learning with a focus hard problem. For example, no polynomial-time algorithms on graphs. While most kernels for such data are based on are known even for the basic task to decide whether two co-occurrence counting, there has been growing interest in graphs have the same structure [1]. Therefore, the pragmatic deriving kernels from optimal assignments in recent years. arXiv:1901.10356v2 [cs.LG] 10 Sep 2019 approach of describing such data with a ‘bag-of-words’ or The pyramid match kernel was proposed to approximate ‘bag-of-features’ is commonly used. In this representation, a correspondences between bags of features in Rd by employing series of objects are identified in the data and each object is a space-partitioning tree structure and counting how often described by a label or feature. The labels are placed in a bag points fall into the same bin [8]. For graphs, the optimal where the order in which they appear does not matter. assignment kernel was proposed, which establishes a cor- In the most basic form, such bags can be represented by respondence between the vertices of two graphs using the histograms or feature vectors, two of which are compared Hungarian algorithm [9]. However, it was soon realised that by counting the number of co-occurrences of a label in the this approach does not lead to valid kernels in the general bags. This is a common approach not only for images and case [2]. Therefore, Johansson and Dubhashi [10] derived text, but also for graph comparison, where a number of graph kernels from optimal assignments by first sampling a fixed kernels have been proposed which use different substructures set of so-called landmarks and representing graphs by their as elements of the bag [2, 3]. In the more general case, two optimal assignment similarities to landmarks. Kriege et al. bags of features are compared by summing over all pairs [11] demonstrated that a specific choice of a weight function of features weighted by a similarity function between the (derived from a hierarchy) does in fact generate a valid kernel © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. for the optimal assignment method and allows computation in representing cost functions: (i) based on Weisfeiler-Lehman linear time. The approach is not designed to actually construct refinement to quantify discrete and structural differences, the assignment. (ii) based on hierarchical clustering for continuous vertex The term graph matching refers to a diverse set of tech- attributes. We show experimentally that our linear time as- niques which typically establish correspondences between the signment algorithm scales to large graphs and datasets. Our vertices (and edges) of two graphs [12, 13]. This task is closely approach outperforms both exact and approximate methods related to the classical NP-hard maximum common subgraph for computing the graph edit distance in terms of running problem, which asks for the largest graph that is contained as time and provides state-of-the-art classification accuracy. For subgraph in two given graphs [1]. Exact algorithms for this some datasets with discrete labels our method even beats these problem haven been studied extensively, e.g., for applications approaches in terms of accuracy. in cheminformatics [14], where small molecular graphs with about 20 vertices are compared. The term network alignment II. FUNDAMENTALS is commonly used in bioinformatics, where large networks We summarise basic concepts and results on graphs, tree with thousands of vertices are compared such as protein- distances, the assignment problem and the graph edit distance. protein interaction networks. Methods applicable to such graphs typically cannot guarantee that an optimal solution is A. Graph theory found and often solve the assignment problem as a subroutine, An undirected graph G = (V; E) consists of a finite set e.g., [15, 16]. In the following we will focus on the graph V (G) = V of vertices and a finite set E(G) = E of edges, edit distance, which is one of the most widely accepted where each edge connects two distinct vertices. We denote approaches to graph matching with applications ranging from an edge connecting a vertex u and a vertex v by uv or vu, cheminformatics to computer vision [17]. It is defined as the where both refers to the same edge. Two vertices u and v are minimum cost of edit operations required to transform one said to be adjacent if uv 2 E and referred to as endpoints graph into another graph. The concept has been proposed of the edge uv. The vertices adjacent to a vertex v in G are for pattern recognition tasks more than 30 years ago [18]. denoted by NG(v) = fu 2 V (G) j uv 2 E(G)g and referred However, its computation is NP-hard, since it generalises the to as neighbours of v.A path of length n is a sequence of maximum common subgraph problem [19]. The graph edit vertices (v0; : : : ; vn) such that vivi+1 2 E for 0 ≤ i < n.A distance is closely related to the notoriously hard quadratic weighted graph is a graph G endowed with a weight function assignment problem [20]. Recently several elaborated exact al- w : E(G) ! R. The length of a path in a weighted graph gorithms for computing the graph edit distance have been pro- refers to the sum of weights of the edges contained in the path. posed [21, 22, 23]. Binary linear programming formulations A graph G0 = (V 0;E0) is a subgraph of a graph G = (V; E), in combination with highly-optimised general purpose solvers written G0 ⊆ G, if V 0 ⊆ V and E0 ⊆ E. Let V 0 ⊆ V , are among the most efficient approaches, but are still limited to then G0 = (V 0;E0) with E0 = fuv 2 E j u; v 2 V 0g is small graphs [22]. Even a restricted special case of the graph said to be the subgraph induced by V 0 in G and is denoted edit distance is APX-hard [24], i.e., there is a constant c > 1, by G[V 0]. An isomorphism between two graphs G and H is such that no polynomial-time algorithm can approximate it a bijection : V (G) ! V (H) such that uv 2 E(G) , within the factor c, unless P=NP.

Load more