New Techniques for Graph Edit Distance Computation

New Techniques for Graph Edit Distance Computation by David B. Blumenthal This dissertation is submitted for the degree of Doctor of Philosophy at the Faculty of Computer Science of the Free University of Bozen-Bolzano. Ph. D. Supervisor Johann Gamper, Free University of Bozen-Bolzano, Bolzano, Italy External Co-Authors Sébastien Bougleux, Normandie Université, Caen, France Luc Brun, Normandie Université, Caen, France Nicolas Boria, Normandie Université, Caen, France Évariste Daller, Normandie Université, Caen, France Benoit Gaüzère, Normandie Université, Rouen, France External Reviewers Walter G. Kropatsch, Technische Universität Wien, Vienna, Austria Jean-Yves Ramel, Université de Tours, Tours, France Keywords Graph Edit Distance, Graph Matching, Algorithm Design Copyright c 2019 by David B. Blumenthal This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International Public License. For a copy of this license, visit https: //creativecommons.org/licenses/by-sa/4.0/. Contents Acknowledgements v Abstract vii 1 Introduction 1 1.1 Background . 1 1.2 Contributions and Organization . 2 1.3 Publications . 5 2 Preliminaries 7 2.1 Two Definitions of GED . 7 2.2 Miscellaneous Definitions . 11 2.3 Definitions of LSAP and LSAPE . 14 2.4 Test Datasets and Edit Costs . 16 3 Theoretical Aspects 21 3.1 State of the Art . 22 3.2 Hardness on Very Sparse Graphs . 27 3.3 Compact Reduction to QAP for Quasimetric Edit Costs . 30 3.4 Harmonization of GED Definitions . 38 3.5 Empirical Evaluation . 41 3.6 Conclusions and Future Work . 43 4 The Linear Sum Assignment Problem with Error-Correction 45 4.1 State of the Art . 46 4.2 A Fast Solver Without Cost Constraints . 49 4.3 Empirical Evaluation . 56 i ii Contents 4.4 Conclusions and Future Work . 63 5 Exact Algorithms 65 5.1 State of the Art . 66 5.2 Speed-Up of Node Based Tree Search . 78 5.3 Generalization of Edge Based Tree Search . 80 5.4 A Compact MIP Formulation . 82 5.5 Empirical Evaluation . 85 5.6 Conclusions and Future Work . 91 6 Heuristic Algorithms 93 6.1 State of the Art . 96 6.2 Two New LSAPE Based Lower and Upper Bounds . 112 6.3 An Anytime Algorithm for Tight Lower Bounds . 118 6.4 LSAPE, Rings, and Machine Learning . 124 6.5 Enumeration of Optimal LSAPE Solutions . 135 6.6 A Local Search Based Upper Bound . 136 6.7 Generation of Initial Solutions for Local Search . 142 6.8 Empirical Evaluation . 144 6.9 Conclusions and Future Work . 182 7 Conclusions and Future Work 185 A GEDLIB: A C++ Library for Graph Edit Distance Computation 187 A.1 Overall Architecture . 188 A.2 User Interface . 189 A.3 Abstract Classes for Implementing GED Algorithms . 191 A.4 Abstract Class for Implementing Edit Costs . 195 A.5 Conclusions and Future Work . 196 Bibliography 197 List of Definitions 211 List of Theorems, Propositions, and Corollaries 213 List of Tables 215 List of Figures 217 Contents iii List of Algorithms 219 List of Acronyms 223 Acknowledgements First of all, I would like to thank my Ph. D. supervisor Johann Gamper for his contributions of time, effort, and ideas during the last three years. In particular, I would like to thank him for his invaluable suggestions on how to fine-tune our papers to meet the expectations of the research community and for his trust in my ability to independently pursue my ideas. Moreover, I would like to thank Sébastien Bougleux, Luc Brun, Nicolas Boria, and Évariste Daller for collaborating with me during my stays at the GREYC Research Lab in Digital Sciences in Caen, France. Working with Sébastien, Luc, Nicolas, and Évariste has been truly inspiring and resulted in many ideas which I would not have been able to come up with alone. I would also like to thank the external reviewers Walter G. Kropatsch and Jean-Yves Ramel for their comments on a preliminary version of this thesis, my colleagues from our research group in Bolzano for innumerable helpful discussions and our table football matches after lunch, and my friends and academic advisors back in Berlin, with whom I discovered and deepened my desire to become a professional researcher (although back then I thought that I was going to do a Ph. D. in Philosophy rather than in Computer Science). Last but not least, I would like to thank my family: My sister Mirjam for being sister and best friend in one. My father Andreas for raising me with a love of critical thought and for always encouraging me to follow my passions and interests. My daughters Livia and Romina for countless little moments of joy and happiness. And, most importantly, my wife Clizia for supporting me ever since we met during our Erasmus exchanges in Edinburgh almost nine years ago. Thank you! David B. Blumenthal, August 2019 v Abstract Due to their capacity to encode rich structural information, labeled graphs are often used for modeling various kinds of objects such as images, molecules, and chemical compounds. If pattern recognition problems such as clustering and classification are to be solved on these domains, a (dis-)similarity measure for labeled graphs has to be defined. A widely used measure is the graph edit distance (GED), which, intuitively, is defined as the minimum amount of distortion that has to be applied to a source graph in order to transform it into a target graph. The main advantage of GED is its flexibility and sensitivity to small differences between the input graphs. Its main drawback is that it is hard to compute. In this thesis, new results and techniques for several aspects of computing GED are presented. Firstly, theoretical aspects are discussed: competing definitions of GED are harmonized, the problem of computing GED is characterized in terms of complexity, and several reductions from GED to the quadratic assignment problem (QAP) are presented. Secondly, solvers for the linear sum assignment problem with error-correction (LSAPE) are discussed. LSAPE is a generalization of the well-known linear sum assignment problem (LSAP), and has to be solved as a subproblem by many GED algorithms. In particular, a new solver is presented that efficiently reduces LSAPE to LSAP. Thirdly, exact algorithms for computing GED are presented in a systematic way, and improvements of existing algorithms as well as a new mixed integer programming (MIP) based approach are introduced. Fourthly, a detailed overview of heuristic algorithms that approximate GED via upper and lower bounds is provided, and eight new heuristics are described. Finally, a new easily extensible C++ library for exactly or approximately computing GED is presented. vii 1 Introduction 1.1 Background Labeled graphs can be used for modeling various kinds of objects, such as chemical compounds, images, molecular structures, and many more. Because of this, labeled graphs have received increasing attention over the past years. One task researchers have focused on is the following: Given a database G that contains labeled graphs from a domain G, find all graphs G 2 G that are sufficiently similar to a query graph H or find the k graphs from G that are most similar to H [35, 47, 107]. Being able to quickly and precisely answer graph similarity queries of this kind is crucial for the development of performant pattern recognition techniques in various application domains [104], such as keyword spotting in handwritten documents [103] and cancer detection [81]. The task of answering graph similarity queries can be addressed in several ways [47]. The straightforward approach is to define a graph (dis-)similarity measure f : G × G ! R. Subsequently, f can be used for answering the graph similarity queries in the graph space, e. g., via techniques such as the k-nearest neighbors algorithm. If f is defined as a graph kernel, i. e., if f is symmetric and positive semi-definite, more advanced pattern recognition techniques such as support vector machines and principal component analysis can be used, too [51, 78, 79]. An alternative approach consists in defining a graph embedding g : G ! Rd that maps graphs to multidimensional real-valued vectors and then answering the graph similarity queries via vector matching techniques [30, 72, 73, 84–86, 109, 110]. A very flexible and therefore widely used graph dissimilarity measure 1 2 Chapter 1.Introduction is the graph edit distance (GED) [29, 96]. GED is defined as the minimum cost of an edit path between two graphs. An edit path between graphs G and H is a sequence of edit operations that transforms G into H. There are six edit operations, namely, node insertion, deletion, and substitution as well as edge insertion, deletion, and substitution. Each edit operation comes with an associated non-negative edit cost, and the cost of an edit path is defined as the sum of the costs of its edit operations. The disadvantage of GED is that it is NP-hard to compute [111]; its advantage is that it is very sensitive to small differences between the input graphs. GED is therefore mainly used for rather small graphs, where information that is disregarded by rougher dissimilarity measures is crucial [104]. It can be employed either as a stand-alone graph dissimilarity measure, or as a building block for graph kernels [51, 78, 79] or graph embeddings [30, 84–86]. Another problem that can be addressed by computing GED is the error- correcting or inexact graph matching problem [28, 29, 47]. This problem problem asks to align the nodes and edges of two input graphs G and H, while allowing that some nodes and edges may be inserted or deleted. It can be solved by exactly or approximately computing GED, because edit paths correspond to error-correcting graph matchings (cf.

Load more