Efficient Frequent Subtree Mining Beyond Forests

Efﬁcient Frequent Subtree Mining Beyond Forests Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn vorgelegt von Pascal Welke aus Bonn Bonn, 2018 Angefertigt mit Genehmigung der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn Diese Arbeit wurde von Dr. Tamás Horváth und Prof. Dr. Stefan Wrobel betreut 1. Gutachter: Prof. Dr. Stefan Wrobel 2. Gutachter: Prof. Dr. Christian Bauckhage Tag der Promotion: 20.3.2019 Erscheinungsjahr: 2019 Pascal Welke Department of Computer Science III [email protected] ii Declaration I, Pascal Welke, confirm that this work is my own and is expressed in my own words.Any uses made within it of the works of other authors in any form (e.g. ideas, equations, fig- ures, text, tables, programs) are properly acknowledged at the point of their use. A full list of the references employed has been included. iii Acknowledgments This thesis would not have been possible without the support of many people. In partic- ular, I would like to thank my supervisors Tamás Horváth and Stefan Wrobel. Without them, this thesis would not exist. I don’t know many people that have the opportunity to talk to their supervisor as frequently as I did with Tamás. I know even less1, who ad- ditionally got invited to several intense weeks of discussions, writing, no Internet, and great food and wine in Nemesvita. I’d like to thank Stefan for the freedom, the generous support of my work, and for creating our motivating working environment. During the course of my work on this thesis I was surrounded by a great group of people. CAML became KDML, countless coffees were imbibed, discussions were had and slowly an idea for this thesis manifested. Thank you, Krisztian Buza, Thomas Gärtner, Laurentiu Ilici, Michael Kamp, Olana Missura, Daniel Paurat, Till Schulz, Florian Seif- farth, Katrin Ullrich, and all the great people at Fraunhofer IAIS. A special thank you also goes to Ionut Andone, Konrad Błaszkiewicz, and Alexander Markowetz for the fruitful distractions. I’d like to thank Lars Borutzky, Moritz Fürneisen, Tamás Horváth, Michael Kamp, Ra- jkumar Ramamurthy, Florian Seiffarth, Till Schulz, and Stefan Welke, who have read var- ious draft versions of my thesis and gave valuable feedback, pointed out errors, and asked nasty helpful questions. All remaining mistakes are of course my own responsibility. Last but not least, I want to thank my parents Daniela and Stefan Welke and my girl- friend Hanna Hünert for everything. You are the best! 1 All of them were supervised by Tamás v Abstract A common paradigm in distance-based learning is to embed the instance space into some appropriately chosen feature space equipped with a metric and to define the dissimilar- ity between instances by the distance of their images in the feature space. If the instances are graphs, then frequent connected subgraphs are a well-suited pattern language to define such feature spaces. Identifying the set of frequent connected subgraphs andsub- sequently computing embeddings for graph instances, however, is computationally in- tractable. As a result, existing frequent subgraph mining algorithms either restrict the structural complexity of the instance graphs or require exponential delay between the output of subsequent patterns. Hence distance-based learners lack an efficient way to op- erate on arbitrary graph data. To resolve this problem, in this thesis we present a mining system that gives up the demand on the completeness of the pattern set to instead guar- antee a polynomial delay between subsequent patterns. Complementing this, we devise efficient methods to compute the embedding of arbitrary graphs into the Hamming space spanned by our pattern set. As a result, we present a system that allows to efficiently apply distance-based learning methods to arbitrary graph databases. To overcome the computational intractability of the mining step, we consider only frequent subtrees for arbitrary graph databases. This restriction alone, however, does not suf- fice to make the problem tractable. We reduce the mining problem from arbitrary graphs to forests by replacing each graph by a polynomially sized forest obtained from a random sample of its spanning trees. This results in an incomplete mining algorithm. However, we prove that the probability of missing a frequent subtree pattern is low. We show em- pirically that this is true in practice even for very small sized forests. As a result, our algorithm is able to mine frequent subtrees in a range of graph databases where state-of- the-art exact frequent subgraph mining systems fail to produce patterns in reasonable time or even at all. Furthermore, the predictive performance of our patterns is compara- ble to that of exact frequent connected subgraphs, where available. The above method considers polynomially many spanning trees for the forest, while many graphs have exponentially many spanning trees. The number of patterns found by our mining algorithm can be negatively influenced by this exponential gap. We hence propose a method that can (implicitly) consider forests of exponential size, while remaining computationally tractable. This results in a higher recall for our incomplete mining algorithm. Furthermore, the methods extend the known positive results on the tractabil- ity of exact frequent subtree mining to a novel class of transaction graphs. We conjecture that the next natural extension of our results to a larger transaction graph class is at least as difficult as proving whether P = NP, or not. Regarding the graph embedding step, we apply a similar strategy as in the mining step. We represent a novel graph by a forest of its spanning trees and decide whether the frequent trees from the mining step are subgraph isomorphic to this forest. As a result, the embedding computation has one-sided error with respect to the exact subgraph isomor- phism test but is computationally tractable. Furthermore, we show that we can leverage a partial order on the pattern set. This structure can be used to reduce the runtime of the embedding computation dramatically. For the special case of Jaccard-similarity between graph embeddings, a further substantial reduction of runtime can be achieved using min- hashing. The Jaccard-distance can be approximated using small sketch vectors that can be computed fast, again using the partial order on the tree patterns. vii Die Welt geht nicht unter, wenn wir nicht vollständig aufzählen. (Tamás Horváth) ix Contents 1. Introduction 1 1.1. A Motivating Experiment ............................... 4 1.2. Contributions ...................................... 6 1.2.1. Efficient Frequent Subtree Mining .................... 7 1.2.2. Fast Computation in Probabilistic Subtree Feature Spaces ...... 10 1.3. Outline .......................................... 11 1.4. Previously Published Work .............................. 12 2. Preliminaries 13 2.1. Notions and Notation ................................. 13 2.2. Frequent Connected Subgraph Mining ...................... 18 2.2.1. A Generic Levelwise Mining Algorithm ................. 20 2.2.2. The Computational Complexity of Frequent Subtree Mining .... 23 2.3. Embedding Computation ............................... 27 2.4. Datasets ......................................... 28 3. Related Work 33 3.1. Algorithms for the SubgraphIsomorphism Problem ........... 35 3.1.1. Embedding Lists and Exponential Algorithms ............. 37 3.2. Algorithms for the FCSM Problem ......................... 39 3.2.1. Frequent Tree Mining Algorithms ..................... 39 3.2.2. Frequent Subgraph Mining Algorithms ................. 42 3.2.3. Algorithms for Relaxed Problems ..................... 44 4. Probabilistic Frequent Subtrees 49 4.1. Mining Probabilistic Frequent Subtrees ...................... 51 4.1.1. The Relaxed Frequent Subtree Mining Problem ............ 51 4.1.2. Probabilistic Bounds and the Importance of Subtrees ........ 56 4.1.3. Implementation Issues and Runtime Analysis ............. 58 4.2. Experimental Evaluation ............................... 60 4.2.1. Runtime ..................................... 61 4.2.2. Recall ...................................... 65 4.2.3. Stability of Probabilistic Subtree Patterns ................ 67 4.2.4. Predictive Performance ........................... 69 4.3. Summary ......................................... 72 5. Boosted Probabilistic Frequent Subtrees 73 5.1. An Efficient Embedding Operator for Trees ................... 75 x Contents 5.2. Mining Boosted Probabilistic Frequent Subtrees ................ 82 5.2.1. Implementation Issues ........................... 85 5.2.2. Experimental Evaluation .......................... 86 5.3. Exact Frequent Subtree Mining on Locally Easy Graphs ............ 90 5.4. Summary and Open Questions ........................... 93 6. Fast Computation in Probabilistic Subtree Feature Spaces 95 6.1. Complete Embeddings into Subtree Feature Spaces .............. 97 6.2. Min-Hashing in Subtree Feature Spaces ...................... 101 6.3. Experimental Evaluation ............................... 104 6.3.1. Efficiency Gains ................................ 105 6.3.2. Predictive and Retrieval Performance .................. 107 6.4. Summary and Open Questions ........................... 110 7. Conclusion 113 7.1. Discussion ........................................ 113 7.2. Outlook .........................................

Load more