Real-Time Analytics for Complex Structure Data
Ting Guo
A Thesis submitted for the degree of Doctor of Philosophy
Faculty of Engineering and Information Technology University
of Technology, Sydney 2015 2 Certificate of Authorship and Originality
I certify that the work in this thesis has not previously been submitted for a degree nor has it been submitted as part of requirements for a degree except as fully acknowledged within the text.
I also certify that the thesis has been written by me. Any help that I have received in my research work and the preparation of the thesis itself has been acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis.
Signature of Student:
Date:
3 4 Acknowledgments
On having completed this thesis, I am especially thankful to my supervisor Prof. Chengqi Zhang and co-supervisor Prof. Xingquan Zhu, who had led me to an at one time unfamiliar area of academic research, and trusted me and given me as much as possible freedom to purse my own research interests. Prof. Zhu has taught me how to think and study indepen- dently and how to solve a difficult scientific problem in flexible but rigorous ways. He has sacrificed much of his precious time for developing my academic research skills. When I felt lost and terrified with my future, he always gave me the confidence and motivation to keep going and strive to get better. Prof. Zhang has also given me great help and support in life.
I am thankful to the group members I met in the University of Technology, Sydney, including Shirui Pan, Lianhua Chi, Jia Wu, and many others. I learned a lot from these smart people, and I was always inspired by the interesting and in-depth discussions with them. I enjoyed the wonderful atmosphere, being with them, of both academic research and daily life.
I am incredibly grateful to my mother and father for their generosity and encourage- ment. This thesis is definitely impossible to be completed without their constant support and understanding. I am also thankful to my friends who have companied me, though not always at my side, through the arduous journey of three years.
5 6 Contents
1 Introduction 19 1.1 MOTIVATION ...... 19 1.2 CONTRIBUTIONS ...... 26 1.3 PUBLICATIONS ...... 28 1.4 THESIS STRUCTURE ...... 29
2 Literature Review 31 2.1 PRELIMINARY ...... 31 2.2 GRPAH CLASSIFICATION ...... 32 2.3 FREQUENT SUB-GRAPH MINING (FSM) ...... 34 2.4 SUB-GRPAH FEATURE SELECTION ...... 35 2.5 DATA STREAM MINING ...... 36 2.6 REAL-TIME ANALYSIS ...... 37 2.7 ROADMAP ...... 38
3 Understanding the Roles of Sub-graph Features for Graph Classification: An Empirical Study Perspective 39 3.1 INTRODUCTION ...... 39 3.2 PROBLEM FORMULATION ...... 43 3.2.1 Graph and Sub-graph ...... 43 3.2.2 Frequent Sub-graph Mining ...... 44 3.2.3 Graph Classification ...... 45 3.3 EXPERIMENTAL STUDY ...... 45
7 3.3.1 Benchmark Data ...... 46 3.3.2 Sub-graph Features ...... 48 3.3.3 Experimental Settings ...... 50 3.3.4 Results and Analysis ...... 50
4 Graph Hashing and Factorization for Fast Graph Stream Classification 59 4.1 INTRODUCTION ...... 59 4.2 PROBLEM DEFINITION ...... 62 4.3 GRAPH FACTORIZATION ...... 63 4.3.1 Factorization Model ...... 63 4.3.2 Learning Algorithm ...... 66 4.4 FAST GRAPH STREAM CLASSIFICATION ...... 68 4.4.1 Overall Framework ...... 68 4.4.2 Graph Clique Mining ...... 70 4.4.3 Clique Set Matrix and Graph Factorization ...... 72 Discriminative Frequent Cliques ...... 72 Feature Mapping ...... 74 4.4.4 Graph Stream Classification ...... 75 4.5 EXPERIMENTS ...... 75 4.5.1 Benchmark Data ...... 75 4.5.2 Experimental Settings ...... 78 4.5.3 Experimental Results ...... 79 Graph Steams Classification Accuracy ...... 79 Graph Steam Classification Efficiency ...... 83
5 Super-graph based Classification 85 5.1 INTRODUCTION ...... 85 5.2 PROBLEM DEFINITION ...... 88 5.3 OVERALL FRAMEWORK ...... 89 5.4 WEIGHTED RANDOM WALK KERNEL ...... 89 5.4.1 Kernel on Single-attribute Graphs ...... 91
8 5.4.2 Kernel on Super-Graphs ...... 94 5.5 SUPER-GRAPH CLASSIFICATION ...... 96 5.6 THEORETICAL STUDY ...... 96 5.7 EXPERIMENTS AND ANALYSIS ...... 98 5.7.1 Benchmark Data ...... 98 5.7.2 Experimental Settings ...... 100 5.7.3 Results and Analysis ...... 101
6 Streaming Network Node Classification 105 6.1 INTRODUCTION ...... 105 6.2 PROBLEM DEFINITION AND FRAMEWORK ...... 109 6.3 THE PROPOSED METHOD ...... 110 6.3.1 Streaming Network Feature Selection ...... 113 Feature Selection on a Static Network ...... 113 Feature Selection on Streaming Networks ...... 117 6.3.2 Node Classification on Streaming Networks ...... 121 6.4 EXPERIMENTS ...... 123 6.4.1 Experimental Settings ...... 123 6.4.2 Performance on Static Networks ...... 125 6.4.3 Performance on Streaming Networks ...... 128 6.4.4 Case Study ...... 130
7 Conclusion 133 7.1 SUMMARY OF THIS THESIS ...... 133 7.2 FUTURE WORK ...... 135
9 10 List of Figures
2-1 Graph examples collected from different domains...... 33 2-2 The overall roadmap of this thesis...... 38
3-1 An example of sub-graph pattern representation. Left panel shows two
graphs, G1 and G2 and right panel gives the two indicator vectors showing whether a sub-graph exists in the graphs...... 40 3-2 The runtime of frequent sub-graph pattern mining with respect to the in- creasing number of edges of sub-graphs...... 42 3-3 A conceptual view of graph vs. sub-graph. (b) is a sub-graph of (a). .... 44 3-4 An example of graph isomorphism...... 45 3-5 Graph representation for a paper (ID17890) in DBLP. Node in red is the main paper. Nodes in black ellipse are citations. While nodes in black box are keywords...... 48 3-6 Classification accuracy on five NCI chemical compound datasets with re- spect to different sizes of sub-graph features (using Support Vector Ma- chines: Lib-SVM)...... 52 3-7 Classification accuracy on D&D protein dataset and DBLP citation dataset with respect to different sizes of sub-graph features (using Support Vector Machines: Lib-SVM)...... 53 3-8 Classification accuracy on one NCI chemical compound dataset, D&D pro- tein dataset, and DBLP citation dataset with respect to different sizes of sub-graph features (using Nearest Neighbours: NN)...... 54
4-1 Coarse-grained vs. fine-grained representation...... 60
11 4-2 An example of graph factorization...... 64 4-3 The framework of FGSC for graph stream classification...... 69 4-4 An example of clique mining in a compressed graph...... 71 4-5 An example of “in-memory” Clique-class table Γ...... 74 4-6 Graph representation for a paper (ID17890) in DBLP...... 77 4-7 Accuracy w.r.t different chunk sizes on DBLP Stream. The number of features in each chunk is 142. The batch sizes vary as: (a) 1000; (b) 800; (c) 600...... 80 4-8 Accuracy w.r.t different number of features on DBLP Stream with each chunk containing 1000 graphs. The number of features selected in each chunk is: (a) 307; (b) 142; (c) 62...... 80 4-9 Accuracy w.r.t different classification methods on DBLP Stream with each chunk containing 1000 graphs, and the number of features in each chunk is 142. The classification methods selected here are: (a) NN; (b) SMO; (c) NaiveBayes...... 80 4-10 Accuracy w.r.t different chunk sizes on IBM Stream. The number of fea- tures in each chunk is 75. The batch sizes vary from (a) 500; (b) 400; to (c) 300...... 81 4-11 Accuracy w.r.t different number of features on IBM Stream with each chunk containing 400 graphs. The number of features selected in each chunk is: (a) 148; (b) 75; (c) 43...... 81 4-12 Accuracy w.r.t different classification methods on IBM Stream with each chunk containing 400 graphs, and the number of features in each chunk is 75. The classification methods selected include: (a) NN; (b) SMO; (c) NaiveBayes...... 81 4-13 System accumulated runtime-based by using NN classifier, where |D| = 1000, |m| = 142 (for DBLP) and |D| = 400, |m| =75(for IBM) respec- tively. (a) Results on DBLP stream; (b) Results on IBM stream...... 83
5-1 (A): a single-attribute graph; (B): an attributed graph; and (C): a super-graph. 86
12 5-2 A conceptual view of a protein interaction network using super-graph rep- resentation...... 87 5-3 WRWK on the super-graphs (G, G ) and the single-attribute graphs (g1,
g2, g3)...... 90 5-4 An example of using super-graph representation for scientific publications. . 99 5-5 Super-graph and comparison graph representations...... 100 5-6 Classification accuracy on DBLP and Beer Review datasets w.r.t. different classification methods (NB, DT, SVM, and NN)...... 102 5-7 Classification accuracy on Beer Review dataset w.r.t. different datasets and classification methods (NB, DT, SVM, and NN)...... 103 5-8 The performance w.r.t. different edge-cutting thresholds on DBLP and Beer Review datasets by using WRWK method...... 104
6-1 An example of streaming networks, where each color bar denotes a feature. 106 6-2 An example of using feature selection to capture changes in a streaming network (keywords inside each node denote node content)...... 108 6-3 The framework of the proposed streaming network node classification (SNOC) method...... 111 6-4 An example of using feature selection to capture structure similarity. ....115 6-5 The accuracies on three real-world static networks w.r.t. different numbers of selected features (from 50 to 300)...... 125 6-6 The accuracy on three networks w.r.t. (a) different maximal lengths of path l (from 1 to 5), (b) different values of weight parameter ξ (from 0 to 1), and (c) different percentages of labeled nodes...... 126 6-7 The accuracy on streaming networks: (a) accuracy on DBLP citation net- work from 1991 to 2010, (b) accuracy on PubMed Diabetes network for 15 time points, and (c) accuracy on extended DBLP citation network from 1991 to 2010...... 128 6-8 The cumulative runtime on DBLP and PubMed Diabetes networks corre- sponding to Fig. 6-5...... 130
13 6-9 Case study on DBLP citation network...... 131
14 List of Tables
3.1 The advantages and disadvantages comparisons between vector represen- tation vs. graph representation ...... 46 3.2 NCI datasets used in experiments ...... 47 3.3 DBLP dataset used in experiments ...... 47 3.4 Number of sub-graphs with respect to different sizes (i.e. number of edges) 49
4.1 DBLP dataset used in experiments...... 76
6.1 Accuracy Results on Static Network...... 126
15 Abstract
The advancement of data acquisition and analysis technology has resulted in many real- world data being dynamic and containing rich content and structured information. More specifically, with the fast development of information technology, many current real-world data are always featured with dynamic changes, such as new instances, new nodes and edges, and modifications to the node content. Different from traditional data, which are represented as feature vectors, data with complex relationships are often represented as graphs to denote the content of the data entries and their structural relationships, where instances (nodes) are not only characterized by the content but are also subject to depen- dency relationships. Plus, real-time availability is one of outstanding features of today’s data. Real-time analytics is dynamic analysis and reporting based on data entered into a system before the actual time of use. Real-time analytics emphasizes on deriving immedi- ate knowledge from dynamic data sources, such as data streams, and knowledge discovery and pattern mining are facing complex, dynamic data sources. However, how to combine structure information and node content information for accurate and real-time data mining is still a big challenge. Accordingly, this thesis focuses on real-time analytics for com- plex structure data. We explore instance correlation in complex structure data and utilises it to make mining tasks more accurate and applicable. To be specific, our objective is to combine node correlation with node content and utilize them for three different tasks, in- cluding (1) graph stream classification, (2) super-graph classification and clustering, and (3) streaming network node classification.
Understanding the role of structured patterns for graph classification: the thesis in- troduces existing works on data mining from an complex structured perspective. Then we propose a graph factorization-based fine-grained representation model, where the main ob- jective is to use linear combinations of a set of discriminative cliques to represent graphs for learning. The optimization-oriented factorization approach ensures minimum informa- tion loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process. Based on this idea, we propose a novel framework for fast graph stream classification.
16 A new structure data classification algorithm: The second method introduces a new super-graph classification and clustering problem. Due to the inherent complex struc- ture representation, all existing graph classification methods cannot be applied to super- graph classification. In the thesis, we propose a weighted random walk kernel which cal- culates the similarity between two super-graphs by assessing (a) the similarity between super-nodes of the super-graphs, and (b) the common walks of the super-graphs. Our key contribution is: (1) a new super-node and super-graph structure to enrich existing graph representation for real-world applications; (2) a weighted random walk kernel considering node and structure similarities between graphs; (3) a mixed-similarity considering struc- tured content inside super-nodes and structural dependency between super-nodes; and (4) an effective kernel-based super-graph classification method with sound theoretical basis. Empirical studies show that the proposed methods significantly outperform the state-of- the-art methods. Real-time analytics framework for dynamic complex structure data For streaming net- works, the essential challenge is to properly capture the dynamic evolution of the node content and node interactions in order to support node classification. While streaming net- works are dynamically evolving, for a short temporal period, a subset of salient features are essentially tied to the network content and structures, and therefore can be used to charac- terize the network for classification. To achieve this goal, we propose to carry out streaming network feature selection (SNF) from the network, and use selected features as gauge to classify unlabeled nodes. A Laplacian based quality criterion is proposed to guide the node classification, where the Laplacian matrix is generated based on node labels and network topology structures. Node classification is achieved by finding the class label that results in the minimal gauging value with respect to the selected features. By frequently updating the features selected from the network, node classification can quickly adapt to the changes in the network for maximal performance gain. Experiments and comparisons on real-world networks demonstrate that SNOC is able to capture dynamics in the network structures and node content, and outperforms baseline approaches with significant performance gain.
17 18 Chapter 1
Introduction
1.1 MOTIVATION
The advancement of data acquisition and analysis technology has resulted in many applica- tions involving complex structure data; examples include Cheminformatics [65], Bioinfor- matics [53], and Social Network Analysis (e.g. DBLP) [3]. Different from traditional data, which are represented as feature vectors, data with structure relationships are often repre- sented as graphs to preserve the content of the data entries and their relationships, where instances (nodes) are not only characterized by the content but are also subject to depen- dency relationships. For example, each node in a social network can denote one person and links between nodes can represent their social interactions. In reality, changes are essential components in real-world structure data, mainly be- cause user participation, interactions, and responses to external factors continuously intro- duce new nodes and edges to the data. In addition, a user may add/delete/modify online posts, which naturally result in changes in the node content. As a result, data are inherently dynamic. These dynamic changes may significantly influence the mining results. For ex- ample, changes in users’ interest fields may result in the changes of their social groups or friend circles. So real-time analytics on structure data could help us understand and capture such changes and therefore improve the mining performances. Graph classification concerns the learning of a discriminative classifier, from training data containing structure information, to classify previously unseen graph samples into spe-
19 cific categories, where the main challenge is to explore structure information in the training data to build classifiers. One of the most common graph classification approaches is to use sub-graph features to convert graph into instance-feature representations, so generic learning algorithms can be applied for classification. Finding good sub-graph features is therefore an important task for this type of learning approaches. Common heuristics are to find high frequency sub-graphs or to apply feature selection measures, such as information gain or Gini-index, to refine frequent sub-graphs. While all these methods have shown to be effective in the literature, due to the inherent complexity of the graph data, the process of finding sub-graphs is a non-trivial task and is always the most computationally expensive procedure in the graph classification process. In addition, the genuine discriminative power of the sub-graph features has never been comprehensively studied. All these observations raise big concerns:
• The mining of the sub-graphs involves graph matching operation which is computa- tionally demanding. Indeed, just testing whether a graph is a sub-graph of another graph is an NP-complete problem [74]. It is very time-consuming to match patterns in graph datasets, especially for sub-graphs with a large number of edges.
• We are interested in finding discriminative power of sub-graphs. In other words, we want to know how much the classification accuracy can be affected by different prop- erties of the selected sub-graphs, such as different number of nodes, different number of edges, and the size of the selected sub-graph set. To achieve the goal, a typical ob- jective function, information gain (IG) [36], is used to test the discriminative power of attributions (sub-graphs). Experiments show that the objective function are nei- ther monotonic nor anti-monotonic with respect to the size of the sub-graphs [81]. In addition, the internal structural correlation between sub-graphs can also impact on the classification result. For instance, sub-graphs with similar structures tend to have similar objective scores, whereas using sub-graphs with high correlations for clas- sification should be avoided because it will introduce redundant features with high dependency which may deteriorate the classification accuracy. This fact can help to select better sub-graphs to represent graph datasets.
20 • Most given graph classification methods are computationally expensive, we are won- dering whether there is any inexpensive way to achieve graph classification. More specifically, if we use some random sub-graphs to represent a graph datasets, how good the classification accuracy will be, compared to the approaches that use sophis- ticated sub-graph mining algorithms?
Motivated by the above concerns, empirical studies on the role of sub-graph for graph classification. We carry out empirical studies on four real-world graph classification tasks, by using three types of sub-graph features, including frequent sub-graphs, frequent sub- graph selected by using information gain, and random sub-graphs, and by using two types of learning algorithms including Support Vector Machines and Nearest Neighbour. Our experiments show that (1) The discriminative power of sub-graphs varies by their sizes; (2) Random sub-graphs have a reasonably good performance; (3) Number of sub-graphs is important to ensure good performance; (4) Increasing number of sub-graphs reduces the difference between classifiers built from different sub-graphs, and (5) Performance with respect to different classifiers shows similar trend. Our studies provide a practical guidance for designing effective sub-graph based graph classification methods. From the empirical studies, we discover a fact that unlike traditional instance-feature representations, graphs do not have specific features, so an essential step in building a graph classification model is to explore graph features to represent graph data in an instance- feature space for effective learning [57]. The majority of existing graph classification mod- els employ an occurrence-based feature representation model, in which a set of sub-graph features is selected to represent the graph data by using the occurrence of the sub-graph in the graph (either 0/1 occurrence or actual number of occurrences) as the feature value. In this thesis, we refer to such an occurrence based representation model as coarse-grained representation model. For existing sub-graph based graph classification methods (coarse-grained representa- tion models), they have a number of disadvantages, especially for graph streaming clas- sification including: Computation burden for isomorphism validation: Because sub- graphs are selected as features to represent the graph, the expensive sub-graph mining and isomorphism validation process can not be avoided (graph isomorphism is proven to be
21 NP-hard) [19]. After the sub-graph features have been selected, the isomorphism valida- tion process has to be applied again to map the graphs into the vector space. Because sub-graphs are selected as features to represent the graph and general learning methods prefer low feature dimension, a coarse-grained feature representation model has to limit the number of sub-graph to represent the graph data. Severe and unbounded informa- tion loss: Because the coarse-grained representation does not characterize the degree of closeness of the sub-graph pattern related to the graph, it suffers information loss in repre- senting the graph. Furthermore, for a graph stream with continuously growing volumes (i.e. an increasing number of graphs) and changing structures (e.g. new nodes may appear in the coming graphs), the disadvantage of the existing coarse-grained graph feature model is even greater. This is because (1) stream volumes continuously increase, making it compu- tationally expensive to explore subgraph features, and (2) the changes in the graph stream (such as new structures) make the sub-graph features incapable of representing graphs. So a fine-grained graph feature model is needed to improve the classification performance.
To present a novel fine-grained graph feature model, our main idea is to bypass the expensive sub-graph mining process and use linear combinations of graph cliques to rep- resent graphs. Linear combinations is to make sure the information loss of the transform process from graphs to vectors can be minimal. Compared to the traditional coarse-grained representation model, the advantage of fine-grained representation is clear. A clear advan- tage of our method is that fine-grained graph representation enjoys a rigorous theoretical foundation that ensures minimal information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process, which is proven to be NP-hard. To achieve this goal, our algorithm relies on two important steps to extract fine-grained graph representation: (1) finding a set of frequent graph cliques as the base; and (2) us- ing graph factorization to calculate a linear combination of graph cliques to best represent a graph. After applying our fine-grained feature mapping model to represent graphs, we can use any learning algorithm, such as Nearest Neighbor or Support Vector Machines, for graph stream classification. Our method offers a number of advantages including a fast fea- ture mining process, more precise graph representation, and better performance for graph stream classification. Experiments on two real-world network graph data demonstrate that
22 our method outperforms state-of-the-art approaches in both classification accuracy and run- time efficiency. Even though graph representation indeed preserves the structure information to improve the classification accuracy, all existing frameworks, in terms of using graphs to represent objects, rely on two approaches to describe node content (1) node as a single attribute: each node has only one attribute (single-attribute node). A clear drawback of this represen- tation is that a single attribute cannot precisely describe the node content [40]. This repre- sentation is commonly referred to as a single-attribute graph (Fig.5-1 (A)). (2) node as a set of attributes: use a set of independent attributes to describe the node content (Fig.5-1 (B)). This representation is commonly referred to as an attributed graph [8, 14, 80]. How- ever, with the development of information and technology, the data is becoming more and more complex. Indeed, in many applications, the attributes/properties used to describe the node content may be subject to dependency structures. For example, in a citation network each node represents one paper and edges denote citation relationships. It is insufficient to use one or multiple independent attributes to describe detailed information of a paper. In- stead, we can represent the content of each paper as a graph with nodes denoting keywords and edges representing contextual correlations between keywords (e.g. co-occurrence of keywords in different sentences or paragraphs). As a result, each paper (a super-node) and all references cited in this thesis can form a super-graph with each edge between papers denoting their citation relationships. To build learning models for super-graphs, the mainly challenge is to properly calculate the distance between two super-graphs.
• Similarity between two super-nodes: Because each super-node is a graph, the over- lapped/intersected graph structure between two super-nodes reveals the similarity between two super-nodes, as well as the relationship between two super-graphs. Tra- ditional hard-node-matching mechanism is unsuitable for super-graphs which require soft-node-matching.
• Similarity between two super-graphs: The complex structure of super-graph re- quires that the similarity measure considers not only the structure similarity, but also
23 the super-node similarity between two super-graphs. This cannot be achieved with- out combining node matching and graph matching as a whole to assess similarity between super-graphs.
The above challenges motivate the proposed Weighted Random Walk Kernel (WRWK) for super-graphs. In our paper, we generate a new product graph from two super-graphs and then use weighted random walks on the product graph to calculate similarity between super-graphs. A weighted random walk denotes a walk starting from a random weighted node and following succeeding weighted nodes and edges in a random manner. The weight of the node in the product graph denotes the similarity of two super-nodes. Given a set of labeled super-graphs, we can use an weighted product graph to establish walk-based relationship between two super-graphs and calculate their similarities. After that, we can obtain the kernel matrix for super-graph classification. Recent years have also witnessed an increasing number of applications involving net- worked data, where instances are not only characterized by the content but are also subject to dependency relationships. The mixed node content and structure information raise many unique data mining tasks, such as network node classification [3]. In reality, changes are essential components in real-world networks, mainly because user participation, interac- tions, and responses to external factors continuously introduce new nodes and edges to the network. In addition, a user may add/delete/modify online posts, which naturally result in changes in the node content. As a result, the networks are inherently dynamic. Accurate node classification in a streaming network setting is therefore much more challenging than static networks. In summary, node classification in streaming networks has at least three major challenges:
• Streaming network structures: Network structures encode rich information about node interactions inside the network, which should be considered for node classifi- cation. In streaming networks, structures are constantly changing, so node classi- fication needs to rapidly capture and adapt to such changes for maximal accuracy gain.
• Streaming node features: For each node in a streaming network, its content may
24 constantly evolve (w.r.t. users posts or profile updating). As a result, the feature space used to denote the node content is dynamically changing, resulting in streaming fea- tures [78] with infinite feature space. To capture changes, a feature selection method should timely select the most effective features to ensure that node classification can quickly adapt to the new network.
• Unlimited Network Node Space: Because node volumes of streaming networks are dynamically increasing, resulting in unlimited network node space and new nodes never appearing in the network before. Node classification needs to scale to the dy- namic increasing node volumes and incrementally updates models discovered from historical data to accurately classify new nodes.
To achieve a high node classification accuracy, a fundamental issue is to properly char- acterize such changes. One possible solution is to use features to capture changes in streaming networks for node classification. For streaming networks, changes are intro- duced through two major channels (1) node content; and (2) topology structures. Because in a networked world, nodes close to each other in the network structure space tend to share common content information [26], we can use selected features to design a “similar- ity gauging” procedure to assess the consistency of the network node content and structures to determine the labels of unlabeled nodes. A smaller gauging value indicates that the node content and structures has a better alignment with the node label. So the gauging based classification is carried out such that for an unlabeled node, its label is the class which results in the minimal gauging value with respect to the identified features. By updating the selected features, the node classification can automatically adapt to the changes in the streaming network for maximal accuracy gain. Accordingly, our research will propose a node classification method for networked data, and then propose a streaming feature selec- tion approach for dynamic drafting network. In summary, taking complex structure information into account helps improve the min- ing performance on structure data. Moreover, employing changes in dynamic setting is a promising solution to further improving mining accuracy. In this thesis, we investigate different types of complex structure data and propose several methods to solve above chal-
25 lenges.
1.2 CONTRIBUTIONS
This thesis focuses on exploring and utilizing structure information to solve some important problems in data mining. We list our contributions to each of them below:
• Exploring the roles of sub-graph features for graph classification: we carry out empirical studies on four real-world graph classification tasks, by using three types of sub-graph features, including frequent sub-graphs, frequent sub-graph selected by using information gain, and random sub-graphs, and by using two types of learning algorithms including Support Vector Machines and Nearest Neighbour. Our experi- ments show that (1) The discriminative power of sub-graphs varies by their sizes; (2) Random sub-graphs have a reasonably good performance; (3) Number of sub-graphs is important to ensure good performance; (4) Increasing number of sub-graphs re- duces the difference between classifiers built from different sub-graphs, and (5) Per- formance with respect to different classifiers shows similar trend. Our studies pro- vide a practical guidance for designing effective sub-graph based graph classification methods.
• Exploring clique information and using factorization method for fast graph stream classification: we propose a fine-grained graph factorization framework for efficient graph stream classification in this thesis. Being fine-grained, our mapping framework relies on a set of discriminative frequent cliques, instead of important sub-graph patterns, to represent graphs. Such a fine-grained representation ensures that the final instance-feature representation is sufficiently close to the original graph, with theoretical guarantee. To solve the problem, our algorithm relies on two impor- tant steps to extract fine-grained graph representation: (1) finding a set of frequent graph cliques as the base; and (2) using graph factorization to calculate a linear com- bination of the graph cliques to best represent a graph. Compared to the traditional coarse-grained representation model, a clear advantage of our method is that fine-
26 grained graph representation enjoys a rigorous theoretical foundation that ensures minimal information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process.
• Exploring inner-structure information for super-graph classification: In this the- sis, we introduce a special type of graph, where the content of the node can be represented as a graph, as a “super-graph”. Likewise, we refer to the node whose content is represented as a graph, as a “super-node”. To build learning models for super-graphs, the mainly challenge is to properly calculate the distance between two super-graphs.
(1) Similarity between two super-nodes: Because each super-node is a graph, the overlapped/intersected graph structure between two super-nodes reveals the similar- ity between two super-nodes, as well as the relationship between two super-graphs. Traditional hard-node-matching mechanism is unsuitable for super-graphs which re- quire soft-node-matching.
(2) Similarity between two super-graphs: The complex structure of super-graph requires that the similarity measure considers not only the structure similarity, but also the node similarity between two super-graphs. This cannot be achieved with- out combining node matching and graph matching as a whole to assess similarity between super-graphs.
The above challenges motivate the proposed Weighted Random Walk Kernel for super-graphs. we propose a weighted random walk kernel which calculates the simi- larity between two super-graphs by assessing (a) the similarity between super-nodes of the super-graphs, and (b) the common walks of the super-graphs. Our key contri- bution is twofold: (1) a weighted random walk kernel considering node and structure similarities between graphs; and (2) an effective kernel-based super-graph classifica- tion method with sound theoretical basis.
• Exploring manifold information for streaming network node nlassification: we propose a novel node classification method for streaming networks. Our method takes network structure and node labels into consideration to find an optimal subset
27 of features to represent the network. Based on the selected features, a streaming network node classification method, SNOC, is proposed to classify unlabeled nodes through the minimization of the similarity distance in the network and the feature- based distance between nodes. The main contribution, compared to existing works, is twofold:
(1) Streaming Network Node Classification: We propose a new streaming network node classification (SNOC) method that takes node content and structure similarity into consideration to find important features to model changes in the network for node classification. This method is not only more accurate than existing node clas- sification approaches, but is also effective to capture changes in networks for node classification.
(2) Streaming Network Feature Selection: We introduce a novel streaming net- work feature selection framework, SNF, for streaming networks. To ensure feature evaluation can timely adapt to changes in the network, SNF incrementally updates the evaluation score of an existing feature by accumulating changes in the network. This allows our method to effectively handle streaming networks with changing fea- ture space and feature distributions for better runtime and performance gain.
1.3 PUBLICATIONS
• Ting Guo, Zhanshan Li, and Xingquan Zhu. Large Scale Diagnosis Using Associ- ations between System Outputs and Components. Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011, pp.1786-1787.
• Ting Guo and Xingquan Zhu. Understanding the roles of sub-graph features for graph classification: an empirical study perspective. Proceedings of the 22nd ACM international Conference on Information & Knowledge Management, 2013, pp. 817- 822.
• Ting Guo, Lianhua Chi, and Xingquan Zhu. Graph hashing and factorization for fast graph stream classification. Proceedings of the 22nd ACM international Conference
28 on Information & Knowledge Management, 2013, pp. 1607-1612.
• Ting Guo and Xingquan Zhu. Super-Graph Classification. Proceedings of Advances in Knowledge Discovery and Data Mining, 2014, pp. 323-336.
• Ting Guo, Xingquan Zhu, Jian Pei and Chengqi Zhang. SNOC: Streaming Network Node Classification. Proceedings of the 14th IEEE International Conference on Data Mining, 2014.
1.4 THESIS STRUCTURE
The rest of thesis is summarized as follows: Chapter 2: This chapter surveys existing works on mining structure data. It summarizes major approaches in the field, along with their technical strengths/weaknesses. Chapter 3: In this chapter, we carry out empirical studies on four real-world graph clas- sification tasks, by using three types of sub-graph features, including frequent sub-graphs, frequent sub-graph selected by using information gain, and random sub-graphs. The two types of learning algorithms include Support Vector Machines and Nearest Neighbour. Our studies provide a practical guidance for designing effective sub-graph based graph classifi- cation methods. Chapter 4: We propose a fine-grained graph factorization approach for Fast Graph Stream Classification (FGSC). A graph clique mining is given and the factorization al- gorithm is provided based on the clique mining result. Experiments demonstrate that the proposed method outperforms state-of-the-art approaches in both classification accurracy and efficiency. Chapter 5: In this chapter, we formulate a new super-graph classification task where each node of the super-graph may contain a graph. To support super-graph classification, we propose a Weighted Random Walk Kernel (WRWK) with sound theoretical properties, including bounded similarity. Experiments confirm that our method significantly outper- forms baseline approaches. Chapter 6: This chapter introduces a new classification method for streaming networks,
29 namely streaming network node classification (SNOC). It provides algorithm details, theo- retical proof, time complexity and experiments. Chapter 7: This chapter concludes this thesis and outlines directions for future work.
30 Chapter 2
Literature Review
This literature review provides an in-depth study on how existing data mining methods on complex structure data. Our main objective is to (1) summarize and categorize graph classification methods, graph clustering methods and extended ones to streaming structure data; and (2) compare and analyze the strengths and deficiencies of existing approaches. Firstly, we introduce some basic definitions.
2.1 PRELIMINARY
DEFINITION 1 Graph: A graph G is a set of vertex (nodes) v connected by edges (links) e. Thus G =(v, e).
DEFINITION 2 Vertex (Node): A node v is a terminal point or an intersection point of a graph. It is the abstraction of a location such as a city, an administrative division, a road intersection or a transport terminal (stations, terminuses, harbors and airports).
DEFINITION 3 Edge (Link): An edge e is a link between two nodes. The link (i, j) is of initial extremity i and of terminal extremity j. A link is the abstraction of a transport infrastructure supporting movements between nodes. It has a direction that is commonly represented as an arrow. When an arrow is not used, it is assumed the link is bi-directional.
DEFINITION 4 Sub-Graph: A sub-graph is a subset of a graph G, where p is the num- ber of sub-graphs. For instance G =(v ,e ) can be a distinct sub-graph of G. Unless
31 the global transport system is considered in its whole, every transport network is in the- ory a sub-graph of another. For instance, the road transportation network of a city is a sub-graph of a regional transportation network, which is itself a sub-graph of a national transportation network.
2.2 GRPAH CLASSIFICATION
Given a collection of training samples, each of which is suitably labeled with a class label, classification task is to build a learning model to automatically assign a previously unseen example into a specific category (or class). For data with dependency structure, such as the ones showing in Figure 2-11, building a learning model to classify them into respective categories is proved to be a challenging task. This is mainly attributed to the fact that graph data normally do not have features immediately available to support the learning, whereas nearly all existing supervised learning methods requires that training data to be represented as tabular instance-feature format for learning. In order to support graph classification, one of the fundamental challenges is to charac- terize graphs to represent them as tabula-feature formats, so the generic supervised learning algorithm can be used to derive learning models from graphs. Three most popularly used approaches include (1) Sub-graph feature based methods; (2) Global structure based meth- ods, such as graph edit and graph embedding, and (3) Graph kernel methods. Sub-graph feature based methods is using substructure patterns (sub-graphs) as features to represent each graph as a feature vector. So a training graph set can be converted into a generic training instance set for learning. Sub-graph based approaches are useful in many graph related tasks, including discriminating different groups of graphs, classifying and clustering graphs and building graph indices in vector spaces. An inherent advantage of embedding graphs into vector space is that it makes existing algorithmic tools developed
1In Fig. 2-1, the upper ones are the original representations and the bottom ones are the corresponding transferred graphs (a) Protein data (each node denotes a local amino acid region with special secondary structures and each edge denotes the nearest neighbour in the space); (b) a graph collected from chemical compound data (the 3D-structure can be transferred into graphs by using molecules connected with chemical bonds); (c) a graph collected from DBLP citation dataset, where each node denotes a Paper ID or a keyword and each edge denotes the citation relationship between papers or keywords appeared in the paper’s title. The label of the graph (y) can be used for training a learning model for graph classification.
32
Figure 2-1: Graph examples collected from different domains.
for feature based object representations available for graph structure data. For sub-graph feature based methods, one key challenge is how to select discriminative sub-graphs to mapping graphs into vector space, which can help improve the classification accuracy. To achieve the goal, several objective functions are used to test the discriminative power of attributions (sub-graphs), like frequency, Information Gain (IG) [36], Fisher Score [23] and Laplace Score [42]. Because sub-graph features can help convert each single graph into a vector representation, graph classification can be achieved by using popular learning algorithms such as Decision Trees [36] and Support Vector Machine (SVM) [16].
Global structure based methods is to apply inexact graph matching, in which error cor- rection is made part of the matching process. Central to this approach is the measurement of the similarity of pairwise graphs. This can be measured in many ways. One of them, which has garnered particular interest as it is error-tolerant to noise and distortion, is the graph edit distance (GED), defined as the cost of the least expensive sequence of edit operations that are needed to transform one graph into another [24]. GED algorithms are influenced considerably by cost functions that are related to edit operations. The GED between pair- wise graphs changes with the change of cost functions and its validity is dependent on the
33 rationality of cost functions definition [6][17]. Graph kernel methods is using kernel to calculate the distance between graphs. Various graph kernel methods, such as subtree kernels [63], shortest-path kernels [5], joint ker- nels [30] and cyclic pattern kernels [31], have been proposed. All methods have shown to be effective, but subject to high computational overhead. Geng Li et al. proposes an alternative approach based on feature vectors constructed from different global topologi- cal attributes [47]. A novel graph representation method named Graph edit distance is also proposed by using dissimilarity space embedding graph kernel [38], where the edit distance can be calculated by Lipschitz embedding method [39]. Experiments show that the classifi- cation accuracy can be statistically significantly enhanced by using graph edit distance with prototype reduction and dimensionality reduction [7]. Vogelstein et al. develops a set of signal-subgraph estimators for graph classification [76]. This estimator can be considered as local sparse and low-rank matrix decompositions.
2.3 FREQUENT SUB-GRAPH MINING (FSM)
Sub-graph patterns have been proved a good representation for closing the gap between structural and statistical pattern classification. Selecting features, in the form of sub-graph patterns, from graph data is a well established area in graph data mining, where early methods often use Apriori-based Graph Mining (AGM) to identify frequent induced sub- graphs [33]. Evaluation of AGM on chemical carcinogenesis data demonstrated that it is more efficient than an inductive logic programming based approach combined with a level- wise search. Meanwhile, Kuramochi and Karypis develop an FSG [45] method which uses Breath First Search (BFS) strategy to grow candidates whereby pairs of identified frequent k sub-graphs are joined to generate (k +1)sub-graphs. FSG uses a canonical labeling method for graph comparison and calculates the support of the patterns using a vertical transaction list data representation. Experiments show that FSG is inefficient when graphs contain many vertexes and edges that have identical labels because the join operation used by FSG allows multiple automorphism of single or multiple cores. In summary, AGM and FSG are all Apriori-based frequent sub-graph mining algorithms.
34 In Apriori-based frequent sub-graph mining algorithms, it is time-consuming and com- putationally ineffective when two size-k frequent sub-graphs are joined to generate size- (k+1) graph candidates [28]. To improve the efficiency, several pattern-growth-based graph mining approaches have been proposed. FFSM [32] attempts to extend graphs from a single sub-graph directly. For each graph G that has been discovered, FFSM recursively adds new edges until all frequent supergraphs of G are discovered. The recursion stops if no more frequent graphs can be generated. Because the pattern-growth algorithm extends a frequent graph pattern by adding a new edge with every possible label, there is a potential risk that same graph patterns may be generated more than once [28]. gSpan algorithm [82] solves this problem by using a right-most extension technique. It uses Depth Fist Search (DFS) lexicographic ordering to construct a tree-like lattice over all possible graph patterns. The search tree is traversed in a DFS manner and all possible graph patterns with non-minimal DFS codes are pruned so that redundant candidate generation is avoided. gSpan is arguably the most frequently cited FSM algorithm.
2.4 SUB-GRPAH FEATURE SELECTION
To find discriminate sub-graph features for graph classification, common methods are to first discover a complete set of sub-graph patterns by using a minimum support threshold or other parameters such as sub-graph size limit [13], and then use a feature selection crite- rion (such as Information Gain [12, 36]) to select a small set of discriminative sub-graphs as features. Such a two-step sub-graph feature selection approach is considered ineffec- tive and several methods are proposed to directly generate a compact set of discriminative sub-graphs during the frequent sub-graph mining process. Kudo et al. [43, 59] proposes a boosting based method, where a boosting algorithm repeatedly constructs multiple weak classifiers on weighted training graphs and each weak classifier is, in fact, a single sub- graph. Yan et al. proposes to mine significant sub-graphs by using biased search which exploits the correlation between structural similarity and significance similarity [81]. Ranu and Singh propose a scalable method, GraphSig, to mine significant sub-graphs based on a feature vector representation of graphs [56]. GraphSig uses domain knowledge to select
35 a meaningful feature set and prior probabilities of features are used to evaluate the signifi- cance of sub-graphs in the feature space. This strategy can use existing frequent sub-graph mining techniques to mine significant patterns in a scalable manner. Saigo et al. proposes an iterative sub-graph mining method based on partial least squares regression named gPLS [58]. This method is efficient because the weight vector is updated with elementary matrix calculations. To support real-time graph query, some researchers try to index a small set of sub-graphs to enable scalable query of graph databases [57, 67, 83, 84]. Sun et al. in- troduces a graph search method for sub-graph queries based on sub-graph frequencies. In addition, evolutionary computation has also been introduced in discriminative sub-graph mining (GAIA) [35]. For above sub-graph feature selection methods, the graph data are required to fully la- belled, whereas labeling graph data may be subject to expensive costs. An alternative solu- tion is to combine both labeled and unlabeled graphs for sub-graph feature selection. Kong et al. integrates active learning theory into feature and sample selection for graph classifi- cation [41]. They maximize the dependency between sub-graph patterns and graph labels by using an active learning framework and search the optimal graph to query for the label. In addition, a feature evaluation criterion, gSemi, is derived to estimate the significance of sub-graphs based on both labeled and unlabeled graphs [42], by combining semi-supervised feature selection and sub-graph feature mining. For large graphs, a novel technique called D-walks (Discriminative random walks) is developed to tackle semi-supervised classifica- tion [9]. The class of unlabeled data can be predicted by maximizing the betweenness score with labeled data. For applications without negative graph samples, PU learning for graph classification is developed to focus on selecting useful sub-graph features based on positive and unlabeled graph data only [87].
2.5 DATA STREAM MINING
Our problem is also related to data stream mining. The initial study of data stream mining was proposed in [21], in which incremental learning was used to tackle the increasing volumes in the data stream. Some ensemble learning-based methods [69, 77] have also
36 been used to address concept drift and data distribution changes in data streams. All these frameworks only work for generic data with instance-feature representations, and there is no feature immediately available for graph streams for learning and classification. Several studies have investigated the graph stream classification problem. To the best of our knowledge, only a few works [1, 15, 46] are directly related to our problem. These methods apply hashing techniques to sketch the graph stream to save computational cost and control the size of the subgraph-pattern set. In [1], Aggarwal proposes a 2-D random edge hashing scheme to construct an “in-memory” summary for sequentially presented graphs and uses a simple heuristic to select a set of most discriminative frequent patterns for building a rule-based classifier. Although this method has demonstrated promising per- formance on graph stream classification, it has two inherent limitations: (1) The selected sub-graphs contain disconnected edges, which may have less discriminative capability than connected subgraph-patterns because of structure meaning deficiency. (2) A frequent pat- tern mining process is required to perform on the summary table that comprises massive transactions, resulting in high computational cost. In [15], a clique hashing strategy is used to improve computational efficiency for graph stream task, but the authors still use traditional coarse-grained graph representation which causes severe information loss in mapping process and further decreases the classification accuracy. And in [46], the au- thors proposed a hash kernel to project arbitrary graphs onto a compatible feature space for similarity computing, but this technique can only be applied to node-attributed graphs.
2.6 REAL-TIME ANALYSIS
As we mentioned above, real-time analysis is to capture the changes at each time period, which is used to improve mining performance. Zhu and Shasha have proposed techniques to compute some statistical measures over time series data streams [90]. The proposed techniques use discrete Fourier transform. The system is called StatStream and is able to compute approximate error bounded correlations and inner products. The system works over an arbitrarily chosen sliding window. Lin et al. have proposed the use of symbolic representation of time series data streams [48]. This representation allows dimensional-
37 ity/numerosity reduction. They have demonstrated the applicability of the proposed repre- sentation by applying it to clustering, classification, indexing and anomaly detection. Chen et al. have proposed the application of what so called regression cubes for data streams [11]. Due to the success of OLAP technology in the application of static stored data, it has been proposed to use multidimensional regression analysis to create a compact cube that could be used for answering aggregate queries over the incoming streams. This research has been extended to be adopted in an undergoing project Mining Alarming Incidents in Data Streams MAIDS. Himberg et al. have presented and analyzed randomized variations of segmenting time series data streams generated onboard mobile phone sensors [29]. One of the applications of clustering time series discussed: Changing the user interface of mobile phone screen according to the user context. It has been proven in this study that Global Iterative Replacement provides approximately an optimal solution with high efficiency in running time.
2.7 ROADMAP
The overall roadmap of this thesis is given in Fig.2-2.
& #!$#$! #
!# $ !! #% ! !
$! ! #! $ !! #!#% ! #$!#$' ""# ""# ""#
'""
Figure 2-2: The overall roadmap of this thesis.
38 Chapter 3
Understanding the Roles of Sub-graph Features for Graph Classification: An Empirical Study Perspective
3.1 INTRODUCTION
The advancement of data acquisition and analysis technology has resulted in many appli- cations involving complex structure data. Examples include Cheminformatics, Bioinfor- matics, and Social Network Analysis. Different from traditional data that are represented as feature vectors, the new data are often represented as graphs to denote the content of the data entries and their structural relationships. This has resulted in an increasing interest in graph classification, which tries to learn classification model from a number of labeled graphs to separate previously unseen graphs into different categories. This problem, in general, can be divided into two large branches: (1) Classification of nodes (or edges) in a single large graph, like social networks; (2) classification of a set of small graphs, like drug activity predictions. This experimental study focuses on the latter, i.e the classification of small graphs.
Although graph representation has some inherent advantages, such as the number of nodes and edges can vary to best capture the complex relationship between objects, the
39
Figure 3-1: An example of sub-graph pattern representation. Left panel shows two graphs, G1 and G2 and right panel gives the two indicator vectors showing whether a sub-graph exists in the graphs.
disadvantage is also obvious: Lack of learning algorithms which are able to support graph structure data. Finding effective representations for graph data is therefore a major chal- lenge for graph classification. One possible solution is to define some graph kernels [37] that use graph structures, such as topology, paths, and edge labels of the graphs, to calculate distance between a pair of graphs, and then use the distance to formulate a generic learning algorithm. The second major solution is using substructure patterns (i.e. sub-graphs) as features to represent each graph as a feature vector. So a training graph set can be con- verted into a generic training instance set for learning. Fig. 3-1 shows a sub-graph pattern representation example. More specifically, Sub-graphs g1, g2, and g3 are used to convert graphs G1 and G2 into feature vectors. Depending on whether g1, g2, and g3 appear in each
T respective graph, say G1, a binary feature vector X1 = [110...] is created to represent
G1. subgraph-based approaches are useful in many graph related tasks, including discrim- inating different groups of graphs, classifying and clustering graphs and building graph indices in vector spaces. An inherent advantage of embedding graphs into vector space is that it makes existing algorithmic tools developed for feature based object representations available for graph structure data.
In this chapter, we mainly focus on sub-graph pattern mining based graph classifica- tion. Frequent Sub-graph Mining (FSM) has been an emerging sub-graph mining problem,
40 mainly because sub-graph patterns are meaningful tokens to effectively map graphs into
vector space for clustering or classification. Given a graph dataset, D = {G0,G1, ..., Gn}, assume Supg denotes the number of graphs (in D) which include g as a sub-graph, the ob- jective of FSM is to find all frequent sub-graphs whose number of occurrence is above the
specified threshold minsup (s.t. Supg ≥ minsup) from the given graph dataset D. Existing FSM methods can be roughly divided into two categories: (i) Apriori-based approaches, and (ii) pattern growth-based approaches. The Apriori-based approaches carry out pattern mining in a generate-and-test manner by using a Breadth First Search (BFS) strategy to explore the sub-graph lattice of the given database. Three well established Apriori-based FSM algorithms include AGM, FSG, and DPMine [33, 22, 45, 75]. Pattern growth-based approaches adopt a Depth First Search (DFS) strategy where, for each discovered sub-graph g, the sub-graph is extended recursively until all frequent super-graphs of g are discovered. FSM algorithms that adopt a DFS strategy tend to need less memory because they traverse the lattice of all possible frequent sub-graphs in a DFS manner. Two well-known example algorithms are gSpan and FFSM [32, 82].
Given a set of frequent sub-graphs g1,g2, ..., gd, a graph Gx can then be represented as T a feature vector X =[x1,x2, ..., xd] , where xi =1if gi ⊆ Gx; otherwise, xi =0[81]. Existing research has demonstrated that such vectorization helps to build efficient indices to support fast graph search [83]. In addition, this representation can also help to preserve some basic structural information for graph data analysis.
Several reasons motivate the proposed empirical studies on frequent sub-graphs for graph classification. Firstly, the mining of the sub-graphs involves graph matching op- erations that are computationally demanding. Indeed, just testing whether a graph is a sub-graph of another graph is an NP-complete problem [74]. In Fig. 3-2, we report the sub-graph mining runtime with respect to an increasing number of edges of the sub-graphs for a given graph set. The results show that it is very time-consuming to match patterns in graph datasets, especially for sub-graphs with a large number of edges. Secondly, we are interested in finding discriminative power of sub-graphs. In other words, we want to know how much the classification accuracy can be affected by different properties of the selected sub-graphs, such as different number of nodes, different number of edges, and the size of
41
Figure 3-2: The runtime of frequent sub-graph pattern mining with respect to the increasing number of edges of sub-graphs. the selected sub-graph set. To achieve the goal, a typical objective function, information gain (IG) [36], is used to test the discriminative power of attributions (sub-graphs). Ex- periments show that the objective function is neither monotonic nor anti-monotonic with respect to the size of the sub-graphs [81]. In addition, the internal structural correlation between sub-graphs can also impact on the classification result. For instance, sub-graphs with similar structures tend to have similar objective scores, whereas using sub-graphs with high correlations for classification should be avoided because it will introduce redundant features with high dependency that may deteriorate the classification accuracy. This fact can help to select better sub-graphs to represent graph datasets. Thirdly, most given graph classification methods are computationally expensive, we are wondering whether there is any inexpensive way to achieve graph classification. More specifically, if we use some ran- dom sub-graphs to represent a graph dataset, how good will the classification accuracy be, compared to the approaches that use sophisticated sub-graph mining algorithms?
In this chapter, we present an empirical comparison of graph classification based on different sets of sub-graph features, including frequent sub-graphs, random sub-graphs, and sub-graphs selected using information gain measure. The comparisons are carried out by using different number of sub-graphs, and sub-graphs with different edges. Experiments are performed on seven real-world graph datasets from three domains, by using two popular learning algorithms including support vector machines and nearest neighbour. The studies
42 show that using random sub-graph features has a reasonably good performance compared to its expensive peers which use frequent sub-graphs. Experiments also show that sub- graphs with many edges are not necessarily good candidates for graph classification. The remainder of the chapter is organized as follows. Section 3.2 formulates the prob- lem and defines some important notations. Experimental studies are reported in Section 3.3.
3.2 PROBLEM FORMULATION
In this section, we introduce several basic concepts related to graph representation, frequent sub-graph mining, and graph classification.
3.2.1 Graph and Sub-graph
DEFINITION 5 Graph: LV and LE are finite label sets which denote nodes and edges, respectively. A graph g =(V,E,μ,ν), where:
• V denotes a finite set of nodes,
• E ⊆ V × V is the set of edges,
• μ : V → LV is the node labeling function, and
• ν : E → LE is the edge labeling function.
The number of nodes of a graph g is denoted by |g|, while D represents the set of all
graphs over the label alphabets LV and LE.
DEFINITION 6 Sub-graph: Let g1 =(V1,E1,μ1,ν1) and g2 =(V2,E2,μ2,ν2) be two graphs, Graph g1 is a sub-graph of g2, denoted by g1 ⊆ g2,if
• V1 ⊆ V2,
• E1 ⊆ E2,
• μ1(v)=μ2(v) for all v ∈ V1, and
43 • ν1(e)=ν2(e) for all e ∈ E1.
An example is shown in Fig. 3-3, where the label of each node is color coded (the same color means the same label). Graph (b) is a sub-graph of (a).
Figure 3-3: A conceptual view of graph vs. sub-graph. (b) is a sub-graph of (a).
3.2.2 Frequent Sub-graph Mining
DEFINITION 7 Support: Given a graph database D, the support of a sub-graph g, de-
noted by Supg, is the fraction of the graphs in D of which g is a sub-graph, formally:
|{g ∈ ζ|g ∈ g }| Sup = , g |D|
Given a user specified minimum support threshold minsup and a graph database D,a frequent sub-graph is a sub-graph whose support is at least minsup (i.e. SupD ≥ minsup) and the frequent sub-graph mining problem is to find all frequent sub-graphs in D.
DEFINITION 8 Graph Isomorphism: There are two graphs g1 =(V1,E1,μ1,ν1) and g2 =(V2,E2,μ2,ν2), a graph isomorphism is a bijective function f : V1 → V2 satisfying
• μ1(v)=μ2(f(v)) for all nodes v ∈ V1,
• for each edge e1 =(v1,v2) ∈ E1, there exists an edge e2 =(f(v1),f(v2)) ∈ E2 such
that ν1(e1)=ν2(e2), and
−1 −1 • for each edge e2 =(v1,v2) ∈ E2, there exists an edge e1 =(f (v1),f (v2)) ∈ E1)
such that ν1(e1)=ν1(e2).
Two graphs are called isomorphic if there exists an graph isomorphism between them.
44 Figure 3-4: An example of graph isomorphism.
DEFINITION 9 Sub-graph Isomorphism: Given two graphs g1 =(V1,E1,μ1,ν1) and g2 =(V2,E2,μ2,ν2), an injective function f : V1 → V2 from g1 to g2 is a sub-graph isomorphism if there exists a sub-graph g ⊆ g2 such that f is a graph isomorphism between
g1 and g.
3.2.3 Graph Classification
DEFINITION 10 Graph Embedding: Let D be a set of graphs. A graph embedding is a function ϕ : D→Rn mapping graphs to n-dimensional vectors, i.e.,