Real-Time Analytics for Complex Structure Data

Ting Guo

A Thesis submitted for the degree of

Faculty of Engineering and Information Technology University

of Technology, Sydney 2015 2 Certificate of Authorship and Originality

I certify that the work in this thesis has not previously been submitted for a degree nor has it been submitted as part of requirements for a degree except as fully acknowledged within the text.

I also certify that the thesis has been written by me. Any help that I have received in my research work and the preparation of the thesis itself has been acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis.

Signature of Student:

Date:

3 4 Acknowledgments

On having completed this thesis, I am especially thankful to my supervisor Prof. Chengqi Zhang and co-supervisor Prof. Xingquan Zhu, who had led me to an at one time unfamiliar area of academic research, and trusted me and given me as much as possible freedom to purse my own research interests. Prof. Zhu has taught me how to think and study indepen- dently and how to solve a difficult scientific problem in flexible but rigorous ways. He has sacrificed much of his precious time for developing my academic research skills. When I felt lost and terrified with my future, he always gave me the confidence and motivation to keep going and strive to get better. Prof. Zhang has also given me great help and support in life.

I am thankful to the group members I met in the University of Technology, Sydney, including Shirui Pan, Lianhua Chi, Jia Wu, and many others. I learned a lot from these smart people, and I was always inspired by the interesting and in-depth discussions with them. I enjoyed the wonderful atmosphere, being with them, of both academic research and daily life.

I am incredibly grateful to my mother and father for their generosity and encourage- ment. This thesis is definitely impossible to be completed without their constant support and understanding. I am also thankful to my friends who have companied me, though not always at my side, through the arduous journey of three years.

5 6 Contents

1 Introduction 19 1.1 MOTIVATION ...... 19 1.2 CONTRIBUTIONS ...... 26 1.3 PUBLICATIONS ...... 28 1.4 THESIS STRUCTURE ...... 29

2 Literature Review 31 2.1 PRELIMINARY ...... 31 2.2 GRPAH CLASSIFICATION ...... 32 2.3 FREQUENT SUB-GRAPH MINING (FSM) ...... 34 2.4 SUB-GRPAH FEATURE SELECTION ...... 35 2.5 DATA STREAM MINING ...... 36 2.6 REAL-TIME ANALYSIS ...... 37 2.7 ROADMAP ...... 38

3 Understanding the Roles of Sub-graph Features for Graph Classification: An Empirical Study Perspective 39 3.1 INTRODUCTION ...... 39 3.2 PROBLEM FORMULATION ...... 43 3.2.1 Graph and Sub-graph ...... 43 3.2.2 Frequent Sub-graph Mining ...... 44 3.2.3 Graph Classification ...... 45 3.3 EXPERIMENTAL STUDY ...... 45

7 3.3.1 Benchmark Data ...... 46 3.3.2 Sub-graph Features ...... 48 3.3.3 Experimental Settings ...... 50 3.3.4 Results and Analysis ...... 50

4 Graph Hashing and Factorization for Fast Graph Stream Classification 59 4.1 INTRODUCTION ...... 59 4.2 PROBLEM DEFINITION ...... 62 4.3 GRAPH FACTORIZATION ...... 63 4.3.1 Factorization Model ...... 63 4.3.2 Learning ...... 66 4.4 FAST GRAPH STREAM CLASSIFICATION ...... 68 4.4.1 Overall Framework ...... 68 4.4.2 Graph Clique Mining ...... 70 4.4.3 Clique Set Matrix and Graph Factorization ...... 72 Discriminative Frequent Cliques ...... 72 Feature Mapping ...... 74 4.4.4 Graph Stream Classification ...... 75 4.5 EXPERIMENTS ...... 75 4.5.1 Benchmark Data ...... 75 4.5.2 Experimental Settings ...... 78 4.5.3 Experimental Results ...... 79 Graph Steams Classification Accuracy ...... 79 Graph Steam Classification Efficiency ...... 83

5 Super-graph based Classification 85 5.1 INTRODUCTION ...... 85 5.2 PROBLEM DEFINITION ...... 88 5.3 OVERALL FRAMEWORK ...... 89 5.4 WEIGHTED RANDOM WALK KERNEL ...... 89 5.4.1 Kernel on Single-attribute Graphs ...... 91

8 5.4.2 Kernel on Super-Graphs ...... 94 5.5 SUPER-GRAPH CLASSIFICATION ...... 96 5.6 THEORETICAL STUDY ...... 96 5.7 EXPERIMENTS AND ANALYSIS ...... 98 5.7.1 Benchmark Data ...... 98 5.7.2 Experimental Settings ...... 100 5.7.3 Results and Analysis ...... 101

6 Streaming Network Node Classification 105 6.1 INTRODUCTION ...... 105 6.2 PROBLEM DEFINITION AND FRAMEWORK ...... 109 6.3 THE PROPOSED METHOD ...... 110 6.3.1 Streaming Network Feature Selection ...... 113 Feature Selection on a Static Network ...... 113 Feature Selection on Streaming Networks ...... 117 6.3.2 Node Classification on Streaming Networks ...... 121 6.4 EXPERIMENTS ...... 123 6.4.1 Experimental Settings ...... 123 6.4.2 Performance on Static Networks ...... 125 6.4.3 Performance on Streaming Networks ...... 128 6.4.4 Case Study ...... 130

7 Conclusion 133 7.1 SUMMARY OF THIS THESIS ...... 133 7.2 FUTURE WORK ...... 135

9 10 List of Figures

2-1 Graph examples collected from different domains...... 33 2-2 The overall roadmap of this thesis...... 38

3-1 An example of sub-graph pattern representation. Left panel shows two

graphs, G1 and G2 and right panel gives the two indicator vectors showing whether a sub-graph exists in the graphs...... 40 3-2 The runtime of frequent sub-graph pattern mining with respect to the in- creasing number of edges of sub-graphs...... 42 3-3 A conceptual view of graph vs. sub-graph. (b) is a sub-graph of (a). .... 44 3-4 An example of graph isomorphism...... 45 3-5 Graph representation for a paper (ID17890) in DBLP. Node in red is the main paper. Nodes in black ellipse are citations. While nodes in black box are keywords...... 48 3-6 Classification accuracy on five NCI chemical compound datasets with re- spect to different sizes of sub-graph features (using Support Vector Ma- chines: Lib-SVM)...... 52 3-7 Classification accuracy on D&D protein dataset and DBLP citation dataset with respect to different sizes of sub-graph features (using Support Vector Machines: Lib-SVM)...... 53 3-8 Classification accuracy on one NCI chemical compound dataset, D&D pro- tein dataset, and DBLP citation dataset with respect to different sizes of sub-graph features (using Nearest Neighbours: NN)...... 54

4-1 Coarse-grained vs. fine-grained representation...... 60

11 4-2 An example of graph factorization...... 64 4-3 The framework of FGSC for graph stream classification...... 69 4-4 An example of clique mining in a compressed graph...... 71 4-5 An example of “in-memory” Clique-class table Γ...... 74 4-6 Graph representation for a paper (ID17890) in DBLP...... 77 4-7 Accuracy w.r.t different chunk sizes on DBLP Stream. The number of features in each chunk is 142. The batch sizes vary as: (a) 1000; (b) 800; (c) 600...... 80 4-8 Accuracy w.r.t different number of features on DBLP Stream with each chunk containing 1000 graphs. The number of features selected in each chunk is: (a) 307; (b) 142; (c) 62...... 80 4-9 Accuracy w.r.t different classification methods on DBLP Stream with each chunk containing 1000 graphs, and the number of features in each chunk is 142. The classification methods selected here are: (a) NN; (b) SMO; (c) NaiveBayes...... 80 4-10 Accuracy w.r.t different chunk sizes on IBM Stream. The number of fea- tures in each chunk is 75. The batch sizes vary from (a) 500; (b) 400; to (c) 300...... 81 4-11 Accuracy w.r.t different number of features on IBM Stream with each chunk containing 400 graphs. The number of features selected in each chunk is: (a) 148; (b) 75; (c) 43...... 81 4-12 Accuracy w.r.t different classification methods on IBM Stream with each chunk containing 400 graphs, and the number of features in each chunk is 75. The classification methods selected include: (a) NN; (b) SMO; (c) NaiveBayes...... 81 4-13 System accumulated runtime-based by using NN classifier, where |D| = 1000, |m| = 142 (for DBLP) and |D| = 400, |m| =75(for IBM) respec- tively. (a) Results on DBLP stream; (b) Results on IBM stream...... 83

5-1 (A): a single-attribute graph; (B): an attributed graph; and (C): a super-graph. 86

12 5-2 A conceptual view of a protein interaction network using super-graph rep- resentation...... 87 5-3 WRWK on the super-graphs (G, G ) and the single-attribute graphs (g1,

g2, g3)...... 90 5-4 An example of using super-graph representation for scientific publications. . 99 5-5 Super-graph and comparison graph representations...... 100 5-6 Classification accuracy on DBLP and Beer Review datasets w.r.t. different classification methods (NB, DT, SVM, and NN)...... 102 5-7 Classification accuracy on Beer Review dataset w.r.t. different datasets and classification methods (NB, DT, SVM, and NN)...... 103 5-8 The performance w.r.t. different edge-cutting thresholds on DBLP and Beer Review datasets by using WRWK method...... 104

6-1 An example of streaming networks, where each color bar denotes a feature. 106 6-2 An example of using feature selection to capture changes in a streaming network (keywords inside each node denote node content)...... 108 6-3 The framework of the proposed streaming network node classification (SNOC) method...... 111 6-4 An example of using feature selection to capture structure similarity. ....115 6-5 The accuracies on three real-world static networks w.r.t. different numbers of selected features (from 50 to 300)...... 125 6-6 The accuracy on three networks w.r.t. (a) different maximal lengths of path l (from 1 to 5), (b) different values of weight parameter ξ (from 0 to 1), and (c) different percentages of labeled nodes...... 126 6-7 The accuracy on streaming networks: (a) accuracy on DBLP citation net- work from 1991 to 2010, (b) accuracy on PubMed Diabetes network for 15 time points, and (c) accuracy on extended DBLP citation network from 1991 to 2010...... 128 6-8 The cumulative runtime on DBLP and PubMed Diabetes networks corre- sponding to Fig. 6-5...... 130

13 6-9 Case study on DBLP citation network...... 131

14 List of Tables

3.1 The advantages and disadvantages comparisons between vector represen- tation vs. graph representation ...... 46 3.2 NCI datasets used in experiments ...... 47 3.3 DBLP dataset used in experiments ...... 47 3.4 Number of sub-graphs with respect to different sizes (i.e. number of edges) 49

4.1 DBLP dataset used in experiments...... 76

6.1 Accuracy Results on Static Network...... 126

15 Abstract

The advancement of data acquisition and analysis technology has resulted in many real- world data being dynamic and containing rich content and structured information. More specifically, with the fast development of information technology, many current real-world data are always featured with dynamic changes, such as new instances, new nodes and edges, and modifications to the node content. Different from traditional data, which are represented as feature vectors, data with complex relationships are often represented as graphs to denote the content of the data entries and their structural relationships, where instances (nodes) are not only characterized by the content but are also subject to depen- dency relationships. Plus, real-time availability is one of outstanding features of today’s data. Real-time analytics is dynamic analysis and reporting based on data entered into a system before the actual time of use. Real-time analytics emphasizes on deriving immedi- ate knowledge from dynamic data sources, such as data streams, and knowledge discovery and pattern mining are facing complex, dynamic data sources. However, how to combine structure information and node content information for accurate and real-time data mining is still a big challenge. Accordingly, this thesis focuses on real-time analytics for com- plex structure data. We explore instance correlation in complex structure data and utilises it to make mining tasks more accurate and applicable. To be specific, our objective is to combine node correlation with node content and utilize them for three different tasks, in- cluding (1) graph stream classification, (2) super-graph classification and clustering, and (3) streaming network node classification.

Understanding the role of structured patterns for graph classification: the thesis in- troduces existing works on data mining from an complex structured perspective. Then we propose a graph factorization-based fine-grained representation model, where the main ob- jective is to use linear combinations of a set of discriminative cliques to represent graphs for learning. The optimization-oriented factorization approach ensures minimum informa- tion loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process. Based on this idea, we propose a novel framework for fast graph stream classification.

16 A new structure data classification algorithm: The second method introduces a new super-graph classification and clustering problem. Due to the inherent complex struc- ture representation, all existing graph classification methods cannot be applied to super- graph classification. In the thesis, we propose a weighted random walk kernel which cal- culates the similarity between two super-graphs by assessing (a) the similarity between super-nodes of the super-graphs, and (b) the common walks of the super-graphs. Our key contribution is: (1) a new super-node and super-graph structure to enrich existing graph representation for real-world applications; (2) a weighted random walk kernel considering node and structure similarities between graphs; (3) a mixed-similarity considering struc- tured content inside super-nodes and structural dependency between super-nodes; and (4) an effective kernel-based super-graph classification method with sound theoretical basis. Empirical studies show that the proposed methods significantly outperform the state-of- the-art methods. Real-time analytics framework for dynamic complex structure data For streaming net- works, the essential challenge is to properly capture the dynamic evolution of the node content and node interactions in order to support node classification. While streaming net- works are dynamically evolving, for a short temporal period, a subset of salient features are essentially tied to the network content and structures, and therefore can be used to charac- terize the network for classification. To achieve this goal, we propose to carry out streaming network feature selection (SNF) from the network, and use selected features as gauge to classify unlabeled nodes. A Laplacian based quality criterion is proposed to guide the node classification, where the Laplacian matrix is generated based on node labels and network topology structures. Node classification is achieved by finding the class label that results in the minimal gauging value with respect to the selected features. By frequently updating the features selected from the network, node classification can quickly adapt to the changes in the network for maximal performance gain. Experiments and comparisons on real-world networks demonstrate that SNOC is able to capture dynamics in the network structures and node content, and outperforms baseline approaches with significant performance gain.

17 18 Chapter 1

Introduction

1.1 MOTIVATION

The advancement of data acquisition and analysis technology has resulted in many applica- tions involving complex structure data; examples include Cheminformatics [65], Bioinfor- matics [53], and Social Network Analysis (e.g. DBLP) [3]. Different from traditional data, which are represented as feature vectors, data with structure relationships are often repre- sented as graphs to preserve the content of the data entries and their relationships, where instances (nodes) are not only characterized by the content but are also subject to depen- dency relationships. For example, each node in a social network can denote one person and links between nodes can represent their social interactions. In reality, changes are essential components in real-world structure data, mainly be- cause user participation, interactions, and responses to external factors continuously intro- duce new nodes and edges to the data. In addition, a user may add/delete/modify online posts, which naturally result in changes in the node content. As a result, data are inherently dynamic. These dynamic changes may significantly influence the mining results. For ex- ample, changes in users’ interest fields may result in the changes of their social groups or friend circles. So real-time analytics on structure data could help us understand and capture such changes and therefore improve the mining performances. Graph classification concerns the learning of a discriminative classifier, from training data containing structure information, to classify previously unseen graph samples into spe-

19 cific categories, where the main challenge is to explore structure information in the training data to build classifiers. One of the most common graph classification approaches is to use sub-graph features to convert graph into instance-feature representations, so generic learning can be applied for classification. Finding good sub-graph features is therefore an important task for this type of learning approaches. Common heuristics are to find high frequency sub-graphs or to apply feature selection measures, such as information gain or Gini-index, to refine frequent sub-graphs. While all these methods have shown to be effective in the literature, due to the inherent complexity of the graph data, the process of finding sub-graphs is a non-trivial task and is always the most computationally expensive procedure in the graph classification process. In addition, the genuine discriminative power of the sub-graph features has never been comprehensively studied. All these observations raise big concerns:

• The mining of the sub-graphs involves graph matching operation which is computa- tionally demanding. Indeed, just testing whether a graph is a sub-graph of another graph is an NP-complete problem [74]. It is very time-consuming to match patterns in graph datasets, especially for sub-graphs with a large number of edges.

• We are interested in finding discriminative power of sub-graphs. In other words, we want to know how much the classification accuracy can be affected by different prop- erties of the selected sub-graphs, such as different number of nodes, different number of edges, and the size of the selected sub-graph set. To achieve the goal, a typical ob- jective function, information gain (IG) [36], is used to test the discriminative power of attributions (sub-graphs). Experiments show that the objective function are nei- ther monotonic nor anti-monotonic with respect to the size of the sub-graphs [81]. In addition, the internal structural correlation between sub-graphs can also impact on the classification result. For instance, sub-graphs with similar structures tend to have similar objective scores, whereas using sub-graphs with high correlations for clas- sification should be avoided because it will introduce redundant features with high dependency which may deteriorate the classification accuracy. This fact can help to select better sub-graphs to represent graph datasets.

20 • Most given graph classification methods are computationally expensive, we are won- dering whether there is any inexpensive way to achieve graph classification. More specifically, if we use some random sub-graphs to represent a graph datasets, how good the classification accuracy will be, compared to the approaches that use sophis- ticated sub-graph mining algorithms?

Motivated by the above concerns, empirical studies on the role of sub-graph for graph classification. We carry out empirical studies on four real-world graph classification tasks, by using three types of sub-graph features, including frequent sub-graphs, frequent sub- graph selected by using information gain, and random sub-graphs, and by using two types of learning algorithms including Support Vector Machines and Nearest Neighbour. Our experiments show that (1) The discriminative power of sub-graphs varies by their sizes; (2) Random sub-graphs have a reasonably good performance; (3) Number of sub-graphs is important to ensure good performance; (4) Increasing number of sub-graphs reduces the difference between classifiers built from different sub-graphs, and (5) Performance with respect to different classifiers shows similar trend. Our studies provide a practical guidance for designing effective sub-graph based graph classification methods. From the empirical studies, we discover a fact that unlike traditional instance-feature representations, graphs do not have specific features, so an essential step in building a graph classification model is to explore graph features to represent graph data in an instance- feature space for effective learning [57]. The majority of existing graph classification mod- els employ an occurrence-based feature representation model, in which a set of sub-graph features is selected to represent the graph data by using the occurrence of the sub-graph in the graph (either 0/1 occurrence or actual number of occurrences) as the feature value. In this thesis, we refer to such an occurrence based representation model as coarse-grained representation model. For existing sub-graph based graph classification methods (coarse-grained representa- tion models), they have a number of disadvantages, especially for graph streaming clas- sification including: Computation burden for isomorphism validation: Because sub- graphs are selected as features to represent the graph, the expensive sub-graph mining and isomorphism validation process can not be avoided (graph isomorphism is proven to be

21 NP-hard) [19]. After the sub-graph features have been selected, the isomorphism valida- tion process has to be applied again to map the graphs into the vector space. Because sub-graphs are selected as features to represent the graph and general learning methods prefer low feature dimension, a coarse-grained feature representation model has to limit the number of sub-graph to represent the graph data. Severe and unbounded informa- tion loss: Because the coarse-grained representation does not characterize the degree of closeness of the sub-graph pattern related to the graph, it suffers information loss in repre- senting the graph. Furthermore, for a graph stream with continuously growing volumes (i.e. an increasing number of graphs) and changing structures (e.g. new nodes may appear in the coming graphs), the disadvantage of the existing coarse-grained graph feature model is even greater. This is because (1) stream volumes continuously increase, making it compu- tationally expensive to explore subgraph features, and (2) the changes in the graph stream (such as new structures) make the sub-graph features incapable of representing graphs. So a fine-grained graph feature model is needed to improve the classification performance.

To present a novel fine-grained graph feature model, our main idea is to bypass the expensive sub-graph mining process and use linear combinations of graph cliques to rep- resent graphs. Linear combinations is to make sure the information loss of the transform process from graphs to vectors can be minimal. Compared to the traditional coarse-grained representation model, the advantage of fine-grained representation is clear. A clear advan- tage of our method is that fine-grained graph representation enjoys a rigorous theoretical foundation that ensures minimal information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process, which is proven to be NP-hard. To achieve this goal, our algorithm relies on two important steps to extract fine-grained graph representation: (1) finding a set of frequent graph cliques as the base; and (2) us- ing graph factorization to calculate a linear combination of graph cliques to best represent a graph. After applying our fine-grained feature mapping model to represent graphs, we can use any learning algorithm, such as Nearest Neighbor or Support Vector Machines, for graph stream classification. Our method offers a number of advantages including a fast fea- ture mining process, more precise graph representation, and better performance for graph stream classification. Experiments on two real-world network graph data demonstrate that

22 our method outperforms state-of-the-art approaches in both classification accuracy and run- time efficiency. Even though graph representation indeed preserves the structure information to improve the classification accuracy, all existing frameworks, in terms of using graphs to represent objects, rely on two approaches to describe node content (1) node as a single attribute: each node has only one attribute (single-attribute node). A clear drawback of this represen- tation is that a single attribute cannot precisely describe the node content [40]. This repre- sentation is commonly referred to as a single-attribute graph (Fig.5-1 (A)). (2) node as a set of attributes: use a set of independent attributes to describe the node content (Fig.5-1 (B)). This representation is commonly referred to as an attributed graph [8, 14, 80]. How- ever, with the development of information and technology, the data is becoming more and more complex. Indeed, in many applications, the attributes/properties used to describe the node content may be subject to dependency structures. For example, in a citation network each node represents one paper and edges denote citation relationships. It is insufficient to use one or multiple independent attributes to describe detailed information of a paper. In- stead, we can represent the content of each paper as a graph with nodes denoting keywords and edges representing contextual correlations between keywords (e.g. co-occurrence of keywords in different sentences or paragraphs). As a result, each paper (a super-node) and all references cited in this thesis can form a super-graph with each edge between papers denoting their citation relationships. To build learning models for super-graphs, the mainly challenge is to properly calculate the distance between two super-graphs.

• Similarity between two super-nodes: Because each super-node is a graph, the over- lapped/intersected graph structure between two super-nodes reveals the similarity between two super-nodes, as well as the relationship between two super-graphs. Tra- ditional hard-node-matching mechanism is unsuitable for super-graphs which require soft-node-matching.

• Similarity between two super-graphs: The complex structure of super-graph re- quires that the similarity measure considers not only the structure similarity, but also

23 the super-node similarity between two super-graphs. This cannot be achieved with- out combining node matching and graph matching as a whole to assess similarity between super-graphs.

The above challenges motivate the proposed Weighted Random Walk Kernel (WRWK) for super-graphs. In our paper, we generate a new product graph from two super-graphs and then use weighted random walks on the product graph to calculate similarity between super-graphs. A weighted random walk denotes a walk starting from a random weighted node and following succeeding weighted nodes and edges in a random manner. The weight of the node in the product graph denotes the similarity of two super-nodes. Given a set of labeled super-graphs, we can use an weighted product graph to establish walk-based relationship between two super-graphs and calculate their similarities. After that, we can obtain the kernel matrix for super-graph classification. Recent years have also witnessed an increasing number of applications involving net- worked data, where instances are not only characterized by the content but are also subject to dependency relationships. The mixed node content and structure information raise many unique data mining tasks, such as network node classification [3]. In reality, changes are essential components in real-world networks, mainly because user participation, interac- tions, and responses to external factors continuously introduce new nodes and edges to the network. In addition, a user may add/delete/modify online posts, which naturally result in changes in the node content. As a result, the networks are inherently dynamic. Accurate node classification in a streaming network setting is therefore much more challenging than static networks. In summary, node classification in streaming networks has at least three major challenges:

• Streaming network structures: Network structures encode rich information about node interactions inside the network, which should be considered for node classifi- cation. In streaming networks, structures are constantly changing, so node classi- fication needs to rapidly capture and adapt to such changes for maximal accuracy gain.

• Streaming node features: For each node in a streaming network, its content may

24 constantly evolve (w.r.t. users posts or profile updating). As a result, the feature space used to denote the node content is dynamically changing, resulting in streaming fea- tures [78] with infinite feature space. To capture changes, a feature selection method should timely select the most effective features to ensure that node classification can quickly adapt to the new network.

• Unlimited Network Node Space: Because node volumes of streaming networks are dynamically increasing, resulting in unlimited network node space and new nodes never appearing in the network before. Node classification needs to scale to the dy- namic increasing node volumes and incrementally updates models discovered from historical data to accurately classify new nodes.

To achieve a high node classification accuracy, a fundamental issue is to properly char- acterize such changes. One possible solution is to use features to capture changes in streaming networks for node classification. For streaming networks, changes are intro- duced through two major channels (1) node content; and (2) topology structures. Because in a networked world, nodes close to each other in the network structure space tend to share common content information [26], we can use selected features to design a “similar- ity gauging” procedure to assess the consistency of the network node content and structures to determine the labels of unlabeled nodes. A smaller gauging value indicates that the node content and structures has a better alignment with the node label. So the gauging based classification is carried out such that for an unlabeled node, its label is the class which results in the minimal gauging value with respect to the identified features. By updating the selected features, the node classification can automatically adapt to the changes in the streaming network for maximal accuracy gain. Accordingly, our research will propose a node classification method for networked data, and then propose a streaming feature selec- tion approach for dynamic drafting network. In summary, taking complex structure information into account helps improve the min- ing performance on structure data. Moreover, employing changes in dynamic setting is a promising solution to further improving mining accuracy. In this thesis, we investigate different types of complex structure data and propose several methods to solve above chal-

25 lenges.

1.2 CONTRIBUTIONS

This thesis focuses on exploring and utilizing structure information to solve some important problems in data mining. We list our contributions to each of them below:

• Exploring the roles of sub-graph features for graph classification: we carry out empirical studies on four real-world graph classification tasks, by using three types of sub-graph features, including frequent sub-graphs, frequent sub-graph selected by using information gain, and random sub-graphs, and by using two types of learning algorithms including Support Vector Machines and Nearest Neighbour. Our experi- ments show that (1) The discriminative power of sub-graphs varies by their sizes; (2) Random sub-graphs have a reasonably good performance; (3) Number of sub-graphs is important to ensure good performance; (4) Increasing number of sub-graphs re- duces the difference between classifiers built from different sub-graphs, and (5) Per- formance with respect to different classifiers shows similar trend. Our studies pro- vide a practical guidance for designing effective sub-graph based graph classification methods.

• Exploring clique information and using factorization method for fast graph stream classification: we propose a fine-grained graph factorization framework for efficient graph stream classification in this thesis. Being fine-grained, our mapping framework relies on a set of discriminative frequent cliques, instead of important sub-graph patterns, to represent graphs. Such a fine-grained representation ensures that the final instance-feature representation is sufficiently close to the original graph, with theoretical guarantee. To solve the problem, our algorithm relies on two impor- tant steps to extract fine-grained graph representation: (1) finding a set of frequent graph cliques as the base; and (2) using graph factorization to calculate a linear com- bination of the graph cliques to best represent a graph. Compared to the traditional coarse-grained representation model, a clear advantage of our method is that fine-

26 grained graph representation enjoys a rigorous theoretical foundation that ensures minimal information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process.

• Exploring inner-structure information for super-graph classification: In this the- sis, we introduce a special type of graph, where the content of the node can be represented as a graph, as a “super-graph”. Likewise, we refer to the node whose content is represented as a graph, as a “super-node”. To build learning models for super-graphs, the mainly challenge is to properly calculate the distance between two super-graphs.

(1) Similarity between two super-nodes: Because each super-node is a graph, the overlapped/intersected graph structure between two super-nodes reveals the similar- ity between two super-nodes, as well as the relationship between two super-graphs. Traditional hard-node-matching mechanism is unsuitable for super-graphs which re- quire soft-node-matching.

(2) Similarity between two super-graphs: The complex structure of super-graph requires that the similarity measure considers not only the structure similarity, but also the node similarity between two super-graphs. This cannot be achieved with- out combining node matching and graph matching as a whole to assess similarity between super-graphs.

The above challenges motivate the proposed Weighted Random Walk Kernel for super-graphs. we propose a weighted random walk kernel which calculates the simi- larity between two super-graphs by assessing (a) the similarity between super-nodes of the super-graphs, and (b) the common walks of the super-graphs. Our key contri- bution is twofold: (1) a weighted random walk kernel considering node and structure similarities between graphs; and (2) an effective kernel-based super-graph classifica- tion method with sound theoretical basis.

• Exploring manifold information for streaming network node nlassification: we propose a novel node classification method for streaming networks. Our method takes network structure and node labels into consideration to find an optimal subset

27 of features to represent the network. Based on the selected features, a streaming network node classification method, SNOC, is proposed to classify unlabeled nodes through the minimization of the similarity distance in the network and the feature- based distance between nodes. The main contribution, compared to existing works, is twofold:

(1) Streaming Network Node Classification: We propose a new streaming network node classification (SNOC) method that takes node content and structure similarity into consideration to find important features to model changes in the network for node classification. This method is not only more accurate than existing node clas- sification approaches, but is also effective to capture changes in networks for node classification.

(2) Streaming Network Feature Selection: We introduce a novel streaming net- work feature selection framework, SNF, for streaming networks. To ensure feature evaluation can timely adapt to changes in the network, SNF incrementally updates the evaluation score of an existing feature by accumulating changes in the network. This allows our method to effectively handle streaming networks with changing fea- ture space and feature distributions for better runtime and performance gain.

1.3 PUBLICATIONS

• Ting Guo, Zhanshan Li, and Xingquan Zhu. Large Scale Diagnosis Using Associ- ations between System Outputs and Components. Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011, pp.1786-1787.

• Ting Guo and Xingquan Zhu. Understanding the roles of sub-graph features for graph classification: an empirical study perspective. Proceedings of the 22nd ACM international Conference on Information & Knowledge Management, 2013, pp. 817- 822.

• Ting Guo, Lianhua Chi, and Xingquan Zhu. Graph hashing and factorization for fast graph stream classification. Proceedings of the 22nd ACM international Conference

28 on Information & Knowledge Management, 2013, pp. 1607-1612.

• Ting Guo and Xingquan Zhu. Super-Graph Classification. Proceedings of Advances in Knowledge Discovery and Data Mining, 2014, pp. 323-336.

• Ting Guo, Xingquan Zhu, Jian Pei and Chengqi Zhang. SNOC: Streaming Network Node Classification. Proceedings of the 14th IEEE International Conference on Data Mining, 2014.

1.4 THESIS STRUCTURE

The rest of thesis is summarized as follows: Chapter 2: This chapter surveys existing works on mining structure data. It summarizes major approaches in the field, along with their technical strengths/weaknesses. Chapter 3: In this chapter, we carry out empirical studies on four real-world graph clas- sification tasks, by using three types of sub-graph features, including frequent sub-graphs, frequent sub-graph selected by using information gain, and random sub-graphs. The two types of learning algorithms include Support Vector Machines and Nearest Neighbour. Our studies provide a practical guidance for designing effective sub-graph based graph classifi- cation methods. Chapter 4: We propose a fine-grained graph factorization approach for Fast Graph Stream Classification (FGSC). A graph clique mining is given and the factorization al- gorithm is provided based on the clique mining result. Experiments demonstrate that the proposed method outperforms state-of-the-art approaches in both classification accurracy and efficiency. Chapter 5: In this chapter, we formulate a new super-graph classification task where each node of the super-graph may contain a graph. To support super-graph classification, we propose a Weighted Random Walk Kernel (WRWK) with sound theoretical properties, including bounded similarity. Experiments confirm that our method significantly outper- forms baseline approaches. Chapter 6: This chapter introduces a new classification method for streaming networks,

29 namely streaming network node classification (SNOC). It provides algorithm details, theo- retical proof, time complexity and experiments. Chapter 7: This chapter concludes this thesis and outlines directions for future work.

30 Chapter 2

Literature Review

This literature review provides an in-depth study on how existing data mining methods on complex structure data. Our main objective is to (1) summarize and categorize graph classification methods, graph clustering methods and extended ones to streaming structure data; and (2) compare and analyze the strengths and deficiencies of existing approaches. Firstly, we introduce some basic definitions.

2.1 PRELIMINARY

DEFINITION 1 Graph: A graph G is a set of vertex (nodes) v connected by edges (links) e. Thus G =(v, e).

DEFINITION 2 Vertex (Node): A node v is a terminal point or an intersection point of a graph. It is the abstraction of a location such as a city, an administrative division, a road intersection or a transport terminal (stations, terminuses, harbors and airports).

DEFINITION 3 Edge (Link): An edge e is a link between two nodes. The link (i, j) is of initial extremity i and of terminal extremity j. A link is the abstraction of a transport infrastructure supporting movements between nodes. It has a direction that is commonly represented as an arrow. When an arrow is not used, it is assumed the link is bi-directional.

DEFINITION 4 Sub-Graph: A sub-graph is a subset of a graph G, where p is the num- ber of sub-graphs. For instance G =(v,e) can be a distinct sub-graph of G. Unless

31 the global transport system is considered in its whole, every transport network is in the- ory a sub-graph of another. For instance, the road transportation network of a city is a sub-graph of a regional transportation network, which is itself a sub-graph of a national transportation network.

2.2 GRPAH CLASSIFICATION

Given a collection of training samples, each of which is suitably labeled with a class label, classification task is to build a learning model to automatically assign a previously unseen example into a specific category (or class). For data with dependency structure, such as the ones showing in Figure 2-11, building a learning model to classify them into respective categories is proved to be a challenging task. This is mainly attributed to the fact that graph data normally do not have features immediately available to support the learning, whereas nearly all existing supervised learning methods requires that training data to be represented as tabular instance-feature format for learning. In order to support graph classification, one of the fundamental challenges is to charac- terize graphs to represent them as tabula-feature formats, so the generic supervised learning algorithm can be used to derive learning models from graphs. Three most popularly used approaches include (1) Sub-graph feature based methods; (2) Global structure based meth- ods, such as graph edit and graph embedding, and (3) Graph kernel methods. Sub-graph feature based methods is using substructure patterns (sub-graphs) as features to represent each graph as a feature vector. So a training graph set can be converted into a generic training instance set for learning. Sub-graph based approaches are useful in many graph related tasks, including discriminating different groups of graphs, classifying and clustering graphs and building graph indices in vector spaces. An inherent advantage of embedding graphs into vector space is that it makes existing algorithmic tools developed

1In Fig. 2-1, the upper ones are the original representations and the bottom ones are the corresponding transferred graphs (a) Protein data (each node denotes a local amino acid region with special secondary structures and each edge denotes the nearest neighbour in the space); (b) a graph collected from chemical compound data (the 3D-structure can be transferred into graphs by using molecules connected with chemical bonds); (c) a graph collected from DBLP citation dataset, where each node denotes a Paper ID or a keyword and each edge denotes the citation relationship between papers or keywords appeared in the paper’s title. The label of the graph (y) can be used for training a learning model for graph classification.

32   

      

  

Figure 2-1: Graph examples collected from different domains.

for feature based object representations available for graph structure data. For sub-graph feature based methods, one key challenge is how to select discriminative sub-graphs to mapping graphs into vector space, which can help improve the classification accuracy. To achieve the goal, several objective functions are used to test the discriminative power of attributions (sub-graphs), like frequency, Information Gain (IG) [36], Fisher Score [23] and Laplace Score [42]. Because sub-graph features can help convert each single graph into a vector representation, graph classification can be achieved by using popular learning algorithms such as Decision Trees [36] and Support Vector Machine (SVM) [16].

Global structure based methods is to apply inexact graph matching, in which error cor- rection is made part of the matching process. Central to this approach is the measurement of the similarity of pairwise graphs. This can be measured in many ways. One of them, which has garnered particular interest as it is error-tolerant to noise and distortion, is the graph edit distance (GED), defined as the cost of the least expensive sequence of edit operations that are needed to transform one graph into another [24]. GED algorithms are influenced considerably by cost functions that are related to edit operations. The GED between pair- wise graphs changes with the change of cost functions and its validity is dependent on the

33 rationality of cost functions definition [6][17]. Graph kernel methods is using kernel to calculate the distance between graphs. Various graph kernel methods, such as subtree kernels [63], shortest-path kernels [5], joint ker- nels [30] and cyclic pattern kernels [31], have been proposed. All methods have shown to be effective, but subject to high computational overhead. Geng Li et al. proposes an alternative approach based on feature vectors constructed from different global topologi- cal attributes [47]. A novel graph representation method named Graph edit distance is also proposed by using dissimilarity space embedding graph kernel [38], where the edit distance can be calculated by Lipschitz embedding method [39]. Experiments show that the classifi- cation accuracy can be statistically significantly enhanced by using graph edit distance with prototype reduction and dimensionality reduction [7]. Vogelstein et al. develops a set of signal-subgraph estimators for graph classification [76]. This estimator can be considered as local sparse and low-rank matrix decompositions.

2.3 FREQUENT SUB-GRAPH MINING (FSM)

Sub-graph patterns have been proved a good representation for closing the gap between structural and statistical pattern classification. Selecting features, in the form of sub-graph patterns, from graph data is a well established area in graph data mining, where early methods often use Apriori-based Graph Mining (AGM) to identify frequent induced sub- graphs [33]. Evaluation of AGM on chemical carcinogenesis data demonstrated that it is more efficient than an inductive logic programming based approach combined with a level- wise search. Meanwhile, Kuramochi and Karypis develop an FSG [45] method which uses Breath First Search (BFS) strategy to grow candidates whereby pairs of identified frequent k sub-graphs are joined to generate (k +1)sub-graphs. FSG uses a canonical labeling method for graph comparison and calculates the support of the patterns using a vertical transaction list data representation. Experiments show that FSG is inefficient when graphs contain many vertexes and edges that have identical labels because the join operation used by FSG allows multiple automorphism of single or multiple cores. In summary, AGM and FSG are all Apriori-based frequent sub-graph mining algorithms.

34 In Apriori-based frequent sub-graph mining algorithms, it is time-consuming and com- putationally ineffective when two size-k frequent sub-graphs are joined to generate size- (k+1) graph candidates [28]. To improve the efficiency, several pattern-growth-based graph mining approaches have been proposed. FFSM [32] attempts to extend graphs from a single sub-graph directly. For each graph G that has been discovered, FFSM recursively adds new edges until all frequent supergraphs of G are discovered. The recursion stops if no more frequent graphs can be generated. Because the pattern-growth algorithm extends a frequent graph pattern by adding a new edge with every possible label, there is a potential risk that same graph patterns may be generated more than once [28]. gSpan algorithm [82] solves this problem by using a right-most extension technique. It uses Depth Fist Search (DFS) lexicographic ordering to construct a tree-like lattice over all possible graph patterns. The search tree is traversed in a DFS manner and all possible graph patterns with non-minimal DFS codes are pruned so that redundant candidate generation is avoided. gSpan is arguably the most frequently cited FSM algorithm.

2.4 SUB-GRPAH FEATURE SELECTION

To find discriminate sub-graph features for graph classification, common methods are to first discover a complete set of sub-graph patterns by using a minimum support threshold or other parameters such as sub-graph size limit [13], and then use a feature selection crite- rion (such as Information Gain [12, 36]) to select a small set of discriminative sub-graphs as features. Such a two-step sub-graph feature selection approach is considered ineffec- tive and several methods are proposed to directly generate a compact set of discriminative sub-graphs during the frequent sub-graph mining process. Kudo et al. [43, 59] proposes a boosting based method, where a boosting algorithm repeatedly constructs multiple weak classifiers on weighted training graphs and each weak classifier is, in fact, a single sub- graph. Yan et al. proposes to mine significant sub-graphs by using biased search which exploits the correlation between structural similarity and significance similarity [81]. Ranu and Singh propose a scalable method, GraphSig, to mine significant sub-graphs based on a feature vector representation of graphs [56]. GraphSig uses domain knowledge to select

35 a meaningful feature set and prior probabilities of features are used to evaluate the signifi- cance of sub-graphs in the feature space. This strategy can use existing frequent sub-graph mining techniques to mine significant patterns in a scalable manner. Saigo et al. proposes an iterative sub-graph mining method based on partial least squares regression named gPLS [58]. This method is efficient because the weight vector is updated with elementary matrix calculations. To support real-time graph query, some researchers try to index a small set of sub-graphs to enable scalable query of graph databases [57, 67, 83, 84]. Sun et al. in- troduces a graph search method for sub-graph queries based on sub-graph frequencies. In addition, evolutionary computation has also been introduced in discriminative sub-graph mining (GAIA) [35]. For above sub-graph feature selection methods, the graph data are required to fully la- belled, whereas labeling graph data may be subject to expensive costs. An alternative solu- tion is to combine both labeled and unlabeled graphs for sub-graph feature selection. Kong et al. integrates active learning theory into feature and sample selection for graph classifi- cation [41]. They maximize the dependency between sub-graph patterns and graph labels by using an active learning framework and search the optimal graph to query for the label. In addition, a feature evaluation criterion, gSemi, is derived to estimate the significance of sub-graphs based on both labeled and unlabeled graphs [42], by combining semi-supervised feature selection and sub-graph feature mining. For large graphs, a novel technique called D-walks (Discriminative random walks) is developed to tackle semi-supervised classifica- tion [9]. The class of unlabeled data can be predicted by maximizing the betweenness score with labeled data. For applications without negative graph samples, PU learning for graph classification is developed to focus on selecting useful sub-graph features based on positive and unlabeled graph data only [87].

2.5 DATA STREAM MINING

Our problem is also related to data stream mining. The initial study of data stream mining was proposed in [21], in which incremental learning was used to tackle the increasing volumes in the data stream. Some ensemble learning-based methods [69, 77] have also

36 been used to address concept drift and data distribution changes in data streams. All these frameworks only work for generic data with instance-feature representations, and there is no feature immediately available for graph streams for learning and classification. Several studies have investigated the graph stream classification problem. To the best of our knowledge, only a few works [1, 15, 46] are directly related to our problem. These methods apply hashing techniques to sketch the graph stream to save computational cost and control the size of the subgraph-pattern set. In [1], Aggarwal proposes a 2-D random edge hashing scheme to construct an “in-memory” summary for sequentially presented graphs and uses a simple heuristic to select a set of most discriminative frequent patterns for building a rule-based classifier. Although this method has demonstrated promising per- formance on graph stream classification, it has two inherent limitations: (1) The selected sub-graphs contain disconnected edges, which may have less discriminative capability than connected subgraph-patterns because of structure meaning deficiency. (2) A frequent pat- tern mining process is required to perform on the summary table that comprises massive transactions, resulting in high computational cost. In [15], a clique hashing strategy is used to improve computational efficiency for graph stream task, but the authors still use traditional coarse-grained graph representation which causes severe information loss in mapping process and further decreases the classification accuracy. And in [46], the au- thors proposed a hash kernel to project arbitrary graphs onto a compatible feature space for similarity computing, but this technique can only be applied to node-attributed graphs.

2.6 REAL-TIME ANALYSIS

As we mentioned above, real-time analysis is to capture the changes at each time period, which is used to improve mining performance. Zhu and Shasha have proposed techniques to compute some statistical measures over time series data streams [90]. The proposed techniques use discrete Fourier transform. The system is called StatStream and is able to compute approximate error bounded correlations and inner products. The system works over an arbitrarily chosen sliding window. Lin et al. have proposed the use of symbolic representation of time series data streams [48]. This representation allows dimensional-

37 ity/numerosity reduction. They have demonstrated the applicability of the proposed repre- sentation by applying it to clustering, classification, indexing and anomaly detection. Chen et al. have proposed the application of what so called regression cubes for data streams [11]. Due to the success of OLAP technology in the application of static stored data, it has been proposed to use multidimensional regression analysis to create a compact cube that could be used for answering aggregate queries over the incoming streams. This research has been extended to be adopted in an undergoing project Mining Alarming Incidents in Data Streams MAIDS. Himberg et al. have presented and analyzed randomized variations of segmenting time series data streams generated onboard mobile phone sensors [29]. One of the applications of clustering time series discussed: Changing the user interface of mobile phone screen according to the user context. It has been proven in this study that Global Iterative Replacement provides approximately an optimal solution with high efficiency in running time.

2.7 ROADMAP

The overall roadmap of this thesis is given in Fig.2-2.

 & #!$#$! #

!#  $ !!  #% ! ! 

$!  ! #! $ !!  #!#% ! #$!#$' ""#  ""#    ""#              

 '""

Figure 2-2: The overall roadmap of this thesis.

38 Chapter 3

Understanding the Roles of Sub-graph Features for Graph Classification: An Empirical Study Perspective

3.1 INTRODUCTION

The advancement of data acquisition and analysis technology has resulted in many appli- cations involving complex structure data. Examples include Cheminformatics, Bioinfor- matics, and Social Network Analysis. Different from traditional data that are represented as feature vectors, the new data are often represented as graphs to denote the content of the data entries and their structural relationships. This has resulted in an increasing interest in graph classification, which tries to learn classification model from a number of labeled graphs to separate previously unseen graphs into different categories. This problem, in general, can be divided into two large branches: (1) Classification of nodes (or edges) in a single large graph, like social networks; (2) classification of a set of small graphs, like drug activity predictions. This experimental study focuses on the latter, i.e the classification of small graphs.

Although graph representation has some inherent advantages, such as the number of nodes and edges can vary to best capture the complex relationship between objects, the

39                                                                       

Figure 3-1: An example of sub-graph pattern representation. Left panel shows two graphs, G1 and G2 and right panel gives the two indicator vectors showing whether a sub-graph exists in the graphs.

disadvantage is also obvious: Lack of learning algorithms which are able to support graph structure data. Finding effective representations for graph data is therefore a major chal- lenge for graph classification. One possible solution is to define some graph kernels [37] that use graph structures, such as topology, paths, and edge labels of the graphs, to calculate distance between a pair of graphs, and then use the distance to formulate a generic learning algorithm. The second major solution is using substructure patterns (i.e. sub-graphs) as features to represent each graph as a feature vector. So a training graph set can be con- verted into a generic training instance set for learning. Fig. 3-1 shows a sub-graph pattern representation example. More specifically, Sub-graphs g1, g2, and g3 are used to convert graphs G1 and G2 into feature vectors. Depending on whether g1, g2, and g3 appear in each

T respective graph, say G1, a binary feature vector X1 = [110...] is created to represent

G1. subgraph-based approaches are useful in many graph related tasks, including discrim- inating different groups of graphs, classifying and clustering graphs and building graph indices in vector spaces. An inherent advantage of embedding graphs into vector space is that it makes existing algorithmic tools developed for feature based object representations available for graph structure data.

In this chapter, we mainly focus on sub-graph pattern mining based graph classifica- tion. Frequent Sub-graph Mining (FSM) has been an emerging sub-graph mining problem,

40 mainly because sub-graph patterns are meaningful tokens to effectively map graphs into

vector space for clustering or classification. Given a graph dataset, D = {G0,G1, ..., Gn}, assume Supg denotes the number of graphs (in D) which include g as a sub-graph, the ob- jective of FSM is to find all frequent sub-graphs whose number of occurrence is above the

specified threshold minsup (s.t. Supg ≥ minsup) from the given graph dataset D. Existing FSM methods can be roughly divided into two categories: (i) Apriori-based approaches, and (ii) pattern growth-based approaches. The Apriori-based approaches carry out pattern mining in a generate-and-test manner by using a Breadth First Search (BFS) strategy to explore the sub-graph lattice of the given database. Three well established Apriori-based FSM algorithms include AGM, FSG, and DPMine [33, 22, 45, 75]. Pattern growth-based approaches adopt a Depth First Search (DFS) strategy where, for each discovered sub-graph g, the sub-graph is extended recursively until all frequent super-graphs of g are discovered. FSM algorithms that adopt a DFS strategy tend to need less memory because they traverse the lattice of all possible frequent sub-graphs in a DFS manner. Two well-known example algorithms are gSpan and FFSM [32, 82].

Given a set of frequent sub-graphs g1,g2, ..., gd, a graph Gx can then be represented as T a feature vector X =[x1,x2, ..., xd] , where xi =1if gi ⊆ Gx; otherwise, xi =0[81]. Existing research has demonstrated that such vectorization helps to build efficient indices to support fast graph search [83]. In addition, this representation can also help to preserve some basic structural information for graph data analysis.

Several reasons motivate the proposed empirical studies on frequent sub-graphs for graph classification. Firstly, the mining of the sub-graphs involves graph matching op- erations that are computationally demanding. Indeed, just testing whether a graph is a sub-graph of another graph is an NP-complete problem [74]. In Fig. 3-2, we report the sub-graph mining runtime with respect to an increasing number of edges of the sub-graphs for a given graph set. The results show that it is very time-consuming to match patterns in graph datasets, especially for sub-graphs with a large number of edges. Secondly, we are interested in finding discriminative power of sub-graphs. In other words, we want to know how much the classification accuracy can be affected by different properties of the selected sub-graphs, such as different number of nodes, different number of edges, and the size of

41   



 

   



         Figure 3-2: The runtime of frequent sub-graph pattern mining with respect to the increasing number of edges of sub-graphs. the selected sub-graph set. To achieve the goal, a typical objective function, information gain (IG) [36], is used to test the discriminative power of attributions (sub-graphs). Ex- periments show that the objective function is neither monotonic nor anti-monotonic with respect to the size of the sub-graphs [81]. In addition, the internal structural correlation between sub-graphs can also impact on the classification result. For instance, sub-graphs with similar structures tend to have similar objective scores, whereas using sub-graphs with high correlations for classification should be avoided because it will introduce redundant features with high dependency that may deteriorate the classification accuracy. This fact can help to select better sub-graphs to represent graph datasets. Thirdly, most given graph classification methods are computationally expensive, we are wondering whether there is any inexpensive way to achieve graph classification. More specifically, if we use some ran- dom sub-graphs to represent a graph dataset, how good will the classification accuracy be, compared to the approaches that use sophisticated sub-graph mining algorithms?

In this chapter, we present an empirical comparison of graph classification based on different sets of sub-graph features, including frequent sub-graphs, random sub-graphs, and sub-graphs selected using information gain measure. The comparisons are carried out by using different number of sub-graphs, and sub-graphs with different edges. Experiments are performed on seven real-world graph datasets from three domains, by using two popular learning algorithms including support vector machines and nearest neighbour. The studies

42 show that using random sub-graph features has a reasonably good performance compared to its expensive peers which use frequent sub-graphs. Experiments also show that sub- graphs with many edges are not necessarily good candidates for graph classification. The remainder of the chapter is organized as follows. Section 3.2 formulates the prob- lem and defines some important notations. Experimental studies are reported in Section 3.3.

3.2 PROBLEM FORMULATION

In this section, we introduce several basic concepts related to graph representation, frequent sub-graph mining, and graph classification.

3.2.1 Graph and Sub-graph

DEFINITION 5 Graph: LV and LE are finite label sets which denote nodes and edges, respectively. A graph g =(V,E,μ,ν), where:

• V denotes a finite set of nodes,

• E ⊆ V × V is the set of edges,

• μ : V → LV is the node labeling function, and

• ν : E → LE is the edge labeling function.

The number of nodes of a graph g is denoted by |g|, while D represents the set of all

graphs over the label alphabets LV and LE.

DEFINITION 6 Sub-graph: Let g1 =(V1,E1,μ1,ν1) and g2 =(V2,E2,μ2,ν2) be two graphs, Graph g1 is a sub-graph of g2, denoted by g1 ⊆ g2,if

• V1 ⊆ V2,

• E1 ⊆ E2,

• μ1(v)=μ2(v) for all v ∈ V1, and

43 • ν1(e)=ν2(e) for all e ∈ E1.

An example is shown in Fig. 3-3, where the label of each node is color coded (the same color means the same label). Graph (b) is a sub-graph of (a).

 

Figure 3-3: A conceptual view of graph vs. sub-graph. (b) is a sub-graph of (a).

3.2.2 Frequent Sub-graph Mining

DEFINITION 7 Support: Given a graph database D, the support of a sub-graph g, de-

noted by Supg, is the fraction of the graphs in D of which g is a sub-graph, formally:

|{g ∈ ζ|g ∈ g}| Sup = , g |D|

Given a user specified minimum support threshold minsup and a graph database D,a frequent sub-graph is a sub-graph whose support is at least minsup (i.e. SupD ≥ minsup) and the frequent sub-graph mining problem is to find all frequent sub-graphs in D.

DEFINITION 8 Graph Isomorphism: There are two graphs g1 =(V1,E1,μ1,ν1) and g2 =(V2,E2,μ2,ν2), a graph isomorphism is a bijective function f : V1 → V2 satisfying

• μ1(v)=μ2(f(v)) for all nodes v ∈ V1,

• for each edge e1 =(v1,v2) ∈ E1, there exists an edge e2 =(f(v1),f(v2)) ∈ E2 such

that ν1(e1)=ν2(e2), and

−1 −1 • for each edge e2 =(v1,v2) ∈ E2, there exists an edge e1 =(f (v1),f (v2)) ∈ E1)

such that ν1(e1)=ν1(e2).

Two graphs are called isomorphic if there exists an graph isomorphism between them.

44 Figure 3-4: An example of graph isomorphism.

DEFINITION 9 Sub-graph Isomorphism: Given two graphs g1 =(V1,E1,μ1,ν1) and g2 =(V2,E2,μ2,ν2), an injective function f : V1 → V2 from g1 to g2 is a sub-graph isomorphism if there exists a sub-graph g ⊆ g2 such that f is a graph isomorphism between

g1 and g.

3.2.3 Graph Classification

DEFINITION 10 Graph Embedding: Let D be a set of graphs. A graph embedding is a function ϕ : D→Rn mapping graphs to n-dimensional vectors, i.e.,

ϕ(g)=(x1,x2, ..., xn) ,

In Table 3.1, we summarize the advantages and disadvantages of graph representation comparing with vector representation. Embedding graphs into vector spaces makes existing algorithmic tools developed for feature based object representations immediately available for graph classification. After mapping graph data into vector space, graph classification is almost the same as learning and classifying generic instances represented in the feature space.

3.3 EXPERIMENTAL STUDY

In this section we first discuss the benchmark data and the experimental settings, and then report detailed experimental results and analysis.

45 Table 3.1: The advantages and disadvantages comparisons between vector representation vs. graph representation

3.3.1 Benchmark Data

We carry out our experimental studies on seven real-world graph datasets, which are col- lected from three different domains: Chemical compound, Protein structures, and Citation network.

Chemical Compound: The activity of chemical compound molecules can be predicted by their special 3-dimensional structures. In our experiments, we use a series of binary-label graph datasets from the PubChem website1. This website provides real-world datasets on the biological activities of small molecules, containing the bioassay records for anti-cancer screen tests with different cancer cell lines collecting from National Cancer Institute (NCI). Each dataset belongs to a certain type of cancer screen with active or inactive response (i.e. class labels) [81]. Because each NCI bioassay dataset contains very few active graphs, we use under-sampling to down-sample inactive graphs to form a relatively balanced dataset for performance evaluation. The number of vertices in most of those compounds ranges from 10 to 200. We use 5 graph datasets in the experiments and the information about these data are reported in Table 3.2.

Protein Structure: Proteins are organic compounds made of amino acids sequences joined together by peptide bonds. A huge amount of proteins have been sequenced over years, and the structures of thousands of proteins have been resolved so far. The well known role of proteins in the cell is as enzymes which catalyze chemical reactions [38]. The D&D dataset we used contains 1178 protein structures that can be divided into two classes: 691 enzymes

1http://pubchem.ncbi.nlm.nih.gov

46 Table 3.2: NCI datasets used in experiments

Vertex Edge # Node # Edge Datasets Data Size Size Size Labels Labels NCI33 2934 30.2 32.5 39 3 NCI410 2881 21.5 30.1 42 3 NCI2302 13764 23.9 27.7 38 3 NCI489028 11092 33.5 30.7 42 3 NCI485346 18426 29.5 31.4 44 3 and 487 non-enzymes [20]. Each protein is represented by a graph, in which the nodes are amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart. These proteins, with an average size of 285 vertices and 716 edges, are larger and stronger connected than molecules from the NCI screening.

Table 3.3: DBLP dataset used in experiments

Classes Descriptions # Papers # Graphs SIGMOD,VLDB,ICDE, EDBT,PODS, DBDM DASFAA,SSDBM,CIKM,DEXA, 20601 9530 KDD, ICDM, SDM, PKDD, PAKDD ICCV, CVPR, ECCV, ICPR, ICIP, CVPR 18366 9926 ACM Multimedia, ICME

DBLP Citation Network: The DBLP dataset is composed of bibliography data in the field of computer science2. Each record in DBLP is associated with a number of attributes such as paper ID, authors, years, title, abstract and reference ID [71]. We build a binary-label graph dataset by using papers published in a list of conferences (as shown in Table 3.3). The classification task is to predict whether a paper belongs to the field of DBDM (database and data mining) or CVPR (computer vision and pattern recognition), by using references and title of each paper. In our experiments, each paper in DBLP is represented as a graph, where each node denotes a Paper ID or a keyword and each edge denotes the citation relationship between papers or keywords appeared in the paper’s title. More specifically,

2http://arnetminer.org/citation

47 we denote that (1) each paper ID is a node; (2) if a paper A cites another paper B, there is an edge between A and B; (3) each keyword in the title is also a node; (4) each paper ID node is connected to the keyword nodes of the paper; and (5) for each paper, its keyword nodes are fully connected with each other. An example of DBLP graph data is shown in Figure 3-5, where the rectangles are paper ID nodes and diamonds are keyword nodes. The paper ID17890 cites (connects) paper ID17883 and ID18068, and ID17890 has keywords Patch, Motion, and Invariance in its title. Paper ID18068 has keyword Edge and Detection, and paper ID17883’s title includes keywords Vision and Edge. For each paper, the keywords in the title are linked with each other.



 

  

  

Figure 3-5: Graph representation for a paper (ID17890) in DBLP. Node in red is the main paper. Nodes in black ellipse are citations. While nodes in black box are keywords.

3.3.2 Sub-graph Features

To understand the relationship between sub-graph features and the classification accuracy, we carry out sub-graph feature selection by using different sizes of sub-graphs (with respect to the number of edges), which are from one to nine (sub-graphs with more than nine edges are much less frequent). In Table 3.4, we report the number of sub-graph features with respect to different sizes (i.e. number of edges) in the benchmark datasets. Meanwhile, we also vary the number of select sub-graph features (i.e. the size of the “selected feature set”) to include 50, 1000, and 2000 sub-graphs respectively, in all experiments. To select sub-graph features, we use frequency and Information Gain (IG) based feature selection criteria, respectively. For frequency based criterion, we simply select sub-graphs

48 Table 3.4: Number of sub-graphs with respect to different sizes (i.e. number of edges)

#Edges 123456789 Datasets NCI33 134 437 1361 >2000 >2000 >2000 >2000 >2000 >2000 NCI410 53 174 1512 >2000 >2000 >2000 >2000 >2000 >2000 NCI2302 140 449 1706 >2000 >2000 >2000 >2000 >2000 >2000 NCI489028 48 417 1298 >2000 >2000 >2000 >2000 >2000 >2000 NCI485346 39 140 562 1172 >2000 >2000 >2000 >2000 >2000 D&D 190 1356 >2000 >2000 >2000 >2000 >2000 >2000 >2000 DBLP >2000 >2000 >2000 >2000 >2000 >2000 >2000 >2000 >2000

with the highest frequency. For IG based approach, we calculate Information Gain (IG) between each sub-graph and the class label, and select the ones with the highest IG scores as sub-graph features. It is worth noting that IG is commonly used for sub-graph feature selection and most significant patterns likely fall into the high-quantile of frequency [81].

In addition to frequency and Information Gain based sub-graph features, we also em- ploy a random feature selection approach. In our experiments, we first collect all one-edge sub-graphs from the training set. After that, multiple-edge sub-graphs are generated by randomly combining one-edge sub-graphs. If a multiple-edge sub-graph appears more than once in training set, this sub-graph will be considered as a possible random sub-graph feature. This process is equivalent to random sub-graphs collecting from the whole struc- ture space. Because our experiments focus on sub-graph features with different edges, we generate a sub-graph with X edges by using X one-edge sub-graphs selected in a random manner (X =1, 2, ··· , 9). This process is much more efficient and more feasible than generating all possible features from training set (which is computationally infeasible).

Among all four benchmark datasets, DBLP is a special graph dataset. The node space in DBLP is very sparse, because there are thousands of nodes with different labels (as shown in Table 3.3) (the nodes in DBLP represent paper ID and paper keywords). As a result, the most discriminative features in DBLP are one-node sub-graphs. So in our experiments, we also include sub-graphs containing only one-node (0-edge features mean one-node sub- graphs).

49 3.3.3 Experimental Settings

In our experiments, we use classification accuracy to measure the performance of the al-

gorithm. Suppose the true label set of testing data is Lt, and the Label set returned by our algorithm is Lr. The accuracy is defined as |Lt ∩ Lr|/|Lr|. In our experiments, We use 10-fold cross-validation to evaluate and compare the algorithm’s performance. Each graph dataset is evenly partitioned into 10 parts. Only one part is used as test set and the other nine parts are used for sub-graph mining, sub-graph feature selection, and classifier generation. To reduce variability, the validation was collected from the average of 10 times cross-validation. To train classifiers from graph data, we use Support Vector Machines (Lib-SVM based MATLAB package3) and Nearest Neighbor algorithm (NN) [27]. All experiments are conducted on machines with 4GB RAM and Intel CoreTM i5 CPUs of 3.10 GHz.

3.3.4 Results and Analysis

In this section we report the graph classification accuracy with respect to four major fac- tors: (1) the sub-graph feature sizes; (2) the size of the sub-graph feature set; (3) different learning algorithms; and (4) different benchmark datasets. In Figures 3-6, 3-7, and 3-8, we report the algorithm performance with respect to different sizes of sub-graph features, by using Support Vector Machines (Figures 3-6 and 3-7) and Nearest Neighbour (Figure 3-8) learning algorithms. More specifically, in Fig. 3-6, figures in each row correspond to one benchmark dataset. The left, middle, and the right panel each corresponds to 50, 1000, and 2000 sub-graph features, respectively. For each figure, the x-axis denotes the size (i.e. the number of edges) of the selected sub-graph features, and the y-axis denotes the classifica- tion accuracy. While in Fig. 3-7, figures in each row correspond to one benchmark dataset. The left, middle, and the right panel each corresponds to 50, 1000, and 2000 sub-graph fea- tures, respectively. For each figure, the x-axis denotes the size (i.e. the number of edges) of the selected sub-graph features, and the y-axis denotes the classification accuracy. Each curve in the figure corresponds to the classification accuracies using sub-graph features se-

3http://www.csie.ntu.edu.tw/cjlin/libsvm

50 lected from different approaches. Each curve in the figure corresponds to the classification accuracies using sub-graph features selected from different approaches. In Fig. 3-8, figures in each row correspond to one benchmark dataset. The left, middle, and the right panel each corresponds to 50, 1000, and 2000 sub-graph features, respectively. For each figure, the x-axis denotes the size (i.e the number of edges) of the selected sub-graph features, and the y-axis denotes the classification accuracy. Each curve in the figure corresponds to the classification accuracies using sub-graph features selected from different approaches. In each figure, each row corresponds to one benchmark dataset. The left, middle, and the right panel each corresponds to the results of using 50, 1000, and 2000 sub-graph features, respectively. The x-axis in each figure denotes the size (i.e the number of edges) of each sub-graph features, and the y-axis denotes the classification accuracy. Each curve corre- sponds to the classification accuracies using sub-graph features selected from random sub- graph features, frequency based sub-graph features, and Information Gain based sub-graph features, respectively. The results in Figures 3-6, 3-7, and 3-8 suggest the following major findings.

The discriminative power of sub-graphs varies by their sizes: The results show that the size of sub-graph features has significant impact on its discriminative power. For NCI data, features with 4 to 7 edges have good classification performance. For D&D protein data, features with 1 to 3 edges are better than others. For DBLP data, features with one-node are more discriminative. The variance of the discriminative power of sub-graphs is mainly attributed to the domain characteristics of the graph data:

• NCI graphs have very similar structures between each other. Examples include ben- zene rings, which can be observed in many graphs. Each NCI dataset directly focuses on a special type of compounds (or structures). So most NCI data are very similar in structures.

• Although D&D graphs only have less than 44 node labels, these graphs have very complex structures, and the distribution of the data is more scattered than NCI data. Because there are up to thousands of enzymes found in the biosphere. The large structure differences result in less graphs sharing the same sub-graphs with relatively

51 NCI33 with 50 features NCI33 with 1000 features NCI33 with 2000 features 0.75 0.8 0.78

0.76 0.7 0.75 0.74

0.72 0.65 0.7 0.7

Accuracy % Accuracy % Accuracy % 0.68 0.6 Random 0.65 Random Random Frequent Frequent Frequent 0.66 IG IG IG 0.55 0.64 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (a) (b) (c)

NCI410 with 50 features NCI410 with 1000 features NCI410 with 2000 features 0.9 0.8 0.8

0.78 0.78 0.8 0.76 0.76

0.74 0.74 0.7 0.72 0.72

Accuracy % Accuracy % 0.7 Accuracy % 0.7 0.6 Random Random Random Frequent Frequent 0.68 Frequent 0.68 IG IG IG 0.5 0.66 0.66 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (d) (e) (f)

NCI2302 with 50 features NCI2302 with 1000 features NCI2302 with 2000 features 0.67 0.69 0.7

0.66 0.68 0.68 0.67 0.65 0.66 0.64 0.66 0.65 0.63 Accuracy % Accuracy % 0.64 Accuracy % Random Random Random 0.64 Frequent Frequent Frequent 0.62 0.63 IG IG IG 0.61 0.62 0.62 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (g) (h) (i)

NCI485346 with 50 features NCI485346 with 1000 features NCI485346 with 2000 features 0.7 0.74 0.74 0.72 0.65 0.72 0.7 0.7 0.6 0.68 0.68

0.55 0.66 0.66 Accuracy % Accuracy % Random 0.64 Random Accuracy % 0.64 Random 0.5 Frequent Frequent Frequent 0.62 0.62 IG IG IG 0.45 0.6 0.6 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (j) (k) (l)

NCI489028 with 50 features NCI489028 with 1000 features NCI489028 with 2000 features 0.64 0.64 Random Frequent 0.63 0.62 0.62 IG 0.62

0.6 0.6 0.61 0.6 0.58 0.58 0.59 Accuracy % Accuracy % Accuracy % Random 0.58 Random 0.56 0.56 Frequent Frequent IG 0.57 IG 0.54 0.54 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (m) (n) (o)

Figure 3-6: Classification accuracy on five NCI chemical compound datasets with respect to different sizes of sub-graph features (using Support Vector Machines: Lib-SVM).

52 D&D with 50 features D&D with 1000 features D&D with 2000 features 0.7 0.74 0.8 Random Random Random Frequent 0.72 Frequent Frequent 0.65 IG IG 0.75 IG 0.7

0.68 0.6 0.7 0.66

Accuracy % Accuracy % 0.64 Accuracy % 0.55 0.65 0.62

0.5 0.6 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (a) (b) (c) DBLP with 50 features DBLP with 1000 features DBLP with 2000 features 0.9 0.9 0.9 Random Random Random Frequent Frequent Frequent 0.8 IG 0.8 IG 0.8 IG

0.7 0.7 0.7

0.6 0.6 0.6 Accuracy % Accuracy % Accuracy % 0.5 0.5 0.5

0.4 0.4 0.4 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Number of selected edges Number of selected edges Number of selected edges (d) (e) (f)

Figure 3-7: Classification accuracy on D&D protein dataset and DBLP citation dataset with respect to different sizes of sub-graph features (using Support Vector Machines: Lib-SVM).

large size. As a result, the most discriminative features of D&D graphs have 1 to 3 edges on average.

• DBLP data is the most sparse graph datasets. There are thousands unique paper IDs and keywords in the DBLP (each paper ID and keyword represent one node). Because there are a very large number of node labels, it is hard to find shared sub- graphs with more than 4 edges from any two DBLP graphs. The most frequent and the most discriminative features we found are, in fact, one-node, and sub-graph features with more than 3-edge may only appear very few times. As a result, when using features with 3 or more edges to build classifiers, the classification accuracy is about 50% (which is equivalent to random predictions) (as shown in Figure 3-7 (d), (e), and (f)).

Overall, the experiments suggest that the actual discriminative power of sub-graph varies, depending on their sizes and the domain of the graph data. Sub-graphs with many edges are not necessary to be considered for graph classification.

53 NCI33 with 50 features NCI33 with 1000 features NCI33 with 2000 features 0.7 0.75 0.75 Random Frequent 0.65 IG 0.7 0.7

0.6 0.65 0.65 0.55 Accuracy % Accuracy % Accuracy % Random 0.6 0.6 Random 0.5 Frequent Frequent IG IG 0.45 0.55 0.55 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (a) (b) (c)

D&D with 50 features D&D with 1000 features D&D with 2000 features 0.7 0.75 0.8 Random Random Random Frequent Frequent Frequent 0.65 IG IG IG 0.7 0.75

0.6 0.65 0.7 0.55 Accuracy % Accuracy % Accuracy % 0.6 0.65 0.5

0.45 0.55 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (d) (e) (f)

DBLP with 50 features DBLP with 1000 features DBLP with 2000 features 0.9 0.9 0.9 Random Random Random Frequent Frequent Frequent 0.8 IG 0.8 IG 0.8 IG

0.7 0.7 0.7

0.6 0.6 0.6 Accuracy % Accuracy % Accuracy %

0.5 0.5 0.5

0.4 0.4 0.4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Number of selected edges Number of selected edges Number of selected edges (g) (h) (i)

Figure 3-8: Classification accuracy on one NCI chemical compound dataset, D&D protein dataset, and DBLP citation dataset with respect to different sizes of sub-graph features (using Nearest Neighbours: NN).

Random sub-graphs have a reasonably good performance: compared to sub-graphs discovered from expensive sub-graph mining and selection criteria, such as frequency based sub-graphs and information gain based sub-graphs, the classification accuracy of random sub-graph are reasonably good. Our experiments show that none of the feature selection approaches can significantly outperform all other methods (including random sub-graph) on all seven benchmark graph datasets. Although the accuracy of the information gain based method is often higher than others, in most cases, the difference of accuracy between random feature selection and frequent feature selection with IG is less than 5%.

Random sub-graphs performs well on graph classification tasks because current graph

54 classification or clustering methods are always based on binary sub-graph representation and combine with traditional vector-based methods to do classification or clustering. As a result, this so-called coarse-grained representation model cannot well capture the real structure of graphs and cause information loss. Different sub-graph selection methods hardly make up such loss caused by the binary representation mechanism. We will give a novel Overall, our experiments suggest that random sub-graphs are useful for graph classifi- cation. The classifiers built from random sub-graphs are not significantly inferior to other peers.

Number of sub-graphs is important to ensure good performance: When comparing three figures in each row (the left, middle, and the right panel represent results correspond- ing to 50, 1000, and 2000 sub-graph features, respectively), it is clear that increasing the number of sub-graph features often results in an improved accuracy. For example, when increasing the number of sub-graphs from 50, to 1000, and to 2000, the average classifica- tion accuracy of the classifiers trained by using three types of sub-graph features (Random, Frquent, and IG) for 9-edge features on NCI410 graph dataset will increase from 70%, 73% to 75%. Similar trends can also be observed from other benchmark datasets. For each specific selected sub-graph feature subset (i.e. a feature set with 1000 sub-graphs), the accuracies will vary significantly, depending on the size of the sub-graphs in the set. A larger improvement can be observed for sub-graph feature subset containing more features. For example, when using 2000 sub-graphs, the difference between the maximum and the minimum classification accuracies is significantly larger than the difference of using 50 sub-graphs (as shown in Figure 3-6 (d) vs. (f)). Overall, our experiments suggest that the number of sub-graph is important for achiev- ing good classification accuracy. High accuracy can be expected if a good number of sub-graphs are combined with good sizes of sub-graphs.

Increasing number of sub-graphs reduces the difference between classifiers built from different sub-graphs: The results from all benchmark datasets show that when increasing the number of sub-graphs, the overall prediction accuracy will also increase (for example, Figure 3-6 (a), (b) , and (c)). For a small number of sub-graphs, e.g the feature size is 50,

55 the accuracies of the classifiers trained from different sub-graphs are largely different, and random sub-graphs are inferior to other approaches (as shown in 3-6 (a) , (d) , (g) , (j) , and (m), where the differences of accuracy among three methods are from 0% to 17.5%). When the size of feature set increases to 2000, the accuracies of all three classifiers are very close to each other. In some cases, the three curves are found to exactly the same, which suggest that the performance of using random sub-graphs will steadily improve when an increasing number of sub-graphs is used to train the classifier. Indeed, when a large number of sub-graph are used, the difference between different feature set will be reduced. In an extreme case, if all sub-graphs are used to train classifiers, there is no difference between feature sets, which will therefore result in the same classifier performance. Overall, our experiments suggest that the difference between classifiers trained by using different sub-graphs is mostly noticeable when the number of sub-graphs is small. Such difference will be reduced when the number of sub-graphs increases, and may vanish when a relative large number sub-graphs, say more than 1000, is used. A direct conclusion from this observation will suggest that when using more than 1000 sub-graph features to train classifiers, there is not much difference between using random sub-graphs or using frequent sub-graphs mined and selected by other expensive methods.

Performance with respect to different classifiers shows similar trend: In Figure 3-8, we report the results of using Nearest Neighbour (NN) classifier on the benchmark data. Due to space limitations, we only select one NCI graph dataset. Compared with Support Vector Machines, we can find that the overall accuracies of NN are slightly lower than SVM. However, the overall trends, such as the increasing/decreasing of the classification accuracies with respect to the size of sub-graphs (or the number of sub-graphs), are very similar to the results from SVM. The observations we made from SVM are mostly valid for NN as well. Overall, our experiments suggest that there is minor difference between the classifiers trained by using different learning methods, whereas the relationship between the classifier and the sub-graphs is mostly the same for different learning algorithms. In this chapter, we found that current graph classification or clustering methods are always based on binary sub-graph representation and combine with traditional vector-based

56 methods to do classification or clustering. As a result, number of sub-graphs and sub-graphs selection methods play a huge influence on afterward graph classification or clustering performances. And what is more, this so-called coarse-grained representation model cannot well capture the real structure of graphs and cause information loss. So in next chapter, we will propose a new fine-grained representation model to enjoy a rigorous theoretical foundation that ensures minimal information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process, which is proven to be NP-hard.

57 58 Chapter 4

Graph Hashing and Factorization for Fast Graph Stream Classification

4.1 INTRODUCTION

As discussed in last chapter, number of sub-graphs and sub-graphs selection methods play a huge influence on afterward graph classification or clustering performances. What is more, unlike traditional instance-feature representations, graph-structure data do not have specific features to represent the graph, so an essential step in building a graph classification model is to explore graph features to represent graph data in an instance-feature space for effective learning [57]. The majority of existing graph classification models employ an occurrence- based feature representation model, in which a set of sub-graph features is selected to represent the graph data by using the occurrence of the sub-graph in the graph (either 0/1 occurrence or actual number of occurrences) as the feature value. In this chapter, we refer to such an occurrence based representation model as coarse-grained representation model.

An example is shown in Fig. 4-1, where the substructures g1, g2, and g3 are used to represent

Graph Gi. While the coarse-grained feature representation model has been popularly used for graph classification, it has a number of disadvantages, including:

• Computation burden for isomorphism validation: Because sub-graphs are se- lected as features to represent the gra[LP0POph, the expensive sub-graph mining and

59      

         

      

      

      

Figure 4-1: Coarse-grained vs. fine-grained representation.

isomorphism validation process can not be avoided (graph isomorphism is proven to be NP-hard) [19]. After the sub-graph features have been selected, the isomorphism validation process has to be applied again to map the graphs into the vector space.

• Severe and unbounded information loss: Because the coarse-grained represen- tation model does not characterize the degree of closeness of the sub-graph pattern related to the graph, it suffers severe information loss in representing the graph. More importantly, the information loss incurred in the representation is unbounded so it is hard to characterize how closely the sub-graph features can represent the underlying graph and what is the degree of information loss incurred in the representation.

For a graph stream with continuously growing volumes (i.e. number of graphs) and changing structures (e.g. new nodes may appear in the coming graphs), the disadvantage of the existing coarse-grained graph feature model is even greater. This is because (1) stream volumes continuously increase, which makes it computationally expensive to explore sub- graph features, and (2) the changes in the graph stream (such as new structures) make the sub-graph features incapable of representing graphs. For example, if new nodes or structures appear in the graph stream, the previously discovered sub-graph patterns will not be able to represent the graphs, because sub-graph patterns do not contain the information of new nodes and new structures. Although it is always possible to rediscover new patterns from the most recent data, the mining procedures are normally time-consuming, and any features discovered may soon be outdated and incapable of representing future graphs.

60 Motivated by the above observations, we propose a fine-grained graph factorization framework for efficient graph stream classification in this chapter. Being fine-grained, our mapping framework relies on a set of discriminative frequent cliques, instead of important sub-graph patterns, to represent the graph data. In addition, such a fine-grained representa- tion ensures that the final instance-feature representation is sufficiently close to the original graph, with theoretical guarantee.

To solve the problem, our main idea is to bypass the expensive sub-graph mining pro- cess and use linear combinations of graph cliques to represent graphs. This is similar to the Fast Fourier Transform (FFT) [55] in which a set of base functions (i.e. cliques) is used to form input signals (i.e. graphs). To achieve this goal, our algorithm relies on two important steps to extract fine-grained graph representation: (1) finding a set of frequent graph cliques as the base; and (2) using graph factorization to calculate a linear combination of the graph cliques to best represent a graph.

Compared to the traditional coarse-grained representation model, the advantage of fine- grained representation is clear. As shown in Fig.4-1, the original graphs are represented in more detail by the use of fine-grained feature mapping: Given sub-graph features g1, g2, and g3 and graphs G1 and G2, coarse-grained representation represents Gi as instance Xi with 0/1 feature values, whereas fine-grained representation represents Gi as instance Xi

with a fraction number indicating the closeness of the sub-graph gj to the graph Gi.Even though g3 does not appear in G1, there is still a feature value (0.09). This is because our framework no longer relies on the occurrences of the sub-graphs but on the linear combina- tions of a number of base patterns to represent graphs. A clear advantage of our method is that fine-grained graph representation enjoys a rigorous theoretical foundation that ensures minimal information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process, which is proven to be NP-hard. After applying our fine- grained feature mapping model to represent graphs, we can use any learning algorithm, such as Nearest Neighbor or Support Vector Machines, for graph stream classification. Our method offers a number of advantages including a fast feature mining process, more precise graph representation, and better performance for graph stream classification. Experiments on two real-world network graph data demonstrate that our method outperforms state-of-

61 the-art approaches in both classification accuracy and runtime efficiency. The remainder of this chapter is organized as follows. Section 4.2 introduces problem definition. We describe the proposed graph factorization method in Section 4.3. Section 4.4 describes the system framework FGSC for graph stream classification, with experimental studies reported in Section 4.5.

4.2 PROBLEM DEFINITION

In this chapter, we propose a fast graph stream classification method using graph factor- ization to address the aforementioned problems and achieve better performance in both classification accuracy and runtime efficiency.

DEFINITION 11 (Connected Graph): A graph is represented as G =(V, E, L), where V { ··· } E⊆V×V = v1,v2, ,vnv is the set of vertices, denotes a set of edges, and L { ··· } = l1,l2, ,lnl is the set of symbol labels for vertices and edges. A connected graph is a graph such that there is a path between any pair of vertices.

DEFINITION 12 (Adjacency Matrix): An adjacency matrix of G (which contains Q

Q×Q unique nodes) is denoted by A(G) ∈ R , where each entry aij denotes the weight of the edge from vertex vi to vertex vj. And aii means the label on node vi.

We use the adjacency matrix as the graph representation and assume for simplicity that each edge has a default weight 1. A graph Gi is a labeled graph if a class label yi ∈Y

is assigned to Gi. For the binary classification problem, we have yi ∈Y= {−1, +1}.A L U graph Gi is either labeled (denoted by Gi ) or unlabeled (denoted by Gi ).

DEFINITION 13 (Clique): A clique c =(V, E , L) in a graph G =(V, E, L) is a sub- graph of G, where V ⊆V, E ⊆Eand L ⊆Lsuch that every two vertices in c are connected by an edge.

DEFINITION 14 (Graph Stream): A graph stream S contains an increasing number of

graphs arriving in a streaming fashion with S = {G1,G2, ··· ,Gi, ···}. At any particular

62 S ∞ time point, we can collect a batch of graphs (B) for analysis. Formally, = j=1 Bj, { ··· } where Bj = Gj1 ,Gj2 , ,Gjkj .

The aim of graph stream classification is to learn a classification model from S,atany particular time point, by scanning the graph stream only once, and predicting the class labels of future arrival graphs in the graph stream with maximal accuracy.

4.3 GRAPH FACTORIZATION

In this section, we present the model used to describe the relationship between cliques and graphs, and propose a graph factorization algorithm to generate fine-grained graph representation.

4.3.1 Factorization Model

Given a graph encoded by a binary symmetric adjacency matrix, our goal is to use dis- criminative feature-patterns to represent the graph precisely. In this chapter, we propose to use cliques as feature-patterns to represent graphs. There are several advantages of using cliques as feature-patterns: (1) Finding cliques does not require a complicated sub-graph mining process (which is computationally expensive). Although finding maximal cliques is an NP-complete problem (similar to the sub-graph mining problem), the special structure allows us to develop a fast algorithm to find cliques (details will be addressed in Section

5). (2) Because there is an edge between any two nodes vi and vj in a clique, there is no need to describe the edges in the clique, so we can simply use all the nodes appearing in a clique to form a vector to represent the clique. We assume that the size of the feature set (which is a clique set), denoted by N, and the size of graph node space, denoted by Q, are given. Our model is parameterized by wz and mjz, where wz denotes the expected weight of clique cz, and mjz represents the occurrence

of vertex j in clique cz. The expected weight of the edge that lies between vertices i and j in clique cz can then be re-written by mizwzmjz. If we sum up the cliques in a given clique-pattern set C, where |C| = N, the expected weight of the edge that lies between

63    

 

 

    

Figure 4-2: An example of graph factorization.

vertices i and j can be represented as

aij = mizwzmjz (4.1) cz∈C

From Eq. (4.1), we have an expected adjacency matrix A as follows, where M is a Q × N binary matrix and W is a N × N diagonal matrix.

A(G)=MW MT (4.2)

In Eq.(4.2), each row of M means a node index which is the same as the node index in A(G). Each column of M corresponds to the occurrences of the nodes in a selected clique-pattern c ∈ C (Fig. 4-2 shows an example of M and A(G) of G1 in Fig. 4-1): Given graph G1 and three cliques c1, c2, and c3, the clique set matrix (M) records whether a node (row) appears in each of the three cliques (columns). Graph factorization uses M as a base to find a diagonal matrix W1 which best approximates the adjacency matrix with

T A(G1) ≈ A(G1)=MW1M . The diagonal values in matrix W1 are used as feature

64 values to represent graph G1. More explicitly, ⎛ ⎞ ⎛ ⎞

⎜ a11 ··· a1 ⎟ ⎜ w1 ··· 0 ⎟ ⎜ Q ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ . . . ⎟ ⎜ . . ⎟ T ⎜ . .. . ⎟ = M ⎜ . ... . ⎟ M (4.3) ⎜ . . ⎟ ⎜ . . ⎟ ⎝ ⎠ ⎝ ⎠ aQ1 ··· aQQ 0 ··· wN

where aij is an element of A(G).

The decomposition of the adjacency matrix A(G) can be interpreted in terms of over-

th lapping weight theory. More specifically, denote the i column of M as M·i. We let M MT Θi = ·i ·i , and Eq. (4.3) can then be rewritten as:

N A(G)= wiΘi (4.4) i=1

th We refer to Θi as the basic element. Because the i column of M is a clique, Θi is the binary matrix in which each element of the diagonal means a vertex of the graph that does or does not appear in this clique and other nodes in the binary graph matrix represent edge information in the clique. We use W ∈ RN×N to denote the clique interaction matrix, which can also be considered as the clique degree matrix. As a result, Eq.(4.4) means that

a graph can be represented as a set of cliques with different weight values (wi). For a given T M, we can decompose a graph into a diagonal matrix W =[w1,w2, ··· ,wN ] , which can be used to represent the graph. This is equivalent to mapping graphs into a vector space. Unfortunately, in reality, M is a non-square matrix and Wk may not exist such that T A(Gk)=MWkM for a given graph Gk.

Thus, we want to find an optimal Wk that satisfies,

T A(Gk) ≈MWkM (4.5)

65 By using squared loss to measure the relaxation error, we have,

T 2 RE(A(Gk),W,M)=||MW M − A(Gk)|| (4.6) 2 = ||A(G) − A(Gk)||

where RE(·) is the function of computing relaxation error. Then the optimization prob- lem of Eq.(4.5) can be formulated as

Wk = arg min RE(A(Gk),W,M) (4.7) W

DEFINITION 15 (Fine-grain graph representation): Given a set of cliques c1,c2,...,cn and a set of graphs G1,G2,...,Gm. For each graph Gk, if there is a vector Wk satisfied M Wk = arg minW RE(A(Gk),W, ), we call Wk the Fine-grain representation for Gk under c1,c2,...,cn.

4.3.2 Learning Algorithm

To solve Eq.(4.7), by combining Eqs.(4.2), (4.3), and (4.4), we have

N || ·M MT − ||2 Wk = arg min wi ·i ·i A(Gk) (4.8) W i=1

N 2 Wk = arg min || wi · Θi − A(Gk)|| (4.9) W i=1 We define vec(·) as the matrix vectorized operator.

N 2 Wk = arg min || wi · vec(Θi) − vec(A(Gk))|| (4.10) W i=1

Let P =[vec(Θ1),vec(Θ2), ··· ,vec(ΘN )]. Then

2 Wk = arg min ||P · diag(W ) − vec(A(Gk))|| (4.11) W

where diag(·) denotes the main diagonal of a matrix.

66 To solve Eq.(4.11), we employ the General Linear System Theorem proposed in [62].

THEOREM 1 (General Linear System Theorem): Let there exist a matrix B such that

By is a minimum 2-norm least-squares solution of a linear system Ax = y. Then it is necessary and sufficient that B = A†, the Moore-Penrose generalized inverse of matrix A.

Remark. According to Theorem 1, we have the following properties for our proposed graph mapping algorithm:

† • The special solution x0 = A y is one of the least-squares solutions of a general linear system, 2 † 2 2 ||Ax0 − y|| = ||AA y − y|| = min ||Ax − y|| x

† • The special solution x0 = A y has the smallest 2-norm among all least-squares solu- tions of Ax = y: † ||x0|| = ||A y|| ≤ ||x||,

∀x ∈{x : ||Ax − y||2 ≤||Az − y||2, ∀z ∈ Rn}

• The minimum 2-norm least-squares solution of Ax = y is unique, which is x = A†y.

Based on Theorem 1, the smallest 2-norm least-squares solution of the above linear system (Eq.(4.11)) is † diag(Wk)=P · vec(A(Gk)) (4.12)

where P † is the Moore-Penrose generalized inverse of matrix P . By using the above factorization results, we can use diag(Wk) to map each graph Gk into a vector space, which is called fine-grained representation for graph Gk. Because our factorization process ensures that the multiplication of Clique Set Matrix M and the

diagonal matrix W can best approximate the adjacency matrix of graph Gk with A(Gk) ≈ T A(Gk)=MWkM , our method has the following three key advantages:

• Error Bound for Graph Representation: The difference between the original and the approximated adjacency matrices, A(Gk) vs. A(Gk), quantitatively measures

67 the information loss of graph representation. We can further adjust the Clique Set Matrix M to ensure an information loss upper bound. This provides a clear theoret- ical base for graph representation, which the existing coarse-grained representation cannot achieve.

• Avoid Sub-graph Pattern Mining: The inputs of the graph factorization process

include adjacency matrix A(Gk) and the Clique Set Matrix M. In the next section we will show that finding the clique set matrix is much more efficient than finding frequent sub-graph patterns. In addition, our method does not need sub-graph match- ing for new graphs to generate vectors as traditional methods do. Because the Clique Set Matrix M has already been generated from the training data, we only need to use matrix operations to map the test graphs into a vector space, which is more effi- cient than sub-graph matching. As a result, our approach avoids expensive sub-graph pattern mining for efficient graph stream classification.

• Better Representation for Graph Streams: In graph stream scenarios, the graph data and structures may change continuously. As a result, it is very difficult to find representative substructures to represent graph streams (even if the computational cost is not an issue). The fine-grained representation uses linear combinations of base cliques to approximate each graph. Because cliques are basic graph units, which remain relatively stable in graph streams, our approach provides a better presentation framework for continuously changing graph streams.

4.4 FAST GRAPH STREAM CLASSIFICATION

In this section, we first introduce the overall framework for graph stream classification and then address details of the method in respective subsections.

4.4.1 Overall Framework

The proposed graph stream classification framework (FGSC), is illustrated in Fig. 4-3, which contains three key modules:

68  # 

# , 0#)/- &#+/ ( ),'.#)(  .),, *, - (..#)()  .,#(#(!.  ,#(#(! &--# , # 

)  $ -"#(! $

#-,#'#(.#0 , +/ (. ), "%! ( ,. #! ) &#+/ -#(#(! &--# , $  -/$ ..)  -"#(! $ ( ( ,.  &#+/  . .,#1

)  $ -"#(! $  .),, *, - (..#)(  ) . -.. #

)   &, #.#)( . -. -"#(! . -.  

Figure 4-3: The framework of FGSC for graph stream classification.

• Clique Mining: To tackle continuously increasing graph stream volumes and ex- panding graph nodes and structures, we collect graph data into batches with each

batch containing a number of graphs. For each single graph Gk in a batch, we use a node hashing strategy to map the unlimited node space into a fixed-size node hashing set, so our method can effectively handle graph streams with expanding new nodes and structures. We then decompose the compressed graph into a number of cliques by using the maximal clique representation method.

• Clique Set Matrix and Graph Factorization: The above clique mining process may find an increasing number of clique-patterns, whereas we can only select a fixed number of cliques as a base for factorization. Therefore, we propose a clique hashing strategy to find discriminative frequent clique-patterns to form the Clique Set Matrix (M). It is worth noting that the matrix M, in the graph stream scenarios, is discov- ered from current and previous batches so we can find a good set of base cliques for graph factorization, as shown in the dotted box in Fig. 4-3. As a result, all graphs in

the current batch Bi can be cast into a vector space through matrix factorization.

• Graph Stream Classification: After mapping the graphs into the vector space, a

69 classifier is trained from the current batch Bi. The graphs in the next chunk Bi+1

are collected as a test set (which is denoted by Gtest in Fig. 4-3). To ensure that the

vector space-based classifier properly classifies the test graphs, each graph in Bi+1 is processed using the factorization module and is converted into vector space by using generated W for classification.

4.4.2 Graph Clique Mining

As shown in Fig. 4-3, instead of relying on expensive frequent sub-graph mining to capture graph features, we propose to use clique-based graph factorization to represent the graph data. Our first step is to select a set of cliques from each graph in the stream. In stream sce- narios, the node set of the graph stream can be extremely large and unlimited. In addition, new nodes and structures may erupt and appear in the graphs, and it is therefore necessary to compress each incoming graph into a fixed node space. We use a random hash function to map the original unlimited node set onto a significantly compressed node set Ω of size

Q. In other words, nodes in the compressed graph of Gk will be re-indexed by {1, ··· ,Q}.

Since multiple nodes in Gk may be hashed onto the same index, the weight of one edge in the compressed graph is set to the number of edges between the original nodes that are hashed onto the compressed nodes. After hashing all nodes into a fixed-size space, each ∈S ∗ graph Gk is transferred into a compressed graph Gk for clique mining. We define ∗ this node hashing strategy using Gk := NodeHash(Gk,Q). An example of graph clique mining is illustrated in Fig. 4-4, where step (B) shows the compressed graph. To discover cliques from compressed graphs, we propose the following fast approach. According to the graph compression results, the weight of one edge in the compressed graph denotes the number of edges that exist in the original graph. Our clique mining process can take this information into consideration. More specifically, our maximal clique set mining process includes:

• ∗ Finding Maximal Cliques: We first discover any maximal cliques, CGk , from the ∗ ∗ compressed graph Gk (where no maximal cliques in CGk share any overlapping ∗ edge). Because all nodes in compressed graph Gk are unique, finding maximal

70 1 1 1 1 1 2 1 1 1 1 1 1 1 1 c1 c2 c3 c4 1 1 1 2 1 Node 1 Hashing 1 1 1 1 1 2 1 2 1 1 0 0 1 1 1 2 1 2 1 2 0 0 0 1 0 0 0 0 0 1 0 0 0 G Compressed G 0 0 1 0 0 (A) (B) (C) (D) (E)

Figure 4-4: An example of clique mining in a compressed graph.

∗ cliques from Gk is very efficient (we use the Bron-Kerbosch maximal clique find-

ing algorithm in our experiments). An example of the maximal clique c1 is shown in Fig. 4-4 (C) upper panel;

• Clique Extraction: We set the weight of each edge in the discovered maximal clique

∈ ∗ c CGk equal to the smallest weight values of the edges in c. After that, we decrease ∗ the weights of the edges in compressed graph Gk by the weight of the corresponding ∗ edges in CGk and generate a new weighted graph Gk. An example of the updated graph Gk, after extracting clique c1 is shown in Fig. 4-4 (C) lower panel;

• Loop: If the weight of any edge in the updated graph Gk is 0, it means that the edge has already been removed, and no longer needs to be considered in the following clique mining process. Continuing the loop between the above two steps will dis- cover all cliques until all weight values of edges in Gk are equal to 0. An example is shown in Fig. 4-4 (E) lower panel.

∗ It is worth noting that because all nodes in the compressed graph Gk are unique, graph ∗ ∗ Gk can completely restored (i.e. rebuilding Gk by using all discovered cliques). This en- sures that there is no information loss during graph decomposition. Algorithm 1 describes the detailed process of clique mining, where MaximalCliques(Gk) finds the maximal

71 cliques from Gk. MinimalW eight(c) sets input clique c’s edge weight to the smallest weight value of all edges in c. UpgradeW eights(Gk) updates the weight values of graph Gk.

Algorithm 1 Clique Mining

Input: graph Gk ∗ ← 1: Gk NodeHash(Gk,Q); ← ∗ 2: Gk Gk; ← 3: Ckout φ; 4: while The weight of each edge in Gk unequal to 0 do  ← 5: CGk MaximalClques(Gk); ∈  6: for all c CGk do 7: c ← MinimalW eight(c); 8: end for ← ∪  9: Ckout Ckout CGk ; ← 10: Gk UpgradeW eights(Gk) 11: end while

Output: Ckout ;

An example of the graph clique mining is illustrated in Fig. 4-4, where nodes of the same color are compressed into one node, with the weight values in the compressed graph being changed accordingly. After the clique mining process, graph G is decomposed into a clique set Cout = {C1,C2,C3,C4} for graph factorization.

4.4.3 Clique Set Matrix and Graph Factorization

In this subsection, we introduce the detailed process for finding a set of discriminative frequent cliques to form the Clique Set Matrix (M), and then use graph factorization for feature mapping.

Discriminative Frequent Cliques

To find a good set of cliques to form base M for graph factorization, we use correlations between each clique and graph labels (i.e. similar to Information Gain [44] and G-test score [66]) to find cliques that are highly correlated to the classification task. For fast discriminative clique finding, we use an “in-memory” Clique-class table Γ with (Q + |Y| + 2) columns to count the frequency of each clique with respect to each class label.

72 An example of an “in-memory” clique-class table is shown in Fig. 4-5, where Q =4 is the size of compressed node set, which corresponds to the first 4 columns. Cols. 5 and 6 give the statistical information of two classes. Col. 7 is the value of random function

H. The final column is the objective score. ck is a new clique whose hashing result is

1531. We only compare ck with c1 and c3 and find that it is equal to c1. Then we update

the statistical information of c1. Q is the size of the compressed node set Ω and |Y| is the number of graph class labels (The table in Fig. 4-5 contains four nodes and two classes, i.e. Q =4and |Y| =2). A clique c and its statistical information is recorded in one row, where first Q columns represent the occurrence of the nodes in c (In Fig. 4-5, clique c1 contains nodes 1, 3, and 4). The frequencies of the clique with respect to each class (Yi) are recorded in the next |Y| columns. In Fig. 4-5, c1 appears 85 times in one class and 20 times in another class. The clique hashing output, which helps to speed up the update of the statistical information of each clique-pattern, is recorded in one column. In Fig. 4-5, the

th 7 column records the hashing output of c1 with 1531. The last column is used to record the discrimination-test scores (objective scores) of each clique.

For each graph Gk in the stream, we first use Algorithm 1 to collect its clique set Ckout . · After that, for each clique in Ckout , say ck,j, we then apply a random hash function h( ) to the string of ordered edges in ck,j to generate an index Hk,j ∈{1, 2, ··· ,τ}, where τ is a control parameter. Then we check the second-last column of Γ to validate whether there

is a value equal to Hk,j. In other words, we only compare the nodes of ck,j with the nodes

on Γpo, where Hk,j =Γpo,o = Q + |Y| +1. If they are the same clique, we update the

information by adding 1 to the corresponding class Label column to which Gk belongs.

If ck,j does not appear in the current Γ, we add a new row at the end of Γ to record this clique. The clique hashing strategy used here is to accelerate clique matching. Instead of comparing all cliques, we only compare cliques with the same hashing results.

Given the “in-memory” Clique-class table Γ, we can find discriminative frequent clique- patterns by using two parameters: frequency threshold α and clique feature number m.We sort all cliques whose frequencies are equal to or greater than α by using their objective scores (e.g. IG) in descending order (the objective scores are listed in the last column of Γ). We then use the top m cliques with the highest scores to generate Clique Set Matrix M.

73     







  

Figure 4-5: An example of “in-memory” Clique-class table Γ.

Algorithm 2 Finding Discriminative Frequent Cliques Input: Training graph set G, frequency threshold α and feature number m 1: for all Gk ∈Gdo ← 2: Ckout CliqueMining(Gk); ← 3: Γ Γ Ckout ; 4: end for 5: Γ ← Select(Γ,α); 6: Sort(Γ); 7: M←Top(Γ,m); Output: M;

Algorithm 2 shows the procedure of discriminative frequent clique-pattern finding, in which CliqueMining(·) is Algorithm 1, operator “ ” is updating Γ by comparing one clique with Γ. Select(Γ,α) is a function to find cliques whose frequencies are equal to or greater than α. Sort(Γ) sorts cliques in descending order according to their objective scores and Top(Γ,m) returns top m rows from Γ.

Feature Mapping

Once the Clique Set Matrix M has been suitably generated, we can convert each com- ∗ pressed graph Gk into the same vector space by using the factorization results proposed.

74 ||M MT − The factorization will generate diagonal matrix Wk under objective Wk = arg minW W ∗ ||2 ∗ A(Gk) and diag(Wk) can then be used as fine-grained representation for Gk.

4.4.4 Graph Stream Classification

Given the vector representation of each graph in a batch of graphs (Bi), we can use any generic learning algorithm, such as Nearest Neighbors (NN), Naive Bayes or support vec- tor machines (SVM), to train a classifier. The classifier is then used to predict class labels for new graphs that have yet to arrive. As soon as the new graphs arrive and a new graph batch is collected and labeled, we can easily update the classification model by training a new classifier from the new batch. This procedure is detailed in Algorithm 4, where F eatureMapping(·) is used to obtain fine-grained representation by using graph factor- ization. Algorithm 3 Classification U M Input: Classifier ζ, Gtest and U∗ ← U 1: Gtest NodeHash(Gtest,Q); V ← M U∗ 2: Vector test F eatureMapping( ,Gtest); 3: ytest ← ζ(Vtest); Output: ytest;

4.5 EXPERIMENTS

In this section we report the performance of the proposed factorization-based fast graph stream classification method (FGSC) on real-world graph data. Our experiments mainly focus on the study of efficiency and effectiveness.

4.5.1 Benchmark Data

We validate the performance of the algorithm on the following two real-world graph streams.

• DBLP Stream 1: DBLP is a computer science bibliography. Each record in DBLP denotes one publication, and the record is associated with a number of attributes 1http://arnetminer.org/citation

75 such as the paper ID, authors, year, title, abstract, and reference ID [72]. In our ex- periments, we build a graph stream with binary classification tasks by using papers published in a list of selected conferences (as shown in Table 4.1). The classifica- tion task is to predict whether a paper belongs to the field of DBDM (database and data mining) or CVPR (computer vision and pattern recognition), by using structural information, such as the title and references, of each paper. In our experiments, we represent each paper in DBLP as a graph, with each node denoting a paper ID or a keyword and each edge representing the citation relationship between papers or keywords appearing in the paper’s title. We make the following designations: (1) each paper ID is a node; (2) if a paper A cites another paper B, there is an edge between A and B (we use undirected edge); (3) each keyword in the title is also a node; (4) each paper ID node is connected to the keyword nodes of the paper; and (5) the keyword nodes of each paper are fully connected with one another. An exam- ple of DBLP graph data is shown in Fig.4-6: The rectangles are paper ID nodes and diamonds are keyword nodes. The paper ID17890 cites (connects) paper ID17883 and ID18068, and ID17890 has the keywords Patch, Motion, and Invariance in its title. Paper ID18068 has the keyword Edge and Detection, and paper ID17883’s title includes the keywords Vision and Active. The title keywords in each paper are linked with one another. In the experiments, all papers are arranged in chronological order to form a graph stream. It is worth noting that the paper title keyword space is very large (more than one thousand), and new title keywords may appear at any time, so the node space is constantly changing and expanding, which represents the inherent challenge of the graph stream.

Table 4.1: DBLP dataset used in experiments.

                      ! $  &#"             %"$$ &&!$     

76   

    



     

    

Figure 4-6: Graph representation for a paper (ID17890) in DBLP.

• IBM Sensor Stream 2: This stream contains information about local traffic on a sensor network which issues a set of intrusion attack types. Each graph consti- tutes a local pattern of traffic in the sensor network where nodes denote the IP- addresses and edges correspond to local traffic patterns (local traffic flows between IP-addresses) [1]. The selected data contains a stream of intrusion graphs from June 1, 2007 to June 3, 2007. Each graph is associated with a particular intrusion type and there are over 300 different local patterns in the dataset. We only choose 20 typical intrusion types (10 types of “BWEST” pattern and 10 types of “SNMP” pattern) that are considered as class labels in our experiments. Our goal is to classify a traffic flow pattern as one of the 20 intrusion types. More details about this data can be found in [1]. Compared to the DBLP stream, each graph in the IBM stream has far fewer nodes and the node labels have fewer overlaps between different graphs, which makes the problem particularly difficult for accurate classification. We intentionally choose a graph structure that is very different from the DBLP stream, therefore the results on IBM and DBLP streams can help to evaluate the algorithm’s performance on different types of graphs.

2http://www.charuaggarwal.net/sens1/gstream.txt

77 4.5.2 Experimental Settings

Baseline Methods: To evaluate the efficiency and effectiveness of our graph stream classi- fication framework, we compare the proposed FGSC (FG+Stream) with two other meth- ods.

• Coarse-grained representation-based method (CG+Stream) Compared with our fine-grained representation for graphs, this method uses traditional coarse-grained feature representation model which uses the occurrence of the discriminative frequent clique to represent graph data (the selected cliques and the classification methods are the same as our proposed method, except that CG+Stream uses 0/1 occurrence as the feature value). This method is similar to a clique hashing-based graph stream classification method in [15].

• 2-D edge hash-compressed stream classifier (EH+Stream) This method employs a 2-D random edge hashing scheme to construct an “in-memory” summary for the sequentially presented graphs [1]. The first random-hash scheme is used to reduce the size of the edge set. The second min-hash scheme is used to dynamically update a number of hash-codes, which is able to summarize frequent patterns of co-occurrence edges observed so far in the graph stream. A simple heuristic is used to select a set of most discriminative frequent patterns to build a rule-based classifier for classification.

In the experiments, we use 10-fold cross-validation to evaluate the performance of the graph stream classification. For fair comparison, all three methods (FG+Stream, CG+Stream, and EH+Stream) use different chunk sizes and feature numbers to predict the class labels of graphs in the stream. For EH+Stream, Nearest Neighbor classifier (NN) is used as the classifier in their method, so we cannot apply other learning algorithms to this method. For the other two methods (FG+Stream and CG+Stream), we use three different classifiers, including Nearest Neighbor (NN), Support Vector Machines (SMO), and Naive Bayes, to form an ensemble and predict the class labels of graphs in a future chunk. The default parameter settings are as follows: Batch size |D| = {600, 800, 1000} (for DBLP) and {300, 400, 500} (for IBM), feature size |m| = {62, 142, 307} with frequency threshold

78 α = {2%, 1%, 0.5%} (for DBLP) and |m| = {43, 75, 148} with α = {0.3%, 0.1%, 0.06%} respectively (for IBM). All experimental results are collected from a Linux cluster comput- ing node with an Intel(R) Xeon(R)@3.33GHZ CPU and 4GB fixed memory size.

4.5.3 Experimental Results

In this section, we report the algorithm performance with respect to the classification accu- racy and classification efficiency of the graph stream.

Graph Steams Classification Accuracy

Results on different chunk sizes |D|: In Figs. 4-7 and 4-10, we report the algorithm performance by using different numbers of graphs in each chunk |D| (varying from 1000, 800, to 600 for the DBLP stream and from 500, 400, to 300 for the IBM stream). Overall, the results in Fig. 4-7 show that the accuracies of FG+Stream on DBLP stream are better than that of CG+Stream and EH+Stream. The EH+Stream method has the worst performance, and its accuracy is slightly over 50%, which is nearly equivalent to random predictions. The accuracy of FG+Stream, on all three experiments, is consistently better than that of CG+Stream and EH+Stream, even though the accuracy fluctuates to a large extent on some chunks (such as Batch 14 in Fig. 4-7 (b)). Recall that the only difference between FG+Stream and CG+Steam is that the former uses fine-grained feature representa- tions whereas the latter uses 0/1 occurrence (both methods have the same set of discrimina- tive cliques). As a result, the better performance of FG+Stream, compared to CG+Stream, can be attributed to the fact that fine-grained representation provides more accurate in- formation to describe each single graph and ensures minimal information loss for graph representation. As we have explained in Section 4 (Graph Factorization), FG+Stream uses a linear combination of cliques to represent each graph, so even if some features do not appear in a specific graph, they may still have a value to indicate each clique’s correlation to the graph. By contrast, CG+Stream can only assign zero value to indicate that the fea- ture does not appear in the graph. As a result, the feature information given by fine-grained representation ensures good accuracy for graph classification.

79 DBLP |D|=1000,|m|=142, NN DBLP |D|=800,|m|=142, NN DBLP |D|=600,|m|=142, NN 0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5 Accuracy % Accuracy % Accuracy %

FG+Stream FG+Stream 0.4 FG+Stream 0.4 0.4 CG+Stream CG+Stream CG+Stream EH+Stream EH+Stream EH+Stream 0.3 0.3 0.3 1 3 5 7 9 11 13 15 17 19 21 23 25 1 3 5 7 9 11 13 15 17 19 21 23 25 1 3 5 7 9 11 13 15 17 19 21 23 25 Batch ID Batch ID Batch ID (a) (b) (c)

Figure 4-7: Accuracy w.r.t different chunk sizes on DBLP Stream. The number of features in each chunk is 142. The batch sizes vary as: (a) 1000; (b) 800; (c) 600.

DBLP |D|=1000,|m|=307, NN DBLP |D|=1000,|m|=142, NN DBLP |D|=1000,|m|=62, NN 0.9 0.8 0.8

0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 Accuracy % Accuracy % Accuracy % 0.5 FG+Stream FG+Stream 0.4 FG+Stream 0.4 0.4 CG+Stream CG+Stream CG+Stream EH+Stream EH+Stream EH+Stream 0.3 0.3 0.3 1 3 5 7 9 11 13 15 17 19 21 23 25 1 3 5 7 9 11 13 15 17 19 21 23 25 1 3 5 7 9 11 13 15 17 19 21 23 25 Batch ID Batch ID Batch ID (a) (b) (c)

Figure 4-8: Accuracy w.r.t different number of features on DBLP Stream with each chunk containing 1000 graphs. The number of features selected in each chunk is: (a) 307; (b) 142; (c) 62.

DBLP |D|=1000,|m|=142, NN DBLP |D|=1000,|m|=142, SMO DBLP |D|=1000,|m|=142, NaiveBayes 0.8 0.8 0.7

0.7 0.75 0.68

0.6 0.7 0.66 0.5 Accuracy % Accuracy % Accuracy %

FG+Stream 0.65 0.64 0.4 FG+Stream CG+Stream FG+Stream CG+Stream EH+Stream CG+Stream 0.3 0.6 0.62 1 3 5 7 9 11 13 15 17 19 21 23 25 1 3 5 7 9 11 13 15 17 19 21 23 25 1 3 5 7 9 11 13 15 17 19 21 23 25 Batch ID Batch ID Batch ID (a) (b) (c)

Figure 4-9: Accuracy w.r.t different classification methods on DBLP Stream with each chunk containing 1000 graphs, and the number of features in each chunk is 142. The classification methods selected here are: (a) NN; (b) SMO; (c) NaiveBayes.

The results in Fig. 4-10 show that, for the IBM stream, the accuracies of FG+Stream and CG+Stream are almost identical across the whole stream. This is mainly because that each graph in the IBM stream is composed of only a few nodes (most graphs contain less than four nodes). As a result, the features mined from the training set contain only a few nodes (e.g. one or two nodes) and the overlap between any two features (i.e. the same nodes and edges shared by two features) is less frequent. The feature weight values calculated by our graph factorization method thus degrade to 0/1 occurrence of the features. As a

80 IBM |D|=500,|m|=75, NN IBM |D|=400,|m|=75, NN IBM |D|=300,|m|=75, NN 1 1 1

0.8 0.8 0.8

0.6 0.6 0.6 FG+Stream FG+Stream FG+Stream CG+Stream CG+Stream CG+Stream 0.4 EH+Stream 0.4 0.4 EH+Stream Accuracy % Accuracy % EH+Stream Accuracy %

0.2 0.2 0.2

0 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 1 4 7 10 13 16 19 22 25 28 31 34 1 5 9 13 17 21 25 29 33 37 41 45 Batch ID Batch ID Batch ID (a) (b) (c)

Figure 4-10: Accuracy w.r.t different chunk sizes on IBM Stream. The number of features in each chunk is 75. The batch sizes vary from (a) 500; (b) 400; to (c) 300.

IBM |D|=400,|m|=148, NN IBM |D|=400,|m|=75, NN IBM |D|=400,|m|=43, NN 1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

FG+Stream FG+Stream FG+Stream 0.4 CG+Stream 0.4 CG+Stream 0.4 CG+Stream Accuracy % EH+Stream Accuracy % EH+Stream Accuracy % EH+Stream

0.2 0.2 0.2

0 0 0 1 4 7 10 13 16 19 22 25 28 31 34 1 4 7 10 13 16 19 22 25 28 31 34 1 4 7 10 13 16 19 22 25 28 31 34 Batch ID Batch ID Batch ID (a) (b) (c)

Figure 4-11: Accuracy w.r.t different number of features on IBM Stream with each chunk containing 400 graphs. The number of features selected in each chunk is: (a) 148; (b) 75; (c) 43.

IBM |D|=400,|m|=75, NN IBM |D|=400,|m|=75, SMO IBM |D|=400,|m|=75, NaiveBayes 1 1 1

0.8 0.8 0.8

0.6 FG+Stream 0.6 0.6 0.4 CG+Stream Accuracy % EH+Stream Accuracy % Accuracy % 0.4 0.4 0.2 FG+Stream FG+Stream CG+Stream CG+Stream 0 0.2 0.2 1 4 7 10 13 16 19 22 25 28 31 34 1 4 7 10 13 16 19 22 25 28 31 34 1 4 7 10 13 16 19 22 25 28 31 34 Batch ID Batch ID Batch ID (a) (b) (c)

Figure 4-12: Accuracy w.r.t different classification methods on IBM Stream with each chunk containing 400 graphs, and the number of features in each chunk is 75. The classifi- cation methods selected include: (a) NN; (b) SMO; (c) NaiveBayes. result, the vector representation obtained by using FG+Stream is almost the same as that of CG+Stream, which results in the same classification accuracy. The accuracy of EH+Stream is significantly worse than the other two methods. As shown in Fig. 4-10 (a), no graph is correctly classified for chunks 14-15, 17-19, 21, and 25-27. This demonstrates that for graphs with very few nodes and edges, using clique hashing, as EH+Stream does, cannot obtain good classification accuracy.

81 Results on different numbers of features |m|: In Figs. 4-8 and 4-11, we report the algo- rithm’s performance with respect to the different number of features in each chunk.

As expected, FG+Stream has the best performance of the three algorithms for the DBLP stream. In addition, the results in Fig. 4-8 show that increasing the number of features in the DBLP stream actually increases the accuracy of the CG+Stream method, while the accu- racy of FG+Stream method remains relatively stable. This is because CG+Stream directly uses select clique features to represent a graph whereas FG+Stream uses combinations of clique features for representation. As a result, FG+Stream has less dependence on the number of features selected for graph representation.

Interestingly, even though the accuracy of all three methods fluctuates to a large extent across the whole graph stream, the results in Fig.4-11 show that the accuracy of each is almost the same within the changing of the feature sizes. This may suggest that for the IBM stream, a small number of features (such as special IP-addresses or IP-address connections) may already have enough discriminate power for classification. Increasing the number of features may only result in redundant features in the learning process, which is not only time-consuming for learning but also not effective in improving classification accuracy.

Results on different classifiers: For the EH+Stream method, the authors use a nearest neighbor classifier that scans the training data and re-organizes edges into graphs, and then finds the closest graph for the test instance [1]. Essentially, this is a Nearest Neighbor classifier, so we cannot apply other classifiers to the EH+Stream and validate its perfor- mance. We validate the performance of FG+Stream and CG+Stream by using three ma- jor classifiers, including Nearest Neighbor (NN), Support Vector Machines (SVM), and Naive Bayes. The results in Fig. 4-9 show that, for DBLP data, Naive Bayes has the worst performance compared to NN and SVM, and when using Naive Bayes, the accuracies of FG+Stream and CG+Stream fluctuate significantly across the whole stream. Meanwhile, NN and SVM have comparable performance in most cases. For the IBM stream, the clas- sification accuracies of all three classifiers are very close to one another.

82 Accumulated Time on DBLP stream Accumulated Time on IBM stream 14 10 FG+Stream FG+Stream 12 CG+Stream CG+Stream EH+Stream 8 EH+Stream ms) ms) 5 10 5 X X 6 8

6 4

4 2 Accumulated Time ( 10 Accumulated Time ( 10 2

0 0 1 5 9 13 17 21 25 1 4 7 10 13 16 19 22 25 28 31 34 Batch ID Batch ID (a) (b)

Figure 4-13: System accumulated runtime-based by using NN classifier, where |D| = 1000, |m| = 142 (for DBLP) and |D| = 400, |m| =75(for IBM) respectively. (a) Results on DBLP stream; (b) Results on IBM stream.

Graph Steam Classification Efficiency

In Fig. 4-13, we report the system runtime performance (efficiency) for graph stream pro- cessing. The results show that FG+Stream always requires less time than CG+Stream. This is mainly because FG+Stream avoids the expensive sub-graph isomorphism validation pro- cess for vector generation, which is very time-consuming. EH+Stream performs better than the other two methods on the IBM stream and has a very fast runtime for the first few chunks. This is because EH+Stream always keeps a feature table of fixed size. The features in the table will be replaced once a better feature is detected. The size of graphs in the IBM stream is very small (only several nodes) compared to the DBLP stream, so features are found which lead to less feature valuation and replacement for updating. Overall, the results show that FG+Stream linearly scales to the number of chunks. The runtime of FG+Stream is acceptable compared to its high classification accuracy. As a result, FG+Stream is a practical solution for handling real-world high speed graph streams.

83 84 Chapter 5

Super-graph based Classification

5.1 INTRODUCTION

Many applications, such as social networks and citation networks, commonly use graph structure to represent data entries (i.e. nodes) and their structural relationships (i.e. edges). When using graphs to represent objects, all existing frameworks rely on two approaches to describe node content (1) node as a single attribute: each node has only one attribute (single-attribute node). A clear drawback of this representation is that a single attribute cannot precisely describe the node content [40]. This representation is commonly referred to as a single-attribute graph (Fig.5-1 (A)). (2) node as a set of attributes: use a set of independent attributes to describe the node content (Fig.5-1 (B)). This representation is commonly referred to as an attributed graph [8, 14, 80]. Both single-attribute and attributed graphs emphasize the dependency structures be- tween objections (i.e. nodes) but inherently overlook internal structures inside each node, where the attributes/properties used to describe the node content may be subject to depen- dency structures as well. Indeed, in many applications, the attributes/properties used to describe the node content may be subject to dependency structures. For example, in a ci- tation network each node represents one paper and edges denote citation relationships. It is insufficient to use one or multiple independent attributes to describe detailed informa- tion of a paper. Instead, we can represent the content of each paper as a graph with nodes denoting keywords and edges representing contextual correlations between keywords (e.g.

85   

                  

        

  

Figure 5-1: (A): a single-attribute graph; (B): an attributed graph; and (C): a super-graph. co-occurrence of keywords in different sentences or paragraphs). As a result, each paper and all references cited in this thesis can form a super-graph with each edge between pa- pers denoting their citation relationships. While in a protein interaction network each node represents one protein and edges denote interaction relationships. A protein may undergo reversible structural changes in performing its biological function and these changes are likely not shown in sequence representation [85]. So just using sequence is insufficient to describe a protein as shown in Fig. 5-2, where each protein is represented as a super-node by using its molecular structure. Proteins form a super-graph with edges between proteins denoting their interaction relationships. Two interacted proteins may share functional or structure similarity. In this chapter, we refer to this type of graph, where the content of the node can be represented as a graph, as a “super-graph”. Likewise, we refer to the node whose content is represented as a graph, as a “super-node”. Given a set of super-graphs, classification problem is to identify which category a super- graph belongs, based on the known information, like labeling information, super-node structure, node content, etc. To build learning models for super-graph based classification, the mainly challenge is to properly calculate the distance between two super-graphs.

• Similarity between two super-nodes: Because each super-node is a graph, the over- lapped/intersected graph structure between two super-nodes reveals the similarity between two super-nodes, as well as the relationship between two super-graphs. Tra- ditional hard-node-matching mechanism is unsuitable for super-graphs that require soft-node-matching.

86    

   

  

   

      

Figure 5-2: A conceptual view of a protein interaction network using super-graph repre- sentation.

• Similarity between two super-graphs: The complex structure of super-graph re- quires that the similarity measure considers not only the structure similarity, but also the super-node similarity between two super-graphs. This cannot be achieved with- out combining node matching and graph matching as a whole to assess similarity between super-graphs.

The above challenges motivate the proposed Weighted Random Walk Kernel (WRWK) for super-graphs. In this chapter, we generate a new product graph from two super-graphs and then use weighted random walks on the product graph to calculate similarity between super-graphs. A weighted random walk denotes a walk starting from a random weighted node and following succeeding weighted nodes and edges in a random manner. The weight of the node in the product graph denotes the similarity of two super-nodes. Given a set of labeled super-graphs, we can use an weighted product graph to establish walk-based relationship between two super-graphs and calculate their similarities. After that, we can obtain the kernel matrix for super-graph classification. Experiments on two real-world super-graph classification tasks demonstrate that the proposed method significantly outper- forms baseline approaches. The remainder of the chapter is structured as follows. Section 5.2 formulates the prob- lem and defines important notations. The weighted random walk kernel is proposed in

87 Section 5.4. The super-graph classification algorithm is given in Section 5.5, followed by theoretical analysis and theorems in Section 5.6. Experiments are reported in Section 5.7.

5.2 PROBLEM DEFINITION

DEFINITION 16 (Single-attribute Graph) A single-attribute graph is represented as g =

(V, E, Att, f), where V = {v1,v2, ··· ,vn} is a finite set of vertices, E ⊆ V × V denotes a finite set of edges, and f : V → Att is an injective function from the vertex set V

to the attribute set Att = {a1,a2, ··· ,am}. An attribute ai is a single symbol, e.g. a −−−−→ keyword, denoting node content. Att(g)=[a1(g), ··· ,am(g)] means the attribute vector. n ai(g)= , where h(f(uj),ai)=1,iff(uj)=ai and h(f(uj),ai)=0, j=1 h(f(uj),ai) otherwise.

DEFINITION 17 (Super-graph and Super-node) A super-graph is represented as G =

(V, E, G, F), where V = {V1,V2, ··· ,VN } is a finite set of graph-structured nodes. E⊆ V×Vdenotes a finite set of edges, and F : E→Gis an injective function from E to

G, where G = {g1,g2, ··· ,gM } is the set of single-attribute graphs. A node in the super- graph, which is represented by a single-attribute graph, is called a Super-node.

Formally, a single-attribute graph g =(V, E, Att, f) can be uniquely described by its

attribute and adjacency matrices. The attribute matrix ϕ is defined by ϕri =1⇔ ai ∈ f(vr), otherwise ϕri =0. The adjacency matrix θ is defined by θij =1⇔ (vi,vj) ∈ E, otherwise θij =0. Similarly, the adjacency matrix Θ of an super-graph G =(V, E, G, F) is defined by Θij =1⇔ (Vi,Vj) ∈E, otherwise Θij =0. However, because of the complicated structure of super-graph, we cannot use an unique matrix to describe its super- node information. To calculate conveniently as this article shows later, an super attribute ∈ RN×S |V| | ∪ ∪ matrix Φ is defined for super-graph G, where N = and S = Attg1 Attg2 ···∪ | ⇔ ∈ AttgM . Φri =1 ai Attgr , otherwise Φri =0.

A super-graph Gi is a labeled graph if a class label L(Gi) ∈Y= {y1,y2, ··· ,yx} is L U assigned to Gi. A super-graph Gi is either labeled (Gi ) or unlabeled (Gi ).

88 DEFINITION 18 (Super-graph Classification) Given a set of labeled super-graphs DL =

L L {G1 ,G2 , ···}, the goal of super-graph classification is to learn a discriminative model L U U U from D to predict some previously unseen super-graphs D = {G1 ,G2 , ···}with maxi- mum accuracy.

5.3 OVERALL FRAMEWORK

In Fig. 5-3, we list the major steps of our super-graph based classification framework. WRWK on the super-graphs (G, G ) and the single-attribute graphs (g1, g2, g3), where

g1⊗2 and g1⊗3 are the single-attribute product graphs for (g1,g2) and (g1,g3), respectively, and G⊗ is the super product graph for (G, G ). (θ1,ϕ1) and (Θ, Φ) are the adjacency and

attribute matrices for g1 and G, respectively. w¯1⊗2 and w are the weight vectors for g1⊗2 and

G⊗, respectively. Each element of w is equal to the WRWK of two super-nodes (as dotted arrows show). A random walk on single-attribute product graph, say g1⊗3, is equivalent

to performing simultaneous random walks on graphs g1 and g3, respectively. Assume that

v1(a1) means node v1 with attribute a1 in g1. The walk v1(a1) → v3(a5) → v4(a1) in g1 can match the walk v1 (a1) → v3 (a5) → v2 (a1) in g3. The corresponding walk on the single-

attribute product graph g1⊗3 is: v11 (a1) → v33 (a5) → v42 (a1). k(g1,g2) and k(g1,g3) are the kernels. k(g1,g2)

5.4 WEIGHTED RANDOM WALK KERNEL

Defining a kernel on a space X paves the way for using kernel methods [60] for classifica- tion, regression, and clustering. To define a kernel function for super-graphs, we employ a random walk kernel principle [25]. Our Weighted Random Walk Kernel (WRWK) is based on a simple idea: Given a pair

of super-graphs (G1 and G2), we can use them to build a product graph, where each node

of the product graph contains attributes shared by super-nodes in G1 and G2. For each node

89                                                              

        

                                                                            

                                          

Figure 5-3: WRWK on the super-graphs (G, G ) and the single-attribute graphs (g1, g2, g3). in the product graph, we can assign a weight value (which is based on the similarity of the two super-nodes generating the current node). Because a weighted product node means the common weight of the node appeared in both G1 and G2, we can perform random walks through these nodes to measure the similarity by counting the number of matching walks (the walks through nodes containing intersected attribute sets) and combining weights of the nodes. The larger the number, the more similar the two super-graphs are. Adding weight value to the random walk is meaningful. It provides a solution to take graph match- ing of super-nodes into consideration to calculate the similarity between super-graphs. We consider the similarity between any two super-nodes as the weight value instead of just us- ing 1/0 hard-matching. The node similarities between different super-nodes represent the relationship between two graphs in a more precise way, which, in turn, helps improve the classification accuracy. Accordingly, we divide our kernel design into two parts based on the similarity of super-nodes and super-graphs.

Graph kernel provides a natural tool to answer the above questions. Among others, one of the most powerful graph kernel is based on random walks, and has been successfully applied to many real world applications. There are many advantages of using random walk kernel. Firstly, the matching paths between two graphs consider the node similarities as

90 only paths with same sequential attributed nodes can be matched. Secondly, random walk kernel method also consider the structure similarities as it traverses all matching paths between two graphs with no tottering. These paths contain all structure information and have been approved by previous work that better accuracy can be achieved on classification or clustering benchmarks.

5.4.1 Kernel on Single-attribute Graphs

We firstly introduce the weighted random walk kernel on single-attribute graphs.

DEFINITION 19 (Single-attribute Product Graph) Given two single-attribute graphs

g1 =(V1,E1,Att1,f1) and g2 =(V2,E2,Att2,f2), their single-attribute product graph ∗ ∗ ∗ ∗ is denoted by g1⊗2 =(V ,E ,Att ,f ) (g⊗ for short), where

∗ • V = {v|v =, v1 ∈ V1,v2 ∈ V2};

• E∗ = {e|e =(u,v),u ∈ V ∗,v ∈ V ∗,f∗(u) = φ, f ∗(v) = φ,

u =, v =, (u1,v1) ∈ E1, (u2,v2) ∈ E2};

∗ • Att = Att1 ∪ Att2;

∗ ∗ ∗ • f = {f (v)|f (v)=f(v1) ∩ f(v2),v =, v1 ∈ V1,v2 ∈ V2}.

In other words, g⊗ is a single-attribute graph where a vertex v is the intersection be- tween a pair of nodes in g1 and g2. There is an edge between a pair of vertices in g⊗, if and

only if an edge exists in corresponding vertices in g1 and g2, respectively. An example is

shown in Fig. 5-3 (g1⊗2). In the following, we show that an inherent property of the prod- uct graph is that performing a weighted random walk on the product graph is equivalent to

performing simultaneous random walks on g1 and g2, respectively. So the single-attribute product graph provides an effective way to count the number of walks combining weight values on nodes between graphs without expensive graph matching.

To generate g⊗’s adjacency matrix θ⊗ from g1 and g2 by using matrix operations, we define the Attributed Product as follow:

91 DEFINITION 20 (Attributed Product) Given matrices B ∈ Rn×n, C ∈ Rm×m and   H ∈ Rn ×m , the attributed product B  C ∈ Rnm×nm and the column-stacking opera-   tor vec(H) ∈ Rn m are defined as

B  C =[vec(B∗1C1∗) vec(B∗1C2∗) ··· vec(B∗nCm∗)],

  ···   vec(H)=[H∗1 H∗2 H∗n] ,

th th where H∗i and Hj∗ denote i column and j row of H, respectively.

Based on Def. 20, the adjacency matrix θ⊗ of the single-attribute product graph g⊗ can be directly derived from g1(θ1,ϕ1) and g2(θ2,ϕ2) as follow:

 θ⊗ =(θ1  θ2)  vec(ϕ1ϕ2 ) (5.1)

where ⎡ ⎤ ⎡ ⎤

⎢ x1 ⎥ ⎢ B11 ∧ x1 ··· B1 ∧ x1 ⎥ ⎢ ⎥ ⎢ n ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . . ⎥ B  ⎢ . ⎥ = ⎢ . ... . ⎥ , ⎢ . ⎥ ⎢ . . ⎥ ⎣ ⎦ ⎣ ⎦ xn Bn1 ∧ xn ··· Bnn ∧ xn

“∧” is a conjunction operation (a ∧ b =1iff a =1and b =1).

As a result of the above process, we can matrix operation to count random walk. More

z th specifically, for the adjacency matrix θx of a graph gx, each element [θx]ij of this n power

matrix provides the number of walks of length z from ui to uj in gx.

According to Eq. (5.1), performing a random walk on the single-attribute product graph g⊗ is equivalent to performing simultaneous random walks on the graphs g1 and g2. After g⊗ is generated, Attributed Random Walk Kernel (ARWK), which computes the similarity between g1 and g2, can be defined with a sequence of weights δ = δ0,δ1, ··· (δi ∈ R and

δi ≥ 0 for all i ∈ N):   n1n2 ∞ z K(g1,g2)=ρ δz θ⊗ (5.2) i,j=1 z=1 ij

92 where n1 and n2 are the node sizes of g1 and g2, respectively. ρ is the control parameter used to make the kernel function convergence. As a result, kernel values are upper bounded (the proof is given later). To compute the ARW K for single-attribute graphs, as defined in Eq. (5.2), a diago-

nalization decomposition method [25] can be used. Because θ⊗ is a symmetric matrix, the −1 diagonalization decomposition of θ⊗ exists: θ⊗ = THT , where the columns of T are its eigenvectors, and H is a diagonal matrix of corresponding eigenvalues. The kernel defined in Eq. (5.2) can then be rewritten as:     n1n2 ∞ n1n2 ∞ z −1 z −1 K(g1,g2)=ρ δz (TH T ) = ρ T ( δz H ) T (5.3) i,j=1 z=1 ij i,j=1 z=1 ij z x ∞ z By setting δz = λ /z! in Eq. (5.3), and use e = z=0 x /z!,wehave,

n1n2   λH −1 1 2 T − T K(g ,g )=ρ (e I) ij (5.4) i,j=1

Where I is an identity matrix with the same size as θ⊗. The diagonalization decompo- sition can greatly expedite ARW K kernel computation. Here we set

n1 n2 1  ρ = ,wherec= [ϕ1ϕ2 ]ij (5.5) c(eλ(c−1) − 1) i j Then we have the following theorem.

THEOREM 2 Given any two single-attribute graphs g1 and g2, the attributed random walk kernel of these two graphs is bounded by 0 ≤ K(g1,g2) ≤ 1.

The proposed ARWK is different to the traditional random walk kernel, which simply calculates the total number of shared random walks between two graphs, but the number of shared walks can increase as the number of nodes/edges increase. As a result, the similarity based on traditional random walk kernel is not bounded. In comparison, ARWK first finds all shared nodes between two graphs (where a shared node means a node with the same attribute in both graphs), and then calculates the ratio between the number of random walks

93 among shared nodes of two graphs and the number of random walks on the complete graph formed by using all shared nodes (the number of random walks on the complete graph is calculated by ρ−1 as Eq. (5.5) which is proved in Theorem 4). We call this ratio the Structural Integrity Ratio, which is proved to be bounded. This not only provides a bounded measure for similarity assessment, but also provides an effective way to combine attribute and structure information to assess similarity between two graphs under the shared node information. We also call the proposed ARWK as Weighted Random Walk Kernel (WRWK) for single-attribute graph. The diagonalization decomposition can greatly expedite WRWK kernel computation. An example of WRWK kernel is shown in Fig. 5-3.

5.4.2 Kernel on Super-Graphs

The WRWK of single-attribute graph helps calculate the similarity between two graphs, so it can be used to calculate the similarity between two super-nodes. Given a pair of super-graphs G1 and G2, assume we can generate a new product graph G⊗ whose nodes

are generated by super-nodes in G1 and G2, and weight value of each node is the similarity between the super-nodes which generate this node, then the same process shown in Sec- tion 5.4.1 can be used to calculate the WRWK for super-graphs G1 and G2 to denote their similarity.

DEFINITION 21 (Super Product Graph) Given two super-graphs G1 =(V1, E1, G1, F1) ∗ ∗ ∗ ∗ and G2 =(V2, E2, G2, F2), their Super Product Graph is denoted by G1⊗2 =(V , E , G , F )

(G⊗ for short), where

∗ •V = {V |V =, V1 ∈V1,V2 ∈V2};

•E∗ = {e|e =(V,V ),V ∈V∗,V ∈V∗, F ∗(V ) = φ, F ∗(V ) = φ,

V =, V =, (V1,V1 ) ∈E1, (V2,V2 ) ∈E2};

∗ •G = G1 ∪G2;

∗ ∗ ∗ •F = {F (V )|F (V )=F(V1) ∩F(V2),V =, V1 ∈V1,V2 ∈V2}.

94 An example of super product graph is shown in Fig. 5-3 (G⊗). Similar to Eq. (5.1),

the adjacency matrix Θ⊗ of the super product graph G⊗ can be directly derived from

G1(Θ1, Φ1) and G2(Θ2, Φ2) as follows:

 Θ⊗ =(Θ1  Θ2)  vec(Φ1Φ2 ) (5.6)

Because we use the similarity between two super-nodes as the weight value of the node in the super product graph, the kernel value may increase infinitely with the increasing size of the super-graph. So for super-graph kernel, we add a control variable η to limit the range of the super-graph kernel value. Then the Weighted Random Walk Kernel, which computes the similarity between G1 and G2, can be defined with a sequence of weights

σ = σ0,σ1, ··· (σi ∈ R and σi ≥ 0 for all i ∈ N):   N1N2 ∞  z K(G1,G2)=η σz (w w  Θ⊗) (5.7) i,j=1 z=0 ij

where N1 and N2 are the node sizes of G1 and G2, respectively, and

···  w =[k(g1,g1) k(g1,g2) k(gN1 ,gN2 )]

Similar to WRWK of single-attribute graphs in Eq. (5.4), the WRWK on super-graphs

z can be calculated by setting σz = γ /z!,wehave,

N1N2   γH −1 K(G1,G2)=η T e T (5.8) ij i,j=1

 −1 where w w  Θ⊗ = T HT . To ensure kernel function convergence, we set

1−N2N2 1 1 2 γe2λ η = e N1N2 (5.9) N1N2

where γ is the parameter of WRWK on super-graphs, and λ is the parameter of WRWK on single-attribute graphs which is given in previous sub-section. The WRWK of super-graphs is also upper bounded with proof showing in Section 5.6.

95 5.5 SUPER-GRAPH CLASSIFICATION

The WRWK provides an effective way to measure the similarity between super-graphs. Given a number of labeled super-graphs, we can use their pair-wise similarity to form an kernel matrix. Then generic classifiers, such as Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (NB) and Nearest Neighbour (NN), can be applied to the kernel matrix for super-graph classification. Algorithm 4 shows the framework of using WRWK to train a classifier.

Algorithm 4 WRWK Classifier Generation Input: Labeled super-graph set DL, Kernel parameters λ and γ Output: Classifier ζ Initialize: Kernel matrix M ← [] L 1: for any two super-graphs Ga and Gb in D do 2: w ← [] 3: for any gx in Ga and gy in Gb do 4: q ← (x − 1)Nb + y // q is the element index of w 5: if Attx ∩ Atty = φ then 6: wq ← 0 7: else 1 1 1 1 ρ ← w¯ ← [ ··· ] θ⊗ ← (θx  θy)  vec(ϕxϕ ) 8: nxny , nxny nxny nxny , y 9: wq ← k(gx,gy) // k(gx,gy) is calculated by eq. (5.4) 10: end if

11: end for 2 2 1−N N 2λ 1 a b γe η ← e NaNb Θ⊗ ← (Θa  Θb)  vec(ΦaΦ ) 12: NaNb , b 13: Mab ← K(Ga,Gb) // K(Ga,Gb) is calculated by eq. (5.8) 14: end for 15: ζ ← TrainClassifier(C, M, L¯) /* C is the learning algorithm. L¯ is the class label vector of DL */

5.6 THEORETICAL STUDY

In this section, we derive theoretical basis to show the rationality of WRWK.

THEOREM 3 The Weighted Random Walk Kernel function is positive definite.

Proof 1 As the random walk-based kernel is closed under products [25] and the WRWK can be written as the limit of a polynomial series with positive coefficients (as Eqs. (5.4) and (5.8)), the WRWK function is positive definite.

96 THEOREM 4 Given any two single-attribute graphs g1 and g2, the weighted random walk

λ kernel of these two graphs is bounded by 0

Proof 2 Because k(g1,g2) is a positive definite kernel by Theorem 3, so k(g1,g2) > 0.

λ Then we only need to show that the upper bound of k(g1,g2) is e . Based on the definition of WRWK, assume the node sizes of two single-attribute graphs g1 and g2 are n1 and n2, respectively, the number of random walks on g1 ⊗ g2 must be not

c greater than that of the complete connect graph g which has n1 × n2 nodes. We assume

c that g = g1 ⊗ g2, where g1 and g2 are with n1 and n2 nodes respectively. So we have k(g1,g2)

  n1n2 ∞ ∞ n1n2     z {   z } k(g1,g2)=ρ δz (¯w w¯ θ⊗) = ρ δz (¯w w¯ θ⊗) ij , i,j=1 z=0 ij z=0 i,j=1

ρ = 1 w¯ =[ 1 , 1 , ··· , 1 ] gc Because n1n2 , n1n2 n1n2 n1n2 , and is a complete connect graph, where the diagonal elements are equal to 0 and other element in the adjacency matrix are equal to 1. So n1n2   n1n2 − 1 1   z (z−1) − (¯w w¯ θ⊗) ij =( 2 2 ) (1 ), n1n2 n1n2 i,j=1

Then

∞ 1 n1n2 − 1 1 (z−1) − k(g1,g2)= [δz( 2 2 ) (1 )], n1n2 n1n2 n1n2 z=0 ∞ 1 1 n1n2 − 1 − (z−1) = (1 ) [δz( 2 2 ) ], n1n2 n1n2 n1n2 z=0

λz Because δz = z! , we have

∞ z 1 1 λ n1n2 − 1 − (z−1) k(g1,g2)= (1 ) [ ( 2 2 ) ], n1n2 n1n2 z! n1n2 z=0

97 2 2 ∞ z 1 1 n n λ n1n2 − 1 − 1 2 z = (1 )( ) [ ( 2 2 ) ], n1n2 n1n2 n1n2 − 1 z! n1n2 z=0 x ∞ z Because e = z=0 x /z!, we have λ(n1n2−1) n2n2 λ k(g1,g2)

THEOREM 5 The weighted random walk kernel between super-graphs G1 and G2 is

γe2λ−2λ bounded by 0

Similar to Theorem 4, Theorem 5 can be derived.

5.7 EXPERIMENTS AND ANALYSIS

In this section we first discuss the benchmark data and the experimental settings, and then report detailed experimental results and analysis.

5.7.1 Benchmark Data

We carry out experimental studies on real-world datasets from two different domains: DBLP scientific publication dataset and Beer Review dataset.

• DBLP Dataset: DBLP dataset consists of bibliography data in computer science1. Each record in DBLP is a scientific publication with a number of attributes such as abstract, authors, year, venue, title, and references. To build super-graphs, we se- lect papers published in Artificial Intelligence (AI: IJCAI, AAAI, NIPS, UAI, COLT, ACL, KR, ICML, ECML and IJCNN) and Computer Vision (CV: ICCV, CVPR, ECCV, ICPR, ICIP, ACM Multimedia and ICME) fields to form a classification task. The goal is to predict which field (AI or CV) a paper (i.e. a super-graph) belongs to by using the abstract of each paper (i.e. a super-node) and abstracts of references (i.e. other super-nodes), as shown in Fig. 5-4, where the left figure shows that each paper cites a number of references. For each paper (or each reference), its abstract

1http://arnetminer.org/citation/

98                      

  

   

   





(a) (b)

Figure 5-4: An example of using super-graph representation for scientific publications.

can be converted as a graph. So each paper denotes a super-node, and the citation re- lationships between papers form a super-graph. The sub-figure of Fig.5-4 to the right shows a graph representation of paper abstract with each node denoting a keyword. A: The weight values between nodes indicate correlations between keywords. B: by using proper threshold, we can convert each abstract as an undirected unweighted graph. An edge (undirected) between two super-nodes indicates a citation relation- ship between two papers. For each paper, we use fuzzy cognitive map (E-FCM) [50] to convert paper abstract into a graph (which represents relations between keywords with weights over a threshold (i.e. edge-cutting threshold)). This graph representa- tion has shown better performance than simple bag-of-words representation [4]. We select 1000 papers (500 in each class), each of which contains 1 to 10 references, to form 1000 super-graphs.

• Beer Review Dataset: The online beer review dataset consists of review data for beers2. Each review in the dataset is associated with some attributes such as appear- ance score, aroma score, palate score, and taste score (rating of the product varies from 1 to 5), and detailed review texts. Our goal is to classify each beer to the right style (Ale vs. Not Ale) by using customer reviews. The graph representation for re- views is similar to the sub-graphs in DBLP dataset. Each review is represented as a super-node. The edge between super-nodes are built using following method: Be- cause a review has four rating scores in appearance, aroma, palate, and taste, we use

2http://beeradvocate.com/

99          

     

 

                      

           

     

  

Figure 5-5: Super-graph and comparison graph representations.

these four scores as a feature vector for each review. If two reviews’ distance in the feature space (Euclidean distance) is less than 2, an edge is used to link two reviews (i.e super-nodes). We choose 1000 beer products, half from Ale and the rest is from Large and Hybrid Style, to form 1000 super-graphs for classification.

5.7.2 Experimental Settings

Baseline Methods: Because no existing method can handle super-graph classification, for comparison purposes, we use two approaches to generate traditional graph representa- tions from each super-graph: (1) randomly selecting one attribute (i.e one word) in each super-node (and ignoring all other words) to form a single-attribute graph, as shown in Fig. 5-5 (C). We repeat random selection for five times with each time generating a set of single-attribute graphs from super-graphs. Each single-attribute graph set is used to train a classifier and their majority voted accuracy on test super-graphs is reported in the ex- periments. (2) We use the attribute set of the single-attribute graph in each super-node as multi-attributes of the super-node to generate an attributed graph (which is equal to remov- ing all the edges from the super-graph as shown in Fig. 5-5 (B)). For all the two graph representations, we use the traditional random walk kernel (RWK) to measure the similar- ity between any two graphs [25]. We use 10 times 10-fold cross-validation classification accuracy to measure and com- pare the algorithm performance. To train classifiers from graph data, we use Naive Bayes (NB), Decision Tree (DT), Support Vector Machines (SVM) and Nearest Neighbor algo- rithm (NN). Majority examples are based on the kernel parameter settings: λ = 100 and

100 γ = 100. All experiments are conducted on a machine with 4GB RAM and Intel CoreTM i5 3.10 GHZ CPU.

5.7.3 Results and Analysis

Performance on standard benchmarks: In Fig. 5-6, we report the classification accu- racy on the benchmark datasets. For DBLP dataset, the number of super-nodes in each super-graph varies within 1-10 and the edge-cutting threshold for single-attribute graph is 0.001. For Beer Review dataset, the number of super-nodes in each super-graph varies within 30-60 and the edge-cutting threshold is 0.001. WRWK uses super-graph repre- sentation (Fig.5-5(A)), RWK1 and RWK2 use traditional random walk kernel method with attributed graph representation (Fig. 5-5(B)) and single-attribute graph representation (Fig. 5-5(C)), respectively.. The experimental results show that WRWK constantly out- performs traditional RWK method, regardless of the type of graph representations used by RWK. This is mainly because in traditional graph representations each node only has one attribute or a set of independent attributes, whereas single-attribute nodes (or multiple independent attributes) cannot precisely describe the node content. For super-graphs, the graph associated to each super-node provides an effective way to describe the node content. In WRWK method, we consider the similarity between any two super-nodes as the weight value instead of just using 1/0 hard matching to represent whether there is an intersection of attribute sets between two nodes. The soft matching node similarities between super- nodes capture the relationship between two graphs in a more precise way. This, in turn, helps improve the classification accuracy. What is more, from Fig. 5-6, we can see that the accuracy increases with the increasing information content in nodes.

Performance under different super-graph structures: To demonstrate the performance of our WRWK method on super-graphs with different characteristics, we construct super- graphs on Beer Review dataset by using different super-node sizes and different structures of single-attribute graph (the structure of the single-attribute node is controlled by the edge- cutting threshold) as shown in Fig. 5-7 (a)-(d). Figures correspond to three types of super-

101 (a) (b)

Figure 5-6: Classification accuracy on DBLP and Beer Review datasets w.r.t. different classification methods (NB, DT, SVM, and NN). graph structure on Beer Review. Data 1: the number of super-nodes in each super-graph is 2-30 and the edge-cutting threshold is 0.00001. Data 2: the number of super-nodes in each super-graph is 30-60 and the edge-cutting threshold is 0.001. Data 3: the number of super-nodes in each super-graph is > 60 and the threshold is 0.1. The result shows that our WRWK method is stable on super-graphs with different structures.

Performance w.r.t. changes in super-nodes and walks: The proposed weighted random walk kernel relies on the similarity between super-nodes and the common walks between two super-graphs to calculate the graph similarity. This raises a concern on whether super- node similarity or walk similarity (or both) plays a more important role in assessing the super-graph similarity. In order to resolve this concern, we design the following experiments. In the first set of experiments (Fig. 5-8 (a) and (b)), we fix super-graph edges, and change edges inside super-nodes, which will impact on the super-node similarities. If this results in significant changes, it means that super-node similarity plays a more important role. In the second set of experiments (Fig. 5-8 (c)), we fix super-nodes but vary the edges in super-graphs (by randomly removing edges), which will impact on the common walks between super- graphs. If this results in significant changes, it means that walk plays a more important role than super-nodes. Fig. 5-8 (a) and (b) report the algorithm performance with respect to the edge-cutting

102 (a) (b)

(c) (d)

Figure 5-7: Classification accuracy on Beer Review dataset w.r.t. different datasets and classification methods (NB, DT, SVM, and NN). threshold. The accuracy decreases dramatically with the increase of the edge-cutting thresh- old for both datasets. When the edge-cutting threshold is set to 0.0001 on DBLP and 0.00001 on Beer Review, the single-attribute graph in each super-node is almost a com- plete graph and four methods achieve the highest classification accuracy. As the threshold is set to 0.01 on DBLP and 0.1 on Beer Review, the single-attribute graph in each super- node is very small and contains very few edges. As a result, the accuracies are just around 65%. This demonstrates that WRWK heavily relies on the structure information of each super-node to assess the super-graph similarities.

Fig. 5-8 (c) reports the algorithm performance by fixing the super-nodes but gradually removing edges in the super-graph of Beer Review dataset (as Fig. 5-6 (b)). In (a) and (b), the node size of super-graph varies from 0-10 on DBLP dataset and 30-60 on Beer Review dataset. The x-axis shows the value of the edge-cutting threshold and the average

103               Performance on Beer data set

  73



  68      NB Accuracy %    DT    63 SVM   NN           0 25 50 75 100                  Average number of cutting edges (a) (b) (c)

Figure 5-8: The performance w.r.t. different edge-cutting thresholds on DBLP and Beer Review datasets by using WRWK method. node degree in its corresponding single-attribute graph is reported in the parentheses. In (c), the average number of edges cut from each super-graph varies from 0 to 100. Simi- lar to the result in Fig. 5-8 (a) and (b), the classification accuracy decreases when edges are continuously removed from the super-graph (even if the super-nodes are fixed). From Figs. 5-8, we find that graph structure of super-graph is as important as that of super-nodes. This is mainly attributed to the fact that our weighted random walk kernel relies on both super-node similarity and walks in the super-graph to calculate graph similarities.

104 Chapter 6

Streaming Network Node Classification

6.1 INTRODUCTION

Recent years have witnessed an increasing number of applications involving networked data, where instances are not only characterized by the content but are also subject to de- pendency relationships. For example, each node in a social network can denote one person and links between nodes can represent their social interactions. For each node in the net- work, its content can be described using features, such as the bag-of-word representation of the user’s posts. The mixed node content and structure information raise many unique data mining tasks, such as network node classification [3] where the goal is to combine node content and network topology structures to classify unlabeled nodes in the network with a maximal accuracy. Applications of network node classification include social spammer de- tection [70], automatic email categorization [52], inferring personality from social network structures [68], and image classification using social networks [51]. When classifying nodes in networks, existing methods can be roughly categorized into three groups: (1) combining content and structure features into new feature vector repre- sentation, such as iterative collective classification [61], and link-based classification [49]; (2) using network paths, such as random walks [18], to determine node labels; and (3) using content information to build additional structure nodes and generate a new topology network for classification [2]. The theme of all these methods is to leverage node content and topology structures to infer correct labels for unlabeled nodes.

105   

    

  

   

  

Figure 6-1: An example of streaming networks, where each color bar denotes a feature.

Existing node classification methods are carried out in a static network setting, with- out taking evolving network structures and node concepts into consideration. In reality, changes are essential components in real-world networks, mainly because user participa- tion, interactions, and responses to external factors continuously introduce new nodes and edges to the network. In addition, a user may add/delete/modify online posts, which natu- rally result in changes in the node content. As a result, the networks are inherently dynamic. In this thesis, we refer to this type of networks, where the network structures and node con- tent are continuous changing, as Streaming Networks. An example of streaming networks is shown in Fig. 6-1, where the topology structures and the node feature distributions are constantly changing with time. Specifically, at time point t2, new nodes (e.g., 4 and 5) and

relevant edges join the network; At t3, Node 5 and the edge between Nodes 1 and 2 are removed. Over the whole period, node content may continuously change (e.g., the content change in Node 3). Accurate node classification in a streaming network setting is therefore much more challenging than static networks. Because existing node classification methods on static networks cannot capture the changing information which may cause different clas- sification results, while if we use existing node classification methods repeatedly for each time point, it is time-consuming. In summary, node classification in streaming networks has at least three major challenges:

• Streaming network structures: Network topology structures encode rich informa- tion about node interactions inside the network, which should be considered for node classification. In streaming networks, topology structures are constantly changing, so

106 node classification needs to rapidly capture and adapt to such changes for maximal accuracy gain.

• Streaming node features: For each node in a streaming network, its content may constantly evolve (w.r.t. users posts or profile updating). As a result, the feature space used to denote the node content is dynamically changing, resulting in streaming fea- tures [78] with infinite feature space. To capture changes, a feature selection method should timely select the most effective features to ensure that node classification can quickly adapt to the new network.

• Continuous Changing Network Volumes: Because node volumes of streaming net- works are continuously changing, node classification needs to scale to the dynamic increasing of the network volumes by utilizing models discovered from historical data to boost the learning on new data.

For streaming networks, changes are introduced through two major channels (1) node content; and (2) topology structures. To achieve a high node classification accuracy, a fundamental issue is how to properly characterize such changes. In this thesis, we propose to address this issue using a feature driven framework, which uses node content features to model and capture network changes for classification. Fig. 6-2 demonstrates how node features can be used to characterize changes in the network. More specifically, nodes and edges with solid lines denote network observed at time point t, while dashed circles and edges mean networked data arriving at t+1. Nodes and edges with curved lines are removed at t +1, and the underlined features (keywords) are also removed at t +1. Nodes are colored based on their class labels, and white nodes mean unlabeled nodes. At time point t, keywords “System" and “Network" are selected as features to represent node content (assuming the number of features is limited to 2) and classify nodes into two classes (red vs. cyan). At t +1, the network changes incurring new structures and node content. By updating the selected features and using “Spectrum" to replace “System", the new feature space {“Network”, “Spectrum”} can effectively classify unlabeled nodes into right classes. The above observations motivate the proposed research that uses features to capture changes in streaming networks for node classification. When a network is experiencing

107                                                                        

Figure 6-2: An example of using feature selection to capture changes in a streaming net- work (keywords inside each node denote node content). changes in the topology structures and node content, we can try to identify a set of important features that can best reveal such changed network structures and node content. Because in a networked world, nodes close to each other in the network topology structure space tend to share common content information [26], we can use selected features to design a “similarity gauging” procedure to assess the consistency of the network node content and structures in order to determine the labels of unlabeled nodes. A smaller gauging value indicates that the node content and structures has a better alignment with the node label. So the gauging based classification is carried out such that for an unlabeled node, its label is the class which results in the minimal gauging value with respect to the identified features. By updating the selected features, the node classification can automatically adapt to the changes in the streaming network for maximal accuracy gain. The main contribution of the chapter, compared to existing works, is twofold:

• Streaming Network Node Classification: We propose a new streaming network node classification (SNOC) method that takes node content and structure similarity into consideration to find important features to model changes in the network for node classification. This method is not only more accurate than existing node clas- sification approaches, but is also effective to capture changes in networks for node classification.

• Streaming Network Feature Selection: We introduce a novel streaming network

108 feature selection framework, SNF, for streaming networks. To ensure feature evalu- ation can timely adapt to the changes in the network, SNF can incrementally update the evaluation score of an existing feature by accumulating changes in the network. This allows our method to effectively handle streaming networks with changing fea- ture space and feature distributions for better runtime and performance gain.

The remainder of the chapter is structured as follows. The problem definition and the overall framework are introduced in Sect. 6.2. Sect. 6.3 introduces the proposed node classification method, followed by experiments in Sec. 6.4.

6.2 PROBLEM DEFINITION AND FRAMEWORK

A streaming network contains a dynamic number of nodes and edges, and the node content may also change in terms of new features or new feature values. At a given time point t,

X { }nt ∈ Rdt the network nodes are denoted by = (xi,yi) i=1, where xi is the original feature vector of node i, and yi ∈Y= {0, 1, 2,...,c} is the label of node i. nt and dt denote the number of nodes and the dimensionality of the node feature space at time point t, which

nt×nt may vary with time. Specificially, yi =0means that node i is unlabeled. A ∈ R is the

adjacency matrix of the networked data, where Aij =1if there is an edge (link) between nodes i and j, and Aij =0otherwise. A path Pij between nodes i and j is a sequence of edges, starting at i, ending at j. The length of a path is the number of edges on it. For

k th each adjacency matrix A, the element [A ]ij of the k power matrix denotes the number of length-k paths from i to j in the network [25]. To represent network node content, we use F = {f 1,...,fr,...,fdt } to denote node feature space at time point t, where the feature dimension dt dramatically changes with time 1 2 dt  ∈ Rdt×nt t. We use X =[x1, x2,...,xnt ]=[f , f ,...,f ] to represent the data matrix, and C ∈ Rnt×nt represents the label relationship matrix of the networked data, where

r Cij =1means nodes i and j are in the same class, and Cij =0otherwise. We use f to denote a feature, and use bold-faced fr to represent indicator vector of feature f r, where

r r [f ]j records the actual value of feature f in node j. In a binary feature representation (such

109 r r r as the bag-of-word for text), we have [f ]j =1if feature f appears in node j, and [f ]j =0, otherwise. Obviously, fr helps capture the distribution of feature f r in the network. Streaming network node classification aims to classify unlabeled nodes in the network, at any time point t, with maximal accuracy. To capture dynamic changes of the network, we propose to use feature selection to timely discover a feature subset S of size m from F. When discovering feature set S, both the node content and the network structures are combined to find the most informative features at each single time point t. As a result, the node classification can adapt to the changes in the network to achieve maximal accuracy. Fig. 6-3 shows the framework of our streaming network node classification method (SNOC). More specifically, Panel A: at time point t, the network is denoted by nodes and edges with solid lines. Dashed nodes and edges denote new nodes and edges arriving at time point t +1. Colour bars in nodes means different features and curved bars means that a feature appeared at t butisremovedatt +1, such as the purple bar in Node 3. Nodes and edges with curved lines means they exist at t, but are removed at t +1, like Node 4. Panel B: at time point t, candidate features and selected features are identified based on

r F-score qt(f ). At time point t +1, streaming network feature selection (SNF) updates the scores of old features (candidate features at t), and also calculates feature scores for new features. Panel C: At time point t +1, SNOC uses selected features as gauge to test whether to classify an unlabeled Node 5 as positive (+) or negative (-). The one with the smallest gauging value is used to label Node 5. To capture changes in streaming networks, an incremental streaming network feature selection method, SNF, is proposed to timely discover a set of most informative features in the network. To classify unlabeled nodes, SNOC takes both label similarity and structure similarity into consideration and uses a quality criterion to find most suitable label for an unlabeled node.

6.3 THE PROPOSED METHOD

To classify unlabeled nodes in a streaming network, our theme is to let (1) nodes sharing the same class and having a high structure similarity be close to each other, and (2) nodes belonging to different classes and having a weak structure relationship be far away from

110 !         !       "            "   

                

  

Figure 6-3: The framework of the proposed streaming network node classification (SNOC) method.

each other. This is motivated by the commonly observed phenomenon [26] that nodes close to each other in network topology structures tend to share common content information. For example, friends in the same cohort group are more likely to share similar experiences or interests, and a paper and its references often contain relevant research subjects/topics. Our proposed theme is also consistent with the relational collective inference modeling [34] that uses relationships between class labels and attributes of neighboring objects in the network for classification. Following the above theme, we can regard streaming network node classification as an optimization problem, which tries to find the optimal assignment of the class labels to unlabeled node set X u, such that the assigned class labels Yu ⊆Ycan result in the whole network to maximally comply with the proposed theme, as defined in Eq.(6.1).

∗ Yu = arg min E(Yu) (6.1) Yu⊆Y

where Yu is an assignment of labels to unlabeled nodes in the network, and Yu∗ is the optimal assignment, which results in the minimal utility score E(Yu). Following the node classification objective function in Eq.(6.1), the key question is how to properly define utility function E(·). Clearly, the node content provides valuable information to determine the label of each node, so we should define E(·) based on the node content Feature Space. In streaming networks, the feature space used to denote the node content is continuously changing with new features or updated feature values. Using all features to represent the network is clearly suboptimal. If a set of good features

111 can be found to capture changes in the network, the node classification in Eq.(6.1) will automatically adapt to the changes in the network for maximal accuracy. So Eq.(6.1) is re-written as ∗ Yu = arg min E(Yu, S) (6.2) Yu⊆Y

where S is the selected feature set used to capture changes in a streaming network. Because the utility function E(Yu, S) is constrained by the selected features S, finding the optimal S becomes the next challenge. Obviously, a good S should properly capture network node relationships in terms of the node content, node labels, and node topology structures. That means the node relationships in Feature Space should follow our proposed theme with the node similarity being impacted by (1) the label-based similarity in the Label Space; and (2) the structure-based similarity in the Structure Space.

Accordingly, node classification in Eq. (6.2) can be divided into two major steps: (1) Finding an optimal feature set S; (2) Finding an optimal assignment of labels to unlabeled nodes such that the utility score E(Yu, S) calculated based to the selected features S has the minimal value. Therefore, we derive an updated evaluation criterion E(Yu, S) as follows: u 1 2 E(Y , S)= h(i, j, y )(DS x −DS x ) 2 i i j i∈X u j∈X (6.3) 1 2 s.t. min( h(i, j)(DS x −DS x ) ), S⊆F, |S| = m 2 i j i,j∈X

where h(i, j) is the similarity between nodes i and j in the network structure space

that will be formally defined in Eq.(6.8). h(i, j, yi) is the similarity between nodes i and j 2 conditioned by setting the label of unlabeled node i as yi. In Eq.(6.3), (DS xi −DS xj) mea- sures the feature based distance between nodes i and j w.r.t. the current selected features

S. DS is a diagonal matrix indicating features that are selected into the selected feature set S (from F), where  1,ifi= jandfi ∈S; [DS ]ij = 0,otherwise.

In Eq.(6.3), we use network structure similarity h(i, j, yi) as the weight value of the

112 2 node feature distance (DS xi −DS xj) . If nodes i and j have a high structure similarity, their feature distance will have a large weight value and therefore plays a more important role in the objective function. In an extreme case, if nodes i and j have a zero structure similarity, their feature distance will not have any impact on the objective function at all. By doing so, we can effectively combine node structure similarity and node content distance to assess the consistency of the whole network. For streaming networks, the selected feature set S should be dynamically updated to capture changes in the network. By using dynamic feature set S to guide the node classi- fication, Eq.(6.3) provides an efficient way to classify nodes in dynamic networks. This is mainly because that any significant changes in the network will be captured by S, and by using S as gauge for node classification, our method can automatically adapt to the changes in the network structures and node content. The solutions to the objective function in Eq.(6.3) require optimization for both vari-

u ables (DS and Y ). To solve Eq.(6.3), we divide the process into two parts: (1) propose a novel streaming network feature selection framework, SNF, to take both network structures and node labels into consideration to find optimal feature set S; and (2) propose an Lapla- cian based quality criterion to grade an unlabeled node with respect to different labels by using S as the gauge. Finally, the node classification is achieved by finding best labels that result in the minimal gauging values.

6.3.1 Streaming Network Feature Selection

Given a streaming network, the network observed at a single time point t can be consid- ered as a static network. In this subsection, we first introduce feature selection on a static network, and then extend to streaming networks.

Feature Selection on a Static Network

We first define feature selection as an optimization problem. Our target is to find an optimal set of features, which can best represent the network node content and structures. The network edges and node labels both play important, yet different, roles for node

113 classification. We assume that the optimal feature set should have the following properties:

• Label Similarity: a) labeled nodes in the same class should be close to each other, and labeled nodes in different classes should be far away from each other; b) unla- beled nodes should be separated from each other.

• Structure Similarity: The structure similarity between nodes i and j is closely tied to the number of paths and the path length between them. The more the number of length-l paths between i and j, the higher their structure similarity is. The shorter the path length between two nodes, the higher their structure similarity is.

Note that Item b) in the first bullet incorporates the distributions of unlabeled nodes, and tends to select features that can separate nodes far from each other. It is similar to the assumption of the Principle Component Analysis, which is expressed as the average squared distance between unlabeled samples [88]. Item b) intends to disfavor features that are too rare or too frequent in the data set, because unlabeled nodes cannot be separated from each other using these features [42]. The above two properties can be formalized as follows: (1) Minimizing Label Similarity Objective Function: 1   2 1   2 J (f r)= (V x −V x ) − (V x −V x ) L 2 r i r j 2c r i r j (6.4) Cij =1 Cij =0

where c is the total number of classes, and Vr is an indicating vector showing that feature is selected, and its definition is as [Vr]i =1if i = r, and [Vr]i =0otherwise. (2) Minimizing Structure Similarity Objective Function:

1 nt J (f r)= Θ (V −V )2 S 2 ij r xi r xj (6.5) i,j=1

where Θij in Eq.(6.5) means the l-maximal length path weight parameter between nodes i and j, which is defined as follows:

l 1 Θ= Ai (6.6) 2i−1 i=1

114            

          

   

Figure 6-4: An example of using feature selection to capture structure similarity.

The number of paths between two nodes is a proved good indicator of the node structure similarity. The shorter the path between two nodes, the closer the two nodes are in structure. So the weight in Eq. (6.6) will decrease with the increase of the path length. An example is shown in Fig. 6-4. More specifically, left panel shows the network in original feature space and right panel shows the network in selected feature space (which contains m =6 features). On the right panel, Node 1 shares more paths with Node 3 than with Node 7, and the paths between Node 1 and Node 3 are shorter than the ones between Node 1 and Node 7. So Nodes 1 and 3 are closer to each other than Nodes 1 and 7 from structure similarity perspective. The structure similarity is tied to the representation of the nodes in selected feature space. If two nodes have an edge, they will be close to each other in the selected feature space (e.g. Node 1 and Node 5 have one edge, so they have three common features in selected feature space). With the increase of the path length between two nodes, the distance of the nodes in feature space also increase (e.g. Node 1 and Node 6 have no edge, so they share no common feature in selected feature space).

In Eq. (6.5), Θ is used as a penalty factor for two nodes that have high structure simi- larity but are far away from each other in feature space. Intuitively, nodes close in topology structure have a high probability of sharing similar node content [26]. So if any two nodes i and j are close to each other in topology structure but have a large distance in the original

feature space, their Θij value will increase the objective value and thus encourages feature selection module to find similar features for i and j. This provides a unique way to impose network topology structures into the node feature selection process.

By combining the label similarity objective function in Eq. (6.4) and structure similarity

115 objective function in Eq. (6.5), we can form a combined evaluation criterion for each feature f r as follows:

r r r J (f )=ξ ·JL(f )+(1− ξ) ·JS(f ) (6.7)

where ξ (0 ≤ ξ ≤ 1) is the weight parameter used to balance the contributions of network structures and node labels. The ξ values allow users to fine tune structure and label similarity in the feature selection for networks from different domains. In Section 6.4, we will report the algorithm performance w.r.t. different ξ values on benchmark networks. An example of using feature selection to capture structure similarity is also shown in Figure 6- 4.

nt×nt By defining a weighted matrix W =[Wij] as

 Wij =[ξ,ξ/c] · [Cij, Cij − 1] +(1− ξ) · Θij (6.8)

we can rewrite Eq. (6.7) as follows:

1 nt J (f r)= (V −V )2 2 r xi r xj Wij i,j=1 1 nt = ( r − r)2 (6.9) 2 fi fj Wij i,j=1    =(fr) Dfr − (fr) Wfr =(fr) Lfr where D is a diagonal matrix whose entries are column sums of W, i.e., Dii = j Wij. L = D − W is a Laplacian matrix.

In Eq. (6.8), Wij is equal to the structure similarity matrix h(i, j) in Eq. (6.3), so the J r constraint part in Eq. (6.3) is equal to minimizing f r∈S (f ). As a result, the problem of feature selection in a static network is equal to finding a subset S containing m features that satisfy: J r S⊆F |S| min (f ), s.t. , = m (6.10) f r∈S

DEFINITION 22 (F-Score) Let X =[f1, f2,...,fdt ] represents the networked data, and W is a matrix defined as Eq. (6.8). L is a Laplacian matrix defined as L = D − W, where

116 D is a diagonal matrix, Dii = j Wij. We define a quality criterion q called F-Score, for a feature f r as q(f r)=(fr)Lfr (6.11)

The solution to Eq. (6.10) can be found by using F-Score to assess features in the original feature space F. Suppose the F-Score for all features are denoted by q(f 1) ≤ q(f 2) ≤···≤q(f dt ) in a sorted order, the solution of finding the m most informative features is S = {f r| r ≤ m} (6.12)

Feature Selection on Streaming Networks

When the network continuously evolves at different time points T = {t1,t2,...}, network structure, including edges and nodes, and node features may change accordingly. So we need to adjust the selected feature set S in order to characterize changed network. Com- pletely rerunning the feature selection at each single time point from the scratch is time consuming, especially for large size networks. In this section, we introduce an incremen- tal feature selection method, which calculates the score of an old feature based on new networked data and then combines it with the old feature scores to update the feature’s final score. Such an incremental feature selection process ensures our method to tackle the “Continuous Changing Network Volumes” challenge. To incrementally update scores for old features, we separate networked data into two parts: a) nodes and edges that already exist at time point t; and b) new emerged (or dis- appeared) nodes and their relevant topology structures at t +1. After that, we use Part a) to get the changing parts of feature distributions in the old networks and use Part b) to calculate local incremental scores and update the scores of existing features, respectively. If the changed score of an old feature at t +1can be obtained by using Part a) and Part b) efficiently, we can compute a feature score by combining its old score at t and the changed score at t +1. For ease of representation, we define following notations:

• A subscript t or t +1of each matrix (or a vector) means the time point t or t +1of

117 the matrix (or vector).

• r r r ft and ft+1 denote indicator vectors of feature f in the network at time point t and  r ∈ Rnt×1 r ∈ Rnt+1×1 r ∈ t +1, respectively, where ft and ft+1 . Then we define ft+1  Rnt×1 r r ≤ ≤ as [ft+1]i =[ft+1]i, where 1 i nt.

• Δn denotes the number of new arrived nodes (from time point t to t +1).

• Wo denotes the weight matrix defined in Eq.(6.8) between new nodes arrived at time point t +1and old nodes that already existed at time point t.

• Wc denotes the changed weight matrix from time point t to t +1between old nodes that already existed at time point t.

• WΔn denotes the weight matrix between new nodes that arrived at time point t +1.

So the weight matrix of the networked data at time point t +1is Wt+1, and the updated Δ part between t and t +1is Wt+1, i.e. ⎡ ⎤ ⎡ ⎤ + ⎢ Wt Wc Wo ⎥ Δ ⎢ Wc Wo ⎥ Wt+1 = ⎣ ⎦ , Wt+1 = ⎣ ⎦.   Wo WΔn Wo WΔn

Then the F -Score of an old feature f r at time point t +1is,

r qt+1(f ) 1 nt+1 = ([ r ] − [ r ] )2[ ] 2 ft+1 i ft+1 j Wt+1 ij =1 i,j ⎡ ⎤

nt+1 1 ⎢ Wt 0 ⎥ = ([ r ] − [ r ] )2(⎢ ⎥ +[ Δ ] ) 2 ft+1 i ft+1 j ⎣ ⎦ Wt+1 ij (6.13) i,j=1 00 ij 1 nt+1 =(r ) r + ([ r ] − [ r ] )2[ Δ ] ft+1 Ltft+1 2 ft+1 i ft+1 j Wt+1 ij i,j=1 =(r) r +( r − r) ( r − r)+( r ) Δ r ft Ltft ft+1 ft Lt ft+1 ft ft+1 Lt+1ft+1

r r In Eq.(6.13), qt+1(f ) contains three parts. The first term is qt(f ), and the last two terms are the changed scores at t +1, which correspond to Part a) and Part b), respectively.

118 Formally, Δ r r − r  r − r r  Δ r qt+1(f )=(ft+1 ft ) Lt(ft+1 ft )+(ft+1) Lt+1ft+1 (6.14)

Δ Δ Δ where Lt+1 is the Laplacian matrix of Wt+1. We calculate Wt+1 by using changed part of the network (including nodes and edges) as follows:

Δ Δ(L) − Δ(S) Wt+1 = ξWt+1 +(1 ξ)Wt+1 (6.15)

Δ(L) Δ(S) Wt+1 and Wt+1 are used to calculate the changed parts of label relationships and structure relationships, respectively, from time point t to t +1.  ([1, 1/c] · [ , − 1],ifiorj∈ Δn; Δ(L) Cij Cij Wt+1 = 0,otherwise.

and Δ(S) − Wt+1 =Θt+1 Θt,

Δ(L) [Wt+1 ]ij denotes the incremental weight parameter of label similarity between nodes i Δ(S) and j, and [Wt+1 ]ij is the incremental l-length path weight parameter between nodes i and j. Both of them are “incrementally" calculated by only using the changed parts of the streaming networks.

r Δ r r As a result, we can obtain a new score qt+1(f ) by adding qt+1(f ) to qt(f ), with the new scores of old features being used in the final feature selection process. When

r Δ r the streaming networks change with time, an old feature f ’s new score, qt+1(f ), can be incrementally calculated by using the changed part of the network at t +1compared to

r qt(f ) time point t, which allows SNF to efficiently update feature scores for large scale dynamic networks.

For streaming features with an infinite feature space, it is infeasible to keep all fea- ture scores for future comparison. So SNF maintains a small feature set, called candi- date feature set, for future comparisons, i.e. T = {f 1,f2,...,fm,fm+1,...,fk}, where, q(f 1) ≤ q(f 2) ≤···≤q(f k) and q(f k) ≤ 2q(f m). This setting ensures that the discarded features are very unlikely to be selected at the next time point. So SNF always keeps a

119 Algorithm 5 SNF: Steaming network feature selection

Input: (1) the network at time points t and t +1: Xt and Xt+1, (2) candidate feature set: Tt, (3) F-Score list of Tt: Ht, (4) size of selected feature set: m, and (5) new feature set Vt+1. Output: selected feature set: St+1 and candidate feature set Tt+1. H Δ 1: Initialize the score list t+1 and generate the updated Laplacian matrix Lt+1; 2: // calculate F-Score for new features r 3: for f ∈Vt+1 do r r  r 4: q(f ) ← (f ) Lt+1f r 5: Ht+1 ← q(f ) ∪Ht+1 6: end for 7: // update F-Score for old features r 8: for f ∈Tt do   Δ r ← r − r  r − r r  Δ r 9: qt+1(f ) (ft+1 ft ) Lt(ft+1 ft )+(ft+1) Lt+1ft+1 r r 10: qt(f ) ←Ht(f ) r ← r Δ r 11: qt+1(f ) qt(f )+qt+1(f ) r 12: Ht+1 ← q(f ) ∪Ht+1 13: end for 14: Sort Ht+1 in ascending order 15: St+1 ← top-m of Ht+1 k m 16: Tt+1 ← top-k of Ht+1, where q(f ) ≤ 2q(f )

candidate feature set with dynamic size k, and discard less informative features. For all new features appearing in the new nodes, SNF will calculate their feature scores in order to ensure that important new features can be discovered immediately after they emerge in the network. Algorithm 5 lists the detailed SNF algorithm, which incrementally compares scores of new features and old features in T and selects top-m features to form the final feature set. It is worth noting that the proposed SNF method can efficiently handle three types of changes in streaming networks: (1) Feature distribution changes: For each feature f r,if

r r its distributions change from ft to ft+1, the first part of Eq. (6.14) is used to calculate its changed score; (2) Node addition and structure changes: For new nodes and their associ- ated edge connections, the second part of Eq. (6.14) will capture the topological structure changes and the node addition information; and (3) Node deletion: For nodes that are re- moved at t+1, we can set their feature indicators to 0 to indicate that the nodes have empty node content, and then use (1) to update feature scores.

3l Time Complexity Analysis: the time complexity of getting Wt is O(nt ). So the total

120 r 3l Δ time complexity of getting qt(f ) in Algorithm 5 is O(nt ). While if the size of Wt+1 is Δ × Δ Δ r Δ 3l nt+1 nt+1, the complexity of getting qt+1(f ) in Algorithm 5 is O([nt+1] ). In most cases, Δ nt+1 << nt. So our updated method is very efficient under streaming network setting.

6.3.2 Node Classification on Streaming Networks

Once the most informative m features are identified at time point t (denoted by St = {f 1,f2, ··· ,fm}), the network nodes can be represented by using selected features as:

⇒ St 1 2 m  ∈ Rm×nt Xt =[x1, x2,...,xnt ] X =[f , f ,...,f ] ,

The node classification is to provide accurate labels for unlabeled nodes in the network at time point t. If an unlabeled node u is correctly labeled, u should be right positioned to other nodes w.r.t. the label and structure similarity as defined in Eq.(6.3). So the quality criterion of Eq.(6.3) for each unlabeled node u at a given time point t can be rewritten as follows, nt 1 2 yu E(y , S )= (D t x −D t x ) [W ] (6.16) u t 2 S i S j ij i,j=1

yu where yu ∈Y, and W means the weight matrix generated from Eq.(6.8) by setting D the label of u to yu. St is a diagonal matrix indicating features that are selected into the feature set from F to St. Because the quality criterion is only affected by the changed part of weight matrix, and the changes in node labels only affect the label similarity part, we can define the changed weight matrix as: ⎧ ⎪1,ifj= u, yi = yu or i = u, yj = yu; ⎨⎪ 1 Δyu − [W ]ij = ⎪ ,ifj= u, yi = yu or i = u, yj = yu; ⎩⎪ c 0,ifj= uandi= u. (6.17)

So the quality criterion in Eq.(6.16) can be replaced by

nt 1 2 Δyu E (y , S )= (D t x −D t x ) [W ] (6.18) u t 2 S i S j ij i,j=1

121 Algorithm 6 SNOC: Streaming Network Node Classification

Input: (1) the network: Xt and Xt−1, (2) label list: Yt, (3) candidate feature set at point t − 1:Tt−1, (4) F-Score list of Tt−1: Ht−1, (5) size of selected feature set: m, and (6) new feature set Vt. Yu Output: label list for unlabeled data: t . 1: (St, Tt) ← SNF(Xt, Xt−1, Tt−1, Ht−1, Vt,m) St 2: Mapping Xt into X by using St; 3: for each unlabeled node u do ∗ St  Δyu St 4: yu = arg min(tr([X ] L X )) yu∈Y 5: end for

Then we can calculate E(yu, St) as

nt 1 2 Δyu E (y , S )= (D t x −D t x ) [W ] u t 2 S i S j ij i,j=1

D Δyu − Δyu D = tr( St Xt(D W )Xt St ) (6.19) D Δyu D = tr( St XtL Xt St ) = tr([XSt ]LΔyu XSt )

where tr(·) is the trace of a matrix, and LΔyu is the Laplacian matrix of WΔyu . So our target is to select a label for an unlabeled node u to ensure:

minE (yu, St) (6.20) yu∈Y

DEFINITION 23 (SNC) Let XSt =[f1, f2,...,fm] represents the mapped network nodes in the selected feature space. Suppose WΔyu is a matrix defined as Eq. (6.17). LΔyu is a Laplacian matrix defined as LΔyu = DΔyu − WΔyu , where DΔyu is a diagonal matrix, Δyu Δyu [D ]ii = j [W ]ij. We define a labeling criterion, called streaming network criterion SNC, for each unlabeled node u as follows,

∗ St  Δyu St yu = arg min(tr([X ] L X )) (6.21) yu∈Y

Through the SNC criterion, Eq. (6.1) can be achieved by calculating yu for each single unlabeled node. Algorithm 6 lists the detailed process of the proposed streaming network

122 node classification (SNOC) method, which uses SNF to select a feature space to capture network changes and then assigns different labels to each unlabeled node by using selected features as the gauge. The class label of an unlabeled node is the one that results in the minimal gauging value with respect to the selected features.

6.4 EXPERIMENTS

In this section, we conduct extensive experiments to evaluate the efficiency and effective- ness of SNOC for node classification in static and streaming networks.

6.4.1 Experimental Settings

We validate the performance of SNOC on the following four real-world networks.

Cora1 is a citation network with 2,708 publications (i.e. nodes) classified into one of seven classes. The citation relationships are captured in 5,429 links. The node content is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from a dictionary of 1,433 unique words.

CiteSeer1 consists of 3,312 scientific publications classified into one of six classes. The network consists of 4,732 links. Each publication in CiteSeer is described by a 0/1-valued word vector from a dictionary with 3,703 unique words.

PubMed Diabetes1 network consists of 19,717 publications (nodes) from PubMed database pertaining to three types of diabetes. It has 44,338 links. Each paper is de- scribed by a TF-IDF weighted word vector from a dictionary containing 500 unique words.

DBLP2 network contains 2,084,055 papers and 2,244,018 citation relationships. We separate papers into six classes: DataBases, Artificial Intelligence, Hardware and Architecture, Applications and Media, System Technology, and others. Each paper is denoted by a 0/1-valued word vector from a dictionary with 3,000 words.

1http://linqs.cs.umd.edu/projects//projects/lbc/index.html 2http://arnetminer.org/citation

123 To evaluate the performance of SNOC for streaming networks, we firstly test the algo- rithm performance on static networks by using three networks (Cora, CiteSeer and PubMed Diabetes). After that, we use DBLP and PubMed Diabetes networks as our streaming net- work test bed (because the sizes of CiteSeer and Cora networks are too small for testing in streaming network settings). DBLP is inherently a streaming network, because publi- cations are continuously updated and top keywords (node features) are also continuously changing with respect to the time. For DBLP network with streaming network setting, we choose 2,000 publications for each year and build a streaming network covering the time period from 1991 to 2010. In addition, we also use PubMed Diabetes network to simulate a streaming network with 1,000 random nodes to be included for each time point t (the experiments include a total of 15 time points). For most experiments, we randomly label 40% of nodes in the network and use the remaining nodes as test data (this is reasonable setting because real-world networks always have more unlabeled nodes than the number of labeled ones). In addition, we also report the algorithm performance with respect to different percentages of training/test nodes (detailed in Fig. 6-6(c)). For streaming network experiments, the accuracy is tested on the new nodes arrived at each time point. The default size of selected feature set is m =100, the default value of weight parameter ξ =0.7, and the default maximal path length in Eq. 6.6 is set to l =3. Baseline Methods: We compare the performance of SNOC with four baselines: Information Gain+SVM (IG+SVM): This method ignores link structures in the network and uses Information Gain (IG) to select the top-m features from all nodes (using content information in the original bag-of-feature representation). LIBSVM [10] is used as the learning algorithm to train classifiers for node classification. Link Structure+SVM (LS+SVM): This method ignores label information of labeled nodes and only uses structure similarity to construct the weight matrix (W) and then calculates the feature score in a similar way as SNF. LIBSVM is also used as the learning algorithm to train classifiers for node classification. Collective Classification (GS+LR): This method refers to the combined classification of interlinked objects including the correlation between node label and node content. In our

124 70 90 100 IG+SVM IG+SVM IG+SVM 65 LS+SVM LS+SVM LS+SVM 90 DYCOS 80 DYCOS DYCOS 60 GS+LR GS+LR GS+LR 55 SNOC SNOC 80 SNOC 70 50 70 45 60 60 40 50 35

Accuracy % Accuracy % Accuracy % 50

30 40 40 25

20 30 30 50 100 200 300 50 100 200 300 50 100 200 300 # of selected features (m) # of selected features (m) # of selected features (m) (a) Cora (b) CiteSeer (c) PubMed Diabetes

Figure 6-5: The accuracies on three real-world static networks w.r.t. different numbers of selected features (from 50 to 300). experiments, we use collective classification method [61], which uses a simplified version of Gibbs sampling (GS) as the approximate inference procedures for networked data, with Logistic Regression (LR) being used as classifiers for node classification. DYCOS: This is a recently proposed method that combines text content and links for net- work node classification [2]. It is considered the state-of-the-art classification method in streaming networks. A random walk approach in conjunction with the content of the net- work is used for node classification. This results in a new approach to handle variations in content and linkage structures. gini-index is used to select features in this method. All experiments are conducted on a cluster machine with 16GB RAM and Intel CoreTM i7 3.20 GHZ CPU.

6.4.2 Performance on Static Networks

Table I reports the performance of different methods on three static networks (Cora, Cite- Seer and PubMed Diabetes). The results show that SNOC outperforms other four baseline methods on all three networks with significant performance gain. This is mainly attributed to SNOC’s integration of network topology structure and node labels to explore features for node classification. Although DYCOS indeed considers network linkage information and GS+LR considers the correlation between node labels and node content, they do not take into account the impact of deep structure information for both feature selection and classification process. So their performance is inferior to SNOC. Noticeably, even though DYCOS takes structure information into account, the actual contributions of label similar-

125

90 90 70 Cora IG+SVM CiteSeer LS+SVM 80 80 PubMed Diabetes DYCOS 60 GS+LR 70 SNOC 70 50 60 60 50 40 Accuracy % Accuracy % Accuracy % 50 40 Cora 30 30 CiteSeer 40 PubMed 20 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10% 20% 30% 40% 50% 60% 70% 80% 90%

(a) Maximal Path Length l (b) Weight Parameter ξ (c) Percentage of Labeled Nodes

Figure 6-6: The accuracy on three networks w.r.t. (a) different maximal lengths of path l (from 1 to 5), (b) different values of weight parameter ξ (from 0 to 1), and (c) different percentages of labeled nodes.

ity and structure similarity have not been optimized in those methods, in order to achieve bets feature selection results for networked data. This partially explain why the accuracies of DYCOS cannot match IG+SVM for PubMed data set. In comparison, SNOC considers both labeled and unlabelled nodes, and combines node label similarity and node structure similarity to find effective features. All these designs help SNOC outperform all other baseline methods.

Table 6.1: Accuracy Results on Static Network.

Data sets Cora CiteSeer PubMed IG+SVM 50.34%±1.42% 57.21%±1.59% 65.24%±1.33% LS+SVM 27.37%±2.85% 39.64%±2.66% 43.06%±2.75% DYCOS 53.57%±1.24% 64.38%±1.29% 64.53%±1.86% GS+LR 55.17%±1.09% 65.93%±2.37% 72.88%±2.05% SNOC 62.66%±1.57% 73.81%±1.46% 81.09%±2.37%

In Fig. 6-5, we report the algorithm performance with respect to different numbers of selected features on three networks. Overall, SNOC achieves the highest accuracy gain on all three networks with different feature sizes. LS-SVM has the lowest accuracies because network structure alone provides very little useful information (compared to the node con- tent) for node classification. SNOC and GS+LR have the highest accuracies on Cora and CiteSeer networks when selecting m = 100 features, and on PubMed Diabetes data set when m =50, whereas DYCOS’s accuracies decrease with the increasing of the feature

126 size on all three data sets. The accuracies of all methods become close to each other with the number of selected features continuously increase. This is because that including more features may introduce interference and dilute the significance of important node features, so the benefit of feature selection is becoming less significant. Because SNOC balances the label and structure information to feature space for node classification, it still outperforms other baseline methods. In Fig. 6-6(a), we report the accuracies w.r.t. different maximal lengths of path to calculate Eq. 6.6. The results show that the accuracies decrease if the path lengths are too long. This is because even though the path between two nodes is relevant to the node structure similarity, if the path length is too long, the similarity maybe deteriorated by special paths like cycles and and become inaccurate to capture the node similarity. In Fig. 6-6(b), we report the algorithm performance w.r.t. the changing values of weight parameter ξ. According to the definition in Eq. (6.7), ξ is used to balance the contribution of network structures and node labels. The results from Fig. 6-6(b) show that node labels play a more important role than network structures. For Cora network, the accuracy reaches the peak when ξ =0.6, while the highest accuracies appear on ξ =0.7 for CiteSeer data and PubMed data, respectively. This suggests that network structures and node labels have different contributions to feature selection for networks from different domains. In order to achieve the best performance, users may need to carefully choose a suitable weight value for different networks. In previous experiments, the percentage of labeled nodes is fixed to 40% of the net- work. In reality, the percentage of labeled nodes in networks may vary significantly, so in this subsection, we study the performance of all methods on networks with different percentages of labeled nodes (due to page limitations, we only report the results on Cora network). The results in Fig. 6-6(c) show that when the number of labelled nodes in the network increases, all methods achieve accuracy gains. After the majority of the network nodes are labeled, all four methods except LS-SVM achieve similar accuracies. This is because labeled nodes provide sufficient content information for classification. Interestingly, our results show that when the network contains a small percentage of labeled nodes, e.g. 30%

127 90 90 90    SNOC   SNOC ï  ï  DYCOS DYCOS ï   ï   80 ï   GS+LR 80 80 ï   GS+LR

70 70 70

60 60 60 Accuracy % Accuracy % Accuracy %

50 50   50 ï   SNOC ï  DYCOS ï   GS+LR 40 40 40 1991 1996 2001 2006 2010 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1991 1996 2001 2006 2010

(a) DBLP (b) PubMed Diabetes (c) extended DBLP

Figure 6-7: The accuracy on streaming networks: (a) accuracy on DBLP citation network from 1991 to 2010, (b) accuracy on PubMed Diabetes network for 15 time points, and (c) accuracy on extended DBLP citation network from 1991 to 2010.

or less, SNOC can achieve much more significant accuracy gains compared to other meth- ods. This observation indicates that SNOC is more suitable for networks with very few labeled nodes. This is mainly attributed to the fact that SNOC can integrate node labels and network structures (which also include unlabeled nodes) to find most effective features to characterize the network node content and topology information. In addition, the similarity gauging process also tries to find optimal node labels for unlabeled nodes to ensure the distance evaluated in the feature space are consistent with the network structures.

6.4.3 Performance on Streaming Networks

Because only DYCOS and GS+LR are designed for classifying networked data, in the following, we only compare SNOC with DYCOS and GS+LR on streaming networks. In Figs. 6-7 (a) and (b), we report accuracies on DBLP and PubMed networks in a streaming network setting. In addition, Fig. 6-8 further reports the runtime of different methods. Because GS+LR is designed for static networks, it needs to be rerun at each time point. Both DYCOS and SNOC can handle streaming networks. For DBLP network, the results show that the proposed SNOC outperforms all other methods in streaming network setting. An exception is on 1997, GS+LR method, which is more time-consuming as shown in Fig. 6-8, can match SNOC. This shows that a good balance between node content and network structure is very important for node classifica- tion. Although GS+LR emphasizes on node content information and DYCOS emphasizes

128 on network structures, both of them, however, fail to capture the changes in streaming net- works. Meanwhile, the runtime performance in Fig. 6-8 shows that DYCOS is as fast as SNOC but its accuracy is inferior to SNOC because DYCOS uses a random walk to predict node labels. Because random walks are inherently uncertain and contain many random- ness in the classification process, the node classification results of DYCOS are inferior to both SNOC and GS+LR. Meanwhile, as the time steps t continuously increase, the runtime curve of DYCOS increases much quicker than SNOC. This is because SNOC only needs to consider the changed part of the network for both node classification and feature selection. Although GS+LR obtains better accuracies compared to DYCOS, it is much more time- consuming compared to SNOC and DYCOS. This is mainly because GS+LR is an iterative algorithm designed for static networks, so it needs to be rerun at each time step.

To validate the performances of different methods on streaming networks with all types of changes (including dynamic changing node features, addition and deletion nodes and edges), we allow each node (i.e., paper) to include its reference’s title into the node content.

For example, if a paper pi is cited by a paper pj at a particular time point t, we will include pj’s title into node pi’s content. By doing so, we can introduce dynamic changing features to nodes in the network. In addition, we also continuously remove old papers in the network to maintain papers published within a five-year period. This is will result in node/edge deletion and feature removal for the whole network. All these settings result in a highly complicated streaming network setting for node classification. We denote this network as full streaming DBLP network, and report the results in Fig. 6-7(c). The results show that SNOC clearly outperforms all other methods in complicated network setting. More specifically, at Year 1996, the accuracies of all methods deteriorate with significant drops. This is mainly because that 1996 is the first time that old nodes are removed from the network. Our method achieves smallest decline-slope compared to other two methods.

Interestingly, when comparing the results in Fig. 6-7(a) and Fig. 6-7(c), we can find that the average accuracies of SNOC and GS+LR on networks containing all publications (Fig. 6-7(a)) are higher than the accuracies on networks only containing publications with a five-year span (Fig. 6-7(c)). Notice that the former contains a much higher node and edge density in the network, so when the same sets of nodes are given for classification, the rich

129 topology structures in a dense network will help algorithm improve the node classification accuracies. For DYCOS, its average accuracy in Fig. 6-7(a) is 1.5% lower than the average accuracy in Fig. 6-7(c). Notice that DYCOS uses random walks for node classification. For dense networks, the random walks will contain many irrelevant paths, which deteriorate the classification accuracy. As a result, its accuracy on five-year span networks are actually better than the accuracy on the whole networks.

3,500 1,500 SNOC SNOC 3,000 DYCOS DYCOS GS+LR GS+LR 2,500 1,000 2,000

1,500 Runtime (s) Runtime (s) 500 1,000

500

1991 1996 2001 2006 2010 1 6 11 15

(a) DBLP (b) PubMed Diabetes

Figure 6-8: The cumulative runtime on DBLP and PubMed Diabetes networks correspond- ing to Fig. 6-5.

6.4.4 Case Study

In Fig. 6-9, we use a case study to demonstrate the performance of the three methods (SNOC, DYCOS and GS-LR) in handling cases with abrupt network changes. In our experiments, from time points 1 to 3, the network only contain nodes from four classes (Hardware and Architecture, Applications and Media, System Technology, and others). From time point 4 to 6, nodes from a new class (DataBases) are included into the network (including unlabeled nodes). From time points 7 to 9, new nodes from another new class (Artificial Intelligence) are introduced into the network. The results in Fig. 6-9 show that, due to the abrupt inclusion of new class nodes, the accuracies of all methods decrease. When nodes from the new class continuously arrive,

130 SNOC’s accuracy can quickly recover, because SNF in SNOC can find ideal features to represent changes in the network and use these features to adjust the node classification. As a result, SNOC can adapt to the changes in the network for node classification.

 





  

 ï        

Figure 6-9: Case study on DBLP citation network.

In summary, our experiments confirm that, for streaming networks, using features to capture changes and further classifying nodes by assessing the consistency of node labels and the network structures can provide effective and efficient solutions for node classifica- tion.

131 132 Chapter 7

Conclusion

7.1 SUMMARY OF THIS THESIS

In this thesis, we have studied mining problems from the view of complex structure data, where instances (nodes) are not only characterized by the content but are also subject to dependency relationships. What is more, with the fast development of information tech- nology, many current real-world data are always featured with dynamic changes. Accord- ingly, this thesis explores instance correlation in complex structure data and utilizes it to make mining tasks more accurate and applicable. Our objective is to combine node correla- tion with node content and utilize them for three different tasks, including (1) graph stream classification, (2) super-graph classification and clustering, and (3) streaming network node classification. More specifically, in Chapter 3, we proposed an empirical study to reveal the roles of sub-graph features for graph classification. Existing research has commonly agreed that finding discriminative sub-graphs to represent graphs is one of the main challenges for graph classification. Yet there is no comprehensive study about (1) the genuine relationship between sub-graphs and the classification accuracy, and (2) the actual difference between sub-graphs discovered from expensive mining process (such as frequent sub-graph mining) with sub-graphs from simple approaches (such as random sub-graphs). In this thesis, we empirically validated the relationship between sub-graphs and graph classifiers by vary- ing (1) the sub-graph feature sizes; (2) the size of the sub-graph feature set; (3) different

133 learning algorithms; and (4) different benchmark datasets. We characterized sub-graphs discovered from different approaches (including random sub-graphs, frequent sub-graphs, and frequent sub-graphs selected from Information Gain) by their size (i.e. number of edges) and by their number (i.e. number of sub-graphs in a set), and validated the perfor- mance of classifiers trained from these sub-graphs on seven benchmark graph datasets from three domains. Our study drew a number of important findings, which provide a clear view about the relations between sub-graphs and graph classifications.

In Chapter 4, we proposed to address graph stream classification. We argued that in graph stream scenarios, the data volumes and the structures of the graphs may con- stantly change. The existing sub-graph feature-based representation model is not only inefficient but also ineffective for graph stream classification. This is because the min- ing of the sub-graph features is time-consuming and the existing occurrence-based graph representation model will result in significant information loss and will make sub-graph features ineffective for represent graph data. To solve the problem, we proposed a graph factorization-based fine-grained representation model, where the main objective is to use linear combinations of a set of discriminative cliques to represent graphs for learning. The optimization-oriented factorization approach ensures minimum information loss for graph representation, and also avoids the expensive sub-graph isomorphism validation process. Based on this idea, we proposed a novel framework for fast graph stream classification. Experiments on two real-world graph streams validated the proposed design for effective graph stream classification.

In Chapter 5, we first formulated a new super-graph classification problem. Due to the inherent complex structure representation, all existing graph classification methods cannot be applied for super-graph classification. In the thesis, we proposed a weighted random walk kernel which calculates the similarity between two super-graphs by assessing (a) the similarity between super-nodes of the super-graphs, and (b) the common walks of the super- graphs. Our key contribution is twofold: (1) a weighted random walk kernel considering node and structure similarities between graphs; and (2) an effective kernel-based super- graph classification method with sound theoretical basis.

In Chapter 6, we proposed a novel node classification method for streaming networks.

134 We argued that for networks with continuous changes in structure and node content, fea- tures are the most effective tool to capture such changes. Accordingly, we proposed to takes network topology structure and node labels into consideration to find an optimal subset of features to represent the network. Based on the selected features, a streaming net- work node classification method, SNOC, is proposed to classify unlabeled nodes through the minimization of the similarity distance in the network and the feature-based distance between nodes. Experiments and comparisons demonstrate that SNOC is able to capture emerging changes and outperform baseline approaches with significant performance gain. The key innovation of the paper compared to the existing methods is twofold: (1) a new node classification method for handling streaming networks; and (2) a streaming feature selection method for networked data where selected features can capture both node content and network structure dependency.

7.2 FUTURE WORK

Even though we have proposed several mining methods on complex structure data, apply- ing mining algorithms to very large scale problems and dynamic setting still poses chal- lenges: (1) how to find an effective representation for a high-dimensional large-scale com- plex structure data, so as to fit in memory, (2) how to capture the dynamic changes on the complex structure data to adjust the mining results, and (3) how to balance manifold infor- mation to achieve effective performance. To the best of our knowledge, there is no work combining large scale learning and dynamic learning. In our future work, we focus on de- signing mining algorithms and approaches that are faster, data efficient and less demanding in computational resources to achieve scalable algorithms for large scale problems. More specifically, we will extend super-graph classification method to clustering prob- lem on a huge super-graph. To discover clusters from a large network/graph, existing methods follow three common approaches: (1) Structure-based Clustering using node con- nectivity only [54, 79, 64]; (2) Attribute-based Clustering using node attribute similar- ity [73, 86]; and (3) Structural and Attribute Clustering combining node attribute and node connectivity similarities [14, 80, 89]. All above methods are, however, inapplicable for

135 super-graph clustering mainly because they cannot take internal structures inside super- nodes for clustering. Indeed, super-nodes may share some overlapped/intersected struc- tures to help assess the similarity between super-nodes. In addition, the inter-connected structures between super-nodes also provide useful structure information for clustering. The existence of the structure dependency within and between super-nodes require a clus- tering algorithm to take node internal and external structures for super-graph clustering. To clustering a super-graphs, the main challenge is to properly calculate similarities between super-nodes by considering internal structures inside each super-node and inter- connectivity between super-nodes. The complex super-node structure, where each node is itself another graph, makes it a very challenging problem. The above challenges mo- tivated our research to combine structural and content similarity between super-nodes for clustering. In the future, we will also focus on finding cluster structures for networked instances and discovering representative features for each cluster represents a special co-clustering task usefully for many real-world applications, such as automatic categorization of scien- tific publications and finding representative key-words for each cluster. To date, although co-clustering has been commonly used for finding clusters for both instances and features, all existing methods are focusing on instance-feature relationships, without realizing that topology structures between instances provide very valuable information to help boost co- clustering performance. We will try to propose a co-clustering method to ensure that the final cluster structures are consistent across information from the three aspects with mini- mum errors.

136 Bibliography

[1] Charu. C. Aggarwal. On classification of graph streams. In Proceedings of Eleventh SIAM International Conference on Data Mining (SDM’11), pages 652–663, 2011.

[2] Charu C. Aggarwal and Nan Li. On node classification in dynamic content-based networks. In SDM, pages 355–366, 2011.

[3] Charu C. Aggarwal and Haixun Wang. Managing and mining graph data. Springer, 2010.

[4] Ralitsa Angelova and Gerhard Weikum. Graph-based text classification: learn from your neighbors. In ACM SIGIR, pages 485–492, 2006.

[5] Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schönauer, S. V. N. Vishwanathan, Alex J. Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. , 21(1):47–56, 2005.

[6] Horst Bunke. Recent developments in graph matching. In Proceedings. 15th Interna- tional Conference on Pattern Recognition, pages 117–124, 2000.

[7] Horst Bunke and Kaspar Riesen. Graph classification based on dissimilarity space embedding. In Structural, Syntactic, and Statistical Pattern Recognition (2008), 2008.

[8] Yandong Cai, Nick Cercone, and Jiawei Han. An attribute-oriented approach for learning classification rules from relational databases. In ICDE, pages 281–288, 1990.

[9] Jérôme Callut, Kevin Françoisse, Marco Saerens, and Pierre Dupont. Classification in graphs using discriminative random walks. In MLG, 2008.

[10] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. TIST, 2(3), 2011.

[11] Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W. Wah, and Jianyong Wang. Multi-dimensional regression analysis of time-series data streams. In Proceedings of the 28th international conference on Very Large Data Bases, pages 323–334, 2002.

[12] Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu. Discriminative frequent pattern analysis for effective classification. In ICDE, 2007.

137 [13] Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu. Discriminative frequent pattern-based graph classification. In Link Mining: Models, Algorithms, and Appli- cations, 2010.

[14] Hong Cheng, Yang Zhou, and Jeffrey Xu Yu. Clustering large attributed graphs: A balance between structural and attribute similarities. ACM TKDD, 5(2), 2011.

[15] Lianhua Chi, Bin Li, and Xingquan Zhu. Fast graph stream classification using dis- criminative clique hashing. In Advances in Knowledge Discovery and Data Mining (PAKDD’13), pages 225–236, 2013.

[16] Corinna Cortes and Vladimir Vapnik. Support-vector networks. , 1995.

[17] Andrew D. J. Cross, Richard C. Wilson, and Edwin R. Hancock. Inexact graph match- ing using genetic search. Pattern Recognition, 30(6):953–970, 1997.

[18] Thiago Henrique Cupertino and Liang Zhao. Bias-guided random walk for network- based data classification. In Advances in Neural Networks, pages 375–384, 2013.

[19] Gröget Hans Dietmar. On the randomized complexity of monotone graph properties. Acta Cybernetica, 10(3):119–127, 1992.

[20] Paul D. Dobson and Andrew J. Doig. Distinguishing enzyme structures from non- enzymes without alignments. Journal of molecular biology, 2003.

[21] Pedro Domingos and Geoff Hulten. Mining high speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’00), pages 71–80, 2000.

[22] Gudes Ehud, Solomon Eyal Shimony, and Natalia Vanetik. Discovering frequent graph patterns using disjoint paths. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2006.

[23] Terrence S. Furey, Nello Cristianini, Nigel Duffy, David W. Bednarski, Michèlˇ Schummer, and . Support vector machine classification and vali- dation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906–914, 2000.

[24] Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of graph edit dis- tance. Pattern Analysis and applications, 13(1):113–129, 2010.

[25] Thomas Gärtner, Peter A. Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. In COLT, pages 129–143, 2003.

[26] Ray Geagans and Bill McEvily. Network structure and knoweldge transfer: The ef- fects of cohesion and range. Administrative Science Quarterly, 48(2):240–267, 2003.

138 [27] Gregory Gutin, Anders Yeo, and Alexey Zverovich. Traveling salesman should not be greedy: Domination analysis of greedy-type heuristics for the TSP. Discrete Applied Mathematics, 2002.

[28] Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: Cur- rent status and future directions. Data Mining and Knowledge Discovery, 2007.

[29] Johan Himberg, Kalle Korpiaho, Heikki Mannila, and Johanna Tikanmäki. Time series segmentation for context recognition in mobile devices. In Proceedings IEEE International Conference on Data Mining, pages 203–210, 2001.

[30] Kashima Hisashi, Koji Tsuda, and Akihiro Inokuchi. Marginalized kernels between labeled graphs. In ICML, 2003.

[31] Tamás Horváth, Thomas Gärther, and Stefan Wrobel. Cyclic pattern kernels for pre- dictive graph mining. In ACM SIGKDD, 2004.

[32] Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In ICDM, 2003.

[33] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In PKDD, 2000.

[34] David Jensen, Jennifer Neville, and Brian Gallagher. Why collective inference im- proves relational classification. In SIGKDD, 2004.

[35] Ning Jin, Calvin Young, and Wei Wang. GAIA: Graph classification using evolution- ary computation. In ACM SIGKDD, 2010.

[36] J.R.Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.

[37] Hisashi Kashima and Akihiro Inokuchi. Kernels for graph classification. In ICDM Workshop on Active Mining, volume 2002, 2002.

[38] Riesen Kaspar and Horst Bunke. Cluster ensembles based on vector space embed- dings of graphs. Multiple Classifier Systems, 2009.

[39] Riesen Kaspar and Horst Bunke. Graph classification by means of lipschitz embed- ding. IEEE Transaction on Systems, Man, and Cybernetics, Part B, 2009.

[40] Riesen Kaspar and Horst Bunke. Graph Classification and Clustering Based on Vec- tor Space Embedding. World Scientific Publishing Co., Inc., 2010.

[41] Xiangnan Kong, Wei Fan, and Philip S. Yu. Dual active feature and sample selection for graph classification. In ACM SIGKDD, 2011.

[42] Xiangnan Kong and Philip S. Yu. Semi-supervised feature selection for graph clas- sification. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 793–802, 2010.

139 [43] Taku Kudo, Eisaku Maeda, and Yuji Matsumoto. An application of boosting to graph classification. Advances in neural information processing systems, 2004.

[44] Solomon Kullback. Information theory and statistics. Courier Dover Publications, 1968.

[45] Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In ICDM, 2001.

[46] Bin Li, Xingquan Zhu, Lianhua Chi, and Chengqi Zhang. Nested subtree hash kernels for large-scale graph classification over streams. In 12th International Conference on Data Mining (ICDM’12), pages 399–408, 2012.

[47] Geng Li, Murat Semerci, Bülent Yener, and Mohammed J. Zaki. Graph classification via topological and label attributes. In 9th Workshop on Mining and Learning with Graphs (with SIGKDD), 2011.

[48] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolic represen- tation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pages 2–11, 2003.

[49] Qing Lu and Lise Getoor. Link-based classification. In ICML, pages 496–503, 2003.

[50] Xiangfeng Luo, Zheng Xu, Jie Yu, and Xue Chen. Building association link network for semantic link on web resources. IEEE Transactions onAutomation Science and Engineering, 8(3):482–494, 2011.

[51] J. McAuley and J. Leskovec. Image labelling on a network: using social-network metadata for image classification. In ECCV, volume 4, pages 828–841, 2012.

[52] Kenrick Mock. An experimental framework for email categorization and manage- ment. In SIGIR, pages 392–393, 2001.

[53] Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona Singh. Whole- proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21(Suppl.1):i302–i310, 2005.

[54] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E, 69(2):026113, 2004.

[55] H. J. Nussbaumer. Fast fourier transform and convolution algorithms. Springer series in information sciences, 2, 1982.

[56] Sayan Ranu and Ambuj K. Singh. Graphsig: A scalable approach to mining signif- icant subgraphs in large graph databases. In Proceedings of the 2009 International Conference on Data Engineering, pages 844–855, Shanghai, China, March 2009.

140 [57] John W. Raymond, Eleanor J. Gardiner, and Peter Willet. Rascal: Calculation of graph similarity using maximum common edge subgraphs. The Computer Journal, 45(6):3–35, 2002.

[58] Hiroto Saigo, Nicole Krämer, and Koji Tsuda. Partial least squares regression for graph mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 578–586, 2008.

[59] Hiroto Saigo, Sebastian Nowozin, Tadashi Kadowaki, Taku Kudo, and Koji Tsuda. gboost: a mathematical programming approach to graph classification and regression. Machine Learning, 75(1):69–89, 2009.

[60] Bernhard Schölkopf and Alexander J. Smola. Learning with kernels. MIT Press, 2002.

[61] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, and Lise Getoor. Collective classifi- cation in network data. In Encyclopedia of Machine Learning, 2010.

[62] D. Serre. Matrices: Theory and Applications. Springer, New York, 2002.

[63] Nino Shervashidze and Karsten M. Borgwardt. Fast subtree kernels on graphs. Ad- vances in Neural Information Processing Systems, 22:1660–1668, 2009.

[64] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

[65] Aaron Smalter, Jun Huan, Y. Jia, and Gerald Lushington. GPD: a graph pattern dif- fusion kernel for accurate graph classification with applications in cheminformatics. IEEE/ACM Transactions on and Bioinformatics, 7(2):197– 207, 2010.

[66] R. R. Sokal and F. J. Rohlf. Biometry: the principles and practice of statistics in biological research. W.H. Freeman & Co Ltd, New York, 1981.

[67] Srinivasa Srinath and Sujit Kumar. A platform based on the multi-dimensional data modal for analysis of bio-molecular structures. In Proceedings of the 29th interna- tional conference on Very large data bases, volume 29, pages 975–986, 2003.

[68] Jacopo Staiano, Bruno Lepri, Nadav Aharony, Fabio Pianesi, Nicu Sebe, and Alex Pentland. Friends dont lie-inferring personality traits from social network structure. In Ubicomp, pages 321–330, 2012.

[69] W. Nick Streetand and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’01), pages 377–382, 2001.

[70] Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna. Detecting spammers on social networks. In ACSAC, pages 1–9, 2010.

141 [71] Lei Tang, Huan Liu, Jianping Zhang, and Zohreh Nazeri. Community evolution in dynamic multi-mode networks. In ACM SIGKDD, 2008.

[72] Lei Tang, Huan Liu, Jianping Zhang, and Zohreh Nazeri. Community evolution in dynamic multi-mode networks. In Proceedings of the 14th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining (KDD’08), pages 677– 685, 2008.

[73] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. Efficient aggregation for graph summarisation. In ACM SIGMOD, pages 567–580, 2008.

[74] Roger Ming Hieng Ting and James Bailey. Mining Minimal Contrast Subgraph Pat- terns. University of Melbourne, Department of Computer Science and Software En- gineering, 2007.

[75] N. Vanetik, E. Gudes, and S. E. Shimony. Computing frequent graph patterns from semistructured data. In Proceedings of the 2002 IEEE International Conference on Data Mining, pages 458–465, 2003.

[76] Joshua T. Vogelstein, William R. Gray, R. Jacob Vogelstein, and Carey E. Priebe. Graph classification using signal-subgraphs: Applications in statistical connectomics. Applications in Statistical Connectomics, 2011.

[77] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD in- ternational conference on Knowledge discovery and data mining (KDD’03), pages 226–235, 2003.

[78] Xindong Wu, Kui Yu, Wei Ding, Hao Wang, and Xingquan Zhu. Online feature se- lection with streaming features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5):1178–1192, 2013.

[79] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. Scan: a structural clustering algorithm for networks. In KDD, pages 824–833, 2007.

[80] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. A model-based approach to attributed graph clustering. In Proc. of ACM SIGMOD, pages 505–516, 2012.

[81] Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S. Yu. Mining significant graph patterns by leap search. In ACM SIGMOD, 2008.

[82] Xifeng Yan and Jiawei Han. gSpan: Graph-based substructure pattern mining. In ICDM, 2003.

[83] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: A frequent structure-based approach. In ACM SIGMOD, 2004.

142 [84] Xifeng Yan, Philip S. Yu, and Jiawei Han. Substructure similarity search in graph databases. In ACM SIGMOD, 2005.

[85] Eng-Hui Yap, Tyler Rosche, Steve Almo, and Andras Fiser. Functional clustering of immunoglobulin superfamily proteins with protein-protein interaction information calibrated hidden markov model sequence profiles. 426:945–961, 2013.

[86] Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Cross-relational clustering with user’s guidance. In ACM SIGKDD, pages 344–353, 2005.

[87] Yuchen Zhao, Xiangnan Kong, and Philip S. Yu. Positive and unlabeled learning for graph classification. In IEEE 11th International Conference on Data Mining, pages 962–971, 2011.

[88] Zheng Zhao and Huan Liu. Semi-supervised feature selection via spectral analysis. In SDM, pages 641–646, 2007.

[89] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on struc- tural/attribute similarities. In the VLDB Endowment, volume 2, pages 718–729, 2009.

[90] Yanyue Zhu and Dennis Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th international conference on Very Large Data Bases, pages 358–369, 2002.

143