Community Search and Detection on Large Graphs Esra Akbas
Total Page:16
File Type:pdf, Size:1020Kb
Florida State University Libraries Electronic Theses, Treatises and Dissertations The Graduate School 2017 Community Search and Detection on Large Graphs Esra Akbas Follow this and additional works at the DigiNole: FSU's Digital Repository. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES COMMUNITY SEARCH AND DETECTION ON LARGE GRAPHS By ESRA AKBAS A Dissertation submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2017 Copyright c 2017 Esra Akbas. All Rights Reserved. Esra Akbas defended this dissertation on November 6, 2017. The members of the supervisory committee were: Peixiang Zhao Professor Directing Dissertation Washington Mio University Representative Piyush Kumar Committee Member Xiuwen Liu Committee Member The Graduate School has verified and approved the above-named committee members, and certifies that the dissertation has been approved in accordance with university requirements. ii To my son and husband... iii ACKNOWLEDGMENTS I would like to express my deepest gratitude to my supervisor Dr. Peixiang Zhao for his excellent guidance, valuable suggestions and infinite patience. All this work would have never been possible without his guidance and support. I am also thankful to my committee members, Dr. Kumar, Dr. Liu and Dr. Mio, for spending their time and efforts to read and comment on my dissertation. I am grateful to my family members, especially my husband Mehmet and my son Ahmet, for their love and their support in every stage of my life. iv TABLE OF CONTENTS List of Tables............................................. vii List of Figures............................................ viii Abstract................................................x 1 Introduction 1 1.1 Terminology.........................................2 1.2 Community Detection...................................3 1.2.1 Attributed Graph Clustering...........................4 1.3 Community Search.....................................8 1.3.1 Truss Based Community Search..........................9 1.3.2 Approximate Closest Truss based Community Search.............. 11 2 Related Work 14 2.1 Attributed Graph Clustering............................... 14 2.1.1 Approaches which convert an attribute graph to a weighted graph...... 14 2.1.2 Distance-based Approaches............................ 16 2.1.3 Model-based Approaches............................. 20 2.1.4 Subspace Clustering................................ 21 2.1.5 Comparison of Methods.............................. 23 2.2 Graph Embedding..................................... 23 2.3 Community Search..................................... 25 3 Attributed Graph Clustering: an Attribute-aware Graph Embedding Approach 30 3.1 Problem Formulation.................................... 31 3.2 The Attribute-Aware Graph Embedding Framework.................. 32 3.2.1 Vertex Attribute Embedding........................... 33 3.2.2 Structure Embedding............................... 35 3.3 Attributed Graph Clustering Algorithm......................... 37 3.4 Experiments......................................... 38 3.4.1 Datasets....................................... 39 3.4.2 Evaluation Metrics................................. 39 3.4.3 Experimental Results............................... 40 3.5 Conclusions......................................... 46 4 Truss based Community Search: a Truss equivalence Based Indexing Approach 47 4.1 Preliminaries........................................ 48 4.2 Truss Equivalence..................................... 51 4.3 Truss-Equivalence Based Index.............................. 54 4.3.1 Index Design and Construction.......................... 54 v 4.3.2 Community Search on EquiTruss ........................ 58 4.4 Experiments......................................... 60 4.4.1 Index Construction................................. 61 4.4.2 Community Search................................. 62 4.4.3 Effectiveness Analysis in DBLP.......................... 64 4.5 Conclusions......................................... 66 5 Index based Closest Community Search 67 5.1 Preliminaries........................................ 68 5.2 Basic Algorithmic framework............................... 69 5.2.1 Finding Maximal Connected k-truss....................... 70 5.2.2 Eliminating Free Riders.............................. 70 5.2.3 Approximation Analysis.............................. 71 5.3 Truss equivalence based index............................... 73 5.3.1 Index Design and Construction.......................... 74 5.4 Community search on Index................................ 77 5.4.1 Finding maximal connected k-truss on TEQ Index................ 77 5.4.2 MSTTEQ index structure and Querying maximal k-truss on it.......... 80 5.4.3 Minimal connected k-truss............................. 83 5.5 Experiments......................................... 85 5.6 Conclusions......................................... 90 6 Conclusions 92 6.1 Future Work........................................ 93 Bibliography............................................. 94 Biographical Sketch......................................... 104 vi LIST OF TABLES 1.1 Main symbols.........................................3 2.1 Comparison of Attributed Graph Clustering Methods.................. 24 4.1 Primer of terminologies and notations........................... 49 4.2 Network statistics (K = 103 and M = 106)........................ 61 4.3 Index construction time (in seconds) and space cost (in megabytes) of EquiTruss and TCP-Index, together with the sizes of graphs (in megabytes)............... 62 5.1 Graphs statistics (K = 103 and M = 106)......................... 86 5.2 Trussness of the Communities for Different Size of Query Node Set, Q ......... 89 5.3 Number of Edges in the Community for Different Size of Query Node Set, Q ..... 90 5.4 Trussness of the Communities in All Datasets....................... 90 vii LIST OF FIGURES 1.1 A Sample Attributed Graph................................2 1.2 Structure-based Clustering.................................5 1.3 Attribute-based Clustering.................................5 1.4 Structural/Attribute Clustering...............................6 1.5 Truss-based Communities for vertex v7 in G........................9 1.6 Closest Truss Community for vertex 2 and 4 in G..................... 12 3.1 The attribute-aware graph embedding on a sample attributed graph. (a) presents a sample social graph G containing 13 individuals and their friendship relations. Each individual is characterized by two attributes: education and favorite language; (b) presents the transformed, weighted graph G0 with vertex attribute proximity embed- ded as edge weights; (c) presents the two-dimensional attribute-aware graph embed- ding, φ, from which the latent cluster structures naturally arise............. 31 3.2 Clustering Quality in Political Blog Dataset........................ 41 3.3 Clustering Quality in DBLP Dataset............................ 41 3.4 Clustering Quality in Patent Dataset............................ 42 3.5 Clustering Quality of AA-Cluster w.r.t. Neighborhood Distance, L ........... 43 3.6 Clustering Quality of AA-Cluster w.r.t. Number of walks, γ, in DBLP graph (k = 10) 44 3.7 Clustering Quality of AA-Cluster w.r.t. Window Size,w, in DBLP graph (k = 10)... 44 3.8 Runtime Cost in Synthetic Graphs............................. 45 4.1 A Sample graph G and Truss-based Communities for vertex v7 in G........... 50 4.2 k-truss edges in the graph G................................. 53 4.3 Truss-equivalence based index, EquiTruss, of G...................... 55 4.4 The two 4-truss communities for the query vertex v4, including A1 with edges in red color and A2 with edges in green color........................... 60 4.5 Community search performance in different vertex-degree percentile buckets...... 63 4.6 Community search performance for different truss values of k .............. 63 viii 4.7 (a) The summarized graph in EquiTruss for the DBLP four-area graph. Each super- node represents a k-truss community (7 ≤ k ≤ 27), and each super-edge depicts triangle-connectivity between super-nodes. (b) All k-truss communities (7 ≤ k ≤ 27) in the DBLP four-area graph................................. 65 4.8 7-truss community and 8-truss community for the query Michael Stonebraker..... 66 5.1 A sample graph G....................................... 74 5.2 k-truss edges in the graph G in Figure 5.1......................... 74 5.3 Truss-equivalence based index, TEQ, of G.......................... 75 5.4 Community search on TEQ after first iteration with k = 4................ 79 5.5 Final result for Community search on TEQ ......................... 79 5.6 Maximum Spanning tree of TEQ IG in Figure 5.3.................... 82 5.7 Rooted tree of Maximum Spanning tree given in Figure 5.6............... 83 5.8 Query time to find maximum connected k-truss varying query size jQj on DBLP... 87 5.9 Total query time to find closest truss community varying query size jQj on DBLP.. 87 5.10 Query time to find maximum connected k-truss varying query size jQj on Facebook. 88 5.11 Total query time to find closest truss community varying query size jQj on Facebook. 89 5.12 Query time to find maximum connected k-truss on all datasets............. 91 5.13 Total query time to find closest connected k-truss on all datasets........... 91 ix ABSTRACT Modern science and technology have witnessed in the past decade a proliferation of complex data that can