Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2017 Community Search and Detection on Large Graphs Esra Akbas

Follow this and additional works at the DigiNole: FSU's Digital Repository. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

COMMUNITY SEARCH AND DETECTION ON LARGE GRAPHS

By

ESRA AKBAS

A Dissertation submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy

2017

Copyright c 2017 Esra Akbas. All Rights Reserved. Esra Akbas defended this dissertation on November 6, 2017. The members of the supervisory committee were:

Peixiang Zhao Professor Directing Dissertation

Washington Mio University Representative

Piyush Kumar Committee Member

Xiuwen Liu Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the dissertation has been approved in accordance with university requirements.

ii To my son and husband...

iii ACKNOWLEDGMENTS

I would like to express my deepest gratitude to my supervisor Dr. Peixiang Zhao for his excellent guidance, valuable suggestions and infinite patience. All this work would have never been possible without his guidance and support. I am also thankful to my committee members, Dr. Kumar, Dr. Liu and Dr. Mio, for spending their time and efforts to read and comment on my dissertation. I am grateful to my family members, especially my husband Mehmet and my son Ahmet, for their love and their support in every stage of my life.

iv TABLE OF CONTENTS

List of Tables...... vii List of Figures...... viii Abstract...... x

1 Introduction 1 1.1 Terminology...... 2 1.2 Community Detection...... 3 1.2.1 Attributed Graph Clustering...... 4 1.3 Community Search...... 8 1.3.1 Truss Based Community Search...... 9 1.3.2 Approximate Closest Truss based Community Search...... 11

2 Related Work 14 2.1 Attributed Graph Clustering...... 14 2.1.1 Approaches which convert an attribute graph to a weighted graph...... 14 2.1.2 Distance-based Approaches...... 16 2.1.3 Model-based Approaches...... 20 2.1.4 Subspace Clustering...... 21 2.1.5 Comparison of Methods...... 23 2.2 Graph Embedding...... 23 2.3 Community Search...... 25

3 Attributed Graph Clustering: an Attribute-aware Graph Embedding Approach 30 3.1 Problem Formulation...... 31 3.2 The Attribute-Aware Graph Embedding Framework...... 32 3.2.1 Vertex Attribute Embedding...... 33 3.2.2 Structure Embedding...... 35 3.3 Attributed Graph Clustering Algorithm...... 37 3.4 Experiments...... 38 3.4.1 Datasets...... 39 3.4.2 Evaluation Metrics...... 39 3.4.3 Experimental Results...... 40 3.5 Conclusions...... 46

4 Truss based Community Search: a Truss equivalence Based Indexing Approach 47 4.1 Preliminaries...... 48 4.2 Truss Equivalence...... 51 4.3 Truss-Equivalence Based Index...... 54 4.3.1 Index Design and Construction...... 54

v 4.3.2 Community Search on EquiTruss ...... 58 4.4 Experiments...... 60 4.4.1 Index Construction...... 61 4.4.2 Community Search...... 62 4.4.3 Effectiveness Analysis in DBLP...... 64 4.5 Conclusions...... 66

5 Index based Closest Community Search 67 5.1 Preliminaries...... 68 5.2 Basic Algorithmic framework...... 69 5.2.1 Finding Maximal Connected k-truss...... 70 5.2.2 Eliminating Free Riders...... 70 5.2.3 Approximation Analysis...... 71 5.3 Truss equivalence based index...... 73 5.3.1 Index Design and Construction...... 74 5.4 Community search on Index...... 77 5.4.1 Finding maximal connected k-truss on TEQ Index...... 77 5.4.2 MSTTEQ index structure and Querying maximal k-truss on it...... 80 5.4.3 Minimal connected k-truss...... 83 5.5 Experiments...... 85 5.6 Conclusions...... 90

6 Conclusions 92 6.1 Future Work...... 93

Bibliography...... 94 Biographical Sketch...... 104

vi LIST OF TABLES

1.1 Main symbols...... 3

2.1 Comparison of Attributed Graph Clustering Methods...... 24

4.1 Primer of terminologies and notations...... 49

4.2 Network statistics (K = 103 and M = 106)...... 61

4.3 Index construction time (in seconds) and space cost (in megabytes) of EquiTruss and TCP-Index, together with the sizes of graphs (in megabytes)...... 62

5.1 Graphs statistics (K = 103 and M = 106)...... 86

5.2 Trussness of the Communities for Different Size of Query Node Set, Q ...... 89

5.3 Number of Edges in the Community for Different Size of Query Node Set, Q ..... 90

5.4 Trussness of the Communities in All Datasets...... 90

vii LIST OF FIGURES

1.1 A Sample Attributed Graph...... 2

1.2 Structure-based Clustering...... 5

1.3 Attribute-based Clustering...... 5

1.4 Structural/Attribute Clustering...... 6

1.5 Truss-based Communities for vertex v7 in G...... 9

1.6 Closest Truss Community for vertex 2 and 4 in G...... 12

3.1 The attribute-aware graph embedding on a sample attributed graph. (a) presents a sample G containing 13 individuals and their friendship relations. Each individual is characterized by two attributes: education and favorite language; (b) presents the transformed, weighted graph G0 with vertex attribute proximity embed- ded as edge weights; (c) presents the two-dimensional attribute-aware graph embed- ding, φ, from which the latent cluster structures naturally arise...... 31

3.2 Clustering Quality in Political Blog Dataset...... 41

3.3 Clustering Quality in DBLP Dataset...... 41

3.4 Clustering Quality in Patent Dataset...... 42

3.5 Clustering Quality of AA-Cluster w.r.t. Neighborhood Distance, L ...... 43

3.6 Clustering Quality of AA-Cluster w.r.t. Number of walks, γ, in DBLP graph (k = 10) 44

3.7 Clustering Quality of AA-Cluster w.r.t. Window Size,w, in DBLP graph (k = 10)... 44

3.8 Runtime Cost in Synthetic Graphs...... 45

4.1 A Sample graph G and Truss-based Communities for vertex v7 in G...... 50

4.2 k-truss edges in the graph G...... 53

4.3 Truss-equivalence based index, EquiTruss, of G...... 55

4.4 The two 4-truss communities for the query vertex v4, including A1 with edges in red color and A2 with edges in green color...... 60

4.5 Community search performance in different vertex-degree percentile buckets...... 63

4.6 Community search performance for different truss values of k ...... 63

viii 4.7 (a) The summarized graph in EquiTruss for the DBLP four-area graph. Each super- node represents a k-truss community (7 ≤ k ≤ 27), and each super-edge depicts triangle-connectivity between super-nodes. (b) All k-truss communities (7 ≤ k ≤ 27) in the DBLP four-area graph...... 65

4.8 7-truss community and 8-truss community for the query Michael Stonebraker..... 66

5.1 A sample graph G...... 74

5.2 k-truss edges in the graph G in Figure 5.1...... 74

5.3 Truss-equivalence based index, TEQ, of G...... 75

5.4 Community search on TEQ after first iteration with k = 4...... 79

5.5 Final result for Community search on TEQ ...... 79

5.6 Maximum Spanning tree of TEQ IG in Figure 5.3...... 82

5.7 Rooted tree of Maximum Spanning tree given in Figure 5.6...... 83

5.8 Query time to find maximum connected k-truss varying query size |Q| on DBLP... 87

5.9 Total query time to find closest truss community varying query size |Q| on DBLP.. 87

5.10 Query time to find maximum connected k-truss varying query size |Q| on Facebook. 88

5.11 Total query time to find closest truss community varying query size |Q| on Facebook. 89

5.12 Query time to find maximum connected k-truss on all datasets...... 91

5.13 Total query time to find closest connected k-truss on all datasets...... 91

ix ABSTRACT

Modern science and technology have witnessed in the past decade a proliferation of complex data that can be naturally modeled and interpreted as graphs. In real-world networked applications, the underlying graphs oftentimes exhibit fundamental community structures supporting widely varying interconnected processes. Identifying communities may offer insight on how the network is organized. In this thesis, we worked on community detection and search problems on graph data. Community detection (graph clustering) has become one of the most well-studied problems in graph management and analytics, the goal of which is to group vertices of a graph into densely knitted clusters with each cluster being well separated from all the others. Classic graph clustering methods primarily take advantage of topological information of graphs to model and quantify the proximity between vertices. With the proliferation of rich, heterogeneous graph contents widely available in real-world graphs, such as user profiles in social networks, it becomes essential to consider both structures and attributive contents of graphs for better quality graph clustering. On the other hand, existing community detection methods focus primarily on discovering communities in an apriori, top-down manner with the only reference to the input graph. As a result, all communities have to be exhaustively identified thus incurring expensive time/space cost and a huge amount of fruitless computation, if only a fraction of them are of special interest to end-users. In many real-world occasions, however, people are more interested in the communities pertaining to a given vertex. In our first project, we work on attributed graph clustering problem. We propose a graph embedding approach to cluster content-enriched, attributed graphs. The key idea is to design a unified latent representation for each vertex of a graph such that both the graph connectivity and vertex attribute proximity within the localized region of the vertex can be jointly embedded into a unified, continuous vector space. As a result, the challenging attributed graph clustering problem is cast to the traditional data clustering problem. In my second and third projects, we work on a query-dependent variant of community detection, referred to as the community search problem. The objective of community search is to identify dense subgraphs containing the query vertices. We study the community search problem in the truss- based model aimed at discovering all dense and cohesive k-truss communities to which the query set Q belongs. We introduce a novel equivalence relation, k-truss equivalence, to model the intrinsic

x density and cohesiveness of edges in k-truss communities and based on this equivalence we create 2 different space-efficient, truss-preserving index structure, EquiTruss and TEQ. Community search for one query or multiple queries can thus be addressed upon EquiTruss and TEQ without repeated, time-demanding accesses to the original graph, G, which proves to be theoretically optimal. While query set includes one query vertex in our first project, it includes multiple query vertices in our second project. As a summary, to get better quality on attributed graph clustering, the attribute-aware clus- ter information is well preserved during graph embedding. While we use SkipGram method for embedding, there are other embedding methods. We can use them to see the effect of different embedding methods on attributed graphs. In addition, our index structure is good for community search on large graphs without considering attribute information. Using attribute information in addition to the structure may give better communities for given query nodes. So, we can update our index structure to support community search on attributed graphs.

xi CHAPTER 1

INTRODUCTION

Modern science and technology have witnessed in the past decade a proliferation of complex data that can be naturally modeled and interpreted as graphs. Graphs are structured data representing relationships between objects [1, 22]. They are formed by a set of vertices (also called nodes) and a set of edges that are connections between pairs of vertices. There are many examples for graph data from real life such as (1) computer networks consisting of routers/computers as nodes and the links as edges between them, (2) social networks consisting of individuals and their interconnections such as coauthorship, and citation networks of scientists, (3) protein interaction networks from biological networks that link proteins which must work together to perform some particular biological functions [68, 74]. If vertices of a graph have a set of attributes describing their properties, such as interest, gender and education in graphs, we call this graph as an attributed graph (AG). Such attributes can be numerical, categorical or other types of data. In a wide variety of real world applications, graphs are modeled in the form of AG. For example, in IP communication networks, IP-nodes have multiple functionalities (e.g., DNS server, Web server, P2P client), in social networks such as Facebook and MySpace networks, user nodes have multiple attributes (e.g., education degree, mother tongue). In co-authorship networks, author-nodes are experts in a particular topic (e.g., databases, data mining, and machine learning) and lastly protein-protein interaction networks have not only the interaction but also the gene expressions associated with the proteins [104, 115]. Figure 1.1 is an example of an attributed graph which represents the friendship relations and the node attributes denote education degree, and programming language of individuals. In real-world network applications, the underlying graphs oftentimes exhibit fundamental com- munity structures supporting widely varying interconnected processes [102, 57]. Identifying com- munities may offer insight on how the network is organized. In this thesis, we worked on community detection and search problems on large graphs.

1 Figure 1.1: A Sample Attributed Graph

In this chapter, we first review the necessary terminology to facilitate discussion in the rest of the thesis. Then, we give a brief introduction about community detection and community search problems. In the following sections, we briefly present the motivation and definitions of specific problems we studied in this dissertation. Then, we summarize our contributions for each graph problem and give pointers to associated chapters.

1.1 Terminology

In this section, we give the necessary terminology and symbols to facilitate discussion in the rest of the thesis. Table 1.1 lists these symbols. First and the most common notation that is used in this thesis is G(V,E) which shows an undirected graph G with the set of vertices V and the set of edges E. Secondly, n and m are used to notate the number of elements in V and E, respectively. Moreover, N(vi) is the set of neighbors th th of the vertex vi, d(vi) is the degree of the vertex vi, eij is the edge between i and j vertices, th wij is the weight of the edge eij, Ci is the set of vertices in the i cluster, C = {C1, ..., Ck} is a clustering of G, nCi is the number of vertices in Ci, mCi is the number of edges in Ci and cCi is the number of edges on the boundary of Ci. Lastly, m(C) is the set of intra-cluster edges and m(C) is the set of inter-cluster edges.

2 Table 1.1: Main symbols

Notation Description G(V,E) an undirected graph G with the vertex set V and the edge set E (n, m) number of elements in V and E, respectively N(vi) the set of neighbors of vertex vi d(vi) the degree of the vertex vi th th eij the edge between i and j vertices wij the weight of the edge eij th Ci the set of vertices in the i cluster C = {C1, ..., Ck} a clustering of G

nCi the number of vertices in Ci

mCi the number of edges in Ci

cCi the number of edges on the boundary of Ci m(C) the set of intra-cluster edges m(C) the set of inter-cluster edges

1.2 Community Detection

In real-world networked applications, the underlying graphs oftentimes exhibit fundamental community structures supporting widely varying interconnected processes. Community detection has thus become one of the most well-studied problems in graph management and analytics. Com- munity detection, which is also known as graph clustering, is the task of grouping vertices of a graph with an objective of putting similar vertices into same clusters, taking into account the topo- logical structure of the graph, such that the clusters are composed of vertices strongly connected [102, 57, 20]. It has been studied for more than five decades, and a vast number of algorithms [50, 28, 100, 113, 69, 103, 67] have been proposed and widely used in many fields including social network analytics, document clustering, bioinformatics and others. Noteworthy examples include,

• In any communication network, graph clustering serves as a tool for analyzing, modeling and predicting of the function, usage and evolution of the network [35].

• In the field of bioinformatics, graph clustering typically deals with the classification of a gene expression data (specifically gene-activation dependencies) and protein interactions [48]. An example of application of graph clustering on biological networks is the identification of functional related protein modules in large protein-protein interaction networks.

• Applications of graph clustering in social networks include identifying groups of individuals who have similar interest, such as identifying the groups of scientists working together or

3 working on similar topics from collaboration networks of scientists, identifying terrorist net- works when a member is known or locating potentially infected people when an infected and contagious individual is encountered [81].

There are two main approaches for graph clustering; topological graph clustering and attributed graph clustering. Topological approaches take only the edge structure of the graph into consider- ation and ignore the properties of the vertices. They detect densely connected groups of vertices with a high number of intra-cluster edges and relatively a few number of inter-cluster edges based on different criteria such as vertex connectivity, or neighborhood similarity using structure informa- tion in the graph. The core methods proposed in the literature optimize a quality function based on density and connectivity of clusters with the rest of the graph. Most popular quality functions are the modularity, the ratio cut, the min-max cut and the normalized cut [20]. Several surveys [81, 60, 65] have been written about this. We do not focus on them with referring to these surveys.

1.2.1 Attributed Graph Clustering

Different from topological approaches, attributed graph clustering approaches use both structure and attribute information to find similar groups of vertices in the graph. While most of the works in this area have previously focused on the analysis of graph structures, according to recent studies [13], structure of a network may not be sufficient to find the good communities in it, because the clusters generated based on only structure information would have a rather random distribution of vertex properties within clusters [113]. The similarities or differences in the content of nodes may affect the patterns of linking and grouping. Attributes of nodes offer valuable information about the entities that can improve the quality of the clusters. An ideal graph clustering should generate clusters which have dense sub- graphs with cohesive intra-cluster structure and homogeneous vertex properties, by balancing the structural and attribute similarities. Many experimental studies confirm that clustering based on structure and attribute data provides more meaningful clusters than methods that take into ac- count one type of data [21]. For instance, coauthor networks include vertices that represent authors with their attributes such as primary topic, publication number, and edges that represent the coau- thor relationship between authors or co-participation in same events. In order to detect strongly connected clusters containing individuals with similar research interests, we have to exploit both attributes associated to each person and relationships between the members of the network [20].

4 Figure 1.2: Structure-based Clustering

Figure 1.3: Attribute-based Clustering

As a solution to this, attributed graph clustering approaches consider both structure and attribute information to get better clustering results.

Example 1.1. An example of different clustering results on the attributed graph given in the Figure 1.1 based on different approaches is given in 1.2, 1.3, 1.4. Figure 1.1 shows a clustering result based on vertex connectivity, Figure 1.3 shows another clustering result based on attribute

5 Figure 1.4: Structural/Attribute Clustering similarity. Figure 1.4 shows the clustering result based on both structure and attribute information. This clustering result balances the structural and attribute similarities: person within one cluster are closely connected; meanwhile, they are homogeneous on their attributes.

Attributed graph clustering has found a wide range of real-world applications as it has the potential to yield more informative and better-quality clustering results due in particular to a joint consideration of both structure and attributive content information of the underlying graphs. Some noteworthy examples are outlined as follows:

1. In social networks such as Facebook, LinkedIn, and Google+, users and their friendship re- lations constitute the underlying social graphs. Additionally, each user is characterized by a series of attributes depicting one’s personal profiles including age, education, occupation, location, and hobbies. Clustering social network users by tackling both their social rela- tionships and personal profiles is particularly useful for social targeting and personalized recommendation [59, 34, 113];

2. In protein-protein interaction (PPI) networks, the proteins are often annotated with bio- logical attributes, such as functional classifications, gene ontology (GO) terms, and gene expression profiles. Clustering PPI networks by leveraging both the network topologies and protein attributes can significantly facilitate the identification process of protein complexes and pathways [41, 14];

3. The web graph consists of web pages interweaved by hyperlinks. Each web page is also characterized by a series of attributes including URL, name, keywords, contents, tags, and

6 so forth. When both hyperlinks and web page attributes are considered for web community discovery, the results are often more informative than those identified based solely on web structures [77, 29, 39].

Unfortunately, attributed graph clustering turns out to be significantly more challenging than the classic graph clustering problem where the mere graph structure information is considered [13]. The main reason is that, topological structures and graph attributes are two completely different types of information pertaining to graphs. Clustering based solely on one type of information may lead to inaccurate, or even contradicting, graph clusters [114]. Consequently, the key challenge here is to synergistically incorporate both the structure and attribute information of graphs towards striking the right balance between the two distinct objectives of clustering based on either graph structures or attributes, respectively, toward achieving more informative and better quality graph clusters. In our first project, we introduce a novel approach for attributed graph clustering considering both graph structure cohesiveness and attribute homogeneity. The key idea is to design a unified latent representation for each vertex u of a graph such that both the graph connectivity and the vertex attribute proximity within the localized region of u can be jointly embedded into a unified, continuous vector space. As a result, the challenging attributed graph clustering problem is cast to the traditional data clustering problem upon d-dimensional space, in which the graph structure cohesiveness and the attribute homogeneity are approximately preserved. The main contributions of our work is summarized as follows,

1. We propose a novel, attribute-aware graph embedding framework for attributed graph clus- tering. It provides a natural and principled approach to encode the localized structure and attribute information of vertices into a unified, latent representation in a low-dimensional space, within which the graph structure cohesiveness and vertex attribute homogeneity can be well preserved. This framework also establishes a general graph embedding approach to tackling attributed graphs in a widely varying application domains (Section 3.2);

2. We design an efficient and cost-effective graph embedding algorithm that transforms an at- tributed graph into its vertex-based, latent representation, which are further fed as input to any data clustering method for attributed graph clustering (Section 3.3);

3. We perform experimental studies for our method in a series of real-world and synthetic graphs in comparison with state-of-the-art attributed graph clustering techniques. Experimental

7 results demonstrate that our method outperforms existing algorithms in terms of both the graph structure cohesiveness (w.r.t. graph density) and the attribute homogeneity (w.r.t. entropy) in resultant graph clusters (Section 3.4).

1.3 Community Search

Existing community detection methods focus primarily on discovering communities in an apriori, top-down manner with only reference to the input graph. As a result, all communities have to be exhaustively identified, thus incurring expensive time/space cost and a huge amount of fruitless computation, if only a fraction of them are of special interest to end-users. In many real-world occasions, however, people are more interested in the communities pertaining to a given vertex. For example, in a social network, a user is typically more curious about the communities she participates in rather than all the communities of the entire graph [85]. This query-dependent variant of community detection is usually referred to as the community search problem, the objective of which is to identify dense subgraphs containing the query vertex [30, 31, 44, 46, 22, 26, 58, 85]. Community search admits an online, bottom-up process for community detection desirable especially in real-world, large graphs. Furthermore, it opens the door to the personalized community discovery, and has thus found a wide range of applications in expert recommendation and team formation [111,6], personal context discovery [17, 61], collaborative tagging [70], social contagion modeling [95], and gene/protein regulation [85]. In our second and third projects, we consider community search based on the trussness of the subgraph which is the minimum support of the edge plus 2 [18]. Given a graph G, a k-truss (k ≥ 2) is the largest subgraph of G with each constituent edge contained in at least (k − 2) triangles. There have been numerous dense-subgraph notions proposed thus far towards modelling real-world community structures, including or quasi-clique [25, 93], k-core [51, 22, 78,7, 85], k-truss [44, 47, 83, 98], nucleus [79, 80], and k-plex [10], to name a few. Our choice of k-truss as the underlying community model is influenced by the following important facts: (1) as opposed to primitive vertices/edges, the higher-order graph motif, triangle, is exploited as building blocks to quantify the strong and stable relationships in communities [9]. As a result, the high-density and cohesiveness of real-world communities can be encoded in k-truss with strong theoretical guarantees (more details will be elaborated in Section 2.3); (2) k-truss enables a comprehensive modelling of

8 multiple, overlapping communities in G. By tuning the parameter k, we can derive a collection of k-truss communities that form an inclusive, dense-graph hierarchy representing the cores of G at varied levels of granularity [79]; (3) discovering all k-trusses from G is polynomially tractable [98], while most existing dense-graph models render the community search problem NP-hard [46, 10, 22]. As a consequence, k-truss has been extensively employed for community search in real-world networked applications[44, 45, 46, 47]

1.3.1 Truss Based Community Search

In our second project, we consider community search based on the k-truss community model defined in [44]. Given a fixed value of k, there exists only one k-truss in G due to its maximality, while a k-truss is not necessarily a connected graph. Therefore, the classic definition of k-truss is not suitable to directly model real-world communities that are both densely and cohesively connected. To tackle this issue, an edge connectivity constraint is further imposed upon k-truss community model, that is, any two edges in a community either belong to the same triangle, or are reachable from each other through a series of adjacent triangles. The edge connectivity requirement ensures that a discovered community is connected and cohesive.

Figure 1.5: Truss-based Communities for vertex v7 in G.

Example 1.2. . Figure 1.5(a) presents a toy graph G. Given a query vertex v7, the k-truss communities (k = 3, 4, 5) containing v7 are illustrated in Figure 1.5(b). For instance, when k = 5, each edge in the 5-truss community is contained in at least 3 different triangles. By tuning the values of k, we generate a series of dense and cohesive community structures pertaining to v7.

9 Searching k-truss communities based simply on the definition becomes immediately prohibitive as it incurs a lot of random exploration and wasteful edge accesses in the graph. The state- of-the-art solution, TCP-Index [44], indexes the pre-computed k-trusses in a series of maximum spanning trees (MSTs) for community search. Unfortunately, each edge of G might be maintained in multiple MSTs, making TCP-Index redundant and excessively large, typically several times larger than G. On the other hand, k-truss communities have to be reconstructed online from MSTs during community search, thus involving costly, repeated accesses to G and making community search extremely inefficient, especially in real-world, massive graphs. We introduce a novel equivalence relation, k-truss equivalence, to model the intrinsic density and cohesiveness of edges in k-truss communities. Consequently, all the edges of G can be partitioned to a series of k-truss equivalence classes that constitute a space-efficient, truss-preserving index structure, EquiTruss. Community search can thus be addressed upon EquiTruss without repeated, time-demanding accesses to the original graph, G, which proves to be theoretically optimal. In addition, EquiTruss can be efficiently maintained in a dynamic fashion when G evolves with edge insertion/deletion. Experimental studies in real-world, large-scale graphs validate the efficiency and effectiveness of EquiTruss, which achieves at least an order of magnitude speedup in community search over the state-of-the-art method, TCP-Index. We summarize the contributions of EquiTruss as follows,

• We introduce a novel notion, k-truss equivalence, to capture the intrinsic relationship of edges in truss-based communities. Based on this new concept, we can partition any graph G into a series of truss-preserving equivalence classes for community search (Section 4.2);

• We design and develop a truss-equivalence based index, EquiTruss, that is space-efficient, cost-effective, and amenable to dynamic changes in the graph G. More importantly, commu- nity search can be performed directly upon EquiTruss without costly revisits to G, which is theoretical optimal (Section 4.3);

• We carry out extensive experimental studies in real-world, large-scale graphs, and compare EquiTruss with the state-of-the-art solution, TCP-Index. Experimental results demonstrate that EquiTruss is smaller in size, faster to be constructed and maintained, and admits at least an order of magnitude speedup for community search in large graphs (Section 4.4);

10 1.3.2 Approximate Closest Truss based Community Search

The k-truss community model defined in the previous section works well to find all overlapping communities containing a single query node q. However, in real applications, searching for com- munities containing a set of query nodes is more common. When we try to extend this k-truss community model for multiple query nodes, some limitations may occur. Since the triangle con- nectivity constraint in this model is very strict and may result in failing to discover any community for query nodes. In general, there are also other community models based on different measures and constraints. Some of them may suffer from the“free rider effect” formally defined and studied in [85]. If the detected community for query nodes may include irrelevant subgraphs as a result of its definition, such irrelevant subgraphs are referred as free riders. For example, if we use the classic density definition(|E|/|V |) as the community goodness metric, including denser part of the graph into the community will increase the overall density. However, this denser part may be too far from query nodes and may not be relevant to query nodes. In our third project, we study the problem of the closest community search, i.e. given a set of query nodes, find a densely connected subgraph that contains the query nodes in which nodes are close to each other. We consider community search based on the k-truss. In addition, we use graph diameter to measure the closeness of all nodes in the community to ensure every node included in the community is tightly related to query nodes and other nodes included in the community. Thus, based on k-truss and graph diameter, we use the closest truss community (CTC) model [46], which requires that all query nodes are connected in this community and the graph structure is a k-truss with the largest trussness k. The problem is defined as finding the closest truss community (CTC), as finding a connected k-truss subgraph with the largest k that contains Q, and has the minimum diameter among such subgraphs. In [46], it is proven that CTC is an NP-hard problem and it is NP-hard to approximate the problem within a factor (2 − ), for any  > 0. Proof for this is given in Section 5.2.3.

Example 1.3. Figure 1.6(a) presents a sample graph G. Given a set of query vertices 2 and 4, maximum k-truss community containing 2 and 4 are illustrated in Figure 1.6(b). Since maximal k is 4, it includes all edges in the graph whose truss value is larger than 3. This 4-truss is said to suffer from the free rider effect. Since it includes far away nodes from query nodes such as 10, 11.

11 Figure 1.6: Closest Truss Community for vertex 2 and 4 in G.

On the other hand, the subgraph without them is also 4-truss, it has the smallest diameter among all 4-trusses containing the query nodes, and does not suffer from the free rider effect as we can see in Figure 1.6(c).

As a basic approach, Huang et al. [46] propose a 2 step algorithm. They first find a connected k-truss G0 with the largest k (kmax) containing given query nodes in polynomial time. Then, they iteratively remove the far away nodes from it to get the closest truss community. They achieve 2 approximation to the optimal solution with this method. To improve the pruning process, they develop an optimization method, bulk deletion, to achieve quick termination while sacrificing some approximation ratio. In the bulk deletion, instead of removing one far away node at each step, they remove multiple nodes whose distance are same with farthest node. As the first drawback of this algorithm, finding maximal connected k-truss on large graphs is time-consuming. Since we need to check the connectivity at each step while trying to maximize k. Furthermore, the result may be too large and removing far away nodes may take a long time. Different than previous methods, they also propose a heuristic method as the local exploration. In the local exploration, a Steiner tree of the query nodes is found on the graph G and then expanded into a k-truss by exploring the local neighborhood of it. While it achieves improvements in efficiency, trussness of the community decrease too much. To solve these problems, similar to the previous project, we propose a truss equivalence based index TEQ which we can use to find the connected k-truss with the largest k (kmax). We remove the triangle connectivity constraint from previous truss equivalence definition and just have the

12 connectivity constraint in it. While searching community on index graph is more efficient than searching on the original large graph, it may still take time to find kmax with connecting query nodes. As another improvement, we convert our index graph, TEQ to the Maximum Spanning Tree based on trussness of supernodes. MST has less number of edges and also one single path between any 2 super-nodes with maximum trussness. So, BFS on the index tree is more efficient than index graph. Also, it is easy to find the maximum value of k( kmax) for maximum connected k-truss. To

find kmax, we just need to find the path between query nodes on MST. Minimum truss values of the edges on the path will be the kmax. We also convert MST to a rooted tree to find the path more efficiently.

While we improve the efficiency of finding G0 with using our tree index structure, since G0 may be too large, the second step of the algorithm, which is removing faraway nodes, may still take long time. We propose a new algorithm, Minimal, to eliminate the free riders early and to improve the efficiency of this part. Since maximal graph includes more free riders, we find a minimal graph with maximum k. We just include the edges which make query nodes connected with maximizing k not all edges whose truss values is greater than or equal k. With this way, we include less nodes and less free riders. So, removing them in the second step will faster. We summarize our contributions in this project as follows,

• We design and developed a truss equivalence based index, TEQ, that is space efficient, cost efficient and supports truss based community search without costly revisiting to original large graph 5.3;

• We also create Maximum spanning tree of index graph TEQ, according to trussness of the super nodes in TEQ. MST has less number of the edges so community search is more efficient

on it. It also supports finding kmax efficiently 5.4.2;

• We propose another algorithm Minimal which finds minimal connected k-truss with maxi- mizing k. It includes less free riders and improves the efficiency of removing them 5.4.3;

• We carry out experimental studies in real world graphs and compare our results with state-of -the-art solutions, Basic, BD and LCTC, proposes in [46]. Experimental results demonstrate that MST TEQ, is faster than Basic to find maximal connected k-truss and Minimal is faster than BD to find closest truss community and it has higher trussness than LCTC 5.5 as it has

kmax.

13 CHAPTER 2

RELATED WORK

2.1 Attributed Graph Clustering

Attributed graph clustering uses both structure and attribute information to find clusters in graphs. These clusters are groups of nodes that are densely connected as well as have high similarity in their attributes. Various graph clustering approaches have been proposed to utilize content information in addition to structure information of graphs. We categorize these approaches into four groups: (1) approaches which convert an attribute graph to a weighted graph, (2) distance- based approaches, (3) model-based approaches and (4) subspace-based approaches and a brief summary of selected approaches in each group is presented. The first category includes approaches based on a conversion of the original attributed graph to a weighted graph such as FocusCo [72]. Node attributes are removed from the nodes by storing their information inside the edges of the graph. This is done by giving an attribute similarity value between two nodes to the edge of nodes as the weight of it. The second category includes distance- based approaches such as the SA-Clustering [113], SI-Cluster [115] and CODICIL [77]. Structure information is stored into a similarity (distance) function between nodes, and it is combined with attribute similarity (distance) function. The third category is Model-based approaches. These include, but are not limited to, PCL-DC [109], Bayesian probabilistic model [105] and CESNA [107]. They are based on a probabilistic model that avoids the artificial design of a distance measure. The fourth category is subspace clustering approaches. A selected set of proposed methods in this category includes CoPaM [64], GAMer [37], DB-CSC [38] and SSCG [36]. They identify the clusters only on the context of their own relevant features as a subset of all attributes of the nodes, especially for high dimensional data.

2.1.1 Approaches which convert an attribute graph to a weighted graph

Attribute similarity between nodes of a graph may show the strength of the relationship between them. Hence, node attributes are removed from the nodes by storing their information inside the

14 edges of the graph. This is done by giving attribute similarity between two nodes to the edge between them as the weight of it. After representing the graph as a weighted graph, different graph clustering algorithms, which take into consideration the weight of edges, can be applied to this weighted graph. If we keep the edges with high weights during the clustering process, we can get groups of nodes which have similar attribute values. In [67], the matching coefficient metric is used to compute node similarity. Objects that are not directly related by an edge in the graph have a similarity of zero regardless of their attribute values. Three different clustering algorithms, Karger’s Min-Cut [49], MajorClust [86] and spectral clustering with a normalized cut objective function [84], are applied after weighting the existing link graph with attribute similarities. In [88], the matching coefficient computation is extended to take both discrete and continuous attributes into account. All nodes connected by edges whose weights are greater than a threshold t are clustered together. In addition to these works, Cruz et al. [24] state that all attributes may not be relevant for clustering. So a machine learning approach known as self-organizing map (SOM) is applied to find the latent information to compute the similarity between the nodes. Louvain method [11] is used to cluster the obtained weighted graph.

FocusCO. Most of the graph clustering algorithms partition the whole graph into groups without considering user preferences or applications. However, users might not be concerned with all but a few available attributes. To overcome this limitation, the novel problem of focused clustering and outlier detection (FocusCO) in attributed graphs is introduced as extracting only the (type of) clusters pertaining to a user’s interest from graph rather than partitioning the whole graph [72]. For this goal, they select some attributes called focus attributes through a set of user-provided exemplar nodes using an optimization function based on Mahalanobis distance. They learn the relevance weights, βu, of node attributes that make the exemplar nodes, Cex, similar to each other. They find the weights of the edges by the weighted similarity of their end nodes based on relevance weights and the graph induced the graph G0 on the edges with large weights. Then, they extract focused clusters C from G that are (1) structurally dense as well as (2) consistent with the focus attributes.

15 In addition to finding clusters, outlier nodes are identified for each focused cluster. An outlier node is defined as the node that is densely connected to the cluster members but significantly different than the cluster members with respect to its focused attributes.

2.1.2 Distance-based Approaches

The straight-forward idea for attributed graph clustering is to define some vertex-wise distance metric that takes into account both structure and attribute information of vertices in a graph. For instance, the differences in vertex attribute values can be quantified as distances between neighboring vertices [87]. The textual web contents and hyperlinks are also combined in a similarity measure for web page clustering [39]. Different similarity (distance) measures are proposed for this sake and classic distance based clustering methods can be applied to the graph data using these proposed measures. For an attributed graph, attribute similarity function and structure similarity function are combined into one equation with a weighting factor in some studies [19, 97]. As an example, the equation 2.1 is defined in [19] to compute similarity between nodes where dT (i, j) and dS(i, j) are the attribute and structure similarity between node i and node j, respectively.

dTS(i, j) = α · dT (i, j) + (1 − α) · dS(i, j) (2.1)

Villa-Vialaneix et al. [97] propose the similarity Equation 2.2 based on kernels. A multi-kernel similarity function is defined to combine structure and attribute similarity. In the equation, K0(i, j) d d is the kernel measuring structural similarity, Kd(ci , cj ) is the kernel measuring attribute similarity, d ci is the d-th attribute of node i and αd is the weighting factor for the d-th attribute.

d d KT (i, j) = α0 · K0(i, j) + ΣdαdKd(ci , cj ) (2.2)

An extension of the Louvain[11] method is proposed by including the similarity of attributes in the modularity computation in [27]. Extended modularity function is given in the equation 2.3 where C shows the set of clusters, S(i, j) is the link strength as the original modularity and simA(i, j) is the attribute similarity function.

Q = ΣCi∈C Σi,j∈Ci (α · S(i, j) + (1 − α) · simA(i, j)) (2.3)

16 These are parametric methods and how to choose α is an important question for these methods. In addition to these distance computation methods, walk models are used to compute vertex distance in the graphs. In [32], the connected k-center problem is proposed. Attributed graph is divided into connected components whose nodes have similar feature vectors using a simple breadth-first search(BFS) as the walk strategy. After selecting initial centers, nodes are assigned to these centers based on attribute distance between them. It is not guaranteed that the discovered clusters are dense and connected.

Structural/Attribute Clustering. Some researchers have also explored ways to augment the underlying network to take into account the content information. Structural/Attribute Clus- tering (SA-Cluster) algorithm is proposed by Zhou et al [113, 114]. They transform a graph into another augmented graph with new, artificial attribute vertices representing distinct vertex at- tribute values. A new attribute edge (u, u0) is further created that links an original vertex u and a newly created attribute vertex u0, if there is an attribute of u whose value is equal to u0. After this transformation, the vertices sharing the same attribute values are connected via common attribute vertices. A random walk-based distance measure is further defined upon the augmented graph in order to estimate the closeness of vertices in terms of both structure cohesiveness and attribute prox- imity. Given l as the length that a random walk can go, c ∈ (0, 1) as the restart probability, the neighborhood random walk distance d(vi, vj) from vi to vj is defined as

X length(τ) d(vi, vj) = p(τ)c(1 − c) (2.4) τ:vi vj length(τ)≤l where τ is a path from vi to vj whose length is length(τ) with transition probability p(τ), which is the probability of transitioning from one vertex to another.

A structure edge (vi, vj) ∈ E is of a different type from an attribute edge (vi, vjk) ∈ Ea. Therefore, their contributions in the neighborhood random walk distance may have a different degree and the transition probability through a structure edge and an attribute edge may be different. The transition probability from vertex vi to vertex vj through a structure edge is

( w0 , if (vi, vj) ∈ E |N(vi)|∗w0+w1+w2+...+wm p(vi, vj) = (2.5) 0, otherwise

17 Similarly, the transition probability from vi to vjk through an attribute edge is

( wj , if (vi, v ) ∈ Ea |N(vi)|∗w0+w1+w2+...+wm jk p(vi, vjk) = (2.6) 0, otherwise

The transition probability from vik to vj through an attribute edge is

( 1 , if (v , vi) ∈ Ea |N(vi)| jk p(vjk, vi) = (2.7) 0, otherwise

Finally, the transition probability between any two attribute vertices vip and vjq is 0 since there is no edge between them.

A weight wi is assigned to each edge to indicate the importance of it. As different attributes may have different degrees of importance, a weight wi, which is initialized to 1.0, is assigned to the attribute edges corresponding to attribute ai. The attribute edge weights {w1, ..., wm} are updated in each iteration of the clustering process, to reflect the importance of different attributes. The K-Medoids clustering method is followed as a clustering framework. After selecting the most centrally located vertex in a cluster as a centroid, the rest of vertices are assigned to their closest centroids. The degree of contributions of structural similarity and attribute similarity and the weights of attributes that show the importance of them are learned automatically. The algorithm iterates until it converges. Although good results are obtained based on density and entropy measure, this algorithm is computationally expensive because (1) the resultant augmented graph can be excessively large if the number of distinct vertex attribute values is high, and (2) the random walk distance matrix needs to be recalculated in each iteration of the clustering process. Since the random walk distance calculation involves matrix mul- tiplication, which has a time complexity of O(n3), the repeated random walk distance calculation causes a non-trivial computational cost in SA-Cluster[113]. Although an improved version has been proposed to support incremental computation for the random-walk distance matrix [114], SA-Cluster is still hard to scale up in real-world large-scale attributed graphs.

Social Influence based Clustering Algorithm. As different than other approaches, in [115], a social influence based clustering framework is presented for analyzing heterogeneous in- formation networks. Heterogeneous information networks include different types of entities which

18 have static attributes and are interconnected through heterogeneous types of links, representing different kinds of semantic relations. Social influence(SI) based graph clustering algorithm is de- veloped that captures not only the complex attributes of people (vertices) in a social collaboration network but also the nested and complex relationships between people and other types of entities in different information networks. A novel social influence based vertex similarity metric in terms of both self-influence similarity which measures the self-influence vertex closeness on a social graph and co-influence similarity on influence graphs are introduced. Heat diffusion model is used as the distance measure. Since heat always flows from an object with high temperature to an object with low temperature. In a large social graph SG, experts with many publications often influence other late authors. Thus the spread of influence resembles the heat diffusion. They propose an iterative learning algorithm, SI-Cluster, to dynamically refine the k clusters by continuously quantifying and adjusting the weights on self-influence similarity and multiple co-influence similarity scores towards the clustering convergence.

CODICIL. A novel approach CODICIL (efficient community detection in large networks using content and links) is presented to link structure with content information efficiently and effectively. They simplify the graph efficiently to identify and retain important edges in a graph according to both content and graph topology. This simplified graph can then be clustered by any general graph clustering algorithm without considering the content. As the first step, they create content edges between nodes which has similar attributes. A unified edge set, εu,(εu = εc ∪ εt), is obtained by combining created content edges, εc, and structure edges,

εt of the graph. From the unified edge set, which are obtained by combining created content edges and structure edges, they extract a sampled edge set that are relevant edges in local neighborhoods. It is done with using the topological similarity between nodes. In the last step, the simplified graph composed of these sampled edges is given to any fast content-insensitive standard graph clustering algorithm which partitions the vertices into a given number of clusters such as METIS[50] or Markov clustering[28].

Comparison of the distance based approaches . SA-Cluster and CODICIL analyze in- formation networks which have one type of node with attributes and links. In contrast, SI-Cluster analyzes the heterogeneous information networks which include different types of nodes with at- tributes and links. SA-Cluster method inserts attribute vertices into the graph and attribute edges

19 between the attribute vertices and original vertices of the graph, and CODICIL inserts attribute edges only between original vertices of the graph. While SA-Cluster designs a random walk based distance measure, SI-Cluster proposes a novel distance measure based on a heat diffusion model. Then, both use the K-Medoids clustering framework as a distance based clustering algorithm to cluster the graph. Unlike these methods, CODICIL applies one of the graph clustering algorithms on the simplified graph for the clustering process. SA-Cluster and CODICIL are computationally too expensive. This fact limits their application in large graphs.

2.1.3 Model-based Approaches

Another stream of related work for attributed graph clustering build primarily upon generative probabilistic models, within which graph structure and vertex attribute information is correlated to a set of shared, hidden variables of cluster membership [108, 104,5, 110, 40]. This approach avoids the artificial design of a distance measure. A brief summary of some selected model based approaches is given below. GBAGC [106, 104] is a Bayesian probabilistic model in which the cluster label of each vertex of the graph is explicitly represented as a hidden variable. A joint probability distribution is fur- ther defined and estimated over the space of all possible clusterings of the attributed graph, and a variational inference algorithm is developed to find the posterior distribution with the highest probability. Therefore, the clustering problem is defined as a standard probabilistic inference prob- lem that is finding the clustering which has the highest probability.

Yang et al. [109] claim that a generative probabilistic model has two shortcomings; (1) link pat- terns are usually affected not only by communities but also by other factors such as the popularity of a node (i.e., how likely the node is cited by other nodes). So, community membership by itself is insufficient to model links. (2) the content information often includes irrelevant attributes and as a result, a generative model without feature selection usually has poor performance. To solve these problems, an alternative discriminative probabilistic model is introduced to incorporate content information in the conditional link model and estimate the community membership directly. This model is constructed with combining a popularity conditional link (PCL) model and a discrimina- tive content model (DC) into a unified model (PCL-DC). Link probability between two nodes is

20 decided by nodes’ popularity as well as community membership, which is in turn decided by content terms. A two-stage EM algorithm is proposed to optimize community membership probabilities as the log-likelihood of combined mode and content weights alternately. Upon convergence, each graph node is assigned to the community with maximum membership probability. CESNA [108] assumes clusters will generate both the graph structure and vertex attributes based on the affiliation network model [55] and a separate logistic model, respectively. They claim that we should be able to predict the value of each of the node’s attributes based on a node’s community memberships. With this intuition, they also model node attributed with community information. A maximum-likelihood estimation problem is formulated to detect community memberships F and the relation factors W between communities and attributes. An efficient coordinate ascent algorithm is used to solve this problem and detect overlapping graph clusters from the attributed graph.

Comparison of the model based approaches . CESNA detects overlapping communities in networks with node attributes while PCL-DC and BAGC partition the graph into disjoint subsets. PCL-DC and BAGC assume soft node-community membership, which do not allow a node to have high membership strength to multiple communities simultaneously. According to the model of CESNA and BAGC, node attributes and network structure are generated from community membership. In PCL-DC model, link probability between nodes is conditioned on the community membership using popularity based link model and the content information is used to model the memberships of nodes.

2.1.4 Subspace Clustering

The main problem of attributed graph clustering is using all attributes for similarity com- putation. Most of the works on attributed graph clustering enforce attribute homogeneity in all attributes. However, some attributes may not be relevant for all clusters. For instance, it is unlikely that people are similar within all of their characteristics. While a subset of attributes is important for a group of people, another subset is important for the other group and some attributes may not have a strong correlation with the network and disagree with the clustering structure. Although some methods differentiate the importance of attributes with an attribute weighting strategy, they cannot get rid of irrelevant attributes completely. Recently some methods use unsupervised feature selection as the subspace clustering and extract cohesive subgraphs with homogeneity in a subset of attributes. Subspace clustering methods are

21 proposed to solve this problem which identify clusters only on the context of their own relevant features, especially for high dimensional data. However finding relevant features is computationally hard. An optimization step is needed to combine different quality measures such as density, entropy, and dimensionality. A brief summary of some selected approaches in this category is given below which combines the subspace clustering with dense subgraph mining. One novel problem of mining cohesive patterns from graphs with feature vectors [64] is intro- duced as CoPaM(Cohesive Pattern Miner). The (CoPaM) problem is defined as finding the set of all maximal cohesive patterns as the subgraph of G which satisfy three cohesive pattern con- straints. (1)a connectivity constraint that subgraph should be connected;(2) a density constraint that density of subgraphs needs to be higher than a given threshold; (3) a subspace cohesion con- straint that G should be homogeneous in the subset of its features. Several pruning strategies are applied to find all maximal cohesive patterns according to these constraints. GAMer(Graph and Attribute Miner) [37] defines Twofold clusters as a set of nodes that are not only densely connected in the given graph but also have high similarity in the subset of their attributes. The density, the size, and the number of relevant dimensions of the clusters are tried to be maximized. The GAMer algorithm is based on pruning methods according to their cluster definition. Their clustering model also includes a redundancy model to avoid unnecessary increase of the result set and at the same time permits overlaps between clusters in general. Gunnemann et al. [38] use attribute neighborhoods on the subspace of attributes in addition to the structure neighbors to model the density of the vertices. They use local densities calcu- lated within the clusters instead of global densities. Density-based combined subspace clusters are obtained by merging vertices located in the same dense region based on the combined neighborhood. Gunnemann et al. [36] introduce a spectral subspace clustering algorithm on graphs with feature vectors (SSCG). The idea of Spectral clustering is expanded for subspace clustering of the attributed graphs. Each cluster has an individual set of relevant features. Clustering process is based on normalized cuts that find a k-partitioning of nodes that minimizes the inter-cluster connectivity while at the same time maximizing the intra-cluster connectivity. Attributes of nodes are integrated into the normalized cut equation using kernels to incorporate the content information into the clustering process.

22 Comparison of subspace clustering approaches . CoPaM and GAMer follow a subspace clustering approach based on using vertex subsets to find clusters that show high similarity in a feature subspace and are densely connected in the given graph, while DB-CSC and SSCG take a density-based approach to find subspace clusters. They have different limitations: (1) CoPaM and GAMer do not have a vertex subset merging strategy for adjacent sunsets, and the formation of a vertex subset depends on the initial vertex to start with; (2) CoPaM strictly requires all nodes in a cluster have the same value for each attribute in the subspace. GAMer and DB-CSC use a single parameter maximal width w to control the difference on all the attributes in the concerned subspace within a vertex set. But for different attributes, which can be categorical or numerical ones, it is hard to set a uniform threshold to control the attribute value differences; (3) the time complexity of GAMer and DB-CSC increases exponentially with the number of vertices in the input graph, which can hardly scale on large graphs.

2.1.5 Comparison of Methods

We give a comparison Table 2.1 for methods mentioned in the previous section. Rows show the name of the methods and columns show the categories and properties of methods. ”User Prefer- ences” property indicates that corresponding methods take the user preferences into consideration. ”Overlapping” property shows that clusters can overlap or not. Finally, ”Scalability” property is for the efficiency of methods. Checkmarks show that the method on the corresponding row has the property on the corresponding column. Our work differs from existing attributed graph clustering solutions in that, we are the first to consider the idea of attribute-aware graph embedding towards encoding both graph structure cohesiveness and vertex attribute homogeneity into a low-dimensional latent space, within which attributed graph clustering can be supported in an efficient and cost-effective way especially in large-scale attributed graphs.

2.2 Graph Embedding

Graph embedding is to transform every vertex u of a graph into a latent, low-dimensional d feature vector f(u) ∈ R , where d is a small number for the latent dimensions. This way, local graph structure information of u are encoded within f(u) in a way that vertices of the same cluster

23 Table 2.1: Comparison of Attributed Graph Clustering Methods

Based User Weighted graph DistanceBased Model SubspaceBased Overlapping Method √Based √ Preference√ √Scalability √ FocusCO √ SA-Cluster √ √ SI-Cluster √ CODICIL √ √ PCL-DC √ √ BAGC √ √ √ CESNA √ √ √ CoPaM √ √ GAMer √ √ DB-CSC √ SSCG will have similar feature vectors. The distance within the latent low-dimensional space should represent a metric for evaluating the structure similarity between corresponding vertices of the graph. There have been many approaches to learning low-dimensional representations for graphs [8, 91, 76]. Inspired by the recent advancement in language modeling and deep learning [62], a series of graph embedding work that learns latent vertex representations of graphs have been proposed. DeepWalk [73] takes advantage of local structure information of vertices based on the Skip-Gram model [62] to learn the latent representations by treating random walks as the equivalent of sen- tences. When applied for multi-label graph classification in social networks, DeepWalk can success- fully encode the global graph structure especially in the presence of missing information in graphs. Line [89] is a scalable graph embedding method that uses edge sampling for model inference. It naturally breaks the limitation of the classical stochastic gradient descent method adopted in graph embedding without compromising embedding efficiency. GraRep [15] refines DeepWalk by introduc- ing an explicit loss function of the Skip-Gram model defined on the graph, and extends Line by capturing k-step (k > 2) high-order information for learning the latent representations. Matrix factorization algorithms are used for optimization in GraRep. Our work differs from existing graph embedding solutions in the following two aspects: (1) graph embedding is typically proposed and optimized for the task of graph classification, while our work

24 is primarily designed for attributed graph clustering; (2) graph embedding considers encoding the mere graph structure into a low-dimensional space, while our work is the first to take into account attributed graphs and thus enriches the existing frameworks for attribute-aware graph embedding.

2.3 Community Search

As a query-dependent variant of the well-known community detection problem, community search is to find cohesive and densely connected subgraphs involving a given query vertex (or a set of query vertices) in a graph. Community search was first explored when communities were modeled as k-core (each vertex in a k-core has its degree no less than k) with distance and size constraints, rendering the problem NP-hard [85]. Online search of overlapping communities for a query vertex based on a quasi-clique notion: α-adjacency γ-quasi-k-clique, was further proposed [26]. The resultant communities, however, may not be cohesively connected in comparison with truss-based communities [44]. Influential community search aims to discover top-r most influen- tial k-core communities [58]. However, it is not query-dependent, but rather an influence-based extension of the traditional community detection problem. Local community search [101, 26,7] is to identify k-core sub-graphs that contain query vertices and also maximize/minimize some good- ness metric of communities, such as density, modularity, and graph conductance. However, many “free-rider” vertices irrelevant to the query vertex are inevitably returned for some goodness met- rics. Furthermore, only a single community is identified, which fails to account for the real-world cases where a query vertex may participate in different communities. Community search was also examined in attributed graphs [45, 31] and spatial graphs [30]. Truss. Dense, cohesive subgraphs are critical components revealing potential community struc- tures of real-world, massive graphs. There has been a rich literature in modelling and quantification of dense and cohesive graphs, including clique or quasi-clique [22, 93], k-core [78,7, 51, 26] [ 37], nucleus [80, 79], and k-plex [10]. All the aforementioned models except k-core super from the com- putational intractability problem [83], whereas k-core may result in in-cohesive subgraphs [18, 112]. Truss is defined on the higher-order graph motif, triangle, and enjoys numerous advantages in community modelling and computation [18]: a k-truss is a (k -1 )-core but not vice versa [18]; a k-truss is (k -1)-edge-connected: any deletion of fewer than (k -1) edges will not disconnect a k-truss; a k-truss with n vertices has its diameter no more than b2n − 2/kc; that is, a k-truss is

25 diameter-bounded [46]. All these properties are critical indicators of good communities [83]. In addition, k-truss based communities exhibit an inclusive hierarchy representing the cores of a graph at different levels of granularity; that is, a k-truss is contained in a (k -1)-truss for k ≥ 3. When augmented with the triangle connectivity constraint, k-truss can account for the more practical case where a vertex may belong to multiple k-truss communities, which is consistent with socio- logical studies [44, 99]. In [18], the authors designed polynomial algorithms to find k-trusses from a graph and extended the methods to Map-Reduce. An improved algorithm was further proposed 3frm−e based on triangle enumeration in O(|EG| ) time [98]. A similar I/O-efficient algorithm for k-truss computation was designed and facilitated by graph database technologies [112]. A parallel method, PETA, can detect local k-trusses within a few iterations, and it has the same complexity as corresponding serial algorithms [83]. Due to its advantages in community modeling and computation, truss has been extensively applied for community search. In [46], community search was reformulated as an NP-hard problem that identifies a closest k-truss community with the minimum diameter and the largest k containing a set of query vertices. A greedy algorithm with compact indexes was proposed for approximate solutions. Truss has also been extended for community search in probabilities graphs [47] and attributed graphs [45]. Graph Summarization. As the scale and complexity of graphs increase, the topic of graph summarization has been explored toward simplifying massive graphs into succinct and quality- preserving summaries, which lead to significant reduction of graph storage and computational cost [53]. GraSS [56] summarizes graphs by greedily grouping vertices for a probabilistic adjacency matrix, upon which neighborhood queries can be approximated efficiently. In [75], graphs are summarized into super-nodes and super-edges with guarantees on the reconstruction error. The compressed summary is used to approximate queries including adjacency, degree, eigenvector cen- trality, and subgraph counting. In [66], the authors proposed greedy and randomized algorithms to compress graphs with bounded minimum discription length(MDL) errors. VOD [53] is a vocabulary- based graph summarization method aimed at minimizing the information-theoretic encoding cost, in terms of MDL, of the graph. SNAP [92] groups vertices based on vertex attributes, and then iteratively splits groups until eventually reaching the maximum attribute and relationship compat-

26 ible grouping. Graph summarization techniques have also been studied on RDF graphs [96] and biological networks [82]. Graph summarization is problem-driven, and typically optimized toward application-dependent objectives. However, there exist no prior graph summarization methods for the community search problem, as addressed in this paper. To our knowledge, no existing research has explored the intrin- sic relationships of edges within k-truss communities, which lead to the summarized, community- preserving index, EquiTruss, as proposed in this paper. TCP-Index. The state-of-the-art solution to truss-based community search is TCP-Index (Tri- angle Connectivity Preserved Index) [44], which maintains trussness values and triangle-adjacency information of the pre-computed k-trusses into a group of tree-structured indexes. Specifically, for each vertex x 2 VG, we consider the vertex-centric ego-net Gx, where VGx = NG(x) and

EGx = {(y, z)|(y, z) ∈ EG, y, z ∈ NG(x)}. The edge (y; z) ∈ EGx is further assigned a weight w indicating that a triangle 4xyz arises in a k-truss community (w ≥ k). Given the weighted graph Gx, a maximum spanning tree (MST), Tx, is identified, and all Tx’s (x ∈ VG) constitute TCP-Index of G. We note that, for any two vertices connected through a series of edges with weights no less than k in Tx, they belong to the same k-truss community. Namely, the community structures are losslessly compressed in TCP-Index. However, during community search, a series of costly, decompression-like operations have to be under-taken online in both TCP-Index and the original graph G to fully reconstruct edges of resultant k-truss communities. For instance, if any edge (u, v) ∈ Tx is in a k-truss community, we have to examine both Tu and Tv in TCP-Index, and revisit G to find missing edges of the community, thus inevitably incurring expensive computational cost. The limitations of TCP-Index are summarized as follows:

1. given any k-truss community of G, its constituent edges have to be examined and maintained redundantly in different MSTs, thus rendering the construction of TCP-Index extremely time- consuming and the resultant index excessively large;

2. during community search, a costly truss-reconstruction process has to been undertaken by repeated accesses to both TCP-Index and G, thus making community search inefficient;

3. when G evolves, the dynamic maintenance of TCP-Index becomes complicated and time- consuming. For instance, when a new edge is inserted to G or an existing edge is removed

27 from G, a significant fraction of MSTs in TCP-Index need to be updated accordingly, which is time-consuming.

As a result, TCP-Index may fail in supporting online, efficient community search especially in real-world, massive graphs. CTC-Problem. Another problem defined with truss based community model is the closest truss community (CTC) search problem[46]. It is defined as given a set of query nodes, to find a dense connected subgraph that contains the query nodes in which nodes are close to each other with maximizing k-trussness of the community. Huang et al. [46] prove that it is NP-Hard problem and propose a 2-approximation solution to it. It is a 2 steps algorithm. First, given a graph G and query nodes Q, they find the maximal connected k-truss, denoted as G0, containing Q and having the largest trussness. They initialize

G0 as the query nodes and use minimum trussness of the query nodes as the initial k value. Then, they iterativly insert all edges of nodes in G0 whose truss value is larger than k − 1 and check the connectivity of query nodes on G0. If connected, algorithm terminates ans return G0 as the result. If not, they decrease the k by one and repeat the edge insertion. They repeat this steps until query nodes are connected on G0. Since G0 is maximal which includes all edges whose truss value is larger than or equal k, it may have a large diameter. In the second step, they iteratively remove nodes far away from the query nodes, while maintaining the trussness of the remaining graph at k. They compute the shortest distance between query nodes and other nodes in G0 and find the fardest nodes which has maximum distance to a query node. While removing this, truss values of some edges in G0 may decrease and become less than k. So, they also need to remove them. They continue this process until G0 become disconnected. Finally, they get the closest truss community with 2 approximation. However, a maximal connected k-truss, G0, may be too large so, it may be inefficient to compute G0 and remove free riders from it. Then, they develop 2 strategies to improve the efficiency of CTC search. In the first strategy, they speed up the pruning process by deleting at least k nodes in batch. While this process sacrifice some approximation ratio, it achieves quick termination. In the second strategy, they develop a heuristic method to quickly find the closest truss com- munity in the local neighborhood of query nodes. While this speed up finding closest community, it decrease the trussness of it. In the local exploration, a Steiner tree of the query nodes is found

28 and then expanded into a k-truss by exploring the local neighborhood of it. They define a truss distance to make the result more dense which use the length of the path and minimum truss value of the edges on the path as the weight of the path.

29 CHAPTER 3

ATTRIBUTED GRAPH CLUSTERING: AN ATTRIBUTE-AWARE GRAPH EMBEDDING APPROACH

This work is published in [3]. In this project, we propose a graph embedding approach to clustering content-enriched, attributed graphs. In this project, we introduce a novel approach for attributed graph clustering considering both graph structure cohesiveness and attribute homogeneity. The key idea is to design a unified latent representation for each vertex u of a graph such that both the graph connectivity and vertex at- tribute proximity within the localized region of u can be jointly embedded into a unified, continuous vector space. Specifically, pairwise vertex attribute similarity between u and its incident vertices is first quantified and embedded as new edge weights of the graph. A series of truncated, weight- biased random walks originating from u are further generated to capture the local, attribute-aware structure information surrounding u. Inspired by the recent work of graph embedding [16, 90, 73], d these random walks are further used to learn a latent representation, r(u) ∈ R , of u, which lies in a continuous vector space with a relatively small number d of dimensions. This way, the salient, localized attribute and structure information of vertices can be jointly encoded in a uniform, d- dimensional vector space. As a result, the challenging attributed graph clustering problem is cast to the traditional data clustering problem upon d- dimensional space, in which the graph structure cohesiveness and attribute homogeneity are approximately preserved. Finally, we can apply any data clustering algorithm, e.g., k- Medoids, to accomplish the challenging attributed graph cluster- ing task in large graphs. To illustrate the key idea, we present the typical pipeline of our attributed graph clustering method in a sample graph, as shown in Figure 3.1. The main contributions of our work is summarized as follows,

1. We propose a novel, attribute-aware graph embedding framework for attributed graph clus- tering. It provides a natural and principled approach to encoding the localized structure and attribute information of vertices into a unified, latent representation in a low-dimensional

30 Figure 3.1: The attribute-aware graph embedding on a sample attributed graph. (a) presents a sample social graph G containing 13 individuals and their friendship relations. Each individual is characterized by two attributes: education and favorite language; (b) presents the transformed, weighted graph G0 with vertex attribute proximity embedded as edge weights; (c) presents the two- dimensional attribute-aware graph embedding, φ, from which the latent cluster structures naturally arise.

space, within which the graph structure cohesiveness and vertex attribute homogeneity can be well preserved. This framework also establishes a general graph embedding approach to tackling attributed graphs in a widely varying application domains(Section 3.2);

2. We design an efficient and cost-effective graph embedding algorithm that transforms an at- tributed graph into its vertex-based, latent representation, which are further fed as input to any data clustering method for attributed graph clustering (Section 3.3);

3. We perform experimental studies for our method in a series of real-world and synthetic graphs in comparison with state-of-the-art attributed graph clustering techniques. Experimental results demonstrate that our method outperforms existing algorithms in terms of both graph structure cohesiveness (w.r.t. graph density) and attribute homogeneity (w.r.t. entropy) in resultant graph clusters (Section 3.4).

3.1 Problem Formulation

Henceforth, we consider clustering graphs where vertices are affiliated with multidimensional attributes. We refer to these complex graphs as attributed graphs, formally defined as follows,

Definition 3.1 (Attributed Graph). An attributed graph is a vertex-labeled graph G = (V,E, A) with vertices V , edges E ⊆ V × V . A = {a1, ..., an} is the set of feature vectors associated with

31 vertices in V for describing vertex attributes. Each vertex vi in V is associated with an attribute 1 j d j vector avi = [ai , .., ai , .., ai ] where ai is the attribute value of vertex vi on attribute aj.

In this project, we consider the attributed graphs as undirected, connected, simple graphs, and all the vertex attributes conform to a unique multidimensional schema, A. However, the proposed attribute-aware graph embedding framework can be effortlessly extended to other types of graphs with vertex attributes conforming to heterogeneous schemas. We further assume that each vertex attribute Ai has a finite set of discrete values and the number of possible values (or cardinality) of

Ai is |Ai|. For vertex attributes with continuous or infinitely countable values, we can transform them into discrete values by binning or histogram techniques. Given a vertex u in an attributed graph G, we denote all the neighboring vertices of u as

N1(u) = {v|v ∈ V, (u, v) ∈ E}. Analogously, we denote all the vertices that are l(l ≥ 1) hops away from u as Nl(u) = {v|v ∈ V, d(u, v) = l}, where d(·) is the shortest unit distance function defined upon G. If l is small, Nl(u) consists of all vertices that are in the local vicinity of u. In principle, if vertices in Nl(u) are densely connected and share similar vertex attributes with u, they are likely to be in the same cluster as u belongs to.

Definition 3.2 (Attributed Graph Clustering). Given an attributed graph G, we partition G into k mutually exclusive, collectively exhaustive subgraphs Gi = (Vi,Ei,A) with an objective to obtain the following graph clustering properties:

1. structure closeness: Vertices within the same clusters are closely connected while vertices in different clusters are far apart;

2. attribute homogeneity: Vertices in the same clusters have similar attribute values, while vertices in different clusters differ significantly in attribute values.

We note that in the classic graph clustering problem, only the first objective is considered as only the graph structure information is employed, while for attributed graph clustering, we strive to achieve the dual objective of structure closeness and attribute homogeneity for graph clusters.

3.2 The Attribute-Aware Graph Embedding Framework

In this section, we will discuss our attribute-aware graph embedding framework for attributed graph clustering. The goal is to transform each vertex u of an attributed graph into a latent,

32 low-dimensional feature vector f(u) ∈ Rd, where d is a small number for latent dimensions. This way, both vertex attributes and local graph structure information of u are encoded within f(u) in a way that vertices of the same cluster will have similar feature vectors. The distance within the latent low-dimensional space should represent a metric for evaluating the structure-attribute similarity between vertices of the attributed graph.

3.2.1 Vertex Attribute Embedding

Given an attributed graph G, our first step is to embed the information of vertex attribute 0 similarity into a transformed, weighted graph G = (V,E; W ), where W : E → R≥0. Specifically, for each edge e = (u, v) ∈ E, we assign an edge weight w(e) that quantifies the vertex attribute similarity for u and v. This way, the vertex attribute information of G is encoded into the weighted graph G0 as new edge weights. The straightforward way to quantify the multidimensional attribute similarity of two adjacent vertices u and v is based on a dimension-wise evaluation of attribute values for u and v, respectively.

We define an indicator function 1Ai (u, v) for the attribute Ai of u and v as follows,

 1 if A (u) = A (v) 1 (u, v) = i i Ai 0 otherwise

Then the vertex attribute similarity, s0(u, v), of vertices u and v can be computed as

Pn 1 (u, v) s (u, v) = i=1 Ai (3.1) 0 n

In an attributed graph G, it is not uncommon two adjacent vertices u and v that are within the same cluster may share few, or even no, identical vertex attribute values. In this case, u and v may be closely connected and only their structure information plays an essential role in assigning them in one cluster. However, if we solely rely on the structure information of u and v that disagree on vertex attribute values, it is still likely u and v will be assigned to different clusters by mistake. To account for this case, we extend the computation of the vertex attribute similarity by taking into consideration the neighboring vertices of u and v, respectively. This way, if u and v share few or no common vertex attribute values, the vertices within their vicinity may still hold identical or similar vertex attribute values, considering u and v belong to the same cluster. Formally, we consider all the nearby vertices of u in Nl(u) which are l hops away from u. For each vertex attribute

33 Ai(1 ≤ i ≤ n), we maintain a histogram vector, HAi (u), with a total number of |Ai| entries, each of which corresponds to a possible value at ∈ Domain(Ai), and maintains the value as

|{v|v ∈ Nl(u),Ai(v) = at}| HAi (u)[t] = , 1 ≤ t ≤ |Ai| (3.2) |Nl(u)|

That is, the t-th element of the vector HAi (u) maintains the percentage of vertices whose attribute value upon the attribute Ai is equal to at (1 ≤ t ≤ |Ai|) w.r.t. all vertices that are l hops away from u. As a result, HAi (u) maintains the distribution of vertex attribute values of the attribute

Ai for the vertices that are near the vertex u. So the vertex attribute similarity of u and v in terms of their neighbors that are l hops away can be formally defined as

Pn sim(H (u),H (v)) s (u, v) = i=1 Ai Ai (3.3) l n where sim(·, ·) is a similarity function defined on two vectors. In this work, we consider the cosine similarity function for computation. For adjacent vertices u and v, where (u, v) ∈ E, we synthesize the overall vertex attribute similarity, s(u, v), by considering both the vertex attribute similarity of u and v themselves (Equa- tion 3.1) and vertex attribute similarities of all the vertices within the localized vicinity of u and v, respectively, which are up to L hops away from u and v (Equation 3.3),

L X sl(u, v) s(u, v) = . (3.4) 2l l=0 We note that vertices that are l (1 ≤ l ≤ L) hops away from u and v should contribute less to the vertex attribute similarity between u and v when the value of l grows larger. To account for this, we consider dampening exponentially the vertex attribute similarities of nearby vertices in terms of distances away from u and v, respectively, as presented in Equation 3.4. As a result, we assign the vertex attribute similarity s(u, v) as an edge weight w(u, v) of the edge (u, v) in the transformed weighted graph G0. This way, the information of vertex attribute similarity is embedded into G0, which will be further explored in order to accommodate structure information in the attributed graph. Algorithm1 presents the key steps for vertex attribute embedding. Given an attributed graph G as input, we consider for each edge e = (u, v) a new edge weight w(e) that encodes the vertex attribute similarity, s(u, v), between two vertices u and v. The vertex attribute similarity s(u, v)

34 Algorithm 1: Vertex Attribute Embedding (G, L) Input: attributed graph G(V ; E; A), maximum neighborhood length L

Output: weighted graph G0(V ; E; W )

1 for each e(u, v) in E do 2 s(u, v) ← s0(u, v) /* Eq. 1 */

3 for l = 1 to L do 4 for i = 1 to n do 5 Construct histograms HAi(u)andHAi(v) /* Eq.2 */ Pn i=1 sim(HAi(u),HAi(v)) 6 sl(u, v) = n /* Eq. 3 */ s (u,v) 7 l s(u, v) = s(u, v) + 2l /* Eq 4 */ 8 w(e) ← s(u, v) 9 return G0(v, E, W )

is first initialized to s0(u, v) (Line 3). We then expand the neighborhood scope, l, and examine all the vertices l hops away from u and v, respectively. For each of different n = |A| vertex attributes, we construct histograms for neighboring vertices l hops away from u and v, respectively (Line 6), and the vertex attribute similarity at distance l, sl(u, v), is computed by a vector-based similarity function sim(., .), e.g., cosine similarity (Line 7). The vertex attribute similarity s(u, v) is further augmented by sl(u, v) that is exponentially dampened by l (Line 8). The worst-case time complexity L of Algorithm1 is O(|E| × L × |A| × dmax), where dmax is the maximum vertex degree in G. In practice, we only need consider the neighborhood L as a very small value (0 ≤ L ≤ 2), because larger values of L will introduce more vertices that are likely to be in other graph clusters, or with a great percentage of “noisy” vertex attributes values that begin to mismatch with each other.

3.2.2 Structure Embedding

Once the attributed graph G is transformed to the weighted graph G0, the information of vertex attribute similarity is embedded as edge weights of G0. Another aspect of graph clusters yet to be explored is the information of structure closeness. Intuitively, vertices within the same graph cluster are closely connected, while those located in different clusters are typically far apart. We thus take as input the augmented graph G0 and further embed the local structure information of vertices for attributed graph clustering.

35 We consider using a series of short random walks to capture the structure closeness within the localized vicinity of vertices. Random walks have been extensively used as a fundamental data structure to efficiently capture local community structure information of graphs [21, 33, 17, 2]. Specifically, for each vertex u ∈ V , we generate a group of γ truncated random walks rooted at t u, denoted as Wl (u) = (u, v1, . . . , vt), where 1 ≤ l ≤ γ, and t denotes the length (i.e., the number of edges) of the random walks. Each of truncated random walks is generated as follows. We start from the vertex v0 = u, and at each step i(0 ≤ i ≤ t − 1), we choose the next vertex vi+1 ∈ N1(vi) in the truncated random walk with the following probability:

s(vi, vi+1) Pr(vi, vi+1) = P s(vi, vj) vj ∈N1(vi)

where s(vi; vi+1) is the weight of the edge (vi, vi+1) in G0, as defined in Equation 4. That is, the t truncated random walks Wl (u) are generated in a biased way that the edge (vi, vi+1) with a higher edge weight will be chosen with a greater probability. Note that edge weights in G0 indicate the vertex wise attribute similarity, as discussed in Section IV-A. As a result, the truncated random walks rooted from u are attribute − aware random walks that encode both structure closeness and vertex-attribute homogeneity in the local vicinity of u, if t, the length of random walks, is set small. Inspired by recent advances in language modeling and deep learning [16], we treat each attribute- aware random walk as a short sentence or phrase, and each vertex of the graph as a word in a special language. Our goal is to learn a latent representation Φ : v ∈ V → Rd that maps each vertex into a low-dimensional vector, Φ(u). Following the intuition of DeepWalk [73], we relax the formulation of random walks as follows: (1) a random walk passing through a vertex vi ∈ V as the center of the walk is treated as a bi-directional random walk rooted at vi; that is, we consider the transformed random walk originated from vi and encompassing preceding and subsequent vertices in a window of size 2w; (2) we ignore the ordering of vertices in random walks. Such relaxations are particularly useful for the latent representation learning as the order independence assumption well captures a sense of “closeness” provided by random walks. Furthermore, they simplify the learning process and save the training time. To this end, deriving the latent representation of vertices is formulated as an optimization problem:

min(−log Pr({vi−w, . . . , vi−1, vi, vi+1, . . . , vi+w})) (3.5) Φ

36 Algorithm 2: Structure Embedding (G0; w; d;; t)

Input: weighted graph G0(V ; E; W ) Output: matrix of v ertex latent representationθ ∈ R|V |xd

1 for i = 1 to y do 2 for u ∈ V do t 0 t 3 Wi (u) =RandomWalk(G , u, t) SkipGram(θ, Wi (u), w) 4 return θ

To solve this problem, we take advantage of SkipGram [62] that maximizes the concurrence probability among the words (vertices) arising within a window w in a sentence (a truncated random walk). we further use Hierarchical Softmax [63] and stochastic gradient descent (SGD) to optimize the approximation of probability distributions and parameter estimation.

Algorithm2 presents the procedure for structure embedding. Given the weighted graph G0 as input, we examine every vertex u of G0, and embed it into a low-dimensional space as a d- dimensional vector Φ(u). We generate truncated random walks with length t (Line 4). When each truncated random walk originated from u is generated, we use the SkipGram algorithm to update the latent representation in accordance with the objective function in Equation 3.5(Line 5). As the time complexity for the training process of SkipGram is O(log|V |) for each random walk update, the overall time complexity of Algorithm2 is O(γt|E| + |V |log|V |).

3.3 Attributed Graph Clustering Algorithm

Based on the attributed-aware graph embedding framework discussed in Section 3.2, it becomes straightforward to support clustering on attributed graphs, and the algorithm is sketched in Algo- rithm3. Given an attributed graph G, we first embed vertex attribute similarity information into a weighted graph G0, where the parameter L regulates the scope of vertex neighborhood for the quantification of vertex attribute similarity (Line 2). We then embed the structure information of G0 by mapping vertices into d-dimensional latent representations, Φ (Line 3), which encode both structure closeness and attribute homogeneity in the local neighborhood of vertices, and thus are important indicators of the cluster membership of vertices. Once the original graph is transformed into its latent representations in the d-dimensional space, we can use any traditional data clustering method, such as kMedoids, to partition d-dimensional vectors into the final k clusters.

37 Algorithm 3: Attributed Graph Clustering (G; k; L; w; d; y; t) Input: attributed graph G, number of resultant graph clusters k, maximum neighborhood length L, window size w, embedding size d, random walks per vertex y, random walk length t

Output: graph clustering C = {C1,C2, ..., Ck}

1 G0 ← Vertex Attribute Embedding (G, L)

2 Φ ← Structure Embedding (G0, w, d, γ, t)

3 C ← kMedoids (Φ, k)

4 return C

It is worth noting that although in Definition 3.2, we aim to generate hard graph clusters, meaning that every vertex belongs to at most one cluster, our proposed attribute-aware graph embedding approach can support overlapping graph clustering as well. Once we transform the attributed graph into its latent representations, Φ, we can apply any hierarchically data clustering method to generate overlapping graph clusters.

3.4 Experiments

In this section, we present the experimental studies for our proposed method, AA-Cluster (attribute-aware graph clustering). We compare AA-Cluster with three state-of-the art methods: (1) SA-Cluster [113] combines vertex attributes and graph structures in a unified distance measure for attributed graph clustering; (2)BAGC [106] is a Bayesian probabilistic approach for attributed graph clustering; (3) DeepWalk[73] learns the graph structure information as latent features in a low-dimensional space without consideration of vertex attributes. We choose recommended pa- rameters for SA-Cluster, BAGC, and DeepWalk as mentioned in the corresponding papers. For our method, AA-Cluster, we choose the following default parameter values, if not specified otherwise: the vertex neighborhood distance L = 1, the number of truncated random walks per vertex y = 30, the length of truncated random walks t = 30, the window size w = 30, and the embedded dimension d = 40. All our experiments were carried out in a Linux workstation running RedHat Enterprise Server 6.5 with 16 Intel Xeon 2.3GHz CPUs and 128GB of memory.

38 3.4.1 Datasets

We consider three real-world attributed graphs and a set of synthetic attributed graphs in our experimental studies. The details of datasets are as follows,

1. Political Blogs. This a network of hyperlinks between web blogs on US politics recorded in 20051. It contains 1,490 web blogs as vertices and 19,090 hyperlinks as edges. Each blog has an attribute pertaining to its political leaning as liberal or conservative;

2. DBLP2. This is a co-authorship graph consisting of authors in four research areas: database, data mining, information retrieval, and artificial intelligence. The graph contains 27,199 authors as vertices and 66,832 collaborations as edges. For each author, we consider two attributes: topic is the primary one of 100 research topics extracted from paper titles based on topic modelling [113]; level is determined by the number p of papers published by the author: if p > 20, the value of level is highly prolific; If 10 < p < 20, the value is prolific; and if p < 10, the value is low prolific;

3. Patent. This is a patent citation network with vertices representing patents and edges depicting the citations between patents3. We extract a subgraph with all patents between 1988 and 1999. Each patent has six attributes, grant year, number of claims, technological category, technological subcategory, assignee type and main patent class. There are 1,174,908 vertices and 4,967,216 edges in this graph. Note this is the largest attributed graph in our experimental studies, and most existing attributed graph clustering methods fail to run on such a large graph;

4. Synthetic Graphs. We generate a series of syntactic attributed, small-world graphs by varying the number of vertices ranging from 1K up to 250K. For vertex attributes, we change the vertex dimensionality, |A|, from 5 to 40 with half of attributes following the Gaussian distribution, and the remaining half of attributes following the uniform distribution. The synthetic graphs are primarily used to examine the clustering efficiency and scalability of our proposed method, AA-Cluster.

3.4.2 Evaluation Metrics

In order to compare the effectiveness of different attributed graph cluttering methods and assess the quality of resultant graph clusters, we consider the following standard evaluation metrics for attributed graph clustering [4],

1http://www-personal.umich.edu/∼mejn/netdata 2http://dblp.uni-trier.de/xml/ 3http://www.nber.org/patents

39 1. Clustering density. Assume there are k graph clusters C = {C1,C2, ..., Ck} generated. The clustering density is defined as

k X |{(u, v)|u, v ∈ VC , (u, v) ∈ EC }| density = i i (3.6) |E| i=1 Density is a quantitative measure indicating the structure closeness of resultant graph clusters. Empirically, the larger the density value, the better quality of clustering results in terms of structure closeness;

2. Clustering entropy. In order to quantify the homogeneity of vertex attribute values in graph clusters, we consider a second evaluation metric, the average clustering entropy, defined as follows, n k 1 X X |VCj | entropy = | entropy(a ,V ) (3.7) n |V | i Cj i=1 j=1 where

|A | Pl entropy(ai,VCj ) = − pijl log pijl l=1

Note n = |A| is the number of vertex attributes in the attributed graph, and pijl denotes the percentage of vertices in the cluster Cj whose vertex attribute value upon the attribute Ai = ail, where ail ∈ Dom(Ai). Empirically, the lower the value of entropy, the more homogeneous of vertex attribute values in the resultant graph clusters. Besides evaluation metrics for clustering quality, we also examine the runtime cost and scalability of different methods, as it is important to cluster real-world, large-scale attributed graphs in a fast and potentially scalable way.

3.4.3 Experimental Results

1) Clustering Quality We first apply different attributed graph clustering methods in the Politi- cal Blog graph, and the clustering quality results in terms of both density and entropy are illustrated in Figure 3.2. By varying the number k of graph clusters, we recognize that the density of clus- ters generated by AA-Cluster is very close to that by SA-Cluster, both of which are consistently higher than the density results of BAGC and DeepWalk (Figure 3.2(a)). Meanwhile, graph clus- ters generated by AA-Cluster have significantly smaller entropy than those generated by the other three methods, meaning that AA-Cluster leads to more homogeneous graph clusters w.r.t. vertex attributes (Figure 3.2(b)). Therefore, AA-Cluster results in both structurally densely connected,

40 and attribute-wise homogeneous clusters whose clustering quality is higher than the graph clusters generated by AA-Cluster, BAGC and DeepWalk. An interesting observation is that, when k = 7, graph clusters generated by SA-Cluster are structurally imbalanced: There is one large cluster sub- suming most vertices of the graph, while the remaining six graph clusters are very small containing a handful of vertices. In contrast, graph clusters generated by AA-Cluster are structurally balanced with quality clustering results in terms of both density and entropy.

Figure 3.2: Clustering Quality in Political Blog Dataset

Figure 3.3: Clustering Quality in DBLP Dataset

41 Figure 3.4: Clustering Quality in Patent Dataset

We then perform experimental studied in the DBLP graph, and the clustering quality results are presented in Figure 3.3. By tuning the number k of resultant graph clusters, we clearly notice that in terms of both density (Figure 3.3(a)) and entropy (Figure 3.3(b)), AA-Cluster outperforms SA- Cluster BAGC, and DeepWalk in generating high-quality graph clusters. In addition, the clustering quality of AA-Cluster is stable, which is insensitive to the number k of graph clusters generated. We further evaluate different methods in the largest Patent graph. Both SA-Cluster and BAGC fail in this dataset by returning runtime memory errors. In contrast, AA-Cluster and DeepWalk finish graph clustering, and the clustering quality results are reported in Figure 3.4. In terms of both density and entropy, we find that AA-Cluster is consistently better than DeepWalk in generating high-quality graph clusters. This indicates that a joint consideration of both graph structures and vertex attribute information for graph clustering will lead to better-quality graph clusters than the methods that take advantage of only graph structures during clustering. 2) Parameter Analysis: It is important to note that AACluster is regulated by a series of important algorithmic parameters. In this section, we will examine how these parameters affect the graph clustering performance of AA-Cluster. We first study the parameter of neighborhood distance, L, in vertex attribute embedding. The clustering quality results are reported in Figure 3.5, in terms of density (Figure 3.5(a)) and entropy (Figure 3.5(b)), respectively. We recognize that by increasing the neighborhood scope, more vertices

42 within the localized region of target vertices are involved for vertex attribute similarity computation. This can be treated as a smoothing step in order to avoid the case that two intra-cluster vertices happen to share few or even no common vertex attributes. In addition, for the case that two intra- cluster vertices that share some vertex attribute values in common but there is no edge directly connecting them, we still can count on the vertex attribute similarity of their neighboring vertices to account for the closeness between them. As a result, the involvement of neighboring vertices for the quantification of vertex attribute similarity can help improve the graph clustering quality. However, it is not always beneficial to increase L for vertex attribution similarity computation. When L is set large, the computational cost of attribute embedding grows as well. More importantly, noisy vertices with heterogeneous vertex attribute values might be involved, thus leading to an increase in entropy. In our experimental studies, we find L = 1 is typically good enough for vertex attribute embedding.

Figure 3.5: Clustering Quality of AA-Cluster w.r.t. Neighborhood Distance, L

We then examine the parameters pertaining to structure embedding for AA-Cluster. We will report the experimental results in the DBLP graph with k = 10 graph clusters thus generated, as we witness similar results and findings in the other two graphs. By varying the length of truncated random walks, t, and the number of truncated random walks rooted per vertex, γ, the clustering quality results are illustrated in Figure 3.6. We recognize that by leveraging more truncated random walks and lengthening these random walks, we can more easily capture the structure closeness of

43 Figure 3.6: Clustering Quality of AA-Cluster w.r.t. Number of walks, γ, in DBLP graph (k = 10)

Figure 3.7: Clustering Quality of AA-Cluster w.r.t. Window Size,w, in DBLP graph (k = 10)

graph clusters. However, the side effect is that we include more vertices with heterogeneous vertex attribute values, thus leading to a growth in entropy. As a result, there is an intrinsic trade-off when setting the values of t and , which are closely correlated to the clustering quality. We then examine the parameter w, the window size of SkipGram, in structure embedding, and the clustering quality results are reported in Figure 3.7. When w increases, the clustering quality is enhanced as the density increases and the entropy decreases simultaneously. This suggests that large window sizes benefit our graphs clustering method, AA-Cluster. 3) Scalability: We also evaluate the runtime cost and scalability of different attributed graph

44 Figure 3.8: Runtime Cost in Synthetic Graphs

clustering methods in a series of synthetic graphs. First we create a series of synthetic, small- world graphs with the number of vertices ranging from 1K up to 250K, and the number of vertex attributes, n = |A| = 10, with the values of five vertex attributes following the Gaussian distribution (µ = 3, σ2 = 5) and values of the other five vertex attributes following the uniform distribution. We test four different graph clustering methods in these synthetic graphs and the runtime results are reported in Figure 3.8(a). We note that SA-Cluster and BAGC cannot scale in large graphs. In contrast, both AA-Cluster and DeepWalk exhibit excellent scalability, and the runtime cost differs marginally. Note that DeepWalk tackles only graph structure information, while AA-Cluster takes account of both graph structures and vertex attribute information in attributed graph clustering. Therefore, AA-Cluster is efficient in clustering large-scale attributed graphs. We then examine how the value distributions of vertex attributes affect the running time of AA- Cluster. We consider two settings by assigning the mean values (µ) of vertex attributes following Gaussian distributions to be 5 and 20, respectively. Meanwhile, we vary the variances, σ2, from 1 up to 20, and the runtime results are reported in Figure 3.8(b). We note that the runtime cost of AA-Cluster is insensitive to the changes of vertex attribute values. Specifically, for graphs with unevenly distributed vertex attribute values, AA-Cluster is still capable of clustering the graphs efficiently.

45 3.5 Conclusions

Graph clustering has played a fundamental role in modeling, structuring, and understanding large-scale networks. In many real-world settings, we are concerned not only with interconnected graph structures, but with the rich graph contents characterized by vertex attributes during graph clustering. More importantly, we want to study the interplay between graph structure and content information with an objective to efficiently generating high-quality graph clusters from real world, attributed graphs. In this paper, we devised a new attributed graph clustering method that combines both vertex attributes and graph structure information within a general, unified attributed-aware graph embed- ding framework. We designed efficient graph embedding algorithms to encode an attributed graph into a low-dimensional latent representation. As a result, the attribute-aware cluster information is well preserved during graph embedding. We evaluated our attributed graph clustering method, AA-Cluster, in a series of real-world and synthetic graphs, and the experimental results validate the effectiveness and efficiency of AA-Cluster, compared with the state-of-the-art attributed graph clustering techniques.

46 CHAPTER 4

TRUSS BASED COMMUNITY SEARCH: A TRUSS EQUIVALENCE BASED INDEXING APPROACH

This work is published to VLDB’17 [4]. In this chapter, we consider the community search problem defined upon a large graph G: given a query vertex q in G, to find as output all the densely connected subgraphs from G, each of which contains the query vertex v. As an online, query-dependent variant of the well-known community detection problem, community search enables personalized community discovery from graphs, and has found a wide range of real-world applications. We study the community search problem in the truss-based model aimed at discovering all dense and cohesive k-truss communities to which the query vertex q belongs. We propose a novel graph indexing solution to the truss-based community search problem in real-world, large-scale graphs. Our main idea is to introduce a new concept of k-truss equivalence among edges of a graph: given two edges e and e0 of G, they are k-truss equivalent if and only if they belong to the same k-truss , and are further connected by a series of triangles in a strong sense (modeled by the notion of k-triangle connectivity in Definition 4.9). Intuitively, if e belongs to a k-truss community w.r.t. a query vertex v, so does e0. We prove that k-truss equivalence is an equivalence relation, such that all the edges of G can be partitioned to at most one equivalence class based on k-truss equivalence. We further design a truss-equivalence based index, EquiTruss, which is a summarized graph G = (V, E) consisting of a super-node set, V, and a super-edge set, E; each super-node ν ∈ V represents an equivalence class of edges based on k-truss equivalence, and there exists a super-edge (ν, µ) ∈ E if the edges partitioned to the super-node ν and µ (ν, µ ∈ V), respectively, are connected via triangles in G. We prove that community search can be carried out directly upon EquiTruss without repeated accesses to the original graph G, and its time complexity is solely determined by the actual size of output, which is theoretically optimal. In addition, EquiTruss is amenable to efficient, dynamic update when the underlying graph G evolves in terms of edge insertion and deletion. We examine, both theoretically and experimentally, the efficiency and effectiveness of EquiTruss, which has achieved at least an order of magnitude speedup for community search, in

47 comparison to the state-of-the-art method, TCP-Index. Furthermore, EquiTruss provides simple yet powerful community search functionalities in large scale graphs, and thus can be effectively employed in the studies of real-world networked data. We summarize the contributions of EquiTruss as follows,

• We introduce a novel notion, k-truss equivalence, to capture the intrinsic relationship of edges in truss-based communities. Based on this new concept, we can partition any graph G into a series of truss-preserving equivalence classes for community search (Section 4.2);

• We design and develop a truss-equivalence based index, EquiTruss, that is space-efficient, cost-effective, and amenable to dynamic changes in the graph G. More importantly, commu- nity search can be performed directly upon EquiTruss without costly revisits to G, which is theoretical optimal (Section 4.3);

• We carry out extensive experimental studies in real-world, large-scale graphs, and compare EquiTruss with the state-of-the-art solution, TCP-Index. Experimental results demonstrate that EquiTruss is smaller in size, faster to be constructed and maintained, and admits at least an order of magnitude speedup for community search in large graphs (Section 4.4);

The remainder of the chapter is organized as follows. In Section 4.1, we formulate the truss- based community search problem and give the preliminaries. In Section 4.2, we introduce a novel notion, truss equivalence, for truss-based community modeling and search. We present the truss- equivalence based indexing solution, EquiTruss, in Section 4.3. Experimental studied and key findings are reported in Section 4.4, followed by concluding remarks in Section 4.5.

4.1 Preliminaries

We consider an undirected, connected, simple graph G = (VG,EG) where VG is a set of vertices, and EG ⊆ {VG × VG} is a set of edges. Given a vertex v ∈ VG, we denote the set of neighboring vertices of v as NG(v), where NG(v) = {u|u ∈ VG :(u, v) ∈ EG}, and the degree of v is d(v) =

|NG(v)|. We use dmax to denote the maximum degree of vertices in G. A triangle 4uvw is a cycle of length three comprising three distinct vertices u, v, w ∈ VG. Based on triangles, we define the following key concepts,

Definition 4.1. (Edge Support). The support of an edge e(u, v) ∈ EG, denoted by supG(e), is the number of triangles with e as a constituent edge,i.e., supG(e) = |4uvw : w ∈ VG|.

48 Table 4.1: Primer of terminologies and notations

Notation Description G = (VG,EG) An undirected, simple graph G G = (V, E) A summarized graph G in EquiTruss u, v, w, x, y, z Vertices in VG of G v, µ, ψ Super-nodes in V of G N(v) The set of neighboring vertices of v ∈ VG 4uvw A triangle formed by vertices u, v, w sup(e) The support of e, e ∈ EG τ(G0), τ(e) The trussness of a graph G0, the trussness of an edge e 4s ↔ 4t, e1 ↔ e2 4s and 4t (e1 and e2) are triangle connected k k 4s ↔ 4t,e1 ↔ e2 4s and 4t (e1 and e2) are k-triangle connected k e1 = e2 Edges e1 and e2 are k-truss equivalent

Definition 4.2. (Subgraph Trussness). Given a subgraph G0(V 0,E0) ⊆ G, the trussness of the G0, 0 0 denoted as τ(G ), is the largest integer k(k ≥ 2), such that τ(G ) = argmaxk{supG0 (e) ≥ k − 2 : ∀e ∈ E0}.

Consider a subgraph G0 ⊆ G. If τ(G0) = k and there exists no super-graph G00 of G0 (G0 ⊆ G00 ⊆ G) such that τ(G00) = τ(G0) = k, G0 is referred to as a maximal k-truss, or k-truss for short. Given a fixed value of k, there exists only one k-truss in G due to its maximality, while a k-truss is not necessarily a connected graph. Therefore, the classic definition of k-truss is not suitable to directly model real-world communities that are both densely and cohesively connected. To tackle this issue, the triangle-connectivity constraint is further imposed upon k-truss :

Definition 4.3. (Triangle Adjacency). Given two triangles 41, 42 in G, they are adjacent if 41 and 42 share a common edge, which is denoted by 41 ∩ 42 6= ∅.

Definition 4.4. (Triangle Connectivity). Given two triangles 4s, 4t in G, 4s and 4t are triangle connected, if there exist a series of triangles 41, ..., 4n in G, where n ≥ 2, such that 41 = 4s, 4n =

4t and for 1 ≤ i < n, 4i ∩ 4i+1 = ∅

0 0 Analogously, for any two edges e, e ∈ EG, they are triangle-connected, denoted as e ↔ e , if 0 0 0 and only if (1) e and e belong to the same triangle, i.e., e, e ∈ 4, or (2) e ∈ 4s, e ∈ 4t, s.t.

4s ↔ 4t. To this end, the truss-based community can be defined as follows,

49 Figure 4.1: A Sample graph G and Truss-based Communities for vertex v7 in G.

Definition 4.5. (k-truss Community). Given a graph G and an integer k ≥ 31, a subgraph G0 ⊆ G is a k-truss community, if G satisfies the following three conditions:

1. k-truss. G0 is a subgraph of G, denoted as G0 ⊆ G, such that ∀e ∈ E(G0), sup(e, G0) ≥ (k−2);

0 0 0 2. Edge Connectivity. ∀e, e ∈ EG0 , ∃41, 42 in G such that e ∈ 41, e ∈ 42, then either 0 41 = 42, or 41 is triangle connected with 42, 41 ↔ 42in G ;

3. Maximal Subgraph. G0 is a maximal subgraph satisfying conditions (1) and (2). That is, 00 0 00 00 @G ⊆ G, such that G ⊂ G , and G satisfies conditions (1) and (2);

Example 4.1. Given the toy graph G as shown in Figure 4.1 (a), we consider a subgraph G0 induced 0 by the set of vertices {v7, v8, v9, v10, v11}. We note that for each edge e ∈ G , we have supG0 (e) = 3 meaning that e is involved in at least three diffferent triangles. As a result, the subgraph trussness of 0 0 0 G0, τ(G ), is 5, and G is actually a 5-truss community because for any pair of edges in G , they are triangle connected. For example, (v7, v8) ↔ (v10, v11), because (v7, v8) ∈ 47,8,11, (v10, v11) ∈ 48,10,11 and 47,8,11 ∩ 48,10,11 = {(v8, v11)}.

It has been well recognized that triangle is a fundamental, higher-order graph motif representing a strong and stable relationship in graphs [43, 52,9], and the community modelling based on triangles, rather than primitive vertices/edges, results in more accurate communities in real-world

1In the classic k-truss definition, k = 2 indicates a degraded case where a (sub)graph has no triangles involved. Such a graph is neither densely nor cohesively connected, and thus is omitted in our discussion.

50 graphs [44,9]. In Definition 4.5, the high-density and cohesiveness of communities are guaranteed, respectively: condition (1) ensures the community is a densely-connected subgraph modeled by k- truss , and condition (2) ensures all edges within a community are cohesively connected via strong and stable triangle motifs. Furthermore, this truss-based community definition allows a vertex to participate in multiple communities [44]. We finally define the truss-based community search problem as follows,

Definition 4.6. (k-Truss Community Search) Given a graph G(VG,EG),a query vertex q ∈ VG and an integer k ≥ 3, find all k-truss communities containing q.

4.2 Truss Equivalence

To systematically address the limitations of TCP-Index and enable efficient community search, we propose a new notion, k-truss equivalence, to characterize a fundamental equivalence relation for edges that are strongly connected in a k-truss community. As a consequence, a truss-equivalence based index, EquiTruss, can be developed that is theoretically optimal for community search. To start with, we consider a preprocessing step to decompose an input graph G into k-truss es (k ≥ 2).

Given an edge e ∈ EG, we first define edge trussness of e as follows,

Definition 4.7. (Edge Trussness). The trussness of an edge e ∈ EG, denoted as τ(e) is the maximum subgraph trussness of a subgraph G∗ ⊆ G that involves e as a constituent edge, i.e., as

τ(e) = maxG∗⊆G{τ(G∗): e ∈ EG∗ }

It is important to note that a (maximal) k-truss of G consists of all the edges with edge trussness no less than k. We thus can apply a truss decomposition algorithm [98], as detailed in Algorithm4, to compute edge trussness and discover all k-trusses from G. The algorithm starts with an initialization step to compute edge supports in O(|E|1.5) time using existing triangle enumeration methods [54, 71] (Line 1). After the initialization, for k starting from 2, we iteratively select the edge e∗(u; v) with the lowest support (Line 5), assign the edge trussness k to e∗, and remove it from G (Line 11). Meanwhile, we decrement the support of all the other edges forming triangles with e∗, and reorder them based on their new edge support (Lines 7-10). This process continues until all the edges with edge support no greater than (k − 2) are removed from G (Line 4). If there are still edges left in G, we increment k by one to process the edges with edge trussness

51 Algorithm 4: Truss Decomposition

Input: A graph G(VG; EG), maximum neighborhood length L

Output: Edge trussness τ(e) for each e ∈ Eg

1 Compute sup(e) for each edge e ∈ EG;

2 Sort all edges in ascending order of their support;

3 k ← 2

4 while ∃e ∈ EG, sup(e) ≤ (k − 2) do ∗ 5 e (u, v) ← argmine∈EG sup(e)

6 assume w.l.o.g d(u) ≤ d(v);

7 foreach w ∈ N(u) and (v, w) ∈ EG do 8 sup(u, w) ← sup(u, w) − 1;

9 sup(v, w) ← sup(v, w) − 1;

10 Reorder (u, w) and (v, w) w.r.t new edge support; ∗ ∗ 11 τ(e ) ← k, remove e from EG; 12 if ∃e ∈ EG then 13 k ← k + 1

14 goto Step 4 15 return {τ(e)|e ∈ EG}

1.5 (k + 1) (Lines 12-14). The time complexity of Algorithm 1 is O(|EG| ) and its space complexity is O(|VG| + |EG|)[98].

Example 4.2. . We apply Algorithm4 in the graph G (Figure 4.1(a)) to compute edge trussness for all edges of G, and the results are presented in Figure 4.2. Edges with different edge trussness values are illustrated in different colors.

We further define a stronger triangle-connectivity constraint: k-triangle connectivity, as follows,

Definition 4.8. (k-triangle). Given a triangle 4uvw ⊆ G, if edge trussnesses of all the three constituent edges are no less than k, i.e., min{τ((u, v)), τ((u, w)), τ((v, w))} ≥ k, 4uvw is denoted as a k-triangle.

Definition 4.9. (k-triangle connectivity). Given two triangles 4s, 4t in G, they are k-triangle k connected, denoted as 4s ↔ 4t, if there exists a sequence of n ≥ 2 k-triangles 41, ..., 4n s.t.

41 = 4s, 4n = 4t and for 1 ≤ i < n, 4i ∩ 4i+1 = {e|e ∈ EG} and τe = k.

52 Figure 4.2: k-truss edges in the graph G.

Example 4.3. . Consider the graph G as shown in Figure 4.2 and two 4-triangles 44,5,7 and46,8,11.

They are 4-triangle connected as there are two 4-triangles 45,6,7 and 46,7,8, such that 44,5,7 ∪

45,6,7 = {(5, 7)}, 45,6,7 ∩ 46,7,8 = {(6, 7)}, 46,8,11 ∩ 46,7,8 = {(6, 8)} and edges trussness values of all these edges are 4. However, the two 3-triangles 41,4,5 and 43,4,7 are not 3-triangle connected.

k Intuitively, if 4s ↔ 4t, the two k-triangles, 4s and 4t, are connected by a series of k-triangles with a chain of join edges (common edges shared by consecutive triangles) that have the edge 0 k 0 trussness k. Analogously, we say two edges e, e ∈ EG are k-triangle connected, denoted as e ↔ e , 0 0 k if and only if (1) e and e belong to the same k-triangle, or (2) e ∈ 4s, e ∈ 4t, s.t. 4s ↔ 4t. To this end, we define a new relation, k-truss equivalence, upon EG, as follows,

0 Definition 4.10. k-truss equivalence Given any two edges e, e ∈ EG, they are k-truss equivalent (k ≥ 3), denoted as e =k e0, if and only if (1) τ(e) = τ(e0) = k, and (2) e ↔k e0.

Theorem 4.1. k-truss equivalence is an equivalence relation upon EG.

Proof. k-truss equivalence is a binary relation defined upon EG, and we prove the following key properties of an equivalence relation for k-truss equivalence: 0 Reflexivity. Consider an edge e0 ∈ EG, s.t. τ(e ) = k. Based on Definition 4.7, there exists at ∗ ∗ ∗ ∗ ∗ least one subgraph G (V ,E ) ⊆ G such that e0 ∈ E , and ∀e ∈ E , τ(e) ≥ k. Since k ≥ 3, there ∗ k exists at least one k-triangle 4 ⊆ G such that e0 ∈ 4. Namely, e0 = e0; k Symmetry. Consider two edges e1, e2 ∈ EG, e1 = e2. That is, τ(e1) = τ(e2) = k, and either of the following cases holds: (1) e1 and e2 are in the same k-triangle; (2) there exist two k-triangles k 41 and 42, such that e1 ∈ 41, e2 ∈ 42, and 41 ↔ 42. For case (1), as e2 is located in the

53 k same k-triangle as e1, so e2 = e1. For case (2), note that k-triangle connectivity is symmetric, so k k 42 ↔ 41. Namely, e2 = e1. k k Transitivity. Consider three edges e1, e2, e3 ∈ EG, s.t. e1 = e2 and e2 = e3. Namely τ(e1) = τ(e2) = τ(e3) = k, and either of the following cases holds: (1) there exist two k-triangles

41and 42, such that e1, e2 ∈ 41 and e2, e3 ∈ 42. If 41 = 42, e1 and e3 are located in the k k same k-triangle, so e1 = e3. Otherwise, 41 ∩ 42 = {e2} and τ(e2) = k, so 41 ↔ 42. Therefore, k e1 = e3; (2) there exist m(≥ 2) k-triangles 4l1 , ..., 4lm in G, s.t. e1 ∈ 4l1 , e2 ∈ 4lm , and all the edges joining these m consecutive k-triangles are with the same edge trussness, k. Meanwhile, there exist n(≥ 2) k-triangles 4t1 , ..., 4tn in G s.t. e2 ∈ 4t1 , e3 ∈ 4tn and all the edges joining k these n k-triangles are with the same edge trussness,k. If 4lm = 4t1 , we know that 4l1 ↔ 4tn through a series of (m + n − 1) adjacent k-triangles 4l1 , ..., 4lm , ..., 4tn . Otherwise, we know that k 4lm ∩ 4t1 = {e2} and τ(e2) = k, so 4l1 ↔tn through a series of (m + n) adjacent k-triangles k 4l1 , ..., 4lm , 4lm , ..., 4tn . Therefore, e1 = e3.

0 0 k 0 Given an edge e ∈ EG, τ(e) = k, the set Ce = {e |e = e, e ∈ EG} defines an equivalence class of e w.r.t. k-truss equivalence, and the set of all equivalence classes forms a mutually exclusive and collectively exhaustive partition of EG. In particular, any equivalence class Ce consists of edges with the same edge trussness, k, that are also k-triangle connected, making Ce a k-truss community by definition.

4.3 Truss-Equivalence Based Index

Based on k-truss equivalence, we design and develop a graph-structured index, EquiTruss (Section 4.3.1), which supports community search with theoretically optimal performance (Section 4.2). In addition, EquiTruss allows efficient, dynamic update when G changes dynamically (Section 4.3).

4.3.1 Index Design and Construction

According to k-truss equivalence, all the edges of the graph G are partitioned into a series of mutually exclusive equivalence classes, each of which represents a k-truss community. We thus design a truss-equivalence based index, EquiTruss, as a summarized graph G = (V; E), where V is a super-node set and E is a super-edge set, E ⊆ VxV. A super-node ν ∈ V represents a distinct

54 equivalence class Ce where e ∈ EG, and a super-edge (µ; ν) ∈ E, where µ, νinV, indicates that the two equivalence classes are triangle-connected; that is, ∃e ∈ µ and ∃e0 ∈ ν, s.t. e ↔ e0. It is important to recognize that EquiTruss is a community-preserving graph summary, where all k-truss communities are completely encoded in super-nodes, and the triangle connectivity across different communities is exactly maintained in super-edges, thus making all the information critical to community search readily available in EquiTruss. Furthermore, each edge e of G is maintained in exactly one super-node representing its k-truss equivalence class, Ce. In comparison to TCP-Index where e has to be maintained redundantly in multiple MSTs, EquiTruss is significantly more succinct and space-efficient.

Example 4.4. . The truss-equivalence based index, EquiTruss, of the graph G (Figure 1.5(a)) is shown in Figure 4.3. It contains 5 super-nodes representing k-truss equivalence classes for edges in

G, as tabulated in Figure 3. For example, the super-node ν2 represents a 4-truss community with 6 edges: they are 4-triangle connected, and have the same edge trussness value of 4. Meanwhile, there are 6 super-edges in EquiTruss depicting triangle connectivity between super-nodes (k-truss communities).

Figure 4.3: Truss-equivalence based index, EquiTruss, of G.

55 Algorithm 5: TEC Index Construction Algorithm

Input: G(VG,EG) Output: EquiTrussG(V, E) /* Initialization */ 1 Truss decomposition for G 2 foreach e ∈ EG do 3 e.processed ← FALSE 4 e.list← ∅ 5 if τe = k then 6 φk ← φk ∪ {e} 7 snID← 0 /* Index Construction */ 8 for k ← 3 to kmax do 9 while ∃e ∈ φk do 10 e.processed=TRUE; 11 Create a super-node ν with ν.snID ←++snID; 12 V ← V ∪ {ν}; /* A new super-node for Ce */ 13 Q.enqueue(e); 14 while Q 6= ∅ do 15 e(u, v) ← Q.dequeue(); 16 ν ← ν{e}; /* Add e to super-node ν */ 17 foreach id ∈ e.list do 18 Create a super-edge (ν, µ) where µ is an existing super-node with µ.snID = id; 19 E ← E ∪ {(ν, µ)} /* Add super-edge */ 20 foreach w ∈ N(u) ∩ N(v) do 21 if τ(u, w) ≥ k and τ(v, w) ≥ k then 22 ProcessEdge(u;w); 23 ProcessEdge(v;w); 24 φk ← φk − {e}; EG ← EG − {e} 25 return G(V, E) 26 Procedure ProocessEdge(u, v) 27 if τ((u, v)) == k then 28 if (u, v).processed=FALSE then 29 (u, v).processed=TRUE; 30 Q.enqueue(u, v); 31 else 32 if snID ∈/(u,v).list then 33 (u, v).list ← (u, v).list ∪{snID} 34

Given the graph G, we construct the truss-equivalence based index, EquiTruss, in Algorithm 2. In the initialization phase (Lines 1-7), we first call Algorithm 1 to compute edge trussness for each

56 edge e ∈ EG (Line 1), then reallocate edges to different sets, Φk, in terms of edge trussness (Lines

5-6). Given e ∈ EG, we maintain two auxiliary data structures: processed is a Boolean variable indicating whether e has been examined in index construction, and is initialized to FALSE (Line 3); list is a set of super-node identifies, each of which represents a previously explored super-node, µ, where τ(µ) < k, and µ is triangle-connected to the current super-node,ν(τ(ν) = k), via the edge e. The e.list is initialized as an empty set (Line 4). We then examine all the edges of G in a non-decreasing order of edge trussness from Φ3 to Φkmax consecutively (Line 8). When selecting an edge e ∈ Φk, we create a new super-node ν corresponding to the equivalence class Ce of e (Lines 10-12). Using e = (u, v) as an initial seed, we traverse G (in BFS) to identify all the edges k-truss equivalent to e by exploring its incident k-triangles (Line 20-23), and add them to the super-node ν. Meanwhile, we also check if there exists some super-node µ in e.list, whereτ(µ) < τ(ν) = k, and µ is triangle-connected to ν through e. If so, we create a super-edge (µ, ν) in the index (Lines 17-19). Given any k-triangle, if there exists an edge e0 with τ(e0) > k, the identifier of the current super-node ν will be subscribed to e0.list as ν is triangle-connected to the super-node to which e0 belongs, and a super-edge will be created when e0 is processed (Lines 31-33). After e and all its incident triangles are examined, e is removed from both Φk and EG (Line 24), ensuring that each edge e belongs to at most one k-truss equivalence class represented by the super-node ν.

1.5 Theorem 4.2. EquiTruss can be constructed in O(|EG| ) time and O(|EG|) space by Algorithm5.

Proof. Proof. In the initialization phase of Algorithm5 (Lines 1-7), the truss decomposition costs 1.5 O(|EG| ) time. In the index construction phase (Lines 8-24), for each edge e = (u, v) ∈ EG, we consider all the triangles 4uvw that involve e in order to identify the k-truss equivalent edges. Then e is eliminated from Φk and EG, making each triangle 4uvw examined only once. The procedure ProcessEdge takes O(1) time. So the index construction of EquiTruss is equivalent to enumerating 1.5 all triangles from G in O(|EG| ) time.

Given an edge e ∈ EG, the size of e.list is equivalent to the number of super-nodes, µ, where τ(µ) < τ(e), and µ is triangle connected to the super-node to which e belongs. So e.list takes at most O(|EG|) space. Once e has been processed, it is removed from G and the space of e.list is released. As a result, the space complexity of Algorithm5 is O(|EG|).

57 In practice, EquiTruss is built offline before community search is performed, so it can be constructed efficiently from real-world, massive graphs. Meanwhile, EquiTruss is significantly more space-efficient than TCP-Index, as there are no redundant edges maintained in the index.

4.3.2 Community Search on EquiTruss

After EquiTruss is constructed from G, community search can henceforth be carried out directly on EquiTruss without repeated accesses to G, which is detailed in Algorithm 16. First of all, we find from the index G the super-nodes within which the query vertex q is located. We use a hash V structure H : VG → 2 to maintain this information, where H(u) = {ν1, ..., ν} as long as there exists some edge (u, v) ∈ νi(1 ≤ i ≤ l), where u, v ∈ VG. We remark H can be efficiently built as a by-product in index construction. Starting from each super-node ν ∈ H(q) with τ(ν) ≥ k, we traverse G in a BFS fashion, and for each unexplored, neighboring super-node µ with τ(µ) ≥ k, the edges within µ will be added to the k-truss community, Al. In the end, each Al represents a k-truss community in which q is involved.

Theorem 4.3. Given a query vertex q ∈ VG and the truss value k, Algorithm 3 correctly computes all k-truss communities containing v.

Proof. According to Algorithm 16, each set Ai(1 ≤ i ≤ l) satisfies the following conditions: (1) the query vertex q is located in Ai; (2) for each edge e ∈ Ai, τ(e) ≥ k; (3) all the edges in Ai are triangle-connected. Therefore, Ai is a k-truss with constituent edges triangle-connected. We then prove that all Ai(1 ≤ i ≤ l) are maximal by way of contradiction. Assume otherwise there exists at 0 0 0 least one set Ai which is not maximal; that is, there exists a subgraph A ⊆ G, s.t. Ai ⊂ A and A is one of the community search results satisfying the aforementioned conditions. As a consequence, there exists at least one edge e ∈ A0 \A, s.t. τ(e) ≥ k, and e is triangle connected to every edge in A0, but is not triangle connected to any edge in A. Because the query vertex q is located in both A 0 and A , there exists at least one incident edge of q, denoted as (q, u), where u ∈ VG, s.t.τ(q; u) ≥ k and (q, u) ∈ A. As A ⊂ A0, the edge (q, u) is also in A0, i.e. (q, u)2 ∈ A0. Therefore, (q, u) is triangle connected to the edge e, which contradicts with the fact that any edge of A, including (q, u), is not triangle connected to e.

58 Algorithm 6: Community Search Based on EquiTruss Input: EquiTrussG(V, E) , the truss value k ≥ 3, the query vertex q Output: A: all k-truss communities containing q /* Initialization */ 1 foreach ν ∈ V do 2 ν.processed ← FALSE; 3 l ← 0; /* BFS traversal for community search */ 4 foreach ν ∈ H(q) do 5 if τ(ν) ≥ k and ν.processed=FALSE then 6 ν.processed=TRUE; 7 l ← l + 1;Al ← ∅; 8 Q ← ∅; Q.enqueue(ν); 9 while Q6= ∅ do 10 ν ← Q.dequeue(); 11 Al ← Al ∪ {e|e ∈ ν}; 12 foreach (ν, µ)∈ e do 13 if τ(µ) ≥ k and µ.processed=FALSE then 14 µ.processed=TRUE; 15 Q.enqueue(µ); 16 return {A1, ..., Al} 17

Example 4.5. Consider the sample graph G, as shown in Figure 1.5(a), the truss value k = 4,

and the query vertex v4. Based on Algorithm 16, we first find from EquiTruss (in Figure 4.4) the

super-nodes ν2 and ν4 that contain v4. Starting from ν2, we recognize that τ(ν2) = 4 ≥ k, so all

the edges within ν2 are in the first community A1. However, ν2’s neighboring super-nodes ν1 and

ν3 are disqualified because τ(ν1) = τ(ν3) = 3 < k. We then start with the second super-node ν4. As

τ(ν4) = 4 > k, all the edges within ν4 are in the second community A2. Furthermore, since (ν4, ν5)

∈ E and τ(ν5) = 5, ν5 is also qualified and all the edges within ν5 are in the community A2 as well. The whole community search process is illustrated in Figure 4(a) and the community search results

including A1 (colored in red) and A2 (colored in green) are presented in Figure 4(b).

Theorem 4.4. The time complexity of Algorithm 16 is determined solely by the size of the resultant l k-truss communities, i.e. O(| ∪i=1 Ai|).

Proof. In Algorithm 16, each edge of Ai is accessed only once when it is reported as output. For any edge of G that is in the super-node ν where τ(ν) < k, it is not even accessed in the algorithm.

59 Figure 4.4: The two 4-truss communities for the query vertex v4, including A1 with edges in red color and A2 with edges in green color.

l As a result, the time complexity of Algorithm 16 is O(| ∪i=1 Ai|), which is exactly the time used for listing all the edges in the resultant k-truss communities from G. Based on Theorem 4.4, we note that Algorithm 16 is optimal as returning all the edges of the l k-truss communities requires Ω(| ∪i=1 Ai|) time, and Algorithm 16 achieves this lower-bound by visiting each edge of the resultant communities exactly once. It is also worth mentioning that, in Algorithm 16, we do not need to revisit the original graph G, and it suffices to just leverage EquiTruss for community search. In comparison to TCP-Index that needs repeated accesses to G for community recovery, out method, EquiTruss, is significantly more efficient.

4.4 Experiments

In this section, we report our experimental studies for community search in real-world graphs. We primarily compare our truss-equivalence based indexing approach, EquiTruss, with the state-of- the-art solution, TCP-Index [44]. In addition, we also implement a brute-forth community search method, Index-Free, which leverages no index structures for community search: given a query vertex q, Index-Free starts with each incident edge (q, ui) of q where τ(q, ui) ≥ k, and carries out a BFS-like exploration for all the edges that are triangle connected to (q, ui). The algorithm iterates until all k-truss communities relevant to q are identified by definition. To this end, Index-Free can be used as a baseline for community search in our experimental studies. All the algorithms are

60 implemented in Java and the experiments are performed on a Linux server running Ubuntu 14:04 with two Intel 2:3GHz ten-core CPUs and 256GB memory. Datasets. We consider five real-world graphs, which have been widely adopted in the studies of community search and detection, and are publicly available in the Stanford Network Analysis Project (SNAP)2 and the UF Sparse Matrix Collection3. The general statistics of these graphs are reported in Table 4.2, where dmax denotes the maximum vertex degree, and kgmax denotes the maximum edge trussness in G.

Table 4.2: Network statistics (K = 103 and M = 106)

Network |V | |E| dmax kgmax Amazon 335K 926K 549 7 DBLP 317K 1M 342 114 LiveJournal 4M 35M 14,815 352 3.1M 117M 33,313 78 UK-2002 18.6M 298.1M 194,955 944

4.4.1 Index Construction

We start with the experiments to construct the indexes from graphs. This process is typically performed offline before community search is carried out. Once the indexes are built, they will reside in main memory and serve as an efficient vehicle to facilitate community search in large graphs. We focus on two evaluation metrics in our experimental studies: (1) the time spent for index construction, and (2) the space consumed for the overall index structures in memory. We compare our approach EquiTruss with TCP-Index, and the experimental results are reported in Table 3 (For the baseline method, Index-Free, no index is pre-built so there are no results reported). From Table 4.3, we recognize that the truss-equivalence based index, EquiTruss, can be con- structed more efficiently than TCP-Index from all graphs. The speedup ranges from 3.36x in the Amazon graph up to 14.61x in the Orkut graph. Meanwhile, EquiTruss takes significantly less space cost than TCP-Index, ranging from 1.88x less in the Orkut graph up to 11x less in the UK- 2002 graph, and the index sizes are consistently smaller than graph sizes. The main reason is that each edge of G is partitioned to at most one super-node in EquiTruss; that is, there is no redundant

2snap.stanford.edu/data/index.html 3www.cise.ufl.edu/research/sparse/matrices/LAW/ uk-2002.html

61 Table 4.3: Index construction time (in seconds) and space cost (in megabytes) of EquiTruss and TCP-Index, together with the sizes of graphs (in megabytes).

Network Graph Size Index Size Index Time EquiTruss TCP EquiTruss TCP Amazon 17.50 7.60 32.86 1.7 5.72 DBLP 18.54 9.93 44.64 2.5 15,336 LiveJournal 598 428 1,367 345.4 1496 Orkut 1,897 1,687 3,164 2,160 31,558 UK-2002 4,336 1,484 16,324 2,288 26,632 information maintained in EquiTruss. In contrast, the same edge may occur redundantly in mul- tiple maximum spanning trees originated from different vertices of G in TCP-Index, thus resulting in significantly larger index structures (several times larger than original graphs) and more index construction time. In consequence, EquiTruss can be constructed more efficiently with less space cost than TCP-Index in large graphs.

4.4.2 Community Search

Once the indexes are built, we can use them to support community search in graphs. Here we consider two different experimental settings. In the first set of experiments, we select queries with varied vertex-degrees, as community patterns vary significantly for vertices with different degrees: high-degree vertices are typically involved in large and dense communities, while low-degree vertices oftentimes participate in communities that are small and sparse. For each graph, we sort vertices in a non-increasing order w.r.t. vertex degrees, and partition them into ten equal-width buckets based on degree percentiles. For instance, the first bucket contains the top 10% high-degree vertices in G. We then randomly select 100 vertices from each bucket as queries and report the average runtime for community search in each bucket. We set the truss value k = 4 in Amazon, k = 5 in DBLP, k = 6 in LiveJournal, k = 10 in Orkut, and k = 10 in UK-20024. The community search performance is reported in Figure 4.5. We have the following experimental findings in different real-world graphs: (1) The baseline method, Index-Free, is the least efficient algorithm for community search, which is typically orders

4We explore the full range of values for k in different graphs and recognize that in small or medium-size graphs, such as Amazon or DBLP, high values of k lead to very few, or even no, communities as most edges in these graphs have small edge trussness. Meanwhile, similar experimental results have been witnessed with different values of k, and thus are omitted for the sake of brevity.

62 Figure 4.5: Community search performance in different vertex-degree percentile buckets

Figure 4.6: Community search performance for different truss values of k

of magnitude slower than EquiTruss especially in large graphs. In Orkut, for queries in high or medium vertex-degree percentile buckets (< 70%), community search cannot finish within 3 hours. The primary reason is that each community search incurs exhaustive BFS exploration and costly triangle-connectivity evaluation that are extremely time-demanding in large graphs. As a result, Index-Free without deliberate indexing schemes becomes unfeasible for community search in real-world graphs; (2) It typically takes more time to search communities for high-degree vertices than low-degree vertices. When queries are drawn from low-degree percentile buckets, the search time drops steadily for all methods. Specifically, When the degree percentiles are 70% or higher in DBLP, LiveJournal, and UK-2002, the runtime reduces significantly as there are very few k- truss communities for low-degree vertices; (3) For all vertex-degree percentile buckets in different graphs, EquiTruss outperforms TCP-Index in community search with at least an order of magnitude speedup. In the largest UK-2002 graph, we recognize this speedup can be as large as two orders of magnitude, and in most real-world graphs, EquiTruss can find k-truss communities in realtime. However, TCP-Index becomes significantly slow especially in large graphs, such as LiveJournal (more than 10 seconds per query), and Orkut (more than 100 seconds per query). In the second set of experiments, by tuning the parameter k, we examine the runtime for community search in different graphs. In each graph, we generate two query sets: 100 high-degree vertices drawn at random from the first 30% degree percentile buckets, and 100 low-degree vertices

63 drawn at random from the remaining 70% buckets. We denote community search by EquiTruss using two different query sets as EquiTruss-H and EquiTruss-L, respectively. Analogously, we have TCP-H and TCP-L for TCP-Index, and Free-H and Free-L for Index-Free using different query sets. The experimental results are presented in Figure 4.6. We recognize that for most values of k in all graphs, EquiTruss is the most efficient community search method that is at least an order of magnitude faster than TCP-Index for both high-degree and low-degree queries. The primary reason is that, in TCP-Index, repeated accesses to the original graph G are required to reconstruct communities from maximum spanning trees of the index. In EquiTruss, however, a simple traverse on the graph-structured index suffices for community search without revisiting G. The performance gap becomes more significant in large graphs, such as Orkut and UK-2002, because repeated accesses to G turn out to be extremely time consuming. These experiments again verify the clear advantages of EquiTruss for community search and conform with the theoretical results of the proposed algorithms. On the other hand, Index-Free is the least efficient method for community search, and simply becomes unfeasible in large graphs (In Orkut, there are no performance results reported, as Index-Free cannot finish within 3 hours).

4.4.3 Effectiveness Analysis in DBLP

In previous studies[44, 46, 47], k-truss has resulted in more accurate community structures than k-core [12] and clique/quasi-clique based models [25] in real-world graphs with ground-truth community information. We note that EquiTruss generates identical k-truss communities as TCP- Index, so it leads to the same effectiveness results (in terms of F-1 measure) [44], which are therefore omitted for the sake of brevity. Instead, we perform case studies in DBLP to showcase the power of EquiTruss in modeling research communities and supporting community search in academic graphs. We focus on the scholars in the four designated areas based on their publication records: DB (database), IR (information retrieval), ML (machine learning), and DM (data mining), and visualize the summarized graph G of EquiTruss, as shown in Figure 4.7(a). The summarized graph G provides a macroscopic profile of the original collaboration graph at the granularity of communities: each super-nodes represents a k-truss community (7 ≤ k ≤ 27) and each super-edge depicts the triangle-connectivity between communities. If we want to “zoom in” to some or all communities in order to find microscopic collaborative patterns at the granularity of vertices/edges of the original graph, we can unfold both super-nodes and super-edges in G, and the detailed

64 Figure 4.7: (a) The summarized graph in EquiTruss for the DBLP four-area graph. Each super-node represents a k-truss community (7 ≤ k ≤ 27), and each super-edge depicts triangle- connectivity between super-nodes. (b) All k-truss communities (7 ≤ k ≤ 27) in the DBLP four-area graph.

community structures are shown in Figure 4.7(b). Therefore, EquiTruss itself can be of special interest in visualizing large graphs with both schematic views in terms of k-truss communities, and detailed connectivity information at the finest resolution of vertices and edges. We then perform a community search for Michael Stonebraker in the DBLP graph, and set the truss values to be 7 and 8, respectively, and the k-truss communities in which Michael Stonebraker is involved are presented in Figure 4.8(a) and (b), respectively. We recognize that Mike is involved in three 7-truss communities: the first one (colored in yellow) represents collaborators in database community; the second one (colored in blue) represents collaborators mainly from U.C. Berkeley; and the last one (colored in green) represents other collaborators primarily from the industry. When k is set 8, the third community dissolves, meaning Mike is more closely tied to the first two communities, represented by two 8-trusses. As a result, by tuning k, we can query a series of communities with different density and cohesiveness, which is vital for personalized community search in real-world graph studies.

65 Figure 4.8: 7-truss community and 8-truss community for the query Michael Stonebraker

4.5 Conclusions

In this project, we studied the truss-based community search problem in large graphs. We pro- posed a truss-equivalence based indexing approach, EquiTruss, to simplifying an input graph into a space-efficient, truss-preserving summarized graph based on an innovative notion of k-truss equiv- alence. We proved that, with the aid of EquiTruss, community search can be performed directly upon EquiTruss without costly, repeated accesses to the original graph, and our EquiTruss-based community search method is theoretically optimal. We further designed efficient dynamic mainte- nance methods for EquiTruss in the presence of edge insertion and deletion, extending its utility into real-world dynamic graphs. We conducted extensive experimental studies in real-world large- scale graphs, and the results have validated both the efficiency and effectiveness of the proposed community search method, EquiTruss, in comparison to the state-of-the-art algorithm, TCP-Index.

66 CHAPTER 5

INDEX BASED CLOSEST COMMUNITY SEARCH

In this project, we study the closest community search problem, i.e. given a set of query nodes, find a densely connected subgraph that contains the query nodes in which nodes are close to each other. We consider community search based on the k-truss. In addition, we use the graph diameter to measure the closeness of all nodes in the community to ensure every node included in the community is tightly related to query nodes and other nodes included in the community reported. Thus, based on k-truss and the graph diameter, we use the closest truss community (CTC) model [46], which requires that all query nodes are connected in the community and the graph structure is a k-truss with the largest trussness k. The problem is defined as finding a closest truss community (CTC), as finding a connected k-truss subgraph with the largest k that contains Q, and has the minimum diameter among such subgraphs. It is proven that CTC is an NP-hard problem and it is NP-hard to approximate the problem within a factor (2 − ), for any  > 0 [46]. Huang et al. propose an algorithm which achieves 2-approximation to the optimal solution. It is a 2 steps algorithm. First, given a graph G and a set of query nodes Q, they find the maximal connected k-truss, denoted as G0, containing Q and having the largest trussness. As G0 may have a large diameter, in the second step, they iteratively remove nodes far away from the query nodes, while maintaining the trussness of the remaining graph at k. However, a maximal connected k-truss

G0 may be too large so, it may be inefficient to compute G0 and remove faraway nodes. We propose a truss preserving index structure based on truss equivalance (TEQ) that support querying maximal connected k-truss including query nodes with maximum k. We prove that com- munity search can be carried out directly upon TEQ without repeated accesses to the original graph G, and its time complexity is solely determined by the actual size of output, which is theoretically optimal. We also propose Minimal algorithm to find minimal connected k-truss containing query nodes with maximum k to make the removing faraway nodes more efficient. We examine, both theoretically and experimentally, the efficiency and effectiveness of our index and algorithm.

67 In the next sections, we first give the algorithm from the paper [46] which use the basic index to find maximal connected k-truss (in section 5.2.1) and then remove the free riders from it (in section 5.2.2). We also give the approximation analysis of this algorithm. Then, we give the details of our index structure that preserves the k-truss community information and next we use this index to find the maximal connected k-truss efficiently (in section 5.4.1, 5.4.2). While this will make the first part efficient, the second part may be still inefficient. To make the second part, deletion process, more efficient, we need to find a smaller connected k-truss in the first part. For this sake, we propose a new algorithm, Minimal, in section 5.4.3 to find the minimal connected k-truss. We keep the result as small as possible while connecting query nodes and making k maximum. With this way, we include less free riders into the community and need to remove less in the second part.

5.1 Preliminaries

In this project, we consider an undirected, connected, simple graph G = (VG; EG) where VG is a set of vertices, and EG ⊆ {VG × VG} is a set of edges. Using the support and truss definitions given in Section4, a connected k-truss is a connected subgraph such that each edge (u, v) in the subgraph is “endorsed” by at least k − 2 common neighbors of u and v. For closeness, we define the following key concept.

Definition 5.1. (QUERY DISTANCE). Given a graph G and a set of query nodes Q ⊆ V , for each vertex v ∈ G, the vertex query distance of v is the maximum length of the shortest path from v to a query node q ∈ Q, i.e., distG(v, Q) = maxq∈QdistG(v, q). For a subgraph H ⊆ G, the graph query distance of H is defined as distG(H,Q) = maxu∈H distG(u, Q) = maxu∈H,q∈QdistG(u, q).

Definition 5.2. (GRAPH DIAMETER). The diameter of a graph G is defined as the maximum length of a shortest path in G, i.e., diam(G) = maxu,v∈G{distG(u, v)}.

Definition 5.3. (CLOSEST TRUSS COMMUNITY). Given a graph G and a set of query nodes Q, G0 is the closest truss community (CTC), if it satisfies the following two conditions: (1) Connected k-Truss. G0 is a connected k-truss containing Q with the largest k, i.e., Q ⊆ G0 ⊆ G and ∀e ∈ E(G0), sup(e) ≥ k − 2; (2) Smallest Diameter. G0 is a subgraph of the smallest diameter satisfying condition (1). That 00 00 0 00 is, @G ⊆ G, such that diam(G ) < diam(G ), and G satisfies condition (1).

68 Condition (1) requires that the closest community containing the query nodes Q be densely connected. In addition, condition (2) makes sure that each node is as close as possible to every other node in the community, including the query nodes. The problem of closest truss community (CTC) search studied in this paper is stated as follows. PROBLEM 1 (CTC-Problem). Given a graph G(V,E) and a set of query vertices Q =

{v1, ..., vr} ⊆ V , find a closest truss community containing Q.

5.2 Basic Algorithmic framework

Algorithm 7: Basic Framework

Input: G(VG,EG) Output: General Algorithm

1 Find a maximal connected k-truss containing Q with the largest k as G0(section 5.2.1,5.4)

2 Delete free riders from G0 until it become disconnected or don’t include a query node (Section 5.2.2)

In this section, we will discuss the general framework of the algorithm for the CTC search problem given in the paper [46]. The framework of the algorithm is given in Algorithm7. It is a 2 steps algorithm. First, given a graph G and query nodes Q, they find a maximal connected k-truss, denoted as G0, containing Q and having the largest trussness. As G0 may have a large diameter, in the second step, they iteratively remove nodes far away from the query nodes as the pruning process, while maintaining the trussness of the remainder graph at k. Different than previous method, they also propose a heuristic method, LCTC, as the local exploration. In the local exploration, a Steiner tree of the query nodes is found on the graph G and then expanded into a k-truss by exploring the local neighborhood of it. They define a truss distance to make the result more dense which use the length of the path and minimum truss value of the edges on the path as the weight of the path. As the first drawback of these algorithms, finding maximal connected k-truss on large graphs is time consuming. Since we need to check the connectivity at each step while trying to maximize k. Furthermore, the result may be too large with many free riders and removing them may take a quite long time.

69 5.2.1 Finding Maximal Connected k-truss

In this section, we discuss the fist step of the the basic approach given in Algorithm7. Huang et all [46] find a connected k-truss G0 with the largest k(kmax) containing given query nodes in polynomial time. They need to do BFS search on the graph to find G0 and it is expensive for large graphs.

To find G0 efficiently, they use a simple truss index which is constructed by organizing edges according to their trussness. For each vertex v ∈ V , they sort its neighbors N(v) in descending order of the edge trussness τ(e(v, u)) for ∀u ∈ N(v). For each distinct trussness value k ≥ 2, the position of the first vertex u is marked in the sorted adjacency list where τ(e(u, v)) = k. This supports efficient retrieval of v’s incident edges with a certain trussness value.

G0 is initialized as the query vertex set Q, and iteratively the edges of nodes in G0 are added to

G0 in decreasing order of their trussness, until G0 gets connected. They initialize k as the minimum truss values of the query nodes, k = min{τ(q1), ..., τ(qr)} and add the incident edges whose truss values greater than k − 1 in BFS manner starting from query nodes. After traversing the graph, connectivity of G0 is checked and if connected, the algorithm terminates with returning G0 as the result. If not, k is decreased by one and repeat the traversing with the edges whose truss value is greater or equal to new k.

5.2.2 Eliminating Free Riders

After finding G0 as the maximum connected k-truss connecting query nodes with maximum k, we need to eliminate free riders from it. They iteratively remove nodes far away from the query nodes to decrease the diameter, while maintaining the trussness of the remainder graph at k. Algorithm8 describes the procedure for eliminating free riders.

Algorithm8: After getting G0, for all u ∈ Gl and q ∈ Q, the shortest distance between u and q, is compute to obtain the vertex query distance distGl (u, Q). Among all vertices, they select a vertex ∗ ∗ u with the maximum distGl (u ,Q) which is also the graph query distance distGl (Gl,Q). Next, ∗ they remove the vertex u and its incident edges from Gl, and delete any nodes and edges needed to restore the k-truss property of Gl. The updated graph is assigned as a new Gl. Then, they repeat the above steps until Gl does not have a connected subgraph containing Q. Finally, algorithm terminates by outputting graph R as the closest truss community, where R is any graph G0 ∈

70 Algorithm 8: Free rider elimination

Input: A k-truss connected subgraph G0(V,E), a set of query nodes Q = {q1, ..., qr} Output: A connected k-truss R with a small diameter

1 while Q is connected do

2 Compute distGl (q, u), ∀q ∈ Q and ∀u ∈ Gl; ∗ 3 u ← arg maxu∈Gl distGl (u, Q); ∗ 4 distGl (Gl,Q) ← distGl (u ,Q); ∗ 5 Delete u and its incident edges from Gl;

6 Maintain k-truss property of Gl;

7 Gl+1 ← Gl; l ← l + 1 0 8 0 0 R ← arg minG ∈{G0,...,Gl−1} distG (G ,Q)

0 0 {G0, ..., Gl−1} with the smallest graph query distance distG(G ,Q). Note that each intermediate graph G ∈ {G0, ..., Gl−1} is a k-truss with the maximum trussness as required. To improve this pruning process, an optimization method is developed as bulk deletion to achieve quick termination while sacrificing some approximation ratio. In the bulk deletion, instead of removing one far away node at each step, they remove multiple nodes whose distance are same with farthest node. So, instead of just removing the nodes u∗ with the maximum distGl (u, Q) = D, all nodes u with distGl (u, Q) = d are romoved in one shot. As the second improvement, all nodes u with distGl (u, Q) ≥ d − 1 are removed at one step. This will decrease the number of iteration to the O(|VG0 |/k).

k-truss Maintenance. After removing nodes Vd and their incident edges from Gl, Gl may not be a k-truss any more, or Q may be disconnected. Thus, we iteratively deletes edges having support less than (k − 2) and nodes disconnected with Q from Gl, until Gl becomes a connected k-truss containing Q.

5.2.3 Approximation Analysis

Algorithm7 can achieve 2-approximation to the optimal solution [46]. For the self-completeness of this thesis, we give the proof of 2-approximation of the algorithm [46]. Since, for any optimal ∗ solution H , in the first step, it finds a connected k-truss community G0 which includes query nodes, ∗ Q ⊆ G0 and has largest trussness value, kmax, τ(G0) = kmax = τ(H ). After removing free riders in

71 ∗ the second step, diameter of G0 has at most 2 times larger than optimal, diam(G0) ≤ 2diam(H ), ∗ The proof for diam(G0) ≤ 2diam(H ) is given as follows [46];

FACT 1. Given two graphs G1 and G2 with G1 ⊆ G2, for u, v ∈ V (G1), distG2 (u, v) ≤ distG1 (u, v) holds. Moreover, if Q ⊆ V (G1), then distG2 (G1,Q) ≤ distG1 (G1,Q) also holds.

Proof. Trivially follows from the fact that G2 preserves paths between nodes in G1

Recall that in Algorithm8, in each iteration i, a node u∗ with maximum dist(u∗,Q) is deleted from Gi, but distGi (Gi,Q) is not monotone nonincreasing during the process, hence distGl−1 (Gl−1,Q) is not necessarily the minimum. Note that in Algorithm8, Gl is not the last feasible graph (i.e., connected k-truss containing G), but Gl−1 is. The observation is shown in the following lemma.

Lemma 5.1. In Algorithm8, it is possible that for some 0 ≤ i < j < l, we have Gj ⊂ Gi, and distGi (Gi,Q) < distGj (Gj,Q) hold.

Proof. It is easy to be realized, because for a vertex v ∈ G, distG(v, Q) is non-decreasing monotone w.r.t. subgraphs of G. More precisely, for v ∈ Gi ∩ Gj, distGi (v, Q) ≤ distGj (v, Q) holds.

We have an important observation that if an intermediate graph Gi obtained by Algorithm8 ∗ ∗ ∗ contains an optimal solution H , i.e., H ⊂ Gi, and distGi (Gi,Q) > distGi (H ,Q), then algorithm will not terminate at Gi+1

∗ Lemma 5.2. In Algorithm8, for any intermediate graph Gi, we have H ⊆ Gi, and distGi (Gi,Q) > ∗ ∗ distGi(H ,Q), then Gi+1 is a connected k-truss containing Q and H ⊆ Gi+1.

∗ ∗ Proof. Suppose H ⊆ Gi and distGi (Gi,Q) > distGi (H ,Q). Then there exists a node u ∈ Gi \ ∗ ∗ H s.t. distGi (u, Q) = distGi (Gi,Q) > distGi (H ,Q). Clearly, u∈ / Q. In the next iteration,

Algorithm8 will delete u from Gi (Step 5), and perform Step 6. The graph resulting from restoring ∗ the k-truss property is Gi+1. Since H is a connected k-truss containing Q, the restoration step ∗ (line 6) must find a subgraph Gi+1 s.t. H ⊆ Gi+1, and Gi+1 is a connected k-truss containing Q. Thus, the algorithm will not terminate in iteration (i + 1).

We are ready to establish the main result of this section. The polynomial algorithm7 can find a connected k-truss community R having the minimum query distance to Q, which is optimal.

Lemma 5.3. For any connected k-truss H with the highest k containing Q, distR(R,Q) ≤ distH (H,Q).

72 Proof. The following cases arise for Gl−1, which is the last feasible graph obtained by Algorithm8.

Case (a): H ⊆ Gl−1. We have distGl−1 (Gl−1,Q) ≤ distGl−1 (H,Q); for otherwise, if distGl−1 (Gl−1,Q) > distGl−1 (H,Q), we can deduce from Lemma 5.2 that Gl−1is not the last feasible graph obtained by

Algorithm8, a contradiction. Thus, by Step 8 in Algorithm8 and the fact that distGl−1 (Gl−1,Q) ≤ distGl−1 (H,Q), we have distR(R,Q) ≤ distGl−1 (Gl−1,Q) ≤ distGl−1 (H,Q) ≤ distH (H,Q).

Case (b): H 6⊆ Gl−1. There exists a vertex v ∈ H deleted from one of the subgraphs ∗ {G0, ..., Gl−2}. Suppose the first deleted vertex v ∈ H is in graph Gi, where 0 ≤ i ≤ l − 2, then v∗ must be deleted in Step 5, but not in Step 6. This is because each vertex/edge of H satisfies the condition of k-truss, and will not be removed before any vertex is removed from Gi. ∗ Then, we have distGi (Gi,Q) = distGi (v ,Q) = distGi (H,Q), and distGi (Gi,Q) ≥ distR(R,Q) by

Step 8. As a result, distR(R,Q) ≤ distGi (H,Q) ≤ distH (H,Q).

Based on the preceding lemmas, we have:

Theorem 5.1. Algorithm7 provides 2-approximation to the CTC-Problem as diam(R) ≤ 2diam(H∗).

∗ Proof. Since distR(R,Q) ≤ distH∗ (H ,Q) by Lemma 5.3, we get diam(R) ≤ 2distR(R,Q) ≤ ∗ ∗ 2distH∗ (H ,Q) ≤ 2diam(H ) by Lemma 2. The theorem follows from this.

5.3 Truss equivalence based index

Finding maximal connected k-truss with basic index requires many edge access and time con- suming. To make it more efficient, we propose a truss preserving index structure which supports finding maximal connected k truss efficiently with theoretically optimal performance. For this, we define k-truss equivalence for edges as the following;

Definition 5.4. (k-truss equivalence). Given any two edges es, et ∈ EG, they are k-truss equivalent k (k ≥ 3), denoted as es ↔ et, if and only if (1) their truss values are same, τ(es) = τ(et) = k, and

(2) if they are connected with k-truss edges, there exists a sequence of n ≥ 2 edges, {e1, e2, ..., en} st. e1 = es, en = et and for 1 ≤ i ≤ n, ei ∪ ei+1 = {v|v ∈ V } and τ(ei) = k

0 0 k 0 Given an edge e ∈ EG, τ(e) = k, the set Ce = {e |e = e, e ∈ EG} defines an equivalence class of e w.r.t. k-truss equivalence, and the set of all equivalence classes forms a mutually exclusive and collectively exhaustive partition of EG. In particular, any equivalence class Ce consists of

73 edges with the same edge trussness, k, that are also connected, making Ce a k-truss community by definition.

Figure 5.1: A sample graph G.

Figure 5.2: k-truss edges in the graph G in Figure 5.1.

Example 5.1. The k-truss-equivalence classes of edges in the graph G (Figure 5.1(a)) is shown in Figure 5.2. Red color edges are in the 5-truss equivalent, blue color edges are 4-truss equivalent( 2 different class which are {(1, 2), (2, 3), (1, 3), (1, 4), (2, 4), (3, 4)} and {(5, 6), (5, 7), (5, 8), (6, 7), (6, 8), (7, 8)} and black color edges are 3-truss equivalent. The edge (7, 9) is 3 truss and it is alone in its equiva- lence class since it is not connected to any other 3-truss edge.

5.3.1 Index Design and Construction

According to k-truss equivalence, all edges of the graph G are partitioned into a series of mutually exclusive equivalence classes, each of which represents a k-truss subgraph. We thus design a truss-equivalence based index, TEQ, as a summarized graph G = (V, E), where V is a super-node set and E is a super-edge set, E ⊆ {VxV}. A super-node υ ∈ V represents a

74 distinct equivalence class Ce where e ∈ EG, and a super-edge (µ, υ) ∈ E, where µ, υ ∈ V, indicates that the two equivalence classes are connected; that is ∃e ∈ µ and ∃e0 ∈ υ, s.t. e and e’ is connected 0 (e ∩ e = v|v ∈ VG ); It is important to recognize that TEQ is a truss-preserving graph summary, where all maximal connected k-truss are completely encoded into super-nodes and connectivity of different connected k-truss is exactly maintained in super-edges. Thus, making all the information critical to finding the maximal connected k-truss readily available in TEQ. Furthermore, each edge e of G is maintained in exactly one super-node representing its k-truss equivalence class Ce. So it is also space efficient.

Example 5.2. The truss-equivalence based index, TEQ, of the graph G (Figure 5.1) is shown in Figure 5.3. It contains 5 super-nodes representing k-truss equivalence classes for edges in G, as tabulated in Figure 5.2. For example, the super-node ν4 represents a 4-truss community with 6 edges: they are connected, and have the same edge trussness value of 4. Meanwhile, there are 6 super-edges in TEQ depicting connectivity between super-nodes (k-truss communities).

Figure 5.3: Truss-equivalence based index, TEQ, of G.

Given the graph G, we construct the truss-equivalence based index, TEQ, using Algorithm9. In the initialization phase (Lines 1-5), we first compute edge trussness for each edge e ∈ EG (Line 1), then reallocate edges to different sets, Φk, in terms of edge trussness (Lines 2-4). We then examine all the edges of G in a non-decreasing order of edge trussness from Φ3 to Φkmax consecutively(line

8). When selecting an edge e ∈ Φk, we create a new super-node Ni corresponding to the equivalence class Ce of e (line 9). Using e = (u, v) as an initial seed, we traverse G (in BFS) to identify all the

75 Algorithm 9: TEQ Index Construction Algorithm Input: G(VG,EG) Output: TEQ index Graph

1 Perform truss decomposition for G

2 for each e in EG do 3 if τ(e) == k then 4 Φk ⇒ Φk ∩ e 5 snID = 0

6 for k = 3 to kmax do 7 for e(u, v) ∈ Φk do 8 Q.enqueue(e(u, v))

9 Create super-node Ni with id snID

10 snid + +

11 while Q 6= ∅ do 12 e(x, y) ← Q.dequeue()

13 Add (x, y) in to edge list of Ni

14 if Ni ∈/ x.list then 15 x.list ← x.list ∪ Ni

16 for z ∈ N(x) do 17 if τ((x, z)) = k then 18 Q.enqueue(e(u, v)) 19 for each v in V do 20 for each Ni in v.list do 21 for each Nj in v.list do 22 Create super-edge (Ni,Nj) if not exists

edges k-truss equivalent to e by exploring its incident edges and add them to the super-node Ni(Line 11-28). For each v ∈ V we keep an auxiliary data structure, v.list. It keeps the super-nodes which include any edge of the vertex v. After constructing super-nodes, we create super-edges based on connectivity of the edges in the super-nodes. We examine the super-nodes of each vertex and add a super-edges between all super-nodes of each vertex(line 19-22). For each super-node v ∈ V , we also sort its neighbors N(v) in descending order of the edge trussness τ(e(v, u)), for u ∈ N(v). For each distinct trussness value k ≥ 2, the position of the first super-node u is marked in the sorted adjacency list where τ(e(u, v)) = k. This supports efficient

76 retrieval of v’s incident super-edges with a certain trussness value. 1.5 TEQ can be constructed in O(|EG| ) time and O(|EG|) space by Algorithm5. In the initializa- 1.5 tion phase of Algorithm 2 (Lines 1-5), the truss decomposition costs O(|EG| ) time which is the cost of enumerating all triangles. In the index construction phase, we apply BFS on the edges of the graph G so complexity is O(|EG|). So the index construction of TEQ is equivalent to enumerating 1.5 all triangles from G in O(|EG| ) time. Since each edge is maintained in one super node and the number of super nodes and edges is quite less then original graph so it takes O(|EG|) space.

5.4 Community search on Index 5.4.1 Finding maximal connected k-truss on TEQ Index

We can find the maximal k-truss just with traversing on TEQ index graph. In Algorithm7, they need to traverse the original graph and check the connectivity of the community which includes original graph edges. For a big graph, traversing the graph is very expensive and also if the community is big, checking connectivity of it is also expensive. On the other hand, when we use TEQ index to find the maximal k-truss, we do not need to traverse the original big graph, just traverse the index graph which is much smaller than original graph. Also, connectivity of the super-nodes shows the connectivity of the original edges in the community. So just we need to check the connectivity of the super-nodes in the index graph. The number of super-nodes is excessively less than original edges in the community, so connectivity check part is also more efficient with TEQ index.

Lemma 5.4. Let the TEQ index graph IG and a set of query vertex Q be given. Let q0 be an arbitrary vertex in Q. The super-nodes in IG, which are reachable from the super-nodes of q0 through edges with weights no less than k and also make the super-nodes of Q connected, include all edges of the maximal k-truss containing Q.

The proof of the lemma is easily followed by the definition of TEQ index and maximal k-truss. According to the Lemma 5.4, when TEQ index is constructed from G, maximal connected k-truss can henceforth be found out directly on TEQ without repeated accesses to G, which is detailed in Algorithm 10.

The initial k trussness value is computed as k = min{τ(q1), ..., τ(qr)} We use S to denote the set of super-nodes to be visited within this level. We start with the super-nodes which include

77 Algorithm 10: Query processing using TEQ Index

Input: G(V, E), a set of query nodes Q = {q1, ..., qr}

Output: A maximal connected k-truss G0 containing Q with maximum k

1 k ← min{τ(q1), ..., τ(qr)}

2 V (Gmax) ← ∅; E(G0) ← ∅; S ← ∅;

3 for q ∈ Q do 4 for Ni ∈ H(q do 5 if τ(Ni) ≥ k then 6 S ← S ∪ Ni /* BFS search on TEQ index graph */

7 while connected(Gmax)=false do 8 for Nj ∈ S do 9 V(Gmax) ⇐ V(Gmax) ∩ Nj

10 for Nn ∈ N(Nj) do 11 if τ(Nn) ≥ k then 12 S.push(Nn) 13 k ← k − 1 14 for Ni ∈ Gmax do 15 E(G0) ← E(G0) ∪ Ni.edgelist() query nodes and whose truss values is greater than or equal k(line 3-6). For level k, we process each super-node Nj ∈ S, and insert these super-nodes into Gmax and then, visit its neighbors in a

BFS manner on the TEQ index graph. Meanwhile, if the neighbor Nn is not in S and its truss value is greater than or equal k, we add Nn into S (line 7-13). After traversing all vertices in S, the algorithm checks whether super-nodes in Gmax is connected in TEQ. If yes, the algorithm terminates and edges in the edge list of the super-nodes in Gmax is pushed into G0 and G0 is returned as the maximal k-truss (line 14-15); otherwise, we decrease the present level k by 1 (line 13) and repeat the above steps (lines 7-13). We store the edges of the super-nodes in sorted order based on the truss value of the neighbor super-node. So getting neighbor super-nodes in line 10 will be constant time. So time complexity of

Algorithm 10 will be O(|V(Gmax)| + |E(G0)|). Since |V(Gmax) ≤ |E(G0)|), complexity O(|E(G0)|.

Example 5.3. Consider the sample graph G, as shown in Figure 5.1, TEQ index of it as shown in Figure 5.3 and the query nodes{2,7}. Based on Algorithm 10, we initialize k=4 and we first

find the super-nodes N1 that contains 2, N3 that contain 7 from TEQ (in Figure 5.3) and add them

78 into S. We iterate over all super nodes in S, add them into Gmax and also we add neighbor super nodes of them whose truss value is greater or equal 4 into S. So, we add super node N5 in to S and when we finish the iteration of S, Gmax includes {N1,N3,N5}. Then, we checked the connectivity of

Gmax. Since it is not connected, we decrease the k from 4 to 3 and repeat the process. Finally, Gmax includes all super nodes in IG. The whole community search process is illustrated in Figure 5.4 and in Figure 5.5.

Figure 5.4: Community search on TEQ after first iteration with k = 4

Figure 5.5: Final result for Community search on TEQ

79 5.4.2 MSTTEQ index structure and Querying maximal k-truss on it

While using TEQ index will improve the efficiency of finding maximal connected k-truss, it may take long time for big communities in a large graph. Since we do not know the exact maximum k value and we should check connectivity and decrease k in a loop until making query nodes connected. Also, there are many paths from one super-node to another super-node in the index that we do not need to use while traversing. To eliminate these paths, we construct Maximum spanning tree of the index graph. It contains a single path between super-nodes with maximum weight and decreases the size of the index with including less number of edges. We can also use it to find the maximum k value for given query nodes.

Given the TEQ index graph IG, we give weight to each super-edge (Ni,Nj) as the mini- mum truss value of the two end-super-nodes of the edge (i.e ∀(Ni,Nj) ∈ E(IG), w(Ni,Nj) = min{τ(Ni), τ(Nj)} ). Then, we construct a compact tree-structure index T, which is the maximum spanning tree (MST) of TEQ IG; that is, the index in a weighted tree where each edge has a weight as the min of the truss values of the two end-super-nodes.

Definition 5.5. (Maximum spanning Tree) Given an edge weighted TEQ index graph IG, a spanning tree of IG is a subgraph of IG that is tree and contains all nodes of IG. A maximum spanning tree of IG is the spanning tree with the maximum total weight.

The MST has the nice property that it explicitly stores the path with the maximum weight for every pair of the nodes as proved below.

Given any super-nodes Ni and Nj in the TEQ index graph IG, let PNi,Nj denote the set of the all simple paths between two super-nodes and define the weight w(P) of the path P as the minimum 0 0 0 0 truss value of super-edges on the path P (i.e. w(P) = min{w(Ni ,Nj)|(Ni ,Nj) ∈ P}.

Lemma 5.5. Let Ni and Nj be any super-nodes in TEQ index graph IG and gNi,Nj be a connected subgraph in IG which includes super-nodes, Ni and Nj. Maximum truss value of subgraph g is τ(g) = max w(P) P∈PNi,Nj

Proof. Let Pm be the path in PN ,N with the maximum weight (i.e. Pm = argmax w(P). i j P∈PNi,Nj

We first prove that τ(gNi,Nj ) ≥ w(Pm). Let Pm = (N1(= Ni), ..., Nq,Nq+1, ..., Nn(= Nj)). Without loss of generality, assume that w(Pm) = w(Np,Np+1), and let k be this value. Based on the

80 properties of k-truss and the fact that w(Np−1,Np) ≥ w(Np,Np+1) = k, g should also contain Np−1.

Similarly, we can prove that g contains all vertices in Pm, which include Ni and Nj. Therefore,

τ(gNi,Nj ) ≥ k = w(Pm).

Now we prove that τ(gNi,Nj ) ≤ w(Pm) by contradiction. Assume that τ(gNi,Nj ) > w(Pm), and let k be τ(gNi,Nj ). Then, there is a connected subgraph g of IG containing Ni and Nj with

τ(g) = k, and the set of edges in Gc that correspond to edges in g have weights at least k. Since g is connected, we can find a path between Ni and Nj in g with minimum weight at least k(> w(Pm)), which contradicts that Pm is the path with maximum weight among all paths between Ni and Nj

(i.e.,PNi,Nj ). Therefore, τ(gNi,Nj ) ≤ w(Pm). Thus, the lemma holds.

Given weighted TEQ IG, the MST T can be constructed by a slight modification of the minimum spanning tree algorithms, such as Prims’ algorithm or Kruskal’s algorithm [23]

Lemma 5.6. Given any MST T constructed from TEQ IG, the unique path P in T btw super-nodes

Ni and Nj has the maximum weight among all paths btw Ni and Nj in IG. Thus τ(gNi,Nj ) = 0 0 0 0 0 0 w(P) = min 0 0 w (N ,N ) where w (N ,N ) denotes the weight of the edge w (N ,N ) in (Ni ,Nj )∈P T i j T i j T i j T

0 Proof. We prove the lemma by contradiction. Assume there is a path P in IG between Ni and Nj with a weight larger than P (i.e., w(P 0) > w(P )). Obviously, not all edges of P are in T . Without loss of generality, assume that (N1,N2) ∈ P has the minimum weight among all edges in P , then 0 0 0 (N1,N2) ∈/ P since w(P ) > w(P ). Therefore, there is a simple cycle in P ∪ P such that (N1,N2) is in the cycle and has the minimum weight. According to the cycle property of maximum spanning tree, (N1,N2) cannot be in T . Contradiction. Thus, the lemma holds.

We present the following lemma for computing the maximum trussness of q, kmax which is the value for the k of the maximal connected k-truss containing q which directly follows from Lemma 5.4 and Lemma 5.6.

Lemma 5.7. Given a set of q of query vertices, maximum trussness of q, kmax, is equal to the minimum edge weight of the subtree Tq of MST T, where Tq is the minimal connected subtree of T that contains a super-nodes for each query vertex in q.

81 Intuitively, Tq is formed by the set of paths in T between Ni ∈ Nv0∈Q and one super-node of each other vertex in q. Following from Lemma 5.7, given a set of vertices, Q, we first obtain the subtree Tq of T , and then report the minimum edge weight among all edges in Tq as the maximum trussness of q, kmax = min(Ni,Nj )∈Tq wT (Ni,Nj).

For the algorithm to compute Tq, a naive implementation of an algorithm by BFS or DFS [23] would require O(|V(Gmax)| time to get the subtree Tq for Q, which is slow. We can obtain it in

O(|Tq|) time in the following.

Definition 5.6. (LowestCommonAncestor [2]) The Lowest Common Ancestor (LCA) of two vertices, u and v, in a rooted tree [33], denoted lca(u, v), is defined as the vertex that is farthest to the root and has both u and v as its descendants (where a vertex is allowed to be a descendant of itself).

Similarly, we can define the LCA of super-nodes,Sq, of Q, denoted lca(Sq). To compute the Tq, just we need to traverse the MST T and find the lowest common ancestor of the Sq. Given the MST T , we make it a rooted tree by choosing an arbitrary vertex to be the root. Then, it is easy to see that Tq consists of all edges in the paths from every node in Sq to the lowest common ancestor of them(lca(Sq)).

Figure 5.6: Maximum Spanning tree of TEQ IG in Figure 5.3

82 Figure 5.7: Rooted tree of Maximum Spanning tree given in Figure 5.6

Example 5.4. Consider the TEQ index graph IG as shown in Figure 5.3, Maximum Spanning tree of IG is given in Figure 5.6. Blue color edges are weight with 4 and black color edges are weight with 3. Rooted tree of this MST is given in Figure 5.7

Firstly, we present the lemma below that directly follows from the definition of maximal k-truss and Lemma 5.6.

Lemma 5.8. Given MST T and k-truss value, kmax, maximal k-truss including Q consists of all edges in the edge list of the super-nodes that are reachable from the super-nodes of q0 through the super-edges with weight no less than k, where q0 is an arbitrary vertex in Q.

According to the lemma 5.8, after getting kmax for query set Q, we can obtain maximal k-truss of Q by conducting a BFS starting from super-node of any vertex q and visiting all super-nodes whose truss values at least kmax. If we store T in the form of adjacency list and organize edges in each adjacency list in non-increasing weight order, we can implement the BFS of T in O(|Vq|) where Vq is the super-node list of result. Then just we need to get original graph edges of the super-nodes in Vq.

5.4.3 Minimal connected k-truss

In previous sections, we focus on improving the efficiency of CTC by improving the first part efficiency which is finding maximal k-truss. While using MST make the computation of maximal connected k-truss efficient, it still takes quite long time to remove far away nodes. Since maximal

83 subgraph may be too large which includes too many free riders that need to be removed. In this section, we focus on that how we can improve the second part efficiency. If we get smaller k-truss community with maximum k, it will include less free riders and we need to delete fewer nodes in the second part. To make the removing part more efficient, we should do early pruning. We should find a subgraph which has maximum k and making query nodes connected but the subgraph should be minimal. It will include fewer nodes, fewer free riders and we need to delete fewer nodes in the second part.

During the process of finding kmax, we obtain a subtree Tq which is the minimal connected subtree of T that connects super-nodes of query nodes. So when we take super-nodes in this subtree Tq, query nodes are connected with maximum k on the path. However, if we just take edges in these super-nodes, results will not be k-truss. Since some edges((x, w); τ(x, w) = z) of triangles(4xyw) of an edge((x, y); τ(x, y) = p) in a p-truss super-node may be in another super- node with z-truss (z > p). So we need to add these edges of the triangles to make them kmax-truss.

However, we do not need all edges whose truss value is greater than or equal kmax. So, we will not take all other super-nodes whose truss value greater or equal kmax, but we need to expand Tq with the minimum number of nodes until it is kmax-truss. First, we need to compute support values of the edges in connected super-nodes and then we

find the edges, Esk, whose support value is less than kmax-2. Then we traverse the edges, Esk. We will add the triangle of these edges, which are not in Tq until their support value is kmax-2. Also, note that truss values of these triangles in the input graph G should be greater than k. Otherwise, 0 they will drop the truss value from k. We iteratively add the edges until G become kmax-truss. This will give us a smaller subgraph which is connecting query nodes with maximum k and it is k-truss. It will be at most maximum connected k-truss but smaller than that. So, we need to remove less free riders which make second part faster. Algorithm 11 describes the procedure to find the minimal connected k-truss.

Example 5.5. Consider the sample graph given in Figure 5.1, the rooted MST T of it given in

Figure 5.7 and query nodes Q = {2, 5}. When we find Tq of Q on T , it includes super nodes N1,N2, kmax = 3. Then when we get the edges in these super nodes and compute their support value, the

84 Algorithm 11: Minimal Connected k-truss

Input: A rooted MST T, a set of query nodes Q = {q1, ..., qr} 0 Output: A connected k-truss G0 with maximal k , minimal edges

1 Sq ← the super-nodes of query nodes

2 Get the Tq on T which make connected the super-node in Sq

3 kmax ← min{τ(Ni)|(Ni ∈ Tq)} 0 4 G0 ← {(u, v)||((u, v) ∈ Ni and Ni ∈ Tq)} 0 0 5 Find edges in G0 whose support value is less than kmax (Q ← {(u, v)|(u, v ∈ G0 and

sup(u, v) < kmax}

6 for ej ∈ Q do 7 for w ∈ N(v) ∩ N(u) with τ(w, v) >= kmax and τ(u, w) >= kmax do 0 8 if (w, v) 6∈ G0 then 9 G0 ← G0 ∪ {(w, v)}

10 if sup(w, v) < kmax then 11 Q ← Q ∪ (w, v) 0 12 if (w, u) 6∈ G0 then 13 G0 ← G0 ∪ {(w, u)} if sup(w, v) < kmax then 14 Q ← Q ∪ (w, v) support value of the edge (6, 10) is 0. Since other edges of its triangle {(6, 8), (8, 10)} are in the other super node N3. So we should also include these edges to make the edge (6, 10) 3-truss.

5.5 Experiments

In this section, we report our experimental studies for closest community search in real-world graphs. To evaluate the efficiency and effectiveness of improved strategies, we test and compare four algorithms proposed in this paper, namely, TEQ, MST, RootedMST, and Minimal. Here is TEQ is Algorithm 10, which use TEQ index to find maximal connected k-truss. MST is algorithm explained in section 5.4.2 which use MST to find maximum value of k, kmax and to compute maximal connected k-truss. RootedMST is algorithm explained in section 5.4.2 which use rooted MST to find maximum value of k, kmax and to compute maximal connected k-truss. Minimal is Algorithm 11, which finds minimal k-truss connecting query nodes with maximum k value. We primarily compare these algorithms with the state-of-the-art solution in [46], namely, BD which finds maximal connected

85 k-truss and than remove free riders with bulk deletion and LCTC which find local neighborhood of query nodes with a Steiner tree. We set the parameters as given in [46]. All the algorithms are implemented in Java and the experiments are performed on a Linux server running Ubuntu 14:04 with two Intel 2.3GHz ten-core CPUs and 256GB memory. Datasets. We consider 4 real-world graphs, DBLP, Facebook, Amazon and LiveJournal which publicly available in the Stanford Network Analysis Project (SNAP)1. The general statistics of these graphs are reported in Table 5.1.

Table 5.1: Graphs statistics (K = 103 and M = 106)

Network |V | |E| dmax kgmax DBLP 317K 1M 342 114 Facebook 4K 88k 1045 97 Amazon 335K 926K 549 7 LiveJournal 4M 35M 14,815 352

Once the indexes are built, we can use them to support community search in graphs. We randomly generate sets of query nodes. Query size |Q| are varied for generating different sets of query nodes. For the efficiency, we report running time in seconds. We test 5 different query size |Q| in 2, 3, 5, 8. For each value of |Q|, we randomly select 100 sets of |Q| query nodes, and we report the average running time. The results for DBLP and Facebook are shown in Figure 5.8, Figure 5.10, Figure 5.11 and Figure 5.9 respectively. We report 2 different times which are time to find maximal connected k-truss and total time to compute closest truss community. Since our algorithms are focused on the first part efficiency, results shows the improvements of our algorithm on the first part. However, they, especially Minimal, have also improvement on total query time. As we see from Figure 5.8 and Figure 5.10, RootedMST performs the best in terms of efficiency for all different query size for the first part which is the finding maximal connected k-truss. BD performs BFS on the original large graph and needs to find kmax. So it takes long time. However, when we use RootedMST of TEQ index, it is easy to find kmax and we do BFS on the MST which has less number of nodes and edges than original graph. So it takes less time. Especially, when we increase the number of query nodes, it gets difficult to connect them and to find kmax. Since

1snap.stanford.edu/data/index.html

86 Figure 5.8: Query time to find maximum connected k-truss varying query size |Q| on DBLP

Figure 5.9: Total query time to find closest truss community varying query size |Q| on DBLP

we may have several iteration to decrease k to make connected all query nodes. However, making connected them on RootedMST is still taking less time. For the total time comparison of the methods, we use ROOTEDMST which has best performance for the first part. It takes quite long time to remove free riders from maximal connected k-truss. For

87 Figure 5.10: Query time to find maximum connected k-truss varying query size |Q| on Facebook

these reason, while RootedMST has improvement on the first part, total time of BD and RootedMST are close because of the second part’s time. However, Minimal has better performance in terms of efficiency for all different query size for the total time. LCTC has lower total query time on Facebook data since super nodes of index, RootedMST, includes large number of original graph edges and when we include a super node of a query node, it also includes many free riders. It is obvious that free riders in the first part of the Minimal is less than or equal in free riders in the maximal k-truss. So, deletion takes less time. Since, LCTC expand the results of Steiner tree with size constraints which at most 1000, so their deletion part does not take too much time. However, k-truss value of LCTC is not kmax. So they finds smaller connected subgraph for given query set but not maximal k. So it not the solution of the CTC-Problem. We give the average truss values of communities in Table 5.2. As see from table, k value of

LCTC is very lower than kmax which is the maximum k value found by other algorithms. As we know, while we are getting minimal connected k-truss in Minimal algorithm, we do not decrease the k value. It is same with the k value of the maximum connected k-truss. We also give the average number of nodes to see the free rider elimination in Table 5.3. Maximum is the number of nodes in maximum connected k-truss, MaxClosest is the number of the nodes in the closest community after removing free riders from maximum. Minimal is the number of the

88 Figure 5.11: Total query time to find closest truss community varying query size |Q| on Facebook

Table 5.2: Trussness of the Communities for Different Size of Query Node Set, Q

|Q| DBLP Facebook kmax kLCT C kmax kLCT C 2 7.0 3.2 9.28 4.39 3 6.2 2.9 7.49 2.9 5 5.4 2.8 6.65 2.11 8 5.1 2.1 6.03 2 16 4.5 2.0 5.11 2 nodes in closest community after removing free riders from minimal connected k-truss. According to the Table 5.3, the more free riders are being removed with Minimal than Maximal. After analyzing query times for different query times, we fixed the query size to 5 and give the average query time to find maximal connected k-truss and total time to compute closest connected k-truss for all datasets in Figure 5.12 and 5.13. As we see from Figure 5.12, for all datasets, RootedMST gives the best efficiency to find maximal connected k-truss. However, since removing free riders takes quite long time, there is less difference between BD and RootedMST for the total time to compute closest connected k-truss. Especially, for large datasets such as LiveJournal, free riders elimination takes too much time. So, LCTC has better efficiency. On the other hand, as we mention before, LCTC does not find kmax. We also give the average truss values of communities in

89 Table 5.3: Number of Edges in the Community for Different Size of Query Node Set, Q

|Q| DBLP Facebook Maximum MaxClosest MinimalClosest Maximum MaxClosest MinimalClosest 2 47907 27594 11043 2268 1459 972 3 68546 47887 17376 2967 2031 1087 5 102921 87155 39208 3162 2523 1584 8 115198 103174 49077 3321 2832 1845 16 152845 140998 75032 3549 3194 2025

Table 5.4: Trussness of the Communities in All Datasets

DataSet kmax kLCT C Amazon 3.0 2 Facebook 6.65 2.11 DBLP 5.4 2.8 LiveJournal 7.1 2

Table 5.4 for all datasets.

5.6 Conclusions

In this project, we study the closest truss community search problem in large graphs. We pro- pose a truss-equivalence based indexing approach, TEQ, which is a space-efficient, truss-preserving summarized graph based on k-truss equivalence of edges. We proved that, closest truss community search can be performed directly upon TEQ without costly, repeated accesses to the original graph. We further designed Maximum spanning tree of TEQ index graph and rooted MST, RootedMST, of it to improve efficiency of finding maximal connected k-truss. Moreover, we propose another al- gorithm, Minimal, to find minimal connected k-truss with maximizing k to eliminate free riders early. We conduct extensive experimental studies in real-world large-scale graphs, and the results have validated both the efficiency and effectiveness of the proposed methods, MST, RootedMST and Minimal, in comparison to the state-of-the-art algorithms, Basic, BD and LCTC [42].

90 Figure 5.12: Query time to find maximum connected k-truss on all datasets

Figure 5.13: Total query time to find closest connected k-truss on all datasets

91 CHAPTER 6

CONCLUSIONS

In this thesis, we worked on graphs which are structured data representing relationships between objects. If vertices of a graph have a set of attributes describing the properties of them, such as interest, gender and education in social network graphs, we call this graph as an attributed graph (AG). In real-world networked applications, the underlying graphs oftentimes exhibit fundamental community structures supporting widely varying interconnected processes. There are different graph mining problems and we worked on community detection and search problems. Community detection has thus become one of the most well-studied problems in graph manage- ment and analytics. Community detection, which is also known as graph clustering, is the task of grouping vertices of a graph with an objective of putting similar vertices into same clusters, taking into account the topological structure of the graph, such that the clusters are composed of vertices strongly connected [102, 57, 20]. While most of the works in this area have previously focused on the analysis of graph structures, attributed graph clustering approaches use both structure and attribute information to find similar groups of vertices of the graph. In our first project, we worked on attributed graph clustering problem. We propose a graph embedding approach to cluster content-enriched, attributed graphs and we convert the challenging attributed graph clustering problem to the traditional data clustering problem. Existing community detection methods focus primarily on discovering all communities in a larger graph. In many real-world occasions, however, people are more interested in the communities pertaining to a given vertex. In our second and third projects, we work on a query-dependent variant of community detection, referred to as the community search problem. We study the community search problem in the truss- based community model aimed at discovering all dense and cohesive k-truss communities to which the query set belongs. We create space-efficient, truss-preserving index structures, EquiTruss and TEQ. While EquiTruss support community search for single query node, TEQ supports community search for multiple query nodes. Community search can thus be addressed upon EquiTruss and TEQ

92 without repeated, time-demanding accesses to the original graph which proves to be theoretically optimal. Since maximal k-truss may be too large, while we use edge connectivity constraints in our first project, we use graph diameter in our second project to get smaller and compact communities for query nodes.

6.1 Future Work

Data sources representing attribute information with network information are widely available in todays applications such as social networks. To realize the full potential for knowledge extraction, many of data mining techniques should consider both information types simultaneously. As my primary future research plan, I want to expand my work on the attributed graph clustering problem to other mining techniques on attributed graphs such as query, search, prediction. As one of my other future project, I am planning to work on the community search problem in attributed graphs. To do this, I need to reorganize the index structure to keep attribute information of vertices in addition to structure information. Also, we designed our index structure based on k-truss information. There are other connectivity measures on graph such as k-core, k-edge connectivity. So, we can apply our ideas to these measures. My other future project is about attributed graph embedding. In our attributed graph clustering method, we use random walks to generate graph embedding. There are some other graph embedding methods such as Line[89], GraRep[15]. I am planning to apply these graph embedding methods on attribute graphs as well. Also, according to our survey on user characterization[94], we see that user profiles and their postings are employed as the data source in user characterization. However, there is also important information coming from users’ networks. If we combine the network structure and other informa- tion, we get an attributed graph data. Then, we can answer the questions mentioned in the survey using the attributed graph data. Therefore, I am planning to use attributed graph data for user characterization. Last but not least, I want to continue to work on Sentiments Analysis which I did during my master. We just use text to detect the opinion of the people. On the other hand, it is obvious that friends also affect users’ opinion about different topics. So, we can analyze opinions based on friend influence in addition to users’ post with considering the data as an attributed graph.

93 BIBLIOGRAPHY

[1] Charu C Aggarwal and Haixun Wang. Managing and mining graph data, volume 40. Springer, 2010.

[2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. On finding lowest common ancestors in trees. In Proceedings of the Fifth Annual ACM Symposium on Theory of Computing, STOC ’73, pages 253–265, New York, NY, USA, 1973. ACM.

[3] Esra Akbas and Peixiang Zhao. Attributed graph clustering: An attribute-aware graph embedding approach. 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2017.

[4] Esra Akbas and Peixiang Zhao. Truss Based Community Search: A Truss Equivalence Based Indexing Approach. Proceedings of the VLDB Endowment, 2017.

[5] Leman Akoglu, Hanghang Tong, Brendan Meeder, and Christos Faloutsos. PICS: parameter- free identification of cohesive subgroups in large attributed graphs. In Proceedings of the Twelfth SIAM International Conference on Data Mining, Anaheim (SDM’12), pages 439– 450, 2012.

[6] Aris Anagnostopoulos, Luca Becchetti, Carlos Castillo, Aristides Gionis, and Stefano Leonardi. Online team formation in social networks. In Proceedings of the 21st Interna- tional Conference on World Wide Web, WWW ’12, pages 839–848, New York, NY, USA, 2012. ACM.

[7] Nicola Barbieri, Francesco Bonchi, Edoardo Galimberti, and Francesco Gullo. Efficient and effective community search. Data Min. Knowl. Discov., 29(5):1406–1433, September 2015.

[8] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, 2003.

[9] Austin R Benson, David F Gleich, and Jure Leskovec. Higher-order organization of complex networks. Science, 353(6295):163–166, 2016.

[10] Devora Berlowitz, Sara Cohen, and Benny Kimelfeld. Efficient enumeration of maximal k- plexes. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 431–444, New York, NY, USA, 2015. ACM.

[11] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P1000, 2008.

94 [12] Francesco Bonchi, Francesco Gullo, Andreas Kaltenbrunner, and Yana Volkovich. Core de- composition of uncertain graphs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 1316–1325, New York, NY, USA, 2014. ACM.

[13] C´ecileBothorel, Juan David Cruz, Matteo Magnani, and Barbora Micenkov´a. Clustering attributed graphs: models, measures and methods. CoRR, abs/1501.01676, 2015.

[14] Mario Cannataro, Pietro H. Guzzi, and Pierangelo Veltri. Protein-to-protein interactions: Technologies, databases, and algorithms. ACM Comput. Surv., 43(1):1:1–1:36, 2010.

[15] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15), pages 891–900, 2015.

[16] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pages 891–900, New York, NY, USA, 2015. ACM.

[17] Tanmoy Chakraborty, Sikhar Patranabis, Pawan Goyal, and Animesh Mukherjee. On the formation of circles in co-authorship networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 109– 118, New York, NY, USA, 2015. ACM.

[18] Jonathan D. Cohen. Trusses: Cohesive subgraphs for . National Security Agency Technical Report, pages 1–29, 2008.

[19] D. Combe, C. Largeron, E. Egyed-Zsigmond, and M. Gery. Combining relations and text in scientific network clustering. In Advances in Social Networks Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference on, pages 1248–1253, Aug 2012.

[20] D. Combe, C. Largeron, E. Egyed-Zsigmond, and M. Gery. Getting Clusters from Structure Data and Attribute Data. 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 710–712, August 2012.

[21] D. Combe, C. Largeron, E. Egyed-Zsigmond, and M. Gery. Getting Clusters from Structure Data and Attribute Data. 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 710–712, August 2012.

[22] Diane J. Cook and Lawrence B. Holder. Mining Graph Data. John Wiley & Sons, 2006.

[23] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.

95 [24] Juan David Cruz, C´ecileBothorel, and Fran¸coisPoulet. Semantic clustering of social networks using points of view. In COnf´erence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information Retrieval Conference, Avignon, France, March 16-18, 2011. Proceedings, pages 175–182, 2011.

[25] Wanyun Cui, Yanghua Xiao, Haixun Wang, Yiqi Lu, and Wei Wang. Online search of overlapping communities. Proceedings of the 2013 international conference on Management of data - SIGMOD ’13, page 277, 2013.

[26] Wanyun Cui, Yanghua Xiao, Haixun Wang, and Wei Wang. Local search of communities in large graphs. Proceedings of the 2014 ACM SIGMOD international conference on Manage- ment of data - SIGMOD ’14, 2014.

[27] The Anh Dang and Emmanuel Viennet. Community Detection based on Structural and Attribute Similarities. ICDS, pages 7–12, 2012.

[28] S. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000.

[29] Yon Dourisboure, Filippo Geraci, and Marco Pellegrini. Extraction and classification of dense communities in the web. In Proceedings of the 16th International Conference on World Wide Web (WWW’07), pages 461–470, 2007.

[30] Yixiang Fang, Reynold Cheng, Xiaodong Li, Siqiang Luo, and Jiafeng Hu. Effective commu- nity search over large spatial graphs. Proceedings of the VLDB Endowment, 10(6):709–720, 2017.

[31] Yixiang Fang, Reynold Cheng, Siqiang Luo, and Jiafeng Hu. Effective Community Search for Large Attributed Graphs. Proceedings of the VLDB Endowment, 9(12):1233–1244, 2016.

[32] Rong Ge, Martin Ester, Byron J. Gao, Zengjian Hu, Binay Bhattacharya, and Boaz Ben- Moshe. Joint cluster analysis of attribute data and relationship data: The connected k-center problem, algorithms and applications. ACM Trans. Knowl. Discov. Data, 2(2):7:1–7:35, July 2008.

[33] Alan Gibbons. Algorithmic . Cambridge university press, 1985.

[34] Neil Zhenqiang Gong, Wenchang Xu, Ling Huang, Prateek Mittal, Emil Stefanov, Vyas Sekar, and Dawn Song. Evolution of social-attribute networks: Measurements, modeling, and implications using google+. In Proceedings of the 2012 ACM Conference on Internet Measurement Conference (IMC’12), pages 131–144, 2012.

[35] Vic Grout and Stuart Cunningham. A constrained version of a clustering algorithm for switch placement and interconnection in large networks. In CAINE, pages 252–257, 2006.

96 [36] S. Gunnemann, I. Farber, S. Raubach, and T. Seidl. Spectral subspace clustering for graphs with feature vectors. In 2013 IEEE 13th International Conference on Data Mining (ICDM), pages 231–240, Dec 2013.

[37] S. Gunnemann, I. Farber, B. Boden, and T. Seidl. Subspace clustering meets dense subgraph mining: A synthesis of two paradigms. In 2010 IEEE 10th International Conference on Data Mining (ICDM), pages 845–850, Dec 2010.

[38] Stephan Gunnemann, Brigitte Boden, and Thomas Seidl. Db-csc: A density-based approach for subspace clustering in graphs with feature vectors. In Dimitrios Gunopulos, Thomas Hof- mann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6911 of Lecture Notes in Computer Science, pages 565–580. Springer Berlin Heidelberg, 2011.

[39] Xiaofeng He, Chris H. Q. Ding, Hongyuan Zha, and Horst D. Simon. Automatic topic identi- fication using webpage clustering. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01), pages 195–202, 2001.

[40] Keith Henderson, Tina Eliassi-Rad, Spiros Papadimitriou, and Christos Faloutsos. HCDF: A hybrid community discovery framework. In Proceedings of the SIAM International Conference on Data Mining (SDM’10), pages 754–765, 2010.

[41] Allen L. Hu and Keith C. C. Chan. Utilizing both topological and attribute information for protein complex identification in ppi networks. IEEE/ACM Trans. Comput. Biol. Bioinfor- matics, 10(3):780–792, 2013.

[42] Jiafeng Hu, Xiaowei Wu, Reynold Cheng, Siqiang Luo, and Yixiang Fang. Querying minimal steiner maximum-connected subgraphs in large graphs. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pages 1241–1250, New York, NY, USA, 2016. ACM.

[43] Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. I/o-efficient algorithms on triangle listing and counting. ACM Transactions on Database Systems (TODS), 39(4):27, 2014.

[44] Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu. Querying k-truss com- munity in large and dynamic graphs. Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD ’14, 2014.

[45] Xin Huang and Laks V. S. Lakshmanan. Attribute-driven community search. Proceedings of the VLDB Endowment, 10(9):949–960, May 2017.

[46] Xin Huang, Laks V S Lakshmanan, Jeffrey Xu Yu, and Hong Cheng. Approximate Closest Community Search in Networks. Proceedings of the VLDB Endowment, 9(4):276–287, 2015.

97 [47] Xin Huang, Wei Lu, and Laks VS Lakshmanan. Truss decomposition of probabilistic graphs: Semantics and algorithms. In Proceedings of the 2016 International Conference on Manage- ment of Data, pages 77–90. ACM, 2016.

[48] Daxin Jiang, Chun Tang, and Aidong Zhang. Cluster analysis for gene expression data: A survey. Knowledge and Data Engineering, IEEE Transactions, 16(11):1370–1386, 2004.

[49] David R. Karger. Global min-cuts in rnc, and other ramifications of a simple min-out algo- rithm. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 21–30, 1993.

[50] George Karypis and Vipin Kumar. Multilevel algorithms for multi-constraint graph parti- tioning. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, SC ’98, pages 1–13, Washington, DC, USA, 1998. IEEE Computer Society.

[51] Wissam Khaouid, Marina Barsky, Venkatesh Srinivasan, and Alex Thomo. K-core decom- position of large networks on a single pc. Proceedings of the VLDB Endowment, 9(1):13–23, September 2015.

[52] Jinha Kim, Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, and Hwanjo Yu. Opt: A new framework for overlapped and parallel triangulation in large-scale graphs. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 637–648. ACM, 2014.

[53] Danai Koutra, U Kang, Jilles Vreeken, and Christos Faloutsos. Summarizing and under- standing large graphs. Statistical Analysis and Data Mining: The ASA Data Science Journal, 8(3):183–202, 2015.

[54] Matthieu Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci., 407(1-3):458–473, November 2008.

[55] Silvio Lattanzi and D. Sivakumar. Affiliation networks. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, pages 427–434, 2009.

[56] Kristen LeFevre and Evimaria Terzi. Grass: Graph structure summarization. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 454–465. SIAM, 2010.

[57] Jure Leskovec, Kevin J. Lang, and Michael Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 631–640, New York, NY, USA, 2010. ACM.

[58] Rong-hua Li, Lu Qin, Jeffrey Xu Yu, and Rui Mao. Influential community search in large networks. Proceedings of the VLDB Endowment, 8(5):509–520, 2015.

98 [59] Rui Li, Chi Wang, and Kevin Chen-Chuan Chang. User profiling in an ego network: Co- profiling attributes and relationships. In Proceedings of the 23rd International Conference on World Wide Web (WWW’14), pages 819–830, 2014.

[60] Fragkiskos Malliaros and Michalis Vazirgiannis. Clustering and community detection in di- rected networks: A survey. Physics Reports, page 86, 2013.

[61] Julian Mcauley and Jure Leskovec. Discovering social circles in ego networks. ACM Trans. Knowl. Discov. Data, 8(1):4:1–4:28, February 2014.

[62] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 3111– 3119, USA, 2013. Curran Associates Inc.

[63] Andriy Mnih and Geoffrey Hinton. A scalable hierarchical distributed language model. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’08, pages 1081–1088, USA, 2008. Curran Associates Inc.

[64] Flavia Moser, Recep Colak, Arash Rafiey, and Martin Ester. Mining Cohesive Patterns from Graphs with Feature Vectors. In Society for Industrial and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining, pages 593–604, 2009.

[65] Mari C.V. Nascimento and Andr C.P.L.F. de Carvalho. Spectral methods for graph clustering a survey. European Journal of Operational Research, 211(2):221 – 231, 2011.

[66] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. Graph summarization with bounded error. In Proceedings of the 2008 ACM SIGMOD International Conference on Man- agement of Data, SIGMOD ’08, pages 419–432, New York, NY, USA, 2008. ACM.

[67] Jennifer Neville, Micah Adler, and David Jensen. Clustering relational data using attribute and link information. In Proceedings of the Text Mining and Link Analysis Workshop, 18th International Joint Conference on Artificial Intelligence, pages 9–15, 2003.

[68] M. E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, 2006.

[69] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.

[70] Vitor Oliveira, Guilherme Gomes, Fabiano Bel´em,Wladmir Brand˜ao,Jussara Almeida, Nivio Ziviani, and Marcos Gon¸calves. Automatic query expansion based on tag recommendation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, pages 1985–1989, New York, NY, USA, 2012. ACM.

99 [71] Mark Ortmann and Ulrik Brandes. Triangle listing algorithms: Back from the diversion. In Proceedings of the Meeting on Algorithm Engineering & Expermiments, pages 1–8, Philadel- phia, PA, USA, 2014. Society for Industrial and Applied Mathematics.

[72] Bryan Perozzi, Leman Akoglu, Patricia Iglesias S´anchez, and Emmanuel M¨uller. Focused Clustering and Outlier Detection in Large Attributed Graphs. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

[73] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre- sentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.

[74] NataˇsaPrzulj. Graph theory approaches to protein interaction data analysis. Knowledge Discovery in High-Throughput Biological Domains,, 120:000, 2004.

[75] M. Riondato, D. Garca-Soriano, and F. Bonchi. Graph summarization with quality guaran- tees. In 2014 IEEE International Conference on Data Mining, pages 947–952, Dec 2014.

[76] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.

[77] Yiye Ruan, David Fuhry, and Srinivasan Parthasarathy. Efficient Community Detection in Large Networks using Content and Links. In WWW, December 2013.

[78] Ahmet Erdem Sariyuce, Bu˘graGedik, Gabriela Jacques-Silva, Kun-Lung Wu, and Umit¨ V. C¸ataly¨urek. Incremental k-core decomposition: Algorithms and evaluation. Proceedings of the VLDB Endowment, 25(3):425–447, June 2016.

[79] Ahmet Erdem Sariyuce and Ali Pinar. Fast hierarchy construction for dense subgraphs. Proceedings of the VLDB Endowment, 10(3):97–108, November 2016.

[80] Ahmet Erdem Sariyuce, C. Seshadhri, Ali Pinar, and Umit V. Catalyurek. Finding the hierar- chy of dense subgraphs using nucleus decompositions. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, pages 927–937, Republic and Canton of Geneva, Switzerland, 2015. International World Wide Web Conferences Steering Committee.

[81] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, August 2007.

[82] Boon-Siew Seah, Sourav S. Bhowmick, C. Forbes Dewey, and Hanry Yu. Fuse: a profit maxi- mization approach for functional summarization of biological networks. BMC Bioinformatics, 13(3):S10, 2012.

[83] Yingxia Shao, Lei Chen, and Bin Cui. Efficient cohesive subgraphs detection in parallel. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 613–624, New York, NY, USA, 2014. ACM.

100 [84] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, August 2000.

[85] M Sozio and a Gionis. The community-search problem and how to plan a successful cocktail party. Proceedings Of The Acm Sigkdd International Conference On Knowledge Discovery And Data Mining, pages 939–948, 2010.

[86] Benno Stein and Oliver Niggemann. On the nature of structure and its identification. In Peter Widmayer, Gabriele Neyer, and Stephan Eidenbenz, editors, Graph-Theoretic Concepts in Computer Science, volume 1665 of Lecture Notes in Computer Science, pages 122–134. Springer Berlin Heidelberg, 1999.

[87] Karsten Steinhaeuser and Nitesh V. Chawla. Identifying and evaluating community structure in complex networks. Pattern Recogn. Lett., 31(5):413–421, 2010.

[88] Karsten Steinhaeuser and NiteshV. Chawla. Community detection in a large real-world social network. In Huan Liu, JohnJ. Salerno, and MichaelJ. Young, editors, Social Computing, Behavioral Modeling, and Prediction, pages 168–175. Springer US, 2008.

[89] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large- scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web (WWW’15), pages 1067–1077, 2015.

[90] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large- scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. ACM, 2015.

[91] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.

[92] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. Efficient aggregation for graph summarization. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 567–580, New York, NY, USA, 2008. ACM.

[93] Charalampos Tsourakakis, Francesco Bonchi, Aristides Gionis, Francesco Gullo, and Maria Tsiarli. Denser than the densest subgraph: extracting optimal quasi- with quality guarantees. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 104–112. ACM, 2013.

[94] Tayfun Tuna, Esra Akbas, Ahmet Aksoy, Muhammed Abdullah Canbaz, Umit Karabiyik, Bilal Gonen, and Ramazan Aygun. User characterization for online social networks. Social Network Analysis and Mining, 6(1):104, 2016.

101 [95] Johan Ugander, Lars Backstrom, Cameron Marlow, and Jon Kleinberg. Structural diversity in social contagion. Proceedings of the National Academy of Sciences, 109(16):5962–5966, 2012.

[96] Sejlaˇ Cebiri´c,Fran¸coisˇ Goasdou´e,and Ioana Manolescu. Query-oriented summarization of rdf graphs. Proceedings of the VLDB Endowment, 8(12):2012–2015, August 2015.

[97] Nathalie Villa-Vialaneix, Madalina Olteanu, and Christine Cierco-Ayrolles. Carte auto- organisatrice pour graphes ´etiquet´es. In Atelier Fouilles de Grands Graphes (FGG) - EGC’2013, page Article num´ero4, Toulouse, France, January 2013.

[98] Jia Wang and James Cheng. Truss decomposition in massive networks. Proceedings of the VLDB Endowment, 5(9):812–823, 2012.

[99] Douglas R White and Frank Harary. The cohesiveness of blocks in social networks: Node connectivity and conditional density. Sociological Methodology, pages 305–359, 2001.

[100] Scott White and Padhraic Smyth. A spectral clustering approach to finding communities in graphs. In In SIAM International Conference on Data Mining, 2005.

[101] Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust Local Community Detection: On Free Rider Effect and Its Elimination. Proceedings of the VLDB Endowment, pages 798–809, 2015.

[102] Jierui Xie, Stephen Kelley, and Boleslaw K. Szymanski. Overlapping community detection in networks: The state-of-the-art and comparative study. ACM Comput. Surv., 45(4):43:1– 43:35, August 2013.

[103] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. Scan: A structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pages 824–833, New York, NY, USA, 2007. ACM.

[104] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. A model-based approach to attributed graph clustering. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12), pages 505–516, 2012.

[105] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. A model-based approach to attributed graph clustering. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pages 505–516, New York, NY, USA, 2012. ACM.

[106] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. Gbagc: A general bayesian framework for attributed graph clustering. ACM Trans. Knowl. Discov. Data, 9(1):5:1–5:43, August 2014.

102 [107] Jaewon Yang, Julian McAuley, and Jure Leskovec. Community detection in networks with node attributes. Proceedings - IEEE International Conference on Data Mining, ICDM, pages 1151–1156, January 2013.

[108] Jaewon Yang, Julian J. McAuley, and Jure Leskovec. Community detection in networks with node attributes. In IEEE 13th International Conference on Data Mining (ICDM’13), pages 1151–1156, 2013.

[109] Tianbao Yang, Rong Jin, Yun Chi, and Shenghuo Zhu. Combining link and content for community detection: A discriminative approach. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 927– 936, New York, NY, USA, 2009. ACM.

[110] Hugo Zanghi, Stevenn Volant, and Christophe Ambroise. Clustering based on random graph model embedding vertex features. Pattern Recognition Letters, 31(9):830–836, 2010.

[111] Jiawei Zhang, Philip S. Yu, and Yuanhua Lv. Enterprise employee training via project team formation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, pages 3–12, New York, NY, USA, 2017. ACM.

[112] Feng Zhao and Anthony K. H. Tung. Large scale cohesive subgraphs discovery for social network visual analysis. Proceedings of the VLDB Endowment, 6(2):85–96, December 2012.

[113] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment, 2(1):718–729, August 2009.

[114] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Clustering Large Attributed Graphs: An Efficient Incremental Approach. 2010 IEEE International Conference on Data Mining, pages 689–698, December 2010.

[115] Yang Zhou and Ling Liu. Social influence based clustering of heterogeneous information networks. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’13, page 338, 2013.

103 BIOGRAPHICAL SKETCH

The author was born in Afyon, Turkey and pursued undergraduate in Computer Science at the TOBB ETU and master studies in Computer Science at the Bilkent University, Ankara, Turkey. After graduation, she moved to the United States to pursue graduate studies in Computer Science at the Florida State University. She is married and has a lovely son.

104