Enumeration of Maximal Cliques from an Uncertain Graph
Total Page:16
File Type:pdf, Size:1020Kb
1 Enumeration of Maximal Cliques from an Uncertain Graph Arko Provo Mukherjee #1, Pan Xu ∗2, Srikanta Tirthapura #3 # Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA Coover Hall, Ames, IA, USA 1 [email protected] 3 [email protected] ∗ Department of Computer Science, University of Maryland A.V. Williams Building, College Park, MD, USA 2 [email protected] Abstract—We consider the enumeration of dense substructures (maximal cliques) from an uncertain graph. For parameter 0 < α < 1, we define the notion of an α-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of α-maximal cliques possible within a (uncertain) graph. We present an algorithm to enumerate α-maximal cliques whose worst-case runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm. Index Terms—Graph Mining, Uncertain Graph, Maximal Clique, Dense Substructure F 1 INTRODUCTION of the most basic problems in graph mining, and has been applied in many settings, including in finding overlapping Large datasets often contain information that is uncertain communities from social networks [14, 17, 18, 19], finding in nature. For example, given people A and B, it may not overlapping multiple protein complexes [20], analysis of be possible to definitively assert a relation of the form “A email networks [21] and other problems in bioinformat- knows B” using available information. Our confidence in ics [22, 23, 24]. such relations are commonly quantified using probability, and we say that the relation exists with a probability of p, for While the notion of a dense substructure and methods some value p determined from the available information. In for enumerating dense substructures are well understood in this work, we focus on uncertain graphs, where our knowl- a deterministic graph, the same is not true in the case of an edge is represented as a graph, and there is uncertainty in the uncertain graph. This is an important open problem today, presence of each edge in the graph. Uncertain graphs have given that many datasets increasingly incorporate data that been used extensively in modeling, for example, in commu- is noisy and uncertain in nature. Uncertainty can result nication networks [1, 2, 3], social networks [4, 5, 6, 7, 8, 9], from a lack of data. For example, in constructing a social protein interaction networks [10, 11, 12], and regulatory network from data collected through sensors, some com- networks in biological systems [13]. munications between individuals maybe missed, or maybe Identification of dense substructures within a graph is anonymized [4]. In some cases, relationships themselves a fundamental task, with numerous applications in data are probabilistic in nature; for example, the relation of mining, including in clustering and community detection one person influencing another in a social network [25]. in social and biological networks [14], the study of the co- In biological networks such as protein–protein interaction expression of genes under stress [15], integrating different networks, it is known that there are frequent errors in types of genome mapping data [16]. Perhaps the most finding interactions and our knowledge is best modeled elementary dense substructure in a graph, also probably the probabilistically [10]. most commonly used, is a clique, a completely connected In this work, we consider the analog of a maximal clique subgraph. We are typically interested in a maximal clique, in an uncertain graph. Intuitively, a clique in an uncertain which is a clique that is not contained within any other graph is a set of vertices that has a high probability of being clique. Enumerating all maximal cliques from a graph is one a completely connected subgraph. In other words, when we sample from the uncertain graph, this set is likely to form a an extension of MULE to efficiently enumerate only large (deterministic) clique. Finding such sets of vertices enables maximal cliques. us to unearth robust communities within an uncertain graph, Note that the worst–case runtime of our algorithm is for example, a group of proteins such that it is likely that not the same as an exhaustive search. The cost of checking each protein interacts with each other protein. We present whether an uncertain clique is maximal or not can be as a systematic study of the problem of identifying cliques large as Θ(n2). Considering that there are 2n subsets of within an uncertain graph. A preliminary version of this vertices of the graph, exhaustive search has a worst-case work appeared in [26]. runtime of O n2 · 2n, which is worse than our algorithm by a factor of O(n). 1.1 Our Contributions Experimental Evaluation. We present an experimental First, we present a precise definition of a maximal clique in evaluation of MULE using synthetic as well as real-world an uncertain graph, leading to the notion of an α-maximal uncertain graphs. Our evaluation shows that MULE is prac- clique, for parameter 0 < α ≤ 1. A set of vertices U tical and can enumerate maximal cliques in an uncertain in an uncertain graph is an α-maximal clique if U is a graph with tens of thousands of vertices, more than hundred clique with probability at least α, and there does not exist thousand edges and more than two million α-maximal a vertex set U 0 such that U ⊂ U 0 and U 0 is a clique with cliques. Interestingly, the observed runtime of this algorithm probability at least α. When α = 1, the above definition is proportional to the size of the output. The real-world reduces to the well understood notion of a maximal clique graphs included a protein–protein interaction network, and in a deterministic graph. a collaboration network inferred from DBLP. Number of Maximal Cliques. We first consider a basic 1.2 Related Work question on maximal cliques in an uncertain graph: how There has been much recent work on mining from uncertain many α-maximal cliques can be present within an uncertain graphs, including computing shortest paths [28], nearest graph? For deterministic graphs, this question was first neighbors [29], clustering [30], enumerating frequent and considered by Moon and Moser [27] in 1965, who presented reliable subgraphs [31, 32, 33, 34, 35, 36], and distance- matching upper and lower bounds for the largest number constrained reachability [37]. The problem of enumerating of maximal cliques within a graph; on a graph with n dense substructures is different from the above. In partic- vertices, the largest possible number of maximal cliques is ular, the problem of finding reliable subgraphs is one of n 1 3 3 . For the case of uncertain graphs, we present the first finding subgraphs that are connected with a high probability. matching upper and lower bounds for the largest number of However, these individual subgraphs are not required to be α-maximal cliques in a graph on n vertices. We show that dense and may be sparse. In contrast, we are interested in for any 0 < α < 1, the maximum number of α-maximal finding subgraphs that are not just connected, but also fully n cliques possible in an uncertain graph is bn=2c , i.e. there connected with a high probability. The most closely related n work to ours is on mining cliques from an uncertain graph is an uncertain graph on n vertices with bn=2c uncertain maximal cliques and no uncertain graph on n vertices can by Zou et. al [38]. Our work is different from theirs in n significant ways as elaborated below. have more than bn=2c α-maximal cliques. Algorithm for Enumerating All Maximal Cliques. • While we focus on enumerating all α-maximal We present a novel algorithm, MULE (Maximal Uncertain cliques in a graph, they focus on a different problem, cLique Enumeration), for enumerating all α-maximal that of enumerating the k cliques with the highest cliques within an uncertain graph. MULE is based on a probability of existence. depth-first-search of the graph, combined with optimiza- • We present bounds on the number of such cliques tions for limiting exploration of the search space, and a that could exist, while by definition, their problem fast way to check for maximality based on an incremental requires them to output no more than k cliques. computation of clique probabilities. We present a theoretical • We provide a runtime complexity analysis of our analysis showing that the worst-case runtime of MULE is algorithm and show that it is near optimal. No n O (n · 2 ), where n is the number of vertices. This is nearly runtime complexity analysis was provided for the the best possible dependence on n, since our analysis of the algorithm presented in [38]. number of maximal cliques shows that the size of the output • We also provide an algorithm to enumerate only p n can be as much as O( n · 2 ). Such worst-case behavior large maximal uncertain cliques. occurs only in graphs that are very dense; for typical graphs, we can expect the runtime of MULE to be far better, as There is substantial prior work on maximal clique we show in our experimental evaluation. We also present enumeration from a deterministic graph. A popular algo- rithm for maximal clique enumeration problem is the Bron- 1. This assumes that 3 divides n. If not, the expressions are slightly Kerbosch algorithm [39], also based on depth-first-search. different Tomita et al. [40] improved the depth-first-search approach 2 through a better strategy for pivot selection; their resulting Definition 3. In an uncertain graph G, for a set of vertices n algorithm runs in time O(3 3 ), which is worst-case optimal, C ⊆ V , the clique probability of C, denoted by clq(C; G), due to the bound on the number of maximal cliques possi- is defined as the probability that in a graph sampled from ble [27].