1 of Maximal Cliques from an Uncertain Graph

Arko Provo Mukherjee #1, Pan Xu ∗2, Srikanta Tirthapura #3 # Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA Coover Hall, Ames, IA, USA 1 [email protected] 3 [email protected] ∗ Department of Computer Science, University of Maryland A.V. Williams Building, College Park, MD, USA 2 [email protected]

Abstract—We consider the enumeration of dense substructures (maximal cliques) from an uncertain graph. For parameter 0 < α < 1, we define the notion of an α-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of α-maximal cliques possible within a (uncertain) graph. We present an to enumerate α-maximal cliques whose worst-case runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm.

Index Terms—Graph Mining, Uncertain Graph, Maximal Clique, Dense Substructure !

1 INTRODUCTION of the most basic problems in graph mining, and has been applied in many settings, including in finding overlapping Large datasets often contain information that is uncertain communities from social networks [14, 17, 18, 19], finding in nature. For example, given people A and B, it may not overlapping multiple protein complexes [20], analysis of be possible to definitively assert a relation of the form “A email networks [21] and other problems in bioinformat- knows B” using available information. Our confidence in ics [22, 23, 24]. such relations are commonly quantified using probability, and we say that the relation exists with a probability of p, for While the notion of a dense substructure and methods some value p determined from the available information. In for enumerating dense substructures are well understood in this work, we focus on uncertain graphs, where our knowl- a deterministic graph, the same is not true in the case of an edge is represented as a graph, and there is uncertainty in the uncertain graph. This is an important open problem today, presence of each edge in the graph. Uncertain graphs have given that many datasets increasingly incorporate data that been used extensively in modeling, for example, in commu- is noisy and uncertain in nature. Uncertainty can result nication networks [1, 2, 3], social networks [4, 5, 6, 7, 8, 9], from a lack of data. For example, in constructing a social protein interaction networks [10, 11, 12], and regulatory network from data collected through sensors, some com- networks in biological systems [13]. munications between individuals maybe missed, or maybe Identification of dense substructures within a graph is anonymized [4]. In some cases, relationships themselves a fundamental task, with numerous applications in data are probabilistic in nature; for example, the relation of mining, including in clustering and community detection one person influencing another in a social network [25]. in social and biological networks [14], the study of the co- In biological networks such as protein–protein interaction expression of genes under stress [15], integrating different networks, it is known that there are frequent errors in types of genome mapping data [16]. Perhaps the most finding interactions and our knowledge is best modeled elementary dense substructure in a graph, also probably the probabilistically [10]. most commonly used, is a clique, a completely connected In this work, we consider the analog of a maximal clique subgraph. We are typically interested in a maximal clique, in an uncertain graph. Intuitively, a clique in an uncertain which is a clique that is not contained within any other graph is a set of vertices that has a high probability of being clique. Enumerating all maximal cliques from a graph is one a completely connected subgraph. In other words, when we sample from the uncertain graph, this set is likely to form a an extension of MULE to efficiently enumerate only large (deterministic) clique. Finding such sets of vertices enables maximal cliques. us to unearth robust communities within an uncertain graph, Note that the worst–case runtime of our algorithm is for example, a group of proteins such that it is likely that not the same as an exhaustive search. The cost of checking each protein interacts with each other protein. We present whether an uncertain clique is maximal or not can be as a systematic study of the problem of identifying cliques large as Θ(n2). Considering that there are 2n subsets of within an uncertain graph. A preliminary version of this vertices of the graph, exhaustive search has a worst-case work appeared in [26]. runtime of O n2 · 2n, which is worse than our algorithm by a factor of O(n). 1.1 Our Contributions Experimental Evaluation. We present an experimental First, we present a precise definition of a maximal clique in evaluation of MULE using synthetic as well as real-world an uncertain graph, leading to the notion of an α-maximal uncertain graphs. Our evaluation shows that MULE is prac- clique, for parameter 0 < α ≤ 1. A set of vertices U tical and can enumerate maximal cliques in an uncertain in an uncertain graph is an α-maximal clique if U is a graph with tens of thousands of vertices, more than hundred clique with probability at least α, and there does not exist thousand edges and more than two million α-maximal a vertex set U 0 such that U ⊂ U 0 and U 0 is a clique with cliques. Interestingly, the observed runtime of this algorithm probability at least α. When α = 1, the above definition is proportional to the size of the output. The real-world reduces to the well understood notion of a maximal clique graphs included a protein–protein interaction network, and in a deterministic graph. a collaboration network inferred from DBLP. Number of Maximal Cliques. We first consider a basic 1.2 Related Work question on maximal cliques in an uncertain graph: how There has been much recent work on mining from uncertain many α-maximal cliques can be present within an uncertain graphs, including computing shortest paths [28], nearest graph? For deterministic graphs, this question was first neighbors [29], clustering [30], enumerating frequent and considered by Moon and Moser [27] in 1965, who presented reliable subgraphs [31, 32, 33, 34, 35, 36], and distance- matching upper and lower bounds for the largest number constrained reachability [37]. The problem of enumerating of maximal cliques within a graph; on a graph with n dense substructures is different from the above. In partic- vertices, the largest possible number of maximal cliques is ular, the problem of finding reliable subgraphs is one of n 1 3 3 . For the case of uncertain graphs, we present the first finding subgraphs that are connected with a high probability. matching upper and lower bounds for the largest number of However, these individual subgraphs are not required to be α-maximal cliques in a graph on n vertices. We show that dense and may be sparse. In contrast, we are interested in for any 0 < α < 1, the maximum number of α-maximal finding subgraphs that are not just connected, but also fully n  cliques possible in an uncertain graph is bn/2c , i.e. there connected with a high probability. The most closely related n  work to ours is on mining cliques from an uncertain graph is an uncertain graph on n vertices with bn/2c uncertain maximal cliques and no uncertain graph on n vertices can by Zou et. al [38]. Our work is different from theirs in n  significant ways as elaborated below. have more than bn/2c α-maximal cliques. Algorithm for Enumerating All Maximal Cliques. • While we focus on enumerating all α-maximal We present a novel algorithm, MULE (Maximal Uncertain cliques in a graph, they focus on a different problem, cLique Enumeration), for enumerating all α-maximal that of enumerating the k cliques with the highest cliques within an uncertain graph. MULE is based on a probability of existence. depth-first-search of the graph, combined with optimiza- • We present bounds on the number of such cliques tions for limiting exploration of the search space, and a that could exist, while by definition, their problem fast way to check for maximality based on an incremental requires them to output no more than k cliques. computation of clique probabilities. We present a theoretical • We provide a runtime complexity analysis of our analysis showing that the worst-case runtime of MULE is algorithm and show that it is near optimal. No n O (n · 2 ), where n is the number of vertices. This is nearly runtime complexity analysis was provided for the the best possible dependence on n, since our analysis of the algorithm presented in [38]. number of maximal cliques shows that the size of the output • We also provide an algorithm to enumerate only √ n can be as much as O( n · 2 ). Such worst-case behavior large maximal uncertain cliques. occurs only in graphs that are very dense; for typical graphs, we can expect the runtime of MULE to be far better, as There is substantial prior work on maximal clique we show in our experimental evaluation. We also present enumeration from a deterministic graph. A popular algo- rithm for maximal clique enumeration problem is the Bron- 1. This assumes that 3 divides n. If not, the expressions are slightly Kerbosch algorithm [39], also based on depth-first-search. different Tomita et al. [40] improved the depth-first-search approach

2 through a better strategy for pivot selection; their resulting Definition 3. In an uncertain graph G, for a set of vertices n algorithm runs in time O(3 3 ), which is worst-case optimal, C ⊆ V , the clique probability of C, denoted by clq(C, G), due to the bound on the number of maximal cliques possi- is defined as the probability that in a graph sampled from ble [27]. Further work on enumeration of maximal cliques G, C is a clique. For parameter 0 ≤ α ≤ 1, C is called an includes [41, 42, 43, 44, 45]. There is work on parallel α-clique if clq(C, G) ≥ α. methods for enumerating maximal cliques and bicliques For any set of vertices C ⊆ V , let E denote the set of from a large graph [46, 47]. C edges {e = {u, v}|e ∈ E, u, v ∈ C and u 6= v}, i.e. the Our algorithm uses the general structure of search pre- set of edges connecting vertices in C. sented in [39, 40]. However, unlike the case of a deter- ministic maximal clique where it is easy to incrementally Observation 1. For any set of vertices C ⊆ V in maintain the set of vertices that can be added to the clique, G = (V, E, p), such that C is a clique in G = (V,E), clq(C, G) = Q p(e). for an uncertain graph, this is more complex, since we e∈EC need to be aware of the change in clique probabilities. Proof. Let G be a graph sampled from G. The set C will be Recomputing these can be expensive, and our a clique in G iff every edge in E is present in G. Since the reduce this cost through an incremental computation. Our C events of selecting different edges are independent of each runtime analysis and correctness proof need to take this into other, the observation follows. account, and do not follow from the analysis in [39] or [40]. Roadmap. We present a problem definition in Section 2, Definition 4. Given an uncertain graph G = (V, E, p), and bounds on the number of α-maximal cliques in Section 3, an a parameter 0 ≤ α ≤ 1, a set M ⊆ V is defined as an α- algorithm to enumerate all α-maximal cliques in Section 4, maximal clique if (1) M is an α-clique in G, and (2) There followed by experimental results in Section 5. is no vertex v ∈ (V \ M) such that M ∪ {v} is an α-clique in G. 2 PROBLEM DEFINITION Definition 5. The Maximal Clique Enumeration problem An uncertain graph is a probability distribution over a set in an Uncertain Graph G is to enumerate all vertex sets of deterministic graphs. We deal with undirected simple M ⊆ V such that M is an α-maximal clique in G. graphs, i.e. there are no self-loops or multiple edges. An uncertain graph is a triple G = (V, E, p), where V is a set The following two observations follow directly from of vertices, E ⊆ V × V is a set of (possible) edges, and Observation 1. p : E → (0, 1] is a function that assigns a probability of Observation 2. For any two vertex sets A, B in G, if B ⊂ existence p(e) to each edge e ∈ E. As in prior work on A then, clq(B, G) ≥ clq(A, G). uncertain graphs, we assume that the existence of different edges are mutually independent events. Observation 3. Let C be an α-clique in G. Then for all Let n = |V | and m = |E|. Note that G is a distribution e ∈ EC we have p(e) ≥ α. over 2m deterministic graphs, each of which is a subgraph 3 NUMBEROF MAXIMAL CLIQUES of the undirected graph (V,E). This set of possible deter- The maximum number of maximal cliques in a deterministic ministic graphs is called the set of “possible graphs” of the graph on n vertices is known exactly due to a result by uncertain graph G, and is denoted by D(G). Note that in Moon and Moser [27]. If n mod 3 = 0, this number n n−4 order to sample from an uncertain graph G, it is sufficient to is 3 3 . If n mod 3 = 1, then it is 4 · 3 3 , and if n n−2 sample each edge e ∈ E independently with a probability mod 3 = 2, then it is 2 · 3 3 . The graphs that have the p(e). maximum number of maximal cliques are known as Moon- In an uncertain graph G = (V, E, p), two vertices u and Moser graphs. v are said to be adjacent if there exists an edge {u, v} in For uncertain cliques, no such bound was known so E. Let the neighborhood of vertex u, denoted Γ(u), be the far. In this section, we establish a bound on the maximum set of all vertices that are adjacent to u in G. The next two number of α-maximal cliques in an uncertain graph. For definitions are standard, and apply not to uncertain graphs, 0 < α < 1, let f(n, α) be the maximum number of but to deterministic graphs. α-maximal cliques in any uncertain graph with n nodes, Definition 1. A set of vertices C ⊆ V is a clique in a graph without any assumption about the assignments of edge G = (V,E), if every pair of vertices in C is connected by probabilities. The following theorem is the main result of an edge in E. this section. Theorem 1. Let n ≥ 2, and 0 < α < 1. Then: f(n, α) = Definition 2. A set of vertices M ⊆ V is a maximal clique n  in a graph G = (V,E), if (1) M is a clique in G and bn/2c (2) There is no vertex v ∈ V \ M such that M ∪ {v} is a Proof. We can easily verify that the theorem holds for n = clique in G. n  2. for n ≥ 3, let g(n) = bn/2c . We show f(n, α) is at 3 least g(n) in Lemma 1, and then show that f(n, α) is no We can show that L(n) and U(n) can be bounded as shown more than g(n) in Lemma 2. in Lemma 4 and Lemma 5 respectively.

Lemma 1. For any n ≥ 3, and any α, 0 < α < 1, Lemma 3. For any n ≥ 3, |C∗| = g(n). there exists an uncertain graph G = (V, E, p) with n nodes which has g(n) α-maximal cliques. Proof. We first consider the case when n is even. By Lem- mas 4 and 5, we know n/2 ≤ L(n) ≤ U(n) ≤ n/2. Thus Proof. First, we assume that n is even. Consider G = ∗ ∗ n/2 we have L(n) = U(n) = n/2 which implies C = Cn/2. (V, E, p), where E = V × V . Let κ = 2 . For each C ⊆ C, 0 ≤ k ≤ n κ Recall that k is the collection of subsets e ∈ E, let p(e) = q where q = α. We have 0 < q < 1 of V with the size of k. since 0 < α < 1. Let S be an arbitrary subset of V such ∗ ∗ ∗ We have (1) C = Cn/2 ⊆ Cn/2 and (2) |C | ≥ |Cn/2| that |S| = n/2. We can verify that S is an α-maximal ∗ κ since C is a largest independent set of Gb. Thus we conclude clique since (1) the probability that S is a clique is q = α ∗ n  0 0 0 C = C which has the size of = g(n). and (2) for any set S ) S, S ⊆ V , the probability that S n/2 (n/2) is a clique is at most qqκ = qα < α. We can also observe We next consider the case when n is odd. From Lem- that for any subset S ⊆ V , S cannot be an α-maximal mas 4 and 5, we know (n − 1)/2 ≤ L(n) ≤ U(n) ≤ ∗ ∗ S ∗ clique if |S| < n/2 or |S| > n/2. Thus we conclude that a (n + 1)/2. Thus we have C = C(n−1)/2 C(n+1)/2. subset S ⊆ V is an α-maximal clique iff |S| = n/2 which For notation convenience, we set n1 = (n − 1)/2, n2 =

implies that the total number of α-maximal cliques in G is (n + 1)/2. Let Gb(Cn1 , Cn2 ) be the subgraph of Gb induced n  by C ∪ C . We can view G(C , C ) as a bipartite n/2 . A similar proof applies when n is odd. n1 n2 b n1 n2 graph with two disjoint vertex sets Cn1 and Cn2 respectively. Note that our construction in the Lemma above employs C∗ ⊆ C C∗ ⊆ C E(C∗ ) Observe that n1 n1 and n2 n2 . Let b n1 be the the condition that n ≥ 3 and 0 < α < 1. When α = 1, C∗ G(C , C ) C∗ set of edges induced by n1 in b n1 n2 . Since is an the upper bound is from the result of Moon and Moser for ∗ n independent set of Gb, none of the edges in Eb(Cn ) will have deterministic graphs, and in this case f(n, α) = 3 3 and 1 an end in a node of C∗ , i.e, all the edges of E(C∗ ) should is smaller than g(n). Next we present a useful definition n2 b n1 have an end falling in C \C∗ . Note that in G(C , C ), required for proving the next Lemma. n2 n2 b n1 n2 all nodes have a degree of n2. Thus we have: Definition 6. A collection of sets C is said to be non- |E(C∗ )| = |C∗ |∗n ≤ |C \C∗ |∗n = (|C |−|C∗ |)∗n redundant if for any pair S1,S2 ∈ C, S1 6= S2, we have b n1 n1 2 n2 n2 2 n2 n2 2 S1 * S2 and S2 * S1. |C∗| = |C∗ | + |C∗ | ≤ |C | = from which we obtain n1 n2 n2 n  Lemma 2. g(n) is an upper bound on f(n, α). . Note that Cn itself is an independent set of G with n2 2 b size n . Thus we conclude that |C∗| = n  = g(n). Proof. Let Cα(G) be the collection of all α-maximal cliques n2 n2 in G. Note that by the definition of α-maximal cliques, any Lemma 4. L(n) ≥ bn/2c α-maximal clique S in G can not be a proper subset of any other α-maximal clique in G. Thus from Definition 6, for any uncertain graph G, Cα(G) is a non-redundant collection. Proof. Let us assume n is an even number. We prove by contradiction as follows. Suppose L(n) = ` ≤ n/2−1. Let Hence, it is clear that the largest number of α-maximal ∗ ∗ Ck ⊆ C ,L(n) ≤ k ≤ U(n) be the collection of all sets in cliques in G should be upper bounded by the size of a largest ∗ ∗ ∗ non-redundant collection of subsets of V . C which has the size of k, i.e, Ck = {S ∈ C ||S| = k}. C ⊆ C Let C be the collection of all subsets of V . Based on In the following we construct a new collection new which proves to be an independent set in G with the size C, we construct such an undirected graph Gb = (C, Eb) b being strictly larger than C∗. For each S ∈ C∗, we add to C∗ where for any two nodes S1 ∈ C,S2 ∈ C, there is an ` all subsets of V which has the form as S ∪ {i} where i ∈ edge connecting S1 and S2 iff S1 ⊆ S2 or S2 ⊆ S1. It can ∗ be verified that a sub-collection C0 ⊆ C is a non-redundant V \ S and remove S from C meanwhile. Let Cnew be the 0 collection obtained after we process the same route for all iff C is an independent set in Gb. In Lemma 3, we show that ∗ S S ∈ C` . Mathematically, we have: Cnew = C1 C2 where g(n) is the size of a largest independent set of Gb, which S S ∗ ∗ C1 = ∗ {S ∪ {i}}, C2 = C \C . First we implies that g(n) is an upper bound for the number of α- S∈C` i∈V \S ` maximal cliques in G. show Cnew is an independent set of Gb. Arbitrarily choose two distinct sets, say S1 ∈ Cnew,S2 ∈ Cnew,S1 6= S2. We ∗ Let C be a largest independent set in Gb. Also, let check all the possible cases one by one: Ck ⊆ C, 0 ≤ k ≤ n be the collection of subsets of V with the size of k. Observe that for each 0 ≤ k ≤ n, Ck • S1 ∈ C1,S2 ∈ C1. We observe that |S1| = |S2| = is an independent set of Gb. Also let L(n) and U(n) be `+1 and S1 6= S2. Thus no inclusion relation could ∗ respectively the minimum and maximum size of sets in C . exist between S1 and S2. 4 ∗ ∗ • S1 ∈ C2,S2 ∈ C2. In this case no inclusion We can verify that |Cdual| = |C |. Therefore we can ∗ relation can exist between S1 and S2 since C2 is conclude Cdual is a largest independent set of Gb. By ∗ an independent set of Gb. Lemma 4, we get to know the minimum size of sets in Cdual ∗ • S1 ∈ C1,S2 ∈ C2. Since C` is the collection of should be at least n/2, which yields the maximum size of sets in C∗ which has the smallest size `, we get of sets in C∗ should be at most n/2. For the case when n is that |S2| ≥ ` + 1 = |S1|. Therefore there is only odd, we can analyze essentially the same as above. one possible inclusion relation existing here, that is 0 S1 ⊂ S2. Suppose S1 = S1 ∪ {i1} ⊂ S2 for 4 ENUMERATION ALGORITHM 0 ∗ 0 some S1 ∈ C` . Thus we get that S1 ⊂ S2 which ∗ In this section, we present MULE (Maximal Uncertain implies C is not an independent set of Gb. Hence cLique Enumeration), an algorithm for enumerating all α- we conclude that no inclusion relation could exist maximal cliques in an uncertain graph G, followed by a between S1 and S2. proof of correctness and an analysis of the runtime. We Summarizing the analysis above, we get that no inclu- assume that G has no edges e such that p(e) < α. If there sion relation could exist between S1 and S2 which yields are any such edges, they can be pruned away without losing Cnew is an independent set of Gb. any α-maximal cliques, using Observation 3. Let the vertex ∗ Now we prove that |Cnew| > |C |. Observe that C1 identifiers in G be 1, 2, . . . , n. For clique C, let max(C) ∗ and C2 are disjoint from each other; otherwise C is not an denote the largest vertex in C. For ease of notation, let independent set. So we have |Cnew| = |C1|+|C2|. Note that max(∅) = 0, and let clq(∅, G) = 1. ∗ ∗ ∗ Intuition. We first describe a basic approach to enu- |C | = |C` | + |C2| since C is the union of the two disjoint ∗ ∗ meration using depth-first-search (DFS) with backtracking. parts C` and C2. Therefore |Cnew| > |C | is equivalent to ∗ ∗ The algorithm starts with a set of vertices C (initialized to |C1| > |C` |. Let Gb(C` , C1) be the induced subgraph graph ∗ S ∗ an empty set) that is an α-clique and incrementally adds of Gb by C` C1. Note that Gb(C` , C1) can be viewed as ∗ vertices to C, while retaining the property of C being a bipartite graph where the two disjoint vertex sets are C` ∗ an α-clique, until we can add no more vertices to C. At and C1 respectively. In Gb(C` , C1) we observe that (1) for ∗ this point, we have an α-maximal clique. Upon finding each node S1 ∈ C , its degree d(S1) = n − `; (2) for each ` a clique that is α-maximal, the algorithm backtracks to node S2 ∈ C1, its degree d(S2) ≤ ` + 1. Thus we get ∗ explore other possible vertices that can be used to extend that |Ee| = |C |(n − `) ≤ |C1|(` + 1). According to our ` C, until all possible search paths have been explored. To assumption we have ` ≤ n/2 − 1. Thus we have avoid exploring the same set C more than once, we add |C∗|/|C | ≤ (` + 1)/(n − `) ≤ (n/2)/(n/2 + 1) < 1, ` 1 vertices in increasing order of the vertex id. For instance, if yielding |C∗| < |C | which is equivalent to |C∗| < |C |. ` 1 new C was currently the vertex set {1, 3, 4}, we do not consider So far we have successfully constructed a new collection adding vertex 2 to C, since the resulting clique {1, 2, 3, 4} Cnew ⊆ C such that (1) it is an independent set of Gb and will also be reached by the search path by adding vertices (2) |C | > |C∗|. That contradicts with the fact that C∗ is new 1, 2, 3, 4 in that order. a largest independent set of Gb. Thus our assumption ` ≤ MULE improves over the above basic DFS approach in n/2 − 1 does not hold, which yields ` ≥ n/2. For the case the following ways. First, given a current α-clique C, the when n is odd, we can process essentially the same analysis set of vertices that can be added to extend C includes only as above and get ` ≥ (n − 1)/2. those vertices that are already connected to every vertex Lemma 5. U(n) ≤ dn/2e within C. Instead of considering every vertex that is greater than max(C), it is more efficient to track these vertices as ∗ Proof. Let us assume n is an even number. Based on C , we the recursive algorithm progresses – this will save the effort ∗ ∗ construct a dual collection Cdual as follows: Initialize Cdual of needing to check if a new vertex v can actually be used ∗ as an empty collection. For each S ∈ C , we add V \S into to extend C. This leads us to incrementally track vertices ∗ ∗ S Cdual. Mathematically, we have: Cdual = S∈C∗ {V \ S}. that can still be used to extend C. ∗ First we show Cdual is an independent set of Gb. Arbitrarily Second, note that not all vertices that extend C into ∗ choose two distinct sets, say V \ S1 ∈ Cdual,V \ S2 ∈ a clique preserve the property of C being an α-clique. In ∗ ∗ ∗ Cdual, where S1 ∈ C ,S2 ∈ C ,S1 6= S2. Note that particular, adding a new vertex v to C decreases the clique probability of C by a factor equal to the product of the V \S1 ⊂ V \S2 ⇔ S1 ⊃ S2,V \S2 ⊂ V \S1 ⇔ S2 ⊃ S1 edge probabilities between v and every vertex in C. So, in Thus we have that no inclusion relation could exist considering vertex v for addition to C, we need to compute between V \ S1 and V \ S2 since no inclusion relation the factor by which the clique probability will fall. This ∗ exists between S1 and S2 resulting from the fact that C is computation can itself take Θ(n) time since the size of C ∗ an independent set of Gb. So we get Cdual is an independent can be Θ(n), and there can be Θ(n) edges to consider in set as well. adding v. A key insight is to reduce this time to O(1) by

5 incrementally maintaining this factor for each vertex v still Algorithm 2: Enum–Uncertain–MC(C, q, I, X) under consideration. The recursive subproblem contains, in Input: We assume G and α are available as addition to current clique C, a set I consisting of pairs immutable global variables (u, r) such that u > max(C), u can extend C into an α- C is the current Uncertain Clique being processed clique, and adding u will multiply the clique probability of q = clq(C, G), maintained incrementally C by a factor of r. This set I is incrementally maintained I is a set of all tuples (u, r), such that u > max(C), and supplied to further recursive calls. and C ∪ {u} is an α-clique in G Finally, there is the cost of checking maximality. Sup- X is a set of all tuples (v, s), such that v 6∈ C, pose that at a juncture in the algorithm we found that I was v < max(C), and C ∪ {v} is an α-clique in G empty, i.e. there are no more vertices greater than max(C) 1 if I = ∅ and X = ∅ then that can extend C into an α-clique. This does not yet mean 2 Output C as α-maximal clique that C is an α-maximal clique, since it is possible there 3 return are vertices less than max(C), but not in C, which can extend C to an α-maximal clique (note that such an α- 4 forall the (u, r) ∈ I considered in lexicographical ordering over u do maximal clique will be found through a different search 0 5 C ← C ∪ {u} // Lemma 6 path). This means that we have to run another check to see 0 6 m = max(C ) = u if C is an α-maximal clique. Note that even checking if a 0 2 7 q ← q · r // clq(C ∪ {v}, G) set of vertices C is an α-maximal clique can be a Θ(n ) 0 0 0 8 I ← GenerateI(C , q ,I) operation, since there can be as many as Θ(n) vertices to be 0 0 0 2 9 X ← GenerateX(C , q ,X) potentially added to C, and Θ(n ) edge interactions to be 0 0 0 0 considered. We reduce the time for searching such vertices 10 Enum–Uncertain–MC(C , q ,I ,X ) by maintaining the set X of vertices that can extend C, but 11 X ← X ∪ {(u, r)} will be explored in a different search path. By incrementally maintaining probabilities with vertices in I and X, we can reduce the time for checking maximality of C to Θ(n). Algorithm 3: GenerateI(C0, q0,I) MULE incorporates the above ideas and is described in Input: We assume G and α are available as Algorithm 1. immutable global variables 0 0 1 m ← max(C ), I ← ∅, S ← ∅ Algorithm 1: MULE(G, α) 2 forall the (u, r) ∈ I do Input: G is the input uncertain graph, 3 S ← S ∪ {u}

α, 0 < α < 1 is the user provided probability 4 S ← S ∩ {Γ(m)} threshold 5 forall the (u, r) ∈ I do ˆ 1 I ← ∅ 6 if u > m and u ∈ S then 0 0 2 forall the u ∈ V do 7 clq(C ∪ {u}, G) ← q · r · p({u, m}) ˆ ˆ 0 3 I ← I ∪ {(u, 1)} 8 if (clq(C ∪ {u}, G)) ≥ α then 0 9 4 Enum–Uncertain–MC(∅, 1 ,Iˆ, ∅) u ← u 0 10 r ← r · p({u, m}) 4.1 Proof of Correctness 0 0 0 0 11 I ← I ∪ {(u , r )} In this section we prove the correctness of MULE. Theorem 2. MULE (Algorithm 1) enumerates all α- 12 return I’ maximal cliques from an input uncertain graph G. Proof. Let u0 ∈ V be a vertex such that (1) u0 > max(C0), Proof. To prove the theorem we need to show the following. and (2) C0 ∪ {u0} is an α-clique in G. We need to show that First, if C is a clique emitted by Algorithm 1, then C must (u0, r0) ∈ I0 such that clq(C0 ∪ {u0}, G) = q0 · r0. be an α-maximal clique. Next, if C is an α-maximal clique, Let C0 be a clique being called by Enum–Uncertain–MC then it will be emitted by Algorithm 1. We prove them in with I0. Note that each call of the method adds one vertex Lemmas 8 and 9 respectively. u ∈ I to the current clique C such u > max(C). Since the vertices are added in the lexicographical ordering, there is an Before proving Lemmas 8 and 9, we prove some prop- unique sequence of calls to the method Enum–Uncertain– erties of Algorithm 2. MC such that we reach a point in execution of Algorithm 2 Lemma 6. When Algorithm 2 is called with C0 in line 10, I0 where Enum–Uncertain–MC is called with C0. We call this is a set of all tuples (u0r0), where u0 ∈ V and 0 < r0 ≤ 1, sequence of calls as Call-0, Call-1, ..., Call-|C0|. Also, 0 0 0 0 0 0 such that u > max(C ), and clq(C ∪{u }, G) = q ·r ≥ let Ci be the clique used by method Enum–Uncertain–MC α, i.e. C0 ∪ {u0} is an α-clique in G. during Call-i.

6 Algorithm 4: GenerateX(C0, q0,X) vertices that are greater than the maximum vertex in C. Let Input: We assume G and α are available as X be the corresponding set of tuples used when the call was immutable global variables made to Enum–Uncertain–MC with C. Let u > max(C) 0 0 0 be a vertex such that clq(C ∪ {u}, G) ≥ α and u < m. 1 m ← max(C ), X ← ∅, S ← ∅ Note that u 6∈ C0, u < max(C0), and C0 ∪ {u} is an 2 forall the (v, s) ∈ I do α-clique in G. This means u satisfies all conditions for 3 S ← S ∪ {v} u ∈ X0. We need to show that when Enum–Uncertain– 4 S ← S ∩ {Γ(m)} MC is called with C0, the generated X0 which is passed in 5 forall the (v, s) ∈ X do Enum–Uncertain–MC contains u. 6 if v ∈ S then 0 0 0 Firstly, note that since C ∪ {u} is α-clique in G, we 7 clq(C ∪ {v}, G) ← q · s · p({v, m}) 0 have clq(C ∪ {u}, G) ≥ α (from Observation 2). Since 8 if (clq(C ∪ {v}, G) ≥ α then 0 u > max(C) and clq(C ∪ {u}, G) ≥ α, from Lemma 6, 9 v ← v 0 u will be used in line 4 to call Enum–Uncertain–MC using 10 s ← s · p({v, m}) 0 0 0 0 C ∪ {u}. Once this call is returned, u is added to X in 11 X ← X ∪ {(v , s )} line 11. Note that since the loop at line 4 add vertices in lexicographical order, m will be added to C after u. Thus u 12 return X’ will be in X, when m is used to extend C. Next we show that if u ∈ X, after execution of line 9, u ∈ X0. We prove We prove by induction. First consider the base case. For this as follows. Note that Algorithm 4 is used to generate that consider the first call made to Algorithm 2, i.e. Call- X0 from X. Note that X0 is generated by Algorithm 4 by 0. We know that C is initialized as ∅. During the first call selectively adding vertices from X. A vertex is added to X0 made, all vertices in V satisfy conditions (1) and (2). This from X, only if C0 ∪ {u} is α-clique in G. From our initial is because, first max(∅) = 0. Second any single vertex can assumptions, we know that u satisfies this condition and is be considered as a clique with probability 1. Iˆ is initialized hence added to X0 and passed on to Enum–Uncertain–MC such that all r in Iˆ are 1 ≥ α. Thus for all u such that when it is called with C0. (u, r) ∈ Iˆ, u > max(C). This proves the base case. Now let us consider v, such that v does not satisfy all the For the inductive step, consider a recursive call to the conditions for v ∈ X0. We need to show that v 6∈ X0. There method Call-i which calls Call-(i+1). For every case expect are two cases. First, when v 6∈ X. This case is trivial as X0 initialization, I0 is generated from I by line 8 of Algorithm 2 is constructed from X and hence if v 6∈ X, v 6∈ X0. For which in turn calls Algorithm 3. In Algorithm 3, only the second case, when v ∈ X, we need to show that v will vertices in I that are greater than C0 are added to I0. Thus all not be added to X0 in line 9 of Algorithm 2. Note that since vertices in I that satisfy (1) are added to I0. Next every ver- v ∈ X, we know v 6∈ C0 and v < max(C0). Thus, it must tex in I is connected to C. We need to show that all vertices be that C ∪ {m, v} is not an α-clique in G. Algorithm 4 in I0 are connected to C0. In line 4 of Algorithm 3, we prune will add v to X0 only if C ∪ {m, v} is α-clique in G. But out any vertex in I that is not connected to m = max(C0). from our previous discussion, we know that this condition Assume that u0 extends C such that clq(C ∪ {u0}, G) = r. doesn’t hold. Hence, v will not be added to X0. Thus only Now let c = {C0 \ C }. Note that c is a single vertex. Also, vertices that satisfy all three conditions are in X0. assume u0 > c. From line 4, we know that q0 · r0 ≥ α 0 0 Also from line 6 of Algorithm 3, r = r · p({c, u }). Now Lemma 8. Let C be a clique emitted by Algorithm 2. Then 0 0 0 0 0 0 clq(C ∪ {u }, G) = q · r · p({c, u }) = q · r , Now in C is an α-maximal clique. line 8 of Algorithm 3 we add u0 to I0 only if r0 ≥ α thus proving the inductive step. Proof. Algorithm 2 emits C in Line 2. From Observation 4, we know that C is an α-clique. We need to show that C is The following observation follows from Lemma 6. α-maximal. We use proof by contradiction. Suppose C is Observation 4. The input C to Algorithm 2 is an α-clique. non-maximal. This means that there exists a vertex u ∈ V , such that C ∪{u} is an α-clique. We know that I = ∅ when Lemma 7. When Algorithm 2 is called with C0 in line 10, C X0 is a set of all tuples (v0, s0), where v0 ∈ V and 0 < is emitted. From Lemma 6, we know that there exists no u ∈ V u > max(C) C s0 ≤ 1, such that, ∀(v0, s0) ∈ X0, we have v0 6∈ C0, v0 < vertex such that that can extend . X = ∅ C max(C0), and (clq(C0 ∪ {v0}, G) = q0 · s0) ≥ α, i.e. Again, we know that when is emitted. Thus from v ∈ V C0 ∪ {v0} is an α-clique in G. Lemma 7, we know that there exists no vertex such that v < max(C) that can extend C. This is a contradiction Proof. Let m = max(C0) and C = C0 \{m}. Since Al- and hence C is an α-maximal clique. gorithm 2 was called with C0, it must have been called with C. This is because the working clique is always extended by Lemma 9. Let C be an α-maximal clique in G. Then C is adding vertices from I, and from Lemma 6, I only contains emitted by Algorithm 2.

7 Proof. We first show that a call to method Enum– We next consider the time taken at each internal node. Uncertain–MC with α-clique C enumerates all α-maximal Instead of adding up the times at different internal nodes, cliques C0 in G, such that for all c ∈ {C0 \ C}, we equivalently add up the cost of the different edges in c > max(C). the search tree. At each internal node, the cost of making Without loss of generality, consider a α-maximal clique a recursive call can be analyzed as follows. Line 5 takes C0 in G such that ∀c ∈ {C0 \ C}, c > max(C). Note that O (n) time as we add all vertices in C to C0 and also u. C0 will be emitted as an α-maximal clique by the method Line 7 takes constant time. Lines 8 and 9 take O (n) time Enum–Uncertain–MC when called with C, if the following (Lemmas 10 and 11 respectively). Note that lines 5 to 9 can holds: (1) A call to method Enum–Uncertain–MC is made get executed only once in between the two calls. Thus total with C0, (2) When this call is made, I0 = ∅, and X0 = ∅. runtime for each edge of the search tree is O (n). Since C0 is α-maximal clique in G, the second point follows Note that the total number of calls made to the method from Lemmas 6 and 7. Thus we need to show that a call to method Enum–Uncertain–MC is no more than the possible Enum–Uncertain–MC is made with C0. number of unique subsets of V , which is O (2n). We see 0 We prove this by induction. Let Cˆ = {C \ C}. Let ci that for internal nodes, is O (n) and for leaf represent the ith element in Cˆ in lexicographical order. Also nodes it is O (1). Hence the time complexity of Algorithm 2 n let Ci = C ∪ {c1, c2, . . . , ci}. For the base case, we show is O (n · 2 ). that if a call to Enum–Uncertain–MC is made with C, a call Thus now we need to prove that lines 8 and 9 take will be made with C1 = C ∪ {c1}. This is because, line 4 O (n) time. This implies that time complexity of Algo- of the method loops over every vertex u ∈ I thus implying 0 rithms 3 and 4 is O (n). We prove the same in Lem- u > max(C) and clq(C ∪ {u}, G) ≥ α. Since C is an mas 10 and 11 respectively. α-maximal clique, c1 will satisfy both these conditions and hence a call to Enum–Uncertain–MC is made with C∪{c1}. Lemma 10. The runtime of Algorithm 3 is O (n). Now for the inductive step we show that if a call is made Proof. First note that lines 1-6 takes O (n) time. This is with clique Ci, then this call will in turn call the method because |I| = O (n), and hence the loop at line 4 of Algo- with clique Ci+1. Again, ci+1 is greater than max(Ci) rithm 3 can take O (n) time. Further the set intersection at and clq(Ci ∪ {ci+1}, G) ≥ α. Thus ci+1 ∈ I when the line 6 also takes O (n) time. We need to show that the for call is made to Enum–Uncertain–MC with Ci. Hence using loop in line 7 is O (n), that is each iteration of the loop takes the previous argument, in line 4, ci+1 will be used as a O (1) time. Assume that it takes constant time to find out vertex in the loop which would in turn make a call to Enum– the probability of an edge. This is a valid assumption, as the Uncertain–MC with Ci+1. edge probabilities can be stored as a HashMap and hence for Now without any loss of generality, consider an α- an edge e, in constant time we can find out p(e). With this maximal clique in G. We know that C ⊃ ∅. Thus the proof assumption, it is easy to show that lines 8-13 takes constant follows. time. This is because, they are either constant number of multiplications, or adding one element to a set. Thus total 4.2 Runtime Complexity time complexity is O (n). Theorem 3. The runtime of MULE (Algorithm 1) on an Lemma 11. The runtime of Algorithm 4 is O (n). input graph of n vertices is O (n · 2n). We omit the proof of the above lemma since it is similar Proof. MULE initializes variables and calls to Algorithm 2, to the proof of Lemma 10. hence we analyze the runtime of Algorithm 2. An execution Observation 5. The worst-case runtime of any algorithm of the recursive Algorithm 2 can be viewed as a search tree that can output all maximal cliques of an uncertain graph as follows. Each call to Enum–Uncertain–MC is a node of √ on n vertices is Ω( n · 2n). this search tree. The first call to the method is the root node. A node in this search tree is either an internal node that Proof. From Theorem 1, we know that the number of n  makes one or more recursive calls, or a leaf node that does maximal uncertain cliques can be as much as bn/2c =  n  not make further recursive calls. To analyze the runtime of Θ √2 (using Stirling’s Approximation). Since the size Algorithm 2, we consider the time spent at internal nodes as n of each uncertain clique can be Θ(n), the total output size well as leaf nodes. √ can be Ω( n · 2n), which is a lower bound on the runtime The runtime at each leaf node is O(1). For a leaf node, of any algorithm. the parameter I = ∅, and there are no further recursive calls. This implies that either C is α-maximal (X = ∅) and Lemma 12. The worst-case√ runtime of MULE on an n is emitted in line 2 or it is non-maximal (X 6= ∅) but cannot vertex graph is within a O( n) factor of the runtime of an be extended by the loop in line 4 as I = ∅. Checking the optimal algorithm for Maximal Clique Enumeration on an sizes of I and X takes constant time. uncertain graph.

8 Proof. The proof follows from Theorem 3 and Observa- Algorithm 6: Enum–Uncertain–MC–Large(C, q, I, X, t) tion 5. Input: We assume G and α are available as 4.3 Enumerating Only Large Maximal Cliques immutable global variables C is the current Uncertain Clique being processed For a typical input graph, many maximal cliques are small, q = clq(C, G), maintained incrementally and may not be interesting to the user. Hence it is helpful I is a set of all tuples (u, r), such that u > max(C), to have an algorithm that can enumerate only large maxi- and C ∪ {u} is an α-clique in G mal cliques efficiently, rather than enumerate all maximal X is a set of all tuples (v, s), such that v 6∈ C, cliques. We now describe an algorithm that enumerates v < max(C), and C ∪ {v} is an α-clique in G every α-maximal clique with more than t vertices, where t is the user provided size threshold t is an user provided parameter. 1 if I = ∅ and X = ∅ then As a first step, we prune the input uncertain graph 2 Output C as α-maximal clique G = (V, E, p) by employing techniques described by 3 return Modani and Dey [42]. We apply the “Shared Neighborhood Filtering” where edges are recursively checked and removed 4 forall the u, r ∈ I considered in lexicographical as follows. First drop all edges {u, v} ∈ E, such that ordering over u do 0 |Γ(u) ∩ Γ(v)| < (t − 2). Next drop every vertex v ∈ V , 5 C ← C ∪ {u} // Lemma 6 0 that doesn’t satisfy the following condition. For vertex 6 m = max(C ) = u 0 v ∈ V , there must exist at least (t − 1) vertices in Γ(v), 7 q ← q · r // clq(C ∪ {v}, G) 0 0 0 such that for u ∈ Γ(v), |Γ(u) ∩ Γ(v)| < (t − 2). Let G0 8 I ← GenerateI(C , q ,I) 0 0 denote the graph resulting from G after the pruning step. 9 if |C | + |I | < t then Algorithm 5 runs on the pruned uncertain graph G0 10 continue 0 0 0 to enumerate only large maximal cliques. The recursive 11 X ← GenerateX(C , q ,X) 0 0 0 0 method in Algorithm 6 differs from Algorithm 2 as follows. 12 Enum–Uncertain–MC–Large(C , q ,I ,X , t) Before each recursive call to method Enum–Uncertain–MC- 13 X ← X ∪ {(u, r)} Large (Algorithm 6), the algorithm checks if the sum of the 0 sizes of the current working clique C and the candidate a call must be made to Enum–Uncertain–MC–Large with vertex set I0 are greater than the size threshold t. If not, 0 C1 before the call is made with C1. In the worst case, the recursive method is not called. This optimization leads let us consider that the search tree reaches the execution to a substantial pruning of the search space and hence a 0 point where Enum–Uncertain–MC–Large is called with C1. reduction in runtime. Consider the execution of the algorithm where m1 is added 0 0 to C = C1 to form C = C1. Since C1 is an α-maximal Algorithm 5: LARGE–MULE(G, α,t) clique, I0 will become NULL which implies |I0| = 0. We 0 0 Input: G is the input uncertain graph post pruning know that |C1| < t. Thus |C1 + I | will also be less than α, 0 < α < 1 is the user provided probability t and the If condition (line 8) will succeed. This will result threshold in the execution of the continue statement. Thus Enum– t, t ≥ 2 is the user provided size threshold Uncertain–MC-Large will not be called with C1 implying 1 Iˆ ← ∅ that C1 is not enumerated. 2 forall the u ∈ V do Next we show that any maximal clique of size at least 3 Iˆ ← Iˆ∪ {(u, 1)} t is enumerated by Algorithm 6. Consider an α-maximal clique C2 in G of size at least t. We note that the “If” 4 Enum–Uncertain–MC–Large(∅, 1 ,Iˆ, ∅,t) condition in line 8 is never satisfied in the search path ending with C2 and hence a call is made to the method with Enum-Uncertain-MC-Large with C2. This is easy to Lemma 13. Given an input graph G, LARGE–MULE (Al- see as whenever a call is made to Enum-Uncertain-MC- gorithm 5) enumerates every α-maximal clique with more Large with any C ⊆ C2, since C2 is large, we always have than t vertices. |C| + |I| ≥ t. Proof. First we prove that no maximal clique of size less 5 EXPERIMENTAL RESULTS than t is enumerated by Algorithm 6. Consider an α- maximal clique C1 in G with less than t vertices. Also We report the results of an experimental evaluation of our 0 let m1 = max(C1) and C1 = C1 \{m1}. Note that if algorithm. We implemented the algorithm using Java and C1 is emitted by Algorithm 6, then a call must be made to ran all experiments on a system with a 3.19 GHz Intel(R) Enum–Uncertain–MC–Large with C1. Since the Algorithm Core(TM) i5 processor and 4 GB of RAM, with heap space adds vertices in lexicographical ordering, this implies that configured at 1.5GB.

9 TABLE 1: Input Graphs Input Graph Category Description # Vertices # Edges Fruit-Fly Protein Protein Interaction network PPI for Fruit Fly from STRING 3751 3692 DBLP10 Social network Collaboration network from DBLP 684911 2284991 amazon–0302 Product co-purchasing network March 2003 Amazon co-purchase network 262111 1234877 p2p-Gnutella08 Internet peer-to-peer networks Gnutella network August 8 2002 6301 20777 p2p-Gnutella04 Internet peer-to-peer networks Gnutella network August 4 2003 10879 39994 p2p-Gnutella09 Internet peer-to-peer networks Gnutella network August 9 2003 8114 26013 ca-GrQc Collaboration networks Arxiv General Relativity 5242 28980 wiki-vote Social networks wikipedia who-votes-whom network 7118 103689 BA5000 Barabasi´ −Albert random graphs Random graph with 5K vertices 5000 50032 BA6000 Barabasi´ −Albert random graphs Random graph with 6K vertices 6000 60129 BA7000 Barabasi´ −Albert random graphs Random graph with 7K vertices 7000 70204 BA8000 Barabasi´ −Albert random graphs Random graph with 8K vertices 8000 80185 BA9000 Barabasi´ −Albert random graphs Random graph with 9K vertices 9000 90418 BA10000 Barabasi´ −Albert random graphs Random graph with 10K vertices 10000 99194

100 DFS-NOIP 100 DFS-NOIP MULE MULE

10 10 Runtime (seconds) Runtime (seconds)

1 wiki-vote BA5000 ca-GrQc PPI 1 wiki-vote BA5000 ca-GrQc PPI Input Graph Input Graph (a) α = 0.9 (b) α = 0.8

DFS-NOIP DFS-NOIP 10000 10000 MULE MULE

1000 1000

100 100

10 10 Runtime (seconds) Runtime (seconds)

1 wiki-vote BA5000 ca-GrQc PPI 1 wiki-vote BA5000 ca-GrQc PPI Input Graph Input Graph (c) α = 0.0001 (d) α = 0.0005 Fig. 1: Comparison of Simple and Optimized Depth First Search approaches. The y–axis is in log–scale.

40 120 11600 BA5000 11500 amazon0302 35 BA6000 100 PPI BA7000 ca-GrQc 11400 30 p2p-Gnutella04 BA8000 80 11300 25 BA9000 p2p-Gnutella08 BA10000 p2p-Gnutella09 11200 60 20 wiki-vote 11100 15 40 11000 10 10900 20 5 10800

Algorithm runtime (seconds) Algorithm runtime (seconds) 0 Algorithm runtime (seconds) 10700 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Probability Threshold (α) Probability Threshold (α) Probability Threshold (α) (a) Random Graphs (b) Semi–synthetic and Real Graphs (c) amazon0302 Fig. 2: Runtime vs Probability threshold (α). The x–axis is in log–scale

Input Data: Details of the input graphs that we used are form the STRING 3 database, and the DBLP 4 dataset from shown in Table 1. authors of [36], which is an uncertain network predicting The first set of graphs consists of real world uncertain future co-authorship. The PPI network is an uncertain graph graphs, also used in [32] and [36]. These include a protein- where each vertex represents a protein, and two vertices are protein interaction (PPI) network of a Fruit Fly obtained connected by an edge with a probability representing the by integrating data from the BioGRID 2 database with that 3. http://string-db.org/ 2. http://thebiogrid.org/ 4. http://dblp.uni-trier.de/

10 90000 550000 1.8e+06 amazon0302 80000 PPI 500000 1.6e+06 ca-GrQc 450000 70000 1.4e+06 p2p-Gnutella04 400000 60000 1.2e+06 p2p-Gnutella08 350000 50000 1e+06 p2p-Gnutella09 wiki-vote 300000 40000 BA5000 800000 BA6000 250000 600000 30000 BA7000 200000 BA8000 400000 20000 BA9000 150000 200000 10000 BA10000 100000

Number of Alpha Maximal Cliques Number of Alpha Maximal Cliques 0 Number of Alpha Maximal Cliques 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Probability Threshold (α) Probability Threshold (α) Probability Threshold (α)

(a) Random Graphs (b) Semi–synthetic and Real Graphs (c) amazon0302 Fig. 3: No of α-maximal cliques vs Probability threshold (α). The x–axis is in log–scale likelihood of interaction between the the two proteins. The Figure 1 shows the runtime of MULE and of DFS–NOIP DBLP network represents co-authorship in academic arti- on different input graphs. The results show that MULE is cles. Each vertex in this network represents an author. Two much faster than DFS–NOIP. For instance, for the wiki– vertices are connected by an edge with a probability that vote graph with α = 0.9 DFS–NOIP took 64 seconds depends on the “strength” of their co-authorship, computed while MULE took only 8 seconds. The relative performance as 1−e−c/10, where c is the number of papers co–authored. results hold true over a wide range of input graphs and The second set of graphs was obtained from the Stan- values of α, including synthetic and real-world graphs, and ford Large Network Collection [48], and includes graphs small and large values of α. For α = 0.0001, MULE representing Internet p2p networks, collaboration networks, took only 25 secs on ca-GrQc, while DFS–NOIP took over and an online social network. The amazon0302 network 4400 secs. On the wiki–vote input graph with probability is a product co-purchasing network where each node is threshold 0.9, MULE took 8 seconds while DFS–NOIP took a product and two nodes (products) are connected by an 64 seconds. For the same graph, with probability threshold edge if they are frequently co-purchased. The p2p-Gnutella 0.0001, MULE took 114 seconds, while DFS–NOIP took graphs represent peer to peer file sharing networks, where more than 11 hours. each vertex represents a computer and an edge represents Dependence on α. We measured the runtime of enumer- an application level communication between computers. ation as well as the output size (the number of α-maximal The p2p-Gnutella04, p2p-Gnutella08 and p2p-Gnutella09 cliques that were output) as a function of α. The dependence graphs represent communications occurring on 4th, 8th and of the runtime on α is shown in Figure 2, and the number of 9th of August, 2002 respectively. The ca-GrQc graph repre- cliques as a function of α is shown in Figure 3. We note that sents a collaboration network among scientists working on as α increases, the number of maximal cliques, and the time General Relativity and Quantum Cosmology. Each vertex of enumeration both drop sharply. The decrease in runtime in the graph is a scientist and two vertices are connected is because with a larger value of α, the size of the output by an edge if the corresponding scientists have co-authored decreases and the algorithm is able to prune search paths a paper. Finally the wiki-vote graph represents voting that aggressively early in the enumeration. occurs while selecting a new wikipedia administrator. Each We note here that the number of α-maximal cliques does vertex is either a wikipedia admin or wikipedia user and an not necessarily decrease as α increases. Sometimes it is pos- edge represents a vote that an admin/user casts in favor of sible that the number of α-maximal cliques increases with a candidate. For each graph in the second set, an uncertain α. For instance, a large maximal clique whose probability is graph was created by assigning edge probabilities uniformly above the threshold for a small value of α may not pass the at random. Hence these can be considered as semi–synthetic threshold for a higher value of α, and may split into many uncertain graphs. smaller maximal cliques, thus leading to an increase in the The third set of input graphs was synthetically gener- number of maximal cliques. However, such instances seem ated using the Barabasi´ −Albert (BA) model for random to be rare and are not visible in the plots. graphs [49]. Then the edges were assigned probabilities Dependence on Output Size Figure 4 shows the run- uniformly at random from [0, 1]. time as a function of the number of α-maximal cliques Comparison with other approaches. We compare enumerated, for randomly generated graphs. It can be seen our algorithm with another algorithm based on depth-first- that the runtime of the algorithm increases with the number search, which we call DFS-NOIP (DFS with NO Incre- of maximal cliques in the output, and also that the runtime mental Probability Computation). Similar to MULE, this scales well with the output size. This comparison was not algorithm performs a depth first search to enumerate all α– done for real world or semi–synthetic graphs as these graphs maximal cliques but unlike MULE, it does not compute the have different structural properties, hence different sizes of probabilities incrementally. The DFS-NOIP algorithm is de- maximal cliques and thus there is no meaningful way to scribed in supplementary materials, due to space constraints. interpret the results.

11 20 0.05 18 0.01 when we directly model uncertainty and use our Algorithm, 0.005 16 0.001 we can find structures much faster than finding all maximal 0.0005 14 0.0001 cliques followed by pruning on basis of probability to find 12 maximal uncertain cliques. 10 20 BA 90000 BA SF SF 8 18 85000 80000 16 Algorithm runtime (seconds) 6 75000 14 50000 60000 70000 80000 90000 70000 Output Size ( Number of maximal cliques ) 12 65000 10 60000 (a) Random Graphs 8 55000 6 50000 Algorithm runtime (seconds)

5000 6000 7000 8000 9000 10000 Number of Alpha Maximal Cliques 5000 6000 7000 8000 9000 10000 Fig. 4: Runtime vs Output size Number of vertices Number of vertices

Enumerating Large Maximal Cliques. Figures 5 and 6 (a) Runtime (α = 0.0001) (b) Output Size (α = 0.0001) Fig. 8: BA vs Two–Level graphs show the runtime of LARGE–MULE (Algorithm 5) and the output size respectively as a function of t, the mini- Scale-Free Graph Models. In order to see the per- mum threshold for the size of the α-maximal clique. As formance on different types of scale-free graph models, t increases, both the runtime and output size decrease in addition to the random graphs generated using the substantially. For instance, MULE takes 76797 seconds to Barabasi´ −Albert (BA) model, which is a type of a scale free enumerate all uncertain maximal cliques from the DBLP (SF) model, we constructed random graphs using the two– dataset, for probability threshold 0.9. However, LARGE– level scale free model [50], to compare against the graphs MULE takes only 32 seconds when t = 3. Similarly, generated using the BA model. For each BA input graph that for input graph ca-GrQc and α = 0.0001, MULE takes we used, we generated a SF graph with the same number of 125 seconds, while LARGE–MULE takes 10 seconds when vertices as the BA graph, and compared the runtime of our t = 6 and 6 seconds when t = 7. algorithm and the output size. Due to lack of space, we Dependence on Number of Uncertain Edges. Figure 7 present only a subset of results in Figure 8. We observe shows the change in runtime as a function of the graph from the Figure that there is little difference in processing uncertainty, for the ca–GrQc and DBLP10 networks. We runtime or output size, between the BA graphs and the two- consider the same underlying network with difference in level SF graphs. Both BA as well as two-level SF graphs, the number of uncertain edges. We use the following three took approximately the same amount of time to be processed cases: 1) when all edges are uncertain, 2) when two–thirds and had very similar number of maximal uncertain cliques. of all the edges are uncertain, and 3) when one–third of all We observed the same behavior for different values of α the edges are uncertain. that we tried. We can see that as the number of uncertain edges in 6 CONCLUSION the graph increases, the runtime decreases. Considering that We present a systematic study of the enumeration of maxi- there can be many more uncertain maximal cliques than mal cliques from an uncertain graph, starting from a precise just maximal cliques, Figure 7 shows that our algorithm definition of the notion of an α-maximal clique, followed by employs effective pruning techniques that help us to quickly a proof showing that the maximum number of α-maximal n  identify all maximal uncertain cliques. For example, for cliques in a graph on n vertices is exactly bn/2c , for the graph ca-GrQc and α = 0.5, MULE took 6 and 7 0 < α < 1. We present a novel algorithm, MULE, for seconds for all uncertain and two–third uncertain edges enumerating the set of all α-maximal cliques from a graph, respectively. However, for one–third uncertain edges, it took and an analysis showing that the worst-case runtime of this 124 seconds. Again for the graph DBLP10 and α = 0.9, algorithm is O (n · 2n). We present an experimental evalu- MULE (with size threshold 4), could find all maximal ation of MULE showing its performance, and an extension cliques in 6 seconds with all uncertain edges. But with for faster enumeration of large maximal cliques. only two–third uncertain edges, the same graph took over An interesting open problem is to design an algorithm 42 minutes with one–third uncertain edges, we could not for enumerating maximal cliques from an uncertain√ graph process the graph within 4 hours. Thus, the graph processing whose time complexity is worst-case optimal, O ( n · 2n). time required by our method reduces as the uncertainty Finally, there are various dense substructures that can be of the graph increases. Note that there are two ways in found in a network. Some examples include bicliques, which uncertain cliques can be handled – first by using quasi–cliques and k-cores. Finding these dense substruc- deterministic MCE and then finding embedded uncertain tures in the context of uncertain graphs can be an important cliques, and second, by directly incorporating uncertainty future direction of work. in pruning, as we do. From Figure 7 we can see that for Acknowledgments: This work was partially funded by multiple values of α, as the number of uncertain edges an IBM Faculty Award and by grant 0831903 from the increase, runtime of MULE decreases. This implies that National Science Foundation.

12 25 20 0.2 0.2 0.9 0.1 0.1 20000 0.8 0.05 20 0.05 0.7 0.01 0.01 0.6 15 0.005 0.005 0.5 0.001 0.001 15000 0.4 0.0005 15 0.0005 0.3 0.0001 0.0001 0.2 10 10000 0.1 10

5000 5 5 Algorithm runtime (seconds) Algorithm runtime (seconds) Algorithm runtime (seconds) 0 2 3 4 5 6 7 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Size Threshold Size Threshold Size Threshold

(a) BA10000 (b) ca-GrQc (c) DBLP Fig. 5: Runtime vs Size threshold of enumerated uncertain maximal cliques

0.9 100000 100000 0.8 10000 0.7 10000 0.6 10000 0.5 1000 0.2 0.2 0.4 0.1 1000 0.1 0.05 0.05 1000 0.3 100 0.01 0.01 0.2 0.005 100 0.005 0.1 0.001 0.001 100 0.0005 0.0005 10 10 Number of maximal cliques 0.0001 Number of maximal cliques 0.0001 Number of maximal cliques 10 2 3 4 5 6 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Size Threshold Size Threshold Size Threshold

(a) BA10000 (b) ca-GrQc (c) DBLP Fig. 6: Number of α-maximal cliques vs Size threshold of enumerated uncertain maximal cliques 10000 100 0.5 0.9

0.1 1000 0.8 0.05 0.7

100 10

10 Number of uncertain edges Algorithm runtime (seconds) 1 1 8 12 16 20 24 28 32 160 170 180 190 200 210 220 Number of uncertain edges(in thousands) Algorithm runtime (thousands of seconds) (a) ca–GrQc (b) DBLP10 (s=4) Fig. 7: Runtime vs the number of uncertain edges in the graph. The y–axis is in log scale. Observe that the runtime decreases when the number of uncertain edges in the graph increases.

REFERENCES inference in web-based social networks,” ACM Transactions on Internet Technology, vol. 10, no. 2, pp. 8:1–8:23, Jun. 2010. [1] J. Ghosh, H. Ngo, S. Yoon, and C. Qiao, “On a routing problem within prob- [9] P. Boldi, F. Bonchi, A. Gionis, and T. Tassa, “Injecting uncertainty in graphs abilistic graphs and its application to intermittently connected networks,” for identity obfuscation,” Proceedings of the VLDB Endowment, vol. 5, in INFOCOM 2007. 26th IEEE International Conference on Computer no. 11, pp. 1376–1387, 2012. Communications. IEEE, 2007, pp. 1721–1729. [10] S. Asthana, O. D. King, F. D. Gibbons, and F. P. Roth, “Predicting pro- [2] S. Biswas and R. Morris, “Exor: opportunistic multi-hop routing for tein complex membership using probabilistic network reliability,” Genome wireless networks,” ACM SIGCOMM Computer Communication Review, Research, vol. 14, pp. 1170–1175, 2004. vol. 35, no. 4, pp. 133–144, Aug. 2005. [11] J. Bader, A. Chaudhuri, J. Rothberg, and J. Chant, “Gaining confidence [3] H. Kawahigashi, Y. Terashima, N. Miyauchi, and T. Nakakawaji, “Modeling in high-throughput protein interaction networks.” Nature Biotechnology, ad hoc sensor networks using random ,” in Second IEEE vol. 22, no. 1, pp. 78–85, 2004. Consumer Communications and Networking Conference, 2005, pp. 104– [12] D. R. Rhodes, S. A. Tomlins, S. Varambally, V. Mahavisno, T. Barrette, 109. S. Kalyana-Sundaram, D. Ghosh, A. Pandey, and A. M. Chinnaiyan, [4] E. Adar and C. Re, “Managing uncertainty in social networks,” IEEE Data “Probabilistic model of the human protein-protein interaction network,” Engineering Bulletin, vol. 30, no. 2, pp. 15–22, 2007. Nature Biotechnology, vol. 23, no. 8, pp. 951–959, 2005. [5] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins, “Propagation of trust [13] R. Jiang, Z. Tu, T. Chen, and F. Sun, “Network motif identification in and distrust,” in Proceedings of the 13th International conference on World stochastic networks,” Proceedings of the National Academy of Sciences, Wide Web, ser. WWW ’04. New York, NY, USA: ACM, 2004, pp. 403– vol. 103, no. 25, pp. 9404–9409, 2006. 412. [14] G. Palla, I. Derenyi,´ I. Farkas, and T. Vicsek, “Uncovering the overlapping [6] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of influence community structure of complex networks in nature and society.” Nature, through a social network,” in Proceedings of the 9th ACM SIGKDD vol. 435, no. 7043, pp. 814 – 818, 2005. International conference on Knowledge discovery and data mining, ser. [15] O. Rokhlenko, Y. Wexler, and Z. Yakhini, “Similarities and differences of KDD ’03. New York, NY, USA: ACM, 2003, pp. 137–146. gene expression in yeast stress conditions,” Bioinformatics, vol. 23, no. 2, [7] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for pp. 184–190, 2007. social networks,” in Proceedings of the 12th International conference on [16] E. Harley and A. Bonner, “Uniform integration of genome mapping data Information and knowledge management, ser. CIKM ’03. New York, NY, using intersection graphs,” Bioinformatics, vol. 17, no. 6, pp. 487–494, USA: ACM, 2003, pp. 556–559. 2001. [8] U. Kuter and J. Golbeck, “Using probabilistic confidence models for trust [17] J. Mcauley and J. Leskovec, “Discovering social circles in ego networks,”

13 ACM Transactions on Knowledge Discovery from Data, vol. 8, no. 1, pp. [41] D. Eppstein and D. Strash, “Listing all maximal cliques in large sparse real- 4:1–4:28, Feb. 2014. world graphs,” in Experimental Algorithms, ser. Lecture Notes in Computer [18] H. Bernard, P. D. Killworth, and L. Sailer, “Informant accuracy in social Science, P. Pardalos and S. Rebennack, Eds. Springer Berlin / Heidelberg, network data iv: a comparison of clique-level structure in behavioral and 2011, vol. 6630, pp. 364–375. cognitive network data,” Social Networks, vol. 2, no. 3, pp. 191 – 218, [42] N. Modani and K. Dey, “Large maximal cliques enumeration in sparse 19791980. graphs,” in Proceedings of the 17th ACM conference on Information and [19] J. Pattillo, N. Youssef, and S. Butenko, “Clique relaxation models in social knowledge management, ser. CIKM ’08. New York, NY, USA: ACM, network analysis,” in Handbook of Optimization in Complex Networks, ser. 2008, pp. 1377–1378. Springer Optimization and Its Applications. Springer New York, 2012, [43] S. Tsukiyama, M. Ide, H. Ariyoshi, and I. Shirakawa, “A new algorithm for pp. 143–162. generating all the maximal independent sets,” SIAM Journal on Computing, [20] G. A. et al., “Functional organization of the yeast proteome by systematic vol. 6, no. 3, pp. 505–517, 1977. analysis of protein complexes,” Nature, vol. 415, no. 6868, pp. 141 – 147, [44] N. Chiba and T. Nishizeki, “Arboricity and subgraph listing algorithms,” 2002. SIAM Journal on Computing, vol. 14, pp. 210–223, February 1985. [21] N. Pathak, S. Mane, and J. Srivastava, “Who thinks who knows who? socio- [45] K. Makino and T. Uno, “New algorithms for enumerating all maximal cognitive analysis of email networks,” in Sixth International Conference on cliques,” in Algorithm Theory - SWAT 2004, ser. Lecture Notes in Computer Data Mining, 2006, pp. 466–477. Science, T. Hagerup and J. Katajainen, Eds. Springer Berlin / Heidelberg, [22] E. Harley, A. Bonner, and N. Goodman, “Uniform integration of genome 2004, vol. 3111, pp. 260–272. mapping data using intersection graphs,” Bioinformatics, vol. 17, no. 6, pp. [46] M. Svendsen, A. P. Mukherjee, and S. Tirthapura, “Mining maximal cliques 487–494, 2001. from a large graph using mapreduce: Tackling highly uneven subproblem [23] H. M. Grindley, P. J. Artymiuk, D. W. Rice, and P. Willett, “Identification sizes,” J. Parallel Distrib. Comput., vol. 79, pp. 104–114, 2015. of tertiary structure resemblance in proteins using a maximal common [47] A. P. Mukherjee and S. Tirthapura, “Enumerating maximal bicliques from subgraph isomorphism algorithm,” Journal of Molecular Biology, vol. 229, a large graph using mapreduce,” in 2014 IEEE International Congress on no. 3, pp. 707–721, 1993. Big Data, Anchorage, AK, USA, June 27 - July 2, 2014, 2014, pp. 707–716. [24] B. Zhang, B.-H. Park, T. Karpinets, and N. F. Samatova, “From pull- [48] J. Leskovec. Stanford large network dataset collection. down data to protein interaction networks and complexes with biological [49] R. Albert and A.-L. Barabasi,´ “Statistical mechanics of complex networks,” relevance,” Bioinformatics, vol. 24, no. 7, pp. 979–986, 2008. Reviews of Modern Physics, vol. 74, pp. 47–97, Jan 2002. [25] W. Chen, Y. Wang, and S. Yang, “Efficient influence maximization in [50] C. Dangalchev, “Generation models for scale-free networks,” Physica A: social networks,” in Proceedings of the 15th ACM SIGKDD International Statistical Mechanics and its Applications, vol. 338, no. 34, pp. 659 – 671, Conference on Knowledge Discovery and Data Mining, ser. KDD ’09. New 2004. York, NY, USA: ACM, 2009, pp. 199–208. [26] A. P. Mukherjee, P. Xu, and S. Tirthapura, “Mining maximal cliques from an uncertain graph,” in Proceedings of the Thirty–First IEEE International Conference Data Engineering, ser. ICDE ’15. IEEE, 2015. [27] J. Moon and L. Moser, “On cliques in graphs,” Israel Journal of Mathemat- Arko Provo Mukherjee received his ics, vol. 3, pp. 23–28, 1965. Ph.D. in Computer Engineering from Iowa [28] Y. Yuan, L. Chen, and G. Wang, “Efficiently answering probability threshold-based shortest path queries over uncertain graphs,” in Database State University in 2015. He received Systems for Advanced Applications his Baccalaureate Degree from National , ser. Lecture Notes in Computer Sci- PLACE ence. Springer Berlin Heidelberg, 2010, vol. 5981, pp. 155–170. Institute of Technology, Durgapur, India, [29] M. Potamias, F. Bonchi, A. Gionis, and G. Kollios, “k-nearest neighbors in PHOTO and worked for 3 years as an application uncertain graphs,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, HERE developer at IBM. His research interests pp. 997–1008, September 2010. are in algorithms for large graph mining [30] G. Kollios, M. Potamias, and E. Terzi, “Clustering large probabilistic and tools for large-scale data analytics. graphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 2, pp. 325–336, 2013. [31] P. Hintsanen and H. Toivonen, “Finding reliable subgraphs from large probabilistic graphs,” Data Mining and Knowledge Discovery, vol. 17, no. 1, pp. 3–23, August 2008. [32] Z. Zou, J. Li, H. Gao, and S. Zhang, “Mining frequent subgraph patterns from uncertain graph data,” IEEE TKDE, vol. 22, no. 9, pp. 1203–1218, Pan Xu is a Ph.D. student in Computer 2010. Science at University of Maryland, Col- [33] R. Jin, L. Liu, and C. C. Aggarwal, “Discovering highly reliable subgraphs lege Park. He received his PhD in Indus- in uncertain graphs,” in Proceedings of the 17th ACM SIGKDD Interna- trial Engineering from Iowa State Univer- tional conference on Knowledge discovery and data mining, ser. KDD ’11. PLACE sity. New York, NY, USA: ACM, 2011, pp. 992–1000. PHOTO [34] Z. Zou, H. Gao, and J. Li, “Discovering frequent subgraphs over uncertain HERE graph under probabilistic semantics,” in Proceedings of the 16th ACM SIGKDD International conference on Knowledge discovery and data mining, ser. KDD ’10. New York, NY, USA: ACM, 2010, pp. 633–642. [35] L. Liu, R. Jin, C. Aggarwal, and Y. Shen, “Reliable clustering on uncertain graphs,” in IEEE 12th International Conference on Data Mining (ICDM), 2012, pp. 459–468. [36] A. Khan, F. Bonchi, A. Gionis, and F. Gullo, “Fast reliability search in uncertain graphs,” in Proceedings of the 16th International Conference on Extending Database Technology, ser. EDBT ’14. New York, NY, USA: ACM, 2014, pp. 535–546. Srikanta Tirthapura received his Ph.D. [37] R. Jin, L. Liu, B. Ding, and H. Wang, “Distance-constraint reachability in Computer Science from Brown Univer- computation in uncertain graphs,” Proceedings of the VLDB Endowment, sity in 2002, and his B.Tech. in Computer vol. 4, no. 9, pp. 551–562, June 2011. Science and Engineering from IIT Madras [38] Z. Zou, J. Li, H. Gao, and S. Zhang, “Finding top-k maximal cliques PLACE in 1996. He is an Associate Professor in an uncertain graph,” in IEEE 26th International Conference on Data PHOTO in the department of Electrical and Com- Engineering (ICDE), 2010, pp. 649–652. HERE [39] C. Bron and J. Kerbosch, “Algorithm 457: finding all cliques of an puter Engineering at Iowa State Univer- undirected graph,” Communications of ACM, vol. 16, no. 9, pp. 575–577, sity. His research is concerned with al- September 1973. gorithms for big data, including streaming [40] E. Tomita, A. Tanaka, and H. Takahashi, “The worst-case time complexity algorithms and parallel and distributed al- for generating all maximal cliques and computational experiments,” Theo- gorithms, and applications to cybersecu- retical Computer Science, vol. 363, pp. 28–42, October 2006. rity and transportation.

14