As Strong As the Weakest Link: Mining Diverse Cliques in Weighted Graphs Appendix

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs Appendix Petko Bogdanov1, Ben Baumer2, Prithwish Basu3, Amotz Bar-Noy4, and Ambuj K. Singh1 1 University of California, Santa Barbara, CA 93106, USA, fpetko,[email protected] 2 Smith College, Northampton, MA 01063, USA, [email protected] 3 Raytheon BBN Technologies, 10 Moulton St., Cambridge, MA 02138, USA, [email protected] 4 The City University of New York New York, NY 10016{4309, USA, [email protected] 1 The choice of group score We model a group as a collection of pairwise interactions. Ideally we would consider higher-order interactions such as subgroups of size 3,4 or more, and incorporate them into our model. However, this approach has several limitations: 1. Although graphs may not be the ideal framework for modeling higher-order interactions, and hypergraphs or simplicial complexes [3, 15, 7] may be the preferred approach, algorithms for the latter settings are computationally demanding. 2. To validate the developed theories, one would need empirical datasets fea- turing a sufficient number of instances where a subgroup has interacted, and such data is hard to find for higher-order interactions. To elaborate on the second point above, data at the subgroup level becomes either sparser (in the case of sports) or is mostly unavailable (in the case of protein/gene interactions). For bigger subgroups in sports there are fewer games (observations) in which the same group participates, limiting one's ability to measure subgroup performance with high statistical confidence. In the case of gene networks, major technologies allow for testing only pairwise interactions, while the overall goal is to understand the system at a complex/pathway level. We address both of the above challenges by requiring that all pairwise interactions/performances are strong (penalizing groups in which the minimum pairwise interaction is low). In a sense this is a weaker condition that seeks to approximate the group performance. Adopting this approximate condition based on the weakest link allows for scalable solutions and general methods that are applicable to multiple domains regardless of the scarcity of data at the subgroup level. In the gene interaction scenario, we show empirically that the mined cliques involve genes of similar function and hence are likely to capture 2 Petko Bogdanov et al. complexes/pathways. Our weakest link scoring function also correlates well with sports subgroups performance, making it an acceptable approximation. Arguably, higher-order relation modeling (based on hypergraphs and simplicial complexes) or optimizing different connectivity criteria (e.g. a good span- ning tree) is more critical in \heterogeneous" teams where there exists a leader- ship/hierarchical structure within the group, e.g., in military platoons or terrorist cells. In such scenarios, the strength of every pairwise connection may not be critical, since the organizational hierarchy is specifically designed to streamline communication among its members, e.g., a weak link between a general and a private or a terrorist cell leader and a terrorist does not threaten the effectiveness of the unit as a whole. Conversely, the addition of a dynamic leader to an other- wise poorly performing group might make the group highly effective even if the leader does not have a strong pairwise relationship with each in the group. For non-hierarchical (“flat”) group structures, the topic we focus on in this paper, it may not be necessary to score each individual higher-order subgroup separately as the score function is unlikely to be discontinuous with the addition of a node, as it might be in the terrorist cell example. Hence, we focus on modeling the strength of non-hierarchical “flat” subgroups based on our weakest link scoring scheme. 2 Proofs 2.1 NP-completeness (Theorem 4.1) For any scoring function s() that maps a graph substructure to a non-negative real number, the decision problem corresponding to mDkC, namely: \Is there a set of m substructures A, each of size k, such that ds(A) ≥ B for some positive number B," is NP-complete. Proof. We show the NP-completeness for the special case when α = 0 by a reduction from the Set Cover problem. An instance of Set Cover contains a set S of subsets over a finite element universe U and an integer m. The problem asks if there is a subset S0 ⊂ S; jS0j ≤ m that covers the whole universe. From an instance of Set Cover we construct a graph G = (U; E; w), such that the node set corresponds to the elements in the universe U, the edge set is constructed by adding an edge between every two elements (ui; uj) that are both included in one of the sets in S. Weights are all set to 1 and the decision threshold B is set to jUj=k. If there is a Set Cover solution A of size m or less, then the corresponding set of cliques achieves the maximal ds(A) = jUj=k (possibly one may need to append the solution with extra cliques if the Set Cover solution is of size less than m, which will not change the score). The implication in the opposite direction holds as well: if there are a set of k-cliques that achieve a score of jUj=k (i.e. they have included all nodes in the graph) then the corresponding sets in the Set Cover instance will cover the universe. QED Mining Diverse Cliques in Weighted Graphs 3 2.2 Monotonicity and submodularity (Theorem 4.2) If k and α are fixed, the diversity score function ds(A) is: { Monotonic, i.e. for any subset A ⊆ B, ds(A) ≤ ds(B) { Sub-modular, i.e. for any sets A; B, ds(A) + ds(B) ≥ ds(A[B) + ds(A\B) Proof. To see monotonicity, suppose that A ⊆ B, such that A[D = B, where D is some possibly empty set. Then, α X 1 − α [ ds(B) = ds(A[D) = s(C) + C k k C2A[D C2A[D α X 1 − α [ ≥ s(C) + C = ds(A); k k C2A C2A where the inequality is justified by fact that s(C) ≥ 0 for any set C. Sub-modularity follows from a similar argument. For any two sets A and B, we have that: α X 1 − α [ α X 1 − α [ ds(A) + ds(B) = s(C) + C + s(C) + C k k k k C2A C2A C2B C2B " # " # α X X 1 − α [ [ ≥ s(C) + s(C) + C + C k k C2A[B C2A\B C2A[B C2A\B = ds(A[B) + ds(A\B) : This follows from elementary set theory, since if a clique C is in either A or B but not the other, its nodes get counted once on the LHS of the inequality and once on the RHS (in the A[B term). Conversely if C is in both A and B, then its nodes gets counted twice in the LHS (once in each term), and twice on the RHS (once in each term). QED 2.3 Maximum score improvement (Theorem 5.1) Let C; jCj ≤ k be a clique of size not exceeding k. The maximum improvement of ds score when adding any k super clique of C to a clique set A is bounded by: k − j([ B) \ Cj δ(A;C) = ds (A[ C) − ds(A) = α min w(u; v) + (1 − α) B2A ; u;v2C k where in the diversity part, the set ([B2AB) \ C is the intersection of nodes included in A and nodes in C. Proof. The highest-score super clique of C of size k (if existing) can improve the score part by not more than α minu;v2C w(u; v) since this score assumes opti- mistically that no lower-weight edge than the ones included in C will be added in the super clique. Similarly, the diversity part improvement is maximum as it assumes that all unobserved nodes in the k-completion of C are not contained in A. QED 4 Petko Bogdanov et al. 3 Data sources and preparation Team-based data These data sets contain whole team performance without explicit measure of the pairwise interactions strength. We compute pairwise interactions using a generic scoring model capturing performance significance. We consider the number of games won (highly-viewed movies) for each pair of teammates who appeared in at least one game (movie). The weight (between 0 and 1) on each edge represents the probability of observing the number of wins in the games in which the pair participated. We assume that the true success probability of a pair is fixed over a fixed time period. The expected pair winning percentage is estimated from a k-NN lookup within historical background data for similar pairs (based on personal success rate) and p-value of the current ob- servation is used to define the edge score (the score is simply 1−p-value as lower p-values correspond to more significant observations). NBA: This data set contains play-by-play data for 7; 139 games from the past six seasons (2005 − 2012) from the National Basketball Association (NBA) 5. We use the two most recent seasons (2010 − 2012) as a testing set, while the background set consists of games in the four previous seasons. The generic weight function for each edge represents the p-value of observing the number of wins for the pair of teammates (all games that included both players in the testing set), using the performance of similarly successful players in the background set as a null model. In order to alleviate issues with free-riders, this data was pre-filtered so that a minimum of 10 minutes of playing time in each game was required.

Load more