Planted Clique Detection Below the Noise Floor Using Low-Rank Sparse Pca
Total Page:16
File Type:pdf, Size:1020Kb
PLANTED CLIQUE DETECTION BELOW THE NOISE FLOOR USING LOW-RANK SPARSE PCA Alexis B. Cook1 and Benjamin A. Miller2 1Department of Applied Mathematics, Brown University, Providence, RI 02906 alexis [email protected] 2MIT Lincoln Laboratory, Lexington, MA 02420 [email protected] ABSTRACT We study approaches to solving the planted clique problem that are based on an analysis of the spectral properties of the graph’s Detection of clusters and communities in graphs is useful in a wide modularity matrix. Nadakuditi proved in [8] the presence of a phase range of applications. In this paper we investigate the problem of transition between a regime in which the principal component of detecting a clique embedded in a random graph. Recent results have the modularity matrix can clearly identify the location of the clique demonstrated a sharp detectability threshold for a simple algorithm vertices and a regime in which it cannot. Through understanding the based on principal component analysis (PCA). Sparse PCA of the breakdown point of this simple spectral method for clique detection graph’s modularity matrix can successfully discover clique locations in high dimensional settings, we hope to motivate new algorithms to where PCA-based detection methods fail. In this paper, we demon- detect cliques smaller than previously thought possible. strate that applying sparse PCA to low-rank approximations of the Sparse principal component analysis (SPCA) has recently been modularity matrix is a viable solution to the planted clique problem shown to enable detection of dense subgraphs that cannot be detected that enables detection of small planted cliques in graphs where run- in the first principal component [7, 9]. The semidefinite program- ning the standard semidefinite program for sparse PCA is not possi- ming formulation known as DSPCA has also been proposed as an ble. approximation that nearly achieves the information-theoretic bound Index Terms— planted clique detection, graph analysis, com- in [4], and as a complexity-theoretic bound for planted clique detec- munity detection, semidefinite programming, sparse principal com- tion [10]. However, DSPCA has great computational burden. This is ponent analysis because several eigendecompositions are performed over the course of the procedure. For subgraphs that are relatively close to the de- tection threshold, however, the subgraph still manifests itself signif- 1. INTRODUCTION icantly in the largest eigenvectors. This suggests that these cliques would be detectable even if only a low-rank subspace of the orig- Real-world graphs are naturally inhomogeneous and exhibit nonuni- inal matrix were considered. In this paper, we investigate the use form edge densities within local substructures. In this setting, it is of DSPCA with a low-dimensional approximationp of the modularity often possible to break graphs into communities, or sets of vertices matrix. The running time of DSPCA is O(n4 log n/), so reduc- with a high number of within-group connections and fewer links be- ing the dimensionality of the problem can potentially improve the tween communities. The problem of community detection in net- running time by multiple orders of magnitude. works has become increasingly prevalent in recent years and has im- The remainder of this paper is organized as follows. Section 2 portant applications to fields such as computer science, biology, and outlines the problem model and defines notation. Section 3 briefly sociology [1–3]. While community detection often considers par- discusses recent relevant research in this area. In Section 4, we titioning a graph into multiple communities, a variant of the prob- demonstrate analytically—using an approximation—that a large lem considers detection of a small subgraph with higher connectiv- fraction of the magnitude of a vector indicating the planted clique ity than the remainder of the graph [4], a special case of which is the will manifest itself in the eigenvectors of the graph’s modularity planted clique problem [5–7]. matrix associated with the largest eigenvalues. Section 5 provides A clique is a set of vertices such that every two vertices are con- empirical results demonstrating improved performance using a small nected by an edge. In the planted clique problem, one is given a number of eigenvectors (rather than only one), and in Section 6 we graph containing a hidden clique, where each possible edge outside summarize and outline possible future work. the clique occurs independently with some probability p. Detecting the location of this maximum-density embedding in a random back- 2. DEFINITIONS AND NOTATION ground graph is a useful proxy for a variety of applications—such as computer network security or social network analysis—in which In the standard (k; p; n) planted clique problem, one is given an a subgraph with anomalous connectivity is to be detected. Erdos-R˝ enyi´ graph G(n; p) with an embedded clique of size k, and the objective is to determine the hidden locations of the clique ver- This work is sponsored by the Assistant Secretary of Defense for Re- search & Engineering under Air Force Contract FA8721-05-C-0002. Opin- tices. Formally, given the graph G = (V; E), where V is the set of ions, interpretations, conclusions and recommendations are those of the au- vertices and E is the set of edges (connections), with jV j = n, the thors and are not necessarily endorsed by the United States Government. desired outcome is the subset of vertices V ∗ ⊂ V , jV ∗j = k that 978-1-4673-6997-8/15/$31.00 ©2015 IEEE 3726 ICASSP 2015 belong to the clique. where k · k0 denotes the L0 quasi-norm. We will apply a convex In our procedure, instead of working directly with the edge set relaxation and use the lifting procedure for semidefinite program- E, we will analyze the adjacency matrix corresponding to our model. ming (SDP), following the method of [12], to yield a semidefinite The adjacency matrix A of an undirected, unweighted graph is a program: useful representation of the graph’s topology. Each row and column X^ = arg max tr(BX) − ρ1T jXj1 is associated with a vertex in V . After applying an arbitrary ordering X2Sn (3) of the vertices with integers from 1 to n, we will denote the ith vertex subject to tr(X) = 1; as v . Then A is 1 if v and v are connected by an edge and is 0 i ij i j where tr(·) denotes the matrix trace, and S is the set of positive otherwise. The model for the adjacency matrix of the planted clique n semidefinite matrices in n×n. Here, ρ > 0 is a tunable parameter problem is: R that controls the sparsity of the solution. After applying DSPCA to a 8 1 if v ; v 2 V ∗ graph observation, we take the principal eigenvector of X^ and assign <> i j ∗ ∗ the vertices associated with the k largest entries in absolute value as Aij = 1 with prob p if vi 2= V or vj 2= V belonging to the clique. We use the DSPCA toolbox provided by the :>0 with prob 1 − p if v 2= V ∗ or v 2= V ∗: i j authors of [12] for the results in this paper. Consistent with the methodology developed by Newman [1], we Fig. 1 compares the results returned by PCA and DSPCA for a will study the modularity matrix B of our observed graph, defined sample test case. We generated an Erdos-R˝ enyi´ graph G(500; 0:2) as: and embedded a clique of size 10. The PCA-based technique fails Bij = Aij − p; here, as the entries in the principal component corresponding to the clique vertices (marked by magenta dots) are not consistently high where we have subtracted the expected value of the adjacency matrix in absolute value. However, the vector returned by DSPCA, or the for the Erdos-R˝ enyi´ model that does not contain the planted clique. principal eigenvector of solution to the SDP relaxation described in (3), accurately identifies the clique vertices. Here, ρ = 0:6 was 3. RECENT METHODS AND RESULTS used. For this sparsity level, the DSPCA solution contains a very clear signal with high values at the clique vertices and is almost zero It has been shown previously that thresholding the principal eigen- at background vertices. vector of B can yield the locations of the clique vertices. The unit- normalized principal eigenvector u1 of the modularity matrix asso- 0.2 0.4 ciated with an Erdos-R˝ enyi´ graph without an embedded clique is p 0.3 asymptotically distributed as nu1 ∼ N(0;I) [11]. Using this fact, 0.1 Nadakuditi establishes an algorithm for detecting vertices V^ ∗ in the 0.2 0 planted clique [8]: 0.1 −0.1 ∗ n p −1 “ α ”o 0 Component Value ^ Component Value V = vi : j nui1j > FN(0;1) 1 − ; (1) 2 −0.2 −0.1 0 100 200 300 400 500 0 100 200 300 400 500 Vector Component Index Vector Component Index where ui1 denotes the ith entry in u1 and the false-alarm probability of identifying a non-clique vertex as part of the clique is α. (a) (b) PCA-based clique detection has been shown to have a well- defined breakdown point in high dimensions. Nadakuditi elucidates Fig. 1. Example PCA result (a) and DSPCA result (b). this breakdown point in the following theorem. Theorem 3.1 (Nadakuditi [8]) Consider a (k; p; n) planted clique DSPCA is an iterative algorithm that needs to compute a full problem where the clique vertices are identified using (1) forp a sig- eigendecomposition of a perturbation of the modularity matrix at nificance level α. Then, for fixed p, as k; n ! 1 such that k= n ! each step. Thus, for large graphs, the algorithm can run quite slowly.