Graphlet Decomposition of a Weighted Network
Total Page:16
File Type:pdf, Size:1020Kb
Graphlet decomposition of a weighted network Hossein Azari Edoardo M. Airoldi Harvard University Harvard University Abstract Λ is related to models for binary networks based on the singular value decomposition and other factoriza- We introduce the graphlet decomposition of tions (Hoff, 2009; Kim and Leskovec, 2010). We aban- a weighted network, which encodes a notion don the popular orthogonality constraint (Jiang et al., of social information based on social struc- 2011) among the basis matrices Pis to attain inter- ture. We develop a scalable algorithm, which pretability in terms of multi-scale social structure, and combines EM with Bron-Kerbosch in a novel we chose not model zero edge weights to enable the es- fashion, for estimating the parameters of the timation to scale linearly with the number of positive model underlying graphlets using one net- weights (Leskovec et al., 2010). These choices lead work sample. We explore theoretical prop- to a non-trivial inferential setting (Airoldi and Haas, erties of graphlets, including computational 2011). Related methods utilize a network to encode in- complexity, redundancy and expected accu- ferred dependence among multivariate data (Coifman racy. We test graphlets on synthetic data, and Maggioni, 2006; Lee et al., 2008). We term our and we analyze messaging on Facebook and method “graphlet decomposition”, as it is reminiscent crime associations in the 19th century. of a wavelet decomposition of a graph. An unrelated literature uses the term graphlets to denote network motifs (Przulj et al., 2006; Kondor et al., 2009). 1 Introduction The key features of the graphlet decomposition we in- troduce in this paper are: the basis matrices Pis are In recent years there has been a surge of interest in col- non-orthogonal, but capture overlapping communities laborative projects similar to Wikipedia, social media at multiple scales; the information used for inference platforms such as Facebook and Twitter, and services comes exclusively from positive weights; parameter es- that rely on the social structure underlying these ser- timation is linear in the number of positive weights. vices, including cnn.com “popular on Facebook” and The basic idea is to posit a Poisson model for the edge nytimes.com “most emailed”. Efforts in the computa- weights P (Y Λ) and parametrize the rate matrix Λ in tional social sciences have begun studying patterns of terms of a binary| factor matrix B. The binary factors behavior that result in organized social structure and can be interpreted as latent features that induce social interactions (e.g., see Lazer et al., 2009). Here, we de- structure through homophily. The factor matrix B al- velop a new tool to analyze data about social structure lows for an equivalent interpretation as basis matrices, and interactions routinely collected in this context. which define overlapping cliques of different sizes. In- ference is carried out in two stages: first we identify There is a rich literature of statistical models to ana- candidate basis matrices using Bron-Kerbosch, then lyze binary interactions or networks (Goldenberg et al., we estimate coefficients using EM. The computational 2010). Arguably, however, only a few of them may complexity of the inference is addressed in Section 3.2. be fit to weighted networks. Our approach is to de- compose the graphon Λ, which defines an exchange- able model (Kallenberg, 2005) for integer-valued mea- 2 Graphlet decomposition and model surements of pairs of individuals P (Y Λ), in terms of a number of basis matrices that grows| with the size Consider observing an undirected weighted network, encoded by a symmetric adjacency matrix with inte- of the network, Λ = i µiPi. The factorization of ger entries and diagonal elements equal to zero. Intu- Appearing in ProceedingsP of the 15th International Con- itively, we want to posit a data generating process that ference on Artificial Intelligence and Statistics (AISTATS) can explain edge weights in terms of social information, 2012, La Palma, Canary Islands. Volume XX of JMLR: quantified by community structure at multiple scales W&CP XX. Copyright 2012 by the authors. and possibly overlapping. We choose to represent com- 54 Graphlet decomposition of a weighted network munities in terms of their constituent maximal cliques. We develop a two-stage estimation algorithm. First, we identify a candidate set of Kc basis elements, using Below, we avoid notation whose sole purpose is to set the Bron-Kerbosch algorithm, and obtain a candidate to zero diagonal elements of the resulting matrices. basis matrix Bc. Second, we develop an expectation- Definition 1. The graphlet decomposition (GD) of a maximization (EM) algorithm to estimate the corre- matrix Λ with non-negative entries λ is defined as ij sponding coefficients µ1:Kc ; several of these coefficients Λ = BWB0, where B is an N K binary matrix, W will vanish, thus selecting a final set of basis elements. is a K K diagonal matrix. Explicitly,× we have × We study theoretical properties of this estimation al- gorithm in Section3. Figure1 illustrates the steps of λ11 λ12 . λ1N µ1 0 ... 0 the estimation process on a toy network. λ21 λ22 . λ2N 0 µ2 ... 0 . = B . B0 . .. .. 2.1.1 Identifying candidate basis elements λ λ . λ 0 0 . µ N1 N2 NN K The algorithm to identify the candidate basis elements where λii is set to zero and µi is positive for each i. proceeds by thresholding the observed network Y at a The basis matrix B can be interpreted in terms over- number of levels, t = max(Y ) ... min(Y ), and by iden- lapping communities at multiple scales. To see this, tifying all the maximal cliques in the corresponding sequence of binary networks, Y (t) = 1(Y t), us- denote the i-th column of the matrix B as b i. We can ≥ re-write Λ = K µ P , where each P is an· N N ing the Bron-Kerbosch algorithm (Bron and Kerbosch, i=1 i i i c matrix defined as P = b b , in which the diagonal× is 1973). The resulting cumulative collection of K max- i i 0i (t) set to zero. WeP refer to·P · as basis elements. Think imal cliques found in the sequence of networks Y i c of the i-th column of B as encoding a community of is then turned into candidate basis elements, Pi , for i = 1 ...Kc. Algorithm1 details this procedure. interest; then the vector b i specifies which individuals · are members of that community, and the basis element Bc = empty basis set Pi is the binary graph that specifies the connections For t = max(Y ) to min(Y ) among the member of that community. The same in- Y (t) = 1(Y t) dividual may participate to multiple communities, and C = maximal≥ cliques(Y (t)) using Bron-Kerbosch the communities may span different scales. Bc = Bc C ∪ c The model for an observed network Y is then Algorithm 1: Estimate a basis matrix, B , encoding candidate elements, P c, from a weighted network, Y . K Y Poisson+ ( i=1 µi Pi). (1) ∼ Algorithm1 allows us to avoid dealing with permuta- This is essentially a binaryP factor model with a trun- tions of the input adjacency matrix Y by inferring the cated Poisson link and two important nuances. Stan- basis elements Pi independently of such permutation. dard factor models assume that entire rows of the ma- trix Y are conditionally independent, even when the In Section 3.1 we show that, under certain conditions matrix is square and row and column i refer to the on the true underlying basis matrix B, Algorithm1 same unit of analysis. In contrast, our model treats finds a set of candidate basis elements that contains the true set of basis elements. In addition, in Section the individual random variables yij as exchangeable, and imposes the restriction that positives entries in the 3.2 we show that, under the same conditions on B and factors map (in some way) to maximal cliques in the additional realistic assumptions, we expect to have a observed network. In addition, the matrix Y is sparse; number of candidate basis elements of the same order of magnitude of the true number of basis elements, our model implies that zero edge weights carry no in- c formation about maximal cliques, and thus are not K = O(K), as the network size N grows, given that relevant for estimation. These nuances—map between the maximum weight is stationary, max(Y ) = O(1). cliques and factors, and information from positives en- tries only—produce a new and interesting inferential 2.1.2 Sparse Poisson deconvolution setting that scales to large weighted networks. Given the candidate set of basis elements encoded in Bc, we develop an EM algorithm to estimate the corre- 2.1 Inference sponding coefficients µ1:Kc . Under certain conditions on the true basis matrix B, this algorithm consistently Computing the graphlet decomposition of a network estimates the coefficients by zeroing out the coefficients Y involves the estimation of a few critical parameters: associated with the unnecessary basis elements. the number of basis elements K, the basis matrix B c or equivalently the basis elements P1:K , and the coef- Recall that we have K candidate basis elements c c ficients associated with these basis elements µ1:K . P1:Kc . Let’s define a set of statistics Tk = ij Pk,ij 55 P Azari & Airoldi 3 3 thres = 4 μ = 3.0 3 1 2 2 3 2 2 μ =0.0 1 4 thres = 3 2 4 2 1 4 μ 3 =2.0 1 thres = 2 1 1 1 1 Network Generation μ 4 =1.0 Final Basis 1 thres = 1 Maximal Cliques Candidate Basis Clique Components Setting Thresholds Bron Kerboshch Expectation Maximization Figure 1: An illustration of the two-stage estimation algorithm for the graphlet decomposition.