Graphlet decomposition of a weighted network

Hossein Azari Edoardo M. Airoldi Harvard University Harvard University

Abstract Λ is related to models for binary networks based on the singular value decomposition and other factoriza- We introduce the graphlet decomposition of tions (Hoff, 2009; Kim and Leskovec, 2010). We aban- a weighted network, which encodes a notion don the popular orthogonality constraint (Jiang et al., of social information based on social struc- 2011) among the basis matrices Pis to attain inter- ture. We develop a scalable algorithm, which pretability in terms of multi-scale social structure, and combines EM with Bron-Kerbosch in a novel we chose not model zero edge weights to enable the es- fashion, for estimating the parameters of the timation to scale linearly with the number of positive model underlying graphlets using one net- weights (Leskovec et al., 2010). These choices lead work sample. We explore theoretical prop- to a non-trivial inferential setting (Airoldi and Haas, erties of graphlets, including computational 2011). Related methods utilize a network to encode in- complexity, redundancy and expected accu- ferred dependence among multivariate data (Coifman racy. We test graphlets on synthetic data, and Maggioni, 2006; Lee et al., 2008). We term our and we analyze messaging on Facebook and method “graphlet decomposition”, as it is reminiscent crime associations in the 19th century. of a wavelet decomposition of a graph. An unrelated literature uses the term graphlets to denote network motifs (Przulj et al., 2006; Kondor et al., 2009). 1 Introduction The key features of the graphlet decomposition we in- troduce in this paper are: the basis matrices Pis are In recent years there has been a surge of interest in col- non-orthogonal, but capture overlapping communities laborative projects similar to Wikipedia, at multiple scales; the information used for inference platforms such as Facebook and Twitter, and services comes exclusively from positive weights; parameter es- that rely on the social structure underlying these ser- timation is linear in the number of positive weights. vices, including cnn.com “popular on Facebook” and The basic idea is to posit a Poisson model for the edge nytimes.com “most emailed”. Efforts in the computa- weights P (Y Λ) and parametrize the rate matrix Λ in tional social sciences have begun studying patterns of terms of a binary| factor matrix B. The binary factors behavior that result in organized social structure and can be interpreted as latent features that induce social interactions (e.g., see Lazer et al., 2009). Here, we de- structure through . The factor matrix B al- velop a new tool to analyze data about social structure lows for an equivalent interpretation as basis matrices, and interactions routinely collected in this context. which define overlapping of different sizes. In- ference is carried out in two stages: first we identify There is a rich literature of statistical models to ana- candidate basis matrices using Bron-Kerbosch, then lyze binary interactions or networks (Goldenberg et al., we estimate coefficients using EM. The computational 2010). Arguably, however, only a few of them may complexity of the inference is addressed in Section 3.2. be fit to weighted networks. Our approach is to de- compose the graphon Λ, which defines an exchange- able model (Kallenberg, 2005) for integer-valued mea- 2 Graphlet decomposition and model surements of pairs of individuals P (Y Λ), in terms of a number of basis matrices that grows| with the size Consider observing an undirected weighted network, encoded by a symmetric adjacency matrix with inte- of the network, Λ = i µiPi. The factorization of ger entries and diagonal elements equal to zero. Intu- Appearing in ProceedingsP of the 15th International Con- itively, we want to posit a data generating process that ference on Artificial Intelligence and Statistics (AISTATS) can explain edge weights in terms of social information, 2012, La Palma, Canary Islands. Volume XX of JMLR: quantified by at multiple scales W&CP XX. Copyright 2012 by the authors. and possibly overlapping. We choose to represent com- 54 Graphlet decomposition of a weighted network munities in terms of their constituent maximal cliques. We develop a two-stage estimation algorithm. First, we identify a candidate set of Kc basis elements, using Below, we avoid notation whose sole purpose is to set the Bron-Kerbosch algorithm, and obtain a candidate to zero diagonal elements of the resulting matrices. basis matrix Bc. Second, we develop an expectation- Definition 1. The graphlet decomposition (GD) of a maximization (EM) algorithm to estimate the corre- matrix Λ with non-negative entries λ is defined as ij sponding coefficients µ1:Kc ; several of these coefficients Λ = BWB0, where B is an N K binary matrix, W will vanish, thus selecting a final set of basis elements. is a K K diagonal matrix. Explicitly,× we have × We study theoretical properties of this estimation al- gorithm in Section3. Figure1 illustrates the steps of λ11 λ12 . . . λ1N µ1 0 ... 0 the estimation process on a toy network. λ21 λ22 . . . λ2N 0 µ2 ... 0  . . . .  = B  . . . .  B0 ......     2.1.1 Identifying candidate basis elements λ λ . . . λ   0 0 . . . µ   N1 N2 NN   K      The algorithm to identify the candidate basis elements where λii is set to zero and µi is positive for each i. proceeds by thresholding the observed network Y at a The basis matrix B can be interpreted in terms over- number of levels, t = max(Y ) ... min(Y ), and by iden- lapping communities at multiple scales. To see this, tifying all the maximal cliques in the corresponding sequence of binary networks, Y (t) = 1(Y t), us- denote the i-th column of the matrix B as b i. We can ≥ re-write Λ = K µ P , where each P is an· N N ing the Bron-Kerbosch algorithm (Bron and Kerbosch, i=1 i i i c matrix defined as P = b b , in which the diagonal× is 1973). The resulting cumulative collection of K max- i i 0i (t) set to zero. WeP refer to·P · as basis elements. Think imal cliques found in the sequence of networks Y i c of the i-th column of B as encoding a community of is then turned into candidate basis elements, Pi , for i = 1 ...Kc. Algorithm1 details this procedure. interest; then the vector b i specifies which individuals · are members of that community, and the basis element Bc = empty basis set Pi is the binary graph that specifies the connections For t = max(Y ) to min(Y ) among the member of that community. The same in- Y (t) = 1(Y t) dividual may participate to multiple communities, and C = maximal≥ cliques(Y (t)) using Bron-Kerbosch the communities may span different scales. Bc = Bc C ∪ c The model for an observed network Y is then Algorithm 1: Estimate a basis matrix, B , encoding candidate elements, P c, from a weighted network, Y . K Y Poisson+ ( i=1 µi Pi). (1) ∼ Algorithm1 allows us to avoid dealing with permuta- This is essentially a binaryP factor model with a trun- tions of the input adjacency matrix Y by inferring the cated Poisson link and two important nuances. Stan- basis elements Pi independently of such permutation. dard factor models assume that entire rows of the ma- trix Y are conditionally independent, even when the In Section 3.1 we show that, under certain conditions matrix is square and row and column i refer to the on the true underlying basis matrix B, Algorithm1 same unit of analysis. In contrast, our model treats finds a set of candidate basis elements that contains the true set of basis elements. In addition, in Section the individual random variables yij as exchangeable, and imposes the restriction that positives entries in the 3.2 we show that, under the same conditions on B and factors map (in some way) to maximal cliques in the additional realistic assumptions, we expect to have a observed network. In addition, the matrix Y is sparse; number of candidate basis elements of the same order of magnitude of the true number of basis elements, our model implies that zero edge weights carry no in- c formation about maximal cliques, and thus are not K = O(K), as the network size N grows, given that relevant for estimation. These nuances—map between the maximum weight is stationary, max(Y ) = O(1). cliques and factors, and information from positives en- tries only—produce a new and interesting inferential 2.1.2 Sparse Poisson deconvolution setting that scales to large weighted networks. Given the candidate set of basis elements encoded in Bc, we develop an EM algorithm to estimate the corre- 2.1 Inference sponding coefficients µ1:Kc . Under certain conditions on the true basis matrix B, this algorithm consistently Computing the graphlet decomposition of a network estimates the coefficients by zeroing out the coefficients Y involves the estimation of a few critical parameters: associated with the unnecessary basis elements. the number of basis elements K, the basis matrix B c or equivalently the basis elements P1:K , and the coef- Recall that we have K candidate basis elements c c ficients associated with these basis elements µ1:K . P1:Kc . Let’s define a set of statistics Tk = ij Pk,ij 55 P Azari & Airoldi

3 3 thres = 4 μ = 3.0 3 1 2 2 3 2 2 μ =0.0 1 4 thres = 3 2 4 2 1 4 μ 3 =2.0 1 thres = 2 1 1 1 1 Network Generation μ 4 =1.0 Final Basis 1 thres = 1 Maximal Cliques Candidate Basis Components Setting Thresholds Bron Kerboshch Expectation Maximization

Figure 1: An illustration of the two-stage estimation algorithm for the graphlet decomposition.

c for k = 1 ...K , and let’s introduce a set of latent Definition 3. Define P = i I Pi to mean P (j, k) = ∈ N N matrices Gk with positive entries, subject to the 1 whenever any Pi(j, k) = 1 and zero otherwise. × W only constrain that k Gk = Y . Algorithm2 details Definition 4. Given Pk , we say that P 0 is an ex- the EM iterative procedure that will lead to maximum pansion of P if both2 {and}3 are satisfied. P k likelihood estimates for the coefficients µ1:Kc . Definition 5 (non expandable basis). A collection of (0) P is non-expandable if for any expansion P 0 of P Initialize µ { k} k While µ(t+1) µ(t) >  we can find j such that P 0 Pj. || − || ⊂ For k=1 to K c (t+1) (t) Yij Pk,ij 3.1 Identifiability E-step: Gk,ij = µk (t) c Tk m µm Pm,ij (t+1) (t+1) P M-step: µk = i,j Gk,ij ; The first result asserts that the collection of Poisson t=t+1; rates Λ can be uniquely decomposed into a collection P Algorithm 2: Estimate non-negative coefficients, µ, of basis matrices P1:K , here encoded by the equivalent c c for all candidate elements, P , in the basis matrix B . binary matrix B, with weights W = diag(µ1:K ). Theorem 1. Let Λ be a symmetric N N nonnegative This is the Richardson-Lucy algorithm for Poisson de- matrix which is generated by a non-expandable× basis. convolution (Richardson, 1972). The truncated Pois- Then there exists a unique N K matrix B and K K son likelihood can also be maximized directly, using matrix W such that Λ = BWB× . × a KL divergence argument (Csiszar and Shields, 2004) 0 involving the discrete probability distribution obtained This implies that, in the absence of noise, the set of ¯ by normalizing the data matrix, Y , and a linear combi- parameters B, W, K are estimable from the data Y . nation of discrete probability distributions obtained by The proof of Theorem1 is constructive by means of ¯ normalizing the candidate basis matrices, k ωkPk. Algorithm3, which is fast but sensitive to noise. In practice, we propose a more robust algorithm that uses P 3 Theory non-expandability to identify a collection of candidate basis elements Bc, which provably includes the unique Here we develop some theory for graphlets when the non expandable basis B that generated the graph. graph Y is generated by a non-expandable collection Theorem 2. Let Λ = BWB0, where W is a diagonal of basis matrices. The extent to which these results matrix with nonnegative entries and B encodes a non- extend to the general case is unclear, as of this writing, expandable basis, both unknown. If Bc is the output of however, the results help form some intuition about Algorithm1 with Λ as input, then B Bc. ⊆ how graphlets encode information. See the Appendix for details of the proofs. We begin by establishing These two theorems, with Bij’s IID Bernoulli random some notation. variables, may be used to generate random cliques.

Definition 2. Given two adjacency matrices P1 and 3.2 Redundancy and complexity P both binary define P P to mean P (j, k) = 1 2 1 ⊆ 2 2 whenever P1(j, k) = 1; i.e. the induced graph by P1 is Here, we derive an upper bound for the maximum a subgraph of P2. number of basis elements encoded in Bc in a network 56 Graphlet decomposition of a weighted network

with at most K = c log N cliques. Networks with this Definition 6 (τ-norm). Let Y P oisson(BWB0), 2 ∼ many cliques include networks generated with most where W = diag(µ1 . . . µK ), define the statistics ak exchangeable graph models (Goldenberg et al., 2010), N ≡ i=1 Bik for k = 1 ...K. The τ-norm of Y is defined and are the largest networks with an implicit repre- K as τ(Y ) k=1 µkak . sentation (Kannan et al., 1988). P ≡ | | P ˜ ˜ Theorem 3. Let the elements Bik of the basis matrix Consider an approximation Y = BWB0 characterized for a network Y be IID Bernoulli random variables by an index set E 1 ...K , which specifies the basis ⊂ { } with parameter pN . Then an asymptotic upper bound elements to be excluded by setting the corresponding for the number of candidate basis elements, denoted set of coefficients µE to zero. Its reconstruction error ˜ CN,pN , identified by Algorithm1 is is τ(Y Y ) = k /E µkak , and its reconstruction − | ∈ | accuracy is τ(Y˜ )/τ(Y ). Thus, given a network matrix C Q(2KH(pN ) + K) P N,pN ≤ Y representable exactly with K basis elements, the c1H(pN ) ˜ = Q(N + c log2 N), (2) best approximation with K basis elements is obtained by zeroing out the lowest K K˜ coefficients µ. where Q is the number of thresholds in Algorithm1. − We posit the following theoretical model, While Theorem3 applies in general, a notion of redun- dancy of the candidate basis set is well defined only µk ak/N Gamma(α, 1), (4) · ∼ for networks generated by a non expandable basis, in ˜ which Bc B. In this case, we define redundancy for k = 1 ... K. This model may be used to compute as the ratio⊆R C /K. Theorem2 states that the expected accuracy of an approximate reconstruc- N N,pN ˜ number of candidate≡ basis elements is never smaller tion based on K < K basis elements, since the mag- than the true number of basis K. Thus Theorem3 nitude of the ordered µk coefficients that are zeroed leads to the following upper bound on redundancy out are order statistics of a sample of Gamma variates (Shawky and Bakoban, 2009). Given K and α, we can N c1H(pN ) compute the expected coefficient magnitudes and the R Q 1 + . (3) N,pN ≤ c log N overall expected accuracy τ .  2  0 Theorem 4. The theoretical accuracy of the best ap- In a realistic scenario pN = O(1/ log2 N), the complex- proximate Graphlet decomposition with K˜ out of K ity of the candidate basis set is O(Q log N), and the K˜ f(j,K,α) 2 basis elements is: τ (K,˜ K, α) = ; where implied redundancy is at most O(Q). Alternatively, 0 j=1 αK K j 1 q j 1 f(1,K j+q+1,α) f(j, K, α) = − ( 1) − − , and if pN = O(1/K), the implied redundancy is at most j q=0 q K j+q+1 − P − K (α 1)(K 1) Γ(α+m) O(QK). These are regimes we often encounter in prac- f(1, K, α) = − − c (α, K 1) , in Γ(α)P m=0 m − Kα+m tice. These limiting behaviors of pN arise whenever the which the coefficients cm are defined by the recursion i=α P1 1 nodes in a network can be assumed to have a capacity − cm(α, q) = i=0 i! cm i(α, q 1) with boundary con- that is essentially independent of the size of the net- 1 − − ditions cm(α, 1) = i! for i = 1 . . . α. work, e.g., individuals participate in the activities of a P constant number of groups over time. Figure2 illustrates this result on simulated networks. The solid (red) line is the theoretical accuracy com- These calculations are only suggestive for networks puted for K = 30 and α = 1, the relevant parameters generated by an expandable basis. However, non ex- used to to simulate the sample of 100 weighted net- pandability tends to be a reasonable approximation works. The (blue) boxplots summarize the empirical in many situations, including whenever exchangeable accuracy at a number of distinct values of K/K˜ . graph models are good fit for the network. In the anal- ysis of messaging patterns on Facebook in Section 4.3, for instance, we empirically observe C 2Kˆ . 4 Results N,p ≈ 3.3 Accuracy with less than K basis elements We evaluate graphlets on real and simulated data. Simulation results on weighted networks in Section 4.1 The main estimation Algorithm2 recovers the correct show that the estimation problem is well-posed and number of basis elements K and the corresponding both the binary matrix B and the coefficients µ are coefficients µ and basis matrix B, whenever the true estimable without bias, while results on binary net- weighted network Y is generated from a non expand- works in Section 4.2 enable the analysis of the com- able basis matrix. Here we quantify the expected loss parative performance with existing methods for binary in reconstruction accuracy if we were to use K˜ < K networks. We also develop an analysis of messaging basis elements to reconstruct Y . To this end we intro- patterns on Facebook for a number of US colleges and duce a norm for a network Y , and related metrics. an analysis of historical crime data in Sections 4.3–4.4. 57 Azari & Airoldi

denoted Y˜ 1. This metric gives more weight to larger cliques, since| | the clique size enters as a quadratic term. We also provide the average error in reconstructing the simulated networks based on the τ-norm, which is less sensitive to incorrectly estimating larger cliques. In addition, we compute the average error in reconstruct- ing whether or not edges have positive weights, rather than their magnitude, denoted error in 1(Y˜ > 0). Table1 summarizes the simulation results for weighted network reconstructions obtained by setting a target accuracy τ0, ranging from 0.20 to 1.00—no error. The last row of Table1 supports the claim that the es- timation problem is well-posed; the estimation algo- rithm stably recovers the true parameter values. The only sources of non-zero error are the estimation of Figure 2: Theoretical and empirical accuracy for dif- the number of basis elements and the estimation of ferent fractions of basis elements K/K˜ with α = 0.1. the basis elements themselves, as encoded by the ma- The ratio K/K˜ also provides a measure of sparsity. trix B. However, these errors are negligible and do not seem to impact the estimation of the coefficients Overall, these results suggest that graphlets encode a µ, nor to propagate to the network reconstruction in- new quantification of social information; we provide a dependently of the metric we use to quantify it. concrete illustration of this idea in Section 5.1. The average bias in estimating the optimal number of basis elements K is 0.07, with a standard deviation 4.1 Parameter estimation of 0.256. Column K/˜ Kˆ provides the fraction of the estimated optimal number of basis elements Kˆ that We set out to evaluate whether the parameters under- is necessary to achieve the desired target accuracy, in lying graphlets—the entries of binary matrix B, its size terms of the τ-norm, using Theorem4. Column two K, and the non-negative coefficients µ—are estimable provides the empirical τ(Y˜ ) error, defined as one minus from one weighted network sample Y . the accuracy, corresponding to the reported fraction of We simulated M = 100 networks from the model, each basis elements. The empirical accuracy matches the with 50 nodes, using random values for the parameters expected accuracy given by Theorem4 well. For in- µ, B and K. For the i-th network, the number of basis stance, the Theorem4 implies that we need 45% of the basis elements in order to achieve an accuracy of 0.85, elements Ki was sampled from a Poisson distribution and the empirical accuracy is slightly in excess of 0.85, with rate λ = 30. The coefficients µji were sampled from a Gamma distribution with parameters α = 1 on average. Figure2 shows similar results, for a larger and β = 10. The entries of the basis matrix B were set of 29 values for the target accuracy τ0. sampled from a Bernoulli distribution with probability The first two columns of Table1 report the average of success p = 0.04. The index i = 1 ...M runs over L1 error and the average τ-based error, properly nor- the networks and the index j = 1 ...Ki runs over the malized to have range in zero to one. Cliques in the basis elements for the i-th network. simulated networks range between 2 and 5 nodes, for Algorithms1–2 were used to estimate the parameters most networks. The differences in the reconstruction B, µ and K, for each synthetic network. The estima- error are due to the weight of these larger cliques. For tion error of Kˆ is Kˆ K . Since the estimated matrix instance, the empirical L1 accuracy obtained by using Bˆ is unique only| up− to a| permutation of its columns, 45% of the basis elements is around 0.65, as opposed the Procrustes transform (Kendall, 1989) was used to to a τ accuracy of around 0.85, on average. identify the optimal rotation of the matrix Bˆ to be The accuracy in reconstructing the presence or absence compared to the true matrix B. The estimation er- of edges in the network, rather than the magnitude of ˆ ror of B is computed as the normalized L2 distance the corresponding weights, is in between the L1 accu- between B and Bˆ after Procrustes. The estimation racy and the τ accuracy, for any fraction K/˜ Kˆ . error ofµ ˆ is computed as the normalized L2 distance between µ andµ ˆ. The average error in reconstructing 4.2 Link reconstruction the simulated networks is computed as the L1 distance between Y and Yˆ normalized to the total number of Graphlet is a method for decomposing a weighted net- counts in the network, averaged over 100 simulations, work. However, here we evaluate how graphlet fares 58 Graphlet decomposition of a weighted network

Table 1: Performance of the estimation algorithm for a target accuracy τ0, on synthetic data. Y˜ error τ(Y˜ ) error error in 1(Y˜ > 0) error in B error in µ K/˜ Kˆ τ | |1 0 0.97 0.012 0.65 0.074 0.76 0.062 50.75 10.893 0.55 0.082 0.08 0.035 0.20 0.92 ± 0.016 0.46 ± 0.026 0.61 ± 0.053 42.40 ± 8.134 0.32 ± 0.041 0.16 ± 0.039 0.50 0.79 ± 0.036 0.23 ± 0.012 0.40 ± 0.057 29.14 ± 5.536 0.10 ± 0.015 0.33 ± 0.050 0.75 0.56 ± 0.063 0.09 ± 0.006 0.24 ± 0.048 17.04 ± 3.757 0.02 ± 0.004 0.54 ± 0.062 0.90 0.00 ± 0.000 0.00 ± 0.000 0.00 ± 0.000 0.67 ± 0.618 0.00 ± 0.000 1.01 ± 0.033 1.00 ± ± ± ± ± ± in terms of the computational complexity versus ac- ements Kˆ , ranging from 25% to 100%—no error. The curacy trade-off when compared to the performance accuracy is consistently high, while the running time current methods for binary networks can achieve. is a only a fraction of a second, independent of the desired reconstruction accuracy. This is because the For this purpose, we generated 100 synthetic networks complexity of computing the entire graphlet decom- from the model, each with 100 nodes, choosing ran- position is linear in the number of edges present in dom values for the parameters as we did in Section the network. Thus specifying a smaller fraction of the 4.1. The number of basis was sampled from a Pois- number of basis elements is a choice based on storage son with rate λ = 12. The coefficients were sampled considerations, rather than on runtime. The compara- from a Gamma with parameters α = 2 and β = 10. tive performance results suggest that graphlet decom- The entries of the basis matrix were sampled from a position is very competitive in predicting links. Bernoulli with probability of success p = 0.1. We then obtained the corresponding binary networks by reset- This binary link prediction simulation study is one way ting positive edge weights to one, 1(Y > 0). to compare graphlet to interesting existing methods, which have been developed for binary graphs. We compute comparative performance for a few pop- ular models of networks that can be used for link pre- diction. These include the Erd¨os-R´enyi-Gilbert ran- 4.3 Analysis of messaging on Facebook dom graph model (Erd¨os and R´enyi, 1959; Gilbert, 1959), denoted ERG, a variant of the degree distribu- Here we illustrate graphlet with an application to mes- tion model fitted with importance sampling (Blitzstein saging patterns on Facebook. We analyzed the number and Diaconis, 2010), denoted DDS, the latent space of public wall-post on Facebook, over a three month cluster model (Handcock et al., 2007), denoted LSCM, period, among students of a number of US colleges. and the stochastic blockmodel with mixed membership While our data is new, the US colleges we selected (Airoldi et al., 2008), denoted MMSB. A comparison have been previously analyzed (Traud et al., 2011). with singular value decomposition in terms binary link Table3 provides a summary of the weighted network prediction is problematic, since using a few singular data and of the results of the graphlet decomposition. vectors would not lead to binary edge weights. Salient statistics for each college include the number Table2 summarizes the comparative performance re- of nodes and edges. The table reports the number of ˆ sults. The accuracy of graphlet is reported for different estimated basis elements K for each network and the ˜ fractions of the estimated optimal number of basis el- runtime, in seconds. The τ(Y ) error incurred by us- ing a graphlet decomposition is reported for different fractions of the estimated optimal number of basis el- Table 2: Link reconstruction accuracy and runtime for ements Kˆ , ranging from 10% to 100%—no error. a number of competing methods, on simulated data. The graphlet decomposition is denoted Glet, with the Overall, these results suggest that the compression of fraction of basis elements used quoted in brackets. wall-posts achievable on collegiate network is substan- tial. A graphlet decomposition with about 10% of the Method Runtime (sec) Accuracy optimal number of basis elements already leads to a Glet (25%) 0.0636 92.7 0.30 reconstruction error of 10% or less, with a few excep- Glet (50%) 0.0636 94.7 ± 0.15 ± tions. Using 25% of the basis elements further reduces Glet (75%) 0.0636 97.0 0.08 the reconstruction error to below 5%. Glet (90%) 0.0636 98.9 ± 0.05 Glet (100%) 0.0636 100.0 ± 0.00 ± 4.4 Analysis of historical crime data ERG 0.0127 86.0 1.20 DDS 11.1156 89.0 ± 0.85 ± More recently, we have used the graphlet decompo- LSCM 6.3491 90.0 0.70 sition to analyze crime associations records from the MMSB 2.8555 93.0 ± 0.40 ± 19th century. These records are available at the 59 Azari & Airoldi

Table 3: τ(Y˜ ) error for different fractions of the estimated optimal number of basis elements Kˆ . college nodes edges Kˆ sec 10% 25% 50% 75% 90% 100% Amherst 2235 181907 10151 124 6.5 2.3 .84 .33 .14 0 Caltech 769 33311 3735 51 5.7 1.8 .65 .27 .11 0 CMU 6637 499933 11828 169 14.9 5.2 1.89 .73 .34 0 MIT 6440 502503 13145 191 11.5 4.7 1.68 .65 .30 0 Rice 4087 369655 12848 155 8.7 3.2 1.25 .51 .22 0 Swarthmore 1659 122099 7856 96 6.3 2.6 1.06 .44 .20 0

South Carolina Department of Archives and History In summary, we were abel to successfully use graphlets (Columbia, SC), Record Group 44, Series L 44158, as an exploratory tool to capture and distill an in- South Carolina, Court of General Sessions (Union teresting, multi-scale organization among criminals in County) Indictments 1800-1913. Nodes in the net- 19th century Union County, from historical records. work are 6221 residents of Union County, during the years 1850 to 1880. Edge weights indicate the number 5 Discussion of crimes involving pairs of residents. Details on the structure of this data set and its historical relevance Taken together, our results suggest that the graphlet may be found in Frantz-Parsons(2005, 2011). decomposition implies a new notion of social informa- A first step in the analysis was to explore the degree to tion, quantified in terms of multi-scale social structure. which there is an optimal level of granularity, for the We explored this idea with a simulation study. observed associations, at which a non-trivial criminal structure emerges. We pursued this analysis by fit- 5.1 A new notion of “social information” ting graphlets to the raw criminal association network raised to different powers, Y k for k = 1 ... 5. We found Graphlet quantifies social information in terms of com- that Y 2 contained a substantial degree of criminal so- munity structure (i.e., maximal cliques) at multiple cial structure. At this resolution, k = 2, two residents scales and possibly overlapping. To illustrate this no- are associated with crimes in which they are either tion of social information, we simulated 200 networks: directly involved, or in which they are involved indi- 100 Erd¨os-R´enyi-Gilbert random graphs with Poisson rectly through their direct criminal associates. The results suggest that there are several tight pockets of DRAYTONSVILLE criminality scattered throughout Union County. GOWDEYSVILLE We were interested in developing an interpretation for 11 16 13 H PINKNEY these pockets of criminality. One compelling hypoth- G 23 esis is that each pocket captures a gang or a clan. If JONESVILLE this holds, we might expect the variation in the gangs’ 12 19 36 relations to be explained by the spatial organization 37 4 3 30 34 38 8 D of the townships they operate in, to a large degree. BOGANSVILLE 9 C F 14 B A 27 18 To test this hypothesis, we built a meta network of 5 39 17 32 35 20 26 criminal associations, in which each gang identified in UNIONVILLE 28 the first step of the analysis is represented by a meta 25 I node. Two or more gangs were connected in the meta 31 2 29 SANTUCK network whenever a criminal record involved at least 7 J 23 one of their members. A graphlet decomposition of CROSS KEYS the meta network identified pockets criminal activity E at the gang level. Plotting these gang level criminal 6 associations reveals the townships’ spatial organiza- FISHDAM tion, to a large degree. Figure3 shows the pockets of 10 criminality identified by the graphlet analysis of the GOSHENHILL original criminal associations, Y 2, as green nodes as- sociated with numbers, as well as the pockets of crim- Figure 3: A two-step analysis of the Union County inality identified by the graphlet analysis of the meta criminal association network using graphlets reveals network, as purple nodes associated with letters. tight pockets of criminality (the numbered nodes) and local criminal communities (the lettered nodes). 60 Graphlet decomposition of a weighted network

Figure 4: Comparison between graphlet and SVD decompositions. Panels 1–2 report results for on the networks simulated from the model underlying graphlets. Panels 3–4 report results on the Poisson random graphs. weights and 100 networks from the model described compelling statistical model, it is amenable to theoret- in Section2. While, arguably, the Poisson random ical analysis, and it scales to large weighted networks. graphs do not contain any such information, the data SVD is the most parsimonious orthogonal decomposi- generating process underlying graphlets was devised tion of a data matrix Y (Hastie et al., 2001). Graphlets to translate such information into edge weights. abandon the orthogonality constraint to attain inter- As a baseline for comparison, we consider the singu- pretability of the basis matrix in terms of social infor- lar value decomposition (Searle, 2006), applied to the mation. In the presence of social structure, graphlet symmetric adjacency matrices with integer entries en- is a better summary of the variability in the observed coding the weighted networks. The SVD summarizes connectivity. SVD always achieves a lower reconstruc- edge weights using orthogonal eigenvectors vi as basis tion error, however, when a few basis elements are con- elements and the associated eigenvalues λi as weights, sidered (see Figure4) suggesting that orthogonality in- N YN N = i=1 λi vi vi0. However, SVD basis elements duces a more useful set of constraints whenever a very are× not interpretable in terms of community structure, low dimensional representation is of interest. thus SVDP should not be able to capture the notion of Graphlets provide a parsimonious sumary of complex social information we are interested in quantifying. social structure, in practice. The simulation studies in We applied graphlets and SVD to the two sets of net- Sections 4.1 and 5.1 suggest that graphlets are leverag- works we simulated. Figure4 provides an overview of ing the social structure underlying messaging patterns the results. Panels 1–2 report results for on the net- on Facebook to deliver the high compression/high ac- works simulated from the model underlying graphlets. curacy ratios we empirically observe in Table3. Panels 3–4 report results on the Poisson random The information graphlets utilize for inference comes graphs. Panels 1 and 3 show the box plots of the co- exclusively from positive weights in the network. This efficients associated with each of the basis elements, feature is the source of desirable properties, however, for graphlets in blue and for SVD in red. Panels 2 because of this, graphlets are also sensitive to missing and 4 show the box plots of the cumulative repre- data; zero edge weights cannot be imputed. More re- sentation error as a function of the number of basis search is needed to address this issue properly, while elements utilized. Graphlets coefficients decay more maintaining the properties outlined in Section3. Em- slowly than SVD coefficients on Poisson graphs (panel pirical solutions include considering powers of the orig- 3). Because of this, the error in reconstructing Poisson inal network, as in Section 4.4, or cumulating weights graphs achievable with graphlets is consistently worse over a long observation window, as in Section 4.3. In- then the error achievable with SVD (panel 4). In con- tuitively, cumulating messages over long time windows trast, graphlets coefficients decay rapidly to zero on will reveal a large set of basis elements. Algorithm2 networks with social structure, much sharply then the can then be used to project message counts over a SVD coefficients (panel 1). Thus, the reconstruction short observation window onto the larger set of basis error achievable with graphlets is consistently better elements to estimate which elements were active in the then the error achievable with SVD (panel 2). short window and how much. Recent empirical stud- These results support our claim that graphlets is able ies suggest three months as a useful cumulation period to distill and quantify a notion of social information in (Cortes et al., 2001; Kossinets and Watts, 2006). terms of social structure. Acknowledgments 5.2 Concluding remarks This work was partially supported by National Sci- The graphlet decomposition of a weighted network is ence Foundation grants DMS-1106980, IIS-1017967, an exploratory tool, which is based on a simple but CAREER award IIS-1149662, and by Army Research 61 Azari & Airoldi

Office grant 58153-MA-MUR, all to Harvard Univer- P. D. Hoff. Multiplicative latent factor models for de- sity. EMA is an Alfred P. Sloan Research Fellow. scription and prediction of social networks. Com- putational & Mathematical Organization Theory, 15 (4):261–272, 2009. References X. Jiang, Y. Yao, H. Liu, and L. Guibas. Compres- E. M. Airoldi and B. Haas. Polytope samplers for sive network analysis. Manuscript, April 2011. URL inference in ill-posed inverse problems. In Journal http://arxiv.org/abs/1104.4605. of Machine Learning Research, volume 15 of W&CS O. Kallenberg. Probabilistic symmetries and invari- (AiStat), 2011. ance principles. In Probability and its Applications. E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Springer, New York, 2005. Xing. Mixed membership stochastic blockmod- S. Kannan, M. Naor, and S. Rudich. Implicit represen- els. Journal of Machine Learning Research, 9:1981– tation of graphs. Proceedings of 20th Annual ACM 2014, 2008. Symposium on Theory of Computing, 1988. J. K. Blitzstein and P. Diaconis. A sequential im- D. G. Kendall. A survey of the statistical theory of portance sampling algorithm for generating random shape. Statistical Science., 4:87–99, 1989. graphs with prescribed degrees. Internet Mathemat- ics, 6(4):489–522, 2010. M. Kim and J. Leskovec. Multiplicative attribute graph model of real-world networks. In 7th Work- C. Bron and J. Kerbosch. Algorithm 457: Finding all shop on Algorithms and Models for the Web Graph cliques of an undirected graph. Communications of (WAW), 2010. the ACM, pages 575–577, 1973. R. Kondor, N. Shervashidze, and K. M. Borgwardt. R. R. Coifman and M. Maggioni. Diffusion wavelets. The graphlet spectrum. In International Conference Applied and Computational Harmonic Analysis, 21 on Machine Learning, volume 26, 2009. (1):53–94, 2006. G. Kossinets and D. J. Watts. Empirical analysis of an Corinna Cortes, Daryl Pregibon, and Chris Volinsky. evolving . Science, 311:88–90, 2006. Communities of interest. Proceedings of Intelligent D. Lazer, et al. Computational social science. Science, Data Analysis, pages 105–114, 2001. 323(5915):721–723, 2009. Imre Csiszar and Paul C. Shields. Information Theory A. B. Lee, B. Nadler, and L. Wasserman. Treelets: An and Statistics: A Tutorial. Foundations and Trendsˆo adaptive multiscale basis for sparse unordered data. in Communications and Information Theory, 2004. Annals of Applied Statistics, 2(2):435–471, 2008. P. Erd¨osand A. R´enyi. On random graphs. Publica- J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Falout- tiones Mathematicae Debrecen, 5:290–297, 1959. sos, and Z. Ghahramani. Kronecker graphs: An ap- proach to modeling networks. Journal of Machine E. Frantz-Parsons. Midnight rangers: Costume and Learning Research, 11:985–1042, 2010. performance in the reconstruction-era ku klux klan. Journal of American History, 92:3:811–836, 2005. N. Przulj, D. G. Corneil, and I. Jurisica. Efficient estimation of graphlet frequency distributions in E. Frantz-Parsons. Klan skepticism and denial in protein–protein interaction networks. Bioinformat- reconstruction-era public discourse. Journal of ics, 22(8):974–980, 2006. Southern History, 67:1:53–90, 2011. W. H. Richardson. Bayesian-based iterative method E. N. Gilbert. Random graphs. Annals of Mathemat- of image restoration. Journal of the Optical Society ical Statistics, 30:1141–1144, 1959. of America, 62(1):55–59, 1972. A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. S. R. Searle. Matrix Algebra Useful for Statistics. Airoldi. Statistical models of networks. Foundations Wiley-Interscience, 2006. and Trends in Machine Learning, 2(2):1–117, 2010. A. I. Shawky and R. A. Bakoban. Order statistics M. Handcock, A. E. Raftery, and J. Tantrum. Model- from exponentiated gamma distribution and associ- based clustering for social networks (with discus- ated inference. Int. J. Contemp. Math. Sciences,, 4 sion). Journal of the Royal Statistical Society, Series (2):71–91, 2009. A, 170:301–354, 2007. A. L. Traud, E. D. Kelsic, P. J. Mucha, and M. A. T. Hastie, R. Tibshirani, and J. H. Friedman. The Porter. Community structure in online collegiate Elements of Statistical Learning: Data Mining, In- social networks. SIAM Review, 53(3):526–543, 2011. ference, and Prediction. Springer-Verlag, 2001. 62 Graphlet decomposition of a weighted network

A Proofs of theorems and lemmas To show that Pk is a maximal cliques of Λth=thk we need to prove the following two statements: In the following, Λ C denotes the element-wise ◦ (1) Pk Λth=thk which means clique Pk should be (Hadamard) product of two matrices Λ and C. ⊆ included in Λth=thk . This statement is true because the weights of the edges in Λ that correspond to Pk, A.1 Proof of Theorem1 are greater than thk since µs are positive.

Proof. We will show B can be recovered from Λ using (2) Pk can not be part of a larger clique in Λth=thk . the deterministic algorithm3. By contradiction, assume such a larger clique exists and name it P 0 then P P 0 Λ . Considering k ⊂ ⊆ th=thk Set Λ0 = Λ that the basis is non-expandable, given lemma (2) and

While(There is at least one non-zero element in Λ ) P 0 Λth=thk Pli then we can find j T G 0 ⊆ ⊆ i T G ∈ \ µ = min (Λ C)(i, j) that P P . Hence,∈ \ we have P P P which c i,j 0 j W k 0 j ij =argmin (Λ◦ C)(i, j) leads to ⊆P P for a j T G.⊂ This⊆ is clearly a c ij ◦ k ⊂ j ∈ \ C=largest clique in Λ0 that includes the edge ijc contradiction since such j’s are excluded in G. Λ0=Λ0 µcC − With Pk’s being one of the maximal cliques of thresh- Add (C, µc) to (B,W ) respectively. olded binary networks at thk and considering the fact Algorithm 3: Algorithm in theorem1 for uniqueness that the threshold level will be equal to i I µi for any of non expandable basis ∈ I G1,G2, ..., GK (because minij Λij l I µl ∈ { } P≤ ∈ ≤ maxij Λij ), it is clear that B Bc. This ends the In each step, due to non-expandability, the maximal proof of theorem (2). ⊆ P clique of Λ0, C, will not be a subset of any other clique Lemma 2. Λth=thk i T G Pi, for T = 1 ...K. and will contain an edge which only belongs to C, us- ⊆ ∈ \ k ing Lemma1 below. The weight of this edge in Λ 0 W Proof. We will prove this by contradiction. Assume will be µc the coefficient of clique C in the composi- tion. Also, edges of clique C in the network will not the above claim is not true. In that case, an edge in Λ can be found which belongs to complementary have weights less than µc because all the coefficients th=thk set: P P . However, the weight for in composition(µs) are positive. So, in each step, the i Gk i i T Gk i ∈ \ ∈ \ weight corresponding to the edge(s) unique to C will this edge is more than thk = µl which is not W W l Gk possible for edges included in ∈ P P be correctly detected and subtracted from the network. i Gk i i T Gk i because the maximum weight forP edges∈ obtained\ ∈ \ from Hence, using algorithm3 we can find basis B and co- this group of cliques is at most Wth = W µ . End T k l Gk l efficients W of matrix in Λ0 = BWB . Since the algo- of proof of the lemma (2). ∈ rithm is deterministic given Λ, it will provide us only P one set of unique NEB. A.3 Proof of Theorem3 Lemma 1. If B generates a nonexpandable basis set S, any clique C S will contain an edge that does not Proof. Using the definition of Graphlet, P1, .., PK are belong to any other∈ clique in S. cliques generated from matrix B. Considering that Y = l µlPl, any clique P 0 detected at different thresholds will be in the form of P 0 = l I Pl for an Proof. Proof by contradiction: If every edge in C I 1,P2, .., K . Each clique will be detected∈ at most overlaps with another clique in S, then C can be Q⊆times. { Then} the total number of detectedV cliques decomposed to other cliques, which contradicts non- will not be more than Q times 2K , the total number expandability of S. of possible Is. However, based on the distribution of elements of matrix B we know the typical set for Is A.2 Proof of Theorem2 have size of 2KH(pN ), asymptotically for large n, where H(.) is entropy function. This follows the fact that the Proof. Recall that P = b b and Λ = µ P . For k .k .k0 l l l bit strings for nodes will be in a typical set where each any k 1, 2, .., K , setting a threshold on the edge bit-string has Kp bits of one. Hence, the number of weights∈ of { the network} Λ, such that edgesP with weights N nonempty common intersections among a nontypical above the threshold are selected, obtains a binary net- set of cliques will converge to zero and we will have work. Define Gk to be those indices ` for which B` KH(pN ) CN,p = Q2 + QK. covers Bk. If the above mentioned threshold is set to thk = l G µl, then a binary network Λth=thk will ∈ k be revealed. Below we will prove that Pk is a maximal P clique for Λth=thk using Lemma2. 63