Proceedings of the Twenty-Second International Joint Conference on

Generalized Latent Factor Models for Analysis Wu-Jun Li†, Dit-Yan Yeung‡, Zhihua Zhang † Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China ‡ Department of Computer Science and Engineering Hong Kong University of Science and Technology, Hong Kong, China  College of Computer Science and Technology, Zhejiang University, China [email protected], [email protected], [email protected] Abstract Homophily and stochastic equivalence are two primary features of interest in social networks. Recently, the multiplicative latent factor model (MLFM) is proposed to model social networks with directed links. Although MLFM can capture stochastic equivalence, it cannot model well ho- mophily in networks. However, many real-world (a) homophily (b) stochastic equivalence networks exhibit homophily or both homophily Figure 1: Homophily and stochastic equivalence in networks. and stochastic equivalence, and hence the net- work structure of these networks cannot be mod- eled well by MLFM. In this paper, we propose graph has the feature of homophily. On the other hand, if a novel model, called generalized latent factor the nodes of a graph can be divided into groups where mem- model (GLFM), for by en- bers within a group have similar patterns of links, we say this hancing homophily modeling in MLFM. We de- graph exhibits stochastic equivalence. The web graph has vise a minorization-maximization (MM) algorithm such a feature because some nodes can be described as hubs with linear-time complexity and convergence guar- which are connected to many other nodes called authorities antee to learn the model parameters. Extensive ex- but the hubs or authorities are seldom connected among them- periments on some real-world networks show that selves. For stochastic equivalence, the property that members GLFM can effectively model homophily to dramat- within a group have similar patterns of links also implies that ically outperform state-of-the-art methods. if two nodes link to or are linked by one common node, the two nodes most likely belong to the same group. 1 Introduction Examples of homophily and stochastic equivalence in di- rected graphs are illustrated in Figure 1, where the locations A social network 1 [Wasserman and Faust, 1994] is often in the 2-dimensional space denote the characteristics of the represented as a graph in which the nodes represent the ob- points (nodes). From Figure 1(a), we can see that a link is jects and the edges (or called links) represent the binary re- more likely to exist between two points close to each other, lations between objects. The edges in a graph can be di- which is the property of homophily. In Figure 1(b), the points rected or undirected. If the edges are directed, we call the form three groups associated with different colors, and the graph a directed graph. Otherwise, the graph is an undirected nodes in each group share similar patterns of links to nodes graph. Unless otherwise stated, we focus on directed graphs in other groups, but the nodes in the same group are not neces- in this paper because an undirected edge can be represented sarily connected to each other. This is the property of stochas- by two directed edges with opposite directions. Some typi- tic equivalence. Note that in a graph exhibiting stochastic cal networks include friendship networks among people, web equivalence, two points close to each other are not necessarily graphs, and paper citation networks. connected to each other and connected points are not neces- As pointed out by [Hoff, 2008], homophily and stochas- sarily close to each other, which is different from the property tic equivalence are two primary features of interest in social of homophily. networks. If an edge is more likely to exist between two nodes with similar characteristics than between those nodes As social network analysis (SNA) is becoming more and having different characteristics, we say the graph exhibits ho- more important in a wide range of applications, many SNA mophily. For example, two individuals are more likely to be models have been proposed [Goldenberg et al., 2009]. In this friends if they share common interests. Hence, a friendship paper, we focus on latent variable models [Bartholomew and Knott, 1999] which have been successfully applied to model 1In this paper, we use the terms ‘network’, ‘social network’ and social networks [Hoff, 2008; Nowicki and Snijders, 2001; ‘graph’ interchangeably. Hoff et al., 2002; Kemp et al., 2006; Airoldi et al., 2008;

1705 Hoff, 2009]. These models include: the latent class model overall density of the links in the network, Λ is a D × D [Nowicki and Snijders, 2001] and its extensions [Kemp et diagonal matrix with the diagonal entries being either positive al., 2006; Airoldi et al., 2008], the latent distance model or negative. The latent eigenmodel generalizes both latent [Hoff et al., 2002], the latent eigenmodel [Hoff, 2008], and class models and latent distance models. It can model both the multiplicative latent factor model (MLFM) [Hoff, 2009]. homophily and stochastic equivalence in undirected graphs Among all these models, the recently proposed latent eigen- [Hoff, 2008]. model, which includes both the latent class model and the To adapt the latent eigenmodel for directed graphs, MLFM latent distance model as special cases, can capture both ho- defines T mophily and stochastic equivalence in networks. However, Θik = μ + Xi∗ΛWk∗, (1) it can only model undirected graphs. MLFM [Hoff, 2009] where X and W have orthonormal columns. Note that the adapts the latent eigenmodel for directed graphs. However, key difference between the latent eigenmodel and MLFM lies as to be shown in our experiments, in fact it cannot model in the fact that MLFM adopts a different receiver factor ma- well homophily. trix W which enables MLFM to model directed (asymmetric) In this paper, we propose a novel model, called general- graphs. As we will show in our experiments, this modifica- ized latent factor model (GLFM), for social network analysis tion in MLFM makes it fail to model homophily in networks. N by enhancing homophily modeling in MLFM. The learning Letting Θ =[Θik]i,k=1, we can rewrite MLFM as follows: algorithm of GLFM is guaranteed to converge to a local op- T timum and has linear-time complexity. Hence, GLFM can be Θ = μE + XΛW , (2) used to model large-scale graphs. Extensive experiments on where E is an N×N matrix with all entries being 1. community detection in some real-world networks show that We can find that MLFM is a special case of the following GLFM dramatically outperforms existing methods. model: Θ = μE + UVT . (3) 2 Notation and Definition For example, we can get MLFM by setting U = X and V = WΛ X W Λ We use boldface uppercase letters, such as K, to denote ma- . Furthermore, it is easy to compute the , and U V trices, and boldface lowercase letters, such as x, to denote in (2) based on the learned and in (3). Hence, in the vectors. The ith row and the jth column of a matrix K are sequel, MLFM refers to the model in (3). denoted as Ki∗ and K∗j , respectively. Kij denotes the el- ement at the ith row and jth column in K. xi denotes the 4 Generalized Latent Factor Models ith element in x. We use tr(K) to denote its trace, KT for As discussed above, MLFM can capture stochastic equiva- its transpose and K−1 for its inverse. ·is used to denote lence but cannot model well homophily in directed graphs. the length of a vector. |·|denotes the cardinality of a set. I Here, we propose our GLFM to enhance homophily model- denotes the identity matrix whose dimensionality depends on ing in MLFM. the context. For a matrix K, K 0 means that K is positive semi-definite (psd) and K 0 means that K is positive def- 4.1 Model inite (pd). K M means K − M 0. N (·) denotes the In GLFM, Θik is defined as follows: normal distribution, either for scalars or vectors. ◦ denotes 1 T 1 T Θik = μ + Ui∗U + Ui∗V . (4) the Hadamard product (element-wise product). 2 k∗ 2 k∗ N A Let denote the number of nodes in a graph. is the Comparing (4) to (3), we can find that GLFM generalizes N Aij =1 T 3 adjacency (link) matrix for the nodes. if there MLFM by adding an extra term Ui∗U . It is this extra term i j A D k∗ exists a link from node to node . Otherwise, ij =0. that enables GLFM to model homophily in networks, which denotes the number of latent factors. In real-world networks, will be detailed in Section 4.2 when we analyze the objective A i j if ij =1, we can say that there is a relation from to . function in (7). This will also be demonstrated empirically A However, ij =0does not necessarily mean that there is no later in our experiments. i j A relation from to . In most cases, ij =0means that the Based on (4), the likelihood of the observations can be de- i j relationship from to is missing. Hence, we use an indicator fined as follows: Z  matrix to indicate whether or not an element is missing. Aik 1−Aik Zik p(A | U, V,μ)= [S (1 − Sik) ] , (5) More specifically, Zij =1means that Aij is observed while ik i=k Zij =0means that Aij is missing. where exp (Θik) 3 Multiplicative Latent Factor Models Sik = . (6) 1+exp(Θik) The latent eigenmodel is formulated as follows 2: Note that as in the conventional SNA model, we ignore the T Θik = log odds(Aik =1| Xi∗, Xk∗,μ)=μ + Xi∗ΛXk∗, diagonal elements of A. That is, in this paper, we set Aii = Zii =0by default. where X is an N × D matrix with Xi∗ denoting the latent i μ Furthermore, we put normal priors on the parame- representation of node and is a parameter reflecting the μ U V p μ N μ | ,τ−1 ,p U ters , and : ( )= ( 0 ) ( )= 2 D D Note that in this paper, we assume for simplicity that there is no d=1 N (U∗d | 0,βI),p(V)= d=1 N (V∗d | 0,γI). attribute information for the links. It is straightforward to integrate 3 1 attribute information into the existing latent variable models as well Note that the coefficient 2 in (4) makes no essential difference as our proposed model. between (4) and (3). It is only for convenience of computation.

1706 4.2 Learning et al., 2000] to learn it. MM is a so-called expectation- Although the Markov chain Monte Carlo (MCMC) algo- maximization (EM) algorithm [Dempster et al., 1977] with- rithms designed for other latent variable models can easily be out missing data, alternating between constructing a concave adapted for GLFM, we do not adopt MCMC here for GLFM lower bound of the objective function and maximizing that because MCMC methods typically incur very high computa- bound.

1707 4.3 Convergence and Computational Complexity of 1433 unique words. There are overall 5429 citations (links) With the MM algorithm, the learning procedure of GLFM is between the papers. guaranteed to converge to a local maximum. The Citeseer data set contains 3312 papers which can be The time complexity to compute the gradient and Hessian classified into 6 categories. Each paper is described by a for node i is linear to the total number of ones in both Z∗i and 0/1-valued word vector indicating the absence/presence of the Zi∗. In general, this number is O(1) because the observations corresponding word from a dictionary of 3703 unique words. in real networks are always very sparse. Furthermore, since There are overall 4732 citations (links) between the papers. Hi and Gi are D×D, the computational complexity to invert After deleting the self-links, we obtain 4715 links for our the Hessian matrices is O(D3). Typically, D is a very small evaluation. number. Hence, to update the whole U and V, only O(N) As in [Yang et al., 2009a],weuseNormalized Mutual time is needed. Information (NMI), Pairwise F-Measure (PWF) and Modu- larity (Modu) as metrics to measure the clustering accuracy 5 Experiment of our model. For all the algorithms, we set the number of communities to the ground-truth number of class labels in the There exist many different SNA tasks such as social posi- data. tion and role estimation [Wasserman and Faust, 1994], link prediction [Hoff, 2009], node classification [Li et al., 2009a; 5.2 Baselines ] Li and Yeung, 2009; Li et al., 2009b , community detection We compare GLFM with the closely related method [ ] Yang et al., 2009a , and so on. In this paper, we adopt the MLFM [Hoff, 2009]. The U and V in both MLFM and [ ] same evaluation strategy as that in Yang et al., 2009a; 2009b GLFM are initialized by principal component analysis (PCA) for social community detection. The main reason for choos- on the content information. In addition, we also adopt the ing this task is that from our model formulation we can clearly methods introduced in [Yang et al., 2009a] and [Yang et al., see the difference between GLFM and other latent factor 2009b] for comparison. Those methods can be divided into models. However, many other models from different research three groups: link-based methods, content-based methods, communities have also been proposed for SNA. It is diffi- link+content based methods. cult to figure out the connection and difference between those The link-based methods include: PHITS [Cohn and models and GLFM from the formulation perspective. Hence, Chang, 2000], LDA-Link [Erosheva et al., 2004]–an exten- we use empirical evaluation to compare them. Most main- sion of latent Dirichlet allocation (LDA) for link analysis, the [ stream models have been compared in Yang et al., 2009a; popularity-based conditional link model (PCL) [Yang et al., ] 2009b for community detection, which provides a good plat- 2009b], and the normalized cut (NCUT) for spectral cluster- form for our empirical comparison. ing [Shi and Malik, 2000]. For MLFM and GLFM, we adopt k-means to perform clus- U The content-based methods include: the probabilistic la- tering based on the normalized latent representation . Here tent semantic analysis (PLSA) [Hofmann, 1999], LDA-Word, normalization means that the latent representation of each U and NCUT respectively with the Gaussian RBF kernel and the node is divided by its length. Because the magnitude of i∗ probabilistic product (PP) kernel [Jebara et al., 2004]. reflects the activity of i, we select the most active user as the k The link+content based methods include: PHITS- first seed of the -means, and then choose a point as the seed PLSA [Cohn and Hofmann, 2000], LDA-Link-Word [Ero- of the next community if summation of the distances between sheva et al., 2004], Link-Content-Factorization (LCF) [Zhu this point and all the existing seeds is the largest one. Hence, et al., 2007], NCUT, PCL-PLSA, PHITS-DC, PCL-DC and k Z = A the initialization of -means is fixed. We set . For fair C-PLDC [Yang et al., 2009a]. Here PCL-PLSA represents comparison, the hyper-parameters in GLFM and all the base- τ the combination of PCL and PLSA, PHITS-DC represents lines to be compared, such as the in (7), are chosen from the PHITS model combined with the discriminative content a wide range and the best results are reported. More specif- (DC) model in [Yang et al., 2009a] , PCL-DC represents the τ 106 β γ ically, for GLFM, the is fixed to , and are set to 2, PCL model combined with DC, and C-PLDC refers to the D =20 and . combined popularity-driven link model and DC model [Yang t 5.1 Data Sets and Evaluation Metric et al., 2009a]. Moreover, the setting for in C-PLDC fol- lows that in [Yang et al., 2009a]. More specifically, C- As in [Yang et al., 2009a], we use two paper citation net- t 4 PLDC( =1) denotes a special case of C-PLDC without pop- works, Cora and Citeseer data sets , for evaluation. Both ularity modeling [Yang et al., 2009a]. data sets contain content information in addition to the di- rected links. 5.3 Illustration The Cora data set contains 2708 research papers from the 7 subfields of : case-based reasoning, ge- We sample a subset from Cora for illustration. The sampled data set contains two classes. The learned latent represen- netic algorithms, neural networks, probabilistic methods, re- U inforcement learning, rule learning, and theory. Each paper tations for the data instances are illustrated in Figure 2, is described by a 0/1-valued word vector indicating the ab- where the blue circle and red cross are used to denote the data sence/presence of the corresponding word from a dictionary instances from two different classes respectively, and the (di- rected) black edges are the citation relationships between the 4The two data sets can be downloaded from http: data points. In Figure 2, (a) and (c) show the original learned //www.cs.umd.edu/projects/linqs/projects/ latent factors of MLFM and GLFM, respectively, (b) and (d) lbc/index.html. show the corresponding normalized latent factors of MLFM

1708 −2000 −2000 1 1.4 −3000 0.8 −3000 1.2 0.6 1 −4000 0.4 −4000 0.8 −5000 0.2 0.6 −5000 0 0.4 −6000 −0.2 0.2 −6000 −0.4 −7000 0 Objective Function Value Objective Function Value −0.6 −0.2 −7000 −8000 0 5 10 15 20 0 5 10 15 20 −0.4 −0.8 T T −0.6 −1 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (a) Cora (b) Citeseer (a) MLFM (b) MLFM (normalized) Figure 3: Convergence speed of GLFM.

1

1.5 0.8 0.9 0.8 0.6 1 0.8 0.4 0.7 NMI 0.2 0.7 0.5 PWF 0 0.6 0.6 Modu −0.2 0 0.5 −0.4 0.5 0.4 −0.5 −0.6 0.4 NMI −0.8 PWF 0.3 0.3 −1 −1 Modu −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 0.2 0.2 10 20 30 40 50 10 20 30 40 50 D D (c) GLFM (d) GLFM (normalized) (a) Cora (b) Citeseer

Figure 2: Illustration of the homophily and stochastic equiv- Figure 4: Sensitivity to the parameter D of GLFM. alence modeling in networks.

citation networks are more likely to exhibit homophily be- and GLFM, respectively. Here normalization means that we cause the citations often exist among papers from the same divide the latent factor of each node by its length. Hence, it is community. This can explain why GLFM can achieve such clear to see that all the points in (b) and (d) have unit length. good performance on these data sets. Hence, GLFM provides Note that for fair comparison all the different subfigures from a way to model networks which cannot be modeled well by (a) to (d) are generated automatically by our program with the MLFM. same parameter settings and initial values. Figure 4 shows the performance of GLFM when D takes In (a) and (b) of Figure 2, two instances are more likely different values. We see that GLFM is not sensitive to D as to be close if they are connected by or connect to the same long as D is not too small. instance, which is just the feature of stochastic equivalence. However, there exist many links across the inner part of the circle in (b), which means that two instances linked with each 6 Conclusion other are not necessarily close in the latent space. This just In this paper, a generalized latent factor model is proposed to violates the feature of homophily. Hence, we can conclude model homophily in social networks. A linear-time learning that MLFM cannot effectively model homophily in networks. algorithm with convergence guarantee is proposed to learn In (c) and (d) of Figure 2, homophily is obvious since two the parameters. Experimental results on community detection nodes are close to each other in general if there exists a link in real-world networks show that our model can effectively between them. model homophily to outperform state-of-the-art methods. 5.4 Convergence Speed When D =20, the objective function values of GLFM Acknowledgments against the iteration number T are plotted in Figure 3, from Li is supported by a grant from the “project of Arts & Science” which we can see that our learning procedure with the MM of Shanghai Jiao Tong University (No. 10JCY09). Yeung is sup- method for GLFM converges very fast. We set the maximum ported by General Research Fund 622209 from the Research Grants number of iterations T as T =5in all our following experi- Council of Hong Kong. Zhang is supported by the NSFC (No. ments. 61070239), the 973 Program of China (No. 2010CB327903), the Doctoral Program of Specialized Research Fund of Chinese Univer- 5.5 Accuracy sities (No. 20090101120066), and the Fundamental Research Funds for the Central Universities. We compare GLFM with all the baselines introduced in Sec- tion 5.2 in terms of NMI, PWF and Modu [Yang et al., 2009a; 2009b]. The results are reported in Table 1, from References which we can see that GLFM achieves the best performance [Airoldi et al., 2008] Edoardo M. Airoldi, David M. Blei, on all the data sets for the three criteria. Especially for the Stephen E. Fienberg, and Eric P. Xing. Mixed membership Citeseer data set, GLFM dramatically outperforms the sec- stochastic blockmodels. Journal of Machine Learning Research, ond best model. According to the prior knowledge, the paper 9:1981–2014, 2008.

1709 Table 1: Community detection performance on Cora and Citeseer data sets (the best performance is shown in bold face). Cora Citeseer Algorithm NMI PWF Modu NMI PWF Modu PHITS 0.0570 0.1894 0.3929 0.0101 0.1773 0.4588 Link LDA-Link 0.0762 0.2278 0.2189 0.0356 0.2363 0.2211 PCL 0.0884 0.2055 0.5903 0.0315 0.1927 0.6436 NCUT 0.1715 0.2864 0.2701 0.1833 0.3252 0.6577 PLSA 0.2107 0.2864 0.2682 0.0965 0.2298 0.2885 Content LDA-Word 0.2310 0.2774 0.2970 0.1342 0.2880 0.3022 NCUT(RBF kernel) 0.1317 0.2457 0.1839 0.0976 0.2386 0.2133 NCUT(pp kernel) 0.1804 0.2912 0.2487 0.1986 0.3282 0.4802 PHITS-PLSA 0.3140 0.3526 0.3956 0.1188 0.2596 0.3863 LDA-Link-Word 0.3587 0.3969 0.4576 0.1920 0.3045 0.5058 LCF 0.1227 0.2456 0.1664 0.0934 0.2361 0.2011 Link NCUT(RBF kernel) 0.2444 0.3062 0.3703 0.1592 0.2957 0.4280 + NCUT(pp kernel) 0.3866 0.4214 0.5158 0.1986 0.3282 0.4802 Content PCL-PLSA 0.3900 0.4233 0.5503 0.2207 0.3334 0.5505 PHITS-DC 0.4359 0.4526 0.6384 0.2062 0.3295 0.6117 PCL-DC 0.5123 0.5450 0.6976 0.2921 0.3876 0.6857 C-PLDC(t=1) 0.4294 0.4264 0.5877 0.2303 0.3340 0.5530 C-PLDC 0.4887 0.4638 0.6160 0.2756 0.3611 0.5582 MLFM 0.3640 0.3874 0.2325 0.2558 0.3356 0.0089 GLFM 0.5229 0.5545 0.7234 0.3951 0.5053 0.7563

[Bartholomew and Knott, 1999] David J. Bartholomew and Martin [Kemp et al., 2006] Charles Kemp, Joshua B. Tenenbaum, Knott. Latent Variable Models and Factor Analysis. Kendall’s Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Library of ,7, second edition, 1999. Learning systems of concepts with an infinite relational model. [Cohn and Chang, 2000] David Cohn and Huan Chang. Learning In AAAI, 2006. to probabilistically identify authoritative documents. In ICML, [Lang et al., 2000] Kenneth Lang, David R. Hunter, and Ilsoon 2000. Yang. Optimization transfer using surrogate objective functions. [Cohn and Hofmann, 2000] David A. Cohn and Thomas Hofmann. Journal of Computational and Graphical Statistics, 9, 2000. The missing link - a probabilistic model of document content and [Li and Yeung, 2009] Wu-Jun Li and Dit-Yan Yeung. Relation reg- hypertext connectivity. In NIPS, 2000. ularized matrix factorization. In IJCAI, 2009. [Dempster et al., 1977] A Dempster, N Laird, and D Rubin. Max- [Li et al., 2009a] Wu-Jun Li, Dit-Yan Yeung, and Zhihua Zhang. imum likelihood from incomplete data via the EM algorithm. Probabilistic relational PCA. In NIPS, 2009. Journal of the Royal Statistical Society, Series B, 39(1):1–38, [Li et al., 2009b] Wu-Jun Li, Zhihua Zhang, and Dit-Yan Yeung. 1977. Latent Wishart processes for relational kernel learning. In AIS- [Erosheva et al., 2004] Elena Erosheva, Stephen Fienberg, and TATS, 2009. John Lafferty. Mixed membership models of scientific publi- [Nowicki and Snijders, 2001] Krzysztof Nowicki and Tom A. B. cations. In Proceedings of the National Academy of Sciences, Snijders. Estimation and prediction for stochastic blockstruc- volume 101, pages 5220–5227, 2004. tures. Journal of the American Statistical Association, 96:1077– [Goldenberg et al., 2009] Anna Goldenberg, Alice X. Zheng, 1087, 2001. Stephen E. Fienberg, and Edoardo M. Airoldi. A survey of statis- [Shi and Malik, 2000] Jianbo Shi and Jitendra Malik. Normalized tical network models. Foundations and Trends in Machine Learn- cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. ing, 2:129–233, 2009. Intell., 22(8):888–905, 2000. [Hoff et al., 2002] Peter D. Hoff, Adrian E. Raftery, and Mark S. [Wasserman and Faust, 1994] Stanley Wasserman and Katherine Handcock. Latent space approaches to social network analysis. Faust. Social Network Analysis: Methods and Applications. Journal of the American Statistical Association, 97(460):1090– Cambridge University Press, 1994. 1098, 2002. [ ] [Hoff, 2008] Peter D. Hoff. Modeling homophily and stochastic Yang et al., 2009a Tianbao Yang, Rong Jin, Yun Chi, and equivalence in symmetric relational data. In NIPS, 2008. Shenghuo Zhu. A Bayesian framework for community detedtion integrating content and link. In UAI, 2009. [Hoff, 2009] Peter D. Hoff. Multiplicative latent factor models for [ ] description and prediction of social networks. Computational Yang et al., 2009b Tianbao Yang, Rong Jin, Yun Chi, and and Mathematical Organization Theory, 15:261–272, 2009. Shenghuo Zhu. Combining link and content for community de- tection: a discriminative approach. In KDD, 2009. [Hofmann, 1999] Thomas Hofmann. Probabilistic latent semantic [ ] indexing. In SIGIR, 1999. Zhu et al., 2007 Shenghuo Zhu, Kai Yu, Yun Chi, and Yihong Gong. Combining content and link for classification using matrix [Jebara et al., 2004] Tony Jebara, Risi Imre Kondor, and Andrew factorization. In SIGIR, 2007. Howard. Probability product kernels. Journal of Machine Learn- ing Research, 5:819–844, 2004.

1710