Modeling the Homophily Effect Between Links and Communities for Overlapping Community Detection Hongyi Zhang, Tong Zhao, Irwin King, Michael R

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Modeling the Homophily Effect between Links and Communities for Overlapping Community Detection Hongyi Zhang, Tong Zhao, Irwin King, Michael R. Lyu Shenzhen Key Laboratory of Rich Media Big Data Analytics and Applications, Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong hyzhang, tzhao, king, lyu @cse.cuhk.edu.hk { } Abstract ing it will be to identify such communities. It is mainly due to the infeasibility of visualization and the variety of com- Overlapping community detection has drawn much munity structure. Classic graph-partition-based community attention recently since it allows nodes in a detection approaches assume that a node belongs to one and network to have multiple community member- only one community, which contradicts with the fact that a ships. A standard framework to deal with over- node often appears with multiple memberships. To relax this lapping community detection is Matrix Factoriza- unrealistic constraint, several new algorithms for overlapping tion (MF). Although all existing MF-based ap- community detection have been proposed in recent years. proaches use links as input to identify communities, the relationship between links and communities is A majority of existing methods for overlapping commu- still under-investigated. Most of the approaches nity detection is based on Matrix Factorization (MF) [Pso- only view links as consequences of communities rakis et al., 2011; Wang et al., 2011; Zhang and Yeung, 2012; (community-to-link) but fail to explore how nodes’ Zhang et al., 2015], which has been a standard technique in community memberships can be represented by other areas such as recommender systems and natural lan- their linked neighbors (link-to-community). In guage processing. The basic idea of MF here is to use this paper, we propose a Homophily-based Non- low-dimensional latent vectors to represent nodes’ features negative Matrix Factorization (HNMF) to model in networks. MF naturally fits into overlapping community both-sided relationships between links and com- detection since the dimensions of factorized latent vectors munities. From the community-to-link perspec- of nodes can be interpreted as their community member- tive, we apply a preference-based pairwise func- ship and hence are no longer latent. The MF-based over- tion by assuming that nodes with common com- lapping community detection can be summarized into three munities have a higher probability to build links steps: (1) assign the number of communities, (2) compute than those without common communities. From the node-community weight matrix F through a learning the link-to-community perspective, we propose a objective, and (3) obtain the final community set accord- new community representation learning with net- ing to F . Here the most important part is the selection of work embedding by assuming that linked nodes learning objective. The simplest way is to recover the ad- have similar community representations. We con- jacency matrix of original network A by F with minimum error, i.e., to minimize A FFT [Psorakis et al., 2011; duct experiments on several real-world networks || − || and the results show that our HNMF model is able Wang et al., 2011]. However, an entry in A is a label (ei- to find communities with better quality compared ther 0 or 1) whereas an entry in F is a real value. The with state-of-the-art baselines. mismatch between label and entry does not make sense. To fix it, recent approaches such as [Yang and Leskovec, 2013; Zhang et al., 2015] adopt generative objectives, which are 1 Introduction based on the intuition that a node is more likely to build a Network is an abstraction representing relationships among link with another node inside its community than outside. real-world objects. A typical pattern of a network is that When we look into this intuition, it implicitly reveals that there are groups of nodes closely connected within the group links are the consequence of communities (community-to- but rarely making connections with nodes outside the group. link), i.e., if two nodes share common communities, they Such groups are defined as communities [Girvan and New- will have a higher probability to be linked. However, the man, 2002]. The task of finding such communities from com- investigation in reverse perspective (link-to-community) is plex networks is referred as community detection, an impor- largely ignored, i.e., whether a node’s community member- tant research topic in web mining for more than a decade. ship can be represented by its neighbors’ community mem- Usually, the more complex a network is, the more challeng- bership. Taking MF as an example, the link-to-community 3938 Dataset V E S MA perspective can be interpreted as to learn a node’s commu- | ||||| nity representation via the community representations of its Amazon 335k 926k 49k 100.0 14.83 neighbors. Here we use the word homophily, the tendency DBLP 317k 1.0M 2.5k 429.8 2.57 of an individual node to associate with similar others [Tang et al., 2013], to recapitulate both perspectives. Table 1: Data statistics. V : number of nodes, E : number In this paper, we propose a Homophily-based Non-negative of links, S : number of ground-truth| | communities,| |M: aver- MF (HNMF) to explicitly model the effect of homophily from age number| | of nodes per community, A: average community both perspectives. From the community-to-link perspective, memberships per node. we apply a pairwise objective function in the Preference- based Non-negative MF (PNMF) model [Zhang et al., 2015]. From the link-to-community perspective, we develop a novel Thus, new approaches are proposed to tackle this problem generative objective function based on unsupervised repre- in recent years. In this paper, we mainly focus on matrix- sentation learning and network embedding. We combine factorization-based approaches for overlapping community both objective functions into a joint learning objective, in detection. which parameter learning can be easily parallelized using Definition 2.2 (Overlapping Community Detection via asynchronous stochastic gradient descent. Through exper- MF). Overlapping community detection via MF is a process V V iments on various real-world datasets, we demonstrate that that takes the adjacency matrix A 0, 1 | |⇥| | of a net- our model can identify communities with better quality com- work G as input and produces a node-community2{ } weight ma- V S pared with state-of-the-art baselines and can be applied to trix F R| |⇥| | whose entry Fu,c represents the weight of large datasets. node u2 V in community c S to minimize a particular Contributions. We summarize our main contribution of this loss function2 l, i.e., 2 paper as follows, arg min l(A, F F T ), (2) 1. Our work is the first to explore the link-to-community F side of homophily effect between links and communities in overlapping community detection. We justify it where S is the set of communities. In the end, we obtain S via observation on real-world datasets with ground-truth according to F . communities; As we mentioned, the simplest l is in the form of A FFT . The main target of this paper is to seek for a better|| −l 2. We propose a new learning objective to model both || perspectives of homophily within the non-negative MF that can capture the nature of communities more precisely. framework. Experiments show that our HNMF model can detect overlapping communities with better quality. 2.2 Data Observation In order to validate the link-to-community perspective, we 2 Problem Definition and Data Observation observe two large network datasets with ground-truth com- munities1 [Yang and Leskovec, 2012] to see whether linked In this section, we first provide several definitions about com- node pairs have more similar community representations than munity and community detection. Then we conduct a data non-linked ones. These two datasets are: observation on two large real-world networks to strengthen our motivation. Amazon: a products co-purchasing network based on • Customers Who Bought This Item Also Bought feature 2.1 Problem Definition of the Amazon website. Suppose we have a network G(V,E), where V and E are DBLP: a collaboration network of research paper au- node and edge sets respectively. A community in G is usu- • thors in computer science; ally considered as a group of densely connected nodes with a A simple statistics can be found in Table 1. common feature, e.g., students from a university, employees We exploit average number of shared communities (SC) from a company, etc. The task of community detection can and average Jaccard similarity of community memberships be defined as follows. (JS) for all linked node pairs as our measurements. They are Definition 2.1 (Community Detection). Community detec- calculated by tion is a process that takes a network G as input and produce a set of communities S as output to maximize a particular 1 SC = Ci Cj , (3) objective function f, i.e., 2 E | \ | i V j N +(i) | | X2 2X arg max f(G, S), (1) S and 1 Ci Cj where S = Ci Ci = ,Ci = Cj, 1 i, j S . JS = | \ |, (4) { | 6 ; 6 | |} 2 E Ci Cj i V j N +(i) While classic community detection requires that Ci = Cj | | X2 2X | [ | if i = j, overlapping community detection does not set6 any + 6 respectively, where N (i) is the set of i’ neighbors and Ci constraints on S. This relaxation matches the nature of real- represents the set of communities containing i. We also draw world networks better but brings big challenges since classic garph-partition-based algorithms are no longer feasible.

Load more