COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency

COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency Yutao Zhang?, Jie Tang?[, Zhilin Yang?, Jian Pei], and Philip S. Yuyz ?Department of Computer Science and Technology, Tsinghua University zInstitute for Data Science, Tsinghua University [Tsinghua National Laboratory for Information Science and Technology (TNList) ]School of Computing Science, Simon Fraser University yDepartment of Computer Science, University of Illinois at Chicago {yt-zhang13,yzl11}@mails.tsinghua.edu.cn, [email protected], [email protected],[email protected] 1. INTRODUCTION We are facing an era of online with offline (OWO)—almost ev- ABSTRACT eryone is using online social networks to connect friends or, more More often than not, people are active in more than one social net- generally, to satisfy social needs at different levels [25]. In fact, work. Identifying users from multiple heterogeneous social net- many users participate in more than one social network, such as works and integrating the different networks is a fundamental is- public networks and private networks, as well as business networks sue in many applications. The existing methods tackle this prob- and family networks. We conducted an interview with 20 graduate lem by estimating pairwise similarity between users in two net- students from the authors’ lab. Preliminary statistics show that the works. However, those methods suffer from potential inconsistency average number of social networks in which a user participates is of matchings between multiple networks. eight. The intentions behind these choices are sophisticated. For example, people may be attracted by different functionalities of- In this paper, we propose COSNET (COnnecting heteroge- 1 neous Social NETworks with local and global consistency), a novel fered by different social networks. A survey in the US shows that energy-based model, to address this problem by considering both two-thirds of online adults (66%) use social media platforms such local and global consistency among multiple networks. An efficient as Facebook, Twitter, MySpace, or LinkedIn, to stay in touch with subgradient algorithm is developed to train the model by converting current friends, family members, and business partners. Users gen- the original energy-based objective function into its dual form. erate heterogeneous content and also build different ego-networks We evaluate the proposed model on two different genres of data in different social networks. One interesting and important ques- collections: SNS and Academia, each consisting of multiple het- tion is: can we automatically integrate the different heterogeneous erogeneous social networks. Our experimental results validate the social networks together? effectiveness and efficiency of the proposed model. On both data The results can benefit many applications in one way or another. collections, the proposed COSNET method significantly outper- For example, if we could correctly integrate different social networks together, we could create an integrated profile for each user, forms several alternative methods by up to 10-30% (p 0:001, 2 t-test) in terms of F1-score. We also demonstrate that applying the and build a better user interest model. Talentbin uses this idea to integration results produced by our method can improve the accu- integrate professional information of an employer that scattered in racy of expert finding, an important task in social networks. different social networks to provide a better view of expertise. We can also leverage the integrated results to help social recommenda- tions [24, 31]. Categories and Subject Descriptors The problem is fundamentally important in social network anal- H.3.3 [Information Search and Retrieval]: Text Mining; H.2.8 ysis and is also very challenging. First, users’ information in dif- [Database Management]: Database Applications—Data Mining ferent networks is very unbalanced. Some network may contain rich profile information such as location and interests, while some General Terms others may not have any information. Users’ behavior is an important clue (also referred to as social fingerprint [39]) to help Algorithms, Experimentation recognize users in different networks. If we use the similarity s(u; v); u 2 G1; v 2 G2 between users to link users from two Keywords different networks (G1 and G2), the problem can be formalized P Social network, Network integration, Energy-based model as an optimization problem: max u;v s(u; v). The problem can be solved efficiently by using a minimum cost flow algorithm [1, Permission to make digital or hard copies of all or part of this work for personal or 33]. This formulation focuses on local consistency based on user classroom use is granted without fee provided that copies are not made or distributed profiles; however, it does not consider the network structure—an for profit or commercial advantage and that copies bear this notice and the full cita- identical user in different networks may have similar ego-networks. tion on the first page. Copyrights for components of this work owned by others than Finding matches between the users of two networks G1 and G2 can ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- be mapped onto the problem of finding bijection between two net- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. 1 KDD’15, August 10-13, 2015, Sydney, NSW, Australia. http://www.pewinternet.org/2011/11/15/why-americans-use- social-media/ c 2015 ACM. ISBN 978-1-4503-3664-2/15/08 ...$15.00. 2 DOI: http://dx.doi.org/10.1145/2783258.2783268. https://www.talentbin.com/, an online recruitment service. 1 Username: 퐺 Ortiz_Brandy Nation: USA 1 Gender: female 푣1 Network Local Consistency Consistency 1 1 +13.4 % 푣 2 푣 3 +7.3 % p << 0.01 (t-test) p << 0.01 (t-test) +17.5 % p << 0.01 (t-test) Global Username: Inconsistency 2 3 @ortizbrandy 푣1 푣1 Nation: USA Gender: female 퐺 2 2 2 3 3 3 푣 2 푣 3 푣 2 푣 3 퐺 (a) Example illustration (b) Accuracy performance (c) Application: Expert Finding Figure 1: (a) Example illustration of global inconsistency when connecting three social networks; (b) Accuracy performance of comparison methods; (c) Application improvement by the proposed model (COSNET). works, a challenging problem in graph theory [35]. Approximation Table 1: Notations. can be accomplished using a greedy algorithm. In our problem, this SYMBOL DESCRIPTION formulation aims to achieve network consistency, also referred to G a set of social networks to be integrated as network matching (Cf. § 3). However, if we extend this problem V a set of jV j = N users to the setting of multiple networks, e.g., fG1;G2; ··· ;Gkg, only E a set of relationships between users considering local consistency and network consistency is then in- R a N × d attribute matrix sufficient. Figure 1 gives an example of connecting three different MG a matching graph constructed from G 1 1 2 2 th networks. User v1 in G has high similarities with v1 in G and xi 2 X the i user pair 3 3 3 y 2 f+1; 0g the binary indicator representing whether ith user v1 in G (indicated by the green arrows), while v1 has the highest i 2 2 pair in MG is a correct matching or not similarity with v3 in G . If we match any two networks indepen- dently, it can be easily seen that we will have an inconsistent results: gl(:); fe(:); ft(:) a set of feature functions defined in the energy-based 1 2 1 3 2 3 model v1 $ v1 , v1 $ v1 , and v3 $ v1 . An ideal solution to the problem is to consider all pieces of information (local, network, and global consistencies) in a unified model and tackle them simultaneously. dual form of the original energy-based objective function by Despite several studies on various related topics including en- means of a subgradient method. tity linking [22, 4, 5, 20, 28], entity resolution [6, 13, 21, 30], and de-anonymization [3], the problem of connecting multiple hetero- • Our empirical study on two different genres of data collec- geneous social networks remains largely unsolved. Most existing tions, SNS and Academia, verifies the effectiveness of the works focus on estimating pairwise similarity between users from proposed model. SNS consists of several popular social net- two networks. They ignore either network consistency or global works including Twitter, Flickr, Myspace, Last.fm, and Live- consistency. Some other methods such as [20] and [6] consider the Journal. The Academia data collection consists of LinkedIn, network consistency; however they are still targeting at two net- ArnetMiner, and VideoLectures. Figure 4 shows the perfor- works and do not consider the global consistency among multiple mance of different comparison methods on the two data col- networks. lections. Clearly, the proposed COSNET method performs Challenges and Our Solution. The problem of connecting mul- on average 10-30% better than the comparative methods in tiple heterogeneous social networks is non-trivial and poses a set terms of F1-score (p 0:001, t-test). of unique challenges. First, how to formulate local, network and global consistencies into a principled optimization model is a chal- • We use expert finding, an important task in social networks, lenging issue. Moreover, as real networks are becoming larger and as an application case study to further validate the effective- larger with millions of nodes, it is important to develop efficient al- ness of the proposed method. Figure 1(c) shows the perfor- gorithms that can scale up well. In addition, how to quantitatively mance of expert finding. When applying the integrated re- validate the usefulness of integrated results is also a challenging sults, it is clear that the performance of expert finding can be task.

Load more