Structure Based User Identification Across Social Networks Xiaoping Zhou, Xun Liang, IEEE Senior Member, Xiaoyong Du, Jichao Zhao
Total Page:16
File Type:pdf, Size:1020Kb
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID 1 Structure Based User Identification across Social Networks Xiaoping Zhou, Xun Liang, IEEE Senior Member, Xiaoyong Du, Jichao Zhao Abstract—Identification of anonymous identical users of cross-platforms refers to the recognition of the accounts belonging to the same individual among multiple Social Network (SN) platforms. Evidently, cross-platform exploration may help solve many problems in social computing, in both theory and practice. However, it is still an intractable problem due to the fragmentation, inconsistency and disruption of the accessible information among SNs. Different from the efforts implemented on user profiles and users’ content, many studies have noticed the accessibility and reliability of network structure in most of the SNs for addressing this issue. Although substantial achievements have been made, most of the current network structure-based solutions, requiring prior knowledge of some given identified users, are supervised or semi-supervised. It is laborious to label the prior knowledge manually in some scenarios where prior knowledge is hard to obtain. Noticing that friend relationships are reliable and consistent in different SNs, we proposed an unsupervised scheme, termed Friend Relationship-based User Identification algorithm without Prior knowledge (FRUI-P). The FRUI-P first extracts the friend feature of each user in an SN into friend feature vector, and then calculates the similarities of all the candidate identical users between two SNs. Finally, a one-to- one map scheme is developed to identify the users based on the similarities. Moreover, FRUI-P is proved to be efficient theoretically. Results of extensive experiments demonstrated that FRUI-P performs much better than current state-of-art network structure-based algorithm without prior knowledge. Due to its high precision, FRUI-P can additionally be utilized to generate prior knowledge for supervised and semi-supervised schemes. In applications, the unsupervised anonymous identical user identification method accommodates more scenarios where the seed users are unobtainable. Index Terms—Structural Social Network Analysis, Cross-Platform, Social Network, Anonymous Identical Users, Unsupervised, Friend Relationship, User Identification, Prior Knowledge ———————————————————— 1 INTRODUCTION OCIAL networks (SNs) are pervasive and indispensa- these SNs, and it has attracted extensive attentions in re- S ble in our everyday life. Due to the contribution of mas- cent years. sive behavioral data, SNs have exerted an enormous influ- In the early stage, researchers utilized the uniqueness of ence on various research fields, e.g., sociology [1], econom- the email address and linked identical users in different ics [2] and epidemiology [3]. Cleverly, different SNs sup- SNs correctly by building up the “Find Friend” mechanism ply an individual with different functions, which activates through these email addresses [4]. Although email ad- the signups of several SNs for the same individual. For in- dresses are a powerful attribute for this task, they were stance, in China, people use RenRen, a Facebook-style but banned by the SNs successively due to privacy protection autonymous SN for blogs, and Sina Microblog for sharing concerns. To tackle this issue, more efforts were made on the statuses. Theoretically, the integration of users’ behav- the accessible attributes of the SNs, e.g., the user profile at- iors from all these SNs can be beneficial for all these studies tributes [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], on SNs. This is because these cross-platform investigations [16], [17], [18], [19], [20], the user activities [21], [22], [23], on SNs merge various SN platforms to create richer raw [24], [25] and the network structure [26], [27], [28], [29], [30], data and more complete SNs for social computing tasks. [31] , [32], [33] , [34], [35] , [36]. All these schemes can iden- Additionally, they enable an all-sided analysis on user be- tify a portion of identical users and mitigate the cross-plat- haviors. Accordingly, exploration of this topic lays a foun- form task. Currently, the accessible information among dation for current, as well as further, studies on SNs. SNs becomes increasingly fragmented, inconsistent and User identification, which is also termed as user recog- disruptive. These characteristics of SNs marginalize the nition, user identity resolution, user matching, and anchor traditional resolutions and trigger demands for new tech- linking, aims to find identical users among different SN niques for user identification. platforms. Admittedly, the first and most important task SN connections (or network) fall into two categories: for cross-platform SN research is user identification across single-following connections and mutual-following con- nections (or friend relationships) [28]. Admittedly, the sin- ———————————————— gle-following connections are very noisy because not all Xiaoping Zhou is with the School of Information, Renmin University of connections represent true “friend” relationships [37]. China, Beijing 100872, China, and Beijing Key Laboratory of Intelligent However, friend relationships (mutual-following connec- Processing for Building Big Data, Beijing University of Civil Engineering tions in microblogging SNs), in which each connection re- and Architecture, Beijing 100044, China. E-mail: [email protected]. quires mutual confirmations between the two users, are Xun Liang, Xiaoyong Du and Jichao Zhao are with the School of Infor- mation, Renmin University of China, Beijing 100872, China. E-mail: much more reliable and consistent among SNs, and thus {xliang, duyong, zhaojichao}@ruc.edu.cn. (Corresponding author: Xun Liang.) 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistributionxxxx-xxxx/0x/$xx.00 © 200xrequires IEEE IEEE permission. Published See byhttp://www.ieee.org/publications_standards/publications/rights/index.html the IEEE Computer Society for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 2 much more suitable for cross-platform user identification than NM in running time. We also introduced three pa- tasks. Currently, most of the solutions in this category are rameters to improve the performance and one parameter supervised or semi-supervised, which require some to ensure the precision of FRUI-P. matched identical users, or seed users (prior knowledge) 4. Providing concrete demonstrations of FRUI-P perfor- in advance. The seed users are often acquired from user mance with three synthetic networks and two major online self-posting websites, e.g., Google+ and About.me, or SNs in China, namely Sina Microblog and RenRen. The through human annotating by comparing profile, content synthetic networks include the Erdős – Rényi (ER) [40] ran- and network features [37]. In the latter scenarios, e.g., ob- dom networks, Watts - Strogatz (WS) [41] small-world net- taining prior knowledge from Sina Microblog and RenRen, works and Barabási - Albert preferential attachment model it is laborious to label the seed users manually. Moreover, (BA) [42] networks. The findings show that FRUI-P is su- the quality and quantity of the seed users have significant perior to NM in these networks. Moreover, FRUI-P is ef- influence on the identification results. For instance, when fective for the de-anonymization task, as the user identifi- most of the seed users locate in a cluster, only a few iden- cation task is similar to, yet tougher than, the de-anony- tical users can be identified. Undoubtedly, exploration of mization problem. unsupervised algorithm can mitigate the dependency on This article proceeds as follows. Section 2 systematically prior knowledge. To the best of our knowledge, Neighbor presents terminology on user identification across SN plat- Matching (NM) [36] is the only unsupervised method forms, and formally presents the problem definition in highly pertinent to the user identification problem. Alt- friend relationship-based networks. Section 3 reviews re- hough NM is effective in de-anonymizing SN and won the lated work on cross-platform user identification. Section 4 “champion of the de-anonymization” task of the WSDM proposes the FRUI-P algorithm. Section 5 covers the exper- 2013 Data Challenge [38], its performance is not satisfac- imental studies. Section 6 offers conclusions. tory in user identification tasks, as shown in Section 5. Accordingly, in this study, we investigated the unsu- 2 PROBLEM DEFINITION pervised strategy to recognize anonymous identical users across SNs purely by friend relationships. Without a spec- 2.1 Terminology Definitions ification, the network structure of SN denotes the friend Since only the friend relationships are considered in this relationships among the users and the relationships with study, an SN is defined as SN = {U, F}, where U and F are single direction connection are ignored. Although this