This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID 1

Structure Based User Identification across Social Networks Xiaoping Zhou, Xun Liang, IEEE Senior Member, Xiaoyong Du, Jichao Zhao

Abstract—Identification of anonymous identical users of cross-platforms refers to the recognition of the accounts belonging to the same individual among multiple (SN) platforms. Evidently, cross-platform exploration may help solve many problems in social computing, in both theory and practice. However, it is still an intractable problem due to the fragmentation, inconsistency and disruption of the accessible information among SNs. Different from the efforts implemented on user profiles and users’ content, many studies have noticed the accessibility and reliability of network structure in most of the SNs for addressing this issue. Although substantial achievements have been made, most of the current network structure-based solutions, requiring prior knowledge of some given identified users, are supervised or semi-supervised. It is laborious to label the prior knowledge manually in some scenarios where prior knowledge is hard to obtain. Noticing that friend relationships are reliable and consistent in different SNs, we proposed an unsupervised scheme, termed Friend Relationship-based User Identification algorithm without Prior knowledge (FRUI-P). The FRUI-P first extracts the friend feature of each user in an SN into friend feature vector, and then calculates the similarities of all the candidate identical users between two SNs. Finally, a one-to- one map scheme is developed to identify the users based on the similarities. Moreover, FRUI-P is proved to be efficient theoretically. Results of extensive experiments demonstrated that FRUI-P performs much better than current state-of-art network structure-based algorithm without prior knowledge. Due to its high precision, FRUI-P can additionally be utilized to generate prior knowledge for supervised and semi-supervised schemes. In applications, the unsupervised anonymous identical user identification method accommodates more scenarios where the seed users are unobtainable.

Index Terms—Structural , Cross-Platform, Social Network, Anonymous Identical Users, Unsupervised, Friend Relationship, User Identification, Prior Knowledge ————————————————————

1 INTRODUCTION OCIAL networks (SNs) are pervasive and indispensa- these SNs, and it has attracted extensive attentions in re- S ble in our everyday life. Due to the contribution of mas- cent years. sive behavioral data, SNs have exerted an enormous influ- In the early stage, researchers utilized the uniqueness of ence on various research fields, e.g., sociology [1], econom- the email address and linked identical users in different ics [2] and epidemiology [3]. Cleverly, different SNs sup- SNs correctly by building up the “Find Friend” mechanism ply an individual with different functions, which activates through these email addresses [4]. Although email ad- the signups of several SNs for the same individual. For in- dresses are a powerful attribute for this task, they were stance, in China, people use , a -style but banned by the SNs successively due to privacy protection autonymous SN for blogs, and Sina Microblog for sharing concerns. To tackle this issue, more efforts were made on the statuses. Theoretically, the integration of users’ behav- the accessible attributes of the SNs, e.g., the user profile at- iors from all these SNs can be beneficial for all these studies tributes [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], on SNs. This is because these cross-platform investigations [16], [17], [18], [19], [20], the user activities [21], [22], [23], on SNs merge various SN platforms to create richer raw [24], [25] and the network structure [26], [27], [28], [29], [30], data and more complete SNs for social computing tasks. [31] , [32], [33] , [34], [35] , [36]. All these schemes can iden- Additionally, they enable an all-sided analysis on user be- tify a portion of identical users and mitigate the cross-plat- haviors. Accordingly, exploration of this topic lays a foun- form task. Currently, the accessible information among dation for current, as well as further, studies on SNs. SNs becomes increasingly fragmented, inconsistent and User identification, which is also termed as user recog- disruptive. These characteristics of SNs marginalize the nition, user identity resolution, user matching, and anchor traditional resolutions and trigger demands for new tech- linking, aims to find identical users among different SN niques for user identification. platforms. Admittedly, the first and most important task SN connections (or network) fall into two categories: for cross-platform SN research is user identification across single-following connections and mutual-following con- nections (or friend relationships) [28]. Admittedly, the sin- ———————————————— gle-following connections are very noisy because not all  Xiaoping Zhou is with the School of Information, Renmin University of connections represent true “friend” relationships [37]. China, Beijing 100872, China, and Beijing Key Laboratory of Intelligent However, friend relationships (mutual-following connec- Processing for Building Big Data, Beijing University of Civil Engineering tions in microblogging SNs), in which each connection re- and Architecture, Beijing 100044, China. E-mail: [email protected]. quires mutual confirmations between the two users, are  Xun Liang, Xiaoyong Du and Jichao Zhao are with the School of Infor- mation, Renmin University of China, Beijing 100872, China. E-mail: much more reliable and consistent among SNs, and thus {xliang, duyong, zhaojichao}@ruc.edu.cn. (Corresponding author: Xun Liang.) 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistributionxxxx-xxxx/0x/$xx.00 © 200xrequires IEEE IEEE permission. Published See byhttp://www.ieee.org/publications_standards/publications/rights/index.html the IEEE Computer Society for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 2

much more suitable for cross-platform user identification than NM in running time. We also introduced three pa- tasks. Currently, most of the solutions in this category are rameters to improve the performance and one parameter supervised or semi-supervised, which require some to ensure the precision of FRUI-P. matched identical users, or seed users (prior knowledge) 4. Providing concrete demonstrations of FRUI-P perfor- in advance. The seed users are often acquired from user mance with three synthetic networks and two major online self-posting websites, e.g., Google+ and About.me, or SNs in China, namely Sina Microblog and RenRen. The through human annotating by comparing profile, content synthetic networks include the Erdős – Rényi (ER) [40] ran- and network features [37]. In the latter scenarios, e.g., ob- dom networks, Watts - Strogatz (WS) [41] small-world net- taining prior knowledge from Sina Microblog and RenRen, works and Barabási - Albert preferential attachment model it is laborious to label the seed users manually. Moreover, (BA) [42] networks. The findings show that FRUI-P is su- the quality and quantity of the seed users have significant perior to NM in these networks. Moreover, FRUI-P is ef- influence on the identification results. For instance, when fective for the de-anonymization task, as the user identifi- most of the seed users locate in a cluster, only a few iden- cation task is similar to, yet tougher than, the de-anony- tical users can be identified. Undoubtedly, exploration of mization problem. unsupervised algorithm can mitigate the dependency on This article proceeds as follows. Section 2 systematically prior knowledge. To the best of our knowledge, Neighbor presents terminology on user identification across SN plat- Matching (NM) [36] is the only unsupervised method forms, and formally presents the problem definition in highly pertinent to the user identification problem. Alt- friend relationship-based networks. Section 3 reviews re- hough NM is effective in de-anonymizing SN and won the lated work on cross-platform user identification. Section 4 “champion of the de-anonymization” task of the WSDM proposes the FRUI-P algorithm. Section 5 covers the exper- 2013 Data Challenge [38], its performance is not satisfac- imental studies. Section 6 offers conclusions. tory in user identification tasks, as shown in Section 5. Accordingly, in this study, we investigated the unsu- 2 PROBLEM DEFINITION pervised strategy to recognize anonymous identical users across SNs purely by friend relationships. Without a spec- 2.1 Terminology Definitions ification, the network structure of SN denotes the friend Since only the friend relationships are considered in this relationships among the users and the relationships with study, an SN is defined as SN = {U, F}, where U and F are single direction connection are ignored. Although this al- the sets of users and friend relationships in SN, respec- gorithm can only identify a portion of the identical users in tively. As more than one SNs are discussed in this study, real-world SNs, it can be applied jointly with other feature- SNA = {UA, FA} is used to represent SN A. For instance, based user identification algorithms for higher perfor- SNTwitter denotes the . Usually, SN is modeled as an mance. Clearly, no algorithm can alone solve the problem. undirected graph, where the nodes and the edges are the This study makes the following contributions: users and friend relationships, respectively. Therefore, SN 1. Proposing a novel unsupervised user identification is also termed graph in this study. algorithm, termed Friend Relationship-based User Identi- Throughout this paper, the superscript indicates varia- fication without Prior Knowledge (FRUI-P). FRUI-P ex- bles (or notations) associated with SN, and the subscript is tracts the multiple dimensional features of each user from the user in the SN. For instance, Ui denotes user i in a not A A the network structure, and then the similarities of any two specified SN, while Ui is user i in SN . Moreover, the set users in two different SNs are evaluated from the multiple items are marked in italic, while the vectors are presented dimensional features. Since the multiple dimensional fea- in bold. The notations frequently used in this paper are tures of users are considered, FRUI-P has the ability to rec- listed in Table 1. ognize more identical users than NM, in which the features TABLE 1 in only one dimensional are taken into the consideration. NOTATIONS Opposed to most of the existing schemes, FRUI-P identifies Symbol Description the identical users without prior knowledge and can be SN={U, F} Social network with user set U and friend set F. employed to provide prior knowledge for the supervised A A Ui User i in SN .

and semi-supervised algorithms, e.g., FRUI [28]. A A Fi The friend set of Ui . 2. Presenting the friend feature learning process with A B ÏA~B(i, j) Candidate identical user pair composed by Ui and Uj .

circumstances. Inspired by the state-of-art techniques in A B A B Ui =Uj Identical user Ui and Uj .

Word2Vec in the natural language processing (NLP) area A A Ci A context of Ui .

[39], we proposed a Friend Feature Vector Model (FFVM) A A Ni The negative user set of Ui .

which utilizes the power of random walk. Although A A fi Friend feature vector of Ui .

Word2Vec has been widely examined in measuring the A A ci Context feature vector of Ci . similarity of any two words, its performance in unsuper- s Friend feature similarity of two users. vised cross-platform task is unknown. We first carried out x Dimension of friend/context feature vector. the investigation and the empirical testing results proved S Positive sample set. that FFVM is effective in the user identification task. t Rounds of friend feature vector learning. 3. Discussing the efficiencies of FRUI-P and the strate- λ Similarity threshold of the identified users. gies for ensuring the performance of user identification in Note: The superscript indicates variables (or notations) associated with SN, FRUI-P. We proved that FRUI-P is much more efficient and the subscript is the user in the SN.| ∙ | is the size of a set. 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 3

Real World SNA SNB studies have already examined that users tend to have sim- e Connections 4 d' e' ilar friends across different SNs. Dunbar et al. confirmed d 5 2 b that the structure of online social networks mirrored those b' in the offline world by an empirical study on Twitter and c 3 1 a a' c' Facebook [46]. In our former study [28], we also found that ÏA~B(a, a') an average of 67.5% of a person’s friends in Sina Microblog A A Fig. 1. Illustration of definitions. UA in SN forms a candidate identi- B A A A concurred in RenRen. Suppose that people set up random cal user pairs with all the users in SN . Suppose that { Ua , Uc , Ub } A A A A friendships in the real world, and the probability of a is the result of a random walk, the context of Uc is Cc = { Ua , Ub }, A A A A A A and the negative user set of Uc is Nc = { Ua , Ub , Ud , Ue }. Subse- friendship between any two persons is p (0 < p < 1). For any A A A A quently, (Cc , Uc ) is a positive sample, while (Cc , Ud ) is a negative friendship, sa (0 < sa < 1) and sb (0 < sb < 1) are probabilities sample. that the friendship exists in SNA and SNB, respectively. For convenience, we had the following definitions. Fig. Therefore, the probabilities that a friendship exists in SNA B 1 presents an illustration. and SN are p∙sa and p∙sb, respectively. Note that a friend- A B A B ship exists in both SN and SN with the probability of Definition 1 (Candidate Identical User Pair). Ui and Uj A B from two different SNs, SNA and SNB, form a Candidate p∙sa∙sb. Subsequently, Ui and Uj share |U|∙p∙sa∙sb friends A B 2 A when Ui = Uj , or |U|∙p ∙sa∙sb, where |U| denotes the Identical User Pair, which is denoted as ÏA~B(i, j) or Ï(Ui , B number of users in the SNs. Obviously, the difference in Uj ). Also, we denoted Ï(∙, ∙) as all the candidate identical A B A shared friend accounts for 1 / p times between Ui = Uj user pairs and Ï(Ui , ∙) as all the candidate identical user A B A A B and Ui  Uj . This establishes the basis for the cross-plat- pairs containing Ui . Initially, |U | × |U | candidate B A form user identification purely using network structure. identical user pairs exist in Ï(∙, ∙) and |U | in Ï(Ui , ∙). A B A The cross-platform user identification problem is to de- Definition 2 (Identical Users). For any Ï(Ui , Uj ), if Ui and A B A B A termine whether Ui and Uj in two different SNs, SN and Uj belong to the same individual in real-life, then Ui B B A B SN , are identical users or not, and is defined as: and Uj are identical users, and Ui = Uj . AB AB AB 1, Uij = U ; Definition 3 (Context). The context of user Ui in an SN, f (Uij ,U , SNP , SN )   (2) 0, otherwise. denoted as Ci or C(Ui), is a set of users which have a high probability to predict Ui. where P is the prior knowledge, or extra information, ex- A B In this study, we employ random walk to model the cept SN and SN which is given in advance. For instance, context of users. Since a user may appear more than one P is the seed users in FRUI. When there is no other extra times in one or more random walks, a user may have more information available, P = ϕ. In this scenario, only the friend relationships can be utilized, and (2) turns into than one contexts. We anticipate that Ci has a higher prob- 1, U AB= U ; ability to predict Ui and a lower probability to predict Uj (j ABAB  ij f (Uij ,U SN , SN )   (3) ≠ i), thus we have Definitions 4 and 5 for Ci. 0, otherwise.

Definition 4 (Positive Sample). For a given Ci in an SN, (Ci, Obviously, f( ∙ ) is the function we sought in this study. For Ui) is a positive sample. All the positive samples form various reasons, some people hold a few accounts in the the positive sample set S. same SN, yet we often assumed that these multiple ac- Definition 5 (Negative User Set, Negative Sample). Op- counts are independent and belong to different individuals. posed to the definition of positive sample, all the other In other words, we only identified one of these accounts. users except Ui, denoted as Ni or N(Ui), are termed as the negative user set for Ci. Thus, Ni = U – {Ui}. For any 3 RELATED WORKS user Uj  Ni, (Ci, Uj) forms a negative sample. We review the current studies on cross-platform user iden- Definition 6 (Friend Feature Vector). For any user Ui in an tification from three categories: profile-based, content- SN, his friend feature vector fi is an x-dimensional vec- based and network structure-based approaches, and pre- tor which embeds his friend feature. x is a parameter sent a brief summary of the network embedding. given in advance. 3.1 Profile-Based User Identification Definition 7 (Friend Feature Similarity). The similarity of A B Several studies addressing anonymous user identification any Ï(Ui , Uj ) is the similarity on the friend feature of A B have focused on public profile attributes, including screen the two users, denoted as s(Ui , Uj ). Since the friend fea- A B name, gender, birthday, city and profile image. ture is represented by a friend feature vector, s(Ui , Uj ) A B A screen name is the publically required profile feature is measured by fi and fj . in almost all SNs. It has been widely explored as a way to 2.2 Problem Definition recognize users across different SNs. Perito et al. [5] calcu- In the real world, each person has his own circle of friends, lated the similarity of screen names and identified users which is highly personal. Therefore, if we know all of the using binary classifiers. Similarly, Liu et al. [6] focused on friends of a person, we may know who he/she is. Using deciding whether cross-platform user identities with same the SN in Fig. 1(a) as an example, if one has the friend set username belong to same natural person. Zafarani and Liu of users 2, 3 and 5, it is obvious that he could be user 1. [7], [8] developed a user mapping method by modeling SNs can be considered as mirrors of the real world, and user behavior on screen names. The profile image is an- most popular SNs have similar network structures. Some other feature that has received considerable study. Ac- quisti et al. [9] addressed the user identification task with 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 4

a face recognition algorithm. Although both screen name key difference among these studies is the similarity defini- and profile image can identify users, they cannot be ap- tion in the information propagation. Clearly, the similari- plied to large SNs. Because some users may have the same ties of any unmapped nodes are evaluated based on the screen name and profile images. For example, many users mapped nodes. Liu et al. [32] proposed Input Output Net- have the screen name “John Smith” on the Facebook. work Embedding (IONE). IONE identifies the users by ex- Leveraging a combination of profile features can result tracting embedding-based network features from the fol- in better user identification. Iofciu et al. [10] proposed an lower-ship/followee-ship of each user simultaneously. approach by measuring the distance between user profiles. The joint use of profile information, user behaviors hid- Motoyama and Varghese [11] gathered attributes (educa- den content and network structures may lead to better re- tion, occupation, etc.) as sets of words and matched users sults [19], [20]. Zhang et al. [33] presented the COSNET, in by calculating the similarity of users. Goga [12] linked ac- which an energy-based model is employed to link user counts belonging to the same person identity, based solely identities by considering both local and global consistency. on the profile information. Cortis [13] proposed a weighted The local consistency refers to situation of only two social ontology-based user profile resolution technique. Abel et networks, while global consistency is the mapping con- al. [14] aggregated user profiles and matched users across sistency on multiple networks. Zhang et al. [34] leveraged systems. Similar studies across multiple platforms are also distance-based public profile features and neighborhood- found in [15], [16], [17], [18]. based network features to link user identities by a local ex- Public profile attributes provide powerful information pansion propagation algorithm. Bartunov et al. [30] inte- for user identification. However, some attributes are dupli- grated profiles with a network structure using a Condi- cated in large-scale SNs, and are easily impersonated. tional Random Fields model. Liu et al. [35] proposed a user 3.2 Content-Based User Identification identification solution, namely HYDRA, which makes full use of all the handful user properties and identifies users Content-Based User Identification solutions attempt to rec- by learning the features from the seed users using support ognize users based on the times and locations that users vector machines. post content, as well as the writing style of the content. Zheng et al. [21] proposed a framework for authorship 3.3.2 Network Structure-Based User Identification without Prior Knowledge identification using the writing style of online messages. Almishari and Tsudik [22] linked users across different Network structure-based user identification without prior SNs by exploiting the writing style of authors. Kong and knowledge is a hard nut to crack. To the best of our Zhang [23] proposed Multi-Network Anchoring (MNA) to knowledge, NM [36] is the most relevant study in this cat- map users. MNA calculates the combined similarities of egory. Suppose that there are two SNs both with n users: A B A B user’s social, spatial, temporal and text information in dif- SN and SN . NM first defines a similarity s(Ui , Uj ) for A B A B (0) ferent SNs. Goga et al. [24] exploited the geo-location at- any two users Ui and Uj in SN and SN . Initially, s (U A B A B tached to users’ posts, the timestamp of posts, and users’ i , Uj ) = 1. Then, NM refreshes s(Ui , Uj ) iteratively. In the writing style to address user identification tasks. Riederer k-th iteration, NM constructs a complete bipartite graph B (k+1) A B A B A B et al. [25] also presented an algorithm which utilizes trajec- i,j = (Fi , Fj , Fi × Fj ), where each edge (Ui‘ , Uj’ ) is weighted (k) A B tory-based content features to link identical users. as s (Ui‘ , Uj’ ). Finally, NM finds the maximum weighted (k+1) (k+1) Geo-location appears to have forceful features for user match of Bi,j , denoted as Mi,j , and updates s(i, j) (1)kAB ()kAB recognition. However, this information is often sparse in ss(Uij ,U )   (Uij' ,U) ' AB (k1) . (1) SNs, since only a small portion of users are willing to post (MUUij' ,)'  ij, their locations. The writing-style solutions can be im- When the normalization of s is converged, NM returns proved by the new techniques on authorship verification the top m mappings of the maximal weighted match of B = and clustering [43]. (UA, UB, UA × UB). The overall running time of each itera- 2 3 tion is O(n dmax ), where dmax is the upper bound of the 3.3 Network Structure-Based User Identification number of user friends. SN has been verified as a scale-free 3.3.1 Network Structure-Based User Identification with network repeatedly. Thus, dmax ≈ n and the time complexity Prior Knowledge of NM is O(n5) in the SN. Although NM is effective in the The network structure-based studies with prior de-anonymization task, the time complexity is pretty high knowledge can be divided into two groups: supervised and its performance has not been verified in the scenarios and semi-supervised models. when the overlap of two SNs are not guaranteed. In the supervised category, Man et al. [26] developed Graph kernel methods [44], [45] find the similar sub- Predicting Anchor Links via Embedding (PALE) based on graphs from two graphs, and are somewhat relevant to the latent features of users in SN. Nie et al. [27] proposed Dy- user identification task. However, graph kernel solutions namic Core Interests Mapping (DCIM) by jointly model- mainly focus on computing the similarities of subgraphs. ling neighborhood-based network features and interested- In this study, we proposed an innovative approach, based content features. FRUI-P, to address the challenges faced by previous meth- In the semi-supervised category, [28], [29], [30], [31] ods. FRUI-P differs from NM and our previous study, have similar workflow, finding seed users first, then using FRUI, in the following aspects: these seed users to recursively propagate information 1. FRUI requires prior knowledge, whereas FRUI-P and through networks and extend sets of mapped nodes. The NM do not. FRUI-P resolves the issue of acquiring 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 5

matched identical users. Without prior knowledge, it is al- an example. As individual 1 has similar friends in SNA and B A most impossible to measure the similarity of two users SN , if we embedded individual 1 and his accounts Ua in A B B from two SNs purely using network structure. NM ad- SN and Ua’ in SN into two-dimensional vectors, they A B dresses this problem through a similarity iterative process, must localize closely in geometry. That is, fa and fa’ local- while FRUI-P evaluates the similarities from the friend fea- ize almost at the same spot in a two-dimensional space. ture vectors extracting from the network structure. Consequently, it is easy to identify the users by calculating 2. NM is suitable for de-anonymization task, while the geometrical similarities of the vectors. FRUI-P and FRUI focus on the user identification on differ- The key techniques of our unsupervised friend relation- ent SNs. Although all three algorithms find the identical ship-based user identification algorithm can be divided users in two different networks, NM actually works on the into two types: users’ Friend Feature Vector Model (FFVM) same SN. and user identification model over the feature vectors. 3. FRUI-P and FRUI can perform on heterogeneous SNs. 4.1 Friend Feature Vector Model Friend relationship is robust, reliable and consistent in SNs. Both FRUI and FRUI-P convert the connections in different The curse of dimensionality and incapability of the embed- SNs into friend relationships and identify the identical us- ded feature make the traditional adjacent vertices impossi- ers via the friend relationships. ble to facilitate the user identification issue. Thus, the first 4. FRUI-P works more efficiently and effectively than and primary task is to formulate the friend feature vector. NM. We proved that the running time complexity of FRUI- Deep learning allows computational models that are P is O(n2), while NM costs O(n5). Apparently, FRUI-P costs composed of multiple processing layers to learn represen- much less in the user identification task. tations of data with multiple levels of abstraction, and its 5. FRUI-P can be easily implemented in a parallel man- effectiveness has been verified in various areas [51]. We ner, whereas NM cannot. also used the power of deep learning to abstract the friend 6. FRUI-P operates an O(xn) in memory space, while feature vector. Our algorithm employs the random walk NM requires O(n2), where x is a small integer representing and is on top of the methodology on Word2Vec [39]. the dimension of the friend feature vector. Thus, FRUI-P As demonstrated in Fig. 3, FFVM consists of two inte- can be applied to much larger SNs. grations, namely the positive sample model and the friend feature vector learning model. 3.4 Network Embedding 4.1.1 Positive Sample Model Network embedding aims to learn latent representations The positive sample model aims to produce the positive of vertices in a network. DeepWalk [47] addresses this is- sample set S of a given SN. Since Ci has a high probability sue by a random walk accompany with a SkipGram model to predict Ui, the feature of Ui can be extracted from Ci. The in natural language processing (NLP) area. Node2vec [48] power of random walk has been verified in multiple areas, improves DeepWalk through a bias random walk. LINE e.g., content recommendation [52] and social representa- [49] learns the vectors using the 1st order proximity and tion [47]. In this study, we employed the simple random 2nd order proximity. TriDNR [50] explores the network walk (SRW) to construct the positive sample model. Be- embedding by combining use of node structure, node con- cause random walk with restart (RWR) biases on the start , and node labels. node, Metropolis-Hasting random walk (MHRW) biases on the nodes with larger degree [53]. 4 USER IDENTIFICATION WITHOUT PRIOR At the beginning of a random walk, a random user Ui1 is KNOWLEDGE chosen as the root and set as the current user. Then a ran-

Usually, when the similarity of the two given users is dom user Ui2  Fi1 is traversed and set as the current user.

larger than a threshold, the two users are identical. Thus, Following this procedure, a random walk (Ui1, Ui2, …) is the key problem of cross-platform user identification is to generated when l > 0 users are covered. Since the friends find the similarity of any item in Ï(∙, ∙) in the two given SNs. have much more ability to predict a user and those under- Given enough identical users as prior knowledge, many lying connections built upon another user would introduce solutions can be used to define and evaluate the similari- noise [28], only the friends are considered in this study.

ties of the candidate identical users, e.g., the solutions in Subsequently, given a random walk (…, Uik-1, Uik, Uik+1, …),

FRUI [28] and NS [29]. However, it is still a challenging Cik = { Uik-1, Uik+1 }. Clearly, l – 2 positive samples can be gen- task to evaluate the similarity of any two users without erated for each random walk. Finally, S are produced with prior knowledge. To address this issue in the area of social approximately |S| / l processes of random walk. Obvi- network de-anonymization, NM [36] evaluates the similar- ously, S should be large enough to make sure that the ities of all the item in Ï(∙, ∙) by an iteration process. To a friend feature of each user in the given SN can be correctly certain extent, NM only utilizes one feature of the users’ extracted. It is noteworthy that duplicated users are al- network. Intuitively, the performances of NM can be im- lowed in a random walk, thus those prestige users will ap- proved by employing multiple dimensional features. pear several times in a random walk. A A A A A For any individual i in the real world, if the feature of Consider Fig. 1(b) as an example. (Ua , Uc , Ub , Ud , Ub ) is A A A his friend relationships can be extracted and embedded in the result of a random walk. For user Uc , we had Cc = {Ua , A A A A A A A A A A B ⊆ a vector fi, then his friend feature vectors fi in SN and fi Ub } and Nc { Ua , Ub , Ud , Ue }. Thus, (Cc , Uc ) is a positive A A in SNB must be similar, because the individual deploys sample, while (Cc , Ub ) is a negative sample. The positive similar friend relationships in different SNs. Take Fig. 2 as sample model lays the foundation for the friend feature 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 6

vector learning model. To reduce the computation complexity, we converted 4.1.2 Friend Feature Vector Learning Model (11) by taking the logarithm. Thus, we had

LGlogglog C (Uii, )  log[(U g C )]ii, The friend feature vector learning model learns the fi for   (UCSii, )  (CUii, ) S any Ui in a given SN. In this study, we used the negative Ejii() 1(-Ej ) =   sampling-based CBOW model in NLP [39].  log  ()1()cij q  cij q (UCS, )  At the beginning, we defined ii U{Ujii } N =-Ejlogii()()1()1() c q  Ejlog c q 1,UU;ij  ij ij  i  , Ej()  (4) (UCSiijii )U{U } N 0,UU,ij  = log  (cq ) log 1() cq where Ui and Uj are two users in the given SN.  ii ij (UCSii, ) U ji N Here, we used Maximum Likelihood to set up the optimal  function. For a given positive sample (Ci, Ui), we had  =   log ()cii q log ().cij q g( C, U ) p (U | C ) (UCS, ) U N i i  j i . ii ji U {U } N (5) j i i (12) In this study, we employed the sigmoidal function to We adopted the asynchronous stochastic gradient algo- define p(Uj| Ci) rithm for the optimization objective. Let 1  ()x  LijEjlog(,)()()i  cq x , (6) ij 1 e 1()1().-Ejlogi  cq we had  ij (13)  i  (),()1;cqij Ej Employing the properties that pC(U|)ji  . (7) 1(),()0. cq Eji  ij logxx()1()  , (14) x T where qj  is an assistant parameter for Uj, c represents and  x  the transpose of c, and ci is the sum of fj, Uj Ci.  cf logxx1()()- , (15) ij . (8) U jiC we had (7) can be rewritten as follows L(,) i j ii- E() j 1(E ) j  q j pC(Uji |)()1() cij q  cij q . (9)  Thus, (5) turns to be ii-  Ejlog( )()1( )1()cij q  Ejlog cij q   q gC(U, )()1()c q c q j ii ii  ij . (10) . U jiN ii =-Ej( ) 1()1(c )()iji qc Ej ciji q c T  Apparently, σ(ci qi) denotes the probability that Ci cor- ii T Ej( ) 1()1(c )() q -Ej c qc rectly predicts Ui, while σ(ci qj), Uj  Ni is the probability   ij iji  that Ci predicts user Uj. Therefore, for any Ci, we desired to i  T T Ej( )(). ciji qc maximize σ(ci qi) and minimize σ(ci qj). Thus, the optimal  function (10) is deduced. (16) Similarly, given S, the Maximum Likelihood function, also Consequently, the qj updated equation is: the globalized optimal function, is as follows: i  qj:()() q j Ej  ci q j c i . (17) , G  g(U) Cii . (11) where ε is the learning rate. Higher ε can make qj converge (,U)CSii 1 quicker but with lower efficiency. Similarly, according to 1 a' 5 the symmetry of qj and ci, we had 0.8 a e' c' e L(,) i j i  0.6 3 Ej()() ci q j q j . (18) 2  ci b' 0.4 c Since ci is the summation of fv, Uv  Ci, the gradient of ci b 4 0.2 d' can directly provide the update of fv. Subsequently, the up- 0 d dated equation of fv is 0 0.2 0.4 0.6 0.8 1 L(,) i j ff:  ,U C Fig. 2. Two-dimensional friend feature vector of SNs in Fig. 1. vv  vi . (19) Ujii {U } N ci It is unfortunate to find that (19) is computationally ex- B B C E A pensive. Because (19) requires a summation over the entire A E C B A C C set of users in Ni, which is approximately equal to U. To D D C B E address this issue, we adopted Noise Contrastive Estima- E tion (NCE) to sample the multiple negative cases according D C B A to some noisy distribution for each user Ui. This process is

(a) Graph (b) Positive Sample Model (c) Feature Learning also termed as negative sampling (NEG) and proved to be Fig. 3. Overall framework of friend feature vector model efficient to learn high-quality vector representations [39]. 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 7

Algorithm 1: Friend Feature Vector Model (FFVM) Distance to evaluate the similarities of items in Ï(∙, ∙). In or- Input: SN = (U, F), |S|, x der to normalize the similarity, we had: Output: fi, Ui  U 1 s(U,U)AB 1: function FFVM(SN, |S|, x) ij AB . (23) 2: Generate NEG(Ui) for all Ui  U using (20) 1(1||||)log ffij 3: Generate random value for fi and qi for all Ui  U The user identification problem is often converted into 4: Generate S using Positive Sample Model the bipartite matching problem. Different from the prob- 5: for each (Ci, Ui) in S do 6: e = 0 lem of perfect matching or complete matching problem, 7: Calculate cu using (8) user identification seeks a stable matching. Because only 8: for Uj  { Ui } ∪ NEGi do T the scenario that one individual owns one account in an SN 9: r = σ(ci qj) i is considered and identical users have the most similar 10: s = ε(E (j) - r) A B 11: e := e + sqj friend feature vector than the others. Thus, if Ui = Uj , then A B A B 12: qj := qj + sci s(Ui , Uj ) is no less than s(Ui , ∙) and s(∙, Uj ). Accordingly, 13: for Uj  Ci do the user identification function without prior knowledge 14: fj := fj + e can be resolved as 15: return fi, Ui  U A B A B f(Ui , Uj ) = H(s(Ui , Uj ) Algorithm 2: FRUI-P A B (24) Input: SNA, SNB, |SA|, |SB|, x, t, λ – max{s(Ui , ∙), s(∙, Uj )}). A Output: Identical users where max{∙} returns the maximal value of {∙}, and s(Ui , ∙) A B A 1: function FRUI-P(SN , SN , t, λ) yields the similarities of all the items in Ï(Ui , ∙). H(y) is the A B 2: i = 0, t1 = 0, t2 = 0, F = [], F = [] Heaviside step function, which is defined as follows: 3: for i++ < t do A A B B 1,;ya 4: F [i] = FFVM(SN ), F [i] = FFVM(SN ) Hya() (25) 5: for t1++ < t do 0,.ya 6: for t2++ < t do A A B B Evidently, user identification without prior knowledge 7: for each Um in SN , Un in SN do A B 8: update s(Um , Un ) using (28) is a challenging task to attack. One round of the feature A B 9: update s(Um , Un ) using (29) learning process may be not enough to ensure the perfor- A B A B 10: return Ï(Um , Un ) with f(Um , Un , λ) = 1 mance in some scenarios. Thus, we proposed our user Specifically, Ni is estimated by NEG(Ui), or NEGi, for each identification algorithm with t > 0 times of friend feature user Ui as follows: learning in this part. h (t ) Denoting f 1 as the t1-th (1 ≤ t1 ≤ t) friend feature vector NEGiU| xU ~ P ( v )  . (20) xn of a user, (23) can be generalized: h where { ∙ } indicates the number of { ∙ } and h is the number 1 s(,)tt12(UAB ,U )  of negative samples for each data sample. The task is to ij ()()tt12AB. (26) 1log (1  ||ffij  ||) distinguish the target user Ui of samples from the noise dis- tribution Pn(v) using logistic regression. We set Pn(v)  d where 1 ≤ t1, t2 ≤ t. In order to reduce the running time com- 3/4 v as proposed in [54], where n = |U| is the total number plexity, we predigested (24) to be: A B A B (t1, t2) (t1, t2) of users in the SN and dv is the size of Fv. Thus, the objective f (Ui , Uj ) = H(s (Ui , Uj ) A (27) (t1, t2) function for each user can be converted to: – max{s (Ui , ∙)}). 2 log((U g C ))ii, With t feature matrices for each SN, t times of user iden- 2  tification, resulting from t pairs of combination of the fea-  log ()c q  log ()c q . (21)  ii  ij ture matrices, are executed. For each pair, only those items U jiNEG in Ï(∙, ∙) with f(∙, ∙) = 1 are left. Noticing that the identical The first term models the observed user, while the second users occur several times, we summed the similarities. Fi- one reflects the negative samples drawn from the noise dis- nally, the similarity is defined as tribution. Subsequently, (19) turns to be: A B t t A B A B (t1, t2) (t1, t2) 1 2 (28) L(,) i j s(Ui , Uj ) = ∑t =1∑t =1[f (Ui , Uj ) ∙ s (Ui , Uj )]. ffvv:  ,UviC In addition, (24) is used to ensure a one-to-one mapping.  c . (22) Ujii {U } NEG i It is conceivable that the friend features of the users with Algorithm 1 concludes the whole process of FFVM. Line only a few friends can hardly be learned. It is actually in- 2 generates the negative samples for all the users, line 3 in- tuitive because the limited contexts of these users can be itiates fi and qi with random value for each user and line 4 formulated. Subsequently, to increase the precision, both produces the positive samples S. Lines 5-14 iterate each similarities and friend sizes should be considered. Here we positive sample and update fi and qi, while line 15 returns upgrade s(∙, ∙) for each item in Ï(∙, ∙) A B A B A B the friend feature vectors for all the users. s(Ui , Uj ) := s(Ui , Uj ) × log(min(|Fi |, |Fj |)), (29) 4.2 User Identification Using Friend Feature Vector where min{∙} returns the minimal value of {∙}. Only the items in Ï(∙, ∙)s with s(∙, ∙) above λ ≥ 0 are considered as the With the friend feature vector of each user in two given identical users. Subsequently, we have SNs, several metrics can be employed to evaluate the sim- A B A B A f(Ui , Uj , λ) = H(s(Ui , Uj ) – max{s(Ui , ∙)}) ilarities of all the items in Ï(∙, ∙), e.g., Euclidean Distance, Che- A B (30) byshev Distance, Cosine. Here, we believed that the friend × H(s(Ui , Uj ) – λ). feature vectors of identical users are located much closer Algorithm 2 presents FRUI-P. Line 2 initiates the param- than those of non-identical ones. Thus, we used Euclidean eters. Lines 3 and 4 learn the friend feature vectors for each nodes. Lines 5-9 update the similarities of all the node pairs. 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 8

Line 10 ensures the one-one mapping and returns the iden- identified users are considered as identical users. tical users with similarity larger than λ. Since |S| ≥ |F|, and |S| ≤ |U|2 in practice, we approx- 2 4.3 Discussions imately took O(|S|) = O(|U| ). Thus, the time complexity of FRUI-P can be simplified as O(t × max{|SA|, |SB|} × x2) Lemma 1. The time complexity of FFVM is O(|U|) + O(|S| + O(t2 × max{|UA|, |UB|}2) = O(x2 × max{|UA|, |UB|}2) ≈ 2 × x ) time, where |U|, |S| and x are the number of users in O(max{|UA|, |UB|}2). the SN, the number of positive samples and the dimension the Noticing that the updates of fi and qi in FFVM do not friend feature vector, respectively. acquire a lock to access the shared parameters. This allows Proof. The generation of negative sampling is O(|U|). As us to use asynchronous version of stochastic gradient algo- |S| positive samples require approximately |S| ran- rithm in the multi-worker case and help achieve an optimal dom picks of nodes or their neighbors, the random walk rate of convergence [56]. Also, (26) and (27) can work in a process costs O(|S|) in time complexity. Lines 8-14 in parallel manner. Consequently, parallel version of FRUI-P FFVM is the process for each (Ci, Ui), and the time com- is easy to implement, which makes FRUI-P scalable. 2 plexity is O(|NEGi|× x ), where |NEGi| is the number of users in NEGi. Thus, the total complexity of FFVM is 5 EXPERIMENTAL STUDIES 2 O(|U|) + O(|S|) + O(|S| × |NEGi|× x ) = O(|U|) + 2 We verified FRUI-P in both the synthetic and ground-truth O(|S| × x ), because |NEGi| is a small integer.  networks. All the experiments were conducted using a Although FFVM also learns the latent friend represen- computer with 8 G memory and 2.8 GHz CPU. tations of users from the friend relationships, it is different We used NM as the main baseline because it is the clos- from the current network embedding solutions. We take est to FRUI-P as a state-of-art, network structure-based un- DeepWalk [47] as a comparison since its process is similar spervised algorithm, which won the first prize in the de- to FFVM. Firstly, only the friends are taken into account in anonymization task of WSDM 2013 Data Challenge [38]. In modeling S because underlying connections built upon an- our empirical testing of NM, the sizes of the graphs are no other user would introduce noise in identifying users [28]. more than 1,000 nodes. Because the running time and the However, to model the global network feature, DeepWalk space requirement of NM, which are O(n5) and O(n2) re- encloses a host of unconnected users. Secondly, CBOW is spectively, limit the scalability of NM. For instance, a pair employed in FFVM while SkipGram [39] is utilized in of graphs with 50,000 nodes, NM requires no less than DeepWalk. Finally, to speed the training time, FFVM uses 50,000 × 50,000 × 4B × 2 (NM uses the k-th and (k+1)-th negative sampling, while DeepWalk exploits Hierarchical similarity matrices) = 20 G memory space, which is beyond Softmax [55]. the capacity of our experimental computer. Theorem 1. The total time complexity of FRUI-P is O(t × FRUI is the other adversary in this study, because it is A B 2 2 A B 2 max{|S |, |S |} × x ) + O(t × max{|U |, |U |} ), where the state-of-art semi-supervised user identification purely A B A B t, |U |, |U |, |S |, |S | and x are the times of friend fea- using network structure. A B ture learning, the number of users in SN and SN , the num- We employed recall rate, precision and F1-measure to A B ber of positive samples of SN and SN , and the dimension measure the performance. They are defined as the friend feature vector, respectively. # of correct identified users Proof. The time complexity of t rounds of friend feature Recall rate = , # of total identical users vector learning is O(t × |UA|) + O(t × |SA| × x2) + O(t # of correct identified users × |UB|) + O(t × |SB| × x2) ≤ O(t × max{|UA|, |UB|}) + Precision = , (31) O(t × max{|SA|, |SB|} × x2). Lines 7-8 in FRUI-P calcu- # of total identified users lates the similarity of each item in Ï(∙, ∙) with f(∙, ∙) = 1 2 Recall rate Precision F1- measure = . and costs O(|UA||UB|) in time complexity. Thus, the Recall rate  Precision 2 A B time complexity of Lines 5-8 is O(t × |U ||U |). Since Higher recall rate, precision and F1-measure indicate bet- A B Line 9 costs no more than O(|U ||U |), the total time ter performance of a user identification solution. A B complexity of FRUI-P is O(t × max{|U |, |U |}) + O(t Besides the real-world experiments, we also conducted A B 2 2 A B × max{|S |, |S |} × x ) + O(t × |U ||U |) + extensive experiments on synthetic networks to evaluate A B A B 2 2 O(|U ||U |) = O(t × max{|S |, |S |} × x ) + O(t × the performance of FRUI-P. SNs are naturally overlapped A B 2 max{|U |, |U |} ).       in both users (nodes) and friends (edges) [28]. The node Obviously, larger t will increase the complexity of overlap is the fundamental assumption for all the user FRUI-P, however, it also ensure the recall rate and preci- identification solutions, while edge overlap lays the foun- sion of FRUI-P. More positive samples ensure adequate dation for all the network structure-based algorithms. To learning of the friend feature vector, and larger x hints that achieve various degrees of node and edge overlap, we features from more dimensions are taken into account. In- added graphs with different levels of noise. For a given creasing either S or x raises the complexity of FRUI-P, yet network, a subnetwork is generated by keeping every can improve the performance of user identification. Our node/edge with a sampling probability. Through two empirical studies also confirm that t, S and x can enhance rounds of the sampling process, a pair of subnetworks are the performance of FRUI-P. λ guarantees the minimal sim- produced, and the common nodes kept in both subnet- ilarity of the identified users and ensures the precision. works are identical. Fig. 4 presents an illustration. We in- Since identical users with similarity less than λ are missed, troduced the Jaccard Coefficient to measure the degree of higher λ results in lower recall rate. When λ = 0, all the 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 9

node/edge overlap, nized when overlap(FA, FB) ≥ 0.64. Undoubtedly, the perfor- XXAB mances of FRUI-P in BA networks can be further improved AB overlapXX(,)  (32) by increasing x, t and |S|, as verified in Section 5.2. In all XXAB the experiments on the ER, WS and BA networks, FRUI-P where overlap(, ) denote the degree of the overlap of X revealed nearly all identical nodes when overlap(FA, FB) ≥ and X represents the U/F of the two graphs. Thus, when 0.64, which approximately equals that in real-world SNs. each of the two graphs shares 2/3 of nodes, overlap(UA, UB) This indicates that FRUI-P can address the user identifica- = 0.5. The degree of edge overlap was limited by the rela- tion task without seed users. Also, the precision is almost tionships between the overlapping nodes. 1 when overlap(FA, FB) ≥ 0.49, this verified that FRUI-P can 5.1 Synthetic Network Experiments be used to generate the prior knowledge for some other user identification algorithms, e.g., FRUI [28]. To validate the performance of FRUI-P, we conducted ex- We also conducted experiments to compare the effec- periments in the ER [40] random networks, WS [41] small- tiveness of FRUI-P, FRUI and NM. The comparisons of world networks and BA [42] networks. The properties of FRUI-P, FRUI and NM in the three synthetic networks are the three synthetic networks refers to our previous studies shown in Fig. 5. We changed overlap(FA, FB) from 0.25 to 1 [28]. Both ER and WS networks are generated from the reg- in the experiments. p = 0.02 in the ER and WS networks, ular random network by rewiring each edge with a proba- and m = 20 in the BA networks. We used the graphs with bility. If all edges are rewired so that the probability of re- 5,000 nodes in the evaluation of FRUI-P, FRUI and NM. wiring equals 1, the network turns out to be an ER network; The size of prior knowledge of FRUI was 50. Since the re- otherwise, it is a WS network. In the experiments, the prob- call rate, precision and F1-Measure exhibit the same trends ability of rewiring in a WS network was 0.5. in the experiments, only the recall rates are demonstrated. We generated 75 pairs of networks in experiments to il- The results of experiments conducted in the networks gen- lustrate the performance of FRUI-P in synthetic networks. erated by ER and WS models are displayed in Fig. 5(a) and In all the three synthetic network experiments, five net- (b). Evidently, both FRUI-P and FRUI identify almost all works with 5,000 nodes were created. In the five ER and the identical nodes when overlap(FA, FB) ≥ 0.36 in ER and WS network experiments, p, the probability that an edge WS networks. The recall rates of the three algorithms in BA exists between two nodes, equals to 0.02, 0.04, 0.06, 0.08 networks are illustrated in Fig. 5(c). Both FRUI-P and FRUI and 0.1, respectively. Similarly, in the BA network experi- have a promising performance, especially when overlap(FA, ment, the number of edges to attach a new node to existing FB) ≥ 0.64. In contrast, NM recognizes all the identical nodes, denoted as m, increased from 20 to 100 by 20. Sub- nodes when overlap(FA, FB) is almost 1, whereas little if an- sequently, 75 pairs of networks were generated by 15 syn- ything when overlap(FA, FB) ≤ 0.81, in all these three syn- A B A B thetic networks, with overlap(U , U ) = 1 and overlap(F , F ) thetic networks. That means that NM performs inade- equaling to 0.25, 0.36, 049, 0.64 and 0.81. We set t = 1, λ = 0, quately when the two graphs are sampled randomly. Ac- |S| = 50|F| and x = 500 in all these experiments. It is no- cordingly, we can adopt the attitude that NM cannot be table that all the edges are sampled randomly in our exper- applied to the cross-platform user identification task when iments, which can model the real world SNs more pre- the two SNs overlap at around 65%. cisely. This random sampling also describes the difference In addition, a more delicate comparison between FRUI- between the cross-platform user identification task and the P and FRUI, in all the three synthetic networks with over- de-anonymization task, in which the two SNs totally over- lap(FA, FB) = 0.25, are also shown in Fig. 6. We compared lap among a host of nodes. FRUI-P with FRUI, which were given 50 and 100 matched The results of empirical testing in ER networks are dis- identical users as prior knowledge. Conspicuously, FRUI- played in Table 2. The performances of FRUI-P increase P outperforms FRUI in WS and BA networks when only a with the growth of the overlap of the edges between the small slice of identical users is given in advance. two SNs. It is also remarkable that FRUI-P identified al- most all identical nodes when overlap(FA, FB) ≥ 0.49. More- 5.2 Effect of Parameters over, all the 25 pairs of networks have precisions larger Three parameters, namely x, |S| and t can improve the than 90%. The performances of FRUI-P in WS networks are performance of FRUI-P, while λ is introduced to ensure the shown in Table 3. Analogous to the experiments in the ER high precision on FRUI-P. This section validates the effect network, FRUI-P recognized no less than 90% of all identi- of the four parameters. Without specification, overlap(FA, FB) cal users when overlap(FA, FB) ≥ 0.36, and the precisions in = 0.64, x = 500, |S| = 50|F|, t = 1, λ = 0 in this part. SNA all the 25 pairs of networks is larger than 96.3%. Seemingly, D the performances of FRUI-P in WS networks are somewhat B better than those in ER networks. The results presented in D Table 4 show that FRUI-P also performs well in BA net- C E B A works. Although the performance of FRUI-P was not that good in BA networks as those in ER and SW networks, the C SNB A D A B precisions are larger than 91.5% when overlap(F , F ) ≥ 0.49. E Original B Additionally, almost all the identified users are correct Network when m ≥ 40, and almost all the identical users are recog- C

Fig. 4. Overlap sampling of a network. 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 10

As the empirical testing on the three synthetic networks, 0.49 1 1 1 1 1 FRUI-P almost identifies all the identical nodes in the ER 0.64 1 1 1 1 1 and WS networks. Thus, we only conducted further exper- 0.81 1 1 1 1 1 iments on the BA networks to show the effect of these pa- 0.25 0.963 0.997 0.999 0.993 0.977 rameters on FRUI-P. 0.36 0.999 1 1 1 0.999 Preci- 0.49 Since enlarging |S| can provide more positive samples sion 1 1 1 1 1 for the friend feature learning, we varied |S| from 30|F| 0.64 1 1 1 1 1 to 70|F| by steps of 10|F| to observe the influence of the 0.81 1 1 1 1 1 number of positive samples on FRUI-P. The results shown 0.25 0.863 0.977 0.975 0.956 0.915 in Fig. 7(a) and (b) reveal that more positive samples can 0.36 0.987 0.999 0.998 0.995 0.974 F1- 0.49 ensure more identical users with higher precision. We also Measure 1 1 1 1 1 varied x from 100 to 900 by steps of 200 to evaluate the ef- 0.64 1 1 1 1 1 fect of x on FRUI-P. As illustrated in Fig. 7(c) and (d), larger 0.81 1 1 1 1 1 x can also result in higher recall rate with higher precision. TABLE 4 The experimental result with overlap(FA, FB) = 0.49 are PERFORMANCES OF FRUI-P IN BA NETWORKS WITH t = 1, λ = 0. shown in Fig. 8(a). Actually, the recall rates increase along Edge m = m = m = m = m= with t. In other words, higher t results in more identical Index overlap 20 40 60 80 100 users, no matter how sparse or dense the networks are. The 0.25 0.023 0.041 0.061 0.087 0.107 results presented in Fig. 8(b) reveal the effect of t with m = 0.36 0.080 0.201 0.304 0.341 0.394 60. No matter how much overlap there is between the two Recall 0.49 0.407 0.703 0.835 0.861 0.859 Rate SNs, higher t also ensures more identical users. 0.64 0.953 0.996 0.999 0.999 0.998 TABLE 2 0.81 1 1 1 1 1 PERFORMANCES OF FRUI-P IN ER NETWORKS WITH t = 1, λ = 0. 0.25 0.230 0.334 0.446 0.477 0.513 Edge p = p = p = p = p = 0.36 0.495 0.790 0.856 0.890 0.917 Index overlap 0.02 0.04 0.06 0.08 0.1 Precision 0.49 0.915 0.992 0.998 0.999 0.999 0.25 0.756 0.893 0.879 0.814 0.693 0.64 0.999 1 1 1 1 0.36 0.978 0.993 0.986 0.956 0.864 0.81 1 1 1 1 1 Recall 0.49 1 1 1 0.998 0.974 Rate 0.25 0.042 0.073 0.108 0.147 0.177 0.64 1 1 1 1 1 0.36 0.138 0.320 0.448 0.494 0.551 0.81 1 1 1 1 1 F1-Meas- 0.49 ure 0.563 0.823 0.909 0.925 0.924 0.25 0.990 0.996 0.989 0.968 0.906 0.64 0.976 0.998 1 0.999 0.999 0.36 1 1 0.999 0.998 0.978 0.81 1 1 1 1 1 Preci- 0.49 1 1 1 1 1 sion 0.64 1 1 1 1 1 TABLE 5 0.81 1 1 1 1 1 PERFORMANCE OF FRUI-P IN REAL WORLD SNS 0.25 0.857 0.942 0.931 0.884 0.785 # pair of Subgraphs Precision (# of Identified Users) (Nodes) FRUI-P NM FRUI-P+FRUI 0.36 0.989 0.996 0.993 0.977 0.917 F1- Sina 3145 0.767 0.49 1 1 1 0.999 0.987 1 0.080(3145) 0.493 (1587) Measure Ren- 5217 (2574) 0.64 1 1 1 1 1 Ren 0.81 1 1 1 1 1 Sina 3892 0.833 2 Ren- 0.093 (3892) 0.513 (1639) 4919 (3025) TABLE 3 Ren PERFORMANCES OF FRUI-P IN WS NETWORKS WITH t = 1, λ = 0. Sina 3546 3 0.817 0.087 (3546) 0.507 (1616) Ren- 5018 (2577) Edge p = p = p = p = p = Ren Index overlap 0.02 0.04 0.06 0.08 0.1 Note: The precisions of FRUI-P and NM are evaluated in the top 300 identi- Recall 0.25 0.781 0.957 0.953 0.921 0.861 fied users, while FRUI-P + FRUI in random 300 identified users. Rate 0.36 0.975 0.998 0.997 0.990 0.953

1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 FRUI-P 0.4 FRUI-P 0.4 FRUI-P

FRUI - 0.01 Rate Recall FRUI - 0.01 Rate Recall FRUI - 0.01

Recall Rate Recall 0.2 0.2 0.2 NM NM NM 0 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 A B A B A B overlap(F , F ) overlap(F , F ) overlap(F , F ) (a) ER network (b) WS network (c) BA network Fig. 5. Comparisons of FRUI-P, FRUI and NM in synthetic networks. 1% prior knowledge is provided in FRUI. p = 0.02 in ER and WS net- works, and m = 20 in BA networks.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 11

1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 FRUI-P 0.4 FRUI-P 0.4 0.4 FRUI - 0.01 FRUI-P Recall Rate Recall Recall Rate Recall Recall Rate Recall FRUI - 0.02 0.2 FRUI - 0.01 0.2 FRUI - 0.01 0.2 FRUI - 0.02 FRUI - 0.02 0 0 0 0.02 0.04 0.06 0.08 0.1 0.02 0.04 0.06 0.08 0.1 20 40 60 80 100 p p m (a) ER network (b) WS network (c) BA network Fig. 6. Comparisons between the FRUI-P and FRUI in synthetic networks. The sizes of the prior knowledge in FRUI are 1% and 2%. In all the three synthetic networks, the edge overlap is 0.25.

0.6 w = 0.6|F| 1 0.8 x = 100 1 w = 0.8|F| x = 300 0.8 w = |F| 0.6 x = 500 0.4 0.8 w = 1.2|F| x = 700 0.6 x = 100 w = 1.4|F| w = 0.6|F| 0.4 x = 900 w = 0.8|F| 0.4 x = 300 0.2 Precision 0.6 Precision

Recall RateRecall RateRecall x = 500 w = |F| 0.2 w = 1.2|F| 0.2 x = 700 x = 900 0 0.4 w = 1.4|F| 0 0 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 m m m m (a) Effect of w on recall rate (b) Effect of w on precision (c) Effect of x on recall rate (d) Effect of x on precision Fig. 7. Effect of w and x in FRUI-P in BA networks. overlap(FA, FB) = 0.36 in the experiments. In (a) and (b), x = 500. In (c) and (d), w = |F|.

1 1 1 1 0.8 0.95 0.95 0.8 0.6 0.9 0.4 t = 1 0.9

0.6 t = 1 RateRecall t = 1 Recall RateRecall t = 2 t = 1 Precision 0.85 t = 2 0.2 Precision 0.85 t = 2 t = 3 t = 2 t = 3 t = 3 0 t = 3 0.8 0.4 0.8 0.25 0.45 0.65 0.85 0 0.2 0.4 0.6 20 40 60 80 100 A B 0 2 4 6 8 m overlap(F , F ) λ Recall Rate (a) Effect of t by varying m (b) Effect of t by varying overlap(FA, FB) (c) Effect of λ (d) Tradeoff Fig. 8. Effect of r and λ in FRUI-P in BA networks. In (a), overlap(FA, FB) = 0.49. In (b), m = 60. In (c) and (d), m = 60, overlap(FA, FB) = 0.36.

1 FRUI-P 1 FRUI-P while RenRen has 5.5M users and 14.6M friend relation- 0.8 FRUI - 0.01 0.8 FRUI - 0.01 FRUI - 0.08 ships. More details of the datasets can be found in our pre- 0.6 FRUI - 0.08 0.6 0.4 0.4 vious studies [24]. To evaluate the performance of FRUI-P Recall rate Recall 0.2 rate Recall 0.2 with real-world datasets, we generated a series of con- 0 0 trolled datasets. We selected a pair of subgraphs from both 20 40 60 80 100 20 40 60 80 100 θ θ SNs, and each had over 50,000 nodes. (a) Sina Microblog (b) RenRen In the experiments, we randomly chose a number of Fig. 9. Comparison between FRUI-P and FRUI in the Sina and Ren- shared nodes as a priori identical users for FRUI, and t = 1, Ren datasets. Node overlap is 33%, and edge overlap is 33%. In FRUI, 1% and 8% prior knowledge are compared. The minimal de- λ = 0, x = 500 and |S| = 50|F| for FRUI-P. Then, we exe- grees vary from 20 to 100 by steps of 20. cuted user identification in both FRUI-P and FRUI. We in- The experimental results under the circumstance with creased the percentage of a priori identical users in all m = 60 and overlap(FA, FB) = 0.36 are shown in Fig. 8(c) and identical users from 0.01 to 0.1 by 0.01. Since the average (d). Fig. 8(c) demonstrates the relationship between the degrees of both Sina Microblog and RenRen are fairly low, precision and λ. Clearly, the precision increase along with only the nodes with no less than θ neighbors were selected λ, and almost all the identified users are correct when λ = as overlapping nodes. To check the performance of FRUI- 1. We also present the tradeoff between precision and re- P, we increased θ from 20 to 100 by a step of 20 each time. call rate along with λ in Fig 8(d). It conveys that the im- Although only those users with friends above θ are se- provement of precision may cause the decline in recall rate. lected, after the sampling, the degree distributions also fol- Noticeably, FRUI-P can identify 20% of identical users low the power-law distribution. That is, many, if not most, with no less than 95% precision, even for the condition t = users with friends less than 5 in the sampled SNs, which 1. This proves that FRUI-P has the ability to provide prior are difficult to identify. knowledge for other solutions, e.g., FRUI. The recall rates of FRUI-P and FRUI in both Sina and RenRen networks are compared in Fig. 9. In FRUI, 1% and 5.3 Social Network Experiments 8% of the identical users are the prior knowledge. Alt- In this section, we use ground truth datasets to evaluate hough FRUI shows a better performance, FRUI-P also have the user identification resolution. In order to verify FRUI- the ability to identify around 20% of all the identical users P in different types of SNs, we collected data from two het- without any prior knowledge of the identical users. More- erogeneous SNs: Sina Microblog and RenRen. Sina Mi- over, FRUI-P finds slightly more identical users than FRUI croblog has 1.17M users and 1.9M friend relationships, in the SNs with θ = 20, when only 1% are given as the prior 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 12

knowledge in FRUI. The performance of FRUI-P in real NM in all the three experiments. These findings reveal that world SNs appears not that satisfactory as in synthetic net- FRUI-P is much more proficient for recognizing identical works, as almost half of the users in these real world SNs users across Sina Microblog and RenRen. only possess one friend whose features are virtually im- Finally, we conducted experiments jointly using FRUI- possible to be learned from their friends. P and FRUI. 150 identified users with the highest similarity We also examined the influence of the number of friends of FRUI-P are taken as the prior knowledge for FRUI. We and λ on FRUI-P, which are suitable to explain the perfor- mance of FRUI-P in real world SNs. The trends of precision randomly chose 300 identified users to check the precisions. over the number of friends, in the SNs with θ = 100, are The empirical results show that around half of the identi- shown in Fig. 10(a). Evidently, the precision of FRUI-P in- fied users are identical through jointly using of FRUI-P and creases with the number of friends. Moreover, the preci- FRUI. The total numbers of identified users and the preci- sion is above 80% for the users with more than 60 friends, sions are presented in Table 5. The results confirm that it suggests that FRUI-P identify those users with more FRUI-P can be applied to obtain the prior knowledge for friends with high precision, and verifies that the friend fea- the supervised or semi-supervised schemes. tures of users with few friends can hardly be learned. Ad- ditionally, almost all the identified users are identical 6 CONCLUSIONS when the users have more than 120 friends in Sina and 250 friends in RenRen. The trends of precision along with λ are This study addressed the problem of user identification presented in Fig. 10(b). Similar to experiments in synthetic across SN platforms and offered an innovative solution. As networks, higher λ results in higher precision of the iden- a key aspect of SN, network structure is of paramount im- tification results, and the precision hits as high as 0.9 with portance and helps resolve the user identification tasks. λ = 0.4 in both the Sina and RenRen networks. The tradeoff Many studies have reported that identical users have sim- between precision and recall rates is displayed in Fig. 10(c). ilar friends in different SN platforms. Therefore, we pro- It is obvious that FRUI-P returns more than 15% identical posed a uniform network structure-based user identifica- users with 95% in precision in both the Sina and RenRen tion solution. Based on our previous study [28], we devel- networks. As illustrated in a previous study [21], only a oped a novel unsupervised friend relationship-based algo- small portion of prior knowledge, e.g., 1%, can help iden- rithm without prior knowledge called FRUI-P. We also ad- tify more than 40% identical users in the Sina dataset and dressed the complexity and discussed its scalability. Fi- 60% identical users in RenRen dataset. Accordingly, FRUI- nally, we verified our algorithm in both synthetic networks P can be a powerful complement to FRUI and as a result and ground-truth networks. FRUI-P can be applied to provide prior knowledge for Results of our empirical experiments reveal that the net- those algorithms that require prior knowledge. work structure can accomplish important user identifica- tion tasks. Our FRUI-P algorithm is efficient, and performs We also evaluated the performance FRUI-P in user much better than NM. In scenarios when raw text data is identification across Sina Microblog and RenRen by con- sparse, incomplete, or hard to obtain due to privacy set- ducting three groups of experiments. In each experiment, tings, FRUI-P is extremely suitable for cross-platform tasks. we selected a pair of subgraphs by starting with the iden- Moreover, FRUI-P can provide reliable prior knowledge tical users from the Renmin University of China (RUC) and for many prior knowledge-based user identification extracting two-layer friends using a breadth-first search. schemes, e.g. FRUI. Since the parallel version of FRUI-P is The leaf nodes are removed, and only the users with an easy to implement, our method is scalable and can be eas- education background of RUC are left. Since the exact ily applied to large datasets. To improve the efficiency, number of identical users is unknown, only the precision preprocessing processes which retrench the candidate was compared. With λ = 0, FRUI-P returns almost as many identical nodes for each node on another graph will be ex- as identical users as NM. Here we chose the 300 identified plored in our future work. Clearly, the anonymous identi- users with the highest similarity to check the precision of cal user identification method without prior knowledge ac- commodates more scenarios where the seed users are hard FRUI-P and NM. Table 5 illustrates the empirical results. to obtain, thus largely empowering the applicability. The precisions of FRUI-P are around 80%, outperforming 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 Precision Precision Sina Sina Precision 0.2 0.2 0.2 Sina RenRen RenRen RenRen 0 0 0 1 10 100 1000 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 Number of Friends λ Recall rate (a) Effect of number of friends (b) Effect of λ (c) Tradeoff Fig. 10. Influence of the number of friends and λ on FRUI-P. Node overlap is 33%, and edge overlap is 33%. The SNs are sampled with mini- mal degree no less than 100.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 13

Identifying anonymous users across multiple SNs is [11] M. Motoyama and G. Varghese, "I seek you: searching and challenging work. Therefore, only a portion of identical us- matching individuals in social networks," Proc. of the 11th inter- ers with different nicknames can be recognized with this national workshop on Web Information and Data Management method. This study built the foundation for further studies (WIDM’09), pp. 67-75, 2009. on this issue. Ultimately, it is our hope that a final ap- [12] O. Goga, D. Perito, H. Lei, R. Teixeira, and R. Sommer, "Large- proach can be developed to identify all identical users with scale Correlation of Accounts across Social Networks," Tech- different nicknames in an ideal case. Other user identifica- nical report, 2013. tion methods can be applied simultaneously to examine [13] K. Cortis, S. Scerri, I. Rivera, and S. Handschuh, "An ontology- multiple SN platforms. These methods are complementary based technique for online profile resolution," Social Informatics, and not mutually exclusive, since the final decision may Berlin: Springer, pp. 284-298, 2013. rely on human user’s involvement. Therefore, we suggest [14] F. Abel, E. Herder, G.J. Houben, N. Henze, and D. Krause, applying these methods synergistically and a proper usage "Cross-system user modeling and on the social of their strengths should lead to optimal effects beneficial web," User Modeling and User-Adapted Interaction, vol. 23, pp. to their common goals. 169-209, 2013. [15] O. De Vel, A. Anderson, M. Corney, and G. Mohay, "Mining e- ACKNOWLEDGEMENTS mail content for author identification forensics,” ACM Sigmod This work was supported by the Natural Science Founda- Record, vol. 30, no. 4, pp. 55-64, 2001. tion of China under grant nos. 71531012 and 71601013, the [16] E. Raad, R. Chbeir, and A. Dipanda, "User profile matching in Beijing Natural Science Foundation under grant nos. social networks," Proc. of the 13th International Conference on Net- 4172032 and 4174087, and the Scientific Research Project of work-Based Information Systems (NBiS’10), pp.297-304, 2010. Beijing Educational Committee(no. SQKM201710016002). [17] J. Vosecky, D. Hong, and V.Y. Shen, "User identification across multiple social networks," Proc. Of the 1st International Confer- REFERENCES ence on Networked Digital Technologies, pp.360-365, 2009. [1] X. H. F. Wang, Y. Fang, I. Qureshi, and O. Janssen, "Understand- [18] X. Mu, F. Zhu, E.P. Lim, J. Xiao, J. Wang, and Z.H. Zhou, "User ing employee innovative behavior: Integrating the social net- Identity Linkage by Latent User Space Modelling," Proc. of the work and leader–member exchange perspectives," Journal of Or- 22nd ACM SIGKDD International Conf. on Knowledge Discovery and ganizational Behavior, vol. 36, no. 3, pp. 403-420, 2015. Data Mining, pp. 1775-1784, 2016. [2] D. Duvanova, A. Nikolaev, A. Nikolsko-Rzhevskyy, and A. Se- [19] P. Jain, P. Kumaraguru, and A. Joshi, "@ i seek 'fb. me': identify- menov, "Violent conflict and online segregation: An analysis of ing users across multiple online social networks," Proc. of the social network communication across Ukraine's regions," Journal 22nd International Conference on World Wide Web Companion, pp. of Comparative Economics, vol. 44, no. 1, pp. 163-181, 2016. 1259-1268, 2013. [3] D. R. Farine and H. Whitehead, "Constructing, conducting and [20] P. Jain and P. Kumaraguru, "Finding Nemo: searching and re- interpreting animal social network analysis," Journal of Animal solving identities of users across online social networks," arXiv Ecology, vol. 84, no. 5, pp. 1144-1163, 2015. preprint arXiv:1212.6147, 2012. [4] M. Balduzzi, C. Platzer, T. Holz, E. Kirda, D. Balzarotti, and C. [21] R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for au- Kruegel, "Abusing social networks for automated user profil- thorship identification of online messages: writing-style fea- ing," Int. Workshop on Recent Advances in Intrusion Detection, pp. tures and classification techniques," J. of the American Society for 422-441, 2010. Information Science and Technology, vol. 57, no. 3, pp. 378-393, [5] D. Perito, C. Castelluccia, M.A. Kaafar, and P. Manils, "How 2006. unique and traceable are usernames?," Privacy Enhancing Technol- [22] M. Almishari and G. Tsudik, "Exploring linkability of user re- ogies (PETS’11), pp. 1-17, 2011. views," Computer Security–ESORICS 2012 (ESORICS’12), pp. 307- [6] J. Liu, F. Zhang, X. Song, Y.I. Song, C.Y. Lin, and H.W. Hon, 324, 2012. "What's in a name?: an unsupervised approach to link users [23] X. Kong, J. Zhang, and P.S. Yu, "inferring anchor links across across communities," Proc. of the 6th ACM international conference multiple heterogeneous social networks," Proc. of the 22nd ACM on Web search and data mining(WDM’13), pp. 495-504, 2013. International Conf. on Information and Knowledge Management [7] R. Zafarani and H. Liu, "Connecting corresponding identities (CIKM’13), pp. 179-188, 2013. across communities," Proc. of the 3rd International ICWSM Con- [24] O. Goga, H. Lei, S.H.K. Parthasarathi, G. Friedland, R. Sommer, ference, pp. 354-357, 2009. and R. Teixeira, "Exploiting innocuous activity for correlating us- [8] R. Zafarani and H. Liu, "Connecting users across ers across sites," Proc. 22nd international conference on World Wide sites: a behavioral-modeling approach, " Proc. of the 19th ACM Web (WWW’13),pp. 447-458, 2013. SIGKDD International Conference on Knowledge Discovery and [25] C. Riederer, Y. Kim, A. Chaintreau, N. Korula, and S. Lattanzi, Data Mining (KDD’13), pp.41-49, 2013. "Linking Users Across Domains with Location Data: Theory and [9] A. Acquisti, R. Gross, and F. Stutzman, "Privacy in the age of aug- Validation," Proc. of the 25th International Conference on World Wide mented reality," Proc. National Academy of Sciences, 2011. Web, pp. 707-719, 2016. [10] T. Iofciu, P. Fankhauser, F. Abel, and K. Bischoff, "Identifying [26] T. Man, H. Shen, S. Liu, X. Jin, and X. Cheng, "Predict Anchor users across social tagging systems,” Proc. of the 5th International Links across Social Networks via an Embedding Approach," Proc. AAAI Conference on Weblogs and Social Media, pp. 522-525, 2011. of the 25th International Joint Conference on Artificial Intelligence, pp. 1823-1829, 2016. [27] Y. Nie, Y. Jia, S. Li, X. Zhu, A. Li, and B. Zhou, "Identifying users

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2784430, IEEE Transactions on Knowledge and Data Engineering

XIAOPING ZHOU ET AL.: STRUCTURE BASED USER IDENTIFICATION ACROSS SOCIAL NETWORKS 14

across social networks based on dynamic core interests," Neuro- world," Social Networks, vol. 43, pp. 39-47, 2015. computing, vol. 210, pp. 107-115, 2016. [47] B. Perozzi, R. Al-Rfou, and S. Skiena, "Deepwalk: Online learn- [28] X. Zhou, X. Liang, H. Zhang, and Y. Ma, "Cross-Platform Identi- ing of social representations," Proc. of the 20th ACM SIGKDD Int. fication of Anonymous Identical Users in Multiple Social Media Conf. on Knowledge Discovery and Data Mining, pp. 701-710, 2014. Networks," IEEE Trans. Knowledge and Data Eng., vol. 28, no. 2, [48] A. Grover, and J. Leskovec, "node2vec: Scalable Feature Learn- pp. 411-424, 2016. ing for Networks," Proc. of the 22th ACM SIGKDD Int. Conf. on [29] A. Narayanan and V. Shmatikov, "De-anonymizing social net- Knowledge Discovery and Data Mining, pp.855-864, 2016. works," Proc. Of the 30th IEEE Symposium on Security and Privacy [49] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, "Line: (SSP’09), pp. 173-187, 2009. Large-scale information network embedding, " Proc. of the 24th [30] S. Bartunov, A. Korshunov, S. Park, W. Ryu, and H. Lee, "Joint Int. Conf. on World Wide Web, pp. 1067-1077, 2015. link-attribute user identity resolution in online social net- [50] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, "Tri-party deep works," The 6th SNA-KDD Workshop ’12, 2012. network representation," Int. Joint Conf. on Artificial Intelligence, [31] N. Korula and S. Lattanzi, "An efficient reconciliation algorithm pp.1895-1901, 2016. for social networks," arXiv preprint arXiv:1307.1690, 2013. [51] Y. LeCun, Y., Bengio, and G. Hinton, "Deep learning," Nature, vol. [32] L. Liu, W.K. Cheung, X. Li, and L. Liao, L, "Aligning users across 521, no. 7553, pp. 436-444, 2015. social networks using network embedding," Proc. of the 25th In- [52] F. Fouss, A. Pirotte, J.M. Renders, and M. Saerens, "Random- ternational Joint Conf. on Artificial Intelligence, pp. 1774-1780, 2016. walk computation of similarities between nodes of a graph with [33] Y. Zhang, J. Tang, Z. Yang, J. Pei, and P.S. Yu, "Cosnet: Connect- application to collaborative recommendation," IEEE Trans. ing heterogeneous social networks with local and global con- Knowledge and Data Eng., vol. 19, no. 3, pp. 355-369, 2007. sistency," Proc. of the 21th ACM SIGKDD International Conf. on [53] C. H. Lee, X. Xu, and D. Y. Eun, "Beyond random walk and me- Knowledge Discovery and Data Mining, pp. 1485-1494, 2015. tropolis-hastings samplers: why you should not backtrack for [34] Y. Zhang, L. Wang, X. Li, and C. Xiao, "Social identity link across unbiased graph sampling," ACM SIGMETRICS Performance Eval- incomplete social information sources using anchor link expan- uation Review, pp. 319-330, 2012. sion," Pacific-Asia Conf. on Knowledge Discovery and Data Mining, [54] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, pp. 395-408, 2016. "Distributed representations of words and phrases and their [35] S. Liu, S. Wang, and F. Zhu, "Structured Learning from Hetero- compositionality," Advances in Neural Information Processing Sys- geneous Behavior for Social Identity Linkage," IEEE Trans. tems, pp. 3111-3119, 2013. Knowledge and Data Eng., vol. 27, no. 7, pp. 2005-2019, 2015. [55] A. Mnih, and G. E. Hinton, "A scalable hierarchical distributed [36] H. Fu, A. Zhang, and X. Xie, "Effective deanony- language model," Advances in neural information processing sys- mization based on graph structure and descriptive information," tems, pp. 1081-1088, 2009. ACM Trans. Intelligent Systems and Technology, vol. 6, no. 4, pp. 49, [56] B. Recht, C. Re, S. Wright, and F. Niu, "Hogwild: A lock-free ap- 2015. proach to parallelizing stochastic gradient descent," Advances in [37] K. Shu, S. Wang, J. Tang, R. Zafarani, and H. Liu, "User Identity Neural Information Processing Systems, pp. 693-701, 2011. Linkage across Online Social Networks: A Review," SIGKDD Ex- plorations, vol. 18, no. 2, pp. 5-17, 2017. Xiaoping Zhou received the B.E. and M.E. from Beijing Information Science and Technology Univer-

[38] H. Fu, A. Zhang, and X. Xie, "De-anonymizing social graphs via sity in 2006 and 2009, respectively. He is currently a node similarity," Proc. of the 23rd Int. Conf. on World Wide Web, pp. Ph.D. candidate in the Department of Computer Sci- 263-264, 2014. ence at the Renmin University of China and an As- [39] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estima- sociate Professor at the Beijing University of Civil Engineering and Architecture. His research interests tion of word representations in vector space," arXiv preprint include data mining and artificial intelligence. arXiv:1301.3781, 2013. [40] P. Erdős and A. Rényi, "On random graphs I," Publ. Math. De- Xun Liang received the Ph.D. degree in computer brecen, vol. 6, pp. 290-297, 2010. engineering from Tsinghua University, China, in [41] D. J. Watts, and S. H. Strogatz, "Collective dynamics of ‘small- 1993. He worked as a Postdoctoral Fellow at the In- stitute of Computer Science, Peking University from world’ networks," Nature, vol.393, no.6684, pp. 440-442, 1998. 1993 to 1995. He is a Professor in the Department [42] A. L. Barabasi and R. Albert, "Emergence of scaling in random of Computer Science at the Renmin University of networks," Science, vol. 286, no. 5439, pp. 509-512, 1999. China. His research interests include social compu- [43] E. Stamatatos, M. Tschuggnall, B. Verhoeven, W. Daelemans, G. ting and machine learning.

Specht, B. Stein, and M. Potthast, "Clustering by authorship Xiaoyong Du received the Ph.D. degree from Na- within and across documents," Working Notes Papers of the CLEF, goya Institute of Technology, Japan. He is a profes- 2016. sor and dean in the School of Information, Renmin [44] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. University of China. His research interests mainly include intelligent information retrieval and semantic Borgwardt, "Graph kernels," J. of Machine Learning Research, vol. web. He has served extensively on many 11, pp. 1201-1242, 2010. conferences. [45] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. Borgwardt, "Scalable kernels for graphs with continuous attrib- Jichao Zhao received the B.E. in information sys- utes," In Advances in Neural Information Processing Systems, pp. tem from Northwest A&F University in 2015. She is 216-224, 2013. currently a postgraduate student in the Department of Computer Science at the Renmin University of

[46] R. I. Dunbar, V. Arnaboldi, M. Conti, and A. Passarella, "The China. Her research interests include social compu- structure of online social networks mirrors those in the offline ting and data mining. 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.