An Efficient Reconciliation Algorithm for Social Networks
Total Page:16
File Type:pdf, Size:1020Kb
An efficient reconciliation algorithm for social networks Nitish Korula Silvio Lattanzi Google Inc. Google Inc. 76 Ninth Ave, 4th Floor 76 Ninth Ave, 4th Floor New York, NY New York, NY [email protected] [email protected] ABSTRACT user, but probably a poor representation of her working People today typically use multiple online social networks contacts, while LinkedIn is a good representation of work (Facebook, Twitter, Google+, LinkedIn, etc.). Each online contacts but not a very good representation of personal re- network represents a subset of their \real" ego-networks. An lationships. Therefore, analyzing social behaviors in any of interesting and challenging problem is to reconcile these on- these networks has the drawback that the results would only line networks, that is, to identify all the accounts belonging be partial. Furthermore, even if certain behavior can be ob- to the same individual. Besides providing a richer under- served in several networks, there are still serious problems standing of social dynamics, the problem has a number of because there is no systematic way to combine the behavior practical applications. At first sight, this problem appears of a specific user across different social networks and be- algorithmically challenging. Fortunately, a small fraction cause some social relationships will not appear in any social of individuals explicitly link their accounts across multiple network. For these reasons, identifying all the accounts be- networks; our work leverages these connections to identify a longing to the same individual across different social services very large fraction of the network. is a fundamental step in the study of social science. Our main contributions are to mathematically formalize Interestingly, the problem has also very important prac- the problem for the first time, and to design a simple, lo- tical implications. First, having a deeper understanding of cal, and efficient parallel algorithm to solve it. We are able the characteristics of a user across different networks helps to prove strong theoretical guarantees on the algorithm's to construct a better portrait of her, which can be used to performance on well-established network models (Random serve personalized content or advertisements. In addition, Graphs, Preferential Attachment). We also experimentally having information about connections of a user across mul- confirm the effectiveness of the algorithm on synthetic and tiple networks would make it easier to construct tools such real social network data sets. as \friend suggestion" or \people you may want to follow". The problem of identifying users across online social net- works (also referred to as the social network reconciliation 1. INTRODUCTION problem) has been studied extensively using machine learn- The advent of online social networks has generated a re- ing techniques; several heuristics have been proposed to naissance in the study of social behaviors and in the un- tackle it. However, to the best of our knowledge, it has derstanding of the topology of social interactions. For the not yet been studied formally and no rigorous results have first time, it has become possible to analyze networks and been proved for it. One of the contributions of our work is social phenomena on a world-wide scale and to design large- to give a formal definition of the problem, which is a precur- scale experiments on them. This new evolution in social sor to mathematical analysis. Such a definition requires two science has been the center of much attention, but has also key components: A model of the \true" underlying social attracted a lot of critiques; in particular, a longstanding network, and a model for how each online social network is problem in the study of online social networks is to under- formed as a subset of this network. We discuss details of stand the similarity between them and \real" underlying our models in Section 3. social networks [29]. Another possible reason for the lack of mathematical anal- This question is particularly challenging because online ysis is that natural definitions of the problem are demotivat- 1 social networks are often just a realization of a subset of ingly similar to the graph isomorphism problem. In addi- real social networks. For example, Facebook \friends" are tion, at first sight the social network reconciliation problem a good representation of the personal acquaintances of a seems even harder because we are not looking just for iso- morphism but for similar structures, as distinct social net- works are not identical. Fortunately, when reconciling so- This work is licensed under the Creative Commons Attribution- 1 NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- In graph theory, an isomorphism between two graphs G cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- and H is a bijection, f(∗), between the vertex sets of G and mission prior to any use beyond those covered by the license. Contact H such that any two vertices u and v of G are adjacent in G if and only if f(u) and f(v) are adjacent in H. The graph copyright holder by emailing [email protected]. Articles from this volume 0 were invited to present their results at the 40th International Conference on isomorphism problem is: Given two graphs G and G find Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. an isomorphism between them or determine that there is no Proceedings of the VLDB Endowment, Vol. 7, No. 5 isomorphism. The graph isomorphism problem is considered Copyright 2014 VLDB Endowment 2150-8097/14/01. very hard, and no polynomial algorithms are known for it. 377 cial networks, we have two advantages over general graph find the corresponding names. To improve on these first isomorphism: First, real social networks are not the adver- naive approaches, several machine learning models were de- sarially designed graphs which are hard instances of graph veloped [3, 17, 20, 25, 28], all of which collect several features isomorphism, and second, a small fraction of social network of the users (name, location, image, connections topology), users explicitly link their accounts across multiple networks. based on which they try to identify users across networks. The main goal of this paper is to design an algorithm with These techniques may be very fragile with respect to ma- provable guarantees that is simple, parallelizable and robust licious users, as it is not hard to create a fake profile with to malicious users. For real applications, this last prop- similar characteristics. Furthermore, they get lower preci- erty is fundamental, and often underestimated by machine sion experimentally than our algorithm achieves. However, learning models.2 In fact, the threat of malicious users is we note that these techniques can be combined with ours, so prominent that large social networks (Twitter, Google+, both to validate / increase the number of initial trusted Facebook) have introduced the notion of `verification’ for links, and to further improve the performance of our algo- celebrities. rithm. Our first contribution is to give a formal model for the A different approach was studied in [22], where the au- graph reconciliation problem that captures the hardness of thors infer missing attributes of a user in an online social the problem and the notion of an initial set of trusted links network from the attribute information provided by other identifying users across different networks. Intuitively, our users in the network. To achieve their results, they retrieve model postulates the existence of a true underlying graph, communities, identify the main attribute of a community then randomly generates 2 realizations of it which are per- and then spread this attribute to all the user in the commu- turbations of the initial graph, and a set of trusted links for nity. Though it is interesting, this approach suffers from the some users. Given this model, our next significant contri- same limitations of the learning techniques discussed above. bution is to design a simple, parallelizable algorithm (based Recently, Henderson et al. [14] studied which are the most on similar intuition to the algorithm in [23]) and to prove important features to identify a node in a social network, formally that our algorithm solves the graph reconciliation focusing only on graph structure information. They ana- problem if the underlying graph is generated by well estab- lyzed several features of each ego-network, and also added lished network models. It is important to note that our al- the notion of recursive features on nodes at distance larger gorithm relies on graph structure and the initial set of links than 1 from a specific node. It is interesting to notice that of users across different networks in such a way that in order their recursive features are more resilient to attack by ma- to circumvent it, an attacker must be able to have a lot of licious users, although they can be easily circumvented by friends in common with the user under attack. Thus it is the attacker typically assumed in the social network security more resilient to attack than much of the previous work on literature [32], who can create arbitrarily many nodes. this topic. Finally, we note that any mathematical model The problem of reconciling social networks is closely con- is, by necessity, a simplification of reality, and hence it is nected to the problem of de-anonymizing social networks. important to empirically validate the effectiveness of our ap- Backstrom et al. introduced the problem of deanonymiz- proach when the assumptions of our models are not satisfied. ing social networks in [4]. In their paper, they present 2 In Section 5, we measure the performance of our algorithm main techniques: An active attack (nodes are added to the on several synthetic and \real" data sets.