Prediction Promotes Privacy in Dynamic Social Networks

Prediction Promotes Privacy In Dynamic Social Networks Smriti Bhagat Graham Cormode, Balachander Krishnamurthy, Divesh Srivastava Rutgers University AT&T Labs–Research [email protected] fgraham, bala, [email protected] Abstract network. Ensuring sufficient privacy while keeping out- Recent work on anonymizing online social networks put relevant for its intended uses is more challenging in (OSNs) has looked at privacy preserving techniques for the dynamic case. Anonymizing each version of the net- publishing a single instance of the network. However, work independently is easily shown to leak information OSNs evolve and a single instance is inadequate for an- by comparing the different versions of the data [21]. In- alyzing their evolution or performing longitudinal data stead, we ensure that subsequent releases are consistent analysis. We study the problem of repeatedly publish- with the initial release. Bad decisions made for an ini- ing OSN data as the network evolves while preserving tial anonymization mean that subsequent releases may privacy of users. Publishing multiple instances indepen- lead to an undesirable amount of information (measured dently has privacy risks, since stitching the information in terms of probabilities) that can be extracted about the together may allow an adversary to identify users. We users in the data, and may require that some information provide methods to anonymize a dynamic network when is suppressed from the subsequent releases. Without new nodes and edges are added to the published network. knowing how the network will grow, how do we choose These methods use link prediction algorithms to model proper anonymizations early on so that the information the evolution. Using this predicted graph to perform that can be extracted about individuals from later releases group-based anonymization, the loss in privacy caused is minimized? by new edges can be eliminated almost entirely. We pro- We propose a solution based on link prediction algo- pose metrics for privacy loss, and evaluate them for pub- rithms, that use the current state of the network to pre- lishing multiple OSN instances. dict future structure. The prediction is used to choose an anonymization which is expected to remain safe and 1 Introduction useful for future releases. Existing prediction methods OSNs are a ubiquitous feature of modern life. A key tend to over-predict edges, i.e., they suggest many more feature of current OSNs, exemplified by Facebook, is edges than actually arrive. Thus, we cannot treat the pre- that a user’s detailed information is not visible without dicted edges equally to observed edges, and must define their explicit permission. This leaves interested parties— how to integrate predicted edges with anonymization al- network researchers, sociologists, app designers—to gorithms. We present a variety of methods to select a scrape away at the edges. Full release of snapshots of the subset of predicted edges to find a usable anonymiza- network would address this need. But the default settings tion. are private for a reason: OSNs contain sensitive personal Outline and Contributions. Section 2 defines the information of their users. Principled anonymization of anonymization problem for dynamic graphs, and de- OSN data allows sharing with 3rd-parties without reveal- scribes four requirements of the output. Section 3 ing private information. After simplistic anonymization provides metrics for evaluating privacy preservation of methods were shown to be vulnerable [3, 15] more so- anonymizations based on prediction. Section 4 discusses phisticated anonymizations have been proposed [20]. how different prediction models can be incorporated into Prior work focused primarily on static networks: the our framework, and how the results of the prediction can dataset is a single instance of the network, represented as be fine-tuned by adoption of conditions for anonymiza- a graph, failing to capture the highly dynamic nature of tion. Section 5 presents experiments over temporal data social network data. We would like to repeatedly release representing social network activity from three differ- anonymized snapshots reflecting the current state of the ent sources, and empirically evaluates privacy guarantees 1 and utility resulting from our anonymization methods. 2 B 5 E Our study shows that with the correct choice of predic- 1 A tion method and anonymization properties, it is possible 10 C to provide useful data on dynamic social networks while 3 C 4 D retaining sufficient privacy. We conclude in Section 6 6 D after reviewing related work. 9 B 2 Problem Definition 7 E 8 A Graph Model. A time-varying social network can be represented with a graph G = (V ;E ). Here V is (a) Example graph G at time t. (b) Full list anonymized G t t t t Solid lines are existing edges, dotted the set of vertices that represent users (or, entities) Ut are predicted edges that are a part of the network at time t, and Et is the set of all edges (interactions between users) created up Figure 1: Anonymization of a single snapshot of a graph to time t. Each user is associated with a set of at- tributes. Let G = fG1;G2;:::;GT g be the sequence of T graphs representing the network observed at timesteps goal of utility), then we must balance the extra utility t = 1; 2;:::;T respectively. We assume edges and from publishing the new information with the potential nodes are only added to the graph, not deleted (our model threat to the privacy of the previously published data. can be extended to allow deletions, but we do not discuss 3 Understanding Dynamic Privacy this issue in this presentation) Thus, we have Vt ⊆ Vt+1 and Et ⊆ Et+1, i.e. the graph at time t represents the 3.1 Anonymizing a single graph complete history of events recorded on the graph. New The (full) list-based scheme for anonymizing a single edges created between time t and t + 1 form the set graph was proposed in [4] (Section 6 identifies other E nE . Accordingly, any edge created at time t + 1 is t+1 t methods). It masks the mapping between nodes in the one of three kinds (i) “old-old”: between nodes v; w 2 V t graph V and entities U such that each v 2 V is associ- (ii) “old-new”: between node v 2 V and w 2 V nV t t+1 t ated with a list of possible labels l(v) ⊂ U. The original (iii) “new-new”: between nodes v; w 2 V nV . Let T t+1 t label of a node must appear within that node’s list. Us- be the current timestamp so that all prior graphs G for i ing the full list anonymization scheme, jl(v)j ≥ k and i ≤ T are observed and known. The graph continues to jl(v)j nodes are assigned the same label list. The under- evolve so that the graph G for i > 0 represents the T +i lying graph structure is published, with a label list at each (unknown) future state of the network. node instead of the user identifier. The lists can be gen- Problem Statement. Given G as input, our objective at erated by partitioning the nodes into groups of size k, so any time T is to publish an anonymized version of graph that each node in the group is given the same list, which 0 0 Gt as Gt. The output graph Gt should have the follow- consists of all (true) labels of nodes in the group. ing properties based on privacy parameters pn and pe: If the links between nodes in a group, or between 1. entity privacy: any u 2 Ut cannot be identified with a nodes in two groups, are dense, then an observer will 0 node in Gt with probability > pn. conclude a high probability of certain edges. This con- 2. privacy of observed edges: for any two entities tradicts the “privacy of observed edges” requirement, u1; u2 2 Ut, where t ≤ T , without background informa- even while the privacy of entities requirement may be tion it should not be possible to determine the existence met. Hence, lists are generated by dividing nodes into of an edge between them with probability > p . e g groups S1;S2 :::Sg so that they satisfy a Safety Con- 0 3. privacy of future edges: when GT +i is later published, dition. This condition states (informally) that each node it should not be possible to identify the presence of an must interact with at most one node in any group and so edge between u1; u2 2 UT +i with probability > pe. ensures sparsity of interactions between nodes of any two 4. utility: the anonymized graphs should be usable to ob- groups. The resulting grouping guarantees the privacy of tain accurate answers to queries involving longitudinal entities with parameter pn = 1=k, and the privacy of ob- analysis (e.g., how does the interaction between users served edges with pe = 1=k. Our focus in this paper is 0 0 from NJ change between two releases Gt and Gt+1). on maintaining this safety condition in the presence of Prior work in graph anonymization focused on pub- arriving nodes and edges. For more context, and details lishing a single graph instance, with requirements simi- of the strength it provides, see [4]. lar to goals 1 and 2 above. When publishing information about network evolution, new events impact what has al- Example 1. Figure 1(a) shows a sample snapshot of a ready been published, which motivates the third goal. If graph at time t with node-set Vt = f1; 2;::: 10g. In Fig- the anonymization has any value (i.e. it meets the fourth ure 1(b), the graph has been anonymized using the full 2 list method: the nodes are partitioned into groups with EI(S1;S2) is the ratio of the number of edges between k = 2 as A = f1; 8g;B = f2; 9g;C = f3; 10g;D = the two groups to the maximum number of such edges: f4; 6g; and E = f5; 7g.

Prediction Promotes Privacy in Dynamic Social Networks

M&A @ Facebook: Strategy, Themes and Drivers

A Longitudinal Study of Facebook, Linkedin, & Twitter

Arxiv:1403.5206V2 [Cs.SI] 30 Jul 2014

How to Make a Personal Facebook Account on Computer

Introduction to Web 2.0: Facebook, Texting, Twitter, Your Students, and You

Facebook Upload Failed Notification

Facebook Platform Weihaun SHU Socs @ Mcgill Y a Social Networking Website

Internet Reviews: Social Networking Software: Facebook and Myspace Stacey Greenwell University of Kentucky, [email protected]

FACEBOOK, INC. V. DUGUID ET AL

Social Suite User Guide User Guide

Do Neighbor Buddies Make a Difference in Reblog Likelihood? an Analysis on SINA Weibo Data

Realtime Data Processing at Facebook