Accurate Estimation of the Distribution of Private Networks

Michael Hay, Chao Li, Gerome Miklau, and David Jensen Department of Computer Science University of Massachusetts Amherst {mhay, chaoli, miklau, jensen}@cs.umass.edu

Abstract—We describe an efficient algorithm for releasing tee of anonymity ensures that private information is not a provably private estimate of the degree distribution of disclosed [7], [13], [22], [23]. In terms of utility, while a network. The algorithm satisfies a rigorous property of the transformed graph necessarily differs from the original, differential privacy, and is also extremely efficient, running on networks of 100 million nodes in a few seconds. Theoretical the hope is that it retains important structural properties of analysis shows that the error scales linearly with the number of interest to the analyst. A drawback of existing techniques is unique degrees, whereas the error of conventional techniques that they lack formal bounds on the magnitude of the error scales linearly with the number of nodes. We complement the introduced by the transformation. A common strategy for theoretical analysis with a thorough empirical analysis on real assessing utility is to measure familiar graph properties and and synthetic graphs, showing that the algorithm’s variance and bias is low, that the error diminishes as the size of the compare these measures numerically to the original data. input graph increases, and that common analyses like fitting a Empirically, it appears that with increasing anonymity, the power-law can be carried out very accurately. graph is rapidly distorted and some metrics are systemati- Keywords-privacy; social networks; privacy-preserving data cally biased [3], [7], [13], [22], [23]. mining; differential privacy. A final limitation of existing techniques is that few scale to the massive graphs that are now collected and studied. I. INTRODUCTION Most existing techniques have been evaluated on graphs The analysis of social networks is crucial to addressing a of about 5,000-50,000 nodes, and may be difficult to scale diverse set of societally important issues including disease much larger [7], [13], [22], [23]. transmission, fraud detection, efficiency of communication In this work, we focus on a specific utility goal— networks, among many others. Although technological ad- estimating the degree distribution of the graph—and develop vances have allowed the collection of these networks (often an algorithm that provides provable utility, strong privacy, massive in scale), privacy concerns have severely restricted and excellent scalability. The algorithm returns the degree the ability of social scientists and others to study these distribution of the graph after applying a complex two- networks. Valuable network data remains locked away in phase process of random perturbation. The error due to institutions, far from the scientists who are best equipped to random noise is provably low, yet sufficient to prevent even exploit it, because the data is too sensitive to share. powerful adversaries from extracting private information. The challenges of analyzing sensitive graph-structured The algorithm can scale to graphs with hundreds of millions data have recently received increased attention. It is now of nodes. The techniques we propose here do not result in a well-known that removing identifiers from a published graph. Instead we release only an estimate of the and releasing a “naively anonymized” isomorphic network true degree distribution to analysts. can leave participants open to a range of attacks [1], We choose to focus on the degree distribution because it [7], [17]. In response, a number of more sophisticated is one of the most widely studied properties of a graph. anonymization techniques have been proposed [3], [7], [13], It influences the structure of a graph and processes that [22], [23]. These techniques transform the graph—through operate on a graph, and a diverse line of research has studied the addition/removal of edges or clustering of nodes into properties of random ensembles of graphs consistent with a groups—into a new graph which is then published. The known degree distribution [12], [14], [18]. analyst uses the transformed graph to derive estimates of The simple strategy of releasing the exact degree distri- various properties of the original. bution fails to provide adequate privacy protection. Some The conventional goals of such algorithms are privacy and graphs have unique degree distributions (i.e., all graphs utility, although neither is satisfactorily achieved. Existing matching this degree distribution are isomorphic) making approaches provide anonymity, typically through transfor- the release of the degree distribution no safer than naive mations that make each node indistinguishable from others, anonymization. In general, it is unclear how to determine but they lack rigorous guarantees of privacy. They rely what the degree distribution reveals about the structure of the on the assumption that the adversary’s knowledge is lim- graph. The problem is compounded when either the adver- ited [3], [13], [22] and/or fail to prove that the guaran- sary has partial knowledge of graph structure, or the degree distribution is only one of several statistics published. Our it is important to know how accurately it can be estimated, goal is to design an approach that provides robust privacy independently of other properties. A long-term goal is to protection against powerful adversaries and is compatible develop a differentially private algorithm for publishing a with releasing multiple statistics. synthetic graph offering good utility for a range of analyses. The algorithms proposed here satisfy a rigorous privacy Because of the degree distribution’s profound influence on standard called differential privacy [4]. It protects against the structure of the graph, we believe that accurate estimation any adversary, even one with nearly complete knowledge of it is a critical step towards that long-term goal. of the private data. It also composes well: one can release multiple statistics under differential privacy, so long as II. DIFFERENTIAL PRIVACY FOR GRAPHS the algorithm for each statistic satisfies differential privacy. We first review the definition of differential privacy, and Thus while we focus on the degree distribution, additional then propose how it can be adapted to graph data. statistics can be incorporated into the privacy framework. While an existing differentially private algorithm [5] can A. Differential privacy be easily adapted to obtain noisy answers to queries about To define differential privacy, we consider an algorithm the degree distribution, the added noise introduces consider- A that operates on a private database I. The algorithm able error. In this work, we capitalize on a recent innovation is randomized and the database is modeled as a set of in differentially private algorithms that has been shown to records, each describing an individual’s private information. boost accuracy without sacrificing privacy [8]. The technique Differential privacy formally limits how much a single performs a post-processing step on the noisy answers, using individual record in the input can influence the output of the the fact that the queries impose constraints on the space of algorithm. More precisely, if I is a neighboring database— possible answers to infer a more accurate result. We apply i.e., one that differs from I by exactly one record—then an this technique to obtain an accurate estimate of the degree algorithm is differentially private if it is likely to produce distribution of a graph. the same output whether the input is I or I . Our contributions include the following. The following is a formal definition of differential privacy, • (Privacy) We adapt the definition of differential privacy due to Dwork [4]. Let nbrs(I) denote the set of neighbors to graph-structured data. We present several alterna- of I—i.e., I ∈ nbrs(I) if and only if |I ⊕ I | =1where tive formulations and describe the implications for the ⊕ denotes symmetric difference.1 privacy-utility tradeoff inherent in them. Definition II.1 (-differential privacy). An algorithm A is - • (Scalability) We provide an implementation of the differentially private if for all instances I, any I ∈ nbrs(I), inference step of Hay et al. [8] that is linear, rather and any subset of outputs S ⊆ Range(A), the following than quadratic, in the number of nodes. As a result of holds: this algorithmic improvement the technique can scale to very large graphs, processing a 200 million node graph Pr[A(I) ∈ S] ≤ exp() × Pr[A(I ) ∈ S] in 6 seconds. We also extend the inference technique where probability Pr is over the randomness of A. to include additional constraints of integrality and non- negativity. The parameter measures the disclosure and is typically • (Utility) We assess utility of the resulting degree dis- also an input to the algorithm. For example, the techniques tribution estimate through comprehensive experiments used in this paper add random noise to their outputs, where on real and synthetic data. We show that: (i) estimates the noise is a function of . The choice of is a matter of are extremely accurate under strong privacy parameters, policy, but typically is “small,” say at most 1, making the exhibiting low bias and variance; (ii) for power-law probability “almost the same” whether the input is I or I. graphs, the relative boost in accuracy from inference An example illustrates why this protects privacy. Suppose increases with graph size; (iii) an analyst can use a hospital wants to analyze the medical records of their the differentially private output to accurately assess patients and publish some statistics about the patient popu- whether the degree distribution follows a power-law. lation. A patient may wish to have his record omitted from These contributions are some of the first positive results the study, out of a concern that the published results will in the private analysis of social network data, showing that reveal something about him personally and thus violate his a fundamental network analysis task can be performed ac- privacy. The above definition assuages this concern because curately and efficiently, with rigorous guarantees of privacy. whether the individual opts-in or opts-out of the study, the Admittedly, the degree distribution is just one property probability of a particular output is almost the same. Clearly, of a graph, and there is evidence that a number of other properties are not constrained by the degree distribution 1Differential privacy has been defined inconsistently in the literature. The original definition (called indistinguishability) defines neighboring alone [12], [14]. Nevertheless, it is hard to imagine a useful databases in terms of Hamming [5]. Note that -differential privacy technique that distorts the degree distribution greatly. Thus (as defined above) implies 2-indistinguishability. any observed output cannot reveal much about his particular who emails whom; edge-differential privacy protects email record if that output is (almost) as likely to occur even when relationships from being disclosed. the record is excluded from the database. However, in some applications, it may be desirable to extend the protection beyond individual edges. For example, B. Differential privacy for graphs Klovdahl et al. [9] analyze the social network structure of In the above definition, the database is a table whereas in “a population of prostitutes, injecting drug users and their the present work, the database is a graph. Below we adapt personal associates.” In this graph, an edge represents a the definition of differential privacy to graphs. sexual interaction or the use of a shared needle. Edges are The semantic interpretation of differential privacy rests on clearly private information, but so too are other properties the definition of neighboring databases. Since differential like node degree (the number of sexual/drug partners) and privacy guarantees that the output of the algorithm can- even membership in the network. not be used to distinguish between neighboring databases, A second adaptation of differential privacy to graphs pro- what is being protected is precisely the difference between vides much stronger privacy protection. In node-differential neighboring databases. In the above definition, a neighboring privacy, two graphs are neighbors if they differ by at most database is defined as the addition or removal of a single one node and all of its incident edges. Formally, G and G record. With the hospital example, the patient’s private infor- are neighbors if |V ⊕ V | =1and E ⊕ E = {(u, v)|u ∈ mation is encapsulated within a single record. So differential (V ⊕ V ) or v ∈ (V ⊕ V )}. privacy ensures that the output of the algorithm does not Node-differential privacy mirrors the “opt-in/opt-out” no- disclose the patient’s medical history. tion of privacy from the hospital example. It assuages any With network data, which is primarily about relationships privacy concerns, as a node-differentially private algorithm among individuals, the correspondence between private data behaves almost as if the participant did not appear in at all. and database records is less clear. To adapt differential pri- While node-differential privacy is a desirable privacy vacy to graphs, we must choose a definition for neighboring objective, it may be infeasible to design algorithms that are graphs and understand the privacy semantics of that choice. both node-differentially private and enable accurate network We propose three alternatives offering varying degrees of analysis. A differentially private algorithm must hide even privacy protection. the worst case difference between neighboring graphs, and We model the input as a graph, G =(V,E), where V this difference can be large under node-differential privacy. is a set of n entities and E is a set of edges. Edges are For instance the empty graph (n isolated nodes) is a neighbor undirected pairs (u, v) such that u and v are members of of the star graph (a hub node connected to n nodes). We V . (Results are easily extended to handle directed edges.) show in Sec III-A that estimates about node degrees are While the meaning of an edge depends on the domain—it highly inaccurate under node-differential privacy. could connote friendship, email exchange, sexual relations, To span the spectrum of privacy between edge- and node- etc.—we assume that it represents a sensitive relationship differential privacy, we introduce an extension to edge- that should be kept private. The focus of the present work is differential privacy that allows neighboring graphs to differ concerned with graph structure, so the inclusion of attributes by more than a single edge. In k-edge-differential privacy, on nodes or edges is left for future work. graphs G and G are neighbors if |V ⊕ V | + |E ⊕ E|≤k. The first adaptation of differential privacy to graphs is A larger setting of k leads to greater privacy protection. mathematically similar to the definition for tables. Neighbor- If k =1, then k-edge-differential privacy is equivalent to ing graphs are defined as graphs that differ by one “record.” edge-differential privacy. If k = |V |, then k-edge-differential Given a graph G, one can produce a neighboring graph G by privacy is even stronger than node-differential privacy, as the either adding/removing an edge in E, or by adding/removing set of neighboring graphs under k-edge-differential privacy an isolated node in V . Restricting to isolated nodes ensures is a superset of the neighbors under node-differential privacy. that the change to V does not require additional changes If 1 1 provided that is appropriately scaled as in The sensitivity of a query depends on how neighboring these examples. databases are defined. Intuitively, queries are more sensi- tive under node-differential privacy than edge-differential B. Constrained inference privacy, because the difference between neighboring graphs is larger under node-differential privacy. Hay et al. [8] introduce a post-processing technique that ˜ However, regardless of how neighbors are defined, the operates on the output of algorithm Q. It can be seen as following proposition holds. Let Lap(σ)d denote a d- an improvement on the basic algorithm of Dwork et al. [5] length vector of independent random samples from a Laplace that boosts accuracy without sacrificing privacy. The main distribution with mean zero and scale σ. idea behind the approach is to use the semantics of a query to impose constraints on the answer. While the true answer Q˜ Proposition 1 ([5]). Let denote the randomized algorithm Q(I) always satisfies the constraints, the noisy answer that Q that takes as input a database I, a query of length d, and is output by Q˜ may violate them. Let q˜ denote the output of some >0, and outputs Q˜ . The constrained inference process takes q˜ and finds the ˜ d Q(I)=Q(I)+Lap(SQ/) answer that is “closest” to q˜ and also satisfies the constraints of the query. Here “closest” is measured in L2 distance and Q˜ Algorithm satisfies -differential privacy. the consistent output is called the minimum L2 solution. While this proposition holds for any of the adaptations of Definition III.2 (Minimum L2 solution). Let Q be a query differential privacy, the accuracy of the answer depends on sequence with a set of constraints denoted γQ. A minimum the magnitude of SQ, which differs across the adaptations. L2 solution is a vector q that satisfies the constraints γQ Using an example query, we illustrate the accuracy trade- and minimizes ||q˜− q||2. offs between k-edge- and node-differential privacy. Let Du denote the query that returns the degree of node u if u ∈ V As discussed in [8], the technique has no impact on and otherwise returns −1. privacy since it requires no access to the private database, Since the addition of Laplace noise introduces error of only q˜, the output of the differentially private algorithm. ±SQ/ in expectation, the accuracy of the answer depends This technique can be used to obtain an accurate estimate on and the sensitivity of Du. Under k-edge-differential of the degree distribution of the graph. Our approach is to privacy, the sensitivity is k—in the worst case, neighboring ask a query for the graph’s degree sequence, a sequence graphs differ by k edges that are all adjacent to u, making of non-decreasing numbers corresponding to the degrees u’s degree differ by k. Thus we expect an accurate answer of the graph’s vertices. Of course, a degree sequence can to Du when k/ is small relative to u’s degree. be converted to a degree distribution by simply counting the frequency of each degree. The advantage of the degree contains subsets of nodes with the same degree. Hay et sequence query is that it is constrained, as explained below. al. [8] theoretically analyze the error in terms of mean 2 ˜ We now define S, the degree sequence query. Let squared error. For a query Q, the mean square error is th ˜ ˜ ˜ 2 deg(i) return the i smallest degree in G. Then, the MSE(Q)=E[||Q − Q(I)||2]= i E[(Q[i] − Q[i]) ], degree sequence of the graph is the sequence S = where the expectation is over the randomness of Q˜ . deg(1) ...deg(n). Under 1-edge-differential privacy, the Theorem 2 ([8]). Let d be the number of unique degrees sensitivity of S is 2: suppose a neighboring graph has an 3 2 in S(I). Then MSE(S)=O(d log n/ ). In comparison, additional edge between two nodes of degree d, d , then two MSE(S˜)=Θ(n/2). values in S are affected, the largest i such that S[i]=d ˜ becomes d+1, similarly for d . Let S denote the application This result shows that error scales linearly with the of the algorithm described in Proposition 1 to the S query. number of unique degrees, rather than the number of nodes. ˜ A random output of S is denoted s˜. While this is a promising result, it is not clear what lower Query S is constrained because the degrees are positioned MSE in the degree sequence means for an analyst interested in sorted order. The constraint set for S is denoted γS, and in studying the degree distribution. Furthermore, it is unclear contains the inequalities S[i] < S[i +1] for 1 ≤ i

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● constant time rather than linear time. 10 15 20 Before describing the algorithm, we introduce some no- tation and restate the minimum L2 solution of Theorem 1 0 5 10 15 20 25 using this notation. Let the minimum cumulative average at Index k be denoted as Mk = min1≤j≤k M[k, j]. Then we can rewrite the solution at s[k] as follows: Figure 1. Example sequence S(I), noisy estimate s˜ = S˜(I), and constrained inference estimate s = S(I). s[k] = max min M[i, j] = max Mi (1) 1≤i≤k i≤j≤n 1≤i≤k Example 1. Figure 1 shows a degree sequence S(I) fora25 node graph, along with a sampled s˜ and inferred s. While the The basic idea behind the linear time algorithm is to values in s˜ deviate considerably from S(I), s lies very close construct s incrementally, starting at the end of the sequence to the true answer. In particular, for subsequence [1, 20], the and working backwards toward the beginning. At each true sequence S(I) is uniform and the constrained inference step , the algorithm maintains a partial solution for the process effectively averages out the noise of s˜. The twenty- subsequence s˜[,...,n]—meaning that the sort constraints first position is a unique degree in S(I) and constrained are only enforced on this subsequence and the rest of s˜ is inference does not refine the noisy answer, i.e., s[21] =s ˜[21]. ignored. At each step, the subsequence is extended to include another element of s˜ and the partial solution is updated As suggested by the example, the error of S can be accordingly. lower than that of S˜, particularly when the degree sequence We denote the partial solution as r¯, and from Equation 1, 2We simplify the presentation, as well as later experiments, by assuming the value of r¯ at position k is equal to that n is known. In practice, n is unlikely to be sensitive, especially for large networks. Alternatively, one could derive a very accurate estimate of r¯ [k] = max min M[i, j] = max Mi (2) n using Proposition 1 and then adjust S accordingly. This would result in ≤i≤k i≤j≤n ≤i≤k a small amount of additional error in the estimate for the number of low 1 degree nodes. Observe that partial solution r¯ is equal to the desired s. Given a partial solution r¯, we can extend the solution to faster than updating r¯ and from J we can easily reconstruct −1 by extending the subsequence to include the observation the solution r¯. The details are shown in Algorithm 1, which s˜[ − 1] and updating the partial r¯ to obtain r¯−1. There are computes J (lines 1-8) for  = n,...,1. Then, it constructs two components of the update procedure. First, we determine s using J 1 (lines 10-16). the value for the new observation at position  − 1. From Equation 2 the solution is simply the minimum cumulative Algorithm 1 An algorithm for computing s given s˜ − −1 − M average starting at  1; i.e., r¯ [ 1] = −1. 1: J ←∅, J.push(n) − th The second step is to check whether including the ( 1) 2: for k from n to 1 do ∗ element requires updating the existing solution for positions 3: j ← k, j ← J.top() ∗ ∗ ,...,n. From Equation 2, we can see that we must update 4: while J = ∅ and M[j +1,j] ≤ M[k, j ] do ∗ any k = ,...,n where the current solution r¯ [k] is smaller 5: j ← j, J.pop(), j ← J.top() − −1 − than the new value at  1 position, r¯ [ 1]: 6: end while −1 ∗ r¯ [k] = max Mi 7: J.push(j ) −1≤i≤k  8: end for 9: b ← 1 = max M−1, max Mi ≤i≤k 10: while J = ∅ do   ∗ −1 − 11: j ← J.pop() = max r¯ [ 1], r¯ [k] ∗ 12: for k from b to j do M ∗ Thus, each step requires first computing −1 and then 13: s[k] ← M[b, j ] updating the existing solution for positions ,...,n.We 14: end for ∗ show next that we can use the partial solution r¯ to simplify 15: b ← j +1 M the cost of finding −1. While computing an individual 16: end while M −1 can take linear time in the worst-case, the amortized 17: return s cost is only O(1). Furthermore, we store partial solution r¯ in such a way that once M−1 is found, no additional work is required to update the solution. Example 2. The following table shows a sample input s˜ Given a partial solution r¯, break it into subsequences along with the computations used to compute s. such that each subsequence is uniform and has maximal length. Let J be the set of indexes marking the ends of k s˜[k] s[k] arg mink≤j≤n M[k, j] the uniform subsequences. E.g., if all of the elements in r¯ 1 1 1 M[1, 1] are distinct, then J = {,...,n}; if the values of r¯ are all 2 9 5 M[2, 5] the same, then J = {n}. 3 4 5 M[3, 4] The following theorem shows how we can use r¯ to 4 3 5 M[4, 4] compute the minimum cumulative average M−1. Re- 5 4 5 M[5, 5] M − ∗ call that −1 = min−1≤k≤n M[ 1,j]. Let j de- The algorithm begins with J 5 = {5}. Stack J 4 becomes note the end index of this minimum average, i.e., j∗ = {4, 5} after s˜[4] is considered since s˜[4] < M5. When it arg min−1≤j≤n M[ − 1,j]. 3 comes to s˜[3], the stack J is {4, 5} since M3 = M[3, 4]. Theorem 3. Given r¯ and a corresponding J , one of the When s˜[2] is added since M[2, 5] j∗}. equal to {1, 5}. The algorithm rebuilds s as 1, 5, 5, 5, 5. Theorem 3 shows that the set J can be used to find The time complexity of Algorithm 1 is O(n). First, once the minimum cumulative average for  − 1. The algorithm the stack J is completed, reconstructing s (lines 10-16) proceeds by considering the indexes in J in ascending clearly takes O(n). Second, the complexity of computing order. The initial cumulative average is set to M[−1,−1], stack J is also linear: it can seen by considering the number and the average is extended to the next endpoint in J so of times that line 5 executes. Since each execution of line long as it reduces the average. When the average increases, 5 reduces the size of stack J by 1 and there are only O(n) the algorithm terminates the search. The following Lemma push operations on stack J, we know line 5 executes at most implies that this will be j∗. O(n) times. In the worst-case the stack J can require O(n) space. However, only the top of the stack is accessed during Lemma 1. Let j∗ be defined as above, then computations and the rest can be written to disk as needed. max≤j≤j∗ Mj ≤ M−1 ≤ Mj∗+1. Incorporating additional constraints: The output of During the computation, we need only store the set J the algorithm, s, may be non-integral and include negative instead of the entire sequence of r¯. Updating J is much numbers, when in fact the values of the true degree sequence are constrained to lie in {0,...,n − 1}. We would like the algorithm of Hay et al. [8] takes 20 minutes for a 1 a solution that respects these constraints since they are million node graph and over an hour to process a 2 million required in many applications. Theorem 4 shows that such node graph. The efficiency of the improved algorithm makes a solution is computed from s by simply rounding. the constrained inference approach practical for large graphs. Theorem 4. Let γS be the constraint set γS augmented with Large real datasets Larger synthetic datasets the additional constraint that each count be an integer in the power law regular regular set {0,...,n− 1}. Given s, the minimum L2 solution for ● random livejournal random constraint set γS, let s denote the sequence derived from s natural natural in which each element s[k] is rounded to the nearest value ● in {0,...,n− 1}. Then s is a minimum L2 solution that orkut satisfies the constraint set γS. ● flickr ● Execution time(seconds) Execution time(seconds) Execution V. E XPERIMENTS youtube 0123456 0.00 0.05 0.10 0.15 The goals of our experiments are two-fold. First, we 1e+06 3e+06 5e+06 0.0e+00 1.0e+08 2.0e+08 assess the scalability of the constrained inference algorithm Datasize Data size introduced in Section IV. Second, we want to understand the tradeoff between privacy and utility. To do so, we first Figure 2. Runtime of Algorithm 1 on real (left) and larger synthetic characterize how the noise introduced for privacy distorts the datasets (right). degree distribution. Then, using several metrics to compare distributions, we assess how accurately the distributions B. Utility derived from S and S˜ approximate the true degree dis- We use two measures we use to assess accuracy. First, tribution. The accuracy depends on , which governs the we use the Kolmogorov-Smirnoff (KS) statistic, a measure amount of privacy. We also assess how the privacy-utility used to test whether two samples are drawn from the tradeoff changes with the size of the graph (does a bigger same distribution. Let the empirical cumulative distribution graph allow for better utility at a fixed level of privacy?). function (CDF) of sample X = X1,...,Xn be defined as Finally, we consider one of the most common tasks that 1 n FX (x)= I[Xi ≤ x]. Then the KS statistic between analysts perform on a degree distribution: assessing whether n i=1 X and Y is KS(X, Y ) = maxx |FX (x) − FY (x)|. it follows a power-law. We measure how the added noise The KS statistic is insensitive to differences in the tails of affects the fit of a power-law model. the two distributions, so we also use the Mallows distance We experiment on both synthetic and real datasets. The (aka Earth Mover’s distance) to capture deviations in the tail. real datasets are derived from crawls of four online so- Given samples X and Y each of size n, with X(i) denoting cial networking sites: Flickr (≈1.8M nodes), LiveJour- th the i largest sample in X, the Mallows p-distance is nal (≈5.3M), Orkut (≈3.1M), and YouTube (≈1.1M) [16].  n 1/p To the best of our knowledge, these are the largest publicly 1 p Mallowsp (i) − (i) available social network datasets. The synthetic datasets (X, Y )= X Y n i=1 include Random, a classical , which has a Poisson degree distribution (λ =10), and Power, a random An example shows how Mallows distance is more sensitive graph with a power-law degree distribution (α =1.5). than the KS statistic to the tail of the distribution. Consider three graphs A, B, and C in which all nodes have degree 1, A. Scalability except in B one node has degree 2 and in C one node has Figure 2 shows that the runtime of Algorithm 1 scales degree n−1. The KS statistic between A and either B or C linearly and is extremely fast. The left figure shows the is O(n−1). The Mallows distance (p =1) between A and B runtime on the real datasets and the right figure shows the is O(n−1), but between A and C, the Mallows distance is runtime on even larger synthetic datasets of up to 200M O(1), capturing the difference between their largest degrees. nodes. In addition to Random and Power, we include two A visual comparison of distributions: Figure 3(a) shows non-random synthetic distributions, corresponding to the the true degree distribution along with the differentially best- and worst-case inputs for the runtime of the algorithm. private approximations, revealing that S produces a very The best-case is Regular, a uniform degree distribution (all accurate approximation while S˜ does not. The distributions nodes have degree 10), the worst-case is Natural, a distribu- are represented using the complementary CDF (CCDF), tion having one occurrence of each degree in {0,...,n−1}. denoted CF and defined as CFX (x)=1− FX (x). Thus, The small variation in runtime across datasets shows each line shows what fraction of nodes have a degree greater that it is not particularly sensitive to the type of degree than the given value on the x-axis. Abusing notation, we distribution. Furthermore, it is extremely fast: processing a use S(I), s˜, and s, which are all degree sequences, to refer 200 million node graph takes less than 6 seconds. In contrast, to their corresponding degree distributions. Thus, the line labeled S(I) refers to the true degree distribution and the from 10K to 5M nodes. The results show a clear separation lines labeled s˜ and s refer to the degree distributions derived between S˜ and S: as the size of the graph increases, the from differentially private sequences s˜ and s (here =0.01). accuracy of S˜ remains constant whereas the accuracy of S Figure 3(a) shows that noise added to produce s˜ substan- improves. Thus, with S, larger datasets yield either more tially distorts the degree distribution. In contrast, s is a much privacy (given a fixed accuracy target, we can lower )or more accurate approximation of S(I). While s exhibits some better utility (higher accuracy for fixed ). deviations from the true distribution, the deviations appear The accuracy of S˜ does not improve with graph size to oscillate around the true distribution. This demonstrates because random noise is added to each degree, thus the that, by exploiting the sort constraints, constrained inference average error per degree does not change with the size of the can filter out much of the noise in s˜. graph. However, as Example 1 showed, S can be very accu- Bias & variance analysis: In addition to showing rate when the degree sequence contains long subsequences individual samples s˜ and s, we also analyze the bias and of uniform degrees. As the graph size increases, accuracy variance of randomized algorithms S˜ and S. More pre- improves because the subsequences of uniform degree grow cisely, we measure bias of S as the expected difference longer (in a power-law graph, the expected proportion of between the CCDFs of S and S(I) for each degree—i.e., nodes with a given degree is a constant independent of n). biasS(x)=E[CFS(x) − CFS(I)(x)] where the expectation In this experiment, the parameters k and of the privacy is over the randomness in S. The variance of S is varS(x)= condition remain fixed as n increases. If node degrees were 2 E[(CFS(x) − E[CFS(x)]) ]. We focus on S because it is to increase with graph size, then holding fixed would mean evident from Figure 3(a) that S˜ exhibits substantial bias. that while the absolute disclosure remains fixed, the relative We evaluate the bias/variance of S empirically thru re- disclosure about a node’s neighborhood would increase peated sampling. The results are shown in the bottom panel with n. When evaluating graph models where node degrees of Figure 3(a). The y-axis is the difference in cumulative increase with size (e.g., forest-fire graphs [11]), it may be probability between S and S, CFS(x) − CFS(I)(x). The appropriate to decrease as n increases. line shows the average difference (bias) and the error bars Modeling power-law distributions: Our final exper- depict the standard deviation from the average (square root iment assesses how accurately the analyst can estimate ˜ of variance). The line remains near 0, suggesting that S may the parameters of a power-law model using S or S. The be an unbiased or nearly unbiased estimator of S(I). The experiment is designed as follows. First, we sample a Power variance peaks wherever the CCDF exhibits steepest change. graph with parameters θ =(α =1.5,xmin = 10).We Accuracy vs. privacy: Figures 3(b) and 3(c) show fix this as the true degree distribution. Then we sample s˜ the relationship between privacy and the accuracy for two and s and derive corresponding distributions. To each of measures of accuracy—KS in 3(b), Mallows in 3(c). We these three degree distributions, we fit a power-law model report the average accuracy over 10 trials (random samplings using maximum likelihood [2]. The result is three different of s˜). The amount of privacy is controlled by the parameter estimates for the parameters θ, which we denote θˆ, θ˜, and (horizontal axis)—smaller corresponds to stronger privacy. θ¯ respectively. We are interested in comparing the model fit The results show that S is uniformly more accurate than to the true degree distribution, θˆ, to the models fit under S˜, across all datasets, settings of , and both measures differential privacy, θ˜ and θ¯. of accuracy. Furthermore, for low settings of (stronger The individual parameter estimates are shown in the privacy), the difference in accuracy is greater, suggesting that middle and right plot of Figure 3(e), but the leftmost plot the benefit of constrained inference increases with privacy. provides a holistic assessment of the model fit. It assesses Also shown in the figure is the accuracy of an estimate model fit using the D statistic of Clauset et al. [2] which based on random sampling (10% of the degrees are sampled measures the KS statistic on the power-law tail of the uniformly at random). While sampling does not provide distribution. We consider two variants of this measure: in differential privacy, it can serve as a useful reference point. one, the tail is defined by the estimate of xmin under s˜ or Sampling has very low KS distance (as expected), but higher s; in the other, xmin is based on the true xmin. Mallows distance because random sampling is unlikely to The plots reveal that using either S˜ or S, the analyst will select the high degree nodes in the tail. In fact, sampling has estimate a model that has a close fit to the tail of the original higher Mallows distance than S (except on Random, which (power-law) distribution, when the tail is defined by the is a distribution without long tails). Since analysts often xmin estimated on the noisy distribution. However, it also cannot obtain complete graphs and must rely on samples, shows that the size of the tail is under-estimated (the power- this result suggests that the additional error due to privacy law behavior becomes apparent only for large degrees). If we can be small compared to the sampling error. compare the models based on how well they fit the true tail Accuracy vs. size: Figure 3(d) shows how the privacy- of the power-law distribution (solid lines of leftmost plot), utility tradeoff of S improves as the graph increases in size. we see that S˜ has considerable distortion (note the log-scale) The figure reports accuracy on Power graphs of varying size, while S is reasonably accurate even at small . flickr livejournal orkut youtube power random

~s s s CF(x) 0.0 0.2 0.4 0.6 0.8 1.0 ) x ( S bias −0.15 0.05 01 49194901 4919490 1 4 9 19 49 99 01 4919490 1 4 9 19 49 99 01 491949 x x x x x x

(a) Complementary CDFs of S(I), s˜ and s (top). Bias of S (bottom)

flickr livejournal orkut youtube power random ~ S S S sampling KS distance to 0.0 0.1 0.2 0.3 0.4 0.5

0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 ε ε ε ε ε ε

(b) Privacy () vs. Accuracy (KS distance)

flickr livejournal orkut youtube power random ~ S S S sampling Mallows distance to Mallows 5e−02 5e−01 5e+00 5e+01 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 ε ε ε ε ε ε

(c) Privacy () vs. Accuracy (Mallows distance with p =2)

power power Power−law Fit Estimating α Estimating xmin ~ α~ ~ θ from x^ xmin ~ ~ ~ min α S S θ from ~x xmin

S min S ● S sample

^ ^ θ α S from x ● sampling ● ● ● sampling min θ from x ● ● ● min ● θ^ from x^ ● ● min ●● ● ● x

● KS distance to KS distance to s Absolute Error to Mallows distance to Mallows

●●●●● ● ● ● ● ● ● ● ● 0.002 0.006 0.010 0.014 1 2 5 10 50 200 0.0 0.1 0.2 0.3 0.4 0.5 10 20 50 200 500

0e+00 2e+06 4e+06 0e+00 2e+06 4e+06 0.0010.010 0.005 0.020 0.032 0.100 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 0.010 0.032 0.100 0.320 1.000 size size ε ε ε (d) Size vs. Accuracy for fixed =0.01 (e) Accuracy of estimating power-law model using S˜, S.

Figure 3. The privacy-utility tradeoff of two differentially private estimators S˜ and S.

VI. RELATED WORK produces accurate, nearly unbiased estimates of the degree distribution—and provide a more complete characterization The constrained inference technique that underlies this of the privacy-utility tradeoffs. work was originally proposed in [8]. That work focuses on using constraints to improve accuracy for a variety of count- Most prior work focuses on protecting privacy through ing queries. While applications to degree estimation were anonymization, transforming the graph so that nodes cannot recognized by the authors, a number of issues necessary be re-identified [3], [7], [13], [22], [23]. The output is a for practical network analysis were left open. The present published graph which the analyst can study in place of the paper shows that inference only requires linear time and original. While publishing a graph allows a broader range of that the framework can be extended to include the additional analysis, anonymization is a much weaker notion of privacy constraints of integrality and non-negativity constraints. Fur- and is vulnerable to attack (e.g., [6], [21]). ther, we resolve open questions about the practical utility of Furthermore, for the analyst interested in studying the the algorithm—showing that it scales to large graphs and degree distribution, these techniques may not scale to large graphs and can introduce considerable distortion. For exam- [6] S. Ganta, S. Kasiviswanathan, and A. Smith. Composition ple, the technique of Liu & Terzi [13], which is the most attacks and auxiliary information in data privacy. In KDD, scalable approach, appears to considerably bias power-law 2008. degree distributions, reducing the power-law coefficient by [7] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. 0.5 for reasonable settings of the privacy parameters. Our Resisting structural re-identification in anonymized social estimates have much smaller error (e.g., 0.004 at =0.01) networks. In VLDB, 2008. and satisfy a much stronger privacy condition. [8] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the Differential privacy has been an active area of research accuracy of differentially-private histograms through consis- (Dwork [4] provides a survey). Enabling accurate analysis tency. arXiv:0904.0942, 2009. of social networks is an often mentioned goal, but we are aware of only a few concrete results: techniques for [9] A. Klovdahl, J. Potterat, D. Woodhouse, J. Muth, S. Muth, computing properties such as clustering coefficient that have and W. Darrow. Social networks and infectious disease: the Colorado Springs study. Soc Sci Med, 1994. high sensitivity [19], [20] and a forthcoming approach that estimates the parameters of a random graph model [15]. [10] G. Kossinets and D. Watts. Empirical Analysis of an Evolving Social Network. Science, 311(5757):88–90, 2006. VII. CONCLUSION For the task of approximating the degree distribution of a [11] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: Densification laws, shrinking diameters and possible private social network, we present an algorithm that protects explanations. In KDD, 2005. privacy, scales to large graphs, and produces extremely accurate approximations. Our approach satisfies differential [12] L. Li, D. Alderson, J. Doyle, and W. Willinger. Towards privacy, which means that, unlike approaches based on a theory of scale-free graphs: Definition, properties, and implications. Mathematics, 2005. anonymization, it provides extremely robust protection, even against powerful adversaries. Finally, given the importance [13] K. Liu and E. Terzi. Towards identity anonymization on of the degree distribution to the structure of a graph, we graphs. In SIGMOD, 2008. believe that our techniques are a critical first step towards [14] P. Mahadevan, D. Kiroukov, K. Fall, and A. Vahdat. System- the ultimate goal of publishing synthetic graphs that are both atic topology analysis and generation using degree correla- accurate and ensure differential privacy. tions. In SIGCOMM, 2006.

ACKNOWLEDGMENTS [15] D. Mir and R. Wright. A differentially private graph estima- Hay and Jensen were supported by the Air Force Research tor. In International Workshop on Privacy Aspects of Data Laboratory and the Intelligence Advanced Research Projects Ac- Mining, 2009. tivity (IARPA), under agreement number FA8750-07-2-0158. Hay, [16] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and Miklau, and Li were supported by NSF CNS 0627642. The U.S. B. Bhattacharjee. Measurement and analysis of online social Government is authorized to reproduce and distribute reprints for networks. In IMC, 2007. Governmental purposes notwithstanding any copyright notation [17] A. Narayanan and V. Shmatikov. De-anonymizing social thereon. The views and conclusion contained herein are those of the networks. In IEEE Symposium on Security and Privacy authors and should not be interpreted as necessarily representing (Oakland), 2009. the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and the Intelligence Advanced [18] M. Newman, S. Strogatz, and D. Watts. Random graphs with arbitrary degree distributions and their applications. Physical Research Projects Activity (IARPA), or the U.S. Government. Review E, 64(2), 2001. REFERENCES [19] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensi- [1] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art tivity and sampling in private data analysis. In STOC, 2007. thou R3579X? . In WWW, 2007. [20] V. Rastogi, M. Hay, G. Miklau, and D. Suciu. Relationship [2] A. Clauset, C. R. Shalizi, and M. Newman. Power-law privacy: Output perturbation for queries with joins. In PODS, distributions in empirical data. SIAM Review, 2009. 2009.

[3] G. Cormode, D. Srivastava, S. Bhagat, and B. Krishnamurthy. [21] R. C.-W. Wong, A. W.-C. Fu, K. Wang, and J. Pei. Minimality Class-based graph anonymization for social network data. In attack in privacy preserving data publishing. In VLDB, 2007. VLDB, 2009. [22] B. Zhou and J. Pei. Preserving privacy in social networks [4] C. Dwork. Differential privacy: A survey of results. In TAMC, against neighborhood attacks. In ICDE, 2008. 2008. [23] L. Zou, L. Chen, and T. Ozsu. K-Automorphism: A general [5] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating framework for privacy preserving network publication. In noise to sensitivity in private data analysis. In TCC, 2006. VLDB, 2009.