Graphical-model based estimation and inference for differential privacy

Ryan McKenna 1 Daniel Sheldon 1 2 Gerome Miklau 1

Abstract Inference is a critical component of privacy mechanisms because: (i) it can reduce error when answering a query by Many privacy mechanisms reveal high-level in- combining evidence from multiple related measurements, formation about a distribution through noisy (ii) it provides consistent query answers even when measure- measurements. It is common to use this informa- ments are noisy and inconsistent, and (iii) it provides the tion to estimate the answers to new queries. In this above benefits without consuming the privacy-loss budget, work, we provide an approach to solve this estima- since it is performed only on privately-computed measure- tion problem efficiently using graphical models, ments without re-using the protected data. which is particularly effective when the distribu- tion is high-dimensional but the measurements are Consider a U.S. dataset, exemplified by the Adult over low-dimensional marginals. We show that table, which consists of 15 attributes including age, sex, our approach is far more efficient than existing race, income, education. Given noisy answers to a set of estimation techniques from the privacy literature measurement queries, our goal is to infer answers to one and that it can improve the accuracy and scalabil- or more new queries. The measurement queries might be ity of many state-of-the-art mechanisms. expressed over each individual attribute (age), (sex), (race), etc., as well as selected combinations of attributes (age, income), (age, race, education), etc. When inference is 1. Introduction done properly, the estimate for a new query (e.g., counting the individuals with income>=50K, 10 years of education, Differential privacy (Dwork et al., 2006) has become the and over 40 years old) will use many, or even all, available dominant standard for controlling the privacy loss incurred measurements. by individuals as a result of public data releases. For com- plex data analysis tasks, error-optimal algorithms are not Current inference methods are limited in both scalability known and a poorly designed algorithm may result in much and generality. Most methods first estimate some model of greater error than strictly necessary for privacy. Thus, care- the data and then answer new queries using the model. Per- ful algorithm design, focused on reducing error, is an area haps the simplest model is a full , which of intense research in the privacy community. stores a value for every element of the domain. When the measurements are linear queries (a common case, and our For the private release of statistical queries, nearly all recent primary focus) least-squares (Hay et al., 2010; Nikolov et al., algorithms (Zhang et al., 2017; Li et al., 2015; Lee et al., 2013; Li et al., 2014; Qardaji et al., 2013b; Ding et al., 2011; 2015; Proserpio et al., 2014; Li et al., 2014; Qardaji et al., Xiao et al., 2010; Li et al., 2010) and multiplicative-weight 2013b; Nikolov et al., 2013; Hardt et al., 2012; Ding et al., updates (Hardt & Rothblum, 2010; Hardt et al., 2012) have 2011; Xiao et al., 2010; Li et al., 2010; Hay et al., 2010; both been used to estimate this model from the noisy mea- Hardt & Rothblum, 2010; Hardt & Talwar, 2010; Barak surements. New queries can then be answered by direct cal- et al., 2007; Gupta et al., 2011; Thaler et al., 2012; Acs et al., culation. However, the size of the contingency table is the 2012; Zhang et al., 2014; Yaroslavtsev et al., 2013; Cormode product of the domain sizes of each attribute, which et al., 2012; Qardaji et al., 2013a; McKenna et al., 2018) these methods break down for high-dimensional cases (or include steps within the algorithm where answers to queries even a modest number of dimensions with large domains). are inferred from noisy answers to a set of measurement In the example above, the full contingency table would con- queries already answered by the algorithm. sist of 1019 entries. To avoid this, factored models have been 1University of Massachusetts, Amherst 2Mount Holyoke considered (Hardt et al., 2012; Zhang et al., 2017). How- College. Correspondence to: Ryan McKenna . including restricting the query class (Hardt et al., 2012) or failing to properly account for (possibly varying) noise in th Proceedings of the 36 International Conference on Machine measurements (Zhang et al., 2017). Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Graphical-model based estimation and inference for differential privacy

In this work we show that graphical models provide a foun- subsets of attributes. We will describe an extension to a dation for significantly improved inference. We propose to generalized class of queries, including non-linear ones, in use a graphical model instead of a full contingency table Section 3.1.A linear query set fQ(X) is defined by a query r×n as a model of the data distribution. Doing so avoids an matrix Q ∈ R and has answer fQ(X) = Q pX. The T intractable full materialization of the contingency table and ith row of Q, denoted qi represents a single scalar-valued retains the ability to answer a broad class of queries. We query. In most cases we will refer unambiguously to the show that the graphical model representation corresponds matrix Q, as opposed to fQ, as the query set. We often to using a maximum entropy criterion to select a single data consider query sets that can be expressed on a marginal distribution among all distributions that minimize estimation (over a subset of attributes) of the vector p. Let loss. The structure of the graphical model is determined by A ⊆ [d] identify a subset of attributes and, for x ∈ X , let the measurements, such that no information is lost relative to xA = (xi)i∈A be the sub-vector of x restricted to A. Then a full contingency table representation, but when each mea- the marginal probability vector (or simply “marginal on A”) surement is expressible over a low-dimensional marginal of µA, is defined by: the contingency table, as is common, the graphical model m representation is much more compact. 1 X (i) Y µ (x ) = {x = x }, ∀x ∈ X := X . A A m I A A A A i This work is focused on developing a principled and general i=1 i∈A approach to inference in privacy mechanisms. Our method is The size of the marginal is n := |X | = Q n , which agnostic to the used to estimate the data model A A i∈A i is exponential in |A| but may be considerably smaller than n. and to the noise distribution used to achieve privacy. We Note that µ (x ) is a linear function of p, so there exists focus primarily on linear measurements, but also describe A A a matrix M ∈ nA×n such that µ = M p. When an extension to non-linear measurements. A R A A a query set depends only on the marginal vector µA, we rA×nA We assume throughout that the measurements are given, but call it a marginal query set written as QA ∈ R , and we show our inference technique is versatile since it can with answer fQA (X) = QA µA. The marginal query set be incorporated into many existing private query-answering QA is equivalent to the query set Q = QAMA on the algorithms that determine measurements in different ways. full contingency table, since QAµA = (QAMA)p. One For those existing algorithms that scale to high-dimensional marginal query set asks for the marginal vector itself, in data, our graphical-model based estimation method can which case QA = InA×nA (the identity matrix). substantially improve accuracy (with no cost to privacy). In our problem formulation, we consider measurements con- Even more importantly, our estimation method can be added sisting of a collection of marginal query sets. Specifically, to some algorithms which fail to scale to high-dimensional let C be a collection of measurement sets, where each C ∈ C data, allowing them to run efficiently in new settings. We is a subset of [d].1 For each measurement set C ∈ C, we therefore believe our inference method can serve as a basic are given a marginal query set Q . The following nota- building block in the design of new privacy mechanisms. C tion is helpful to refer to combined measurements and their marginals. Let µ = (µC )C∈C be the combined vector of 2. Background and Problem Statement marginals, and let QC be the block-diagonal matrix with diagonal blocks {Q } , so that the entire set of query Data. Our input data represents a population of individuals, C C∈C answers can be expressed as Q µ. Finally, let M be the each contributing a single record x = (x , . . . , x ) where C C 1 d matrix that vertically concatenates the matrices {M } , x is the ith attribute belonging to a discrete finite domain C C∈C i so that µ = M p and Q µ = Q M p. This shows that X of n possible values. The full domain is X = Qd X C C C C i i i=1 i our measurements are equivalent to the combined query set and its size n = Qd n is exponential in the number i=1 i Q = QCMC applied to the full table p. of attributes. A dataset X consists of m such records X = (x(1),..., x(m)). We also consider a normalized con- Differential privacy. Differential privacy protects individ- tingency table representation p, which counts the fraction uals by bounding the impact any one individual can have on of the population with record equal to x, for each x in the the output of an algorithm. 1 Pm (i) domain. That is, p(x) = m i=1 I{x = x}, ∀x ∈ X , Definition 1 (Differential Privacy; Dwork et al., 2006). A where I{·} is an indicator function. Thus p is a probability randomized algorithm A satisfies (, δ)-differential privacy vector in Rn with index set X (ordered lexicographically). if for any input X, any X0 ∈ nbrs(X), and any subset of We write p = pX when it is important to denote the depen- outputs S ⊆ (A), dence on X. Pr[A(X) ∈ S] ≤ exp() Pr[A(X0) ∈ S] + δ Queries, Marginals, and Measurements. We focus on 1 the most common case of linear queries expressed over Later, these will comprise the cliques of a graphical model, as the notation suggests. Graphical-model based estimation and inference for differential privacy

Above, nbrs(X) denotes the set of datasets formed by replac- mation to estimate Wp as opposed to just using the noisy ing any x(i) ∈ X with an arbitrary new record x0(i) ∈ X . answer directly. We describe an extension to non-linear When δ = 0 we say A satisfies -differential privacy. Dif- queries in Section 3.1; this will be applied to the DualQuery ferentially private answers to fQ are typically obtained with algorithm (Gaboardi et al., 2014) in Section4. a noise-addition mechanism, such as the Laplace or Gaus- sian mechanism. For -differential privacy, the noise added 3. Algorithms for Estimation and Inference to the output of fQ is determined by the L1 sensitivity of fQ, which, specialized to linear queries, is defined as What principle can we follow to estimate answers to the ∆Q = maxX,X0∈nbrs(X) kQ pX − Q pX0 k1. It is straight- workload query set? Prior work takes the approach of first 2 using all available information to estimate a full contingency forward to show that ∆Q = m kQk1 where kQk1 is the maximum L1 norm of the columns of Q. table pˆ ≈ p and then using pˆ to answer later queries (Hay et al., 2010; Li et al., 2010; Ding et al., 2011; Qardaji et al., Definition 2 (Laplace Mechanism; Dwork et al., 2006). 2013b; Lee et al., 2015). We will call finding pˆ estimation, Given a query set Q ∈ r×n of r linear queries, the R and using pˆ to answer new queries inference. Laplace mechanism is defined as L(X) = Q pX + z where z = (z1, . . . , zr) and each zi is an i.i.d. 3.1. Optimization Formulation from Laplace(∆Q/). The standard framework for estimation and inference is: The Laplace mechanism satisfies -differential privacy. The sequential composition property implies that if we answer pˆ ∈ argmin L(p), (estimation) p∈S two query sets Q1 and Q2, under 1 and 2 differential privacy, respectively, then the combined answers are (1 + fW(X) ≈ W pˆ. (inference)  )-differentially private. The post-processing property of 2 Here S = p : p ≥ 0, 1T p = 1 is the probability differential privacy (Dwork & Roth, 2014) asserts that post- simplex and L(p) is a loss function that measures how processing the output of a differentially private algorithm well p explains the observed measurements. In past works, (without using the original protected data) does not affect L(p) = kQp − yk has been used as a loss function, where the privacy guarantee. Q is the measured query set and k·k is either the L1 norm Problem Statement. We assume as given a collection C of or L2 norm. Minimizing the L1 norm is equivalent to maxi- measurement sets, and for each C ∈ C: a marginal query mum likelihood estimation when the noise comes from the set QC , a privacy parameter C , and an C -differentially Laplace mechanism (Lee et al., 2015). Minimizing the L2 private measurement yC = QC µC + Lap(∆QC /C ). The norm is far more common in the literature however, and combined measurements are y = (yC )C∈C which satisfy it is also the maximum likelihood estimator for Gaussian P -differential privacy for  = C∈C C by sequential com- noise (Hay et al., 2010; Nikolov et al., 2013; Li et al., 2014; position. Note that there is no loss of generality in these Qardaji et al., 2013b; Ding et al., 2011; Xiao et al., 2010; Li assumptions; in the extreme case, there may be just a single et al., 2010; McKenna et al., 2018). Our method supports measurement set C = [d] consisting of all attributes. Formu- both of these loss functions; we only require that L is con- lating the problem this way will allow us to realize computa- vex. Both of these loss functions are easily adapted to the tional savings when measurements are not full-dimensional, situation where queries in Q may be measured with differ- which is common in practice. We also emphasize that the ing degrees of noise. The constraint p ∈ S may also be marginal query set QC is often a complex set of linear relaxed, which simplifies L2 minimization; additionally, un- queries expressed over measurement set C (not simply a der different assumptions and an alternate version of privacy, marginal). Many past works (Li et al., 2015; 2014; Qar- the number of individuals may not be known. All existing daji et al., 2013b; Nikolov et al., 2013; Ding et al., 2011; algorithms to solve these variations of the estimation prob- Xiao et al., 2010; Li et al., 2010; Hay et al., 2010; Barak lem suffer from the same problem: they do not scale to high et al., 2007) have shown that it is beneficial, in the presence dimensions since the size of p is exponential in d and we of noise-addition for privacy, to measure carefully chosen have to construct it explicitly as an intermediate step even query sets which balance sensitivity against efficient recon- if the inputs and outputs are small (e.g., all measurement struction of the workload queries. queries are over low-dimensional marginals). Our goal is: given y, derive answers to (possibly different) Optimization in Terms of Marginals. For marginal query workload queries W. There are multiple possible motiva- sets, a loss function will typically depend on p only through tions: W may include new queries that were not part of its marginals µ. For example, when Q = QCMC we have the original measurements; or it is possible that W is a L(p) = kQp − yk = kQCµ − yk = L(µ) where we now subset of measurement queries, but we can obtain a more write the loss function as L(µ). More generally, we will con- accurate answer by combining all of the available infor- sider any loss function that only depends on the marginals. Graphical-model based estimation and inference for differential privacy

A very general case is when L(µ) = − log p(y | µ) is Algorithm 1 Proximal Estimation Algorithm the negative log-likelihood of any differentially private al- Input: Loss function L(µ) between µ and y gorithm that produces output y that depends only on the Output: Estimated data distribution pˆθ marginal vector µ (see our treatment of DualQuery in Sec- θ = 0 tion4). for t = 1,...,T do The marginal vector µ may be much lower dimensional µ =MARGINAL-ORACLE(θ) than p. How can we take advantage of this fact? An “ob- θ = θ − ηt∇L(µ) vious” idea would be to modify the optimization to esti- end for return pˆ mate only the marginals as µˆ ∈ argminµ∈M L(µ), where θ  M = µ : ∃p ∈ S s.t. MCp = µ} is the marginal poly- tope, which is the set of all valid marginals. There are two Theorem1 says that, after finding µˆ, we can obtain a fac- issues here. First, the marginal polytope has a complex tored representation of the maximum-entropy distribution combinatorial structure, and, although it is a convex set, it with these marginals by finding the graphical model parame- is generally not possible to enumerate its constraints for ters θˆ. This is the problem of learning in an graphical model, use with standard convex optimization algorithms. Note which is well understood (Wainwright & Jordan, 2008). that this optimization problem is in fact a generic convex optimization problem over the marginal polytope, and as 3.2. Estimation: optimizing over the marginal polytope such it generalizes standard graphical model inference prob- lems (Wainwright & Jordan, 2008). Second, after finding We need algorithms to find µˆ and θˆ. We considered a variety µˆ it is not clear how to answer new queries, unless they of algorithms and present two of them here. Both are proxi- depend only on some measured marginal µC . mal algorithms for solving convex problems with “simple” constraints (Parikh et al., 2014). Central to our algorithms Graphical Model Representation. After finding an opti- is a subroutine MARGINAL-ORACLE, which is some mal µˆ we want to answer new queries that do not necessarily black-box algorithm for computing the clique marginals depend directly on the measured marginals. To do this we µ of a graphical model from the parameters θ. This is need to identify a distribution pˆ that has marginals µˆ, and the problem of marginal inference in a graphical model. we must have a tractable representation of this distribution. MARGINAL-ORACLE may be any marginal inference rou- Also, since there may be many pˆ that give rise to the same tine — we use on a junction tree. In the marginals, we want a principled criteria to choose a single remainder of this section, we assume that the clique set C estimate, such as the principle of maximum entropy. We are the cliques of a junction tree. This is without loss of accomplish these goals using undirected graphical models. generality, since we can enlarge cliques as needed until this A graphical model represents a high-dimensional distribu- property is satisfied. tion as a product of factors φ defined on attribute subsets, C Algorithm1 is a routine to find µˆ by solving a convex op- i.e., p(x) = Q φ (x ). A factored representation is C∈C C C timization problem over the marginal polytope. Due to the often much more compact than an explicit one and facil- special structure of the algorithm it also finds the parameters itates efficient computation of marginals (Koller & Fried- θˆ. Algorithm1 is inspired by the entropic mirror descent man, 2009). We use a log-linear form with parameters algorithm for solving convex optimization problems over θ (x ) = log φ (x ). C C C C the probability simplex (Beck & Teboulle, 2003). The it- erates of the optimization are obtained by solving simpler Definition 3 (Graphical model). Let pθ(x) = 1 P  optimization problems of the form: Z exp C∈C θC (xC ) be a normalized distribu- nC 1 tion, where θC ∈ R . This distribution is a graphical µt+1 = argmin µT ∇L(µt) + D(µ, µt) (1) model that factors over the measurement sets C, which µ∈M ηt are the cliques of the graphical model. The vector where D is a Bregman that is chosen to reflect θ = (θ ) is the parameter vector. C C∈C the geometry of the marginal polytope. Here we use the fol- lowing Bregman divergence generated from the Shannon en- Theorem 1 (Maximum entropy (Wainwright & Jordan, t t t T t 2008)). Given any µˆ in the interior of M there is a pa- tropy: D(µ, µ ) = −H(µ)+H(µ )+(µ−µ ) ∇H(µ ), ˆ where H(µ) is the Shannon entropy of the graphical model rameter vector θ such that the graphical model pθˆ(x) has maximum entropy among all pˆ(x) with marginals µˆ.2 pθ with marginals µ. Since we assumed above that µ are marginals of the cliques of a junction tree, the Shannon 2If the marginals are on the boundary of M, e.g., if they entropy is convex and easily computed as a function of µ contain zeros, there is a sequence of parameters {θ(n)} such alone (Wainwright & Jordan, 2008).3 that pθ(n) (x) converges to the maximum-entropy distribution as n → ∞. See (Wainwright & Jordan, 2008). 3An alternative would be to use the Bethe entropy as in (Vilnis Graphical-model based estimation and inference for differential privacy

With this divergence, the objective of the subproblem in Algorithm 2 Accelerated Proximal Estimation Algorithm Equation1 can be seen to be equal to a variational free en- Input: Loss function L(µ) between µ and y ergy, which is minimized by marginal inference in a graphi- Output: Estimated data distribution pˆθ cal model. The full derivation is provided in the supplement. K = Lipchitz constant of ∇L The implementation of Algorithm1 is very simple — it g¯ = 0 simply requires calling MARGINAL-ORACLE at each iter- ν, µ =MARGINAL-ORACLE(0) ation. Additionally, even though the algorithm is designed for t = 1,...,T do to find the optimal µ, it also returns the corresponding graph- 2 c = t+1 ical model parameters θ “for free” as a by-product of the ω = (1 − c)µ + cν optimization. This is evident from Algorithm1: upon con- g¯ = (1 − c)g¯ + c∇L(ω) vergence, µ is the vector of marginals of the graphical model −t(t+1) θ = g¯ with parameters θ. The variable η in this algorithm is a step 4K t ν =MARGINAL-ORACLE(θ) size, which can be constant, decreasing, or found via line µ = (1 − c)µ + cν search. This algorithm is an instance of mirror descent, and end for thus inherits its convergence guarantees. It will converge for √ return graphical model pˆ with marginals µ any convex loss function L at a O(1/ t) rate,4 even ones θ that are not smooth, such as the L1 loss. We now present a related algorithm which is based on full marginals if the query does not need them. For more the same principles as Algorithm1 but has an improved complicated downstream tasks, we can generate synthetic 2 O(1/t ) convergence rate for convex loss functions with data by from pˆθ, although this should be avoided Lipchitz continuous gradients. Algorithm2 is based on when possible as it introduces additional sampling error. Nesterov’s accelerated dual averaging approach (Nesterov, 2009; Xiao, 2010; Vilnis et al., 2015). The per-iteration 4. Use in Privacy Mechanisms complexity is the same as Algorithm1 as it requires calling the MARGINAL-ORACLE once, but this algorithm will Next we describe how our estimation algorithms can im- converge in fewer iterations. Algorithm2 has the advan- prove the accuracy and/or scalability of four state-of-the-art tage of not requiring a step size to be set, but it requires mechanisms: MWEM, PrivBayes, HDMM, and DualQuery. knowledge of the Lipchitz constant of ∇L. For the stan- MWEM. The multiplicative weights exponential mecha- dard L loss with linear measurements, this is equal to the 2 nism (Hardt et al., 2012) is an active-learning style algo- largest eigenvalue of QT Q. The derivation of this algorithm rithm that is designed to answer a workload of linear queries. appears in the supplement. MWEM maintains an approximation of the data distribution and at each time step selects the worst approximated query 3.3. Inference T qi from the workload via the exponential mechanism (Mc- Once pˆθ has been estimated, we need algorithms to answer Sherry & Talwar, 2007). It then measures the query using T new queries without materializing the full contingency table the Laplace mechanism as yi = qi p + zi and then updates representation. This corresponds to the problem of inference the approximate data distribution by incorporating the mea- in a graphical model. If the new queries only depend on sured information using the multiplicative weights update pˆθ through its clique marginals µ, we can immediately an- rule. The most basic version of MWEM represents the ap- swer them using MARGINAL-ORACLE, or by saving the proximate data distribution in vector form, and updates it final value of µ from Algorithms1 or2. If the new queries according to the following formula after each iteration: depend on some other marginals outside of the cliques of T the graphical model, we instead use the pˆ ← pˆ exp (−qi(qi pˆ − yi)/2m)/Z, (2) algorithm (Koller & Friedman, 2009) to first compute the necessary marginal, and then answer the query. In Section where is elementwise multiplication and Z is a normal- B of the supplement, we present a novel inference algorithm ization constant. that is related to variable elimination but is faster for answer- It is infeasible to represent p explicitly for high-dimensional ing certain queries because it does not need to materialize data, so this version of MWEM is only applicable to rel- et al., 2015). The Bethe entropy is convex and computable from atively low-dimensional data. Hardt et al describe an en- µ alone regardless of the model structure. Using Bethe entropy hanced version of MWEM, which we call factored MWEM, would lead to approximate marginal inference instead of exact that is able to avoid materializing this vector explicitly, in marginal inference as the subproblems, which is an interesting the special case when the measured queries decompose direction for future work. √ over disjoint subsets of attributes. In that case, p is repre- 4That is, L(µt) − L(µ∗) ∈ O(1/ t). sented implicitly as a product of independent distributions Graphical-model based estimation and inference for differential privacy Q over smaller domains, i.e., p(x) = C∈C pC (xC ), and the then Q will contain measurements over the low-dimensional update is done on one group at a time. However, this en- marginals too. Thus, we replace the full “probability” vector hancement breaks down for measurements on overlapping pˆ with a graphical model pˆθ. Also pˆ may contain nega- subsets of attributes in high-dimensional data, so MWEM is tive values and need not sum to 1 since HDMM solves an still generally infeasible to run except on simple workloads. ordinary (unconstrained) problem. We can replace the multiplicative weights update with a call DualQuery. DualQuery (Gaboardi et al., 2014) is an itera- to Algorithm2 using the standard L2 loss function (on all tive algorithm inspired by the same two-player game under- measurements up to that point in the algorithm). By doing lying MWEM. It generates to approximate so, we learn a compact graphical model representation of pˆ, the true data on a workload of linear queries. DualQuery which avoids materializing the full p vector even when the maintains a distribution over the workload queries that de- measured queries overlap in complicated ways. This allows pends on the true data so that poorly approximated queries MWEM to scale better and run in settings where it was pre- have higher probability mass. In each iteration, samples are viously infeasible. Our efficient version of MWEM gives drawn from the query distribution, which are proven to be solutions that are identical to a commonly used MWEM vari- differentially private. The sampled queries are then used to ant. The MWEM update (Equation2) is closely related to find a single record from the data domain (without accessing the entropic mirror descent update (Beck & Teboulle, 2003), the protected data), which is added to the synthetic database. and, if iterated until convergence (as is done in practice), The measurements — i.e., the random outcomes from the solves the same L minimization problem that we consider. 2 privacy mechanism — are the queries sampled in each itera- More details are given in Section E.2 of the supplement. tion. Even though these are very different from the linear PrivBayes. PrivBayes (Zhang et al., 2017) is a differen- measurements we have primarily focused on, we can still tially private mechanism that generates synthetic data. It express the log-likelihood as a function of p and select p to first spends half the privacy budget to learn a Bayesian net- maximize the log-likelihood using Algorithm1 or2. The work structure that captures the dependencies in the data, log-likelihood only depends on p through the answers to the and then uses the remaining privacy budget to measure workload queries. If the workload can be expressed in terms the statistics—which are marginals—necessary to learn the of µ instead, the log-likelihood can as well. Thus, after run- parameters. PrivBayes uses a heuristic ning DualQuery, we can call Algorithm1 with this custom of truncating negative entries of noisy measurements and loss function to estimate the data distribution, which we can normalizing to get conditional probability tables. It then use in place of the synthetic data produced by DualQuery. samples a synthetic dataset of m records from the Bayesian The full details are given in the supplementary material. network from which consistent answers to workload queries can be derived. While this is simple and efficient, the heuris- 5. Experimental Evaluation tic does not properly account for measurement noise and sampling may introduce unnecessary error. In this section, we measure the accuracy and scalability improvements enabled by probabilistic graphical-model We can replace the PrivBayes estimation and sampling step (PGM) based estimation when it is incorporated into ex- with a call to Algorithm2, using an appropriate loss func- isting privacy mechanisms. tion (e.g. L1 or L2), to estimate a graphical model. Then we can answer new queries by performing graphical model inference (Section 3.3), rather than using synthetic data. 5.1. Adding PGM estimation to existing algorithms HDMM. The high-dimensional matrix mechanism We run four algorithms: MWEM, PrivBayes, HDMM, and (McKenna et al., 2018) is designed to answer a workload DualQuery, with and without our graphical model technol- of linear queries on multi-dimensional data. It selects the ogy using a privacy budget of  = 1.0 (and δ = 0.001 for measurements that minimize expected error on the input DualQuery). We run Algorithm1 with line search for Dual- workload, which are then answered using the Laplace mech- Query and Algorithm2 for the other mechanisms, each for anism, and inconsistencies resolved by solving an ordinary 10000 iterations. We repeat each five times and report the workload error. are done least squares problem of the form: pˆ = argmin kQp − yk2. Solving this least squares problem is the main bottleneck on 2 cores of a single compute cluster node with 16 GB of of HDMM, as it requires materializing the data vector even RAM and 2.4 GHz processors. when Q contains queries over the marginals of p. We use a collection of four datasets in our experiments, sum- We can replace the HDMM estimation procedure with Al- marized in Table1. Each dataset consists of a collection of categorical and numerical attributes (with the latter dis- gorithm2, using the same L2 loss function. If the work- load contains queries over low-dimensional marginals of p, cretized into 100 bins). The domain of each dataset is very large, which makes efficient estimation challenging. Graphical-model based estimation and inference for differential privacy

2.5 2.5 2.5 2.5 PrivBayes DualQuery MWEM HDMM+LLS 2.0 2.0 2.0 2.0 PrivBayes+PGM DualQuery+PGM MWEM+PGM HDMM+PGM

1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5 Workload Error

0.0 0.0 0.0 0.0 titanic adult loans stroke titanic adult loans stroke titanic adult loans stroke titanic adult loans stroke (a) PrivBayes (b) DualQuery (c) MWEM (d) HDMM

Figure 1: Workload error of four mechanisms on four datasets, with and without our PGM estimation algorithm for  = 1.0.

Table 1: Datasets used in experiments along with the number large to maintain in memory. However, incorporating PGM of queries in the workload used with the dataset. estimation makes these algorithms feasible.

Dataset Records Attributes Domain Queries As Figure 1c shows, for the first three datasets, MWEM Titanic 1304 9 3e8 4851 crashed before completing because it ran out of memory or Adult 48842 15 1e19 62876 timed out. For example, on one run of the Adult dataset, the Loans 42535 48 5e80 362201 first three chosen queries were on the hrace, native-country, Stroke 19434 110 4e104 17716 incomei, hworkclass, race, capital-gaini, and hmarital status, relationship, capital-gaini marginals. Since these all overlap with respect to race and capital-gain, factored MW offers For each dataset, we construct a workload of counting no benefit and the entire vector pC must be materialized queries which is an extension of the set of three-way over these attributes, which requires over 100 MB. After 5 marginals. First, we randomly choose 15 subsets of at- iterations, the representation requires more than 2 GB, at tributes of size 3, C. For each subset C ∈ C, if C con- which point it timed out. Interestingly, MWEM was able tains only categorical attributes, we define sub-workload to run on the Stroke dataset, which has the largest domain WC to be a 3-way marginal. However, when C contains and greatest number of attributes. This is mainly because any discretized numerical attributes, we replace the set of the workload did not contain as many queries involving unit queries used in a marginal with the set of prefix range common attributes. In general, MWEM’s representation queries. For example, if C = hsex, education, incomei then remains feasible as long as the workload (and therefore its the resulting subworkload WC would consist of all queries measurements) consists solely of queries defined over low- of the form: sex = x, education = y, income ∈ [0, z] where dimensional marginals that do not have common attributes. x, y, z range over the domains of the attributes, respectively. Unfortunately this imposes a serious restriction on the work- The final workload is the union of the 15 three-way sub- loads MWEM can support. Note that the error on Stroke is workloads defined above. identical for MWEM and MWEM+PGM — this is because both are based on the same underlying estimation problem. We measure the error on the workload queries as: Although the HDMM algorithm fails to run, for the purpose 1 X kWC µ − WC µˆ k Error = C C 1 of comparison, we run a modified version of the algorithm |C| 2 kWC µ k C∈C C 1 (denoted HDMM+LLS) which uses local least squares inde- pendently over each measurement set instead of global least where the summand is related to the total variation distance squares over the full data vector. While scalable, Figure 1d (and is equal in the special case when WC = I). shows that this estimation is substantially worse than PGM Improved accuracy. PrivBayes and DualQuery are highly estimation, especially on the Titanic and Loans dataset. In- scalable algorithms supporting the large domains considered corporating PGM estimation offers error reductions of 6.6×, here. Figures 1a and 1b show that incorporating PGM 3.2×, 27×, and 6.3×. These improvements primarily stem estimation significantly improves accuracy. For PrivBayes, from non-negativity and global consistency. workload error is reduced by a factor of 6× and 7× on the Varying epsilon. While  is set to 1 in Figure1, in Figure Loans and Stroke datasets, respectively, and a modest 30% 2a we look at the impact of varying , for a fixed dataset for Adult. For DualQuery, we also observe very significant and measurement set. We use the Adult dataset and the mea- error reductions of 1.2×, 1.8×, 3.5×, and 4.4×. surements selected by HDMM, (which do not depend on Replacing infeasible estimation methods. The MWEM ). The magnitude of the improvement offered by our PGM and HDMM algorithms fail to run on the datasets and work- estimation algorithm increases as  decreases. At  = 0.3 loads we consider because both require representations too and below, the mechanism has virtually no utility without Graphical-model based estimation and inference for differential privacy

HDMM+LLS MW 6. Related Work 101 2.0 HDMM+PGM LSMR 1.5 PGM The release of linear query answers has been extensively 100 1.0 studied by the privacy community (Zhang et al., 2017; Li

10−1 0.5 Workload Error et al., 2015; Zhang et al., 2014; Li et al., 2014; Gaboardi Time per Iteration (s) 0.0 et al., 2014; Yaroslavtsev et al., 2013; Qardaji et al., 2013b; 10−2 10−1 100 101 101 102 103 Epsilon Attributes Nikolov et al., 2013; Thaler et al., 2012; Hardt et al., 2012; (a) (b) Cormode et al., 2012; Acs et al., 2012; Gupta et al., 2011; Ding et al., 2011; Xiao et al., 2010; Li et al., 2010; Hay et al., Figure 2: (a) Error of HDMM variants on Adult as a function 2010; Hardt & Talwar, 2010; Barak et al., 2007; McKenna of . (b) Scalability of estimation algorithms. et al., 2018; Eugenio & Liu, 2018). Early work using infer- PGMs. At the highest  of 10.0, HDMM+LLS actually ence includes (Barak et al., 2007; Hay et al., 2010; Williams offers slightly lower error than HDMM+PGM on the work- & McSherry, 2010), motivated by consistency as well as load, although both have very low error in an absolute sense. potential accuracy improvements. Inference has since been The error of HDMM+PGM on the measurements is still bet- widely used in techniques for answering linear queries (Lee ter by more than a factor of three at this privacy level. This et al., 2015). These mechanisms often contain custom spe- behavior has been observed before in the low-dimensional cialized inference algorithms that exploit properties of the setting, where the estimator general- measurements taken, and can be replaced by our algorithms. izes better than the non-negative least squares estimator for (Williams & McSherry, 2010) introduce the problem of find- workloads with range queries (Li et al., 2015). ing posterior distributions over model parameters from the output of differentially private algorithms. Their problem 5.2. The scalability of PGM estimation formulation requires a known model parameterization and a prior distribution over the parameter space. Their approach We now evaluate the scalability of our approach compared requires approximating a high-dimensional integral, which with two other general-purpose estimation techniques: mul- they do either by Markov chain Monte Carlo, or by upper tiplicative weights (MW; Hardt et al., 2012) and iterative or- and lower bounds via the “factored exponential mechanism”. dinary least squares (LSMR; Fong & Saunders, 2011, Zhang In the discrete data case, these bounds require summing over et al., 2018). We omit from comparison PrivBayes estima- the data domain, which is just as hard as materializing p tion and DualQuery estimation because they are special- and is not feasible for high-dimensional data. purpose estimation methods that cannot handle arbitrary linear measurements. We use synthetic data so that we (Bernstein et al., 2017) consider the task of privately learn- can systematically vary the domain size and the number of ing the parameters of an undirected graphical model. They attributes. We measure the marginals for each triple of adja- do so by releasing noisy sufficient statistics using the cent attributes — i.e., QC = I for all C = (i, i + 1, i + 2) Laplace mechanism, and then using an expectation max- where 1 ≤ i ≤ d − 2. In Figure 2b, we vary the number imization algorithm to learn model parameters from the of attributes from 3 to 1000 (fixing the domain of each at- noisy sufficient statistics. Their work shares some technical tribute, |Xi| at 10), and plot the time per iteration of each of similarities with ours, but the aims are different. They have these estimation algorithms. Both MW and LSMR fail to the explicit goal of learning a graphical model whose struc- scale beyond datasets with 10 attributes, as they both require ture is specified in advance and used to determine the mea- materializing p in vector form, while PGM easily scales to surements. Our goal is to find a compact representation of datasets with 1000 attributes. some data distribution that minimizes a loss function where the measurements are determined externally; the graphical The domain size is the primary factor that determines model structure is a by-product of the measurements made scalability of the baseline methods. However, the scal- and the maximum entropy criterion. ability of PGM primarily depends on the complexity of the measurements taken. In the experiment above, the (Chen et al., 2015) consider the task of privately releasing measurements were chosen to highlight a case where synthetic data. Their mechanism is similar to PrivBayes, PGM estimation scales very well. In general, when the but it uses undirected graphical models instead of Bayesian graphical model implied by the measurements has high networks. It finds a good model structure using a mutual tree-width, our methods will have trouble scaling, as information criteria, then measures the sufficient statistics MARGINAL-ORACLE is computationally expensive. In of the model (which are marginals) and post-processes them these situations, MARGINAL-ORACLE may be replaced to resolve inconsistencies. This post-processing is based on with an approximate marginal inference algorithm, like a technique developed by (Qardaji et al., 2014) that ensures loopy belief propagation (Wainwright & Jordan, 2008). all measured marginals are internally consistent, and may be improved with our methods. Graphical-model based estimation and inference for differential privacy

Acknowledgements pattern analysis and machine intelligence, 35(10):2454– 2467, 2013. This work was supported by the National Science Founda- tion under grants 1409143, 1617533, and 1749854; and by Dwork, C. and Roth, A. The Algorithmic Foundations of Dif- DARPA and SPAWAR under contract N66001-15-C-4067. ferential Privacy. Foundations and Trends in Theoretical The U.S. Government is authorized to reproduce and dis- Computer Science, 2014. tribute reprints for Governmental purposes not withstanding any copyright notation thereon. The views, opinions, and/or Dwork, C., McSherry, F., Nissim, K., and Smith, A. Cal- findings expressed are those of the author(s) and should not ibrating noise to sensitivity in private data analysis. In be interpreted as representing the official views or policies Third Theory of Cryptography Conference, 2006. of the Department of Defense or the U.S. Government. Eaton, F. and Ghahramani, Z. Choosing a variable to clamp. In Artificial Intelligence and Statistics, pp. 145– References 152, 2009.

Acs, G., Castelluccia, C., and Chen, R. Differentially pri- Eugenio, E. C. and Liu, F. Cipher: Construction of differen- vate publishing through lossy compression. In tially private microdata from low-dimensional 2012 IEEE 12th International Conference on Data Min- via solving linear equations with tikhonov regularization. ing (ICDM) , pp. 1–10, 2012. arXiv preprint arXiv:1812.05671, 2018. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, Fong, D. C.-L. and Saunders, M. Lsmr: An iterative algo- F., and Talwar, K. Privacy, accuracy, and consistency too: rithm for sparse least-squares problems. SIAM Journal a holistic solution to contingency table release. In PODS, on Scientific Computing, 33(5):2950–2971, 2011. pp. 273–282, 2007. Gaboardi, M., Arias, E. J. G., Hsu, J., Roth, A., and Wu, Beck, A. and Teboulle, M. Mirror descent and nonlinear Z. S. Dual query: Practical private query release for high projected subgradient methods for convex optimization. dimensional data. In Proceedings of the 31st International Operations Research Letters, 31(3):167–175, 2003. Conference on (ICML), 2014. Bernstein, G., McKenna, R., Sun, T., Sheldon, D., Hay, M., and Miklau, G. Differentially private learning of undi- Gupta, A., Hardt, M., Roth, A., and Ullman, J. Privately rected graphical models using collective graphical models. releasing conjunctions and the statistical query barrier. In In Proceedings of the 34th International Conference on Proceedings of the Forty-third Annual ACM Symposium Machine Learning (ICML), 2017. on Theory of Computing, STOC ’11, pp. 803–812, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0691-1. Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. A limited doi: 10.1145/1993636.1993742. URL http://doi. memory algorithm for bound constrained optimization. acm.org/10.1145/1993636.1993742. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995. Hardt, M. and Rothblum, G. N. A multiplicative weights mechanism for privacy-preserving data analysis. In Foun- Chen, R., Xiao, Q., Zhang, Y., and Xu, J. Differentially dations of Computer Science (FOCS), 2010 51st Annual private high-dimensional data publication via sampling- IEEE Symposium on, pp. 61–70. IEEE, 2010. based inference. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Dis- Hardt, M. and Talwar, K. On the geometry of differential covery and , pp. 129–138. ACM, 2015. privacy. In Symposium on Theory of computing (STOC), pp. 705–714, 2010. Cormode, G., Procopiuc, C., Srivastava, D., Shen, E., and Yu, T. Differentially private spatial decompositions. In Hardt, M., Ligett, K., and McSherry, F. A simple and 2012 IEEE 28th International Conference on Data Engi- practical algorithm for differentially private data release. neering, pp. 20–31. IEEE, 2012. In Advances in Neural Information Processing Systems, pp. 2339–2347, 2012. Ding, B., Winslett, M., Han, J., and Li, Z. Differentially pri- vate data cubes: optimizing noise sources and consistency. Hay, M., Rastogi, V., Miklau, G., and Suciu, D. Boosting In Proceedings of the 2011 ACM SIGMOD International the accuracy of differentially private histograms through Conference on Management of data, pp. 217–228. ACM, consistency. Proceedings of the VLDB Endowment, 3 2011. (1-2):1021–1032, 2010. Domke, J. Learning graphical model parameters with ap- Koller, D. and Friedman, N. Probabilistic graphical models: proximate marginal inference. IEEE transactions on principles and techniques. MIT press, 2009. Graphical-model based estimation and inference for differential privacy

Lee, J., Wang, Y., and Kifer, D. Maximum likelihood post- Proceedings of the VLDB Endowment, 6(14):1954–1965, processing for differential privacy under consistency con- 2013b. straints. In Proceedings of the 21th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Qardaji, W., Yang, W., and Li, N. Priview: Practical dif- Mining, pp. 635–644. ACM, 2015. ferentially private release of marginal contingency ta- bles. In Proceedings of the 2014 ACM SIGMOD interna- Li, C., Hay, M., Rastogi, V., Miklau, G., and McGregor, tional conference on Management of data, pp. 1435–1446. A. Optimizing linear counting queries under differen- ACM, 2014. tial privacy. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Thaler, J., Ullman, J., and Vadhan, S. Faster algorithms database systems, pp. 123–134. ACM, 2010. for privately releasing marginals. In Proceedings of the 39th International Colloquium Conference on Au- Li, C., Hay, M., Miklau, G., and Wang, Y. A data-and tomata, Languages, and Programming - Volume Part workload-aware algorithm for range queries under differ- I, ICALP’12, pp. 810–821, Berlin, Heidelberg, 2012. ential privacy. Proceedings of the VLDB Endowment, 7 Springer-Verlag. ISBN 978-3-642-31593-0. doi: 10. (5):341–352, 2014. 1007/978-3-642-31594-7_68. URL http://dx.doi. org/10.1007/978-3-642-31594-7_68. Li, C., Miklau, G., Hay, M., McGregor, A., and Rastogi, V. The matrix mechanism: optimizing linear counting Vilnis, L., Belanger, D., Sheldon, D., and McCallum, A. queries under differential privacy. The VLDB journal, 24 Bethe projections for non-local inference. In Proceedings (6):757–781, 2015. of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 892–901. AUAI Press, 2015. Maclaurin, D., Duvenaud, D., and Adams, R. P. Autograd: Effortless gradients in numpy. In ICML 2015 AutoML Wainwright, M. J. and Jordan, M. I. Graphical models, ex- Workshop, 2015. ponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008. McKenna, R., Miklau, G., Hay, M., and Machanavajjhala, A. Optimizing error of high-dimensional statistical queries Williams, O. and McSherry, F. Probabilistic inference and under differential privacy. Proceedings of the VLDB differential privacy. In Advances in Neural Information Endowment, 11(10):1206–1219, 2018. Processing Systems, pp. 2451–2459, 2010.

McSherry, F. and Talwar, K. Mechanism design via differ- Xiao, L. Dual averaging methods for regularized stochastic ential privacy. In Foundations of Computer Science, 2007. learning and online optimization. Journal of Machine FOCS’07. 48th Annual IEEE Symposium on, pp. 94–103. Learning Research, 11(Oct):2543–2596, 2010. IEEE, 2007. Xiao, X., Wang, G., and Gehrke, J. Differential privacy via Nesterov, Y. Primal-dual subgradient methods for convex transforms. In International Conference on Data problems. Mathematical programming, 120(1):221–259, Engineering, 2010. 2009. Yaroslavtsev, G., Cormode, G., Procopiuc, C. M., and Sri- Nikolov, A., Talwar, K., and Zhang, L. The geometry of vastava, D. Accurate and efficient private release of dat- differential privacy: the approximate and sparse cases. In acubes and contingency tables. In ICDE, 2013. Symposium on Theory of Computing, 2013. Zhang, D., McKenna, R., Kotsogiannis, I., Hay, M., Parikh, N., Boyd, S., et al. Proximal algorithms. Founda- Machanavajjhala, A., and Miklau, G. Ektelo: A frame- tions and Trends R in Optimization, 1(3):127–239, 2014. work for defining differentially-private computations. In Conference on Management of Data (SIGMOD), 2018. Proserpio, D., Goldberg, S., and McSherry, F. Calibrating Data to Sensitivity in Private Data Analysis. In Confer- Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, ence on Very Large Data Bases (VLDB), 2014. D., and Xiao, X. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Sys- Qardaji, W., Yang, W., and Li, N. Differentially private grids tems (TODS), 42(4):25, 2017. for geospatial data. In 2013 IEEE 29th international conference on data engineering (ICDE), pp. 757–768. Zhang, X., Chen, R., Xu, J., Meng, X., and Xie, Y. To- IEEE, 2013a. wards accurate histogram publication under differential privacy. In SIAM International Conference on Data Min- Qardaji, W., Yang, W., and Li, N. Understanding hier- ing (SDM). SIAM, 2014. archical methods for differentially private histograms. Graphical-model based estimation and inference for differential privacy

A. Estimation A.2. Accelerated Proximal Algorithm Derivation Define µ(θ) to be the marginals of the graphical model The derivation of the accelerated proximal algorithm is sim- with parameters θ, which may be computed with the ilar. It is based on Algorithm 3 from (Xiao, 2010). Applied MARGINAL-ORACLE. to our setting, step 4 of that algorithm requires solving the following problem: A.1. Proximal Algorithm Derivation 4K νt = argmin µT g¯ − H(µ) Our goal is to solve the following optimization problem: µ∈M t(t + 1)

t(t + 1) T µˆ = argmin L(µ) = argmin µ g¯ − H(µ) µ∈M 4K µ∈M  t(t + 1)  = µ − g¯ where L is some convex function such as kQCµ − yk. 4L Using the mirror descent algorithm (Beck & Teboulle, 2003), we can use the following update equation: which we solve by using the MARGINAL-ORACLE. 1 µt+1 = argmin µT ∇L(µt) + D(µ, µt) A.3. Direct Optimization µ∈M ηt In preliminary experiments we also evaluated a direct Where D is a Bregman distance measure defined as method to solve the optimization problem. For the direct method, we estimate the parameters θˆ directly by refor- D(µ, µt) = ψ(µ) − ψ(µt) − (µ − µt)T ∇ψ(µt) mulating the optimization problem and instead solving the ˆ  unconstrained problem θ = argminθ L µ(θ) .To evaluate for some strongly convex and continuously differentiable the optimization objective, we use MARGINAL-ORACLE function ψ. Using ψ = −H to be the negative entropy, we to compute µ(θ) and then compute the loss. For opti- arrive at the following update equation: mization, it has been observed that it is possible to back- propagate through marginal inference procedures (with or without automatic differentiation software) to com- t+1 T t 1 t µ = argmin µ ∇L(µ ) + D(µ, µ ) pute their gradients (Eaton & Ghahramani, 2009; Domke, µ∈M ηt 2013). We apply automatic differentiation to the entire for- 1   = argmin µT ∇L(µt) + − H(µ) + µT ∇H(µt) ward computation (Maclaurin et al., 2015), which includes µ∈M ηt MARGINAL-ORACLE, to compute the gradient of L.  1  1 = argmin µT ∇L(µt) + ∇H(µt) − H(µ) Since this is now an unconstrained optimization problem η η µ∈M t t and we can compute the gradient of L, many optimization T  t t  = argmin µ ηt∇L(µ ) + ∇H(µ ) − H(µ) methods apply. In our experiments, we use L2 loss, which µ∈M is smooth, and apply the L-BFGS algorithm for optimiza- T  t t tion (Byrd et al., 1995). = argmin µ ηt∇L(µ ) − θ − H(µ) µ∈M However, despite its simplicity, there is a significant draw- t t  = µ θ − ηt∇L(µ ) back to the direct algorithm. It is not, in general, convex with respect to θ. This may seem surprising since the origi- The first four steps are simple algebraic manipulation of nal problem is convex, i.e., L(µ) is convex with respect to the mirror descent update equation. The final two steps µ and M is convex. Also, the most well known problem t t use the observation that ∇H(µ ) = −θ and that marginal of this form, maximum-likelihood estimation in graphical inference can be cast as the following optimization problem: models, is convex with respect to θ (Wainwright & Jor- (Wainwright & Jordan, 2008; Vilnis et al., 2015) dan, 2008); however, this relies on properties of exponential families that do not apply to other loss functions. One can T µ(θ) = argmin −µ θ − H(µ) verify for losses as simple as L2 that the Hessian need not µ∈M be positive definite. As a result, the direct algorithm is not guaranteed to converge to a global minimum of the origi- Thus, optimization over the marginal polytope is reduced nal convex optimization problem minµ∈M L(µ). We did to computing the marginals of a graphical model with pa- not observe convergence problems in our experiments, but t t rameters θ − ηt∇L(µ ), which can be accomplished using it was not better in practice than the proximal algorithms, belief propagation or some other MARGINAL-ORACLE. which is why it is not included in the paper. Graphical-model based estimation and inference for differential privacy

Algorithm 3 Inference for Factored Queries is given by: Input: Parameters θ, factored query matrix Q X (Qpθ)(z) = Q(z, x)pθ(x) Output: Query answers Q pθ x∈X ψ = {exp(θC ) | C ∈ C} ∪ {Qi | i ∈ [d]} d Z =MARGINAL-ORACLE(θ) 1 X Y Y = Q (z , x ) exp[θ (x )] return VARIABLE-ELIM(ψ, X )/Z Z i i i C C x∈X i=1 C∈C This can be understood as marginalizing over the x variables B. Inference in the augmented model f(z, x). The VARIABLE-ELIM rou- tine referenced in the algorithm is standard variable elimina- We now discuss how to exploit our compact factored repre- tion to perform this marginalization; it can handle negative sentation of pθ to answer new linear queries. We give an values with no modification. We stress that, in practice, efficient algorithm for answering factored linear queries. factor matrices Qi may have only one row (ri = 1, e.g., for Qd Definition 4 (Factored Query Matrix). A factored query marginalization); hence the output size r = i=1 ri is not matrix Q has columns that are indexed by x and rows that necessarily exponential in d. are indexed by vectors z ∈ [r1]×· · ·×[rd]. The total number Qd B.1. Factored Query Matrices of rows (queries) is r = i=1 ri. The entries of Q are given Qd ri×ni by Q(z, x) = i=1 Qi(zi, xi), where Qi ∈ R is a Table2 gives some example “building block” factors that specified factor for the ith attribute. The matrix Q can be can be used to construct factored query matrices. This expressed as Q = Q1 ⊗· · ·⊗Qd, where ⊗ is the Kronecker is by no means an exhaustive list of possible factors but product. it provides the reader with evidence that answering these types of queries efficiently is practically useful. The factored Factored query matrices are expressive enough to encode query matrix for computing the marginal µC uses Qi = I any conjunctive query (or a cartesian product of such for i ∈ C and Qi = 1 for i∈ / C. Similarly, the factored queries), and more. There are a number of concrete exam- query matrix for computing the multivariate CDF of µC ples that demonstrate the usefulness of answering queries would simply use Qi = P for i ∈ C. A query matrix of this form, including: for compressing a distribution could be characterized by functions fi :[ni] → [2] or equivalently binary matrices • Computing the marginal µC for any C ⊆ [d] (including Q = R ∈ 2×ni unmeasured marginals). i fi R . The query matrix for computing the (unnormalized) expected value of variable i conditioned on • Computing the multivariate CDF of µC for any C ⊆ [d]. variable j would use Qi = E and Qj = I (and Qk = 1 • Answering range queries. for all other k). These are only a few examples; these • Compressing the distribution by transforming the domain. building blocks can be combined arbitrarily to construct a • Computing the (unnormalized) expected value of one wide variety of interesting query matrices. variable conditioned on other variables. C. Loss Functions For the first two examples, we could have used standard variable elimination to eliminate all variables except those C.1. L1 and L2 losses in C. Existing algorithms are not able to handle the other examples without materializing pˆ (or a marginal that sup- The L1 and L2 loss functions have simple (sub)gradients. ports the queries). Thus, our algorithm generalizes variable T ∇L1(µ) = Q sign(QCµ − y) elimination. A more comprehensive set of examples, and C T details on how to construct these query matrices are given ∇L2(µ) = QC (QCµ − y) in section B.1 C.2. Linear measurements with unequal noise The procedure for answering these queries is given in Algo- rithm3, which can be understood as follows. For a particular When the privacy budget is not distributed evenly to the Q z, write f(z, x) = Q(z, x)pθ(x) = i Qi(zi, xi)pθ(x). measurements in the we have to appropriately modify the This can be viewed as an augmented graphical model on the loss functions, which assume that the noisy answers all variables z and x where we have introduced new pairwise have equal . In order to do proper estimation and factors between each (xi, zi) pair defined by the query ma- inference we have to account for this varying noise level in trix. Unlike a regular graphical model, the new factors can the loss function. In section 3.1 we claimed that L(p) = contain negative values. The query answers are obtained by kQp − yk makes sense as a loss function when the noise multiplying Q and p, which sums over x. The zth answer introduced to y are iid. Luckily, even if this assumption is Graphical-model based estimation and inference for differential privacy

Qi Requirements Size Definition (∀a ∈ [ni]) Description I ni × ni Qi(a, a) = 1 keep variable in 1 1 × ni Qi(1, a) = 1 marginalize variable out ej j ∈ [ni] 1 × ni Qi(1, j) = 1 inject evidence eS S ⊆ [ni] 1 × ni Qi(1, j) = 1 ∀j ∈ S inject evidence (disjuncts) P ni × ni Qi(b, a) = 1 ∀b ≥ a transform into CDF Rf f :[ni] → [ri] ri × ni Qi(f(a), a) = 1 compress domain E 1 × ni Qi(1, a) = a reduce to b Ek k ≥ 1 k × ni Qi(b, a) = a ∀b ≤ k reduce to first k moments

Table 2: Example factors in the factored query matrix

T not satisfied it is easy to correct. Assume that yi = qi p+εi likelihood is fairly expensive, as it requires basically simu- 1 1 T 1 where εi ∼ Lap(bi). Then yi = q p + εi and lating the entire DualQuery algorithm. Fortunately we do bi bi i bi 1 εi ∼ Lap(1). Thus, we can replace the query matrix not have to run the most computationally expensive step bi Q ← DQ and the answer vector y ← Dy where D is the within the procedure, which is finding xt. We differentiate 1 diagonal matrix defined by Dii = . All the new query this loss function using automatic differentiation (Maclaurin bi answers have the same effective noise scale, and so the et al., 2015) for use within our estimation algorithms. standard loss functions may be used. This idea still applies if the noise on each query answer is sampled from a normal Algorithm 5 Dual Query Loss Function distribution as well (for (, δ)-differential privacy). Input: µ, marginals of the data Input: WC, workload queries C.3. Dual Query Loss Function Input: cache, all relevant output from DualQuery t t > q1,... qs - sampled queries at each time step Algorithm4 shows DualQuery applied to workloads de- t fined over the marginals of the data. There are five hyper- > x - chosen record at each time step parameters, of which four must be specified and the remain- Output: L(µ), the negative log likelihood ing one can be determined from the others. y = WC µ Q1 = uniform(W) The first step of the algorithm computes the answers to the loss = 0 workload queries. Then for T time steps observations are for t = 1,...,T do made about the true data via samples from the distribution Ps t t loss –= i=1 log (Q (qi)) Qt x ∈ X t+1 t . These observations are used to find a record to Q = Q exp (−η ∗ (y − WC µxt )) add to the synthetic database. normalize Qt end for Algorithm 4 Dual Query for marginals workloads return loss Input: X, the true data Input: WC, workload queries Input: (s, T, η, , δ), hyper-parameters D. Additional Experiments Output: synthetic database of T records D.1. L1 vs. L2 Loss y = WC µX 1 Q = uniform(W) In Section3 we mentioned that minimizing L1 loss is equiv- for t = 1,...,T do alent to maximizing likelihood for linear measurements with t t t q1,... qs from Q Laplace noise, but that L2 loss is more commonly used in t Ps t t x = argmaxx∈X i=1 qiµ − qiµx the literature. In this experiment we compare these two t+1 t Q = Q exp (−η ∗ (y − WC µxt )) estimators side-by-side. Specifically, we consider the work- normalize Qt load from Figure1 and measurements chosen by HDMM end for with  = 1.0. As expected, performing L1 minimization 1 T return (x ,..., x ) results in lower L1 loss but higher L2 loss, although the difference is quite small, especially for L1 loss. The dif- ference is larger for L2 loss. Minimizing L2 loss results in Algorithm5 shows a procedure for computing the negative lower workload error, indicating that it generalizes better. log likelihood (our loss function) of observing the Dual- This is somewhat surprising given that L1 minimization is Query output, given some marginals. Evaluating the log maximizing likelihood. Another interesting observation is Graphical-model based estimation and inference for differential privacy

4 4 × 10 0.35

Optimizing L1 4 Optimizing L1 Optimizing L1 3 × 10 0.30 Optimizing L2 Optimizing L2 Optimizing L2 0.25 s s 2 × 104 s s

o o 0.20 L L

1 2 0.15 L L

104 0.10 6

2 × 10 Workload Error 0.05

100 101 102 103 104 100 101 102 103 104 100 101 102 103 104 Iteration Iteration Iteration

Figure 3: L1 minimization vs. L2 minimization, evaluated on L1 loss, L2 loss, and workload error that the workload error actually starts going up after about we end up with the following update equation. 200 iterations, suggesting that some form of over-fitting is T occurring. There minimum workload error achieved was pˆ ← pˆ exp (Q (Qpˆ − y)/2m)/Z 0.066 while the final workload error was 0.084 — a pretty T meaningful difference. Of course, in practice we cannot Observing that ∇L2(pˆ) = Q (Qpˆ − y), this simplifies to: stop iterating when workload error starts increasing because pˆ ← pˆ exp (∇L (pˆ)/2m)/Z evaluating it requires looking at the true data. 2 which is precisely the update equation for entropic mirror E. Additional Details descent for minimizing L2(p) over the probability simplex (Beck & Teboulle, 2003). E.1. Unknown Total Our algorithms require m, the total number of records in the dataset is known or can be estimated. Under a slightly differ- ent privacy definition where nbrs(X) is the set of databases where a single record is added or removed (instead of mod- ified), this total is a sensitive quantity which cannot be re- leased exactly (Dwork & Roth, 2014). Thus, the total is not known in this setting, but a good estimate can typically be obtained from the measurements taken, without spending ad- T ditional privacy budget. First observe that 1 µC = m is the total for an unnormalized database. Now suppose we have T measured yC = QC µC + zC . Then as long as 1 is in the T + row-space of QC , mC = 1 QC yC is an unbiased estimate T + 2 for m with variance V ar(mC ) = V ar(yC ) 1 QC 2. This is a direct consequence of Proposition 9 from (Li et al., 2015). We thus have multiple estimates for m which we can combine using inverse variance weighting, resulting in the P C mC /V ar(mC ) final estimate of mˆ = P , which we can use C 1/V ar(mC ) in place of m.

E.2. Multiplicative Weights vs Entropic Mirror Descent Recall from Section4 that the multiplicative weights update equation is:

T pˆ ← pˆ exp (qi(qi pˆ − yi))/2m/Z and the update is applied (possibly cyclically) for i = 1,...,T . Now imagine taking all of the measurements and organizing them into a T × n matrix Q. Then we can apply all the updates at once, instead of sequentially, and