Graphical-Model Based Estimation and Inference for Differential Privacy

Graphical-model based estimation and inference for differential privacy Ryan McKenna 1 Daniel Sheldon 1 2 Gerome Miklau 1 Abstract Inference is a critical component of privacy mechanisms because: (i) it can reduce error when answering a query by Many privacy mechanisms reveal high-level in- combining evidence from multiple related measurements, formation about a data distribution through noisy (ii) it provides consistent query answers even when measure- measurements. It is common to use this informa- ments are noisy and inconsistent, and (iii) it provides the tion to estimate the answers to new queries. In this above benefits without consuming the privacy-loss budget, work, we provide an approach to solve this estima- since it is performed only on privately-computed measure- tion problem efficiently using graphical models, ments without re-using the protected data. which is particularly effective when the distribution is high-dimensional but the measurements are Consider a U.S. Census dataset, exemplified by the Adult over low-dimensional marginals. We show that table, which consists of 15 attributes including age, sex, our approach is far more efficient than existing race, income, education. Given noisy answers to a set of estimation techniques from the privacy literature measurement queries, our goal is to infer answers to one and that it can improve the accuracy and scalabil- or more new queries. The measurement queries might be ity of many state-of-the-art mechanisms. expressed over each individual attribute (age), (sex), (race), etc., as well as selected combinations of attributes (age, income), (age, race, education), etc. When inference is 1. Introduction done properly, the estimate for a new query (e.g., counting the individuals with income>=50K, 10 years of education, Differential privacy (Dwork et al., 2006) has become the and over 40 years old) will use many, or even all, available dominant standard for controlling the privacy loss incurred measurements. by individuals as a result of public data releases. For com- plex data analysis tasks, error-optimal algorithms are not Current inference methods are limited in both scalability known and a poorly designed algorithm may result in much and generality. Most methods first estimate some model of greater error than strictly necessary for privacy. Thus, care- the data and then answer new queries using the model. Per- ful algorithm design, focused on reducing error, is an area haps the simplest model is a full contingency table, which of intense research in the privacy community. stores a value for every element of the domain. When the measurements are linear queries (a common case, and our For the private release of statistical queries, nearly all recent primary focus) least-squares (Hay et al., 2010; Nikolov et al., algorithms (Zhang et al., 2017; Li et al., 2015; Lee et al., 2013; Li et al., 2014; Qardaji et al., 2013b; Ding et al., 2011; 2015; Proserpio et al., 2014; Li et al., 2014; Qardaji et al., Xiao et al., 2010; Li et al., 2010) and multiplicative-weight 2013b; Nikolov et al., 2013; Hardt et al., 2012; Ding et al., updates (Hardt & Rothblum, 2010; Hardt et al., 2012) have 2011; Xiao et al., 2010; Li et al., 2010; Hay et al., 2010; both been used to estimate this model from the noisy mea- Hardt & Rothblum, 2010; Hardt & Talwar, 2010; Barak surements. New queries can then be answered by direct cal- et al., 2007; Gupta et al., 2011; Thaler et al., 2012; Acs et al., culation. However, the size of the contingency table is the 2012; Zhang et al., 2014; Yaroslavtsev et al., 2013; Cormode product of the domain sizes of each attribute, which means et al., 2012; Qardaji et al., 2013a; McKenna et al., 2018) these methods break down for high-dimensional cases (or include steps within the algorithm where answers to queries even a modest number of dimensions with large domains). are inferred from noisy answers to a set of measurement In the example above, the full contingency table would con- queries already answered by the algorithm. sist of 1019 entries. To avoid this, factored models have been 1University of Massachusetts, Amherst 2Mount Holyoke considered (Hardt et al., 2012; Zhang et al., 2017). How- College. Correspondence to: Ryan McKenna <rm- ever, while scalable, these methods have other limitations [email protected]>. including restricting the query class (Hardt et al., 2012) or failing to properly account for (possibly varying) noise in th Proceedings of the 36 International Conference on Machine measurements (Zhang et al., 2017). Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Graphical-model based estimation and inference for differential privacy In this work we show that graphical models provide a foun- subsets of attributes. We will describe an extension to a dation for significantly improved inference. We propose to generalized class of queries, including non-linear ones, in use a graphical model instead of a full contingency table Section 3.1.A linear query set fQ(X) is defined by a query r×n as a model of the data distribution. Doing so avoids an matrix Q 2 R and has answer fQ(X) = Q pX. The T intractable full materialization of the contingency table and ith row of Q, denoted qi represents a single scalar-valued retains the ability to answer a broad class of queries. We query. In most cases we will refer unambiguously to the show that the graphical model representation corresponds matrix Q, as opposed to fQ, as the query set. We often to using a maximum entropy criterion to select a single data consider query sets that can be expressed on a marginal distribution among all distributions that minimize estimation (over a subset of attributes) of the probability vector p. Let loss. The structure of the graphical model is determined by A ⊆ [d] identify a subset of attributes and, for x 2 X , let the measurements, such that no information is lost relative to xA = (xi)i2A be the sub-vector of x restricted to A. Then a full contingency table representation, but when each mea- the marginal probability vector (or simply “marginal on A”) surement is expressible over a low-dimensional marginal of µA, is defined by: the contingency table, as is common, the graphical model m representation is much more compact. 1 X (i) Y µ (x ) = fx = x g; 8x 2 X := X : A A m I A A A A i This work is focused on developing a principled and general i=1 i2A approach to inference in privacy mechanisms. Our method is The size of the marginal is n := jX j = Q n , which agnostic to the loss function used to estimate the data model A A i2A i is exponential in jAj but may be considerably smaller than n. and to the noise distribution used to achieve privacy. We Note that µ (x ) is a linear function of p, so there exists focus primarily on linear measurements, but also describe A A a matrix M 2 nA×n such that µ = M p. When an extension to non-linear measurements. A R A A a query set depends only on the marginal vector µA, we rA×nA We assume throughout that the measurements are given, but call it a marginal query set written as QA 2 R , and we show our inference technique is versatile since it can with answer fQA (X) = QA µA. The marginal query set be incorporated into many existing private query-answering QA is equivalent to the query set Q = QAMA on the algorithms that determine measurements in different ways. full contingency table, since QAµA = (QAMA)p. One For those existing algorithms that scale to high-dimensional marginal query set asks for the marginal vector itself, in data, our graphical-model based estimation method can which case QA = InA×nA (the identity matrix). substantially improve accuracy (with no cost to privacy). In our problem formulation, we consider measurements con- Even more importantly, our estimation method can be added sisting of a collection of marginal query sets. Specifically, to some algorithms which fail to scale to high-dimensional let C be a collection of measurement sets, where each C 2 C data, allowing them to run efficiently in new settings. We is a subset of [d].1 For each measurement set C 2 C, we therefore believe our inference method can serve as a basic are given a marginal query set Q . The following nota- building block in the design of new privacy mechanisms. C tion is helpful to refer to combined measurements and their marginals. Let µ = (µC )C2C be the combined vector of 2. Background and Problem Statement marginals, and let QC be the block-diagonal matrix with diagonal blocks fQ g , so that the entire set of query Data. Our input data represents a population of individuals, C C2C answers can be expressed as Q µ. Finally, let M be the each contributing a single record x = (x ; : : : ; x ) where C C 1 d matrix that vertically concatenates the matrices fM g , x is the ith attribute belonging to a discrete finite domain C C2C i so that µ = M p and Q µ = Q M p. This shows that X of n possible values. The full domain is X = Qd X C C C C i i i=1 i our measurements are equivalent to the combined query set and its size n = Qd n is exponential in the number i=1 i Q = QCMC applied to the full table p. of attributes. A dataset X consists of m such records X = (x(1);:::; x(m)).

Graphical-Model Based Estimation and Inference for Differential Privacy

Graphical Models for Discrete and Continuous Data Arxiv:1609.05551V3

Graphical Models

Graphical Modelling of Multivariate Spatial Point Processes

Decomposable Principal Component Analysis Ami Wiesel, Member, IEEE, and Alfred O

A Probabilistic Interpretation of Canonical Correlation Analysis

Canonical Correlation Analysis and Graphical Modeling for Huaman

Sequential Mean Field Variational Analysis

Simultaneous Clustering and Estimation of Heterogeneous Graphical Models

The Multiple Quantile Graphical Model

Probabilistic Graphical Model Explanations for Graph Neural Networks

High Dimensional Semiparametric Latent Graphical Model for Mixed Data

High-Dimensional Inference for Cluster-Based Graphical Models