Foundations of Bayesian Learning from Synthetic Data

Foundations of Bayesian Learning from Synthetic Data Harrison Wilde Jack Jewson Sebastian Vollmer Chris Holmes Department of Statistics Barcelona GSE Department of Statistics, Department of Statistics University of Warwick Universitat Pompeu Fabra Mathematics Institute University of Oxford; University of Warwick The Alan Turing Institute Abstract (Dwork et al., 2006), to define working bounds on the probability that an adversary may identify whether a particular observation is present in a dataset, given There is significant growth and interest in the that they have access to all other observations in the use of synthetic data as an enabler for machine dataset. DP’s formulation is context-dependent across learning in environments where the release of the literature; here we amalgamate definitions regard- real data is restricted due to privacy or avail- ing adjacent datasets from Dwork et al. (2014); Dwork ability constraints. Despite a large number of and Lei (2009): methods for synthetic data generation, there are comparatively few results on the statisti- Definition 1 (("; δ)-differential privacy) A ran- cal properties of models learnt on synthetic domised function or algorithm K is said to be data, and fewer still for situations where a ("; δ)-differentially private if for all pairs of adjacent, researcher wishes to augment real data with equally-sized datasets D and D0 that differ in one another party’s synthesised data. We use observation and all S ⊆ Range(K), a Bayesian paradigm to characterise the updating of model parameters when learning Pr[K(D) 2 S] ≤ e" × Pr [K (D0) 2 S] + δ (1) in these settings, demonstrating that caution should be taken when applying conventional learning algorithms without appropriate con- Current state-of-the-art approaches involve the privati- sideration of the synthetic data generating sation of generative modelling architectures such as process and learning task at hand. Recent Generative Adversarial Networks (GANs), Variational results from general Bayesian updating sup- Autoencoders (VAEs) or Bayesian Networks. This is port a novel and robust approach to Bayesian achieved through adjustments to their learning pro- synthetic-learning founded on decision theory cesses such that their outputs fulfil a DP guarantee that outperforms standard approaches across specified at the point of training (e.g. Zhang et al., repeated experiments on supervised learning 2017; Xie et al., 2018; Jordon et al., 2018; Rosenblatt and inference problems. et al., 2020). Despite these contributions, a fundamental question remains regarding how, from a statistical perspective, one should learn from privatised synthetic 1 Introduction data. Progress has been made for simple exponential family and regression models (Bernstein and Sheldon, 2018, 2019), but these model classes are of limited use Privacy enhancing technologies comprise an area of in modern machine learning applications. rapid growth (The Royal Society, 2019). An important aspect of this field concerns publishing privatised ver- We characterise this problem for the first time via an sions of datasets for learning; it is known that simply adoption of the M -open world viewpoint (Bernardo and anonymising the data is not sufficient to guarantee Smith, 2001) associated with model misspecification; individual privacy (e.g. Rocher et al., 2019). We in- unifying the privacy and synthetic data generation lit- stead adopt the Differential Privacy (DP) framework erature alongside recent results in generalised Bayesian updating (Bissiri et al., 2016) and minimum divergence Proceedings of the 24th International Conference on Artifi- inference (Jewson et al., 2018) to ask what it means cial Intelligence and Statistics (AISTATS) 2021, San Diego, to learn from synthetic data, and how can we improve California, USA. PMLR: Volume 130. Copyright 2021 by upon our inferences and predictions given that we ac- the author(s). knowledge its privatised synthetic nature? Foundations of Bayesian Learning from Synthetic Data This characterisation results in generative models that density f0(x) with respect to the Lebesque measure, d are ‘misspecificed by design’, owing to the constraints such that x1:n ∼ F0(x); we suppose xi 2 R . These imposed upon their design by requiring the fulfilment observations are held privately by a data keeper K. of a DP guarantee. This inevitably leads to discrepancy • K uses data x1:n to produce an ("; δ)-differentially between the learner’s final model and the one that they private synthetic data generating mechanism (S- would otherwise have formulated if not for this DP DGP). With a slight abuse of notation we use restriction. In real-world, finite data contexts where Gε,δ(x1:n) to denote the S-DGP, noting that Gε,δ could synthesis methods are often ‘black-box’ in nature, it is be a fully generative model, or a private release mech- difficult for a learner to fully capture and understand anism that acts directly on the finite data x1:n (see the inherent differences in the underlying distributions discussion on the details of the S-DGP below). We of the real and synthetic data that they have access to. denote the density of this S-DGP as gε,δ. • Let fθ(x) denote a learner L’s model likeli- There are two key insights that we explore in this pa- hood for F0(x), parameterised by θ with prior per following the characterisation above: Firstly, when π~(θ), and marginal (predictive) likelihood p(x) = left unchecked, the Bayesian inference machine learns R fθ(x)~π(θ)dθ. model parameters minimising the Kullback-Leibler di- θ • L’s prior may already encompass some other set of vergence (KLD) to the synthetic data generating process real-data drawn from F0 leading to π~(θ) = π(θ j (S-DGP) (Berk et al., 1966; Walker, 2013) rather than L x1:n ), for nL ≥ 0 prior observations. the true data generating process (DGP); Secondly, ro- L bust inference methods offer improved performance We adopt a decision theoretic framework (Berger, 2013), by acknowledging this misspecification, where in some in assuming that L wishes to take some optimal action cases synthetic data can otherwise significantly hinder a^ in a prediction or inference task; satisfying: learning rather than helping it. Z In order to investigate these behaviours, we experiment a^ = arg max U(x; a)F0(x)dx: (2) with models based on a mix of simulated-private and a2A real-world data to offer empirical insights on the learn- This is with respect to a user-specified utility-function ing procedure when a varying amount of real data is U(x; a) that evaluates actions in the action space A, available, and explore the optimalities in the amounts and makes precise L’s desire to learn about F0 in order of synthetic data with which to augment this real data. to accurately identify a^. The contributions of our work are summarised below: Details of the synthetic data generation mechanism. G 1. Learning from synthetic data can lead to unpre- In defining ε,δ, we believe it is important to dictable and negative outcomes, due to varying levels differentiate between its two possible forms: of model misspecification introduced by its generation and associated privacy constraints. 1. Gε,δ(x1:n) = Gε,δ(z j x1:n): G is a privacy- 2. Robust Bayesian inference offers improvements over preserving generative model fit on the real data, classical Bayes when learning from synthetic data. such as the PATE-GAN (Jordon et al., 2018), DP- 3. Real and synthetic data can be used in tandem to GAN (Xie et al., 2018) or PrivBayes (Zhang et al., achieve practical effectiveness through the discovery 2017). Privatised synthetic data is produced by in- of desirable stopping points for learning, and optimal jecting potentially heavy-tailed noise into gradient- model configurations. based learning and/or through partitioned training 4. Consideration of the preferred properties of the in- leveraging marginal distributions, aggregations and ference procedure are critical; the specific task at subsets of the data. The S-DGP provides conditional hand can determine how best to use synthetic data. independence between z1:m and x1:m and therefore no longer queries the real data after training. R 2. Gε,δ = Kε,δ(x; dz)F0(dx): A special case of this We adopt a Bayesian standpoint throughout this paper, integral comprises the convolution of F0 with some but note that many of the results also hold in the noise distribution H, such that Gε,δ = F0 ?Hε,δ. The frequentist setting. sampling distribution is therefore not a function of the private data x1:n. In this case, the number of 2 Problem Formulation samples that we can draw is limited to m ≤ n as drawing one data item requires using one sample We outline the inference problem as follows, of K’s data. Examples of this formulation include the Laplace mechanism (Dwork et al., 2014) and • Let x1:n denote a training set of n exchangeable transformation-based privatisation (Aggarwal and observations from Nature’s true DGP, F0(x) with Yu, 2004). Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes The fundamental problem of synthetic learning. (KLD) of the model from the DGP of the data (Berk L wants to learn about F0 but only has access to their et al., 1966; Walker, 2013; Bissiri et al., 2016), where R g prior π~(θ) and to z1:m ∼ Gε,δ, where Gε,δ 6≡ F0. That KLD(gε,δ k f) = log ( ε,δ=f) dGε,δ. is, the S-DGP Gε,δ(·) is ‘misspecified by design’. This As a result, if L updates their model fθ(x) using syn- claim is supported by a number of observations: thetic data z1:m ∼ Gε,δ(x1:n), then as m ! 1 they will be learning about the limiting parameter that min- • L specifies a model p(x) using beliefs about the tar- imises the KLD to the S-DGP: get F0 to be built using real data x1:n, they are then constrained by a subsequently imposed requirement KLD θG = arg min KLD (gε,δ(·) k fθ(·)) ; (3) of guaranteeing DP which instead requires consider- ε,δ θ2Θ ation of the resulting S-DGP Gε,δ; this leads to an inevitable change in their beliefs such that the re- and under regularity conditions the posterior distri- sulting model would be misspecified relative to the bution concentrates around that point, π(θ j z1:m) ! DGP F 1θKLD as m ! 1.

Foundations of Bayesian Learning from Synthetic Data

Synthpop: Bespoke Creation of Synthetic Data in R

Synthetic Data Generation with Probabilistic Bayesian Networks

Fidelity and Privacy of Synthetic Medical Data Review of Methods and Experimental Results

Federated Principal Component Analysis

Synthetic Data Sharing and Estimation of Viable Dynamic Treatment Regimes with Observational Data

Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data

Anomaly Detection Based on Wavelet Domain GARCH Random Field Modeling Amir Noiboar and Israel Cohen, Senior Member, IEEE

SYNC: a Copula Based Framework for Generating Synthetic Data from Aggregated Sources

Optimal Principal Component Analysis of STEM XEDS Spectrum Images Pavel Potapov1,2* and Axel Lubk2

Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel

Mtcopula: Synthetic Complex Data Generation Using Copula Fodil Benali, Damien Bodénès, Nicolas Labroche, Cyril De Runz

Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion