<<

Foundations of Bayesian Learning from Synthetic Data

Harrison Wilde Jack Jewson Sebastian Vollmer Chris Holmes Department of Barcelona GSE Department of Statistics, Department of Statistics University of Warwick Universitat Pompeu Fabra Mathematics Institute University of Oxford; University of Warwick The Alan Turing Institute

Abstract (Dwork et al., 2006), to define working bounds on the probability that an adversary may identify whether a particular observation is present in a dataset, given There is significant growth and interest in the that they have access to all other observations in the use of synthetic data as an enabler for machine dataset. DP’s formulation is context-dependent across learning in environments where the release of the literature; here we amalgamate definitions regard- real data is restricted due to privacy or avail- ing adjacent datasets from Dwork et al. (2014); Dwork ability constraints. Despite a large number of and Lei (2009): methods for synthetic data generation, there are comparatively few results on the statisti- Definition 1 ((ε, δ)-differential privacy) A ran- cal properties of models learnt on synthetic domised function or K is said to be data, and fewer still for situations where a (ε, δ)-differentially private if for all pairs of adjacent, researcher wishes to augment real data with equally-sized datasets D and D0 that differ in one another party’s synthesised data. We use observation and all S ⊆ (K), a Bayesian paradigm to characterise the up- dating of model parameters when learning Pr[K(D) ∈ S] ≤ eε × Pr [K (D0) ∈ S] + δ (1) in these settings, demonstrating that caution should be taken when applying conventional learning without appropriate con- Current state-of-the-art approaches involve the privati- sideration of the synthetic data generating sation of generative modelling architectures such as process and learning task at hand. Recent Generative Adversarial Networks (GANs), Variational results from general Bayesian updating sup- Autoencoders (VAEs) or Bayesian Networks. This is port a novel and robust approach to Bayesian achieved through adjustments to their learning pro- synthetic-learning founded on decision theory cesses such that their outputs fulfil a DP guarantee that outperforms standard approaches across specified at the point of training (e.g. Zhang et al., repeated on supervised learning 2017; Xie et al., 2018; Jordon et al., 2018; Rosenblatt and inference problems. et al., 2020). Despite these contributions, a fundamen- tal question remains regarding how, from a statistical perspective, one should learn from privatised synthetic 1 Introduction data. Progress has been made for simple and regression models (Bernstein and Sheldon, 2018, 2019), but these model classes are of limited use Privacy enhancing technologies comprise an area of in modern applications. rapid growth (The Royal Society, 2019). An important aspect of this field concerns publishing privatised ver- We characterise this problem for the first time via an sions of datasets for learning; it is known that simply adoption of the M -open world viewpoint (Bernardo and anonymising the data is not sufficient to guarantee Smith, 2001) associated with model misspecification; individual privacy (e.g. Rocher et al., 2019). We in- unifying the privacy and synthetic data generation lit- stead adopt the Differential Privacy (DP) framework erature alongside recent results in generalised Bayesian updating (Bissiri et al., 2016) and minimum Proceedings of the 24th International Conference on Artifi- inference (Jewson et al., 2018) to ask what it cial Intelligence and Statistics (AISTATS) 2021, San Diego, to learn from synthetic data, and how can we improve California, USA. PMLR: Volume 130. Copyright 2021 by upon our inferences and predictions given that we ac- the author(s). knowledge its privatised synthetic nature? Foundations of Bayesian Learning from Synthetic Data

This characterisation results in generative models that density f0(x) with respect to the Lebesque measure, d are ‘misspecificed by design’, owing to the constraints such that x1:n ∼ F0(x); we suppose xi ∈ R . These imposed upon their design by requiring the fulfilment observations are held privately by a data keeper K. of a DP guarantee. This inevitably leads to discrepancy • K uses data x1:n to produce an (ε, δ)-differentially between the learner’s final model and the one that they private synthetic data generating mechanism (S- would otherwise have formulated if not for this DP DGP). With a slight abuse of notation we use restriction. In real-world, finite data contexts where Gε,δ(x1:n) to denote the S-DGP, noting that Gε,δ could synthesis methods are often ‘black-box’ in nature, it is be a fully generative model, or a private release mech- difficult for a learner to fully capture and understand anism that acts directly on the finite data x1:n (see the inherent differences in the underlying distributions discussion on the details of the S-DGP below). We of the real and synthetic data that they have access to. denote the density of this S-DGP as gε,δ. • Let fθ(x) denote a learner L’s model likeli- There are two key insights that we explore in this pa- hood for F0(x), parameterised by θ with prior per following the characterisation above: Firstly, when π˜(θ), and marginal (predictive) likelihood p(x) = left unchecked, the machine learns R fθ(x)˜π(θ)dθ. model parameters minimising the Kullback-Leibler di- θ • L’s prior may already encompass some other set of vergence (KLD) to the synthetic data generating process real-data drawn from F0 leading to π˜(θ) = π(θ | (S-DGP) (Berk et al., 1966; Walker, 2013) rather than L x1:n ), for nL ≥ 0 prior observations. the true data generating process (DGP); Secondly, ro- L bust inference methods offer improved performance We adopt a decision theoretic framework (Berger, 2013), by acknowledging this misspecification, where in some in assuming that L wishes to take some optimal action cases synthetic data can otherwise significantly hinder aˆ in a prediction or inference task; satisfying: learning rather than helping it. Z In order to investigate these behaviours, we aˆ = arg max U(x, a)F0(x)dx. (2) with models based on a mix of simulated-private and a∈A real-world data to offer empirical insights on the learn- This is with respect to a user-specified utility-function ing procedure when a varying amount of real data is U(x, a) that evaluates actions in the action space A, available, and explore the optimalities in the amounts and makes precise L’s desire to learn about F0 in order of synthetic data with which to augment this real data. to accurately identify aˆ. The contributions of our work are summarised below: Details of the synthetic data generation mech- anism. G 1. Learning from synthetic data can lead to unpre- In defining ε,δ, we believe it is important to dictable and negative outcomes, due to varying levels differentiate between its two possible forms: of model misspecification introduced by its genera- tion and associated privacy constraints. 1. Gε,δ(x1:n) = Gε,δ(z | x1:n): G is a privacy- 2. Robust Bayesian inference offers improvements over preserving generative model fit on the real data, classical Bayes when learning from synthetic data. such as the PATE-GAN (Jordon et al., 2018), DP- 3. Real and synthetic data can be used in tandem to GAN (Xie et al., 2018) or PrivBayes (Zhang et al., achieve practical effectiveness through the discovery 2017). Privatised synthetic data is produced by in- of desirable stopping points for learning, and optimal jecting potentially heavy-tailed noise into gradient- model configurations. based learning and/or through partitioned training 4. Consideration of the preferred properties of the in- leveraging marginal distributions, aggregations and ference procedure are critical; the specific task at subsets of the data. The S-DGP provides conditional hand can determine how best to use synthetic data. independence between z1:m and x1:m and therefore no longer queries the real data after training. R 2. Gε,δ = Kε,δ(x, dz)F0(dx): A special case of this We adopt a Bayesian standpoint throughout this paper, integral comprises the convolution of F0 with some but note that many of the results also hold in the noise distribution H, such that Gε,δ = F0 ?Hε,δ. The frequentist setting. distribution is therefore not a function of the private data x1:n. In this case, the number of 2 Problem Formulation samples that we can draw is limited to m ≤ n as drawing one data item requires using one sample We outline the inference problem as follows, of K’s data. Examples of this formulation include the Laplace mechanism (Dwork et al., 2014) and • Let x1:n denote a training set of n exchangeable transformation-based privatisation (Aggarwal and observations from Nature’s true DGP, F0(x) with Yu, 2004). Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

The fundamental problem of synthetic learning. (KLD) of the model from the DGP of the data (Berk L wants to learn about F0 but only has access to their et al., 1966; Walker, 2013; Bissiri et al., 2016), where R g prior π˜(θ) and to z1:m ∼ Gε,δ, where Gε,δ 6≡ F0. That KLD(gε,δ k f) = log ( ε,δ/f) dGε,δ. is, the S-DGP Gε,δ(·) is ‘misspecified by design’. This As a result, if L updates their model fθ(x) using syn- claim is supported by a number of observations: thetic data z1:m ∼ Gε,δ(x1:n), then as m → ∞ they will be learning about the limiting parameter that min- • L specifies a model p(x) using beliefs about the tar- imises the KLD to the S-DGP: get F0 to be built using real data x1:n, they are then

constrained by a subsequently imposed requirement KLD θG = arg min KLD (gε,δ(·) k fθ(·)) , (3) of guaranteeing DP which instead requires consider- ε,δ θ∈Θ ation of the resulting S-DGP Gε,δ; this leads to an inevitable change in their beliefs such that the re- and under regularity conditions the posterior distri- sulting model would be misspecified relative to the bution concentrates around that point, π(θ | z1:m) → DGP F 1θKLD as m → ∞. original ‘best’ model for the true 0. Gε,δ • Correctly modelling a privacy preserving mechanism Furthermore, this posterior will concentrate away from as part of an S-DGP such as a ‘black-box’ GAN or F complex noise convolution is often intractable. the model that is closest to 0 in KLD, corresponding • There is inherent finiteness to real-world sensitive to the limiting model that would be learnt given an x F data contexts that makes it difficult for a gener- infinite real sample 1:∞ from 0:

ative model to capture the true DGP of even the KLD KLD θG 6≡ θF = arg min KLD (f0(·) k fθ(·)) . (4) non-private data it seeks to emulate. Considering ε,δ 0 θ∈Θ sufficiently large quantities of data, and sufficiently flexible models, would in theory allow for the true This problem is exacerbated by the fact that S-DGP’s DGP to be learnt, but relying on asymptotic be- can be prone to generating ‘outliers’ relative to the haviour in this setting is at odds with the definition private data as a result of the injection of additional of DP given that the identifiability of individuals noise into their training or generation to ensure (ε, δ)- would also naturally diminish as data availability KLD DP, combined with the fact that θG is known to be increases. Moreover, for non-trivial high-dimensional ε,δ non-robust to outliers (e.g. Basu et al., 1998). There- models, the amount of data required to properly cap- fore, given that as we collect more synthetic data our ture the DGP often becomes infeasible, regardless of inference is no longer minimising the KLD towards F0, the magnitude of DP constraints. we must carefully consider and investigate whether our inference is still ‘useful’ for learning about F0 at all. Therefore, the posterior predictive converges to a differ- ent distribution under real and synthetic data generat- 2.2 The approximation to F ing processes such that p(x | z1:m→∞) 6≡ p(x | x1:n→∞). 0 Learning from synthetic data is an intricate example of learning under model misspecification, where the Before we proceed any further we must consider what G misspecification is by K’s design. It is important, as it means for data from ε,δ to be ‘useful’ for learning F scoring shown below, that this is recognised in the learning about 0. We can do so using the concepts of rules of models. Fortunately we can adapt recent advances and statistical divergence. in Bayesian inference under model misspecification to Definition 2 (Proper Scoring Rule) The function help optimise learning with respect to L’s task. s : X × P is a strictly proper scoring rule provided its difference function D satisfies 2.1 Bayesian Inference under model misspecification D(f0 k f) = Ex∼f0 [s(x, f(·))] − Ex∼f0 [s(x, f0(·))] Bayesian inference under model misspecification has D(f k f ) ≥ 0, D(f k f) = 0 for all f, f , f ∈ P(x) recently been formalised (Walker, 2013; Bissiri et al., 1 2 1 2 2016) and represents a growing area of research, see  Z  Watson and Holmes (2016); Jewson et al. (2018); Miller P(x) := f(x): f(x) ≥ 0 ∀x ∈ X , f(x)dx = 1 , and Dunson (2018); Lyddon et al. (2018); Grünwald X et al. (2017); Knoblauch et al. (2019) to name but a few. Traditional Bayes rule updating in this context Where the function D measures a distance between can be seen as an approach that learns about the pa- two probability distributions. s(x, f) arises as the di- rameters of the model that minimises the logarithmic vergence and is minimised when f0 = f (Gneiting and score, or equivalently, the Kullback-Leibler divergence Raftery, 2007; Dawid, 2007). A further advantage of Foundations of Bayesian Learning from Synthetic Data

this representation is that it allows for the minimisation rule updating. The predictive distribution associated of D(f0 k ·) using only samples from F0, with such a posterior and the model fθ is: Z arg min D(f0 k f) = arg min Ex∼f0 [s(x, f(·))] ` ` f∈F f∈F p (x|z1:m) = fθ(x)π (θ|z1:m)dθ (7) n 1 X ←−−−− s(xi, f(·)), xi ∼ F0 (5) n→∞ n 3.2 Robust Bayes and dealing with outliers i=1

Henceforth, we define any concepts of closeness (or In the absence of the ability to correctly model the ‘usefulness’) in terms of a chosen divergence D and S-DGP, (see e.g. Berger et al., 1994) associated scoring rule s, where the approximating provide an alternative option to guard against artefacts density f is given by the predictive inferences resulting of the generated synthetic data. We can gain increased robustness in our learning procedure to data z1:m by from synthetic z1:m ∼ Gε,δ. Given that inference using changing the `(z, fθ) used for inference in Gε,δ is no longer able to exactly capture f0, L can use Eq. (6). We consider two alternative loss functions to this notion of closeness to define what aspects of F0 they are most concerned with capturing. The importance the standard logarithmic score underpinning standard of this specification is illustrated in Section 4. Bayesian statistics,

`w(z, fθ) := −w log fθ(z) (8) 3 Improved learning from the S-DGP Z (β) 1 β+1 1 β ` (z, fθ) := fθ(y) dy − fθ(z) . (9) β + 1 β The classical assumptions underlying statistics are that minimising the KLD is the optimal way to learn about Here, `w(z, fθ) introduces a learning parameter w > 0 the DGP, and that more observations provide more into the Bayesian update (e.g. Lyddon et al., 2018; information about this underlying DGP; such logic does Grünwald et al., 2017; Miller and Dunson, 2018; Holmes not necessarily apply here. L wishes to learn about the and Walker, 2017). Down-weighting, w < 1 will gener- private DGP, F0, but must rely on observations from the ally produce a less confident posterior than in the case S-DGP Gε,δ to do so. In this section we acknowledge this of traditional Bayes’ rule, with a greater dependence setting to propose a framework for improved learning on the prior. Conversely, w > 1 will have the opposite from synthetic data. In so doing we pose the following effect. The value of w can have ramifications for infer- question and detail our solutions in turn: Given the ence and prediction (Rossell and Rubio, 2018; Grün- D θKLD scoring criteria , is Gε,δ the best the learner can do? wald et al., 2017). However, we note that as the sample size grows using the weighted likelihood posterior will ∗ e 1. Can the robustness of the learning procedure be im- still learn about θG if w is fixed. Choosing w = /m proved to better approximate F0 by acknowledging instead can be seen to average the log-likelihood, with the misspecification and outlier prone nature of z1:m? e providing a notion of effective sample size. 2. Starting from the prior predictive, p(x), for a given (β) Alternatively, minimising ` (x, f(·)) in expectation learning method, when does learning using z ∼ Gε,δ over the DGP is equivalent to minimising the β- stop improving inference for F0(x)? That is, when divergence (βD) (Basu et al., 1998). Therefore, analo- (β) gously to the KLD and the log-score, using ` (x, f(·)) Ez [D (f0(·) k p(· | z1:j+1))] > Ez [D (f0(·) k p(· | z1:j))] (Bissiri et al., 2016; Jewson et al., 2018; Ghosh and 3.1 General Bayesian Inference Basu, 2016) produces a Bayesian update targeting: βD θG : = arg min βD (gε,δ(·) k fθ) . (10) In order to address these issues we adopt a general ε,δ θ∈Θ Bayesian, minimum divergence paradigm for inference β → 0 β → β (Bissiri et al., 2016; Jewson et al., 2018) inspired by As , then D KLD, but as increases it model misspecification, where L can coherently update provides increased robustness through skepticism of beliefs about their model parameter θ from prior π(θ) new observations relative to the prior. We demonstrate the robustness properties of the βD in some simple to posterior π(θ | z1:m) using: scenarios in the Supplementary material (see A.1) and π˜(θ) exp (− Pm `(z , f )) refer the reader to e.g. Knoblauch et al. (2019, 2018) π`(θ | z ) ∝ i=1 j θ , 1:m R Pm (6) for further examples. We note there are many possible π˜(θ) exp (− i=1 `(zj, fθ)) dθ divergences providing greater robustness properties where `(z, fθ) is the loss function used by L for infer- than the KLD, e.g. Wasserstein or Stein discrepancy ence. Note that in this formulation, the logarithmic (Barp et al., 2019), but for our exposition we focus on score `0(z, fθ) = − log fθ(z) recovers traditional Bayes the βD for its convenience and simplicity. Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

A key difference between the two robust loss func- We provide a proposition that says using more data ` (z, f ) θKLD tions considered above is that while w θ down- and approaching the limit, Gε,δ is not necessarily the weights the log-likelihood of each observation equally, optimal target to learn about according to criteria D. `(β)(x, f(·)) does so adaptively, based on how likely the new observation is under the current inference (Ci- Proposition 1 (Suboptimality of the S-DGP) chocki et al., 2011). It is this adaptive down-weighting For S-DGP Gε,δ, model fθ(·), and divergence D, there that allows the βD to target a different limiting param- exists prior π˜(θ), private DGP F0 and at least one θKLD βD value of 0 ≤ m < ∞ such that, eter to Gε,δ . This, in particular, allows the to be robust to outliers and/or heavy tailed contaminations.  

As a result, we believe that T`0 (m; D, f0, gε,δ) ≤ D F0 k f KLD , (11) θG     ε,δ D F0 k f βD < D F0 k f KLD θ θG KLD Gε,δ ε,δ where θG := arg minθ∈Θ KLD(Gε,δ k fθ) and T`0 is the learning trajectory of the Bayesian posterior predictive across a wide range of S-DGPs, i.e. the βD minimising distribution (using `0) based on (synthetic) data z1:m, approximation to Gε,δ is a better approximation of F0 see Eq. (7). than the KLD minimising approximation. Proving that this is the case in general proves complicated The proof of Proposition 1 involves a simple counter and is hampered by the intractability both of the βD example in which the prior is a better approximation minimisation and of many popular S-DGP’s. However, to F0 according to divergence D than Gε,δ. While we further justify this claim in A.1.2 where we show trivial, this could reasonably occur if L has strong, well- that for the prevalent Laplace DP mechanism, this calibrated expert belief judgements, or if they have a holds uniformly over the privacy level parameterised considerable amount of their own data before beginning λ by the scaling parameter of the Laplace distribution to incorporate synthetic data. Furthermore, we argue D = for KLD. next that by considering this learning trajectory path A strength of the βD is that, unlike standard robust between the prior and the S-DGP Gε,δ, in terms of the methods using heavier tailed models or losses (Berger number of synthetic observations m, it is possible to et al., 1994; Huber and Ronchetti, 1981; Beaton and get even ‘closer’ to F0 in terms of a chosen D. `(β)(x, f(·)) Tukey, 1974), does not change the model Changing the divergence used for inference as suggested used for inference. In the absence of any specific knowl- in Section 3.2 changes these trajectories by chang- edge about the S-DGP, updating using the βD maintains ing their limiting parameter. However, Proposition the model L would have used to estimate F0, but up- 1, which considered learning minimising the KLD, can dates its parameters robustly. This also has advantages equally be shown for learning minimising the βD (see L in the data combination scenario where is combining A.2.1). In the following sections we talk generally inferences from their own private data xL with syn- 1:nL about optimising such a trajectory for a given learn- z thetic data 1:m. They can maintain the same model ing method, before focusing on comparisons in these for both datasets, with the same model parameters, methods and the resulting trajectories in Section 4. yet update robustly about z1:m whilst still using the log-score for xL (i.e. to condition their prior π˜(θ)). 1:nL 3.4 Optimising the Learning Trajectory

3.3 The Learning Trajectory The insights from the previous section raise two ques- tions for a learner L using synthetic data: The concept of closeness provided by D allows us to consider how L’s approximation to F0 changes as more L data is collected from the S-DGP. To do so we consider 1. Is synthetic data able to improve ’s inferences the ‘learning trajectory’, a function in m of the expected about F0 according to divergence D? If so, 2. What is the optimal quantity to use for getting the divergence to F0 of inference using m observations from learner closest to F0 according to D? Gε,δ:  `  T` (m; D, f0, gε,δ) = Ez D f0(·) k p (· | z1:m) , Both questions can be solved by the estimation of

` ∗ where p (· | z1:m) is the general Bayesian posterior m := arg min T` (m; D, f0, gε,δ) , (12) predictive distribution (using `) based on (synthetic) 0≤m≤M data z1:m. This ‘learning trajectory’ traces the path taken by Bayesian inference from its prior predictive However, clearly the learner never has access to the (m = 0) towards the synthetic data’s DGP under in- data generating density. Instead we take advantage of creasing amounts of data (e.g. Figure 1; Section A.4). the representation of proper scoring rules and propose Foundations of Bayesian Learning from Synthetic Data

0 using a ‘test set’ x1:N ∼ F0 to estimate study. Here we recommend that alongside releasing synthetic data, K optimises the learning trajectory B N 1 1 X X 0 ` (b) themselves, under some default model, loss and prior mˆ : = arg min s(xj, p (·|z1:m)) (13) 0≤m≤M N B setting by repeatedly partitioning x1:n into test and b=1 j=1 training sets. For example, when releasing classification (b) with {z1:m}b1:B ∼ Gε,δ. data, K could release an mˆ associated with logistic regression and BART, for the log-score, under some F As such we use a small amount of data from 0 to guide vaguely informative priors, providing learners an idea F the synthetic data inference towards 0. We consider of what to expect in terms of utility from the synthetic two procedures for doing so: Firstly, by tailoring to data. Whilst this is less tailored to any specific inference L ’s specific inference problem, and when this is not problem, it still allows K to communicate a broad K possible, putting the onus on to evaluate the general measure of the quality of its released data for learning ability of their S-DGP to capture F0. Before so doing about F0, and is advisable given our results. the following remark briefly considers optimising the learning trajectory for a specific stream of data rather 3.5 Posthoc improvement through averaging than average over the S-DGP. Once mˆ has been estimated, the question then remains Remark 1 We note that although previously the learn- of how to conduct inference given mˆ . In particular, ing trajectory was defined to average across samples if more synthetic data is available (e.g. mˆ << M or from the S-DGP, there may be cases where it is prefer- sampling z1:m, m → ∞ from a GAN), we can average able to define the learning trajectory based on a concrete the posterior predictive distribution across different stream of data z . In such a scenario we may con- 1:m realisations, using mˆ each time, ensuring we do not sider data-dependent waste any synthetic data. Jensen’s inequality allows us 0∗ `  to prove that performance of the predictive distribution m (z1:M ) : = arg min D f0(·) k p (· | z1:m) . (14) 0≤m≤M will not diminish if we consider convex proper scoring rules such as the logarithmic score: This can be seen as learning the optimal point to stop learning for a specific stream of data rather than the Proposition 2 (Predictive Averaging) Given di- optimal size for the average synthetic dataset. Such a vergence D with convex scoring rule, averaging over setting introduces an undesirable dependence between different realisations of the posterior predictive depend- 0∗ m (z1:M ) and the ordering of the data, but this can be ing on different synthetic data sets cannot deteriorate somewhat mitigated by averaging different realisations inference: of the synthetic data (see Prop. 2 and A.2.3/A.5.2 for B ! Jensen’s ineq. more details). For the rest of this paper, however, we 1 X  (b)  ∗ EzD F0 k p˜ x|z1:m ≤ focus on the more generally defined m from Eq. (12). B b=1 B 1 X   (b)    (b)  3.4.1 Optimising for L’s inference D F k p˜ x|z = D F k p˜ x|z . Ez B 0 1:m Ez 0 1:m In the first instance we consider the learner L has b=1 0 access to an independent test set x ∼ F0 allowing 1:N A more detailed proof is provided in A.2.2. The sig- them to calculate the mˆ associated with their specific nificance of this is that more synthetic data cannot learning trajectory. Two potential sources are for x0 1:N hinder the predictive distribution, but only if we do are 1) for L to sacrifice some of their own data xL 1:nL not naïvely use all of it to learn at once. when constructing their prior, or 2) require that K hold 0 x1:N out when it trains the S-DGP, which can then be queried by L in order to estimate mˆ . Clearly K is not 4 Experimental Setup and Results able to share the observations with L as this would violate the DP guarantee. Instead a secure protocol In order to investigate the concepts and methodologies for two-way communication between L’s model and outlined above, we consider two experiment types: K’s test set must be established; promising directions include (Cormode et al., 2019; de Montjoye et al., 2018) 1. Learning the location and of a univariate and a practical use case (UK HDR Alliance, 2020). Gaussian distribution. 2. Using Bayesian logistic regression for binary classifi- 3.4.2 A Broader Study cation on a selection of real-world datasets.

When the previous, problem, and data-specific methods In these contexts we investigate the learning trajectory are not available we have to fall back on a broader of classical Bayesian updating alongside the robust ad- Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

justments discussed in Section 3.2. In order to draw A.6.1.1 for explicit formulations): comparisons between these methods, we study the tra- jectories’ dependence on values spanning a grid of data 1. The standard likelihood adjusted with an additional quantities nL and m, robustness parameters w and β, reweighting parameter w as in Eq. (8). prior values, and the parameters of chosen DP mecha- 2. The posterior under the βD loss as in Eq. (9). nisms (see A.6.3 for full experimental specifications). 3. The ‘Noise-Aware’ likelihood where the S-DGP can The varying amounts of non-private data available to be tractably modelled using the Normal-Laplace L were used to construct increasingly informative pri- convolution (Reed, 2006; Amini and Rabbani, 2017). π˜(θ) = π(θ | xL ) ors 1:nL through the use of standard Bayesian updating, as robustness is not required when 4.1.1 Results and Discussion learning using data drawn from F0. Learning trajec- tories are then estimated utilising an unseen dataset We observe that three different categories of learning 0 x1:N (mimicking either K’s data or some subset of L’s trajectory occur across the models; these are illustrated data not used in training). in the ‘branching’ plots in Figure 1 (explained and To this end we use optimised MCMC sampling schemes analysed further in A.6.4 and A.6.5): (e.g. Hoffman and Gelman, 2014) to sample from all of the considered posteriors in each experiment’s case; 1. The prior π˜ is sufficiently inaccurate or uninforma- drawing comparisons across the grid laid out above and tive (in this case due to low nL) such that the syn- repeating experiments to mitigate any sources of noise. thetic data continues to be useful across the range This results in an extensive computational task, made of m we consider. As a result the learning trajectory feasible through a mix of Julia’s Turing PPL (Ge et al., is a monotonically decreasing curve in the criteria 2018), MLJ (Blaom et al., 2020) and Stan (Carpenter of interest. et al., 2017). 2. A turning point is observed; synthetic data initially brings us closer to F0 before the introduction of The majority of the experiments are carried out with further synthetic observations moves the inference ε = 6, which is seen to be a realistic value respective of away. We see that in the majority of cases these practical applications (Lee and Clifton, 2011; Erlings- trajectories lie under the limiting KLD and βD ap- son et al., 2014; Tang et al., 2017; Differential Privacy proximations to Gε,δ demonstrating the efficacy of Team at Apple, 2017) and upon observation of the re- ‘optimising the learning trajectory’ through the ex- lationship between privacy and misspecification shown istence of these optimal turning points. in Figure A.4. We evaluate our experiments according 3. The final scenario occurs under a sufficiently infor- to the divergences and scoring rules discussed in detail mative prior π˜ (here due to a large nL) such that in A.6.2, presented across a range of the figures in the synthetic data is not observably of any use at all; it main text and supplementary material. can be seen to immediately cause model performance to deteriorate. 4.1 Simulated Gaussian Models We can further quantify the turning points which are We first introduce a simple but illustrative simulated perhaps the most interesting characteristic of these example in which we infer the parameters of a Gaus- experiments. To do this we formulate bootstrapped 2 2 sian model fθ = N (µ, σ ) where θ = (µ, σ ). We place averages of the number of ‘effective real samples’ that 2 conjugate priors on θ with σ ∼ InverseGamma(αp, βp) correspond to the estimated optimal quantity of syn- and µ ∼ N (µp, σp × σ) respectively. We consider x1:n thetic data. This is done by comparing these minima 2 drawn from DGP F0 = N (0, 1 ) and adopt the Laplace with the black curves in the ‘branching’ plots represent- mechanism (Dwork et al., 2014) to define our S-DGP. ing the learning trajectory under an increasing nL and This mechanism works by perturbing samples drawn m = 0 (see A.6.4 for more details). These calculations from the DGP with noise drawn from the Laplace dis- are shown in Figure 2 (and discussed further in A.6.5). tribution of scale λ, calibrated via the sensitivity S of In general, we observe a significant increase in perfor- the DGP in order to provide (ε, 0)-DP per the Laplace mance from the βD (see Figures 1, 3), indicated by mechanism’s definition with ε = S/λ. To achieve finite- its proximity to even the noise-aware model at lower ness of S in this case, we adjust our model to be that values of nL, alongside more modest improvements of a truncated Gaussian; restricting its range to ±3σ from reweighting methods. The βD achieves more de- to allow for meaningful ε’s to be calculated under the sirable minimum-trajectory log score, KLD and Wasser- Laplace mechanism (See A.3 for a proof of this result). stein values compared to other model types; exhibiting We then compare and evaluate the empirical perfor- greater robustness to larger amounts of synthetic data mance of the competing methods defined below (see where other approaches lose out significantly. Foundations of Bayesian Learning from Synthetic Data

w = 0.5 β = 0.5 Number of 0.3 Real Samples

2

4

6

8

0.2 10

13

16

19

22

0.1 25 30

Kullback-Leibler Distance 35

40

50

75 0.0 100 0 25 50 75 100 0 25 50 75 100 Total Number of Samples (Real + Synth)

Figure 1: Shows how the KLD to F0 changes as we add more synthetic data, starting with increasing amount of real data. This demonstrates the effectiveness of the optimal discovered βD configuration compared to the the closest alternative traditional model in terms of performance with down-weighting w = 0.5; the black dashed line ∗ KLD ∗ βD KLD(F k f ∗ ) θ = θ θ = θ F illustrates 0 θ for Gε,δ (left) and Gε,δ (right), representing the approximation to 0 given an infinite sample from Gε,δ under the two learning methods, and exhibiting the superiority of the βD.

AUROC Log Score

400 50 Model Type

w = 0.5 w = 1 (Naive) β = 0.25

0 200 β = 0.5 β = 0.75 Throough the Use of Synthetic Data Maximum Effective Real Sample Gain -50 0

0 1000 2000 3000 4000 0 1000 2000 3000 4000 Number of Real Samples

Figure 2: Shows the effective number of real samples gained through optimal mˆ synthetic observations alongside varying amounts of real data usage with respect to the AUROC and log-score performance criteria. These are calculated and presented here via bootstrapped averaging under a logistic regression model learnt on the Framingham dataset. The amount of effective real samples is significantly affected by the learning task’s criteria.

0.85

Model 0.3 Model Configuration Configuration 0.84 Resampled Noise-Aware w = 0 (No Syn) w = 0 (No Syn) 0.83 w = 0.5 w = 0.5 w = 1 (Naive) AUROC 0.2 w = 1 (Naive) β = 0.25 β = 0.5 0.82

Wasserstein Distance β = 0.5 β = 0.8 β = 0.75

0.81

0.1

0 50 100 150 200 0 100 200 300 Number of Synthetic Samples Number of Synthetic Samples (a) Simulated Gaussian, real n = 10 (b) UCI Heart Dataset, real n = 30 (10%)

Figure 3: Given a fixed real amount of data, we can compare model performances directly by focusing on one of the ‘branches’ in the class of diagrams shown in Figures 1 & 4, to see that the βD’s performance falls between that of the noise-aware model and the other models, exhibiting robust and desirable behaviour across a range of β. Naïve and reweighting-based approaches fail to gain significantly over not using synthetic data (shown by w = 0’s flat trajectory); the resampled model in the logistic case can also be seen to perform very poorly in comparison to models that leverage the synthetic data. Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

Referring to Figures 2, 3 and 4 we see that the learn- ing trajectories observed in this more realistic example mirror those observed in our simulated Gaussian ex- periments. There are however some cases in which the reweighted posterior outperforms the βD, and we see large discrepancies in mˆ when comparing log score to AUROC values, reinforcing the importance of carefully defining the learning task to prioritise. Additionally, experiments using synthetic data from a GAN offer the unique observation that performance can actually improve as ε decreases. We believe this Figure 4: This plot illustrates an interesting and impor- is due to potential collapse in the GAN learning process on imbalanced datasets, and concentrations of tant observation made when varying ε for a GAN based G model, we observe that there is a privacy ‘sweet-spot’ realisations of ε,δ as the injected noise increases such around ε = 1 whereby more private data performs bet- that a small number of synthetic samples can actually more F ter than less private data (see the curves for ε = 100 be representative of 0 than even the real data. which essentially represent non-private data).1 This effect is short-lived as more synthetic observations are used in learning, as presumably these samples then become over-represented through the posterior distribu- 4.2 Logistic Regression tion and performance begins to deteriorate, see Figure 4 for performance comparisons across ε. We now move on to a more prevalent and practical class of models that also exhibit the potentially dangerous 5 Conclusions behaviours of synthetic data in real-world contexts, via datasets concerning subjects that have legitimate potential privacy concerns. Namely, we build logistic We consider foundations of Bayesian learning from regression models for the UCI Heart Disease dataset synthetic data that acknowledge the intrinsic model (Dua and Graff, 2017) and the Framingham Cohort misspecification and learning task at hand. Contrary dataset (Splansky et al., 2007). Clearly, we are now only to traditional statistical inferences, conditioning on in- ∗ able to access the empirical distribution F n∗ , where n creasing amounts of synthetic data is not guaranteed to is the total amount of data present in each dataset. We help you learn about the true data generating process use x1:nT to train an instance of the aforementioned or make better decisions. Down-weighting the infor- w PATE-GAN Gε,δ and keep back x1:(n∗−nT ) for evaluation; mation in synthetic data (either using a weight or βD we then draw synthetic data samples z1:m ∼ Gε,δ. As divergence ) provides a principled approach to robust before, we investigate how the learning trajectories are optimal information processing and warrants further affected across the experimental parameter grid. investigation. Further work could consider augmenting these general robust techniques with tailored adjust- Again, we consider learning using `w and `β applied ments to inference based on the specific synthetic data to the logistic regression likelihood, fθ (see A.6.1.2 for acknowledging the discrepancies between its generating exact formulations). In this case we cannot formulate density and the Learner’s true target. a ‘Noise-Aware’ model due to the black-box nature of the GAN, highlighting the reality of the model mis- specification setting we find ourselves in aside from Acknowledgements simple or simulated examples. We can instead define a ‘resampled’ model that recycles the real data used in HW is supported by the Feuer International Scholar- formulating the prior. ship in Artificial Intelligence. JJ was funded by the Ayudas Fundación BBVA a Equipos de Investigación 4.2.1 Results and Discussion Cientifica 2017 and Government of Spain’s Plan Na- cional PGC2018-101643-B-I00 grants whilst working Here the learning trajectories are defined with respect on this project. SJV is supported by The Alan Turing to the AUROC as well as the log score; whilst not tech- Institute (EPSRC grant EP/N510129/) and the Uni- nically a divergence, this gives us a decision theoretic versity of Warwick IAA funding. CH is supported by criteria to quantify the closeness of our inference to F0. The Alan Turing Institute, Health Data Research UK, 1 This plot exhibits the effect under the βD model on the the Medical Research Council UK, the Engineering and Framingham dataset with β = 0.5, but is observable across Physical Sciences Research Council (EPSRC) through all model types in both AUROC and log score. the Bayes4Health programme Grant EP/R018561/1, Foundations of Bayesian Learning from Synthetic Data and AI for Science and Government UK Research and Anthony D Blaom, Franz Kiraly, Thibaut Lienart, Yian- Innovation (UKRI). nis Simillides, Diego Arenas, and Sebastian J Vollmer. MLJ: A Julia package for composable Machine Learn- References ing. July 2020. Charu C Aggarwal and Philip S Yu. A condensation Bob Carpenter, Andrew Gelman, Matthew D Hoffman, approach to privacy preserving . In Ad- Daniel Lee, Ben Goodrich, Michael Betancourt, Mar- vances in Database Technology - EDBT 2004, pages cus Brubaker, Jiqiang Guo, Peter Li, and Allen Rid- 183–199. Springer Berlin Heidelberg, 2004. dell. Stan: A probabilistic programming language. Journal of statistical software, 76(1), 2017. Zahra Amini and Hossein Rabbani. Letter to the editor: Andrzej Cichocki, Sergio Cruces, and Shun-ichi Amari. Correction to “the normal-laplace distribution and Generalized alpha-beta divergences and their appli- its relatives”. Communications in Statistics-Theory cation to robust nonnegative matrix factorization. and Methods, 46(4):2076–2078, 2017. Entropy, 13(1):134–170, 2011. Alessandro Barp, Francois-Xavier Briol, Andrew Dun- Graham Cormode, Tejas Kulkarni, and Divesh Srivas- can, Mark Girolami, and Lester Mackey. Minimum tava. Answering range queries under local differential Advances in Neural stein discrepancy estimators. In privacy, 2019. Information Processing Systems, pages 12964–12976, 2019. A Philip Dawid. The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59 Ayanendranath Basu, Ian R Harris, Nils L Hjort, and (1):77–93, 2007. MC Jones. Robust and efficient estimation by min- Yves-Alexandre de Montjoye, Sébastien Gambs, Vin- imising a density power divergence. Biometrika, 85 cent Blondel, Geoffrey Canright, Nicolas de Cordes, (3):549–559, 1998. Sébastien Deletaille, Kenth Engø-Monsen, Manuel Albert E Beaton and John W Tukey. The fitting of Garcia-Herranz, Jake Kendall, Cameron Kerry, Gau- power series, meaning polynomials, illustrated on tier Krings, Emmanuel Letouzé, Miguel Luengo-Oroz, band-spectroscopic data. Technometrics, 16(2):147– Nuria Oliver, Luc Rocher, Alex Rutherford, Zbigniew 185, 1974. Smoreda, Jessica Steele, Erik Wetter, Alex Sandy Pentland, and Linus Bengtsson. On the privacy- James O Berger. Statistical Decision Theory and conscientious use of mobile phone data. Sci Data, 5: Bayesian Analysis. Springer Science & Business 180286, December 2018. Media, March 2013. Differential Privacy Team at Apple. Learning with James O Berger, Elías Moreno, Luis Raul Pericchi, privacy at scale. 2017. M Jesús Bayarri, José M Bernardo, Juan A Cano, Julián De la Horra, Jacinto Martín, David Ríos-Insúa, Dheeru Dua and Casey Graff. UCI machine learning Bruno Betrò, et al. An overview of robust bayesian repository, 2017. analysis. Test, 3(1):5–124, 1994. Cynthia Dwork and Jing Lei. Differential privacy and Proceedings of the forty-first Robert H Berk et al. Limiting behavior of posterior dis- robust statistics. In annual ACM symposium on Theory of computing tributions when the model is incorrect. The Annals , of Mathematical Statistics, 37(1):51–58, 1966. pages 371–380, 2009. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and José M Bernardo and Adrian FM Smith. Bayesian Adam Smith. Calibrating noise to sensitivity in theory, 2001. private data analysis. In Theory of cryptography Garrett Bernstein and Daniel R Sheldon. Differentially conference, pages 265–284. Springer, 2006. private bayesian inference for exponential families. In Cynthia Dwork, Aaron Roth, et al. The algorithmic Advances in Neural Information Processing Systems, foundations of differential privacy. Foundations and pages 2919–2929, 2018. Trends in Theoretical Computer Science, 9(3-4):211– Garrett Bernstein and Daniel R Sheldon. Differentially 407, 2014. private bayesian . In Advances in Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Ko- Neural Information Processing Systems, pages 523– rolova. Rappor: Randomized aggregatable privacy- 533, 2019. preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and PG Bissiri, CC Holmes, and Stephen G Walker. A communications security general framework for updating belief distributions. , pages 1054–1067, 2014. Journal of the Royal Statistical Society: Series B Hong Ge, Kai Xu, and Zoubin Ghahramani. Turing: a (Statistical Methodology), 2016. language for flexible probabilistic inference. In In- Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

ternational Conference on Artificial Intelligence and William J Reed. The normal-laplace distribution and Statistics, AISTATS 2018, 9-11 April 2018, Playa its relatives. In Advances in distribution theory, order Blanca, Lanzarote, Canary Islands, Spain, pages statistics, and inference, pages 61–74. Springer, 2006. 1682–1690, 2018. URL http://proceedings.mlr. Luc Rocher, Julien M Hendrickx, and Yves-Alexandre press/v84/ge18b.html. De Montjoye. Estimating the success of re- Abhik Ghosh and Ayanendranath Basu. Robust bayes identifications in incomplete datasets using gener- estimation using the density power divergence. An- ative models. Nature communications, 10(1):1–9, nals of the Institute of Statistical Mathematics, 68 2019. (2):413–437, 2016. Lucas Rosenblatt, Xiaoyan Liu, Samira Pouyanfar, Ed- Tilmann Gneiting and Adrian E Raftery. Strictly uardo de Leon, Anuj Desai, and Joshua Allen. Differ- proper scoring rules, prediction, and estimation. entially private synthetic data: Applied evaluations Journal of the American Statistical Association, 102 and enhancements. arXiv preprint arXiv:2011.05537, (477):359–378, 2007. 2020. Peter Grünwald, Thijs Van Ommen, et al. Inconsistency David Rossell and Francisco J Rubio. Tractable of bayesian inference for misspecified linear models, bayesian variable selection: beyond normality. Jour- and a proposal for repairing it. Bayesian Analysis, nal of the American Statistical Association, 113(524): 12(4):1069–1103, 2017. 1742–1758, 2018. Matthew D Hoffman and Andrew Gelman. The no- Greta Lee Splansky, Diane Corey, Qiong Yang, Larry D u-turn sampler: adaptively setting path lengths in Atwood, L Adrienne Cupples, Emelia J Benjamin, hamiltonian monte carlo. J. Mach. Learn. Res., 15 Ralph B D’Agostino Sr, Caroline S Fox, Martin G (1):1593–1623, 2014. Larson, Joanne M Murabito, et al. The third gener- ation cohort of the national heart, lung, and blood CC Holmes and SG Walker. Assigning a value to institute’s framingham heart study: design, recruit- a power likelihood in a general bayesian model. ment, and initial examination. American journal of Biometrika, 104(2):497–503, 2017. , 165(11):1328–1335, 2007. Peter J Huber and EM Ronchetti. Robust statistics, se- Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xue- ries in probability and mathematical statistics, 1981. qiang Wang, and Xiaofeng Wang. Privacy loss in Jack Jewson, Jim Smith, and Chris Holmes. Princi- apple’s implementation of differential privacy on ma- ples of Bayesian inference using general divergence cos 10.12. arXiv preprint arXiv:1709.02753, 2017. Entropy criteria. , 20(6):442, 2018. The Royal Society. Privacy Enhancing Technologies: James Jordon, Jinsung Yoon, and Mihaela van der protecting privacy in practice. 2019. Schaar. Pate-gan: Generating synthetic data with UK HDR Alliance. Trusted research environments International Con- differential privacy guarantees. In (TRE). https://ukhealthdata.org/wp-content/ ference on Learning Representations, 2018. uploads/2020/07/200723-Alliance-Board_ Jeremias Knoblauch, Jack Jewson, and Theodoros Paper-E_TRE-Green-Paper.pdf, 2020. Accessed: Damoulas. Doubly robust Bayesian inference for 2020-10-15. non-stationary streaming data using β-divergences. Stephen G Walker. Bayesian inference with misspecified In Advances in Neural Information Processing Sys- models. Journal of Statistical Planning and Inference, tems (NeurIPS), pages 64–75, 2018. 143(10):1621–1633, 2013. Jeremias Knoblauch, Jack Jewson, and Theodoros James Watson and Chris Holmes. Approximate models Damoulas. Generalized variational inference. arXiv and robust decisions. Statistical Science, 31(4):465– preprint arXiv:1904.02063, 2019. 489, 2016. Jaewoo Lee and Chris Clifton. How much is enough? Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and choosing ε for differential privacy. In International Jiayu Zhou. Differentially private generative adver- Conference on Information Security, pages 325–340. sarial network. arXiv preprint arXiv:1802.06739, Springer, 2011. 2018. SP Lyddon, CC Holmes, and SG Walker. General Jun Zhang, Graham Cormode, Cecilia M Procopiuc, bayesian updating and the loss-likelihood bootstrap. Divesh Srivastava, and Xiaokui Xiao. PrivBayes: Biometrika, 2018. Private data release via bayesian networks. ACM Jeffrey W Miller and David B Dunson. Robust bayesian Trans. Database Syst., 42(4):1–41, October 2017. inference via coarsening. Journal of the American Statistical Association, pages 1–13, 2018.