An Empirical Study on the Intrinsic Privacy of Stochastic Gradient Descent Stephanie L
Total Page:16
File Type:pdf, Size:1020Kb
An Empirical Study on the Intrinsic Privacy of Stochastic Gradient Descent Stephanie L. Hyland and Shruti Tople Microsoft Research {stephanie.hyland, shruti.tople}@microsoft.com ABSTRACT - for example, a hospital which trains a prediction model on its In this work, we take the first step towards understanding whether own patient data and then shares it with other hospitals or a cloud the intrinsic randomness of stochastic gradient descent (SGD) can provider. The adversary then has access to the fully trained model, be leveraged for privacy, for any given dataset and model. In doing including its architecture, and we assume details of the training so, we hope to mitigate the trade-off between privacy and util- procedure are public (e.g. batch size, number of training iterations, ity for models trained with differential-privacy (DP) guarantees. learning rate), but not the random seed used to initialise the model Our primary contribution is a large-scale empirical analysis of parameters and sample inputs from the dataset. We therefore focus SGD on convex and non-convex objectives. We evaluate the in- on how SGD introduces randomness in the final weights of a model. herent variability in SGD on 4 datasets and calculate the intrinsic This randomness is introduced from two main sources — (1) data-dependent ϵi ¹Dº values due to the inherent noise. For logistic random initialization of the model parameters and (2) random sam- regression, we observe that SGD provides intrinsic ϵi ¹Dº values pling of the input dataset during training. We argue that rather between 3.95 and 23.10 across four datasets, dropping to between than viewing this variability as a pitfall of stochastic optimisation, 1.25 and 4.22 using the tight empirical sensitivity bound. For neural it can instead be seen as a source of noise that can mask information networks considered, we observe high ϵi ¹Dº values (>40) owing about participants in the training data. This prompts us to investi- to their larger parameter count. We propose a method to augment gate whether SGD itself can be viewed as a differentially-private the intrinsic noise of SGD to achieve the desired target ϵ, which mechanism, with some intrinsic data-dependent ϵ-value, which we produces statistically significant improvements in private model refer to as ϵi ¹Dº. To calculate ϵi ¹Dº, we propose a novel method performance (subject to assumptions). Our experiments provide that characterises SGD as a Gaussian mechanism and estimates strong evidence that the intrinsic randomness in SGD should be the intrinsic randomness for a given dataset, using a large-scale considered when designing private learning algorithms. empirical approach. To the best of our knowledge, ours is the first work to report the empirical calculation of ϵi ¹Dº values based on the observed distribution. Finally, we propose an augmented 1 INTRODUCTION differentially-private SGD algorithm that takes into account the Respecting the privacy of people contributing their data to train intrinsic ϵi ¹Dº to provide better utility. We empirically compute machine learning models is important for the safe use of this tech- the ϵi ¹Dº for SGD and the utility improvement for models trained nique [11, 27, 30]. Private variants of learning algorithms have been with both convex and non-convex objectives on 4 different datasets: proposed to address this need [5, 12, 24, 29, 31]. Unfortunately the MNIST, CIFAR10, Forest Covertype and Adult. utility of private models typically degrades, limiting their applicabil- ity. This performance loss often results from the need to add noise 2 PROBLEM & BACKGROUND during or after model training, to provide the strong protections We study the variability due to random sampling, and sensitivity of ϵ-differential-privacy [8]. However, results to date neglect the to dataset perturbations of stochastic gradient descent (SGD). fact that learning algorithms are often stochastic. Framing them as ‘fixed’ queries on a dataset neglects an important source of intrinsic Differential Privacy (DP).. Differential privacy hides the participa- noise. Meanwhile, the randomness in learning algorithms such as tion of an individual sample in the dataset [8]. ¹ϵ; δº-DP ensures that 0 stochastic gradient descent (SGD) is well-known among machine for all adjacent datasets S and S , the privacy loss of any individual learning practitioners [10, 14], and has been lauded for affording datapoint is bounded by ϵ with probability at least 1 − δ [9]: superior generalisation to its non-stochastic counterpart [16]. More- Definition 2.1 ( ¹ϵ; δº-Differential Privacy). A mechanism M with over, the ‘insensitive’ nature of SGD relative to variations in its domain I and range O satisfies ¹ϵ; δº-differential privacy if for any input data has been established in terms of uniform stability [13]. two neighbouring datasets S; S 0 2 I that differ only in one input The data-dependent nature of this stability has also been charac- and for a set E ⊆ O, we have: Pr¹M¹Sº 2 Eº ≤ eϵ Pr¹M¹S 0º 2 Eº+δ terised [18]. Combining these observations, we speculate that the variability in the model parameters produced by the stochasticity of SGD may exceed its sensitivity to perturbations in the specific We consider an algorithm whose output is the weights of a input data, affording ‘data-dependent intrinsic’ privacy. In essence, trained model, thus focus on the `2-sensitivity: we ask: “Can the intrinsic, data-dependent stochasticity of SGD help Definition 2.2` ( -Sensitivity (From Def 3.8 in [9]). Let f be a with the privacy-utility trade-off?” 2 function mapping from a dataset to a vector in Rd . Let S, S 0 be two Our Approach. We consider a scenario where a model is trained datasets differing in one data point. Then the `2-sensitivity of f is 0 securely, but the final model parameters are released to the public defined as: ∆2¹f º = maxS;S0 kf ¹Sº − f ¹S ºk2 An Empirical Study on the Intrinsic Privacy of SGD One method for making a deterministic query f differentially 3 AUGMENTED DP-SGD private is the Gaussian mechanism. This gives a way to compute In this section, we show how to account for the data-dependent the ϵ of a Gaussian-distributed query given δ, its sensitivity ∆2¹f º, intrinsic noise of SGD while adding noise using the output per- 2 and its variance σ . turbation method that ensures differential privacy guarantees [31]. Theorem 2.3. Gaussian mechanism (From [9]). Let f be a function The premise of output perturbation is to train a model in secret, that maps a dataset to a vector in Rd . Let ϵ 2 ¹0; 1º be arbitrary. and then release a noisy version of the final weights. For a desired ϵ, ˆ ; ˆ ∗ For c2 > 2 ln ¹1:25/δº, adding Gaussian noise sampled using the δ, and a known sensitivity value (∆S ∆S ) the Gaussian mechanism (Theorem 2.3) gives us the required level of noise in the output, parameters σ ≥ c∆2¹f )/ϵ guarantees ¹ϵ; δº-differential privacy. which we call σtarget. Stochastic Gradient Descent (SGD). SGD and its derivatives are In Wu et al. [31], this σtarget defines the variance of the noise the most common optimisation methods for training machine learn- vector sampled and added to the model weights, to produce a ¹ϵ; δº- ing models [3]. Given a loss function L¹w; ¹x;yºº averaged over DP model. In our case, we reduce σtarget to account for the noise a dataset, SGD provides a stochastic approximation of the tra- already present in the output of SGD. Since the sum of two inde- ditional gradient descent method by estimating the gradient of 2 2 pendent Gaussians with variance σa and σb is a Gaussian with L at random inputs. At step t, on selecting a random sample 2 + 2 ¹Dº variance σa σb , if the intrinsic noise of SGD is σi , to achieve ¹xt ;yt º the gradient update function G performs wt+1 = G¹wt º = the desired ϵ we need to augment σi ¹Dº to reach σtarget: w − ηr L¹w ; ¹x;yºº, where η is the (constant) step-size or learn- t w t q 2 2 ing rate, and wt are the weights of the model at t. In practice, the σaugment = σtarget − σi ¹Dº (2) stochastic gradient is estimated using a mini-batch of B samples. Given the likely degradation of model performance with out- Recently, Wu et al. [31] showed results for the sensitivity of SGD put noise, accounting for σ ¹Dº is expected to help utility with- to the change in a single training example for a convex, L-Lipschitz i out compromising privacy. The resulting algorithm for augmented and β-smooth loss L. We use this sensitivity bound in our analyses differentially-private SGD is shown in Algorithm 1. to compute intrinsic ϵi ¹Dº for convex functions. Let A denote the SGD algorithm using r as the random seed. The upper bound for Algorithm 1 Augmented differentially private SGD sensitivity for k-passes of SGD with learning rate η is given by 1: Given σ ¹Dº, ϵ , δ , ∆ ¹f º, model weights w . 0 i target 2 private ∆ˆ S = max kA¹r; Sº − A¹r; S ºk ≤ 2kLη (1) p −5 r 2: c 2log¹1:25)/δ + 1 × 10 3: σ c∆ ¹f )/ϵ ˆ target 2 target ∆S gives the maximum difference in the model parameters dueto 4: if σi ¹Dº < σtarget then the presence or absence of a single input sample. When using a q 2 2 5: σaugment σ − σi ¹Dº batch size of B as we do, the sensitivity bound can be reduced by a target 6: else factor of B i.e., ∆ˆ S ≤ 2kLη/B.