Differentially Private Stochastic Coordinate Descent
Total Page:16
File Type:pdf, Size:1020Kb
Differentially Private Stochastic Coordinate Descent Georgios Damaskinos,1* Celestine Mendler-Dunner,¨ 2* Rachid Guerraoui,1 Nikolaos Papandreou,3 Thomas Parnell3 1Ecole´ Polytechnique Fed´ erale´ de Lausanne (EPFL), Switzerland 2University of California, Berkeley 3IBM Research, Zurich georgios.damaskinos@epfl.ch, [email protected], rachid.guerraoui@epfl.ch, fnpo,[email protected] Abstract Our goal is to extend SCD with mechanisms that preserve In this paper we tackle the challenge of making the stochastic data privacy, and thus enable the reuse of large engineer- coordinate descent algorithm differentially private. Com- ing efforts invested in SCD (Bradley et al. 2011a; Ioannou, pared to the classical gradient descent algorithm where up- Mendler-Dunner,¨ and Parnell 2019), for privacy-critical use- dates operate on a single model vector and controlled noise cases. In this paper we therefore ask the question: Can SCD addition to this vector suffices to hide critical information maintain its benefits (convergence guarantees, low tuning about individuals, stochastic coordinate descent crucially re- cost) alongside strong privacy guarantees? lies on keeping auxiliary information in memory during train- We employ differential privacy (DP) (Dwork, Roth et al. ing. This auxiliary information provides an additional privacy 2014) as our mathematical definition of privacy. DP provides leak and poses the major challenge addressed in this work. a formal guarantee for the amount of leaked information re- Driven by the insight that under independent noise addition, garding participants in a database, given a strong adversary the consistency of the auxiliary information holds in expect- ation, we present DP-SCD, the first differentially private and the output of a mechanism that processes this database. stochastic coordinate descent algorithm. We analyze our new DP provides a fine-grained way of measuring privacy across method theoretically and argue that decoupling and parallel- multiple subsequent executions of such mechanisms and is izing coordinate updates is essential for its utility. On the em- thus well aligned with the iterative nature of ML algorithms. pirical side we demonstrate competitive performance against The main challenge of making SCD differentially private the popular stochastic gradient descent alternative (DP-SGD) is that an efficient implementation stores and updates not while requiring significantly less tuning. only the model vector α but also an auxiliary vector v := Xα to avoid recurring computations. These two vectors are 1 Introduction coupled by the data matrix X and need to be consistent for Stochastic coordinate descent (SCD) (Wright 2015) is an ap- standard convergence results to hold. However, to provide pealing optimization algorithm. Compared with the classical rigorous privacy guarantees, it is vital to add independent stochastic gradient descent (SGD), SCD does not require a noise to both vectors which prohibits this consistency. learning rate to be tuned and can often have favorable con- Contributions. We present DP-SCD, a differentially vergence behavior (Fan et al. 2008; Dunner¨ et al. 2018; Ma private version of the SCD algorithm (Shalev-Shwartz and et al. 2015; Hsieh, Yu, and Dhillon 2015). In particular, for Zhang 2013b) with formal privacy guarantees. In particular, training generalized linear models, SCD is the algorithm of we make the following contributions. choice for many applications and has been implemented as • We extend SCD to compute each model update as an ag- a default solver in several popular packages such as Scikit- gregate of independent updates computed on a random learn, TensorFlow and Liblinear (Fan et al. 2008). arXiv:2006.07272v4 [cs.LG] 14 Mar 2021 data sample. This parallelism is crucial for the utility of However, SCD is not designed with privacy concerns in the algorithm because it reduces the amount of the noise mind: SCD builds models that may leak sensitive informa- per sample that is necessary to guarantee DP. tion regarding the training data records. This is a major issue • We provide the first analysis of SCD in the DP setting for privacy-critical domains such as health care where mod- and derive a bound on the maximum level of noise that els are trained based on the medical records of patients. can be tolerated to guarantee a given level of utility. Nevertheless, the low tuning cost of SCD is a particularly • We empirically show that for problems where SCD has appealing property for privacy-preserving machine learning closed form updates, DP-SCD achieves a better privacy- (ML). Hyperparameter tuning costs not only in terms of ad- utility trade-off compared to the popular DP-SGD al- ditional computation, but also in terms of spending the pri- gorithm (Abadi et al. 2016) while, at the same time, be- vacy budget. ing free of a learning rate hyperparameter that needs tun- *Work partially conducted while at IBM Research, Zurich. ing. Our implementation is available1. Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1https://github.com/gdamaskinos/dpscd N ∗ 2 Preliminaries where α 2 R denotes the dual model vector and `i the Before we dive into the details of making SCD differentially convex conjugate of the loss function `i. Since the dual ob- private, we first formally define the problem class considered jective (Problem (3)) depends on the data matrix through the in this paper and provide the necessary background on SCD linear map Xα, the auxiliary vector is naturally defined as and differential privacy. v := Xα in SDCA. We use the first order optimality condi- tions to relate the primal and the dual model vectors which 1 2.1 Setup results in θ(α) = λN Xα and leads to the important defin- We target the training of Generalized Linear Models ition of the duality gap (Dunner¨ et al. 2016): (GLMs), the class of models to which SCD is most com- Gap(α) := F ∗(α; X) + F(θ(α); X) monly applied. This class includes convex problems of the λ λ following form: = hXα; θ(α)i + kθ(α)k2 + kθk2 (4) 2 2 ( N ) 1 X > λ 2 min F(θ; X) := min `i(xi θ) + kθk (1) By the construction of the two problems, the optimal val- θ θ N 2 i=1 ues for the objectives match in the convex setting and the M duality gap attains zero (Shalev-Shwartz and Zhang 2013b). The model vector θ 2 R is learnt from the training θ M×N Therefore, the model can be learnt from solving either dataset X 2 R that contains the N training examples θ(α) M Problem (1) or Problem (3), where we use the map xi 2 R as columns, λ > 0 denotes the regularization para- to obtain the final solution when solving Problem (3). meter, and `i the convex loss functions. The norm k·k refers While the two problems have similar structure, they are to the L2-norm. For the rest of the paper we use the com- quite different from a privacy perspective due to their data mon assumption that for all i 2 [N] the data examples xi access patterns. When applied to the dual, SCD computes are normalized, i.e., kxik = 1, and that the loss functions `i each update by processing a single example at a time, are 1/µ-smooth. A wide range of ML models fall into this whereas the primal SCD processes one coordinate across all setup including ridge regression and L2-regularized logistic the examples. Several implications arise as differential pri- regression (Shalev-Shwartz and Zhang 2013b). vacy is defined on a per-example basis. Threat model. We consider the classical threat model used in (Abadi et al. 2016). In particular, we assume that an 2.3 Differentially Private Machine Learning adversary has white-box access to the training procedure (al- Differential privacy is a guarantee for a function f applied gorithm, hyperparameters, and intermediate output) and can to a database of sensitive data (Dwork, Roth et al. 2014). have access even to Xnxk, where xk is the data instance the In the context of supervised ML, this function is the update adversary is targeting. However, the adversary cannot have function of the algorithm and the data is typically a set of access to the intermediate results of any update computation. input-label pairs (xi; yi) that are used during model training. We make this assumption more explicit in §3.1. Two input datasets are adjacent if they differ only in a single input-label pair. Querying the model translates into making 2.2 Primal-Dual Stochastic Coordinate Descent predictions for the label of some new input. The primal SCD algorithm repeatedly selects a coordinate Definition 1 (Differential privacy). A randomized mechan- j 2 [M] at random, solves a one dimensional auxiliary prob- ism M : D! R satisfies (, δ)-DP if for any two adjacent lem and updates the parameter vector θ: inputs d; d0 2 D and for any subset of outputs S ⊆ R it 0 + ? ? holds that: Pr[M(d) 2 S] ≤ e Pr[M(d ) 2 S] + δ. θ θ + ej ζ where ζ = arg min F(θ + ej ζ; X) (2) ζ The Gaussian mechanism. The Gaussian mechanism where ej denotes the unit vector with value 1 at position j. is a popular method for making a deterministic function Problem (2) often has a closed form solution; otherwise F f : D! R differentially private. By adding Gaussian noise can be replaced by its second-order Taylor approximation. to the output of the function we can hide particularities of A crucial approach for improving the computational com- individual input values. The resulting mechanism is defined 2 2 plexity of each SCD update is to keep an auxiliary vector as: M(d) := f(d) + N (0;Sf σ ) where the variance of the > v := X θ in memory.