Robust and Private Learning of Halfspaces

Badih Ghazi Ravi Kumar Pasin Manurangsi Thao Nguyen Google Research Mountain View, CA {badihghazi, ravi.k53}@gmail.com, {pasin, thaotn}@google.com

Abstract adversarial manipulations of their inputs at test time, with the intention of causing classification errors (e.g., Dalvi et al., 2004; Biggio et al., 2013; Szegedy et al., In this work, we study the trade-offbetween 2014; Goodfellow et al., 2015; Papernot et al., 2016). differential privacy and adversarial robustness Numerous methods have been proposed with the goal under L -perturbations in the context of learn- 2 of training models that are robust to such adversarial ing halfspaces. We prove nearly tight bounds attacks (e.g., Madry et al., 2018; Gowal et al., 2018, on the sample complexity of robust private 2019; Schott et al., 2019), which in turn has led to learning of halfspaces for a large regime of new attacks being devised in order to fool these mod- parameters. A highlight of our results is that els (Athalye et al., 2018; Carlini and Wagner, 2018; robust and private learning is harder than Sharma and Chen, 2017). See (Kolter and Madry, robust or private learning alone. We comple- 2018) for a recent tutorial on this topic. ment our theoretical analysis with experimen- tal results on the MNIST and USPS datasets, Some recent work has suggested incorporating for a learning algorithm that is both differen- mechanisms from DP into neural network training to tially private and adversarially robust. enhance adversarial robustness (Lecuyer et al., 2019; Phan et al., 2020, 2019). Given this existing interplay between DP and robustness, we seek to answer the 1Introduction following natural question: Is achieving privacy and adversarial robustness In this work, we study the interplay between two top- harder than achieving either criterion alone? ics at the core of AI ethics and safety: privacy and Recent empirical work has provided mixed response robustness. to the question (Song et al., 2019b,a; Hayes, 2020), As modern machine learning models are trained on reporting the success rate of membership inference potentially sensitive data, there has been a tremendous attacks as a heuristic measure of privacy. Instead, interest in privacy-preserving training methods. Differ- using theoretical analysis and the strict guarantees ential privacy (DP) (Dwork et al., 2006b,a) has emerged offered by DP, we formally investigate this question in as the gold standard for rigorously tracking the privacy the classic setting of halfspace learning, and arrive at leakage of algorithms in general (see, e.g., Dwork and anear-completepicture. Roth, 2014; Vadhan, 2017, and the references therein), and machine learning models in particular (e.g., Abadi Background. In order to present our results, we et al., 2016), resulting in several practical deployments start by recalling some notions from robust learning in recent years (e.g., Erlingsson et al., 2014; Shank- in the PAC model. Let 0, 1 X be a (Boolean) land, 2014; Greenberg, 2016; Apple Differential Privacy C✓{ } d hypothesis class on an instance space R .A Team, 2017; Ding et al., 2017; Abowd, 2018). X✓ perturbation is defined by a function P : 2X , X! Another vulnerability of machine learning models that where P(x) denotes the set of allowable perturbed ✓X has also been widely studied recently is with respect to instances starting from an instance x. The robust risk of a hypothesis h with respect to a distribution on D Proceedings of the 24th International Conference on Artifi- 1 and perturbation P is defined as (h, )= X⇥{± } RP D cial Intelligence and Statistics (AISTATS) 2021, San Diego, Pr(x,y) [ z P(x),h(z) = y]. Adistribution is said ⇠D 9 2 6 D California, USA. PMLR: Volume 130. Copyright 2021 by to be realizable (with respect to and P)iffthereexists the author(s). C h⇤ such that (h⇤, )=0. In the adversarially 2C RP D Robust and Private Learning of Halfspaces

robust PAC learning problem, the learner is given i.i.d. We first prove that robust learning with pure-DP re- samples from a realizable distribution on 1 , quires ⌦(d) samples. D X⇥{± } and the goal is to output a hypothesis h : 1 X!{± } Theorem 2. Any ✏-DP (,0.9)-robust (possibly im- such that with probability 1 ⇠ it holds that (h, ) RP D  proper) learner has sample complexity ⌦(d/✏). ↵.Wereferto⇠ as the failure probability and ↵ as the accuracy parameter.Alearnerissaidtobeproper if In the private but non-robust setting (i.e., for an ✏-DP the output hypothesis h belongs to ; otherwise, it is (,0)-robust learner2), Nguyen et al. (2020) showed that C said to be improper. O(1/2) samples suffice. Together with earlier known results showing that (,0.9)-robust learning (without We focus our study on the concept class of halfspaces, 2 d privacy) only requires O(1/ ) samples (e.g., Bartlett i.e., halfspaces := hw w R where hw(x)= C { | 2 } and Mendelson, 2002; Koltchinskii and Panchenko, sgn( w, x ), in this model with respect to L perturba- h i 2 2002), our result gives a separation between robust tions, i.e., P (x):= z : z x 2 for margin { 2X k k  } private learning and private learning alone or robust parameter >0.Weassumethroughoutthatthedo- learning alone, whenever d 1/2. main of our functions is bounded in the d-dimensional d d Euclidean unit ball B := x R x 2 1 .We For the case of approximate-DP, we establish a lower { 2 |k k  } also write as a shorthand for .Analgorithmis bound of ⌦(min pd/, d ), which holds only against R RP { } said to be a (,0)-robust learner if, for any realizable proper learners. As for Theorem 2 in the context of distribution with respect to halfspaces and P ,using pure-DP, this result implies a similar separation in the D C acertainnumberofsamples,itoutputsahypothesis approximate-DP proper learning setting. h such that w.p. 1 ⇠,wehave (h, ) ↵, where R0 D  Theorem 3. Let ✏<1.Any(✏, o(1/n))-DP (,0.9)- ↵,⇠> 0 are sufficiently smaller than some positive con- robust proper learner has sample complexity n = 1 stant. We are especially interested in the case where ⌦(min pd/, d ). 0 is close to ;forsimplicity,weuse0 =0.9 as a { } representative setting throughout. In robust learning, Our proof technique can also be used to improve the the main quantities of interest are the sample complex- lower bound for DP (,0)-robust learning. Specifically, ity,i.e.,theminimumnumberofsamplesneededto Nguyen et al. (2020) show an ⌦(1/2) lower bound for learn, and the running time of the learning algorithm. proper (,0)-robust learners with pure-DP. We extend it to hold even for improper (,0)-robust learners with We use the standard terminology of DP. Recall that approximate-DP: two datasets X and X0 are neighbors if X0 results from adding or removing a single data point from X. Theorem 4. For any ✏>0, there exists >0 such that any (✏,)-DP (,0)-robust (possibly improper) Definition 1 (Differential Privacy (DP) (Dwork et al., 1 learner has sample complexity ⌦ 2 . Moreover, this 2006b,a)). Let ✏, R 0. A randomized algorithm A ✏ 2 taking as input a dataset is said to be (✏,)-differentially holds even when d = O(1/2). ⇣ ⌘ private (denoted by (✏,)-DP) if for any two neighbor- Finally, we provide algorithms with nearly matching ing datasets X and X0, and for any subset S of outputs ✏ upper bounds. For pure-DP, we prove the following, of A, it is the case that Pr[A(X) S] e Pr[A(X0) 2  · 2 S]+.If =0, A is said to be ✏-differentially private which matches the lower bound in Theorem 2 to within a constant factor when d 1/2. (denoted by ✏-DP). Theorem 5. There is an ✏-DP (,0.9)-robust learner As usual, ✏ should be thought of as a small constant, 3 1 1 with sample complexity O↵ max d, 2 . whereas should be negligible in the dataset size. We ✏ { } refer to the case where =0as pure-DP,andthecase ⇣ ⌘ For approximate-DP, it is already possible4 to achieve where >0 as approximate-DP. 3 asamplecomplexity of n = O˜↵(pd/) (Nguyen et al., 2020; Bassily et al., 2014) but the running time is Our Results. We assume that ✏ O(1) unless other-  ⌦(n2d).5 We give a faster algorithm with running time wise stated, and that ,↵,⇠> 0 are sufficiently smaller O↵(nd/). than some positive constant. We will not state these assumptions explicitly here for simplicity; interested 2This means that the output hypothesis only needs to readers may refer to (the first lines of) the proofs for have a small classification error, but may have a large robust risk. the exact upper bounds that are imposed. 3 Here O↵( ) hides a factor of poly(1/↵),andO˜( ) hides 1 · · It is necessary to have 0 <;when = 0,proper a factor of poly log(1/(↵)). 4 (,0)-robust learning is as hard as general proper learning This can be achieved by running the DP ERM algorithm of halfspace (see, e.g., Diakonikolas et al., 2020), which is im- of (Bassily et al., 2014) with the hinge loss; see (Nguyen possible with any finite number of samples under DP (Bun et al., 2020) for the analysis. et al., 2015). 5Specifically, Nguyen et al. (2020) uses the DP empirical Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Thao Nguyen

Robust Non-robust Bounds Proper Improper Proper Improper Non-private Tight ⇥(1/2) Upper O(d) (Theorem 5) O˜(1/2) Pure-DP Lower ⌦(d) (Theorem 2) ⌦(1/2)⌦(1/2) (Theorem 4) 2 Upper O˜(pd/) † O˜(1/ ) Approximate-DP Lower ⌦(pd/) (Theorem 3) ⌦(1/2) (Theorem 4) ⌦(1/2) (Theorem 4)

Table 1: Trade-offs between privacy and robustness. The robust column corresponds to (,0.9)-robust learners, whereas the non-robust column corresponds to (,0)-robust learners. For simplicity of presentation, we assume ↵,✏,⇠ (0, 1) are sufficiently small constants, d 1/2, and for approximate-DP lower bounds, that = o(1/n).Whileapproximate-DP2 upper bounds (marked with†) can already be derived from previous work, we give a faster algorithm (Theorem 6). For DP, known results are from (Nguyen et al., 2020); for the non-private case, the results follow, e.g., from (Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, 2002).

Theorem 6. There is an (✏,)-DP (,0.9)- fingerprinting codes. robust learner with sample complexity n = ˜ 1 pd 1 ˜ O max , 2 and running time O (nd/). ↵ ✏ · ↵ 2.1 Pure-DP Lower Bound (Theorem 2) ⇣ n o⌘ Our theoretical results and those from prior works are We use the packing framework,aDPlowerboundproof summarized in Table 1. Notice that the non-private technique that originated in (Hardt and Talwar, 2010). robust setting and the non-robust private setting each Roughly speaking, to apply this framework, we have to requires only O(1/2) samples, whereas our results construct many input distributions for which the sets of show that the private and robust setting requires either valid outputs for each distribution are disjoint (hence ⌦(d) samples (for pure-DP) or ⌦(pd/) samples (for the name “packing”). In our context, this means that approximate-DP). This separation positively answers we would like to construct distributions (1),..., (K) D D the question central to our study. such that the sets G(i) of hypotheses with small robust risk on (i) are disjoint. Once we have done this, the D We complement our theoretical results by empirically packing framework immediately gives us a lower bound evaluating our algorithm (from Theorem 6) on the of ⌦(log K/✏) on the sample complexity; below we de- MNIST (LeCun et al., 2010) and USPS (Hull, 1994) scribe a construction for K =2⌦(d) distributions, which datasets. Our results show that it is possible to achieve yields the desired ⌦(d/✏) lower bound in Theorem 2. both robustness and privacy guarantees while main- taining reasonable performance. We further provide Our construction proceeds by picking unit vec- evidence that models trained via our algorithm are tors w(1),...,w(K) that are nearly orthogonal, i.e., more resilient to adversarial noise compared to neural w(i), w(j) < 0.01 for all i = j. It is not hard 6 networks trained via DP-SGD (Abadi et al., 2016). to see (and well-known) that such vectors exist for ⌦ ↵ K =2⌦(d).Wethenlet (i) be the uniform distribu- D tion on (1.01 w(i), +1), ( 1.01 w(i), 1). Organization. In the two following sections, we de- · · scribe in detail the ideas behind each of our proofs. Now, let G(i) denote the set of hypotheses h for We then present our experimental results in Section 4. which (h, (i)) < 0.5.Sinceourdistribution R0.9 D Finally, we discuss additional related work and sev- (i) is uniform on two elements, we must have that D eral open questions in Sections 5 and 6 respectively. (h, (i))=0for all h G(i).ToseethatG(i) R0.9 D 2 Due to space constraints, all missing proofs and addi- and G(j) are disjoint for any i = j,noticethat,since 6 tional experiments are deferred to the Supplementary w(i), w(j) < 0.01,thepoint1.01 w(i) is within · Material. distance 1.8 from the point 1.01 w(j). This means ⌦ ↵ · that any hypothesis h cannot correctly classify both (1.01 w(i), 1) and ( 1.01 w(i), 1) with margin at 2SampleComplexityLowerBounds · · least 0.9, which implies that G(i) G(j) = . This \ ; In this section, we explain the high-level ideas behind completes our proof sketch. each of our sample complexity lower bounds. Our We end by remarking that the previous work of Nguyen pure-DP lower bound is based on a packing framework et al. (2020) also uses a packing argument; the main and our approximate-DP lower bounds are based on differences between our construction and theirs are in the choice of the distributions (i) and the proof risk minimization algorithm of (Bassily et al., 2014) with D 2 of disjointness of the G(i)’s. Our construction of (i) the hinge loss; however, the latter requires ⌦(n ) iterations D and each iteration requires ⌦(d) time. is in fact simpler, since our proof of disjointness can Robust and Private Learning of Halfspaces

rely on the robustness guarantee. These differences 2.3 Non-Robust DP Learning Lower Bound are inherent, since our lower bound holds even against (Theorem 4) improper learners, i.e., when the output hypothesis may not be a halfspace, whereas the lower bound of Nguyen This lower bound once again uses the “embedding” et al. (2020) only holds against the particular case of technique described above. Here we start with a hard proper learners. one-dimensional instance, which is simply the uniform distribution on (x, 1), (x, +1) where x is either +1 or 1.When>0 is sufficiently small (depending on ✏), it is simple to show that any (✏,)-DP (1, 0)-learner for 2.2 Approximate-DP Lower Bound this instance requires ⌦(1/✏) samples. Similar to the (Theorem 3) previous proof overview, we embed 1/2 such instances into 1/2 dimensions. Since the one-dimensional in- We reduce from a lower bound from the line of stance requires ⌦(1/✏) samples, the combined instance works (Bun et al., 2018; Dwork et al., 2015; Steinke requires ⌦(1/(✏2)) samples, thereby yielding Theo- and Ullman, 2016, 2017) inspired by fingerprinting rem 4. codes.Specifically,theseworksconsidertheso-called attribute mean problem, where we are given a set of vectors drawn from some (hidden) distribution on 3Sample-EfficientAlgorithms D 1 d and the goal is to compute the mean. It is {± } known that getting an estimate to within 0.1 of the In this section, we present our algorithms for robust true mean in each coordinate requires ⌦(pd) samples. and private learning. Our pure-DP algorithm is based In fact, Steinke and Ullman (2017) show that even on an improved analysis of the exponential mechanism. outputting a vector with a “non-trivial” dot product6 Our approximate-DP algorithm is based on a private, with the mean already requires ⌦(pd) samples. This batched version of the perceptron algorithm. almost implies our desired lower bound: the only re- In the following discussions, we assume that w⇤ is maining step is to turn to a distribution that is D an (unknown) optimal halfspace with respect to the realizable with margin ⇤.Wedothisbyconditioning input distribution and L -perturbations with margin D 2 on only points x with a sufficiently large dot prod- parameter ,i.e.,thatw⇤ satisfies (h , )=0.We D R w⇤ D uct with the true mean, and then adding both (x, +1) may assume without loss of generality that w⇤ =1. and ( x, 1) to our distribution. This reduction gives k k ⌦(pd) an lower bound on the sample complexity of 3.1 Pure-DP Algorithm (Theorem 5) (⇤, 0.9⇤)-robust proper learners, for some absolute constant margin parameter ⇤ > 0. Theorem 5 is shown via the exponential mechanism To get an improved bound for a smaller margin , (EM) (McSherry and Talwar, 2007). Our guarantee we “embed” ⌦(1/2) hard instances above in each of is an improvement over the “straightforward” analysis 2 (0.1) d O( d) dimensions. More specifically, let T = ⇤/; of EM on a -net of the unit sphere in R , which for each i [T 2] we create a distribution (i) that gives an upper bound of O↵(d log(1/)/✏) (Nguyen 2 D 2 is the hard distribution from the previous paragraph et al., 2020). On the other hand, when d 1/ ,our O (d/✏) in d0 := d/T dimensions embedded onto coordinates sample complexity is ↵ . The intuition behind our d0(i 1)+1,...,d0i.Wethenletthedistribution 0 be improvement is that, if we take a random unit vector (1) (T 2) D w such that w, w⇤ 0.99,thenitalreadygives the (uniform) mixture of ,..., .Sinceeach h i 2 (i) D D a small robust risk (in expectation) when d 1/ is realizable with margin ⇤ via some halfspace D(i) 1 w(i) because the component of w orthogonal to w⇤ is a w ,wemaytakew⇤ := 2 (i) to realize T i [T ] w random (d 1)-dimensional vector of norm less than 21 k k the distribution 0 with margin ⇤ = as desired. D P T · one, meaning that in expectation it only affects the O(1/pd) 0.1 Now, to find any w with small (h , 0),we margin by . Now, a random unit R0.9 w D ⌧ O(d) roughly have to solve (most of) the T 2 instances (i)’s. vector satisfies w, w⇤ 0.99 with probability 2 , D h i Recall that solving each of these instances requires which (roughly speaking) means that EM should only require O↵(d/✏) samples. ⌦(pd0) samples. Thus, the combined instance requires 2 2 2 ⌦(T pd0)=⌦(1/ d)=⌦(pd/) samples. · · 3.2 Approximate-DP Algorithm (Theorem 6) p To prove Theorem 6, we use the DP Batch Perceptron 6 Specifically, this holds when the output vector has `2- algorithm presented in Algorithm 1. DP Batch Percep- norm at most pd and the dot product is at least ⇣d for any tron is the batch and privatized version of the so-called constant ⇣>0. margin perceptron algorithm (Duda and Hart, 1973; Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Thao Nguyen

7 Collobert and Bengio, 2004). That is, in each iteration, of w⇤,itissimple to check that w⇤,y x for h · i we randomly sample a batch of samples and for each all samples (x,y). This means that w⇤, u m, h ii sample (x,y) in the batch that is not correctly classified resulting in with margin 0,weaddy x to the current weight of the · halfspace. Furthermore, we add some Gaussian noise w⇤, wi w⇤, wi 1 + m. (1) h ih i to the weight vector to make this algorithm private. We also have a “stopping condition” that terminates On the other hand, the check condition before we whenever the number of samples mislabeled at margin add each example (x,y) to Mi (Line 6) ensures that w ,y x 0 w for all (x,y) M .Fromthis 0 is sufficiently small. (We add Laplace noise to the h i · i ·k ik 2 i number of such samples to make it private.) To get one can derive the following bound: (,0.9) a -robust learner, it suffices for us to set, e.g., 0.5m2 0 =0.95;weusethisvalueof0 in the subsequent wi wi 1 + 0m + , k kk k wi 1 discussions. k k

which, when wi 1 50m/,impliesthat k k Algorithm 1 DP Batch Perceptron wi wi 1 +0.96m. (2) DP-Batch-Perceptron ,p,T,b,( (xj,yj) j [n]) k kk k 0 { } 2 1: w 0 0 Combining (1) and (2), we arrive at 2: for i =1,...,T 3: Si a set of samples where each (xj,yj) 50m/ +0.96mi w w , w mi. i ⇤ i is independently included w.p. p k kh i 4: Mi This implies that the algorithm must stop after T = ; 2 5: for (x,y) S O 1/ iterations. (When m =1,thisnoiselessanal- 2 i wi 1 ysis is essentially the same as the original convergence 6: if sgn , x y 0 = y wi 1 k k · 6 analysis of perceptron (Novikoff, 1963).) 7: Mi⇣D Mi (xE,y) ⌘ [{ } 8: Sample ⌫i Lap(b) The previous paragraphs outlined the analysis for the ⇠ =0 9: if Mi + ⌫i < 0.3↵pn noiseless case where . Next, we will describe | | 10: return wi 1/ wi 1 how the noise affects the analysis and our choices k k u y x of parameters. Roughly speaking, we would like the 11: i (x,y) Mi 2 · 2 inequalities (1) and (2) to “approximately” hold even 12: Sample gi (0, Id d) P ⇠N · ⇥ after adding noise. In particular, this means that we 13: wi wi 1 + ui + gi return FAIL would like the right-hand side of these inequalities to be affected by at most o(m) by the noise addition, with high probability. This condition will determine Before we dive into the details of the proof, we remark our selection of parameters. that our runtime reduction, compared to the generic al- gorithm from (Bassily et al., 2014), comes from the fact For (1),theinclusionofthenoisetermgi adds to the that, in the accuracy analysis, we only need the number right-hand side by w⇤, gi . The expectation of this 2 h i of iterations T to be O (1/ ),similartothepercep- term is w⇤ , which means that it suffices ↵ k k·  tron algorithm (Novikoff, 1963). On the other hand, to ensure that m !˜ (/).For(2),itturnsout g 2 the generic theorem of Bassily et al. (2014) requires k ik that the dominant additional term is wi 1 which, n2 = O˜ (d2/2) k k ↵ iterations. under the assumption that wi 1 50m/,isat k k most O( g 2/m);thistermisO(d2/m) in expec- The accuracy analysis of DP Batch Perceptron follows k ik the blueprint of that of perceptron (Novikoff, 1963). tation. Since we would like this term to be o(m),it p Specifically, we keep track of the following two quan- suffices to have m = !˜( d).Bycombiningthese · m = !˜(pd +1/) tities: w , w⇤ ,thedotproductbetweenthecurrent two requirements, we may pick . h i i · halfspace wi and the “true” halfspace w⇤,and wi , We remark that the number of iterations still remains k k T = O(1/2) the (Euclidean) norm of wi.Wewouldliketoshow ,asinthenoiselesscaseabove. that, after the first few iterations, w , w⇤ increases h i i While we have so far assumed for simplicity that at a faster rate than wi .Since wi, w⇤ is bounded M = m in all iterations, in the actual analysis we k k h i | i| above by wi ,wemayusethistoboundthenumber only require that M m.Furthermore,itissim- k k | i| T of iterations required for the algorithm to converge. ple to show that, as long as the current hypothesis

For simplicity, we assume that in each iteration Mi 7 Specifically, since (hw , )=0,wehavey w⇤, z | | R ⇤ D h i is equal to m>0. Let us first consider the case where 0 for all z such that x z .Plugginginz = x yw⇤ no noise is added (i.e., =0). From the definition yields the claimed inequality.k k Robust and Private Learning of Halfspaces has robust risk significantly more than ↵, we will have using Renyi DP (Abadi et al., 2016; Mironov, 2017). M ⌦ (pn) with high probability. Combining with The calculations for this follow the implementation in | i| ↵ the previous paragraph, this gives us the following the official TensorFlow Privacy repository (https:// condition (assuming that ↵ is constant): github.com/tensorflow/privacy). For experiments with varying ✏ (first column of Figure 1), we fix pn !˜(pd +1/). 5 4 · to 10 for MNIST and 10 for USPS. We observe that despite the robustness and privacy constraints, This leads us to pick the following set of parameters: DP Batch Perceptron still achieves competitive accu- 1 1 racy on both datasets. We also report performance n = O˜ pd + ,p= O˜(),= O˜(1). 2 3 4 5 with varying values of 10 , 10 , 10 , 10 (second ✓ ✓ ◆◆ column of Figure 1), while keeping ✏ fixed at 1.0. The privacy analysis of our algorithm is similar to that of DP-SGD (Bassily et al., 2014). Specifically, by the Adversarial Robustness Evaluation. We com- choice of = O˜(1) and subsampling rate p = O˜(), pare the robustness of our models against those of each iteration of the algorithm is O(✏),O(2) - neural networks trained with DP-SGD. For the latter, DP (e.g., Dwork et al., 2010; Balle et al., 2018). Since we follow the architecture found in the official Tensor- the number of iterations is T = O(1/2),advancedcom- Flow Privacy tutorial, which consists of two convolu- position theorem (Dwork et al., 2010) implies that the tional layers, each followed by a MaxPool operation, entire algorithm is (O(pT ✏),O(T 2)) = (✏,)-DP and a dense layer that outputs predicted logits. The · · as desired. network is then trained with batch size 250,learning rate 0.15, L2 clipping-norm 1.0,andfor60 epochs. We end by noting that, despite the popularity of This configuration yields competitive performance on perceptron-based algorithms, we are not aware of any the MNIST dataset. work that analyzes the above noised and batched vari- ant. The most closely related analysis we are aware To evaluate the robustness of the models, we calculate of is that of Blum et al. (2005), whose algorithm uses the robust risk on the test dataset for varying values of the entire dataset in each iteration. While it is pos- the perturbation norm (i.e., margin) . Following com- sible to adapt their analysis to the batch setting, it mon practice in the field, we plot the robust accuracy unfortunately does not give an optimal sample com- on the test data, which is defined as one minus the ro- bust risk (i.e., 1 (h, ) where is the test dataset plexity. Specifically, their analysis requires the batch R D D size to be ⌦(pd/),resultinginsamplecomplexityof distribution), instead of the robust risk itself. Similarly, ⌦(pd/2).Ontheotherhand,ourmorecarefulanal- our plots use the unnormalized margin, meaning that ysis works even with batch size O˜(pd +1/), which the images are not re-scaled to have L2 norm equal to ˜ 1 before prediction. Note that each of their pixel values results in the desired O↵,✏(pd/) sample complexity is still scaled (i.e., divided by 255 if necessary) to have when d 1/2. values in the range [0, 1]. 4Experiments In the case of DP Batch Perceptron, it is known (see, e.g., Hein and Andriushchenko, 2017) that an example We run our DP Batch Perceptron algorithm on the (x,y) cannot be perturbed (using P )toanincorrect w(y),x w(y0),x MNIST (LeCun et al., 2010) and USPS (Hull, 1994) label if

(a) Accuracy as ✏ varies (b) Accuracy as varies (c) Robust accuracy as ✏ varies

Figure 1: Performance of DP Batch Perceptron halfspace classifiers on the MNIST (top row) and USPS (bottom row) datasets.

(a) ✏ =0.5 (b) ✏ =1 (c) ✏ =2

Figure 2: Robustness accuracy comparison between DP-SGD-trained Convolutional neural networks and DP Batch 5 Perceptron halfspace classifiers on MNIST dataset for a fixed privacy budget. In all three plots, =10 but ✏ varies from 0.5, 1, and 2.

networks. Unlike the linear classifier case, this method modified objective of only gives a lower bound on the robust risk, meaning that more sophisticated attacks might result in even min `(M(x +),y0) more incorrect classifications. k k We now briefly summarize the attack we use against where ` is some loss function. This optimization prob- DP-SGD-trained neural networks; (a version of) this lem is then solved using Projected Gradient Descent method was already presented in (Szegedy et al., 2014). (PGD). We use the cross entropy loss in our attack. M Let denote a trained model; recall that the last layer Comparisons of the robust accuracy of models trained of our model consists of 10 outputs corresponding to via DP Batch Perceptron and those trained via DP- each class and to predict an image we take y⇤ with the 5 SGD are shown in Figure 2 for = 10 and ✏ = (x,y) maximum output. Given a sample ,wewould 0.5, 1, 2. In the case of ✏ =0.5, while both classifiers like to determine whether there exists a perturbation have similar test accuracies (without any perturbation, Rd with such that M predict x +to 2 k k =0), as increases, the robust accuracy rapidly be some other class y0 = y. Instead of solving this 6 degrades for the DP-SGD-trained neural network com- (intractable) problem directly, the attack considers a pared to that of the DP Batch Perceptron model. This overall trend persists for ✏ =1, 2;inbothcases,the Robust and Private Learning of Halfspaces

neural networks start offwith noticeably larger test Differentially Private Learning. Private learning accuracy when =0but are eventually surpassed by has been a popular topic since the early days of differen- halfspace classifiers as increases. tial privacy (e.g., Kasiviswanathan et al., 2008)). Apart from the work of Nguyen et al. (2020) on privately learn- ing halfspaces with a margin, a line of work closely re- 5OtherRelatedWork lated to our setting is the study of the sample complex- ity of learning threshold functions (Beimel et al., 2016; Feldman and Xiao, 2014; Bun et al., 2015; Alon et al., Learning Halfspaces with Margin. L2 robust- 2019; Kaplan et al., 2020a) and halfspaces (Beimel ness is a classical setting closely related to the no- et al., 2019; Kaplan et al., 2020b,c). These works study tion of margin (e.g., Rosenblatt, 1958; Novikoff, 1963). the setting where the unit ball Bd is discretized so (See the supplementary material for formal definitions.) that the domain is Xd Bd (i.e., each coordinate X \ Known margin-based learning methods include the clas- is an element of X). Interestingly, it has been shown sic perceptron algorithm (Rosenblatt, 1958; Novikoff, that when X is infinite, halfspaces become unlearnable, 1963) and its many generalizations and variants (e.g., i.e., the sample complexity becomes unbounded (Alon Duda and Hart, 1973; Collobert and Bengio, 2004; Fre- et al., 2019). On the other hand, an (✏, o(1/n))-DP 2.5 und and Schapire, 1999; Gentile and Littlestone, 1999; d O(log⇤ X ) learner with sample complexity O˜ 2 | | Li and Long, 2002; Gentile, 2001), as well as Support ↵✏ · Vector Machines (SVMs) (Boser et al., 1992; Cortes exists (Kaplan et al., 2020b). ⇣ ⌘ and Vapnik, 1995). A number of works also explore While the above setting is not directly comparable acloselyrelatedagnostic setting, where the distribu- to ours, it is possible to reduce between the margin tion is not guaranteed to be realizable with a mar- setting and the discretized setting, albeit with some loss. D gin (e.g., Ben-David and Simon, 2000; Shalev-Shwartz For example, we may use the grid discretization with et al., 2010; Long and Servedio, 2011; Birnbaum and X =0.01/pd,toobtaina(,0)-learner with sample | | 2.5 Shalev-Shwartz, 2012; Diakonikolas et al., 2019, 2020). complexity O˜ d 2O(log⇤(1/)). This is better than ↵✏ · Generalization aspects of margin-based learning of half- the (straightforward)⇣ ⌘ bound of O(d log(1/)) obtained · spaces is also a widely studied topic (e.g., Bartlett, 1998; by applying the exponential mechanism (McSherry and d1.6 Zhang, 2002; Bartlett and Mendelson, 2002; Koltchin- Talwar, 2007) when is very small (e.g., 2 ).  skii and Panchenko, 2002; McAllester, 2003; Kakade It remains an interesting open problem to close such a et al., 2008), and it is known that the sample complexity gap for very small values of . of robust learning of halfspaces is O(1/(↵)2) (Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, Several works (e.g., Rubinstein et al., 2012; Chaudhuri 2002). et al., 2011) have studied differentially private SVMs. However, to the best of our knowledge, there is no To the best of our knowledge, the first work that com- straightforward way to translate their theoretical re- bines the study of learning halfspaces with margin and sults to those shown in our paper as the objectives in DP is (Nguyen et al., 2020); their results are repre- the two settings are different. sented in Table 1. Recently, Ghazi et al. (2020) gave alternative proofs for some results of Nguyen et al. (2020) via reductions to clustering problems, but these 6ConclusionsandFutureDirections do not provide any improved sample complexity or running time. In this work, we prove new trade-offs, measured in terms of sample complexity, between privacy and Adversarially Robust Learning. There has been robustness—two crucial properties within the domain a rapidly growing literature on adversarial robustness. of AI ethics and safety—for the classic task of halfspace Some of these works have presented evidence that learning. Our theoretical results demonstrate that DP training robust classifiers might be harder than non- and adversarially robust learning requires a larger num- robust ones (e.g., Awasthi et al., 2019; Bubeck et al., ber of samples than either DP or adversarially robust 2018, 2019; Degwekar et al., 2019). Other works aim learning alone. We then propose a learning algorithm to demonstrate the accuracy cost of robustness (e.g., that meets both criteria, and test it on two multi-class Tsipras et al., 2019; Raghunathan et al., 2019). An- classification datasets. We also provide empirical evi- other line of work seeks to determine the right quantity dence that despite having a slight advantage in terms that governs adversarial generalization (e.g., Schmidt of test accuracy on the main task, standard neural et al., 2018; Montasser et al., 2019; Khim and Loh, networks trained with DP-SGD are not as robust as 2018; Yin et al., 2019; Awasthi et al., 2020). those trained with our algorithm. Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Thao Nguyen

We conclude with a few future research directions. First, via couplings and divergences. In NeurIPS,pages it would be interesting to close the gap for the sam- 6280–6290, 2018. ple complexity of improper approximate-DP (,0.9)- Peter L. Bartlett. The sample complexity of pattern robust learners; for d 1/2,theupperboundis classification with neural networks: The size of the O(pd/) (Theorem 6) but the lower bound is only 2 weights is more important than the size of the net- ⌦(1/ ) (Theorem 4). This is the only case where work. IEEE Trans. Inf. Theory,44(2):525–536,1998. there is still a super-polylogarithmic gap for d 1/2. Peter L Bartlett and Shahar Mendelson. Rademacher Another technical open question is to improve the lower and Gaussian complexities: Risk bounds and struc- bounds in Theorem 3 to ⌦(min pd/, d )/✏. Cur- { } tural results. JMLR,3:463–482,2002. rently, we are missing the ✏ term because we invoke a Raef Bassily, Adam D. Smith, and Abhradeep lower bound from Steinke and Ullman (2017) (Theo- Thakurta. Private empirical risk minimization: Effi- rem 11), which was specifically proved only for ✏ =1. cient algorithms and tight error bounds. In FOCS, Furthermore, it would be natural to extend our study to pages 464–473, 2014. L perturbations for p =2.Anespeciallynoteworthy p 6 Raef Bassily, Vitaly Feldman, Kunal Talwar, and case is when p = , which is a well-studied setting in 1 Abhradeep Guha Thakurta. Private stochastic con- the adversarial robustness literature. vex optimization with optimal rates. In NeurIPS, Finally, it would be very interesting to provide a theo- pages 11282–11291, 2019. retical understanding of private and robust learning be- Amos Beimel, , and Uri Stemmer. Pri- yond halfspaces, to accommodate complex algorithms vate learning and sanitization: Pure vs. approximate (e.g., deep neural networks) that are better suited for differential privacy. Theory Comput.,12(1):1–61, more challenging tasks. 2016. Amos Beimel, Shay Moran, Kobbi Nissim, and Uri References Stemmer. Private center points and learning of half- Martin Abadi, Andy Chu, Ian Goodfellow, H Bren- spaces. In COLT,pages269–282,2019. dan McMahan, Ilya Mironov, Kunal Talwar, and Shai Ben-David and Hans Ulrich Simon. Efficient learn- Li Zhang. Deep learning with differential privacy. In ing of linear perceptrons. In NIPS,pages189–195, CCS,pages308–318,2016. 2000. John M Abowd. The US Census Bureau adopts differ- Battista Biggio, Igino Corona, Davide Maiorca, Blaine ential privacy. In KDD,pages2867–2867,2018. Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giac- Noga Alon, Oded Goldreich, Johan Håstad, and René into, and Fabio Roli. Evasion attacks against ma- Peralta. Simple constructions of almost k-wise inde- chine learning at test time. In ECML/PKDD,pages pendent random variables. In FOCS,pages544–553, 387–402, 2013. 1990. Aharon Birnbaum and Shai Shalev-Shwartz. Learning Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay halfspaces with the zero-one loss: Time-accuracy Moran. Private PAC learning implies finite Little- tradeoffs. In NIPS,pages935–943,2012. stone dimension. In STOC,pages852–860,2019. Avrim Blum, , Frank McSherry, and Apple Differential Privacy Team. Learning with privacy Kobbi Nissim. Practical privacy: the SuLQ frame- at scale. Apple Machine Learning Journal,2017. work. In PODS,pages128–138,2005. Anish Athalye, Nicholas Carlini, and David Wagner. Bernhard E. Boser, Isabelle Guyon, and Vladimir Vap- Obfuscated gradients give a false sense of security: nik. A training algorithm for optimal margin classi- Circumventing defenses to adversarial examples. In fiers. In COLT,pages144–152,1992. ICML,pages274–283,2018. Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Pranjal Awasthi, Abhratanu Dutta, and Aravindan Razenshteyn. Adversarial examples from computa- Vijayaraghavan. On robustness to adversarial ex- tional constraints. In ICML,pages831–840,2018. amples and polynomial optimization. In NeurIPS, pages 13760–13770, 2019. Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. Razenshteyn. Adversarial examples from computa- Adversarial learning guarantees for linear hypotheses tional constraints. In ICML,pages831–840,2019. and neural networks. In ICML,pages431–441,2020. Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vad- Borja Balle, Gilles Barthe, and Marco Gaboardi. Pri- han. Differentially private release and learning of vacy amplification by subsampling: Tight analyses threshold functions. In FOCS,pages634–649,2015. Robust and Private Learning of Halfspaces

Mark Bun, Jonathan Ullman, and Salil P. Vadhan. Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Ko- Fingerprinting codes and the price of approximate rolova. RAPPOR: Randomized aggregatable privacy- differential privacy. SIAM J. Comput.,47(5):1888– preserving ordinal response. In CCS,pages1054– 1938, 2018. 1067, 2014. Nicholas Carlini and David Wagner. Audio adversarial Vitaly Feldman and David Xiao. Sample complexity examples: Targeted attacks on speech-to-text. In bounds on differentially private learning via com- IEEE SPW,pages1–7,2018. munication complexity. In COLT,pages1000–1019, Kamalika Chaudhuri, Claire Monteleoni, and Anand D 2014. Sarwate. Differentially private empirical risk mini- Vitaly Feldman, Tomer Koren, and Kunal Talwar. Pri- mization. JMLR,12(3):1069–1109,2011. vate stochastic convex optimization: optimal rates Ronan Collobert and Samy Bengio. Links between in linear time. In STOC,pages439–449,2020. perceptrons, MLPs and SVMs. In ICML,2004. and Robert E. Schapire. Large margin and . Support-vector classification using the perceptron algorithm. Mach. networks. Mach. Learn.,20(3):273–297,1995. Learn.,37(3):277–296,1999. Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, and Claudio Gentile. A new approximate maximal margin Deepak Verma. Adversarial classification. In KDD, classification algorithm. JMLR,2:213–242,2001. pages 99–108, 2004. Claudio Gentile and Nick Littlestone. The robustness Akshay Degwekar, Preetum Nakkiran, and Vinod of the p-norm algorithms. In COLT,pages1–11, Vaikuntanathan. Computational limitations in ro- 1999. bust classification and win-win results. In COLT, Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. Dif- pages 994–1028, 2019. ferentially private clustering: Tight approximation Ilias Diakonikolas, Daniel Kane, and Pasin Manurangsi. ratios. In NeurIPS,2020. Nearly tight bounds for robust proper learning of Ian J Goodfellow, Jonathon Shlens, and Christian halfspaces with a margin. In NeurIPS,pages10473– Szegedy. Explaining and harnessing adversarial ex- 10484, 2019. amples. In ICLR,2015. Ilias Diakonikolas, Daniel M. Kane, and Pasin Ma- Sven Gowal, Krishnamurthy Dvijotham, Robert Stan- nurangsi. The complexity of adversarially robust forth, Rudy Bunel, Chongli Qin, Jonathan Uesato, proper learning of halfspaces with agnostic noise. In Relja Arandjelovic, Timothy Mann, and Pushmeet NeurIPS,2020. Kohli. On the effectiveness of interval bound propa- Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. gation for training verifiably robust models. arXiv Collecting telemetry data privately. In NIPS,pages Preprint:1810.12715,2018. 3571–3580, 2017. Sven Gowal, Jonathan Uesato, Chongli Qin, Po-Sen Richard O. Duda and Peter E. Hart. Pattern Classifi- Huang, Timothy Mann, and Pushmeet Kohli. An cation and Scene Analysis.Wiley,1973. alternative surrogate loss for PGD-based adversarial Cynthia Dwork and Aaron Roth. The algorithmic testing. arXiv Preprint: 1910.09338,2019. foundations of differential privacy. Foundations and Andy Greenberg. Apple’s “differential privacy” is about Trends in Theoretical Computer Science,9(3-4):211– collecting your data – but not your data. Wired, June, 407, 2014. 13, 2016. Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc- Moritz Hardt and Kunal Talwar. On the geometry of Sherry, Ilya Mironov, and . Our data, differential privacy. In STOC,pages705–714,2010. ourselves: Privacy via distributed noise generation. In EUROCRYPT,pages486–503,2006a. Jamie Hayes. Provable trade-offs between private & ro- bust machine learning. arXiv Preprint: 2006.04622, Cynthia Dwork, Frank McSherry, Kobbi Nissim, and 2020. Adam D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC,pages265–284,2006b. Matthias Hein and Maksym Andriushchenko. Formal Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. guarantees on the robustness of a classifier against Boosting and differential privacy. In FOCS,pages adversarial manipulation. In NeurIPS,pages2266– 51–60, 2010. 2276, 2017. Cynthia Dwork, Adam D. Smith, Thomas Steinke, Jonathan J. Hull. A database for handwritten text Jonathan Ullman, and Salil P. Vadhan. Robust trace- recognition research. IEEE PAMI,16(5):550–554, ability from trace amounts. In FOCS,pages650–669, 1994. 2015. Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Thao Nguyen

Sham M. Kakade, Karthik Sridharan, and Ambuj Aleksander Madry, Aleksandar Makelov, Ludwig Tewari. On the complexity of linear prediction: Risk Schmidt, Dimitris Tsipras, and Adrian Vladu. To- bounds, margin bounds, and regularization. In NIPS, wards deep learning models resistant to adversarial pages 793–800, 2008. attacks. In ICLR,2018. Haim Kaplan, Katrina Ligett, Yishay Mansour, Moni David A. McAllester. Simplified PAC-Bayesian margin Naor, and Uri Stemmer. Privately learning thresh- bounds. In COLT,pages203–215,2003. olds: Closing the exponential gap. In COLT,pages Frank McSherry and Kunal Talwar. Mechanism design 2263–2285, 2020a. via differential privacy. In FOCS,pages94–103,2007. Haim Kaplan, Yishay Mansour, Uri Stemmer, and Ilya Mironov. Rényi differential privacy. In CSF,pages Eliad Tsfadia. Private learning of halfspaces: Sim- 263–275, 2017. plifying the construction and reducing the sample Omar Montasser, Steve Hanneke, and Nathan Srebro. complexity. In NeurIPS,2020b. VC classes are adversarially robustly learnable, but Haim Kaplan, Micha Sharir, and Uri Stemmer. How to only improperly. In COLT,pages2512–2530,2019. Find a Point in the Convex Hull Privately. In SoCG, Huy Le Nguyen, Jonathan Ullman, and Lydia Zakyn- pages 52:1–52:15, 2020c. thinou. Efficient private algorithms for learning large- Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi margin halfspaces. In ALT,pages704–724,2020. Nissim, Sofya Raskhodnikova, and Adam D. Smith. Albert B Novikoff. On convergence proofs for percep- What can we learn privately? In FOCS,pages531– trons. Technical report, Stanford Research Institute, 540, 2008. Menlo Park, CA, 1963. Justin Khim and Po-Ling Loh. Adversarial risk Nicolas Papernot, Patrick McDaniel, and Ian Good- bounds via function transformation. arXiv Preprint: fellow. Transferability in machine learning: from 1810.09519,2018. phenomena to black-box attacks using adversarial Vladimir Koltchinskii and Dmitry Panchenko. Empiri- samples. arXiv Preprint: 1605.07277,2016. cal margin distributions and bounding the general- NhatHai Phan, Minh Vu, Yang Liu, Ruoming Jin, De- ization error of combined classifiers. The Annals of jing Dou, Xintao Wu, and My T Thai. Heterogeneous Statistics,30(1):1–50,2002. Gaussian mechanism: Preserving differential privacy Zico Kolter and Aleksander Madry. Adversarial ro- in deep learning with provable robustness. In IJCAI, bustness: Theory and practice. Tutorial at NeurIPS, pages 4753–4759, 2019. 2018. NhatHai Phan, My T Thai, Han Hu, Ruoming Jin, Beatrice Laurent and Pascal Massart. Adaptive esti- Tong Sun, and Dejing Dou. Scalable differential pri- mation of a quadratic functional by model selection. vacy with certified robustness in adversarial learning. Annals of Statistics,pages1302–1338,2000. In ICML,pages7683–7694,2020. Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Aditi Raghunathan, Sang Michael Xie, Fanny Yang, Haffner. Gradient-based learning applied to docu- John C Duchi, and Percy Liang. Adversarial training ment recognition. In Proceedings of the IEEE,pages can hurt generalization. arXiv Preprint: 1906.06032, 2278–2324, 1998. 2019. Yann LeCun, Corinna Cortes, and CJ Burges. MNIST Ali Rahimi and Benjamin Recht. Random features for handwritten digit database, 2010. http://yann. large-scale kernel machines. In NIPS,pages1177– lecun.com/exdb/mnist. 1184, 2007. Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geam- Frank Rosenblatt. The perceptron: a probabilistic basu, Daniel Hsu, and Suman Jana. Certified robust- model for information storage and organization in ness to adversarial examples with differential privacy. the brain. Psychological Review,65(6):386,1958. In IEEE S&P,pages656–672,2019. Benjamin IP Rubinstein, Peter L Bartlett, Ling Huang, Yi Li and Philip M. Long. The relaxed online maximum and Nina Taft. Learning in a large function space: margin algorithm. Mach. Learn.,46(1-3):361–387, Privacy-preserving mechanisms for SVM learning. J. 2002. Priv. Confidentiality,4(1),2012. Philip M. Long and Rocco A. Servedio. Learning large- Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, margin halfspaces with more malicious noise. In Kunal Talwar, and Aleksander Madry. Adversarially NIPS,pages91–99,2011. robust generalization requires more data. In NeurIPS, pages 5014–5026, 2018. Robust and Private Learning of Halfspaces

Bernhard Schölkopf, Kah Kay Sung, Christopher J. C. risks of securing machine learning models against Burges, Federico Girosi, Partha Niyogi, Tomaso A. adversarial examples. In CCS,pages241–257,2019b. Poggio, and Vladimir Vapnik. Comparing support Thomas Steinke and Jonathan Ullman. Between pure vector machines with Gaussian kernels to radial basis and approximate differential privacy. J. Priv. Confi- function classifiers. IEEE Trans. Signal Process.,45 dentiality,7(2),2016. (11):2758–2765, 1997. Thomas Steinke and Jonathan Ullman. Tight lower Lukas Schott, Jonas Rauber, Matthias Bethge, and bounds for differentially private selection. In FOCS, Wieland Brendel. Towards the first adversarially pages 552–563, 2017. robust neural network model on MNIST. In ICLR, 2019. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Shai Shalev-Shwartz, Ohad Shamir, and Karthik Srid- Rob Fergus. Intriguing properties of neural networks. haran. Learning kernel-based halfspaces with the In ICLR,2014. zero-one loss. In COLT,pages441–450,2010. Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Stephen Shankland. How Google tricks itself to protect Alexander Turner, and Aleksander Madry. Robust- Chrome user privacy. CNET, October,2014. ness may be at odds with accuracy. In ICLR,2019. Yash Sharma and Pin-Yu Chen. Attacking the Madry . The complexity of differential privacy. In defense model with l1-based adversarial examples. Tutorials on the Foundations of Cryptography,pages arXiv Preprint: 1710.10733,2017. 347–450. Springer, 2017. Liwei Song, Reza Shokri, and Prateek Mittal. Member- Dong Yin, Ramchandran Kannan, and Peter Bartlett. ship inference attacks against adversarially robust Rademacher complexity for adversarially robust gen- deep learning models. In IEEE SPW,pages50–56, eralization. In ICML,pages7085–7094,2019. 2019a. Tong Zhang. Covering number bounds of certain reg- Liwei Song, Reza Shokri, and Prateek Mittal. Privacy ularized linear function classes. JMLR,2:527–550, 2002.