The Sample Complexity of Parity Learning in the Robust Shuffle Model

A Thesis submitted to the Faculty of the Graduate School of Arts and Sciences of Georgetown University in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

By

Chao Yan, B.S.

Washington, DC April 21, 2021 Copyright © 2021 by Chao Yan All Rights Reserved

ii The Sample Complexity of Parity Learning in the Robust Shuffle Model

Chao Yan, B.S.

Thesis Advisor: , Ph.D.

Abstract

Differential privacy [Dwork, McSherry, Nissim, Smith TCC 2006] is a standard of privacy in data analysis of personal information requiring that the information of any single individual should not influence the analysis’ output distribution significantly. The focus of this thesis is the shuffle model of differential privacy [Bittau, Erlingsson, Maniatis, Mironov, Raghunathan, Lie, Rudominer, Kode, Tinnés, Seefel SOSP 2017], [Cheu, Smith, Ullman, Zeber, Zhilyaev EUROCRYPT 2019]. In the shuffle model, users communicate with an analyzer via a trusted shuffler, which permutes all received messages uniformly at random before submitting them to the analyst (then the ana- lyst outputs an aggregate answer). Another model which we will discuss is the pan- privacy model [Dwork, Naor, Pitassi, Rothblum, Yekhanin ICS 2010], where an online algorithm is required to maintain differential privacy of both its internal state and its output (jointly). We focus on the task of parity learning in the robust shuffle model and obtain the following contributions:

• We construct a reduction from a pan-private parity learner to the robust shuffle parity learner. Given an (ε, δ, 1/3)-robust shuffle private parity learner, we con- struct an (ε, δ)-pan-private parity learner. Applying recent pan-privacy lower- bounds [Cheu, Ullman 2021], we obtain a lower bound on the sample complexity

iii d/2 Ω 2 in the pan-private parity learner, which in turn implies the same lower bound in the robust shuffle model.

• We present a simple robust shuffle parity learning algorithm with sample com- plexity O(d2d/2). The algorithm evaluates, with differential privacy, the empir- ical error of all possible parity functions, and selects the one with minimal error.

Index words: Differential privacy, private learning, parity learning

iv Acknowledgments

This thesis could not be completed without the help of many people. I want to show my appreciation here. I would like to thank my advisor Kobbi Nissim, who gave me a lot of support from the beginning. He taught me a lot during my master, including the knowledge of differential privacy, method of research and thesis writing. I am grateful for his help, guidance, and patience. I want to thank faculties who have taught me in Georgetown. They are Sasha Golovnev, Bala Kalyanasundaram, Calvin Newport, Kobbi Nissim, Micah Sherr, Richard Squier, Justin Thaler. Their courses helped me to have a basic background of computer science. I also want to thank Sasha Golovnev and Jonathan Ullman for willing to be my committee members. At last, I want to thank my parents. Without their support, I would not have chance to study in Georgetown.

v Table of Contents

Chapter 1 Introduction ...... 1 1.1 Results of this thesis ...... 3 2 Preliminaries ...... 4 2.1 Differential privacy ...... 4 2.2 Models of computation for differential privacy ...... 6 2.3 Private learning ...... 12 3 Related Work ...... 15 3.1 Non-private and private learning ...... 15 3.2 Reduction from pan-private algorithms to shuffle private algorithms 15 3.3 Hard tasks for pan-private mechanisms ...... 17 3.4 Private summation protocol in shuffle model ...... 19 4 A Lowerbound on the Sample Complexity of Parity Learning in the Shuffle Model ...... 20 4.1 From robust shuffle model parity learner to a pan-private parity learner ...... 20 4.2 From pan-private parity learner to distinguishing hard distributions 22 4.3 Tightness of the lowerbound ...... 25 5 Conclusion ...... 29 Appendix: Tail Bounds ...... 30 Bibliography ...... 31

vi List of Figures

2.1 The curator model of differential privacy ...... 6 2.2 The local model ...... 8 2.3 The shuffle model ...... 10 2.4 The pan-privacy model ...... 11

vii Chapter 1

Introduction

Differential privacy [9] is a mathematical definition of privacy applied to analyses of personal data. It requires that any observer of the outcome of the data analysis would not be able to determine whether specific individual is included in the data set or not. Differential privacy requires that the output distributions on two neighbouring input data are similar, where neighbouring data are two data that only differ on

0 0 0 one element, i.e. for x = (x1, . . . , xn) and x = (x1, . . . , xn), there only exists one

0 i ∈ {1, . . . , n} such that xi 6= xi. More formally, with parameters ε and δ, an algorithm M satisfies (ε, δ)-differential privacy if for any event T ,

Pr[M(x) ∈ T ] ≤ eε · Pr[M(x0) ∈ T ] + δ.

There are some different settings for differential privacy. The most widely researched setting is the curator model, also known as the central model, where the data is collected by a trusted curator and then the curator outputs an aggregate answer, computed with differential privacy. To provide differential privacy, the answer must be randomized. For example, suppose x ∈ {0, 1}n and the goal is to approximate Pn f(x) = i=1 xi. Dwork et al. [9] presented a mechanism satisfying (ε, 0)-differential privacy: n X M(x) = xi + Lap(1/ε). i=1

1 A disadvantage of the curator model is that it relies on a trusted curator. The local model [17] avoids this drawback by eliminating the need for a trusted curator. In the local model, instead of directly outputting raw data, each user randomizes their data by applying a differentially private mechanism, called local randomizer, before sharing it with an analyst. The data analyst post-processes the messages it received to publish a differentially private answer. As the adversary can view all users’ outputs, the local

0 0 model must satisfy for all neighboring (x1, . . . , xn), (x1, . . . , xn) and all events T :

ε 0 0 Pr[(R1(x1),...,Rn(xn)) ∈ T ] ≤ e · Pr[(R1(x1),...,Rn(xn)) ∈ T ] + δ,

or, equivalently,

ε 0 Pr[Ri(x) ∈ T ] ≤ e · Pr[(Ri(x ) ∈ T ] + δ,

for all i ∈ [n], x, x0 and T . However, processing personal information in the local model restricts utility (com- pared with the curator model). For example, any differentially private mechanism that √ computing binary sum in the local model incurs at least Ωε( n) noise in [5], while it

in the curator model, the mechanism presented above yields Oε(1) noise [9]. Considering the advantages and issues of the curator and local model, it is natural to seek alternative models. The shuffle model is one alternative approach, which is composed of an untrusted curator, local randomizers and a trusted shuffler [3, 6]. In the shuffle model, each individual sends (potentially noisy) data to the shuffler, which randomly permutes the messages and then submits them to an analyzer. The random permutation potentially disassociate messages from their senders, and hence potentially amplifies privacy. The pan-private model [10] is another alternative model, where the curator is trusted, but the adversary can intrude the internal state of the algorithm. A pan- private algorithm receives a stream of raw data and starts with a beginning state.

2 Each time unit the algorithm processes one element and updates its internal state by combining the element and the current state. At the end, the algorithm process its final state and outputs the aggregate answer. The pan-private algorithm requires that both of internal state and the output are differentially private. Balcer et al. [1] discov- ered a connection between the pan-private model and the shuffle model. Concretely, if a shuffle private algorithm is robust, which means it can maintain differential pri- vacy provided that a certain fraction of the agents participate, there exists a way to construct a corresponding pan-private algorithm. This thesis proves a lower bound on the sample complexity of parity learning algorithm in the robust shuffle model. The thesis also provides a robust shuffled parity learning algorithm, which shows that the bound is almost tight.

1.1 Results of this thesis

We first construct a reduction from the pan-private parity learner to the robust shuffle parity learner. This is based on the works of Balcer et al. [1] and Cheu and Ullman [7]. We then prove the lower bound on the sample complexity of parity learning algorithm in robust shuffle model. This applies a result of Cheu and Ullman [7]. Given a family

of distributions {pv}v∈V , if a pan-private algorithm can distinguish two mixture dis-

n n tributions PV and U (see details in Section 3.3), then this algorithm has a sample

 1  n = Ω Pd,k,α complexity εk{Pv}k∞→2 . They also gave a family of hard distributions and compute kPd,k,αk∞→2. We show how to use a pan-private learner to distinguish

n n PV and U when the family of distribution is Pd,k,1/2. Then we apply the lower bound of Cheu and Ullman. We prove this lower bound is almost tight by constructing a concrete algorithm to learn the parity function. This uses the algorithm in the work of Balle et al. [2].

3 Chapter 2

Preliminaries

2.1 Differential privacy

Differential privacy is a standard that the output is insensitive to any single element of the input data. By introducing randomness, the behaviours of the differentially private analysis are similar on neighbouring data that differ on the entry of any single individual. In mathematical language, x, x0 ∈ X is a pair of neighbouring data if

0 |{i : xi 6= xi}| = 1.

Definition 1 (Differential Privacy [9]) Let X be a data universe and let ε, δ ≥ 0 A (randomized) mechanism M : X n → Y is (ε, δ)-differentially private if for every event T ⊆ Y and for all neighboring datasets x, x0 ∈ X n,

Pr[M(x) ∈ T ] ≤ eε Pr[M(x0) ∈ T ] + δ, where the probability is over the randomness of M.

The randomization in differential privacy is necessary, because if an algorithm M

0 is deterministic and M(x) = y1, M(x ) = y2, where y1 6= y2. When the output is y1, we know the input can’t be x0. One example of a differentially private mechanism is randomized response intro- duced by Warner [21]. The randomized function R(x) with input x ∈ {0, 1} is set as:   x with probability eε/(eε + 1) R(x) =  1 − x with probability 1/(eε + 1)

4 It can be verified that

Pr[R(1) = 1] ≤ eε · Pr[R(0) = 1], Pr[R(0) = 0] ≤ eε · Pr[R(1) = 0],

Pr[R(0) = 1] ≤ eε · Pr[R(1) = 1], Pr[R(1) = 0] ≤ eε · Pr[R(0) = 0].

Thus R(x) is (ε, 0)-differentially private. Applying any algorithm f to an (ε, δ)-differentially private algorithm M also sat- isfies (ε, δ)-differential privacy.

Proposition 1 (Post-Processing) Let M : X n → Y be an (ε, δ)-differentially pri- vate algorithm and f : Y → Y0 be any algorithm. Then f ◦ M : X n → Y0 is (ε, δ)- differentially private.

We can combine two differentially private mechanisms to create a new differen- tially private mechanism. As each of the two analyses can contribute to the leakage of information, the parameters of the combined analysis cannot be smaller than that of the individual analyses. A collection of composition theorems control how the param- eters ε, δ grow. E.g., basic composition, as below, tells that the growth is at most linear.

Theorem 1 (Basic Composition [8, 11]) Let mechanism M1 be (ε1, δ1)-differentially private and let mechanism M2 be (ε2, δ2)-differentially private. Define M1,2(x) =

(M1(x),M2(x)) as the combination of M1 and M2, then M1,2(x) satisfies (ε1 +ε2, δ1 +

δ2)-differential privacy.

Better composition theorems exist, where the growth of the privacy parameters is sub-linear. One such theorem is provided by Dwork, Rothblum, and Vadhan:

5 Theorem 2 (Advanced Composition [12]) Given M1,M2,...,Mk, which are

0 0 (ε, δ)-differentially private, then M = (M1,M2,...,Mk) is (ε , kδ + δ )-differentially private, where ε0 = p2k ln 1/δ0 · ε + k · ε(eε − 1).

Specifically, when ε0 < 1, to ensure that the composed mechanism is (ε0, kδ + δ0)- differentially private, it suffices that each mechanism is (ε, δ)-differentially private, where ε0 ε = . 2p2k ln 1/δ0

2.2 Models of computation for differential privacy

2.2.1 The curator model

In the curator model of differential privacy, it is assumed that the data analyst is trustworthy. The curator collects individual information in the clear, and produces the outcome of a differentially private computation. To make the computation private, the curator introduces random noise in its computation. One widely used method is the Laplace mechanism [9], which promises differential privacy by adding noise sampled from the Laplace distribution.

Figure 2.1: The curator model of differential privacy

6 Definition 2 (The Laplace Distribution) The probability density function of zero-centered Laplace distribution Lap(b) is

1  |x| h(x) = exp − . 2b b

The variance of Lap(b) is 2b2.

n k Definition 3 (Global `1 sensitivity) Given a real function f : X → R , the global sensitivity of f is ∆f = max{f(x) − f(x0)}, where the maximum is taken over neighboring x, x0 ∈ X n.

Theorem 3 (The Laplace Mechanism [9]) Given a real function f : X n → Rk, the mechanism

ML(x) = f(x) + (Lap1(∆f/ε), . . . , Lapk(∆f/ε)) is ε-differentially private.

n Pn For example, suppose x = (x1, x2, . . . , xn) ∈ {0, 1} and f(x) = i=1 xi. We have Pn that ∆f = 1 and hence ML = i=1 xi + Lap(1/ε) is ε-differentially private. The Laplace mechanism applies to computations that map to real numbers. The discrete version of it is the discrete Laplace mechanism, which maps to the integers.

Definition 4 (Discrete Laplace Distribution [14]) For any integer k, the prob- ability mass function of discrete Laplace distribution DLap(α) is

α − 1 p(k) = · α−|k|. α + 1

7 2.2.2 The local model

In the local model, each user outputs noisy data through a local randomizer, then an analyzer post-processes the received noisy data and publishes the output. The local model requires that each local randomizer is differentially private. Due to the post-processing property of differential privacy, the adversary’s view is differentially private.

Definition 5 (Local Model, Non-Interactive [13, 18, 21]) A local model pro- tocol M = ((R1,R2, ..., Rn),A) is (ε, δ)-differentially private if (R1(x1),...,Rn(xn)) is (ε, δ)-differentially private, or, equivalently, Ri(xi) is (ε, δ)-differentially private for all i ∈ [n].

Figure 2.2: The local model

This definition can be generalized to the interactive local model. In the context of private learning (discussed below) Kasiviswanathan et al. [18] proved an equivalence between the local model and statistical query (SQ) learning. As an example for a computation in the local model of differential privacy, to compute bit addition, the local randomizer can be set to the randomized response

8 functionality [21]:   x with probability eε/(eε + 1) R(x) =  1 − x with probability 1/(eε + 1)

The aggregator can be set as:

n ! eε + 1 X n A(r , . . . , r ) = r − . 1 n eε − 1 i eε + 1 i=1 Pn It is easy to verify that E(A(R(x1),...,R(xn))) = E( i=1 xi). By the Chernoff bound, √ the error is less than O( n/ε) with constant probability.

2.2.3 The shuffle model

In the shuffle model, there exists a trusted shuffler that applies a uniformly random permutation to the input messages. In this thesis, we focus on the variant presented in [3, 6] – the Encode, Shuffle, Analyze (ESA) architecture. In the ESA framework, users Encode their data in one or more messages. A shuffler collects and randomly permutes these messages. Finally, an analyst processes the permuted messages to generate an output. A similar model was researched by Ishai el al. [16] in the context of secure multiparty computation.

Definition 6 (Shuffle Model [3, 6]) A shuffle model protocol M = (R,S,A) con- sists of three kinds of randomized algorithms:

∗ 1. A set of randomizers: R = (R1,...,Rn), where for each i ∈ [n], Ri : X → Y maps the input to a number of massages.

2. A shuffler S : Y∗ → Y∗ permutes the input messages uniformly at random.

3. An analyst A : Y∗ → Z computes the result of the algorithm.

9 Definition 7 A shuffle model protocol M = (R,S,A) is (ε, δ)-differentially private

0 0 0 if for every set T and neighbouring data x = (x1, . . . , xn) and x = (x1, . . . , xn),

ε 0 0 Pr[S(R1(x1),...,Rn(xn)) ∈ T ] ≤ e Pr[S(R1(x1),...,Rn(xn)) ∈ T ] + δ.

Figure 2.3: The shuffle model

A shuffle model protocol that preserves differential privacy with only a fraction of honest users is called robust shuffle model. Formally, the robust shuffle model is defined as:

Definition 8 (Robust Shuffle Model [1]) For γ ∈ (0, 1], a shuffle private algo- rithm M = (R,S,A) is (ε, δ, γ)-robust shuffle private if for any γ0 ≥ γ, any random permutation fucntion π :[n] → [n] and x = (x1, . . . , xγ0n),

S(Rπ(1)(x1),...,Rπ(γ0n)(xγ0n)) is (ε, δ)-differentially private.

2.2.4 pan-privacy

Pan-privacy is a requirement on online algorithm2. The Pan-private model, introduced by Dwork et al. [10], requires that differential privacy would be preserved in face of an intruder that can observe the outcome of the online algorithm as well as examine its internal state.

10 Definition 9 (Online Algorithm) An online algorithm M = (R,A): X n → Y consists of internal algorithm R : I → I (where I is the algorithm’s internal state space) and an analysis algorithm A : I → Y. The execution of M on input x ∈ X n begins with the initial state S0. Then, the internal algorithm R maps the current state

Si−1 ∈ I and the input data point xi to the new state Si = R(Si−1, xi). At the end of the data stream, the analysis algorithm maps the final state to the output of the algorithm, A(Sn).

Definition 10 (Pan-privacy [10]) For an online algorithm M = (R,A): X n → Y,

t let S≤t(x) = (S0,S1,...,St) ∈ I represent the first t internal states when M is executed on input x. M is (ε, δ)-pan-private if for every event T ∈ I × Y,

ε 0 Pr[(S≤t(x),M(x)) ∈ T ] ≤ e Pr[(S≤t(x),M(x )) ∈ T ] + δ.

Figure 2.4: The pan-privacy model

11 2.3 Private learning

A concept is a Boolean mapping c : X → {0, 1}. A concept class C = {c} is a collection of concepts. Let P be a distribution over X , and h : X → {0, 1} be a hypothesis. The generalization error between hypothesis h and concept c is defined as:

errP (c, h) = Pr [h(x) 6= c(x)]. x∼P

Definition 11 (PAC Learning [20]) A concept class C is (α, β, m)-PAC learnable if there exist an algorithm L, such that for any distribution P over X and all concepts c ∈ C,

h m  m  i Pr {xi}i=1 ∼ P ; h ← L {(xi, c(xi)}i=1 ; errorP (c, h) ≤ α ≥ 1 − β,

where the probability is over the choice of x1, . . . , xm i.i.d. from P and the randomness of L. Let d represent the size of elements in X , we say that the concept class C is efficiently PAC learnable if it is (α, β, m)-PAC learnable with sample complexity m = poly(d, 1/α, log(1/β)).

Agnostic PAC learning [15] extends the definition of generalization error in PAC learning. Let P be a distribution over X × {0, 1}. Define

errP (h) = P r(x,y)∼P [h(x) 6= y].

Let errOPT denote the minimized possible error of the hypothesis taken from the hypothesis class H, i.e.

errOPT = min errP (h). h∈{H}

Definition 12 (Agnostic PAC Learning with hypothesis class H [15]) A hypoth- esis class H is (α, β, n)-agnostic PAC learnable if there exist an algorithm L, such

12 that for any distribution P over (X × {0, 1}),

h m  m  i Pr {(xi, yi)}i=1 ∼ P ; h ← L {(xi, yi}i=1 ; errP (h) ≤ errOPT + α ≥ 1 − β,

where the probability is over the choice of (x1, y1),..., (xm, ym) i.i.d. from P and the randomness of L.

Kasiviswanathan et al. [18] defined Private PAC learning algorithms to be PAC learning algorithms which satisfy differential privacy.

Definition 13 (Privately PAC Learning [18]) A concept class C is private PAC learnable by algorithm L with parameters α, β, m, ε, δ, if L is (ε, δ)-differential private and L is (α, β, m)-PAC learns concept class C.

Definition 14 (Privately Agnostic PAC Learning [18]) A hypothesis class H is private PAC learnable by algorithm L with parameters α, β, m, ε, δ, if L is (ε, δ)- differential private and L (α, β, m)-agnostic PAC learns concept class C

The statistical query (SQ) model [19] restricts the access of the learning algorithm to only observe statistical properties of the distribution. In SQ learning, the learner accesses its data through an SQ oracle, which answers statistical queries with bounded error.

Definition 15 (SQ Oracle [19]) Let P be a distribution over the domain X .A statistical query consists of a function f : D → [0, 1] and a tolerance parameter

SQ τ = 1/poly(d). An SQ oracle OP answers this statistical query by outputting a value v ∈ [0, 1] such that:

|v − Ex∼P [f(x)]| ≤ τ.

13 Definition 16 (SQ Learning [19]) An SQ learning algorithm with parameters

SQ α, β, τ is a PAC learning algorithm which accesses its data through OP and outputs a hypothesis h, such that:

Pr [errP (h) ≥ α] ≤ β. x∼D

In this thesis we focus on learning parity function.

Definition 17 (weight k parity) Let PARITYd,k = {c`,b}`⊆[d],|`|≤k,b∈{±1} where

d Q c`,b : {±1} → {±1} is defined as c`,b(x) = b · i∈` xi. Where k = d we omit k and

write PARITYd.

Definition 18 A distribution-free parity learner is a PAC learning algorithm for

PARITYd,k.A uniform distribution parity learner is a PAC learning algorithm for

PARITYd,k where the underlying distribution P is known to be uniform over X = {±1}d.

14 Chapter 3

Related Work

3.1 Non-private and private learning

Kasiviswanathan et al. [18] proved an equivalence between the statistical queries model and the local model, i.e. an algorithm in one model can be simulated by an algorithm in the another model. For the parity learning problem, Kasiviswanathan et al. [18] presented an efficient private PAC learning algorithm, but Blum et al. [4] proved that parity cannot be efficiently learned in the SQ model. That shows a sep- aration between the local model and private PAC learning model. In this thesis, we

 2d/2  prove that learning parity functions in the robust shuffle model requires Ω ε samples.

3.2 Reduction from pan-private algorithms to shuffle private algo- rithms

Balcer et al. [1] established a relationship between the robust shuffle model and the pan-private model. Then they used lowerbounds on the pan-privacy settings to estab- lish lowerbounds for the robust shuffle model. As one example from their work, given an (ε, δ, 1/3)-robust shuffle uniformity testing algorithm (informally, a distinguisher whether samples are from the uniform distribution or from a distribution which is “far” from being uniform), they constructed a pan-private algorithm that for the unifor- mity testing problem. To keep pan-privacy, the initial state is generated by applying

15 the 1/3-robust algorithm to N/3 data points sampled from the uniform distribution. When a data point arrives, the internal state is updated as follows:(i) The new data point is mapped into several massages, which are concatenated with the current state by the shuffler. (ii) Then, the shuffler randomly permutes all massages and sets them as the new state. Before submitting data to the analyst, the final state adds extra N/3 data points sampled from uniform distribution. The analyst then computes and publishes the final output of the algorithm. This idea was generalized by Cheu and Ullman [7], as presented in Algorithm 1.

Algorithm 1: M Π: a pan-private algorithm constructed from the robust shuffle private algorithm Π [7] Let Π = (R,S,A) be a 1/3-robust differentially private shuffle model protocol. Input: Data stream x ∈ {X }n/3. n/3 n/3 1 Create initial state s0 ← S(R (U )). 0 2 Sample N ∼ Bin(n, 2/9). 0 0 3 Set N ← min(N , n/3). 4 for i ∈ [n/3] do 0 5 if i ∈ [N ] then 6 wi ← xi 7 else 8 wi ∼ U 9 end 10 si ← S(si−1,R(wi)) 11 end n/3 n/3 12 sfinal ← S(sn/3,R (U )) 13 return A(sfinal)

Theorem 4 ( [1, 7]) Given an (ε, δ, 1/3)-robust shuffle private algorithm Π, Algo- rithm M Π is (ε, δ)-pan-private .

Proof. (sketch, following [1, 7]) Let x and x0 be two neighboring data sets and let j be the index where x and x0 differ. Let 1 ≤ t ≤ n/3 be the time an adversary probes into the algorithm’s memory.

16 n/3 If t ≥ j then S≤t = (S ◦ (R1,...,Rn/3+t))(U , w1 . . . , wt) and, as M is a robust differentially private mechanism S≤t preserves (ε, δ)-differential privacy. Because

Π A(sfinal) is post-processing of S≤t the outcome of M is (ε, δ)-pan private.

0 If t < j then S≤t(x) is identically distributed to S≤t(x ). Note that as M is a robust differentially private mechanism we get that

n/3 σ = (S ◦ (Rt,...,R2n/3−t))(S≤t, wt+1, . . . , wN 0 ,U )

preserves (ε, δ)-differential privacy. To conclude the proof, note that (S≤t(x),A(sfinal)) is the result of post-processing σ.

3.3 Hard tasks for pan-private mechanisms

Cheu and Ullman [7] proved a lower bound on the sample complexity of pan-private algorithms that can distinguish between a family of distributions and their uniform mixture. Concretely, let {Pv}v∈V be a family of distributions and define U = Ev∈V (Pv).

n n n Let U the Cartesian product of n copies of U and PV = Ev∈V (Pv ). The (∞ → 2)- norm of {Pv}v∈V is defined as

2  2 k{Pv}k∞→2 = sup Ev∼V (Ex∼Pv (f(x)) − Ex∼U (f(x))) . f:X →[±1]

Theorem 5 ([7]) For a family of distributions {Pv}v∈V , and an (ε, δ)-pan-private

2 2 n n algorithm M, such that δ log (|V |/δ) = o (ε k{Pv}k∞→2), and dTV (M(PV ),M(U )) = Ω(1), then  1  n = Ω . εk{Pv}k∞→2

They provided a family of hard distributions. Let x ∈ {±1}d, a parameter α ∈ (0, 1/2] (Cheu and Ullman restricted α to the open interval (0, 1/2) but their proof also holds for α = 1/2), a non-empty set ` ⊆ [d] and a bit b ∈ {±1}. The distribution

17 Pd,`,b,α is defined as:  −d Q  (1 + 2α)2 if i∈` xi = b Pd,`,b,α(x) = −d Q  (1 − 2α)2 if i∈` xi = −b

Define the family of distributions

Pd,k,α = {Pd,`,b,α(x): ` ⊆ [d], |`| ≤ k, b ∈ {±1}}.

In particular, for α = 1/2 we get  −d+1 Q  2 if i∈` xi = b Pd,`,b,1/2(x) = Q  0 if i∈` xi = −b

Pd,k,1/2 = {Pd,`,b,1/2(x): ` ⊆ [d], |`| ≤ k, b ∈ {±1}}.

Cheu and Ullman proved a lower bound on the ∞ → 2 norm of Pd,k,α:

Lemma 1 ([7]) For d ∈ N, α ∈ (0, 1/2], 4α2 kP k2 ≤ . d,k,α ∞→2 d  ≤k Substituting this bound in theorem 5 results in the following theorem:

Theorem 6 ([7], rephrased) Let Pd,L,B,α be a uniformly random distribution chosen from Pd,k,α, where L is a random set such that L ⊆ [d] and |L| ≤ k and B is uniformly chosen from {±1}. For an (ε, δ)-pan-private algorithm M, such that

 d    2 2 d  n n δ log ≤k /δ = o ε α / ≤k , and dTV (M(Pd,L,B,α),M(U )) = Ω(1), then

q d  ≤k n = Ω .  εα 

We apply this lower bound to prove a lower bound on the sample complexity of parity learning algorithms. Then this lower bound induces the lower bound for robust shuffle parity learners.

18 3.4 Private summation protocol in shuffle model

Balle et al. [2] constructed a real summation protocol in the shuffle model. In their protocol, for each user’s input x, the local randomizer computes noisy number y = x + polya(1/n, α) − polya(1/n, α) and encodes y as k uniformly random numbers Pk (y1, . . . , yk) satisfying i=1 yi = y. The curator computes the sum of all received Pn messages and the total additive noise i=1(polya(1/n, α) − polya(1/n, α)) follows the discrete Laplace distribution, the protocol is hence differentially private. In our robust shuffle parity learner, we use the summation protocol by Balle et al. to estimate and compare the correctness of all possible parity functions for given samples.

19 Chapter 4

A Lowerbound on the Sample Complexity of Parity Learning in the Shuffle Model

4.1 From robust shuffle model parity learner to a pan-private parity learner

We show how to construct, given a robust shuffle model distribution-free parity learner, a uniform distribution pan-private parity learner. Our reduction–Algorithm LearnParUnif–is described in Algorithm 2. We use a similar technique to the padding presented in [1, 7], with small modifications. To allow the shuffle model protocol use

different randomzers R1,...,Rn, the pan-private learner applies these randomizers in a random order (the random permutation π). The padding is done with sam- ples of the form (1d, ˆb) where ˆb is a uniformly random selected bit. Finally, as in [7] the number of labeled samples which the pan-private algorithm considers from its input is binomially distributed, so that if (xi, yi) are such that xi is uniform in X Q and yi = cr,b(xi) = b · i∈r xi then (after a random shuffle) the input distribution presented to the shuffle model protocol is statistically close to a mixture of the two d ˆ following distributions: (i) a distribution where Pr[(xi, yi) = (1 , b)] = 1 and (ii) a

d distribution where xi is uniformly selected in {±1} and yi = cr,b(xi).

Proposition 2 Algorithm LearnParUnif is (ε, δ)-pan private.

Proof. The proof is identical to the proof of theorem 4 [1, 7].

20 Algorithm 2: LearnParUnif, a uniform distribution pan private parity learner Let M = ((R1,...,Rn), S, A) be a 1/3-robust differentially private distribution parity learner. Input: n/3 labeled examples (xi, yi) where xi ∈ X and yi ∈ {±1}. 1 Randomly choose a permutation π :[n] → [n]. ˆ 2 Randomly choose b ∈R {±1}. d ˆ d ˆ 3 Create initial state s0 ← S(Rπ(1)(1 , b),...,Rπ(n/3)(1 , b)). 0 4 Sample N ∼ Bin(n, 2/9). 0 0 5 Set N ← min(N , n/3). 6 for i ∈ [n/3] do 0 7 if i ∈ [N ] then 8 wi ← (xi, yi) 9 else d ˆ 10 wi ← (1 , b) 11 end 12 si ← S(si−1,Rπ(n/3+i)(wi)) 13 end d ˆ d ˆ 14 sfinal ← S(sn/3,Rπ(2n/3+1)(1 , b),...,Rπ(n)(1 , b)) 15 return A(sfinal)

Proposition 3 (learning) Let M be a (α, β, m) distribution free parity learner, where α, β < 1/4 and m = n/9. Algorithm LearnParUnif is a uniform distribution parity learner that with probability at least 1/4 correctly identifies the concept cr,b.

Proof. (sketch) Algorithm LearnParUnif correctly guesses the label b for 1d with probability 1/2. Assuming ˆb = b the application of M uniquely identifies r, b with probability at least 1/2. Thus, LearnP arUnif recovers cr,b with probability at least 1/4.

21 4.2 From pan-private parity learner to distinguishing hard distribu- tions

In this section, we use Theorem 6 to obtain a lowerbound on the sample complexity of parity learning in the shuffle model. In Algorithm 3, we provide a reduction from

identifies the hard distribution Pd,`,b,1/2 presented in section 3.3 to pan-private parity learning.

Algorithm 3: IdentifyHard, a pan-private for identifying the distribution Pd,`,b,1/2 Let Π be a pan-private uniform distribution parity learner. Input: A sample of n examples z = (z1, z2, . . . , zn), where each example is of d the form zj = (zj[1], zj[2], . . . , zj[d]) ∈ {±1} ∗ 1 Randomly choose i ∈R [d]. 2 /* Apply the uniform distribution parity learner Π: */ 3 for j ∈ [n] do ∗ 4 yj ← zj[i ] 5 xj = zj ∗ ∗ 6 xj[i ] = ⊥ /* i.e., xj equals zj with entry i erased */ 7 Provide (xj, yj) to Π. 8 end 9 (r, b) ← Π((x1, y1),..., (xn, yn)) ∗ 10 ` ← r ∪ {i } 11 return (`, b)

Observation 1 The pan-privacy of Algorithm 3 follows from the pan-privacy of algo- rithm Π.

Proposition 4 Given a uniform distribution parity learner that with probability at least 1/4 correctly identifies the concept cr,b, algorithm 3 can correctly identify the

|`| distribution Pd,`,b,1/2 with probability at least 4d .

22 Proof. Note that with probability |`|/d we get that i∗ ∈ `, in which case the inputs

d−1 x1, . . . , xn provided to the learner Π in Step 7 are uniformly distributed in {±1} Q and yj = b· i∈`\{i∗} xj[i], i.e., the inputs to Π are consistent with the concept c`\{i∗},b. On the uniform distribution, the generalization error of any parity function is 1/2.

On Pd,`,b,1/2 Algorithm 3 succeeds with probability |`|/4d to identify `, b. Algorithm 4 evaluates the generalization error of the concept learned in algorithm 3 towards

n n exhibiting a large total variance distance on Pd,L,B,1/2 and U .

n+m n+m Algorithm 4: DistPU: Distinguisher for Pd,L,B,1/2 and U

Let M = ((R1,...,Rn), S, A) be the pan private algorithm described in Algorithm 3. Input: A sample of m + n examples z = (z1, z2, . . . , zn+m), where p m = max{512d/k, 64 2d/k/ε} and each example is of the form d zj = (zj[1], zj[2], . . . , zj[d]) ∈ {±1} . 1 Let (`, b) be the outcome of executing M On the first n examples z1, . . . , zn. 2 c ← Lap(1/ε) 3 for i ∈ [m] do Q 4 if j∈` zi+n[j] = b then c ← c + 1 5 end ∗ 6 c ← c + Lap(1/ε) ∗ 7 if c ≥ 3m/4 then return 1 else return 0

n+m Observe that if z ∼ Pd,L,B,1/2 then in every execution of Algorithm 4 there exists n+m ` ⊂ [d] of cardinality at most k and b ∈ {±1} such that z ∼ Pd,`,b,1/2.

|`| Proposition 5 Pr n+m [DistP U(z) = 1] ≥ . z∼Pd,L,B,1/2 8d

n+m Q Proof. For any z ∼ Pd,L,B,1/2, we always have i∈` zi = b, so

Pr[DistP U(z) = 1] ≥ Pr[DistP U correctly identifies (`, b)] · Pr[c∗ ≥ 3m/4] |`| ≥ · Pr[Lap(1/ε) + Lap(1/ε) ≥ −m/4] 4d |`| 1 ≥ · (By symmetry of Lap around 0) 4d 2 |`| = . 8d 23 k Proposition 6 Prz∼U n+m [DistP U(z) = 1] ≤ 64d . Q Proof. For all (`, b), we have that Prz∼U [ j∈` z[j] = b] = 1/2, so we have

Pr [DistP U(z) = 1] = Pr[Bin(m, 1/2) + Lap(1/ε) + Lap(1/ε) ≥ 3m/4] z∼U n+m ≤ Pr[|Bin(m, 1/2) + Lap(1/ε) + Lap(1/ε) − m/2| ≥ m/4] m/4 + 2/ε2 + 2/ε2 ≤ (Chebyshev’s inequality) m2/16 = 4/m + 64/ε2m2 k k k ≤ + = . 128d 128d 64d

n+m n+m k Proposition 7 dTV (DistP U(U ), DistP U(Pd,L,B,1/2)) ≥ 64d .

Proof.

n+m n+m dTV (DistP U(U ), DistP U(Pd,L,B,1/2)) ≥ Pr [DistP U(z) = 1] − Pr [DistP U(z) = 1] n+m z∼U n+m z∼Pd,L,B,1/2 X = Pr [DistP U(z) = 1] · Pr[(L, B) = (`, b)] − Pr [DistP U(z) = 1] z∼P n+m z∼U n+m `∈[d],|`|≤k,b∈{±1} d,`,b,1/2 X ≥ Pr [DistP U(z) = 1] · Pr[(L, B) = (`, b)] − Pr [DistP U(z) = 1] z∼P n+m z∼U n+m `∈[d],k/2≤|`|≤k,b∈{±1} d,`,b,1/2 X k ≥ · Pr[(L, B) = (`, b)] − Pr [DistP U(z) = 1] 16d z∼U n+m `∈[d],k/2≤|`|≤k,b∈{±1} k = · Pr[|L| ≥ k/2] − Pr [DistP U(z) = 1] 16d z∼U n+m d  d  k − k k = · ≤k ≤k/2 − Pr [DistP U(z) = 1] ≥ − Pr [DistP U(z) = 1] ≥ . 16d d  z∼U n+m 32d z∼U n+m 64d ≤k

d − d (≤k) (≤k/2) 1 The last inequality follows from d ≥ 1/2. (≤k)

1 d  d  If k = d then ≤k ≥ 2 ≤k/2 . Otherwise (k < d) we get for 0 ≤ i ≤ bk/2c that the difference between bk/2c + 1 + i and d/2 is smaller than the difference between bk/2c − i d  d  d  P d P d and d/2 hence bk/2c−i < bk/2c+1+i , thus ≤k = 0≤i≤bk/2c i + bk/2c+1≤i≤k i > P d d  2 0≤i≤bk/2c i = 2 ≤k/2 .

24 n+m n+m In particular, for all k we get that dTV (DistP U(U ), DistP U(Pd,L,B,1/2)) ≥ n+m n+m k/64d and for k = d we get dTV (DistP U(U ), DistP U(Pd,L,B,1/2)) ≥ 1/64.

Theorem 7 For any (ε, δ, 1/3)-robust private distribution-free parity learning algo- rithm in the shuffle model, where ε = O(1), the sample complexity is

2d/2  n = Ω . ε

Proof. Let k = d, applying Theorem 6, DistP U has sample complexity

2d/2  n + m = Ω . ε

Since k ≥ 1, ε = O(1), m = O(d/ε). By the of Algorithm DistP U from a (ε, δ, 1/3)- robust private parity learning algorithm, any (ε, δ, 1/3)-robust private parity learning algorithm has sample complexity

2d/2  n = Ω . ε

That completes our proof.

4.3 Tightness of the lowerbound

We now observe that Theorem 7 is tight as there exists a 1/3-robust agnostic parity learner in the shuffle model with an almost matching sample complexity. For every possible hypothesis (`, b) (there are 2d+1 hypotheses) the learner estimates the number of samples which are consistent with the hypothesis,

Y c`,b = |{i : b · xi[j] = yi}|. j∈`

One possibility for counting the number of consistent samples is to use the pro- tocol by Balle et al. [2] which is an (ε, δ)-differentially private one-round shuffle model

25 P protocol for estimating ai where ai ∈ [0, 1]. The outcome of this protocol is sta-

P  tistically close to ai + DLap(e ) and the statistical distance δ can be made arbi- trarily small by increasing the number of messages sent by each agent. (In the discrete Laplace distribution DLap(e), introduced in Definition 4, the probability of selecting i ∈ Z is proportional to e−|i|). The protocol uses the divisibility of Discrete Laplace random variables, generating Discrete Laplace noise ν as the sum of differences of Polya random variables:

n X ν = P olya(1/n, e−) − P olya(1/n, e−). i=1

To make the protocol γ-robust, we slightly change the noise generation to guarantee (ε, δ) differential privacy even in the case where only n/3 parties participate in the protocol. This can be done by changing the first parameter of the Polya random

Pn − − variables to 3/n resulting in ν = i=1 P olya(3/n, e ) − P olya(3/n, e ). Observe that ν is distributed as the sum of three independent DLap(e) random variables. Using this protocol, it is possible for the analyzer to compute a noisy estimate of the number of samples consistent with each hypothesis,

c˜`,b = c`,b + ν, and then output

(`, b) = argmax(˜c`,b). `,b

d/2 The sample complexity of this learner is Oα,β,,δ(d2 ). The resulting protocol is presented in Algorithm 5. We denote by k the size of the concept class, i.e., k = 2d+1, and P ar`,b denotes the parity function with parameters ` and b.

Proposition 8 (privacy) For ε < 1, LearnP arity is (ε, δ, γ)-robust private, where δ = k · δ0 + δ∗.

26 Algorithm 5: LearnParity: an agnostic parity learning algorithm Let ε0 = √ ε . Let ShuffleCount be an (ε0, δ0, γ)-robust shuffle protocol 4 2d ln(1/δ∗) that compute the sum of {0, 1} bits.  √  36((d+2) ln 2−ln β) 48(ln 3+(d+2) ln 2) 2d ln 1/δ∗ Input: n ≥ max α2 , αε labeled examples d (xi, yi) where xi ∈ {±1} and yi ∈ {±1}. 1 for ` ⊆ [d], b ∈ {±1} do 2 Apply ShuffleCount to obtain a noisy count c˜`,b of samples for which Q b · j∈` xi[j] = yi. 3 end ˆ ˆ 4 (`, b) ← argmax`,b({c˜`,b}`⊆[d],b∈{±1}) 5 return (`,ˆ ˆb)

Proof. LearnP arity performes k counting computations applying ShuffleCount and then selects the largest one. By the corollary of advanced composition, setting ε0 = √ ε can make LearnP arity (ε, δ)-differentially private. Since ShuffleCount 2 2k ln 1/δ∗ is γ-robust, LearnP arity is γ-robust. To prove that LearnP arity is an (α, β)-agnostic parity learner, we show that (i) the true number of samples that agree with the parity function is close to the expected number of samples that agree with the parity function (Proposition 9); (ii) the noisy estimate produced by ShuffleCount is close to the true number of samples that agree with the parity function (Proposition 10).

Let p`,b represent the probability that one example agrees with the parity function

P ar`,b.

Proposition 9

  2 αN − α ·N Pr |p · N − c | ≤ ≥ 1 − e 36 `,b `,b 4

27 Proof. c`,b agrees with the distribution Bin(N, p`,b), by chernoff bound,

2 α ·N α2·N − 32p +4α − Pr[c`,b > (p`,b + α/4) · N] = Pr[c`,b > (1 + α/4p`,b) · p`,bN] ≤ e `,b ≤ e 36

2 α ·N α2·N − 32p − Pr[c`,b < (p`,b − α/4) · N] = Pr[c`,b < (1 − α/4p`,b) · p`,bN] ≤ e `,b ≤ e 36

Proposition 10

  0 αN − αNε Pr |c˜ − c | ≤ ≥ 1 − 3 · e 12 , `,b `,b 4

Proof. The noise added in ShuffleCount amounts to the sum of three DLap(e) variables . The probability that a DLap(e) variable exceeds αN/12 is

0 ε 0 0 ε e − 1 ε0 − αNε −1 ε0 − αNε −2 Pr[|DLap(e )| > αN/12] = 2 · · ((e ) 12 + (e ) 12 + ...) eε0 + 1 ε0 −ε0·( αN +1) e − 1 e 12 = 2 · · eε0 + 1 1 − e−ε0 0 − αNε 2 · e 12 = eε0 + 1 0 − αNε < e 12 .

Hence, by union bound, the probaility the sum of three DLap(e) variables exceeds 0 − αNε αN/4 is at most 3 · e 12 . Let OPT be the lowest possible error of the hypothesis taken from all parity

αN functions. If c˜`,b − Np`,b < 2 for all (`, b), the error of hypothesis outputted by the algorithm is less than OPT + α.

Proposition 11 LearnP arity is (α, β)-agnostic learning.

Proof. By union bound,

2 0 − α n − αnε β ≤ k · e 36 + k · 3 · e 12 ≤ β/2 + β/2 = β.

28 Chapter 5

Conclusion

The shuffle model of differential privacy has been suggested as an appealing alternative to the local model of differential privacy. Indeed, for tasks such as addition and histogram computing, the shuffle model allows for a lower level of noise than that required in the local model. The task of interest for this thesis is parity learning, which was shown to require an exponential sample complexity in the local model (at least when the round complexity of a protocol is limited to be polynomial). An interesting question was whether it is possible in the shuffle model to perform parity learning efficiently. The recent work of Cheu and Ullman suggested that this may not be the case – they gave an exponential lowerbound on the sample complexity of learning parity agnostically in the robust shuffle model. We extend this lowerbound to apply also to distribution free parity learning. We also present a concrete parity learner, whose sample complexity has a extra factor of poly(d) than our result. We leave the open question of the sample complexity of parity learning under the uniform distribution.

29 Appendix

Tail Bounds

Theorem 8 (Chebyshev’s inequality) For a random variable X with expected

value E(X) = µ and variance V ar(X) = δ2, and any a > 0,

δ2 Pr[|X − µ| ≥ a] ≤ . a2

Theorem 9 (Chernoff bound) For a random variable X with expected value

E(X) = µ, then 2 − δ µ (1) Pr[X ≥ (1 + δ)µ] ≤ e 2+δ for all δ > 0;

2 − δ µ (2) Pr[X ≤ (1 − δ)µ] ≤ e 2 for all 0 < δ < 1.

30 Bibliography

[1] Victor Balcer, Albert Cheu, Matthew Joseph, and Jieming Mao. Connecting robust shuffle privacy and pan-privacy. CoRR, abs/2004.09481, 2020.

[2] Borja Balle, James Bell, Adrià Gascón, and Kobbi Nissim. Differentially private summation with multi-message shuffling. CoRR, abs/1906.09116, 2019.

[3] Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghu- nathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnés, and Bern- hard Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 441–459. ACM, 2017.

[4] Avrim Blum, Merrick L. Furst, Jeffrey Jackson, Michael J. Kearns, Yishay Man- sour, and Steven Rudich. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In STOC, pages 253–262. ACM, 1994.

[5] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Optimal lower bound for differ- entially private multi-party aggregation. In Leah Epstein and Paolo Ferragina, editors, Algorithms - ESA 2012 - 20th Annual European Symposium, Ljubljana, Slovenia, September 10-12, 2012. Proceedings, volume 7501 of Lecture Notes in Computer Science, pages 277–288. Springer, 2012.

[6] Albert Cheu, Adam D. Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In Yuval Ishai and Vincent

31 Rijmen, editors, Advances in Cryptology - EUROCRYPT 2019, volume 11476 of Lecture Notes in Computer Science, pages 375–403. Springer, 2019.

[7] Albert Cheu and Jonathan R. Ullman. The limits of pan privacy and shuffle privacy for learning and estimation. CoRR, abs/2009.08000, 2020.

[8] and Jing Lei. Differential privacy and robust statistics. In STOC, pages 371–380. ACM, May 31–June 2 2009.

[9] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, volume 3876 of Lecture Notes in Computer Science, pages 265–284. Springer, 2006.

[10] Cynthia Dwork, , Toniann Pitassi, Guy N. Rothblum, and Sergey Yekhanin. Pan-private streaming algorithms. In Andrew Chi-Chih Yao, editor, Innovations in Computer Science - ICS 2010, Tsinghua University, Beijing,

China, January 5-7, 2010. Proceedings, pages 66–80. Tsinghua University Press, 2010.

[11] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211– 407, 2014.

[12] Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In FOCS, pages 51–60. IEEE Computer Society, 2010.

[13] Alexandre Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, pages 211–222. ACM, 2003.

32 [14] A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utility- maximizing privacy mechanisms. In STOC, 2009.

[15] David Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78– 150, 1992.

[16] Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky, and Amit Sahai. Cryptography from anonymity. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 239–248. IEEE Computer Society, 2006.

[17] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhod- nikova, and Adam Smith. What can we learn privately? SIAM J. Comput., 40(3):793–826, 2011.

[18] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhod- nikova, and Adam Smith. What can we learn privately? SIAM J. Comput., 40(3):793–826, 2011.

[19] Michael J. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998.

[20] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, November 1984.

[21] Stanley L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63– 69, 1965.

33