Certified By. Signature Redacted

New directions in sublinear algorithms and testing properties of distributions. by Themistoklis Gouleakis Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2018

Author . (Signature redacted...... up~ft fe-EIe~ ical Engineering and Computer Science August 31, 2018

Certified by. Signature redacted...... Ronitt Rubinfeld Professor of Electrical engineering and Computer Science Thesis Supervisor Signature redacted A ccepted by ...... / (/ L Leslie Kolodziejski Professor of Electrical engineering and Computer Science Chair, Department Committee on Graduate Students IMASSACHUSES INSITUTE OF TECHNOLOGY

OCI 1 02018

LtGRARIES

New directions in sublinear algorithms and testing properties of distributions. by Themistoklis Gouleakis

Submitted to the Department of Electrical Engineering and Computer Science on August 31, 2018, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering

Abstract

This thesis deals with sublinear algorithms for various types of problems in statistics, combinatorial optimization and graph algorithms. A first focus of this thesis is algorithms for testing whether a probability distribution, to which the algorithms have sample access, is equal to a given hypothesis distribution, using a number of samples that is sublinear in the domain size. A second focus is to consider various other models of computation defined by type of queries available to the user. This thesis shows how more powerful queries, such as the ability to get a sample according to the conditional distribution on a specified set, allows one to get faster algorithms for a number of problems. Thirdly, this thesis considers the problem of certifying and correcting the result of a crowdsourced computation with potentially erroneous worker reports, by using verification queries on a sublinear number of reports. Finally, we show improved methods to simulate graph algorithms for maximal independent set, minimum vertex cover and maximum matching by distributing the computation to multiple sublinear space computing machines and allowing only a sublinear number of rounds of communication between them.

Thesis Supervisor: Ronitt Rubinfeld Title: Professor of Electrical Engineering and Computer Science

3 4 Acknowledgments

First and foremost, I would like to thank my advisor, Ronitt Rubinfeld, for her endless

support during my PhD and for being an excellent source of new research problems

for me. She has always been always encouraging and optimistic about every research

direction I was going to pursue, either jointly with her or not. Furthermore, she

has been an invaluable mentor for me while helping me to improve my writing and

presentation skills as well as giving me advice for my next steps.

I would like to thank the other members of my thesis committee: Constantinos

Daskalakis, Ilias Diakonikolas and Ankur Moitra for the guidance and career advice

that they gave me. I would particularly like to thank Ilias Diakonikolas for being a

valuable collaborator who has always proposed the right research directions that have

lead to a large part of this thesis. Also, Constantinos Daskalakis, although we had

not collaborated on research until very recently, has always been very accessible and

has given me helpful advice throughout my years at MIT.

I am also grateful to the Onassis Foundation and Paris Kanellakis' family for the

support I received in my first years of graduate school.

Next, I would like to thank all my collaborators from which I have learned a

lot: Dimitris Achlioptas, Maryam Aliakbarpour, Amartya Shankha Biswas, Clement

Canonne, Constantinos Daskalakis, Ilias Diakonikolas, Mohsen Ghaffari, Slobodan

Mitrovi6, John Peebles, Eric Price, Ronitt Rubinfeld, Christos Tzamos, Anak Yod-

pinyanee and Manolis Zampetakis as well as all the members of the MIT theory group

for creating a very lively, friendly and motivating environment. A special thank you

goes to Dimitris Achlioptas who had been a mentor for me in the last years of my

undergraduate studies at NTUA and was the reason I decided to do research in the-

oretical computer science. This thesis would not have been possible without all of them.

Last, but not least, I would like to thank my family and all my friends for their constant support throughout my life.

5 6 Contents

1 Introduction 15

1.1 Property testing of distributions ...... 16

1.2 Part II: Sublinear algorithms with conditional sampling and verifica-

tion queries ...... 20 1.3 Part III: Distributed algorithms ...... 22

2 Testing properties of probability distributions 25

2.1 Overview of the results and prior work ...... 26 2.1.1 High confidence regime ...... 27

2.1.2 Sample complexity in terms of the error probability ...... 28 2.1.3 Techniques ...... 29

2.1.4 Discussion and Prior Work ...... 33

2.1.5 Distribution identity testing model ...... 36 2.1.6 N otation ...... 37

2.2 Collision-based uniformity testing ...... 37

2.2.1 Analysis of TEST-UNIFORMITY-COLLISIONS ...... 38

2.3 Testing Closeness via Collisions ...... 45

2.3.1 Analysis of TEST-CLOSENESS-COLLISIONS ...... 45

2.4 Sample-Optimal Uniformity Testing ...... 57

2.4.1 Stochastic Domination for Statistics of the Histogram .. ... 57

2.4.2 Our Test Statistic ...... 64

2.4.3 Bounding the Expectation Gap ...... 66 2.4.4 Proof of lemma 22 ...... 74

7 2.4.5 Concentration of Test Statistic: Proof of Theorem 21..... 75

2.4.6 Non-Existence of Indefinite Closed-Form for Components of Ex-

pectation ...... 78

2.5 Information-Theoretic Lower Bound ...... 79

3 Sublinear algorithms with conditional sampling 85

3.1 Conditional sampling model and motivation ...... 85

3.1.1 Previous Work in the classical model ...... 88

3.1.2 Our Contributions ...... 90

3.2 Formal definitions ...... 92

3.2.1 Conditional Sampling as Computational Model ...... 93 3.3 Basic Primitives ...... 93

3.3.1 Point in Set and Support Estimation ...... 96 3.3.2 Point of Maximum Weight ...... 99 3.3.3 Sum of Weights of Points ...... 100 3.3.4 Weighted Sampling ...... 101

3.3.5 Distinct Elements Sampling fo Sampling ...... 103

3.4 k-means clustering ...... 105

3.5 Euclidean Minimum Spanning Tree ...... 107

3.5.1 Computing the size of small connected components ...... 111

3.5.2 Algorithm for estimating the number of connected components 112

4 Certified computation 117

4.1 Computation on unreliable datasets ...... 117

4.1.1 Our Model and Results ...... 119

4.1.2 More Related Work ...... 123

4.2 Certification Schemes for Linear Programs ...... 125

4.2.1 Computing the Sum of Records ...... 125

4.2.2 Functions given by Linear Programs ...... 126

4.3 Certification Schemes for Wi-Lipschitz Functions and Applications . . 129

4.3.1 Certification Schemes for w-Lipschitz Functions ...... 132

8 4.4 Weak Correction Model ...... 133

4.5 Strong Correction Model ...... 137

4.5.1 Computing the Sum of Values of Records ...... 137

4.5.2 Lower Bound for the Maximum of Sums Function ...... 139

4.5.3 From Algorithms using Conditional Sampling to Strong Cor-

rection Schemes ...... 140

4.6 Applications of Theorem 55 ...... 143

4.6.1 Optimal Travelling Salesman Tour ...... 143

4.6.2 Steiner tree ...... 145

5 Distributed algorithms 149

5.1 Introduction ...... 149 5.1.1 The m odels ...... 150 5.1.2 Related work ...... 151

5.1.3 Our contributions ...... 153

5.1.4 Our techniques ...... 154

5.1.5 N otation ...... 155

5.2 Maximal Independent Set ...... 155 5.2.1 Randomized Greedy Algorithm for MIS ...... 156 5.2.2 Simulation in O(loglogA) rounds of MPC and CONGESTED- C LIQ U E ...... 156 5.2.3 A nalysis ...... 158

5.3 Matching and Vertex Cover, Simple Approximations ...... 160

5.3.1 Basic O(log n)-iteration Centralized Algorithm ...... 161

5.3.2 An Attempt for Simulation in O(log log n) rounds of MPC . . 162

5.3.3 Our Actual Simulation in O(log log n) rounds of MPC ..... 164 5.3.4 A nalysis ...... 168

5.4 Integral Matching and Improved Approximation ...... 184

9 10 List of Figures

11 12 List of Tables

13 14 Chapter 1

Introduction

One of the important practical challenges in statistics and computer science is that

datasets of many different types have been steadily growing. Due to this growth, even

linear-time algorithms have become impractical. Time constraints that algorithms

should run in real-time, often require us to make a decision before one is able to

examine the input in its entirety. These new challenges motivate research on the

design and analysis of sublinear-time algorithms. In the past 20 years, a large amount

of significant theoretical research (see for example in [124, 125, 44, 61, 119, 73]) has

been conducted in this field. Sublinear time algorithms exploit the structure of the specific problem at hand, and are able to get an approximate answer in sublinear time using clever random sampling.

Specifically, a large part of the research in this thesis is on learning and testing properties of distributions over large domains. Testing was first considered in [74, 72, 191 (see e.g the surveys [28, 122] for a detailed exposition.) In these settings, the aim is to design algorithms that use only a small (usually sublinear in the domain size) number of samples from an unknown distribution and aim to perform various tasks, such as density estimation, parameter estimation or testing membership in a particular class of distributions. In the first part of this thesis, we consider the sample complexity of tasks such as deciding whether or not samples are coming from a given known distribution.

Apart from expanding our knowledge in distribution testing, this thesis also deals

15 with the design of sublinear time algorithms for combinatorial optimization problems

[9, 79, 781. In many cases, classical models of computation are inadequate, either by being too conservative in the description of the access to information available to the user, or by making very strong assumptions on the data. For those settings, we consider new models based on weak structural assumptions, that either allow for much more efficient algorithms [79, 78], or allow us to expand the applicability of existing algorithms [311.

1.1 Property testing of distributions

It is often important to distinguish whether a distribution has a certain property for example, is it uniform or identical to some other known distribution? Or does it have one or more of the properties of high entropy, independent marginals, k-modality, log-concavity or monotone hazard rate? A large portion of this thesis considers distribution property testing where the generic inference problem is the following (also see, e.g., [20, 21, 122, 28, 711): given sample access to one or more unknown distributions, distinguish the case that they satisfy some property from the case that they are far from any distribution that satisfies the property. More specifically, for each property of interest, let P be the class of distributions that have that property. Then, the property testing problem can be equivalently described as distinguishing whether a distribution D is in P from the case that it is E-far from any D' E P according to some natural distance measure. Similarly to any other hypothesis testing problem, we also consider the error probability 6 of the test, which is analogous to p-values in statistics.

Examples of such properties include: Uniformity, identity to a known distribution, closeness of two distributions in total variation distance, entropy estimation, independence of marginals and various shape restriction properties such as being k-modal, log-concave or monotone hazard rate. The main research question of interest is: How does the number of samples T(P,n, E, 6), that are needed for testing property P, scale with the domain size (n), accuracy (e) and error probability (6)?

16 Tight analysis of collisions based tester for uniformity: One of the most fundamental tasks in this field is deciding whether an unknown discrete distri-

bution is approximately uniform on its domain, known as the problem of uniformity

testing. Goldreich and Ron [72], motivated by the question of testing the expansion

of graphs, proposed a simple and natural uniformity tester that relies on the collision probability of the unknown distribution, which was further formulated and analyzed by Batu et' al in [19]. The collision probability of a discrete distribution p is the probability that two samples drawn according to p are equal. The key insight here is that the uniform distribution has the minimum collision probability among all distributions on the same domain, and that any distribution that is E-far from uniform has noticeably larger collision probability.

In this thesis, we provide a new analysis of this very natural collision uniformity tester, establishing a tight O(n 1 / 2 /E2 ) upper bound on its sample complexity. Our bound improves on the previously known O(n1 / 2 /E 4 ) upper bound for this algorithm and matches the lower bound of Paninski [115] and upper bounds by less natural algorithms [129, 51]. That is, we show that the originally proposed uniformity tester based on collisions is in fact sample-optimal, both in the dependence on n and in the dependence on E, up to constant factors.

A related testing problem of central importance in the field is the following:

Given samples from two unknown distributions p, q over [n] with the promise that max{|pII8, IIqIj} b, distinguish between the cases that |fp-qJ|2 8/2 and |Ip-qJ12 > e. That is, we want to test the closeness between two unknown distributions with small f2 -norm. (We remark here that the assumption that both p and q have small e2 -norm is critical in this context.) The seminal work of Batu et al. [20] gave a collision- based tester for this problem that uses O(b2 /E 4 + b1/ 2 /E2 ) samples. Subsequent work by Chan, Diakonikolas, Valiant, and Valiant [36] gave a different "chi-squared type" tester that uses O(bi/ 2 /E 2 ); this sample bound was shown [36, 129] to be optimal, up to constant factors.

Similarly to the case of uniformity testing, prior to this work, it was not known whether the analysis of the collision-based closeness tester in [20] is tight. As our

17 second contribution, we show (Theorem 9) that (essentially) the collision-based tester of [201 succeeds with O(bi/ 2 /E 2 ) samples, i.e., it is sample-optimal, up to constants, for the corresponding problem.

Remark.: Uniformity testing has been a useful algorithmic primitive for several other distribution testing problems as well [19, 45, 51, 50, 29, 691. Notably, Goldreich [691 recently showed that the more general problem of testing the identity of any explicitly given distribution can be reduced to uniformity testing with only a constant factor loss in sample complexity.

The problem of f2 closeness testing for distributions with small f2 norm has been identified as an important algorithmic primitive since the original work of Batu et al. [201 who exploited it to obtain the first f, closeness tester. Recently, Diakonikolas and Kane [49] gave a collection of reductions from various distribution testing problems to the above f2 closeness testing problem. The approach of [491 shows that one can obtain sample-optimal testers for a range of different properties of distributions by applying an optimal tester for the above problem as a black-box.

High probability uniformity/identity testing: Note that most previous works in this field were aiming for property testers that output the correct result with some fixed constant probability greater than 1/2 (e.g 2/3). However, this is not sufficient when high confidence hypothesis testing is desired. Therefore, we would like to know how the sample complexity scales as the desired probability of error (6) decreases. Since testing is a decision problem, standard amplification techniques can be used to boost the success probability to any desired accuracy given any algorithm that succeeds with probability 2/3. Specifically, if S(n, E) is a sample complexity upper bound for T(P, n, E, 1/3), this generic method implies that S(n, E) -e(log(1/6)) is a sample upper bound for T(P, n, E, 6).

For almost all distribution properties, the best known sample complexity testing upper bounds follow from this technique. In our work [521, we obtain tight bounds for the dependence of the sample complexity for the problem of uniformity testing on the error probability and for the more general problem of identity testing (related to the

18 goodness of fit problem in statistics) against an explicit distribution. Identity testing

corresponds to the property P = {q}, where q is an explicitly given distribution

on a domain of size n. Uniformity testing is the special case that q is the uniform

distribution over the domain. For both these problems, we show that the generic

amplification is suboptimal for any 6 = o(1), and give a new identity tester that

achieves the optimal sample complexity which is: E L n log(1/S) + log(1/6))).

Our main contribution here is the design of the first sample-optimal uniformity

tester for the high success probability regime which is surprisingly simple: to test

whether p = U, versus |1p - U,1 1 ;> e, we simply threshold j|F - U" 1 , where - is the empirical probability distribution. In a literature with several different unifor-

mity testers 172, 19, 115, 129, 51], the empirical total variation distance had never

been previously proposed as a statistic due to the fact that when the sample size is

sublinear in the domain size, the empirical total variation distance is very far from

the actual total variation distance. However, our results show that the empirical

distance from uniformity is noticeably smaller for the uniform distribution than for

"far from uniform" distributions, even with a sub-linear sample size. Our sample-

optimal identity tester follows by applying the recent result of Goldreich [69], which

provides a black-box reduction of identity to uniformity. We also show a matching

information-theoretic lower bound on the sample complexity.

Brief Overview of Techniques. To analyze our tester, we introduce two new

techniques for the analysis of distribution testing statistics, which we describe in

more detail in Section 2.1.3. Our techniques leverage a simple common property of numerous distribution testing statistics which does not seem to have been previously exploited in their analysis: their convexity. Our first technique crucially exploits an underlying strong convexity property to bound from below the expectation gap between the completeness and soundness cases. We remark that this is in contrast to most known distribution testers where bounding the expectation gap is easy, and the challenge is in bounding the variance of the statistic.

Our second technique implies a new, fast method for obtaining empirical estimates of the true worst-case failure probability of any member of a broad class of uniformity

19 testing statistics. This class includes all uniformity testing statistics studied in the literature. Critically, these estimates come with provable guarantees about the worst- case failure probability of the statistic over all possible input distributions, and have tunable additive error. We elaborate in Section 2.1.3.

1.2 Part II: Sublinear algorithms with conditional sampling and verification queries

It often happens that generic models of computation cannot capture the characteris- tics of various practical applications, which results in hiding the true complexity of the problems or in limiting the applicability of known algorithms. So, another important contribution of this thesis is examining the power of more specialized models of computation.

One example is the powerful, yet reasonable, sampling model of conditional sampling that we exploit in this thesis in order to obtain sub-linear time algorithms for various geometric problems, which are exponentially faster than their classical coun- terparts in the literature.

Conditional sampling: Recent work in distribution property testing [30, 351 utilizes conditional sampling queries' in order to design distribution property testing algorithms with an exponentially smaller sample complexity than what was known before. Queries of that type have also been implemented in practice using special data structures [961. In [791 we propose a different notion of conditional sampling for sublinear time algorithms, inspired by these results. More specifically, we introduce a computational model based on conditional sampling, where the dataset is provided as a distribution. The algorithms have access to a conditional sampling oracle. A conditional sampling oracle is given as input the description of a boolean circuit C, and returns datapoints at random from the conditional distribution on the subset of the domain which satisfies the circuit. As an example, given a function f described

'That is, the ability to sample from the conditional distribution on a subset S C [n] of the domain.

20 by a small circuit and a constant a, we can construct a slightly bigger circuit C' that

given a point x as input, outputs 1 if and only if f(x) > a. Then the sampling oracle, given C', will output a random datapoint among those whose value f(x) is greater than a.

Using conditional sampling, we provide algorithms which are exponentially faster

than the best known algorithms in classical model. Examples of problems that we

apply this model to include Euclidean MST and k-means clustering of points in a

metric space. For example, the best known upper bound for (1 + -)-approximate

Euclidean MST of n points in Rd in the classical model, even when augumented with

more powerful geometric queries, is O(Vh. (1/E)o(d)) due to [39]. In [791 we are able to get poly(d, log n, ) running time in the conditional sampling model. It is important to note that these conditional sampling queries can be simulated by passes over the input of a streaming algorithm, and also implemented by parallel and distributed algorithms. By solving more basic problems such as support estimation, weighted sampling, fo-sampling etc, we have created a framework which can potentially be applied to other problems as well.

Certified computation: In many practical scenarios, we need to do computation on unreliable data. Crowdsourcing is an example where data is provided by a very large number of workers. These workers may need to put significant effort to extract high quality data and without the right incentives they might choose not to do so, resulting in very noisy and unreliable reports being produced. In [78], we consider such a setting where a fraction of the data is erroneous. The goal is to either certify that those errors will not affect the outcome of the computation by more than an (1 + E) factor (certification) or find a subset of the input that would give a (1 + E) approximation to the correct output (correction). The algorithm is given access to verification queries answering whether a data point is valid or not. Since these verification queries can be expensive, we are looking for algorithms that use as few such verification queries as possible. We show that verification of small sets of critical worker reports is sufficient to guarantee high quality learning outcomes for various optimization objectives. More specifically, we show that many problems only

21 need poly(l/) verification queries, to ensure that the output of the computation is at most a factor of (1 E) away from the truth. For any given instance, we provide an instance optimal solution that verifies the minimum number of workers possible to approximately certify correctness. In case this certification step fails, a misreporting worker will be identified. Removing these workers and repeating until success, guarantees that the result will be correct and will depend only on the verified workers.

Surprisingly, as we show, for several computation tasks more efficient methods are possible. These methods always guarantee that the produced result is not affected by the misreporting workers, since any misreport that affects the output will be detected and verified.

1.3 Part III: Distributed algorithms

Distributed algorithms is an area of theoretical computer science that that aims at capturing that ability of computers connected in a network to perform a computation collectively. This is particularly important in relation to sublinear algorithms, as it can make it possible for a number of time and space limited computers to perform a computational task that would be impossible for each one of them separately.

In particular, for all the distributed algorithms that we will examine in this thesis, each computer has access to a limited fraction of the input. However, it has the ability to communicate to every other computer by synchronous message passing. That is, in each round of the computation, after some computation that is done internally in each computer, there is some exchange of messages between the computers of the network who collectively aim to compute the global result of the computation in as few rounds as possible.

We are interested in designing algorithms for graph problems such as Maximal

Independent Set, Maximum Matching and Minimum Vertex Cover where the amount of memory needed by each individual computer is sublinear in the size of the input graph. The study of these problems in the models of parallel computation dates back to PRAM algorithm. A seminal work of Luby [106] gives a simple randomized

22 algorithm for constructing MIS in O(logn) PRAM rounds. When this algorithm is applied to the line graph of input graph G, it outputs a maximal matching of G, and hence a 2-approximate maximum matching of G. The output maximal matching also provides a 2-approximate minimum vertex cover. Similar results, also in the context of PRAM algorithms, were obtained in [10, 89, 90]. Since then, the aforementioned problems were studied quite extensively in various models of computation.

The PRAM model mentioned above is best suited for modeling multiprocessor applications where each processor has access to the entire memory and the goal is to minimize the total amount of computation performed. However, there are application in which it is unrealistic for each processor to have access to the entire memory

(i.e input data) due to the size of the dataset or other reasons that make it impossible to store it in a single computer. In this thesis, we will instead consider the following two closely related models: Massively Parallel Computation (MPC), and the CONGESTED-CLIQUE model of distributed computing. Indeed, we consider it as a conceptual contribution of this work to (further) exhibit the proximity of these two models. These models attempt to capture the fact that each computer only has access to its own part of the dataset as well as the fact that the communication is more expensive than computation. As a result, we will be interested in minimizing the total number of communication rounds. We next review these models briefly and will state our results afterwards.

The MPC model: The MPC model was first introduced in [94] and later refined in [76, 22, 111 and it can be informally described as follows: Suppose that we are given an input of N bits and we have m machines at our disposal. However, each of the machines has limited memory of S bits. The regime we are interested in is when S is between and N since otherwise the problem either becomes impossible or trivial. In particular, if S is smaller than L, there is not enough total storage to solve the problem, and if S is larger than N, then we could trivially finish in just one round using one machine. So, in general our distributed algorithm would work as follows: Initially, the input is distributed among the m machines that will perform some computation locally on their respective part of the input and this will be done

23 in rounds. Between every pair of rounds, each machine can exchange messages with any other machine under the restriction that the total amount of communication is upper bounded by its internal memory. At least one of the machines should have enough information to output the result at the end.

CONGESTED-CLIQUE: The second model that we consider is the CONGESTED-

CLIQUE model of distributed computing, which was introduced by Lotker, Pavlov,

Patt-Shamir, and Peleg [1051 and has been studied extensively since then, see e.g., [116, 58, 24, 101, 59, 112, 83, 82, 32, 81, 23, 65, 68, 97, 84, 33, 66, 931. In this model, we have n machines which can communicate in synchronous rounds to solve a graph theoretic problem. Each node of the input graph is associated with one machine which initially knows the local topology of the graph around the vertex and at the end of the execution is required to know the local output. That is, for example, whether or not the vertex is in the maximal independent set of the graph. As far as communication is concerned, each machine can again send messages to any other machine regardless of the graph topology, but each message between two machines is now restricted to

O(log n) bits.

Our results: In our work [671 (presented in Section 5.2 of this thesis), we design

O(log log A) round algorithms for Maximal Independent Set in both the CONGESTED-

CLIQUE model and the MPC model where the memory per machine is linear in the number of nodes of the graph. As our second result, in Section 5.3, we first design an algorithm that returns a (2 + e)-approximate fractional maximum matching and a (2 + E)-approximate integral minimum vertex cover in O(log log n) MPC rounds, which improves on the recent O((log log n)2 ) algorithm by Czumaj et' al [40]. We also show how to round this fractional matching to a (2 + e)-approximate integral maximum matching. After applying vertex-based random partitioning (that was proposed in this context in [401), the algorithm repeats only a couple of simple steps to perform all its decisions. See Section 5.1.2 for more details on related work on these problems.

24 Chapter 2

Testing properties of probability distributions

The generic inference problem in distribution property testing [20, 21, 741 (also see, e.g., [122, 28, 711) is the following: given sample access to one or more unknown distributions, determine whether they satisfy some global property or are "far" from satisfying the property. During the past couple of decades, distribution testing which is a generalization of statistical hypothesis testing [113, 1001 has developed into a mature field. One of the most fundamental tasks in this field is deciding whether an unknown discrete distribution is approximately uniform on its domain, known as the problem of uniformity testing. Formally, we want to design an algorithm that, given independent samples from a discrete distribution p over [n] and a parameter e > 0, distinguishes (with high probability) the case that p is uniform from the case that p is E-far from uniform, i.e., the total variation distance between p and the uniform distribution over [n] is at least E.

In this chapter, we first present a tighter analysis showing that a well known tester for uniformity [72, 201 has optimal sample complexity in terms of the domain size and the distance parameter. Afterwards, we show how to obtain an optimal sample complexity even when the error probability is also considered improving by on the best known upper bound and showing a matching lower bound. All results can be extended to uniformity testing using a reduction by Goldreich [69J.

25 2.1 Overview of the results and prior work

Uniformity testing was the very first problem considered in this line of work: Goldreich and Ron [72] and Batu et' al [191 proposed a simple and natural uniformity tester that relies on the collision probability of the unknown distribution. The collision probability of a discrete distribution p is the probability that two samples drawn according to p are equal. The key intuition here is that the uniform distribution has the minimum collision probability among all distributions on the same domain, and that any distribution that is e-far from uniform has noticeably larger collision probability. Formalizing this intuition, Batu et' al [191 showed that the collision- based uniformity tester succeeds after drawing O(n1 / 2 /64 ) samples from the unknown distribution. An information-theoretic lower bound of Q(n1 / 2 ) on the number of samples required by any uniformity tester follows from a simple birthday-paradox argument [75, 19, 741, even for constant values of the parameter E. In subsequent work, Paninski [1151 showed an information-theoretic lower bound of Q(ni1 / 2 /E2 ), and also provided a matching upper bound of O(n1/ 2/ 2 ) that holds under the assumption

1 4 1 that E - Q(n- / ) . This lower bound assumption on e is not inherent: As shown in a number of recent works [129, 51] (see also [1, 36]), a variant of Pearson's X2 -tester can test uniformity with O(ni/ 2 /E 2) samples for all values of n, E > 0. The "chi-squared type" testers of [36, 129] are simple, but are also arguably slightly less natural than the original collision-based uniformity tester [72, 19].

Perhaps surprisingly, prior to this work, the sample complexity of the collision uniformity tester was not fully understood. In particular, it was not known whether the sample upper bound of O(ni1 / 2 /E 4 ) established in [72, 191 is tight for this tester, or there exists an improved analysis that can give a better upper bound. As our first main contribution (Theorem 3), we provide a new analysis of the collision uniformity tester establishing a tight O(ni/ 2 /E 2 ) upper bound on its sample complexity. That is, we show that the originally proposed uniformity tester is in fact sample-optimal,

1The uniformity tester of [1151 relies on the number of unique elements, i.e., the elements that appear in the sample set exactly once. Such a tester is only meaningful in the regime that the total number of samples is smaller than the domain size.

26 up to constant factors.

2.1.1 High confidence regime

Research in this field has primarily centered on determining tight bounds on the sample complexity of testing various properties in the constant probability of success

regime. That is, the testing algorithm must succeed with a probability of (say) at least 2/3. This constant confidence regime is fairly well understood. For a range of fundamental properties [115, 36, 129, 51, 50, 2, 49, 48] we now have sample-optimal testers that use provably optimal number of samples (up to constant factors) in this regime.

In sharp contrast, the high confidence regime i.e., the case where the desired failure probability is subconstant is poorly understood even for the most basic properties. For essentially all distribution property testing problems studied in the literature, the standard amplification method is the only way known to achieve a high confidence success probability. Amplification is a black-box method that can boost the success probability to any desired accuracy. However, using it increases the number of required samples beyond what is necessary to obtain constant confidence. Specifically, to achieve a high confidence success probability of 1- 6 via amplification, the number of samples required increases by a factor of O(log(1/6)) compared to the constant confidence regime.

This discussion raises the following natural questions: For a given distribution property testing problem, does black-box amplification give sample-optimal testers for obtaining a high confidence success probability? Specifically, is the e(log(1/6)) multiplicative increase in the sample size the best possible? If not, can we design testers that have optimal sample complexity for all values of the relevant problem parameters, including the success probability 6?

We believe that these are fundamental questions that merit theoretical investiga- tion in their own right. As Goldreich notes [70], "eliminating the error probability as a parameter does not allow to ask whether or not one may improve over the straightforward error reduction". From a practical perspective, understanding this

27 high confidence regime is important to applications of hypothesis testing (e.g., in biology), because the failure probability 6 of the test can be reported as a p-value.

2 Standard techniques for addressing the problem of multiple comparisons, such as

Bonferroni correction, require vanishingly small p-values.

Perhaps surprisingly, with one exception [85], this basic problem has not been previously investigated in the finite sample regime. A conceptual contribution of this work is to raise this problem as a fundamental goal in distribution property testing.

We note here that the analogous question in the context of distribution learning has been intensely studied in statistics and probability theory (see, e.g., [131, 47]) and tight bounds are known in a range of settings.

2.1.2 Sample complexity in terms of the error probability

Our main result is a complete characterization of the worst-case sample complexity of identity testing in the high confidence regime. For this problem, we show that black-box amplification is suboptimal for any 6 = o(1), and give a new identity tester that achieves the optimal sample complexity:

Theorem 1 (Main Result). There exists a computationally efficient (E, 6)-identity tester for discrete distributions of support size n with sample complexity

E)( Vn lIog(1/J) + log(1/J) . (2.1)

Moreover, this sample size is information-theoretically optimal, up to a constant factor, for all n, E, 6.

As we explain in Section 2.1.4, [85] gave a tester that achieves the optimal sample complexity when the sample size is o(n). However this tester completely fails with

Q(n) samples, as may be required when either E or 6 are sufficiently small. Theo- rem 1 provides a complete characterization of the worst-case sample complexity of

2The family of distribution testing algorithms with success probability 1-6 for a given problem is equivalent to the family of statistical tests whose p value (probability of Type I error) and probability of Type II error are both at most 6.

28 the problem with a single statistic for all settings of parameters n, E, 6.

2.1.3 Techniques

Tight analysis of the collisions tester

We now provide a brief summary of previous analyses and a comparison with our work.

The canonical way to construct and analyze distribution property testers roughly

works as follows: Given m independent samples si, . . . , sm from our distribution(s),

we consider an appropriate random variable (statistic) F(si,. .. , sm). If F(si,... , sm)

exceeds an appropriately defined threshold T, our tester rejects; otherwise, it accepts.

The canonical analysis proceeds by bounding the expectation and variance of F for

the case that the distribution(s) satisfy the property (completeness), and the case

they are E-far from satisfying the property (soundness), followed by an application of

Chebyshev's inequality.

The main difficulty is choosing the statistic F appropriately so that the expec-

tations for the completeness and soundness cases are sufficiently separated after a

small number of samples, and at the same time the variance of the statistic is not

"too large". Typically, the challenging step in the analysis is bounding the variance of

F from above in the soundness case. Our analysis follows this standard framework.

Roughly speaking, for both problems we consider, we provide a tighter analysis of the variance of the corresponding estimators, that in turn leads to the optimal sample

complexity upper bound.

More specifically, for the case of uniformity testing, the argument of [721 proceeds by showing that the collision tester yields a (1 + -y)-multiplicative approximation of

1 2 2 the e2 -norm of the unknown distribution with O(ni / /y ) samples. Setting y = E2 gives a uniformity testing under the f, distance that uses O(n1 / 2/E 4 ) samples. We note that the quadratic dependence on 1/-y in the multiplicative approximation of the f2 norm is tight in general. (To see this, consider the case that our distribution is either uniform over two elements, or assigns probability mass 1/2 - -y, 1/2 + -y to the elements.) Roughly speaking, we show that we can do better when the f2 norm of

29 the distribution in question is small. More specifically, the collision uniformity tester can distinguish between the case that |IpII< (1 + -y/2)/n and |1pJ12 > (1 + 7)/n with

O(n1 / 2 /-Y) samples. This immediately yields the desired fi guarantee.

For the closeness testing problem (under our bounded f2 norm assumption), Batu et al. [201 construct a statistic whose expectation is proportional to the square of the f2 distance between the two distributions p and q. This statistic has three terms whose expectations are proportional to IpI1 , |1q I2, and 2p- q respectively. Specifically, the first term is obtained by considering the number of self-collisions of a set of samples from p. Similarly, the second term is proportional to the number of self- collisions of a set of samples from q. The third term is obtained by considering the number of "cross-collisions" between some samples from p and q. In order to simplify the analysis, [20] uses a separate set of fresh samples for the cross-collisions term.

This set is independent of the set of samples used for the two self-collisions terms.

While this choice makes the analysis cleaner, it ends up increasing the variance of the estimator too much, leading to a sub-optimal sample upper bound. We show that by reusing samples to calculate the number of cross-collisions, one achieves sufficiently good variance to get optimal sample complexity. This comes at the cost of a more complicated analysis involving a very careful calculation of the variance.

Upper Bound for high probability Uniformity Testing

We would like to show that the test statistic dTV(P, Un) is, with high probability, larger when dTv(p, U,) > E than when p = Un. We start by showing that among all possible alternative distributions p with dTv(p, U,) > E, it suffices to consider those in a very simple family. We then show that the test statistic is highly concentrated around its expectation, and that the expectations are significantly different in the two cases. The main technical components of this chapter are our techniques for accomplishing these tasks.

To simplify the structure of p, we show (Section 2.4.1) that if p majorizes another distribution q, then the test statistic dTV(, U,) stochastically dominates dTV(q, Us).

(In fact, this statement holds for any test statistic that is a convex symmetric function

30 of the empirical histogram.) Therefore, for any p, if we average out the large and small entries of p, the test statistic becomes harder to distinguish from uniform.

We remark as a matter of independent interest that this stochastic domination lemma immediately implies a fast algorithm for performing rigorous empirical comparisons of test statistics. A major difficulty in empirical studies of distribution testing is that it is not possible to directly check the failure probability of a tester over every possible distribution as input, because the space of such distributions is quite large.

Our structural lemma reduces the search space dramatically for uniformity testing: for any convex symmetric test statistic (which includes all existing ones), the worst case distribution will have an coordinates of value (1 + e/a)/n, and the rest of value

(1 - E/(1 - a))/n, for some a. Hence, there are only n possible worst-case distributions for any e. Notably, this reduction does not lose anything, so it could be used to identify the non-asymptotic optimal constants that a distribution testing statistic achieves for a given set of parameters.

Returning to our uniformity tester, at the cost of a constant factor in E we can assume a = 1/2. As a result, we only need to consider p to be either U, or of the form

1:: in each coordinate. We now need to separate the expectation of the test statistic n in these two situations. The challenge is that both expectations are large, and we do not have a good analytic handle on them. We therefore introduce a new technique for showing a separation between the completeness and soundness cases that utilizes the strong convexity of the test statistic. Specifically, we obtain an explicit expression for the Hessian of the expectation, as a function of p. The Hessian is diagonal, and for our two situations of pi 1/n each entry is within constant factors of the same value, giving a lower bound on its eigenvalues. Since the expectation is minimized at p = Un, strong convexity implies an expectation gap. Specifically, we prove that this gap is 2 - min(m2 /n 2 , rn/n, 1/s).

Finally, we need to show that the test statistic concentrates about its expectation.

For m > n, this follows from McDiarmid's inequality: since the test statistic is 1/rm-

Lipschitz in the m samples, with probability 1 - 6 it lies within Nlog(1/6)/m of its expectation. When m is larger than the desired sample complexity given in (2.1), this

31 is less than the expectation gap above. The concentration is trickier when m < n, since the expectation gap is smaller, so we need to establish tighter concentration.

We get this by using a Bernstein variant of McDiarmid's inequality, which is stronger than the standard version of McDiarmid in this context. We note that the use of the stochastic domination is also crucial here. Since our statistic is a symmetric convex function of the histogram values, we can use lemma 16 to assume without loss of generality that the soundness case distribution has possible probability mass values exclusively in the set {'j[, -, '}, for some E' = O(E). This distribution has a stronger Lipschitz-type property than the other soundness case distributions.

Therefore, we are able to use a stronger concentration bound via McDiarmid's inequality and argue that even though other soundness case distributions may have weaker concentration, they still have smaller error due to our stochastic domination argument.

Upper Bound for Identity Testing

In [69], it was shown how to reduce E-testing of an arbitrary distribution q over [n] to E/3-testing of U6 n. This reduction preserves the error probability 6, so applying it gives an identity tester with the same sample complexity as our uniformity tester, up to constant factors. 3

Sample Complexity Lower Bound

To match our upper bound (2.1), we need two lower bounds. The lower bound of

Q(! log(1/6)) is straightforward from the same lower bound as for distinguishing a fair coin from an E-biased coin, while the Vn log(1/S)/E 2 bound is more challenging.

For intuition, we start with a /n log(1/J) lower bound for constant E. When p = U, the chance that all m samples are distinct is at least (1 - m/n)m ~e-m2 / .

Hence, if m < nGlog(1/6), this would happen with probability significantly larger than 26. On the other hand, if p is uniform over a random subset of n/2 coordinates,

3Note that the fact that identity testing can be reduced to uniformity testing was already known by [191 who showed that testing identity can be reduced to testing uniformity on several subsets of the domain rather than to testing uniformity of a single distribution.

32 the m samples will also all be distinct with probability (1 - 2m/n)m > 2J. The two

situations thus look the same with 26 probability, so no tester could have accuracy

1 - J.

This intuition can easily be extended to include a 1/E dependence, but getting

the desired 1/f2 dependence requires more work. First, we Poissonize the number

of samples, so we independently see Poi(mpi) samples of each coordinate i; with

exponentially high probability, this Poissonization only affects the sample complexity

by constant factors. Then, in the alternative hypothesis, we set each pi independently

at random to be 9. This has the unfortunate property that p no longer sums

to 1, so it is a "pseudo-distribution" rather than an actual distribution. Still, it

is exponentially likely to sum to 6(1), and using techniques from [135, 49] this is

sufficient for our purposes.

At this point, we are considering a situation where the number of times we see each coordinate is either Poi(m/n) or j(Poi((1-E) )+Poi((1+E) )), and every coordinate is independent of the others. These two distributions have Hellinger distance at least

E2 m/n in each coordinate. Then the composition property for Hellinger distance

over n independent coordinates implies m ;> rnlog(1/3)/ 2 is necessary for success

probability 1 - J.

2.1.4 Discussion and Prior Work

Uniformity testing is the first and one of the most well-studied problems in distribution testing [72, 115, 129, 51, 48]. As already mentioned, the literature has almost exclusively focused on the case of constant error probability 3. The first uniformity tester, introduced by Goldreich and Ron [72], counts the number of collisions among the samples and was shown to work with O(#/E4) samples [72]. A related tester proposed by Paninski [115], which relies on the number of distinct elements in the set of samples, was shown to have the optimal m = E(V/E 2 ) sample complexity, as long as m = o(n). Recently, a chi-squared based tester was shown in [129, 51] to achieve the optimal E(fii/E 2) sample complexity without any restrictions. Finally, the original collision-based tester of [72] was very recently shown to also achieve the

33 optimal E(iui/E2 ) sample complexity [48]. Thus, the situation for constant values of

3 is well understood.

The problem of identity testing against an arbitrary (explicitly given) distribution was studied in [19], who gave an (E, 1/3)-tester with sample complexity O(n1/ 2 )/poly().

The tight bound of O(ni/2 / 2 ) was first given in [1291 using a chi-squared type tester

(inspired by [36j). In subsequent work, a similar chi-squared tester that also achieves the same sample complexity bound was given in [2]. (We note that the [129, 21 testers have sub-optimal sample complexity in the high confidence regime, even for the case of uniformity.) In a related work, [51] obtained a reduction of identity to uniformity that preserves the sample complexity, up to a constant factor, in the constant error probability regime. More recently, Goldreich [69], building on [49], gave a different reduction of identity to uniformity that preserves the error probability. We use the latter reduction here in order to obtain an optimal identity tester starting from our new optimal uniformity tester.

Since the sample complexity of identity testing is O(\/'Y/E 2 ) for 3 = 1/3 [129, 51], standard amplification gives a sample upper bound of 6(VThlog(1/6)/E 2 ) for this problem. It is not hard to observe that this naive bound cannot be optimal for all values of 6. For example, in the extreme case that 6 2 -e(n), this gives a sample complexity of O(n3 / 2 /E2 ). On the other hand, one can learn the underlying distribution (and therefore test for identity) with Q(n/c 2 ) samples for such values of 64.

The case where 1 > 3 > 2-9(1) is more subtle, and it is not a priori clear how to improve upon naive amplification. Theorem 1 provides a smooth transition

2 2 between the extremes of E(VY/ ) for constant 6 and O(n/E ) for 3 = 2 -8(n). It thus provides a quadratic improvement in the dependence on 6 over the naive bound for all 3 > 2 e(n), and shows that this is the best possible. For 3 < 2 e(n), it turns out that the additive E(log(1/J)/E 2 ) term is necessary, as outlined in Section 2.1.3, so learning the distribution is optimal for such values of 6.

4This follows from the fact that, for any distribution p over n elements, the empirical probability distribution &m obtained after m = Q((n + log(1/6))/6 2 ) samples drawn from p is E close to p in total variation distance with probability at least 1 - 6.

34 We obtain the first sample-optimal uniformity testerfor the high confidence regime.

Our sample-optimal identity tester follows from our uniformity tester by applying the recent result of Goldreich [69], which provides a black-box reduction of identity to uniformity. We also show a matching information-theoretic lower bound on the sample complexity.

The sample-optimal uniformity tester we introduce is remarkably simple: to distinguish between the cases that p is the uniform distribution U, over n elements versus dTv(p, Un) > E, we simply compute dTv(f, U,) for the empirical distribution . The tester accepts that p = Un if the value of this statistic is below some well-chosen threshold, and rejects otherwise.

It should be noted that such a tester was not previously known to work with sub-learning sample complexity i.e., fewer than 6(n/ 2 ) samples even in the constant confidence regime. Surprisingly, in a literature with several different uniformity testers [72, 115, 129, 511, no one has previously used the empirical total variation distance. On the contrary, it would be natural to assume as was suggested in [20, 211 that this tester cannot possibly work. A likely reason for this is the following observation: When the sample size m is smaller than the domain size n, the empirical total variation distance is very far from the true distance to uniformity. This suggests that the empirical distance statistic gives little, if any, information in this setting.

Despite the above intuition, we prove that the natural "plug-in" estimator relying on the empirical distance from uniformity actually works for the following reason: the empirical distance from uniformity is noticeably smaller for the uniform distribution than for "far from uniform" distributions, even with a sub-linear sample size.

Moreover, we obtain the stronger statement that the "plug-in" estimator is a sample- optimal uniformity tester for all parameters n, e and 6.

In [85], it was shown that the distinct-elements tester of [1151 achieves the optimal sample complexity of m = O(fn log(1/6)/E 2 ) , as long as m = o(n). When m = Q(n), as is the case in many practically relevant settings (see, e.g., the Polish lottery example in [123] with n < vx/& 2 n/c 2), this tester is known to fail completely even in the constant confidence regime. On the other hand, in such settings the sample size is

35 not sufficiently large so that we can actually learn the underlying distribution.

It is important to note that all previously considered uniformity testers [72, 115,

129, 511 do not achieve the optimal sample complexity (as a function of all parameters, including S), and this is inherent, i.e., not just a failure of previous analyses. Roughly speaking, since the collision statistic [721 and the chi-squared based statistic [129, 51] are not Lipschitz, it can be shown that their high-probability performance is poor. Specifically, in the completeness case (p = U,), if many samples happen to land in the same bucket (domain element), these test statistics become quite large, leading to their suboptimal behavior for all 6 = o(1). (For a formal justification, the reader is referred to Section V of [85]). On the other hand, the distinct-elements tester [1151 does not work for m = w(n). For example, if E or 6 are sufficiently small to necessitate m > n log n, then typically all n domain elements will appear in both the completeness and soundness cases, hence the test statistic provides no information.

2.1.5 Distribution identity testing model

We now formally define the task of identity testing, which is arguably the most fundamental distribution testing problem.

Definition 2 (Distribution Identity Testing Problem). Given a target distribution q with domain D of size n, parameters 0 < e, J < 1, and sample access to an unknown distribution p over the same domain, we want to distinguish with probability at least

1 - 6 between the following cases:

" Completeness: p = q.

" Soundness: dTv(p, q) > E.

We call this the problem of (E, 6)-testing identity to q. The special case of q being uniform is known as uniformity testing. An algorithm that solves one of these problems will be called an (E, 6)-tester for identity/uniformity.

Note that dTv(p, q) denotes the total variation distance or statistical distance between distributions p and q, i.e., dTv(p, q) d -. 1p - qli. The goal is to characterize the sample complexity of the problem: i.e., the number of samples that are necessary and

36 sufficient to correctly distinguish between the completeness and soundness cases with

success probability 1 - 6.

2.1.6 Notation

We write [n] to denote the set {1,.. ,n }. We consider discrete distributions over [n],

which are functions p : [n] - [0, 1] such that En 1 pi = 1. We use the notation pi to denote the probability of element i in distribution p. For S ; [n], we will denote

p(S) =Eis pi. We will also sometimes think of p as an n-dimensional vector. We

will denote by Un the uniform distribution over [n].

For r > 1, the f4-norm of a distribution is identified with the f4-norm of the corresponding vector, i.e., IIP1II = (En IpIr)l/r . The f4-distance between distributions p and q is defined as the the fr-norm of the vector of their difference. The

total variation distance between distributions p and q is defined as dTv(p, q)

maxsc[,] Ip(S) - q(S)I = (1/2) - 11p - qi1. The Hellinger distance between p and

2 q is H(p,q) (1/V) -K||d - \/l2 (i/V) - =1 (Vi- q2) . We denote by Poi(A) the Poisson distribution with parameter A.

2.2 Collision-based uniformity testing

In this section, we show that the natural collision uniformity tester proposed in [72, 19]

is sample-optimal up to constant factors. More specifically, we are given m samples from a probability distribution p over [n], and we wish to distinguish (with high constant probability) between the cases that p is uniform versus E-far from uniform in fr-distance. The main result of this section is that the collision-based uniformity tester succeeds in this task with m = O(nl/ 2/ 2 ) samples.

In fact, we prove the following stronger f2-guarantee for the collisions tester: With

1 2 2 2 m = O(n / /E ) samples, it distinguishes between the cases that lip - Un1 1 < E /(2n)

(completeness) versus lp- U1|2 2& 2/n (soundness). The desired fi guarantee follows from this f 2 guarantee by an application of the Cauchy-Schwarz inequality in the soundness case.

37 Formally, we analyze the following tester:

Algorithm TEST-UNIFORMITY-COLLISIONS(p, n, E)

Input: sample access to a distribution p over [n], and E > 0.

2 2 Output: "YES" if 11p - Un7 l| < E /(2n); "NO" if 11p - UnI 2> E /n.

1. Draw m iid samples from p.

2. Let o-s be an indicator variable which is 1 if samples i and j are the same and

0 otherwise.

3. Define the random variable s = E>j o-i and the threshold t (7). 1+3e2/4

4. If s > t return "NO"; otherwise, return "YES".

The following theorem characterizes the performance of the above estimator:

Theorem 3. The above estimator, when given m samples drawn from a distributionp over [n] will, with probability at least 3/4, distinguish the case that lip-Un 2 < E2 / (2n)

2 1 2 2 from the case that ||p - Un|12> E /n provided that m > 3200n / / .

The rest of this section is devoted to the proof of Theorem 3. Note that the condition of the theorem is equivalent to testing whether 11pJ12 < +,2/2 versus 1pl1 2> 1 2 1+e 2 Ou etrtksm 3200n2 / + . Our tester takes m = E samples from p and distinguishes between the two cases with probability at least 3/4.

2.2.1 Analysis of TEST-UNIFORMITY-COLLISIONS

The analysis proceeds by bounding the expectation and variance of the estimator for the completeness and soundness cases, and applying Chebyshev's inequality. The novelty here is a tight analysis of the variance which leads to the optimal sample bound.

We start by recalling the following simple closed formula for the expected value:

Lemma 4. We have that E[s] = (m) j|p||

38 Proof. For any i, j, the probability that samples i and j are equal is 11pII'. By this

and linearity of expectation, we get

E [s] = E Oi = E [o-ij] = ( |P|| =(M||P|| .

Thus, we see that in the completeness case the expected value is at most (M). 1+E/2 22

In the soundness case, the expected value is at least ( 2) . This motivates our

choice of the threshold t halfway between these expected values.

In order to argue that the statistic will be close to its expected value, we bound its variance from above and use Chebyshev's inequality. We bound required number of

samples in two steps. First, we bound the variance of the statistic by m 2 . IpII + m3 .

(IIpII' - I1pII'), and then we do a case analysis and show the same sample complexity upper bound no matter if the first or the second term in the above bound dominates for a given distribution p. We note here that for any probability distribution p we have that I|pIl' - ||p>|| > 0 since:

||P11| = P1/2 _ P1/2 /2< . / 12 J3/2

where we have used the Cauchy-Schwarz inequality.

Lemma 5. We have that Var[s] < m 2 -||P|| +m 3 _ (||p|3- ||p|| ).

39 Proof. The lemma follows from the following chain of (in-)equalities:

Var[s] E[s2 ] - E[s]2

2 =E ai 90 - 2rn 1P112 i

2 -E 9iU3 cke+ E cijcxae+2ZE0ijukj + E(z

all distinct i5k

2 1|)+(p-|p||1)(mn2)(11P113mn(m.11P114) 2

-||P|4. _||P||2+7n3(||P||3m2 n

Remark.: We note that the upper bound of the previous lemma is tight, up to

constant factors. The -,m 3 |IpII4 term is critical for getting the optimal dependence

on E in the sample bound.

Continuing the analysis, we now derive an upper bound on the number of samples

that suffices for the tester to have the desired success probability of 3/4.

Lemma 6. Let a satisfy ||p|| - 1+ and o- be the standard deviation of s. The

number of samples required by TEST-UNIFORMITY-COLLISIONS is at most

5un -a - 3 2/41

in order to get error probability at most 1/4.

Proof. By Chebyshev's inequality, we have that

Pr s - (2) 1p12 ;> ko]

40 where - v Var[s].

We want s to be closer to its expected value than the threshold is to that expected

value because when this occurs, the tester outputs the right answer. Furthermore, to

achieve our desired probability of error of at most 1/4, we want this to happen with

probability at least 3/4. So, we want

ko- < E[s] - t = () p - =1 3c2/4) (n) a - 32 /41/n ( 2 2 n 2

For sufficiently large m and k = 2, the following slightly stronger condition for m

suffices:

2 .a - 32 /41 5n

So, it suffices to have 5-n m > - a- 3C2/41

We might as well take the smallest number of samples m for which the tester works, which implies the desired inequality. l

To complete the proof of Theorem 3, we need to show that given enough samples there is a clear separation between the completeness and soundness cases regarding the value of our statistic.

By Lemma 6, it suffices to bound from above the variance o.2 . We proceed by a case analysis based on whether the term m 2I1pJI2 or m 3(IIp 113- IIp11) contributes more to the variance.

Case when m 2IIpIl2 is Larger

Lemma 7. Let ||p|| = (1 + a)/n. Consider the completeness case when a < C2/2 2 and the soundness case when a > 62. If m ||p||2 contributes more to the variance, i.e., if

m 2 11pI12 > m 3(I1pII1 _ IIPII4)

41 then the required number of samples is at most

48n1 /2 mn < 2 in order to get error probability 1/4.

Proof. Suppose that m2I|pI2 > m3 (IjpjII-I|pI4). Then o 2 < 2m 2 jpI2= 2m2 (1+a)/n.

Substituting this into Lemma 6 and solving for m gives that the necessary number of samples is at most

mn < 8n 1/2 .1+C la - 3c 2 /41

Using calculus to maximize this expression by varying a, one gets that a = 62 maximizes the expression for a E [0, E2/2] U [E2, n-i], since the right hand side is increasing in the first interval and decreasing in the second. Thus,

1 2 Nl__ /22 v 48n /2 m < 32n/ 1+6 32n/ 6 6 2 6C2

3 Case when m (lIpII3 - IIp|J1) is Larger

Lemma 8. Let ||p|| (1 + a)/n. Consider the completeness case when a < 62/2

3 and the soundness case when a > 62. If m (||p||3 -| IIpI4) contributes more to the variance, i.e., if 3 2 m (|pI|| -| IIPII) mIjpIj then the required number of samples is at most

3200n/2 62 in order to get error probability < 1/4.

3 2 3 Proof. Suppose that m (lIpI|I - |IpI|I) > m |pI2. Then U2 _ 2m (I1pII3 -Ip4).

Substituting this into Lemma 6 and solving for m gives that the necessary number of

42 samples is at most m<50n 2. ||Plll - ||Plll m (a - 3c2 /4) 2 -

Let us parameterize p as pi = 1/n + a, for some vector a. Then we have I a 2= a/n.

In the completeness case, we can write:

2 2 IlII - l1pl12 m < 50n 2 3/4 -3 < 50n22 - (since a < 62/2) 2/4)2 (3&2/4 - Ce)2 (C2/4)2

In the soundness case, we can write:

IIpII3 _ IIpII42 2~ I_ I______2/4)2 < 50n2 _ - / 2 m < 50n2 - (since 62 < a). (a -3c24 (o/4)2

We also have that:

|1 | 3 - ||IP || < |I l3 2 a(1/n+a)3] - -n IPI13= 2 3 n3 n n + ai + a + a 31i S -n n 1 i=1

Yaj + + Ya i=1 i=1 i=1

-Eai+ ai i=1 3 33 I 3-(a/n) + (a/in)3/2 n 2 3I -n 2 - n

Using the above bound, in the completeness case we get the following:

m < 50n2. I - Il

- 2

43 In the soundness case, we get:

1/ 2 2 !(a/n) + (a/n)3 / 2 2400 800n m <=50n - n (e42 = + (a/4)2 a a <2400 800n1/2 < 2 + 80ri (since E2 < a) 3200n1 /2

Combining the above, the required number of samples is at most m < 32002

Note that, as mentioned earlier, if we had ignored the - IpI1I term, we would have had an Q(1/c 4 ) term in our bound, which would have given us the wrong dependence on 6.

Theorem 3 now follows as an immediate consequence of Lemmas 7 and 8.

Remark.: It is worth noting that the collisions statistic analyzed in this section is very similar to the chi-squared-like uniformity tester in [511 itself a simplifica- tion of similar testers in [36, 129] which also achieves the optimal sample complexity of O(n1 / 2 /E2 ). Specifically, if Xi denotes the number of times we see the

2 i-th domain element in the sample, the [511 statistic is Ei(Xi - m/n) - Xi =

2 EZ< cia - 2M Ej Xi + m. We note that the [51] analysis uses Poissonization; i.e., instead of drawing m samples from the distribution, we draw Poi(m) samples.

Without Poissonization, the aforementioned statistic simplifies to s - '2, where s is the collisions statistic. While the non-Poissonized versions of the two testers are equivalent, the Poissonized versions are not. Specifically, the Poissonized version of the [51] uniformity tester has sufficiently good variance to yield the sample-optimal bound. On the other hand, the Poissonized version of the collisions statistic does not have good variance: Specifically, its variance does not have the -IIpJI term which as noted earlier is necessary to get the optimal E dependence.

44 2.3 Testing Closeness via Collisions

Given samples from two unknown distributions p, q over [n] with the promise that

max{1pI', |IqIj} < b, we want to distinguish between the cases that IIp - q1|2 E/2

versus 1|p - q1|2 > E. We show that a natural collisions-based tester succeeds in this

2 2 task with O(bi/ /E ) samples. The estimator we analyze is a slight variant of the f2 tester in [201, described in pseudocode below.

We define the number of self-collisions in a sequence of samples from a distribution

as E

of samples as >j, fij, where fij is the indicator variable denoting whether sample i from the first sequence is the same as sample j from the second sequence.

Algorithm TEST-CLOSENESS-COLLISIONS(p, q, n, b, E)

Input: sample access to distribution p, q over [n], E, b > 0.

Output: "YES" if 11p - q1|2 < E/2; "NO" if 1|p - q1I2 > 6.

1. Draw two multisets Sp, Sq of m iid samples from p, q. Let C1 denote the number

of self-collisions of Sp, C2 denote the number of self-collisions of Sq, and C3 denote the number of cross-collisions between SP and Sq.

2. Define the random variable Z = C1 +C 2 - - C3 and the threshold t (n)C2/2.

3. If Z > t return "NO"; otherwise, return "YES".

The following theorem characterizes the performance of the above estimator:

Theorem 9. There exists an absolute constant c such that the above estimator, when given m samples drawn from each of two distributions, p, q over [n] will, with proba-

bility at least 3/4, distinguish the case |p- qj 12 E/2 from the case that Ip - q| 12 > E provided that m > c - , where b is an upper bound on ||p||j,||q| I.

2.3.1 Analysis of TEST-CLOSENESS-COLLISIONS

45 Let Xi, Y be the number of times we see the element i in each set of samples

SP and Sq, respectively. The above random variables are distributed according to binomial distributions as follows: Xi Bin(m,pi), Yi Bin(m, qi). Note that the statistic Z can be written as

X+ -++- ( X(Xi - 1) +-1Yi(Yi - 1) Z = M m'-,+ 2

2 2XI -Y?, + X(Xi - 1) + Y(Y - 1)] + 2m [Xi(Xi - 1) +Yi(Y, - 1)] -2m Z ')2mfL.J 'I i=1 i=1

Therefore, we have:

Zi- 2 _ + 1 XI(Xi-_1)+YI(Y-I)]_=m- A+B 2m i=1 2mX2mi=1 2-

where A Z=1 [(X 2 - - X, - Yi] and B =1 [Xi(Xi - 1) +Y(Y - 1)]. Note that

Var[Z] < max Var [2 (m A) ,Var [2 B) 2m 2m _ ((in -- 1)21 4-max 1 - 2 Var[A] Var[B] 4m2 '4m 2 a

Note that B is equal to twice the number of collisions within two disjoint sets of sam-

ples, hence we already have an upper bound on its variance. The bulk of the analysis

goes into bounding from above the variance of A = E" Ai = E" [(Xi -- - Y].

Remark.: The collision-based e2 tester we analyze here is closely related to the f2- tester of [36]. Specifically, the A term in the expression for Z has the same formula

as the f2-tester of [36]. However, a key difference is that the statistic of [36] is Poissonized, which is crucial for its analysis.

We now proceed to analyze the collision-based closeness tester. We start with a

simple formula for its expectation:

46 Lemma 10. For the expectation of the statistic Z in the closeness tester, we have:

(2.2) E[Z] = 2 || anp -q| |t

Proof. Viewing p and q as vectors, we have

2 E[Z ]=E[C +C -m r-iC (m) (q M (pq) 112 1 2 *C3 (p-p)+ (?) q)-m 2(T) Ip-q 2-

F-

For the variance, we show the following upper bound:

Lemma 11. For the variance of the statistic Z in the closeness tester, we have:

Var[Z] < 116m 2b + 16M3 1IP - q 12b1/2

To prove this lemma, we will use the following proposition, whose proof is deferred

to the following subsection.

2 3 Proposition 12. We have that Var[A] < lOOm b + 8m Z(pi - qj)(p2 - q2).

Proof of Lemma 11. Recall that by Lemma 5 we have

2 3 Var[B] 4m (I|p||2 +|q 112) + 4m (||p|I| - IIP| + |q||13 -| |q| )1

Combined with Proposition 12, we obtain:

1Var[B] Var[Z] < 4. max { M _12Var[A],

4(T spo+|rg||n) +aament s at mo 1b Ti s a m

The second argument of the max statement is at most 16mb. This is at most 16(mn -

47 1) 2 b for m > 3. Thus, we have

2 Var[Z] < 116(m - 1)2b + 8m(m - 1)2 (P -)

< 116m 2 b + 8m3 2 (p + qi)

2 3 4 2 inequality) < 116m b + 8m - qi) q pi + i) (by the Cauchy-Schwarz " 116m2b + 16m3 IIbi i- j Ep+qj2 (since E(p + gi)2 < 4b) .

Proof of Theorem 9

By Lemma 11, we have that

2 3 2 2 3 1 Var[Z] < 116m b + 16m 1lP - q11 bl/ < 116m b + 16m 11p - q12b /2.

We wish to show we can distinguish the completeness case (i.e., Ilp - q112 E/2) from

the soundness case (i.e., |lp - q11 2 ;> E). Set a Ilp - ql12. Then we are promised that

either a > c2 or a < e2/4. Recall we chose t = 2 and that Lemma 10 says that E[Z] a.

Since

E [ZIcompleteness case] < t < E[Zlsoundness case],

the only way we fail to distinguish the completeness and soundness cases is if Z

deviates from its expectation additively by at least

it-IE[Z]l = -E max{a, 2 /4,

where the last inequality follows by the promise on a in the completeness and sound-

48 ness cases.' By Chebyshev's inequality, the probability this happens is at most

Var [Zi 116m2 b + 16maab/ 2 Pr[ IZ - E[Z]I > It - E[Z]1] < -

where we simplified using the assumption that m > 2 and min{x, y} < x-y for x 1

and y = . Thus, if we set m = Q(b), we get a constant probability of error in

both cases as desired. LI

Proof of Proposition 12

Recall that A = E' Ai = [(X-Y - -Xi - Yi], hence Var(A) = EI Var(Ai)+

E>Z Cov(Aj, Aj). We proceed to bound from above the individual variances and covariances via a sequence of elementary but quite tedious calculations.

Bounding Var(Ai):

Since

Ai = X Y) i- iX + 2- 2X,,Y - X, - Y, we can write:

Var(A) =Var(X2) + Var(i) + 4Var(XiY) + Var(Xi) + Var(Yi)

2 + 2*-[-2 Cov(X2, XjY) - Cov(X2, Xi) - 2 Cov(Y, XiYi) - Cov(Y , Yi) + 2 Cov(XjYi, Xj) + 2 Cov(XjYi, Yj)] .

Let Sk E [n] be the random variable corresponding to the value of the k-th sample drawn. We proceed to calculate the individual quantities:

'In the completeness case where a < C2/4 and E[Z] = (m)a, Z has to deviate by at least

(m).2 /4 > ()2a to cross the threshold t = (7)<2/2. In the soundness case where a > E2, Z has to deviate by at least (m)a/2 > E/2 to cross the threshold t.

49 (a)

Cov(Xf, Xi) = 3 Cov([s, = S,= i], [st = i]) r,s,tE[il

S COV([S, = ii, [Sr i]) + 2 Cov([Sr Sw =], [Sr =i) rElm] r,sE[m],rAs

2 s =i]] - pipi) =mpi(1 - pi) + 2(m - m)(E[[s,= i r

2 = mpi(1 - pi) + 2(m - m)(p2 - p3)

= mpi(1 - pi)[I + 2pi(m - 1)]

= mp (1- pi)[1 - 2pi + 2pim]

2 = mpi(1 - pi)(l - 2pi) + 2m -Pi)

(b)

Cov(X, , XiYi) = E[X 3Yi] - E[Xi2] -E[XiYi|

= Cov(X2, X2 ) -E[Yi]

2 3 = m piqi - pi)(1 - 2pi) + 2m P2q -

(d)

Var(X2) =E[X4] - (E[X2]) 2

2 =mp(1 - 7pi + 7mp, + 12p2 - 18mp2 + 6m p - 6p

+ 11mp -6mp m3 p3) - (mp -m + M2 22

2 2 3 2 4 =mpi - 7mp2 + 7m p2 + 12mp3 - 18m p? + 6m p - 6mp + 11m 6m2pp + 44

2 2 4 2 3 3 - (m p2 + m pi + mn p - 2m p + 2m p - 2m )

2 2 3 2 3 =mpi - 7mpi + 6m p? + l2mp - 16m p? + 4m p - 6mp4 + lOm p? - 4m p4

2 3 2 3 mpi - 7mp2 + 6m2p? + 12mp3 - 6mp4 - 16m p? + 4m p3 + 1Om p 4 - 4m P4

50 (e)

2 2 Var(XiYi) =E[Xi1$] - (EEX Y|) E[X||E{[Yj - (1EX[X ])

= -MiTP 22 2 2) 42 2 (mp- mp + m 2p)- (mqi - mqi + m2 qi) - m PA 2 2 2 3 qi) - 2mIp q . = m piqi + m q - m (piq2 + p2qi) + m (pi2 + p

So, we get:

2 2 Var(A) = mpi - 7mpi + 6m p + 12mpg - 6mpg - 16m p + 4mTpn + 1Om2p - 4m3p

2 2 + mqi - 7mqi + 6m qi+ 12mqi - 6mq' - 16m qi + 4maq + 1Om 2 qi - 4m qi

2 3 + 4(m (piqi + p2qi - pi p2qi) + m (piq2 + p2 qi) - 2m3 p q 2)

+ mpi(1 - pi) + mqi(1 - qi) - 4(m 2p (1 - pi)(I - 2pi) + 2m 3P2(I - pi))qi

2 2 3 - 2(mpi(1 - pi)(I - 2pi) + 4m p (1 - pi)) - 4(m qi(1 - qi)(I - 2qi) + 2m q2(l - qi))pi

2 2 2 - 2mqi(1 - qi)(1 - 2qi) - 4m q2(1 - qi) + 4m pi(1 - pi)qi + 4m qi(I - qj)pi

m[p -- 7 + 12p - 6i + qi - 7q9+ 12q, - 6q9 -pi - 4 + qi

2 - 2p (I - pi)(I - pi) - 2qi(1 - qi)(1 - 2qi)]

+ m 2 [-4piqi(i - pi)(l - 2 pi) - 4piqi(1 - qi)(1 - 2qi) + 4piqi(2 - pi - qi)

+ 6P - 16 A- 10P4 + 6q2 - 16q + 10q4 + 4piqi(1 + piqi - pi - qi)

- 4P + 4P - 4q2 + 4q3]

3 + m [4 -- 44 + 4qi - 4q + 4piqi(pi + qi) - 8P q -8p qi - 8piqi + 8piqi + 8piq||

m-2p + 8p -- 6 - 2qi + 8qi - 6qf|

2 2 A- m [2(pi + qi) - 12p + 104 +A 4q - 8q+ A + 4p A- 44q+ - 12qi - 8piq + 10q)

3 + 4m (p, _ 2 [pi (1 - pi) + qi(1 - qi)|

8m(P + q ) + 12m2 (pi + qi) 2 + 4m3 (pi - qi) 2 (pA+ qi)

2 2 3 2 <20m (p, A qi) + 4m (p, - qj) (pi + qi)

51 Bounding the Covariances

It suffices to show that the covariances of Ai and Aj, for i # j, are appropriately bounded from above. Let i # j. Note that if sk is the result of sample k, we have:

Cov(Xj, Xj) = Cov([S= i=, [s"' j]) =E Cov([Sr =l, [Sr = j]) = -mpipj r,uE [m] rE[m]

Similarly,

Cov(X2,X) = E CoV([r = Sw = i, [St = j]) r,s,tE [im] = Cov([s, =i, [sr=j]) + 2 sE Cov([sr = s] = ], [Sr = I) rE[m] r,sG[m], r:As

= -mpipj - 2m(m - 1)p2pj.

Similarly,

Cov(Xi, X?)= Cov([sr = Su = i], [St = Su j])

r,s,t,uE [M]

4 Cov([sr =sw= i]S, r=Su=j])

unique r,s,uE [m]

+2 Cov([sr= sw= i, [Sr =sw =j]) unique r,sE[m]

+2 Cov([sr= Sw i], [Sr j) unique r,sE[m]

+ 2 Cov([Sr = Zi, [sr = St ) unique r,tE[im]

+ E Cov([s = fl, [Sr = j]) rE~in]

= -mpipj - 2m(m - 1) (p2pj + Pip +pfp) - 4m(m - 1) (m - 2)p2p2

= -mpip3 - 2m(m - 1)(pipj + pp ) - 2m(m - 1)(2m - 3)p2p .

52 And,

Cov(Xi Yi, XY}j) = E[XYXY] - E[XiY]E[X.bY ] = E[XiXj]E[YiY] - E[Xi]E[Yi]E[Xj]E[Y]

=(Cov(Xi, Xj) + E[Xi]E[Xj]) -(Cov(Yi, Yj) + E[Yi]E[Yj]) - E[Xi]E[Xj]E[Y]E[Y]

2 3 = (M - 2m )pipjqigq.

Also,

Cov(XjYj, Xj) = E[XiYiXj] - E[XiYi]E[Xj]

= E[XiXj]E[Yi] - E[Xi]E[Yi]E[Xj]

(Cov(Xi, Xj) + E[Xi]E[Xj]) - E[Yi] - E[Xi]E[Xj|E[Yi]

Cov(Xi, X)E[Yi].

Similar equations hold if we swap i and j and/or we swap X and Y. Because covariance is bilinear, this gives us all the information we need in order to exactly compute

Cov(Ai, Aj). In particular, by setting Wi = Xi - Y, we have:

Cov(Ai, Aj) = Cov(Wj2 - Xi - Y, Wj - Xj - Y)

2 Cov(Xi, Xj) + Cov(Y, Y) + Cov(Xi, Y) + Cov(X3 , Yi) - Cov(W , Xj) 2 - Cov(W, Yj) - Cov(W2, X,) - Cov(W, Yi) + Cov(W , W ).

For the summands we have:

(a)

2 Cov(W , X.) = Cov((X, - Y) , Xg) Cov(X2, X3 ) - 2 Cov(XjY, X 3 ) 2 -mpip - 2m(m - 1)p 2p + 2m

2 = -mpipj (1 - 2pi) + 2m pp3 (qj - pi)

53 (b)

2 2 Cov(W , Yj) = -mqiq ,(1 - 2qi) + 2m qiq3 (pi - qi)

(c)

2 Cov(Wj, X ) = -mpipj(1 - 2pj) + 2m p~p3 (qj - pj)

(d)

2 Cov(Wj', Y) = -mqiq g(1 - 2qj) + 2m q qj(pj - q3 )

(e)

2 2 2 Cov(W , Wji) Cov(X2, Xj) + Cov(Y , y ) +4 Cov(X Y, X3 Y)

- 2 Cov(X2, XjYj) - 2 Cov(Xj, XjY) - 2Cov(Y ,X3Y) - 2Cov(Y7,XY) Cov(Xf, Xf) + Cov(YQ, Y2 )+4 Cov(XiYi, XjYj)

- 2 Cov(X2, Xj)E[Y] - 2 Cov(Xj, Xi)E[Yi]

2 2 - 2 Cov(Y , Yj)E[X,] - 2 Cov(Y , Y)E[Xi]

- mpip - 2m(m - p)(p3p2 pipj) - 2m(m - 1)(2m - 3)p p

- mqiqj - 2m(m - 1)(qiqi + qiq ) - 2m(m - 1)(2m - 3)q q

+ 4(m 2 - 2m 3 )pipjqiq + 2m 2pipjqj + 4m2 (m - 1)p pj gq

2 2 2 2 + 2m pipj t+ 4m (m - 1)p piqi + 2m qijpj + 4m (m - 1)q 2qpj

2 2 + 2m qiqpi + 4m (m - 1)qjp.

54 By substituting, we get:

2 Cov(Ai, Aj) Cov(Xi, Xj) + Cov(Y, y) - Cov(W , X3) 2 2 2 - Cov(W , y1) - Cov(W2, Xi) - Cov(W Y) + Cov(W , W )

=-m(pipj + qiqj)

2 + mpip3(1 - 2pi) - 2m pipj(qi - pi) + mqiqj(1 - 2qi)

2 2 - 2m qigj(pi - qi) + mpipj(1 - 2pj) - 2m pip(q. - p,)

2 2 + mqiqj(1 - 2qj) - 2m qiq, (p, - q3) + Cov(W , W2)

2 2 2 (substituting Cov(Wi , Wj)) - 2m [pipj (qj + qj) + qj qj (pi + pj)] + 2m [pjp(q+ q) + qiqj(p- F pA)

- 2m(m - 1)(2m - 3)p pj - 2m(m - 1)(2m - 3)q qj

+ 4(m 2 - 2m3 )pipjqiqj + 4m2 (m - 1)(piqj + pjqi)(pipj + qi qj)

= 6m(p2p + q1q3) -m 2 [O(p p2 + q~q2) + 4p pjqiqj - 4(pigq + pjqi)(pip I+ qigj)]

- m 3 [4 (pp qj2q) + 8pip qiqj - 4 (p qj +- p qi) (pzpj + q qj)]

- 6m(ppj + q2qj)

+2m 2 [5(p p + qq + 2p pj q q - 2(piqj + pj qi)(pipj + qiqj)] 3 2 - 4m [(pipj + qiqj) - (pqjq +2pqi)(pipj + qiqj).

In summary,

Cov(Ai, Aj) - 6m(pp + q qj)

2 2 2 +- 2m [(5p pj + 5q qj) - 6p pjqiqj - 2piq (p3 - qj) - 2pjqj(pi - qi) ]

- 4m(pi - qi)(pj - qj)(pipj + qiqj)

The total contribution of the covariances to the variance for all i # j is E Cov(Ai, A,).

We consider the coefficients on each of the powers of m separately. For the coefficients

55 of the [m], [m2 ] and [M3] terms of the covariance, we have:

3 [mi ] E Cov(A , A3 ) - -4 (pi - qj)(pj - qj)(pipj + qjqj) i5j i -j

=4E(pi - q,) 2(P2 +2) -4 (pi - qj) (pj - qj) (pipj + qj gj)

2 <4 E(p - qi) (p + qi) - 4Z (pi - qi)(pj - qj) (pipj + qiqj) i~j

2 =4E(pi - qi) (p + qi) -4(p - q)T(ppT + qqT)(p - q)

<4Z(pi - qj)2 (pi + qi).

Also, [m] >Z Cov(Ai, A,) < 0 Finally,

2 2 2 [n ] E Cov(A, A3 ) = 2 [(5 p + 5qq ) -6pipjqiq - 2piqj(p3 -qj) - 2p q (p2 -qi) isj i~j

i=Aj 10 (p p + qgq ) ij 10||pI| + 10 q|

< 20b2 < 20b.

Completing the Proof

Var[A] = Var[Ai] + E Cov(Ai, Aj)

3 2 80m2 + 4n (p - qi (p

2 3 2 + 20m b + 4M Z(P -qi) (pi - qj)

2 3 <10Om b + 8m Z(pi - qi)(p2 -q 2)

56 2.4 Sample-Optimal Uniformity Testing

In this section, we describe and analyze our optimal uniformity tester. Given samples

from an unknown distribution p over [n], our tester returns "YES" with probability

1 - 6 if p = Un, and "NO" with probability 1 - 6 if dTv(p, U,) ;> E. Before we get

into the details of the statistic and the analysis, in the next section we present some

useful results of possible independent interest. That is, we show stochastic domination

between statistics of a certain type computed on similar distributions. This will allow us to identify worst case families of distributions for the specific problem at hand and therefore simplify our worst case analysis.

2.4.1 Stochastic Domination for Statistics of the Histogram

In this section, we consider the set of statistics which are symmetric convex functions of the histogram (i.e., the number of times each domain element is sampled) of an arbitrary random variable Y. We start with the following definition:

Definition 13. Let p = (P1, ... , pn), q = (qi, .. . , qn) be probability distributions and p4 , q1 denote the vectors with the same values as p and q respectively, but sorted in non-increasing order. We say that p majorizes q (denoted by p '- q) if

k k Vk : Epf> yqf. (2.3) i=1 i=1

The following theorem from [12] gives an equivalent definition:

Theorem 14. [12] Let p = (p1,.. . ,p. ), q = (q,... , qn) be any pair of probability distributions. Then, p >- q if and only if there exists a doubly stochastic matrix A such that q = Ap.

Remark: It is shown in [121 that multiplying the distribution p by a doubly stochastic matrix is equivalent to performing a series of so called "Robin hood operations" and

57 permutations of elements. Robin hood operations are operations in which probability mass in transferred from heavier to lighter elements. For more details, the reader is referred to [12, 107].

Note that Definition 13 defines a partial order over the set of probability distributions.

We will see that the uniform distribution is a minimal element for this partial order, which directly follows as a special case of the following lemma.

Lemma 15. Let p be a probability distribution over [n] and S C [n]. Let q be the distribution which is identical to p on [n] \ S, and for every i C S we have qi = , where IS| denotes the cardinality of S. Then, we have that p >- q.

Proof. Let A = (aij) be the doubly stochastic matrix A (aij) with entries:

1ifijg S

aI ifiSAj S

0 otherwise

Observe that q = Ap. Therefore, Theorem 14 implies that p >- q. El

In the rest of this section, we use the following standard terminology: We say that a real random variable A stochastically dominates a real random variable B if for all x E R it holds Pr[A > x] > Pr[B > x]. We now state the main result of this section

(see Section 2.4.1 for the proof):

Lemma 16. Let f : Rn - R be a symmetric convex function and p be a distribution over [n]. Suppose that we draw m samples from p, and let Xi denote the number of times we sample element i. Let g(p) be the random variable f(X 1 , X2 ,..., Xn). Then, for any distributionq over [n] such that p >- q, we have that g(p) stochastically dominates g(q).

As a simple consequence of the above, we obtain the following:

Fact 17. Let p be a distribution on [n] and S C [n]. Let p' be the distribution which is identical to p in [n] \ S and the probabilities in S are averaged (i.e., p' = ).

58 Then, we have that p(p') < p(p), where p(.) denotes the expectation of our statistic

as defined in Section 2.4. In particular, p(Un) < p(p) for all p.

Proof. Recall that our statistic applies a symmetric convex function f to the histogram of the sampled distribution. Since p' is averaging the probability masses on

a subset S C [n], Lemma 15 gives us that p >- p'. Therefore, by Lemma 16 we con-

clude that g(p) stochastically dominates g(p'), which implies that: p(p) = E[g(p)] >

E[g(p')] = p(p'), as was to be shown. l

The following lemma shows that given an arbitrary distribution p over [n] that is

E-far from the uniform distribution Un, if we average the heaviest [n/2] elements and

then the lightest [n/2J elements, we will get a distribution that is E' > E/2-far from uniform.

Lemma 18. Let p be a probability distribution and p' be the distribution obtainedfrom p after averaging the [fJ heaviest and the [fJ lightest elements separately. Then, the following holds:

ll-2-Unill <_ ip' - Unrj1 |IP - Unfl.

Proof. Recall that pl denotes the vector p with entries rearranged in non-increasing order. Suppose that at least n/2 elements have at least 1/n probability mass'. There- fore, if p is not the uniform distribution, we have

Ln/2j + e' 1 EZPk 2 2' for some E' > 0.

Thus, we have that

1+F for k < n

for k > n

6This is without loss of generality, since we can use essentially the same argument in the other case.

59 when n is even, and

for k < n-1 n_ 2

Pk = { for k = k>n+1 n for 2

when n is odd. Moreover, since we are just averaging, we have that Yk_1 P2 Ln/2P

Since we have assumed that the majority of elements has mass at least -, we know that the total variation distance is given by:

Ln/2J Ln/2J (-p'/n) =2 1 /n) 2 E (p' -1/n)dTv 2dTv(p', (p, U) Us) S (pi-1/n) 22 2 i:pi>1/n k=1 k=1 i:p'>1/n

Thus, dTv(p', Un) > (1/2)dTV(p, Un) or Ilp - Un|11/2 < lip' - Un 1 , as desired. l

We note that by doing the averaging as suggested by the above lemma, we obtain a

distribution p' that is supported on the following set of three values: {1'+ 1 1-},

for some <

Symmetric convex statistics

The following two lemmas will be useful for analyzing the behavior of symmetric convex statistics of the histogram when the probability distribution becomes smoother through robin hood operations.

Lemma 19. Let f : Rn -+ R be a symmetric convex function, and a, b, c E R such that 0 < a < b and c > 0. Then,,

f(a, b+ c, X 3, ... ,xn) f(a +c, b, X 3, .. ,xn) -

2 Proof. Consider the set of convex functions f'3 . : JR -+ R defined as:

f' (...X(Xi,X2) = f(Xi, x 2, x3,. . . , X7 ) -

60 We will show that for every possible choice of x3 , .. . , Xn it holds that:

f'. 3....(a, b+ c) > f'. (a+ c, b) .

Since f is symmetric, so is f'. Therefore, we have that f' . , (a, b+ c) =...(b +

c, a). The 3 points: P1 = (a, b + c), P2 = (a + c, b), P3 = (b + c, a) are collinear since

their coordinates satisfy the equation x 1 + X 2 = a + b + c.

We have that P2 is between P and P3 since:

(P1 8,P 2 P3 ) = ((c, -c), (b - a, a - b)) > 0.

By applying Jensen's inequality, we get that

f'(a, b+ c) + '(b + c, a) "3(',' ' fX -(b, c 2 Ic3.,(a, + b+ c). f' (a+ c, b) < 3. 2

as desired. l

The stochastic domination between the two statistics is established in the following lemma:

Lemma 20. Let f : R' -+ R be a symmetric convex function, p be a distribution

over [n], and a, b E [n] be such that Pa < Pb. Also, let q be the distribution which is identical to p on [n] \ {a, b}, and for which:

qa w 17W) (z (2.4) qb ) I-W W j (Pb where w G [1, 1]. Suppose we take m samples from p and let Xi denote the number of times we sample element i. Let g(p) be the random variable f(X1 , X2,... , Xn). Then, g(p) stochastically dominates g(q).

Proof. To prove stochastic domination between g(p) and g(q), we are going to define a coupling under which it is always true that g(p) takes a larger value than g(q).

61 Initially, we define an auxiliary coupling between p and q as follows: To get a sample from q, we first sample from p and use it as our sample, unless the output is element "b", in which case we output "a" with probability (1-W)(PbP) and "b" Pb otherwise'.

Suppose now that we draw m samples from p, which we also convert to samples from q using the above rule.

In relation to this coupling we define the following random variables:

" Xi0 .: The number of times element "a" is sampled.

* Xhigh: The number of times element "b" is sampled and is not swapped for element "a" in q.

* Xmid: The number of times element "b" is sampled and is swapped for element "a" in q.

From the above, we have that:

Xa =Xio,, Xb X hig+ Xmid, Xa= Xiow+ Xmid, X= Xhigh, where Xj is the number of times element i is sampled in q.

We want to show that g(p) stochastically dominates g(q). That is, we want to show that':

Vt : Pr[f(Xiow, Xhigh+Xmid, X 3 ,... , Xn) t] > Pr[f(Xow+Xmid, Xhigh, X3 ,. . . ,Xn) > t]

We now condition on the events EjYzj : {Xo0 , Xhigh} = {y, z} and E,,....: Xmid c A X3 = X3 A ... A Xn = Xn}, where y < z without loss of generality. Let B =

Efy,z, A EC,X3,....

7 Note that this coupling does not fix the value of g(q) given a fixed value for g(p), and is defined for convenience. We still have to show stochastic domination for the coupled random variables using a second8 coupling. To simplify notation, we pick a = 1 and b = 2 without loss of generality.

62 We have that:

Pr[f (Xio, Xhigh - Xmid, X 3,... Xn) t]

= Pr[f(Xiow, Xhigh + Xmid, X3 ,. .. , Xn) t | B] Pr[B] y

So, it suffices to show that for every y, z, c, x 3 ,... , Xn, t:

Pr[f(Xiow, Xhigh + Xmid, X3,... , Xn) > t I B]

> Pr[f(Xiow + Xmid, Xhigh, X3,... , Xn) > t B] . (2.5)

At this point, we have conditioned on everything except which of Xlw and Xhigh is y and which is z. That is, after conditioning on the event B = E{y,j /\ EA,,...,, we have that:

{ f(XIow + Xmid, Xhigh, X 3 , .. . , Xn), f (Xlow, Xhigh + Xmid, X 3 , ... , Xn)} - {u, w}

where u = f(y + c, z, X3,... , X), w f(y, z + c, x 3 , .. . , Xn). Since by assumption y < z, we have by Lemma 19 that u < w. Then (2.5) holds trivially as an equality for t < u and for t > w. For the remaining values of t, it is equivalent to:

Pr[f (Xiow, Xhigh+Xmid, X3, ... , Xn) = w I B] > Pr[f (Xow+Xmid, Xhigh, X3, ... , Xn) W B], and hence to

Pr [Xio = Y, Xhigh = z | B] > Pr[Xi0, = Z, Xhigh = Y I B].

Now, this is also equivalent to a version with less restricted conditioning,

Pr[XI. = Y, Xhigh = z I ECj ,...,n] > Pr[Xi. = Z, Xhigh = Y I ECI3,.,n

because neither event occurs in the added regime where E{f,, 1 is false. But if we

63 rethink how our samples were drawn, we find that this is equivalent to showing that

paqi ;> pzq.

This holds since qz-y > pZY, concluding the proof. L

Proof of Lemma 16: Since p >- q, we have by Theorem 14 (and the remark that follows it) that q can be constructed from p by repeated applications of (2.4). There- fore, Lemma 20 and the fact that stochastic domination is transitive imply that g(p) stochastically dominates g(q).

2.4.2 Our Test Statistic

We define a very natural statistic that yields a uniformity tester with optimal dependence on the domain size n, the proximity parameter E, and the error probability

6. Our statistic is a thresholded version of the empirical total variation distance between the unknown distribution p and the uniform distribution. Our tester TEST-

UNIFORMITY is described in the following pseudocode:

64 Algorithm TEST-UNIFORMITY(p, n, e, J) Input: sample access to a distribution p over [n], E > 0, and 6 > 0. Output: "YES" if p = U,; "NO" if dTv(p, Un) > F.

1. Draw m = E ((1/E2). ( n log(1/6) + log(1/6))) i.i.d. samples from p.

2. Let X =(X 1 , X 2 , ... , Xn) e Z'0 be the histogram of the samples. That is, Xi is the number of times domain element i appears in the (multi-)set of samples.

E - 1 and set the threshold 3. Define the random variable S = 1

{29 for m < n

t = pl(U,) + C.- 2.- for n < m < 2

for - K m

where C is a universal constant (derived from the analysis of the algorithm), and p(Un) is the expected value of the statistic in the completeness case. (We

can compute p(U,) in 0(m) time using the procedure in Section 2.4.2.)

4. If S > t return "NO"; otherwise, return "YES". In order to compute p(Un) in step 3 of the algorithm, we use the following procedure:

Computation of the Expectation in Completeness Case

Our statistic can be written as: S =" max{Xi - ,0}. Therefore, by linearity of expectation, we get:

n E[S] = E [max{Xi - , 0 i1

So, all we need to do is to compute: E [max{Xi - ', 0}] for a single value of i in the completeness case.

Note that Xi - Bin(m, -) and that the above expectation can be written as:

[max{ Xi - L,0]P X ]( (2.6) k=Fmn1

65 where Pr[Xi = k] (1 1) This is a sum of 0(m) terms each of which can be computed in constant time, giving an 0(m) runtime overall.

The main part of this section is devoted to the analysis of TEST-UNIFORMITY, establishing the following theorem:

Theorem 21. There exists a universal constant C > 0 such that the following holds:

Given m > C. (1/E 2 ) (n log(1/3) + log(1/a)) samples from an unknown distributionp, Algorithm TEST-UNIFORMITY is an (E, S)- tester for uniformity of distribution p.

As we point out in section 2.4.2, the value p(Un) can be computed efficiently, hence our overall tester is computationally efficient. To prove correctness of the above tester, we need to show that the expected value of the statistic in the completeness case is sufficiently separated from the expected value in the soundness case, and also that the value of the statistic is highly concentrated around its expectation in both cases. In Section 2.4.3, we bound from below the difference in the expectation of our statistic in the completeness and soundness cases. In Section 2.4.5, we prove the desired concentration which completes the proof of Theorem 21.

2.4.3 Bounding the Expectation Gap

The expectation of the statistic in algorithm TEST-UNIFORMITY can be viewed as def a function of the n variables pi, ... , p,. We denote this expectation by P(p)

E[S(X 1 , .. , X,)] when the samples are drawn from distribution p.

Our analysis has a number of complications for the following reason: the function p(p) - p(Un) is a linear combination of sums that have no indefinite closed form, even if the distribution p assigns only two possible probabilities to the elements of the domain. This statement is made precise in section 2.4.6. As such, we should only hope to obtain an approximation of this quantity.

66 A natural approach to try and obtain such an approximation would be to produce separate closed form approximations for p(p) and p(Us), and combine these quantities

to obtain an approximation for their difference. However, one should not expect such

an approach to work in our context. The reason is that the difference P(p) - P(U,)

can be much smaller than p(p) and p(Us); it can even be arbitrarily small. As such, obtaining separate approximations of p(p) and p(U,) to any fixed accuracy would

contribute too much error to their difference.

To overcome these difficulties, we introduce the following technique, which is novel

in this context. We directly bound from below the difference p(p) - I(U,) using

strong convexity. Specifically, we show that the function p is strongly convex with

appropriate parameters and use this fact to bound the desired expectation gap. The main result of this section is the following lemma:

Lemma 22. Let p be a distribution over [n] and c = dTv(p, Un). For all m > 6 and n > 2, we have that:

2r m < 172- for n < n

p(p) - p(Un) > ()(1) -C2 -j frn<

for - K m

We note that the bounds in the right hand side above are tight, up to constant factors. Any asymptotic improvement would yield a uniformity tester with sample complexity that violates our tight information-theoretic lower bounds.

The proof of Lemma 22 requires a couple of important intermediate lemmas. Our starting point is as follows: By the intermediate value theorem, we have the quadratic expansion

1 p(p) = [(U.) + Vp(Un)T(p - Un) + (p - Un)THP (p - Un) 2 where Hp is the Hessian matrix of the function p at some point p' which lies on the line segment between Un and p. This expression can be simplified as follows: First,

67 we show (Fact 17) that our p is minimized over all probability distributions on input U. Thus, the gradient Vp(U) must be orthogonal to being a direction in the space of probability distributions. In other words, Vp(Un) must be proportional to the all- ones vector. More formally, since p is symmetric its gradient is a symmetric function, which implies it will be symmetric when given symmetric input. Moreover, (p - U") is a direction within the space of probability distributions, and therefore sums to 0, making it orthogonal to the all-ones vector. Thus, we have that Vy (U,)T(p- U") = 0, and we obtain

1 1 1 P(P)- p(Un) = (p - Un)THP,(p - Un) > -lip - UnTJ| -9 > -f1p - Un|ll/n - o- , (2.7) 2 2-2 where o- is the minimum eigenvalue of the Hessian of p on the line segment between

U, and p.

The majority of this section is devoted to proving a lower bound for 0-. Be- fore doing so, however, we must first address a technical consideration. Because we are considering a function over the space of probability distributions which is not full-dimensional the Hessian and gradient of p with respect to Rn depend not only on the definition of our statistic S, but also its parameterization. For the purposes of this subsection, we parameterize S as S(x) n="max { - -, 0}

I En 1 max {xi - m, 0}.

In the analysis we are about to perform, it will be helpful to replace Mn with a free parameter t which we will eventually set back to roughly m/n. Thus, we define

St(x) max{xi - t, 0} i=1 and

lt (P) A Ex-Multinomial(m,)[St(x)] () (i - p)m-k(k - t) . (2.8) i=1 k=rtl

Note that when t = m/n we have St = S and Pt = p. Also note that when we

68 compute the Hessian of pt(p), we are treating pt(p) as a function of p and not of t.

In the following lemma, we derive an exact expression for the entries of the Hessian.

This result is perhaps surprising in light of the likely nonexistence of a closed form expression for p(p). That is, while the expectation p(p) may have no closed form, we prove that the Hessian of p(p) does in fact have a closed form.

Lemma 23. The Hessian of pt(p) viewed as a function of p is a diagonal matrix whose ith diagonal entry is given by

hij =- st,j where we define st,j as follows: Let At be the distance of t from the next largest integer,

2.e., At t] -t. Then, we have that

0 fort 0

sti =j(m (m- -1 _i)m-t-1 for t - Zo

At -sLtJ,i + (1 - At)si fort > 0 and t 0 Z

In other words, we will derive the formula for integral t > 1 and then prove that the value for nonintegral t > 0 can be found by linearly interpolating between the closest integral values of t.

Proof. Note that because St(x) is a separable function of x, pt(p) is a separable function of p, and hence the Hessian of pt(p) is a diagonal matrix. By Equation 2.8, the i-th diagonal entry of this Hessian can be written explicitly as the following expression:

82 2 d 1 m m s== [1p)k= pf (1 - p)m-k(k - t) Npi tti m k=rt] k

Notice that if we sum starting from k = 0 instead of k-= t, then the sum equals

69 the expectation of Bin(m, pi) minus t. That is, notice that:

2 1 M m"~k,_ ,- 2 1 d d 2 ( )P1 - pi)-k(k - t) = d p im t) = 0. d; kdpEkpi m

By this observation and the fact that the summand is 0 for integer t when k = t we can switch which values of k we are summing over to k from 0 through LtJ if we negate the expression:

02 d2 sP Z )P - pi)mk(t - k) . k=O

We first prove the case when t Z+ In this case, we view st,j as a sequence with respect to t (where i is fixed), which we denote st. We now derive a generating function for this sequence.' Observe that derivatives that are not with respect to the formal variable commute with taking generating functions. Then, the generating function for the sequence {st} is

d 2 1 d (Pix +I- Pj)M X d(PX +1-P,)r m-2 dp2 m dx X1 - 1)1P -Ix)

Note that the coefficient on x0 is 0, so so,j = 0 as claimed. For t C Z,, the right hand side is the generating function of

(m - 1) m -2 Pt-1 (1 _ P)M-t-i (t - I1

Thus, this expression gives the i-th entry Hessian in the t E Z>o, as claimed.

9 To avoid potential convergence issues, we view generating functions as formal polynomials from the ring of infinite formal polynomials. Under this formalism, there is no need to deal with conver gence at all.

70 Now consider the case when t is not an integer. In this case, we have:

SiAd 2 1 m m - pm)-k(k -t) dp2 M k

2 d 1 m k - p)m-k(k - [t] + At) dpr EM k Sk=-Fti d2 1 sFt,i+ AtV m E ( k pi)m-k. k = t] k - t]-1 2 d1 i = sFtl,i - Atd m S k

The last equality is because if we change bounds on the sum so they are from 0 through m, we get 1 which has partial derivative 0. Thus, we can flip which terms we are summing over if we negate the expression.

Note that this expression we are subtracting above can be alternatively written as:

At ( p(1 - pi)m-k = At. (s tli -

Thus, we have

sti = S tli - At - (s [ti - S Ltij,i) = At - s LtJ,i + (1 - At) - S Ftl,i as desired. This completes the proof of Lemma 23. El

It will be convenient to simplify the exact expressions of Lemma 23 into something more manageable. This is done in the following lemma:

Lemma 24. Fix any constant c > 0. The Hessian of p(p), viewed as a function of p, is a diagonal matrix whose i-th diagonal entry is given by

for m < n hij = st:=m/n,i > a(,) - nimi for n 6, n > 2, and E < 1/2.

71 Similarly, these bounds are tight up to constant factors, as further improvements would violate our sample complexity lower bounds.

Proof. By Lemma 23, we have an exact expression st,i for the ith entry of the Hessian of It(p).

First, consider the case where m K n. Then we have

sti = (1 - At) - stli .

Substituting t = m/n, [t] = 1, and At = [t] - t = 1 - m/n gives

mmm2 So~ - - (M_- )(1 -p,),- =()) n n

Now consider the case where n < m K E(1)- i. Note that the case where n < m < 2n follows from (i) the fact that st,j for fractional t linearly interpolates between the value of sp,, the nearest two integral values of t' and (ii) the analyses of the cases where m K n and 2n < m < E(1)n. Thus, all we have left to do is prove the case where

2n K m < E(1) - n.

Since st,j is a convex combination of sFt1,i and sLtji, it suffices to bound from below these quantities for t = m/n. Both of these tasks can be accomplished simultaneously by bounding from below the quantity st=m/n+-y,i for arbitrary -y [-1, 1].

We do this as follows: Let t = m/n +y. Using Stirling's approximation, we can show that for any -y c [-1, 1], we get:

Bounding st,i:

Note that Stirling's approximation is tight up to constant factors as long as the number we are taking the factorial of is not zero. Note that m - 2 > 1, t - 1 > 1, and m - t -1 M/2 - 2 > 1. Thus, if we apply Stirling's approximation to the factorials in the definition of the binomial coefficient and substitute t = m/n + y, we obtain

72 the following approximation, which is tight up to constant factors:

m - 2 v'm -2 (m - 2) m-2 t - 1iJ fm - -m/n- I - -y V/m/n-- I1 + -y (m -- m/n - I - -,)m-m/n-1--7(m/n - I + Y)M/n

- 1 + -y) 2)" =E)(1) - (m - m/n - 1 - -y)(m/n (M - S(m - 2) 3 (m - m/n - 1 - -)m-m/n-y(n/n + 7)m/n+, E)(1). mM V/nrii (m - m/n - 1 - .)m-m/n-y(M /n + 7)m/n+-,

Using this approximation, we get:

_1 st'i = (M - m)- 2

mm r/n-l+, ( _- p gm-m/n-1-,y rn n - m/n - 1 - -)m-m/n--y (rn/n - 1 + 7)m/n+y 1 rn mm/n+yP/n+,y- p_)nm-m/n-1-y

pi r (I - )m-m/n- (m/n)m/n+-y

1 rn (np,)m/n+Y (1 - p,)m-m/n-1-y =p(1 ri - - (I - i)m-m/n--y

By substituting pi = L, we get:

(1 i )m/n+y-1(1 - 1 E m-m/n-y si,= E)(1) nin m-m/n-- 1*E )m-m/n (1 i E)m/n(1 _ = E(1) v/ 1)- m/n -( )r (i+n rm-m/n =) (1) -n 1 ),r/n ( I T ri- 1)

73 8(1) /n (1 )m/ (1 ) -n(n-1)

= E(1) mV/Tnn- (1 6)m/n (1 )FE)m/n

= E(1) mT/nn (1 - c2)m/n

2 = 8(1) - v/mn - e-e(1)E (m/n)

> 0(1) -mn (since m < E(1) - n)

This completes the proof of Lemma 24. I

2.4.4 Proof of lemma 22

Given lemmas 23 and 24 from section 2.4.3, we are ready to prove the desired expectation gap.

Proof of Lemma 22: We start by reducing the soundness case to a much simpler setting. To do this, we use the following fact, established in Section 2.4.1:

Fact 25. Let S(D) be the random variable taking the value of our test statistic when the samples come from the distribution D. For any distribution p on [n], there exists a distribution p' supported on [n] whose probability mass values are in the set

{ '1 for some E' > dTv(p, Un)/2, with at most one element having mass , and such that the statistic S(p) stochastically dominates S(p'). In particular,we have that p(p') < pi(p).

Proof. This is an immediate corollary of Lemmas 15, 16, and 18. 0

By Fact 25, there is a distribution p' that satisfies the conditions of Lemma 24, has total variation distance E(6) to the uniform distribution, and P(p') < p(p). Therefore, it suffices to prove a lower bound on the expectation gap between the completeness and soundness cases for distributions p of this form. Note that all probability distributions on the line from p to Un are also of this form for different (no larger) values of E. Thus, Lemma 24 gives a lower bound on the diagonal entries of the Hessian at all points on this line. Since the Hessian is diagonal, this also bounds from below the minimum eigenvalue of the Hessian on

74 this line. Therefore, by this and Equation (2.7), we obtain the first two cases of this

lemma, as well as the third case for - K m < 4 - .

The final case of this lemma for 4. K m follows immediately from the folklore fact

that if one takes at least this many samples, the empirical distribution approximates

the true distribution with expected f1 error at most E/2. For completeness, we give

a proof. We have

E[\|X/m - p|1] = E[|Xi/m - pI|] < Var[Xi/m - pi]

2 Z mp2/m < Vm(1/n)/m2 (2.9)

= n/M < E/2, where Equation (2.9) follows from the fact that the sum is a symmetric concave function of p, so it is maximized by setting all the pi's to be equal.

2.4.5 Concentration of Test Statistic: Proof of Theorem 21

Let the m samples be Y, .. . , Ym c [n], and let Xi, i E [n], be the number of j G [m] for which Y = i. Let S be our empirical total variation test statistic, S = 1 i |I_.

We prove the theorem in two parts, one when m > n, and one when m < n.

We will require a "Bernstein" form of the standard bounded differences (McDi- armid) inequality:

Lemma 26 (Bernstein version of McDiarmid's inequality [136]). Let Y1,..., Ym be independent random variables taking values in the set Y. Let f : y' -+ R be a function of yi,. . , ym so that for every j G [m] and yi, . . . , y, y E Y, we have that:

|f (y ,. ..., yj, . .. , ym) - f ( i, - -.. , y', -- -, ym)I < B .

Then, we have:

Pr [f(Yi, ... ,Ym) - IE[f] > z] < exp ( . (2.10) FmB2

75 If in addition, for each j G [m] and yi, ... , yj-1, yj+1, ... , ym we have that

then we have

Pr [f(Yi, ... ,Ym) - E[f] z] 5 exp m Z. (2.11) 2 a? + 2Bz/3

Case I: m > n

Since the Y's are independent and S is I-Lipschitz in them, the first form of McDi- armid's inequality implies that

Pr[S - E[S] z] < exp(-2mz 2 ) and similarly, by applying it to -S, we have Pr[S - E[S] < -z] < exp(-2mz 2 ). Let R be the right-hand side of the Equation in Lemma 22, so p(p) - p(Un) > R in the soundness case. Since we threshold the tester at t = p(U,) + R/2, we find in both the completeness and soundness cases that the success probability will be at least

1 - exp(-mR 2 /2) and hence we just need to show

mR 2 /2 > log(1/6) . (2.12)

Since we are in the regime that m > n, there are two possible cases in Lemma 22. For n < m < n/E2 , we need that

M - . ((1) -E'm/n > log(1/6) or

/n log(1/6) E2

76 For m > n/E2 , we need that

m - . E(1) . E2 > log(1/6) 2 or log(1/6) M > 9(0) - E2.

The theorem's assumption on m implies that both conditions hold, which completes the proof of Theorem 21 in this case.

Case II: m < n

To establish Theorem 21 for m < n, we will require the Bernstein form of McDiarmid's inequality (Equation (2.11) in Lemma 26).

To apply this form of Lemma 26, it suffices to compute B and oj for our test statistic as a function of the Y's. Note that for m < n, _L - ' is equal to : - 1 whenever Xi $ 0. In particular, this implies

2 M n

= I ( _ 1) + 2 x=o} 2 m n n

n i- {i Xi 0}i

Hence, the value of the parameter B for our test statistic is 1/n, since each Y will affect the number of nonzero X2's by at most 1. In particular, the function value as Y varies and the other Yy's are kept fixed can be written as the sum of a deterministic quantity plus (1/n) - b, where b is a Bernoulli random variable that is 1 if sample

Y collides with another sample Yy and 0 otherwise. Thus, the variance of S as Y varies and the other Yy's are kept fixed is given by Var[(1/n) - b]. This variance is

(1/n2) - r(1 - r), where r is the probability that Y collides with another Y.

By fact 25, we have that probability that value of the statistic falls below the threshold (i.e type II error) is maximized for some distribution that has probability

77 mass values in the set { 'j+, 1, '7' } for some E' > dTV(p, Un) /2. This allows us to consider only this worst case. Therefore, we have that r < m(1 + E')/n < 2m/n and the variance of S as Y varies and the other Yy's are kept fixed is at most

1 -r(1 - r) r/n2 < 2m/n3 :2 in both the completeness and soundness cases. Applying Equation (2.11) of Lemma 26 we find

Pr[JS - E[S]I > z] < 2 exp ( ~-4 -m2/n3 + (2/3) -z/n)

By Lemma 22, in the soundness case we have expectation gap p(p) - P(Un) ;> R :=CE2 m 2 /n 2 for some constant C < 1. Substituting z = R/2 in the above concentration inequality yields that our tester will be correct with probability 1 - 6 as long as 1 m ; - 2(1) n log(2/S), for an appropriately chosen constant, which is true by assumption. This completes the proof of Theorem 21.

2.4.6 Non-Existence of Indefinite Closed-Form for Components of Expectation

In this section, we formalize and prove our assertion from Section 2.4.3 that the function p(p) - p(U,2 ) is a linear combination of sums each of which has no indefinite closed form.

Recall Equation (2.8) which says that the expectation is a linear combination of sums with summands of the form (m) qk(1 - q)m-k (k - t) for various values of q where the values of q are themselves different variables that any closed form would need to depend on (in addition to the other variables). A sum is said to have an indefinite closed form if, when the upper and lower limits of the sum are replaced with new variables, the resulting sum has a closed form valid for all values of all variables.

78 By closed form, we mean a closed form as defined in [118, Definition 8.1.11 which, as far as we are aware, is the main formal sense in which the phrase is used in combinatorics. This definition of closed form says that a function can be written as a sum of a constant number of rational functions, where the numerator and denominator in each is a linear combination of a constant number of products of exponentials, factorials, and constant degree polynomials. An example of such a function is - + (n)k 5 7 2k!+2k + k.

To prove that a sum with summands (m) )qk(1 - q)m-k(k - t) has no indefinite closed form where m, k, q, t, and the limits of the sum are the variables that the closed form would need to be a function of one can run Gosper's algorithm on this summand with k as the index of summation and observe that it returns that there is no indefinite closed form solution in the sense we have described [118, Theorem 5.6.3], [77, 117].

2.5 Information-Theoretic Lower Bound

In this section, we prove our matching sample complexity lower bound. Namely, we prove:

Theorem 27. Any algorithm that distinguishes with probability at least 1 - 6 the uniform distribution on [n] from any distribution that is s-far from uniform, in total variation distance, requires at least

Q ((/n -log(1/6)+ log(1/6)) /E2 samples.

Theorem 27 will immediately follow from separate sample complexity lower bounds of Q(log(1/6)/e 2 ) and Q( n log(1/6)/s 2 ) that we will prove. We start with a simple sample complexity lower bound of Q(log(1/6)/E2 ):

Lemma 28. For all n, E, and 6, any (E, 6) uniformity tester requires Q(log(1/6)/s 2 ) samples.

79 Proof. If n is odd, set the last probability to 1/n, subtract 1 from n, and invoke the following lower bound instance on the remaining elements. If n is even, do the following. Consider the distribution p which has probability pi = '+' for each element

1 5i 5 and pi = for each element n < i < n. Clearly, dTv(p, Un) = E. Note that the probability that a sample comes from the first half of the domain is 1+ and the probability that it comes from the second half of the domain is 2. Therefore, distinguishing p from U, is equivalent to distinguishing between a fair coin and an

E-biased coin. It is well-known (see, e.g., Chapter 2 of [171) that this task requires m = Q(log(1/S)/e 2 ) samples. E

The rest of this section is devoted to the proof of the following lemma, which gives our desired lower bound:

Lemma 29. For all n, E, and 5, any (E, 5) uniformity tester requires at least Q( Nn log(1/)/E 2 ) samples.

To prove Lemma 29, we will construct two indistinguishable families of pseudo- distributions. A pseudo-distribution w is a non-negative measure, i.e., it is similar to a probability distribution except that the "probabilities" may sum to something other than 1. We will require that our pseudo-distributions always sum to a quantity within a constant factor of 1. A pair of pseudo-distribution families is said to be &-indistinguishableusing m samples if no tester exists that can, for every pair of pseudo-distributions w, w' one from each of the two families distinguish the product distributions 0 Poi(mwi) versus 0 Poi(mw') with failure probability at most 6.

This technique is fairly standard and has been used in [135, 129, 491 to establish lower bounds for distribution testing problems. The benefit of the method is that it is much easier to show that pseudo-distributions are indistinguishable, as opposed to working with ordinary distributions. Moreover, lower bounds proven using pseudo- distributions imply lower bounds on the original distribution testing problem.

We will require the following lemma, whose proof is implicit in the analyses of [135, 129, 49]:

80 Lemma 30. Let P' be a property of distributions. We extend P' to the unique property

P of pseudo-distributionswhich agrees with P' on true distributions and is preserved

under rescaling. Suppose we have two families F 1, F2 of pseudo-distributionswith the following properties:

1. All pseudo-distributionsin F1 have property P and all those in F2 are E-far in total variation distance from any pseudo-distribution that has the property.

2. F1 and F 2 are 6-indistinguishable using m samples.

3. Every pseudo-distributionin each family has f 1-norm within the interval [1-, C2],

for some constants c1 , c 2 > 1.

Then there exist two families F1 ,F 2 of probability distributions with the following properties:

1. All distributions in F1 have property P and all those in F2 are i-far in total variation distance from any distribution that has the property.

2. Any tester that can distinguish F1 and T 2 has worst-case error probability > 6 - 2-C", for some constant c > 0 or requires E(1) - m samples.

In our case, the property of distributions P' is simply being the uniform distri-

bution. The families of pseudo-distributions we will use for our lower bound are

the family F1 that only contains the uniform distribution and the family F 2 of all

pseudo-distributions of the form wi = 1 such that II - E wiI E/2. Note that

this constraint on the sum of the wi's ensures the first and last conditions needed to

invoke Lemma 3010.

Furthermore, by Lemma 28, the required number of samples m satisfies m >

Q(log(1/6)). Ignoring constant factors, we may assume that m > c'log(1/6) for any constant c' > 0. In particular, by selecting c' appropriately, we can guarantee that

2-cm < 6/3, where c is the constant in the last statement of Lemma 30. Thus, the error probability guaranteed by Lemma 30 for distinguishing the true distribution families is at least (2/3)6.

101f we did not have this constraint, the first condition would not be satisfied (because e.g., a w such that wi = (1 + e)/n for all i is not e far from being proportional to the all ones vector.

81 Thus, all that remains is to show that F and F2 are 6-indistinguishable using m samples. In order to show these families are indistinguishable, we show that it is impossible to distinguish whether the product distribution 0 Poi(mw) has w uniform or w generated according to the following random process: we pick each wi independently by setting wi = '+ or wi = -- each with probability 1/2.

A distribution generated by this process has a small probability of not being in

F 2 . Specifically, this happens iff it fails to satisfy the constraint on having a sum within 1 E/2. However, by an application of the Chernoff bound, it follows that this happens with probability at mots < 2 -9(1)n. Since the bound in Lemma 28 is larger (up to constant factors) than the lower bound we presently wish to prove in the case that J/3 > 2 -e(1)-n, we will assume J < 2 -e(1).n; in which case, the following lemma implies that we can still invoke Lemma 30, where the probability of not being in F2 is absorbed into our overall indistinguishability probability, and we get a final indistinguishability probability of at least (2/3)6 - 6/3 = 6/3. The following lemma is implicit in [135, 129, 491:

Lemma 31. Let P be a property of pseudo-distributions. Suppose we have two families F1, F2 of pseudo-distributionsand two distributionsD 1, D 2 on F1 and F2 respectively with the following properties:

1. With probability each at least 1 - 61, a distribution output by D1 is in F1 and a

distribution output by D2 is in F2 .

2. If we generate w according to D1 or D2 , then any algorithm for determining which family w came from given access to 0 Poi(mwi) has worst case error

probability at least 62.

Then F1 and F2 are (62 - 61)-indistinguishable using m samples.

Thus, we now simply need to show that D, and D2 are hard to distinguish. Let Xi, Xj be the random variables equal to the number of times the element i is sampled in the completeness and soundness cases respectively. We will require a technical lemma that will be used to bound the Hellinger distance between any pair of corresponding coordinates in the completeness and soundness cases. By

82 Poi(A,) + - Poi(A 2 ), we denote a uniform mixture of the corresponding distributions. We have

Fact 32 (Lemma 7 of [129]). For any A > 0, E < 1 we have

H (Poi(A), Poi((1+ E)A) + Poi((1 - E)A) < CA2 4 for some constant C.

We also require the following Lemma which gives a tighter relationship between

Hellinger distance when the distance is close to 1.

Lemma 33. Any distributions with Hellinger-squareddistance H 2(p, q) < 1 - J have total variation distance at most dTv(p, q) < 1 - 62/2.

The more standard inequality between these quantities only gives dTv(p, q) <

2(1 - 3) which is worse than the trivial bound of 1 when 3 is small.

Proof. Let H2 (p, q) < 1 - 6. Let a: min(p , qi) and b= max(pi, qi). Then we have

(p-) =;> H2--_ /-2 Vp__1qj

2 2

Applying Cauchy-Schwarz then yields

62 ( a) b( < (1 - dTv(p, q)) 2.

Thus,

dTv(p,q) <1 _62/2.

We are now ready to prove Lemma 29.

83 Proof of Lemma 29: As follows from the discussion preceding Fact 32, it suffices to show that D1 and D 2 are hard to distinguish. We will use Fact 32 to show that the Hellinger distance between the overall distributions is small, which implies their total variation distance is small, and hence that they cannot be distinguished with probability better than 6.

Recall that each of the n coordinates of the vectors output by these distributions is distributed according to Poi(m/n) for D1 vs. a uniform mixture of Poi((1 te)m/n) for D 2 . A single coordinate then has

4 2 2 2)E H (X1 , XI) C(m so the collection of all coordinates has

H2 (X, X') < 1 - (1 - C(m 2/n 2 )E4 )n 1 - C(M2/n)E4 so by Lemma 33, dTv(X, X') < 1 - e-2C(m 2/n) 4 /2.

For this latter quantity to be at least 1 - 6, we need

m = ((1/E2) . V/, log1/ samples, as desired. The result then follows from Lemma 31 and Lemma 30. This completes the proof. 0

84 Chapter 3

Sublinear algorithms with conditional sampling

As we know, the huge growth in the size of datasets has created a shortage in resources like time and space for computations. This is the problem that sublinear algorithm are trying to deal with and there has been significant progress in this field in the past 20 years (see [124, 125, 44, 61, 119, 731 for more details). Many of these works assume random access to the input dataset and use a sublinear size random sample from it to compute the result. In this thesis, we would like to examine how we can further improve the running time under stronger yet reasonable assumptions on data access.

3.1 Conditional sampling model and motivation

In several situations, more flexible access to the dataset might be possible, e.g. when data is stored in a database. Such improved access can significantly reduce the number of queries needed to perform support estimation or other tasks. One recent model, called conditional sampling, introduced by [34, 30] for distribution testing, describes such a possibility. In that model, there is an underlying distribution D, and a conditional sampling oracle that takes as input a subset S of the domain and produces a sample from D conditioned to a set S. In the work of [34] and [30] several problems in distribution testing are studied obtaining surprising results: Using conditionalqueries

85 it is possible to bypass lower bounds that exist for the sampling complexity in the standard distribution testing framework and get testers with only polylogarithmic, which is an exponential improvement over the lower bounds, or even constant query complexity. Their results were further improved by [601 for identity and closeness testing. In follow up work, [4, 1201 consider the support estimation problem we described above and prove that it can be solved using only O(poly log log n) conditional samples. This is a doubly exponentially better guarantee compared to the optimal classical algorithm which requires ((n/ log n) samples [130j.

Inspired by the power of these results, we introduce a computational model based on conditional sampling where access to the dataset is provided via access to a conditional sampling orace that returns data points at random from a specified set. More precisely, an algorithm is given access to an oracle COND(C) that takes as input a function C : Q -+ {0, 1} and returns a tuple (i, xi) with C(xi) = 1 with i chosen uniformly at random from the subset {j E [n] I C(xj) = 1}. If no such tuple exists the oracle returns _. The function C is represented as a binary circuit. We assume that queries to the conditional sampling oracle COND take time linear in the circuit size.

Most works using the conditional sampling model for accessing the data measure the performance of algorithms only by their query complexity. The work of [301 considers the description complexity of the query set S by examining restricted conditional queries that either specify pairs of points or intervals. Similarly, the work of

[41] provides property testing algorithms in a model similar to conditional sampling that allows querying arbitrary subsets of a Euclidean space. To keep the description complexity small they focus on queries of the form of axis parallel boxes or simplices.

However, in many cases, such simple queries might not be sufficient to obtain efficient algorithms. We use the circuit size of the description of a set as a measure of simplicity to allow for richer queries which is naturally counted towards the running time of our algorithms.

Apart from its theoretical interest, it is also practically useful to consider algorithms that perform well in the conditional sampling model. This is because efficient

86 algorithms for the conditional sampling model can directly be implemented in a num-

ber of different computational models that arise when we have to deal with a huge

amount of data.

Parallel and distributed Computing: : We notice that the computation of one

conditional sample can be very easily parallelized because it suffices to assign to each

processor a part of the input and send to each of them the description of the circuit.

Each processor can compute which of its points satisfy the circuit and pick one at

random among them. Then, we can select as output the sample of one processor

chosen at random. The probability of choosing one processor in this phase is proportional to the number of points in the input assigned to this processor that belong to the conditioning set. This way we can implement in just a few steps a conditional sampling query. If the input is divided evenly among m processors the load on each of them is n/rm. Combining the answers can be done in log rn steps and therefore, the running time of A in the parallel computation model is O(q - s - (n/m + logim) + r), where q is the number of conditional queries s is the size of the description of the sets used and r is the additional computational time needed. This gives a non-trivial parallelization of the problem P. Except for the running time, one important issue that can decrease the performance of a parallel algorithm is the communication that is needed among the processors as described in the work of Afrati et. al. [5]. This communication cost can be bounded by the size s of the circuit at each round plus the communication for the partition of the input that happens only once in the beginning.

The implementation of a conditional query in the distributed computational model can follow the same ideas as in the parallel computational model.

Streaming Algorithms: : We can implement of a conditional query in the streaming model, where we want to minimize the number of passes of the input, as follows:

With one pass of the input we can select one point uniformly at random from the points that belong to the conditioning set using standard streaming techniques. The space that we need for each of these passes is just s and we need q passes of the input.

The surprising observation in all the above cases is that once we use the the con-

87 ditional sampling model appropriately for each of the aforementioned computational models, we can get high performance algorithms in terms of q, s and r are small. In this chapter we show how to design algorithms that achieve all these quantities to be only polylogarithmic in the size of the input, which leads to very efficient algorithms in all the above models.

3.1.1 Previous Work in the classical model

We consider two very well studied combinatorial problems: k-means clustering and minimum spanning tree. For these problems we know the following about the sublinear algorithms in the classical setting.

k-means Clustering

Sublinear algorithms for k-median and k-means clustering first studied by Indyk [871.

In this work, given a set of n points from a metric space, an algorithm is given that computes a set of size O(k) that approximates the cost of the optimal clustering within a constant factor and runs in time O(nk). Notice that the algorithm is sublinear, as the input contains all the pairwise distances between the points which have total size

0(n2)

In followup work, Mettu and Plaxton [1091 gave a randomized constant approximation algorithm for the k-median problem with running time O(n(k+ log n)) subject to the constraint R < 20(n/ log(n/k)), where R denotes the ratio between the maximum and the minimum distance between any pair of distinct points in the metric space.

Also Meyerson et. al. [110] presented a sublinear algorithm for the k-median problem with running time 0((k 2 /E) log(k/E)) under the assumption that each cluster has size

Q(nE/k).

In a different line of work Mishra, Oblinger and Pitt [1111 and later Czumaj and

Sohler [43] assume that the diameter A of the set of points is bounded and known.

The running time of the algorithm by Mishra et. al. [111] is only logarithmic in the input size n, but is polynomial in A. Their algorithm is very simple, since it just

88 picks uniformly at random a subset of points and solves the clustering problem on that subset. Following similar ideas, Czumaj and Sohler [431 gave a tighter analysis of the same algorithm, proving that the running time depends only on the diameter A and is independent of n. The dependence on A is still polynomial in this work. The guarantee in both these works is a constant multiplicative approximation algorithm with an additional additive error term.

Minimum Spanning Tree in Euclidean metric space

There is a large body of work on sublinear algorithms for the minimum spanning tree.

In [871, given n points in a metric space Q, an algorithm is provided that outputs a spanning tree in time 0(n/6) achieving a (1/2 - 3)-approximation to the optimum.

When considering only the task of estimating of the weight of the optimal spanning tree, Czumaj and Sohler [421 provided an algorithm that gets a (1+E)-approximation. The running time of this algorithm is O(n - poly(1/E)).

To achieve better guarantees several assumptions could be made. One first assumption is that we are given a graph that has bounded average degree deg and the weights of the edges are also bounded by W. For this case, the work of Chazelle et. al. [381 provides a sublinear algorithm with running time O(deg - W - 1/ 2 ) that returns the weight of the minimum spanning tree with relative error e. Although the algorithm completely gets rid of the dependence in the number of points n it depends polynomially in the maximum weight W. Also in very dense graphs deg is polynomial in n and therefore we also have a polynomial dependence on n.

Finally, another assumption that we could make is that the points belong to the d-dimensional Euclidean space. For this case, the work of Czumaj et. al. [39 provide an (1 + E)-approximation algorithm that requires time O(v6 - (1/)d), while access to specific geometric queries is assumed via some special data structures. Note that in this case the size of the input is 0(n) and not 0(n2 ) since given the coordinates of the n points we can calculate any distance. Therefore, the algorithms described before that get running time 0(n) are not sublinear anymore. Although Czumaj et. al. [391 manage to achieve a sublinear algorithm in this case they cannot escape

89 from the polynomial dependence on n. Additionally, their algorithm has exponential dependence on the dimension of the Euclidean space.

3.1.2 Our Contributions

The main result in this chapter is that on the conditional sampling framework we can get exponentially faster sublinear algorithms compared to the sublinear algorithms in the classical framework.

We first provide some basic building blocks useful primitives for the design of algorithms. These building blocks are: 1. Compute the size of a set given its description, Section 3.3.1.

2. Compute the maximum of the weights of the points of a set given the description

of the set and the description of the weights, Section 3.3.2.

3. Compute the sum of the weights of the points of a set given the description of

the set and the description of the weights, Section 3.3.3.

4. Get a weighted conditional sample from the input set of points given the de-

scription of the weights, Section 3.3.4.

5. Get an o-sample given the description of labels to the points Section 3.3.5.

For all these primitives, we give algorithms that run in time polylogarithmic in the domain size and the value range of the weights. We achieve this by querying the conditional sampling oracle with random subsets produced by appropriately chosen distribution on the domain. Intuitively, this helps to estimate the density of the input points on different parts of the domain. One important issue of conditioning on random sets in that the description complexity of the set can be almost linear on the domain size. To overcome this difficulty we replace the random sets with pseudorandom ones based on Nisan's pseudorandom generator [1141. The implementation of these primitives is of independent interest and especially the fourth one since it shows that the weighted conditional sample, which is similar to sampling models that have been used in the literature [31, can be simulated by the conditional sampling model with only a polylogarithmic overhead in the query complexity and the running time.

90 After describing and analyzing these basic primitives, we use them to design fast sublinear algorithms for the k-means clustering and the minimum spanning tree. k-means Clustering

Departing from the works of Mishra, Oblinger and Pitt [111] and Czumaj and Sohler

[43] where the algorithms start by choosing a uniform random subset, we start by choosing a random subset based on weighted sampling. In the classical computational model we need at least linear time to get one conditional sample and thus it is not possible to use the power of weighted sampling to get sublinear time algorithms for the k-means problem. But when we are working in the conditional sampling model, then the weighted sampling can be implemented in polylogarithmic time and queries. This enables us to use all the known literature about the ways to get efficient algorithms using conditional sampling [13]. Quantitatively the advantage from the use of the weighted sampling is that we can get sublinear algorithms with running times O(poly(log A, log n)) where A is the diameter of the metric space and n the number of points on the input. This is exponentially better than Indyk [87] in terms of n and exponentially better than Czumaj and Sohler [43] in terms of A. This shows the huge advantage that one can get from the ability to use or implement conditional sampling queries. We develop these ideas in detail in Section 3.4.

Minimum Spanning Tree in Euclidean metric space

Based on the series of works on sublinear algorithms for minimum spanning trees, we develop algorithms that exploit the power of conditional sampling and achieve polylogarithmic time with respect to the number of input points n and only polynomial with respect to the dimension of the Euclidean space. This is very surprising since in the classical model there seems to exist a polynomial barrier that we cannot escape from. Compared to the algorithm by Czumaj et. al. [39], we get running time O(poly(d, log n, 1/)) which is exponential improvement with respect to both the parameters n and d.

We present our algorithm at Section 3.5. From a technical point of view, we use a

91 gridding technique similar to [391 but prove that using a random grid can significantly reduce the runtime of the algorithm as we avoid tricky configurations that can happen in worst case.

3.2 Formal definitions

Notation: For m e N we denote the set {1,... , m} by [m]. We use 0(N) to denote

O(N logo()1 N) algorithms.

Given a function f that takes values over the rationals we use Cf to denote the binary circuit that takes as input the binary representation of the input x of f and outputs the binary representation of the output f(x). If the input or the output are rational numbers then the representation is the pair (numerator,denominator).

Suppose we are given an input Y = (X 1 , x 2 , - - , Xn) of length n, where every xi belongs in some set Q. In this work, we will fix Q = [D]d for some D = no(1 ) to be the discretized d-dimensional Euclidean space. Our goal is to compute the value of a symmetric function f : Q' -+ R+ in input 7 E QX. We assume that all xi are distinct and define X C Q as the set X = {xi : i E [n]}. Since we consider symmetric functions f, it is convenient to extend the definition of f to sets f(X) = f(x).

A randomized algorithm that estimates the value f(x) is called sublinear if and only if its running time is o(n). We are interested in additive or multiplicative approximation. A sublinear algorithm ALG for computing f has (e, 6)-additive approximation if and only if

P [JA LG(X) - f (X)l > E] and has (E, 3)-multiplicative approximation if and only if

P [(1 - e)f(x) ALG(x) (1 + E)f(x)] <6.

We usually refer to (E, 6)-approximation and is clear from the context if we refer to the additive or the multiplicative one.

92 3.2.1 Conditional Sampling as Computational Model

The standard sublinear model assumes that the input is stored in a random access

memory that has no further structure. Since f is symmetric in the input points, the

only reasonable operation is to uniformly sample points from the input. Equivalently, the input can be provided by an oracle SUB that returns a tuple (i, xi) where i is chosen

uniformly at random from the set [n] = {1, .., n}.

When the input has additional structure (i.e. points stored in a database), more

complex queries can be performed. The conditional sampling model allows such

queries of small description complexity to be performed. In particular, the algorithm

is given access to an oracle COND(C) that takes as input a function C : -a {0, 1} and returns a tuple (i, xi) with C(xi) = 1 with i chosen uniformly at random from the subset {j E [n] I C(xj) = 1}. If no such tuple exists the oracle returns _. The function C is represented as a sublinear sized binary circuit. All the results presented in this chapter use polylogarithmic circuit sizes.

We assume that queries to the conditional sampling oracle COND take time linear in the circuit size. Equivalently, we could assume constant time, as we are already paying linear cost in the size of the circuit to construct it.

3.3 Basic Primitives

In this section, we describe some primitive operations that can be efficiently implemented in this model. We will use these primitives as black boxes in the algorithms for the combinatorial problems we consider. We make this separation as these primitives are commonly used building blocks and will make the presentation of our algorithms cleaner.

A lot of the algorithmic primitives are based on constructing random subsets of the domain and querying the random oracle COND with a description of this set. A barrier is that such subsets have description complexity that is linear in the domain size. For this reason, we will use a pseudorandom set whose description is polylogarithmic in the domain size. The main tool to do this is Nisan's pseudorandom generator

93 [1141 which produces pseudorandom numbers that appear as perfectly random to algorithms running in polylogarithmic time.

Theorem 34. Let UN and U denote uniformly random binary sequences of length

N and f respectively. There exists a map G : {O, 1} -+ {O, }N such that for any algorithm A : {O, 1}N _+ {O,1}, with A E SPACE(S), where S = S(N), it holds that

IP(A(UN) = 1) - P(A(G(Ue)) = 1)1 2-S for f = 0(S log N).

Nisan's pseudorandom generator is a simple recursive process that starts with

E(Slog N) random bits and generates a sequence of N bits. The sequence is generated in blocks of size S and every block can be computed given the seed of size

E(Slog N) using O(log N) multiplications on S bit numbers. The overall time and space complexity to compute the k-th block of S bits is O(S log N) and there exists a circuit of size 6(Slog N) that performs this computation.

Using Nisan's theorem, we can easily obtain pseudorandom sets for conditional sampling. We are interested in random sets where every element appears with probability g(x) for some given function g.

Corollary 35. Let R be a random set, described by a circuit CR, that is generated by independently adding each element x E Q with probability g(x), where g is described by a circuit Cg. For any J < |QK, there exists a random set R' described by a

O(Cg|+ log jQj log(1/6))-sized circuit CR' such that

IP(COND(C A CR) = x) - P(COND(C A CR') = x)I <6 (3-1) for all circuits C and elements x E Q.

Proof. The corollary is an application of Nisan's pseudorandom generator for conditional sampling. A simple linear time algorithm that performs conditional sampling based on a random set R is as follows. We keep two variables, cntmatched which keeps

94 track of the number of elements that pass the criteria of selection and is initialized at value 0, and the selected element. For every element x in the domain Q in order, we perform the following:

1. Draw k random bits b C {0, 1}k and check whether the number b - 2 -k > g(X).

2. If yes, skip x and continue in the next element.

3. Otherwise if C(x) = 1, increment cntmatched and with probability cnt-mached

change the selected element to x.

Note that here, we have truncated the probabilities g(x) to 2 -k accuracy, so the random set R used is slightly different than R. However, picking k O(log(IQI/6)), we have that

|P(COND(C A CR) = X) - P(COND(C A CR) = x) < (3.2) for all circuits C and elements x c Q.

To prove the statement, we will use Nisan's pseudorandom generator to generate the sequence of bits for the algorithm. The algorithm requires only memory to store the two variables which is equal to O(log IQI). Moreover, the total number of random bits used is kIQI and thus by applying Theorem 34 for S = 6(log(1/6)) = Q(log IQI) and N = kIQI, we can create a sequence of random bits R' based on a seed of size O(Slog N) and give them to the algorithm. This sequence can be computed in blocks of size S = E(log(1/6)) using a circuit C' of size O(log(kjQj))log(1/3)) = O(log(IQI) log(1/6)). We align blocks of bits with points x e Q and thus the circuit C' gives for input x the k bits needed in the first step of the above algorithm. This implies that the circuit CR' that takes the output of Cr and compares them with Cg satisfies: P(COND(C A CR) X) - P(COND(C A CR') = x) < (3-3) 2 for all circuits C and elements x E Q. By triangular inequality, we get the desired error probability with respect to the circuit CR.

The total size of the circuit CR, is O(Cg1 + log AI log(1/6)) which completes the

95 proof. El

3.3.1 Point in Set and Support Estimation

Point in Set

The point in set function takes a set S C Q given as a circuit C and returns one point x C S or I if there is no such point in the set of input points, i.e. X n S = 0. The notation that we use for this function is EP(.) and takes as input the description C of S. Obviously the way to implement this function in the conditional sampling model is trivial. Since the point in set returns any point in the described set S a random point also suffices. Therefore we just call the oracle COND(C) and we return this as a result to EP(C).

We can test whether there is a unique point in a set by setting x* = EP(C) and querying EP(C A lif#.*). Similarly, if the set has k points, we can get all points in the set in time O(ICIk + k2 ) by querying k times, excluding every time the points that have been found.

Support Estimation

The support estimation function takes as input a description C of a set S C Q and outputs an estimation for the size of the set Sx = Sn X. We call this function SE(C).

The first step is to define a random subset R C Q by independently adding every element x E Q with probability for some integer parameter a that corresponds to a guess of the support size. Let CR be the description of R. We will later use Corollary 35 to show that an approximate version of CR can be efficiently constructed. We then use the Point-In-Set primitive and we query EP(CA CR). This tests whether

96 Sx n R ? 0 which happens with probability

P[Sx n R = 0] =

P[(si R) A (S2 V R) A ..-A (sk V R)=

where ISxI = k.

Using this query, we can distinguish whether ISxI (1 - E)a or ISxI > (1 + F)a.

The probabilities of these events are P1 > (1 - ) and P2 < (1- respectively. The total variation distance in the two cases is

P1 -P2 =Pi1 - P2 -2e

a a F1 )

(14 (1 - 2e where for the second to last inequality we assumed a > 21.

We can therefore conclude that for we can distinguish with probability 6 between |SxI < (1-E)a and ISs,,tppI > (1+E)a using O(log }/E 2 ) queries of the form EP(CA

CR). Binary searching over possible a's, we can compute an (1 E) approximation of the support size by repeating O(log n) times, as there are n possible values for a. A more efficient estimator, since we care about multiplicative approximation, only considers values for a of the form (1 +E)'. There are are log, , n = O log n) possible such values, so doing a binary search over them takes O(log 1 + log log n) iterations.

Thus, overall, the total number of queries is O(log log log n/E2 ).

To efficiently implement each query, we produce a circuit CR, using Corollary 35 with parameter 8' for error and a constant function g(x) = 1/a. The only change is that at every comparison the probabilities P and P2 are accurate to within 3'. Choos- 'The case a = 1 can be trivially handled by listing few points from Sx.

97 ing 3' = ' implies that IP1 -P 21 is still Q(E) and thus the same analysis goes through. The circuit C A CR, has size O(1C+ log 2 (IQ1) + log(IQ1) log(1/E)) which implies that the total runtime for (log 1 log log n/E2 ) queries is 6 ((1C + log 2(IQI)) log I/E2) as n = O(IQI).

Using our conditional sampling oracle, we are able to obtain the following lemma:

Lemma 36. There exists a procedure SE(C) that takes as input the description C of a set S and computes an (E, 3)-multiplicative approximation of the size of S using 0(log log n log(1/J)/E 2 ) conditional samples in time O ((ICI + log 2 IQI) log(1/6)/E2).

Distinct Values

One function that is highly related to support estimation is the distinct values function, which we denote by DV. The input of this function is a description C of a set

S together with a function f : Q -+ [M] described by a circuit Cf. The output of DV is the total number of distinct values taken by f on the subset Sx =S n X, i.e.

DV(C, Cf) = I{f(x) I x e Sx}l

To implement the distinct values function we perform support estimation on the range [M] of the function f. This is done as before by computing random sets R C {M] and performing the queries EP(C A (CR OCf)), where CR is the circuit description of the set R and the circuit CR o Cf takes value 1 on input x if f(x) E R and 0 otherwise.

Lemma 37. There exists a procedure DV(C, f) that takes as input the description C of a set S and a function f : Q -+ [M] given as a circuit Cf and computes an (E, 3)- multiplicative approximation to the number of distinct values of f on the set S n X in time O((C+| IC I + log 2 M) log(1/J)/E 2 ). The number of conditional samples used is O(log log nlog(1/J)/E 2).

98 3.3.2 Point of Maximum Weight

The point of maximum weight function takes as input a description C of a set S together with a function f : Q -+ [M] given by a circuit Cf. Let Sx = S n X. The output of the function is the value maxxEs, f(x). We call this function MAX(C, Cf).

Sometimes we are interested also in finding a point where this maximum is achieved, i.e. arg max.cs, f(x) which we call ARGMAX(C, Cf).

This is simple to implement by binary search for the value MAX(C, Cf). At every

step, we make a guess m for the answer and test whether there exists a point in the set

Sxn{f(x) > m}. This requires log M queries and the runtime is O((ICI+ICfI) log M).

Lemma 38. There exists a procedure MAX(C, Cf) that takes as input the description

C of a set S and a function f : Q -+ [M] given as a circuit Cf and computes the value maXXEsX f(x) using O(log NI) conditional samples in time O((ICI + Cf 1) log Al).

An alternative algorithm solves this task in O(log n log(1/6)) queries with probability 6. The algorithm starts with the lowest possible value m for the answer, i.e. m = 1. At every step, it asks the COND oracle for a random point with f(x) > m.

If such a point x* exists, the algorithm updates the estimate by setting m = f(x*). Otherwise, if no such point exists, the guessed value of m is optimal. It is easy to see that every step, with probability 1/2, half of the points are discarded. Repeating log(logn/6) times we get that the points are halfed with probability 1 - 6/logn.

Thus after O(lognlog(logn/6)) steps, the points will be halfed logn times and the maximum will be identified with probability 1 -6. Thus, the total number of queries is O(log n log(1/S)) and we obtain the following lemma.

Lemma 39. There exists a procedure MAX(C, Cf) that takes as input the description

C of a set S and a function f : - * [M] given as a circuit Cf and computes the value maxxcs, f (x) using U(log n log(1/6)) conditional samples in time O((ICI +

Cf 1) log n log(1/6)) with probability of error 6.

99 3.3.3 Sum of Weights of Points

The sum of weights of points function takes as input a description C of a set S together with a function f : Q -+ [M]. The output of the function is an (1 e) approximation of the sum of all f(x) for every x in Sx = S n X, i.e. Jxs, f(x).

We call this function SUM(C, Of).

To implement this function in the conditional sampling model, we first compute

MAX = MAX(C, Cf) (Lemma 39). We then create k = log1g,(n/E) = O(log n/s) sets Si = {x C S: f(x) C ((1 + E)-', (1 + E)'1] MAX} for i c [k], by grouping together points whose values are close. Let Cs, denote the circuit description of every set Si.

The circuit Cs, can be implemented using the circuit Cf and an implementation of a comparison gate.

We can get an estimate for the overall sum as

k MAX SUM(C, Cf) = SE (Cs,) (I +

To see why this is an accurate estimate we rewrite the summation in the following form: k S f(x) = f(x) + E f(x) (3.4) x ESx XESx: i=1 xESinx f(X)

To bound the error for the second term of (3.4), notice that for every i E [k] and x E Si, we have that f(x) E ((l+e) , (l+E)"]--MAX. Thus, the value Sinx 1+)- is a (1 + e)-approximation to the sum xEsinx f(x). Since the primitive SE (Csi) returns a (1 + E)-approximation to ISi n XI, we get that the second term of (3.4) is approximated by SuM(C, Cf) multiplicatively within (1 + E)2 ; 1 + 3E.

The first term introduces an additive error of at most n - MAX - (1 + E)-k

E -MAX < E -SuM(C, Cf) which implies that SuM(C, Cf) gives (1 4E)-multiplicative approximation to the sum of weights. Rescaling E by a constant, we get the desired guarantee. Thus, we can get the estimate by using one query to the MAX primitive

100 and k = O(log n/E) queries to SE. For the process to succeed with probability 6, we require that all k of the SE queries to succeed with probability 6'= 6/k. Plugging in the corresponding guarantees of Lemmas 36 and 39, we obtain the following:

Lemma 40. There exists a procedure SuM(C, Cf) that takes as input the description

C of a set S and a function f : Q -l [M] given by a circuit C and computes an

(E, 6)-multiplicative approximation of the value Ees, f (x) using $(log n log(1/6)/E 3 ) conditional samples in time 6 ((1C + ICf| + log 2 I|) log n log(1/6)/E 3 ).

3.3.4 Weighted Sampling

The weighted sampling function gets as input a description C of a set S together with a function f : Q - [M] given as a circuit Cf. The output of the function is a point x in the set Sx S n X chosen with probability proportionally to the value f(x). Therefore, we are interested in creating an oracle WCOND(C, Cf) that outputs element x E Sx with probability .

To implement the weighted sampling in the conditional sampling model, we use a similar idea as in support estimation. First we compute SUM = SuM(C, Cf) and then we define a random set H that contains independently every element x with probability f(x) P[x c H] = fSWM(3.5) 2SUM

Let CH be the description of H. We will later use Corollary 35 in order to build a pseudorandom set H' with small circuit description CH' that approximately achieves the guarantees of CH.

Based on the random set H, we describe Algorithm 1 which performs weighted sampling according to the function f.

We argue the correctness of this algorithm. Given a purely random H, we first show that at every iteration, the probability of selecting each point x E Sx is proportional to its weight. This implies that the same will be true for the final distribution as we perform rejection sampling on _L outcomes.

The probability that in one iteration the algorithm will return the point x G S

101 Algorithm 1 Sampling elements according to their weight. 1: selected +- 1 2: while selected = I and #iterations < k do 3: Construct the random set H and CH as described by the equation (3.5) 4: Check if there exists a unique point x E Sx in the set H. 5: if such unique point x exists then 6: With prob 1 - f(x) set selected <- x 7: return selected is the probability that x has been chosen in H and that |H Sx|= 1, i.e. it is the unique point of the input set X that lies in set S and was not filtered by H. For every x E Sx, this probability is equal to

P[xG H] 11 P[y H] -P[keep x]= yESx,yfx

_f(x) 'f'y)_ f(x) f x$ f- . (I - 2SUM 2SUM yESx,y:Ax 2SuM f(x) f(y) 2SUM 2SUM yE Sx and it is easy to see that this probability is proportional to f(x) as all other terms don't depend on x.

We now bound the probability of selecting a point at one iteration. This is equal to EXSx, f (X) f (y) > 2SuM H' b2SuM)

1 1 / yeZ (# - exp -2 >YESxf(Y))>

1 1 1 -- exp- which is at least 1/4 for a small enough parameter E > 0 chosen in our estimation

SUM of the total sum of f(x). Thus at every iteration, there is a constant probability of outputing a point. By repeating k = 8(log(1/&)) times we get that the algorithm outputs a point with probability 6/2.

102 Summarizing, if we assume a purely random set H, the probability that the above procedure will fail in O(log(1/6)) iterations is at most 6/2 plus the probability that the computation of the sum will fail which we can also make to be at most 6/2, for a total probability 6 of failure. Since we need only a constant multiplicative approximation to the sum, using Lemma 40 the total number of queries that we need for the probability of failure to be at most 6/2 is O(log n log(1/5)).

Since the random set H can have very large description complexity, we use Corol- lary 35 to generate a pseudorandom set H'. If we apply the corollary for error (' we get that the total variation distance between the output distribution in one step when using H' with the distribution when using H is at most:

E IP(COND(C A CH) = x) - P(COND(C A CH) = x)I n' (3.6) xESx

Since we make at most k = e(log(1/6)) queries the oracle COND, we get that the total variation distance between the two output distributions is O(n'log(1/6)). Setting

6' = O( 6 ) we get that this distance is at most e. Computing the total runtime and number of samples, we obtain the following lemma.

Lemma 41. There exists a procedure WCOND(C, Cf) that takes as input the description C of a set S and a function f given by a circuit Cf and returns a point x e Sx from a probability distribution that is at most E-far in total variation distancefrom the probability distribution that selects each element x e Sx proportionally to f(x). The procedure fails with probability at most 6, uses O(log n log(1/6)) conditional samples and takes time O((ICI + CjI + log 2 |Q) log n log(1/5) + log IQI log (1/E) log(1/5)).

3.3.5 Distinct Elements Sampling - to Sampling

The distinct elements sampling function gets as input a description C of a set S together with a function f : Q - [M] described by a circuit Cf. It outputs samples from a distribution on the set Sx = S n X such that the distribution of values f(x) is uniform over the image space f(Sx). We thus want that for every y G f(Sx),

103 PIX G f -1(y)] = if (SX) F

We first explain the implementation of the algorithm assuming access to true randomness. Assume therefore that we have a circuit Ch that describes one purely random hash function h : [A] -> [NI]. Then argmaxxcsh(f(x)) will produce a uniformly random element as long as the maximum element is unique. This means that if we call the procedure ARGMAx to find a point x* = ARGMAX(C, Cho C1 ) and check that no point x G Sx exists such that f(x) / f(x*) and h(f(x)) = h(f(x*)) then the result will be a point distributed with the correct distribution. Repeating

O(log(1/S)) times guarantees that we get a valid point with probability at least 1 -6.

Therefore the only question is how to replace h with a pseudorandom h'. We can apply Nisan's pseudorandom generator. Consider an algorithm that for every point y E [M] in order, draws a random sample s uniformly at random from [M] and checks if y E f(Sx) and whether s is the largest value seen so far. This algorithm computes argmaxysph(y) while only keeping track of the largest sample s and the largest point y. This algorithm uses E(log NI) bits of memory and O(M log NI) random bits.

Therefore we can apply Nisan's theorem (Theorem 34) for space E(log(1/E)) for

E > M- 1 and we can replace h with h' that uses only O(log NI log(1/E)) random bits whose circuit representation is only O(log ]Vlog(1/e)).

This means that we can use the Lemma 39 and Theorem 34 to get the following lemma about the o-sampling.

Lemma 42. There exists a procedure DES(C, C1 ) that takes as input the description C of a set S and a function f : Q -4 [M] given by a circuit Cf and returns a point x E Sx from a probability distribution that is at most e-far in total variation distance (for e < M-1 ) from a probability distribution that assigns probability

)I to every set f (y) for y E f(Sx). This procedure fails with probability at most 6, uses O(log n log(1/6)) conditional samples and takes time O((ICI + \C + -I log M log(1/E)) log n log(1/J)).

104 3.4 k-means clustering

In this section we describe how known algorithms for the k-means clustering can be

transformed to sublinear time algorithms in the case that we have access to conditional

samples. The basic tool of this algorithms was introduced by Arthur and Vassilvitskii [131.

D2 -sampling: This technique provides some very simple algorithms that can easily

get a constant factor approximation to the optimal k-means clustering, like in

the work of Aggarwal et. al. [6]. Also if we allow exponential running time

dependence on k then we can also get a PTAS like in the work of Jaiswal et.

al. [911. The drawback of this PTAS is that it works only for points in the d dimensional euclidean space.

When working in arbitrary metric spaces, inspired by Aggarwal et. al. [6] we use weighted sampling to get a constant factor approximation to the k-means problem.

Now we describe how all these steps can be implemented in sublinear time using the primitives that we described in the previous section. The steps of the algorithm are:

1. Pick a set P of O(k) points according to D2 -sampling.

For O(k) steps, let P denote the set of samples that we have chosen in the i-th

step. We pick the (i+ 1)th point according to the following distribution

d2(-y 'p P(probability of picking d) P 2 Z= d (53 , P,)

Implementation: To implement this step we simply use the primitive WCOND(C, f)

2 2 where C is the constant true circuit and f(x) = d (x, F2) = minpEp d (x, p). The circuit to implement the function d2 (., -) has size O(logjQj). Now since

IPjI O(k) we can also implement the minimum using a tournament with only O(k) comparisons each of which has size O(logjQ ). This means that the size of

the circuit of f is bounded by JCf I O(k logjQ). Therefore we can now use the Lemma 41 and get that we need O(k log n log(1/6)) queries and running time O((k logIQI + log 2 jQl) log n log(1/J) + logjQj log(1/6) log(1/6)) to get the O(k)

105 needed samples from a distribution that is ei in total variation distance from

the correct distribution and has probability of error 6 for each sample.

2. Weight the points of P according to the number of points that each

one represents.

For any point p E P we set

w = |I{x E Q I Vp'(#p) E P d(x, p) < d(x, p')}I

Implementation: To implement this step given the previous one we just iter-

ate over all the points in P and for each one of these points p we compute

the weight wp using the procedure SUM as described in Lemma 40. Simi-

larly to the previous step we have that C is the constant 1 circuit and fp(x)

is equal to 1 if the closest point to x in P is p and zero otherwise. To de-

scribe this function we need as before 0(k loglQl) sized circuit. Therefore for this step we need O(log n log(1/6)/E') conditional samples and running time O((k log|Qf +log 2jQ1) log n log(1/6)/E) in order to get an (E2, 6)--multiplicative

approximation of every wp.

3. Solve the weighted k-means problem in with the weighted points of P.

This can be done using an arbitrary constant factor approximation algorithm

for k-means since the size of P is O(k) and therefore the running time will be

just poly(k) which is already sublinear in n.

To prove that this algorithm gets a constant factor approximation we use Theorem 1 and Theorem 4 of [6]. From the Theorem 1 of [6] and the fact that we sample from a distribution that is E-close in total variation distance to the correct one we conclude that the set P that we chose satisfies Theorem 1 of [6] with probability of error at most E, + 0(k)6. Then it is easy to see at Theorem 4 of [6] that when we have a constant factor approximation of the weights, we lose only a constant factor in the approximation ratio. Therefore we can choose E2 to be constant. Finally for the total probability of error to be constant, we have to pick El to be constant. Combining all

106 these with the description of the step that we have above we get the following result.

Theorem 43. There exists an algorithm that computes an 0(1)-approximation to the k-means clustering and uses only 6(k2 log n log(k/6)) conditional queries and has running time 0(poly(k) log 2 IQI log nlog(116)).

Remark 1. The above algorithm could be extended to an arbitrary metric space where we are given a circuit Cd that describes the distance metric function. In this case the running time will also depend on ICdI.

Remark 2. In this case that the points belong to d dimensional space, we can also use the Find-k-means(X) algorithm by [911 to get (1 + E)-approximation instead of constant. This algorithm iterates over a number of subsets of the input of specific size that have been selected using D2-sampling. Then from all these different solutions it selects the one with the minimum cost. We can implement this algorithm using our

WCOND and SUM primitives to get a sublinear (E, J)-multiplicative approximation algorithm that uses O ( 2 6(k2/,) - log n -log(1/6)) conditional samples and has running 2 time 6 ( 2 6(k2/E) . log jQj log n log(1/6)).

3.5 Euclidean Minimum Spanning Tree

In this section we are going to discuss how to use the primitives that we described earlier in order to estimate the weight of the minimum spanning tree of n points in euclidean space. More specifically, suppose that we have a d-dimensional discrete euclidean space Q = 1, ... , A}d and a set of n points X = I,.... , Xn}, where xi E Q. We assume that A = poly(n) which is a reasonable assumption to make when bounded precision arithmetic is used. This means that each coordinate of our points can be specified using 0(logn) bits. 2

2This is consistent with the standard word RAM model assumption that size of the coordinates

107 In what follows, we are going to be using the following formula that relates the weight of an MST to the number of connected components of certain graphs. Let W denote the maximum distance between any pair of points in Q. Moreover, let G( be the graph whose vertices correspond to points in X and there is an edge between two vertices if and only if the distance of the corresponding points is at most (1 + ). By ci we denote the number of connected components of the graph G(). In [42], it is shown that the following quantity leads to an (1 + E)-multiplicative approximation of the weight of the minimum spanning tree:

log,+, W-1 n - W + E - E (I + EYi - Ci (3.7) i=O The quantity would be equal to the weight of the minimum spanning tree if all pairwise distances between points were (1 + E)i for some i c N.

In order to estimate the weight of the MST, we need to estimate the number of connected components ci for each graph G(). As shown in [641, for every i, we can equivalently focus on performing the estimation task after rounding the coordinates of all points to an arbitrary grid of size E(1 + )i/V' V. This introduces a multiplicative error of at most 1 + O(E) which we can ignore by scaling E by a constant.

We thus assume that every point is at a center of a grid cell when performing our estimation. We perform a sampling process which samples uniformly from the occupied grid cells (regardless of the number of points in each of them) and estimates the number of cells covered by the connected component j of G( that the sampled cell belongs to. Comparing that estimate to an estimate for the total number of occupied grid cells, we obtain an estimate for the total number of connected components. In more detail, if the sampled component covers a p fraction of the cells, the guess for the number of components is -. For that estimator to be accurate, we need to make p sure that the total expected number of occupied grid cells is comparable to the total number of components without blowing up in size exponentially with the dimension d of the Euclidean space. We achieve this by choosing a uniformly random shifted are O(log n). Otherwise even representing one point in the grid would need more than O(log n) space.

108 grid. This random shift helps us avoid corner cases where a very small component spans a very large number of cells even when all its contained points are very close together. With a random shift, such cases have negligible probability.

We will first use the following lemma to get an upper bound on the number of occupied cells which holds with high probability:

Lemma 44. Let C C Rd be a 1-D curve of length L, a grid G C Rd of side length

R and random vector V' distributed uniformly over [0, R]". Then, the expected number of grid cells of G + V that contain some point of the curve is vo1([oR]+C) where denotes the Minkowski sum of the two sets.

Proof. Consider the grid G {RS : S Zd} C Rd shifted by a random vector

& E [0, R]d to obtain a grid Gv =- + G. We associate every point F E G with a cell z + [0, R]d. Observe that a cell corresponding to a grid point Z intersects the curve C if (Z+ [0, R]d) n C $ 0, or equivalently if S' C + [-R, 0 ]d. The expected number of occupied grid cells is thus equal to the expected number of grid points of G, which lie in the Minkowski sum of C and [-R, 0 ]d. Note that each of the original grid points Z E G can move inside a d-dimensional hypercube of side length R and all those hypercubes are pairwise disjoint and span the whole d-dimensional space.

Now let Ij be the indicator random variable for the event that 5 E C + [-R, O]d. Clearly,

E[ = Pr[Z E C+ [-R,0]d] vol((C + [-R, O]d) n (Z+ [0, R]d)) Rd

So the expected number of points in C + [-R, O]d is:

E[fpoints] = E [z I]

vol(C + [-R, 0]d) _vol(C + [0, R]d) Rd Rd

109 The following lemma bounds the volume of the required Minkowski sum.

Lemma 45. Let C C Rd be a 1-D curve of length L. The volume of the Minkowski sum C + [0, R]d is at most Rd + ./ -L.

Proof. We can think of the Minkowski sum as the set of points spanned by the d- dimensional hypercube [0, R]d as it travels along the curve C. Now suppose that we move the hypercube for a very small distance dL along an arbitrary unit vector F = (a,,... , ad) with positive coordinates (we assume this wlog since all other cases are symmetric). Also, let el,... , ed be the standard basis vectors (i.e ej =

(0, ... , 0, 1,0 .. . , 0) where the "1" is at the i-th coordinate). Note that each of those vectors is orthogonal to a facet of the hypercube and the total volume spanned by each facet during the movement is equal to the absolute value of the inner product

F. ej scaled by dL where ej is the standard basis vector orthogonal to that facet.

The volume spanned by this displacement is equal to the sum of the volumes spanned by each of the facets and is given by the following formula:

d dV = dL - Rdl- F -

i=1

d dL - Re- - lail = dL - Ra |r\ i=1

So in the worst case the curve C is a straight line segment along the all ones unit vector W' = -(1, 1, ... ,1), since this is the unit vector that has the maximum l norm.

In this case, the total volume spanned during the movement along C is:

Vc = L -Rd-1 l irt1 = vF - Rd-1 - L

So, the volume of the Minkowski sum C + [0, R]d is:

vol(C + [0,R]d) = Rd Vc = R + fvd - Rd- L

110 and this is an upper bound for the general case.

We can view the minimum spanning tree as a 1-D curve considering its Euler

tour. The length of this Euler tour is 2 - MST since each edge is traversed exactly

twice. For the same reason, each point in the Minkowski sum is "covered by" at least

two points in the curve. So, effectively, the length of the curve can be divided by 2.

Thus, the volume of the Minkowski sum T + [0, R]d is at most Rd + Vd Rd-i . MST.

Therefore, by lemma 44, we get that:

vf . MST E[fcells] = 1 + R R

Using Markov's inequality, we get that

Pr cells > 2 -(1 + )/-.MST< 1 I R 2

Finally, we can use our support estimation primitive of Lemma 36 to estimate the

number of occupied grid cells after a random shift which enables us to amplify the success probability to 1 - 6 by picking the random shift with the smallest number of cells after O(log(1/6)) repetitions.

We immediately get the following corollary:

Corollary 46. We can find a grid of side length R, such that the number of grid

cells that contain points is at most 2 - (1 + V'sj "') using 6(log log n log2 (1/6)) con-

ditional samples while the failure probability is at most 6. The total running time is O(log2 n log2 (116))

3.5.1 Computing the size of small connected components

As we have said earlier, we will use (3.7) in order to estimate the weight of the MST.

For every i, we estimate the number of connected components ci in the graph G(4) assuming that the points are in the center of a given grid with side length R =

( + E)i/v' and that the total number of grid cells is at most 2 - (1 + rdj.ST) -

111 2 - (1 + dMST). For that purpose, we will sample grid cells uniformly at random and estimate the size of the connected component that we hit during our sampling procedure.

In order to do this, we first sample a uniformly random grid cell using the Distinct

Elements Sampling procedure of Lemma 42 and then perform a BFS-like search starting from that cell to count the number of cells in that connected component. More specifically, at every iteration, we ask for a uniformly random cell that is adjacent in

GO) to one of the cells we have already visited, using our conditional sampling oracle.

If we visit more than t distinct cells (for some threshold t) during this search, we will stop and output that the connected component is "big". Otherwise, we stop when we completely explore the connected component and output its size. Since there cannot be too many "big" connected components, ignoring them is not going to affect our final estimate too much. More specifically, we will set t = dlog,+, W and note that there can be at most cells/t "big" connected components. So, by ignoring them we introduce at an additive error of at most cells/t.

3.5.2 Algorithm for estimating the number of connected components

Now, we can continue with the main part of the algorithm which shows how to estimate the number of connected components of the graph G(').

Let s1 , -. , sk be the number of cells occupied by each of the k connected components of GW and X be the random variable for the index of the connected component our sample hits (in number of cells). Also, let S be the total number of occupied grid cells, and S be our estimate for that using the SE primitive from Lemma 36.

Our estimator which comes from Algorithm 2 is the following:

kI YR[X j] j=1 Sj where sj = sj when sj < t and sy = S otherwise (i.e i = 1).

112 Algorithm 2 Estimating c, 1: x0 <- uniformly random occupied cell using 1o sampling. 2: U +- {X 0 } 3: s +- 1 4: while s < t = dlog,+ W do 5: Let circuit Cu such that Cu(x) (]p E U : cell(x) neighbor of cell(p))A (Vp E U : cell(x) / cell(p)) 6: x +- COND(Cu) 7: if x = I then 8: return e, = 9: U+- U U {} 10: s +-s + 1 11: return 6j = 1

Note that j is always overestimating the true value of sj and an lower bound for this expectation is the same sum if we exclude the "big" connected components. This means that: k 0 < E[ j] - EPr[X =j] S <{ :fsj > t}|<- j=1 s Sj

0 E[j] - I{J : s < t}| - <{j : sj > t}j where we substituted Pr[X J] = 1{' : sj < t}|I/S since every component is selected with probability proportional to the number of cells it contains.

From the SE primitive, we have that

Pr[S - Sf > E - S] < J

So, by conditioning on the support estimation procedure succeeding (which happens with probability 1 - 6) we get that:

f{j : s < t} -(1 - e) E[ j]

T huJ : s h t}v - (I1+ E) +is |{ : sa > t}b

Thus, we have that E[ j], is an accurate approximation of ci with probability at

113 least 1 - 6. In particular, since I{J : sj > t}I < S/t we get:

(1 - E)ci - S/t < E[ j] < (1 + E)Ci + S/t with probability at least 1 - 6.

We are going to repeat the above estimation m times independently and keep the average which, as we are going to show, will be very well concentrated around its mean pi = E[&]. To show that, we can use Hoefding's inequality since our trials are independent and trivially the value of each individual estimate is lower and upper bounded by 1 and S respectively.

Let Af denote the estimated average. From Hoefding's inequality we get:

2 2m S5 2 Pr[|j - pu| > S/t] < 2e- = 2e-2m/t

If we set m = O(t2 log(1/6)) we get the above guarantee with probability at least

1 - 6. This means that with probability 1 - 6 we get:

(1 - E)Ci - 2S/t < Pi < (1 + E)ci + 2S/t

This means that fi = (1 e)ci 4 - (1 + dS )/t and applying it to equation

(3.7), we get the following estimator for the weight of the MST:

1=log1+e W-1 MST=n-W+E- E (l+E) -A= i=0

l=log1+, W-1 d-MST (1 E)MST 4E (1+F)'-

The last term is bounded by 4-d-ogl+, W MST which for t - +, W gives a n (I+ E)- multiplicative approximation to MST.

The total runtime requires log,+, W iterations to estimate every ci by Ai. In every iteration a random shifting is performed and the total number of occupied grid

114 cells are counted using the SE primitive. Moreover, for the estimation 0(t2 ) samples are required using Distinct Element Sampling (fo-sampling) of the occupied grid

cells. Finally for each such sample, a BFS procedure is performed for at most t iter-

ations. The circuit complexity of the conditional sampling queries that are required

is negligible in most cases as it is subsumed by the runtime of the corresponding al-

gorithmic primitive used. Only queries performed during the BFS have large circuit

size as the circuit requires to keep all grid cells that have been visited. The size in that case is bounded by 0(t log IQ) = 0(t - d - log n). The number of samples is bounded by O(log,+ Wt') = 6(d' lg 4 n/ 7 ) and the total runtime is bounded by

O(log,, Wt4 d log n) = O(d5 log' n/E9 ) if we require constant success probability. Re- peating log(1/6) times we can amplify the total success probability to 1 - 6. The following theorem shows the dependence of the running time and query complexity on the parameters n, d, E, 6:

Theorem 47. It is possible to compute an (E, 6)-multiplicative approximation to the

3 4 7 weight of the Euclidean minimum spanning tree using O(d log n/ ) . log(1/6) conditional queries in time (d5 log 6 n/c9 ) - log(1/6).

115 116 Chapter 4

Certified computation

4.1 Computation on unreliable datasets

Modern science and business involve using large amounts of data to perform various computational or learning tasks. The data required by a particular research group or enterprise usually contain errors and inaccuracies because of the following reasons:

- The validity of the data changes dynamically. For example data involving home

locations of customers or employees are not constant over time. Hence a set of

data collected in a particular time frame will likely not remain valid forever,

- The data might be provided by other entities or collected online from a source

that has no certification for their validity. For example data collected from

crowdsourcing environments.

We call a set of data with the property that only a subset of them is valid an unreliable data set. The goal of this work is to develop theoretically established methods that lead to certified computation over such data sets. Towards this goal we assume that we have the ability to verify the validity of a record in our data set. Usually this verification process is costly and hence it does not make sense to verify all the records in our data set every time we want to compute a function on them. On the other hand if the majority or the most important part of the data is invalid then trying to find a valid subset could lead to essentially verifying the entire data set. In our work

117 we introduce the concept of learning with certification, in which we can distinguish between the following scenarios

1. the value of the function that we computed on the unreliable data set is close

to the value of the function computed on the valid subset of our data set,

2. there exists at least one invalid record that could dramatically charge the value

of the function that we want to compute. by verifying only a small number of records.

Computations in Crowdsourcing. Crowdsourcing [57 is a popular instantiation of an unreliable data set, where records are provided by a very large number of workers. These workers may need to put significant effort to extract high quality data and without the right incentives they might choose not to do so giving, as a result, very noisy and unreliable reports. Experimental evidence [95, 132, 1331 suggests that

There are a large number of examples where crowdsourcing fails in practice because of the unreliability of the data that it produces. An anecdotal failure of crowdsourcing is the example of Walmart's mechanism that made the famous rapper Pitbull travel to the remote island of Kodiak, Alaska, see e.g., [?]. In 2012, Walmart asked their customers to vote, through Facebook, for their favorite local store. The store with the most votes would host a promotional performance by Pitbull. Perhaps as a joke, a handful of people organized an #ExilePitbull campaign, inviting Facebook users to vote for the most remote Walmart store, at Kodiak. The campaign went viral and Pitbull performed at Kodiak, in July 2012. While the objective of Walmart was to learn the location that would maximize attendance to the concert, the resulting outcome was terribly off because the incentives of the workers were misaligned.

The topic of this chapter is motivated by these observations and aims through the use of verification to provide a generic approach that guarantees high quality learning outcomes. Verification can be implemented either directly, in tasks such as peer grading, by having an expert regrade the assignment, or indirectly, e.g. in the

Walmart example by verifying the locations of the voters. The main challenge in our framework it to minimize such verifications since they can be very costly.

118 4.1.1 Our Model and Results

A data set is a set of records .N. The set V may contain, apart from the valid subset of records T, a set of records K \ T that are invalid due to reasons that we described earlier. But how much does the presence of these invalid records affect the output of the computation? The answer to this question depends on the number but also on the importance of the invalid records, where the importance depends on the specific computation task that we want to run.

In order to assess whether the computed output is accurate, we can verify the validity of some of the records. Our goal is to verify as few reports as possible and eventually be confident that the output of the computation is accurate. At this point we have to define a measure to evaluate the accuracy of an output. Ideally, an accurate output is the output that we would get if all the records were valid. Such a benchmark, however, is impossible to achieve as the correct values of the invalid records are unobservable. We instead focus on a simpler benchmark. We want to decide whether given an unreliable data set the output of the computation based on

K is close to the output of the computation based only on T. That is, if we could see which records are invalid and perform the computation task after discarding them, would the output of the computation be close to the current value?

Certification Schemes: A positive answer to the question above is called certification of the computation task based on the unreliable data set K. A negative answer is a witness that at least one record in K is invalid. Our first goal of this work is to provide certification schemes for general computation tasks that verify only a small number of records and can distinguish between these two cases.

As a toy example of these models let us consider the simple function f(Ar) = maxiEg Xi, where we assume that each record is a real number xi E R. For the certification task we want to check whether f(4r) = f(5r) or not. This can be easily done by checking whether the record i* = arg maxjEg xi is valid or not.

Not all functions though have such efficient deterministic and exact certification schemes. For several functions, we can obtain randomized certification schemes that

119 succeed with high probability and certify that the output is close up to a multiplicative factor. Moreover, for some functions it might not even be possible to efficiently certify them without certifying almost everything. One extreme such example is a threshold function that is 1 if all records are valid and 0 otherwise, ffV=T where we cannot obtain any meaningful approximation without verifying Q(n) records of M.

In Section 4.2, we provide efficient certification schemes for many different functions. Our results are the following:

- sum function. We start by presenting a randomized scheme for certifying the

sum of all records that uses only O(') verifications to certify correctness up to

a multiplicative factor 1 + E. This is a very useful primitive that can be used in

several different tasks: For example for computing the average, we can compute

and certify the total sum of records and divide by the total number of records

which we can also certify as another summation task. Another example is the

max-of-sums function, where as in the Walmart example we presented earlier, agents vote on different categories and the goal is to compute the total category

that has the maximum number (or sum) of valid records. This can be easily

certified by computing the max of all sums of records and certifying that this

sum is approximately correct.

- functions given by linear programs. We then study the set of functions

expressible as LPs where the input data corresponds to either variables or con-

straints of the LP. We show that for functions expressible as packing or covering

LPs, only O(1) verifications to certify correctness up to a multiplicative factor

1 E while for more general LPs we provide a deterministic scheme that depends

on the dimensionality (number of variables or constraints). - instance-optimal schemes. To study more general functions, we devise a

linear program that characterizes (up to a constant factor) for any given instance

the minimum number of verifications needed for approximate certification. We

show that even though optimal certification schemes may be arbitrarily complex,

there are simple schemes that verify records independently that are almost-

optimal.

120 - Main theorem for certification. Our main most general result is that we

can provide explicit solutions to the instance optimal linear program for a large

class of different objectives that satisfy a w-Lipschitz property. We illustrate the

flexibility of this constraint by showing that even very complex functions that

correspond to NP-hard problems satisfy the w-Lipschitz property. Specifically, using our general theorem, we prove this for the TSP problem and the Steiner

tree problems where we show that the certification complexity is only 0( ).

These capture settings where agents report their locations in a metric space

and the goal is to design an optimal tour that visits all of them (TSP) or

connecting them in a network by minimizing total cost (Steiner tree).

Correction Schemes: Although very useful, the certification process fails when at least one invalid record is found. Naturally the next question to ask is how we can proceed in order to actually compute the value of the function that we are interested in by throwing away the invalid records. In a worst case example where all records are invalid, we would need to verify all the agents to complete the correction task. To get a more meaningful and realistic measure of the verification complexity of a correction task, we carefully define it in terms of a budget B. For verification complexity V, we assume that the designer has an initial budget for verifications B = V which decreases as he performs verifications but might increase every time he finds an invalid record.

The rationale behind the increase is that verifications of incorrect records lead to their removal from the data set, which makes it more accurate. Therefore, in this model we measure the number of verifications needed to correct one incorrect record. We distinguish correction schemes into two models depending on the budget increase.

In the weak correction model, the budget increases by V every time an invalid record is found. This means that finding an invalid record allows us to restart the process from the beginning.

In the strong correction model, the budget does not increase but does not decrease either. This means that verification of invalid records is costless.

Of course, as we said, in the worst case a correction scheme has to verify all the data in the data set which is not realistic. However, during the correction procedure

121 for specific tasks it would be reasonable to have an upper bound on the number of invalid data points that we are willing to verify before dismissing the entire data set for being too corrupted. In particular, if the correction scheme succeeds within the verification budgets we have an accurate output to our computation task. Otherwise, we can conclude that our data set is too corrupted and hence we need to collect the data from the beginning.

Notice that in the example of the max function, if the certification fails then we can continue by checking the second largest record and so on until we find a valid record which will give us the value f (sT) precisely. However, strong correction schemes are much harder to obtain than weak ones in general.

If our computation or learning task has a deterministic certification scheme, e.g. the max function, it is easy to obtain weak correction schemes by repeating the certification scheme until success. For randomized schemes though, one needs to be more careful as it is possible that errors can accumulate. This is easy to fix by requiring that the certification scheme fails with probability at most 1/n. However, this increases the total weak-verification complexity by a logarithmic factor.

In Section 4.4, we prove our main result for weak correction schemes which implies that such an increase is not necessary and it is possible to obtain weak correction schemes with the same complexity as the underlying certification scheme (up to constant factors). To do this, we run the certification scheme many times and do not stop the first time it succeeds but continue until the total number of successes is more than the number of failures. A random walk argument guarantees that this produces the correct answer with constant probability. If the objective function is not monotone, additional care is needed to get the same guarantee.

While weak correction schemes with good verification complexity exist for all tasks that we can efficiently certify, strong correction schemes are more rare. In Section 4.5, we show that it is possible to obtain strong correction schemes for the sum function using only O(L) verifications of valid records. Since that many verifications are necessary to get a 1 E multiplicative approximation for the sum, this implies a gap between the weak and strong correction models. The gap between them can be arbi-

122 trarily large though. As an example, the max-of-sums function we discussed earlier

has certification and weak-correction complexity O(1), although it is impossible to get a constant factor approximation in the strong correction model without verifying

Q(n) valid records.

Despite the impossibility of obtaining strong correction schemes even for simple functions such as the max-of-sums, we can show that efficient certification schemes exist for quite general optimization objectives. We prove (Theorem 59) a very in- teresting and tight connection of strong correction schemes with sublinear algorithms

that use conditional sampling [791. We can exploit this connection to directly obtain efficient strong correction schemes. This gives efficient schemes for general optimization tasks such as clustering, minimum spanning tree, TSP and Steiner tree that capture settings where agent reports lie on some metric space.

4.1.2 More Related Work

Our certification task resembles the task of property testing, as formalized in [741, where one has to decide whether the data has a particular property versus being E-far from it in some distance metric. In our case, the property we want to test is whether the evaluation of the function on all the collected data is equal to its evaluation on the subset of the valid data only.

Our correction task is related to a large body of work in statistics on how to deal with noisy or incomplete datasets. Several methods have been proposed for dealing with missing data. The popular method of imputation [121, 103, 1271 corrects the dataset by filling in the missing values using the maximum likelihood estimates.

In addition, the field of robust statistics [80, 861 deals with the problem of designing estimators when the dataset contains random or adversarially corrupted datapoints.

Several efficient algorithmic results have recently appeared in the context of robust parameter estimation and distribution learning [53, 54, 55, 56, 98]. The goal of these works is to learn the parameters of a multidimensional distribution, that belongs to a known parametric family, while a constant fraction of the samples have been adversarially corrupted. In [371, the authors deal with parameter estimation in cases

123 where more than half of the dataset is corrupted and identification is impossible by providing a list of candidate estimates. They show that the correct estimate can be chosen as long as a small "verified" set of data is provided. In contrast to [371, the verification oracle in our model allows us to verify any subset of datapoints but verification is costly. The work of [128] considers similar verification access to the dataset in crowdsourced peer grading settings.

Selective verification of datapoints has also been explored in the context of mechanism design. In [63], the authors study mechanisms with verification and achieve truthfulness by solving a task similar to certification in social choice problems.

Finally, another related branch of literature considers the task of correcting datasets through local queries ([92, 27, 25, 126, 8, 31]). For example, using local queries, [8] correct datasets to ensure monotonicity and other structural properties. Solutions to similar local correction tasks for noisy probability distributions are presented in [31].

Notation: For m E N we denote the set {1,... ,m} by [m]. Let M = [n] be the set of all records of the data set and T C / be the subset of records that are valid. The set T is unknown to the algorithm. Suppose we are given an input '.g=

(XI, x 2 ,--- , Xn) of length n, where every xi belongs to some set Q. Let VT = (Xj)jEr be a vector consisting only of the coordinates of 5 that are in T. Our general goal is to approximate the value of a symmetric function f : Q* -4 R+ on input X7 E Q*.

Finally, Every input xj with j E T is called valid and the rest -\Tare called invalid.

We consider two different tasks; certification and correction.

In the certification task, we count the total number of verifications of records needed to test between the following two hypotheses:

.1.

f2.y) Etthat f (T)

2. there exists a record i such that i T.

124 We allow a small probability 6 that the algorithm fails to find a witness, i.e.

P (f(sy) g - , - f(5'r) A no invalid record found <

In the correction task, the goal is to always compute an approximation to the correct answer, even when the certification task fails. We consider two models for correction the weak correction and the strong correction.

In the weak correction model after catching an invalid record we are allowed to restart the task and therefore we do not count the number of verifications that we already used before catching the invalid record. So if we have the guarantee that a weak correction scheme uses v(n, E) verifications and during the execution of the scheme we find k invalid records, then the total number of verifications used is at most (k + 1) - v(n, e).

In the strong correction model instead of restarting every time we find an invalid record, we just ignore the data of this record and we also do not count them in the number of verifications. So if we have the guarantee that a strong correction scheme uses v(n, e) verifications and during the execution of the scheme we find k invalid records, then the total number of samples used is at most k + v(n, E).

4.2 Certification Schemes for Linear Programs

In this section, we present examples of certification schemes as the ones defined in section 4.1 for frequently arising problems such as computing the sum of values or functions that can be expressed as a linear programs. In the next section we will see the more general statement about certification schemes for functions that satisfy a general Lipschitz continuity condition.

4.2.1 Computing the Sum of Records

One of the most basic certification tasks is computing the sum of the values of the records. For this task, we are given n positive real numbers x 1 , X 2 ,. . . , Xn each one

125 comming from a record in our data set. Our goal is to certify whether the sum of all the records is closed to the sum of the subset of records that are valid, i.e. belong to T.

More formally, we want to check with probability of failure at most 6 > 0 whether

KEN Xi E [1 - E, E] -XiT xi or there is at least one record i such that i T. We show that there exists an efficient certification scheme for this task:

Lemma 48. Let x 1 , x 2 ,... ,xn 0 be the values of the records in K and f(x)=

KEsr xi. Consider the probability distribution pi = xi which assigns to each record a probability proportionalto its value xi. Verifying k = 0(! log(l/6)) records sampled independently from p, guarantees that the certification task succeeds with probability at least 1 - 6.

Proof. Since T C K and xi's are positive numbers, the inequality EiE( : (1 - e) Ziey xi holds trivially. If the inequality XiE x% -- , xi does not hold, we can bound the probability that all of the k verifications fail to find an invalid record, as follows:

The probability that a single verification fails to find an invalid record is Eiy pi =

ZETErX' < 1 E-

Therefore the probability that all k verifications fail is at most (1 - E)k. Setting k 0(1 log(1/6)), we guarantee that an invalid record is found with probability at least 1 - 6. El

4.2.2 Functions given by Linear Programs

We now extend the previous results for the sum function to more general objective functions that can be represented as linear programs. We first consider the special case of packing and covering LPs while later we present a result for general linear programs.

126 Packing LP Covering LP m max, ci yi min, E b. x iEr j=1 m s.t. E aij yi0, j=1, ...,Im

Packing and Covering LP's are parameterized by the non-negative parameters a23 , bj, ci. We assume that each record i contains all parameters under his control, i.e. the value ci and aij for all j, while the parameters bj are accurately known in advance.

Packing LPs capture settings where several resources (each available in a quantity bj) are to be divided among a set of agents in the system and agents report how much of each resource they need (given by aij) and how much value they can generate if they are given the resources they ask for (given by cij). Our goal is to compute an efficient allocation to agents that maximizes the total value generated. For the certification task, we want to certify that the total value generated by the true agents in an optimal allocation is close to the value computed under the possibly incorrect reports. We show that efficient certification schemes exist by extending the certification scheme presented for the sum function:

Lemma 49. Let a2 j, ci > 0 be values contained in the records K and y* be the optimal solution to the packing LP. Consider the probability distribution pi = C which Z3 Y4Z 3 assigns records a probability proportional to their computed value y~cj. Verifying k

0(1 log(1/6)) records sampled independently from p, guarantees that the certification task for the packing LP succeeds with probability at least 1 - 6.

To see why this lemma holds, notice that the value EiE cjyl computed using all the records K is higher than the value >ET cVip computed using only the valid records

T. Moreover, if ZiET cjyi' (1 - E) EiA ciy*, then it must be that EET c >

(1 - e) y cjy; as well, since setting y, = y* for i E T and yi = 0 otherwise is a

127 feasible solution to the packing LP under the valid records. Finally, if Ej'1 ciy* <

(1 - e) EZr ciy*, it means that invalid records contribute more than an E fraction of the total value and thus an invalid record can be easily found as in the previous case of the sum function.

Covering LPs naturally capture various settings with public goods where the designer wants to introduce new goods to satisfy all the demands coming from the records of our data set, but minimizing the total cost at the same time. In facility location problems, the designer wants to open facilities so that all a set of agents have access to at least one facility and the agents report which locations are accessible to them.

Certification schemes for covering LPs are less direct than previous examples, but can be easily obtained through LP duality. As the dual of a covering LP is a packing

LP which has the exact same value, we can use the certification scheme of Lemma 49 to certify that value. We directly get the following:

Lemma 50. Let aj, ci > 0 be values in the records K. Verifying k = E( log(1/6)) records sampled independently according to a distribution p given by the solution to the dual packing LP, guarantees that the certification task for the covering LP succeeds with probability at least 1 - 6.

General LPs can be written in the form of a packing or a covering LP but have arbitrary (possibly negative) parameters aij, bj, ci. The value of such LPs is harder to certify in general as a lot more verifications than before might be needed. However, we can show that m verifications suffice to certify their value exactly.

Lemma 51. Let aj, ci be (possibly negative) values contained in the records K. The certification complexity for general LPs (written in the form of packing or covering

LPs above) is at most m.

To see why this is true, notice that in the covering LP formulation, the optimal value is given by at most m tight constraints as there are only m variables. Verifying the m records relevant to those constraints guarantees that the optimal value of the LP under only the value of the records is equal to the computed one. This is because only

128 those m constraints determine the optimal value and even if every other constraint i

was dropped (i.e. because i T) the value would remain the same. The result also holds for general LPs under the packing LP formulation by LP duality.

4.3 Certification Schemes for i-Lipschitz Functions

and Applications

In this section, we present a unified way of finding almost-optimal certificationschemes.

For a given a function f, a desired approximation parameter E and an instance 'g, we want to compute the "instance-optimal" number of verifications in order to certify that f(5g) E [1 - E, f(-T) with probability of failure at most 1/3. The first result of this section is a structural result. We show that even though optimal schemes may be arbitrarily complex, there are simpler schemes, that verify records independently, which are almost-optimal.

To show this we define for every set S C K the probability ps that the instance- optimal certification scheme C* verifies at least one record in S, i.e.

Ps = P U {C* verifies record i} (iES

For such an event, we say that the certification scheme verifies S and for simplicity we denote pi, the probability that C* verifies record i, i.e. pi = pi}.

For the instance Yg, the set of invalid records could be any S C K. For the certification scheme to work with failure probability at most 2/3, we must have that Ps 2/3 for any subset S C K such that f(' )/f((g\s) [1 - :, 1/(1 - E)]. If this does not hold for some S, an adversary could choose the set of invalid records to be S and the certification scheme C* would fail with probability more than 1/3. Moreover, even though the optimal certification scheme C* may verify records in a very correlated way, we have that EiESPi 2 ps > 2/3 from a simple union bound. Therefore, the

129 certification scheme C* must satisfy the following set of necessary conditions:

pi 2/3 VS C A such that f (4\) 1 - e, 1

By linearity of expectation, the expected total number of verifications that C* performs is,

E[total number of verifications] = E i{C * verifies record } = pi

The above imply that the value of the following linear program is a lower bound on

the total number of verifications needed by the optimal scheme C* for this specific instance 'g. mm- min E A iEK/ s.t. pi > 2/3, VS C K, f(i) [1 - e, (4.1) ics 0 ; Pi < 1, Vi E )

Notice that the solutions to LP (4.1), do not directly correspond to certification

schemes with success probability 2/3. However, as we show, any solution to LP (4.1)

can be converted to a certification scheme with number of verifications at most twice

as many as the optimal value of LP (4.1) and success probability 2/3. Since the

optimal value of LP (4.1) lower bounds the instance optimal number of verifications, our derived certification scheme will be a 2-approximation to the instance optimal

scheme.

Definition 52. For a solution p of LP (4.1), we define the certification scheme Cp that verifies each record i independently with probability qj = min{2pi, 1}.

It is clear that the certification scheme C, uses in expectation at most twice as

many verifications as the optimal value of LP (4.1) and the instance optimal scheme.

We now show that it also achieves, success probability of 2/3 as required.

130 Assume that the subset of valid records is T = \ S. The probability that the

scheme Cp does not verify anyone in the set S = {si, .. . , S"} is

P(Cp does not verify T) = P((Cp does not verify si)A... A(Cp does not verify sm)) = J(1-q) sES

Since P is a feasible solution to LP (4.1), the probability that some record from S is verified is

P(Cp verifies S) =-1 - Q(1 - qs) > 1 - exp -2 Yp > 1 - exp (-4/3) > 2/3.

sES \iES

This means that our certification scheme succeeds with probability 2/3 using at most

twice the optimal number of verifications in expectation. We can amplify the proba-

bility of 2/3, making it arbitrarily close to one by repeating the certification scheme.

Since the repetitions are independent and each of them fails with probability at most

1/3, after r repetitions the total probability of failure is 3-'. Repeating r = log(1/6) times, guarantees that for any subset S, the probability that it will be verified is at least 1 - J. This result is summarized in the following theorem.

Theorem 53. For any given function f : Q* -+ R and any set of valus if records ig, a solution to LP (4.1) corresponds to a certification scheme that verifies records of the data set independently using at most twice as many verifications as the optimal scheme for this instance and succeeds with probability 2/3. Repeating the scheme log(1/6) times increases the success probability to 1 - 6.

Remark: We note that the LP (4.1) has exponentially many constraints and it may be computationally intractable to solve depending on the function. It is very useful though as a tool to uncover the structure of approximately optimal certification schemes. For example, Theorem 53 implies that even though optimal schemes may be arbitrarily complex, there are simpler schemes, that verify records independently, which are almost-optimal.

131 In the following section, we derive a general methodology to obtain solutions to

LP (4.1) for the very general class of W-Lipschitz functions.

4.3.1 Certification Schemes for w-Lipschitz Functions

In this section we show how we can use Theorem 53 to get sufficient smoothness conditions on the function f that can be used to provide certification schemes with small number of verifications.

For any record i E K we define wi to be the weight of the record i. The weight of record i will be the quantity that will determine the probability that we will verify

record i according to the verification scheme that we want to define. We state now the property that we want f to satisfy in order to find a good verification scheme.

Definition 54. We say that a function f : Q* -4 R is W'-Lipschitz, with - E Rn, if for any S C K if (-) - f(5 vs)I < Zwi iES For any function that satisfies this Lipschitz property we can get a good verification

scheme that depends on the weight vector W'.

Theorem 55. For any non-negative z'-Lipschitz function f : Q* - R+, set of records

K with value ig, and real numbers E, 6 > 0, there exists a certification scheme that

uses at most 4Ep,,Wl log(1/6) verifications and has probability of success at least 1-6.

Proof. We set pi = 2wi and we show that those values satisfy the LP (4.1). Thus, if we choose to verify record i with probability min{2pi, 1}, we get a valid (E, 6)-

certification scheme.

For any subset S C K by the W-Lipschitz property we get that

If( Ar) - f \s)| wi 2 2f( gs) < ? 2\ 3E 3Ef (-X ) - 3f (- )E ~GAr)iES X iES

Now if 1 we have 2 -2f(gs > 2 f(lo \s) 1-Te, w h 3e f(g) - 3 Also if f 1v < 1- E, we have 2. 2f (9\ S) > 2 2> 2 f :A\S 3E 3Ef (iv) I 36(1-E) 3E - 3

132 ______2_2___\ S Therefore when f (YAr) ( [I-i, 1,we have E s pi = E 32 ws -U f() 2 3' This means that LP (4.1) is satisfied. Now we can apply Theorem 53 and we conclude

that the certification scheme that verifies each record independently with probability

min{2pi, 1}, where 2pi 3(4e_, verifies at most 4 "'i2EKrecords and has probability

of success at least 2/3. In order to get probability of success 6 we instead verify each

record i with probability 2pi log(1/J) and the theorem follows. l

In section 4.6.1 and 4.6.2, we present two applications of Theorem 55 to get certification schemes for the Traveling Salesman (TSP) and the Steiner Tree problems. In both applications, we show that the optimal solution is zi-Lipschitz with f w' < 2.

Hence, the total number of verifications by Theorem 55 is O((1/e) log(1/6)).

4.4 Weak Correction Model

We show how starting from a certification scheme, we can obtain a weak-correction scheme with the same verification complexity (up to constants).

Theorem 56. Suppose that there exists a certification scheme for a function f that uses q(n, E) verifications and fails with probability 1/3. Then, there exists a weak-

correction scheme with verification complexity O(q(n, E) log(1/6)) that outputs an ac-

curate estimate of the function f and fails with probability 6.

Proof. We will first provide a simple analysis when the function f is increasing with record values and later extend to the general case.

Proof of Theorem 56 for increasingfunctions. Our weak correction scheme works by repeating the certification process enough times so that the number of times it failed is less than the number of times it succeeded. In particular, we model this procedure as a random walk on the integers starting from point C and ending once it reaches

0. We move to the right whenever the round of verifications (i.e an execution of the certification scheme) reveals some invalid record, and we move to the left otherwise.

133 The random walk is guaranteed to return to the origin eventually since if all invalid records are removed the certification scheme will not be able to find any additional invalid record. The only case that the weak correction scheme fails is if it returns early without removing enough invalid records having a value larger than f( 5 T)/(l - E).

In such a case, at all points of the random walk the estimate was always larger than

XT()/(I - E) which means that the random walk was biased with probability at least 2/3 to the right. The probability that such a biased random walk reaches the origin is at most ()c = 2-c Setting C = log(1/6) times guarantees a probability of error 6. The number of verifications performed if k invalid records are found is (C + 2k)q(n, E), thus the total verification complexity is O(q(n, E) log(1/6)). E

We will now remove the assumption that the function f is increasing. We again use the same random walk that starts at C and ends at 0 as before. However, instead of outputting the result of the function f on the final subset of records (after all deletions), we will consider every possible intermediate subset of records during the random walk as a candidate for producing an (1 + E)-approximate solution. Note that, at each step i of the random walk, we run a certification scheme on some set

Si C K. We define a subset SC K to be "bad" if f() [1 - E, and to be _~~~ f(is) [1 1-E]ad ob "good" otherwise. By the definition of the certification scheme, if the set S is "bad", then an invalid record is found with probability at least 2/3, in which case the random walk moves to the right. Otherwise, we do not have any guarantee on how the random walk will behave.

However, if at all steps the probability of finding an invalid record is more than 3/5, then the probability that the random walk reaches 0 is less than ()C ()c < 6/2 for C = O(log(1/6)). Thus given that we returned, with high probability, there must

be some set Si for which the correction scheme accepts with probability more than

2/5. Note that, this can only be true if the set Si is good since 2/5 > 1/3.

At this point, given a list of these subsets, our goal is to find a "good" subset

for which the certification scheme accepts with probability more than 1/3. We know

that a "good" subset exists for which the acceptance probability is more than 2/5.

134 We view the certification process for a subset S as sampling from a Bernoulli random

variable. We say that a set S has probability p if the certification process on the set

S does not find an invalid record with probability p.

Let Test(S, -y) be a test that accepts if the probability of a set S is more than 2/5

(call such a set "very good") and rejects if it is less than 1/3. Such a test fails with probability -y requiring O(log(1/7)) samples.

The main idea behing this algorithm is to iteratively run Test(S, -y) for all candi-

date subsets S with varying error probabilities -y to throw out the failing ones until

a significant fraction of the subsets in our pool is "good". When this happens, we

pick a subset at random and check if it is actually "good" by running Test(S, -/) with

small -y. We repeat this until we actually find a good subset and output the value on the function f on that subset. To ensure that this will eventually happen, we choose parameters appropriately, so that a constant fraction of the "bad" subsets fail while

the "good" subsets pass the certifications with high enough probability.

Let K be the number of candidate subsets Si. We have that K is equal to the

number of invalid records found during the random walk process.

Our algorithm proceeds in rounds until there are at most K/log K sets remaining. In the t-th round:

" The algorithm runs Test(Si, 1 0 t) for every set Si and discards all sets that fail.

* If the number of remaining sets is did not drop by a factor of 2 the algorithm

stops and returns a set Si uniformly at random from the remaining sets.

If the algorithm has not returned after log log K steps, then it runs Test(Si, 1/K 2 ) for every remaining set Si and returns one that passes the test.

The proposed algorithm returns a "good" set with probability more than 3/5-o(1).

First, notice that a "very good" set will be discarded in the first log log K rounds with

1/10 11/10 probability at most E 10-' < 1 = 1/9. Hence, if the algorithm did not return after log log K rounds, the last step returns a "good" set with high probability.

Now, suppose the algorithm returns at some round t. Let Kt_ 1 be the total remaining sets before round t. The probability that the number Bt of "bad" sets

135 remaining after round t to be more than Kt_ 1/5 is at most:

P(Bt > Kt- 1/5) < exp(-Bt_1/10) exp(-Kt_1/50) < exp(-K/(50 log K))

This is an exponentially small probability and by a union bound over all log log K rounds it is still negligible.

Thus, assuming that Bt < Kt- 1 /5 and Kt > Kt- 1/2, a set chosen uniformly at random is "bad" with probability Bt/Kt < 2/5.

Therefore, a good set is chosen with probability at least 3/5 - o(1) and thus by repeating O(log(1/S)) times and choosing the median of the values f(ss), we have that with probability 1 - 6/2, E [1 - E, .

The total number times the certification scheme is called is O(log(1/3)) Et O(2-tK log 10t) O(K log(1/6)).

Thus, the verification complexity of the weak correction scheme is equal to O(q(n, E) log(1/6)) and the Theorem follows.

Theorem 56 shows that the certification task we defined in section 4.2 is already strong enough to perform this seemingly more challenging task. Intuitively, this is because we can run many rounds of certification until we have enough confidence that we have an accurate result while we remove from the dataset any invalid record we might find during these rounds. Indeed, a simple way to make the conversion is to start from a certification scheme with error probability 1/3 and reduce its probability of error to 6/n, by repeating it log(n/6) times. Then use this scheme repeatedly until no more invalid records are detected. By a union bound, the probability of error is at most 6 since the process takes at most n steps. Theorem 56 shows a stronger result than the above result showing that the logarithmic dependence on the number of records can be avoided if the stopping time is more carefully chosen.

136 4.5 Strong Correction Model

In section, we show that it is possible to obtain more efficient correction schemes for

several problems. We show that this is true for the sum function which implies strong

correction schemes for other functions such as the average. In addition, we show

strong correction schemes for more general combinatorial tasks through a connection

with conditional sampling. However, as we show there are simple functions, e.g. the

composition of the max and the sum function, for which good weak correction schemes exist that do not admit strong correction schemes.

4.5.1 Computing the Sum of Values of Records

Using the formulation in Section 4.2.1, we get the following result. It shows that

E(1/ 2) verifications are both necessary and sufficient.

Lemma 57. Let x 1 , x 2 ,... ,xn > 0 be the values of the records K and f(g)=

xi. Consider the probability distribution pi = which selects a record xi with probability proportionalto its value. If we sample NI times independently from p

= Zig xi lies until k = e (-1 log(1/6)) valid records found, then the estimator E

in [1 - -, 1 L] - ZETxi w.p. at least 1 - 6.

Proof. We let the random variable AI to be the total number of verifications until

we found k valid records and let M be the set of samples that we observed. Also we

define Z = M n TI/M = k/I. We claim that

given that IM n T| k.

Let q = ET . For tha sake of contradiction let Z > 1 Z-ETx% then M < ZEEr X'. 1-E ZiEAI Xi (1 - E)k/q. Hence the expected number of valid records if we draw M samples

according to the described distribution is at most (1 - E)k. But know using simple

Chernoff bounds and the fact that k > - log(2/J) we get that with probability at

most J/2 the number of valid records found is at least k.

137 Similarly we can show that if Z < (1 - E) Z x then with probability at most

6/2 the number of valid records found is at most k. Hence we have

P(IM n Tj = k) = P IM n Tj = k I q E [I - -,Z1 P (q E 1 - E 1 Z)

+ P (IM n T =k I q < (1 - E)Z) P (q < (1 - E)Z)

+ P (|M n T| k I q > 1 z 1 Z but from the definition of the strong correction scheme P(IM n T = k) = 1 and as we proved P(jMnT = kjq<(- E)Z)<6/2 and

PjM n T =k | q> Z) < 6/2 therefore

P(IMfnT=k)=1 ;P(IMnTI=kI q E[1 -E , -I .Z)P [q[E 7] z +6

which implies

P qG [E 1 E - Z) > I - 6.

This finally implies that our estimator is in the correct range

P~ z ZExic: IE ~ -'i ZXi) > 1-.

To see that e (1 log(1/6)) are also necessary let x 1 = - = = 1/n and let

ITI = I|Nq where q = 1/2. This instance is identical with estimating the bias of a Bernoulli random variable with error at most E and since all the xi's are equal we can assume without loss of generality that at each step we take a uniform sample from

M. But it is well known that for estimating a Bernoulli random variable within E with probability of failure at most 6 we need at least E (} log(1/6)) total samples.

138 Half of those samples are expected to be correct samples and hence the verification complexity for any strong correction scheme is also at least E (y log(1/6)). l

4.5.2 Lower Bound for the Maximum of Sums Function

In this section we show that no efficient strong correction scheme exists for the com-

position of the max and the sum function. More precisely we assume we have a

partition J {A/,. . ., Nf} of the set K and we want a strong correction scheme for

the function f(ig) = maxAJ EiEA xi. Lemma 58 shows that any strong correction

scheme for f that achieves constant approximation has to verify at least a constant fraction of the records.

Lemma 58. Let c > 0. There exists a partition = {= 1,...,N} of K and a

vector u E R n such that any strong correction scheme for the function f(y)A =

E [.,c] f(QT) with probabilitymaxAeJ at least ZEA xi, that returns an estimate ]

3/4, must verify |K|/ 4c2 records.

Proof. We consider a partition 5 of K into n/c2 sets of size c2 each with xi 1 for

all i E K. Let S be a certification scheme that verifies less than n/4c 2 records. Then

there exists a set A E such that

P(S verifies some j E A) < 1/4.

We prove this by contradiction. Let P(C verifies some j c A) > 1/4 for all A E 5.

2 Then E[verification by C] > ZACJ IP(C verifies some j c A) n/4c and hence we have a contradiction on the assumption that S verifies less than n/4c2 records. Let s be the output estimator of S then we have that

s [,c] .-f(5T) =P( [, c f(fT) S verifies some j E A P (S verifies some j E A)

+ P e -, c] f(zT) S does not verify A P (S does not verify A) c

< 1/4 + P C , c - f(i!T) I S does not verify A)

139 Now if we fix Q C N, we observe that the quantity P (s E Q I S does not verify A) does not depend on T n A since we are conditioning on the event that S does not verify any record in A. Now let jB be an arbitrary record from the set B E J. We consider the following two possibilities for the set T.

T= U {jB} BEJ,B#A

7 1 =Tou A

We observe now that if T = T then f(fr) = 1 and if T =T then f( 5 r) = c2. Now since s does not depend on Tn A given that S does not verify A we have that we can change T between To and T1 without changing the quantity P (s G Q I S does not verify A). Now

- if P(3 E [1, c] I S does not verify A) < 1/2 then we set T = T and

2 - if P( [c(c - 1), c ] I S does not verify A) < 1/2 then we set T = T1 .

Observe that one of the two cases has to be true. In any of these we get that

P sE ,c[ f~i!) I S does not verify A < 1/2.

Hence we get that

P s [G, c]- f (5)) < 1/4 + P s c] f(iT) I S does not verify A < 3/4 C IC and therefore S has to verify at least n/4c2 records. l

4.5.3 From Algorithms using Conditional Sampling to Strong

Correction Schemes

The design of a strong correction scheme can be challenging since the guarantee is very strong. Our main theorem in this section shows that there is a correspondence of strong correction schemes with sublinear algorithms using conditional sampling,

140 introduced recently in 179]. We state here the main theorem for this section below:

Theorem 59. Any function that can be approximated using k conditional samples admits a strong correction scheme with cost k.

More precisely we are given an input 4- (x 1 , x2 ,- ' , xn) of length n, where every 0 1 xi belongs in some set Q. In this section, we will fix Q [D]d for some D - n ) to be the discretized d-dimensional Euclidean space. Our goal is to compute the value of a symmetric function f : Qf -+ R+ with input 5 E Qn. We assume that all xi are distinct and define X C Q as the set X = {xi : i c .J}. Since we consider symmetric functions f, it is convenient to extend the definition of f to sets f(X) = f(x). The conditional sampling model allows such queries of small description complexity to be performed. In particular, the algorithm is given access to an oracle COND(C) that takes as input a function C : Q -+ {0, 1} and returns a tuple (i, xi) with C(xi) = 1 with i chosen uniformly at random from the subset {j E [n] I C(xj) = 1}. If no such tuple exists the oracle returns _L.

The main result of this section is a reduction from any algorithm that uses conditional sampling to a strong correction scheme.

Theorem 60. An algorithm that uses k conditional samples to compute a function f can produce a strong correction scheme with verification cost k.

Proof. We will show how we can implement one conditional sample using only one verification. We take all the values of the records x 1 ,. . . , x, and we randomly shuffle them to get x,,, ... , x,. Then we take one by one the records x, with this new order and we check if C(x,) = 1. If yes then we verify x, and if it is valid we return it as the result of the conditional sampling oracle. If 7ri is invalid then we just ignore this records without any cost and we proceed with the next record. If we finish the records and we found no valid record xr such that C(x) = 1, then we return

I. It is easy to see that this procedure produces at every step a verified conditional sample. Since the conditional sampling algorithm has only this access to the data we get that any guarantees of the conditional sampling immediately transfer to this corresponding strong correction scheme. El

141 The above result gives a general framework for designing strong correction schemes for several computational and learning problems. We give some of these examples below that are based on [791. For other distributional learning tasks, one can use the conditional sampling algorithms of [301 to get efficient strong correction schemes. k-means Clustering: Let Q be a metric space with distance metric d: Q x Q --+ R, i.e. d(x, y) represents the distance between x and y. Given a set of centers Ct we define the distance of a point x from Ct to be d(x, Ct) = mincect d(x, c). Now given a set of n input points X C Q and a set of centers Ct C Q we define the cost of

Ct for X to be d(X, Ct) = d(x, Ct). The k-means problem is the problem of minimizing the squared cost d 2 (X, Ct) = E ,x d2 (x, Ct) over the choice of centers

Ct subject to the constraint Ct = k. We assume that the diameter of the metric space is A = maxx,YeX d(x, y). In this setting we assume that the records contain the points in the d-dimensional metric space.

Corollary 61. Let x 1 , x 2 ,... , xn be the points in the d-dimensional metric space Q stored in the records K and f (A') be the optimal k-means clustering of the points 5 g.

There exists a strong correction scheme with 6(k 2 log n log(k/6)) verifications that guarantees a constant approximation of the value optimal clustering, with probability

of failure at most 6.

The proof of this corollary is based on the Theorem 59 and theorem 43.

Euclidean Minimum Spanning Tree: Given a set of points ig in Rd, the minimum spanning tree problem in Euclidean space ask to compute the a spanning tree

T on the points minimizing the sum of weights of the edges. The weight of an edge between two points is equal to their Euclidean distance. We will focus on a simpler variant of the problem which is to compute just the weight of the best possible spanning tree, i.e. estimate the quantity mintree T E(,x,),T 1x - x'11 2 .

Corollary 62. Let x 1 , x 2 ,.-., xn be the points in Rd stored in the records K and 5 f ( g) = mintree T Z(x e, x - x'11 2. There exists a strong correction scheme with

O(d' log4 n/e7 ) - log(1/6) verifications that guarantees an (1 + E)-approximation of the

weight of the minimum spanning tree, with probability of failure at most 6.

142 The proof of this corollary is based on the Theorem 59 and theorem 47.

Remark. Observe that the value of the MST gives a 2-approximation of the

metric TSP and the metric Steiner Tree problems. Hence Corollary 62 implies efficient

strong correction schemes that achieve constant approximation for those problems as well.

4.6 Applications of Theorem 55

4.6.1 Optimal Travelling Salesman Tour

In this section we examine the metric travelling salesman problem where we are given

n points (each provided by one record in ) in a metric space X and we wish to find the length of the minimum cycle going through each point in the set T C PV of correct answers. As usual we let 5 g be the input vector with record values whose coordinates are points in the metric space X. Our goal is to find a certification scheme for this metric travelling salesman problem. That is, the algorithm should either output a sufficiently accurate value (according to 1) for the minimum weight cycle going through the points in xT or find a invalid record '. The following lemma combined with Theorem 55 give us the desired result.

Lemma 63. Let f : Q* -+ R be the function mapping a set of points in a metric space X to their minimum TSP tour and let v1 v2 ... v, be the minimum TSP tour.

Also, let W- E R"= (w1,.. .,w), where wi = d(v_ 1, vi) + d(vi, vi 1) and the second indices are mod n. Then, f is z'-continuous.

Proof. According to definition 54, we need to show that for any S C H:

f( g) <; f(5g\S) + Wi (4.2) iES

'Note that throughout this chapter we do not consider the computational complexity of the problems, since we are more interested in the number of verifications needed. Besides that in the case of Euclidean TSP we could use the (1 + e) approximation algorithm that we know in order to get similar results and avoid NP completeness.

143 To see why this inequality is satisfied, let TR be the minimum TSP tour going through the points in R = V \ S and Tg = v1 v2 ... v, be the minimum TSP tour that goes through all the points in the set K D R. Now let Ji < J2 -- - < Jr be the indices at which the points of the set R appear in this TSP tour. Consider two consecutive points VjkI Vjk+1 in this sequence and let Pk - {Vjk+1, VJk2, .. , Vik1 _1 } be the set of consecutive points in the tour Tg between vik and VJk+l* Clearly, Vk : Pk C S and therefore the weights of those points appear in the sum that is in the rhs of

equation (4.2). Now consider the two paths P1,k = Vi, jIk+1, ... , V 1 - and P2,k

Vjk+1, Vjk+1,... VV, 1 which are both part of Tg. We have that:

2 jk+1~

iEPk i=jk+l where l(-) denotes the length of a path. We now consider the walk that goes through all the vertices in K and has the following two properties:

" It respects the order in which the vertices in R are visited by TR

" Between any two consecutive such vertices, it follows whichever path among

P1,k and P2,k has smaller length in the forward and then backwards direction.

We know that f('g) smaller or equal to the walk we have just defined, since the

walk goes through all the given points and even repeats the points in R 2. Thus,

s jk+1-2 fG~Ar) f(K\s) + 2 -min{(d(vk,,ko+1), d(vik, 1 + 2 d(vi, vi+1) k=1 i=jk+l S

Sf (Xzy\s ) + Wi k=1 iEPk

f(urs) + wi iES

2 Since we are working on a metric space, skipping points in the order that we visit them can only decrease the cost.

144 Using lemma 63 and theorem 55, we get the following corollary:

Corollary 64. Let f : Q* - R be the function mapping a set of points in a metric

space X to their minimum TSP tour. Then, there exists a verifcation scheme that

uses at most O(-! log(!)) verifications per correction.

Proof. This is a straightforward application of lemma 63 and theorem 55 since ZiEe Wi contains each of the edges in the optimal TSP tour Tg exactly twice. Thus,

: wi = 2f(ig)

4.6.2 Steiner tree

In the classic Steiner tree problem, the input is a positively weighted graph G

(V, E, w) and the set of vertices V is partitioned into two disjoint sets T and U such that V = T U U. Usually T is called the set of terminal nodes and U the set of Steiner nodes. The goal is to compute a connected subgraph of G that has the smallest possible weight and has a set of vertices T C V' C V that includes all terminal nodes and any number of steiner nodes.

Here, we are going to examine the Steiner tree problem in the following setting:

We are given a fixed graph G = (V, E) on lVI vertices and we also have .II values from the set of records A. Each record is a node from the set V claiming that this node is in the set T C V of terminal nodes that need to be connected by the tree.

However, the records might be invalid and the algorithm is allowed to do verifications on those records. Let 5 g be the input vector whose coordinates are vertices claimed to be in the set T of terminal nodes. Similarly, let ' be a vector containing only a subset A C AV of those vertices. Our goal is again to be able to either output a sufficiently accurate answer for the cost of the optimal Steiner tree of find an invalid record.

145 As in the previous section we are going to use theorem 55 to achieve this. The conditions of theorem 55 are satisfied in this case due to the following lemma:

Lemma 65. Let G = V, E be a graph and fG : V* --+ R be the function mapping a set of vertices T C V to the minimum cost of a steiner tree connecting the vertices in T. Then, there exists a vector Wi C R' such that f is &-continuous and also

111 W' = O(fG(5g))-

Proof. We need to show that there exists a vector ' c R n = (wi, ... , wn), such that for any S C K, the following inequality holds:

f (Vg) f ( \s) + Wi (4.3) iES

We start by introducing some notation. Let t be a tree subgraph of G. We denote by Ht the Eulerian graph that results when we double each edge in t. Also, let tA denote the optimal Steiner tree for the set A C V of terminal nodes. Thus, VA : f (' ) = cost(tA).

Now let tR be the optimal Steiner tree for some set R = A \ S C V of terminal nodes. In order to show equation (4.3), it suffices to show that there exists a tree t and a vector W- c Rn, such that t is a valid Steiner tree for the set K of terminal nodes and its cost is: cost(t) cost(tR) + ZES Wi- In other words, we would like to find a weight vector W' E R+ , such that starting from the Steiner tree tR and using the weight assigned to the set S = K\ R as budget, we are able to construct a Steiner tree the covers the set K. To keep the number of verifications low, we also require this vector to be such that EZeA Wi = O(f YGA)

Now fix a specific Euler tour (i.e an ordering of the nodes) Ug for the graph Ht" and also fix an Euler tour UR for the graph Ht. Note that the cost of each Euler tour is exactly twice the cost of the corresponding Steiner tree (e.g cost(UR) = 2cost(tR) where cost(.) denotes the sum of weights of all edges in the Euler tour or the tree). We define each weight wi to be the length of the path from the predecessor to the successor of node i in the ordering Ug.

146 Our goal is to find a new Euler tour which directly corresponds to a valid Steiner

tree 3 for the set M and is within our budget ZEs wi.

Now let Ug = vIv 2 ... v, be the ordering in which the terminal nodes are visited

in the Euler tour of Ht, and Ji < J2 -..< Jr be the indices at which the points of the

set R KI \ S appear in this Euler tour. Consider two consecutive points vyj, v 3 , in this sequence and let Pk = {Vj,-l, Vjk2, ... , Vi, _I-1} C S be the set of consecutive

points in the Euler tour Ug between vj, and vjk+. Note that the sets Pk are mutually

disjoint and therefore: E I , wiePk < EiES wi- Also, ZEPk wi is enough budget

to add the set of nodes Pk in the ordering UR between vik and vj,+,. By repeating

this for all k E [r], we get the desired Steiner tree t that covers all nodes in K and is

such that:

2 -cost(tg ) < 2 -cost(t) < 2 - cost(tR) + Wi iES

cost(tg ) COSt(tR) + Wi Z 2 iES

f (VA ) f (5g\s) + E w iES where w' =

Thus, f is y-continuous and also Ei w = -- 2 - cost(Ug) 2 - f(5). E]

The following corollary is a direct application of lemma 65 and theorem 55:

Corollary 66. Let G =V, E be a graph and fG : V* - R be the function mapping a

set of vertices T C V to the minimum cost of a steiner tree connecting the vertices in

T. Then, there exists a verification scheme that uses at most O(1 log(.!)) verifications per correction.

3 That is, the traversing each edge of that tree twice and in opposite directions. 4To be more precise here, we need an argument similar to the two paths argument in the proof of lemma 63.

147 148 Chapter 5

Distributed algorithms

5.1 Introduction

A growing need to process massive data led to development of a number of frameworks

for large-scale computation, such as MapReduce [461, Hadoop 11341, Spark [1371, or Dryad [881. Thanks to their natural approach to processing massive data, these frameworks have gained great popularity. In this work, we consider the Massively

Parallel Computation (MPC) model that is abstracted out of the capabilities of these frameworks.

In our work, we study some of the most fundamental problems in algorithmic graph theory: maximal independent set (MIS), maximum matching and minimum vertex cover. The study of these problems in the models of parallel computation dates back to PRAM algorithm. A seminal work of Luby [106] gives a simple randomized algorithm for constructing MIS in O(log n) PRAM rounds. When this algorithm is applied to a line graph of graph G, it outputs a maximal matching of G, and hence a 2-approximate maximum matching. The output maximal matching also provides a

2-approximate minimum vertex cover. Similar results, also in the context of PRAM algorithm, were obtained in [10, 89, 90]. Since then, the aforementioned problems were studied quite extensively in various models of computation. In the context of MPC, we design simple randomized algorithms that construct (approximate) instances for all the three problems.

149 5.1.1 The models

We consider two closely related models: Massively ParallelComputation (MPC), and the CONGESTED-CLIQUE model of distributed computing. Indeed, we consider it as a

(non-technical) contribution of this chapter to (further) exhibit the proximity of these two models and we are hopeful that it will bring the related research communities closer. We next review these models.

The MPC model

The MPC model was first introduced in [941 and later refined in [76, 22, 11]. The computation in this model proceeds in synchronous rounds carried out by m machines. At the beginning of every round, the data (e.g. vertices and edges) is distributed across the machines. During a round, each machine performs computation locally without communicating to other machines. At the end of the round, the machines exchange messages which are used to guide the computation in the next round. In every round, each machine receives and outputs messages that fit into its local memory.

Space: In this model, each machine has S words of space. If N is the total size of data, then one usually wants that S is sublinear in N and that S -m = 0(N). That is, the total memory across all the machines suffices to fit all the data, but is not much larger than that. If we are given a graph on n vertices, in our work we consider the regimes in which S C O(n/polylogn) or S E 6(n).

Communication vs. computation complexity: Our main focus is the communication complexity, i.e. the number of rounds required to finish computation. Although we do not explicitly state the computation complexity in our results, it will be apparent from the description of our algorithms that the total computation time across all the machines is nearly-linear in the input size.

CONGESTED-CLIQUE

The other model that we consider is the CONGESTED-CLIQUE model of distributed computing, which was introduced by Lotker, Pavlov, Patt-Shamir, and Peleg[1051

150 and has been studied extensively since then, see e.g.,[116, 58, 24, 101, 59, 112, 83, 82, 32, 81, 23, 65, 68, 97, 84, 33, 66, 93]. In this model, we have n-players which can communicate in synchronous rounds. Per round each player can send O(log n) bits to each other player. Besides this communication restriction, the model does not limit the players, e.g., they can use large space and arbitrary computations; though, in our algorithms, both of these will be small. Furthermore, in studying graph problems in this model, the standard setting is that we have an n-vertex graph G = (V, E), and each player is associated with one vertex of this graph. Initially, each player knows only the edges incident on its own vertex. At the end, each player should know the part of the output related to its own vertex, e.g., whether its vertex is in the computed maximal independent set or not, or whether some of its edges is in the matching or not.

We emphasize that CONGESTED-CLIQUE provides an all-to-all communication model. It is worth contrasting this with the more classical models of distributed computing. For instance, the LOCAL model, first introduced by Linial [1021, allows the players to communicate only along the edges of the graph problem G (with unbounded size messages).

5.1.2 Related work

Maximum Matching and Minimum Vertex Cover: If the space per machine is O(n'+6 ), for any 6 > 0, Lattanzi et al. [991 show how to construct a maximal matching, and hence a 2-approximate minimum vertex cover, in 0(1/6) MPC rounds.

Furthermore, in case the machine-space is 0(n), their algorithm requires 0(log n) many rounds to output a maximal matching. In their work, they apply filtering techniques to gradually sparsify the graph. Ahn and Guha [71 provide a method for constructing a (1 + E)-approximation of weighted maximum matching in O(1/(6E)) rounds while, similarly to [99], requiring that the space per machine is O(nl+6 ).

If the space per machine is O(nf/n), Assadi and Khanna [161 show how construct an 0(1)-approximate maximum matching and an 0(log n)-approximate minimum

151 vertex cover in two rounds. Their approach is based on designing randomized composable coresets.

Recently, Czumaj et al. [40] designed an algorithm for constructing a (1 + e)- approximate maximum matching in O((loglog n)2 ) MPC rounds of computation and

O(n/polylogn) memory per machine. To obtain this result, they start from a variant of a PRAM algorithm that requires O(logn) parallel iterations, and showed how to compress many of those iterations (on average, O(log n/(log log n)2 ) many of them) into 0(1) MPC rounds. Their result does not transfer to an algorithm for computing

0(1)-approximate minimum vertex cover.

Building on 1401 and [161, Assadi [141 shows how to produce an O(log n)-approximate minimum vertex cover in O(log log n) MPC rounds when the space per machine is

O(n/polylogn). The work by Assadi et al. [151 also addresses these two problems, and provides a way to construct a (1 + E)-approximate maximum matching and an

0(1)-approximate minimum vertex cover in O(log log n) rounds when the space per machine is 0(n). Their result builds on techniques originally developed in the context of dynamic matching algorithms and composable coresets.

Maximal Independent Set: Maximal independent set has been central in the study of graph algorithms in both the parallel and the distributed models. The seminal work of Luby [106] and Alon, Babai, and Itai [101 provide O(log n)-round parallel and distributed algorithms for constructing MIS. The distributed complexity in the

LOCAL model was first improved by Barenboim et al.[181 and consequently by Ghaf- fari [65], which led to the current best round complexity of 0(log A) + 2 0(v'1o9ogn). In the CONGESTED-CLIQUE model of distributed computing, Ghaffari [661 gave another algorithm which computes an MIS in 0( log A) rounds. A deterministic O(logn log A)-round CONGESTED-CLIQUE algorithm was given by Censor-Hillel et al. [331.

It is also worth referring to the literature on one particular MIS algorithm, known as the randomized greedy MIS, which is relevant to what we do for MIS. In this algorithm, we permute the vertices uniformly at random and then add them to the

MIS greedily. Blelloch et al. [26] showed that one can implement this algorithm in

152 O(log 2 n) parallel/distributed rounds, and recently Fischer and Noever [62j improved that to a tight bound of e(logn). We will show a O(log log A)-round simulation of the randomized greedy MIS algorithm in the MPC and the CONGESTED-CLIQUE model.

5.1.3 Our contributions

As our first result, in section 5.2 we present an algorithm for constructing MIS.

Theorem 67. There is an algorithm that with high probability computes an MIS in O(log log A) rounds of the MPC model, with 6(n)-bits of memory per machine. Moreover, the same algorithm can be adapted to compute an MIS in O(log log A) rounds of the CONGESTED-CLIQUE model.

As our second result, in section 5.3, we first design an algorithm that returns a

(2+ e)-approximate fractional maximum matching and a (2 + E)-approximate integral minimum vertex cover in O(log log n) MPC rounds. Then, in section 5.4, we show how to round this fractional matching to a (2+ E)-approximate integral maximum matching. Although, compared to the prior work, our result has better round-complexity

(compared to [40]) or provides stronger approximation-guarantee (compared to the result for vertex cover in [151), we find that the main advantage of our algorithm is its simplicity. After applying random partitioning, the algorithm repeats only a couple of simple steps to perform all its decisions.

Theorem 68. There is an algorithm that with high probability computes a (2 + e)- approximate integral maximum matching and a (2+E)-approximateintegral minimum vertex cover in O(log log n) rounds of the MPC model, with 6(n)-bits of memory per machine.

As noted by Assadi et al. [151, applying the techniques of [1081 on theorem 68 leads to following result.

Corollary 69. There exists an algorithm that with high probability constructs a (1 +

E)-approximate integral maximum matching in O(log log n) - (1/ )o(1/) MPC rounds, with 0(n)-bits of memory per machine.

153 As noted by Czumaj et al. [40], the result of Lotker et al. [1041 can be used to obtain the following result.

Corollary 70. There exists an algorithm that outputs a (2 + E)-approximation to maximum weighted matching in 0(loglogn - (1/E)) MPC rounds and 0(n)-bits of memory per machine.

For the sake of clarity, we present our algorithms for the case in which each machine has 0(n)-bits of memory (or 0(n) words of memory). However, similarly as in [401, our algorithm for matching and vertex cover can be adjusted to still run in

0(loglogn) MPC rounds even when the memory per machine is 0(n/polylogn).

5.1.4 Our techniques

Maximal independent set: Our MPC algorithm for MIS is based on the randomized greedy MIS algorithm. We show how to efficiently implement this algorithm in only 0(loglogn) MPC and CONGESTED-CLIQUE rounds.

Maximum matching and vertex cover: In section 5.3.1, we start from a sequential algorithm that outputs a (2 + E)-approximate fractional maximum matching and a

(2 + e)-approximate integral minimum vertex cover. The algorithm maintains edge- weights. Initially, every edge-weight is set to 1/n. Then, gradually, at each iteration the edge-weights are simultaneously increased by a multiplicative factor of 1/(1 - E).

Each vertex whose sum of the incident edges becomes 1 - 2E or larger gets frozen, and its incident edges do not change their weights afterward. The vertices that got frozen in this process constitute the desired vertex cover. It is not hard to see that after 0(log n/E) iterations every edge will be incident to at least one frozen vertex, and at this point the algorithm terminates.

In section 5.3.3, we show how to simulate this sequential algorithm in the MPC model, by on average simulating 0(log n/log log n) iterations in 0(1) MPC rounds.

As the first step, motivated by [401, we apply vertex-based sampling. Namely, the vertex-set is randomly partitioned across the machines, and each machine considers only the induced graph on its local copy of vertices. Then, during one MPC round,

154 each machine simulates several iterations of the sequential algorithm on the subgraph

it has. During this simulation, each machine estimates weights of its local vertcies in order to decide which vertices should be frozen. However, even if the estimates are sharp, only a slight error could potentially cause many vertices to largely deviate from its true behavior. To alleviate this issue, instead of having a fixed threshold 1 - 26, for each vertex and in every iteration we choose a random threshold from the interval

[1 - 4E, 1 - 2E]. Then, a vertex gets frozen only if its estimated weight is above this randomly chosen threshold. Intuitively, this significantly reduces the chance of these decisions (on whether to freeze a vertex or not) deviating from the true ones.

As our final component, in section 5.4, we provide a rounding procedure that for a given fractional matching produces an integral one of size only a constant-factor smaller than the size of the fractional matching. Furthermore, every vertex in that rounding method chooses edges based only on its neighborhood, i.e. makes local decisions, so it is easy to parallelize this procedure.

5.1.5 Notation

For a graph G = (V, E) and a set V' C V, G[V'] denotes the subgraph of G induced on the set V', i.e. G[V'] = (V', En (V' x V')). We use N(v) to refer to the neighborhood of v in G. Throughout this chapter, we use n := |VI to denote the number of vertices in the input graph.

5.2 Maximal Independent Set

Consider a sequential random greedy algorithm that ranks/permutes vertices 1 to n randomly and then greedily adds vertices to the MIS, while walking through this permutation. In this section, we simulate this algorithm in O(log log A) rounds of the CONGESTED-CLIQUE model, thus proving the following result:

Theorem 67. There is an algorithm that with high probability computes an MIS in O(log log A) rounds of the MPC model, with 6(n)-bits of memory per machine.

155 Moreover, the same algorithm can be adapted to compute an MIS in O(log log A) rounds of the CONGESTED-CLIQUE model.

5.2.1 Randomized Greedy Algorithm for MIS

Let us first consider a randomized variant of the sequential greedy MIS algorithm described below. We remark that this algorithm has been studied before in the literature of parallel algorithms[62, 261.

Greedy Randomized Maximal Independent Set:

- Initially, choose a permutation ir: [n] - [n] uniformly at random.

- Repeat until the next rank is at least n/log 0 n and the maximum degree is at most log10 rn:

(A) Mark the vertex v which has the smallest rank among the remain-

ing vertices according to 7r, and add v to the MIS.

(B) Remove all the neighbors of v.

- Run O(log log A) rounds of the Sparsified MIS Algorithm of [661 in the

remaining graph. Remove from the graph the constructed MIS and its neighborhood.

- Deliver the remaining graph on a single machine and find its MIS.

- At the end, output the constructed MIS sets.

5.2.2 Simulation in O(log log A) rounds of MPC and CONGESTED- CLIQUE

Simulation in the MPC model: We now explain how to simulate the above algorithm in the MPC model with O(n log n)-bits of memory per machine, and also in the CONGESTED-CLIQUE model. In each iteration, we take an induced subgraph of G that is guaranteed to have 0(n) edges and simulate the above algorithm on that graph. We show that the total number of edges drops fast enough, so that

156 O(log log A) rounds will suffice. More concretely, we first consider the subgraph

induced by vertices with ranks 1 to n/A', for a = 3/4. This subgraph has 0(n)

edges, with high probability. So we can deliver it to one machine, and have it simulate

the algorithm up to this rank. Now, this machine sends the resulting MIS to all other

machines. Then, each machine removes its vertices that are in MIS or neighboring

MIS. In the second phase, we take the subgraph induced by remaining vertices with

ranks n/A0 to n/A 2 . Again, we can see that this subgraph has 0(n) edges (a proof

is given below), so we can simulate it in 0(1) rounds. More generally, in the i-th

iteration, we will go up to rank n/Ac" Once the next rank becomes n/log 0 n, which

as we show happens after O(loglog A) rounds, the maximum degree of the graph is

some value A' < O(log'l n) (see theorem 71). Note that clearly also A' < A. At that point, we apply the MIS Algorithm of [66] for sparse graphs to the remaining graph.

This algorithm is applicable whenever the maximum degree is at most 20(09'i) (see

Theorem 1.1 of [66]). After O(log log A') rounds, w.h.p., that algorithm finds an

MIS which after removed along with its neighborhood results in the graph having

0(n) edges. Now we deliver the whole remaining graph to one machine where it is processed in a single MPC round.

We note that the Algorithm of [661 performs only simple local decisions with low communication, and hence every iteration of the algorithm can be implemented in

0(1) MPC rounds, with 0(n) memory per machine, by using standard techniques.

Simulation in CONGESTED-CLIQUE: We now argue that each iteration can be implemented in 0(1) rounds of CONGESTED-CLIQUE. For that, in each iteration, we make all vertices with permutation rank in the selected range send their edges to the leader vertex. Here, the leader is an arbitrarily chosen vertex, e.g., the one with the minimum identifier. As we show below, the number of these edges per iteration is

0(n) with high probability, and thus we can deliver all the messages to the leader in

0(1) rounds using Lenzen's routing method[101J. Then, the leader can compute the

MIS among the vertices with ranks in the selected range. It then reports the result to all the vertices in a single round, by telling each vertex whether it is in the computed

157 independent set or not. A single round of computation, in which the vertices in the independent set report to all their neighbors, is then used to remove all the vertices that have a neighbor in the independent set (or are in the set). After these steps, the algorithm proceeds to the next iteration.

Regarding the round-complexity of the algorithm once the rank becomes n/ logi0 n:

The work [661 already provides a way to solve MIS in O(log log A') CONGESTED-

CLIQUE rounds for any A' = 2o(v/O1). Here, A' is the maximum degree of the graph remained after processing the vertices up to rank n/logi n, and, as we show by theorem 71, that A' < polylogn < 20(0l n). Hence, the overall round complexity is again O(log log A) rounds.

5.2.3 Analysis

Since by the i-th iteration the algorithm has processed the ranks up to n/Aa, the rank n/log1 0 n is processed within O(log log A) iterations. In the proof of Theorem 67 presented below, we prove that with high probability the number of edges sent to one machine per phase is O(n). Before that, we present a lemma that will aid in bounding the degrees and the number of edges in our analysis.

Lemma 71. Suppose that we have simulated the algorithm up to rank r. Let Gr be the remaining graph. Then, the maximum degree in G, is O(n log n/r) with high probability.

Proof. We first upper-bound the probability that Gr contains a vertex of degree at least d. Then, we conclude that the degree of every vertex in Gr is 0(n log n/r) with high probability.

Consider a vertex whose degree is still d. When the sequential algorithm considers one more vertex, which is like choosing a random one among the remaining vertices, one of this vertex or its neighbors gets hit with probability at least d/n. If that happens, this vertex would be removed. The probability that this does not happen throughout ranks 1 to r is at most (1 - d/n)r < exp(-rd/n). Now, the probability that a vertex in G, has degree more than 20n log n/r is at most 1/n5 , which implies

158 that, the maximum degree of G, is at most 20n log n/r with probability at least

We are now ready to prove the main theorem of this section.

Proof of Theorem 67. We first argue about the MPC round-complexity of the algorithm, and then show that it requires 0(n) memory.

Round complexity: Recall that the algorithm considers ranks of the form ri n/Aa', until the rank becomes n/ log10 n or greater. When that occurs, it applies other algorithms for O(log log A) iterations, as described in section 5.2.2. Hence, the algorithm runs for at most i*+log log A iterations, where i* is the smallest integer such

that rank ri, := n/A" > n/logi n. A simple calculation gives i* < log 4 / 1 log A, for a = 3/4. Furthermore, every iteration can be implemented in 0(1) rounds as discussed above.

Memory requirement: We first discuss the memory required to implement the process until the rank becomes O(n/ log10 n). By theorem 71 we have that after the graph up to rank ri is simulated, the maximum degree in the remaining graph is O(n log n/ri) w.h.p. Observe that it also trivially holds in the first iteration, i.e. the initial graph has maximum degree 0(n). Let Gi be the graph induced by the ranks between ri and ri+i. Then, a neighbor u of vertex v appears in Gi with probability

(ri+1 - ri)/(n - ri) < ri+1/n. Hence, the expected degree of every vertex in this graph is at most

p := 6(n log n/ri -ri+/n) = E (A log n) .

Since p > log n, by Chernoff bound we have that every vertex in Gi has degree O(p) w.h.p. Now, since there are O(ri+i) vertices in Gj, we have that Gi contains

0 (ri+A(10)ai log n) = 0 (nA-i /2 log n) (5.1) many edges w.h.p., where we used that a = 3/4. Recall that the algorithm it-

erates over the ranks until the maximum degree becomes less than login. Also,

159 0(n log n/ri) upper-bounds the maximum degree (see theorem 71). Hence, we have

e(n log n/ri) ;> log"n > Ac > Q (log' n) .

Combining the last implication with (5.1) provides that Gi contains 0(n) edges w.h.p.

After the rank becomes n/ log1 0 n or greater, we run the CONGESTED-CLIQUE algorithm of [66] for 0(log log A) iterations. Since that algorithm performs only simple local decisions with low communication, every iteration of the algorithm can be implemented in 0(1) MPC rounds, with 0(n) memory per machine, by using standard techniques. Finally, the graph remaining after running O(log log A) iterations of that algorithm contains 0(n) edges w.h.p. (see Lemma 2.11 of [66]). Hence, we deliver the

remaining graph to one machine and construct its MIS. E

5.3 Matching and Vertex Cover, Simple Approxima-

tions

In this section, we describe a simple algorithm that leads to a fractional matching

of weight within a (2 + e)-factor of (integral) maximum matching and, the same

algorithm, leads to a 2 + E approximation of minimum vertex cover, for any small

constant e > 0. In the next section (section 5.4), we explain how to obtain an

integral (2 + E)-approximate maximum matching from the described fractional one.

That result along with standard techniques underlined in section 5.1.3 provides 1 + E

approximation of maximum matching.

In section 5.3.1, we first present the advertized algorithm that runs in 0(log n)

rounds. Then, in section 5.3.2 and section 5.3.3, we explain how to simulate this

algorithm in 0(log log n) rounds of the MPC model. In section 5.3.4 we provide the

analysis of this simulation.

160 5.3.1 Basic 0(logn)-iteration Centralized Algorithm

We now provide a simple centralized algorithm for obtaining the described fractional

matching and minimum vertex cover. We refer to this algorithm as CENTRAL.

CENTRAL: Centralized O(log n)-round Fractional Matching and Ver-

tex Cover:

- Initially, for each edge e E E, set xe = 1/n.

- Then, until each edge is frozen, in iteration t:

(A) Freeze each vertex v for which y, - E3v x, 1 - 2E and freeze all its edges.

(B) For each active edge, set xe <- xe/(1 - c)

- At the end, once all edges are frozen, output the set of values xe as a

fractional matching and the set of frozen vertices as a vertex cover.

Lemma 72. For any constants E such that 0 < E < 1/10, the algorithm CENTRAL terminates after O(log n) iterations, at which point all edges are frozen. Moreover, we have two properties:

(A) The set of frozen vertices i.e., those v for which yt = Zegv Xe > 1 - 2E is a

vertex cover that has size within a (2+ 5E) factor of the minimum vertex cover.

(B) ZeEE Xe > IM*I/(2 + 5E), that is, the computed fractional matching has size within (2 + 5E)-factor of the maximum matching

Proof. We first prove the claim about vertex cover, and then about maximum matching.

Vertex cover: Let C be the vertex cover obtained by the algorithm. Every vertex added to C has weight at least 1 - 2E. Furthermore, an edge can be incident to at most 2 vertices of C. Let WM be the weight of the fractional matching the algorithm constructs. Then, we have |CJ < 2WJ/(1 - 2E) 2(1 + 5E)WM, for E < 1/10.

Note that the algorithm ensures that at every step y, < 1. Hence, from the strong duality we have that the weight of fractional minimum vertex covers is at least WM.

161 Therefore, the minimum (integral) vertex cover has size at least Wm as well. This now implies that |CJ is a 2(1 + 5F)-approximate minimum vertex cover.

Maximum matching: Let W7 be the weight of a fractional maximum matching.

Then, it holds IM*| < Wh < JCj. From our analysis above and the last chain of inequalities we have Wm > IM*/(2(l + 5E)). 0

5.3.2 An Attempt for Simulation in O(log log n) rounds of MPC

An Idealized MPC Simulation: Next, we explain an attempt toward simulating the algorithm CENTRAL in the MPC model. Once we discuss this, we will point out some shortcomings and then explain how we plan to adjust the algorithm to address these shortcomings.

The algorithm starts with every vertex and every edge being active. If not active, an edge/vertex is frozen. Throughout the algorithm, the minimum active fractional edge value increases and consequently, the degree of each vertex with respect to active edges decreases gradually. We break the simulation into phases, where the ith phase ensures to simulate enough of the algorithm until the minimum active fractional edge value is 1/A( 0 9 )*, which implies that the active degree is at most A(0.9)i. Hence, we finish within O(log log n) phases.

Remark:: In our final implementation, the number of iteration one phase simulates is slightly different than presented here. However, that final implementation, that we precisely define in the sequel, follows the exact same behavior as presented here.

Let us focus on one phase. Suppose that G' is the remaining graph on the active edges, the minimum active fractional edge value is 1/d, and thus G' has degree at most d. In this phase, we simulate the algorithm until the minimum active fractional edge value reaches 1/d0 .9 , which implies that the active degree is at most d0 .9 .

We randomly partition the vertex-set of G', that consists only of active edges, among m = Vdf machines; let G' be the graph given to machine i. In this way, each machine receives 0(n) edges w.h.p. Machine i for the next log,/(,-) d/10 rounds simulates the basic algorithm on G'. For that, in each round the machine which

162 received a vertex v estimates yv = Eev Xe by ylocal defined as

ylocal eDv;eEG' e~v;eEG\G'

That is, yl"cal is the summation of edge-values of G'-edges incident on v whose other

endpoint is in the same machine, multiplied by m (to normalize for the partitioning), plus the value of all edges remaining from G \ G', i.e., edges that were frozen before

this phase. In each round and for every vertex v, if , > 1 - 2e, then the machine

freezes v and the edges incident to v. After this step, for any active edge e E G'C

the machine sets Xe < Xe - 1/(1 - -). The phase ends after log1/( 1-) A/10 rounds. At the end, the round in which different vertices were frozen determines when the corresponding edges got frozen (if they did). So, it suffices to spread the information

about the frozen vertices and the related timing to deduce the edge-values of all

edges. Since per iteration each active edge increases by a factor of 1/(1 - E), after

0 9 log1 /( 1_e) A/10 rounds, the minimum active edge value reaches 1/d and we are done with this phase.

The Issue with the Direct Simulation: Consider first the following wishful-

thinking scenario. Assume for a moment that in every iteration it holds lyv - (1 -

2E)l > |yv - ylocal, that is, yv and ylocal are "on the same side" of the threshold.

Then, the algorithm CENTRAL and the MPC simulation of it make the same decision

on whether a vertex v gets frozen or not. Moreover, this happens in every iteration, as can be formalized by a simple induction. This in turn implies that the MPC

algorithm performs the exact same computations as the CENTRAL algorithm and thus it provides the same approximation as CENTRAL. However, in general case, even if yv and ylocal are almost equal, e.g., iyv - ylocall < e, it might happen that

Y, > 1 - 2e and y1 cal < 1 - 2E, resulting in the two algorithms making different decisions with respect to v. Furthermore, this situation could occur for many vertices simultaneously, and this deviation of the two algorithms might grow as we go through the round; these complicate the task of analyzing the behavior of the MPC algorithm.

163 Random Thresholding to the Rescue: Observe that if ly, - yocal is small then there is only a "small range" of values of y, around the threshold 1 - 2E which could potentially lead to the two algorithms behaving differently with respect to v. Moti- vated by this observation, instead of having one fixed threshold throughout the whole algorithm, in each iteration t and for each vertex v the algorithm will uniformly at random choose a fresh threshold vt from the interval [1 - 4E, 1 - 2E]. We call this algorithm CENTRAL-RAND, and state it below. Then, if v is not frozen until the tth iteration, v gets frozen by CENTRAL-RAND if yv,t > v,t (and similarly, v get frozen by the MPC simulation if ygft -l V,). In that case, if |y- yical| «i, then most of the time yv would be far from the threshold and the two algorithms would behave similarly. We make this intuition formal in the next section by theorem 82.

5.3.3 Our Actual Simulation in O(log log n) rounds of MPC

We now present the modified CENTRAL-RAND algorithm with the random thresholding and then discuss how we simulate it in the MPC model.

CENTRAL-RAND: Centralized O(log n)-round Fractional Matching

and Vertex Cover with Random Thresholding:

- Each vertex v chooses a list of thresholds 7, such that: the thresholds are chosen independently; each threshold is chosen uniformly at random from [1- 4E, 1 - 2].

- Initially, for each edge e E E, set xe = 1/n.

- Then, until each edge is frozen, in iteration t:

(A) Freeze each vertex v for which yv,t = , Xe t and freeze all

its edges.

(B) For each active edge, set xe +- xe/(1 - E).

- At the end, once all edges are frozen, output the set of values xe as a

fractional matching and the set of frozen vertices as a vertex cover.

Our Actual MPC Simulation: We now provide an MPC simulation of CENTRAL-RAND,

164 that we will refer to by MPC-SIMUL, and discuss it below.

Our algorithm begins by selecting a collection of~random thresholds T. In the actual implementation, since these thresholds are chosen independently and each from the same interval, threshold T, can be sampled when needed ("on the fly").

During the simulation, we maintain a vertex set V' C V that consists of vertices that we consider for the rest of the simulation. The algorithm defines the initial weight of the edges to be wo = (1 - 2E)/n. Also, it maintains variable d representing the upper- bound on the maximum degree in the remaining graph (in principle, the maximum degree can be smaller than d).

MPC-SIMUL is divided into phases. At the beginning of a phase, we consider a subgraph G' of G[V'] that consists only of the active edges. In theorem 77 we prove that the maximum degree in G' is at most d. Then, the vertex set V' is distributed across m = Vrd machines. Each machines collects the induced graph of G' on the vertex set assigned to it. In theorem 78 we prove that each of these induced graphs consists of 0(n) edges. Also at the beginning of a phase, the algorithm defines yold

(see line (d)). This is part of the vertex-weight that remains the same throughout the execution of the phase. It corresponds to the sum of weights of the edges incident to v that were frozen in prior phases.

Each phase executes the steps under line (e), which simulates I iterations of

CENTRAL-RAND. During a phase, we maintain the iteration-counter t. The value of t counts all the iterations since the beginning of the algorithm, and not only from the beginning of a phase. After this simulation is over, the weight xej' of each edge e is properly set/updated. For instance, if e was not assigned to any of the machines

(i.e., its endpoints were assigned to distinct machines), then e was not changing during the simulation of CENTRAL-RAND in this phase even if both of its endpoints were active. To account for that, at line (g) the value Xeu1 is set to wo 1 where t' is the last iteration when both endpoints of e were active. To implement this step, each vertex will also keep a variable corresponding to the iteration when it was last active.

Every vertex v that has weight more than 1, i.e., yMPC > 1, is along with its

165 incident edges removed from the consideration, e.g., removed from V' at line (i), but v is added to the vertex cover that is reported at the end of the algorithm. Note that after the removal of such v, the edges incident to it are not considered anymore while computing yMPC or ylocal. This step ensures that throughout the algorithm the fractional matching on G[V' will be valid. But it also ensures that all the edges that are in G[V \ V'], in particular those incident to v, will be covered by the final vertex cover.

166 MPC-SIMUL: MPC Simulation of algorithm CENTRAL-RAND:

(1) Each vertex v chooses a list of thresholds Tt such that: the thresholds are chosen independently; each threshold is chosen uniformly at random

from [1 - 4E, 1 - 2e].

(2) Init: V' = V; for each edge e E E, set xz 1 C = wo = (1 - 2-)/n; d = n; t = 0.

(3) While d> log 2 0 n:

V; # iterations I =(log m)/(10 log 5). (a) Set: # machines m d

(b) Partition V' into m sets V1 ,. .. , V by assigning each vertex to a machine independently and uniformly at random.

(d) For each v E V', define yold EG eMPC

(e) For each i {1, ... , m} in parallel execute I iterations

(A) Freeze each v c Vi for which y"" - - Ze v;eEG'[Vj X PC

yold > Tt and freeze all its edges.

(B) For each active edge of G'[VI], set x /Pc X1PC/(l -E

(f) Update d +- d(1 - E)'.

(g) For every edge e = {u, v}: set xMPC - W I where t' is the last iteration in which both a and v were active.

(h) For each v E V let y = J;MPC (i) For each v E V such that yl > 1: remove v from V.

(j) For each v E V such that y 'c P> 1 - 2E: freeze v and freeze all its edges.

(4) Directly simulate log/( _,) log20 n iterations of CENTRAL-RAND.

(5) Output the vector 1 MPC as a fractional matching and the set of frozen vertices as a vertex cover. 167 If some vertex has weight between 1 - 2e and 1, it has sufficiently large fractional weight, so we simply freeze it (line (j)) before the next phase.

Once the upper-bound d becomes less than log20 n, the algorithm exits from the main while loop, and the rest of the iterations needed to simulate CENTRAL-RAND are executed one by one. During this part of the simulation, MPC-SIMUL and

CENTRAL-RAND behave identically.

5.3.4 Analysis

We prove that the set of frozen vertices forms a 2+ O(e) approximation of the minimum vertex cover, and the computed fractional matching is a 2 + 0(e) approximation of maximum matching.

Lemma 73. MPC-SIMUL with high probability outputs a (2 + 50 - e)-approximate minimum vertex cover and a fractional matching which is a (2+50 -e) approximation of maximum matching. Moreover, there is an implementation of MPC-SIMUL that with high probability has 0(log log n) MPC-round complexity and requires 0(n) space per machine.

Furthermore, the algorithm outputs fractional matching x and a vertex cover C such that the fractional weight of at least IC113 vertices of C is at least 1 - 5E.

Remark: For technical reasons and for the sake of clarity of our exposition, in our analysis we assume that e < 1/50. (If the input E > 1/50, we simply reduce its value and deliver even better approximation than required.) Also, as e is a constant, we assume that e > 1/ log n.

Roadmap: We split the proof of theorem 73 into three parts. We start by, in section 5.3.4, showing some properties of the edge-weights and the maximum degree of vertices during the course of MPC-SIMUL. Then, in section 5.3.4 we prove that 0(n) space per machine suffices for the execution of MPC-SIMUL, and that the algorithm can be executed in 0(log log n) MPC-rounds. Next, in section 5.3.4 we relate the vertex-weights in MPC-SIMUL (i.e., the vectors ylocai and yMPC) to the corresponding weights in the algorithm CENTRAL-RAND (i.e., to the vector y). namely, we trace

168 ly"O "' - yeJ over the course of one phase, and show that for most of the vertices this difference remains small. We put forth those results in section 5.3.4 and prove theorem 73.

Weight and degree properties

We now state several properties of edge-weights that are easily derived from the

algorithm, and provide an upper-bound on the maximal active degree of any vertex.

These properties will be used throughout our proofs in the coming sections.

Define wt = wo (1 e). Observe that at the tth iteration, the weight of all the

active edges that are on some of the machines equals wt. Furthermore, if for an edge

e ={u, v} such that u and v are on different machines, vertices u and v are both

active in the tth iteration, then after the phase ends the weight xMjC will be set to

at least wt (see line (g)). We next state two observations.

Observation 74 (Degree active-weight invariant). Consider an iteration t at

which is updated d at line (f) of MPC-SIMUL. Then, just after the degree d is updated, it holds d - wt = 1 - 2E.

Proof. At the beginning of the algorithm, it holds wo - d = 1 - 2E. Over a phase,

weights of the active edges increase by 1/(1 - E)'. On the other hand, the degree d

decreases by 1/(1 - E)'. Hence, their product remains the same after every phase. El

Observation 75 (Maximum weight of active-edge). The weight of any active edge

at the beginning of a phase is (1 - 2E)/m 2 , where m is the number of machines used

in that phase. During that phase, the weight of any edge is at most 1/mr- 8 .

Proof. Let wt. be the weight of any active edge at the beginning of a phase. As defined at line (a), we have m 2 = d. From theorem 74 we hence conclude that

Wt* = (1 - 2e)/m2 .

Also at line (a), I is defined to be (logm)(10log5) < (logm)/5. On the other hand, for at most I iterations the weight of any active edge is increased by at most

169 1/(1 - E) < 2 at line (C). Hence

Wt*+I < 2, - (1 - 2E)/m 2 < (1 - 2E)m0 2 /m 2 < M 1_8 .

Memory requirement and round complexity

In this section, we first show that O(n) space per machines suffices to store the induced graphs G'[Vi] considered by MPC-SIMUL (see theorem 78). After, in theorem 79, we upper-bound the number of phases of MPC-SIMUL. At the end of the section, we combine these together to prove the following lemma.

Lemma 76. There is an implementation of MPC-SIMUL that requires O(n) memory per machine and executes O(log log n) MPC rounds w.h.p.

We start by upper-bounding the number of active edges incident to a vertex of

V'.

Lemma 77. Let V', G' and d be as defined in MPC-SIMUL at the beginning of the same phase. Then, the degree of every vertex in G'[V'] is at most d.

Proof. In the beginning of the algorithm, we have that d = n, and hence the statement holds for the very first phase.

Towards a contradiction, assume that there exists a phase and a vertex v such that the degree of v in G'[V'] is more than d. Let dv denote its degree. Let t* be the first iteration of that phase. Notice that wj- was the weight of active edges at the end of the previous phase. Now by theorem 74 we have

wt* - dv > wt* - d = 1 - 2E.

But this now contradicts the step at line (j) of MPC-SIMUL after which all the edges incident to v would become frozen. E

Now we prove that every induced graph processed on machine has O(n) edges.

170 Lemma 78 (Size of induced graphs). Let G' and Vi be as defined at line (c) and line (b) of MPC-SIMUL, respectively. Than, |E (G'[ )I e 0(n) w.h.p.

Proof. We split the proof into two parts. First, we argue that the size of Vi is 0(n/m) w.h.p. After, we argue that the degree of each vertex in G'[V] is 0(d/m) w.h.p., after which the proof will follow by union bound.

Expected size of V : Now, E1Vi|= IV'I/m < n/m. Observe that we have m < V/n at any phase, and hence n/m > n. Now Chernoff bound implies that

IVI < IV'I/m + n/m E 0(n/m) (5.2) hold w.h.p.

Degree bound: Consider a vertex v E Vi, and let dv be its degree in V. theorem 77 implies Ed, < d/m. By the definition it holds d/m = m, and also m > log 0 n. Now again by applying Chernoff bound, we conclude that

Ed, d/m + m O(m) (5.3) holds w.h.p.

Combining the bounds: Since (5.2) and (5.3) hold independently and w.h.p., by taking union bound over all the vertices we conclude that the number of the edges in

G'[V] is bounded by 0((n/m) - m) w.h.p. This concludes the proof. l

Lemma 79 (Number of phases upper-bound). MPC-SIMUL executes 0(loglogrn) phases.

Proof. Let di be the degree d of MPC-SIMUL at the beginning of a phase, and dj+1 the degree updated at line (f) after the phase ends. Let I = (logrm)/(10log5) = (log di)/(20 log 5). Then, by the definition of the algorithm, we have the following relation lo g 5 log (1/di-e> dj+1 = di(1 - )' = di - = d i/(o" (5.4) 22

171 For the sake of brevity, define -y : 1/log) . Observe that for a constant E such that 0 < E < 1/2 it implies that y is a constant and -y < 1. Then, from (5.4) we have

MPC-SIMUL is executed for i* phases, where i* is the smallest integer such that di. < log 2o n. This implies

n(y)- < log 2o

Taking log on the both sides of the last inequality, we obtain

(1 - -y)i* log n < 20 log log n.

Now a simple calculation shows that iZ* E 0 _"glog))) c 0(loglogn). L

Proof of theorem 76. By theorem 79, MPC-SIMUL executes 0(log log n) phases. Also, for constant E, line (4) requires 0(log log n) iterations. Furthermore, by theorem 78, each induced graph G'[Vi] that is processed on a single machine during a phase has

0(n) size w.h.p.

There is an implementation of MPC-SIMUL such that, when every machine has space 0(n), each phase and each of the operations the algorithm performs are executed in 0(1) MPC-rounds. For more details on such implementation, we refer the reader to [401, section MPC Implementation Details, and to [761. l

Properties of vertex- and edge-weights in MPC-SIMUL

In this section, we show that ly, - yto|al remains small for most of the vertices (this claim is formalized in theorem 86). Before we provide an outline of the analysis, we state some definition and describe the notation we use.

Definition 80 (Bad and good vertex). We say that vertex is bad if it gets frozen in

CENTRAL-RAND and not in MPC-SIMUL (or the other way around). Once bad, the vertex remains bad throughout the whole phase. If a vertex is not bad, we say it is good.

172 Definition 81 (Local neighbor). If a neighbor i of vertex v is in the given iteration of MPC-SIMUL on the same machine as v, then we say that u is a local neighbor of

Notation: We use wt to refer to the weight of active edge in the tth iteration. Let

N'"Wta(V, t) (resp. Nical(v, t)) denote the active neighbors of v at the beginning of the tth iteration of the ideal (resp. MPC) algorithm. Similarly, we use Nlocal(v, t) to denote the local neighbors of v in iteration t. If it is clear from the context which iteration we are referring to, sometimes we omit t from the notation. Throughout our proofs, we will be making claims of the following form a = b c, which should be read as a E [b-c,b+c].

Analysis Outline: Recall that yl"cal and yv,t represent the fractional weight of vertex v in the tth iteration of MPC-SIMUL and CENTRAL-RAND, respectively. From the definition, we have yv,t = yv,t-1 + E wt-1|INgen (v, t)1, and similarly y"ta= ylocai + E WtI MNcl, t)|. To say that the algorithms stay close to each other, we upper-bound |yt,,-t -y l| inductively as a function of t. Suppose that we already have an upper bound on |yv,t- - ylocal1; we focus on upper-bounding the difference between INgetal(V, t)I and mINcal(v, t)1. There are two sources of difference between INWntral(v, t) and m|Nical(V, t)|:

(1) Some of the neighbors of v might be bad. Notice that this also implies that in

general the set NAcal(v, t) might not even be a subset of NeAl (v, t).

(2) Even in the very first iteration (or more generally even if there is no bad vertex in N~cal(, t)), the set Nocal(v, t) is a random sample of Nentral(, t) and hence

INical(v, t)| might deviate from its expectation IN,ntral(v, t)I/m.

Furthermore, in our analysis, we assume that at the beginning of each phase MPC-SIMUL and CENTRAL-RAND start from the same fractional matching. Namely, we compare

MPC-SIMUL to the behavior of CENTRAL-RAND assuming that initially x equals xMPC, for the value of xMPC at the beginning of a given phase. Since we ensure that xMPc is at the beginning of a phase always a valid fractional matching,

CENTRAL-RAND in our approach will also maintain a valid fractional matching.

173 Also, we assume that the thresholds, i.e., 7T-,t for each v c V and each iteration

t, are the same for both MPC-SIMUL and CENTRAL-RAND. Note that the latter

algorithm is only a hypothetical one, whose purpose is to compare our simulation to

a process that constructs a fractional matching, so this assumption is made without

loss of generality.

Analysis:

The following claim is a direct consequence of choosing the thresholds randomly

at each iteration.

o for every Lemma 82. Consider the tth iteration of a phase. Let yv,,t - ylO'a 1 .

vertex v that is active in both CENTRAL-RAND and MPC-SIMUL. Then, v becomes

bad in the tth iterationwith probability at most E/- and independently of other vertices.

Proof. If lylocal - 7,tj > o-, then MPC-SIMUL and CENTRAL-RAND would behave

the same with respect to vertex v. Since 7 ,t is chosen uniformly at random within

interval of size 2E, MPC-SIMUL and CENTRAL-RAND would differ in iteration t with

respect to v with probability at most 2u/(2E) = u/c. Furthermore, as 7,t is chosen

independently of other vertices, v becomes bad independently of other vertices. L

There are two distinct steps where MPC-SIMUL directly or indirectly estimates

y. The first one is computing ylocal , which is used to deduce whether a vertex should

be frozen or not. The second one corresponds to the actual weight that MPC-SIMUL

assigns to the vertices. Namely, at the end of a phase, weight is assigned to each edge

MPG t (line (g)) for edge e = {u, v}, if u or v is frozen, then it is set x C t, wherehr t is the iteration when the first of the two vertices got frozen; otherwise, xMPC _ Wt

for t being the most recent simulated iteration. Then, the weight of a vertex v, that

we denote by yurc, is simply the sum of all x"PC incident to v. This can be seen as

an indirect estimate of yv.

Our next goal is to understand how does the estimate y local and simulated vertex

weight yM'Pc relate to yv. To that end, we define the notion to capture the difference

in how the weights y, and y pc are composed.

174 Definition 83 (Weight-difference). We use diff(v, t) to denote the total weight of

the edges that contributed to the weight of yv,t and not to yMPC, and the other way around. Formally, let 4MPC be the updated weight of edge e in iteration t in

MPC-SIMUL (updated in the sense as given by line (g) of the algorithm). Let xe,,t be the weight of edge e in iteration t in CENTRAL-RAND. Then,

diff (v, t) : |x,, - x,,te eEN(v)

Notice that 1y,,t - yVPC1 diff(v, t). In general it might be the case that IyVt - yMPCI < diff(v, t). For instance, consider two edges el and e 2 both incident to v.

Assume that in CENTRAL-RAND el is active, while e 2 is frozen. On the other hand, assume that in MPC-SIMUL it is the case that el is frozen while e2 active. So, these two edges alone do not make any difference in the change of the weight of yv,t and yMPC their effect cancels out. However, their effect does not cancel each other in the definition of diff(v, t).

As a first step, we show that yv is close to yIPc in the first iteration of some phase.

Lemma 84. Let iterationt* be the first iteration of some phase, and let v be an active vertex by iteration t*. Then, w.h.p.

0 iyv,t* - ya i| < m- .2

Furthermore, diff(v, t*) = 0 with certainty.

Proof. To argue that diff(v, t*) = 0 it suffices to observe that, at the beginning of a phase, edge-weights in MPC-SIMUL and CENTRAL-RAND coincide.

We upper-bound jye* - as follows. At the beginning of the phase, no vertex is bad. Hence, the only difference between yV,t* and y 1 comes from the random partitioning of the vertices.

Observe that NA'cal(v, t*) is a random sample of N5central(v, t*), and hence p Em - INIcal(V, t*)| = INfcentral(V, t*)1. We now consider two cases, based on p.

175 Case p < m'-6 : Then, from Chernoff bound we have

1 6 , 2 exp (-m 32/(3p) <2 exp (-m 1-6/3), IPm- |Nocal(v, t*)| - ;>| m p (5.5)

which is high probability as m > log10 n during every phase.

Case P > M1.6: Now again by Chernoff bound, we have

0 2 1 2 Pjm_ |NAIocl (v, t*)| -P| > m- p < 2exp (n-0.4p/3) 5 2 exp (m p/3). (5.6)

This probability is again high, as m >> log n.

Combining the two cases: From (5.5) and (5.6) we derive that w.h.p. that

4m.6}..INlca(v,max{m-o.2Nentral(V,t*)It*)1,INnral(V, I

Hence, w.h.p.

IYvt* - y "ocall < w< max{m-o 2 mentral(V, t*)I, m 1.6 .

Now, from theorem 74 and theorem 75 we have that INcentrl(v, t*)|wt* < 1 and

Wt* m- 1 8 . This implies that w.h.p. it holds

IYv,t* - ylocIl I < m 0 .2 ,

as desired. D

We now prove our main technical lemma, which quantifies the increase in the

difference between yv and its estimates y lcal and YMPc over the course of one phase.

Lemma 85 (Evolution of weight-estimates). Let v be an active vertex in iteration t - 1 in both CENTRAL-RAND and MPC-SIMUL. Then, if Iyot-I - y I

* Iyvt - y local < 4(o + ern-0 .2 ), and

176 * diff(v, t) 4(a- + Em-02 ).

Proof. We proceed by upper-bounding the effect of three different kinds of vertices

on y y oa- l: bad vertices prior to the tth iteration; vertices becoming bad in the

tth iteration; and, the effect of the random partitioning. Observe that diff(v, t) is not

directly affected by the random partitioning.

Old bad vertices.: Let B,,t_1 be the set of bad vertices prior to the beginning of

the tth iteration that are also local neighbors of v. The set B,,t_1 accounts for the

vertices Nlocal(v, t - 1) \ Nentral(V, t - 1), and for the local neighbors of v that are

in Ng"ntrl(V, t - 1) but not in NA cl(v, t - 1). Since their weight was bounded by -

in the (t - 1)th, then their weight is bounded by (1 + E)o in the tth iteration. Hence, from iteration t - 1 to iteration t, the effect of the old bad vertices increased by Ea.

the machines, we get that from iteration t - 1 to iteration t the effect of the old bad

vertices on diff(v, t) increased by Ea.

New bad vertices.: In addition to the bad vertices in B,,t-1, there might be new

bad vertices in the beginning of the tth iteration the vertices of NAc l(v, t - 1) n

NeA (v, t--1) that are not in Nocal(v, t)nNentral(v, t). To bound the weight of those

bad vertices, we first upper-bound the cardinality of N ca1 (v, t -1)f Nentrl(v, t -1). For the sake of brevity, define nlocal : -Nlocal(V, t -1) mN ntral (v, t -1)f where, as a

reminder, the set Nlocal (v, t - 1) refers to the local neighbors (both frozen and active)

of v. We trivially have

INlcal(v,t - 1)N ntal(Vt - 1) local

Then, by theorem 82, the number of new bad vertices is in expectation at most

nvi't-l/s. We now proceed by providing a sharp concentration around this expected value. To that end, we provide an upper-bound on nlocal that holds w.h.p.

Observe that N nAc (v, t - 1) is defined deterministically and independently of the

6 MPC algorithm. Then, if IN"etrl(v,t - 1)1 ; mn- , we have that w.h.p. nivt'1 K (1 + m-02 )N ntra(v, t - 1)1/m. Otherwise, if INentra (V, t - 1)1 < m- 6 , then w.h.p.

177 nVloaI < 2mr. 6 . Therefore, for y := max{(1N+M-o 2 )Nentra(v, t-1)f, 2m1 -6}, we have the w.h.p.

m~local < 1 m v,t-1 -

2 0 6 bounded by max{(1+ rn ) loca/ , 2m . }. So, putting all together, we have that the weight coming from new bad vertices that affects the local estimate of yv,t is at most

2 1 6 02 :=wt-1 - max{(1 + Mr0. )-ya/E, 2m }.

1 8 But now, using that wt_1 _ - and also that INyetal(v,t - 1)wt_ K 1 as is a

2 active vertex in CENTRAL-RAND, we derive o 2 < 2(u-+ Em- ).

It remains to comment about the effect of new bad vertices on diff(v, t). Note that the expected number of new bad vertices affecting diff(v, t) is at most IN.'ntral(v, t -

1)o-/s. So, applying the same arguments as above, the weight of new bad vertices affects diff(v, t) by at most g2 w.h.p.

Effect of random partitioning.: Finally, we upper-bound the effect of the random partitioning on the estimate yval. Similarly to our arguments given earlier, we have that w.h.p. the number of vertices of NWetral(V, t) that are local neighbors of v

2 6 deviates from INeAc (v, t)I/m by at most 77 := max{m-o INy"trl(v, t)I, m' }/m. The total weight of these vertices scaled by rn is at most swtmr em- 2

Final step: Putting altogether, if jy,,t_1 - yalit K -and yo,t_1 - y -, then

jyVt - yv"coaj I (1 + E)o- + 2(o-t+ sm 2. ) +Em-m 2 4(a +Em- 2), and similarly

0 2 diff(v, t) (1 + e)o + 2(a + em- ) <4(a + Em-0 .2 ),

'A more detailed proof for this type of claim is given in the proof of theorem 84.

178 as desired. F1

Lemma 86. Let v be an active vertex in iteration t - 1 in both MPC-SIMUL and

CENTRAL-RAND. If a phase consists of at most I := (logm)/(10log5) iterations, then it holds |yv,t - yal|I m-0 -1 and diff(v,t) < m-0-1 w.h.p.

Proof. Let iteration t* be the first iteration of the ith phase. Combining theorem 84

and theorem 85, for any t* < t < t* + I in which v is not bad, it holds

IyVt - Y llI < 5 m-0 .2 < m-0-.,

and diff(v, t) < 5'm-0 2 < M-0.1.

We are now ready to prove the main result of this section.

Proof of theorem 73

We start the proof by recalling that theorem 76 shows the desired bound on the

space- and round-complexity of MPC-SIMUL. The rest of the proof is divided into

two parts. First, we prove the statement for vertex cover, and then for matching.

Throughout the proof we consider only those rounds of the MPC algorithm that execute at least two iterations. The rounds in which is executed only one iteration coincide with the ideal algorithm, and for them the claims in the rest of the proof follow directly. In this section, we assume that a maximum matching and a minimum vertex cover is of size at least log10 n. In section 5.3.4 we show how to handle that case when the maximum matching has size less than log 10 n.

Part I Vertex Cover:

Let C be the vertex cover constructed by MPC-SIMUL. First, C is indeed a vertex cover as by the end of the algorithm every edge is incident to at least one frozen vertex, and every frozen vertex is included to 0. This follows from the following facts: by the

179 end of the algorithm the weight of active edges is at least 1-2E; and, the last iterations of MPC-SIMUL directly simulate CENTRAL-RAND. Since CENTRAL-RAND freezes any vertex (and its incident edges) having incident edge of weight at least 1 - 2E, MPC-SIMUL freezes such vertices as well.

Informally, our goal is to show that 101 is roughly at most twice larger than

WM := EV, yEVPC, where V' is the set of vertices after removing those of weight more than 1. Our proof consists of two main parts. First, we consider the contribution to WM of the vertices that remained active in CENTRAL-RAND for at least as many iterations in MPC-SIMUL (and at first we ignore the other vertices). After that, we take into account the remaining vertices, and in the same time account for the vertices having weight more than 1.

CENTRAL-RAND freezing last: Let t be the last iteration of a phase. We first consider only those vertices added to C which remained active in CENTRAL-RAND for at least as many iterations as in MPC-SIMUL, and claim that for every such vertex v it holds yMPC> 1- 5E. In the analysis we give, we ignore that some vertices u such that yfiPC > 1 got removed along with their incident edges. We analyze two types of vertices: good vertices; and, bad vertices that got frozen by MPC-SIMUL first.

(1) If v is good, then it was active in MPC-SIMUL in the same iterations as in CENTRAL-RAND. Hence, by theorem 86, we have that y, 1P > - m-0-1 >

1 - 4E -E = 1 - 5E.

(2) Assume that v is bad, but got frozen by MPC-SIMUL first. Let t' be the

iteration v got frozen by MPC-SIMUL. This directly implies that y"' > 1-4E.

Since v was active in the both algorithms in iteration t' - 1, by theorem 86 we

have y,,t, > 1 - 4E - m 0 1 . But now again by theorem 86 we conclude that

yMPC 1- 4e- m-0.1 - m-0-1 > I - 5E.

Informally (again), this analysis can be stated as: for every vertex of the two considered types which is added to 0, and while disregarding the vertices whose incident edges got removed, there is at least (1 - 5E)/2 weight in Wm. The weight is scaled by 2 as every edge is incident to at most 2 vertices of 0.

180 MPC-SIMUL freezing last: We now consider the vertices that got frozen by

MPC-SIMUL in later iteration that by CENTRAL-RAND. We call such vertices late-

bad, and use nlate to denote their number. Let C denote the vertex cover constructed by CENTRAL-RAND. Observe that the late-bad vertices are a subset of C if vertex is not active in CENTRAL-RAND anymore, it means it has been frozen and added to

C. Late-bad have another important property every vertex v such that yMPC > 1 is late-bad, as we argue in the sequel. In our next step, we upper-bound nlate by the number of the vertices of C that are bad.

Let Ct denote the vertices that join the vertex cover C in the tth iteration of

CENTRAL-RAND. From theorem 82 and theorem 86, a vertex is bad with probability at most m-0 1/E. Hence, the expected number of bad vertices in the tth iteration is at most m-01ICtI/. Notice that C is a deterministic set, defined independently of MPC-SIMUL. Furthermore, at the tth iteration, every vertex of Ct becomes bad independently of other vertices. So, the number of bad, and also heavy- bad, vertices throughout all the phases is with high probability upper-bounded by

2 1 O(max{log n,m-0- C1/E}). Recall that we assume m-0 1 < E2, and also that a minimum vertex cover of the graph has size at least logl4 n. This now implies

- M0.1) CI > (1 - )C, S>1 E and hence

liate < E| |o . n~~~ - In

Vertices v such that yMpc > 1: We say that a vertex v is heavy-bad if yMIpc > 1.

The analysis we performed above on relating |Cl and WM does not take into account heavy-bad vertices. Recall that heavy-bad vertices are removed from the graph along with their incident edges, which we did not account for while lower-bounding yMPC for u C C. Next, we discuss how much heavy-bad vertices affect y"MPc for any vertex u E V' n C.

Observe that v is heavy-bad only if: v belongs to some set Ct; v was active in the (t - 1)" iteration by MPC-SIMUL; and, v remained active (by MPC-SIMUL)

181 throughout the tth iteration. Hence, every heavy-bad vertex is also late-bad (but there can be a late-bad vertex that is not heavy-bad).

Let v be late-bad. Observe that by the time it holds y$'C > 1, vertex v is already bad w.h.p. as long as v is not bad from theorem 86 we have yMP<

Yv + m0-1 1 - E + m 0 < 1. On the other hand, from the iteration v got frozen in CENTRAL-RAND, along with its incident edges, the increase in yumc is accounted to diff(-, -), which we have already analyzed. So, to account for the removal of heavy- bad vertices and their incident edges, it suffices to upper-bound the total weight of heavy-bad vertices while they were still active in CENTRAL-RAND. That weight is trivially upper-bounded by n.ate. Furthermore, the weight nlate takes into account those vertices that are late-bad but not heavy-bad, so we do not consider separately such vertices (as we did for the other kind of bad vertices and for the good ones).

Finalizing: Let olate be the subset of C consists of late-bad vertices. Our analysis shows Wm + nlate I - 5E |C| - IClate - 2

For the sake of brevity, define a := (1 - 5E)/2. Using that IOlate I n

S a E +(1a) (5.8)

Next, observe that a < 1 and hence 1 + a < 2. Also, we assume e < 1/2. Then, (5.8) further implies

and hence 2 ICi < WM < 2(1 + 50E)Wm. (5.9) 1 - 13e

Finally, from strong duality it implies that I0j 2(1 + 50E)W , where W is the minimum fractional vertex cover weight. Since the minimum integral vertex cover

182 has size at least W , the lemma follows.

Part II Maximum Matching:

After we provided an upper-bound for C by (5.9), the analysis of the weight of frac-

tional maximum matching our algorithm MPC-SIMUL designs follows almost directly.

First, recall that WM denotes the weight of the fractional matching MPC-SIMUL de-

signs. Also recall that Wm does not include the vertices v that got removed due

to having yMPC > 1. Therefore, by the design of the algorithm, the vertex-weights

yMPC satisfy the matching constraint, i.e., yMPC ; 1. Furthermore, from (5.9) and from the fact JC > W we have

WM > 1| > 2(1+50E) - 2(1+ 50E) W

Now by strong duality, Wm is a (2(1 + 50E))-approximation of fractional maximum matching.

Finding small matchings and vertex covers

In the proof of section 5.3.4 we made an assumption that the maximum matching size is at least logl0 n. If the maximum matching size is less than logion, in this section we show how to find a maximal matching and a 2-approximate minimum vertex cover in O(log log n) rounds when the memory per machine is 6(n).

First, observe that if the size of a minimum vertex cover is O(log in), then the underlying graph has O(n logi n) edges each vertex can cover at most n edges. If our graph has O(n log' 0 n) edges, we apply the result of [99] to find a maximal matching of the graph in O(log log n) MPC rounds. Namely, in [99] in the proof of Lemma 3.2 it is shown that their algorithm w.h.p. halves the number of the edges in each MPC round. Hence, after O(log log n) the algorithm will produce some matching, and the induced graph on the unmatched vertices will have O(n) edges. After that, we gather all the edges on one machine and find the remaining matching. The endpoints of this maximal matching give a 2-approximate vertex cover. We point out that it is crucial that their method outputs a maximal matching, so it is easy to turn it into a

183 2-approximate minimum vertex cover.

5.4 Integral Matching and Improved Approximation

In this section we prove the following theorem.

Theorem 68. There is an algorithm that with high probability computes a (2 + s)-

approximate integral maximum matching and a (2+E)-approximate integral minimum

vertex cover in O(log log n) rounds of the MPC model, with 0(n)-bits of memory per

machine.

Before we provide a proof, recall that theorem 73 shows how to construct a frac-

tional matching of large size. In the following lemma we show how to round that

matching (i.e., to obtain an integral one), while still retaining large size of the frac-

tional matching. This lemma is the main ingredient of the proof of theorem 68.

Lemma 87 (Randomized rounding). Let G = (V, E) be a graph. Let x :E - [0, 1]

be a fractional matching of G, i.e., for each v E V it holds Zeg, x 1. Let C C V

be a set of vertices such that for each v G C it holds Ze3v xe ;> 1 - 0, for some

constant 3 < 1/2. Then, there exists an algorithm that with probability at least

1 - 2 exp (-10|/5000) outputs a matching in G of size at least CI|/50.

In our proof of this lemma we use McDiarmid's inequality that we review first.

Theorem 88 (McDiarmid's inequality). Suppose that X1 ,... , Xk are independent random variables and assume that f is a function that satisfies

-- , xk)| c, for all 1

(The inequality above states that if one coordinate of the function is changed, then

the value of the function changes by at most c.) Then, for any 6 > 0 it holds

.J P~f (X1, Xk) - Ef (X 1, ., Xk)l > J < 2 exp (_kc2

184 Proof of theorem 87. Our goal is to apply theorem 88 in order to prove this lemma. So

we will design a randomized process that will correspond to the setup of the theorem, but also round the fractional matching x.

Setup and the rounding algorithm: For every vertex v G C we define a random

variable X, as follows. X, takes value from the set {N(v) U {*}}. So, X, is either a

neighbor of v or a special symbol *. Intuitively, Xv will correspond to v (randomly)

choosing some neighbor, and if X, equals *, then it would mean v have not chosen

any of the neighbors. The probability space for each Xv is defined as follows: for

every u E N(v), we define PXv =,u = xf,,vj/10, and PXv * 1 - (E,,v xe)/10. Observe that PXv = * > 9/10. For any two vertices u, v E C, the random variables X, and X, are chosen independently.

Now we define function f. First, given a set of edges H we say that edge e E H is good if H \ {e} does not contain edge incident to e. For a set of variables {Xv}vC we construct a set of edges Hx as follows: if X, = * we add edge {v, X,} to Hx; otherwise X, does not contribute to Hx. Let {vi, . . ., vio,} be the vertices of 0. We set f(Xv ,..., Xv,1() to be the number of good edges in Hx.

The number of good edges obtained in this random process represent our rounded matching. Next, we lower-bound the size of the integral matching obtained in this way. To that end, we derive the upper-bound on c for f as defined in theorem 88 and lower-bound the expectation of f.

Upper-bound on c: Fix a vertex v E 0. If X, *, then X, does not contribute any edge to Hx. If X, would change to some neighbor of v, then it would result in adding edge e = {X,, v} to Hx. But now, if there were good edges incident to v or

Xv, they will not be good anymore. So, changing X, from * to a neighbor of v could increase f by at most 2. On the other hand, if there was no edge incident to {X,, v} in Hx, then changing X, in the described way would increase f by 1.

Assume now that Xv # *. Then, similarly to the analysis above, changing Xv = u to another neighbor u' of v could increase f by 2 at most if u initially had two incident edges while u' had none, so by changing X, to u' there are two more good edges. In

185 the opposite way, the number of good edges could be decreased by 2 at most. Finally, changing X, to * could increase f by at most 2 or decrease by at most 1.

From this case analysis, we conclude c = 2.

Lower-bounding the expectation of f: Consider an edge e = {u, v} incident to a vertex v E C. Now we will analyze when {Xv = u, v} is a good edge. If X = *, and for every neighbor w G N(v) n O we have X., / v, the variable X, = u will contribute

1 to f. First, PX = * > 9/10. On the other hand

PX, z v for all w E N(v) n 0= i(i - - exp - - (5.10)

2 where we used inequality - ln (1 - y) y + y that holds for Iy K 1/2. Now using

2 y < y for 0 < y < 1 and E, 3 vxe < 1, from (5.10) we further have

PXw # v for all w E N(v) n cl exp(100 exp 0 ) 100 e~v where the last inequality follows from 1 - y < exp (-y).

So, X, = u contributes 1 to f with probability at least X"t -9/10 89/100

4x{v,u}/5. Since for each vertex v E it holds e, e 9, and #3 1/2, from linearity of expectation we get

Ef(Xe, . .. , Xv 1 ) 4101(1 - 0)/50 > 101/25. (5.11)

Applying theorem 88: We are now ready to conclude the proof. Let J = 101/50. By applying theorem 88 to the function f and the random variables we defined, using that c = 2 and the lower-bound (5.11) on the expectation of f, we conclude that f(XJ, .... , Xv1,) 101/50 with probability at least 1 - 2 exp (-a0o/5000). 0

We are now ready to prove the main theorem.

Theorem 68. There is an algorithm that with high probability computes a (2 + e)- approximate integral maximum matching and a (2+E)-approximate integral minimum

186 vertex cover in O(log log n) rounds of the MPC model, with 0(n)-bits of memory per machine.

Proof. Invoking theorem 73 for the approximation parameter E/50 we obtain the

desired approximation of the minimum vertex cover. To obtain a (2+ E)-approximate

(integral) maximum matching, we alternatively apply the results of theorem 73 and

theorem 87, as we describe in the sequel. We proceed with the proof as follows. First,

we describe how to handle the case when the input graph has small matching. Second, we define an algorithm that iteratively extracts matching of a constant size from our

graph. Finally, we analyze the designed algorithm the probability of success and the

number of required iteration to produce a (2 + E)-approximate maximum matching.

Small degree: We invoke two methods separately, each of them providing a match-

ing, and we output the larger of them as the final result. The first method is described

section 5.3.4, and performs well when the matching size if O(log 0 n). Hence, from

now on we assume that the maximum matching is of size at least log1 0 n.

Algorithm: Now, define algorithm A that as input gets a graph G = (V, E) and

consists of the following steps:

" Invoke MPC-SIMUL to obtain a fractional matching x.

" Apply the rounding method described by theorem 87 on x. Let M be the

produced integral matching.

" Update V by removing from it all the vertices in M.

Analysis of the algorithm: Consider one execution of A. Let x be the fractional matching returned by MPC-SIMUL for the approximation parameter set to e/50, and let W(x) denote its weight. Let C be the vertex cover as defined in the statement of theorem 73. By theorem 73, and from the fact that W(x) < 1C, there are at least

W(x)/3 vertices that have fractional weight at least 1 - 5E. Hence, as long as x has weight at least log9 n, the rounding method described by theorem 87 w.h.p. produces an integral matching M of size at least W(x)/150.

Consider now multiples executions of A. Once it holds W(x) < log9 n, it means that we have already collected a large fraction of any maximal matching, i.e., (1 -

187 1/logn) fraction. On the other hand, as long as W(x) > log9 n algorithm A will produce an integral matching of size at least 1/150 fraction of the size of the current maximum matching. This discussion motivates our final algorithm which is as follows: run A for log 150 149 (1/) many iterations and output the union of integral matching it produces. Our discussion implies that the final returned matching is a (2 + e)- approximate maximum matching of the input graph. Furthermore, for constant E, this algorithm can be implemented in O(log log n) MPC-rounds. L

188 Bibliography

[11 J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, S. Pan, and A. Suresh. Compet- itive classification and closeness testing. In COLT, 2012.

[21 J. Acharya, C. Daskalakis, and G. Kamath. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems (NIPS), pages 3591 3599, 2015.

[3] Jayadev Acharya, Clement L. Canonne, and Gautam Kamath. Adaptive estimation in weighted group testing. In IEEE International Symposium on In- formation Theory, ISIT 2015, Hong Kong, China, June 14-19, 2015, pages 2116 2120, 2015.

[4] Jayadev Acharya, Cl6ment L. Canonne, and Gautam Kamath. A chasm between identity and equivalence testing with conditional queries. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2015, August 24-26, 2015, Princeton, NJ, USA, pages 449 466, 2015.

[5] Foto N. Afrati, Magdalena Balazinska, Anish Das Sarma, Bill Howe, Semih Salihoglu, and Jeffrey D. Ullman. Designing good algorithms for mapreduce and beyond. In ACM Symposium on Cloud Computing, SOCC '12, San Jose, CA, USA, October 14-17, 2012, page 26, 2012.

[6] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Approximation, Randomization, and CombinatorialOp- timization. Algorithms and Techniques, 12th InternationalWorkshop, APPROX 2009, and 13th International Workshop, RANDOM 2009, Berkeley, CA, USA, August 21-23, 2009. Proceedings, pages 15 28, 2009.

[71 Kook Jin Ahn and Sudipto Guha. Access to data and number of iterations: Dual primal algorithms for maximum matching under resource constraints. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA 2015, Portland, OR, USA, June 13 15, 2015, pages 202 211, 2015.

[81 Nir Ailon, Bernard Chazelle, Seshadhri Comandur, and Ding Liu. Property- preserving data reconstruction. Algorithmica, 51(2):160 182, April 2008.

189 [9] Maryam Aliakbarpour, Amartya Shankha Biswas, Themistoklis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. Sublinear-time algorithms for counting star subgraphs with applications to join selectivity estimation. CoRR, abs/1601.04233, 2016.

[101 Noga Alon, Liszl6 Babai, and Alon Itai. A fast and simple randomized parallel algorithm for the maximal independent set problem. Journal of Algorithms, 7(4):567 583, 1986.

[11] Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavt- sev. Parallel algorithms for geometric graph problems. In Proceedings of the 46th ACM Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 June 3, 2014, pages 574 583, 2014.

[12] Barry Arnold. Majorization and the Lorenz order: A brief introduction, volume 43. Springer Science & Business Media, 2012.

[131 David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 1027 1035, 2007.

[141 Sepehr Assadi. Simple round compression for parallel vertex cover. CoRR, abs/1709.04599, September 2017.

[15] Sepehr Assadi, MohammadHossein Bateni, Aaron Bernstein, Vahab S. Mir- rokni, and Cliff Stein. Coresets meet EDCS: algorithms for matching and vertex cover on massive graphs. CoRR, abs/1711.03076, 2017.

[16] Sepehr Assadi and Sanjeev Khanna. Randomized composable coresets for matching and vertex cover. In Proceedings of the 29th ACM Symposium on Par- allelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24 26, 2017, pages 3 12, 2017.

[171 Z. Bar-Yossef. The Complexity of Massive Data Set Computations. PhD thesis, Berkeley, CA, USA, 2002.

[18] Leonid Barenboim, Michael Elkin, Seth Pettie, and Johannes Schneider. The locality of distributed symmetry breaking. In Foundations of Computer Science (FOCS) 2012, pages 321 330. IEEE, 2012.

[191 T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White. Testing random variables for independence and identity. In Proc. 42nd IEEE Symposium on Foundations of Computer Science, pages 442 451, 2001.

[201 T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In IEEE Symposium on Foundations of Computer Sci- ence, pages 259 269, 2000.

190 [211 T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing closeness of discrete distributions. J. ACM, 60(1):4, 2013.

[221 Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In Proceedings of the 32nd ACM SIGMOD-SIGACT- SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA, June 22 27, 2013, pages 273 284, 2013.

[23] Florent Becker, Antonio Fernandez Anta, Ivan Rapaport, and Eric Re6mila. Brief announcement: A hierarchy of congested clique models, from broadcast to unicast. In the Proc. of the Int'l Symp. on Princ. of Dist. Comp. (PODC), PODC '15, pages 167 169. ACM, 2015.

[24] Andrew Berns, James Hegeman, and Sriram V Pemmaraju. Super-fast distributed algorithms for metric facility location. In the Pro. of the Int'l Col- loquium on Automata, Languages and Programming (ICALP), pages 428 439. 2012.

[25] Arnab Bhattacharyya, Elena Grigorescu, Madhav Jha, Kyomin Jung, Sofya Raskhodnikova, and David P. Woodruff. Lower bounds for local monotonicity reconstruction from transitive-closure spanners. SIAM Journal on Discrete Mathematics, 26(2):618 646, 2012.

[26] Guy E Blelloch, Jeremy T Fineman, and Julian Shun. Greedy sequential maximal independent set and matching are parallel on average. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, pages 308 317. ACM, 2012.

[27] M. Blum, M. Luby, and R. Rubinfeld. Self-testing/correcting with applications to numerical problems. In Proceedings of the Twenty-second Annual ACM Sym- posium on Theory of Computing, STOC '90, pages 73 83, New York, NY, USA, 1990. ACM.

[28] C. L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015.

[29] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restrictions of discrete distributions. In 33rd Symposium on Theoretical Aspects of Computer Science, STACS 2016, pages 25:1 25:14, 2016.

[301 Clkment Canonne, Dana Ron, and Rocco A Servedio. Testing equivalence between distributions using conditional samples. In Proceedings of the Twenty- Fifth Annual A CM-SIAM Symposium on Discrete Algorithms, pages 1174 1192. Society for Industrial and Applied Mathematics, 2014.

191 [31] Cl6ment L. Canonne, Themis Gouleakis, and Ronitt Rubinfeld. Sampling cor- rectors. In Proceedings of the 2016 ACM Conference on Innovations in Theo- retical Computer Science, Cambridge, MA, USA, January 14-16, 2016, pages 93 102, 2016.

[321 Keren Censor-Hillel, Petteri Kaski, Janne H. Korhonen, Christoph Lenzen, Ami Paz, and Jukka Suomela. Algebraic methods in the congested clique. In the Proc. of the Int'l Symp. on Princ. of Dist. Comp. (PODC), pages 143 152. ACM, 2015.

[33] Keren Censor-Hillel, Merav Parter, and Gregory Schwartzman. Derandomizing local distributed algorithms under bandwidth restrictions. In 31 International Symposium on Distributed Computing, 2017.

[34] Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah. On the power of conditional samples in distribution testing. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 561 580. ACM, 2013.

[351 Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah. On the power of conditional samples in distribution testing. SIAM J. Comput., 45(4):1261 1296, 2016.

[36] S. Chan, I. Diakonikolas, P. Valiant, and G. Valiant. Optimal algorithms for testing closeness of discrete distributions. In SODA, pages 1193 1203, 2014.

[37] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from un- trusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 47 60. ACM, 2017.

[38] Bernard Chazelle, Ronitt Rubinfeld, and Luca Trevisan. Approximating the minimum spanning tree weight in sublinear time. SIAM Journal on Computing, 34(6):1370 1379, 2005.

[39] Artur Czumaj, Funda Ergin, Lance Fortnow, Avner Magen, Ilan Newman, Ronitt Rubinfeld, and Christian Sohler. Approximating the weight of the euclidean minimum spanning tree in sublinear time. SIAM J. Comput., 35(1):91 109, 2005.

[40] Artur Czumaj, Jakub Lacki, Aleksander Mqdry, Slobodan Mitrovi6, Krzysztof Onak, and Piotr Sankowski. Round compression for parallel matching algorithms. STOC, 2018.

[41] Artur Czumaj and Christian Sohler. Property testing with geometric queries. In Algorithms - ESA 2001, 9th Annual European Symposium, Aarhus, Denmark, August 28-31, 2001, Proceedings, pages 266 277, 2001.

192 [42] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear-time. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, STOC '04, pages 175 183, New York, NY, USA, 2004. ACM.

[43] Artur Czumaj and Christian Sohler. Sublinear-time approximation algorithms for clustering via random sampling. Random Struct. Algorithms, 30(1-2):226 256, 2007.

[44] Artur Czumaj and Christian Sohler. Sublinear-time algorithms. In Property testing, pages 41 64. Springer, 2010.

[45] C. Daskalakis, I. Diakonikolas, R. Servedio, G. Valiant, and P. Valiant. Testing k-modal distributions: Optimal algorithms via reductions. In SODA, pages 1833 1852, 2013.

[46] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, Volume 6, OSDI'04, pages 10 10, Berkeley, CA, USA, 2004. USENIX Association.

[47] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics, Springer, 2001.

[48] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Collision-based testers are optimal for uniformity and closeness. Electronic Colloquium on Computa- tional Complexity (ECCC), 23:178, 2016.

[49] I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. In FOCS, pages 685 694, 2016. Full version available at abs/1601.05557.

[50] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, pages 1183 1202, 2015.

[51] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing identity of structured distributions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Sympo- sium on Discrete Algorithms, SODA 2015, pages 1841 1854, 2015.

[52] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Sample- optimal identity testing with high probability. In 45th InternationalColloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic, pages 41:1 41:14, 2018.

[53] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In IEEE 57th Annual Symposium on Foundations

193 of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 655 664, 2016.

[54] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. arXiv preprint arXiv:1703.00893, 2017.

[55] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual A CM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 2683 2702, 2018.

[56] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mix- tures. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pages 73 84. IEEE, 2017.

[57] Anhai Doan, Raghu Ramakrishnan, and Alon Y Halevy. Crowdsourcing systems on the world-wide web. Communications of the ACM, 54(4):86 96, 2011.

[58] Danny Dolev, Christoph Lenzen, and Shir Peled. AAIJTri, Tri againAAi: Find- ing triangles and small subgraphs in a distributed setting. In Distributed Com- puting, pages 195 209. Springer, 2012.

[59] Andrew Drucker, Fabian Kuhn, and Rotem Oshman. On the power of the congested clique model. In the Proc. of the Int'l Symp. on Princ. of Dist. Comp. (PODC), pages 367 376. ACM, 2014.

[60] Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Faster algorithms for testing under conditional sampling. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 607 636, 2015.

[61] Eldar Fischer. The art of uninformed decisions: A primer to property testing. Current Trends in Theoretical Computer Science: The Challenge of the New Century, 1:229 264, 2004.

[62] Manuela Fischer and Andreas Noever. Tight analysis of parallel randomized greedy mis. In Proceedings of the Twenty-Ninth Annual A CM-SIAM Symposium on Discrete Algorithms, pages 2152 2160. SIAM, 2018.

[63] Dimitris Fotakis, Christos Tzamos, and Manolis Zampetakis. Mechanism design with selective verification. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 771 788. ACM, 2016.

[64] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. Int. J. Comput. Geometry Appl., 18(1/2):3 28, 2008.

194 [65] Mohsen Ghaffari. An improved distributed algorithm for maximal independent set. In Pro. of ACM-SIAM Symp. on Disc. Aig. (SODA), 2016.

[66] Mohsen Ghaffari. Distributed mis via all-to-all communication. In Proceedings of the A CM Symposium on Principles of Distributed Computing, pages 141 149. ACM, 2017.

[67] Mohsen Ghaffari, Themis Gouleakis, Christian Konrad, Slobodan Mitrovid, and Ronitt Rubinfeld. Improved massively parallel computation algorithms for mis, matching, and vertex cover. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, pages 129 138. ACM, 2018.

[68] Mohsen Ghaffari and Merav Parter. Mst in log-star rounds of congested clique. In the Proc. of the Int'l Symp. on Princ. of Dist. Comp. (PODC), 2016.

[691 0. Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. ECCC, 23, 2016.

[70] 0. Goldreich. Commentary on two works related to testing uniformity of distributions, 2017.

[71] 0. Goldreich. Lecture Notes on Property Testing of Distributions. Available at http://www.wisdom.weizmann.ac.il/ oded/PDF/pt-dist.pdf, March, 2016.

[72] 0. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. Electronic Colloqium on Computational Complexity, 7(20), 2000.

[73] Oded Goldreich. Combinatorial property testing (a survey). Randomization Methods in Algorithm Design, 43:45 59, 1999.

[741 Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. In 37th Annual Symposium on Foun- dations of Computer Science, FOCS '96, Burlington, Vermont, USA, 14-16 October, 1996, pages 339 348, 1996.

[75] Oded Goldreich and Dana Ron. Property testing in bounded degree graphs. In Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Com- puting, STOC '97, pages 406 415, New York, NY, USA, 1997. ACM.

[76] Michael T. Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, searching, and simulation in the MapReduce framework. In InternationalSymposium on Algorithms and Computation, pages 374 383. Springer, 2011.

[77] R. William Gosper. Decision procedure for indefinite hypergeometric summation. Proceedings of the National Academy of Sciences, 75(1):40 42, 1978.

[78] Themis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Certified computation from unreliable datasets. In Conference On Learning Theory, pages 3271 3294, 2018.

195 [791 Themistoklis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Faster sublinear algorithms using conditional sampling. In Proceedings of the Twenty- Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 1743 1757, 2017.

[80] Frank R Hampel, Peter J Rousseeuw, Elvezio M Ronchetti, and Werner A Stahel. Robust statistics: the approach based on influence functions. 1980.

[81] James W. Hegeman, Gopal Pandurangan, Sriram V. Pemmaraju, Vivek B. Sardeshmukh, and Michele Scquizzato. Toward optimal bounds in the congested clique: Graph connectivity and MST. In the Proc. of the Int'l Symp. on Princ. of Dist. Comp. (PODC), pages 91 100. ACM, 2015.

[82] James W Hegeman and Sriram V Pemmaraju. Lessons from the congested clique applied to MapReduce. In the Proceedings of the International Collo- quium on Structural Information and Communication Complexity, pages 149 164. Springer, 2014.

[83] James W Hegeman, Sriram V Pemmaraju, and Vivek B Sardeshmukh. Near- constant-time distributed algorithms on a congested clique. In Proc. of the Int'l Symp. on Dist. Comp. (DISC), pages 514 530. Springer, 2014.

[841 Monika Henzinger, Sebastian Krinninger, and Danupon Nanongkai. A deterministic almost-tight distributed algorithm for approximating single-source shortest paths. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pages 489 498. ACM, 2016.

[85] D. Huang and S. Meyn. Generalized error exponents for small sample universal hypothesis testing. IEEE Trans. Inf. Theor., 59(12):8157 8181, December 2013.

[861 Peter J Huber. Robust statistics. In InternationalEncyclopedia of Statistical Science, pages 1248 1251. Springer, 2011.

[871 Piotr Indyk. Sublinear time algorithms for metric space problems. In Pro- ceedings of the Thirty-first Annual ACM Symposium on Theory of Computing, STOC '99, pages 428 434, New York, NY, USA, 1999. ACM.

[88] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. SIGOPS Operating Systems Review, 41(3):59 72, March 2007.

[891 Amos Israeli and Alon Itai. A fast and simple randomized parallel algorithm for maximal matching. Information Processing Letters, 22(2):77 80, 1986.

[90] Amos Israeli and Yossi Shiloach. An improved parallel algorithm for maximal matching. Information Processing Letters, 22(2):57 60, 1986.

196 [911 Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple D 2-sampling based PTAS for k-means and other clustering problems. Algorithmica, 70(1):22 46, 2014.

[921 M. Jha and S. Raskhodnikova. Testing and reconstruction of lipschitz functions with applications to data privacy. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 433 442, Oct 2011.

[931 Tomasz Jurdzinski and Krzysztof Nowicki. Mst in 0 (1) rounds of congested clique. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2620 2632. SIAM, 2018.

[941 Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Sympo- sium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17 19, 2010, pages 938 948, 2010.

[951 Gabriella Kazai, Jaap Kamps, Marijn Koolen, and Natasa Milic-Frayling. Crowdsourcing for book search evaluation: impact of hit design on compar- ative system ranking. In Proceedings of the 34th internationalACM SIGIR conference on Research and development in Information Retrieval, pages 205 214. ACM, 2011.

[96] Albert Kim, Liqi Xu, Tarique Siddiqui, Silu Huang, Samuel Madden, and Aditya Parameswaran. Speedy browsing and sampling with needletail. CoRR, 2016.

[971 Janne H Korhonen. Deterministic mst sparsification in the congested clique. arXiv preprint arXiv:1605.02022, 2016.

[981 Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665 674. IEEE, 2016.

[99] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. Fil- tering: a method for solving graph problems in MapReduce. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architec- tures, SPAA 2011, San Jose, CA, USA, June 4 6, 2011, pages 85 94, 2011.

[1001 E. L. Lehmann and J. P. Romano. Testing statisticalhypotheses. Springer Texts in Statistics. Springer, 2005.

[1011 Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique. In the Proc. of the Int'l Symp. on Princ. of Dist. Comp. (PODC), pages 42 50, 2013.

[1021 Nathan Linial. Distributive graph algorithms global solutions from local data. In Proc. of the Symp. on Found. of Comp. Sci. (FOCS), pages 331 335. IEEE, 1987.

197 [103] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. Wiley Series in Probability and Statistics. John Wiley & Sons, 2002. Second edition.

[1041 Zvi Lotker, Boaz Patt-Shamir, and Adi Ros6n. Distributed approximate matching. SIAM Journal on Computing, 39(2):445 460, 2009.

[1051 Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. MST construction in O(log log n) communication rounds. In the Proceedings of the Symposium on Parallel Algorithms and Architectures, pages 94 100. ACM, 2003.

[106] Michael Luby. A simple parallel algorithm for the maximal independent set problem. SIAM Journal on Computing, 15(4):1036 1053, 1986.

[107] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: theory of majorization and its applications, volume 143. Springer, 1979.

[108] Andrew McGregor. Finding graph matchings in data streams. In Approxima- tion, Randomization and Combinatorial Optimization, Algorithms and Tech- niques, 8th International Workshop on Approximation Algorithms for Combina- torial Optimization Problems, APPROX 2005 and 9th InternationalWorkshop on Randomization and Computation, RANDOM 2005, Berkeley, CA, USA, August 22-24, 2005, Proceedings, pages 170 181, 2005.

[109] Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56(1-3):35 60, 2004.

[1101 Adam Meyerson, Liadan O'Callaghan, and Serge A. Plotkin. A k-median algorithm with running time independent of data size. Machine Learning, 56(1- 3):61 87, 2004.

[111] Nina Mishra, Daniel Oblinger, and Leonard Pitt. Sublinear time approximate clustering. In Proceedings of the Twelfth Annual Symposium on Discrete Algo- rithms, January 7-9, 2001, Washington, DC, USA., pages 439 447, 2001.

[1121 Danupon Nanongkai. Distributed approximation algorithms for weighted shortest paths. In Proc. of the Symp. on Theory of Comp. (STOC), 2014.

[113] J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289 337, 1933.

[114] N. Nisan. Pseudorandom generators for space-bounded computations. In Pro- ceedings of the Twenty-second Annual ACM Symposium on Theory of Comput- ing, STOC '90, pages 204 212, New York, NY, USA, 1990. ACM.

[115] L. Paninski. A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE Transactions on Information Theory, 54:4750 4755, 2008.

198 [116] Boaz Patt-Shamir and Marat Teplitsky. The round complexity of distributed sorting. In the Proc. of the Int'l Symp. on Princ. of Dist. Comp. (PODC), pages 249 256, 2011.

[1171 P. Paule and M. Schorn. A mathematica version of Zeilberger's algorithm for proving binomial coefficient identities. Journal of Symbolic Computation, 20(5):673 698, 1995.

[118] M. Petkovsek, H.S. Wilf, and D. Zeilberger. A = B (Online Edition). Ak Peters Series. Taylor & Francis, 1997.

[119] Dana Ron. Property testing.

[1201 Dana Ron and Gilad Tsur. The power of an example: Hidden set size approximation using group queries and conditional sampling. TOCT, 8(4):15, 2016.

[121] Donald B. Rubin. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 1987.

[122] R. Rubinfeld. Taming big probability distributions. XRDS, 19(1):24 28, 2012.

[123] R. Rubinfeld. Taming probability distributions over big domains. Talk given at STOC'14 Workshop on Efficient DistributionEstimation, 2014. Available at http://www.iliasdiakonikolas.org/stocl4-workshop/rubinfeld.pdf.

[124] Ronitt Rubinfeld. Sublinear time algorithms. In International Congress of Mathematicians,volume 3, pages 1095 1110. Citeseer, 2006.

[1251 Ronitt Rubinfeld and Asaf Shapira. Sublinear time algorithms. SIAM Journal on Discrete Mathematics, 25(4):1562 1588, 2011.

[1261 Michael Saks and C. Seshadhri. Local monotonicity reconstruction. SIAM Journal on Computing, 39(7):2897 2926, 2010.

[127] Joseph L. Schafer. Analysis of incomplete multivariate data. CRC press, 1997.

[128] Jacob Steinhardt, Gregory Valiant, and Moses Charikar. Avoiding imposters and delinquents: Adversarial crowdsourcing and peer prediction. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4439 4447, 2016.

[1291 G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. In FOCS, 2014.

[130] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceed- ings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011, pages 685 694, 2011.

199 [1311 A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical pro- cesses. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics.

[1321 Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proc. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIRaA211), pages 21 26, 2011.

[133] Paul Wais, Shivaram Lingamneni, Duncan Cook, Jason Fennell, Benjamin Goldenberg, Daniel Lubarov, David Marin, and Hari Simons. Towards building a high-quality workforce with mechanical turk. Proceedings of computational social science and the wisdom of crowds (NIPS), pages 1 5, 2010.

[134] Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2012.

[135] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactionson Information Theory, 62(6):3702 3720, June 2016.

[136] Y. Ying. McDiarmid's inequalities of Bernstein and Bennett forms. City Uni- versity of Hong Kong, 2004.

[1371 Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud'10, Boston, MA, USA, June 22, 2010.

200