Copyright by Xue Chen 2018 The Dissertation Committee for Xue Chen certifies that this is the approved version of the following dissertation:

Using and Saving Randomness

Committee:

David Zuckerman, Supervisor

Dana Moshkovitz

Eric Price

Yuan Zhou Using and Saving Randomness

by

Xue Chen

DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN May 2018 Dedicated to my parents Daiguang and Dianhua. Acknowledgments

First, I am grateful to my advisor, David Zuckerman, for his unwavering support and encouragements. David has been a constant resource for providing me with great advice and insightful feedback about my ideas. Besides these and many fruitful discussions we had, his wisdom, way of thinking, and optimism guided me through the last six years. I also thank him for arranging a visit to the Simons institute in Berkeley, where I enrich my understanding and made new friends. I could not have hoped for a better advisor.

I would like to thank Eric Price. His viewpoints and intuitions opened a different gate for me, which is important for the line of research presented in this thesis. I benefited greatly from our long research meetings and discussions. Besides learning a lot of technical stuff from him, I hope I have absorbed some of his passion and attitudes to research.

Apart from David and Eric, I have had many mentors during my time as a student. I especially thank Yuan Zhou, who has been an amazing collaborator and friend since starting college. I thank Dana Moshkovitz for discussions at various points and being in my thesis committee. I thank Xin Li for many valuable discussions and sharing many research direc- tions. I thank Pinyan Lu for inviting me to China Theory Week and arranging a visit to Shanghai University of Finance and Economics.

I thank the other professors and graduate students in UT Austin. Their discussions and numerous talks give me the chance of learning new results and ideas. I also want to thank my collaborators: Guangda Hu, Daniel Kane, Zhao Song, Xiaoming Sun, and Lei Wang.

Last but the most important of all, I would like to thank my parents for a lifetime of support.

v Using and Saving Randomness

Xue Chen, Ph.D. The University of Texas at Austin, 2018

Supervisor: David Zuckerman

Randomness is ubiquitous and exceedingly useful in computer science. For example, in sparse recovery, randomized algorithms are more efficient and robust than their determin- istic counterparts. At the same time, because random sources from the real world are often biased and defective with limited entropy, high-quality randomness is a precious resource. This motivates the studies of pseudorandomness and randomness extraction. In this thesis, we explore the role of randomness in these areas. Our research contributions broadly fall into two categories: learning structured signals and constructing pseudorandom objects.

Learning a structured signal. One common task in audio signal processing is to com- press an interval of observation through finding the dominating k frequencies in its Fourier transform. We study the problem of learning a Fourier-sparse signal from noisy samples, where [0,T ] is the observation interval and the frequencies can be “off-grid”. Previous meth- ods for this problem required the gap between frequencies to be above 1/T , which is necessary to robustly identify individual frequencies. We show that this gap is not necessary to recover the signal as a whole: for arbitrary k-Fourier-sparse signals under `2 bounded noise, we pro- vide a learning algorithm with a constant factor growth of the noise and sample complexity polynomial in k and logarithmic in the bandwidth and signal-to-noise ratio.

In addition to this, we introduce a general method to avoid a condition number depending on the signal family F and the distribution D of measurement in the sample

vi complexity. In particular, for any linear family F with dimension d and any distribution D over the domain of F, we show that this method provides a robust learning algorithm with O(d log d) samples. Furthermore, we improve the sample complexity to O(d) via spectral sparsification (optimal up to a constant factor), which provides the best known result for a range of linear families such as low degree multivariate polynomials. Next, we generalize this result to an active learning setting, where we get a large number of unlabeled points from an unknown distribution and choose a small subset to label. We design a learning algorithm optimizing both the number of unlabeled points and the number of labels.

Pseudorandomness. Next, we study hash families, which have simple forms in theory and efficient implementations in practice. The size of a hash family is crucial for many applications such as derandomization. In this thesis, we study the upper bound on the size of hash families to fulfill their applications in various problems. We first investigate the number of hash functions to constitute a randomness extractor, which is equivalent to the degree of the extractor. We present a general probabilistic method that reduces the degree of any given strong extractor to almost optimal, at least when outputting few bits. For various almost universal hash families including Toeplitz matrices, Linear Congruential Hash, and Multiplicative Universal Hash, this approach significantly improves the upper bound on the degree of strong extractors in these hash families. Then we consider explicit hash families and multiple-choice schemes in the classical problems of placing balls into bins. We construct explicit hash families of almost-polynomial size that derandomizes two classical multiple-choice schemes, which match the maximum loads of a perfectly random hash function.

vii Table of Contents

Acknowledgmentsv

Abstract vi

List of Tables xii

List of Figures xiii

List of Algorithms xiv

Chapter 1. Introduction1 1.1 Overview...... 2 1.1.1 Continuous Sparse Fourier Transforms...... 2 1.1.2 Condition-number Free Query and Active Learning...... 5 1.1.3 Existence of Extractors from Simple Hash Families...... 7 1.1.4 Hash functions for Multiple-choice Schemes...... 11 1.1.5 CSPs with a global cardinality constraint...... 12 1.2 Organization...... 14

Chapter 2. Preliminaries 15 2.1 Condition Numbers...... 16 2.2 Chernoff Bounds...... 17

Chapter 3. Condition Numbers of Continuous Sparse Fourier Transform 19 3.1 The Worst-case Condition number of Fourier Sparse Signals...... 20 3.2 The Average Condition number of Fourier Sparse Signals...... 24 3.3 Polynomials and Fourier Sparse Signals...... 27 3.3.1 A Reduction from Polynomials to Fourier Sparse Signals...... 28 3.3.2 Legendre Polynomials and a Lower Bound...... 30

viii Chapter 4. Learning Continuous Sparse Fourier Transforms 32 4.1 Gram Matrices of Complex Exponentials...... 36 4.1.1 The Determinant of Gram Matrices of Complex Exponentials..... 38 4.2 Shifting One Frequencies...... 42

Chapter 5. Fourier-clustered Signal Recovery 44 5.1 Band-limit Signals to Polynomials...... 45 5.2 Robust Polynomial Interpolation...... 48

Chapter 6. Query and Active Learning of Linear Families 53 6.1 Condition Number of Linear families...... 56 6.2 Recovery Guarantee for Well-Balanced Samples...... 58 6.2.1 Proof of Theorem 6.0.3...... 59 6.2.2 Proof of Corollary 6.2.2...... 62 6.3 Performance of i.i.d. Distributions...... 63 6.3.1 Proof of Lemma 6.3.2...... 64 6.4 A Linear-Sample Algorithm for Known D ...... 66 6.4.1 Proof of Lemma 6.4.1...... 67 6.5 Active Learning...... 71 6.6 Lower Bounds...... 74

Chapter 7. Existence of Extractors in Simple Hash Families 78 7.1 Tools...... 82 7.2 Restricted Extractors...... 83 7.2.1 The Chaining Argument Fooling one test...... 85 7.2.2 Larger Degree with High Confidence...... 91 7.3 Restricted Strong Extractors...... 95 7.3.1 The Chaining Argument of Strong Extractors...... 97

Chapter 8. Hash Functions for Multiple-Choice Schemes 105 8.1 Preliminaries...... 107 8.2 Witness Trees...... 108 8.3 Hash Functions...... 113

ix 8.3.1 Proof Overview...... 116 8.4 The Uniform Greedy Scheme...... 118 8.5 The Always-Go-Left Scehme...... 122 8.6 Heavy Load...... 127

Chapter 9. Constraint Satisfaction Problems Above Average with Global Cardinality Constraints 130 9.1 Notation and Tools...... 133 9.1.1 Basics of Fourier Analysis of Boolean functions...... 134 9.1.2 Distributions conditioned on global cardinality constraints...... 136 9.1.3 Eigenspaces in the Johnson Schemes...... 138 2 9.2 Eigenspaces and Eigenvalues of EDp [f ] and VarDp (f)...... 140 9.3 Parameterized algorithm for CSPs above average with the bisection constraint 147 9.3.1 Rounding...... 148 9.3.2 2 → 4 Hypercontractive inequality under distribution D ...... 152 9.3.3 Proof of Theorem 9.3.1...... 157

9.4 2 → 4 Hypercontractive inequality under distribution Dp ...... 158 9.4.1 Proof of Claim 9.4.4...... 165 9.5 Parameterized algorithm for CSPs above average with global cardinality con- straints...... 167 9.5.1 Rounding...... 168 9.5.2 Proof of Theorem 9.5.1...... 174

Chapter 10. Integrality Gaps for Dispersers and Bipartite Expanders 176 10.1 Preliminaries...... 180 10.1.1 Pairwise Independent Subspaces...... 182 10.2 Integrality Gaps...... 183 10.2.1 Integrality Gaps for Dispersers...... 191 10.2.2 Integrality Gaps for Expanders...... 194 10.3 An Approximation algorithm for Dispersers...... 196

Appendices 201 A Omitted Proof in Chapter3...... 202 B Omitted Proof in Chapter 7...... 205 C Omitted Proof in Chapter8...... 208

x Bibliography 210

xi List of Tables

6.1 Lower bounds and upper bounds in different models...... 56

xii List of Figures

8.1 A witness tree with distinct balls and a pruned witness tree with 3 collisions 110 8.2 An example of extracting C0 from C given two collisions...... 113

xiii List of Algorithms

1 Recover k-sparse FT...... 33 2 RobustPolynomialLearningFixedInterval...... 51 3 SampleDF...... 57 4 A well-balanced sampling procedure based on Randomized BSS...... 68 5 Regression over an unknown distribution D ...... 72 6 Construct M ...... 75

xiv Chapter 1

Introduction

In this thesis, we study the role of randomness in designing algorithms and construct- ing pseudorandom objects. Often, randomized algorithms are simpler and faster than their deterministic counterparts. We explore the advantage of randomized algorithms in providing more efficient sample complexities and robust guarantees than deterministic algorithms in learning. On the other hand, both from a theoretic viewpoint and for practical applications, it is desirable to derandomize them and construct pseudorandom objects such as hash func- tions using as few random bits as possible. We investigate the following two basic questions in computer science:

1. A fundamental question in many fields is to efficiently recover a signal from noisy ob- servations. This problem takes many forms, depending on the measurement model, the signal structure, and the desired norms. For example, coding theory studies the recovery of discrete signals in the Hamming distance. In this thesis, we consider con- tinuous signals that is approximately sparse in the Fourier domain or close to a linear family such as multivariate low degree polynomials. This is a common form in prac- tice. For example, the compression of audio and radio signals exploits their sparsity in the Fourier domain. Often the sample complexity is of more interest than its run- ning rime, because it is costly to take measurements in practical applications. Also, a sample of labeled data requires expensive human/oracle annotation in active learning. This raises a natural question: how can we learn a structured signal from a few noisy

1 queries? In the first half of this thesis, we design several robust learning algorithms with efficient sample complexity for various families of signals.

2. While randomness is provably necessary for certain tasks in cryptography and dis- tributed computing, the situation is unclear for randomized algorithms. In the second half, we consider an important pseudorandom object in derandomization — hash func- tions, which have wide applications both in practice and theory. For many problems of derandomization, a key tool is a hash family of small size such that a deterministic al- gorithm could enumerate all hash functions in this family. In this thesis, we study how to construct small hash families to fulfill their requirements in building randomness extractors and the classical problems of placing balls into bins.

Next we discuss the problems studied in this thesis in detail and show how our methods apply to them. We remark that several techniques developed in this thesis are very general and apply to many other problems.

1.1 Overview 1.1.1 Continuous Sparse Fourier Transforms.

The Fourier transform is a ubiquitous computational tool in processing a variety of signals, including audio, image, and video. In many practical applications, the main reason for using the Fourier transform is that the transformed signal is sparse — it concentrates its energy on k frequencies for a small number k. The sparse representation in the Fourier basis exhibits structures that could be exploited to reduce the sample complexity and speed up the computation of the Fourier transform. For n-point discrete signals, this idea has led to a number of efficient algorithms on discrete sparse Fourier transforms, which could achieve the n n optimal O(k log k ) sample complexity [IK14] and O(k log n log k ) running time [HIKP12].

2 However, many signals, including audio and radio, are originally from a continuous domain. Because the standard way to convert a continuous Fourier transform into a discrete one blows up the sparsity, it is desirable to directly solve the sparse Fourier transform in the continuous setting for efficiency gains. In this thesis, we study sparse Fourier transforms in the continuous setting.

Previous work. Let [0,T ] be the observation interval of k-Fourier-sparse signals and F denote the “bandlimit”, which is the limiting of the frequency domain. This problem, recov- ering a signal with sparse Fourier transform off the grid (not multiplies of 1/T ), has been a question of extensive study. All previous research starts with finding the frequencies of the signal and then recovers the magnitude of each frequencies.

The first algorithm was by Prony in 1795, which worked in the noiseless setting. Its refinements MUSIC [Sch81] and ESPRIT [RPK86] empirically work better than Prony’s algorithm with noise. Matrix pencil [BM86] is a method for computing the maximum like- lihood signal under Gaussian noise and evenly spaced samples. Moitra [Moi15] showed that it has an poly(k) approximation factor if the frequency gap is at least 1/T .

However, the above results use FT samples, which is analogous to n in the discrete setting. A variety of works [FL12, BCG+14, TBR15] have studied how to adapt sparse Fourier techniques from the discrete setting to get sublinear sample complexity; they all rely on the minimum separation among the frequencies to be at least c/T for c ≥ 1 or even larger gaps c = Ω(k) and additional assumptions such as i.i.d. Gaussian noise. Price and Song [PS15] gave the first algorithm with O(1) approximation factor for arbitrary noise, 2 finding the frequencies when c & log(1/δ), and the signal when c & log(1/δ) + log k. All of the above algorithms are designed to recover the frequencies and show this yields a good approximation to the overall signal. Such an approach necessitates c ≥ 1:

3 Moitra [Moi15] gave a lower bound, showing that any algorithm finding the frequencies with approximation factor 2o(k) must require c ≥ 1.

Our contribution. In joint work with Kane, Price, and Song [CKPS16], we design the first sample-efficient algorithm that recovers signals with arbitrary k Fourier frequencies from noisy samples. In contrast to all previous approaches that require a frequency gap at least 1/T to recover every individual frequency, our algorithm demonstrates that the gap is not necessary to learn the signal: it outputs a sparse representation in the Fourier basis for arbitrary k-Fourier-sparse signals under `2 bounded noise. This, for the first time, provides a strong theoretic guarantee for the learning of k-Fourier-sparse signals with continuous frequencies.

An important ingredient in our work is the first bound on the condition number 2 kfk∞ 4 3 2 = O(k log k) for any k-Fourier-sparse signal f over the interval [0,T ]. In this thesis, kfk2 we present an improved condition number O(k3 log2 k). The condition number of k-sparse signals with 1/T -separated frequencies is O(k) from the Hilbert inequality. However, for signals with k arbitrary close frequencies, this family is not very well-conditioned — its condition number is Ω(k2) from degree k − 1 polynomials by a Taylor expansion.

On the other hand, degree k polynomials are the special case of k-Fourier-sparse signals in the limit of all frequencies close to 0, by a Taylor expansion. This is a regime with no frequency gap, so previous sparse Fourier results would not apply but our results shows that poly(k) samples suffices. In this special case, we prove that O(k) samples is enough to learn any degree k polynomial under noise, which is optimal up to a constant factor. Our strategy is to query points according to the Chebyshev distribution even though the distance measurement is the uniform distribution over [0,T ]. A natural question is to generalize this idea to Fourier-sparse signals and avoid the condition number Ω(k2). This motivates the follow-up work with Eric Price [CP17].

4 1.1.2 Condition-number Free Query and Active Learning.

In joint work with Price [CP17], we proposed a framework to avoid the condition number in the sample complexity of learning structured signals. Let F be a family of signals

being learned and D be the distribution to measure the `2 distance between different signals. We show a strong lower bound on the number of random samples from D to recover a signal in 2 F: the sample complexity depends on the condition number sup sup{ |f(x)| }, which E [|f(y)|2] x∈supp(D)f∈F y∼D could be arbitrary large under adversarial distributions.

We consider two alternative models of access to the signal to circumvent the lower bound. The first model is that we may query the signal where we want, such as in the problem of sparse Fourier transforms. We show how to improve the condition number by biasing the choices of queries towards points of high variance. For a large class of families, this approach significantly saves the number of queries in agnostic learning, including the family of k-Fourier-sparse signals and any linear family.

For continuous sparse Fourier transforms discussed in Section 1.1.1, we present an explicit distribution that scales down the condition number from O(k3 log2 k) to O(k log2 k) through this approach. Because the condition number is at least k for any query strategy, our estimation O(k log2 k) is almost optimal up to log factors. Based on this, we show that O(k4 log3 k + k2 log2 k log FT ) samples are sufficient to recover any k-Fourier-sparse signal with “bandlimit” F on the interval [0,T ], which significantly improves the sample complexity in the previous work [CKPS16].

For any linear family F of dimension d and any distribution D, we show that this approach scales down the condition number to d and provides an algorithm with O(d log d) queries to learn any signal in F under arbitrary noise. The log d factor in the query complex- ity is due to the fact that the algorithm only uses the distribution with the least condition number to generate all queries. Then, we improve this query complexity to O(d) by de- signing a sequence of distributions and generating one query from each distribution. On

5 the other hand, we prove an lower bound on the number of queries matching the query complexity of our algorithm up to a constant factor. Hence this work completely characterizes query access learning of signals from linear families.

The second model of access is active learning, where the algorithm receives a set of unlabeled points from the unknown distribution D and chooses a subset of these points to obtain their labels from noisy observations. We design a learning algorithm that optimizes both the number of labeled and unlabeled examples required to learn a signal from any linear family F under any unknown distribution D.

Previous work. We notice that the distribution with the least condition number for linear families described in our work [CP17] is similar to strategies proposed in importance sam- pling [Owe13], leverage score sampling [DMM08, MI10, Woo14], and graph sparsification [SS11]. One significant body of work on sampling regression problems uses leverage score sampling [DMM08, MI10, Mah11, Woo14]. For linear function classes, our sample distribu- tion with the least condition number can be seen as a continuous limit of the leverage score sampling distribution, so our analysis of it is similar to previous work. At least in the fixed design setting, it was previously known that O(d log d + d/ε) samples suffice [Mah11].

Multiple researchers [BDMI13, SWZ17] have studied switching from leverage score sampling—which is analogous to the near-linear spectral sparsification of [SS11]—to the linear-size spectral sparsification of [BSS12]. This improves the sample complexity to O(d/ε),

but the algorithms need to know all the yi as well as the xi before deciding which xi to sample; this makes them unsuitable for active/query learning settings. Because [BSS12] is deterministic, this limitation is necessary to ensure the adversary doesn’t concentrate the noise on the points that will be sampled. Our result uses the alternative linear-size spectral

sparsification of [LS15] to avoid this limitation: we only need the xi to choose O(d/ε) points that will perform well (on average) regardless of where the adversary places the noise.

6 The above references all consider the fixed design setting, while [SM14] gives a method for active regression in the random design setting. Their result shows (roughly speaking) that O((d log d)5/4 + d/ε) samples of labeled data suffice for the desired guarantee. They do not give a bound on the number of unlabeled examples needed for the algorithm. (Nor do the previous references, but the question does not make sense in fixed-design settings.)

Less directly comparable to this work is [CKNS15], where the noise distribution (Y | x) is assumed to be known. This assumption, along with a number of assumptions on the structure of the distributions, allows for significantly stronger results than are possible in our agnostic setting. The optimal sampling distribution is quite different, because it depends on the distribution of the noise as well as the distribution of x.

1.1.3 Existence of Extractors from Simple Hash Families.

Next we consider hash families and their application to construct another pseudoran- dom object — randomness extractors. In the real world, random sources are often biased and defective, which rarely satisfy the requirements of its applications in algorithm design, distributed computing, and cryptography. A randomness extractor is an efficient algorithm that converts a “weak random source” into an almost uniform distribution. As is standard, we model a weak random source as a probability distribution with min-entropy.

1 Definition 1.1.1. The min-entropy of a random variable X is H∞(X) = min log2 X x . x∈supp(X) Pr[ = ]

It is impossible to construct a deterministic randomness extractor for all sources of min-entropy k [SV86], even if k is as large as n − 1. Therefore a seeded extractor also takes as input an additional independent uniform random string, called a seed, to guarantee that the output is close to uniform [NZ96].

+ d Definition 1.1.2. For any d ∈ N , let Ud denote the uniform distribution over {0, 1} . For two random variables W and Z with the same support, let kW − Zk denote the statistical

7

(variation) distance kW − Zk = max Prw∼W [w ∈ T ] − Prz∼Z [z ∈ T ] . T ⊆supp(W ) Ext : {0, 1}n × {0, 1}t → {0, 1}m is a (k, )-extractor if for any source X with min- t entropy k and an independent uniform distribution Y on {0, 1} , kExt(X,Y ) − Umk ≤ . It   is a strong (k, )-extractor if in addition, it satisfies Ext(X,Y ),Y − Um,Y ≤ .

We call 2t the degree of Ext, because when Ext is viewed as a bipartite graph on {0, 1}n ∪ {0, 1}m, its left degree is 2t. Often the degree 2t is of more interest than the seed length t. For example, when we view an extractor as a hash family H = Ext(·, y)|y ∈ {0, 1}t , the degree 2t corresponds to the size of H.

Minimizing the degree of an extractor is crucial for many applications such as con- structing optimal samplers and simulating probabilistic algorithms with weak random sources. n−k It is well known that the optimal degree of (k, )-extractors is Θ( 2 ), where the upper bound is from the probabilistic method and the lower bound was shown by Radhakrishnan and Ta- Shma [RT00]. At the same time, explicit constructions [Zuc07] with an optimal degree, even for constant error, have a variety of applications in theoretical computer science such as hard- ness of inapproximability [Zuc07] and constructing almost optimal Ramsey graphs [BDT17].

Most known extractors are sophisticated based on error-correcting codes or expander graphs and complicated to implement in practice. This raises natural questions — are there simple constructions like linear transformations and even simpler ones of Toeplitz matrices? Are there extractors with good parameters and efficient implementations? In joint work with Zuckerman, we answer these questions in the context of the extractor degree. Our main result indicates the existence of strong extractors of linear transformations and Toeplitz matrices and extremely efficient strong extractors with degree close to optimal, at least when outputting few bits.

Our contributions. In joint work with Zuckerman [CZ18], we present a probabilistic construction to improve the degree of any given extractor. We consider the following method

8 to improve the degree of any strong extractor while keeping almost the same parameters of min entropy and error.

Definition 1.1.3. Given an extractor Ext : {0, 1}n × {0, 1}t → {0, 1}m and a sequence of t seeds (y1, ··· , yD) where each yi ∈ {0, 1} , we define the restricted extractor Ext(y1,··· ,yD) to n be Ext restricted in the domain {0, 1} × [D] where Ext(y1,··· ,yD)(x, i) = Ext(x, yi).

Our main result is that given any strong (k, )-extractor Ext, most restricted extractors ˜ n with a quasi-linear degree O( 2 ) from Ext are strong (k, 3)-extractors for a constant number of output bits, despite the degree of Ext.

Theorem 1.1.4. There exists a universal constant C such that given any strong (k, )- n t m n·2m 2 n·2m extractor Ext : {0, 1} × {0, 1} → {0, 1} , for D = C · 2 · log  random seeds t y1, . . . , yD ∈ {0, 1} , Ext(y1,...,yD) is a strong (k, 3)-extractor with probability 0.99.

We state the corollaries of Theorem 1.1.4 for simple extractors of linear transforma- tions and Toeplitz matrices and hash families with efficient implementations in Chapter7.

On the other hand, the same statement of Theorem 1.1.4 holds for extractors. In this case, we observe that the dependency 2m on the degree D of restricted extractors is necessary to guarantee its error is less than 1/2 on m output bits.

Proposition 1.1.5. There exists a (k, )-extractor Ext : {0, 1}n × {0, 1}t → {0, 1}m with k = 1 and  = 0 such that any restricted extractor of Ext requires the degree D ≥ 2m−1 to guarantee its error is less than 1/2.

While the dependency 2m is necessary for the degree of restricted extractors, it may not be necessary for strong extractors.

9 Previous work. In a seminal work, Impagliazzo, Levin, and Luby [ILL89] proved the Leftover Hash Lemma, i.e., all functions from an almost universal hash family constitute a strong extractor. In particular, this implies that all linear transformations and all Toeplitz matrices constitute strong extractors respectively.

Most previous research has focused on linear extractors, whose extractor functions are linear on the random source for every fixed seed. Because of their simplicity and various applications such as building blocks of extractors for structured sources [Li16, CL16], there have been several constructions of linear extractors with small degree. The first nontrivial progress was due to Trevisan [Tre01], who constructed the first linear extractor with degree polynomial in n and . Based on Trevisan’s work, Shaltiel and Umans [SU05] built linear extractors with almost linear degree for constant error. Later on, Guruswami, Umans, and Vadhan [GUV09a] constructed almost optimal linear condensers and vertex-expansion ex- panders, which are variants of extractors and lead to an extractor with a degree n·poly(k/). However, the GUV extractor is not linear. Moreover, it is still open whether the degree of n−k linear extractors could match the degree Θ( 2 ) of general extractors. On the other hand, much less is known about extractors consisting of Toeplitz ma- trices. Prior to this work, even for m = 2 output bits, the best known upper bound on extractors with Toeplitz matrices was exponential in n by the Leftover Hash Lemma [ILL89] of all Toeplitz matrices.

At the same time, from a practical point of view, it is desirable to have an extractor with a small degree that runs fast and is easy to implement. In this work, we consider efficient extractors from almost universal hash families, which are easier to implement than error-correcting codes and expander graphs used in most known constructions of extrac- tors. In Chapter7, we describe a few notable almost universal hash families with efficient implementations. Prior to this work, the best known upper bounds on the degree of extrac- tors from almost universal hash families were the Leftover Hash Lemma [ILL89], which are

10 exponential in the length of the random source n.

Other work on improving extractors focus on other parameters such as the error and number of output bits. Raz et al. [RRV99] showed how to reduce the error and enlarge the number of output bits of any given extractor by sacrificing the degree.

1.1.4 Hash functions for Multiple-choice Schemes.

We consider explicit hash families for the classical problem of placing balls into bins. The basic model is to hash n balls into n bins independently and uniformly at random, which we call the 1-choice scheme. A well-known and useful fact of the 1-choice scheme is that log n with high probability, each bin contains at most O( log log n ) balls. By high probability, we mean probability 1 − n−c for an arbitrary constant c.

An alternative variant, which we call Uniform-Greedy, is to provide d ≥ 2 independent random choices for each ball and place the ball in the bin with the lowest load. In a seminal work, Azar et al. [ABKU99] showed that the Uniform-Greedy scheme with d independent log log n random choices guarantees a maximum load of only log d + O(1) with high probability for n balls. Later, V¨ocking [Voc03] introduced the Always-Go-Left scheme to further improve log log n the maximum load to + O(1) for d choices where φd > 1.61 is the constant satisfying d log φd d d−1 φd = 1 + φd + ··· + φd . For convenience, we always use d-choice schemes to denote the Uniform-Greedy and Always-Go-Left scheme with d ≥ 2 choices.

Traditional analysis of load balancing assumes a perfectly random hash function. A large body of research is dedicated to the removal of this assumption by designing ex- plicit hash families using fewer random bits. In the 1-choice scheme, it is well known that log n log n O( log log n )-wise independent functions guarantee a maximum load of O( log log n ) with high log2 n probability, which reduces the number of random bits to O( log log n ). Celis et al. [CRSW13] designed a hash family with a description of O(log n log log n) random bits that achieves the log n same maximum load of O( log log n ) as a perfectly random hash function.

11 In this thesis, we are interested in the explicit constructions of hash families that achieve the same maximum loads as a perfectly random hash function in the d-choice schemes. More precisely, we study how to derandomize the perfectly random hash function in the Uniform-Greedy and Always-Go-Left scheme. For these two schemes, O(log n)-wise inde- pendent hash functions achieve the same maximum loads from V¨ocking’s argument [Voc03], which provides a hash family with Θ(log2 n) random bits. Recently, Reingold et al. [RRW14] showed that the hash family in [CRSW13] guarantees a maximum load of O(log log n) in the Uniform-Greedy scheme with O(log n log log n) random bits.

Our results. We construct a hash family with O(log n log log n) random bits based on the previous work of Celis et al. [CRSW13] and show the following results.

log log n 1. This hash family has a maximum load of log d +O(1) in the Uniform-Greedy scheme.

2. It has a maximum load of log log n + O(1) in the Always-Go-Left scheme. d log φd

The maximum loads of our hash family match the maximum loads of a perfectly random hash function [ABKU99, Voc03] in the Uniform-Greedy and Always-Go-Left scheme separately.

1.1.5 CSPs with a global cardinality constraint.

A variety of problems in pseudorandomness such as bipartite expanders and dispersers could be stated as constraint satisfaction problems complying with a global cardinality con- straint. In a d-ary constraint satisfaction problem (CSP), we are given a set of boolean variables {x1, x2, ··· , xn} over {±1} and m constraints C1, ··· ,Cm, where each constraint

Ci consists of a predicate on at most d variables. A constraint is satisfied if and only if the assignment of the related variables is in the predicate of the constraint. The task is to find an assignment to {x1, ··· , xn} so that the greatest (or the least) number of constraints in

{C1, ··· ,Cm} are satisfied.

12 Given a boolean CSP instance I, we can impose a global cardinality constraint Pn i=1 xi = (1 − 2p)n (we assume that pn is an integer). Such a constraint is called the bisection constraint if p = 1/2. For example, the MaxBisection problem is the Max- Cut problem with the bisection constraint. Constraint satisfaction problems with global cardinality constraints are natural generalizations of boolean CSPs. Researchers have been studying approximation algorithms for CSPs with global cardinality constraints for decades, where the MaxBisection problem [FJ95, Ye01, HZ02, FL06, GMR+11, RT12, ABG13] and the SmallSet Expansion problem [RS10, RST10, RST12] are two prominent examples.

Adding a global cardinality constraint could strictly enhance the hardness of the problem. The SmallSet Expansion problem can be viewed as the MinCUT problem with the cardinality of the selected subset to be ρ|V | (ρ ∈ (0, 1)). While MinCUT ad- mits a polynomial-time algorithm to find the optimal solution, we do not know a good approximation algorithm for SmallSet Expansion. Raghavendra and Steurer [RS10] sug- gested that the SmallSet Expansion problem is where the hardness of the notorious UniqueGames problem [Kho02] stems from.

We study several problems about CSPs under a global cardinality constraint in this thesis. The first one is the fixed parameter tractability (FPT) of the CSP above average under a global cardinality constraint. Specifically, let AV G be the expected number of constraints satisfied by randomly choosing an assignment to x1, x2, . . . , xn complying with the global cardinality constraint. We show an efficient algorithm that finds an assignment (complying with the cardinality constraint) satisfying more than (AV G+t) constraints for an input parameter t. The second is to approximate the vertex expansion of a bipartite graph. Our main results are strong integrality gaps in the Lasserre hierarchy and an approximation algorithm for dispersers and bipartite expanders.

13 1.2 Organization

We provide some preliminaries and discuss condition numbers in Chapter2. We also introduce a few tools in signal recovery of continuous sparse Fourier transform and linear families in this chapter.

In Chapter3, we prove the condition number of k-Fourier-sparse signals is in [k2, O˜(k3)] and show how to scale it down to O˜(k). In Chapter4, we present the sample-efficient algo- rithm for continuous sparse Fourier transform. In Chapter5, we consider a special case of continuous sparse Fourier transform and present the robust polynomial recovery algorithm. In Chapter6, we study how to learn signals from any given linear families. Most of the ma- terial in these chapters are based on joint work with Kane, Price, and Song [CKPS16, CP17].

We consider hash functions and its applications in Chapters7 and8. We show how to reduce the degree of any extractor and discuss its implications on almost universal hash families in Chapter7. We present our hash families that derandomizes multiple-choice schemes in Chapter8. Most of the material in these chapters are based on joint work with Zuckerman [CZ18] and [Che17].

We study problems about CSPs under a global cardinality constraint in Chapters9 and 10. We show that they are fixed parameter tractable in Chapter9. We present the integrality gaps of approximating dispersers and bipartite expanders in Chapter 10. Most of the material in these chapters are based on joint work with Zhou [CZ17] and [Che16].

14 Chapter 2

Preliminaries

S Let [n] = {1, 2, ··· , n}. For convenience, we always use d to denote the set of all S  subsets of size d in S and ≤d to denote the set of all subsets of size at most d in S (including ∅). For two subsets S and T , we use S∆T to denote the symmetric difference of S and T .

[ n−1 ] Qn Q 2 ~ ~ Let n! denote the product i=1 i and n!! = i=0 (n − 2i). We use 0 (1 resp.) to

denote the all 0 (1 resp.) vector and 1E to denote the indicator variable of an event E, i.e.

1E = 1 when E is true, and 1E = 0 otherwise.

We use X . Y to denote the inequality X ≤ C · Y for a universal constant C. For a random variable X on {0, 1}n and a function f from {0, 1}n, let f(X) denote the random variable f(x) on the image of f when x ∼ X.

2 1/2 Given a distribution D, let kfkD denote the `2 norm in D, i.e., ( [|f(x)| ]) . x∼ED Given a sequence S = (t1, . . . , tm) (allowing repetition in S) and corresponding weights 2 Pm 2 (w1, . . . , wm), let kfkS,w denote the weighted `2 norm j=1 wj · |f(tj)| . For convenience, we  21/2 omit w if it is a uniform distribution on S, i.e., kfkS = Ei∈[m] |f(ti)| . For a vector  m P k1/k ~v = v(1), . . . , v(m) ∈ C , let k~vkk denote the Lk norm, i.e., i∈[m] |v(i)| .

Given a matrix A ∈ Cm×m, let A∗ denote the conjugate transpose A∗(i, j) = A(j, i)

kA~vk2 and kAk denote the operator norm kAk = max ~ and λ(A) denote all eigenvalues of ~v=6 0 k~vk2

A. Given a self-adjoint matrix A, let λmin(A) and λmax(A) denote the smallest eigenvalue and the largest eigenvalue of A.

Then we discuss the condition numbers of a family under D.

15 2.1 Condition Numbers

Given a family F (not necessarily linear) of signals and a fixed distribution D on the 2 domain of F, we consider the problem of estimating kfkD with high confidence and define the worst-case condition number and average condition number of F under D.

2 Let x1, ··· , xm be m random samples from D. The basic way to estimate kfkD =  2 1 Pm 2 |f(x)| is |f(xi)| . To show that this concentrates, we would like to apply x∼ED m i=1 Chernoff bounds, which depend on the maximum value of summand. In particular, the concentration depends on the worst-case condition number of F, i.e., |f(x)|2 KF = sup sup 2 . x∈supp(D) f∈F kfkD

2  2 We consider estimating kfkD = |f(x)| by samples from any other distribution x∼ED 0 0 0 D . For any distribution D over supp(D), for m samples t1, ··· , tm from D , we always D(x) assign the weight wi = D0(x)·m for each i ∈ [m] such that " m # m   X 2 X D(ti) 2  2 2 wi · |f(ti)| = · |f(ti)| = |f(t)| = kfk . E E 0 0 E D t1,··· ,tm ti∼D D (ti) · m t∼D i=1 i=1 q (D0) D(x) When D is clear, we also use f (x) to denote the weighted function D0(x) · f(x) such  2  (D0) 2 that E |f(x)| = E |f (x)| . When F and D are clear, let KD0 denote the condition x∼D x∼D0 number of signals in F from random sampling of D0: D(x) |f(x)|2  D(x) |f(x)|2  0 KD = sup sup 0 · 2 = sup 0 · sup 2 . x∈supp(D) f∈F D (x) kfkD x∈supp(D) D (x) f∈F kfkD

Let DF be the distribution minimizing KD0 by making the inner term the same for every x, i.e., |f(x)|2 D(x) · supf∈F 2  2  kfkD |f(x)| DF(x) = for κF = sup . (2.1) x∼ED 2 κF f∈F kfkD

For convenience, we call κF the average condition number of F. It is straightforward to verify

the condition number KDF = κF for F with random samples from DF.

16 Claim 2.1.1. For any family F and any distribution D on its domain, let D be the distri-  F   D(x) |f(x)|2 bution defined in (2.1) with κF. The condition number KDF = sup sup D (x) · kfk2 is x f∈F F D at most κF.

Proof. For any g ∈ F and x in the domain G,

|g(x)|2 2 |g(x)| D(x) kgk2 · D(x) · = D ≤ κ . kgk2 D (x) |f(x)|2 F D F supf∈F 2 · D(x)/κF kfkD

In this rest of this work, we will use κF to denote the condition number of DF. We discuss the application of this approach to sparse Fourier transform and linear families in Chapter3 and Chapter6 separately.

2.2 Chernoff Bounds

We state a few versions of the Chernoff bounds for random sampling. We start with the Chernoff bound for real numbers [Che52].

Lemma 2.2.1. Let X1,X2, ··· ,Xn be independent random variables. Assume that 0 ≤ Xi ≤ n P 1 always, for each i ∈ [n]. Let X = X1 + X2 + ··· + Xn and µ = E[X] = E[Xi]. Then for i=1 any ε > 0, ε2 ε2 Pr[X ≥ (1 + ε)µ] ≤ exp(− µ) and Pr[X ≥ (1 − ε)µ] ≤ exp(− µ). 2 + ε 2

In this work, we use the following version of the Chernoff bound.

Corollary 2.2.2. Let X1,X2, ··· ,Xn be independent random variables in [0,R] with expec- Pn i=1 Xi tation 1. For any ε < 1/2, X = n with expectation 1 satisfies ε2 n Pr[|X − 1| ≥ ε] ≤ 2 exp(− · ). 3 R

17 We state the matrix Chernoff inequality from [Tro12].

Theorem 2.2.3 (Theorem 1.1 of [Tro12]). Consider a finite sequence {Xk} of independent, random, self-adjoint matrices of dimension d. Assume that each random matrix satisfies

Xk  0 and λ(Xk) ≤ R.

P P Define µmin = λmin( k E[Xk]) and µmax = λmax( k E[Xk]). Then

( ) µ /R X  e−δ  min Pr λ ( X ) ≤ (1 − δ)µ ≤ d for δ ∈ [0, 1], and (2.2) min k min (1 − δ)1−δ k ( ) µ /R X  e−δ  max Pr λ ( X ) ≥ (1 + δ)µ ≤ d for δ ≥ 0 (2.3) max k max (1 + δ)1+δ k

18 Chapter 3

Condition Numbers of Continuous Sparse Fourier Transform

We study the worst-case and average condition number of k-Fourier-sparse signals in the continuous setting. Without loss of generality, we set [−1, 1] to be the interval of observations and F to be the bandlimit of frequencies. In this chapter, we set the family of k-Fourier-sparse signals as

( k ) X 2πifj x F = f(x) = vj · e vj ∈ C, |fj| ≤ F . (3.1) j=1

2 1/2 We consider the uniform distribution over [−1, 1] and define kfk2 = E [|f(x)| ] to x∈[−1,1] denote the energy of a signal f.

We prove lower and upper bounds on the worst-case condition number and average condition number of F. Our main constribution are two upper bounds on the condition numbers of F, which are the first result about the condition numbers of sparse Fourier transform with continuous frequencies.

Theorem 3.0.1. Let F be the family defined in (3.1) given F and k.

2 |f(x)| 3 3 KF = sup sup 2 = O(k log k). f∈F x∈[−1,1] kfk2

2 |f(x)| 2 At the same time, there exists f ∈ F such that sup kfk2 = k . x∈[−1,1] 2

˜ Furthermore, we show the average condition number κF is O(k).

19 Theorem 3.0.2. Given any F and k, let F be the family defined in (3.1).

 2  |f(x)| 2 κF = E sup 2 = O(k log k). x∈[−1,1] f∈F kfk2

From Claim 2.1.1 in Chapter2, we obtain the following corollary. Recall that D(x) = 1/2 for the uniform distribution over [−1, 1].

Corollary 3.0.3. For any k, there exists an explicit distribution DF such that

 2  1 |f(x)| 2 ∀x ∈ [−1, 1], sup · 2 = O(k log k). f∈F 2 · DF(x) kfk2

  |f(x)|2 On the other hand, the average condition number κF = E sup kfk2 is at least x∈[−1,1] f∈F 2 |f(x)|2 k, because for any x ∈ [−1, 1], there exists a periodic signal f with 2 = k. This indicates kfk2 our estimation of average condition number is almost tight up to log factors.

In the rest of this section, we prove the upper bound of Theorem 3.0.1 in Section 3.1. Then we prove Theorem 3.0.2 and Corollary 3.0.3 in Section 3.2. For completeness, we show the lower bound of Theorem 3.0.1 through the approximation of degree k − 1 polynomials by k-Fourier sparse signals in Section 3.3.

3.1 The Worst-case Condition number of Fourier Sparse Signals

We bound the worst-case condition number of signals in F in this section. We first state the technical result to prove the upper bound in Theorem 3.0.1.

Theorem 3.1.1. Given any k > 0, there exists d = O(k2 log k) such that for any f(x) =

Pk 2πifj ·x j=1 vj · e , any t ∈ R, and any ∆ > 0,

d X |f(t)|2 ≤ O(k) · ( |f(t + j · ∆|2) j=1

20 We finish the proof of Theorem 3.0.1 bounding the worst-case condition number of

KF by the above relation.

Proof of Theorem 3.0.1. Given any f ∈ F, we prove that

Z 1 |f(t)|2 = O(k3 log2 k) |f(x)|2dx for any t ≤ 0, t

2   which indicates |f(t)|2 = O(k3 log k) · E |f(x)|2 . By symmetry, it also implies that x∼[−1,1] 2   |f(t)|2 = O(k3 log k) · E |f(x)|2 for any t ≥ 0. x∼[−1,1] We use Theorem 3.1.1 on f(t):

1−t 1 − t Z d X · |f(t)|2 ≤ O(k) · |f(t + j∆)|2d∆ d ∆=0 j∈[d] 1−t Z d X 2 . k |f(t + j∆)| d∆ j∈[d] ∆=0 (1−t)j Z d X 1 0 2 0 . k · |f(t + ∆ )| d∆ j 0 j∈[d] ∆ =0 Z 1 X 1 2 . k · |f(x)| dx j x − j∈[d] = 1 Z 1 2 . k log k · |f(x)| dx. x=−1

From all discussion above, we have |f(t)|2 . dk log k · E [|f(x)|2]. x∈[−1,1]

We prove Theorem 3.1.1 in the rest of this section. We first provide a polynomial interpolation lemma.

Theorem 3.1.2. Given z1, ··· , zk with |z1| = |z2| = ··· = zk = 1, there exists a degree 2 Pd j d = O(k log k) polynomial P (z) = j=0 c(j) · z satisfying

21 1. P (zi) = 0 for each i ∈ [k].

2 Pd 2 2. Its coefficients satisfy |c(0)| = O(k) · j=1 |c(j)| .

We use residual polynomials to prove Theorem 3.1.2.

Pk−1 (i) i Lemma 3.1.3. Given z1, ··· , zk, for any integer n, let rn,k(z) = i=0 rn,k · z denote the n Qk residual polynomial of rn,k ≡ z mod j=1(z −zj). Then each coefficient in rn,k is bounded: (i) k−1 n  (i) k−1 |n|+k−1 |rn,k| ≤ i · k−1 for n ≥ k and |rn,k| ≤ i · k−1 for n < 0.

For completeness, we provide a proof of Lemma 3.1.3 in AppendixA. We finish the proof of Theorem 3.1.2 here.

2 Proof. Let C0 be a large constant and d = 5 · k log k. We use P to denote the following subset of polynomials with bounded coefficients: ( d ) X −j/k j αj · 2 · z α0, . . . , αd ∈ [−C0,C0] ∩ Z .

j=0

Pd −j/k j Qk For each polynomial P (z) = j=0 αj · 2 · z ∈ P, we rewrite P (z) mod j=1(z − zj) as

d k ! k−1 d ! X −j/k j Y X X −j/k (i) i αj · 2 · z mod (z − zj) = αj · 2 · rn,k z . j=0 j=1 i=0 j=0

Pd −j/k (i) The coefficient j=0 αj · 2 · rn,k is bounded by

d X −j/k k k−1 k k 2k C0 · 2 · 2 j ≤ d · C0 · 2 · d ≤ d . j=0

d Qd We apply the pigeon hole theorem on the (2C0 +1) polynomials in P after module j=1(z − 0.9d zj): there exists m > (2C0 + 1) polynomials P1, ··· ,Pm such that each coefficient of Qk −2k (Pi − Pj) mod j=1(z − zj) is d small from the counting (2C + 1)d 0 > (2C + 1)0.9d. (d2k/d−2k)2k 0

22 0.9d Because m > (2C0 + 1) , there exists j1 ∈ [m] and j2 ∈ [m] \{j1} such that the lowest l monomial z with different coefficients in Pj1 and Pj2 satisfies l ≤ 0.1d. Eventually we set  k   k  −l  −l Y Y P (z) = z · Pj1 (z)−Pj2 (z) − z mod (z−zj) · Pj1 (z)−Pj2 (z) mod (z−zj) j=1 j=1 to satisfy the first property P (z1) = P (z2) = ··· = P (zk) = 0. We prove the second property in the rest of this proof.

−l Qk  Qk We bound every coefficient in z mod j=1(z−zj) · Pj1 (z)−Pj2 (z) mod j=1(z−  l k−1 −2k d k−1 −2k −0.5k zj) by k · 2 (l + k) · d ≤ d · 2 d · d ≤ d . On the other hand, the constant −l  −l/k −0.1d/k −0.5k l coefficient in z · Pj1 (z)−Pj2 (z) is at least 2 ≥ 2 = k because z is the small- est monomial with different coefficients in Pj1 and Pj2 from P. Thus the constant coefficient |C(0)|2 of P (z) is at least 0.5 · 2−2l/k.

Pd 2 Next we upper bound the sum of the rest coefficients j=1 |C(j)| by

d d d X −(l+j)/k −0.5k 2 2 X −2(l+j)/k X −0.5k·2 −2l/k (2C0 · 2 + d ) ≤ 2 · 4C0 2 + 2 · d . k · 2 , j=1 j=1 j=1 which demonstrates the second property.

Then we finish the proof of Theorem 3.1.1 using the above polynomial interpolation bound.

2πif1·∆ Proof of Theorem 3.1.1. Given k frequencies f1, ··· , fk and ∆, we set z1 = e , ··· , zk = e2πifk·∆. Let C(0), ··· ,C(d) be the coefficients of the degree d polynomial P (z) in Theo- rem 3.1.2. We have d d X X X 2πif 0 (t+j∆) C(j) · f(t + j · ∆) = C(j) vj0 · e j j=0 j=0 j0∈[k] d d X X 2πifj0 t j X 2πifj0 t X j = C(j) vj0 · e · zj0 = vj0 · e C(j) · zj0 = 0. j=0 j0∈[k] j0∈[k] j=0

23 Hence for every i ∈ [k],

d X − C(0) · f(t) = C(j) · f(t + j · ∆). (3.2) j=1 By Cauchy-Schwartz inequality, we have

d ! d ! X X |C(0)|2 · |f(t)|2 ≤ |C(j)|2 · |f(t + j · ∆)|2 . (3.3) j=1 j=1

2 Pd From the second property of C(0), ··· ,C(d) in Theorem 3.1.2, |f(t)| ≤ O(k) · ( j=1 |f(t + j · ∆|2).

3.2 The Average Condition number of Fourier Sparse Signals

We bound the average condition number of F in this section. The key ingrediant is a 2 2 upper bound on |f(t)| /kfk2 depending on t.

Lemma 3.2.1. For any t ∈ (−1, 1), |f(t)|2 k log k sup 2 . . f∈F kfk2 1 − |t|

We state the improvement compared to Theorem 3.1.1 bounding the worst-case con- dition number.

Pk 2πifj ·x Claim 3.2.2. Given f(x) = j=1 vje and ∆, there exists l ∈ [2k] such that for any t,

2 X 2 |f(t + l · ∆)| . |f(t + j · ∆)| . j∈[2k]\{l}

2πif1·∆ 2πifk·∆ Proof. Given k frequencies f1, ··· , fk and ∆, we set z1 = e , ··· , zk = e . Let V be the linear subspace

( 2k−1 !  2k X j α(0), . . . , α(2k − 1) ∈ C α(j) · zi = 0, ∀i ∈ [k] . j=0

24 Because the dimension of V is k, let α1, ··· , αk ∈ V be k orthogonal coefficient vectors with Pk 2 unit length kαik2 = 1. Let l be the coordinate in [2k] with the largest weight i=1 |αi(l)| . We prove the main technical result.

From the definition of αi, we have

X X X 2πif 0 ·(t+j∆) αi(j) · f(t + j · ∆) = αi(j) vj0 · e j j∈[2k] j∈[2k] j0∈[k]

X X 2πifj0 t j X 2πifj0 t X j = αi(j) vj0 · e · zj0 = vj0 · e αi(j) · zj0 = 0. j∈[2k] j0∈[k] j0 j∈[2k]

Hence for every i ∈ [k],

X − αi(l) · f(t + l · ∆) = αi(j) · f(t + j · ∆). (3.4) j∈[2k]\{l}

Let A ∈ R[k]×[2k−1] denote the matrix of the coefficients excluding the coordinate l, i.e.,   α1(0) ··· α1(l − 1) α1(l + 1) ··· α1(2k − 1) α (0) ··· α (l − 1) α (l + 1) ··· α (2k − 1)  2 2 2 2  A =  ......  .  ......  αk(0) ··· αk(l − 1) αk(l + 1) ··· αk(2k − 1)

For the k × k matrix A · A∗, its entry (i, i0) equals

X αi(j) · αi0 (j) = hαi, αi0 i − αi(l) · αi0 (l) = 1i=i0 − αi(l) · αi0 (l). j∈[2k]\{l}

∗ P 2 Thus the eigenvalues of A · A are bounded by 1 + i∈[k] |αi(l)| , which also bounds the ∗ P 2 eigenvalues of A · A by 1 + i∈[k] |αi(l)| . From (3.4),

X 2 ∗ X 2 |αi(l) · f(t + l · ∆)| ≤ λmax(A · A) · |f(t + j · ∆)| i∈[k] j∈[2k]\{l} X 2 2 X 2 X 2 ⇒ |αi(l)| · |f(t + l · ∆)| ≤ (1 + |αi(l)| ) · |f(t + j · ∆)| . i∈[k] i∈[k] j∈[2k]\{l}

25  P 2 P 2 Because l = arg maxj∈[2k] i∈[k] |αi(j)| and α1, ··· , αk are unit vectors, i∈[k] |αi(l)| ≥ Pk 2 i=1 kαik2/2k ≥ 1/2. Therefore X |f(t + l · ∆)|2 ≤ 3 |f(t + j · ∆)|2. j∈[2k]\{l}

Pk 2πifj ·x Corollary 3.2.3. Given f(x) = j=1 vje , for any ∆ and t,

2k 2k 2 X 2 X 2 |f(t)| . |f(t + i∆)| + |f(t − i∆)| . i=1 i=1

Next we finish the proof of Lemma 3.2.1.

Proof of Lemma 3.2.1. We assume t = 1 −  and integrate ∆ from 0 to /2k:

Z ε/2k 2k 2k 2 X 2 X 2 ε/2k · |f(t)| . |f(t + i∆)| + |f(t − i∆)| d∆ ∆=0 i=1 i=1 X Z ε/2k = |f(t + i∆)|2 + |f(t − i∆)|2d∆ i∈[1,...,2k] ∆=0 Z ε·i/2k Z ε·i/2k X 1 0 2 0 X 1 0 2 0 . · |f(t + ∆ )| d∆ + · |f(t − ∆ )| d∆ i 0 i 0 i∈[1,...,2k] ∆ =0 i∈[1,...,2k] ∆ =0 Z ε X 1 0 2 0 . · |f(t + ∆ )| d∆ i 0 −ε i∈[1,...,2k] ∆ = Z 1 2 . log k · |f(x)| dx. x=−1

2 k log k 2 From all discussion above, we have |f(1 − ε)| . ε · E [|f(x)| ]. x∈[−1,1]

Next we finish the proof of Theorem 3.0.2.

26 Proof of Theorem 3.0.2. We bound |f(x)|2 κ = E [sup 2 ] x∈[−1,1] f∈F kfk2 1 Z 1 |f(x)|2 = sup 2 dx 2 x=−1 f∈F kfk2 Z 1−ε 2 |f(x)| 4 3 . sup 2 dx + ε · k log k from Theorem 3.0.1 x=−1+ε f∈F kfk2 Z 1−ε k log k 4 3 . dx + ε · k log k from Lemma 3.2.1 x=−1+ε 1 − |x| 1 k log k · log + ε · k4 log3 k k log2 k . ε .

1 by choosing ε = k3 log k .

We show the distribution DF.

Proof of Corollary 3.0.3. From Claim 2.1.1 in Chapter2, there exists a constant c = Θ(1) such that the distribution ( c , for |x| ≤ 1 − 1 (1−|x|) log k k3 log2 k DF(x) = c · k3 log k, for |x| > 1 − 1 k3 log2 k k D(x) X 2πifj x 2 2 2 guarantees for any f(x) = vje , |f(x)| · = O(k log k) · kfkD ∀x ∈ [−1, 1]. DF(x) j=1

3.3 Polynomials and Fourier Sparse Signals

We prove that for any  > 0 and degree k − 1 polynomial p(x), there exists a signal f ∈ F such that |f(x) − p(x)| ≤  for all x ∈ [−1, 1]. Then we show that there exists 2 |p(1)| 2 a degree k − 1 polynomial p(x) with 2 = k , which complements the lower bound in kpk2 Theorem 3.0.1.

27 Pk−1 j Theorem 3.3.1. For any degree k − 1 polynomial p(x) = j=0 cjx and  > 0, there always Pk−1 2πi·(jτ)·x exists τ > 0 and f(x) = j=0 aje such that

∀x ∈ [−1, 1], |f(x) − p(x)| ≤ .

At the same time, we use the Legendre polynomials to show the lower bound.

Theorem 3.3.2. For any k, there exists a degree k − 1 polynomial p(x) such that

2 |p(1)| 2 2 = k . kpk2

2 |q(x)| 2 At the same time, for any degree k − 1 polynomial q and x ∈ [−1, 1], 2 ≤ k . kqk2

We first prove Theorem 3.3.1 in Section 3.3.1 using the Taylor expansion. Then we review a few facts about the Legendre polynomials and prove Theorem 3.3.2 in Section 3.3.2.

3.3.1 A Reduction from Polynomials to Fourier Sparse Signals

We prove Theorem 3.3.1 in this section. We first consider the Taylor expansion of a Pk−1 2πi·(jτ)·x signal f(x) = j=0 aje with a small frequency gap τ:

k−1 ∞ X X (2πi · (jτ) · x)l f(x) = a j l! j=0 l=0 ∞ k−1 ! X (2πi · τ)l X = · a · jl xl. l! j l=0 j=0

Pk−1 l Given τ and a polynomial p(x) = l=0 cl · x , we consider how to match their coefficients

k−1 (2πi · τ)l X · a · jl = c for every l = 0, . . . , k − 1. l! j l j=0

28 This is equivalent to

k−1 X cl · l! jl · a = for every l = 0, . . . , k − 1. j (2πi · τ)l j=0

l Q Let A denote the k × k Vandermonde matrix j l,j. Since det(A) = i

 > > −1 cl · l! (a0, ··· , ak−1) = A · (2πi · τ)l l=0,··· ,k−1 satisfying the above equations.

k2 max{cl}·k l Then we use the property of the Vandermonde matrix A to bound aj ≤ τ k−1 . Q k−1 k det(A) = i

 c · l!  k(a , ··· , a )k ≤ λ (A)−1 · l . 0 k−1 2 min l l=0,··· ,k−1 (2πi · τ) 2 This indicates k k2 k · max{cl} max{cl}· k k(k−1) l l max{ai} ≤ k ≤ . i τ k−1 τ k−1

2 Given k and the point-wise error , we set τ = ·k−2k such that the tail in the Taylor max{cj } j expansion is less than  for x ∈ [−1, 1]:

∞ l k−1 ! ∞ l X (2π · τ) X l l X (2π · τ) l · aj · j x ≤ · k max{aj}· (k − 1) l! l! j l=k j=0 l=k k2 ∞ max{cj}· k X j ≤ k(2π · τ · k)l/l! · τ k−1 l=k ∞ X l k2 l−k+1 ≤ k(2π · k) /l! · max{cj}· k · τ ≤ . j l=k

29 3.3.2 Legendre Polynomials and a Lower Bound

We provide an brief introduction to Legendre polynomials (please see [Dun10] for a complete introduction).

Definition 3.3.3. Let Ln(x) denote the Legendre polynomials of degree n, the solution to Legendre’s differential equation: d  d  (1 − x2) L (x) + n(n + 1)L (x) = 0 (3.5) dx dx n n We will the following two facts about the Legendre polynomials in this work.

Fact 3.3.4. Ln(1) = 1 for any n ≥ 0 in the Legendre polynomials.

Fact 3.3.5. The Legendre polynomials constitute an orthogonal basis with respect to the inner product on interval [−1, 1]: Z 1 2 Lm(x)Ln(x)dx = δmn −1 2n + 1

where δmn denotes the Kronecker delta, i.e., it equals to 1 if m = n and to 0 otherwise.

For any polynomial P (x) of degree at most d with complex coefficients, there exists a set of coefficients from the above properties such that

d X P (x) = αi · Li(x), where αi ∈ C, ∀i ∈ {0, 1, 2, ··· , d}. i=0 Now we finish the proof of Theorem 3.3.2.

Proof of Theorem 3.3.2. We first consider the upper bound. First observe that it is 2 |q(1)| 2 2 enough to prove 2 ≤ k for any degree k − 1 polynomial q due to the definition kqk2 = kqk2   ∗ E |q(x)|2 . For any q, let x = arg max|q(x)|2. Then x∼[−1,1] x∈[−1,1]

∗ 2 ∗ 2 |q(x )| 2 |q(x )| 2   ≤ k and   ≤ k E |q(x)|2 E |q(x)|2 x∼[−1,x∗] x∼[x∗,1]

30 ∗ 2 |q(x )| 2 imply 2 ≤ k . kqk2 Pk−1 For any degree k − 1 polynomial q, let q(x) = j=0 αjLj(x) be its expression in the Pk−1 2 Pk−1 2 Legendre polynomials. Thus q(1) = j=0 αj and kqk2 = j=0 |αj| /(2j + 1). From the Cauchy-Schwartz inequality,

k−1 !2 k−1 ! k−1 ! 2 X X 2 X 2 2 |q(1)| ≤ |αj| ≤ |αj| /(2j + 1) · 2j + 1 = k · kqk2. j=0 j=0 j=0

On the other hand, there exists q such that the above inequality is tight.

31 Chapter 4

Learning Continuous Sparse Fourier Transforms

We follow the notation in the last chapter. Let F be the bandlimit of the frequencies, n o Pk 2πifj t [−T,T ] be the interval of observations, and F = f(t) = j=1 vj · e vj ∈ C, |fj| ≤ F be the family of signals. We consider the uniform distribution over [−T,T ] and define 2 1/2 kfk2 = E [|f(t)| ] to denote the energy of a signal f. t∈[−T,T ] In this chapter, we study the sample complexity of learning a signal f ∈ F under

noise. Our main result is an sample-efficient algorithm under bounded `2 noise.

Theorem 4.0.1. For any F > 0, T > 0, ε > 0, there exists an algorithm that given any

Pk 2πifj t 4 3 observation y(t) = j=1 vje + g(t) with |fj| ≤ F for each j, takes m = O(k log k + k k2 log2 k · log FT ) samples t , . . . , t and outputs f(t) = P v e2πifej t satisfying ε 1 m e j=1 ej

2 2 2 kf − fek2 . kgk2 + kfk2 with probability 0.99.

We show our algorithm in Algorithm1. To prove the correctness of Algorithm1, we first state the technical result in this chapter.

Lemma 4.0.2. Let C be a universal constant and N = ε · ∩ [−F,F ] denote a net f T ·kCk2 Z Pk 2πi·fj t of frequencies given F and T . For any signal f(t) = j=1 vje , there exists a k-sparse signal

k 0 0 X 0 2πifj (t) 0 0 0 f (t) = vje satisfying kf − f k2 ≤ εkfk2 with frequencies f1, . . . , fk ∈ Nf . j=1

32 Algorithm 1 Recover k-sparse FT 1: procedure SparseFT(y, F, T, ε) 4 3 2 2 FT 2: m ← O(k log k + k log k log ε ) 3: Sample t1, . . . , tm from DF where DF is the distribution provided in Corollary 3.0.3. 4: Set the corresponding weights (w1, . . . , wm) and S = (t1, . . . , tm) 5: Query y(t1), . . . , y(tm) from the observation y 6: N ← ε · ∩ [−F,F ] for a constant C f T ·kCk2 Z 0 0 7: for all possible k frequencies f1, . . . , fk in Nf do 2πi·f 0 t 2πi·f 0 t 8: Find h(t) in span{e 1 , . . . , e k } minimizing kh − ykS,w 9: Update fe = h if kh − ykS,w ≤ kfe− ykS,w 10: end for 11: Return fe. 12: end procedure

We prove Theorem 4.0.1 here then finish the proof of Lemma 4.0.2 in the rest of this chapter.

Proof of Theorem 4.0.1. We use Lemma 4.0.2 to rewrite y = f + g = f 0 + g0 where f 0 has 0 0 0 0 frequencies in Nf and g = g + f − f with kg k2 ≤ kgk2 + εkfk2. Our goal is to recover f . We construct a δ-net with δ = 0.05 for ( 2k ) X 2πi·bhj t h(t) = vje khk2 = 1, bhj ∈ Nf .

j=1

We first pick 2k frequencies bh1,..., bh2k in Nf then construct a δ-net (`2-norm) on the linear subspace span{e2πibh1t, . . . , e2πibh2kt}. Hence the size of the δ-net is

2  4FT ·kCk  4FT · kCk2 ε · (12/δ)2k ≤ ( )3k. 2k ε · δ

Now we consider the number of random samples from DF to estimate signals in the δ- net. Based on the condition number of DF in Theorem 3.0.2 and the Chernoff bound of Corollary 2.2.2, a union bound over the δ-net indicates k log2 k  k log2 k FT  m = O · log |net| = O · (k3 log k + k log ) δ2 δ2 εδ

33 2 random samples from DF would guarantee that for any signal h in the net, khkS,w = (1 ± 2 δ)khk2. From the property of the net, we obtain

2k X 2πi·bhj t 2 2 for any h(t) = vje with bhj ∈ Nf , khkS,w = (1 ± 2δ)khk2. j=1

Finally, we bound kf − fek2 as follows. The expectation of kf − fek2 over the random samples S = (t1, . . . , tm) is at most

0 0 0 0 kf − f k2 + kf − fek2 ≤ kf − f k2 + 1.1kf − fekS,w 0 0 ≤ kf − f k2 + 1.1(kf − ykS,w + ky − fekS,w) 0 0 0 ≤ kf − f k2 + 1.1(kg kS,w + ky − f kS,w)

≤ εkfk2 + 2.2(kgk2 + εkfk2).

From the Markov inequality, with probability 0.99, kf − fek2 . εkfk2 + kgk2.

Pk 2πi·fj t We sketch the proof of Lemma 4.0.2 here. Given f(t) = j=1 vje where the frequencies f1, ··· , fk could be arbitrarily close to each other, we first show how to shift one 0 frequency fj to fj while keep almost the same signal f(t) on [−T,T ].

k P 2πifj t Lemma 4.0.3. There is a universal constant C0 > 0 such that for any f(t) = vje j=1 and any frequency fk+1, there always exists

k−1 0 X 0 2πifj t 0 2πifk+1t f (t) = vje + vk+1e j=1

0 0 0 0 with k coefficients v1, v2, ··· , vk−1, vk+1 satisfying

2 0 C0k kf − fk2 ≤ k · (|fk − fk+1|T ) · kfk2

We defer the proof of Lemma 4.0.3 to Section 4.2. Then we separate the frequencies

f1, ··· , fk in f(t) by at least η.

34 Lemma 4.0.4. Given F and η, let Nf = ηZ ∩ [−F,F ]. For any k frequencies f1 < f2 < 0 0 0 0 ··· < fk in [−F,F ], there exists k frequencies f1, ··· , fk such that min {fi+1 − fi } ≥ η and i∈[k−1] 0 for all i ∈ [k], |fi − fi| ≤ kη.

0 0 Proof. Given a frequences f, let π(f) denote the first element f in Nf satisfying f ≥ f. 0 0 0 0 We define the new frequencies fi as follows: f1 = π(f1) and fi = max{fi−1 + η, π(fi)} for i ∈ {2, 3, ··· , k}.

We finish the proof of Lemma 4.0.2 using the above two lemmas.

Proof of Lemma 4.0.2. Let η = ε for C = C +1 where C is the universal constant 5k2T ·kCk2 0 0

in Lemma 4.0.3. Using Lemma 4.0.4 on frequencies f1, ··· , fk of the signal f, we obtain k 0 0 0 new frequencies f1, ··· , fk such that their gap is at least η and maxi |fi − fi | ≤ kη. Next we use the hybrid argument to find the signal f 0.

(0) Let f (t) = f(t). For i = 1, ··· , k, we apply Lemma 4.0.3 to shift the frequency fi 0 to fi and obtain

k i i i 0 2 (i) X ( ) 2πifj t X ( ) 2πifj t (i) (i−1) C0k 0 (i−1) f (t) = vj e + vj e s.t. kf (t) − f (t)k2 ≤ k (|fi − fi |T )kf k2. j=i+1 j=1

(0) At the same time, we bound kf (t)k2 by

 2 i  2 i C0k (0) (i) C0k (0) 1 − k (kηT ) kf (t)k2 ≤ kf (t)k2 ≤ 1 + k (kηT ) kf (t)k2,

(0) 1 −Ck2 which is between (1 ± 0.1) · kf (t)k2 for η ≤ 5k2T · k with C = C0 + 1.

35 At last, we set f 0(t) = f (k)(t) and bound the distance between f 0(t) and f(t) by

k (k) (0) X (i) (i−1) kf (t) − f (t)k2 ≤ kf (t) − f (t)k2 by triangle inequality i=1 k 2 X C0k 0 (i−1) ≤ k (|fi − fi |T )kf (t)k2 by Lemma 4.0.3 i=1 k 2 X C0k (i−1) 0 ≤ 2k (kηT )kf (t)k2 by max|fi − fi | ≤ kη i i=1 2 C0k ≤2k · 2k (kηT )kf(t)k2

≤kf(t)k2

where the last inequality follows from our choice of η = ε . 5k2T ·kCk2

In the rest of this chapter, we review a few properties about the determinant of Gram matrices in Section 4.1 and finish the proof of Lemma 4.0.3 in Section 4.2.

4.1 Gram Matrices of Complex Exponentials

We provide an brief introduction to Gram matrices (please see [Haz01] for a complete introduction). We use hx, yi to denote the inner product between vector x and vector y.

Let ~v , ··· ,~vn be n vectors in an inner product space and span{~v , ··· ,~vn} be the lin- 1 ( 1 ) P ear subspace spanned by these n vectors with coefficients in C, i.e., αi~vi|∀i ∈ [n], αi ∈ C . i∈[n]

The Gram matrix Gram~v1,··· ,~vn of ~v1, ··· ,~vn is an n×n matrix defined as Gram~v1,··· ,~vn (i, j) = h~vi,~vji for any i ∈ [n] and j ∈ [n].

Fact 4.1.1. det(Gram~v1,··· ,~vn ) is the square of the volume of the parallelotope formed by

~v1, ··· ,~vn.

k Let Gram~v1,··· ,~vn−1 be the Gram matrix of ~v1, ··· ,~vn−1. Let ~vn be the projection of

36 ⊥ k vn onto the linear subspace span{~v1, ··· ,~vn−1} and ~vn = ~vn − ~vn be the part orthogonal to

span{~v1, ··· ,~vn−1}. We use k~vk to denote the length of ~v in the inner product space, which is ph~v,~vi.

Claim 4.1.2.

⊥ 2 det(Gram~v1,··· ,~vn−1 ) k~vn k = . det(Gram~v1,··· ,~vn )

Proof.

2 det(Gram~v1,··· ,~vn ) =volume (~v1, ··· ,~vn)

2 ⊥ 2 ⊥ 2 =volume (~v1, ··· ,~vn−1) · k~vn k = det(Gram~v1,··· ,~vn−1 ) · k~vn k .

We keep using the notation e2πifj t to denote a vector from [−T,T ] to C and consider

2πifit 2πifj t 1 R T 2πi(fi−fj )t the inner product he , e iT = 2T −T e dt for complex exponential functions. We bound the determinant of the Gram matrices of e2πif1t, . . . , e2πifkt as follows, which will extensively used in Section 4.2.

Lemma 4.1.3. There exists a universal constant α > 0 such that, for any T > 0 and real

2πif1t 2πif2t 2πifkt numbers f1, ··· , fk, the k × k Gram matrix Gramf1,··· ,fk of e , e , ··· , e , whose (i, j)-entry is

2πifit 2πifj t Gramf1,··· ,fk (i, j) = he , e iT , has a determinant between

−αk2 Y 2 αk2 Y 2 k min((|fi − fj|T ) , 1) ≤ det (Gramf1,··· ,fk ) ≤ k min((|fi − fj|T ) , 1). i

We prove this lemma in Section 4.1.1.

37 4.1.1 The Determinant of Gram Matrices of Complex Exponentials

Because we could rescale T to 1 and fi to fi · T , we replace the interval [−T,T ] by

[−1, 1] and prove the following version: for real numbers f1, . . . , fk, let Gf1,...,fk be the matrix whose (i, j)-entry is Z 1 e2πi(fi−fj )tdt. −1 We plan to prove O(k2) Y 2 det(Gf1,...,fk ) = k min(|fi − fj| , 1). (4.1) i

−k −αk2 αk2 Y 2 det(Gramf1,··· ,fk ) = 2 ·det(Gf1,...,fk ) ∈ [k , k ]· min((|fi −fj|T ) , 1) for some α > 0. i

First, we note by the Cauchy-Binet formula that the determinant in (4.1) is equal to

Z 1 Z 1 Z 1 2πifitj 2 ... det([e ]i,j) dt1dt2 . . . dtk. (4.2) −1 −1 −1

P We next need to consider the integrand in the special case when |fi| ≤ 1/8.

P Lemma 4.1.4. If fi ∈ R and tj ∈ R, i |fi|(maxi |ti|) ≤ 1/8 then

 k  (2) Q (2π) i

Proof. Firstly, by adding a constant to all the tj we can make them non-negative. This P multiplies the determinant by a root of unity, and at most doubles i |fi|(maxi |ti|).

By continuity, it suffices to consider the ti to all be multiples of 1/N for some large

integer N. By multiplying all the tj by N and all fi by 1/N, we may assume that all of the

tj are non-negative integers with t1 ≤ t2 ≤ ... ≤ tk.

38 Let zi = exp(2πifi). Then our determinant is   h tj i det zi , i,j which is equal to the Vandermonde determinant times the Schur polynomial sλ(zi) where λ is the partition λj = tj − (j − 1).

Therefore, this determinant equals

Y (zi − zj)sλ(z1, z2, . . . , zk). i

The absolute value of Y (zi − zj) i

k Q (2) Q is approximately i

By standard results, sλ is a polynomial in the zi with non-negative coefficients, and all exponents at most maxj |tj| in each variable. Therefore, the monomials with non-zero coefficients will all have real part at least 1/2 and absolute value 1 when evaluated at the zi. Therefore,

|sλ(z1, . . . , zk)| = Θ(|sλ(1, 1,..., 1)|).

On the other hand, by the Weyl character formula Q Y tj − ti i

This completes the proof.

Next we prove our Theorem when the f have small total variation.

39 P Lemma 4.1.5. If there exists a f0 so that |fi − f0| < 1/8, then

3k(k−1)/2 k(k−1) Q 2 ! 2 π i

Proof. By translating the fi we can assume that f0 = 0.

By the above we have det(Gf1,...,fk ) is

k(k−1) Q 2 Z 1 Z 1 (2π) i

k−1 Y 2n+1(n!)2 . (n + 1)(2n)! n=0 This completes the proof.

Next we extend this result to the case that all the f are within poly(k) of each other.

Claim 4.1.6. If there exists a f0 so that |fi − f0| = poly(k) for all i, then

O(k2) Y 2 det(Gf1,...,fk ) = k min(|fi − fj| , 1). i

Proof. We begin by proving the lower bound. We note that for 0 < x < 1, Z x Z x Z 1 2πifitj 2 k det(Gf1,...,fk ) ≥ ... det([e ]i,j) dt1dt2 . . . dtk = x det(Gf1/x,f2/x,...,fk/k). −x −x −1 Taking x = 1/poly(k), we may apply the above Lemma to compute the determinant on the right hand side, yielding an appropriate lower bound.

40 To prove the lower bound, we note that we can divide our fi into clusters, Ci, where for 2 any i, j in the same cluster |fi −fj| < 1/k and for i and j in different clusters |fi −fj| ≥ 1/k . We then note as a property of Graham matrices that

Y O(k2) Y 2 O(k2) Y 2 det(Gf1,...,fk ) ≤ det(G{fj ∈Ci}) = k |fi − fj| = k |fi − fj| .

Ci i

Finally, we are ready to prove our Theorem.

Proof. Let I(t) be the indicator function of the interval [−1, 1].

From Lemma 6.6 in [CKPS16], there is a function h(t) so that for any function f that

is a linear combination of at most k complex exponentials that |h(t)f(t)|2 = Θ(|I(t)f(t)|2) and so that hˆ is supported on an interval of length poly(k) < kC about the origin.

Note that we can divide our fi into clusters, C, so that for i and j in a cluster C+1 C |fi − fj| < k and for i and j in different clusters |fi − fj| > k .

˜ R 2 (2πi)(fi−fj )t Let Gf ,f ,...,f 0 be the matrix with (i, j)-entry |h(t)| e dt. 1 2 k R We claim that for any k0 ≤ k that

O(k0) det(G˜ 0 ) = 2 det(G 0 ). f1,f2,...,fk f1,f2,...,fk

This is because both are Graham determinants, one for the set of functions I(t) exp((2πi)fjt)

and the other for h(t) exp((2πi)fjt). However since any linear combination of the former has L2 norm a constant multiple of that the same linear combination of the latter, we have that

G˜ 0 = Θ(G 0 ) f1,f2,...,fk f1,f2,...,fk

as self-adjoint matrices. This implies the appropriate bound.

41 Therefore, we have that

O(k) ˜ det(Gf1,...,fk ) = 2 det(Gf1,...,fk ).

However, note that by the Fourier support of h that Z |h(t)|2e(2πi)(fi−fj )tdt = 0 R C ˜ if |fi − fj| > k , which happens if i and j are in different clusters. Therefore G is block diagonal and hence its determinant equals

˜ Y ˜ O(k) Y det(Gf1,...,fk ) = det(G{fj ∈Ci}) = 2 det(G{fj ∈Ci}). C C However the Proposition above shows that

Y O(k2) Y 2 det(G{fj ∈Ci}) = k min(1, |fi − fj| ). C i

4.2 Shifting One Frequencies

We finish the proof of Lemma 4.0.3 in this section. We plan to shift fk to fk+1 and prove that any vector in span{e2πif1t, ··· , e2πifk−1t, e2πifkt} is close to some vector in the linear subspace span{e2πif1t, ··· , e2πifk−1t, e2πifk+1t}.

For convenience, we use ~uk to denote the projection of vector e2πifkt to the linear subspace U = span{e2πif1t, ··· , e2πifk−1t} and ~wk denote the projection of vector e2πifk+1t to this linear subspace U. Let ~u⊥ = e2πifkt − ~uk and ~w⊥ = e2πifk+1t − ~wk be their orthogonal parts to U separately.

From the definition e2πifkt = ~uk + ~u⊥ and ~uk ∈ U = span{e2πif1t, ··· , e2πifk−1t}, we rewrite the linear combination k k−1 X 2πifj t X 2πifj t ⊥ f(t) = vje = αje + vk · ~u j=1 j=1

42 for some scalars α1, ··· , αk−1. We substitute ~u⊥ by ~w⊥ in the above linear combination and find a set of new coeffi- ⊥ h~u⊥, ~w⊥i ⊥ ⊥ ⊥ cients. Let ~w = ~w1 + ~w2 where ~w1 = ⊥ 2 ~u is the projection of ~w to ~u . Therefore ~w2 k~u k2 is the orthogonal part of the vector e2πifk+1t to V = span{e2πif1t, ··· , e2πifk−1t, e2πifkt}. We

k ~w2k2 use δ = ⊥ for convenience. k ~w k2 ⊥ ⊥ ⊥ ⊥ k~u −β· ~w k2 ∗ h~u , ~w i Notice that the min ⊥ = δ and β = ⊥ 2 is the optimal choice. Therefore β k~u k2 k ~w k2 we set

k−1 0 X 2πifj t ∗ ⊥ 2πif1t 2πifk−1t 2πifk+1t f (t) = βje + vk · β · ~w ∈ span{e , ··· , e , e } j=1

0 where the coefficients β1, ··· , βk−1 guarantee that the projection of f onto U is as same as the projection of f onto U. From the choice of β∗ and the definition of f 0,

0 2 2 2 ⊥ 2 2 2 kf(t) − f (t)k2 = δ · |vk| · k~u k2 ≤ δ · kf(t)k2.

Eventually, we show an upper bound for δ2 from Claim 4.1.2.

2 2 k~w2k2 δ = ⊥ 2 k~w k2

det(Gram 2πif t 2πifk−1t 2πif t 2πifk+1t ) det(Gram 2πif t 2πifk−1t 2πifk+1t ) = e 1 ,··· ,e ,e k ,e / e 1 ,··· ,e ,e by Claim 4.1.2 det(GramV ) det(GramU ) k k k− k− Q+1 Q+1 Q1 Q1 min(|fi − fj|T, 1) min(|fi − fj|T, 1) i=1 j=1 i=1 j=1 2 j6 i j6 i ≤ k4αk · = · = k k k−1k−1 k−1 Q Q Q Q Q 2 2 min(|fi − fj|T, 1) min(|fi − fj|T, 1) · min(|fi − fk+1| T , 1) i=1j=1 i=1 j=1 i=1 j=6 i j=6 i by Lemma 4.1.3

4αk2 2 2 = k |fk − fk+1| T

43 Chapter 5

Fourier-clustered Signal Recovery

In this chapter, we reduce the recovery of signals f(t), whose frequency representations fb are restricted to a small band [−∆, ∆], to the interpolation of low-degree polynomials. We first show that any such f(t) could be approximated by low degree polynomials. In this 1 R T section, we set [−T,T ] to be the interval of observations and use hf, gi = 2T −T f(t)g(t)dt 2 denote the inner product of two signals f and g such that kfk2 = hf, fi.

P 2πifj t Theorem 5.0.1. For any ∆ > 0 and any ε > 0, let f(t) = j∈[k] vje where |fj| ≤ ∆ for each j ∈ [k]. There exists a polynomial P (t) of degree at most

d = O(T ∆ + k3 log k + k log 1/ε)

such that 2 2 kP (t) − f(t)k2 ≤ εkfk2.

In Chapter3, Theorem 3.3.1 shows that any degree k − 1 polynomial P (t) could be approximated by a signal with k-sparse Fourier transform. Theorem 5.0.1 provides an approximation on the reverse direction. Notice that the dependency ∆T is necessary for a signal like f(t) = sin(2π · ∆t).

Next we provide an efficient algorithm that recovers polynomials under noise with an optimal sample complexity (up to a constant factor).

Theorem 5.0.2. For any degree d polynomial P (t) and an arbitrary noise function g(t), there exists an algorithm that takes O(d) samples from x(t) = P (t) + g(t) over [T,T ] and

44 reports a degree d polynomial Q(t) in time O(d3) such that, with probability at least 99/100,

2 2 kP (t) − Q(t)k2 . kg(t)k2.

A direct corollary of the above two theorems indicates an efficient algorithm to recover signals f(t) whose frequencies are restricted to a small band [−∆, ∆].

Corollary 5.0.3. For any k > 0, T > 0, ∆ > 0, and ε > 0, there exist d = O(T ∆+k3 log k+ k log 1/ε) and an efficient algorithm that takes O(d) samples from y(t) = f(x) + g(t), where

P 2πifj t f(t) = j∈[k] vje with |fj| ≤ ∆ and g is an arbitrary noise function, and outputs a degree d polynomial Q(t) in time O(d3) such that, with probability at least 99/100,

2 2 2 kf − Qk2 . kgk2 + εkfk2.

In the rest of this chapter, we prove Theorem 5.0.1 in Section 5.1 and Theorem 5.0.2 in Section 5.2.

5.1 Band-limit Signals to Polynomials

We first prove a special case of Theorem 5.0.2 for k-Fourier-sparse signal with a

frequency gap bounded away from zero. To prove this, we bound the coefficients v1, ··· , vk k P 2πifj t 2 in f(t) = vje by its energy kfk2. j=1

k P 2πifj t Lemma 5.1.1. There exists a universal constant c > 0 such that for any f(t) = vje j=1 2 −ck2 2k  Pk 2 with frequency gap η = min|fi − fj|, we have kf(t)k2 ≥ k · min (ηT ) , 1 · j=1 |vj| . i=6 j

2πifit 2 Proof. Let ~vi denote the vector e and V = {~v1, ··· , ~vk}. Notice that k~vik2 = h~vi, ~vii = 1. k For each ~vi, we define ~vi to be the projection of ~vi into the linear subspace span{V \ ~vi} = ⊥ k span{~v1, ··· ,~vi−1,~vi+1, ··· ,~vk} and ~vi = ~vi − ~vi which is orthogonal to span{V \ ~vi} by the definition.

45 Therefore from the orthogonality,

k 2 2 ⊥ 2 1X 2 ⊥ 2 kf(t)k2 ≥ max{|vj| · k~vj k2} ≥ |vj| · k~vj k2. j∈[k] k j=1

⊥ 2 It is enough to estimate k~vj k2 from Claim 4.1.2:

⊥ 2 det(Gram(V )) −2αk2 Y 2 −2αk2 2k−2 k~vj k2 = ≥ k min ((fj − fi)T, 1) ≥ k (ηT ) , det(Gram(V \ ~vi)) j=6 i where we use Lemma 4.1.3 to lower bound it in the last step.

We show that for signals with a frequency gap, its Taylor expansion is a good ap- proximation.

k P 2πifj t Lemma 5.1.2 (Existence of low degree polynomial). Let f(t) = vje , where ∀j ∈ j=1 [k], |fj| ≤ ∆ and min|fi − fj| ≥ η. There exists a polynomial Q(t) of degree i=6 j

d = O T ∆ + k log 1/(ηT ) + k2 log k + k log(1/ε)

such that, 2 2 kQ(t) − f(t)k2 ≤ εkf(t)k2. (5.1)

d−1 k P (2πifj t) Proof. For each frequency fj, let Qj(t) = k! be the first d terms in the Taylor k=0 2πifj t 2πifj t Expansion of e . For any t ∈ [−T,T ], we know the difference between Qj(t) and e is at most d (2πifjT ) 2πT ∆ · e |Q (t) − e2πifj t| ≤ | | ≤ ( )d. j d! d We define k X Q(t) = vjQj(t) j=1

46 and bound the distance between Q and f from the above estimation:

Z T 2 1 2 kQ(t) − f(t)k2 = |Q(t) − f(t)| dt 2T −T k 1 Z T X = | v (Q (t) − e2πifj t)|2dt 2T j j −T j=1 k X 1 Z T ≤2k |v |2 · |Q (t) − e2πifj t|2dt by triangle inequality T j j j=1 0 k X 2πT ∆ · e ≤k |v |2 · ( )2d by Taylor expansion j d j=1

On the other hand, from Lemma 5.1.1, we know

2 2k −ck2 X 2 kf(t)k2 ≥ (ηT ) · k |vj| . j

Because d = 10 · πe(T ∆ + k log 1/(ηT ) + k2 log k + k log(1/ε)) is large enough, we have 2πT ∆·e 2d 2k −ck2 2 2 k( d ) ≤ ε(ηT ) ·k , which indicates that kQ(t)−f(t)k2 ≤ εkfk2 from all discussion above.

Proof of Theorem 5.0.1. We first use Lemma 4.0.2 on f to obtain f 0 with a frequency gap η ≥ ε satisfying kf − f 0k ≤ εkfk . Then we use Lemma 5.1.2 on f 0 to obtain T ·kCk2 2 2 a degree d = O (T ∆ + k log 1/(ηT ) + k2 log k + k log(1/ε)) = O (T ∆ + k3 log k + k log(1/ε)) 0 0 polynomial Q satisfying kQ − f k2 ≤ εkf k2. Hence

0 0 kQ − fk2 ≤ kf − f k2 + kf − Qk2 ≤ 3εkfk2.

47 5.2 Robust Polynomial Interpolation

We show how to learn a degree-d polynomial P with n = O(d) samples and prove Theorem 5.0.2 in this section. By define fe(t) = f(t/T ), we rescale the interval from [−T,T ] 2 1 R 1 2 to [−1, 1] and use kfk2 = 2 −1 |f(t)| dt.

Lemma 5.2.1. Let d ∈ N and  ∈ R+, there exists an efficient algorithm to compute a

partition of [−1, 1] to n = O(d/) intervals I1, ··· ,In such that for any degree d polynomial

P (t): R → C and any n points t1, ··· , tn in the intervals I1, ··· ,In respectively, the function Q(t) defined by

Q(t) = P (tj) if t ∈ Ij

approximates P by

kQ − P k2 ≤ kP k2. (5.2)

One direct corollary from the above lemma is that observing n = O(d/) points

each from I1, ··· ,In provides a good approximation for all degree d polynomials. Recall

that given a sequence S = (t1, ··· , tm) and corresponding weights (w1, ··· , wm), kfkS,w = Pm 2 1/2 ( i=1 wi · |f(ti)| ) .

Corollary 5.2.2. Let I1, ··· ,In be the intervals in the above lemma and wj = |Ij|/2 for

each j ∈ [n]. For any t1, ··· , tn in the intervals I1, ··· ,In respectively, let S = (t1, ··· , tn)

with weights (w1, ··· , wn). Then for any degree d polynomial P , we have

kP kS,w ∈ [(1 − )kP k2, (1 + )kP k2] .

We first prove one property of low degree polynomials from the Legendre basis.

Lemma 5.2.3. For any degree d polynomial P (t): R → C with derivative P 0(t), we have, Z 1 Z 1 (1 − t2)|P 0(t)|2dt ≤ 2d2 |P (t)|2dt. (5.3) −1 −1

48 Proof. Given a degree d polynomial P (x), we rewrite P (x) as a linear combination of the Legendre polynomials: d X P (x) = αiLi(x). i=0 2 0 We use Fi(x) = (1 − x )Li(x) for convenience. From the definition of the Legendre polyno- 0 00 0 mials in the Equation (3.5), Fi (x) = −i(i + 1) · Li(x) and Fi (x) = −i(i + 1) · Li(x). Hence we have

Z −1 Z −1 (1 − x2)|P 0(x)|2dx = (1 − x2)P 0(x) · P 0(x)dx 1 1     Z −1 X X −F 00(x) = α F (x) · α i dx  i i   i i(i + 1)  1 i∈[d] i∈[d]     0 1 X X −Fi (x) =  αiFi(x) ·  αi  i(i + 1) i∈[d] i∈[d] −1     Z −1 X X F 0(x) + α F 0(x) · α i dx  i i   i i(i + 1) 1 i∈[d] i∈[d]     Z −1 X X i(i + 1) · Li(x) = α · i(i + 1) · L (x) · α dx  i i   i i(i + 1)  1 i∈[d] i∈[d] X 2 2 = |αi| i(i + 1)kLik2 i∈[d] 2 ≤ d(d + 1)kP k2

Proof of Lemma 5.2.1. We set m = 10d/ and show a partition of [−1, 1] into n ≤ 20m √ 1−t2 + intervals. We define g(t) = m and y0 = 0. Then we choose yi = yi−1 + g(yi−1) for i ∈ N . 9 Let l be the first index of y such that yl ≥ 1 − m2 . We show l . m.

49 −k Let jk be the first index in the sequence such that yjk ≥ 1 − 2 . Notice that 3/4 j2 ≤ √ ≤ 1.5m 1−(3/4)2 m and √ p1 − y2 1 − y y − y = g(y ) = i−1 ≥ i−1 . i i−1 i−1 m m Then for all k > 2, we have −k 2 −k/2 jk − jk−1 ≤ √ ≤ 2 m. 1−y (jk−1) m −3/2 −k/2  Therefore jk ≤ 1.5 + (2 + ··· 2 ) m and l ≤ 10m.

9 Because yl−1 ≤ 1 − m2 , for any j ∈ [l] and any t ∈ [yi−1, yi], we have the following property: 1 − t2 1 (1 − y2 ) ≥ · i−1 = (y − y )2/2. (5.4) m2 2 m2 i i−1

Now we set n and partition [−1, 1] into I1, ··· ,In as follows:

1. n = 2(l + 1).

2. For j ∈ [l], I2j−1 = [yj−1, yj] and I2j = [−yj, −yj−1].

3. I2l+1 = [yl, 1] and I2l+2 = [−1, −yl].

For any t1, ··· , tn where tj ∈ Ij for each j ∈ [n], we rewrite the LHS of (5.2) as follows: n−2 Z Z Z X 2 2 2 |P (tj) − P (t)| dt + |P (tn−1) − P (t)| dt + |P (tn) − P (t)| dt . (5.5) j=1 Ij In−1 In | {z } | {z } A B For A in Equation (5.5), from the Cauchy-Schwarz inequality, we have 2 n−2 Z n−2 Z Z t n−2 Z Z t X 2 X 0 X 0 2 |P (tj) − P (t)| dt = P (y)dy dt ≤ |t − tj| |P (y)| dydt. j=1 Ij j=1 Ij tj j=1 Ij tj

50 Algorithm 2 RobustPolynomialLearningFixedInterval 1: procedure RobustPolynomialLearning(y, d) 2:  ← 1/20. 3: Let I1, ··· ,In be the intervals in Lemma 5.2.1 of parameters d and . 4: Randomly choose tj ∈ Ij for every j ∈ [n] |I1| |In| 5: Define S = {t1, ··· , tn} with weight w1 = 2 , ··· , wn = 2 . 6: Q(t) = arg min {kQ − ykS,w}. deg(Q)=d 7: Return Q(t). 8: end procedure

Then we swap dt with dy and use Equation (5.4):

n−2 n−2 n−2 X Z Z X Z X Z 2(1 − t2) |P 0(y)|2 |t − t |dtdy ≤ |P 0(t)|2 · |I |2dt ≤ |P 0(t)|2 dt. j j m2 j=1 Ij t/∈(tj ,y) j=1 Ij j=1 Ij We use Lemma 5.2.3 to simplify it by

n−2 X Z Z 1 2(1 − t2) 2d2 Z 1 |P (t ) − P (t)|2 dt ≤ |P 0(t)|2 dt ≤ |P (t)|2dt. j m2 m2 j=1 Ij −1 −1

−2 For B in Equation (5.5), notice that |In−1| = |In| = 1−yl ≤ 9m and for j ∈ {n−1, n}

2 2 2 2 |P (t) − P (tj)| ≤ 4 max |P (t)| ≤ 4(d + 1) kP k2 t∈[−1,1] from the properties of degree-d polynomials, i.e., Theorem 3.3.2. Therefore B in Equation 2 −2 2 (5.5) is upper bounded by 2 · 4(d + 1) (9m )kP (t)k2.

2 99d2 2 From all discussion above, kQ(t) − P (t)k2 ≤ m2 ≤  .

Now we use the above lemma to provide a faster learning algorithm for polynomials on interval [−1, 1] with noise instead of using the -nets argument. We show it in Algorithm2.

Lemma 5.2.4. For any degree d polynomial P (t) and an arbitrary function g(t), Algo- rithm RobustPolynomialLearningFixedInterval takes O(d) samples from y(t) =

51 P (t) + g(t) over [−1, 1] and reports a degree d polynomial Q(t) in time O(d3) such that, with probability at least 99/100,

2 2 kP (t) − Q(t)k2 . kg(t)k2.

Proof. Notice that n = O(d/) = O(d) and find a degree d polynomial Q(t) minimizing 3 ky(t) − Q(t)kS,w is equivalent to calculate the pseudoinverse, which takes O(d ) time. It is enough to bound the distance between P and Q:

kP − Qk2

≤1.09kP − QkS,w by Corollary 5.2.2

=1.09ky − g − QkS,w by y = P + g

≤1.09kgkS,w + 1.09ky − QkS,w by triangle inequality

≤1.09kgkS,w + 1.09ky − P kS,w Q = arg min kR − ykS,w degree-d R

≤2.2kgkS,w

2 2 Because [kgkS,w] = kgk2, we know that kP −Qk2 kgk2 with probability ≥ .99 by Markov’s ES . inequality.

52 Chapter 6

Query and Active Learning of Linear Families

In Chapter5, we show that O(d) samples could robustly interpolate degree d polyno- mials under noisy observations. In this chapter, we generalize this result to any linear family under any distribution over its support and improve the guarantee of the output.

Let F be a linear family of dimension d and D be a distribution over the domain of F. Recall that the worst-case condition number and the average condition number are |f(x)|2  |f(x)|2  KF = sup sup and κF = sup . 2 x∼ED 2 x∈supp(D) f∈F kfk2 f∈F kfk2 We first show that for any linear family F and any distribution D over the support of F, the

average condition number κF = d. For convenience, let hf, gi = [f(x)g(x)] denote the x∼ED 2 1/2 inner product under D and kfkD = ( [|f(x)| ]) in this chapter. x∼ED Lemma 6.0.1. For any linear family F of dimension d,

sup |h(x)|2 = d x∼ED h∈F:khkD=1

2 such that DF(x) = D(x) · sup |h(x)| /d has a condition number KDF = d. Moreover, h∈F:khkD=1 D(x) there exists an efficient algorithm to sample x from DF and compute its weight . DF(x)

Based on this condition number d, we use the matrix Chernoff bound to show that

O(d log d) i.i.d. samples from DF suffice to learn F.

However, this approach needs Ω(d log d) samples due to a coupon-collector argument, because it only samples points from one distribution DF. Next we consider how to improve

53 it to O(d) using linear size spectral sparsification [BSS12, LS15]. We need our sample points to both be sampled non-independently (to avoid coupon-collector issues) but still fairly randomly (so adversarial noise cannot predict it). A natural approach is to design a

sequence of distributions D1, ··· ,Dm (m is not necessarily fixed) then sample xi ∼ Di and

assign a weight wi for xi, where Di+1 could depend on the previous points x1, ··· , xi.

Ideally each KDi would be O(d), but we do not know how to produce such distribu-

tions while still getting linear sample spectral sparsification to guarantee khkS,w ≈ khkD for

D(xi) every h ∈ F. Therefore we use a coefficient αi to control every KD , and set wi = αi · i Di(xi) instead of D(xi) . mDi(xi)

Definition 6.0.2. Given a linear family F and underlying distribution D, let P be a random sampling procedure that terminates in m iterations (m is not necessarily fixed) and provides a coefficient αi and a distribution Di to sample xi ∼ Di in every iteration i ∈ [m].

We say P is an ε-well-balanced sampling procedure if it satisfies the following two properties:

D(xi) −3 1. Let the weight wi = αi · . With probability 1 − 10 , Di(xi)

m X 3 5 w · |h(x )|2 ∈ , · khk2 ∀h ∈ F. i i 4 4 D i=1

P 5 2. For a universal constant C, the coefficients always have i αi ≤ 4 and αi ·KDi ≤ /C.

Intuitively, the first property says that the sampling procedure preserves the signal, and the second property says that the recovery algorithm does not blow up the noise on average. For such sampling procedures we consider the weighted ERM fe ∈ F minimizing P 2 i wi|fe(xi) − yi| . We prove that fe satisfies the desired guarantee:

54 Theorem 6.0.3. Given a linear family F, joint distribution (D,Y ), and ε > 0, let P be an ε-well-balanced sampling procedure for F and D, and let f = arg min E [|y − h(x)|2] h∈F (x,y)∼(D,Y ) be the true risk minimizer. Then the weighted ERM fe resulting from P satisfies

2 2 kf − fekD ≤  · E [|y − f(x)| ] (x,y)∼(D,Y ) with 99% probability.

Well-balanced sampling procedures. We observe that two standard sampling proce- dures are well-balanced, so they yield agnostic recovery guarantees by Theorem 6.0.3. The 0 simplest approach is to set each Di to a fixed distribution D and αi = 1/m for all i. For m = O(KD0 log d + KD0 /ε), this gives an ε-well-balanced sampling procedure. These results appear in Section 6.3.

We get a stronger result of m = O(d/ε) using the randomized BSS algorithm by Lee and Sun [LS15]. The [LS15] algorithm iteratively chooses points xi from distributions Di. A term considered in their analysis—the largest increment of eigenvalues—is equivalent to our KDi . By looking at the potential functions in their proof, we can extract coefficients

αi bounding αiKDi in our setting. This lets us show that the algorithm is a well-balanced sampling procedure; we do so in Section 6.4.

Next we generalize the above result to active learning, where the algorithm receives a set of unlabelled points xi from the unknown distribution D and chooses a subset of these points to receive their labels.

Theorem 6.0.4. Consider any dimension d linear space F of functions from a domain G to C. Let (D,Y ) be a joint distribution over G × C and f = arg min E [|y − h(x)|2]. h∈F (x,y)∼(D,Y ) 2 supx∈G |h(x)| Let K = sup khk2 . For any ε > 0, there exists an efficient algorithm that h∈F:h=06 D

55 K d takes O(K log d+ ε ) unlabeled samples from D and requests O( ε ) labels to output fesatisfying

E [|f˜(x) − f(x)|2] ≤ ε · E [|y − f(x)|2] with probability ≥ 0.99. x∼D (x,y)∼(D,Y )

Finally we show two lower bounds on the sample complexity and query complexity that match our upper bounds. Lower bound Upper bound d d Query complexity Theorem 6.6.1 Ω( ε ) Theorem 6.4.2 O( ε ) K K Sample complexity Theorem 6.6.4 Ω(K log d + ε ) Theorem 6.3.3 O(K log d + ε ) Table 6.1: Lower bounds and upper bounds in different models

Organization. We organize the rest of this chapter as follows. In Section 6.1, we show the

distribution DF to prove Lemma 6.1.1. Then we discuss the ERM of well-balanced sampling procedures and prove Theorem 6.0.3 in Section 6.2. In Section 6.3 we analyze the number of samples required for sampling from an arbitrary distribution D0 to be well-balanced. In Section 6.4 we show that [LS15] yields a well-balanced linear-sample procedure in the query setting or in fixed-design active learning. Next we show the active learning algorithm of Theorem 6.0.4 in Section 6.5. Finally, we prove the lower bounds on the query complexity and sample complexity in Table 6.1 in Section 6.6.

6.1 Condition Number of Linear families

We use the linearity of F to prove κ = d and describe the distribution DF. Let

{v1, . . . , vd} be any orthonormal basis of F, where inner products are taken under the distri- bution D.

Lemma 6.1.1. For any linear family F of dimension d and any distribution D,

sup |h(x)|2 = d x∼ED h∈F:khkD=1

56 Algorithm 3 SampleDF

1: procedure GeneratingDF(F = span{v1, . . . , vd},D) 2: Sample j ∈ [d] uniformly. 2 3: Sample x from the distribution Wj(x) = |vj(x)| . d 4: Set the weight of x to be Pd 2 . i=1 |vi(x)| 5: end procedure

2 such that DF(x) = D(x) · sup |h(x)| /d has a condition number KDF = d. Moreover, h∈F:khkD=1 D(x) there exists an efficient algorithm to sample x from DF and compute its weight . DF(x)

Proof. Given an orthonormal basis v1, . . . , vd of F, for any h ∈ F with khkD = 1, there exists c1, . . . , cd such that h(x) = ci·vi(x). Then for any x in the domain, from the Cauchy-Schwartz inequality,

2 P 2 P 2 P 2 |h(x)| | i∈[d] civi(x)| ( i∈[d] |ci| ) · ( i∈[d] |vi(x)| ) X sup = sup = = |v (x)|2. 2 P 2 P 2 i h khk c ,...,c |ci| |ci| D 1 d i∈[d] i∈[d] i∈[d]

This is tight because there always exist c1 = v1(x), c2 = v2(x), . . . , cd = vd(x) such that P 2 P 2 P 2 | civi(x)| = ( |ci| ) · ( |vi(x)| ). Hence i∈[d] i∈[d] i∈[d]

2 |h(x)|  X 2 E sup 2 = E |vi(x)| = d. x∼Dh∈F:h=06 khk x∼D D i∈[d]

By Claim 2.1.1, this indicates KDF = d. At the same time, this calculation indicates

2 D(x) · sup |h(x)| P 2 khk =1 D(x) · |vi(x)| D (x) = D = i∈[d] . F d d

We present our sampling procedure in Algorithm3.

57 6.2 Recovery Guarantee for Well-Balanced Samples

In this section, we show for well-balanced sampling procedures (per Definition 6.0.2) that an appropriately weighted ERM approximates the true risk minimizer, and hence the true signal.

Definition 6.2.1. Given a random sampling procedure P , and a joint distribution (D,Y ), we define the weighted ERM resulting from P to be the result

( m ) X 2 fe = arg min wi · |h(xi) − yi| h∈F i=1

D(xi) after we use P to generate random points xi ∼ Di with wi = αi · for each i, and draw Di(xi)

yi ∼ (Y | xi) for each point xi.

For generality, we first consider points and labels from a joint distribution (D,Y ) and use f = arg min E [|y − h(x)|2] to denote the truth. h∈F (x,y)∼(D,Y ) Theorem 6.0.3. Given a linear family F, joint distribution (D,Y ), and ε > 0, let P be an ε-well-balanced sampling procedure for F and D, and let f = arg min E [|y − h(x)|2] h∈F (x,y)∼(D,Y ) be the true risk minimizer. Then the weighted ERM fe resulting from P satisfies

2 2 kf − fekD ≤  · E [|y − f(x)| ] (x,y)∼(D,Y ) with 99% probability.

Next, we provide a corollary for specific kinds of noise. In the first case, we con- sider noise functions representing independently mean-zero noise at each position x such as i.i.d. Gaussian noise. Second, we consider arbitrary noise functions on the domain.

Corollary 6.2.2. Given a linear family F and distribution D, let y(x) = f(x) + g(x) for f ∈ F and g a randomized function. Let P be an ε-well-balanced sampling procedure for F and D. With 99% probability, the weighted ERM fe resulting from P satisfies

58 ˜ 2 2 1. kf − fkD ≤ ε · [kgkD], when g(x) is a random function from G to where each g(x) Eg C is an independent random variable with [g(x)] = 0. Eg ˜ 2. kf − fkD ≤ (1 + ε) · kgkD for any other noise function g.

In the rest of this section, we prove Theorem 6.0.3 in Section 6.2.1 and Corollary 6.2.2 in Section 6.2.2.

6.2.1 Proof of Theorem 6.0.3

We introduce a few more notation in this proof. Given F and the measurement D,

let {v1, . . . , vd} be a fixed orthonormal basis of F, where inner products are taken under

the distribution D, i.e., [vi(x) · vj(x)] = 1i=j for any i, j ∈ [d]. For any function h ∈ F, x∼ED let α(h) denote the coefficients (α(h)1, . . . , α(h)d) under the basis (v1, . . . , vd) such that Pd h = i=1 α(h)i · vi and kα(h)k2 = khkD. We characterize the first property in Definition 6.0.2 of well-balanced sampling pro- cedures as bounding the eigenvalues of A∗ · A, where A is the m × d matrix defined as √ A(i, j) = wi · vj(xi).

Lemma 6.2.3. For any ε > 0, given S = (x1, . . . , xm) and their weights (w1, . . . , wm), let √ A be the m × d matrix defined as A(i, j) = wi · vj(xi). Then

2 2 khkS,w ∈ [1 ± ε] · khkD for every h ∈ F if and only if the eigenvalues of A∗A are in [1 − ε, 1 + ε].

Proof. Notice that √ √  A · α(h) = w1 · h(x1),..., wm · h(xm) . (6.1) Because m 2 X 2 2 ∗ ∗ ∗ ∗ 2 khkS,w = wi|h(xi)| = kA·α(h)k2 = α(h) ·(A ·A)·α(h) ∈ [λmin(A ·A), λmax(A ·A)]·khkD i=1

59 and h is over the linear family F, these two properties are equivalent.

Next we consider the calculation of the weighted ERM fe. Given the weights (w1, ··· , wm) √ on (x1, . . . , xm) and labels (y1, . . . , ym), let ~yw denote the vector of weighted labels ( w1 · √ 2 y1,..., wm ·ym). From (6.1), the empirical distance kh−(y1, . . . , ym)kS,w equals kA·α(h)− 2 ~ywk2 for any h ∈ F. The function fe minimizing kh − (y1, . . . , ym)kS,w = kA · α(h) − ~ywk2

overall all h ∈ F is the pseudoinverse of A on ~yw, i.e.,

d ∗ −1 ∗ X α(fe) = (A · A) · A · ~yw and fe = α(fe)i · vi. i=1

 Finally, we consider the distance between f = arg min E [|h(x) − y|2] and fe. h∈F (x,y)∼(D,Y ) ~ √ √  ∗ −1 ∗ ~ For convenience, let fw = w1·f(x1),..., wm·f(wm) . Because f ∈ F,(A ·A) ·A ·fw = α(f). This implies

2 2 ∗ −1 ∗ ~ 2 kfe− fkD = kα(fe) − α(f)k2 = k(A · A) · A · (~yw − fw)k2.

∗ −1 ∗ ~ 2 We assume λ (A · A) is bounded and consider kA · (~yw − fw)k2.

Lemma 6.2.4. Let P be an random sampling procedure terminating in m iterations (m is

not necessarily fixed) that in every iteration i, it provides a coefficient αi and a distribution

D(xi) m×d Di to sample xi ∼ Di. Let the weight wi = αi · and A ∈ denote the matrix Di(xi) C √ 2 A(i, j) = wi · vj(xi). Then for f = arg min E [|y − h(x)| ], h∈F (x,y)∼(D,Y )

h i m ∗ ~ 2  X  2 E kA (~yw − fS,w)k2 ≤ sup αi · max αj · KDj E [|y − f(x)| ], P P j (x,y)∼(D,Y ) i=1   D(x)  |v(x)|2 where KDi is the condition number for samples from Di: KDi = sup D (x) · sup kvk2 . x i v∈F 2

60  m √ Proof. For convenience, let gj denote yj − f(xj) and ~gw ∈ C denote the vector wj ·  ~ ∗ ~ ∗ gj|j=1,...,m = ~yw − fS,w for j ∈ [m] such that A · (~yw − fS,w) = A · ~gw.

" d m # ∗ 2 X X ∗ 2 E[kA · ~gwk2] = E A (i, j)~gw(j) i=1 j=1 d " m # d " m # X X 2 X X 2 2 2 = E wjvi(xj) · gj = E wj · |vi(xj)| · |gj| , i=1 j=1 i=1 j=1 where the last step uses the following fact  D(xj)    E [wjvi(xj) · gj] = E αj · vi(xj)gj = αj · E vi(xj)(yj − f(xj)) = 0. wj ∼Dj wj ∼Dj Dj(xj) xj ∼D,yj ∼Y (xj ) We swap i and j: d " m # m " d # X X 2 2 2 X X 2 2 E wj · |vi(xj)| · |gj| = E wj|vi(xj)| · wj|gj| i=1 j=1 j=1 i=1 m ( d ) X X 2  2 ≤ sup wj |vi(xj)| · E wj · |gj| . x j=1 j i=1 h i 2 D(xj ) 2  For [wj · |gj| ], it equals αj · yj − f(xj) = αj· yj− E E Dj (xj ) E xj ∼Dj ,yj ∼Y (xj ) xj ∼D,yj ∼Y (xj ) 2 f(xj) . n o For sup w Pd |v (x )|2 , we bound it as xj j i=1 i j

( d ) ( d ) X 2 D(xj) X 2 sup wj |vi(xj)| = sup αj · |vi(xj)| x x Dj(xj) j i=1 j i=1  2  D(xj) |h(xj)| =αj sup · sup 2 = αj · KDj . xj Dj(xj) h∈F khkD 2 Pd 2 Pd 2 Pd 2  |h(xj )|  | i=1 aivi(xj )| ( i=1 |ai| )( i=1 |vi(xj )| ) We use the fact sup khk2 = sup Pd 2 = Pd 2 by the D i=1 |ai| i=1 |ai| h∈F (a1,...,ad) Cauchy-Schwartz inequality. From all discussion above, we have   ∗ 2 X 2 X  2 E[kA ·~gwk2] ≤ αjKDj · αj E [|y − f(x)| ] ≤ ( αj) max αjKDj E [|y−f(x)| ]. (x,y)∼(D,Y ) j (x,y)∼(D,Y ) j j

61 We combine all discussion above to prove Theorem 6.0.3.

Proof of Theorem 6.0.3. The first property of P indicates λ(A∗ · A) ∈ [1 − 1/4, 1 + 1/4]. ∗ ~ 2 2 On the other hand, E[kA · (~yw − fw)k2] ≤ /C · E [|y − f(x)| ] from Lemma 6.2.4. (x,y)∼(D,Y )  ∗ −1 ∗ ~ 2 Conditioned on the first property, we have E k(A · A) · A · (~yw − fw)k2 ≤ 2/C · E [|y − f(x)|2]. (x,y)∼(D,Y ) 1 By choosing a large constant C, with probability 1 − 200 ,

2 ∗ −1 ∗ ~ 2 ∗ −1 ∗ ~ 2 2 kfe−fkD = k(A ·A) ·A ·(~yw−fw)k2 ≤ λmax (A ·A) ·kA ·(~yw−fw)k2 ≤ · E [|y−f(x)| ] (x,y)∼(D,Y )

from the Markov inequality.

6.2.2 Proof of Corollary 6.2.2

For the first part, let (D,Y ) = D, f(x) + g(x) be our joint distribution of (x, y). Because the expectation E[g(x)] = 0 for every x ∈ G, arg min E [|y − v(x)|2] = f. v∈V (x,y)∼(D,Y ) ∗ −1 ∗ From Theorem 6.0.3, for α(fe) = (A · A) · A · ~yw and m = O(d/ε),

2 2 2 2 kfe− fkD = kα(fe) − α(f)k2 ≤ ε · E [|y − f(x)| ] = ε · E[kgkD], with probability 0.99. (x,y)∼(D,Y )

For the second part, let gk be the projection of g(x) to F and g⊥ = g − gk be the orthogonal part to F. Let α(gk) denote the coefficients of gk in the fixed orthonormal basis k k ~ ~ ~k ~⊥ (v1, . . . , vd) so that kα(g )k2 = kg kD. We decompose ~yw = fw + ~gw = fw + g w + g w. Therefore

∗ −1 ∗ ~ ~k ~⊥ k ∗ −1 ∗ ~⊥ α(fe) = (A A) · A · (fw + g w + g w) = α(f) + α(g ) + (AA ) A · g w.

62 The distance kfe− fkD = kα(fe) − α(f)k2 equals

∗ −1 ∗ k ∗ −1 ∗ ~⊥ k ∗ −1 ∗ ~⊥ k(A A) ·A ·~yw−α(f)k2 = kα(f)+α(g )+(A A) ·A ·g w−α(f)k2 = kα(g )+(A A) ·A ·g wk2.

√ ∗ −1 ∗ ~⊥ ⊥ From Theorem 6.0.3, with probability 0.99, k(A A) ·A ·g wk2 ≤ ε·kg kD. Thus

∗ −1 ∗ k ∗ −1 ∗ ~⊥ k(A A) · A · ~yw − α(f)k2 = kα(g ) + (A A) · A · g wk2 k √ ⊥ ≤ kg kD + ε · kg kD.

k ⊥ p 2 Let 1 − β denote kg kD/kgkD such that kg kD/kgkD = 2β − β . We rewrite it as  r   √ p  √ p p ε ε 1 − β + ε · 2β − β2 kgk ≤ (1−β + ε· 2β)kgk ≤ 1 − ( β − )2 + kgk . D D 2 2 D

∗ −1 ∗ From all discussion above, kfe− fkD = kα(fe) − α(f)k2 = k(A A) · A · ~yw − α(f)k2 ≤

(1 + ε)kgkD.

6.3 Performance of i.i.d. Distributions

Given the linear family F of dimension d and the measure of distance D, we provide

a distribution DF with a condition number KDF = d.

Lemma 6.3.1. Given any linear family F of dimension d and any distribution D, there

always exists an explicit distribution DF such that the condition number

 2   D(x) |h(x)| KDF = sup sup · 2 = d. x h∈F DF(x) khkD

Next, for generality, we bound the number of i.i.d. random samples from an arbitrary distribution D0 to fulfill the requirements of well-balanced sampling procedures in Defini- tion 6.0.2.

63 0 Lemma 6.3.2. There exists a universal constant C1 such that given any distribution D with the same support of D and any  > 0, the random sampling procedure with m =

KD0 0 0 C1(KD log d + ε ) i.i.d. random samples from D and coefficients α1 = ··· = αm = 1/m is an ε-well-balanced sampling procedure.

By Theorem 6.0.3, we state the following result, which will be used in active learning. For G = supp(D) and any x ∈ G, let Y (x) denote the conditional distribution (Y |D = x) and (D0,Y (D0)) denote the distribution that first generates x ∼ D0 then generates y ∼ Y (x).

Theorem 6.3.3. Consider any dimension d linear space F of functions from a domain G to C. Let (D,Y ) be a joint distribution over G × C, and f = arg min E [|y − h(x)|2]. h∈F (x,y)∼(D,Y )   0  D(x) |h(x)|2 0 Let D be any distribution on G and KD = sup sup D0(x) · khk2 . The weighted x h∈F D KD0 0 0 0 ERM fe resulting from m = O(KD log d + ε ) random queries of (D ,Y (D )) with weights D(xi) wi = 0 for each i ∈ [m] satisfies m·D (xi)

˜ 2  ˜ 2  2 kf − fkD = E |f(x) − f(x)| ≤ ε · E |y − f(x)| with probability ≥ 0.99. x∼D (x,y)∼(D,Y )

We show the proof of Lemma 6.3.2 in Section 6.3.1.

6.3.1 Proof of Lemma 6.3.2

We use the matrix Chernoff theorem to prove the first property in Definition 6.0.2. √ We still use A to denote the m × d matrix A(i, j) = wi · vj(xi).

Lemma 6.3.4. Let D0 be an arbitrary distribution over G and |h(D0)(x)|2 0 KD = sup sup 2 . (6.2) h∈F:h=06 x∈G khkD

There exists an absolute constant C such that for any n ∈ N+, linear family F of dimension

d, ε ∈ (0, 1) and δ ∈ (0, 1), when S = (x1, . . . , xm) are independently from the distribution

64 0 C d D(xj ) 0 D with m ≥ ε2 · KD log δ and wj = m·D0(x ) for each j ∈ [m], the m × d matrix A(i, j) = √ j wi · vj(xi) satisfies

kA∗A − Ik ≤ ε with probability at least 1 − δ.

Proof of Lemma 6.3.4. Let v1, . . . , vd be the orthonormal basis of F in the definition

of matrix A. For any h ∈ F, let α(h) = (α1, . . . , αd) denote the coefficients of h under (D0) 2 2 2 |h (x)| v1, . . . , vd such that khkD = kα(h)k2. At the same time, for any fixed x, sup khk2 = h∈F D 0 | Pd α h ·v(D ) x |2 D0 sup i=1 ( )i i ( ) = P |v( )(x)|2 by the tightness of the Cauchy Schwartz inequality. kα(h)k2 i∈[d] i α(h) 2 Thus (D0) 2 def  |h (x)| X (D0) 2 0 0 KD = sup sup 2 indicates sup |vi (x)| ≤ KD . (6.3) x∈G h∈F:h=06 khk x∈G D i∈[d]

D(xj ) For each point xj in S with weight wj = 0 , let Aj denote the jth row of the matrix A. It m·D (xj ) (D0) √ v (x ) m d i √ j ∗ P ∗ is a vector in C defined by Aj(i) = A(j, i) = wj · vi(xj) = m . So A A = j=1 Aj · Aj.

∗ ∗ For Aj · Aj, it is always  0. Notice that the only non-zero eigenvalue of Aj · Aj is   1 X D0 KD0 λ(A∗ · A ) = A · A∗ = |v( )(x )|2 ≤ j j j j m  i j  m i∈[d]

from (6.3).

Pm ∗ At the same time, j=1 E[Aj · Aj] equals the identity matrix of size d × d because 0 ∗ the expectation of the entry (i, i ) in Aj · Aj is

(D0) (D0) v (x ) · v 0 (x ) [A(j, i) · A(j, i0)] = [ i j i j ] E 0 E 0 xj ∼D xj ∼D m

D(x) · vi(xj) · vi0 (xj) vi(xj) · vi0 (xj) = [ ] = [ ] = 1~ ~0 /m. E 0 0 E i=i xj ∼D m · D (xj) xj ∼D m

65 ∗ Pm ∗ Now we apply Theorem 2.2.3 on A A = j=1(Aj · Aj):

K 0 K 0 1/ D 1/ D  e−ε  m  e−ε  m Pr [λ(A∗A) ∈/ [1 − ε, 1 + ε]] ≤ d + d (1 − ε)1−ε (1 + ε)1+ε

ε2· m d K 0 − D0 6KD log δ ≤ 2d · e 3 ≤ δ given m ≥ . ε2

Then we finish the proof of Lemma 6.3.2.

P Proof of Lemma 6.3.2. Because the coefficient αi = 1/m = Ω(KD0 /ε) and i αi = 1, this indicates the second property of well-balanced sampling procedures.

∗ Since m = Θ(KD0 log d), by Lemma 6.3.4, we know all eigenvalues of A · A are in [1−1/4, 1+1/4] with probability 1−10−3. By Lemma 6.2.3, this indicates the first property of well-balanced sampling procedures.

6.4 A Linear-Sample Algorithm for Known D

We provide a well-balanced sampling procedure with a linear number of random samples in this section. The procedure requires knowing the underlying distribution D, which makes it directly useful in the query setting or the “fixed design” active learning setting, where D can be set to the empirical distribution D0.

Lemma 6.4.1. Given any dimension d linear space F, any distribution D over the do- main of F, and any ε > 0, there exists an efficient ε-well-balanced sampling procedure that 1 terminates in O(d/ε) rounds with probability 1 − 200 .

From Corollary 6.2.2, we obtain the following theorem for the ERM fe resulting from the well-balanced sampling procedure in Lemma 6.4.1.

66 Theorem 6.4.2. Consider any dimension d linear space F of functions from a domain G to C and distribution D over G. Let y(x) = f(x) + g(x) be our observed function, where f ∈ F and g denotes a noise function. For any ε > 0, there exists an efficient algorithm that d observes y(x) at m = O( ε ) points and outputs fe such that with probability 0.99,

˜ 2 2 1. kf − fkD ≤ ε · [kgkD], when g(x) is a random function from G to where each g(x) Eg C is an independent random variable with [g(x)] = 0. Eg ˜ 2. kf − fkD ≤ (1 + ε) · kgkD for any other noise function g.

We show how to extract the coefficients α1, ··· , αm from the randomized BSS algo- rithm by Lee and Sun [LS15] in Algorithm4. Given , the linear family F, and the distribu- √ tion D, we fix γ = /C0 for a constant C0 and v1, . . . , vd to be an orthonormal basis of F  in this section. For convenience, we use v(x) to denote the vector v1(x), . . . , vd(x) . In the rest of this section, we prove Lemma 6.4.1 in Section 6.4.1.

6.4.1 Proof of Lemma 6.4.1

We state a few properties of randomized BSS [BSS12, LS15] that will be used in

this proof. The first property is that matrices B1,...,Bm in Procedure RandomizedBSS always have bounded eigenvalues.

Lemma 6.4.3. [BSS12, LS15] For any j ∈ [m], λ(Bj) ∈ (lj, uj).

Lemma 3.6 and 3.7 of [LS15] shows that with high probability, the while loop in d Procedure RandomizedSamplingBSS finishes within O( γ2 ) iterations and guarantees the

λmax(Bm) um last matrix Bm is well-conditioned, i.e., ≤ ≤ 1 + O(γ). λmin(Bm) lm

1 Lemma 6.4.4. [LS15] There exists a constant C such that with probability at least 1 − 200 , 2 Procedure RandomizedSamplingBSS takes at most m = C·d/γ random points x1, . . . , xm and guarantees that um ≤ 1 + 8γ. lm

67 Algorithm 4 A well-balanced sampling procedure based on Randomized BSS 1: procedure RandomizedSamplingBSS(F, D, ) 2: Find an orthonormal basis v1, . . . , vd of F under D; √ 4d/γ 3: Set γ = /C0 and mid = 1/(1−γ)−1/(1+γ) ; 4: j = 0; B0 = 0; 5: l0 = −2d/γ; u0 = 2d/γ; 6: while uj+1 − lj+1 < 8d/γ do; −1 −1 7: Φj = Tr(ujI − Bj) + Tr(Bj − ljI) ; . The potential function at iteration j. γ 1 8: Set the coefficient αj = Φ · mid ; j  > −1 > 9: Set the distribution Dj(x) = D(x) · v(x) (ujI − Bj) v(x) + v(x) (Bj −  −1  ljI) v(x) /Φj for v(x) = v1(x), . . . , vd(x) ;

γ D(x) 10: Sample xj ∼ Dj and set a scale sj = · ; Φj Dj (x) > 11: Bj+1 = Bj + sj · v(xj)v(xj) ; γ γ 12: uj = uj + ; lj = lj + ; +1 Φj (1−γ) +1 Φj (1+γ) 13: j = j + 1; 14: end while 15: m = j; 16: Assign the weight wj = sj/ mid for each xj; 17: end procedure

68 We first show that (A∗ · A) is well-conditioned from the definition of A. We prove

Pm γ um+lm um+lm that our choice of mid is very close to j=1 φ = 1 1 ≈ 2 . j 1−γ + 1+γ Claim 6.4.5. After exiting the while loop in Procedure RandomizedBSS, we always have

1. um − lm ≤ 9d/γ.

2 2. (1 − 0.5γ ) · Pm γ ≤ mid ≤ Pm γ . d j=1 φj j=1 φj

Proof. Let us first bound the last term γ in the while loop. Since u − l < 8d/γ, φm m−1 m−1 φ ≥ 2d · 1 ≥ γ , which indicates the last term γ ≤ 2. Thus m 4d/γ 2 φm 1 1 u − l ≤ 8d/γ + 2( − ) ≤ 8d/γ + 5γ. m m 1 − γ 1 + γ

4d/γ 2 2 From our choice mid = 1/(1−γ)−1/(1+γ) = 2d(1 − γ )/γ and the condition of the while Pm 1 1 loop um − lm = j=1(γ/φj) · ( 1−γ − 1+γ ) + 4d/γ ≥ 8d/γ, we know m X γ ≥ mid = 2d(1 − γ2)/γ2. φj j=1

Pm−1 γ On the other hand, since um− − lm− < 8d/γ is in the while loop, < mid. 1 1 j=1 φj Hence m−1 m m X γ X γ X γ mid > ≥ − 2 ≥ (1 − 0.5γ2/d) · ( ). φj φj φj j=1 j=1 j=1

Lemma 6.4.6. Given um ≤ 1 + 8γ, λ(A∗ · A) ∈ (1 − 5γ, 1 + 5γ). lm

Pm > Proof. For Bm = j=1 sjv(xj)v(xj) , λ(Bm) ∈ (lm, um) from Lemma 6.4.3. At the same time, given wj = sj/ mid,

m m X 1 X Bm (A∗A) = w v(x )v(x )> = · s v(x )v(x )> = . j j j mid j j j mid j=1 j=1

69 2 2 3γ Pm γ 3γ um+lm 2 2 um+lm Since mid ∈ [1 − d , 1] · ( j=1 φ ) = [1 − d , 1] · ( 1 1 ) ⊆ [1 − 2γ , 1 − γ ] · ( 2 ) j 1−γ + 1+γ ∗ from Claim 6.4.5, λ(A · A) = λ(Bm)/ mid ∈ (lm/ mid, um/ mid) ⊂ (1 − 5γ, 1 + 5γ) given um ≤ 1 + 8γ in Lemma 6.4.4. lm

We finish the proof of Lemma 6.4.1 by combining all discussion above.

Proof of Lemma 6.4.1. From Lemma 6.4.4 and Lemma 6.4.6, m = O(d/γ2) and λ(A∗A) ∈ [1 − 1/4, 1 + 1/4] with probability 0.995.

γ 1 Pm γ 1 For αi = · , we bound · by 1.25 from the second property of Φi mid i=1 Φi mid Claim 6.4.5.

2 |h(x)| P 2 Then we bound αj · KDj . We notice that sup khk2 = i∈[d] |vi(x)| for every x ∈ G h∈F D 2 2 P |h x | i α(h)i·vi(x) because sup ( ) = sup = P |v (x)|2 by the Cauchy-Schwartz inequality. khk2 kα(h)k2 i i h∈F D α(h) 2 D(x) Pd 2 This simplifies KD to sup { · |ui(x)| } and bounds αj · KD by j x Dj (x) i=1 j ( d ) γ D(x) X 2 · sup · |vi(x)| Φj · mid x Dj(x) i=1 ( ) γ Pd |v (x)|2 = · sup i=1 i > −1 > −1 mid x v(xj) (ujI − Bj) v(xj) + v(xj) (Bj − ljI) v(xj) ( ) γ Pd |v (x)|2 ≤ · sup i=1 i −1 2 −1 2 mid x λmin (ujI − Bj) · kv(xj)k2 + λmin (Bj − ljI) · kv(xj)k2 γ 1 ≤ · mid 1/(uj − lj) + 1/(uj − lj) γ u − l = · j j (apply the first property of Claim 6.4.5) mid 2 4.5 · d ≤ ≤ 3γ2 = 3/C2. mid 0

By choosing C0 large enough, this satisfies the second property of well-balanced sampling procedures. At the same time, by Lemma 6.2.3, Algorithm4 also satisfies the first property of well-balanced sampling procedures.

70 6.5 Active Learning

In this section, we investigate the case where we do not know the distribution D of x and only receive random samples from D. We finish the proof of Theorem 6.0.4 that bounds the number of unlabeled samples by the condition number of D and the number of labeled samples by dim(F) to find the truth through D. In the end of this section, we state a corollary for specific kinds of noise.

Theorem 6.0.4. Consider any dimension d linear space F of functions from a domain G to C. Let (D,Y ) be a joint distribution over G × C and f = arg min E [|y − h(x)|2]. h∈F (x,y)∼(D,Y ) 2 supx∈G |h(x)| Let K = sup khk2 . For any ε > 0, there exists an efficient algorithm that h∈F:h=06 D K d takes O(K log d+ ε ) unlabeled samples from D and requests O( ε ) labels to output fesatisfying

E [|f˜(x) − f(x)|2] ≤ ε · E [|y − f(x)|2] with probability ≥ 0.99. x∼D (x,y)∼(D,Y )

For generality, we bound the number of labels using any well-balanced sampling pro- cedure, such that Theorem 6.0.4 follows from this lemma with the linear sample procedure in Lemma 6.4.1.

Lemma 6.5.1. Consider any dimension d linear space F of functions from a domain G to C. Let (D,Y ) be a joint distribution over G × C and f = arg min E [|y − h(x)|2]. h∈F (x,y)∼(D,Y ) 2 supx∈G |h(x)| Let K = sup khk2 and P be a well-balanced sampling procedure terminating h∈F:h=06 D −3 in mp(ε) rounds with probability 1 − 10 for any linear family F, measurement D, and ε. K For any ε > 0, Algorithm5 takes O(K log d + ε ) unlabeled samples from D and requests at

most mp(ε/8) labels to output fe satisfying 1 E [|f˜(x) − f(x)|2] ≤ ε · E [|y − f(x)|2] with probability ≥ 1 − . x∼D (x,y)∼(D,Y ) 200

71 Algorithm5 first takes m0 = O(K log d+K/ε) unlabeled samples and defines a distri-

bution D0 to be the uniform distribution on these m0 samples. Then it uses D0 to simulate D in P , i.e., it outputs the ERM resulting from the well-balanced sampling procedure P  with the linear family F, the measurement D0, and 8 .

Algorithm 5 Regression over an unknown distribution D 1: procedure RegressionUnknownDistribution(ε, F,D,P ) 2: Set C to be a large constant and m0 = C · (K log d + K/ε).

3: Take m0 unlabeled samples x1, . . . , xm0 from D.

4: Let D0 be the uniform distribution over (x1, . . . , xm0 ). ˜ 5: Output the ERM f resulting from P with parameters F,D0, /8. 6: end procedure

q 2 Proof. We still use kfkD0 to denote E [|f(x)| ] and D1 to denote the weighted distribution x∼D0 generated by Procedure P given F,D0, ε. By Lemma 6.3.2 with D and the property of P , with probability at least 1 − 2 · 10−3,

2 2 2 2 khkD0 = (1 ± 1/4) · khkD and khkD1 = (1 ± 1/4) · khkD0 for every h ∈ F. (6.4)

We assume (6.4) holds in the rest of this proof.

Let yi denote a random label of xi from Y (xi) for each i ∈ [m0] including the unlabeled samples in the algorithm and the labeled samples in Step5 of Algorithm5. Let f 0 be the

weighted ERM of (x1, ··· , xm) and (y1, ··· , ym) over D0, i.e.,

0  2 f = arg min E |yi − h(xi)| . (6.5) h∈F xi∼D0,yi∼Y (xi)

Given Property (6.4) and Lemma 6.3.2, [kf 0−fk2 ] ≤ 2K · [|y−f(x)|2] E D m0 E (x1,y1),...,(xm0 ,ym0 ) (x,y)∼(D,Y ) from the proof of Theorem 6.0.3. Let the constant in m0 be large enough such that from the Markov inequality, with probability 1 − 10−3,

0 2 ε 2 kf − fkD ≤ · E [|y − f(x)| ]. 8 (x,y)∼(D,Y )

72 In the rest of this proof, we plan to show that the weighted ERM fe resulting from P with measurement D guarantees kf − f 0k2 [|y − f(x)|2] with high probability. 0 e D0 . E (x,y)∼(D,Y ) 0 2 2 Given Property (6.4) and the guarantee of Procedure P , we have [kfe − f kD ] ≤ · EP 0 C 0 2 E [|yi − f (xi)| ] from the proof of Theorem 6.0.3. Next we bound the right hand side x∼D0 0 2 2 E [|yi − f (xi)| ] by E [|y − f(x)| ] over the randomness of (x1, y1),..., (xm0 , ym0 ): xi∼D0 (x,y)∼(D,Y )    0 2 E E |yi − f (xi)| (x1,y1),...,(xm0 ,ym0 ) xi∼D0    2 0 2 ≤ E 2 E |yi − f(xi)| + 2kf − f kD0 (x1,y1),...,(xm0 ,ym0 ) xi∼D0  2  0 2  ≤2 E |y − f(x)| + 3 E kf − f kD from (6.4) (x,y)∼(D,Y ) (x1,y1),...,(xm0 ,ym0 ) Hence  [kf − f 0k2 ] ≤ 2 · 2 + 3 · 2K  · [|y − f(x)|2]. E E e D0 C m0 E (x1,y1),...,(xm0 ,ym0 ) P (x,y)∼(D,Y ) Since C is a large constant, from the Markov inequality, with probability 1 − 2 · 10−3

over (x1, y1),..., (xm0 , ym0 ) and P ,

0 2 ε  2 kfe− f kD ≤ · E |y − f(x)| . 0 4 (x,y)∼(D,Y )

From all discussion above, we have

2 0 2 0 2 0 2 ε 2 2 kfe−fkD ≤ 2kfe−f kD+2kf −fkD ≤ 3kfe−f kD + E [|y−f(x)| ] ≤ ε E [|y−f(x)| ]. 0 4 (x,y)∼(D,Y ) (x,y)∼(D,Y )

We apply Corollary 6.2.2 to obtain the following result.

Corollary 6.5.2. Let F be a family of functions from a domain G to C with dimension d 2 supx∈G |h(x)| and D be a distribution over G with bounded condition number K = sup khk2 . Let h∈F:h=06 D y(x) = f(x) + g(x) be our observation with f ∈ F.

K For any ε > 0, there exists an efficient algorithm that takes O(K log d + ε ) unlabeled d samples from D and require O( ε ) labels to output fe such that with probability 0.99,

73 2 2 1. kfe−fkD ≤ ε·E[kgkD], when g is a random function where each g(x) is an independent random variable with E[g(x)] = 0.

2. kfe− fkD ≤ (1 + ε) · kgkD otherwise.

6.6 Lower Bounds

We present two lower bounds on the number of samples in this section. We first prove a lower bound for query complexity based on the dimension d. Then we prove a lower bound for the sample complexity based on the condition number of the sampling distribution.

1 Theorem 6.6.1. For any d and any ε < 10 , there exist a distribution D and a linear family 1 F of functions with dimension d such that for the i.i.d. Gaussian noise g(x) = N(0, ε ),

any algorithm which observes y(x) = f(x) + g(x) for f ∈ F with kfkD = 1 and outputs fe 3 0.8d satisfying kf − fekD ≤ 0.1 with probability ≥ 4 , needs at least m ≥ ε queries.

Notice that this lower bound matches the upper bound in Theorem 6.4.2. In the rest of this section, we focus on the proof of Theorem 6.6.1. Let F = {f :[d] → R} and D be the uniform distribution over [d]. We first construct a packing set M of F.

Claim 6.6.2. There exists a subset M = {f1, . . . , fn} ⊆ F with the following properties:

1. kfikD = 1 for each fi ∈ M.

2. kfik∞ ≤ 1 for each fi ∈ M.

3. kfi − fjkD > 0.2 for distinct fi, fj in M.

4. n ≥ 20.7d.

74 Proof. We construct M from U = f :[d] → {±1} in Procedure ConstructM. Notice that |U| = 2d before the while loop. At the same time, Procedure ConstructM removes d  0.3d at most ≤0.01d ≤ 2 functions every time because kg − hkD < 0.2 indicates Pr[g(x) 6= h(x)] ≤ (0.2)2/4 = 0.01. Thus n ≥ 2d/20.3d ≥ 20.7d.

Algorithm 6 Construct M 1: procedure ConstructM(d)  2: Set n = 0 and U = f :[d] → {±1} . 3: while U 6= ∅ do 0 0 4: Choose any h ∈ U and remove all functions h ∈ U with kh − h kD < 0.2. 5: n = n + 1 and fn = h. 6: end while 7: Return M = {f1, . . . , fn}. 8: end procedure

We state the Shannon-Hartley theorem in to finish the proof of Theorem 6.6.1.

Theorem 6.6.3 (The Shannon-Hartley Theorem [Har28, Sha49]). Let S be a real-valued random variable with E[S2] = τ 2 and T ∼ N(0, σ2). The mutual information I(S; S + T ) ≤ 1 τ 2 2 log(1 + σ2 ).

Proof of Theorem 6.6.1. Because of Yao’s minimax principle, we assume A is a determin- istic algorithm given the i.i.d. Gaussian noise. Let I(fe; fj) denote the mutual information of a random function fj ∈ M and A’s output fe given m observations (x1, y1),..., (xm, ym) 1 with yi = fj(xi) + N(0, ε ). When the output fe satisfies kfe − fjkD ≤ 0.1, fj is the clos- est function to fe in M from the third property of M. From Fano’s inequality [Fan61],

75 ˜ 1 log(|M|−1) H(fj|f) ≤ H( 4 ) + 4 . This indicates

˜ I(fj; fe) = H(fj) − H(fj|f) ≥ log |M| − 1 − log(|M| − 1)/4 ≥ 0.7 log |M| ≥ 0.4d.

At the same time, by the data processing inequality, the algorithm A makes m queries   x1, . . . , xm and sees y1, . . . , ym , which indicates   m    X I(fe; fj) ≤ I y1, . . . , ym ; fj = I yi; fj(xi) y1, ··· , yi−1 . (6.6) i=1

For the query xi, let Di,j denote the distribution of fj ∈ M in the algorithm A given the   first i − 1 observations x1, y1 ,..., xi−1, yi−1 . We apply Theorem 6.6.3 on Di,j such that it bounds

 2    E [f(xi) ] 1 1 f∼Di,j I yi = fj(xi) + N(0, ); fj(xi) y1, ··· , yi−1 ≤ log 1 + ε 2  1/ 

 2  max[f(xi) ] 1 f∈M ≤ log 1 + 2  1/ε  1 ε = log 1 + ε ≤ , 2 2

where we apply the second property of M in the second step to bound f(x)2 for any f ∈ M. Pm ε Hence we bound i=1 I(yi; fj|y1, ··· , yi−1) by m · 2 . This implies ε 0.8d 0.4d ≤ m · ⇒ m ≥ . 2 ε

Next we consider the sample complexity of linear regression.

76 Theorem 6.6.4. For any K, d, and ε > 0, there exist a distribution D, a linear family of   |h(x)|2 functions F with dimension d whose condition number sup sup khk2 equals K, and a h∈F:h=06 x∈G D noise function g orthogonal to V such that any algorithm observing y(x) = f(x) + g(x) of K f ∈ F needs at least Ω(K log d + ε ) samples from D to output fe satisfying kfe − fkD ≤ √ 3 0.1 ε · kgkD with probability 4 .

Proof. We fix K to be an integer and set the domain of functions in F to be [K]. We choose D to be the uniform distribution over [K]. Let F denote the family of functions f :[K] →  |h(x)|2 C|f(d+1) = f(d+2) = ··· = f(K) = 0 . Its condition number sup sup khk2 equals K. h∈F:h=06 x∈G D |h(x)|2 |h(x)|2 h(x) = 1x=1 provides the lower bound ≥ K. At the same time, khk2 = PK 2 ≤ K D i=1 |h(x)| /K indicates the upper bound ≤ K.

K We first consider the case K log d ≥ ε . Let g = 0 such that g is orthogonal to √ V . Notice that kfe− fkD ≤ 0.1 ε · kgkD indicates fe(x) = f(x) for every x ∈ [d]. Hence the algorithm needs to sample f(x) for every x ∈ [d] when sampling from D: the uniform distribution over [K]. From the lower bound of the coupon collector problem, this takes at least Ω(K log d) samples from D.

Otherwise, we prove that the algorithm needs Ω(K/ε) samples. Without loss of   generality, we assume E |f(x)|2 = 1 for the truth f in y. Let g(x) = N(0, 1/ε) for each x∼[d]     x ∈ [d]. From Theorem 6.6.1, to find fesatisfying E |fe(x)−f(x)|2 ≤ 0.1 E |f(x)|2 , the x∼[d] x∼[d] algorithm needs at least Ω(d/ε) queries of x ∈ [d]. Hence it needs Ω(K/ε) random samples from D, the uniform distribution over [K].

77 Chapter 7

Existence of Extractors in Simple Hash Families

We study randomness extractors consisting of hash families in this section. Recall that the min-entropy of a random variable X is

1 H∞(X) = min log2 . x∈supp(X) Pr[X = x]

For convenience, we provide the definition of extractors and strong extractors here.

+ d Definition 7.0.1. For any d ∈ N , let Ud denote the uniform distribution over {0, 1} . For two random variables W and Z with the same support, let kW − Zk denote the statistical (variation) distance

kW − Zk = max Pr [w ∈ T ] − Pr [z ∈ T ] . T ⊆supp(W ) w∼W z∼Z

A function Ext : {0, 1}n × {0, 1}t → {0, 1}m is a (k, )-extractor if for every source X with min-entropy k and an independent uniform distribution Y on {0, 1}t,

kExt(X,Y ) − Umk ≤ .

  It is a strong (k, )-extractor if in addition, it satisfies Ext(X,Y ),Y − Um,Y ≤ .

Given an extractor Ext : {0, 1}n × {0, 1}t → {0, 1}m, we sample a few seeds from {0, 1}t and consider the new extractor, called a restricted extractor, constituted by these seeds.

78 Definition 1.1.3. Given an extractor Ext : {0, 1}n × {0, 1}t → {0, 1}m and a sequence of t seeds (y1, ··· , yD) where each yi ∈ {0, 1} , we define the restricted extractor Ext(y1,··· ,yD) to n be Ext restricted in the domain {0, 1} × [D] where Ext(y1,··· ,yD)(x, i) = Ext(x, yi).

In this chapter, we prove that given any (k, )-extractor Ext, most restricted extractors ˜ n with a quasi-linear degree O( 2 ) from Ext are (k, 3)-extractors for a constant number of output bits, despite the degree of Ext.

Theorem 1.1.4. There exists a universal constant C such that given any strong (k, )- n t m n·2m 2 n·2m extractor Ext : {0, 1} × {0, 1} → {0, 1} , for D = C · 2 · log  random seeds t y1, . . . , yD ∈ {0, 1} , Ext(y1,...,yD) is a strong (k, 3)-extractor with probability 0.99.

At the same time, the same statement of Theorem 1.1.4 holds for extractors. But in this case, the dependency 2m on the degree D of restricted extractors is necessary to guarantee its error is less than 1/2 on m output bits.

Proposition 1.1.5. There exists a (k, )-extractor Ext : {0, 1}n × {0, 1}t → {0, 1}m with k = 1 and  = 0 such that any restricted extractor of Ext requires the degree D ≥ 2m−1 to guarantee its error is less than 1/2.

Though the dependency 2m is necessary for extractors in the above proposition, it may not be necessary for strong extractors.

Applications of Theorem 1.1.4. In a seminal work, Impagliazzo, Levin, and Luby [ILL89] proved the Leftover Hash Lemma, i.e., all functions from an almost-universal hash family constitute a strong extractor. We first define universal hash families and almost universal hash families.

79 Definition 7.0.2 (Universal hash families by Carter and Wegman [CW79]). Let H be a family of functions mapping {0, 1}n to {0, 1}m. H is universal if

∀x, y ∈ {0, 1}n(x 6= y), Pr [h(x) = h(y)] ≤ 2−m. h∼H

Moreover, we say H is almost universal if,

∀x, y ∈ {0, 1}n(x 6= y), Pr [h(x) = h(y)] ≤ 2−m + 2−n. h∼H

We discuss several applications of Theorem 1.1.4 based on the Leftover Hash Lemma [ILL89] on almost universal hash families.

Lemma 7.0.3 (Leftover Hash Lemma [ILL89]). For any n and m, let H be a family of n m T hash functions {h1, ··· , hT } mapping [2 ] to [2 ] such that for any distinct x and y, Pr [h(x) = h(y)] ≤ 2−n + 2−m. Then Ext : {0, 1}n × [T ] → {0, 1}m defined as Ext(x, y) = h∼H 1 hy(x) is a strong (k, )-extractor for any k and  satisfying k ≥ m + 2 log  .

For completeness, we provide a proof of the Leftover Hash Lemma in AppendixB.

Plugging the extractors of all linear transformations and Toeplitz matrices [ILL89] in Theorem 1.1.4, our result indicates that most extractors constituted by a quasi-linear ˜ n number O( 2 ) of linear transformations or Toeplitz matrices keep the nearly same parameters of the min-entropy and error, for a constant number of output bits. We treat the subset n n n {0, 1, 2, ··· , 2 − 1} the same as {0, 1} and F2 in this work.

Corollary 7.0.4. There exists a universal constant C such that for any integers n, m, k, 1 n·2m 2 n·2m and  > 0 with k ≥ m + 2 log  , Ext(A1,...,AD) with D = C · 2 · log  random matrices m×n n m A1,...,AD ∈ F2 , mapping from F2 × [D] to F2 as Ext(A1,...,AD)(x, i) = Ai · x, is a strong (k, 3)-extractor with probability 0.99.

n×m Moreover, the same holds for D random Toeplitz matrices A1,...,AD ∈ F2 .

80 Next we consider extractors from almost-universal hash families, which have efficient implementations and wide applications in practice. We describe a few examples of almost- universal hash families with efficient implementations.

1. Linear Congruential Hash by Carter and Wegman [CW79]: for any n and m, let p be n  a prime > 2 and H1 = ha,b|a, b ∈ {0, 1, ··· , p − 1} be the hash family defined as  m n ha,b(x) = (ax + b) mod p mod 2 for every x ∈ {0, 1, ··· , 2 − 1}.

2. Multiplicative Universal Hash by Dietzfelbinger et al. [DHKP97] and Woelfel [Woe99]:  n n−m for any n and m, let H2 = ha,b|a ∈ {1, 3, 5, ··· , 2 − 1}, b ∈ {0, 1, ··· , 2 − 1 be the hash family mapping {0, 1, ··· , 2n − 1} to {0, 1, ··· , 2m − 1} that first calculates n ax + b modulo 2 then takes the high order m bits as the hash value, i.e., ha,b(x) = (ax + b) mod 2n div 2n−m. In C-code, this hash function could be implemented as

ha,b(x) = (a ∗ x + b) >> (n − m) when n = 64.

3. Shift Register Hash by Vazirani [Vaz87]: let p be a prime such that 2 is a generator mod- (i) n (i) ulo p and a denote the ith shift of a string a ∈ F2 , i.e., a = ai+1ai+2 ··· ana1 ··· ai.  p For n = p − 1 and any m ≤ n, let H3 = ha|a ∈ F2 be the hash family mapping n m (1) m−1  F2 to F2 as ha, 1 ◦ xi, ha , 1 ◦ xi, ··· , ha , 1 ◦ xi , where hw, zi denotes the inner p product of w, z ∈ F2 in F2.

Because all these hash families are almost-universal, by the Leftover Hash Lemma

[ILL89], Ext(x, y) = hy(x) is a strong extractor for all hash functions hy in one family. Plugging these extractors in Theorem 1.1.4, we obtain extractors of almost optimal degrees with efficient implementations.

Corollary 7.0.5. Let H be any almost-universal hash family mapping {0, 1}n to {0, 1}m. 1 There exists a universal constant C such that for any integer k and  > 0 with k ≥ m+2 log  , n·2m 2 n·2m Ext(h1,...,hD) with D = C · 2 · log  random hash functions h1, ··· , hD ∼ H, defined as

Ext(h1,··· ,hD)(x, i) = hi(x), is a strong (k, 3)-extractor with probability 0.99.

81 Organization. In the rest of this chapter, we provide a few basic tools and Lemmas in Section 7.1. Then we prove Theorem 1.1.4 for extractors and Proposition 1.1.5 in Section 7.2. Finally, we prove Theorem 1.1.4 for strong extractors in Section 7.3.

7.1 Tools

Given a subset Λ ∈ {0, 1}n, we consider the flat random source of the uniform distri- 1 bution over Λ, whose min-entropy is − log2 |Λ| = log2 |Λ|. Because any random source with min-entropy k is a linear combination of flat random sources of min-entropy k, we focus on flat random sources in the rest of this work.

We always use N(0, 1) to denote the standard Gaussian random variable and use the following concentration bound on Gaussian random variables [LT91].

Lemma 7.1.1. Given any n Gaussian random variables G1, ··· ,Gn (not necessarily inde- 2 pendent) where each Gi has expectation 0 and variance σi ,   p  E max |Gi| . log n · max σi . i∈[n] i∈[n]

Let S be a subset of events, X a random variable, and f any function from S×supp(X) to R+. We state the standard symmetrization and Gaussianization [LT91, RW14] that Pn transform bounding max j f(Λ, xi) of n independent random variables x1, ··· , xn ∼ X Λ∈S =1 to a Gaussian process.

Theorem 7.1.2. For any integer n and n independent random samples x1, ··· , xn from X, " n # " n # " " n ## X X √ X E max f(Λ, xi) ≤ max E f(Λ, xj) + 2π·E E max f(Λ, xj)gj . x1∼X,··· ,xn∼X Λ∈S Λ∈S x x g∼N(0,1)n Λ∈S j=1 j=1 j=1

hPn i The first term maxE j f(Λ, xj) is the largest expectation over all events Λ, and Λ∈S x =1 the second term is to bound the deviation of every event from its expectation simultaneously. For completeness, we provide a proof of Theorem 7.1.2 in AppendixB.

82 We state the Beck-Fiala theorem in the discrepancy theory [Cha00].

Theorem 7.1.3. [Beck-Fiala theorem] Given a universe [n] and a collection of subsets

S1, ··· ,Sl such that each element i ∈ [n] appears in at most d subsets, there exists an assignment χ :[n] → {±1} such that for each subset S , | P χ(i)| < 2d. j i∈Sj

7.2 Restricted Extractors

We study restricted extractors in this section and prove Theorem 1.1.4 for extractors. ˜ n·2m The main result in this section is that most sequences of O( 2 ) seeds from any given ex- tractor constitute a restricted extractor with nearly the same parameters of min entropy and error. On the other hand, we show that for certain extractors, the degree of its restrictions is Ω(2m) to guarantee any constant error.

We first consider the upper bound on the degree of restricted extractors for all entropy- k flat sources fooling one fixed statistical test.

2 n n t m n·log  Lemma 7.2.1. Let Ext : {0, 1} ×{0, 1} → {0, 1} be an (k, )-extractor and D = C · 2 for a universal constant C. Given any subset T ⊆ {0, 1}m, for D independently random seeds t y1, ··· , yD in {0, 1} ,

" D #   X   |T | E max Pr Ext(Λ, yi) ∈ T ≤ D · m + 2 . (7.1) y1,··· ,yD Λ:|Λ|=2k 2 i=1

Proof. We symmetrize and Gaussianize the L.H.S. of (7.1) by Theorem 7.1.2:

" D # X   E max Pr Ext(Λ, yi) ∈ T (7.2) y1,··· ,yD Λ:|Λ|=2k i=1 " D # " " D ## X   √ X   ≤ max E Pr Ext(Λ, yi) ∈ T + 2πE E max Pr Ext(Λ, yi) ∈ T gi . Λ y y g∼N(0,1)D Λ i=1 i=1 (7.3)

83 Because Ext is an extractor for entropy k sources with error , the first term

D X   |T | max E [ Pr Ext(Λ, yi) ∈ T ] ≤ D · ( m + ). |Λ|=2ky1,··· ,yD 2 i=1

The technical result is a bound on the Gaussian process for any y1, ··· , yD.   PD   Claim 7.2.2. For any y1, ··· , yD, E max i=1 Pr Ext(Λ, yi) ∈ T · gi ≤ C0 · √ g∼N(0,1)D Λ:|Λ|=2k nD · log D for some universal constant C0.

We defer the proof of Claim 7.2.2 to Section 7.2.1.

2 n n log  By choosing the constant C large enough, for D = C 2 , we can ensure that √  D |T | C0 nD · log D ≤ 5 . This bounds (7.3) by D · ( 2m + ) + D.

Next, we show that a restricted extractor is good with high probability. To do this,   PD   we provide a concentration bound on max i=1 Pr Ext(Λ, yi) ∈ T . We prove that |Λ|=2k a restricted extractor with D random seeds achieves the guarantee in Lemma 7.2.1 with ˜ 1 probability 1 − δ after enlarging D by a factor of O(log δ ). 1 1 0 n·log δ 2 n·log δ 0 Lemma 7.2.3. For any δ > 0, let D = C · 2 · log  for a universal constant C . Given any (k, )-extractor Ext : {0, 1}n × {0, 1}t → {0, 1}m and any subset T ⊆ {0, 1}m, for t D independently random seeds y1, ··· , yD in {0, 1} , " ( D ) # X   D · |T | Pr max Pr Ext(Λ, yi) ∈ T − m ≤ D · 3 ≥ 1 − δ. y1,··· ,yD Λ:|Λ|=2k 2 i=1 We defer the proof of Lemma 7.2.3 to Section 7.2.2. Finally we state the result about extractors.

Theorem 7.2.4. Let Ext : {0, 1}n × {0, 1}t → {0, 1}m be a (k, )-extractor and D = 1 m 1 m n·(log δ +2 ) 2 n·(log δ +2 ) C· 2 ·log  for a universal constant C. For a random sequence (y1, ··· , yD) t where each yi ∼ {0, 1} , the restricted extractor Ext(y1,··· ,yD) is a (k, 2)-extractor with proba- bility 1 − δ.

84 δ Proof. We choose the error probability to be 22m in Lemma 7.2.3 and apply a union bound over all possible statistical tests T in {0, 1}m.

For extractors, we show that 2m dependence in the degree is necessary.

Claim 7.2.5. There exists a (k = 1,  = 0)-extractor Ext such that for any constant 0 ≤ 1/2 0 m 0 0 and k > 0, any restriction Ext(y1,··· ,yD) requires D = Ω(2 ) to be an (k ,  )-extractor.

Proof. Let us consider the extractor Ext : {0, 1}n × {0, 1}m → {0, 1}m defined as Ext(x, y) = y. From the definition, it is an (k = 1,  = 0)-extractor. On the other hand, Ext(y1, ··· , yD) is an (k0, 0.5)-extractor only if D ≥ 0.5 · 2m.

However, this lower bound may not be necessary for strong extractors.

7.2.1 The Chaining Argument Fooling one test

Given y1, ··· , yD and T , for any subset Λ, we use ~p(Λ) to denote the vector       Pr Ext(Λ, y1) ∈ T , ··· , Pr Ext(Λ, yD) ∈ T .

Let t = 10 be a fixed parameter in this proof.

We rewrite the Gaussian process

" D # " # X   max Pr Ext(Λ, yi) ∈ T · gi = max h~p(Λ), gi . E D {0,1}n E D {0,1}n g∼N(0,1) Λ∈ g∼N(0,1) Λ∈ ( 2k ) i=1 ( 2k )

D We construct a sequence of subsets Ft−1, Ft, ··· , Fk of vectors in R and a sequence n of maps π : {0,1}  → F for each j from t − 1 to k. We first set F to be the subset j 2k j k m of all vectors in the Gaussian process, i.e., F = ~p(Λ) Λ ∈ {0,1}  and π (Λ) = ~p(Λ). k 2k k n For convenience, we set F = {~0} and π (Λ) = ~0 for any Λ ∈ {0,1}  and specify F t−1 t−1 2k j

85 and πj for j ∈ [t, k − 1] later. For any ~p(Λ) in the Gaussian process, we use the equation Pt   ~p(Λ) = j=k πj Λ − πj−1 Λ to rewrite it:

" # " * t + # X   E max h~p(Λ), gi = E max πj Λ − πj−1 Λ , g (7.4) n {0,1}n g {0,1}n g∼N(0,1) Λ∈ Λ∈ ( 2k ) ( 2k ) j=k " t # X   = E max πj Λ − πj−1 Λ , g (7.5) g {0,1}n Λ∈ ( 2k ) j=k t " # X   = E max πj Λ − πj−1 Λ , g (7.6) g {0,1}n Λ∈ j=k ( 2k ) t X q   log(|Fj| · |Fj−1|) · max πj Λ − πj−1 Λ . . {0,1}n 2 Λ∈ j=k ( 2k ) (7.7)

Here (7.7) follows from the union bound over Gaussian random variables — Lemma 7.1.1. In the rest of this proof, we provide upper bounds on π Λ − π Λ and |F | to finish j j−1 2 j the calculation of (7.7).

Two upper bounds for π Λ − π Λ . Next, we provide two methods to bound j j−1 2 π Λ − π Λ in (7.7). In this proof, for any map π and Λ, we always choose the j j−1 2 j 0 0 map πj(Λ) = p(Λ ) for some subset Λ .

2  Claim 7.2.6. Given |Λ0| ≥ D , there always exists Λ1 ⊆ Λ0 with size |Λ1| ∈ |Λ0|/2 −  2D, |Λ0|/2 + 2D such that

1.5 k~p(Λ0) − ~p(Λ1)k2 ≤ 6D /|Λ0|.

Proof. We plan to use the Beck-Fiala Theorem 7.1.3 to find Λ given Λ . Let the ground set  1 0 be Λ0 and the collection of subsets be Si = α ∈ Λ0 Ext(α, yi) ∈ T for each i ∈ [D]

86 and SD+1 = Λ0. Because the degree is at most D + 1, Theorem 7.1.3 implies an as- P signment χ :Λ0 → {±1} satisfying that for each Si, | α∈S χ(α)| < 2(D + 1). We set   i

Λ1 = α ∈ Λ0 χ(α) = 1 .

P |Λ0| Because | α∈ χ(α)| < 2(D + 1), Λ1 − < (D + 1) ≤ 2D. At the same time, Λ0   2 |Si| for each i ∈ [D] and Si, α ∈ Λ1 Ext(α, yi) ∈ T − < (D + 1). 2 6D To finish the proof, we prove Pr[Ext(Λ , yi) ∈ T ] − Pr[Ext(Λ , yi) ∈ T ] ≤ . 0 1 |Λ0|

Pr[Ext(Λ0, yi) ∈ T ] − Pr[Ext(Λ1, yi) ∈ T ]  

α ∈ Λ1 Ext(α, yi) ∈ T |Si| = − |Λ | |Λ | 0 1

     

α ∈ Λ1 Ext(α, yi) ∈ T α ∈ Λ1 Ext(α, yi) ∈ T α ∈ Λ1 Ext(α, yi) ∈ T |Si| ≤ − + − |Λ | |Λ |/2 |Λ |/2 |Λ | 0 0 0 1

  |Λ0|/2 − |Λ1| 2(D + 1) < + α ∈ Λ1 Ext(α, yi) ∈ T · |Λ0| |Λ0|/2 · |Λ1| 2(D + 1) (D + 1) 6D < + ≤ . |Λ0| |Λ0|/2 |Λ0|       From the definition of ~p(Λ0) = Pr Ext(Λ0, y1) ∈ T , ··· , Pr Ext(Λ0, yD) ∈ T , this 1.5 implies k~p(Λ0) − ~p(Λ1)k2 ≤ 6D /|Λ0|.

Next we provide an alternative bound for Λ0 with a small size using the probabilistic method.

Claim 7.2.7. Given any Λ0 of size at least 100, there always exists Λ1 ⊆ Λ0 with size  p p  |Λ1| ∈ |Λ0|/2 − |Λ0|, |Λ0|/2 + |Λ0| such that p k~p(Λ0) − ~p(Λ1)k2 ≤ 6 D/|Λ0|.

87 Proof. We first show the existence of Λ1 with the following two properties:

 p p  1. |Λ1| ∈ |Λ0|/2 − |Λ0|, |Λ0|/2 + |Λ0| .

 2 P   2. α ∈ Λ1 Ext(α, yi) ∈ T − α ∈ Λ0 Ext(α, yi) ∈ T /2 ≤ D · |Λ0|. i∈[D]

We pick each element α ∈ Λ0 to Λ1 randomly and independently with probability 1/2. For h 2i the first property, E |Λ1| − |Λ0|/2 = |Λ0|/4 implies it holds with probability at least Λ1 3/4.

At the same time,    2 X   E  α ∈ Λ1 Ext(α, yi) ∈ T − α ∈ Λ0 Ext(α, yi) ∈ T /2  Λ1 i∈[D]

X  = α ∈ Λ0 Ext(α, yi) ∈ T /4

i∈[D] implies the second property holds with probability at least 3/4. Therefore there exists Λ1 satisfying both properties.

88 Now let us bound k~p(Λ0) − ~p(Λ1)k2:

X 2 (Pr[Ext(Λ0, yi) ∈ T ] − Pr[Ext(Λ1, yi) ∈ T ]) i∈[D] 2     α ∈ Λ0 Ext(α, yi) ∈ T α ∈ Λ1 Ext(α, yi) ∈ T X   ≤2  −   |Λ0| |Λ0|/2  i∈[D]

2     α ∈ Λ1 Ext(α, yi) ∈ T α ∈ Λ1 Ext(α, yi) ∈ T X   + 2  −   |Λ0|/2 |Λ1|  i∈[D]

2 8 X   ≤ α ∈ Λ Ext(α, y ) ∈ T /2 − α ∈ Λ Ext(α, y ) ∈ T 2 0 i 1 i |Λ0| i∈[D] 2  2 X  |Λ1| − |Λ0|/2 +2 α ∈ Λ1 Ext(α, yi) ∈ T · |Λ1| · |Λ0|/2 i∈[D]

8D X |Λ0| ≤ + 2 ≤ 16D/|Λ |. 2 0 |Λ0| |Λ0| /4 i∈[D]

Constructions of Fj. We construct Fk−1, ··· , Ft to fit in with Claim 7.2.6 and 7.2.7. We j define two parameters s(j)l and s(j)u on the order of 2 for each Fj such that

  n  n   n {0, 1} {0, 1} {0, 1} Fj = ~p(Λ) Λ ∈ ∪ ∪ · · · ∪ . s(j)l s(j)l + 1 s(j)u

k We start with s(k)l = s(k)u = 2 and define s(j)l and s(j)u from j = k − 1 to t.

s(j+1)l s(j+1)u 1. j > 2 log D + 8: we define s(j)l = 2 − 2D and s(j)u = 2 + 2D. In this proof, j j we bound 2 − 4D ≤ s(j)l ≤ s(j)u ≤ 2 + 4D.

89 s(j+1)l p s(j+1)u p 2. j ≤ 2 log D+8: we define s(j)l = 2 − s(j + 1)l and s(j)u = 2 + s(j + 1)u. j j We bound 0.8·2 ≤ s(j)l ≤ s(j)u ≤ 1.4·2 by induction. The base case is j > 2 log D+8, p which is proved above. Because 2D is always less than s(j + 1)l for j > 2 log D + 8,

j j j s(j)l Y 2s(i)l Y 2 X 2 = ≥ (1 − ) ≥ 1 − . j p p 2 s(i + 1)l i=k−1 i=k−1 s(i + 1)l i=k−1 s(i + 1)l

j t P √ 2 P √ 2 By induction, i=k−1 ≤ i=k−1 j ≤ 0.2 given t = 10. Similarly, s(i+1)l 0.8·2

j j s(j)u Y 2s(i)u Y 2 = ≤ (1 + ). j p 2 s(i + 1)u i=k−1 i=k−1 s(i + 1)u

j t P √ 2 P √ 2 By induction, i=k−1 ≤ i=k−1 j ≤ 0.2 given t = 10, which implies s(i+1)u 1.4·2 s(j)u 2j ≤ 1.4.

Constructions of πj. Next we define πj from j = k to j = t by induction. The base case k is j = k such that πj(Λ) = ~p(Λ) for any Λ of size 2 . Given Λ and πj(Λ) ∈ Fj, we define

πj−1(Λ) using Claim 7.2.6 or 7.2.7. From the definition of Fj, πj(Λ) = ~p(Λj) for some Λj with size in [s(j)l, s(j)u].

For j > 2 log D + 8, we apply Claim 7.2.6 on Λj to find Λj−1 of size |Λj−1| ∈ [|Λj| − 1.5 2D, |Λj| + 2D] satisfying k~p(Λj) − ~p(Λj−1)k2 ≤ 6D /|Λj|.

For j ≤ 2 log D + 8, we apply Claim 7.2.7 on Λj to find Λj−1 of size |Λj−1| ∈ [|Λj| − p p p |Λj|, |Λj| + |Λj|] satisfying k~p(Λj) − ~p(Λj−1)k2 ≤ 6 D/|Λj|.

Thus |Λj−1| is always in [s(j − 1)l, s(j − 1)u], which indicates ~p(Λj−1) is in Fj−1. We

set πj−1(Λ) = ~p(Λj−1).

n j j Ps(j)u 2  To finish this proof, we plug 0.8·2 ≤ s(j)l ≤ s(j)u ≤ 1.4·2 and |Fj| = ≤ i=s(j)l i

90 n (s(j) − s(j) + 1) · 2  ≤ 2j · 2n·2j ≤ 22n·2j into (7.7). u l s(j)u

t X q   log(|Fj| · |Fj−1|) · max πj Λ − πj−1 Λ {0,1}n 2 Λ∈ j=k ( 2k ) 2 log D+9 √ t √ X j 1.5 X j p ≤ 4n · 2 · 6D /s(j)l + 4n · 2 · 6 D/s(j)l j=k j=2 log D+8 2 log D+9 √ t √ X 1.5 j X p . n · 2j · D /2 + n · 2j · D/2j j=k j=2 log D+8 2 log D+9 t X √ D X √ ≤ nD · + nD 2j/2 j=k j=2 log D+8 √ . log D · nD.

7.2.2 Larger Degree with High Confidence

t×D We finish the proof of Lemma 7.2.3 in this section. Given (y1, ··· , yD) ∈ {0, 1} 2n ( k ) and T , we consider the error vector in R 2 :

D ! X T  Err(y1, ··· , yD) = Pr[Ext(Λ, yi) ∈ T ] − m . 2 {0,1}n i=1 Λ∈ ( 2k )

PD    |T |  Because max i=1 Pr Ext(Λ, yi) ∈ T − 2m ≤ kErr(y1, ··· , yD)k∞, we will prove Λ:|Λ|=2k

n · log 1 n · log 1 Pr [kErr(y , ··· , y )k ≥ 3D] ≤ δ for D = C0 · δ · log2 δ . (7.8) 1 D ∞ 2 y1,··· ,yD  

Since Err(y1, ··· , yD) = Err(y1, ∅, ··· , ∅) + Err(∅, y2, ∅, ··· , ∅) + ··· + Err(∅, ··· , ∅, yD), we plan to apply a concentration bound to prove (7.8).

Our main tool is a concentration inequality of Ledoux and Talagrand [LT91] for symmetric vectors. For convenience, we use the following version for any Banach space from Rudelson and Vershynin, which is stated as Theorem 3.8 in [RV08].

91 Theorem 7.2.8. Given a Banach space with norm k · k, let Y1, ··· ,Ym be independent and symmetric random vectors taking values in it with kYjk ≤ r for all j ∈ [m]. There exists an absolute constant C1 such that for any integers l ≥ q, and any t > 0, the random variable Pm k j=1 Yjk satisfies

" m m # 2 X X C1 t Pr k Y k ≥ 8q k Y k + 2r · l + t ≤ ( )l + 2 exp(− ). j E j Pm 2 Y1,··· ,Ym q 256q [k Yjk] j=1 j=1 E j=1

To apply this theorem for symmetric random vectors, we symmetrize our goal Err(y1, ··· , yD) m as follows. Given a subset T ⊆ {0, 1} and 2D seeds (y1, ··· , yD) and (z1, ··· , zD), we define n a vector ∆ from {0,1}  to : y,z 2k R

D X ∆y,z(Λ) = (Pr[Ext(Λ, yi) ∈ T ] − Pr[Ext(Λ, zi) ∈ T ]) i=1

We use the `∞ norm in this section:

k∆y,zk∞ = max |∆y,z(Λ)|. {0,1}n Λ∈ ( 2k ) Next we use the following Lemma to bridge Theorem 7.2.8 for symmetric random vectors and our goal (7.8).

Lemma 7.2.9. When we generate y = (y1, ··· , yD) and z = (z1, ··· , zD) independently from D×t the uniform distribution of {0, 1} , for Err(y) over the random choices y = (y1, ··· , yD),   Pr kErr(y)k∞ ≥ 2 [kErr(y)k∞] + δ ≤ 2 · Pr [k∆y,zk∞ ≥ δ] . y Ey y,z

Proof. Let Z and Z0 be independent identically distributed non-negative random variables. We use the following fact from [RV08]

   0  Pr Z ≥ 2 E[Z] + δ ≤ 2 Pr Z − Z ≥ δ . (7.9)

92 The reason is that

0 0 Pr [Z ≥ 2 E[Z] + δ] ≤ Pr [Z − Z ≥ δ|Z ≤ 2 E[Z]] Z Z,Z0 Pr Z − Z0 ≥ δ ∧ Z0 ≤ 2E[Z] Pr Z − Z0 ≥ δ ≤ ≤ .  0 0  PrZ0 Z ≤ 2 E[Z ] 1/2

By plugging Z = kErr(y)k∞ in (7.9),   Pr kErr(y)k∞ ≥ 2 [kErr(y)k∞] + δ ≤ 2 Pr [kErr(y)k∞ − kErr(z)k∞ ≥ δ] . y Ey y,z

Then, for any y = (y1, ··· , yD) and z = (z1, ··· , zD), we bound kErr(y)k∞ − kErr(z)k∞ by    

 X   D · |T |   X   D · |T |  max Pr Ext(Λ, yi) ∈ T − − max Pr Ext(Λ, zi) ∈ T − {0,1}n m {0,1}n m Λ∈ 2 Λ∈ 2 ( 2k )  i∈[D]  ( 2k )  i∈[D]   

 X   D · |T | X   D · |T |  ≤ max Pr Ext(Λ, yi) ∈ T − − Pr Ext(Λ, zi) ∈ T − {0,1}n m m Λ∈ 2 2 ( 2k )  i∈[D] i∈[D]       

 X   D · |T | X   D · |T |  ≤ max Pr Ext(Λ, yi) ∈ T − − Pr Ext(Λ, zi) ∈ T − {0,1}n  m   m  Λ∈ 2 2 ( 2k )  i∈[D] i∈[D]   

 X   X    = max Pr Ext(Λ, yi) ∈ T − Pr Ext(Λ, zi) ∈ T = k∆y,zk∞ {0,1}n Λ∈ ( 2k )  i∈[D] i∈[D] 

From the discussion above, we have   Pr kErr(y)k∞ ≥ 2 [kErr(y)k∞] + δ ≤2 Pr [k∆y,zk∞ ≥ δ] . y Ey y,z

Proof of Lemma 7.2.3. From Lemma 7.2.9, it is enough to use Theorem 7.2.8 to show a concentration bound on ∆y,z. We first bound E[kErryk∞] and E[k∆y,zk∞]. Notice that the

93 proofs of Theorem 7.1.2 and Lemma 7.2.1 indicate " D #  |T | X   E [kErryk∞] = E max Pr Ext(Λ, yi) ∈ T − m Λ:|Λ|=2k 2 i=1   D " D # X   X  0  ≤ E max Pr Ext(Λ, yi) ∈ T − E Pr Ext(Λ, yi) ∈ T Λ:|Λ|=2k y0 i=1 i=1 " D #  X  0  D · |T | + E Pr Ext(Λ, yi) ∈ T − y0 2m i=1 " " D ## √ X ≤ 2π E E max Pr[Ext(Λ, yi) ∈ T ] · gi +  · D by Claim B.1 y g∼N(0,1)n {0,1}n Λ∈( k ) i √ 2 =1 ≤C2 nD · log D + D. by Claim 7.2.2

Similarly, " D # X E[k∆y,zk∞] = E max (Pr[Ext(Λ, yi) ∈ T ] − Pr[Ext(Λ, zi) ∈ T ]) y,z {0,1}n Λ∈ ( 2k ) i=1 " " D ## √ X ≤ 2π E E max Pr[Ext(Λ, yi) ∈ T ] · gi by Claim B.1 y g∼N(0,1)n {0,1}n Λ∈( k ) i √ 2 =1 ≤ C2 nD · log D by Claim 7.2.2

PD Now we rewrite ∆y,z(Λ) = i=1 (Pr[Ext(Λ, yi) ∈ T ] − Pr[Ext(Λ, zi) ∈ T ]) as the sum- mation of D symmetric and independent random variables

∆yi,zi (Λ) = (Pr[Ext(Λ, yi) ∈ T ] − Pr[Ext(Λ, zi) ∈ T ])

n for Λ ∈ {0,1} . Next we bound each term 2k

 r = max k∆yi,zi k∞ = max Pr[Ext(Λ, yi) ∈ T ] − Pr[Ext(Λ, zi) ∈ T ] ≤ 1. y ,z {0,1}n i i y ,z ,Λ∈ i i ( 2k )

1 We choose the parameters q = 2C1 = Θ(1), l = log δ ≤ D/10, t = D/3 and plug them in Theorem 7.2.8 to bound 2 √ − t   −l 256q [k∆y,zk∞]2 Pr k∆y,zk∞ ≥ 8q · C2 nD · log D + 2r · l + t ≤ 2 + 2e E ≤ 3δ

94 √ while 8q · C2 nD · log D + 2r · l + t ≤ 0.8D.

Since E[kErr(y)k∞] ≤ 1.1D, we have Pr[kErr(y)k∞ ≥ 3 · D] ≤ 3δ.

7.3 Restricted Strong Extractors

We extend our techniques to strong extractors in this section.

n·2m n 2 Theorem 7.3.1. Let D = C · 2 · (log  + m) for a universal constant C and Ext : {0, 1}n × {0, 1}t → {0, 1}m be any strong (k, )-extractor. For D independently random seeds

y1, ··· , yD, Ext(y1,··· ,yD) is a strong extractor for entropy k sources with expected error 2.

˜ 1 Similar to Lemma 7.2.3, when we enlarge the degree by a factor of O(log δ ), we improve the guarantee to a high probability 1 − δ instead of an expected error.

m 1 m 1 n·2 log δ 2 n·2 log δ Corollary 7.3.2. For any δ > 0, let D = C · 2 · log  for a universal constant C. Given any strong (k, )-extractor Ext : {0, 1}n × {0, 1}t → {0, 1}m, for D independently

random seeds y1, ··· , yD, Ext(y1,··· ,yD) is a strong (k, 3)-extractor with probability at least 1 − δ.

The proof of Theorem 7.3.1 follows the same outline of the proof of Lemma 7.2.1 with different parameters. We apply a chaining argument to bound the L1 error of all entropy k sources Λ:  D   X X −m  max  | Pr[Ext(Λ, yi) = α] − 2 | , |Λ|=2k  i=1 α∈{0,1}m  instead of bounding the error over all statistical tests in degree D strong extractors.

Proof of Theorem 7.3.1. We plan to show

 D  X X −m E  max Pr [Ext0(x, yi) = α] − 2  ≤ 4D. (7.10) y ,··· ,y {0,1}n 1 D Λ∈ x∼Λ ( 2k ) i=1 α∈{0,1}m

95 For convenience, we use Pr[Ext0(Λ, yi) = α] to denote Prx∼Λ[Ext0(x, yi) = α] and Erry(Λ) to P −m denote the error of the seed y and subset Λ , i.e., Erry(Λ) = Pr[Ext0(Λ, y) = α]−2 . α∈{0,1}m We use these notations to rewrite (7.10) as " D # X E max Erryi (Λ) . y ,··· ,y {0,1}n 1 D Λ∈ ( 2k ) i=1 Then we symmetrize and Gaussianize it by Theorem 7.1.2: " D # " D # " " D ## X X √ X E max Erryi (Λ) ≤ max E Erryi (Λ) + 2πE E max Erryi (Λ) · gi . y {0,1}n {0,1}n y y g {0,1}n Λ∈ Λ∈ Λ∈ ( 2k ) i=1 ( 2k ) i=1 ( 2k ) i=1 (7.11) hPD i Because Ext0 is a strong extractor, the first term E i=1 Erryi (Λ) is at most 2D for y1,··· ,yD any Λ of size 2k.

To bound the second term in (7.11), we fix the seeds y1, ··· , yD and bound the Gaussian process. " # √ P m Claim 7.3.3. For any seeds y1, ··· , yD, E max Erryi (Λ) · gi ≤ C0(log D+m) nD · 2 g {0,1}n Λ∈ i∈[D] ( 2k ) for a constant C0.

We defer the proof of this claim to Section 7.3.1. We finish the proof by bounding (7.10) as follows: " D # X E max Erryi (Λ) y ,··· ,y {0,1}n 1 D Λ∈ ( 2k ) i=1 ( " #) √ X ≤ 2π · max Err (Λ) · g + D E E n yi i y1,··· ,yD g {0,1} Λ∈( k ) i √ √2 m ≤ 2π · C0(log D + m) nD · 2 + D · .

n 2 m 2 n(log  +m) ·2 We choose D = 10C0 · 2 such that " D # X E max Erryi (Λ) ≤ 4D. y ,··· ,y {0,1}n 1 D Λ∈ ( 2k ) i=1

96 This indicates the error of the strong linear extractor constituted by A1, ··· ,AD is 2 in statistical distance.

Bottleneck of the chaining argument. In our proof of Theorem 7.3.1, we use the follow- ing relaxation to bound the distance of two vectors in the Gaussian process P corresponding to two subsets Λ and Λ0:

2  0  kExt(Λ, yi) − Umk1 i ,··· ,D − kExt(Λ , yi) − Umk1 i ,··· ,D (7.12) =1 =1 2 D X 0 2 = (kExt(Λ, yi) − Umk1 − kExt(Λ , yi) − Umk1) (7.13) i=1 D X 0 2 ≤ (kExt(Λ, yi) − Ext(Λ , yi)k1) . (7.14) i=1 The shortcoming of our approach is that the subset chaining argument provides a tight analysis on (7.14) but not (7.12). √ We show that the Gaussian process under the distance (7.14) is Ω( 2m) from the Sudakov minoration. For example, let us consider the distance of the first coordinate 0 kExt(Λ, y1) − Ext(Λ , y1)k1. Because of the existence of random codes with constant rate and m m linear distance, there exists l = exp(2 ) subsets T1, ··· ,Tl in {0, 1} such that |Ti \ Tj| = m Ω(2 ) for any i 6= j. Let Λ1, ··· , Λl be the inverse images of T1, ··· ,Tl in Ext(·, y1). Then

kExt(Λi, y1)−Ext(Λj, y1)k1 = Ω(1) for any i 6= j from the distance of the code, which indicates the Gaussian process is Ω(2m) from the Sudakov minoration for the distance (7.14).

7.3.1 The Chaining Argument of Strong Extractors

We prove Claim 7.3.3 in this section. We fix a parameter t = 8 in this proof.

Recall that y1, ··· , yD are fixed in this section, we use Err(Λ) to denote the vector

(Erry1 (Λ), ··· , ErryD (Λ)). We rewrite the Gaussian process as

97   " # X E  max Erryi (Λ) · gi  = E max Err(Λ), g . g 2n g 2n Λ∈( k ) Λ∈( k ) 2 i∈[D] 2

D We define a sequence of subsets Ft−1, Ft, Ft+1, ··· , Fk of vectors in R where Ft−1 = n n {~0}, |F | = poly(2 ), and F = Err(Λ) Λ ∈ {0,1}  . For each i from t to k, we construct i 2i k 2k ~ a map πi : Fk → Fi, except that πk is the identity map and πt−1(v) = 0 for any v. For any vector v ∈ Fk, t X v = πj(v) − πj−1(v). j=k We plug these notations into the Gaussian process:

" #   max Err(Λ), g = max v, g (7.15) E n E g 2 g v∈Fk Λ∈(2k ) " t # X = E max πj(v) − πj−1(v), g (7.16) g v∈Fk j=k " t # X ≤ E max πj(v) − πj−1(v), g (7.17) g v∈Fk j=k t   X ≤ E max πj(v) − πj−1(v), g (7.18) g v∈Fk j=k t X q log |Fj| · |Fj−1| · max kπj(v) − πj−1(v)k2. (7.19) . v j=k

We first construct Fj from j = k to j = t then define their maps πk−1, ··· , πt. To j construct Fj, we will specify two parameters s(j)l = s(j)u = Θ(2 ) for the size of Λ such that   n  n   n {0, 1} {0, 1} {0, 1} Fj = Err(Λ) Λ ∈ ∪ · · · ∪ . s(j)l s(j)l + 1 s(j)u

98 Notice that the size of each subset Fj is bounded by

 2n   2n  |Fj| ≤ + ··· + . s(j)l s(j)u

n The base case is s(k) = s(k) = 2k and F = Err(Λ) Λ ∈ {0,1}  . l u k 2k

Construction of Fj for j > 4 log D + m: s(j)l = s(j + 1)l/2 − 2D and s(j)u = s(j + j j 1)u/2 + 2D. We bound s(j)l ≥ 2 − 4D and s(j)u ≤ 2 + 4D for all j > 4 log D + m.

p Construction of Fj for j ≤ 4 log D + m: s(j)l = s(j + 1)l/2 − s(j + 1)l and s(j)u =

p j t Qt 2s(j)l s(j + 1)u/2 + s(j + 1)u. We bound s(j)l ≥ 0.8 · 2 because s(t)l/2 = is at j=k−1 s(j+1)l least

k k 2 2 2 X 2 X 2 (1− )·(1− ) ··· (1− ) ≥ 1− ≥ 1− √ ≥ 0.8. p p p p j s(t + 1)l s(t + 2)l s(k)l j=t+1 s(j)l j=t+1 0.8 · 2

j Similarly, we bound s(j)u ≤ 1.4 · 2 because

k k 2 2 2 X 2 X 2 (1+ )·(1+ ) ··· (1+ ) ≤ 1+2 ≤ 1+2 √ ≤ 1.4. p p p p j s(t + 1)u s(t + 2)u s(k)u j=t+1 s(j)u j=t+1 1.4 · 2

Construction of πj: we construct the map πj from j = k − 1 to j = t and bound kπj+1(v) − πj(v)k2 for each v ∈ Fk in (7.19). We first use the Beck-Fiala Theorem in the discrepancy method to construct πj with j > 4 log D + m then use a randomized argument to construct πj with j ≤ 4 log D + m.

4 0 Claim 7.3.4. Given Λ ≥ D and D seeds y1, ··· , yD, there always exists Λ ⊆ Λ with size |Λ0| ∈ |Λ|/2 − 2D, |Λ|/2 + 2D such that

0 1.5 m kErr(Λ) − Err(Λ )k2 ≤ 6D · 2 /|Λ|.

99 Proof. We plan to use the Beck-Fiala Theorem from the discrepancy method. We define the m ground set S = Λ and m = 2 · D + 1 subsets T1, ··· ,Tm to be

 m T(i−1)2m+α = x ∈ Λ Ext(x, yi) = α for each α ∈ [0, ··· , 2 − 1] and i ∈ [D] and the last Tm = S = Λ. Notice that the degree of every element x ∈ Λ is D + 1.

From the Beck-Fiala Theorem, there always exists χ :Λ → {±1} such that

X for any i ∈ [m], | χ(x)| < 2D + 2.

x∈Ti

0 0   We choose Λ = {x|χ(x) = 1}. From the guarantee of Tm, we know |Λ | ∈ |Λ|/2 ± (D + 1) . 0 Next we consider kErr(Λ) − Err(Λ )k2.

m 0 2 We fix α ∈ {0, 1} and i ∈ [D] and bound (Pr[Ext(Λ, yi) = α] − Pr[Ext(Λ , yi) = α]) as follows.

0 2 (Pr[Ext(Λ, yi) = α] − Pr[Ext(Λ , yi) = α])  |{x ∈ Λ0|Ext(x, y ) = α}|2 ≤2 Pr[Ext(Λ, y ) = α] − i i |Λ|/2 |{x ∈ Λ0|Ext(x, y ) = α}| 2 + 2 i − Pr[Ext(Λ0, y ) = α] |Λ|/2 i |{x ∈ Λ|Ext(x, y ) = α}| − 2|{x ∈ Λ0|Ext(x, y ) = α}|2 ≤2 i i |Λ|  1 1 2 + 2 |{x ∈ Λ0|Ext(x, y ) = α}| · ( − ) i |Λ|/2 |Λ0| 3D  |Λ|/2 − |Λ|0 2 ≤2( )2 + 2 |{x ∈ Λ0|Ext(x, y ) = α}| · |Λ| i |Λ|/2 · |Λ0| 18D2 2D ≤ + 2( )2 = 26D2/|Λ|2. |Λ|2 |Λ|/2

100 0 2 We bound kErr(Λ) − Err(Λ )k2 using the above inequality. 2 D   0 2 X X −m 0 −m kErr(Λ) − Err(Λ )k2 =  | Pr[Ext(Λ, yi) = α] − 2 | − | Pr[Ext(Λ , yi) = α] − 2 | i=1 α∈{0,1}m 2 D   X X 0 ≤  | Pr[Ext(Λ, yi) = α] − Pr[Ext(Λ , yi) = α]| i=1 α∈{0,1}m D m X X 0 2 ≤ 2 (Pr[Ext(Λ, yi) = α] − Pr[Ext(Λ , yi) = α]) i=1 α∈{0,1}m ≤ 26D3 · 22m/|Λ|2.

Claim 7.3.5. Given any Λ of size at least 100, there always exists Λ0 ⊆ Λ with size |Λ0| ∈ |Λ|/2 − p|Λ|, |Λ|/2 + p|Λ| such that

0 p m kErr(Λ) − Err(Λ )k2 ≤ 6 D · 2 /|Λ|.

Proof. We show the existence of Λ0 by the probabilistic method of picking each element in 0 0 |Λ| 0 |Λ| 2 |Λ| 0 Λ to Λ with probability 1/2. Because E[|Λ |] = 2 and E[(|Λ | − 2 ) ] = 4 ,Λ satisfies |Λ| |Λ| |Λ0| ∈  −p|Λ|, +p|Λ| with probability at least 3/4 from the Chebyshev inequality. 2 2 (7.20)

Next we consider   X X 0 2 E  {x ∈ Λ |Ext(x, yi) = α} − {x ∈ Λ|Ext(x, yi) = α} /2  Λ0 i∈[D] α∈{0,1}m X X h 0 2i = E {x ∈ Λ |Ext(x, yi) = α} − {x ∈ Λ|Ext(x, yi) = α} /2 Λ0 i∈[D] α∈{0,1}m X X = {x ∈ Λ|Ext(x, yi) = α} /4 = D · |Λ|/4. i∈[D] α∈{0,1}m

101 With probability 3/4,

X X 0 2 {x ∈ Λ|Ext(x, yi) = α} /2 − {x ∈ Λ |Ext(x, yi) = α} ≤ D · |Λ|. (7.21) i∈[D] α∈{0,1}m

We set Λ0 to be a subset satisfying equations (7.20) and (7.21) and consider kErr(Λ)− 0 Err(Λ )k2.

m 0 2 We fix α ∈ {0, 1} and i ∈ [D] and bound (Pr[Ext(Λ, yi) = α] − Pr[Ext(Λ , yi) = α]) as follows.

0 2 (Pr[Ext(Λ, yi) = α] − Pr[Ext(Λ , yi) = α])  |{x ∈ Λ0|Ext(x, y ) = α}|2 ≤2 Pr[Ext(Λ, y ) = α] − i i |Λ|/2 |{x ∈ Λ0|Ext(x, y ) = α}| 2 + 2 i − Pr[Ext(Λ0, y ) = α] |Λ|/2 i |{x ∈ Λ|Ext(x, y ) = α}| − 2|{x ∈ Λ0|Ext(x, y ) = α}|2 ≤2 i i |Λ|  1 1 2 + 2 |{x ∈ Λ0|Ext(x, y ) = α}| · ( − ) i |Λ|/2 |Λ0| 0 2 0 {x ∈ Λ|Ext(x, yi) = α}|/2 − |{x ∈ Λ |Ext(x, yi) = α} |{x ∈ Λ |Ext(x, yi) = α}| ≤8 + 20 , |Λ|2 |Λ|2 (*)

where we use the property (7.20) in the last step to bound the second term. Next we

102 0 2 bound kErr(Λ) − Err(Λ )k2 base on the above inquality:

2 D   0 2 X X −m 0 −m kErr(Λ) − Err(Λ )k2 =  | Pr[Ext(Λ, yi) = α] − 2 | − | Pr[Ext(Λ , yi) = α] − 2 | i=1 α∈{0,1}m 2 D   X X 0 ≤  | Pr[Ext(Λ, yi) = α] − Pr[Ext(Λ , yi) = α]| i=1 α∈{0,1}m D m X X 0 2 ≤ 2 (Pr[Ext(Λ, yi) = α] − Pr[Ext(Λ , yi) = α]) next apply (*) i=1 α∈{0,1}m 0 2 X {x ∈ Λ|Ext(x, yi) = α}|/2 − |{x ∈ Λ |Ext(x, yi) = α} ≤ 8 · 2m |Λ|2 i,α 0 X |{x ∈ Λ |Ext(x, yi) = α}| + 20 · 2m |Λ|2 i,α 2D · |Λ| D · |Λ| 36D · 2m ≤ 8 · 2m + 20 · 2m ≤ . |Λ|2 |Λ|2 |Λ|

Now we define our map πj : Fk → Fj from j = k to t by induction. The base case πk

is the identity map. Then we define πj−1 given πj.

n For j > 4 log D + m, given Λ ∈ {0,1} , let v = π Err(Λ) be the vector in F . From 2k j j the definition of Fj, there exists Λj of size between [s(j)l, s(j)u] such that v = Err(Λj). Let  Λj−1 be the subset satisfying the guarantee in Claim 7.3.4 for Λj. We set πj−1 Err(Λ) =

Err(Λj−1).

k Similarly, for j ≤ 4 log D + m, given u = Err(Λ) and πj(u) = Err(Λj) for Λ of size 2 ,

we define πj−1(u) = Err(Λj−1) where Λj−1 is the subset satisfying the guarantee in Claim 7.3.5

for Λj.

103 To finish the calculation of (7.19), we bound |Fj| by

 n   n   n  2 2 2 j F j 2n2 | j| ≤ + ··· + ≤ 2 · 2 · j ≤ 2 . s(j)l s(j)u 1.8 · 2

From the all discussion above, we bound the Gaussian process in (7.19) as

" # t X q max |P j(Λ) − 2−m · ~1|, g log |F | · |F | · max kπ (v) − π (v)k E n . j j−1 j j−1 2 g Λ∈ 2 v (2k ) j=k 4 log D+m X √ ≤ 2n · 2j · (10D1.5 · 2m−j) j=k t X √ √ + 2n · 2j · 10 D · 2m−j j=4 log D+m 4 log D+m t X √ (D1.5 · 2m) X √ n · + 2nD · 2m . 2j/2 j=k j=4 log D+m √ (D1.5 · 2m) √ n · + (4 log D + m) · 2nD · 2m . 2 m/2 D · 2 √ .(4 log D + m) · 2nD · 2m.

104 Chapter 8

Hash Functions for Multiple-Choice Schemes

We present explicit constructions of hash families that guarantee the same maximum loads as a perfectly random hash function in the multiple-choice schemes. We construct our hash families based on the hash family of Celis et al. [CRSW13] for the 1-choice scheme, which is O(log log n)-wise independent over n bins and “almost” O(log n)-wise independent over a fraction of poly(log n) bins.

log log n We first show our hash family guarantees a maximum load of log d + O(1) in the Uniform-Greedy scheme [ABKU99, Voc03] with d choices. We use U to denote the pool of balls and consider placing m = O(n) balls into n bins here. Without loss of generality, we always assume |U| = poly(n) and d is a constant at least 2 in this work.

Theorem 8.0.1 (Informal version of Theorem 8.4.1). For any m = O(n), any constants c and d, there exists a hash family with O(log n log log n) random bits such that given any m balls in U, with probability at least 1−n−c, the max-load of the Uniform-Greedy scheme with log log n d independent choices of h is log d + O(1).

Our hash family has an evaluation time O(log log n)4 in the RAM model based on the algorithm designed by Meka et al. [MRRR14] for the hash family of Celis et al. [CRSW13].

Then we show this hash family guarantees a load balancing of log log n + O(1) in the d log φd Always-Go-Left scheme [Voc03] with d choices. The Always-Go-Left scheme [Voc03] is an asymmetric allocation scheme that partitions the n bins into d groups with equal size and

105 uses an unfair tie-breaking mechanism. Its allocation process provides d independent choices for each ball from the d groups separately and always chooses the left-most bin with the least load for each ball. We defer the formal description of the Always-Go-Left scheme to d d−1 Section 8.5. Notice that the constant φd in equation φd = 1 + φd + ··· + φd satisfies

1.61 < φ2 < φ3 < φ4 < ··· < φd < 2. Compared to the Uniform-Greedy scheme, the Always- Go-Left scheme [Voc03] improves the maximum load exponentially with regard to d. Even for d = 2, the Always-Go-Left scheme improves the maximum load from log log n + O(1) to 0.7 log log n + O(1).

Theorem 8.0.2 (Informal version of Theorem 8.5.3). For any m = O(n), any constants c and d, there exists a hash family with O(log n log log n) random bits such that given any m balls in U, with probability at least 1 − n−c, the max-load of the Always-Go-Left scheme with d independent choices of h is log log n + O(1). d log φd

At the same time, from the lower bound log log n − O(1) on the maximum load of any d log φd random d-choice scheme shown by V¨ocking [Voc03], the maximum load of our hash family is optimal for d-choice schemes up to the low order term of constants.

Finally, we show our hash family guarantees the same maximum load as a perfectly random hash function in the 1-choice scheme for m = n · poly(log n) balls. Given m > m p m n log n balls in U, the maximum load of the 1-choice scheme becomes n + O( log n · n ) from the Chernoff bound. For convenience, we refer to this case of m ≥ n log n balls as a heavy load. In a recent breakthrough, Gopalan, Kane, and Meka [GKM15] designed a pseudorandom generator of seed length O(log n(log log n)2) that fools the Chernoff bound within polynomial error. Hence the pseudorandom generator [GKM15] provides a hash function with O(log n(log log n)2) random bits for the heavy load case. Compared to the hash function of [GKM15], we provide a simplified construction that achieves the same maximum load but only works for m = n · poly(log n) balls.

106 Theorem 8.0.3 (Informal version of Theorem 8.6.1). For any constants c and a ≥ 1, there exist a hash function generated by O(log n log log n) random bits such that for any m = loga n · n balls, with probability at least 1 − n−c, the max-load of the n bins in the √ m p m  1-choice scheme with h is n + O log n · n .

8.1 Preliminaries

We use U to denote the pool of balls, m to denote the numbers of balls in U, and n to denote the number of bins. We assume m ≥ n and n is a power of 2 in this work. We use

Fp to denote the Galois field of size p for a prime power p.

n Definition 8.1.1. Given a prime power p, a distribution D on Fp is a δ-biased space if for n any non-trivial character function χα in p , [χα(x)] ≤ δ. F x∼ED n A distribution D on Fp is a k-wise δ-biased space if for any non-trivial character n function χα in p of support size at most k, [χα(x)] ≤ δ. F x∼ED

The seminal works [NN90, AGHP90] provide small-biased spaces with optimal seed length.

Lemma 8.1.2 ([NN90, AGHP90]). For any prime power p and integer n, there exist explicit n pn constructions of δ-biased spaces on Fp with seed length O(log δ ) and explicit constructions kp log n of k-wise δ-biased spaces with seed length O(log δ ).

n Given two distributions D1 and D2 with the same support Fp , we define the statistical P distance to be kD − D k = n |D (x) − D (x)|. Vazirani [Vaz86] proved that small- 1 2 1 x∈Fp 1 2 biased spaces are close to the uniform distribution.

n n/2 Lemma 8.1.3 ([Vaz86]). A δ-biased space on Fp is δ · p close to the uniform distribution in statistical distance.

107 n k/2 Given a subset S of size k in [n], a k-wise δ-biased space on Fp is δ · p close to the uniform distribution on S in statistical distance.

Given a distribution D on functions from U to [n], D is k-wise independent if for any k k elements x1, . . . , xk in U, D(x1),...,D(xk) is a uniform distribution on [n] . For small- |U| biased spaces, we choose p = n and the space to be Fn in Lemma 8.1.2 and summarize the discussion above.

Lemma 8.1.4. Given k and n, a k-wise δ-biased space from U to [n] is δ · nk/2 close to the kn log n uniform distribution from U to [n] on any k balls, which needs O(log δ ) random bits.

Remark 8.1.5. In this work, we always choose δ ≤ 1/n and k = poly(log n) in the small 1 biased spaces such that the seed length is O(log δ ). At the same time, we only use k-wise small-biased spaces rather than small biased spaces to improve the evaluation time from O(log n) to O(log log n)4.

We state the Chernoff bound in k-wise independence by Schmidt et al. in [SSS95].

Lemma 8.1.6 (Theorem 5 (I) (b) in [SSS95]). If X is the sum of k-wise independent random variables, each of which is confined to the interval [0, 1] with µ = E[X], then for δ ≤ 1 and k ≥ δ2µ · e−1/3, Pr[|X − µ| ≥ δµ] ≤ e−δ2µ/3.

8.2 Witness Trees

We first provide several notation and definitions in this work. Then we review the witness tree argument of V¨ocking [Voc03] for the Uniform-Greedy scheme.

Definition 8.2.1 (Uniform-Greedy with d choices). The process inserts balls in any fixed order. Let h(1), . . . , h(d) be d hash functions from U to [n]. The allocation process works as

108 follows: for each ball i, the algorithm considers d bins {h(1)(i), . . . , h(d)(i)} and puts the ball i into the bin with the least load among {h(1)(i), . . . , h(d)(i)}. When there are several bins with the least load, it picks an arbitrary one.

We define the height of a ball to be the height of it on the bin allocated in the above process.

Next we follow the notation of V¨ocking [Voc03] to define witness trees and pruned witness trees. Given the balls and d hash functions h(1), . . . , h(d) in the allocation process, we construct a symmetric witness tree for each ball in this process.

Definition 8.2.2 (Symmetric witness trees). A symmetric witness tree T with height l for a ball b is a complete d-ary tree of height l. Every node w in this tree corresponds to a ball T (w) ∈ [n]; and the root corresponds to the ball b. A ball u in T has a ball v as its ith child iff when we allocate u in the process, ball v is the top ball in the bin h(i)u. Hence v < u and the bin h(i)(u) is in the subset h(1)(v), . . . , h(d)(v) of [n] when v is the ith child of u.

Next we trim the repeated nodes in a witness trees such that there is no duplicate edge after the trimming.

Definition 8.2.3 (Pruned witness trees and collisions). Given a witness tree T where nodes v1, . . . , vj in T correspond to the same ball, let v1 be the node among them in the most bottom level of T . Consider the following process: first remove v2, . . . , vj and their subtrees; then, redirect the edges of v2, . . . , vj from their parents to v1 and call these edges collisions. Given a symmetric witness tree T , we call the new tree without repeated nodes after the above process as the pruned witness tree of T .

We call different witness trees with the same structure but different balls a configu- ration. For example, the configuration of symmetric witness trees with distinct nodes is a full d-ary tree without any collision.

109 48 48 48

37 42 37 42 Trim T 37 42

24 19 26 23 24 19 26 19 24 19 26

1 3 4 8 11 13 14 18 1 3 6 1 3 11 6 1 1 3 6 11

Figure 8.1: A witness tree with distinct balls and a pruned witness tree with 3 collisions

Next we define the height and size of pruned witness trees.

Definition 8.2.4 (Height of witness trees). Given any witness tree T , let the height of T be the length of the shortest path from the root of T to its leaves. Because height(u) = min height(v) + 1, the height of the pruned witness tree equals the height of the v∈children(u) original witness tree. Given a ball b of height h and any h0 < h, we always consider the pruned witness tree of b with height h0 whose leaves have height h − h0.

At the same time, let |T | denote the number of vertices in T for any witness tree T and |C| denote the number of nodes in a configuration C.

Finally we review the argument of V¨ocking [Voc03] for m = n balls. One difference between this proof and V¨ocking’s [Voc03] proof is an alternate argument for the case of witness trees with many collisions.

Lemma 8.2.5 ([Voc03]). For any constants c ≥ 2 and d, with probability at least 1−n−c, the max-load of the Always-Go-Left scheme with d independent choices from perfectly random log log n hash functions is log d + O(1).

 Proof. We fix a parameter l = dlogd (2 + 2c) log n + 3c + 5e for the height of witness trees. In this proof, we bound the probability that any symmetric witness tree of height l with

110 leaves of height at least 4 exists in perfectly random hash functions. From the definition of witness trees, this also bounds the probability of a ball with height l + 4 in the d-choice Uniform-Greedy scheme.

For symmetric witness trees of height l, it is sufficient to bound the probability that their pruned counterparts appear in perfectly random hash functions. We separate all pruned witness trees into two cases according to the number of edge collisions: pruned witness trees with at most 3c collisions and pruned witness trees with at least 3c collisions.

Pruned witness trees with at most 3c collisions. Let us fix a configuration C with at most 3c collisions and consider the probability any pruned witness trees with configuration C appears in perfectly random hash functions. Because each node of this configuration C corresponds a distinct ball, there are at most n|C| possible ways to instantiate balls into C.

Next, we fix one possible pruned witness tree T and bound the probability of the appearance of T in h(1), . . . , h(d). We consider the probability of two events: every edge (u, v) in the tree T appears during the allocation process; and every leaf of T has height at least 4. For the first event, an edge (u, v) holds during the process only if the hash functions satisfy

d h(i)(u) ∈ h(1)(v), . . . , h(d)(v) , which happens with probability at most . (8.1) n

Secondly, the probability that a fixed leaf ball has height at least 4 is at most 3−d. A leaf ball of height 4 indicates that each bin in his choices has height at least 3. Because at most n/3 bins contain at least 3 balls at any moment, the probability that a random bin has height at least 3 is ≤ 1/3. Thus the probability that d random bins have height 3 is at most 3−d.

111 We apply a union bound on the probability that any witness tree with the configura- tion C appears in perfectly random hash functions: Y d n|C| · · (3−d)number of leaves (8.2) n (u,v)∈C We lower bound the number of edges in C by |C| − 1 because C is connected. Next we lower bound the number of leaves. Because C is a d-ary tree with at most 3c collisions, the number |C|−3c of leaves is at least 2 . At the same time, C is trimmed from the d-ary symmetric witness tree of height l. Thus |C| ≥ (1 + d + ··· + dl−3c). From all discussion above, we bound (8.2) by

|C| d |C|−1 −d |C|−3c 2.5 −d |C|/2.5 2.5 −d dl−3c/2.5 10(2+2c) log n −2c−1 n ·( ) ·(3 ) 2 ≤ n·(d ·3 ) ≤ n·(d ·3 ) ≤ n·(0.8) ≤ n . n

Finally, we apply a union bound on all possible configurations with at most 3c colli- P3c l+1 2·i sions: the number of configurations is at most i=0(d ) ≤ n such that the probability of any witness tree with height l and at most 3c collisions existing is at most n−c.

Pruned witness trees with at least 3c collisions. We use the extra 3c collisions with equation (8.1) instead of the number of leaves in this case.

Given any configuration C with at least 3c collisions, we consider the first 3c collisions 0 e1, . . . , e3c in the BFS of C. Let C be the induced subgraph of C that only contains nodes 0 in e1, . . . , e3c and their ancestors in C. At the same time, the size |C | ≤ 3c(2l + 1) and the number of edges in C0 is |C0| + 3c − 1.

Because any pruned witness tree of C exists only if its corresponding counterpart of C0 exists in perfectly random hash functions, it is suffice to bound the probability of the latter event. There are at most n|C0| instantiations of balls in C0. For each instantiation, we bound the probability that all edges survive by (8.1):

d d 0 ( )number of edges = ( )|C |+3c−1. n n

112 Extract C0 from C

Figure 8.2: An example of extracting C0 from C given two collisions.

We bound the probability that any pruned witness tree of configuration C0 survives in the perfectly random hash function by

d 0 0 1 ( )|C |+3c−1 · n|C | ≤ ( )3c−1 · d(2l+2)·3c ≤ n−2c. n n

Finally, we apply a union bound over all possible configurations C0: there are at most (1 + d + ··· + dl)2·3c ≤ n configurations of 3c collisions.

Remark 8.2.6. Because the sizes of all witness trees are bounded by dl+1 = O(log n),

Oc,d(log n)-wise independent hash functions could adopt the above argument to prove a max- load of logd log n + O(d + c).

8.3 Hash Functions

We construct our hash family and show its properties for the derandomization of d- choice schemes in this section. We sketch the derandomization of Lemma 8.2.5 of Vocking’s argument in Section 8.3.1.

Let ◦ denote the concatenation operation and ⊕ denote the bit-wise XOR operation.

Construction 8.3.1. Given δ1 > 0, δ2 > 0, and two integers k, kg, let

113 2−i 2 1. hi : U → [n ] denote a function generated by an O(log n)-wise δ1-biased space for each i ∈ [k],

2−k 2 2. hk+1 : U → [n ] denote a function generated by an O(log n)-wise δ2-biased space

such that (h1(x) ◦ h2(x) ◦ · · · ◦ hk(x) ◦ hk+1(x)) is a function by U to [n],

3. g : U → [n] denote a function from a kg-wise independent family from U to [n].

We define a random function h : U → [n] in our hash family H with parameters δ1, δ2, k and kg to be:  h(x) = h1(x) ◦ h2(x) ◦ · · · ◦ hk(x) ◦ hk+1(x) ⊕ g(x).

2 2 Hence the seed length of our hash family is O(k log n·log n·log |U| + log n·log n·log |U| + δ1 δ2 kg log n). We always choose k ≤ log log n, kg = O(log log n), δ1 = 1/poly(n), and δ2 = (log n)−O(log n) such that the seed length is O(log n log log n).

Remark 8.3.2. Our parameters of h1◦· · ·◦hk+1 are stronger than the parameters in [CRSW13]. O(k) While the last function hk+1 of [CRSW13] is still a δ1-biased space, we use δ2 = (δ1) in O(log n) hk+1 to provide almost O(log n)-wise independence on (log n) subsets of size O(log n) for our calculations.

Properties of h. We state the properties of h that will be used in the derandomization.

Because of the kg-wise independence in g and the ⊕ operation, we have the same property for h.

Property 8.3.3. h is kg-wise independent.

Then we fix g and discuss h1 ◦ · · · ◦ hk ◦ hk+1. For each i ∈ [k], it is natural to think 1 1 1− i 1− i h1 ◦ · · · ◦ hi as a function from U to [n 2 ], i.e., a hash function maps all balls into n 2 bins. Celis et al. [CRSW13] showed that for every i ∈ [k], the number of balls in every bin 1 m 2i 1 of h1 ◦ · · · ◦ hi is close to its expectation n · n in poly(n) -biased spaces.

114 −0.2 Lemma 8.3.4 ([CRSW13]). Given k = log2(log n/3 log log n) and β = (log n) , for any constant c > 0, there exists δ1 = 1/poly(n) such that given m = O(n) balls, with probability 1− 1 1 m −c 2i i 2i at least 1 − n , for all i ∈ [k], every bin in [n ] contains at most (1 + β) n · n balls under h1 ◦ · · · ◦ hi.

For completeness, we provide a proof of Lemma 8.3.4 in AppendixC. In this work, we use the following version that after fixing g in the Construction 8.3.1, h1 ◦ h2 ◦ · · · ◦ hk still allocates the balls evenly.

Corollary 8.3.5. For any constant c > 0, there exists δ1 = 1/poly(n) such that given 3 −c m = O(n) balls and any function g0 : U → [n/ log n], with probability at least 1 − n over 1− 1 3 3 m 2k h1, . . . , hk, for any bin j ∈ [n ] = [n/ log n], it contains at most 1.01 · log n · n balls in  the hash function h1(x) ◦ · · · ◦ hk(x) ⊕ g0(x).

3 U Next we discuss the last function hk+1 generated from a δ2-biased space on [log n] . For a subset S ⊆ U, let h(S) denote the distribution of a random function h on S and 3 U[log3 n](S) denote the uniform distribution over all maps from S → [log n]. From Lemma 8.1.3 and the union bound, we have the following claim.

−C log n C Claim 8.3.6. Given δ2 = (log n) , for a fixed subset S of size 3 · log n, hk+1(S) − C ·log n is (log n) 2 -close to the uniform distribution on S, i.e., khk+1(S) − U[log3 n](S)k1 ≤ − C ·log n (log n) 2 .

C C 3 ·log n Then for m = (log n) subsets S1,...,Sm of size 3 · log n, we have

X C C − 2 ·log n − 6 ·log n khk+1(Si) − U[log3 n](Si)k1 ≤ m · (log n) ≤ (log n) . i∈[m]

C log n In another word, hk+1 is close to the uniform distribution on (log n) 3 subsets of C size 3 log n. However, hk+1 (or h) is not close to log n-wise independence on n balls.

115 Remark 8.3.7 (Evaluation time). Our hash function has an evaluation time O((log log n)4) −O(log log n) in the RAM model. Because we use (log n) -biased spaces in hk+1, we lose a factor of O(log log n)2 compared to the hash family of [CRSW13]. The reason is as follows.

g can be evaluated by a degree O(log log n) polynomial in the Galois field of size

poly(n), which takes O(log log n) time. The first k hash functions h1, . . . , hk use 1/poly(n)- biased spaces, which have total evaluation time O(k · log log n) = O(log log n)2 in the RAM model from [MRRR14].

−O(log log n) The last function hk+1 in the RAM model is a O(log n)-wise n -biased space from U to [log3 n], which needs O(log log n) words in the RAM model. Thus the evaluation time becomes O(log log n) times the cost of a quadratic operation in the Galois field of size nO(log log n), which is O((log log n)4).

8.3.1 Proof Overview

We sketch the derandomization of Lemma 8.2.5 in this section. Similar to the proof of

Lemma 8.2.5, we bound the probability that any pruned witness tree of height l = logd log n+ (1) (d) (i) (i) (i)  (i) O(1) exists in h , . . . , h , where each h = h1 (x) ◦ · · · ◦ hk+1(x) ⊕ g (x). We use the

property of h1 ◦ · · · ◦ hk+1 to derandomize the case of pruned witness trees with at most 3c collisions and the property of g to derandomize the other case.

Pruned witness trees with at most 3c collisions. We show how to derandomize the union bound (8.2) for a fixed configuration C with at most 3c collisions. There are two Q d probabilities in (8.2): the second term (u,v)∈C n over all edges in C and the last term −d·number of leaves Q d 3 over all leaves. We focus on the second term (u,v)∈C n in this discussion, because it contributes a smaller probability. Since |C| ∈ [dl−3c, dl+1] = Θ(log n), it needs O(log n)-wise independence over [n] bins for every possible witness trees in (8.2), which is impossible to support with o(log2 n) bits [Sti94].

116 (1) (d)  (i) (i) (i) We omit {g , . . . , g } and focus on the other part h1 ◦· · ·◦hk ◦hk+1 i ∈ [d] in this  (i) (i) case. Our strategy is to first fix the prefixes in the d hash functions, h1 ◦ · · · ◦ hk i ∈ [d] , (1) (d) then recalculate (8.2) using the suffixes hk+1, . . . , hk+1. Let T be a possible witness tree in the configuration C. For an edge (u, v) in T to satisfy (8.1), the prefixes of h(1)(v), . . . , h(d)(v) and h(i)(u) must satisfy

(i) (i) n (1) (1) (d) (d) o h1 (u) ◦ · · · ◦ hk (u) ∈ h1 (v) ◦ · · · ◦ hk (v), . . . , h1 (v) ◦ · · · ◦ hk (v) . (8.3)

After fixing the prefixes, let FT denote the subset of possible witness trees in the configuration C that satisfy the prefix condition (8.3) for every edge. Because each bin of 3 3 (j) (j) (j) [n/ log n] receives at most 1.01 log n balls from every prefix function h1 ◦ h2 ◦ · · · ◦ hk by Corollary 8.3.5, we could bound

3 |C|−1 |C| 3 |C|−1 O(log n) |FT | ≤ n(d · 1.01 log n) = n · (1.01d) · (log n) = (log n)

instead of n|C| in the original argument.

(1) (d) Now we consider all possible witness trees in FT under the suffixes hk+1, . . . , hk+1. (1) (d) We could treat hk+1, . . . , hk+1 as O(log n)-wise independent functions for all possible witness O(log n) trees in FT from Claim 8.3.6, because |C| = O(log n) and |FT | = (log n) . In the next step, we use O(log n)-wise independence to rewrite (8.2) and finish the proof of this case.

Pruned witness trees with at least 3c collisions. In our alternate argument of this case in Lemma 8.2.5, the subconfiguration C0 of C has at most 3c · (2l + 1) nodes and 0 3c · (2l + 1) + 3c edges. Since l = logd log n + O(1), the number of edges in C is O(log log n).

By choosing kg = Θ(log log n) with a sufficiently large constant, h with kg-wise independence supports the argument in Lemma 8.2.5.

117 8.4 The Uniform Greedy Scheme

We prove our main result for the Uniform-Greedy scheme — Theorem 8.0.1 in this section.

Theorem 8.4.1. For any m = O(n), any constant c ≥ 2, and integer d, there exists a hash family H from Construction 8.3.1 with O(log n log log n) random bits that guarantees the

max-load of the Uniform Greedy scheme with d independent choices from H is logd log n + m  −c O c + n with probability at least 1 − n for any m balls in U.

Proof. We specify the parameters of H as follows: kg = 10c(logd log m + logd(2 + 2c) + 5 + log n −C log n 3c), k = log2 3 log log n , δ2 = log n for a large constant C, and δ1 = 1/poly(n) such that Corollary 8.3.5 holds with probability at least 1 − n−c−1. Let h(1), . . . , h(d) denote the d independent hash functions from H with the above parameters, where each

(j) (j) (j) (j) (j)  (j) h (x) = h1 (x) ◦ h2 (x) ◦ · · · ◦ hk (x) ◦ hk+1(x) ⊕ g (x).

(1) (2) (d) We use the notation g to denote {g , g , . . . , g } in the d choices and hi to denote the (1) (d) group of hash functions {hi , . . . , hi } in this proof.

We bound the probability that any symmetric witness tree of height l = dlogd log m+ m (1) (d) logd(2 + 2c) + 5 + 3ce with leaves of height at least b = 10d · n + 1 exists in h , . . . , h . Similar to the proof of Lemma 8.2.5, we bound the probability of pruned witness trees of height l in h(1), . . . , h(d). We separate all pruned witness trees into two cases according to the number of edge collisions: pruned witness trees with at most 3c collisions and pruned witness trees with at least 3c collisions.

Pruned witness trees with at least 3c collisions. We start with a configuration C of

pruned witness trees with height l and at least 3c collisions. Let e1, . . . , e3c be the first 3c collisions in the BFS of C. Let C0 be the induced subgraph of C that only contains nodes

118 in these edges e1, . . . , e3c and their ancestors in C. Therefore any pruned witness tree T of configuration C exists in h(1), . . . , h(d) only if the corresponding counterpart T 0 of T with configuration C0 exists in h(1), . . . , h(d). The existence of T 0 in h(1), . . . , h(d) indicates that for every edge (u, v) in T 0, h(1), . . . , h(d) satsify

h(i)T (u) ∈ h(1)T (v), . . . , h(d)T (v) when v is the ith child of u. (8.4)

0 0 Notice that the number of edges in C and T is at most 3c · 2l + 3c = 2l(3c + 1) ≤ kg/2.

(1) (d) Because h , . . . , h are kg-wise independent, We bound the probability that all edges of T 0 satisfy (8.4) in h(1), . . . , h(d) by

Y d d 0 ( ) = ( )|C |+3c−1. n n (u,v)∈T 0

Now we apply a union bound over all choices of balls in C0. There are at most m|C0| choices of balls in the nodes of C0. Therefore we bound the probability that any witness with at least 3c collisions survives in kg-wise independent functions by

d 0 0 d m 0 d m ( )|C |+3c−1 · m|C | ≤ ( )3c−1 · ( · d)|C | ≤ ( )3c−1 · ( · d)3c·(2l+1) ≤ n−2c. n n n n n

Next we apply a union bound over all configurations C0. Because there are at most (1 + d + ··· + dl)2·3c ≤ n configurations of 3c collisions, with probability at least 1 − n−c, there is no pruned witness trees with at least 3c collision and height l exists in h(1), . . . , h(d).

Pruned witness trees with at most 3c collisions. We fix a configuration C of pruned witness trees with height l and less than 3c collisions. Next we bound the probability that any pruned witness trees in this configuration C with leaves of height at least b exists in h(1), . . . , h(d).

3 m We extensively use the fact that after fixing g and h1 ◦· · ·◦hk, at most d(1.01 log n· n ) elements in h(1), . . . , h(d) are mapped to any bin of [n/ log3 n] from Corollary 8.3.5. Another

119 property is the number of leaves in C: because there are at most 3c collisions in C, C has at least dl−3c ∈ [d5(2 + 2c) log m, d6(2 + 2c) log m] leaves. On the other hand, the number of |C|−3c leaves is at least 2 . For a pruned witness tree T with configuration C, T exists in h(1), . . . , h(d) only if

∀(u, v) ∈ C, h(i)T (u) ∈ h(1)T (v), . . . , h(d)T (v) when v is the ith child of u. (8.5)

We restate the above condition on the prefixes and suffixes of h(1), . . . , h(d) separately. Let

gp(x) denote the first log n − 3 log log n bits of g(x) and gs(x) denote the last 3 log log n (i) bits of g(x), which matches h1(x) ◦ · · · ◦ hk(x) and hk+1(x) separately. Since h (x) = (i) (i)  (i) h1 (x)◦· · ·◦hk+1(x) ⊕g (x), property (8.5) indicates that the prefixes of the balls bu = T (u)

and bv = T (v) satisfy n o h(i)(b ) ◦ · · · ◦ h(i)(b ) ⊕ g(i)(b ) ∈ h(i)(b ) ◦ · · · ◦ h(i)(b ) ⊕ g(i)(b ) . (8.6) 1 u k u p u 1 v k v p v i∈[d]

and their suffixes satisfy

(i) (i) n (1) (1) (d) (d) o hk+1(bu) ⊕ gs (bu) ∈ hk+1(bv) ⊕ gs (bv), . . . , hk+1(bv) ⊕ gs (bv) . (8.7)

Let FT be the subset of witness trees in the configuration C whose edges satisfy the

condition (8.6) in preffixes h(1), . . . , h(k), i.e.,

FT = {T |configuration(T ) = C and (u, v) satisfies (8.6) ∀(u, v) ∈ T }.

We show that m |F | ≤ m · (d · 1.01 log3 n · )|C|−1. T n The reason is as follows. There are m choices of balls for the root u in C. For the ith child v of the root u, we have to satisfy the condition (8.6) for (u, v). For a fixed bin (i) (i)  (i) 3 m h1 (bu) ◦ · · · ◦ hk (bu) ⊕ gp (bu), there are at most 1.01 · log n · n elements from each hash

120 (j) 3 m function h mapped to this bin from Corollary 8.3.5. Hence there are at most d·1.01 log n· n choices for each child of u. Then we repeat this arguments for all non-leaf nodes in C.

(1) (d) Next we consider the suffixes hk+1, . . . , hk+1. We first calculate the probability that (1) (d) any possible witness tree in FT survives in hk+1, . . . , hk+1 from t-wise independence for t = l+2 (1) (d) 5b·d = O(log n). After fixing gs, for a possible witness tree T in FT , hk+1, . . . , hk+1 satisfy (8.7) for every edge (u, v) ∈ C with probability d in t/2-wise independent distributions log3 n because the number of edges in C is less than t/2.

m For each leaf v in T , we bound the probability that its height is at least b = 10d· n +1 −3d2 n 2d by 2 ·( m ) in (b·d+1)-wise independence. Given a choice i ∈ [d] of leaf v, we fix the bin (i) to be h (v). Then we bound the probability that there are at least b − 1 balls w1, . . . , wb−1 in this bin excluding all balls in the tree by

X X X (i) (j1) (jb−1) ··· Pr[h (v) = h (w1) = ··· = h (wb−1)]

w1:w1

3 (b−1)·d −3d2 n 2d For all d choices of this leaf v, this probability is at most ( 4 ) ≤ 2 · ( m ) .

Because w1, . . . , wb are not in the tree T for every leaf, they are disjoint and in- dependent with the events of (8.7) in T , which are over all edges in the tree. Hence we could multiply these two probability together in t-wise independence given t/2 ≥ (b · d + 1) · number of leaves. Then we apply a union bound over all possible pruned witness trees in

FT to bound the probability (in the t-wise independence) that there is one witness tree of

121 m height l whose leaves have height at least 10d · n + 1 by d 3 |F | · ( )|C|−1 · ( )b·dnumber of leaves T log3 n 4  |C| |C|−3c m d  2 n  2 ≤m 1.01d · log3 n · · · 2−3d · ( )2d n log3 n m

 m|C|  2 n |C|/3 ≤m · 2d2 · · 2−3d · ( )2d n m ≤m · 2−|C|/3 ≤ n−c−1.

−c−1 Finally we replace the t-wise independence by a δ2-biased space for δ2 = n · 3 −t −O(log n) (log n) /|FT | = (log n) . We apply Claim 8.3.6 to all possible pruned witness tress in FT : in δ2-biased spaces, the probability of the existence of any height-l witness tree with m leaves of height at least b = 10d · n + 1 is at most

−c−1 3 t −c−1 n + |FT | · δ2 · (log n) ≤ 2n .

Then we apply a union bound on all possible configurations with at most 3c collisions:

(dl+1)|3c| · 2n−c−1 ≤ 0.5n−c.

From all discussion above, with probability at least 1 − n−c, there is no ball of height more than l + b = logd log n + O(1).

8.5 The Always-Go-Left Scehme

We show that the hash family in Section 8.3 with proper parameters also achieves a max-load of log log n + O(1) in the Always-Go-Left scheme [Voc03] with d choices, where d log φd d d−1 φd > 1 is the constant satisfying φd = 1 + φd + ··· + φd . We define the Always-Go-Left scheme [Voc03] as follows:

122 Definition 8.5.1 (Always-Go-Left with d choices). Our algorithm partition the bins into d (1) (d) groups G1,...,Gd of the same size n/d. Let h , . . . , h be d functions from U to G1,...,Gd (1) (d) separately. For each ball b, the algorithm consider d bins {h (b) ∈ G1, . . . , h (b) ∈ Gd} and chooses the bin with the least number of balls. If there are several bins with the least number of balls, our algorithm always choose the bin with the smallest group number.

We define asymmetric witness trees for the Always-Go-Left mechanism such that a ball of height l + C in the Always-Go-Left scheme indicates that there is an asymmetric witness tree of height l whose leaves have height at least C. For an asymmetric witness tree T , the height of T is still the shortest distance from the root to its leaves.

Definition 8.5.2 (Asymmetric Witness tree). The asymmetric witness tree T of height l in

group Gi is a d-ary tree. The root has d children where the subtree of the jth child is an

asymmetric witness tree in group Gj of height (l − 1j≥i).

(1) (d) Given d functions h , . . . , h from U to G1,...,Gd separately, a ball b with height

more than l + C in a bin of group Gi indicates an asymmetric witness tree T of height l in

Gi whose leaves have height at least C. Each node of T corresponds to a ball, and the root of T corresponds to the ball b. A ball u in T has a ball v as its jth child iff when we insert the ball u in the Always-Go-Left mechanism, v is the top ball in the bin h(j)(u). Hence v < u and h(j)(u) = h(j)(v) when the jth child of u is v.

For an asymmetric witness tree T of height l in group Gi, We use the height l and the group index i ∈ [d] to determine its size. Let f(l, i) be the size of a full asymmetric witness tree of height l in group Gi. From the definition, we have f(0, i) = 1 and i−1 d X X f(l, i) = f(l, j) + f(l − 1, j). j=1 j=i Let g(l − 1) · d + i = f(l, i) such that

g(n) = g(n − 1) + g(n − 2) + ··· + g(n − d).

123 We know there exist c0 > 0, c1 = O(1), and φd > 1 satisfying

d d−1 n n φd = 1 + φd + ··· + φd such that g(n) ∈ [c0 · φd , c1 · φd ].

Hence  (l−1)d+i (l−1)d+i f(l, i) = g (l − 1) · d + i ∈ [c0 · φd , c1 · φd ]. Similar to the pruned witness tree of a symmetric witness tree, we use the same process in Definition 8.2.3 to obtain the pruned asymmetric witness tree of an asymmetric witness tree.

V¨ocking in [Voc03] showed that in a perfectly random hash function, the maximum load is log log n +O(1) with high probability given any n balls. We outline V¨ocking’s argument d log φd for distinct balls here: let b be a ball of height l + 4 for l = log log n+log(1+c) + 1. Without loss d log φd

of generality, we assume that b is in the first group G1. By the definition of the asymmetric

witness tree, there exists a tree T in G1 with root b and height l whose leaves have height at least 4. For each ball u and its ith ball v, the hash function h(i) satisfies h(i)(u) = h(i)(v). Similar to (8.2), we apply a union bound on all possible witness trees of height l in this configuration to bound the probability by d 1 nf(l,1) · ( )f(l,1)−1 · ( )number of leaves in f(l,1), n 3d

−c (l−1)d+1  which is less than n given f(l, 1) = Θ(φd ) = Θ (1 + c) log n . We prove our derandomization of V¨ocking’s argument here.

Theorem 8.5.3. For any m = O(n), any constants c > 1 and d ≥ 2, there exist a constant

φd ∈ (1.61, 2) and a hash family H in Construction 8.3.1 with O(log n log log n) random bits such that for any m balls in U, with probability at least 1 − n−c, the max-load of the Always-Go-Left mechanism with d independent choices from H is log log n + O(c + m ). d log φd n

ld m Proof. Let l be the smallest integer such that c0φd ≥ 10(2 + 2c) log m and b = 10d · n + 1. We bound the probability of a witness tree of height l + 3c + 1 whose leaves have height

124 more than b in h(1), . . . , h(d) during the Always-Go-Left scheme. Notice that there is a ball

of height l + b + 3c + 1 in any bin of G2,G3,...,Gd indicates that there is a ball of the same

height in G1.

We choose the parameters of H as follows: kg = 20c·d·(l+b+1+3c) = O(log log n), k =

log2(log n/3 log log n), δ1 = 1/poly(n) such that Corollary 8.3.5 happens with probability at −c−1 −O(log n) most n , and the bias δ2 = log n of hk+1 later. We set hk+1 to be a hash function from U to [log3 n/d] and g to be a function from U to [n/d] such that

(j) (j) (j) (j) (j)  (j) h = h1 ◦ h1 ◦ · · · ◦ hk ◦ hk+1 ⊕ g

is a map from U to Gj of [n/d] bins for each j ∈ d. We use h(1), . . . , h(d) to denote d independent hash functions from H with the above (1) (d) parameters. We use the notation of hi to denote the group of hash functions {hi , . . . , hi } in this proof. We assume Corollary 8.3.5 and follow the same argument in the proof of Theorem 8.4.1. We bound the probability of witness trees from 2 cases depending on the number of collisions.

Pruned witness trees with at least 3c collisions: Given a configuration C with at 0 least 3c collisions, we consider the first 3c collisions e1, . . . , e3c in the BFS of C. Let C be

the induced subgraph of C that only contains all vertices in e1, . . . , e3c and their ancestors in C. Therefore C survives under h(1), . . . , h(d) only if C0 survives under h(1), . . . , h(d).

Observe that |C0| ≤ 3c · 2 · d · height(T ). There are at most m|T 0| possible instan- 0 0 tiations of balls in C . For each instantiation T of C , because kg ≥ 2 · number of edges = 2(|C0| + 3c − 1), we bound the probability that any instantiation of C0 survives in h by

0 d 0 d 0 0 d m|C | · ( )number of edges = m|C | · ( )|C |+3c−1 ≤ (dm/n)|C | · ( )3c−1 ≤ n−2c. n n n At the same time, there are at most (|T |2)3c = poly(log n) configurations of C0. Hence we bound the probability of any witness with at least 3c collisions surviving by n−c.

125 Pruned witness tree with less than 3c collisions: We fix a configuration C of witness

tree in group G1 with height l+1+3c and less than 3c collisions. Thus |C| ∈ [f(l+1, 1), f(l+ 1 + 3c, 1)].

Let FT be the subset of possible asymmetric witness tree with configuration C after (i)  fixing the prefixes h1, h2, . . . , hk. For any T ∈ FT , each edge (u, v) has to satisfy h T (u) = h(i)T (v) in the Always-Go-Left scheme when v is the ith child of u. This indicates their prefixes are equal:

(i)  (i)  (i)  (i)  h1 T (u) ◦ · · · ◦ hk T (u) = h1 T (v) ◦ · · · ◦ hk T (v) . From the same argument in the proof of Theorem 8.4.1, we bound m |F | ≤ m · (1.01 log3 n · )|C|−1 T n

under h1, h2, . . . , hk from Corollary 8.3.5.

3 We first consider hk+1 as a t-wise independent distribution from U to [log n/d] for

t = 5bd · f(l + 3c + 1, 1) = O(log m) then move to δ2-biased spaces. For each asymmetric 3 witness tree, every edge (u, v) maps to the same bin w.p. d/ log n in hk+1. For each leaf, its height is at least b if each bin in its choices has height at least b − 1, which happens with probability at most d 1.01·log3 n· m d n  !  m b−1  (1.01d · ) 2 n b−1 ≤ n ≤ 2−3d · ( )2d (log3 n/d)b−1 (b − 1)! m from the proof of Theorem 8.4.1.

Because these two types of events are on disjoint subsets of balls, the probability that

any possible asymmetric witness tree in FT exists in t-wise independent distributions over the suffixes is at most  |C|−1 (d−1)(|C|−3c) |C| |C|/ d  2 n  d  m  2 n  3 |F | · · 2−3d · ( )2d ≤m · 1.01d · · 2−3d · ( )2d T log3 n m n m ≤m · 2−f(l+1,1) ≤ n−c−1.

126 −c−1 3 −t −O(log n) We choose δ2 = n · (log n/d) /|FT | = (log n) such that in δ2-biased

spaces, any possible asymmetric witness tree in FT exists hk+1 is at most happens with −c−1 3 bd·f(l+3c+1,1) −c−1 probability at most n + |FT | · δ2 · (log /d) ≤ 2n . At the same time, the number of possible configurations is at most (f(l + 3c + 1, 1)2)3c ≤ 0.1n.

From all discussion above, with probability at most n−c, there exists a ball in the Always-Go-Left mechanism with height at least l + b + 3c + 1 = log n log n + O(1). d log φd

8.6 Heavy Load

We consider the derandomization of the 1-choice scheme when we have m = n · poly(log n) balls and n bins. From the Chernoff bound, w.h.p, the max-load among n bins √ m p n  is n 1 + O( log n · m ) when we throw m > n log n balls into n bins independently at random. We modify the hash function from [CRSW13] with proper parameters for m = √ m p n  poly(log n) · n balls and prove the max-load is still n 1 + O( log n · m ) . We assume m = loga n · n for a constant a ≥ 1 in the rest of this section.

Theorem 8.6.1. For any constant c > 0 and a ≥ 1, there exist a constant C and a hash function from U to [n] generated by O(log n log log n) random bits such that for any m = loga n · n balls, with probability at least 1 − n−c, the max-load of the n bins in the 1-choice √ m p n  scheme with the hash function h is at most n 1 + C · log n · m .

We omit g in this section and change h1, . . . , hk+1 with different parameters. We log n 2−i choose k = log (2a) log log n , hi to denote a hash function from U to [n ] for i ∈ [k], and hk+1 2−k 2a to denote a hash function from U to [n ] = [log n] such that h1 ◦ h2 ◦ · · · ◦ hk ◦ hk+1 √ p n constitute a hash function from U to [n]. We set β = 4(c + 2) log n m . For convenience, 1−2−i we still think h1 ◦ h2 ◦ · · · ◦ hi as a hash function maps to n bins for any i ≤ k. In

this section, we still use δ1-biased spaces on h1, . . . , hk and a δ2-biased space on hk+1 for −O(log n) δ1 = 1/poly(n) and δ2 = (log n) .

127 Claim 8.6.2. For any constant c > 0, there exists δ1 = 1/poly(n) such that given m = loga n · n balls, with probability 1 − n−c−1, for any i ∈ [k] and any bin b ∈ [n1−2−i ], there are Q β m 2−i less than j≤i(1 + (k+2−i)2 ) · n · n balls in this bin.

Proof. We still use induction on i. The base case is i = 0. Because there are at most m balls, the hypothesis is true. Q Suppose it is true for i = l. Now we fix a bin and assume there are s = j≤l(1 + β m 2−l m 2−l (k+2−i)2 ) · n n ≤ (1 + β) n n balls in this bin from the induction hypothesis. hl+1 maps these s balls to t = n2−(l+1) bins. We will prove that with high probability, every bin in these β t bins of hl+1 contains at most (1 + (k+1−l)2 )s/t balls.

We use Xi ∈ {0, 1} to denote whether ball i is in one fixed bin of [t] or not. Hence l Pr[Xi = 1] = 1/t. Let Yi = Xi − E[Xi]. Therefore E[Yi] = 0 and E[|Yi| ] ≤ 1/t for any l ≥ 2. Let b = β2l for a large constant β later.

P b X β EDδ [( i Yi) ] Pr[ X > (1 + )s/t] ≤ 1 i 2 β Dδ (k + 1 − l) b 1 i ( (k+1−l)2 s/t) P 2b EU [Yi1 ··· Yib ] + δ1s ≤ i1,...,ib β b ( (k+1−l)2 s/t) 2bb!(s/t)b/2 + δ s2b ≤ 1 β b ( (k+1−l)2 s/t) !b/2 2b(s/t) 2b ≤ + δ1 · s β 2 ( (k+1−l)2 s/t)

log n k β log n 2−l−1 2k We use these bounds k = log (2l) log log n < log log n, b < β2 < (2l) log log n and n ≥ n ≥

128 log2l n ≥ (m/n)2 to simplify the above bound by

!b/2 2 log n + δ s2b β2 1 (log log n)4 · s/t  2 log2 n b/2 ≤ + δ s2b n m 2−l−1 1 (log n · m ) · ( n n )  1 b/2 ≤ + δ s2b n0.5·2−l−1 1  2β2l −l−1 l 2m −l ≤n−0.5·2 ·β2 /2 + δ n2 ≤ n−β/8 + δ · n6β. 1 n 1

−6β−c−2 Hence we choose the two parameters β > 8(c + 2) and δ1 = n such that the above probability is bounded by 2n−c−2. Finnally, we apply the union bound on i and all bins.

Proof of Theorem 8.6.1. We first apply Claim 8.6.2 to h1, . . . , hk.

2 In hk+1, we first consider it as a b = 16(c + 2) log n = O(log n)-wise independent Q β m 2−k 2−k distribution that maps s < j≤k(1 + (k+2−i)2 ) · n n balls to t = n bins. From Lemma 8.1.6 and Theorem 5 (I) in [SSS95], we bound the probability that one bin receives more

2 than (1 + β)s/t by eβ ·E[s/t]/3 ≤ n−c−2 given b ≥ β2 E[s/t].

−b·5a −O(log n) Then we choose δ2 = (log n) = (log n) such that any δ2-biased space from m 2a m 2a 2a 2 n log n 2a b −c−2 [2 n log n] to [log n] is δ2 · ≤b · (log n) < n -close to a b-wise independent −c−2 distribution. Hence in hk+1, with probability at most 2 · n , there is one bin that receives more than (1 + β)s/t balls. Overall, the number of balls in any bin of [n] is at most

Y β m X β m m (1 + )(1 + β) ≤ (1 + ) ≤ (1 + 2β) . (k + 2 − i)2 n (k + 2 − i)2 n n i≤k i≤k+1

129 Chapter 9

Constraint Satisfaction Problems Above Average with Global Cardinality Constraints

In this chapter, we consider the constraint satisfaction problem on {−1, 1}n under a global cardinality constraint. For generality, we allow different constraints using different predicates.

Definition 9.0.1. An instance I of a constraint satisfaction problem of arity d consists of a set of variables V = {x1, ··· , xn} and a set of m constraints C1, ··· ,Cm. Each constraint d Ci consists of d variables xi1 , ··· , xid and a predicate Pi ⊆ {−1, 1} . An assignment on xi1 , ··· , xid satisfies Ci if and only if (xi1 , ··· , xid ) ∈ Pi. The value valI(α) of an assignment

α is the number of constraints in C1, ··· ,Cm that are satisfied by α. The goal of the problem is to find an assignment with maximum possible value.

An instance I of a constraint satisfaction problem with a global cardinality constraint P consists of an instance I of a CSP and a global cardinality constraint i∈[n] xi = (1 − 2p)n specified by a parameter p. The goal of the problem is to find an assignment of maximum P possible value complying with the global cardinality constraint i∈[n] xi = (1 − 2p)n. We denote the value of the optimal assignment by

OPT = max val (α). P I α: i αi=(1−2p)n The average value AV G of I is the expected value of an assignment chosen uniformly at random among all assignments complying the global cardinality constraint

AV G = [val (α)]. P E I α: i αi=(1−2p)n

130 Given an instance I of a constraint satisfaction problem of arity d, we associate a degree-at-most d multilinear polynomial fI with I such that fI(α) = valI(α) for any α ∈ {±1}n. Q X X j∈[d](1 + σj · xi,j) f (x) = . I 2d i∈[m] σ∈Pi P Notice that given an instance I and a global cardinality constraint i∈n xi = (1 − 2p)n, the expectation of I under the global cardinality constraint AV G = ED[fI] is different than its expectation in the uniform distribution, even for CSPs of arity 2 in the bisection constraint.

Definition 9.0.2. In the satisfiability above Average Problem, we are given an instance of a P CSP of arity d, a global cardinality constraint i∈n xi = (1 − 2p)n, and a parameter t. We need to decide whether OPT ≥ AV G + t or not.

In this chapter, we show that it is fixed-parameter tractable. Namely, given a pa- rameter t and an instance of a CSP problem of arity d under a global cardinality constraint P 2 i∈n xi = (1 − 2p)n, we design an algorithm that either finds a kernel on O(t ) variables or certifies that OPT ≥ AV G + t.

Theorem 9.0.3 (Informal version of Theorem 9.3.1 and Theorem 9.5.1). For any integer O(1) constant d and real constant p0 ∈ (0, 1/2], given a d-ary CSP with n variables and m = n Pn constraints, a global cardinality constraint i=1 xi = (1 − 2p)n such that p ∈ [p0, 1 − p0], and an integer parameter t, there is an algorithm that runs in time (nO(1) + 2O(t2)) and decides whether there is an assignment complying with the cardinality constraint to satisfy at least (AV G + t) constraints or not.

One important ingredient in the proof of our main theorem is the 2 → 4 hypercon- tractivity of low-degree multilinear polynomials in a correlated probability space. Let Dp be the uniform distribution on all assignments to the n variables complying with the cardinality Pn constraint i=1 xi = (1 − 2p)n. We show the following inequality.

131 Theorem 9.0.4 (Informal version of Corollary 9.3.8 and Corollary 9.4.2). For any degree d multilinear polynomial f on variables x1, x2, . . . , xn, we have

4 d 2 2 E [f ] ≤ poly(d) · Cp · E [f ] , Dp Dp

1 1 where the constant Cp = poly( 1−p , p ).

The ordinary 2 → 4 hypercontractive inequality (see Section 9.1.1 for details of the inequality) has wide applications in computer science, e.g., invariance principles [MOO10], a lower bound on the influence of variables on Boolean cube [KKL88], and an upper bound on the fourth moment of low degree functions [AGK+11, MMZ15] (see [O’D14] for a complete introduction and more applications with the reference therein). The inequality admits an elegant induction proof, which was first introduced in [MOO05]; and the proof was later ex- tended to different settings (e.g. to the low-level sum-of-squares proof system [BBH+14], and to more general product distributions [MMZ15]). All the previous induction proofs, to the best of our knowledge, rely on the local independence of the variables (i.e. the independence among every constant-sized subset of random variables). In the 2 → 4 hypercontractive inequality we prove, however, every pair of the random variables is correlated.

Because of the lack of pair-wise independence, our induction proof (as well as the proof to the main theorem (Theorem 9.0.3)) crucially relies on the analysis of the eigenvalues of several nO(d) × nO(d) set-symmetric matrices. We will introduce more details about this analysis in the next subsection.

Related work. Recently, Gutin and Yeo [GY10] showed that it is possible to decide   whether there is an assignment satisfying more than dm/2+te constraints in time 2O(t2) + O(m) for the MaxBisection problem with m constraints and n variables. The running time was later improved to 2O(t) + O(m) by Mnich and Zenklusen [MZ12]. However, observe that in the MaxBisection problem, the trivial randomized algorithm satisfies AV G =

132  1 1  2 + 2(n−1) m constraints in expectation. Therefore, when m  n, our problem MaxBi- section above average asks more than what was proved in [GY10, MZ12]. For the Max- Cut problem without any global cardinality constraint, Crowston et al. [CJM15] showed that optimizing above the Edwards-Erd˝osbound is fixed-parameter tractable, which is com- parable to the bound in our work, while our algorithm outputs a solution strictly satisfying Pn the global cardinality constraint i=1 xi = (1 − 2p)n. Independently, Filmus and Mossel [FM16] provided a hypercontractive inequality

over Dp based on the log-Sobolev inequality due to Lee and Yau [LY98]. They utilized the

property that harmonic polynomials constitute an orthogonal basis in Dp. In this chapter,

we use parity functions and their Fourier coefficients to analyze the eigenspaces of VarDp

and prove the hypercontractivity in Dp. Parity functions do not constitute an orthogonal

basis in Dp, e.g., the n variables are not independent under any global cardinality constraint Pn i=1 xi = (1−2p)n. However, there is another important component in the proof of our main theorem – we need to prove the variance of a random solution is high if the optimal solution is much above average, where parity functions play an important role in this component.

Organization. We review several basic tools like Fourier analysis and Johnson scheme in 2 Section 9.1. Then we analyze the eigenspaces of EDp [f ] and VarDp (f) in Section 9.2. Next we consider CSPs under the bisection constraint in Section 9.3. We prove the hypercontrac- tivity Theorem 9.0.4 in Section 9.4. Finally we consider an arbitrary global constraint in Section 9.5.

9.1 Notation and Tools

In this chapter, we only consider f : {±1}n → R. Let U denote the uniform distribu- n n tion on {±1} and Up denote the biased product distribution on {±1} such that each bit equals to −1 with probability p and equals to 1 with probability 1 − p.

133 For a random variable X with standard deviation σ, it is known that the fourth moment is necessary and sufficient to guarantee that there exists x ∈ supp(X) greater than E[X] + Ω(σ) from [Ber97, AGK+11, O’D14]. We state this result as follows.

Lemma 9.1.1. Let X be a real random variable. Suppose that E[X] = 0, E[X2] = σ2 > 0, √ and E[X4] < bσ4 for some b > 0. Then Pr[X ≥ σ/(2 b)] > 0.

In this chapter, we always use D to denote the uniform distribution on all assignments Pn to the n variables complying with the bisection constraint i=1 xi = 0 and Dp to denote the Pn uniform distribution on all assignments complying with the cardinality constraint i=1 xi = (1 − 2p)n.

9.1.1 Basics of Fourier Analysis of Boolean functions

We state several basic properties of the Fourier transform for Boolean functions those will be useful in this chapter. We first introduce the standard Fourier transform in {±1}n, which will be used in Section 9.3 and 9.5. We will also use the p-biased Fourier transform in several proofs especially for the 2 → 4 hypercontractive inequality under Dp in Section 9.1.2, Section 9.2, and Section 9.4.

For the uniform distribution U, we define the inner-product on a pair of functions n Q f, g : {±1} → R by hf, gi = Ex∼U [f(x)g(x)]. Hence χS(x) = i∈S xi over all subsets S ⊆ [n] constitute an orthonormal basis for the functions from {±1}n to R. We simplify the notation by writing χS instead of χS(x). Hence every Boolean function has a unique P ˆ ˆ multilinear polynomial expression f = S⊆[n] f(S)χS, where f(S) = hf, χSi is the coefficient ˆ of χS in f. In particular, f(∅) = Ex∼U [f(x)]. An important fact about Fourier coefficients P ˆ 2 2 P ˆ 2 is Parseval’s identity, i.e., S f(S) = Ex∼U [f(x) ], which indicates VarU (f) = S=6 ∅ f(S) . Given any Boolean function f, we define its degree to be the largest size of S with non-zero Fourier coefficient fˆ(S). In this chapter, we focus on the multilinear polynomials

134 f with degree-at-most d. We use the Fourier coefficients of weight i to denote all Fourier ˆ [n] coefficients {f(S)|S ∈ i } of size i character functions. For a degree-at-most d polynomial [n] f, we abuse the notation f to denote a vector in the linear space span{χS|S ∈ ≤d }, where each coordinate corresponds to a character function χS of a subset S.

We state the standard Bonami Lemma for Bernoulli ±1 random variables [Bon70, O’D14], which is also known as the 2 → 4 hypercontractivity for low-degree multilinear polynomials.

Lemma 9.1.2. Let f : {−1, 1}n → R be a degree-at-most d multilinear polynomial. Let

X1, ··· ,Xn be independent unbiased ±1-Bernoulli variables. Then

4 d 2 2 E[f(X1, ··· ,Xn) ] ≤ 9 · E[f(X1, ··· ,Xn) ] .

For the p-biased distribution Up, we define the inner product on pairs of function f, g : n q p q 1−p {±1} → R by hf, gi = Ex∼Up [f(x)g(x)]. Then we define φi(x) = 1−p 1xi=1 − p 1xi=−1 Q and φS(x) = i∈S φi(x). We abuse the notation by writing φS instead of φS(x). It is 2 straightforward to verify EUp [φi] = 0 and EUp [φi ] = 1. Notice that φSφT 6= φS∆T unlike

χSχS = χS∆T for all x. However, hφS, φT i = 0 for different S and T under Up. Thus we have P ˆ ˆ the biased Fourier expansion f(x) = S⊆[n] f(S)φS(x), where f(S) = hf, φSi in Up. We ˆ P ˆ 2 2 also have f(∅) = EUp [f] and Parseval’s identity S f(S) = EUp [f ], which demonstrates P ˆ 2 VarUp (f) = S=6 ∅ f(S) . We state two facts of φi that will be useful in the later section.

p P P 1. xi = 2 p(1 − p)·φi +1−2p. Hence i φi(x) = 0 for any x satisfying i xi = (1−2p)n.

2 2p−1 2. φi = q · φi + 1 for q = √ . Thus we write f as a multilinear polynomial of φi. p(1−p)

Observe that the largest size of |T | with non-zero Fourier coefficient fˆ(T ) in the basis [n] [n] {φS|S ∈ ≤d } is equivalent to the degree of f defined in {χS|S ∈ ≤d }. Hence we still

135 define the degree of f to be maxS:fˆ(S)=06 |S|. We abuse the notation f to denote a vector in [n] the linear space span{φS|S ∈ ≤d }.

2 4 p2 (1−p) For the biased distribution Up, we know EUp [φi ] = 1−p + p ≥ 1. Therefore we

state the 2 → 4 hypercontractivity in the biased distribution Up as follows.

Lemma 9.1.3. Let f : {−1, 1}n → R be a degree-at-most d multilinear polynomial of

φ1, ··· , φn. Then

 2 2 d 4 p (1 − p) 2 2 E[f(X1, ··· ,Xn) ] ≤ 9 · + 9 · · E[f(X1, ··· ,Xn) ] . Up 1 − p p Up

At last, we notice that the definition of φS is consistent with the definition of χS when 2 1/2 p = 1/2. When the distribution Up is fixed and clear, we use kfk2 = Ex∼Up [f(x) ] to denote P ˆ 2 1/2 the L2 norm of a Boolean function f. From Parseval’s identity, kfk2 is also ( S f(S) ) . 2 1/2 2 1/2 From the Cauchy-Schwarz inequality, one useful property is kfgk2 ≤ kf k2 kg k2 .

9.1.2 Distributions conditioned on global cardinality constraints

We will study the expectation and the variance of a low-degree multilinear polynomial f in Dp. Because φS is consistent with χS when p = 1/2, we fix the basis to be φS of the p-biased Fourier transform. We treat q = √2p−1 as a constant and hide it in the big-Oh p(1−p) notation.

We first discuss the expectation of f under Dp. Because EDp [φS] is not necessary 0

ˆ 0 for any non-empty subset S, EDp [f] 6= f(∅). Let δS = EDp [φS]. From symmetry, δS = δS for 0 [n] any S and S with the same size. For convenience, we use δk = δS for any S ∈ k . From P ˆ the definition of δ, we have EDp [f] = S f(S) · δS.

k/2 (k−1)!! For p = 1/2 and D, δk = 0 for all odd k and δk = (−1) (n−1)·(n−3)···(n−k+1) for even k. [n] P We calculate it this way: pick any T ∈ k and consider ED[( i xi)χT ] = 0. This indicates

k · δk−1 + (n − k)δk+1 = 0.

136 From δ0 = 1 and δ1 = 0, we could obtain δk for every k > 1. P For p 6= 1/2 and Dp under the global cardinality constraint i∈n xi = (1 − 2p)n, we P P consider EDp [φS], because i∈n xi = (1−2p)n indicates i φi = 0. Thus we use δS = EDp [φS] [n] P and calculate it as follows: pick any T ∈ k and consider EDp [( i φi)φT ] = 0. φiφT = φT ∪i 2 for i∈ / T ; and φiφT = q · φT + φT \i for i ∈ T from the fact φi = q · φi + 1. We have

k · δk−1 + k · q · δk + (n − k)δk+1 = 0 (9.1)

Remark 9.1.4. For p = 1/2 and the bisection constraint, q = 0 and the recurrence relation becomes k · δk−1 + (n − k)δk+1 = 0, which is consistent with the above characterization. Thus we abuse the notation δk when Up is fixed and clear.

From δ0 = 1, δ1 = 0, and the relation above, we can determine δk for every k. For

1 δ2 2q example, δ2 = − n−1 and δ3 = − n−2 · 2 · q = (n−1)(n−2) . We bound δi as follows:

i −i i (2i−1)!! −i−1 Claim 9.1.5. For any i ≥ 1, δ2i−1 = (−1) O(n ) and δ2i = (−1) ni + O(n ).

Proof. We use induction on i. Base Case: δ0 = 1 and δ1 = 0.

i−1 −i+1 i −i Because δ2i−2 = (−1) Θ(n ) and δ2i−1 = (−1) O(n ), the major term of δ2i is

determined by δ2i−2. We choose k = 2i − 1 in the equation (9.1) to obtain (2i − 1)!! 1 (2i − 1)!! 1 δ = (−1)i + O( ) = (−1)i + O( ). 2i (n − 1)(n − 3) ··· (n − 2i + 1) ni+1 ni ni+1

At the same time, from δ2i and δ2i−1, (2i)(2i − 1)!! + (2i)(2i − 2)(2i − 3)!! + ··· + (2i)!! 1 1 δ = (−1)i+1q· +O( ) = (−1)i+1O( ). 2i+1 ni+1 ni+2 ni+2

137 2 Now we turn to EDp [f ] and VarDp [f] for a degree-at-most-d multilinear polynomial P ˆ f. From the definition and the Fourier transform f = S f(S)φS,

2 X ˆ ˆ 2 2 X ˆ ˆ E [f ] = f(S)f(T )δS∆T , VarDp (f) = E [f ] − E [f] = f(S)f(T )(δS∆T − δSδT ). Dp Dp Dp S,T S,T

n  n  2 2 T We associate a ≤d × ≤d matrix A with EDp [f ] that A(S,T ) = δS∆T . Hence EDp [f ] = f · [n] A·f from the definition when we think f is a vector in the linear space of span{φS|S ∈ ≤d }. n  n  Similarly, we associate a ≤d × ≤d matrix B with VarDp (f) that B(S,T ) = δS∆T − T δS · δT . Hence VarDp (f) = f · B · f. Notice that an entry (S,T ) in A and B only depends on the size of S,T, and S ∩ T .

Remark 9.1.6. Because B(∅,S) = B(S, ∅) = 0 for any S and VarDp (f) is independent with ˆ ˆ [n] [n] [n] [n] f(∅), we could neglect f(∅) in B such that B is a ( d + ··· 1 ) × ( d + ··· 1 ) matrix. fˆ(∅) is the only difference between the analysis of eigenvalues in A and B. Actually, the

difference δS ·δT between A(S,T ) and B(S,T ) will not effect the analysis of their eigenvalues except the eigenvalue induced by fˆ(∅).

2 In Section 9.2, we study the eigenvalues of EDp [f ] and VarDp (f) in the linear space [n] span{φS|S ∈ ≤d }, i.e., the eigenvalues of A and B.

9.1.3 Eigenspaces in the Johnson Schemes

We shall use a few characterizations about the eigenspaces of the Johnson scheme to analyze the eigenspaces and eigenvalues of A and B in Section 9.2 (please see [God10] for a complete introduction).

We divide A into (d + 1) × (d + 1) submatrices where Ai,j is the matrix of A(S,T ) [n] [n] over all S ∈ i and T ∈ j . For each diagonal matrix Ai,i, observe that Ai,i(S,T ) only

depends on |S∩T | because of |S| = |T | = i, which indicates Ai,i is in the association schemes, in particular, Johnson scheme.

138 [n] [n] ( r )×( r ) [n] Definition 9.1.7. A matrix M ∈ R is set-symmetric if for every S,T ∈ r , M(S,T ) depends only on the size of |S ∩ T |.

([n])×([n]) For n, r ≤ n/2, let Jr ⊆ R r r be the subspace of all set-symmetric matrices. Jr is called the Johnson scheme.

([n])×([n]) Let r be a fixed integer and M ∈ R r r be a matrix in the Johnson scheme Jr. [n] ( r ) P ˆ We treat a vector in R as a homogeneous degree r polynomial f = [n] f(T )φT , T ∈( r ) where each coordinate corresponds to a r-subset. Although the eigenvalues of M depend on the entries of M, the eigenspaces of M are independent with M as long as M is in the Johnson scheme.

Fact 9.1.8. There are r + 1 eigenspaces V0,V1, ··· ,Vr in M. For i ∈ [r], the dimension n n  ˆ of Vi is i − i−1 ; and the dimension of V0 is 1. We define Vi through f(S) over all [n] ˆ [n] S ∈ i , although M and f only depend on {f(T )|T ∈ r }. Vi is the linear space spanned ˆ [n] by {f(S)φS|S ∈ i } with the following two properties:

0 [n]  ˆ [n] P ˆ 0 1. For any T ∈ i−1 , {f(S)|S ∈ i } satisfies that j∈ /T 0 f(T ∪ j) = 0 (neglect this

property for V0).

[n] P 2. For any T ∈ , fˆ(T ) = T fˆ(S). r S∈( i )

n n  It is straightforward to verify that the dimension of Vi is i − i−1 and Vi is an eigenspace P ˆ in M. Notice that the homogeneous degree i polynomial [n] f(S)φS is an eigenvector of S∈( i ) matrices in Ji.

To show the orthogonality between Vi and Vj, it is enough to prove that

[n] P ˆ Claim 9.1.9. For any j ≤ r and any S ∈ , [n] f(T ) = 0 for any f ∈ Vj.

139 Proof. We use induction on the size of S to show it is true. P ˆ Base Case |S| = j − 1: from the definition of f, T :S⊂T f(T ) = 0. P ˆ [n]  [n] Suppose T :S⊂T f(T ) = 0 for any S ∈ k+1 . We prove it is true for any S ∈ k : X 1 X X fˆ(T ) = fˆ(T ) = 0. j − |S| T :S⊂T i/∈S T :(S∪i)⊂T

2 9.2 Eigenspaces and Eigenvalues of EDp [f ] and VarDp (f) In this section we analyze the eigenvalues and eigenspaces of A and B, following the approach of Grigoriev [Gri01b]. P We fix any p ∈ (0, 1) with the global cardinality constraint i xi = (1 − 2p)n and use [n] the p-biased Fourier transform in this section, i.e., {φS|S ∈ ≤d }. Because χS is consistent [n] with φS for p = 1/2, it is enough to study the eigenspaces of A and B in span{φS|S ∈ ≤d }. Since A can be divided into (d + 1) × (d + 1) submatrices where we know the eigenspaces of the diagonal submatrices from the Johnson scheme, we study the eigenspaces of A through P the global cardinality constraint i φi = 0 and the relations between eigenspaces of these diagonal matrices characterized in Section 9.1.3. We will focus on the analysis of A in most time and discuss about B in the end of this section.

0 We first show the eigenspace Vnull with an eigenvalue 0 in A, i.e., the null space of P P A. Because i xi = (1 − 2p)n in the support of Dp, i φi(x) = 0 for any x in the support P of Dp. Thus ( i φi)h = 0 for all polynomial h of degree-at-most d − 1, which is in the P [n]  linear subspace span{( i φi)φS|S ∈ ≤d−1 }. This linear space is the eigenspace of A with n  [n]  [n]  [n] an eigenvalue 0; and its dimension is ≤d−1 = d−1 + d−2 + ··· 0 . By the same reason, 0 Vnull is the eigenspace in B with an eigenvalue 0.

140 [n] [n] Let Vd be the largest eigenspace in Ad,d on d × d . We demonstrate how to find an eigenspace of A based on Vd. From the definition of Vd, for any fd ∈ Vd, fd satisfies that P ˆ [n]  j∈ /T fd(T ∪ j) = 0 for any T ∈ d−1 from the property of the Johnson scheme. Thus, [n] from Claim 9.1.9 and the fact that A(S,T ) only depends on |S ∩ T | given S ∈ i and [n] ~ T ∈ d , we know Ai,dfd = 0 for all i ≤ d − 1. We construct an eigenvector f in A from fd ˆ [n] ˆ ˆ [n] ~ as follows: f(S) = 0 for all S ∈

Then we move to Vd−1 in Ad,d and illustrate how to use an eigenvector in Vd−1 to P ˆ construct an eigenvector of A. For any fd ∈ Vd−1, let fd−1 = [n] fd−1(S)φS be the S∈(d−1) P P ˆ  homogeneous degree d−1 polynomial such that fd = [n] T fd−1(S) φT . From T ∈( d ) S∈(d−1) Claim 9.1.9, Ai,dfd = 0 for all i < d − 1 and Ai,d−1fd−1 = 0 for all i < d − 2. Observe that fd−1 is an eigenvector of Ad−1,d−1, although it is possible that the eigenvalue of fd−1 in

Ad−1,d−1 is different than the eigenvalue of fd in Ad,d. At the same time, from the symmetry of A and the relationship between fd and fd−1, Ad−1,dfd = β0fd−1 and Ad,d−1fd−1 = β1fd for some constants β0 and β1 only depending on δ and d. Thus we can find a constant αd−1,d ~ such that (0, fd−1, αd−1,dfd) becomes an eigenvector of A.

More directly, we determine the constant αd−1,d from the orthogonality between ~ P [n]  [n]  (0, fd−1, αd−1,d · fd) and the null space span{( i φi)φS|S ∈ ≤d−1 }. We pick any T ∈ d−1 P P P and rewrite ( i φi)φT = j∈T φT \j + q · |T | · φT + j∈ /T φT ∪j. From the orthogonality,

ˆ X X ˆ 0  q|T | · fd−1(T ) + αd−1,d fd−1(T ) = 0 j∈ /T 0 T ∪j T ∈(d−1)     ˆ X X ˆ 00 ⇒ q|T | + (n − |T |)αd−1,d fd−1(T ) + αd−1,d fd−1(T ∪ j) = 0. 00 T j∈ /T T ∈(d−2)

141 P ˆ 00 00 [n]  From the property of fd−1 that j∈ /T 00 fd−1(T ∪ j) = 0 for all T ∈ d−2 , we simplify it to

 ˆ q|T | + (n − 2|T |)αd−1,d fd−1(T ) = 0,

−(d−1)q which determines αd−1,d = n−2d+2 directly. Following this approach, we figure out all eigenspaces of A from the eigenspaces 0 V0,V1, ··· ,Vd in Ad,d. For convenience, we use Vk for k ≤ d to denote the kth eigenspace in

A extended by Vk in Ad,d. We first choose the coefficients in the combination. Let αk,i = 0 for all i < k, αk,k = 1, and αk,k+1, ··· , αk,d satisfy the recurrence relation (we will show the choices of α later):

i · αk,k+i−1 + (k + i) · q · αk,k+i + (n − 2k − i)αk,k+i+1 = 0. (9.2)

ˆ [n] ˆ [n] Then for every f ∈ Vk, the coefficients of f(T ) over all T ∈ ≤d spanned by {f(S)|S ∈ k } satisfy the following three properties:

[n]  P ˆ 0 1. ∀T ∈ k−1 , j∈ /T f(T ∪ j) = 0 (neglect this property for V0 );

[n] ˆ P ˆ 2. ∀T ∈ , f(T ) = αk,|T | · T f(S); >k S∈(k)

[n] ˆ 3. ∀T ∈

Now we show the recurrence relation of αk,k+i from the fact that f is orthogonal to the null P space of A. We consider ( i φi)φT in the null space for a subset T of size k + i < d and P P P simplify ( i φi)φT to j∈T φT \j + q · |T | · φT + j∈ /T φT ∪j. We have

X X ˆ X ˆ X X ˆ αk,k+i−1 f(S) + (k + i) · q · αk,k+i f(S) + αk,k+i+1 f(S) = 0 ⇒ j∈T T \j T j∈ /T T ∪j S∈( k ) S∈(k) S∈( k ) X  ˆ X X ˆ 0 i·αk,k+i−1+(k+i)q·αk,k+i+(n−k−i)αk,k+i+1 f(S)+ αk,k+i+1 f(T ∪j) = 0. T 0 T j∈ /T S∈(k) T ∈(k−1)

142 0 [n]  P ˆ 0 0 Using the first property ∀T ∈ k−1 , j∈ /T 0 f(T ∪ j) = 0 to eliminate all S not in T , We obtain

 X ˆ i · αk,k+i−1 + (k + i) · q · αk,k+i + (n − 2k − i)αk,k+i+1 f(S) = 0. T S∈(k) P Because T fˆ(S) is not necessary equal to 0 to satisfy the first property (actually S∈(k) P [n]  [n] T fˆ(S) = 0 for all T ∈ indicates fˆ(S) = 0 for all S ∈ ), the coefficient S∈(k) k+i k is 0, which provides the recurrence relation in (9.2).

0 [n] [n]  The dimension of Vk is k − k−1 from the first property (It is straightforward to Pd 0 0 Pd [n] [n]  [n]  [n]  [n] [n] verify k=0 dim(Vk) + dim(Vnull) = k=0( k − k−1 ) + d−1 + d−2 + ··· 0 = ≤d ). 0 0 The orthogonality between Vi and Vj follows from Claim 9.1.9 and the orthogonality of Vi and Vj.

0 0 0 0 Remark 9.2.1. V1 , ··· ,Vd are the non-zero eigenspaces in B except for V0 . For f ∈ V0 , ˆ ˆ 0 observe that f(T ) only depends on the size of T and f(∅). Hence for any polynomial f ∈ V0 , 0 0 f is a constant function over the support of Dp, i.e., VarDp (V0 ) = 0. Therefore V0 is in the null space of B.

We use induction on i to bound αk,k+i. From αk,k = 1 and the recurrence relation kq 1+(k+1)q·αk,k+1 1 (9.2), the first few terms would be αk,k+1 = − n−2k and αk,k+2 = − n−2k−1 = − n−2k−1 + O(n−2).

i (2i−1)!! −i−1 i+1 −i−1 Claim 9.2.2. αk,k+2i = (−1) ni + O(n ) and αk,k+2i+1 = (−1) O(n ).

kq Proof. We use induction on i again. Base Case: αk,k = 1 and αk,k+1 = − n−2k .

i−1 −i+1 i −i From the induction hypothesis αk,k+2i−2 = (−1) Θ(n ) and αk,k+2i−1 = (−1) n , i (2i−1)!! −i−1 the major term of αk,k+2i is determined αk,k+2i−2 such that αk,k+2i = (−1) ni +O(n ). i+1 −i−1 Similarly, αk,k+2i+1 = (−1) O(n ).

143 0 Now we bound the eigenvalue of Vk. For convenience, we think 0! = 1 and (−1)!! = 1.

0 Pd−k (i−1)!!(i−1)!! Theorem 9.2.3. For any k ∈ {0, ··· , d}, the eigenvalue of Vk in A is even i=0 i! ± −1 0 Pd−k (i−1)!!(i−1)!! O(n ). For any k ∈ {1, ··· , d}, the eigenvalue of Vk in B is the same even i=0 i! ± O(n−1).

0 [n] P ˆ Proof. We fix a polynomial f ∈ Vk and S ∈ to calculate [n] A(S,T )f(T ) for the k T ∈(≤d) 0 ˆ P ˆ 0 P ˆ eigenvalue of Vk in A. From the fact f(T ) = αk,|T |· 0 T f(S ), we expand T A(S,T )f(T ) S ∈(k) ˆ 0 0 [n] into the summation of f(S ) over all S ∈ k with coefficients. From the symmetry of A, the coefficients of fˆ(S0) in the expansion only depends on the size of S ∩ S0 (the sizes of S 0 ˆ 0 0 and S are k). Hence we use τi to denote the coefficients of f(S ) given |S ∆S| = i. Thus P ˆ P ˆ 0 T A(S,T )f(T ) = 0 [n] τ|S0 S|f(S ). S ∈( k ) M 0 0 We calculate τ0, ··· , τ2d as follows. Because |S | = |S| = k, |S∆S | is always even.

For τ0, we only consider T containing S and use k + i to denote the size of T .

d−k X n − k τ = · α · δ . (9.3) 0 i k,k+i i i=0

0 0 0 For τ2l, we fix a subset S with S∆S = 2l and only consider T containing S . We use k + i to denote the size of T and t to denote the size of the intersection of T and S \ S0.

d−k i X X ln − k − 2l τ = α δ . (9.4) 2l k,k+i t i − t 2l+i−2t i=0 t=0

−l 0 We will prove that τ0 = Θ(1) and τ2l = O(n ) for all l ≥ 1 then eliminate all S 6= S in P ˆ P ˆ 0 0 T A(S,T )f(T ) = 0 [n] τS0 Sf(S ) to obtain the eigenvalue of Vk. S ∈( k ) M Pd−k n−k From Claim 9.1.5 and Claim 9.2.2, we separate the summation of τ0 = i=0 i · P n−k P n−k αk,k+i · δi to even i i · αk,k+i · δi + odd i i · αk,k+i · δi. We replace δi and αk,k+i by

144 the bound in Claim 9.1.5 and Claim 9.2.2:     X n − k i i (i − 1)!! (i − 1)!! 2 + 2 −i−1 (−1) i i + O(n ) i 2 2 even i n n   X n − k i+1 + i+1 − i+1 − i+1 + (−1) 2 2 · O(n 2 ) · O(n 2 ). i odd i

Pd−k (i−1)!!(i−1)!! −1 −l It shows τ0 = even i=0 i! + O(n ). For τ2l, we bound it by O(n ) through similar i n−k−2l −l − 2  i−t method, where O(n ) comes from the fact that αk,k+i = O(n ), i−t < n , and − 2l+i−2t δ2l+i−2t = O(n 2 ).

0 At last, we show the eigenvalue of Vk is O(1/n) close to τ0, which is enough to finish 0 [n]  P ˆ 0 the proof. From the fact that for any T ∈ k−1 , j∈ /T 0 f(T ∪ j) = 0, we have (recall that|S| = k)

     X X ˆ   X X ˆ  (k − i)  f(S0 ∪ S1) + (i + 1)  f(S0 ∪ S1) S [n]\S S [n]\S S0∈( i ) S1∈( k−i ) S0∈(i+1) S1∈(k−i−1) X X X ˆ = f(S0 ∪ S1 ∪ j) = 0.

S [n]\S j∈ /S0∪S1 S0∈( i ) S1∈(k−i−1)

P P ˆ 0  0 0 Thus we apply it on i τ2i 0 [n] 0 f(S ) to remove all S 6= S. Let τ2k = τ2k S ∈( k ):|S∩S |=k−i and i + 1 τ 0 = τ − · τ 0 . 2k−2i−2 2k−2i−2 k − i 2k−2i Using the above rule, it is straightforward to verify

k     X X ˆ 0 0 X ˆ 0 τ2i f(S ) = τ2j f(S ) i=j 0 [n] 0 0 [n] 0 S ∈( k ):|S∩S |=k−i S ∈( k ):|S∩S |=k−j

Pd P ˆ 0  0 ˆ from j = k to j = 0 by induction. Therefore i τ2i 0 [n] 0 f(S ) = τ0f(S). =0 S ∈( k ):|S∩S |=k−i −i 0 0 Pk ik Because τ2i = O(n ), we have τ0 = τ0±O(1/n). (Remark: actually, τ0 = i=0 τ2i·(−1) i .)

145 0 0 Pd−k (i−1)!!(i−1)!! From all discussion above, the eigenvalue of Vk in A is τ0, which is even i=0 i! ± O(n−1).

0 For Vk in B of k ≥ 1, observe that the difference δS · δT between A(S,T ) and B(S,T ) P ˆ P ˆ  will not change the calculation of τ, because [n] δSδT f(T ) = δSδi [n] f(T ) = 0 T ∈( i ) T ∈( i ) 0 from the fact f is orthogonal to V0 .

(i−1)!!(i−1)!! Because i! ≤ 1 for any even integer i ≥ 0, we have the following two corol- laries.

2 Corollary 9.2.4. All non-zero eigenvalues of EDp [f ] in the linear space of span{φS|S ∈ [n] d ≤d } are between .5 and [ 2 ] + 1 ≤ d.

Corollary 9.2.5. All non-zero eigenvalues of VarDp [f] in the linear space of span{φS|S ∈ [n]  d+1 1,··· ,d } are between .5 and [ 2 ] ≤ d.

P Because f + ( i φi)h ≡ f over supp(Dp) for any h of degree-at-most d − 1, we define 0 2 2 the projection of f onto Vnull to compare kfk2 and EDp [f ]. P Definition 9.2.6. Fix the global cardinality constraint i xi = (1 − 2p)n and the Fourier

transform φS, let hf denote the projection of a degree d multilinear polynomial f onto the null P [n]  2 P space span{( i φi)φS|S ∈ ≤d−1 } of EDp [f ] and VarDp (f), i.e., f −( i φi)hf is orthogonal 2 to the eigenspace of an eigenvalue 0 in EDp [f ] and VarDp (f).

2 2 From the above two corollaries and the definition of hf , we bound EDp [f ] by kfk2 as ˆ ˆ follows. For VarDp (f), we exclude f(∅) because VarDp (f) is independent with f(∅). Recall 2 2 P ˆ 2 that kfk2 = EUp [f ] = S f(S) .

Corollary 9.2.7. For any degree d multilinear polynomial f and a global cardinality con- P 2 2 2 P 2 straint i xi = (1 − 2p)n, EDp [f ] ≤ dkfk2 and EDp [f ] ≥ 0.5kf − ( i φi)hf k2.

146 Corollary 9.2.8. For any degree d multilinear polynomial f and a global cardinality con- P ˆ 2 ˆ straint i xi = (1 − 2p)n, VarDp (f) ≤ dkf − f(∅)k2 and VarDp (f) ≥ 0.5kf − f(∅) − P 2 ( i φi)hf−fˆ(∅)k2.

9.3 Parameterized algorithm for CSPs above average with the bisection constraint

We prove that CSPs above average with the bisection constraint are fixed-parameter P tractable. Given an instance I from d-ary CSPs and the bisection constraint i xi = 0, we [n] use the standard basis {χS|S ∈ ≤d } of the Fourier transform in U and abbreviate fI to f. 2 2 P ˆ 2 Recall that kfk2 = EU [f ] = S f(S) and D is the uniform distribution on all assignments in {±1}n complying with the bisection constraint. ˆ For f with a small variance in D, we use hf−fˆ(∅) to denote the projection of f − f(∅) P [n]  ˆ P 2 onto the null space span{( i xi)χS|S ∈ ≤d−1 }. We know kf − f(∅) − ( i xi)hf−fˆ(∅)k2 ≤

2VarD(f) from Corollary 9.2.8, i.e., the lower bound of the non-zero eigenvalues in VarD(f).

Then we show how to round hf−fˆ(∅) in Section 9.3.1 to a degree d − 1 polynomial h with ˆ P 2 ˆ P 2 integral coefficients such that kf − f(∅) − ( i xi)hk2 = O(kf − f(∅) − ( i xi)hf−fˆ(∅)k2), ˆ P which indicates that f − f(∅) − ( i xi)h has a small kernel under the bisection constraint. Otherwise, for f with a large variance in D, we show the hypercontractivity in D that 4 2 2 ED[(f −ED[f]) ] = O(ED[(f −ED[f]) ] ) in Section 9.3.2. From the fourth moment method, p 2 we know there exists α in the support of D satisfying f(α) ≥ ED[f] + Ω( VarD[f] ). At last, we prove the main theorem in Section 9.3.3.

Theorem 9.3.1. Given an instance I of a CSP problem of arity d and a parameter t, 3d 2 there is an algorithm with running time O(n ) that either finds a kernel on at most Cdt variables or certifies that OPT ≥ AV G + t under the bisection constraint for a constant 2 d d 2d 2 Cd = 24d · 7 · 9 · 2 · d!(d − 1)! ··· 2! .

147 9.3.1 Rounding

In this section, we show that for any polynomial f of degree d with integral coefficients, there exists an efficient algorithm to round hf into an integral-coefficient polynomial h while P 2 P 2 it keeps kf − ( i xi)hk2 = O(kf − ( i xi)hf k2).

Theorem 9.3.2. For any constants γ and d, given a degree d multilinear polynomial f with √ P 2 ˆ [n] kf − ( i xi)hf k2 ≤ n whose Fourier coefficient f(S) is a multiple of γ for all S ∈ ≤d , there exists an efficient algorithm to find a degree-at-most d − 1 polynomial h such that

γ 1. The Fourier coefficients of h are multiples of d!(d−1)!···2! , which demonstrates that the P γ Fourier coefficients of f − ( i xi)h are multiples of d!(d−1)!···2! .

P 2 d P 2 2. kf − ( i xi)hk2 ≤ 7 · kf − ( i xi)hf k2.

ˆ ˆ The high level idea of the algorithm is to round hf (S) to h(S) from the coefficients of weight d−1 to the coefficient of weight 0. At the same time, we guarantee that for any k < d, P 2 P 2 the rounding on the coefficients of weight k will keep kf −( i xi)hk2 = O(kf −( i xi)hf k2) in the same order.

Because hf contains non-zero coefficients up to weight d − 1, we first prove that we ˆ [n]  [n] could round {hf (S)|S ∈ d−1 } to multiples of γ/d!. Observe that for T ∈ d , the coefficient P ˆ P ˆ P ˆ P ˆ 2 of χT in f −( i xi)hf is f(T )− j∈T hf (T \j). Because [n] (f(T )− j∈T hf (T \j)) = T ∈( d ) ˆ P ˆ [n] P ˆ o(n), f(T ) − j∈T hf (T \ j) is close to 0 for most T in d . Hence j∈T hf (T \ j) mod γ [n]  ˆ is close to 0 for most T . Our start point is to prove that for any S ∈ d−1 , h(S) is close to a multiple of γ/d! from the above discussion.

ˆ ˆ P ˆ [n] Lemma 9.3.3. If f(T ) is a multiple of γ and f(T ) − T hf (S) = 0 for all T ∈ , S∈(d−1) d ˆ [n]  then hf (S) is a multiple of γ/d! for all S ∈ d−1 .

148 Proof. From the two conditions, we know

X ˆ hf (S) ≡ 0 mod γ T S∈(d−1)

[n] for any T ∈ d . We prove that

ˆ d ˆ (d − 1)! · hf (S1) + (−1) (d − 1)! · hf (S2) ≡ 0 mod γ

[n]  [n]\S for any S1 ∈ d−1 and S2 ∈ d−1 . Thus

X ˆ ˆ 0 ≡ (d − 1)! · hf (S2) ≡ d! · hf (S1) mod γ, T S2∈(d−1) ˆ for any T with S1 ∩ T = ∅, which indicates hf (S1) is a multiple of γ/d!.

Without loss of generality, we assume S1 = {1, 2, ··· , d−1} and S2 = {k1, k2, ··· , kd−1}. S1∪S2 P ˆ P ˆ For a subset T ∈ , because T hf (S) = j∈T hf (T \ j), we use (T ) to denote d S∈(d−1) the equation X ˆ hf (T \ j) ≡ 0 mod γ (T) j∈T −i Let βd−1,1 = (d−2)! and βd−i−1,i+1 = d−i−1 ·βd−i,i for any i ∈ {1, ··· , d−2} (we choose βd−1,1 to guarantee that all coefficients are integers). Consider the following linear combination of

S1∪S2 equations over T ∈ d with coefficients βd−i,i: d−1 d−1 X X X X X ˆ  βd−i,i (T1∪T2) ⇒ βd−i,i hf (T1∪T2\j) ≡ 0 mod γ. i=1 S1 S2 i=1 S1 S2 j∈T1∪T2 T1∈(d−i),T2∈( i ) T1∈(d−i),T2∈( i ) (9.5)

S1  0 S2 Observe that for any i ∈ {1, ··· , d − 2}, S ∈ d−i−1 , and S ∈ i , the coefficient of ˆ 0 hf (S ∪ S ) is i · βd−i,i + (d − i − 1) · βd−i−1,i+1 = 0 in equation (9.5), where i comes from the number of choices of T1 is d − 1 − |S| = i and d − i − 1 comes from the number of choices of 0 T2 is (d − 1) − |S |.

149 ˆ ˆ Hence equation (9.5) indicates that (d − 1)βd−1,1hf (S1) + (d − 1)β1,d−1hf (S2) ≡ 0 d−2 mod γ. Setting into βd−1,1 = (d − 2)! and β1,d−1 = (−1) (d − 2)!, we obtain

ˆ d ˆ (d − 1)! · hf (S1) + (−1) (d − 1)! · hf (S2) ≡ 0 mod γ.

P ˆ P ˆ 2 0.6 Corollary 9.3.4. If [n] (f(T ) − T hf (S)) = k = o(n ), then for all S ∈ T ∈( d ) S∈(d−1) [n]  ˆ 0.1 d−1, , hf (S) is d · γ/d! close to a multiple of γ/d!.

.8 [n] P ˆ Proof. From the condition, we know that except for n choices of T ∈ , T hf (S) is d S∈(d−1) n−.1 close to a multiple of γ because of n.8 ·(n−.1)2 > k. Observe that the above proof depends 0.8 on the Fourier coefficients in at most 2d + 1 variables of S1 ∪ T . Because n = o(n), for any

[n] [n]\S1 0 S1∪T    P 0 ˆ subset S1 ∈ d− , there is a subset T ∈ d such that for any T ∈ d , T hf (S) 1 S∈(d−1) is n−.1 close to a multiple of γ.

ˆ (2d)!(d!)2 0.1 Following the proof in Lemma 9.3.3, we obtain that hf (S) is n.1 < d · γ/d! close [n]  to a multiple of γ/d! for any S ∈ d−1 .

ˆ We consider a natural method to round hf , which is to round hf (S) to the closet [n]  multiple of γ/d! for every S ∈ d−1 . ˆ ˆ Claim 9.3.5. Let hd−1 be the rounding polynomial of hf such that hd−1(S) = hf (S) for any ˆ ˆ [n]  |S| 6= d − 1 and hd−1(S) is the closest multiple of γ/d! to hf (S) for any S ∈ d−1 . Let ˆ ˆ ε(S) = hd−1(S) − hf (S).

If |ε(S)| < .1/d · γ/d! and α(T ) is a multiple of γ for any T , then

X X 2 X X ˆ 2 ε(S) ≤ α(T ) − hf (S) . [n] T [n] T T ∈( d ) S∈(d−1) T ∈( d ) S∈(d−1)

150 [n] P P Proof. For each T ∈ , Because T |ε(S)| < 0.1 · γ/d!, then | T ε(S)| < d S∈(d−1) S∈(d−1) P ˆ P P 2 P |α(T ) − T hf (S)|. Hence we know [n] T ε(S) ≤ [n] α(T ) − S∈(d−1) T ∈( d ) S∈(d−1) T ∈( d ) P ˆ 2 T hf (S) . S∈(d−1)

From now on, we use hd−1 to denote the degree d−1 polynomial of hf after the above rounding process on the Fourier coefficients of weight d − 1. Now we bound the summation P P 2 of the square of the Fourier coefficients in f − ( i xi)hd−1, i.e., kf − ( i xi)hd−1k2. Observe ˆ [n] 0 [n]  that rounding hf (S) only affect the terms of T ∈ d containing S and T ∈ d−2 inside S, P ˆ P ˆ P ˆ because ( i xi)hf (S)χS = i∈S hf (S)χS\i + i/∈S hf (S)χS∪i.

P 2 P 2 Lemma 9.3.6. kf − ( i xi)hd−1k2 ≤ 7kf − ( i xi)hf k2.

ˆ ˆ Proof. Let ε(S) = hd−1(S) − hf (S). It is sufficient to prove  2 X  ˆ X ˆ X  X ˆ X ˆ 2 f(T ) − hf (S) − ε(S) ≤ 4 f(T ) − hf (S) , (9.6) [n] T T [n] T T ∈( d ) S∈(d−1) S∈(d−1) T ∈( d ) S∈(d−1) and  2 X X X X X fˆ(T 0) − hˆ (S) − hˆ (T 0 ∪ {j}) − ε(T 0 ∪ {j}) ≤ 2kf−( x )h k2.  f f  i f 2 0 T 0∈ [n] T j∈ /T 0 j∈ /T 0 i (d−2) S∈(d−3) (9.7) Equation (9.6) follows the fact that

X X 2 X ˆ X ˆ 2 ε(S) ≤ f(T ) − hf (S) [n] T [n] T T ∈( d ) S∈(d−1) T ∈( d ) S∈(d−1) by Claim 9.3.5. From the inequality of arithmetic and geometric means, we know the cross terms:

X ˆ X ˆ X X ˆ X ˆ 2 2 · f(T ) − hf (S) · | ε(S)| ≤ 2 f(T ) − hf (S) . [n] T T [n] T T ∈( d ) S∈(d−1) S∈(d−1) T ∈( d ) S∈(d−1)

151 For (9.7), observe that

X X 2 X X ε(T ∪ {j}) = (d − 1) ε(S)2 + 2ε(S)ε(S0)

0 [n] j∈ /T 0 [n] S,S0:|S∩S0|=d−2 T ∈(d−2) S∈(d−1) X X 2 X ˆ X ˆ 2 ≤ ε(S) ≤ f(T ) − hf (S) . [n] T [n] T T ∈( d ) S∈(d−1) T ∈( d ) S∈(d−1) Hence we have

X ˆ 0 X ˆ X ˆ 0  X X 2 f(T ) − hf (S) − hf (T ∪ {j}) + ε(T ∪ {j}) 0 T 0∈ [n] T j∈ /T 0 T 0∈ [n] j∈ /T 0 (d−2) S∈(d−3) (d−2) X ˆ 0 X ˆ X ˆ 0  X ˆ X ˆ 2 ≤ f(T ) − hf (S) − hf (T ∪ {j}) + f(T ) − hf (S) 0 T 0∈ [n] T j∈ /T 0 T ∈ [n] S∈ T (d−2) S∈(d−3) ( d ) (d−1) X 2 ≤ kf − ( xi)hf k2. i We use the inequality of arithmetic and geometric means again to obtain inequality (9.7).

Proof of Theorem 9.3.2. We apply Claim 9.3.5 and Lemma 9.3.6 for d times on the Fourier ˆ [n]  ˆ [n]  ˆ [n] coefficients of hf from {hf (S)|S ∈ d−1 }, {hf (S)|S ∈ d−2 }, ··· to {hf (S)|S ∈ 0 } by choosing γ properly. More specific, let hi be the polynomial after rounding the coefficients on [n] ˆ [n] ≥i and hd = hf . Every time, we use Claim 9.3.5 to round coefficients of {hi(S)|S ∈ i } from hi+1 for i = d − 1, ··· , 0. We use different parameters of γ in different rounds: γ in the γ rounding of hd−1, γ/d! in the rounding of hd−2, d!·(d−1)! in the rounding of hd−3 and so on. γ After d rounds, all coefficients in h0 are multiples of d!(d−1)!(d−2)!···2! .

P 2 P 2 Because kf − ( i xi)hik2 ≤ 7kf − ( i xi)hi+1k2 from Lemma 9.3.6. Eventually, P 2 d P 2 kf − ( i xi)h0k2 ≤ 7 · kf − ( i xi)hf k2.

9.3.2 2 → 4 Hypercontractive inequality under distribution D

We prove the 2 → 4 hypercontractivity for a degree d polynomial g in this section.

152 4 2d 4 Theorem 9.3.7. For any degree-at-most d multilinear polynomial g, ED[g ] ≤ 3d · 9 · kgk2.

2 1/2 P 2 1/2 P Recall that kgk2 = EU [g ] = ( S gˆ(S) ) and g − ( i xi)hg ≡ g in the support P 2 2 of D. Because kg − ( i xi)hgk2 ≤ 2 Ex∼D[g ] from the lower bound of non-zero eigenvalues 2 in ED[g ] in Corollary 9.2.4, without loss of generality, we assume g is orthogonal to the null P [n]  space span{( i xi)χS|S ∈ ≤d−1 }.

4 2d Corollary 9.3.8. For any degree-at-most d multilinear polynomial g, ED[g ] ≤ 12d · 9 · 2 2 ED[g ] .

Before proving the above Theorem, we observe that uniform sampling a bisection (S, S¯) is as same as first choosing a random perfect matching M and independently assigning each pair of M to the two subsets. For convenience, we use P (M) to denote the product distribution on M and EM to denote the expectation over a uniform random sampling of perfect matching M. Let M(i) denote the vertex matched with i in M and M(S) = {M(i)|i ∈ S}. From the 2 → 4 hypercontractive inequality on product distribution P (M), we have the following claim:

4 d 2 2 Claim 9.3.9. EM [EP (M)[g ]] ≤ 9 EM [EP (M)[g ] ].

Now we prove the main technical lemma of the 2 → 4 hypercontractivity under the bisection constraint to finish the proof.

2 2 d 4 Lemma 9.3.10. EM [EP (M)[g ] ] ≤ 3d · 9 · kgk2.

Theorem 9.3.7 follows from Claim 9.3.9 and Lemma 9.3.10. Now we proceed to the proof of Lemma 9.3.10.

153 P 2 2 Proof of Lemma 9.3.10. Using g(x) = [n] gˆ(S)χS, we rewrite EM [EP (M)[g ] ] as S∈(≤d)    X 22 E E ( gˆ(S)χS) M P (M) [n] S∈(≤d)    X 2 X 0 2 = E E gˆ(S) + gˆ(S)ˆg(S )χSMS0 . M P (M) [n] [n] 0 [n] 0 S∈(≤d) S∈(≤d),S ∈(≤d),S =6 S

|S∆S0|/2 0 0 Notice that EP (M)[χS∆S0 ] = (−1) if and only if M(S∆S ) = S∆S ; otherwise it is 0. We expand it to    0 2 2 X 0 |SMS |/2 kgk2 + gˆ(S)ˆg(S ) · 1S S0=M(S S0) · (−1) ME M M [n] 0 [n] 0 S∈(≤d),S ∈(≤d),S =6 S   0 4 2 X 0 |SMS |/2 = kgk2 + 2kgk2 · gˆ(S)ˆg(S ) · 1S S0=M(S S0) · (−1) ME M M [n] 0 [n] 0 S∈(≤d),S ∈(≤d),S =6 S  

 0 2 X 0 |SMS |/2 +  gˆ(S)ˆg(S ) · 1S S0=M(S S0) · (−1)  . ME  M M  [n] 0 [n] 0 S∈(≤d),S ∈(≤d),S =6 S

0 P 0 |SMS |/2 We first bound the expectation of [n] 0 [n] 0 gˆ(S)ˆg(S )·1S S0=M(S S0)·(−1) S∈(≤d),S ∈(≤d),S =6 S M M in the uniform distribution over all perfect matchings, then bound the expectation of its (|U|−1)(|U|−3)···1 square. Observe that for a subset U ⊆ [n] with even size, EM [1U=M(U)] = (m−1)(m−3)···(m−|U|+1) |U|/2 such that EM [1U=M(U) · (−1) ] = δ|U|, i.e., the expectation ED[χU ] of χU in D. Hence

0  X 0 |SMS |/2 X 0 2 gˆ(S)ˆg(S )1S S0=M(S S0) · (−1) ≤ gˆ(S)ˆg(S ) · δS S0 = [g ]. ME M M M ED [n] 0 [n] 0 S,S0 S∈(≤d),S ∈(≤d),S =6 S

From Corollary 9.2.4, the largest non-zero eigenvalue of the matrix constituted by δS∆S0 is 2 at most d. Thus the expectation is upper bounded by d · kgk2.

0 P 0 We define g to be a degree 2d polynomial [n] gˆ (T )χT with coefficients T ∈(≤2d) X gˆ0(T ) = gˆ(S)ˆg(S0)

[n] 0 [n] 0 S∈(≤d),S ∈(≤d):S∆S =T

154 [n]  for all T ∈ ≤2d . Hence we rewrite    2 0 X 0 |SMS |/2  gˆ(S)ˆg(S ) · 1S S0=M(S S0) · (−1)  ME  M M  [n] 0 [n] 0 S∈(≤d),S ∈(≤d),S =6 S   X 0 |T |/22 = gˆ (T ) · 1T =M(T )(−1) ME [n] T ∈(≤2d)   X 0 0 0 |T |/2+|T 0|/2 = gˆ (T )ˆg (T ) · 1T =M(T )1T 0=M(T 0)(−1) . ME T,T 0 Intuitively, because |T | ≤ 2d and |T 0| ≤ 2d, most of pairs T and T 0 are disjoint such that

EM [1T =M(T )1T 0=M(T 0)] = EM [1T =M(T )] · EM [1T 0=M(T 0)]. The summation is approximately P 0 |T |/2 2 2 4 EM [ T gˆ (T )1T =M(T )(−1) ] , which is bounded by d kgk2 from the discussion above. However, we still need to bound the contribution from the correlated paris of T and T 0.

0 2 4 d 4 Notice that kg k2 = EU [g ], which can be upper bounded by ≤ 9 kgk2 from the standard 2 → 4 hypercontractivity.

4 0 2 d 4 Instead of bounding it by kgk2 directly, we will bound it by 2d · kg k2 ≤ 2d · 9 kgk2 through the analysis on its eigenvalues and eigenspaces to this end. For convenience, we rewrite it to   X 0 0 |T |/2+|T 0|/2 X 0 0 0 0 gˆ (T )ˆg (T )1T =M(T )1T 0=M(T 0)(−1) = gˆ (T )ˆg (T )∆(T,T ), ME T,T 0 T,T 0 where ∆(T,T 0) = 0 if |T | or |T 0| is odd, otherwise   0 |T |/2+|T 0|/2 ∆(T,T ) = 1T ∩T 0=M(T ∩T 0)1T =M(T )1T 0=M(T 0)(−1) ME 0 0 0 |T ∩ T − 1|!! · |T \ T − 1|!! · |T \ T − 1|!! 0 = (−1)|T ∆T |/2. (n − 1)(n − 3) ··· (n − |T ∪ T 0| + 1)

0 n  n  0 0 Let A be the 2d × 2d matrix whose entry (T,T ) is ∆(T,T ). We prove that the eigenspace 0 [n]  P 0 of A with eigenvalue 0 is still span{( i xi)χT |T ∈ ≤2d−1 }. Because ∆T,T 6= 0 if and only

155 0 0 P 0 if |T |, |T |,and |T ∩ T | are even, it is sufficient to show i A (S,T ∆i) = 0 for all odd sized T and even sized S.

1. |S ∩ T | is odd: ∆(S,T M i) 6= 0 if and only if i ∈ S. We separate the calculation into i ∈ S ∩ T or not:

X 0 X X A (S,T M i) = ∆(S,T \ i) + ∆(S,T ∪ i). i i∈S∩T i∈S\T Plugging in the definition of ∆, we obtain

|S ∩ T | · |S ∩ T − 2|!! · |S \ T |!! · |T \ S|!! (−1)|S|/2+|T −1|/2 (n − 1)(n − 3) ··· (n − |S ∪ T | + 1) |S \ T | · |S ∩ T |!! · |S \ T − 2|!! · |T \ S|!! + (−1)|S|/2+|T +1|/2 = 0. (n − 1)(n − 3) ··· (n − |S ∪ T | + 1)

2. |S ∩ T | is even: ∆(S,T M i) 6= 0 if and only if i∈ / S. We separate the calculation into i ∈ T or not:

X 0 X X A (S,T M i) = ∆(S,T \ i) + ∆(S,T ∪ i). i i∈T \S i/∈S∪T Plugging in the definition of ∆, we obtain

|T \ S| · |S ∩ T − 1|!! · |S \ T − 1|!! · |T \ S − 2|!! (−1)|S|/2+|T −1|/2 (n − 1)(n − 3) ··· (n − |S ∪ T | + 1) (n − |S ∪ T |) · |S ∩ T − 1|!! · |S \ T − 1|!! · |T \ S|!! + (−1)|S|/2+|T +1|/2 = 0. (n − 1)(n − 3) ··· (n − |S ∪ T |)

From the same analysis in Section 9.2, the eigenspaces of A0 are as same as the eigenspaces of A with degree 2d except the eigenvalues, whose differences are the differences 0 between ∆SMT and δSMT . We can compute the eigenvalues of A by the same calculation of eigenvalues in A. However, we bound the eigenvalues of A0 by 0  A0  A as follows.

Observe that for any S and T , A0(S,T ) and A(S,T ) always has the same sign. At 0 |A(S,T )| 0 the same time, |A (S,T )| = O( n|S∩T | ). For a eigenspace Vk in A, we focus on τ0 because

156 the eigenvalue is O(1/n)-close to τ0 from the proof of Theorem 9.2.3. We replace δi by any Pd−k n−k 0 ∆(S,T ) of |S| = k, |T | = k + i and |S ∩ T | = i in τ0 = i=0 i · αk,k+i · δi to obtain τ0 0 αk,k+iδi 0 for A . Thus αk,k+i · ∆(S,T ) = Θ( ni ) indicates τ0 = Θ(1) < τ0 from the contribution of 0 i = 0. Repeat this calculation to τ2l, we can show τ2l = O(τ2l) for all l. Hence we know the 0 0 eigenvalue of A in Vk is upper bounded by the eigenvalue of A from the cancellation rule of τ in the proof of Theorem 9.2.3. On the other hand, A0  0 from the definition that it is the expectation of a square term in M.

From Corollary 9.2.4 and all discussion above, we bound the largest eigenvalue of A0 by 2d. Therefore   X 0 0 |T |/2+|T 0|/2 gˆ (T )ˆg (T )1T =M(T )1T 0=M(T 0)(−1) ME T,T 0 X 0 0 0 0 0 2 d 4 = gˆ (T )ˆg (T )∆(T,T ) ≤ 2dkg k2 ≤ 2d · 9 · kgk2. T,T 0

 2 2 4 4 d 4 d Over all discussion above, EM EP (M)[g ] ≤ kgk2 + 2dkgk2 + 2d · 9 · kgk2 ≤ 3d · 9 · 4 kgk2.

9.3.3 Proof of Theorem 9.3.1

In this section, we prove Theorem 9.3.1. Let f = fI be the degree d multilinear

polynomial associated with the instance I and g = f − ED[f] for convenience. We discuss

VarD[f] in two cases.

2 2d 2 4 2d 2 2 If VarD[f] = ED[g ] ≥ 12d · 9 · t , we have ED[g ] ≤ 12d · 9 · ED[g ] from√ the 2 → 4 2 ED[g ] hypercontractivity of Theorem 9.3.7. By Lemma 9.1.1, we know PrD[g ≥ √ ] > 0. 2 12d·92d Thus PrD[g ≥ t] > 0, which demonstrates that PrD[f ≥ ED[f] + t] > 0.

2d 2 ˆ Otherwise we know VarD[f] < 12d·9 ·t . We consider f−f(∅) now. Let hf−fˆ(∅) be the ˆ ˆ P 2 projection of f −f(∅) onto the linear space such that kf −f(∅)−( i xi)hf−fˆ(∅)k2 ≤ 2Var(f)

157 −d from Corollary 9.2.8 and γ = 2 . From Theorem 9.3.2, we could round hf−fˆ(∅) to h for f − fˆ(∅) in time nO(d) such that

ˆ P γ 1. the coefficients of f − f(∅) − ( i xi)h are multiples of d!(d−1)!···2! ;

ˆ P 2 d ˆ P 2 d d 2 2. kf − f(∅) − ( i xi)hk2 ≤ 7 kf − f(∅) − ( i xi)hf−fˆ(∅)k2 ≤ 7 · 2 · 12d · 9 · t .

P We first observe that f(α) = f(α) − ( i αi)h(α) for any α in the support of D. Then we ˆ P argue f −f(∅)−( i xi)h has a small kernel, which indicates that f has a small kernel. From ˆ P 2 γ 2 the above two properties, we know there are at most kf − f(∅) − ( i xi)hk2/( d!(d−1)!···2! ) ˆ P non-zero coefficients in f −f(∅)−( i xi)h. Because each of the nonzero coefficients contains at most d variables, the instance I has a kernel of at most 24d2 ·7d ·9d ·22d ·d!(d−1)! ··· 2!2t2 variables.

The running time of this algorithm is the running time to find hf and the rounding time O(nd). Therefore this algorithm runs in time O(n3d).

9.4 2 → 4 Hypercontractive inequality under distribution Dp

In this section, we prove the 2 → 4 hypercontractivity of low-degree multilinear P polynomials in the distribution Dp conditioned on the global cardinality constraint i xi = (1 − 2p)n.

We assume p is in (0, 1) such that p · n is a integer. Then we fix the Fourier transform [n] to be the p-biased Fourier transform in this section, whose basis is {φS|S ∈ ≤d }. Hence

we use φ1, ··· , φn instead of x1, ··· , xn and say that a function only depends on a subset of

characters {φi|i ∈ S} if this function only takes input from variables {xi|i ∈ S}. For a degree P ˆ 2 1/2 P ˆ 2 1/2 d multilinear polynomial f = [n] f(S)φS, we use kfk2 = EUp [f ] = ( S f(S) ) in S∈(≤d) this section.

158 P We rewrite the global cardinality constraint as i φi = 0. For convenience, we use q p n+ = (1 − p)n to denote the number of 1−p ’s in the global cardinality constraint of φi and q 1−p n+ 1−p p1 n− = pn to denote the number of − ’s. If = = for some integers p and p , p n− p p2 1 2 we could follow the approach in Section 9.3.2 that first partition [n+ + n−] into tuples of size p1 + p2 then consider the production distribution over tuples. However, this approach will introduce a dependence on p1 + p2 to the bound, which may be superconstant. Instead of partitioning, we use induction on the number of characters and degree to prove the 2 → 4 hypercontractivity of low-degree multilinear polynomials in Dp.

Theorem 9.4.1. For any degree-at-most d multilinear polynomial f on φ1, ··· , φn,  d 4 3/2 1 − p 2 p 22 4 E [f(φ1, . . . , φn) ] ≤ 3 · d · 256 · ( ) + ( ) · kfk2. Dp p 1 − p

P [n]  Recall that hf is the projection of f onto the null space span{( i φi)φS|S ∈ ≤d−1 }. P k P k We know f −( i φi)hf ≡ f in supp(Dp), which indicates EDp [f ] = EDp [(f −( i φi)hf ) ] for P any integer k. Without loss of generality, we assume f is orthogonal to span{( i φi)φS|S ∈ [n]  2 2 ≤d−1 }. From the lower bound of eigenvalues in EDp [f ] by Corollary 9.2.4, 0.5kfk2 ≤ 2 EDp [f ]. We have a direct corollary as follows.

Corollary 9.4.2. For any degree-at-most d multilinear polynomial f on φ1, ··· , φn,  d 4 3/2 1 − p 2 p 22 2 2 E [f(φ1, . . . , φn) ] ≤ 12 · d · 256 · ( ) + ( ) E [f(φ1, . . . , φn) ] . Dp p 1 − p Dp

Note that since xi can be written as a linear function of φi and linear transformation does not change the degree of the multilinear polynomial, we also have for any degree-at-most d

multilinear polynomial g on x1, ··· , xn,  d 4 3/2 1 − p 2 p 22 2 2 E [g(x1, . . . , xn) ] ≤ 12 · d · 256 · ( ) + ( ) E [g(x1, . . . , xn) ] . Dp p 1 − p Dp

159 Proof of Theorem 9.4.1. We assume the inequality holds for any degree < d polynomials and use induction on the number of characters in a degree d multilinear polynomial f to

prove that if the multilinear polynomial f of φ1, ··· , φn depends on at most k characters 2 4 3/2 d k 4  1−p 2 p 2 of φ1, ··· , φn, then EDp [f ] ≤ d · C · β · kfk2 for C = 256 · ( p ) + ( 1−p ) and β = 1 + 1/n.

4 ˆ 4 Base case. f is a constant function that is independent from φ1, ··· , φn, EDp [f ] = f(∅) = 4 kfk2.

Induction step. Suppose there are k ≥ 1 characters of φ1, ··· , φn in f. Without loss of

generality, we assume φ1 is one of the characters in f and rewrite f = φ1h0 + h1 for a degree

d − 1 polynomial h0 with at most k − 1 characters and a degree d polynomial h1 with at 2 2 2 most k − 1 characters. Because f is a multilinear polynomial, kfk2 = kh0k2 + kh1k2. We 4 4 expand EDp [f ] = EDp [(φ1h0 + h1) ] to

4 4 3 3 2 2 2 3 4 E [φ1 · h0] + 4 E [φ1 · h0 · h1] + 6 E [φ1 · h0 · h1] + 4 E [φ1 · h0 · h1] + E [h1]. Dp Dp Dp Dp Dp

4 d k−1 4 From the induction hypothesis, EDp [h1] ≤ C β kh1k2 and

4 4 1 − p 2 p 2 4 3/2 d−0.5 k−1 4 E [φ1 · h0] ≤ max{( ) , ( ) } E [h0] ≤ d · C β kh0k2. (9.8) Dp p 1 − p Dp

2 2 2 4 4 1/2 4 1/2 Hence EDp [φ1 ·h0 ·h1] ≤ (EDp [φ1h0]) (EDp [h1]) from the Cauchy-Schwarz inequal- ity. From the above discussion, this is at most

3/4 d/2 (k−1)/2 2 3/4 (d−0.5)/2 (k−1)/2 2 3/2 d−1/4 k−1 2 2 d ·C ·β ·kh1k2 ·d ·C β kh0k2 ≤ d ·C ·β ·kh0k2kh1k2. (9.9)

3 3 Applying the inequality of arithmetic and geometric means on EDp [φ1 · h0 · h1], we know it is at most

4 4 2 2 2 3/2 d−0.5 k−1 4 d−1/4 k−1 2 2 ( E [φ1h0] + E [φ1h0h1])/2 ≤ d C β kh0k2 + C · β · kh0k2kh1k2 /2. (9.10) Dp Dp

160 3 Finally, we bound EDp [φ1 · h0 · h1]. However, we cannot apply the Cauchy-Schwarz inequality or the inequality of arithmetic and geometric means, because we cannot afford a 3/2 d k−1 4 term like d ·C β kh1k2 any more. We use Dφ1>0 (Dφ1<0 resp.) to denote the conditional q p q 1−p distribution of Dp on fixing φ1 = 1−p (− p resp.) and rewrite

3 p 3 p 3 E [φ1 · h0 · h1] = p(1 − p) E [h0 · h1] − p(1 − p) E [h0 · h1]. (9.11) Dp Dφ1>0 Dφi<0

Let L be the matrix corresponding to the quadratic form [fg] − [fg] for low- EDφ1>0 EDφ1<0 degree multilinear polynomials f and g (i.e. let L be a matrix such that f T Lg = [fg]− EDφ1>0 [fg]). The main technical lemma of this section is a upper bound on the spectral norm EDφ1<0 of L.

Lemma 9.4.3. Let g be a degree d(d ≥ 1) multilinear polynomial on characters φ2, ··· , φn, we have 3/2 2 3d kgk2 | E [g2] − E [g2]| ≤ · √ . Dφ1>0 Dφ1<0 p(1 − p) n

3/2 3d √1 Therefore, the spectral norm of L is upper bounded by p(1−p) · n .

From the above lemma, we rewrite the equation (9.11) from the upper bound of its eigenvalues:   3 p 3 3 E [φ1 · h0 · h1] = p(1 − p) E [h0 · h1] − E [h0 · h1] Dp Dφ1>0 Dφ1<0 3(2d)3/2 1 = pp(1 − p)(h h )T L · (h2) ≤ pp(1 − p) · · √ · kh h k · kh2k . 0 1 1 p(1 − p) n 0 1 2 1 2

Then we use the inequality of arithmetic and geometric means on it:

3/2 2 2 2 3 p 10d kh0h1k2 + kh1k2/n E [φ1 · h0 · h1] ≤ p(1 − p) · · . Dp p(1 − p) 2

161 2 d 2 2 4 d  p2 (1−p)  4 Next, we use the 2 → 4 hypercontractivity kh k2 = EUp [h ] ≤ 9 · 1−p + p khk2 in Up and the Cauchy-Schwarz inequality to further simplify it to:

d d  p2 (1−p)2  3/2 9 · + 3 5d 2 2 1−p p 4 E [φ1 · h0 · h1] ≤ p · (kh0k2 · kh1k2 + kh1k2) Dp p(1 − p) n 3/2  2 2 d 5d d p (1 − p) 2 2 1 4 ≤ · 9 · + (kh0k · kh1k + kh1k ). (9.12) pp(1 − p) 1 − p p 2 2 n 2

From all discussion, we bound E[f 4] by the upper bound of each inequalities in (9.8),(9.10),(9.9),(9.12):

4 EDp [(φ1h0 + h1) ]

4 4 3 3 2 2 2 3 4 =EDp [φ1 · h0] + 4EDp [φ1 · h0 · h1] + 6EDp [φ1 · h0 · h1] + 4EDp [φ1 · h0 · h1] + EDp [h1] 3 1 3  1 1  3 1 2 d− 2 k−1 4 2 d− 2 k−1 4 d− 4 k−1 2 2 2 d− 4 k−1 2 2 ≤d C β kh0k2 + 2d C β kh0k2 + C β kh0k2kh1k2 + 6d C β kh0k2kh1k2 3 d 2  2 2    5d d p (1 − p) 2 2 1 4 3 d k−1 4 + 4 · 9 · + kh0k kh1k + kh1k + d 2 · C β kh1k pp(1 − p) 1 − p p 2 2 n 2 2 3 d! 2  2 2  3 d− 1 k−1 4 3 d− 1 k−1 20d d p (1 − p) 2 2 ≤3d 2 · C 2 β kh0k + 8d 2 · C 4 · β + · 9 · + kh0k kh1k 2 pp(1 − p) 1 − p p 2 2 3 d ! 2  2 2  20d d p (1 − p) 1 3 d k−1 4 + · 9 · + · + d 2 · C β kh1k pp(1 − p) 1 − p p n 2 3 3 3 3 2 d k 4 2 d k 2 2 2 d d k−1 4 2 d k 4 ≤d · C β kh0k2 + d · C β · 2kh0k2kh1k2 + d · (C /n + C β )kh1k2 ≤ d · C β kfk2.

We prove Lemma 9.4.3 to finish the proof. Intuitively, both [g2] and [g2] EDφ1>0 EDφ1<0 2 are close to EDp [g ] (we add the dummy character φ1 back in Dp) for a low-degree multilinear 2 2 polynomial g; therefore their gap should be small compared to EDp [g ] = O(kgk2). Recall P that Dp is a uniform distribution on the constraint i φi = 0, i.e., there are always n+

162 q p q 1−p characters of φi with 1−p and n− characters with − p . For convenience, we abuse the notation φ to denote a vector of characters (φ1, ··· , φn).

Proof of Lemma 9.4.3. Let F be the p-biased distribution on n − 2 characters with the Pn−1 q p q 1−p global cardinality constraint i=2 φi = − 1−p + p = −q, i.e., n+ − 1 of the characters q p q 1−p ~ are always 1−p and n− − 1 of the characters are always − p . Let φ−i denote the ~ vector (φ2, ··· , φi−1, φi+1, ··· , φn) of n − 2 characters such that we could sample φ−i from

F . Hence φ ∼ Dφ1<0 is equivalent to the distribution that first samples i from 2, ··· , n then q p ~ fixes φi = 1−p and samples φ−i ∼ F . Similarly, φ ∼ Dφ1>0 is equivalent to the distribution q 1−p ~ that first samples i from 2, ··· , n then fixes φi = − p and samples φ−i ∼ F .

For a multilinear polynomial g depending on characters φ2, ··· , φn, we rewrite

E [g2] − E [g2] Dφ1>0 Dφ1<0  r r  p ~ 2 1 − p ~ 2 = E E g(φi = , φ−i = φ) − g(φi = − , φ−i = φ) . (9.13) i φ∼F 1 − p p

3d3/2 p We will show its eigenvalue is upper bounded by p(1−p) · 1/n. We first use the Cauchy- Schwarz inequality:   r r  p ~ 1 − p ~ E g(φi = , φ−i = φ) − g(φi = − , φ−i = φ) φ∼F 1 − p p  r p r1 − p   · g(φ = , φ~ = φ) + g(φ = − , φ~ = φ) i 1 − p −i i p −i " r r 2#1/2 p ~ 1 − p ~ ≤ E g(φi = , φ−i = φ) − g(φi = − , φ−i = φ) φ∼F 1 − p p " r r 2#1/2 p ~ 1 − p ~ · E g(φi = , φ−i = φ) + g(φi = − , φ−i = φ) . φ∼F 1 − p p

163 2 2 From the inequality of arithmetic and geometric means and the fact that EDp [g ] ≤ dkgk2, observe that the second term is bounded by

" r r 2# p ~ 1 − p ~ E g(φi = , φ−i = φ) + g(φi = − , φ−i = φ) φ∼F 1 − p p  r   r  p ~ 2 1 − p ~ 2 ≤2 E g(φi = , φ−i = φ) + 2 E g(φi = − , φ−i = φ) Dφ1>0 1 − p Dφ1<0 p

3 2 3d 2 ≤ E [g ] ≤ · kgk2. p(1 − p) Dp p(1 − p)

  2 1/2 q p ~ q 1−p ~ Then we turn to Eφ∼F g(φi = 1−p , φ−i = φ) − g(φi = − p , φ−i = φ) . We use P ˆ q p ~ q 1−p ~ gi = S:i∈S f(S)φS\i to replace g(φi = 1−p , φ−i = φ) − g(φi = − p , φ−i = φ):

 1/2 r r  !2 r r  X p 1 − p p 1 − p  21/2 E  gˆ(S) + φS\i  = + E gi(φ) . φ∼F 1 − p p 1 − p p φ∼F S:i∈S

2 Eventually we bound Eφ∼F [gi(φ) ] by its eigenvalue. Observe that F is the distri- q p q 1−p bution on n − 2 characters with (n+ − 1) 1−p ’s and (n− − 1) − p ’s, which indicates P P P j φj + q = 0 in F . However, the small difference between j φj + q = 0 and j φj = 0

will not change the major term in the eigenvalues of Dp. From the same analysis, the largest 2 eigenvalue of EF [gi ] is at most d. For completeness, we provide a calculation in Section 9.4.1.

2 2 Claim 9.4.4. For any degree-at-most d multilinear polynomial gi, Eφ∼F [gi(φ) ] ≤ dkgik2. √ 2 1/2 Therefore we have Ei[Eφ∼F [gi(φ) ] ] ≤ d · Ei[kgi(φ)k2] and simplify the right hand

164 side of inequality (9.13) further:

 " r r 2#1/2 p ~ 1 − p ~ E E g(φi = , φ−i = φ) − g(φi = − , φ−i = φ) i φ∼F 1 − p p " r r 2#1/2  p ~ 1 − p ~ · E g(φi = , φ−i = φ) + g(φi = − , φ−i = φ) φ∼F 1 − p p " s # r p r1 − p √ 3d ≤ E + · d · kgik2 · · kgk2 i 1 − p p p(1 − p) 3d ≤ · kgk2 · E[kgik2] p(1 − p) i

P 2 2 Using the fact that i kgik2 ≤ dkgk2 and Cauchy-Schwartz again, we further simplify the expression above to obtain the desired upper bound on the absolute value of eigenvalues of [g2] − [g2]: EDφ1=1 EDφ1=−1

P 2 1/2 3d 3d ( i kgik2) · kgk2 · E[kgik2] ≤ · kgk2 · √ p(1 − p) i p(1 − p) n r 3d d 3d3/2 kgk2 ≤ · kgk · kgk ≤ · √ 2 . p(1 − p) 2 n 2 p(1 − p) n

9.4.1 Proof of Claim 9.4.4

2 We follow the approach in Section 9.2 to determine the eigenvalues of EF [f ] for a P ˆ P polynomial f = [n−2] f(S)φS. Recall that i φi + q = 0 over all support of F for S∈( ≤d ) q = √2p−1 as defined before. p(1−p)

We abuse δk = EF [φS] for a subset S with size k. We start with δ0 = 1 and δ1 = P [n] −q/(n − 2). From ( i φi + q)φS ≡ 0 for k ≤ d − 1 and any S ∈ k , we have X X E[ φjφS + qχS + φS∪j] = 0 ⇒ k · δk−1 + (k + 1)q · δk + (n − 2 − k) · δk+1 = 0 j∈S j∈ /S

165 2 P Now we determine the eigenspaces of EF [f ]. The eigenspace of an eigenvalue 0 is span{( i φ+ [n−2] q)φS|S ∈ ≤d−1 }. There are d + 1 non-zero eigenspaces V0,V1, ··· ,Vd. The eigenspace Vi 2 ˆ [n−2] of EF [f ] is spanned by {f(S)φS|S ∈ i }. For each f ∈ Vi, f satisfies the following properties:

0 [n−2] P ˆ 1. ∀T ∈ i−1 , j∈ /T 0 f(S ∪ j) = 0 (neglect this contraint for V0).

[n] ˆ 2. ∀T ∈

[n−2] ˆ P ˆ 3. ∀T ∈ >i , f(T ) = αk,|T | S∈T f(S) where αk,k = 1 and αk,k+i satisfies

i · αk,k+i−1 + (k + i + 1) · q · αk,k+i + (n − 2 − 2k − i)αk,k+i+1 = 0.

We show the calculation of αk,k+i as follows: fix a subset T of size k + i and consider the P orthogonality between ( i φi + q)]φT and f ∈ Vk:

X X ˆ X ˆ X X ˆ αk,k+i−1 f(S) + (k + i + 1) · q · αk,k+i f(S) + αk,k+i+1 f(S) = 0 j∈T T \j T j∈ /T T ∪j S∈( k ) S∈(k) S∈( k ) X   ˆ ⇒ (k + i + 1) · q · αk,k+i + (n − 2 − k − i)αk,k+i+1 + i · αk,k+i−1 f(S) T S∈(k) X X ˆ 0 + αk,k+i+1 f(T ∪ j) = 0. 0 T j∈ /T T ∈(k−1)

0 [n−2] P ˆ 0 Using the first property ∀T ∈ i−1 , j∈ /T 0 f(S ∪ j) = 0 to remove all S ∈/ T , we have

X  ˆ i · αk,k+i−1 + (k + i + 1) · q · αk,k+i + (n − 2 − 2k − i)αk,k+i+1 f(S) = 0 T S∈(k)

0 We calculate the eigenvalues of Vk following the approach in Section 9.2. Fix S and S 0 ˆ 0 with i = |S M S |, we still use τi to denote the coefficients of f(S ) in the expansion of

166 P ˆ T (δSMT − δSδT )f(T ). Observe that τi is as same as the definition in Section 9.2 in terms of δ and αk :

d−k d−k i X n − 2 − k X X ln − 2 − k − 2l τ = · α · δ , τ = α δ . 0 i k,k+i i 2l k,k+i t i − t 2l+i−2t i=0 i=0 t=0 P P Observe that the small difference between ( i φi +q) ≡ 0 and i φi ≡ 0 only changes a little in the recurrence formulas of δ and α. For δ2i and αk,k+2i of an integer i, the major term is still determined by δ2i−2 and αk,k+2i−2. For δ2i+1 and αk,k+2i+1, they are still in the same order (the constant before n−i−1 will not change the order). Using the same induction on δ and α, we have

i (2i−1)!! −i−1 1. δ2i = (−1) ni + O(n );

−i−1 2. δ2i+1 = O(n );

i (2i−1)!! −i−1 3. αk,k+2i = (−1) ni + O(n );

−i−1 4. αk,k+2i+1 = O(n ).

Pd−k (i−1)!!(i−1)!! −1 −l Hence τ0 = even i=0 i! + O(n ) and τ2l = O(n ). Follow the same analysis in Pd−k (i−1)!!(i−1)!! −1 Section 9.2, we know the eigenvalue of Vk is τ0 ± O(τ2) = even i=0 i! + O(n ).

2 d From all discussion above, the eigenvalues of EF [f ] is at most [ 2 ] + 1 ≤ d.

9.5 Parameterized algorithm for CSPs above average with global cardinality constraints P We show that CSPs above average with the global cardinality constraint i xi =

(1 − 2p)n are fixed-parameter tractable for any p ∈ [p0, 1 − p0] with an integer pn. We n still use Dp to denote the uniform distribution on all assignments in {±1} complying with P i xi = (1 − 2p)n.

167 Without loss of generality, we assume p < 1/2 and (1 − 2p)n is an integer. We choose the standard basis {χS} in this section instead of {φS}, because the Fourier coefficients in

{φS} can be arbitrary small for some p ∈ (0, 1).

Theorem 9.5.1. For any constant p0 ∈ (0, 1) and d, given an instance of d-ary CSP, a parameter t, and a parameter p ∈ [p0, 1 − p0], there exists an algorithm with running time nO(d) that either finds a kernel on at most C · t2 variables or certifies that OPT ≥ P AV G + t under the global cardinality constraint i xi = (1 − 2p)n for a constant C = d   2 160d2 · 30d ( 1−p0 )2 + ( p0 )2 · (d!)3d · (1/2p )4d. p0 1−p0 0

9.5.1 Rounding

Let f be a degree d polynomial whose coefficients are multiples of γ in the standard [n] basis {χS|S ∈ ≤d }. We show how to find an integral-coefficients polynomial h such that P  f − i xi − (1 − 2p)n h only depends on O(VarDp (f)) variables. We use the rounding algorithm in Section 9.3.1 as a black box, which provides a polynomial h such that f − P  i xi h only depends on O(VarD(f)) variables (where D is the distribution conditioned on ˆ the bisection constraint). Without loss of generality, we assume f(∅) = 0 because VarDp (f) is independent with fˆ(∅).

Before proving that f depends on at most O(VarDp (f)) variables, we first define the inactivity of a variable xi in f. P ˆ ˆ Definition 9.5.2. A variable xi for i ∈ [n] is inactive in f = S f(S)χS if f(S) = 0 for all S containing xi. P A variable xi is inactive in f under the global cardinality constraint i xi = (1 − 2p)n if P  there exists a polynomial h such that xi is inactive in f − i xi − (1 − 2p)n h.

In general, there are multiple ways to choose h to turn a variable into inactive. However, if we know a subset S of d variables and the existence of some h to turn S

168 P  into inactive in f − i xi − (1 − 2p)n h, we show that h is uniquely determined by S. S1∪S 2d−1 Intuitively, for any subset S1 with d − 1 variables, there are d = d ways to choose

a subset of size d. Any d-subset T in S1 ∪ S contains at least one inactive variable such that ˆ P ˆ 2d−1 f(T ) − j∈T h(T \ j) = 0 from the assumption. At the same time, there are at most d−1 ˆ ˆ coefficients of h in S1 ∪ S. So there is only one solution of coefficients in h to satisfy these 2d−1 d equations. P ˆ Claim 9.5.3. Given f = [n] f(T )χT and p ∈ [p0, 1 − p0], let S be a subset with at least T ∈(≤d) d variables such that there exists a degree ≤ d − 1 multilinear polynomial h turning S into P  inactive in f − i xi − (1 − 2p)n h. Then h is uniquely determined by any d elements in S and f.

ˆ Proof. Without lose of generality, we assume S = {1, ··· , d} and determine h(S1) for S1 =

{i1, ··· , id−1}.

For simplicity, we first consider the case S ∩ S1 = ∅. From the definition, we know S∪S1 ˆ that for any T ∈ d , T contains at least one inactive variable, which indicates f(T ) − P ˆ ˆ j∈T h(T \ j) = 0. Hence we can repeat the argument in Lemma 9.3.3 to determine h(S1) ˆ S∪S1 from f(T ) over all T ∈ d .

−i Let βd−1,1 = (d − 2)! and βd−i−1,i+1 = d−i−1 βd−i,i for any i ∈ {1, ··· , d − 2} be the S  parameters define in Lemma 9.3.3. For any S2 ∈ d−1 , by the same calculation,

d−1 ! X X ˆ X ˆ βd−i,i f(T1 ∪ T2) − h(T1 ∪ T2 \ j) = 0

i=1 S1 S2 j∈T1∪T2 T1∈(d−i),T2∈( i )

ˆ indicates that (all h(S) not S1 or S2 cancel with each other)

d−1 ˆ d ˆ X X ˆ (d − 1)! · h(S1) + (−1) (d − 1)! · h(S2) = βd−i,i f(T1 ∪ T2).

i=1 S1 S2 T1∈(d−i),T2∈( i )

169 ˆ d−1ˆ d Pd−1 βd−i,i P ˆ  Hence h(S2) = (−1) h(S1) + (−1) i S1 S2 f(T1 ∪ T2) for any S2 ∈ =1 (d−1)! T1∈(d−i),T2∈( i ) S  ˆ ˆ P ˆ ˆ . Replacing all h(S2) in the equation f(S) − S h(S2) = 0, we obtain h(S1) in d−1 S2∈(d−1) terms of f and S.

0 If S ∩ S1 6= ∅, then we set S = S \ S1. Next we add arbitrary |S ∩ S1| more variables 0 0 0 0 into S such that |S | = d and S ∩ S1 = ∅. Observe that S ∪ S1 contains at least d inactive ˆ variables. Repeat the above argument, we could determine h(S1). ˆ [n]  [n]  After determining h(S1) for all S1 ∈ d−1 , we repeat this argument for S1 ∈ d−2 ˆ [n]  and so on. Therefore we could determine h(S1) for all S1 ∈ ≤d−1 from S and the coefficients in f.

Remark 9.5.4. The coefficients of h are multiples of γ/d! if the coefficients of f are multiples of γ.

Let h1 and h2 be two polynomials such that at least d variables are inactive in both P  P  f − i xi−(1−2p)n h1 and f − i xi−(1−2p)n h2. We know that h1 = h2 from the above P  claim. Furthermore, it implies that any variable that is inactive in f − i xi −(1−2p)n h1 P  is inactive in f − i xi − (1 − 2p)n h2 from the definition, and vice versa. Based on this observation, we show how to find a degree d − 1 function h such that P  there are fews active variables left in f − i xi − (1 − 2p)n h. The high level is to random sample a subset Q of (1 − 2p)n variables and restrict all variables in Q to 1. Thus the rest variables constitutes the bisection constraint on 2pn variables such that we could use

the rounding process in Section 9.3.1. Let k be a large number, Q1, ··· ,Qk be k random

subsets and h1, ··· , hk be the k functions after rounding in Section 9.3.1. Intuitively, the number of active variables in f − (P x )h , ··· , f − (P x )h are small with high i/∈Q1 i 1 i/∈Qi k k

probability such that h1, ··· , hk share at least d inactive variables. We can use one function

h to represent h1, ··· , hk from the above claim such that the union of inactive variables in

170 f − (P x )h over all j ∈ [k] are inactive in f − (P x )h from the definition. Therefore i/∈Qj i j i i P there are a few active variables in f − ( i xi)h. P  Let us move to f − i xi −(1−2p)n h. Because h is a degree-at-most d−1 function, (1 − 2p)n · h is a degree ≤ d − 1 function. Thus we know that the number of active variables P  among degree d terms in f − i xi −(1−2p)n h is upper bounded by the number of active P P  variables in f − ( i xi)h. For the degree < d terms left in f − i xi − (1 − 2p)n h, we repeat the above process again.

P Theorem 9.5.5. Given a global cardinality constraint i xi = (1 − 2p)n and a degree d P ˆ 0.5 function f = [n] f(S) with VarDp (f) < n and coefficients of multiples of γ , there is S∈(≤d) an efficient algorithm running in time O(dn2d) to find a polynomial h such that there are at C0 ·Var f 2 d 2d2 p,d Dp ( ) P  0 20d 7 ·(d!) most γ2 active variables in f − i xi − (1 − 2p)n h for Cp,d = (2p)4d .

[n]  ~ Proof. For any subset Q ∈ (1−2p)n , we consider the assignments conditioned on xQ = 1

and use fQ to denote the restricted function f on xQ = ~1. Conditioned on xQ = ~1, the P global cardinality constraint on the rest variables is i/∈Q xi = 0. We use DQ denote the P distribution on assignments of {xi|i∈ / Q} satisfying i/∈Q xi = 0, i.e., the distribution of ¯ {xi|i∈ / Q} under the bisection constraint.

Let XQ(i) ∈ {0, 1} denote whether xi is active in fQ under the bisection constraint of Q¯ or not after the bisection rounding in Theorem 9.3.2. From Theorem 9.3.2, we get an

upper bound on the number of active variables in fQ, i.e.,

X 0 XQ(i) ≤ 2Cd · VarDQ (fQ) i

d 2 0 7 (d!(d−1)!···2!) 0.6 for Cd = γ2 and any Q with VarDQ (fQ) = O(n ). We claim that

[VarD (fQ)] ≤ VarDp (f). EQ Q

171 2  2 From the definition, EQ[VarDQ (fQ)] = EQ Ey∼DQ [fQ(y) ] − EQ Ey∼DQ [fQ(y)] . At the same 2 2 2 2 time, we observe that EQ Ey∼DQ [fQ(y) ] = ED[f ] and EQ[Ey∼DQ fQ(y)] ≥ EDp [f] . There- 0.6 −0.1 fore EQ[VarDQ (fQ)] ≤ VarDp [f]. One observation is that PrQ[VarDQ ≥ n ] < n from 0.5 the assumption VarDp (f) < n , which is very small such that we can neglect it in the rest P 0 of proof. From the discussion above, we have EQ[ i XQ(i)] ≤ 2Cd · VarDp (f) with high probability.

(2p)2d Now we consider the number of i’s with EQ[XQ(i)] ≤ 5d . Without loss of generality, (2p)2d we use m to denote the number of i’s with EQ[XQ(i)] ≤ 5d and further assume these (2p)2d variables are {1, 2, ··· , m} for convenience. Hence for any i > m, EQ[XQ(i)] > 5d . We 2d 2VarDp (f) (2p) know the probability VarDQ (f) ≤ (2p)2d is at least 1 − 2 , which implies 2C0 · Var (f) 20d · C0 Var n − m ≤ d DQ ≤ d Dp(f) . (2p)2d (2p)4d 5d (2p)2d We are going to show that EQ[XQ(i)] is either 0 or at least 5d , which means that only P xm+1, ··· , xn are active in f under i xi = 0. Then we discuss how to find out a polynomial P  hd such that x1, ··· , xm are inactive in the degree d terms of f − i xi − (1 − 2p)n hd.

We fix d variables x1, ··· , xd and pick d arbitrary variables xj1 , ··· , xjd from {d + 2d 1, ··· , n}. We focus on {x1, x2, ··· , xd, xj1 , ··· , xjd } now. With probability at least (2p) − o(1) ≥ 0.99(2p)2d over random sampling Q, none of these 2d variables is in Q. At the same (2p)2d time, with probability at least 1 − 2d · 5d , all variables in the intersction {x1, . . . , xm} ∩ ¯ {x1, x2, ··· , xd, xj1 , ··· , xjd } are inactive in fQ under the bisection constraint on Q (2d is 2d 2d (2p) for xj1 , ··· , xjd if necessary). Therefore, with probability at least 0.99(2p) − 2d · 5d − 2d (2p) 2d P 2 ≥ 0.09(2p) , x1, x2, ··· , xd are inactive in fQ under i/∈Q xi = 0 and n − m is small. Namely there exists a polynomial h such that the variables in {x , . . . , x } ∩ xj1 ,··· ,xjd 1 m {x , x , ··· , x , x , ··· , x } are inactive in f − (P x )h . 1 2 d j1 jd Q i/∈Q i xj1 ,··· ,xjd

Now we apply Claim 9.5.3 on S = {1, ··· , d} in f to obtain the unique polynomial hd, which is the combination of h over all choices of j , ··· , j , and consider f −(P x )h . xj1 ,··· ,xjd 1 d i i d

172 Because of the arbitrary choices of xj1 , ··· , xjd , it implies that x1, ··· , xd are inactive in P f − ( i xi)hd. For example, we fix any j1, ··· , jd and T = {1, j1, ··· , jd−1}. we know fˆ(T ) − P hˆ (T \ j) = 0 from the definition of h . Because h agrees with j∈T xj1 ,··· ,xjd xj1 ,··· ,xjd d h on the Fourier coefficients from Claim 9.5.3, we have fˆ(T ) − P hˆ (T \ j) = 0. xj1 ,··· ,xjd j∈T d P Furthermore, it implies that xd+1, ··· , xm are also inactive in f − ( i xi)hd. For example, we fix j1 ∈ {d + 1, ··· , m} and choose j2, ··· , jd arbitrarily. Then x1, ··· , xd, and x are inactive in f − (P x )h for some Q from the discussion above, which j1 Q i/∈Q i xj1 ,··· ,xjd P indicates that xj1 are inactive in f − ( i xi)hd by Claim 9.5.3.

2d To find hd in time O(n ), we enumerate all possible choices of d variables in [n] as

S. Then we apply Claim 9.5.3 to find the polynomial hS corresponding to S and check P P f − ( i xi)hS. If there are more than m inactive variables in f − ( i xi)hS, then we set n d 2d hd = hS. Therefore the running time of this process is d · O(n ) = O(n ).

Hence, we can find a polynomial hd efficiently such that at least m variables are P P inactive in f − ( i xi)hd. Let us return to the original global cardinality constraint i xi = (1 − 2p)n. Let X  fd = f − xi − (1 − 2p)n hd. i

x1, ··· , xm are no longer inactive in fd because of the extra term (1 − 2p)n · h. However,

x1, ··· , xm are at least independent with the degree d terms in fd. Let Ad denote the set for 0 20d·CdVarDp(f) active variables in the degree d terms of fd, which is less than (2p)4d from the upper bound of n − m.

For fd, observe that VarDp (fd) = VarDp (f) and all coefficients of fd are multiples of

γd−1 = γ/d! from Claim 9.5.3. For fd, we neglect its degree d terms in Ad and treat it as a degree d − 1 function from now on. Then we could repeat the above process again for the degree d − 1 terms in fd to obtain a degree d − 2 polynomial hd−1 such that the active set P  Ad−1 in the degree d − 1 terms of fd−1 = fd − i xi − (1 − 2p)n hd−1 contains at most

173 0 2 2 20d·Cd−1VarDp(f) 0 ((d−1) ···2!) 4d variables for Cd−1 = 2 . At the same time, observe that the degree of (2p) γd−1 P  i xi − (1 − 2p)n hd−1 is at most d − 1 such that it will not introduce degree d terms to

fd−1. Then we repeat it again for terms of degree d − 2, d − 3, and so on.

To summarize, we can find a polynomial h such that Ad ∪ Ad−1 · · · ∪ A1 is the active P  P set in f − i xi − (1 − 2p)n h. At the same time, |Ad ∪ Ad−1 · · · ∪ A1| ≤ i |Ai| ≤ 2 d 2d2 20d 7 ·VarDp (f)·(d!) γ2·(2p)4d .

9.5.2 Proof of Theorem 9.5.1

In this section, we prove Theorem 9.5.1. Let f = fI be the degree d function associated

with the instance I and g = f − EDp [f] for convenience. We discuss VarDp [f] in two cases. d 2  1−p 2 p 2 3 2 If VarDp [f] = EDp [g ] ≥ 8 16 · ( p ) + ( 1−p ) · d · t , we have

 d 4 1 − p 2 p 22 6 2 2 E [g ] ≤ 12 · 256 · ( ) + ( ) · d E [g ] Dp p 1 − p Dp

from the 2 → 4 hypercontractivity in Theorem 9.4.2. By Lemma 9.1.1, we know   p 2  EDp [g ]  Pr g ≥ r  > 0. Dp   d  1−p 2 p 22 6 2 12 · 256 · ( p ) + ( 1−p ) · d

Thus PrDp [g ≥ t] > 0, which demonstrates that PrDp [f ≥ EDp [g] + t] > 0. d  1−p 2 p 2 3 2 −d Otherwise we know VarDp [f] ≤ 8 16 · ( p ) + ( 1−p ) · d · t . We set γ = 2 . From Theorem 9.5.5, we could find a degree d − 1 function h in time O(n2d) such that C0 ·Var f P  p,d Dp ( ) f − i xi − (1 − 2p)n h contains at most γ2 variables. We further observe that P  f(α) = f(α) − i αi − (1 − 2p)n h(α) for any α in the support of Dp. Then we know the

174 kernel of f and I is at most

 1 − p p d C0 8 16 · ( )2 + ( )2 · d3 · t2 · p,d p 1 − p γ2  1 − p p d 20d27d · Var (f) · (d!)2d2 · 22d < 8 16 · ( )2 + ( )2 · d3 · t2 · Dp < C · t2. p 1 − p γ2 · (2p)4d

The running time of this algorithm is O(dn2d).

175 Chapter 10

Integrality Gaps for Dispersers and Bipartite Expanders

In this chapter, we study the vertex expansion of bipartite graphs. For convenience, we always use G to denote a bipartite graph and [N]∪[M] to denote the vertex set of G. Let D and d denote the maximal degree of vertices in [N] and [M], respectively. For a subset S in [N] ∪ [M] of a bipartite graph G = ([N], [M],E), we use Γ(S) to denote its neighbor set {j|∃i ∈ S, (i, j) ∈ E}. We consider the following two useful concepts in bipartite graphs:

Definition 10.0.1. A bipartite graph G = ([N], [M],E) is a (k, s)-disperser if for any subset S ⊆ [N] of size k, the neighbor set Γ(S) contains at least s distinct vertices.

Definition 10.0.2. A bipartite graph G = ([N], [M],E) is a (k, a)-expander if for any subset S ⊆ [N] of size k, the neighbor set Γ(S) contains at least a·k distinct vertices. It is a (≤ K, a) expander if it is a (k, a)-expander for all k ≤ K.

Because dispersers focus on hitting most vertices in [M], and expanders emphasize that the expansion is in proportion of the degree D , it is often more convenient to use parameters ρ, δ, and  for k = ρN, s = (1 − δ)M, and a = (1 − )D for dispersers and expanders.

These two combinatorial objects have wide applications in computer science. Dis- persers are well known for obtaining non-trivial derandomization results, e.g., for deran- domization of inapproximability results for MAX Clique and other NP-Complete problems

176 [Zuc96a, TZ04, Zuc07], deterministic amplification [Sip88], and oblivious sampling [Zuc96b]. Dispersers are also closely related to other combinatorial constructions such as randomness extractors, and some constructions of dispersers follow the constructions of randomness ex- tractors directly [TZ04, BKS+10, Zuc07]. Explicit constructions achieving almost optimal degree have been designed by Ta-Shma [Ta-02] and Zuckerman [Zuc07], respectively, in dif- ferent important parameter regimes.

For bipartite expanders, it is well known that the probabilistic method provides very good expanders, and some applications depend on the existence of such bipartite expanders, e.g., proofs of lower bounds in different computation models [Gri01a, BOT02]. Expanders also constitute an important part in other pseudorandom constructions, such as expander codes [SS96] and randomness extractors [TZ04, CRVW02, GUV09b]. A beautiful appli- cation of bipartite expanders was given by Buhrman et.al. [BMRV00] in the static mem- bership problem (see [CRVW02] for more applications and the reference therein). Explicit constructions for expansion a = (1 − )D with almost-optimal parameters have been de- signed in [CRVW02] and [TZ04, GUV09b] for constant degree and super constant degree respectively.

We consider the natural problem of how to approximate the vertex expansion of ρN- subsets in a bipartite graph G on [N] ∪ [M] in terms of the degrees D, d, and the parameter ρ. More precisely, given a parameter ρ such that k = ρN, it is natural to ask what is the size of the smallest neighbor set over all ρN-subsets in [N]. To the best of our knowledge, this question has only been studied in the context of expander graphs when G is d-regular with M = N and D = d by bounding the second eigenvalue. In [Kah95], Kahale proved D that the second eigenvalue can be used to show the graph G is a (≤ ρN, 2 )-expander for a 1 ρ  poly( D ). Moreover, Kahale showed that some Ramanujan graphs do no expand by more than D/2 among small subsets, which indicates D/2 is the best parameter for expanders using the eigenvalue method. In another work [WZ99], Wigderson and Zuckerman pointed

177 out that the expander mixing lemma only helps us determine whether the bipartite graph 4 G is a (ρN, (1 − Dρ )N)-disperser or not, which is not helpful if Dρ ≤ 4. Even if Dρ = Ω(1), the expander mixing lemma is unsatisfactory because a random bipartite graph on [N]∪[M] with right degree d is an (N, (1 − O((1 − ρ)d))M)-disperser with high probability. Therefore Wigderson and Zuckerman provided an explicit construction for the case dρ = Ω(1) when d = N 1−δ+o(1) and ρ = N −(1−δ) for any δ ∈ (0, 1). However, there exist graphs such that the second eigenvalue is close to 1 but the graph has very good expansion property among small subsets [KV05, BGH+12]. Therefore the study of the eigenvalue is not enough to fully characterize the vertex expansion. On the other hand, it is well known that a random regular bipartite graph is a good disperser and a good expander simultaneously, it is therefore natural to ask how to certify a random bipartite graph is a good disperser or a good expander.

Our main results are strong integrality gaps and an approximation algorithm for the vertex expansion problem in bipartite graphs. We prove the integrality gaps in the Lasserre hierarchy, which is a strong algorithmic tool in approximation algorithm design such that most currently known semidefinite programming based algorithms can be derived by a constant number of levels in this hierarchy.

We first provide integrality gaps for dispersers in the Lasserre hierarchy. It is well known that a random bipartite graph on [N] ∪ [M] is an (N α, (1 − δ)M)-disperser with

very high probability when N is large enough and left degree D = Θα,δ(log N), and these dispersers have wide applications in theoretical computer science [Sha02, Zuc07]. We show an average-case complexity of the disperser problem that given a random bipartite graph, the Lasserre hierarchy cannot approximate the size of the subset in [N] (equivalently the min-entropy of the disperser) required to hit at least 0.01 fraction of vertices in [M] as its neighbors. The second result is an integrality gap for any constant ρ > 0 and random bipartite graphs with constant right degree d (the formal statements are in section 10.2.1).

Theorem 10.0.3. (Informal Statement) For any α ∈ (0, 1) and any δ ∈ (0, 1), the N Ω(1)-

178 level Lasserre hierarchy cannot distinguish whether, for a random bipartite graph G on [N] ∪ [M] with left degree D = O(log N):

1. G is an (N α, (1 − δ)M)-disperser,

2. G is not an (N 1−α, δM)-disperser.

Theorem 10.0.4. (Informal Statement) For any ρ > 0, there exist infinitely many d such that the Ω(N)-level Lasserre hierarchy cannot distinguish whether, for a random bipartite graph G on [N] ∪ [M] with right degree d:

1. G is a (ρN, (1 − (1 − ρ)d)M)-disperser,

1−ρ  2. G is not a ρN, (1 − C0 · ρd+1−ρ )M -disperser for a universal constant C0 > 0.1.

We also provide an approximation algorithm to find a subset of size exact ρN with a relatively small neighbor set when the graph is not a good disperser. For a balanced ρ 1−ρ constant ρ like ρ ∈ [1/3, 2/3], 1−ρ and ρ are just constants, and the approximation ratio of our algorithm is close to the integrality gap in Theorem 10.0.4 within an extra loss of log d.

Theorem 10.0.5. Given a bipartite graph ([N], [M]) that is not a (ρN, (1−∆)M)-disperser with right degree d, there exists a polynomial time algorithm that returns a ρN-subset in [N] { ρ 2, } min ( 1−ρ ) 1 d  with a neighbor set of size at most 1 − Ω( log d · d(1 − ρ) )∆ M.

For expanders, we will show that for any constant  > 0, there is another constant 0 <  such that the Lasserre hierarchy cannot distinguish the bipartite graph is a (ρN, (1 − 0)D) expander or not a (ρN, (1 − )D) expander for small ρ (the formal statement is in Section 10.2.2). To the best knowledge, this is the first hardness result for such an expansion property. For example, it indicates that the Lasserre hierarchy cannot distinguish between a (ρN, 0.6322D)-expander or not a (ρN, 0.499D)-expander.

179 0 e−2−(1−2) Theorem 10.0.6. For any  > 0 and  < 2 , there exist constants ρ and D such that the Ω(N)-level Lasserre hierarchy cannot distinguish whether, a bipartite graph G on [N] ∪ [M] with left degree D:

1. G is an (ρN, (1 − 0)D)-expander,

2. G is not an (ρN, (1 − )D)-expander.

10.1 Preliminaries

In this chapter, we always use Λ to denote a Constraint Satisfaction Problem and Φ to denote an instance of the CSP. A CSP Λ is specified by a width d, a finite field Fq for d a prime power q and a predicate C ⊂ Fq . An instance Φ of Λ consists of n variables and ~ m constraints such that every constraint j is in the form of xj,1 ··· xj,d ∈ C + bj for some ~ d bj ∈ Fq and d variables xj,1, ··· , xj,d. We provide a short description the semidefinite programming relaxations from the Lasserre hierarchy [Las02] (see [Rot13, Bar14] for a complete introduction of Lasserre hi- erarchy and sum of squares proofs). We will use f ∈ {0, 1}S for S ⊂ [N] to denote an

assignment on variables in {xi|i ∈ S}. Conversely, let fS denote the partial assignment on S for f ∈ {0, 1}n and S ⊂ [n]. For two assignments f ∈ {0, 1}S and g ∈ {0, 1}T we use f ◦ g to denote the assignment on S ∪ T when f and g agree on S ∩ T . For a matrix A, we will use

A(i,j) to describe the entry (i, j) of A and A  0 to denote that A is positive semidefinite.

Consider a {0, 1}-programming with an objective function Q and constraints P0, ··· ,Pm,

180 [n] d where Q, P0, ··· ,Pm are from ≤d × {0, 1} to R: X max Q(R, h)1xR=h [n] R R∈(≤d),h∈{0,1} X Subject to Pj(R, h)1xR=h ≥ 0 ∀j ∈ [m] [n] R R∈(≤d),h∈{0,1}

xi ∈ {0, 1} ∀i ∈ [n]

Let yS(f) denote the probability that the assignment on S is f in the pseudo-distribution. This {0, 1}-programming[Las02] in the t-level Lasserre hierarchy is: X max Q(R, h)yR(h)

[n] R R∈(≤d),h∈{0,1}  Subject to yS∪T (f ◦ g) [n] S [n] T  0, (10.1) (S⊂(≤t),f∈{0,1} ),(T ⊂(≤t),g∈{0,1} ) X  Pj(R, h)yS∪T ∪R(f ◦ g ◦ h) [n] S [n] T  0, ∀j ∈ [m]. (S∈(≤t),f∈{0,1} ),(T ∈(≤t),g∈{0,1} ) [n] R R∈(≤d),h∈{0,1} (10.2)

An important tool in the Lasserre hierarchy to prove that the matrices in (10.2) are positive semidefinite is introduced by Guruswami, Sinop and Zhou in [GSZ14], we restate it [n] S here and prove it for completeness. Let uS(f) for all S ∈ ≤t , f ∈ {0, 1} be the vectors to explain the matrix (10.1). P ~ Lemma 10.1.1. (Restatement of Theorem 2.2 in [GSZ14]) If R,h P (R, h)~uR(h) = 0, then the corresponding matrix in (10.2) is positive semidefinite.

Proof. From the definition of ~u, we have X X P (R, h)yS∪T ∪R(f ◦ g ◦ h) = h P (R, h)~uS∪R(f ◦ h), ~uT (g)i R,h R,h X = h~uS∪T (f ◦ g), P (R, h)~uR(h)i = 0. R,h

181 10.1.1 Pairwise Independent Subspaces

We introduce an extra property of pairwise independent subspaces for our construc- tion of integrality gaps of list Constraint Satisfaction Problems.

d Definition 10.1.2. Let C be a pairwise independent subspace of Fq and Q be a subset of Fq d |C∩Qd| with size k. We say that C stays in Q with probability p if Prx∼C [x ∈ Q ] = |C| ≥ p.

k/q In [BGGP12], Benjamini et.al. proved p ≤ (1−k/q)·d+k/q for infinitely many d when |Q| = k. They also provided a distribution that matches the upper bound with probability k/q (1−k/q)·d+k/q for every d with q|(d − 1). In this thesis, we need the property that C is a d subspace rather than an arbitrary distribution in Fq . We provide two constructions for the base cases k = 1 and k = q − 1.

Lemma 10.1.3. There exist infinitely many d such that there is a pairwise independent d 1/q subspace C ⊂ Fq that stays in a size 1 subset Q of Fq with probability (1−1/q)d+1/q .

Proof. Choose Q = {0} and C to be the dual code of Hamming codes over Fq with block ql−1 l 1 length d = q−1 and distance 3 for an integer l. Using |C| = q , the probability is |C| = 1/q (1−1/q)d+1/q . It is pairwise independent because the dual distance of C is 3.

Lemma 10.1.4. There exist infinitely many d such that there is a pairwise independent d (q−1)/q subspace C ⊂ Fq staying in a (q − 1)-subset Q of Fq with probability at least Ω( d/q+(q−1)/q ).

Proof. First, we provide a construction for d = q − 1 then generalize it to d = (q − 1)ql for any integer l. For d = q − 1, the generator matrix of the subspace is a (q − 1) × 2 matrix 2 ∗ where row i is (αi, αi ) for q − 1 distinct elements {α1, ··· , αq−1} = Fq . Because αi 6= αj

for any two different rows i and j, it is pairwise independent. Let Q = Fq \{1}. Using the inclusion-exclusion principle and the fact that a quadratic equation can have at most 2 roots

182 in Fq:

d ∗ Pr [x ∈ Q ] = Pr [∀β ∈ Fq , xβ 6= 1] x∼C x∼C X X = 1 − P r[xβ = 1] + P r[xβ1 = 1 ∧ xβ2 = 1] ∗ ∗ β∈Fq {β ,β }∈ Fq 1 2 ( 2 ) X − P r[xβ1 = 1 ∧ xβ2 = 1 ∧ xβ3 = 1] + ··· ∗ {β ,β ,β }∈ Fq 1 2 3 ( 3 ) q − 1 q−1 = 1 − + 2 − 0 + 0 q q2 q2 − q + 2 1 1 = ≥ − 2q2 2 2q

For any d = (q −1)ql, the generator matrix of the subspace is a d×(l+2) matrix where every 2 ∗ row is in the form (α, α , β1, ··· , βl) for all nonzero elements α ∈ Fq and β1 ∈ Fq, ··· , βl ∈ Fq. d 1 1 1 The pairwise independence comes from a similar analysis. Prx∼C [x ∈ Q ] ≥ ql ( 2 − 2q ) because 1 q−1 it is as same as d = q−1 when all coefficients before β1, ··· , βl are 0, which is ≥ 3 · d+q−1 .

Remark 10.1.5. The construction for d = q − 1 also provides a subspace that stays in Q d d (2) with probability 1 − q + q2 for any d < q − 1 by deleting unnecessary rows in the matrix.

10.2 Integrality Gaps

We first consider a natural {0, 1} programming to determine the vertex expansion of ρN-subsets in [N] given a bipartite graph G = ([N], [M],E):

M M X X min ∨i∈Γ(j)xi = min (1 − 1∀i∈Γ(j),xi=0) j=1 j=1 N X Subject to xi ≥ ρN i=1

xi ∈ {0, 1} for every i ∈ [N]

183 We relax it to a convex programming in the t-level Lasserre hierarchy.

M X ~ min (1 − yΓ(j)(0)) j=1  Subject to yS∪T (f ◦ g) [N] S [N] T  0 (10.3) ((S∈( ≤t ),f∈{0,1} ),(T ∈( ≤t ),g∈{0,1} )) N X  yS∪T ∪{i}(f ◦ g ◦ 1) − ρN · yS∪T (f ◦ g) [N] S [N] T  0 ((S∈( ≤t ),f∈{0,1} ),(T ∈( ≤t ),g∈{0,1} )) i=1 (10.4)

In this section, we focus on random bipartite graphs G on [N]∪[M] that are d-regular in [M], which are generated by connecting each vertex in [M] to d random vertices in [N] independently. The main technical result we will prove in this section is:

d Lemma 10.2.1. Suppose there is a pairwise independent subspace C ⊆ Fq staying in a k-subset with probability ≥ p0. Let G = ([N], [M],E) be a random bipartite graph with M = O(N) that is d-regular in [M], the Ω(N)-level Lasserre hierarchy for G and ρ = 1−k/q 1 has an objective value at most (1 − p0 + N 1/3 )M with high probability.

We introduce list Constraint Satisfaction Problems which allow every variable to take k values from the alphabet. Next, we lower bound the objective value of an instance of a list CSP in the Lasserre hierarchy from the objective value of the corresponding instance of the CSP in the Lassrre hierarchy. Then we show how to use list CSPs to obtain an upper bound of the vertex expansion for ρ = 1 − k/q in the Lasserre hierarchy.

Definition 10.2.2 (list Constraint Satisfaction Problem). A list Constraint Satisfaction

Problem (list CSP) Λ is specified by a constant k, a width d, a domain over finite field Fq for d a prime power q, and a predicate C ⊆ Fq . An instance Φ of Λ consists of a set of variables

{x1, ··· , xn} and a set of constraints {C1,C2, ··· ,Cm} on the variables. Every variable xi

takes k values in Fq, and every constraint Cj consists of a set of d variables xj,1, xj,2, ··· , xj,d

184 ~ d ~ and an assignment bj ∈ Fq . The value of Cj is |(C +bj)∩xi,1 ×xi,2 ··· xi,d| ∈ N. The value of Φ is the summation of values over all constraints, and the objective is to find an assignment on {x1, ··· , xn} that maximizes the total value as large as possible.

Remark 10.2.3. We abuse the notation Cj to denote the variable subset {xj,1, xj,2, ··· , xj,d}. Our definition is consistent with the definition of the classical CSP when k = 1. The dif- ferences between a list CSP and a classical CSP are that a list CSP allow each variable to d choose k values in Fq instead of one value and relax every constraint Ci from Fq → {0, 1} d to Fq → N.

The {0, 1} programming for an instance Φ with variables {x1, ··· , xn} and constraints

{C1, ··· ,Cm} of Λ with parameters k, Fq,and a predicate C states as follows (the variable set is the direct product of [n] and Fq in the {0, 1} programming):

X X max 1∀i∈C(j),xi,f(i)=1

j∈[m] f∈C+~bj

Subject to xi,α ∈ {0, 1} ∀(i, α) ∈ [n] × Fq X xi,α = k ∀i ∈ [n]

α∈Fq

The SDP in the t-level Lasserre hierarchy for Φ succeeds this {0, 1} programming as follows:

X X ~ max y(Cj ,f)(1)

j∈[m] f∈C+~bj  S.t. yS∪T (f ◦ g) S⊂ [n]×Fq ,f∈{ , }S , T ⊂ [n]×Fq ,g∈{ , }T  0 (10.5) ( ( ≤t ) 0 1 ) ( ( ≤t ) 0 1 ) X  k · yS∪T (f ◦ g) − yS∪T ∪{(i,α)}(f ◦ g ◦ 1) S⊂ [n]×Fq ,f∈{ , }S , T ⊂ [n]×Fq ,g∈{ , }T = 0, ∀i ∈ [n] ( ( ≤t ) 0 1 ) ( ( ≤t ) 0 1 ) α (10.6)

Definition 10.2.4. Let Λ be the list CSP problem with parameters k, q, d and a predicate d C ⊂ Fq . Let Φ be an instance of Λ with n variables and m constraints. p(Φ) is the projection

185 d instance from Φ in the CSP of the same parameters q, d, C ⊆ Fq , and the same constraints ~ ~ ~ (C1, b1), (C2, b2), ··· , (Cm, bm) except k = 1.

d Recall that a subspace C ⊂ Fq stays in a subset Q ⊂ Fq with probability p0 if d Prx∼C [x ∈ Q ] ≥ p0. We lower bound Φ’s objective value in the Lasserre hierarchy by exploiting the subspace property of C and Q.

Lemma 10.2.5. Let Φ be an instance of the list CSP Λ with parameters k, q, d and a predicate d C, where C is a subspace of Fq staying in a k-subset Q with probability at least p0. Suppose

p(Φ)’s value is γ in the w-level Lasserre hierarchy, then Φ’s value is at least p0|C| · γ in the w-level Lasserre hierarchy.

[n]×Fq S Proof. Let yS(f) and ~vS(f) for S ∈ ≤w and f ∈ {0, 1} denote the pseudo-distribution and the vectors in the w-level Lasserre hierarchy for p(Φ) respectively. Let z and ~u denote the pseudo-distribution and vectors in the w-level Lasserre hierarchy for Φ. The construction of z and ~u from y and ~v are based on the subspace C and Q. The intuition is to choose

xi = α + Q in Φ if xi = α for some α ∈ Fq in p(Φ).

[n]×Fq Before constructing z and ~u, define ⊕ operation as follows. For any S ∈ ≤w , g ∈ S {0, 1} , and P ⊆ Fq, let S⊕P denote the union of the subset (i, α+P ) for every element (i, α) S⊕P in S, which is ∪(i,α)∈S{(i, α + P )} in [n] × Fq, and g ⊕ P ∈ {0, 1} denote the assignment on S ⊕ P such that g ⊕ P (i, α + P ) = g(i, α). If there is a conflict in the definition of

g ⊕ P , namely ∃(i, β) such that (i, β) ∈ (i, α1 + P ) and (i, β) ∈ (i, α2 + P ) for two distinct

(i, α1), (i, α2) in S, define g ⊕ P (i, β) to be an arbitrary one. Because every variable only S⊕P takes one value in p(Φ), yS(g) = 0 if there is a conflict on g ⊕ P ∈ {0, 1} . Follow the

186 [n]×Fq S intuition mentioned above, for any S ⊂ ≤w and g ∈ {0, 1} , let R = {i|∃α, (i, α) ∈ S},

X 0 zS(g) = yT (g ), T ∈ R×Fq ,g0∈{ , }T S⊆T ⊕Q,g0⊕Q S g ( ≤w ) 0 1 : ( )= X 0 ~uS(g) = ~vT (g ). T ∈ R×Fq ,g0∈{ , }T S⊆T ⊕Q,g0⊕Q S g ( ≤w ) 0 1 : ( )= The verification of the fact that ~u explains z in (10.5) of Φ is straightforward. To verify

(10.6) is positive semidefinite, notice that every variable xi takes k values in Fq: X X X z(i,α)(1) = y(i,α−β)(1)

α∈Fq α∈Fq β∈Q X X = y(i,α−β)(1) = |Q| = k.

β∈Q α∈Fq By a similar analysis, P ~u (1) = k~v and apply Lemma 10.1.1 to prove (10.6) is PSD. α∈Fq (i,α) ∅ Recall that p(Φ)’s value is P P y (~1) = γ, so Φ’s objective value in j∈[m] f∈C+~bj (Cj ,f) the w-level Lasserre hierarchy is X X X X X ~ 0 ~ z(Cj ,f)(1) = y(Cj ,f )(1) 0 d 0 j∈[m] f∈C+~bj j∈[m] f∈C+~bj f ∈Fq :f∈f ⊕Q X X X 0 ~ 0 = y(Cj ,f )(1) · 1f∈f ⊕Q 0 d j∈[m] f ∈Fq f∈C+~bj X X 0 0 ~ ~ ≥ y(Cj ,f )(1) · |(f ⊕ Q) ∩ (C + bj)| 0 j∈[m] f ∈C+~bj X X 0 ~ ≥ y(Cj ,f )(1) · p0|C| 0 j∈[m] f ∈C+~bj

≥ p0|C| · γ.

Before proving Lemma 10.2.1, We restate Theorem G.8 that is summarized by Chan in [Cha13] of the previous works [Gri01a, Sch08, Tul09] and observe that the pseudo-distirbution in their construction is uniform over C on every constraint.

187 Theorem 10.2.6. ([Cha13]) Let Fq be the finite field with size q and C be a pairwise in- d dependent subspace of Fq for some constant d ≥ 3. The CSP is specified by parameters

Fq, d, k = 1 and a predicate C. The value of an instance Φ of this CSP on n variables with m constraints is m in the Ω(t)-level Lasserre hierarchy if every subset T of at most t constraints contains at least (d − 1.4)|T | variables.

S Observation 10.2.7. Let yS({0, 1} ) denote the pseudo-distribution on S provided by the solution of the semidefinite programming in the Lasserre hierarchy of Φ. For every constraint

Cj Cj(j ∈ [m]) in Φ, yCj ({0, 1} ) provides a uniform distribution over all assignments that

satisfy constraint Cj.

Proof of Lemma 10.2.1. Without lose of generality, we assume [N] = [n] × Fq. It is natural to think [N] corresponding to n variables and each variables has q vertices corresponding

to Fq. Let G be a random bipartite graph on [N] ∪ [M] that is d-regular on [M]. For each

vertex j ∈ M, the probability that j has two or more neighbors in i × Fq for some i is at d2q most n . Let R denote the subset in M that do not have two or more neighbors in any 2 √1 d√ q i × Fq for all i ∈ [n]. With probability at least 1 − n , R ≥ (1 − n )M. Because the neighbors of each vertex in [M] is generated by choosing d random vertices in [N]. For each vertex in R, the generation of its neighbors is as same as first sampling d

random variables in [n] then sampling an element in Fq for each variable. By a standard

calculation using Chernoff bound and Stirling formula, there exists a constant β = Od,M/n(1) R  such that with high probability, ∀T ⊆ ≤βn , T contains at least (d − 1.4)|T | variables.

We construct an instance Φ based on the induced graph of [n] × Fq ∪ R in the list CSP with the parameters k, q, d and the predicate {~0}. For each vertex j ∈ R, let

(i1, b1), ··· , (id, bd) be its neighbors in G. We add a constraint Cj in Φ with variables ~ xi1 , ··· , xid and b = (b1, ··· , bd).

188 Recall that C is a subspace staying a subset Q of size k with probability p0, we use the following two claims to prove the value of the vertex expansion of ρN-subsets in the Lasserre 2 2 d√ q d√ q hierarchy is at most (1 − p0)R + (M − R) ≤ (1 − p0)(1 − n )M + n M ≤ (1 − p0 + o(1))M with high probability.

Claim 10.2.8. Φ’s value is at least p0|R| in the Ω(βn)-level Lasserre hierarchy.

Claim 10.2.9. Suppose Φ’s value is at least r in the t-level Lasserre hierarchy, the objective value of the t-level Lasserre hierarchy is at most |R| − r for the vertex expansion problem on [N] ∪ R with ρ = 1 − k/q.

Proof of Claim 10.2.8. Let Λ be the list CSP with parameters Fq, k, d and predicate C. Let Φ0 be the instance of Φ in Λ. From Theorem 10.2.6, P (Φ0)’s value is R because every small constraint subset contains at least (d − 1.4)|T | variables. From Lemma 10.2.5, Φ’s

value is at least p0|C| · R in the Ω(n)-level Lasserre hierarchy.

Let us take a closer look, for each constraint j in P (Φ0), the pseudo-distribution on

Cj is uniformly distributed over bj +C. Therefore every assignment f +bj for f ∈ C appears 0 in the pseudo-distribution of P (Φ ) on Cj with probability 1/|C|. As the same reason, 0 every assignment f + bj appears in the pseudo-distribution of Φ with the same probability |Qd∩C| ~ ~ ~ C = p0. Because 0 ∈ C, the probability Cj contains 0 + bj in the pseudo-distribution 0 0 of Φ is p0 by the analysis. Using the solution of Φ in the Ω(βn)-level Lasserre hierarchy as the solution of Φ, it is easy to see Φ’s value is at least p0|R|.

[n]×Fq S Proof of Claim 10.2.9. Let yS(f),~vS(f) for all S ⊆ t and f ∈ {0, 1} be the solution of pseudodistribution and vectors in the t-level Lasserre hierarchy for Φ. We define

189 [n]×Fq S zS(f), ~uS(f) for all S ⊆ t ([N] = [n]×Fq) and f ∈ {0, 1} to be the pseudodistribution and vectors for the vertex expansion problem as follows:

~uS(f) = ~vS(~1 − f), zS(f) = yS(~1 − f).

The verification of the fact that ~u explains the matrix (10.3) of z in the Lasserre hierarchy is straightforward. Another property from the construction is

X X X ~u(xi,b)(1) = ~v(xi,b)(0) = (~v∅ − ~v(xi,b)(1))

(xi,b) (xi,b) (xi,b) X X X = (~v∅ − ~v(xi,b)(1)) = (q~v∅ − k~v∅) = ρN · ~v∅, i∈[n] b ∈Fq i

which implies the matrix in (10.4) is positive semidefinite by Lemma 10.1.1. P ~ The value of the vertex expansion problem given z, ~u is j∈[R](1 − zN(j)(0)) = P ~ P ~ j∈[R](1 − y(N(j))(1)) = R − j∈[R] y(N(j))(1) = R − r.

On the other hand, it is easy to prove a random bipartite graph has very good vertex expansion by using Chernoff bound and union bound.

20q Lemma 10.2.10. For any constants d, ρ,  > 0, and c ≥ (1−ρ)d·2 , with high probability, a random bipartite graph on [N] ∪ [M](M = cN) that is d-regular in [M] guarantees that every ρN-subset in [N] contains at least 1 − (1 + )(1 − ρ)d different vertices in [M].

Proof. For any subset S ⊆ [N] of size ρN, the probability that a vertex in [M] is not a neighbor of S is at most (1 − ρ)d + o(1). Applying Chernoff bound on M independent experiments, the probability that S contains less than (1 − (1 + )(1 − ρ)d) neighbors in [M] is at most exp(−2(1 − ρ)dM/12) ≤ 2−M . From union bound, every ρN subset has at least (1 − (1 + )(1 − ρ)d) neighbors with high probability.

190 10.2.1 Integrality Gaps for Dispersers

Theorem 10.2.11. For any  > 0 and ρ ∈ (0, 1), there exist infinitely many d such that a random bipartite graph on [N] ∪ [M] that is d-regular in [M] satisfies the following two properties with high probability:

1. It is a ρN, (1 − (1 − ρ)d − )M-disperser.

2. The objective value of the Ω(N)-level Lasserre hierarchy for ρ is at most 1 − C0 · 1−ρ  dρ+1−ρ M for a universal constant C0 ≥ 1/10.

20q d  Proof. Let M ≥ (1−ρ)d·2 N, a random bipartite graph G is a ρN, (1 − (1 − ρ) − )M - disperser from Lemma 10.2.10 with very high probability.

On the other hand, choose a prime power q and k in the base cases of Lemma 10.1.3 0 or Lemma 10.1.4 such that ρ = 1 − k/q > ρ and p0 be the probability that the subspace 1 1−ρ0 1 1−ρ C staying in a k-subset. From the construction, p0 ≥ 3 dρ0+1−ρ0 ≥ 9 · dρ+1−ρ . From Lemma

10.2.1, a random graph G that is d-regular in [M] has vertex expansion at most (1−p0)M for ρ0 with high probability. Because ρ0 ≥ ρ, this indicates The objective value of the Ω(N)-level 1 1−ρ Lasserre hierarchy for ρ is at most (1 − 9 · dρ+1−ρ )M.Therefore, a random bipartite graph G satisfies the two properties with high probability.

We generalize the above construction to d = Θ(log N) and prove the Lasserre hier- archy cannot approximate the entropy of a disperser in the rest of this section. Because d = Θ(log N) is a super constant, we relax the strong requirement in the variable expansion of constraints and follow the approach of [Tul09]. We also notice the same observation has independently provided by Bhaskara et.al. in [BCV+12].

Theorem 10.2.12. (Restatement of Theorem 4.3 in [Tul09]) Let C be the dual space of

a linear codes with dimension d and distance l over Fq. Let Φ with n variables and m

191 constraints be an instance of the CSP Λ with d, k = 1,Fq and predicate C. If for every subset S of at most t constraints in Φ, it contains at least (1 − l/2 + .2)d · |S| different variables. Then the value of Φ is m in the Ω(t)-level Lasserre hierarchy.

Lemma 10.2.13. For any prime power q,  > 0, δ > 0, and any constant c, a random bipartite graph on [N]∪[M] that is d = c log N-regular in M has the following two properties with high probability:

1. It is a (δN, (1 − 2(1 − δ)d)M)-disperser.

Ω(1) q−1 2. The objective value of the N -level Lasserre hierarchy for ρ = q is at most (1 − −d 1 q + N 1/3 )M.

Proof. Let A be a linear code over Fq with dimension d, rate (1 − )d and distance 3γd for d 20q·N some γ > 0. C is the dual space of A with size |C| = q . Let M = (1−δ)d , which is poly(N) here. From Lemma 10.2.10, a random bipartite graph G on [N] ∪ [M] that is d-regular in M is a (δN, (1 − 2(1 − δ)d)M)-disperser with very high probability.

[M]  In the rest of proof, it is enough to show that for every subsets S ⊆ ≤N γ/2 in Φ, the constraints in S contain at least (1 − γ)|S|d variables. By union bound, the probability that does not happen is bounded by

N γ/2 X M N  (1 − γ)dl X dl X M l (dl)dl ( )dl ≤ M lN (1−γ)dl( )dl ≤ ≤ 0.1. l (1 − γ)d · l N N N γ·dl/2 N γ·dl/2 l=1 l≤N γ/2 l≤N γ/2

2 By Lemma 10.2.1, the value of G with ρ = q−1 is at most (1 − 1/|C| + √d q )M ≤ (1 − q−d + q N 1 γ/2 N 1/3 )M in the Ω(n )-level Lasserre hierarchy.

We show the equivalence between the vertex expansion problem and the problem of approximating the entropy in a disperser:

192 Problem 10.2.14. Given a bipartite graph ([N], [M],E) and ρ, determine the size of the smallest neighbor set over all subsets of size at least ρN in [N].

Problem 10.2.15. Given a bipartite graph ([N], [M],E) and γ, determine the size of the largest subset in [N] with a neighbor set of size ≤ γM.

We prove the equivalence of these two problems with parameters ρ + γ = 1. For a bipartite graph ([N], [M],E) and a parameter γ, let T be the largest subset in [N] with |Γ(T )| ≤ γM. Let S = [M] \ Γ(T ). Then |S| ≥ (1 − γ)M and Γ(S) ⊆ [N] \ T . Since T is the largest subset with |Γ(T )| ≤ γM, S is the subset of size at least (1 − γ)M with the smallest neighbor set. The converse is similar, which shows the equivalence between these two problems.

Theorem 10.2.16. For any α ∈ (0, 1), any δ ∈ (0, 1) and any prime power q, there exists a constant c such that a random bipartite graph on [N] ∪ [M] that is D = c log N-regular in [N] has the following two properties with high probability:

1. It is an (N α, (1 − δ)M)-disperser.

2. The objective value of the SDP in the N Ω(1)-level Lasserre hierarchy for obtaining M/q distinct neighbors is at least N 1−α/2.

1 log 1−δ log N d 1/4 20q·N Proof. Let  = 4α log q = O(1) and d = 4 log q such that |C| = q = N and M = (1−δ)d ≥ N 1/α. So d = O(log M).

From Lemma 10.2.13, a random bipartite graph on [N] ∪ [M] d-regular in [M] is a (δN, M−M α)-disperser, but the value of N Ω(1)-level Lasserre hierarchy for G with ρ = 1−1/q is at most M − M 1−α/2. From the equivalence, any subset of size M α in [M] has a neighbor set of size at least (1 − δ)N. On the other hand, it is possible that there exists a M 1−α/2- subset of [M] with a neighbor set of size at most [N]/q in the Lasserre hierarchy, from the

193 fact that the N Ω(1)-level Lasserre hierarchy has a value at most M − M 1−α/2 for ρ = 1 − 1/q. To finish the proof, swap [N] and [M] in the bipartite graph such that D = d in the new bipartite graph.

Corollary 10.2.17. (Restatement of Theorem 10.0.3) For any α ∈ (0, 1), any δ ∈ (0, 1), there exists a constant c such that a random bipartite graph on [N] ∪ [M] with D = c log N- regular in [N] has the following two properties with high probability:

1. It is an (N α, (1 − δ)M)-disperser.

2. The objective value of the SDP in the N Ω(1)-level Lasserre hierarchy for obtaining δM distinct neighbors is at least N 1−α.

10.2.2 Integrality Gaps for Expanders

We prove that a random bipartite graph is almost D-reguar on the right hand side and use the fact dN ≈ DM.

Theorem 10.2.18. For any prime power q, integer d < q and constant δ > 0, there exist a constant D and a bipartite graph G on [N] ∪ [M] with the largest left degree D and the largest right degree d has the following two properties for ρ = 1/q:

d 0 0 (1−ρ) −(1−ρd) Pd−1 i−1 (d−1)···(d−i+1) i 1. It is a (ρN, (1− −2δ)D)-expander with  = ρd = i=1 (−1) (i+1)! ρ .

2. The objective value of the vertex expansion for G with ρ in the Ω(N)-level Lasserre ρ(d−1) hierarchy is at most (1 −  + δ)D · ρN with  = 2 .

100q·log(1/β) Proof. Let β be a very small constant specified later and c = d(1−ρ)d·δ2 . Let G0 be a random dM graph on [N] ∪ [M] with M = cN that is d-regular in [M]. Let D0 = N and L denote the

vertices in [N] with degree [(1 − δ)D0, (1 + δ)D0]. Let G1 denote the induced graph of G0

194 on L ∪ [M]. The largest degree of L is D = (1 + δ)D0 and the largest degree of M is d. We

will prove G1 is a bipartite graph that satisfies the two properties in this lemma with high

probability. Because G0 is a random graph, we assume there exists a constant γ = OM/N,d(1) M  such that every subset S ∈ ≤γN has different (d − 1.1)|S| neighbors.

In expectation, each vertex in N has degree D0. By Chernoff bound, the fraction

of vertices in [N] of G0 with degree more than (1 + δ)D0 or less than (1 − δ)D0 is at most 2 d 4 2exp(−δ · N · M/12) ≤ β . At the same time, with high probability, G0 satisfies that any 3 N  1 2 3 β N-subset in [N] has total degree at most βdM because β3N · exp(−( β2 ) (β d) · M/12) is exponentially small in N. Therefore with high probability, |L| ≥ (1 − β3)N and there are

at least (1 − β)dM edges in G1.

We first verify the objective value of the vertex expansion for G1 with ρ = 1/q in the Ω(N)-level Lasserre hierarchy is at most (1−+δ)D ·ρN). Let R be the vertices in [M] that have degree d. From Lemma 10.2.1, the objective value of the vertex expansion for L ∪ R

with ρ = 1/q in the Ω(γN)-level Lasserre hierarchy is at most (1 − p0)|R| where p0 is the d 2 staying probability of C in a q−1 subset. From Lemma 10.1.4, p0 = 1−dρ+ 2 ρ . Therefore d 2 (1 − p0)|R| ≥ (1 − 1 + dρ − 2 ρ )(1 − dβ)M. For the vertices in M \ R, they will contribute at most dβM in the objective value of the Lasserre hierarchy. Therefore the objective value d 2 (d−1)ρ β β for G1 is at most (dρ − 2 ρ + dβ)M = (1 − 2 + ρ )ρdM ≤ (1 −  + ρ )ρDM. For the integral value, every ρN-subset in [N] has at least (1 − (1 + β)(1 − ρ)d)M

neighbors in G0 by Lemma 10.2.10. Because G1 is the induced graph of G0 on L∪[M], every d 0 β 0 β ρN-subset in L has at least (1 − (1 + β)(1 − ρ) )M ≥ (1 −  − ρd )ρdM ≥ (1 −  − ρd )D0 · ρN

neighbors in G1. By setting β small enough, there exists a bipartite graph with the required two properties.

0 e−2−(1−2) Corollary 10.2.19. For any  > 0 and any  < 2 , there exist ρ small enough and a bipartite graph G with the largest left degree D = O(1) that has the following two properties:

195 1. It is a (ρN, (1 − 0)D)-expander.

2. The objective value of the vertex expansion for G with ρ in the Ω(N)-level Lasserre hierarchy is at most (1 − )D · ρN.

2 ρd Proof. Think ρ to be a small constant and d = ρ + 1 such that  is very close to 2 . Then 0 (1−ρ)d−(1−ρd) e−ρd−(1−ρd) e−2−(1−2) the limit of  = ρd is ρd = 2 by decreasing ρ.

10.3 An Approximation algorithm for Dispersers

In this section, we will provide a polynomial time algorithm that has an approximation ratio close to the integrality gap in Theorem 10.0.4.

Theorem 10.3.1. Given a bipartite graph ([N], [M],E) with right degree d, if (1 − ∆)M is the size of the smallest neighbor set over ρN-subsets in [N], there exists a polynomial time { ρ 2, } min ( 1−ρ ) 1 algorithm that outputs a subset T ⊆ [N], such that |T | = ρN and Γ(T ) ≤ 1−Ω( log d · d(1 − ρ)d · ∆)M.

We consider a simple semidefinite programming for finding a subset T ⊆ [N] that maximizes the number of unconnected vertices to T . X 1 X max k ~v k2 (*) d i 2 j∈[M] i∈Γ(j)

Subject to h~vi,~vii ≤ 1 n X ~vi = ~0 i=1

ρ 2 We first show the objective value of the semidefinite programming is at least min{( 1−ρ ) , 1}· ∆. For convenience, let δ denote the value of this semidefinite programming and A denote the positive definite matrix of the objective function in the semidefinite programming such P T that δ = i,j Ai,j(~vi ·~vj). If ρ ≥ 0.5, δ ≥ ∆·M by choosing ~vi = (1, 0, ··· , 0) for every i∈ / S

196 1−ρ and ~vi = (− ρ , 0, ··· , 0) for every i ∈ S. But this is not a valid solution for the SDP when ρ 2 ρ ρ < 0.5. However, δ ≥ ( 1−ρ ) · ∆ · M in this case by choosing choosing ~vi = ( 1−ρ , 0, ··· , 0) ρ 2 for every i∈ / S and ~vi = (−1, 0, ··· , 0) for every i ∈ S. Therefore δ ≥ min{( 1−ρ ) , 1}· ∆M. 1 Without lose of generality, both δ and ∆M are ≥ d M, otherwise a random subset is enough to achieve the desired approximation ratio. P The algorithm has two stages: first round ~vi to zi ∈ [−1, 1] and keep i zi almost

balanced, which is motivated by the work [AN04], then round zi to xi using the algorithm suggested by [CMM07].

Lemma 10.3.2. There exists a polynomial time algorithm that given k~vik ≤ 1 for every i, P ~ P 1 P 2 i ~vi = 0 and δ = j k d i∈Γ(j) ~vik2 ≥ M/d, it finds zi ∈ [−1, 1] for every i such that P P 1 P 2 δ | i zi| = O(N/d) and j( d i∈Γ(j) zi) ≥ Ω( log d ).

Proof. The algorithm works as follows:

√ 1. Sample ~g ∼ N(0, 1)N and choose t = 3 log d.

2. Let ζi = hg,~vii for every i = 1, 2, ··· , n.

3. If ζi > t or ζi < −t, cut ζi = ±t respectively.

4. zi = ζi/t.

It is convenient to analyze the approximation ratio in another set of vectors {~ui|i ∈ [n]} in a Hilbert space such that ~ui(~g) = h~vi,~gi and h~ui, ~uji = E~g[h~ui,~gi · h~g, ~uji] = h~vi,~vji. So P T P ~ ~0 i,j Ai,j(~ui · ~uj) = δ and i ~ui = 0 again. Let u i be the vector in the same Hilbert space 0 by applying the cut operation with parameters t on ~ui. Namely u~ i(~g) = t (or −t) when 0 ~ui(~g) > t (or < −t), otherwise u~ i(~g) = ~ui(~g) ∈ [−t, t]. Therefore the algorithm is as same 0 as sampling a random point ~g and setting zi = u~ i(~g)/t.

197 ~0 4.5 ~0 2 4 Fact 10.3.3. For every i, ku i − ~uik1 = O(1/d ) and ku i − ~uik2 = O(1/d ).

P ~0 T 2 The analysis uses the second fact to bound i,j Ai,j((~ui − u i) · ~uj) ≤ O(m/d ) P T ~0 as follows. Notice that A is a positive definite matrix and consider i,j Ai,j(~wi · w j) for 0 0 any unit vectors ~wi and w~ j. It reaches the maximal value when ~wi = w~ i by the property P T P 1 P 2 of positive definite matrices. And i,j Ai,j(~wi · ~wj) = j∈[M] k d i∈Γ(j) ~wik2 is always P T ~0 bounded by M, because ~wi are unit vectors. So i,j Ai,j(~wi ·w j) ≤ max{k~w1k2, ··· , k~wnk2}· ~0 ~0 max{kw 1k2, ··· , kw nk2}· M.

T X T X ~0 ~0 Ai,j(~ui · ~uj) − Ai,j(u i · u j) i,j i,j X T ~0 X ~0 T ~0 = Ai,j(~ui · (~uj − u j)) + Ai,j((~ui − u i) · u j) i,j i,j ≤O(M/d2)

T P ~0 ~0 P T Therefore i,j Ai,j(u i · u j) ≥ 0.99 i,j Ai,j(~ui · ~uj) ≥ 0.99M/d. And it is upper bounded 2 .49 P ~0 ~0 by t · M. So with probability at least dt2 , g satisfies i,j Ai,j · (u i(g) · u j(g)) ≥ .49δ. On P ~0 3 the other hand, | i u i(g)| ≥ N/d with probability at most 1/d from the first property P ~0 4 k i u ik1 ≤ O(N/d ).

.5 3 P Overall, with probability at least dt2 − 1/d , zi satisfies | zi| = O(N/d) and P 1 P 2 δ δ j( d i∈Γ(j) zi) = Ω( t2 ) = Ω( log d ).

It is not difficult to verify that independently sampling zi ∈ {−1, 1} for every i according to its bias zi will not reduce the objective value but keep the same bias overall i.

Without lose of generality, let zi ∈ {−1, 1} from now on.

P Lemma 10.3.4. There exists a polynomial time algorithm that given zi with | i zi| = P 1.5 P O(N/d), outputs xi ∈ {0, 1} such that i xi = (1 − ρ)(1 ± 1/d )N and j 1∀i∈Γ(j):xi=1 ≥ d P 1 P 2 Ω(d(1 − ρ) ) · j( d i∈Γ(j) zi) .

198 Proof. The algorithm works as follows:

1. δ = (1 − ρ)p2/d. Execute Step 2 or Step 3 with probability 0.5 and 0.5 separately.

2. For every i ∈ [N], xi = 1 with probability 1 − ρ + δzi.

3. For every i ∈ [N], xi = 1 with probability 1 − ρ − δzi.

1 P Let yj = d i∈Γ(j) zi. The probability xi = 1 for every i in Γ(j) is

1+y 1−y 1+y 1−y 1 j d j d j d j d (1 − ρ + δ) 2 · (1 − ρ − δ) 2 + (1 − ρ − δ) 2 · (1 − ρ + δ) 2 ) 2 1 1 − ρ + δ 1 − ρ − δ = (1 − ρ + δ)d/2(1 − ρ − δ)d/2( )yj ·d/2 + ( )yj ·d/2 2 1 − ρ − δ 1 − ρ + δ 1 δ2 1 − ρ + δ = (1 − ρ)d(1 − )d/2 · cosh(y · d/2 · ln( )) 2 (1 − ρ)2 j 1 − ρ − δ p 1 d d/2 1 − ρ + (1 − ρ) 2/d 2 ≥ (1 − ρ) (1 − 2/d) · 0.9 · yj · d/2 · ln( ) 2 1 − ρ − (1 − ρ)p2/d 1 ≥ (1 − ρ)d(1 − 2/d)d/2 · 0.9 · y2 · (d/2)2 · (p2/d)2 2 j d 2 ≥Ω((1 − ρ) · yj · d).

P P P At the same time, i xi is concentrated around E[ i xi] = i(1 − ρ ± δzi) = (1 − ρ)N ± P 1.5 δ i zi = (1 − ρ)(1 ± 1/d )N with very high probability. Therefore {x1, ··· , xn} satisfies P 1.5 P d P 2 i xi = (1 − ρ)(1 ± 1/d )N and j 1∀i∈Γ(j):xi=1 ≥ Ω(d(1 − ρ) ) · j yj with constant probability.

ρ 2 Proof of Theorem 10.3.1. Let δ be the value from SDP (∗), which is ≥ min{( 1−ρ ) , 1}·∆ from P the analysis above. By Lemma 10.3.2, round vi into zi ∈ [−1, 1] such that | i zi| = O(N/d) P 1 P 2 δ and j( d i∈Γ(j) zi) ≥ Ω( log d ). By Lemma 10.3.4, round zi into xi ∈ {0, 1} such that P 1.5 P d δ i xi = (1 − ρ)(1 ± 1/k )N and j 1∀i∈Γ(j):xi=1 ≥ Ω(d(1 − ρ) · log d ).

199 { ρ 2, } 1 min ( 1−ρ ) 1 Let T = {i|xi = 0}. Then |T | = ρ(1 ± O( k1.5 ))N and Γ(T ) ≤ (1 − C · log d · d(1 − ρ)d · ∆)M for some absolute constant C. At last, adjust the size of T by randomly N N adding or deleting O( k1.5 ) vertices such that the size of T is ρN. Because at most O( k1.5 ) vertices are added to T , with constant probability, Γ(j) ∩ T = ∅ if Γ(j) ∩ T = ∅ for a node { ρ 2, } min ( 1−ρ ) 1 d j ∈ [M] before the adjustment. Therefore Γ(T ) ≤ (1 − C0 · log d · d(1 − ρ) · ∆)M for

some absolute constant C0.

200 Appendices

201 A Omitted Proof in Chapter3

We fix z1, ··· , zk to be complex numbers on the unit circle and use Q(z) to denote k Q the degree-k polynomial (z − zi). We first bound the coefficients of rn,k(z) for n ≥ k. i=1

Pk−1 (i) i Lemma A.1. Given z1, ··· , zk, for any positive integer n, let rn,k(z) = i=0 rn,k · z denote n Qk the residual polynomial of rn,k ≡ z mod j=1(z − zj). Then each coefficient in rn,k is (i) k−1 n  bounded: |rn,k| ≤ i · k−1 for every i.

n Proof. By definition, rn,k(zi) = zi . From the polynomial interpolation, we have

Q n k (z − zj)zi X j∈[k]\i rn,k(z) = Q . (zi − zj) i=1 j∈[k]\i

Let SymS,i be the symmetry polynomial of z1, ··· , zk with degree i among subset S ⊆ [k], P Q i.e., SymS,i = zj. Then the coefficients of zl in rn,k(z) is 0 S j∈S0 S ⊆( i )

k Sym · zn k−1−l X [k]\i,k−1−l i rn,k(l) = (−1) Q . (zi − zj) i=1 j∈[k]\i

k−1−l (l) We omit (−1) in the rest of proof and use induction on n, k, and l to prove |rn,k| ≤ k−1 n  l k−1 .

n (l) Base Case of n: For any n < k, from the definition, r(z) = z and |rn,k| ≤ 1. Suppose it is true for any n < n . We consider rl from now on. When k = 1, 0 n0,k n rn,0 = z1 is bounded by 1 because z1 is on the unit circle of C.

Given n0, suppose the induction hypothesis is true for any k < k0 and any l < k. For k = k , we first prove that |r(k0−1)| ≤ n0  then prove that |r(l) | ≤ k0−1 n0  for 0 n0,k0 k0−1 n0,k0 l k0−1

202 l = k0 − 2, ··· , 0.

k0 n0 k − X z r( 0 1) = i n0,k0 Q (zi − zj) i=1 j∈[k0]\i k − 0 1 n0 zn0 X zi k0 = Q + Q (zi − zj) (zk − zj) i=1 0 j∈[k0]\i j∈[k0]\k0 k − 0 1 n0 n0−1 n0−1 zn0 X zi − zi zk0 + zi zk0 k0 = Q + Q (zi − zj) (zk − zj) i=1 0 j∈[k0]\i j∈k0\k0   k − 0 1 n0−1 n0−1 zn0 X zi zi zk0 k0 =  Q + Q  + Q  (zi − zj) (zi − zj) (zk − zj) i=1 0 j∈[k0−1]\i j∈k0\i j∈k0\k0     k − k 0 1 zn0−1 0 zn0−1 X i   X i  = Q + zk0 Q  (zi − zj)  (zi − zj) i=1 i=1 j∈[k0−1]\i j∈k0\i = r(k0−2) + z · r(k0−1) n0−1,k0−1 k0 n0−1,k0

203 Hence |r(k0−1)| ≤ |r(k0−2) | + |r(k0−1) | ≤ n0−2 + n0−2 = n0−1. For l < k − 1, we have n0,k0 n0−1,[k0−1] n0−1,k0 k0−2 k0−1 k0−1 0

k0 n0 l X Sym[k0]\i,k0−1−l · zi r( ) = let l0 = k − 1 − l n0,k0 Q 0 (zi − zj) i=1 j∈[k0]\i

k0−1  n0 n0 Sym 0 + Sym 0 · z z Sym 0 · z X [k0−1]\i,l [k0−1]\i,l −1 k0 i [k0−1],l k0 = Q + Q (zi − zj) (zk − zj) i=1 0 j∈[k0]\i j

k0−1 n0−1 n0−1 n0 0 0 0 X Sym[k0−1]\i,l · (zi − zk0 )zi + Sym[k0−1]\i,l · zk0 zi + Sym[k0−1]\i,l −1 · zk0 zi = Q (zi − zj) i=1 j∈[k0]\i n0 Sym 0 · z [k0−1],l k0 + Q (zk0 − zj) j

k0−1 n0−1 n0−1 n0 Sym 0 · (z − z )z + Sym 0 · z z Sym 0 · z X [k0−1]\i,l i k0 i [k0−1],l k0 i [k0−1],l k0 = Q + Q (zi − zj) (zk − zj) i=1 0 j∈[k0]\i j

k0−1 n0−1 k0−1 n0−1 n0 Sym 0 z Sym 0 · z z Sym 0 · z X [k0−1]\i,l i X [k0−1],l k0 i [k0−1],l k0 = Q + Q + Q (zi − zj) (zi − zj) (zk − zj) i=1 i=1 0 j∈[k0−1]\i j∈[k0]\i j

By induction hypothesis, |r(l) | ≤ k0−2n0−1 + k0−1n0−1 ≤ k0−1 n0 . n0,k0 l−1 k0−2 l k0−1 l k0−1

−n Qk Similarly, we could bound the coefficients of z mod j=1(z − zj).

Pk−1 (i) i Lemma A.2. Given z1, ··· , zk, for any integer n with n ≥ 0, let r−n,k(z) = i=0 r−n,k · z −n Qk denote the residual polynomial of r−n,k ≡ z mod j=1(z − zj). Then each coefficient in (i) k−1 n+k−1 rn,k is bounded: |r−n,k| ≤ i · k−1 for every i.

204 B Omitted Proof in Chapter 7

We finish the proof of the Leftover Hash Lemma 7.0.3 and the proof of Theorem 7.1.2 in this section.

Proof of Lemma 7.0.3. Given a subset Λ of size 2k, we first consider

X X ( Pr [h = g and h(x) = α] − Pr [h = g]/2m)2 g∼H,x∼Λ g∼H h∈H α∈[2m] X X = Pr [h = g and h(x) = α]2 g∼H,x∼Λ h∈H α∈[2m] − 2 Pr [h = g and h(x) = α] · Pr [h = g]/2m + ( Pr [h = g]/2m)2 g∼H,x∼Λ g∼H g∼H X X 1 = Pr [h = g and h(x) = α]2 − . g∼H,x∼Λ T · 2m h∈H α∈[2m]

P 2 Notice that h,α Pr [h = g and h(x) = α] is the collision probability g∼H,x∼Λ Pr [h = g and g(x) = h(y)]2 g∼H,h∼H,x∼Λ,y∼Λ 1 1 = · Pr [x = y] + · Pr [x 6= y] · Pr [h(x) = h(y)|x 6= y] T x∼Λ,y∼Λ T x∼Λ,y∼Λ h∼H 1 = 2−k + (2−m + 2−k)(1 − 2−k) T 1 ≤ 2 · 2−k + 2−m. T

P P m 2 2 Thus h∈H α∈ m ( Pr [h = g and h(x) = α] − Pr [h = g]/2 ) ≤ k . From the [2 ] g∼H,x∼Λ g∼H T ·2 Cauchy-Schwartz inequality, √ r √ √ X X m 2 − k−m | Pr [h = g and h(x) = α]− Pr [h = g]/2 | ≤ T · 2m· ≤ 2·2 2 ≤ 2. g∼H,x∼Λ g∼H T · 2k h∈H α∈[2m] √ 2 Thus Ext(x, a) = ha(x) is a strong extractor error 2 (in statistical distance).

Then we apply symmetrization and Gaussianization to prove Theorem 7.1.2.

205 Proof of Theorem 7.1.2. We first symmetrize it by

" n # X E max f(Λ, xj) x1,··· ,xn Λ j=1 " n n n !# X X 0 X 0 = E max f(Λ, xj) − E [ f(Λ, xj)] + E [ f(Λ, xj)] x1,··· ,xn Λ x0 ,··· ,x0 x0 ,··· ,x0 j=1 1 n j=1 1 n j=1 " n # " n n #

X 0 X X 0 ≤ max E f(Λ, xj) + E max f(Λ, xj) − E[ f(Λ, xj)] . Λ x0 x Λ x0 j=1 j=1 j=1

Then we apply Gaussianization on the second term.

n Claim B.1. Let g = (g1, ··· , gn) denote the Gaussian vector sampled from N(0, 1) ,

" n n # " " n ## √ X X 0 X E max f(Λ, xj) − E[ f(Λ, xj)] ≤ 2π · E E max f(Λ, xj)gj . x Λ x0 x g Λ j=1 j=1 j=1

Proof. We first use the convexity of the | · | function to move E to the left hand side: x0

" n n # " n n #

X X 0 X X 0 E max f(Λ, xj) − E[ f(Λ, xj)] ≤ E max E f(Λ, xj) − f(Λ, xj) x Λ x0 x Λ x0 j=1 j=1 j=1 j=1   use max E[Gi] ≤ E[max Gi] to move E out i G G i x0 " n n #

X X 0 ≤ E max f(Λ, xj) − f(Λ, xj) x,x0 Λ j=1 j=1 p Then we Gaussianize it using the fact E[|gj|] = 2/π. Let g denote a sequence of n

206 independent Gaussian random variables. We have

" n n # " n # X X 0 p X 0  max f(Λ, xj) − f(Λ, x ) = π/2 max f(Λ, xj) − f(Λ, x ) · |gj| E0 j E0 j E x,x Λ x,x Λ gj j=1 j=1 j=1   use the convexity of | · | to move Eg " n #

p X 0  ≤ π/2 E max E f(Λ, xj) − f(Λ, xj) · |gj| x,x0 Λ g j=1   use max [Gi] ≤ [max Gi] to move out i EG EG i Eg " n #

p X 0  ≤ π/2 E E max f(Λ, xj) − f(Λ, xj) · |gj| x,x0 g Λ j=1   0 use the symmetry of f(Λ, xj) − f(Λ, xj)

" n #

p X 0  = π/2E E max f(Λ, xj) − f(Λ, xj) · gj g x,x0 Λ j=1   use the triangle inequality

" n n #

p X X 0 ≤ π/2 E E max f(Λ, xj)gj + max − f(Λ, xj)gj x,x0 g Λ Λ j=1 j=1   use the symmetry of gj

" n # √ X ≤ 2πEE max f(Λ, xj)gj . x g Λ j=1

Remark B.2. We use the independence between x1, ··· , xn in the third last step.

207 C Omitted Proof in Chapter8

We apply an induction from i = 0 to i = k. The base case i = 0 is true because there are at most m balls.

1− 1 Suppose it is true for i = l < d. For a fixed bin j ∈ [n 2l ], there are at most 1 m 1 m l 2l l 2l (1+β) n · n balls. With out loss of generality, we assume there are exactly s = (1+β) n · n

balls from the induction hypothesis. Under the hash function hl+1, we allocate these balls 1 into t = n 2l+1 bins.

We fix one bin in hl+1 and prove that this bin receives at most

l 1 1 m l+1 1 m (1 + β)s/t = (1 + β) · (1 + β) n 2l /n 2l+1 · = (1 + β) n 2l+1 · n n −c−2 3 balls with probability ≤ 2n in a log n-wise δ1-biased space for δ1 = 1/poly(n) with a

sufficiently large polynomial. We use Xi ∈ {0, 1} to denote the ith ball is in the bin or not. P Hence E[ i∈[s] Xi] = s/t.

For convenience, we use Yi = Xi −E[Xi]. Hence Yi = 1−1/t w.p. 1/t, o.w. Yi = −1/t. l Notice that E[Yi] = 0 and | E[Yi ]| ≤ 1/t for any l ≥ 2. We choose b = 2l · β = O(log n) for a large even number β and compute the bth P moment of i∈[s] Yi as follows.

X X b b Pr[| Xi| > (1 + β)s/t] ≤ E [( Yi) ]/(βs/t) δ1-biased i∈[s] i [(P Y )b] + δ · s2btb ≤ EU i i 1 (βs/t)b P 3b E[Yi1 · Yi2 ····· Yib ] + δ1 · s ≤ i1,...,ib (βs/t)b Pb/2 b−j−1 b! j j 3b · b/2 · s (1/t) + δ1 · s ≤ j=1 j−1 2 (βs/t)b 2b/2b! · (s/t)b/2 + δ · s3b ≤ 1 (βs/t)b

208 1 3 β log n 2k k 1/3 −0.2 0.1 Because s/t ≥ n ≥ log n, b ≤ β2 ≤ 3 log log n ≤ (s/t) and β = (log n) < (s/t) , we simplify it to

2b2 · s/tb/2 2(s/t)2/3 · s/tb/2 + δ · s3b ≤ + δ · s3b (βs/t)2 1 (s/t)1.8 1 2 l 3 l (−0.1)·b/2 3b l+1 −0.1·β2 l 3β·2 −c−2 9β −c−2 =(s/t) + δ1 · s ≤ (n 2 ) + δ1(n 2 ) = n + δ1n ≤ 2n .

−9β−c−2 Finally we choose β = 40(c + 2) = O(1) and δ1 = n to finish the proof.

209 Bibliography

[ABG13] Per Austrin, Siavosh Benabbas, and Konstantinos Georgiou. Better Balance by Being Biased: A 0.8776-Approximation for Max Bisection. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’13, pages 277–294, 2013. 13

[ABKU99] Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced alloca- tions. SIAM J. Comput., 29(1):180–200, September 1999. 11, 12, 105

[AGHP90] N. Alon, O. Goldreich, J. Hastad, and R. Peralta. Simple construction of almost k-wise independent random variables. In Proceedings of the 31st Annual Sym- posium on Foundations of Computer Science. IEEE Computer Society, 1990. 107

[AGK+11] Noga Alon, Gregory Gutin, EunJung Kim, Stefan Szeider, and Anders Yeo. Solving MAX-r-SAT Above a Tight Lower Bound. Algorithmica, 61(3):638– 655, 2011. 132, 134

[AN04] Noga Alon and Assaf Naor. Approximating the cut-norm via grothendieck’s inequality. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 72–80, 2004. 197

[Bar14] Boaz Barak. Sums of squares upper bounds, lower bounds, and open questions. http://www.boazbarak.org/sos/, 2014. Page 39. Accessed: October 28, 2014. 180

210 [BBH+14] Boaz Barak, Fernando G.S.L. Brandao, Aram W. Harrow, Jonathan Kelner, David Steurer, and Yuan Zhou. Hypercontractivity, Sum-of-squares Proofs, and Their Applications. In Proceedings of the 44th Annual ACM Symposium on Theory of Computing, STOC ’14, pages 307–326, 2014. 132

[BCG+14] Petros Boufounos, Volkan Cevher, Anna C Gilbert, Yi Li, and Martin J Strauss. What’s the frequency, Kenneth?: Sublinear Fourier sampling off the grid. In Algorithmica(A preliminary version of this paper appeared in the Proceedings of RANDOM/APPROX 2012, LNCS 7408, pp. 61-72), pages 1–28. Springer, 2014. 3

[BCV+12] Aditya Bhaskara, Moses Charikar, Aravindan Vijayaraghavan, Venkatesan Gu- ruswami, and Yuan Zhou. Polynomial integrality gaps for strong sdp relaxations of densest k-subgraph. In Proceedings of the 23rd Annual ACM-SIAM Sympo- sium on Discrete Algorithms, pages 388–405. SIAM, 2012. 191

[BDMI13] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal coresets for least-squares regression. IEEE transactions on information theory, 59(10):6880–6892, 2013.6

[BDT17] Avraham Ben-Aroya, Dean Doron, and Amnon Ta-Shma. An efficient reduc- tion from two-source to non-malleable extractors: achieving near-logarithmic min-entropy. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1185–1194, 2017.8

[Ber97] Bonnie Berger. The Fourth Moment Method. SIAM J. Comput., 26(4):1188– 1207, August 1997. 134

[BGGP12] Itai Benjamini, Ori Gurel-Gurevich, and Ron Peled. On k-wise independent

211 distributions and boolean functions. http://arxiv.org/abs/1207.0016, 2012. Ac- cessed: October 28, 2014. 182

[BGH+12] Boaz Barak, Parikshit Gopalan, Johan Hastad, Raghu Meka, Prasad Raghaven- dra, and David Steurer. Making the long code shorter. In Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 370–379, Washington, DC, USA, 2012. IEEE Computer Society. 178

[BKS+10] B. Barak, G. Kindler, R. Shaltiel, B. Sudakov, and A. Wigderson. Simulating independence: New constructions of condensers, ramsey graphs, dispersers, and extractors. J. ACM, 57(4):20:1–20:52, May 2010. 177

[BM86] Y. Bresler and A. Macovski. Exact maximum likelihood parameter estimation of superimposed exponential signals in noise. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(5):1081–1089, Oct 1986.3

[BMRV00] H. Buhrman, P. B. Miltersen, J. Radhakrishnan, and S. Venkatesh. Are bitvec- tors optimal? In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing, STOC, pages 449–458, New York, NY, USA, 2000. ACM. 177

[Bon70] Aline Bonami. Etude´ des coefficients Fourier des fonctions de Lp(G). Annales de l’Institut Fourier, 20(2):335–402, 1970. 135

[BOT02] Andrej Bogdanov, Kenji Obata, and Luca Trevisan. A lower bound for testing 3-colorability in bounded-degree graphs. In Proceedings of the 43rd Symposium on Foundations of Computer Science, pages 93–102, 2002. 177

[BSS12] Joshua Batson, Daniel A Spielman, and Nikhil Srivastava. Twice-ramanujan sparsifiers. SIAM Journal on Computing, 41(6):1704–1721, 2012.6, 54, 67

212 [Cha00] Bernard Chazelle. The Discrepancy Method. Cambridge University Press, 2000. 83

[Cha13] Siu On Chan. Approximation resistance from pairwise independent subgroups. In Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Com- puting, STOC ’13, pages 447–456, New York, NY, USA, 2013. ACM. 187, 188

[Che52] Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothe- sis based on the sum of observations. The Annals of Mathematical Statistics, 23:493–507, 1952. 17

[Che16] Xue Chen. Integrality gaps and approximation algorithms for dispersers and bipartite expanders. In SODA, 2016. 14

[Che17] Xue Chen. Derandomized allocation process. Manuscript (https://arxiv. org/abs/1702.03375), 2017. 14

[CJM15] Robert Crowston, Mark Jones, and Matthias Mnich. Max-cut parameterized above the edwards-erd˝osbound. Algorithmica, 72(3):734–757, 2015. 133

[CKNS15] Kamalika Chaudhuri, Sham M Kakade, Praneeth Netrapalli, and Sujay Sang- havi. Convergence rates of active learning for maximum likelihood estimation. In Advances in Neural Information Processing Systems, pages 1090–1098, 2015. 7

[CKPS16] Xue Chen, Daniel M. Kane, Eric Price, and Zhao Song. Fourier-sparse interpo- lation without a frequency gap. In Foundations of Computer Science(FOCS), 2016 IEEE 57th Annual Symposium on, 2016.4,5, 14, 41

213 [CL16] Eshan Chattopadhyay and Xin Li. Extractors for sumset sources. In Proceed- ings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 299–311, 2016. 10

[CMM07] Moses Charikar, Konstantin Makarychev, and Yury Makarychev. Near-optimal algorithms for maximum constraint satisfaction problems. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 62–68, 2007. 197

[CP17] Xue Chen and Eric Price. Condition number-free query and active learning of linear families. Manuscript (https://arxiv.org/abs/1711.10051), 2017.4, 5,6, 14

[CRSW13] L. Elisa Celis, Omer Reingold, Gil Segev, and Udi Wieder. Balls and bins: Smaller hash families and faster evaluation. SIAM J. Comput., 42(3):1030– 1050, 2013. 11, 12, 105, 114, 115, 116, 127

[CRVW02] Michael Capalbo, Omer Reingold, Salil Vadhan, and Avi Wigderson. Random- ness conductors and constant-degree lossless expanders. In Proceedings of the 34th Annual ACM STOC, pages 659–668. ACM, 2002. 177

[CW79] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions (extended abstract). JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 18:143–154, 1979. 80, 81

[CZ17] Xue Chen and Yuan Zhou. Parameterized algorithms for constraint satisfaction problems above average with global cardinality constraints. In SODA, 2017. 14

[CZ18] Xue Chen and David Zuckerman. Existence of simple extractors. manuscript, 2018.8, 14

214 [DHKP97] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Pentto- nen. A reliable randomized algorithm for the closest-pair problem. J. Algo- rithms, 25(1):19–51, October 1997. 81

[DMM08] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error cur matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008.6

[Dun10] Mark Dunster. Legendre and Related Functions. Handbook of Mathematical Functions, Cambridge University Press, 2010. 30

[Fan61] Robert Fano. Transmission of information; a statistical theory of communica- tions. Cambridge, Massachusetts, M.I.T. Press, 1961. 75

[FJ95] Alan Frieze and Mark Jerrum. Improved approximation algorithms for max k-cut and max bisection. In Proceedings of the 4th International Conference on Integer Programming and Combinatorial Optimization, pages 1–13, 1995. 13

[FL06] Uriel Feige and Michael Langberg. The RPR2 Rounding Technique for Semidef- inite Programs. J. Algorithms, 60(1):1–23, July 2006. 13

[FL12] Albert Fannjiang and Wenjing Liao. Coherence pattern-guided compressive sensing with unresolved grids. SIAM Journal on Imaging Sciences, 5(1):179– 202, 2012.3

[FM16] Yuval Filmus and Elchanan Mossel. Harmonicity and Invariance on Slices of the Boolean Cube. In Computational Complexity Conference 2016, CCC 2016, 2016. 133

215 [GKM15] Parikshit Gopalan, Daniek Kane, and Raghu Meka. Pseudorandomness via the discrete fourier transform. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, pages 903–922, 2015. 106

[GMR+11] Venkatesan Guruswami, Yury Makarychev, Prasad Raghavendra, David Steurer, and Yuan Zhou. Finding Almost-Perfect Graph Bisections. In Proceedings of the 2nd Symposium on Innovations in Computer Science, ICS ’11, pages 321– 337, 2011. 13

[God10] Chris Godsil. Association Schemes. http://www.math.uwaterloo.ca/~cgodsil/ pdfs/assoc2.pdf, 2010. Last accessed: November 6, 2015. 138

[Gri01a] Dima Grigoriev. Linear lower bound on degrees of positivstellensatz calculus proofs for the parity. Theoretical Computer Science, 259:613 – 622, 2001. 177, 187

[Gri01b] Dmitry Grigoriev. Complexity of Positivstellensatz proofs for the knapsack. Computational Complexity, 10(2):139–154, 2001. 140

[GSZ14] Venkatesan Guruswami, Ali Kemal Sinop, and Yuan Zhou. Constant factor lasserre integrality gaps for graph partitioning problems. SIAM Journal on Optimization, 24(4):1698 – 1717, 2014. 181

[GUV09a] Venkatesan Guruswami, Christopher Umans, and Salil Vadhan. Unbalanced expanders and randomness extractors from parvaresh–vardy codes. J. ACM, 56(4):20:1–20:34, July 2009. 10

[GUV09b] Venkatesan Guruswami, Christopher Umans, and Salil Vadhan. Unbalanced expanders and randomness extractors from parvaresh–vardy codes. J. ACM, 56(4):20:1–20:34, July 2009. 177

216 [GY10] Gregory Gutin and Anders Yeo. Note on maximal bisection above tight lower bound. Inf. Process. Lett., 110(21):966–969, 2010. 132, 133

[Har28] Ralph Hartley. Transmission of information. Technical Journal, 1928. 75

[Haz01] Michiel Hazewinkel. Gram matrix. Encyclopedia of Mathematics, Springer, 2001. 36

[HIKP12] Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Nearly opti- mal sparse Fourier transform. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing. ACM, 2012.2

[HZ02] Eran Halperin and Uri Zwick. A Unified Framework for Obtaining Improved Approximation Algorithms for Maximum Graph Bisection Problems. Random Struct. Algorithms, 20(3):382–402, May 2002. 13

[IK14] Piotr Indyk and Michael Kapralov. Sample-optimal Fourier sampling in any constant dimension. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 514–523. IEEE, 2014.2

[ILL89] R. Impagliazzo, L. A. Levin, and M. Luby. Pseudo-random generation from one-way functions. In Proceedings of the Twenty-first Annual ACM Symposium on Theory of Computing, STOC ’89, pages 12–24, New York, NY, USA, 1989. ACM. 10, 79, 80, 81

[Kah95] Nabil Kahale. Eigenvalues and expansion of regular graphs. J. ACM, 42(5):1091– 1106, September 1995. 177

217 [Kho02] Subhash Khot. On the Power of Unique 2-prover 1-round Games. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, STOC ’02, pages 767–775, New York, NY, USA, 2002. ACM. 13

[KKL88] J. Kahn, G. Kalai, and N. Linial. The Influence of Variables on Boolean Func- tions. In Proceedings of the 29th Annual Symposium on Foundations of Com- puter Science, FOCS ’88, pages 68–80, 1988. 132

[KV05] Subhash Khot and Nisheeth K. Vishnoi. The unique games conjecture, inte-

grality gap for cut problems and embeddability of negative type metrics into l1. In 46th Annual IEEE Symposium on Foundations of Computer Science, 23-25 October 2005, Pittsburgh, PA, USA, Proceedings, pages 53–62, 2005. 178

[Las02] Jean B. Lasserre. An explicit equivalent positive semidefinite program for non- linear 0-1 programs. SIAM J. on Optimization, 12(3):756–769, March 2002. 180, 181

[Li16] Xin Li. Improved two-source extractors, and affine extractors for polylogarith- mic entropy. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 168–177, 2016. 10

[LS15] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsification in almost-linear time. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), FOCS ’15, pages 250–269. IEEE Computer Society, 2015.6, 54, 55, 56, 67

[LT91] M. Ledoux and M. Talagrand. Probability in Banach spaces. Springer, 1991. 82, 91

218 [LY98] Tzong-Yau Lee and Horng-Tzer Yau. Logarithmic Sobolev inequality for some models of random walks. Ann. Prob., 26(4):1855–1873, 1998. 133

[Mah11] Michael W Mahoney. Randomized algorithms for matrices and data. Founda-

tions and Trends R in Machine Learning, 3(2):123–224, 2011.6

[MI10] Malik Magdon-Ismail. Row sampling for matrix algorithms via a non-commutative bernstein bound. arXiv preprint arXiv:1008.0587, 2010.6

[MMZ15] Konstantin Makarychev, Yury Makarychev, and Yuan Zhou. Satisfiability of Ordering CSPs Above Average Is Fixed-Parameter Tractable. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’15, pages 975–993, 2015. 132

[Moi15] Ankur Moitra. The threshold for super-resolution via extremal functions. In STOC, 2015.3,4

[MOO05] Elchanan Mossel, Ryan O’Donnell, and Krzysztof Oleszkiewicz. Noise stability of functions with low influences: invariance and optimality. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’05, pages 21–30, 2005. 132

[MOO10] Elchanan Mossel, Ryan O’Donnell, and Krzysztof Oleszkiewicz. Noise stability of functions with low influences: invariance and optimality. Annals of Mathe- matics, 171(1):295–341, 2010. 132

[MRRR14] Raghu Meka, Omer Reingold, Guy N. Rothblum, and Ron D. Rothblum. Fast pseudorandomness for independence and load balancing - (extended abstract). In Automata, Languages, and Programming - 41st International Colloquium, ICALP 2014, Proceedings, pages 859–870, 2014. 105, 116

219 [MZ12] Matthias Mnich and Rico Zenklusen. Bisections above tight lower bounds. In the 38th International Workshop on Graph-Theoretic Concepts in Computer Science, WG ’12, pages 184–193, 2012. 132, 133

[NN90] J. Naor and M. Naor. Small-bias probability spaces: Efficient constructions and applications. In Proceedings of the Twenty-second Annual ACM Symposium on Theory of Computing, STOC ’90, pages 213–223. ACM, 1990. 107

[NZ96] Noam Nisan and David Zuckerman. Randomness is linear in space. J. Comput. Syst. Sci., 52(1):43–52, 1996.7

[O’D14] Ryan O’Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014. 132, 134, 135

[Owe13] Art B. Owen. Monte Carlo theory, methods and examples. 2013.6

[PS15] Eric Price and Zhao Song. A robust sparse Fourier transform in the continuous setting. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 583–600. IEEE, 2015.3

[Rot13] Thomas Rothvoß. The lasserre hierarchy in approximation algorithms. http://www.math.washington.edu/ rothvoss/lecturenotes/lasserresurvey.pdf, 2013. Accessed: October 28, 2014. 180

[RPK86] Robert Roy, Arogyaswami Paulraj, and . Esprit–a subspace rotation approach to estimation of parameters of cisoids in noise. Acoustics, Speech and Signal Processing, IEEE Transactions on, 34(5):1340–1342, 1986.3

[RRV99] Ran Raz, Omer Reingold, and Salil Vadhan. Error reduction for extractors. In Proceedings of the 40th Annual Symposium on the Foundations of Computer Science, New York, NY, October 1999. IEEE. 11

220 [RRW14] Omer Reingold, Ron D. Rothblum, and Udi Wieder. Pseudorandom graphs in data structures. In Automata, Languages, and Programming - 41st International Colloquium, ICALP 2014, Proceedings, pages 943–954, 2014. 12

[RS10] Prasad Raghavendra and David Steurer. Graph expansion and the unique games conjecture. In Proceedings of the 42nd ACM STOC, pages 755–764, 2010. 13

[RST10] Prasad Raghavendra, David Steurer, and Prasad Tetali. Approximations for the isoperimetric and spectral profile of graphs and related parameters. In Proceedings of the 42nd ACM STOC, pages 631–640, 2010. 13

[RST12] Prasad Raghavendra, David Steurer, and Madhur Tulsiani. Reductions between expansion problems. In Proceedings of the 27th Conference on Computational Complexity, pages 64–73, 2012. 13

[RT00] Jaikumar Radhakrishnan and Amnon Ta-Shma. Bounds for dispersers, extrac- tors, and depth-two superconcentrators. SIAM J. Discrete Math., 13(1):2–24, 2000.8

[RT12] Prasad Raghavendra and Ning Tan. Approximating CSPs with global cardinal- ity constraints using SDP hierarchies. In Proceedings of the 23rd Annual ACM- SIAM Symposium on Discrete Algorithms, SODA ’12, pages 373–387, 2012. 13

[RV08] Mark Rudelson and Roman Vershynin. On sparse reconstruction from fourier and gaussian measurements. Communications on Pure and Applied Mathemat- ics, 61(8):1025–1045, 2008. 91, 92

[RW14] Atri Rudra and Mary Wootters. Every list-decodable code for high noise has abundant near-optimal rate puncturings. In STOC, 2014. 82

221 [Sch81] Ralph Otto Schmidt. A signal subspace approach to multiple emitter location spectral estimation. Ph. D. Thesis, Stanford University, 1981.3

[Sch08] Grant Schoenebeck. Linear level lasserre lower bounds for certain k-csps. In Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Com- puter Science, FOCS ’08, pages 593–602, Washington, DC, USA, 2008. 187

[Sha49] Claude Shannon. Communication in the presence of noise. Proc. Institute of Radio Engineers, 37(1):10–21, 1949. 75

[Sha02] Ronen Shaltiel. Recent developments in explicit constructions of extractors. Bulletin of the EATCS, 77:67–95, 2002. 178

[Sip88] Michael Sipser. Expanders, randomness, or time versus space. J. Comput. Syst. Sci., 36(3):379–383, June 1988. 177

[SM14] Sivan Sabato and Remi Munos. Active regression by stratification. In Advances in Neural Information Processing Systems, pages 469–477, 2014.7

[SS96] Michael Sipser and Daniel Spielman. Expander codes. 6:1710–1722, 1996. 177

[SS11] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913–1926, 2011.6

[SSS95] Jeanette P. Schmidt, Alan Siegel, and Aravind Srinivasan. Chernoff-hoeffding bounds for applications with limited independence. SIAM J. Discret. Math., 8(2):223–250, May 1995. 108, 129

[Sti94] D. R. Stinson. Combinatorial techniques for universal hashing. Journal of Computer and System Sciences, 48(2):337–346, 1994. 116

222 [SU05] Ronen Shaltiel and Christopher Umans. Simple extractors for all min-entropies and a new pseudorandom generator. J. ACM, 52(2):172–216, March 2005. 10

[SV86] Miklos Santha and Umesh V. Vazirani. Generating quasi-random sequences from slightly-random sources. J. Comput. System Sci., (33):75–87, 1986.7

[SWZ17] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. arXiv preprint arXiv:1704.08246, 2017.6

[Ta-02] Amnon Ta-Shma. Almost optimal dispersers. Combinatorica, 22(1):123–145, 2002. 177

[TBR15] Gongguo Tang, Badri Narayan Bhaskar, and Benjamin Recht. Near minimax line spectral estimation. Information Theory, IEEE Transactions on, 61(1):499– 512, 2015.3

[Tre01] Luca Trevisan. Extractors and pseudorandom generators. J. ACM, 48(4):860– 879, July 2001. 10

[Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foun- dations of Computational Mathematics, 12:389–434, 2012. 18

[Tul09] Madhur Tulsiani. Csp gaps and reductions in the lasserre hierarchy. In Proceed- ings of the Forty-first Annual ACM Symposium on Theory of Computing, STOC ’09, pages 303–312, New York, NY, USA, 2009. ACM. 187, 191

[TZ04] Amnon Ta-Shma and David Zuckerman. Extractor codes. IEEE Transactions on Information Theory, 50(12):3015–3025, 2004. 177

[Vaz86] Umesh Vazirani. Randomness, adversaries and computation. In Ph.D. Thesis, EECS, UC Berkeley, pages 458–463. 1986. 107

223 [Vaz87] U. Vazirani. Efficiency considerations in using semi-random sources. In Pro- ceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, STOC ’87, pages 160–168, 1987. 81

[Voc03] Berthold Vocking. How asymmetry helps load balancing. J. ACM, 50(4):568– 589, July 2003. 11, 12, 105, 106, 108, 109, 110, 122, 124

[Woe99] Philipp Woelfel. Efficient strongly universal and optimally universal hashing. In International Symposium on Mathematical Foundations of Computer Science, MFCS, 1999, pages 262–272, 1999. 81

[Woo14] David P Woodruff. Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357, 2014.6

[WZ99] Avi Wigderson and David Zuckerman. Expanders that beat the eigenvalue bound: Explicit construction and applications. Combinatorica, 19(1):125–138, 1999. 177

[Ye01] Yinyu Ye. A .699-approximation algorithm for Max-Bisection. Math. Program., 90(1):101–111, 2001. 13

[Zuc96a] David Zuckerman. On unapproximable versions of np-complete problems. SIAM J. Comput., 25(6):1293–1304, 1996. 177

[Zuc96b] David Zuckerman. Simulating BPP using a general weak random source. Algo- rithmica, 16(4/5):367–391, 1996. 177

[Zuc07] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory of Computing, 3(1):103–128, 2007.8, 177, 178

224