Probability and Algorithms, Caltech CS150, Fall 2018

Leonard J. Schulman Schulman: CS150 2018

Syllabus

The most important questions of life are, for the most part, really only problems of . Strictly speaking one may even say that nearly all our knowledge is problematical; and in the small number of things which we are able to know with certainty, even in the mathematical sciences themselves, induction and analogy, the principal means for discovering truth, are based on , so that the entire system of human knowledge is connected with this theory. Pierre-Simon Laplace Introduction to Theorie Analytique des Probabilities. Oeuvres, t. 7. Paris, 1886, p. 5.

Class time: MWF10:00-11:00 in Annenberg 314. Office hours will be in my office, Annenberg 317. My office hours are by appointment at the beginning of the term, after that I’ll fix a regular time (but appointments will still be fine). Starting Oct 12: my OH are on Fridays at 11:00. TA: Jenish Mehta, [email protected]. TA Office Hours: weeks that an assignment is due: M 7:00pm Ann 121; off weeks: W 7:00pm Ann 107. (There is a calendar on the class web page that includes all this information.) Problem sets are due to Jenish’s mailbox by W 6:00pm. There will be problem sets, due on Wednesdays; there will not be an exam. You may collaborate with other students on the sets; just make a note of the extent of collaboration and with whom (truly joint work with Alice, Bob helped me on this problem, etc.). This is assuming it’s a collaboration and doesn’t regularly become one-way; if you feel that happening, (a) focus on doing a decent fraction of the problems on your own or with consultation with me or the TA, (b) don’t collaborate until after you’ve already spent some time thinking about the problem yourself. Lecture notes will be posted after the fact. The topics covered during the quarter are listed in the table of contents. Some other topics not reached that I’ll try to cover when I add a second quarter to the course: Ran- domized vs. Distributional Complexity. Game tree evaluation: upper and lower bounds. Karger’s min-cut algorithm. Hashing, AKS dictionary hashing, cuckoo hashing. Power of two choices. Talagrand concentration inequality. Linial-Saks graph partitioning. A. Kalai’s sampling random factored numbers. Feige leader election. Approximation of the permanent and self-reducibility. Equivalence of approximate counting and approximate sampling. ε-biased k-wise independent spaces. #DNF-approximation. Shamir secret sharing. An interactive proof for a problem only known to be in coNP: graph non-isomorphism. Searching for the first spot where two sequences disagree. Weighted sampling (e.g., Karger network reliability). Markov Chain Monte Carlo. Notes. (1) This course can be only an exposure to probability and its role in the theory of algorithms. We will stay focused on key ideas and examples; we will not be overconcerned with best bounds. (2) I assume this is not your first exposure to probability. Likewise I’ll assume you have some familiarity with algorithms. However, the first lecture will start out with some basic examples and definitions. Books. There will be no assigned book, but I recommend the following references: On reserve at SFL:

• Mitzenmacher & Upfal, Probability and Computing, Cambridge 2005 • Motwani & Raghavan, Randomized Algorithms, Cambridge 1995 • Williams, Probability with Martingales, Cambridge 1991 • Alon & Spencer, The Probabilistic Method, 4th ed., Wiley 2016

Not on reserve:

1 Schulman: CS150 2018

• Adams & Guillemin, Measure Theory and Probability, Birkhauser¨ 1996

• Billingsley, Probability and Measure, 3rd ed., Wiley 1995

2 Contents

1 Some basic 7 1.1 Lecture 1 (3/Oct): Appetizers ...... 7 1.2 Lecture 2 (5/Oct) Some basics ...... 9 1.2.1 Measure ...... 9 1.2.2 Measurable functions, random variables and events ...... 10 1.3 Lecture 3 (8/Oct): Linearity of expectation, union bound, existence theorems . . . . . 13 1.3.1 Countable additivity ...... 13 1.3.2 Coupon collector ...... 13 1.3.3 Application: the probabilistic method ...... 14 1.3.4 Union bound ...... 14 1.3.5 Using the union bound in the probabilistic method: Ramsey theory ...... 15 1.4 Lecture 4 (10/Oct): Upper and lower bounds ...... 17 1.4.1 Bonferroni inequalities ...... 17 1.4.2 Tail events: Borel-Cantelli ...... 18 1.4.3 B-C II: a partial converse to B-C I ...... 19 1.5 Lecture 5 (12/Oct): More on tail events: Kolmogorov 0-1, random walk ...... 20 1.6 Lecture 6 (15/Oct): More probabilistic method ...... 23 1.6.1 Markov inequality (the simplest tail bound) ...... 23 1.6.2 Variance and the Chebyshev inequality: a second tail bound ...... 23 1.6.3 Power mean inequality ...... 23 1.6.4 Large girth and large chromatic number; the deletion method ...... 24 1.7 Lecture 7 (17/Oct): FKG inequality ...... 26 1.8 Lecture 8 (19/Oct) Part I: Achieving expectation in MAX-3SAT...... 29 1.8.1 Another appetizer ...... 29 1.8.2 MAX-3SAT ...... 29 1.8.3 Derandomization by the method of conditional expectations ...... 29

3 Schulman: CS150 2018 CONTENTS

2 Algebraic Fingerprinting 32 2.1 Lecture 8 (19/Oct) Part II: Fingerprinting with Linear Algebra ...... 32 2.1.1 Polytime Complexity Classes Allowing Randomization ...... 32 2.1.2 Verifying Matrix Multiplication ...... 33 2.2 Lecture 9 (22/Oct): Fingerprinting with Linear Algebra ...... 35 2.2.1 Verifying Associativity ...... 35 2.3 Lecture 10 (24/Oct): Perfect matchings, polynomial identity testing ...... 37 2.3.1 Matchings ...... 37 2.3.2 Bipartite perfect matching: deciding existence ...... 37 2.3.3 Polynomial identity testing ...... 39 2.4 Lecture 11 (26/Oct): Perfect matchings in general graphs. Parallel computation. Iso- lating lemma...... 41 2.4.1 Deciding existence of a perfect matching in a graph ...... 41 2.4.2 Parallel computation ...... 42 2.4.3 Sequential and parallel linear algebra ...... 42 2.4.4 Finding perfect matchings in general graphs. The isolating lemma ...... 43 2.5 Lecture 12 (29/Oct.): Isolating lemma, finding a perfect matching in parallel . . . . . 44 2.5.1 Proof of the isolating lemma ...... 44 2.5.2 Finding a perfect matching, in RNC ...... 44

3 Concentration of Measure 47 3.1 Lecture 13 (31/Oct): Independent rvs, Chernoff bound, applications ...... 47 3.1.1 Independent rvs ...... 47 3.1.2 Chernoff bound for uniform Bernoulli rvs (symmetric random walk) . . . . . 47 3.1.3 Application: set discrepancy ...... 49 3.1.4 Entropy and Kullback-Liebler divergence ...... 50 3.2 Lecture 14 (2/Nov): Stronger Chernoff bound, applications ...... 51 3.2.1 Chernoff bound using divergence; robustness of BPP ...... 51 3.2.2 Balls and bins ...... 52 3.2.3 Preview of Shannon’s coding theorem ...... 53 3.3 Lecture 15 (5/Nov): Application of large deviation bounds: Shannon’s coding theo- rem. ...... 55 3.3.1 Shannon’s block coding theorem. A probabilistic existence argument...... 55 3.3.2 Central limit theorem ...... 57 3.4 Lecture 16 (7/Nov): Application of CLT to Gale-Berlekamp. Khintchine-Kahane. Moment generating functions ...... 58 3.4.1 Gale-Berlekamp game ...... 58 3.4.2 Moment generating functions, Chernoff bound for general distributions . . . . 59

3.5 Lecture 17 (9/Nov): Johnson-Lindenstrauss embedding `2 → `2 ...... 62 3.5.1 Normed spaces ...... 63

4 Schulman: CS150 2018 CONTENTS

3.5.2 JL: the original method ...... 64 3.5.3 JL: a similar, and easier to analyze, method ...... 66 3.6 Lecture 18 (12/Nov): cont. JL embedding; Bourgain embedding ...... 67 3.6.1 cont. JL ...... 67

3.6.2 Bourgain embedding X → Lp, p ≥ 1...... 69

3.6.3 Embedding into L1 ...... 70 3.7 Lecture 19 (14/Nov): cont. Bourgain embedding ...... 71

3.7.1 cont. Bourgain embedding: L1 ...... 71

3.7.2 Embedding into any Lp, p ≥ 1...... 73 3.7.3 Aside: Holder’s¨ inequality ...... 73

4 Limited independence 74 4.1 Lecture 20 (16/Nov): Pairwise independence, Shannon coding theorem again, second moment inequality ...... 74 4.1.1 Improved proof of Shannon’s coding theorem using linear codes ...... 74 4.1.2 Pairwise independence and the second-moment inequality ...... 76 4.2 Lecture 21 (19/Nov): G(n, p) thresholds ...... 78 4.2.1 Threshold for H as a subgraph in G(n, p) ...... 78

4.2.2 Most pairs independent: threshold for K4 in G(n, p) ...... 78 4.3 Lecture 22 (21/Nov): Concentration of the number of prime factors; begin Khintchine- Kahane for 4-wise independence ...... 80 4.3.1 4-wise independent random walk ...... 83 4.4 Lecture 23 (26/Nov): Cont. Khintchine-Kahane for 4-wise independence; begin MIS inNC...... 84 4.4.1 Paley-Zygmund: solution through an in-probability bound ...... 84 4.4.2 Berger: a direct expectation bound ...... 85 4.4.3 cont. proof of Theorem 73 ...... 86 4.4.4 Maximal Independent Set in NC ...... 86 4.5 Lecture 24 (28/Nov): Cont. MIS, begin derandomization from small sample spaces . 89 4.5.1 Cont. MIS ...... 89 4.5.2 Descent Processes ...... 90 4.5.3 Cont. MIS ...... 90 4.5.4 Begin derandomization from small sample spaces ...... 91 4.6 Lecture 25 (30/Nov): Limited linear independence, limited statistical independence, error correcting codes...... 92 4.6.1 Generator matrix and parity check matrix ...... 92 4.6.2 Constructing C from M ...... 93 4.6.3 Proof of Thm (87) Part (1): Upper bound on the size of k-wise independent sample spaces ...... 94 4.6.4 Back to Gale-Berlekamp ...... 95 4.6.5 Back to MIS ...... 95

5 Schulman: CS150 2018 CONTENTS

5 Lov´aszlocal lemma 96 5.1 Lecture 26 (3/Dec): The Lovasz´ local lemma ...... 96 5.2 Lecture 27 (5/Dec): Applications and further versions of the local lemma ...... 100 5.2.1 Graph Ramsey lower bound ...... 100 5.2.2 van der Waerden lower bound ...... 100 5.2.3 Heterogeneous events and infinite dependency graphs ...... 101 5.3 Lecture 28 (7/Dec): Moser-Tardos branching process algorithm for the local lemma . 103

Bibliography 106

6 Chapter 1

Some basic probability theory

1.1 Lecture 1 (3/Oct): Appetizers

1. Measure the length of a long string coiled under a glass tabletop. You have an ordinary rigid ruler (longer than sides of the table). 2. N gentlemen check their hats in the lobby of the opera, but after the performance the hats are handed back at random. How many men, on average, get their own hat back? 3. The coins-on-dots problem: On the table before us are 10 dots, and in our pocket are 10 nickels. Prove the coins can be placed on the table (no two overlapping) in such a way that all the dots are covered.

4. Birthday Paradox. I just remind you of this: a class of 23 students has better than even odds of some common birthday. (Supposing birthdates are uniform on 365 possibilities.) The exact calculation is 365 ··· 343 Pr(some common birthday) = 1 − =∼ 0.507297 36523 k but a better way to understand this is that the number of ways this can happen is (2) for k students; so long as these events don’t start heavily overlapping, we can almost add their 1 probabilities (which are each just 365 ). We’ll be more formal about the upper and lower bounds soon. 5. The envelope swap paradox: You’re on a TV game show and the host offers you two identical- looking envelopes, each of which contains a check in your name from the TV network. You pick whichever envelope you like and take it, still unopened. Then the host explains: one of the checks is written for a sum of $N (N > 0), and the other is for $10N. Now, he says, it’s 50-50 whether you selected the small check or the big one. He’ll give you a chance, if you like, to swap envelopes. It’s a good idea for you to swap, he explains, because your expected net gain is (with $m representing the sum currently in hand):

E(gain) = (1/2)(10m − m) + (1/2)(m/10 − m) = (81/20)m

How can this be?

6. Consider a certain society in which parents prefer female offspring. Can a couple increase their expected fraction of daughters by halting reproduction after the first?

7 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Let’s just make explicit here that we are not using advanced medical technologies. That is to say, the couple can control whether they create a pregnancy, but no other property of the fetus. Before moving on we note that this is closely related to a famous problem which we will return to: the gambler’s ruin problem. A gambler starts with $1 in his pocket and repeatedly risks $1 on a fair coin toss, until he goes broke. He is very likely to go broke, right? Indeed if he sticks around indefinitely, he will go broke with probability 1. But that is exactly equivalent (when boys and girls are equiprobable) to the event of a sufficiently large family having an excess of girls over boys. Why are our intuitions so opposite in these two cases? It has to do with the fact that we clearly internalize the finiteness of the family size, whereas we can easily imagine the gambler addicitively playing for an extraordinarily long time. So his high-probability doom impresses us. If we decide in advance that we will stop him after one million plays, whether or not he has stopped himself by that time, then his expected wealth at that time is equal to $1, even though he has almost certainly lost his $1 and gone broke; there’s a small chance he has earned a lot of money. 7. Unbalancing lights: You’re given an n × n grid of lightbulbs. For each bulb, at position (i, j), there is a switch bij; there is also a switch ri on each row and a switch cj on each column. The (i, j) bulb is lit if bij + ri + cj is odd.

What is the greatest f (n) such that for any setting to the bij’s, you can set the row and column switches to light at least n2/2 + f (n) bulbs?

Now, we haven’t yet defined either random variables or expectations, but I think you likely already have a feel for these concepts, so let’s see how linearity of expectation already resolves several of our appetizers. If you’re not sure how to be rigorous about this, no worries, we’ll proceed more methodically in the next lecture. √ (1): Let the tabletop be the rectangle [−a, a] × [−b, b]. Set r = a2 + b2. Choose θ uniformly in [0, π) and z uniformly in [−r, r]. Lay the ruler along the affine line of points (x, y) satisfying x cos θ + y sin θ = z. Count the number of times the ruler crosses the string. Since we have in mind a physical string, mathematically we can model it as differentiable, and therefore the number of intersections is equal to the limit in which we decompose the string into short straight segments. A ruler can intersect such a segment only 0 or 1 times (apart from a probability 0 event of aligning perfectly). Observe that our process with the ruler is such that no matter where a straight segment lies on the table, the probability of the ruler intersecting it is proportional to its length. Applying linearity of expectation, we conclude that the total length is proportional to the expected number of intersections. We skip calculating the constant of proportionality. (2): The probability that each gentleman gets his hat back is 1/N. So the expected number of his hats that he receives (this can be only 0 or 1) is 1/N. By linearity of expectation, the expected N number of hats restored to their proper owners overall, is ∑1 1/N = 1. (Note, if you know about independence of random events, the events corresponding to success of the various gentlemen are not independent! But that doesn’t matter to since we are only adding expectations.)

8 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.2 Lecture 2 (5/Oct) Some basics

1.2.1 Measure

Frequently one can “get by” with a na¨ıve treatment of probability theory: you can treat random variables quite intuitively so long as you maintain Bayes’ law for conditional probabilities of events:

Pr(A1 ∩ A2) Pr(A1|A2) = Pr(A2)

However, that’s not good enough for all situations, so we’re going to be more careful, and me- thodically answer the question, “what is a ?” (For a philosophical and historical discussion of this question see Mumford in [70].) First we need measure spaces. Let’s start with some standard examples.

1. Z with the counting measure. 2. R with the Lebesgue measure, i.e., the measure (general definition momentarily) in which intervals have measure proportional to their length: µ([a, b]) = b − a for b ≥ a. 3. [0, 1] with the Lebesgue measure.

4. A finite set with the uniform probability measure.

As we see, a measure µ assigns a real number to (some) subsets of a universe; if, as in the last two examples, we also have µ(universe) = 1 (1.1) then we say the measure space is a probability space or a sample space. Let’s see what are the formal properties we want from these examples. As we just hinted, we don’t necessarily assign a measure to all subsets of the universe; only to the measurable sets. In order to make sense of this, we need to define the notion of a σ-algebra (also known as a σ-field). A σ-algebra (M, M˜ ) is a set M along with a collection M˜ of subsets of M (called the measurable sets) which satisfy: (1) ∅ ∈ M˜ , and (2) M˜ is closed under complement and countable intersection. It follows also that M ∈ M˜ and M˜ is closed under countable union (de Morgan). By induction this gives a stability property: we can take any finite sequence of the form, a countable union of countable intersections of . . . of countable unions of measurable sets, and the result will be a measurable set. A measure space is a σ-algebra (M, M˜ ) together with a measure µ, which is a function

µ : M˜ → [0, ∞] (1.2) that is countably additive, that is, for any pairwise disjoint S1, S2,... ∈ M˜ , [ µ( Si) = ∑ µ(Si). (1.3) So, (1.2) and (1.3) give us a measure space, and if we also assume (1.1) then we have a probability space. Let us see some properties of measure spaces: I. µ(∅) = 0 since µ(∅) + µ(∅) = µ(∅ ∪ ∅) = µ(∅).

9 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

II. The modular identity µ(S) + µ(T) = µ(S ∩ T) + µ(S ∪ T) holds because necessarily S − T, T − S and S ∩ T are measurable, and both sides of the equation may be decomposed into the same linear combination of the measures of these sets. (The set S − T is S ∩ (¬T).) This identity is sometimes also called the lattice or valuation property. III. From the modular identity and nonnegativity, S ⊆ T ⇒ µ(S) ≤ µ(T).

1.2.2 Measurable functions, random variables and events

A measurable function is a mapping X from one measure space, say (M1, M˜ 1, µ1), into another, say (M2, M˜ 2, µ2), such that pre-images of measurable sets are measurable, that is to say, if T ∈ M˜ 2, then −1 X (T) ∈ M˜ 1.

If M1 is a probability space we call X a random variable.

The range of the random variable, M2, can be many things, for example:

• M2 = R, with the σ-field consisting of rays (a, ∞), rays [a, ∞), and any set formed out of these by closing under the operations of complement and countable union. (In CS language, any other measurable set is formed by a finite-depth formula whose leaves are rays of the afore- mentioned type, and each internal node is either a complementation or a countable union.) Sometimes it is convenient to use the “extended real line,” the real line with ∞ and −∞ adjoined, as the base set.

• M2 = names of people eligible for a draft which is going to be implemented by a lottery. The M σ-field here is 2 2 , namely the power set of M2.

• M2 = deterministic algorithms for a certain computational problem, with the counting mea- sure. On a countably infinite set M2, just as on a finite set, we can use the power set as the σ-field. The counting measure assigns to S ⊆ M2 its cardinality |S|.

Events With any measurable subset T of M2 we associate the event X ∈ T; if X is understood, we simply call this the event T. This event has the probability Pr(X ∈ T) (or if X is understood, Pr(T)) dictated by

−1 Pr(X ∈ T) = µ1(X (T)). (1.4)

The indicator of this event is the function T or IT, J K IT : M1 → {0, 1} ⊆ R ( 1 y ∈ X−1(T) IT(y) = 0 otherwise

The basic but key property is that Z Pr(X ∈ T) = IT dµ = E(IT). (1.5)

It follows that probabilities of events satisfy:

1. Pr(∅) = 0 (“the experiment has an outcome”)

10 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

2. Pr(M2) = 1 (“the experiment has only one outcome”) 3. Pr(A) ≥ 0 4. Pr(A) + Pr(B) = Pr(A ∩ B) + Pr(A ∪ B)

Note that events can themselves be thought of as random variables taking values in {0, 1}; indeed we will sometimes define an event directly, rather than creating out of some other random variable X and subset T of the image of X. For the most part we will sidestep measure theory—one needs it to cure pathologies but we will be studying healthy patients. However I recommend Adams and Guillemin [3] or Billingsley [16]. Often when studying probability one may suppress any mention of the sample space in favor of abstract axioms of probability. For us the situation will be quite different. While starting out as a formality, explicit sample spaces will soon play a significant role. Joint distributions Given two random variables X1 : M → M1, X2 : M → M2 (where each Mi has associated with it a σ-field (Mi, M˜ i)), we can form the “product” random variable (X1, X2) : M → M1 × M2. The same goes for any countable collection of rvs on M, and it is important that we can do this for countable collections; for instance we want to be able to discuss unbounded sequences of coin tosses. Given a product rv (X1, X2,...) : M → M1 × M2 × ...,

its marginals are probability distributions on each of the measure spaces Mi. These distributions are defined by, for A ∈ M˜ i,

Pr(Xi ∈ A) = Pr((X1, X2,...) ∈ M1 × M2 × ... × Mi−1 × A × Mi+1 × ...

That is, you simply ignore what happens to the other rvs, and assign to set A ∈ M˜ i the probability −1 µ(Xi (A)). ˜ ˜ X1, X2, . . . are independent if for any finite S = {s1,..., sn} and any As1 ∈ Ms1 ,..., Asn ∈ Msn , we have Pr((Xs1 ,..., Xsn ) ∈ As1 × · · · × Asn ) = Pr(Xs1 ∈ As1 ) ··· Pr(Xsn ∈ Asn ).

(Note that Pr((X1, X2) ∈ A1 × A2) is just another way of writing Pr((X1 ∈ A1) ∧ (X2 ∈ A2)).) Example: a pair of fair dice. Let M be the set of 36 ways in which two dice can roll, each outcome having probability 1/36. On this sample space we can define various useful functions: e.g., Xi = the value of die i (i = 1, 2); Y = X1 + X2. X1 and X2 are independent; X1 and Y are not independent.

X1,...: M → T are independent and identically distributed (iid) if they are independent and all marginals are identical. If T is finite and the marginals are the uniform distribution, we say that the rv’s are uniform iid. We use the same terminology in case T is infinite but of finite measure (e.g., Lebesgue measure on a compact set), and the marginal on T is the probability distribution proportional to this measure on T. Conditional Probabilities are defined by

Pr(X ∈ A ∩ B) Pr(X ∈ A|X ∈ B) = (1.6) Pr(X ∈ B) provided the denominator is positive. An old example. You meet Mr. Smith and find out that he has exactly two children, at least one of which is a girl. What is the probability that both are girls? Answer1: 1/3.

1As usual in such examples we suppose that the sexes of the children are uniform iid. Some facts from general knowledge should be enough for you to doubt both uniformity and independence.

11 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Taking (1.6) and applying induction, we have that if Pr(A1 ∩ ... An) > 0, then: Chain rule for conditional probabilities

Pr(A1 ∩ ... An) = Pr(An|A1 ∩ ... An−1) · Pr(An−1|A1 ∩ ... An−2) ··· Pr(A2|A1) · Pr(A1).

(If Pr(A1 ∩ ... An) = 0 then because of the denominators, some of the conditional probabilities in the chain might not be well defined, but you can say that either Pr(A1) = 0 or there is some i s.t. Pr(Ai|A1 ∩ ... Ai−1) = 0.) Real-valued random variables; expectations If X is a real-valued rv on a sample space with measure µ, its expectation (aka average, mean or first moment) is given by the following integral Z E(X) = X dµ which is defined in the Lebesgue manner by2

Z X dµ = lim jh Pr(jh ≤ X < (j + 1)h) (1.7) → ∑ h 0 integer −∞0 integer −∞

It is not hard to innocently encounter cases where the integral is not defined. Stand a meter from an infinite wall, holding a laser pointer. Spin so you’re pointing at a uniformly random orientation. If the laser pointer is not shining at a point on the wall (which happens with probability 1/2), repeat until it does. The displacement of the point you’re pointing at, relative to the point closest to you on the wall, is tan α meters for α uniformly distributed in (−π/2, π/2). You could be forgiven for thinking the average displacement “ought” to be 0, but the integral does not converge absolutely, R π/2 = − R cos(π/2) 1 = −[ ]|cos(π/2) = + = because 0 tan α dα cos(0) x dx log x cos(0) ∞, using the substitution x cos α. To see the kind of problem that this can create, consider that for an integration definition to make sense, we ought to have the property that if lim a = a and lim b = b, then lim R bm f (α)dα = m m am R b a f (α)dα. But in the present circumstance we can, for instance, take am = − arccos 1/m, bm = arccos 2/m, and then R bm tan α dα = − log m + log 2m = log 2 (rather than 0). am Nonnegative integrands. As we see in (1.8), it is essential to be able to characterize whether an R ∞ | | integral of a nonnegative function converges. (That equation is a discretization of −∞ X dµ.) It is worth pointing out that for probability measure µ supported on the nonnegative integers, R x dµ(x) = ∑n≥1 nµ({n}) = ∑n≥1 µ({n,...}). So the integral converges iff the sequence µ({n,...}) has a finite sum. Exercise: State and verify the analogous statement when µ is supported on the nonnegative reals. 2One can be more scrupulous about the measure theory; see the suggested references. But he knew little out of his way, and was not a pleasing companion; as, like most great mathematicians I have met with, he expected universal precision in everything said, or was for ever denying or distinguishing upon trifles, to the disturbance of all conversation. He soon left us. The Autobiography of Benjamin Franklin, chapter 5

12 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.3 Lecture 3 (8/Oct): Linearity of expectation, union bound, exis- tence theorems

Let’s return to one of our appetizers, the coins-on-dots problem (3): I don’t want to give this away entirely, but here’s a hint: what is the fraction of the plane covered by unit disks packed in a hexagonal pattern?

1.3.1 Countable additivity

Back to the theory. If we have two real-valued rvs X, Y on the same sample space, we can form their sum rv X + Y. No matter the joint distribution of X and Y, we have, providing their expectations are well defined:

E(X + Y) = E(X) + E(Y) linearity of expectation for the simple reason that expectation is a first moment. You have only to verify: Exercise: Absolute convergence of R X dµ and R Y dµ implies absolute convergence of R (X + Y) dµ. Because R |X + Y| dµ ≤ R |X| + |Y| dµ < ∞. In the nonnegative case we have also countable additivity:

Exercise: Let X1, . . . be nonnegative real-valued with expectations E(Xi). Then

E(∑ Xi) = ∑ E(Xi).

1.3.2 Coupon collector

There are n distinct types of coupons and you want to own the whole set. Each draw is uniformly distributed, no matter what has happened earlier. What is the expected time to elapse until you own the set?

Think of the coupons being sampled at times 1, 2, . . .. Let Yi = the first time at which we are in state Si, which is when we have seen exactly i different kinds of coupons (i = 0, . . . , n). So Y0 = 0, Y1 = 1. Let Xi = Yi − Yi−1. In state Si−1, in each round there is probability (n − i + 1)/n that we see a new kind of coupon, until that finally happens. That is to say, Xi is geometrically distributed with parameter pi = (n − i + 1)/n. We can work out E(Xi) from the geometric sum, but there’s a slicker way.

If we’re in state Si−1, then with probability (n − i + 1)/n we’re in Si in one more time step, else we’re back in the same situation.

1/n 2/n 1−1/n 1

1 1−1/n 1−2/n 2/n 1/n S0 / S1 / S2 / ··· / Sn−1 / Sn (1.9)

So n − i + 1 i − 1 E(X ) = 1 + · 0 + · E(X ) (1.10) i n n i n − i + 1 E(X ) = 1 (1.11) n i n E(X ) = (1.12) i n − i + 1

13 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Now we have: n E(Yn) = ∑ E(Xi) 1 n n = ∑ − + 1 n i 1 n 1 = n ∑ 1 i = nHn = n(log n + O(1))

1.3.3 Application: the probabilistic method

A tournament of size n is a directed complete graph. We may think of a tournament T equivalently as a skew-symmetric mapping T : [n] × [n] → {1, 0, −1} that is 0 only on the diagonal.3 A Hamilton path in a tournament (or a digraph more generally) is a directed simple path through all the vertices.

Lemma 1 There exists a tournament with at least n!2−n+1 Hamilton paths.

This certainly isn’t true for all tournaments—as an extreme case, the totally ordered tournament has only one H-path. Proof: This is an opportunity to consider a nice random variable: the random tournament. You simply fix n vertices, and direct each edge between them uniformly iid. Any particular permutation of the vertices has probability 2−n+1 of being a H-path, so the expec- tation of the indicator rv for this event is 2−n+1. The indicator rvs are far from independent, but anyway, by linearity of expectation, the expected number of H-paths is n!2−n+1. So some tourna- ment has at least this many H-paths. 2 Exercise: explicit construction. Describe a specific tournament with n!(2 + o(1))−n Hamilton paths.

1.3.4 Union bound

Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B) ≤ Pr(A) + Pr(B) The bound applies also to countable unions:

S∞ ∞ Lemma 2 Pr( 1 Ai) ≤ ∑1 Pr(Ai).

Proof: First note that by induction the bound applies to any finite union. Now, if the right-hand side is at least 1, the result is immediate. If not, consider any counterexample; since the sequences Sk k Pr( 1 Ai) and ∑1 Pr(Ai) each monotonically converge to their respective limit, then there is a finite Sk k k for which Pr( 1 Ai) > ∑1 Pr(Ai). Contradiction. 2 Later in the lecture we’ll use the following which, while trivial, has the whiff of assigning a value to ∞/∞:

3We frequently use the notation [n] = {1, . . . , n}.

14 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

S Corollary 3 If a countable list of events A1,... all satisfy Pr(Ai) = 0, then Pr( Ai) = 0. Likewise if for T all i, Pr(Ai) = 1, then Pr( Ai) = 1.

Now let’s revisit the birthday paradox (4). For a year of n days and a class of r students, let the rv B = the number of pairs of students who share a birthday.

r 1 E(B) = 2 n

which suggests√ that the probability of some joint birthday may be a constant once r is large enough that r ∼ n. We can easily verify the correctness of one side of this claim. The event of there being some common birthday is [B > 0]. With Bij being the event students i, j share a birthday, S [B > 0] = i

r 1 Pr(B > 0) ≤ . 2 n √ This shows that r ∈ o( n) ⇒ Pr(B > 0) ∈ o(1). The converse holds, too; fundamentally this is because there is not much overlap in the sample r space between the (2) different events. We postpone this for now but will show below how to carry out this argument.

1.3.5 Using the union bound in the probabilistic method: Ramsey theory

Theorem 4 (Ramsey [74]) Fix any nonnegative integers k, `. There is a finite “Ramsey number” R(k, `) such that every graph on R(k, `) vertices contains either a clique of size k or an independent set of size `. k+`−2 Specifically, R(k, `) ≤ ( `−1 ).

(The finiteness is due to Ramsey [74] and the bound to Erdos¨ and Szekeres [31].) Numerous generalizations of Ramsey’s argument have since been developed—see the book [44]. Proof: (of Theorem (4)) This is outside our main line of development but we include it for com- k−1 pleteness. First, R(k, 1) = R(1, k) = 1 = ( 0 ). Now if k, ` > 1, consider a graph with R(k, ` − 1) + R(k − 1, `) + 1 vertices and pick any vertex v. Let VY denote the vertices connected to v by an edge, and let VN denote the remaining vertices. Either |VN| ≥ R(k, ` − 1) or |VY| ≥ R(k − 1, `).

If |VN| ≥ R(k, ` − 1) then either the graph spanned by VN contains a k-clique or the graph spanned by VN ∪ {v} contains an independent set of size `.

On the other hand if |VY| ≥ R(k − 1, `) then either the graph spanned by VY ∪ {v} contains a k-clique or the graph spanned by VY contains an independent set of size `. Applying this argument and induction on k + `, we have: R(k, `) ≤ R(k, ` − 1) + R(k − 1, `) ≤ k+`−3 k+`−3 k+`−2 ( `−2 ) + ( `−1 ) = ( `−1 ). (The final equality counts subsets of [k + ` − 2] of size ` − 1 according to whether the first item is selected.) √ 2k−2 k If you apply Stirling’s approximation, this gives the bound R(k, k) ≤ ( k−1 ) ∈ O(4 / k). In the intervening nearly-a-century there have been some improvements on this bound, first by k −Ω( log ) Rodl¨ [43], then by Thomason [85], and most recently by Conlon [20] to R(k, k) ≤ 4kk log log k . What we use the union bound for is to show a converse:

n (k)− k k Theorem 5 (Erd¨os[28]) If ( ) < 2 2 1 then R(k, k) > n. Thus R(k, k) ≥ (1 − o(1)) √ 2 /2. k e 2

15 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

This leaves an exponential gap. Actually this gap is small by the standards of Ramsey theory. The gap has been slightly tightened since Erdos’s¨ work, as we will show later in the course, but remains exponential, and is a major open problem in combinatorics. Proof: (of Theorem (5)) This is an opportunity to introduce one of the most-studied random vari- ables in combinatorics, the random graph G(n, p), in which each edge is present, independently, with probability p. Among other things, people use this model to study threshold phenomena for many properties such as connectivity, appearance of a Hamilton cycle, etc. For the lower bound on R(k, k) we use G(n, 1/2). Any particular subset of k vertices has probability −(k) 21 2 of forming either a clique or an independent set. Take a union bound over all subgraphs. 2

16 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.4 Lecture 4 (10/Oct): Upper and lower bounds

1.4.1 Bonferroni inequalities

The union bound is a special case of the Bonferroni inequalities: c Let A1,..., An be events in some probability space, and A their complements. For S ⊆ [n] let T i AS = i∈S Ai. [n] For 0 ≤ j ≤ n let ( j ) denote the subsets of [n] of cardinality j.

Lemma 6 For j ≥ 1 let (see Fig. 1.1):

mj = ∑ Pr(AS) ∈ [n] S ( j ) k k j+1 j+1 Mk = ∑(−1) mj = ∑(−1) ∑ Pr(AJ) j=1 j=1 J⊆[n],|J|=j

0 1 1 1 0 1 3 0 1 1 1 1

0 1

Figure 1.1: m2 (left), M2 (right)

Then: [ M2, M4,... ≤ Pr( Ai) ≤ M1, M3,... S Moreover, Pr( Ai) = Mn; this is known as the inclusion-exclusion principle.

Comment: Often, but not always, larger values of k give improved bounds. See the problem set. Proof: The sample space is partitioned into 2n measurable sets

\ \ c BS = ( Ai) ∩ ( Ai ) i∈S i∈/S

S Note that AS = S⊆T BT, which, since the BT’s are disjoint, gives Pr(AS) = ∑S⊆T Pr(BT).

|T| m = Pr(A ) = Pr(B ) = Pr(B ) j ∑ S ∑ ∑ T ∑ T j ∈ [n] ∈ [n] S⊆T T S ( j ) S ( j )

17 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

k j+1 Mk = ∑(−1) mj j=1 k   j+1 |T| = ∑ Pr(BT) ∑(−1) T j=1 j k     j+1 |T| 0 = ∑ Pr(BT) ∑(−1) because = 0 for j ≥ 1. T6=∅ j=1 j j S Observe Pr( Ai) = 1 − Pr(B∅) = ∑T6=∅ Pr(BT). So

k   [ j+1 |T| Mk − Pr( Ai) = ∑ Pr(BT) ∑(−1) T6=∅ j=0 j where we have inserted the needed − Pr(BT) for T 6= ∅ by starting the internal summation from j = 0. The inequalities now follow from the claim that for t ≥ 1,  0 k ≥ t k t  ∑(−1)j+1 = ≥ 0 k odd (1.13) = j  j 0 ≤ 0 k even

(For the inclusion-exclusion principle, note that once k ≥ n, all t fall into the first category.) t t The first line follows by expanding (1 − 1) (and noting that all terms t < j ≤ k have (j) = 0). For the remaining two lines we use the identity

t  t  t − 1 t − 1 − = − (1.14) j j − 1 j j − 2

a (which holds for t, j ≥ 1 with the interpretation (b) = 0 for a ≥ 0, b < 0). Therefore when we group adjacent pairs j in the summation on the LHS of (1.13) (that is, {k, k − 1}, {k − 2, k − 3}, etc., with 0 unpaired for k even), we obtain a telescoping sum, and so we have

k t t − 1 t − 1 t − 1 For k odd: (−1)j+1 = − = ≥ 0 ∑ − j=0 j k 1 k k t t − 1 t − 1 t t − 1 For k even: ∑(−1)j+1 = − + − = − ≤ 0 j=0 j k 0 0 k 2 Comment: inclusion-exclusion is a special case of what is known in order theory as Mobius¨ inver- sion.

1.4.2 Tail events: Borel-Cantelli

Here is a very fundamental application of the union bound.

Definition 7 Let B = {B1,...} be a countable collection of events. lim sup B is the event that infinitely many of the events Bi occur.

18 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Lemma 8 (Borel Cantelli I) Let ∑i≥1 Pr(Bi) < ∞. Then Pr(lim sup B) = 0. lim sup B is what is called a tail event: a function of infinitely many other events (in this case the B1, . . .) that is unaffected by the outcomes of any finite subset of them. Proof: It is helpful to write lim sup B as \ [ lim sup B = Bj. i≥0 j≥i

S S For every i, lim sup B ⊆ j≥i Bj, so Pr(lim sup B) ≤ infi Pr( j≥i Bj). By the union bound, the latter is ≤ infi ∑j≥i Pr(Bj) = 0. 2

1.4.3 B-C II: a partial converse to B-C I

Lemma 8 does not have a “full” converse.

To show a counterexample, we need to come up with events Bi for which ∑i≥1 Pr(Bi) = ∞ but Pr(lim sup B) = 0. Here is an example. Pick a point x uniformly from the unit interval. Let Bi be the event x < 1/i. You will notice that in this example the events are not independent. That is crucial, for B-C I does have the partial converse:

Lemma 9 (Borel Cantelli II) Suppose that B1,... are independent events and that ∑i≥1 Pr(Bi) = ∞. Then Pr(lim sup B) = 1.

c Proof: We’ll show that (lim sup B) , the event that only finitely many Bi occur, occurs with proba- c S T c bility 0. Write (lim sup B) = i≥0 j≥i Bj . T c By the union bound (Cor. 3), it is enough to show that Pr( j≥i Bj ) = 0 for all i. Of course, for any T c T c I ≥ i, Pr( j≥i Bj ) ≤ Pr( I≥j≥i Bj ). T c I c By independence, Pr( I≥j≥i Bj ) = ∏i Pr(Bj ), so what remains to show is that

I c For any i, lim Pr(Bj ) = 0. (1.15) I→ ∏ ∞ i

(Note the LHS is decreasing in I.) There’s a classic inequality we often use:

1 + x ≤ ex (1.16) which follows because the RHS is concave and the two sides agree in value and first derivative at a point (namely at x = 0).

Consequently if a finite sequence xi satisfies ∑ xi ≥ 1 then ∏(1 − xi) ≤ 1/e.

Supposing (1.15) is false, fix i for which it fails, let qi be the limit of the LHS, and let I be suf- I c 0 I0 I0 c ficient that ∏i Pr(Bj ) ≤ 2qi. Let I be sufficient that ∑I+1 Pr(Bj) ≥ 1. Then ∏i Pr(Bj ) ≤ 2qi/e. Contradiction. 2

19 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.5 Lecture 5 (12/Oct): More on tail events: Kolmogorov 0-1, ran- dom walk

A beautiful fact about tail events is Kolmogorov’s famous 0-1 law.

Theorem 10 (Kolmogorov) If Bi is a sequence of independent events and C is a tail event of the sequence, then Pr(C) ∈ {0, 1}.

We won’t be using this theorem, and its usual proof requires some measure theory, so I’ll merely offer a few examples of its application.

Bond percolation

Fix a parameter 0 ≤ p ≤ 1. Start with a fixed infinite, connected, locally finite graph H, for instance the grid graph Z2 (nodes (i, j) and (i0, j0) are connected if |i − i0| + |j − j0| = 1) and form the graph G by including each edge of the grid in G independently with probability p. “Locally finite” means the degree of every vertex is finite. The graph is said to “percolate” if there is an infinite connected component. Percolation is a tail event (with respect to the events indicating whether each edge is present): consider the effect of adding or removing just one edge. Now induct on the number of edges added or removed. It is easy to see by a coupling argument that Pr(percolation) is monotone nondecreasing in p, as follows: Instead of choosing just a single bit at each edge e, choose a real number Xe ∈ [0, 1] 0 uniformly. Include the edge if Xe < p. Now, if p < p , we can define two random graphs Gp, Gp0 , each is a percolation process from the respective parameter value, and Gp ⊆ Gp0 .

Due to the 0-1 law, there exists a “critical” pH such that Pr(percolation) = 0 for p < pH and Pr(percolation) = 1 for p > pH. (See Fig. 1.5.) A lot of work in probability theory has gone into determining values of pH for various graphs, and also into figuring out whether Pr(percolation) is 0 or 1 at pH.

Pr(percolate)

1.0

0.8

0.6

0.4

0.2

p 0.2 0.4 0.6 0.8 1.0

Figure 1.2: Bond percolation in the 2D square grid

20 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Another example of a tail event for bond percolation, this one not monotone, is the event that there are infinitely many infinite components. No matter what the underlying graph is, the probability of this event is 0 at p ∈ {0, 1}.

Site percolation

A closely related process is that starting from a fixed infinite, connected, locally finite graph H, we retain vertices independently with probability p. (And of course we retain an edge if both its vertices are retained.)

Let N0 be the random variable representing the number of infinite components in the graph that is our random variable. Here “number” can be any nonnegative integer or ∞. It is known under fairly general conditions (particularly, H should be vertex-transitive), that for any p, exactly one of the following three events has probability 1: N0 = 0; N1 = 1; N∞ = 1. See [71] for the beginning of this story, and [10] for a survey. (It happens that in the square grid, in any dimension and for any p, Pr(N∞) = 0; as p increases, we go from Pr(N0) = 1 to Pr(N1 = 1) [6], and stay there. However, in more “expanding” graphs such as d-regular trees, d > 1, and also other non-amenable graphs, there can be a phase in the middle with Pr(N∞) = 1. See [65, 47].)

Random Walk on Z

Here is another example of a tail event, but this one we can work out without relying on the 0-1 law, and also see which of 0, 1 is the value: Consider rw on Z that starts at 0 and in every step with probability p goes left, and with probability 1 − p goes right. Let L = the event that the walk visits every x ≤ 0. Let R = the event that the walk visits every x ≥ 0. Each of L and R is a tail event. So by Theorem 10, for any p, Pr(L) and Pr(R) lie in {0, 1}. In fact, we will show—without relying on Theorem 10, but relying on Lemma 8 (Borel-Cantelli I)—that:

Theorem 11

• For p < 1/2, Pr(L) = 0 and Pr(R) = 1.

• For p > 1/2, Pr(L) = 1 and Pr(R) = 0. (Obviously this is symmetric to the preceding.) • For p = 1/2, Pr(L) = Pr(R) = 1. (Note that if L ∩ R occurs, then the walk must actually visit every point infinitely often. (Suppose not, and let t be the last time that some site y was visited. Then on one side of y, the point t + 1 steps away cannot have been visited yet, and will never be visited.) Thus in this case of the theorem, since Pr(L ∩ R) = 1 by union bound, Pr(every point in Z is visited infinitely often) = 1. The term for this is that unbiased rw on the line is recurrent.)

Proof: First, no matter what p is, let qy be the probability that the walk ever visits the point y. Let’s start with the cases p 6= 1/2. The first step of the argument doesn’t depend on the sign of p − 1/2: Consider any y and let By,t = the event that the walk is at y at time t. The following

21 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

calculation shows that for any y, ∑t Pr(By,t) < ∞: For t s.t. t = y mod 2, we have

 t  t−y t+y Pr(By,t) = t−|y| p 2 (1 − p) 2 2    y/2 t 1 − p t/2 = t−|y| (p(1 − p)) 2 p  1 − p y/2 ≤ 2t (p(1 − p))t/2 p  1 − p y/2 = (4p(1 − p))t/2 p

Therefore  1 − p y/2 1 ∑ Pr(By,t) ≤ p t p 1 − 4p(1 − p) which is < ∞ for p 6= 1/2. So by Borel-Cantelli-I (Lemma 8), with probability 1, y is visited only finitely many times. Then by the union bound, with probability 1 every y is visited only finitely many times. Now let’s suppose further that p > 1/2 (i.e., the walk drifts left). Then for any x ∈ Z,

 1 − p x/2 1 1 ∑ ∑ Pr(By,t) ≤ · p · p < ∞ y≥x t p 1 − (1 − p)/p 1 − 4p(1 − p)

So we get the even stronger conclusion, again by BC-I, that with probability 1 the walk spends only finite time in the interval [x, ∞]. Since this holds for all x, we get Pr(L) = 1. Plugging in x = 0 gives Pr(R) = 0. Applying symmetry, we’ve covered the first two cases of the theorem. For p = 1/2, the claims Pr(L) = 1 and Pr(R) = 1 are equivalent so let’s focus on the first. The claim Pr(L) = 1 is equivalent to saying that for any x ≥ 0, with probability 1 the walk reaches the point −x. This is the same as saying that in the gambler’s ruin problem, no matter what the initial stake x of the gambler, he will with probability 1 go broke.

For x ≥ 0 let’s write qx = the probability the gambler goes broke from initial stake x. We claim that qx is harmonic on the nonnegative axis with boundary condition q0 = 1. The harmonic condition means that on all interior points of the nonnegative axis, which means all x > 0, the function value is the average of its neighbors: qx = (qx−1 + qx+1)/2 That this is so is obvious from the description of the gambler’s ruin process. But this equation indicates that qx is affine linear on x ≥ 0, because for x ≥ 1 the “discrete second derivative” is 0:

(qx+1 − qx) − (qx − qx−1) = qx+1 − 2qx + qx−1 = 0

However, the function qx is also bounded in [0, 1]. So it can only be a constant function, agreeing with its boundary value q0 = 1. 2

22 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.6 Lecture 6 (15/Oct): More probabilistic method

1.6.1 Markov inequality (the simplest tail bound)

4 Lemma 12 Let A be a non-negative random variable with finite expectation µ1. Then for any λ ≥ 1, Pr(A > λµ1) < 1/λ. In particular, for µ1 = 0, Pr(A > µ1) = 0.

(Of course the lemma holds trivially also for 0 < λ < 1.) Proof:

If Pr(A ≥ λµ1) > 1/λ then E(A) > µ1, a contradiction. So Pr(A ≥ λµ1) ≤ 1/λ and therefore, if the lemma fails, it must be that Pr(A > λµ1) = 1/λ. In particular for some ε > 0 there is a δ > 0 s.t. Pr(A ≥ λµ1 + ε) ≥ δ. Then E(A) ≥ δ · (λµ1 + ε) + (1/λ − δ) · λµ1 = µ1 + δε > µ1, a contradiction. 2

For a more visual argument (but proving the slightly weaker Pr(A ≥ λµ1) ≤ 1/λ), note that the step function x ≥ λµ satisfies the inequality x ≥ λµ ≤ x for all nonnegative x. If µ is the 1 1 λµ1 J K J RK R x µ1 1 probability distribution of the rv A, then Pr(A ≥ λµ1) = x ≥ λµ1 dµ ≤ = = . J K λµ1 λµ1 λ

1.6.2 Variance and the Chebyshev inequality: a second tail bound

Let X be a real-valued rv. If E(X) and E(X2) are both well-defined and finite, let Var(X) = E(X2) − E(X)2. We can also see that E((X − E(X))2) = Var(X) by expanding the LHS and applying linearity of expectation. In particular, the variance is nonnegative. If c ∈ R then since the variance is homogenous and quadratic, Var(cX) = c2 Var(X).

p Lemma 13 (Chebyshev) If E(X) = θ, then Pr(|X − θ| > λ Var(X)) < 1/λ2.

Proof:  q    Pr |X − θ| > λ Var(X) = Pr (X − θ)2 > λ2 Var(X)

< 1/λ2 by the Markov inequality (Lemma 12). 2 The Chebyshev inequality is the most elementary and weakest kind of concentration bound. We will talk about this more in the context of sums of random variables. A frequently useful corollary of the Chebyshev inequality (Lemma 13) is:

( = ) ≤ Var(X) Corollary 14 Suppose X is a nonnegative rv. Then Pr X 0 (E(X))2 .

1.6.3 Power mean inequality

Nonnegativity of the variance is merely a special case of monotonicity of the power means. (In this context, though, we will assume the random variable X is positive-valued. For the variance we don’t need this constraint.)

4For a nonnegative rv there can be no problems with absolute convergence of the expectation; however, it may be infinite.

23 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Lemma 15 (Power means inequality) For a positive-real-valued rv X, and for reals s < t,

1/t (E(Xs))1/s ≤ E(Xt) .

Proof: Let µ be the probability measure. Recall that for r ≥ 1, the function xr is convex (“cup”) in x. For a convex function, R f (x) dµ(x) ≤ f (R x dµ(x)). (This is sometimes called Jensen’s inequality.) Applying this with r = t/s, we have

Z Z s/t Xs dµ ≤ Xt dµ .

2 Using the concave function f (x) = log(x) gives us Z Z exp log x dµ ≤ xdµ (1.17) which is the arithmetic-geometric mean inequality: in the case of a uniform distribution on n 1/n 1 positive values of X, it reads (∏ Xi) ≤ n ∑ Xi. That (1.17) is a special case of the power means inequality can be seen by fixing t = 1 and taking the limit s → 0 (approximating xs by 1 + s log x).

1.6.4 Large girth and large chromatic number; the deletion method

Earlier we saw our first example of the probabilistic method, the proof of the existence of graphs with no small clique or independent set. In that case, just picking an element of a set at random was already enough in order to produce an object that is hard to construct “explicitly”. However, the probabilistic method in that form can construct only an object with properties that are shared by a large fraction of objects. Now we will see an example that enables the probabilistic method to construct something that is quite rare—indeed, it is maybe a bit surprising that this kind of object even exists. We consider graphs here to be undirected and without loops or multiple edges. The chromatic number χ of a graph is the least number of colors with which the vertices can be colored, so that no two neighbors share a color. Clearly, as you add edges to a graph, its chromatic number goes up. The girth γ of a graph is the length of a shortest simple cycle. (“Simple” = no edges repeat.) Clearly, as you add edges to a graph, its girth goes down. These numbers are both monotone in the inclusion partial order on graphs. Chromatic number is monotone increasing, while girth is monotone decreasing. An important theorem we’ll cover shortly is the FKG Inequality, which implies in this setting that for any k, g > 0, if you pick a graph u.a.r., and condition on the event that its chromatic number is above k, that reduces the probability that its girth will be above g. In symbols, for the G(n, p) ensemble,

Pr((χ(G) > k) ∩ (γ(G) > g)) < Pr(χ(G) > k) Pr(γ(G) > g).

So in this precise sense, chromatic number and girth are anticorrelated. Indeed, having large girth means that the graph is a tree in large neighborhoods around each vertex. A tree has chromatic number 2. If you just allow yourself 3 colors, you gain huge flexibility in how to color a tree. Surely,

24 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

with large girth, you might be able to color the local trees so that when they finally meet up in cycles, you can meet the coloring requirement? No! Here is a remarkable theorem.

Theorem 16 (Erd¨os[29]) For any k, g there is a graph with chromatic number χ ≥ k and girth γ ≥ g.

Proof: Pick a graph G from G(n, p), where p = n−1+1/g. This is likely to be a fairly sparse graph, the expected degree is n1/g (minus 1). g−1 m Let the rv X be the number of cycles in G of length < g. E(X) = ∑m=3 p n · (n − 1) ··· (n − m + 1)/(2m). (Pick the cycle sequentially and forget the starting point and orientation.) Then

g−1 g−1 g−1 E(X) < ∑ pmnm/(2m) = ∑ nm/g/(2m) ≤ ∑ nm/g/6. m=3 m=3 m=3

For sufficiently large n, specifically n > 2g, the successive terms in this sum at least double, so E(X) ≤ n1−1/g/3. By Markov’s inequality, Pr(X > n1−1/g) < 1/3. For the chromatic number we use a simple lower bound. Let I be the size of a largest independent set in G. Since every color class of a coloring must be an independent set,

I · χ ≥ n. (1.18)

n i (2) Now Pr(I ≥ i) ≤ ( i )(1 − p) , and recalling (1.16), the simple inequality for the exponential n i n i −1+1/g n −p(2) −(2)n i function, we have Pr(I ≥ i) ≤ ( i )e = ( i )e . Using the wasteful bound ( i ) ≤ n we i n−(i )n−1+1/g i n+(i −i2 )n−1+1/g have Pr(I ≥ i) ≤ e log 2 = e log /2 /2 . Finally we apply this at i = 3n1−1/g log n.

3n1−1/g log2 n+ 1 (3n1−1/g log n)n−1+1/g− 1 (3n1−1/g log n)2n−1+1/g Pr(I ≥ i) ≤ e 2 2 3 (log n−n1−1/g log2 n) = e 2

which for sufficiently large n is < 1/3. Thus, for sufficiently large n, there is probability at least 1/3 that G has both I < 3n1−1/g log n and at most n1−1/g ≤ n/2 cycles of length strictly less than g. Removing vertices from G can only reduce I, because any set that is independent after the removal, was also independent before. (By contrast, removing edges can only increase I.) So, by removing one vertex from each cycle, we obtain a graph with ≥ n/2 vertices, girth ≥ g, and I ≤ 3n1−1/g log n. Applying (1.18) (to the graph now of size n/2), we have χ ≥ n1/g/(6 log n) which for sufficiently large n is ≥ k. 2

25 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.7 Lecture 7 (17/Oct): FKG inequality

Consider again the random graph model G(n, p). Suppose someone peeks at the graph and tells you that it has a Hamilton cycle. How does that affect the probability that the graph is planar? Or that its girth is less than 10? Or consider the percolation process on the n × n square grid. Suppose you check and find that there is a path from (0, 0) to (1, 0). What does that tell you about the chance that the graph has an isolated vertex? These questions fall into the general framework of correlation inequalities. History: Harris (1960) [45], Kleitman (1966) [57], Fortuin, Kasteleyn and Ginibre (1971) [34], Holley (1974) [49], Ahlswede and Daykin “Four Functions Theorem” (1978) [5].

We are concerned here with the probability space Ω of n independent random bits b1,..., bn. It doesn’t matter whether they are identically distributed. Let pi = Pr(bi = 1). 0 0 We consider the boolean lattice B on these bits: b ≥ b if for all i, bi ≥ b . So, Ω is the distribution     i B (b) = p ( − p ) on for which Pr ∏{i:bi=1} i ∏{i:bi=0} 1 i .

Definition 17 A real-valued function f on Ω is increasing if b ≥ b0 ⇒ f (b) ≥ f (b0). It is decreasing if − f is increasing. Likewise, an event on Ω (or in other words a subset of B) is increasing if its indicator function is increasing, and decreasing if its indicator function is decreasing.

Theorem 18 (FKG [34]) If f and g are increasing functions on Ω then

E( f g) ≥ E( f )E(g)

Corollary 19 1. If A and B are increasing events on Ω then Pr(A ∩ B) ≥ Pr(A) Pr(B). 2. If f is an increasing function and g is a decreasing function, then E( f g) ≤ E( f )E(g).

3. If A is an increasing event and B is a decreasing event, then Pr(A ∩ B) ≤ Pr(A) Pr(B).

Before we begin the proof we should introduce an important concept: Conditional expectation Suppose X and Y are random variables. Let T be some subset T of the range of Y with Pr(Y ∈ T) > 0 (actually this restrictive assumption is not necessarily but for most purposes in this course we can settle for finite sample spaces, so we won’t worry about doing the measure theory. It all works out as you’d expect.) Then 1 Z E(X|Y ∈ T) = X dµ Pr(Y ∈ T) Y−1(T)

The conditional expectation E(X|Y) is a random variable, and specifically, it is a function of the random variable Y. This has the following natural consequence, which is called the tower property of conditional ex- pectations: E(X) = E(E(X|Y)) (1.19) Notice that on the RHS, the outer expectation is over the distribution of Y; on the inside we have the rv which is a real number that is, as we have said, a function of Y. To see (1.19) in the case of discrete rvs, one need only note that both sides equal E(X) = ∑y Pr(Y = y)E(X|Y = y).

26 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Of course, (1.19) still makes sense under any conditioning on a third rv Z:

E(X|Z) = E(E(X|Y)|Z) (1.20)

Now let us reinterpret the theorem. Suppose g is the indicator function of some increasing event. Then

E( f g) = Pr(g = 1)E( f |g = 1) + Pr(g = 0)E(0|g = 0) = Pr(g = 1)E( f |g = 1) = E(g)E( f |g = 1)

so E( f g) E( f )E(g) E( f |g = 1) = ≥ = E( f ). E(g) E(g) The interpretation is that conditioning on an increasing event, only increases the expectation of any increasing function. Proof: By induction on n. Case n = 1 (with p = Pr(b = 1)):

E( f g) − E( f )E(g) = p f (1)g(1) + (1 − p) f (0)g(0) − (p f (1) + (1 − p) f (0))(pg(1) + (1 − p)g(0)) = p(1 − p)( f (1)g(1) + f (0)g(0) − f (1)g(0) − f (0)g(1) = p(1 − p)( f (1) − f (0))(g(1) − g(0)) ≥ 0 by the monotonicity of both functions

n−1 Now for the induction. Observe that for any assignment (b2 ... bn) ∈ {0, 1} , f becomes a mono- tone function of the single bit b1. For convenience, in the expectations to follow the subscript indicates explicitly which subset of bits the expectation is taken with respect to. So for instance in the second line, f g has the role of X, above, and (b2 ... bn) has the role of Y. These subscripts are extraneous and I’m just including them for clarity.

E( f g) = E1...n( f g) = E2...n (E1( f g|b2 ... bn)) ≥ E2...n (E1( f |b2 ... bn) · E1(g|b2 ... bn)) applying the base-case

Observe again that E1( f |b2 ... bn) is a function of b2 ... bn. By monotonicity of f , it is an increasing function. Likewise for E1(g|b2 ... bn). Since by induction we may assume the theorem for the case n − 1, we have

... ≥ E2...n (E1( f |b2 ... bn)) · E2...n (E1(g|b2 ... bn)) = E1...n( f ) · E1...n(g) by (1.19) = E( f )E(g)

2 Easy Application: In the random graph G(2k, 1/2), there is probability at least 2−2k that all degrees are ≤ k − 1. (Call this event A. One can also ask for an upper bound on Pr(A). As was noted in class, A is disjoint from the event that all degrees are at least k, which has by symmetry the same probability, so we can conclude that Pr(A) ≤ 1/2. Here is a simple improvement, showing Pr(A) tends toward 0. Fix a set L of the vertices, of size `. For v ∈ L, if it has at most k − 1 neighbors, then it has at most k − 1 neighbors in Lc. So we’ll just upper bound the probability that every vertex in L has at most

27 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

k − 1 neighbors in Lc. These events (ranging over v ∈ L) are independent. So we can use the upper ` √  −2k+` 2k−`  bound 2 (≤k−1) . Fixing ` proportional to k sets the base of this exponential to a constant < 1 (a deviation bound at a constant number of√ standard deviations from the mean), and therefore yields a bound of the form Pr(A) ≤ exp(−Ω( k)).) Application: The FKG inequality provides a very efficient proof of an inequality of Daykin and Lovasz´ [25]:

Theorem 20 Let H be a family of subsets of [n] such that for all A, B ∈ H, ∅ ⊂ A ∩ B and A ∪ B ⊂ [n] (strict containments). Then |H| ≤ 2n−2.

Proof: Let F be the “upward order ideal” generated by H: F = {S : ∃T ∈ H, T ⊆ S}. Let G be the “downward order ideal” generated by H: G = {S : ∃T ∈ H, S ⊆ T}. Then H ⊆ F ∩ G. |F| ≤ 2n−1 because F satisfies the property that ∅ ⊂ A ∩ B for all A, B ∈ F, and therefore F cannot contain any set and its complement. Likewise, |G| ≤ 2n−1 because G satisfies the property that A ∪ B ⊂ [n] for all A, B ∈ G, and therefore G cannot contain any set and its complement. Interpreting this in terms of the bits being distributed uniformly iid, we have that Pr(F) ≤ 1/2 and Pr(G) ≤ 1/2. Since F is an increasing event and G a decreasing event, Pr(F ∩ G) ≤ 1/4. 2 Application: We won’t show the argument here, but the FKG inequality was used in a very clever way by Shepp to prove the “XYZ inequality” conjectured by Rival and Sands. Let Γ be a finite poset. A linear extension of Γ is any total order on its elements that is consistent with Γ. Consider the uniform distribution on linear extensions of Γ. The XYZ inequality says:

Theorem 21 (Shepp [81]) For any three elements x, y, z of Γ,

Pr((x ≤ y) ∧ (x ≤ z)) ≥ Pr(x ≤ y) · Pr(x ≤ z).

28 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.8 Lecture 8 (19/Oct) Part I: Achieving expectation in MAX-3SAT.

Logistics: Jenish’s OH will be in Annenberg 107 or 121 according to the day; see class webpage and Google calendar.

1.8.1 Another appetizer

Consider unbiased random walk on the n-cycle. Index the vertices clockwise 0, . . . , n − 1, and start the walk at 0. What is the probability distribution on the last vertex to be reached?

1.8.2 MAX-3SAT

Let’s start looking at some computational problems. A 3CNF formula on variables x1,..., xn is the conjunction of clauses, each of which is a disjunction of at most three literals. (A literal is an xi or c c xi , where xi is the negation of xi.) You will recall that it is NP-complete to decide whether a 3CNF formula is satisfiable, that is, whether there is an assignment to the xi’s s.t. all clauses are satisfied. Let’s take a little different focus: think about the maximization problem of satisfying as many clauses as possible. Of course this is NP-hard, since it includes satisfiability as a special case. But, being an optimization problem, we can still ask how well we can do.

Theorem 22 For any 3CNF formula there is an assignment satisfying ≥ 7/8 of the clauses. Moreover such an assignment can be found in randomized time O(m2), where m is the number of clauses (and we suppose that every variable occurs in some clause).

Proof: The existence assertion is due to linearity of expectation, while the algorithm might be attributed to the English educator Hickson [48]: ’Tis a lesson you should heed: / Try, try, try again. / If at first you don’t succeed, / Try, try, try again. Now that we’ve been suitably educated, let’s ask, how long does this process take? In a single trial we check one assignment, which takes time O(m). How many trials do we need to succeed? Let the rv M be the number of satisfied clauses of a random assignment. m − M is a nonnegative rv with expectation m/8, and Markov’s inequality tells us that Pr(M ≤ (7/8 − ε)m) = Pr(m − M ≥ (1 + 8ε)m/8) ≤ 1/(1 + 8ε). This says we have a good chance of getting close to the desired number of satisfied clauses; however, we asked to achieve 7/8, not 7/8 − ε. We can get this by noting that M is integer-valued, so for ε < 1/m, an assignment satisfying 7/8 − ε of the clauses, satisfies 7/8 of them. 1 With the choice ε = 2m , then, the probability that a trial succeeds is at least 1 8ε 4 1 − = = ∈ Ω(1/m) 1 + 8ε 1 + 8ε m + 4 Trials succeed or fail independently so the expected number of trials to success is the expectation of a geometric random variable with parameter Ω(1/m), which is O(m). 2

1.8.3 Derandomization by the method of conditional expectations

How can we improve on this simple-minded method? We do not have a way forward on increasing the fraction of satisfied clauses, because of:

29 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Theorem 23 (H˚astad[46]) For all ε > 0 it is NP-hard to approximate Max-3SAT within factor 7/8 + ε.

But we might hope to reduce the runtime, and also perhaps the dependence on random bits. As it turns out we can accomplish both of these objectives.

Theorem 24 There is an O(m)-time deterministic algorithm to find an assignment satisfying 7/8 of the clauses of any 3CNF formula on m clauses.

Proof: This algorithm illustrates the method of conditional expectations. The point is that we can de- randomize the randomized algorithm by not picking all the variables at once. Instead, we consider the alternative choices to just one of the variables, and choose the branch on which the conditional expected number of satisfying clauses is greater. This very general method works in situations in which one can actually quickly calculate (or at least approximate) said conditional expectations. We use the tower property of conditional expectations, (1.19): letting Y = the number of satisfied clauses for a uniformly random setting of the rvs,

E(Y) = E(E(Y|x1)) or explicitly 1 1 E(Y) = E(Y|x = 0) + E(Y|x = 1) 2 1 2 1 and the strategy is to pursue the value of x1 which does better. In the present example computing the conditional expectations is easy. The probability that a clause −i of size i is satisfied is 1 − 2 . If a formula has mi clauses of size i, the expected number of satisfied −i 1 0 clauses is ∑ mi(1 − 2 ). Now, partition the clauses of size i into mi that contain the literal x1, mi c that contain the literal x1, and those that contain neither. The expected number of satisfied clauses conditional on setting x1 = 1 is

1 0 −i+1 1 0 −i ∑ mi + ∑ mi (1 − 2 ) + ∑(mi − mi − mi )(1 − 2 ). (1.21)

Similarly the expected number of satisfied clauses conditional on setting x1 = 0 is

1 −i+1 0 1 0 −i ∑ mi (1 − 2 ) + ∑ mi + ∑(mi − mi − mi )(1 − 2 ). (1.22)

A simple way to do this: we can compute each of these quantities in time O(m). (Actually, since these quantities average to the current expectation, which we already know, we only have to cal- culate one of them.) This simple process runs in time O(m2). However, we can actually do the

Clauses C1 C2 C3 C4 ... Cm

Variables x1 x2 x3 ... xn

Figure 1.3: m clauses of size ≤ 3, n variables process in time O(m). We don’t even really need to calculate the quantities (1.21),(1.22). We start with variable x1 and scan all the clauses it participates in (see Fig. 1.3). For each clause Ci (which say has currently |Ci| literals), the effect of setting x1 = 1 change the contribution of the clause

30 Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

to the expectation from 1 − 2−|Ci| to either 1 (if the variable satisfies the clause) or to 1 − 2−|Ci−1| (otherwise); i.e., the expectation either increases or decreases by 2−|Ci|, while the effect of setting −|C | x1 = 0 is exactly the negative of this. We add these contributions of ±2 i , conditional on x1 = 1, as we scan the clauses containing x1; if it is nonnegative we fix x1 = 1, otherwise we fix x1 = 0. Having done that, we edit the relevant clauses to eliminate x1 from them. Then we continue with x2, etc. The work spent per variable is proportional to its degree in this bipartite graph (the number of clauses containing it), and the sum of these degrees is ≤ 3m. So the total time spent is O(m). 2

31 Chapter 2

Algebraic Fingerprinting

There are several key ways in which randomness is used in algorithms. One is to “push apart” things that are different even if they are similar. We’ll study a few examples of this phenomenon.

2.1 Lecture 8 (19/Oct) Part II: Fingerprinting with Linear Algebra

2.1.1 Polytime Complexity Classes Allowing Randomization

Some definitions of one-sided and two-sided error in randomized computation are useful.

Definition 25 BPP, RP, co-RP, ZPP: These are the four main classes of randomized polynomial-time compu- tation. All are decision classes. A language L is:

• In BPP if the algorithm errs with probability ≤ 1/3. • In RP if for x ∈ L the algorithm errs with probability ≤ 1/3, and for x ∈/ L the algorithm errs with probability 0.

(note, RP is like NP in that it provides short proofs of membership), while the subsidiary definitions are:

• L ∈ co-RP if Lc ∈ RP, that is to say, if for x ∈ L the algorithm errs with probability 0, and if for x ∈/ L the algorithm errs with probability ≤ 1/3. • ZPP = RP ∩ co-RP.

It is a routine exercise that none of these constants matter and can be replaced by any 1/poly, although completing that exercise relies on the Chernoff bound which we’ll see in a later lecture.

Exercise: Show that the following are two equivalent characterizations of ZPP: (a) there is a poly- time randomized algorithm that with probability ≥ 1/3 outputs the correct answer, and with the remaining probability halts and outputs “don’t know”; (b) there is an expected-poly-time algorithm that always outputs the correct answer. We have the following obvious inclusions:

P ⊆ ZPP ⊆ RP, co-RP ⊆ BPP

32 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

What is the difference between ZPP and BPP? In BPP, we can never get a definitive answer, no matter how many independent runs of the algorithm execute. In ZPP, we can, and the expected time until we get a definitive answer is polynomial; but we cannot be sure of getting the definitive answer in any fixed time bound. Here are the possible outcomes for any single run of each of the basic types of algorithm:

x ∈ L x ∈/ L RP ∈,/∈ ∈/ co-RP ∈ ∈,/∈ BPP ∈,/∈ ∈,/∈

If L ∈ ZPP, then we can be running simultaneously an RP algorithm A and a co-RP algorithm B for L. Ideally, this will soon give us a definitive answer: if both algorithms say “x ∈ L”, then A cannot have been wrong, and we are sure that x ∈ L; if both algorithms say “x ∈/ L”, then B cannot have been wrong, and we are sure that x ∈/ L. The expected number of iterations until one of these events happens (whichever is viable) is constant. But, in any particular iteration, we can (whether x ∈ L or x ∈/ L) get the indefinite outcome that A says “x ∈/ L” and B says “x ∈ L”. This might continue for arbitrarily many rounds, which is why we can’t make any guarantee about what we’ll be able to prove in bounded time. An algorithm with “BPP”-style two-sided error is often referred to as “Monte Carlo”, while a “ZPP”- style error is often referred to as “Las Vegas”.

2.1.2 Verifying Matrix Multiplication

It is a familiar theme that verifying a fact may be easier than computing it. Most famously, it is widely conjectured that P6=NP. Now we shall see a more down-to-earth example of this phenomenon. In what follows, all matrices are n × n. In order to eliminate some technical issues (mainly numerical precision, also the design of a substitute for uniform sampling), we suppose that the entries of the matrices lie in Z/p, p prime; and that scalar arithmetic can be performed in unit time. (The same method will work for any finite field and a similar method will work if the entries are integers less than poly(n) in absolute value, so that we can again reasonably sweep the runtime for scalar arithmetic under the rug.) Here are two closely related questions:

1. Given matrices A, B, compute A · B.

2. Given matrices A, B, C, verify whether C = A · B.

The best known algorithm, as of 2014, for the first of these problems runs in time O(n2.3728639) [39]. Resolving the correct lim inf exponent (usually called ω) is a major question in computer science. Clearly the second problem is no harder, and a lower bound of Ω(n2) even for that is obvious since one must read the whole input. Randomness is not known to help with problem (1), but the situation for problem (2) is quite different.

Theorem 26 (Freivalds [36]) There is a co-RP-type algorithm for the language “C = A · B”, running in time O(n2).

33 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

I wrote “co-RP-type” rather than co-RP because the issue at stake is not the polynomiality of the runtime (since nω+o(1) is an upper bound and the gain from randomization is that we’re achieving n2), but only the error model. Proof: Note that the obvious procedure for matrix-vector multiplication runs in time O(n2). The verification algorithm is simple. Select uniformly a vector x ∈ (Z/p)n. Check whether ABx = Cx without ever multiplying AB: applying associativity, (AB)x = A(Bx), this can be done in just three matrix-vector multiplications. Output “Yes” if the the equality holds; output “No” if it fails. Clearly if AB = C, the output will be correct. In order to get a co-RP-type result, it remains to show that Pr(ABx = Cx|AB 6= C) ≤ 1/2. The event ABx = Cx is equivalently stated as the event that x lies in the right kernel of AB − C. Given that AB 6= C, that kernel is a strict subspace of (Z/p)n and therefore of at most half the cardinality of the larger space. Since we select x uniformly, the probability that it is in the kernel is at most 1/2. 2

34 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

2.2 Lecture 9 (22/Oct): Fingerprinting with Linear Algebra

2.2.1 Verifying Associativity

Let a set S of size n be given, along with a binary operation ◦ : S × S → S. Thus the input is a table of size n2; we call the input (S, ◦). The problem we consider is testing whether the operation is associative, that is, whether for all a, b, c ∈ S,

(a ◦ b) ◦ c = a ◦ (b ◦ c) (2.1) A triple for which (2.1) fails is said to be a nonassociative triple. No sub-cubic-time deterministic algorithm is known for this problem. However,

Theorem 27 (Rajagopalan & Schulman [73]) There is an O(n2)-time co-RP type algorithm for associa- tivity.

Proof: An obvious idea is to replace the O(n3)-time exhaustive search for a nonassociative triple, by randomly sampling triples and and checking them. The runtime required is inverse to the fraction of nonassociative triples, so this method would improve on exhaustive search if we were guaranteed that a nonassociative operation had a super-constant number of nonassociative triples. However, for every n ≥ 3 there exist nonassociative operations with only a single nonassociative triple. So we’ll have to do something more interesting. Let’s define a binary operation (S, ◦) on a much bigger set S. Define S to be the vector space with basis S over the field Z/2, that is to say, an element x ∈ S is a formal sum

x = ∑ axa for xa ∈ Z/2 a∈S

The product of two such elements x, y is

x ◦ y = ∑ ∑ (a ◦ b)xayb a∈S b∈S M = ∑ c xayb c∈S a,b:a◦b=c where of course L denotes sum mod 2. On (S, ◦) we have an operation that we do not have on (S, ◦), namely, addition:

x + y = ∑ a(xa + yb) a∈S

(Those who have seen such constructions before will recognize (S, ◦) as an “algebra” of (S, ◦) over Z/2.) The algorithm is now simple: check the associative identity for three random elements of S. That is, select x, y, z u.a.r. in S. If (x ◦ y) ◦ z) = x ◦ (y ◦ z), report that (S, ◦) is associative, otherwise report that it is not associative. The runtime for this process is clearly O(n2). If (S, ◦) is associative then so is (S, ◦), because then (x ◦ y) ◦ z and x ◦ (y ◦ z) have identical ex- pansions as sums. Also, nonnassociativity of (S, ◦) implies nonnassociativity of (S, ◦) by simply considering “singleton” vectors within the latter. But this equivalence is not enough. The crux of the argument is the following:

35 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

Lemma 28 If (S, ◦) is nonnassociative then at least one eighth of the triples (x, y, z) in S are nonassociative triples.

Proof: The proof relies on a variation on the inclusion-exclusion principle. For any triple a, b, c ∈ S, let g(a, b, c) = (a ◦ b) ◦ c − a ◦ (b ◦ c). Note that g is a mapping g : S3 → S. Now extend g to g : S3 → S by:

g(x, y, z) = ∑ g(a, b, c)xaybzc a,b,c

If you imagine the n × n × n cube indexed by S3, with each position (a, b, c) filled with the entry g(a, b, c), then g(x, y, z) is the sum of the entries in the combinatorial subcube of positions where xa = 1, yb = 1, zc = 1. (We say “combinatorial” only to emphasize that unlike a physical cube, here the slices that participate in the subcube are not in any sense adjacent.) Fix (a0, b0, c0) to be any nonassociative triple of S. Partition S3 into blocks of eight triples apiece, as follows. Each of these blocks is indexed by a triple 0 0 0 x, y, z s.t. xa0 = 0, yb0 = 0, zc0 = 0. The eight triples are (x + ε1a , y + ε2b , z + ε3c ) where εi ∈ {0, 1}. Now observe that 0 0 0 0 0 0 ∑ g(x + ε1a , y + ε2b , z + ε3c ) = g(a , b , c ) ε1,ε2,ε3 To see this, note that each of the eight terms on the LHS is, as described above, a sum of the entries in a “subcube” of the “S3 cube”. These subcubes are closely related: there is a core subcube whose indicator function is x × y × z, and all entries of this subcube are summed within all eight terms. Then there are additional width-1 pieces: the entries in the region a0 × y × z occur in four terms, as do the regions x × b0 × z and x × y × c0. The entries in the regions a0 × b0 × z, a0 × y × c0 and x × b0 × c0 occur in two terms, and the entry in the region a0 × b0 × c0 occurs in one term. Since the RHS is nonzero, so is at least one of the eight terms on the LHS. 2 2 Corollary: in time O(n2) we can sample x, y, z u.a.r. in S and determine whether (x ◦ y) ◦ z = x ◦ (y ◦ z). If (S, ◦) is associative, then we get =; if (S, ◦) is nonassociative, we get 6= with probability ≥ 1/8.

36 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

2.3 Lecture 10 (24/Oct): Perfect matchings, polynomial identity testing

2.3.1 Matchings

A matching in a graph G = (V, E) is a set of vertex disjoint edges; the size of the matching is the number of edges. Let n = |V| and m = |E|.A perfect matching is one of size n/2. A maximal matching is one to which no edges can be added. A maximum matching is one of largest size. How hard are the problems of finding such objects? It is of course easy to find a maximal matching—sequentially. On the other hand, finding one on a parallel computer is a much more interesting problem, which I hope to return to later in the course. Returning to sequential computation: Finding a maximum matching, or deciding whether a perfect matching exists, are interesting problems. In bipartite graphs, Hall’s theorem and the augment- ing path method give very nice and accessible deterministic algorithms for maximum matching. In√ general graphs the problem is harder but there are deterministic algorithms running in time O( nm) [66, 38].

2.3.2 Bipartite perfect matching: deciding existence

The first problem we focus on here is to decide whether a bipartite graph has a perfect matching. As noted there are nice deterministic algorithms for this problem but the randomized one is even simpler. Write G = (V1, V2, E) with E ⊆ V1 × V2. Form the V1 × V2 “variable” matrix A which has Aij = xij if {i, j} ∈ E, and otherwise Aij = 0.

Let q be some prime power and consider the xij as variables in GF(q). The determinant of A, then, is a polynomial in the variables xij. Before launching into this, a word on a subtle point: what does it mean for a (multivariate) polyno- mial p to be nonzero? Consider a polynomial over any field κ, which is to say, the coefficients of all the monomials in the polynomial lie in κ.

Definition 29 We consider a polynomial nonzero if any monomial has a nonzero coefficient.

A stronger condition, which is not the definition we adopt, is that p(x) 6= 0 for some x ∈ κ. Of course this implies the condition in the definition; but it is strictly stronger, as we can see from the example of the polynomial x2 + x over the field Z/2. However, the conditions are equivalent in the following two cases:

1. Over infinite fields such as R or C. This will follow from Lemma 35. 2. For multilinear polynomials. (This applies in particular to Det(A) which we are considering now, so for Lemma 31, it wouldn’t have mattered which definition we used.) Specifically we have:

Lemma 30 Let p(~x) be a nonzero multilinear polynomial over field κ. Then there is a setting of the xi to values ci in κ s.t. p(~c) 6= 0.

37 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

Proof: Every monomial is associated with a set of variables; choose a minimal such set. (E.g., if there is a constant term, then the empty set.) Assign the value 1 to all variables in this set, and 0 to all variables outside this set. Only the chosen monomial can be nonzero, so p 6= 0 for this assignment. 2

Lemma 31 G has a perfect matching iff Det(A) 6= 0.

(This is the “baby” version of a result of Tutte that we will see later in Theorem 36.) Proof: Every monomial in the expansion of the determinant corresponds to a permutation. A permutation is simply a pattern of 1’s hitting each edge and column exactly once, namely, a perfect matching in the bipartite graph. Conversely, if some perfect matching is present, it puts a monomial in the determinant with coefficient either 1 or −1. 2

Corollary 32 Fix any field κ. G has a perfect matching iff there is an assignment of the variables in A such that the determinant is nonzero.

Proof: Apply Lemma 30 to Lemma 31. 2 This suggests the following exceptionally simple algorithm: compute the polynomial and see if it is nonzero. There’s a problem with this idea! The determinant has exponentially many monomials. This is not a problem for computing determinants over a ring such as the integers, because even the sum of exponentially many integers only has polynomially more bits than the largest of those integers has. However, in this ring of multivariate polynomials (i.e., the ring κ[{xij}] where κ = Q or κ = GF(q), for the moment it doesn’t matter), there are exponentially many distinct terms to keep track of if you want to write the polynomial out as a sum of monomials. Of course the determinant has a more concise representation (namely, as “Det(A)”), but we do not know how to efficiently convert that to any representation that displays transparently whether the polynomial is the 0 polynomial. So we modify the original suggestion. Since we do know how to efficiently compute determinants of scalar matrices, let’s substitute scalar values for the xij’s. What values should we use? Random ones.

Revised Algorithm: Sample the xij’s u.a.r. in GF(q); call the sampled matrix AR. Compute Det(AR); report “G has/hasn’t a perfect matching” according to whether Det(AR) 6= 0 or = 0.

substitute Det(variables) / Det(scalars)

expand evaluate   monomials(variables) / value of Det substitute

Figure 2.1: This diagram commutes, but for a fast commute, go right and then down.

Clearly the algorithm answers correctly if there is no perfect matching, and it is fast (see Fig. 2.3.2). What needs to be shown is that the probability of error is small if there is a perfect matching (and q is large enough). So this is an RP-type algorithm for “G has a perfect matching”.

Theorem 33 The algorithm is error-free on bipartite graphs lacking a perfect matching, and the probability of error of the algorithm on bipartite graphs which have a perfect matching is at most n/q. The runtime of the algorithm is nω+o(1), where ω is the matrix multiplication exponent.

38 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

All we have to do, then, is use a prime power q ≥ 2n in order to have error probability ≤ 1/2. Incidentally, there is always a prime 2n ≤ q < 4n; this is called “Bertram’s postulate”. This fact alone isn’t quite strong enough for if we want to find a prime in the right size range efficiently, but that too can be done, in a slightly larger range. (The density of primes of this size is about 1/ log n so we don’t have to try many values before we should get lucky; and note, primality testing has efficient algorithms in ZPP and even somewhat less efficient algorithms in P [4].) However, we don’t have to work this hard, since we’re satisfied with prime powers rather than primes. We can simply use the first power of 2 after 2n. We will prove Theorem 33 after introducing a general useful tool.

2.3.3 Polynomial identity testing

In the previous section we saw that testing for existence of a perfect matching in a bipartite graph can be cast as a special case of the following problem. We are given a polynomial p(x), of total degree n, in variables x = (x1,..., xm), m ≥ 1. (The total degree of a monomial is the sum of the degrees of the variables in it; the total degree of a polynomial is the greatest total degree of its monomials.) We are agnostic as to how we are “given” the polynomial, and demand only that we be able to quickly evaluate it at any scalar assignment to the variables. We wish to test whether the polynomial is identically 0, and our procedure for doing so is to evaluate it at a random point and report “yes” if the value there is 0. We rely on the following lemma. Let z(p) be the set of roots (zeros) of a polynomial p.

Lemma 34 Let p be a nonzero polynomial over GF(q), of total degree n in m variables. Then |z(p)| ≤ nqm−1.

As a fraction, this is saying that |z(p)|/qm ≤ n/q, and in this form the lemma immediately implies Theorem 33. The univariate case of the lemma is probably familiar to you. The lemma is a special case of the following more general statement which holds for any, even infinite, field κ.

Lemma 35 Let p be a nonzero polynomial over a field κ, of total degree n in variables x1,..., xm. Let m−1 S1,..., Sm be subsets of κ with |Si| ≤ s for all i. Then |z(p) ∩ (S1 × ... × Sm)| ≤ s n.

This is usually known as the Schwartz-Zippel lemma [77, 90], although the results in these two publications were not precisely equivalent, and there were at least two other discoveries of versions of the result, by Ore [62] and by DeMillo and Lipton [26]. A generalization beyond polynomials is due to Gonnet [41]. Recalling the two candidate definitions of what it means for a polynomial to be nonzero, since in Defn 29 we chose the weaker condition, Lemma 35 is stronger than it would be otherwise. Proof: of Lemma 35: The lemma is trivial if n ≥ s, so suppose n < s. First consider the univariate case, m = 1. (In this case the two lemmas are identical since any set S1 is a product set.) This follows by induction on n because if n ≥ 1 and p(α) = 0, then p can be factored as p(x) = (x − α) · q(x) for some q of degree n − 1. (Because, make the change of variables to x − α. After this change the polynomial cannot have any constant term. So we can factor out (x − α).)

39 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

Next we handle m > 1 by induction. If x1 does not appear in p then the conclusion follows from n i the case m − 1. Otherwise, write p in the form p(x) = ∑0 x1 pi(x2,..., xm), and let 0 < i ≤ n be largest such that pi 6= 0. The degree of pi is at most n − i, so by induction,

|z(p ) ∩ (S × ... × S )| n − i i 2 m ≤ sm−1 s

Let r be the LHS, i.e., the fraction of S2 × ... × Sm that are roots of pi.

For (x2,..., xm) ∈ z(pi) we allow as a worst case that all choices of x1 ∈ S1 yield a zero of p.

For (x2,..., xm) ∈/ z(pi), p restricts to a nonzero polynomial of degree i in the variable x1, so by the case m = 1, |z(p ) ∩ (S × x × ... × x )| i i 1 2 m ≤ s s

i n Since s ≤ s < 1, the weighted average of our two bounds (on the fraction of roots in sets of the form S1 × x2 × ... × xm) is worst when r is as large as possible. Thus

z(p) ∩ (S × ... × S ) i 1 m ≤ r · 1 + (1 − r) · sm s n − i n − i i ≤ · 1 + (1 − ) · s s s n i(n − i) = − s s n ≤ s 2 Comment: This lemma gives us an efficient randomized way of testing whether a polynomial is identically zero, and naturally, people have wondered whether there might be an efficient deter- ministic algorithm for the same task. So far, no such algorithm has been found, and it is known that any such algorithm would have hardness implications in complexity theory that are currently out of reach [52]1.

1Specifically: If one can test in polynomial time whether a given arithmetic circuit over the integers computes the zero polynomial, then either (i) NEXP 6⊆ P/poly or (ii) the Permanent is not computable by polynomial-size arithmetic circuits.

40 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

2.4 Lecture 11 (26/Oct): Perfect matchings in general graphs. Par- allel computation. Isolating lemma.

2.4.1 Deciding existence of a perfect matching in a graph

Bipartite graphs were handled last time. Now we consider general graphs. Deterministically, deciding the existence of a perfect matching in a general graph is harder than the same problem in a bipartite graph. (As noted, we have poly-time algorithms, but not nearly so simple ones.) With randomization, however, we can adapt the same approach to work with almost equal efficiency. We must define the Tutte matrix of a graph G = (V, E). Order the vertices arbitrarily from 1, . . . , n and set  { } ∈  0 if i, j / E Tij = xij if {i, j} ∈ E and i < j  −xji if {i, j} ∈ E and i > j

Theorem 36 (Tutte [86]) Det(T) 6= 0 iff T has a perfect matching.

This determinant is not multilinear in the variables, so the lemma from last time does not apply.

Proof: If T has a perfect matching, assign xij = 1 for edges in the matching, and 0 otherwise. Each matching edge {i, j} describes a transposition of the vertices i, j. With this assignment every row and column has a single nonzero entry corresponding to the matching edge it is part of, so the matrix is the permutation matrix (with some signs) of the involution that transposes the vertices on each edge. Since a transposition has sign −1 and there is a single −1 in each pair of nonzero entries, the contribution of each transposition to the determinant is 1, and overall we have Det(T) = 1. Conversely suppose Det(T) 6= 0 as a polynomial. Consider the determinant as a signed sum over permutations. The net contribution to the determinant from all permutations having an odd cycle is 0, for the following reason. In each such cycle identify the “least” odd cycle by some criterion, e.g., ordering the cycles by their least-indexed vertex. Then flip the direction of the least odd cycle. This map is an involution on the set of permutations. It carries the permutation to another, which contributes the opposite sign to the determinant, since the sign of all edges in the cycle flipped. (Figure 2.2.)

 ......   ......   ......   ......       ...... x ...  vs.  ...... x  (2.2)  34   35   ...... x45   ...... −x34 ...  ...... −x35 ...... −x45

Figure 2.2: Flipping a cycle among vertices 3, 4, 5. Preserves permutation sign; reverses signs of cycle variables.

Therefore there are permutations of the vertices, supported by T (i.e., each vertex is mapped to a destination along one of the edges incident to it, that is, π(i) = j ⇒ Tij 6= 0), having only even cycles. The even cycles of length 2 are matching edges, and in any even cycle of length greater than 2, we can use every alternate edge; altogether we obtain a perfect matching. See Figure 2.3 for a graph having perfect matchings, and two permutations from which we can read off perfect matchings. 2

41 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

  x12 x13 x14  −x x   12 23  (2.3)  −x13 −x23 x34  −x14 −x34

 ∗   ∗   ∗   ∗    or   (2.4)  ∗   ∗  ∗ ∗

Figure 2.3: A graph and its Tutte matrix with two different permutations π from which we can read off perfect matchings: the involution (12)(34) and the 4-cycle (1234).

In exactly the same way as for the bipartite case, this yields:

Theorem 37 The algorithm to determine existence of a perfect matching in a graph on n vertices, is error- free on graphs lacking a perfect matching, and the probability of error of the algorithm on graphs which have a perfect matching, is at most n/q. The runtime of the algorithm is nω+o(1), where ω is the matrix multiplication exponent.

By self-reducibility this immediately yields an O˜ (nω+2)-time algorithm for finding a perfect match- ing. (Remove an edge, see if there is still a perfect matching without it,. . . )

2.4.2 Parallel computation

There are two major processes at work in the above algorithm: determinant computations, and sequential branching used in the self-reducibility argument. As we discuss in a moment, the linear algebra can be parallelized. But the branching is inherently sequential. Nevertheless, we will shortly see a completely different algorithm, that avoids this sequential branching. In this way we’ll put the problem of finding a perfect matching, in RNC. NCi = problems solvable deterministically by poly(n) processors in logi n time. (Equivalently, by poly(n) -size, logi n -depth circuits.) S i NC = i NC RNCi, RNC = same but the processors (gates) may use random bits. (Note, we are glossing over the “uniformity” conditions of the complexity classes.)

2.4.3 Sequential and parallel linear algebra

In sequential computation, there are reductions showing that matrix inversion and multiplication have the same time complexity (up to a factor of log n), and that determinant is no harder than these.

42 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

In parallel computation, the picture is actually a little simpler. Matrix multiplication is in NC1 (right from the product definition, since we can use a tree of depth log n to sum the n terms of a row-column inner product). Matrix inversion and determinant are in NC2, due to Csanky [24] (over fields of characteristic 0) (and using fewer processors in RNC2 by Pan [72]); the problem is also in NC over general fields [12, 19]. Csanky’s algorithm builds on the result of Valiant, Skyum, Berkowitz and Rackoff [87] that any deterministic sequential algorithm computing a polynomial of degree d in time m can be converted to a deterministic parallel algorithm computing the same polynomial in time O((log d)(log d + log m)) using O(d6m3) processors. For a good explanation of Csanky’s algorithm see [59] §31, and for more on parallel algorithms see [61].

2.4.4 Finding perfect matchings in general graphs. The isolating lemma

We now develop a randomized method of Mulmuley, U. Vazirani and V. Vazirani [69] to find a perfect matching if one exists. A polynomial time algorithm is implied by the previous testing method along with self-reducibility of the perfect matching decision problem. However, with the following method we can solve the same problem in parallel, that is to say, in polylog depth on polynomially many processors. This is not actually the first RNC algorithm for this task—that is an RNC3 method due to [54]—but it is the “most parallel” since it solves the problem in RNC2. A slight variant of the method yields a minimum weight perfect matching in a weighted graph that has “small” weights, that is, integer weights represented in unary; and there is a fairly standard reduction from the problem of finding a maximum matching to finding a minimum weight perfect matching in a graph with weights in {0, 1}. So we actually through this method can find a maximum matching in a general graph, with a similar total amount of work. There are really two key ingredients to this algorithm. The first, which we have already noted, is that all basic linear algebra problems can be solved in NC2. The second ingredient, which will be our focus, is the following lemma. First some notation. Let A = {a1, a2,..., am} be a finite set. If a1,..., am are assigned weights w1,..., wm, the weight of set ⊆ ( ) = S = { } S A is defined to be w S ∑ai∈S wi. Let S1,..., Sk be a collection of subsets of A. Let min(S : w) ⊆ S be the collection of those S of least weight in S. We are interested in the event that the least weight is uniquely attained, i.e., the event that | min(S : w)| = 1.

Lemma 38 ([69] Isolating Lemma, based on improved version in [84]) Let the weights w1,..., wm be independent random variables, each wi being sampled uniformly in a set Ri ⊆ R, |Ri| ≥ r ≥ 2. Then

Pr(| min(S : w)| = 1) ≥ (1 − 1/r)m. (2.5)

This lemma is remarkable because of the absence of a dependence on k, the size of the family, in the conclusion.

43 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

2.5 Lecture 12 (29/Oct.): Isolating lemma, finding a perfect match- ing in parallel

2.5.1 Proof of the isolating lemma

Proof: Write Ri = {ui(1),..., ui(|Ri|)} with ui(j) < ui(j + 1) for all 1 ≤ j ≤ |Ri| − 1.

Think of u as the mapping ∏ ui where ui(j) is the evaluation of the function ui at 1 ≤ j ≤ |Ri|. m 0 m Let V = ∏1 {1, . . . , |Ri|} and V = ∏1 {2, . . . , |Ri|}. Any composition u ◦ v is a weight function on 0 A, and if v ∈ V then this weight function avoids using the weights ui(1). 0 m m Note |V |/|V| = ∏1 (1 − 1/|Ri|) ≥ (1 − 1/r) . Given v ∈ V0, fix a set T ∈ min(S : u ◦ v) of largest cardinality. Define φ : V0 → V by ( v(i) − 1 : i ∈ T φv(i) = v(i) : otherwise

We claim that min(S : u ◦ φv) = {T} and that φ is a bijection. Observe that for any S ∈ S,

(u ◦ v)(S) − (u ◦ φv)(S) = ∑ (ui(v(i)) − ui(v(i) − 1)). i∈S∩T

with every summand on the RHS being positive. In particular (u ◦ v)(T) − (u ◦ φv)(T) is the largest change in weight possible for any S, and is achieved by S only if T ⊆ S. Because T has largest cardinality among sets in min(S : u ◦ v), no other set of min(S : u ◦ v) can contain T, and therefore T decreases its weight by strictly more than any other set of min(S : u ◦ v). Other sets of S might have their weight decrease by the same amount as T, but not more. So, min(S : u ◦ φv) = {T} as desired.

Consequently also T can be identified as the unique min-weight element of min(S : u ◦ φv). So φ can be inverted. (Keep in mind, at different v in the domain in φ, different sets T get used, so in order to invert φ we need to be able to identify T just from seeing the mapping φv. (And of course, u, which is fixed.) ) Thus |φ(V0)| = |V0|. So, with v sampled uniformly,

Pr(| min(S : u ◦ v)| = 1) ≥ Pr(v ∈ φ(V0)) = Pr(v ∈ V0) ≥ (1 − 1/r)m.

2

2.5.2 Finding a perfect matching, in RNC

Now we describe the algorithm to find a perfect matching (or report that probably none exists) in a graph G = (V, E) with n = |V|, m = |E|.

For every (i, j) ∈ E pick an integer weight wij iid uniformly distributed in {1, . . . , 2m}. By the 1 m isolating lemma, if G has any perfect matchings, then with probability at least (1 − 2m ) ≥ 1/2 it obtains a unique minimum weight perfect matching. Define the matrix T by:   0 if {i, j} ∈/ E wij Tij = 2 if {i, j} ∈ E, i < j (2.6)  −2wji if {i, j} ∈ E, i > j

wij This is an instantiation of the Tutte matrix, with xij = 2 .

44 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

Claim 39 If G has a unique minimum weight perfect matching of G (call it M and its weight w(M)) then Det(T) 6= 0 and moreover, Det(T) = 22w(M)× [an odd number].

Proof: of Claim: As before we look at the contributions to Det(T) of all the permutations π that are supported by edges of the graph. The contributions from permutations having odd cycles cancel out—that is just because this is a special case of a Tutte matrix. It remains to consider permutations π that have only even cycles.

• If π consists of transpositions along the edges of M then it contributes 22w(M).

• If π has only even cycles, but does not correspond to M, then:

0 – If π is some other matching M0 of weight w(M0) > w(M) then it contributes 22w(M ). – If π has only even cycles and at least one of them is of length ≥ 4, then by separating each cycle into a pair of matchings on the vertices of that cycle, π is decomposed into w(M )+w(M ) two matchings M1 6= M2 of weights w(M1), w(M2), so π contributes ±2 1 2 . Because of the uniqueness of M not both of M1 and M2 can achieve weight w(M), so w(M1) + w(M2) > 2w(M). 2

Now let Tˆij be the (i, j)-deleted minor of T (the matrix obtained by removing the i’th row and j’th column from T), and set

n mij = ∑ sign(π) ∏ Tk,π(k) π:π(i)=j k=1 (2.7)

wij = ±2 Det(Tˆij)

Claim 40 For every {i, j} ∈ E:

1. The total contribution to mij of permutations π having odd cycles is 0. 2. If there is a unique minimum weight perfect matching M, then:

2w(M) (a) If {i, j} ∈ M then mij/2 is odd. 2w(M) (b) If {i, j} ∈/ M then mij/2 is even.

Proof: of Claim: This is much like our argument for Det(T) but localized.

1. If π has an odd cycle then it has an even number of odd cycles and hence an odd cycle not containing point i. Pick the “first” odd cycle that does not contain point i and flip it to obtain r r r r a permutation π . Note that (π ) = π. The contribution of π to mij is the negation of the contribution of π to mij, because we have replaced an odd number of terms from the Tutte matrix by the same entry with a flipped sign.

2. By the preceding argument, whether or not {i, j} ∈ M, we need only consider permutations containing solely even cycles. Just as argued for Claim 39, the contribution of every such w(M )+w(M ) permutation π can be written as 2 1 2 , where M1 and M2 are two perfect matchings obtained as follows: each transposition (i0, j0) in π puts the edge {i0, j0} into both of the matchings; each even cycle of length ≥ 4 can be broken alternatingly into two matchings, one of which (arbitrarily) is put into M1 and one into M2.

45 Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

The only case in which there is a term for which w(M1) + w(M2) = 2w(M) is the single case that {i, j} ∈ M and π consists entirely of transpositions along the edges of M. In every other case, at least one of M1 or M2 is distinct from M, and therefore w(M1) + w(M2) > 2w(M). The claim follows. 2

Finally we collect all the elements necessary to describe the algorithm:

1. Generate the weights wi uniformly in {1, . . . , 2m}. 2. Define T as in Eqn (2.6), compute its determinant and if it is nonsingular invert it. (Otherwise, start over.) This determinant computation and the inversion can be done (deterministically) in depth O(log2 n) as discussed earlier. 3. Determine w(M) by factoring the greatest power of 2 out of Det(T).

wij 4. Obtain the values ±mij from the equations mij = ±2 Det(Tˆij) and

i+j −1 Det(Tˆij) = (−1) (T )ji Det(T) (Cramer’s rule)

2w(M) If mij/2 is odd then place {i, j} in the matching. 5. Check whether this defines a perfect matching. This is guaranteed if the minimum weight perfect matching is unique. If a perfect matching was not obtained (which will occur for sure if there is no perfect matching, but with probability ≤ 1/2 if there is one), generate new weights and repeat the process.

Of course, if the graph has a perfect matching, the probability of incurring k repetitions wihout success is bounded by 2−k, and the expected number of repetitions until success is at most 2.

The simultaneous computation of all the mij’s in step 2 is key to the efficiency of this procedure. The numbers in the matrix A are integers bounded by ±22m. Pan’s RNC2 matrix inversion algorithm will compute A−1 using O(n3.5m) processors. For the maximum matching problem, we use a simple reduction: use weights for each of the non- edges too, but sample those weights uniformly from 2mn + 1, . . . , 2mn + 2m (rather than 1, . . . , 2m like the graph edges). Then no minimum weight perfect matching will use any of the non-edges. The cost of this reduction is that the integers in the matrix now use O(mn) rather than O(m) bits, so the number of processors used by the maximum matching algorithm is O(n4.5m).

46 Chapter 3

Concentration of Measure

3.1 Lecture 13 (31/Oct): Independent rvs, Chernoff bound, appli- cations

3.1.1 Independent rvs

Lemma 41 If X1,..., Xn are independent real rvs with finite expectations (recall this assumption requires that the integrals converge absolutely), then

E(∏ Xi) = ∏ E(Xi).

This is a consequence of the fact that the probabilities of independent events multiply; one only has to be careful about the measure theory. It is enough to consider the case n = 2 and proceed by induction. Recall the definition of expectation from Eqn 1.7:

E(X) = lim jh Pr(jh ≤ X < (j + 1)h) h→0 ∑ integer−∞

Pr((jh ≤ X < (j + 1)h) ∧ (j0h ≤ Y < (j0 + 1)h)) = Pr(jh ≤ X < (j + 1)h) · Pr(j0h ≤ Y < (j0 + 1)h) for independent X, Y. If you want to do the measure theory carefully, this boils down to the Fubini Theorem.

3.1.2 Chernoff bound for uniform Bernoulli rvs (symmetric random walk)

The Chernoff bound1 will be one of two ways in which we’ll display the concentration of measure phenomenon, the other being the central limit theorem. In the types of problems we’ll be looking at the Chernoff bound is the more frequently useful of the two but they’re closely related.

Let’s begin with the special case of iid fair coins, aka iid uniform Bernoulli rvs: P(Xi = 1) = 1/2, P(Xi = 0) = 1/2. Put another way, we have n independent events, each of which occurs with

1First due to Bernstein [14, 15, 13] but we follow the standard naming convention in Computer Science.

47 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

probability 1/2. We want an exponential tail bound on the probability that significantly more than half the events occur. This very short argument is the seed of more general or stronger bounds that we will see later.

It will be convenient to use the rvs Yi = 2Xi − 1, where Xi is the indicator rv of the ith event. This shift lets us work with mean-0 rvs. This leaves the Yi independent; that is a special case of the following lemma, which is an immediate consequence of the definitions in Sec. 1.2:

Lemma 42 If f1,... are measurable functions and X1,... are independent rvs then f1(X1),... are indepen- dent rvs. (Proof omitted.)

n Theorem 43 Let Y1,..., Yn be iid rvs, with Pr(Yi = −1) = Pr(Yi = 1) = 1/2. Let Y = ∑1 Yi. Then √ 2 Pr(Y > λ n) < e−λ /2 for any λ > 0. √ p The significance of n here is that it is the standard deviation of Y (i.e., Var(Y)), because (a) E(Yi) = 1 (easy), and (b): Exercise: If Z1,..., Zn are independent (actually pairwise independent is enough) real rvs with well defined first and second moments, then Var(∑ Zi) = ∑ Var(Zi). (3.1) √ Consequently, we already know from the Chebyshev bound, Lemma 13, to “expect” n to be the regime where we start to get a meaningful deviation bound. Proof: Fix any α > 0. Exercise:2 2 E(eαYi ) = cosh α ≤ eα /2.

By independence of the rvs eαYi ,

2 E(eαY) = ∏ E(eαXi ) ≤ enα /2.

√ √ Pr(Y > λ n) = Pr(eαY > eαλ n) √ ≤ E(eαY)/eαλ n Markov ineq. √ 2 ≤ enα /2−αλ n √ We now optimize this bound by making the choice α = λ/ n, and obtain:

√ 2 Pr(Y > λ n) ≤ e−λ /2. 2

Here’s another way to think about this calculation: Let sx(y) be the step function sx(y) = 1 for y > x, sx(y) = 0 for y ≤ x. Note, for any α > 0, sx(y) ≤ exp(α(y − x)), which is to say, the exponential kernel of integration, is greater than the threshold kernel of integration. (See Fig. 3.1.2.) √ √ Pr(Y > λ n) = E(s n(Y)) λ √ ≤ E(exp(α(Y − λ n)) n √ = ∏ E(exp(α(Yi − λ/ n))) 1  √ n = e−αλ/ n cosh α

2 k k x2/2 2k k 2k For k ≥ 0, (2k)! = ∏1 i(k + i) ≥ 2 k!, so for any real x, e = ∑k≥0 x /(2 k!) ≥ ∑k≥0 x /(2k)! = cosh x.

48 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

1.5

1.0 Threshold kernel Exponential kernel Probability mass

0.5

-2 -1 1 2

Figure 3.1: Integrating a probability mass against two different nonnegative kernels

√ We get the best upper bound by minimizing the base b of this exponent. If we pick α = λ/ n, 2 √ which doesn’t exactly optimize the bound but comes close, we get b = e−λ /n cosh(λ/ n) ≤ 2 2 2 e−λ /neλ /2n = e−λ /2n. Then substituting back we get

√ 2 Pr(Y > λ n) ≤ e−λ /2.

3.1.3 Application: set discrepancy

For a function χ : {1, . . . , n} → {1, −1} and a subset S of {1, . . . , n}, let χ(S) = ∑i∈S χ(i). Define the discrepancy of χ on S to be Disc(S, χ) = |χ(S)|, and the discrepancy of χ on a collection of sets S = {S1,..., Sn} to be Disc(S, χ) = maxj |χ(Sj)|. √ Theorem 44 (Spencer) With the definitions above, ∃χ with Disc(S, χ) ∈ O( n).

We won’t provide Spencer’s argument, but the starting point for it is the proof of the following weaker statement.

p Theorem 45 With the definitions above, a function χ selected u.a.r. has Disc(S, χ) ∈ O( n log n) with positive probability.

Proof: By Theorem (43), for any particular set Sj (noting that |Sj| ≤ n), p p c n log n q Pr(|χ(Sj)| > c n log n) = Pr(|χ(Sj)| > q |Sj|) |Sj|

−c2n log n ≤ 2e 2|Sj| 2 − c log n ≤ 2e 2 2 = 2n−c /2.

49 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Now take a union bound over the sets. p p Pr(∃j : |χ(Sj)| > c n log n) ≤ n Pr(|χ(Sj)| > c n log n) 2 < 2n1−c /2. √ Plug in any c > 2 to show the theorem for sufficiently large values of n. 2

3.1.4 Entropy and Kullback-Liebler divergence

When we introduced BPP we specified that at the end of the poly-time computation, strings in the language should be accepted with probability ≥ 2/3, and strings not in the language should be accepted with probability ≤ 1/3. We also noted that these values were immaterial and did not even need to be constants—we need only that they be separated by some 1/poly. We’ll shortly see why. We start by defining two important functions.

1 Definition 46 The entropy (base 2) of a probability distribution {p ,..., pn} is h (p) = ∑ p lg . 1 2 i pi

In natural units we use h(p) = ∑ p log 1 . i pi

Definition 47 Let r = (r1,..., rn) and s = (s1,..., sn) be two probability distributions and suppose si > 0∀i. The (base 2) Kullback-Leibler divergence D2(rks) “from s to r,” or “of r w.r.t. s,” is defined by ri D2(rks) = ∑ ri lg i si

This is also known as information divergence, directed divergence or relative entropy3. In natural r log units the divergence is D(rks) = ∑ r log i , and we also use this notation when the base doesn’t i i si matter. D(rks) is not a metric (it isn’t symmetric and doesn’t satisfy the triangle inequality) but it is nonnegative, and zero only if the distributions are the same. Exercise:

(a) D(rks) ≥ 0 ∀r, s (b) D(rks) = 0 ⇒ r = s 2 ! εi 3 (c) D(s + εks) = ∑ + O(εi ) i 2si

(d) for n = 2, D((s1 + ε, 1 − s1 − ε)k(s1, 1 − s1)) is increasing in |ε| The “k” notation is strange but is the convention. 2 From (c) and (d) we have that for n = 2, D((s1 + ε, 1 − s1 − ε)k(s1, 1 − s1)) ∈ Ω(ε ) (with the constant depending on s1). When s is the uniform distribution, we have:

D(rkuniform) = ∑ ri log(nri) = lg n + ∑ ri log ri = log n − h(r) So D(rkuniform) can be thought of as the entropy deficit of r, compared to the uniform distribution. 1 1 In the case n = 2 we will write p rather than (p, 1 − p), thus: h2(p) = p lg p + (1 − p) lg 1−p , p 1−p D2(pkq) = p lg q + (1 − p) lg 1−q . 3D is useful throughout information theory and statistics (and is closely related to “Fisher information”). See [23].

50 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

0.7

0.6 x2/2 D((1+x)/2 ||1/2) 0.5

0.4

0.3

0.2

0.1

x -1.0 -0.5 0.5 1.0

Figure 3.2: Comparing the two Chernoff bounds at q = 1/2

3.2 Lecture 14 (2/Nov): Stronger Chernoff bound, applications

3.2.1 Chernoff bound using divergence; robustness of BPP

Let’s extend and improve the previous large deviation bound for symmetric random walk. The new bound is almost the same for relatively√ mild deviations (just a few standard deviations) but is much stronger at many (especially, Ω( n)) standard deviations. It also does not depend on the coins being fair.

Theorem 48 If X1,..., Xn are iid coins each with probability q of being heads, the probability that the −nD (pkq) number of heads, X = ∑ Xi, is > pn (for p ≥ q) or < pn (for p ≤ q), is < 2 2 = exp(−nD(pkq)).

n Exercise: Derive from the above one side of Stirling’s approximation for (pn). Note 1: this improves on Thm 43 even at q = 1/2 because the inequality cosh α ≤ exp(α2/2) that we used before, though convenient, was wasteful. (But the two bounds converge for p in the neighborhood of q.) Specifically we have (see Figure 3.2):

D(pk1/2) ≥ (2p − 1)2/2 (3.2)

Note 2: The divergence is the correct constant in the above inequality; and this remains the case even when we “reasonably” extend this inequality to alphabets larger than 2—that is, dice rather than coins; see Sanov’s Theorem [23, 75]. There are of course lower-order terms that are not captured by the inequality. Note 3: Let’s see what we mean by “concentration of measure”. Clearly, the Chernoff bound is telling us that something, namely the rv X, is very tightly concentrated about a particular value. On the other hand, if you look at the full underlying rv, namely the vector (X1,..., Xn), that is not concentrated at all; if say q = 1/2, then it is actually as smoothly distributed as it could be, being uniform on the hypercube! The concentration of measure phenomenon, then, is a statement about low dimension representation of high dimensional objects. In fact the “representation” does not have to be a nice linear function like X = ∑ Xi. It is sufficient that f (X1,..., Xn) be a Lipschitz function, namely that that there be some constant bound c s.t. flipping any one of the Xi’s changes the function value by no more than c. From this simple information you can already get a large deviation bound on f for independent inputs Xi.

51 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Proof: Consider the case p ≥ q; the other case is similar. Set Yi = Xi − q and Y = ∑ Yi. Now for α > 0,

Pr(Y > n(p − q)) = Pr(eαY > eαn(p−q)) < E(eαY)/eαn(p−q) Markov !n (1 − q)e−αq + qeα(1−q) = Independence eα(p−q)

= p(1−q) Set α log (1−p)q . Continuing,

− !n  q p  1 − q 1 p = p 1 − p = e−nD(pkq)

This is saying that the probability of a coin of bias q empirically “masquerading” as one of bias at least p > q, drops off exponentially, with the coefficient in the exponent being the divergence.

Back to BPP

Suppose we start with a randomized polynomial-time decision algorithm for a language L which for x ∈ L, reports “Yes” with probability at least p, and for x ∈/ L, reports “Yes” with probability at most q, for p = q + 1/ f (n) for some f (n) ∈ nO(1). Also, D(q + εkq) is monotone in each of the regions ε > 0, ε < 0. So if we perform O(n f 2(n)) repetitions of the original BPP algorithm, and accept x iff the fraction of “Yes” votes is above (p + q)/2, then the probability of error on any input is bounded by exp(−n).

3.2.2 Balls and bins

Suppose you throw n balls, uniformly iid, into n bins. What is the highest bin occupancy?

Let Ai = # balls in bin i. Claim: ∀c > 1, Pr(max Ai > c log n/ log log n) ∈ o(1).

To avoid a morass of iterated logarithms, write L = log n, L2 = log log n, L3 = log log log n. So we wish to show Pr(max Ai > cL/L2) ∈ o(1). Proof: by the union bound,

Pr(max Ai > cL/L2) ≤ n Pr(Ai > cL/L2) cL 1 ≤ n exp(−nD( k )) nL2 n cL (1− cL )n   ! nL2 L L2 1 − 1/n = n 2 cL 1 − cL nL2 cL (1− cL )n   ! nL2 L L2 1 ≤ n 2 cL 1 − cL nL2

52 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

2.0

1.8

1.6

1.4

1.2

0.2 0.4 0.6 0.8 1.0

1 1−p Figure 3.3: ( 1−p ) vs. 1 + p

4 1 1−p p Expand the first term and apply the inequality ( 1−p ) ≤ e (0 ≤ p < 1) to the second term:

 cL cL  ... ≤ exp L + (L3 − L2 − log c) + L2 L2 L − log c + 1 = exp((1 − c)L + cL 3 ) L2 ≤ exp((1 − c)L + o(1)) = n1−c+o(1).

Omitted: Show that a constant fraction of bins are unoccupied. A much more precise analysis of this balls-in-bins process is available, due to G. Gonnet [40].

3.2.3 Preview of Shannon’s coding theorem

This is an exceptionally important application of large deviation bounds. Consider one party (Alice) who can send a bit per second to another party (Bob). She wants to send him a k-bit message. How- ever, the channel between them is noisy, and each transmitted bit may be flipped, independently, with probability p < 1/2. What can Alice and Bob do? You can’t expect them to communicate reliably at 1 bit/second anymore, but can they achieve reliable communication at all? If so, how many bits/second can they achieve? This question turns out to have a beautiful answer that is the starting point of modern communication theory. Before Shannon came along, the only answer to this question was, basically, the following na¨ıve strategy: Alice repeats each bit some ` times. Bob takes the majority of his ` receptions as his best guess for the value of the bit. We’ve already learned how to evaluate the quality of this method: Bob’s error probability on each bit is bounded above by, and roughly equal to, exp(−`D(1/2kp)). In order for all bits to arrive correctly, then, Alice must use ` proportional to log k. This means the rate of the communication, the number of message bits divided by elapsed time, is tending to 0 in the length of the message (scaling as 1/ log k). And if Alice and Bob want to have exponentially small probability of error exp(−k), she would have to employ ` ∼ k, so the rate would be even worse, scaling as 1/k. Shannon showed that in actual fact one does not need to sacrifice rate for reliability. This was a great insight, and we will see next time how he did it. Roughly speaking—but not exactly—his

4 1 1−p 1 In fact we have the stronger ( 1−p ) ≤ 1 + p (see Fig. 3.2.2) although we don’t need this. Let α = log 1−p , so α ≥ 0. −α −α Then p = 1 − e−α and we are to show that 2 ≥ e−α + eαe =: f (α). f (0) = 2 and f 0 = e−α(eαe − 1 − α) so it suffices to show −α for α ≥ 0 that g(α) := eαe ≤ 1 + α. At α = 0 this is satisfied with equality so it suffices to show that 1 ≥ g0 = (1 − α)e−α g. −α Since 1 − α ≤ e−α, it suffices to show that 1 ≥ e−2α g = eα(e −2), which holds (with room to spare) because e−α ≤ 1.

53 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE argument uses a randomly chosen code. He achieves error probability exp(−Ω(k)) at a constant communication rate. What is more, the rate he achieves is arbitrarily close to the theoretical limit.

54 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.3 Lecture 15 (5/Nov): Application of large deviation bounds: Shannon’s coding theorem. Central limit theorem

3.3.1 Shannon’s block coding theorem. A probabilistic existence argument.

In order to communicate reliably, Alice and Bob are going to agree in advance on a codebook, a set of codewords that are fairly distant from each other (in Hamming distance), with the idea that when a corrupted codeword is received, it will still be closer to the correct codeword than to all others. In this discussion we completely ignore a key computational issue: how are the encoding and decoding maps computed efficiently? In fact it will be enough for us, for a positive result, to demonstrate existence of an encoding map E : {0, 1}k → {0, 1}n and a decoding map D : {0, 1}n → {0, 1}k (we’ll call this an (n, k) code) with the desired properties; we won’t even explicitly describe what the maps are, let alone specify how to efficiently compute them. We will call k/n the rate of such a code. Shannon’s achievement was to realize (and show) that you can simultaneously have positive rate and error probability tending to 0—in fact, exponentially fast.

Theorem 49 (Shannon [78]) Let p < 1/2. For any ε > 0, for all k sufficiently large, there is an (n, k) code −Ω(k) with rate ≥ D2(pk1/2) − ε and error probability e on every message. (The constant in the Ω depends on p and ε.)

In this theorem statement, “Error” means that Bob decodes to anything different from X, and error probabilities are taken only with respect to the random bit-flips introduced by the channel. Proof: Let k n = (3.3) D2(pk1/2) − ε (ignoring rounding). Let R ∈ {0, 1}n denote the error string. So, with Y denoting the received message, Y = E(X) + R with X uniform in {0, 1}k, and R consisting of iid Bernoulli rvs which are 1 with probability p. The error event is that D(E(X) + R) 6= X. As a first try, let’s design E by simply mapping each X ∈ {0, 1}k to a uniformly, independently chosen string in {0, 1}n. (This won’t be good enough for the theorem.) So (for now) when we speak of error probability, we have two sources of randomness: channel noise R, and code design E. To describe the decoding procedure we start with the notion of Hamming distance H. The Ham- ming distance H(x, y) between two same-length strings over a common alphabet Σ, is the number n of indices in which the strings disagree: H(x, y) = |{i : xi 6= yi}| for x, y, ∈ Σ . Define the decoding D to map Y to a closest codeword in Hamming distance. For most of the remainder of the proof (in particular until after the lemma), we fix a particular message X, and analyze the probability that it is decoded incorrectly.

In order to speak separately about the two sources of error, we define MX to be the rv (which is a function of E) MX = PrR(Error on X|E). So for any E, 0 ≤ MX ≤ 1. In order to analyze how well this works, we pick δ sufficiently small that p + δ < 1/2 (3.4) and D2(p + δk1/2) > D2(pk1/2) − ε/2. (3.5) Note that if both

55 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

1. H(E(X) + R, E(X)) < (p + δ)n (“channel noise is low”), and

2. ∀X0 6= X : H(E(X) + R, E(X0)) > (p + δ)n (“code design is good for X, R”)

then Bob will decode correctly. The contrapositive is that if Bob decodes X incorrectly then at least one of the following events has to have occurred:

Bad1: H(E(X) + R, E(X)) ≥ (p + δ)n 0 0 Bad2: ∃X 6= X : H(E(X) + R, E(X )) ≤ (p + δ)n

1−cn Lemma 50 ∃c > 0 s.t. EE (MX) < 2

Proof: Specifically we show this for c = min{D2(p + δkp), ε/2}. In what follows when we write a bound on PrW (...) we mean that “conditional on any setting to the other random variables, the randomness in W is enough to ensure the bound”.

EE (MX) ≤ Pr(Bad1) + Pr(Bad2) R ∑ E X06=X ≤ Pr(H(~0, R) ≥ (p + δ)n) + Pr (H(R, E(X0) − E(X)) ≤ (p + δ)n) R ∑ E( 0) E( ) X06=X X , X ≤ 2−nD2(p+δkp) + 2k−nD2(p+δk1/2) = 2−nD2(p+δkp) + 2n(D2(pk1/2)−ε−D2(p+δk1/2)) substituting value of k ≤ 2−nD2(p+δkp) + 2−εn/2 using inequality (3.5) ≤ 21−cn using value of c

2 All of the above analysis treated an arbitrary but fixed message X. We showed that, picking the code at random, the expected value of MX = PrR(Error on X|E) is small.

Let Z be the rv which is the fraction of X’s for which MX ≤ 2E(MX). By the Markov inequality, ∃E s.t. Z ≥ 1/2. Let E ∗ be a specific such code. ∗ E works well for most messages X, but this isn’t quite what we want—we want MX to be small for all messages X. There is a simple solution. Choose a code E ∗ as above for k + 1 bits, then map the k-bit messages to the good half of the messages. Note that removal of some codewords from E ∗ can only decrease any MX. (Assuming we still use closest-codeword decoding.) 2−cn So now the bound PrR(Error on X) ≤ 2E(MX) ≤ 2 applies to all X. The asymptotic rate is unaffected by this trick; the error exponent is also unaffected. To be explicit, using E ∗ designed for k + 1 bits and with n = k+1 we have for all X ∈ {0, 1}k D2(pk1/2)−ε

Pr(Error on X) ≤ 22−cn R

Thus no matter what message Alice sends, Bob’s probability of error is exponentially small. 2

56 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.3.2 Central limit theorem

As I mentioned earlier in the course, there are two basic ways in which we express concentration of measure: large deviation bounds, and the central limit theorem. Roughly speaking the former is a weaker conclusion (only upper tail bounds) from weaker assumptions (we don’t need full independence—we’ll talk about this soon). The proof of the basic CLT is not hard but relies on a little Fourier analysis and would take us too far out of our way this lecture, so I will just quote it. Let µ be a probability distribution on R, i.e., for X distributed as µ, measurable S ⊆ R, Pr(X ∈ S) = µ(S). For X1,..., Xn sampled independently 1 n from µ set X = n ∑i=1 Xi.

Theorem 51 Suppose that µ possesses both first and second moments: Z θ = E [X] = x dµ mean

h i Z σ2 = E (X − θ)2 = (x − θ)2 dµ variance

Then for all a < b, b aσ bσ 1 Z 2 lim Pr( √ < X − θ < √ ) = √ e−t /2 dt. (3.6) n n n 2π a

The form of convergence to the normal distribution in 3.6 is called convergence in distribution or convergence in law. For a proof of the CLT see [16] Sec. 27 or for a more accessible proof for the case that the Xi are bounded, see [3] Sec. 3.8.

57 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.4 Lecture 16 (7/Nov): Application of CLT to Gale-Berlekamp. Khintchine-Kahane. Moment generating functions

3.4.1 Gale-Berlekamp game

Let’s remember a problem we saw in the first lecture (slightly retold):

• You are given an n × n grid of lightbulbs. For each bulb, at position (i, j), there is a switch bij; there is also a switch ri on each row and a switch cj on each column. The (i, j) bulb is lit if bij + ri + cj is even. For a setting b, r, c of the switches, let F(b, r, c) be the number of lit bulbs bij+ri+cj minus the number of unlit bulbs. Then F(b, r, c) = ∑ij(−1) .

Let F(b) = maxr,c F(b, r, c). What is the greatest f (n) such that for all b, F(b) ≥ f (n)?

This is called the Gale-Berlekamp game after David Gale and Elwyn Berlekamp, who viewed it as a competitive game: the first player chooses b and then the second chooses r and c to maximize the number of lit bulbs. So f (n) is the outcome of the game for perfect players. In the 1960s, at Bell Labs, Berlekamp even built a physical 10 × 10 grid of lightbulbs with bij, ri and cj switches. People have labored to determine the exact value of f (n) for small n—see [33]. But the key issue is the asymptotics.

Theorem 52 f (n) ∈ Θ(n3/2).

Proof: First, the upper bound f (n) ∈ O(n3/2): We have to find a setting b that is favorable for the “mini- mizing f ” player, who goes first. That is, we have to find a b with small F(b). Fix any r, c. Then for b selected u.a.r.,

−n2D ( 1 + √k k 1 ) Pr(F(b, r, c) > kn3/2) ≤ 2 2 2 2 n 2 we’ll choose a value for k shortly 2 ≤ 2−k n/(2 log 2) using D(pk1/2) ≥ (2p − 1)2/2

Now take a union bound over all r, c.

2 Pr(F(b) > kn3/2) ≤ 22n−k n/(2 log 2) p p For k > 2 log 2 this is < 1. So ∃b s.t. ∀r, c, F(b, r, c) ≤ 2 log 2n3/2. Next we show the lower bound. Here we must consider any setting b and show how to choose r, c favorably. Initially, set all ri = 0 and pick cj u.a.r. Then for any fixed i, the row sum

bij+cj ∑(−1) =: Xi j is binomially distributed, being an unbiased random walk of length n. Now, unlike the Chernoff bound, we’d like to see not an upper but a lower tail bound on random walk. Let’s derive this from the CLT: √ Corollary 53 For X the sum of m uniform iid ±1 rvs, E(|X|) = (1 + o(1)) 2m/π.

58 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

(Proof sketch: for X distributed as the unit-variance Gaussian N (0, 1), this value is exact; see [89]. The CLT shows this is a good enough approximation to our rv.) Comment: Instead of using Corollary 53, we could alternatively have used the following result:

Theorem 54 (Khintchine-Kahane) Let a = (a1,..., an), ai ∈ R. Let si ∈U ±1 and set S = |∑ siai|. Then √1 kak ≤ E(S) ≤ kak . 2 2 2

The original result of this form is [56]; the above constant and generality are found in [60]; for an elegant one-page proof see [32]. Not coincidentally, both this result and the CLT are proven through Fourier analysis. Comment: Since we haven’t provided proofs of either of these, and we are about to use them, let me mention that later in the course (Sec. 4.3.1) we’ll come back and finish the proof (with a weaker constant) through a more elementary argument, and with the added benefit that we will be able to give the player a deterministic poly-time strategy for choosing the row and column bits. (Here we gave the player only a randomized poly-time strategy.) In any case we now continue, using√ the conclusion (with the largest constant, coming from the CLT): for every i, E(|Xi|) = (1 + o(1)) 2n/π. ri Now for each√ row, flip ri if the row sum is negative. So E(∑i(−1) Xi) = E(∑i |Xi|) = ∑i E(|Xi|) = (1 + o(1)) 2/πn3/2. √ 3/2 This shows (assuming the CLT) that√ for any b, Ec maxr F(b, r, c) is (1 + o(1)) 2/πn . Conse- quently, for all b, F(b) ≥ (1 + o(1)) 2/πn3/2, which proves the theorem. 2 Comment: It was convenient in this problem that the system of switches at your disposal was “bipartite”, that is, there are no interactions amongst the effects of the row switches, and likewise amongst the effects of the column switches. However, even when such effects are present it is possible to attain similar theorems. See [53].

3.4.2 Moment generating functions, Chernoff bound for general distributions

Now for a version of the Chernoff bound which we can apply to sums of independent real rvs with very general probability distributions. After presenting the bound we’ll see an application of it, with broad computational applications, in the theory of metric spaces. Let X be a real-valued random variable with distribution µ: for measurable S ⊆ R, Pr(X ∈ S) = µ(S).

Definition 55 The moment generating function (mgf) of X (or, more precisely, of µ) is defined for β ∈ R by

h βXi gµ(β) = E e provided this converges in an open neighborhood of 0 ∞ βk = ∑ E(Xk) 0 k!

Incidentally note that (a) if instead of taking β to be real we take it to be imaginary, this gives the Fourier transform, (b) both are “slices” of the Laplace transform. For any probability measure µ, gµ(0) = E [1] = 1. (3.7)

59 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

We are interested in large deviation bounds for random walk with steps from µ. That is, if we 1 n sample X1,..., Xn iid from µ and take X = n ∑i=1 Xi, we want to know if the distribution of X is concentrated around E [X]. It will be convenient to re-center µ, if necessary, so that E [X] = 0; clearly this just subtracts a known constant off each step of the rw, so it does not affect any probabilistic calculations. So without loss of generality we now take E [X] = 0. Perhaps not surprisingly, the quality of the large deviation bound that is possible, depends on how heavy the tails of µ are. What is interesting is that this is nicely measured by the smoothness of gµ at the origin. Specifically, a moment-generating function that is differentiable at the origin guarantees exponential tails. One way to think about this intuitively is to examine the Fourier transform (the imaginary axis), rather than the characteristic function, near the origin. If µ has light tails—as an extreme case suppose µ has bounded support—then near the origin, the Fourier coefficients are picking up only very long-wavelength information, and seeing almost no “cancellations”—negative contributions can come only from very far away and therefore be very small. So the Fourier coefficients near 0 are vanishingly different from the Fourier coefficient at 0, and so gµ is differentiable at 0. This goes both ways—if µ has heavy tails, then even at very long wavelengths, the Fourier integral picks up substantial cancellation, and so the Fourier coefficients change a lot moving away from 0.

Theorem 56 (Chernoff) If the mgf gµ(β) is differentiable at 0, then ∀ε 6= 0 ∃cε < 1 such that

n Pr(X/ε > 1) < cε .

Specifically −βε cε ≤ inf e gµ(β) < 1. (3.8) β

Proof: Let N be a neighborhood of 0 in which the mgf converges. Start with the case ε > 0.

Pr(X > ε) = Pr(eβ ∑i Xi > eβnε) for any β > 0 (3.9) h i < e−βnεE eβ ∑i Xi Markov bound, for β ∈ N  h in −βnε βX1 = e E e Xi are independent n  −βε  = e gµ(β) (3.10)

−βε 0 We now need to show that there is a β > 0 such that e gµ(β) < 1. At β = 0, e gµ(0) = 1, so let’s −βε find the derivative of e gµ(β) at 0. Since gµ is differentiable at 0 we have:

 βX ∂gµ(β) ∂E e = ∂β ∂β 0 0  βX  ∂e = E ∂β 0 h i = E XeβX 0 = E [X] = 0 (3.11)

So, because we have shifted the mean to 0, the moment-generating function is flat at 0.

60 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Now we can differentiate the whole function:

−βε ∂e gµ(β) = e−ε·0g0 (0) − εe−ε·0g (0) product rule ∂β µ µ 0 −ε·0 0 −ε·0 = e g (0) −ε e gµ(0) at β = 0 |{z} µ |{z} 1 | {z } 1 | {z } 0 1 = −ε (3.12)

−βε We have determined that ∃β > 0 such that e gµ(β) < 1, and thus there is a cε < 1 as stated in the theorem. The case ε < 0 is similar. All that changes is that for line 3.9 we substitute

Pr(X < ε) = Pr(eβ ∑i Xi > eβnε) for any β < 0 (3.13)

The rest of the derivation is identical up to and including line 3.12, which in this case shows that −βε ∃β < 0 such that e gµ(β) < 1, and thus there is a cε < 1 as stated in the theorem. 2

This method also allows us, in some cases, to find the value of cε which gives the tightest Chernoff bound. (For general µ and ε this can be a complicated task and we may have to settle for bounds on the best cε.)

Exercise: What is the mgf of the uniform distribution on ±1? What is the best cε?

61 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.5 Lecture 17 (9/Nov): Johnson-Lindenstrauss embedding `2 → `2

By a small sample we may judge of the whole piece. Cervantes, Don Quixote de la Mancha §I-1-4

Today we’ll see a geometric application of the Chernoff bound. At first glance the question we solve, which originates in analysis, appears to have nothing to do with probability. But actually it illustrates a shared geometric core between analysis and probability.

Definition 57 A metric space (M, dM) is a set M and a function dM : M × M → (R ∪ ∞) that is symmetric; 0 on the diagonal; and obeys the triangle inequality, dM(x, y) ≤ dM(x, z) + dM(z, y).

Examples:

n p n 2 1. A Euclidean space is a vector space R equipped with the metric d(x, y) = ∑1 (xi − yi) .

2. The same vector space can be equipped with a different metric, for instance the `∞ metric, maxi |xi − yi|, or the `1 metric, ∑i |xi − yi|. Actually in real vector spaces the metrics we use, like these, are usually derived from norms (see Sec. 3.5.1).

3. Sometimes we get important metrics as restrictions of another metric. For instance let ∆n n denote the probability simplex, ∆n = {x ∈ R : ∑i xi = 1, xi ≥ 0}. In this space (half of) the `1 distance is referred to as “total variation distance”, dTV. It has another characterization, dTV(p, q) = maxA⊆[n] p(A) − q(A). Exercise: Usually a metric arises through a “min” definition (shortest path from one place to another), and in Example 5 we will see that dTV does have that kind of definition. Why does it coincide with a “max” definition? 4. Many metric spaces have nothing to do with vector spaces. An important class of metrics are the shortest path metrics, derived from undirected graphs: If G = (V, E) is a graph and x, y ∈ V, let d(x, y) denote the length of (number of edges on) a shortest path between them. 5. If you start with a metric d on a measurable space M you can “lift” it to the transportation metric dtrans. This is much bigger: the “points” of this new metric space are probability distributions on M, and the transportation distance is how far you have to shift probability mass in order to transform one distribution to the other. Here is the formal definition for the case of a finite space M. Let µ, ν be the two distributions. π will range over probability distributions on the direct product space M2.

dtrans(µ, ν) = min{∑ d(x, y)π(x, y)|∀x : ∑ π(x, y) = µ(x), ∀y : ∑ π(x, y) = ν(y), ∀x, y : π(x, y) ≥ 0} x,y y x

Sometimes this is called “earthmover distance” (imagine bulldozers pushing the probability mass around).

For example, if M is the graph metric on a clique of size k (as in Example 4) then dtrans = dTV = variation distance among probability distributions on the vertices (i.e., the metric space of Example 3).

0 Definition 58 An embedding f : M → M is a mapping of a metric space (M, dM) into another metric 0 dM0 ( f (a), f (b)) dM(c,d) space (M , dM0 ). The distortion of the embedding is supa b c d∈M · . The mapping , , , dM(a,b) dM0 ( f (c), f (d)) is called isometric if it has distortion 1.

62 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

A finite metric space is one in which the underlying set is finite. A finite `2 space is one that can be embedded isometrically into a Euclidean space of any dimension. Exercise: The dimension need not be greater than n − 1. (n points span only at most an (n − 1)- dimensional affine subspace.) Exercise: Generically, the dimension must be n − 1. (Show the distances between points in Euclidean space determine their coordinates up to a rotation, reflection and translation. Then consider the volume of the simplex.)

What we’ll see today is a method of embedding an n-point `2 metric into a very low-dimensional Euclidean space with only slight distortion. This is useful in the theory of computation because many algorithms for geometric problems have complexity that scales exponentially in the dimension of the input space. We’ll have to skip giving example applications, but there are quite a few by now, and because of these, a variety of improvements and extensions of the embedding method have also been developed. Our goal is to prove the following claim:

Theorem 59 (Johnson and Lindenstrauss [51]) Given a set A of n points in a Euclidean space, there k −2 ε exists a map f : A → (R , `2) with k = O(ε log n) that is of distortion e on the metric restricted to A. Moreover, the map f can be taken to be linear and can be found with a simple randomized algorithm in expected time polynomial in n.

Although the points of A generically span an (n − 1)-dimensional affine space, and the map is linear, nonetheless observe that we are not embedding all of Rn−1 with low distortion—that is impossible, as the map is many-one—we care only about the distances among our n input points.

3.5.1 Normed spaces

A real normed space is a vector space V equipped with a nonnegative real-valued “norm” k · k satisfying kcvk = ckvk for c ≥ 0, kvk 6= 0 for v 6= 0, and kv + wk ≤ kvk + kwk. Norms automatically define metrics, as in examples 1, 2, by taking the distance between v and w to be kv − wk.

Let S = (S, µ) be any measure space. For p ≥ 1, the Lp normed space w.r.t. the measure µ, Lp(S), is defined to be the vector space of functions

f : S → R of finite “Lp norm,” defined by

Z 1/p p k f kp = k f (x)k dµ(x) S

Exercise: k f + gkp ≤ k f kp + kgkp

So (like any normed space), Lp(S) is also automatically a metric space.

This framework allows us to discuss the collection of all L2 (Euclidean) spaces, all L1 spaces, etc. The most commonly encountered cases are indeed L1, L2 and L∞, which is defined to be the sup norm (so µ doesn’t matter). Today we discuss embeddings L2 → L2. Time permitting we may also discuss embeddings of general metrics into L1. k We will use the shorthand Lp to refer to an Lp space on a set S of cardinality k, with the counting measure.

63 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.5.2 JL: the original method

Returning to the statement of the Johnson-Lindenstrauss Theorem (59), how do we find such a map f ? Here is the original construction: pick an orthogonal projection, W˜ , onto Rk uniformly at random, and let f (x) = Wx˜ for x ∈ A. For k as specified, this is satisfactory with high (constant) probability (which depends on the con- stant in k = O(ε−2 log n)). An equivalent description of picking a projection W˜ at random is as follows: choose U uniformly (i.e., using the Haar measure) from On (the orthogonal group). Let Q˜ be the n × n matrix which is the projection map onto the first k basis vectors:

 1 0 0 ··· 0 0   0 1 0 ··· 0 0     .  ˜  ..  Q =   .  0 0 ··· 1 0 0     0 0 0 0 0 0  0 0 0 0 0 0

Then set W = U−1QU˜ . I.e., a point x ∈ A is mapped to U−1QUx˜ . Let’s start simplifying this. The final multiplication by U−1 doesn’t change the length of any vector so it is equivalent to use the mapping x → QUx˜ and ask what this does to the lengths of vectors between points of A. Having simplified the mapping in this way, we can now discard the all-0 rows of Q˜ , and use just Q:

 1 0 0 ··· 0 0   0 1 0 ··· 0 0    Q =  .  .  ..  0 0 ··· 1 0 0 So JL’s final mapping is f (x) = QUx.

In order to analyze this map, we will consider a vector v, the difference between two points in A, i.e. v = x − y for some x, y ∈ A. Since the question of distortion of the length of v is scale invariant, we can simplify by supposing that kvk = 1. Moreover, the process described above has the same distribution for all rotations of v. That is to say, for any v, w ∈ Rn and any orthogonal matrix A,

Pr(QUv = w) = Pr(QU(Av) = w). (prob. densities) U U

So we may as well consider that v is the vector v = (1, 0, 0, . . . , 0)∗. (Where ∗ denotes transpose.)

In that case, kQUvk equals k(QU)∗1k where (QU)∗1 is the first column of QU. But (QU)∗1 = ∗ (U1,1, U2,1,..., Uk,1) , i.e., the top few entries of the first column of U. Since U is a random orthogonal matrix, the distribution of its first column (or indeed of any other single column) is simply that of a random unit vector in Rn.

64 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

So the whole question boils down to showing concentration for the length of the projection of a random unit vector onto the subspace spanned by the first k standard basis vectors. This distribution is somewhat deceptive in low dimensions. For n = 2, k = 1 the density looks like Figure (3.4).

2.5

2

1.5

1

0.5

0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 3.4: Density of projection of a unit vector in 2D onto a random unit vector

However, in higher dimensions, this density looks more like Figure (3.5). The phenomenon we are encountering is truly a feature of high dimension.

−1 0 1

Figure 3.5: Density of projection of a unit vector in high dimension onto a random unit vector

Remarks:

1. In the one-dimensional projection density (Fig. 3.5) some constant fraction of the probability h i √−1 √1 is contained in the interval n , n . 2. The squares of the projection-lengths onto each of the k dimensions are “nearly independent” random variables, so long as k is small relative to n.

Johnson and Lindenstrauss pushed this argument through but there is an easier way to get there, by just slightly changing the construction.

65 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.5.3 JL: a similar, and easier to analyze, method

Pick k vectors w1, w2,..., wk independently from the spherically symmetric Gaussian density with standard deviation 1, i.e., from the probability density

! 1 1 n η(x x ) = − x 2 1,..., n n/2 exp ∑ i (2π) 2 i=1

Note 1: the projection of this density on any line through the origin is the 1D Gaussian with standard deviation 1, i.e., the density 1  x2  √ exp − 2π 2 (Follows immediately from the formula, by factoring out the one dimension against an entire “con- ditioned” Gaussian on the remaining n − 1 dimensions.) Note 2: The distribution is invariant under the orthogonal group. (Follows immediately from the formula.)

Note 3: The coordinates x1, x2 etc. are independent rvs. (Follows immediately from the formula.) Set   ...... w1 ......  ...... w2 ......    W =  .   .  ...... wk ...... n (The rows of W are the vectors wi.) Then, for v ∈ R set f (v) = Wv. By Notes 1 & 3, each entry of W is an i.i.d. random variable with density √1 exp(−x2/2). 2π Informally, this process is very similar to that of JL, although it is certainly not identical. Individual entries of W can (rarely) be very large, and rows are not going to be exactly orthogonal, although they will usually be quite close to orthogonal. Because of Note 2, analysis of this method boils down, just as for the original JL construction, to showing a concentration result for the length of the first column of W, which we denote w1. 1 2 k 2 Because of Note 3, the expression kw k = ∑1 wi1 gives the LHS as the sum of independent, and by Note 1 iid, rvs. This will enable us to show concentration through a Chernoff bound.

66 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.6 Lecture 18 (12/Nov): cont. JL embedding; Bourgain embedding

3.6.1 cont. JL

Recall that our projection of (any particular) unit vector in the original space, is a vector whose 2 2 coordinates w11,..., wk1 are iid normally distributed with E(wi1) = 1. So E(∑ wi1) = k. We want a 2 deviation bound on ∑ wi1. 2 2 2 There is a name for these rvs: each wi1is a “χ ” rv with parameter 1, and their sum is a χ rv with parameter k.

1.2

k=1 1.0 k=2 0.8 k=3 k=4 0.6 k=10

0.4

0.2

0.0 0 1 2 3 4

1 2 Figure 3.6: Probability density of k ∑ wi1

2 Set random variables yi = wi1 − 1 so that E(yi) = 0. With this change of variables we now want a 1 k bound on the deviation from 0 of the rv y = k ∑i=1 yi. To get a Chernoff bound, we need the mgf, g(β), for yi, in order to use Eqn. 3.8 to write:

P(y/ε > 1) < [ inf e−εβg(β)]k for ε 6= 0. (3.14) β>0

So what is g(β)?

2 g(β) = E(eβy) = E(eβ(w −1)) Z ∞ 1 2 = e−β √ ew (β−1/2)dw −∞ 2π −β Z ∞ p e 1 − 2β − 1 w2(1−2β) = p √ e 2 dw 1 − 2β −∞ 2π e−β = p 1 − 2β The last equality follows as the integrand is the density of a normal random variable with standard deviation √ 1 . 1−2β 1 Thus, g(β) is well defined and differentiable in (−∞, 2 ), with (necessarily) g(0) = 1 (which recall from (3.7) holds for the mgf of any probability measure), and g0(0) = 0 (because g0(0) = the

67 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

first moment of the probability measure, that’s why it’s called the moment generating function, recall (3.11); and we have centered the distribution at 0). For a given ε what β should be used in the Chernoff bound (Eqn. 3.14)? After some calculus, we = ε find that β 2(1+ε) is the best value (for both signs of ε). Figure (3.7) shows the dependence of β on ε.

β(ϵ)

ϵ -1.0 -0.5 0.5 1.0

-0.5

-1.0

-1.5

-2.0

-2.5

2 = ε Figure 3.7: Best choice of β as a function of ε for the χ distribution: β 2(1+ε)

Plugging this value of β above into the bound, we get

1 − ε k P(y/ε > 1) < ((1 + ε) 2 e 2 ) (3.15)

1 ε 1 2 3 k 2 − 2 which we incidentally note is (1 − 2 ε + O(ε )) . The function (1 + ε) e is shown in Fig. 3.8.

1.0

0.8

0.6

0.4

0.2

ϵ 2 4 6 8

2 1 − ε Figure 3.8: Base of the Chernoff bound for the χ distribution: cε = (1 + ε) 2 e 2

Now let’s apply this bound to the modified JL construction. We will ensure distortion (Defn. 58) δ n e (with positive probability) by showing that for each of our (2) vectors v, with probability > n 1 − 1/(2),

1 kvke−δ/2 ≤ √ kWvk ≤ kvkeδ/2. k

68 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

We already argued, by the invariance of our construction under the orthogonal group, that for any v this has the same probability as the event r 1 e−δ/2 ≤ w2 ≤ eδ/2 k ∑ i1 1 e−δ ≤ w2 ≤ eδ k ∑ i1 or equivalently e−δ − 1 ≤ y¯ ≤ eδ − 1. (3.16)

Applying the Chernoff bound (3.15) first on the right of (3.16), we have

δ δ 2 Pr(y¯ > eδ − 1) < ek(δ/2−(e −1)/2) = e(k/2)(1+δ−e ) < e−kδ /4

Next applying the Chernoff bound (3.15) on the left of (3.16), we have

−δ −δ 2 3 Pr(y¯ < e−δ − 1) < ek(−δ/2−(e −1)/2) = e(k/2)(1−δ−e ) < e−k(δ /4+O(δ ))

−2 1 2 −δ δ 2 In all, taking k = 8(1 + O(δ))δ log n suffices so that Pr( k ∑ wi1 ∈/ [e , e ]) < 1/n and therefore so the mapping with probability at least 1/2 has distortion bounded by eδ. Finally, for the computational aspect: to get a randomized “Las Vegas” algorithm simply try matri- ces W at random and examine each to test whether the distortion is satisfactory.

Note: About another embedding question: Finite l2 metric spaces can be embedded in l1 isomet- rically. There’s also an algorithm—deterministic, in fact—to find such an embedding, but it takes exponential time in the number of points in the space. Comment: There are deterministic poly-time algorithms producing an embedding up to the stan- dards of the Johnson-Lindenstrauss theorem, see Engebretsen, Indyk and O’Donnell [27], Sivaku- mar [82].

3.6.2 Bourgain embedding X → Lp, p ≥ 1

In the previous result, we saw how an already “rigid” metric, namely an L2 metric, could be embedded in reduced dimension. Now we will see how a relatively “loose” object, just a metric space, can be embedded in a more rigid object, namely a vector space with an Lp norm. There will be a price in distortion to pay for this.

O(log2 n) Theorem 60 (Bourgain [18]) Any metric (X, d) with n = |X| can be embedded in Lp , p ≥ 1, with distortion O(log n). There is a randomized poly-time algorithm to find such an embedding.

Some comments are in order. Dimension: The dimension bound here is actually due not to Bourgain but to Linial, London and Rabinovich [63]. Also, Bourgain showed embedding into L2; after we prove the L1 result we’ll show how it also implies all p ≥ 1. A later variation of the Bourgain proof that achieves dimension O(log n) is due to Abraham, Bartal and Neiman [1]. Derandomization: It will follow from ideas we see soon, that there is a deterministic algorithm to construct a Bourgain embedding into dimension poly(n). This will be on a problem set. It is also possible, by the method of conditional probabilities, to reduce the dimension to O(log2 n); we probably won’t have time to discuss this.

69 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Distortion: The distortion in the theorem is best possible: expander graphs require it. However, there are open questions for restricted classes of metrics: for example whether the distortion can be improved, possibly to a constant, for shortest path metrics in planar graphs. See [50] for a survey on metric embeddings from 2004. Method: Weighted Frechet´ embeddings.

3.6.3 Embedding into L1

Proof: Since the domain of our mapping is merely a metric space rather than a normed space, we cannot apply anything like the JL technique, and something quite different is called for. Bourgain’s proof employs a type of embedding introduced much earlier by Frechet´ [35]. The one-dimensional Frechet´ embedding imposed by a set T ⊂ X is the mapping

τ : X → R+

τ(x) = d(x, T) := min d(x, t) t∈T

Observe that by the triangle inequality for d, |τ(x) − τ(y)| ≤ d(x, y). So τ is a contraction.

We can also combine several such Ti’s in separate coordinates. If we label the respective mappings τi and give each a nonnegative weight wi, with the weights summing to 1—that is to say, the weights form a probability distribution:

τ(x) = (w1τ1(x),..., wkτk(x))

k then we can consider the composite mapping τ as an embedding into L1 and it too is contractive, namely, kτ(x) − τ(y)k1 ≤ d(x, y).

So the key part of the proof is the lower bound. 0 Let s = dlg ne. For 1 ≤ t ≤ s and 1 ≤ j ≤ s ∈ Θ(s), choose set Ttj by selecting each point of X independently with probability 2−t. Let all the weights be uniform, i.e., 1/ss0. This defines an 0 embedding τ = (..., τtj,...)/ss of the desired dimension. We need to show that with positive probability

∀x, y ∈ X : kτ(x) − τ(y)k1 ≥ Ω(d(x, y)/s).

Just as in JL, the proof proceeds by considering just a single pair x 6= y and showing that with prob- ability greater than 1 − 1/n2 (enabling a union bound) it is embedded with the desired distortion (in this case O(log n) = O(s)).

70 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.7 Lecture 19 (14/Nov): cont. Bourgain embedding

3.7.1 cont. Bourgain embedding: L1

We use this notation for open balls:

Br(x) = {z : d(x, z) < r}

and for closed balls, B¯r(x) = {z : d(x, z) ≤ r}. Recall that we are now analyzing the embedding for any single pair of points x, y.

Let ρ0 = 0 and, for t > 0 define

t t ρt = sup{r : |Br(x)| < 2 or |Br(y)| < 2 } (3.17) up to tˆ = max{t : RHS < d(x, y)/2}. It is possible to have tˆ = 0 (for instance if no other points are near x and y). ¯ ˆ ¯ t ¯ t Observe that for the closed balls B we have that for all t ≤ t, |Bρt (x)| ≥ 2 and |Bρt (y)| ≥ 2 . This means in particular that (due to the radius cap at d(x, y)/2, which means that y is excluded from these balls around x and vice versa), tˆ < s. ˆ t t Set ρtˆ+1 = d(x, y)/2, which means that it still holds for t = t + 1 that |Bρt (x)| < 2 or |Bρt (y)| < 2 , ˆ although (in contrast to t ≤ t), ρtˆ+1 is not the largest radius for which this holds. ˆ Note t + 1 ≥ 1. Also, ρtˆ+1 > ρtˆ (because the latter was defined to be less than d(x, y)/2). But for t ≤ tˆ it is possible to have ρt = ρt−1. tˆ + 1 will be the number of scales used in the analysis of the lower bound for the pair x, y. I.e., we use the sets Ttj for 0 ≤ t ≤ tˆ + 1. Any contribution from higher-t (smaller expected cardinality) sets is “bonus.” Consider any 1 ≤ t ≤ tˆ + 1. √ Lemma 61 With positive probability (specifically at least (1 − 1/ e)/4), |τt1(x) − τt1(y)| > ρt − ρt−1.

t ¯ t−1 Proof: Suppose wlog that |Bρt (x)| < 2 . By Eqn (3.17) (with t − 1), |Bρt−1 (y)| ≥ 2 (and the same for x but we don’t need that). If

Tt1 ∩ Bρt (x) = ∅ (3.18) and ¯ Tt1 ∩ Bρt−1 (y) 6= ∅ (3.19) then kτt1(x) − τt1(y)k > ρt − ρt−1. We wish to show that this conjunction happens with constant probability. (See Fig. 3.9.)

The two events (3.18), (3.19) are independent because Tt1 is generated by independent sampling, ¯ and because, due to the radius cap at d(x, y)/2 (and because ρtˆ < ρtˆ+1), Bρt (x) ∩ Bρt−1 (y) = ∅. First, the x-ball event (3.18):

−t |Bρt (x)| Pr(Tt1 ∩ Bρt (x) = ∅) = (1 − 2 ) t ≥ (1 − 2−t)2 t = (1 − 2−t)2 ≥ 1/4 for t ≥ 1

71 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Figure 3.9: Balls Bρt−1 (x), Bρt (x), Bρt−1 (y) depicted. Events 3.18 and 3.19 have occurred, because no point has been selected for Tt1 in the larger-radius (ρt) region around x, while some point (marked in red) has been selected for Tt1 in the smaller-radius (ρt−1) region around y.

(For large t this is actually about 1/e.) Second, the y-ball event (3.19):

|B (y)| ¯ −t ρt−1 Pr(Tt1 ∩ Bρt−1 (y) 6= ∅) = 1 − (1 − 2 ) t−1 ≥ 1 − (1 − 2−t)2 and recalling 1 + x ≤ ex for all real x, ≥ 1 − e−1/2 √ Consequently, |τ (x) − τ (y)| > ρt − ρ − with probability at least (1 − 1/ e)/4. 2 t1 t1 t 1 √ Now, let Gx,y,t be the “good” event that at least (1 − 1/ e)/8 of the coordinates at level t, namely s0 {τtj}j=1, have |τtj(x) − τtj(y)| > ρt − ρt−1. If the good event occurs for all t, then for all x, y, √ 1 (1 − 1/ e) d(x, y) kτ(x) − τ(y)k ≥ . 1 s 8 2 Here the first factor is from the normalization, the second from the definition of good events, and the third from the cap on the ρt’s.

We can upper bound the probability that a good event Gx,y,t fails to happen using Chernoff:

−Ω(s0) Pr(¬Gx,y,t) ≤ e . Now taking a union bound over all x, y, t,

−Ω(s0) 2 Pr(∪x,y,t¬Gt) ≤ e n lg n < 1/2 for a suitable s0 ∈ Θ(log n). To be specific we can use the following version of the Chernoff bound (see problem set 4):

Lemma 62 Let F1,..., Fs0 be independent Bernoulli rvs, each with expectation ≥ µ. Pr(∑ Fi < (1 − 2 0 ε)µs0) ≤ e−ε µs /2. √ which permits us (plugging in ε = 1/2) to take s0 = √32 e log(n2 lg n). 2 e−1 n Exercise: Form a Frechet´ embedding X → R by using as Ti’s all singleton sets. Argue that this is n an isometry of X into (R , L∞). Consequently L∞ is universal for finite metrics. (This, I believe, was Frechet’s´ original result [35].)

72 Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.7.2 Embedding into any Lp, p ≥ 1

As a matter of fact the above embedding method has distortion just as good into Lp, for any p ≥ 1. Start by expanding:

!1/p 1 p kτ(x) − τ(y)kp = 0 ∑(τij(x) − τij(y)) (3.20) ss ij

We begin with the upper bound, which is unexciting:

!1/p 1 p (3.20) ≤ 0 ∑(d(x, y)) ss ij = d(x, y).

For the lower bound, we use the power-means inequality. Note that (3.20) is a p’th mean of the quantities (τij(x) − τij(y)), ranging over i, j. So from Lemma 15,

1 (3.20) ≥ 0 ∑ τij(x) − τij(y) = kτ(x) − τ(y)k1 ss ij

so for any τ and any p > 1, the Lp distortion of τ is no more than its L1 distortion. This demonstrates O(log2 n) O(log2 n) the generalization of Theorem (60) with Lp (p ≥ 1) replacing L1 .

3.7.3 Aside: H¨older’s inequality

Although we already proved the power means inequality directly in Lemma 15, it is worth seeing how it fits into a framework of inequalities. The power means inequality is a comparison between two integrals over a measure space that is also a probability space (i.e., the total measure of the space is 1). Power means follows immediately from an important inequality that holds for any measure space (and indeed generalizes the Cauchy-Schwarz inequality), Holder’s¨ inequality:

Lemma 63 (H¨older) For norms with respect to any fixed measure space, and for 1/p + 1/q = 1 (p and q are “conjugate exponents”), k f kp · kgkq ≥ k f gk1.

To see the power means inequality, note that over a probability space, k f kp is simply a p’th mean. Now plug in the function g = 1 and Holder¨ gives you power means.

73 Chapter 4

Limited independence

4.1 Lecture 20 (16/Nov): Pairwise independence, Shannon coding theorem again, second moment inequality

4.1.1 Improved proof of Shannon’s coding theorem using linear codes

Very commonly, in Algorithms, we have a tradeoff between how much randomness we use, and efficiency. But sometimes we can actually improve our efficiency by carefully eliminating some of the ran- domness we’re using. Roughly, the intuition is that some of the randomness is going not toward circumventing a barrier (especially, leaving the adversary in the dark about what we are going to do), but just into noise.1 A case in point is the proof of Shannon’s Coding Theorem. In a previous lecture we proved the theorem as follows: we first built an encoding map E : {0, 1}k → {0, 1}n by sampling a uniformly random function; then, we had to delete up to half the codewords to eliminate all kinds of fluctua- tions in which codewords fell too close to one another. It turns out that this messy solution can be avoided. The key observation is that our analysis depended only on pairwise data about the code—basically, pairwise distances between codewords. “Higher level” structure (mutual distances among triples, etc.) didn’t feature in the analysis. So the argument will still go through with a pairwise-independently constructed code. So we’ll do this now, and in the process we’ll see how this helps. Sample E from the following pairwise independent family of functions {0, 1}k → {0, 1}n. Select k n k vectors v1,..., vk iid ∈U {0, 1} . Now map the vector (x1,..., xk) to ∑1 xivi. This is, of course, a linear map, consisting of multiplication by the generator matrix G whose rows are the vi:   − − − v1 − − −  − − − v − − −  (message x)  2  = (codeword)  − − − ... − − −  − − − vk − − −

The message 0¯ ∈ {0, 1}k is always mapped to the codeword 0¯ ∈ {0, 1}n, and every other codeword is uniformly distributed in {0, 1}n. It is not hard to see that the images of messages are pairwise independent. (Including even the image of the 0¯ message.)

1People who pack a tent, wind up spending the night on the mountain – a climbing instructor of mine

74 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Let’s see why: say the two messages are x 6= x0. W.l.o.g. x0 6= 0. Now, we want to show that ∀y, y0, Pr(x0G = y0|xG = y) = 2−n. Since x0 6= x (and we are over GF2), this means that x0 ∈/ span(x). 0 0 Consequently, there exists a G s.t. xG = y, x G = y . But since there exists such a G, call it G0 the number of G’s satisfying this pair of equations does not depend upon y0; the set of such G’s is simply 0 0 0 0 0 equal to all G0 + G where x, x ∈ ker G . The number of such G depends only upon dim span(x, x ). (If you want a more concrete argument, you can change basis to where x, if nonzero, is a singleton vector, and x0 − x is another singleton vector. Then the row of G corresponding to x is y, the row corresponding to x0 − x is y0 − y, and the rest of the matrix can be anything.) Now let’s remember some of the settings we used for this theorem in Section (3.3.1): (1) The code rate is (3.3) n = k ; D2(pk1/2)−ε (2) First upper bound on δ is (3.4): p + δ < 1/2;

(3) Second upper bound on δ is (3.5): D2(p + δk1/2) > D2(pk1/2) − ε/2;

And finally we make δ as large as we can subject to these constraints, and set c = min{D2(p + δkp), ε/2} > 0. Looking back at the analysis of the error probability on message X in Section (3.3.1), it had two parts, in each of which we bounded the probability of one of the following two sources of error:

Bad1: H(E(X) + R, E(X)) ≥ (p + δ)n. That is to say, the error vector R has weight (number of 1’s) at least (p + δ)n. This analysis is of course unchanged, and doesn’t depend at all on choice of the code. As before, the bound is

~ −D2(p+δkp)n −cn Pr(Bad1) = Pr(H(0, R) ≥ (p + δ)n) ≤ 2 ≤ 2 . R R

0 0 Bad2: ∃X 6= X : H(E(X) + R, E(X )) ≤ (p + δ)n. For this, pairwise independence is enough to obtain an analysis similar to before. Specifically, for any pair X 6= X0 and any R, the rv (which now depends only on the choice of code) R + E(X) − E(X0) is uniformly distributed in {0, 1}n (because X − X0 is not the zero string, so E(X − X0) is uniform) so, the choice of R does not affect PrR(Bad2), and we can bound it as

k−nD2(p+δk1/2) Pr(Bad2) ≤ 2 R = 2n(D2(pk1/2)−ε−D2(p+δk1/2)) from (3.3) ≤ 2−nε/2 from (3.5) ≤ 2−cn

1−cn So, we get the same as before: PrE,R(Error on X) ≤ 2 for the same c > 0 that depends only on p, ε. That is, for every X, with MX = PrR(Error on X|E), we have

1−cn EE (MX) ≤ 2 (4.1)

Next, just as before, we wish to remove E from the randomization in the analysis. In order to do this it helps to consider the uniform distribution over messages X and derive from Eqn. 4.1 the weaker 1−cn EX,E (MX) ≤ 2 (4.2) The reason is that this weaker guarantee is maintained even if we now modify the decoding algo- rithm so that it commutes with translation by codewords. Specifically, no matter what the decoder did before, set it now so that D(Y) is uniformly sampled among “max-likelihood” decodings of Y,

75 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

which is equivalent (thanks to the uniformity over X and to the noise R being independent of X) to those X which minimize H(E(X), Y). For the uniform distribution, max-likelihood decoding min- imizes the average probability of error, so this new decoder D also satisfies 4.2. The new decoder has the commutation advantage that we promised: for any E,

 commutes with D(E(X) + R) = D(E(X)) + D(R) translation by code (4.3)  decoding correct = X + D(R) on codewords

As a consequence,

For all E, X1, X2: Pr(Error on X1|E) = Pr(Error on X2|E). R R

So we can define a variable M which is a function of E, M = Pr(Error on 0¯|E) = Pr(Error on X|E) for all X R R and we have 1−cn EE (M) ≤ 2 2−cn Since M ≥ 0, PrE (M > 2 ) < 1/2 and so if we just pick linear E at random, there is probability ≥ 1/2 that (using the already-described decoder D for it), for all X the decoding-error probability is ≤ 22−cn. What is much more elegant about this construction than about the preceding fully-random-E is that no X’s with high error probabilities need to be thrown away. The set of codewords is always just a linear subspace of {0, 1}n. The code also has a very concise description, O(k2) bits (recall n ∈ Θ(k)); whereas the previous full-independence approach gave a code with description size exponential in k. One comment is that although picking a code at random is easy, checking whether it indeed satisfies the desired condition is slow: one can either do this in time exponential in n, exactly, by exhaustively considering R’s, or one can try to estimate the probability of error by sampling R, but even this will require time inverse in the decoding-error-probability of R until we see error events and can get a good estimate of the error probability of R; in particular we cannot certify a good code this way in time less than 2cn.

4.1.2 Pairwise independence and the second-moment inequality

A common situation in which we use Chebyshev’s inequality, Lemma 13 is when we have many variables which are not fully independent, but are pairwise independent (or nearly so).

Definition 64 (Pairwise and k-wise independence) A set of rvs are pairwise independent if every pair of them are independent; this is a weaker requirement than that all be independent. Likewise, the variables are k-wise independent if every subset of size k is independent.

Definition 65 (Covariance) The covariance of two real-valued rvs X, Y is (if well-defined) Cov(X, Y) = E(XY) − E(X)E(Y).

Exercise: Show that if X and Y are independent then Cov(X, Y) = 0, but that the converse need not be true. n Exercise: If X = ∑1 Xi, Var X = ∑i Var(Xi) + ∑i6=j Cov(Xi, Xj).

76 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Corollary 66 If X1,..., Xn are pairwise independent real rvs with well-defined variances, then Var(∑ Xi) = ∑ Var(Xi). (We already mentioned this in (3.1).) If in addition they are identically distributed and X = 1 1 n ∑ Xi, then E(X) = E(X1) and Var(X) = n Var(X1).

Exercise: Apply the Chebyshev inequality to obtain:

Lemma 67 (2nd moment inequality) If X1,..., Xn are identically distributed, pairwise-independent real q Var(X1) 2 rvs with finite 1st and 2nd moments then P(|X − E(X)| > λ n ) < 1/λ .

Corollary 68 (Weak Law) Pairwise independent rvs obey the weak . Specifically, if X1,..., Xn are identically distributed, pairwise-independent real rvs with finite variance then for any ε, limn→∞ P(|X − E(X)| > ε) = 0.

So we see that the weak law holds under a much weaker condition than full independence. When we talk about the cardinality of sample spaces, we’ll see why pairwise (or small k-wise) indepen- dence has a huge advantage over full independence, so that it is often desirable in computational settings to make do with limited independence.

77 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.2 Lecture 21 (19/Nov): G(n, p) thresholds

4.2.1 Threshold for H as a subgraph in G(n, p)

Working with low moments of random variables can be incredibly effective, even when we are not specifically looking for limited-independence sample spaces. Here is a prototypical example. “When” does a specific, constant-sized graph H, show up as a subgraph of a random graph selected from the distribution G(n, p)? We have in mind that we are “turning the knob” on p. If H has any edges then when p = 0, with probability 1 there is no subgraph isomorphic to H. When p = 1, with probability 1 such subgraphs are everywhere 2. In between, for any finite n, the probability is some increasing function of p. But we won’t take n finite, we will take it tending to ∞. So the question is,3 can we identify a function π(n) such that in the model G(n, p(n)), with H denoting the event that there is an H in the random graph G, J K

(a) If p(n) ∈ o(π(n)), then limn Pr( H ) = 0. J K (b) If p(n) ∈ ω(π(n)), then limn Pr( H ) = 1. J K Such a function π(n) is known as the threshold for appearance of H. It follows from work of Bollobas and Thomason [17] that monotone events—events that must hold in G0 if they hold in some G ⊆ G0—always have a threshold function. (A related but incomparable statement: for a monotone graph property, i.e., a monotone property invariant under vertex permutations, for any ε > 0 there is a p(n) such that Prp(n)(property) ≤ ε and Prp(n)+O(1/ log n)(property) ≥ 1 − ε. See [37].)

4.2.2 Most pairs independent: threshold for K4 in G(n, p)

Let S ⊂ {1, . . . , n}, |S| = 4. Let XS be the event that K4 occurs as a subgraph of G at S—that is, when you look at those four vertices, all the edges between them are present. Conflating XS with its indicator function and letting X be the number of K4’s in G, we have

X = ∑ XS S and n E(X) = p6. 4

We are interested in Pr(X > 0). Let π(n) = n−2/3.

(a) For p(n) ∈ o(π(n)), E(X) ∈ o(1), so Pr( K4 ) ∈ o(1) and therefore limn Pr( K4 ) = 0. J K J K (b) For 1 > p(n) ∈ ω(π(n)), E(X) ∈ ω(1). We’d like to conclude that likely X > 0 but we do not have enough information to justify this, as it could be that X is usually 0 and occasionally very 4 large. We will exclude that possibility for K4 by studying the next moment of the distribution. Before carrying out this calculation, though, we have to make one important note. Since the event 0 K4 is monotone, [p ≤ p ] ⇒ [PrG(n,p) K4 ≤ PrG(n,p0) K4 ]. (An easy way to see this is by choosing realsJ K iid uniformly in [0, 1] at each edge,J andK placingJ theK edge in the graph if the rv is above the p

2 Today we focus on H = K4, the 4-clique, but more generally this method will establish the probability of any fixed graph H occurring as a subgraph in G, that is, ∃ injection of V(H) into V(G) carrying edges to edges. This is different from asking that H occur as an induced subgraph of G, which requires also that non-edges be carried to non-edges. That question is different in an essential way: the event is not monotone in G. 3Recall p(n) ∈ o(π(n)) means that lim sup p(n)/π(n) = 0, and p(n) ∈ ω(π(n)) means that lim sup π(n)/p(n) = 0. 4 When we study not K4-subgraphs, but other subgraphs, this can really happen. We’ll discuss this below.

78 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

0 or p threshold.) This means that it is enough to show that K4 “shows up” slightly above π. This is useful because some of our calculations break down far above π, not because there is anything wrong with the underlying statement but because the inequalities we use are not strong enough to be useful there and a direct calculation would need to take account of further moments. To simplify our remaining calculations, then, let p = n−2/3g(n), so n4 p6 = g6 for any sufficiently small g(n) ∈ ω(1); we’ll see how this is helpful in the calculations. By an earlier exercise, Var(X) = ∑ Var(XS) + ∑ Cov(XS, XT) S S6=T

6 6 6 XS is a coin (or Bernoulli rv) with Pr(XS = 1) = p . The variance of such an rv is p (1 − p ). The covariance terms are more interesting.

1. If |S ∩ T| ≤ 1, no edges are shared, so the events are independent and Cov(XS, XT) = 0. 2. If |S ∩ T| = 2, one edge is shared, and a total of 11 specific edges must be present for both cliques to be present. A simple way to bound the covariance is (since E(XS), E(XT) ≥ 0) that 11 Cov(XS, XT) = E(XSXT) − E(XS)E(XT) ≤ E(XSXT) = p . 3. If |S ∩ T| = 3, three edges are shared, and a specific 9 edges must be present for both cliques 9 to be present. Similarly to the previous case, Cov(XS, XT) ≤ p .

n  n   n  Var(X) ≤ p6(1 − p6) + p11 + p9 4 2, 2, 2 3, 1, 1 ∈ O(n4 p6 + n6 p11 + n5 p9) = O(g6n4−4 + g11n6−22/3 + g9n5−6) from p = n−2/3g(n) = O(g6 + g11n−4/3 + g9n−1) (4.4) = O(g6) provided g5 ∈ O(n4/3) and g3 ∈ O(n)

This gives us the key piece of information. For g ∈ ω(1) but not too large, we have Var(X) O(g6) O(g6) ∈ = = O(g−6) ⊆ o(1) (E(X))2 Θ((n4 p6)2) Θ(g12) and we have only to apply the Chebyshev inequality (Cor. 14) (or better yet Paley-Zygmund, Lemma 74 which we haven’t proven yet) to conclude that Pr(X = 0) ∈ o(1) and so

lim Pr( K4 ) = 1. (4.5) n J K

Since K4 is a monotone event, (4.5) holds even for g above the range we needed for the calculation to hold.J (Note,K though, since there is so much “room” in the calculation, we could even have used the upper bound O(g11) on 4.4, and not resorted to this monotonicity argument.) Exercise: Show that the threshold for appearance, as a subgraph, of the graph with 5 edges and 4 vertices is n−4/5. Comment: For a general H the threshold for appearance of H in G(n, p) as a subgraph is determined not by the ratio ρH of edges to vertices, but by the maximum of this ratio over induced subgraphs of H, call it ρmax H. We’ll see this on a problem set (and see [8]). If these numbers are different then above n−1/ρH the expected number of H’s starts tending to ∞ but almost certainly we have none; once we cross the higher threshold n−1/ρmax H , there is an “explosion” of many of these subgraphs appearing. (They show up highly intersecting in the fewer copies of the critical induced subgraph.)

79 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.3 Lecture 22 (21/Nov): Concentration of the number of prime factors; begin Khintchine-Kahane for 4-wise independence

Now for an application of near-pairwise independence in number theory. Let m(k) be the number of primes dividing k. Hardy and Ramanujan showed that for large k this number is almost always close to log log k. Specifically, let k ∈U [n], and let M be the rv

M = m(k).

Always M ≤ lg k. But usually this is a vast overestimate:

Theorem 69 (Hardy & Ramanujan) Let 1 ≤ λ ≤ n1/4.

p 1 + o(1) Pr(|M − log log k| > λ log log k) < . λ2 (This doesn’t make much sense for k < 16; instead more formally the failure event should be k < 16 OR |M − p log log k| > λ log log k .) J K Proof: We show an elegant proof due to Turan. Before we begin the proof in earnest let’s simplify things. The function log log,√ besides being monotone, is so slowly growing√ that it√ hardly distinguishes between n and n. Specifically, log log n = log 2 + log log n, so for k ≥ n we have

log log n − log 2 ≤ log log k ≤ log log n

which in particular implies

|M − log log n| + log 2 ≥ |M − log log k|

Consequently: p √ Pr(|M − log log k| > λ log log k) ≤ Pr(k ≤ n) p √ √ + Pr(|M − log log n| + log 2 > λ log log n − log 2|k > n) Pr(k > n)

Use Pr(A|B) = Pr(A ∩ B)/ Pr(B) ≤ Pr(A)/ Pr(B): √ ... = Pr(k ≤ n) p + Pr(|M − log log n| + log 2 > λ log log n − log 2) (4.6) √ | − |+ log log n−log 2+log 2 If the event in (4.6) holds for λ ≥ 1 then M log log n log 2 ≤ √ . Consequently, |M−log log n| log log n−log 2 setting p 0 |M − log log n| log log n − log 2 λ = λ p (|M − log log n| + log 2) log log n we can say p p log log n − log 2 + log 2 log log n 0 0 λ ≤ p p λ ∈ (1 + o(1))λ log log n − log 2 log log n − log 2

80 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Now continuing from (4.6),

p 1 Pr(|M − log log k| > λ log log k) ≤ √ (4.7) n p + Pr(|M − log log n| > λ0 log log n) (4.8)

p If λ0 > (lg n)/ log log n, the probability on line (4.8) is 0 because M ≤ lg n, so we’re done. p √ It remains to consider the range λ0 ≤ (lg n)/ log log n. In this range, the 1/ n term in 4.7 is 02 1 ≤ 1+o(1) 1+o(1) dominated by 1/λ . Since λ02 λ2 , it remains for the theorem only to bound (4.8) by λ02 .

Proposition 70 p 1 + o(1) Pr(|M − log log n| > λ0 log log n) < . λ02

For prime p let p|k be the indicator rv for p dividing k. Note M = ∑p p|k . J K J K E( p|k ) = bn/pc/n J K So 1/p − 1/n ≤ E( p|k ) ≤ 1/p J K

1 1 − 1 + ∑ ≤ E(M) ≤ ∑ (4.9) prime p≤n p prime p≤n p For k ≥ 1 let π(k) = |{p : p ≤ k, p prime}|. We remind ourselves of the

Theorem 71 (Prime number theorem) π(k) ∈ (1 + o(1))k/ log k.

We use the following corollary (proof omitted in class):5

1 Corollary 72 ∑prime p≤n p ∈ (1 + o(1)) log log n.

5 Proof: For k ≥ 2, π(k) − π(k − 1) is the indicator rv for k being prime.

1 π(k) − π(k − 1) ∑ = ∑ prime p≤n p 2≤k≤n k π(n)  1 1  = + π(k) − ∑ + n 2≤k≤n−1 k k 1 π(k) = o(1) + ∑ ( + ) 2≤k≤n−1 k k 1 1 + o(1) = o(1) + by the prime number theorem ∑ ( + ) 2≤k≤n−1 k 1 log k There’s now a minor exercise. We want to move the (1 + o(1)) out of the summation but inside it refers to k, outside to n. Exercise: Moving it outside is justified for any summation which tends to ∞, which we will shortly see is true of this one. In the process we can also subsume the additive o(1). More formally, Exercise: Let ak, bk ≥ 0 be series such that ak ∈ (1 ± ok(1))bk n n n and ∑1 ak ∈ ωn(1) (where the subscripts of o or ω indicate the variable tending to ∞). Then ∑1 bk ∈ (1 + on(1)) ∑1 ak. Now applying this: 1 = (1 + o(1)) ∑ ( + ) 2≤k≤n−1 k 1 log k 1 = (1 + o(1)) ∑ using same exercise again 2≤k≤n−1 k log k

81 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

So, from Eqn. (4.9) and Cor. 72 we know that E(M) ∈ (1 + o(1)) log log n. Now for the variance of M. The proposition will follow from showing Var(M) = (1 + o(1)) log log n (4.10) and an application of the Chebyshev inequality. To show (4.10), as always we can write Var(M) = ∑ Var( p|k ) + ∑ Cov( p|k , q|k ) (4.11) prime p≤n J K primes p6=q≤n J K J K

What we will discover is that the sum of covariances is very small and so the bound on Var(M) is almost as if were had pairwise independence between the events p|k . J K As we’ve already noted on occasion, for a {0, 1}-valued rv Y, Var(Y) = E(Y)(1 − E(Y)) ≤ E(Y). Applying this we have ∑ Var( p|k ) ≤ ∑ E( p|k ) ∈ (1 + o(1)) log log n. (4.12) prime J K prime J K p≤n p≤n

Now to handle the covariances. Observe that for primes p 6= q, p|k q|k is the indicator rv pq|k . Just as for primes, E( pq|k ) = b n c/n ≤ 1 . So J KJ K J K J K pq pq Cov( p|k , q|k ) = E( pq|k ) − E( p|k )E( q|k ) J K J K J K J K J K 1  1 1   1 1  ≤ − − − pq p n q n 1  1 1  ≤ + n p q This is a very low covariance, which is crucial to the theorem. 1  1 1  ∑ Cov( p|k , q|k ) ≤ ∑ + primes J K J K primes n p q p6=q≤n p6=q≤n  prime 2 1  = (1 + o(1)) π(n) number n ∑ p primes  theorem p≤n 2 = (1 + o(1)) π(n) log log n Cor. (72) n 2 log log n = (1 + o(1)) log n

By the same exercise, this can be evaluated by comparison to an integral: Z n−1 1 = (1 + o(1)) dx 2 x log x Z log(n−1) 1 = ( + ( )) u = u 1 o 1 u e du substitution x e log 2 e u Z log n 1 = (1 + o(1)) du log 2 u = (1 + o(1)) log log n 2

82 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

This is dominated by the sum of variances in (4.11), i.e., by (4.12) (this is even tending to 0), so we have established (4.10). 2

4.3.1 4-wise independent random walk

Earlier we quoted the CLT or Khintchine-Kahane to conclude that the value of the Gale-Berlekamp game is Ω(n3/2). Specifically we used this to show that for a symmetric random walk of length n, n 1/2 X = ∑1 Xi with Xi ∈U {1, −1}, E(|X|) ∈ Ω(n ). Now we will show this from first principles—and more importantly, using only information about the 2nd and 4th moments. This is not only of methodological interest. It makes the conclusion more robust, specifically the conclusion holds for any 4-wise independent space, and therefore implies a poly-time deterministic algorithm to find a Gale-Berlekamp solution of value Ω(n3/2), because there exist k-wise indepen- dent sample spaces of size O(nbk/2c), as we will show in a later lecture.

n Theorem 73 Let X = ∑1 Xi where the Xi are 4-wise independent and Xi ∈U {1, −1}. Then E(|X|) ∈ Ω(n1/2).

Proof: We start with two calculations. These calculations are made easy by the fact that for any product of the form Xb1 ··· Xb4 , with i ,..., i distinct and b ≥ 0 integer, i1 i4 1 4 i

 0 if any b is odd E(Xb1 ··· Xb4 ) = i1 i4 1 otherwise

So now 2 2 E(X ) = ∑ E(XiXj) = ∑ E(Xi ) = n i,j i

4 2 2 4 2 E(X ) = 3 ∑ E(Xi Xj ) − 2 ∑ E(Xi ) = 3n − 2n. i,j One is tempted to apply Chebyschev’s inequality (in the form of Cor. 14) to the rv X2, because we know both its expectation and its variance. Unfortunately, the numbers are not favorable! Var(X2) = 3n2 − 2n − n2 = 2n2 − 2n > n2 = (E(X2))2. So Cor. 14 gives us only Pr(X2 = 0) ≤ Var(X2)/(E(X2))2, where 1 ≤ Var(X2)/(E(X2))2, which is useless. (Let alone that we would actu- ally want to bound the larger quantity Pr(X2) < cn for some c > 0.) There are two ways to solve this, and I’ll show you both. For the strongest bound for Gale- Berlekamp, however, one may skip the next section and proceed to Sec. 4.4.2.

83 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.4 Lecture 23 (26/Nov): Cont. Khintchine-Kahane for 4-wise in- dependence; begin MIS in NC

4.4.1 Paley-Zygmund: solution through an in-probability bound

Paley-Zygmund is usually stated as an alternative (to Cor. 14) lower-tail bound for nonnegative rvs; i.e., it gives a way to say that a nonnegative rv A is “often large”.

Let µi be the ith moment of A. Knowing only the first moment µ1 of A is not enough, because for any value—even infinite—of the first moment, we can arrange, for any δ > 0, a nonnegative rv A which equals 0 with probability 1 − δ, yet has first moment µ1. We just have to move δ of the probability mass out to the point µ1/δ, or, in the infinite µ case, spread δ probability mass out in a measure whose first moment diverges. 6 However, a finite second moment µ2 is enough to justify such a “usually large” statement : Actually PZ can be stated for rvs which are not necessarily nonnegative, so we’ll do it that way.

Lemma 74 (Paley-Zygmund) Let A be a real rv with positive µ1 and finite µ2. For any 0 < λ ≤ 1,

2 2 λ µ1 Pr(A > (1 − λ)µ1) > . µ2

Proof: Let ν be the distribution of A. Let p = Pr(A > (1 − λ)µ1). (This is what we want to lower bound.) Decompose µ = R x dν(x) + R x dν(x). Now examine each of these 1 [−∞,(1−λ)µ1] ((1−λ)µ1,∞] terms. Z x dν(x) ≤ (1 − p)(1 − λ)µ1 ≤ (1 − λ)µ1 (4.13) [−∞,(1−λ)µ1] 7 x2 I Apply Cauchy-Schwarz to the functions and x>(1−λ)µ1 . These are not effectively proportional 2 to each other w.r.t. ν (unless ν is supported on a single point, in which case µ1 = µ2 and the Lemma is immediate), so we get a strict inequality,

Z r Z ∞ 2 1/2 1/2 x dν(x) < p x dν(x) = p µ2 (4.14) ((1−λ)µ1,∞] −∞

1/2 1/2 Putting (4.13), (4.14) together, µ1 < (1 − λ)µ1 + p µ2

1/2 1/2 λµ1 < p µ2 as desired. (There’s not normally much to be gained by preserving the “1 − p” factor in (4.13), but it’s at least another reason for writing strict inequality in the Lemma.) 2

6 2 We don’t need to also assume anything about µ1 because µ1 ≤ µ2, by nonnegativity of the variance (special case of the power means inequality). 7

Lemma 75 (Cauchy-Schwarz) If functions f , g are square-integrable w.r.t. measure ν then R f (x)g(x) dν(x) ≤ q R f 2(x) dν(x) · R g2(x) dν(x).

Proof: Squaring and subtracting sides, it suffices to show: 0 ≤ RR f 2(x)g2(y) dν(x)dν(y) − RR f (x)g(x) f (y)g(y) dν(x)dν(y). This is equivalent (by swapping the dummy variables) to showing 0 ≤ RR ( f 2(x)g2(y) + f 2(y)g2(x) − 2 f (x)g(x) f (y)g(y)) dν(x)dν(y) = RR ( f (x)g(y) − f (y)g(x))2 dν(x)dν(y) which is an integral of squares. 2 Say that f and g are effectively proportional to each other w.r.t. ν if this last integral is 0; this is the condition for equality in Cauchy-Schwarz.

84 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

2 2 µ2−µ1 µ2−µ1 Comment: This gives Pr(A ≤ 0) ≤ µ which improves on the upper bound 2 of Cor. 14. It 2 µ1 2 should be said though that PZ does not dominate Cor. 14 in all ranges (e.g., if the variance µ2 − µ1 2 is very small compared to the µ1, and λ is small). Returning to Gale-Berlekamp: Lemma 74 is not directly usable for our purpose, i.e., we cannot plug 2 3 2 2 in the rv A = |X|, because all it will tell us is that µ1 ≥ (1 − λ)λ µ1/µ2, i.e., µ2 ≥ (1 − λ)λ µ1, which follows already from Cauchy-Schwarz (with the better constant 1). Note, this shows how Paley-Zygmund serves as a more flexible, albeit slightly weaker, version of Cauchy-Schwarz. Instead, we set B = |X| and A = B2, and then apply Paley-Zygmund to A. This is not a technicality. It means that we are relying on 4-wise independence, not just 2-wise independence, of the Xi’s. And indeed, Exercise: There are for arbitrarily large n, collections of n pairwise independent Xi’s, uniform in ±1, s.t. Pr(|X| = 0) = 1 − 1/n, Pr(|X| = n) = 1/n.

Corollary 76 Let B be a nonnegative rv with Pr(B = 0) < 1 and fourth moment µ4(B) < ∞. Then E(B) ≥ 16√ µ5/2/µ . 25 5 2 4

Proof: For any θ, E(B) ≥ θ Pr(B ≥ θ) = θ Pr(B2 ≥ θ2), so, applying Lemma 74 to A = B2, with p θ = µ2(B)/5 and λ = 4/5,

r 2 5/2 µ2(B) 2 (4/5) µ2(B) E(B) ≥ Pr(B ≥ µ2(B)/5) ≥ √ . 5 5 µ4(B) 2

4.4.2 Berger: a direct expectation bound

3/2 µ2(B) Lemma 77 (Berger [11]) Let B be a nonnegative rv with 0 < µ4(B) < ∞. Then µ1(B) ≥ 1/2 . µ4(B)

This is stronger than Cor. 76 for two reasons: the constant, and perhaps more importantly, because 1/2 µ2/µ4 ≤ 1 (power mean inequality). Of course, this lemma does not give an in-probability bound, so it is incomparable with Lemma 74. Lemma 77 is a special case of the following, with p = 1, q = 2, r = 4:

Lemma 78 Let 0 < p < q < r and let B be a nonnegative rv with probability measure θ, θ({0}) < 1. For r−p q−p x r−q − r−q x > 0 let µx = E(B ). Then µp(B) ≥ µq(B) µr(B) .

Proof: A more usual way to write this is

r−q q−p r−p r−p µq ≤ µp µr (4.15)

Note the exponents sum to 1, and that the average of p and r weighted by the exponents is q, i.e., r − q q − p p · + r · = q r − p r − p

so (4.15) is a consequence of an important fact, the log-concavity of the moments (i.e., of µq as a function of q). We’ll show this next. 2

85 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Lemma 79 (Log concavity of the moments) If θ is a probability distribution on the nonnegative reals 2 with ({ }) < then, for all q at which R yq 2 y d converges absolutely, ∂ > . θ 0 1 log θ ∂2q log µq 0

Proof: 2 2 Z ∂ ∂ q log µq = log y dθ ∂2q ∂2q R ∂ yq log y dθ = ∂q µq µ R (yq log2 y + yq−1) dθ − (R yq log y dθ)2 = q 2 µq µ R yq log2 y dθ − (R yq log y dθ)2 ≥ q 2 µq 1 ZZ   = q q 2 − q q ( ) ( ) 2 x y log y x y log x log y dθ x dθ y µq 1 ZZ = q q ( − )2 ( ) ( ) 2 x y log y log x dθ x dθ y 2µq ≥ 0

2

4.4.3 cont. proof of Theorem 73

We apply Lemma 77. Substituting our known moments of the rv |X|,

n3/2 √ E(|X|) ≥ ≥ n/3. (3n2 − 2n)1/2

Observe that we have lost only a small constant factor here compared with the precise value ob- tained for a fully-independent sample space from the CLT. 2

4.4.4 Maximal Independent Set in NC

Parallel complexity classes

L = log-space = problems decidable by a Turing Machine having a read-only input tape and a read-write work tape of size (for inputs of length n) O(log n). S k k NC = k NC , where NC = languages s.t. ∃c < ∞ s.t. membership can be computed, for inputs of size n, by nc processors running for time logk n. RNC = same, but the processors are also allowed to use random bits. For x ∈ L Pr( error ) ≤ 1/2, for x ∈/ L Pr( error ) = 0. L ⊆ NC1 ⊆ ... ⊆ NC ⊆ RNC ⊆ RP. P-Complete = problems that are in P, and that are complete for P w.r.t. reductions from a lower complexity class (usually, log-space).

86 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Maximal Independent Set

MIS is the problem of finding a Maximal Independent Set. That is, an independent set that is not strictly contained in any other. This does not mean it needs to be a big, let alone a maximum cardinality set. (It is NP-complete to find an independent set of maximum size. This is more commonly known as the problem of finding a maximum clique, in the complement graph.) There is an obvious sequential greedy algorithm for MIS: list the vertices {1, . . . , n}. Use vertex 1. Remove it and its neighbors. Use the smallest-index vertex which remains. Remove it and its neighbors, etc. The independent set you get this way is called the Lexicographically First MIS. Finding it is P- complete w.r.t. L-reductions [22]. So it is interesting that if we don’t insist on getting this particular MIS, but are happy with any MIS, then we can solve the problem in parallel, specifically, in NC2. We’ll see an RNC, i.e., randomized parallel, algorithm of Luby [64] for MIS. Then, we’ll see how to derandomize the algorithm. (Some of the ideas we’ll see also come from the papers [55, 7]).

Notation: Dv is the neighborhood of v, not including v itself. dv = |Dv|. Luby’s MIS algorithm: Given: a graph G = (V, E) with n vertices. Start with I = ∅. Repeat until the graph is empty:

1. Mark each vertex v pairwise independently with probability 1 . 2dv+1 2. For each doubly-marked edge, unmark the vertex of lower degree (break ties arbitrarily).

3. For each marked vertex v, append v to I and remove the vertices v ∪ Dv (and of course all incident edges) from the graph.

An iteration can be implemented in parallel in time O(log n), using a processor per edge. We’ll show that an expected constant fraction of edges is removed in each iteration (and then we’ll show that this is enough to ensure expected logarithmically many iterations).

Definition 80 A vertex v is good if it has ≥ dv/3 neighbors of degree ≤ dv. (Let G be the set of good vertices, and B the remaining ones which we call bad.) An edge is good if it contains a good vertex.

1 Lemma 81 If dv > 0 and v is good, then Pr(∃ marked w ∈ Dv after step (1) ) ≥ 18 .

This follows immediately from the following, using Pr(w marked ) ≥ dv 1 ≥ 1 . ∑w∈Dv 3 2dv+1 9

S 1 1 Lemma 82 If {Xi} are pairwise independent events s.t. Pr(Xi) = pi then Pr( Xi) ≥ 2 min( 2 , ∑ pi).

Compare with the pairwise-independent version of the second Borel-Cantelli lemma. Of course, that is about guaranteeing that infinitely many events occur, here we’re just trying to get one to occur, but the lemmas are nonetheless quite analogous.

Proof: If ∑ pi < 1/2 then consider all events, otherwise there is a subset s.t. 1/2 ≤ ∑ pi ≤ 1 (consider two cases depending on whether any pi exceeds 1/2); apply the following argument just to that subset.

87 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

[ Pr( Xi) ≥ ∑ pi − ∑ Pr(Xi ∩ Xj) Bonferroni level 2 i

= ∑ pi − ∑ pi pj i

88 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.5 Lecture 24 (28/Nov): Cont. MIS, begin derandomization from small sample spaces

4.5.1 Cont. MIS

Lemma 83 If v is marked then the probability it is unmarked in step (2) is ≤ 1/2.

Proof: It is unmarked only if a weakly-higher-degree neighbor is marked. Each of these events happens, conditional on v being marked, with probability at most 1 . Apply a union bound. 2 2dv+1

Corollary 84 The probability that a good vertex is removed in step 3 is at least 1/36.

Proof: Immediate from the previous two lemmas. 2 Now for our measure of progress.

Lemma 85 At least half the edges in a graph (V, E) are good.

Proof: Sort the vertices from left to right so that du < dv ⇒ u < v (ties arbitrarily). Direct each edge in out from lower to higher degree vertex; now we have in-degrees dv and out-degrees dv .

A bad vertex has > 2dv/3 neighbors with degree > dv. (4.16)

For two sets of vertices V1, V2 let E(V1, V2) be the edges directed from V1 to V2. (In particular E = E(V, V).) If by Eˆ(V1, V2) we mean the set of undirected edges associated with the directed edges E(V1, V2), then note that Eˆ(B, B) ∪ Eˆ(B, G) ∪ Eˆ(G, B) ∪ Eˆ(G, G) is a disjoint partition of the edges of the graph, |E(B, B)| = |Eˆ(B, B)|, and

Eˆ(B, B) is the set of bad edges. Eˆ(B, G) ∪ Eˆ(G, B) ∪ Eˆ(G, G) is the set of good edges.

From (4.16), every v ∈ B has at least twice as many outgoing edges as incoming edges. Each edge in E(B, B) contributes one incoming edge to B, so there must be at least 2|E(B, B)| outgoing edges from B; only |E(B, B)| of these can be accounted for by outgoing edges of E(B, B). The remainder are accounted for by edges in E(B, G), so |E(B, G)| ≥ |E(B, B)|. Consequently |E(B, B)| ≤ |E|/2. 2 Due to the corollary, each good edge is removed with probability at least 1/36. Of course the edge- removals are correlated, but in any case, since at least half the edges are good, the expected fraction of edges removed is at least 1/72. In the next section we analyze how long it takes this process to terminate. First, a comment: the analysis above was not sensitive to the precise probability 1 with which 2dv+1 vertices were marked. For instance, it would be fine if each vertex were marked with some proba- bility p , 1 ≤ p ≤ 1 ; the only effect would be to change the “1/72” to some other constant. v 4dv v 2dv+1 We will actually only modify each 1 by a factor 1 + o (1) when we derandomize the algorithm. 2dv+1 n

89 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.5.2 Descent Processes

(This is not widespread terminology but things like this come up often. The coupon collector problem which we saw in Sec. 1.3.2 is another example.) In a descent process the state of the process is a nonnegative integer n; the process terminates when n = 0. At n > 0, you sample a random variable X from a distribution pn on {0, . . . , n}, and transition to state n − X. The question is, how many iterations does it take you to hit 0? Let Tn be n this random variable, when you start from state n. (So E(T0) = 0.) Write θn = Epn (X) = ∑0 ipn(i).

> ( ) = ( ) ≤ n 1 Lemma 86 For i 0 let g i mini≤m≤n θm. Then E Tn ∑1 g(i) .

This bound is pretty good if θn is monotone increasing, which is the common situation. It can be a bad bound if there are a few bottlenecks which the descent process usually avoids. In our case we won’t know g exactly, but we will know lower bounds on it, which is enough.

Note that g is nondecreasing and g(n) = θn. Proof: The lemma is trivial for n = 0 (the LHS is 0 and the RHS is an empty summation). For n > 0 proceed by induction.

n E(Tn) = 1 + pn(0)E(Tn) + ∑ pn(i)E(Tn−i) i=1

n (1 − pn(0))E(Tn) = 1 + ∑ pn(i)E(Tn−i) i=1 n n−i 1 ≤ 1 + p (i) induction ∑ n ∑ ( ) i=1 j=1 g j ! n n 1 n 1 = 1 + p (i) − ∑ n ∑ ( ) ∑ ( ) i=1 j=1 g j j=n−i+1 g j n 1 n n 1 = 1 + (1 − p (0)) − p (i) n ∑ ( ) ∑ n ∑ ( ) j=1 g j i=1 j=n−i+1 g j ! n 1 n i n i n 1 = 1 + (1 − p (0)) − p (i) + p (i) − n ∑ ( ) ∑ n ( ) ∑ n ( ) ∑ ( ) j=1 g j i=1 g n i=1 g n j=n−i+1 g j n 1 n i ≤ 1 + (1 − p (0)) − p (i) g nondecreasing, g(n) = θ n ∑ ( ) ∑ n n j=1 g j i=1 θn n 1 = 1 + (1 − p (0)) − 1 n ∑ ( ) j=1 g j 2

4.5.3 Cont. MIS

|E| 72 As a consequence, the expected number of iterations until the algorithm terminates is ≤ ∑1 i ∈ O(log |E|) ∈ O(log n). Each iteration alone takes time O(log n) to do the local marking and unmark- ing, and removing vertices and edges. This is an RNC2 algorithm, using O(|E|) processors, for MIS.

90 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

In Section 4.6.5 we’ll see how we can derandomize this using a factor of O(n2) more processors, and thereby put MIS in NC2. Question: Here is a different parallel algorithm for MIS. At each vertex v choose uniformly a real number rv in [0, 1]. Put a vertex in I if rv > ru for every neighbor u of v. Remove I and all its neighbors. Repeat. (We don’t really need to pick random real numbers; we can just pick multiples of 1/n3, and we’re unlikely to have any ties to deal with.) This process is a bit simpler than Luby’s algorithm because there is no “unmarking”. Question: If the rv’s are chosen independently, is it the case that the expected number of edges that are removed, is a constant fraction of |E|? If so, is this still true if the rv’s are pairwise independent?

4.5.4 Begin derandomization from small sample spaces

We discussed in an earlier lecture the notion of linear error-correcting codes. We worked over the base field GF(2), also known as Z/2. (Which is to say, we added bit-vectors using XOR.) Encoding of messages in such a code is simply multiplication of the message, as a vector v ∈ Fm, by the generator matrix C of the code; the result, if C is m × n, is an n-bit codeword.

 generator matrix  (message v)   = (codeword vC) C

The set of codewords is exactly Rowspace(C). The minimum weight of a linear code is the least number of 1’s in a codeword. If the minimum weight is k + 1 then the code is k-error-detecting; this property ensures

1. Error detection up to k errors 2. Error correction up to bk/2c errors.

This property is not possessed by codes achieving near-optimal rate in Shannon’s coding theorem. That theorem protects only against random noise, and if that is what you want, then the mininimum weight property is too strong to allow optimally efficient codes. It protects against adversarial noise.

91 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.6 Lecture 25 (30/Nov): Limited linear independence, limited sta- tistical independence, error correcting codes.

4.6.1 Generator matrix and parity check matrix

Error detection can be performed with the aid of the parity check matrix M:

Left Nullspace(M) = Rowspace(C)

 parity   generator matrix     check    = 0   matrix   C   M

wM = 0 ⇐⇒ w ∈ Rowspace(C) ⇐⇒ w is a codeword

  Every vector in   Every k rows Rowspace(C) has ⇐⇒ of M are linearly weight ≥ k + 1   independent

In coding theory terms, this is an (n, m, k + 1) code over GF(2). (Unfortunately, coding theorists conventionally use the letters (n, k, d) but we have k + 1 reserved for the least weight, because we’re following the conventional terminology from “k-wise independence”.) For any fixed values of n and k, the code is most efficient when the message length m, which is the number of rows of C, is as large as possible; equivalently, the number of columns of M, ` = n − m, is as small as possible. So we’ll want to design a matrix M with few columns in which every k rows are linearly independent. But first, let’s see a connection between linear and statistical independence. Let B be a k × ` matrix over GF(2), with full row rank. (So k ≤ `.) ` k If x ∈U (GF(2)) then y = Bx ∈U (GF(2)) ,    y   B  =  x  because the pre-image of any y is an affine subspace (a translate of the right nullspace of B). (We already made this observation in the context of Freivalds’ verification algorithm, Theorem 26.) Now, if we have a matrix M with n rows, of which every k are linearly independent, then every k bits of z = Mx are uniformly distributed in (GF(2))k.

               z  =  M   x         

We’ve exhibited dual applications of the parity check matrix:

92 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

• Action on row vectors: checking validity of a received word w as a codeword. (s = wM is called the “syndrome” of w; in the case of non-codewords, i.e., s 6= 0, one of the ways to decode is to maintain a table containing for every s, the least-weight vector η s.t. ηM = s. Then w − η is the closest codeword to w. This table-lookup method is practical for some very high rate codes, where there are not many possible vectors s.) • Action on column vectors: converting few uniform iid bits into many k-wise independent uniform bits.

Now we can see an entire sample space on n bits that are uniform and k-wise-independent. At the right end we place the uniform distribution on all 2` vectors of the vector space GF(2)`.

          0 0 . . . 1 1      Ω  =  M   . . . unif. dist. on cols          0 1 . . . 0 1

Ω is the uniform distribution on the columns on the LHS. m n−` Maximizing the transmission rate n = n of a binary, k-error-detecting code, is equivalent to minimizing the size |Ω| = 2` of a linear k-wise independent binary sample space. So how big does |Ω| have to be?

Theorem 87

1. For all n there is a sample space of size O(nbk/2c) with n uniform k-wise independent bits. For larger ranges one has: For all n there is a sample space of size O(2k max{m,dlg ne}) with n k-wise independent rvs, each uniformly distributed on [2m]. 2. For all n, any sample space on n k-wise independent bits, none of which is a.s. constant, has size Ω(nbk/2c).

We show Part 1 in Sec. 4.6.3; Part 2 will be on the problem set. First though, returning to the subject of codes, there is a question worth asking even though we don’t need it for our current purpose:

4.6.2 Constructing C from M

Suppose we have constructed a parity check matrix M. How do we then get a generator matrix C? One should note that over a finite field, Gram-Schmidt does not work. Gram-Schmidt would have allowed us to produce new vectors which are both orthogonal to the columns of M and linearly independent of them. But this is generally not possible: the row space of C and the column space of M do not necessarily span the n-dimensional space. For example over GF(2) we may have

 1  C = 1 1  , M = 1

However, Gaussian elimination does work over finite fields, and that is what is essential.

93 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Specifically, given n × a M and n × b N, b < n − a, N of full column rank (i.e., rank b), we show how to construct a vector c s.t. cM = 0 and c0 is linearly independent of the columns of N. (Then adjoin c0 to N and repeat.) Perform Gaussian elimination on the columns of N so that it is lower triangular, with a nonzero diagonal. (That is, allowed operations are adding columns to one another, and permuting rows. When permuting rows of N, permute the rows of M to match.) Obviously this does not change the column space of N (except for the simultaneous permution of rows in N and M). Now take the submatrix of M consisting of the a + 1 rows b + 1, . . . , b + a + 1. By Gaussian elimina- tion on these rows we can find a linear dependence among them. Extending that dependence to the n-dimensional space with 0 coefficients elsewhere yields a vector c s.t. cM = 0 and s.t. the support of c is disjoint from the first b coordinates. Then c is linearly independent of the column space of N because the restriction of N to its first b rows is nonsingular, so any linear combination of the rows of N has a nonzero value somewhere among its first b entries.

4.6.3 Proof of Thm (87) Part (1): Upper bound on the size of k-wise independent sample spaces

(We’ll do this carefully for producing binary rvs and only mention at the end what should be done for larger ranges.) This construction uses the finite fields whose cardinalities are powers of 2. These are called exten- sion fields of GF(2). If you are not familiar with this, just keep in mind that for each integer r ≥ 1 there is a (unique) field with 2r elements. We can add, subtract, multiply and divide these without leaving the set; in particular, in the usual way of representing the elements of the field as bit strings of length r, addition is simply XOR addition.8 Specifically, we can think of the elements of GF(2r) as the polynomials of degree ≤ r − 1 over GF(2), taken modulo some fixed irreducible polynomial r−1 p of degree r. That is, a field element c has the form c = cr−1x + ... + c1x + c0 (mod p(x)), and our usual way of representing this element is through the mapping β : GF(2r) → (GF(2))r given by β(c) = (cr−1,..., c0). (I.e., the list of coefficients.) But all we really need today are three things: (a) Like GF(2), GF(2r) is a field of characteristic 2, i.e., 2x = 0. (b) For matrices over GF(2r) the usual concepts of linear independence and rank apply. (c) β is injective, linear (namely β(c) + β(c0) = β(c + c0)), and β(1) = 0 . . . 01. r Now, round n up to the nearest n = 2 − 1, and let a1,..., an denote the nonzero elements of the r field. Let M1 be the following Vandermonde matrix over the field GF(2 ):

 2 k−1  1 a1 a1 ... a1  1 a a2 ... ak−1  M =  2 2 2  1  ......  2 k−1 1 an an ... an r Exercise: Every k rows of M1 are linearly independent over GF(2 ). (Form any such submatrix B, say that using the first k rows. Verify that Det(B) = ∏i

 k−1  β(1) = 001 β(a1) = 001 . . . β(a1 ) = 001  β(1) = 001 β(a ) = 010 . . . β(ak−1) = ...  M =  2 2  2  ......  k−1 β(1) = 001 β(an) = 111 . . . β(an ) = ... 8See any introduction to Algebra, for instance Artin [9].

94 Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Corollary: Every k rows of M2 are linearly independent over GF(2). Actually it is possible to even further reduce the number of columns while retaining the corollary. First, we can drop the leading 0’s in the first entry. Second, we can strike out all batches of columns generated by positive even powers.

 3  1 β(a1) = 001 β(a1) = 001 ......  1 β(a ) = 010 β(a3) = ......  M =  2 2  3  ......  3 1 β(an) = 111 β(an) = ......

r Lemma 88 Every set of rows that is linearly independent (over GF(2 )) in M1 is also linearly independent (over GF(2)) in M3. Hence every k rows of M3 are linearly independent.

Proof: Let a set of rows R be independent in M1; we show the same is true in M3. Since M3 is over GF(2), this is equivalent to saying that for every ∅ ⊂ S ⊆ R, the sum of the rows S in M3 is nonzero.

So we are to show that S independent in M1 has nonzero sum in M3. Independence in M1 implies in particular that the sum in M2 of the rows in S is nonzero.

If |S| is odd then the same sum in M3 has a nonzero first entry and we are done. t Otherwise, let t > 0 be the smallest value such that ∑i∈S ai 6= 0; it is enough to show that t is odd. Suppose not, so t = 2t0. Then, since Characteristic(GF(2r)) = 2, !2 2t0 t0 ∑ ai = ∑ ai i∈S i∈S t0 so ∑i∈S ai 6= 0, contradicting minimality of t. 2 Finally for the binary construction, recalling that n = 2r − 1, we have |Ω| = 21+rbk/2c ∈ O(nbk/2c). Comment: If you want n k-wise independent bits with nonuniform marginals, then this construction doesn’t work. The best general construction, due to Koller and Megiddo [58], is of size O(nk). Larger ranges: this is actually simpler because we’re not achieving the factor-2 savings in the expo- nent. Let r, as in the statement, be r = max{m, dlg ne}. Just form the matrix M1.

4.6.4 Back to Gale-Berlekamp

We now see a deterministic polynomial-time algorithm for playing the Gale-Berlekamp game. As we demonstrated last time, it is enough to use a 4-wise independent sample space in order to achieve Ω(n3/2) expected performance. The above construction gives us a 4-wise independent sample space of size O(n2). All we have to do is exhaustively list the points of the sample space until we find one with performance Ω(n3/2).

4.6.5 Back to MIS

For MIS we need only pairwise independence, but want the marking probabilities pv to be more varied (approximately 1 ). This, however, is easy to achieve: use the matrix M , with k = 2, 2dv+1 1 r without modifying to M2 and M3. This generates for each v an element in the field GF(2 ); these elements are pairwise independent; and one can designate for each v a fraction of approximately 1 elements which cause v to be marked. The deterministic algorithm is therefore as described 2dv+1 in Sec. 4.4.4, with a space of size O(n2).

95 Chapter 5

Lov´aszlocal lemma

5.1 Lecture 26 (3/Dec): The Lov´aszlocal lemma

The Lovasz´ local lemma is a fairly widely applicable tool for controlling the interactions among a large collection of random variables.

Consider a probability space in which there is a long (possibly infinite) list of “bad” events B1,... that might occur. We may wish to show that the union of the bad events is not the entire space. c S c T c That is, with denoting complement, we wish to show that ( Bi) = Bi 6= ∅. There are in the probabilistic method two elementary tools to show this kind of statement:

S c 1. The union bound. If ∑ Pr(Bi) < 1 then ( Bi) 6= ∅.

2. Independence. If Pr(Bi) < 1 for all i, and the Bi are mutually independent, then for any finite T c n n, Pr Bi = ∏1 (1 − Pr(Bi)) > 0.

Let’s consider two toy examples of using just the second tool, independence. First toy example: no matter how many fair coins you toss, it is possible for all to come up Heads. Second toy example: Show that any finite (or even infinite but locally finite) tree has a valid 2- coloring. Of course this is trivial (and true also without assuming local finiteness), but can you show it by just coloring the vertices uniformly at random? Suppose the tree has n + 1 vertices. There are n “bad” events, each corresponding to a particular edge being monochromatic; these are mutually independent (check). So the probability that the tree is properly colored is 2−n > 0. This shows that there is a valid coloring of the tree, even though the probability that a random coloring is valid is vanishing in n. (For an infinite, locally finite tree, extend the argument by compactness.) The second toy example illustrates that the use of independence is very fragile. If you insert into the tree just one extra edge closing an odd-length cycle—no matter how long that cycle is—the de- pendence induced among distant events is enough to ruin the 2-colorability. So even an assumption of (n − 1)-way independence among n variables is not sufficient to imply that all good events may intersect. What Lovasz´ did was to create an argument, somewhat like the independence argument we set out above, but more robust, which still works in situations where the bad events are not entirely independent. His argument is a wonderful combination of tools (1) and (2). We present here one form of Lovasz’s´ bound.

96 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

Definition 89 (Dependency graph) Let B1,... be a finite set of events labeled by a set S. A “dependency graph” for the events is a directed graph whose vertices are the set S, with the following property. Let Di be the set of in-neighbors of i. (We do not include an event among its own in-neighbors.) Then for all i, Bi is independent of the product random variable B . ∏j∈S−{i}−Di j

Observe that the same set of events may be assigned many different dependency graphs. In partic- ular, any edges may be added; more significantly, there can be incomparable minimal dependency graphs. Example: consider three events indexed by i = {1, 2, 3}, each with Pr(Bi) = 1/2, pairwise independent but such that the number of bad events is always even. A digraph is a dependency graph iff every vertex has indegree at least 1. Those familiar with the notion of a “graphical model” should note this is a different concept.1

| | ≤ ( ) ≤ 1 T c 6= Lemma 90 (Lov´asz[30]) Suppose that for all i, Di ∆ and Pr Bi e(∆+1) . Then i∈S Bi ∅. In other words, it is possible for all good events to coincide.

The factor of e here is best possible; this was shown by Shearer [79]. An application: k-SAT with restricted intersections. This is a canonical application of the Lovasz´ lemma. Let H = (V, E) be a SAT instance in conjunctive normal form (CNF). That is, V is a set of boolean variables; a literal is a variable v ∈ V or its negation. E is a collection of clauses, each T ∈ E being a set of literals, which is satisfied if at least one of them is satisfied. H is satisfied if all T ∈ E are satisfied. So H is a conjunction of disjunctions. We say that two clauses are neighbors if they share any variable (not necessarily literal).

Corollary 91 Suppose every clause in T ∈ E has size ≥ k and has at most d neighbors. If d + 1 ≤ 2k/e then H is satisfiable.

Two cases of this corollary are easy: if the total number of clauses is small, or if the clauses are all disjoint (share no variables). The local lemma uses both effects:

 union bound: local (in the dependency graph) independence: global

Proof: Make a random, uniform assignment to the variables V. For each T there is a “bad event” −k BT = no literal in T is satisfied. Pr(BT) = 2 . After excluding the edges intersecting T, BT is independent of the collection of all remaining events, because even finding out the exact coloring of V within those edges (not merely which events occurred) does not affect the probability of event BT. Now apply Lemma (90). 2 A closely related application is to NAE-SAT (“Not All Equal”, or “Property B”): Let H = (V, E) be hypergraph (a set system whose elements we call edges). Specifically V is finite and E ⊆ 2V. H has Property B if V can be two-colored so that no edge is monochromatic. This is just like SAT except that two assignments, rather than one assignment, are ruled out per clause.

1To simplify comparison let us consider only bidirected dependency graphs and undirected graphical models. So in either case, we are considering a simple undirected graph among the events Bi, i ∈ S. In a graphical model, the condition is B B B that i be independent of ∏j∈S−({i}∪Di ) j conditional on ∏j∈Di j. Consider as an example the following graphical model (a special case of the Ising model): given that b neighbors j of i b−c have B occur, and that the remaining c neighbors of i have Bc occur, Pr(B ) = e . (Exercise: there is a joint distribution j j i eb−c+ec−b with this property.) The graph of the graphical model is not a dependency graph for this space. The information that a single event Bj has occurred, even if very far away from i in the graph, will at least slightly bias toward Bi occurring. Conversely, consider the above example of three events of which an even number always hold. One (bidirected) depen- dency graph for this is a path of two edges. However, this graph is not valid for a graphical model on the sample space: if the path is 1-2-3, B1 is not independent of B3 conditional on B2. To the contrary, conditional on B2, B3 determines B1.

97 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

Corollary 92 Suppose every edge T ∈ E has size ≥ k and intersects at most d other edges. If d + 1 ≤ 2k−1/e then H has Property B.

Proof: of the local lemma: |R| T c  ∆  We demonstrate more concretely that for any finite subset R of S, Pr i∈R Bi ≥ ∆+1 . This is a corollary of the following assertion which we prove by induction on m:

For any set of m − 1 events (w.l.o.g. relabeled as) B1,..., Bm−1 and any other event (w.l.o.g. relabeled as) B , m   \ c 1 Pr Bm | B  ≤ . (5.1) j ∆ + 1 j≤m−1 | {z } p1 If m = 1 this is immediate from the hypothesis of the lemma. Now suppose the claim is true for values smaller than m. Reorder B1,..., Bm−1 so that for some 0 ≤ d ≤ ∆, Dm ∩ {1, . . . , m − 1} = {m − d,..., m − 1} =: D. Write also D0 = {1, . . . , m − d − 1}. If d = 0, Eqn. (5.1) is again immediate from the hypothesis of the lemma. Otherwise, write     \ c \ c Pr Bm ∩  Bj  | Bj  j∈D j∈D0 | {z } p2     (5.2) \ c \ c \ c = Pr Bm | Bj  Pr  Bj | Bj  j≤m−1 j∈D j∈D0 | {z } | {z } p1 p3

We’re going to upper bound p by expressing it in the form p = p2 , and showing the upper bound 1 1 p3 ≤ 1 ≥ p2 e(∆+1) and the lower bound p3 1/e.

Term p2 is the application of independence at the global level. We use a simple upper bound for it: T c Bm ∩ j∈D Bj ⊆ Bm, so       \ c \ c \ c Pr Bm ∩  Bj  | Bj  ≤ Pr Bm | Bj  j∈D j∈D0 j∈D0 | {z } p2

= Pr(Bm) 1 ≤ . e(∆ + 1)

Term p3 is the union bound at the local level. We could in fact write it explicitly as a union bound but the lemma would suffer the slightly inferior factor of 4 in place of e, so we use the following

98 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

slightly slicker derivation.     \ c \ c c \ c Pr  Bj | Bj  = ∏ Pr Bj | Bj0  j∈D j∈D0 m−d≤j≤m−1 j0 1/e

Where the first inequality is by induction because every on the right-hand side is of the form (5.1) and involves at most m − 1 sets. Combining our two bounds we obtain the following from (5.2):

1 p e(∆+1) 1 p = 2 ≤ = . 1 p 1 ∆ + 1 3 e 2

99 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

5.2 Lecture 27 (5/Dec): Applications and further versions of the local lemma

5.2.1 Graph Ramsey lower bound

Ramsey’s theorem (the upper bound on the Ramsey number) runs in the opposite direction to Property B because it establishes the existence of something monochromatic. Not surprisingly, then, our use of the local lemma will be to provide a lower bound on Ramsey numbers. We already saw such a lower bound: a simple union bound argument gave R(k, k) ≥ (1 − o(1)) √k 2k/2. Now e 2 we will see how to improve this. √ k n k (2)−1 k 2 k/2 Theorem 93 R(k, k) ≥ max{n : e(2)(k−2) ≤ 2 }. Thus R(k, k) ≥ (1 − o(1)) e 2 .

1 k To see that the condition on n implies the conclusion, raise each side to the power k−2 . The e, (2) and −1 are inconsequential; we find that if n satisfies the following, then R(k, k) ≥ n:

−(k−2) log(k−2)+k−2 k2−k (1 + o(1))ne k−2 ≤ 2 2(k−2)

k+1 ne/k ≤ (1 + o(1))2 2

Proof: As before, sample a graph from G(n, 1/2). For each set of k vertices the “bad event” of a −(k) clique or independent set occurs with probability 21 2 . For the dependency graph, connect two subsets if they intersect in at least two vertices. The degree of this dependency graph is strictly less k n−2 than (2)(k−2) (relying on k ≥ 3, since the theorem is easy for k = 2) because this counts neighbors k n with the extra information of a distinguished edge in the intersection, so ∆ + 1 ≤ (2)(k−2). 2 This bound, due to Spencer [83], improves on the union bound by a factor of only 2. While the improvement factor is very small, qualitatively it is meaningful. It shows that a certain negative correlation among edges is possible: you have a graph which is big enough to have on average many copies of each graph of size k. (The union bound was tailored so that the expected number of copies of a k-clique was just below 1/2, and the same for a k-indep-set. Now we have twice as many places to put each of the k vertices, so we expect to see about 2k−1 of each of these subgraphs.) Yet as you look across different subgraphs of this graph, there is a kind of negative correlation which prevents the occurrence of these extreme graphs (the k-clique and the k-indep-set). The FKG inequality helps illustrate that the Lovasz´ local lemma did something truly non-local in the probability space. Any independent sampling method would result in a monotone event such as a specific k-clique, being at least as likely as the product of its constituent events (the indicators of each edge in the clique). √ 1 1 It is a major open problem to improve on either lim inf k log R(k, k) ≥ 2 or lim sup k log R(k, k) ≤ 4. Actually this gap is small by the standards of Ramsey theory. For more on the general topic see [21].

5.2.2 van der Waerden lower bound

Here is another “eventual inevitability” theorem; as before, the local lemma will provide a counter- point.

Theorem 94 (van der Waerden [88]) For every integer k ≥ 1 there is a finite W(k) such that every two- coloring of {1, . . . , W(k)} contains a monochromatic arithmetic sequence of k terms.

100 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

The best upper bound on W(k), due to Gowers [42], is2

2k+9 22 W(k) ≤ 22 . | {z } five two’s

The gap in our knowledge for this problem is even worse than for the graph Ramsey numbers: the − ( ) ≥ 2k 1 current lower bound, which we’ll see below, is W k (k+2)e . (A better bound is known for prime k.) First we show an elementary lower bound:

k−1 √ Theorem 95 W(k) > 2 2 k − 1.

Proof: Color uniformly iid. The probability of any particular sequence being monochromatic is 21−k. The union bound shows that all these events can be avoided provided

n(n − 1) 21−k < 1 (5.3) k − 1

n−1 (count n places the sequence can start, while the step size is bounded by k−1 ). The bound n ≤ k−1 √ 2 2 k − 1 implies 5.3. 2 Now for the improved bound through the local lemma:

− ( ) ≥ 2k 1 Theorem 96 (Lov´asz[30]) W k (k+2)e .

Proof: Again color uniformly iid. For a dependency graph, connect any two intersecting sequences. The degree of this graph is bounded by (n − 1)k2 k − 1 2 n−1 (k choices for which elements they have in common, k−1 possible step sizes). Thus all the bad events can be avoided if 1 21−k < , k2(n−1) e(1 + k−1 ) which in turn is implied by the bound in the statement of the lemma. The improvement here came because a union bound over approximately n2/k terms was replaced by a smaller factor of about nk. 2

5.2.3 Heterogeneous events and infinite dependency graphs

There are two generalized forms of the local lemma that come fairly easily.

2The original bound of van der Waerden is of Ackermann type growth [2]. The first primitive recursive bound, due to Shelah [80], is this. For any function f : N → N let fb : N → N be fb(1) = f (1), fb(k) = f ( fb(k − 1))(k > 1). So, letting k [ exp2(k) := 2 , the tower function is T = exp[2. Shelah’s bound is of the form Tb or in other words exp[2.

101 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

Heterogeneous events

It is not necessary that we use the same upper bound on Pr(Bj) for all j. Instead, we can allow events of various probabilities. Those which are more likely to occur, must have in-edges from events of smaller total probability. On the other hand less likely events can tolerate more in-edges (as measured by total probability). This is formulated, in a slightly circuitous way, in the following version of the lemma.

Lemma 97 Let events Bj and dependency edges E be as before. If there are xj < 1 such that for all j,

Pr(Bj) ≤ xj ∏ (1 − xk) (k,j)∈E

Then \ c Pr( Bj ) ≥ ∏(1 − xj). j

The proof method is the same. Show inductively on m that (for any subcollection of m events and  T c any ordering on them), Pr Bm | j≤m−1 Bj ≤ xm.

Infinite dependency graphs

Typically, the restriction that S is finite can be dropped due to compactness of the probability space. Specifically, suppose that—as in most applications—there is an underlying space of independent rvs Xk, k ranging over some index set U, and each Xk ranging in some compact topological space Rk. Moreover suppose that every one of the bad events Bj is a function of only finitely many of the Xk’s, ∈ ⊂ c say of k Uj U, Uj finite. Suppose moreover that each Bj is an open set in ∏k∈Uj Rk. Then each Bj is a closed set in the product topology on ∏k∈U Rk. Since the product topology is itself compact by Tychonoff’s theorem, it satisfies the Finite Intersection Property: a collection of closed sets of which any finite subcollection has nonempty intersection, has nonempty intersection. Consequently, under the additional topological assumptions made here—which are trivially satisfied if each Xk takes on only finitely many values—the supposition in the local lemma (in either formulation 90 or 97) that S is finite, may be dropped.

Example

Here is an example that takes advantage of both the above generalizations. ∗ For a word w ∈ Σ , Σ a set which we think of as an “alphabet”, let w = wn ... w1 be the reversal of 1 w and let DPal(w) = n dHamming(w, w). Large DPal means that w is far from being a palindrome. Consider the infinite ternary tree T3. The local lemma implies the existence of strongly palindrome- avoiding labelings T3:

Theorem 98 For all δ > 0 there is an integer |Σ| < ∞, and a labeling α : Vertices(T3) → Σ, such that for all words w of length > 1 that you read along a simple path in T3, DPal(w) ≥ 1 − δ.

(The proof gives |Σ| < exp(1/δ).) This is used in communication theory [76]. Comment: it is an open problem, not only for all δ > 0 but even for any δ < 1, to explicitly construct such a tree. Namely, we want an algorithm which for the rooted ternary tree, and a vertex v at distance n from the root, outputs α(v) in time poly(n).

102 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

5.3 Lecture 28 (7/Dec): Moser-Tardos branching process algorithm for the local lemma

Now we describe an algorithm for finding satisfying assignments to the local lemma. The algorithm works in great generality and achieves the same limiting threshold (whenever the algorithm is applicable) as the full local lemma; however, for simplicity, we describe it here in a slightly restricted setting. (Most notably the dependency graph will be symmetric.) Let H = (V, E) be a SAT instance (see definitions in Sec. 5.1); we can encode most applications of the local lemma in these terms. As before, say that two clauses are neighbors in the dependency graph if they share any variable (not necessarily literal). Write n = |V| (number of variables), m = |E| (number of clauses). Call the clauses T1,..., Tm according to an arbitrary ordering. We have from last time a corollary of the local lemma:

Corollary 99 Suppose every clause in T ∈ E has size k and has at most d neighbors. If d + 1 ≤ 2k/e then H is satisfiable.

Moser-Tardos Algorithm [67, 68] Pick a random assignment to V While there is an unsatisfied clause, pick the first-indexed such clause T and run Fix(T). Fix(T) Recolor the variables of T u.a.r. until it is satisfied. While T has an unsatisfied neighbor, pick the first-indexed such neighbor T0 and run Fix(T0). (“First-indexed” is a mere convenience, any deterministic order is ok, even depending on the his- tory of the algorithm so far). Observe that Fix implements a recursive or stack-based exploration analogous to Depth First Search (DFS), but it is possible for clauses to be revisited.

Theorem 100 If 4(d + 2) ≤ 2k then the Moser-Tardos algorithm finds a satisfying assignment to H in time O˜ (n + mk).

We are being loose in this presentation about the leading constant of 4 and about the d + 2. These can be improved to e and d + 1. We’re also being a bit loose about the run-time. For a bound of O(n + mk) we’ll just keep track of the number of random bits the algorithm uses. The actual run-time, which includes data structure management, will be a little larger but only by some factor of about log nm. Before presenting the proof, let’s see why what we are studying is very similar to a branching process. Fixing some clause as the root, there is an implicit tree extending out first to neighboring clauses, then to neighbors of those, and so on. (Of course there may be repetition but that works out in our favor.) The degree of this tree is d + 1, but our DFS needs to explore only a subtree of it, generated at random, in which the expected number of children of a node is bounded by (d + 1)/2k < 1. So, intuitively, what is going on is that a Fix call that is initiated by the main procedure, tends to terminate after generating a finite DFS tree. This is only of course intuition, and the formal proof follows.

Proof: The algorithm is implemented with the aid of an (infinite) random bit string z = z1 . . .. The first n bits are used for the initial assignment. Then, successive bits are used in batches of k for the Fix procedure. The choice of z amounts to uniformly choosing a path down a non-degenerate binary tree (no vertices with one child), whose leaves represent successful terminations of the algorithm. (Note,

103 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

this is the tree of random bits, and we only descend in it—it is not the graph of clauses in which we are performing a DFS-like process!) Of course the tree is infinite (we might endlessly sample bits badly). However, we will argue that with high probability we reach a leaf fairly soon. A key observation is that a call to Fix(T) has the following monotonicity property. If Fix(T) termi- nates, then after termination (a) T is satisfied, (b) any clause T0 that was satisfied before the call is still satisfied. (a) is obvious from the text of the procedure. For (b), consider the last time after Fix(T) started, at which any of the variables inside T0 were changed. This change, which left T0 unsatisfied, occurred during a call to Fix(T00) where T00 is T0 or one of its neighbors. But as we see in the procedure, we cannot have terminated Fix(T00) while T0 is unsatisfied. So (since this is all on a stack) we also cannot have terminated Fix(T). Hence, the main procedure calls Fix at most m times on any z.

Let Nt be the number of nodes that the algorithm tree has at depth t. Since the algorithm always runs the first n steps and then operates in batches of k bits, leaves of the tree occur at the levels n + sk, after s calls to Fix (whether from the main procedure or recursively from within Fix). For any such node which actually exists in the tree, what is the probability that the algorithm reaches it? Since all random seeds z are equally likely, this probability is precisely 2−n−sk. So what is the expected runtime of the tree? It is the sum of the probabilities that we reach each node. Namely, −t −n−sk ∑ Nt2 = n + k ∑ 2 Nn+sk. t≥0 s≥1 n+sk You can see that if we had, say, Nn+sk = 2 , this sum would diverge, as of course it should, since that tree has no leaves at all. But even if there are some leaves, the sum can readily diverge. We need to show the tree is thin enough that the sum converges. The method is to devise an alternative way of naming a vertex at level t = n + sk. The obvious way t is to give the bits z1 ... zt, but that allows us to name 2 vertices, and so will not do. Our naming scheme must give all vertices distinct names, yet be such that the name space for vertices at depth t is considerably smaller than 2t.

Call the vertex we are focusing on Z = (z1 ... zn+sk). Suppose that rather than being told all the bits z1 ... zn+sk, we’re instead told, in order, the arguments (names of clauses) to Fix; plus the assignment to all the variables at the time we reach Z.

Then we can determine z1 ... zn+sk by “working backwards.” The last clause, before its recoloring, had to have been in its unique unsatisfied assignment. In turn, the penultimate clause, before its recoloring, had to have been in its unique unsatisfied assignment. And so forth. How many bits are required to specify Z in this alternative way?

1. n bits for the last assignment. 2. We list, in chronological order, each clause as it is pushed on the stack (i.e., is called in Fix) and when it is done (popped off the stack). Since subsequent Pushes are neighbors in the dependency graph, this requires only lg(d + 2) bits per call (reserve one symbol for “Popping”). When we Pop all the way out to the main procedure (which is something we know has occurred from keeping track of the stack), we just need one bit per each of clause Ti, to indicate whether the main procedure calls Fix(Ti).

So, lg Nn+sk ≤ min{n + sk, n + m + sdlg(d + 2)e}.

104 Schulman: CS150 2018 CHAPTER 5. LOVASZ´ LOCAL LEMMA

Now, as above measuring runtime in terms of how many bits of z we read, we have,

−n−sk E(runtime) = n + k ∑ 2 Nn+sk s≥1 ≤ n + k ∑ 2−n−sk2n+min{sk,m+s(1+lg(d+2))} s≥1 = n + k ∑ 2min{0,m+s(1+lg(d+2)−k)} s≥1

Since we have assumed k ≥ 2 + lg(d + 2), this is

≤ n + k ∑ 2min{0,m−s} s≥1 m = n + k ∑ 1 + k ∑ 2m−s s=1 s>m = n + k(m + 1).

2

105 Bibliography

[1] I. Abraham, Y. Bartal, and O. Neiman. Advances in metric embedding theory. In STOC, 2006. [2] W. Ackermann. Zum hilbertschen aufbau der reellen zahlen. Mathematische Annalen, 99(1):118– 133, 1928. URL: http://dx.doi.org/10.1007/BF01459088. [3] M. Adams and V. Guillemin. Measure Theory and Probability. Birkhauser,¨ 1996. [4] M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Ann. of Math., 160:781–793, 2004. doi:https://doi.org/10.4007/annals.2004.160.781. [5] R. Ahlswede and D. E. Daykin. An inequality for the weights of two families of sets, their unions and intersections. Z. Wahrscheinl. V. Geb, 43:183–185, 1978. [6] M. Aizenman, H. Kesten, and C. Newman. Uniqueness of the infinite cluster and continuity of connectivity functions for short and long range percolation. Comm. Math. Phys., 111:505–531, 1987. [7] N. Alon, L. Babai, and A. Itai. A fast and simple randomized parallel algorithm for the maximal independent set problem. J. Algorithms, 7:567–583, 1986. [8] N. Alon and J. Spencer. The Probabilistic Method. Wiley, 3rd edition, 2008. [9] M. Artin. Algebra. Prentice-Hall, 1991. [10] I. Benjamini and O. Schramm. Percolation beyond zd, many questions and a few answers. Electron. Commun. Probab., 1:71–82, 1996. URL: http://dx.doi.org/10.1214/ECP.v1-978, doi:10.1214/ECP.v1-978. [11] B. Berger. The fourth moment method. SIAM J. Comput., 26(4):1188–1207, 1997. [12] S. Berkowitz. On computing the determinant in small parallel time using a small number of processors. Information Processing Letters, 18:147–150, 1984. [13] Bernstein inequality. In Encyclopedia of Mathematics. Springer and Europ. Math. Soc. URL: http://www.encyclopediaofmath.org. [14] S. N. Bernstein. On a modification of Chebyshev’s inequality and of the error formula of Laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math. 1, 1924. [15] S. N. Bernstein. On certain modifications of Chebyshev’s inequality. Doklady Akademii Nauk SSSR, 17(6):275–277, 1937. [16] P. Billingsley. Probability and Measure. Wiley, third edition, 1995. [17] B. Bollobas´ and A. G. Thomason. Threshold functions. Combinatorica, 7(1):35–38, 1987. URL: http://dx.doi.org/10.1007/BF02579198, doi:10.1007/BF02579198.

106 Schulman: CS150 2018 BIBLIOGRAPHY

[18] J. Bourgain. On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. Math., 52:46–52, 1985. [19] A. L. Chistov. Fast parallel calculation of the rank of matrices over a field of arbitrary charac- teristic. In Proc. Conf. Foundations of Computation Theory, pages 63–69. Springer-Verlag, 1985. [20] D. Conlon. A new upper bound for diagonal Ramsey numbers. Ann. of Math., 170:941–960, 2009. [21] D. Conlon, J. Fox, and B. Sudakov. Recent developments in graph Ramsey theory. In Surveys in Combinatorics, pages 49–118. Cambridge University Press, 2015. [22] S. A. Cook. A taxonomy of problems with fast parallel algorithms. Information and Control, 64:2–22, 1985. [23] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. [24] L. Csanky. Fast parallel matrix inversion algorithms. SIAM J. Computing, 5:618–623, 1976. [25] D. E. Daykin and L. Lovasz. The number of values of Boolean functions. J. London Math. Soc., 2(12):225–230, 1976. [26] R. A. DeMillo and R. J. Lipton. A probabilistic remark on algebraic program testing. Informa- tion Processing Letters, 7(4):193 – 195, 1978. URL: http://www.sciencedirect.com/science/ article/pii/0020019078900674. [27] L. Engebretsen, P. Indyk, and R. O’Donnell. Derandomized dimensionality reduction with applications. In SODA, 2002. [28] P. Erdos.¨ Some remarks on the theory of graphs. Bull. Amer. Math. Soc., 53:292–294, 1947. [29] P. Erdos.¨ Graph theory and probability. Canad. J. Math., 11:34–38, 1959. [30] P. Erdos¨ and L. Lovasz.´ Problems and results on 3-chromatic hypergraphs and some related questions. In Infinite and Finite Sets. North-Holland, 1975. [31] P. Erdos¨ and G. Szekeres. A combinatorial problem in geometry. Compositio Math., 2:463–470, 1935. [32] Y. Filmus. Khintchine-Kahane using Fourier Analysis. 2011. URL: http://www.cs.toronto. edu/~yuvalf/. [33] P. C. Fishburn and N. J. A. Sloane. The solution to Berlekamp’s switching game. Discrete Mathematics, 74:263–290, 1989. [34] C. M. Fortuin, P. W. Kasteleyn, and J. Ginibre. Correlation inequalities on some partially ordered sets. Commun. Math. Phys., 22:89–103, 1971. [35] M. Frechet.´ Sur quelques points du calcul fonctionnel. Rend. Circ. Matem. Palermo, 22:1–72, 1906. doi:10.1007/BF03018603. [36] R. Freivalds. Probabilistic machines can use less running time. In IFIP Congress, pages 839–842, 1977. [37] E. Friedgut and G. Kalai. Every monotone graph property has a sharp threshold. Proc. Amer. Math. Soc., 124:2993–3002, 1996. [38] H. N. Gabow and R. E. Tarjan. Faster scaling algorithms for general graph-matching problems. J. ACM, 38(4):815–853, 1991.

107 Schulman: CS150 2018 BIBLIOGRAPHY

[39] F. Le Gall. Powers of tensors and fast matrix multiplication. In International Symposium on Symbolic and Algebraic Computation (ISSAC), pages 296–303, 2014. arXiv:1401.7714. [40] G. H. Gonnet. Expected length of the longest probe sequence in hash code searching. J. Assoc. Comput. Mach., 28:289–304, 1981. [41] G. H. Gonnet. Determining equivalence of expressions in random polynomial time. In Proc. Sixteenth Annual ACM Symposium on Theory of Computing (STOC), pages 334–341. ACM, 1984. URL: http://doi.acm.org/10.1145/800057.808698. [42] W. T. Gowers. A new proof of Szemeredi’s´ theorem. Geom. Funct. Anal., 11(3):465–588, 2001. URL: http://dx.doi.org/10.1007/s00039-001-0332-9, doi:10.1007/s00039-001-0332-9. [43] R. L. Graham and V. Rodl.¨ Numbers in Ramsey theory. In Surveys in Combinatorics, London Math. Soc. Lecture Note Ser. Vol. 123, pages 111–153. Cambridge University Press, 1987. [44] R. L. Graham, B. L. Rothschild, and J. H. Spencer. Ramsey Theory. Wiley, 2nd edition, 1990. [45] T. E. Harris. Lower bound for the critical probability in a certain percolation process. Math. Proc. Cambridge Philos. Soc., 56:13–20, 1960. [46]J.H astad.˚ Some optimal inapproximability results. J. ACM, 48(4):798–859, 2001. [47] M. Heydenreich and R. van der Hofstad. Progress in high-dimensional percolation and random graphs. Springer, 2017. [48] W. E. Hickson. In Oxford Dictionary of Quotations, page 251. Oxford University Press, 3rd edition, 1979. [49] R. Holley. Remarks on the FKG inequalities. Communications in Mathematical Physics, 36:227– 231, 1974. URL: http://dx.doi.org/10.1007/BF01645980. [50] P. Indyk and J. Matousek.ˇ Low-distortion embeddings of finite metric spaces. In Handbook of Discrete and Computational Geometry, pages 177–196. CRC Press, 2004. [51] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math., 26:189–206, 1984. [52] V. Kabanets and R. Impagliazzo. Derandomizing polynomial identity tests means proving circuit lower bounds. Comput. Complex., 13:1–46, 2004. [53] G. Kalai and L. J. Schulman. Quasi-random multilinear polynomials. Isr. J. Math., 2018. To appear; arXiv:1804.04828. [54] R. M. Karp, E. Upfal, and A. Wigderson. Constructing a Maximum Matching is in Random NC. Combinatorica, 6(1):35–48, 1986. [55] R. M. Karp and A. Wigderson. A fast parallel algorithm for the maximal independent set problem. In Proc. 16th ACM STOC, pages 266–272, 1984. [56] A. Khintchine. Uber¨ dyadische Bruche.¨ Math. Z., 18:109–116, 1923. [57] D. J. Kleitman. Families of non-disjoint subsets. J. Combin. Theory, 1:153–155, 1966. [58] D. Koller and N. Megiddo. Constructing small sample spaces satisfying given constants. SIAM J. Discret. Math., 7:260–274, May 1994. Previously in 25th STOC pp.268-277, 1993. URL: http: //portal.acm.org/citation.cfm?id=178422.178455. [59] D. C. Kozen. The design and analysis of algorithms. Springer, 1992.

108 Schulman: CS150 2018 BIBLIOGRAPHY

[60] R. Latała and K. Oleszkiewicz. On the best constant in the Khinchin-Kahane inequality. Studia Mathematica, 109(1):101–104, 1994. URL: http://eudml.org/doc/216056. [61] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992. [62] R. Lidl and H. Niederreiter. Finite Fields. Cambridge U Press, 2nd edition, 1997. (Theorem 6.13). [63] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215–245, 1995. [64] M. Luby. A simple parallel algorithm for the maximal independent set problem. In Proc. 17th ACM STOC, pages 1–10, 1985. [65] R. Lyons and Y. Peres. Probability on Trees and Networks. Cambridge University Press, 2017. URL: http://pages.iu.edu/~rdlyons/. p [66] S. Micali and V. V. Vazirani. An O( |V| · |E|) algorithm for finding maximum matching in general graphs. In Proc. 21st FOCS, pages 17–27. IEEE, 1980. [67] R. A. Moser. A constructive proof of the Lovasz´ local lemma. In Proceedings of the 41st annual ACM symposium on Theory of computing, STOC ’09, pages 343–350, New York, NY, USA, 2009. ACM. URL: http://doi.acm.org/10.1145/1536414.1536462. [68] R. A. Moser and G. Tardos. A constructive proof of the general Lovasz´ local lemma. J. ACM, 57(2):11:1–11:15, 2010. URL: http://doi.acm.org/10.1145/1667053.1667060. [69] K. Mulmuley, U. V. Vazirani, and V. V. Vazirani. Matching is as easy as matrix inversion. Combinatorica, 7:105–113, 1987. [70] D. Mumford. The dawning of the age of stochasticity. In V. Arnold, M. Atiyah, P. Lax, and B. Mazur, editors, Mathematics: Frontiers and Perspectives. AMS, 2000. [71] C. M. Newman and L. S. Schulman. Infinite clusters in percolation models. Journal of Statistical Physics, 26(3):613–628, 1981. URL: http://dx.doi.org/10.1007/BF01011437. [72] V. Pan. Fast and efficient parallel algorithms for the exact inversion of integer matrices. In S. N. Maheshwari, editor, Foundations of Software Technology and Theoretical Computer Science, volume 206 of Lecture Notes in Computer Science, pages 504–521. Springer, 1985. URL: http: //dx.doi.org/10.1007/3-540-16042-6_29. [73] S. Rajagopalan and L. J. Schulman. Verification of identities. SIAM J. Comput., 29(4):1155–1163, 2000. [74] F. P. Ramsey. On a problem of formal logic. Proc. London Math. Soc., 48:264–286, 1930. [75] I. N. Sanov. On the probability of large deviations of random variables. Mat. Sbornik, 42:11–44, 1957. [76] L. J. Schulman. Coding for interactive communication. IEEE Transactions on Information Theory, 42(6):1745–1756, 1996. [77] J. Schwartz. Fast probabilistic algorithms for verification of polynomial identities. J. ACM, 27:701–717, 1980. [78] C. E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379–423; 623–656, 1948.

109 Schulman: CS150 2018 BIBLIOGRAPHY

[79] J. B. Shearer. On a problem of Spencer. Combinatorica, 5:241–245, 1985.

[80] S. Shelah. Primitive recursive bounds for van der waerden numbers. J. Amer. Math. Soc., 1:683– 697, 1988. [81] L. A. Shepp. The XYZ-conjecture and the FKG-inequality. Ann. Probab., 10(3):824–827, 1982. [82] D. Sivakumar. Algorithmic derandomization via complexity theory. In STOC, 2002.

[83] J. Spencer. Asymptotic lower bounds for Ramsey functions. Discrete Math., 20:69–76, 1977. [84] N. Ta-Shma. A simple proof of the isolation lemma. ECCC TR15-080, 2015. URL: http: //eccc.hpi-web.de/report/2015/080/. [85] A. Thomason. An upper bound for some Ramsey numbers. J. Graph Theory, 12, 1988.

[86] W. T. Tutte. The factorization of linear graphs. J. London Math. Soc., s1-22(2):107–111, 1947. URL: http://jlms.oxfordjournals.org/content/s1-22/2/107.short. [87] L. Valiant, S. Skyum, S. Berkowitz, and C. Rackoff. Fast parallel computation of polynomials using few processors. SIAM J. Comput., 12(4):641–644, 1983.

[88] B. L. van der Waerden. Beweis einer baudetschen vermutung. Nieuw. Arch. Wisk., 15:212–216, 1927. [89] Wikipedia. Folded normal distribution. [Online; accessed 6-November-2016]. URL: https: //en.wikipedia.org/w/index.php?title=Folded_normal_distribution&oldid=748178170.

[90] R. Zippel. Probabilistic algorithms for sparse polynomials. In E. W. Ng, editor, Symbolic and Algebraic Computation, volume 72 of Lecture Notes in Computer Science, pages 216–226. Springer, 1979. URL: http://dx.doi.org/10.1007/3-540-09519-5_73, doi:10.1007/3-540-09519-5_ 73.

110