<<

Probability and Caltech CS150a, 2020 Lecture Notes

Leonard J. Schulman Glossary a.s. = = with 1 1 = the constant one, and when used in linear , the all-ones vector.

A = 1A = the for occurrence of event A. (An event is a measurable of theJ K sample . This then is the function which is 1 on that subset, and 0 elsewhere.) We will often conflate an event and its indicator function, so the notations A ⊆ B, A ≤ B , A ⊆ B will occur interchangeably. J K J K J K J K

N+ = the positive integers

[n] = the set {1, . . . , n}, provided n ∈ N+.

Let B be a collection of countably many events Bi. The event that infinitely many Bi occur (also called “B infinitely often”) is variously written lim sup B or B i.o. ⊂ always indicates strict containment. Ac: the complement of event A.

Caltech CS150 2020. 2 Contents

1 Some basic probability 7 1.1 Lecture 1 (30/Sep): Appetizers ...... 7 1.2 Lecture 2 (2/Oct) Some basics ...... 8 1.2.1 ...... 8 1.2.2 Measurable functions, random variables and events ...... 9 1.3 Lecture 3 (5/Oct): Linearity of expectation, union bound, existence . . . . . 12 1.3.1 Countable additivity ...... 13 1.3.2 Coupon collector ...... 13 1.3.3 Application: the probabilistic method ...... 14 1.4 Lecture 4 (7/Oct): Upper and lower bounds ...... 15 1.4.1 Union bound ...... 15 1.4.2 Using the union bound in the probabilistic method: Ramsey theory ...... 15 1.4.3 Bonferroni inequalities ...... 16 1.5 Lecture 5 (9/Oct): Tail events. Borel-Cantelli, Kolmogorov 0-1, percolation ...... 19 1.5.1 Borel-Cantelli ...... 19 1.5.2 B-C II: a partial converse to B-C I ...... 19 1.5.3 Kolmogorov 0-1, percolation ...... 20 1.6 Lecture 6 (12/Oct): , gambler’s ruin ...... 22 1.7 Lecture 7 (14/Oct): Percolation on trees. Basic inequalities...... 24 1.7.1 A simple model for (among other things) epidemiology: percolation on the regular d-ary tree, d > 1...... 24 1.7.2 Markov inequality (the simplest tail bound) ...... 25 1.7.3 and the Chebyshev inequality: a second tail bound ...... 25 1.7.4 Omitted material 2020: Cont. basic inequalities, probabilistic method . . . . . 26 1.8 Lecture 8 (16/Oct): FKG inequality ...... 28 1.8.1 Omitted 2020: Achieving expectation in MAX-3SAT...... 31

3 Contents

2 Algebraic Fingerprinting 33 2.1 Lecture 9 (19/Oct): Fingerprinting with ...... 33 2.1.1 Polytime Complexity Classes Allowing Randomization ...... 33 2.1.2 Verifying Multiplication ...... 34 2.1.3 Verifying Associativity ...... 35 2.2 Lecture 10 (21/Oct): Cont. associativity; perfect matchings, polynomial identity testing 38 2.2.1 Matchings ...... 38 2.2.2 Bipartite perfect matching: deciding existence ...... 38 2.3 Lecture 11 (23/Oct): Cont. perfect matchings; polynomial identity testing ...... 41 2.3.1 Polynomial identity testing ...... 41 2.3.2 Deciding existence of a perfect matching in a graph ...... 42 2.4 Lecture 12 (26/Oct): Parallel computation: finding perfect matchings in general graphs. Isolating lemma...... 44 2.4.1 Parallel computation ...... 44 2.4.2 Sequential and parallel linear algebra ...... 44 2.4.3 Finding perfect matchings in general graphs, in parallel. The Isolating Lemma 44 2.5 Lecture 13 (28/Oct): Finding a perfect matching, in RNC ...... 47

3 Concentration of Measure 49 3.1 Lecture 14 (30/Oct): Independent rvs: data processing, Chernoff bound, applications 49 3.1.1 Two facts about independent rvs ...... 49 3.1.2 Chernoff bound for uniform Bernoulli rvs (symmetric random walk) . . . . . 50 3.1.3 Application: set discrepancy ...... 51 3.1.4 Entropy and Kullback-Liebler divergence ...... 52 3.2 Lecture 15 (2/Nov): CLT. Stronger Chernoff bound and applications. Start Shannon coding ...... 54 3.2.1 ...... 54 3.2.2 Chernoff bound using divergence; robustness of BPP ...... 54 3.2.3 Balls and bins ...... 56 3.2.4 Preview of Shannon’s coding theorem ...... 56 3.3 Lecture 16 (4/Nov): Application of large deviation bounds: Shannon’s coding theorem 58 3.4 Lecture 17 (6/Nov): Application of CLT to Gale-Berlekamp. Khintchine-Kahane. Moment generating functions ...... 60 3.4.1 Gale-Berlekamp game ...... 60 3.4.2 Moment generating functions, Chernoff bound for general distributions . . . . 62 3.5 Lecture 18 (9/Nov): Metric spaces ...... 64 3.5.1 Metric space examples ...... 64

3.5.2 Embedding for n points in L2 ...... 65 3.5.3 Normed spaces ...... 66 3.5.4 Exponential savings in dimension for any fixed distortion ...... 66

Caltech CS150 2020. 4 Contents

3.6 Lecture 19 (11/Nov): Johnson-Lindenstrauss embedding ...... 67 3.6.1 The original method ...... 67 3.6.2 JL: a similar, and easier to analyze, method ...... 69

3.7 Lecture 20 (13/Nov): Bourgain embedding X → Lp, p ≥ 1...... 73

3.7.1 Embedding into L1: Weighted Frechet´ embeddings ...... 73 3.7.2 Good things can happen ...... 74 3.7.3 Aside: Holder’s¨ inequality ...... 76

4 Limited independence 77 4.1 Lecture 21 (16/Nov): , improved proof of coding theorem using linear codes ...... 77 4.2 Lecture 22 (18/Nov): Pairwise independence, second moment inequality, G(n, p) thresholds ...... 80 4.2.1 Threshold for H as a subgraph in G(n, p) ...... 80

4.2.2 Most pairs independent: threshold for K4 in G(n, p) ...... 81 4.3 Lecture 23 (20/Nov): Limited independence: near-pairwise for primes, 4-wise for Khintchine-Kahane ...... 83 4.3.1 Turan’s proof of a theorem of Hardy and Ramanujan ...... 83 4.3.2 4-wise independent random walk ...... 85 4.4 Lecture 24 (23/Nov): Khintchine-Kahane for 4-wise independence; begin MIS in NC 86 4.4.1 Log concavity of moments and Berger’s bound ...... 86 4.4.2 Khintchine-Kahane for 4-wise independent rvs ...... 87 4.4.3 Khintchine-Kahane from Paley-Zygmund (omitted in class) ...... 87 4.4.4 Maximal Independent Set in NC ...... 88 4.5 Lecture 25 (25/Nov): Luby’s parallel for maximal independent set . . . . . 90 4.5.1 Descent Processes ...... 91 4.6 Lecture 26 (30/Nov): Limited linear independence, limited statistical independence, error correcting codes...... 93 4.6.1 Begin derandomization from small sample spaces ...... 93 4.6.2 Generator matrix and parity check matrix ...... 93 4.7 Lecture 27 (2/Dec): Limited linear independence, limited statistical independence, error correcting codes...... 96 4.7.1 Constructing C from M ...... 96 4.7.2 Proof of Thm (93) Part (1): Upper bound on the size of k-wise independent sample spaces ...... 96 4.7.3 Back to Gale-Berlekamp ...... 98 4.7.4 Back to MIS ...... 98

5 Special topic 99 5.1 Lecture 28 (4/Dec): factored ...... 99

Caltech CS150 2020. 5 Contents

A Material omitted in lecture 102 A.1 Paley-Zygmund in-probability bound, applied to the 4-wise indep. Khintchine-Kahane 102

Bibliography 104

Caltech CS150 2020. 6 Chapter 1

Some basic probability theory

1.1 Lecture 1 (30/Sep): Appetizers

1. N gentlemen check their hats in the lobby of the opera, but after the performance the hats are handed back at random. How many men, on , get their own hat back? 2. Measure the length of a long string coiled under a rectangular glass tabletop. You have an ordinary rigid ruler (longer than sides of the table). 3. On the table before us are 10 dots, and in our pocket are 10 nickels. Prove the coins can be placed on the table (no two overlapping) in such a way that all the dots are covered. 4. The envelope swap paradox: You’re on a TV game show and the host offers you two identical- looking envelopes, each of which contains a check in your name from the TV network. You pick whichever envelope you like and take it, still unopened. Then the host explains: one of the checks is written for a sum of $N (N > 0), and the other is for $10N. Now, he says, it’s 50-50 whether you selected the small check or the big one. He’ll give you a chance, if you like, to swap envelopes. It’s a good idea for you to swap, he explains, because your expected net gain is (with $m representing the sum currently in hand):

E(gain) = (1/2)(10m − m) + (1/2)(m/10 − m) = (81/20)m

How can this be? 5. Unbalancing lights (Gale-Berlekamp): You’re given an n × n grid of lightbulbs. For each bulb, at position (i, j), there is a switch bij; there is also a switch ri on each row and a switch cj on each column. The (i, j) bulb is lit if bij + ri + cj is odd.

What is the greatest f (n) such that for any setting to the bij’s, you can set the row and column switches to light at least n2/2 + f (n) bulbs?

7 Chapter 1. Some basic probability theory

1.2 Lecture 2 (2/Oct) Some basics

1.2.1 Measure

Frequently one can “get by” with a na¨ıve treatment of probability theory: you can treat random variables quite intuitively so long as you maintain Bayes’ law for conditional of events:

Pr(A1 ∩ A2) Pr(A1|A2) = Pr(A2)

However, that’s not good enough for all situations, so we’re going to be more careful, and me- thodically answer the question, “what is a random ?” (For a philosophical and historical discussion of this question see Mumford in [75].) First we need measure spaces. Let’s start with some standard examples.

1. Z with the .

2. A finite set with the uniform . In particular, a “” is the set {H, T}, each having probability 1/2.

3. (a) R with the , i.e., the measure (general definition momentarily) in which intervals have measure proportional to their length: µ([a, b]) = b − a for b ≥ a. (b) Likewise, [0, 1] with the Lebesgue measure.

4. Radial measure: the measure on R2 induced by, for infinitesimal intervals (r, r + dr) and (θ, θ + dθ), µ((r, r + dr) × (θ, θ + dθ)) = r dr dθ.

As we see, a measure µ assigns a nonnegative to (some) of a universe. Let’s see what are the formal properties we want from these examples. As we just hinted, we don’t necessarily assign a measure to all subsets of the universe; only to the measurable sets. In order to make sense of this, we need to define the notion of a σ-algebra (also known as a σ-field).

Definition 1. A σ-algebra (M, M˜ ) is a set M along with a collection M˜ of subsets of M (called the measurable sets) which satisfy: (1) ∅ ∈ M,˜ and (2) M˜ is closed under complement and countable intersection.

It follows also that M ∈ M˜ and M˜ is closed under countable union (de Morgan). By induction this gives a property: we can take any finite sequence of the form, a countable union of countable intersections of . . . of countable unions of measurable sets, and the result will be a measurable set.

Definition 2. A measure µ on a σ-algebra (M, M˜ ) is a function

µ : M˜ → [0, ∞] (1.1) that is countably additive, that is, for any pairwise disjoint S1, S2,... ∈ M,˜ [ µ( Si) = ∑ µ(Si). (1.2)

(M, M˜ , µ) is called a measure space. If we also assume µ(M) = 1 then it is called a .

Caltech CS150 2020. 8 1.2. Lecture 2 (2/Oct) Some basics

Let us see some properties of measure spaces: I. µ(∅) = 0 since µ(∅) + µ(∅) = µ(∅ ∪ ∅) = µ(∅). II. The modular identity µ(S) + µ(T) = µ(S ∩ T) + µ(S ∪ T) holds because necessarily S − T, T − S and S ∩ T are measurable, and both sides of the equation may be decomposed into the same linear combination of the measures of these sets. (The set S − T is S ∩ (¬T).) This identity is sometimes also called the lattice or valuation property. III. From the modular identity and nonnegativity, S ⊆ T ⇒ µ(S) ≤ µ(T). Example. I mentioned above the example of Lebesgue measure on R. So now, more formally: the σ-algebra it uses is called the Borel σ-algebra, which is the (smallest) σ-field that contains all the open intervals (a, b), a < b. That is, a set is measurable if you can write it by starting from open intervals, and finitely-many-times applying the operations. Finally the Lebesgue measure on this σ-algebra is the measure induced by setting µ((a, b)) = b − a.

1.2.2 Measurable functions, random variables and events

A is a mapping X from one σ-algebra, say (M1, M˜ 1), into another, say (M2, M˜ 2), −1 such that pre-images of measurable sets are measurable, that is to say, if T ∈ M˜ 2, then X (T) ∈ M˜ 1.

We will be entirely concerned with the situation that the domain M1 is also equipped with a mea- sure, µ1. In this case µ1 induces a push-forward measure for sets T ∈ M˜ 2:

−1 µ2(T) := µ1(X (T)). (1.3)

Definition 3. A is a measurable function X whose domain M1 is a probability space. In this case M1 is called the of the random variable X.

The range of the random variable, M2, can be many things, for example:

• M2 = R, with the σ-field consisting of rays (a, ∞), rays [a, ∞), and any set formed out of these by closing under the operations of complement and countable union. (In CS language, any other measurable set is formed by a finite-depth formula whose leaves are rays of the afore- mentioned type, and each internal node is either a complementation or a countable union.) Sometimes it is convenient to use the “extended real line,” the real line with ∞ and −∞ adjoined, as the base set.

• M2 = names of people eligible for a draft which is going to be implemented by a lottery. The M σ-field here is 2 2 , namely the of M2.

• M2 = deterministic algorithms for a certain computational problem. (On a countably infinite set M2, just as on a finite set, we can use the power set as the σ-field.)

Sanity check Let’s check that the push-forward definition 1.3 does actually give us a measure. Taking pairwise disjoint S1, S2,... ∈ M˜ 2,

[ −1 [ [ −1 µ2( Si) = µ1(X ( Si)) = µ1( X (Si)) −1 −1 = ∑ µ1(X (Si)) Since the Si are disjoint, so are the X (Si) = ∑ µ2(Si).

Events With any measurable subset T of M2 we associate the event that X lies in T; if X is understood, we

Caltech CS150 2020. 9 Chapter 1. Some basic probability theory

simply call this the event T. This event has the probability Pr(X ∈ T) (or if X is understood, Pr(T)) dictated by

−1 Pr(X ∈ T) = µ1(X (T)). (1.4)

The indicator of this event is the function T or 1T or X ∈ T , J K J K 1 T : M1 → {0, 1} ⊆ R ( 1 y ∈ X−1(T) 1T(y) = 0 otherwise

The basic but key property is that Z Pr(X ∈ T) = 1T dµ = E(1T). (1.5)

It follows that probabilities of events satisfy:

1. Pr(∅) = 0 (“the experiment has an outcome”)

2. Pr(M2) = 1 (“the experiment has only one outcome”) 3. Pr(A) ≥ 0

4. Pr(A) + Pr(B) = Pr(A ∩ B) + Pr(A ∪ B)

Note that events can themselves be thought of as random variables taking values in {0, 1}; indeed we will sometimes define an event directly, rather than creating out of some other random variable X and subset T of the image of X. For the most part we will sidestep measure theory—one needs it to cure pathologies but we will be studying healthy patients. However I recommend Adams and Guillemin [2] or Billingsley [18]. Often when studying probability one may suppress any mention of the sample space in favor of abstract of probability. For us the situation will be quite different. While starting out as a formality, explicit sample spaces will soon play a significant role. Joint distributions Given two random variables X1 : M → M1, X2 : M → M2 (where each Mi has associated with it a σ-field (Mi, M˜ i)), we can form the “product” random variable (X1, X2) : M → M1 × M2. The same goes for any countable collection of rvs on M, and it is important that we can do this for countable collections; for instance we want to be able to discuss unbounded sequences of coin tosses. Given a product rv ~ X = (X1, X2,...) : M → M1 × M2 × ...,

its marginals are probability distributions on each of the measure spaces Mi. These distributions are defined by, for A ∈ M˜ i, ~ Pr(Xi ∈ A) = Pr(X ∈ M1 × M2 × ... × Mi−1 × A × Mi+1 × ...)

That is, you simply ignore what happens to the other rvs, and assign to set A ∈ M˜ i the “push- −1 forward” probability µ(Xi (A)). ˜ ˜ X1, X2, . . . are independent if for any finite S = {s1,..., sn} and any As1 ∈ Ms1 ,..., Asn ∈ Msn , we have Pr((Xs1 ,..., Xsn ) ∈ As1 × · · · × Asn ) = Pr(Xs1 ∈ As1 ) ··· Pr(Xsn ∈ Asn ).

Caltech CS150 2020. 10 1.2. Lecture 2 (2/Oct) Some basics

(Note that Pr((X1, X2) ∈ A1 × A2) is just another way of writing Pr((X1 ∈ A1) ∧ (X2 ∈ A2)).) Example: a pair of fair . Let M be the set of 36 ways in which two dice can roll, each outcome having probability 1/36. On this sample space we can define various useful functions: e.g., Xi = the value of die i (i = 1, 2); Y = X1 + X2. X1 and X2 are independent; X1 and Y are not independent.

X1,...: M → T are independent and identically distributed (iid) if they are independent and all marginals are identical. If T is finite and the marginals are the uniform distribution, we say that the rv’s are uniform iid. We use the same terminology in case T is infinite but of finite measure (e.g., Lebesgue measure on a compact set), and the marginal on T is the proportional to this measure on T. Conditional Probabilities are defined by

Pr(X ∈ A ∩ B) Pr(X ∈ A|X ∈ B) = (1.6) Pr(X ∈ B) provided the denominator is positive. An old example. You meet Mr. Smith and find out that he has exactly two children, at least one of whom is a girl. What is the probability that both are girls? Answer1: 1/3.

1As usual in such examples we suppose that the sexes of the children are uniform iid. Some facts from general knowledge should be enough for you to doubt uniformity, and perhaps even independence. See e.g., a Stanford genetics post about people, something about parrots, and something less definite about mammals.

Caltech CS150 2020. 11 Chapter 1. Some basic probability theory

1.3 Lecture 3 (5/Oct): Linearity of expectation, union bound, exis- tence theorems

Taking (1.6) and applying induction, we have that if Pr(A1 ∩ ... ∩ An−1) > 0, then: Chain rule for conditional probabilities

Pr(A1 ∩ ... ∩ An) = Pr(An|A1 ∩ ... ∩ An−1) · Pr(An−1|A1 ∩ ... ∩ An−2) ··· Pr(A2|A1) · Pr(A1).

(What happens if (If Pr(A1 ∩ ... ∩ An) = 0? Then we can’t necessarily write this, because some denominator in the chain might be 0. But we can focus on the smallest i at which Pr(A1 ∩ ... ∩ Ai) = 0, and there say that either Pr(A1) = 0 or for i > 1, Pr(Ai|A1 ∩ ... ∩ Ai−1) = 0.) Real-valued random variables; expectations If X is a real-valued rv on a sample space with measure µ, its expectation (a.k.a. average, or first moment) is given by the following Z E(X) = X dµ (1.7) which is defined in the Lebesgue manner by2

Z X dµ = lim jh Pr(jh ≤ X < (j + 1)h) (1.8) → ∑ h 0 integer −∞

It is not hard to innocently encounter cases where the integral is not defined. Stand a meter from an infinite wall, holding a laser pointer. Spin so you’re pointing at a uniformly random orientation. If the laser pointer is not shining at a point on the wall (which happens with probability 1/2), repeat until it does. The displacement of the point you’re pointing at, relative to the point closest to you on the wall, is tan α meters for α uniformly distributed in (−π/2, π/2). You could be forgiven for thinking the average displacement “ought” to be 0, but the integral does not converge absolutely, because, using the substitution x = cos α (with dα/dx = − sin α):

Z π/2 Z cos(π/2) Z 0 sin α(x) −1 1 1 tan α dα = dx = − dx = [log x]|0 = +∞ 0 cos(0) cos α(x) sin α(x) 1 x

Nonnegative integrands. As we see in (1.9), it is essential to be able to characterize whether an R ∞ integral of a nonnegative function converges. (That equation is a discretization of −∞ |X| dµ.) It is worth pointing out that for probability measure µ supported on the nonnegative integers, R x dµ(x) = ∑n≥1 nµ({n}) = ∑n≥1 µ({n,...}). So the integral converges iff the sequence µ({n,...}) has a finite sum. Exercise: State and verify the analogous statement when µ is supported on the nonnegative reals.

2One can be, and sometimes must be, more scrupulous about the measure theory. When in doubt consult the texts of Adams & Guillemin, Billingsley, or Williams. But we will try to stay in the company of Benjamin Franklin: But he knew little out of his way, and was not a pleasing companion; as, like most great mathematicians I have met with, he expected universal precision in everything said, or was for ever denying or distinguishing upon trifles, to the disturbance of all conversation. He soon left us. — The Autobiography of Benjamin Franklin, chapter 5

Caltech CS150 2020. 12 1.3. Lecture 3 (5/Oct): Linearity of expectation, union bound, existence theorems

1.3.1 Countable additivity

Back to the theory. Naturally, being defined by an integral (1.7), expectation of real-valued rvs is linear. That is, for c ∈ R, E(cX) = cE(X), and if we have two real-valued rvs X, Y on the same sample space, we can form their sum rv X + Y. No matter the joint distribution of X and Y, we have, providing their expectations on the RHS are well defined:

E(X + Y) = E(X) + E(Y) linearity of expectation

To believe this you have only to verify: Exercise: Absolute convergence of R X dµ and R Y dµ implies absolute convergence of R (X + Y) dµ. Because R |X + Y| dµ ≤ R (|X| + |Y|) dµ < ∞. In the nonnegative case we have also countable additivity:

Exercise: Let X1, . . . be nonnegative real-valued with expectations E(Xi). Then

E(∑ Xi) = ∑ E(Xi).

Let’s return to one of our appetizers, the coins-on-dots problem (3): I don’t want to give this away entirely, but here’s a hint: what is the fraction of the plane covered by unit disks packed in a hexagonal pattern?

1.3.2 Coupon collector

There are n distinct types of coupons and you want to own the whole set. Each draw is uniformly distributed, no matter what has happened earlier. What is the expected time to elapse until you own the set?

Think of the coupons being sampled at times 1, 2, . . .. Let Yi = the first time at which we are in state Si, which is when we have seen exactly i different kinds of coupons (i = 0, . . . , n). So Y0 = 0, Y1 = 1. Let Xi = Yi − Yi−1. In state Si−1, in each round there is probability (n − i + 1)/n that we see a new kind of coupon, until that finally happens. That is to say, Xi is geometrically distributed with pi = (n − i + 1)/n. We can work out E(Xi) from the geometric sum, but there’s a slicker way.

If we’re in state Si−1, then with probability (n − i + 1)/n we’re in Si in one more time step, else we’re back in the same situation.

1/n 2/n 1−1/n 1

1 1−1/n 1−2/n 2/n 1/n S0 / S1 / S2 / ··· / Sn−1 / Sn (1.10)

So n − i + 1 i − 1 E(X ) = 1 + · 0 + · E(X ) (1.11) i n n i n − i + 1 E(X ) = 1 (1.12) n i n E(X ) = (1.13) i n − i + 1

Caltech CS150 2020. 13 Chapter 1. Some basic probability theory

Now we have: n E(Yn) = ∑ E(Xi) 1 n n = ∑ − + 1 n i 1 n 1 = n ∑ 1 i = nHn here Hn are the “harmonic sums” = n(log n + O(1))

1.3.3 Application: the probabilistic method

A tournament of size n is a directed complete graph. We may think of a tournament T equivalently as a skew-symmetric mapping T : [n] × [n] → {1, 0, −1} that is 0 only on the diagonal. A Hamilton path in a tournament (or a digraph more generally) is a directed simple path through all the vertices. Lemma 4. There exists a tournament with at least n! 2−n+1 Hamilton paths.

This certainly isn’t true for all tournaments—as an extreme case, the totally ordered tournament has only one Hamilton path.

Proof. This is an opportunity to consider a nice random variable: the random tournament. You simply fix n vertices, and direct each edge between them uniformly iid. Any particular permutation of the vertices has probability 2−n+1 of being a Hamilton path, so the expectation of the indicator rv for this event is 2−n+1. The indicator rvs are far from independent, but anyway, by linearity of expectation, the expected number of Hamilton paths is n!2−n+1. So some tournament has at least this many Hamilton paths. 2 Exercise: explicit construction. Describe a specific tournament with n!(2 + o(1))−n Hamilton paths.

Caltech CS150 2020. 14 1.4. Lecture 4 (7/Oct): Upper and lower bounds

1.4 Lecture 4 (7/Oct): Upper and lower bounds

1.4.1 Union bound

Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B) ≤ Pr(A) + Pr(B) The bound applies also to countable unions:

S∞ ∞ Lemma 5 (countable subadditivity). Pr( 1 Ai) ≤ ∑1 Pr(Ai).

Proof. First note that by induction the bound applies to any finite union. Now, if the right-hand side is at least 1, the result is immediate. If not, consider any counterexample; since the sequences Sk k Pr( 1 Ai) and ∑1 Pr(Ai) each monotonically converge to their respective limit, then there is a finite Sk k k for which Pr( 1 Ai) > ∑1 Pr(Ai). Contradiction. 2 Later in the lecture we’ll use the following which, while trivial, has the whiff of assigning a value to ∞/∞: S Corollary 6. If a countable list of events A1,... all satisfy Pr(Ai) = 0, then Pr( Ai) = 0. Likewise if for T all i, Pr(Ai) = 1, then Pr( Ai) = 1.

Birthday paradox: skipped in class. 3

1.4.2 Using the union bound in the probabilistic method: Ramsey theory

Theorem 7 (Ramsey [80]). Fix any nonnegative integers k, `. There is a finite “Ramsey number” R(k, `) such that every graph on R(k, `) vertices contains either a clique of size k or an independent set of size `. k+`−2 Specifically, R(k, `) ≤ ( `−1 ).

(The finiteness is due to Ramsey [80] and the bound to Erdos¨ and Szekeres [34].) Numerous generalizations of Ramsey’s argument have since been developed—see the book [46].

3In case you’re not famliar with this one: a class of 23 students has better than even of some common birthday. (Supposing birthdates are uniform on 365 possibilities.) The exact calculation is 365 ··· 343 Pr(some common birthday) = 1 − =∼ 0.507297 36523 r but a better way to understand this is that the number of ways this can happen is (2) for r students; so long as these events 1 don’t start heavily overlapping, we can almost add their probabilities (which are each just 365 ). For a year of n days and a class of r students, let the rv B = the number of pairs of students who share a birthday.

r 1 E(B) = 2 n √ which suggests that the probability of some joint birthday may be a constant once r is large enough that r ∼ n. We can easily verify the correctness of one side of this claim. The event of there being some common birthday is B > 0 . With Bij S J K being the event students i, j share a birthday, B > 0 = i 0) ≤ . 2 n √ This shows that r ∈ o( n) ⇒ Pr(B > 0) ∈ o(1). r The converse holds, too; fundamentally this is because there is not much overlap in the sample space between the (2) different events. We postpone this for now but will show below how to carry out this argument.

Caltech CS150 2020. 15 Chapter 1. Some basic probability theory

Proof. (of Theorem (7)) This is outside our main line of development but we include it for complete- k−1 ness. First, R(k, 1) = R(1, k) = 1 = ( 0 ). Now if k, ` > 1, consider a graph with R(k, ` − 1) + R(k − 1, `) vertices and pick any vertex v. Let VY denote the vertices connected to v by an edge, and let VN denote the remaining vertices. Either |VN| ≥ R(k, ` − 1) or |VY| ≥ R(k − 1, `).

If |VN| ≥ R(k, ` − 1) then (by induction on k + `) either the graph spanned by VN contains a k-clique or the graph spanned by VN ∪ {v} contains an independent set of size `.

On the other hand if |VY| ≥ R(k − 1, `) then (by induction on k + `) either the graph spanned by VY ∪ {v} contains a k-clique or the graph spanned by VY contains an independent set of size `. k+`−3 k+`−3 k+`−2 So we have: R(k, `) ≤ R(k, ` − 1) + R(k − 1, `) ≤ ( `−2 ) + ( `−1 ) = ( `−1 ). (The final equality counts subsets of [k + ` − 2] of size ` − 1 according to whether the first item is selected.) √ 2k−2 k If you apply Stirling’s approximation, this gives the bound R(k, k) ≤ ( k−1 ) ∈ O(4 / k). In the intervening nine decades there have been some improvements on this bound, first by Rodl¨ [45], then by Thomason [91], then by Conlon [24], and most recently by Sah [82] to (for k ≥ 3) R(k, k) ≤ 2k−2 2 ( k−1 ) exp(−c log k). What we use the union bound for is to show a converse: n (k)− k k Theorem 8 (Erdos¨ [32]). If ( ) < 2 2 1 then R(k, k) > n. Thus R(k, k) ≥ (1 − o(1)) √ 2 /2. k e 2 This leaves an exponential gap. Actually this gap is small by the standards of Ramsey theory. The gap has been slightly tightened since Erdos’s¨ work, as we will show later in the course, but remains exponential, and is a major open problem in .

Proof. (of Theorem (8)) This is an opportunity to introduce one of the most-studied random vari- ables in combinatorics, the G(n, p), in which each edge is present, independently, with probability p. Among other things, people use this model to study threshold phenomena for many properties such as connectivity, appearance of a Hamilton cycle, etc. For the lower bound on R(k, k) we use G(n, 1/2). Any particular subset of k vertices has probability −(k) 21 2 of forming either a clique or an independent set. Take a union bound over all subgraphs. 2

1.4.3 Bonferroni inequalities

The union bound is a special case of the Bonferroni inequalities: c Let A1,..., An be events in some probability space, and A their complements. For S ⊆ [n] let T i AS = i∈S Ai. [n] For 0 ≤ j ≤ n let ( j ) denote the subsets of [n] of cardinality j. Lemma 9 (Bonferroni). For j ≥ 1 let (see Fig. 1.1):

mj = ∑ Pr(AS) ∈ [n] S ( j ) k k j+1 j+1 Mk = ∑(−1) mj = ∑(−1) ∑ Pr(AJ) j=1 j=1 J⊆[n],|J|=j

Then: [ M2, M4,... ≤ Pr( Ai) ≤ M1, M3,... S Moreover, Pr( Ai) = Mn; this is known as the inclusion-exclusion principle. It is a special case of what is known in as M¨obiusinversion.

Caltech CS150 2020. 16 1.4. Lecture 4 (7/Oct): Upper and lower bounds

1 0 1 2 1 1 1 0 1 3 3 0 2 1 1 2 1 1

1 0 1

Figure 1.1: m1 (left), m2 (middle), M2 (right)

Comment: Often, but not always, larger values of k give improved bounds. See the problem set.

Proof. The sample space is partitioned into 2n measurable sets

\ \ c BS = ( Ai) ∩ ( Ai ) i∈S i∈/S

S Note that AS = S⊆T BT, which, since the BT’s are disjoint, gives Pr(AS) = ∑S⊆T Pr(BT).

|T| m = Pr(A ) = Pr(B ) = Pr(B ) j ∑ S ∑ ∑ T ∑ T j ∈ [n] ∈ [n] S⊆T T S ( j ) S ( j )

Let’s start with the cases M1, M2. S Observe Pr( Ai) = 1 − Pr(B∅) = ∑T6=∅ Pr(BT). Now   |T| [ M1 = m1 = ∑ Pr(BT) = ∑ |T| Pr(BT) ≥ ∑ Pr(BT) = Pr( Ai). T 1 T T6=∅

Let t = |T|. t t t(3 − t) M2 = m1 − m2 = ∑ Pr(BT) − = ∑ Pr(BT) T 1 2 T 2 The quadratic is ≤ 1 at integer t, and 0 at t = 0, therefore [ ... ≤ ∑ Pr(BT) = Pr( Ai). T6=∅

Now consider general k. By the above observation,

k   [ j+1 |T| Mk − Pr( Ai) = ∑ Pr(BT) ∑(−1) T6=∅ j=0 j where we have inserted the needed − Pr(BT) for T 6= ∅ by starting the internal summation from k j+1 t j+1 t−1 j = 0. Exercise: for t ≥ 1, j ≥ 0, ∑j=0(−1) (j) = (−1) ( j ). The sign here completes the argument. 2

The inequalities hold also for countable collections of events, with a proviso.

Caltech CS150 2020. 17 Chapter 1. Some basic probability theory

Lemma 10 (Bonferroni for countable collections). Let A1,... be a countable collection of events, let mj = k+1 [N ] Pr(A ), define M as previously, and suppose m < ∞ for all 1 ≤ j ≤ k. Then (−1) (M − ∑ ∈( + ) S k j k S j S Pr( Ai)) ≥ 0.

By the way, it is not sufficient to require that m1 < ∞, as m2 may still be infinite. For X ∈U [0, 1] let 2 j−1 Ai = X < 1/i . Then m1 < ∞ but m2 ≥ ∑i

k+1 S Proof. Suppose to the contrary that (−1) (Mk − Pr( Ai)) = −ε < 0.

Let mj(n) and Mk(n) be as defined earlier but for the list of events A1,..., An. Fix n sufficiently (Sn ) ≥ (S ) − ε ≤ ≤ ( ) ≥ − ε large that Pr 1 Ai Pr Ai 2(k+1) and, for each 1 j k, mj n mj 2(k+1) . Then k+1 Sn k+1 S (−1) (Mk(n) − Pr( 1 Ai)) ≤ (−1) (Mk − Pr( Ai)) + ε/2 ≤ −ε/2 < 0 contradicting Lemma 9. 2

Caltech CS150 2020. 18 1.5. Lecture 5 (9/Oct): Tail events. Borel-Cantelli, Kolmogorov 0-1, percolation

1.5 Lecture 5 (9/Oct): Tail events. Borel-Cantelli, Kolmogorov 0-1, percolation

1.5.1 Borel-Cantelli

Here is a very fundamental application of the union bound.

Definition 11. Let B = {B1,...} be a countable collection of events. lim sup B is the event that infinitely many of the events Bi occur.

Lemma 12 (Borel Cantelli I). Let ∑i≥1 Pr(Bi) < ∞. Then Pr(lim sup B) = 0. lim sup B is what is called a tail event: a function of infinitely many other events (in this case the B1, . . .) that is unaffected by the outcomes of any finite subset of them.

Proof. It is helpful to write lim sup B as \ [ lim sup B = Bj. i≥0 j≥i

S S For every i, lim sup B ⊆ j≥i Bj, so Pr(lim sup B) ≤ infi Pr( j≥i Bj). By the union bound, the latter is ≤ infi ∑j≥i Pr(Bj) = 0. 2

1.5.2 B-C II: a partial converse to B-C I

Lemma 12 does not have a “full” converse.

To show a counterexample, we need to come up with events Bi for which ∑i≥1 Pr(Bi) = ∞ but Pr(lim sup B) = 0. Here is an example. Pick a point x uniformly from the unit interval. Let Bi be the event x < 1/i. You will notice that in this example the events are not independent. That is crucial, for B-C I does have the partial converse:

Lemma 13 (Borel Cantelli II). Suppose that B1,... are independent events and that ∑i≥1 Pr(Bi) = ∞. Then Pr(lim sup B) = 1.

c Proof. We’ll show that (lim sup B) , the event that only finitely many Bi occur, occurs with proba- c S T c bility 0. Write (lim sup B) = i≥0 j≥i Bj . T c By the union bound (Cor. 6), it is enough to show that Pr( j≥i Bj ) = 0 for all i. Of course, for any T c T c I ≥ i, Pr( j≥i Bj ) ≤ Pr( I≥j≥i Bj ). T c I c By independence, Pr( I≥j≥i Bj ) = ∏i Pr(Bj ), so what remains to show is that

I c For any i, lim Pr(Bj ) = 0. (1.14) I→ ∏ ∞ i

(Note the LHS is decreasing in I.) There’s a classic inequality we often use:

1 + x ≤ ex (1.15)

Caltech CS150 2020. 19 Chapter 1. Some basic probability theory which follows because the RHS is concave and the two sides agree in value and first derivative at a point (namely at x = 0).

Consequently if a finite sequence xi satisfies ∑ xi ≥ 1 then ∏(1 − xi) ≤ 1/e.

Supposing (1.14) is false, fix i for which it fails, let qi be the limit of the LHS, and let I be suf- I c 0 I0 I0 c ficient that ∏i Pr(Bj ) ≤ 2qi. Let I be sufficient that ∑I+1 Pr(Bj) ≥ 1. Then ∏i Pr(Bj ) ≤ 2qi/e. Contradiction. 2

1.5.3 Kolmogorov 0-1, percolation

A beautiful fact about tail events is Kolmogorov’s famous 0-1 law.

Theorem 14 (Kolmogorov). If Bi is a sequence of independent events and C is a tail event of the sequence, then Pr(C) ∈ {0, 1}.

We won’t be using this theorem. It takes a bit of work to prove, either using measure theory (though not much more than I’ve already shown you), or through (to me a more informative argument, but requiring more background). In any case, to keep us on our track, I’ll only offer here a few examples of its application.

Bond percolation

Fix a parameter 0 ≤ p ≤ 1. Start with a fixed infinite, connected, locally finite graph H, for instance the grid graph Z2 (nodes (i, j) and (i0, j0) are connected if |i − i0| + |j − j0| = 1) and form the graph G by including each edge of the grid in G independently with probability p. “Locally finite” means the degree of every vertex is finite. The graph is said to “percolate” if there is an infinite connected component. Percolation is a tail event (with respect to the events indicating whether each edge is present): consider the effect of adding or removing just one edge. Now induct on the number of edges added or removed. It is easy to see by a coupling argument that Pr(percolation) is monotone nondecreasing in p, as follows: Instead of choosing just a single bit at each edge e, choose a real number Xe ∈ [0, 1] 0 uniformly. Include the edge if Xe < p. Now, if p < p , we can define two random graphs Gp, Gp0 , each is a percolation process from the respective parameter value, and Gp ⊆ Gp0 .

Due to the 0-1 law, there exists a “critical” pH such that Pr(percolation) = 0 for p < pH and Pr(percolation) = 1 for p > pH. (See Fig. 1.5.3.) A lot of work in probability theory has gone into determining values of pH for various graphs, and also into figuring out whether Pr(percolation) is 0 or 1 at pH. Another example of a tail event for bond percolation, this one not monotone, is the event that there are infinitely many infinite components. No matter what the underlying graph is, the probability of this event is 0 at p ∈ {0, 1}.

Site percolation

A closely related process is that starting from a fixed infinite, connected, locally finite graph H, we retain vertices independently with probability p. (And of course we retain an edge if both its vertices are retained.)

Caltech CS150 2020. 20 1.5. Lecture 5 (9/Oct): Tail events. Borel-Cantelli, Kolmogorov 0-1, percolation

Pr(percolate)

1.0

0.8

0.6

0.4

0.2

p 0.2 0.4 0.6 0.8 1.0

Figure 1.2: Bond percolation in the 2D grid

Let N be the random variable representing the number of infinite components in the random graph. Here “number” can be any nonnegative integer or ∞. The events N = 0 , N = ∞ are tail events but for 0 < r < ∞, N = r is not a tail event. J K J K J K It is worth noting that N is not a monotone function of the percolation graph. The event N = 0 is monotone decreasing (just as in the bond percolation discussion earlier). The event N =J ∞ is Knot monotone increasing. J K It is known under fairly general conditions (particularly, if H is connected and vertex-transitive), that for any p, one (and therefore only one) of the following three events has probability 1: N = 0; N = 1; N = ∞. You can see that this is not implied by Kolmogorov’s law, but it does follow by an extension of the argument. See [76] for the beginning of this story, and [11] for a survey. (It happens that in the square grid, in any dimension and for any p, Pr(N = ∞) = 0; as p increases, we go from Pr(N = 0) = 1 to Pr(N = 1) = 1 [5], and stay there. However, in more “expanding” graphs such as d-regular trees, d > 1, and also other non-amenable graphs, there can be a phase in the middle with Pr(N = ∞) = 1. See [69, 49].)

Caltech CS150 2020. 21 Chapter 1. Some basic probability theory

1.6 Lecture 6 (12/Oct): Random walk, gambler’s ruin

Random walk on Z, gambler’s ruin

Here is another example of a tail event, but this one we can work out without relying on the 0-1 law, and also see which of {0, 1} is the value: Consider rw on Z that starts at 0 and in every step with probability p goes left, and with probability 1 − p goes right. Let L = the event that the walk visits every x ≤ 0. Let R = the event that the walk visits every x ≥ 0. Exercise: Each of L and R is a tail event. So by Theorem 14 (Kolmogorov), for any p, Pr(L) and Pr(R) lie in {0, 1}. In fact, we will show— without relying on Theorem 14, but relying on Lemma 12 (Borel-Cantelli I)—that: Theorem 15.

• For p < 1/2, Pr(L) = 0 and Pr(R) = 1.

• For p > 1/2, Pr(L) = 1 and Pr(R) = 0. (Obviously this is symmetric to the preceding.)

• For p = 1/2, Pr(L) = Pr(R) = 1.

Note that if L ∩ R occurs, then the walk must actually visit every point infinitely often. (Suppose not, and let t be the last time that some site y was visited. Then on one side of y, the point t + 1 steps away cannot have been visited yet, and will never be visited.) Thus in the third case in the theorem, since Pr(L ∩ R) = 1 by union bound,

Pr(every point in Z is visited infinitely often) = 1.

The term for this property of unbiased rw on Z, is that it is recurrent. Unbiased rw is still recurrent in Z2; but not in Zd, d > 2. The 3 dimensional case is particularly interesting as this roughly describes how photons make their way out of the interior of the sun. This glosses over quantum properties of photons so don’t take it as too serious a model. Nonetheless, it gives a feel for why photons should escape the sun at a positive rate, rather than accumulating within it.

Proof. First, no matter what p is, let qy be the probability that the walk ever visits the point y. Part 1: p 6= 1/2. The first step of the argument doesn’t depend on the sign of p − 1/2: Lemma 16. Let p 6= 1/2. Then with probability 1, every y is visited only finitely many times.

Proof. Consider any y and let By,t = the event that the walk is at y at time t. The following calcula- tion shows that for any y, ∑t Pr(By,t) < ∞: For t s.t. t = y mod 2, we have

 t  t−y t+y Pr(By,t) = t−|y| p 2 (1 − p) 2 2    y/2 t 1 − p t/2 = t−|y| (p(1 − p)) 2 p  1 − p y/2 ≤ 2t (p(1 − p))t/2 p  1 − p y/2 = (4p(1 − p))t/2 p

Caltech CS150 2020. 22 1.6. Lecture 6 (12/Oct): Random walk, gambler’s ruin

Therefore  1 − p y/2 1 ∑ Pr(By,t) ≤ p < ∞ t p 1 − 4p(1 − p) where the final inequality uses p 6= 1/2. So by Borel-Cantelli-I (Lemma 12), with probability 1, y is visited only finitely many times. Then by the union bound, with probability 1 every y is visited only finitely many times. 2

Now let’s suppose further that p > 1/2 (i.e., the walk drifts left). Then for any x ∈ Z,

 1 − p x/2 1 1 ∑ ∑ Pr(By,t) ≤ · p · p < ∞ y≥x t p 1 − (1 − p)/p 1 − 4p(1 − p)

So we get the even stronger conclusion, again by BC-I, that with probability 1 the walk spends only finite time in the interval [x, ∞]. Plugging in x = 0 gives Pr(R) = 0. And since this holds for all x, we get Pr(L) = 1. By symmetry, we’ve covered part 1 (the first two cases) of the theorem. Part 2: p = 1/2. Here the claims Pr(L) = 1 and Pr(R) = 1 are equivalent so let’s focus on the first. The claim Pr(L) = 1 is equivalent to saying that for any integer x ≥ 0, with probability 1 the walk reaches the point −x. This is often referred to as the phenomenon of gambler’s ruin:

Suppose a gambler with a finite integer initial stake $x, bets in each round (so long as he is solvent, i.e., he has a positive amount of money) $1 on the outcome of a fair coin. Then the probability that he goes broke is 1.

Let’s show this. For x ≥ 0 now q−x = the probability the gambler goes broke from initial stake x. We claim that q−x is harmonic on N with boundary condition q0 = 1. The harmonic condition means that on all interior points of the integer nonnegative axis, which means all x ∈ N+, the function value is the average of its neighbors: qx = (qx−1 + qx+1)/2 That this is so is obvious from the description of the gambler’s process. This equation implies that qx is affine linear on N, because for x ≥ 1 the “discrete second derivative” is 0:

(qx+1 − qx) − (qx − qx−1) = qx+1 − 2qx + qx−1 = 0

However, the function qx is also bounded in [0, 1]. So it can only be a constant function, agreeing with its boundary value q0 = 1. 2

Caltech CS150 2020. 23 Chapter 1. Some basic probability theory

1.7 Lecture 7 (14/Oct): Percolation on trees. Basic inequalities.

1.7.1 A simple model for (among other things) epidemiology: percolation on the regular d-ary tree, d > 1

In epidemiology there are various mathematical models for the spread of an infectious disease. A primary workhorse is the “SIR” (susceptible/infected/recovered) model, and from there it escalates in complexity. Today we’ll talk about a model that is even simpler than SIR. Yet this model is still good enough to capture the most fundamental epidemiologists work with, which is R0, the “basic reproductive number:” the expected number of individuals who contract the infection due to contact with one infected individual. The rule of thumb is that a disease spreads if R0 > 1, and dies out if R0 < 1. Today we’ll give some justification for this rule of thumb.

Consider Td, the infinite rooted d-regular tree (which means every vertex has d children; this means that as a graph, all but the root have degree d + 1). Perform percolation with parameter 0 < p < 1, i.e., each edge is present independently with probability p. Observe that the expected number of children a vertex is connected to is pd. We have in mind that the root is “patient 0,” the first individual carrying the disease; presence of an edge from “parent” X to “child” Y indicates infection of Y by X. (Of course, in actual epidemiology there is no fixed underlying tree.) So, in our modeling: R0 = pd. Let C be the connected component of the root. Think of C finite representing the disease dying out; C infinite is the limiting concept of a pandemic. Theorem 17. If pd < 1 then C is a.s. finite. If pd > 1 then Pr(C is infinite) > 0.

` Proof. If p < 1/d then E(|C|), the expected number of nodes connected to the root is ∑`≥0(pd) < ∞. So BC-I implies that the tree is a.s. finite.

If p > 1/d, define Z` = the root is connected to some vertex at level ` . Then C is infinite = T J K J K Z`. Observe that

1. Pr(Z0) = 1,

2. ∀` Z` ⊃ Z`+1 so

(a) Pr(Z`) is decreasing in `, and T (b) Pr( Z`) = inf Pr(Z`).

3. Pr(Z`) is increasing in p (by the coupling argument for percolation we mentioned in an earlier lecture).

Let 0 < ε = log pd. At each child of the root of a tree of depth ` + 1, consider the event that that child is connected to the root and that within its own subtree, the event Z` occurs. Applying Bonferroni level 2 to these d events, we have:   2 d 2 Pr(Z`+ ) ≥ pd Pr(Z`) − p Pr(Z`) 1 2

≥ Pr(Z`)pd(1 − Pr(Z`)p(d − 1)/2) ε ε e − p ≥ Pr(Z`)e (1 − Pr(Z`) ) 2 ε e −p −ε 1−e−ε If Pr(Z`) were very small, specifically if 1 − Pr(Z`) 2 > e (equivalently Pr(Z`) < 2 eε−p ), this 1−e−ε would show that Pr(Z`+1) > Pr(Z`), which would contradict 2a. Therefore, Pr(Z`) ≥ 2 eε−p > 0 for all `, which by 2b completes the proof. 2

Caltech CS150 2020. 24 1.7. Lecture 7 (14/Oct): Percolation on trees. Basic inequalities.

1.7.2 Markov inequality (the simplest tail bound)

4 Lemma 18. Let A be a non-negative random variable with finite expectation µ1. Then for any λ ≥ 1, Pr(A > λµ1) < 1/λ. In particular, for µ1 = 0, Pr(A > µ1) = 0.

(Of course the lemma holds trivially also for 0 < λ < 1.)

Proof. If Pr(A ≥ λµ1) > 1/λ then E(A) > µ1, a contradiction. So Pr(A ≥ λµ1) ≤ 1/λ and therefore, if the lemma fails, it must be that Pr(A > λµ1) = 1/λ. In particular for some ε > 0 there is a δ > 0 s.t. Pr(A ≥ λµ1 + ε) ≥ δ. Then E(A) ≥ δ · (λµ1 + ε) + (1/λ − δ) · λµ1 = µ1 + δε > µ1, a contradiction. 2

For a more visual argument (but proving the slightly weaker Pr(A ≥ λµ1) ≤ 1/λ), note that the step function x ≥ λµ satisfies the inequality x ≥ λµ ≤ x for all nonnegative x. If µ is the 1 1 λµ1 J K J RK R x µ1 1 probability distribution of the rv A, then Pr(A ≥ λµ1) = x ≥ λµ1 dµ ≤ = = . J K λµ1 λµ1 λ

1.7.3 Variance and the Chebyshev inequality: a second tail bound

Let X be a real-valued rv. If E(X) and E(X2) are both well-defined and finite, let Var(X) = E(X2) − E(X)2. We can also see that E((X − E(X))2) = Var(X) by expanding the LHS and applying linearity of expectation. In particular, the variance is nonnegative. If c ∈ R then since the variance is homogeneous and quadratic, Var(cX) = c2 Var(X). p Lemma 19 (Chebyshev). If E(X) = θ, then Pr(|X − θ| > λ Var(X)) < 1/λ2.

Proof.

 q    Pr |X − θ| > λ Var(X) = Pr (X − θ)2 > λ2 Var(X)

< 1/λ2 by the Markov inequality (Lemma 18). 2 A frequently useful corollary of the Chebyshev inequality (Lemma 19) is:

2 Corollary 20. If X is a real rv with mean < < and variance 2 < then (X ≤ ) ≤ σ . 0 µ ∞ σ ∞ Pr 0 µ2

On a homework I’ll ask you to show:

2 Lemma 21. If X is a nonnegative rv with mean < and variance 2 < then (X = ) ≤ σ . µ ∞ σ ∞ Pr 0 µ2+σ2

4For a nonnegative rv there can be no problems with absolute convergence of the expectation; however, it may be infinite.

Caltech CS150 2020. 25 Chapter 1. Some basic probability theory

1.7.4 Omitted material 2020: Cont. basic inequalities, probabilistic method

Power mean inequality

Nonnegativity of the variance is merely a special case of monotonicity of the power means. (In this context, though, we will assume the random variable X is positive-valued. For the variance we don’t need this constraint.)

Lemma 22 (Power means inequality). For a positive-real-valued rv X, and for reals s < t,

1/t (E(Xs))1/s ≤ E(Xt) .

Proof. Let µ be the probability measure. Recall that for r ≥ 1, the function xr is convex (“cup”) in x. For a , R f (x) dµ(x) ≤ f (R x dµ(x)). (This is sometimes called Jensen’s inequality.) Applying this with r = t/s, we have

Z Z s/t Xs dµ ≤ Xt dµ .

2 Using the concave function f (x) = log(x) gives us Z Z exp log x dµ ≤ xdµ (1.16)

which is the -geometric mean inequality: in the case of a uniform distribution on n 1/n 1 positive values of X, it reads (∏ Xi) ≤ n ∑ Xi. That (1.16) is a special case of the power means inequality can be seen by fixing t = 1 and taking the limit s → 0 (approximating xs by 1 + s log x).

Large girth and large chromatic number; the deletion method

Earlier we saw our first example of the probabilistic method, the proof of the existence of graphs with no small clique or independent set. In that case, just picking an element of a set at random was already enough in order to produce an object that is hard to construct “explicitly”. However, the probabilistic method in that form can construct only an object with properties that are shared by a large fraction of objects. Now we will see an example that enables the probabilistic method to construct something that is quite rare. It is maybe even a bit surprising that this kind of object exists. We consider graphs here to be undirected and without loops or multiple edges. The chromatic number χ of a graph is the least number of colors with which the vertices can be colored, so that no two neighbors share a color. Clearly, as you add edges to a graph, its chromatic number goes up. The girth γ of a graph is the length of a shortest simple cycle. (“Simple” = no edges repeat. The girth of a forest is infinite.) Clearly, as you add edges to a graph, its girth goes down. These numbers are both monotone in the inclusion partial order on graphs. Chromatic number is monotone increasing, while girth is monotone decreasing. An important theorem we’ll cover shortly is the FKG Inequality, which implies in this setting that for any k, g > 0, if you pick a graph

Caltech CS150 2020. 26 1.7. Lecture 7 (14/Oct): Percolation on trees. Basic inequalities.

u.a.r., and condition on the event that its chromatic number is above k, that reduces the probability that its girth will be above g. In symbols, for the G(n, p) ensemble,

Pr((χ(G) > k) ∩ (γ(G) > g)) < Pr(χ(G) > k) Pr(γ(G) > g).

So in this precise sense, chromatic number and girth are anticorrelated. Indeed, having large girth means that the graph is a tree in large neighborhoods around each vertex. A tree has chromatic number 2. If you just allow yourself 3 colors, you gain huge flexibility in how to color a tree. Surely, with large girth, you might be able to color the local trees so that when they finally meet up in cycles, you can meet the coloring requirement? No! Here is a remarkable theorem. Theorem 23 (Erdos¨ [33]). For any k, g there is a graph with chromatic number χ ≥ k and girth γ ≥ g.

Proof. Pick a graph G from G(n, p), where p = n−1+1/g. This is likely to be a fairly sparse graph, the expected degree is n1/g (minus 1). g−1 m Let the rv X be the number of cycles in G of length < g. E(X) = ∑m=3 p n · (n − 1) ··· (n − m + 1)/(2m). (Pick the cycle sequentially and forget the starting point and orientation.) Then

g−1 g−1 g−1 E(X) < ∑ pmnm/(2m) = ∑ nm/g/(2m) ≤ ∑ nm/g/6. m=3 m=3 m=3

For sufficiently large n, specifically n > 2g, the successive terms in this sum at least double, so E(X) ≤ n1−1/g/3. By Markov’s inequality, Pr(X > n1−1/g) < 1/3. For the chromatic number we use a simple lower bound. Let I be the size of a largest independent set in G. Since every color class of a coloring must be an independent set,

I · χ ≥ n. (1.17)

n i (2) Now Pr(I ≥ i) ≤ ( i )(1 − p) , and recalling (1.15), the simple inequality for the exponential n i n i −1+1/g n −p(2) −(2)n i function, we have Pr(I ≥ i) ≤ ( i )e = ( i )e . Using the wasteful bound ( i ) ≤ n we i n−(i )n−1+1/g i n+(i −i2 )n−1+1/g have Pr(I ≥ i) ≤ e log 2 = e log /2 /2 . Finally we apply this at i = 3n1−1/g log n.

3n1−1/g log2 n+ 1 (3n1−1/g log n)n−1+1/g− 1 (3n1−1/g log n)2n−1+1/g Pr(I ≥ i) ≤ e 2 2 3 (log n−n1−1/g log2 n) = e 2

which for sufficiently large n is < 1/3. Thus, for sufficiently large n, there is probability at least 1/3 that G has both I < 3n1−1/g log n and at most n1−1/g ≤ n/2 cycles of length strictly less than g. Removing vertices from G can only reduce I, because any set that is independent after the removal, was also independent before. (By contrast, removing edges can only increase I.) So, by removing one vertex from each cycle, we obtain a graph with ≥ n/2 vertices, girth ≥ g, and I ≤ 3n1−1/g log n. Applying (1.17) (to the graph now of size n/2), we have χ ≥ n1/g/(6 log n) which for sufficiently large n is ≥ k. 2

Caltech CS150 2020. 27 Chapter 1. Some basic probability theory

1.8 Lecture 8 (16/Oct): FKG inequality

Another appetizer: pub crawl. Pubs labelled 0, . . . , k in clockwise order are situated all around a city block. A drunk starts his evening in pub 0 and has a beer. When he’s finished his beer he steps out, randomly walks one storefront left or right, steps into that pub, and has a beer there. He repeats this all evening, and his festivities end only when he has visited all pubs. Question: What is the probability distribution on the pub where he finishes his evening?

Correlations among monotone events

Consider again the random graph model G(n, p). Suppose someone peeks at the graph and tells you that it has a Hamilton cycle. How does that affect the probability that the graph is planar? Or that its girth is less than 10? Or consider the percolation process on the n × n square grid. Suppose you check and find that there is a path from (0, 0) to (1, 0). What does that tell you about the chance that the graph has an isolated vertex? These questions fall into the general framework of correlation inequalities. History: Harris (1960) [47], Kleitman (1966) [61], Fortuin, Kasteleyn and Ginibre (1971) [37], Holley (1974) [51], Ahlswede and Daykin “Four Functions Theorem” (1978) [4].

We are concerned here with the probability space Ω of n independent random bits b1,..., bn. It doesn’t matter whether they are identically distributed. Let pi = Pr(bi = 1). 0 0 We consider the boolean lattice B on these bits: b ≥ b if for all i, bi ≥ bi. So, Ω is the distribution on B for which    

Pr(b) =  ∏ pi  ∏ (1 − pi) {i:bi=1} {i:bi=0}  1 1  bi = ∏ + (−1) ( − pi) i 2 2

Definition 24. A real-valued function f on Ω is increasing if b ≥ b0 ⇒ f (b) ≥ f (b0). It is decreasing if − f is increasing. Likewise, an event on Ω (or in other words a subset of B) is increasing if its indicator function is increasing, and decreasing if its indicator function is decreasing. Theorem 25 (FKG [37]). If f and g are increasing functions on Ω then

E( f g) ≥ E( f )E(g)

Corollary 26. 1. If A and B are increasing events on Ω then Pr(A ∩ B) ≥ Pr(A) Pr(B). 2. If f is an increasing function and g is a decreasing function, then E( f g) ≤ E( f )E(g).

3. If A is an increasing event and B is a decreasing event, then Pr(A ∩ B) ≤ Pr(A) Pr(B).

Before we begin the proof we should introduce an important concept: Suppose X and Y are random variables defined over a common probability measure µ. Let T be a measurable subset of the range of Y with Pr(Y ∈ T) > 0. Then

1 Z E(X|Y ∈ T) = X · Y ∈ T dµ Pr(Y ∈ T) J K

Caltech CS150 2020. 28 1.8. Lecture 8 (16/Oct): FKG inequality

The conditional expectation E(X|Y) is a random variable, and specifically, it is a function of the random variable Y. This has the following natural consequence, which is called the tower property of conditional ex- pectations: E(X) = E(E(X|Y)) (1.18) Notice that on the RHS, the outer expectation is over the distribution of Y; on the inside we have the rv which is a real number that is, as we have said, a function of Y. We’re not going to have time to do the measure theory here properly but to see (1.18) in the case of discrete rvs, one need only note that both sides equal E(X) = ∑y Pr(Y = y)E(X|Y = y). Before we prove the FKG theorem let’s just reinterpret it. Suppose g is the indicator function of some increasing event. Then

E( f g) = Pr(g = 1)E( f |g = 1) + Pr(g = 0)E(0|g = 0) = Pr(g = 1)E( f |g = 1) = E(g)E( f |g = 1)

so E( f g) E( f )E(g) E( f |g = 1) = ≥ = E( f ). E(g) E(g) The interpretation is that conditioning on an increasing event, only increases the expectation of any increasing function.

Proof. By induction on n. Case n = 1 (with p = Pr(b = 1)):

E( f g) − E( f )E(g) = p f (1)g(1) + (1 − p) f (0)g(0) − (p f (1) + (1 − p) f (0))(pg(1) + (1 − p)g(0)) = p(1 − p) ( f (1)g(1) + f (0)g(0) − f (1)g(0) − f (0)g(1)) = p(1 − p)( f (1) − f (0))(g(1) − g(0)) ≥ 0 by the monotonicity of both functions

n−1 Now for the induction. Observe that for any assignment (b2 ... bn) ∈ {0, 1} , f becomes a mono- tone function of the single bit b1. For convenience, in the expectations to follow the subscript indicates explicitly which subset of bits the expectation is taken with respect to. So for instance in the second line, f g has the role of X, above, and (b2 ... bn) has the role of Y. These subscripts are extraneous and I’m just including them for clarity.

E( f g) = E1...n( f g) = E2...n (E1( f g|b2 ... bn)) ≥ E2...n (E1( f |b2 ... bn) · E1(g|b2 ... bn)) applying the base-case

Observe again that E1( f |b2 ... bn) is a function of b2 ... bn. By monotonicity of f , it is an increasing function. Likewise for E1(g|b2 ... bn). Since by induction we may assume the theorem for the case n − 1, we have

... ≥ E2...n (E1( f |b2 ... bn)) · E2...n (E1(g|b2 ... bn)) = E1...n( f ) · E1...n(g) by (1.18) = E( f )E(g)

2

Caltech CS150 2020. 29 Chapter 1. Some basic probability theory

Easy Application: In the random graph G(2k, 1/2), there is probability at least 2−2k that all degrees are ≤ k − 1. (Call this event A. One can also ask for an upper bound on Pr(A). A is contained in the event that vertex 1 has degree ≤ k − 1, so Pr(A) ≤ 1/2. Here is a simple improvement, showing Pr(A) tends toward 0. Fix a set L of the vertices, of size `. For v ∈ L, if it has at most k − 1 neighbors, then it has at most k − 1 neighbors in Lc. So we’ll just upper bound the probability that every vertex in L has at most k − 1 neighbors in Lc. These events (ranging over v ∈ L) are independent. So we can use ` √  −2k+` 2k−`  the upper bound 2 (≤k−1) . Fixing ` proportional to k sets the base of this exponential to a constant < 1 (a deviation bound at a constant number of√ standard deviations from the mean), and therefore yields a bound of the form Pr(A) ≤ exp(−Ω( k)). It might be an interesting exercise to improve this bound.) Application: The FKG inequality provides a very efficient proof of an inequality of Daykin and Lovasz´ [28]: Theorem 27. Let H be a family of subsets of [n] such that for all A, B ∈ H, ∅ ⊂ A ∩ B and A ∪ B ⊂ [n] (strict containments). Then |H| ≤ 2n−2.

Proof. Let F be the “upward order ideal” generated by H:

F = {S : ∃T ∈ H, T ⊆ S}.

Similarly let G be the “downward order ideal” generated by H: G = {S : ∃T ∈ H, S ⊆ T}. Then H ⊆ F ∩ G. |F| ≤ 2n−1 because F cannot contain any set and its complement. Likewise, |G| ≤ 2n−1 because G too cannot contain any set and its complement. Interpreting this in terms of the bits being distributed uniformly iid, we have that Pr(F) ≤ 1/2 and Pr(G) ≤ 1/2. Since F is an increasing event and G a decreasing event, Pr(F ∩ G) ≤ 1/4. 2 Application: We won’t show the argument here, but the FKG inequality was used in a very clever way by Shepp to prove the “XYZ inequality” conjectured by Rival and Sands. Let Γ be a finite poset. A linear extension of Γ is any total order on its elements that is consistent with Γ. Consider the uniform distribution on linear extensions of Γ. The XYZ inequality says: Theorem 28 (Shepp [86]). For any three elements x, y, z of Γ,

Pr((x ≤ y) ∧ (x ≤ z)) ≥ Pr(x ≤ y) · Pr(x ≤ z).

Caltech CS150 2020. 30 1.8. Lecture 8 (16/Oct): FKG inequality

1.8.1 Omitted 2020: Achieving expectation in MAX-3SAT.

MAX-3SAT

Let’s start looking at some computational problems. A 3CNF formula on variables x1,..., xn is the conjunction of clauses, each of which is a disjunction of at most three literals. (A literal is an xi or c c xi , where xi is the negation of xi.) You will recall that it is NP-complete to decide whether a 3CNF formula is satisfiable, that is, whether there is an assignment to the xi’s s.t. all clauses are satisfied. Let’s take a little different focus: think about the maximization problem of satisfying as many clauses as possible. Of course this is NP-hard, since it includes satisfiability as a special case. But, being an optimization problem, we can still ask how well we can do.

Theorem 29. For any 3CNF formula there is an assignment satisfying ≥ 7/8 of the clauses. Moreover such an assignment can be found in randomized time O(m2), where m is the number of clauses (and we suppose that every variable occurs in some clause).

Proof. The existence assertion is due to linearity of expectation, while the algorithm might be at- tributed to the English educator Hickson [50]: ’Tis a lesson you should heed: / Try, try, try again. / If at first you don’t succeed, / Try, try, try again. Now that we’ve been suitably educated, let’s ask, how long does this process take? In a single trial we check one assignment, which takes time O(m). How many trials do we need? Let the rv M be the number of satisfied clauses of a random assignment. m − M is a nonnegative rv with expectation m/8, and Markov’s inequality tells us that Pr(M ≤ (7/8 − ε)m) = Pr(m − M ≥ (1 + 8ε)m/8) ≤ 1/(1 + 8ε). This says we have a good chance of getting close to the desired number of satisfied clauses; however, we asked to achieve 7/8, not 7/8 − ε. We can get this by noting that M is integer-valued, so for ε < 1/m, an assignment satisfying 7/8 − ε of the clauses, satisfies 7/8 of them. 1 With the choice ε = 2m , then, the probability that a trial succeeds is at least

1 8ε 4 1 − = = ∈ Ω(1/m) 1 + 8ε 1 + 8ε m + 4

Trials succeed or fail independently so the expected number of trials to success is the expectation of a geometric random variable with parameter Ω(1/m), which is O(m). 2

Derandomization by the method of conditional expectations

How can we improve on this simple-minded method? We do not have a way forward on increasing the fraction of satisfied clauses, because of:

Theorem 30 (Hastad˚ [48]). For all ε > 0 it is NP-hard to approximate Max-3SAT within factor 7/8 + ε.

But we might hope to reduce the runtime, and also perhaps the dependence on random bits. As it turns out we can accomplish both of these objectives.

Theorem 31. There is an O(m)-time deterministic algorithm to find an assignment satisfying 7/8 of the clauses of any 3CNF formula on m clauses.

Caltech CS150 2020. 31 Chapter 1. Some basic probability theory

Proof. This algorithm illustrates the method of conditional expectations. The point is that we can de- randomize the randomized algorithm by not picking all the variables at once. Instead, we consider the alternative choices to just one of the variables, and choose the branch on which the conditional expected number of satisfying clauses is greater. This very general method works in situations in which one can actually quickly calculate (or at least approximate) said conditional expectations. We use the tower property of conditional expectations, (1.18): letting Y = the number of satisfied clauses for a uniformly random setting of the rvs,

E(Y) = E(E(Y|x1))

or explicitly 1 1 E(Y) = E(Y|x = 0) + E(Y|x = 1) 2 1 2 1

and the strategy is to pursue the value of x1 which does better. In the present example computing the conditional expectations is easy. The probability that a clause −i of size i is satisfied is 1 − 2 . If a formula has mi clauses of size i, the expected number of satisfied −i 1 0 clauses is ∑ mi(1 − 2 ). Now, partition the clauses of size i into mi that contain the literal x1, mi c that contain the literal x1, and those that contain neither. The expected number of satisfied clauses conditional on setting x1 = 1 is

1 0 −i+1 1 0 −i ∑ mi + ∑ mi (1 − 2 ) + ∑(mi − mi − mi )(1 − 2 ). (1.19)

Similarly the expected number of satisfied clauses conditional on setting x1 = 0 is

1 −i+1 0 1 0 −i ∑ mi (1 − 2 ) + ∑ mi + ∑(mi − mi − mi )(1 − 2 ). (1.20)

A simple way to do this: we can compute each of these in time O(m). (Actually, since these quantities average to the current expectation, which we already know, we only have to cal- culate one of them.) This simple process runs in time O(m2). However, we can actually do the

Clauses C1 C2 C3 C4 ... Cm

Variables x1 x2 x3 ... xn

Figure 1.3: m clauses of size ≤ 3, n variables process in time O(m). We don’t even really need to calculate the quantities (1.19),(1.20). We start with variable x1 and scan all the clauses it participates in (see Fig. 1.3). For each clause Ci (which say has currently |Ci| literals), the effect of setting x1 = 1 changes the contribution of the clause to the expectation from 1 − 2−|Ci| to either 1 (if the variable satisfies the clause) or to 1 − 2−|Ci−1| (otherwise); i.e., the expectation either increases or decreases by 2−|Ci|, while the effect of setting −|C | x1 = 0 is exactly the negative of this. We add these contributions of ±2 i , conditional on x1 = 1, as we scan the clauses containing x1; if it is nonnegative we fix x1 = 1, otherwise we fix x1 = 0. Having done that, we edit the relevant clauses to eliminate x1 from them. Then we continue with x2, etc. The work spent per variable is proportional to its degree in this bipartite graph (the number of clauses containing it), and the sum of these degrees is ≤ 3m. So the total time spent is O(m). 2

Caltech CS150 2020. 32 Chapter 2

Algebraic Fingerprinting

There are several key ways in which is used in algorithms. One is to “push apart” things that are different even if they are similar. We’ll study a few examples of this phenomenon.

2.1 Lecture 9 (19/Oct): Fingerprinting with Linear Algebra

2.1.1 Polytime Complexity Classes Allowing Randomization

Some definitions of one-sided and two-sided error in randomized computation are useful. Definition 32. BPP, RP, co-RP, ZPP: These are the four main classes of randomized polynomial-time com- putation. All are decision classes. A language L is:

• In BPP if the algorithm errs with probability ≤ 1/3.

• In RP if for x ∈ L the algorithm errs with probability ≤ 1/3, and for x ∈/ L the algorithm errs with probability 0. RP: no false positives.

(Note, RP is like NP in that it provides short proofs of membership.) The subsidiary definitions are:

• L ∈ co-RP if Lc ∈ RP, that is to say, if for x ∈ L the algorithm errs with probability 0, and if for x ∈/ L the algorithm errs with probability ≤ 1/3. co-RP: no false negatives

• ZPP = RP ∩ co-RP. ZPP: no errors

It is a routine exercise that none of these constants matter and can be replaced by any 1/ poly, although completing that exercise relies on the Chernoff bound which we’ll see in a later lecture.

Exercise: Show that the following are two equivalent characterizations of ZPP: (a) there is a poly- time randomized algorithm that with probability ≥ 1/3 outputs the correct answer, and with the remaining probability halts and outputs “don’t know;” (b) there is an expected-poly-time algorithm that always outputs the correct answer. We have the following inclusions: What is the difference between ZPP and BPP? In BPP, we can never get a definitive answer, no matter how many independent runs of the algorithm execute. In ZPP, we can, and the expected

33 Chapter 2. Algebraic Fingerprinting

NPO 6 BPPh co-NPO

RP h 6co-RP

ZPPO

P

Figure 2.1: Some inclusions among complexity classes. time until we get a definitive answer is polynomial; but we cannot be sure of getting the definitive answer in any fixed time bound. Here are the possible outcomes for any single run of each of the basic types of algorithm:

x ∈ L x ∈/ L RP ∈,/∈ ∈/ co-RP ∈ ∈,/∈ BPP ∈,/∈ ∈,/∈

If L ∈ ZPP, then we can be running simultaneously an RP algorithm A and a co-RP algorithm B for L. Ideally, this will soon give us a definitive answer: if both algorithms say “x ∈ L”, then A cannot have been wrong, and we are sure that x ∈ L; if both algorithms say “x ∈/ L”, then B cannot have been wrong, and we are sure that x ∈/ L. The expected number of iterations until one of these events happens (whichever is viable) is constant. But, in any particular iteration, we can (whether x ∈ L or x ∈/ L) get the indefinite outcome that A says “x ∈/ L” and B says “x ∈ L”. This might continue for arbitrarily many rounds, which is why we can’t make any guarantee about what we’ll be able to prove in bounded time. An algorithm with “BPP”-style two-sided error is often referred to as “Monte Carlo,” while a “ZPP”- style error is often referred to as “Las Vegas.”

2.1.2 Verifying Matrix Multiplication

It is a familiar theme that verifying a fact may be easier than computing it. Most famously, it is widely conjectured that P6=NP. Now we shall see a more down-to-earth example of this phenomenon. In what follows, all matrices are n × n. In order to eliminate some technical issues (mainly numerical precision, also the design of a substitute for uniform sampling), we suppose that the entries of the matrices lie in Z/p, p prime; and that scalar arithmetic can be performed in unit time. (The same method will work for any finite field and a similar method will work if the entries are integers less than poly(n) in absolute value, so that we can again reasonably sweep the runtime for scalar arithmetic under the rug.) Here are two closely related questions:

1. Given matrices A, B, compute A · B.

Caltech CS150 2020. 34 2.1. Lecture 9 (19/Oct): Fingerprinting with Linear Algebra

2. Given matrices A, B, C, verify whether C = A · B.

The best known algorithm, as of 2014, for the first of these problems runs in time O(n2.3728639) [42]. Resolving the correct lim inf exponent (usually called ω) is a major question in . Clearly the second problem is no harder, and a lower bound of Ω(n2) even for that is obvious since one must read the whole input. Randomness is not known to help with problem (1), but the situation for problem (2) is quite different. I’ll use the term co-RP-type in the following theorems only to indicate the acceptance conditions given in the definition, but not the runtime, which we will be giving explicitly. Theorem 33 (Freivalds [39]). There is a co-RP-type algorithm for the language “C = A · B,” running in time O(n2).

Proof. Note that the obvious procedure for matrix-vector multiplication runs in time O(n2). The verification algorithm is simple. Select uniformly a vector x ∈ (Z/p)n. Check whether ABx = Cx without ever multiplying AB: applying associativity, (AB)x = A(Bx), this can be done in just three matrix-vector multiplications. Output “Yes” if the the equality holds; output “No” if it fails. Clearly if AB = C, the output will be correct. In order to get the co-RP-type result, it remains to show that Pr(ABx = Cx|AB 6= C) ≤ 1/2. The event ABx = Cx is equivalently stated as the event that x lies in the right kernel of AB − C. Given that AB 6= C, that kernel is a strict subspace of (Z/p)n and therefore of at most half the cardinality of the larger space. Since we select x uniformly, the probability that it is in the kernel is at most 1/2. 2

2.1.3 Verifying Associativity

Let a set S of size n be given, along with a binary operation ◦ : S × S → S. Thus the input is a table of size n2; we call the input (S, ◦). The problem we consider is testing whether the operation is associative, that is, whether for all a, b, c ∈ S,

(a ◦ b) ◦ c = a ◦ (b ◦ c) (2.1) A triple for which (2.1) fails is said to be a nonassociative triple. No sub-cubic-time deterministic algorithm is known for this problem. However, Theorem 34 (Rajagopalan & Schulman [79]). There is an O(n2)-time co-RP-type algorithm for associa- tivity.

Proof. An obvious idea is to replace the O(n3)-time exhaustive search for a nonassociative triple, by randomly sampling triples and and checking them. The runtime required is inverse to the fraction of nonassociative triples, so this method would improve on exhaustive search if we were guaranteed that a nonassociative operation had a super-constant number of nonassociative triples. However, for every n ≥ 3 there exist nonassociative operations with only a single nonassociative triple. So we’ll have to do something more interesting. Let’s define a binary operation (S, ◦) on a much bigger set S. Define S to be the vector space with basis S over the field Z/2, that is to say, an element x ∈ S is a formal sum

x = ∑ axa for xa ∈ Z/2 a∈S

Caltech CS150 2020. 35 Chapter 2. Algebraic Fingerprinting

The product of two such elements x, y is

x ◦ y = ∑ ∑ (a ◦ b)xayb a∈S b∈S M = ∑ c xayb c∈S a,b:a◦b=c where of course L denotes sum mod 2. On (S, ◦) we have an operation that we do not have on (S, ◦), namely, addition:

x + y = ∑ a(xa + ya) a∈S (Those who have seen such constructions before will recognize (S, ◦) as an “algebra” of (S, ◦) over Z/2.) The algorithm is now simple: check the associative identity for three random elements of S. That is, select x, y, z u.a.r. in S. If (x ◦ y) ◦ z) = x ◦ (y ◦ z), report that (S, ◦) is associative, otherwise report that it is not associative. The runtime for this process is clearly O(n2). If (S, ◦) is associative then so is (S, ◦), because then (x ◦ y) ◦ z and x ◦ (y ◦ z) have identical ex- pansions as sums. Also, nonnassociativity of (S, ◦) implies nonnassociativity of (S, ◦) by simply considering “” vectors within the latter. But this equivalence is not enough. The crux of the argument is the following: Lemma 35. If (S, ◦) is nonnassociative then at least one eighth of the triples (x, y, z) in S are nonassociative triples.

Proof. The proof relies on a variation on the inclusion-exclusion principle. For any triple a, b, c ∈ S, let g(a, b, c) = (a ◦ b) ◦ c − a ◦ (b ◦ c). Note that g is a mapping g : S3 → S. It is somewhere nonzero, because it is nonzero exactly on the nonassociative triples of (S, ◦). Fix (a0, b0, c0) to be such a nonassociative triple. Now extend g to g : S3 → S by:

g(x, y, z) = ∑ g(a, b, c)xaybzc a,b,c

If you imagine the n × n × n cube indexed by S3, with each position (a, b, c) filled with the entry g(a, b, c), then g(x, y, z) is the sum of the entries in the combinatorial subcube of positions where xa = 1, yb = 1, zc = 1. (We say “combinatorial” only to emphasize that unlike a physical cube, here the slices that participate in the subcube are not in any sense adjacent.) Partition S3 into blocks of eight triples apiece, as follows. Each of these blocks is indexed by a triple 0 0 0 x, y, z s.t. xa0 = 0, yb0 = 0, zc0 = 0. The eight triples are (x + ε1a , y + ε2b , z + ε3c ) where εi ∈ {0, 1}. Now we claim that 0 0 0 0 0 0 ∑ g(x + ε1a , y + ε2b , z + ε3c ) = g(a , b , c ) (2.2) ε1,ε2,ε3 To see this, note that each of the eight terms on the LHS is, as described above, a sum of the entries in a “subcube” of the “S3 cube.” These subcubes are closely related: there is a core subcube whose indicator function is x × y × z, and all entries of this subcube are summed within all eight terms. Then there are additional width-1 pieces: the entries in the region a0 × y × z occur in four terms, as do the regions x × b0 × z and x × y × c0. The entries in the regions a0 × b0 × z, a0 × y × c0 and x × b0 × c0 occur in two terms, and the entry in the region a0 × b0 × c0 occurs in one term.

Caltech CS150 2020. 36 2.1. Lecture 9 (19/Oct): Fingerprinting with Linear Algebra

Since the RHS of (2.2) is nonzero, so is at least one of the eight terms on the LHS. 2 2 The algorithmic conclusion is now a corollary: in time O(n2) we can sample x, y, z u.a.r. in S and determine whether (x ◦ y) ◦ z = x ◦ (y ◦ z). If (S, ◦) is associative, then we get =; if (S, ◦) is nonassociative, we get 6= with probability ≥ 1/8.

Caltech CS150 2020. 37 Chapter 2. Algebraic Fingerprinting

2.2 Lecture 10 (21/Oct): Cont. associativity; perfect matchings, poly- nomial identity testing

2.2.1 Matchings

A matching in a graph G = (V, E) is a set of vertex disjoint edges; the size of the matching is the number of edges. Let n = |V| and m = |E|.A perfect matching is one of size n/2. A maximal matching is one to which no edges can be added. A maximum matching is one of largest size. How hard are the problems of finding such objects? It is of course easy to find a maximal matching—sequentially. On the other hand, finding one on a parallel computer is a much more interesting problem, which I hope to return to later in the course. Returning to sequential computation: Finding a maximum matching, or deciding whether a perfect matching exists, are interesting problems. In bipartite graphs, Hall’s theorem and the augment- ing path method give very nice and accessible deterministic algorithms for maximum matching. In√ general graphs the problem is harder but there are deterministic algorithms running in time O( nm) [72, 41].

2.2.2 Bipartite perfect matching: deciding existence

The first problem we focus on here is to decide whether a bipartite graph has a perfect matching. As noted there are nice deterministic algorithms for this problem but the randomized one is even simpler. Write G = (V1, V2, E) with E ⊆ V1 × V2. Form the V1 × V2 “variable” matrix A which has Aij = xij if {i, j} ∈ E, and otherwise Aij = 0.

Let q be some prime power and consider the xij as variables in GF(q). The determinant of A, then, is a polynomial in the variables xij. Before launching into this, a word on a subtle point: what does it mean for a (multivariate) polyno- mial p to be nonzero? Consider a polynomial over any field κ, which is to say, the coefficients of all the monomials in the polynomial lie in κ. Definition 36. We consider a polynomial nonzero if any monomial has a nonzero coefficient.

A stronger condition, which is not the definition we adopt, is that p(x) 6= 0 for some x ∈ κ. Of course this implies the condition in the definition; but it is strictly stronger, as we can see from the example of the polynomial x2 + x over the field Z/2. However, the conditions are equivalent in the following two cases:

1. Over infinite fields, e.g., R, Q, or any algebraically closed field. This will follow from Lemma 42.

2. For multilinear polynomials. (This applies in particular to Det(A) which we are considering now.)

Specifically for the multilinear case we have:

Lemma 37. Let p(~x) be a nonzero multilinear polynomial over field κ. Then there is a setting of the xi to values ci in κ s.t. p(~c) 6= 0.

Caltech CS150 2020. 38 2.2. Lecture 10 (21/Oct): Cont. associativity; perfect matchings, polynomial identity testing

Proof. Every monomial is associated with a set of variables; choose a minimal such set. (E.g., if there is a constant term, then the empty set.) Assign the value 1 to all variables in this set, and 0 to all variables outside it. Only the chosen monomial can be nonzero, so p 6= 0 for this assignment. 2

Lemma 38. G has a perfect matching iff Det(A) 6= 0.

(This is the “baby” version of a result of Tutte that we will see later in Theorem 43.)

Proof. Every monomial in the expansion of the determinant corresponds to a permutation. A per- mutation is simply a pattern of 1’s hitting each edge and column exactly once, namely, a perfect matching in the bipartite graph. Conversely, if some perfect matching is present, it puts a monomial in the determinant with coefficient either 1 or −1. 2 Corollary 39. Fix any field κ. G has a perfect matching iff there is an assignment of the variables in A such that the determinant is nonzero.

Proof. Apply Lemma 37 to Lemma 38. 2 This suggests the following exceptionally simple algorithm: compute the polynomial and see if it is nonzero. There’s a problem with this idea! The determinant has exponentially many monomials. This is not a problem for computing determinants over a ring such as the integers, because even the sum of exponentially many integers only has polynomially more bits than the largest of those integers has. However, in this ring of multivariate polynomials (i.e., the ring κ[{xij}] where κ = Q or κ = GF(q), for the moment it doesn’t matter), there are exponentially many distinct terms to keep track of if you want to write the polynomial out as a sum of monomials. Of course the determinant has a more concise representation (namely, as “Det(A)”), but we do not know how to efficiently convert that to any representation that displays transparently whether the polynomial is the 0 polynomial. So we modify the original suggestion. Since we do know how to efficiently compute determinants of scalar matrices, let’s substitute scalar values for the xij’s. What values should we use? Random ones.

Revised Algorithm: Sample the xij’s u.a.r. in GF(q); call the sampled matrix AR. Compute Det(AR); report “G has/hasn’t a perfect matching” according to whether Det(AR) 6= 0 or = 0.

substitute Det(variables) / Det(scalars)

expand evaluate   monomials(variables) / value of Det substitute

Figure 2.2: This diagram commutes, but for a fast commute, go clockwise.

Clearly the algorithm answers correctly if there is no perfect matching, and it is fast (see Fig. 2.2.2). What needs to be shown is that the probability of error is small if there is a perfect matching (and q is large enough). So this is an RP-type algorithm for “G has a perfect matching.” Theorem 40. The algorithm is error-free on bipartite graphs lacking a perfect matching, and the probability of error of the algorithm on bipartite graphs which have a perfect matching is at most n/q. The runtime of the algorithm is nω+o(1), where ω is the matrix multiplication exponent.

Caltech CS150 2020. 39 Chapter 2. Algebraic Fingerprinting

All we have to do, then, is use a prime power q ≥ 2n in order to have error probability ≤ 1/2. Incidentally, there is always a prime 2n ≤ q < 4n; this is called “Bertram’s postulate.” This fact alone isn’t quite strong enough for if we want to find a prime in the right size range efficiently, but that too can be done, in a slightly larger range. (The density of primes of this size is about 1/ log n so we don’t have to try many values before we should get lucky; and note, primality testing has efficient algorithms in ZPP and even somewhat less efficient algorithms in P [3].) However, we don’t have to work this hard, since we’re satisfied with prime powers rather than primes. We can simply use the first power of 2 after 2n. We will prove Theorem 40 after introducing a general useful tool about polynomial identity testing.

Caltech CS150 2020. 40 2.3. Lecture 11 (23/Oct): Cont. perfect matchings; polynomial identity testing

2.3 Lecture 11 (23/Oct): Cont. perfect matchings; polynomial iden- tity testing

2.3.1 Polynomial identity testing

In the previous section we saw that testing for existence of a perfect matching in a bipartite graph can be cast as a special case of the following problem. We are given a polynomial p(x), of total degree n, in variables x = (x1,..., xm), m ≥ 1. (The total degree of a monomial is the sum of the degrees of the variables in it; the total degree of a polynomial is the greatest total degree of its monomials.) We are agnostic as to how we are “given” the polynomial, and demand only that we be able to quickly evaluate it at any scalar assignment to the variables. We wish to test whether the polynomial is identically 0, and our procedure for doing so is to evaluate it at a random point and report “yes” if the value there is 0. We rely on the following lemma. Let z(p) be the set of roots (zeros) of a polynomial p. Lemma 41. Let p be a nonzero polynomial over GF(q), of total degree n in m variables. Then |z(p)| ≤ nqm−1.

As a fraction, this is saying that |z(p)|/qm ≤ n/q, and in this form the lemma immediately implies Theorem 40. The univariate case of the lemma is probably familiar to you. The lemma is a special case of the following more general statement which holds for any, even infinite, field κ.

Lemma 42. Let p be a nonzero polynomial over a field κ, of total degree n in variables x1,..., xm. Let m−1 S1,..., Sm be subsets of κ with |Si| ≤ s for all i. Then |z(p) ∩ (S1 × ... × Sm)| ≤ s n.

This is usually known as the Schwartz-Zippel lemma [84, 97], although the results in these two publications were not precisely equivalent, and there was another discovery around the same period by DeMillo and Lipton [29], and all these were preceded by Ore [66]. A generalization beyond polynomials is due to Gonnet [44]. (Recall the two candidate definitions of what it means for a polynomial to be nonzero. Since in Definition 36 we chose the weaker condition, Lemma 42 is stronger than it would have been had we chosen the stronger condition.)

Proof. of Lemma 42: The lemma is trivial if n ≥ s, so suppose n < s. First consider the univariate case, m = 1. (In this case the two lemmas are identical since any set S1 is a product set.) This follows by induction on n because if n ≥ 1 and p(α) = 0, then p can be factored as p(x) = (x − α) · q(x) for some q of degree n − 1. (Because, make the change of variables to x − α. After this change the polynomial cannot have any constant term. So we can factor out (x − α).)

Next we handle m > 1 by induction. If x1 does not appear in p then the conclusion follows from n i the case m − 1. Otherwise, write p in the form p(x) = ∑0 x1 pi(x2,..., xm), and let 0 < i ≤ n be largest such that pi 6= 0. The degree of pi is at most n − i, so by induction,

|z(p ) ∩ (S × ... × S )| n − i i 2 m ≤ sm−1 s

Let r be the LHS, i.e., the fraction of S2 × ... × Sm that are roots of pi.

For (x2,..., xm) ∈ z(pi) we allow as a worst case that all choices of x1 ∈ S1 yield a zero of p.

Caltech CS150 2020. 41 Chapter 2. Algebraic Fingerprinting

For (x2,..., xm) ∈/ z(pi), p restricts to a nonzero polynomial of degree i in the variable x1, so by the case m = 1, |z(p ) ∩ (S × x × ... × x )| i i 1 2 m ≤ s s

i n Since s ≤ s < 1, the weighted average of our two bounds (on the fraction of roots in sets of the form S1 × x2 × ... × xm) is worst when r is as large as possible. Thus

z(p) ∩ (S × ... × S ) i 1 m ≤ r · 1 + (1 − r) · sm s n − i n − i i ≤ · 1 + (1 − ) · s s s n i(n − i) = − s s n ≤ s 2 Comment: This lemma gives us an efficient randomized way of testing whether a polynomial is identically zero, and naturally, people have wondered whether there might be an efficient deter- ministic algorithm for the same task. So far, no such algorithm has been found, and it is known that any such algorithm would have hardness implications in complexity theory that are currently out of reach [54]1.

2.3.2 Deciding existence of a perfect matching in a graph

Bipartite graphs were handled last time. Now we consider general graphs. Deterministically, deciding the existence of a perfect matching in a general graph is harder than the same problem in a bipartite graph. (As noted, we have poly-time algorithms, but not nearly so simple ones.) With randomization, however, we can adapt the same approach to work with almost equal efficiency. We must define the Tutte matrix of a graph G = (V, E). Order the vertices arbitrarily from 1, . . . , n and set  { } ∈  0 if i, j / E Tij = xij if {i, j} ∈ E and i < j  −xji if {i, j} ∈ E and i > j Theorem 43 (Tutte [92]). Det(T) 6= 0 iff T has a perfect matching.

This determinant is not multilinear in the variables, so the lemma from last time does not apply.

Proof. If T has a perfect matching, assign xij = 1 for edges in the matching, and 0 otherwise. Each matching edge {i, j} describes a transposition of the vertices i, j. With this assignment every row and column has a single nonzero entry corresponding to the matching edge it is part of, so the matrix is the permutation matrix (with some signs) of the involution that transposes the vertices on each edge. Since a transposition has sign −1 and there is a single −1 in each pair of nonzero entries, the contribution of each transposition to the determinant is 1, and overall we have Det(T) = 1. Conversely suppose Det(T) 6= 0 as a polynomial. Consider the determinant as a signed sum over permutations. The net contribution to the determinant from all permutations having an odd cycle

1Specifically: If one can test in polynomial time whether a given arithmetic circuit over the integers computes the zero polynomial, then either (i) NEXP 6⊆ P/poly or (ii) the Permanent is not computable by polynomial-size arithmetic circuits.

Caltech CS150 2020. 42 2.3. Lecture 11 (23/Oct): Cont. perfect matchings; polynomial identity testing

is 0, for the following reason. In each such cycle identify the “least” odd cycle by some criterion, e.g., ordering the cycles by their least-indexed vertex. Then flip the direction of the least odd cycle. This map is an involution on the set of permutations. It carries the permutation to another, which contributes the opposite sign to the determinant, since the sign of all edges in the cycle flipped. (Figure 2.3.)

 ......   ......   ......   ......       ...... x ...  vs.  ...... x  (2.3)  34   35   ...... x45   ...... −x34 ...  ...... −x35 ...... −x45

Figure 2.3: Flipping a cycle among vertices 3, 4, 5. Preserves permutation sign; reverses signs of cycle variables.

Therefore there are permutations of the vertices, supported by T (i.e., each vertex is mapped to a destination along one of the edges incident to it, that is, π(i) = j ⇒ Tij 6= 0), having only even cycles. The even cycles of length 2 are matching edges, and in any even cycle of length greater than 2, we can use every alternate edge; altogether we obtain a perfect matching. See Figure 2.4 for a graph having perfect matchings, and two permutations from which we can read off perfect matchings. 2

  x12 x13 x14  −x x   12 23  (2.4)  −x13 −x23 x34  −x14 −x34

 ∗   ∗   ∗   ∗    or   (2.5)  ∗   ∗  ∗ ∗

Figure 2.4: A graph and its Tutte matrix with two different permutations π from which we can read off perfect matchings: the involution (12)(34) and the 4-cycle (1234).

In exactly the same way as for the bipartite case, this yields: Theorem 44. The algorithm to determine existence of a perfect matching in a graph on n vertices, is error- free on graphs lacking a perfect matching, and the probability of error of the algorithm on graphs which have a perfect matching, is at most n/q. The runtime of the algorithm is nω+o(1), where ω is the matrix multiplication exponent.

Caltech CS150 2020. 43 Chapter 2. Algebraic Fingerprinting

2.4 Lecture 12 (26/Oct): Parallel computation: finding perfect match- ings in general graphs. Isolating lemma.

2.4.1 Parallel computation

By self-reducibility the decision algorithm above immediately yields an O˜ (nω+2)-time algorithm for finding a perfect matching. (Remove an edge, see if there is still a perfect matching without it,. . . ) There are two major processes at work in the above algorithm: determinant computations, and sequential branching used in the self-reducibility argument. As we discuss in a moment, the linear algebra can be parallelized. But the branching is inherently sequential. Nevertheless, we will shortly see a completely different algorithm, that avoids this sequential branching. In this way we’ll put the problem of finding a perfect matching, in RNC. NCi = problems solvable deterministically by poly(n) processors in logi n time. Equivalently, by poly(n) -size, logi n -depth circuits. (We add that the circuits must be “log-space uniform,” that is, there must be a Turing machine which on input string 1n presented on a read-only tape, uses only an O(log n) -length read-write tape as its workspace, and outputs on a write-only tape the circuit in full. This “uniformity” requirement keeps us from artificially hiding complex behavior in the sequence of circuits for different values of n. For example without this kind of requirement, your complexity class would include some uncomputable functions.) S i NC = i NC RNCi, RNC = same but the processors (gates) may use random bits. (We are glossing over the error model, i.e., Las Vegas, Monte Carlo etc., and will just mention that explicitly when relevant.)

2.4.2 Sequential and parallel linear algebra

In sequential computation, there are reductions showing that matrix inversion and multiplication have the same time complexity (up to a factor of log n), and that determinant is no harder than these. In parallel computation, the picture is actually a little simpler. Matrix multiplication is in NC1 (right from the product definition, since we can use a tree of depth log n to sum the n terms of a row-column inner product). Matrix inversion and determinant are in NC2, due to Csanky [27] (over fields of characteristic 0) (and using fewer processors in RNC2 by Pan [77]); the problem is also in NC over general fields [13, 23]. Csanky’s algorithm builds on the result of Valiant, Skyum, Berkowitz and Rackoff [93] that any deterministic sequential algorithm computing a polynomial of degree d in time m can be converted to a deterministic parallel algorithm computing the same polynomial in time O((log d)(log d + log m)) using O(d6m3) processors. For a good explanation of Csanky’s algorithm see [63] §31, and for more on parallel algorithms see [65].

2.4.3 Finding perfect matchings in general graphs, in parallel. The Isolating Lemma

We now develop a randomized method of Mulmuley, U. Vazirani and V. Vazirani [74] to find a perfect matching if one exists. A polynomial time algorithm is implied by the previous testing

Caltech CS150 2020. 44 2.4. Lecture 12 (26/Oct): Parallel computation: finding perfect matchings in general graphs. Isolating lemma.

method along with self-reducibility of the perfect matching decision problem. However, with the following method we can solve the same problem in parallel, that is to say, in polylog depth on polynomially many processors. This is not actually the first RNC algorithm for this task—that is an RNC3 method [58]—but it is the “most parallel” since it solves the problem in RNC2. A slight variant of the method yields a minimum weight perfect matching in a weighted graph that has “small” weights, that is, integer weights represented in unary; and there is a fairly standard reduction from the problem of finding a maximum matching to finding a minimum weight perfect matching in a graph with weights in {0, 1}. So we actually through this method can find a maximum matching in a general graph, with a similar total amount of work. There are really two key ingredients to this algorithm. The first, which we have already noted, is that all basic linear algebra problems can be solved in NC2. The second ingredient, which will be our focus, is the following lemma. First some notation. Let A = {a1, a2,..., am} be a finite set. If a1,..., am are assigned weights w1,..., wm, the weight of set ⊆ ( ) = S = { } S A is defined to be w S ∑ai∈S wi. Let S1,..., Sk be a collection of subsets of A. Let min(S, w) ⊆ S be min(S, w) = {S ∈ S : ∀T ∈ S, w(S) ≤ w(T)} the collection of those S of least weight in S. We are interested in the event that the least weight is uniquely attained, i.e., the event that | min(S, w)| = 1. Lemma 45 ([74] Isolating Lemma, based on improved version in [90]). Let A = [m] and let the weights w1,..., wm be independent random variables, each wi being sampled uniformly among values ui(1) < ... < ui(ri), ui(j) ∈ R, ri ≥ r ≥ 2. Then

Pr(| min(S, w)| = 1) ≥ (1 − 1/r)m. (2.6)

This lemma is remarkable because of the absence of a dependence on k, the size of the family, in the conclusion.

Proof. To simplify notation we give the proof only for the “hardest” case that all ri = r. m m Think of u as the mapping (∏ ui) : [r] → R . Let V = [r]A and V0 = {2, . . . , r}A. If v ∈ V then the composition u ◦ v : A → Rm is a weight 0 function on A, and if v ∈ V then this weight function avoids using the weights ui(1). 0 m m Note |V |/|V| = ∏1 (1 − 1/r) . Given v ∈ V0, fix any set T ∈ min(S, u ◦ v) of largest cardinality. Define

φ : V0 → V

v → φv ( v(i) − 1 : i ∈ T φv(i) = v(i) : otherwise

We claim (a) that min(S, u ◦ φv) = {T} and (b) that φ is a bijection. It will follow that the set 0 m of weight functions {u ◦ φv : v ∈ V } is at least fraction (1 − 1/r) of all weight functions, and moreover that for each weight function in this set, the minimum weight is uniquely achieved. (a) Observe that for any S ∈ S,

(u ◦ v)(S) − (u ◦ φv)(S) = ∑ (ui(v(i)) − ui(v(i) − 1)). i∈S∩T

Caltech CS150 2020. 45 Chapter 2. Algebraic Fingerprinting

with every summand on the RHS being positive. In particular (u ◦ v)(T) − (u ◦ φv)(T) is the largest change in weight possible for any S, and is achieved by S only if T ⊆ S. Therefore every set S ∈ S other than T either started out with the same weight, but had its weight reduced by strictly less than T did; or started out with strictly greater weight, and had its weight reduced by at most the same as T did. So, min(S, u ◦ φv) = {T} as desired.

(b) Consequently T can be identified as the unique element of min(S, u ◦ φv). So φ can be inverted. 0 That is, for any w ∈ Image φ there is a unique v ∈ V such that φv = w. Thus φ is a bijection. 2

Caltech CS150 2020. 46 2.5. Lecture 13 (28/Oct): Finding a perfect matching, in RNC

2.5 Lecture 13 (28/Oct): Finding a perfect matching, in RNC

Now we describe the algorithm to find a perfect matching (or report that probably none exists) in a graph G = (V, E) with n = |V|, m = |E|.

For every {i, j} ∈ E pick an integer weight wij iid uniformly in {1, . . . , 2m}. By the isolating lemma, 1 m if G has any perfect matchings, then with probability at least (1 − 2m ) ≥ 1/2 it obtains a unique minimum weight perfect matching. Define the matrix T by:   0 if {i, j} ∈/ E wij Tij = 2 if {i, j} ∈ E, i < j (2.7)  −2wji if {i, j} ∈ E, i > j

wij This is an instantiation of the Tutte matrix, with xij = 2 . Lemma 46. If G has a unique minimum weight perfect matching of G (call it M and its weight w(M)) then Det(T) 6= 0 and moreover, Det(T) = 1 mod 22w(M).

Proof. of Lemma: As before we look at the contributions to Det(T) of all the permutations π that are supported by edges of the graph. The contributions from permutations having odd cycles cancel out—because this is a special case of a Tutte matrix. It remains to consider permutations π that have only even cycles.

• If π consists of transpositions along the edges of M then it contributes 22w(M).

• If π has only even cycles, but does not correspond to M, then:

0 – If π is some other matching M0 of weight w(M0) > w(M) then it contributes 22w(M ). – If π has only even cycles and at least one of them is of length ≥ 4, then by separating each cycle into a pair of matchings on the vertices of that cycle, π is decomposed into w(M )+w(M ) two matchings M1 6= M2 of weights w(M1), w(M2), so π contributes ±2 1 2 . Because of the uniqueness of M not both of M1 and M2 can achieve weight w(M), so w(M1) + w(M2) > 2w(M). 2

Now let Tˆij be the (i, j)-deleted minor of T (the matrix obtained by removing the i’th row and j’th column from T), and set

n mij = ∑ sign(π) ∏ Tk,π(k) π:π(i)=j k=1 (2.8)

wij ∈ ±2 Det(Tˆij)

Lemma 47. For every {i, j} ∈ E:

1. The total contribution to mij of permutations π having odd cycles is 0. 2. If there is a unique minimum weight perfect matching M, then:

2w(M) (a) If {i, j} ∈ M then mij/2 is odd. 2w(M) (b) If {i, j} ∈/ M then mij/2 is even.

Proof. of Lemma: This is much like our argument for Det(T) but localized.

Caltech CS150 2020. 47 Chapter 2. Algebraic Fingerprinting

1. If π has an odd cycle then it has an even number of odd cycles and hence an odd cycle not containing vertex i. If π(i) = j, i.e., π contributes to mij, then an odd cycle not containing i also does not contain j. Pick the “first” odd cycle that does not contain vertex i and flip it to r r r r obtain a permutation π . Note that (π ) = π. The contribution of π to mij is the negation of the contribution of π to mij, because we have replaced an odd number of terms from the Tutte matrix by the same entry with a flipped sign.

2. By the preceding argument, whether or not {i, j} ∈ M, we need only consider permutations containing solely even cycles. Just as argued for Lemma 46, the contribution of every such w(M )+w(M ) permutation π can be written as 2 1 2 , where M1 and M2 are two perfect matchings obtained as follows: each transposition (i0, j0) in π puts the edge {i0, j0} into both of the matchings; each even cycle of length ≥ 4 can be broken alternatingly into two matchings, one of which (arbitrarily) is put into M1 and one into M2.

The only case in which there is a term for which w(M1) + w(M2) = 2w(M) is the single case that {i, j} ∈ M and π consists entirely of transpositions along the edges of M. In every other case, at least one of M1 or M2 is distinct from M, and therefore w(M1) + w(M2) > 2w(M). The lemma follows. 2

Finally we collect all the elements necessary to describe the algorithm:

1. Generate the weights wi uniformly in {1, . . . , 2m}. 2. Define T as in Eqn (2.7), compute its determinant (in Z) and if it is nonsingular invert it. (Otherwise, start over.) This determinant computation and the inversion can be done (deter- ministically) in depth O(log2 n) as discussed earlier.

3. Determine w(M) by factoring the greatest power of 2 out of Det(T).

wij 4. Obtain the values ±mij from the equations mij = ±2 Det(Tˆij) and

i+j −1 Det(Tˆij) = (−1) (T )ji Det(T) (Cramer’s rule)

2w(M) If mij/2 is odd then place {i, j} in the matching. 5. Check whether this defines a perfect matching. This is guaranteed if the minimum weight perfect matching is unique. If a perfect matching was not obtained (which will occur for sure if there is no perfect matching, but with probability ≤ 1/2 if there is one), generate new weights and repeat the process.

Of course, if the graph has a perfect matching, the probability of incurring k repetitions wihout success is bounded by 2−k, and the expected number of repetitions until success is at most 2.

The simultaneous computation of all the mij’s in step 2 is key to the efficiency of this procedure. The numbers in the matrix A are integers bounded by ±22m. Pan’s RNC2 matrix inversion algorithm will compute A−1 using O(n3.5m) processors. For the maximum matching problem, we use a simple reduction: use weights for each of the non- edges too, but sample those weights uniformly from 2mn + 1, . . . , 2mn + 2m (rather than 1, . . . , 2m like the graph edges). Then any minimum weight perfect matching must use the minimum possible number of originally-non-edges. The cost of this reduction is that the integers in the matrix now use O(mn) rather than O(m) bits, so the number of processors used by the maximum matching algorithm is O(n4.5m).

Caltech CS150 2020. 48 Chapter 3

Concentration of Measure

3.1 Lecture 14 (30/Oct): Independent rvs: data processing, Cher- noff bound, applications

3.1.1 Two facts about independent rvs

Lemma 48. If X1,..., Xn are independent real rvs with finite expectations (recall this assumption requires that the converge absolutely), then

E(∏ Xi) = ∏ E(Xi).

This is a consequence of the fact that the probabilities of independent events multiply; one only has to be careful about the measure theory. It is enough to consider the case n = 2 and proceed by induction. The calculation is as follows, with µ being the underlying probability measure; justifying the moves between double and single integrals is the content of the Fubini Theorem, which we skip here. Z E(XY) = XY dµ Z Z = xy Pr(X = x ∧ Y = y) dy dx R R Z Z = x Pr(X = x) y Pr(Y = y|X = x) dy dx R R Z Z = x Pr(X = x) y Pr(Y = y) dy dx R R Z = x Pr(X = x)E(Y) dx R Z = E(Y) x Pr(X = x) dx R = E(X)E(Y).

The following lemma is an immediate consequence of the definitions in Sec. 1.2:

Lemma 49. If f1,... are measurable functions and X1,... are independent rvs then f1(X1),... are indepen- dent rvs.

49 Chapter 3. Concentration of Measure

This lemma is a special case of what is known in theory as the data processing inequality which says that for any rvs X, Y and any measurable function f , I( f (X), Y) ≤ I(X, Y) where I is “.” We will not define that quantity right now though.

3.1.2 Chernoff bound for uniform Bernoulli rvs (symmetric random walk)

The Chernoff bound1 will be one of two ways in which we’ll display the concentration of measure phenomenon, the other being the central limit theorem. In the types of problems we’ll be looking at the Chernoff bound is the more frequently useful of the two but they’re closely related.

Let’s begin with the special case of iid fair coins, aka iid uniform Bernoulli rvs: P(Xi = 1) = 1/2, P(Xi = 0) = 1/2. Put another way, we have n independent events, each of which occurs with probability 1/2. We want an exponential tail bound on the probability that significantly more than half the events occur. This very short argument is the seed of more general or stronger bounds that we will see later.

It will be convenient to use the rvs Yi = 2Xi − 1, where Xi is the indicator rv of the ith event. This shift lets us work with mean-0 rvs. This leaves the Yi independent; that follows from Lemma 49. n Theorem 50. Let Y1,..., Yn be iid rvs, with Pr(Yi = −1) = Pr(Yi = 1) = 1/2. Let Y = ∑1 Yi. Then √ 2 Pr(Y > λ n) < e−λ /2 for any λ > 0. √ p The significance of n here is that it is the standard deviation of Y (i.e., Var(Y)), because (a) E(Yi) = 1 (easy), and (b): Exercise: 2 If Z1,..., Zn are independent real rvs with well defined first and second moments, then

Var(∑ Zi) = ∑ Var(Zi). (3.1) √ Consequently, the Chebyshev bound, Lemma 19, suggests that n is about where we should start to get a meaningful deviation bound.

Proof. Fix any α > 0. Exercise:3 2 E(eαYi ) = cosh α ≤ eα /2.

By independence of the rvs eαYi ,

2 E(eαY) = ∏ E(eαYi ) ≤ enα /2.

√ √ Pr(Y > λ n) = Pr(eαY > eαλ n) √ ≤ E(eαY)/eαλ n Markov ineq. √ 2 ≤ enα /2−αλ n √ We now optimize this bound by making the choice α = λ/ n, and obtain:

√ 2 Pr(Y > λ n) ≤ e−λ /2.

1First due to Bernstein [16, 17, 14] but we follow the standard naming convention in Computer Science. 2Pairwise independent is enough; we’ll soon get to this. 3 k k x2/2 2k k 2k For k ≥ 0, (2k)! = ∏1 i(k + i) ≥ 2 k!, so for any real x, e = ∑k≥0 x /(2 k!) ≥ ∑k≥0 x /(2k)! = cosh x.

Caltech CS150 2020. 50 3.1. Lecture 14 (30/Oct): Independent rvs: data processing, Chernoff bound, applications

1.5

1.0 Threshold kernel Exponential kernel Probability mass

0.5

-2 -1 1 2

Figure 3.1: Integrating a probability mass against two different nonnegative kernels

2

Here’s another way to think about this calculation: Let sx(y) be the step function sx(y) = 1 for y > x, sx(y) = 0 for y ≤ x. Note, for any α > 0, sx(y) ≤ exp(α(y − x)), which is to say, the “threshold kernel” is less than the “exponential kernel.” (See Fig. 3.1.2.) √ √ Pr(Y > λ n) = E(s n(Y)) λ √ ≤ E(exp(α(Y − λ n)) n √ = ∏ E(exp(α(Yi − λ/ n))) 1  √ n = e−αλ/ n cosh α √ We get the best upper bound by minimizing the base b of this exponent. If we pick α = λ/ n, 2 √ which doesn’t exactly optimize the bound but comes close, we get b = e−λ /n cosh(λ/ n) ≤ 2 2 2 e−λ /neλ /2n = e−λ /2n. Then substituting back we get

√ 2 Pr(Y > λ n) ≤ e−λ /2.

3.1.3 Application: set discrepancy

For a function χ : {1, . . . , n} → {1, −1} and a subset S of {1, . . . , n}, let χ(S) = ∑i∈S χ(i). Define the discrepancy of χ on S to be Disc|χ(S)|, and the discrepancy of χ on a collection of sets S = {S1,..., Sn} to be Disc(S, χ) = maxj |χ(Sj)|. √ Theorem 51 (Spencer [89]). With the definitions above, ∃χ with Disc(S, χ) ≤ 6 n.

There is a coincidence here between the number of sets and the size of the universe; actually all that matters for the upper bound is the number of sets. There has been an enormous amount of attention to discrepancy minimization, in a variety of formulations. See for example the books [70, 22], or the article [10].

Caltech CS150 2020. 51 Chapter 3. Concentration of Measure

Here is a slightly more general version of the theorem (which follows by basically the same proof n if you keep s = n). Let k · k∞ be the L∞ vector , i.e., for v ∈ R , kvk∞ = maxi |vi|. Take any n 1 s n j j j vectors v ,..., v in R each with kv k∞ ≤ 1, that is, maxi |(v )i| ≤ 1. You should think of each v as “the indicator function for the sets containing element j of the universe [s].” Form these vectors into = ( j) ∃ ∈ { − }s a matrix Vij √v i. Think of its rows as indicator functions for the given sets. Then χ 1, 1 s.t. kVχk∞ ≤ 6 n, where χ is regarded as a column vector. In this formulation, Theorem 51 is seen to be a cousin of a famous and still open conjecture of J. Komlos´ regarding discrepancy in k · k2, Euclidean norm. The conjecture says that there is a universal 1 s n j constant K for for any s and n, and for any v ,..., v ∈ R with kv k2 ≤ 1, and with V formed as s above, then ∃χ ∈ {1, −1} s.t. kVχk2 ≤ K. We won’t prove Theorem 51, but the starting point for it is the proof of the following weaker bound. p Theorem 52. With the definitions above, a function χ selected u.a.r. has Disc(S, χ) ∈ O( n log n) with positive probability.

Proof. By Theorem 50, for any particular set Sj (noting that |Sj| ≤ n), p p c n log n q Pr(|χ(Sj)| > c n log n) = Pr(|χ(Sj)| > q |Sj|) |Sj|

−c2n log n ≤ 2e 2|Sj| 2 − c log n ≤ 2e 2 2 = 2n−c /2.

Now take a union bound over the sets. p p Pr(∃j : |χ(Sj)| > c n log n) ≤ n Pr(|χ(Sj)| > c n log n) 2 < 2n1−c /2. √ Plug in any c > 2 to show the theorem for sufficiently large values of n. 2

3.1.4 Entropy and Kullback-Liebler divergence

When we introduced BPP we specified that at the end of the poly-time computation, strings in the language should be accepted with probability ≥ 2/3, and strings not in the language should be accepted with probability ≤ 1/3. We also noted that these values were immaterial and did not even need to be constants—we need only that they be separated by some 1/poly. We’ll shortly see why. We start by defining two important functions. 1 Definition 53. The entropy (base 2) of a probability distribution {p ,..., pn} is h (p) = ∑ p lg . 1 2 i pi

In natural units we use h(p) = ∑ p log 1 . i pi

Definition 54. Let r = (r1,..., rn) and s = (s1,..., sn) be two probability distributions and suppose si > 0∀i. The (base 2) Kullback-Leibler divergence D2(rks) “from s to r,” or “of r w.r.t. s,” is defined by

ri D2(rks) = ∑ ri lg i si

Caltech CS150 2020. 52 3.1. Lecture 14 (30/Oct): Independent rvs: data processing, Chernoff bound, applications

This is also known as information divergence, directed divergence or relative entropy4. In natural r log units the divergence is D(rks) = ∑ r log i , and we also use this notation when the base doesn’t i i si matter. D(rks) is not a metric (it isn’t symmetric and doesn’t satisfy the triangle inequality) but it is nonnegative, and zero only if the distributions are the same. Exercise:

(a) D(rks) ≥ 0 ∀r, s (b) D(rks) = 0 ⇒ r = s 2 ! εi 3 (c) D(s + εks) = ∑ + O(εi ) i 2si

(d) for n = 2, D((s1 + ε, 1 − s1 − ε)k(s1, 1 − s1)) is increasing in |ε|

The “k” notation is strange but is the convention. 2 From (c) and (d) we have that for n = 2, D((s1 + ε, 1 − s1 − ε)k(s1, 1 − s1)) ∈ Ω(ε ) (with the constant depending on s1). When s is the uniform distribution, we have:

D(rkuniform) = ∑ ri log(nri) = lg n + ∑ ri log ri = log n − h(r) So D(rkuniform) can be thought of as the entropy deficit of r, compared to the uniform distribution. 1 1 In the case n = 2 we will write p rather than (p, 1 − p), thus: h2(p) = p lg p + (1 − p) lg 1−p , p 1−p D2(pkq) = p lg q + (1 − p) lg 1−q .

4D is useful throughout and (and is closely related to “Fisher information”). See [26].

Caltech CS150 2020. 53 Chapter 3. Concentration of Measure

3.2 Lecture 15 (2/Nov): CLT. Stronger Chernoff bound and appli- cations. Start Shannon coding theorem

3.2.1 Central limit theorem

As I mentioned earlier in the course, there are two basic ways in which we express concentration of measure: large deviation bounds, and the central limit theorem. Roughly speaking the former is a weaker conclusion (only upper tail bounds) from weaker assumptions (we don’t need full independence—we’ll talk about this soon). The proof of the basic CLT is not hard but relies on a little and would take us too far out of our way this lecture, so I will just quote it. Let µ be a probability distribution on R, i.e., for X distributed as µ, measurable S ⊆ R, Pr(X ∈ S) = µ(S). For X1,..., Xn sampled independently 1 n from µ set X = n ∑i=1 Xi. Theorem 55. Suppose that µ possesses both first and second moments: Z θ = E [X] = x dµ mean

h i Z σ2 = E (X − θ)2 = (x − θ)2 dµ variance

Then for all a < b, b aσ bσ 1 Z 2 lim Pr( √ < X − θ < √ ) = √ e−t /2 dt. (3.2) n n n 2π a

The form of convergence to the in 3.2 is called convergence in distribution or convergence in law. For a proof of the CLT see [18] Sec. 27 or for a more accessible proof for the case that the Xi are bounded, see [2] Sec. 3.8.

3.2.2 Chernoff bound using divergence; robustness of BPP

Let’s extend and improve the previous large deviation bound for symmetric random walk. The new bound is almost the same for relatively√ mild deviations (just a few standard deviations) but is much stronger at many (especially, Ω( n)) standard deviations. It also does not depend on the coins being fair.

Theorem 56. If X1,..., Xn are iid Bernoulli rvs with Pr(Xi = 1) = q, and X = ∑ Xi, then Pr(X > pn) (for p ≥ q) or Pr(X < pn) (for p ≤ q), is < 2−nD2(pkq) = exp(−nD(pkq)).

n Exercise: Derive from the above one side of Stirling’s approximation for (pn). Note 1: this improves on Thm 50 even at q = 1/2 because the inequality cosh α ≤ exp(α2/2) that we used before, though convenient, was wasteful. (But the two bounds have the same leading quadratic term for p in the neighborhood of q.) Specifically we have (see Figure 3.2):

D(pk1/2) ≥ (2p − 1)2/2 (3.3)

Note 2: The divergence is the correct constant in the above inequality; and this remains the case even when we “reasonably” extend this inequality to alphabets larger than 2—that is, dice rather than coins; see Sanov’s Theorem [26, 83]. There are of course lower-order terms that are not captured by the inequality.

Caltech CS150 2020. 54 3.2. Lecture 15 (2/Nov): CLT. Stronger Chernoff bound and applications. Start Shannon coding theorem

0.7

0.6 x2/2 D((1+x)/2 ||1/2) 0.5

0.4

0.3

0.2

0.1

x -1.0 -0.5 0.5 1.0

Figure 3.2: Comparing the two Chernoff bounds at q = 1/2

Note 3: Let’s see what we mean by “concentration of measure.” Clearly, the Chernoff bound is telling us that something, namely the rv X, is very tightly concentrated about a particular value. On the other hand, if you look at the full underlying rv, namely the vector (X1,..., Xn), that is not concentrated at all; if say q = 1/2, then it is actually as smoothly distributed as it could be, being uniform on the hypercube! The concentration of measure phenomenon, then, is a statement about low dimension representation of high dimensional objects. In fact the “representation” does not have to be a nice linear function like X = ∑ Xi. It is sufficient that f (X1,..., Xn) be a Lipschitz function, namely that that there be some c < ∞ s.t. flipping any one of the Xi’s changes the function value by no more than c. From this simple information you can already get a large deviation bound on f for independent inputs Xi. We won’t prove that here.

Proof. Consider the case p ≥ q; the other case is similar. Set Yi = Xi − q and Y = ∑ Yi. Now for α > 0, Pr(Y > n(p − q)) = Pr(eαY > eαn(p−q)) < E(eαY)/eαn(p−q) Markov !n (1 − q)e−αq + qeα(1−q) = Independence eα(p−q)

= p(1−q) Set α log (1−p)q . Continuing,

− !n  q p  1 − q 1 p = p 1 − p = e−nD(pkq) This is saying that the probability of a coin of bias q empirically “masquerading” as one of bias at least p > q, drops off exponentially, with the coefficient in the exponent being the divergence.

Back to BPP

Suppose we start with a randomized polynomial-time decision algorithm for a language L which for x ∈ L, reports “Yes” with probability at least p, and for x ∈/ L, reports “Yes” with probability at most q, for p = q + 1/ f (n) for some f (n) ∈ nO(1).

Caltech CS150 2020. 55 Chapter 3. Concentration of Measure

Also, D(q + εkq) is monotone in each of the regions ε > 0, ε < 0. So if we perform O(n f 2(n)) repetitions of the original BPP algorithm, and accept x iff the fraction of “Yes” votes is above (p + q)/2, then the probability of error on any input is bounded by exp(−n).

3.2.3 Balls and bins

Suppose you throw n balls, uniformly iid, into n bins. What is the highest bin occupancy? Let Ai = # balls in bin i.

Theorem 57. ∀c > 1, Pr(max Ai > c log n/ log log n) ∈ o(1).

Proof. To avoid a morass of iterated , write L = log n, L2 = log log n, L3 = log log log n. So we wish to show Pr(max Ai > cL/L2) ∈ o(1). By the union bound,

Pr(max Ai > cL/L2) ≤ n Pr(Ai > cL/L2) cL 1 ≤ n exp(−nD( k )) nL2 n cL cL (1− )n   ! nL2 L L2 1 − 1/n = n 2 cL 1 − cL nL2 cL cL (1− )n   ! nL2 L L2 1 ≤ n 2 cL 1 − cL nL2

5 1 1−p p Expand the first term and apply the inequality ( 1−p ) ≤ e (0 ≤ p < 1) to the second term:

 cL cL  ... ≤ exp L + (L3 − L2 − log c) + L2 L2 L − log c + 1 = exp((1 − c)L + cL 3 ) L2 ≤ exp((1 − c)L + o(1)) = n1−c+o(1).

2

Omitted: Show that a constant fraction of bins are unoccupied. A much more precise analysis of this balls-in-bins process is available, see [43].

3.2.4 Preview of Shannon’s coding theorem

This is an exceptionally important application of large deviation bounds. Consider one party (Alice) who can send a bit per second to another party (Bob). She wants to send him a k-bit message. How- ever, the channel between them is noisy, and each transmitted bit may be flipped, independently,

5 1 1−p 1 In fact we have the stronger ( 1−p ) ≤ 1 + p (see Fig. 3.3) although we don’t need this. Let α = log 1−p , so α ≥ 0. Then −α −α p = 1 − e−α and we are to show that 2 ≥ e−α + eαe =: f (α). f (0) = 2 and f 0 = e−α(eαe − 1 − α) so it suffices to show for −α α ≥ 0 that g(α) := eαe ≤ 1 + α. At α = 0 this is satisfied with equality so it suffices to show that 1 ≥ g0 = (1 − α)e−α g. −α Since 1 − α ≤ e−α, it suffices to show that 1 ≥ e−2α g = eα(e −2), which holds (with room to spare) because e−α ≤ 1.

Caltech CS150 2020. 56 3.2. Lecture 15 (2/Nov): CLT. Stronger Chernoff bound and applications. Start Shannon coding theorem

ⅇp 2.5 1+p ( 1 )1-p 1-p 2.0

1.5

p 0.2 0.4 0.6 0.8 1.0

1 1−p p Figure 3.3: ( 1−p ) vs. 1 + p vs. e

with probability p < 1/2. What can Alice and Bob do? You can’t expect them to communicate reliably at 1 bit/second anymore, but can they achieve reliable communication at all? If so, how many bits/second can they achieve? This question turns out to have a beautiful answer that is the starting point of modern communication theory. Before Shannon came along, the only answer to this question was, basically, the following na¨ıve strategy: Alice repeats each bit some ` times. Bob takes the majority of his ` receptions as his best guess for the value of the bit. We’ve already learned how to evaluate the quality of this method: Bob’s error probability on each bit is bounded above by, and roughly equal to, exp(−`D(1/2kp)). In order for all bits to arrive correctly, then, Alice must use ` proportional to log k. This means the rate of the communication, the number of message bits divided by elapsed time, is tending to 0 in the length of the message (scaling as 1/ log k). And if Alice and Bob want to have exponentially small probability of error exp(−k), she would have to employ ` ∼ k, so the rate would be even worse, scaling as 1/k. Shannon showed that in actual fact one does not need to sacrifice rate for reliability. This was a great insight, and we will see next time how he did it. Roughly speaking—but not exactly—his argument uses a randomly chosen code. He achieves error probability exp(−Ω(k)) at a constant communication rate. What is more, the rate he achieves is arbitrarily close to the theoretical limit.

Caltech CS150 2020. 57 Chapter 3. Concentration of Measure

3.3 Lecture 16 (4/Nov): Application of large deviation bounds: Shannon’s coding theorem

In order to communicate reliably, Alice and Bob are going to agree in advance on a codebook, a set of codewords that are fairly distant from each other (in Hamming distance), with the idea that when a corrupted codeword is received, it will still be closer to the correct codeword than to all others. In this discussion we completely ignore a key computational issue: how are the encoding and decoding maps computed efficiently? In fact it will be enough for us, for a positive result, to demonstrate existence of an encoding map E : {0, 1}k → {0, 1}n and a decoding map D : {0, 1}n → {0, 1}k (we’ll call this an (n, k) code) with the desired properties; we won’t even explicitly describe what the maps are, let alone specify how to efficiently compute them. We will call k/n the rate of such a code. Shannon’s achievement was to realize (and show) that you can simultaneously have positive rate and error probability tending to 0—in fact, exponentially fast. This was one of the first applications of the probabilistic method. Theorem 58 (Shannon [85]). Let p < 1/2. For any ε > 0, for all k sufficiently large, there is an (n, k) code −Ω(k) with rate ≥ D2(pk1/2) − ε and error probability e on every message. (The constant in the Ω depends on p and ε.)

In this theorem statement, “Error” means that Bob decodes to anything different from X, and error probabilities are taken only with respect to the random bit-flips introduced by the channel.

Proof. Let k n = (3.4) D2(pk1/2) − ε (ignoring rounding). Let R ∈ {0, 1}n denote the error string. So, with Y denoting the received message, Y = E(X) + R with X uniform in {0, 1}k, and R consisting of iid Bernoulli rvs which are 1 with probability p. The error event is that D(E(X) + R) 6= X. As a first try, let’s design simply (this won’t be good enough but we’ll later modify it):

E maps each X ∈ {0, 1}k to a uniformly, independently chosen string in {0, 1}n.

So (for now) when we speak of error probability, we have two sources of randomness: channel noise R, and code design E. To describe the decoding procedure we start with the notion of Hamming distance H. The Ham- ming distance H(x, y) between two same-length strings over a common alphabet Σ, is the number n of indices in which the strings disagree: H(x, y) = |{i : xi 6= yi}| for x, y, ∈ Σ . Define the decoding D to map Y to a closest codeword in Hamming distance. For most of the remainder of the proof (in particular until after the lemma), we fix a particular message X, and analyze the probability that it is decoded incorrectly.

In order to speak separately about the two sources of error, we define a rv MX which is a function of the rv E: MX = Pr(Error on X|E). So for any E, MX is a number in the range [0, 1]. In order to analyze how well this works, we pick δ sufficiently small that

p + δ < 1/2 (3.5) and D2(p + δk1/2) > D2(pk1/2) − ε/2. (3.6)

Caltech CS150 2020. 58 3.3. Lecture 16 (4/Nov): Application of large deviation bounds: Shannon’s coding theorem

Note that if both

1. H(E(X) + R, E(X)) < (p + δ)n (“channel noise is low”), and 2. ∀X0 6= X : H(E(X) + R, E(X0)) > (p + δ)n (“code design is good for X, R”)

then Bob will decode correctly. The contrapositive is that if Bob decodes X incorrectly then at least one of the following events has to have occurred:

Bad1: H(E(X) + R, E(X)) ≥ (p + δ)n 0 0 Bad2: ∃X 6= X : H(E(X) + R, E(X )) ≤ (p + δ)n 1−cn Lemma 59. ∃c > 0 s.t. EE(MX) < 2

Proof. Specifically we show this for c = min{D2(p + δkp), ε/2}. In what follows when we write a bound on PrW (...) we mean that “conditional on any setting to the other random variables, the randomness in W ensures the bound.”

EE(MX) ≤ Pr(Bad1) + Pr(Bad2) R ∑ E X06=X ≤ Pr(H(~0, R) ≥ (p + δ)n) + Pr (H(R, E(X0) − E(X)) ≤ (p + δ)n) R ∑ ( 0) ( ) X06=X E X ,E X ≤ 2−nD2(p+δkp) + 2k−nD2(p+δk1/2) = 2−nD2(p+δkp) + 2n(D2(pk1/2)−ε−D2(p+δk1/2)) substituting value of k ≤ 2−nD2(p+δkp) + 2−εn/2 using inequality (3.6) ≤ 21−cn using value of c

2 All of the above analysis treated an arbitrary but fixed message X. We showed that, picking the code at random, the of MX = PrR(Error on X|E) is small.

Let Z be the rv which is the fraction of X’s for which MX ≤ 2E(MX). By the Markov inequality, ∃E s.t. Z ≥ 1/2. Let E∗ be a specific such code. ∗ E works well for most messages X, but this isn’t quite what we want—we want MX to be small for all messages X. There is a simple solution. Choose a code E∗ as above for k + 1 bits, then map the k-bit messages to the good half of the messages. Note that removal of some codewords from E∗ can only decrease any MX. (Assuming we still use closest-codeword decoding.) 2−cn So now the bound PrR(Error on X) ≤ 2E(MX) ≤ 2 applies to all X. The asymptotic rate is unaffected by this trick; the error exponent is also unaffected. To be explicit, using E∗ designed for k + 1 bits and with n = k+1 we have for all X ∈ {0, 1}k D2(pk1/2)−ε

Pr(Error on X) ≤ 22−cn R

Thus no matter what message Alice sends, Bob’s probability of error is exponentially small. 2

Caltech CS150 2020. 59 Chapter 3. Concentration of Measure

3.4 Lecture 17 (6/Nov): Application of CLT to Gale-Berlekamp. Khintchine-Kahane. Moment generating functions

3.4.1 Gale-Berlekamp game

Let’s remember a problem we saw in the first lecture (slightly retold):

• You are given an n × n grid of lightbulbs. For each bulb, at position (i, j), there is a switch bij; there is also a switch ri on each row and a switch cj on each column. The (i, j) bulb is lit if bij + ri + cj is even. For a setting b, r, c of the switches, let

F(b, r, c) = (number of lit bulbs) − (number of unlit bulbs)

bij+ri+cj Then F(b, r, c) = ∑ij(−1) .

Let F(b) = maxr,c F(b, r, c). What is the greatest f (n) such that for all b, F(b) ≥ f (n)?

This is called the Gale-Berlekamp game after David Gale and Elwyn Berlekamp, who viewed it as a competitive game: the first player chooses b and then the second chooses r and c to maximize the number of lit bulbs. So f (n) is the outcome of the game for perfect players. In the 1960s, at Bell Labs, Berlekamp even built a physical 10 × 10 grid of lightbulbs with bij, ri and cj switches. People have labored to determine the exact value of f (n) for small n—see [36]. But the key issue is the asymptotics. Theorem 60. f (n) ∈ Θ(n3/2).

Proof. First, the upper bound f (n) ∈ O(n3/2): We have to find a setting b that is favorable for the “minimizing f ” player, who goes first. That is, we have to find a b with small F(b). Fix any r, c. Then for b selected u.a.r.,

−n2D ( 1 + √k k 1 ) Pr(F(b, r, c) > kn3/2) ≤ 2 2 2 2 n 2 we’ll choose a value for k shortly 2 ≤ 2−k n/(2 log 2) using D(pk1/2) ≥ (2p − 1)2/2

Now take a union bound over all r, c.

2 Pr(F(b) > kn3/2) ≤ 22n−k n/(2 log 2) p p For k > 2 log 2 this is < 1. So ∃b s.t. ∀r, c, F(b, r, c) ≤ 2 log 2n3/2. Next we show the lower bound. Here we must consider any setting b and show how to choose r, c favorably. Initially, set all ri = 0 and pick cj u.a.r. Then for any fixed i, the row sum

bij+cj ∑(−1) =: Xi j is binomially distributed, being an unbiased random walk of length n. Now, unlike the Chernoff bound, we’d like to see not an upper but a lower tail bound on random walk. Let’s derive this from the CLT: √ Corollary 61. For X the sum of m uniform iid ±1 rvs, E(|X|) = (1 + o(1)) 2m/π.

Caltech CS150 2020. 60 3.4. Lecture 17 (6/Nov): Application of CLT to Gale-Berlekamp. Khintchine-Kahane. Moment generating functions

(Proof sketch: for X distributed as the unit-variance Gaussian N(0, 1), this value is exact; see [94]. The CLT shows this is a good enough approximation to our rv.) Comment: Instead of using Corollary 61, we could alternatively have used the following result, which allows also for step sizes of different lengths:

Theorem 62 (Khintchine-Kahane). Let a = (a1,..., an), ai ∈ R. Let si ∈U ±1 and set S = |∑ siai|. Then √1 kak ≤ E(S) ≤ kak . 2 2 2

The original result of this form is [60]; Kahane generalized to normed linear spaces [55]; the above constant and generality are found in [64]; for an elegant one-page proof see [35]. Not coincidentally the best constant in the fully independent case is, like the CLT, proven through Fourier analysis. Comment: Since we haven’t provided proofs of either of these, and we are about to use them, let me mention that later in the course (Sec. 4.3.2) we’ll come back and finish the proof (with a weaker constant) through a more elementary argument, and with the added benefit that we will be able to give the player a deterministic poly-time strategy for choosing the row and column bits. (Here we gave the player only a randomized poly-time strategy.) In any case we now continue, using√ the conclusion (with the largest constant, coming from the CLT): for every i, E(|Xi|) = (1 + o(1)) 2n/π. ri Now for each√ row, flip ri if the row sum is negative. So E(∑i(−1) Xi) = E(∑i |Xi|) = ∑i E(|Xi|) = (1 + o(1)) 2/πn3/2. √ 3/2 This shows (assuming the CLT) that√ for any b, Ec maxr F(b, r, c) is (1 + o(1)) 2/πn . Conse- quently, for all b, F(b) ≥ (1 + o(1)) 2/πn3/2, which proves the theorem. 2 Comment: It was convenient in this problem that the system of switches at your disposal is bipartite, in that there are no interactions amongst the effects of the row switches, and likewise amongst the effects of the column switches. However, even when such effects are present it is possible to attain similar theorems. See [57].

Caltech CS150 2020. 61 Chapter 3. Concentration of Measure

3.4.2 Moment generating functions, Chernoff bound for general distributions

Now for a version of the Chernoff bound which we can apply to sums of independent real rvs with very general probability distributions. After presenting the bound we’ll see an application of it, with broad computational applications, in the theory of metric spaces. Let X be a real-valued random variable with distribution µ: for measurable S ⊆ R, Pr(X ∈ S) = µ(S).

Definition 63. The moment generating function (mgf) of X (or, more precisely, of µ) is defined for β ∈ R by

h βXi gµ(β) = E e provided this converges in an open neighborhood of 0 ∞ βk = ∑ E(Xk) 0 k!

Incidentally note that (a) if instead of taking β to be real we take it to be imaginary, this gives the , (b) both are “slices” of the . For any probability measure µ,

gµ(0) = E [1] = 1. (3.7)

We are interested in large deviation bounds for random walk with steps from µ. That is, if we 1 n sample X1,..., Xn iid from µ and take X = n ∑i=1 Xi, we want to know if the distribution of X is concentrated around E [X]. It will be convenient to re-center µ, if necessary, so that E [X] = 0; clearly this just subtracts a known constant off each step of the rw, so it does not affect any probabilistic calculations. So without loss of generality we now take E [X] = 0. Perhaps not surprisingly, the quality of the large deviation bound that is possible, depends on how heavy the tails of µ are. What is interesting is that this is nicely measured by the smoothness of gµ at the origin. Specifically, a moment-generating function that is differentiable at the origin guarantees exponential tails. One way to think about this intuitively is to examine the Fourier transform (the imaginary axis), rather than the , near the origin. If µ has light tails—as an extreme case suppose µ has bounded —then near the origin, the Fourier coefficients are picking up only very long-wavelength information, and seeing almost no “cancellations”—negative contributions can come only from very far away and therefore be very small. So the Fourier coefficients near 0 are vanishingly different from the Fourier coefficient at 0, and so gµ is differentiable at 0. This goes both ways—if µ has heavy tails, then even at very long wavelengths, the Fourier integral picks up substantial cancellation, and so the Fourier coefficients change a lot moving away from 0.

Theorem 64 (Chernoff). If the mgf gµ(β) is differentiable at 0, then ∀ε 6= 0 ∃cε < 1 such that

n Pr(X/ε > 1) < cε .

Specifically −βε cε ≤ inf e gµ(β) < 1. (3.8) β

Caltech CS150 2020. 62 3.4. Lecture 17 (6/Nov): Application of CLT to Gale-Berlekamp. Khintchine-Kahane. Moment generating functions

Proof. Let N be a neighborhood of 0 in which the mgf converges. Start with the case ε > 0.

Pr(X > ε) = Pr(eβ ∑i Xi > eβnε) for any β > 0 (3.9) h i < e−βnεE eβ ∑i Xi Markov bound, for β ∈ N  h in −βnε βX1 = e E e Xi are independent n  −βε  = e gµ(β) (3.10)

−βε 0 We now need to show that there is a β > 0 such that e gµ(β) < 1. At β = 0, e gµ(0) = 1, so let’s −βε find the derivative of e gµ(β) at 0. Since gµ is differentiable at 0 we have:

 βX ∂gµ(β) ∂E e = ∂β ∂β 0 0  βX  ∂e = E ∂β 0 h i = E XeβX 0 = E [X] = 0 (3.11)

So, because we have shifted the mean to 0, the moment-generating function is flat at 0. Now we can differentiate the whole function:

−βε ∂e gµ(β) = e−ε·0g0 (0) − εe−ε·0g (0) product rule ∂β µ µ 0 −ε·0 0 −ε·0 = e g (0) −ε e gµ(0) at β = 0 |{z} µ |{z} 1 | {z } 1 | {z } 0 1 = −ε (3.12)

−βε We have determined that ∃β > 0 such that e gµ(β) < 1, and thus there is a cε < 1 as stated in the theorem. The case ε < 0 is similar. All that changes is that for line 3.9 we substitute

Pr(X < ε) = Pr(eβ ∑i Xi > eβnε) for any β < 0 (3.13)

The rest of the derivation is identical up to and including line 3.12, which in this case shows that −βε ∃β < 0 such that e gµ(β) < 1, and thus there is a cε < 1 as stated in the theorem. 2

This method also allows us, in some cases, to find the value of cε which gives the tightest Chernoff bound. (For general µ and ε this can be a complicated task and we may have to settle for bounds on the best cε.)

Exercise: What is the mgf of the uniform distribution on ±1? What is the best cε for it?

Caltech CS150 2020. 63 Chapter 3. Concentration of Measure

3.5 Lecture 18 (9/Nov): Metric spaces

Today we’ll start to see a geometric application of the Chernoff bound. At first glance the question we solve, which originates in analysis, appears to have nothing to do with probability. But actually it illustrates a shared geometric core between analysis and probability.

Definition 65. A metric space (M, dM) is a set M and a function dM : M × M → (R ∪ ∞) that is symmetric; 0 on the diagonal; and obeys the triangle inequality, dM(x, y) ≤ dM(x, z) + dM(z, y).

3.5.1 Metric space examples

n p n 2 1. A Euclidean space is a vector space R equipped with the metric d(x, y) = ∑1 (xi − yi) .

2. The same vector space can be equipped with a different metric, for instance the `∞ metric, maxi |xi − yi|, or the `1 metric, ∑i |xi − yi|. Actually in real vector spaces the metrics we use, like these, are usually derived from norms (see Sec. 3.5.3).

3. Sometimes we get important metrics as restrictions of another metric. For instance let ∆n n denote the probability , ∆n = {x ∈ R : ∑i xi = 1, xi ≥ 0}. In this space (half of) the `1 distance is referred to as “total variation distance”, dTV. It has another characterization, dTV(p, q) = maxA⊆[n] p(A) − q(A). Exercise: Usually a metric arises through a “min” definition (shortest path from one place to another), and in Example 5 we will see that dTV does have that kind of definition. Why does it coincide with a “max” definition? 4. Many metric spaces have nothing to do with vector spaces. An important class of metrics are the shortest path metrics, derived from undirected graphs: If G = (V, E) is a graph and x, y ∈ V, let d(x, y) denote the length of (number of edges on) a shortest path between them. 5. If you start with a metric d on a measurable space M you can “lift” it to the transportation metric dtrans. This is much bigger: the points of this new metric space are probability distributions on M, and the transportation distance is how far you have to shift probability mass in order to transform one distribution to the other. Here is the formal definition for the case of a finite space M. Let µ, ν be two probability distributions. π will range over probability distributions on the direct product space M2. We sometimes call π a coupling of µ and ν.

dtrans(µ, ν) = min{∑ d(x, y)π(x, y) | ∀x : ∑ π(x, y) = µ(x), ∀y : ∑ π(x, y) = ν(y)} π x,y y x Sometimes this is called “earthmover distance” (imagine bulldozers pushing the probability mass around).

For example, if M is the graph metric on a clique of size k (as in Example 4) then dtrans = dTV = variation distance among probability distributions on the vertices (i.e., the metric space of Example 3). Of course, even if M is not finite, just a measure space with a given σ-algebra, we need only for µ and ν to be measures on this σ-algebra, and the minimization ranges over measures π on the product σ-algebra on M2. So, with S and T ranging over measurable sets: Z ZZ dtrans(µ, ν) = inf d(x, y)π(x, y) dx dy | ∀S : 1x∈Sπ(x, y) dx dy = µ(S), π ZZ  ∀T : 1y∈Tπ(x, y) dx dy = ν(T) .

Caltech CS150 2020. 64 3.5. Lecture 18 (9/Nov): Metric spaces

0 Definition 66. An embedding f : M → M is a mapping of a metric space (M, dM) into another metric 0 dM0 ( f (a), f (b)) dM(c,d) space (M , dM0 ). The distortion of the embedding is supa b c d∈M · . The mapping , , , dM(a,b) dM0 ( f (c), f (d)) is called isometric if it has distortion 1.

(This is slightly more generous than the usual definition of isometry because it allows for uniform nonzero scaling of all distances.)

A finite metric space is one in which the underlying set is finite. A finite `2 space is one that can be embedded isometrically into a Euclidean space of any dimension.

3.5.2 Embedding dimension for n points in L2

Exercise 1: The dimension need not be greater than n − 1. (n points span only at most an (n − 1)- dimensional affine subspace.) Exercise 2: Generically, the dimension must be n − 1. Let me suggest two ways to think about this. One is to show that the distances between points in Euclidean space determine their coordinates up to a rotation, reflection and translation. (I.e., up to an action of the orthogonal group.) Then consider the volume of the the convex hull of the points. A second way, maybe a little more direct, is this. d Suppose you have embedded the n points into n vectors vi in R . Subtract vn off each of the vectors; this does not change any distances, and puts vn = 0. Let V be the (n − 1) × d matrix whose rows are the vectors v1,..., vn−1. The singular value decomposition theorem tells us that we can write V = PDQ where P is an (n − 1) × (n − 1) orthogonal matrix, Q is a d × d orthogonal matrix, D is a diagonal matrix with nonnegative diagonal entries (these are the singular values) σ1 ≥ σ2 ≥ . . ., with σ` = 0 for all ` > min{d, n − 1}. This being the case, exercise 1 follows by forming P0 by chopping off all columns of P beyond min{d, n − 1}, and likewise forming D0 by chopping off all rows of D beyond min{d, n − 1}. Then V = P0D0Q. Our new embedding of the i’th point (i < n) is the i’th row of P0D0; the n’th point is still taken to 0. The distances between these points are unchanged—think about the Gram matrix VV∗ of the original embedding, where ∗ denotes transpose.

For 1 ≤ i < n let ei be the ith standard basis vector (as a row vector). Just to make the notation 2 ∗ ∗ go through, let en be the zero vector. Then for any i, j, dij = kvi − vjk = (ei − ej)VV (ei − ej) . But this quantity is unchanged when we use the Gram matrix of the new embedding, because VV∗ = PDQQ∗D∗P∗ = P0D0QQ−1(P0D0)∗ = (P0D0)(P0D0)∗. So all the distances are unchanged. Now instead of showing that “generically” we need n − 1 , let’s just consider a concrete case, and leave the general situation for you to think about. The concrete case I have in mind is the clique metric, or if you prefer, the vi are the n vertices of the probability simplex. That naturally sits in n − 1 dimensions, but why is that necessary? Well, suppose the points sit in some dimension d, possibly less than n − 1; and form V as above. Now, I claim we know what the Gram matrix ∗ 2 ∗ > VV is. Its diagonal entries are all 1 because for i < n, 1 = din = (ei − en)VV (ei − en) = ∗ > ∗ eiVV ei = (VV )ii. Then its off-diagonal entries are all 1/2 because for distinct i, j < n, 1 = 2 ∗ ∗ ∗ ∗ ∗ ∗ dij = (ei − ej)VV (ei − ej) = (VV )ii + (VV )jj − 2(VV )ij = 2 − 2(VV )ij. Let I be the identity matrix, and J the matrix of all 1’s; note that J has a single nonzero eigenvalue of n − 1. Then VV∗ = (I + J)/2 has one eigenvalue of n/2 and another n − 2 eigenvalues of 1/2; consequently it is nonsingular. But then, since the nonzero eigenvalues of VV∗ are the same as the nonzero eigenvalues of the d × d matrix V∗V, we must have d ≥ n − 1.

Caltech CS150 2020. 65 Chapter 3. Concentration of Measure

3.5.3 Normed spaces

A real normed space is a vector space V equipped with a nonnegative real-valued “norm” k · k satisfying kcvk = ckvk for c ≥ 0, kvk 6= 0 for v 6= 0, and kv + wk ≤ kvk + kwk. Norms always induce metrics, as in examples 1, 2, by taking the distance between v and w to be kv − wk.

Let S = (S, µ) be any measure space. For p ≥ 1, the Lp normed space w.r.t. the measure µ, Lp(S), is defined to be the vector space of functions

f : S → R of finite “Lp norm,” defined by

Z 1/p p k f kp = k f (x)k dµ(x) S

Exercise: k f + gkp ≤ k f kp + kgkp

So (like any normed space), Lp(S) is also automatically a metric space.

This framework allows us to discuss the collection of all L2 (Euclidean) spaces, all L1 spaces, etc. The most commonly encountered cases are indeed L1, L2 and L∞, which is defined to be the sup norm (so µ doesn’t matter). Today we discuss embeddings L2 → L2. Time permitting we may also discuss embeddings of general metrics into L1.

We will use the shorthand Lp(k) to refer to an Lp space on a set S of cardinality k, with the counting measure.

3.5.4 Exponential savings in dimension for any fixed distortion

In view of the lower bound in Section 3.5.2, our next result may be surprising. We will see a method of embedding any n-point `2 metric into a very low-dimensional Euclidean space with only slight distortion. This is useful in the because many algorithms for geometric problems have complexity that scales exponentially in the dimension of the input space. We’ll have to skip giving example applications, but there are quite a few by now, and because of these, a variety of improvements and extensions of the embedding method have also been developed. Our goal is to prove this:

Theorem 67 (Johnson and Lindenstrauss [53]). Given a set A of n points in a Euclidean space L2(d), −2 ε there exists a map h : A → L2(k) with k = O(ε log n) that is of distortion e on A. Moreover, the map h can be taken to be linear and can be found with a simple randomized algorithm in expected time polynomial in n.

Although the points of A may (and generically will) span a d = (n − 1)-dimensional affine space, and the map is linear, nonetheless observe that we are not embedding all of Rn−1 with low distortion— that is impossible, as the map is many-one—we care only about the distances among our n input points.

Caltech CS150 2020. 66 3.6. Lecture 19 (11/Nov): Johnson-Lindenstrauss embedding

3.6 Lecture 19 (11/Nov): Johnson-Lindenstrauss embedding

By a small sample we may judge of the whole piece. Cervantes, Don Quixote de la Mancha §I-1-4

3.6.1 The original method

Returning to the statement of the Johnson-Lindenstrauss Theorem (67), how do we find such a map h? Here is the original construction: pick an orthogonal projection, W˜ , onto Rk uniformly at random, and let h(x) = Wx˜ for x ∈ A. For k as specified, this is satisfactory with high (constant) probability (which depends on the con- stant in k = O(ε−2 log n)). An equivalent description of picking a projection W˜ at random is as follows: choose U uniformly (i.e., using the ) from Od (the orthogonal group). Let Q˜ be the d × d matrix which is a projection map, sending column vectors onto their coordinates in the first k basis vectors:

 1 0 0 ··· 0 0   0 1 0 ··· 0 0     .  ˜  ..  Q =   .  0 0 ··· 1 0 0     0 0 0 0 0 0  0 0 0 0 0 0

Then set W˜ = U−1QU˜ . I.e., a point x ∈ A is mapped to U−1QUx˜ . Let’s start simplifying this. The final multiplication by U−1 doesn’t change the length of any vector so it is equivalent to use the mapping x → QUx˜ and ask what this does to the lengths of vectors between points of A. Having simplified the mapping in this way, we can now discard the all-0 rows of Q˜ , and use just Q:

 1 0 0 ··· 0 0   0 1 0 ··· 0 0    Q =  .  .  ..  0 0 ··· 1 0 0 So JL’s final mapping is f (x) = QUx.

In order to analyze this map, we will consider a vector v, the difference between two points in A, i.e. v = x − y for some x, y ∈ A. Since the question of distortion of the length of v is scale invariant, we can simplify by supposing that kvk = 1. Moreover, the process described above has the same distribution for all rotations of v. That is to say, for any v ∈ Rd, measurable S ⊆ Rd, and orthogonal matrix A,

Pr(QUv ∈ S) = Pr(QUAv ∈ S). U U (This is precisely the content of saying that the Haar measure is invariant under actions of the ∗ orthogonal group.) So we may as well consider that v is the vector v = e1 = (1, 0, 0, . . . , 0) .

Caltech CS150 2020. 67 Chapter 3. Concentration of Measure

In that case, kQUvk equals k(QU)∗1k where (QU)∗1 is the first column of QU. But (QU)∗1 = ∗ (U1,1, U2,1,..., Uk,1) , i.e., the top k entries of the first column of U. Since U is a random orthogonal matrix, the distribution of its first column (or indeed of any other single column) is simply that of a random unit vector in Rd. So the whole question boils down to showing concentration for the length of the projection of a random unit vector onto the subspace spanned by the first k standard basis vectors. This distribution is deceptive in low dimensions. For d = 2, k = 1 the density looks like Figure (3.4). The projection is not at all tightly concentrated.

2.5

2

1.5

1

0.5

0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 3.4: Density of projection of a unit vector in 2D onto a random unit vector

However, in higher dimensions, this density looks more like Figure (3.5). The phenomenon we are encountering is truly a feature of high dimension.

−1 0 1

Figure 3.5: Density of projection of a unit vector in high dimension onto a random unit vector

Remarks:

1. In the one-dimensional projection density (Fig. 3.5) some constant fraction of the probability h i is contained in the interval √−1 , √1 . d d 2. The squares of the projection-lengths onto each of the k dimensions are “nearly independent” random variables, so long as k is small relative to d.

Johnson and Lindenstrauss pushed this argument through but there is an easier way to get there, by just slightly changing the construction.

Caltech CS150 2020. 68 3.6. Lecture 19 (11/Nov): Johnson-Lindenstrauss embedding

3.6.2 JL: a similar, and easier to analyze, method

d Pick k vectors w1, w2,..., wk ∈ R independently from the spherically symmetric Gaussian density with standard deviation 1, i.e., from the probability density on x = (x1,..., xd):

! 1 1 d η(x) = − x 2 d/2 exp ∑ i (2π) 2 i=1

A few notes:

1. the projection of this density on any line through the origin is the 1D Gaussian with standard deviation 1, i.e., the density 1  x2  √ exp − 2π 2 (Follows immediately from the formula, by factoring out the one dimension against an entire “conditioned” Gaussian on the remaining d − 1 dimensions.)

2. The distribution is invariant under the orthogonal group. (Follows immediately from the formula.)

3. The coordinates x1, x2 etc. are independent rvs. (Follows immediately from the formula.)

Set   ...... w1 ......  ...... w2 ......    W =  .   .  ...... wk ...... d (The rows of W are the vectors wi.) Then, for v ∈ R set h(v) = Wv. By Notes 1 and 3, each entry of W is an i.i.d. random variable with density √1 exp(−x2/2). 2π Informally, this process is very similar to that of JL, although it is certainly not identical. Individual entries of W can (rarely) be very large, and rows are not going to be exactly orthogonal, although they will usually be quite close to orthogonal. Because of Note 2, analysis of this method boils down, just as for the original JL construction, to showing a concentration result for the length of the first column of W, which we denote w1. 1 2 k 2 Because of Note 3, the expression kw k = ∑1 wi1 gives the LHS as the sum of independent, and by Note 1 iid, rvs. This will enable us to show concentration through a Chernoff bound. So now our situation is that our projection of (any particular) unit vector in the original space, has the distribution of a vector whose coordinates wi1,..., wik are iid normally distributed with 2 2 2 E(wi1) = 1. So E(∑ wi1) = k. We want a deviation bound on ∑ wi1. 2 There is a name for these rvs: each wi1 is a “Chi Squared” rv with parameter 1, and their sum is a Chi Squared rv with parameter k. 2 Set random variables yi = wi1 − 1 so that E(yi) = 0. With this change of variables we now want a 1 k bound on the deviation from 0 of the rv y = k ∑i=1 yi. To get a Chernoff bound, we need the mgf, g(β), for yi, in order to use Eqn. 3.8 to write:

P(y/ε > 1) < [ inf e−εβg(β)]k for ε 6= 0. (3.14) β>0

Caltech CS150 2020. 69 Chapter 3. Concentration of Measure

1.2

k=1 1.0 k=2 0.8 k=3 k=4 0.6 k=10

0.4

0.2

0.0 0 1 2 3 4

1 2 Figure 3.6: Probability density of k ∑ wi1

So what is g(β)?

2 g(β) = E(eβy) = E(eβ(w −1)) Z ∞ 1 2 = e−β √ ew (β−1/2)dw −∞ 2π −β Z ∞ p e 1 − 2β − 1 w2(1−2β) = p √ e 2 dw 1 − 2β −∞ 2π e−β = p 1 − 2β

The last equality follows as the integrand is the density of a normal random variable with standard deviation √ 1 . 1−2β 1 Thus, g(β) is well defined and differentiable in (−∞, 2 ), with (necessarily) g(0) = 1 (which recall from (3.7) holds for the mgf of any probability measure), and g0(0) = 0 (because g0(0) = the first moment of the probability measure, that’s why it’s called the moment generating function, recall (3.11); and we have centered the distribution at 0). For a given ε what β should be used in the Chernoff bound (Eqn. 3.14)? After some , we = ε find that β 2(1+ε) is the best value (for both signs of ε). Figure (3.7) shows the dependence of β on ε. Plugging this value of β above into the bound, we get

1 − ε k P(y/ε > 1) < ((1 + ε) 2 e 2 ) (3.15)

1 ε 1 2 3 k 2 − 2 which we incidentally note is (1 − 2 ε + O(ε )) . The function (1 + ε) e is shown in Fig. 3.8. Now let’s apply this bound to the modified JL construction. We will ensure distortion (Defn. 66) δ n e (with positive probability) by showing that for each of our (2) vectors v, with probability > n 1 − 1/(2),

1 kvke−δ/2 ≤ √ kWvk ≤ kvkeδ/2. k

Caltech CS150 2020. 70 3.6. Lecture 19 (11/Nov): Johnson-Lindenstrauss embedding

β(ϵ)

ϵ -1.0 -0.5 0.5 1.0

-0.5

-1.0

-1.5

-2.0

-2.5

2 = ε Figure 3.7: Best choice of β as a function of ε for the χ distribution: β 2(1+ε)

1.0

0.8

0.6

0.4

0.2

ϵ 2 4 6 8

2 1 − ε Figure 3.8: Base of the Chernoff bound for the χ distribution: cε = (1 + ε) 2 e 2

We already argued, by the invariance of our construction under the orthogonal group, that for any v this has the same probability as the event r 1 e−δ/2 ≤ w2 ≤ eδ/2 k ∑ i1

1 e−δ ≤ w2 ≤ eδ k ∑ i1 or equivalently e−δ − 1 ≤ y¯ ≤ eδ − 1. (3.16)

Applying the Chernoff bound (3.15) first on the right of (3.16), we have

δ δ 2 Pr(y¯ > eδ − 1) < ek(δ/2−(e −1)/2) = e(k/2)(1+δ−e ) < e−kδ /4

Next applying the Chernoff bound (3.15) on the left of (3.16), we have

−δ −δ 2 3 Pr(y¯ < e−δ − 1) < ek(−δ/2−(e −1)/2) = e(k/2)(1−δ−e ) < e−k(δ /4+O(δ ))

Caltech CS150 2020. 71 Chapter 3. Concentration of Measure

−2 1 2 −δ δ 2 In all, taking k = 8(1 + O(δ))δ log n suffices so that Pr( k ∑ wi1 ∈/ [e , e ]) < 1/n and therefore so the mapping with probability at least 1/2 has distortion bounded by eδ. Finally, for the computational aspect: to get a randomized “Las Vegas” algorithm simply try matri- ces W at random and examine each to test whether the distortion is satisfactory.

Note: About another embedding question: Finite l2 metric spaces can be embedded in l1 isomet- rically. There’s also an algorithm—deterministic, in fact—to find such an embedding, but it takes exponential time in the number of points in the space. Comment: There are deterministic poly-time algorithms producing an embedding up to the stan- dards of the Johnson-Lindenstrauss theorem, see Engebretsen, Indyk and O’Donnell [31], Sivaku- mar [88].

Caltech CS150 2020. 72 3.7. Lecture 20 (13/Nov): Bourgain embedding X → Lp, p ≥ 1

3.7 Lecture 20 (13/Nov): Bourgain embedding X → Lp, p ≥ 1

In the previous result, we saw how an already “rigid” metric, namely an L2 metric, could be embedded in reduced dimension. Now we will see how a relatively “loose” object, just a metric space, can be embedded in a more rigid object, namely a vector space with an Lp norm. There will be a price in distortion to pay for this. 2 Theorem 68 (Bourgain [21]). Any metric (X, d) with n = |X| can be embedded in Lp(O(log n)), p ≥ 1, with distortion O(log n). There is a randomized poly-time algorithm to find such an embedding.

Some comments are in order. Dimension: The dimension bound here is actually due not to Bourgain but to Linial, London and Rabinovich [67]. Also, Bourgain showed embedding into L2; after we prove the L1 result we’ll show how it also implies all p ≥ 1. A later variation of the Bourgain proof that achieves dimension O(log n) is due to Abraham, Bartal and Neiman [1]. Derandomization: It will follow from ideas we see soon, that there is a deterministic algorithm to construct a Bourgain embedding into dimension poly(n). This will be on a problem set. It is also possible, by the method of conditional probabilities, to reduce the dimension to O(log2 n); we probably won’t have time to discuss this. Distortion: The distortion in the theorem is best possible: expander graphs require it. However, there are open questions for restricted classes of metrics: for example whether the distortion can be improved, possibly to a constant, for shortest path metrics in planar graphs. See [52] for a survey on metric embeddings from 2004.

3.7.1 Embedding into L1: Weighted Fr´echetembeddings

Proof. Since the domain of our mapping is merely a metric space rather than a normed space, we cannot apply anything like the JL technique, and something quite different is called for. Bourgain’s proof employs a type of embedding introduced much earlier by Frechet´ [38]. The one-dimensional Frechet´ embedding imposed by a set ∅ ⊂ T ⊆ X is the mapping

τ : X → R+

τ(x) = d(x, T) := min d(x, t) t∈T

Observe that by the triangle inequality for d, |τ(x) − τ(y)| ≤ d(x, y). So τ is a contraction of X into the metric on the line.

We can also combine several such Ti’s in separate coordinates. If we label the respective mappings τi and give each a nonnegative weight wi, with the weights summing to 1—that is to say, the weights form a probability distribution:

τ(x) = (w1τ1(x),..., wkτk(x)) then we can consider the composite mapping τ as an embedding into L1(k) and it too is contractive, namely, kτ(x) − τ(y)k1 ≤ d(x, y).

So the key part of the proof is the lower bound. 0 Let s = dlg ne. For 1 ≤ t ≤ s and 1 ≤ j ≤ s ∈ Θ(s), choose set Ttj by selecting each point of X independently with probability 2−t. Let all the weights be uniform, i.e., 1/ss0. This defines an

Caltech CS150 2020. 73 Chapter 3. Concentration of Measure

0 embedding τ = (..., τtj,...)/ss of the desired dimension. (If any Ttj is empty, let τtj map all x ∈ X to 0.) We need to show that with positive probability

∀x, y ∈ X : kτ(x) − τ(y)k1 ≥ Ω(d(x, y)/s).

Just as in JL, in the proof we focus on a single pair x 6= y and show that with probability greater than 1 − 1/n2 (enabling a union bound) it is embedded with the desired distortion O(s).

3.7.2 Good things can happen

We use this notation for open balls:

Br(x) = {z : d(x, z) < r}

and for closed balls, B¯r(x) = {z : d(x, z) ≤ r}. Recall that we are now analyzing the embedding for any fixed pair of points x, y.

Let ρ0 = 0 and, for t > 0 define t t ρt = sup{r : |Br(x)| < 2 or |Br(y)| < 2 } (3.17)

up to tˆ = max{t : RHS < d(x, y)/2}. It is possible to have tˆ = 0 (for instance if no other points are near x and y). ¯ ˆ ¯ t ¯ t Observe that for the closed balls B we have that for all t ≤ t, |Bρt (x)| ≥ 2 and |Bρt (y)| ≥ 2 . This means in particular that (due to the radius cap at d(x, y)/2, which means that y is excluded from these balls around x and vice versa), tˆ < s. ˆ t t Set ρtˆ+1 = d(x, y)/2, which means that it still holds for t = t + 1 that |Bρt (x)| < 2 or |Bρt (y)| < 2 , ˆ although (in contrast to t ≤ t), ρtˆ+1 is not the largest radius for which this holds. ˆ Note t + 1 ≥ 1. Also, ρtˆ+1 > ρtˆ (because the latter was defined to be less than d(x, y)/2). But for t ≤ tˆ it is possible to have ρt = ρt−1. tˆ + 1 will be the number of scales used in the analysis of the lower bound for the pair x, y. I.e., we use the sets Ttj for 0 ≤ t ≤ tˆ + 1. Any contribution from higher-t (smaller expected cardinality) sets is “bonus.” Consider any 1 ≤ t ≤ tˆ + 1. √ Lemma 69. With positive probability (specifically ≥ (1 − 1/ e)/4), |τt1(x) − τt1(y)| > ρt − ρt−1.

Proof. Suppose wlog that t |Bρt (x)| < 2 . (3.18) ¯ t−1 By Eqn (3.17) (with t − 1), |Bρt−1 (y)| ≥ 2 (and the same for x but we don’t need that). Therefore ¯ |Bρt−1 (y)| > |Bρt (x)|/2 (3.19)

Now observe that a sufficient condition for

kτt1(x) − τt1(y)k > ρt − ρt−1 is that the following two events both hold:

Tt1 ∩ Bρt (x) = ∅ (3.20)

Caltech CS150 2020. 74 3.7. Lecture 20 (13/Nov): Bourgain embedding X → Lp, p ≥ 1

Figure 3.9: Balls Bρt−1 (x), Bρt (x), Bρt−1 (y) depicted. Events 3.20 and 3.21 have occurred, because no point has been selected for Tt1 in the larger-radius (ρt) region around x, while some point (marked in red) has been selected for Tt1 in the smaller-radius (ρt−1) region around y. and ¯ Tt1 ∩ Bρt−1 (y) 6= ∅ (3.21) We claim√ that this conjunction (see Fig. 3.9) happens with constant probability, specifically at least (1 − 1/ e)/4. Instead of proving this here, a version of it will be on the homework, where I’ll ask you to show:

Lemma 70. Suppose A, B are disjoint sets with A ∪ B = [m], m > 0 and |A| ≥ cm. Let Ri be pairwise independent binary rvs for i = 1, . . . , n, with Pr(Ri = 1) = p (for a value of p to be determined). Let R = {i : Ri = 1}. Then ∀c > 0 ∃d > 0 s.t. ∀m ∃p such that Pr((A ∩ R 6= ∅) ∧ (B ∩ R = ∅)) ≥ d.

(Note that this is considerably stronger than we need for the lemma since we’re assuming only pairwise independence. In exchange we allow some sacrifice in the constant d.) Thanks to (3.19), Lemma 69 follows by applying Lemma 70 with c = 1/3. 2 √ Now, let Gx,y,t be the “good” event that at least (1 − 1/ e)/8 of the coordinates at level t, namely s0 {τtj}j=1, have |τtj(x) − τtj(y)| > ρt − ρt−1.

If the good event occurs for all t, then for all x, y, √ 1 (1 − 1/ e) d(x, y) kτ(x) − τ(y)k ≥ . 1 s 8 2 Here the first factor is from the normalization, the second from the definition of good events, and the third from the cap on the ρt’s.

We can upper bound the probability that a good event Gx,y,t fails to happen using Chernoff:

−Ω(s0) Pr(¬Gx,y,t) ≤ e .

To be specific we can use the following version of the Chernoff bound which does not rely on exact knowledge of the mgf (see problem set 4):

Lemma 71. Let F1,..., Fs0 be independent Bernoulli rvs, each with expectation ≥ µ. Pr(∑ Fi < (1 − 2 0 ε)µs0) ≤ e−ε µs /2.

Caltech CS150 2020. 75 Chapter 3. Concentration of Measure

√ This permits us (plugging in ε = 1/2) to take s0 = √32 e log(n2 lg n). Now taking a union bound e−1 over all x, y, t, −Ω(s0) 2 Pr(∪x,y,t¬Gt) ≤ e n lg n < 1/2 for a suitable s0 ∈ Θ(log n). 2 n Exercise: Form a Frechet´ embedding X → R by using as Ti’s all singleton sets. Argue that this is an isometry of X into L∞(n). Consequently L∞ is universal for finite metrics. (This, I believe, was Frechet’s´ original result [38].)

3.7.3 Aside: H¨older’s inequality

Although we already proved the PMI directly in Lemma 22, it is worth seeing how the PMI fits into a broader framework of inequalities. The PMI is a comparison between two integrals over a probability space (i.e., the total measure of the space is 1). Power means follows immediately from an important inequality that holds for any measure space (and indeed generalizes the Cauchy- Schwarz inequality), Holder’s¨ inequality: Lemma 72 (Holder)¨ . For norms with respect to any fixed measure space, and for 1/p + 1/q = 1 (p and q are “conjugate exponents”), k f kp · kgkq ≥ k f gk1.

To see the PMI from this, note that over a probability space, k f kp is simply a p’th mean. Now plug in the function g = 1 and Holder¨ gives you the PMI.

Caltech CS150 2020. 76 Chapter 4

Limited independence

4.1 Lecture 21 (16/Nov): Pairwise independence, improved proof of coding theorem using linear codes

Very commonly, in Algorithms, we have a tradeoff between how much randomness we use, and efficiency. But sometimes we can actually improve our efficiency by carefully eliminating some of the ran- domness we’re using. Roughly, the intuition is that some of the randomness is going not toward circumventing a barrier (especially, leaving the adversary in the dark about what we are going to do), but just into noise, and costs.1 A case in point is the proof of Shannon’s Coding Theorem, for error correction over channels which flip bits independently with probability p < 1/2. In a previous lecture we proved the theorem as follows: we first built an encoding map E : {0, 1}k → {0, 1}n by sampling a uniformly random function; then, we had to delete up to half the codewords to eliminate all kinds of fluctuations in which codewords fell too close to one another. It turns out that this messy solution can be avoided. The key observation is that our analysis depended only on pairwise data about the code—basically, pairwise distances between codewords. “Higher level” structure (mutual distances among triples, etc.) didn’t feature in the analysis. So the argument will still go through with a pairwise-independently constructed code. So we’ll do this now, and in the process we’ll see how this helps. Sample E from the following pairwise independent family of functions {0, 1}k → {0, 1}n. Select k n k vectors v1,..., vk iid ∈U {0, 1} . Now map the vector (x1,..., xk) to ∑1 xivi. This is, of course, a linear map, consisting of multiplication by the generator matrix G whose rows are the vi:   − − − v1 − − −  − − − v − − −  (message x)  2  = (codeword)  − − − ... − − −  − − − vk − − −

The message 0¯ ∈ {0, 1}k is always mapped to the codeword 0¯ ∈ {0, 1}n, and every other codeword is uniformly distributed in {0, 1}n. It is not hard to see that the images of messages are pairwise independent. (Including even the image of the 0¯ message.)

1If you pack a tent, you’ll spend the night on the mountain. – a climbing instructor of mine

77 Chapter 4. Limited independence

Let’s see why: say the two messages are x 6= x0. W.l.o.g. x0 6= 0. Now, we want to show that ∀y, y0, Pr(x0G = y0|xG = y) = 2−n. Since x0 6= x (and we are over GF2), this means that x0 ∈/ span(x). 0 0 Consequently, there exists a G s.t. xG = y, x G = y . But since there exists such a G, call it G0 the number of G’s satisfying this pair of equations does not depend upon y0; the set of such G’s is simply 0 0 0 0 0 equal to all G0 + G where x, x ∈ ker G . The number of such G depends only upon dim span(x, x ). (If you want a more concrete argument, you can change basis to where x, if nonzero, is a singleton vector, and x0 − x is another singleton vector. Then the row of G corresponding to x is y, the row corresponding to x0 − x is y0 − y, and the rest of the matrix can be anything.) Now let’s remember some of the settings we used for this theorem in Section (??): (1) The code rate is (3.4) n = k ; D2(pk1/2)−ε (2) First upper bound on δ is (3.5): p + δ < 1/2;

(3) Second upper bound on δ is (3.6): D2(p + δk1/2) > D2(pk1/2) − ε/2;

And finally we make δ as large as we can subject to these constraints, and set c = min{D2(p + δkp), ε/2} > 0. Looking back at the analysis of the error probability on message X in Section (??), it had two parts, in each of which we bounded the probability of one of the following two sources of error:

Bad1: H(E(X) + R, E(X)) ≥ (p + δ)n. That is to say, the error vector R has weight (number of 1’s) at least (p + δ)n. This analysis is of course unchanged, and doesn’t depend at all on choice of the code. As before, the bound is

~ −D2(p+δkp)n −cn Pr(Bad1) = Pr(H(0, R) ≥ (p + δ)n) ≤ 2 ≤ 2 . R R

0 0 Bad2: ∃X 6= X : H(E(X) + R, E(X )) ≤ (p + δ)n. For this, pairwise independence is enough to obtain an analysis similar to before. Specifically, for any pair X 6= X0 and any R, the rv (which now depends only on the choice of code) R + E(X) − E(X0) is uniformly distributed in {0, 1}n (because X − X0 is not the zero string, so E(X − X0) is uniform) so, the choice of R does not affect PrR(Bad2), and we can bound it as

k−nD2(p+δk1/2) Pr(Bad2) ≤ 2 R = 2n(D2(pk1/2)−ε−D2(p+δk1/2)) from (3.4) ≤ 2−nε/2 from (3.6) ≤ 2−cn

1−cn So, we get the same as before: PrE,R(Error on X) ≤ 2 for the same c > 0 that depends only on p, ε. That is, for every X, with MX = PrR(Error on X|E), we have

1−cn EE(MX) ≤ 2 (4.1)

Next, just as before, we wish to remove E from the randomization in the analysis. In order to do this it helps to consider the uniform distribution over messages X and derive from Eqn. 4.1 the weaker

1−cn EX,E(MX) ≤ 2 (4.2)

The reason is that this weaker guarantee is maintained even if we now modify the decoding algo- rithm so that it commutes with translation by codewords. Specifically, no matter what the decoder did before, set it now so that D(Y) is uniformly sampled among “max-likelihood” decodings of Y,

Caltech CS150 2020. 78 4.1. Lecture 21 (16/Nov): Pairwise independence, improved proof of coding theorem using linear codes

which is equivalent (thanks to the uniformity over X and to the noise R being independent of X) to those X which minimize H(E(X), Y). For the uniform distribution, max-likelihood decoding min- imizes the average probability of error, so this new decoder D also satisfies 4.2. The new decoder has the commutation advantage that we promised: for any E,

 commutes with D(E(X) + R) = D(E(X)) + D(R) translation by code (4.3)  decoding correct = X + D(R) on codewords

As a consequence,

For all E, X1, X2: Pr(Error on X1|E) = Pr(Error on X2|E). R R

So we can define a variable M which is a function of E,

M = Pr(Error on 0¯|E) = Pr(Error on X|E) for all X R R and we have 1−cn EE(M) ≤ 2 2−cn Since M ≥ 0, PrE(M > 2 ) < 1/2 and so if we just pick linear E at random, there is probability ≥ 1/2 that (using the already-described decoder D for it), for all X the decoding-error probability is ≤ 22−cn. What is much more elegant about this construction than about the preceding fully-random-E is that no X’s with high error probabilities need to be thrown away. The set of codewords is always just a linear subspace of {0, 1}n. The code also has a very concise description, O(k2) bits (recall n ∈ Θ(k)); whereas the previous full-independence approach gave a code with description size exponential in k. One comment is that although picking a code at random is easy, checking whether it indeed satisfies the desired condition is slow: one can either do this in time exponential in n, exactly, by exhaustively considering R’s, or one can try to estimate the probability of error by sampling R, but even this will require time inverse in the decoding-error-probability of R until we see error events and can get a good estimate of the error probability of R; in particular we cannot certify a good code this way in time less than 2cn.

Caltech CS150 2020. 79 Chapter 4. Limited independence

4.2 Lecture 22 (18/Nov): Pairwise independence, second moment inequality, G(n, p) thresholds

p Recall that Chebyshev’s inequality 19, says: If E(X) = θ, then Pr(|X − θ| > λ Var(X)) < 1/λ2. A common situation in which we use this is when we have many variables which are not fully independent, but are pairwise independent (or nearly so). Definition 73 (Pairwise and k-wise independence). A set of rvs are pairwise independent if every pair of them are independent; this is a weaker requirement than that all be independent. Likewise, the variables are k-wise independent if every subset of size k is independent. Definition 74 (). The covariance of two real-valued rvs X, Y is Cov(X, Y) = E(XY) − E(X)E(Y).

(Here and in what follows we assume that all expectations we use are well-defined, i.e., the under- lying integral converges absolutely.) Exercise: Show that if X and Y are independent then Cov(X, Y) = 0, but that the converse need not be true. n Exercise: If X = ∑1 Xi, Var X = ∑i Var(Xi) + ∑i6=j Cov(Xi, Xj).

Corollary 75. If X1,..., Xn are pairwise independent real rvs, then Var(∑ Xi) = ∑ Var(Xi). (We already 1 mentioned this in (3.1).) If in addition they are identically distributed and X = n ∑ Xi, then E(X) = E(X1) 1 and Var(X) = n Var(X1).

The Chebyshev inequality and Corollary 75 give us:

Lemma 76 (2nd moment inequality). If X1,..., Xn are identically distributed, pairwise-independent real q Var(X1) 2 rvs with finite 1st and 2nd moments then P(|X − E(X)| > λ n ) < 1/λ . Corollary 77 (Weak Law). Pairwise independent rvs obey the weak . Specifically, if X1,..., Xn are identically distributed, pairwise-independent real rvs with finite variance then for any ε, limn→∞ P(|X − E(X)| > ε) = 0.

So we see that the weak law holds under a much weaker condition than full independence. When we talk about the cardinality of sample spaces, we’ll see why pairwise (or small k-wise) indepen- dence has a huge advantage over full independence, so that it is often desirable in computational settings to make do with limited independence.

4.2.1 Threshold for H as a subgraph in G(n, p)

Working with low moments of random variables can be incredibly effective, even when we are not specifically looking for limited-independence sample spaces. Here is a prototypical example. “When” does a specific, constant-sized graph H, show up as a subgraph of a random graph selected from the distribution G(n, p)? We have in mind that we are “turning the knob” on p. If H has any edges then when p = 0, with probability 1 there is no subgraph isomorphic to H. When p = 1, with probability 1 such subgraphs are everywhere 2. In between, for any finite n, the probability is some increasing function of p. But we won’t take n finite, we will take it tending to ∞.

2 Today we focus on H = K4, the 4-clique, but more generally this method will establish the probability of any fixed graph H occurring as a subgraph in G, that is, ∃ injection of V(H) into V(G) carrying edges to edges. This is different from asking that H occur as an induced subgraph of G, which requires also that non-edges be carried to non-edges. That question is different in an essential way: the event is not monotone in G.

Caltech CS150 2020. 80 4.2. Lecture 22 (18/Nov): Pairwise independence, second moment inequality, G(n, p) thresholds

So the question is,3 can we identify a function π(n) such that in the model G(n, p(n)), with H denoting the event that there is an H in the random graph G, J K

(a) If p(n) ∈ o(π(n)), then limn Pr( H ) = 0. J K (b) If p(n) ∈ ω(π(n)), then limn Pr( H ) = 1. J K It is usual to focus only on π ≤ 1/2 since otherwise one may simply consider the complementary property, with threshold 1 − π. Such a function π(n) is known as the threshold for appearance of H. It follows from work of B. Bollobas and A. G. Thomason that monotone properties (this extends even beyond properties of graphs to general set systems, see [20] or [19] §6 for the statement) always have a threshold function. Comments: (1) We usually refer to π as “the” threshold for our monotone event, but note, the definition only specifies p to within a constant. Sometimes we can determine the transition more sharply. (2) In particular for a monotone graph property E, i.e., a monotone property invariant under vertex permutations, for any ε > 0 there is a p(n) such that Prp(n)(E) ≤ ε and Prp(n)+O(1/ log n)(E) ≥ 1 − ε. See [40]. This improves (for monotone graph properties) on the general definition if π ∈ ω(1/ log n).

4.2.2 Most pairs independent: threshold for K4 in G(n, p)

Let S ⊆ {1, . . . , n}, |S| = 4. Let XS be the event that K4 occurs as a subgraph of G at S—that is, when you look at those four vertices, all the edges between them are present. Conflating XS with its indicator function and letting X be the number of K4’s in G, we have

X = ∑ XS S

and n E(X) = p6. 4

We are interested in Pr(X > 0). Let π(n) = n−2/3.

(a) For p(n) ∈ o(π(n)), E(X) ∈ o(1), so Pr( K4 ) ∈ o(1) and therefore limn Pr( K4 ) = 0. J K J K (b) For 1 > p(n) ∈ ω(π(n)), E(X) ∈ ω(1). We’d like to conclude that likely X > 0 but we do not have enough information to justify this, as it could be that X is usually 0 and occasionally very 4 large. We will exclude that possibility for K4 by studying the next moment of the distribution. Before carrying out this calculation, though, we have to make one important note. Since the event 0 K4 is monotone, [p ≤ p ] ⇒ [PrG(n,p) K4 ≤ PrG(n,p0) K4 ]. (An easy way to see this is by choosing realsJ K iid uniformly in [0, 1] at each edge,J andK placingJ theK edge in the graph if the rv is above the p 0 or p threshold.) This means that it is enough to show that K4 “shows up” slightly above π. This is useful because some of our calculations break down far above π, not because there is anything wrong with the underlying statement but because the inequalities we use are not strong enough to be useful there and a direct calculation would need to take account of further moments. To simplify our remaining calculations, then, let

p = n−2/3g(n), so n4 p6 = g6 for any sufficiently small g(n) ∈ ω(1); we’ll see how this is helpful in the calculations.

3Recall p(n) ∈ o(π(n)) means that lim sup p(n)/π(n) = 0, and p(n) ∈ ω(π(n)) means that lim sup π(n)/p(n) = 0. 4 When we study not K4-subgraphs, but other subgraphs, this can really happen. We’ll discuss this below.

Caltech CS150 2020. 81 Chapter 4. Limited independence

By an earlier exercise, Var(X) = ∑ Var(XS) + ∑ Cov(XS, XT) S S6=T

6 6 6 XS is a coin (or Bernoulli rv) with Pr(XS = 1) = p . The variance of such an rv is p (1 − p ). The covariance terms are more interesting.

1. If |S ∩ T| ≤ 1, no edges are shared, so the events are independent and Cov(XS, XT) = 0. 2. If |S ∩ T| = 2, one edge is shared, and a total of 11 specific edges must be present for both cliques to be present. A simple way to bound the covariance is (since E(XS), E(XT) ≥ 0) that 11 Cov(XS, XT) = E(XSXT) − E(XS)E(XT) ≤ E(XSXT) = p . 3. If |S ∩ T| = 3, three edges are shared, and a specific 9 edges must be present for both cliques 9 to be present. Similarly to the previous case, Cov(XS, XT) ≤ p .

n  n   n  Var(X) ≤ p6(1 − p6) + p11 + p9 4 2, 2, 2 3, 1, 1 ∈ O(n4 p6 + n6 p11 + n5 p9) = O(g6n4−4 + g11n6−22/3 + g9n5−6) from p = n−2/3g(n) = O(g6 + g11n−4/3 + g9n−1) (4.4) = O(g6) provided g5 ∈ O(n4/3) and g3 ∈ O(n)

This gives us the key piece of information. For g ∈ ω(1) but not too large, we have

Var(X) O(g6) O(g6) ∈ = = O(g−6) ⊆ o(1) (E(X))2 Θ((n4 p6)2) Θ(g12)

and we have only to apply the Chebyshev inequality (Cor. 20) (or better yet Paley-Zygmund, Lemma 84 which we haven’t proven yet) to conclude that Pr(X = 0) ∈ o(1) and so

lim Pr( K4 ) = 1. (4.5) n J K

Since K4 is a monotone event, (4.5) holds even for g above the range we needed for the calculation to hold.J (Note,K though, since there is so much “room” in the calculation, we could even have used the upper bound O(g11) on 4.4, and not resorted to this monotonicity argument.) Exercise: Show that the threshold for appearance, as a subgraph, of the graph with 5 edges and 4 vertices is n−4/5. Comment: For a general H the threshold for appearance of H in G(n, p) as a subgraph is determined not by the ratio ρH of edges to vertices, but by the maximum of this ratio over induced subgraphs of H, call it ρmax H. We’ll see this on a problem set (and see [7]). If these numbers are different then above n−1/ρH the expected number of H’s starts tending to ∞ but almost certainly we have none; once we cross the higher threshold n−1/ρmax H , there is an “explosion” of many of these subgraphs appearing. (They show up highly intersecting in the fewer copies of the critical induced subgraph.)

Caltech CS150 2020. 82 4.3. Lecture 23 (20/Nov): Limited independence: near-pairwise for primes, 4-wise for Khintchine-Kahane

4.3 Lecture 23 (20/Nov): Limited independence: near-pairwise for primes, 4-wise for Khintchine-Kahane

4.3.1 Turan’s proof of a theorem of Hardy and Ramanujan

(Adapted from Alon & Spencer.) Let m(k) be the number of primes dividing k. Always m(k) ≤ lg k. But usually this is a vast overestimate, because m(k) is almost always close to log log k.

Theorem 78 (Hardy & Ramanujan, 1920). Let 1 ≤ λ, and sample k ∈U [n]. Then

p 1 + o(1) Pr(|m(k) − log log n| > λ log log n) < . (4.6) λ2

Proof. We show an elegant proof due to Turan, 1934. The theorem is trivial for λ > lg n, so we consider only λ ≤ lg n. We’re now going to do a minor technical trick that will be useful√ toward the end of the argument. Note that k can have at most one prime√ factor greater than n, so if we define m¯ (k) to be the number of prime factors of k that are ≤ n, it is enough to show (4.6) for m¯ (k). Although we said that most k have about log log k prime factors, the bound in the theorem is stated in terms of n, and so naturally, we cannot expect the bound to hold in the√ rare case that we sample k much smaller than n. Separating out these cases by thresholding k at n (there is no connection between this threshold and that for the prime factors; indeed either could have been chosen at any fixed polynomial in n) is our first step. p Let B be the event that |m¯ (k) − log log n| > λ log log n. √ Pr(B) ≤ Pr(k ≤ n) √ √ + Pr(B | k > n) Pr(k > n) 1 √ ≤ √ + Pr(B | k > n) n

≤ n √1 ∈ o( −2) Since λ lg , we are guaranteed that n λ , so we need only show that

√ 1 + o(1) Pr(B | k > n) < . λ2 This will follow from showing p E(m¯ (k)) = log log n + o( log log n) (4.7)

Var(m¯ (k)) ≤ (1 + o(1)) log log n (4.8) and applying the Chebyshev inequality.

For fixed a and k ∈U [n] let a|k be the indicator rv for a dividing k. Throughout the following expressions p and q are alwaysJ prime.K Note

m¯ (k) = p|k . (4.9) ∑√ p≤ nJ K

n E( p|k ) = b c/n (4.10) J K p

Caltech CS150 2020. 83 Chapter 4. Limited independence

So 1 1 1 − ≤ E( p|k ) ≤ (4.11) p n J K p

From (4.9) and (4.11) we have

1 1 1 − √ + ≤ E(m¯ (k)) ≤ (4.12) n ∑√ p ∑√ p p≤ n p≤ n

Now we use a well-known bound in :

1 ∑ = log log r + O(1) (4.13) p≤r p √ 1 (where the “+O(1)” might be negative). Observe log log n = log( 2 log n) = log log n − log 2. So, applying both sides of (4.12), E(m¯ (k)) = log log n + O(1)

which shows (4.7). It remains to show (4.8).

As always we can write

Var(m¯ (k)) = Var( p|k ) + Cov( p|k , q|k ) (4.14) ∑√ ∑√ p≤ n J K p6=q≤ n J K J K

What we will discover is that the sum of is very small and so the bound on Var(m¯ (k)) is almost as if were had pairwise independence between the events p|k . J K A simple bound we’ve used before is that for a {0, 1}-valued rv Y, Var(Y) = E(Y)(1 − E(Y)) ≤ E(Y). Applying this,

Var( p|k ) ≤ E( p|k ) = E(m¯ (k)) = log log n + O(1) (4.15) ∑√ ∑√ p≤ n J K p≤ n J K

Now to handle the covariances. For primes p 6= q, p|k q|k = pq|k . Just as we noted for primes in (4.10), we have J KJ K J K 1 1 n 1 − ≤ E( pq|k ) = b c/n ≤ pq n J K pq pq

So (and using the lower bound in (4.11)):

Cov( p|k , q|k ) = E( pq|k ) − E( p|k )E( q|k ) J K J K J K J K J K 1  1 1   1 1  ≤ − − − pq p n q n 1  1 1  ≤ + n p q

Caltech CS150 2020. 84 4.3. Lecture 23 (20/Nov): Limited independence: near-pairwise for primes, 4-wise for Khintchine-Kahane

This is a very low covariance, which is crucial to the theorem. 1  1 1  Cov( p|k , q|k ) ≤ + ∑√ n ∑√ p q p6=q≤ n J K J K p6=q≤ n 1  1 1  ≤ + n ∑√ p q p,q≤ n 2 1 = n ∑√ p p,q≤ n 2 1 = √ This is why we switched to m¯ n ∑√ p p≤ n 2 = √ (log log n + O(1)) ∈ o(1) n Plugging this back into (4.15) gives us (4.8). 2

4.3.2 4-wise independent random walk

Earlier we quoted the CLT or Khintchine-Kahane to conclude that the value of the Gale-Berlekamp game is Ω(n3/2). Specifically we used this to show that for a symmetric random walk of length n, n 1/2 X = ∑1 Xi with Xi ∈U {1, −1}, E(|X|) ∈ Ω(n ). Now we will show this from first principles—and more importantly, using only information about the 2nd and 4th moments. This is not only of methodological interest. It makes the conclusion more robust, specifically the conclusion holds for any 4-wise independent space, and therefore implies a poly-time deterministic algorithm to find a Gale-Berlekamp solution of value Ω(n3/2), because there exist k-wise indepen- dent sample spaces of size O(nbk/2c), as we will show in a later lecture. n Theorem 79. Let X = ∑1 Xi where the Xi are 4-wise independent and Xi ∈U {1, −1}. Then E(|X|) ∈ Ω(n1/2).

We’ll prove a more general version of this next time, but we start with two calculations. These calculations are made easy by the fact that for any product of the form Xb1 ··· Xb4 , with i ,..., i i1 i4 1 4 distinct and bi ≥ 0 integer,  0 if any b is odd E(Xb1 ··· Xb4 ) = i1 i4 1 otherwise So now 2 2 E(X ) = ∑ E(XiXj) = ∑ E(Xi ) = n i,j i 4 2 2 4 2 E(X ) = 3 ∑ E(Xi Xj ) − 2 ∑ E(Xi ) = 3n − 2n. i,j One is tempted to apply Chebyschev’s inequality (in the form of Cor. 20) to the rv X2, because we know both its expectation and its variance. Unfortunately, the numbers are not favorable! E(X2) = 2 ( 2) = 2 − − 2 = 2 − ( 2 = ) ≤ ( 2) ( ( 2))2 =∼ 2n = n, Var X 3n 2n n 2n 2n, so all we get is Pr X 0 Var X / E X n2 2, which is useless. (Let alone that we actually need to bound the larger quantity Pr(X2) < cn for some c > 0.) Chebyshev’s inequality isn’t the right tool for this problem. There are two standard tools that work, we’ll see one of them next time (and cover the second in an appendix to these lecture notes).

Caltech CS150 2020. 85 Chapter 4. Limited independence

4.4 Lecture 24 (23/Nov): Khintchine-Kahane for 4-wise indepen- dence; begin MIS in NC

4.4.1 Log concavity of moments and Berger’s bound

3/2 µ2(B) Lemma 80 (Berger [12]). Let B be a nonnegative rv with 0 < µ4(B) < ∞. Then µ1(B) ≥ 1/2 . µ4(B)

x For a nonnegative rv B, for any x > 0 let µx = E(B ). Lemma 80 is a special case of the following, with p = 1, q = 2, r = 4:

Lemma 81. Let 0 < p < q < r and let B be a nonnegative rv with 0 < µr(B) < ∞. Then

r−p q−p r−q − r−q µp(B) ≥ µq(B) µr(B) .

Proof. A more usual way to write this is

r−q q−p r−p r−p µq ≤ µp µr (4.16)

Note the exponents sum to 1, and that the average of p and r weighted by the exponents is q, i.e., r − q q − p p · + r · = q r − p r − p

so (4.16) is a consequence of an important fact, the log-concavity of the moments (i.e., of µq as a function of q). We’ll show this next. 2 Lemma 82 (Log concavity of the moments). If θ is a probability distribution on the nonnegative reals 2 then, for all q at which R yq 2 y d converges absolutely, ∂ > . log θ ∂2q log µq 0

Proof.

2 2 Z ∂ ∂ q log µq = log y dθ ∂2q ∂2q R ∂ yq log y dθ = ∂q µq 1  Z Z  = ( q 2 + q−1) − ( q )2 2 µq y log y y dθ y log y dθ µq 1  Z Z  ≥ q 2 − ( q )2 2 µq y log y dθ y log y dθ µq 1 ZZ = q q ( − )2 ( ) ( ) 2 x y log y log x dθ x dθ y 2µq ≥ 0

2

Worth noting is that the power-means inequality is another corollary of Lemma 81; if we take p = 0, q r then we have µp = 1, and this gives µr ≥ µq.

Caltech CS150 2020. 86 4.4. Lecture 24 (23/Nov): Khintchine-Kahane for 4-wise independence; begin MIS in NC

4.4.2 Khintchine-Kahane for 4-wise independent rvs

Here is our more general version of Theorem 79.

Theorem 83 (Khintchine-Kahane, 4-wise). Let a = (a1,..., an), ai ∈ R. Let Ri ∈U ±1,Xi = Riai, and set X = |∑ Xi|. Then:

1. If the Ri are pairwise independent, E(X) ≤ kak2. 2. If the R are 4-wise independent, √1 kak ≤ E(X). i 3 2 3. If the R are independent, √1 kak ≤ E(X). i 2 2

We’ll only prove the first two parts here. For the third I recommend nice notes of Filmus [35].

Proof. The upper bound is implied by nonnegativity of the variance of |X| (special case of the power-means inequality) and by the calculation

n 2 2 2 E(X ) = ∑ E(XiXj) = ∑ E(Xi ) = ∑ ai . 1≤i,j≤n i=1

The main point of interest is the lower bound. We apply Lemma 80. In order to compute the moments we just slightly generalize the calculation we saw in the previous lecture. For any product of the form Xb1 ··· Xb4 , with i ,..., i distinct and b ≥ 0 integer, i1 i4 1 4 i  0 if any b is odd b b  E(X 1 ··· X 4 ) = a2a2 if b = b = 2 i1 i4 1 2 1 2  4 a1 if b1 = 4 So now n n 4 2 2 4 2 2 4 2 2 4 E(X ) = 3 ∑ E(Xi Xj ) − 2 ∑ E(Xi ) = 3 ∑ ai aj − 2 ∑ ai = 3(∑ ai ) − 2 ∑ ai 1≤i,j≤n i=1 1≤i,j≤n i=1

(E(X2))3/2 (∑ a2)3/2 (∑ a2)3/2 1 q E|X| ≥ = i ≥ √ i = √ a2 (E(X4))1/2 ( ( 2)2 − 4)1/2 2 ∑ i 3 ∑ ai 2 ∑ ai 3 ∑ ai 3 2 √ Observe that we lost only a small constant factor here compared with the precise 1/ 2 value obtained for a fully-independent sample space from the CLT.

4.4.3 Khintchine-Kahane from Paley-Zygmund (omitted in class)

There is another way to prove√ the 4-wise independent Kintchine-Kahane. It is slightly weaker in the constant in E(|X|) ≥ Ω( n) (we focus here just on the case of all ai = 1) than the strategy we gave above, but it gives an in-probability bound. It relies on the Paley-Zygmund inequality. We’ll just write the statement here, and defer proofs to Appendix A.1 Paley-Zygmund is usually stated as an alternative to the Chebyshev (Cor. 20) lower-tail bound for nonnegative rvs; i.e., it gives a way to say that a nonnegative rv A is “often large.”

Caltech CS150 2020. 87 Chapter 4. Limited independence

Let µi be the ith moment of A. Knowing only the first moment µ1 of A is not enough, because for any value—even infinite—of the first moment, we can arrange, for any δ > 0, a nonnegative rv A which equals 0 with probability 1 − δ, yet has first moment µ1. We just have to move δ of the probability mass out to the point µ1/δ, or, in the infinite µ case, spread δ probability mass out in a measure whose first moment diverges.

However, a finite second moment µ2 is enough to justify such a “usually large” statement. Actually Paley-Zygmund holds for rvs which are not necessarily nonnegative.

Lemma 84 (Paley-Zygmund). Let A be a real rv with positive µ1 and finite µ2. For any 0 < λ ≤ 1,

2 2 λ µ1 Pr(A > (1 − λ)µ1) > . µ2

2 2 µ2−µ1 µ2−µ1 Comment: This gives Pr(A ≤ 0) ≤ µ which improves on the upper bound 2 of Cor. 20. It 2 µ1 2 should be said though that PZ does not dominate Cor. 20 in all ranges (e.g., if the variance µ2 − µ1 2 is very small compared to the µ1, and λ is small).

4.4.4 Maximal Independent Set in NC

Recall from section 2.4.1: parallel complexity classes

L = log-space = problems decidable by a Turing Machine having a read-only input tape and a read-write work tape of size (for inputs of length n) O(log n). S k k NC = k NC , where NC = languages s.t. ∃c < ∞ s.t. membership can be computed, for inputs of size n, by nc processors running for time logk n. RNC = same, but the processors are also allowed to use random bits. For x ∈ L Pr( error ) ≤ 1/2, for x ∈/ L Pr( error ) = 0. L ⊆ NC1 ⊆ ... ⊆ NC ⊆ RNC ⊆ RP. P-Complete = problems that are in P, and that are complete for P w.r.t. reductions from a lower complexity class (usually, log-space).

Maximal Independent Set

MIS is the problem of finding a Maximal Independent Set. That is, an independent set that is not strictly contained in any other. This does not mean it needs to be a big, let alone a maximum cardinality set. (It is NP-complete to find an independent set of maximum size. This is more commonly known as the problem of finding a maximum clique, in the complement graph.) There is an obvious sequential greedy algorithm for MIS: list the vertices {1, . . . , n}. Use vertex 1. Remove it and its neighbors. Use the smallest-index vertex which remains. Remove it and its neighbors, etc. The independent set you get this way is called the Lexicographically First MIS. Finding it is P- complete w.r.t. log-space reductions [25]. So it is interesting that if we don’t insist on getting this particular MIS, but are happy with any MIS, then we can solve the problem in parallel, specifically, in NC2. We’ll start with an RNC2 algorithm of Luby [68] for MIS. Then, we’ll see how to derandomize the algorithm. (Some of the ideas we’ll see also come from the papers [59, 6]).

Caltech CS150 2020. 88 4.4. Lecture 24 (23/Nov): Khintchine-Kahane for 4-wise independence; begin MIS in NC

Notation: Dv is the neighborhood of v, not including v itself. dv = |Dv|. Luby’s MIS algorithm: Given: a graph G = (V, E) with n vertices. Start with I = ∅. Repeat until the graph is empty:

1. Mark each vertex v pairwise independently with probability 1 . 2dv+1 2. For each doubly-marked edge, unmark the vertex of lower degree (break ties arbitrarily).

3. For each marked vertex v, append v to I and remove the vertices v ∪ Dv (and of course all incident edges) from the graph.

An iteration can be implemented in parallel in time O(log n), using a processor per edge. We’ll show that an expected constant fraction of edges is removed in each iteration (and then we’ll show that this is enough to ensure expected logarithmically many iterations).

Caltech CS150 2020. 89 Chapter 4. Limited independence

4.5 Lecture 25 (25/Nov): Luby’s parallel algorithm for maximal independent set

Today we analyze the algorithm given last time.

Definition 85. A vertex v is good if it has ≥ dv/3 neighbors of degree ≤ dv.

Let G be the set of good vertices, and B the remaining ones which we call bad. So

A bad vertex has > 2dv/3 neighbors with degree > dv. (4.17)

Definition 86. An edge is good if it contains a good vertex.

1 Lemma 87. If dv > 0 and v is good, then Pr(∃ marked w ∈ Dv after step (1) ) ≥ 18 .

This is implied by the following, using Pr(w marked ) ≥ dv 1 ≥ 1 . ∑w∈Dv 3 2dv+1 9 S 1 1 Lemma 88. If Xi are pairwise independent events s.t. Pr(Xi) = pi then Pr( Xi) ≥ 2 min( 2 , ∑ pi).

Compare with the pairwise-independent version of the second Borel-Cantelli lemma. Of course, that is about guaranteeing that infinitely many events occur, here we’re just trying to get one to occur, but the lemmas are nonetheless quite analogous.

Proof. If there is any single event i with pi ≥ 1/2 then the Lemma is immediate. Otherwise, if ∑ pi < 1/2 then in the unions and sums below consider all events i, while if ∑ pi ≥ 1/2 then in the unions and sums below consider a prefix of the events s.t. 1/2 ≤ ∑ pi ≤ 1. [ Pr( Xi) ≥ ∑ pi − ∑ Pr(Xi ∩ Xj) Bonferroni level 2 i

= ∑ pi − ∑ pi pj i

So we can run the algorithm using a pairwise independent space, with the bits having various biases 1 . 2dv+1 Lemma 89. If v is marked then the probability it is unmarked in step (2) is ≤ 1/2.

Proof. It is unmarked only if a weakly-higher-degree neighbor is marked. By pairwise indepen- dence, each of these events happens, conditional on v being marked, with probability at most 1 . 2dv+1 Apply a union bound. 2

Corollary 90. For any good vertex v, the probability that it is removed in step 3 is at least 1/36.

Caltech CS150 2020. 90 4.5. Lecture 25 (25/Nov): Luby’s parallel algorithm for maximal independent set

Proof. If dv = 0, it is certainly removed. Otherwise, we rely just on neighbors of v being marked and remaining so after Step 2. Apply Lemma 87 and Lemma 89. 2

Now for our measure of progress. Lemma 91. At least half the edges in a graph (V, E) are good.

Proof. Sort the vertices in order of increasing degree, so that du < dv ⇒ u < v (ties arbitrarily). in Direct each edge from the lower-indexed to the higher-indexed vertex; now we have in-degrees dv out in out and out-degrees dv , with dv + dv = dv. ~ ~ For two sets of vertices V1, V2 let E(V1, V2) be the edges directed from V1 to V2. So E(V1, V1) is the set of directed edges within V1 and is in 1 − 1 correspondence with E(V1, V1), the set of undirected edges within the set. In particular E(V, V) = E.

We use lower-case e,~e to indicate cardinality. So ~e(V1, V1) = e(V1, V1). Note that E(B, B) ∪ E(B, G) ∪ E(G, B) ∪ E(G, G) is a partition of E, and

E(B, B) is the set of bad edges. E(B, G) ∪ E(G, B) ∪ E(G, G) is the set of good edges.

From (4.17), every v ∈ B has at least twice as many outgoing edges as incoming edges, so

2~e(V, B) ≤ ~e(B, V) 2(~e(B, B) +~e(G, B)) ≤ ~e(V, V) 2~e(B, B) ≤ ~e(V, V) e(B, B) ≤ e(V, V)/2

2

Due to Corollary 90, each good edge is removed with probability at least 1/36. Since at least half the edges are good, the expected fraction of edges removed is at least 1/72. (Of course the edge- removals are correlated but that doesn’t matter.) In the next section we analyze how long it takes this process to terminate. First, a comment: the analysis above was not sensitive to the precise probability 1 with which 2dv+1 vertices were marked. For instance, it would be fine if each vertex were marked with some probabil- ity p , 1 ≤ p ≤ 1 ; the only effect would be to change the “1/72” to some other constant. We v 4dv v 2dv+1 will indeed modify each 1 when we derandomize the algorithm, but only by a factor 1 + o (1). 2dv+1 n

4.5.1 Descent Processes

(This is not widespread terminology but things like this come up often. The coupon collector problem is another example.) In a descent process we start from a positive integer m, and in general the state of the process is a nonnegative integer n; the process terminates when n = 0. At n > 0, you sample a random variable X from a distribution pn on {0, . . . , n}, and transition to state n − X. The question is, how many iterations does it take you to hit 0? Let Tn be this random variable, when you start from state n. (So E(T0) = 0.) If pm(0) = 1, the process is uninteresting as then Tm = ∞ a.s. In fact if there is

Caltech CS150 2020. 91 Chapter 4. Limited independence

any 0 < n ≤ m, reachable by descents from m, such that pn(0) = 1, then Tm = ∞ has positive probability; this too we exclude, and simply insist that pn(0) < 1 for all 0 < n ≤ m.

In this case all Epn (X) > 0. Now let θn (n ≥ 1) be any nondecreasing sequence of positive reals n such that θn ≤ Epn (X) = ∑x=0 xpn(x). (In most cases, including the current application, the expectations, or at least our bounds on them, are monotone, and we can simply use those bounds to define θn.) n 1 Lemma 92. E(Tn) ≤ ∑ . j=1 θj

Proof. The lemma is trivial for n = 0 (the LHS is 0 and the RHS is an empty summation). For n > 0 proceed by induction.

n E(Tn) = 1 + pn(0)E(Tn) + ∑ pn(x)E(Tn−x) x=1 n (1 − pn(0))E(Tn) = 1 + ∑ pn(x)E(Tn−x) x=1 n n−x 1 ≤ 1 + ∑ pn(x) ∑ (induction) x=1 j=1 θj ! n n 1 n 1 = 1 + ∑ pn(x) ∑ − ∑ x=1 j=1 θj j=n−x+1 θj n 1 n n 1 = 1 + (1 − pn(0)) ∑ − ∑ pn(x) ∑ j=1 θj x=1 j=n−x+1 θj n 1 1 n ≤ 1 + (1 − pn(0)) ∑ − ∑ pn(x)x (θj nondecreasing) j=1 θj θn x=1 n 1 = ( − ( )) ( ≤ ( )) 1 pn 0 ∑ θn Epn X j=1 θj 2

|E| 72 As a consequence, the expected number of iterations until the algorithm terminates is ≤ ∑j=1 j ∈ O(log |E|) ∈ O(log n). Each iteration alone takes time O(log n) to do the local marking and unmark- ing, and removing vertices and edges. This is an RNC2 algorithm, using O(|E|) processors, for MIS. In Section 4.7.4 we’ll see how we can derandomize this using a factor of O(n2) more processors, and thereby put MIS in NC2. Question: Here is a different parallel algorithm for MIS. At each vertex v choose uniformly a real number rv in [0, 1]. Put a vertex in I if rv > ru for every neighbor u of v. Remove I and all its neighbors. Repeat. (We don’t really need to pick random real numbers; we can just pick multiples of 1/n3, and we’re unlikely to have any ties to deal with.) This process is a bit simpler than Luby’s algorithm because there is no “unmarking.” Question: If the rv’s are chosen independently, is it the case that the expected number of edges that are removed, is a constant fraction of |E|? If so, is this still true if the rv’s are pairwise independent?

Caltech CS150 2020. 92 4.6. Lecture 26 (30/Nov): Limited linear independence, limited statistical independence, error correcting codes.

4.6 Lecture 26 (30/Nov): Limited linear independence, limited sta- tistical independence, error correcting codes.

4.6.1 Begin derandomization from small sample spaces

We discussed in an earlier lecture the notion of linear error-correcting codes. We worked over the base field GF(2), also known as Z/2. (Which is to say, we added bit-vectors using XOR.) Encoding of messages in such a code is simply multiplication of the message, as a vector v ∈ Fm, by the generator matrix C of the code; the result, if C is m × n, is an n-bit codeword.

 generator matrix  (message v)   = (codeword vC) C

The set of codewords is exactly Rowspace(C). The minimum weight of a linear code is the least number of nonzero entries in a codeword. Over GF(2), this is the same as the least number of 1’s in a codeword. If the minimum weight is k + 1 then the code ensures

1. Error detection up to k errors

2. Error correction up to bk/2c errors.

This property is not possessed by codes achieving near-optimal rate in Shannon’s coding theorem. That theorem protects only against random noise, and if that is what you want, then the mininimum weight property is too strong to allow optimally efficient codes. The mininimum weight property protects against adversarial noise.

4.6.2 Generator matrix and parity check matrix

Error detection can be performed with the aid of a parity check matrix M:

Left Nullspace(M) = Rowspace(C)

 parity      generator matrix  check       matrix  =  0    C  M 

wM = 0 ⇐⇒ w ∈ Rowspace(C) ⇐⇒ w is a codeword

  Every vector in   Every k rows Rowspace(C) has ⇐⇒ of M are linearly weight ≥ k + 1   independent

Caltech CS150 2020. 93 Chapter 4. Limited independence

In terms, this is an (n, m, k + 1) code over GF(2). (Unfortunately, coding theorists conventionally use the letters (n, k, d) but we have k + 1 reserved for the least weight, because we’re following the conventional terminology from “k-wise independence”.) For any fixed values of n and k, the code is most efficient when the message length m, which is the number of rows of C, is as large as possible; equivalently, the number of columns of M, ` = n − m, is as small as possible. So we’ll want to design a matrix M with few columns in which every k rows are linearly independent. But first, let’s see a connection between linear and statistical independence. Let B be a k × ` matrix over GF(2), with full row rank. (So k ≤ `.) ` k If x ∈U (GF(2)) then y = Bx ∈U (GF(2)) ,   . !   . y ... B ...   = x .   ......  .  . because the pre-image of any y is an affine subspace (a translate of the right nullspace of B). (We already made this observation in the context of Freivalds’ verification algorithm, Theorem 33.) Now, if we have a matrix M with n rows, of which every k are linearly independent, then every k bits of z = Mx are uniformly distributed in (GF(2))k.

               z  =  M   x         

We’ve exhibited dual applications of the parity check matrix:

• Action on row vectors: checking validity of a received word w as a codeword. (s = wM is called the “syndrome” of w; in the case of non-codewords, i.e., s 6= 0, one of the ways to decode is to maintain a table containing for every s, the least-weight vector η s.t. ηM = s. Then w − η is the closest codeword to w. This table-lookup method is only practical for very high rate codes, where there are not many possible vectors s.)

• Action on column vectors: converting few uniform iid bits into many k-wise independent uniform bits.

Now we can see an entire sample space on n bits that are uniform and k-wise-independent. At the right end we place the uniform distribution on all 2` vectors of the vector space GF(2)`.

          0 0 . . . 1 1      Ω  =  M   . . . unif. dist. on cols          0 1 . . . 0 1

Ω is the uniform distribution on the columns on the LHS.

Caltech CS150 2020. 94 4.6. Lecture 26 (30/Nov): Limited linear independence, limited statistical independence, error correcting codes.

m n−` Maximizing the transmission rate n = n of a binary, k-error-detecting code, is equivalent to minimizing the size |Ω| = 2` of a linear k-wise independent binary sample space. So how big does |Ω| have to be? Theorem 93.

1. For all n there is a sample space of size O(nbk/2c) with n uniform k-wise independent bits. If instead of bits one wants rvs uniformly distributed on [2m]: For all n there is a sample space of size O(2k max{m,dlg ne}) with n k-wise independent rvs, each uniformly distributed on [2m]. 2. For all n, any sample space on n k-wise independent rvs, none of which is a.s. constant, has size Ω(nbk/2c).

We show Part 1 in Sec. 4.7.2; Part 2 will be on the problem set.

Caltech CS150 2020. 95 Chapter 4. Limited independence

4.7 Lecture 27 (2/Dec): Limited linear independence, limited sta- tistical independence, error correcting codes.

Returning to the subject of codes, there is a question worth asking even though we don’t need it for our current purpose:

4.7.1 Constructing C from M

Suppose we have constructed a parity check matrix M. How do we then get a generator matrix C? One should note that over a finite field, Gram-Schmidt does not work. Gram-Schmidt would have allowed us to produce new vectors which are both orthogonal to the columns of M and linearly independent of them. But this is generally not possible: the row space of C and the column space of M do not necessarily span the n-dimensional space. For example over GF(2) we may have

 1  C = 1 1  , M = 1

However, Gaussian elimination does work over finite fields, and that is what is essential. Specifically, given an n × a matrix M and a b × n matrix C, b < n − a, C of full row rank (i.e., rank b), we show how to construct a vector c0 s.t. c0 M = 0 and c0 is linearly independent of the rows of C. (Then adjoin c0 to C and repeat.) Perform Gaussian elimination on the rows of C so that it is upper triangular, with a nonzero diag- onal. That is, allowed operations are adding rows to one another, and permuting columns. When permuting columns of C, permute rows of M to match. Obviously the row operations do not change the row space of C; and the column permutations do not affect the rank of either matrix, or the inner products of rows of C with columns of M.  ∗ ∗ ∗   ∗ ∗ ∗      # ∗ ∗ ∗ ∗ ∗  ∗ ∗ ∗  C = , M =   , Key: # means nonzero 0 # ∗ ∗ ∗ ∗  ∗ ∗ ∗     ∗ ∗ ∗  ∗ ∗ ∗

Now take the submatrix of M consisting of the a + 1 rows b + 1, . . . , b + a + 1. By Gaussian elimi- nation on these rows we can find a linear dependence among them. Extending that dependence to the n-dimensional space, with 0 coefficients in the first b coordinates, yields a vector c0 s.t. c0 M = 0 and s.t. the support of c0 is disjoint from the first b coordinates. Then c0 is linearly independent of the row space of C because the restriction of C to its first b columns is nonsingular, so any linear combination of the rows of C has a nonzero value somewhere among its first b entries.

4.7.2 Proof of Thm (93) Part (1): Upper bound on the size of k-wise independent sample spaces

(We’ll do this carefully for producing binary rvs and only mention at the end what should be done for larger ranges.) This construction uses the finite fields whose cardinalities are powers of 2. These are called exten- sion fields of GF(2). If you are not familiar with this, just keep in mind that for each integer r ≥ 1 there is a (unique) field with 2r elements. We can add, subtract, multiply and divide these without

Caltech CS150 2020. 96 4.7. Lecture 27 (2/Dec): Limited linear independence, limited statistical independence, error correcting codes.

leaving the set; in particular, in the usual way of representing the elements of the field as bit strings of length r, addition is simply XOR addition.5 Specifically, we can think of the elements of GF(2r) as the polynomials of degree ≤ r − 1 over GF(2), taken modulo some fixed irreducible polynomial r−1 p of degree r. That is, a field element c has the form c = cr−1x + ... + c1x + c0 (mod p(x)), and our usual way of representing this element is through the mapping vec : GF(2r) → (GF(2))r given by vec(c) = (cr−1,..., c0). (I.e., the list of coefficients.) But all we really need today are three things: (a) Like GF(2), GF(2r) is a field of characteristic 2, i.e., 2x = 0. (b) For matrices over GF(2r) the usual concepts of linear independence and rank apply. (c) vec is injective, linear (namely vec(c) + vec(c0) = vec(c + c0)), and vec(1) = 0 . . . 01. r Now, round n up to the nearest n = 2 − 1, and let a1,..., an denote the nonzero elements of the r field. Let M1 be the following Vandermonde matrix over the field GF(2 ):

 2 k−1  1 a1 a1 ... a1  1 a a2 ... ak−1  M =  2 2 2  1  ......  2 k−1 1 an an ... an r Exercise: Every k rows of M1 are linearly independent over GF(2 ). (Form any such submatrix B ( ) = ( − ) with rows b1,..., bk. Verify that Det B ∏i

 k−1  vec(1) = 001 vec(a1) = 001 . . . vec(a1 ) = 001  vec(1) = 001 vec(a ) = 010 . . . vec(ak−1) = ...  M =  2 2  2  ......  k−1 vec(1) = 001 vec(an) = 111 . . . vec(an ) = ...

Corollary: Every k rows of M2 are linearly independent over GF(2). Actually it is possible to even further reduce the number of columns while retaining the corollary. First, we can drop the leading 0’s in the first entry. Second, we can strike out all batches of columns generated by positive even powers.

 3  1 vec(a1) = 001 vec(a1) = 001 ......  1 vec(a ) = 010 vec(a3) = ......  M =  2 2  3  ......  3 1 vec(an) = 111 vec(an) = ...... r Lemma 94. Every set of rows that is linearly independent (over GF(2 )) in M1 is also linearly independent (over GF(2)) in M3. Hence every k rows of M3 are linearly independent.

Proof. Let a set of rows R be independent in M1; we show the same is true in M3. Since M3 is over GF(2), this is equivalent to saying that for every ∅ ⊂ S ⊆ R, the sum of the rows S in M3 is nonzero.

So we are to show that S independent in M1 has nonzero sum in M3. Independence in M1 implies in particular that the sum of the rows of S in M2 is nonzero.

If |S| is odd then the same sum in M3 has a nonzero first entry and we are done.

5See any introduction to Algebra, for instance Artin [8].

Caltech CS150 2020. 97 Chapter 4. Limited independence

t Otherwise, let t > 0 be the smallest value such that ∑i∈S ai 6= 0; it is enough to show that t is odd. Suppose not, so t = 2t0. Then, since Characteristic(GF(2r)) = 2,

!2 2t0 t0 ∑ ai = ∑ ai i∈S i∈S

t0 so ∑i∈S ai 6= 0, contradicting minimality of t. 2 Finally for the binary construction, recalling that n = 2r − 1, we have |Ω| = 21+rbk/2c ∈ O(nbk/2c). Uniform on larger ranges: this is actually simpler because we’re not achieving the factor-2 savings in the exponent. Let r, as in the statement, be r = max{m, dlg ne}. Just form the matrix M1. Comment: If you want n k-wise independent bits with nonuniform marginals, then this construction doesn’t work. The best general construction, due to Koller and Megiddo [62], is of size O(nk).

Question: Suppose you want n pairwise independent bits Xi, for each of which Pr(Xi = 1) = p. Is there for all p a subquadratic size sample space?

Proof of Thm (93) Part (2): Lower bound on the size of k-wise independent sample spaces. See Sec. ?? or PS5.1

4.7.3 Back to Gale-Berlekamp

We now see a deterministic polynomial-time algorithm for playing the Gale-Berlekamp game. As we demonstrated last time, it is enough to use a 4-wise independent sample space in order to achieve Ω(n3/2) expected performance. The above construction gives us a 4-wise independent sample space of size O(n2). All we have to do is exhaustively list the points of the sample space until we find one with performance Ω(n3/2).

4.7.4 Back to MIS

For MIS we need only pairwise independence, but want the marking probabilities pv to be more varied (approximately 1 ). This, however, is easy to achieve: use the matrix M , with k = 2, 2dv+1 1 r without modifying to M2 and M3. This generates for each v an element in the field GF(2 ); these elements are pairwise independent; and one can designate for each v a fraction of approximately 1 elements which cause v to be marked. The deterministic algorithm is therefore as described 2dv+1 in Sec. 4.4.4, with a space of size O(n2).

Caltech CS150 2020. 98 Chapter 5

Special topic

5.1 Lecture 28 (4/Dec): Sampling factored numbers

As you are probably aware, we have the following contrast.

1. The “primality testing” problem:

Given an n-bit number m, is m prime?

is computable in polynomial time. Already in the 1970’s people put this problem in ZPP. Then, in 2004 it was even found that primality testing is in P [3].

2. The “factoring” problem:

Given an n-bit number m, give its prime factorization.

1 S nc has no known sub-exponential-time algorithms (where by exponential time we mean c>0 2 ), despite enormous efforts, driven largely by cryptographic applications [81]. The best known 1/3 runtime (and even this is only a heuristic estimate, not a proven bound2) is around 2n . (We are, of course, discussing complexity for classical computers. Quantum computers will be able to factor in poly-time if they are built [15, 87].) If you want to slot factoring into a complexity class, you have to deal with the fact that it isn’t a decision problem (every n has a factorization) but still, it is “essentially” in NP; what it really is, is a TFNP search problem. TFNP is the set of binary functions f (x, y) such that (a) |y| = |x|c for some constant c, (b) f ∈ P, (c) ∀x∃y : f (x, y) = 1. The corresponding search problem is, given x, to find a y s.t. f (x, y) = 1. If P=NP then TFNP is in P, by self-reducibility. Specifically, we can in NP figure out whether there is a witness y that starts with y1 = 0, then depending on that answer find out what’s an acceptable second bit of y, etc. The converse is not known to hold, i.e., it could be that TFNP search problems are all in P, yet P6=NP. The distinction is that in TFNP we have a “promise problem,” namely we are assured of the existence of a witness.

1 S cn Sometimes people reserve this term for the more restrictive c 2 . 2Because the runtime estimate relies on a conjecture that numbers generated by the algorithm are about as likely to be smooth as a random number of the same size would be. A number is b-smooth if all its prime factors are bounded by b.

99 Chapter 5. Special topic

Now, in view of this contrast between primality testing and factoring, the following problem is interesting:

Given an n-bit number m, sample a uniform 1 ≤ r ≤ m along with its factorization.

The na¨ıve approach is to pick r at random and then factor it, but in view of the contrast given above, that won’t work. We can make procedure calls to a primality testing algorithm, so we can discover when r is not prime, but then we don’t know how to go ahead and factor it. By the way, you might object that you’ll just focus your sampling on numbers that have only small factors, and that will be “good enough.” Specifically, define: Definition 95. A number is b-smooth if all its prime factors are bounded by b.

If r is (log r)-smooth then the Sieve of Eratosthenes is good enough to factor it in polylog r time. If most numbers were (log r)-smooth, then you might have suggested settling for sampling just from such r’s as a reasonable substitute for the original goal. However, there are two reasons this objection doesn’t hold up. The first is merely that it isn’t what we asked for: we want to exactly sample from the distribution, not from a distribution that is close to it. The second, more substantial response is that even if you say you’re willing to settle for the lesser goal, most numbers√ simply aren’t√ smooth enough. About 30% of numbers r have a prime factor larger than r, i.e., are not even r-smooth, let alone polylogarithmically smooth. So, about 30% of numbers r have a prime factor whose number of digits is at least half of the number of digits of r itself. Going further down, over 99% of numbers have a prime factor with at least one quarter as many digits. 3. So, we must do something clever. The first to solve this was Eric Bach [9]. We’ll show an elegant (albeit slightly slower) method of Adam Kalai [56]. In what follows we assume m > 1. Algorithm:

(0) Set s0 = m.

(1) For i = 1, etc., iteratively pick si ∈U [si−1] until sL > 1, sL+1 = 1.

Test each s1,..., sL for primality. Compute εi = si is prime . J K L εi (2) Let r = ∏i=1 si . (3) If r ≤ m, then with probability r/m output r. Else restart at line (1). Clearly the strategy here is to start with the factors of r, rather than starting with r and looking for its factors. The question is, why does this procedure give us a uniform distribution on r, and why does it terminate in expected polynomial time? (Ok that’s two questions.) Obviously for the latter point we just need to ensure that the expected number of calls to a primality-testing algorithm is polynomial. Let q(r) be the probability we generate candidate r in line (2). In view of the r/m “rejection sam- pling” in line (3), in order to show a uniform distribution on the output, we need to show that for r ≤ m, q(r) is proportional to 1/r.

For 1 < k ≤ m write H(k) = |{j : sj = k}|. Let 1 < p1 < ... < pv ≤ m be all the primes ≤ m.

3Let Ψ(m, B) := fraction of numbers r ≤ m that are B-smooth. The numbers above come from the following two theorems (the second subuming the first) (see [30],[78],[73], and [96] for plots): (a) For fixed 1 ≤ u ≤ 2 and m → ∞, we have that Ψ(m, m1/u) ∼ 1 − log u. (b) For u ≥ 2 this function still has a limit but it’s a little more complicated. Theorem (Dickman): For all u ≥ 1, Ψ(m, m1/u) ∼ ρ(u). The notation ∼ allows for a multiplicative factor 1 + o(1). Here ρ(u) is defined as the solution of the delay uρ0(u) = −ρ(u − 1) on u ≥ 1, with boundary condition ρ(u) = 1 for u ∈ [0, 1]. For 1 ≤ u ≤ 2, ρ(u) = 1 − log u; and it is known that ρ(u) is positive for all u ≥ 0, and tends to 0 roughly as u−u.

Caltech CS150 2020. 100 5.1. Lecture 28 (4/Dec): Sampling factored numbers

Lemma 96. Let α1,..., αv be any nonnegative integers. Then v 1 −αi Pr(H(p1) = α1 ∧ ... ∧ H(pv) = αv) = ∏(1 − )pi i=1 pi v −αi = M(m) ∏ pi i=1 where M(m) = ∏v (1 − 1 ) is called Merten’s function. i=1 pi

Proof. Here is an equivalent description of the sampling process. At each 1 < k ≤ m, pick a geometrically distributed rv h(k), with parameter 1/k. That means 1 1 1 1 1 Pr(h(k) = 0) = 1 − , Pr(h(k) = 1) = (1 − ), Pr(h(k) = 2) = (1 − ), etc. k k k k2 k h(k) represents the number of times that the descent process hits site k; note that the has the “memoryless” property that ∀b, Pr(h(k) ≥ b + 1|h(k) ≥ b) = 1/k. Set m si = max{k : ∑ h(`) ≥ i} `=k Observe that conditional on si−1 = k, k − 1 k − 2 k0 1 Pr(s = k0) = ··· i k k − 1 k0 + 1 k0 1 = k namely, si is uniform between 1 and k—which is why this distribution of the si’s is the same at that described in the algorithm. From this we conclude that the sequence H(2),..., H(k) is equidistributed with the sequence h(2),..., h(k). Really, the algorithm’s process of generating the H(k)’s was simply a (much) more efficient process, made possible because most of them are 0. The lemma follows because the h(k)’s for various k are independent geometric rvs with the stated . 2

v αi v −αi 1 Corollary 97. Let r be an m-smooth number, r = ∏1 pi . Then q(r) = M(m) ∏i=1 pi = r M(m). This already verifies correctness of the process, since q(r) is proportional to 1/r. Now we deal with the runtime. First, let’s bound how many primality tests we run per “restart” of the algorithm; we can upper bound this by the number of jumps performed by the algorithm m until it terminates at 1, namely, ∑2 H(k). We can work this out exactly, but it’s simpler to apply m the general tool we developed for Descent Processes in Lemma 92, which shows that E(∑2 H(k)) ≤ m ∑2 2/k ∈ O(log m). Next we bound the expectation of the number of rounds of the algorithm. The number of rounds one plus the number of “restarts;” the number of restarts is a geometric rv. The probability that any particular round is a success, i.e., is the last round, is r 1 r ∑ q(r) = ∑ M(m) = M(m). (5.1) 1≤r≤m m 1≤r≤m r m e−γ ∼ By Merten’s “third theorem” [71, 95] this limits to log m , for the Euler–Mascheroni constant γ = 0.577. Thus the expected number of rounds is O(log m). Combining these two bounds, we have that the expected total number of primality tests is O(log2 m).

Caltech CS150 2020. 101 Appendix A

Material omitted in lecture

A.1 Paley-Zygmund in-probability bound, applied to the 4-wise indep. Khintchine-Kahane

Paley-Zygmund is usually stated as an alternative (to Cor. 20) lower-tail bound for nonnegative rvs; i.e., it gives a way to say that a nonnegative rv A is “often large.”

Let µi be the ith moment of A. Knowing only the first moment µ1 of A is not enough, because for any value—even infinite—of the first moment, we can arrange, for any δ > 0, a nonnegative rv A which equals 0 with probability 1 − δ, yet has first moment µ1. We just have to move δ of the probability mass out to the point µ1/δ, or, in the infinite µ case, spread δ probability mass out in a measure whose first moment diverges.

However, a finite second moment µ2 is enough to justify such a “usually large” statement. Actually Paley-Zygmund holds for rvs which are not necessarily nonnegative.

Lemma 98 (Paley-Zygmund). Let A be a real rv with positive µ1 and finite µ2. For any 0 < λ ≤ 1, 2 2 λ µ1 Pr(A > (1 − λ)µ1) > . µ2

Proof. Let ν be the distribution of A. Let p = Pr(A > (1 − λ)µ1). (This is what we want to lower bound.) Decompose µ = R x dν(x) + R x dν(x). Now examine each of these 1 [−∞,(1−λ)µ1] ((1−λ)µ1,∞] terms. Z x dν(x) ≤ (1 − p)(1 − λ)µ1 ≤ (1 − λ)µ1 (A.1) [−∞,(1−λ)µ1] 1 x2 I Apply Cauchy-Schwarz to the functions and x>(1−λ)µ1 . These are not effectively proportional 2 to each other w.r.t. ν (unless ν is supported on a single point, in which case µ1 = µ2 and the Lemma 1

Lemma 99 (Cauchy-Schwarz). If functions f , g are square-integrable w.r.t. measure ν then R f (x)g(x) dν(x) ≤ q R f 2(x) dν(x) · R g2(x) dν(x).

Proof. Squaring and subtracting sides, it suffices to show: 0 ≤ RR f 2(x)g2(y) dν(x)dν(y) − RR f (x)g(x) f (y)g(y) dν(x)dν(y). This is equivalent (by swapping the dummy variables) to showing 0 ≤ RR ( f 2(x)g2(y) + f 2(y)g2(x) − 2 f (x)g(x) f (y)g(y)) dν(x)dν(y) = RR ( f (x)g(y) − f (y)g(x))2 dν(x)dν(y) which is an integral of squares. 2 Say that f and g are effectively proportional to each other w.r.t. ν if this last integral is 0; this is the condition for equality in Cauchy-Schwarz.

102 A.1. Paley-Zygmund in-probability bound, applied to the 4-wise indep. Khintchine-Kahane

is immediate), so we get a strict inequality,

Z r Z ∞ 2 1/2 1/2 x dν(x) < p x dν(x) = p µ2 (A.2) ((1−λ)µ1,∞] −∞

1/2 1/2 Putting (A.1), (A.2) together, µ1 < (1 − λ)µ1 + p µ2

1/2 1/2 λµ1 < p µ2 as desired. (There’s not normally much to be gained by preserving the “1 − p” factor in (A.1), but it’s at least another reason for writing strict inequality in the Lemma.) 2 2 2 µ2−µ1 µ2−µ1 Comment: This gives Pr(A ≤ 0) ≤ µ which improves on the upper bound 2 of Cor. 20. It 2 µ1 2 should be said though that PZ does not dominate Cor. 20 in all ranges (e.g., if the variance µ2 − µ1 2 is very small compared to the µ1, and λ is small). Returning to Gale-Berlekamp: Lemma 84 is not directly usable for our purpose, i.e., we cannot plug 2 3 2 2 in the rv A = |X|, because all it will tell us is that µ1 ≥ (1 − λ)λ µ1/µ2, i.e., µ2 ≥ (1 − λ)λ µ1, which follows already from Cauchy-Schwarz (with the better constant 1). Note, this shows how Paley-Zygmund serves as a more flexible, albeit slightly weaker, version of Cauchy-Schwarz. Instead, we set B = |X| and A = B2, and then apply Paley-Zygmund to A. This is not a technicality. It means that we are relying on 4-wise independence, not just 2-wise independence, of the Xi’s. And indeed, Exercise: There are for arbitrarily large n, collections of n pairwise independent Xi’s, uniform in ±1, s.t. Pr(|X| = 0) = 1 − 1/n, Pr(|X| = n) = 1/n.

Corollary 100. Let B be a nonnegative rv with Pr(B = 0) < 1 and fourth moment µ4(B) < ∞. Then E(B) ≥ 16√ µ5/2/µ . 25 5 2 4

Proof. For any θ, E(B) ≥ θ Pr(B ≥ θ) = θ Pr(B2 ≥ θ2), so, applying Lemma 84 to A = B2, with p θ = µ2(B)/5 and λ = 4/5,

r 2 5/2 µ2(B) 2 (4/5) µ2(B) E(B) ≥ Pr(B ≥ µ2(B)/5) ≥ √ . 5 5 µ4(B) 2 Lemma 80 is stronger than Cor. 100 for two reasons: the constant, and perhaps more importantly, 1/2 because µ2/µ4 ≤ 1 (power mean inequality). Of course, Lemma 80 does not give an in-probability bound, so it is incomparable with Lemma 84.

Caltech CS150 2020. 103 Bibliography

[1] I. Abraham, Y. Bartal, and O. Neiman. Advances in metric embedding theory. In Proc. 38th STOC, 2006. [2] M. Adams and V. Guillemin. Measure Theory and Probability. Birkhauser,¨ 1996. [3] M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Ann. of Math., 160:781–793, 2004. doi:10.4007/annals.2004.160.781. [4] R. Ahlswede and D. E. Daykin. An inequality for the weights of two families of sets, their unions and intersections. Z. Wahrscheinl. V. Geb, 43:183–185, 1978. [5] M. Aizenman, H. Kesten, and C. Newman. Uniqueness of the infinite cluster and continuity of connectivity functions for short and long range percolation. Comm. Math. Phys., 111:505–531, 1987. [6] N. Alon, L. Babai, and A. Itai. A fast and simple randomized parallel algorithm for the maximal independent set problem. J. Algorithms, 7:567–583, 1986. [7] N. Alon and J. Spencer. The Probabilistic Method. Wiley, 3rd edition, 2008. [8] M. Artin. Algebra. Prentice-Hall, 1991. [9] E. Bach. How to generate factored random numbers. SIAM J. Comput., 17:179–193, 1988. [10] N. Bansal and J. Spencer. Deterministic discrepancy minimization. Algorithmica, 67:451–471, 2013. doi:10.1007/s00453-012-9728-1. [11] I. Benjamini and O. Schramm. Percolation beyond Zd, many questions and a few answers. Electron. Commun. Probab., 1:71–82, 1996. doi:10.1214/ECP.v1-978. [12] B. Berger. The fourth moment method. SIAM J. Comput., 26(4):1188–1207, 1997. [13] S. Berkowitz. On computing the determinant in small parallel time using a small number of processors. Information Processing Letters, 18:147–150, 1984. [14] Bernstein inequality. In Encyclopedia of Mathematics. Springer and Europ. Math. Soc. URL: http://www.encyclopediaofmath.org. [15] E. Bernstein and U. Vazirani. Quantum complexity theory. SIAM J. Comput., 26(5):1411–1473, 1997. (STOC 1993). [16] S. N. Bernstein. On a modification of Chebyshev’s inequality and of the error formula of Laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math. 1, 1924. [17] S. N. Bernstein. On certain modifications of Chebyshev’s inequality. Doklady Akademii Nauk SSSR, 17(6):275–277, 1937.

104 Bibliography

[18] P. Billingsley. Probability and Measure. Wiley, third edition, 1995.

[19] B. Bollobas.´ Combinatorics. Cambridge U. Press, 1986.

[20] B. Bollobas´ and A. G. Thomason. Threshold functions. Combinatorica, 7(1):35–38, 1987. doi: 10.1007/BF02579198.

[21] J. Bourgain. On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. Math., 52:46–52, 1985.

[22] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cambridge University Press, 2001.

[23] A. L. Chistov. Fast parallel calculation of the rank of matrices over a field of arbitrary charac- teristic. In Proc. Conf. Foundations of Computation Theory, pages 63–69. Springer-Verlag, 1985.

[24] D. Conlon. A new upper bound for diagonal Ramsey numbers. Ann. of Math., 170:941–960, 2009.

[25] S. A. Cook. A taxonomy of problems with fast parallel algorithms. Information and Control, 64:2–22, 1985.

[26] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.

[27] L. Csanky. Fast parallel matrix inversion algorithms. SIAM J. Computing, 5:618–623, 1976.

[28] D. E. Daykin and L. Lovasz. The number of values of Boolean functions. J. London Math. Soc., 2(12):225–230, 1976.

[29] R. A. DeMillo and R. J. Lipton. A probabilistic remark on algebraic program testing. Informa- tion Processing Letters, 7(4):193 – 195, 1978. URL: http://www.sciencedirect.com/science/ article/pii/0020019078900674.

[30] K. Dickman. On the frequency of numbers containing prime factors of a certain relative mag- nitude. Ark. Mat. Astr. Fys., 22:1–14, 1930.

[31] L. Engebretsen, P. Indyk, and R. O’Donnell. Derandomized dimensionality reduction with applications. In SODA, 2002.

[32] P. Erdos.¨ Some remarks on the theory of graphs. Bull. Amer. Math. Soc., 53:292–294, 1947.

[33] P. Erdos.¨ and probability. Canad. J. Math., 11:34–38, 1959.

[34] P. Erdos¨ and G. Szekeres. A combinatorial problem in . Compositio Math., 2:463–470, 1935.

[35] Y. Filmus. Khintchine-Kahane using Fourier Analysis, 2011. URL: https://yuvalfilmus.cs. technion.ac.il/Manuscripts/KK.pdf.

[36] P. C. Fishburn and N. J. A. Sloane. The solution to Berlekamp’s switching game. , 74:263–290, 1989.

[37] C. M. Fortuin, P. W. Kasteleyn, and J. Ginibre. Correlation inequalities on some partially ordered sets. Commun. Math. Phys., 22:89–103, 1971.

[38] M. Frechet.´ Sur quelques points du calcul fonctionnel. Rend. Circ. Matem. Palermo, 22:1–72, 1906. doi:10.1007/BF03018603.

Caltech CS150 2020. 105 Bibliography

[39] R. Freivalds. Probabilistic machines can use less running time. In IFIP Congress, pages 839–842, 1977.

[40] E. Friedgut and G. Kalai. Every monotone graph property has a sharp threshold. Proc. Amer. Math. Soc., 124:2993–3002, 1996.

[41] H. N. Gabow and R. E. Tarjan. Faster scaling algorithms for general graph-matching problems. J. ACM, 38(4):815–853, 1991.

[42] F. Le Gall. Powers of and fast matrix multiplication. In International Symposium on Symbolic and Algebraic Computation (ISSAC), pages 296–303, 2014. arXiv:1401.7714.

[43] G. H. Gonnet. Expected length of the longest probe sequence in hash code searching. J. ACM, 28:289–304, 1981.

[44] G. H. Gonnet. Determining equivalence of expressions in random polynomial time. In Proc. 16th STOC, pages 334–341. ACM, 1984. URL: http://doi.acm.org/10.1145/800057.808698.

[45] R. L. Graham and V. Rodl.¨ Numbers in Ramsey theory. In Surveys in Combinatorics, London Math. Soc. Lecture Note Ser. Vol. 123, pages 111–153. Cambridge University Press, 1987.

[46] R. L. Graham, B. L. Rothschild, and J. H. Spencer. Ramsey Theory. Wiley, 2nd edition, 1990.

[47] T. E. Harris. Lower bound for the critical probability in a certain percolation process. Math. Proc. Cambridge Philos. Soc., 56:13–20, 1960.

[48]J.H astad.˚ Some optimal inapproximability results. J. ACM, 48(4):798–859, 2001.

[49] M. Heydenreich and R. van der Hofstad. Progress in high-dimensional percolation and random graphs. Springer, 2017.

[50] W. E. Hickson. Try, try. . . . In Oxford Dictionary of Quotations, page 251. Oxford University Press, 3rd edition, 1979.

[51] R. Holley. Remarks on the FKG inequalities. Communications in Mathematical , 36:227– 231, 1974. URL: http://dx.doi.org/10.1007/BF01645980.

[52] P. Indyk and J. Matousek.ˇ Low-distortion embeddings of finite metric spaces. In Handbook of Discrete and Computational Geometry, pages 177–196. CRC Press, 2004.

[53] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math., 26:189–206, 1984.

[54] V. Kabanets and R. Impagliazzo. Derandomizing polynomial identity tests means proving circuit lower bounds. Comput. Complex., 13:1–46, 2004.

[55] J.-P. Kahane. Sur les sommes vectorielles ∑ ±un. C. R. Acad. Sci. Paris, 259:2577–2580, 1964. [56] A. Kalai. Generating random factored numbers, easily. J. Cryptology, 16:287–289, 2003.

[57] G. Kalai and L. J. Schulman. Quasi-random multilinear polynomials. Isr. J. Math., 230(1):195– 211, 2019. (arXiv:1804.04828).

[58] R. M. Karp, E. Upfal, and A. Wigderson. Constructing a Maximum Matching is in Random NC. Combinatorica, 6(1):35–48, 1986.

[59] R. M. Karp and A. Wigderson. A fast parallel algorithm for the maximal independent set problem. In Proc. 16th ACM STOC, pages 266–272, 1984.

Caltech CS150 2020. 106 Bibliography

[60] A. Khintchine. Uber¨ dyadische Bruche.¨ Math. Z., 18:109–116, 1923. [61] D. J. Kleitman. Families of non-disjoint subsets. J. Combin. Theory, 1:153–155, 1966. [62] D. Koller and N. Megiddo. Constructing small sample spaces satisfying given constants. SIAM J. Discret. Math., 7:260–274, 1994. Previously in Proc. 25th STOC pp. 268-277, 1993. URL: http://portal.acm.org/citation.cfm?id=178422.178455. [63] D. C. Kozen. The design and . Springer, 1992. [64] R. Latała and K. Oleszkiewicz. On the best constant in the Khinchin-Kahane inequality. Studia Mathematica, 109(1):101–104, 1994. URL: http://eudml.org/doc/216056. [65] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992. [66] R. Lidl and H. Niederreiter. Finite Fields. Cambridge U Press, 2nd edition, 1997. (Theorem 6.13). [67] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215–245, 1995. [68] M. Luby. A simple parallel algorithm for the maximal independent set problem. In Proc. 17th ACM STOC, pages 1–10, 1985. [69] R. Lyons and Y. Peres. Probability on Trees and Networks. Cambridge University Press, 2017. URL: http://pages.iu.edu/~rdlyons/. [70] J. Matousek.ˇ Geometric Discrepancy: An Illustrated Guide. Springer, 2010. [71] F. Mertens. Ein beitrag zur analytischen zahlentheorie. Journal f¨urdie reine und angewandte Math- ematik, 78:46–62, 1874. URL: http://resolver.sub.uni-goettingen.de/purl?PPN243919689_ 0078. p [72] S. Micali and V. V. Vazirani. An O( |V| · |E|) algorithm for finding maximum matching in general graphs. In Proc. 21st FOCS, pages 17–27. IEEE, 1980. [73] H. L. Montgomery and R. C. Vaughan. Multiplicative Number Theory I. Classical Theory. Cam- bridge studies in advanced mathematics, Vol. 97. Cambridge U. Press, 2006. Theorem 7.2. [74] K. Mulmuley, U. V. Vazirani, and V. V. Vazirani. Matching is as easy as matrix inversion. Combinatorica, 7:105–113, 1987. [75] D. Mumford. The dawning of the age of stochasticity. In V. Arnold, M. Atiyah, P. Lax, and B. Mazur, editors, Mathematics: Frontiers and Perspectives. AMS, 2000. [76] C. M. Newman and L. S. Schulman. Infinite clusters in percolation models. Journal of , 26(3):613–628, 1981. doi:10.1007/BF01011437. [77] V. Pan. Fast and efficient parallel algorithms for the exact inversion of integer matrices. In S. N. Maheshwari, editor, Foundations of Software Technology and Theoretical Computer Science, volume 206 of Lecture Notes in Computer Science, pages 504–521. Springer, 1985. doi:10.1007/ 3-540-16042-6_29. [78] C. Pomerance. Smooth numbers and the quadratic sieve. In J. P. Buhler and P. Stevenhagen, editors, Algorithmic number theory, pages 69–81. Cambridge U. Press, 2008. [79] S. Rajagopalan and L. J. Schulman. Verification of identities. SIAM J. Comput., 29(4):1155–1163, 2000.

Caltech CS150 2020. 107 Bibliography

[80] F. P. Ramsey. On a problem of formal logic. Proc. London Math. Soc., 48:264–286, 1930.

[81] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public- key cryptosystems. Comm. ACM, 21:120–126, 1978. [82] A. Sah. Diagonal ramsey via effective quasirandomness, 2020. arXiv:2005.09251. [83] I. N. Sanov. On the probability of large deviations of random variables. Mat. Sbornik, 42:11–44, 1957. [84] J. Schwartz. Fast probabilistic algorithms for verification of polynomial identities. J. ACM, 27:701–717, 1980. [85] C. E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379–423; 623–656, 1948. [86] L. A. Shepp. The XYZ-conjecture and the FKG-inequality. Ann. Probab., 10(3):824–827, 1982. [87] P. W. Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Computing, 26:1484–1509, 1997. (FOCS 1994).

[88] D. Sivakumar. Algorithmic derandomization via complexity theory. In STOC, 2002. [89] J. Spencer. Trans. Amer. Math. Soc., 289:679–706, 1985. doi:10.1090/ S0002-9947-1985-0784009-0. [90] N. Ta-Shma. A simple proof of the isolation lemma. ECCC TR15-080, 2015. URL: http: //eccc.hpi-web.de/report/2015/080/.

[91] A. Thomason. An upper bound for some Ramsey numbers. J. Graph Theory, 12, 1988. [92] W. T. Tutte. The factorization of linear graphs. J. London Math. Soc., s1-22(2):107–111, 1947. URL: http://jlms.oxfordjournals.org/content/s1-22/2/107.short. [93] L. Valiant, S. Skyum, S. Berkowitz, and C. Rackoff. Fast parallel computation of polynomials using few processors. SIAM J. Comput., 12(4):641–644, 1983. [94] Wikipedia. Folded normal distribution. [Online; accessed 6-November-2016]. URL: https: //en.wikipedia.org/w/index.php?title=Folded_normal_distribution&oldid=748178170. [95] Wikipedia contributors. Mertens’ theorems, 2018. [Online; accessed 1-August-2019]. URL: https://en.wikipedia.org/w/index.php?title=Mertens%27_theorems&oldid=825228172. [96] Wikipedia contributors. Dickman function, 2019. [Online; accessed 2-October-2019]. URL: https://en.wikipedia.org/w/index.php?title=Dickman_function&oldid=901888243. [97] R. Zippel. Probabilistic algorithms for sparse polynomials. In E. W. Ng, editor, Symbolic and Algebraic Computation, volume 72 of Lecture Notes in Computer Science, pages 216–226. Springer, 1979. doi:10.1007/3-540-09519-5_73.

Caltech CS150 2020. 108