Analysis of Boolean Functions (CMU 18-859S, Spring 2007) Lecture 10: Learning DNF, AC0, Juntas Feb 15, 2007 Lecturer: Ryan O’Donnell Scribe: Elaine Shi

1 Learning DNF in Almost Polynomial Time

From previous lectures, we have learned that if a function f is ǫ-concentrated on some collection , then we can learn the function using membership queries in poly( , 1/ǫ)poly(n) log(1/δ) time.S |S| O( w ) In the last lecture, we showed that a DNF of width w is ǫ-concentrated on a set of size n ǫ , and O( w ) concluded that width-w DNFs are learnable in time n ǫ . Today, we shall improve this bound, by showing that a DNF of width w is ǫ-concentrated on O(w log 1 ) a collection of size w ǫ . We shall hence conclude that poly(n)-size DNFs are learnable in almost polynomial time. Recall that in the last lecture we introduced H˚astad’s Switching Lemma, and we showed that 1 DNFs of width w are ǫ-concentrated on degrees to O(w log ǫ ).

Theorem 1.1 (Hastad’s˚ Switching Lemma) Let f be computable by a width-w DNF, If (I, X) is a random restriction with -probability ρ, then d N, ∗ ∀ ∈ d Pr[DT-depth(fX→I) >d] (5ρw) I,X ≤

Theorem 1.2 If f is a width-w DNF, then

f(U)2 ǫ ≤ |U|≥OX(w log 1 ) ǫ b

O(w log 1 ) To show that a DNF of width w is ǫ-concentrated on a collection of size w ǫ , we also need the following theorem:

Theorem 1.3 If f is a width-w DNF, then

1 |U| f(U) 2 20w | | ≤ XU b Proof: Let (I, X) be a random restriction with -probability 1 . After this restriction, the DNF ∗ 20w becomes a O(1)-depth decision tree with high probability. Due to H˚astad’s Switching Lemma, we

1 have the following:

n

E [ fX→I 1]= Pr [DT-depth(fX→I)= d] E fX→I 1 DT-depth(fX→I)= d I,X k k I,X · I,X k k  Xd=0

n 1 d 5 w 2d (H˚astad’s Switching Lemma, DT of size s has Fourier norm s) ≤  · 20w ·  · 1 ≤ Xd=0 n 1 = 2 2d ≤ Xd=0 In , we have

2 E [ fX→I 1]= E E fX→I(S) I,X I X ≥ k k I XS⊆ b = E E E [fX→I(y)yS] I X  y∈{−1,1}I  XS⊆I

> E E E[f(y, X)yS] = E f(S) I X y I I I XS⊆ XS⊆ b |U| 1 = f(U) Pr[U I]= f(U) · I ⊆ · 20w UX⊆[n] UX⊆[n] b b 2

Corollary 1.4 If f is a width-w DNF, then

O(w log 1 ) f(U) w ǫ ≤ |U|≤OX(w log 1 ) ǫ b Proof:

1 |U| 1 |U| 2 f(U) f(U) ≥ · 20w ≥ · 20w UX⊆[n] |U|≤OX(w log 1 ) ǫ b b O(w log 1 ) 1 ǫ f(U) ≥ 20w |U|≤OX(w log 1 ) ǫ b 2

O(w log 1 ) Corollary 1.5 If f is a width-w DNF, it’s ǫ-concentrated on a collection of size w ǫ .

2 1 ǫ Proof: Define = U : U O(w log ), f(U) 1 . By Parseval, we get that S | | ≤ ǫ ≥ wO(w log ǫ ) |S| ≤ O(w log 1 ) n o w ǫ . We now show that is ǫ-concentrated b on . By Theorem 1.2, we know that S S f(U)2 ǫ ≤ UX∈S / 1 |U|≥O(w log ǫ ) b By Corollary 1.4, we have

2 O(w log 1 ) ǫ f(U) f(U) max f(U) w ǫ = ǫ O(w log 1 ) ≤ · ≤ · ǫ UX∈S / UX∈S / w |U|≤O(w log 1 ) b |U|≤O(w log 1 ) b b ǫ ǫ Therefore, f is 2ǫ-concentrated on . 2 S

Corollary 1.6 poly(n)-size DNFs are ǫ-concentrated on collections of size

O(log n log 1 ) O(log log n ·log 1 ) n ǫ ǫ n ǫ ǫ log =  ǫ   ǫ  And if ǫ = Θ(1), then the above is nO(log log n). Note that this uses the fact that size-n DNF formulas n are ǫ-close to a width-log( ǫ ) DNF.

An open research problem is the following question: Are poly(n)-size DNFs ǫ-concentrated on a collection of size poly(n), assuming that ǫ = Θ(1).

2 LearningAC0

We will now study how to learn polynomial-size, constant-depth circuits, AC0. Consider circuits with unbounded fan-in AND, OR, and NOT gates. The size of the circuit is defined as the number of AND/OR gates. Observe the following fact:

Fact 2.1 Let d denote the depth of the circuit. At the expense of a factor of d in size, these circuits can be taken to be “layered”. Here “layered” means that eacy layer consists of the same type of gates, either AND or OR, and adjacent layers contain the opposite gates.

In a layered circuit, the number of layers is the depth of the circuit, and define the bottom fan-in of a layered circuit to be the maximum fan-in at the lowest layer (i.e. closest to the input layer).

Theorem 2.2 (LMN.) Let f be computable by a size s, depth D, and bottom fan-in w circuit. Then ≤ ≤ ≤ f(U)2 O(s 2−w) ≤ · X D |U|≥(10w) b

3 Before we show a proof of the LMN theorem, let us first look at some corollaries implied by this theorem.

Corollary 2.3 If f has a size s, depth D circuit, then

f(U)2 ǫ ≤ |U|≥[OX(log s )]D ǫ b Proof: Notice that such an f is ǫ-close to a similar circuit with bottom fan-in log( s ). 2 ≤ ǫ

Corollary 2.4 AC0, i.e., the class of poly-size, constant-depth circuits, are learnable from random poly(log( n )) examples in time n ǫ , where n denotes the size of the circuit.

0 poly(log( n )) Proof: According to Corollary 2.3, AC circuits are ǫ-concentrated on a collection of size n ǫ . 2

Corollary 2.5 If f has a size-s, depth-D circuit, then I(f) [O(log s)]D ≤ Remark 2.6 Due to Hastad,˚ the above bound can be improved to [O(log s)]D−1.

Proof:(sketch.) Define F (τ)= f(U)2. Recall that |UP|>τ b b n O(log s)D n I(f)= U f(U)2 = F ()= F (r)+ F (r) | | XU Xr=1 Xr=1 X D b b b r=O(log s) b n = U f(U)2 + F (O(log s)D) O(log s)D + F (r) | | · X D X D |U|≤O(log s) b b r=O(log s) b n O(log s)D + F (r) ≤ X D r=O(log s) b n It remains to show that F (r) O(log s)D. By Corollary 2.3, r=O(log s)D ≤ P b 1/D F (τ)= f(U)2 s 2−Ω(τ ) ≤ · |UX|>τ b b n Using this fact plus some manipulation, it is not hard to show that F (r) O(log s)D. 2 D ≤ r=OP(log s) b Using this fact, we can derive the following two corollaries:

4 Corollary 2.7 Parity / AC0, Majority / AC0. ∈ ∈ Proof: I(Parity)= n, I(Majority)=Θ(√n). 2

Definition 2.8 (PRFG.) A function f : 1, 1 m 1, 1 n 1, 1 is a Pseudo-Random Function Generator (PRFG), if for Probablistic{− } × {−Polynom} ial→ Time {− (P.P.T)} algorithm A with access to a function,

1 Pr [A(f(S, )) = YES] Pr [A(g)= YES] S⊆{−1,1}m, · − g, ≤ nω(1) A’s random bits A’s random bits

where g is a function picked at random from all functions h h : 1, 1 n 1, 1 . { | {− } → {− }} Corollary 2.9 Psuedo-random function generators / AC0. ∈ Proof: Suppose that f AC0. Then we can construct a P.P.T adversary A such that A can tell f ∈ apart from a truly random function with non-negligible probability. Consider an adversary A that picks at random X 1, 1 n, and i [n]. A is given an oracle to g, it queries g at X and X(i). A outputs YES iff g∈(X {−) = g(}X(i)). ∈ 6 If g is truly random, then Pr[A outputs YES]=1/2; if g is in AC0, then Pr[A outputs YES] I(g)/n = polylog(n)/n. 2 ≤

After seeing all these applications of the LMN theorem, we now show how to prove it. To prove the LMN theorem, we need the the following tools:

Observation 2.10 A depth-w Decision Tree (DT) is expressible as a width-w DNF or as a width-w CNF.

Lemma 2.11 Let f : 1, 1 n 1, 1 , let (I, X) be a random restriction with -probability {− } → {− } ∗ ρ. Then d 5, ∀ ≥ 2 f(U) 2 Pr [DT-depth(fX→I) >d] ≤ (I,X) |UX|≥2d/ρ b Now using Lemma 2.11, we can prove the LMN theorem. Proof:(LMN.)

Claim 2.12 −w Pr [DT-depth(fX→I) >w] s 2 I,X ≤ · The above claim in combination with Lemma 2.11 would complete the proof. We now show why the claim is true. 1 D−1 Observe that we can view choosing random restriction with -probability ( 10w ) as the fol- lowing: ∗

5 First choose a random restriction with -probability 1 . • ∗ 10w Further choose a random restriction ont he surviving variables with probability 1 . • 10w . . . • Repeat the above D 1 times. After the first restriction,− due to H˚astad’s Switching Lemma, for any level 2 circuit, the proba- bility that it doesn’t turn into a depth-w DT can be bounded as below:

5w w Pr[Doesn’t turn into a depth w DT] =2−w ≤ 10w

Due to Observation 2.10, we can express a depth-w DT as a width-w DNF or CNF. Using this, we can transform the bottom two layers of circuit using the opposite of what they were before, i.e., CNF to DNF, and DNF to CNF. This will succeed except with probability 2−w (number of level 2 gates). × Now the 2nd lowest layer and the 3rd lowest layer will have the same type of gates, so we can collapse them into a single layer. Observe that this operation preserves the bottom fan-in, because the resulting CNF or DNF has width w. So we repeat this operation D 1 times, and the probability that the resulting circuit is not a depth-w DT can be bounded as below:−

number of level 2 gates  +number of level 3 gates  Pr[DT-depth of resulting circuit >w] 2−w s 2−w ≤ +number of level 2 gates × ≤ ·    + . . .    2

3 Learning Juntas

n We now study how to learn juntas. Let r = f : 1, 1 1, 1 , and f is an r-junta denote the family of r-juntas. Note that C poly{ -size{− DTs} → {−poly-size} DNFs . } Clog n ⊆{ }⊆{ }

Remark 3.1 To learn r, it suffices for the algorithm to identify the r relevant variables. Since then we can just draw CO(r2r) random examples and with high probability, learn the entire truth table (of size 2r).

Observation 3.2 If f is an r-junta, then every Fourier coefficient of f is either 0 or 2−r in absolute value. This is straightforward directly from the definition of Fourier coefficients,≥ and the fact that f only depend on r variables.

Fact 3.3 If f(S) =0, then all variables in S are relevant. 6 b 6 Proof: Suppose f(S) = 0, but there exists i S irrelevant, according to the definition, f(S) = 6 (i) ∈ Ex [f(x)X ]. But for any x, consider x , we have: S b b f(x) X = f(x(i)) X(i) · S − · S Therefore, everything cancels out in f(S), and f(S)=0. 2

Using the above facts, we give oneb idea for learningb juntas, and show that r-juntas are learnable in time poly(n, 2r) nr. · Estimate f( ). • ∅ b 2−r Estimate f(S) for all S = 1 up to accuracy 4 . If we find an S such that f(S) = 0, then • we know that all variables| | in S are relevant. Note that this takes time poly(n, 2r) n6 . b b 1  2−r Estimate f(S) for all S = 2 up to accuracy 4 . If we find an S such that f(S) = 0, then • we know that all variables| | in S are relevant. Note that this takes time poly(n, 2r) n6 . b b 2  Do the above for S of size 3, 4,...,r. • Observation 3.4 The above gives us a poly(n, 2r) nr-time algorithm for learning r-juntas. · In the next lecture, we shall improve the above result. In particular, we will ask the question, what kind of functions f can have f(S)=0 for all 1 S d? In particular, if f is not such a function, then by step d in the above algorithm, we will≤ have | | ≤ found a relevant variable. If not, i.e., b f(S)=0 for all 1 S d, we will show an algorithm for learning such functions. ≤| | ≤ b

7